OLIVIA Dataset

Introduction

This repository includes data sources used in OLIVIA project for region representations. Olivia dataset is a collection of numerous data sources providing information regarding different regional attributes, studying which can be of significant help to pandemic understanding and response analyses.

Installation

This package can be installed via pip:

pip3 install --upgrade olivia_dataset;

or via the source:

pip3 install -e .;

Getting Started

The first step is to configure the package, please run the following command and insert the required information:

olivia_dataset_config

Afterwards, whenever you want to refresh the live repository, it is sufficient to run the following command:

olivia_dataset_refresh

Live Repository

The most recent version of the data files are available in the live dataset gdrive repository.

Documentation

The package documentation is available at this link.

Citation

Please use the following citation:

@inproceedings{fazeli2021statistical,
  title={Statistical Analytics and Regional Representation Learning for COVID-19 Pandemic Understanding},
  author={Fazeli, Shayan and Moatamed, Babak and Sarrafzadeh, Majid},
  booktitle={2021 IEEE 9th International Conference on Healthcare Informatics (ICHI)},
  pages={248--257},
  year={2021},
  organization={IEEE}
}

Data Sources

  • Laboratory confirmed COVID-19 Associated Hospitalizations [link]

  • A Weekly Influenza Surveillance Report Prepared by the Influenza Division [link]

    • CAUTION: there are slight issues (e.g., a “434” value for New York has been recorded as “4334” in the data), that we need to be aware of.

  • COVID-19 Cases, Deaths, and Recoveries across the United States [link]

  • US Census Data [link]

    • We have been utilizing this variant (and the file for Census 2017) shared by the Kaggle community, however, if one is interested, more recent data might be available in the official repository.

  • US Mortality Rates by County 1980-2014

    • The data sources: [link], [link]

    • The way we use this dataset is deriving static county features using these age-standardized values.

  • Diversity Index of US counties [link]

    • Using the census data, we should be able to compute a more recent and more accurate values for county diversity indices, and end up not using this data.

  • US Drought Monitor [link], [link]

  • US ICU beds (evaluated by Kaiser Health News) [link]

  • Election results [link]

    • To extract county-based features, we focused on 2016 which was the most recent US Presidential Election at the time, more information is currently available in Harvard Dataverse.

  • US Household Income Statistics [link]

  • Food Business Features - From National Restaurants Association - State level [link]

  • Life expectancy, obesity, and physical activity [link]

  • Alcohol [link]

  • Diabetes [link]

  • HRSA Health Center COVID-19 Vaccine Program Participants [link]

Additional Data Repositories Available Online

  • COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [link]]

    • Links to the official state/county dashboards can be found there.

  • County-level Socioeconomic Data for Predictive Modeling of Epidemiological Effects - [link]

    • This is very similar to our effort in gathering regional features and attributes

    • This dataset collection is different from ours. While sharing some key features such as covid-19 outcomes, these two datasets mainly complement each other.

      • There are information on some additional features such as crime and education in this repository.

  • KFF Data Repository

    • This data includes vaccincation related information, as well as policy-related information - [link]