Characterizing diversity in biodata and producing visualizations
Clone the repository and then pip install the package. pip install <path-to-package>
. This is will expose all the command line entry points available.
E.g. to assess the diversity of a your data, use the assess_diversity
command.
$ assess_diversity --help
usage: assess_diversity [-h] [-v] input_data output_dir
assess the diversity of your data
positional arguments:
input_data Path to the csv file containing the data you want to assess.
output_dir Path to a directory where results will stored.
optional arguments:
-h, --help show this help message and exit
-v, --verbose Increase logging verbosity.
To run the tool on the example data run the following command
$ assess_diversity input/ipums_test_cleaned.csv output
The package is pip installable. During development, you can install it in editable mode pip install -e <path-to-package>
.
The package will be hot reloaded as you make changes so you do not need to reinstall the package to test it.
The most significant risk factor for developing dementia is age 1. Aging is also the dominant risk factor in developing clinically significant atherosclerotic lesion formation 2. Heart failure is common in the older population and increases progressively with advancing age 3.
Age can also be an issue for the young. Asthma is more common in children than adults 4. Acute tonsillitis is most common in children 5. About 60% of all babies have jaundice 6.
For many precision medicine analyses, age isn't one field, it's many: age of symptom onset, age at diagnosis, age at certain treatment dates and age of death. Sometimes age in years is sufficient, whereas for others, intervals between events measured in days is more appropriate.
Be aware how dates may be blurred. Health data providers may want to withhold or generalise fields such as birth dates because they may feel it presents significant risk that pseudonymised patient records could be re-identified. They may remove precision by reporting year and month only, or they may impute a day based on some formula. It is important to note the details of how dates were reported, especially across data sources that were made by different organisations.
Prefer using intervals over dates as early as possible in data processing. One way to mitigate concerns of re-identifiability is to work with time intervals rather than specific dates and times as early as possible in data processing. A birth date of October 12, 1954 could always be used to help identify patients in unrelated data sets, but 55 as an 'age of symptom onset' is a less identifiable variable.
Be aware of inaccurate death data. Some of your age calculations may relate to 'age at death', or they may assume a patient is alive when they are not. In some parts of the world, hospitals may record that a patient received a treatment but only record death if it occurs on their premises. If the patient later died at home or died at another hospital, the health records may not accurately indicate if and when someone passed away. You may need to link patient records from a health provider with data from patient death registries.
Age bands are often used to make data less identifiable. Health data providers may choose to make each patient record have an age band instead of a specific age (eg: 30-34 instead of 32). If you are harmonising multiple patient data sources, you may need to consider how to reconcile between different band intervals or with exact ages.
Check the date format used in your data sets. It's a good idea to convert all dates to a single canonical format so you can apply the same data calculations across all your data sources. For example, 07/06/2008 can be June 7, 2008 in a British health data set or July 6, 2008 in a US data set.
Sex refers to biological characteristics, whereas gender is based on socially constructed features. Both variables are most accurately characterised as a spectrum of values, but most often they are treated in data collection as Male or Female. Sex and gender are often wrongfully thought to be interchangeable concepts and this can be reflected when reporting health data.
The spectrum for sex covers male, female and Disorders of Sexual Development (DSD), which are a collection of congenital conditions associated with atypical development of internal and external genital structures 7. At a genetic level, sex is not just a matter of XX for female and XY for male. The range is better described as 'a range of chromosome complements, hormone balances, and phenotypic variations that determine sex' 8
Gender identity is also a spectrum with more than 60 terms associated with it 9.
There is ample evidence that common demographic factors can be highly correlated with various diseases. For example, sex and gender can be important factors. Women have a higher incidence and prevalence of autoimmune diseases than men, and 85% or more patients of multiple autoimmune diseases are female 10. Clinical observation shows that men and women are different in prevalence, symptoms, and responses to treatment of several psychiatric disorders, including schizophrenia 11. Transgender women are 49% more likely to be living with HIV than other adults of reproductive age 12.
Sex and gender tend to be reported as binary variables in health. The variety of other values those variables can assume is often lost in reporting.
The main challenge of processings sex and gender is standardising values across data sets that have been coded with different data dictionary. You may have to do several semantic mapping activities which address both the names and coding values of variables you believe describe sex. This table illustrates the variety of slightly different codings amongst different types of health data sets:
Phenotypic sex classification (NHS Data Dictionary) | CDC Coronavirus Report | ISD Scotland Data Dictionary | US Health Information Knowledge Base Sex/Gender Hl7 | |
---|---|---|---|---|
Male | 1 | 1 | 1 | 1 |
Female | 2 | 2 | 2 | 2 |
Other | - | 3 | - | - |
Indeterminate | 9 | - | - | - |
Unknown | X | 9 | - | - |
Ambiguous | - | - | - | - |
Not Known | - | - | - | 0 |
Not | - | - | 8 | 9 |
-
Are variables capturing sex really describing gender?
-
Is there a default value? This can help you tell the difference between a value which may have been pre-populated on an electronic form and one where someone had to make an active decision
-
Is the value sex assigned at birth or currently assigned sex?
-
Was the sex value self-reported by a patient or assigned by a medical professional? Check whether the variable indicates whether the patient stated their sex or not
-
Be aware of the compatibility of meanings of missing value (eg: not applicable, not specified, not known, unknown). Unknown and not known are equivalent but not specified may indicate a patient's intent not to provide information.
-
Be aware of the semantic equivalence of miscellaneous coding values (eg: other, indeterminate, ambiguous)
-
What was the original data type of variables stored in the system? Some databases store sex as a binary data type that can only record a value as a "0" or a "1". Other systems may use an integer data type for a variable that may have "1" or a "2"
-
Look for other information in your data set to support better diversity
- Look for ICD 10 codes F64.* for "Gender identity disorders"
- Examine Whitchel's paper about Disorders of Sex for a classification of disorders
The concept of race and ethnicity are problematic in precision medicine because they are largely social constructs that don't map well to biological characteristics. Race has only ever alluded to a small number of morphological phenotypes, most of which are not relevant to molecular aspects of disease mechanisms. Ethnicity denotes groups that share a common identity-based ancestry, language or culture 13.
Ethnicity and race can be a strong risk factor in disease response. The prevalence of coronary heart disease amongst the South Asian population in the UK is higher compared to the general population 14. Type 2 diabetes is more prevalent in Asian and Black populations than it is in white populations 15. The overall mortality rate for sarcoidosis has been shown to be eight times higher in African Americans than in Caucasians 16. More recently, it is becoming clear that COVID mortality rates for some ethnic groups are much higher than for White people 17.
These disparities are largely due to societal inequity between different groups and environmental factors that contribute to these different health outcomes. There is very little genetic difference between groups - any individuals share at least 99.5 % of their DNA 18.
The most important biological concept related to ancestry are haplotypes, which are sets of DNA variations that tend to be inherited together. They can either be a combination of alleles or single nucleotide polymorphisms (SNPs) found on the same chromosome 19.
The link between social concepts such as race with biological concepts like haplotypes can be vague and inaccurate. Ethnicity is a more specific indicator of biological ancestry than race, but it too can provide a poor link. Its assignment to patients can vary based on an individual's assessment and be influenced by the culture in which that individual lives. Like assignments of race, assignments of ethnicity are often too reliant on a handful of visible phenotypes. In clinical data sets that contain race and ethnicity fields, the broader the coverage of categories that appear in the data, the more likely it is that at a biological level, there will be more variety in haplotypes.
Although race and ethnicity may be poor proxies for genetic variability, they can be much better proxies for long standing health inequities, which can in turn suggest broader environmental factors that may pressure aspects of genetic expression. For example, if a racial or ethnic group tends to experience economic deprivation, then people from that group may not be able to afford treatments. If they do not seek health care opportunities at all because of cost barriers, then they will not leave behind any health records which could inform machine learning algorithms. We address deprivation more in our section for Socio-Economic Status.
Ethnicity and race values are subjective. If the race or ethnicity of a patient is assigned, you should understand what criteria were used for classification and consider what kinds of biases might have been involved. More often the values for these variables tend to be assigned by patients themselves, which can also be subjective.
Ethnicity and race are often conflated. Both race and ethnicity are social constructs but often race forms broad headings for ethnicity classifications.
Ethnicity classifications are often not mutually exclusive. Categories of ethnicity can include aspects of nationality, geography and skin colour which are not mutually exclusive. As an example of how this can happen, consider the NHS's ethnic category code list.
Now imagine a patient who was born in 1940 in India to a white Welsh mother and a non-white Muslim father who was born and raised in India. In 1947, her family moved from India to Pakistan, a country which only came into existence during the Indian Partition. Later in life she moved to the UK, and became settled in Scotland. Several of these codes may apply and her perception of which ones are most relevant might change over time:
CB: Scottish
CX: Mixed White
C3: Other white, white unspecified
C: Any other White background
A: British, Mixed British
J: British Pakistani or Pakistani
H: Indian or British Indian
F: White and Asian
CC: Welsh
The vagueness in state-recognised ethnic categories can also change over time. For example, forms for the US census only began to allow participants to specify multiple racial categories in 2000 20.
The value in self-reported ethnicity fields may not be in the accuracy of any one patient's records but in the category coverage of many patients' records. The broader the coverage, the more likely that patient cohorts will appear less homogenous.
Harmonising different coding systems can be challenging. It can be difficult to standardise race and ethnicity fields across data sources which use different national classification systems. For example the HL7 FHIR race codes are very different from the NHS ethnicity codes. Significant work may be required to reconcile values from American and British based health data.
Socioeconomic status (SES) is "...the social standing or class of an individual or group. It is often measured as a combination of education, income and occupation." 21.
In health research, SES quantifies aspects of health inequalities, some of which may be health inequities. A health inequality is "...any difference in the distribution of health status or health determinants between different population groups" 22. A health inequity is "...a specific type of health inequality that denotes an unjust difference in health." 23
Young people tend to enjoy better health than old people. This is a health inequality because the difference is related to biology and is not usually considered preventable. In the USA, African-Americans have experienced a disproportionate number of fatalities for COVID-19 relative to their percentage make-up for the general population. This is a health inequity because many causes for the disproportion relate to aspects of social justice 24.
Socio-economic status is often measured either for individuals or for regions. Individual SES measurements can measure income, educational attainment, or occupation. Area-based SES measurements can be based on factors such as average neighbourhood income or be based on complex area deprivation index systems. Often area-based SES measurements are used as a proxy when individual SES data are not available.
The importance of SES for health is well appreciated in epidemiology. According to 25: "Individuals with lower SES experience more chronic disease, are less likely to receive preventive care, and have shorter life expectancies." Low SES has almost the same effect on health as smoking or a sedentary lifestyle 26. Kivimaki's study found that socioeconomic status was associated with increased risk for 18 of 56 conditions. Globally, poorer older adults experience more dental disease and disability 27.
Sometimes socioeconomic measurements are left out of processes for patient recruitment for clinical trials data. For example, the prevalence of Chronic Obstructive Pulmonary Disease (COPD) and asthma is associated with socioeconomic status. However, "...deprivation is rarely considered in typical large-scale efficacy randomised trials that recruit highly selected patient populations 28.
Data sources for precision medicine can reflect a bias in SES levels. For example, in the Danish National Birth Cohort, groups with low socioeconomic values for education, occupation and income status are underrepresented compared to the background population 29. UK Biobank participants are more likely to live in less socioeconomically deprived areas than non-participants 30.
Patients from low socio-economic levels may experience trouble affording treatments or managing the travel logistics of getting a treatment center. They may also tend to have health journeys that are fragmented across multiple healthcare organisations 31. These problems can mean patients from poorer backgrounds are less represented in data sets.
SES is poorly covered in health data sets.
SES has different scales. The UK uses the National Statistics Socio-Economic Classification (NS-SEC) system 32. UK Biobank uses the Townsend Deprivation Index 33. In India, Prasad, Pareek, and Kuppuswamy scales are used to measure the SES of a family 34. For small areas, England uses the English Indices of Deprivation (IoD) 35, whereas the Canadian province of Ontario uses the Ontario Marginalization Index 36.
Be aware of the limitations of using area-based indicators for individuals. Remember that just because an individual lives in a neighbourhood with a socioeconomic status does not mean that person has a similar status.
Be aware of how SES is calculated across data sources. For example, some measures of SES rely on an asset-based wealth index, whereas others use income and expenditure.