You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The purpose of our project is to compare the computational requirements for realistic probabilistic record linkage at different scales, specifically focusing on thousands, millions, and billions of records. Probabilistic record linkage is a vital technique used in various fields such as healthcare, social sciences, and data analytics, where accurate and efficient matching of records from multiple datasets is crucial. By conducting this comparative analysis, we aim to provide insights into the computational challenges and resource requirements associated with scaling probabilistic record linkage algorithms to handle large-scale datasets. This research will contribute to the development of scalable solutions for record linkage and inform decision-making regarding data management and processing strategies.
In particular, we are focusing on record linkage between the US Census and public health datasets, a linkage that is commonly made due to how useful census data (on population distributions, densities and demographics) can be to work out the accuracy and potential biases in public health data.
Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?
This project is being run by Prof. Josephine Bloggs (Nonesuch University, Texas) in collaboration with Drs Sue Denim and Dee Plume at the Centers for Disease Control (Atlanta). Drs Denim and Plume are experts in electronic health record (EHR) data, and will advise Prof. Bloggs on how to adapt pseudopeople's existing simulated data to be appropriately similar to the sort of noisy real-world EHR data which might be linked to real census data in future work, based on the results of this project. As a consequence, all of them will have access to both the pseudopeople input data and the EHR data.
What funding is the project under? What expectations with respect to open access and access to data come with that funding?
Our project is funded by the National Institutes of Health, for whom we have written a Data Management and Sharing Plan. Essentially, this states that we have an obligation to share the final dataset used for the analysis. This is not the same as sharing the pseudopeople data, or the healthcare data - instead, it is simply those variables and rows from the merged dataset that are used in the final analysis.
We commit to:
be responsive to further questions from interested parties
deprecate and replace our version of the pseudopeople input data when a new version is released
What data would you like to request?
Full US
Rhode Island
Other (may not be available immediately)
Other data - more explanation
No response
The text was updated successfully, but these errors were encountered:
In this hypothetical example, the one thing I think we should edit is about the explanation of who will have direct access to the data. Instead of positing that "Drs Denim and Plume are tasked with preparing the public health data, which is then linked by Prof. Bloggs" let's make them advisors who don't directly access the public health data, either. (Because how are they going to prepare the public health data to link to simulated census data without knowing about all of the simulated people in the census data?)
So "Drs Denim and Plume are experts in electronic health record (EHR) data, and will advise Prof. Bloggs on how to adapt pseudopeople's existing simulated data to be appropriately similar to the sort of noisy real-world EHR data which might be linked to real census data in future work, based on the results of this project."
aflaxman
changed the title
[Data access request]: MedLink project
[EXAMPLE of a Data access request]: (Hypothetical) MedLink project --- see this for inspiration if you are requesting data
Oct 31, 2023
What is the name of your project?
MedLink
What is the purpose of your project?
The purpose of our project is to compare the computational requirements for realistic probabilistic record linkage at different scales, specifically focusing on thousands, millions, and billions of records. Probabilistic record linkage is a vital technique used in various fields such as healthcare, social sciences, and data analytics, where accurate and efficient matching of records from multiple datasets is crucial. By conducting this comparative analysis, we aim to provide insights into the computational challenges and resource requirements associated with scaling probabilistic record linkage algorithms to handle large-scale datasets. This research will contribute to the development of scalable solutions for record linkage and inform decision-making regarding data management and processing strategies.
In particular, we are focusing on record linkage between the US Census and public health datasets, a linkage that is commonly made due to how useful census data (on population distributions, densities and demographics) can be to work out the accuracy and potential biases in public health data.
Who is involved in the project? Which of these people will have direct access to the pseudopeople input data?
This project is being run by Prof. Josephine Bloggs (Nonesuch University, Texas) in collaboration with Drs Sue Denim and Dee Plume at the Centers for Disease Control (Atlanta). Drs Denim and Plume are experts in electronic health record (EHR) data, and will advise Prof. Bloggs on how to adapt pseudopeople's existing simulated data to be appropriately similar to the sort of noisy real-world EHR data which might be linked to real census data in future work, based on the results of this project. As a consequence, all of them will have access to both the pseudopeople input data and the EHR data.
What funding is the project under? What expectations with respect to open access and access to data come with that funding?
Our project is funded by the National Institutes of Health, for whom we have written a Data Management and Sharing Plan. Essentially, this states that we have an obligation to share the final dataset used for the analysis. This is not the same as sharing the pseudopeople data, or the healthcare data - instead, it is simply those variables and rows from the merged dataset that are used in the final analysis.
We commit to:
What data would you like to request?
Other data - more explanation
No response
The text was updated successfully, but these errors were encountered: