Skip to content

Commit 3630095

Browse files
committed
Add more info to README, add csv files.
1 parent 57c2426 commit 3630095

File tree

3 files changed

+55589
-0
lines changed

3 files changed

+55589
-0
lines changed

README.md

+21
Original file line numberDiff line numberDiff line change
@@ -37,15 +37,36 @@ Our hope is that the community will get involved with curation of the dataset pr
3737
Suggested improvements should come in via pull requests, where each pull request provides proposed modifications (including potentially supporting tools/scripts, data, references, or links to the same) and a clear explanation of these changes.
3838
Thus, over time the current, curated database is expected to move away from simply reflecting the contents of the Excel spreadsheet and become more valuable.
3939

40+
Some specific points of curation which will be needed include:
41+
- Separation of different types of data; for example, the main tab in the database Excel spreadsheet (and the data in `guthrie_database.csv`) contains not just hydration free energies but other properties with other units, e.g. the entries for phenol include values reported in mg/L, g/m^3, etc.
42+
- unit handling; values are present in kJ/mol and kcal/mol
43+
- checking of molecule names against SMILES and stereochemistry; I (DLM) previously gave Peter some tools to help with this but I do not know if he has used them
44+
4045
## Manifest
4146
- `GuthrieDatabase_April14.zip`: Guthrie database (Excel spreadsheet) as it was provided
47+
- `guthrie_database.csv`: Exported csv file of main tab of Excel spreadsheet
48+
- `guthrie_references_and_status.csv`: Additional tab of Excel spreadsheet which provides definitions of the references and reports on Peter's progress in extracting data from those references; may highlight other areas where more data is still available
49+
50+
There is also data/curation work in an additional tab of the spreadsheet, Sheet 2, which may be useful but is not present here as a separate file yet.
51+
52+
## Using the dataset
53+
54+
The data set can be loaded easily in Python using `pandas`, for example as:
55+
```
56+
python
57+
import pandas
58+
db = pandas.read_csv('guthrie_database.csv', encoding='latin1')
59+
data = db[db.Name=='phenol']
60+
```
61+
to load the database and extract all data with a molecule named phenol
4262

4363
## Authors
4464
### Primary author
4565
- J. Peter Guthrie (University of Western Ontario)
4666

4767
### Other contributors
4868
- David L. Mobley, UC Irvine, who maintains this repository
69+
- Probably students and others who worked with Dr. Guthrie over the years, but I (DLM) do not have their information
4970

5071
## Acknowledgments
5172
- James Guthrie, who made this data available and gave permission to post it publicly; he does not want any credit for this, but he should certainly be acknowledged.

0 commit comments

Comments
 (0)