You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+21
Original file line number
Diff line number
Diff line change
@@ -37,15 +37,36 @@ Our hope is that the community will get involved with curation of the dataset pr
37
37
Suggested improvements should come in via pull requests, where each pull request provides proposed modifications (including potentially supporting tools/scripts, data, references, or links to the same) and a clear explanation of these changes.
38
38
Thus, over time the current, curated database is expected to move away from simply reflecting the contents of the Excel spreadsheet and become more valuable.
39
39
40
+
Some specific points of curation which will be needed include:
41
+
- Separation of different types of data; for example, the main tab in the database Excel spreadsheet (and the data in `guthrie_database.csv`) contains not just hydration free energies but other properties with other units, e.g. the entries for phenol include values reported in mg/L, g/m^3, etc.
42
+
- unit handling; values are present in kJ/mol and kcal/mol
43
+
- checking of molecule names against SMILES and stereochemistry; I (DLM) previously gave Peter some tools to help with this but I do not know if he has used them
44
+
40
45
## Manifest
41
46
-`GuthrieDatabase_April14.zip`: Guthrie database (Excel spreadsheet) as it was provided
47
+
-`guthrie_database.csv`: Exported csv file of main tab of Excel spreadsheet
48
+
-`guthrie_references_and_status.csv`: Additional tab of Excel spreadsheet which provides definitions of the references and reports on Peter's progress in extracting data from those references; may highlight other areas where more data is still available
49
+
50
+
There is also data/curation work in an additional tab of the spreadsheet, Sheet 2, which may be useful but is not present here as a separate file yet.
51
+
52
+
## Using the dataset
53
+
54
+
The data set can be loaded easily in Python using `pandas`, for example as:
55
+
```
56
+
python
57
+
import pandas
58
+
db = pandas.read_csv('guthrie_database.csv', encoding='latin1')
59
+
data = db[db.Name=='phenol']
60
+
```
61
+
to load the database and extract all data with a molecule named phenol
42
62
43
63
## Authors
44
64
### Primary author
45
65
- J. Peter Guthrie (University of Western Ontario)
46
66
47
67
### Other contributors
48
68
- David L. Mobley, UC Irvine, who maintains this repository
69
+
- Probably students and others who worked with Dr. Guthrie over the years, but I (DLM) do not have their information
49
70
50
71
## Acknowledgments
51
72
- James Guthrie, who made this data available and gave permission to post it publicly; he does not want any credit for this, but he should certainly be acknowledged.
0 commit comments