Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas helper functions for get_stat_* #144

Merged
merged 37 commits into from
Aug 26, 2020
Merged
Changes from 23 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
e05e404
Add pandas wrapper function for time series data frame.
tjann Aug 23, 2020
2f070fd
Save work so far on pandas.
tjann Aug 24, 2020
b1feeca
Minor edits.
tjann Aug 24, 2020
516033b
Add function for creating covariate pandas df.
tjann Aug 24, 2020
529aeb3
Add latest date sorting to covariate as well. Add test for covariate …
tjann Aug 24, 2020
a7868e2
stat_vars_test: make response and expected response strings consisten…
tjann Aug 24, 2020
ea3c2ff
Add an example for covariate_pd_input
tjann Aug 24, 2020
ab3f755
Make stat_var examples quoting consistent.
tjann Aug 24, 2020
e72ae4a
Create dcpandas module that uses pandas natively.
tjann Aug 24, 2020
32a0284
Do the python release in another PR.
tjann Aug 24, 2020
160eee6
Remove stale refs in datacommons library to pandas features.
tjann Aug 24, 2020
771b0a5
Update pandas readme.
tjann Aug 24, 2020
5a86466
Cleanup format.
tjann Aug 24, 2020
5780970
Remove pd-related mocks from python testing.
tjann Aug 24, 2020
cb83487
Cosmetics.
tjann Aug 24, 2020
85b3a9b
Update docstring
tjann Aug 24, 2020
4044c03
Fix import statement for pip. Always sort time series df columns.
tjann Aug 24, 2020
d6290be
Restore pandas setup to prepare for release.
tjann Aug 24, 2020
4bba808
change _group_stat_all_by_obs_options mode parameter to time_series b…
tjann Aug 24, 2020
c81eaa6
Address some documentation suggestions from cyin.
tjann Aug 24, 2020
d3a618d
Fix bug from reassigning parameter time_series value in _group_stat_a…
tjann Aug 24, 2020
a4bcf4e
Make df_builder examples more readable.
tjann Aug 24, 2020
a2202c0
Update the docstrings for both PyPI release setup*.py files. Change d…
tjann Aug 24, 2020
0ebb20f
Rename time_series parameter to keep_series for _group_stat_all_by_ob…
tjann Aug 24, 2020
efd2e0c
dcpandas to datacommons_pandas, including all datacommons functions
tjann Aug 25, 2020
f645f3f
Fix various docstrings.
tjann Aug 25, 2020
1ea347d
Merge branch 'master' into pandas-funcs
tjann Aug 25, 2020
5454a93
Merge branch 'pandas-funcs' of github.com:tjann/api-python into panda…
tjann Aug 25, 2020
18cb93e
Add optional args to pandas lib build_time_series to pass onto python…
tjann Aug 25, 2020
51582c8
Update docstrings for time series funcs.
tjann Aug 25, 2020
a49a095
Remove will from CHANGELOG.
tjann Aug 25, 2020
7f46fdd
Reference TODO for cloudbuild pandas-python sync check. Update change…
tjann Aug 25, 2020
87b13ec
Rename covariate* to multivariate*, address cyin's comments on df_bui…
tjann Aug 25, 2020
f116e42
Update docstring for _group_stat_all_by_obs_options.
tjann Aug 25, 2020
975c956
Make err msg for _group_stat_all_by_obs_options no data more general.
tjann Aug 25, 2020
9865e09
Parameterize some pandas lib example functions.
tjann Aug 26, 2020
14dea40
Released pandas.
tjann Aug 26, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -21,9 +21,9 @@ understanding API usage.
For more detail on getting started with the API, please visit our
[API Overview](http://docs.datacommons.org/api/).

After you're ready to use the API, you can refer to `datacommons/examples` for
When you are ready to use the API, you can refer to `datacommons/examples` for
examples on how to use this package to perform various tasks. More tutorials and
documentation can be found at [tutorials](https://datacommons.org/colab)!
documentation can be found on our [tutorials page](https://datacommons.org/colab)!

## About Data Commons

20 changes: 10 additions & 10 deletions datacommons/examples/stat_vars.py
Original file line number Diff line number Diff line change
@@ -11,7 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Basic examples for StatisticalVariable-based param_set Commons API functions."""
"""Basic examples for StatisticalVariable-based param_set Data Commons API functions."""

from __future__ import absolute_import
from __future__ import division
@@ -25,16 +25,16 @@ def main():
param_sets = [
{
'place': 'geoId/06085',
'stat_var': 'Count_Person',
'stat_var': "Count_Person",
},
{
'place': 'geoId/06085',
'stat_var': 'Count_Person',
'stat_var': "Count_Person",
'date': '2018',
},
{
'place': 'geoId/06085',
'stat_var': 'Count_Person',
'stat_var': "Count_Person",
'date': '2018',
'measurement_method': 'CensusACS5yrSurvey',
},
@@ -111,20 +111,20 @@ def call_str(pvs):

pp = pprint.PrettyPrinter(indent=4)
print(
"\nget_stat_all(['geoId/06085', 'country/FRA'], ['Median_Age_Person', 'Count_Person'])"
'\nget_stat_all(["geoId/06085", "country/FRA"], ["Median_Age_Person", "Count_Person"])'
)
print('>>> ')
pp.pprint(
dc.get_stat_all(['geoId/06085', 'country/FRA'],
['Median_Age_Person', 'Count_Person']))
dc.get_stat_all(["geoId/06085", "country/FRA"],
["Median_Age_Person", "Count_Person"]))

print(
"\nget_stat_all(['badPlaceId', 'country/FRA'], ['Median_Age_Person', 'Count_Person'])"
'\nget_stat_all(["badPlaceId", "country/FRA"], ["Median_Age_Person", "Count_Person"])'
)
print('>>> ')
pp.pprint(
dc.get_stat_all(['badPlaceId', 'country/FRA'],
['Median_Age_Person', 'Count_Person']))
dc.get_stat_all(["badPlaceId", "country/FRA"],
["Median_Age_Person", "Count_Person"]))


if __name__ == '__main__':
102 changes: 52 additions & 50 deletions datacommons/stat_vars.py
Original file line number Diff line number Diff line change
@@ -20,13 +20,8 @@
from __future__ import division
from __future__ import print_function

from datacommons.utils import _API_ROOT, _API_ENDPOINTS, _ENV_VAR_API_KEY

import collections
import json
import os
import six.moves.urllib.error
import six.moves.urllib.request
import six

import datacommons.utils as utils

@@ -148,55 +143,62 @@ def get_stat_all(places, stat_vars):
>>> get_stat_all(["geoId/05", "geoId/06"], ["Count_Person", "Count_Person_Male"])
{
"geoId/05": {
"Count_Person": [
{
"val": {
"2010": 1633,
"2011": 1509,
"2012": 1581,
},
"observationPeriod": "P1Y",
"importName": "Wikidata",
"provenanceDomain": "wikidata.org"
},
{
"val": {
"2010": 1333,
"2011": 1309,
"2012": 131,
"Count_Person": {
"sourceSeries": [
{
"val": {
"2010": 1633,
"2011": 1509,
"2012": 1581,
},
"observationPeriod": "P1Y",
"importName": "Wikidata",
"provenanceDomain": "wikidata.org"
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
{
"val": {
"2010": 1333,
"2011": 1309,
"2012": 131,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
}
],
}
],
"Count_Person_Male": [
{
"val": {
"2010": 1633,
"2011": 1509,
"2012": 1581,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
}
],
},
"Count_Person_Male": {
"sourceSeries": [
{
"val": {
"2010": 1633,
"2011": 1509,
"2012": 1581,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
}
],
}
},
"geoId/02": {
"Count_Person": [],
"Count_Person_Male": [
{
"val": {
"2010": 13,
"2011": 13,
"2012": 322,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
"Count_Person": {},
"Count_Person_Male": {
"sourceSeries": [
{
"val": {
"2010": 13,
"2011": 13,
"2012": 322,
},
"observationPeriod": "P1Y",
"importName": "CensusPEPSurvey",
"provenanceDomain": "census.gov"
}
]
}
],
}
}
"""
70 changes: 39 additions & 31 deletions datacommons/test/stat_vars_test.py
Original file line number Diff line number Diff line change
@@ -29,34 +29,42 @@
import datacommons.utils as utils
import json
import unittest
import six
import six.moves.urllib as urllib

# Reusable parts of REST API /stat/all response.
CA_COUNT_PERSON = {
"isDcAggregate":
"true",
"sourceSeries": [
{
"val": {
"1990": 23640,
"1991": 24100,
"1992": 25090,
},
"observationPeriod": "P1Y",
"importName": "WorldDevelopmentIndicators",
"provenanceDomain": "worldbank.org"
"sourceSeries": [{
"val": {
"1990": 23640,
"1991": 24100,
"1993": 25090,
},
{
"val": {
"1790": 3929214,
"1800": 5308483,
"1810": 7239881,
},
"measurementMethod": "WikidataPopulation",
"importName": "WikidataPopulation",
"provenanceDomain": "wikidata.org"
"observationPeriod": "P1Y",
"importName": "WorldDevelopmentIndicators",
"provenanceDomain": "worldbank.org"
}, {
"val": {
"1790": 3929214,
"1800": 5308483,
"1810": 7239881,
},
"measurementMethod": "WikidataPopulation",
"importName": "WikidataPopulation",
"provenanceDomain": "wikidata.org"
}, {
"val": {
"1890": 28360,
"1891": 24910,
"1892": 25070,
},
]
"measurementMethod": "OECDRegionalStatistics",
"observationPeriod": "P1Y",
"importName": "OECDRegionalDemography",
"provenanceDomain": "oecd.org"
}]
}

CA_COUNT_PERSON_MALE = {
@@ -100,7 +108,7 @@
}]
}

HU22_MEDIAN_AGE_PERSON = {
CA_MEDIAN_AGE_PERSON = {
"sourceSeries": [{
"val": {
"1990": 12,
@@ -138,37 +146,37 @@ def read(self):
if req.get_full_url(
) == stat_value_url_base + '?place=geoId/06&stat_var=Count_Person':
# Response returned when querying with basic args.
return MockResponse(json.dumps({'value': 123}))
return MockResponse(json.dumps({"value": 123}))
if req.get_full_url(
) == stat_value_url_base + '?place=geoId/06&stat_var=Count_Person&date=2010':
# Response returned when querying with observationDate.
return MockResponse(json.dumps({'value': 133}))
return MockResponse(json.dumps({"value": 133}))
if (req.get_full_url() == stat_value_url_base +
'?place=geoId/06&stat_var=Count_Person&' +
'date=2010&measurement_method=CensusPEPSurvey&' +
'observation_period=P1Y&unit=RealPeople&scaling_factor=100'):
# Response returned when querying with above optional params.
return MockResponse(json.dumps({'value': 103}))
return MockResponse(json.dumps({"value": 103}))

# Mock responses for urlopen requests to get_stat_series.
if req.get_full_url(
) == stat_series_url_base + '?place=geoId/06&stat_var=Count_Person':
# Response returned when querying with basic args.
return MockResponse(json.dumps({'series': {'2000': 1, '2001': 2}}))
return MockResponse(json.dumps({"series": {"2000": 1, "2001": 2}}))
if (req.get_full_url() == stat_series_url_base +
'?place=geoId/06&stat_var=Count_Person&' +
'measurement_method=CensusPEPSurvey&observation_period=P1Y&' +
'unit=RealPeople&scaling_factor=100'):

# Response returned when querying with above optional params.
return MockResponse(json.dumps({'series': {'2000': 3, '2001': 42}}))
return MockResponse(json.dumps({"series": {"2000": 3, "2001": 42}}))
if (req.get_full_url() == stat_series_url_base +
'?place=geoId/06&stat_var=Count_Person&' +
'measurement_method=DNE'):

# Response returned when data not available for optional parameters.
# /stat/series?place=geoId/06&stat_var=Count_Person&measurement_method=DNE
return MockResponse(json.dumps({'series': {}}))
return MockResponse(json.dumps({"series": {}}))

# Mock responses for urlopen requests to get_stat_all.
if req.get_full_url() == stat_all_url_base:
@@ -204,7 +212,7 @@ def read(self):
"geoId/06": {
"statVarData": {
"Count_Person": CA_COUNT_PERSON,
"Median_Age_Person": HU22_MEDIAN_AGE_PERSON
"Median_Age_Person": CA_MEDIAN_AGE_PERSON
}
},
"nuts/HU22": {
@@ -274,7 +282,7 @@ def test_basic(self, urlopen):
"""Calling get_stat_value with minimal and proper args."""
# Call get_stat_series
stats = dc.get_stat_series('geoId/06', 'Count_Person')
self.assertEqual(stats, {'2000': 1, '2001': 2})
self.assertEqual(stats, {"2000": 1, "2001": 2})

@patch('six.moves.urllib.request.urlopen', side_effect=request_mock)
def test_opt_args(self, urlopen):
@@ -283,7 +291,7 @@ def test_opt_args(self, urlopen):
# Call get_stat_series with all optional args
stats = dc.get_stat_series('geoId/06', 'Count_Person',
'CensusPEPSurvey', 'P1Y', 'RealPeople', 100)
self.assertEqual(stats, {'2000': 3, '2001': 42})
self.assertEqual(stats, {"2000": 3, "2001": 42})

# Call get_stat_series with non-satisfiable optional args
stats = dc.get_stat_series('geoId/06', 'Count_Person', 'DNE')
@@ -316,7 +324,7 @@ def test_basic(self, urlopen):
exp = {
"geoId/06": {
"Count_Person": CA_COUNT_PERSON,
"Median_Age_Person": HU22_MEDIAN_AGE_PERSON
"Median_Age_Person": CA_MEDIAN_AGE_PERSON
},
"nuts/HU22": {
"Count_Person": HU22_COUNT_PERSON,
21 changes: 21 additions & 0 deletions dcpandas/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Changelog

## 0.0.1

**Date** - 08/24/2020

**Release Tag** - [pd.0.0.1](https://github.com/datacommonsorg/api-python/releases/tag/pd0.0.1)

**Release Status** - Current head of branch [`master`](https://github.com/datacommonsorg/api-python/tree/master)

Added pandas wrapper functions.

- `build_time_series` will construct a pd.Series for a given StatisticalVariable and Place, where dates are the index for the time series.
- `build_time_series_dataframe` will construct a pd.DataFrame for a given StatisticalVariable and a set of Places. The DataFrame will have Places as the index and dates as the columns.
- `build_covariate_dataframe` will construct a covariate pd.DataFrame for a set of StatisticalVariables and a set of Places. The DataFrame will have Places as index and StatisticalVariables as the columns. The values are the most recent values for the chosen StatVarObservation options.

For multi-place functions, when a StatisticalVariable has multiple StatVarObservation options,
Data Commons chooses a set of StatVarObservation options that covers the most geos. This
ensures that the data fetched for a StatisticalVariable is comparable across places.
When there is a tie, we select the StatVarObservation options set with the latest date
data is available for any place.
Loading