Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping the country list up-to-date #44

Open
korayal opened this issue Mar 6, 2025 · 1 comment
Open

Keeping the country list up-to-date #44

korayal opened this issue Mar 6, 2025 · 1 comment

Comments

@korayal
Copy link

korayal commented Mar 6, 2025

Hi, I just tried using this library to check against the ISO-3166 codes supported by the libphonenumber library, which uses the C bindings of Google's libphonenumber.

It appears, there are two regions that are in that list, but not in here:
(which, it appears, are both sub-divisions of SH, but also have their own codes)

  • AC - Ascension Island
  • TA - Tristan da Cunha

Since we have the temporarily assigned XK added to this list, should we also include these two above?

--
on a side note,

The countries.csv file in the repository has not been updated for the last 8 years. The list of supported Regions for libphonenumber library hasn't changed for the last 4 years either. But, is it possible to get some details on what conditions are required to make the changes into the CSV, and also on how that CSV is being generated?

You probably know about it, but, there is a repository called datasets/country-codes which contains a list of countries with a lot of details at https://github.com/datasets/country-codes/blob/main/data/country-codes.csv
(which also has these two ISO codes missing for AC and TA, as well as XK which the current CSV includes, but, the datasets project [rejected(https://github.com/datasets/country-codes/issues/66) to include for now)

I tried to re-generate it with some python code via that source:

import pandas as pd
cs = pd.read_csv('https://github.com/byteverse/country/raw/refs/heads/main/countries.csv', dtype=str)
csn = pd.read_csv('https://github.com/datasets/country-codes/raw/refs/heads/main/data/country-codes.csv', dtype=str).rename(columns={
    "official_name_en": "name", 
    "ISO3166-1-Alpha-2":"alpha-2", 
    "ISO3166-1-Alpha-3":"alpha-3",
    "ISO3166-1-numeric":"country-code",
    "Region Name": "region",
    "Sub-region Name": "sub-region",
    "Region Code": "region-code",
    "Sub-region Code": "sub-region-code",
})

def zeroPadDigits(s):
    if type(s) == str:
        return f"{int(s):03d}"
    else:
        return s

csn["iso_3166-2"] = csn["alpha-2"].apply(lambda s: f"ISO 3166-2:{s}")
csn["country-code"] = csn["country-code"].apply(zeroPadDigits)
csn["region-code"] = csn["region-code"].apply(zeroPadDigits)
csn["sub-region-code"] = csn["sub-region-code"].apply(zeroPadDigits)
csn = csn[["name", "alpha-2", "alpha-3","country-code","iso_3166-2","region","sub-region","region-code","sub-region-code"]]
csn.to_csv('~/Downloads/countries-new.csv', index=False)
columns_to_compare = pd.Index(['alpha-2', 'alpha-3', 'country-code'])
pd.concat([cs[columns_to_compare], csn[columns_to_compare]]).drop_duplicates(keep=False)

(which also compares the two versions against three relevant codes)

But, the downside is that I see a lot of changes in the sub-region and sub-region-code columns. In any case, I'm just going to drop the generated CSV if it's any useful.

countries-new.csv

@andrewthad
Copy link
Member

Thank you for the thoughtful issue. I'll try to cover everything that you touched on, but let me know if I miss anything.

First, I'll provide a little historical context. My primary two motivations for writing this library, both related to work, are decoding (and normalizing) countries provided by firewall logs (PAN and Fortigate) and similarly decoding (and normalizing) countries listed in Geolite's data set. For this reason, there is much more emphasis on the decoding functions in this library than on the encoding functions. If I recall correctly, support for the XK country code was added because I started seeing it in Geolite's CSV file one day, and I needed to be able to handle it. As a more general guiding philosophy for this library, it makes sense to lean in the directory of supporting more country codes rather than fewer. If the decode functions in this library result in territorial distinctions that are not acceptable for someone's application, it's easy for that user to replace occurrences of the one country with another. But if we go the other way and combine two country codes during decoding, there's no good way for a user to recover the original distinguished territories.

The countries.csv file is this repo is just something I found somewhere when I was writing the library. It's not special, and it might have things that are incorrect. It's probably better to use the one from the country-codes repo you linked as a source of truth. If you look in code-generation/app-countries/Main.hs at the continents functions, you'll see that the only current use of subregions in to recover the distinction between North America and South America. At this point, I don't recall why I needed to support extracting a continent from a country (something that work required), but there was some reason that I needed to do that at some point in time. Anyway, the point is that it might not matter that the subregions have changed because all of the generated code might end up the same.

Back to the original question about AC and TA, yes, I think it absolutely makes sense to add them. They need numeric codes as well. My cursory search online didn't turn up anything, but please PR a change if you have information for them.

There's no build instructions for this library, and you may have already figured this part out, but just in case:

cabal build -w ghc-9.6.6 all
./dist-newstyle/build/x86_64-linux/ghc-9.6.6/code-generation-0.1.0.0/x/country-code-generation/build/country-code-generation/country-code-generation

That will regenerate haskell source code based on the CSV and aliases.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants