Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LT_All.csv lines have one extra tab #1

Open
zaclittleberry opened this issue Sep 10, 2022 · 4 comments
Open

LT_All.csv lines have one extra tab #1

zaclittleberry opened this issue Sep 10, 2022 · 4 comments

Comments

@zaclittleberry
Copy link

Hey Josh. I'm trying to import the LT_All.csv into postgres using the COPY table("col1", "col2") FROM file.csv command in postgres, but it kept complaining about there being extra data in the rows than expected. I finally realized that there probably shouldn't be a tab after the last piece of data on a line: For example, it should be NOTES\n not NOTES\t\n.

I confirmed this was the case by throwing the csv into LibreOffice calc and re-saving it and looking at the file again (no extra tab, and no problem importing).

Do you think the way you are saving the csv could be modified to not have that extra tab?

Note: Unfortunately postgres doesn't allow not mapping a CSV column into Postgres with copy, so my options are to pre-process the file (either by re-saving, or programmatically) or to add a junk column to the end of my postgres table; neither of which are ideal, but are doable if we can't get LT_All.csv modified during write.

@zaclittleberry
Copy link
Author

alright. It's actually a bit more complicated than that. Even adding a "trailing" column to my sql table doesn't fix things. The row with this docket number "MJ-02301-LT-0000076-2020" still doesn't come out right (and some more after it). I think there's also something going on with row quoting.

One way I am determining this is by doing a diff of the csv directly from the repo, vs one I've opened in libreoffice calc, added a trailing column to, and resaved (which removes trailing tab as a major diff issue, but there's still more going on).

@pinkushn
Copy link
Owner

pinkushn commented Sep 12, 2022 via email

@zaclittleberry
Copy link
Author

Hey Josh! Looks like you made a quick fix to the tabs issue, Thanks!

I've been working on this too. I realized if I parsed the csv with a library that was a bit more forgiving and then re-wrote the data out to file it would create a normalized version of the csv. Using this method I've resolved the quotation marks issue with the file. I thought it was also fixing the extra tab/column issue, but seeing that you updated the file, I'm not sure if it was just your changes or if mine would have done that too.

Taking a quick look at the original csv and the normalized version in a diff viewer (such as meld), it looks like fields with quotes or commas should be surrounded in quotation marks, and quotes in-text get escaped with... a quotation mark? weird; I think this is configurable though.

For reference:

The script that does the full pipeline: https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/reset-download-normalize-and-import-csv.ts

The script where I normalize the csv: https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/normalize-csv.ts

The source and normalized CSVs (note. I'm not committed to keeping these up forever, so future humans, expect these links to be broken, but hopefully there's enough context here that if you've stumbled upon this, you can figure out whatever problem you're having): https://drive.google.com/drive/folders/1L_e3vw4leATZY3rzZKeNZY9ZMvziiEwP?usp=sharing

@pinkushn
Copy link
Owner

pinkushn commented Sep 16, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants