LT_All.csv lines have one extra tab #1

zaclittleberry · 2022-09-10T23:49:28Z

Hey Josh. I'm trying to import the LT_All.csv into postgres using the COPY table("col1", "col2") FROM file.csv command in postgres, but it kept complaining about there being extra data in the rows than expected. I finally realized that there probably shouldn't be a tab after the last piece of data on a line: For example, it should be NOTES\n not NOTES\t\n.

I confirmed this was the case by throwing the csv into LibreOffice calc and re-saving it and looking at the file again (no extra tab, and no problem importing).

Do you think the way you are saving the csv could be modified to not have that extra tab?

Note: Unfortunately postgres doesn't allow not mapping a CSV column into Postgres with copy, so my options are to pre-process the file (either by re-saving, or programmatically) or to add a junk column to the end of my postgres table; neither of which are ideal, but are doable if we can't get LT_All.csv modified during write.

The text was updated successfully, but these errors were encountered:

zaclittleberry · 2022-09-11T01:30:43Z

alright. It's actually a bit more complicated than that. Even adding a "trailing" column to my sql table doesn't fix things. The row with this docket number "MJ-02301-LT-0000076-2020" still doesn't come out right (and some more after it). I think there's also something going on with row quoting.

One way I am determining this is by doing a diff of the csv directly from the repo, vs one I've opened in libreoffice calc, added a trailing column to, and resaved (which removes trailing tab as a major diff issue, but there's still more going on).

pinkushn · 2022-09-12T01:09:22Z

Hey, Zac. Glad you're digging in, and sorry for the hitches. So far, we've only imported into spreadsheets (Excel and Google Sheets) and they are apparently more forgiving than postgres! With tonight's update to the csv files, I've removed the trailing tab per line. Regarding row quoting, it looks like we are getting messed up by at least one instance of a typo on a court pdf where a quotation mark was used in place of a possessive apostrophe. Here's a snippet from the state's pdf for MJ-02301-LT-0000076-2020: [image: image.png] As I scan through the data, I see quotation marks in two general categories: 1) Typos and 2) Used to indicate nicknames or abbreviations. Do you have a suggestion for how to handle quotation marks? Some possibilities: I could fully remove them, or 'escape' them with a preceding character (not sure what postgres would want). :) Josh

…

On Sat, Sep 10, 2022 at 9:30 PM Zac Littleberry ***@***.***> wrote: alright. It's actually a bit more complicated than that. Even adding a "trailing" column to my sql table doesn't fix things. The row with this docket number "MJ-02301-LT-0000076-2020" still doesn't come out right (and some more after it). I think there's also something going on with row quoting. One way I am determining this is by doing a diff of the csv directly from the repo, vs one I've opened in libreoffice calc, added a trailing column to, and resaved (which removes trailing tab as a major diff issue, but there's still more going on). — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTPBILWIIRUXPUWLKTSAB3V5UY43ANCNFSM6AAAAAAQJRDAOU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

zaclittleberry · 2022-09-16T14:29:21Z

Hey Josh! Looks like you made a quick fix to the tabs issue, Thanks!

I've been working on this too. I realized if I parsed the csv with a library that was a bit more forgiving and then re-wrote the data out to file it would create a normalized version of the csv. Using this method I've resolved the quotation marks issue with the file. I thought it was also fixing the extra tab/column issue, but seeing that you updated the file, I'm not sure if it was just your changes or if mine would have done that too.

Taking a quick look at the original csv and the normalized version in a diff viewer (such as meld), it looks like fields with quotes or commas should be surrounded in quotation marks, and quotes in-text get escaped with... a quotation mark? weird; I think this is configurable though.

For reference:

The script that does the full pipeline: https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/reset-download-normalize-and-import-csv.ts

The script where I normalize the csv: https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/normalize-csv.ts

The source and normalized CSVs (note. I'm not committed to keeping these up forever, so future humans, expect these links to be broken, but hopefully there's enough context here that if you've stumbled upon this, you can figure out whatever problem you're having): https://drive.google.com/drive/folders/1L_e3vw4leATZY3rzZKeNZY9ZMvziiEwP?usp=sharing

pinkushn · 2022-09-16T19:26:29Z

Most excellent! Need anything further from my end (beyond ongoing updates)?

…

On Fri, Sep 16, 2022 at 10:29 AM Zac Littleberry ***@***.***> wrote: Hey Josh! Looks like you made a quick fix to the tabs issue, Thanks! I've been working on this too. I realized if I parsed the csv with a library that was a bit more forgiving and then re-wrote the data out to file it would create a normalized version of the csv. Using this method I've resolved the quotation marks issue with the file. I thought it was also fixing the extra tab/column issue, but seeing that you updated the file, I'm not sure if it was just your changes or if mine would have done that too. Taking a quick look at the original csv and the normalized version in a diff viewer (such as meld), it looks like fields with quotes or commas should be surrounded in quotation marks, and quotes in-text get escaped with... a quotation mark? weird; I think this is configurable though. For reference: The script that does the full pipeline: https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/reset-download-normalize-and-import-csv.ts The script where I normalize the csv: https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/normalize-csv.ts The source and normalized CSVs (note. I'm not committed to keeping these up forever, so future humans, expect these links to be broken, but hopefully there's enough context here that if you've stumbled upon this, you can figure out whatever problem you're having): https://drive.google.com/drive/folders/1L_e3vw4leATZY3rzZKeNZY9ZMvziiEwP?usp=sharing — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTPBIO45TQQPLIESLJ6AYLV6R74ZANCNFSM6AAAAAAQJRDAOU> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LT_All.csv lines have one extra tab #1

LT_All.csv lines have one extra tab #1

zaclittleberry commented Sep 10, 2022

zaclittleberry commented Sep 11, 2022

pinkushn commented Sep 12, 2022 via email

zaclittleberry commented Sep 16, 2022

pinkushn commented Sep 16, 2022 via email

LT_All.csv lines have one extra tab #1

LT_All.csv lines have one extra tab #1

Comments

zaclittleberry commented Sep 10, 2022

zaclittleberry commented Sep 11, 2022

pinkushn commented Sep 12, 2022 via email

zaclittleberry commented Sep 16, 2022

pinkushn commented Sep 16, 2022 via email