-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LT_All.csv lines have one extra tab #1
Comments
alright. It's actually a bit more complicated than that. Even adding a "trailing" column to my sql table doesn't fix things. The row with this docket number "MJ-02301-LT-0000076-2020" still doesn't come out right (and some more after it). I think there's also something going on with row quoting. One way I am determining this is by doing a diff of the csv directly from the repo, vs one I've opened in libreoffice calc, added a trailing column to, and resaved (which removes trailing tab as a major diff issue, but there's still more going on). |
Hey, Zac. Glad you're digging in, and sorry for the hitches. So far, we've
only imported into spreadsheets (Excel and Google Sheets) and they are
apparently more forgiving than postgres!
With tonight's update to the csv files, I've removed the trailing tab per
line.
Regarding row quoting, it looks like we are getting messed up by at least
one instance of a typo on a court pdf where a quotation mark was used in
place of a possessive apostrophe. Here's a snippet from the state's pdf for
MJ-02301-LT-0000076-2020:
[image: image.png]
As I scan through the data, I see quotation marks in two general
categories: 1) Typos and 2) Used to indicate nicknames or abbreviations. Do
you have a suggestion for how to handle quotation marks? Some
possibilities: I could fully remove them, or 'escape' them with a preceding
character (not sure what postgres would want).
:)
Josh
…On Sat, Sep 10, 2022 at 9:30 PM Zac Littleberry ***@***.***> wrote:
alright. It's actually a bit more complicated than that. Even adding a
"trailing" column to my sql table doesn't fix things. The row with this
docket number "MJ-02301-LT-0000076-2020" still doesn't come out right (and
some more after it). I think there's also something going on with row
quoting.
One way I am determining this is by doing a diff of the csv directly from
the repo, vs one I've opened in libreoffice calc, added a trailing column
to, and resaved (which removes trailing tab as a major diff issue, but
there's still more going on).
—
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTPBILWIIRUXPUWLKTSAB3V5UY43ANCNFSM6AAAAAAQJRDAOU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hey Josh! Looks like you made a quick fix to the tabs issue, Thanks! I've been working on this too. I realized if I parsed the csv with a library that was a bit more forgiving and then re-wrote the data out to file it would create a normalized version of the csv. Using this method I've resolved the quotation marks issue with the file. I thought it was also fixing the extra tab/column issue, but seeing that you updated the file, I'm not sure if it was just your changes or if mine would have done that too. Taking a quick look at the original csv and the normalized version in a diff viewer (such as meld), it looks like fields with quotes or commas should be surrounded in quotation marks, and quotes in-text get escaped with... a quotation mark? weird; I think this is configurable though. For reference: The script that does the full pipeline: https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/reset-download-normalize-and-import-csv.ts The script where I normalize the csv: https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/normalize-csv.ts The source and normalized CSVs (note. I'm not committed to keeping these up forever, so future humans, expect these links to be broken, but hopefully there's enough context here that if you've stumbled upon this, you can figure out whatever problem you're having): https://drive.google.com/drive/folders/1L_e3vw4leATZY3rzZKeNZY9ZMvziiEwP?usp=sharing |
Most excellent! Need anything further from my end (beyond ongoing updates)?
…On Fri, Sep 16, 2022 at 10:29 AM Zac Littleberry ***@***.***> wrote:
Hey Josh! Looks like you made a quick fix to the tabs issue, Thanks!
I've been working on this too. I realized if I parsed the csv with a
library that was a bit more forgiving and then re-wrote the data out to
file it would create a normalized version of the csv. Using this method
I've resolved the quotation marks issue with the file. I thought it was
also fixing the extra tab/column issue, but seeing that you updated the
file, I'm not sure if it was just your changes or if mine would have done
that too.
Taking a quick look at the original csv and the normalized version in a
diff viewer (such as meld), it looks like fields with quotes or commas
should be surrounded in quotation marks, and quotes in-text get escaped
with... a quotation mark? weird; I think this is configurable though.
For reference:
The script that does the full pipeline:
https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/reset-download-normalize-and-import-csv.ts
The script where I normalize the csv:
https://gitlab.com/gazedev/node-typescript-api/-/blob/add-db-migrations/api/src/normalize-csv.ts
The source and normalized CSVs (note. I'm not committed to keeping these
up forever, so future humans, expect these links to be broken, but
hopefully there's enough context here that if you've stumbled upon this,
you can figure out whatever problem you're having):
https://drive.google.com/drive/folders/1L_e3vw4leATZY3rzZKeNZY9ZMvziiEwP?usp=sharing
—
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTPBIO45TQQPLIESLJ6AYLV6R74ZANCNFSM6AAAAAAQJRDAOU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hey Josh. I'm trying to import the LT_All.csv into postgres using the
COPY table("col1", "col2") FROM file.csv
command in postgres, but it kept complaining about there being extra data in the rows than expected. I finally realized that there probably shouldn't be a tab after the last piece of data on a line: For example, it should beNOTES\n
notNOTES\t\n
.I confirmed this was the case by throwing the csv into LibreOffice calc and re-saving it and looking at the file again (no extra tab, and no problem importing).
Do you think the way you are saving the csv could be modified to not have that extra tab?
Note: Unfortunately postgres doesn't allow not mapping a CSV column into Postgres with copy, so my options are to pre-process the file (either by re-saving, or programmatically) or to add a junk column to the end of my postgres table; neither of which are ideal, but are doable if we can't get LT_All.csv modified during write.
The text was updated successfully, but these errors were encountered: