Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for Complex JSON Array Unexpected Behavior #90

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

hdwatts
Copy link

@hdwatts hdwatts commented Jan 11, 2021

Status

READY

Description

I had been dealing with issues similar to Issue #78 - and pulled in PR #81 to help resolve those issues. However, I found a few further exceptions with complex array structures and schemas. I believe this PR is an alternative to PR #81, although their PR mentions fillGaps which I don't think I address here. Unfortunately there are no tests in that PR so I do not know if this addresses the fillGaps issue mentioned.

The method used here is to keep a stricter track of the csv headers, and ensure that we are only ever adding to a line if the current element is supposed to come after the previous element.

[{
    companies: [
    {
      display_name: "A",
      investments: [{ name: "B" }, { name: "C" }],
    },
    {
      display_name: "B",
      investments: [{ name: "D" }, { name: "E" }],
    },
  ]
}]

Old Logic

Step Current Row Comments
0 [companies.display_name, companies.investments.name] Calculate Headers, New row needed
1 [A, null] Populate company display name
2 [A, B] Populate first investment name
3 [A, B] New row needed
4 [null, C] Populate second investment name
5 [B, C] "B" gets placed here since it was null, falsely applying investment "C" to company "B"

New Logic

Step Current Row Comments
0 [companies.display_name, companies.investments.name] Calculate Headers, New row needed
1 [A, null] Populate company display name
2 [A, B] Populate first investment name, it comes after the previously placed "A", so it is placed on the same row
3 [A, B] New row needed
4 [null, C] C is placed at its correct index, New row needed
5 [B, null] "B" is associated with a header that comes before the previously placed "C", so it is associated to a new row, as appropriate

The other exceptions are noted in the tests - please let me know if you have any questions.

Related PRs

List related PRs against other branches:

branch PR
Original Bugfix PR link

Todos

  • Tests
  • Documentation

Steps to Test or Reproduce

Outline the steps to test or reproduce the PR here.

Tests have been added.

npm run test

Impacted Areas in Application

List general components of the application that this PR will affect:

  • Exporting complex JSON array structures as CSV

@AckerApple
Copy link
Collaborator

AckerApple commented Jan 12, 2021

Link to example of current results relevant to claim in description

(UPDATE: note to self that issuer has provided updated link example in next comment)

@hdwatts
Copy link
Author

hdwatts commented Jan 12, 2021

Link to example of current results relevant to claim in description

Apologies, I should have wrapped the object with brackets in my PR (updated). You will see the behavior then.

Here is an updated link.

@AckerApple
Copy link
Collaborator

Acknowledged. Due to complexity it will take me another round of reviewing and examining.

It would behoove you to attempt a most basic summation if possible for me. It’s taking me a lot of energy to compare results and dissect just exactly is the all encompassing view of this work.

I believe you may have something here. Just trying so extra careful to make sure we have full understanding and solid documentation.

If week or more goes by, please feel free to ping a comment again as my plate if fairly full

@AckerApple AckerApple self-assigned this Jan 13, 2021
@hdwatts
Copy link
Author

hdwatts commented Jan 14, 2021

Sure thing, here is a breakdown of my changes. Apologies for the long write up, I just got diagnosed
with COVID and so have a lot of time on my hands.

The Problem

As stated in my initial write up, I noticed that as the results within a complex JSON array were being
iterated upon that there were easily replicable situations where items will get inserted into the wrong
row of the CSV.

The table copied from above demonstrates the logic flow.

Step Current Row Comments
0 [companies.display_name, companies.investments.name] Calculate Headers, New row needed
1 [A, null] Populate company display name
2 [A, B] Populate first investment name
3 [A, B] New row needed
4 [null, C] Populate second investment name
5 [B, C] "B" gets placed here since it was null, falsely applying investment "C" to company "B"

The issue noted in step 5 is due to the current code only checking if the header index is not undefined.

//
// Line 96 in lib/parser/csv.js
//
let elementHeaderIndex = getHeaderIndex(element.item);
if (currentRow[elementHeaderIndex] != undefined) {
	...

This returns false in the above example, where we are attempting to insert a display name. Therefore the
display name does not get put into a new row and instead gets put on the previous row.

My Proposed Solution

My proposed solution is to adjust the check for a new row to be whether the column at the current row is
not undefined OR it is attempting to insert an element at an index before the previously placed item.

This, of course, needs to rely on the assumption that the data coming in is in an order sorted according to
the headers of the inputted JSON object. Alas, that was not the case, so I also had to make adjustments to
lib/parser/handler.js.

The goal with the changes there is to return the same array of { item: ..., value: ...} objects that is
currently returned, however in order of the header's appearance within the JSON structure.

Ex:

[
	{
		"a": {
			"b": true,
			"c": [{
					"d": 1
				},
				{
					"d": 2
				},
				{
					"d": 3
				},
				{
					"d": 4
				}
			],
			"e": [{
					"f": 1
				},
				{
					"f": 2
				}
			]
		}
	}
]

Becomes:

[
	{ item: "a.b", value: true },
	{ item: "a.c.d", value: 1 },
	{ item: "a.c.d", value: 2 },
	{ item: "a.c.d", value: 3 },
	{ item: "a.c.d", value: 4 },
	{ item: "a.e.f", value: 1 },
	{ item: "a.e.f", value: 2 },
]

While this was already the case for "perfect" JSON schemas, if there were keys out of order or there were
missing keys these items would be returned out of order. These situations were being properly handled in
the existing "undefined check" implementation, but would break with my "header index check".

Therefore, changes had to be made in two files:

  • lib/parser/handler.js to ensure that the results are sorted normalized to the header indexes
  • lib/parser/csv.js to add the header index check to the insert new row logic

The Implementation

lib/parser/handler.js

constructor(options) {
	...
+	this._headers = []
}

We need to track headers for this JSON object on a class level.

_handleArray(array) {
	...
+	const getHeaderIndex = function(item) {
+		let index = self._headers.indexOf(item);
+		if (index === -1) {
+			if (item === null) {
+				self._headers.unshift(item);
+			} else {
+				self._headers.push(item);
+			}
+			index = self._headers.indexOf(item);
+		}
+		return index
+	}
+	const sortByHeaders = function(itemA, itemB) {
+		return getHeaderIndex(itemA.item) - getHeaderIndex(itemB.item);
+	}
	...
}

As we are going to be sorting by the headers stored on the class level, we need a sortByHeaders
function and a way to check a header value against the stored headers. The getHeaderIndex function
has additional logic for if the header is null. That is to handle situations where the user has an array
of strings in their object which require a special delimiter, and appease the firstElementWithoutItem
logic that occurs down the line. Basically if it is a null header, it should always be at the front.

Now, the _handleArray function works recursively through the JSON object, populating the item (header)
and value (the value) for each key as it works through and appending those results to a wider result
array.

This gets tricky because we cannot just sort these items at the end, as that will end up with the results
ignoring how deeply nested they are and just sorting according to the JSON schema.

Example:

[
	{
		c: [
			{
				a: "Name 1",
				b: "Field 1"
			},
			{
				a: "Name 2",
				b: "Field 2"
			}
		]
	}
]

Would result in:

[
	{ item: "c.a", "Name 1" },
	{ item: "c.a", "Name 2" },
	{ item: "c.b", "Field 1" },
	{ item: "c.b", "Field 2" },
]

Therefore, we have to sort only the results at the depth they are at, which gets complex because
we are returning the results as we iterate through recursively. To handle this, I've added a "depth"
field to the object. This is to track how deep we are in the recursive stack, as we want to be sure
that we are not inapropriately sorting items that have already been sorted at their respective depth.

+      for (let bIndex=0; bIndex < resultCheckType.length; bIndex++) {
+        getHeaderIndex(resultCheckType[bIndex].item);
+        resultCheckType[bIndex]._depth = (resultCheckType[bIndex]._depth || 0) + 1

While I am iterating within the above loop, I check for the current items depth. If it is 1, then
it is safe to sort, and I add it to a toSort array. Once I hit an item with a depth that isn't 1,
and there are items to sort, then it is time to sort those items by the headers using the sort function
above. After sorting the array, I iterate through the sorted array and place the items back into the main
result array in their sorted order. This is slightly convoluted, as I can't just make a new array without
risking losing the firstElementWithoutItem references...

+	if (resultCheckType[bIndex]._depth === 1) {
+		toSort.push(resultCheckType[bIndex]);
+	} else if (toSort.length > 0) {
+		const sorted = toSort.sort(sortByHeaders)
+		for (let cIndex = 0; cIndex < sorted.length; cIndex++) {
+			resultCheckType[bIndex - sorted.length + cIndex] =
+				sorted[cIndex];
+		}
+		toSort = []
+	}

Then the same code is executed again at the end of the for loop, just in case we had any items left
in the toSort array.

I've put some output on how the result array looks during this recursive process at this gist.

The end result is an array of items and values that are in a proper order according to the JSON's schema!

lib/parser/csv.js

_parseArray(json, stream) {
	...
+ 	let normalizedHeaders = []
	...
+	let getNormalizedIndex = function(header) {
+		var index = normalizedHeaders.indexOf(header)
+		if (index === -1) {
+			normalizedHeaders.push(header)
+			index = normalizedHeaders.indexOf(header)
+		}
+		return index
+	}
	...
}

The first changes are relatively self explanatory. We can't use the existing this._headers array and
getHeaderIndex functions, as that order can be adjusted with the headers option field. Therefore I
use a new normalizedHeaders array, and have a new getNormalizedIndex function which does ignores the
headers field, ensuring we remain with the order from the JSON schema.

fillRows = function(result) {
	...
+	let lastIndex = -1
	...
	for (let element of result) {
        	let elementHeaderIndex = getHeaderIndex(element.item);
+		let normalizedIndex = getNormalizedIndex(element.item)
+		if (
+			currentRow[elementHeaderIndex] != undefined ||
+			normalizedIndex < lastIndex
+		) {
			fillAndPush(currentRow);
			currentRow = newRow();
		}
		emptyRowIndexByHeader[elementHeaderIndex] = emptyRowIndexByHeader[elementHeaderIndex] || 0;
+	        lastIndex = normalizedIndex;
		...
	}
	...
}

And finally, here we have the check for the normalized header index against the most recently placed noramlized
index. In instances where the same element is being placed twice, then the undefined check will cause a new row.
However now, when we are going to be placing an item backwards, we will also be creating a new row.

Conclusion

While my proposed solution solves for my exact problem, I'm very new to this library and have not touched
any of the options beyond adjusting my changes based on the tests. I am open to any questions, comments,
and concerns you may have, and look forward to collaborating with you further!

@kaue
Copy link
Owner

kaue commented Sep 29, 2021

@AckerApple when you have time please follow up on that last round of review.
I will try to do the same, I still have a lot to catch up on this thread.

@hdwatts Thanks for this contribution, it's really well documented 😄

I feel sorry that It's taking me so much time to reply, I hoping that next year I will be able to be more active in this community.

@AckerApple
Copy link
Collaborator

@kaue you bet friend, I will do a review within a max 4 days. Fairly loaded but I know you just asking for an eye ball review. I will respond in time

@hdwatts
Copy link
Author

hdwatts commented Oct 7, 2021

@kaue @AckerApple Thanks for checking back in on this, guys. I'll note that we've been using these changes on our production platform to export results from our GraphQL API over the past year and have yet to encounter any issues.

@hdwatts hdwatts changed the title Alternative Bugfix for Complex JSON Array Unexpected Behavior Bugfix for Complex JSON Array Unexpected Behavior Dec 8, 2021
@loganpowell
Copy link

loganpowell commented Jan 20, 2022

I would like to use the fixes proposed by @hdwatts... any chance you'll merge it?

EDIT:

I've spoken without fully testing. Now that I have, I get a number of new rows with mostly empty cells for nested objects... Not sure why...

@kaue
Copy link
Owner

kaue commented Jan 21, 2022

@loganpowell i will try to find some time to review this PR

but if you can please try using @hdwatts branch and see if the fix is working properly.

sorry for the delay guys, busy times :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants