Very skewed distribution of duration on large datasets #11

franc00018 · 2015-09-21T17:54:34Z

Hi, I'm trying to use your code as part of an algorithm on a large dataset using Scala and Apache Spark. I'm having great results in terms of accuracy but I did if on several samples of GPS tracklog data and have a very skewed distribution of duration

Metric	Min	25th percentile	Median	75th percentile	Max
Duration	10 s	1.2 min	6.2 min	12 min	51 min
GC Time	0.2 s	2 s	4 s	4 s	10 s
Input	25.5 MB	128.1 MB	128.1 MB	128.1 MB	128.1 MB
Output	18.4 MB	93.3 MB	93.6 MB	93.7 MB	94.0 MB

I would like to know it this is an expected behaviour of this algorithm or if you have some tips and tricks to have a more stable results for any dataset (having less variance, maybe at the cost of having an higher average duration.

AReallyGoodName · 2015-09-21T23:41:10Z

The average complexity of the kd-tree, O(logn), should be what we see since it's a large dataset (barring some bug). It would be good to narrow down further (halve the datasets, take the slowest half and repeat). Perhaps there's a set of lookups that cause a particular issue.

I suspect this could be an old fashion memory caching issue. One thing i would try is to sort the datasets by location to some degree first. Points near each other in the real world will also traverse the same parts of the KD-Tree.

hoerup · 2015-09-22T08:22:03Z

I have made the observation that if the points I search for is within the area covered by the points in K-D tree everything goes smoothly but when I search for nearest from a set of points well outside the covered area the search is significantly slower.

Example
Number of points in KDtree: 749063
Time for finding 1000 nearest when points are well outside K-d tree area: 29 seconds
Time for finding 1000 nearest when points are in the middle of the K-d tree area: less than a second

kutt4n · 2017-01-18T09:34:29Z

@franc00018 - did you face any serialisation issues while using it with spark ?

franc00018 · 2017-01-24T04:21:17Z

Not of what I remember. My code completed successfully. It was on Spark 1.2.1. I unfortunately don't have access to the data anymore so I can't easily reproduce.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very skewed distribution of duration on large datasets #11

Very skewed distribution of duration on large datasets #11

franc00018 commented Sep 21, 2015

AReallyGoodName commented Sep 21, 2015

hoerup commented Sep 22, 2015

kutt4n commented Jan 18, 2017

franc00018 commented Jan 24, 2017

Very skewed distribution of duration on large datasets #11

Very skewed distribution of duration on large datasets #11

Comments

franc00018 commented Sep 21, 2015

AReallyGoodName commented Sep 21, 2015

hoerup commented Sep 22, 2015

kutt4n commented Jan 18, 2017

franc00018 commented Jan 24, 2017