Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very skewed distribution of duration on large datasets #11

Open
franc00018 opened this issue Sep 21, 2015 · 4 comments
Open

Very skewed distribution of duration on large datasets #11

franc00018 opened this issue Sep 21, 2015 · 4 comments

Comments

@franc00018
Copy link

Hi, I'm trying to use your code as part of an algorithm on a large dataset using Scala and Apache Spark. I'm having great results in terms of accuracy but I did if on several samples of GPS tracklog data and have a very skewed distribution of duration

Metric Min 25th percentile Median 75th percentile Max
Duration 10 s 1.2 min 6.2 min 12 min 51 min
GC Time 0.2 s 2 s 4 s 4 s 10 s
Input 25.5 MB 128.1 MB 128.1 MB 128.1 MB 128.1 MB
Output 18.4 MB 93.3 MB 93.6 MB 93.7 MB 94.0 MB

I would like to know it this is an expected behaviour of this algorithm or if you have some tips and tricks to have a more stable results for any dataset (having less variance, maybe at the cost of having an higher average duration.

@AReallyGoodName
Copy link
Owner

The average complexity of the kd-tree, O(logn), should be what we see since it's a large dataset (barring some bug). It would be good to narrow down further (halve the datasets, take the slowest half and repeat). Perhaps there's a set of lookups that cause a particular issue.

I suspect this could be an old fashion memory caching issue. One thing i would try is to sort the datasets by location to some degree first. Points near each other in the real world will also traverse the same parts of the KD-Tree.

@hoerup
Copy link

hoerup commented Sep 22, 2015

I have made the observation that if the points I search for is within the area covered by the points in K-D tree everything goes smoothly but when I search for nearest from a set of points well outside the covered area the search is significantly slower.

Example
Number of points in KDtree: 749063
Time for finding 1000 nearest when points are well outside K-d tree area: 29 seconds
Time for finding 1000 nearest when points are in the middle of the K-d tree area: less than a second

@kutt4n
Copy link

kutt4n commented Jan 18, 2017

@franc00018 - did you face any serialisation issues while using it with spark ?

@franc00018
Copy link
Author

Not of what I remember. My code completed successfully. It was on Spark 1.2.1. I unfortunately don't have access to the data anymore so I can't easily reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants