-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very skewed distribution of duration on large datasets #11
Comments
The average complexity of the kd-tree, O(logn), should be what we see since it's a large dataset (barring some bug). It would be good to narrow down further (halve the datasets, take the slowest half and repeat). Perhaps there's a set of lookups that cause a particular issue. I suspect this could be an old fashion memory caching issue. One thing i would try is to sort the datasets by location to some degree first. Points near each other in the real world will also traverse the same parts of the KD-Tree. |
I have made the observation that if the points I search for is within the area covered by the points in K-D tree everything goes smoothly but when I search for nearest from a set of points well outside the covered area the search is significantly slower. Example |
@franc00018 - did you face any serialisation issues while using it with spark ? |
Not of what I remember. My code completed successfully. It was on Spark 1.2.1. I unfortunately don't have access to the data anymore so I can't easily reproduce. |
Hi, I'm trying to use your code as part of an algorithm on a large dataset using Scala and Apache Spark. I'm having great results in terms of accuracy but I did if on several samples of GPS tracklog data and have a very skewed distribution of duration
I would like to know it this is an expected behaviour of this algorithm or if you have some tips and tricks to have a more stable results for any dataset (having less variance, maybe at the cost of having an higher average duration.
The text was updated successfully, but these errors were encountered: