-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow continuous weights for knn graph #763
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #763 +/- ##
=====================================
Coverage 85.4% 85.4%
=====================================
Files 150 150
Lines 15975 15984 +9
=====================================
+ Hits 13641 13650 +9
Misses 2334 2334
|
Co-authored-by: James Gaboardi <[email protected]>
But why? There's a kernel builder that does exactly this. Graph.build_kernel(df, k=5, kernel="identity")
# or
Graph.build_kernel(df, k=5, kernel=None) The result will be exactly the same as in this PR. This works analogously in the weights module, if anyone wants to use that over Graph. I am not a big fan of duplication of functionality. If you have point locations and want to use distance, either raw or passed through kernel, as weights, use kernel builder. If you want just K nearest neighbors, use KNN. I can see the reasoning for adding binary bool argument here but that opens the door to exposing kernels as well and we end up with two build methods that significantly overlap. |
yeah, i'm halfway down that rabbit hole at the moment. Was ready to expose alpha but not kernel. I think the key here is that we allow kernel to subset by I'm cool with that, the only trouble is its not immediately obvious that 'you can do knn weights with |
I think it is not necessarily an API issue but a documentation one. We should probably explicitly point to |
agree |
thinking about this a little more though, it doesnt feel super intuitive that I'm not saying we should do it another way, but for someone unfamiliar with the library, you need to have a good handle on what you're doing to accomplish distance-weighted knn just to say we really need to document the flexibility of the kernel builder well |
It seems that this API is not very well though through.
If we match the distance band API in KNN, then the only thing kernel builder can do while others cannot is specification of From the teaching perspective, kernel builder can be tricky to explain, so having those options on KNN as well might make the most sense. We can then treat kernel builder as a more advance option. We should probably also expose distance decay in Kernel but bandwidth does not really map well onto alpha, so not sure about that. In the end, we will indeed have a ton of overlap but it is all passed through |
I think the fact that these are all kernels suggests we should implement this by exposing kernel arguments to knn. The apis should look basically identical, but knn should use a boxcar kernel and set k to some value by default. For "alpha", I think it makes more sense to expose a "power" kernel, and let the bandwidth be its argument, just like other kernel functions. I'm split though as to whether use d**(-bandwidth), d**(-1/bandwidth), or d**(bandwidth) The first "flips" the semantics of bandwidth (small::local similarity, big::global similarity). The second fixes the semantics but is unintuitive: if you've explicitly requested the "power" kernel, you'd expect "bandwidth=2" to be d**(-2). The third fixes the semantics too (assuming larger negatives are "smaller"), but it would again feel weird to specify a negative bandwidth as the "normal" value. |
+1 (i was gonna end up adding a power kernel as part of this PR but wasn't sure how to handle alpha) I think I lean toward the first option. We know we're in distance-decay land so I tend to think of the exponent as positive (and the "-" as part of the formula) |
@martinfleis I think you mean this over here? i think levi's proposal would is to remove the alpha parameter and have everything work through a kernel with args (dist, threshold) (the way the power kernel is currently implemented in distance_band) so to get distance decay weights in dist (or KNN), you'd do
instead of
(right?) that seems consistent and pretty reasonable to me |
Or, option 3:
To achieve d**-1 |
I like it. No preference on positive/negative power.
i think it boils down to whether the kernel is truncated or not. For gaussian, exponential, and power kernels (or user function), the distance band builder will be cut (so the neighbor set will only include those up to threshold and the min weight will be f(threshold, threshold)), but |
I would argue against removing a parameter (or two, if I read this correctly) since that may break a lot of legacy code
that may depend on this. It’s safer to leave it in there and either ignore or replace internally with the new arguments.
Or maybe I am misinterpreting the “instead of” below.
I generally argue against breaking things, but instead make silent replacements, e.g., the GMM_Error that replaces
all the old GM_Error calls, but leaves them in place in case legacy code uses it.
Not everyone is involved in this on a day to day basis and there is lots of legacy code out there :-)
The weights functions were some of the first things written in PySAL, almost 20 years ago (or more), before
many of the modern libraries were available. So, I would not argue that they are not dated, but I would argue for
keeping the “old” APIs functioning.
L.
… On Aug 9, 2024, at 2:50 PM, eli knaap ***@***.***> wrote:
Thinking about this more, I am actually not sure if I like passing alpha as bandwidth. It is inconsistent with the rest of the kernels then. Also, distance band has both bandwidth and alpha arguments (though not used together), so we shall aim for some degree of consistency between these.
@martinfleis <https://github.com/martinfleis> I think you mean this over here?
i think levi's proposal would is to remove the alpha parameter and have everything work through a kernel with args (dist, threshold) (the way the power kernel is currently implemented in distance_band)
so to get distance decay weights in dist (or KNN), you'd do
build_distance_band(df, k, kernel='power', bandwidth=1)
instead of
build_distance_band(df, k, binary=False, alpha=-1)
(right?)
that seems consistent and pretty reasonable to me
—
Reply to this email directly, view it on GitHub <#763 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6STH2MPGG2MGZCX3RWP63ZQUMR7AVCNFSM6AAAAABMHBTNTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYGY2DCMBXHE>.
You are receiving this because your review was requested.
|
good point. I should've been clearer that we plan to leave the W class untouched, so anything in this discussion refers only to the Graph class (which is new enough that major changes shouldn't break any critical code--certainly no legacy stuff) |
Fair enough. It does say that “the Graph is currently experimental and its API is incomplete
and unstable” :-) That’s why I haven’t taken a close look yet. I have too much legacy stuff ...
… On Aug 9, 2024, at 4:08 PM, eli knaap ***@***.***> wrote:
good point. I should've been clearer that we plan to leave the W class untouched, so anything in this discussion refers only to the Graph class (which is new enough that major changes shouldn't break any critical code--certainly no legacy stuff)
—
Reply to this email directly, view it on GitHub <#763 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6STHZOGE2IT7OYYU3GXEDZQUVUXAVCNFSM6AAAAABMHBTNTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYG42TQOBTGY>.
You are receiving this because your review was requested.
|
we're almost there!. I'm pretty sure we have feature parity between Graph and W at this point, so just making some final API tweaks. used directly, Graph isn't yet compatible with spreg just yet (because there's still an explicit check for W), though |
Awesome! But we currently have a bunch of new stuff in the development version of spreg that affects nearly all files in the repository, so I would strongly advise against working on this at this stage. Let me check with Luc how we are for a new release, and it would be great to allow for both W and Graph. |
It makes sense to me, it is just the name Not saying that we should do that, but the cleanest would be |
I think we should try to phrase things generically when there is no prior reason to pick one notation over another. This is behind why other stats packages use names (cov, loc, shape, scale) rather than notational conventions (Sigma, beta, alpha, etc). Alpha is called "k" in the scaling literature, and usually called "beta" in gravity model literature. For the Gaussian kernel, bandwidth is usually called "sigma." Do we expose those? I'm not a fan of exposing the "p" for Minkowski metrics, either, but we're trying to match the scipy.spatial API. Reflecting on it, "alpha" is a decay parameter: large values are associated with stronger locality. To me, this implies decay is like an inverse bandwidth. Hence, I think option 3 addresses this by making alpha = -1*bandwidth, while I find option 2 only makes sense if we constrain bandwidth to be positive. |
this is a quick way to address #762 in the graph, though we could also allow other kernel transformations or allow a power argument