allow continuous weights for knn graph #763

knaaptime · 2024-08-08T18:44:30Z

this is a quick way to address #762 in the graph, though we could also allow other kernel transformations or allow a power argument

libpysal/graph/base.py

codecov · 2024-08-08T18:52:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.4%. Comparing base (27776c3) to head (12db0dd).
Report is 37 commits behind head on main.

Additional details and impacted files

@@          Coverage Diff          @@
##            main    #763   +/-   ##
=====================================
  Coverage   85.4%   85.4%           
=====================================
  Files        150     150           
  Lines      15975   15984    +9     
=====================================
+ Hits       13641   13650    +9     
  Misses      2334    2334

Files	Coverage Δ
libpysal/graph/base.py	`96.5% <100.0%> (+<0.1%)`	⬆️

... and 1 file with indirect coverage changes

Co-authored-by: James Gaboardi <[email protected]>

martinfleis · 2024-08-08T20:16:06Z

But why? There's a kernel builder that does exactly this.

Graph.build_kernel(df, k=5, kernel="identity")
# or
Graph.build_kernel(df, k=5, kernel=None)

The result will be exactly the same as in this PR. This works analogously in the weights module, if anyone wants to use that over Graph.

I am not a big fan of duplication of functionality. If you have point locations and want to use distance, either raw or passed through kernel, as weights, use kernel builder. If you want just K nearest neighbors, use KNN. I can see the reasoning for adding binary bool argument here but that opens the door to exposing kernels as well and we end up with two build methods that significantly overlap.

knaaptime · 2024-08-08T20:20:34Z

yeah, i'm halfway down that rabbit hole at the moment. Was ready to expose alpha but not kernel.

I think the key here is that we allow kernel to subset by k, but don't allow KNNs to have a kernel applied. There's already a way to do what the user wants, but it's only exposed one way, not both.

I'm cool with that, the only trouble is its not immediately obvious that 'you can do knn weights with build_kernel, but 'you can't do kernel weights with build_knn`

martinfleis · 2024-08-08T20:22:26Z

I think it is not necessarily an API issue but a documentation one. We should probably explicitly point to build_kernel from build_knn with a note that if you want to treat distances in some way, that is a way to go.

knaaptime · 2024-08-08T20:23:12Z

agree

knaaptime · 2024-08-08T23:20:59Z

thinking about this a little more though, it doesnt feel super intuitive that build_distance_band allows you to apply a kernel or return [weighted] distances but build_knn doesn't. Internally, distance band is nothing more than a switch between boxcar, identity and power functions

I'm not saying we should do it another way, but for someone unfamiliar with the library, you need to have a good handle on what you're doing to accomplish distance-weighted knn

just to say we really need to document the flexibility of the kernel builder well

martinfleis · 2024-08-09T07:12:20Z

It seems that this API is not very well though through.

KNN offers just binary
Distance band offers binary, exposes kernel and distance decay alpha
Kernel has k and bandwidth which identify the same neighbors as KNN and Distance band respectively, and offers kernel function but does not expose distance decay alpha nor a kernel that would result in the same treatment, unless you pass custom callable.

If we match the distance band API in KNN, then the only thing kernel builder can do while others cannot is specification of metric and processing of precomputed distances via a kernel.

From the teaching perspective, kernel builder can be tricky to explain, so having those options on KNN as well might make the most sense. We can then treat kernel builder as a more advance option.

We should probably also expose distance decay in Kernel but bandwidth does not really map well onto alpha, so not sure about that.

In the end, we will indeed have a ton of overlap but it is all passed through _kernel anyway, so not necessarily implementation-wise. It may just be trickier to explain what the build_kernel is for given 95% of its functionality is covered by more explicit build methods.

ljwolf · 2024-08-09T08:28:35Z

I think the fact that these are all kernels suggests we should implement this by exposing kernel arguments to knn. The apis should look basically identical, but knn should use a boxcar kernel and set k to some value by default.

For "alpha", I think it makes more sense to expose a "power" kernel, and let the bandwidth be its argument, just like other kernel functions.

I'm split though as to whether use d**(-bandwidth), d**(-1/bandwidth), or d**(bandwidth)

The first "flips" the semantics of bandwidth (small::local similarity, big::global similarity). The second fixes the semantics but is unintuitive: if you've explicitly requested the "power" kernel, you'd expect "bandwidth=2" to be d**(-2). The third fixes the semantics too (assuming larger negatives are "smaller"), but it would again feel weird to specify a negative bandwidth as the "normal" value.

knaaptime · 2024-08-09T15:37:53Z

+1 (i was gonna end up adding a power kernel as part of this PR but wasn't sure how to handle alpha)

I think I lean toward the first option. We know we're in distance-decay land so I tend to think of the exponent as positive (and the "-" as part of the formula)

knaaptime · 2024-08-09T19:50:34Z

Thinking about this more, I am actually not sure if I like passing alpha as bandwidth. It is inconsistent with the rest of the kernels then. Also, distance band has both bandwidth and alpha arguments (though not used together), so we shall aim for some degree of consistency between these.

@martinfleis I think you mean this over here?

i think levi's proposal would is to remove the alpha parameter and have everything work through a kernel with args (dist, threshold) (the way the power kernel is currently implemented in distance_band)

so to get distance decay weights in dist (or KNN), you'd do

build_distance_band(df, k, kernel='power', bandwidth=1)

instead of

build_distance_band(df, k, binary=False, alpha=-1)

(right?)

that seems consistent and pretty reasonable to me

ljwolf · 2024-08-09T20:07:16Z

build_distance_band(df, k, kernel='power', bandwidth=1)

Or, option 3:

build_distance_band(df, k, kernel='power', bandwidth=-1)

To achieve d**-1

knaaptime · 2024-08-09T20:51:19Z

I like it. No preference on positive/negative power.

In the end, we will indeed have a ton of overlap but it is all passed through _kernel anyway, so not necessarily implementation-wise. It may just be trickier to explain what the build_kernel is for given 95% of its functionality is covered by more explicit build methods.

i think it boils down to whether the kernel is truncated or not. For gaussian, exponential, and power kernels (or user function), the distance band builder will be cut (so the neighbor set will only include those up to threshold and the min weight will be f(threshold, threshold)), but build_kernel will have all-pairs neighbors and min weight f(max(dist), threshold). Thats enough reason to keep both--cause then we have a way of getting to the bounded exponential/power without adding an argument 😆

lanselin · 2024-08-09T21:04:50Z

I would argue against removing a parameter (or two, if I read this correctly) since that may break a lot of legacy code that may depend on this. It’s safer to leave it in there and either ignore or replace internally with the new arguments. Or maybe I am misinterpreting the “instead of” below. I generally argue against breaking things, but instead make silent replacements, e.g., the GMM_Error that replaces all the old GM_Error calls, but leaves them in place in case legacy code uses it. Not everyone is involved in this on a day to day basis and there is lots of legacy code out there :-) The weights functions were some of the first things written in PySAL, almost 20 years ago (or more), before many of the modern libraries were available. So, I would not argue that they are not dated, but I would argue for keeping the “old” APIs functioning. L.

…

On Aug 9, 2024, at 2:50 PM, eli knaap ***@***.***> wrote: Thinking about this more, I am actually not sure if I like passing alpha as bandwidth. It is inconsistent with the rest of the kernels then. Also, distance band has both bandwidth and alpha arguments (though not used together), so we shall aim for some degree of consistency between these. @martinfleis <https://github.com/martinfleis> I think you mean this over here? i think levi's proposal would is to remove the alpha parameter and have everything work through a kernel with args (dist, threshold) (the way the power kernel is currently implemented in distance_band) so to get distance decay weights in dist (or KNN), you'd do build_distance_band(df, k, kernel='power', bandwidth=1) instead of build_distance_band(df, k, binary=False, alpha=-1) (right?) that seems consistent and pretty reasonable to me — Reply to this email directly, view it on GitHub <#763 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6STH2MPGG2MGZCX3RWP63ZQUMR7AVCNFSM6AAAAABMHBTNTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYGY2DCMBXHE>. You are receiving this because your review was requested.

knaaptime · 2024-08-09T21:08:06Z

good point. I should've been clearer that we plan to leave the W class untouched, so anything in this discussion refers only to the Graph class (which is new enough that major changes shouldn't break any critical code--certainly no legacy stuff)

lanselin · 2024-08-09T21:24:53Z

Fair enough. It does say that “the Graph is currently experimental and its API is incomplete and unstable” :-) That’s why I haven’t taken a close look yet. I have too much legacy stuff ...

…

On Aug 9, 2024, at 4:08 PM, eli knaap ***@***.***> wrote: good point. I should've been clearer that we plan to leave the W class untouched, so anything in this discussion refers only to the Graph class (which is new enough that major changes shouldn't break any critical code--certainly no legacy stuff) — Reply to this email directly, view it on GitHub <#763 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6STHZOGE2IT7OYYU3GXEDZQUVUXAVCNFSM6AAAAABMHBTNTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYG42TQOBTGY>. You are receiving this because your review was requested.

knaaptime · 2024-08-09T23:14:14Z

we're almost there!. I'm pretty sure we have feature parity between Graph and W at this point, so just making some final API tweaks.

used directly, Graph isn't yet compatible with spreg just yet (because there's still an explicit check for W), though Graph.to_W() works fine if you want to kick the tires

pedrovma · 2024-08-09T23:33:34Z

used directly, Graph isn't yet compatible with spreg just yet (because there's still an explicit check for W), though Graph.to_W() works fine if you want to kick the tires

Awesome! But we currently have a bunch of new stuff in the development version of spreg that affects nearly all files in the repository, so I would strongly advise against working on this at this stage. Let me check with Luc how we are for a new release, and it would be great to allow for both W and Graph.

martinfleis · 2024-08-12T14:16:16Z

i think levi's proposal would is to remove the alpha parameter and have everything work through a kernel with args (dist, threshold) (the way the power kernel is currently implemented in distance_band)

It makes sense to me, it is just the name bandwidth that is throwing me off, if it is used as alpha. That is not bandwidth any longer so passing the value through feels confusing.

Not saying that we should do that, but the cleanest would be build_distance_band(df, k, kernel='power', param=1) or something like that, that is generic enough to reflect bandwidth or alpha, depending on the function used under the hood.

ljwolf · 2024-08-12T15:30:42Z

I think we should try to phrase things generically when there is no prior reason to pick one notation over another. This is behind why other stats packages use names (cov, loc, shape, scale) rather than notational conventions (Sigma, beta, alpha, etc).

Alpha is called "k" in the scaling literature, and usually called "beta" in gravity model literature. For the Gaussian kernel, bandwidth is usually called "sigma." Do we expose those? I'm not a fan of exposing the "p" for Minkowski metrics, either, but we're trying to match the scipy.spatial API.

Reflecting on it, "alpha" is a decay parameter: large values are associated with stronger locality. To me, this implies decay is like an inverse bandwidth.

Hence, I think option 3 addresses this by making alpha = -1*bandwidth, while I find option 2 only makes sense if we constrain bandwidth to be positive.

allow continuous weights for knn graph

1c2a6e6

knaaptime requested review from martinfleis, lanselin, ljwolf and sjsrey August 8, 2024 18:44

jGaboardi reviewed Aug 8, 2024

View reviewed changes

libpysal/graph/base.py Outdated Show resolved Hide resolved

jGaboardi added enhancement graph labels Aug 8, 2024

Update libpysal/graph/base.py

12db0dd

Co-authored-by: James Gaboardi <[email protected]>

martinfleis mentioned this pull request Aug 8, 2024

add distance attribute to knn weights #762

Open

knaaptime closed this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow continuous weights for knn graph #763

allow continuous weights for knn graph #763

knaaptime commented Aug 8, 2024

codecov bot commented Aug 8, 2024 •

edited

Loading

martinfleis commented Aug 8, 2024

knaaptime commented Aug 8, 2024

martinfleis commented Aug 8, 2024

knaaptime commented Aug 8, 2024

knaaptime commented Aug 8, 2024

martinfleis commented Aug 9, 2024

ljwolf commented Aug 9, 2024

knaaptime commented Aug 9, 2024 •

edited

Loading

knaaptime commented Aug 9, 2024

ljwolf commented Aug 9, 2024

knaaptime commented Aug 9, 2024

lanselin commented Aug 9, 2024 via email

knaaptime commented Aug 9, 2024

lanselin commented Aug 9, 2024 via email

knaaptime commented Aug 9, 2024

pedrovma commented Aug 9, 2024

martinfleis commented Aug 12, 2024

ljwolf commented Aug 12, 2024

allow continuous weights for knn graph #763

allow continuous weights for knn graph #763

Conversation

knaaptime commented Aug 8, 2024

codecov bot commented Aug 8, 2024 • edited Loading

Codecov Report

martinfleis commented Aug 8, 2024

knaaptime commented Aug 8, 2024

martinfleis commented Aug 8, 2024

knaaptime commented Aug 8, 2024

knaaptime commented Aug 8, 2024

martinfleis commented Aug 9, 2024

ljwolf commented Aug 9, 2024

knaaptime commented Aug 9, 2024 • edited Loading

knaaptime commented Aug 9, 2024

ljwolf commented Aug 9, 2024

knaaptime commented Aug 9, 2024

lanselin commented Aug 9, 2024 via email

knaaptime commented Aug 9, 2024

lanselin commented Aug 9, 2024 via email

knaaptime commented Aug 9, 2024

pedrovma commented Aug 9, 2024

martinfleis commented Aug 12, 2024

ljwolf commented Aug 12, 2024

codecov bot commented Aug 8, 2024 •

edited

Loading

knaaptime commented Aug 9, 2024 •

edited

Loading