Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow continuous weights for knn graph #763

Closed
wants to merge 2 commits into from

Conversation

knaaptime
Copy link
Member

this is a quick way to address #762 in the graph, though we could also allow other kernel transformations or allow a power argument

libpysal/graph/base.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Aug 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.4%. Comparing base (27776c3) to head (12db0dd).
Report is 37 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##            main    #763   +/-   ##
=====================================
  Coverage   85.4%   85.4%           
=====================================
  Files        150     150           
  Lines      15975   15984    +9     
=====================================
+ Hits       13641   13650    +9     
  Misses      2334    2334           
Files Coverage Δ
libpysal/graph/base.py 96.5% <100.0%> (+<0.1%) ⬆️

... and 1 file with indirect coverage changes

Co-authored-by: James Gaboardi <[email protected]>
@martinfleis
Copy link
Member

But why? There's a kernel builder that does exactly this.

Graph.build_kernel(df, k=5, kernel="identity")
# or
Graph.build_kernel(df, k=5, kernel=None)

The result will be exactly the same as in this PR. This works analogously in the weights module, if anyone wants to use that over Graph.

I am not a big fan of duplication of functionality. If you have point locations and want to use distance, either raw or passed through kernel, as weights, use kernel builder. If you want just K nearest neighbors, use KNN. I can see the reasoning for adding binary bool argument here but that opens the door to exposing kernels as well and we end up with two build methods that significantly overlap.

@knaaptime
Copy link
Member Author

yeah, i'm halfway down that rabbit hole at the moment. Was ready to expose alpha but not kernel.

I think the key here is that we allow kernel to subset by k, but don't allow KNNs to have a kernel applied. There's already a way to do what the user wants, but it's only exposed one way, not both.

I'm cool with that, the only trouble is its not immediately obvious that 'you can do knn weights with build_kernel, but 'you can't do kernel weights with build_knn`

@martinfleis
Copy link
Member

I think it is not necessarily an API issue but a documentation one. We should probably explicitly point to build_kernel from build_knn with a note that if you want to treat distances in some way, that is a way to go.

@knaaptime
Copy link
Member Author

agree

@knaaptime knaaptime closed this Aug 8, 2024
@knaaptime
Copy link
Member Author

thinking about this a little more though, it doesnt feel super intuitive that build_distance_band allows you to apply a kernel or return [weighted] distances but build_knn doesn't. Internally, distance band is nothing more than a switch between boxcar, identity and power functions

I'm not saying we should do it another way, but for someone unfamiliar with the library, you need to have a good handle on what you're doing to accomplish distance-weighted knn

just to say we really need to document the flexibility of the kernel builder well

@martinfleis
Copy link
Member

It seems that this API is not very well though through.

  1. KNN offers just binary
  2. Distance band offers binary, exposes kernel and distance decay alpha
  3. Kernel has k and bandwidth which identify the same neighbors as KNN and Distance band respectively, and offers kernel function but does not expose distance decay alpha nor a kernel that would result in the same treatment, unless you pass custom callable.

If we match the distance band API in KNN, then the only thing kernel builder can do while others cannot is specification of metric and processing of precomputed distances via a kernel.

From the teaching perspective, kernel builder can be tricky to explain, so having those options on KNN as well might make the most sense. We can then treat kernel builder as a more advance option.

We should probably also expose distance decay in Kernel but bandwidth does not really map well onto alpha, so not sure about that.

In the end, we will indeed have a ton of overlap but it is all passed through _kernel anyway, so not necessarily implementation-wise. It may just be trickier to explain what the build_kernel is for given 95% of its functionality is covered by more explicit build methods.

@ljwolf
Copy link
Member

ljwolf commented Aug 9, 2024

I think the fact that these are all kernels suggests we should implement this by exposing kernel arguments to knn. The apis should look basically identical, but knn should use a boxcar kernel and set k to some value by default.

For "alpha", I think it makes more sense to expose a "power" kernel, and let the bandwidth be its argument, just like other kernel functions.

I'm split though as to whether use d**(-bandwidth), d**(-1/bandwidth), or d**(bandwidth)

The first "flips" the semantics of bandwidth (small::local similarity, big::global similarity). The second fixes the semantics but is unintuitive: if you've explicitly requested the "power" kernel, you'd expect "bandwidth=2" to be d**(-2). The third fixes the semantics too (assuming larger negatives are "smaller"), but it would again feel weird to specify a negative bandwidth as the "normal" value.

@knaaptime
Copy link
Member Author

knaaptime commented Aug 9, 2024

+1 (i was gonna end up adding a power kernel as part of this PR but wasn't sure how to handle alpha)

I think I lean toward the first option. We know we're in distance-decay land so I tend to think of the exponent as positive (and the "-" as part of the formula)

@knaaptime
Copy link
Member Author

Thinking about this more, I am actually not sure if I like passing alpha as bandwidth. It is inconsistent with the rest of the kernels then. Also, distance band has both bandwidth and alpha arguments (though not used together), so we shall aim for some degree of consistency between these.

@martinfleis I think you mean this over here?

i think levi's proposal would is to remove the alpha parameter and have everything work through a kernel with args (dist, threshold) (the way the power kernel is currently implemented in distance_band)

so to get distance decay weights in dist (or KNN), you'd do

build_distance_band(df, k, kernel='power', bandwidth=1)

instead of

build_distance_band(df, k, binary=False, alpha=-1)

(right?)

that seems consistent and pretty reasonable to me

@ljwolf
Copy link
Member

ljwolf commented Aug 9, 2024

build_distance_band(df, k, kernel='power', bandwidth=1)

Or, option 3:

build_distance_band(df, k, kernel='power', bandwidth=-1)

To achieve d**-1

@knaaptime
Copy link
Member Author

I like it. No preference on positive/negative power.

In the end, we will indeed have a ton of overlap but it is all passed through _kernel anyway, so not necessarily implementation-wise. It may just be trickier to explain what the build_kernel is for given 95% of its functionality is covered by more explicit build methods.

i think it boils down to whether the kernel is truncated or not. For gaussian, exponential, and power kernels (or user function), the distance band builder will be cut (so the neighbor set will only include those up to threshold and the min weight will be f(threshold, threshold)), but build_kernel will have all-pairs neighbors and min weight f(max(dist), threshold). Thats enough reason to keep both--cause then we have a way of getting to the bounded exponential/power without adding an argument 😆

@lanselin
Copy link
Member

lanselin commented Aug 9, 2024 via email

@knaaptime
Copy link
Member Author

good point. I should've been clearer that we plan to leave the W class untouched, so anything in this discussion refers only to the Graph class (which is new enough that major changes shouldn't break any critical code--certainly no legacy stuff)

@lanselin
Copy link
Member

lanselin commented Aug 9, 2024 via email

@knaaptime
Copy link
Member Author

we're almost there!. I'm pretty sure we have feature parity between Graph and W at this point, so just making some final API tweaks.

used directly, Graph isn't yet compatible with spreg just yet (because there's still an explicit check for W), though Graph.to_W() works fine if you want to kick the tires

@pedrovma
Copy link
Member

pedrovma commented Aug 9, 2024

used directly, Graph isn't yet compatible with spreg just yet (because there's still an explicit check for W), though Graph.to_W() works fine if you want to kick the tires

Awesome! But we currently have a bunch of new stuff in the development version of spreg that affects nearly all files in the repository, so I would strongly advise against working on this at this stage. Let me check with Luc how we are for a new release, and it would be great to allow for both W and Graph.

@martinfleis
Copy link
Member

i think levi's proposal would is to remove the alpha parameter and have everything work through a kernel with args (dist, threshold) (the way the power kernel is currently implemented in distance_band)

It makes sense to me, it is just the name bandwidth that is throwing me off, if it is used as alpha. That is not bandwidth any longer so passing the value through feels confusing.

Not saying that we should do that, but the cleanest would be build_distance_band(df, k, kernel='power', param=1) or something like that, that is generic enough to reflect bandwidth or alpha, depending on the function used under the hood.

@ljwolf
Copy link
Member

ljwolf commented Aug 12, 2024

I think we should try to phrase things generically when there is no prior reason to pick one notation over another. This is behind why other stats packages use names (cov, loc, shape, scale) rather than notational conventions (Sigma, beta, alpha, etc).

Alpha is called "k" in the scaling literature, and usually called "beta" in gravity model literature. For the Gaussian kernel, bandwidth is usually called "sigma." Do we expose those? I'm not a fan of exposing the "p" for Minkowski metrics, either, but we're trying to match the scipy.spatial API.

Reflecting on it, "alpha" is a decay parameter: large values are associated with stronger locality. To me, this implies decay is like an inverse bandwidth.

Hence, I think option 3 addresses this by making alpha = -1*bandwidth, while I find option 2 only makes sense if we constrain bandwidth to be positive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants