Skip to content
This repository has been archived by the owner on Oct 8, 2021. It is now read-only.

updates in graph_traversal #332

Merged
merged 17 commits into from
Apr 24, 2016
Merged

Conversation

CarloLucibello
Copy link
Contributor

@CarloLucibello CarloLucibello commented Apr 11, 2016

as a follow up of the discussion in #329

the crucial line here is the one that uses get to treat on the same footing arrays and dictionaries

the "add neighborhood" commit got in by mistake, I'll remove it later
this also adds the neighborhood and egonet functions, based on breadth first search.

this also fixes #330

@CarloLucibello CarloLucibello changed the title [WIP] relax colormap in typetraversal [WIP] relax colormap in traversal Apr 11, 2016
@codecov-io
Copy link

Current coverage is 97.64%

Merging #332 into master will decrease coverage by -0.18% as of a53bab0

Powered by Codecov. Updated on successful CI builds.

visitor::SimpleGraphVisitor) # the visitor
graph::SimpleGraph, # the graph
queue::Vector{Int}, # an (initialized) queue that stores the active vertices
colormap, # an (initialized) color-map to indicate status of vertices (0=unseen, 1=seen, 2=closed)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates the LG principal of having type annotations for all parameters. Can we do something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Union{AbstractArray,Dict} work? We could give that a name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure we want to go fully typed here? we loose in flexibility and we gain nothing in performance

@CarloLucibello
Copy link
Contributor Author

I made Dict{Int,Int}() the default for vertexcolormap, since some experiments with large erdos_renyi graph and is_bipartite (which also relies on connected_components) showed up to 4x performance improvements in some cases and never a regression.

@jpfairbanks
Copy link
Contributor

I think we should have three methods on this function: one with no type constraints, another with an AbstractArray{T}, and the third with Dict{Int,T}. The current code can go in the unconstrained method and the other two methods would provide the "types as documentation" that @sbromberger wants.

@sbromberger
Copy link
Owner

@jpfairbanks My initial reaction is that this is probably the worst of all possible solutions.

I'd MAYBE be ok with Associative and AbstractArray, as long as one converted to the other (so we don't have specialized code).

I don't see the use case, though. Aside from some minor additional convenience, why are Int-based colormaps not workable? What's next, changing vertex types?

@jpfairbanks
Copy link
Contributor

Do you dislike the type parameterization T or the three definitions?

@sbromberger
Copy link
Owner

I'm honestly not a fan of either. What's wrong with integer colors? We use ints for coloring all over the place: partitions, biconnected components, .... this change will introduce an inconsistency AND make things more complex without an identified use case.

@jpfairbanks
Copy link
Contributor

I will leave the contained type for now and say it can stay as ints.

Making the color map able to support both Dict and Vector is important for exactly the use case carlo has found. If you want to do a BFS but only go a constant number of hops, then you shouldn't allocate o(nv) memory. That is why we need to support some sparse vector for the colormap. Carlo has some experiments that show Dict works well enough.

@sbromberger
Copy link
Owner

But here's the thing: if you don't want to allocate nv Ints (and, let's face it, for 1 million nodes that's 64 MB) because you're concerned about memory consumption, I don't see how a Dict is going to help you, since you're at the same allocation if you traverse more than nv / 4 nodes. This becomes even worse when you use data types that are larger than 64 bits.

I'm missing something. What, specifically, is the use case with the limitation?

All else being equal, we want to limit memory allocation. However, in this case, we're faced with a tradeoff: get more efficient memory allocs in limited, specific defined cases, but (on the other hand) increase the code complexity AND create a situation where we get HORRIBLE memory allocs in some other defined cases. I'm not sure this is a good tradeoff.

@CarloLucibello
Copy link
Contributor Author

the problem to me is not memory consumption, it is performance: the reason I started implementing neighborhood is that I need it inside an algorithm that has to call it many many times. Now, each call would imply allocating and deallocating O(nv) memory, and this would take most of the computational time.

Unless we want to optionally pass to neighborhood an initialized colormap, which would be really a questionable design choice, the only solution here is to allow both dicts and vectors for colormap.

I don't really see any problem in this:
what we gain? more flexibility, reduced memory consumption and improved or on par performance on all case that I tested (not many, but still somenthing)
what we lose? code complexity? not true, I just had to change two line of code to generalize graph_traversal. Horrible memory allocation in some cases? I experienced quite the opposite:

####
julia> g=erdos_renyi(100_000,100_000)
{100000, 100000} undirected graph

# BEFORE
julia> @time is_bipartite(g)
  0.045502 seconds (84.06 k allocations: 10.419 MB, 21.31% gc time)
false

julia> @time is_bipartite(g)
  0.034474 seconds (84.06 k allocations: 10.419 MB)
false

#AFTER
julia> @time is_bipartite(g)
  0.037780 seconds (84.96 k allocations: 9.797 MB)
false

julia> @time is_bipartite(g)
  0.036506 seconds (84.96 k allocations: 9.797 MB)
false
#####
julia> g=erdos_renyi(1_000_000,10_000_000)
{1000000, 10000000} undirected graph

#BEFORE
julia> @time is_bipartite(g)
  1.038482 seconds (74 allocations: 40.599 MB, 0.41% gc time)
false

julia> @time is_bipartite(g)
  1.041845 seconds (74 allocations: 40.599 MB, 0.31% gc time)
false

#AFTER
julia> @time is_bipartite(g)
  1.086443 seconds (119 allocations: 33.325 MB, 0.21% gc time)
false

julia> @time is_bipartite(g)
  1.074744 seconds (119 allocations: 33.325 MB, 0.13% gc time)
false

# case in which it is actually bipartite
julia> g=erdos_renyi(100_000,10_000)
{100000, 10000} undirected graph

#BEFORE
julia> @time is_bipartite(g)
  0.232985 seconds (481.32 k allocations: 1.243 GB, 18.40% gc time)
true

julia> @time is_bipartite(g)
  0.278150 seconds (481.32 k allocations: 1.243 GB, 31.56% gc time)

#AFTER
julia> @time is_bipartite(g)
  0.099805 seconds (484.05 k allocations: 167.945 MB, 34.27% gc time)
true

julia> @time is_bipartite(g)
  0.080207 seconds (484.05 k allocations: 167.945 MB, 20.56% gc time)
true

This results are not clear to me, they should be investigated, anyway in no case there is a regression and in one case there is a huge improvements using dicts.

"""
function components(labels::Vector{Int})
d = Dict{Int, Int}()
c = Vector{Vector{Int}}()
i = 1
for (v,l) in enumerate(labels)
index = get(d, l, i)
d[l] = index
index = get!(d, l, i)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice.

@sbromberger
Copy link
Owner

@CarloLucibello your numbers are compelling, and my tests bear out what you're saying. Thanks for sticking with this.

Please see more comments in code. I think we can do this so that it remains consistent with LG standards (which should probably be codified at some point).

@jpfairbanks - I think I've been able to do this without type parameterization, of which I'm not a huge fan. Take a look?

@jpfairbanks
Copy link
Contributor

@CarloLucibello this actually shows a deeper problem in is_bipartite. I agree that we should loosen the type for colormap to include Dict because of the neighborhood usecase. But this benchmark is actually diagnosing a different performance problem.

I know that I always think about the single connected component case and then do whatever it takes to apply that to each component. This strategy is efficient when the number of components is small. Based on reading the code, that is what happened is is_bipartite. Here is the code:

############################################
# Test graph for bipartiteness             #
############################################
type BipartiteVisitor <: SimpleGraphVisitor
    bipartitemap::Vector{UInt8}
    is_bipartite::Bool
end

BipartiteVisitor(n::Int) = BipartiteVisitor(zeros(UInt8,n), true)

function examine_neighbor!(visitor::BipartiteVisitor, u::Int, v::Int, vcolor::Int, ecolor::Int)
    if vcolor == 0
        visitor.bipartitemap[v] = (visitor.bipartitemap[u] == 1)? 2:1
    else
        if visitor.bipartitemap[v] == visitor.bipartitemap[u]
            visitor.is_bipartite = false
        end
    end
    return visitor.is_bipartite
end

"""Will return `true` if graph `g` is
[bipartite](https://en.wikipedia.org/wiki/Bipartite_graph).
"""
is_bipartite(g::SimpleGraph, s::Int) = _bipartite_visitor(g, s).is_bipartite

function is_bipartite(g::SimpleGraph)
    cc = filter(x->length(x)>2, connected_components(g))
    return all(x->is_bipartite(g,x[1]), cc)
end

function _bipartite_visitor(g::SimpleGraph, s::Int)
    nvg = nv(g)
    visitor = BipartiteVisitor(nvg)
    traverse_graph(g, BreadthFirst(), s, visitor)
    return visitor
end

This code is correct and efficient when the number of connected components is small.
But when there are many small components as in the case erdos_renyi(n,m) where m < n it is not efficient. The reason is that each connected component leads to a call to _bipartite_visitor(g, s) that allocates a length nv array for visitor. Then traverse_graph(g, BreadthFirst(), s, visitor) also allocates a fresh queue on each invocation because we are not reusing the memory for the queue.

The implementation with Dict alleviates one of these problems because _bipartite_visitor will only allocate an empty Dict and an empty queue which is O(1) memory per connected component. You still are not reusing the same memory between connected components calls so the implementation is not optimal, but that is a lower order effect. The optimal solution to accelerating is_bipartite is to reuse the bipartitemap vector for all the connected components.

We should fix this and add a note to the docs saying that if you notice terrible performance for many small connected components, then it is probably a bug. But we can also relax the colormap to be AbstractArray or Associative.

@sbromberger
Copy link
Owner

@jpfairbanks if I'm understanding correctly, though, this isn't too big an issue since the sizes of the connected components sum up to nv(g) in any case, so the array allocation doesn't really impact memory, just the number of allocations which we have in any case. Is this not right?

@CarloLucibello
Copy link
Contributor Author

@sbromberger I think you are right
@jpfairbanks I'll fix is_bipartite in this PR as you suggested

so, with regard to vertexcolormap and edgecolormap, I wouldn't even bother parametrizing them, but from the abpve discussion it seems you are inclined to <: Union{AbstractArray,Associative} and <:Associative respectively right?

PS I didn't know the existence of Associative, nice to hear

@jpfairbanks
Copy link
Contributor

@sbromberger yeah the number of large allocations is the problem. using a dict reduces the sizes of each allocation and reusing one big array is optimal because it allocates the right amount of memory all at once.

@CarloLucibello yes.

@sbromberger
Copy link
Owner

Wait, no. Why Union? Why not Dict{Int,Int} as it currently stands? If we're going to do it, let's do it.

@jpfairbanks
Copy link
Contributor

We still want some things to use Arrays, such as connected components where it is easy and efficient to use Array{Int, 1}. If you really want strict typing, Union{Array{Int,1}, Dict{Int,Int}} gets the job done.

function is_bipartite(g::SimpleGraph)
cc = filter(x->length(x)>2, connected_components(g))
return all(x->is_bipartite(g,x[1]), cc)
vmap = Dict{Int,Int}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be a Dict because it will be full at the end. A plain old array will work fine. The memory reuse is correct now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be full only in the case the graph is actually bipartite. It is hard to say what is the most efficient choice on average here

@CarloLucibello CarloLucibello changed the title [WIP] relax colormap in traversal updates in graph_traversal Apr 18, 2016
@CarloLucibello CarloLucibello mentioned this pull request Apr 18, 2016
@@ -123,7 +133,7 @@ function discover_vertex!(vis::TarjanVisitor, v)
end

function examine_neighbor!(vis::TarjanVisitor, v, w, w_color::Int, e_color::Int)
if w_color > 0 # 1 means added seen, but not explored; 2 means closed
if w_color >= 0 # >=0 means added seen
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! thanks

@CarloLucibello
Copy link
Contributor Author

This should be ready to merge. I know, it has 17 commits, but even with a careful squashing I wouldn't go with less then 10 since there are a lot of incremental fixes that are explained in the commit messages. I wouldn't go through the pain of cherry squashing, but I can come up with just one big commit if asked

@CarloLucibello
Copy link
Contributor Author

bump on this

@jpfairbanks
Copy link
Contributor

tests pass => 🚢 it

@sbromberger sbromberger merged commit 0cf3103 into sbromberger:master Apr 24, 2016
@sbromberger
Copy link
Owner

sbromberger commented Apr 24, 2016

Awesome. Thanks, Dr. @CarloLucibello! (and also Dr. @jpfairbanks for the review.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

traverse_graph is not pure and should end with !
4 participants