updates in graph_traversal #332

CarloLucibello · 2016-04-11T08:41:20Z

as a follow up of the discussion in #329

the crucial line here is the one that uses get to treat on the same footing arrays and dictionaries

~~the "add neighborhood" commit got in by mistake, I'll remove it later~~
this also adds the neighborhood and egonet functions, based on breadth first search.

this also fixes #330

codecov-io · 2016-04-11T08:53:27Z

Current coverage is `97.64%`

Merging #332 into master will decrease coverage by -0.18% as of a53bab0

Powered by Codecov. Updated on successful CI builds.

sbromberger · 2016-04-11T15:55:43Z

src/traversals/bfs.jl

-    visitor::SimpleGraphVisitor)  # the visitor
+    graph::SimpleGraph,                 # the graph
+    queue::Vector{Int},                 # an (initialized) queue that stores the active vertices
+    colormap,                           # an (initialized) color-map to indicate status of vertices (0=unseen, 1=seen, 2=closed)


This violates the LG principal of having type annotations for all parameters. Can we do something else?

Does Union{AbstractArray,Dict} work? We could give that a name.

are we sure we want to go fully typed here? we loose in flexibility and we gain nothing in performance

- relax type of edgecolormap

CarloLucibello · 2016-04-12T08:20:03Z

I made Dict{Int,Int}() the default for vertexcolormap, since some experiments with large erdos_renyi graph and is_bipartite (which also relies on connected_components) showed up to 4x performance improvements in some cases and never a regression.

jpfairbanks · 2016-04-12T11:20:44Z

I think we should have three methods on this function: one with no type constraints, another with an AbstractArray{T}, and the third with Dict{Int,T}. The current code can go in the unconstrained method and the other two methods would provide the "types as documentation" that @sbromberger wants.

sbromberger · 2016-04-12T16:02:43Z

@jpfairbanks My initial reaction is that this is probably the worst of all possible solutions.

I'd MAYBE be ok with Associative and AbstractArray, as long as one converted to the other (so we don't have specialized code).

I don't see the use case, though. Aside from some minor additional convenience, why are Int-based colormaps not workable? What's next, changing vertex types?

jpfairbanks · 2016-04-12T17:45:29Z

Do you dislike the type parameterization T or the three definitions?

sbromberger · 2016-04-12T22:39:00Z

I'm honestly not a fan of either. What's wrong with integer colors? We use ints for coloring all over the place: partitions, biconnected components, .... this change will introduce an inconsistency AND make things more complex without an identified use case.

jpfairbanks · 2016-04-13T00:48:42Z

I will leave the contained type for now and say it can stay as ints.

Making the color map able to support both Dict and Vector is important for exactly the use case carlo has found. If you want to do a BFS but only go a constant number of hops, then you shouldn't allocate o(nv) memory. That is why we need to support some sparse vector for the colormap. Carlo has some experiments that show Dict works well enough.

sbromberger · 2016-04-13T01:47:38Z

But here's the thing: if you don't want to allocate nv Ints (and, let's face it, for 1 million nodes that's 64 MB) because you're concerned about memory consumption, I don't see how a Dict is going to help you, since you're at the same allocation if you traverse more than nv / 4 nodes. This becomes even worse when you use data types that are larger than 64 bits.

I'm missing something. What, specifically, is the use case with the limitation?

All else being equal, we want to limit memory allocation. However, in this case, we're faced with a tradeoff: get more efficient memory allocs in limited, specific defined cases, but (on the other hand) increase the code complexity AND create a situation where we get HORRIBLE memory allocs in some other defined cases. I'm not sure this is a good tradeoff.

CarloLucibello · 2016-04-13T09:54:49Z

the problem to me is not memory consumption, it is performance: the reason I started implementing neighborhood is that I need it inside an algorithm that has to call it many many times. Now, each call would imply allocating and deallocating O(nv) memory, and this would take most of the computational time.

Unless we want to optionally pass to neighborhood an initialized colormap, which would be really a questionable design choice, the only solution here is to allow both dicts and vectors for colormap.

I don't really see any problem in this:
what we gain? more flexibility, reduced memory consumption and improved or on par performance on all case that I tested (not many, but still somenthing)
what we lose? code complexity? not true, I just had to change two line of code to generalize graph_traversal. Horrible memory allocation in some cases? I experienced quite the opposite:

####
julia> g=erdos_renyi(100_000,100_000)
{100000, 100000} undirected graph

# BEFORE
julia> @time is_bipartite(g)
  0.045502 seconds (84.06 k allocations: 10.419 MB, 21.31% gc time)
false

julia> @time is_bipartite(g)
  0.034474 seconds (84.06 k allocations: 10.419 MB)
false

#AFTER
julia> @time is_bipartite(g)
  0.037780 seconds (84.96 k allocations: 9.797 MB)
false

julia> @time is_bipartite(g)
  0.036506 seconds (84.96 k allocations: 9.797 MB)
false
#####
julia> g=erdos_renyi(1_000_000,10_000_000)
{1000000, 10000000} undirected graph

#BEFORE
julia> @time is_bipartite(g)
  1.038482 seconds (74 allocations: 40.599 MB, 0.41% gc time)
false

julia> @time is_bipartite(g)
  1.041845 seconds (74 allocations: 40.599 MB, 0.31% gc time)
false

#AFTER
julia> @time is_bipartite(g)
  1.086443 seconds (119 allocations: 33.325 MB, 0.21% gc time)
false

julia> @time is_bipartite(g)
  1.074744 seconds (119 allocations: 33.325 MB, 0.13% gc time)
false

# case in which it is actually bipartite
julia> g=erdos_renyi(100_000,10_000)
{100000, 10000} undirected graph

#BEFORE
julia> @time is_bipartite(g)
  0.232985 seconds (481.32 k allocations: 1.243 GB, 18.40% gc time)
true

julia> @time is_bipartite(g)
  0.278150 seconds (481.32 k allocations: 1.243 GB, 31.56% gc time)

#AFTER
julia> @time is_bipartite(g)
  0.099805 seconds (484.05 k allocations: 167.945 MB, 34.27% gc time)
true

julia> @time is_bipartite(g)
  0.080207 seconds (484.05 k allocations: 167.945 MB, 20.56% gc time)
true

This results are not clear to me, they should be investigated, anyway in no case there is a regression and in one case there is a huge improvements using dicts.

sbromberger · 2016-04-13T13:49:54Z

src/connectivity.jl

 """
 function components(labels::Vector{Int})
    d = Dict{Int, Int}()
    c = Vector{Vector{Int}}()
    i = 1
    for (v,l) in enumerate(labels)
-        index = get(d, l, i)
-        d[l] = index
+        index = get!(d, l, i)


This is nice.

sbromberger · 2016-04-13T14:09:57Z

@CarloLucibello your numbers are compelling, and my tests bear out what you're saying. Thanks for sticking with this.

Please see more comments in code. I think we can do this so that it remains consistent with LG standards (which should probably be codified at some point).

@jpfairbanks - I think I've been able to do this without type parameterization, of which I'm not a huge fan. Take a look?

jpfairbanks · 2016-04-13T17:20:33Z

@CarloLucibello this actually shows a deeper problem in is_bipartite. I agree that we should loosen the type for colormap to include Dict because of the neighborhood usecase. But this benchmark is actually diagnosing a different performance problem.

I know that I always think about the single connected component case and then do whatever it takes to apply that to each component. This strategy is efficient when the number of components is small. Based on reading the code, that is what happened is is_bipartite. Here is the code:

############################################
# Test graph for bipartiteness             #
############################################
type BipartiteVisitor <: SimpleGraphVisitor
    bipartitemap::Vector{UInt8}
    is_bipartite::Bool
end

BipartiteVisitor(n::Int) = BipartiteVisitor(zeros(UInt8,n), true)

function examine_neighbor!(visitor::BipartiteVisitor, u::Int, v::Int, vcolor::Int, ecolor::Int)
    if vcolor == 0
        visitor.bipartitemap[v] = (visitor.bipartitemap[u] == 1)? 2:1
    else
        if visitor.bipartitemap[v] == visitor.bipartitemap[u]
            visitor.is_bipartite = false
        end
    end
    return visitor.is_bipartite
end

"""Will return `true` if graph `g` is
[bipartite](https://en.wikipedia.org/wiki/Bipartite_graph).
"""
is_bipartite(g::SimpleGraph, s::Int) = _bipartite_visitor(g, s).is_bipartite

function is_bipartite(g::SimpleGraph)
    cc = filter(x->length(x)>2, connected_components(g))
    return all(x->is_bipartite(g,x[1]), cc)
end

function _bipartite_visitor(g::SimpleGraph, s::Int)
    nvg = nv(g)
    visitor = BipartiteVisitor(nvg)
    traverse_graph(g, BreadthFirst(), s, visitor)
    return visitor
end

This code is correct and efficient when the number of connected components is small.
But when there are many small components as in the case erdos_renyi(n,m) where m < n it is not efficient. The reason is that each connected component leads to a call to _bipartite_visitor(g, s) that allocates a length nv array for visitor. Then traverse_graph(g, BreadthFirst(), s, visitor) also allocates a fresh queue on each invocation because we are not reusing the memory for the queue.

The implementation with Dict alleviates one of these problems because _bipartite_visitor will only allocate an empty Dict and an empty queue which is O(1) memory per connected component. You still are not reusing the same memory between connected components calls so the implementation is not optimal, but that is a lower order effect. The optimal solution to accelerating is_bipartite is to reuse the bipartitemap vector for all the connected components.

We should fix this and add a note to the docs saying that if you notice terrible performance for many small connected components, then it is probably a bug. But we can also relax the colormap to be AbstractArray or Associative.

sbromberger · 2016-04-13T18:49:52Z

@jpfairbanks if I'm understanding correctly, though, this isn't too big an issue since the sizes of the connected components sum up to nv(g) in any case, so the array allocation doesn't really impact memory, just the number of allocations which we have in any case. Is this not right?

CarloLucibello · 2016-04-13T20:17:19Z

@sbromberger I think you are right
@jpfairbanks I'll fix is_bipartite in this PR as you suggested

so, with regard to vertexcolormap and edgecolormap, I wouldn't even bother parametrizing them, but from the abpve discussion it seems you are inclined to <: Union{AbstractArray,Associative} and <:Associative respectively right?

PS I didn't know the existence of Associative, nice to hear

jpfairbanks · 2016-04-13T21:53:24Z

@sbromberger yeah the number of large allocations is the problem. using a dict reduces the sizes of each allocation and reusing one big array is optimal because it allocates the right amount of memory all at once.

@CarloLucibello yes.

sbromberger · 2016-04-13T22:23:35Z

Wait, no. Why Union? Why not Dict{Int,Int} as it currently stands? If we're going to do it, let's do it.

- reimplement neighborhood using BFS

jpfairbanks · 2016-04-15T21:34:43Z

We still want some things to use Arrays, such as connected components where it is easy and efficient to use Array{Int, 1}. If you really want strict typing, Union{Array{Int,1}, Dict{Int,Int}} gets the job done.

jpfairbanks · 2016-04-16T20:29:24Z

src/traversals/bfs.jl

 function is_bipartite(g::SimpleGraph)
    cc = filter(x->length(x)>2, connected_components(g))
-    return all(x->is_bipartite(g,x[1]), cc)
+    vmap = Dict{Int,Int}()


This doesn't need to be a Dict because it will be full at the end. A plain old array will work fine. The memory reuse is correct now.

it will be full only in the case the graph is actually bipartite. It is hard to say what is the most efficient choice on average here

…rsal

sbromberger · 2016-04-18T14:29:21Z

src/connectivity.jl

@@ -123,7 +133,7 @@ function discover_vertex!(vis::TarjanVisitor, v)
 end

 function examine_neighbor!(vis::TarjanVisitor, v, w, w_color::Int, e_color::Int)
-    if w_color > 0 # 1 means added seen, but not explored; 2 means closed
+    if w_color >= 0 # >=0 means added seen


Careful. See #71, JuliaAttic/OldGraphs.jl#187, and JuliaAttic/OldGraphs.jl#188.

good catch! thanks

CarloLucibello · 2016-04-19T09:29:35Z

This should be ready to merge. I know, it has 17 commits, but even with a careful squashing I wouldn't go with less then 10 since there are a lot of incremental fixes that are explained in the commit messages. I wouldn't go through the pain of cherry squashing, but I can come up with just one big commit if asked

CarloLucibello · 2016-04-23T14:42:26Z

bump on this

jpfairbanks · 2016-04-23T17:07:47Z

tests pass => 🚢 it

sbromberger · 2016-04-24T02:22:35Z

Awesome. Thanks, Dr. @CarloLucibello! (and also Dr. @jpfairbanks for the review.)

CarloLucibello changed the title ~~[WIP] relax colormap in typetraversal~~ [WIP] relax colormap in traversal Apr 11, 2016

sbromberger reviewed Apr 11, 2016
View reviewed changes

CarloLucibello mentioned this pull request Apr 11, 2016

missing edgecolormap in BFS traversal #331

Closed

CarloLucibello added 7 commits April 12, 2016 09:59

small improvement to connected_components

9873d0f

add neighborhood

042c7b5

traverse_graph -> traverse_graph!

4a09f3b

relax colormap type in traversal

d522ed1

-add edgecolormap to BFS

cec121f

- relax type of edgecolormap

add DummyEdgeMap and make it default

d55b136

make Dict default vertexcolormap

7d5e2d9

CarloLucibello force-pushed the visit branch from 4a4680a to 7d5e2d9 Compare April 12, 2016 08:15

CarloLucibello added 3 commits April 12, 2016 10:30

remove TreeBFSVisitor : end deprecation

13a9676

make vertexcolormap indicate distances from root in BFS

33b7a8e

simplify gdistances relying on vertecolormap

c571fdf

sbromberger reviewed Apr 13, 2016
View reviewed changes

CarloLucibello added 5 commits April 14, 2016 07:30

- add optional dir=:out/:in for BFS

6443079

- reimplement neighborhood using BFS

add egonet

c4db3ff

improve is_bipartite performance

bcf3afc

update docs

4f4b4d3

another improve for is_bipartite

d2afba9

jpfairbanks reviewed Apr 16, 2016
View reviewed changes

add AbstractEdgeMap and AbstractVertexMap and use them in graph_trave…

cc42c7d

…rsal

CarloLucibello changed the title ~~[WIP] relax colormap in traversal~~ updates in graph_traversal Apr 18, 2016

CarloLucibello mentioned this pull request Apr 18, 2016

add neighborhood #329

Closed

sbromberger reviewed Apr 18, 2016
View reviewed changes

uniform color convention for BFS and DFS

e8d7351

sbromberger merged commit 0cf3103 into sbromberger:master Apr 24, 2016

CarloLucibello mentioned this pull request Apr 29, 2016

strongly connected components are not strongly connected? #348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updates in graph_traversal #332

updates in graph_traversal #332

CarloLucibello commented Apr 11, 2016 •

edited

Loading

codecov-io commented Apr 11, 2016

sbromberger Apr 11, 2016

jpfairbanks Apr 11, 2016

CarloLucibello Apr 12, 2016

CarloLucibello commented Apr 12, 2016

jpfairbanks commented Apr 12, 2016

sbromberger commented Apr 12, 2016

jpfairbanks commented Apr 12, 2016

sbromberger commented Apr 12, 2016

jpfairbanks commented Apr 13, 2016

sbromberger commented Apr 13, 2016

CarloLucibello commented Apr 13, 2016

sbromberger Apr 13, 2016

sbromberger commented Apr 13, 2016

jpfairbanks commented Apr 13, 2016

sbromberger commented Apr 13, 2016

CarloLucibello commented Apr 13, 2016

jpfairbanks commented Apr 13, 2016

sbromberger commented Apr 13, 2016

jpfairbanks commented Apr 15, 2016

jpfairbanks Apr 16, 2016

CarloLucibello Apr 17, 2016

sbromberger Apr 18, 2016

CarloLucibello Apr 19, 2016

CarloLucibello commented Apr 19, 2016

CarloLucibello commented Apr 23, 2016

jpfairbanks commented Apr 23, 2016

sbromberger commented Apr 24, 2016 •

edited

Loading

updates in graph_traversal #332

updates in graph_traversal #332

Conversation

CarloLucibello commented Apr 11, 2016 • edited Loading

codecov-io commented Apr 11, 2016

Current coverage is 97.64%

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarloLucibello commented Apr 12, 2016

jpfairbanks commented Apr 12, 2016

sbromberger commented Apr 12, 2016

jpfairbanks commented Apr 12, 2016

sbromberger commented Apr 12, 2016

jpfairbanks commented Apr 13, 2016

sbromberger commented Apr 13, 2016

CarloLucibello commented Apr 13, 2016

Choose a reason for hiding this comment

sbromberger commented Apr 13, 2016

jpfairbanks commented Apr 13, 2016

sbromberger commented Apr 13, 2016

CarloLucibello commented Apr 13, 2016

jpfairbanks commented Apr 13, 2016

sbromberger commented Apr 13, 2016

jpfairbanks commented Apr 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarloLucibello commented Apr 19, 2016

CarloLucibello commented Apr 23, 2016

jpfairbanks commented Apr 23, 2016

sbromberger commented Apr 24, 2016 • edited Loading

CarloLucibello commented Apr 11, 2016 •

edited

Loading

Current coverage is `97.64%`

sbromberger commented Apr 24, 2016 •

edited

Loading