Knot detection is useful to find hunks of the graph that are either cyclically orphaned from the rest of the graph or that do not allow the reader to escape once they have entered the "event horizon."
If you reverse the direction of the links, you can use a knot detection algorithm to find portions of the graph that aren't linked to from anywhere outside the strongly connected portion. This can help find orphans that are only reachable through BackLinks.
Once the knots/orphans have been found, steps can be taken to "untie" them so that the PageDatabase is more unified.
[CategoryGraphTheory] [CategoryIndexingScheme]
The ShortestPathPages algorithm can detect knots, but it's not very good at it.
After poking around on the Internet for a while, I've discovered the only decent algorithm for detecting all knots in a graph is the distributed version, which I would have to purchase from the IEEE. It isn't really very simple nor necessarily efficient. So, I figure I might as well come up with my own. I'll just keep a few notes here until I derive something. First, definitions.
Diffusion propogates information throughout the network according to some rules in order to detect certain properties. The rules are the key. In this case, we attempt to find ties.
Given that, if we give each node n a unique colour n' which we propogate down through the directed graph it roots, we can find all the ties very quickly. For each link l = s->t of each node t, if s has not been coloured with t' then there is a tie T named t' which includes all the nodes coloured t'.
The trouble with that algorithm is that it is very inefficient. In the worst case, a cyclic graph, you require N squared amount of storage, where N is the number of nodes, and the algorithm runs in O(N squared) time. [I have a hunch, actually that a hypercube will run in worse time.] For a graph as large as MeatballWiki's, this is impractical.
We might improve efficiency by noting that if a node m is coloured n', then all the nodes M coloured m' are also coloured n'. This is not to say that we should recolour all of M to be exclusively n' as m may form part of the event horizon between n' and m'. However, this observation doesn't help much unless we can somehow compress the information. Otherwise, the algorithm would be basically the same as computing the minimum spanning tree of the graph.
We might also just try to construct recursively the system of knots and ties based on the following properties:
Alone, these two properties can only detect ties that are directed acyclic graphs (of ties). However, also note that
If we begin by identifying the leaves of the graph G and the ties that are DAGs built on top of these leaves, and then removing them, then either the entire graph G is such a DAG and we're done or there must be a cycle somewhere in the graph. We know this is the case because if there weren't a cycle in the directed graph, the entire graph is a directed acyclic graph. Therefore, there must be a largest cycle somewhere in G after removing the initial ties.
I envision this algorithm working by creating a graph with two node types. One type is the normal type, representing a page, and the second type is a tie type representing a tie subgraph. We organize the nodes in a stack of lists of nodes. Each layer contains page nodes with the same number of child page nodes.
Initially, layer 0 would include only the leaves, as they have no children at all. We convert all of these to tie nodes and remove them, but in the process we move each of their parents down one layer. This is because we have changed one of their children from a page node to a tie node, reducing the number of their child page nodes by one.
We continue until layer 0 is empty. This will actually identify all the directed acyclic graphs built from leaves. Once this is done, there is either one tie node left representing the entire graph (and the stack is empty) or there is a cycle somewhere in the graph. In fact, if you start from any page node left in the graph and traverse a path from page node to page node, you are guaranteed to eventually form a cycle. If you couldn't, then there must be an end node without any child page nodes, and that would be a new tie.
Conjecture In this new graph, an edge between page nodes either belongs to a cycle or forms part of the event horizon between a tie and a larger graph.
I have no idea what the IEEE method is. But how about this:
If we just start at a random node, and run along a random path, we either find a cycle or get a tie (or communicating class of size one) at the last node. If we find a cycle, we identify those nodes and carry on. If we run out of routes out of a communicating class, we mark it as a tie and, again, never reenter it.
To store this, we need:
Hence the memory requirements are O(N); since we never reenter a node after leaving it, the execution time should be O(E+N), E the number of edges in the graph.
At each step, you try to follow a new edge from the last node (cc) on the path. If that new edge leads to an exhausted node, ignore it. If it leads to an edge (cc) already on the path, we have extended our cc. Otherwise, it's a new node and we carry on from there.
If we have extended our cc, we need to take all the nodes on the cycle, pick one and give its cc number to all the other nodes in the cycle, as well as adding them to its linked list. Having done this, we replace all those nodes on the path we're constructing by the single one representing the cc.
If we run out of untried edges out of a cc, it's a tie, so we mark it as such and back out of it.
If we have backed out of our entire path, which must happen sooner or later as we run out of edges, we check to see if there are any entirely unnoticed edges left, and repeat the whole shebang starting from them if so. If not, we've finished.
To save scanning the entire path each time we find a new node to see if we've found a cycle, we can just store the fact it's on our path alongside the node.
A node can only ever be in one of these states:
How does this sound? -- ChrisPurcell
What happens if you're traversing the path P=<a1,a2,...> and you find a cycle C a6,a7,...,ai. Either a5->a6 is the event horizon of a tie, or a5 could be part of a larger cycle C' that incorporates part of C. It's not clear how the cycle part of your algorithm would behave if it encounters another part of P. This is not to say you aren't on the right track, just that it's not straightforward at the moment. -- SunirShah
Sorry, when I said "cycle" I meant any set of communicating nodes. Can't think of a better word for that right now. If you have such a C' then the cycle a6..ai will have a path back to a5 off one of its nodes, so the algorithm will find it, because when you search for edges out of a cycle you look through all the nodes in that cycle. -- ChrisPurcell
Also, even if you identify all the nodes linked to from a cycle C, there may be a cycle C' inside C that is tied off from C. Think concentric circles, with links only going from outer rings to inner rings. I don't think your algorithm detects these. This is the point where I'm stuck with my construction algorithm. -- SunirShah
If C' is tied off from C, it cannot be a part of it. Every cycle is communicating, since you can use it to travel from every node in it to every other node. -- ChrisPurcell
I'm confused now. You're equivocating the terms strongly connected and cycle. A cycle is always strongly connected, but a strongly connected subgraph is not necessarily a cycle. -- SunirShah
Ah, I'm probably way off with my notation. My algorithm finds all the communicating classes; every communicating class can be covered by a cycle, possibly one repeating nodes (can't remember if this is allowed in the usual definition of a cycle, but hey). In your concentric circles example, the circles are the ccs found by my algorithm.
It won't tell you which ccs have event-horizon edges to other ccs as I've written it. Not hard to add, though, especially since if you order the classes by when they're marked as exhausted, the direction of the event-horizon edges will be aligned with this ordering.
I rewrote my exposition to not abuse the word "cycle". Am I being any less confusing? -- ChrisPurcell