Ever wondered how your disjoint-set data looked like visually? If you know what this data structure is, you can skip to the Visualization section, else if you’re curious what this is about and how that can help you, proceed to the overview.
Quick Overview of Disjoint-set Data Structure
The Algorithms, Part 1 Coursera class taught by Robert Sedgewick presents this data structure better than I could. I recommend it to you either as a starter on algorithms or as a refresh. They actually call it Union-Find in this class, most likely for its intended usage rather than its structure. Because this data abstraction is meant to resolve a connectivity problem. For example, can you tell if p and q are connected in the following maze?
To come up with a way that provides consistent result is what the disjoint-set (or union-find) data structure achieves. This data structure is a collection (or forest) of trees of connected nodes. Each node is identified by a number and each contain a number value to target its parent, if any. If it’s a parent itself or it is a single tree node, then the value of the node will be its own ID. The following image shows from the start how you could end up with a single tree structure by connecting every node.
To recap, all nodes that are part of a tree are connected between each others. Hence, the p and q nodes should be part of the same tree if those are connected. If they are not, they will be part of two different tree structures.
If you think about the requirements of the data structure that will store this disjoint-set forest, it looks like a perfect match for an array of integer:
- Every node is identified by an integer.
- Every node contains an integer that targets its parent (or possibly itself if it is a root).
So an array of integer could be an optimized data structure to represent our trees of connected nodes. It’s not easy to visually represent this as the number grows (and the subject of this post is to help on that regard). But this image should help you to translate the trees of connected nodes into an array of integer:
The left side displays the tree that the array of integer on the right side represents. The upper part of the array indicates the index (or node identification) and the lower part its value (or parent node). For example, the node 0 has node 1 as its parent or, the array position 0 has a value of 1. Furthermore, the node 1 is a parent or, the array position 1 also contains a value of 1 (to indicate it self-references and hence is a tree root).
The lower section of the image demonstrates how an union between node 5 and 9 works out, each of them initially part of different trees. There are considerations on which tree should connect with which one for performance reasons. Some details about its implementation are explained on the Wikipedia page dedicated to it but there is more meat in the books on Algorithms.
With this data structure, the big-O notation (without path compression) is O(log n) for connecting and finding out connections between two nodes, hence it scales pretty well. The usage of this data abstraction can vary, but it could be used to find out if two persons in your social network are connected and how many levels distance them, for example.
The main purpose of this post is to talk about the data visualization of this data structure. My curiosity was sparked on how could be rendered such a data structure, depending on the ordering of the unions or how large it is. It’s not easy to figure out the forest of trees when you look at the array of integers, as this is not a natural human representation.
I was wondering if the algorithm for the disjoint-set data could be improved as I was working on the Percolation assignment of the Algorithms, Part 1 Coursera class. The algorithm is used to solve this problem. Percolation refers to the movement and filtering of fluids through porous materials; could water fall down through the cracks? Would it percolate?
I wanted to perfect the algorithm to prevent a certain problem with backwashing (it’s not important to explain here but basically, it’s when percolation happens but connected cases from the bottom and unconnected to the top are filled with fluid; it should not happen!). But I found it hard to improve it without visually looking at the rendered forests. Hence I decided to render it using some nice graphs.
I looked at a few libraries online and TreeViz got my attention for its nice looking aesthetic, low effort implementation for extension and MIT/LGPL license. It has poor documentation though but the code is nice enough so there were no big troubles after all. It has hyperbolic tree and circular treemap visualizations that I particularly like to look at how the disjoint-set data is rendered.
You can play with the graphics: the hyperbolic tree can be moved around to play with the connections and you can drill down with the circular treemap to inspect inner connections. I’ve added some popup labels on every element to describe its path:
The path starts from the parent to the leaf, the arrow actually targeting a child.
The Virtual Node
As the data structure is a forest of trees, there are several root nodes to organize. Many of them can be a root with no children. Hence there is a virtual node that serves as a parent to the real root nodes. This is not really a choice of mine as the tree representations usually have one root node and here, I try to show a forest with a visualizations meant for a single tree. So let’s pretend that the virtual node is the sky that allows us to have an overview on the whole forest. :)
On offered choices to represent a forest of trees, I haven’t seen much that got my attention. I have not searched exhaustively though, so drop by with a comment if you know of better alternative than a tweaked tree representation to display a forest of trees. That said, nothing prevents us from starting from our current representation and suggest improvements.
There are a few improvements that I see and they share the same characteristic: make it easier to navigate through a large data structure.
Search single and multiple nodes
With a large amount of data, manually searching for a node is like finding a needle in a haystack. It’s definitely one feature that should be in this visualization.
It would be great too to search and highlight multiple nodes at the same time, to see where all of them are relatives to each others.
Highlight path of searched nodes
Once the multiple nodes search feature is implemented, highlighting the path between those nodes if connected would be useful. Interestingly, it would be a nice exercise for picking and using an algorithm fit for that functionality. The union-find data structure wouldn’t be a good fit as this only tells you if two elements are connected and possibly how deep, but not the path to connect these.
Reduce noise of single roots
A simple way to reduce the noise of visualization of union-find nodes with large data set is to filter out the non-connected nodes.
I like colors, especially when they are used to convey information. At the moment, there are only three colors used to denote the depth of the nodes: white for unconnected, blue for connected and first levels and orange for connected and last levels. It could actually be a slight improvement to select one colors with different shades to carry the depth information.
More information with coloring can help the search functionality as well. One could perform several searches of nodes and assign different highlighting colors. Then, we could identify patterns with multiple concurrent searches.
There are a few cases where dynamic animations would be beneficial to the visualizations. For example, the sunray tree visualization is pretty good for handling a large amount of data, but it isn’t crystal clear at first that clicking on a child actually switches the outside rings to the children of the clicked element. An animation that would move over the children to the lower ring and enlarge it to 100% of the circumference would tell the user right away of what’s happening.
Statistics would be a welcomed additions: number of nodes, connected and non-connected nodes, maximum depth, distribution of weight, etc…
Please drop a comment with your suggestion and I’ll put it in there, with proper credits of course.
First, thanks to Werner Randelshofer for his amazing work on the visualizations offering leveraged by this union-find visualizer: it’s 90% his effort, I just plugged in an extension for disjoint-set data structure. Also, if that code was not shared with open-source in mind, it wouldn’t look that nice and quick to implement.
The code for union-find visualizer is hosted on GitHub. Instructions on how to download and install is explained there. The source code contains the extension for the TreeViz visualization graphics.
The visualizer supports two file formats: a) A one-liner file that contains a textual representation of the array of the disjoint-set data structure and b) a file that contains the percolation unions with the same format as the Algorithms’ Coursera class.
Percolation class is an implementation of mine developed during the Algorithms’ Coursera class. Only the class files are given though so don’t look for the source code of percolation in there. :) It’s possible though to plug-in your own implementation with few additional changes. Those are described on the GitHub project page. Hack it as much as you’d like and share any valuable changes with me if you do so!