Disjoint-Set Data Visualization

Ever wondered how your disjoint-set data looked like visually? If you know what this data structure is, you can skip to the Visualization section, else if you’re curious what this is about and how that can help you, proceed to the overview.

union-find-hypertree-25

Quick Overview of Disjoint-set Data Structure

The Algorithms, Part 1 Coursera class taught by Robert Sedgewick presents this data structure better than I could. I recommend it to you either as a starter on algorithms or as a refresh. They actually call it Union-Find in this class, most likely for its intended usage rather than its structure. Because this data abstraction is meant to resolve a connectivity problem. For example, can you tell if p and q are connected in the following maze?

connectivity2

To come up with a way that provides consistent result is what the disjoint-set (or union-find) data structure achieves. This data structure is a collection (or forest) of trees of connected nodes. Each node is identified by a number and each contain a number value to target its parent, if any. If it’s a parent itself or it is a single tree node, then the value of the node will be its own ID. The following image shows from the start how you could end up with a single tree structure by connecting every node.

quick-find-nodes

To recap, all nodes that are part of a tree are connected between each others. Hence, the p and q nodes should be part of the same tree if those are connected. If they are not, they will be part of two different tree structures.

If you think about the requirements of the data structure that will store this disjoint-set forest, it looks like a perfect match for an array of integer:

  • Every node is identified by an integer.
  • Every node contains an integer that targets its parent (or possibly itself if it is a root).

So an array of integer could be an optimized data structure to represent our trees of connected nodes. It’s not easy to visually represent this as the number grows (and the subject of this post is to help on that regard). But this image should help you to translate the trees of connected nodes into an array of integer:

quick-union-overview

The left side displays the tree that the array of integer on the right side represents. The upper part of the array indicates the index (or node identification) and the lower part its value (or parent node). For example, the node 0 has node 1 as its parent or, the array position 0 has a value of 1. Furthermore, the node 1 is a parent or, the array position 1 also contains a value of 1 (to indicate it self-references and hence is a tree root).

The lower section of the image demonstrates how an union between node 5 and 9 works out, each of them initially part of different trees. There are considerations on which tree should connect with which one for performance reasons. Some details about its implementation are explained on the Wikipedia page dedicated to it but there is more meat in the books on Algorithms.

With this data structure, the big-O notation (without path compression) is O(log n) for connecting and finding out connections between two nodes, hence it scales pretty well. The usage of this data abstraction can vary, but it could be used to find out if two persons in your social network are connected and how many levels distance them, for example.

Visualization

The main purpose of this post is to talk about the data visualization of this data structure. My curiosity was sparked on how could be rendered such a data structure, depending on the ordering of the unions or how large it is. It’s not easy to figure out the forest of trees when you look at the array of integers, as this is not a natural human representation.

I was wondering if the algorithm for the disjoint-set data could be improved as I was working on the Percolation assignment of the Algorithms, Part 1 Coursera class. The algorithm is used to solve this problem. Percolation refers to the movement and filtering of fluids through porous materials; could water fall down through the cracks? Would it percolate?

percolation

I wanted to perfect the algorithm to prevent a certain problem with backwashing (it’s not important to explain here but basically, it’s when percolation happens but connected cases from the bottom and unconnected to the top are filled with fluid; it should not happen!). But I found it hard to improve it without visually looking at the rendered forests. Hence I decided to render it using some nice graphs.

I looked at a few libraries online and TreeViz got my attention for its nice looking aesthetic, low effort implementation for extension and MIT/LGPL license. It has poor documentation though but the code is nice enough so there were no big troubles after all. It has hyperbolic tree and circular treemap visualizations that I particularly like to look at how the disjoint-set data is rendered.

union-find-treemap-25

You can play with the graphics: the hyperbolic tree can be moved around to play with the connections and you can drill down with the circular treemap to inspect inner connections. I’ve added some popup labels on every element to describe its path:

union-find-treemap-25-label

The path starts from the parent to the leaf, the arrow actually targeting a child.

The Virtual Node

As the data structure is a forest of trees, there are several root nodes to organize. Many of them can be a root with no children. Hence there is a virtual node that serves as a parent to the real root nodes. This is not really a choice of mine as the tree representations usually have one root node and here, I try to show a forest with a visualizations meant for a single tree. So let’s pretend that the virtual node is the sky that allows us to have an overview on the whole forest. :)

union-find-hypertree-virtual

Possible Improvements

On offered choices to represent a forest of trees, I haven’t seen much that got my attention. I have not searched exhaustively though, so drop by with a comment if you know of better alternative than a tweaked tree representation to display a forest of trees. That said, nothing prevents us from starting from our current representation and suggest improvements.

There are a few improvements that I see and they share the same characteristic: make it easier to navigate through a large data structure.

union-find-hypertree-400

Search single and multiple nodes

With a large amount of data, manually searching for a node is like finding a needle in a haystack. It’s definitely one feature that should be in this visualization.

It would be great too to search and highlight multiple nodes at the same time, to see where all of them are relatives to each others.

Highlight path of searched nodes

Once the multiple nodes search feature is implemented, highlighting the path between those nodes if connected would be useful. Interestingly, it would be a nice exercise for picking and using an algorithm fit for that functionality. The union-find data structure wouldn’t be a good fit as this only tells you if two elements are connected and possibly how deep, but not the path to connect these.

Reduce noise of single roots

A simple way to reduce the noise of visualization of union-find nodes with large data set is to filter out the non-connected nodes.

Better colors

I like colors, especially when they are used to convey information. At the moment, there are only three colors used to denote the depth of the nodes: white for unconnected, blue for connected and first levels and orange for connected and last levels. It could actually be a slight improvement to select one colors with different shades to carry the depth information.

More information with coloring can help the search functionality as well. One could perform several searches of nodes and assign different highlighting colors. Then, we could identify patterns with multiple concurrent searches.

Animations

There are a few cases where dynamic animations would be beneficial to the visualizations. For example, the sunray tree visualization is pretty good for handling a large amount of data, but it isn’t crystal clear at first that clicking on a child actually switches the outside rings to the children of the clicked element. An animation that would move over the children to the lower ring and enlarge it to 100% of the circumference would tell the user right away of what’s happening.

union-find-sunray-400

Statistics

Statistics would be a welcomed additions: number of nodes, connected and non-connected nodes, maximum depth, distribution of weight, etc…

Your suggestion

Please drop a comment with your suggestion and I’ll put it in there, with proper credits of course.

The Code

First, thanks to Werner Randelshofer for his amazing work on the visualizations offering leveraged by this union-find visualizer: it’s 90% his effort, I just plugged in an extension for disjoint-set data structure. Also, if that code was not shared with open-source in mind, it wouldn’t look that nice and quick to implement.

The code for union-find visualizer is hosted on GitHub. Instructions on how to download and install is explained there. The source code contains the extension for the TreeViz visualization graphics.

The visualizer supports two file formats: a) A one-liner file that contains a textual representation of the array of the disjoint-set data structure and b) a file that contains the percolation unions with the same format as the Algorithms’ Coursera class.

The provided Percolation class is an implementation of mine developed during the Algorithms’ Coursera class. Only the class files are given though so don’t look for the source code of percolation in there. :) It’s possible though to plug-in your own implementation with few additional changes. Those are described on the GitHub project page. Hack it as much as you’d like and share any valuable changes with me if you do so!

Advertisements

Coming up to speed with the DevOps community

Last Monday, I went to the DevOps Montreal community meetup. I really enjoyed the presentations on that evening. People were quite friendly too. They had three presentations which covered:

  • the history of Linux containers,
  • the docker.io tools to leverage LXC and
  • dokku, a simple Heroku-compatible single-host PaaS, powered by docker!

Despite being given by different persons, the presentations had a lot of consistency between them, material-wise. For someone like me not working in the DevOps space, there was a lot to catch. The history of LXC made it really easy to identify the current trend occurring in this space, a trend that I would like to mention in this blog post.

Linux containers

But first, let’s quickly go over what is LXC. LXC stands for LinuX Containers. It allows system-level virtualization. It is akin to build the Linux kernel in user mode, that is, run the Linux kernel as a user process, within a Linux environment that also runs the Linux kernel in system mode. Hence you have an isolated kernel in user memory and it can launches processes of its own, encapsulating libraries and binaries within its own namespace. Except that LXC adds extra mechanism to do it properly, and eventually should be most secure. What’s the advantage of such an approach? A quote from Glauber Costa summarizes it well:

[…] should it be possible for the operating system to ensure that excessive resource usage by one group of processes doesn’t interfere with another group of processes? Should it be possible for a single kernel to
provide resource-usage statistics for a logical group of processes? Likewise, should the kernel be able to allow multiple processes to transparently use port 80?

Glauber Costa, Parallels (SWSoft / company behind OpenVZ)
http://lwn.net/Articles/524952/

So you get a way to virtualize your applications within the system, without the full-blown virtualization engine. Unlike the user-mode kernel method, this one will share the system kernel between the host and virtualized environments. This has great advantages in terms of performance and finer granularity of application’s encapsulation. If you want to know more, take a look at Simon Boulet slides

The trend

So LXC allows finer-grained virtualization. And the DevOps community is using this property to accomplish a trend that is not unfamiliar to the software development community. Both communities need to scale quick and effective on dimensions that were not matched yet in the history of computing.

These trends are expressed with two manifestos that came to life in the past years coming; one manifesto for one community. They share similar end-goals and some of the same techniques: the twelve factor app and the reactive manifesto. The first comes from the DevOps community and the second from the functional programming ‘inspired’ community.

For example, the idea for predictable and secure deployment for DevOps goes as far to say that your deployments should be immutable. That should sound very familiar if you are a proponent of the reactive manifesto. Immutability escapes the more traditional paradigm of shared state and all the implied locking strategies that one need to use to manage concurrent accesses. Chad Fowler wrote an interesting blog entry on the topic of immutability in the DevOps space. In order to scale and avoid side-effects, maximum immutability should be encouraged.

That’s where the advantages of something like docker.io kicks in, as you develop an automated package of self-contained libraries and executables meant for one sole application running on the server. This package can be deployed on any OS that has the proper Linux kernel, which contains the LXC feature. It should be deployed only once. No updates are allowed. When you want to re-deploy, you re-build the entire LXC container with an automated process.

Pattern matching

As a last one and quickly, there was an interesting idea thrown by Chad Fowler in his talk in the Rails Israel 2013 conference. He mentions it around 30:55 in this video:

“What if your routes files read like strongly typed functional pattern matching with destructuring and supported federated, heterogenous services?”

The idea is good. Instead of wiring all of the different servers manually and explicitly, we could define the properties a method needs and the server candidates that fit the criteria would get wired automatically. This is a common technique known as pattern matching in the functional community, applied to the DevOps space. Notice a trend? It’s exciting to see all of these change in the past years in a space that looked mostly static on how to do things.

Participate

On this final note, I’d like to stress how great the talks were in the DevOps Montreal meetup. I encourage any software developers in Montreal to attend these, even if you do not identify yourself as a DevOp: we all get to do deployments eventually, they tend to be meticulous and annoying. When it happens, you want to do it right.

The bed bugs of software

There are different types of software bugs listed on Wikipedia. They are categorized on the type of errors such as logic, syntax, resource, etc… None carries personality and express the frustration they cause. And it needs to change… especially when we consider that the term itself, bug, is a colorful way for naming a software error.

So I’d like to contribute more colors to the software landscape by suggesting one more type of bugs: the bed ones. The kind of minor and pesky bugs that won’t kill but leads to sleep deprivation:

Bed Bug

The characteristics of such bug is as follows:

  1. It hides well, you don’t see it easily.
  2. You need special tooling and attention to zoom in on it.
  3. It doesn’t cause big harm, but it possibly has greater side-effects.
  4. The fix to get rid of this bug is generally simple.

I’ve encountered one this week. Let’s examine a special case that led to this one character change fix:

bed_bug
Java supports boolean values in two different forms: primitive and object. The fix implied a switch from the Object type (denoted with a camel case, the big B version) to the primitive type. Often this type of change is done for a taxed method to improve performance. But that is not the case here. This method is part of a class that implements an Apache log4j Filter abstract class, hence this is a method that will get called very few times in the application’s life cycle.

But let’s rewind to the encountered symptoms prior to the fix, so we can review the bed bug afterward:

  1. The web application start-up crashed with Guice injection errors.
  2. The Guice configuration did not change in the past months and was working fine during all this time.
  3. The error did not always occur, sometimes the application started just well.
  4. The error occurred mostly in our deployed local instance of Tomcat.
  5. The error did not occur in our deployed online instance of the application (also in Tomcat).
  6. The error occurred once in a while with the GWT Eclipse plug-in, but a refresh/rebuild seemed to get rid of it.

There were several errors at start-up and some of these were red-herring to the real error. First, the traditional Guice injection error report:

oct. 26, 2013 3:38:46 PM org.apache.catalina.core.StandardContext listenerStart
SEVERE: Exception sending context initialized event to listener instance of class a.company.ui.server.guice.WebAppModuleContextListener
com.google.inject.CreationException: Guice creation errors:

1) No implementation for a.company.domain.dao.DomainDao was bound.
while locating a.company.domain.dao.DomainDao
for parameter 1 at a.company.domain.server.ApplicationServiceImpl.<init>(ApplicationServiceImpl.java:128)
at a.company.domain.server.service.RPCModule.bindServices(RPCModule.java:51)
(+ 10 more..)

Or other thread crashing for file management, reporting that an attribute for the log4j filter configuration was missing:

SEVERE: Exception sending context destroyed event to listener instance of class a.company.domain.server.FileCleanerListener
java.lang.NullPointerException: The presence option was not configured for key 'SessionId'.
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:208)
at com.modelsolv.log.MdcPresenceFilter.decide(MdcPresenceFilter.java:33)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:244)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.slf4j.impl.Log4jLoggerAdapter.debug(Log4jLoggerAdapter.java:209)

There was a lot of these exceptions reported and our team couldn’t figure out what the real problem was. Keep in mind that these symptoms showed up randomly too. But at some point, I noticed these logs that seemed related to the previous error reported:

log4j:WARN Failed to set property [present] to value "false".
log4j:WARN Failed to set property [present] to value "true".
log4j:WARN Failed to set property [present] to value "false".
log4j:WARN Failed to set property [present] to value "true".

Why would this simple property of the log4j configuration failed to be set? Is it really set in the log?

<appender name="consoleMdcAppender">
  <layout>
  <param name="ConversionPattern"
    value="%d{MM/dd/yyyy HH:mm} %5p [%t] (%F:%L) (Session: %X{SessionId}, User: %X{UserId}) - %m%n" />
  </layout>
  <filter>
    <param name="key" value="SessionId" />
    <param name="present" value="true" />
  </filter>
</appender>

This property is well configured. But not well set as the logs report. And as it was not correctly set, the safe guard condition within the MdcPresenceFilter threw an exception as it has a mis-configured present member, which led to the thread crashing:

@Override
public int decide(LoggingEvent event) {
  Preconditions.checkNotNull(key, "The MDC key was not configured.");
  Preconditions.checkNotNull(present, format("The presence option was not configured for key '%s'.", key));

Attaching a debugger and setting a breakpoint to the code location where the parameters were set for log4j was the next logical step. The org.apache.log4j.config.PropertySette class was the right place to look into. After some debug sessions, it turned out that the boolean value of the present filter parameter could not always be correctly set. There is a PropertySetter#convertArg method that is responsible for converting the textual value of the log4g configuration into the proper Java type. If you looked at my previous posted configuration, you’ll get this should be converted to a boolean value. This is a snapshot of the debugger at the exact condition where log4j had to realize the same thing:

condition-not-picked

But it never got into this condition. Ever. The type parameter presented in this snapshot was determined by previous log4j logic that introspected the MdcPresenceFilter class and its methods. It found this excerpt:

public void setPresent(String present) {
  setPresent(Boolean.parseBoolean(present));
}

Hence it asserted that the correct type to set the textual “true” or “false” value had to be java.lang.Boolean. But although a quick glance to the condition in the snapshot could make you think that it does support the Boolean object type, it does not:

} else if (Boolean.TYPE.isAssignableFrom(type)) {

The key element of the faulty line is that it refers to Boolean.TYPE. The latter refers, in fact, to the primitive boolean type of Java. It’s quite different from the full-blown Object java.lang.Boolean type. Hence it means that only the primitive type of boolean is supported by the log4j filter parameters. It’s no big deal so far. But here’s the breaker that made this bug inconsistent in its behavior and why it worked sometimes… and some others not:

public void setPresent(boolean present) {
  this.present = present;
}

public void setPresent(String present) {
  setPresent(Boolean.parseBoolean(present));
}

At the time I wrote the MdcFilterPresence class, I overloaded the setPresent method with a version that accepts a String parameter and that automatically converts it to a Boolean object. As it turns out, the log4j library decides to convert the textual filter parameter of “true” or “false” to, sometimes a String, or some other times a Boolean. Yes that’s right, it’s not always the same! When it decides to convert it to a String, the application starts correctly, else when it picks the Boolean conversion, the application crashes, as the latter type conversion is not supported by log4j. That made the diagnosis of this bug a bit harder because no clear constants were obvious and it all seemed like red herrings.

Why the log4j library arbitrarily decides to pick either Boolean or String as a conversion is non-deterministic per my observation and entirely depends on the java.beans.GenericBeanInfo class, as shown by these screenshots:

Boolean was picked.

Boolean was picked.

String was picked.

String was picked.

These are the kind of bugs that creep in your bed and make your work a bit harder. All of those swayed away with a single character change fix. As I like to put it about bug fixes: “the amount of time required to find a bug is opposite to the amount of changes required to fix that bug.” Not always true, but once in a while, it bites you hard.