Storing hierachical data in MySQL

I will need to be able to store hierachical data in a number of parts of the project management system, such as nested projects, nested tasks, comments etc. A standard relational database, MySQL in this case, isn’t optimised for storing hierachical data unfortunately.

There are 2 main ways to store this data in the database, which will be discussed here in brief. I will skip a lot of the technical details about both implementations as I would just be repeating what I found in the following 2 articles which where the basis of this research.

http://articles.sitepoint.com/article/hierarchical-data-database

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

The Adjacency List Model

This is by far the easiest to implement and much easier to understand initially. Unfortunately, it’s also the worst performer when working with large datasets.

The way it works, is by making each node (item) point to its parent node. This means that to construct a tree representing the hierachy, we need to make a query for each node, to get all subnodes (even if there are none). This can be done as a single query by using a series of left joins on itself, but this requires knowing the max depth to make the query. Also, the data returned in the examples given in the MySQL article aren’t as easy to process as they could be.

Main Disadvantages

  • Either recursive or a can become very complex due to number of joins
  • Too see the full path, you need to know the depth
  • Need to be careful when deleting to not orphan sub trees
  • Can only easily traverse up the tree, and not down

Modified Preorder Tree Traversal

Using this method, we assign each node a left and right value (can’t name the column left or right, as they’re reserved words) and these values let us see which nodes are under a given node. Starting from the root, we give it a left value of 1, and then work our way left, giving each left value one greater then the last. When we reach a leaf (node with no children) we give it a right number, which is one more then the left. We then work our way back up and accross. (The articles explain it better and give some diagrams to explain it).

It is now possible to retreive the entire tree in a single query, and on a randomly created tree with 10,000+ nodes, this takes ~0.05 seconds (this will increase slightly as the query logic grows to include other constraints).

Updates are now slightly more complex then with the Adjacency list model. Instead of just adding a new row, we need to create a space for it by changing the left and right values for all nodes to the right of this one and then insert it. Despite that, the 3 queries used to insert a node take between 0.1 and 0.2 seconds on the same test tree (10,000+ rows)+

In conclusion, I will be using the MPTT method in most cases, although th ALM may be used for limited hierachical data sets. I will discuss how this fits into a real application in a later post, when I explore the how to structure the project hierachy.

Edit

Commonly used columns in queries are normaly indexed for faster retreival. This actually has a negative impact when using MPTT, as the longer operation is the insert, and if the left and right columns are indexed, as well as changing the pointers, it needs to update the indexes

Bookmark and Share

Posted in Research. RSS. Trackback.

Leave a Reply