Jump to main content - accesskey 2

Data Art project sponsors

Data Art with BBC Backstage

Graphviz Jungle

  • Graphviz

This project is an experiment in sketching the BBC site map as static images of its underlying graph structure. In these sketches, each web page in the site is represented as a text node, showing the page URL and/or title, and the links between the pages are shown as lines.

The sketches are made with a short Python script which the use the Graphviz program to create large images of the BBC site map from data extracted from the BBC Jungle API. Graphviz has a number of algorithms for laying-out large graphs, allowing experimentation with the resulting look and readability.

View Graphviz images //

  • Graphviz 01
    Graphviz BBC Sitemap
  • Graphviz 03
    Graphviz BBC Nature Sitemap

How it was built //

Get the source code code for the projects here:
source code here

Huge XML dumps of the BBC site in Jungle format are periodically created that contain brief descriptions of each page in the BBC site and how they are linked. One of these is included:
Jungle_1274975002_nature.xml

For this example, the first step is to cut one branch of this to use, so that its size is manageable. The nature part of the site is used - that is everything beyond www.bbc.co.uk/nature. This is done using the Python lxml library, and XPath syntax. See edit_jungle_xml.py

The main program graphviz_jungle_nature.py has two parts. First, it parses the Jungle XML file to extract all node information and create a DOT file for use by Graphviz. Second, it calls the Graphviz program sfdp to turn the DOT file into the large image. DOT files contain a sequence of text instructions that define the nodes and edges that Graphviz will draw, plus formatting instructions for how to draw them. In this case, the 'prism' layout routine is used for placing the nodes so they don't overlap. See www.graphviz.org/doc .

The program also uses the lxml library to parse and process the XML data with XPath syntax. In this case, it extracts all nodes that have children, which effectively extracts only the trunk and branches of the BBC nature site, but not the leaf nodes. This is just to limit the size of the resulting tree, and show only the overall site structure. With this extracted XML information, it then writes a sequence of DOT instructions to a file. Each node and edge in the DOT file is given a number of formatting attributes also, such as box style, colour etc., which can easily be changed within the Python code.

More details on each step of the program can be found in the inline annotation in the Python file.

Requirements:

  • Python 2.6
  • Graphviz
  • lxml