Jump to main content - accesskey 2

Data Art project sponsors

Data Art with BBC Backstage

News Cluster

  • News Cluster

News Cluster analyses a set of news stories and groups them into related clusters using a statistical technique widely used in science and engineering.

In biology, for example, characteristic features from a set of field samples, such as flower shapes and sizes, or DNA fragments, are analysed with clustering techniques to determine species and produce taxonomies. Clustering analysis shows how the samples relate to each other and uncovers any natural groupings amongst them.

In our case, we use the title words from the news story as the characteristic features and group the stories according to shared terms. The more title words a pair of news stories share, the closer they relate to each other in the analysis.

The clustering algorithm works by first finding the two closest related news stories, and joining them into a group. It then repeats this
process, treating this group as a single story, again finding which pair are the closest, and again joining this pair into a group. The process
is repeated over and over until all stories are joined together. The result is a large binary tree, or dendrogram as it is called, where the leaves of the tree are the news stories, and the branches and sub-branches represent a hierarchy of groups that reflect how the leaves are clustered.

As well as analysing the news stories into groups, the program also generates a number of different views of the results.

The image below left shows a dendrogram tree generated from a few hundred science news stories taken from the BBC website. Moving from right to left you can see the successive grouping of stories at each step of the process described above. The stories at the far right hand side had the most shared words in their title.

Having generated this tree and also a measure of the distance between each pair of stories or group of stories, it is then possible to cleave
the tree into a small number of main branches which contain the natural clusters that exist within the data.

The image below right shows these clustered groups arranged by size of cluster. You can clearly see that the analysis has succeeded in identifying a number of news story themes, in this case the biggest ones being about climate change, the space station, an arctic expedition, a shuttle launch and the mars rover.

View Large images //

  • Click to view large dendrogram tree image
    Dendrogram Tree
  • Click to view large cluster image
    Cluster Map

How it was built //

News Cluster was written in Python using the Wing IDE. The PIL library was used for image generation and the SciPy hierarchical clustering
library
used for the analysis.
A good technical introduction to how the clustering algorithm works using a biological example can be found here:
http://users.soe.ucsc.edu/~eads/iris.html

The news stories were scraped from the BBC RSS feed directly. Each news item in the RSS XML contains the title words used in the analysis, and a link to a thumbnail on the bbc.img.co.uk site used inthe generated images.