News Cluster analyses a set of news stories and groups them into related clusters using a statistical technique widely used in science and engineering.
In biology, for example, characteristic features from a set of field samples, such as flower shapes and sizes, or DNA fragments, are analysed with clustering techniques to determine species and produce taxonomies. Clustering analysis shows how the samples relate to each other and uncovers any natural groupings amongst them.
In our case, we use the title words from the news story as the characteristic features and group the stories according to shared terms. The more title words a pair of news stories share, the closer they relate to each other in the analysis.
The clustering algorithm works by first finding the two closest related news stories, and joining them into a group. It then repeats this
process, treating this group as a single story, again finding which pair are the closest, and again joining this pair into a group. The process
is repeated over and over until all stories are joined together. The result is a large binary tree, or dendrogram as it is called, where the leaves of the tree are the news stories, and the branches and sub-branches represent a hierarchy of groups that reflect how the leaves are clustered.
As well as analysing the news stories into groups, the program also generates a number of different views of the results.
The image below left shows a dendrogram tree generated from a few hundred science news stories taken from the BBC website. Moving from right to left you can see the successive grouping of stories at each step of the process described above. The stories at the far right hand side had the most shared words in their title.
Having generated this tree and also a measure of the distance between each pair of stories or group of stories, it is then possible to cleave
the tree into a small number of main branches which contain the natural clusters that exist within the data.
The image below right shows these clustered groups arranged by size of cluster. You can clearly see that the analysis has succeeded in identifying a number of news story themes, in this case the biggest ones being about climate change, the space station, an arctic expedition, a shuttle launch and the mars rover.