New Feature: News API – Clustering

Datetime:2016-08-23 02:14:25          Topic: Cluster Analysis           Share

Introduction

Last week we showed you how to search and sort news stories by video and image volume and today we are going to introduce you to another recently added News API feature – Clustering.

What is Clustering?

Clustering is a data mining technique used to group similarly related objects together in groups or collections. It’s an unsupervised classification method, meaning that data is classified without any pre-trained labels or categories, and is used for exploratory data analysis to find hidden patterns or groupings in data. It’s a common technique used in text mining.

In relation to our News API, clustering allows you to group similar news stories that are returned from your specific search or query, without the need for pre-trained classifiers or labels.

As Donald Hebb put it, cells that fire together wire together . This principle is relevant to clustering in that it refers to how the brain uses coincidence for association. Similar or coincidental News API stories are clustered using a measure of similarity and the semantic importance of words and phrases within the content.

source: http://sherrytowers.com/2013/10/24/k-means-clustering/

Clustering with the News API

Clustering is now available for the following News API endpoints;

  • /stories
  • /related_stories
  • /coverages

We’ve also added three algorithms that you can chose from when clustering, depending on the type of data and format of results that you require.

1. STC (Suffix Tree Clustering)

STC is a linear time clustering algorithm (linear in the size of the document set), which is based on identifying phrases that are common to groups of documents. A phrase is an ordered sequence of one or more words. This algorithm treats documents as a string of words rather than a collection of words and thus operates using the proximity of information between words. Learn more

2. K-means

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Learn more

3. Lingo

Algorithm for clustering search results, which emphasizes cluster description quality, using algebraic transformations of the term-document matrix and frequent phrase extraction using suffix arrays. Learn more

Why is clustering news stories useful?

Topical Analysis

Clustering enables you to group specific topic areas that may otherwise not be available as entities, concepts or classification labels.

Deduplication

Because Clustering groups stories with semantic similarities, it enables users to extract the most relevant story from each cluster, thus performing a deduplication of sorts. To do this, you can take one story ID from each cluster according to your own search requirements. For example, you could take the story ID with the most social shares, the most recently published, or the ID with the highest volume of images/videos.

Narrowing your search by cluster

Clustering is a great way to narrow News API search results and dive even deeper into specific areas or topics of interest. By taking the returned clusters of stories, you can use their labels (by appending them to your query) to further narrow your search. This will lead to results being returned from the API that only belong to a specific cluster.

As a use case example, consider a news application. Grouping stories by similarity in clusters makes it easier to provide an end-user with intuitively grouped news stories. For example a user can choose a specific group (cluster) to drill-down that interests them in their news app, giving them only results from their chosen cluster/topic.

As an example, let’s take the search query;

“Brexit” AND “Ireland”

Here is a sample of returned stories;

Perhaps now we are interested in exploring further the topic of Irish passport applications relevant to our original search. We then search create a new search;

“Brexit” AND “Ireland” AND “Irish Passports”

This produces more specific results and clusters relating to our chosen subject area – the spike in Irish passport applications relevant to Brexit and Ireland and relevant stories.

Examples

To show you examples of clusters that are generated from a variety of specific searches, we’ve chosen two topics that are currently popular in world news and selected the top five clusters from each search query.

Brexit AND Ireland

As one of Britain’s biggest trading partners, the potential impact of Brexit on the Irish people and economy has been a hot topic of conversation since the poll results were released last week.

Below we have embedded the API call and JSON results for the “Brexit AND Ireland” search query, using the lingo algorithm and only returning English language stories;

Query:

Results:

We’ve then listed 5 of the top clusters returned and used a simple column chart to visualize the grouping of the different stories and the labels they were automatically assigned.

  1. Investment in Ireland
  2. Peace Deal
  3. Risk
  4. Applying for Irish Passports
  5. Financial Services

Cluster Story Volumes – Brexit AND Ireland

Perhaps unsurprisingly, the concept of ‘Investment in Ireland’ represents the largest cluster. With the highest percentage of Irish exports going to Britain, the concern here is certainly warranted.

Two other stand-out clusters are ‘Peace Deal’ and ‘Applying for Irish Passports’. Without getting too deep into the politics, Brexit would result in the island of Ireland containing the EU-member Republic and the non-EU North. As for the passport cluster, many Britons have been scrambling to see if they are entitled to an Irish passport, in the hope that they can continue to travel freely within the EU. Google search trends saw a sharp spike in Irish passport-related searches immediately after the Brexit results.

Olympics AND Golf

For the first time since 1904, golf will be contested at the Olympic Games in Brazil this year. However, the future of the sport at the games is already in doubt as a number of high-profile golfers are withdrawing or threatening to do so, citing Zika Virus fears.

Below we have embedded the API call and JSON results for the “Brexit AND Ireland” search query, using the lingo algorithm and only returning English language stories;

Query:

Results:

For this example we’ve taken the top six clusters from the “Olympics AND Golf” search query;

  1. Jordan Spieth
  2. Louis Oosthuizen of South Africa
  3. Zika Virus Fears
  4. Day’s Withdrawal
  5. Wins on the PGA Tour
  6. Olympic Success

Cluster Story Volumes – Olympics AND Golf

For the non-golfers among you, the three mentioned players (Day, Spieth, Oosthuizen) are among a large group of golfers that are unlikely to travel to Brazil this Summer. Day and Spieth are ranked #1 and #2 in the world respectively, so you can really see the gravity of the situation. What should be a celebration of the return of golf to olympics has become a bit of a media-circus, as we can see from our clusters above where the topic of player withdrawals and the Zika Virus has taken center stage.

Usage Tips

Clustering is disabled by default so you will need to pass the cluster parameter to one of the three endpoints and set it as true .

Lingo is the default algorithm so you will need to modify cluster.algorithm to either ‘stc’ or ‘kmeans’ should you wish to use either one.

For the best quality clustering results and to avoid irrelevant labels, we recommend that you set your language of choice.

Conclusion

Using the News API’s clustering features you easily group or cluster stories that are semantically related. A task that would usually only be possible with a significant amount of code and data mining knowledge.





About List