Only one of two posts from this blog in 2012 but it is a useful one.
From the post:
A common desire when working with natural language is topic discovery. That is, given a set of documents (eg. tweets, blog posts, emails) you would like to discover the topics inherent in those documents. Often this method is used to summarize a large corpus of text so it can be quickly understood what that text is ‘about’. You can go further and use topic discovery as a way to classify new documents or to group and organize the documents you’ve done topic discovery on.
Walks through the use of Pig and Mallet on a newsgroup data set.
I have been thinking about getting one of those unlimited download newsgroup accounts.
Maybe I need to go ahead and start building some newsgroup data sets.