An NLP Approach to Analyzing Twitter, Trump, and Profanity

Datetime:2016-08-23 03:18:23          Topic: Natural Language Processing           Share

Who swears more? Do Twitter users who mention Donald Trump swear more than those who mention Hillary Clinton? Let’s find out by taking a natural language processing approach (or, NLP for short) to analyzing tweets.

This walkthrough will provide a basic introduction to help developers of all background and abilities get started with theNLP microservicesavailable on Algorithmia. We’ll show you how to chain them together to perform light analysis on unstructured text. Unfamiliar with NLP? Our gentle introduction to NLP will help you get started.

We know that getting started with a new platform or developer tool is an investment in time and energy. Sometimes it can be hard to find the information you need in order to start exploring on your own. That’s why we’ve centralized all our information in the Algorithmia Developer Center and API Docs, where users will find helpful hints, code snippets, and getting started guides. These guides are designed to help developers integrate algorithms into applications and projects, learn how to host their trained machine learning models, or build their own algorithms for others to use via an API endpoint.

Now, let’s tackle a project using some algorithms to retrieve content, and analyze it using NLP. What better place to start than Twitter, and analyzing our favorite presidential candidates?

Twitter, Trump, and Profanity: An NLP Approach

First, let’s find the Twitter-related algorithms on Algorithmia. Go to the search bar on top of the navigation and type in “Twitter”:

You’ll get quite a few results, but find the one called Retrieve Tweets with Keyword , and check out the algorithm page where it will tell you such information as the algorithm’s description, pricing, and the permissions set for this algorithm:

If you are interested in learning more regarding the basics of the platform including the algorithm profile page visit the Developer Center’sBasic Guidessection.

The algorithm description provides information about the input and output data structures expected, as well as the details regarding any other requirements. For instance, Retrieve Tweets with Keyword requires your Twitter API authentication keys .

At the bottom section of every algorithm page we provide the code samples for your input, output, and how to call the algorithm in Python, Rust, Ruby, JavaScript, NodeJS, cURL, CLI, Java, or Scala. If you have questions about the details of using the Algorithmia API check out theAPI docs.

Alright, let’s get started!

Here’s the overall structure of our project:

+-- profanity_demo
|   +-- data
|       +-- Donald-Trump-OR-Trump.csv
|       +-- Hillary-Clinton-OR-Hillary.csv
|   +-- logs
|       +-- twitter_data_pull.log
|   +-- profanity_analysis.py
|   +-- twitter_pull_data.py

You’ll need a free Algorithmia account to complete this project. Sign up for free and receive an extra 10,000 credits . Overall, the project will consist of processing around 700 tweets or so with emoticons and other special characters stripped out. This means if a tweet only contained URL’s and emoticons then it won’t be analyzed. Once we pull our data from the Twitter API, we’ll clean it up with some regex, remove stop words, and then find our swear words.

Step One: Retrieve Tweets by Keyword

We’ll use the Retrieve Tweets by Keyword algorithm first in order to query tweets from the Twitter Search API :

import os
import csv import sys import logging import Algorithmia # Logging logger = logging.getLogger(__name__) logger.setLevel(logging.INFO) logFile = logging.FileHandler( 'logs/twitter_pull_data.log') logFile.setLevel(logging.INFO) # Creating a custom log format for each line in the log file formatter = logging.Formatter('%(asctime)s : %(levelname)s : %(message)s') logFile.setFormatter(formatter) logger.addHandler(logFile) # Pass in string query as sys.argv q_input = sys.argv[1] def pull_tweets(): input = { "query": q_input, "numTweets": "700", "auth": { "app_key": 'consumer_key', "app_secret": 'consumer_secret', "oauth_token": 'access_token', "oauth_token_secret": 'access_secret' } } client = Algorithmia.client('algorithmia_api_key') algo = client.algo('twitter/RetrieveTweetsWithKeyword/0.1.3') tweet_list = [{'user_id': record['user']['id'], 'retweet_count': record['retweet_count'], 'text': record['text']} for record in algo.pipe(input).result] return tweet_list def write_data(): # Write tweet records to csv for later data processing data = pull_tweets() filename = os.path.join(q_input.replace(' ', '-')) try: with open('data/{0}.csv'.format(filename), 'w') as f: fieldnames = ['user_id', 'retweet_count', 'text'] writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() for record in data: writer.writerow(record) except Exception as e: logger.info(e) if __name__ == '__main__': write_data()

Okay, let’s go over the obvious parts of the code snippet. This algorithm takes a nested dictionary called ‘input’ that contains the keys: ‘query’, ‘numTweets’ and ‘auth’ which is a dictionary itself. The key ‘query’ is set as a global variable called q_input and holds the system argument that is passed when executing the script. In our case it will hold a presidential nominee name. The key ‘numTweets’ is set to the number of tweets you want to extract and the dictionary ‘auth’ holds the Twitter authentication keys and tokens that you got from Twitter.

As you write the pull_tweets() function, pay attention to the line that sets the variable ‘client’ to ‘Algorithmia.client(algorithmia_api_key)’. This is where you pass in your API key that you were assigned when you signed up for an account with Algorithmia. If you don’t recall where to find that it is in the My Profile page in the Credentials section.

Next notice the variable ‘algo.’ This is where we pass in the path to the algorithm we’re using. Each algorithm’s documentation will give you the appropriate path in the code examples section at the bottom of the algorithm page.

And last, the list comprehension ‘tweet_list’ holds our data after looping through the result of the algorithm by passing in our input variable to algo.pipe(input).result.

Now, you simply write your data to a CSV file that is named after your query. Note: if your query is a space separated string, then the script will join the query with a dash.

Step Two: Collecting Data

It’s time to call our script with our query ‘Donald Trump OR Trump’ which will grab tweets with the terms ‘Donald Trump’ or ‘Trump,’ and will then write a file to your data file called ‘Donald-Trump-OR-Trump.csv’.

python twitter_pull_data.py 'Donald Trump OR Trump'

Try running the script again, but this time passing in ‘Hillary Clinton OR Hillary’ as the query.

With both CSV files in our data folder, we can now create a script called profanity_analysis.py

Step Three: Data Preprocessing

In this next script, we’ll first clean up our dirty data, get rid of emoticons, hashtags, RT’s, etc. Then, we’ll explore the English stop words and profanity algorithms.

import os
import re import csv import sys import Algorithmia as alg # Add in your Algorithmia API key client = alg.client('algorithmia_api_key') def read_data(): """Create the list of Tweets from your query.""" filename = os.path.join(sys.argv[1].replace(' ', '-')) with open('data/{0}.csv'.format(filename)) as data_file: data_object = csv.DictReader(data_file, delimiter=',') text_data = [tweets['text'] for tweets in data_object] return text_data def process_text(): """Remove emoticons, numbers etc. and returns list of cleaned tweets.""" stripped_text = [ re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?" + sys.argv[1].lower(), '', tweets.lower()).strip() for tweets in read_data() ] return stripped_text

Our first step in cleaning up the data was to use some regex to remove emoticons and numbers. Then, we call theRetrieve Stop Words algorithm to further scrub our data, and helps the Profanity algorithm run a little faster since it doesn’t have to parse through all the common English words that provide no value.

def remove_stop_words():
    """Remove stop words in tweets."""
    algo = client.algo('nlp/RetrieveStopWords/0.1.1')
    # Input is an empty list
    stop_word_list = algo.pipe([])
    # If our word is not in the stop list than we add it to our word list
    clean_text = ' '.join([word for sentence in process_text()
                           for word in sentence.split(' ')
                           if word not in stop_word_list.result])
    return clean_text

That’s it for cleaning up our tweets!

Step Four: Checking Tweets for Profanity

Now, we’ll check out the Profanity Detection algorithm and discover the swear words in our tweets. This algorithm is based on around 340 words from  noswearing.com , which does a basic string match to catch swear words. Check out the Profanity algorithm page to learn more about the details of the algorithm, and how you can customize your word list by adding your own offensive words since fun, new offensive colloquialisms are constantly being added to the English language everyday. Don’t believe us? Just check out Urban Dictionary for some new favorites that have popped up.

The profanity function is fairly straightforward:

def profanity():
    """Returns a dictionary of swear words and their frequency"""
    algo = client.algo('nlp/ProfanityDetection/0.1.2')
    # Pass in the clean list of tweets combined into a single corpus
    result = algo.pipe([remove_stop_words()]).result
    # Total profanity in corpus
    total = sum(result.values())
    print("total swear words", result, total)

You’re simply passing in the list of words that have been cleaned of English stop words. We’ve joined them into a single corpus since we’re interested in the total profanity of all the tweets from our data, rather than the profanity of each tweet. Our function profanity() prints out both the result of the algorithm along with the total swear words. At the time of this writing there were 30 swear words for the query ‘Donald Trump OR Trump’ and ‘Hillary Clinton OR Clinton’ returns 8 swear words. ��

When we pulled our Twitter data, we also grabbed the user_id and the count of retweets. This is useful because you might want to gauge the popularity of a tweet by doing some light analysis in order to find the probability of whether or not a tweet is likely to be more or less popular given the amount of profanity used.

Next Steps

Be sure to check out our other NLP algorithms such as social sentiment analysis orLDA(tags). Microservices like AnalyzeTweetscombine the previously mentioned algorithms with one that retrieves tweets. This algorithm returns the negative and positive sentiment of each tweet along with the negative and positive LDA of each tweet. There is no shortage of combinations you can create to do either quick exploratory analysis, or add algorithms such as profanity or Nudity Detectionto your app to make sure your content is family friendly.

Enjoy exploring the platform and as always if you have any questions feel free toreach out!





About List