With Google just about to launch a point of presence in Sydney for its Google Cloud Platform (GCP), we thought it timely to explore how to use Google’s Cloud Natural Language API as a part of the Google Cloud Platform Machine Learning suite. There are many articles out there which outline many different ways to do text analysis, but what Google offers is a kind of black box where you simply call an API and get a predicted value. What this means for the average developer is that we no longer need to be statisticians, and we don’t have to accumulate the vast amount of data required for this kind of analysis. Sure, we forego the ability to fine tune the algorithm, but there’s definitely a market to get productive right away and build important applications, instead of building everything from the ground up.
What we’ll try to do is use the sentiment analysis and entity analysis from the language API. The input data will be streams from Twitter that include trump as the keyword so we can get plenty of traffic.
Sentiment analysis gives us a floating point value, whether or not the entire text string is between -1 and 1. Anything < 0 is considered ‘Negative’ and anything > is ‘Positive’. 0 and the small values around it will be considered ‘Neutral’.
Entity analysis tries to give us the subjects and nouns of the tweet.
Why is this a big deal?
This next part is a bit of speculation, but we’ll try to guess the effort Google is providing for us when we use these APIs. Let’s just pick sentiment analysis alone and try to figure out at a high level what we would need to do if we had to implement it ourselves.
First we need to collect data from the various data sources, be it tweets, articles, comments, databases etc. So the main tasks here are the varying different input sources, getting a sufficient quantity and making sure that the data is backed up and highly available for processing. This means using something like Cloud Storage, AWS S3, HA databases, or heaven forbid, rolling our own infrastructure. We would also have to filter it out by language and ensure that the data is cleaned before we store it. If it goes into a database we may also need to transform and structure it into our columns. We can collect a lot of data, but we think Google would have access to plenty themselves. There are entire jobs dedicated to just doing ETL.
Training the data
If we approach this from a classification model, we need to take a subset of the data and label it. In this case we need to take some sample tweets and label it ourselves as to whether or not it’s positive, negative, or neutral. Perhaps we can automate this process by allowing the end users of our product to *like* some text, thumbs up or thumbs down, or say nothing about it and then feed it back into the learning model.
The goal from our processing is to build a final model where we can finally take some input X and give it a probability between -1 to 1 that translates to sentiments. But what’s involved in the processing?
Imagine if we had to batch process terabytes to petabytes of text. How much training data is representative enough of that size? How much processing resource from an infrastructure side do we need to dedicate to process it all and how often do we have to re-process it when we get new data?
We need to spin up a cluster of computers that work together and use our algorithm to:
- Pull the large quantities of text from the datasource(s) from the input stage.
- Tokenize the string.
- Remove stop words.
- Find relevance to the corpus / bounded contexts.
- Train the data set.
- Test the data set and confirm accuracy.
- Publish the final model.
There are plenty of smart people out there, but Google has plenty of them as well to work their mathematical magic on this algorithm. If we were to do it, I can probably imagine spinning up some expensive Spark cluster to do all that batch processing and writing back the model to storage again. Test and re-test many algorithms to see which one fits for the moment. Finally we can move onto the easy part:
Providing the model in an API
Our final output will need to take this generated model and provide a scalable API for our consumers. This is probably the simplest of all the tasks and is simply another web app which we can put on an autoscaling group on our platform of choice. The problem here is that the model doesn’t stay relevant forever. We need to go through the entire process again and re-publish our model to stay relevant. So the entire process goes through a rinse and repeat with constant adjustments to the algorithm and that’s just for one language alone!
Phew, that’s a lot of work! That’s what AWS, Google and Azure are trying to solve for us by enabling us to write smarter apps more quickly. They’re not perfect, but like anything that’s machine learning related, more data and refinements will yield better results.
So let’s take a look at what the pricing model looks like for the consumer after all that.
Natural Language Pricing
From Google’s Pricing calculator we can estimate the cost of calling the natural language API:
Entity Recognition: 100,000 records Sentiment Analysis: 100,000 records Syntax Analysis: 0 records $190.00 Total Estimated Cost: $190.00 per 1 month
That’s really cheap! Let’s compare that to the cost of buying a coffee:
Small flat white: 2 per day ($3.50 ea) Sentiment Analysis: Positive Total Estimated Cost: $140 per 20 days (I don't drink them on weekends) Total Estimated Cost: $190.00 per 1 month
Getting tweets from Golang
Setup your Twitter app
Navigate to https://apps.twitter.com to create a new Twitter application. There are 4 things we need from setting up this app. The cost is free for the streaming API, but you just don’t get 100% of the tweets and at a lower stream rate. That’s good enough for now. If you’re using Git Bash for Windows or Bash then have the following filled out in your
# Twitter Creds export TWITTER_CONSUMER_KEY=<value> export TWITTER_CONSUMER_SECRET=<value> export TWITTER_ACCESS_TOKEN=<value> export TWITTER_ACCESS_SECRET=<value>
The example program will also take them in as command line arguments but for testing it will also pick up these ENV vars.
Once those are added make sure they’re reflected in your current shell by issuing the
source ~/.profile command.
The Go app in its entirety is available here: Natural Language API Example in Go.
The Twitter streaming part is almost a copy and paste from the library it uses which you can peruse. We’ll focus on the part where we process the tweet.
The actual calls to the API sit in the
getSentiment() function calls. It uses protocol buffers which is already provided to us by the SDK. This will be faster than the RESTful API since the data arrives as binary and we deserialize it in the app. Minus the error checking, it only took us a few lines of code to get what we wanted.
Final output with entities
2017/03/10 09:53:21 Starting Stream... 2017/03/10 09:53:23 Entity name: RT @InxsyS 2017/03/10 09:53:23 Entity name: Trump 2017/03/10 09:53:23 Entity name: 'T 2017/03/10 09:53:23 Entity name: Repugnants 2017/03/10 09:53:23 Entity name: LIE 2017/03/10 09:53:23 Entity name: Sphincter 2017/03/10 09:53:23 Entity name: LIE 2017/03/10 09:53:23 Entity name: ADMINISTRATION 2017/03/10 09:53:23 Entity name: PUBLIC 2017/03/10 09:53:23 Entity name: AMERICAN 2017/03/10 09:53:23 RT @InxsyS: Trump,Sphincter & All Repugnants- YOU DON'T GET TO TELL LIE AFTER LIE TO THE AMERICAN PUBLIC--THEN SAY YOUR ADMINISTRATION IS… is NEGATIVE with score -0.600000 2017/03/10 09:53:26 Entity name: Alec Baldwin 2017/03/10 09:53:26 Entity name: This Is the One 2017/03/10 09:53:26 Entity name: Donald Trump 2017/03/10 09:53:26 Entity name: https://t.co/XXarEybxTE https://t.co/FnG55iSbvn 2017/03/10 09:53:26 This Is the One Thing Alec Baldwin Likes About Donald Trump https://t.co/XXarEybxTE https://t.co/FnG55iSbvn is POSITIVE with score 0.700000 2017/03/10 09:53:28 Entity name: RT @CBSThisMorning 2017/03/10 09:53:28 Entity name: Trump 2017/03/10 09:53:28 Entity name: Sean Spicer 2017/03/10 09:53:28 Entity name: White House 2017/03/10 09:53:28 Entity name: investigation 2017/03/10 09:53:28 Entity name: Justice Department 2017/03/10 09:53:28 RT @CBSThisMorning: Sean Spicer says White House is "not aware" of any Justice Department investigation into President Trump. https://t.co/… is POSITIVE with score 0.200000 2017/03/10 09:53:34 Entity name: A Brand Name 2017/03/10 09:53:34 Entity name: Hedge Fund Happy Hour: Trump 2017/03/10 09:53:34 Entity name: Mar-a-Lago 2017/03/10 09:53:34 Entity name: NYT 2017/03/10 09:53:34 Entity name: The New York Times https://t.co/tBP18mNEb9 2017/03/10 09:53:34 Entity name: ALEXANDRA STEVENSON 2017/03/10 09:53:34 "A Brand Name for a Hedge Fund Happy Hour: Trump’s Mar-a-Lago" by ALEXANDRA STEVENSON via NYT The New York Times https://t.co/tBP18mNEb9 is POSITIVE with score 0.600000
Some of the positives are pretty weak so the closer it is to a score of 1 the more likely it is to be positive.
The entity analysis seems to do a pretty decent job. I can picture that working well with other ML projects, chatbots and the like.
That being said, Google did recently release a Perspective API if the purpose was to combat trolling and figuring out a toxicity level of some comments. This is a really interesting project to really help clean up the Internet which it drastically needs.
Machine Learning is certainly an interesting domain and we’re excited about the pace at which the big cloud providers are innovating in this space. Once GCP is in Sydney we can’t wait to experiment more with Google’s suite of services.
Thank you for reading!