Analysing News Article Content with Google Cloud Natural Language API

Tech Community • 9 min read

In my previous blog post I showed how to use AI Platform Training to fine-tune a custom NLP model using PyTorch and the transformers library. In this post we take advantage of Google’s pre-trained AI models for NLP and use Cloud Natural Language API to analyse text.

Google’s pre-trained machine learning APIs are great for building working AI prototypes and proof of concepts in matter of hours. Google’s Cloud Natural Language API allows you to do named entity recognition, sentiment analysis, content classification and syntax analysis using a simple REST API. The API supports Python, Go, Java, Node.js, Ruby, PHP and C#. In this post we’ll be using the Python API.

Photo by AbsolutVision on Unsplash

Before we jump in, let’s define our use case. To highlight the simplicity and power of the API, I’m going to use it to analyse the contents of news articles. In particular, I want to find out if the latest articles published in The Guardian’s world news section contain mentions of famous people and if those mentions have a positive or a negative sentiment. I also want to find out the overall sentiment of the news articles. To do this, we will go through a number of steps.

  1. We will use The Guardian’s RSS feed to extract links to the latest news articles in the world news section.
  2. We will download the HTML content of the articles published in the past 24 hours and extract the article text in plain text.
  3. We will analyse the overall sentiment of the text using Cloud Natural Language.
  4. We will extract named entities from the text using Cloud Natural Language.
  5. We will go through all named entities of type PERSON and see if they have a Wikipedia entry (for the purposes of this post, this will be our measure of the person being “famous”).
  6. Once we’ve identified all the mentions of “famous people”, we analyse the sentiment of the sentences mentioning them.
  7. Finally, we will print the names, Wikipedia links and the sentiments of the mentions of all the “famous people” in each article, together with the article title, url and the overall sentiment of the article.

We will do all this using GCP AI Platform Notebooks.

To launch new notebook make sure you are logged in to Google Cloud Console and have an active project selected. Navigate to AI Platform Notebooks and select New Instance. For this demo you don’t need a very powerful notebook instance, so we will make some changes to the defaults to save cost. First, select Python 3 (without CUDA) from the list and give a name for your notebook. Next, click the edit icon next to Instance properties. From Instance properties select n1-standard-1 as the Machine type. You will see that the estimated cost of running this instance is only $0.041 per hour.

Select Machine type

Once you have created the instance and it is running, click the Open JupyterLab link of your notebook instance. Once you’re in JupyterLab, select new Python 3 notebook.

Steps 1–2: Extract the Latest News Articles

We start start by downloading some required Python libraries. The following command uses pip to install lxml, Beautiful Soup and Feedparser. We use lxml and Beautiful Soup for processing and parsing HTML the content. Feedparser will be used to parse the RSS feed to identify the latest news articles and to get the links to the full text of those articles.

!pip install lxml bs4 feedparser

Once we have installed the required libraries we need to import them together with the other libraries we need for extracting the news article content. Next, we will define the url to the RSS feed as well as the time period we want to limit our search to. We will then define two functions we will use to extract the main article text from the HTML document. The text_from_html function will parse the HTML file, extract the text from that file and use the tag_visible function to filter out all but the main article text.

Once we have defined these functions we will parse the RSS feed, identify the articles published in the past 24 hours and extract the required attributes for those articles. We will need the article title, link, publishing time and, using the functions defined above, the plain text version of the article text.

Once we have defined these functions we will parse the RSS feed, identify the articles published in the past 24 hours and extract the required attributes for those articles. We will need the article title, link, publishing time and, using the functions defined above, the plain text version of the article text.

3–7: Analyse the Content Using Cloud Natural Language API

To use the Natural Language API we will import the required libraries.

from google.cloud import language_v1
from google.cloud.language_v1 import enums

Next, we define the main function for the demo print_sentiments(document). In this function, in 21 lines of code, we will do all the needed text analysis as well as print the results to view the output. The function takes document as the input, analyses the contents and prints the results. We will look at the contents of the document input later.

To use the API we need to initialise the LanguegeServiceClient. We then define the encoding type which we need to pass together with the document to the API.

The first API call analyze_entities(document, encoding_type=encoding_type) takes the input document and the encoding type and returns a response of the following form:

{
"entities": [
{
object(Entity)
}
],
"language": string
}

We will then call the API to analyse the sentiment of the document as well as to get the sentiments of each sentence in the document. The response has the following form:

{
"documentSentiment": {
object(Sentiment)
},
"language": string,
"sentences": [
{
object(Sentence)
}
]
}

The overall document sentiment is stored in annotations.document_sentiment.score. We assign the document an overall sentiment POSITIVE if the score is above 0, NEGATIVE if it is less than 0 and NEUTRAL if it is 0.

We then go through all the entities identified by the API and create a list of those entities that have the type PERSON. Once we have this list, we loop through it and check which ones from the list have wikipedia_url in their metadata_name. As said, we use this as our measure of the person being "famous". When we identify a "famous person" we print the person's name and the link to the Wikipedia entry.

We then check the sentiment annotated sentences for occurrence of the identified “famous person” and use the same values as above to determine the sentiment category of those sentences. Finally, we print all the sentiments of all the sentences mentioning the person.

Now that we have extracted the text from the news site and defined the function to analyse the contents of each article, all we need to do is go through the articles and call the function. The input for the function is a dictionary containing the plain text contents of the article, the type of the document (which in our case if PLAIN_TEXT) and the language of the document (which for us is English). We also print the name of each article and the link to the article.

For demo purposes we limit our analysis to the first 3 articles. The code for the above steps is displayed below together with the output of running that code.

##################################################

‘We have to win’: Myanmar protesters persevere as forces ramp up violence
https://www.theguardian.com/world/2021/feb/28/we-have-to-win-myanmar-protesters-persevere-as-forces-ramp-up-violence
Overall sentiment: NEGATIVE

Person: Min Aung Hlaing
- Wikipedia: https://en.wikipedia.org/wiki/Min_Aung_Hlaing
- Sentence: 1 mentioning Min Aung Hlaing is: NEUTRAL

Person: Aung San Suu Kyi
- Wikipedia: https://en.wikipedia.org/wiki/Aung_San_Suu_Kyi
- Sentence: 1 mentioning Aung San Suu Kyi is: POSITIVE

##################################################

White House defends move not to sanction Saudi crown prince
https://www.theguardian.com/world/2021/feb/28/white-house-defends-not-sanction-saudi-crown-prince-khashoggi-killing
Overall sentiment: NEGATIVE

Person: Joe Biden
- Wikipedia: https://en.wikipedia.org/wiki/Joe_Biden
- Sentence: 1 mentioning Joe Biden is: NEGATIVE

Person: Mark Warner
- Wikipedia: https://en.wikipedia.org/wiki/Mark_Warner
- Sentence: 1 mentioning Mark Warner is: NEGATIVE

Person: Khashoggi
- Wikipedia: https://en.wikipedia.org/wiki/Jamal_Khashoggi
- Sentence: 1 mentioning Khashoggi is: NEGATIVE
- Sentence: 2 mentioning Khashoggi is: NEGATIVE
- Sentence: 3 mentioning Khashoggi is: NEGATIVE

Person: Jen Psaki
- Wikipedia: https://en.wikipedia.org/wiki/Jen_Psaki
- Sentence: 1 mentioning Jen Psaki is: NEGATIVE

Person: Democrats
- Wikipedia: https://en.wikipedia.org/wiki/Democratic_Party_(United_States)
- Sentence: 1 mentioning Democrats is: NEGATIVE

Person: Gregory Meeks
- Wikipedia: https://en.wikipedia.org/wiki/Gregory_Meeks
- Sentence: 1 mentioning Gregory Meeks is: POSITIVE

Person: Prince Mohammed
- Wikipedia: https://en.wikipedia.org/wiki/Mohammed_bin_Salman
- Sentence: 1 mentioning Prince Mohammed is: NEGATIVE

##################################################

Coronavirus live news: South Africa lowers alert level; Jordan ministers sacked for breaches
https://www.theguardian.com/world/live/2021/feb/28/coronavirus-live-news-us-approves-johnson-johnson-vaccine-auckland-starts-second-lockdown-in-a-month
Overall sentiment: NEGATIVE

Person: Germany
- Wikipedia: https://en.wikipedia.org/wiki/Germany
- Sentence: 1 mentioning Germany is: NEGATIVE
- Sentence: 2 mentioning Germany is: NEUTRAL

Person: Nick Thomas-Symonds
- Wikipedia: https://en.wikipedia.org/wiki/Nick_Thomas-Symonds
- Sentence: 1 mentioning Nick Thomas-Symonds is: NEGATIVE

Person: Cyril Ramaphosa
- Wikipedia: https://en.wikipedia.org/wiki/Cyril_Ramaphosa
- Sentence: 1 mentioning Cyril Ramaphosa is: NEGATIVE

Person: Raymond Johansen
- Wikipedia: https://en.wikipedia.org/wiki/Raymond_Johansen
- Sentence: 1 mentioning Raymond Johansen is: NEGATIVE

Person: Archie Bland
- Wikipedia: https://en.wikipedia.org/wiki/Archie_Bland
- Sentence: 1 mentioning Archie Bland is: NEUTRAL

##################################################

As you can see the 3 articles we analysed all have an overall negative sentiment. We also found quite a few mentions of people with Wikipedia entries as well as the sentiments of those sentences.

Conclusion

As we saw, the Cloud Natural Language API is a super simple and powerful tool that allows us to analyse text with just a few lines of code. This is great when you are working on a new use case and need to quickly test the feasibility of an AI-based solution. It is also the go-to resource when you don’t have data to train your own machine learning model for the task. However, if you need to create a more customised model for your use case, I recommend using AutoML Natural Language or training your own model using AI Platform Training.

Hope you enjoyed this demo. Feel free to contact me if you have any questions.

Aarne TalmanGlobal Machine Learning Practice Lead
Related topics

Get in Touch.

Let’s discuss how we can help with your cloud journey. Our experts are standing by to talk about your migration, modernisation, development and skills challenges.

Ilja Summala
Ilja’s passion and tech knowledge help customers transform how they manage infrastructure and develop apps in cloud.
Ilja Summala LinkedIn
Group CTO