Discovering Topics in News Articles

Table of Contents

Introduction

Digital transformation (DX) is growing rapidly, and with it the necessity of classifying massive text sets. Latent Dirichlet Allocation (LDA), a popular approach for locating hidden topics in text data, is one effective way to handle this problem.

This article will show you how to use LDA with the AG News dataset, which is a large collection of news stories that are ideal for text analysis. LDA is an essential tool in today’s DX-driven environment as firms move toward data-driven decision-making since it may assist in identifying important trends and themes.

What is Topic Modeling?

Topic modeling is a technique that uses unsupervised machine learning to discover hidden patterns in a text corpus by grouping similar words into clusters, revealing underlying topics. For example, it can categorize a document as an invoice, complaint, or contract based on its content.

With huge amounts of mostly unstructured data generated daily, manually sorting through it is impractical. Topic modeling automates this process, helping businesses quickly extract insights from unstructured data.

Introducing LDA: A Popular Topic Modeling Technique

One of the most popular methods for topic modeling is called Latent Dirichlet Allocation, or LDA for short. Here’s how it works in simple terms:

LDA assumes that each document talks about a mix of topics.
Each topic is associated with certain words.
The computer tries to figure out which words go together to form topics, and which topics are present in each document.

For example, if many articles contain words like “player,” “team,” “score,” and “championship,” LDA might identify this as a “Sports” topic.

The AG News Dataset

For this blog post, we’ll be using a collection of news articles called the AG News dataset. It consists of:

120,000 training samples and 7,600 test samples
4 classes: World, Sports, Business, and Sci/Tech
Each class contains 30,000 training samples and 1,900 test samples

This dataset is particularly suitable for topic modeling due to its diverse range of news articles across different categories.

Preparing the Data

Before applying LDA, we need to preprocess our data. Here’s a Python script to get us started:

        
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Load the data
df = pd.read_csv("/kaggle/input/ag-news-classification-dataset/train.csv")

# Combine title and content
df['Text'] = df['Title'] + ' ' + df['Description']

# Create document-term matrix
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(df['Text'])

This script does the following:

Loads the AG News dataset
Combines the title and content of each article
Creates a document-term matrix using CountVectorizer, which:

Removes common English stop words
Ignores terms that appear in more than 95% of the documents (max_df=0.95)
Ignores terms that appear in fewer than 2 documents (min_df=2)

Implementing LDA

Now that our data is prepared, let’s implement LDA:

        
from sklearn.decomposition import LatentDirichletAllocation

# Set up and train LDA model
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_output = lda_model.fit_transform(doc_term_matrix)

# Print the top words for each topic
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda_model.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")

Here is the result of this script:

Topic 0: oil, 39, prices, quot, oracle, peoplesoft, said, european, bid, reuters
Topic 1: gt, lt, 39, world, font, new, year, dvd, face, said
Topic 2: 39, open, world, gaza, final, game, test, australia, cup, israeli
Topic 3: reuters, percent, 39, sales, quarter, stocks, new, said, profit, year
Topic 4: 39, said, space, reuters, ap, iran, nuclear, new, palestinian, people
Topic 5: microsoft, new, 39, software, internet, service, computer, security, search, mobile
Topic 6: 39, ap, game, win, night, new, victory, team, league, lead
Topic 7: gt, lt, reuters, com, said, fullquote, million, company, new, target
Topic 8: iraq, said, president, 39, ap, bush, reuters, killed, afp, minister
Topic 9: 39, ap, year, season, coach, new, sports, football, time, team

Analyzing the Results

After running LDA, you might find topics that correspond roughly to the original categories in the AG News dataset. For example:

World: topic 0, topic 4, topic 8
Sports: topic 2, topic 6, topic 9
Business: topic 3, topic 7
Sci/Tech: topic 5

However, some topics have meaningless words like ‘gt’, ‘lt’ or ‘quot’. This might be due to preprocessing steps that removed stopwords and punctuation, which may have led to a loss of context. Let’s try to improve our model by using a different approach for text preprocessing.

Improving Our Results with spaCy

We can improve our results by using more advanced text preprocessing techniques. One powerful tool for this is spaCy, a popular NLP library in Python. Let’s try some modifications to our pipeline:

Part-of-speech filtering: By keeping only certain parts of speech, we focus on the most meaningful words and reduce noise.
Lemmatization: This reduces words to their base form, which can help group related words together (e.g., “running,” “ran,” and “runs” all become “run”).
Named Entity Recognition: spaCy’s model can better handle proper nouns, which can be important for news article topics.

        
import spacy

nlp = spacy.load("en_core_web_md")

def clean_data(data):
    include_pos = ["NOUN", "VERB", "ADV", "PROPN", "ADJ"]
    tokens = nlp(data)
    tokens = [t.lemma_ for t in tokens if t.pos_ in include_pos]
    return " ".join(tokens)

df["NormalizeText"] = df.Text.apply(clean_data)

After preprocessing the data, we can apply LDA as before. Let’s see how this improves our results:

Topic 0: oil, price, reuters, stock, rise, high, new, say, dollar, year
Topic 1: space, win, say, new, world, athens, team, year, gold, nasa
Topic 2: say, red, reuters, league, sox, ap, united, new, israeli, quot
Topic 3: microsoft, new, software, company, google, search, service, web, user, internet

Conclusion

Using LDA, we were able to uncover meaningful topics within the AG News dataset, demonstrating the power of unsupervised learning for text classification tasks. As organizations continue to embrace digital transformation, techniques like LDA will play an essential role in gaining insights from vast amounts of text data. By refining our text preprocessing methods, we can further enhance the quality of the topics discovered, making LDA even more valuable for real-world applications.

About us

Services

AI Development

Data & Intelligence

Engineering

Business Software Development

Web App Development

QA & Software Testing

Mobile App Development

Cloud Services

Insights

Case Study

News

Contact us

Co-Creating Future Opportunities

Discovering Topics in News Articles

Introduction

What is Topic Modeling?

Introducing LDA: A Popular Topic Modeling Technique

The AG News Dataset

Preparing the Data

Implementing LDA

Analyzing the Results

Improving Our Results with spaCy

Conclusion

Search

Popular articles

Latest Articles

Revolutionize Your Work with Gemini 2.5: Innovative AI for Healthcare, Finance, and More

Firebase Studio: Revolutionizing Full-Stack Development in the AI Era! The Secret to 3x Productivity?

Building a Scalable and Customizable Data Scraping Pipeline. Part 1: Overview

Graph RAG: Advancing Retrieval Augmented Generation with Knowledge Graphs

About us

Services

AI Development

Data & Intelligence

Engineering

Business Software Development

Web App Development

QA & Software Testing

Mobile App Development

Cloud Services

Insights

Case Study

News

Contact us

Co-Creating Future Opportunities

Discovering Topics in News Articles

Introduction

What is Topic Modeling?

Introducing LDA: A Popular Topic Modeling Technique

The AG News Dataset

Preparing the Data

Implementing LDA

Analyzing the Results

Improving Our Results with spaCy

Conclusion

Revolutionize Your Work with Gemini 2.5: Innovative AI for Healthcare, Finance, and More

Firebase Studio: Revolutionizing Full-Stack Development in the AI Era! The Secret to 3x Productivity?

Search

Popular articles

「DifyxGAS×生成AI」で、営業リストの事前調査を自動化するワークフローを作成してみました。

Graph RAG: Advancing Retrieval Augmented Generation with Knowledge Graphs

Discovering Topics in News Articles

Building a Scalable and Customizable Data Scraping Pipeline. Part 1: Overview

Latest Articles

Revolutionize Your Work with Gemini 2.5: Innovative AI for Healthcare, Finance, and More

Firebase Studio: Revolutionizing Full-Stack Development in the AI Era! The Secret to 3x Productivity?

Building a Scalable and Customizable Data Scraping Pipeline. Part 1: Overview

Graph RAG: Advancing Retrieval Augmented Generation with Knowledge Graphs