Introduction

Digital transformation (DX) is growing rapidly, and with it the necessity of classifying massive text sets. Latent Dirichlet Allocation (LDA), a popular approach for locating hidden topics in text data, is one effective way to handle this problem.

This article will show you how to use LDA with the AG News dataset, which is a large collection of news stories that are ideal for text analysis. LDA is an essential tool in today’s DX-driven environment as firms move toward data-driven decision-making since it may assist in identifying important trends and themes.

What is Topic Modeling?

Topic modeling is a technique that uses unsupervised machine learning to discover hidden patterns in a text corpus by grouping similar words into clusters, revealing underlying topics. For example, it can categorize a document as an invoice, complaint, or contract based on its content.

With huge amounts of mostly unstructured data generated daily, manually sorting through it is impractical. Topic modeling automates this process, helping businesses quickly extract insights from unstructured data.

Introducing LDA: A Popular Topic Modeling Technique

One of the most popular methods for topic modeling is called Latent Dirichlet Allocation, or LDA for short. Here’s how it works in simple terms:

  1. LDA assumes that each document talks about a mix of topics.
  2. Each topic is associated with certain words.
  3. The computer tries to figure out which words go together to form topics, and which topics are present in each document.

For example, if many articles contain words like “player,” “team,” “score,” and “championship,” LDA might identify this as a “Sports” topic.

The AG News Dataset

For this blog post, we’ll be using a collection of news articles called the AG News dataset. It consists of:

  • 120,000 training samples and 7,600 test samples
  • 4 classes: World, Sports, Business, and Sci/Tech
  • Each class contains 30,000 training samples and 1,900 test samples

This dataset is particularly suitable for topic modeling due to its diverse range of news articles across different categories.

Preparing the Data

Before applying LDA, we need to preprocess our data. Here’s a Python script to get us started:

        
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Load the data
df = pd.read_csv("/kaggle/input/ag-news-classification-dataset/train.csv")

# Combine title and content
df['Text'] = df['Title'] + ' ' + df['Description']

# Create document-term matrix
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(df['Text'])
        
    

This script does the following:

  1. Loads the AG News dataset
  2. Combines the title and content of each article
  3. Creates a document-term matrix using CountVectorizer, which:
    • Removes common English stop words
    • Ignores terms that appear in more than 95% of the documents (max_df=0.95)
    • Ignores terms that appear in fewer than 2 documents (min_df=2)

Implementing LDA

Now that our data is prepared, let’s implement LDA:

        
from sklearn.decomposition import LatentDirichletAllocation

# Set up and train LDA model
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_output = lda_model.fit_transform(doc_term_matrix)

# Print the top words for each topic
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda_model.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")
        
    

Here is the result of this script:

  • Topic 0: oil, 39, prices, quot, oracle, peoplesoft, said, european, bid, reuters
  • Topic 1: gt, lt, 39, world, font, new, year, dvd, face, said
  • Topic 2: 39, open, world, gaza, final, game, test, australia, cup, israeli
  • Topic 3: reuters, percent, 39, sales, quarter, stocks, new, said, profit, year
  • Topic 4: 39, said, space, reuters, ap, iran, nuclear, new, palestinian, people
  • Topic 5: microsoft, new, 39, software, internet, service, computer, security, search, mobile
  • Topic 6: 39, ap, game, win, night, new, victory, team, league, lead
  • Topic 7: gt, lt, reuters, com, said, fullquote, million, company, new, target
  • Topic 8: iraq, said, president, 39, ap, bush, reuters, killed, afp, minister
  • Topic 9: 39, ap, year, season, coach, new, sports, football, time, team

Analyzing the Results

After running LDA, you might find topics that correspond roughly to the original categories in the AG News dataset. For example:

  • World: topic 0, topic 4, topic 8
  • Sports: topic 2, topic 6, topic 9
  • Business: topic 3, topic 7
  • Sci/Tech: topic 5

However, some topics have meaningless words like ‘gt’, ‘lt’ or ‘quot’. This might be due to preprocessing steps that removed stopwords and punctuation, which may have led to a loss of context. Let’s try to improve our model by using a different approach for text preprocessing.

Improving Our Results with spaCy

We can improve our results by using more advanced text preprocessing techniques. One powerful tool for this is spaCy, a popular NLP library in Python. Let’s try some modifications to our pipeline:

  • Part-of-speech filtering: By keeping only certain parts of speech, we focus on the most meaningful words and reduce noise.
  • Lemmatization: This reduces words to their base form, which can help group related words together (e.g., “running,” “ran,” and “runs” all become “run”).
  • Named Entity Recognition: spaCy’s model can better handle proper nouns, which can be important for news article topics.
        
import spacy

nlp = spacy.load("en_core_web_md")

def clean_data(data):
    include_pos = ["NOUN", "VERB", "ADV", "PROPN", "ADJ"]
    tokens = nlp(data)
    tokens = [t.lemma_ for t in tokens if t.pos_ in include_pos]
    return " ".join(tokens)

df["NormalizeText"] = df.Text.apply(clean_data)
        
    

After preprocessing the data, we can apply LDA as before. Let’s see how this improves our results:

  • Topic 0: oil, price, reuters, stock, rise, high, new, say, dollar, year
  • Topic 1: space, win, say, new, world, athens, team, year, gold, nasa
  • Topic 2: say, red, reuters, league, sox, ap, united, new, israeli, quot
  • Topic 3: microsoft, new, software, company, google, search, service, web, user, internet

Conclusion

Using LDA, we were able to uncover meaningful topics within the AG News dataset, demonstrating the power of unsupervised learning for text classification tasks. As organizations continue to embrace digital transformation, techniques like LDA will play an essential role in gaining insights from vast amounts of text data. By refining our text preprocessing methods, we can further enhance the quality of the topics discovered, making LDA even more valuable for real-world applications.