Text Mining with Gensim

Text Mining (Metin Madenciliği) is the process of extracting high-quality information from text. It involves various techniques like tokenization, stemming, lemmatization, part-of-speech tagging, named entity recognition, topic modeling, and sentiment analysis to transform unstructured text data into structured insights. The ultimate goal is to discover patterns, trends, and valuable knowledge that would otherwise be hidden within large volumes of text.

Gensim is an open-source Python library specifically designed for unsupervised topic modeling and natural language processing. It is highly optimized for handling large text corpora efficiently, even those that don't fit into memory. Gensim's core strength lies in its ability to implement robust algorithms like Latent Semantic Analysis (LSA/LSI), Latent Dirichlet Allocation (LDA), and Word2Vec, among others.

Key concepts in Gensim include:
- Corpus: A collection of digital documents. In Gensim, a corpus is typically represented as a stream of documents, where each document is a list of word IDs and their frequencies (bag-of-words model).
- Dictionary: A mapping between words and their unique integer IDs. This is crucial for converting text into a numerical representation suitable for machine learning models.
- Models: Gensim provides implementations for various models:
- LDA (Latent Dirichlet Allocation): A generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For text, this means discovering abstract 'topics' that occur in a collection of documents.
- LSA (Latent Semantic Analysis): A technique that analyzes relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
- Word2Vec/Doc2Vec: Neural network-based models for learning word and document embeddings, capturing semantic relationships between words and documents.

A typical Text Mining workflow with Gensim involves:
1. Text Preprocessing: Cleaning and transforming raw text. This includes tokenization (breaking text into words), removing stop words (common words like 'the', 'is'), and often stemming (reducing words to their root form) or lemmatization (reducing words to their base dictionary form).
2. Creating a Dictionary: Mapping preprocessed words to unique numerical IDs using `gensim.corpora.Dictionary`.
3. Creating a Corpus: Converting the preprocessed text into a bag-of-words (BoW) format, where each document is represented as a list of (word_id, word_frequency) tuples. This is done using the created dictionary.
4. Training a Model: Applying a topic model (e.g., LDA) to the BoW corpus to discover underlying topics.
5. Interpreting Results: Analyzing the generated topics and their associated keywords to gain insights into the document collection.

Example Code

import gensim
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import -
import numpy as np

 Ensure NLTK data is available if running for the first time
 import nltk
 nltk.download('wordnet')
 nltk.download('omw-1.4')  Open Multilingual Wordnet for lemmatizer

 --- 1. Text Preprocessing ---
def lemmatize_stemming(text):
    return SnowballStemmer('english').stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:  Filter out stop words and short words
            result.append(lemmatize_stemming(token))
    return result

 Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog. The fox is known for its agility.",
    "Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to 'learn' from data.",
    "Natural language processing (NLP) is a subfield of artificial intelligence, computer science, and linguistics concerned with the interactions between computers and human (natural) languages.",
    "A dog barks at night. The owner takes care of the dog. Dogs are loyal pets.",
    "Python is a popular programming language often used for data science and machine learning tasks.",
    "Data science involves extracting knowledge and insights from structured and unstructured data."
]

processed_docs = [preprocess(doc) for doc in documents]
print("Processed Documents Example:", processed_docs[0])

 --- 2. Creating a Dictionary ---
dictionary = corpora.Dictionary(processed_docs)

 Filter out tokens that appear in less than 2 documents or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)

print("\nDictionary Example (first 5):", list(dictionary.items())[:5])

 --- 3. Creating a Corpus (Bag-of-Words) ---
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
print("\nCorpus Example (first document):", corpus[0])

 --- 4. Training an LDA Model ---
num_topics = 2  Let's try to find 2 topics
lda_model = gensim.models.LdaMulticore(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=100,
    chunksize=100,
    passes=10,
    per_word_topics=True
)

 --- 5. Interpreting Results ---
print("\n--- LDA Topics ---")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

 Example: Get topic distribution for the first document
print(f"\nTopic distribution for document 1: {documents[0]}")
doc_topics = lda_model.get_document_topics(corpus[0])
print(doc_topics)  (topic_id, probability) tuples

 You can also get the most probable topic for a document
if doc_topics:
    most_probable_topic = max(doc_topics, key=lambda item: item[1])
    print(f"Most probable topic for document 1: Topic {most_probable_topic[0]} with probability {most_probable_topic[1]:.2f}")

Text Mining with Gensim

Example Code

Related Topics