spaCy is an open-source software library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use, offering a non-destructive tokenization pipeline and highly optimized components written in Cython.
Key features and capabilities of spaCy include:
1. Tokenization: Breaking down text into individual words, punctuation, and other linguistic units (tokens).
2. Part-of-Speech (PoS) Tagging: Assigning grammatical categories (like noun, verb, adjective) to each token.
3. Named Entity Recognition (NER): Identifying and classifying named entities in text into predefined categories such as person, organization, location, dates, etc.
4. Dependency Parsing: Analyzing the grammatical structure of sentences by establishing relationships between words.
5. Lemmatization: Reducing words to their base or dictionary form (lemma) to normalize text.
6. Sentence Segmentation: Dividing text into individual sentences.
7. Text Classification: Categorizing text into predefined classes.
8. Word Vectors/Embeddings: Representing words as numerical vectors to capture semantic relationships.
9. Rule-based Matching: Allowing users to define custom rules for finding patterns in text.
10. Pre-trained Models: spaCy provides trained statistical models for various languages, which include support for PoS tagging, dependency parsing, NER, and word vectors.
Unlike other NLP libraries that might offer a broad range of algorithms, spaCy focuses on providing the most efficient and practical implementations for common NLP tasks, making it a popular choice for building real-world NLP applications and pipelines. It is known for its speed, ease of use, and robust integration with modern deep learning frameworks.
Example Code
First, install spaCy and download a language model (e.g., 'en_core_web_sm')
pip install spacy
python -m spacy download en_core_web_sm
import spacy
Load the English language model
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion. John Doe, CEO of ExampleCorp, will attend the conference."
Process the text with the NLP pipeline
doc = nlp(text)
print("\n--- Tokenization and Part-of-Speech Tagging ---")
for token in doc:
print(f"{token.text:<15} {token.pos_:<10} {token.lemma_:<15} {token.is_stop:<5}")
print("\n--- Named Entity Recognition ---")
for ent in doc.ents:
print(f"{ent.text:<25} {ent.label_:<15} {spacy.explain(ent.label_)}")
print("\n--- Dependency Parsing ---")
for token in doc:
print(f"{token.text:<15} {token.dep_:<15} {token.head.text:<15} {spacy.explain(token.dep_)}")
print("\n--- Sentence Segmentation ---")
for sent in doc.sents:
print(sent.text)








spaCy