Processing Natural Language with Python and Friends

2019-11-18

datascience > category2

Python is a typical language chosen for Data Science work, and its strengths with strings make it especially useful for working with natural language. While the nltk library opened-up this work for python users, the newer spacy improves upon processing power by implementing Cython code. Tests display its power in production when compared with more traditional approaches, such as with Stanford’s CoreNLP. This post is an outline of examples from the spacy coursework and examples. It also uses nltk for providing datasets. Additional examples come from:

This tutorial introduces the basics of working with natural languages in python, including the following topics:

Extract linguistic features: part-of-speech tags, dependencies, named entities
Work with pre-trained statistical models
Find words and phrases using Matcher and PhraseMatcher match rules
Best practices for working with data structures Doc, Token Span, Vocab, Lexeme
Find semantic similarities using word vectors
Write custom pipeline components with extension attributes

Configure Environment

Ensure that spacy is installed. Language models are also necessary:

\\( python -m spacy download en\_core\_web\_sm
#or, pip install https://github.com/explosion/spacy-models/releases/download/en\_core\_web\_sm-2.0.0/en\_core\_web\_sm-2.0.0.tar.gz --no-deps
\\) python -m spacy validate
$ python -m spacy download en_core_web_lg --force

import numpy as np

from spacy.lang.en import English

nlp = English()

import nltk
print( nltk.corpus.gutenberg.fileids())

emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
emma = emma.replace('\n',' ')
docEmma = nlp(emma)

Finding words, phrases, names and concepts

Documents, spans, and tokens

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals

Lexical attributes

# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4

Context-specific linguistic attributes (using models)

The model provides the binary weights that enable spaCy to make predictions. It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline. All models include a meta.json that defines the language to initialize, the pipeline component names to load as well as general meta information like the model name, version, license, data sources, author and accuracy figures (if available). Model packages include a strings.json that stores the entries in the model’s vocabulary and the mapping to hashes. This allows spaCy to only communicate in hashes and look up the corresponding string if needed.

The en_core_web_lg (788 MB) compared to en_core_web_sm (10 MB):

LAS: 90.07% vs 89.66%
POS: 96.98% vs 96.78%
UAS: 91.83% vs 91.53%
NER F-score: 86.62% vs 85.86%
NER precision: 87.03% vs 86.33%
NER recall: 86.20% vs 85.39%

All that while en_core_web_lg is 79 times larger, hence loads a lot more slowly.

In spaCy, attributes that return strings usually end with an underscore (pos_) – attributes without the underscore return an ID.

The dep_ attribute returns the predicted dependency label.
The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.
The doc.ents property lets you access the named entities predicted by the model.

#model package
#$ python -m spacy download en_core_web_sm

#load models
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp("She ate the pizza")

#iterate over the tokens
for token in doc:
    #print the text and the predicted part-of-speech tag
    print(token.i, token.text, token.pos_)

0 She PRON
1 ate VERB
2 the DET
3 pizza NOUN

#syntatic dependency
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate

#process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY

for tok in doc:
    print( tok.text, tok.ent_type_, end=" ")

Apple ORG is  looking  at  buying  U.K. GPE startup  for  $ MONEY 1 MONEY billion MONEY

#common tags and labels
print( spacy.explain('GPE') )
print( spacy.explain('NNP') )
print( spacy.explain('dobj') )

Countries, cities, states
noun, proper singular
direct object

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc1 = nlp("This is a sentence.")
doc2 = nlp("This is another sentence.")
displacy.render([doc1, doc2], style="dep")

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

options = {"compact": True, "bg": "#09a3d5","color": "white", "font": "Source Sans Pro"}

text = """In ancient Rome, some neighbors live in three adjacent houses. In the center is the house of Senex, who lives there with wife Domina, son Hero, and several slaves, including head slave Hysterium and the musical's main character Pseudolus."""
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep", options=options, jupyter=True)

import spacy
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")

When Sebastian Thrun PERSON started working on self-driving cars at Google ORG in 2007 DATE , few people outside of the company took him seriously.

colors = {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"ents": ["ORG"], "colors": colors}

displacy.render(doc, style="ent", options=options)

When Sebastian Thrun started working on self-driving cars at Google ORG in 2007, few people outside of the company took him seriously.

{
    "words": [
        {"text": "This", "tag": "DT"},
        {"text": "is", "tag": "VBZ"},
        {"text": "a", "tag": "DT"},
        {"text": "sentence", "tag": "NN"}
    ],
    "arcs": [
        {"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
        {"start": 2, "end": 3, "label": "det", "dir": "left"},
        {"start": 1, "end": 3, "label": "attr", "dir": "right"}
    ]
}

ex = [{"text": "But Google is starting from behind.",
       "ents": [{"start": 4, "end": 10, "label": "ORG"}],
       "title": None}]
html = displacy.render(ex, style="ent", manual=True)

But Google ORG is starting from behind.

Rule-based matching

Match exact token texts: [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
Match lexical attributes: [{'LOWER': 'iphone'}, {'LOWER': 'x'}]
Match any token attributes: [{'LEMMA': 'buy'}, {'POS': 'NOUN'}]

# Import the Matcher
from spacy.matcher import Matcher

text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

#match_id: hash value of the pattern name
#start: start index of matched span
#end: end index of matched span

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

[(9528407286733565721, 1, 3)]

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]
# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": ____}, {"POS": ____}, {"POS": ____, "OP": ____}]

Large-scale data analysis with spaCy

Vocab, hashes, lexeme

vocab stores data shared across multiple documents. The doc contains words in context with their part-of-speech tags and dependencies. The string store maintains the text of the vocab hashes.

A lexeme object is an hash entry in the vocabulary vocab. lexemes hold context-independent information about a word, like the text, or whether the the word consists of alphabetic characters. Don’t have part-of-speech tags, dependencies or entity labels. Those depend on the context.

nlp.vocab.length

#Hashes can't be reversed – that's why we need to provide the shared vocab
coffee_hash = nlp.vocab.strings['coffee']
print(coffee_hash)

3197928453018144401

# Raises an error if we haven't seen the string before
string = nlp.vocab.strings[3197928453018144401]

doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'], doc.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401 3197928453018144401
string value: coffee

#contains the context-independent information
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True

Doc, span, and token

Doc is created automatically when you process a text with the nlp object. But you can also instantiate the class manually. It takes three arguments: the shared vocab, the words and the spaces.

A Span is a slice of a Doc consisting of one or more tokens. The Span takes at least three arguments: the doc it refers to, and the start and end index of the span (with end index exclusive).

Doc and Span are very powerful and hold references and relationships of words and sentences

Convert result to strings as late as possible
Use token attributes if available – for example, token.i for the token index. This will let you reuse it in spaCy.

# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
doc

Hello world!

# Import the Doc class
from spacy.tokens import Span

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

[('I like', 'GREETING')]

Similarity and vectors

In order to use similarity, you need a larger spaCy model that has word vectors included (en_core_web_lg, en_core_web_md – but not _sm). That is because similarity is determined using word vectors. Word vectors are generated using an algorithm like Word2Vec and lots of text. The default distance is cosine similarity, but can be adjusted.

For a more in-depth look, the source code for .similarity shows:

return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

# Load a larger model with vectors
nlp = spacy.load('en_core_web_lg')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.8627203210548107
0.7369546

nlp = spacy.load('en_core_web_sm')
doc1 = nlp("I")
print( doc1.vector.shape )

(96,)

nlp = spacy.load('en_core_web_lg')
doc1 = nlp("I")
print( doc1.vector.shape )

(300,)

nlp = spacy.load('en_core_web_sm')

doc1 = nlp("I")
doc2 = nlp("like")
doc3 = nlp("I like")
doc4 = nlp("I like pizza")

print( doc1.vector.shape, ' ', doc1.vector_norm )
print( doc2.vector.shape, ' ', doc2.vector_norm )
print( doc3.vector.shape, ' ', doc3.vector_norm )
print( doc4.vector.shape, ' ', doc4.vector_norm )

(96,)   23.1725315055188
(96,)   21.75560300132138
(96,)   17.23478412191207
(96,)   14.848700829346688

doc1[0].vector == doc3[0].vector

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False])

print( np.dot(doc1.vector, doc2.vector) )
print( np.linalg.norm(doc2.vector, ord=2) )
print( np.linalg.norm(doc1.vector, ord=2) == np.linalg.norm(doc2.vector, ord=2) )
print( doc2.vector_norm )

21.871986
4.6767497
True
4.676749731219555

doc1 = nlp("pizza like I")
doc2 = nlp("I like pizza")
doc3 = nlp("pizza I like")
doc4 = nlp("like pizza I")
doc5 = nlp("I pizza like")
doc6 = nlp("like I pizza")

lDoc = [doc1, doc2, doc3, doc4, doc5, doc6]
result = np.zeros((6,6))
for i,doc1 in enumerate(lDoc):
    for j, doc2 in enumerate(lDoc):
        result[i,j] = doc1.similarity(doc2)

result

array([[1.        , 0.99999994, 0.99999994, 0.99999995, 0.99999994,
        0.99999994],
       [0.99999994, 1.        , 0.99999993, 0.99999994, 0.99999993,
        0.99999992],
       [0.99999994, 0.99999993, 1.        , 0.99999994, 0.99999993,
        0.99999993],
       [0.99999995, 0.99999994, 0.99999994, 1.        , 0.99999994,
        0.99999994],
       [0.99999994, 0.99999993, 0.99999993, 0.99999994, 1.        ,
        0.99999993],
       [0.99999994, 0.99999992, 0.99999993, 0.99999994, 0.99999993,
        1.        ]])

nlp = spacy.load('en_core_web_lg')

doc1 = nlp("apples oranges fruit")
print( doc1[0].vector_norm )
print( doc1[1].vector_norm )
print( doc1[2].vector_norm )

print( doc1[0].similarity(doc1[1]))
print( doc1[0].similarity(doc1[2]))

doc1 = nlp("apples")
doc2 = nlp("apples apples apples apples apples apples apples apples apples apples apples apples apples apples apples apples apples apples")

print(doc1.vector_norm)
print(doc2.vector_norm)
print(doc1.similarity(doc2))

6.895897646384268
6.895897762990182
1.0000000930092277

doc1 = nlp("apples fruit")
doc2 = nlp("apples apples apples apples apples apples apples apples apples apples apples apples apples apples apples apples apples apples fruit")

print(doc1.vector_norm)
print(doc2.vector_norm)
print(doc1.similarity(doc2))

6.588359567392134
6.816139083280562
0.9383865534490474

# Load a larger model with vectors
#nlp = spacy.load('en_core_web_lg')

doc = nlp("I have a banana")
# Access the word vector via the token.vector attribute
vector = doc[3].vector
print( type(vector) )
print( vector.shape)

<class 'numpy.ndarray'>
(300,)

vector[0:10]

array([ 0.20228 , -0.076618,  0.37032 ,  0.032845, -0.41957 ,  0.072069,
       -0.37476 ,  0.05746 , -0.012401,  0.52949 ], dtype=float32)

#no universal definition for similarity 
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9501447503553421

Pattern matching

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "Amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad-free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

import json
from spacy.lang.en import English

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

Processing Pipelines

OOB pipeline components

First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the Doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed Doc is returned, so you can work with it.

Name	Description	Creates
tagger	Part-of-speech tagger	Token.tag
parser	Dependency parser	Token.dep, Token.head, Doc.sents, Doc.noun_chunks
ner	Named entity recognizer	Doc.ents, Token.ent_iob, Token.ent_type
textcat	Text classifier	Doc.cats

Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.

#initialize the language, add the pipeline and load in the binary model weights
nlp = spacy.load("en_core_web_sm")

doc = nlp("This is a sentence.")

#list of pipeline component names
print(nlp.pipe_names)

['tagger', 'parser', 'ner']

#list of (name, component) tuples
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7f4d64e3ada0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7f4d4ab97108>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7f4d4ab97228>)]

Custom pipeline components

Custom components are executed automatically when you call the nlp object on a text. They’re especially useful for adding your own custom metadata to documents and tokens. You can also use them to update built-in attributes, like the named entity spans.

A custom component:

takes a doc, modifies it and returns it
can be added using the nlp.add_pipe method

Custom components can only modify the Doc and can’t be used to update weights of other components directly.

def custom_component(doc, <last,first,before,after> ):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)

#simple component

# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print('Doc length:', len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

# Process a text
doc = nlp("Hello world!")

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']
Doc length: 3

import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]

Custom span attributes

Custom attributes allow

add custom metadata to documents, tokens and spans
accessible via the ._ property to distinguish from built-in attr
registered on the global Doc, Token or Span using the set_extension method
Attribute extensions
Property extensions
Method extensions

doc._.title = 'My document'
token._.is_color = True
span._.has_color = False

# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

#Attribute extension
from spacy.tokens import Token

# Set extension on the Token with default value
Token.set_extension('is_color', default=False)
doc = nlp("The sky is blue.")
# Overwrite extension attribute value
doc[3]._.is_color = True

#Property (getter/setter) extension: Token
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color, force=True)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

True - blue

#Property (getter/setter) extension: Span
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue
False - The sky

#Method (pass an argument) extension
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)

doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud

Scaling and performance

Streaming

Use nlp.pipe method
Processes texts as a stream, yields Doc objects
Much faster than calling nlp on each text

%timeit 
docs = [nlp(text) for text in LOTS_OF_TEXTS]
#bad

%timeit
docs = list(nlp.pipe(LOTS_OF_TEXTS))
#good

#this idiom is useful for associating metadata with the doc
from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

Pipeline configuration

Only run the models you need.

#slow
doc = nlp("Hello world")

#fast - only runs tokenizer, not all models
doc = nlp.make_doc("Hello world!")

#disable tagger and parser
#restores them after the with block
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)

Training a neural network model

List of english models, here

SpaCy supports updating existing models with more examples, and training new models.

Update an existing model: a few hundred to a few thousand examples
Train a new category: a few thousand to a million examples; spaCy’s English models: 2 million words

This is essential for text classification, very useful for entity recognition and a little less critical for tagging and parsing.

Creating training data

The entity recognizer predicts entities in context, it also needs to be trained on entities and their surrounding context.

Use Matcher to quickly create training data for NER models.

Create a doc object for each text using nlp.pipe.
Match on the doc and create a list of matched spans.
Get (start character, end character, label) tuples of matched spans.
Format each example as a tuple of the text and a dict, mapping ’entities’ to the entity tuples.
Append the example to TRAINING_DATA and inspect the printed data.

import json
from spacy.matcher import Matcher
from spacy.lang.en import English

TEXTS = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

Training the model

Loop for a number of times.
Shuffle the training data.
Divide the data into batches.
Update the model for each batch.
Save the updated model.

import spacy
import random
import json

with open("exercises/gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

Problems with updating:

if you don’t provide examples of original labels, then it will ‘forget’ them by adjusting too much to the new data
label scheme needs to be consistent and not too specific, for example: CLOTHING is better than ADULT_CLOTHING and CHILDRENS_CLOTHING

You can create those additional examples by running the existing model over data and extracting the entity spans you care about. You can then mix those examples in with your existing data and update the model with annotations of all labels.

If the decision is difficult to make based on the context, the model can struggle to learn it. The label scheme also needs to be consistent and not too specific. You can always add a rule-based system later to go from generic to specific.

Applications

Working on customer environment

Serialization refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network. This SpaCy guide provides the latest recommendations, ref.

The DocBin class lets you efficiently serialize the information from a collection of Doc objects. You can control which information is serialized by passing a list of attribute IDs, and optionally also specify whether the user data is serialized. The DocBin is faster and produces smaller data sizes than pickle, and allows you to deserialize without executing arbitrary Python code.

Typical serialization

# Load a larger model with vectors
nlp = spacy.load('en_core_web_lg')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-3-a430716dafad> in <module>()
      1 # Load a larger model with vectors
----> 2 nlp = spacy.load('en_core_web_lg')
      3 
      4 # Compare two documents
      5 doc1 = nlp("I like fast food")

NameError: name 'spacy' is not defined

! ls ../tmp;

spacy-pizza.bz	spacy-pizza.mdl

import pickle

data_dict = {'doc1':doc1,'nlp':nlp}
filename = '../tmp/spacy-pizza.mdl'
outfile = open(filename,'wb')

pickle.dump(data_dict, outfile)
outfile.close()

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-2-9c7a736efa99> in <module>()
      1 import pickle
      2 
----> 3 data_dict = {'doc1':doc1,'nlp':nlp}
      4 filename = '../tmp/spacy-pizza.mdl'
      5 outfile = open(filename,'wb')

NameError: name 'doc1' is not defined

infile = open(filename,'rb')
new_dict = pickle.load(infile)
infile.close()

new_dict['doc1'].similarity(doc2)

/opt/conda/envs/beakerx/lib/python3.6/runpy.py:193: ModelsWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
  "__main__", mod_spec)

0.953127971438058

Serialization with compression

import bz2
import pickle

filename = '../tmp/spacy-pizza.bz'
#sfile = bz2.BZ2File('smallerfile', 'w')
#pickle.dump(data_dict, sfile)
#outfile.close()


outfile = bz2.BZ2File(filename, 'wb')
pickle.dump(data_dict, outfile, protocol=2)
outfile.close()

! ls ../tmp

spacy-pizza.bz	spacy-pizza.mdl

infile = bz2.BZ2File(filename, 'rb')
myobj = pickle.load(infile)
infile.close()

myobj

{'doc1': This is a sentence., 'nlp': <spacy.lang.en.English at 0x7f4d606145c0>}

myobj['doc1'].similarity(doc2)

/opt/conda/envs/beakerx/lib/python3.6/runpy.py:193: ModelsWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.
  "__main__", mod_spec)

0.953127971438058

SpaCy method for encapsulating texts

#fast method
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
doc = nlp(text)

# All strings mapped to integers, for easy export to numpy
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array = doc.to_array("POS")

import pickle
serialized = pickle.dumps(np_array, protocol=0) # protocol 0 is printable ASCII
deserialized_array = pickle.loads(serialized)

#comprehensive approach
import spacy
from spacy.tokens import DocBin

doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
texts = ["Some text", "Lots of texts...", "..."]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts):
    doc_bin.add(doc)
bytes_data = doc_bin.to_bytes()    #.to_disk("/path")

# Deserialize later, e.g. in a new process
nlp = spacy.blank("en")
doc_bin = DocBin().from_bytes(bytes_data)    #.from_disk("/path")
docs = list(doc_bin.get_docs(nlp.vocab))

docs = list(doc_bin.get_docs(nlp.vocab))
Doc.set_extension("my_custom_attr", default=None)
print([doc._.my_custom_attr for doc in docs])

Pickle Doc to include dependencies

When pickling spaCy’s objects like the Doc or the EntityRecognizer, keep in mind that they all require the shared Vocab (which includes the string to hash mappings, label schemes and optional vectors). This means that their pickled representations can become very large, especially if you have word vectors loaded, because it won’t only include the object itself, but also the entire shared vocab it depends on.

doc = nlp("This is a sentence.")
assert len(nlp.vocab) > 0

data = {"doc":doc, "nlp":nlp}