Are you sure you want to delete this access key?
$$ cos(\theta) = \frac{\vec{x} \cdot \vec{y}} {\lVert \vec{x} \rVert \lVert \vec{y} \rVert} $$
I'll be speaking about all of this at the O'Reilly AI Superstream: NLP in Production on May 11, 11AM EST. This, along with other talks, is about understanding the most common obstacles for getting NLP models into production and learning how to overcome them. Sign in, or sign up for your free trial: https://oreil.ly/Xx8An
This Document Match Math Machine document, and all the code that accompanies it can be found in the DagsHub repo for it. DagsHub, much like GitHub, is focused on specifically meeting the needs of data scientists and machine learning engineers. I hope you will sign up for a free account. DagsHub does NOT pay me to promote them. I just approve of their mission. And, yes, I still love GitHub, and GitLab, etc. They each have their own great purposes.
Truly this could also be called a text match math machine. It can be used for many processes where the closest matching text from one source to another helps to identify similarities between large multiples of similar source materials. What's most important to appreciate herein are the concepts of how it can be used. I'll try to note such things as we work through the code. What's important to note is that this math machine approach can be used for many such "find the best text match" activities in language processing.
Please note the following high level principles.
The history of this work is as follows:
Let's cover at least the basic math to get a feel for how it works to our benefit.
The dot product of two vectors divided by the product of their magnitudes is
$$ cos(\theta) = \frac{\vec{x} \cdot \vec{y}} {\lVert \vec{x} \rVert \lVert \vec{y} \rVert} $$
where
$$ \vec{x} \cdot \vec{y} = x_1 y_1 + x_2 y_2 + \dots + x_n y_n $$
and
$$ \lVert \vec{x} \rVert = \sqrt{ \vec{x} \cdot \vec{x} } = \sqrt{ x_1^2 + x_2^2 + \dots + x_n^2 } $$
The closer the two vectors are to one another, the closer $ cos(\theta) $ is to 1.0!
See the file in this repo named Fake_Document_Creation.py
to see how the fake documents were generated.
The code in that file is repeated in the next three code cells.
from faker import Faker
Faker.seed(0)
fake = Faker()
for i in range(100):
a_doc = fake.paragraph(nb_sentences=20)
file_name = f'./documents/doc_{i}.txt'
with open(file_name, 'w') as f:
f.write(a_doc)
The 100 documents created by that code are already in the documents subdirectory of this repo. However, I encourage you to run the code again with different parameters. It may be best to create a small set of small files so that you can inspect the various components of the math machine more easily.
Python's Natural Language Toolkit (NLTK) is great. However, I rarely work with what would be called "natural" language. I've learned through R&D (trial and error) that my own tokenizers work best on the technical documents that I tend to process. Programming my own tokenizers allows me to get exactly what I need. The code blocks below represent my tokenizer, which is also easy to extend when needed. I can also rearrange the methods into any order I might need for special needs.
When this class, Micro_Tokenizer
, is instantiated, you pass in either a file name or text - not both.
Once it has its text
attribute loaded, it runs the default routines:
__whitespace_tokenize_text__
__space_out_punc_marks_in_tokens__
__segregate_character_types__
The value of running the methods in this order will become apparent as we walk through the methods.
Note that all methods have double underscores before and after them.
This is to indicate that all these methods are meant to be private.
I intended for any of them to only be called from the __init__
function.
This behavior can of course be changed if you wish to leverage from this code.
class Micro_Tokenizer:
def __init__(self, file_name='', text=''):
if file_name:
self.file_name = file_name
self.__load_file_data__()
self.text = self.Text.lower()
elif text:
self.text = text.lower()
self.__whitespace_tokenize_text__()
self.__space_out_punc_marks_in_tokens__()
self.__segregate_character_types__()
The __retokenize__
method simply makes sure that each token is a separate string.
If a token is a list, maintain that list and append it to the new retokenized list.
This is a special feature for grouping items that go together for specific needs.
For this simple example, this is NOT likely to be encountered.
If it's a string with spaces, the items in the string have been separated by another routine, and they will now become separate tokens.
Please note that the action of token.split
creates a list, and adding two list simply adds the items from the second list to the end of the first one.
def __retokenize__(self):
new_tokens = []
for token in self.tokens:
if type(token) is list:
new_tokens.append(token)
elif type(token) is str:
new_tokens += token.split()
self.tokens = new_tokens
The __load_file_data__
method simply robustly acquires the text from a file name that has been passed in.
def __load_file_data__(self):
try:
with open(self.file_name, 'r', encoding='utf8',
errors='ignore') as f:
self.Text = f.read()
except UnicodeDecodeError:
with open(self.file_name, 'r') as f:
self.Text = f.read()
The __whitespace_tokenize_text__
method simply uses all whitespace to initially separate the text into individual tokens.
This will not allow punctuation marks to be treated separately though, so we will need a method to separate tokens from alpha characters.
def __whitespace_tokenize_text__(self):
self.tokens = self.text.split()
And the __space_out_punc_marks_in_tokens__
method does put spaces between any punctuation marks and between punctuation marks and alpha characters.
Notice the call to the __retokenize__
method at the end of this method to convert strings with spaces into separate tokens.
def __space_out_punc_marks_in_tokens__(self):
for t in range(len(self.tokens)):
if type(self.tokens[t]) is list:
continue
new_string = ''
for thing in self.tokens[t]:
if thing in PUNCS:
new_string += f' {thing} '
else:
new_string += thing
self.tokens[t] = new_string.strip()
self.__retokenize__()
But wait! There's more! IF we see a change in character type in a string found in the token list, put spaces between those, and then retokenize afterwards. This ensures that we have good separation.
def __segregate_character_types__(self):
for t in range(len(self.tokens)):
element = self.tokens[t]
new_element = ''
for i in range(len(element) - 1):
if element[i].isalpha() and element[i+1].isalpha():
new_element += element[i]
elif element[i].isdigit() and element[i+1].isdigit():
new_element += element[i]
elif element[i] in PUNCS and element[i+1] in PUNCS:
new_element += f'{element[i]} '
else:
new_element += f'{element[i]} '
new_element += element[-1]
self.tokens[t] = new_element
self.__retokenize__()
Let's run a little test.
The code in the below cell is at the bottom of the Doc_Tokenizer.py
python script file, and will run if you simply call Python to run the file OR run it from within your IDE.
if __name__ == "__main__":
my_text = """
This is a simple set of sentences.
We are only using them to test the micro_tokenizer to see what it does.
This is only a small simple test.
"""
mc = Micro_Tokenizer(text=my_text)
print(mc.tokens)
Note from the output that the token list allows repeats of tokens. At this point, we do want this. The build of the vectors that make up the corpus matrix, will change this to a count for each token.
Output:
['this', 'is', 'a', 'simple', 'set', 'of', 'sentences', '.', 'we', 'are',
'only', 'using', 'them', 'to', 'test', 'the', 'micro', '_', 'tokenizer',
'to', 'see', 'what', 'it', 'does', '.', 'this', 'is', 'only', 'a', 'small',
'simple', 'test', '.']
The imports below are for all the classes in the following subsections of this section.
Please note that I have a habit of importing sys
even if I don't use it.
When developing such code, I will frequently inject a sys.exit()
to force my code to stop at a certain point when there is code below it.
NOTE many tools come from the `` module. The git repo for that is in my Python_Tools repo. I hope you will clone it and make good use of it. Don't hesitate to suggest improvements and additions too.
import __General_Tools_Module__ as gtm
import os
import copy
import numpy as np
import pandas as pd
import scipy.sparse as ss
import scipy.sparse.linalg as ssla
import sys
from Doc_Tokenizer import Micro_Tokenizer
Corpus_Matrix
ClassThe Corpus_Matrix
class directly builds, grows, and maintains the corpus matrix for a group of documents.
The __init__
method:
class Corpus_Matrix:
def __init__(self, data_dir):
self.data_dir = data_dir
gtm.prepare_dir(data_dir)
self.tocab_fn = f'{data_dir}/tocab.json'
self.cm_fn = f'{data_dir}/cm.npz'
tocab_exists = os.path.isfile(self.tocab_fn)
corpus_matrix_exists = os.path.isfile(self.cm_fn)
check = (tocab_exists and corpus_matrix_exists)
if check:
self.tocab = gtm.load_object_from_json_file(self.tocab_fn)
self.corpus_matrix = ss.load_npz(self.cm_fn)
else:
self.tocab = {}
The add_new_doc_vector
is the workhorse method of the Corpus_Matrix
class.
It first establishes the previous_token_count
to see how to grow the corpus matrix after processing the new set of tokens that are being passed in.
It also ensures a copy is made of the current tokabulary so that we can add to it.
Note the for loop. The main tokabulary is a dictionary with the tokens as keys and the values always 0. If the current token is in the main tokabulary, it simply reestablishes that token with a value of 0. If the current token is not in the main tokabulary, it is appended to the end of the tokab dictionary. However, if the current token is NOT in the copy of the main tokabulary, it is also added with 0, and then the count of that tokens occurrences in the current document's tokens is incremented by 1.
The token count is checked after the for loop to see if it has grown. The change in token count equals the number of new rows of zeroes that need to be added to the bottom of the existing corpus matrix. Then we also collect the tokens, and their counts in separate lists.
The if block is important.
If the current instance of this class already has a corpus_matrix
:
corpus_matrix
attribute has not yet been created, then we create a csc matrix from a Pandas DataFrame where the index is the tokab_list and the values are the occurrence rates of the tokens, which is exactly like the second step for growing the corpus_matrix
. We are simply "creating" the matrix, first column, with the first document.After each document alters the corpus matrix, the corpus matrix and the tokabulary are saved. Saving these are fast operations and ensures we do not lose our work.
def add_new_doc_vector(self, tokens):
previous_token_count = len(self.tocab)
this_tocab = copy.deepcopy(self.tocab)
for t in tokens:
self.tocab[t] = 0
if t not in this_tocab:
this_tocab[t] = 0
this_tocab[t] += 1
new_token_count = len(self.tocab)
new_rows = new_token_count - previous_token_count
tocab_list = list(self.tocab.keys())
values = this_tocab.values()
if hasattr(self, "corpus_matrix"):
self.corpus_matrix = \
self.__add_zeros_rows_to_bottom_of_matrix__(new_rows)
column = ss.csc_matrix(
pd.DataFrame(index=tocab_list, data=values))
self.corpus_matrix = \
ss.hstack([self.corpus_matrix, column])
else:
self.corpus_matrix = ss.csc_matrix(
pd.DataFrame(index=tocab_list, data=values))
ss.save_npz(self.cm_fn, self.corpus_matrix)
gtm.store_object_to_json_file(self.tocab, self.tocab_fn)
In order to appreciate the actions of the `` method, you want to remember how python references lists - from the first index to just before the last index. The del_cols is list of column indices that will be replaced by the new_vecs. Based on the current corpus matrix and the new vectors, we determine the number of zeroes rows to add to the bottom. Then we make the updated corpus matrix be:
During such an operation, we might end up with some rows that have no counts for a given token. If so, we remove those rows with all zeroes.
def replace_vectors_with_vectors(self, del_cols, new_vecs):
# Extend the matrix rows by the number of rows in new_vecs
cm_rows = self.corpus_matrix.shape[0]
new_vecs_rows = new_vecs.shape[0]
new_rows = new_vecs_rows - cm_rows
self.__add_zeros_rows_to_bottom_of_matrix__(new_rows)
# Replace the old columns with the new ones in new_vecs
ds1 = del_cols[0]
ds2 = del_cols[-1] + 1
self.corpus_matrix = ss.hstack(
[self.corpus_matrix[:, :ds1],
new_vecs,
self.corpus_matrix[:, ds2:]])
self.corpus_matrix = self.remove_zero_rows(self.corpus_matrix)
Based on descriptions above, and the simplicity of the code in the next cell for the __add_zeros_rows_to_bottom_of_matrix__
method, I trust the code needs no explanation.
Run it in small steps using a small example if you do not understand the details that occur between the steps.
The main thing to note is that a vstack
rather than an hstack
is used, because new rows are being added.
def __add_zeros_rows_to_bottom_of_matrix__(self, num_rows):
cols = self.corpus_matrix.shape[1]
the_types = self.corpus_matrix.dtype
new_rows = ss.csc_matrix(np.zeros((num_rows, cols), dtype=the_types))
# Return a new corpus matrix -
# user decides when to update corpus matrix
corpus_matrix = ss.csc_matrix(ss.vstack(
[self.corpus_matrix, new_rows]))
return corpus_matrix
The last method, __remove_zero_rows__
, just checks if there are any rows that are all zeroes.
If any rows are all zeroes, that means that there are no longer any occurrences of that token.
If there are not occurrences of that token, the tokabulary and the corpus matrix should be adjusted.
def remove_zero_rows(self, sparse_csc_matrix):
# a_sparse_csc_matrix is a scipy sparse csr matrix. We want to remove all zero rows from it
sparse_csr_matrix = sparse_csc_matrix.tocsr()
nonzero_row_indice, _ = sparse_csr_matrix.nonzero()
unique_nonzero_indice = np.unique(nonzero_row_indice)
sparse_csr_matrix_wo_zero_rows = sparse_csr_matrix[unique_nonzero_indice]
sparse_csc_matrix = sparse_csr_matrix_wo_zero_rows.tocsc()
return sparse_csc_matrix
Manage_Corpus_Matrix
ClassArmed with the methods for the Corpus_Matrix
class, we can now establish a class that manages the corpus matrix.
I wonder what would be a good name for this class.
How about Manage_Corpus_Matrix
?
Sounds good.
"But Thom", you ask, "Why do we need this class?" That's an excellent question. I'm glad you asked.
The last class focused on the corpus matrix itself.
This class manages the corpus matrix and tracks it with respect to the tasks at hand.
The last Corpus_Matrix
class is meant to remain mostly the same from use to use.
This Manage_Corpus_Matrix
class is meant to change as needed for each new use.
In this simple example, notice how we have an object to track column to document info. We can find the column that is closest in content to the new document. However, we also need a way to track which document this column is associated to. This may not change often from use to use, because it's a base need. However, I've used this class in some strange situations, and based on specific needs, it's methods and their code can become quite deep.
For example, what if you had another object that tracked document categories too. Once you found the closest matching document or documents to a new incoming document, you could also report which category that document belonged to. There are many more such cases that could be considered.
class Manage_Corpus_Matrix:
def __init__(self, data_dir):
self.data_dir = data_dir
self.cm = Corpus_Matrix(data_dir)
self.col_to_doc_fn = f'{self.data_dir}/col_to_doc_info.json'
self.col_to_doc_info_exists = os.path.isfile(self.col_to_doc_fn)
if self.col_to_doc_info_exists:
self.col_to_doc_info = \
gtm.load_object_from_json_file(self.col_to_doc_fn)
else:
self.col_to_doc_info = {}
The next method illustrates the previous point more clearly. We need this method to:
Finally we store that object to file after each incremental growth.
def grow_corpus_matrix_from_documents(self, docs_D):
current_doc_ID_list = list(self.col_to_doc_info.values())
new_doc_IDs = [id for id in docs_D if id not in current_doc_ID_list]
for doc_ID in new_doc_IDs:
mt = Micro_Tokenizer(text=docs_D[doc_ID])
self.cm.add_new_doc_vector(mt.tokens)
col = self.cm.corpus_matrix.shape[1] - 1
self.col_to_doc_info[col] = doc_ID
gtm.store_object_to_json_file(
self.col_to_doc_info, self.col_to_doc_fn)
Doc_Match_Math_Machine
ClassFinally, we get to the main class, Doc_Match_Math_Machine
, that helps us find the closest matching existing document to a new document.
NOTE, to help avoid possible confusion, this class does not build or manage the growth and updates to the corpus matrix and it's associated documents.
It is used to find the closest matching document in the corpus of documents to the new document.
The init method:
class Doc_Match_Math_Machine:
def __init__(self, cm_data_dir, top_n=3):
# cm is the corpus matrix of document vectors
self.cm = Corpus_Matrix(cm_data_dir)
self.cm.docs_norms = ssla.norm(self.cm.corpus_matrix, axis=0)
self.col_to_doc_fn = f'{cm_data_dir}/col_to_doc_info.json'
self.col_to_doc_info_D = gtm.load_object_from_json_file(
self.col_to_doc_fn)
self.top_n = top_n
And FINALLY, we get to the determination of how to find the closest matches.
ccn
is the cosine closeness numerators;ccd
is the cosine closeness denominators;ccn
by ccd
to get an array of closeness calculations;And finally, we receive the matches_D
dictionary with the information we need.
def get_closest_match_columns(self, doc_text):
matches_D = {}
mt_doc = Micro_Tokenizer(text=doc_text)
doc_tokens = copy.deepcopy(self.cm.tocab)
num_new_tokens = 0
for t in mt_doc.tokens:
if t not in doc_tokens:
doc_tokens[t] = 0
num_new_tokens += 1
doc_tokens[t] += 1
doc_vector = np.array(list(doc_tokens.values()))
doc_vector = doc_vector.reshape((1, len(doc_vector)))
doc_vector = ss.csc_matrix(doc_vector, dtype=np.float64)
doc_norm = ss.linalg.norm(doc_vector)
''' Find the closest documents '''
temp_cm = self.cm.__add_zeros_rows_to_bottom_of_matrix__(
num_new_tokens)
''' ccn = cosine_closeness_numerators
ccd = cosine_closeness_denominators '''
ccn = doc_vector * temp_cm
ccd = doc_norm * self.cm.docs_norms
simils = ccn / ccd
simils = np.array(simils)[0]
the_indices = np.flip(np.argsort(simils))
the_indices = the_indices[:self.top_n]
for col in the_indices:
doc_ID = self.col_to_doc_info_D[str(col)]
cos = round(simils[col], 6)
matches_D[cos] = {
"col": col,
"doc_ID": doc_ID
}
return matches_D
This was perhaps the toughest part of preparing this notebook.
Creating good fake data is simple, but it's not always easy!
Document 73, doc_73.txt
, has the following text.
Feeling out Mrs nearly way great a campaign. Professor behavior environmental item. Also watch wife run behavior recently court. Huge add half up trip foot face. Team number wife as from south detail include. From run manager cell away mind. Effect culture glass executive in movement may poor. Education career control discussion whose. Night skill exist. Skin single at right together coach production professor. Determine what institution others home job bar exist. Late responsibility expect south. Whether operation necessary some color step account join.
Doesn't read like a real story very well does it? I worked a little while to make it have some sort of flow. I tried to NOT add many words or punctuations and to create a possible story using the existing words. Don't judge me - I didn't want to kill myself over this part.
Professor Ives and his wife, Mrs Ives, were recently in court. The late night campaign for his career in the south to reform education on environmental behavior was nearly over. His team had responsibility to control discussions. It was their job to include a number for each detail. The professor must determine what institution others also watch to run successfully. From the south, the manager must run his cell to effect the culture as from an "away mind". This would add huge gains - up to half. Feeling out a great way to give their poor campaign coach a movement by foot trip does exist. He may expect to face the executives above the glass ceiling and show each item of his skill. Whether this operation is necessary and gives some color to their account is in question. They must skin this single cat. The production of the coach at home to add a job for each person is the bar. It does exist. Each must join this step.
Now it's time to put all of this to use!
We will now move to the content of the file Doc_Match_Math_Machine_Usage.py
and see how all this code works in concert when we instantiate the classes and use them.
We want to see if doc_100.txt
, in the top level directory, best matches doc_73.txt
, in the documents directory.
I promise I did NOT adjust anything to get things to work.
This is an honest test with fake data.
Also, consider the historical points at the top of this report regarding the early initial development of this.
We import our handy __General_Tools_Module__
again as gtm.
We of course need our Doc_Match_Math_Machine_Parts
to that we can instantiate the classes for our document matching math machine.
And we setup pprint so that we can pretty print our results.
Please use this convenience form of pprint also - pp = pprint.PrettyPrinter(indent=2)
- it's very handy.
The File Manager from the __General_Tools_Module__
is used collect all the file names in the ./documents
subdirectory.
A dictionary comprehension is used to create a dictionary of indices to file names.
We establish a data directory for our corpus matrix data and instantiate the Manage_Corpus_Matrix
class to create our corpus matrix.
Then we call the grow_corpus_matrix_from_documents
method of that class to build our initial corpus matrix.
IF you were building a large corpus matrix from a large corpus, the initial build can take some time. How could it be made faster? I've not yet needed to move to big data routines for this type of work. However, I am confident that this approach could be written with routines that use fast big data routines.
Finally, we instantiate the Doc_Match_Math_Machine
itself.
We obtain the text for our new file.
Then we send that text for our new file as an argument to the get_closest_match_columns
method of our math machine.
Finally, we pretty print our top 3 matches.
import __General_Tools_Module__ as gtm
import Doc_Match_Math_Machine_Parts as dm3p
import pprint
pp = pprint.PrettyPrinter(indent=2)
""" Gather Docs for building the corpus """
fm = gtm.File_Manager("./documents")
docs_D = {k: v for k, v in enumerate(fm.full_file_list)}
cm_data_dir = "./cm_data"
mcm = dm3p.Manage_Corpus_Matrix(cm_data_dir)
mcm.grow_corpus_matrix_from_documents(docs_D)
dm3 = dm3p.Doc_Match_Math_Machine(cm_data_dir, top_n=3)
new_file = "doc_100.txt"
new_text = gtm.get_text_from_file(new_file)
matches_D = dm3.get_closest_match_columns(new_text)
pp.pprint(matches_D)
Oh my! It worked. I confess, that I am always thrilled when this works for each new case. This is the first time that it's worked for a fake data case sense development.
Output:
{ 0.52267: {'col': 43, 'doc_ID': './documents/doc_9.txt'},
0.530232: {'col': 23, 'doc_ID': './documents/doc_38.txt'},
0.645827: {'col': 1, 'doc_ID': './documents/doc_73.txt'}}
As you can imagine, if you are comfortable thinking about far more than 3 dimensions, the dimensions are formed by content only.
But what about base word form relations?
Exactly.
How would we make sure that seem
and seems
and seemingly
are all related closely?
Having trained GloVe from scratch using Python libraries for the same, I have been frustrated by this.
You'd think that the English language at least would have some look up system that allows us to know the base form of each word.
You'd think that system would then also show us all the words that can be derived from that base form word with prefixes and suffixes.
But no. No such thing exists.
I was considering trying to do this myself with other tools that I've made for language processing.
I am glad I did not. Why? Because when I saw the clever way that transformer tokenizers work, I decided that they were best. They do, in my opinion, a much more clever form of relating base forms of various words. Maybe the version I was thinking about would do better in the long run. I can't take the time to check it yet, and the transformer tokenizers are VERY good. Thus, I think this work can be taken to the next level of improvement by using transformer tokenizers to tokenize the documents.
What have we really done here? We've made each token that occurs in a corpus of documents into a dimension, a unit vector, to create a very high dimensional space. Then we counted how many times each of those tokens appeared in each of many documents. Each document became a column of a huge sparse matrix. Each of those columns had occurrence rates of each token to form the numbers of that column. Thus, the occurrence rates represent a pseudo distance in each of the token dimensions. Each column is a vector in that hyper space.
Now a new document comes to us. We turn it into a new vector that can exist in our corpus hyper space. It must have numbers to work in our math machine. Then, using basic trigonometry at a VERY large scale, we seek to find which vector (column / document) in our corpus matrix is closest to that new vector in the corpus hyper space. Finally, because we did a good job tracking our documents to columns and our tokens to indices, we do, in fact, find the best matching document.
At the end, Thom guessed that this work could be further improved by using transformer type tokenizers to tokenize the documents, but this needs further investigation and is only now a hypothesis.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?