Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

Doc_Match_Math_Machine.md 32 KB

You have to be logged in to leave a comment. Sign In

Document Match Math Machine

$$ cos(\theta) = \frac{\vec{x} \cdot \vec{y}} {\lVert \vec{x} \rVert \lVert \vec{y} \rVert} $$

I'll be speaking about all of this at the O'Reilly AI Superstream: NLP in Production on May 11, 11AM EST. This, along with other talks, is about understanding the most common obstacles for getting NLP models into production and learning how to overcome them. Sign in, or sign up for your free trial: https://oreil.ly/Xx8An

This Document Match Math Machine document, and all the code that accompanies it can be found in the DagsHub repo for it. DagsHub, much like GitHub, is focused on specifically meeting the needs of data scientists and machine learning engineers. I hope you will sign up for a free account. DagsHub does NOT pay me to promote them. I just approve of their mission. And, yes, I still love GitHub, and GitLab, etc. They each have their own great purposes.

Truly this could also be called a text match math machine. It can be used for many processes where the closest matching text from one source to another helps to identify similarities between large multiples of similar source materials. What's most important to appreciate herein are the concepts of how it can be used. I'll try to note such things as we work through the code. What's important to note is that this math machine approach can be used for many such "find the best text match" activities in language processing.

Please note the following high level principles.

  1. This is called a math machine. It is not a machine learning type model. Machine learning models are also math machines, but it's important to note that this math machine, represented by the collection of code in this repo, is NOT trained. The entire system of files herein goes through the following process:
    1. each file has its words, punctuation marks, and even numbers separated into a set of tokens;
    2. the set of tokens builds a tokabulary, somewhat like a vocabulary, but this tokabulary is the set a growing set of tokens found in all documents;
    3. each token in the tokabulary, that can grow over time, receives an ID;
    4. each ID represents a row in vectors that represent each document;
  2. Because it is NOT trained, it can grow real time "on the fly" very quickly:
    1. each new document can be added to the end of the corpus matrix;
    2. any documents determined to be updates can replace the existing vector representing that document in the corpus matrix;
    3. the number of new tokens found will cause the corpus matrix to grow by the same number of rows;
    4. a similar check is done during replacement to reduce tokens if possible.

The history of this work is as follows:

  1. I tested this at large scales first with very large randomly generated sparse matrices before investing the work to code a full actual implementation to ensure the speeds and memory storage would not become a problem.
  2. I then used very small toy dataset cases to develop the routines to tokenize the documents and build the routines for text to numbers and find the best matches.
  3. Finally, I tested a large real world example to serve needs in a previous role to ensure it worked at scale on real data, and it did.

The Math

Let's cover at least the basic math to get a feel for how it works to our benefit.

Cosine Similarity

The dot product of two vectors divided by the product of their magnitudes is

$$ cos(\theta) = \frac{\vec{x} \cdot \vec{y}} {\lVert \vec{x} \rVert \lVert \vec{y} \rVert} $$

where

$$ \vec{x} \cdot \vec{y} = x_1 y_1 + x_2 y_2 + \dots + x_n y_n $$

and

$$ \lVert \vec{x} \rVert = \sqrt{ \vec{x} \cdot \vec{x} } = \sqrt{ x_1^2 + x_2^2 + \dots + x_n^2 } $$

The closer the two vectors are to one another, the closer $ cos(\theta) $ is to 1.0!

Creating Some Fake Documents

See the file in this repo named Fake_Document_Creation.py to see how the fake documents were generated. The code in that file is repeated in the next three code cells.

from faker import Faker
Faker.seed(0)
fake = Faker()
for i in range(100):
   a_doc = fake.paragraph(nb_sentences=20)
   file_name = f'./documents/doc_{i}.txt'
   with open(file_name, 'w') as f:
       f.write(a_doc)

The 100 documents created by that code are already in the documents subdirectory of this repo. However, I encourage you to run the code again with different parameters. It may be best to create a small set of small files so that you can inspect the various components of the math machine more easily.

Tokenizing The Documents

Python's Natural Language Toolkit (NLTK) is great. However, I rarely work with what would be called "natural" language. I've learned through R&D (trial and error) that my own tokenizers work best on the technical documents that I tend to process. Programming my own tokenizers allows me to get exactly what I need. The code blocks below represent my tokenizer, which is also easy to extend when needed. I can also rearrange the methods into any order I might need for special needs.

When this class, Micro_Tokenizer, is instantiated, you pass in either a file name or text - not both. Once it has its text attribute loaded, it runs the default routines:

  1. __whitespace_tokenize_text__
  2. __space_out_punc_marks_in_tokens__
  3. __segregate_character_types__

The value of running the methods in this order will become apparent as we walk through the methods. Note that all methods have double underscores before and after them. This is to indicate that all these methods are meant to be private. I intended for any of them to only be called from the __init__ function. This behavior can of course be changed if you wish to leverage from this code.

class Micro_Tokenizer:
    def __init__(self, file_name='', text=''):
        if file_name:
            self.file_name = file_name
            self.__load_file_data__()
            self.text = self.Text.lower()
        elif text:
            self.text = text.lower()

        self.__whitespace_tokenize_text__()
        self.__space_out_punc_marks_in_tokens__()
        self.__segregate_character_types__()

The __retokenize__ method simply makes sure that each token is a separate string. If a token is a list, maintain that list and append it to the new retokenized list. This is a special feature for grouping items that go together for specific needs. For this simple example, this is NOT likely to be encountered. If it's a string with spaces, the items in the string have been separated by another routine, and they will now become separate tokens. Please note that the action of token.split creates a list, and adding two list simply adds the items from the second list to the end of the first one.

    def __retokenize__(self):
        new_tokens = []

        for token in self.tokens:
            if type(token) is list:
                new_tokens.append(token)
            elif type(token) is str:
                new_tokens += token.split()

        self.tokens = new_tokens

The __load_file_data__ method simply robustly acquires the text from a file name that has been passed in.

    def __load_file_data__(self):
        try:
            with open(self.file_name, 'r', encoding='utf8', 
                      errors='ignore') as f:
                self.Text = f.read()
        except UnicodeDecodeError:
            with open(self.file_name, 'r') as f:
                self.Text = f.read()

The __whitespace_tokenize_text__ method simply uses all whitespace to initially separate the text into individual tokens. This will not allow punctuation marks to be treated separately though, so we will need a method to separate tokens from alpha characters.

    def __whitespace_tokenize_text__(self):
        self.tokens = self.text.split()

And the __space_out_punc_marks_in_tokens__ method does put spaces between any punctuation marks and between punctuation marks and alpha characters. Notice the call to the __retokenize__ method at the end of this method to convert strings with spaces into separate tokens.

    def __space_out_punc_marks_in_tokens__(self):
        for t in range(len(self.tokens)):
            if type(self.tokens[t]) is list:
                continue
            new_string = ''
            for thing in self.tokens[t]:
                if thing in PUNCS:
                    new_string += f' {thing} '
                else:
                    new_string += thing

            self.tokens[t] = new_string.strip()

        self.__retokenize__()

But wait! There's more! IF we see a change in character type in a string found in the token list, put spaces between those, and then retokenize afterwards. This ensures that we have good separation.

    def __segregate_character_types__(self):
        for t in range(len(self.tokens)):
            element = self.tokens[t]
            new_element = ''
            for i in range(len(element) - 1):
                if element[i].isalpha() and element[i+1].isalpha():
                    new_element += element[i]
                elif element[i].isdigit() and element[i+1].isdigit():
                    new_element += element[i]
                elif element[i] in PUNCS and element[i+1] in PUNCS:
                    new_element += f'{element[i]} '
                else:
                    new_element += f'{element[i]} '

            new_element += element[-1]
            self.tokens[t] = new_element

        self.__retokenize__()

Let's run a little test. The code in the below cell is at the bottom of the Doc_Tokenizer.py python script file, and will run if you simply call Python to run the file OR run it from within your IDE.

if __name__ == "__main__":
    my_text = """
    This is a simple set of sentences.
    We are only using them to test the micro_tokenizer to see what it does.
    This is only a small simple test.
    """

    mc = Micro_Tokenizer(text=my_text)
    print(mc.tokens)

Note from the output that the token list allows repeats of tokens. At this point, we do want this. The build of the vectors that make up the corpus matrix, will change this to a count for each token.

Output:
['this', 'is', 'a', 'simple', 'set', 'of', 'sentences', '.', 'we', 'are',
 'only', 'using', 'them', 'to', 'test', 'the', 'micro', '_', 'tokenizer',
 'to', 'see', 'what', 'it', 'does', '.', 'this', 'is', 'only', 'a', 'small',
 'simple', 'test', '.']

The Classes Of The Doc Match Math Machine

The imports below are for all the classes in the following subsections of this section. Please note that I have a habit of importing sys even if I don't use it. When developing such code, I will frequently inject a sys.exit() to force my code to stop at a certain point when there is code below it.

NOTE many tools come from the `` module. The git repo for that is in my Python_Tools repo. I hope you will clone it and make good use of it. Don't hesitate to suggest improvements and additions too.

import __General_Tools_Module__ as gtm
import os
import copy
import numpy as np
import pandas as pd
import scipy.sparse as ss
import scipy.sparse.linalg as ssla
import sys

from Doc_Tokenizer import Micro_Tokenizer

The Corpus_Matrix Class

The Corpus_Matrix class directly builds, grows, and maintains the corpus matrix for a group of documents.

The __init__ method:

  • prepares a directory to store the corpus matrix math machine data;
  • establishes file names for the tokabulary file and for the corpus matrix file;
  • establishes boolean variables for the existence of the tokabulary and corpus matrix;
  • if there is an existing tokab and corpus matrix, they are loaded, and the work of growing or updating them starts from there;
  • if they do not exist, an empty tokab attribute dictionary is started.
class Corpus_Matrix:
    def __init__(self, data_dir):
        self.data_dir = data_dir
        gtm.prepare_dir(data_dir)

        self.tocab_fn = f'{data_dir}/tocab.json'
        self.cm_fn = f'{data_dir}/cm.npz'

        tocab_exists = os.path.isfile(self.tocab_fn)
        corpus_matrix_exists = os.path.isfile(self.cm_fn)

        check = (tocab_exists and corpus_matrix_exists)

        if check:
            self.tocab = gtm.load_object_from_json_file(self.tocab_fn)
            self.corpus_matrix = ss.load_npz(self.cm_fn)
        else:
            self.tocab = {}

The add_new_doc_vector is the workhorse method of the Corpus_Matrix class. It first establishes the previous_token_count to see how to grow the corpus matrix after processing the new set of tokens that are being passed in. It also ensures a copy is made of the current tokabulary so that we can add to it.

Note the for loop. The main tokabulary is a dictionary with the tokens as keys and the values always 0. If the current token is in the main tokabulary, it simply reestablishes that token with a value of 0. If the current token is not in the main tokabulary, it is appended to the end of the tokab dictionary. However, if the current token is NOT in the copy of the main tokabulary, it is also added with 0, and then the count of that tokens occurrences in the current document's tokens is incremented by 1.

The token count is checked after the for loop to see if it has grown. The change in token count equals the number of new rows of zeroes that need to be added to the bottom of the existing corpus matrix. Then we also collect the tokens, and their counts in separate lists.

The if block is important. If the current instance of this class already has a corpus_matrix:

  1. we first add the number of zeroes rows needed to the bottom of it;
  2. then we create a csc matrix from a Pandas DataFrame where the index is the tokab_list and the values are the occurrence rates of the tokens;
  3. we treat this new sparse csc matrix as a column vector;
  4. we append that column to the end of the corpus matrix using the hstack method from the sparse csc matrix class. If the corpus_matrix attribute has not yet been created, then we create a csc matrix from a Pandas DataFrame where the index is the tokab_list and the values are the occurrence rates of the tokens, which is exactly like the second step for growing the corpus_matrix. We are simply "creating" the matrix, first column, with the first document.

After each document alters the corpus matrix, the corpus matrix and the tokabulary are saved. Saving these are fast operations and ensures we do not lose our work.

    def add_new_doc_vector(self, tokens):
        previous_token_count = len(self.tocab)
        this_tocab = copy.deepcopy(self.tocab)

        for t in tokens:
            self.tocab[t] = 0
            if t not in this_tocab:
                this_tocab[t] = 0
            this_tocab[t] += 1

        new_token_count = len(self.tocab)

        new_rows = new_token_count - previous_token_count
        tocab_list = list(self.tocab.keys())
        values = this_tocab.values()

        if hasattr(self, "corpus_matrix"):
            self.corpus_matrix = \
                self.__add_zeros_rows_to_bottom_of_matrix__(new_rows)
            column = ss.csc_matrix(
                pd.DataFrame(index=tocab_list, data=values))
            self.corpus_matrix = \
                ss.hstack([self.corpus_matrix, column])
        else:
            self.corpus_matrix = ss.csc_matrix(
                pd.DataFrame(index=tocab_list, data=values))

        ss.save_npz(self.cm_fn, self.corpus_matrix)
        gtm.store_object_to_json_file(self.tocab, self.tocab_fn)

In order to appreciate the actions of the `` method, you want to remember how python references lists - from the first index to just before the last index. The del_cols is list of column indices that will be replaced by the new_vecs. Based on the current corpus matrix and the new vectors, we determine the number of zeroes rows to add to the bottom. Then we make the updated corpus matrix be:

  1. the old corpus matrix to the left of the new vectors;
  2. the old corpus matrix to the right of the new vectors;
  3. and the new vectors in between.

During such an operation, we might end up with some rows that have no counts for a given token. If so, we remove those rows with all zeroes.

    def replace_vectors_with_vectors(self, del_cols, new_vecs):
        # Extend the matrix rows by the number of rows in new_vecs
        cm_rows = self.corpus_matrix.shape[0]
        new_vecs_rows = new_vecs.shape[0]
        new_rows = new_vecs_rows - cm_rows
        self.__add_zeros_rows_to_bottom_of_matrix__(new_rows)

        # Replace the old columns with the new ones in new_vecs
        ds1 = del_cols[0]
        ds2 = del_cols[-1] + 1
        self.corpus_matrix = ss.hstack(
            [self.corpus_matrix[:, :ds1],
             new_vecs,
             self.corpus_matrix[:, ds2:]])

        self.corpus_matrix = self.remove_zero_rows(self.corpus_matrix)

Based on descriptions above, and the simplicity of the code in the next cell for the __add_zeros_rows_to_bottom_of_matrix__ method, I trust the code needs no explanation. Run it in small steps using a small example if you do not understand the details that occur between the steps. The main thing to note is that a vstack rather than an hstack is used, because new rows are being added.

    def __add_zeros_rows_to_bottom_of_matrix__(self, num_rows):
        cols = self.corpus_matrix.shape[1]
        the_types = self.corpus_matrix.dtype

        new_rows = ss.csc_matrix(np.zeros((num_rows, cols), dtype=the_types))

        # Return a new corpus matrix -
        #     user decides when to update corpus matrix
        corpus_matrix = ss.csc_matrix(ss.vstack(
            [self.corpus_matrix, new_rows]))

        return corpus_matrix

The last method, __remove_zero_rows__, just checks if there are any rows that are all zeroes. If any rows are all zeroes, that means that there are no longer any occurrences of that token. If there are not occurrences of that token, the tokabulary and the corpus matrix should be adjusted.

    def remove_zero_rows(self, sparse_csc_matrix):
        # a_sparse_csc_matrix is a scipy sparse csr matrix. We want to remove all zero rows from it
        sparse_csr_matrix = sparse_csc_matrix.tocsr()
        nonzero_row_indice, _ = sparse_csr_matrix.nonzero()
        unique_nonzero_indice = np.unique(nonzero_row_indice)
        sparse_csr_matrix_wo_zero_rows = sparse_csr_matrix[unique_nonzero_indice]
        sparse_csc_matrix = sparse_csr_matrix_wo_zero_rows.tocsc()

        return sparse_csc_matrix

The Manage_Corpus_Matrix Class

Armed with the methods for the Corpus_Matrix class, we can now establish a class that manages the corpus matrix. I wonder what would be a good name for this class. How about Manage_Corpus_Matrix? Sounds good.

"But Thom", you ask, "Why do we need this class?" That's an excellent question. I'm glad you asked.

The last class focused on the corpus matrix itself. This class manages the corpus matrix and tracks it with respect to the tasks at hand. The last Corpus_Matrix class is meant to remain mostly the same from use to use. This Manage_Corpus_Matrix class is meant to change as needed for each new use.

In this simple example, notice how we have an object to track column to document info. We can find the column that is closest in content to the new document. However, we also need a way to track which document this column is associated to. This may not change often from use to use, because it's a base need. However, I've used this class in some strange situations, and based on specific needs, it's methods and their code can become quite deep.

For example, what if you had another object that tracked document categories too. Once you found the closest matching document or documents to a new incoming document, you could also report which category that document belonged to. There are many more such cases that could be considered.

class Manage_Corpus_Matrix:
    def __init__(self, data_dir):
        self.data_dir = data_dir
        self.cm = Corpus_Matrix(data_dir)

        self.col_to_doc_fn = f'{self.data_dir}/col_to_doc_info.json'
        self.col_to_doc_info_exists = os.path.isfile(self.col_to_doc_fn)

        if self.col_to_doc_info_exists:
            self.col_to_doc_info = \
                gtm.load_object_from_json_file(self.col_to_doc_fn)
        else:
            self.col_to_doc_info = {}

The next method illustrates the previous point more clearly. We need this method to:

  1. look through our documents;
  2. ensure we don't reuse documents already in the corpus matrix;
  3. tokenize them;
  4. grow our corpus matrix with them;
  5. and update any other tracking objects as we go.

Finally we store that object to file after each incremental growth.

    def grow_corpus_matrix_from_documents(self, docs_D):
        current_doc_ID_list = list(self.col_to_doc_info.values())
        new_doc_IDs = [id for id in docs_D if id not in current_doc_ID_list]
        for doc_ID in new_doc_IDs:
            mt = Micro_Tokenizer(text=docs_D[doc_ID])
            self.cm.add_new_doc_vector(mt.tokens)
            col = self.cm.corpus_matrix.shape[1] - 1

            self.col_to_doc_info[col] = doc_ID

            gtm.store_object_to_json_file(
                self.col_to_doc_info, self.col_to_doc_fn)

The Doc_Match_Math_Machine Class

Finally, we get to the main class, Doc_Match_Math_Machine, that helps us find the closest matching existing document to a new document. NOTE, to help avoid possible confusion, this class does not build or manage the growth and updates to the corpus matrix and it's associated documents. It is used to find the closest matching document in the corpus of documents to the new document.

The init method:

  1. loads the corpus_matrix;
  2. calculates the Euclidean norms for all the document vectors in the corpus matrix;
  3. loads the column to document dictionary;
  4. and sets the number of top matches to return to be a class attribute.
class Doc_Match_Math_Machine:
    def __init__(self, cm_data_dir, top_n=3):
        # cm is the corpus matrix of document vectors
        self.cm = Corpus_Matrix(cm_data_dir)
        self.cm.docs_norms = ssla.norm(self.cm.corpus_matrix, axis=0)

        self.col_to_doc_fn = f'{cm_data_dir}/col_to_doc_info.json'
        self.col_to_doc_info_D = gtm.load_object_from_json_file(
            self.col_to_doc_fn)
        self.top_n = top_n

And FINALLY, we get to the determination of how to find the closest matches.

  1. Set up a dictionary to hold the top matching information.
  2. Tokenize the new document to that will be aligned to matches.
  3. Copy the current tokabulary.
  4. Determine occurrence rates of tokens in the new document.
  5. Condition the document occurrence rate vector into the needed form.
  6. Calculate the Euclidean norm for the new document vector.
  7. Create a temporary corpus matrix that grows by the number of new tokens introduced by the new document.
  8. Perform a cosine similarity between the new document and all the vectors in the corpus matrix:
    1. The ccn is the cosine closeness numerators;
    2. The ccd is the cosine closeness denominators;
    3. You have to love NumPy and SciPy array calculation!;
    4. Divide ccn by ccd to get an array of closeness calculations;
    5. The closest values are the ones closest, or even equal, to 1.0;
    6. Do some numpy sorting and flipping to get values closest to 1 down to furthest from 1;
    7. Build a dictionary to relate column numbers, in order of best match down, to documents.

And finally, we receive the matches_D dictionary with the information we need.

    def get_closest_match_columns(self, doc_text):
        matches_D = {}
        mt_doc = Micro_Tokenizer(text=doc_text)

        doc_tokens = copy.deepcopy(self.cm.tocab)
        num_new_tokens = 0
        for t in mt_doc.tokens:
            if t not in doc_tokens:
                doc_tokens[t] = 0
                num_new_tokens += 1
            doc_tokens[t] += 1
        
        doc_vector = np.array(list(doc_tokens.values()))
        doc_vector = doc_vector.reshape((1, len(doc_vector)))

        doc_vector = ss.csc_matrix(doc_vector, dtype=np.float64)
        doc_norm = ss.linalg.norm(doc_vector)

        ''' Find the closest documents '''
        temp_cm = self.cm.__add_zeros_rows_to_bottom_of_matrix__(
            num_new_tokens)
        ''' ccn = cosine_closeness_numerators
            ccd = cosine_closeness_denominators '''
        ccn = doc_vector * temp_cm
        ccd = doc_norm * self.cm.docs_norms
        simils = ccn / ccd
        simils = np.array(simils)[0]

        the_indices = np.flip(np.argsort(simils))
        the_indices = the_indices[:self.top_n]

        for col in the_indices:
            doc_ID = self.col_to_doc_info_D[str(col)]

            cos = round(simils[col], 6)
            matches_D[cos] = {
                "col": col,
                "doc_ID": doc_ID
            }

        return matches_D

Creating A Fake Match Worthy File To Document 73

This was perhaps the toughest part of preparing this notebook. Creating good fake data is simple, but it's not always easy! Document 73, doc_73.txt, has the following text.

Feeling out Mrs nearly way great a campaign. Professor behavior environmental item. Also watch wife run behavior recently court. Huge add half up trip foot face. Team number wife as from south detail include. From run manager cell away mind. Effect culture glass executive in movement may poor. Education career control discussion whose. Night skill exist. Skin single at right together coach production professor. Determine what institution others home job bar exist. Late responsibility expect south. Whether operation necessary some color step account join.

Doesn't read like a real story very well does it? I worked a little while to make it have some sort of flow. I tried to NOT add many words or punctuations and to create a possible story using the existing words. Don't judge me - I didn't want to kill myself over this part.

Professor Ives and his wife, Mrs Ives, were recently in court. The late night campaign for his career in the south to reform education on environmental behavior was nearly over. His team had responsibility to control discussions. It was their job to include a number for each detail. The professor must determine what institution others also watch to run successfully. From the south, the manager must run his cell to effect the culture as from an "away mind". This would add huge gains - up to half. Feeling out a great way to give their poor campaign coach a movement by foot trip does exist. He may expect to face the executives above the glass ceiling and show each item of his skill. Whether this operation is necessary and gives some color to their account is in question. They must skin this single cat. The production of the coach at home to add a job for each person is the bar. It does exist. Each must join this step.

Implementing A Specific Doc Match Math Machine

Now it's time to put all of this to use! We will now move to the content of the file Doc_Match_Math_Machine_Usage.py and see how all this code works in concert when we instantiate the classes and use them. We want to see if doc_100.txt, in the top level directory, best matches doc_73.txt, in the documents directory. I promise I did NOT adjust anything to get things to work. This is an honest test with fake data. Also, consider the historical points at the top of this report regarding the early initial development of this.

We import our handy __General_Tools_Module__ again as gtm. We of course need our Doc_Match_Math_Machine_Parts to that we can instantiate the classes for our document matching math machine. And we setup pprint so that we can pretty print our results. Please use this convenience form of pprint also - pp = pprint.PrettyPrinter(indent=2) - it's very handy.

The File Manager from the __General_Tools_Module__ is used collect all the file names in the ./documents subdirectory. A dictionary comprehension is used to create a dictionary of indices to file names. We establish a data directory for our corpus matrix data and instantiate the Manage_Corpus_Matrix class to create our corpus matrix. Then we call the grow_corpus_matrix_from_documents method of that class to build our initial corpus matrix.

IF you were building a large corpus matrix from a large corpus, the initial build can take some time. How could it be made faster? I've not yet needed to move to big data routines for this type of work. However, I am confident that this approach could be written with routines that use fast big data routines.

Finally, we instantiate the Doc_Match_Math_Machine itself. We obtain the text for our new file. Then we send that text for our new file as an argument to the get_closest_match_columns method of our math machine. Finally, we pretty print our top 3 matches.

import __General_Tools_Module__ as gtm
import Doc_Match_Math_Machine_Parts as dm3p
import pprint

pp = pprint.PrettyPrinter(indent=2)


""" Gather Docs for building the corpus """
fm = gtm.File_Manager("./documents")

docs_D = {k: v for k, v in enumerate(fm.full_file_list)}

cm_data_dir = "./cm_data"
mcm = dm3p.Manage_Corpus_Matrix(cm_data_dir)
mcm.grow_corpus_matrix_from_documents(docs_D)

dm3 = dm3p.Doc_Match_Math_Machine(cm_data_dir, top_n=3)

new_file = "doc_100.txt"
new_text = gtm.get_text_from_file(new_file)
matches_D = dm3.get_closest_match_columns(new_text)

pp.pprint(matches_D)

Oh my! It worked. I confess, that I am always thrilled when this works for each new case. This is the first time that it's worked for a fake data case sense development.

Output:

{ 0.52267: {'col': 43, 'doc_ID': './documents/doc_9.txt'},
  0.530232: {'col': 23, 'doc_ID': './documents/doc_38.txt'},
  0.645827: {'col': 1, 'doc_ID': './documents/doc_73.txt'}}

Future Improvements

As you can imagine, if you are comfortable thinking about far more than 3 dimensions, the dimensions are formed by content only. But what about base word form relations? Exactly. How would we make sure that seem and seems and seemingly are all related closely? Having trained GloVe from scratch using Python libraries for the same, I have been frustrated by this. You'd think that the English language at least would have some look up system that allows us to know the base form of each word. You'd think that system would then also show us all the words that can be derived from that base form word with prefixes and suffixes. But no. No such thing exists. I was considering trying to do this myself with other tools that I've made for language processing.

I am glad I did not. Why? Because when I saw the clever way that transformer tokenizers work, I decided that they were best. They do, in my opinion, a much more clever form of relating base forms of various words. Maybe the version I was thinking about would do better in the long run. I can't take the time to check it yet, and the transformer tokenizers are VERY good. Thus, I think this work can be taken to the next level of improvement by using transformer tokenizers to tokenize the documents.

Summary

What have we really done here? We've made each token that occurs in a corpus of documents into a dimension, a unit vector, to create a very high dimensional space. Then we counted how many times each of those tokens appeared in each of many documents. Each document became a column of a huge sparse matrix. Each of those columns had occurrence rates of each token to form the numbers of that column. Thus, the occurrence rates represent a pseudo distance in each of the token dimensions. Each column is a vector in that hyper space.

Now a new document comes to us. We turn it into a new vector that can exist in our corpus hyper space. It must have numbers to work in our math machine. Then, using basic trigonometry at a VERY large scale, we seek to find which vector (column / document) in our corpus matrix is closest to that new vector in the corpus hyper space. Finally, because we did a good job tracking our documents to columns and our tokens to indices, we do, in fact, find the best matching document.

At the end, Thom guessed that this work could be further improved by using transformer type tokenizers to tokenize the documents, but this needs further investigation and is only now a hypothesis.

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...