I have run into a fortunate double problem in the last few weeks. In my research as a fellow at HOPE, I have been finding that, increasingly, text of documents that I am finding only appear in print. So I face the double problem that the text is neither available to be read on my machine nor searchable. So 1) to the extent necessary, I want to save a personal copy of a text and 2) I would like that copy to generate ORM text so that I can both search the text and analyze it using computational methods. In future posts I will discuss computational analysis.

Today I will keep the discussion simple. I need to create a search engine, and I would like that search engine to do more than calls that you could accomplish, say, in a SQL database using native calls. Instead, I am vectorizing the text. I had been made familiar with vectorization in the past when I wanted to pass a large text to ChatGPT. Seeing how my friend, Cameron Harwick, used vectorization for cluster analysis, it seemed obvious that I should use a similar method for searching my blog posts.

For those who have not noticed, I have constructed this blog from scratch. At the moment it is quite simple. Over time, I expect to add gadgets with the aim of building similar tools for research. The first obvious improvement is to build this search bar (now in the top-right corner) for the purpose of prototyping software for search of a treasure trove of unpublished work.

Before building, it is useful to understand that text vectorization gives each document a coordinate in some high-dimensional space (I count over 7000 dimensions each post). You can then calculate the similarity between different texts by calculating the cosine similarity of two texts (the search phrase and the texts of interest in this case). Since my aim here is not to explain cosine similarity, but rather, to consider its usefulness in building a homegrown search function, it suffices to explain that the cosine similarity of two documents is calculated as follows:

$$AB = ||A||\ ||B||cos\theta$$

Presuming the two vectors are equi-dimensional:

$$cos(\theta)=\frac{AB}{||A||\ ||B||} = \frac{\Sigma_{i=1}^{n}{A_iB_i}}{\sqrt{\Sigma_{i=1}^{n}{A_i^2}}\sqrt{\Sigma_{i=1}^{n}{B_i^2}}}$$

The process for accomplishing this is not too painful. Using LLMs for building out the JavaScript greatly sped up development for me. I vectorize my posts using the TfidfVectorizer() module from sklearn.

In [ ]:

def generate_search_index(posts):
    # Extract text content from posts
    documents = posts['blog'].apply(strip_tags).tolist()
    titles = posts['index'].tolist()
    html_titles = posts['index'].apply(slugify).tolist()
    dates = posts['date'].dt.strftime('%Y-%m-%d').tolist()
    urls = [f"InCognito/{html_title}.html" for html_title in html_titles]

    # Initialize the TfidfVectorizer
    vectorizer = TfidfVectorizer(stop_words='english')
    # Fit and transform the documents
    doc_vectors = vectorizer.fit_transform(documents)
    # Get feature names
    feature_names = vectorizer.get_feature_names_out()
    # Convert document vectors to arrays
    doc_vectors_array = doc_vectors.toarray()

    # Prepare the search data
    search_data = {
        'feature_names': feature_names.tolist(),
        'documents': []
    }

    # For each document, store the TF-IDF vector in the search index
    for idx, (title, html_title, date, content, url, vector) in enumerate(zip(titles, html_titles, dates, documents, urls, doc_vectors_array)):
        # Convert the vector to a list for JSON serialization
        vector_list = vector.tolist()
        search_data['documents'].append({
            'id': html_title,
            'title': title,
            'date': date,
            'content': content,
            'url': url,
            'vector': vector_list
        })

    # Write the search data to a JSON file
    with open('search_index.json', 'w', encoding='utf-8') as f:
        json.dump(search_data, f, ensure_ascii=False)

With the posts already vectorized and saved to json, this speeds up loading time when the text is searched. So next we have to vectorize a search using JavaScript, then calculate the cosine similarity. The results sort posts by cosine similarity to the search text.

Other script is involved, but the core JavaScript functions are:

In [ ]:

function buildQueryVector(tokens, featureNames) {
    var vector = new Array(featureNames.length).fill(0);
    var termCounts = {};
    tokens.forEach(function(token) {
        if (termCounts[token]) {
            termCounts[token] += 1;
        } else {
            termCounts[token] = 1;
        }
    });
    var totalTerms = tokens.length;
    featureNames.forEach(function(term, index) {
        if (termCounts[term]) {
            // Calculate term frequency (TF)
            var tf = termCounts[term] / totalTerms;
            vector[index] = tf;
        }
    });
    // Normalize the vector
    var magnitude = Math.sqrt(vector.reduce((sum, val) => sum + val * val, 0));
    if (magnitude > 0) {
        vector = vector.map(function(val) { return val / magnitude; });
    }
    return vector;
}

function cosineSimilarity(vecA, vecB) {
    var dotProduct = 0;
    var normA = 0;
    var normB = 0;
    for (var i = 0; i < vecA.length; i++) {
        dotProduct += vecA[i] * vecB[i];
        normA += vecA[i] * vecA[i];
        normB += vecB[i] * vecB[i];
    }
    if (normA === 0 || normB === 0) {
        return 0;
    } else {
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }
}

function search(query) {
    var featureNames = searchData.feature_names;
    var documents = searchData.documents;

    // Tokenize the query with spell correction
    var queryTokens = tokenize(query, featureNames);

    // Build the query vector
    var queryVector = buildQueryVector(queryTokens, featureNames);

    var results = [];

    // Compute cosine similarity with each document
    documents.forEach(function(doc) {
        var docVector = doc.vector;
        var similarity = cosineSimilarity(queryVector, docVector);
        if (similarity > 0) {
            results.push({
                title: doc.title,
                url: doc.url,
                date: doc.date,
                content: doc.content,
                score: similarity
            });
        }
    });

    // Sort results by similarity score
    results.sort(function(a, b) {
        return b.score - a.score;
    });

    return results;
}

The search seems to function adequately as is. Still, I will be considering potential improvements. For example, I imagine that searches might benefit from lemmatizing the tokenized text. This is a long term project, for which I intend to make progress across my stay at HOPE. I will summarize major developments, technical and literary, here.

Building a Basic Search Engine with Text Vectorization

Related Posts

The Context and Structure of Early Neural Networks: McCulloch and Pitts

Building Agent-based Models in JavaScript (Schelling Segregation)

Building Agent-based Models in JavaScript (Schelling Segregation)- Round 2