I have run into a fortunate double problem in the last few weeks. In my research as a fellow at HOPE, I have been finding that, increasingly, text of documents that I am finding only appear in print. So I face the double problem that the text is neither available to be read on my machine nor searchable. So 1) to the extent necessary, I want to save a personal copy of a text and 2) I would like that copy to generate ORM text so that I can both search the text and analyze it using computational methods. In future posts I will discuss computational analysis.
Today I will keep the discussion simple. I need to create a search engine, and I would like that search engine to do more than calls that you could accomplish, say, in a SQL database using native calls. Instead, I am vectorizing the text. I had been made familiar with vectorization in the past when I wanted to pass a large text to ChatGPT. Seeing how my friend, Cameron Harwick, used vectorization for cluster analysis, it seemed obvious that I should use a similar method for searching my blog posts.
For those who have not noticed, I have constructed this blog from scratch. At the moment it is quite simple. Over time, I expect to add gadgets with the aim of building similar tools for research. The first obvious improvement is to build this search bar (now in the top-right corner) for the purpose of prototyping software for search of a treasure trove of unpublished work.
Before building, it is useful to understand that text vectorization gives each document a coordinate in some high-dimensional space (I count over 7000 dimensions each post). You can then calculate the similarity between different texts by calculating the cosine similarity of two texts (the search phrase and the texts of interest in this case). Since my aim here is not to explain cosine similarity, but rather, to consider its usefulness in building a homegrown search function, it suffices to explain that the cosine similarity of two documents is calculated as follows:
$$AB = ||A||\ ||B||cos\theta$$
Presuming the two vectors are equi-dimensional:
$$cos(\theta)=\frac{AB}{||A||\ ||B||} = \frac{\Sigma_{i=1}^{n}{A_iB_i}}{\sqrt{\Sigma_{i=1}^{n}{A_i^2}}\sqrt{\Sigma_{i=1}^{n}{B_i^2}}}$$
The process for accomplishing this is not too painful. Using LLMs for building out the JavaScript greatly sped up development for me. I vectorize my posts using the TfidfVectorizer() module from sklearn.