2026-03-25
Full Text Search in BASIC

fts, search, universe, basic, multivalue

My server rebooted and I lost my solr service. The problem with running services seems to be starting things up again. I had thought I had gotten it working properly with systemd but it looks like I did not.

I could debug it and figure out what happened but I also don't quite care. There hasn't been a big change in my searching such that I miss solr.

For those keeping track, I've now done substring searching, solr searching and now for the next attempt, Full Text Search with BM25 ranking. Luckily BM25 is a very simple algorithm to rank search results on how relevant they are.

The core flow is:

1. Tokenize your data2. Create an inverted lookup3. Gather results for given tokenized query4. Rank the results

The tokens can be word boundaries or for more robust results you can create n-grams.

This is inspired by FTS5 in sqlite, I had Claude anaylze the sqlite codebase and come up with guidance material for me and then I read a few blogs posts to fully get the idea. The goal here was to learn and to be entirely honest, I still feel like it was cheating to have Claude look at sqlite instead of doing that myself. I'm fine with using someone else's walkthrough of sqlite but using Claude's feels a bit slimy. I have two wolves inside me, one that hates AI and one that loves AI apparently.

Either way, FTS results in a processing step where you create the index which I'm not a fan of. The simplest option is to rebuild an index daily. I can also make it smarter and simply update index entries depending on which post changed. There is probably a number of enhancement that can be made to this where the indexing step isn't felt so acutely but it will always be there when you run it from scratch.

That's my largest issue with it. The gain is that I won't lose search functionality ever again. It may be out of date or the index disappears but the core logic is always right there. There is a value to that that I think I'm going to appreciate.

Tokeninzing and n-grams

The first step in setting up full text search is to tokenize the data. We need to take a stream of text and make it into tokens. The goal is to use these tokens to create an inverted index.

Hello, World!

The above breaks into 2 tokens because I'm ignoring punctuation:

hello
world

I also normalized the data to lower case. This style of tokenizing works on non alphanumeric characters. However this means that I need to match whole words to get results.

If I have a typo, this won't pull in any results because it's unlikely that the typo has enough data associated with it.

This is where using n-grams comes in handy. Instead of tokenizing on punctuation, I can tokenize based on a set number of characters.

The most common example looks to be trigrams, meaning the data is split into threes and it's a sliding window. This results in many tokens and explodes the index size but it does make searching more useful.

Hello, World!

Becomes:

hel
ell
llo
wor
orl
rld

This way I can now pull in results that have hellllo and worldly in them rather than just the words hello and world.

Now let's look at some code:

   SUBROUTINE FTS.TOKENIZE(ORIGINAL.DOCUMENT.TEXT,TOKENS,MAX.SIZE)
*
* COMPILER DIRECTIVES
*
   $DEFINE DATABASE.QM
   $DEFINE PLATFORM.LINUX
*
   $IFDEF DATABASE.QM
      $CATALOGUE LOCAL
   $ENDIF
*
   DOCUMENT.TEXT = OCONV(ORIGINAL.DOCUMENT.TEXT,'MCL')
*
   TOKENS = ''
   NUMBER.OF.CHARACTERS = LEN(DOCUMENT.TEXT)
*
   TOKEN = ''
   TOKEN.POS = 1
*
   FOR I = 1 TO NUMBER.OF.CHARACTERS + 1
      CHARACTER = DOCUMENT.TEXT[I,1]
*
      IF (CHARACTER >= 'a' AND CHARACTER =< 'z') OR (CHARACTER >= 'A' AND CHARACTER =< 'Z') OR (CHARACTER >= '0' AND CHARACTER =< '9') THEN
         IF LEN(TOKEN) = MAX.SIZE THEN
            GOSUB ADD.TOKEN
            I = I - MAX.SIZE
         END ELSE
            TOKEN = TOKEN : CHARACTER
         END
*
      END ELSE
         IF TOKEN # '' THEN
            GOSUB ADD.TOKEN
         END
      END
   NEXT I
*
   RETURN
*
*********************  S U B R O U T I N E  *********************
*
ADD.TOKEN:NULL
*
   TOKENS<1,-1> = TOKEN
   TOKENS<2,-1> = TOKEN.POS
*
   TOKEN.POS = TOKEN.POS + 1
   TOKEN = ''
*
   RETURN
*
* END OF PROGRAM.
*
END
*

The core logic is that we will take in the DOCUMENT.TEXT and start chopping it up. We loop over it character by character and we have a few conditions for when we consider a token built and read to to be added to our list.

Once the token is added to the list, it's cleared to build the next token.

The conditions to build a token are that we have a character to add, once we have a character to add, we then check if the token has reached its limit. If it has, we can build the token and reset, otherwise we can add a character to the token.

The limit is how we build an n gram, in my case specifically a trigram.

We have 1 extra iteration of the loop so that way the last token that's left will always be added to the token list. This way I don't need to hand it outside the loop.

There is a major issue in this code though, it is quite inefficient. Once a token is built, I clear it right away and move the position back. This means that I actually go through the data 3 times. What would be better is to build the token by using the positions. I could instead of building TOKEN in memory, simply keep track of positions and then use DOCUMENT.TEXT[I,3] to get the right substring.

This would likely be faster but I like the way I've done it here as it is explicit. These changes are things that I think should come afterwards.

Indexing

Now let's get to the actual indexing. This part is relatively straightforward. Once the data has been fully tokenized, we can then loop on the token and update the file. We are going to add the document and the word position to the record for that token.

If the token isn't already in the file, we'll add it.

This looks like:

   SUBROUTINE FTS.INDEX(DOCUMENT.ID,TOKENS,INDEX.FILE)
*
* COMPILER DIRECTIVES
*
   $DEFINE DATABASE.QM
   $DEFINE PLATFORM.LINUX
*
   $IFDEF DATABASE.QM
      $CATALOGUE LOCAL
   $ENDIF
*
   NUMBER.OF.TOKENS = DCOUNT(TOKENS<1>,@VM)
*
   FOR I = 1 TO NUMBER.OF.TOKENS
      TOKEN = TOKENS<1,I>
      WORD.POSITION = TOKENS<2,I>
*
      READ INDEX.MATRIX FROM INDEX.FILE, TOKEN ELSE INDEX.MATRIX = ''
*
      LOCATE(DOCUMENT.ID,INDEX.MATRIX,1;DOC.POS;'AR') THEN
         INDEX.MATRIX<2,DOC.POS,-1> = WORD.POSITION
      END ELSE
         INDEX.MATRIX<1> = INSERT(INDEX.MATRIX<1>,1,DOC.POS,0,DOCUMENT.ID)
         INDEX.MATRIX<2> = INSERT(INDEX.MATRIX<2>,1,DOC.POS,0,WORD.POSITION)
      END
*
      WRITE INDEX.MATRIX ON INDEX.FILE, TOKEN
   NEXT I
*
   RETURN
*

This is quite short. We are keeping a record per token and the record contains a multivalue list of documents and word positions.

This indexing step is where we will spend the bulk of our time, I made the locate statements sorted so there is a small speed up there. I feel like there are gains to be made here.

Searching and Ranking

With the indexing done, now it's time to do the searching and ranking. This is also relatively straightforward. The first thing we need to do is take the search query and tokenize it the same way as we did the data. A mismatch here will cause issues. For example if you use a 2 gram tokenizer with 3 gram data, we won't actually get any results.

Once the query is tokenized, we simply get all of the documents that are tied to each of the tokens in the query.

This might be a very large list of documents or it could be very small. Regardless, once we have the documents, we need to rank how relevant they are.

The searching code is below:

   SUBROUTINE FTS.SEARCH(QUERY.STRING,MAX.SIZE,RESULT.LIST,SOURCE.FILE,INDEX.FILE,STATS.FILE)
*
* COMPILER DIRECTIVES
*
   $DEFINE DATABASE.QM
   $DEFINE PLATFORM.LINUX
*
   $IFDEF DATABASE.QM
      $CATALOGUE LOCAL
   $ENDIF
*
   READ TOTAL.STATS FROM STATS.FILE,'TOTAL.STATS' ELSE TOTAL.STATS = ''
*
   TOTAL.TOKENS = TOTAL.STATS<1>
   TOTAL.DOCUMENTS = TOTAL.STATS<2>
*
   CALL FTS.TOKENIZE(QUERY.STRING,TOKENS,MAX.SIZE)
*
   NUMBER.OF.SEARCH.TERMS = DCOUNT(TOKENS<1>,@VM)
*
   DOCUMENT.LIST = ''
   FOR I = 1 TO NUMBER.OF.SEARCH.TERMS
      TOKEN = TOKENS<1,I>
*
      READ INDEX.MATRIX FROM INDEX.FILE, TOKEN ELSE INDEX.MATRIX = ''
      DOCUMENT.ID.LIST = INDEX.MATRIX<1>
*
      NUMBER.OF.DOCUMENTS = DCOUNT(DOCUMENT.ID.LIST<1>,@VM)
*
      FOR DOC.CTR = 1 TO NUMBER.OF.DOCUMENTS
         DOC.ID = DOCUMENT.ID.LIST<1,DOC.CTR>
*
         LOCATE(DOC.ID,DOCUMENT.LIST,1;DOC.POS) ELSE
            DOCUMENT.LIST<1,-1> = DOC.ID
         END
      NEXT DOC.CTR
   NEXT I
*
   RESULT.LIST = ''
   NUMBER.OF.DOCUMENTS = DCOUNT(DOCUMENT.LIST<1>,@VM)
*
   FOR DOC.CTR = 1 TO NUMBER.OF.DOCUMENTS
      DOC.ID = DOCUMENT.LIST<1,DOC.CTR>
*
      LOCATE(DOC.ID,RESULT.LIST,1;DOC.POS) ELSE
         SCORE = 0
         GOSUB BM25.SCORE
*
         LOCATE(SCORE,RESULT.LIST,2;SCORE.POS;'DR') ELSE NULL
*
         RESULT.LIST<1> = INSERT(RESULT.LIST<1>,1,SCORE.POS,0,DOC.ID)
         RESULT.LIST<2> = INSERT(RESULT.LIST<2>,1,SCORE.POS,0,SCORE)
      END
   NEXT DOC.CTR
*
   RETURN
*

The code has 2 loops, the first is to gather the documents while the second loop does the ranked sorting of the results.

I won't describe the algorithm as I didn't fully understand it, I only implemented it. A good article by Evan Schwartz, Understanding the BM25 Full Text Search Algorithm helped me quite a bit.

*********************  S U B R O U T I N E  *********************
*
BM25.SCORE:NULL
*
   K1 = 1.2
   B = 0.75
*
   AVG =  TOTAL.TOKENS / TOTAL.DOCUMENTS
*
   SCORE = 0
*
   READ DOCUMENT.TOKENS.COUNT FROM STATS.FILE, DOC.ID ELSE DOCUMENT.TOKENS.COUNT = 0
*
   D = DOCUMENT.TOKENS.COUNT
*
   FOR TERM.CTR = 1 TO NUMBER.OF.SEARCH.TERMS
      TERM = TOKENS<1,TERM.CTR>
*
      READ TERM.MATRIX FROM INDEX.FILE, TERM ELSE TERM.DOCUMENT.LIST = ''
*
      NUMBER.OF.DOCUMENTS.WITH.TERM = DCOUNT(TERM.MATRIX<1>,@VM)
*
      LOCATE(DOC.ID,TERM.MATRIX<1>,1;DOC.POS) THEN
         OCCURENCES = DCOUNT(TERM.MATRIX<2,DOC.POS>,@SVM)
      END ELSE
         OCCURENCES = 0
      END
*
      IDF = LN( (TOTAL.DOCUMENTS - NUMBER.OF.DOCUMENTS.WITH.TERM + 0.5) / (NUMBER.OF.DOCUMENTS.WITH.TERM + 0.5) + 1)
*
      NUMERATOR = OCCURENCES * (K1 + 1)
      DENOMINATOR = OCCURENCES + K1 * (1 - B + B * D  / AVG)
*
      SCORE = SCORE + IDF * (NUMERATOR / DENOMINATOR)
   NEXT TERM.CTR
*
   RETURN
*
* END OF PROGRAM.
*
   END
*

The algorithm translated quite well to BASIC and for most part everything just worked.

The End

Yeah.

This is mostly just documentation for myself should I need to implement this again in the future.

The search isn't that great still, but at least it works and I can find some stuff again. The biggest drawback as it is is that my search results take a bit too long to show up. I can fix this by caching more information in the index. Right now search results force a database access and markdown parser to kick in which doesn't actually need to happen.

2026-03-25 Full Text Search in BASIC

Tokeninzing and n-grams

Indexing

Searching and Ranking

The End

2026-03-25
Full Text Search in BASIC