:: krowemoh

Monday | 03 NOV 2025
Posts Links Other About Now

previous
next

2025-10-22
Notes on Solr

search, java, solr

I want to add real searching to my blog, currently I do only substring matching and it is a bit lacking. I also would like to be able to see the context around a search query so that I can quickly see if a post is worth reading.

Solr is a java application that works as a microservice from what I can tell. It spins up it's own servers that have access to a search index. You query these servers for information and it will return the results as a json response. Solr is designed to run multiple nodes on multiple ports and it looks like it uses some sort of orchestration tool called zookeeper.

Solr is a bit heavy but this should prove useful when I want to search all my documents rather than just my blog. The biggest worry is that search is a big enough project that it could be it's own thing. Hopefully Solr is simple enough that it's worth using for something as small as my blog.

Installation

The first step is to get Solr set up. This is actually straightforward.

Install Java 11 and lsof:

sudo yum install java-11-openjdk-devel
sudo yum install lsof

Then we can download Solr:

wget "https://dlcdn.apache.org/solr/solr/9.9.0/solr-9.9.0.tgz"

Untar the file and cd into the directory:

tar xvf solr-9.9.0.tgz
cd solr-9.9.0/

Now we can make sure we have solr available:


bin/solr --version
Solr version is: 9.9.0

Update the Limits

We need to do one more thing before we can use solr. We need to bump up the number of files a process can use.

You can check your limit by doing:

ulimit -n

The default is 1024. Solr will require this number to be 65000.

We need to add the following lines to /etc/security/limits.conf:

...
*   soft    nofile  65000
*   soft    nproc   65000
# End of file

This will bump up the limit.

We need to then restart the machine for these changes to take effect.

The First Tutorial

Now with solr installed and available we can do the first tutorial.

One problem that isn't clear anywhere is that you need to enable the right modules so that PDFs are processed.

Uncomment the following line from bin/solr.in.sh:

SOLR_MODULES=extraction,ltr

Then we can start the tutorial

bin/solr start -e cloud

This will prompt for the number of nodes we want to use and the ports we want to run the nodes on.

We can accept the defaults here. Once we solr has startedm, we will see the following if we do a netstat:

> netstat -tulpn | grep java
tcp6       0      0 127.0.0.1:6574          :::*                    LISTEN      2094/java           
tcp6       0      0 127.0.0.1:7983          :::*                    LISTEN      1946/java           
tcp6       0      0 127.0.0.1:7574          :::*                    LISTEN      2094/java           
tcp6       0      0 127.0.0.1:8983          :::*                    LISTEN      1946/java           
tcp6       0      0 127.0.0.1:9983          :::*                    LISTEN      1946/java           

There are 5 processes in total, I can see the 2 node processes and I'm guessing the 6574 and 9983 ports are being used by the nodes to do something. Possibly one of these ports is for zookeeper.

We will then be prompted for our collection name and collection config. For this the tutorial asks us to use techproducts. This is an example that comes with solr.

The remaining prompts can be defaulted and we should then see that solr is up and running:

SolrCloud example running, please visit: http://localhost:8983/solr

I'm doing this on another machine so I'm going to have nginx proxy pass to solr.

server {
    listen 8984;
    
    location / {
        add_header 'Access-Control-Allow-Origin' '*';
        proxy_pass http://localhost:8983;
        proxy_redirect off;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

This will let me continue using localhost with solr rather than trying to find where I can have solr listen on the other interfaces.

Now I can visit the web page and see the solr admin page.

Ingesting Data

The next step is to ingest some data so that we have some real content in solr.

bin/solr post -c techproducts example/exampledocs/*

Hopefully you don't run into any problems as debugging my problem was a bit of a pain in the ass.

This section was actually drastically longer with 3-4 solutions that didn't pan out.

Querying Solr

Once the documents have been indexed, we can then query solr:

curl "http://localhost:8983/solr/techproducts/select?q=foundation"

We can also limit the fields we see:

curl "http://localhost:8983/solr/techproducts/select?q=foundation&fl=id,name"

Restarting Solr

Now that we have solr running and a collection that we can query. Let's cover shutting solr down and bringing it back up. This is spread out across two tutorials but I wish it was brought up much earlier. It would have helped during the debugging.

bin/solr stop --all

Once it's stopped we then need to bring up each solr node individually. The last node will start up zookeeper as well.

bin/solr start -c -p 8983 --solr-home example/cloud/node1/solr
bin/solr start -c -p 7574 --solr-home example/cloud/node2/solr -z localhost:9983

Now we can do the above queries again and everything should be working.

The Second Tutorial

The second tutorial on the Solr website covers schemaless collections. This is interesting as I didn't see the schema for the first tutorial.

Now let's create a new collection:

bin/solr create -c films --shards 2 --replication-factor 2

We then need to update the collection schema with two things. We want names to be strings and for there to be _text_ field that has all the data. This will be the default for searching.

 curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' http://localhost:8983/solr/films/schema 

Now for the catch call field:

curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" : {"source":"*","dest":"_text_"}}' http://localhost:8983/solr/films/schema

We then index the data:

bin/solr post -c films example/films/films.json

We can look at the schema that was generated by doing:

curl "http://localhost:8983/solr/films/schema"

We can then query our new collection:

curl "http://localhost:8983/solr/films/select?q=comedy&fl=id,name"

We can use facets, this will need some more study:

curl "http://localhost:8983/solr/films/select?q=\*:*&rows=0&facet=true&facet.field=genre_str"

Facets seems to be a way of bucketing data, in this case it's by genre. You can also create range facets where you bucket data by some date period.

The key lessons in this tutorial were to create new collections and to use the facets to look at the data. Facets seem to be similar to filters and groupings.

The Third Tutorial

The third tutorial covers indexing your own data. This is actually quite short and seems a bit redundant. Helpful but redundant.

First create a new collection:

bin/solr create -c localDocs --shards 2 -rf 2

Create the catch all field:

curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" : {"source":"*","dest":"_text_"}}' http://localhost:8983/solr/localDocs/schema

Index my blog posts:

bin/solr post -c localDocs ~/BLOG/LOCAL-DOCUMENTS/*

The key thing here is that solr only works with structured data. I had to transform my markdown into json and then import it.

This makes sense as I do need to have some structured data like the url to the blog post and the date need to be available to solr.

I exported my blog posts with the following json structure:

{
   "id": "Unique Id",
   "title": "Blog Post",
   "filename": "BlogPost.html",
   "date": "2023-07-05T00:00:00Z",
   "markdown": ""
}

Now I can do regular searches against my posts:

curl "http://localhost:8983/solr/localDocs/select?q=cheatsheet"

I can also get the context using the highlighting function of solr:

curl "http://localhost:8983/solr/localDocs/select?q=cheatsheet&hl=true*&hl.fl=*"

I can also do fuzzy searching:

curl "http://localhost:8983/solr/localDocs/select?q=sola~1&hl=true*&hl.fl=title%20markdown"

Conclusion

I think I have a working solr instance now and I'm starting to get comfortable with the ideas. I'll need to do some more reading but for the part it seems like a simple enough process.

Spin up solr, create a collection, create the catch all and then add some data.

The big thing seems to be the tuning and actually getting relevant results. I also don't want to go through the prompts to set things up again so I need to learn the commands that it's really using.

The next step would be to hook up my blog to use solr for searches.

The tutorials themselves are useful but I spent quite a bit of time debugging and running into problems. I think this is where the real learning was happening so I'm happy to do it. I wonder if using AI would have been better as it likely could have pointed me in the right direction. However I would miss out on the goose chases and the ancient stackoverflow answers.

I now see that there has always been some sort of documentation problem and it looks like the problem I ran into has cropped up a few times across the years. This isn't something I would see if I got the right answer immediately.

I'll need to also create a cheatsheet of sorts for solr as it would be nice to simplify things down to the essentials. I quite like that solr simply was a download of a binary and then a simple running of it. No docker, no huge list of packages, no weird daemons. I had to install java and then away we go. This was a nice experience.