2025-10-12
Playing with htmlq

web, sysadmin

htmlq is a utility to parse out parts of an html page using selectors. This is powerful because you can use css selector logic to do wildcards and regexes.

This is handy when you want to do some quick web scraping. For example, getting a list of links on a web page is trivial with htmlq.

The example I'll go through here is to get a list of links, titles and votes from Hacker News.

This will cover using specific selectors, removing nodes and wildcard selectors.

All of the links

The first step is to just get all the links on the web page.

curl https://news.ycombinator.com/ | htmlq a -a href

The -a is the attribute that we want to pull. What the above command is doing is finding all the a tags and giving us the href.

As you can see in the below output, we are getting everything.

https://news.ycombinator.com
news
newest
front
newcomments
ask
show
jobs
submit
login?goto=news
vote?id=45563359&how=up&goto=news
https://mnolangray.substack.com/p/everything-you-need-to-know-about
from?site=mnolangray.substack.com
user?id=bickfordb
item?id=45563359
hide?id=45563359&goto=news
item?id=45563359
vote?id=45559857&how=up&goto=news

Being more specific

We can be more specific so that we only get the links inside the main area:

curl https://news.ycombinator.com/ | htmlq ".titleline a" -a href

This will give us just the a tags that are inside the titleline class.

For Hacker News, this is still going to give us extra link as under each titleline is the source website the link is from. This is getting picked up as these are also a tags.

We can see that there are various extra links:

https://mnolangray.substack.com/p/everything-you-need-to-know-about
from?site=mnolangray.substack.com
https://github.com/chili-chips-ba/wireguard-fpga
from?site=github.com/chili-chips-ba
item?id=45561428
https://github.com/microsoft/edgeai-for-beginners
from?site=github.com/microsoft
https://underreacted.leaflet.pub/ls: cannot access '//underreacted.leaflet.pub/': rom?site=leaflet.pub
https://xenodium.com/introducing-agent-shell
from?site=xenodium.com
https://wip.tf/posts/telefonefix-building-babys-first-international-landline/
from?site=wip.tf

Removing nodes

We can use htmlq to remove parts of the web page:

curl https://news.ycombinator.com/ | htmlq ".titleline a" -a href -r .sitebit

Here we remove the elements with the sitebit class. Then we are left with just the links we care about.

https://mnolangray.substack.com/p/everything-you-need-to-know-about
https://github.com/chili-chips-ba/wireguard-fpga
item?id=45561428
https://github.com/microsoft/edgeai-for-beginners
https://underreacted.leaflet.pub/ls: cannot access '//underreacted.leaflet.pub/': ttps://xenodium.com/introducing-agent-shell
https://wip.tf/posts/telefonefix-building-babys-first-international-landline/
https://buttondown.com/hillelwayne/archive/three-ways-formally-verified-code-can-go-wrong-in/
https://www.thisiscolossal.com/2025/09/2025-bird-photographer-of-the-year-contest/
https://3dpaws.comet.ucar.edu

Getting Text

Now that we can get links we can also get the text content of the a tags:

curl https://news.ycombinator.com/ | htmlq ".titleline a" -t -r .sitebit

We've removed the -a option and replaced it with just -t. This will give us the text content of the specified selector.

Everything You Need to Know About [California] SB 79
Wireguard FPGA
Ask HN: What are you working on? (October 2025)
Edge AI for Beginners
My first week of vibecoding
Emacs agent-shell (powered by ACP)
Show HN: Baby's First International Landline
Three ways formally verified code can go wrong in practice
Bird Photographer of the Year Gives a Lesson in Planning and Patience
3D-Printed Automatic Weather Station
Macro Splats 2025

Getting Scores

On hacker news, the scores for each post are inside a span that have an id starting with score.

We can use htmlq to find score by doing:

curl https://news.ycombinator.com/ | htmlq '[id*="score"]' -t

This will find any elements that have id with score in it. The -t will give us the text inside the element.

We can also say that we want all the ids that start with score rather than containing it:

curl https://news.ycombinator.com/ | htmlq '[id^="score"]' -t

This will give us:

30 points
335 points
103 points
106 points
103 points
8 points

Voila! With that we were able to quickly get the links, titles and votes directly from the command line. This could then be composed with other linux utilities to write quick web scraping code.