I've recently started reading The Magnus Archives instead of listening to it and realized that I much prefer this. Another podcast that I had queued up was the Trojan War Podcast. I looked for transcripts online but didn't find anything so I put together a little routine to do my own transcription.
This isn't a program, this is currently just a set of steps. This relies on using AI and LLM tooling and so I'm a bit suspicious that not everything is truly accurate and that I could inadvertently be mucking things up.
The first step is to get the RSS feed so I can get a list of mp3 files.
wget https://trojanwarpodcast.com/feed/podcast/
I then grabbed the links and saved them to a file:
cat index.html | grep enclosure | cut -d\" -f2 > files.txt
The next step was to download the episodes:
wget -i files.txt
Now we can use whisper to get the raw trancripts, I'm using uv to manage my python projects:
uv run whisper audio/* --model turbo --output_format txt --output_dir txt/
This command expects the mp3s to be in an audio
directory and the output will end up in the txt
directory.
Now the transcript is just the raw data, there is no formatting or paragraph breaks.
The final step is to use Claude to clean up the transcript so that we can actually use it.
To do this, I have claude code on my machine so I simply do Claude and then prompt it to clean a file in the txt
directory and save the cleaned version to another directory.
> Format and clean up txt/EPISODE.txt This is a transcript. Place the cleaned up file in the clean directory.
I'm a bit wary about trusting that Claude isn't somehow making changes to the cleaned version. I would need to listen to the actual podcast to really confirm it. It's also probably possible to get a score of how close the cleaned version is to the raw whisper output. Based off the first episode which I had listened to and the text, it looks like its fine but it still gives me pause.
I would really like it if this stuff was a bit more deterministic. I would have felt much more confident if a person had gone through and rearrange some things, fixed up some words and capitalized the proper nouns.