A blogging workflow based on transcribing audio notes with Whisper

The problem

When you want to preserve your privacy, using cloud-based speech-to-text services is probably not a good idea. But how to still benefit from the user experience of quickly recording a (blog post) idea on your (Android) smartphone, and having it transcribed into a (markdown) file?

The solution

  1. Android’s Sound Recording app (in high quality mode to create .wav files).
  2. Syncthing, to get the recordings from the smartphone directly into the ~/blog/content/posts/ folder.
  3. Georgi Gerganov’s whisper.cpp repo.
  4. A bit of Bash-scripting, see below.

Without any previous experience in AI/LLM usage, but having read Google’s “We Have No Moat” memo, I was positively surprised about how easy implementing my workflow idea was.

The first result was this previous blog post (in German). I didn’t “go meta” and also drafted this post with the described workflow.

Script setup

Admittedly, the following is not awesome a, but it was a nice afternoon project on a rainy weekend day. The whole thing is executed in the

#!/bin/bash

file="$1"
slug="$2"

# https://github.com/ggerganov/whisper.cpp/
tool="$HOME/GitHub.com/whisper.cpp"
size="${3:-small}"

Audio preparation

Next, we convert the input file to Whisper’s required 16kHz,

  • overwriting any existing file with ffmpeg -y, and
  • suppressing any non-essential output with -v error:
temp="$slug.wav"
ffmpeg -y  \
  -v error  \
  -i "$file" \
  -ar 16000 -ac 1 -c:a pcm_s16le \
  "$temp"

# Yes, I like to align things ☺️

Transcription with Whisper

This temp file is now processed into a .txt file, using the model size defined above:

"$tool/main" \
  --model "$tool/models/ggml-$size.bin" \
  --threads 8  \
  --output-txt  \
  --print-colors \
  --no-timestamps \
  --language auto  \
  "$temp"

The transcription progress and quality can be observed via the confidence-colored preview. From the few tests I ran, I found small to be good enough. medium detected only a few more words correctly, so its 3x higher memory usage seems not worth it for this use-case of drafting a blog post.

Converting the transcript into a Hugo blog post draft

For convenience and Hugo-compatibility, the script also prepends metadata to the blog post’s .md file:

date="$(date -u +%Y-%m-%d)"
blog="$date-$slug.md"

cat >"$blog" <<HEREDOC
---
title: $(head -1 "$temp".txt)
date: "$date"
draft: true
---

$(cat "$temp".txt)
HEREDOC

Cleanup

For some reason, all transcribed lines are prefixed with whitespace, so we’ll just remove that with sd and remove the temp & input files, so that my Android Sound Recorder doesn’t fill up with old cruft.

sd '^ ' '' "$blog".md
rm "$temp"* "$file"

Bonus: Sync with benefits

Because Syncthing copies the blog post files back to my Android, I can edit them when inspiration strikes. The blog’s .gitignore just needs a content/post/.st* rule, and Syncthing needs an img/ ignore rule to avoid cluttering Android’s Sound Recorder folder with blog post images.