ScribeSwift

Why it exists

The Retrieval Gap

Voice capture is fast. Speaking runs at 130–150 words per minute; fast typing reaches 80–90. More importantly, speaking does not require a keyboard, a flat surface, or eyes-on-screen. The best thoughts often arrive mid-run, almost asleep, between meetings — at exactly the moments when writing is not an option.

Native voice memo apps solve the capture step and fail at everything else. A list of recordings ordered by timestamp, titled by time of day, with no transcription and no search. Hundreds accumulate. None are findable. The only way to retrieve a specific thought is to remember roughly when it was recorded and scrub through audio.

AI transcription services solve retrieval but introduce a privacy trade-off: audio processed on third-party servers, billed per minute, subject to changing terms, dependent on external uptime. The user's voice — and everything said in it — transits infrastructure they do not control.

There is no version of "your voice stays private" that is compatible with a service that processes it remotely. Privacy in those products is a policy. In ScribeSwift it is an architecture.

ScribeSwift closes the retrieval gap without opening a privacy gap.

A third problem: most capture tools require decisions before anything is saved — what folder, what project, what type of note. Those decisions demand exactly the mental bandwidth that is in shortest supply when the thought arrives. ScribeSwift makes no such demand. Tap Record. Speak. Tap Stop. Classification, transcription, and retrieval happen after the thought is safely stored.

Voice-first over text

Voice is not a fallback for when typing is inconvenient. It is the correct medium for thoughts that arrive during motion. The record button starts immediately — no loading screen, no modal asking you to categorise the thought before you have finished having it. Every second of friction between "I have a thought" and "it is safely saved" is a cost paid in lost ideas.

Privacy by architecture

Faster-Whisper and sentence-transformers are downloaded once and run locally. There is no API key, no external call, no third party anywhere in the data path. Privacy is not a setting or a policy — it is a structural constraint fixed before any other decision. Once fixed, it prevents a whole class of shortcuts: "just use the Whisper API for better accuracy" is structurally unavailable, not merely discouraged.

Local-first as durability

IndexedDB in the browser is the primary store. The server is a processing backend. Recording, playback, and keyword search all work without a network connection. If the server is down, nothing captured locally is lost. The app was designed to survive the server disappearing — capture never waits on infrastructure. This is the correct design for a tool you rely on to preserve things that matter.

Simplicity as maintenance

Vanilla ES modules. No framework. No build step. No database — JSON files on disk. The source code is the deployed code. A deploy is an rsync. A backup is a tar.gz. Every abstraction not added is maintenance debt not accumulated. A personal tool that requires regular toolchain maintenance will eventually not be maintained, and will stop being useful.

Founder observation

Why I Started Building It

This did not start as a transcription tool. The original question was simpler and harder to act on: what happens to all the thoughts that never make it into a notebook?

Not because there was no time to write them down. Because writing requires a kind of decision that is often not available in the moment — what to write, how to structure it, where to put it. Voice memos solve the capture problem. They do not solve retrieval, and they do not solve understanding. A year of voice recordings is not the same as understanding what you were thinking about that year. The recordings exist. The meaning — the threads, the patterns, the decisions made and avoided — has to be found separately.

The deeper interest is not note-taking. It is whether an honest, private archive of a person's own spoken thoughts — accumulated over time — can eventually reveal patterns that are difficult to see from inside your own head. Not because an AI interprets them. Because they are retrievable at all, in a form that lets the person do their own reading.

That requires a system that preserves voice accurately, keeps it private by architecture, and retrieves by meaning rather than by date or keyword.

ScribeSwift is the first practical step toward that. Today it captures, transcribes, embeds, and retrieves. The longer question is whether years of your own words can preserve enough context to help you understand yourself more accurately than memory alone.

Every recording moves through six stages. The first is offline and instant. The remaining five require the server but are invisible to the user — they happen on sync, in the background, without blocking capture.

Record

The browser's MediaRecorder API captures audio as a WebM blob. The recording saves immediately to IndexedDB with status "pending" — safe the moment you tap Stop, before any network call, before any transcription. Works fully offline on the installed PWA.

Upload

When online, pending recordings upload via POST /api/upload — a FormData payload containing the audio blob and browser-generated metadata: UUID, filename, and timing data. Apache proxies the request to FastAPI on localhost:8000, which is never publicly exposed.

Transcribe

Faster-Whisper (base.en, CPU int8) transcribes the audio locally — no API key, no data leaving the server. The model lazy-loads on first call and caches in module scope; startup pays no load cost. A deterministic post-processor corrects whitespace, punctuation spacing, and sentence capitalisation. 48 unit tests cover the post-processing rules.

Embed

all-MiniLM-L6-v2 encodes the cleaned transcript into a 384-dimensional unit vector — a semantic representation of the recording's meaning. Embeddings are L2-normalised at encode time, so cosine similarity reduces to a dot product. No vector database required at personal scale.

Analyse

A pure-regex intelligence layer extracts a title (first sentence, opener-stripped, word-trimmed), summary, tags, memory type across seven categories, action items, and priority. No LLM. No inference cost. The same transcript always produces the same result. 41 unit tests cover title generation logic.

Retrieve by meaning

A search description is embedded at request time and compared against all stored vectors via cosine similarity — threshold 0.25. Related memories surface automatically in the detail view using the recording's stored embedding, with no new inference required. Keyword search runs client-side against IndexedDB and works offline.

Three layers. Browser — Apache — FastAPI. No microservices, no message queue, no container orchestration. A single DigitalOcean Droplet running systemd, Apache, and uvicorn is everything the production system needs.

The browser runs a Vanilla JS PWA with no framework. A service worker (v44, cache-first) provides offline capability for the full app shell. IndexedDB holds all recordings — audio blobs, transcripts, and intelligence fields. The browser is the primary store; the server is secondary. If the server disappears, nothing captured locally is lost.

Apache handles SSL termination via Let's Encrypt and proxies /api/* to FastAPI on localhost:8000. The backend is never publicly exposed — the only path to it is through the Apache proxy. Authentication moved from Apache Basic Auth to FastAPI JWT in Phase 10A. Login sets a 30-day httpOnly cookie; Apache handles no auth logic of its own.

User data lives at SCRIBESWIFT_DATA_DIR/<user_uuid>/, outside the code tree. The deploy script runs rsync --archive --delete --exclude data/ — the exclusion is in the command, not in instructions to be followed carefully. Per-user isolation (Phase 10B) is implemented via UUID subdirectories, with path containment enforced in paths.py.

Layer	Technology	Why
Frontend	Vanilla JS ES modules	No build step; source code is deployed code; no framework upgrade cycle
PWA	Service Worker + IndexedDB	Offline-first; recording works without network; local store is primary
Transcription	Faster-Whisper base.en (CPU int8)	Local; ~150 MB; ~2s per 30s recording on CPU; no external dependency
Embeddings	all-MiniLM-L6-v2 (384-dim)	80 MB; 10ms CPU inference; L2-normalised output; excellent semantic quality
Intelligence	Pure regex (intelligence.py)	Deterministic, instant, auditable; zero inference cost per upload
Backend	FastAPI + Uvicorn	Clean route definitions; async-capable; dependency injection for auth
Storage	JSON files on disk	Inspectable with a text editor; backup is tar.gz; no external service
Web server	Apache2 (mod_proxy)	SSL termination; dumb reverse proxy; proven operational stability
Auth	FastAPI JWT (httpOnly cookie)	In-app login; 30-day session; bcrypt passwords; no browser credential dialogs
Infrastructure	DigitalOcean (Ubuntu 22.04)	Low cost; single-machine simplicity; straightforward systemd management

Engineering decisions

Key Engineering Decisions

IndexedDB name never renamed The database identifier voiceinbox-db was preserved through the ScribeSwift branding rename. Changing it would open a new empty database in the browser — silently disconnecting the user from months of recordings with no error message and no recovery path that does not require DevTools. For a personal memory tool, that failure mode is unacceptable. The identifier stays as voiceinbox-db permanently, regardless of any future branding change.
Circular import resolved via callbacks sync.js needs to re-render the timeline after a successful upload. timeline.js needs to trigger upload operations from retry buttons on recording cards. A direct import in either direction creates a cycle. The solution: recorder.js — the entry point — imports from both and passes callbacks into timeline.js via initTimelineEvents(). Timeline calls the callbacks; it never imports sync. The entry point is the only module aware of both sides.
Rule-based intelligence over LLMs Title extraction, tag classification, action detection, and priority scoring use pure regex because they run on every upload. A failed LLM call degrades to silence; a failed regex match degrades to unknown type and empty fields — the correct default. For single-speaker English voice memos, regex covers the 90% case adequately. The remaining 10% degrades gracefully. Zero inference cost, deterministic output, auditable rules.
Lazy model loading with module-scope cache Faster-Whisper and sentence-transformers load on first request and cache in Python module scope, so startup pays no model-load cost. The first transcription after a restart takes ~2s for model load; every subsequent transcription is immediate.
Data outside the code tree SCRIBESWIFT_DATA_DIR points to a directory outside the repository. The deploy script uses rsync --archive --delete --exclude data/ — the exclusion is in the command, not in instructions to be followed carefully. No deploy, however careless, can reach user recordings. This is a structural guarantee, not an operational discipline.
Service worker update flow without skipWaiting New service workers enter a waiting state on install — skipWaiting() is not called. The app detects the waiting SW and shows an "Update ready" banner. The user taps Reload; the page sends SKIP_WAITING; the new SW activates; the page reloads. Activating mid-session without user consent could interrupt an active recording. The controlled flow prevents this at the cost of one deliberate user action.

Active Development

Phases 1–10B complete · v0.4.0 · scribeswift.com

Full recording pipeline operational: capture, upload, transcription, embedding, rule-based intelligence, semantic search, and related-memory retrieval. JWT authentication and per-user data isolation complete (Phase 10B). 89 unit tests covering transcript post-processing and title generation. Full operational documentation for backup, restore, deployment validation, and account management.

What is running

Recording and transcription pipeline Offline PWA capture, server-side Faster-Whisper transcription, service worker v44, deterministic post-processing
Semantic search and related memories MiniLM embeddings stored per recording; cosine similarity search at threshold 0.25; automatic related-memory retrieval in the detail view
Multi-user data isolation Per-user UUID directories, JWT HS256 auth (bcrypt passwords), admin scripts for account creation and removal, server migration pending
Operational tooling Backup/restore scripts, 5-check healthcheck every 15 minutes, smoke tests, transcription model benchmarking scripts

Known limitations

Synchronous transcription Upload blocks while Whisper runs; recordings over ~3 minutes risk a 502 timeout from Apache. Async job queue is designed and documented; not yet built.
iOS Safari audio recording Not fully validated on iOS 17+ Safari; Android Chrome is the tested and reliable target for recording

Next milestone

Phase 11 complete — first external tester using the production server

Privacy as constraint eliminates a class of shortcuts

Making privacy structural — not a feature, not a setting — prevents the entire category of decisions that would erode it. Constraints that cannot be traded away are more valuable than flexibility. They force better solutions within them. The intelligence layer is rule-based partly because an LLM API was structurally unavailable. The result is a more honest, more auditable, more reliable system.

Adequate is sufficient when determinism matters

The regex intelligence layer handles 90% of voice memo categorisation adequately. A failed regex match degrades to "unknown" type and empty fields — the correct default. Choosing LLM inference for this task would have added latency, cost, and hallucination risk for a marginal quality improvement on a use case where adequate, deterministic, instant is the better outcome than occasionally better but unpredictably wrong.

Reliability belongs in the architecture

A voice memory tool that loses data has failed. The reliability choices are structural: IndexedDB is primary, server is secondary; uploads are idempotent; a sync lock prevents concurrent-tab race conditions; data lives outside the code tree; audio playback has a three-stage fallback (local blob → server stream → notice). Reliability that depends on careful operation eventually fails when operation is not careful.

Explicit is more maintainable than magic

The circular import solution — callbacks passed through the entry point — is more lines than a framework's event system would require. It is also more readable, more traceable, and more maintainable. No-framework means the deployed code is the source code, readable three years from now without touching a toolchain. Explicit choices compound into maintainable systems.

Where it goes next

Future Roadmap

Near term — observation before implementation (Phase 12). Phase 12 does not begin until the first external tester has been using the production system for two to four weeks. Async transcription — returning a job ID immediately and polling for completion — eliminates the synchronous timeout risk on long recordings without any visible change to the upload experience. The design is complete and documented; it waits on a real user signal before being built.

Medium term — settings and portability. A transcription settings panel surfaces model selection (base.en through medium.en) without requiring server configuration file edits. The WHISPER_MODEL environment variable already makes this a zero-code-change switch; the settings panel exposes it to the operator. Date range filtering and export to Markdown address the retrieval and portability needs that accumulate in any archive used over years.

Long term — WASM transcription and local AI worker. whisper.cpp compiles to WebAssembly. A browser-based transcription worker would enable fully offline transcription — no server needed — using a model cached in the browser after first download. The local AI worker architecture moves heavier analysis (7B LLM enrichment, cross-memory pattern detection, weekly digests) to the user's home server or desktop, keeping the VPS lean. The metadata JSON schema already supports arbitrary extra fields — adding localAnalysis fields is backward-compatible.

Long term — cognitive continuity. A personal archive becomes more valuable as it grows. Years of recordings create context that cannot be reconstructed from memory alone. Future development may explore ways of surfacing recurring themes, decisions, questions, and patterns across long periods of time while preserving the system's privacy-first architecture.

ScribeSwift is not finished.

It is running, being used, and being improved in response to real use — not in response to plans.

The engineering decisions — no framework, no database, local AI, offline-first — are not choices made for novelty. They are the consequence of working backward from a fixed constraint: the system must remain private, reliable, and operable over years without requiring constant attention.

That constraint produces a simpler system than the alternative. Simpler is more likely to still be running in three years.

Back to Projects See LegionTrap

ScribeSwift

The Retrieval Gap

The Principles Behind It

Voice-first over text

Privacy by architecture

Local-first as durability

Simplicity as maintenance

Why I Started Building It

How It Works

Record

Upload

Transcribe

Embed

Analyse

Retrieve by meaning

Architecture

Key Engineering Decisions

Current Status

ScribeSwift

Lessons Learned

Privacy as constraint eliminates a class of shortcuts

Adequate is sufficient when determinism matters

Reliability belongs in the architecture

Explicit is more maintainable than magic

Future Roadmap