Production Project
A self-hosted personal voice memory system. Speak a thought, tap Stop. Your server transcribes it locally using Faster-Whisper, embeds it with sentence-transformers, and makes it retrievable by meaning — so a recording from two months ago can be found by describing what you meant, not by remembering the exact words you used.
No cloud API. No third-party service in the data path. Every component — transcription, embedding, intelligence — runs on hardware you control. Production-deployed at scribeswift.com. v0.4.0, eleven phases complete, first external tester onboarding in progress.
ScribeSwift is built on one conviction: the thought you speak is worth keeping as precisely as the thought you had.
Why it exists
Voice capture is fast. Speaking runs at 130–150 words per minute; fast typing reaches 80–90. More importantly, speaking does not require a keyboard, a flat surface, or eyes-on-screen. The best thoughts often arrive mid-run, almost asleep, between meetings — at exactly the moments when writing is not an option.
Native voice memo apps solve the capture step and fail at everything else. A list of recordings ordered by timestamp, titled by time of day, with no transcription and no search. Hundreds accumulate. None are findable. The only way to retrieve a specific thought is to remember roughly when it was recorded and scrub through audio.
AI transcription services solve retrieval but introduce a privacy trade-off: audio processed on third-party servers, billed per minute, subject to changing terms, dependent on external uptime. The user's voice — and everything said in it — transits infrastructure they do not control.
There is no version of "your voice stays private" that is compatible with a service that processes it remotely. Privacy in those products is a policy. In ScribeSwift it is an architecture.
ScribeSwift closes the retrieval gap without opening a privacy gap.
A third problem: most capture tools require decisions before anything is saved — what folder, what project, what type of note. Those decisions demand exactly the mental bandwidth that is in shortest supply when the thought arrives. ScribeSwift makes no such demand. Tap Record. Speak. Tap Stop. Classification, transcription, and retrieval happen after the thought is safely stored.
Design principles
Voice is not a fallback for when typing is inconvenient. It is the correct medium for thoughts that arrive during motion. The record button starts immediately — no loading screen, no modal asking you to categorise the thought before you have finished having it. Every second of friction between "I have a thought" and "it is safely saved" is a cost paid in lost ideas.
Faster-Whisper and sentence-transformers are downloaded once and run locally. There is no API key, no external call, no third party anywhere in the data path. Privacy is not a setting or a policy — it is a structural constraint fixed before any other decision. Once fixed, it prevents a whole class of shortcuts: "just use the Whisper API for better accuracy" is structurally unavailable, not merely discouraged.
IndexedDB in the browser is the primary store. The server is a processing backend. Recording, playback, and keyword search all work without a network connection. If the server is down, nothing captured locally is lost. The app was designed to survive the server disappearing — capture never waits on infrastructure. This is the correct design for a tool you rely on to preserve things that matter.
Vanilla ES modules. No framework. No build step. No database — JSON files on disk. The source code is the deployed code. A deploy is an rsync. A backup is a tar.gz. Every abstraction not added is maintenance debt not accumulated. A personal tool that requires regular toolchain maintenance will eventually not be maintained, and will stop being useful.
Founder observation
This did not start as a transcription tool. The original question was simpler and harder to act on: what happens to all the thoughts that never make it into a notebook?
Not because there was no time to write them down. Because writing requires a kind of decision that is often not available in the moment — what to write, how to structure it, where to put it. Voice memos solve the capture problem. They do not solve retrieval, and they do not solve understanding. A year of voice recordings is not the same as understanding what you were thinking about that year. The recordings exist. The meaning — the threads, the patterns, the decisions made and avoided — has to be found separately.
The deeper interest is not note-taking. It is whether an honest, private archive of a person's own spoken thoughts — accumulated over time — can eventually reveal patterns that are difficult to see from inside your own head. Not because an AI interprets them. Because they are retrievable at all, in a form that lets the person do their own reading.
That requires a system that preserves voice accurately, keeps it private by architecture, and retrieves by meaning rather than by date or keyword.
ScribeSwift is the first practical step toward that. Today it captures, transcribes, embeds, and retrieves. The longer question is whether years of your own words can preserve enough context to help you understand yourself more accurately than memory alone.
The pipeline
Every recording moves through six stages. The first is offline and instant. The remaining five require the server but are invisible to the user — they happen on sync, in the background, without blocking capture.
The browser's MediaRecorder API captures audio as a WebM blob. The recording saves immediately to IndexedDB with status "pending" — safe the moment you tap Stop, before any network call, before any transcription. Works fully offline on the installed PWA.
When online, pending recordings upload via POST /api/upload — a FormData payload containing the audio blob and browser-generated metadata: UUID, filename, and timing data. Apache proxies the request to FastAPI on localhost:8000, which is never publicly exposed.
Faster-Whisper (base.en, CPU int8) transcribes the audio locally — no API key, no data leaving the server. The model lazy-loads on first call and caches in module scope; startup pays no load cost. A deterministic post-processor corrects whitespace, punctuation spacing, and sentence capitalisation. 48 unit tests cover the post-processing rules.
all-MiniLM-L6-v2 encodes the cleaned transcript into a 384-dimensional unit vector — a semantic representation of the recording's meaning. Embeddings are L2-normalised at encode time, so cosine similarity reduces to a dot product. No vector database required at personal scale.
A pure-regex intelligence layer extracts a title (first sentence, opener-stripped, word-trimmed), summary, tags, memory type across seven categories, action items, and priority. No LLM. No inference cost. The same transcript always produces the same result. 41 unit tests cover title generation logic.
A search description is embedded at request time and compared against all stored vectors via cosine similarity — threshold 0.25. Related memories surface automatically in the detail view using the recording's stored embedding, with no new inference required. Keyword search runs client-side against IndexedDB and works offline.
Technical architecture
Three layers. Browser — Apache — FastAPI. No microservices, no message queue, no container orchestration. A single DigitalOcean Droplet running systemd, Apache, and uvicorn is everything the production system needs.
The browser runs a Vanilla JS PWA with no framework. A service worker (v44, cache-first) provides offline capability for the full app shell. IndexedDB holds all recordings — audio blobs, transcripts, and intelligence fields. The browser is the primary store; the server is secondary. If the server disappears, nothing captured locally is lost.
Apache handles SSL termination via Let's Encrypt and proxies /api/*
to FastAPI on localhost:8000. The backend is never publicly exposed —
the only path to it is through the Apache proxy. Authentication moved from Apache
Basic Auth to FastAPI JWT in Phase 10A. Login sets a 30-day httpOnly cookie;
Apache handles no auth logic of its own.
User data lives at SCRIBESWIFT_DATA_DIR/<user_uuid>/, outside the
code tree. The deploy script runs rsync --archive --delete --exclude data/ —
the exclusion is in the command, not in instructions to be followed carefully.
Per-user isolation (Phase 10B) is implemented via UUID subdirectories, with
path containment enforced in paths.py.
| Layer | Technology | Why |
|---|---|---|
| Frontend | Vanilla JS ES modules | No build step; source code is deployed code; no framework upgrade cycle |
| PWA | Service Worker + IndexedDB | Offline-first; recording works without network; local store is primary |
| Transcription | Faster-Whisper base.en (CPU int8) | Local; ~150 MB; ~2s per 30s recording on CPU; no external dependency |
| Embeddings | all-MiniLM-L6-v2 (384-dim) | 80 MB; 10ms CPU inference; L2-normalised output; excellent semantic quality |
| Intelligence | Pure regex (intelligence.py) | Deterministic, instant, auditable; zero inference cost per upload |
| Backend | FastAPI + Uvicorn | Clean route definitions; async-capable; dependency injection for auth |
| Storage | JSON files on disk | Inspectable with a text editor; backup is tar.gz; no external service |
| Web server | Apache2 (mod_proxy) | SSL termination; dumb reverse proxy; proven operational stability |
| Auth | FastAPI JWT (httpOnly cookie) | In-app login; 30-day session; bcrypt passwords; no browser credential dialogs |
| Infrastructure | DigitalOcean (Ubuntu 22.04) | Low cost; single-machine simplicity; straightforward systemd management |
Engineering decisions
voiceinbox-db was preserved through the
ScribeSwift branding rename. Changing it would open a new empty database in
the browser — silently disconnecting the user from months of recordings with
no error message and no recovery path that does not require DevTools.
For a personal memory tool, that failure mode is unacceptable. The identifier
stays as voiceinbox-db permanently, regardless of any future
branding change.
sync.js needs to re-render the timeline after a successful upload.
timeline.js needs to trigger upload operations from retry buttons on
recording cards. A direct import in either direction creates a cycle.
The solution: recorder.js — the entry point — imports from both and
passes callbacks into timeline.js via initTimelineEvents().
Timeline calls the callbacks; it never imports sync. The entry point is
the only module aware of both sides.
unknown type and empty
fields — the correct default. For single-speaker English voice memos, regex
covers the 90% case adequately. The remaining 10% degrades gracefully.
Zero inference cost, deterministic output, auditable rules.
SCRIBESWIFT_DATA_DIR points to a directory outside the repository.
The deploy script uses rsync --archive --delete --exclude data/ —
the exclusion is in the command, not in instructions to be followed carefully.
No deploy, however careless, can reach user recordings. This is a structural
guarantee, not an operational discipline.
skipWaiting() is not called. The app detects the waiting SW and
shows an "Update ready" banner. The user taps Reload; the page sends
SKIP_WAITING; the new SW activates; the page reloads. Activating
mid-session without user consent could interrupt an active recording.
The controlled flow prevents this at the cost of one deliberate user action.
Project status
Phases 1–10B complete · v0.4.0 · scribeswift.com
Full recording pipeline operational: capture, upload, transcription, embedding, rule-based intelligence, semantic search, and related-memory retrieval. JWT authentication and per-user data isolation complete (Phase 10B). 89 unit tests covering transcript post-processing and title generation. Full operational documentation for backup, restore, deployment validation, and account management.
What is running
Known limitations
Phase 11 complete — first external tester using the production server
Lessons from building this
Making privacy structural — not a feature, not a setting — prevents the entire category of decisions that would erode it. Constraints that cannot be traded away are more valuable than flexibility. They force better solutions within them. The intelligence layer is rule-based partly because an LLM API was structurally unavailable. The result is a more honest, more auditable, more reliable system.
The regex intelligence layer handles 90% of voice memo categorisation adequately. A failed regex match degrades to "unknown" type and empty fields — the correct default. Choosing LLM inference for this task would have added latency, cost, and hallucination risk for a marginal quality improvement on a use case where adequate, deterministic, instant is the better outcome than occasionally better but unpredictably wrong.
A voice memory tool that loses data has failed. The reliability choices are structural: IndexedDB is primary, server is secondary; uploads are idempotent; a sync lock prevents concurrent-tab race conditions; data lives outside the code tree; audio playback has a three-stage fallback (local blob → server stream → notice). Reliability that depends on careful operation eventually fails when operation is not careful.
The circular import solution — callbacks passed through the entry point — is more lines than a framework's event system would require. It is also more readable, more traceable, and more maintainable. No-framework means the deployed code is the source code, readable three years from now without touching a toolchain. Explicit choices compound into maintainable systems.
Where it goes next
Near term — observation before implementation (Phase 12). Phase 12 does not begin until the first external tester has been using the production system for two to four weeks. Async transcription — returning a job ID immediately and polling for completion — eliminates the synchronous timeout risk on long recordings without any visible change to the upload experience. The design is complete and documented; it waits on a real user signal before being built.
Medium term — settings and portability. A transcription settings
panel surfaces model selection (base.en through medium.en) without requiring
server configuration file edits. The WHISPER_MODEL environment variable
already makes this a zero-code-change switch; the settings panel exposes it to
the operator. Date range filtering and export to Markdown address the retrieval
and portability needs that accumulate in any archive used over years.
Long term — WASM transcription and local AI worker.
whisper.cpp compiles to WebAssembly. A browser-based transcription worker
would enable fully offline transcription — no server needed — using a model
cached in the browser after first download. The local AI worker architecture
moves heavier analysis (7B LLM enrichment, cross-memory pattern detection,
weekly digests) to the user's home server or desktop, keeping the VPS lean.
The metadata JSON schema already supports arbitrary extra fields — adding
localAnalysis fields is backward-compatible.
Long term — cognitive continuity. A personal archive becomes more valuable as it grows. Years of recordings create context that cannot be reconstructed from memory alone. Future development may explore ways of surfacing recurring themes, decisions, questions, and patterns across long periods of time while preserving the system's privacy-first architecture.
ScribeSwift is not finished.
It is running, being used, and being improved in response to real use — not in response to plans.
The engineering decisions — no framework, no database, local AI, offline-first — are not choices made for novelty. They are the consequence of working backward from a fixed constraint: the system must remain private, reliable, and operable over years without requiring constant attention.
That constraint produces a simpler system than the alternative. Simpler is more likely to still be running in three years.