Why most internal tools become unmaintainable in 18 months
Internal tools start as a Google Sheet, become a flaky Streamlit prototype, and end up as the codebase nobody touches. Here is what we have learned about extending the runway.
The pattern is universal
Every operations team we have worked with tells the same story. Year one: someone in finance builds a Google Sheet that automates a manual reconciliation. By month four it has 47 tabs and three different VBA macros. Year two: a contractor wraps it in a Streamlit prototype "just to make it shareable". By month eighteen the contractor is gone, the Streamlit app is one Python version behind, no one knows where the credentials live, and the team is back to running the original spreadsheet by hand because at least that one they understand.
This is the canonical lifecycle of an unmanaged internal tool. It is not a story about bad engineers or lazy product owners. It is structural.
Three failure modes that almost always show up
Failure mode one: hidden state. The tool depends on something nobody documented — a CSV that someone downloads from an external system every Tuesday morning, an API key that lives in one specific browser tab, a Google Sheet that one person has manually formatted. The tool works as long as that ritual is performed. The day the ritual stops, the tool stops. Worse: nobody knows for weeks because the tool degrades silently rather than throwing errors.
Failure mode two: schema drift. The original tool was designed against a snapshot of the operational reality. Then ops added a new product line. Then sales added a new tier. Then finance added a new revenue category. Each addition required a small surgery in the tool. After eighteen months of small surgeries, the schema looks nothing like the original design and the tool only works for the cases that have been hand-patched.
Failure mode three: orphaned ownership. The person who built the tool moves teams, leaves the company, or gets pulled onto something more strategic. The tool has no owner. When it breaks, every team thinks the other team is fixing it. Eventually someone declares it "the tool that doesn't work anymore" and a parallel manual process emerges.
What we do differently
When we are commissioned to build internal tools at Bright Line, we have a small set of opinionated rules:
Rule one: every tool has a runbook from day one. Not a wiki page nobody reads. A README in the repo, kept up to date, that lists the cron schedules, the credentials and where they live, the dependencies, the typical failure modes, and the on-call rotation. If the runbook is wrong, that is a P1 bug.
Rule two: the tool emits its own health. Every job logs to a single observability stream — we use Cloud Logging or Datadog depending on the client stack. If a job has not run in twice its expected interval, an alert fires. We do not wait for a human to notice the tool is broken.
Rule three: schema migrations are first-class. Even small internal tools get an alembic-style migration system. New ops requirements become migrations, not hand-edits. After 18 months, you can replay the schema history and understand exactly how the business model has evolved.
Rule four: we pick boring stack. Postgres, Python or TypeScript, FastAPI or Next.js, Docker. Not because these are exciting. Because two years from now someone else needs to be able to pick up the codebase without learning a new framework.
The honest economics
Building an internal tool the right way the first time costs about 2x what the Streamlit prototype costs. The Streamlit prototype, if it survives 18 months, costs about 5-10x its initial build to keep limping along — measured in operations team hours, contractor patches, missed insights, and the eventual full rewrite.
The math always works out the same way: pay 2x up front, save 5-10x over three years, plus the team trusts the tool which means they actually use it.
If you have a Streamlit graveyard you would like to talk through, we are happy to take a look.