Backend · Databases · Systems

Databases from First Principles

Why any system that outlives a process needs a database, how DBMS earns its place over raw files, and how to design schemas you won't regret.

PostgreSQLSchema DesignMigrationsIndexing

1. Why databases?

A running process lives in RAM. The moment it exits, every byte it held vanishes. Databases exist for one fundamental reason: persistence — data must survive process restarts, machine reboots, and hardware failures.

Secondary to persistence comes access. It is not enough to store data; you need to retrieve, update, and delete it efficiently and correctly, even when multiple clients are touching the same records at the same time.

2. Storage hierarchy

Not all storage is equal. Every engineering decision about where to put data is a trade-off between speed and cost.

A typical server has ~16 GB RAM and ~1 TB secondary storage. Databases live on disk; Redis is the notable exception that keeps its working set in RAM.

Most databases (PostgreSQL, MySQL) are disk-based: durable by default, with buffer pools in RAM for hot pages. Redis is memory-first: blindingly fast, but data size is bounded by available RAM and durability requires explicit configuration.

3. DBMS

Storing bytes on disk is trivial. A plain text file can technically be a database. The question is: what do you lose without a proper Database Management System?

Organisation

Data is structured so reads and writes are efficient — not a sequential scan every time.

Access (CRUD)

A query language to create, read, update, and delete without writing custom parsers.

Integrity

Constraints enforce valid data — e.g. a payment_amount column only accepts numbers, never a string.

Concurrency

Two users updating the same record simultaneously must not corrupt each other's writes.

Security

Role-based access control determines who can read or modify which tables.

Transactions

All-or-nothing semantics: a bank transfer either debits and credits, or neither happens.

A plain text file handles none of this well. Parsing is manual, there is no concurrency model, and every application reimplements the same low-level plumbing. DBMS is that plumbing, done once and done right.

4. Relational vs NoSQL

Two dominant families, each with a different answer to the schema-flexibility trade-off.

Relational databases enforce a fixed schema — great for data integrity. NoSQL documents are flexible — each can have different fields, but you trade consistency guarantees for that flexibility.

The practical heuristic: if your data has a well-defined, stable shape and relationships between entities matter, use relational. If your schema is genuinely dynamic (a CMS where each article type looks different), NoSQL makes sense — but be deliberate, because flexibility often becomes inconsistency in production.

5. Why PostgreSQL

When in doubt, start with PostgreSQL. Four reasons:

Open source — no licensing cost, no vendor lock-in, enormous community.
SQL + extensions — a 1400-page spec, plus PostGIS for geospatial, pgvector for embeddings, and hundreds more.
JSONB — you get document-store flexibility when you need it, without switching databases.
ACID guarantees — transactions are correct by default. You opt into relaxed consistency, not into strictness.

6. Data types

Category	Types	Notes
Numeric	int, bigint, smallint	Use `bigint` for IDs to avoid overflow
Decimal	numeric, decimal	Exact precision — use for money
Float	real, float8	Faster but imprecise — use for ML scores, not currency
Text	text, varchar(n)	`text` has a ~250 MB upper limit; prefer it over varchar unless a hard cap is needed
Time	timestamptz, date	Always use `timestamptz` (with timezone)
JSON	json, jsonb	`json` = plain text. `jsonb` = binary + indexable. Always prefer `jsonb`
Other	uuid, bool, bytea	Use `uuid` as PK for distributed systems

numeric vs float: numeric is exact (arbitrary precision), float is a fast approximation. Use numeric for any financial value. Use float when you need speed and can tolerate rounding (e.g., confidence scores).

7. Migrations

A migration is a versioned SQL file that describes a single incremental change to your schema. The migration runner applies them in sequence and tracks which have been applied.

Each migration file has an up (apply) and a down (rollback). The runner tracks applied migrations in a schema_migrations table.

Why migrations over ad-hoc ALTER TABLE?

Schema changes are version-controlled alongside code.
Every environment (dev, staging, prod) applies the same sequence.
Rollbacks are explicit and tested, not improvised under pressure.

Tool of choice: dbmate — zero dependencies, works with PostgreSQL, MySQL, SQLite.

8. Relationships

The three fundamental relationship patterns and how they map to table design.

Referential integrity

Foreign keys are not just documentation — they enforce that related data stays consistent. Two key behaviours on delete:

ON DELETE RESTRICT — prevents deleting a user if they own projects. The database blocks the operation.
ON DELETE CASCADE — deleting a user also deletes all their projects. Useful, but destructive — be explicit.

Referential integrity is the first line of defence against data corruption at the database layer, independent of application code.

9. Queries

A real-world query for a user list with their profile embedded as JSON — useful for APIs that need to return a single object per user:

SELECT
  u.*,
  to_json(up.*) AS profile
FROM users u
LEFT JOIN user_profile up ON u.id = up.user_id
ORDER BY u.created_at DESC;

LEFT JOIN ensures users with no profile still appear. to_json(up.*) collapses the joined row into a single JSON column — avoids N+1 queries in the application layer.

10. Indexing

An index is a separate data structure that maps column values to their row locations — a lookup table that trades write overhead and storage for faster reads.

Primary keys are indexed automatically. Any column used frequently in WHERE, JOIN, or ORDER BY clauses should have an explicit index.

-- PostgreSQL creates this automatically for PKs:
CREATE INDEX ON users (id);

-- Add manually for frequently filtered columns:
CREATE INDEX ON projects (owner_id);
CREATE INDEX ON events (created_at DESC);

11. Patterns worth knowing

You probably don't need full normalisation

Normalisation eliminates redundancy. Denormalisation trades redundancy for read performance. In practice, a moderate level of denormalisation (e.g. storing a display name directly on an orders table) avoids expensive joins on hot paths. Measure before normalising everything.

Separate tables for frequently-changing fields

If one column changes constantly but doesn't affect the core entity (e.g. a last_seen_at timestamp on users, or a status field on a long-lived job), moving it to a separate table reduces row-level lock contention and keeps audit history clean.

JSONB for genuinely dynamic data

PostgreSQL's jsonb column lets you store semi-structured data without a separate NoSQL cluster. Index into it with GIN indexes. Useful for metadata blobs or per-tenant configuration that varies unpredictably.

One-to-one: make child PK the FK

For a true 1:1 relationship (e.g. users → user_profile), set the child table's primary key to also be the foreign key referencing the parent. This eliminates a separate surrogate key and makes the join trivially efficient.

References

PostgreSQL Documentation — the 1400-page reference that covers everything
dbmate — lightweight migration tool
How databases evolved — a good video overview of database history and architecture
Kleppmann, M. — Designing Data-Intensive Applications(O'Reilly, 2017) — the definitive book on storage engines, replication, and transactions