Crypto-Shredding in Practice: Field-Level Encryption with Vault Transit at Scale
Published:
TL;DR — We stopped letting databases hold our secrets in plaintext and made services encrypt fields before they touch storage. We used HashiCorp Vault Transit (no keys in apps), scoped keys so we can crypto-shred on demand, migrated live traffic with dual-writes + a backfill, and then—after an ugly incident—added circuit breakers, short-TTL caches for low-sensitivity fields, and graceful degradation. Now “delete” means “no one can decrypt,” not “we hope replicas and backups agree.”
0) Why crypto-shredding (in human words)
Lawyers ask for “prove you deleted it.” Storage systems answer “I deleted it… in this replica… eventually.” That mismatch is how arguments start.
We made peace by changing the unit of deletion: not the row, but the ability to decrypt the row. If we can destroy the decryption path (rotate past it or nuke the key), the bits can sit in a thousand backups and still be useless. That’s crypto-shredding.
Simple rules we stuck to
- Encrypt in the service layer, not inside the DB. The DB becomes a blob farm, not a key party.
- Use Transit as an HSM-ish envelope service. We send plaintext, get ciphertext. Keys never live in app memory.
- Scope keys to the thing you might need to shred: product/tenant/region/dataset, and sometimes per-user via derived context.
- Decryption is a privilege, not plumbing. It needs a purpose, policy, and audit trail. Treat the service as a potential decryption oracle and design accordingly.
1) What we actually built
A tiny shared library we call securitas
. Every profile-facing service loads it at boot. It gives you:
encrypt(field, value, { key, context })
anddecrypt(field, blob, { key, context })
- built-in metrics (QPS, p95/p99, error taxonomy), and “plaintext touched” counters
- circuit breaking and a very short-TTL allowlisted cache for low-sensitivity reads
Happy path, without a diagram:
- Service calls Transit to encrypt → gets a ciphertext
- Service stores ciphertext in NoSQL
- On read, service fetches ciphertext
- Service calls Transit to decrypt only if the caller is allowed to see it
- Service returns either masked or plaintext based on policy
That’s the shape—no secrets at rest in the datastore, and no keys in the app.
2) Vault Transit: the bits that matter
Key layout
- One Transit key per shred-scope (e.g.,
profiles.customer.v1
,profiles.partner.v1
; sometimes per-tenant). - Keys created with
deletion_allowed=true
so we can do irreversible deletion if policy requires. - Keys set to
derived=true
and we pass a context liketenant_id:user_id:field_name
. That keeps blast radii tight. - Regular rotation: we advance
latest_version
. Apps must usemin_encryption_version = latest
and we can raisemin_decryption_version
when retiring old ciphertext.
Node.js client sketch
import createVault from 'node-vault';
const vault = createVault({
apiVersion: 'v1',
endpoint: process.env.VAULT_ADDR,
// Auth via cloud IAM / AppRole. We used cloud IAM per service.
});
export async function encrypt({ key, plaintext, context }) {
const res = await vault.write(`transit/encrypt/${key}`, {
plaintext: Buffer.from(plaintext, 'utf8').toString('base64'),
context: context && Buffer.from(context, 'utf8').toString('base64'),
});
return res.data.ciphertext; // vault:vX:...
}
export async function decrypt({ key, ciphertext, context }) {
const res = await vault.write(`transit/decrypt/${key}`, {
ciphertext,
context: context && Buffer.from(context, 'utf8').toString('base64'),
});
return Buffer.from(res.data.plaintext, 'base64').toString('utf8');
}
Identity & policy shape
Each service authenticates to Vault with a cloud IAM role bound to only its Transit paths:
transit/encrypt/profiles.customer.*
transit/decrypt/profiles.customer.*
- We split encrypt vs decrypt policies where possible. Decrypt is rarer, more monitored, sometimes MFA-gated for ops tooling.
- Audit devices are on. Every decrypt log says “who, what, when, why (purpose string).”
3) Wiring it into services (and keeping plaintext out of your RAM)
We wrapped the client in a tiny facade that bakes in the guardrails:
// securitas/index.js
import LRU from 'lru-cache';
import CircuitBreaker from 'opossum';
import * as transit from './transit-client.js';
const cache = new LRU({ max: 50_000, ttl: 30_000 }); // allowlisted fields only
const breaker = new CircuitBreaker(transit.decrypt, {
timeout: 1500, // cut fast, probe later
errorThresholdPercentage: 60, // open when flapping
resetTimeout: 4000
});
export async function encryptField(key, value, context) {
// Keep plaintext lifetimes minimal; avoid string copies if possible
return transit.encrypt({ key, plaintext: value, context });
}
export async function decryptField(key, ciphertext, context, { cacheable = false } = {}) {
const ck = `${key}:${ciphertext}`;
if (cacheable && cache.has(ck)) return cache.get(ck);
const plain = await breaker.fire({ key, ciphertext, context });
if (cacheable) cache.set(ck, plain);
return plain;
}
Field hygiene tips that saved us:
- Store derivatives for searching/sorting (hashes, last-4, tokens). Never query on plaintext.
- Decrypt only at the edge of the response and drop the buffer immediately. If your language lets you zeroize, do it.
- Make “can I see this field?” a first-class authorization check, not a vibe.
4) The migration we didn’t want to postpone (or page through)
We had millions of plaintext records. The plan had two tracks:
A) Dual-writes first (ship the write path)
Once securitas
was live, new writes stored both legacy plaintext and the encrypted field for a short window. Reads preferred encrypted; fell back to plaintext if missing. This turned on the future before rewriting the past.
B) Backfill next (rewrite the past without drama)
Workers pulled IDs in chunks, called Transit, wrote encrypted fields into shadow attributes (or a parallel collection), then idempotent upsert back. We rate-limited Transit, jittered retries, and punted persistent failures into a DLQ.
Cutover playbook
- Flip feature flags so reads go “decrypt-first”
- Freeze legacy plaintext writes
- Let the grace window expire
- Drop plaintext columns with a guarded migration step
The key to sleeping at night: verification without logging secrets—we compared hashes of plaintext vs decrypt(encrypt(plaintext)) on the fly.
5) Crypto-shredding semantics (how we actually “delete”)
We support two primitives:
Hard shred (no take-backs)
- Create keys with
deletion_allowed=true
. - When the contract says “forget everything for Tenant-7,” we delete
profiles.customer.Tenant7
(or the scoped key) and keep the audit trail. All ciphertext under that key is unrecoverable.
- Create keys with
Soft shred (forward-only)
- Raise
min_decryption_version
so old ciphertext becomes undecryptable, then re-encrypt on access. Great for periodic retirement or data minimization without detonating an entire scope.
- Raise
Granularity tip: If you’ll be asked to erase per user, push that identity into derived context and keep a map. Then you can retire a micro-scope without bulldozing the block.
6) The outage that forced us to grow up
We didn’t lose Vault. We lost time. A different control-plane component hiccupped, Transit latencies spiked, and suddenly decryption sat on the hot path of profile reads. The blast radius felt bigger than it should.
What hurt:
- Retries piled up and created a mini-thundering herd.
- Thread pools filled with “waiting on crypto.”
- Some endpoints had no “okay, mask it and move on” mode.
What we changed:
- HA Vault (multi-node, multi-AZ) with local HAProxy doing health-based routing.
- Budgeted retries (exponential + jitter, small caps) and circuit breakers that open quickly and half-open with probes.
- A short-TTL in-process cache for explicitly low-sensitivity fields (think display names), TTL ≤ 60s. Never cache high-sensitivity PII or secrets.
- Bulkheads: separate pools for crypto I/O and business I/O so one can drown without taking the other with it.
- Purpose binding everywhere: a decrypt without a purpose string does not pass go. We rate-limit by purpose class.
- Degrade modes: mask values where the UX can tolerate it; 503 where it can’t. Failing fast beats failing noisy.
7) Run it like you mean it (ops guardrails)
- Policies: split encrypt/decrypt; short-lived tokens for ops with MFA.
- Metrics: encrypt/decrypt QPS, p95/p99, error taxonomy (client vs server), cache hit ratio, key-version spread.
- Alarms: rising p99, decrypts from surprising call paths, encryption using non-latest versions.
- Runbooks: rotation, shard/key deletion drills, Transit node replacement, failover tests. Practice the scary ones.
8) What gets better, what gets trickier
The wins
- A stolen DB dump is boring without Transit and the right policy.
- Backups/replicas are safer by default.
- “Delete” is a procedure with proof, not a best-effort hope.
The trade-offs
- Your service is a decryption oracle unless you make it otherwise. Authorization and purpose-limited decrypts are table stakes.
- Transit latency now matters. Invest in HA and graceful fallback before your first incident.
- Querying encrypted fields is a paradigm shift—prepare hashes, tokens, and UX that doesn’t require plaintext to search.
9) A small, honest code slice
// Example usage inside a handler
import { encryptField, decryptField } from '@org/securitas';
async function storeProfile(p) {
const ctx = `${p.tenantId}:${p.userId}:phone`;
const phoneEnc = await encryptField('profiles.customer.v1', p.phone, ctx);
await db.update(p.userId, {
phone_enc: phoneEnc,
phone_last4: p.phone.slice(-4),
});
}
async function readProfile(userId, viewer) {
const rec = await db.get(userId);
const canSee = await can(viewer, 'read.phone', userId);
return {
...rec,
phone: canSee
? await decryptField('profiles.customer.v1', rec.phone_enc, `${rec.tenantId}:${userId}:phone`, { cacheable: false })
: mask(rec.phone_last4),
};
}
10) Testing, drills, and audits (aka future you says thanks)
- Property-based tests:
decrypt(encrypt(x)) === x
for weird Unicode, long strings, edge bytes. - Load tests with Transit throttled to force the breaker/caching logic to earn its keep.
- Lifecycle audits: prove (in a sandbox) that
deletion_allowed
keys actually die, and that raisingmin_decryption_version
does what you think. - Red-team the oracle: try to coerce decrypts via unexpected endpoints/purposes. Make sure logs and alerts tell on you.
11) “Why not…?” mini-FAQ
Why not TDE (Transparent Data Encryption)? Great for lost disks; useless for stolen query privileges and DB snapshots. We wanted field-level control and per-scope shredding.
Why not roll our own KMS client-side? Transit gives us derived keys, versioning, audit, and policy out of the box. Also: fewer ways to slice your foot.
What about search over encrypted data? We didn’t reinvent crypto. We used hashes/tokens/last-N alongside ciphertext, and designed queries/UX to live with that.
Closing
We didn’t set out to become key janitors. We set out to make “delete” mean something. The trick wasn’t a clever cipher; it was moving the boundary: encrypt outside the datastore, keep keys outside the app, and make decrypt scarce, purposeful, and observable. Do that, rehearse the migrations, and harden for the day Transit blips—and you’ll have both defense in depth and a clean deletion story when it matters.