Symfony to the Cloud: Twelve Factors, Thirteen Services on Guillaume Delré

Eleven Out of Twelve

Sun, 17 May 2026 15:00:00 +0000

The composer.json in each service had this in its post-install-cmd section:

"post-install-cmd": [
    "bin/console cache:clear --env=prod",
    "bin/console doctrine:migrations:migrate --no-interaction"
]

post-install-cmd runs during composer install, which in the production Dockerfile runs during the image build. There is no database available during a Docker build. The migration command either failed silently, or connected to nothing, or was skipped by Doctrine when it couldn’t find a schema to compare against. In any case, it didn’t migrate anything.

This is a clean violation of Factor XII : admin processes — migrations, one-off scripts, console tasks — should run in the same environment as the application, against the actual production data. Running them at build time inverts the relationship. The image shouldn’t know about the database. The database should be there when the image needs it.

The move to the entrypoint

The migration command moved from composer.json to docker-entrypoint.sh. The shift looks small on a diff. The implications are not.

The entrypoint runs when the container starts, not when the image is built. The database is reachable. The entrypoint waits for it — up to 60 seconds, one attempt per second — before doing anything:

ATTEMPTS_LEFT_TO_REACH_DATABASE=60
until [ $ATTEMPTS_LEFT_TO_REACH_DATABASE -eq 0 ] || \
  DATABASE_ERROR=$(php bin/console dbal:run-sql -q "SELECT 1" 2>&1); do
    sleep 1
    ATTEMPTS_LEFT_TO_REACH_DATABASE=$((ATTEMPTS_LEFT_TO_REACH_DATABASE - 1))
done

if [ $ATTEMPTS_LEFT_TO_REACH_DATABASE -eq 0 ]; then
    echo "$DATABASE_ERROR"
    exit 1
fi

If the database doesn’t respond within 60 seconds, the container exits with an error and Kubernetes restarts it. Once the database is ready, the migration runs:

if [ "$( find ./migrations -iname '*.php' -print -quit )" ]; then
    php bin/console doctrine:migrations:migrate --no-interaction --all-or-nothing
fi

Two changes from the original command: --all-or-nothing ensures that if any migration in a batch fails, the entire batch rolls back. And the find guard skips the command entirely if there are no migration files — useful for services that don’t use Doctrine migrations at all.

This is genuinely better. The database is present. The migration runs in the real environment. The --all-or-nothing flag adds atomicity that the build-time version never had.

What it doesn’t solve

Two pods redeploying simultaneously both run the entrypoint. Both reach the database. Both find pending migrations. Both call doctrine:migrations:migrate.

Doctrine has a locking mechanism: a doctrine_migration_versions table that records which migrations have run, and the command checks it before applying. Under normal conditions this is fine: the second pod finds the table up to date and exits cleanly. The real failure modes are more specific: a migration long enough that the database lock times out before it completes, letting a second runner start the same migration before the first has finished; or a pod that crashes mid-migration before recording the version in the table, leaving the schema in an applied-but-unregistered state that the next pod will try to apply again.

The team’s position is explicit: a brief deployment downtime is acceptable. Application versions aren’t necessarily forward-compatible with older schema versions, so running N and N+1 simultaneously against the same database isn’t safe anyway. The deployment strategy is Recreate: all old pods are terminated before any new pods start. The migration runs on first startup, no overlap between versions. It works.

But “it works” and “it’s the right architecture” are different answers.

What would be different

Factor XII says admin processes should run in “one-off processes.” A process that runs once, for a specific purpose, against the production environment. The entrypoint is not one-off — it runs every time a container starts, including restarts, scaling events, and Kubernetes node movements.

Three alternatives exist, each with a different answer to the question of ownership:

A Kubernetes init container runs before the main container starts, in the same pod. It could run the migration, exit, and let the main container start only after it succeeds. The migration is isolated from the application runtime. The downside: the init container is another image to build and maintain, and it runs on every pod start — so a 14-service platform starting simultaneously still has a potential race.

A Kubernetes Job runs once, on demand or triggered by a deployment pipeline. It can be made to run before any pods are updated — serial, isolated, with a clear success or failure signal. The race condition goes away. The complexity moves to the deployment process: the Job must complete before the Deployment rollout begins, and the CI pipeline must coordinate both.

A Helm hook is the same concept expressed declaratively in the Helm chart. A pre-upgrade hook runs the migration before the application pods are updated. It’s the most idiomatic Kubernetes answer. It also means the Helm chart is now responsible for running migrations — a decision that belongs to whoever owns the chart.

That last sentence is why the entrypoint hasn’t changed. Moving migrations out of the application means deciding that the deployment infrastructure — not the application itself — is responsible for the schema. It’s a governance question as much as a technical one, and governance questions take longer to resolve than code changes.

The honest end

The migration block in the entrypoint is two lines. Literally: the if [ "$( find ./migrations... )" ] guard, and the php bin/console doctrine:migrations:migrate that follows. Eleven other factors have clean resolutions. The cache moved to Redis. The logs go to stdout. The filesystem is an S3 bucket. The CI assembles production images from the same commit it tests. The secrets don’t travel in image layers.

Factor XII has an answer. It’s just not the final one.

The migrations run at startup, with a real database, with atomicity, with a bounded retry window. That’s better than running at build time against nothing. Whether they eventually move to a Job or a Helm hook is a conversation about who owns the schema — a question that a kubectl apply can’t answer.

Ready Is Not the Same as Started

Sun, 17 May 2026 10:00:00 +0000

The rolling deploy looked clean. A new pod started. Kubernetes saw the healthcheck pass — php -v returned zero — and began routing traffic to the new container.

For the next forty seconds — out of a possible sixty — that container was polling for the database.

Requests that landed on it during that window got errors. Not many — the window was short — but enough to show up as noise in the monitoring. The kind of noise that gets dismissed as a transient network issue and filed nowhere. The deploy succeeded. The pod eventually became ready. The mechanism that caused it was still there, waiting for the next deploy.

The entrypoint script does five things before FrankenPHP starts: copy a version file, verify the vendor directory, wait up to sixty seconds for the database, run pending migrations, install assets and set filesystem permissions. In Docker Compose, this is invisible. In Kubernetes, the gap becomes traffic.

The gap between started and ready

Kubernetes decides whether to send traffic to a pod by watching its readiness probe. A pod whose readiness probe passes receives requests. A pod whose readiness probe fails is removed from the load balancer rotation until it recovers. This is the mechanism that makes rolling deploys safe: Kubernetes doesn’t cut over to a new pod until that pod says it’s ready.

The compose.yaml defines a healthcheck on every service:

healthcheck:
    test: [ "CMD", "php", "-v" ]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 10s

php -v succeeds the moment the PHP binary is present — which is true from the first millisecond of container life. The start_period: 10s gives ten seconds before checks begin. But the entrypoint polling loop runs for up to sixty seconds before FrankenPHP even starts. At second ten, the healthcheck passes. The application is still waiting for the database.

The Dockerfile has a better signal:

HEALTHCHECK --start-period=60s CMD curl -f http://localhost:2019/metrics || exit 1

Port 2019 is Caddy’s built-in metrics server, embedded directly in FrankenPHP. The endpoint is Prometheus-compatible and only responds once Caddy’s HTTP stack is fully initialized and PHP workers are accepting connections. php -v exits in fifty milliseconds regardless of what the application is doing — it checks the binary, not the server. :2019/metrics only answers when the server is actually serving. It is also not an endpoint added just for the probe: every service in the platform already has it scraped by Prometheus, so the signal is live regardless of any healthcheck configuration.

That’s closer. But in Kubernetes, the HEALTHCHECK instruction is ignored entirely. Kubernetes uses its own probe configuration. Without explicit probe definitions in the Kubernetes manifests, there are no readiness checks — and a pod is considered ready the moment its container starts.

Which means: pod starts, entrypoint begins polling, Kubernetes routes traffic, application is not yet serving. Requests arrive at a container that isn’t ready to handle them.

Three signals, three questions

Kubernetes separates container lifecycle into three distinct questions, each with its own probe type:

startupProbe — “Has the application finished starting?” Fires repeatedly until it passes, then hands off to liveness. Prevents the liveness probe from killing a container that’s legitimately slow to initialize. For a container whose entrypoint can take sixty seconds, this is the right tool.

readinessProbe — “Is the application ready to handle requests?” Fails and passes throughout the container’s life. When it fails, the pod is removed from the load balancer. This is what makes a rolling deploy safe.

livenessProbe — “Is the application still alive?” If it fails, Kubernetes restarts the container. Meant to catch hung processes, not slow startups.

The sixty-second polling loop belongs in the startupProbe’s patience, not in application code:

startupProbe:
    httpGet:
        path: /metrics
        port: 2019
    failureThreshold: 12    # 12 attempts × 5s = 60s max
    periodSeconds: 5

Once the startupProbe passes, a readinessProbe on the same endpoint takes over — telling Kubernetes when the pod is safe to receive traffic — and a livenessProbe watches for hung processes. But the startupProbe is the one that absorbs the slow start. The entrypoint polling loop becomes redundant: its job was to keep the container alive while the database caught up. Without it, the application attempts to connect, fails, and the container exits — Kubernetes restarts the pod, and the startupProbe maintains its retry cycle until the database responds and the application starts cleanly. The retry responsibility moves from inside the entrypoint to the orchestrator, which is exactly where it belongs.

The migration problem

The polling loop is the most visible issue, but the migrations create a subtler one.

With a rolling deploy and two replicas, Kubernetes starts a new pod while the old one still serves traffic. Both pods run the same entrypoint. Both reach doctrine:migrations:migrate.

Doctrine’s migration table tracks which migrations have already executed, so a completed migration won’t run twice. But if two pods start simultaneously and both see a pending migration, both attempt to run it at the same time. Whether that’s safe depends on the migration: additive schema changes are usually fine; destructive ones less so. And you don’t get to choose which ones run on a deploy that didn’t expect to coordinate. --all-or-nothing wraps migrations in a transaction and rolls back everything if one fails — it’s about atomicity within a single run, not coordination across processes.

The cleaner approach separates the two concerns into two init containers: one that waits for the database, one that runs migrations. The main container starts only after both complete:

initContainers:
    - name: wait-for-db
      image: authentication:latest
      command: ["php", "bin/console", "dbal:run-sql", "-q", "SELECT 1"]
    - name: migrate
      image: authentication:latest
      command: ["php", "bin/console", "doctrine:migrations:migrate", "--no-interaction", "--all-or-nothing"]

Both init containers reuse the application image. That’s not waste: they need the same PHP binary and the same environment wiring to reach the database and resolve the migration classes. A lighter purpose-built image would reduce startup overhead, but would require maintaining a separate PHP installation in sync with the main image.

Even with init containers, multiple pods starting simultaneously — initial deploy, after a node failure, or under autoscaling pressure — will each attempt to run migrations. Solving that properly — through a Helm pre-upgrade hook, a maxSurge: 0 strategy, or a separate migration Job — is a topic in itself. What matters here is that the entrypoint is the wrong place to host that decision: it can’t coordinate across pods, and it ties migration execution to application startup in a way that’s hard to untangle later. The question of which approach fits this codebase — and why the entrypoint hasn’t been replaced — gets its own treatment in the next article in this series .

Factor XII of the twelve-factor methodology — admin processes run in the same environment as the application — is satisfied either way. The question is whether “same environment” means “same entrypoint script” or “same image, separate process”. In Kubernetes, the latter is safer.

What the entrypoint’s real job is

Strip out the database wait (now a startupProbe or init container), the migrations (now an init container or Job), and the assets install (a build-time operation that belongs in the Dockerfile), and the entrypoint has one remaining job: start the application.

exec docker-php-entrypoint "$@"

Factor IX of the twelve-factor app asks for fast startup and graceful shutdown. A container whose startup takes sixty seconds because it’s waiting for external dependencies is not fast. It means rolling deploys are slow, recovery after a crash is slow, and horizontal scale-out creates a sixty-second gap before each new pod contributes.

Fast startup is not just a nice-to-have. It’s what makes the rest of the cloud model work. When a pod can start in seconds, the orchestrator can scale aggressively and recover quickly. When it takes a minute, you add headroom everywhere — longer probe timeouts, larger deployment windows, more conservative scaling policies — and the system becomes rigid.

The Docker Compose tax

The entrypoint accumulates these responsibilities for a reason. In Docker Compose, there is no init container concept. There is no startupProbe. Services declare depends_on, but without health conditions, that’s just startup ordering — not readiness. The entrypoint fills the gap.

This is not a design flaw. It’s a reasonable adaptation to the constraints of Docker Compose. The script works. It handles edge cases (the database timeout, unrecoverable errors, missing migrations directory). Someone tested it.

The issue is the assumption that the same script works equally well in Kubernetes. It runs. The application eventually starts. But it bypasses the probe system that makes Kubernetes deployments reliable, and it puts migration responsibility in a place where coordination across pods is difficult to reason about.

Several of the changes in this series — media storage , secrets in image layers , log handlers , service dependencies , CI environment parity , cache adapters — were changes to application code or configuration. This one is different. It requires the infrastructure to gain awareness of what “ready” means for this application, and it requires the entrypoint to give up responsibilities it currently owns.

That’s a harder conversation. But the startupProbe is waiting for it.

The Cache That Was Lying to Us

Sat, 16 May 2026 15:00:00 +0000

The first time we ran two replicas of the same Symfony service behind a load balancer, everything looked fine. Health checks passed. Traffic split cleanly. Response times were good.

Then someone noticed the rate limiter was acting strange. Hit the API five times, get blocked. Hit it five more times on the next request, get through. Depending on which pod answered, you were a different person.

That was the cache talking. One config line, replicated across thirteen services, was blocking horizontal scaling entirely.

One config file, thirteen times

We were preparing a platform of thirteen Symfony microservices to move to Kubernetes. The stack was already in good shape: FrankenPHP for the HTTP server, multi-stage Dockerfiles, a GitLab CI that pushed tagged images to a cloud registry. The pieces were there. We just needed to verify nothing would break when we started scaling pods horizontally.

A good checklist for that kind of audit is the twelve-factor app methodology — twelve principles for building software that runs cleanly in cloud environments. Most factors were already covered without us doing anything deliberate about it.

Factor VII (port binding) came for free. FrankenPHP embeds Caddy directly into the PHP process. The container exposes its own HTTP endpoint, no Apache or Nginx to bolt on. The image is self-contained, which is exactly what the factor requires:

HEALTHCHECK --start-period=60s CMD curl -f http://localhost:2019/metrics || exit 1

Factor II (dependencies) was handled by composer.json and the Dockerfile extensions. Factor X (dev/prod parity) was covered enough for our scope: same image, same backing services locally and in CI, which is the part that actually matters for what we were auditing.

Then I got to Factor VI.

The problem with “it works on one server”

Factor VI says processes must share nothing. Nothing written to disk between requests, nothing in local memory that another instance can’t see. If you need to persist state, put it in a backing service — a database, a cache cluster, a queue. The process itself stays disposable.

I opened authentication/config/packages/cache.yaml. Then content/config/packages/cache.yaml. Then media/config/packages/cache.yaml.

framework:
    cache:
        app: cache.adapter.filesystem

Thirteen services. Thirteen times, word for word.

Every instance of every service was writing its cache to the local filesystem. Which meant every pod had its own private cache, invisible to every other pod. When the load balancer sent a request to pod A, it got pod A’s cached version of reality. Pod B had built its own. They might have been generated at different times, from different source data, or one of them might not have been built yet at all.

The rate limiter was the most visible symptom because it had a counter. But the same divergence affected every piece of data we were caching: serializer metadata, route collections, Doctrine result caches. Two users sending identical requests could get different responses depending on which node happened to pick up the connection.

Redis was already there

This is the part that stings a little. Redis was already in the stack. Every service had it configured via SncRedisBundle:

# config/packages/snc_redis.yaml — present on all 13 services
snc_redis:
    clients:
        default:
            type: 'phpredis'
            alias: 'default'
            dsn: '%env(IN_MEM_STORE__URI)%'

Factor IV of the twelve-factor app says backing services should be attached resources, interchangeable through configuration. Redis was exactly that: reachable via an environment variable, ready to be swapped for a managed instance in the cloud. The plumbing was done. We just weren’t using it for the application cache.

Some services even had it right for specific pools. The rate limiter in the authentication service:

pools:
    rate_limiter.cache:
        adapter: cache.adapter.redis

Which explains the inconsistency we saw first. The rate limit count went to Redis (shared across pods). The cache backing the rate limit check went to the filesystem (local to the pod). Two sources of truth, one invisible to the other.

The fix was one line per service:

framework:
    cache:
        app: cache.adapter.redis
        default_redis_provider: snc_redis.default

Thirteen files. Thirteen identical changes. The kind of fix that makes you feel like you should have caught it earlier, except it’s perfectly invisible when you’re running a single instance.

What needs to move to Redis

The filesystem cache violated Factor VI (processes carry local state they shouldn’t) and Factor VIII (you can’t scale out without sharing that state). They’re the same problem seen from two angles: VI describes what’s wrong, VIII describes what you can’t do because of it.

With a shared cache backend, a second pod is safe. The two pods build the same cache, see the same invalidations, agree on the same rate limits. You can add a third pod under load and remove it when traffic drops. The orchestrator handles it; the application doesn’t need to know.

Without it, horizontal scaling is a liability. More pods means more divergence, more “works on my machine” bugs that are impossible to reproduce locally because local only runs one container.

Sessions had the same problem — and potentially a worse one. Twelve of the thirteen services were using session.storage.factory.native — which writes sessions to the filesystem by default. A user whose request lands on pod A gets a session tied to pod A. Their next request goes to pod B. Session gone, they’re logged out. Only one service had RedisSessionHandler configured.

The partial mitigation is that most of the platform runs stateless JWT-based APIs, so session usage is limited. But “limited” isn’t “zero”. The services that do create sessions — authentication flows, temporary state during OAuth handshakes — have a user-visible failure mode waiting for the second pod. Either those sessions get moved to Redis, or the code that creates them gets removed. Leaving them as-is is a decision that waits for the first user whose session disappears without explanation.

The other kind of state

Redis fixes the cross-pod problem. FrankenPHP introduces a different one worth knowing about.

In the standard PHP-FPM model, each request forks a fresh process. Every in-memory object — every cached value, every computed result — dies with the response. The process is stateless by construction.

FrankenPHP has a worker mode that doesn’t follow that model. In worker mode, a single PHP process boots once, loads the kernel, wires the container, and handles multiple successive requests without restarting. Request throughput improves: no autoloader cold start, no container rebuild per request, fewer allocations. The tradeoff is that the PHP process now has a lifecycle that spans requests.

For cache, this adds a wrinkle. An array adapter or APCu pool accumulates entries across requests on the same worker. A cache invalidation pushed to Redis reaches the other pods immediately — but doesn’t clear what’s sitting in a worker’s in-process memory. Two requests on the same pod can see different things: one hits a warm in-memory entry, the next triggers a Redis fetch after the in-process entry expires.

The platform keeps worker mode disabled (APP__WORKER_MODE__ENABLED=false). It’s available — the infrastructure is there, the flag is wired — but it’s not active. The performance gain didn’t justify the audit. Every cache pool would need to be verified against worker-mode semantics; every place where state leaks between requests would become a potential bug.

The conservative position: keep PHP stateless at the process level even when the runtime doesn’t require it. Factor VI’s shared-nothing principle applies not just to the filesystem — it applies to the process itself.

What was already working

To be fair to the codebase: the Symfony Scheduler was already using Redis for distributed locks:

$schedule->lock($this->lockFactory->createLock('schedule_purge'));

In a multi-pod environment, you don’t want five instances running the same purge job simultaneously. The lock prevents it. Redis makes the lock visible across pods. Whoever wrote the scheduler knew exactly what they were doing.

The same reasoning just hadn’t propagated to the cache configuration — probably because when you’re running a single instance, cache.adapter.filesystem is invisible. It works, it’s fast, it requires zero configuration. The problem only appears at two.

The four questions

Factor VI catches most applications off guard during a cloud migration. Not because developers don’t know about stateless processes — they usually do — but because the filesystem is always there, and the problem stays hidden until you try to run a second instance.

Before scaling a Symfony service horizontally, four questions are worth answering:

Where does the application cache go? (cache.adapter.filesystem needs to become cache.adapter.redis)
Where do sessions go? (session.storage.factory.native needs Redis — or remove sessions entirely if you’re JWT-only)
Does anything write to var/ at runtime that another pod would need to read?
Is anything in your code path that needs to be mutually exclusive across pods? (if yes, that’s a job for the Symfony Lock component backed by Redis, not a local mutex)

If the answers all point to shared backing services, you’re ready. If any of them points to the local filesystem, production will find the pod that built its cache three hours ago and serve it to the user who least expects it.

Fifteen Minutes Before the First Test

Sat, 16 May 2026 10:00:00 +0000

The pipeline had two stages that had nothing to do with code: provision and deprovision. Between them, in sequence, came phpunit, phpmetrics, and behat.

stages:
  - build
  - provision
  - phpunit
  - phpmetrics
  - behat
  - deprovision
  - deploy

Before the first assertion ran, fifteen minutes had passed. Terraform had cloned an infrastructure repository, authenticated to Azure, and applied a VM configuration. Ansible had connected to the new VM, installed PHP, configured the application, wired up a database and a Redis instance. Then the tests ran. Then Terraform destroyed what Ansible had built.

For every pipeline. From every branch. For every pull request, from open to merge.

What those fifteen minutes were missing

The provision stage set up two services: PostgreSQL and Redis. Three services that the application depended on in production were absent: RabbitMQ, MinIO, and Varnish.

RabbitMQ processed all asynchronous work — 56 consumers across 14 microservices. MinIO handled media storage. Varnish fronted the HTTP cache. In CI, none of them existed. Tests that exercised message queuing or file storage had two options: skip these paths, or leave them untested until staging. Varnish is a different case: tests hit the application directly and intentionally bypass the cache layer, so its absence in CI is a deliberate choice rather than a gap.

This is the problem Factor X describes as the environment gap. The gap here wasn’t a matter of configuration — it was structural. The VM was built by Ansible from a script in a separate repository. It wasn’t a container image. It wasn’t versioned alongside the application. If a branch modified the RabbitMQ message topology, there was no way to test that modification in CI. The topology change and the code that relied on it would only meet in staging.

The Ansible provisioning script itself is part of the problem:

launch_vm:
  stage: provision
  script:
    - git clone git@gitlab.internal/infra/ci-vm.git
    - cd ci-vm
    - az login --service-principal -u $ARM_CLIENT_ID ...
    - terraform apply -var "prefix=${CI_PIPELINE_ID}-vm" ...
    - sleep 45
    - ansible-playbook behat/test-env.yml ...

The sleep 45 is there because Ansible needs the VM to finish booting before it can connect. It’s not an oversight — it’s the minimum time a freshly provisioned VM needs before SSH works. It’s baked into the process.

What replaced it

The new pipeline has no provision stage. It has no deprovision stage. The environment is the images, and the images exist before the tests begin.

Each test job declares its dependencies as Docker services:

services:
  - name: $REGISTRY_URL/platform/rabbitmq:$CI_COMMIT_REF_SLUG
    alias: rabbitmq
  - name: $REGISTRY_URL/platform/minio:$CI_COMMIT_REF_SLUG
    alias: minio
  - name: redis:7.4.1
    alias: redis
  - name: $ARTIFACTORY_URL/postgresql:13
    alias: postgresql

The services start in parallel when the job begins. Before the test script runs, a before_script waits for all of them to be ready:

before_script:
  - $CI_PROJECT_DIR/dockerize
      -wait tcp://postgresql:5432
      -wait tcp://rabbitmq:5672
      -wait tcp://minio:9000
      -wait tcp://redis:6379
      -timeout 120s

From pipeline start to first assertion: ninety seconds — assuming images are already cached on the runner; a cold pull adds time, but becomes negligible once the pipeline has run once on a given branch.

What `$CI_COMMIT_REF_SLUG` means

The timing is the visible result. What produces it is more interesting: the image names.

$REGISTRY_URL/platform/rabbitmq:$CI_COMMIT_REF_SLUG is not the official RabbitMQ image from Docker Hub. It’s an image built by the same pipeline, from the same branch, at the same commit as the code being tested. The RabbitMQ image carries the topology: a definitions.json with every exchange, every queue, every binding, every dead-letter configuration — versioned in git alongside the application that depends on them.

If a branch modifies the messaging topology, the CI pipeline builds a new RabbitMQ image that includes those modifications, then runs the tests against it. The topology change and the code that relies on it are tested together, at the same commit, before anything reaches staging.

The same logic applies to MinIO, as described in the first article in this series : the MinIO image carries preloaded test fixtures. The CI environment doesn’t need a setup step to populate storage. The state is built in.

The test runner itself follows the same pattern. Each job uses a debug variant of the application image — built from the same branch, same commit — with the test dependencies included:

image: $REGISTRY_URL/platform/$service:$CI_COMMIT_REF_SLUG-debug

The whole environment assembles from artifacts built at the same point in the git history.

What this required dropping

Behat and the provisioned VM were coupled. The Behat test suite ran against an HTTP server on the VM; removing the VM meant removing Behat.

That turned out not to be the obstacle it looked like. The Behat suite lived in a separate repository, required the VM to run, and had accumulated significant maintenance overhead. PHPUnit, running inside the application container with Docker services, covered the same scenarios through a more direct path: functional tests exercising the HTTP layer, unit tests for individual components, suites organized per feature area and generated dynamically into parallel CI jobs.

The BDD layer went away. The test coverage stayed — and could now run against the actual services.

Factor X, applied

Factor X is often read as “use the same database locally as in production.” That’s the simplest version. The deeper version is about the gap between what you test and what you ship.

The gap in the old pipeline was wide: a manually configured VM, missing key services, rebuilt from scratch on every run. The gap in the new pipeline is narrow: the CI assembles the environment from the same images as production, built from the same commit as the code under test.

The fifteen minutes of Terraform and Ansible were not just slow. They were building something that wasn’t what production ran, every time, before any test could begin. The ninety seconds of docker pull build exactly what production runs — and the tests that follow are testing that, not an approximation of it.

The Host That Hid the Graph

Fri, 15 May 2026 15:00:00 +0000

Every service in the platform had these six variables:

APP__GATEWAY__PRIVATE__HOST="platform.internal"
APP__GATEWAY__PRIVATE__PORT=80
APP__GATEWAY__PRIVATE__SCHEME="http"
APP__GATEWAY__PUBLIC__HOST="platform.internal"
APP__GATEWAY__PUBLIC__PORT=80
APP__GATEWAY__PUBLIC__SCHEME="http"

Thirteen services, six variables each, one value. Reading any service’s configuration, the architecture looked flat. Everything talked to the same host. That was the whole picture.

It wasn’t.

How the gateway worked

The gateway sat in front of every service and handled all inter-service traffic. A service calling the content API would construct a request to http://platform.internal/content/api/ — the gateway received it, identified the target from the URL path, and forwarded it to the right backend. Every inter-service HTTP client in framework.yaml followed the same pattern:

content.client:
    base_uri: "%http_client.gateway.base_uri%/content/api/"
    headers:
        Host: "%env(APP__GATEWAY__PRIVATE__HOST)%"

The http_client.gateway.base_uri parameter was assembled from the GATEWAY vars. The gateway knew where each service ran. The services didn’t need to know. From their perspective, everything was platform.internal.

This worked. For years, it worked well. Adding a service meant adding one DNS alias in the gateway config, not touching thirteen .env files. The gateway abstracted the topology. The services stayed decoupled from the infrastructure detail of who ran where.

What the gateway was absorbing

The abstraction had a cost that didn’t show up until you tried to read the system.

Looking at content’s env file, you saw six gateway variables and nothing else about inter-service communication. To find out that content called conversion, shorty, and media, you had to read framework.yaml. To find out that pilot called ten external services, you had to trace through the HTTP clients one by one and count.

The number was ten. Authentication, bam, config, content, conversion, media, product, shorty, sitemap, social. Ten of the platform’s thirteen services that pilot depended on at runtime, none of them visible from its configuration. Six variables said: talk to the gateway. They said nothing about the shape of what lay behind it.

That information existed — in the code, in the framework config, in the heads of the people who had built those integrations. It just didn’t live anywhere you could read at a glance.

What Kubernetes made explicit

On-premise, the gateway was a single resolvable hostname. One DNS record, one set of variables, one place to update. Kubernetes doesn’t work that way. Each service gets its own DNS name inside the cluster — content.namespace.svc.cluster.local, conversion.namespace.svc.cluster.local. Inter-service traffic goes directly, service to service, not through a shared gateway.

Moving to Kubernetes meant the gateway abstraction had to give way. Each service needed to know, concretely, where each of its dependencies lived. The six generic variables couldn’t express that.

The refactor replaced them with per-target HOST variables — one per service dependency, named for the target:

# content/.env — content calls these four services
APP__CONFIG__HOST="platform.internal"
APP__CONVERSION__HOST="platform.internal"
APP__MEDIA__HOST="platform.internal"
APP__SHORTY__HOST="platform.internal"

# pilot/.env — ten service dependencies
APP__AUTHENTICATION__HOST="platform.internal"
APP__BAM__HOST="platform.internal"
APP__CONFIG__HOST="platform.internal"
APP__CONTENT__HOST="platform.internal"
APP__CONVERSION__HOST="platform.internal"
APP__MEDIA__HOST="platform.internal"
APP__PRODUCT__HOST="platform.internal"
APP__SHORTY__HOST="platform.internal"
APP__SITEMAP__HOST="platform.internal"
APP__SOCIAL__HOST="platform.internal"

Each HTTP client in framework.yaml got its own base_uri built from its target’s HOST variable, and the Host header gave way to a User-Agent that identified the caller:

content.client:
    base_uri: "%env(APP__HTTP__SCHEME)%://%env(APP__CONTENT__HOST)%:%env(APP__HTTP__PORT)%/content/api/"
    headers:
        User-Agent: "Platform Content - %semver%"

The change isn’t cosmetic. In the old setup, the explicit Host header ensured requests reached the correct gateway virtual host regardless of URL resolution. In the new setup, each client points directly at its target’s DNS name — the right Host is derived from the base_uri automatically. The header slot doesn’t go empty: User-Agent now identifies the calling service, which surfaces in logs and distributed traces without any additional instrumentation.

The discomfort of legibility

pilot’s env file went from nine gateway variables to ten service-specific HOST variables. The file got longer. The architecture didn’t get simpler — the ten dependencies were there before and they’re still there now. What changed is that they’re readable.

Factor III says to store configuration in the environment. The old approach satisfied that literally: six variables, all in env files, none hardcoded. But variables that collapse the entire dependency graph into a single opaque hostname aren’t really configuration — they’re a shorthand that trades legibility for convenience. Factor III doesn’t ask only that configuration be externalized — it implicitly assumes the externalized configuration remains informative.

The refactor didn’t simplify anything. It made the complexity visible. pilot’s ten HOST variables document, in the .env file itself, the ten services it depends on. A new team member reading that file learns something real about the architecture. The old file taught them that there was a gateway.

There’s a version of this story where you read the final state and conclude the team did unnecessary work — they replaced six variables with ten, all pointing at the same host anyway. In local development, platform.internal still resolves to the same place. The functional behavior didn’t change.

The change is in what the configuration communicates. In Kubernetes, the HOST values diverge: each target gets its own cluster-internal DNS name, different per environment. The variables now carry real information. The refactor prepared the config to be honest about a topology it had been quietly simplifying for years.

No Witnesses

Fri, 15 May 2026 10:00:00 +0000

The service had crashed. We had the alert. We had the timestamp down to the second. We had Loki open and a query ready.

What we didn’t have was any logs from the five minutes before the crash.

Promtail was running. It was healthy. It had been collecting logs from every other service without issue. But for this one, in the window that mattered, there was nothing. The service had crashed without leaving a trace.

The setup that looked correct

The logging stack was reasonable. Each service wrote structured JSON to stdout using Monolog’s logstash formatter:

stdout:
    type: stream
    path: "php://stdout"
    level: "%env(MONOLOG_LEVEL__DEFAULT)%"
    formatter: 'monolog.formatter.logstash'

Promtail collected container output via the Docker socket, parsed the JSON, extracted labels, pushed to Loki:

scrape_configs:
    -
        job_name: docker
        docker_sd_configs:
            -
                host: unix:///var/run/docker.sock
                refresh_interval: 5s
        pipeline_stages:
            -
                drop:
                    older_than: 168h
            -
                json:
                    expressions:
                        level: level
                        msg: message
                        service: service
            -
                labels:
                    level:
                    service:
        relabel_configs:
            -
                source_labels: [ '__meta_docker_container_log_stream' ]
                target_label: stream

Two stages in that pipeline do more work than the others. The json stage extracts level and service from each log line; the labels stage immediately following promotes them to Loki index labels, making {service="content", level="error"} a direct index lookup rather than a full-text scan across stored lines. The stream relabeling preserves whether a line came from stdout or stderr — a distinction that becomes queryable once Monolog sends errors to stderr and everything else to stdout. The drop older_than: 168h stage is a safety valve: if Promtail restarts after a long gap and replays buffered lines, anything older than seven days is discarded before reaching Loki.

In theory: logs go to stdout, Promtail reads stdout, logs appear in Loki. The twelve-factor app methodology describes exactly this model for Factor XI — treat logs as event streams, write to stdout, let the environment handle collection and routing.

The application had stdout. Promtail was reading stdout. What could go wrong.

What fingers_crossed takes with it

In production, the when@prod block replaced the simple stream handler with something more sophisticated:

when@prod:
    monolog:
        handlers:
            main:
                type: fingers_crossed
                action_level: error
                handler: main_group
                excluded_http_codes: [404]

The excluded_http_codes: [404] line is itself a tell: without it, every 404 from a scanner or crawler triggers a full buffer flush, dumping megabytes of debug logs for malformed URLs. Someone had already learned that the hard way.

fingers_crossed is a well-known Monolog pattern. The idea is elegant: don’t flood production logs with debug noise, but if something goes wrong, retroactively show what happened before the error. The handler buffers every log record in memory. The moment it sees an error, it flushes the entire buffer to the nested handler — giving you the full context leading up to the failure.

The problem is what happens when the failure isn’t a logged error. It’s an OOM kill. A SIGKILL from the orchestrator. A segfault. A process that stops responding and gets forcibly terminated.

In those cases, fingers_crossed never reaches its action_level. The buffer exists, full of the last five minutes of activity, and it vanishes with the process. The logs were there. They were in memory. They died before reaching stdout.

Factor IX of the twelve-factor app talks about disposability: processes should start fast and stop gracefully. On a clean shutdown (SIGTERM), a well-behaved process finishes its current work and exits. But crashes are not clean shutdowns, and memory buffers are not crash-safe. The service had been disposable in the sense that we could restart it; it was not disposable in the sense that its exit was transparent.

The files nobody was reading

There was a second problem, quieter but just as persistent.

Every service had a main_group handler that routed logs to two destinations in parallel:

main_group:
    type: group
    members: [main_file, stdout]

main_file:
    type: stream
    path: "%kernel.logs_dir%/%kernel.environment%.log"
    formatter: "monolog.formatter.logstash"

var/log/prod.log was being written on every service, in every environment, including production. The same content that went to stdout also went to a file inside the container. The file grew without rotation. The file was not accessible to Promtail (which read from the Docker socket, not from the container filesystem). The file consumed disk space. Nobody was reading it.

The audit channel was worse:

audit_file:
    type: stream
    path: "%kernel.logs_dir%/audit.log"
    formatter: 'monolog.formatter.line'

audit:
    type: group
    members: [audit_file, stderr]
    channels: ['audit']

Audit logs went to stderr (visible to Promtail) and to audit.log (not visible to Promtail). The format in the file was a plain line format, not the structured JSON that Promtail expected. In practice, the audit trail existed in two places: one queryable, one buried in a container directory that survived only as long as the container did.

What Factor XI actually requires

The eleventh factor is direct about this: an app should not concern itself with routing or storage of its output stream. It writes to stdout. Everything else is the environment’s job.

That means no file handlers in production. Not as a backup. Not for audit trails. Not “just in case”. The moment an application starts managing files, it takes on responsibility for rotation, retention, disk space, and accessibility — none of which belong inside a container.

The fix for the file handlers is straightforward. In when@prod, remove every *_file handler and every group that includes one. The audit channel gets the same treatment: stderr only, structured JSON, no file:

when@prod:
    monolog:
        handlers:
            stdout:
                type: stream
                path: "php://stdout"
                # defaults to "warning" — overridable per-deploy via env var for targeted debugging
                level: "%env(default:default_log_level:MONOLOG_LEVEL__DEFAULT)%"
                formatter: 'monolog.formatter.logstash'

            stderr:
                type: stream
                path: "php://stderr"
                level: error
                formatter: 'monolog.formatter.logstash'

            main:
                type: group
                members: [stdout]
                channels: ['!event', '!http_client', '!doctrine', '!deprecation', '!audit']

            audit:
                type: stream
                path: "php://stderr"
                level: debug
                formatter: 'monolog.formatter.logstash'
                channels: ['audit']

stdout for the main channel. stderr for errors and audit. Nothing else. Promtail picks up both via the Docker socket. The container writes nothing to disk. And audit logs are now structured JSON, queryable in Loki alongside everything else.

The harder question about fingers_crossed

The file handlers were easy. fingers_crossed is more nuanced.

The pattern solves a real problem: in a busy production service, logging everything at debug level creates noise and cost. fingers_crossed lets you capture context without paying for it unless something actually goes wrong. It is a reasonable tradeoff when the failure mode you’re protecting against is an application-level error (an exception, a 500, a slow query).

It is not a reasonable tradeoff when the failure mode is a process crash. And in a Kubernetes environment, process crashes happen: OOM evictions, liveness probe failures, node pressure. Exactly the cases where you most need the logs.

One approach: keep fingers_crossed but reduce the buffer size. By default it keeps everything since the last reset. Set buffer_size: 50 and you cap memory usage, which also limits what gets lost on crash. You won’t have the full context, but you’ll have the last fifty records. This patches the blast radius rather than removing the root cause: the opacity still depends on an error threshold that may never fire.

Another approach: accept that debug logs are expensive and raise the default log level in production. Then you don’t need fingers_crossed at all — if info and above go directly to stdout, nothing is ever buffered.

The approach we landed on: drop fingers_crossed, raise the default level to warning, keep a debug override available via env var for targeted investigation. The logs we care about appear immediately. The ones we don’t are never written. Nothing is buffered.

Crashes don’t flush

Factor XI and Factor IX meet at the same point: a process dying mid-request. another article in this series described the illusion of a service that worked perfectly on one pod but quietly misbehaved on two. This is the same illusion, one layer up: a service that appeared to log correctly, until the moment it most needed to.

The rule for production Monolog is blunt: if it doesn’t reach stdout or stderr before the process exits, it doesn’t exist. A file handler inside a container is invisible to the log collector and dies with the pod. A fingers_crossed buffer is invisible to the log collector and dies with the process.

Production tends to create the conditions where you need logs the most — OOM pressure, cascading failures, bad deploys — and those are exactly the conditions where both of these patterns fail you simultaneously. Write to stdout, default to a level that doesn’t require buffering, and make the override available for when you actually need to debug something. The logs will be there. They won’t be waiting for an error threshold that never fires.

What Survives the Build

Thu, 14 May 2026 15:00:00 +0000

At some point during a cloud migration audit, someone ran this:

docker run --rm  php -r "var_dump(require '.env.local.php');"

The output showed everything that composer dump-env prod had compiled into the image at build time. Which meant it showed everything that had been in the .env file when the image was built. Which meant it showed these, among others:

INFLUXDB_INIT_ADMIN_TOKEN=
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin123
BLACKFIRE_CLIENT_ID=
BLACKFIRE_CLIENT_TOKEN=
BLACKFIRE_SERVER_ID=
BLACKFIRE_SERVER_TOKEN=
NGROK_AUTHTOKEN=replace-me-optionnal

Twenty-five variables in total. Every credential that had accumulated in the root .env over three years, now permanent in an image layer.

How `dump-env` works

composer dump-env prod is a legitimate Symfony optimization. Instead of parsing .env files on every request, the runtime loads a pre-compiled PHP array from .env.local.php. Faster and simpler.

The problem is what it reads. The Dockerfile copies the repository into the image with COPY . ./, .env included. Then dump-env prod reads that file and compiles every variable into .env.local.php. The image ships with a frozen snapshot of the credentials that were in .env at build time.

Docker layers are immutable archives. Even if a subsequent step removed .env from the container filesystem, the layer containing it would still exist inside the image. docker save produces a tarball of every layer; extracting any file from any point in the build history is straightforward. The credentials are invisible at runtime. They are not gone.

Factor V calls this out directly: a build artifact should be environment-agnostic, with config arriving at the release step from outside. Once credentials are compiled in, the image is no longer portable. You can’t promote it across environments. You build twice and hope the second build behaves like the first.

How twenty-five variables accumulate

Before tracing how this gets fixed, it’s worth understanding how it happened.

The BLACKFIRE_* tokens are the easy case to understand. A team member sets up profiling, needs to share the configuration, and the repository is already open to everyone. One line in .env is the path of least resistance. The InfluxDB and Grafana credentials follow the same logic — shared tooling, shared repo, one commit.

Then there are the variables that reveal a different kind of drift. In some of the service-level .env files:

APP__RATINGS__SERIALS='{"brand1":{"fr":"12345"},...}'  # ~40 lines of JSON
APP__YOUTUBE__CREDENTIALS='{"brand1":{"client_id":"xxx","refresh_token":"yyy"},...}'

Audience measurement serial numbers. YouTube API refresh tokens per brand. These aren’t secrets in the Blackfire sense. They’re business data — the kind of values that vary between brands and environments, that someone decided to version in .env because they behaved like configuration and .env was where configuration lived.

Twenty-five variables is the sum of incremental decisions, none of which felt wrong in isolation. The problem is structural: when .env is the only answer available, everything starts looking like it belongs there.

Where things actually belong

Emptying the file required answering one question for each variable: where does this actually belong?

The answers revealed three categories that the team had never explicitly named:

Static config lives in code. Business rules, routing logic, Symfony parameter files — anything that doesn’t vary between deployments. A change requires a rebuild. The JSON blobs for audience measurement serials turned out not to be static config at all: they were queried from a dedicated Config service at runtime. They had no business being in a file.

Environment config varies between deployments: hostnames, connection strings, third-party credentials. This is what Factor III means by “config in environment variables” — real OS-level variables injected by the runtime, never files that travel with the code. In Kubernetes, this becomes a ConfigMap for non-sensitive values and a Kubernetes Secret for credentials. The choice for secrets management was SOPS — credentials are encrypted and committed to git, rather than stored in an external vault like Azure Key Vault or HashiCorp Vault. A vault trades simplicity for auditability: automatic rotation, centralized audit logs, workload identity-based access with no key to protect. SOPS trades those capabilities for a simpler operational model — no external service to query at deploy time, secrets travel through the normal code review process, git history serves as the audit trail. The accepted downsides are manual rotation and the responsibility of protecting the decryption key itself. For the team’s scale, the tradeoff was deliberate.

Dynamic config changes without a deployment: editorial parameters, per-brand thresholds, content moderation settings. It belongs in a database, managed through the application’s Config service. Some of what had accumulated in .env files was this category all along, passing as static defaults because it changed rarely enough that nobody noticed.

Once the categories had names, the variables sorted themselves. The root .env ended at four lines:

DOMAIN=platform.127.0.0.1.sslip.io
XDEBUG_MODE=off
SERVER_NAME=:80
APP_ENV=dev

Safe defaults. Nothing sensitive. dump-env prod now compiles empty strings; real values arrive at runtime from Kubernetes.

The PostgreSQL image

The PostgreSQL image used in CI has a hardcoded password:

FROM postgres:15
ENV POSTGRES_PASSWORD=admin123

This looks like the same problem. It isn’t, because the threat model is different. The CI database is ephemeral — it exists for the duration of a pipeline run, contains no real data, and runs in an isolated network. A hardcoded password on a throwaway test database is an acceptable risk, not a policy exception.

In production, the question doesn’t arise: the platform uses Azure Flexible Server, a managed PostgreSQL service. There is no Docker image. Credentials arrive via Helm chart injection, never touching a layer.

What survives the build now

The image that ships to production now contains a guarantee: var_dump(require '.env.local.php') returns only empty strings and safe defaults. The credentials aren’t there because they were never put there — they arrive at runtime, from outside.

That’s the responsibility boundary dump-env had been quietly erasing: the image is the application, the runtime is the environment. They should not know each other’s secrets.

The Ghost of the CI Runner

Thu, 14 May 2026 10:00:00 +0000

APP__COLD_STORAGE__FILESYSTEM_PATH="/home/jenkins-slave/share_media/media"
APP__COLD_STORAGE__FILESYSTEM_PATH_CACHE="/home/jenkins-slave/share_media/media/cache"
APP__COLD_STORAGE__RAW_IMAGE_PATH="/home/jenkins-slave/share_media/media_raw"
APP__SHARE_STORAGE__FILESYSTEM_PATH="/home/jenkins-slave/share_storage"

These lines were in the production .env of the media service. Not staging. Not a local override. Production, committed to the repository, read on every startup.

The paths end where you’d expect: /media, /share_storage. They start somewhere more surprising: /home/jenkins-slave, the home directory of a CI runner from an old Jenkins setup.

How a runner’s home directory ends up in production config

The platform had grown from a single machine. One server ran everything — the application, the CI runner, the database, the file storage. Files moved between the app and the CI system via NFS: a directory mounted on the same host, accessible to both the containers and the runner.

The path /home/jenkins-slave/share_media was where the NFS share landed on that machine. When the team migrated to Docker Compose, the containers inherited the NFS mount. The path made it into the .env because the application needed to know where to find files. Nobody changed it because it worked. The mount was still there. The path was valid. The application started. Files appeared where they should.

Three years later, nobody thought about it at all. It was just how the media path was configured.

What kubectl apply found

The first kubectl apply for the media service ended with a pod stuck in CrashLoopBackOff. The container started. The entrypoint ran. The application tried to access /home/jenkins-slave/share_media/media. No such file or directory. No NFS mount. No runner.

The path didn’t document a design decision. It documented the machine that happened to be running at the time the .env was written.

This is what Factor IV of the twelve-factor app is warning against. Backing services — storage, queues, databases — should be attached resources, configured via URL or connection string, interchangeable between environments without code changes. A filesystem path on a shared host is not a backing service. It’s a physical assumption about the machine. When the machine changes, the assumption fails.

The path was the symptom

The obvious first step was removing the runner reference:

APP__COLD_STORAGE__FILESYSTEM_PATH="/share_media/media"
APP__SHARE_STORAGE__FILESYSTEM_PATH="/share_storage"

Cleaner. No more CI references in a production config. Still not right. The application still assumed a POSIX filesystem — either a volume mount or a directory on the node. In Kubernetes, a volume shared between multiple pods requires a ReadWriteMany PersistentVolumeClaim. Most storage providers don’t support it. Those that do tend to be slow and expensive. And even where it works, you’ve replaced one shared filesystem assumption with another.

Renaming the path bought time. It didn’t fix the problem.

The problem was that roughly twelve terabytes of images — originals and pre-generated derivatives in multiple formats — from multiple editorial brands — were treated as a directory. A directory can’t be mounted cleanly across pods. A backing service can.

Flysystem as the shape of the solution

The media service already had a Flysystem dependency. Three concrete adapters — local filesystem, AWS S3, Azure Blob — and one lazy adapter sitting on top:

# config/packages/flysystem.yaml
flysystem:
    storages:
        media.storage.local:
            adapter: 'local'
            options:
                directory: "/"

        media.storage.aws:
            adapter: 'aws'
            options:
                client: 'aws_client_service'
                bucket: 'media'
                streamReads: true

        media.storage:
            adapter: 'lazy'
            options:
                source: '%env(APP__FLYSYSTEM_MEDIA_STORAGE)%'

All application code depends on media.storage. It doesn’t know whether files live on the filesystem or in a cloud bucket. One environment variable determines which backend is active:

APP__FLYSYSTEM_MEDIA_STORAGE=media.storage.aws   # production
APP__FLYSYSTEM_MEDIA_STORAGE=media.storage.local  # local fallback still available

The path is gone. The filesystem assumption is gone. What remains is a service name — an attached resource in the twelve-factor sense, configurable without rebuilding the image.

The same pattern extends to the thumbnail cache. LiipImagine generates resized images on demand; both the source originals and the generated cache go through separate Flysystem adapters:

liip_imagine:
    loaders:
        default:
            flysystem:
                filesystem_service: 'media.storage'
        default_cache:
            flysystem:
                filesystem_service: 'media.cache.storage'

Two environment variables, two buckets. The full pipeline — receive upload, store original, generate thumbnail, cache it — is cloud-portable without touching a line of PHP.

What this doesn’t cover is moving the data. The lazy adapter changes one environment variable. Getting twelve terabytes from an NFS mount into an S3 bucket is a different project — a migration window, double-write during cutover, verification that nothing was missed.

What Minio makes possible in CI

Production uses S3. Local development uses Minio , an S3-compatible object store that runs in a Docker container. The AWS adapter talks to Minio locally and to S3 in production. The application doesn’t notice the difference:

# local/CI
APP__FLYSYSTEM_MEDIA_STORAGE=media.storage.aws
APP__MINIO_ENDPOINT=http://minio:9000
APP__MINIO_ACCESS_KEY=minioadmin
APP__MINIO_SECRET_KEY=minioadmin

The same code, the same adapter, a different endpoint. No mocking, no special test paths, no environment-specific branches.

But the CI configuration goes one step further. The Minio image used in the pipeline isn’t the standard upstream one — it’s a custom image built with test fixtures preloaded:

FROM minio/minio:latest
COPY tests/fixtures/ /fixtures_media/

Every CI run starts with a Minio instance that already contains the data the test suite expects. No setup script, no seed command, no “wait for fixtures to load” step before tests begin. The initial state of the test environment is part of the build artifact.

Factor V applied to test infrastructure: the environment state is built, versioned, and immutable. The CI pipeline builds the Minio image from the same source and at the same commit as the application image. The test fixtures and the code that exercises them are always in sync.

The S3 tradeoff, honestly

S3 introduces a latency cost that local storage doesn’t have. The first bytes of a file take 10 to 30 milliseconds to arrive from S3 — that’s the documented first-byte latency for the service, not a measurement on this specific workload.

At 300 requests per second, the reasoning for accepting it was this: most reads hit already-generated thumbnails in the S3-backed cache, not the original files. A freshly uploaded image pays the cold-miss penalty once, on the first thumbnail request. Everything after that is a cache hit. Whether the actual tail latency under load bore that reasoning out required performance testing that was tracked separately — the architecture decision and the validation were decoupled.

The tradeoff was accepted: predictable behavior across multiple pods, no shared-state problems, a storage layer that scales without coordination. The full measurement story belongs in the load test report, not here.

The ghost leaves

The path /home/jenkins-slave no longer appears in the configuration. But what it pointed to was a coupling that predated Docker, predated microservices, predated any conversation about cloud migration. The CI runner and the production application shared a filesystem because they lived on the same machine. Nobody designed it that way. It accumulated.

A kubectl apply error on a path that shouldn’t have existed forced the question: why does this application assume a specific CI runner is present on the host? The answer was “because it always has.” That’s not a reason. It’s a history.

Renaming the path was a paper fix. Flysystem’s lazy adapter was the actual answer — not because it’s more elegant, but because it makes the storage backend a decision that belongs to the environment, not to the application. The container starts, reads one variable, connects to whatever is at the other end. It doesn’t know whether that’s a bucket in a data center or a container on a laptop.

The runner’s home directory is gone from the config. What replaced it is a service name. That’s the difference.