Zero-Downtime Kubernetes Deployment - PXL

Summary

How do you achieve 100% uptime when code changes daily? This article covers the architecture required to eliminate downtime in business-critical systems. We use an anonymized transformation of a leading Norwegian technology company ("NordFinans") as an example of how monolithic legacy was replaced with modern microservices.

The focus is on the technical implementation of Zero Downtime Deployment (ZDD). We dive deep into how to configure Kubernetes, database schemas, and application logic to achieve continuous delivery without interruption – regardless of whether your stack is built on PHP, Python, Go, or Node.js.

Background: The Fear of Deployment

The Old World

Many companies recognize the situation NordFinans faced: A massive monolithic application that had grown unwieldy. The distribution process was manual, slow, and characterized by high risk:

Maintenance windows: Updates required planned downtime, typically scheduled for late evenings. This created frustration for users who expected 24/7 services, and wore out developers who had to work nights.

"Deployment Fear": Because each rollout was a major operation, they were postponed as long as possible. This led to a vicious cycle of enormous code conflicts and increased probability of errors.

Inefficient scaling: During traffic peaks, the entire monolith had to be scaled up, even if only a small part of the system was under pressure.

The Goal

To meet today's availability requirements, the company set three absolute technical requirements:

Requirement	Goal
Deployment Frequency	From monthly to daily rollouts
Change Failure Rate	Under 1% failure rate during deployment
True Zero Downtime	No interrupted sessions or 5xx errors

Strategy: Hybrid CI/CD with GitLab and GitHub

To balance internal security with external availability, a hybrid strategy was chosen. This is a model that works well for companies that have both proprietary core business and public integrations.

GitLab: The Core for DevSecOps

GitLab (Self-Managed) was chosen as the primary platform for internal source code and infrastructure.

Why: GitLab offers a complete package with source code, CI pipelines, container registry, and security scanning (SAST/DAST) in one closed ecosystem.

Kubernetes integration: Via GitLab Agent, the platform team can control access granularly without exposing sensitive access keys to developers.

GitHub: The Public Face

Public SDKs and partner integrations live on GitHub.

Why: GitHub is the industry standard for open-source.

GitHub Actions: Here, Actions are used to run public tests and publish packages to registries like NPM, PyPI, and Packagist.

Synchronization

To avoid fragmentation, GitLab functions as "Source of Truth". Code is automatically mirrored to GitHub, so developers only deal with one dashboard, while the code lives in two places.

GitOps: The Engine Under the Hood

To achieve zero downtime, manual error sources must be eliminated. Manual execution of kubectl apply was therefore strictly forbidden in favor of a pure GitOps model.

ArgoCD as Traffic Police

ArgoCD was implemented to synchronize the state in Git with the state in the Kubernetes cluster.

Pull-based model: Instead of the CI server "pushing" changes to the cluster (which requires the CI server to have admin access to prod), ArgoCD "pulls" changes from a separate manifest repo.

Security: This means the CI system never has direct access to the production environment, which eliminates a massive attack surface.

The Flow from Code to Prod

CI (Build): Developer pushes code. Pipeline runs tests, builds Docker image, and scans for vulnerabilities.
CD (Update): If the build succeeds, the CI job updates the version tag in a separate manifest repo.
Sync: ArgoCD detects the change, calculates the difference, and rolls out the change in a controlled manner in Kubernetes.

Technical Deep Dive: How to Achieve 100% Uptime?

Replacing the engine on a plane while it's in the air requires precision. Here are the specific configurations that make it possible to roll out new versions during working hours without lost requests.

The Rolling Update Strategy

The default behavior of Kubernetes is a "Rolling Update", but the default settings are often too aggressive for critical applications. The strategy must be adjusted to guarantee capacity:

Graceful Shutdown: The Solution to 502 Bad Gateway

The most common error when transitioning to Kubernetes is ignoring the application's lifecycle. When a pod is about to die, two things happen simultaneously (asynchronously):

Kubernetes removes the pod's IP from the load balancers.
Kubernetes sends SIGTERM to the container to stop the process.

The problem: Processes like Nginx, Go binaries, or Node.js often stop faster than it takes for Kubernetes to update the network rules throughout the cluster. The result? Traffic is sent to a pod that has just died. The user sees "502 Bad Gateway".

Probes: The Art of Health Checks

Liveness Probe: "Am I alive?". Checks if the process is running. Best Practice: Should be simple. Don't check database connection here! If the database goes down, all pods will restart simultaneously in an eternal loop.

Readiness Probe: "Am I ready to receive traffic?". Best Practice: Check if the application can actually do work (e.g., db connection ok, cache warm). If this fails, the pod is taken out of the traffic flow without being restarted.

The Database: The Biggest Challenge

Code is ephemeral, but data is persistent. How do you update a database schema without locking tables or crashing the old version of the code that's still running during a rollout?

The solution is the Expand-Contract (Parallel Change) pattern.

Phase 1: Expand

Are we changing a column name from address to billing_address? We add the new column but keep the old one. We roll out the code. Now both columns exist.

Phase 2: Migrate (Dual Write)

The application is updated to write to both columns but read from the new one. A background script moves old data.

Phase 3: Contract

When we're sure all pods are running new code that uses billing_address, we remove the old column in a final migration.

This requires discipline, but guarantees that the database is never "out of sync" with any version of the application running live.

Application Level: Preparing for the Cloud

Regardless of whether the application is written in PHP, Python, Go, or Node.js, a "Zero Downtime" environment requires the application to follow the 12-factor principles.

Configuration and Environment Variables

In a dynamic cluster, you can't rely on .env files that change on disk. All configuration must be injected as Environment Variables from Kubernetes ConfigMaps and Secrets.

Node.js/Python/Go: Read directly from the environment (process.env / os.environ).
PHP: Ensure that configuration cached during build time doesn't contain hardcoded paths that change in production.

Queue Systems and Serialization

An often overlooked trap is asynchronous jobs (RabbitMQ, Redis, Kafka). When a new version is deployed, there may be jobs in the queue that are serialized with the old code structure. If a worker with new code picks up an old job payload, the application may crash.

Solution: Use versioned queues, or ensure that job payloads are always backward compatible. For major changes ("breaking changes"), the queue must be drained before updating.

Results and Business Value

Implementation of this architecture yielded the following results:

Deployment Frequency: Increased from monthly to 8.5 times per day. Developers now deploy small changes continuously.

Lead Time: Reduced from 5 days to 45 minutes (from commit to production).

Availability: Achieved 99.99% uptime the first year, even through major refactorings.

Culture: "Deployment fear" disappeared. Tuesday evening is no longer "on-call evening", but free time.

Conclusion

Zero downtime is not magic, and it's not something you get "for free" just by choosing Kubernetes. It's the result of a deliberate architectural strategy that combines robust CI/CD pipelines, declarative infrastructure (GitOps), and a deep understanding of the application's lifecycle.

For companies that want to compete in a market that demands 24/7 availability, the investment in this architecture is not just an IT cost, but a fundamental business advantage.

At PXL, we help ambitious companies modernize their deployment strategy. We have broad experience setting up scalable, fault-tolerant environments for applications built in everything from PHP and Python to Go and Node.js. We assist with the entire journey – from containerization and CI/CD design to Kubernetes operations and monitoring.

Architecture for Zero Downtime in Critical Applications