Zero Downtime: Crypto Wallet Migration to EKS

How we eliminated outages during 500% load spikes and reduced infrastructure costs by $45,600 per year


Crypto wallet migration to EKS

Crypto trading app. Seamless migration from AWS Lambda to EKS (100% IaC on Pulumi) with no downtime.

TEAM

DevOps engineers

PERIOD OF COLLABORATION

5 months (April – September 2024)

CLIENT’S LOCATION

USA and Canada

When Marketing Success Becomes an Infrastructure Challenge

Our client, a leading crypto wallet processing terabytes of data, relied entirely on an AWS Lambda architecture. While stable under normal conditions, the system critically failed during peak marketing activities.

During free crypto giveaways, instant load spikes of 300–500% caused the infrastructure to crash for 5–10 minutes. For a financial product of this level, even a few seconds of downtime destroys user trust, making an immediate architectural overhaul absolutely necessary.

Project Metrics

icon

-37% AWS costs (actual savings of $45,600 per year)

icon

x3.4 increase in deployment frequency and 0 seconds for "cold starts"

icon

99.97% Uptime across two regions simultaneously (measured over 4 months)


Why the "Serverless" Architecture Stopped Working

The problem wasn't just the downtime itself, but the deep architectural limits of the old system. The cost of downtime was too high, and to stabilize the product, we had to eliminate three fundamental technical "pains" at once:

1

Challenge #1: Databases were "suffocating"

During promotional crypto giveaways, traffic spiked by 300–500%, causing thousands of concurrent AWS Lambda functions to spin up instantly. Because the legacy event-driven architecture relied on these stateless functions, each instance attempted to open its own direct connection to the database simultaneously. This created an uncontrolled tsunami of connection requests. Without an efficient traffic distribution mechanism, the database resources were instantly exhausted. It simply couldn't keep up with the sheer volume of queries, choked under the pressure, and brought the entire platform down for 5–10 minutes at the exact moment user activity peaked

2

Challenge #2: Technical Limits and Slowness

The architecture did not allow processes to run for longer than 15 minutes. Furthermore, some Lambda functions had a "cold start" duration of up to 20 seconds, and building a reliable multi-region infrastructure was extremely difficult in such a setup.

3

Challenge #3: High and "Blind" Costs

Using long-running processes on Lambda was significantly more expensive than regular servers: the cloud bill was approximately ~$10,200 per month. Additionally, the costs were "opaque," and the business didn't fully understand where exactly the money was going.

Rescue Strategy: From Quick Fixes to Global Migration

Understanding the criticality of the situation, we did not immediately rewrite the entire system, but instead applied a cautious two-step approach. We needed to stop the crashes as quickly as possible and only then build a reliable foundation for the future.

After stabilization, we began the step-by-step migration of dependent services to Amazon EKS (Kubernetes). To ensure maximum reliability (SLA) and reduce user latency, we immediately deployed the architecture across two regions (multi-region), which guaranteed uninterrupted operation even in the event of local AWS failures. We also implemented smart autoscaling and gave developers the ability to create ephemeral environments (Preview environments) for secure code testing.

We left this block for those who love technical depth. Here is exactly how we migrated a huge financial product from AWS Lambda to Kubernetes while preserving every transaction.

Architectural Debates: Why EKS and not the simpler ECS?

At the start of the project, we faced a choice: where to migrate? The client's team had relevant experience working with Amazon ECS but not much understanding of EKS. It seemed logical to choose ECS—it is much simpler to set up, and migrating to it would have taken less time. However, we convinced the team to take the more difficult route. Our arguments:

  • Scale dictates the rules: Scale dictates the rules: At large scale — terabytes of data, dozens of services, millions of requests — it's too late to think about “simplicity.” You need tools that provide maximum control.
  • Tailored approach: With ECS, you have to manage only with the functionality provided by Amazon. EKS is more complex, but thanks to our expertise, it allows for obtaining an individually configured (tailored) result.
  • Money: Most importantly, EKS allows for the most flexible configuration of cost-optimized autoscaling. At these volumes, proper optimization makes EKS 37% cheaper.

The Biggest Technical Challenge: State Handling

The most difficult stage of the project was the migration and configuration of multi-regional replication of NoSQL databases (Hydra OAuth, CockroachDB). This process required extremely careful state handling. The strictest condition was that all of this had to happen with zero-downtime, so that wallet users noticed nothing. Furthermore, we had to very cautiously transition the native AWS event-driven logic to Kafka.

Absolute Automation: GitOps, ArgoCD, and Argo Rollouts

We completely abandoned manual cluster management and introduced a strict GitOps enforcement approach:

  • Git as the Single Source of Truth: All deployments are automated and launched exclusively from git commits. We completely excluded any manual infrastructure changes via kubectl commands (no manual kubectl changes). If something is not in the code—it does not exist in the system.
  • Managing Multiple Clusters: For convenient deployment of configurations immediately to 6 clusters in different regions, we used ArgoCD ApplicationSets.
  • Zero-Downtime Releases: To ensure updates were absolutely invisible to the end user, we implemented Argo Rollouts. This allowed us to perform safe canary deployments.
  • Safe Rollbacks: Thanks to GitOps, the process of rolling back the system in case of an error became elementary and boils down to a simple git revert command (instead of complex Lambda version management, as it was before).

The Ultimate Technology Stack

  • Cloud & Compute: 6 EKS clusters in 2 AWS regions. The ca-central-1 region holds Production, Staging, Load Testing, and Tooling, while us-east-1 is allocated for Development and Data.
  • Cost Ops: Smart autoscaling Karpenter, mixing Spot and On-demand instances. Transition to Graviton ARM64 processors (r8g family). Automatic consolidation (bin-packing) via Karpenter reduced the number of nodes by 30-35%.
  • FinOps & Cost Management: Costs for old Lambda were opaque, so we implemented Cloud Intelligence Dashboards (CID), Cost and Usage Reports (CUR 2.0), and QuickSight. Now costs are attributed down to the level of a separate namespace.
  • Data & Mesh: Istio (mTLS, traffic management, circuit breaking) is responsible for service communication. Data layer: Amazon RDS Aurora (multi-region), DynamoDB (global tables), ElastiCache (Valkey), MSK (Kafka), and CockroachDB.
  • Security & Observability: Datadog (APM), Prometheus, Grafana, OpenTelemetry for comprehensive monitoring. Security is handled by External Secrets Operator, AWS Secrets Manager, Cert-manager, Kyverno (policy enforcement), and Cloudflare Tunnel for zero-trust access.
  • IaC: 100% of the infrastructure on Pulumi (TypeScript)—67 micro-stacks and a custom component library.

An Honest Look (Lessons Learned)

We believe in transparency, so we are sharing what worked and what we would have done differently:

  • What worked perfectly: Allocating a full 2 months (April–June 2024) to build a reliable EKS foundation before application migration. Phased migration: we started with non-critical services (Airbyte), and only then moved the critical Hydra OAuth. Early deployment of monitoring systems (Datadog, Prometheus) also helped a lot.
  • What we would change: We should have designed the multi-region architecture from day one—the attempt to add it later required complex refactoring. Furthermore, cost-saving tools (Karpenter, Graviton) and FinOps dashboards should have been enabled "from day one" to immediately provide the client with cost transparency and instant savings.

Final Results: Speed, Economy, and Stability

This architectural transformation did not just solve the crashing problem—it fundamentally changed how the business operates with infrastructure. We not only eliminated all bottlenecks but also ensured absolute budget transparency and created an environment where developers can release new features exponentially faster. Today, the crypto wallet continues to grow actively, and our team remains a reliable partner. The process of cost optimization and improved monitoring is now an ongoing joint initiative.

1

Development Speed (Velocity)

We completely eliminated the 15-minute execution limit and "cold start" delays.
Thanks to the transition to GitOps, the release frequency increased by 3.4 times, and the Mean Time To Recovery (MTTR) decreased by 68%—now system rollback is a simple git revert.

2

Reliability for the Business

The system achieved 99.97% uptime. All updates now occur without a single second of downtime (zero-downtime). Developers received a powerful, easily scalable platform, which allowed the client to implement complex technologies such as Airflow, Flink, and CockroachDB.

3

Budget Optimization (FinOps)

Overall costs fell by 37%, saving the client $45,600 per year (the bill decreased from $10,200 to ~$6,400/month). The use of Spot Instances via Karpenter provided up to an 80% discount, Graviton ARM64 added ~20% savings, and smart node packing reduced their total number by 30-35%. The problem of "blind" costs was 100% solved.

Let's arrange a free consultation

Just fill the form below and we will contaсt you via email to arrange a free call to discuss your project and estimates.