Engineering Journey: The "Quick Win" vs. Security Dilemma
During the Discovery phase, our team faced a serious architectural decision. The first option - a quick-win strategy - was to improve the existing AWS ECR/ECS legacy architecture to accelerate go-live as much as possible, giving the client a chance to start generating revenue that would fund further improvements. The second option was to rewrite everything from scratch on Amazon EKS. We made a compelling case to the client that staying on the old architecture risked compromising the integrity and stability of production from day one - potentially costing them the trust of their earliest users. The client decided not to take that risk: we stood up EKS clusters in parallel with the old system and carried out a completely seamless migration. Only after that did the focus shift to post-migration tasks: security hardening, cost optimization, and bug fixes.
Architectural Leadership and Lessons Learned
We encountered the fastest SDLC of our experience: release cycles of one day or less, accompanied by dozens of micro-calls daily. Since the client had neither a dedicated architect nor a CTO, and their development team was relatively junior, we were compelled to introduce a close code review practice. This decision saved production from many hours of patching and downtime. The key lesson learned: under these conditions, it's essential to push the client from day one to hire a dedicated architect. Also worth highlighting is the close collaboration with the client's AI team, which saved the business from serious reputational damage. As for knowledge transfer, it was virtually absent on the client side - but we successfully ran a fast internal onboarding when adding a second DevOps engineer to the project.
Full Technology Stack and Killer Features
We achieved 100% infrastructure-as-code coverage via Terraform and fully implemented a GitOps approach through ArgoCD. This enabled minutes-worth changes and reduced rollbacks to a simple, safe git revert command.
- Core Infrastructure: The foundation consists of 2 isolated Amazon EKS clusters. Network communication and traffic management are handled by Istio service mesh, which also provides mTLS encryption and deep observability.
- AWS Services Integration: We configured VPC Peering for secure connectivity with the Mongo Atlas database. Also integrated: Amazon CloudFront (CDN), Amazon WAF, AWS Load Balancer Controller (ALB/NLB), with SecurityHub and Inspector handling continuous vulnerability monitoring.
- Extended FinOps Stack: Node management is fully delegated to Karpenter, which automatically provisions mixed on-demand/spot instance groups - delivering up to 80% savings on Spot instances. The auto-consolidation (bin-packing) feature further reduced node count by 30–35%. For additional savings, we migrated workloads to Graviton ARM64 processors (r8g family), which proved ~20% cheaper than standard x86.
- Observability & Security Ecosystem: While the system supports Prometheus, Grafana, and OpenTelemetry for comprehensive monitoring, one of the standout decisions was moving away from the standard Prometheus/Grafana stack in favor of Datadog APM - which delivered the ideal developer-facing interface. Sensitive data management is automated via External Secrets Operator integrated with AWS Secrets Manager, and TLS issuance is handled by cert-manager.