Case study

Building serverless data pipeline for an e-commerce store

on Feb 06, 2020
About the client
Alisa used to be an unusual e-commerce store. Shopify as a storefront, BrightPearl as a warehouse and Corezoid - Visual Programming Platform brought by Privat Bank, responsible for data synchronization among them.

Such setup allows provisioning the whole infrastructure without a development team and in a really short period of time.

Challenges of the project

As business started growing, client started to add more products, vendors and attributes in the warehouse which increased the amount of synchronization requests between Brightpearl and Shopify. At the same time the situation worsened even more because the Privat Bank moved to another platform and started the deprecation process of the Corezoid platform which puts a client into a short timeline to move to another platform.

Initial investigation revealed a number of critical issues:

  • Integration was built before Shopify released their GraphQL API interface and all integrations were using less efficient REST API.
  • Both Shopify and Brightpearl have limitations for the number of API requests made within a certain time slot.
  • Integration platform didn't have any database for persistent data (e.g. product ID). It was causing additional requests to both sides which worsen the situation with API throttling.
  • All deployments involved manual changes of the Corezoid configuration which lead to mistakes and downtimes.
  • Integration didn't keep any state of the transaction, which made it difficult to debug issues and impossible to re-run a failed transaction.

Solutions we offered to the client

Alpacked prioritized, planned and broke down the work into the following milestones:

  1. Current infrastructure analysis on a high-level. Discovery of the main processes, bottlenecks and pain points.
  2. Definement of the main data sync flows. Product quantity and attributes synchronization, new products info sync, custom tagging required by the Shopify.
  3. Discuss tech stack and approach.
  4. Implementation of the main nodes of data pipeline: events handling from the BrightPearl, events and events status tracking, GraphQL endpoint to Shopify.
  5. Implementation of the data storage and caching.
  6. Data sync flow refactoring based on the new infrastructure - remove recurring requests to both sides of the pipeline which got replaced by the storage and cache.
  7. Implement data sync flows one by one based on the purpose and priority.
  8. Implement manual and automatic functionality for failed event re-processing.
  9. Implement logging and monitoring.
  10. Document steps.
  11. Customer training and demos.
  12. Production testing.
  13. Post-release support and bug fixing.

Implementation

Initial analysis discovered key points in customer's infrastructure.

  • Any data sync process is caused by the changes in the warehouse made by operators during the business hours. These changes turned out to be spiky and done by batches, as the lion's share of those are triggered by the csv files imports contained a list of updated products descriptions.

  • A significant amount of requests towards both sides of the pipeline were made in order to obtain the information that either doesn't change at all (product id), or changes rarely (product description and attributes)

Based on these findings it was suggested to break down the application into a set of microservices and adopt the serverless approach. Such an approach allows us to handle a spiky load and complete off-peak periods most effectively, as customer doesn't pay a single cent during periods of time when nothing happens.
traffic spikes graphics
Traffic spikes
BrightPearl API
BrightPearl API traffic pattern

AWS and Serverless framework

AWS and Serverless framework were chosen as a tech stack.

AWS has a perfect set and combination of services that allows to build a solid pipeline, whilst each of those can be covered by the Serverless framework and be launched locally which eases the process of the local development. The following services were chosen to be used:

  • API Gateway as a main point of entry for all events;
  • Lambda as a FaaS;
  • SQS as a queue that manages all messages, priorities and ensures that event either gets delivered or saved for the further investigation in case of a failure (DeadLetter queue);
  • DynamoDB as a storage of the rarely changing information and a storage of the events metadata (event id, timestamp, description, etc);
  • S3 as a storage of the events payload used to re-deliver failed events or debug;
  • Cloudwatch as logging and monitoring solution, and as a Dashboard;
  • X-Ray as a debug and visualization platform

Simplified architecture of the data pipeline is depicted below (click on the image to zoom):


serverless archiecture design
Alisa serverless archiecture
Main data sync flow nodes were designed to be completely stateless, reusable and be a single source of truth for any kind of events that reach them.

An example of the product SKU sync is depicted below in form of a flowchart:
product SKU sync for serverless architecture
Product SKU sync
A list of key features from the above flowchart:

  • The guarantee of a delivery by returning an event payload to the SQS in case of a failure or a throttle.

  • The guarantee of an ability to investigate and debug an unhandled error during event processing by utilizing a native functionality of the SQS DeadLetter queue.

  • Complete application decoupling that heavily relies on microservice architecture and SQS.

  • Event tracking by updating events table in DynamoDB during each state of the data processing pipeline.

  • Data caching and reduced amount of API calls by saving an immutable product data into the DynamoDB table.

Event tracking

An ability to track events, debug failed transactions and re-process failed events was one of the key features client wanted to implement in his project.

There is a number of production-grade systems that solve these issues, like Jaeger or Epsagon, but a complete serverless and cost-efficiency were key factors, so it was decided to implement in-house solution and integrate it with AWS X-Ray for better visibility.

  • DynamoDB was chosen to store event data. Initial analysis and discussions with customer discovered that there'll be millions of write requests and very rare read requests mainly caused by the failed transaction investigation. In order to make the setup both cost-efficient and resilient, it was decided to only store a metadata in DynamoDB, which allows to find any information needed
metadata in DynamoDB
  • Event payload was decided to keep at S3 with lifecycle policies, which archives all events older than a month. Considering that event represents a data sync operation, its payload gets stale really quickly and there's no need to keep its payload for years.

  • Each Lambda function inherits a method responsible for event tracking and a method that redefines error handling. This way, Lambda constantly updates DynamoDB table with information on how event processing is going on, and even in case of an exception, it will add details on exception and its cause before terminating and returning an object to the queue.

  • CloudWatch was used for data tracking, logging and monitoring.
Once event tracking is implemented, it allows to build an event re-processing pipeline around it. With the described above setup, the only thing needed to re-process the whole transaction is event_id. Having event_id, system could search DynamoDB for timestamp and Lambda function name, get event payload from S3, compose an event and send it to Lambda for re-processing.

Automatic re-processing is applied for any case related to the missing information from DynamoDB. For example, if lambda receives a request to sync a product quantity that doesn't yet exist in the database, it will terminate current process and will send product id to another lambda for adding into the database. Once it's added, a job of automatic re-processing will be started.

In some cases, human factor may cause discrepancies between Shopify and BrightPearl data. In order to resolve them right away and not to wait for an automatic event to come and trigger a fix, a lambda function was created. It searches through the database to find product id with different amount of quantity in bp_availability and sh_availability. Once found, a list of product ids are sent to the pipeline entry mimicking the real event


Logging and monitoring

All logs were configured to be pushed to the CloudWatch for further aggregation and usage. CloudWatch is a complex solution which allowed to solve several tasks at the same time:

  • Logging storage
  • Monitoring system
  • Visualization dashboard

Each Lambda function streams its logs into the corresponding CloudWatch log group. It allows to create some custom metrics applied to certain log groups and use them for simple data aggregation and monitoring.

For example the following screenshot depicts a situation when CloudWatch counts a number of certain log entries and converts that count into a metric.
    aws CloudWatch
    AWS provides a great functionality for log searching and aggregation - CloudWatch Insights. Combined with the event tracking system, it allows to find all information regarding the trace in several seconds:
      serverless aws lambda statistics
      At the same time CloudWatch Insights allows to analyse infrastructure due to all the native integrations AWS has behind the curtain. Considering a key role that Lambda has in the whole infrastructure, it is necessary to monitor its efficiency and cost.
        aws lambda statistics
        The last implemented feature of the CloudWatch is its dashboards and visualization tools. In order to provide client with the best observability, it was decided to group the data based on their business logic and provide links to dashboards.
          aws cloudwatch dashboard - serverless study case
          aws cloudwatch dashboard 2 - serverless study case
          aws cloudwatch dashboard 3 - serverless study case
          aws cloudwatch dashboard 4 - serverless study case

          Results and achievements

          Microservice architecture allows to migrate pipeline part by part based on the business logic and this way to ease the migration process and possible downtimes. At the same time, having an immutable pipeline allows one to do a soft launch - add a new webhook to the source of the events while keeping the old one in place and make sure everything works before doing a hard cutover. Event tracking, Monitoring and Logging allows one to keep abreast during the first hours of the go-live and make sure nothing happens unnoticed.

          Migration resulted in a significant decrease of the hosting bill and improved quality of service as there were no missing transactions and any data eventually got synced. The whole process became more transparent to the client and he finally got an ability to watch his application working.


          Number of API requests to BrightPearl after the refactoring process:
            API requests to Brightpearl - serverless
            Lambda statistics per month:

            • 1,731,233 events processed a month
            • 111 hours of computing
              Lambda statistics
              Hosting costs per Serverless pipeline per month: $18.31
                  Hosting costs per Serverless pipeline
                  Questions? Contact us for more info!
                      Recommended articles

                      Related services

                      // CONTACT US:
                      Made on
                      Tilda