Internal quality control for efficient management

Client guide

Yevhenii Hordashnyk

CTO, co-founder

Aug 09, 202020Jul 18, 202323

25 months

Adoption length

8 Juniors

Participated in our research and workflow optimization

~10 000$

Already saved on operational expenses

Veni (Background and Challenge)
During the process of the workflow optimization, we'd noticed aninterestingrate of mistakes that our employees used to make in their work. A deep analysis of the performance showed that we were losing about 15% of the income correcting mistakes or finding out the reason why this or another situation happened, because we could not let ourselves to solve our inner problems at the clients' expense.

As a result, we decided to sort out the mistakes by their type and find their root cause. Based on the results of our analysis, we managed to make up a plan and a pool of tasks for further workflow optimization.

Vidi (Investigation)
We asked our employees to duplicate their tasks on different projects in our internal Jira. Thus, even though we spent the company's money on keeping the internal Jira up-to-date, it gave us an opportunity to collect precious statistical data that we could use for our further analysis.

We sorted out all the tickets with "bug" type, as well as all those tickets that were Rejected for any reason. After a long manual study of the tickets, we identified the following main reasons of incidents and mistakes made by our employees:

Knowledge Transfer issues. Some employees can leave the company, take a PTO, maternity/paternity leave which results in a new person taking their с place and the newcomer isn't familiar with the specifications of the project, tasks and so on.
Experience issues. When a person simply does not have enough experience to make the right decision in a given situation. It is more a problem of a wide range of technologies in IT area rather than lack of experience issue of the junior/middle/senior specialists.
Emergency issues. An emergency situation may occur any time and in some cases there is simply no time to ask your colleague or mentor for advice.

Vici (solution). Learn how we managed to fix these issues.

Knowledge Transfer issues.
he simplest solution that comes to mind would be to force people to hand over project and status details when leaving the project. However, this is not always possible.. Consequently, we had to figure out how to provide a transfer at any time.

As a result, we decided to borrow the solution from the "Dev" part of our component.

We always suggest our clients to adopt the Infrastructure-as-a-Code concept wherever it's possible. Terraform, CloudFormation, Ansible, Chef ... Each of them allows you to describe the infrastructure in a more comprehensive way and much more efficiently than with the help of a regular documentation. We decided to implement Code Styling in Alpacked and clearly define the company's standards for all the main IaaC tools we work with. As an example: Ansible inventory always needs to be divided into the following strict hierarchy: environment type, cloud type, purpose. It has to have an explicitly specified path to keys, python version and a brief comment about host purpose.

Before	After
[prod] Host1 Host2 … HostN [dev] Host3 Host4 In this case we see a real lack of information. A new specialist wasn't able to make any sense. So it took up to 2 months to figure the process flow.	[all] production development [production:children] prod-aws prod-gcp [prod-aws:children] prod-frontend prod-backend prod-ci [prod-ci] # Jenkins Slave1 that is used to build some stuff Host1 ansible_ssh_private_key_file=private_keys/key ansible_python_interpreter=/usr/bin/python2.7 [prod-frontend] # A nodeJS instance for frontend with SSR Host2 ansible_ssh_private_key_file=private_keys/key ansible_python_interpreter=/usr/bin/python2.7 [dev:children] dev-aws dev-gcp [dev-aws:children] dev-frontend dev-backend dev-ci [dev-frontend] # A nodeJS instance for frontend with SSR Host2 ansible_ssh_private_key_file=private_keys/some-dev-key ansible_python_interpreter=/usr/bin/python2.7 Host3

Before

After

[prod]
Host1
Host2
…
HostN
[dev]
Host3
Host4

In this case we see a real lack of information.

A new specialist wasn't able to make any sense. So it took up to 2 months to figure the process flow.

[all]
production
development

[production:children]
prod-aws
prod-gcp

[prod-aws:children]
prod-frontend
prod-backend
prod-ci

[prod-ci]
# Jenkins Slave1 that is used to build some stuff
Host1 ansible_ssh_private_key_file=private_keys/key ansible_python_interpreter=/usr/bin/python2.7

[prod-frontend]
# A nodeJS instance for frontend with SSR
Host2 ansible_ssh_private_key_file=private_keys/key ansible_python_interpreter=/usr/bin/python2.7

[dev:children]
dev-aws
dev-gcp

[dev-aws:children]
dev-frontend
dev-backend
dev-ci

[dev-frontend]
# A nodeJS instance for frontend with SSR
Host2 ansible_ssh_private_key_file=private_keys/some-dev-key ansible_python_interpreter=/usr/bin/python2.7
Host3

This way a new person should only know that the project uses ansible for IaaC. It is enough to open the inventory that will tell him about the underlying infrastructure, it's purpose, number of hosts, used clouds and hosts purpose. No more struggling to figure out what is hosted, where, why and how

Experience issues

All experience issues can be devided literally by 2 types:

Internal Code Review. One of the most important outcomes of the IaaC approach is the ability to implement Code Review for infrastructure changes. Any change to the architecture finds itself in the appropriate pull request, which is reviewed both by a peer to eliminate obvious mistakes and by mentor to determine the correct solution of the problem.
Documented Best Practices. Once a month we conduct a general demo where employees present the work they've done and pay special attention to their solutions to the difficult problems. As the practice shows, such demos increase the involvement of an employee in a project and increases staff's responsibility for the result.

Emergency issues

In Alpacked, besides technical skills we strive to develop and improve skills needed for effective management and HR operations.

We were impressed by the Site Reliability engineering book by Google. They define an idea of Incidents reports there which we tried to adopt in Alpacked.

Now every mistake is an incident - we consider it as a case instead of blaming an employee. We try to determine the root cause of the incident, the type of employee involvement; the way in which he reacted on the incident, what kind of technical means were used to fix the issue. As soon as the analysis is completed, we proceed to the next step.

We were lucky enough to get familiar with a series of books written by Jocko Willink and Leif Babin. The Dichotomy of Leadership is one of my favorites. One of the chapters stated the idea of the importance of a balance between clear standards and will for creativity. The main idea is that everybody needs to have an ability to independently solve issues, however, in case of unexpected emergency situations, it is recommended to refer to the previously created Standard Operating Procedures.

An adoption of this idea in Alpacked resulted in the following process: once analysis in step A is done and all the factors are investigated, the case is turned into an SOP.

For example, one of our e-commerce clients once got hacked. We found that out during a scheduled audit and first of all we tried to remove all outcomes of that break-in to stop the leak of the customer's data. During this process all the code/infrastructure changes done by hacker got wiped out, which made incident investigation totally impossible. As a result, we created the following security incident SOP:

Take a note of time when the breach has been found or reported by the customer
Notify your direct supervisor about the break-ins and consider it your highest priority until resolved
Create an AMI of the EC2 instance for further investigation (if applicable)
Create a copy of the whole project for further investigation
Create a copy of the database for further investigation
Immediately remove malicious code
Identify the time interval during which the system had been compromised
Double-check security logs and reports that were generated exactly before and after the break-ins in the following order
- SSH logins
- Check AIDE report
- Check if EBS volume snapshot exists
- Consider Web Server logs as your main source of information
- Check security reports
Check if there is no injected JS into the checkout/add card process
- If JS and DB Changes monitoring wasn't working use the Wayback Machine to find out the approx. date of injection
Identify the type of data that was compromised
Structure your findings based on steps 8-10 and present it to your direct supervisor.
Based on the findings figure out the way attacker compromised the system
- If this is a known attack vector that has already been mitigated by Alpacked, make sure to find a reason customer hasn't been protected yet
- If this a known attack vector and the attacker still was able to break in or if this is a previously unknown attack vector, create a document that describes it and suggest a solution that will block it from happening in the future.
Fill the Incident report and share it with your direct supervisor

Our results and achievments

This graph presents an analysis of the results of the implementation of the described practice. Here you can see 4 time dependency functions:

the average number of incidents per person,
the level of company losses caused by incidents (calculated as the number of incidents multiplied by the average number of hours required to eliminate it),
company losses caused by internal processes (internal meetings, documentation, planning, etc),
the total loss of the company.

The period from the 1st to the 7th month shows the state of the losses and incidents prior to the implementation of the described methods.
Average number of incidents: 12.3, Average loss rate: 104.2.

Over the next few months, a record amount of time was spent on the implementation of these processes, as well as documentation creating, SOPs, etc.

After this point, it is easy to notice the Total Losses tendency to decrease due to the better ratio of the number of incidents to internal time. Re-phrasing, we can say that our experience has shown that an increase of the amount of internal time required to implement the methods described above ultimately leads to the losses decrease, and consequently, an increase in profits in the long term.

As you can see, we could not reduce the number of incidents to zero or a near zero value. We believe that it is caused by the randomness of some incidents that cannot be foreseen and prevented. Nevertheless, the level of incidents that we were able to achieve seems to be good enough for small and medium-sized IT companies to spend resources on its implementation.

Read Other articles

Serverless

Dmitriy Konstantynov

CEO, co-founder

advanced

Aug 10, 2020

Building serverless data pipeline for an e-commerce store

Alisa used to be an unusual e-commerce store. Shopify as a storefront, BrightPearl as a warehouse and Corezoid - Visual Programming Platform brought by Privat...

Serverless

Yevhenii Hordashnyk

CTO, co-founder

cases

Aug 10, 2020

Serverless vs containers - what to choose?

Serverless and containers are two widely discussed application development trends in 2019-2020. Some devs fiercely claim that serverless comes to replace containers, a team across...

Yevhenii Hordashnyk

CTO, co-founder

See all articles by Yevhenii

Let's arrange a free consultation

Just fill the form below and we will contaсt you via email to arrange a free call to discuss your project and estimates.