Client guide: Internal quality control for efficient management
Find out how we optimized our workflow and decreased the number of mistakes made by our juniors. We are 100% sure you would like to adopt this practice to your business!
25 months

Adoption length
8 Juniors

Participated in our research and workflow optimization
~10 000$

Already saved on operational expenses
Yevhenii Hordashnyk
DevOps consultant, co-founder
Veni (Background and Challenge)
During the process of the workflow optimization, we'd noticed an interesting rate of mistakes that our employees used to make in their work. A deep analysis of the performance showed that we were losing about 15% of the income correcting mistakes or finding out the reason why this or another situation happened, because we could not let ourselves to solve our inner problems at the clients' expense.

As a result, we decided to sort out the mistakes by their type and find their root cause. Based on the results of our analysis, we managed to make up a plan and a pool of tasks for further workflow optimization.

Vidi (Investigation)
We asked our employees to duplicate their tasks on different projects in our internal Jira. Thus, even though we spent the company's money on keeping the internal Jira up-to-date, it gave us an opportunity to collect precious statistical data that we could use for our further analysis.

We sorted out all the tickets with "bug" type, as well as all those tickets that were Rejected for any reason. After a long manual study of the tickets, we identified the following main reasons of incidents and mistakes made by our employees:

  • Knowledge Transfer issues. Some employees can leave the company, take a PTO, maternity/paternity leave which results in a new person taking their с place and the newcomer isn't familiar with the specifications of the project, tasks and so on.

  • Experience issues. When a person simply does not have enough experience to make the right decision in a given situation. It is more a problem of a wide range of technologies in IT area rather than lack of experience issue of the junior/middle/senior specialists.

  • Emergency issues. An emergency situation may occur any time and in some cases there is simply no time to ask your colleague or mentor for advice.
Vici (solution)
Learn how we managed to fix these issues.
Knowledge Transfer issues.

he simplest solution that comes to mind would be to force people to hand over project and status details when leaving the project. However, this is not always possible.. Consequently, we had to figure out how to provide a transfer at any time.

As a result, we decided to borrow the solution from the "Dev" part of our component.

We always suggest our clients to adopt the Infrastructure-as-a-Code concept wherever it's possible. Terraform, CloudFormation, Ansible, Chef ... Each of them allows you to describe the infrastructure in a more comprehensive way and much more efficiently than with the help of a regular documentation. We decided to implement Code Styling in Alpacked and clearly define the company's standards for all the main IaaC tools we work with. As an example: Ansible inventory always needs to be divided into the following strict hierarchy: environment type, cloud type, purpose. It has to have an explicitly specified path to keys, python version and a brief comment about host purpose.
Before
After
[prod]

Host1

Host2



HostN


[dev]

Host3

Host4


[all]

production

development


[production:children]

prod-aws

prod-gcp


[prod-aws:children]

prod-frontend

prod-backend

prod-ci


[prod-ci]

# Jenkins Slave1 that is used to build some stuff

Host1 ansible_ssh_private_key_file=private_keys/key ansible_python_interpreter=/usr/bin/python2.7


[prod-frontend]

# A nodeJS instance for frontend with SSR

Host2 ansible_ssh_private_key_file=private_keys/key ansible_python_interpreter=/usr/bin/python2.7


[dev:children]

dev-aws

dev-gcp


[dev-aws:children]

dev-frontend

dev-backend

dev-ci


[dev-frontend]

# A nodeJS instance for frontend with SSR

Host2 ansible_ssh_private_key_file=private_keys/some-dev-key ansible_python_interpreter=/usr/bin/python2.7

Host3




In this case we see a real lack of information.

A new specialist wasn't able to make any sense. So it took up to 2 months to figure the process flow.
This way a new person should only know that the project uses ansible for IaaC. It is enough to open the inventory that will tell him about the underlying infrastructure, it's purpose, number of hosts, used clouds and hosts purpose. No more struggling to figure out what is hosted, where, why and how
Example of our configuraion before and after

This way a new person should only know that the project uses ansible for IaaC. It is enough to open the inventory that will tell him about the underlying infrastructure, it's purpose, number of hosts, used clouds and hosts purpose.
Before
[prod]

Host1

Host2



HostN


[dev]

Host3

Host4
After
[all]

production

development


[production:children]

prod-aws

prod-gcp


[prod-aws:children]

prod-frontend

prod-backend

prod-ci


[prod-ci]

# Jenkins Slave1 that is used to build some stuff

Host1 ansible_ssh_private_key_file=private_keys/key ansible_python_interpreter=/usr/bin/python2.7


[prod-frontend]

# A nodeJS instance for frontend with SSR

Host2 ansible_ssh_private_key_file=private_keys/key ansible_python_interpreter=/usr/bin/python2.7



[dev:children]

dev-aws

dev-gcp


[dev-aws:children]

dev-frontend

dev-backend

dev-ci


[dev-frontend]

# A nodeJS instance for frontend with SSR

Host2 ansible_ssh_private_key_file=private_keys/some-dev-key ansible_python_interpreter=/usr/bin/python2.7

Host3


Experience issues

All experience issues can be devided literally by 2 types:

  • Internal Code Review. One of the most important outcomes of the IaaC approach is the ability to implement Code Review for infrastructure changes. Any change to the architecture finds itself in the appropriate pull request, which is reviewed both by a peer to eliminate obvious mistakes and by mentor to determine the correct solution of the problem.
  • Documented Best Practices. Once a month we conduct a general demo where employees present the work they've done and pay special attention to their solutions to the difficult problems. As the practice shows, such demos increase the involvement of an employee in a project and increases staff's responsibility for the result.
Emergency issues

In Alpacked, besides technical skills we strive to develop and improve skills needed for effective management and HR operations.

We were impressed by the Site Reliability engineering book by Google. They define an idea of Incidents reports there which we tried to adopt in Alpacked.

Now every mistake is an incident - we consider it as a case instead of blaming an employee. We try to determine the root cause of the incident, the type of employee involvement; the way in which he reacted on the incident, what kind of technical means were used to fix the issue. As soon as the analysis is completed, we proceed to the next step.

We were lucky enough to get familiar with a series of books written by Jocko Willink and Leif Babin. The Dichotomy of Leadership is one of my favorites. One of the chapters stated the idea of the importance of a balance between clear standards and will for creativity. The main idea is that everybody needs to have an ability to independently solve issues, however, in case of unexpected emergency situations, it is recommended to refer to the previously created Standard Operating Procedures.

An adoption of this idea in Alpacked resulted in the following process: once analysis in step A is done and all the factors are investigated, the case is turned into an SOP.

For example, one of our e-commerce clients once got hacked. We found that out during a scheduled audit and first of all we tried to remove all outcomes of that break-in to stop the leak of the customer's data. During this process all the code/infrastructure changes done by hacker got wiped out, which made incident investigation totally impossible. As a result, we created the following security incident SOP:

  1. Take a note of time when the breach has been found or reported by the customer

  2. Notify your direct supervisor about the break-ins and consider it your highest priority until resolved

  3. Create an AMI of the EC2 instance for further investigation (if applicable)

  4. Create a copy of the whole project for further investigation

  5. Create a copy of the database for further investigation

  6. Immediately remove malicious code

  7. Identify the time interval during which the system had been compromised

  8. Double-check security logs and reports that were generated exactly before and after the break-ins in the following order

    1. SSH logins

    2. Check AIDE report

    3. Check if EBS volume snapshot exists

    4. Consider Web Server logs as your main source of information

    5. Check security reports

  9. Check if there is no injected JS into the checkout/add card process

    1. If JS and DB Changes monitoring wasn't working use the Wayback Machine to find out the approx. date of injection

  10. Identify the type of data that was compromised

  11. Structure your findings based on steps 8-10 and present it to your direct supervisor.

  12. Based on the findings figure out the way attacker compromised the system

    1. If this is a known attack vector that has already been mitigated by Alpacked, make sure to find a reason customer hasn't been protected yet

    2. If this a known attack vector and the attacker still was able to break in or if this is a previously unknown attack vector, create a document that describes it and suggest a solution that will block it from happening in the future.

  13. Fill the Incident report and share it with your direct supervisor

    Our results and achievments
    This graph presents an analysis of the results of the implementation of the described practice. Here you can see 4 time dependency functions:

    1. the average number of incidents per person,

    2. the level of company losses caused by incidents (calculated as the number of incidents multiplied by the average number of hours required to eliminate it),

    3. company losses caused by internal processes (internal meetings, documentation, planning, etc),

    4. the total loss of the company.

    The period from the 1st to the 7th month shows the state of the losses and incidents prior to the implementation of the described methods.

    Average number of incidents: 12.3, Average loss rate: 104.2.

    Over the next few months, a record amount of time was spent on the implementation of these processes, as well as documentation creating, SOPs, etc.

    After this point, it is easy to notice the Total Losses tendency to decrease due to the better ratio of the number of incidents to internal time. Re-phrasing, we can say that our experience has shown that an increase of the amount of internal time required to implement the methods described above ultimately leads to the losses decrease, and consequently, an increase in profits in the long term.

    As you can see, we could not reduce the number of incidents to zero or a near zero value. We believe that it is caused by the randomness of some incidents that cannot be foreseen and prevented. Nevertheless, the level of incidents that we were able to achieve seems to be good enough for small and medium-sized IT companies to spend resources on its implementation.

      Container orchestration was a challenge for dating website untill we had switched into Kubernetes. As a result - 60% increase in rollback speed and 0% downtime. Learn our case study on our website!



      Made on
      Tilda