This post is also posted at the Mercari Engineering Blog as part of the Merpay Tech Openness Month.
The 7th day’s post of Merpay Tech Openness Month 2021 is brought to you by @adlerhsieh from the Merpay Payment Platform team.
What is ChatOps?
For those who are not familiar with ChatOps, ChatOps is a style of running our system operations. The interactions are based on sending and receiving messages on messaging platforms like Slack or Microsoft Teams. It feels like talking to a team member, and the team member will try to get things done for you. What an enjoyable experience!
In Merpay, members spend a lot of time on Slack since it’s our primary tool for communication. It is natural that when we do our daily operations, we want to do it on a platform that we’re already familiar with, instead of coming up with another web app that requires 10 pages of onboarding documents.
The basic form of using Slack to do daily operations is to talk to a bot and perform certain actions. There are many amazing integrations on Slack that allow us to interact with it in different ways, including slash commands, or setting up a webhook with specific keywords as its trigger. Usually, there is a server somewhere else, receiving messages from Slack and responding accordingly.
With that, there are a million possibilities for a team to create workflows to resolve any issues.
The Benefit of Using ChatOps
So, how is this different from traditional system operation approaches? For example, why don't we just build a web app to handle system issues like reporting, monitoring, and sending emails? Or, creating a CLI command that can run in anyone's local environment?
No Need to Build a New UI
The most productive part about using ChatOps is that the UI has already been built. Slack provides a highly customizable way to input data and responds with a rich text response. Building a web app requires a developer to build a frontend application with authentication, interface design, and much other work. Building a ChatOps integration in Slack lets us skip all that.
For authentication, it's already handled as well. Since the users are interacting with Slack, and Slack is responsible for talking to the server, Slack basically handles all the authentication. All we need to do on the server-side is to verify that this request comes from Slack & control the access permission. It's a lot simpler than building a web app.
Compared with building a CLI program, using Slack doesn’t require installing dependencies or prerequisites. For a CLI program, depending on the programming language, users might need to install various dependencies on their environment. Also, even the simplest binary executable still has to be maintained and upgraded on the client side. On the other hand, Slack is a tool that everyone will install on their computer for communication among team members. No other actions are required to use this tool. It's already there!
Rich Text Output
Modern messaging platforms like Slack provide a rich text output for a response from a bot user. For operations that require a complicated layout of messages, elements can be composed as blocks or attachments on Slack. The following is an example of the use of attachments:
How is ChatOps Used in Merpay?
In Merpay, we utilize Slack to help us access to our workflows easily. There are many Slack apps that we're currently using for different purposes. For example, we have an app for creating security inquiries, an app for translation, and an app for automatically replying to HR-related questions.
For the engineering team, one big project that we're working on at the moment is to have a universal operation bot for the entire engineering team. Here are some very useful workflows that we're currently using:
This is one of the main reasons why we want to start this project. There are several teams in Merpay that consist of more than ten members working on the same service. In order to manage the release workflow and make sure nothing is missed during the process, we created a "release checklist" for the person who is in charge of the release to go through all the details before releasing the change to the production environment. As time goes by, the number of checklist items has gone up from two items to ten or twelve items, and we start to perceive that this is taking up a lot of developers' time.
In order to save developers' time, we started this operation bot and allowed it to automate some of the actions. This includes awesome features like:
- Database schema checking: it goes to the production environment database and checks whether the current schema can match the schema used by the codebase. If there's any table being updated, database migrations have to be done before the release of the changes.
- A utility action to check the latest current version of the deployment. No more messing up with the tags!
- And many more features that we are currently working on!
Running Database Queries
One pain point in our system is that developers do not have direct access to the production database. One principle is that database queries should be reviewed before running on the production database, in order to avoid running dangerous queries. Some of us might have the experience of accidentally deleting records that are not supposed to be deleted!
This is a good security practice but it slows down the process when there is an incident. So we started to think of a solution. How to send a safe query, but it doesn't have to be reviewed by a colleague? After investigations, we found that 95% of queries that we run are the same queries. For example,
select * from users where user_id = ''; . The solution is to save these queries somewhere, and run them with Slack commands! That way, we can ensure that these queries are predefined and safe, and let the bot run the queries for us without having to ask for another colleague to review.
It turns out to be very helpful. Developers who are handling incidents can focus on the investigation without having to wait for another colleague to review the database query. And other members don't have to be distracted by this unless it's a custom query.
Running Automated Test Suites
Our QA team is very busy, and not all of them have the time to master the use of CLI. We have an automated test suite built for the QA team to run on the test environment, but it has to be triggered by a script through the CLI interface. It is good, but could be better. We ended up creating a tool on the operation bot to trigger the test suite through Slack. It resolves an onboarding issue for the team. No more installations of tools through CLI!
This is the part that I love the most! Even though not many people are using it, the bot is able to collect memes from Reddit and post it to a Slack channel. Don't judge! It has helped me kill a lot of time when GitHub is down.
How We Structure Our App
The structure of the operation bot is very straightforward:
We run a server on GCP CloudRun, and set up a webhook on Slack to post any message received to the server. The server will talk to other external services that we're using.
The benefit of GCP CloudRun is cost control. A bot is an internal tool, so the number of requests the server is receiving is low, compared with the servers we're running on our production server. It's very common that the server does not receive any request in four or five hours. CloudRun can help turn off all the servers to save the cost of having a running server. When a request hits the server, it initiates one server to handle the request.
One concern of using a utility tool like this is security control: how much power do we want to give this bot? We had several discussions and the conclusion we have is that it is better to allow the bot/server to have only read access to the resources it operates, except low-risk services like TODO item management or creating memes. Because the bot service has access to multiple services and their databases, it’s better to leave the important tasks to the services themselves. For actions like updating database records, we will use the bot to create a Kubernetes deployment file and use the target service to actually execute the deployment.
Also, the bot is designed to be compatible with the use by multiple engineering teams. Since every team has different requirements in their daily operations, the bot has a specific plugin system. Every functionality is created as a customizable plugin, and if a team needs to reuse the same functionality, all it has to do is to create a config file and put that plugin on the list.
There are still many things we want to do in the future. For example, the server does not have to only listen to Slack webhooks. It can listen to other sources of "chatting", like GitHub comments. GitHub has a great ecosystem in managing projects, releases, code branches, and pull requests. It would be very beneficial to connect to GitHub and integrate that into the workflow. For example, if someone leaves a comment on a PR that includes specific keywords like a database query, Github can send a webhook to the server, and it will execute the database query and automatically comment on the same thread with the result.
We also want to resolve issues like the input language. For now, because there are many parameters and arguments, the server can only understand the input if it looks like a command on CLI. It cannot understand human language yet. For example, it can understand "db query X", but it cannot understand "could you help me run a db query X?" Natural language processing might not be a priority in this tool but isn't it cool if we feel like we are talking to an actual member who is helping with the daily operations?
We’re looking forward to seeing the growth of this tool!