Operational readiness is the often most neglected part of the software development, but if we look at it closely its a very crucial aspect and if not done right, all the effort which is put to build a really valuable product can become inaccessible to your customers. This is situation that no business wants to be in, and no developer wants to be in.
So exactly what is operational readiness ?
Operational readiness means that we are able and confident to serve the actual production traffic to the feature which we have implemented, and under this traffic the features works as expected.
How can achieve operational readiness for our systems ?
-
Infrastructure scaling
In this, we need to ensure that we have enough infrastructure to support our customers, i.e ability to scale our infra for handling peak traffic. For example : Do we have enough number of servers which can bear the traffic ? Do we have circuit breakers which can prevent our system from complete shutdown ? Are we having the correct configuration in our cloud units which can scale as demand increases ?
-
Use case scaling
In this, we need to ensure that our use case implementation will actually be able to serve the customer using patterns. For example : Use case is that customer should be able to upload a file on our portal. Now an ideal scaling here means that our implementation should be able to handle file of any size.
-
Monitoring Dashboards
In this, we ensure that we are capable to monitor the health of our system in real time(or periodically) and are able to easily identify the problem in case of any failure. This may include jobs which regularly check for system health, graphical representations of the real time metrics which are emitted by system itself, etc. Ideally we should be monitoring each aspect of our system to the extent possible, which immensely help the developers who are trying to find out what has wrong.
-
Auto cut alarms
In this, we make sure that we get timely notifications in cases where the system is not behaving as per expectations. For example : setting thresholds for cpu utilization, endpoint spillovers, dependency throtlling, high disk usage, latency increase beyond limits, etc. This make sure that we reduce the downtime of our system by being able to work on the issue as soon as it happens, and hence reduce the customers getting impacted. Because now we will not wait for customer to tell us that there is some issue with our system.
-
Know the limits
In this section, we bench mark our system capacity in terms of real time traffic handling, long running async processes, etc. The outcome of this exercise is that we know when will our system fail, this is crucial information since based on this we can scale our systems with some buffer to adjust peak timings. This saves us cost by not being over-scaled, but also helps in designing other systems which will depend on this one, and to scale up or down efficiently when the expected traffic is already known.
-
Creating Runbooks
Runbooks are the documents which contains all the info(or atleast links to the info) required with respect to functioning of a specific module or feature. It contains design documents, dashboards, system dependencies, their contact information, how to handle failure scenarios, standard operating procedures, etc. This helps oncalls(developers monitoring the system) to figure out teh resolution steps for a issue, or atleast reach the correct team to resolve the issue.
-
Review
Now once we have done all of the above as per the system requirements or needs, we need to review these with senior engineers, this will allow us to get their perspective in the terms of operational ability, and will get to know the gaps or improvements which can be done. This is required to get an external perspective because eventually we need to monitor our system the best way possible, since nobody knows everything.
These are some ( and not the exhaustive list) of the ways by which we can launch our feature smoothly and confidently to the customer.
Why should we do this ?
Smooth roll-out of our feature.
Ability to gracefully handle failures.
Reduce system downtime.
Reduce or avoid negative customer impact.
And this extra effort saves developer's time later on to fix an issue.
Hopefully, now we have better understanding of operational readiness, and its criticality to us and our businesses. If you have ideas on others better ways via which we can achieve better operational readiness, feel free to add them in comments.
Want to know more on another tech topic or aspect of "Growing as software developer" ! Feel free to add it in comments.
Source and Credits : https://www.dcpsoftwaresolutions.com/2023/10/operational-readiness.html
Top comments (0)