IT Operations – Towards automation and predicting
2019 July 01 - 862 words - 5 mins - monitoring - Also available in Dutch'I have been helping customers with IT Operations issues for many years. One common pattern is to use Splunk as a central analytics platform. It can accelerate IT processes and make them more efficient' , Product Manager Erwin Vrolijk explains.
Everything related to IT is changing rapidly. Vrolijk notices today’s customer questions are different than before. “The environments are becoming more technically complex and IT Operations must be able to properly manage such an environment. In the past it was often about simple questions. How full is the memory? What about the network load? Now, environments are changing much faster and responsibilities are more complicated, among other things due to the different types of cloud models. Security and compliance become a consideration and organizations want to look at this environment from different job levels and departments.”
Five steps
He lays out a roadmap for customers to optimize IT processes, maintaining maximum control. “We have identified and described different phases. Each with its own characteristics and approach. Moving towards AI Ops you need to sort out a number of things first. There is a dependency between the different phases. It is a total of five steps that indicates the degree of maturity. The fifth step is the ultimate goal. At that point, with the help of Artificial Intelligence, companies are fully predictive.”
Reactive
The first step is called reactive. “Companies often work as some kind of fire brigade; they solve each problem separately, without any form of centralization or automation. That takes a lot of time.”
Expectative
The second step is called expectative, aka wait and see. “The known, recurring, problems and the responses to them have been mapped and are dealt with faster. You do not yet see problems coming and new, unknown, incidents are still being solved with great difficulty. You can respond faster, but the incidents are still dealt with separately.”
Operational Visibility
After that, operational visibility is the third step. “The technical chain that is needed to deliver services has been mapped. As a result, problems are identified faster and a good estimate of the impact of a disruption can be made.”
IT Insights
Step four is called IT insights. “It is crucial that the information from the previous phase is now no longer only used by IT, but also use to give direction to the business processes. Data becomes part of the processes and is actively used to support critical decisions. Do I choose new hardware or the cloud? Will I invest in my database platform or in my network?”
AI Ops
Vrolijk calls the ultimate fifth step AI Ops. “At this point, decisions are automated and incidents are predicted so that potential problems are solved before they occur. This is the ultimate goal of every organization.”
Approach
“We first analyze what phase companies are in, individually for each department and even every application,” Vrolijk explains. “We look at the pain points. For example, are certain data sources missing or are SLAs not being achieved because systems are slow or fail? This way we decide where to start. If you want to be able to predict whether an application will fail, you must first map out how that component performs.”
Examples
“A customer who developed a private cloud environment had an issue with a business-critical application that would occasionally crash, at unpredictable moments. This can have many causes. We have gained insight into the entire stack, from hardware to the application. That way we could quickly identify that the problem had to be somewhere in the application. Because we measured on, among other things, CPU usage and memory usage, we were able to see, partly thanks to all the data in our platform, that there was a memory leak in a specific module of the application. As a result, the application gradually started to use more and more memory and crashed. We were able to solve that and with the same software we validated that the problem was actually solved. What made it even better was that we were able to predict a new and comparable problem due to those specific measurements, without the customer ever having been bothered by it. The same indicators lit up for another module of the same application, and so we prevented another crash and the associated downtime.”
“We have seen situations in which more than ten thousand alerts were sent to an operator every day. That is impossible to do. They usually see no more than five percent, while the rest also contains problems that are important. Organizations that are a bit further in the roadmap can cluster alerts with the help of machine learning and reduce them to, for example, ten groups. This makes the work of the analyst or the operator much more efficient.”
Future
Vrolijk has a clear vision of the future. “In five years’ time, what we regard as the fifth step, AI Ops, is largely a reality. A part of IT Operations will be automated by algorithms and the people and teams responsible for it can deal with other, more difficult problems.”
This article was originally written for and published at SMT