Monitoring System: How I Found Out We Need It
When I went to FIT (Faculty of Information Technology) at VUT in Brno, I went to Finland for 5 months as part of the Erasmus program. After returning to Brno, I found accommodation with friends and we had to furnish first. And that was the beginning of my business. To explain it – buying furniture has cost me all my savings and I had to start working.
I was young, ambitious and I felt like I could do everything, and even better than others. That was probably the reason why I couldn´t get the job I imagined. I have thus set up a trade license and began to act as a computer “repairman” (I had been interested in computers since primary school, I had some knowledge and experience). I have worked in a classic break-fix mode from the beginning (ie customers are only calling if something goes wrong). The advantage is that you have no responsibility for anything. The disadvantage, on the contrary, is that the revenue is poorly planned – you do not know when will the things break.
Steps towards the first monitoring system
Over some time, I got to work some smaller companies that already had a server and at least a dozen computers. This is no longer a break-fix mode. Companies wanted to be sure someone was taking care of the backups, updating computers, and taking responsibility if anything happened. I´ve used “brute force” in the beginning. Once a month, I went to every customer, manually checked the computers and made updates. I have connected to servers remotely every week, I did the same as on the stations and additionally checked the backups.
Although this approach has worked, it has had several disadvantages:
Low checkup frequency
I checked the servers every Sunday. I made my coffee in the morning, sat down at the computer and gradually connected the servers and checked everything. When I was finished, I knew everything was fine (servers are updated and backups are working). However, the week is 7 days long. I felt a little uncertain about whether the servers are all right by the Wednesday (backups are running, they are not closed, the disk is not in RAID). I was very nervous on Friday and I was looking forward to freeing my consciousness by regular Sunday checkup (for the servers are okay, I knew that only on Sunday morning when I have checked them, I have only hoped for the rest of the week). The customer did not get the level of service as I wished (although better than if no one would check the servers) and I was also worried by the uncertainty. 🙁
When I came to the customer once a month, I went around, checking and updating the computers. It meant, however, that the user could not work on the computer for around 20 minutes. Someone was ok with it (making a coffee, smoking a cigarette, or doing an “offline” job), others were offended (they had a lot of work, they couldn´t catch up and I was delaying them). When I take it from the point of view of business faculty, it was uneconomical – the owner of the company has paid not only for my time but also for the time of the employee, who could not work because of the PC intervention.
High time consumption
This approach was time-consuming – it was necessary to get to the customer, update/check the computers manually, then drive back. All this just to know that 1 day of the month the computers are okay.
E.g. checking that the backups on the server are okay, takes about 10 minutes (connect to the server, wait for everything to load, run the backup program, go through the logs, log off). If I would have checked them every working day, it’s 21 x 10 minutes = 3.5 hours of work. On one hand, the customers did not want to pay, and if I had to check 11 servers every day, It would take me almost 40 hours of “slave” work per month (1 working week).
As the customer base gradually grew, I knew that this was not feasible and something needed to be done about it. I think I started with the Nagios monitoring system, then went to Zabbix, I was testing Centeron for a while, and I have finally used Icinga. It was not as complex as the system we have now, but it was a significant shift forward. Finally, with one glance at the web browser, I saw that computers, servers, and the network were okay (services running, backups running, the network available, computers antivirus protected…). What I used to do earlier (for example, to check the backups) I now had a live status update with “no” work.
Years of gradual evolution
We have switched from Icing to GFI MAX in 2013, which we were “persuaded” by gentlemen from PBcom (now called SolarWinds RMM). SolarWinds RMM costs some $$ and is paid as a service (that is, we pay each month depending on how many devices we monitor). The new system has brought new features and the ability to make work better. On the other hand, it was a financial burden for us (Icing is free) and unfortunately with some new problems (it’s interesting how many things didn´t work on a paid system).
Using the trial-and-error method, we’ve gradually reshaped and improved the system. Whenever something broke and we did not know about it in a timely manner, we coded our own “check“ (checking script) or measure. It was also aided by my desire for standardization (Standardization – Doing IT as Simple as That) and the monitoring system has also begun to keep an eye on whether everything has been set up in the same way everywhere.
The current capabilities of our monitoring system are described on the company’s website PATRON-IT: What we check (in Czech) or in the following article.
What does the competition do?
As curious as I am, I’m trying to figure out what others are doing and how are we doing in the market itself. So far, I’ve noticed that there are IT outsourcing companies that do not use any monitoring system, which is a shame. After all, we, the IT outsourcing companies, should be ahead in the IT field, innovating and doing the work to our best of consciousness and conscience.
I have also met companies that have a system. But they are more or less fighting the system (something we did a few years ago before we did all of the adjustments). The monitoring system monitors a lot of them, overwhelms them with mistakes or inappropriate alarms, and there is not much time/energy and willingness of management to move further.
Now, companies using finetuned monitoring systems should follow. However, I have not encountered such yet 🙁 (I am only talking about IT outsourcing companies – ISPs are a whole different beast). But I believe there are such. It is most likely so as new customers come to us mainly because of dissatisfaction with an existing IT company. Customers are not leaving great IT companies, so we do not know about them. 🙂
If you use a monitoring system, I will be pleased if you would share your experience. We can grab a coffee/beer in Brno or Hradec Králové. I´ll also be in Prague for a week in mid-February. 🙂
I will write in detail about the monitoring system we are using the following week. If you wish, please submit your email below so I can send you a reminder when the article comes out. EDIT 5.2.2018: Already published Monitoring System – What Can We Do With It.