I don’t know about you, but these summer holidays are passing really quickly. Previously, there was less work during the summer and I have had a chance to rest. Customers are currently using the holidays for new implementations (server purchase, network redesign). No wonder, they are less intrusive for business. But it is a lot to take on in combination with the summer holidays.
Back to our topic. I’ve been long in writing to describe how we measure server and system performance, find bottlenecks (reasons for slowdowns), and design adequate new servers. Judging by the interviews with my colleagues in the industry and our experience, I would say that we are doing pretty well. 🙂 We estimate what acceleration will occur by purchasing a new server/cluster with a 10-15% accuracy.
A few notes to introduce context:
- I want to write a series of these articles practically, so you can fix it yourself.
- I will provide the theory wherever I can. If I make a mistake, I will be pleased to be corrected.
- Information is relevant to stand-alone servers (physical or virtual) and small clusters.
- The conclusions/advice I am sharing have been difficult to extract from someone (which only confirms that the information are gold) or we have laboriously measured it ourselves.
Where it starts
I was wondering where to start the “performance tuning/measurement” topic for a few hours. Should I write it as a story or a reference manual? I have decided to start as it usually starts for us (as well as for you)…
One beautiful day, customer/boss calls us/you that the system is slow. And that he/she wants to do something about it …
Collect as much information from the user as possible
I remember how much I have been pissing my colleagues off at the beginning. They came for advice, but they got tamed instead of the solution that they didn’t have all of the information. They went back to the customer and asked for the rest of the information. When I see that they are now passing on this “method” to their new colleagues, it warms me up internally as they understand the meaning and know that it is not me being smartass.
The fastest and easiest way to get to the cause, why is the system/server slow? (use the same procedure even if something doesn’t work at all). Ask the following questions:
What is slow?
Users tend to generalize. Especially in the beginning. Initial information is: “everything is slow” or “nothing works”. I get it. They have a lot of work to do, the management pushes them, the computer doesn’t work.
Don’t settle for this information. Keep searching. Ask: “Is the whole system slow or just some module/function? Are other applications slow? Ideally list a few applications from the PC and some provided applications from the server and ask if they are slow. You’ll find out if you can focus on a PC, a network, or a server.
Since when is it slow?
Since when does the user observe the slowdown? Did it start 5 minutes ago, this morning or a week ago? Occasionally will the user say that he/she didn´t use the application for a while, so he/she doesn’t know. Ask whether it was ok when used the last time.
How slow is it?
My favorite question, 🙂 “Very” and “Little” is not the answer. You need more accurate data. For example: “Normally. the program starts within 1 second, it took 5 seconds today” Yes. Users often do not know. However, you are finding out whether there is 10% or 200% slowdown. This is a big difference (the latter being easier to solve).
Let them show you the problem
If possible, it will help you, as:
- When you change/fix something in the system, you will immediately verify that you have solved the problem
- You may notice other issues/indications during the demo.
- Or you will find that the user misdescribes the problem. Ordinary users do not have as much technical knowledge as you do to decide what is and what is not relevant.
Ask other users (possibly multiple locations) about the same problem. Or does everything work for them?
How to deal with answers
The crucial question is „since when?“.
If the system/server has been slow for a long time and has been slowing down as the system’s data, the number of users and the number of features has increased, then you need a more powerful server (how to design it in the following article) or convince the information system manufacturer to optimize it (we have never been able to do so).
If you have been slowed down at once and recently, rejoice. You can easily fix the problem. You may discover the cause of the problem when you get information from users.
If you do not know the cause, go through this easy diagram. It will help you decide where to focus your next effort on.
PC, server problem
Most often, there is a program that starts to take resources. The fastest solution: open “resmon” (Start -> Run -> “resmon.exe”). Check if any process overloads processor, memory, or disks (I will share how to identify bottlenecks in the following article). If you find it, turn such process off.
If you don’t see any problematic process (or it keeps reappearing), focus on what has changed on the PC/server since it has worked properly.
Our experience – most often it is:
- Windows updates (Control Panel -> Programs and Features -> Installed Updates)
- Application updates/New application (Control Panel -> Programs and Features, Sort by installed date)
- Antivirus updates
- Driver/firmware updates
- Changing system settings (e.g., power profile)
- New scheduled tasks
- HW problems (eg damaged HDD, RAID, battery).
If it looks like a problem with the Internet on your side, so:
- Have a look at the Border Router – how much data flows to it over the Internet? If the line is busy, find out who downloads/streams and quit the connection (or apply QoS).
- Try “ping -n 30 220.127.116.11” (even if Google doesn’t like it :-)). You will find out loss and latency (loss should be 0% and latency up to 100ms). If the values are bad and the line is not fully used, contact your ISP (however, if the line is busy, this is due to policing/shaping at the ISP).
- Make a Speedtest. Does the line have the right speed? (consider test as a guideline – you may have aggregation on the line, the already existing workload and the test may not be accurate). If it does not, we will contact the ISP for help.
You will find cases here-and-there, where only specific network traffic behaves strangely (eg as in our story “How We Helped Vodafone: Problem Diagnostics (Part 1/2)“). An unenviable job awaits you. Better believe that it will solve itself. It will be necessary to get in touch with the ISP and pull some smart networker into play.
I hope you like the article. Please share your feedback and experience, insights or any questions you might have. I will be happy to include them in the following article. I wish you a peaceful holiday. I wish all of you to avoid any problems.
And a piece of good advice at the end, which sometimes saves hours of futile search. Do not dogmatically believe everything users will tell you. They are sometimes confused or distort something. Or they won’t tell the truth. 🙂