I do not like problems that come out on their own. It is an unexpected extra work, but it is mainly hard to interpret to customers (few have an understanding of things that break themselves apart). The story I want to tell you is exactly that.
It’s a wonderful story since we’ve eliminated the problem, but we still do not know what could have caused it (maybe some might – please let us know). It is also interesting that the customer´s ISP (Vodafone) has paid us for the solution because they did not know how to solve the problem. Now, however, to the story.
The customer called us one afternoon that he had malfunctions of VoIP (a long phone lag) and an information system (IS was very slow). Both services were provided from the datacenter and were available from the local network via IPSec VPN.
Quick environment description (see figure):
- Customer has own routers in HA cluster
- The ISP has a Primary (Wireless) and a Backup (xDSL) lines at the customer with public IPs that are routable on both routes. Switches and routers are in HA connection.
- Both the primary and backup lines have the last mile operated by the ISP subcontractor (so it’s not that easy).
Quick problém diagnostics
Firstly, we have checked the server load in the data center, the data line load at the customer and the data center. But everything was fine. We have tried pinging the internal and external IP data center to verify latency. Hit! Latency to internal IP (via IPSec VPN) was too high (approximately 500 ms vs. 7 ms).
Diagnostic information:
- Ping from LAN to external IP router in datacenter: 7 ms.
- Ping from LAN to internal IP router in datacenter (via IPSec VPN): 500 ms.
- Extending WAN on both sides under line capacity, nothing changed on the customer side.
This won´t be easy
A rather interesting situation that we have not encountered before. We hoped that it will go away as it came (unfortunately, it didn´t). We have called the ISP. The customer has a high SLA at ISP so they can take care of them. Unfortunately, they have told us that everything was OK on their side and they didn´t make any changes. This is precisely the point where you find out that this problem will not be resolved in two hours.
At this point, we have had to find out if there is a problem on the ISP (route), datacenter, or customer side (routers, FW, some incompatibility). With the ISP and Datacenter being difficult to work with (a corporation), it would take a long time and the hourly costs are great.
So we decided to start with ourselves. We took two Mikrotik routers, set up an IPSec tunnel between them and tested the latencies (ie, we were sure the tunnel was functional and compatible). One Mikrotik was connected at our office. The second we brought to the customer and got the router cluster instead. We made ping through the new IPSec tunnel and …. the result was as bad as the original LAN – DC (datacenter) tunnel. Latency through the IPSec tunnel was again at about 500 ms. This eliminates the problem with the customer (router replacement for another proven technology). At the same time, we have eliminated the problem on the datacenter side (the other end of the tunnel was not in the datacenter but at our office).
Diagnostic information:
- Ping from LAN to external IP router in datacenter: 7 ms.
- Ping from LAN to internal IP router in datacenter (via IPSec VPN): 500 ms.
- Extending WAN on both sides under line capacity, nothing changed on the customer side.
- The problem is not on the customer side: tested with other routers that work elsewhere. Latency still 500 ms.
- The problem is not on the data center side: tested by IPSec tunnel at our branch office. Latency still 500 ms
- The problem will be on the ISP side.
Therefore, we have faced the problem of explaining all this to ISP support and hoping it will reach the right people who will have enough knowledge and authority to solve this problem. Personally, I do not like turning to any support, as time efficiency is low. Firstly, you have to explain the problem to the first person, which forces you to test the things you have already tested, then explain the problem to another person. It is particularly time-consuming if the time shift between you and the technical support comes into play. But I hope to get rid of these prejudices in the future.
About the way we have managed to solve the problem and how Vodafone has paid for the solution of the problem, see the second part of the article “How Have We Helped Vodafone: Solving the Problem (Part 2/2)“.