The second part of our story with Vodafone, which I have retold in the article “How Have We Helped Vodafone: Problem Diagnostics (Part 1/2)“. We already know that the problem is on the Vodafone´s side. Unfortunately, this is a classic corporation and my experience says that such a negotiation is a problem.
Corporate curse
I have called the ISP and have explained (or I tried to) the problem. I have described what we have already excluded and that the problem will be at their side. Mr. at the support has thanked me and said he would arrange everything. I was pleasantly surprised at how well it went.
Because the customer has a high SLA, the ISP subcontractor technic (sub-supplier providing the wireless last mile) has arrived within a few hours to the customer and tested the line. During the primary line check-up, the customer was switched to the backup line. Coincidentally, everything worked through the backup link (which only confirms that there is a problem on the ISP side).
The following day we got information from the ISP that the line was measured and everything is fine … at least for them – the customer´s IPSec VPN didn´t work.
I made another call to the ISP:
- Me: „…I want to know what the technician has tested“
- ISP: „…common line parameters …“
- Me: „But we do not have common line problems. We have latency problems with IPSec traffic. I have never said that the line has lapsed.“
- ISP: „Okay, we’ll send the technician once again.“
- Me: „Thank you. Please explain to him that we have problems with IPSec latencies. Everything else on the line is ok.“
Three hours later:
- Technician: „… here I am, Im going to test the line.“
- Me: „Okay, can you please just tell me what you are going to measure.“
- Technician: „…you are supposed to have some latency problems …“
- Me: „We do not have problems with normal latency. We have problems with the latency of the IPSec connection. Did they tell you that? Can you tell me how are you going to measure it?“
- Technician: „Nobody told me that. I am supposed to measure the basic parameters of the line.“
There was another unnecessary visit to the technician. Hours of phone calls to the ISP, discussing with technicians at all positions, checking their routers and infrastructure. It may seem like a cheerful story about “fighting the wind” for those not included. Unfortunately, it has lasted for more than two weeks, and financial damage has occurred to the customer.
Until one day, everything has escalated to teleconference: us, customer´s top management, senior ISP management, and the main ISP technician. The ISP has admitted that they didn´t know what to do with it. At the same time (due to the approaching Christmas), there is a stop to any changes in the infrastructure. Since I have expected something like this to happen, I had a backup plan. The backup plan consisted of a procedure to diagnose the error more closely without compromising the ISP infrastructure. The ISP has liked our backup plan and has confirmed to reimburse all costs.
PATRON-IT – Vodafone 1:0
You’re probably thinking what kind of technology I had worked out.
Here is my thought process:
- I knew that the delay occurred only on IPSec traffic.
- But I did not know where exactly.
- We have measured delays on IPSec after replacing the router with another brand. At the same time, IPSec delayed tunnel testing compared to other endpoints. The error must be somewhere on the Vodafone infrastructure.
- If latency was on all traffic, I could use tracert to determine where it has happened (or how it is interlaced).
- If I could use a tracer that would have IPSec traffic as a payload, I could figure out where the latency arises … but I didn´t find such tool.
- I thought I could somehow simulate it.
- Then I have thought that it would not work with any IPSec traffic (the error would not show).
Quick and simple solution:
- Finally, I have found the Ostinato tool. It allows you to create any Ethernet packets and send them through your network card.
- I have made packet capture of real IPSec customer service.
- I have modified the captured packets: copied, modified L2 (MACs, CRC) and L3 (IP TTL) to simulate the traceroute.
- I have connected my NB instead of customer´s router and sent the packets through the network card while doing packet capture to see how ICMP reports on their expiration.
But how great was my surprise when it turned out to be all right. For fifteen minutes I sat quietly and wondered what I could do wrong. Then I found that the test was only one-way (ie IPSec packets delayed towards customer -> internet), but the delay could be in the opposite direction, so the test was repeated in a different direction. Bingo! The delay was caused by the ISP router located directly at the customer. Conclusions and evidence were passed to the ISP, the router exchanged for another, and it all began to work.
Unfortunately, nobody knows what technology or error has caused this strange behavior on the router. I’m a little sorry, and I still wonder. For us, however, it was most important that the customer had, once again, a functioning network.
Conclusion
As I wrote in the beginning. This was the type of problem that when it pops up, you can start cutting days off the calendar when you can not do anything else but work to find the solution. And you do not know if you’ll be able to do so in five hours or a hundred. In addition, you can not expect anyone to miraculously help you with the problem, as this is a problem outside of all manuals.