Our “suffering” with Microsoft Office 365 took place in the summer of 2016. Even then I knew I wanted to write about it, once. Since then time has passed and many other projects have been resolved. Now, just before the Christmas, a little free time has appeared out of nowhere, and I went to prep the article. I have to admit that when I read through the back-to-back email communication, my memories came back and my blood started to boil. However, I will try to remain focused and not to make sarcastic remarks.
Solving our problem required over 60 emails, 80 hours of work and a month of time. Finally, as in the case of Vodafone, “How Have We Helped Vodafone: Problem Diagnostics“, the cause of the problems was not found (… though I know that Microsoft knows).
In May 2016, we became an indirect CSP partner of Microsoft Office 365
It is a program that allows us to sell all Office 365 services (now with MS Azure) with a monthly frequency. This makes it easier to obtain these services. The customer receives only a single invoice per month from us where the work and HW/SW is already functional. The customer does not have to deal with card payments via the Internet and foreign invoices (until then it could only be paid via Internet and invoices were from Microsoft’s Irish branch) or buy annual licenses from the OLP program (it was not very flexible).
Due to the fact that we were able to easily re-sell Office 365, we decided to start using Exchange Online Protection. We have a number of customers who have their own Microsoft Exchange server (especially thanks to the formerly cheap SBS editions of Windows) and need to solve the problem of spam/viruses. At the same time, we have enough customers who use “Exchange Online,” where this antispam is a part of. Because of our “belief” in standardization (Standardization – doing IT as simple as that), we have decided to use this solution (of course, it suits its functionality and price and can be used with other servers than just MS Exchange).
Problem description
We’ve already had more customers switched to the solution, and we were about to convert another (100+ licenses). We have set and tested everything. The customer had used a solution a couple of days to report that they had randomly delayed emails. About 5% of emails were delayed from tens of minutes to hours. This was a big problem because the customer is very dependent on emails. We’ve made “message tracking” from the EOP console (they have it done nicely). And we found (see figure) that the e-mail was delayed on the handoff between Office 365 and the Exchange server (for EOP the e-mail flow is as follows: “sending e-mail server” -> Office 365 -> “receiving e-mail server”).
The transfer error was “connection refused” – that is, I would have expected something to actively reject communication (email server rejection after TCP connection, RST in TCP handshake, or ICMP TCP port unreachable). However, in the logs of the boundary router, it was not apparent that there was any connection from IP Microsoft at that time (what server tried to deliver, we found out from other EOP statements). From this, we have decided that there might be a mistake on Microsoft’s side and we have opened a ticket.
Someone from the Microsoft Office 365 support has contacted us promptly (I think it was about an hour) and started to solve the problem with us. That would be the most positive part. The negative thing is that they are masters in requests for more information. So for every 5 minutes of their time, it takes about 30 minutes of our time to collect the required information. They often require unnecessary things, but they do not follow through until you get them. It’s an unequal fight with the presumption of error on your side.
Starting line
After exchanging about 5 emails, the status was as follows:
- Problem: the customer lags about 5% of emails by tens of minutes to hours.
- What did we find: emails are delayed for handover between Office 365 and the Exchange Exchange customer. With error „[{LED=450 4.4.316 Connection refused};{MSG=Socket error code 10061};{FQDN=dns_serveru};{IP=ip_zákazníka};{LRT=datum }]‘
- What did we deliver:
- EOP tracking logs to randomly selected delayed emails.
- Email headers for delayed emails.
- Logs from an internal Exchange server.
- Logs from the boundary router (contains information about the executed and rejected connections on internal Exchange – port 25 / TCP).
- Additional information: router has synchronized time over NTP, new FW, the same router with the same setting we have elsewhere and with EOP there are no problems, and MS Exchange is up to date.
- How did we do: Microsoft works only with the option that the bug is on our side. 🙁
The logs from Exchange and the router do not indicate that the unsuccessful delivery attempts would have been received at least on our router (or that they would see their connection see picture – time in EOP is UTC in the router it is UTC + 2).
Other communications with Microsoft’s support looked like this:
- Microsoft asks us to shut down the firewall’s boundary router. It is useless if it was blocked by FW, so we would see the router “rejected connections”. But we keep quiet and fulfill their wishes. Of course, the problem has not been resolved.
- This time we have to disable the SMTP inspection on the router and enable more detailed logging on Exchange. Okay, I take the idea with the inspection (even though we do not have it on). Why turn off advanced logging on Exchange, however, I do not understand when those failing connections dont even get to the Exchange. We do everything and send logs.
- In order to help Microsoft, we have dealt with the ISP. The ISP made a packet capture from its router (1 hop before the customer’s router). This included all TCP SYN packets on port 25 with the exact time and source IPs. Subsequently, we have picked out some delayed emails in the given deadline and provided all the logs to them (it takes a while, unfortunately, it is necessary to do it after every step the MS recommends us). Capture was passed and again there were no attempts to see TCP connections for those unsuccessful handings. From this, it seems to me that the problem can not be on the customer´s side!
- Microsoft thanked us and wrote that the “connection refused” error means that we actively reject the connection on our part and that it does the customer’s router or exchange that cannot handle more connections. Hmm, there is nothing to say. The previous step clearly demonstrated that the connection attempt did not even reach the ISP’s router. My arguments followed, see above, some phone calls and unnecessary upsetting.
To be continued
I would like to pause the article here. I wanted to put the whole story in one piece, but as I was writing, there were so many emotions in me that I just simply couldn´t do it. In the next chapter, I will describe how we fought the support, but we have tried to push through the distributor from desperation.
If you have some personal experience, please share them in the comments.
(Continuation How Did We Fight Microsoft Office 365 Support (Part 2/2): Microsoft Is Always Right)