EPYC Case Study: Servery DELL 15. AMD Gen Once Again

By Martin Melich 2020-06-01 0

Many articles have been written about AMD starting to push Intel. This makes Intel’s stubborn fans happy as Intel is getting under pressure and thus making their products more affordable. The second-generation AMD EPYC processors are a part of the fifteenth generation DELL PowerEdge servers.

The AMD EPYC processors are based on a number of reviews and are cheaper than comparable Intel counterparts in a magnitude of tens of percent. Our customers will especially welcome the change when buying servers. On the other hand, we sometimes hear “stories” that these are not suitable for some applications (eg databases).

We aimed to find out the true potential of these servers with AMD parts and decide whether or not to recommend them to our customers. Our partner, DNS a.s., has been kind enough to lend us of the first 15th generation Dell servers with AMD processor in the Czech republic to test it out.

My last AMD CPU was my high school gaming computer (Athlon 2500+ Burton @3200 +). Similarly, all of our servers to date use Intel processors, and I guess I am not mistaken saying that we all do. I´ve been looking forward to doing the testing.

The most common CPU bottlenecks

Before we recommend a server solution to our customers, we make a thorough measurement of their environment. The aim is to optimize costs. We want the customer to cover their needs, but not pay extra for an overkill, or to save in the wrong area. By measuring how its systems behave and where the performance bottlenecks are, we can predict and guarantee how much the work of their employees using an information system would speed up in advance.

The heart of each server is a CPU-RAM duet and its interoperability with other components such as SSDs, network cards, and other PCIe devices. Regarding CPU, I most often encounter 4 kinds of bottlenecks:

The first and most restrictive is “single thread performance”. Majority software is still not coded in a way as to split one computational problem into multiple CPU cores. In this case, performance is proportional to processor frequency (architecture plays a role), and all you can do is buy a high-frequency processor.
The overall workload is too high – caused by a large number of users, a large number of virtual servers on one hypervisor, or simply software using the performance of multiple cores (either efficiently or inefficiently). A multi-core CPU would help in this case.
In the third scenario, I often see an insufficient communication speed between memory and CPU. Intel has a 6-channel controller. To achieve the highest data flow to/from the operating memory, all 6 channels must be installed equally. RAM is thus bought to not only increase the total capacity but also to increase the speed with which you can store/read data. This is typical of High-Performance calculations. For example, there is only 10 GB matrix in RAM, but the speed of calculation of numerical simulation is directly dependent on the throughput of the bus.
Finally, there are latencies caused due to memory access. Measurements often show that single-socket servers are better than multiprocessor systems. Single-socket processors do not have memory controllers to take into account inter-processor communication, NUMA architecture and are thus simpler/faster.

Why comparing the frequency and number of cores is not enough

My methodology was rather simple up until now, as I have only been testing Intel processors. The performance within the same architecture is proportional to frequency. The only thing is to install the ram properly, “to not to screw up” (ie always install all memory channels evenly ☺).

If I wanted to answer a customer question, “How much faster will the new server be in comparison to the 6 years old one?”, benchmarks would do the work for me, calculating the inter-generational improvement. That is, the performance per unit frequency increases in addition to the processor frequency. This testing serves us, among other things, to verify the real computing power, ie the correct firmware settings, after purchasing and installing a new server.

However, AMD has a different architecture. It reminds me of an athletic multi-athlon, where the results of individual disciplines are converted into points. Up until now, athletes have been improving evenly every year in every discipline. Now competitors from another continent have entered the race, who, although they cannot throw so far, nor would they lift the hammer, but have much better results in stamina and high jump. Suddenly, the overall score can be the same, although there are significant differences between disciplines. ☺

When comparing Intel and AMD, synthetic tests will answer the question of which processor is better for a particular discipline. It’s up to you to answer the big question: what type of workload your particular environment consists of. This means identifying how all individual applications behave. In practice, it’s quite time-consuming. Thus you would focus on those software tools and computing tasks that are dominant for a particular customer.

After reading through a lot of reviews and a few case studies, I have created the following expectations regarding the 2nd gen AMD EPYC:

AMD EPYC is going to be faster in tasks that utilize overall throughput between CPU and RAM
AMD EPYC is going to be slower in single-thread performance than most powerful Intel processors
AMD EPYC is going to be worse in SQL performance than Intel
AMD EPYC is going to be cheaper at the same processing power

How did we measure

As the internet is full of comparative synthetic tests, I have decided to do a practice comparative case study. We have asked 4 of our customers to prepare their typical computing tasks. Firstly, I have measured it on their Intel-based servers and then on a borrowed Dell PowerEdge R7515 server with a 2nd gen AMD EPYC processor.

I have always transferred whole virtual servers (Hyper-V) to the borrowed server, performed repeated measurements several times and did so outside the customer’s working hours. We have borrowed the following server from our partner

The borrowed server: DELL PowerEdge R7515, AMD EPYC 7302P processor, 16 cores 3.0 GHz, 128 GB RAM (8 channels x 16 GB), SSD 960 GB read-intensive. Firmware set to performance profile, real frequency due to Dell Controlled Turbo 3256 MHz and disabled transition to economy C states.

CPU and RAM parameters - borrowed server - HWiNFO64 app — Figure 1 CPU and RAM parameters – borrowed server – HWiNFO64 app

Synthetic RAM Test - Borrowed Server – MLC app (Memory Latency Checker) — Figure 2 Synthetic RAM Test – Borrowed Server – MLC app (Memory Latency Checker)

Synthetic CPU Test - Borrowed Server - userbenchmark.com app — Figure 3 Synthetic CPU Test – Borrowed Server – userbenchmark.com app

Customer 1 – High-performance computing

The first customer specializes in physical research, namely numerical simulations. We have performed performance measurements recently. They had a numerical simulation whose calculation took about 40 minutes on one core, 20 minutes on two cores, and 20 minutes even on four cores.☺ The goal was to find out why didn’t the calculation accelerate any further.

The study has found that each computational core required bandwidth of operating memory of about 9000 MB/s, meanwhile, the whole task has utilized only 7 GB in RAM. Their quad-core 3.6Ghz server with 16GB RAM was unable to accelerate any further as they had only two memory channels. The SSDs of the server were unencumbered.

At that time, we have managed to shorten the time by moving to a suitably designed DELL R640 server to “less than three minutes” (165 seconds dead), when the memory bus was fully saturated with the given type of load at approximately 118,500 MB/s. I was curious to see how the borrowed server with AMD processor would handle this type of load as its memory bus is designed differently.

My expectations were set high thanks to the great presentation of Drew Gallatin from Netflix on EuroBSDcon 2019. Their conclusion in terms of memory throughput was that one AMD EPYC processor was almost as powerful for them as two Intels. More importantly, for only a quarter of their total cost.

Customer’s server: DELL PowerEdge R640, 2x Intel Xeon Gold 6244 processor, 8 cores 3.6 GHz, 192 GB RAM (12 channels x 16 GB), SSD 1.92 TB read-intensive. Firmware set to performance profile, real frequency due to Dell Controlled Turbo 4.28 GHz, switch to eco C states forbidden.

CPU and RAM parameters - Customer 1 - HWiNFO64 appl — Figure 4 CPU and RAM parameters – Customer 1 – HWiNFO64 appl

Synthetic RAM Test - Customer 1 – MLC app (Memory Latency Checker) — Figure 5 Synthetic RAM Test – Customer 1 – MLC app (Memory Latency Checker)

CPU Synthetic Test - Customer 1 - userbenchmark.com app — Figure 6 CPU Synthetic Test – Customer 1 – userbenchmark.com app

I restored the customer’s VM on a borrowed server and started the measurement. I have noted the effect of the number of cores on the computation duration for both Intel (customer’s server) and AMD (borrowed server) in the chart below).

The final task duration on the borrowed server (AMD) was 182 seconds. The same task on the customer’s server (2x Intel) takes 165 seconds. The AMD takes 10% longer but at cost of about one-eighth per processor (costs roughly – Intel 2x 80.000 CZK, AMD 1x 20.000 CZK)!

This confirmed my expectations. With its architecture, AMD is clearly ahead. However, it has to be noted, that this computational task was not processor frequency or SSDs demanding. The flow of data between CPU and RAM was the bottleneck, but this is a common situation regarding similar computing tasks. We would definitely go for AMD EPYC in case of designing a new server for this customer.

Figure 7 Comparison of Calculation Task Duration - Customer 1 — Figure 7 Comparison of Calculation Task Duration – Customer 1

Customer 2 – Developer

The second customer is a development company that we run a hypervisor for their development environment. It is the same configuration server as with the Customer 1. This configuration is my favorite due to its high single-thread performance. ☺

The measuring task is static PHP code analysis, which is performed on a Linux server. The customer chose this task as it is the bottleneck in their job performance. I have found out that any such analysis of one project fully utilizes one core, the RAM flow is negligible and SSDs do literally nothing. Obvious single-thread load case.

Figure 8 Core Load Sample in Performance Monitor - Customer 2 — Figure 8 Core Load Sample in Performance Monitor – Customer 2

I expected that Intel must clearly dominate. The Intel Xeon Gold 6244 processor is the fastest octal ever. By default, the processor operates at its base frequency of 3.6 GHz. Thanks to Dell Controlled Turbo technology, it can operate steadily at 4.28 GHz.

The customer selected a specific static analysis and performed 6 consecutive identical measurements on his Intel server. We moved his virtual server to AMD and did the same analysis. The results are as follows:

2x Intel Xeon Gold 6244 – average task run time 106.4 seconds
1x AMD EPYC 7302P – average task run time 124.2 seconds

Figure 9 2x Intel Xeon Gold 6244 (Left) 1x AMD EPYC 7302P (right)

Figure 10 Time required for static PHP code analysis - customer 2 — Figure 10 Time required for static PHP code analysis – customer 2

The expectation, that Intel would be faster, has been confirmed. However, I thought the difference would be more pronounced. In the case of the Geekbech5 synthetic benchmark, we can see that Intel is more powerful in many disciplines.

However, I need to get back to the idea of the disciplines I wrote at the beginning of the article. When designing a server, we need to know the type of customer load. For example, in floating-point operations, the performance of both processors is almost identical – which is the main burden of our customer (or this task).

At the same time, we must be reminded that we are comparing 2 x 8-core Intel CPUs with single 16-core AMD CPU at a lower frequency and with lower price. Therefore, I have been very pleasantly surprised by AMD.

Figure 11 Comparison of AMD EPYC 7302P and Intel Xeon Gold 6244 in synthetic tests.

Intel Turbo and Dell Controlled Turbo sidenote

I have already mentioned both of these technologies in the article, but have not yet explained. Both technologies are used to pull extra power from the processors. But each is a little different.

For example, by default, the Intel Xeon Gold 6244 processor operates at 3.6 GHz. Thanks to Intel Turbo technology, it can increase its frequency up to 4.4 GHz in the short term until the TJMAX temperature coefficient approaches the limit value.

Short-term high performance is the advantage of Intel Turbo. The disadvantage is the so-called “jitter”, or constantly switching bus multipliers and frequency according to load. It thus takes a non-zero time to get the processor to maximum power out from the idle state. Especially when economy C states are enabled by default.

Dell does the same with its Dell Controlled Turbo technology, but with a slight twist. The processor is set at the maximum possible frequency which can cool. This frequency varies by model, always being higher than the base frequency and always lower than the maximum Intel Turbo frequency.

Specifically, the Intel Xeon Gold 6244 processor has a Dell Controlled Turbo permanent frequency of 4.28 GHz and an Intel Turbo Temporary frequency of 4.4 GHz. The difference is in this case minimal, especially since it is a well-made eight-core that´s cooling system is more advanced. For the AMD EPYC 7302P, the base frequency is 3.0 GHz and permanent DCT frequency is 3.256 GHz.

Customer 3 – IS Helios and SQL

The third customer is an engineering company using Helios in conjunction with a SQL server. There is the very same server as the first customer. Our customer has long-term problems with Helios optimization, which has customized modules. While the customer is pushing the vendor to optimize Helios at the software architecture, the vendor is unable to implement new features, let alone speed optimization. Unfortunately, this is a common scenario that I experience all across information systems in practice.

The customer has prepared 3 tasks that are repeated every day and are time-consuming. We performed measurements on it´s and the borrowed server. I expected Intel to be faster, mainly thanks to a better single thread. To my surprise, AMD came out ahead, which made no sense.

I reaped the measurement, but with a larger number of monitored indicators. I found out that during the SQL server measurements, all CPU cores were used, but not to the maximum. SSD write operations were seen measured, so my suspicion was that way.

Figure 12 CPU Load Process with AMD - Customer 3 — Figure 12 CPU Load Process with AMD – Customer 3

I measured the SSD performance on both the customer and the borrowed server. The customer’s Intel server has a larger SSD with better sequential write and read and also can handle more total 4k IOPS, but AMD SSDs provide somewhat better performance with 4k IOPS in single-thread write/read.

Figure 13 Comparing the SSD measurements of both servers

I created a RAM disk on the AMD server and redirected the TEMP DB to eliminate the SSD. Task duration got even faster. There was one core load increase. The increase was there before, only not so significant.

We could probably get NVMe drives with better write latency to further accelerate the process in comparison to existing SSD. The cost of the entire server would increase significantly. The final bottleneck could then be the single thread load, which is probably responsible for part of the data operations and is subject to the design of the task in Helios itself.

I actually found out what the real problem is instead of comparing the CPUs. Both CPUs were able to perform fast enough to allow the bottleneck to show elsewhere. Real-life situation. Based on this finding, the customer would be good to go with a significantly cheaper AMD server or cheaper Intel server with less powerful CPUs. To take advantage of the Intel Xeon 6244 CPU performance, in this case, storage latencies would need to be reduced.

Figure 14 Process of CPU Load With AMD And RAM Disk For TempDB - Customer 3 — Figure 14 Process of CPU Load With AMD And RAM Disk For TempDB – Customer 3

Obrázek 15 Výsledné časy Intel vs. AMD – zákazník 3 — Figure 15 Intel vs. AMD – Customer 3

Customer 4 – Money S5 a SQL

The customer has a 3-year-old server that serves as a hypervisor for multiple VMs. We use one VM to run Money S5 over MS SQL with a business intelligence module. Several tasks in Money S5 and BI take a long time and the customer would like to speed them up. We were interested in how much we could accelerate the process by “merely” exchanging the server for the borrowed one.

Customer’s server: DELL PowerEdge R530, 2x Intel Xeon E5-2620 v3 processor, 6 cores 2.4 GHz, 96 GB RAM (5x 16 GB), SSD 1.92 TB read-intensive. Firmware set to performance profile, real frequency due to Dell 2.54 GHz, economy C states forbidden.

Figure 16 CPU and RAM parameters - Customer 4 - HWiNFO64 app — Figure 16 CPU and RAM parameters – Customer 4 – HWiNFO64 app

The type of workload for the tasks selected by the customer is combined single thread and multi-thread and the processor is the bottleneck. By simply replacing a server, the time required to complete the task has been reduced by an average of 30%, which is not bad. The Intel Xeon E5-2620 v3 processor, of course, is as old as the Dell 13th generation server. However, it’s pleasant to see that in the case of server update, the customer can choose a cheaper single-socket server and get higher performance (about 42% higher) at the same time.

Figure 17 Intel vs. AMD - Customer 4 — Figure 17 Intel vs. AMD – Customer 4

Conclusion

The new AMD EPYC processor platform in DELL servers has surprised me. It is a clear winner over Intel for RAM throughput computation. You get higher performance for significantly lower purchase costs. It is also a very versatile processor for classic use and a good price.

Intel is still #1 in case of the highest single-thread performance. e.g. If you have software optimized for AVX 512 instructions, Intel is a much more powerful choice.

Which processor is the winner? Unfortunately, there is no universal winner. It will be a different manufacturer and model for each customer. When designing a server, it is crucial to correctly identify the customer’s workload, to understand how software tools behave and to measure them accurately. It is also important to measure everything after you deploy the server. You must verify that the server is performing as expected. Incorrect server settings, driver errors, or other incompatibilities can halve the server performance.

We, as well as our customers, have been so impressed by the new Dell PowerEdge servers with the second-gen AMD EPYC, that we have already placed an order for three of them. It´s been 15 years since I have ordered the last Advanced Micro Devices CPU.