The obvious choice for software-defined IT infrastructure is to use high-end x86 processors. Millions of IT environments depend on the x86. When it comes to alternatives the choices have been pretty limited so far.
Arm may be on the way to transform your software-defined datacenter. The Arm server market share is small, not at all comparable to the industry-standard x86 servers, but an Arm server can be a good solution for several use cases.
Hyperscale companies like Amazon, Google, and Facebook are interested in Arm-based servers primarily as a way to diversify their CPU supply chain, reduce their dependence on vendors like Intel and AMD, reduce TCO and even eliminate the middleman by building whole custom chips and servers for themselves. Has the time come for smaller public and private clouds to adopt Arm processors?
We outline the recent developments in the Arm server ecosystem and examine the choice between x86 servers and Arm servers.
The Arm bazaar approach
Unlike Intel and AMD, who are vertically integrated to a large degree (they design the instruction set, the CPU cores, the CPU chips, manufacture, sell and support them), Arm’s approach is much more distributed. This means there are different companies involved in the various stages of the process and best of breed solutions have a much better chance of reaching each market segment.
In the Arm ecosystem approach, Arm Holdings designs the instruction set and may license CPU core designs to “licensees”. A second set of companies (Samsung, Qualcomm, Apple and many more) design customized CPU cores for specific applications and use-cases. A third set of companies may build System on a chips (SoCs) with Arm’s “standard” CPU cores or custom CPU cores. The SoC chips are manufactured in one of the open fabs (TSMC, Samsung, GlobalFoundries).
Arm’s IP is used in many areas, perhaps most popularly in smartphones. To illustrate the reach of the Arm architecture, every x86 server has multiple Arm cores in various components such as the BMC, HDDs, SSDs, NVMe drives, and RAID controllers. So unknowingly you have been using Arm CPUs in your servers, just not as the main processor. 🙂
Also available on the market are chips with 2-4 Arm cores, which are not as interesting for this review, but may have interesting applications in “the edge”. What sets the above three chips apart is the high core count and high single-thread performance.
All three chips have been through a complex history of projects being started and stopped, IP being sold and bought. Two out of the three products use core IP revived from an abandoned project, and at least one of these products went through a phase where the company said publicly they would discontinue it. Luckily for users of CPUs, all three are currently available on the market through various OEMs.
All three use the same ARMv8-A instruction set aka ARM64 aka AArch64 and are compliant with Arm’s SBSA, which is a specification guaranteeing OS and software compatibility between servers from different CPU vendors and OEMs.
Think of it like x86 applications. If you have an application running on an HPE server with an Intel Xeon CPU, would it work on a Dell/AMD EPYC server? Of course, it would! You would be mad at your application vendor if it doesn’t. The aim of the Arm server ecosystem is the same – code built for ARM64 would run on any ARM64 server, whoever the CPU vendor, whoever the OEM.
Arm-based servers – are they fast?
Arm CPUs are known for their power efficiency, so most people imagine that Arm CPUs can only be slow and power efficient and an Arm CPU can’t be fast and more power-hungry.
To disprove this, we can simply point to the recent dispute over performance of Apple’s iPad exceeding that of Intel-based laptops, even for single-thread apps. “iPad Pro A12X benchmarks rival MacBook Pros with Intel Core i7 CPUs“. The fairness of Geekbench between MacOS/x86 and iOS/ARM64 is questionable, however even taking it into account, this is still eyebrow-raising news. Certainly unexpected by most.
As a computer architect and technologist, I can say there’s nothing in the ARMv8 instruction set that would prevent it from being used in high-performance applications. To the contrary, it being a cleaner design has certain performance advantages in the “front-end” of a modern superscalar out-of-order CPU.
The hard part for server CPUs is not the instruction set. It is also not the technology for putting many CPU cores on a single chip. The really hard part, which has taken years to perfect, is the design of fast cores, capable of high single-thread performance. These core designs are useful not only for server chips but also for desktops, tablets, and phones.
The three server CPUs listed above have 3 unique CPU core designs each with its own pros and cons, but all three have reasonable single-thread performance.
In our internal testing, the Arm CPUs were able to approximately match the single-thread performance of Intel Xeon Scalable cores running at 2 GHz. It is very different for various workloads, so averaging it out like this really doesn’t do it justice, but it is the best we can do. This is a major achievement! The technology will be improved over time to cover more of the high-end segment where Intel is charging top dollar for the fastest CPU cores available today.
In terms of core-count and thread-count scalability the Ampere chip (up to 32 cores/32 threads), the Marvell chip (up to 32 cores/128 threads) and the Qualcomm chip (up to 48 cores/48 threads), exceed cores and threads of reasonably-priced Intel Xeon CPUs (Gold 6138 with 20 cores/40 threads) by a factor of between 1.5 and 3:1. They all seem to be well put together with large L3 caches, plenty of PCIe lanes and DDR4 channels for IO and memory and plenty of internal bandwidth.
In terms of power efficiency, the leanest of the three chips fit 2 times more CPU cores than Intel Xeon Scalable in the same 120W TDP (power/cooling) envelope. So roughly 2:1.
In terms of cost per core, mid-range Arm-based server chips are typically priced at 1/2 to 1/4 of the cost of Intel’s Xeon or AMD’s EPYC. Where Intel and AMD are charging more than $100 per core in server chips, the Arm camp is charging $30-$50 per core.
So to summarize the performance status:
– single-thread performance (OLTP, low latency) – like Intel Skylake core at approx 2 GHz
– multi-thread performance (analytics, batch processing, compute heavy workloads) – Roughly match high-end Xeons at a fraction of the power
– cost per core – Arm servers are cheaper per core and on top of that have lower power consumption
Bottom line: on a pure economic basis, Arm servers seem to be the winner.
Would my software run on Arm server?
Software compatibility with Arm has improved greatly over the last years. All major Linux distributions (Ubuntu, SuSE, RedHat) support ARM64 as a first-tier target platform. The popularity of Raspberry Pi 3 (an ARM64 Linux computer from 2016) and its countless alternatives has pushed support for ARM64 in all kinds of software. Support in Linux distros may not be at exactly the same level as x86, but it’s pretty close.
For example, Ubuntu is building, testing and maintaining 50,000 software packages for ARM64, so everything you expect and need from Python, Node.js, Ruby, Go and Java, through runtimes, databases, modules, libraries, web servers and on to all kinds of tools is installed the same way as on x86. Just
apt install packagename. Simple.
Even VMware, who are in a much more traditional space, announced they’ll support ESXi on ARM64.
Kubernetes, Docker, OpenStack, KVM, MariaDB, PostgreSQL, Apache, Nginx and everything you’d expect works on ARM64. So a lot of the infrastructure and platforms we are used to in the Linux ecosystem is simply available to use.
StorPool on ARM64
For StorPool’s solutions, which are high-performance storage systems, Arm servers make a lot of sense. And StorPool now works on Arm, as well.
In particular use-cases, they can help us achieve unprecedented power efficiency and storage density. Especially Arm server chips with many PCIe lanes, are very interesting for high-density NVMe storage nodes and for high-density HDD storage nodes.
Arm compute servers, connecting to a separate StorPool storage system also make sense for some use-cases like web and mobile applications.
And finally, Arm servers hyperconverged, where applications and storage system run on the same physical nodes, will also work great with StorPool, because of its extreme CPU efficiency. We push a lot of performance using a small number of CPU cores.
As usual, when designing a large system worth 100s of thousands or millions of dollars, it is worth going through a full solution engineering process, to get the most value out of your investment.
When to buy Arm servers and when x86 servers?
The Arm server ecosystem has really picked up in 2018. For the first time ever, there are chips with many fast cores available. This is a game changer.
We are now moving into a period where the IT industry recognizes Arm as a serious alternative for server CPUs. The total addressable market for CPUs in servers is predicted to be $20bn/year by 2020. How much of that would Arm be able to capture in the next two years is anyone’s guess. The inertia of x86 in servers is really great, so it won’t be easy.
Arm servers don’t provide the best single-thread performance available on the market (yet). For that go to Intel. Arm servers have better power efficiency (per unit of compute) and better cost (per unit of compute). So any application, except the most latency-sensitive applications, would benefit from using Arm servers. For example, if your application crunches data for 1 second on a single CPU thread while a person is waiting, Arm might not be the best choice. If your application can distribute that same work to 100 tasks of 10 ms each, Arm is a great choice.
The cost of switching to Arm should not be underestimated. In the “business as usual” case, you can buy x86 servers and be done with it. If you buy an Arm server you are buying into a new CPU architecture, with its own sets of things to learn and issues to watch-out for. Not that x86 servers are completely easy and issue-free, not at all, but at least we are used to it and we know what issues to expect.
If you are running mostly proprietary third-party (vendor) applications, to which you don’t have the source code, supporting Arm would be a matter of working with your vendors. Given the virtually zero market share of Arm servers in 2018, it will be an uphill battle to convince them to port to ARM64.
In terms of porting your own applications, if they are written in a scripting language the effort would mostly be concentrated on dependencies and libraries. So fairly small effort. On the other hand, if your application is written in a compiled language like C, Go or Rust, at the very least you are looking at re-compilation and removing any x86-specific tweaks you might have. So perhaps a larger effort to take into account. In both cases, automated tests and deployments will make the transition much easier.
The larger and more homogeneous environment you have, the more it makes sense to consider Arm servers. For example, assuming it is an application which would work well on Arm, if you have 1000 servers doing the same thing, it is easy to justify. If you have 1000 servers all doing different tasks in different ways, it probably doesn’t make sense to convert. It makes sense for cattle and it’s harder to justify for pets.
And if you need to do more compute work with less energy, Arm servers are a great choice.[ 2018-11-27 update:
Just 5 days after we published this article there was another major development in the Arm ecosystem.
AWS released the Amazon EC2 A1 instances. A1 instances run on an in-house developed multi-core Arm chip with Cortex-A72 cores. Performance per core of the A1 instance types is slightly lower than that of the best available ARM64 cores mentioned above, but still good.
In terms of pricing Amazon rents the a1.xlarge for $0.102 per hour (on demand pricing, 4 Arm64 cores with 8 GB RAM), which is roughly the same as an M5 “general purpose” instance with the same amount of memory (m5.large, x86, 2 threads, 1 core, 8 GB RAM, $0.096/hour on-demand). Depending on the exact workload, typically, running latency-insensitive scale-out workloads would be cheaper on A1 than on M5 “general purpose” instances. ]