Power Savings for Cloud Infrastructure
Find out why cloud builders are searching for power optimizing tools and technologies and how they transform their cloud infrastructure!
Introduction to power savings for cloud infrastructure
During the last year, the price of electricity and other energy resources has grown rapidly across multiple countries in Europe. Rising energy prices directly impact the costs of cloud builders like Managed Services Providers, Hosting Services Providers, Cloud Services Providers, enterprises, and SaaS vendors. Since many of these companies operate their own on-premises cloud infrastructure, whether the company owns the servers or leases them from a colocation provider, they are taking a hit as their overhead bills rise and derail their total cost of ownership projections.
This trend is forcing many companies to review and optimize power use by turning off sections of their infrastructure and seeking other cost optimizations tactics.
This analysis provides an overview of cost optimization approaches covering both hardware and software.
Disclaimer: The information provided on this page results from the constant examination of the power savings topic that StorPool’s solution experts have been working on for the past few years.
Hyper-converged Infrastructure (HCI) approach for improving power saving
Hyper-converged infrastructure is a cloud infrastructure design approach that combines different infrastructure components (compute and storage resources) on identical building blocks of hardware. Typically based on standard x86 servers, HCI eliminates the need to purchase expensive, single-job-focused, specialized equipment for each cloud component. HCI designs reduce cost in the following ways:
- Lowest possible upfront capital expenses – less total hardware is needed to do the same work;
- Lower complexity of the whole cloud as there is no need to maintain different hardware components for each cloud “job”;
- Higher redundancy as the HCI servers are interchangeable and the cloud workloads can run on any of them;
- Higher workload density per rack unit means cloud builders need to rent and manаge less floor space or rack space for the same number of workloads;
- Less hardware needs less cooling and consumes less electricity;
- Putting off certain costs until the need arises – with the right hardware and software, cloud builders can choose to scale up storage and memory capacity only when needed, saving their initial cash for other critical projects and ongoing costs
However, HCI optimizations only work when it is possible to use the free CPU resources of the storage system hardware to perform computing. That is not always possible – e.g., when a different OS is required for storage and compute nodes or when the storage system uses too much of the CPU and memory resources of each node. Additionally, if scaling needs to be done with full HCI nodes every time, in cases where only compute or storage resources need to be added, expansion can be costly in terms of capital expenses and power efficiency.
With StorPool Storage, the initial choice of infrastructure model is no longer a barrier for future scaling. A cloud builder could start with a three-node HCI cluster and later add a new HCI type member or additional dedicated storage-only or compute-only nodes that extend only the storage or compute capacity. Also, a cloud builder can start with dedicated storage nodes and later scale their infrastructure with HCI, compute-only, or storage-only nodes.
StorPool is the most reliable and speedy storage platform on the market. Public and private cloud builders – Managed Services Providers, Hosting Services Providers, Cloud Services Providers, enterprises, and SaaS vendors – use StorPool Storage as the foundation for their clouds.
StorPool converts sets of standard servers into primary storage systems for large-scale cloud infrastructure. The software comes as an utterly hands-off solution – we design, deploy, tune, monitor, and maintain your StorPool Storage system so that your users experience a speedy and reliable service.
Hardware Planning approach for improving power saving
In addition to their upfront cost, hardware components have operational expenses too. The hardware pieces are “cash consumers” because each one uses electricity. So, doing the same job with less hardware can significantly reduce costs and free up cash for other purposes.
If a single-socket server can do the job, there is no need for a dual-socket. In the latter case, the presence of the second CPU leads only to increased power consumption (by the CPU itself, the links between both CPUs, etc.) and can quickly eat up the single-rack power limit. Yet, if the case requires more CPU cores than a single unit can provide, the dual-socket architecture could be more power-efficient than two single-socket servers. That is because the dual-socket node still uses a single chipset on the motherboard, same NIC(s), GPUs, coolers, etc.
To have two running nodes, while it is possible for all virtual machines (VMs) to work on only one of them, is an example of pure spending without value unless disaster recovery RPO and RTO objectives mandate it. Given the doubling of hardware components (power supplies, coolers, NICs, etc.), two half-loaded physical servers will consume more power than one loaded at 85-90%. If the workloads and software allow it, the second node can be powered off and stay ready to be returned to work if the first one fails or needs maintenance, or there is a need for additional resources to handle peak loads. Of course, there is a tradeoff, boot-time of an offline node versus power used by the same node to stay idle.
For larger cases, chassis with a modular system design provide high-density, performance, efficiency, and cost-effectiveness. Such systems share some hardware (e.g., power supplies) and provide rack space savings (e.g., eight servers in 4U chassis).
StorPool Storage doesn’t sell hardware because our customers typically know best who is their preferred hardware vendor and how they like to obtain their cloud components (to buy or rent them, prepaid or on a lease, etc.). For us, it is enough for the chosen hardware to comply with StorPool System Requirements. Still, an integral part of StorPool’s offering is to help customers design their cloud with the right hardware for their specific use cases. In the early stage of each project, we always ask what components the customer plans to use for their particular use case (e.g., databases, IaaS services, web hosting, etc.). We do this not only to confirm the hardware compatibility or check for known issues but also to help with component choice by applying our experience with hundreds of deployments for different use cases.
We are always ready to help our customers achieve their goals with less and right-sized hardware – thus, less initial and operational spending.
Hardware Utilization approach for improving power saving
In the near past, the approach to solving insufficient performance was often “just throw more RAM, CPU cores, and storage devices at it.”. Nowadays, this is no longer so easy. With the current shortages in the hardware components market and significantly increased delivery times, obtaining new compute, network, and storage resources could be time-consuming. This is time during which a cloud builder’s business suffers or, at least, cannot grow because of lack of resources. Yet, even after the unboxing and installation of the new long-awaited components, the problems do not end. Increased power prices and inefficient usage of the hardware resources immediately increase the power energy consumption and cost per workload. And this reflects in the electricity bill, right from the first month.
That market situation forces companies to rethink and optimize their hardware utilization to achieve their goals with what they already have. “Hardware packing” is a set of approaches aiming to run maximum valuable workloads on a minimal amount of hardware resources in use. There are three general points of consideration:
- Get rid of unused hardware components
Why spend power and money on a RAID controller while the server has only NVMe devices on PCIe slots or just a single boot device? Why buy and “feed” a dual-port 100GbE NIC while the motherboard integrated 2×10 GbE NIC is enough for the job? If the 100 GbE NIC is unavoidable (e.g., already bought or integrated into the motherboard), connecting it to a 10 or 25 Gbps network will use less power. The difference is not big for a single server, but it becomes significant for a fleet of hundreds of nodes;
- Concentrate workloads on fewer servers
Each server which does not have a bottleneck, can take additional tasks and thus the workloads in the cloud can be concentrated in a subset of nodes. Load-free servers will use less power while idle or even powered off until they are needed to take on peak loads. If the resources are enough, it is more efficient to pack 20 VMs on a single hypervisor instead of dispersing them over two, three, or more nodes;
- Make a schedule of the workloads and plan the needed resources for them
During weekdays, the developers and QAs could need hundreds, even thousands of VMs or containers for their regular jobs, but there is no need for such a fleet to stay online and consume power during the weekends. The nightly builds and tests could run on the same groups of nodes where the daily reports are generated during the working hours – there is no value to keep spare resources online for both 24/7. The dynamic scheduling of the used Cloud Management Platform can be helpful here. Depending on the level of automation, these schedulers can stop unnecessary workloads and free the used resources based on various criteria. Ideally, they could automatically migrate and pack workloads on only a few servers, freeing the rest and putting the off-loaded ones to sleep.
The Role of The Software
An often overlooked factor is the role of software in achieving higher efficiency in hardware utilization. Like in hardware packing, the density of active workloads has a noticeable impact on power consumption. Each application needs resources but there is no reason to occupy a whole server for a few tasks. With well-designed software, it is easy to know what resources are necessary for a particular case, and so reserve and use them as intended. While the reservation is the first step, the utilization of resources is what matters for performance and power consumption. CPU cycles require electricity. Context-switching and rescheduling are operations that consume CPU cycles – reducing them results in running more workloads per watt.
All StorPool Storage deployments are limited to a set of CPU cores on the servers where the software runs. Each StorPool service is pinned to a particular CPU thread, eliminating all rescheduling and context-switching overhead. In each StorPool deployment, we use Linux Control Groups (cgroups) to reserve the exact amount of CPU threads and RAM required for each StorPool service. This approach has several benefits:
- Predictable and constant performance levels – StorPool services do not compete with other running processes for resources;
- No strict need for dedicated servers for every single cloud “job” – Hyperconverging and mixing of storage and compute workloads is possible and safe;
- Power efficiency – as all unused resources can be safely put in deep idle state or even disabled, the electricity bill for the cloud is optimized;
Last but not least is the design of the software. The StorPool services use a set of methods to limit CPU usage for maintenance or peripheral tasks, like:
- Reduction of kernel calling with kernel bypass functions. With Spectre and other vulnerabilities, the security cost of calling the kernel has grown significantly in the last few years. StorPool thus uses user-space drivers and limits its kernel calls to save time from vulnerability-mitigation-related code;
- Elimination of the scheduling overhead and context-switching, removing them as a source of wasted CPU cycles;
- Optimize for lowest possible latency. The code is optimized to serve I/O requests as fast as physically possible and with the smallest possible amount of CPU time, thus accelerating the response to client requests and not wasting client CPU cycles in waiting. This is important because most I/O client-side interfaces do not have an asynchronous way of working. Therefore, with other storage products, compute clients spend a significant amount of time waiting for a response from the storage. StorPool resolves this challenge by minimizing the I/O latency across the board;
- Power down of CPU cores not in use. When not in use and no requests are being processed, StorPool uses an internal method of “sleeping,” which powers down individual CPU cores and is more effective than what’s available via kernel or other methods. See more detailed information in StorPool’s FAQ.
Latency: #1 Metric for your Cloud
When building a cloud, the focus is usually on cost and throughput metrics (MB/s, IOPS). However this approach is wrong, often leading to application issues – when the applications are moved to the new cloud they end up costing more and/or being slower.
In this talk, we show how storage latency drives application performance for all applications. We will show why latency is the Number 1 metric to optimize for in a cloud. Then we’ll dissect the application performance measurements of leading public clouds – Amazon AWS, Microsoft Azure, and several others
Creating a power-efficient cloud is a process that challenges its decision-makers in several different ways. With StorPool, instead of just a vendor, our customers get a storage partner. Instead of providing only a software license with support in case of an incident, StorPool delivers much more:
- Assistance with the initial cloud design without forcing cloud builders to stick with the initial hardware components when expansion is needed in the future;
- Help with the selections of hardware, closely tailored to each specific use-case to provide enough resources with maximum power efficiency;
- Remote installation and configuration as StorPool tunes its deployments for best performance per watt, delivering the best possible IOPS per watt;
- Monitoring of hundreds of metrics per second to open support tickets and maintain each storage system proactively, supported by enterprise-grade SLAs;
- All software upgrades are included in the license fee and our team installs them non-disruptive to ensure that your storage system always runs optimally.