The silent revolution: the point where flash storage takes over for HDD

Christian Huebner - September 11, 2019

For years the old paradigm has held true. If you want fast, buy flash, but it is going to cost you. If you want cheap, large HDD based servers were the go-to. The old standby for reasonably fast block storage, the 2TBx24 chassis was ubiquitous. For years, it looked like flash would be relegated to performance tiers. But is this actually true? I've been suspicious for some time, but a few days ago I did a comparison for an internal project and what I saw surprised even me.

Flash storage technology and comparative cost

Flash storage has developed at a breakneck pace in recent years. Not only have devices become more resilient and faster, but there is also a new interface to consider. Traditional SSDs are SATA, or in relatively rare cases, SAS based. This limits the performance envelope of the devices severely. SATA SSDs top out at about 550MB/s maximum throughput, and offer around 50k small file input/output operations per second (IOPS) regardless of the speed of the actual chips inside the device.
This limitation is due to the data transfer speed of the bus and the need to translate the storage access request to a disk based protocol (SATA/SAS) and, inside the SSD, back to a memory protocol. The same thing happens on the way out when data is being read.
Enter Non-Volatile Memory express (NVMe). This ‘interface’ is essentially a direct connection of the flash storage to PCIe lanes. A configuration of 4 lanes per NVMe is common, though technology exists to multiplex NVMes so more devices can be attached than there are PCIe lanes available.
NVMe devices typically top out above 2GB/s, and can offer several hundred thousand IOPS - theoretically. They also consume a lot more CPU when operating in a software defined storage environment, which limits performance somewhat. However, in practical application they are still much faster than traditional SSDs - at what is usually a very moderate cost delta.
If the performance of SATA SSDs is insufficient for a specific use case, moving to SAS SSDs is usually not worth the expense, as NVMe devices, which offer much better performance, are usually not more expensive than their SAS counterparts, so a move directly to NVMe is preferable.
One more note: If NVMes operate with the same number of CPU cores as SSDs, they are still somewhat faster and very comparable financially. The calculations below are designed to include more CPU cores for NVMe for performance applications.
Let's look at how the numbers work out for different situations.

Small Environments

Let’s have a look at a 100TB environment with increasing performance requirements. Consider the following table that looks at HDDs, SSDs, and NVMe. Street prices are in US$x1000, and IOPS are rough estimates:

100TB	HDD 6TB 12/4U cost [x1000 US$]	HDD 2TB 20/2U cost [x1000 US$]	SSD Layout	SSD cost [x1000 US$]	NVMe Layout	NVMe cost [x1000 US$]
10k IOPS	132	135	5X 10x 7.68TB	91	5x 4x 15.36TB	102
30k IOPS	345	271	5X 10x 7.68TB	91	5x 4x 15.36TB	102
50k IOPS	559	441	5X 10x 7.68TB	91	5x 4x 15.36TB	102
100k IOPS	1,117	883	5X 10x 7.68TB	91	5x 4x 15.36TB	102
200k IOPS	2,206	1,767	7x 14x 3.84TB	113	5x 10x 7.68TB	113
500k IOPS	5,530	4,419	14x 14x 1.92TB	168	7x 14x 3.84TB	133
1000k IOPS	11,034	8,804	42x 14x 1.92TB	470	13x 14x 2TB	168

In this relatively small cluster, as expected, HDDs are no longer viable. The more IOPS required, the more extra capacity must be purchased to provide enough spindles. This culminates in a completely absurd $11 million for a 1000K IOPS cluster built on 6TB hard disks.

Middle of the Road

Of course, we all know that larger amounts of SSD storage are more expensive, so let's quadruple storage requirements and see where we get. HDDs should become more viable, wouldn’t you think?

400TB	HDD 6TB 12/4U	HDD 2TB 20/2U	SSD Layout	SSD cost [x1000 US$]	NVMe Layout	NVMe cost [x1000 US$]
10k IOPS	250	510	14x 14x 7.68TB	326	7x 14x 15.36TB	348
30k IOPS	405	510	14x 14x 7.68TB	326	7x 14x 15.36TB	348
50k IOPS	655	510	14x 14x 7.68TB	326	7x 14x 15.36TB	348
100k IOPS	1,311	883	14x 14x 7.68TB	326	7x 14x 15.36TB	348
200k IOPS	2,593	1,767	14x 14x 7.68TB	326	7x 14x 15.36TB	348
500k IOPS	6,495	4,419	14x 14x 7.68TB	326	7x 14x 15.36TB	348
1000k IOPS	12,961	8,804	27x 14x 3.84TB	437	14x 14x 7.68TB	413

Surprise! Again we find that HDD is only viable for the slower speed requirements of archival storage. Note that 16TB NVMes are not much more expensive than the SSD solution!
A note about chassis: To get good performance out of NVMe devices, a lot more CPU cores are needed than in HDD based solutions. Four OSDs per NVMe and 2 cores per OSD are a rule of thumb. This means that stuffing 24 NVMes into a 2U chassis and calling it a day is not going to provide exceptional performance. We recommend 1U chassis with 5-8 NVMe devices to reduce bottlenecking on the OSD code itself. (I'm also assuming that the network connectivity is up to transporting the enormous amount of data traffic.)

Petabyte Scale

If we enter petabyte scale, hard disks become slightly more viable, but at this scale (we are talking 64 4U nodes) the sheer physical size of the hard disk based cluster can become a problem:

1PB	HDD 6TB 12/4U	HDD 2TB 20/2U	SSD Layout	SSD cost [x1000 US$]	NVMe Layout	NVMe cost [x1000 US$]
10k IOPS	453	1257	34x 14x 7.68TB	789	17x 14x 15.36tb	871
30k IOPS	453	1257	34x 14x 7.68TB	789	17x 14x 15.36tb	871
50k IOPS	488	1257	34x 14x 7.68TB	789	17x 14x 15.36tb	871
100k IOPS	1850	1257	34x 14x 7.68TB	789	17x 14x 15.36tb	871
200k IOPS	2101	1767	34x 14x 7.68TB	789	17x 14x 15.36tb	871
500k IOPS	4619	4365	34x 14x 7.68TB	789	17x 14x 15.36tb	871
1000k IOPS	10465	8720	34x 14x 7.68TB	789	17x 14x 15.36tb	871

Note: The performance data for SSD and NVMe OSDs is estimated conservatively. Depending on the use case performance will vary.
So what do we learn from all this?
The days of HDD are numbered.
For most use cases even today SSD is superior. Also, SSD and NVMe are still nosediving in terms of cost/unit. SSD/NVMe based nodes also make for much more compact installations and are a lot less vulnerable to vibration, dust and heat.

The health question

Of course, cost isn't the only issue. SSDs do wear. The current crop is way more resilient long term than SSDs from a couple of years ago, but they will still eventually wear out. On the other hand, SSDs are not prone to sudden catastrophic failure triggered by either a mechanical event or marginal manufacturing tolerances.
That means that the good news is that SSDs do not fail suddenly in almost all cases. They develop bad blocks, which for a time are replaced with fresh blocks from an invisible capacity reserve on the device. You will not see capacity degradation until the capacity reserve runs out of blocks, wear leveling does all this automatically.
You can check the health of the SSDs by using SMART (smartmontools on Linux), which will show how many blocks have been relocated, and the relative health of the drive as a percentage of the overall reserve capacity.

Bonus round: SSD vs 10krpm

In the world of low latency and high IOPS, the answer for HDD manufacturers is to bump up rotation speed of the drives. Unfortunately, while this does make them faster, it also makes them more mechanically complex, more thermally stressed and in a word: expensive.
SSDs are naturally faster and mechanically simple. They also -- traditionally at least -- were more expensive than the 10krpm disks, which is why storage providers have still been selling NASes and SANs with 10 or 15krpm disks. (I know this from experience, as I used to run high performance environments for a web content provider.)
Now have a look at this:

Device type	Cost [US$]	Cost/GB [US$]
HDD 1.8TB SAS 10krpm Seagate	370	0.21
SSD 1.92TB Micron SATA	335	0.17
NVMe 2.0TB Intel	399	0.20
HDD 0.9TB Seagate	349	0.39

In other words, 10krpm drives are obsolete not only from the cost/performance ratio, but even from the cost/capacity ratio! The 15krpm drives are even worse. The hard disks in this sector have no redeeming qualities; they are more expensive, drastically slower, more mechanically complex, and cost enormous amounts of money to run.
So why is there so much resistance to moving beyond them? I have heard the two arguments against SSDs:
Lifespan: With today’s wear leveling, this issue has largely evaporated. Yes, it is possible to wear out an SSD, but have a look at the math: a read optimized SSD is rated for about one Device Writes Per Day (DWPD) (that is, a write of the whole capacity of the device) over 5 years. Let’s compare this with an 1.8TB 10krpm HDD. With a workload that averages out at 70MB/s write (with a mix of small and large writes) and a 70/30 read/write ratio, this 10krpm HDD can write 1.81TB/day.
In other words, you won’t wear out the SSD under the same conditions within 5 years. If you want to step up to 3xDWPD (mixed use), the drive is still below the cost for the HDD (US$350), and you will have enough resilience even for very write heavy workloads.
TCO: It is true that an SSD uses more power as throughput increases. Most top out at about 3x the power consumption of a comparable HDD if they are driven hard. They also will provide ~10x the throughput and >100x the small block IOPS of the HDD. If the SSD is ambling along at the sedate pace of the 10krpm HDD, it will consume less power than the HDD. If you stress the performance envelope of the SSD, you would have to have a large number of HDDs to match the single SSD performance, which would not even be in the same ballpark in both initial cost and TCO.
In other words, imagine having to put up a whole node with 20 HDDs to match the performance of this single $350 mixed use SSD that consumes 20W at full tilt operation. You would have to buy a $4000 server with 20 $370 HDDs -- which would, by the way, consume an average of maybe 300W.
So as you can see, an SSD is the better deal, even from a purely financial perspective, whether you drive it hard or not.
Of course there are always edge cases. Ask us about your specific use case, and we can do a tailored head-to-head comparison for your specific application.

So what’s next?

We are already nearing the point where NVMe will supersede the SATA or SAS interface in SSDs. So the SSDs, which came out on top when we started this discussion, are already on their way out again.
NVMe has the advantage of being an interface created specifically for flash memory. It does not pretend to be on a HDD, as SAS and SATA do, thus it does not need the protocol translation needed to transform the HDD access protocol internally into a flash memory access protocol and transform back on the way out. You can see the difference by looking at the performance envelope of the devices.
New flash memory technologies push the performance envelope and the interface increasingly hampers performance, so the shift from SAS/SATA to NVMe is imminent. NVMe even comes in multiple form factors, with one closely resembling 2.5” HDDs for hotswap purposes, and one (m.2, which resembles a memory module) for internal storage that does not need hot swap capability. Intel’s ruler design and Supermicro’s m.2 carriers will further increase storage density with NVMe devices.
On the horizon, new technologies such as Intel Optane again increase performance and resilience to wear, currently still much higher cost to traditional flash modules.
Maybe a few years from now everything is going to be nonvolatile memory and we simply can cut power to the devices. Either way, we will see further increases in density, performance and reliability and further decrease in cost.
Welcome to the future of storage!

Secure clouds for financial services

Run business-critical applications on a cloud designed for financial services—backed by cloud experts with over a decade of experience.

LEARN MORE

BRIEF:

Mirantis is committed to Swarm

Discover how Mirantis is investing in Swarm and what’s coming in our product roadmap.

READ NOW

The silent revolution: the point where flash storage takes over for HDD

Flash storage technology and comparative cost

Small Environments

Middle of the Road

Petabyte Scale

The health question

Bonus round: SSD vs 10krpm

So what’s next?

Recommended posts

Easily manage Kubernetes data storage with Portworx Enterprise on Mirantis Kubernetes Engine

What’s new in Kubernetes 1.29: Storage improvements, Windows features, externalized cloud provider integrations

How obsolete is my cloud?

Choose your cloud native journey.

Join Our Exclusive Newsletter

Join Our Exclusive Newsletter

Secure clouds for financial services

Mirantis is committed to Swarm

Virtualization Solutions

Services

Platform

Company