HELLOOMirantis named a Challenger in 2024 Gartner® Magic Quadrant™ for Container Management  |  Learn More

< BLOG HOME

The silent revolution: the point where flash storage takes over for HDD

Christian Huebner - September 11, 2019
image
For years the old paradigm has held true. If you want fast, buy flash, but it is going to cost you. If you want cheap, large HDD based servers were the go-to. The old standby for reasonably fast block storage, the 2TBx24 chassis was ubiquitous. For years, it looked like flash would be relegated to performance tiers. But is this actually true? I've been suspicious for some time, but a few days ago I did a comparison for an internal project and what I saw surprised even me. 

Flash storage technology and comparative cost

Flash storage has developed at a breakneck pace in recent years. Not only have devices become more resilient and faster, but there is also a new interface to consider. Traditional SSDs are SATA, or in relatively rare cases, SAS based. This limits the performance envelope of the devices severely. SATA SSDs top out at about 550MB/s maximum throughput, and offer around 50k small file input/output operations per second (IOPS) regardless of the speed of the actual chips inside the device. 
This limitation is due to the data transfer speed of the bus and the need to translate the storage access request to a disk based protocol (SATA/SAS) and, inside the SSD, back to a memory protocol. The same thing happens on the way out when data is being read. 
Enter Non-Volatile Memory express (NVMe). This ‘interface’ is essentially a direct connection of the flash storage to PCIe lanes. A configuration of 4 lanes per NVMe is common, though technology exists to multiplex NVMes so more devices can be attached than there are PCIe lanes available. 
NVMe devices typically top out above 2GB/s, and can offer several hundred thousand IOPS - theoretically. They also consume a lot more CPU when operating in a software defined storage environment, which limits performance somewhat. However, in practical application they are still much faster than traditional SSDs - at what is usually a very moderate cost delta. 
If the performance of SATA SSDs is insufficient for a specific use case, moving to SAS SSDs is usually not worth the expense, as NVMe devices, which offer much better performance, are usually not more expensive than their SAS counterparts, so a move directly to NVMe is preferable.
One more note: If NVMes operate with the same number of CPU cores as SSDs, they are still somewhat faster and very comparable financially. The calculations below are designed to include more CPU cores for NVMe for performance applications.
Let's look at how the numbers work out for different situations.

Small Environments

Let’s have a look at a 100TB environment with increasing performance requirements. Consider the following table that looks at HDDs, SSDs, and NVMe. Street prices are in US$x1000, and IOPS are rough estimates:
100TBHDD 6TB 12/4U cost [x1000 US$]HDD 2TB 20/2U cost [x1000 US$]SSD
Layout
SSD cost [x1000 US$]NVMe
Layout
NVMe cost [x1000 US$]
10k IOPS1321355X 10x 7.68TB915x 4x 15.36TB102
30k IOPS3452715X 10x 7.68TB915x 4x 15.36TB102
50k IOPS5594415X 10x 7.68TB915x 4x 15.36TB102
100k IOPS1,1178835X 10x 7.68TB915x 4x 15.36TB102
200k IOPS2,2061,7677x 14x 3.84TB1135x 10x 7.68TB113
500k IOPS5,5304,41914x 14x 1.92TB1687x 14x 3.84TB133
1000k IOPS11,0348,80442x 14x 1.92TB47013x 14x 2TB 168
In this relatively small cluster, as expected, HDDs are no longer viable. The more IOPS required, the more extra capacity must be purchased to provide enough spindles. This culminates in a completely absurd $11 million for a 1000K IOPS cluster built on 6TB hard disks.

Middle of the Road

Of course, we all know that larger amounts of SSD storage are more expensive, so let's quadruple storage requirements and see where we get. HDDs should become more viable, wouldn’t you think?
400TBHDD 6TB 12/4UHDD 2TB 20/2USSD
Layout
SSD cost [x1000 US$]NVMe
Layout
NVMe cost [x1000 US$]
10k IOPS25051014x 14x 7.68TB3267x 14x 15.36TB348
30k IOPS40551014x 14x 7.68TB3267x 14x 15.36TB348
50k IOPS65551014x 14x 7.68TB3267x 14x 15.36TB348
100k IOPS1,31188314x 14x 7.68TB3267x 14x 15.36TB348
200k IOPS2,5931,76714x 14x 7.68TB3267x 14x 15.36TB348
500k IOPS6,4954,41914x 14x 7.68TB3267x 14x 15.36TB348
1000k IOPS12,9618,80427x 14x 3.84TB43714x 14x 7.68TB413
Surprise! Again we find that HDD is only viable for the slower speed requirements of archival storage. Note that 16TB NVMes are not much more expensive than the SSD solution!
A note about chassis: To get good performance out of NVMe devices, a lot more CPU cores are needed than in HDD based solutions. Four OSDs per NVMe and 2 cores per OSD are a rule of thumb. This means that stuffing 24 NVMes into a 2U chassis and calling it a day is not going to provide exceptional performance.  We recommend 1U chassis with 5-8 NVMe devices to reduce bottlenecking on the OSD code itself. (I'm also assuming that the network connectivity is up to transporting the enormous amount of data traffic.)

Petabyte Scale

If we enter petabyte scale, hard disks become slightly more viable, but at this scale (we are talking 64 4U nodes) the sheer physical size of the hard disk based cluster can become a problem:
1PBHDD 6TB 12/4UHDD 2TB 20/2USSD
Layout
SSD cost [x1000 US$]NVMe
Layout
NVMe cost [x1000 US$]
10k IOPS453125734x 14x 7.68TB78917x 14x 15.36tb871
30k IOPS453125734x 14x 7.68TB78917x 14x 15.36tb871
50k IOPS488125734x 14x 7.68TB78917x 14x 15.36tb871
100k IOPS1850125734x 14x 7.68TB78917x 14x 15.36tb871
200k IOPS2101176734x 14x 7.68TB78917x 14x 15.36tb871
500k IOPS4619436534x 14x 7.68TB78917x 14x 15.36tb871
1000k IOPS10465872034x 14x 7.68TB78917x 14x 15.36tb871
Note: The performance data for SSD and NVMe OSDs is estimated conservatively. Depending on the use case performance will vary.
So what do we learn from all this? 
The days of HDD are numbered. 
For most use cases even today SSD is superior. Also, SSD and NVMe are still nosediving in terms of cost/unit. SSD/NVMe based nodes also make for much more compact installations and are a lot less vulnerable to vibration, dust and heat. 

The health question

Of course, cost isn't the only issue. SSDs do wear. The current crop is way more resilient long term than SSDs from a couple of years ago, but they will still eventually wear out. On the other hand, SSDs are not prone to sudden catastrophic failure triggered by either a mechanical event or marginal manufacturing tolerances.
That means that the good news is that SSDs do not fail suddenly in almost all cases. They develop bad blocks, which for a time are replaced with fresh blocks from an invisible capacity reserve on the device. You will not see capacity degradation until the capacity reserve runs out of blocks, wear leveling does all this automatically.
You can check the health of the SSDs by using SMART (smartmontools on Linux), which will show how many blocks have been relocated, and the relative health of the drive as a percentage of the overall reserve capacity.  

Bonus round: SSD vs 10krpm

In the world of low latency and high IOPS, the answer for HDD manufacturers is to bump up rotation speed of the drives. Unfortunately, while this does make them faster, it also makes them more mechanically complex, more thermally stressed and in a word: expensive.
SSDs are naturally faster and mechanically simple. They also -- traditionally at least -- were more expensive than the 10krpm disks, which is why storage providers have still been selling NASes and SANs with 10 or 15krpm disks. (I know this from experience, as I used to run high performance environments for a web content provider.)
Now have a look at this:
Device typeCost [US$]Cost/GB [US$]
HDD 1.8TB SAS 10krpm Seagate3700.21
SSD 1.92TB Micron SATA3350.17
NVMe 2.0TB Intel3990.20
HDD 0.9TB Seagate3490.39
In other words, 10krpm drives are obsolete not only from the cost/performance ratio, but even from the cost/capacity ratio! The 15krpm drives are even worse. The hard disks in this sector have no redeeming qualities; they are more expensive, drastically slower, more mechanically complex, and cost enormous amounts of money to run.
So why is there so much resistance to moving beyond them? I have heard the two arguments against SSDs:
Lifespan: With today’s wear leveling, this issue has largely evaporated. Yes, it is possible to wear out an SSD, but have a look at the math: a read optimized SSD is rated for about one Device Writes Per Day (DWPD) (that is, a write of the whole capacity of the device) over 5 years. Let’s compare this with an 1.8TB 10krpm HDD. With a workload that averages out at 70MB/s write (with a mix of small and large writes) and a 70/30 read/write ratio, this 10krpm HDD can write 1.81TB/day. 
In other words, you won’t wear out the SSD under the same conditions within 5 years. If you want to step up to 3xDWPD (mixed use), the drive is still below the cost for the HDD (US$350), and you will have enough resilience even for very write heavy workloads.
TCO: It is true that an SSD uses more power as throughput increases. Most top out at about 3x the power consumption of a comparable HDD if they are driven hard. They also will provide ~10x the throughput and >100x the small block IOPS of the HDD. If the SSD is ambling along at the sedate pace of the 10krpm HDD, it will consume less power than the HDD. If you stress the performance envelope of the SSD, you would have to have a large number of HDDs to match the single SSD performance, which would not even be in the same ballpark in both initial cost and TCO.
In other words, imagine having to put up a whole node with 20 HDDs to match the performance of this single $350 mixed use SSD that consumes 20W at full tilt operation. You would have to buy a $4000 server with 20 $370 HDDs -- which would, by the way, consume an average of maybe 300W. 
So as you can see, an SSD is the better deal, even from a purely financial perspective, whether you drive it hard or not.
Of course there are always edge cases. Ask us about your specific use case, and we can do a tailored head-to-head comparison for your specific application.

So what’s next?

We are already nearing the point where NVMe will supersede the SATA or SAS interface in SSDs. So the SSDs, which came out on top when we started this discussion, are already on their way out again.
NVMe has the advantage of being an interface created specifically for flash memory. It does not pretend to be on a HDD, as SAS and SATA do, thus it does not need the protocol translation needed to transform the HDD access protocol internally into a flash memory access protocol and transform back on the way out. You can see the difference by looking at the performance envelope of the devices. 
New flash memory technologies push the performance envelope and the interface increasingly hampers performance, so the shift from SAS/SATA to NVMe is imminent. NVMe even comes in multiple form factors, with one closely resembling 2.5” HDDs for hotswap purposes, and one (m.2, which resembles a memory module) for internal storage that does not need hot swap capability. Intel’s ruler design and Supermicro’s m.2 carriers will further increase storage density with NVMe devices.
On the horizon, new technologies such as Intel Optane again increase performance and resilience to wear, currently still much higher cost to traditional flash modules.  
Maybe a few years from now everything is going to be nonvolatile memory and we simply can cut power to the devices. Either way, we will see further increases in density, performance and reliability and further decrease in cost. 
Welcome to the future of storage!

Christian Huebner

<span style="font-weight: 400;">With an MS in Electrical Engineering and a passion for software Christian was drawn into IT right out of university. Stations as a technical instructor, senior support and systems engineer, developer and architect followed, almost always in mission critical Unix or Linux based IT environments. Every station has contributed understanding of additional technologies and their impact on key properties such as reliability, performance, scalability, and ease of use.</span> <span style="font-weight: 400;">As principal architect at Mirantis, technology is still at the heart of everything Christian does, but he feels the important part is to understand, manage and optimize the relationship between business aspects and the technologies to best support it. Thus, a good part of his time is spent determining business impact on project design and technology influence on business decisions. </span>

Choose your cloud native journey.

Whatever your role, we’re here to help with open source tools and world-class support.

GET STARTED

Join Our Exclusive Newsletter

Get cloud-native insights and expert commentary straight to your inbox.

SUBSCRIBE NOW