I recently had an interesting time with a customer who is all too familiar with SAN’s. SAN vendors typically use IOPS/drive sizing numbers of 180 IOPS per drive. This is a good conservative measure for SAN sizing, but the drives are capable of much more and indeed we state higher with Exadata. So, how could this be possible? Does Exadata have an enchantment spell that makes the drives magically spin faster? Maybe a maybe a space time warp to service IO?
The Exadata X2-2 data sheet states “up to 50,000 IOPS” for a full rack of high performance 600GB 15K rpm drives. This works out to be 300 IOs per second. At first glance, you might notice that 300 IOPS for a drive that spins at 250 revolutions per second seems strange. But really, it only means that you have to on average service more than one IO per revolution. So, how do you service more than one IO per revolution?
Drive command queuing and short stroking
Modern drives have the ability to queue up more than one IO at a time. If queues are deep enough and the seek distance is short enough, it is more than possible to exceed one IO per revolution. As you increase the queue, the probability of having an IO in the queue that can be serviced before a full revolution increases. Lots of literature exists on this topic and indeed many have tested this phenomena. A popular site “Tom’s Hardware” has tested a number of drives that shows with a command queue depth of four, both the Hitachi and Segate 15K rpm drives reach 300 IOPS per drive.
This effect of servicing more than one IO per revolution is enhanced when the seek distances are short. There is an old benchmark trick to use only the outer portion of the drive to shrink the seek distance. This technique combined with command queuing increases the probability of servicing more than one IO per revolution.
But how can this old trick work with real world environments?
ASM intelligent data placement to the rescue
ASM has a feature “Intelligent Data Placement” IDP, that optimizes the placement of data such that the most active data resides on the outer potions of the drive. The drive is essentially split into “Hot” and “Cold” regions. This care in placement helps to reduce the seek distance and achieve a higher IOPS/drive. This is the realization of an old benchmark trick, using a real feature in ASM.
the proof is in the pudding… “calibrate” command shows drive capabilities
The “calibrate” command, which is part of the Exadata storage “cellcli” interface, is used to test the capabilites of the underlinying components of Exadata storage. The throughput and IOPS of both the drives and Flash modules can be tested at any point to see if they are performing up to expectations. The calibrate command uses the popular Orion IO test utility designed to mimic Oracle IO patterns. This utility is used to randomly seek over the 1st half of the drive in order to show the capabilities of the drives. I have included an example output from an X2-2 machine below.
CellCLI> calibrate Calibration will take a few minutes... Aggregate random read throughput across all hard disk luns: 1809 MBPS Aggregate random read throughput across all flash disk luns: 4264.59 MBPS Aggregate random read IOs per second (IOPS) across all hard disk luns: 4923 Aggregate random read IOs per second (IOPS) across all flash disk luns: 131197 Calibrating hard disks (read only) ... Lun 0_0 on drive [20:0 ] random read throughput: 155.60 MBPS, and 422 IOPS Lun 0_1 on drive [20:1 ] random read throughput: 155.95 MBPS, and 419 IOPS Lun 0_10 on drive [20:10 ] random read throughput: 155.58 MBPS, and 428 IOPS Lun 0_11 on drive [20:11 ] random read throughput: 155.13 MBPS, and 428 IOPS Lun 0_2 on drive [20:2 ] random read throughput: 157.29 MBPS, and 415 IOPS Lun 0_3 on drive [20:3 ] random read throughput: 156.58 MBPS, and 415 IOPS Lun 0_4 on drive [20:4 ] random read throughput: 155.12 MBPS, and 421 IOPS Lun 0_5 on drive [20:5 ] random read throughput: 154.95 MBPS, and 425 IOPS Lun 0_6 on drive [20:6 ] random read throughput: 153.31 MBPS, and 419 IOPS Lun 0_7 on drive [20:7 ] random read throughput: 154.34 MBPS, and 415 IOPS Lun 0_8 on drive [20:8 ] random read throughput: 155.32 MBPS, and 425 IOPS Lun 0_9 on drive [20:9 ] random read throughput: 156.75 MBPS, and 423 IOPS Calibrating flash disks (read only, note that writes will be significantly slower) ... Lun 1_0 on drive [FLASH_1_0] random read throughput: 273.25 MBPS, and 19900 IOPS Lun 1_1 on drive [FLASH_1_1] random read throughput: 272.43 MBPS, and 19866 IOPS Lun 1_2 on drive [FLASH_1_2] random read throughput: 272.38 MBPS, and 19868 IOPS Lun 1_3 on drive [FLASH_1_3] random read throughput: 273.16 MBPS, and 19838 IOPS Lun 2_0 on drive [FLASH_2_0] random read throughput: 273.22 MBPS, and 20129 IOPS Lun 2_1 on drive [FLASH_2_1] random read throughput: 273.32 MBPS, and 20087 IOPS Lun 2_2 on drive [FLASH_2_2] random read throughput: 273.92 MBPS, and 20059 IOPS Lun 2_3 on drive [FLASH_2_3] random read throughput: 273.71 MBPS, and 20049 IOPS Lun 4_0 on drive [FLASH_4_0] random read throughput: 273.91 MBPS, and 19799 IOPS Lun 4_1 on drive [FLASH_4_1] random read throughput: 273.73 MBPS, and 19818 IOPS Lun 4_2 on drive [FLASH_4_2] random read throughput: 273.06 MBPS, and 19836 IOPS Lun 4_3 on drive [FLASH_4_3] random read throughput: 273.02 MBPS, and 19770 IOPS Lun 5_0 on drive [FLASH_5_0] random read throughput: 273.80 MBPS, and 19923 IOPS Lun 5_1 on drive [FLASH_5_1] random read throughput: 273.26 MBPS, and 19926 IOPS Lun 5_2 on drive [FLASH_5_2] random read throughput: 272.97 MBPS, and 19893 IOPS Lun 5_3 on drive [FLASH_5_3] random read throughput: 273.65 MBPS, and 19872 IOPS CALIBRATE results are within an acceptable range.
As you can see, the drives can actually be driven even higher than the stated 300 IOPS per drive.
So, why can’t SANs achieve this high number?
A SAN that is dedicated to one server with one purpose should be able to take advantage of command queuing. But, SANs are not typically configured in this manner. SANs are a shared general purpose disk infrastructure that are used by many departments and applications from Database to Email. When sharing resources on a SAN, great care is taken to ensure that the number of outstanding IO requests does not get too high and cause the fabric to reset. In Solaris, SAN vendors require the setting of the “sd_max_throttle” parameter which limits the amount of IO presented to the SAN. This is typically set very conservatively so as to protect the shared SAN resource by queuing the IO on the OS.
long story short…
A 180 IOPS/drive rule of thumb for SANs might be reasonable, but the “drive” is definitely capable of more.
Exadata has dedicated drives, is not artificially throttled, and can take full advantage of the drives capabilities.