Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

I recently had an interesting time with a customer who is all too familiar with SAN’s.  SAN vendors typically use IOPS/drive sizing numbers of 180 IOPS per drive.  This is a good conservative measure for SAN sizing, but the drives are capable of much more and indeed we state higher with Exadata.  So, how could this be possible?  Does Exadata have an enchantment spell that makes the drives magically spin faster?  Maybe a maybe a space time warp to service IO?

The Exadata X2-2 data sheet states “up to 50,000 IOPS” for a full rack of high performance 600GB 15K rpm drives.  This works out to be 300 IOs per second.  At first glance, you might notice that 300 IOPS for a drive that spins at 250 revolutions per second seems strange.  But really, it only means that you have to on average service more than one IO per revolution.  So, how do you service more than one IO per revolution?

Drive command queuing and short stroking

Modern drives have the ability to queue up more than one IO at a time.  If queues are deep enough and the seek distance is short enough, it is more than possible to exceed one IO per revolution.  As you increase the queue, the probability of having an IO in the queue that can be serviced before a full revolution increases.  Lots of literature exists on this topic and indeed many have tested this phenomena.   A popular site “Tom’s Hardware” has tested a number of drives that shows with a command queue depth of four, both the Hitachi and Segate 15K rpm drives reach 300 IOPS per drive.

This effect of servicing more than one IO per revolution is enhanced when the seek distances are short.  There is an old benchmark trick to use only the outer portion of the drive to shrink the seek distance.  This technique combined with command queuing increases the probability of servicing more than one IO per revolution.

But how can this old trick work with real world environments?

ASM intelligent data placement to the rescue

ASM has a feature “Intelligent Data Placement” IDP, that optimizes the placement of data such that the most active data resides on the outer potions of the drive.  The drive is essentially split into “Hot” and “Cold” regions.  This care in placement helps to reduce the seek distance and achieve a higher IOPS/drive.   This is the realization of an old benchmark trick, using a real feature in ASM.

the proof is in the pudding… “calibrate” command shows drive capabilities

The “calibrate” command, which is part of the Exadata storage “cellcli” interface, is used to test the capabilites of the underlinying components of Exadata storage.  The throughput and IOPS of both the drives and Flash modules can be tested at any point to see if they are performing up to expectations.  The calibrate command uses the popular Orion IO test utility designed to mimic Oracle IO patterns.   This utility is used to randomly seek over the 1st half of the drive in order to show the capabilities of the drives.  I have included an example output from an X2-2 machine below.

CellCLI> calibrate
Calibration will take a few minutes...
Aggregate random read throughput across all hard disk luns: 1809 MBPS
Aggregate random read throughput across all flash disk luns: 4264.59 MBPS
Aggregate random read IOs per second (IOPS) across all hard disk luns: 4923
Aggregate random read IOs per second (IOPS) across all flash disk luns: 131197
Calibrating hard disks (read only) ...
Lun 0_0  on drive [20:0     ] random read throughput: 155.60 MBPS, and 422 IOPS
Lun 0_1  on drive [20:1     ] random read throughput: 155.95 MBPS, and 419 IOPS
Lun 0_10 on drive [20:10    ] random read throughput: 155.58 MBPS, and 428 IOPS
Lun 0_11 on drive [20:11    ] random read throughput: 155.13 MBPS, and 428 IOPS
Lun 0_2  on drive [20:2     ] random read throughput: 157.29 MBPS, and 415 IOPS
Lun 0_3  on drive [20:3     ] random read throughput: 156.58 MBPS, and 415 IOPS
Lun 0_4  on drive [20:4     ] random read throughput: 155.12 MBPS, and 421 IOPS
Lun 0_5  on drive [20:5     ] random read throughput: 154.95 MBPS, and 425 IOPS
Lun 0_6  on drive [20:6     ] random read throughput: 153.31 MBPS, and 419 IOPS
Lun 0_7  on drive [20:7     ] random read throughput: 154.34 MBPS, and 415 IOPS
Lun 0_8  on drive [20:8     ] random read throughput: 155.32 MBPS, and 425 IOPS
Lun 0_9  on drive [20:9     ] random read throughput: 156.75 MBPS, and 423 IOPS
Calibrating flash disks (read only, note that writes will be significantly slower) ...
Lun 1_0 on drive [FLASH_1_0] random read throughput: 273.25 MBPS, and 19900 IOPS
Lun 1_1 on drive [FLASH_1_1] random read throughput: 272.43 MBPS, and 19866 IOPS
Lun 1_2 on drive [FLASH_1_2] random read throughput: 272.38 MBPS, and 19868 IOPS
Lun 1_3 on drive [FLASH_1_3] random read throughput: 273.16 MBPS, and 19838 IOPS
Lun 2_0 on drive [FLASH_2_0] random read throughput: 273.22 MBPS, and 20129 IOPS
Lun 2_1 on drive [FLASH_2_1] random read throughput: 273.32 MBPS, and 20087 IOPS
Lun 2_2 on drive [FLASH_2_2] random read throughput: 273.92 MBPS, and 20059 IOPS
Lun 2_3 on drive [FLASH_2_3] random read throughput: 273.71 MBPS, and 20049 IOPS
Lun 4_0 on drive [FLASH_4_0] random read throughput: 273.91 MBPS, and 19799 IOPS
Lun 4_1 on drive [FLASH_4_1] random read throughput: 273.73 MBPS, and 19818 IOPS
Lun 4_2 on drive [FLASH_4_2] random read throughput: 273.06 MBPS, and 19836 IOPS
Lun 4_3 on drive [FLASH_4_3] random read throughput: 273.02 MBPS, and 19770 IOPS
Lun 5_0 on drive [FLASH_5_0] random read throughput: 273.80 MBPS, and 19923 IOPS
Lun 5_1 on drive [FLASH_5_1] random read throughput: 273.26 MBPS, and 19926 IOPS
Lun 5_2 on drive [FLASH_5_2] random read throughput: 272.97 MBPS, and 19893 IOPS
Lun 5_3  on drive [FLASH_5_3] random read throughput: 273.65 MBPS, and 19872 IOPS
CALIBRATE results are within an acceptable range.

As you can see,  the drives can actually be driven even higher than the stated 300 IOPS per drive.

So, why can’t SANs achieve this high number?

A SAN that is dedicated to one server with one purpose should be able to take advantage of command queuing.  But, SANs are not typically configured in this manner.  SANs are a shared general purpose disk infrastructure that are used by many departments and applications from Database to Email.   When sharing resources on a SAN, great care is taken to ensure that the number of outstanding IO requests does not get too high and cause the fabric to reset.  In Solaris, SAN vendors require the setting of the “sd_max_throttle” parameter which limits the amount of IO presented to the SAN.  This is typically set very conservatively so as to protect the shared SAN resource by queuing the IO on the OS.

long story short…

A 180 IOPS/drive rule of thumb for SANs might be reasonable, but the “drive” is definitely capable of more.

Exadata has dedicated drives, is not artificially throttled, and can take full advantage of the drives capabilities.

Advertisements

6 Responses to “Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS”


  1. 1 Anton Pavlenko May 17, 2011 at 3:15 am

    Hello Glenn,
    I was really surprised that you have published such article.
    First of all, when you talk about SAN storage it is always some amount of cache ( in most cases gigs of cache ). And as soon as host IO in this cache host think that IO is completed.
    And storage may write this data to disks after really big period of time. So storage have much bigger “buffer” than disk it self and Storage can accumulate commands in the cache, until another operation use this spindle, than reorder commands and than write to the disk.
    Shared disks is SAN is not always bad, but can be really useful if you plan your allocation carefully. You can share disks between IOPS hungry and capacity hungry systems and have positive effect for both systems.

    • 2 glennfawcett May 17, 2011 at 9:00 am

      I certainly agree that SANs are good at sharing and buffering storage…. Indeed SANs have a place in the enterprise, no question. But the article was focused on showing how sizing rules of thumb for SAN, don’t apply with Exadata.

  2. 3 kevinclosson November 1, 2011 at 1:07 pm

    Hi Glenn,

    You might want to investigate further whether the cellcli calibrate is a callout to orion. I’ve been out of the loop for over nine months but I don’t recall that being the case.


  1. 1 Asm Trackback on May 12, 2011 at 2:49 am
  2. 2 Oracle anounces the Unbreakable DB Appliance « Irrelevant thoughts of an oracle DBA Trackback on September 21, 2011 at 9:33 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





%d bloggers like this: