Archive for the 'Oracle' Category

getSQLMON… mining SQLMON reports from Oracle

As an Oracle performance engineer, I often work with customers on benchmarks and diagnosis of various issues.  Now, everyone can gather an AWR report and send it my way, but that is just the tip of the iceberg.  AWR reports are great for summarizing activity at a high level and can even find a SQL that is taking longer than expected.  So, once we have identified a poor performing SQL, how do we fix it?

…that is where SQLMON comes into the picture.

gathering sqlmon reports

The power of the SQLMON report is very well documented.  It provides an insight into EXACTLY what was happening when that query was running on the system.  It documents the run time, SQL plan, CPU activity, IO activity, Offload, etc…  The report can be gathered by running the “dbms_sqltune.report_sql_monitor” utility to extract a report.  Each SQL in Oracle maps to a specific “SQL_ID” as this is really just a hash of the sql text….But, that is not all.

Even time you run a SQL, a specific execution plan must be created or reused.  This execution plan is identified by the “sql_exec_id”….  But wait, this is still not all.

You may have run the same SQL multiple times.  Maybe the first run was on a quite system and the second was during a batch window.  Each execution timestamps the “sql_exec_start” in the v$sql_monitor table.

You can quickly see how one SQL statement could have multiple runs and plans.  This can get messy and cumbersome to gather by hand.  Enterprise manager allows you to look at individual runs and save the SQLMON file, but let’s say you have 20 executions with 3 different plans?

…clearly there has got to be a better way.

the birth of the getSQLMON script

I created a simple script to gather all SQLMON reports for a given portion of the SQL text.  I basically searched the v$sql_text table for any SQLid’s that matched a specific pattern.  It then creates a directory for this particular run and a subdirectory for every run that matches a given pattern.

$ ./getSQLMON.sh
Usage: ./getSQLMON.sh <SQLPATTERN> 

$ ./getSQLMON.sh RTL19

SQL_ID        SQL_EXEC_ID      EPOCH SQLTXT               DURATION_SECS
------------- ----------- ---------- -------------------- -------------
8g97qvjxkdvq5    16777218 1439194608 select /*+ RTL19 */              8
8g97qvjxkdvq5    16777216 1439194779 select /*+ RTL19 */              3
8g97qvjxkdvq5    16777216 1439363758 select /*+ RTL19 */              2

Gather SQLMON reports for these SQLids
ALL done... Reports are in sqlmon_081215_1308


$ cd sqlmon_081215_1308

$ ls
sqlid_8g97qvjxkdvq5_epoch_1439194608 sqlid_8g97qvjxkdvq5_epoch_1439363758
sqlid_8g97qvjxkdvq5_epoch_1439194779 sqlmon_summary_081215_1308.txt


$ ls sqlid*
sqlid_8g97qvjxkdvq5_epoch_1439194608:
sqlmon_sqlid_8g97qvjxkdvq5_execid_16777218_epoch_1439194608_dt_081215_1308.html
sqlmon_sqlid_8g97qvjxkdvq5_execid_16777218_epoch_1439194608_dt_081215_1308.txt

sqlid_8g97qvjxkdvq5_epoch_1439194779:
sqlmon_sqlid_8g97qvjxkdvq5_execid_16777216_epoch_1439194779_dt_081215_1308.html
sqlmon_sqlid_8g97qvjxkdvq5_execid_16777216_epoch_1439194779_dt_081215_1308.txt

sqlid_8g97qvjxkdvq5_epoch_1439363758:
sqlmon_sqlid_8g97qvjxkdvq5_execid_16777216_epoch_1439363758_dt_081215_1308.html
sqlmon_sqlid_8g97qvjxkdvq5_execid_16777216_epoch_1439363758_dt_081215_1308.txt

Once the tool has been run, the directory structure can be distributed for further analysis.  This tool has been very helpful for me to mine SQLMON reports when running benchmarks or analyzing running various systems.  I have placed the getSQLMON tool in GitHub.

Hopefully it will be of use to you as well!! 

“external table write” wait events… but I am only running a query?

I was helping a customer debug some external table load problems.  They are developing some code to do massive inserts via external tables.  As the code was being tested, we saw a fair number of tests that were doing simple queries of an external table.  I expected to see “external table read” wait events, but was surprised when we saw more “external table write” wait events than reads.

I thought this was due to writes to the “log” file and possible “bad” file, but I had to be sure.  I searched the docs but could not find reference to this wait event.  I specifically was seeing the following:

WAIT #139931007587096: nam='external table write' ela= 7 filectx=139931005791096 file#=13 size=41 obj#=-1 tim=1398264093597968
WAIT #139931007587096: nam='external table write' ela= 3 filectx=139931005791096 file#=13 size=89 obj#=-1 tim=1398264093597987

I searched on how to debug the filectx and file# but still couldn’t find anything.  So, I resorted to my good old friend “strace” from the Linux side of the house.  By running “strace” on the oracle shadow process, I was able to find indeed that these write events were to going to the LOG file for the external table.

mylap:EXTW glennf$ egrep 'write\(13' strace-truss-trace.txt
 write(13, "\n\n LOG file opened at 04/23/14 0"..., 41) = 41
 write(13, "KUP-05004: Warning: Intra sour"..., 100) = 100
 write(13, "Field Definitions for table ET_T"..., 36) = 36
 write(13, " Record format DELIMITED, delim"..., 43) = 43
 write(13, " Data in file has same endianne"..., 51) = 51
 write(13, " Rows with all null fields are "..., 41) = 41
 write(13, "\n", 1) = 1
 write(13, " Fields in Data Source: \n", 26) = 26
 write(13, "\n", 1) = 1
 write(13, " ID "..., 47) = 47
 write(13, " Terminated by \"7C\"\n", 25) = 25
 write(13, " Trim whitespace same as SQ"..., 41) = 41
 write(13, " TDATE "..., 46) = 46
  ....
  ....

Each time you open an external table, the time is logged as well the table definition.  We have some very wide tables so there was actually more data logged than queried.  With the proper amount of data now in the dat files, we are indeed seeing more “external table read” requests as expected.   Regardless, this was a fun exercise.

So, the moral of the story… Sometimes you have turn over a few rocks and drill down a bit to find the pot of gold.

 

Analyzing IO at the Cell level with cellcli… a new and improved script

Recently I had the pleasure of corresponding with Hans-Peter Sloot.  After looking at my simple tool in this post to gather cell IO data from cellcli, he took it a several steps further and created a nice python version that goes to the next level to pull IO statistics from the cells.

current_rw_rq.py

This script provides breaks down the IO by “Small” and “Large” as is commonly done by the Enterprise manager.  It also provides a summary by cell.  Here is a sample output from this script.

Hans-Peter also added two other scripts to drill in to historical data stored in cellcli.    Thanks for sharing your tools and further expanding my toolbox!

Analyzing IO at the Exadata Cell level… iostat summary

While analyzing Write-Back cache activity on Exadata storage cells, I wanted something to interactively monitor IO while I was running various tests.  The problem is summarizing the results from ALL storage cell.  So, I decided to use my old friend “iostat” and a quick easy script to roll up the results for both DISK and FLASH.  This allowed me to monitor the IOPS, IO size, wait times, and service times.  

The “iostat-all.sh” tool shows the following data:

day           time  device  r      w   rs       ws     ss    aw    st
---------------------------------------------------------------------
 2013-06-24 14:40:11 DISK  47  40252   54  2667941  66.15  0.28  0.07
 2013-06-24 14:40:11 FLASH  9  40354  322  2853674  70.70  0.13  0.13
 2013-06-24 14:41:13 DISK  48  39548   80  2691362  67.95  0.31  0.08
 2013-06-24 14:41:13 FLASH  9  53677  324  3975687  74.06  0.14  0.13
…

Hopefully this will be useful for those that like to dive into the weeds using our good old friends.

Dtrace probes in Oracle 12c… v$kernel_io_outlier is populated by dtrace!!

Oracle 12c certainly has some great features, but for the performance guy like myself, performance monitoring features are particularly interesting.  There are three new v$ tables that track anomalies in the IO path.  The idea is to provide more information about really poorly performing IO that lasts more than 500ms.

  • V$IO_OUTLIER : tracks the attributies of an IO.  The size, latency as well as ASM information is recorded.
  • V$LGWRIO_OUTLIER : tracks information specifically on Log writer IO.

These two tables are going to be useful to monitor when performance issues occur.  I can already see the SQL scripts to monitor this activity starting to pile up.  But, there is one little extra table that dives even further into the IO stack using Dtrace.

  • “V$KERNEL_IO_OUTLIER” : This table dives into the KERNEL to provide information about Kernel IO.  This table uses my old friend Dtrace to provide information about where the waits are occurring when Kernel IO is in-play.  This shows the time for every step involved in the setup and teardown Kernel IO.   This information allows us to more easily debug anomalies in the IO stack.

Back in 2009 when Oracle was buying Sun I posted “Oracle buys Sun! Dtrace probes for Oracle?” and lamented on how cool that would be… It is good to know that someone was listening 🙂

Analyzing IO at the Exadata Cell level… a simple tool for IOPS.

Lately I have been drawn into to a fare number of discussions about IO characteristics while helping customers run benchmarks.  I have been working with a mix of developers, DBAs, sysadmin, and storage admins.  As I have learned, every group has there own perspective – certainly when it comes to IO and performance.

  • Most DBA’s want to see data from the DB point of view so AWR’s or EM works just fine.
  • Most System Admin’s look at storage from the Filesystem or ASM disk level.
  • Storage Admins want to see what is going on within the array.
  • Performance geeks like myself, like to see all up and down the stack 🙂

As part of pulling back the covers, I came up with a simple little tool for show IOPS at the cell level.

Mining IO statistics from cellcli

The cellsrv process collects data about various events and performance metrics in an Exadata storage cell.  I certainly am a huge fan of the table and index usage data gathered using the  “pythian_cell_cache_extract.pl” written by Christo Kutrovsky.  It is really provides a great look inside the Exadata Smart Flash Cache.  So, this got me to thinking.  What about IOPS data?

With the introduction of the Write Back Flash cache in X3, there is much more analysis about what is going to flash vs disk – and how what is written to flash is flushed to disk.

To look at all the current metrics gathered from the storage cells in your Exadata or SuperCluster you can run “cellcli -e list metriccurrent” on all the storage cells.  The “metriccurrent” parameters are updated every minute by cellsrv to store performance data.  There are a few convient parameters that can be used to sum up all the IOPS.

  • CD_IO_RQ_R_LG_SEC + CD_IO_RQ_R_SM_SEC
  • CD_IO_RQ_W_LG_SEC + CD_IO_RQ_W_SM_SEC

These parameters shore the number of IO/sec for reads and writes.  By mining this data and breaking it down by “FD” vs “CD” you can see hit ratios for reads from an overall cell point of view, but now you can also see how many writes are going to FLASH vs DISK.

The “ciops-all.sh” script will look at all the cells and sum up all the IOPS and report the findings.  This is very useful to get a quick look at the IO profile in the cells.

[oracle@exa6db01 WB]$ ./ciops-all.sh
FLASH_READ_IOPS: 6305
DISK_READ_IOPS: 213
FLASH_WRITE_IOPS: 488203
DISK_WRITE_IOPS: 6814
TOTAL_NUMBER_OF_DRIVES: 84
WRITE_PCT_to_FLASH: 98
READ_PCT_from_FLASH: 96
IOPS_PER_DISK: 83

This can be very helpful when trying to figure out if you need to go with high performance or high capacity disks.  This case shows most IO going to the flash and only 83 IOPS are spilled to each disk.  So, with this case HC disks would be a fine choice.  With a simple modification, I made the “ciops-mon.sh” script to print out the throughput every few minutes to graph the results over time.

ciops_data_x3-2

This has been helpful as I have been investigating and explaining the inter-workings of the Exadata smart flash cache.  Hopefully, you will find this useful when trying to analyze and understand Exadata Cell level IO with your workload.

Solaris Eye for the Linux Guy… Part III (hugepages = ISM)

This post has been a long time coming but recently, I have started working on some SPARC SuperCluster POC’s with customers and I am getting re-acquainted with my old friend Solaris and SPARC.

If you are a Linux performance guy you have likely heard of HugePages.   Huge pages are used to increase the performance of large memory machines but requiring fewer TLB‘s .  I am not going to go into the details TLB’s, but every modern chip supports multiple memory page sizes.

So how do you get huge pages with Solaris?

Do nothing – it is the DEFAULT with Oracle running on Solaris.

The “use_ism” parameter used to be used to control this, but it has been moved to the “_underbar” category these days since there is really no reason whatsoever to change it.   I remember doing tests back in the Solaris 8 days with/without ISM pages to show the performance differences and truly it was and still is a good thing.

How are ISM/Huge pages used with Oracle running on Solaris?

At first, ISM pages are only used for the SGA so OLTP style environments benefited the most from ISM.   With Oracle 10 on Solaris, it was also allowed for the PGA.  So, if you were doing have PGA activity like a HASH join or sort, you would benefit from larger page sizes as well.

With Solaris, it is easy to see if the page sizes of any running process by simply running the “pmap(2)” command. 

root@ssc401:~# pmap -xs 23189
 23189: ora_pmon_HC1
 Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
 0000000100000000 64 64 - - 8K r-x-- oracle
 0000000100010000 48 48 - - - r-x-- oracle
 000000010001C000 64 64 - - 8K r-x-- oracle
 ...
 000000010D420000 256 256 64 - 64K rwx-- oracle
 000000010D460000 64 64 - - - rwx-- oracle
 ...
 ... 
 000000010D540000 2304 2304 2304 - 64K rwx-- [ heap ]
 0000000380000000 262144 262144 - 262144 256M rwxsR [ ism shmid=0xf00007e ]
 0000000390000000 65536 65536 - 65536 4M rwxsR [ ism shmid=0xf00007e ]
 0000000400000000 31457280 31457280 - 31457280 2G rwxsR [ ism shmid=0x600007f ]
 0000000B80000000 1572864 1572864 - 1572864 256M rwxsR [ ism shmid=0x600007f ]
 0000000BE0000000 196608 196608 - 196608 4M rwxsR [ ism shmid=0x600007f ]
 0000000C00000000 24 24 - 24 8K rwxsR [ ism shmid=0x7000000 ]
 FFFFFFFF5A800000 16 16 - - 8K r-x-- libodm11.so
 FFFFFFFF5A902000 8 8 8 - 8K rwx-- libodm11.so
 FFFFFFFF60500000 64 64 - - 64K r-x-- libclsra11.so
 FFFFFFFF60510000 24 24 - - - r-x-- libclsra11.so
 ...
 FFFFFFFF7D1FC000 8 8 - - 8K r-x-- libsched.so.1
 FFFFFFFF7D1FE000 8 8 - - 8K r-x-- libdl.so.1
 FFFFFFFF7D300000 8 8 8 - 8K rw--- [ anon ]
 FFFFFFFF7D400000 8 8 8 - 8K rw--- [ anon ]
 FFFFFFFF7D500000 8 8 8 - 8K rw--- [ anon ]
 FFFFFFFF7FFE0000 128 128 128 - 64K rw--- [ stack ]

Notice that the “text”, “heap”, “anon”, “stack”, and shared memory can all use different page sizes.  In this case, the SGA is backed by 2G, 256M, 4M, 8k ISM pages.

So what about Dynamic ISM?  Should I use ISM or DISM?

So, Dynamic ISM was introduced to resize the SGA.  DISM is really just ISM memory that can be paged.  This can be useful when you have HUGE memory machines and want to resize the SGA without taking down the instance.  But why is this needed?

  • Single-Instance availability on HUGE machines that can dynamically add/replace MEMORY.  Machines like the E10k/E25k/M9000/M10… etc all allow you to add components on the fly without restarted Solaris.  Let’s say have have a failing memory board.  You could “Shrink” the SGA so that it fits into the surviving space and while you service the faulty board.  Also, say you start with a 1/2 populated machine.  You can add memory without and grow the SGA without stopping the instance.
  • Consolidation or Cloud like services.  In this environment you can resize running instances on the fly in order to free up memory for new instances.

Personally, I don’t see a lot of use for DISM with the SuperCluster.   If you have RAC you don’t need DISM for availability reasons and with cloud/consolidation I think multiple instances within a single server is not the best practice going forward.   At one point you needed to use DISM for NUMA features, but that is not case with 11.2.0.3.

Tuning is in the eye of the beholder… Memory is memory right?

It is human nature to draw from experiences to make sense of our surroundings.  This holds true in life and performance tuning.   A veteran systems administrator will typically tune a system different from an Oracle DBA.  This is fine, but often what is obvious to one, is not to the other.  It is sometimes necessary to take a step back to tune from another perspective.

I recently have ran across a few cases where a customer was tuning “Sorts” in the database by adding memory. Regardless of your prospective, every one knows memory is faster than disk; and the goal of any good tuner is to use as much in memory as possible.   So, when it was noticed by the systems administrator that the “TEMP” disks for Oracle were doing a tremendous amount of IO,  the answer was obvious right?

RamDisk to the rescue

To solve this problem, the savvy systems administrator added a RAM disk to the database.  Since, it was only for “TEMP” space this is seemed reasonable.

ramdiskadm -a oratmp1 1024m
/dev/ramdisk/oratmp1

Indeed user performance was improved.  There are some minor issues around recovery upon system reboot or failure that are annoying, but easily addressed with startup scripts.  So, SLA’s were met and everyone was happy.  And so things were fine for a few years.

Double the HW means double the performance… right?

Fast forward a few years in the future.  The system was upgraded to keep up with demand by doubling the amount of memory and CPU resources.  Everything should be faster right? Well not so fast.  This action increased the NUMA ratio of the machine.  After doubling memory and CPU the average user response time doubled from ~1 second to 2 seconds.  Needless to say, this was not going to fly.   Escalations were mounted and the pressure to resolve this problem reached a boiling point. The Solaris support team was contacted by the systems administrator.  Some of the best kernel engineers in the business began to dig into the problem.  Searching for ways to make the “ramdisk” respond faster in the face of an increased NUMA ratio.

A fresh set of eyes

Since I have worked with the Solaris support engineers on anything Oracle performance related for many years, they asked me to take a look.  I took a peak at the system and noticed the ramdisk in use for TEMP.  To me this seemed odd, but I continued to look at SQL performance.   Things became clear once I saw the “sort_area_size” was default.

It turns out, Oracle was attempting to do in-memory sorts, but with the default settings all users were spilling out to temp.  With 100’s of users on the system, this became a problem real fast.  I had the customer increase the sort_area_size until the sorts occurred in memory with out the extra added over head of spilling out to disk (albit fast disk).  With this slight adjustment, the average user response time was better than it had ever been.

lessons learned

  • Memory is memory, but how you use it makes all the difference.
  • It never hurts to broaden your perspective and get a second opinion

Linux takes a page from Solaris… pmap available on Linux.

Recently, there was a thread on an internal alias of old Sun guys.  The problem at hand was to track down a process that is consuming memory on Linux.  This is the type of problem that can be solved many ways (ps, top, etc…), but to my amazement someone mentioned that pmap could be used for Linux…. I guess I didn’t get the memo 🙂

About 6 months back I wrote a few entries that discussed Solaris tools for the Linux Guy in the posts:

These posts explored how to look at Solaris from a Linux point of view – pmap(1M) should have been added to the mix.  The “pmap(1M)” command is an old Solaris command that explores the mapping of virtual memory to physical memory.  I have used it for years to explore things such as how much “SGA” is allocated and what are actual memory page sizes.  You can see the Heap, Stack, Shared, and Txt mappings.  This can be extremely useful when investigating memory leaks and other such memory related issues.  I am very happy to see that this has made it into the major Linux distributions.

Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

I recently had an interesting time with a customer who is all too familiar with SAN’s.  SAN vendors typically use IOPS/drive sizing numbers of 180 IOPS per drive.  This is a good conservative measure for SAN sizing, but the drives are capable of much more and indeed we state higher with Exadata.  So, how could this be possible?  Does Exadata have an enchantment spell that makes the drives magically spin faster?  Maybe a maybe a space time warp to service IO?

The Exadata X2-2 data sheet states “up to 50,000 IOPS” for a full rack of high performance 600GB 15K rpm drives.  This works out to be 300 IOs per second.  At first glance, you might notice that 300 IOPS for a drive that spins at 250 revolutions per second seems strange.  But really, it only means that you have to on average service more than one IO per revolution.  So, how do you service more than one IO per revolution?

Drive command queuing and short stroking

Modern drives have the ability to queue up more than one IO at a time.  If queues are deep enough and the seek distance is short enough, it is more than possible to exceed one IO per revolution.  As you increase the queue, the probability of having an IO in the queue that can be serviced before a full revolution increases.  Lots of literature exists on this topic and indeed many have tested this phenomena.   A popular site “Tom’s Hardware” has tested a number of drives that shows with a command queue depth of four, both the Hitachi and Segate 15K rpm drives reach 300 IOPS per drive.

This effect of servicing more than one IO per revolution is enhanced when the seek distances are short.  There is an old benchmark trick to use only the outer portion of the drive to shrink the seek distance.  This technique combined with command queuing increases the probability of servicing more than one IO per revolution.

But how can this old trick work with real world environments?

ASM intelligent data placement to the rescue

ASM has a feature “Intelligent Data Placement” IDP, that optimizes the placement of data such that the most active data resides on the outer potions of the drive.  The drive is essentially split into “Hot” and “Cold” regions.  This care in placement helps to reduce the seek distance and achieve a higher IOPS/drive.   This is the realization of an old benchmark trick, using a real feature in ASM.

the proof is in the pudding… “calibrate” command shows drive capabilities

The “calibrate” command, which is part of the Exadata storage “cellcli” interface, is used to test the capabilites of the underlinying components of Exadata storage.  The throughput and IOPS of both the drives and Flash modules can be tested at any point to see if they are performing up to expectations.  The calibrate command uses the popular Orion IO test utility designed to mimic Oracle IO patterns.   This utility is used to randomly seek over the 1st half of the drive in order to show the capabilities of the drives.  I have included an example output from an X2-2 machine below.

CellCLI> calibrate
Calibration will take a few minutes...
Aggregate random read throughput across all hard disk luns: 1809 MBPS
Aggregate random read throughput across all flash disk luns: 4264.59 MBPS
Aggregate random read IOs per second (IOPS) across all hard disk luns: 4923
Aggregate random read IOs per second (IOPS) across all flash disk luns: 131197
Calibrating hard disks (read only) ...
Lun 0_0  on drive [20:0     ] random read throughput: 155.60 MBPS, and 422 IOPS
Lun 0_1  on drive [20:1     ] random read throughput: 155.95 MBPS, and 419 IOPS
Lun 0_10 on drive [20:10    ] random read throughput: 155.58 MBPS, and 428 IOPS
Lun 0_11 on drive [20:11    ] random read throughput: 155.13 MBPS, and 428 IOPS
Lun 0_2  on drive [20:2     ] random read throughput: 157.29 MBPS, and 415 IOPS
Lun 0_3  on drive [20:3     ] random read throughput: 156.58 MBPS, and 415 IOPS
Lun 0_4  on drive [20:4     ] random read throughput: 155.12 MBPS, and 421 IOPS
Lun 0_5  on drive [20:5     ] random read throughput: 154.95 MBPS, and 425 IOPS
Lun 0_6  on drive [20:6     ] random read throughput: 153.31 MBPS, and 419 IOPS
Lun 0_7  on drive [20:7     ] random read throughput: 154.34 MBPS, and 415 IOPS
Lun 0_8  on drive [20:8     ] random read throughput: 155.32 MBPS, and 425 IOPS
Lun 0_9  on drive [20:9     ] random read throughput: 156.75 MBPS, and 423 IOPS
Calibrating flash disks (read only, note that writes will be significantly slower) ...
Lun 1_0 on drive [FLASH_1_0] random read throughput: 273.25 MBPS, and 19900 IOPS
Lun 1_1 on drive [FLASH_1_1] random read throughput: 272.43 MBPS, and 19866 IOPS
Lun 1_2 on drive [FLASH_1_2] random read throughput: 272.38 MBPS, and 19868 IOPS
Lun 1_3 on drive [FLASH_1_3] random read throughput: 273.16 MBPS, and 19838 IOPS
Lun 2_0 on drive [FLASH_2_0] random read throughput: 273.22 MBPS, and 20129 IOPS
Lun 2_1 on drive [FLASH_2_1] random read throughput: 273.32 MBPS, and 20087 IOPS
Lun 2_2 on drive [FLASH_2_2] random read throughput: 273.92 MBPS, and 20059 IOPS
Lun 2_3 on drive [FLASH_2_3] random read throughput: 273.71 MBPS, and 20049 IOPS
Lun 4_0 on drive [FLASH_4_0] random read throughput: 273.91 MBPS, and 19799 IOPS
Lun 4_1 on drive [FLASH_4_1] random read throughput: 273.73 MBPS, and 19818 IOPS
Lun 4_2 on drive [FLASH_4_2] random read throughput: 273.06 MBPS, and 19836 IOPS
Lun 4_3 on drive [FLASH_4_3] random read throughput: 273.02 MBPS, and 19770 IOPS
Lun 5_0 on drive [FLASH_5_0] random read throughput: 273.80 MBPS, and 19923 IOPS
Lun 5_1 on drive [FLASH_5_1] random read throughput: 273.26 MBPS, and 19926 IOPS
Lun 5_2 on drive [FLASH_5_2] random read throughput: 272.97 MBPS, and 19893 IOPS
Lun 5_3  on drive [FLASH_5_3] random read throughput: 273.65 MBPS, and 19872 IOPS
CALIBRATE results are within an acceptable range.

As you can see,  the drives can actually be driven even higher than the stated 300 IOPS per drive.

So, why can’t SANs achieve this high number?

A SAN that is dedicated to one server with one purpose should be able to take advantage of command queuing.  But, SANs are not typically configured in this manner.  SANs are a shared general purpose disk infrastructure that are used by many departments and applications from Database to Email.   When sharing resources on a SAN, great care is taken to ensure that the number of outstanding IO requests does not get too high and cause the fabric to reset.  In Solaris, SAN vendors require the setting of the “sd_max_throttle” parameter which limits the amount of IO presented to the SAN.  This is typically set very conservatively so as to protect the shared SAN resource by queuing the IO on the OS.

long story short…

A 180 IOPS/drive rule of thumb for SANs might be reasonable, but the “drive” is definitely capable of more.

Exadata has dedicated drives, is not artificially throttled, and can take full advantage of the drives capabilities.