Archive Page 2

Tuning is in the eye of the beholder… Memory is memory right?

It is human nature to draw from experiences to make sense of our surroundings.  This holds true in life and performance tuning.   A veteran systems administrator will typically tune a system different from an Oracle DBA.  This is fine, but often what is obvious to one, is not to the other.  It is sometimes necessary to take a step back to tune from another perspective.

I recently have ran across a few cases where a customer was tuning “Sorts” in the database by adding memory. Regardless of your prospective, every one knows memory is faster than disk; and the goal of any good tuner is to use as much in memory as possible.   So, when it was noticed by the systems administrator that the “TEMP” disks for Oracle were doing a tremendous amount of IO,  the answer was obvious right?

RamDisk to the rescue

To solve this problem, the savvy systems administrator added a RAM disk to the database.  Since, it was only for “TEMP” space this is seemed reasonable.

ramdiskadm -a oratmp1 1024m
/dev/ramdisk/oratmp1

Indeed user performance was improved.  There are some minor issues around recovery upon system reboot or failure that are annoying, but easily addressed with startup scripts.  So, SLA’s were met and everyone was happy.  And so things were fine for a few years.

Double the HW means double the performance… right?

Fast forward a few years in the future.  The system was upgraded to keep up with demand by doubling the amount of memory and CPU resources.  Everything should be faster right? Well not so fast.  This action increased the NUMA ratio of the machine.  After doubling memory and CPU the average user response time doubled from ~1 second to 2 seconds.  Needless to say, this was not going to fly.   Escalations were mounted and the pressure to resolve this problem reached a boiling point. The Solaris support team was contacted by the systems administrator.  Some of the best kernel engineers in the business began to dig into the problem.  Searching for ways to make the “ramdisk” respond faster in the face of an increased NUMA ratio.

A fresh set of eyes

Since I have worked with the Solaris support engineers on anything Oracle performance related for many years, they asked me to take a look.  I took a peak at the system and noticed the ramdisk in use for TEMP.  To me this seemed odd, but I continued to look at SQL performance.   Things became clear once I saw the “sort_area_size” was default.

It turns out, Oracle was attempting to do in-memory sorts, but with the default settings all users were spilling out to temp.  With 100’s of users on the system, this became a problem real fast.  I had the customer increase the sort_area_size until the sorts occurred in memory with out the extra added over head of spilling out to disk (albit fast disk).  With this slight adjustment, the average user response time was better than it had ever been.

lessons learned

  • Memory is memory, but how you use it makes all the difference.
  • It never hurts to broaden your perspective and get a second opinion

Linux takes a page from Solaris… pmap available on Linux.

Recently, there was a thread on an internal alias of old Sun guys.  The problem at hand was to track down a process that is consuming memory on Linux.  This is the type of problem that can be solved many ways (ps, top, etc…), but to my amazement someone mentioned that pmap could be used for Linux…. I guess I didn’t get the memo 🙂

About 6 months back I wrote a few entries that discussed Solaris tools for the Linux Guy in the posts:

These posts explored how to look at Solaris from a Linux point of view – pmap(1M) should have been added to the mix.  The “pmap(1M)” command is an old Solaris command that explores the mapping of virtual memory to physical memory.  I have used it for years to explore things such as how much “SGA” is allocated and what are actual memory page sizes.  You can see the Heap, Stack, Shared, and Txt mappings.  This can be extremely useful when investigating memory leaks and other such memory related issues.  I am very happy to see that this has made it into the major Linux distributions.

Exadata drives exceed the laws of physics… ASM with intelligent placement improves IOPS

I recently had an interesting time with a customer who is all too familiar with SAN’s.  SAN vendors typically use IOPS/drive sizing numbers of 180 IOPS per drive.  This is a good conservative measure for SAN sizing, but the drives are capable of much more and indeed we state higher with Exadata.  So, how could this be possible?  Does Exadata have an enchantment spell that makes the drives magically spin faster?  Maybe a maybe a space time warp to service IO?

The Exadata X2-2 data sheet states “up to 50,000 IOPS” for a full rack of high performance 600GB 15K rpm drives.  This works out to be 300 IOs per second.  At first glance, you might notice that 300 IOPS for a drive that spins at 250 revolutions per second seems strange.  But really, it only means that you have to on average service more than one IO per revolution.  So, how do you service more than one IO per revolution?

Drive command queuing and short stroking

Modern drives have the ability to queue up more than one IO at a time.  If queues are deep enough and the seek distance is short enough, it is more than possible to exceed one IO per revolution.  As you increase the queue, the probability of having an IO in the queue that can be serviced before a full revolution increases.  Lots of literature exists on this topic and indeed many have tested this phenomena.   A popular site “Tom’s Hardware” has tested a number of drives that shows with a command queue depth of four, both the Hitachi and Segate 15K rpm drives reach 300 IOPS per drive.

This effect of servicing more than one IO per revolution is enhanced when the seek distances are short.  There is an old benchmark trick to use only the outer portion of the drive to shrink the seek distance.  This technique combined with command queuing increases the probability of servicing more than one IO per revolution.

But how can this old trick work with real world environments?

ASM intelligent data placement to the rescue

ASM has a feature “Intelligent Data Placement” IDP, that optimizes the placement of data such that the most active data resides on the outer potions of the drive.  The drive is essentially split into “Hot” and “Cold” regions.  This care in placement helps to reduce the seek distance and achieve a higher IOPS/drive.   This is the realization of an old benchmark trick, using a real feature in ASM.

the proof is in the pudding… “calibrate” command shows drive capabilities

The “calibrate” command, which is part of the Exadata storage “cellcli” interface, is used to test the capabilites of the underlinying components of Exadata storage.  The throughput and IOPS of both the drives and Flash modules can be tested at any point to see if they are performing up to expectations.  The calibrate command uses the popular Orion IO test utility designed to mimic Oracle IO patterns.   This utility is used to randomly seek over the 1st half of the drive in order to show the capabilities of the drives.  I have included an example output from an X2-2 machine below.

CellCLI> calibrate
Calibration will take a few minutes...
Aggregate random read throughput across all hard disk luns: 1809 MBPS
Aggregate random read throughput across all flash disk luns: 4264.59 MBPS
Aggregate random read IOs per second (IOPS) across all hard disk luns: 4923
Aggregate random read IOs per second (IOPS) across all flash disk luns: 131197
Calibrating hard disks (read only) ...
Lun 0_0  on drive [20:0     ] random read throughput: 155.60 MBPS, and 422 IOPS
Lun 0_1  on drive [20:1     ] random read throughput: 155.95 MBPS, and 419 IOPS
Lun 0_10 on drive [20:10    ] random read throughput: 155.58 MBPS, and 428 IOPS
Lun 0_11 on drive [20:11    ] random read throughput: 155.13 MBPS, and 428 IOPS
Lun 0_2  on drive [20:2     ] random read throughput: 157.29 MBPS, and 415 IOPS
Lun 0_3  on drive [20:3     ] random read throughput: 156.58 MBPS, and 415 IOPS
Lun 0_4  on drive [20:4     ] random read throughput: 155.12 MBPS, and 421 IOPS
Lun 0_5  on drive [20:5     ] random read throughput: 154.95 MBPS, and 425 IOPS
Lun 0_6  on drive [20:6     ] random read throughput: 153.31 MBPS, and 419 IOPS
Lun 0_7  on drive [20:7     ] random read throughput: 154.34 MBPS, and 415 IOPS
Lun 0_8  on drive [20:8     ] random read throughput: 155.32 MBPS, and 425 IOPS
Lun 0_9  on drive [20:9     ] random read throughput: 156.75 MBPS, and 423 IOPS
Calibrating flash disks (read only, note that writes will be significantly slower) ...
Lun 1_0 on drive [FLASH_1_0] random read throughput: 273.25 MBPS, and 19900 IOPS
Lun 1_1 on drive [FLASH_1_1] random read throughput: 272.43 MBPS, and 19866 IOPS
Lun 1_2 on drive [FLASH_1_2] random read throughput: 272.38 MBPS, and 19868 IOPS
Lun 1_3 on drive [FLASH_1_3] random read throughput: 273.16 MBPS, and 19838 IOPS
Lun 2_0 on drive [FLASH_2_0] random read throughput: 273.22 MBPS, and 20129 IOPS
Lun 2_1 on drive [FLASH_2_1] random read throughput: 273.32 MBPS, and 20087 IOPS
Lun 2_2 on drive [FLASH_2_2] random read throughput: 273.92 MBPS, and 20059 IOPS
Lun 2_3 on drive [FLASH_2_3] random read throughput: 273.71 MBPS, and 20049 IOPS
Lun 4_0 on drive [FLASH_4_0] random read throughput: 273.91 MBPS, and 19799 IOPS
Lun 4_1 on drive [FLASH_4_1] random read throughput: 273.73 MBPS, and 19818 IOPS
Lun 4_2 on drive [FLASH_4_2] random read throughput: 273.06 MBPS, and 19836 IOPS
Lun 4_3 on drive [FLASH_4_3] random read throughput: 273.02 MBPS, and 19770 IOPS
Lun 5_0 on drive [FLASH_5_0] random read throughput: 273.80 MBPS, and 19923 IOPS
Lun 5_1 on drive [FLASH_5_1] random read throughput: 273.26 MBPS, and 19926 IOPS
Lun 5_2 on drive [FLASH_5_2] random read throughput: 272.97 MBPS, and 19893 IOPS
Lun 5_3  on drive [FLASH_5_3] random read throughput: 273.65 MBPS, and 19872 IOPS
CALIBRATE results are within an acceptable range.

As you can see,  the drives can actually be driven even higher than the stated 300 IOPS per drive.

So, why can’t SANs achieve this high number?

A SAN that is dedicated to one server with one purpose should be able to take advantage of command queuing.  But, SANs are not typically configured in this manner.  SANs are a shared general purpose disk infrastructure that are used by many departments and applications from Database to Email.   When sharing resources on a SAN, great care is taken to ensure that the number of outstanding IO requests does not get too high and cause the fabric to reset.  In Solaris, SAN vendors require the setting of the “sd_max_throttle” parameter which limits the amount of IO presented to the SAN.  This is typically set very conservatively so as to protect the shared SAN resource by queuing the IO on the OS.

long story short…

A 180 IOPS/drive rule of thumb for SANs might be reasonable, but the “drive” is definitely capable of more.

Exadata has dedicated drives, is not artificially throttled, and can take full advantage of the drives capabilities.

Solaris Eye for the Linux Guy… Part II (oprofile, Dtrace, and Oracle Event Trace)

Proper tool for the job

My grandfather used to say to me: “Use the proper tool for the job”.  This is important to keep in mind when faced with performance issues.  When I am faced with performance problems in Oracle, I typically start at a high level with AWR reports or Enterprise Manager to get a high level understanding of the workload.   To drill down further, the next step is to use Oracle “10046 event” tracing.  Cary Millsap created a methodology around event tracing called “Method-R” which shows how to focus in on the source of a performance problem by analyzing the components that contribute to response time.   These are all fine places to start to analyze performance problems from the “user” or “application” point of view.  But what happens if the OS is in peril?

If you are experiencing high system time or strange application behavior, it is likely time to drill deeper with OS based tools.  I mentioned in my last post , “prstat” is the “top” equivalent for Solaris.  “prstat” is the best place to start to see how processes are running on Solaris, but at some point you may need to drill down deeper to gain a better understanding of the problem.

With Linux “oprofile” allows you to sample the kernel and user code to build a profile of how the system and applications are behaving.  This is an incredibly useful tool, but it doesn’t exist on Solaris.  Luckily, there is something that is arguably better – Dtrace.

Solaris Dtrace(1m) / Linux “oprofile”

Dtrace was developed for the release of Solaris 10 by kernel engineers as a way to better debug and monitor Solaris.  Unlike “oprofile”, Dtrace is an really an environment that involves writing code in “D” to make use of the numerous amounts of probe data that exist.  Dtrace is really powerful, but it does require some heavy lifting to get started.  This is where the “Dtrace Toolkit” comes in handy.

The “Dtrace Toolkit” is a set of scripts that server as a starting point for those interested in using Dtrace.  Also included in the “Dtrace Toolkit” are some real clever utilities.  My two favorite utilities for Dtrace are the “hotkernel” and “hotuser” scripts.  These scripts analyze either the kernel or a user “PID” to show which routines are most used.  This can be extremely useful when diagnosing performance problems that extend beyond the V$ tables or Oracle “10046 event trace” data.

To illustrate the use of these utilities, I have included output from a benchmark that shows how these might be used.

HOTKERNEL

root@apl5-1> ./hotkernel
Sampling... Hit Ctrl-C to end.
^C
FUNCTION                                                COUNT   PCNT
nxge`nxge_check_10g_link                                    1   0.0%
genunix`ioctl                                               1   0.0%
...
...
genunix`fop_read                                         5730   2.1%
genunix`kstrgetmsg                                       6091   2.2%
unix`utl0                                                7058   2.6%
FJSV,SPARC64-VII`cpu_halt_cpu                            7220   2.6%
FJSV,SPARC64-VII`copyout                                 9340   3.4%
ip`tcp_fuse_output                                      12637   4.6%
unix`_resume_from_idle                                  12922   4.7%
unix`disp_getwork                                       18864   6.8%
unix`mutex_enter                                        34033  12.3%

HOTUSER

root@apl5-1> ./hotuser -p 12626
Sampling... Hit Ctrl-C to end.
^C
FUNCTION                                                COUNT   PCNT
oracle`kxsInitExecutionHeap                                 1   0.0%
oracle`0x10b319ad0                                          1   0.0%
oracle`kews_pls_jvm_event_resume_i                          1   0.0%
oracle`0x10b319ac8                                          1   0.0%
oracle`kghfrh                                               1   0.0%
oracle`opiptc                                               1   0.0%
...
...
oracle`qertbFetchByRowID                                   91   1.0%
oracle`kghalf                                              94   1.1%
libc_psr.so.1`memcpy                                      102   1.2%
oracle`opikndf2                                           105   1.2%
oracle`kpofcr                                             113   1.3%
oracle`opiodr                                             120   1.4%
oracle`kslwtectx                                          120   1.4%
oracle`kslwt_update_stats_int                             126   1.4%
oracle`opitsk                                             126   1.4%
oracle`ksupucg                                            151   1.7%
oracle`nsbasic_brc                                        153   1.7%
oracle`kdxbrs1                                            187   2.1%
oracle`kdxlrs2                                            192   2.2%
oracle`kews_sqlcol_end                                    194   2.2%
oracle`opifch2                                            212   2.4%
oracle`opiexe                                             250   2.8%
oracle`skgslcas                                           265   3.0%
libc_psr.so.1`memset                                      416   4.7%
oracle`kcbgtcr                                            826   9.4%

You can begin to see how Dtrace can be useful to see the effect of the workload on Solaris and profile the user application – in this case an Oracle shadow process.  But this is just the beginning.  If you are so inclined, Dtrace can be used to correlate all sorts of performance data both inside the kernel and application.

Solaris Eye for the Linux guy… or how I learned to stop worrying about Linux and Love Solaris (Part 1)

This entry goes out to my Oracle techie friends that have been in the Linux camp for sometime now and are suddenly finding themselves needing to know more about Solaris… hmmmm… I wonder if this has anything to do with Solaris now being an available option with Exadata?  Or maybe the recent announcement that the SPARC T3 multiplier for T3-x servers is now 0.25.  Judging by my inbox recently, I suspect a renewed interest in Solaris to continue.

I have focused on Oracle database performance on Solaris for 14 years now.  In the last few years, I began to work on Exadata and found myself  needing to learn the “Linux” way of performance analysis.   I will cover some basic tools needed for Oracle performance analysis with Solaris as well as some special performance topics.  I am not a deep dive kernel type of guy, so don’t expect esoteric Dtrace scripts to impress your friends.  I am not going to cover how to patch and maintain Solaris – this is out of scope. With this in mind, lets get started.

prstat(1M) … top on steriods!

Probably the first tool that the typical Linux performance guy reaches for is top.  This is part of every Linux distribution that I know of but is sadly missing from Solaris… But Solaris has something much better “prstat(1m)“.  I know the name is boring but it simply means “process status”.  This is the first place to get an idea of how processes are performing on a system and quite likely the most useful tool in general.

# prstat
PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP      
 5623 oracle     32G   32G cpu63    0    0   0:02:23 1.0% oracle/1
 5625 oracle     32G   32G cpu60    0    0   0:02:22 1.0% oracle/1
 5627 oracle     32G   32G sleep    0    0   0:02:18 1.0% oracle/1
 5629 oracle     32G   32G cpu38    0    0   0:02:16 1.0% oracle/1
 5609 oracle     43M   38M sleep    0    4   0:01:21 0.6% rwdoit/1
 5605 oracle     43M   38M sleep    0    4   0:01:18 0.6% rwdoit/1
 5607 oracle     43M   38M sleep    0    4   0:01:18 0.6% rwdoit/1
 5601 oracle     43M   38M sleep    0    4   0:01:17 0.6% rwdoit/1
...
...
Total: 106 processes, 447 lwps, load averages: 5.03, 7.11, 19.48

Top would show you....

# top
load averages:  5.06,  6.84, 18.81                                                                             14:50:13
109 processes: 100 sleeping, 1 stopped, 8 on cpu
CPU states: 92.7% idle,  4.0% user,  3.3% kernel,  0.0% iowait,  0.0% swap
Memory: 128G real, 60G free, 50G swap in use, 135G swap free

 PID USERNAME THR PRI NICE  SIZE   RES STATE   TIME CPU-a- COMMAND
 5623 oracle     1   0    0    0K    0K cpu57   3:08 66.41% oracle
 5625 oracle     1   0    0    0K    0K cpu33   3:06 66.02% oracle
 5627 oracle     1   0    0    0K    0K cpu14   3:02 64.45% oracle
 5629 oracle     1   0    0    0K    0K cpu41   2:59 63.28% oracle
 5609 oracle     1   0    4    0K    0K cpu31   1:46 37.70% rwdoit
 5605 oracle     1   0    4    0K    0K cpu55   1:43 36.33% rwdoit
 5607 oracle     1   0    4    0K    0K sleep   1:43 36.33% rwdoit
 5601 oracle     1   0    4    0K    0K cpu32   1:42 35.94% rwdoit

What is happening?  “top” shows details from a from process point of view where as by default prstat(1M) shows the system aggregates.  To make prstat(1M) look more like top, you have to enable micro-state accounting with the “-m” option.  With the “-m” option, prstat(1M) shows CPU utilization of various processes like top but with a LOT more detail.  You have access to details regarding CPU time broken out by user and system.  You can find out the percentage of time spent in traps, sleep, and time spent waiting for CPU “LAT”.  Finally, you can see the number of voluntary and involuntary context switches along with the number of threads per process.

# prstat -m
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 
 5623 oracle    48  18 0.0 0.0 0.0 0.0  29 4.3 38K   8 .3M   0 oracle/1
 5625 oracle    48  18 0.0 0.0 0.0 0.0  30 4.3 38K   9 .3M   0 oracle/1
 5627 oracle    43  21 0.0 0.0 0.0 0.0  30 5.5 32K   7 .3M   0 oracle/1
 5629 oracle    42  21 0.0 0.0 0.0 0.0  31 5.6 33K   6 .3M   0 oracle/1
 5609 oracle    17  21 0.0 0.0 0.0 0.0  57 5.7 33K   5 77K   0 rwdoit/1
 5605 oracle    20  17 0.0 0.0 0.0 0.0  59 4.5 38K   5 89K   0 rwdoit/1
 5607 oracle    17  19 0.0 0.0 0.0 0.0  58 5.4 32K   4 75K   0 rwdoit/1
 5601 oracle    20  16 0.0 0.0 0.0 0.0  59 4.5 38K   5 90K   0 rwdoit/1

There are a lot of options available to prstat(1M) so do take a look at the prstat(1M) man page. Also look at the “scalingbits” blog for an excellent discussion on prstat(1M) monitoring.  Stephan goes into much more detail about the utility and how to monitor by “zones” or “projects”… very useful stuff.

lsof = pfiles(1M)… the proc(1) commands

pfiles(1M) mirrors what “lsof” does but there is much more information available on a per-process basis available.  A man of “proc(1)” shows the available process related commands: “pstack(1M)”, “pldd(1M)”, and “pflags(1M) to name a few.  These utilities referred to as process introspection commands are detailed nicely on the solarisinternals.com website.

vmstat(1M), mpstat(1M), iostat(1M), sar(1M)… basically the same!

It is good to know that somethings are not all that different.  The above tools have minor differences, but generally look the same.  Some options are different and expanded.   Take for example iostat.

“iostat(1M) for Solaris has as a “z” option that takes out the devices that are not doing any IO and have “Zeros”.  The biggest issue with most of these tools come into play when there are missing options or the formatting is different.  This messes up scripts that have been developed to help aid analysis.  This is not too hard to fix… just something that is going to have to be worked out.

References

The best references for Solaris performance and analysis would be the “Solaris Internals” and the “Solaris Performance and Tools” books.  These books describe the architecture of Solaris and show how to analyze and diagnosis performance issues… and you can get them on the kindle 🙂  The books also have an accompanying website “solarisinternals.com” to continue the discussion.

That is all for now…

New Oak table member… It’s official!


After many long nights, technical discussions, and joint work with Oak table members they decided to nominate me.  I just received notice and it is official!!  I am humbled and honored to be recognized along with this top-notch group of Oracle rock stars.

Oracle Open World presentation uploaded… Optimizing Oracle Databases on SPARC Enterprise M-Series Servers.

As promised, I have attached the slide deck we used for our presentation at Oracle Open World.  A big thanks to Allan for asking me to help… glad to do it!

ID#: S315915
Title: Optimizing Oracle Databases on Sun SPARC Enterprise M-Series Servers
Track: Sun SPARC Servers
Date: 20-SEP-10
Time: 12:30 – 13:30
Venue: Moscone South
Room: Rm 270
Slides : oow2010_db_on_mseries.pdf

There are a bunch of OTN white papers that were produced to show how to run the Oracle stack best on Sun servers.  I will try to post an index soon but feel free to peruse the OTN site, there is lots of new content.

Oracle Open World 2010… Optimizing Oracle Databases on Sun SPARC Enterprise M-Series Servers

This year at OOW I will be co-presenting on Oracle performance on Sun servers. Stop by and say hi if you get a chance.

ID#: S315915
Title: Optimizing Oracle Databases on Sun SPARC Enterprise M-Series Servers
Track: Sun SPARC Servers
Date: 20-SEP-10
Time: 12:30 – 13:30
Venue: Moscone South
Room: Rm 270

Oracle 11g R1/R2 Real Application Clusters Handbook… Review in process

A few years back I had the pleasure of meeting Ben Prusinski at Oracle Open world.  Ben has published several books on Oracle Internals and Debugging.  In his most recent project he teams with Guenadi Jilevski and Syed Jaffer Hussain to write
“Oracle 11g R1/R2 Real Application Clusters Handbook”.  I have only just cracked the cover but already know this will be a good resource for the beginner and seasoned professional.  If you want to look for yourself, you can preview or order the book through PACKT publishing.

Open Storage S7000 with Exadata… a good fit ETL/ELT operations.

I have worked on Exadata V2 performance projects with Kevin Closson for nearly a year now and have had the opportunity to evaluate several methods of loading data into a data warehouse. The most common, and by far the fastest method, involves the use of “External Tables”. External tables allow the user to define a table object made up of text files that live on a file system.   Using External Tables allows for standard SQL parallel query operations to be used to load data into permanent database tables.

SQL> alter session enable parallel dml ;
SQL> insert /*+ APPEND */ into mytable select * from ext_tab ;

With the size and power of Exadata, businesses are creating larger and larger data warehouses. There will often be dozens of machines that collect and stage data for ingest by the data warehouse.  So this means the staging area for these flat-files must be huge, real fast, and accessible from multiple networks.

What options are available for staging input files?

With Exadata V2, or any RAC environment, flat-file data has to be present on all nodes in order to fully utilize parallel query.  The natural first choice with Exadata V2 is to use DBFS.

DBFS comes with Exadata and allows for easy clustering across all Exadata database nodes.  The real data store for DBFS are database tables residing on a tablespace within the database machine.  The DBFS client program is then used to mount the DBFS filesystem such that they appear to the Linux user to be just another file system.  This allows for file system data to be managed just like any other database while using the full power of Exadata.  DBFS is quite fast and works well for housing external tables, but it does cut down on the storage available for the data warehouse.  Also, since DBFS is simply a client on an Exadata database node, it uses CPU resources on the database machine to initially transfer or create the flat files.

Open Storage S7000 a natural staging area for Exadata

If you want to extend the amount of storage to stage data for your warehouse, then the S7000 is an excellent choice.  The S7000 can stage files off traditional networks using 1gigE and 10gigE connections.  This allows for multiple machines to seamlessly connect to the S7000 in-order to stage data for ingest.  This activity has no effect on the Exadata users since the S7000 is a self contained storage server – unlike DBFS that uses CPU cycles from the database grid to manage and store the flat-file data.

Once the data is on the S7000, we can use IPoIB and connect directly into the high-bandwidth Infiniband network that is part of Exadata V2.  This allows the S7000 to be positioned neatly between Exadata and the traditional gigE networks.

what about performance?

As part of a larger project, I was able to run a quick test.  I had the following:

  • S7410 with 12 drives
  • 128 x 1GB files on a share
  • 8 db nodes active (x4170) with the share mounted on all nodes.

I created an external table across all the files and performed two tests:

  1. Select count(*).
    SQL> select count(*) from ext_tab;
  2. Insert as Select “IAS”
    SQL> alter session enable parallel dml ;
    SQL> insert /*+APPEND */ into mytable select * from ext_tab;

Both when querying and loading data with “IAS”, I was able to get 1.2GB/sec throughput as I saw with my earlier tests with S7000 and 10gigE.  That is over 4TB/hr with just one head node for the S7410.  With a clustered configuration and multiple mount points, the load rate could be increased even further.

summary

The Sun Open Storage S7410 server is an excellent choice for managing file system data.  With the ability to connect to multiple networks, it is a perfect fit to stage data for Exadata environments as well.