Storage

The pages in this section will related to storage and contain statistics, testings and other useful information (some historical).

 

Disk usage statistics 2008

An access pattern usage statistics was gather in 2008 to estimate the %tage of files no accessed for more than 60 (2 months) days and 160 days (5.5 months) respectively. This estimate was done on BlueArc storage and their BlueArc data migration tool (we acquired an evaluation which only allowed to sort those estimates).

This page is only a snapshot of the disk space. For an up-to-date usage profile and usage scope, please consult the Resource Monitoring tools under Software Infrastructure.

Markers

 

filesystem rule data in f/s amount migrated files migrated Percent Migrated rule amount migrated files migrated Percent Migrated
User disk

/star/u

>60 days

731.13 GB

493.88 GB

4195712

67.55%

>160days

226.22 MB

1670

30.94%

User space (PWG, scratch) [back]

/star/data01

>60 days

756.88 GB

117.04 MB

2303

0.01%

>160days

17.53 MB

1785

0.00%

/star/data02

>60 days

772.28 GB

47.31 GB

53222

6.13%

>160days

0

0

0.00%

/star/data05

>60 days

2.12 TB

499.59 GB

441531

23.00%

>160days

2.52MB

80

0.00%

General space for projects [back]

/star/data03

>60 days

854.42 GB

713.03 GB

1535

83.45%

>160days

0.00 Bytes

0

0.00%

/star/data04

>60 days

702.38 GB

372.76 GB

67170

53.07%

>160days

0.00 Bytes

0

0.00%

/star/data06

>60 days

609.42 GB

560.95 GB

105179

92.05%

>160days

254.42 GB

21256

41.75%

/star/data07

>60 days

731.17 GB

544.22 GB

558413

74.43%

>160days

505.89 GB

457460

69.19%

/star/data08

>60 days

989.15 GB

815.23GB

1205452

82.42%

>160days

0

0

0.00%

/star/rcf

>60 days

873.58 GB

369.68 GB

41730

42.32%

>160days

152.46 GB

123261

17.41%

/star/simu

>60 days

235.25 GB

225.87 GB

33617

95.75%

>160days

225.80 GB

33321

95.75%

Data was gathered during FastOffline time (0% expected)

/star/data09

>60 days

613.75 GB

30.19 MB

10

0.00%

>160days

25.80 MB

5

0.00%

/star/data10

>60 days

483.96 GB

28.92 MB

272

0.00%

>160days

0.00 Bytes

1

0.00%

Institution's disks [back]

institutions/bnl

>60 days

3.39 TB

2.08 TB

3490235

61.36%

>160days

1.40 TB

3082800

41.30%

institutions/lbl

>60 days

9.60 TB

5.19 TB

3765404

54.00%

>160days

2.93 TB

3130797

30.52%

institutions/mit

>60 days

894.69 GB

510.66 GB

78058

57.01%

>160days

303.60 GB

67989

33.93%

institutions/ucla

>60 days

1.44 TB

911.61 GB

2770202

61.82%

>160days

761.95 GB

2439251

51.61%

institutions/iucf

>60 days

785.49 GB

185.14 GB

99491

23.57%

>160days

26.44 GB

36131

3.37%

institutions/vecc

>60 days

731.06 GB

401.20 GB

258056

54.88%

>160days

270.28 GB

55105

36.94%

institutions/ksu

>60 days

647.62 GB

197.30 GB

87715

30.47%

>160days

83.99 GB

76374

12.97%

institutions/emn

>60 days

881.67 GB

80.91 GB

37272

9.18%

>160days

36.51 GB

9139

4.14%

institutions/uta

>60 days

407.68 GB

351.52 GB

20807

86.22%

>160days

195.66 GB

13628

47.99%

Production space [back]

/star/data12

>60 days

685.57 GB

175.32 GB

13967

25.57%

>160days

121.34 GB

10674

17.70%

/star/data13

>60 days

1.64 TB

388.72 GB

20160

23.15%

>160days

307.55 GB

17507

18.31%

/star/data14

>60 days

794.53 GB

231.98 GB

19797

29.20%

>160days

158.41 GB

14020

19.90%

/star/data15

>60 days

791.08 GB

231.74 GB

16620

29.29%

>160days

128.60 GB

11295

16.26%

/star/data16

>60 days

1.46 TB

372.16 GB

23055

24.88%

>160days

200.92 GB

16978

13.44%

/star/data17

>60 days

869.25 GB

229.98 GB

14589

26,46%

>160days

128.80 GB

10447

14.82%

/star/data18

>60 days

905.66 GB

196.12 GB

14115

21.66%

>160days

138.78 GB

11227

15.32%

/star/data19

>60 days

696.64 GB

156.17 GB

12261

22.42%

>160days

88.87 GB

8229

12.76%

/star/data20

>60 days

773.98 GB

188.46 GB

12299

24.35%

>160days

88.46 GB

8298

11.39%

/star/data21

>60 days

710.00 GB

189.90 GB

12989

26.75%

>160days

111.88 GB

9443

15.76%

/star/data22

>60 days

706.21 GB

199.16 GB

15688

28.19%

>160days

143.76 GB

12504

20.36%

/star/data24

>60 days

798.54 GB

213.04 GB

17424

30.17%

>160days

151.88 GB

13661

19.02%

/star/data25

>60 days

739.33 GB

186.02 GB

13804

25.17%

>160days

124.57 GB

10678

16.86%

/star/data26

>60 days

798.36 GB

528.72 GB

13277

66.23%

>160days

497.14 GB

11098

62.27%

/star/data27

>60 days

753.46 GB

191.43 GB

13667

25.36%

>160days

106.11 GB

9768

14.08%

/star/data28

>60 days

797.47 GB

204.35 GB

13886

25.60%

>160days

118.36 GB

9780

14.85%

/star/data29

>60 days

782.30 GB

219.58 GB

14096

28.08%

>160days

133.82 GB

9763

17.11%

/star/data30

>60 days

812.72 GB

122.56 GB

8139

15.08%

>160days

61.27 GB

4771

7.54%

/star/data31

>60 days

763.81 GB

128.72 GB

9629

16.78%

>160days

79.06 GB

6735

10.35%

/star/data32

>60 days

1.54 TB

353.22 GB

23165

22.38%

>160days

178.26 GB

14833

11.30%

/star/data33

>60 days

762.93 GB

148.00 GB

9974

19.42%

>160days

27.76 GB

3957

3.64%

/star/data34

>60 days

1.45 TB

428.74 GB

26467

28.83%

>160days

208.23 GB

15771

14.02%

/star/data35

>60 days

1.43 TB

441.51 GB

25620

30.15%

>160days

17.36 GB

1251

1.19%

/star/data36

>60 days

1.42 TB

476.07 GB

28960

32.74%

>160days

256.63 GB

21073

17.65%

/star/data37

>60 days

1.33 TB

388.87 GB

26653

28.55%

>160days

200.03 GB

16368

14.69%

/star/data38

>60 days

1.52 TB

449.63 GB

27038

28.85%

>160days

164.45 GB

13917

10.57%

/star/data39

>60 days

1.47 TB

446.78 GB

26317

29.63%

>160days

150.30 GB

14513

9.98%

/star/data40

>60 days

1.62 TB

446.26 GB

22080

26.90%

>160days

134.60 GB

10104

8.11%

/star/data41

>60 days

1.70 TB

523.73 GB

29711

30.04%

>160days

188.62 GB

15561

10.84%

/star/data42

>60 days

1.56 TB

409.76 GB

26833

25.65%

>160days

140.44GB

13362

8.79%

/star/data43

>60 days

1.68 TB

434.58 GB

28622

25.26%

>160days

213.89 GB

17479

12.43%

/star/data44

>60 days

1.70 TB

500.49 GB

28955

28.75%

>160days

181.79 GB

15550

10.44%

/star/data45

>60 days

1.58 TB

538.92 GB

31417

33.25%

>160days

259.11 GB

19413

16.02%

/star/data46

>60 days

5.34 TB

1.76 TB

52450

32.96%

>160days

737.01 GB

32405

13.48%

/star/data47

>60 days

6.07 TB

3.14 TB

74679

51.73%

>160days

564.87 GB

23736

9.09%

/star/data48

>60 days

5.18 TB

2.49 TB

82906

48.07%

>160days

887.51 GB

42679

16.73%

/star/data53

>60 days

1.34 TB

492.05 GB

30129

35.86%

>160days

248.89 GB

23693

18.14%

/star/data54

>60 days

1.20 TB

420.97 GB

28666

34.26%

>160days

294.70 GB

22135

23.98%

/star/data55

>60 days

759.39 GB

475.69 GB

20009

62.58%

>160days

258.48 GB

17584

34.04%

 

IO performance page

Tests

  • BlueArc 2 Titan heads, fiber channel mounted over NFS on a farm of Linux 2.4.21-32.0.1.ELsmp #1 SMP

  • stargrid03 to data10 Linux stargrid03.rcf.bnl.gov 2.4.20-30.8.legacysmp #1 SMP Fri Feb 20 17:13:00 PST 2004 i686 i686 i386 GNU/Linux to a reserved NFS mounted volume.
     
  • PANASAS file system mounted on a Linux rplay15.rcf.bnl.gov 2.4.20-19.8smp #1 SMP Wed Apr 14 10:50:14 EDT 2004 i686 i686 i386 GNU/Linux.
    Consult Panasas web site for more information about their products. This was tested in May 2004 (driver available)
  • PANASAS file system mounted on a Linux rplay21 2.4.20-19.8smp #1 SMP Mon Dec 29 16:49:59 EST 2003 i686 i686 i386 GNU/Linux.
    Consult Panasas web site for more information about their products. This was tested on Dec 30st 2003 (driver version not available).
  • IDE RAID (SunOS rmine608 5.8 Generic_108528-22 sun4u sparc SUNW,Ultra-4)
  • SCSI vs IDE comparison.
    SCSI are 36 GB 10 rpm (QUANTUM ATLAS_V_36_SCA).
    IDE disks are 40GB 5400 rpm (QUANTUM FIREBALLlct20 40)
    on the same dual Pentum III processor machine, 1 GB of RAM, running linux 7.2, kernel 2.4.9-31smp (done on Dec 13 2002)
     
  • Linux Client -> IBM Server tests. comments are in the page (done on Jan 7th 2002)
  • MTI performances (done on Nov 17 2001)
  • LSI performances (before it died) (done on Nov 6th 2001)
    Notes : all keeping in mind that the 8KByte (2^3) tests may not represent the best scanning region ...
    • Both vendor have very poor write performance, maximum read/write seems to be around 16/47 MB/sec. This result is supported by IOzone.
    • Individual IO tests shows similar MTI/LSI results, the LSI seems slightly better (the IO profile is flatter, ther are only little performance degradations depending on file size and/or block size).
    • The MTI tests contains serveral Ioperf tables made under different conditions in order to get a detail on the 'best performance' ridge.

 

Basics

This page will contain results for different IO performance made on the RCF hardware. The performances tests are based on 2 programs

  • IOzone : The benchmark generates and measures a variety of file operations. Iozone has been ported to many machines and runs under many operating systems. We used this program to get a quick estimate of the IO profile. I used the -a flag ( automatic mode, full) but reported only a few of the results I found relevant i.e.
    • Left side : the Write operations
    • Right side: the Read operations
    • From top to bottom : Buffered IO, Random IO, un-buffered IO.
    Note that the 'buffered' IO also benchmark the OS hability to flush data in/out of cache as well as the C function implementation (asynchronous or not). On each plot, x and y axes are in log2 base kbyte, the initial gap up to 2^2 kBytes (4 kBytes) represents the minimal startup value ; the sqare hole at large file size is an artifact of this program.
     
  • Ioperf is a program I have written myself after great frustrations I had looking at the Bonnie results (Bonnie reports IO performances far higher than what the card can do, even in unbuffered mode, so it is wrong !!). In any cases,
    • buffered C-IO (fwrite, fread) are made BUT ensuring that all IO are flushed when the test is done (sync). This was chosen as to be a representative of typical IO usage. However, the read operations may still be affected by buffering in case of NFS transactions between client and server. We recommend to test performances on the local machine.
    • Character per character IO tests are displayed as a worst-case-scenario base line.
    • random seeks tests are made over the entire file size. This tests the device ability to locate and access data at any place on the partition. Those values also give you an idea of what to expect as fragmentation occurs.
    • Each result is tabulated by real time, CPU time (or maximum performance) and %tage efficiency. Note that the ratio (or percentage) may be affected by the system's load. However, all tests were done in no-load mode so the Realtime/CPUtime is in our case a good measure of the system's response.
    • Ioperf does/creates a 400000 KB file for its test. That's 18.6 on the IOZone file size axis, a region where IOZone does not provide any results. One has to visually extrapolate ...

 

Disk IO testing, comparative study 2008

Two test nodes, named eastwood and newman, are configured for testing in summer 2008. 

Testing is performed under Scientific Linux 5.1 unless otherwise stated.

See also You do not have access to view this node.

Hardware

The basic hardware specs common to both nodes:


Manufacturer:  Penguin Computing

Model: Relion 230

Dual Intel(R) Xeon(TM) CPU 3.06GHz, w/ Hyper-threading

2 GB RAM, PC2100

6 IDE disks, 200GB each, all Western Digital "Caviar" series, model numbers starting with WD2000JB (there are some variations in the sub-model numbers, but I could find no documentation as to the distinctions amongst them).  Manufactured in late 2003 or early 2004.

The disks are on two controllers:

--onboard, Intel Corporation 82801CA Ultra ATA Storage Controller with 2 channels

  1. hda (primary master)
  2. hdc (secondary master)
  3. hdd (CD-ROM, secondary slave)

--PCI card, Promise Technology, Inc. PDC20268 (Ultra100 TX2)

  1. hde (primary master)
  2. hdf (primary slave)
  3. hdg (secondary master)
  4. hdh (secondary slave)

 

There are some RAID configurations that would not make much sense, such as combining a master and slave from the same channel into an array, beacuse of the inherent limitation of IDE/ATA that the master and slave on a single channel cannot be accessed simultaneously. 


Initial testing configuration is as follows:

On eastwood, no RAID is configured, and all drives are independent (other than the IDE master/slave connection).  The drives are configured as follows:

  • hda is part of a Logical Volume Group and hosts the OS on ext3
  • hdc -- ext2
  • hde -- ext2
  • hdf -- ext2
  • hdg -- ext3
  • hdh -- ext3

On newman two software RAID arrays are confrigured in addition to the system disk:

  • hda is part of a Logical Volume Group and hosts the OS on ext3
  • hdc, hde and hdg (all IDE masters) are part of a RAID5 (striped with distributed parity) array, with 256KB stripe size, ext3
  • hdf and hdh (both IDE slaves) are part of a RAID0 (striped, no parity) array, with 256KB stripe size, ext3

 

Additional test possibilities

Here are some additional tests that have occured to me, in no particular order:

 

  • Swap hdc and hdg for instance to see if the poorer performance of eastwood's hdc follows the drive, or sticks with the controller.

 

  • Try adding an additional PCI-X IDE controller.  Then the two current slaves could become masters and possibly all be accessible simultaeously, allowing 4 disks in RAID0 or RAID5 while avoiding the Intel controller, which may be inferior.

 

  • Try the "-e" option with IOzone to try to force disk access and reduce memory caching.  A good test set is hdh and raid0.  (These are done -- results to be posted)

 

  • Try IOzone with 1/2 or even 1/4 of the RAM (1GB or 512MB).  (tests with 1 GB RAM are done -- results to be posted)

 

  • Vary the RAID0 stripe size?

 

  • Vary the ext2/ext3 block size.

 

  • Try different ext3 journalling modes (journal (the default), ordered and writeback).

 

  • Filesystems other than ext2/3 (Reiser, JFS, XFS, ?).  Support from Redhat for any filesystem other than ext2/3 is essentially non-existent, so any foray down this path will take some extra effort.

 

  • Try running through database benchmarking with multiple clients ("super parallel" access), in the manner of Mike DePhillips's testing of STAR offline database servers.

 

  • Multi-threaded/multi-process testing, possibly with IOzone.

 

  • IOzone has tests for mmap and POSIX async I/O.  I don't know if either of these is relevant to any STAR uses.

 

  • Try different I/O elevators (see for instance http://www.redhat.com/magazine/008jun05/features/schedulers/ )

 

  • IOperf - stalled on timing issues.

 

IOperf

Now trying tests with IOperf

( http://nucwww.chem.sunysb.edu/htbin/software_list.cgi?package=ioperf )

THIS PAGE DOES NOT CONTAIN MEANINGFUL NUMERICAL RESULTS YET!!!  THERE ARE AT LEAST TWO PROBLEMS WITH THE IOPERF NUMBERS TO BE RESOLVED (as described below).  Once resolved, this page will likely be completely rewritten.

The original IOperf code would not compile:

[root@dh34 ioperf]# make
cc -lm -o Ioperf ioperf.c
ioperf.c: In function 'main':
ioperf.c:133: error: 'CLK_TCK' undeclared (first use in this function)
ioperf.c:133: error: (Each undeclared identifier is reported only once
ioperf.c:133: error: for each function it appears in.)
ioperf.c: In function 'get_timer':
ioperf.c:477: error: 'CLK_TCK' undeclared (first use in this function)
make: *** [Ioperf] Error 1

Comments in /usr/include/time.h and /usr/include/bits/time.h lead me to think that "CLK_TCK" is just an obsolete name for CLOCKS_PER_SEC, though this struck me odd, because ioperf code is also using CLOCKS_PER_SEC for CLK_SCALE.

I modified ioperf.h, changing

# define SCALE    CLK_TCK

to

# define SCALE    CLOCKS_PER_SEC

and it compiled, but there is clearly a problem with the real timing scale factor (perhaps simply off by 1 million?). 

So for now, these "real time" results are obviously not realistic numbers, BUT, they are would probably be correct in relative (ratio) terms -- twice as fast is twice as fast.  The CPU % is meaningless for the same reason.

Another limitation (and more important for our purposes) in the "stock" ioperf appears to be a 2GB file size limit.  To get around this, I added "-D_FILE_OFFSET_BITS=64" to the CFLAGS line in the makefile.  I was able to write a 39GB file after this, but there is an overflow at some point in one or more variables, because the printed filesize was -1989672960 in the standard output.  (Even at 8GB, the file size overflow appears).  While the html output has a different filesize that the standard output, it too appears to be incorrect above a certain point.  For instance, look at the third example below, in which the file size should be 8GB, but is instead only ~4GB.

 

With these two oddities (wrong real times and file sizes), I don't see much point in generating a lot of results at this time.

 

JUNE 27 update.  With some wanton changes of int, long int and size_t to 'long long int' in declarations and casting of the ioperf code, I was able to get correct file sizes everywhere (?) for a test with 8GB file size.  This still leaves the timing issue unresolved, so the actual rates are still not meaningful, I don't think, but now perhaps the results will at least be able to be contrasted amongst the different filesystems for relative differences.

 

On to some results:

Ioperf -l 5 -s 100 -n 100000 -html -w -m eastwood-hda:

  Block IO Character Random
fwrite fread putc getc seek
Machine KB KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU
eastwood-hda 400000 372786579.68 141342.76 263746.51 2325581395.35 232558.14 1000000.00 275292498.28 40444.89 680660.70 507614213.20 50697.08 1001269.04 23418764.99 2341.88 1000000.00
eastwood-hda 400000 376470588.24 141843.97 265428.39 2325581395.35 233236.15 997093.02 273130761.35 40120.36 680777.51 507292327.20 50664.98 1001268.23 23418764.99 2344.69 998800.96
eastwood-hda 400000 377477194.09 142180.09 265503.81 2325581395.35 233009.71 998062.02 273473108.48 40053.40 682775.59 502512562.81 50209.21 1000845.49 23344123.51 2336.27 999200.64
eastwood-hda 400000 377982518.31 142348.75 265541.52 2325581395.35 232558.14 1000000.00 273224043.72 40000.00 683062.62 500469189.87 50031.27 1000325.09 23320895.52 2332.09 999995.72
eastwood-hda 400000 377073906.49 142450.14 264720.80 2322880371.66 232558.14 998843.93 273410799.73 39952.06 684352.77 499625281.04 49975.01 999763.80 23306980.91 2329.59 1000472.76

 

Ioperf -l 5 -s 100 -n 100000 -html -w -m newman-hda:

  Block IO Character Random
fwrite fread putc getc seek
Machine KB KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU
newman-hda 400000 349650349.65 140350.88 249125.87 2325581395.35 232558.14 1000000.00 268456375.84 41025.64 654362.42 506970849.18 50697.08 1000000.00 23141291.47 2308.66 1002369.67
newman-hda 400000 351185250.22 141843.97 247578.81 2325581395.35 232558.14 1000000.00 263504611.33 40962.62 643481.34 506970849.18 50729.23 999366.29 23086583.92 2305.93 1001184.83
newman-hda 400000 355977454.76 141843.97 251054.37 2330097087.38 233009.71 1000000.00 259852750.11 41025.64 633791.60 507185122.57 50718.51 1000000.54 23086583.92 2308.66 1000001.87
newman-hda 400000 353904003.54 142095.91 249160.34 2325581395.35 232558.14 1000000.00 261865793.78 41025.64 638715.02 507292327.20 50729.23 1000000.40 23086583.92 2307.29 1000592.42
newman-hda 400000 354421407.05 142146.41 249417.40 2322880371.66 232288.04 1000000.00 263608804.53 41017.23 643131.80 507356671.74 50748.54 999746.51 23097504.73 2307.57 1000947.87

 

(All this shows is that the two machines perform almost identically, as expected,since they are the same hardware.)

Ioperf -l 5 -s 100 -n 2000000 -html -w -m eastwood-hda

 

  Block IO Character Random
fwrite fread putc getc seek
Machine KB KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU KB/sec Real KB/sec CPU %CPU
eastwood-hda 3805696 185888536.12 64921.46 286328.33 193546050.96 103697.44 186644.97 177961000.70 19129.87 930278.23 195073863.35 23687.89 823517.35 11054431.89 1105.44 1000000.00
eastwood-hda 3805696 82291879.40 29158.53 282278.70 86875854.88 46547.99 186637.18 80702092.49 8594.50 939085.77 87483051.72 10636.85 822453.78 11054431.89 1105.31 1000118.98
eastwood-hda 3805696 48867732.72 17233.71 283610.19 51358024.69 27474.19 186932.21 47592167.71 5070.20 938725.47 51647935.60 6283.02 822025.09 11033428.33 1103.21 1000118.75
eastwood-hda 3805696 32029118.79 11267.06 284316.11 33571635.68 17943.67 187094.82 31230001.18 3316.26 941804.92 33758817.36 4104.79 822425.98 11039327.51 1103.87 1000059.31
eastwood-hda 3805696 21913097.65 7686.38 285134.41 22880094.31 12236.46 186983.55 21257190.34 2260.60 940410.63 23025180.52 2798.51 822765.06 11040245.73 1103.97 1000047.45

 No explanation why the Block IO and character results decreased so dramatically with each iteration...  (FWIW, according to the ioperf man page, each iteration is an average of the current result with all previous results, so it would seem that the first iteration was much faster than subsequent iterations for some reason...)  This may be a side effect of the incorrect filesize, since there is an overflowed variable going into the calculations.

 

June 27 update: at least the file sizes are now as expected with an 8GB file after some changes to make variables long long ints:

 

  Block IO Character Random
fwrite fread putc getc seek
MachineKB KB/sec RealKB/sec CPU%CPUKB/sec RealKB/sec CPU%CPUKB/sec RealKB/sec CPU%CPUKB/sec RealKB/sec CPU%CPUKB/sec RealKB/sec CPU%CPU
newman-hda8000000 389863547.76123057.99316812.87 516295579.22171710.67300677.64 367073506.4739414.69931311.37 409563303.1349200.49832437.41 171026707.5317102.671000000.00

 

 

 

 

IOzone tests

Aug 14 update - I am running tests on new hardware with much (?) better hardware than was used in the tests below.  I will probably put the results in a different location (search for R610 in drupal?)

 

First, a note -- There is some duplication in testing, because I reran all the original tests (which did not include small file operations on large files) using full coverage in the second test round.   For what it's worth, I have not seen any surprises in this region. 

Attached Excel files include test results for the test filesystems, using the IOzone's default test suite with files up to 4GB.  All test results are plotted in the attached files, however there is no attempt to make the vertical scales them identical, so be sure to check the vertical scales before making any comparisons!  I may attempt to add a plotting routine that will find the maximum within all test results (eg, maximum of all Writer tests, maximum of all Reader tests, etc.) and plot all the individual test results with the same maximum (ie. all Writer graphs would have the same vertical scale and coloring), so the graphs can be directly compared without having to look at the vertical scaling, but this is a bit more work.

Typical commands are:

  • /opt/iozone/bin/iozone -Ra -g 4G -b eastwood-hda-iozone-4G.xls
  • /opt/iozone/bin/iozone -Raz -g 4G -b eastwood-hda-iozone-full-4G.xls
  • /opt/iozone/bin/iozone -Raz -g 4G -e -b eastwood-hda-with_flush-iozone-full-4G.xls

 

The meanings of the file name components:

"eastwood" or "newman" are the hostanmes

"hdX" or "raidX" are the device names.  hda, hdg, hdh, raid0 and raid5 have ext3 filesystems, while hdc, hde and hdf have ext2.

"full" in the file name indicates test coverage throughout the test range.  If "full" isn't in the file name, then the region of small operations on large file sizes was not tested.

"with_flush" indicates -e was used.

"4G" means the maximum file size tested.

"1GBRAM" means the system's RAM was reduced from 2GB to 1GB during the test.  (NB -- these tests are underway as I write this.)

 

I have tried to make performance comparisons one by one between various filesystems and I'll describe some findings of the individual comparisons, followed by some summary thoughts.  <Need to update this section more carefully>

Comparing ext3 with ext2:

eg, eastwood's hda (ext3/system disk) vs eastwood's hdc (ext2) or

eastwood's hde (ext2) vs eastwood's hdg (ext3) or

eastwood's hdf (ext2) vs eastwood's hdh (ext3):

 

Read performance is essentially indistinguishable, with a few anomolous variations here and there.

Writing to ext2 is almost universally faster than writing to ext3, which is to be expected because of the overhead to keep the journal on ext3.  Somewhat surprising to me, in writing small files (in which most or all of the work is done in memory and flushed to disk later), ext2 writes could be 1.5-2.5 times faster than ext3 writes.  As the file size gets larger (exceeding the system's available RAM for caching/buffering), ext2's writing advantage diminishes to only about ~15-25% for random writes and further down to about 10% for linear writes.  Writing a bunch of small chunks to a large file is less efficient, especially when a journal is involved, than writing the same data in fewer large chunks  When issuing a lot of write commands, ext2 will gain more over ext3.

 

Comparing Master to Slave on a single IDE channel:

eg. eastwood's hde (a "master" with ext2) vs eastwood's hdf (a "slave" with ext2) or

eastwood's hdg ("master"/ext3) vs eastwood's hdh ("slave"/ext3):

hde vs hdf is essentially indistinguishable in all tests.  This is as expected -- the "master" and "slave" designations are not really meaningful terms anymore (and in fact, the terms are no longer used in recent ATA specifications).  hdg has a slight (~5%) edge over hdh in most disk-bound operations -- I'm going to dismiss this as insignificant for the time being.

 

Comparing eastwood to newman:

 eg. eastwood's hda vs newman's hda:

 Since the machines are identical (or very, very nearly so), little or no variation is expected, but it doesn't seem to have worked out that way... For the large file sizes (disk-bound operations), write performances are indistinguishable (if anything, eastwood has a slight edge), but for some reason reads on newman were consistently faster than on eastwood, by 20-25% (maxing out at ~55MB/sec compared to ~45MB/sec).  I don't have any explanation for this.  In fact, on eastwood, writing was faster than reading in comparable tests!  This is certainly a surprising result...

Comparing Controller to Controller on eastwood:

eg., eastwood's hda (Intel 82801CA controller onboard) vs. eastwood's hdg (Promise Technology PDC20268 PCI card) or

eastwood's hdc (Intel 82801CA controller onboard) vs eastwood's hde (Promise Technology PDC20268 PCI card) :

(caveats about this comparison -- hda is a system disk (/), so may have some slight contention with system operations during the test, and the disks on the two different controllers are not exactly the same models -- the major model numbers are all the same, but the minor revision numbers are different.  I couldn't find any documentation about the differences between the minor versions.)

The disks on the Promise Controller are consistently faster than those on the Intel controller.  Typical performance comparison is 40-50MB/sec on the Intel controller vs 60-70 MB/sec on the Promise controller.  To investigate if this is a controller difference, or a difference in the minor disk versions, I could swap some disks around and see if the performance stays with the Controller, or follows the disk around.  If you read this and would like to see this tried, let me know, otherwise I won't give it a high priority. :-)

 

Comparing single disk to RAID0 (with two disks):

eg., eastwood's hdh vs newman's raid0 (Why cross machines instead of comparing newman-hda to newman-raid0?  Because as we saw on eastwood (above), the drives on the Intel controller are consistently slower than the drives on the Promise controller.  The RAID0 array on newman consists of two drives on the Promise controller, so it seems better to compare it to eastwood drives on the Promise controller, rather than a drive on the Intel controller on newman.)

The RAID0 array is 50-100% faster in almost all cases, pretty much as one would expect, when both disks are able to be issued commands simultaneously.  The overhead of software RAID in this case appears negligible.  Where the advantage is least is when the RAIDed disks are not necessarily accessible simultaneously, because subsequent accesses may be on a single drive.  An extreme example of this appears to be in the stride-read results using small accesses, where the single disk is actually faster.  (Stride reading is reading a chunk of size X, seeking ahead Y bytes, reading X bytes, seeking Y bytes again and repeating.)  For certain values of X and Y (and depending on the RAID stripe size), reads may all occur on the same disk, negating the RAID0 advantage (or perhaps even giving the single disk the edge, may be the case for a couple of these test results, though the ouperformance of the single disk in these two cases is beyond any explanation I can come up with.  The IOzone documentation does not explain the relation ship between the "chunk size" variable and X and Y.

 

Comparing RAID5 (with three disks) to the rest:

The RAID5 array on newman includes disks on both controllers (specifically hdc on the Intel controller and hde and hdg on the Promise controller) and is also a mixture of minor disk versions, so there's no other filesystem to compare it to that is "fair".  The hdc drive (or Intel controller) might be "crippling" it, or at least acting as a bottleneck.   Compared to eastwood's hdg, the overall performance is relatively close, with the RAID5 array generally having an edge on reading, but falling behind in writing.  I'm unwilling to try to draw any conclusions from these RAID5 test results, and I doubt there is any configuration of disks possible with this particular hardware to form a "good" RAID5 array.  Ideally we should have three (or more) SCSI or SATA drives on a fast PCI (-X, -E, whatever) bus, which is the sort of thing we'd have with any recent server hardware. 

 

Can we do anything else with IOzone and this hardware?

We can look a bit at the effect(s) of parallel/multi-threaded applications on performance.  I have run some tests with multiple threads accessing the disk, which is likely frequently the case with STAR offline database servers.  Some analysis will follow shortly...

 

 

Simple tests -- hdparm and dd

Starting with the basics, hdparm:

Five samples of the following:

On eastwood:

  • hdparm -t -T /dev/hd{a,c,e,f,g,h}1
  • hdparm -t -T /dev/mapper/VolGroup00-LogVol00 (/ on hda)

On newman:

  • hdparm -t -T /dev/md{0,1}
  • hdparm -t -T /dev/mapper/VolGroup00-LogVol00 (/ on hda)

 

  Timing cached reads (MB/s) Timing buffered disk reads (MB/s)
 
eastwood - LV on hda 1162.59 +/- 3.84 54.39 +/- 0.42
 
eastwood - hda1 1167.32 +/- 2.31 54.55 +/- 0.21  
eastwood - hdc1 1168.34 +/- 4.75 35.95 +/- 0.68  
eastwood - hdc1 ***  1027.43 +/- 4.89 42.65 +/- 0.39

*** -- for this test, I moved this disk to the Promise controller in position hde1 to see if it would improve.

While there appears to be improvement in this test, the IOzone results show no improvement.  Even with this "improved" hdparm result, it is clear this disk really is inferior for some reason.

eastwood - hde1 1167.39 +/- 3.41 57.5 +/- 0.04  
eastwood - hdf1 1167.72 +/- 2.03 59.34 +/- 0.26  
eastwood - hdg1 1166.46 +/- 2.63 57.27 +/- 0.09  
eastwood - hdh1 1166.20 +/- 3.59 54.36 +/- 0.10  
newman - LV on hda 1180.26 +/- 10.32 51.70 +/- 1.51  
newman - md0 (RAID5) 1211.01 +/- 43.51 66.67 +/- 0.30  
newman - md1 (RAID0) 1197.65 +/- 15.77 113.67 +/- 0.45  

 

Items of note:

 

  • The Logical Volumes on the two nodes (hda in both cases) are essentially identical (as expected). 
  • hdc on eastwood is significantly slower for some reason (not expected).  If this holds out in other tests and is true on newman, it will skew the RAID tests, since hdc is used in the RAID5 array on newman.  (I originally hypothesized that this was a feature/flaw of the onboard Intel controller, since the masters and slaves on the Promise controller are indistinguishable, but after further testing (dd, IOzone and some hardware swapping), I have concluded that this disk (and likely all the model "-00FUA0" are simply not up to par with the "-00GVA0" models.))
  • Other than hdc, the individual disks on eastwood all perform within a few percent of each other.
  • ext2 and ext3 are essentially identical (as expected for reading)
  • The RAID arrays on newman are noticably faster than individual disks (as expected for reading stripes in parallel).  The RAID5 array is not nearly as fast though as the RAID0 array.  This could be the "hdc" effect, in which case results are not really apples-to-apples comparisons.

 

 

On to dd:

Here are the test commands, using 1GB write/read and then 10GB write/read:

time dd if=/dev/zero of=/$drive/test.zero bs=1024 count=1000000

time dd of=/dev/zero if=/$drive/test.zero bs=1024 count=1000000

time dd if=/dev/zero of=/$drive/test.zero bs=1024 count=10000000

time dd of=/dev/zero if=/$drive/test.zero bs=1024 count=10000000

 

This sequence was run once on eastwood and five times on newman.  (hda on eastwood was tested twice (by accident, but with surprisingly different results in the 10GB tests)).  This is essentially an idealized test, the results of which are unlikely to be matched in a production system -- during this test there should be little or no rotational or seek latency (because the disks are mostly empty and the reading and writing proceeds sequentially, rather than randomly, plus there should be no contention from multiple processes.)

 

DRIVE OPERATION REAL (s) USER SYS dd TIME dd RATE (MB/s)
eastwood - hda 1GB write 9.848 0.436 7.947 9.84596 104
eastwood - hda 1GB read 2.426 0.401 2.026 2.42409 422
eastwood - hda 10GB write 287.458 5.170 88.529 286.89 35.7
eastwood - hda 10GB read 289.671 3.647 24.788 289.328 35.4
eastwood - hda 1GB write 12.032 0.512 8.462 11.0798 92.4
eastwood - hda 1GB read 2.734 0.402 2.333 2.73233 375
eastwood - hda 10GB write 203.015 5.146 89.103 202.397 50.6
eastwood - hda 10GB read 192.057 4.195 24.033 191.866 53.4
eastwood - hdc 1GB write 6.242 0.386 4.273 6.22181 165
eastwood - hdc 1GB read 2.695 0.434 2.262 2.69342 380
eastwood - hdc 10GB write 232.640 3.874 44.764 232.588 44
eastwood - hdc 10GB read 253.590 3.583 26.264 253.474 40.4
eastwood - hdc (moved to Promise controller)*** 1GB write 4.594±0.011 0.4582±0.009 4.136±0.015 4.592±0.011 223±0.7
eastwood - hdc (moved to Promise controller)*** 1GB read 2.604±0.010 0.4512±0.015 2.154±0.015 2.602±0.010 393.4±1.5
eastwood - hdc (moved to Promise controller)*** 10GB write 231.193±0.923 4.3206±0.095 45.051±0.144 231.146±0.907 44.28±0.16
eastwood - hdc (moved to Promise controller)*** 10GB read 222.119±0.458 3.758±0.048 26.514±0.724 222.074±0.451 46.14±0.09
eastwood - hde 1GB write 5.920 0.377 4.295 5.86033 175
eastwood - hde 1GB read 2.680 0.415 2.266 2.67825 382
eastwood - hde 10GB write 185.871 4.072 44.518 185.808 55.1
eastwood - hde 10GB read 192.924 3.648 25.640 192.841 53.1
eastwood - hdf 1GB write 5.864 0.395 4.293 5.8289 176
eastwood - hdf 1GB read 2.691 0.401 2.290 2.68856 381
eastwood - hdf 10GB write 174.073 3.899 44.818 174.004 58.8
eastwood - hdf 10GB read 282.014 3.847 27.576 281.878 36.3
eastwood - hdg 1GB write 11.149 0.489 8.396 11.1013 92.2
eastwood - hdg 1GB read 2.721 0.440 2.281 2.71868 377
eastwood - hdg 10GB write 181.620 5.316 87.690 181.573 56.4
eastwood - hdg 10GB read 183.700 3.880 24.359 183.613 55.8
eastwood - hdh 1GB write 10.714 0.536 8.175 10.6598 96.1
eastwood - hdh 1GB read 2.710 0.427 2.284 2.7075 378
eastwood - hdh 10GB write 190.202 5.180 87.511 190.156 53.9
eastwood - hdh 10GB write 194.078 4.013 24.908 194.012 52.8
newman - hda 1GB write 10.670±1.308 0.504±0.026 8.242±0.065 10.628±1.281 97.6 ±12.99
newman - hda 1GB read 3.582±1.222 0.440±0.012 2.281±0.041 2.720±0.043 376.6 ±6.0
newman - hda 10GB write 269.355±5.300 5.173±0.131 88.529±0.373 268.738±5.294 38.1 ±0.7
newman - hda 10GB read 211.516±4.548 4.331±0.263 24.237±0.281 211.124±1.035 48.5 ±1.0
newman - raid0 1GB write 10.012±0.459 0.525±0.029 8.605±0.128 9.972±0.453 102.9 ±4.6
newman - raid0 1GB read 2.764±0.029 0.426±0.020 2.338±0.043 2.762±0.029 370.8 ±3.7
newman - raid0 10GB write 123.691±3.427 5.130±0.378 86.677±0.700 123.636±3.427 82.9 ± 2.4
newman - raid0 10GB read 88.485±1.196 3.969±0.198 25.252±0.263 88.415±1.180 115.8 ±1.5
newman - raid5 1GB write 9.701±0.158 0.535±0.010 8.923±0.070 9.693±0.156 105.6 ±1.7
newman - raid5 1GB read 2.974±0.103 0.439±0.029 2.535±0.115 2.971±0.104 345.0 ±12.2
newman - raid5 10GB write 205.112±2.087 5.692±0.106 100.492±0.552 205.090±2.068 49.9 ±0.5
newman - raid5 10GB read 156.661±0.334 4.528±0.343 38.091±1.381 156.623±0.347 65.4 ±0.2

 

Observations:

  • 1GB reads are likely dominated by file caching in RAM, since the file had just been written.  1GB writes may be faster than disk I/O because some of the writes may have only been buffered in RAM rather than actually written to disk at the end of the dd command.
  • We see again that hdc is not able to perform as well as the drives on the Promise Controller (hde and up).  Based on the results and the IOzone results after switching it to the Promise controller, the conclusion is that this drive is not up to par.  An open question is whether this applies to all the "-00FUA0" model drives or just this one. 
  • The discrepancy between the first two hda tests on eastwood is disturbing, and may be worth further testing.
  • Eastwood's hdf (ext2) 10GB read is surprisingly low.  Perhaps this whole test suite should be run several more times to average out possible aberations. 
  • From these results and the hdparm, it appears uncontested single drives max out at about 55-60 MB/sec for both reads and writes, while RAID0 is roughly twice as fast.  The RAID5 results are not encouraging, but this may be affected by the presence of hdc in the array.

 

Summary thoughts

Some conclusions and general suggestions, some based on the data, some on general principles:

1.  RAID0 can significantly improve performance over a single drive, if it isn't bottlenecked by some other limitation of the hardware.  Of course, for X drives in the array, it is X times more likely to suffer a failure than a single drive.  From the test results so far, not much can be said about RAID5.  On principle however, RAID5 should outperform bare drives and be a bit worse than RAID0.  Like RAID0, it should improve as more drives are added (within reason).

2.  The limitations of P-ATA/IDE drives and controllers are something to watch out for when planning drive layouts.  (Fortunately, IDE is
obsolete at this point, so this is unlikely to be a factor in future purchases.)

3.  Those filesystems that will be accessed the most should be put on disks in such a way as to allow simultaneous access if at all possible.
  In the case of STAR database servers, /tmp (or wherever temp space is directed to), swap space and the database files (and if possible, system
files) should all be on individual drives (and in the case of IDE drives, on separate channels).  (Of course, if servers are having to go
into swap space at all, then performance is already suffering, and more RAM would probably help more than anything else.)

4.  For the STAR database server slaves, if they are using tmp space (whether /tmp or redirected somewhere else), then as is the case with using swap space, more RAM would likely help, but to gain a bit in writing to the temp space, make it ext2 rather than ext3, whether RAIDed or not.  Presumably it matters not if the filesystem gets corrupted during a crash -- at worst, just reformat it.  It is after all, only temp space...