The pages in this section will related to storage and contain statistics, testings and other useful information (some historical).
An access pattern usage statistics was gather in 2008 to estimate the %tage of files no accessed for more than 60 (2 months) days and 160 days (5.5 months) respectively. This estimate was done on BlueArc storage and their BlueArc data migration tool (we acquired an evaluation which only allowed to sort those estimates).
This page is only a snapshot of the disk space. For an up-to-date usage profile and usage scope, please consult the Resource Monitoring tools under Software Infrastructure.
filesystem | rule | data in f/s | amount migrated | files migrated | Percent Migrated | rule | amount migrated | files migrated | Percent Migrated |
User disk | |||||||||
/star/u |
>60 days |
731.13 GB |
493.88 GB |
4195712 |
67.55% |
>160days |
226.22 MB |
1670 |
30.94% |
User space (PWG, scratch) [back] | |||||||||
/star/data01 |
>60 days |
756.88 GB |
117.04 MB |
2303 |
0.01% |
>160days |
17.53 MB |
1785 |
0.00% |
/star/data02 |
>60 days |
772.28 GB |
47.31 GB |
53222 |
6.13% |
>160days |
0 |
0 |
0.00% |
/star/data05 |
>60 days |
2.12 TB |
499.59 GB |
441531 |
23.00% |
>160days |
2.52MB |
80 |
0.00% |
General space for projects [back] | |||||||||
/star/data03 |
>60 days |
854.42 GB |
713.03 GB |
1535 |
83.45% |
>160days |
0.00 Bytes |
0 |
0.00% |
/star/data04 |
>60 days |
702.38 GB |
372.76 GB |
67170 |
53.07% |
>160days |
0.00 Bytes |
0 |
0.00% |
/star/data06 |
>60 days |
609.42 GB |
560.95 GB |
105179 |
92.05% |
>160days |
254.42 GB |
21256 |
41.75% |
/star/data07 |
>60 days |
731.17 GB |
544.22 GB |
558413 |
74.43% |
>160days |
505.89 GB |
457460 |
69.19% |
/star/data08 |
>60 days |
989.15 GB |
815.23GB |
1205452 |
82.42% |
>160days |
0 |
0 |
0.00% |
/star/rcf |
>60 days |
873.58 GB |
369.68 GB |
41730 |
42.32% |
>160days |
152.46 GB |
123261 |
17.41% |
/star/simu |
>60 days |
235.25 GB |
225.87 GB |
33617 |
95.75% |
>160days |
225.80 GB |
33321 |
95.75% |
Data was gathered during FastOffline time (0% expected) | |||||||||
/star/data09 |
>60 days |
613.75 GB |
30.19 MB |
10 |
0.00% |
>160days |
25.80 MB |
5 |
0.00% |
/star/data10 |
>60 days |
483.96 GB |
28.92 MB |
272 |
0.00% |
>160days |
0.00 Bytes |
1 |
0.00% |
Institution's disks [back] | |||||||||
institutions/bnl |
>60 days |
3.39 TB |
2.08 TB |
3490235 |
61.36% |
>160days |
1.40 TB |
3082800 |
41.30% |
institutions/lbl |
>60 days |
9.60 TB |
5.19 TB |
3765404 |
54.00% |
>160days |
2.93 TB |
3130797 |
30.52% |
institutions/mit |
>60 days |
894.69 GB |
510.66 GB |
78058 |
57.01% |
>160days |
303.60 GB |
67989 |
33.93% |
institutions/ucla |
>60 days |
1.44 TB |
911.61 GB |
2770202 |
61.82% |
>160days |
761.95 GB |
2439251 |
51.61% |
institutions/iucf |
>60 days |
785.49 GB |
185.14 GB |
99491 |
23.57% |
>160days |
26.44 GB |
36131 |
3.37% |
institutions/vecc |
>60 days |
731.06 GB |
401.20 GB |
258056 |
54.88% |
>160days |
270.28 GB |
55105 |
36.94% |
institutions/ksu |
>60 days |
647.62 GB |
197.30 GB |
87715 |
30.47% |
>160days |
83.99 GB |
76374 |
12.97% |
institutions/emn |
>60 days |
881.67 GB |
80.91 GB |
37272 |
9.18% |
>160days |
36.51 GB |
9139 |
4.14% |
institutions/uta |
>60 days |
407.68 GB |
351.52 GB |
20807 |
86.22% |
>160days |
195.66 GB |
13628 |
47.99% |
Production space [back] | |||||||||
/star/data12 |
>60 days |
685.57 GB |
175.32 GB |
13967 |
25.57% |
>160days |
121.34 GB |
10674 |
17.70% |
/star/data13 |
>60 days |
1.64 TB |
388.72 GB |
20160 |
23.15% |
>160days |
307.55 GB |
17507 |
18.31% |
/star/data14 |
>60 days |
794.53 GB |
231.98 GB |
19797 |
29.20% |
>160days |
158.41 GB |
14020 |
19.90% |
/star/data15 |
>60 days |
791.08 GB |
231.74 GB |
16620 |
29.29% |
>160days |
128.60 GB |
11295 |
16.26% |
/star/data16 |
>60 days |
1.46 TB |
372.16 GB |
23055 |
24.88% |
>160days |
200.92 GB |
16978 |
13.44% |
/star/data17 |
>60 days |
869.25 GB |
229.98 GB |
14589 |
26,46% |
>160days |
128.80 GB |
10447 |
14.82% |
/star/data18 |
>60 days |
905.66 GB |
196.12 GB |
14115 |
21.66% |
>160days |
138.78 GB |
11227 |
15.32% |
/star/data19 |
>60 days |
696.64 GB |
156.17 GB |
12261 |
22.42% |
>160days |
88.87 GB |
8229 |
12.76% |
/star/data20 |
>60 days |
773.98 GB |
188.46 GB |
12299 |
24.35% |
>160days |
88.46 GB |
8298 |
11.39% |
/star/data21 |
>60 days |
710.00 GB |
189.90 GB |
12989 |
26.75% |
>160days |
111.88 GB |
9443 |
15.76% |
/star/data22 |
>60 days |
706.21 GB |
199.16 GB |
15688 |
28.19% |
>160days |
143.76 GB |
12504 |
20.36% |
/star/data24 |
>60 days |
798.54 GB |
213.04 GB |
17424 |
30.17% |
>160days |
151.88 GB |
13661 |
19.02% |
/star/data25 |
>60 days |
739.33 GB |
186.02 GB |
13804 |
25.17% |
>160days |
124.57 GB |
10678 |
16.86% |
/star/data26 |
>60 days |
798.36 GB |
528.72 GB |
13277 |
66.23% |
>160days |
497.14 GB |
11098 |
62.27% |
/star/data27 |
>60 days |
753.46 GB |
191.43 GB |
13667 |
25.36% |
>160days |
106.11 GB |
9768 |
14.08% |
/star/data28 |
>60 days |
797.47 GB |
204.35 GB |
13886 |
25.60% |
>160days |
118.36 GB |
9780 |
14.85% |
/star/data29 |
>60 days |
782.30 GB |
219.58 GB |
14096 |
28.08% |
>160days |
133.82 GB |
9763 |
17.11% |
/star/data30 |
>60 days |
812.72 GB |
122.56 GB |
8139 |
15.08% |
>160days |
61.27 GB |
4771 |
7.54% |
/star/data31 |
>60 days |
763.81 GB |
128.72 GB |
9629 |
16.78% |
>160days |
79.06 GB |
6735 |
10.35% |
/star/data32 |
>60 days |
1.54 TB |
353.22 GB |
23165 |
22.38% |
>160days |
178.26 GB |
14833 |
11.30% |
/star/data33 |
>60 days |
762.93 GB |
148.00 GB |
9974 |
19.42% |
>160days |
27.76 GB |
3957 |
3.64% |
/star/data34 |
>60 days |
1.45 TB |
428.74 GB |
26467 |
28.83% |
>160days |
208.23 GB |
15771 |
14.02% |
/star/data35 |
>60 days |
1.43 TB |
441.51 GB |
25620 |
30.15% |
>160days |
17.36 GB |
1251 |
1.19% |
/star/data36 |
>60 days |
1.42 TB |
476.07 GB |
28960 |
32.74% |
>160days |
256.63 GB |
21073 |
17.65% |
/star/data37 |
>60 days |
1.33 TB |
388.87 GB |
26653 |
28.55% |
>160days |
200.03 GB |
16368 |
14.69% |
/star/data38 |
>60 days |
1.52 TB |
449.63 GB |
27038 |
28.85% |
>160days |
164.45 GB |
13917 |
10.57% |
/star/data39 |
>60 days |
1.47 TB |
446.78 GB |
26317 |
29.63% |
>160days |
150.30 GB |
14513 |
9.98% |
/star/data40 |
>60 days |
1.62 TB |
446.26 GB |
22080 |
26.90% |
>160days |
134.60 GB |
10104 |
8.11% |
/star/data41 |
>60 days |
1.70 TB |
523.73 GB |
29711 |
30.04% |
>160days |
188.62 GB |
15561 |
10.84% |
/star/data42 |
>60 days |
1.56 TB |
409.76 GB |
26833 |
25.65% |
>160days |
140.44GB |
13362 |
8.79% |
/star/data43 |
>60 days |
1.68 TB |
434.58 GB |
28622 |
25.26% |
>160days |
213.89 GB |
17479 |
12.43% |
/star/data44 |
>60 days |
1.70 TB |
500.49 GB |
28955 |
28.75% |
>160days |
181.79 GB |
15550 |
10.44% |
/star/data45 |
>60 days |
1.58 TB |
538.92 GB |
31417 |
33.25% |
>160days |
259.11 GB |
19413 |
16.02% |
/star/data46 |
>60 days |
5.34 TB |
1.76 TB |
52450 |
32.96% |
>160days |
737.01 GB |
32405 |
13.48% |
/star/data47 |
>60 days |
6.07 TB |
3.14 TB |
74679 |
51.73% |
>160days |
564.87 GB |
23736 |
9.09% |
/star/data48 |
>60 days |
5.18 TB |
2.49 TB |
82906 |
48.07% |
>160days |
887.51 GB |
42679 |
16.73% |
/star/data53 |
>60 days |
1.34 TB |
492.05 GB |
30129 |
35.86% |
>160days |
248.89 GB |
23693 |
18.14% |
/star/data54 |
>60 days |
1.20 TB |
420.97 GB |
28666 |
34.26% |
>160days |
294.70 GB |
22135 |
23.98% |
/star/data55 |
>60 days |
759.39 GB |
475.69 GB |
20009 |
62.58% |
>160days |
258.48 GB |
17584 |
34.04% |
This page will contain results for different IO performance made on the RCF hardware. The performances tests are based on 2 programs
Two test nodes, named eastwood and newman, are configured for testing in summer 2008.
Testing is performed under Scientific Linux 5.1 unless otherwise stated.
See also You do not have access to view this node.
The basic hardware specs common to both nodes:
Manufacturer: Penguin Computing
Model: Relion 230
Dual Intel(R) Xeon(TM) CPU 3.06GHz, w/ Hyper-threading
2 GB RAM, PC2100
6 IDE disks, 200GB each, all Western Digital "Caviar" series, model numbers starting with WD2000JB (there are some variations in the sub-model numbers, but I could find no documentation as to the distinctions amongst them). Manufactured in late 2003 or early 2004.
The disks are on two controllers:
--onboard, Intel Corporation 82801CA Ultra ATA Storage Controller with 2 channels
--PCI card, Promise Technology, Inc. PDC20268 (Ultra100 TX2)
There are some RAID configurations that would not make much sense, such as combining a master and slave from the same channel into an array, beacuse of the inherent limitation of IDE/ATA that the master and slave on a single channel cannot be accessed simultaneously.
Initial testing configuration is as follows:
On eastwood, no RAID is configured, and all drives are independent (other than the IDE master/slave connection). The drives are configured as follows:
On newman two software RAID arrays are confrigured in addition to the system disk:
Here are some additional tests that have occured to me, in no particular order:
( http://nucwww.chem.sunysb.edu/htbin/software_list.cgi?package=ioperf )
THIS PAGE DOES NOT CONTAIN MEANINGFUL NUMERICAL RESULTS YET!!! THERE ARE AT LEAST TWO PROBLEMS WITH THE IOPERF NUMBERS TO BE RESOLVED (as described below). Once resolved, this page will likely be completely rewritten.
The original IOperf code would not compile:
[root@dh34 ioperf]# make
cc -lm -o Ioperf ioperf.c
ioperf.c: In function 'main':
ioperf.c:133: error: 'CLK_TCK' undeclared (first use in this function)
ioperf.c:133: error: (Each undeclared identifier is reported only once
ioperf.c:133: error: for each function it appears in.)
ioperf.c: In function 'get_timer':
ioperf.c:477: error: 'CLK_TCK' undeclared (first use in this function)
make: *** [Ioperf] Error 1
Comments in /usr/include/time.h and /usr/include/bits/time.h lead me to think that "CLK_TCK" is just an obsolete name for CLOCKS_PER_SEC, though this struck me odd, because ioperf code is also using CLOCKS_PER_SEC for CLK_SCALE.
I modified ioperf.h, changing
# define SCALE CLK_TCK
to
# define SCALE CLOCKS_PER_SEC
and it compiled, but there is clearly a problem with the real timing scale factor (perhaps simply off by 1 million?).
So for now, these "real time" results are obviously not realistic numbers, BUT, they are would probably be correct in relative (ratio) terms -- twice as fast is twice as fast. The CPU % is meaningless for the same reason.
Another limitation (and more important for our purposes) in the "stock" ioperf appears to be a 2GB file size limit. To get around this, I added "-D_FILE_OFFSET_BITS=64" to the CFLAGS line in the makefile. I was able to write a 39GB file after this, but there is an overflow at some point in one or more variables, because the printed filesize was -1989672960 in the standard output. (Even at 8GB, the file size overflow appears). While the html output has a different filesize that the standard output, it too appears to be incorrect above a certain point. For instance, look at the third example below, in which the file size should be 8GB, but is instead only ~4GB.
With these two oddities (wrong real times and file sizes), I don't see much point in generating a lot of results at this time.
JUNE 27 update. With some wanton changes of int, long int and size_t to 'long long int' in declarations and casting of the ioperf code, I was able to get correct file sizes everywhere (?) for a test with 8GB file size. This still leaves the timing issue unresolved, so the actual rates are still not meaningful, I don't think, but now perhaps the results will at least be able to be contrasted amongst the different filesystems for relative differences.
On to some results:
Ioperf -l 5 -s 100 -n 100000 -html -w -m eastwood-hda:
Block IO | Character | Random | ||||||||||||||
fwrite | fread | putc | getc | seek | ||||||||||||
Machine | KB | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU |
eastwood-hda | 400000 | 372786579.68 | 141342.76 | 263746.51 | 2325581395.35 | 232558.14 | 1000000.00 | 275292498.28 | 40444.89 | 680660.70 | 507614213.20 | 50697.08 | 1001269.04 | 23418764.99 | 2341.88 | 1000000.00 |
eastwood-hda | 400000 | 376470588.24 | 141843.97 | 265428.39 | 2325581395.35 | 233236.15 | 997093.02 | 273130761.35 | 40120.36 | 680777.51 | 507292327.20 | 50664.98 | 1001268.23 | 23418764.99 | 2344.69 | 998800.96 |
eastwood-hda | 400000 | 377477194.09 | 142180.09 | 265503.81 | 2325581395.35 | 233009.71 | 998062.02 | 273473108.48 | 40053.40 | 682775.59 | 502512562.81 | 50209.21 | 1000845.49 | 23344123.51 | 2336.27 | 999200.64 |
eastwood-hda | 400000 | 377982518.31 | 142348.75 | 265541.52 | 2325581395.35 | 232558.14 | 1000000.00 | 273224043.72 | 40000.00 | 683062.62 | 500469189.87 | 50031.27 | 1000325.09 | 23320895.52 | 2332.09 | 999995.72 |
eastwood-hda | 400000 | 377073906.49 | 142450.14 | 264720.80 | 2322880371.66 | 232558.14 | 998843.93 | 273410799.73 | 39952.06 | 684352.77 | 499625281.04 | 49975.01 | 999763.80 | 23306980.91 | 2329.59 | 1000472.76 |
Ioperf -l 5 -s 100 -n 100000 -html -w -m newman-hda:
Block IO | Character | Random | ||||||||||||||
fwrite | fread | putc | getc | seek | ||||||||||||
Machine | KB | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU |
newman-hda | 400000 | 349650349.65 | 140350.88 | 249125.87 | 2325581395.35 | 232558.14 | 1000000.00 | 268456375.84 | 41025.64 | 654362.42 | 506970849.18 | 50697.08 | 1000000.00 | 23141291.47 | 2308.66 | 1002369.67 |
newman-hda | 400000 | 351185250.22 | 141843.97 | 247578.81 | 2325581395.35 | 232558.14 | 1000000.00 | 263504611.33 | 40962.62 | 643481.34 | 506970849.18 | 50729.23 | 999366.29 | 23086583.92 | 2305.93 | 1001184.83 |
newman-hda | 400000 | 355977454.76 | 141843.97 | 251054.37 | 2330097087.38 | 233009.71 | 1000000.00 | 259852750.11 | 41025.64 | 633791.60 | 507185122.57 | 50718.51 | 1000000.54 | 23086583.92 | 2308.66 | 1000001.87 |
newman-hda | 400000 | 353904003.54 | 142095.91 | 249160.34 | 2325581395.35 | 232558.14 | 1000000.00 | 261865793.78 | 41025.64 | 638715.02 | 507292327.20 | 50729.23 | 1000000.40 | 23086583.92 | 2307.29 | 1000592.42 |
newman-hda | 400000 | 354421407.05 | 142146.41 | 249417.40 | 2322880371.66 | 232288.04 | 1000000.00 | 263608804.53 | 41017.23 | 643131.80 | 507356671.74 | 50748.54 | 999746.51 | 23097504.73 | 2307.57 | 1000947.87 |
(All this shows is that the two machines perform almost identically, as expected,since they are the same hardware.)
Ioperf -l 5 -s 100 -n 2000000 -html -w -m eastwood-hda
Block IO | Character | Random | ||||||||||||||
fwrite | fread | putc | getc | seek | ||||||||||||
Machine | KB | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU |
eastwood-hda | 3805696 | 185888536.12 | 64921.46 | 286328.33 | 193546050.96 | 103697.44 | 186644.97 | 177961000.70 | 19129.87 | 930278.23 | 195073863.35 | 23687.89 | 823517.35 | 11054431.89 | 1105.44 | 1000000.00 |
eastwood-hda | 3805696 | 82291879.40 | 29158.53 | 282278.70 | 86875854.88 | 46547.99 | 186637.18 | 80702092.49 | 8594.50 | 939085.77 | 87483051.72 | 10636.85 | 822453.78 | 11054431.89 | 1105.31 | 1000118.98 |
eastwood-hda | 3805696 | 48867732.72 | 17233.71 | 283610.19 | 51358024.69 | 27474.19 | 186932.21 | 47592167.71 | 5070.20 | 938725.47 | 51647935.60 | 6283.02 | 822025.09 | 11033428.33 | 1103.21 | 1000118.75 |
eastwood-hda | 3805696 | 32029118.79 | 11267.06 | 284316.11 | 33571635.68 | 17943.67 | 187094.82 | 31230001.18 | 3316.26 | 941804.92 | 33758817.36 | 4104.79 | 822425.98 | 11039327.51 | 1103.87 | 1000059.31 |
eastwood-hda | 3805696 | 21913097.65 | 7686.38 | 285134.41 | 22880094.31 | 12236.46 | 186983.55 | 21257190.34 | 2260.60 | 940410.63 | 23025180.52 | 2798.51 | 822765.06 | 11040245.73 | 1103.97 | 1000047.45 |
No explanation why the Block IO and character results decreased so dramatically with each iteration... (FWIW, according to the ioperf man page, each iteration is an average of the current result with all previous results, so it would seem that the first iteration was much faster than subsequent iterations for some reason...) This may be a side effect of the incorrect filesize, since there is an overflowed variable going into the calculations.
June 27 update: at least the file sizes are now as expected with an 8GB file after some changes to make variables long long ints:
Block IO | Character | Random | ||||||||||||||
fwrite | fread | putc | getc | seek | ||||||||||||
Machine | KB | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU | KB/sec Real | KB/sec CPU | %CPU |
newman-hda | 8000000 | 389863547.76 | 123057.99 | 316812.87 | 516295579.22 | 171710.67 | 300677.64 | 367073506.47 | 39414.69 | 931311.37 | 409563303.13 | 49200.49 | 832437.41 | 171026707.53 | 17102.67 | 1000000.00 |
First, a note -- There is some duplication in testing, because I reran all the original tests (which did not include small file operations on large files) using full coverage in the second test round. For what it's worth, I have not seen any surprises in this region.
Attached Excel files include test results for the test filesystems, using the IOzone's default test suite with files up to 4GB. All test results are plotted in the attached files, however there is no attempt to make the vertical scales them identical, so be sure to check the vertical scales before making any comparisons! I may attempt to add a plotting routine that will find the maximum within all test results (eg, maximum of all Writer tests, maximum of all Reader tests, etc.) and plot all the individual test results with the same maximum (ie. all Writer graphs would have the same vertical scale and coloring), so the graphs can be directly compared without having to look at the vertical scaling, but this is a bit more work.
Typical commands are:
The meanings of the file name components:
"eastwood" or "newman" are the hostanmes
"hdX" or "raidX" are the device names. hda, hdg, hdh, raid0 and raid5 have ext3 filesystems, while hdc, hde and hdf have ext2.
"full" in the file name indicates test coverage throughout the test range. If "full" isn't in the file name, then the region of small operations on large file sizes was not tested.
"with_flush" indicates -e was used.
"4G" means the maximum file size tested.
"1GBRAM" means the system's RAM was reduced from 2GB to 1GB during the test. (NB -- these tests are underway as I write this.)
I have tried to make performance comparisons one by one between various filesystems and I'll describe some findings of the individual comparisons, followed by some summary thoughts. <Need to update this section more carefully>
eg, eastwood's hda (ext3/system disk) vs eastwood's hdc (ext2) or
eastwood's hde (ext2) vs eastwood's hdg (ext3) or
eastwood's hdf (ext2) vs eastwood's hdh (ext3):
Read performance is essentially indistinguishable, with a few anomolous variations here and there.
Writing to ext2 is almost universally faster than writing to ext3, which is to be expected because of the overhead to keep the journal on ext3. Somewhat surprising to me, in writing small files (in which most or all of the work is done in memory and flushed to disk later), ext2 writes could be 1.5-2.5 times faster than ext3 writes. As the file size gets larger (exceeding the system's available RAM for caching/buffering), ext2's writing advantage diminishes to only about ~15-25% for random writes and further down to about 10% for linear writes. Writing a bunch of small chunks to a large file is less efficient, especially when a journal is involved, than writing the same data in fewer large chunks When issuing a lot of write commands, ext2 will gain more over ext3.
eg. eastwood's hde (a "master" with ext2) vs eastwood's hdf (a "slave" with ext2) or
eastwood's hdg ("master"/ext3) vs eastwood's hdh ("slave"/ext3):
hde vs hdf is essentially indistinguishable in all tests. This is as expected -- the "master" and "slave" designations are not really meaningful terms anymore (and in fact, the terms are no longer used in recent ATA specifications). hdg has a slight (~5%) edge over hdh in most disk-bound operations -- I'm going to dismiss this as insignificant for the time being.
eg. eastwood's hda vs newman's hda:
Since the machines are identical (or very, very nearly so), little or no variation is expected, but it doesn't seem to have worked out that way... For the large file sizes (disk-bound operations), write performances are indistinguishable (if anything, eastwood has a slight edge), but for some reason reads on newman were consistently faster than on eastwood, by 20-25% (maxing out at ~55MB/sec compared to ~45MB/sec). I don't have any explanation for this. In fact, on eastwood, writing was faster than reading in comparable tests! This is certainly a surprising result...
eg., eastwood's hda (Intel 82801CA controller onboard) vs. eastwood's hdg (Promise Technology PDC20268 PCI card) or
eastwood's hdc (Intel 82801CA controller onboard) vs eastwood's hde (Promise Technology PDC20268 PCI card) :
(caveats about this comparison -- hda is a system disk (/), so may have some slight contention with system operations during the test, and the disks on the two different controllers are not exactly the same models -- the major model numbers are all the same, but the minor revision numbers are different. I couldn't find any documentation about the differences between the minor versions.)
The disks on the Promise Controller are consistently faster than those on the Intel controller. Typical performance comparison is 40-50MB/sec on the Intel controller vs 60-70 MB/sec on the Promise controller. To investigate if this is a controller difference, or a difference in the minor disk versions, I could swap some disks around and see if the performance stays with the Controller, or follows the disk around. If you read this and would like to see this tried, let me know, otherwise I won't give it a high priority. :-)
eg., eastwood's hdh vs newman's raid0 (Why cross machines instead of comparing newman-hda to newman-raid0? Because as we saw on eastwood (above), the drives on the Intel controller are consistently slower than the drives on the Promise controller. The RAID0 array on newman consists of two drives on the Promise controller, so it seems better to compare it to eastwood drives on the Promise controller, rather than a drive on the Intel controller on newman.)
The RAID0 array is 50-100% faster in almost all cases, pretty much as one would expect, when both disks are able to be issued commands simultaneously. The overhead of software RAID in this case appears negligible. Where the advantage is least is when the RAIDed disks are not necessarily accessible simultaneously, because subsequent accesses may be on a single drive. An extreme example of this appears to be in the stride-read results using small accesses, where the single disk is actually faster. (Stride reading is reading a chunk of size X, seeking ahead Y bytes, reading X bytes, seeking Y bytes again and repeating.) For certain values of X and Y (and depending on the RAID stripe size), reads may all occur on the same disk, negating the RAID0 advantage (or perhaps even giving the single disk the edge, may be the case for a couple of these test results, though the ouperformance of the single disk in these two cases is beyond any explanation I can come up with. The IOzone documentation does not explain the relation ship between the "chunk size" variable and X and Y.
The RAID5 array on newman includes disks on both controllers (specifically hdc on the Intel controller and hde and hdg on the Promise controller) and is also a mixture of minor disk versions, so there's no other filesystem to compare it to that is "fair". The hdc drive (or Intel controller) might be "crippling" it, or at least acting as a bottleneck. Compared to eastwood's hdg, the overall performance is relatively close, with the RAID5 array generally having an edge on reading, but falling behind in writing. I'm unwilling to try to draw any conclusions from these RAID5 test results, and I doubt there is any configuration of disks possible with this particular hardware to form a "good" RAID5 array. Ideally we should have three (or more) SCSI or SATA drives on a fast PCI (-X, -E, whatever) bus, which is the sort of thing we'd have with any recent server hardware.
We can look a bit at the effect(s) of parallel/multi-threaded applications on performance. I have run some tests with multiple threads accessing the disk, which is likely frequently the case with STAR offline database servers. Some analysis will follow shortly...
Five samples of the following:
On eastwood:
On newman:
Timing cached reads (MB/s) | Timing buffered disk reads (MB/s) |
||
eastwood - LV on hda | 1162.59 +/- 3.84 | 54.39 +/- 0.42 |
|
eastwood - hda1 | 1167.32 +/- 2.31 | 54.55 +/- 0.21 | |
eastwood - hdc1 | 1168.34 +/- 4.75 | 35.95 +/- 0.68 | |
eastwood - hdc1 *** | 1027.43 +/- 4.89 | 42.65 +/- 0.39 |
*** -- for this test, I moved this disk to the Promise controller in position hde1 to see if it would improve. While there appears to be improvement in this test, the IOzone results show no improvement. Even with this "improved" hdparm result, it is clear this disk really is inferior for some reason. |
eastwood - hde1 | 1167.39 +/- 3.41 | 57.5 +/- 0.04 | |
eastwood - hdf1 | 1167.72 +/- 2.03 | 59.34 +/- 0.26 | |
eastwood - hdg1 | 1166.46 +/- 2.63 | 57.27 +/- 0.09 | |
eastwood - hdh1 | 1166.20 +/- 3.59 | 54.36 +/- 0.10 | |
newman - LV on hda | 1180.26 +/- 10.32 | 51.70 +/- 1.51 | |
newman - md0 (RAID5) | 1211.01 +/- 43.51 | 66.67 +/- 0.30 | |
newman - md1 (RAID0) | 1197.65 +/- 15.77 | 113.67 +/- 0.45 |
Items of note:
Here are the test commands, using 1GB write/read and then 10GB write/read:
time dd if=/dev/zero of=/$drive/test.zero bs=1024 count=1000000
time dd of=/dev/zero if=/$drive/test.zero bs=1024 count=1000000
time dd if=/dev/zero of=/$drive/test.zero bs=1024 count=10000000
time dd of=/dev/zero if=/$drive/test.zero bs=1024 count=10000000
This sequence was run once on eastwood and five times on newman. (hda on eastwood was tested twice (by accident, but with surprisingly different results in the 10GB tests)). This is essentially an idealized test, the results of which are unlikely to be matched in a production system -- during this test there should be little or no rotational or seek latency (because the disks are mostly empty and the reading and writing proceeds sequentially, rather than randomly, plus there should be no contention from multiple processes.)
DRIVE | OPERATION | REAL (s) | USER | SYS | dd TIME | dd RATE (MB/s) |
eastwood - hda | 1GB write | 9.848 | 0.436 | 7.947 | 9.84596 | 104 |
eastwood - hda | 1GB read | 2.426 | 0.401 | 2.026 | 2.42409 | 422 |
eastwood - hda | 10GB write | 287.458 | 5.170 | 88.529 | 286.89 | 35.7 |
eastwood - hda | 10GB read | 289.671 | 3.647 | 24.788 | 289.328 | 35.4 |
eastwood - hda | 1GB write | 12.032 | 0.512 | 8.462 | 11.0798 | 92.4 |
eastwood - hda | 1GB read | 2.734 | 0.402 | 2.333 | 2.73233 | 375 |
eastwood - hda | 10GB write | 203.015 | 5.146 | 89.103 | 202.397 | 50.6 |
eastwood - hda | 10GB read | 192.057 | 4.195 | 24.033 | 191.866 | 53.4 |
eastwood - hdc | 1GB write | 6.242 | 0.386 | 4.273 | 6.22181 | 165 |
eastwood - hdc | 1GB read | 2.695 | 0.434 | 2.262 | 2.69342 | 380 |
eastwood - hdc | 10GB write | 232.640 | 3.874 | 44.764 | 232.588 | 44 |
eastwood - hdc | 10GB read | 253.590 | 3.583 | 26.264 | 253.474 | 40.4 |
eastwood - hdc (moved to Promise controller)*** | 1GB write | 4.594±0.011 | 0.4582±0.009 | 4.136±0.015 | 4.592±0.011 | 223±0.7 |
eastwood - hdc (moved to Promise controller)*** | 1GB read | 2.604±0.010 | 0.4512±0.015 | 2.154±0.015 | 2.602±0.010 | 393.4±1.5 |
eastwood - hdc (moved to Promise controller)*** | 10GB write | 231.193±0.923 | 4.3206±0.095 | 45.051±0.144 | 231.146±0.907 | 44.28±0.16 |
eastwood - hdc (moved to Promise controller)*** | 10GB read | 222.119±0.458 | 3.758±0.048 | 26.514±0.724 | 222.074±0.451 | 46.14±0.09 |
eastwood - hde | 1GB write | 5.920 | 0.377 | 4.295 | 5.86033 | 175 |
eastwood - hde | 1GB read | 2.680 | 0.415 | 2.266 | 2.67825 | 382 |
eastwood - hde | 10GB write | 185.871 | 4.072 | 44.518 | 185.808 | 55.1 |
eastwood - hde | 10GB read | 192.924 | 3.648 | 25.640 | 192.841 | 53.1 |
eastwood - hdf | 1GB write | 5.864 | 0.395 | 4.293 | 5.8289 | 176 |
eastwood - hdf | 1GB read | 2.691 | 0.401 | 2.290 | 2.68856 | 381 |
eastwood - hdf | 10GB write | 174.073 | 3.899 | 44.818 | 174.004 | 58.8 |
eastwood - hdf | 10GB read | 282.014 | 3.847 | 27.576 | 281.878 | 36.3 |
eastwood - hdg | 1GB write | 11.149 | 0.489 | 8.396 | 11.1013 | 92.2 |
eastwood - hdg | 1GB read | 2.721 | 0.440 | 2.281 | 2.71868 | 377 |
eastwood - hdg | 10GB write | 181.620 | 5.316 | 87.690 | 181.573 | 56.4 |
eastwood - hdg | 10GB read | 183.700 | 3.880 | 24.359 | 183.613 | 55.8 |
eastwood - hdh | 1GB write | 10.714 | 0.536 | 8.175 | 10.6598 | 96.1 |
eastwood - hdh | 1GB read | 2.710 | 0.427 | 2.284 | 2.7075 | 378 |
eastwood - hdh | 10GB write | 190.202 | 5.180 | 87.511 | 190.156 | 53.9 |
eastwood - hdh | 10GB write | 194.078 | 4.013 | 24.908 | 194.012 | 52.8 |
newman - hda | 1GB write | 10.670±1.308 | 0.504±0.026 | 8.242±0.065 | 10.628±1.281 | 97.6 ±12.99 |
newman - hda | 1GB read | 3.582±1.222 | 0.440±0.012 | 2.281±0.041 | 2.720±0.043 | 376.6 ±6.0 |
newman - hda | 10GB write | 269.355±5.300 | 5.173±0.131 | 88.529±0.373 | 268.738±5.294 | 38.1 ±0.7 |
newman - hda | 10GB read | 211.516±4.548 | 4.331±0.263 | 24.237±0.281 | 211.124±1.035 | 48.5 ±1.0 |
newman - raid0 | 1GB write | 10.012±0.459 | 0.525±0.029 | 8.605±0.128 | 9.972±0.453 | 102.9 ±4.6 |
newman - raid0 | 1GB read | 2.764±0.029 | 0.426±0.020 | 2.338±0.043 | 2.762±0.029 | 370.8 ±3.7 |
newman - raid0 | 10GB write | 123.691±3.427 | 5.130±0.378 | 86.677±0.700 | 123.636±3.427 | 82.9 ± 2.4 |
newman - raid0 | 10GB read | 88.485±1.196 | 3.969±0.198 | 25.252±0.263 | 88.415±1.180 | 115.8 ±1.5 |
newman - raid5 | 1GB write | 9.701±0.158 | 0.535±0.010 | 8.923±0.070 | 9.693±0.156 | 105.6 ±1.7 |
newman - raid5 | 1GB read | 2.974±0.103 | 0.439±0.029 | 2.535±0.115 | 2.971±0.104 | 345.0 ±12.2 |
newman - raid5 | 10GB write | 205.112±2.087 | 5.692±0.106 | 100.492±0.552 | 205.090±2.068 | 49.9 ±0.5 |
newman - raid5 | 10GB read | 156.661±0.334 | 4.528±0.343 | 38.091±1.381 | 156.623±0.347 | 65.4 ±0.2 |
Observations:
Some conclusions and general suggestions, some based on the data, some on general principles:
1. RAID0 can significantly improve performance over a single drive, if it isn't bottlenecked by some other limitation of the hardware. Of course, for X drives in the array, it is X times more likely to suffer a failure than a single drive. From the test results so far, not much can be said about RAID5. On principle however, RAID5 should outperform bare drives and be a bit worse than RAID0. Like RAID0, it should improve as more drives are added (within reason).
2. The limitations of P-ATA/IDE drives and controllers are something to watch out for when planning drive layouts. (Fortunately, IDE is
obsolete at this point, so this is unlikely to be a factor in future purchases.)
3. Those filesystems that will be accessed the most should be put on disks in such a way as to allow simultaneous access if at all possible.
In the case of STAR database servers, /tmp (or wherever temp space is directed to), swap space and the database files (and if possible, system
files) should all be on individual drives (and in the case of IDE drives, on separate channels). (Of course, if servers are having to go
into swap space at all, then performance is already suffering, and more RAM would probably help more than anything else.)
4. For the STAR database server slaves, if they are using tmp space (whether /tmp or redirected somewhere else), then as is the case with using swap space, more RAM would likely help, but to gain a bit in writing to the temp space, make it ext2 rather than ext3, whether RAIDed or not. Presumably it matters not if the filesystem gets corrupted during a crash -- at worst, just reformat it. It is after all, only temp space...