Netgear ReadyNAS Pro testing

Device being tested: 

a Netgear ReadyNAS Pro, originally purchased with three Seagate ST3500630NS disk drives.  Three additional Seagate ST3500320NS drives were added, bringing the total storage space to 2287GB, using Netgear's RAID Level X-RAID2, the default configuration from factory.

 

Model:        ReadyNAS Pro Business Edition [X-RAID2]
Serial:     1YA394RW002E7
Firmware:     RAIDiator 4.2.7 
Memory:     1024 MB [4-5-5-15 DDR2]
Volume C:     Online, X-RAID2, 6 disks, 0% of 2287 GB used

The following performance options are selected:

Enable disk write cache.  (a UPS is strongly recommended with this configuration)

Disable full data journaling.

 

The following performance options are not selected:

Disable journaling.

Optimize for OS X.  (N/A for our use - NFS shares to Linux hosts)

Enable fast CIFS writes.  (N/A for our use - NFS shares to Linux hosts)

Enable fast USB disk writes.  (No USB disks are attached)

 

For initial testing, the Netgear and test nodes are all connected to a LinkSys SLM2048 switch with 1Gb/s links on a private network with nothing else attached.

The original client test nodes are IBM eserver xSeries 335 -[8676ABX]- nodes (onl08, onl09 and onl10.starp.bnl.gov in the DAQ room), with Scientific Linux 4.6 (kernel 2.6.9-78.0.22.ELsmp), with 1GB of RAM each.  In mid-February 2010, those nodes were replaced with Dell PowerEdge 1750s with Scientific Linux 5.3 also with 1GB of RAM each.

The simplest test -- one node using dd over an NFS mount (no optimation attempted in the NFS parameters) to write an 8GB file:

 

Using puny blocks to start:

[root@onl10 ~]# time sh -c "dd if=/dev/zero of=/mnt/onlnas/onl10.out bs=512 count=16000000"
16000000+0 records in
16000000+0 records out

real    3m21.438s
user    0m9.579s
sys     2m6.996s
 

That's 39MB/s (~310Mb/s - about 1/3 of a Gigabit network link)

[root@onl10 ~]# time sh -c "dd of=/dev/null if=/mnt/onlnas/onl10.out bs=512 count=16000000"
16000000+0 records in
16000000+0 records out

real    2m6.603s
user    0m7.949s
sys     0m52.956s
 

Reading that considerably faster:  62MB/s.

 

Then 8x bigger blocks (keeping the same file size):

[root@onl10 ~]# time sh -c "dd if=/dev/zero of=/mnt/onlnas/onl10.out bs=4096 count=2000000"
2000000+0 records in
2000000+0 records out

real    2m7.147s
user    0m1.222s
sys     0m36.865s
[root@onl10 ~]# ll -h /mnt/onlnas/onl10.out
-rw-r--r--  1 nfsnobody nfsnobody 7.7G Dec 18 19:02 /mnt/onlnas/onl10.out
 

or about 62MB/sec write speed (~480 Mb/sec - half of a Gigabit network connection)

Then reading the same file back:

[root@onl10 ~]# time sh -c "dd of=/dev/null if=/mnt/onlnas/onl10.out bs=4096 count=2000000"
2000000+0 records in
2000000+0 records out

real    1m50.421s
user    0m1.459s
sys     0m24.403s
 

A bit faster:  71MB/sec read speed.

 

Now with 8x bigger blocks again (wound up doing this twice, both results shown):

[root@onl10 ~]# time sh -c "dd if=/dev/zero of=/mnt/onlnas/onl10.out bs=32768 count=250000"
250000+0 records in
250000+0 records out

real    2m17.026s   2m15.972s
user    0m0.234s     0m0.238s
sys     0m40.734s    0m37.808s
 

~57MB/s average write speed

and reading with the larger blocks:

[root@onl10 ~]# time sh -c "dd of=/dev/null if=/mnt/onlnas/onl10.out bs=32768 count=250000"
250000+0 records in
250000+0 records out

real    2m15.195s
user    0m0.325s
sys     0m18.437s
 

Reading ~57MB/s - the same as the write rate.

 

Next - using multiple clients...

With three clients and a large block size:

[root@onl10 onlnas]# time sh -c "dd if=/dev/zero of=/mnt/onlnas/onl10.out bs=262144 count=3125"

(executed simultaneously on onl08 and onl09 (replacing "onl10" with "onl09" and "onl08"))  

This appears to have crashed the ReadyNAS unit.  I cannot ssh into it, nor is the web interface working (though it does respond to pings).  Since there is no terminal output on the box itself, I have no other means of interacting with the device, other than a hard reboot.  I had top running on the ReadyNAS at the time of the crash - here is the top few processes:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10921 root      15  -5     0    0    0 R   12  0.0   1:27.64 nfsd
10918 root      15  -5     0    0    0 R   12  0.0   1:29.95 nfsd
10920 root      15  -5     0    0    0 R   10  0.0   1:29.20 nfsd
  639 root      15  -5     0    0    0 R    6  0.0  61:48.62 md2_raid5
10917 root      15  -5     0    0    0 R    5  0.0   1:22.19 nfsd
 1748 root      15  -5     0    0    0 S    4  0.0   0:04.27 kjournald2
10919 root      15  -5     0    0    0 S    3  0.0   1:21.43 nfsd
  263 root      20   0     0    0    0 R    2  0.0   0:38.95 pdflush
  262 root      20   0     0    0    0 D    1  0.0   0:39.91 pdflush
10922 root      15  -5     0    0    0 S    1  0.0   1:30.60 nfsd
10289 root      20   0  2304  788  516 R    0  0.1   6:53.48 top
 

It looks like the three clients initially connected and started writing, but then something went wrong.  On the clients, the dd process is hung - with more than 30 minutes elapsed, there is no error, and Ctrl-C does not work to terminate the process.  The dd process is in "Uninterruptible sleep state" according to ps.  A forced umount (after several attempts) doesn't remove the mount, but it does terminate the dd process with an I/O error.  The ReadyNas unit does not reboot - it simply sits with "ReadyNAS" displayed on its LED, with no other sign of life (except the power button's LED).  In a normal boot cycle, activity starts within a couple of seconds, so whatever this is appears to be before Linux is loaded - in other words, hardware is suspect, rather than a configuration problem, though that is tentative.

I sought help form the ReadyNAS users' forum and got one substantive list of suggestions of things to try.  First was to verify all the disks are alive by putting them in another node and running some diagnostics.

<Several days have gone by...>

The box is alive again, though I have no explanation what was going wrong.  The only thing I did was verify that all six disks were active in a couple PCs and executed the SeaTools short tests (SMART data, SMART's short drive self-test and Seagate's short read test).  No problems were indicated.  I also put the unit through a few hours of memory test cycles (17+ full passes without any errors found), a feature which is built-in to the unit.  After this, the unit booted normally (though it did resync the array for some reason - no clues in the logs what happened).   It has survived for several days with no load, and I am starting testing again.

---

Using the same three clients (onl08, onl09 and onl10), I tried again having the clients write simultaneously like this:

[root@onl08 ~]# time sh -c "dd if=/dev/zero of=/mnt/onlnas/onl08.out bs=262144 count=8000"

Each client writes a 2GB file.  The results are remarkably good...  (too good to be true?):

real    1m1.235s     1m1.075s     1m0.653s
user    0m0.016s    0m0.015s     0m0.018s
sys     0m8.914s     0m7.284s     0m8.730s

That is in slightly over 1 minute, 6GB were written - that works out to ~100MB/sec or 800Mb/sec - darn near Gigabit ethernet line speed.
 

Bumping up the file size to 6GB by increasing the block count to 24000:

After a couple of minutes, the unit locked-up again, in the same state (answering pings, but no other responsiveness).  Rebooting again failed using the power button, but using the power switch restored it without any other fuss, and it did go into resync again (I will wait for this to finish before trying anything else).   The NFS mounts came back (and the dd commands actually finished) without any action on my part other than the power cycle.  NOTE TO SELF:  keep an eye on the temperatures next time!  It is quite warm 15 minutes after rebooting it - perhaps it overheating.

 (Feb. 18, 2010 Note - the device has been up for 42 days now (without use).)

I now have the Dell PowerEdge 1750 client nodes (using onl09, onl10 and onl11) instead of the older IBMs. 

First test, writing with a single client:

[root@onl09 nas]# time sh -c "dd if=/dev/zero of=/mnt/nas/onl09.out bs=262144 count=8000"
8000+0 records in
8000+0 records out
2097152000 bytes (2.1 GB) copied, 20.8879 seconds, 100 MB/s

real    0m20.925s
user    0m0.017s
sys     0m5.243s
 

100MB/s = 800Mb/s - pretty close to gigabit line rate. 

 

Try again with 4GB:

[root@onl09 nas]# time sh -c "dd if=/dev/zero of=/mnt/nas/onl09.out bs=262144 count=16000"
16000+0 records in
16000+0 records out
4194304000 bytes (4.2 GB) copied, 41.4118 seconds, 101 MB/s

real    0m42.013s
user    0m0.030s
sys     0m10.847s
 

good...  I noted the temperatures on the web interface.  "Temp 2" went from 21C to 28.5C", but all other temperatures indicate no change.   Temp 2 quickly falls back down to 21C

 

My methodology has not been very rigorous, but I have managed to freeze the system with only one client writing a 4GB file.  This tells me that the problem is not stemming from excessive file size or dependent on having multiple clients.  (In fact, it makes the thing look downright like a piece of junk.)  It is notable that the temprature does rise during use, but with the limited web interface, it is difficult to know precisely how the temperature is varying, or if a threshhold can be determined.  

One very disturbing note - when the NAS freezes, it takes the clients with it!  Nodes with NFS mounts of the NAS unit become non-responsive until the ReadyNAS is rebooted, which is a very dangerous failure mode.  The client nodes are not running with the latest kernel releases for SL 5, so perhaps a fix for this is there - I really didn't expect the failure of an NFS server could effectively halt a client like this, especially since the mount point and its contents are not in any way core components of the client OS or filesystems.  So it is back to the drawing board, if I can turn up a convenient non-production test node to continue.  I will also check for any more updates from Netgear, on the off chance there are improvments to be had there, but if this doesn't turn around pretty dramatically, then I can't see ever putting this unit into service.

Another idea...  I can try sharing it to Windows nodes and see if a similar use pattern using Samba/CIFS can also freeze the unit.