dean performance issues - RHEL 7, XFS, RAID and IO schedulers

Background

A new RHEL 7 installation was performed on the former backup online web server (aka dean3) in mid-September 2016.   In November it was transitioned to the primary server role.  While the initial deployment went reasonably smoothly, signs of trouble were showing up by mid-December.  The folks needing to work interactivlely on the machine (eg. Leve, me, and Dmitry Arkhipkin for instance) noticed that the experience was unpleasant at times, with long delays in responsiveness from the shell, seemingly related to I/O operations.  Simple file saves after editing could take "30..60 seconds" as Dmitry reported on December 19.  Subsequently, NFS users (Tonko first, followed by others, sometimes in non-mailing-list emails) began reporting long times required for write operations over NFS for their small files.  Much discussion ensued internally (Jerome, Mike and I) followed by an open discussion in a thread started in HyperNews by Tonko: http://www.star.bnl.gov/HyperNews-star/protected/get/rts/813.html

Pieces of the Puzzle

RHEL7

A new thing for us, though it was first released in mid 2014, and it is hardly a cutting edge Linux distribution).


ITD/Network Backups

A disk read-intensive process obviously.  The full backup process runs weekly starting Friday evenings and can take days to finish (backing up up 1.7 TB, over about 10 million inodes).  While initially a concern and almost certainly contributing to the slowness, it is a victim more than a culprit.  Stopping it does not help much (and of course we want backups!).

cgroups

It was proposed early on as a way to help "choose winners" in the I/O wars between processes, however nothing was tried in practice.  It is my intuition that trying to use cgroups to help at this point will not yield much improvement, if any.  If we had a better understanding of the fundamental cause(s), perhaps this could play a role in some sort of work-around, but as is, it does not seem like a good avenue to pursue.
 

I/O schedulers for the disk I/O

CFQ vs deadline - both have been tried and neither is a magical cure.  Both have various tunable parameters - with deadline for example, I set "writes_starved" to 1 hoping to improve the latency for write operations, but this did not seem to help.  It is hard (for me at least) to intuit the effects of many of these scheduler tunables, but still, these schedulers timescales are in fractions of seconds - hard to see how they can lead to 15-30 second (or more) latencies for even small writes or why it is so variable.

XFS

This is the default filesystem in RHEL 7 - new for us, but hardly a novel filesystem at this point - first used in IRIX in 1994, and brought into mainline Linux in the 2002-2004 timeframe.  EXT4 is often reported to be faster than XFS, but the difference is usually not significant (that I've seen).

RAID

RAID 5 is known to be less-than-ideal for writes, especially small writes.  Because of the space efficiency however, RAID 5 was intentionally selected for the "large" /export filesystem.  RAID 5 was also selected for the "/" filesystem, which looking back seems to have been an outright mistake on my part - it probably should have been a RAID 10.  This is relevant because all the RAID arrays are spanned across the same four HDDs - all root filesystem I/O (eg. system logs, httpd logs) are competing for the same disks as the NFS exported filesystem.  

NFS

Mostly using NFSv4 at this point - Google results for NFS tuning are almost entirely for NFS v2/3, so having a hard time figuring out if there is anything to be done here.

Ideas going forward:


  • kernel upgrade (xfs (when used with cfq) bug fix related to io issues - shot in the dark but quick to check)
     
  • play again with IO scheduler, larger read-ahead if reads are the issues (if writes, this will do nothing at all)
     
  • XFS - It may be a red herring, but would like to look into this:
    [root@dean ~]# xfs_logprint /dev/md124 | wc
    xfs_logprint: /dev/md124 contains a mounted and writable filesystem
    xfs_logprint: unknown log operation type (2900)
    Bad data in log
    
  • XFS - go back to ext4 [community experience indicate xfs has bugs still [as demonstrated by the kernel patch I found]] This will cause major downtime.
     
  • Storage hardware improvements -
    • more drives (more spindles) or higher performance drives (if more slots, we can put more in / in RAID5, we can replace one by one with higher cost/more performance one)
    • eg., ssd x2 replicated for the operating system; so the OS would not stall as it does now; all IO go to the same set of drives whether the OS or the NFS partition (as a result even system threads creation appear in "D" states at times)
    • Unfortunately, any additional drives would have to be external (eg. eSATA or external SAS enclosure - either would require an add-on controller)
       
  • exotic solutions (bcache, dmcache revisited with newer kernel; past work indicated no gain)