BA to GPS plan

Post GPFS deployment action items & status

Know issues / actions

  • 2014/06/11 - Users have already reported observing a size oddity which is actually reported here. Copying files from GPFS to another FileSystem may indeed lead to a smaller disk usage due to a file block size overhead. Nothing which should be of concern however.
  • 2014/06/12 - An issue was reported whereas creating a directory took no CPU time but spanned over long time. This was confirmed and quatified: a single directory creation would take > 1 mnt (1 mnt +10-15 seconds) on average. The root cause was found to be a high level of IOps by users, further traced down to fsync() / O_SYNC (direct-IO) operations on file descriptors.
    • Since none of our framework code do this by default, Condor was assumed to be the culprit -  the variable ENABLE_USERLOG_FSYNC was set to false on 06/13.
    • On 06/16, we continued to observe the O_SYNC effect indicating the culprit was not fully identified.
    • On 10/01, a new version of the STAR Scheduler was deployed, hiding the redirect of user's STDOUT and STDERR to a local file (moving them at the end of the job). This will allow to provide scalable and non fragmented IO toward GPFS.
  • 2014/10/02 - The size of /gpfs01/star/pwg (aka /star/data01&02) was increased from 25 TB usable to 50 TB usable (100 TB allocated). However, it is noteworthy to point that the space usage was around 50% only (27% after the size increase). User's quota were also adjusted from 400 GB to 500 GB.
  • 2014/10/02 - A new partition was mounted as /gpfs01/star/pwg_tasks . The partition is 50 TB usable (modulo the soft quota).

Infrastructure Notes

  • The disk space monitoring script (aka Disk Space overview / overall) will be broken until the moves are complete (I will then update to accommodate for the reality of our space hosted on GPFS).
    Status: resolved
  • Similarly, the WaterMark file deletion is not active. Please, be fair to users (when it will be restored it will, files will be deleted according to the previous rules)
    Status: resolved
  • As noted in this post, the old loaned scratch space was dropped on 2014/06/09.
  • The "quota" command is not working on GPFS volume or is incomplete. Especially, the partitions have soft quotas (which means that more space is allocated for the partition but not all is usable). This distract / confuse  users. The request was made of the RCF to rewrite the "quota" command but it does not appear this will be done (i.e. passed it on back to me as 'if you need it, you can write it'). 2014/09/03.
  • Space allocation on /gpfs01/star/pwg to be changed to a "group" like mechanism.
    • The famous disk space usage committee requested for institution disk space to continue to be available and a mechanism to be found to allow further purchases - I have consistently refused to continue to sustain this avenue - anyone in STAR who screamed and accused me of sabotage (and no less) for not providing their precious space (turn from a benefit o a "due" ot "demand" nightmare) is free to prospect and find a sustainable mechanism to do so and I wish them good luck.
    • However (a) not all members agreed (some really disagreed) to continue this mode of operation as it may create an unfair science advantage but (b) UCLA brought a practical use case whereas if user A and B are in the same research group but B does not use the space, A should be able to take it  - it is in fact not uncommon to have one person at an institution responsible for a larger tasks than anyone else (produce analysis trees for example).
    • I proposed a compromise solution which is to have the space available on /gpfs01/star/pwg to be allocated not by user but by "group". However, the implementation is non trivial and will have to rely on the new PhoneBook (to be finalized). A native Unix secondary groups approach (maintained by hand) will not work as not practical - such maintenance has showed in the early 2000 to systematically fall apart due to the dynamic of our members/institution.

Migration schedule

The following table describes the plan used for moving the BA partitions to GPFS. The information is, at this stage, historical.

BA FS name Size TB GPS mount point Final size Backward compatible link mapping Status
/star/grid 55 GB /gpfs01/star/grid 60 GB - Scheduled 2014/06/11
ABORTED
This space is used for OSG/Grid jobs and for this space to be moved to GPFS, it would require a consistent mount on ALL nodes at BNL.
/star/starlib 40 GB /gpfs01/star/starlib 100 GB - Scheduled 2014/06/11
ABORTED
This space contains a software repository and lots of small files. GPFS filesystem blocksize is currently 8MB. The minimum amount of space occupied by a file is a sub-block. A sub block is 1/32 of the blocksize. So, in GPFS a file created will occupy at-least 256KB of space. For 1,000 files, an overhead of 200 MB would occur (and could be GB large for more files). We will re-address this at a later time but this would have implications for moving FS like /star/u to GPFS.
/star/rcf 7 TB /gpfs01/star/rcf 10 TB - Done 06/09
Found not mounted on stargrid nodes on 2014/06/10 but fixed.
/star/simu 3 TB /gpfs01/star/simu 3 TB - Done 05/17
/star/data01 4.2 TB /gpfs01/star/pwg 25 TB
/star/data01 -> /gpfs01/star
/star/data02 -> /gpfs01/star
Done 06/09
/star/data02 4.2 TB
/star/data03 3.5 TB /gpfs01/star/daq 5 TB

10 TB
/star/data03 -> /gpfs01/star 
should allow daq/ to map to the
same on gpfs
Done 05/19
/star/data04 2 TB /gpfs01/star/usim 2 TB /star/data04 -> /gpfs01/star AND
in /gpfs01/star have a ./sim -> ./usim
Done 06/05
Checked 06/06
/star/data05 8 TB /gpfs01/star/scratch 15 TB
/star/data05/scratch -> /gpfs01/star/scratch
Note: move with care at the very end ...
Done 06/04
Was not mounted on rftexp, see ticket RT 2863. Problem resolved 06/06.
/star/data06 2 TB /gpfs01/star/subsysg 5 TB /star/data06 -> /gpfs01/star/subsysg/star/data07 -> /gpfs01/star/subsysg Done 05/19
/star/data07 2 TB
/star/test02/XROOTD ?? /gpfs01/star/XROOTD 15 TB no need for a link - internal usage in XROOTD management script
Done 05/19
/star/institution/uky 1-2 TB /gpfs01/star/i_uky 2 TB /star/institution/uky -> /gpfs01/star/i_uky
This partition will be our test benchmark
for institution disk space allocation and migration to GPFS ...
Done 2014/06/12

This institution disk space is our "test" space (thanks to UKY for agreeing to this).

Copy done 06/10. Overhead ~ 200 GB due to large amount of files (see
note above)

The below are space porting from BlueArc to GPFS, following the initial planned usage (and after resolving access pattern oddities discovered during the initial testing)
/star/institution/lbl
/star/institution/lbl_scratch
/star/institution/lbl_prod

/star/institution/ucla
10 TB
9 TB
12 TB
/gpfs01/star/i_$1 10, 9, 12 TB

7 TB
/star/institution/$1 -> /gpfs01/star/i_$1
As regexp, all copied to GPFS preserving the initial structure
Done 2014/07/15
/star/institution/bnl
/star/institution/npiascr
16 TB

18 TB
/gpfs01/star/i_$1   /star/institution/$1 -> /gpfs01/star/i_$1 Done 2014/07/21
{All but EMN}   /gpfs01/star/i_$1   /star/institution/$1 -> /gpfs01/star/i_$1 Done 2014/07/23

Steps and general information

The migration from BlueArc to GPFS has begun. This migration is happening on the so-called "old" system while the new systems are being prepared for prime time. A note that this migration does NOT need to wait for the new systems to be online as the new system can be added to the GPFS cluster and the data migrated transparently in the background from old to new. Hence, the current ongoing migration only need to make the space "fit". Any GPFS to GPFS later move will be totally transparent.

Historical statement: The total space of the new system should be about ~ 700 TB of usable space (with replication of x2) based on a 640+160=800 TB total space for a 4+1 server system (i.e. 5 servers but we may configure one as a "backup") and a %tage of space unused for overhead. In other words, each servers holds 160 TB of storage that can be partition and formatted any way seen as optimal. The final space will depend on the choice for optimal formating of the storage (the RAID is hardware based unlike the older system which is software RAID). That final number will be known sometime in the week of 14W22.
  • Status: The formatting was done on 2014/06/09 - the total size is 1.424 PBytes i.e. a storage space minimally at 712 TB (with replication x2)
If the x2 replication is removed, the usable space would be doubled hence fall within the [712,1424] range. Note that each partitions may or may not be set to use replication. At the first installment, all partitions will set with a replication of x2.

In a second and later phase, the system network back-end will be set to provide 2x10 Gb Internet link to the newly purchased Aristas (the initial installation provides only 1 Gb link and no network path redundancy). At this stage, we measured up to 600 MB/sec per servers (so 1/2 of the planned bandwidth capacity). The ETA for this final network modification and configuration has not been decided yet (awaiting for facility manager's confirmation of a PDU work).  However, the network downtime will be combined with a major network reshape and the facility will be down for a day.
  • The "new" storage was merged to the "old" one on 06/09 - transparent migration of files from old to new has begun. Users should not be affected. However, "some" users had directories on both data01 and data02 and the merging made the quota reach its limit (see RT #2868).
  • On 2014/06/16, data migration from old to new begun - estimates shows it will take ~ 1.5 weeks causing no load impact on users.