Scheduled tasks for the network upgrade day and beyond

DOWNTIME IS SCHEDULED FOR JUNE 25th, STARTING 6:00 PM Eastern Time - ETA FOR a COME BACK IS 5 PM (TBC).

Intro

The scheduled maintenance day as announced here is aimed at reshaping the RACF farm LAN. STAR will schedule additional work.
The STAR work plan was set in motion weeks ago and included replacement of the old DB nodes and an upgrade of the Web server.
The RCF work plan is has converged the week prior (and a bit more work is being scheduled from other teams, leveraging the global downtime).
Below the current status and the plan for the maintenance day.

STAR work plan

Current status (Phase I)

The current status of the node replacement is that we phased in as a 1:1 replacement [db10, 06 and 07 at 1:1] and added db02, db04, db05 - In short, DB gave back 3 nodes so far and gained +3. db06 and db07 still needs to be put into operation.

June 25th plan

Phase I - to be completed on June 25th

  • sun.star is assumed to be fully functional and tested using the ww2.star and drupal2.star aliases (TODO list Web server readiness)
  • On June 25th, enable sun.star as our primary Web server
    www.star an drupal.star points to that node
    Note: THIS NEEDS TO BE ARRANGED WITH ITD [Would be take care off by Dan F. @ ITD, confirmed 2013/06/16]
  • ON THAT DAY: A few nodes needs to be moved - see image below (orange color)
    ATTENTON     : heston needs to be moved above db10 AND robinson and omega swapped (all the other nodes shift by one slot)
  • NOTES:
    • 2013/06/21 - db06 and db07 confirmed functional and ready
    • 2013/06/21 - db03 could be removed on the 25th (taken care off already) - this will allow slideing a new 2U right away on that day for the duvall/heston replacement
Phase II & III
Lots of work needs to be done after the action above. The physical work being done, all nodes would be regrouped to allow phase-in and rolling upgrades. For each step, in light purple the nodes marked for replacement, in green (any green) the work which needs to be done in that phase, from lighter to darker green.

A few notes:
  • The plan will provide more DB servers than we started with (see note above, replaced 1:1 and added +3 with each nodes expected to be better than the ones it replaces)
  • db03 will fold onto db02
  • heston will be re-merged with duvall - it is believed the new nodes, more beefy, will sustain the number of connections implied by "nova" for example
  • robinson will be updated - this may cause a primary db update downtime (Master) but does not require hardware intervention other than setting up the node
  • At the end of the work, the old db08 could be phased out (possibly, one more replacement exists and could be used)

Whole plan in phases



 

RCF / ITD networking work plan

(Courtesy: Shigeki Misawa / paln in flux as we are discussing it; see addendums)

In addition to upgrading the Terascale switch to Exascale, a reminder that many routing will be changed to logically moving servers to different subnets. At the end of the network re-arrangement, networks will be distributed as follows (STAR sections showed only):

  1. Switch 803 (New Exascale for Linux Farm)
    1. 130.199.206.0/24 - Star
    2. 130.199.207.0/24 - Star
  2. Switch 804 (BCF Exascale for Linux Farm)
    1. 130.199.183.0/24 - Star
    2. 130.199.200.0/24 - Star
  3. Switch 801/802/805 ("Old" routers/Central Services/BlueArc)
    1. 130.199.180.0/25 - BlueArc/StarGrid
    2. 130.199.6.0/24 - RCF Facility services
Addendum:
  • JL: suggested configuration - use 204,205,206 and 207 for STAR (discussed with ITD/Networking 2013/06/20) - confirmed [alter expectations from plan below]
In summary the changes are as follows:
  1. Subnet 130.199.206.0/23 will be split into 130.199.206.0/24 and 130.199.207.0/24. Hosts will need to be reconfigured with the new subnet mask and the new router.
  2. Linux Farm nodes on 130.199.180.0/24 will move to 130.199.206.0/24
  3. Subnet 130.199.180.0/24 will be split into 130.199.180.0/25 and 130.199.180.128/25
    1. All BlueArc EVS's will move to 130.199.180.0/25 and move to an un-tagged interface
    2. Stargrid nodes will need a network configuration change (netmask/default router)
      NB: JL confirmed 2013/06/20 - RCF personel will take care of this
  4. Subnet 130.199.6.0/24 hosts on switch 803/804 will move to switch 805


Except for subnet 6 hosts on 803/804, the changes do not involve any physical moves of equipment or changes to host connectivity to network jacks. Networking is expected to start working at 6:00AM EDT on June 25.

Additional STAR related work

  • Requested to leverage downtime to physically place the two STAR Xrootd redirectors on two separate PDU AND rack (current config regrouped the two, destroying redundancy)
  • In the same token, requested to verify all supervisor and balance them out across the 4 STAR targeted subnets