Meeting notes for week of Oct. 19, 2009

Online network reshape notes from the week of Oct. 18, 2009

During this week, three meetings were held to discuss the STAR online networking reshape plans.

The first meeting included Jeff Landgraf, Wayne Betts, Dan Orsatti (ITD) and Frank Burstein (ITD).  At this meeting the ITD network engineers presented two proposals for core network components based on information previously provided to them by STAR.  The two options were Force-10 based and Cisco-based, with costs of approximately $150,000 and $100,000 respectively.  They included a shared infrastructure for the DAQ/TRG and STARP networks, including a switch redundancy in the DAQ room to handle the two networks and meet DAQ’s relatively high performance needs in the DAQ room.  These ITD options are generally smart, expandable, highly configurable and well-supported by ITD, and meet the initial requirements.

However, in informal discussions since then, Bill Christie suggested that we should consider the possibility of radiation damage and/or errors in any electronic equipment in the WAH.  While this had been mentioned as a possibility in the past, it was not generally taken seriously by those of us in STAR looking after the networks.  Nor is there any way for us to test this to a standard of “beyond reasonable doubt” (or any other standard really).  At Bill’s suggestion, we (Jeff L., Wayne B., Jack E., Yuri G. and Bill C.) met with three members of  C-AD’s networking group, who stated they were certain that radiation could impair switches and strongly suggested that ITD’s suggested equipment was inappropriate for a radiation area.  They also provided some feedback from individuals at two other laboratories that networking equipment in radiation areas are subject to upsets, with one explanation for effects on metal-oxide semiconductors, which at face value would suggest that newer (thus generally smaller) electronic components would be less susceptible, however my intuition is that smaller electronics are denser, and more easily upset by smaller deposited charge, and thus might be more susceptible. 

Here are excerpts from the other labs:

From JLab:  "The flash memory loses its ability to hold data, making it
useless. We have worked around the problem by pulling cable or fiber
back to lower radiation areas wherever we can. Because we made these
cabling changes when we were only using cisco fixed-configuration
100Mbit switches ( 29XX models), I have no data for Gigabit switches.
Since our experience is that it's the flash memory that fails, I'd
expect no better performance from any other switches. All of our
switches that use modular supervisor modules are outside of radiation
areas."

From FermiLab:  "The typical devices used employ metal oxide
semiconductors and the lock up happens when ionizing radiation is
trapped in the gate region of the devices. We see this happen at our two
detectors (CDF and DZero) when losses go up and power supplies circuits
latch up. The other thing working in the positive direction is that when
IC feature sizes go down, there is less likelihood for the charge to get
trapped so they are more radiation tolerant. Having said all that I
can't answer your specific question because we don't put switches or
routers in the tunnel at all."

All this said, the general consensus was that we should move as much “intelligence” as far away from the beam line as reasonably possible.  (Until now, the “big” switches on the platform have actually been about as close to the beam line as possible!)  This means putting any switches in rack rows 1C.  Given both the cost and the radiation concern, we (the STAR personnel) agreed to investigate less expensive switches than ITD’s suggestion, while trying to provide some level of intelligence for monitoring.  We also have a consensus that the DAQ/TRG and STARP networks should try to use common hardware whenever possible, and that we should work to remove as many SOHO-type unmanaged switches as possible as time permits (replacing them with well-documented and labelled patch panels feeding back to core switches).  The C-AD personnel also recommended Cisco’s 2950, 2960 and 3750 switches and Garrett products in general.  One more miscellaneous tidbit from Jack we should avoid LanCast media convertors.

The final meeting of the week included Jerome, Wayne and Matt Ahrenstein, in which Jerome was briefed on the two prior meetings and he generally agreed with the direction we are taking.  At this meeting, we selected an additional area to try to clean-up before the run, specifically the racks on the west side, where there are at least four 8-port unmanaged switches (3 on DAQ/TRG and one on STARP).  He also suggested we consult with Shigeki from the RACF about the whole affair, and is trying to arrange such a meeting as soon as possible.

In addition to this, Jeff has also stated that while either ITD solution would meet DAQ’s needs for several years, he believes he can obtain adequate performance for far less money with lower end equipment.  Here is Jeff's latest on the DAQ needs for the network:

 

"My target is 20Gb/sec network capability across switches.   In likely 
scenarios, the network capability would be significantly higher than 
this because hi bandwidth nodes would all be on the same switch 
(ironically, the cheaper switches mostly seem to be line-speed switches 
internally, unlike the big cisco switches...)    However, in the current 
year, I'll have a hard limit of 12 gigabit ethernet cards incoming on 
EVBs for a hard max of 12Gb/sec.    The projected desired data, 
according to the trigger board is around 6Gb/sec (600MB/sec).   I don't 
expect much more than a factor of two through the EVBs above this 
600MB/sec in the lifetime of STAR (meaning current TPC + HFT + FGT), 
although there are big uncertainties particularly for the HFT.     The 
one lump in the planning involves potential L3 farms - and I don't know 
how this will play out.   There are many scenarios some of which would 
not impact the network (ie... specialized hardware plugged into the TPX 
machines...),  but my current approach is that the network needs will 
have to be incorporated in the L3 farm design plan..." 


 

Where does this leave us?  We need to quickly evaluate options for the “big” switches for the DAQ room and the South Platform.  The DAQ and Trigger groups have 3(?) similar managed switches that might be adequate for the South platform (including a spare), and we should look into the Cisco models suggested by C-AD.  We also should let ITD make another round of suggestions based on our discussions to date, and especially focus with them on what to do with the large ITD switch in the DAQ room that currently has the link to the rest of the campus “public” network.  And we need to do this rather hastily.