EPD EQ3 Issues

Executive Summary: QT32C board that lives in EQ3 slot 0x1C is not functional.  It should be replaced but there is not a spare.  Action is to use empty channels from other boards in EQ3  to collect the data from the tiles associated with this board.

Note: Boards in EQ3 0x1C and 0x1D were swapped starting with run 22118058 and ending with run 22125017.

Details:

Eleanor noticed that starting with run 22048040 there was a significant rate of errors (>10%) showing up in the EQ3-EQ029 to EP002 connection (address 0x1C in EQ3).  Up until that point, it had operated in a similar manner to previous years.  It should be noted that this was the first run after an access (Wednesday, February 17th (day 048)).
 

Figure 1: Bit checking errors for 3 boards in EQ3 for run 22048040.  EQ025 and EQ02A are also both in EQ3, we can see that generally one doesn't expect any errors.

There was not any obvious issues in the ADC/TAC values for this board so we did not do anything until work started on the centrality trigger, then the issue became more noticeable.


Figure 2: Left hand plots are from 22048014 (before Access) Right side plots are for 22048040 (after access). Channel 6 is unused (so allowed to float and is high).  Channel 5 (coming from EQ029) looks normal on the left, but is stuck at zero on the right.

From Eleanor's email:

This situation, where EP002 always receives a truncated hit count of exactly 0 from EQ029 has remained stable ever since. Since the hit count is usually a small number, with higher multiplicity events being rare, it is the LSB of the truncated hit count that is wrong most often. This is exactly what Prashanth is seeing in his plots. 
NOTE: no errors show up in the plots on the right hand side of the L0trg plots because I do not have the bit checking up and going for the EPD QT->DSM links yet.
 
So, I think we have a hardware problem driving that group of 4 output bits, or receiving them at the DSM end of the link.
 
However, it has had no effect because of the way we use those bits.
EP002 calculates an actual hit count for each channel using the logic:
 
if max-TAC = 0 then hit-count = 0
else hit-count = truncated-hit-count+1
 
Since the truncated hit count from EQ029 is stuck at 0 this means the corrected hit-count is either 0 or 1. This is actually correct if the hit count should be 0 or 1, but it is incorrect if the hit count should have been 2 or higher. It is effectively a hit/no-hit flag.
EP002 then takes the hit-counts from all 7 used input channels and adds them up to get a total West hit count. We apply a threshold cut (EPD-WestHitCnt-th) to that total in order to make the EPD-W bit. The threshold value is currently set to 0, so all we need is at least one good hit somewhere on the West side. The (effective) hit flag coming from EQ029 works perfectly well for this. We don't know how many hits were seen by EQ029, but we know if it saw at least 1, and that is all we need.

We decided that we wanted to correct this for the high multiplicity OO trigger.

On April 28th the last two QT32Cs (0x1C and 0x1D) in EQ3 were swapped to see whether the problem followed the board or not.  It had been noticed during that access that the cable from the QT in 0x1C to the DSM wasn't connected securely - however the problem followed the board indicating that the QT32C board itself was a problem.  There was not enough time during the access to do any additional swaps, so we decided to leave things as they were for the next access.

Prior to the access, the average ADC per tile looked like:

Figure 3: Average ADC from run 2218015 (7.7 GeV running) prior to access.

Figure 5: Average ADC from run 22118058 (7.7 GeV running) after the access.  The apparent dead tiles are due to the mismatched - the last board has an empty daughter card.

From Fig 4 to Fig 5, you can see that 4 tiles look dead.  The issue was that when the boards were switched (0x1C and 0x1D in EQ3) the jumpers on the boards were not changed to reflect the new addresses.  This means that they are swapped in the data stream, and thus all the data from that point on will need to be remapped in the database.

During this run time, we started to see issues with EQ3 that were not quite recognized by the shift crews.  (Actually one did call me - I recognized it was an EQ3 issue and had them power cycle the crate which resolved it.  At the time I did not think much about it since EQ3 has always been a little finicky.)  An example is below.


Figure 6: Average ADC from Run 22120031.  The characteristic of an EQ3 problem can be seen here (EQ3 has board from both the East and West Side in it).  However, this doesn't show that EQ3 power was out since the rings aren't all purple, but that the data was corrupted.

On Wednesday May 5th during the access, the boards in EQ3 0x1C and 0x1D were swapped again.  For those keeping track, this means the mapping is back to nominal and the bad board is in 0x1C.  The trigger team had tried to install a new QT32C board into EQ3 with no luck.  So we started the FXT running assuming we were in the same state as before (only losing some capability of the multiplicity trigger).


Figure 7: Average ADC for run 22126010, FXT 44.5 GeV - everything looks good (though we have also lost a channel on the West side).


Figure 8: Average ADC for run 22126014 - we see here that EQ3 is corrupted from the ring of lower signals and the two 120 degree cool spots on the inner tiles.  From here we can see that the board in EQ3 0x1C is completely dead (purple tiles on the right).

The issue was not noticed until the day shift leader realized that EQ3 was sending corrupt data in run 22126017.

So from this time period, we can summarize (from Akio):
22126010 - 13    All channels of EPD are good
22126014 - 21    All channels in EQ3 are bad (possibly out of sync)
22126022 - 23    All channels in EQ3 are bad (whole EQ3 out of run)
22126024 ~ 29    EQ3 0x1C masked out
22126027 ~ 29    EQ3 0x1C removed from trig (no effect other than clean up earliest tac W)

During this time EQ3 was power cycled and the trigger rebooted to no avail.  The board was masked out from this point on.  The rest of EQ3 seems to operate normally without EQ3 0x1C included.


Figure 9: ADC from 22126039 (FXT 44.5 GeV).  We see the empty spot for 0x1C on the west side - the east side really looks like it is getting smacked!

Several attempts to put in the new QT32C failed - so we had to accept that we're likely to start the OO running without the full EPD.   The next thing to try to do is to connect other boards to the diffRx that lead to these board.  The used channels for EQ3 are:


Figure 9: EPD channel mapping for EQ3.  The first 5 boards are "East" boards, the next 5 are "west" boards.

EPD PP9, 10, 11, 12 TT2 and 3 are the tiles associated with the bad board in EQ3 0x1C

These correspond to RxB channels 1 and 2 with sectors 9O,9E, 10O,10E,11O,11E,12O, 12E in the CAMAC crate

Connect QT32C EQ025 in EQ3 0x16 channels 24,25,26,27 to
RxB channel 1 sectors 9O, 9E, 10O, 10E

Connect QT32C EQ02A in EQ3 0x1D channels 24,25,26,27 to
RxB channel 1 sectors 11O, 11E, 12O, 12E

Connect QT32B EQ028 in EQ3 0x1A channels 8 - 16 to
RxB channel 2 sectors 9O, 9E, 10O, 10E, 11O, 11E, 12O, 12E

We needed to mask out the 4 channels from the board in 0x16 as that is the "East" oriented board.  Everything else is already "West".  This can be done by changing the EQ3_QT5_DD_Chnl_mask label (change from 0 to 255 (or 0xFF)).  This was done by Jeff in the new Tier 1 file.  Prashanth will fix the mapping, we are now just waiting for the O+O running to start (late Saturday night).