Bug 3376 Notes

Updated on Thu, 2019-01-24 15:35. Originally created by jwebb on 2019-01-17 12:17.

01/24/2019

NOTE: We have ElectricFence installed. This is a memory heap debugger... replaces malloc functionality. So... we can try compiling / running code with this linked (?) in and see if it tells us somehting.

... doesn't tell me anything useful. It quickly runs out of memory, before ROOT is even up and running. The issue here has to do with some limitations on memory imposed by the OS, and the way efence allocates. These limitations can be changed by a sufficiently privledged user...

01/17/2019

The code appears to crash due to a memory corruption which impacts StEvent/StTpcRawData. The problem is encountered when StTpcRTSHitMaker loops over sectors, rows and hits. The call on (or near) line 232 to StTpcDigitalSector::getTimeAdc(...) results in the crash. Inside the call, the code acceses the StDigitalTimeBins, looping over them to get each StDigitalPair. Here is where the corruption is seen.

Int_t StTpcDigitalSector::getTimeAdc(Int_t row, Int_t pad, 
				     UChar_t ADCs[__MaxNumberOfTimeBins__], 
				     UShort_t IDTs[__MaxNumberOfTimeBins__]) {
  // 10-> 8 conversion
  // no conversion
  UInt_t nTimeSeqs = 0;
  memset (ADCs, 0, __MaxNumberOfTimeBins__*sizeof(UChar_t));
  memset (IDTs, 0, __MaxNumberOfTimeBins__*sizeof(UShort_t));
  StDigitalTimeBins* TrsPadData = timeBinsOfRowAndPad(row,pad);
  if (! TrsPadData) return nTimeSeqs;
  StDigitalTimeBins &trsPadData = *TrsPadData;
  nTimeSeqs = trsPadData.size();
  if (! nTimeSeqs) return nTimeSeqs;
  for (UInt_t i = 0; i < nTimeSeqs; i++) {
    StDigitalPair &digPair = trsPadData[i];
    UInt_t ntbk = digPair.size();                  // Corruption ntbk too large
    UInt_t tb   = digPair.time();                  
    UInt_t isIdt= digPair.isIdt();
    for (UInt_t j = 0; j < ntbk; j++, tb++) {
      if (digPair.adc()[j] <= 0) continue;
      ADCs[tb] = log10to8_table[digPair.adc()[j]];
      if (isIdt) IDTs[tb] = digPair.idt()[j];
    } 
  }
  return nTimeSeqs;
 }

The crash happens in about 1 in 10 jobs.

The crash always happens on event 5.

The "pair" which is effected is not reproducible.

The crash occurs using both fzd input and geant.root input. (geant.root can be created using code prior to the 32/64bit changes with same result).

The chain options can be pruned down, and the corruption still observed.

void bug3376c() {
  gROOT->LoadMacro("bfc.C");
//TString _chain = "trs,fss,y2006h,Idst,IAna,l0,tpcI,TpxClu,ftpc,Tree,logger,ITTF,Sti,genvtx,NoSsdIt,NoSvtIt,MakeEvent,bbcSim,tofsim,tags,emcY2,EEfs,evout,IdTruth,geantout,big,in,MiniMcMk,clearmem";
  TString _chain = "trs,fss,y2006h,idst iana l0 tpci tpxclu ftpc,                                            MakeEvent,                                                     big,in,         clearmem";
  bfc(7,_chain,"fzd/rcf9991_01_1000evts.geant.root");
};

Starting from the minimal set above, the code becomes stable. (Removal of MuDst maker is sufficient).

Looked at the coverity reports for all makers in the reduced chain above. Fixed any buffer overflows found (or confirmed they are never reached in code w/ assert statements).

Found one real issue in St_db_Maker (which Victor fixed and committed).

The code still crashses.

There are numerous other defects ID'd by coverity. Unitialized members, etc... (often false defects because coverity doesn't understand the memset "trick" used frequently in our code to initialize classes).

Bottom line is we need Insure++.

One final (and last) note... appears to be something which accumulates. i.e. running single event / job (#5) I have no crashes in 60 events. But if I extend to events 5 to 10, I see crashes at about the same rate as before.

jwebb's blog
Login or register to post comments

The STAR experiment

jwebb's blog

User login

Navigation

Bug 3376 Notes