Recoverred Ticket 3336 (BTOF related)

Thu Feb 22 09:03:20 2018: Request 3336 was acted upon.
 Transaction: Ticket created by genevb
       Queue: bugs-high
     Subject: Optimized libraries missing west BTOF geometry


It was reported that BTOF matching efficiencies were
low in the recent P17ii AuAu54 physics production:

http://www.star.bnl.gov/HyperNews-star/protected/get/starsoft/10022.html
http://www.star.bnl.gov/HyperNews-star/protected/get/startof/2734.html

This was traced to missing west-side BTOF geometry
in the production. Further investigation led to finding
that the problem still exists, and is only there when
using an optimized library.

For example, from the DEV nightly tests, this simple
grep of log files reveals the issue with reduced numbers
for the optimized library:

grep "#(proj)" /star/rcf/test/dev/daq_sl302.stica_opt/Wed/year_2017/AuAu54_production_2017/st_physics_18155036_raw_2000012.log | head -5
StBTofMatchMaker:INFO  - #(daq): 11 #(proj): 9 #(prim proj): 2 #(matched): 2 #(single valid): 2 #(single prim valid): 1
StBTofMatchMaker:INFO  - #(daq): 973 #(proj): 269 #(prim proj): 202 #(matched): 173 #(single valid): 173 #(single prim valid): 141
StBTofMatchMaker:INFO  - #(daq): 713 #(proj): 222 #(prim proj): 158 #(matched): 138 #(single valid): 138 #(single prim valid): 102
StBTofMatchMaker:INFO  - #(daq): 61 #(proj): 25 #(prim proj): 20 #(matched): 19 #(single valid): 19 #(single prim valid): 16
StBTofMatchMaker:INFO  - #(daq): 25 #(proj): 12 #(prim proj): 0 #(matched): 3 #(single valid): 3 #(single prim valid): 0

grep "#(proj)" /star/rcf/test/dev/daq_sl302.stica/Wed/year_2017/AuAu54_production_2017/st_physics_18155036_raw_2000012.log | head -5
StBTofMatchMaker:INFO  - #(daq): 11 #(proj): 22 #(prim proj): 6 #(matched): 5 #(single valid): 5 #(single prim valid): 3
StBTofMatchMaker:INFO  - #(daq): 973 #(proj): 589 #(prim proj): 416 #(matched): 391 #(single valid): 391 #(single prim valid): 309
StBTofMatchMaker:INFO  - #(daq): 713 #(proj): 423 #(prim proj): 306 #(matched): 290 #(single valid): 290 #(single prim valid): 219
StBTofMatchMaker:INFO  - #(daq): 61 #(proj): 41 #(prim proj): 35 #(matched): 32 #(single valid): 32 #(single prim valid): 28
StBTofMatchMaker:INFO  - #(daq): 25 #(proj): 16 #(prim proj): 0 #(matched): 5 #(single valid): 5 #(single prim valid): 0

I have not yet determined when the problem first appeared.

-Gene



Daniel, would you be able to have a look at some
of the Run 17 pp physics productions and let us
know the state of the BTOF geometry? We need
to understand which produced datasets are
flawed.

There are st_physics files produced in P18ia,
and st_W files produced in P17ig. Please let
me know if you need additional help finding
files.

Thanks,
-Gene




Actually, I found another easy way to see
whether the problem existed in produced
data:

Open a MuDst file in ROOT and type:
MuDst.Draw("BTofHit.mTray:BTofHit.mIndex2Global","","zcol",10)
This plot is half empty when match-making
finds nothing on one side of the barrel.

With this I can try to look at some production
MuDsts myself as well and will report back here.

-Gene




Findings:

P17ig production of Run 17 pp510 st_W stream
is GOOD (completed).

P17ii production of Run 17 AuAu54
is BAD (completed).

P18ia production of Run 17 pp510 st_physics stream
is BAD (ongoing).

And I expect the P17ii production of Run 5 CuCu200
(ongoing) is unaffected as there was no BTOF. However...

The current leading theory is that we have
an issue with optimized libraries after the
OS upgrade to Scientific Linux 7, and not due
to any particular code commits. If this is true,
we also need to answer the question of whether
BTOF's geometry is only one of multiple parts
of the STAR geometry which may be impacted,
and we just haven't yet noticed others.

-Gene




Hi Gene,

It's important to note.  There has been no issue identified with the ROOT
geometry.  Rather, the issue is with the GEANT geometry.  The new compiler,
under certain levels of optimization, decides to prune some FORtran DO loops.

I believe the BTOF codes still use the GEANT geometry, which means it may
be impacted by this issue.

I just did a quick test... converted the y2017 geometry (optimized and non-
optimized) from ROOT to TGeo, dumped into a CINT file.  I can then do a
diff between the two.

I see no changes to the BTOF geometry, at least in terms of the geometry tree.
Everything gets created and placed.  (And likely placed at the correct positions).

There are numerous changes to other geometries, which are worrisome.
We don't use optimized geant in simulation, and only use it in
reconstruction for the BTOF.

Jason

On 02/22/2018 11:20 AM




Another piece of information (suggested
by Dmitri Smirnov):

I processed some Run 17 pp510 events in
optimized SL17g, SL17h, SL17i, and SL18a,
all with the same chain as the P17ig st_W
production. The data looks OK in SL17g
and SL17h, but BAD in SL17i and SL18a.

The SL17g and SL17h libraries that we use
for ".sl73_gcc485" were compiled in
Sept and Oct of 2017. SL17i was compiled
on December 13th, immediately after the
SL7 upgrade to the farm.

So...the possibility that it was a change in
STAR code between SL17h and SL17i is not
yet closed.  Some possible tests include:

a) Re-compiling SL17h codes in SL17h anew
to see if they become problematic.

b) Trying codes that are different between
SL17h and SL17i in the other library to see
if that makes SL17h OK or SL17i BAD.

-Gene




  Gene, below is some comments:

The SL17g and SL17h libraries that we use
for ".sl73_gcc485" were compiled in
Sept and Oct of 2017.
  SL17g and SL17h were compiled naturally on
  SL6.3 platform with gcc 4.8.2 and when we run them
  on the farm which has SL7.3 platform it uses
  link
   .sl73_gcc485 -> .sl64_gcc482/

  SL17i and SL18a have been compiled naturally on
  SL7.3 platform with gcc 4.8.5 compiler .

   Lidia.





This is a comment.  It is not sent to the Requestor(s):


On Feb 26, 2018, at 1:07 PM, Jerome LAURET via RT <bugs-high-comment@sun.star.bnl.gov> wrote:
Gene could you please provide a how to reproduce / illustrate this issue?
Here's a way that's faster than looking at a MuDst:

root4star -b -q -l 'bfc.C(1,"DbV20171001 pp2017a StiCA btof mtd mtdCalib PicoVtxDefault fmsDat fmsPoint fpsDat BEmcChkStat OSpaceZ2 OGridLeak3D -hitfilt","/star/rcf/test/daq/2017/069/st_physics_18069061_raw_2000021.daq")' | grep "proj)"

Something like this is GOOD:
StBTofMatchMaker:INFO  - #(daq): 30 #(proj): 932 #(prim proj): 100 #(matched): 13 #(single valid): 13 #(single prim valid): 7

Something like this (with about half values for the ones after the first) is BAD:
StBTofMatchMaker:INFO  - #(daq): 30 #(proj): 478 #(prim proj): 60 #(matched): 6 #(single valid): 6 #(single prim valid): 5

-Gene


This is a comment.  It is not sent to the Requestor(s):

More information:

I ran a script to check out and locally compile each
subdirectory of pams and StRoot individually, using...

cons FDEBUG="-O0 -g" DEBUG="-O0 -g"

...while in optimized DEV. This was 254 jobs. All of
them ran identically (with a bad result) except for a
single job, for StRoot/StDbUtilities, and it was only
different in a very minor way, not having ~twice the
BTOF matches.

No smoking guns for a piece of code that is having
optimization problems. Confusing.

-Gene



This is a comment.  It is not sent to the Requestor(s):

Hi Gene,

Decided to spend a few minutes poking around.

1) If you run the bfc interactively...

void runBfc() { gROOT->LoadMacro("bfc.C");
bfc(0,"DbV20171001 pp2017a StiCA btof mtd mtdCalib PicoVtxDefault fmsDat fmsPoint fpsDat BEmcChkStat OSpaceZ2 OGridLeak3D 
-hitfilt","/star/rcf/test/daq/2017/069/st_physics_18069061_raw_2000021.daq");
    chain->Clear(); chain->Make();
    new TBrowser();
}

You can navigate to bfc/.make/geant/.const/HALL/...

and down the the BTOF geometry.  Both the east and west halves of the BTOF are present (BTOH and BTO1).
(When I draw the geometry, and in the hierarchy listed in the browser...)

2) If you look at the log file, you get the following output from StBTofMatchMaker, where the geometry
is initialized...

Output for SL17i, optimized --

StBTofMatchMaker:INFO  - [StBTofGeometry] retrieving geometry alignment parameters from database
StBTofMatchMaker:INFO  - [StBTofGeometry] # of sectors = 120
StBTofMatchMaker:INFO  - [StBTofGeometry] Done. NValidTrays = 120  NModulesInTray = 32
StBTofMatchMaker:INFO  - StBTofGeometry::Init -- GEMTOF-tray module indexes will be corrected for year 2017
StBTofMatchMaker:INFO  - CPU time for StBTofGeometry initialization : 0.12 sec
StBTofMatchMaker:INFO  - StBTofMatchMaker -- welcome
StBTofMatchMaker:INFO  -  Number of BTOF Hits = 33
StBTofMatchMaker:INFO  - #(daq): 30 #(proj): 478 #(prim proj): 60 #(matched): 6 #(single valid): 6 #(single prim valid): 5

Output for SL17h, optimized --

StBTofMatchMaker:INFO  - [StBTofGeometry] retrieving geometry alignment parameters from database
StBTofMatchMaker:INFO  - [StBTofGeometry] # of sectors = 120
StBTofMatchMaker:INFO  - [StBTofGeometry] Done. NValidTrays = 120  NModulesInTray = 32
StBTofMatchMaker:INFO  - StBTofGeometry::Init -- GEMTOF-tray module indexes will be corrected for year 2017
StBTofMatchMaker:INFO  - CPU time for StBTofGeometry initialization : 0.15 sec
StBTofMatchMaker:INFO  - StBTofMatchMaker -- welcome
StBTofMatchMaker:INFO  -  Number of BTOF Hits = 33
StBTofMatchMaker:INFO  - #(daq): 30 #(proj): 932 #(prim proj): 100 #(matched): 13 #(single valid): 13 #(single prim valid): 7

..............

I don't see any obvious difference attributable to the geometry.

Jason
This is a comment.  It is not sent to the Requestor(s):

Adding my 2cents: 

in both cases the "[StBTofGeometry]" which is responsible for handling the BTOF geometry happily reports 120 trays. i.e. the 60+60 trays on both halves. The number is the resultant of a loop working its way through a list of geometry nodes to define trays (see StBTofGeometry::InitFromStar(TVolume *starHall)). It would be interesting to see what the debug messages would say at that point. From all I can see there are 120 trays, but it is not too obvious where they end up ...


-f.
This is a comment.  It is not sent to the Requestor(s):

Hi Frank,

Be careful what you wish for...  I ran optimized, SL17h and i, with debug bfc option.
I don't see anything obvious, other than there is too much output for me to go through
to figure things out.  The logs are here, if you want to take a look...

$ ls /star/simu/simu/jwebb/2018/02-26-2018-bug3336/*.log

Cheers,
Jason
This is a comment.  It is not sent to the Requestor(s):

Speaking with Jason earlier, I said...
"I tried running with each of our libraries
from pams, StRoot, and StarVMC
unoptimized, and nothing fixed this.
What's left except root4star itself?"

The obvious answer to that question is,
"Yes, root4star itself."

So, I executed the unoptimized root4star
in optimized DEV, and the optimized root4star
in unoptimized DEV: the formed is GOOD and
the latter is BAD, implicating root4star itself.

I further executed the optimized root4star
from SL17h in optimized DEV, and that works
fine too, while optimized root4star from SL17i
in optimized DEV does not. So the case is
very strong against optimized root4star since
the OS upgrade.

The next question is which particular component
of root4star is the problem, and the likely candidate
seems to be the linked GEANT.

Unfortunately, as far as I understand, we cannot
run root.exe and load in an unoptimized GEANT
at run-time to isolate GEANT. So...we would need
to re-link root4star using an unoptimized GEANT
as a test. I'm not familiar with doing this, but it
would seem the next logical step.

Other ideas on how to proceed are welcome.
There is of course the possibility that we simply
move forward (which we are doing regardless of
this particular issue) without GEANT in the production
chain, as noted in this thread:

http://www.star.bnl.gov/HyperNews-star/protected/get/startof/2731.html

It's interesting to note that if GEANT turns out to
be the problem, the timing of preparing to run
our reconstruction without GEANT occurred nearly
perfectly (off by only a small amount) with the
day that GEANT's vulnerability struck us....
Coincidence? 

-Gene
This is a comment.  It is not sent to the Requestor(s):

This is helpful. Thanks Jason! The match maker here seems to be finding matches in all trays. If this is the optimized version that is supposed to reproduce the lack of trays in the west BTOF, then I don't see much or any evidence of that. In the first event tracks with matched BTOF hits in trayID<60 and trayID>60 are all found and passed on.

 




This is a comment.  It is not sent to the Requestor(s):

Well...it appears not to be GEANT (as Jerome
suggested it couldn't be), but some other
fortran code used in root4star...

With a little help from Jason on re-compiling root4star,
I have isolated a pair of code commits between SL17h
and SL17i that make or break the BTOF-matching:

http://www.star.bnl.gov/cgi-bin/protected/cvsweb.cgi/asps/Simulation/starsim/rebank/aslsetba.age.diff?r1=1.1;r2=1.2;f=h
http://www.star.bnl.gov/cgi-bin/protected/cvsweb.cgi/asps/Simulation/starsim/rebank/aslgetba.age.diff?r1=1.2;r2=1.3;f=h

Jason noted at the time of the commit that these were
to force the compiler not to optimize some loops
involving variables by making them "volatile". However,
this appears to be having some other side effect.

Re-compiling and running in optimized DEV with
the old version of either (or both) of these files
gives a GOOD result:

setenv NODEBUG 1
stardev

...then...

cvs co asps/rexe
cvs co asps/Simulation
rm asps/Simulation/starsim/rebank/aslsetba.age
cvs co -r SL17h asps/Simulation/starsim/rebank/aslsetba.age
cons

OR

cvs co asps/rexe
cvs co asps/Simulation
rm asps/Simulation/starsim/rebank/aslgetba.age
cvs co -r SL17h asps/Simulation/starsim/rebank/aslgetba.age
cons

OR

cvs co asps/rexe
cvs co -r SL17h asps/Simulation
cons

I should explicitly note that I specifically tried recompiling
and running the current versions of these files and that
result is BAD:

cvs co asps/rexe
cvs co asps/Simulation
cons

I should also note that there was one other code change
in these directories between SL17h and SL17i, but
back-stepping this change has no effect (still BAD):

http://www.star.bnl.gov/cgi-bin/protected/cvsweb.cgi/asps/rexe/TGeant3/TGiant3.h.diff?r1=1.1;r2=1.2;f=h

Alternatively, compiling the new codes without optimization
but running in optimized DEV gives a GOOD result too:

cvs co asps/rexe
cvs co asps/Simulation
cons FDEBUG="-O0 -g" DEBUG="-O0 -g"

_____________

My summarizing thoughts so far:

I have isolated code changes (not the OS upgrade itself)
that are tied to the current optimization problem for BTOF-
matching. However, the code changes were introduced to
fix an issue with optimized compilation under the new OS,
so the reverting these changes is not likely to be a viable
solution.

The fact that either code change can be reverted to fix the
problem implies an issue with information that crosses
codes (non-local data?).

Nevertheless, regardless of whether we understand the
issue, specifically compiling these codes unoptimized
seems to be one option for moving forward.

-Gene

p.s. I noticed that one of the now-volatile variables in
aslgetba.age (ldetp) was also listed in a SAVE statement.
Anyone know if this can lead to problems? I've no
expertise in these attributes.




On 02/28/2018

This is a comment.  It is not sent to the Requestor(s):

Hi Gene,
in the code you provided
       INTEGER       NPMAX,Npar,LdetP,I,PAR(NPmax)
Must be replaced by:
       INTEGER       NPMAX,Npar,LdetP,I,PAR(*)
I look the call of this function.
NPmax could be negative. For non optimized mode it is irrelevant.
but for optimized, size of array is negative and compiler force
to quit the loop.
PAR(*) means it is an array with arbitrary size. It is user responsibility
to work with it.

Victor

This is a comment.  It is not sent to the Requestor(s):

Hi, Victor

I tried this for both files in question in optimized SL18a, but it did not restore the proper functioning of the code (BTOF matching still wrong). It of course does not mean that you are wrong, but that this isn't sufficient to resolve the optimized compilation problems.

Index: asps/Simulation/starsim/rebank/aslgetba.age
===================================================================
RCS file: /afs/rhic.bnl.gov/star/packages/repository/asps/Simulation/starsim/rebank/aslgetba.age,v
retrieving revision 1.3
diff -r1.3 aslgetba.age
25c25
<       INTEGER NPMAX,Npar,LdetP,I,PAR(NPmax)
---
      INTEGER NPMAX,Npar,LdetP,I,PAR(*)
Index: asps/Simulation/starsim/rebank/aslsetba.age
===================================================================
RCS file: /afs/rhic.bnl.gov/star/packages/repository/asps/Simulation/starsim/rebank/aslsetba.age,v
retrieving revision 1.2
diff -r1.2 aslsetba.age
32c32
<       INTEGER    NPAR,LdetP,I,PAR(Npar)
---
      INTEGER    NPAR,LdetP,I,PAR(*)
Thanks,
-Gene








On 03/01/2018

This is a comment.  It is not sent to the Requestor(s):

Gene, did you use  the latest code, or the old one without
Jason's attempts to fix the problem?
Victor
This is a comment.  It is not sent to the Requestor(s):

Hi, Victor

Latest codes (where "volatile" was used).

-Gene
This is a comment.  It is not sent to the Requestor(s):

 > Latest codes (where "volatile" was used).
 >
To be sure it is better to use original code.
Victor
This is a comment.  It is not sent to the Requestor(s):

OK, I reverted to the SL17h versions (without "volatile")
and patched in the "*" arguments, while compiling and
running in optimized SL18a, but the BTOF result is still
BAD.

Thanks,
-Gene
This is a comment.  It is not sent to the Requestor(s):

It is a pity and very bad.
If it depends from our code, we can fix it.
If it is the bug in compiler optimization it could affect any code
Victor