Recoverred Ticket 3336 (BTOF related)
Updated on Wed, 2020-07-08 08:56. Originally created by jeromel on 2020-07-08 08:56.
Thu Feb 22 09:03:20 2018: Request 3336 was acted upon. Transaction: Ticket created by genevb Queue: bugs-high Subject: Optimized libraries missing west BTOF geometry It was reported that BTOF matching efficiencies were low in the recent P17ii AuAu54 physics production: http://www.star.bnl.gov/HyperNews-star/protected/get/starsoft/10022.html http://www.star.bnl.gov/HyperNews-star/protected/get/startof/2734.html This was traced to missing west-side BTOF geometry in the production. Further investigation led to finding that the problem still exists, and is only there when using an optimized library. For example, from the DEV nightly tests, this simple grep of log files reveals the issue with reduced numbers for the optimized library: grep "#(proj)" /star/rcf/test/dev/daq_sl302.stica_opt/Wed/year_2017/AuAu54_production_2017/st_physics_18155036_raw_2000012.log | head -5 StBTofMatchMaker:INFO - #(daq): 11 #(proj): 9 #(prim proj): 2 #(matched): 2 #(single valid): 2 #(single prim valid): 1 StBTofMatchMaker:INFO - #(daq): 973 #(proj): 269 #(prim proj): 202 #(matched): 173 #(single valid): 173 #(single prim valid): 141 StBTofMatchMaker:INFO - #(daq): 713 #(proj): 222 #(prim proj): 158 #(matched): 138 #(single valid): 138 #(single prim valid): 102 StBTofMatchMaker:INFO - #(daq): 61 #(proj): 25 #(prim proj): 20 #(matched): 19 #(single valid): 19 #(single prim valid): 16 StBTofMatchMaker:INFO - #(daq): 25 #(proj): 12 #(prim proj): 0 #(matched): 3 #(single valid): 3 #(single prim valid): 0 grep "#(proj)" /star/rcf/test/dev/daq_sl302.stica/Wed/year_2017/AuAu54_production_2017/st_physics_18155036_raw_2000012.log | head -5 StBTofMatchMaker:INFO - #(daq): 11 #(proj): 22 #(prim proj): 6 #(matched): 5 #(single valid): 5 #(single prim valid): 3 StBTofMatchMaker:INFO - #(daq): 973 #(proj): 589 #(prim proj): 416 #(matched): 391 #(single valid): 391 #(single prim valid): 309 StBTofMatchMaker:INFO - #(daq): 713 #(proj): 423 #(prim proj): 306 #(matched): 290 #(single valid): 290 #(single prim valid): 219 StBTofMatchMaker:INFO - #(daq): 61 #(proj): 41 #(prim proj): 35 #(matched): 32 #(single valid): 32 #(single prim valid): 28 StBTofMatchMaker:INFO - #(daq): 25 #(proj): 16 #(prim proj): 0 #(matched): 5 #(single valid): 5 #(single prim valid): 0 I have not yet determined when the problem first appeared. -Gene
Daniel, would you be able to have a look at some of the Run 17 pp physics productions and let us know the state of the BTOF geometry? We need to understand which produced datasets are flawed. There are st_physics files produced in P18ia, and st_W files produced in P17ig. Please let me know if you need additional help finding files. Thanks, -Gene
Actually, I found another easy way to see whether the problem existed in produced data: Open a MuDst file in ROOT and type: MuDst.Draw("BTofHit.mTray:BTofHit.mIndex2Global","","zcol",10) This plot is half empty when match-making finds nothing on one side of the barrel. With this I can try to look at some production MuDsts myself as well and will report back here. -Gene
Findings: P17ig production of Run 17 pp510 st_W stream is GOOD (completed). P17ii production of Run 17 AuAu54 is BAD (completed). P18ia production of Run 17 pp510 st_physics stream is BAD (ongoing). And I expect the P17ii production of Run 5 CuCu200 (ongoing) is unaffected as there was no BTOF. However... The current leading theory is that we have an issue with optimized libraries after the OS upgrade to Scientific Linux 7, and not due to any particular code commits. If this is true, we also need to answer the question of whether BTOF's geometry is only one of multiple parts of the STAR geometry which may be impacted, and we just haven't yet noticed others. -Gene
Hi Gene, It's important to note. There has been no issue identified with the ROOT geometry. Rather, the issue is with the GEANT geometry. The new compiler, under certain levels of optimization, decides to prune some FORtran DO loops. I believe the BTOF codes still use the GEANT geometry, which means it may be impacted by this issue. I just did a quick test... converted the y2017 geometry (optimized and non- optimized) from ROOT to TGeo, dumped into a CINT file. I can then do a diff between the two. I see no changes to the BTOF geometry, at least in terms of the geometry tree. Everything gets created and placed. (And likely placed at the correct positions). There are numerous changes to other geometries, which are worrisome. We don't use optimized geant in simulation, and only use it in reconstruction for the BTOF. Jason On 02/22/2018 11:20 AM
Another piece of information (suggested by Dmitri Smirnov): I processed some Run 17 pp510 events in optimized SL17g, SL17h, SL17i, and SL18a, all with the same chain as the P17ig st_W production. The data looks OK in SL17g and SL17h, but BAD in SL17i and SL18a. The SL17g and SL17h libraries that we use for ".sl73_gcc485" were compiled in Sept and Oct of 2017. SL17i was compiled on December 13th, immediately after the SL7 upgrade to the farm. So...the possibility that it was a change in STAR code between SL17h and SL17i is not yet closed. Some possible tests include: a) Re-compiling SL17h codes in SL17h anew to see if they become problematic. b) Trying codes that are different between SL17h and SL17i in the other library to see if that makes SL17h OK or SL17i BAD. -Gene
Gene, below is some comments:
The SL17g and SL17h libraries that we use for ".sl73_gcc485" were compiled in Sept and Oct of 2017.
SL17g and SL17h were compiled naturally on SL6.3 platform with gcc 4.8.2 and when we run them on the farm which has SL7.3 platform it uses link .sl73_gcc485 -> .sl64_gcc482/ SL17i and SL18a have been compiled naturally on SL7.3 platform with gcc 4.8.5 compiler . Lidia.
This is a comment. It is not sent to the Requestor(s):
On Feb 26, 2018, at 1:07 PM, Jerome LAURET via RT <bugs-high-comment@sun.star.bnl.gov> wrote: Gene could you please provide a how to reproduce / illustrate this issue?
Here's a way that's faster than looking at a MuDst: root4star -b -q -l 'bfc.C(1,"DbV20171001 pp2017a StiCA btof mtd mtdCalib PicoVtxDefault fmsDat fmsPoint fpsDat BEmcChkStat OSpaceZ2 OGridLeak3D -hitfilt","/star/rcf/test/daq/2017/069/st_physics_18069061_raw_2000021.daq")' | grep "proj)" Something like this is GOOD: StBTofMatchMaker:INFO - #(daq): 30 #(proj): 932 #(prim proj): 100 #(matched): 13 #(single valid): 13 #(single prim valid): 7 Something like this (with about half values for the ones after the first) is BAD: StBTofMatchMaker:INFO - #(daq): 30 #(proj): 478 #(prim proj): 60 #(matched): 6 #(single valid): 6 #(single prim valid): 5 -Gene
This is a comment. It is not sent to the Requestor(s): More information: I ran a script to check out and locally compile each subdirectory of pams and StRoot individually, using... cons FDEBUG="-O0 -g" DEBUG="-O0 -g" ...while in optimized DEV. This was 254 jobs. All of them ran identically (with a bad result) except for a single job, for StRoot/StDbUtilities, and it was only different in a very minor way, not having ~twice the BTOF matches. No smoking guns for a piece of code that is having optimization problems. Confusing. -Gene
This is a comment. It is not sent to the Requestor(s): Hi Gene, Decided to spend a few minutes poking around. 1) If you run the bfc interactively... void runBfc() { gROOT->LoadMacro("bfc.C"); bfc(0,"DbV20171001 pp2017a StiCA btof mtd mtdCalib PicoVtxDefault fmsDat fmsPoint fpsDat BEmcChkStat OSpaceZ2 OGridLeak3D -hitfilt","/star/rcf/test/daq/2017/069/st_physics_18069061_raw_2000021.daq"); chain->Clear(); chain->Make(); new TBrowser(); } You can navigate to bfc/.make/geant/.const/HALL/... and down the the BTOF geometry. Both the east and west halves of the BTOF are present (BTOH and BTO1). (When I draw the geometry, and in the hierarchy listed in the browser...) 2) If you look at the log file, you get the following output from StBTofMatchMaker, where the geometry is initialized... Output for SL17i, optimized -- StBTofMatchMaker:INFO - [StBTofGeometry] retrieving geometry alignment parameters from database StBTofMatchMaker:INFO - [StBTofGeometry] # of sectors = 120 StBTofMatchMaker:INFO - [StBTofGeometry] Done. NValidTrays = 120 NModulesInTray = 32 StBTofMatchMaker:INFO - StBTofGeometry::Init -- GEMTOF-tray module indexes will be corrected for year 2017 StBTofMatchMaker:INFO - CPU time for StBTofGeometry initialization : 0.12 sec StBTofMatchMaker:INFO - StBTofMatchMaker -- welcome StBTofMatchMaker:INFO - Number of BTOF Hits = 33 StBTofMatchMaker:INFO - #(daq): 30 #(proj): 478 #(prim proj): 60 #(matched): 6 #(single valid): 6 #(single prim valid): 5 Output for SL17h, optimized -- StBTofMatchMaker:INFO - [StBTofGeometry] retrieving geometry alignment parameters from database StBTofMatchMaker:INFO - [StBTofGeometry] # of sectors = 120 StBTofMatchMaker:INFO - [StBTofGeometry] Done. NValidTrays = 120 NModulesInTray = 32 StBTofMatchMaker:INFO - StBTofGeometry::Init -- GEMTOF-tray module indexes will be corrected for year 2017 StBTofMatchMaker:INFO - CPU time for StBTofGeometry initialization : 0.15 sec StBTofMatchMaker:INFO - StBTofMatchMaker -- welcome StBTofMatchMaker:INFO - Number of BTOF Hits = 33 StBTofMatchMaker:INFO - #(daq): 30 #(proj): 932 #(prim proj): 100 #(matched): 13 #(single valid): 13 #(single prim valid): 7 .............. I don't see any obvious difference attributable to the geometry. Jason
This is a comment. It is not sent to the Requestor(s): Adding my 2cents: in both cases the "[StBTofGeometry]" which is responsible for handling the BTOF geometry happily reports 120 trays. i.e. the 60+60 trays on both halves. The number is the resultant of a loop working its way through a list of geometry nodes to define trays (see StBTofGeometry::InitFromStar(TVolume *starHall)). It would be interesting to see what the debug messages would say at that point. From all I can see there are 120 trays, but it is not too obvious where they end up ... -f.
This is a comment. It is not sent to the Requestor(s): Hi Frank, Be careful what you wish for... I ran optimized, SL17h and i, with debug bfc option. I don't see anything obvious, other than there is too much output for me to go through to figure things out. The logs are here, if you want to take a look... $ ls /star/simu/simu/jwebb/2018/02-26-2018-bug3336/*.log Cheers, Jason
This is a comment. It is not sent to the Requestor(s): Speaking with Jason earlier, I said... "I tried running with each of our libraries from pams, StRoot, and StarVMC unoptimized, and nothing fixed this. What's left except root4star itself?" The obvious answer to that question is, "Yes, root4star itself." So, I executed the unoptimized root4star in optimized DEV, and the optimized root4star in unoptimized DEV: the formed is GOOD and the latter is BAD, implicating root4star itself. I further executed the optimized root4star from SL17h in optimized DEV, and that works fine too, while optimized root4star from SL17i in optimized DEV does not. So the case is very strong against optimized root4star since the OS upgrade. The next question is which particular component of root4star is the problem, and the likely candidate seems to be the linked GEANT. Unfortunately, as far as I understand, we cannot run root.exe and load in an unoptimized GEANT at run-time to isolate GEANT. So...we would need to re-link root4star using an unoptimized GEANT as a test. I'm not familiar with doing this, but it would seem the next logical step. Other ideas on how to proceed are welcome. There is of course the possibility that we simply move forward (which we are doing regardless of this particular issue) without GEANT in the production chain, as noted in this thread: http://www.star.bnl.gov/HyperNews-star/protected/get/startof/2731.html It's interesting to note that if GEANT turns out to be the problem, the timing of preparing to run our reconstruction without GEANT occurred nearly perfectly (off by only a small amount) with the day that GEANT's vulnerability struck us.... Coincidence? -Gene
This is a comment. It is not sent to the Requestor(s): This is helpful. Thanks Jason! The match maker here seems to be finding matches in all trays. If this is the optimized version that is supposed to reproduce the lack of trays in the west BTOF, then I don't see much or any evidence of that. In the first event tracks with matched BTOF hits in trayID<60 and trayID>60 are all found and passed on.
This is a comment. It is not sent to the Requestor(s): Well...it appears not to be GEANT (as Jerome suggested it couldn't be), but some other fortran code used in root4star... With a little help from Jason on re-compiling root4star, I have isolated a pair of code commits between SL17h and SL17i that make or break the BTOF-matching: http://www.star.bnl.gov/cgi-bin/protected/cvsweb.cgi/asps/Simulation/starsim/rebank/aslsetba.age.diff?r1=1.1;r2=1.2;f=h http://www.star.bnl.gov/cgi-bin/protected/cvsweb.cgi/asps/Simulation/starsim/rebank/aslgetba.age.diff?r1=1.2;r2=1.3;f=h Jason noted at the time of the commit that these were to force the compiler not to optimize some loops involving variables by making them "volatile". However, this appears to be having some other side effect. Re-compiling and running in optimized DEV with the old version of either (or both) of these files gives a GOOD result: setenv NODEBUG 1 stardev ...then... cvs co asps/rexe cvs co asps/Simulation rm asps/Simulation/starsim/rebank/aslsetba.age cvs co -r SL17h asps/Simulation/starsim/rebank/aslsetba.age cons OR cvs co asps/rexe cvs co asps/Simulation rm asps/Simulation/starsim/rebank/aslgetba.age cvs co -r SL17h asps/Simulation/starsim/rebank/aslgetba.age cons OR cvs co asps/rexe cvs co -r SL17h asps/Simulation cons I should explicitly note that I specifically tried recompiling and running the current versions of these files and that result is BAD: cvs co asps/rexe cvs co asps/Simulation cons I should also note that there was one other code change in these directories between SL17h and SL17i, but back-stepping this change has no effect (still BAD): http://www.star.bnl.gov/cgi-bin/protected/cvsweb.cgi/asps/rexe/TGeant3/TGiant3.h.diff?r1=1.1;r2=1.2;f=h Alternatively, compiling the new codes without optimization but running in optimized DEV gives a GOOD result too: cvs co asps/rexe cvs co asps/Simulation cons FDEBUG="-O0 -g" DEBUG="-O0 -g" _____________ My summarizing thoughts so far: I have isolated code changes (not the OS upgrade itself) that are tied to the current optimization problem for BTOF- matching. However, the code changes were introduced to fix an issue with optimized compilation under the new OS, so the reverting these changes is not likely to be a viable solution. The fact that either code change can be reverted to fix the problem implies an issue with information that crosses codes (non-local data?). Nevertheless, regardless of whether we understand the issue, specifically compiling these codes unoptimized seems to be one option for moving forward. -Gene p.s. I noticed that one of the now-volatile variables in aslgetba.age (ldetp) was also listed in a SAVE statement. Anyone know if this can lead to problems? I've no expertise in these attributes.
On 02/28/2018
This is a comment. It is not sent to the Requestor(s): Hi Gene, in the code you provided INTEGER NPMAX,Npar,LdetP,I,PAR(NPmax) Must be replaced by: INTEGER NPMAX,Npar,LdetP,I,PAR(*) I look the call of this function. NPmax could be negative. For non optimized mode it is irrelevant. but for optimized, size of array is negative and compiler force to quit the loop. PAR(*) means it is an array with arbitrary size. It is user responsibility to work with it. Victor
This is a comment. It is not sent to the Requestor(s): Hi, Victor I tried this for both files in question in optimized SL18a, but it did not restore the proper functioning of the code (BTOF matching still wrong). It of course does not mean that you are wrong, but that this isn't sufficient to resolve the optimized compilation problems. Index: asps/Simulation/starsim/rebank/aslgetba.age =================================================================== RCS file: /afs/rhic.bnl.gov/star/packages/repository/asps/Simulation/starsim/rebank/aslgetba.age,v retrieving revision 1.3 diff -r1.3 aslgetba.age 25c25 < INTEGER NPMAX,Npar,LdetP,I,PAR(NPmax) ---
INTEGER NPMAX,Npar,LdetP,I,PAR(*)
Index: asps/Simulation/starsim/rebank/aslsetba.age =================================================================== RCS file: /afs/rhic.bnl.gov/star/packages/repository/asps/Simulation/starsim/rebank/aslsetba.age,v retrieving revision 1.2 diff -r1.2 aslsetba.age 32c32 < INTEGER NPAR,LdetP,I,PAR(Npar) ---
INTEGER NPAR,LdetP,I,PAR(*)
Thanks, -Gene
On 03/01/2018
This is a comment. It is not sent to the Requestor(s): Gene, did you use the latest code, or the old one without Jason's attempts to fix the problem? Victor
This is a comment. It is not sent to the Requestor(s): Hi, Victor Latest codes (where "volatile" was used). -Gene
This is a comment. It is not sent to the Requestor(s): > Latest codes (where "volatile" was used). > To be sure it is better to use original code. Victor
This is a comment. It is not sent to the Requestor(s): OK, I reverted to the SL17h versions (without "volatile") and patched in the "*" arguments, while compiling and running in optimized SL18a, but the BTOF result is still BAD. Thanks, -Gene
This is a comment. It is not sent to the Requestor(s): It is a pity and very bad. If it depends from our code, we can fix it. If it is the bug in compiler optimization it could affect any code Victor
»
- jeromel's blog
- Login or register to post comments