How to determine what is happening in slow or stuck processes?

This tutorial is very specific to the cases where you have a process running (batch typically) and one of two conditions below is happening and you do not know why (but eventually, would like too :-) ):

  • the program/chain is "slow" that is, there is no apparent reason for this to happen (no special load on the machine, your code should run fine and you know this) and you would like to estimate where (which portion of the code) does your program/chain spends its time idling
  • the program/chain you are running is simply stuck and do not move into its processing stack, do not consume any CPU etc ...

Your program/chain is "slow"

Slow programs are still programs in execution mode. The best way to see what is happening is to use "strace".

[An example will be provided soon]

Your program/chain is "stuck"

If your program is really stuck, you will not be able to strace it as explained in the previous section. In fact, here is an example of what you would see under such conditions:

% ps -ef | grep root4star
starreco 11209 10955 97 May20 ?        1-07:30:59 root4star -b -q 
bfc.C(100000,"DbV20080512 ry2007 in tpc_daq tpc fcf svt_daq SvtD Physics 
Cdst Kalman l0 tags Tree evout l3onl emcDY2 fpd ftpc trgd ZDCvtx -dstout 
CMuDst hitfilt Corr4 OSpaceZ2 OGridLeak3D","st_upc_8097121_raw_1150008.daq")
starreco  9178  8953  0 19:39 pts/0    00:00:00 grep root4star

% strace -p 11209
Process 11209 attached - interrupt to quit
...

strace would not show any unrolling call stack but simply wait there forever. You know you have a "stuck" process for sure and need to revert to the use of gdb. For the same process, this is what to do

 

% gdb
GNU gdb Red Hat Linux (6.3.0.0-1.143.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu".
(gdb) attach 11209
Attaching to process 11209
Reading symbols from 
/afs/rhic.bnl.gov/star/packages/release10/SL08b/.sl44_gcc346/OBJ/asps/rexe/root4star.3...done.
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Reading symbols from 
/afs/rhic.bnl.gov/star/ROOT/20/5.12.00/.sl44_gcc346/root/lib/libCore.so...done.
Loaded symbols for 
/afs/rhic.bnl.gov/star/ROOT/5.12.00/.sl44_gcc346/root/lib/libCore.so
...

there will be a lot of shared library loaded and finally

Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
0x43f00fa1 in StarMagField::Search (this=0x54e16008, N=37, Xarray=0x54ec64c4, x=-nan(0x400000), low=@0x43f0a4c4)
    at .sl44_gcc346/OBJ/StRoot/StarMagField/StarMagField.cxx:1084
1084          if ( (Int_t)( x >= Xarray[middle] ) == ascend )
#0  0x43f00fa1 in StarMagField::Search (this=0x54e16008, N=37, Xarray=0x54ec64c4, x=-nan(0x400000), low=@0x43f0a4c4)
    at .sl44_gcc346/OBJ/StRoot/StarMagField/StarMagField.cxx:1084
#1  0x43f0071b in StarMagField::Interpolate3DBfield (this=0x54e16008, r=-nan(0x400000), z=-2.99321151e+33,
    phi=-nan(0x400000), Br_value=@0x45, Bz_value=@0x45, Bphi_value=@0x45) 
    at .sl44_gcc346/OBJ/StRoot/StarMagField/StarMagField.cxx:844
#2  0x43efeafb in StarMagField::B3DField (this=0x54e16008, x=0xbffc8e70, B=0xbffc8e60)
    at .sl44_gcc346/OBJ/StRoot/StarMagField/StarMagField.cxx:482
#3  0x4d00c203 in StFtpcTrack::MomentumFit (this=0x1b5d72d8, vertex=0xbffc9610)
    at .sl44_gcc346/OBJ/StRoot/StFtpcTrackMaker/StFtpcTrackingParams.hh:344
#4  0x4d00b391 in StFtpcTrack::Fit (this=0x1b5d72d8, vertex=0xbffc9610, max_Dca=100, primary_fit=true)
    at .sl44_gcc346/OBJ/StRoot/StFtpcTrackMaker/StFtpcTrack.cc:575
#5  0x4d0257cc in StFtpcTracker::Fit (this=0xbffc93f0, primary_fit=Variable "primary_fit" is not available.
) at .sl44_gcc346/OBJ/StRoot/StFtpcTrackMaker/StFtpcTracker.cc:751
#6  0x4d01ee7e in StFtpcTrackMaker::Make (this=0xc9cf538) at .sl44_gcc346/OBJ/StRoot/StFtpcTrackMaker/StFtpcTracker.hh:64
#7  0x4285b0c0 in StMaker::Make (this=0xc9c58c0) at .sl44_gcc346/OBJ/StRoot/StChain/StMaker.cxx:965
#8  0x4285b0c0 in StMaker::Make (this=0xaf3e4c0) at .sl44_gcc346/OBJ/StRoot/StChain/StMaker.cxx:965
#9  0x4285426b in StChain::Make (this=0xaf3e4c0) at .sl44_gcc346/OBJ/StRoot/StChain/StChain.cxx:105
#10 0x4296022d in StBFChain::Make (this=0xaf3e4c0) at .sl44_gcc346/OBJ/StRoot/StBFChain/StBFChain.h:78
#11 0x4285522d in StMaker::IMake (this=0xaf3e4c0, number=275) at .sl44_gcc346/OBJ/StRoot/StChain/StMaker.h:110
#12 0x4295fd89 in StBFChain::Make (this=0xaf3e4c0, number=275) at .sl44_gcc346/include/StChain.h:49
#13 0x428545b0 in StChain::EventLoop (this=0xaf3e4c0, jBeg=1, jEnd=100000, outMk=0x0) 
    at .sl44_gcc346/OBJ/StRoot/StChain/StChain.cxx:165
#14 0x4286d4a2 in G__StChain_Cint_541_9_0 () at /afs/rhic.bnl.gov/star/ROOT/5.12.00/.sl44_gcc346/root/include/TString.h:248
#15 0x40855c0c in G__ExceptionWrapper () from /afs/rhic.bnl.gov/star/ROOT/5.12.00/.sl44_gcc346/root/lib/libCint.so
#16 0x40911cda in G__call_cppfunc () from /afs/rhic.bnl.gov/star/ROOT/5.12.00/.sl44_gcc346/root/lib/libCint.so
#17 0x408ffd40 in G__interpret_func () from /afs/rhic.bnl.gov/star/ROOT/5.12.00/.sl44_gcc346/root/lib/libCint.so
#18 0x408ebb95 in G__getfunction () from /afs/rhic.bnl.gov/star/ROOT/5.12.00/.sl44_gcc346/root/lib/libCint.so
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb)

Withing this stack, you see that the reason for this particular process to be stuck is

0x43f00fa1 in StarMagField::Search (this=0x54e16008, N=37, Xarray=0x54ec64c4, x=-nan(0x400000), low=@0x43f0a4c4)
    at .sl44_gcc346/OBJ/StRoot/StarMagField/StarMagField.cxx:1084
1084          if ( (Int_t)( x >= Xarray[middle] ) == ascend )
  • A variable X is being used in a if() statement test condition but its value is NaN (-nan(0.x400000)) and this cause the call stack freeze.
  • This problem is happening in StarMagField::Search at line 1084 (code StRoot/StarMagField/StarMagField.cxx)
  • The calling upper function is StFtpcTrack::Fit in StFtpcTrack.cc but it is not clear the problem starts there (the NaN especially starts in StarMagField::Interpolate3DBfield and propagates to StarMagField::Search)

Your bug reporting could begin OR if this is your code, now starts the debugging process, editing code, setting breakpoints and stepping in the call stack inspecting the variables and why/when they are assigned to NaN and so on. Your logic may simply be obviously flaw so your first step is definitely editing the code and understanding its logic.