- rcorliss's home page
- Posts
- 2013
- January (1)
- 2012
- 2011
- 2010
- December (1)
- October (2)
- September (2)
- August (2)
- June (2)
- May (3)
- April (3)
- March (5)
- February (2)
- January (8)
- 2009
- December (5)
- November (1)
- October (7)
- September (10)
- August (4)
- July (3)
- May (1)
- February (1)
- January (1)
- 2008
- 2007
- My blog
- Post new blog entry
- All blogs
A summary of our Photon XS Measurement Efforts
The starting points for our work attempting to measure the photon cross section were a set of papers (OPAL: hep-ex/0305075, ZEUS:
PhysLet B 413 1997 201-260) describing the ZEUS and OPAL approaches to photon measurement. The papers identified two strongly discriminating variables useful in separating pion-like and photon- like events:
sigma_cluster:
f_max:
In the above, Ei is the energy in a particular tower, with phi and theta being the position of that tower, and Egamma is the total energy of the candidate, and Emax is the energy of the high tower in the cluster.
As well as an isolation condition:
frixione (used in the OPAL paper, the ZEUS paper uses a simpler
version):
Even with these variables, there remains a large background population throughout the signal region, so it's not possible to use them to label individual events as photons or background with any confidence. Instead, both approaches attempt to model the background contribution in these regions through various simulations. The main background particles are identified as pions, etas, and antineutrons, though in our work we're mainly using the background sample provided by the basic jet MC.
The final counting is done (in the OPAL paper) by fitting the measured two-dimensional distribution of sigma_cluster vs f_max to a sum of photon and background distributions, as simulated.
We are currently running two approaches in parallel - the first mimics the OPAL approach, while the second uses an LDA to construct a discriminant, plots the distributions of that discriminant for the simulated photon and background, and fits the data to the sum of those two distributions.
Details about MC:
The signal Monte Carlo is obtained from pythia simulations, using pythia processes. The background Monte Carlo was originally going to be a sum of single-particle simulations of the main background particles, but as that didn't work very well, we now use pythia jet simulations instead.
For the photon sample, we use pythia IDs 14 (q qbar->g gamma) , 18 (q qbar->gamma gamma), 29 (q g->q gamma), 114 (g g->gamma gamma), and 115 (gg->g gamma), visible in the plot below. Note that since our 'background' is any particle other than a prompt photon, and since our prompt photon processes generate particles that are not prompt photons, we expect some of our background candidates to come from the pythia IDs listed above. The spike at 0 is from the gamma_jet files, for which the pythia IDs do not seem to be recorded in our ntuple. We don't currently understand why this is, though this may be a code issue.
Our MC is generated in ptbins, with the proper distribution in pt, meaning every bin has more events at the lower end than at the higher. To each bin we also apply a weighting factor so that they have the correct relative weights between bins.
The base format of the MC is MuDst, along with associated .geant.root, and other .*.root files. The StGammaMaker code is run over the entire set of these to generate an equal set of .gamma_tree.root files which contain all the information we're concerned with -- the energies of all towers in a cluster of nine centered on the highest tower not already assigned to a cluster in a given event, along with the tracks in the vicinity (within eta-phi radius of 1), the smd strip information underneath the high tower, and the energies in all pre and post layers of that cluster. They also perform a cut of sorts, removing any candidate that has less than 5GeV.
The output files can be found at:
/star/institutions/mit/wleight/simulations/gamma_jet*.root
and
/star/u/rcorliss/sandbox1/photon*.gamma_tree.root
From these source files we create a single master ntuple (/star/u/rcorliss/sandbox1/huge_weighted_combined.ntuple.root) containing the following variables for each candidate:
event_index -- an integer identifying which event in a particular file the candidate originates from.
id -- a deprecated integer labeling the candidate's geant particle id #
process_id -- the integer value of the pythia process id that created the candidate's particle
detector_id -- a deprecated integer labeling the candidate as either a barrel or endcap event
isForTraining -- a boolean value labeling a candidate to be used as our known sample or our unknown sample
isPromptPhoton -- a boolean value lableing the candidate as a prompt photon (coming from the pythia record)
isInBarrel -- a boolean value for where the candidate is located.
isInEndcap -- a most-likely redundant boolean for where the candidate is located
weight -- the statistical importance of the candidate, taken as the cross section for that partonic bin divided by the number of events generated in that bin
parton_pt -- the partonic pt from the pythia record of the candidate's event
true_pt -- the true particle pt from the pythia record, using the ->pt () method
pt -- the measured pt of the candidate
parton_ptbin -- the value of the midpoint of the partonic pt bin
ptbin -- the value of the midpoint of the measured pt bin
e_seed -- the energy in the candidate's central (and thus highest-energy) tower
e_rest -- the energy in the towers other than the central tower
e_pre1 -- the energy in the first preshower layer (or only, in the case of barrel events)
e_pre2 -- the energy in the second preshower layer (always 0 in the case of barrel events)
e_post -- the energy in the postshower layer (always 0 for barrel events)
e_u -- the energy in the SMD u plane for endcap, or eta plane for barrel
e_v -- the energy in the SMD v plane for endcap, or phi for barrel
n_tracks -- the number of tracks pointing toward the candidate
n_towers -- the number of towers above threshold above threshold
n_pre1 -- the number of preshower 1 layers in the candidate above threshold
n_pre2 -- the number of preshower 2 layers in the candidate above threshold
n_post -- the number of postshower layers in the candidate above threshold
n_u -- the number of u (eta) strips above threshold
n_v -- the number of u (eta) strips above threshold
sum_pt_3 -- the sum of all the pts of all tracks and EMC towers within .3 in eta-phi space of the candidate
sum_pt_5 -- the sum of all the pts of all tracks and EMC towers within .5 in eta-phi space of the candidate
sum_pt_7 -- the sum of all the pts of all tracks and EMC towers within .7 in eta-phi space of the candidate
sum_track_pt_3 -- the sum of all the pts of all tracks within .3 in eta-phi space of the candidate
sum_track_pt_5 -- the sum of all the pts of all tracks within .5 in eta-phi space of the candidate
sum_track_pt_7 -- the sum of all the pts of all tracks within .7 in eta-phi space of the candidate
sum_tower_pt_3 -- the sum of all the pts of all EMC towers within .3 in eta-phi space of the candidate
sum_tower_pt_5 -- the sum of all the pts of all EMC towers within .5 in eta-phi space of the candidate
sum_tower_pt_7 -- the sum of all the pts of all EMC towers within .7 in eta-phi space of the candidate
sum_post_3 -- the sum of all the pts of all EMC postshowers within .3 in eta-phi space of the candidate
sum_post_5 -- the sum of all the pts of all EMC postshowers within .5 in eta-phi space of the candidate
sum_post_7 -- the sum of all the pts of all EMC postshowers within .7 in eta-phi space of the candidate
isIsolated -- a boolean telling us whether the candidate satisfies the frixione isolation condition described above
frixione_fraction -- the largest value (instead of .2) for which the candidate satisfies the frixione condition
theta_1 -- the first theta moment of the candidate's towers, computed about the central tower's center
theta_2 -- the second theta moment of the candidate's towers, computed about the central tower's center
theta_3 -- the fthird theta moment of the candidate's towers, computed about the central tower's center
phi_1 -- the first phi moment of the candidate's towers, computed about the central tower's center
phi_2 -- the second phi moment of the candidate's towers, computed about the central tower's center
phi_3 -- the third phi moment of the candidate's towers, computed about the central tower's center
f_max -- the fraction of the total candidate energy found in the central tower
sigma_cluster -- the sum of theta2 and phi2
Two-Dimensional Fits:
For convenience, we divide the single ntuple into multiple pieces: a signal and a background set for endcap and for barrel. In each of these we apply a cut on isForTraining to get our 'known' sample. Cuts are then applied: each candidate in this sample must satisfy the frixione isolation condition, and have no tracks pointing to it. Distributions are generated, but before the fit is done a smoothing process is applied to the distributions (see KDE below). We use the smoothed two-dimensional distributions as the basis functions to perform a fit to the unknown sample, fitting each pt bin individually and then checking our results using the actual values in the unknown sample
The below is an example of a fit attempt showing both the LDA (blue) and the 2D approach (red). (The blue is essentially measuring the total cross section of photons+background.)
KDE:
To overcome the finite statistics in our known sample, we apply smoothing. The textbook method of doing this is to replace each data point with a gaussian centered at that point, to help us interpolate into regions where there is no data. To simplify (and speed up) this process, we instead replace every bin of a finely-binned histogram of the distribution with a gaussian. The end result is the
smoothing of sharp peaks, in exchange for erasing any useful information on small scale structures in the distribution. Proofs exist that if the gaussian width is reduced as more data is added, increasing the number of data makes this KDE'd distribution closer to the actual parent distribution. A rough example of this smoothing follows:
Before smoothing:
After smoothing:
There are still some troubles with how, exactly, to perform this smoothing. The widths of the gaussians must be carefully tuned in each axis, and we also tune them based on the local density of data, which gives us more parameters to tweak.
Another way to look at this process is as a non-parametric fit to our sample, using a sum of gaussians that will generate a smooth but non- polynomial sort of function.
LDA:
Michael Betancourt has spoken about this topic several times, but the basic approach is very simple. We create a signal and a background matrix from our training samples, rows corresponding to entries, and columns to the measured parameters in each entry. Through some simple linear algebra, this gives us a vector in our parameter space that defines the norm that separates signal and background best. (This assumes that the matrix is not singular, which is not a general guarantee, but something we can tweak ourselves)
The basic procedure is:
1=Msignal*F 0=Mbackground*F
classi=Mi j * Fj
Fk=M^(-1)k i * classi
We feed the LDA the following parameters:
For the EEMC LDA (the ones in parentheses are only used for pt<25, after which they lead to singular matrices and need to be dropped):
isIsolated
pt
e_seed
e_pre1
(e_pre2)
(e_post)
n_towers
n_pre1
(n_pre2)
(n_post)
sum_pt_3
sum_track_pt_3
sum_tower_pt_3
sum_tower_pt_5
sum_tower_pt_7
(sum_post_3)
(sum_post_5)
(sum_post_7)
frixione_fraction
sigma_cluster
f_max
For the BEMC LDA:
isIsolated
pt
e_seed
n_tracks
n_towers
sum_pt_3
sum_track_pt_3
sum_tower_pt_3
frixione_fraction
sigma_cluster
f_max
This reduces our task from the multidimensional fit above to a fit of a single variable, the discriminant, calculated by taking the dot product
of our F and a particular candidate, which tells us how far into photon-like parameter space the candidate is. This has fewer parameters
to tweak and is faster to perform since the number of bins is vastly reduced. As this approach matures, we will have to spend some time to make sure that the higher-order correlations in the MC correctly represent the real data from the detector.
The code for these approaches is in the following files:
on rcf:
/star/u/rcorliss/sandbox1/StRoot/CreateTrainingTree.C -- Makes the basic ntuple, a single file from all gamma_jet and photon MC files. Technically, the most up-to-date version of this is on Ross's laptop, but that version should be in synch.
/star/u/rcorliss/sandbox1/StRoot/readNtupleClassic.C -- Reads the ntuple (or ntuples, if it's been split for endcap/barrel, photon/background, etc) and performs the 2D fit discussed above, saving various plots. As before, this is a copy of the version on Ross's laptop.
on deltag5:
/Users/rcorliss/trainGammas.C -- takes in the training tree and builds the LDA vector discussed above. These are saved as s series of text files in a subdirectory '/weights/'
/Users/rcorliss/classify_ntuple.C -- reads the full ntuple and creates a new ntuple that includes only the basic parameters of the previous, adding in the discriminant as a leaf.
back on rcf:
/star/u/rcorliss/sandbox1/StRoot/readNtupleSpiffy.C -- reads in the classed ntuple(s) and performs a one-dimensional fit on the discriminant. As before, this is a copy of the version on Ross's laptop.
Current Goal:
Right now we hope to perfect our analysis process to the point that we can input a MC sample with a known number of photons in each pt bin and extract those numbers. At that point, we will be ready to move on to looking at actual data.
Current Issues:
1) We can't currently read the pythia records of the gamma_jet sample correctly. We've contacted Renee Fatemi for help in this, and are
simultaneously rooting through our code to make sure the issue is not on our end.
2) Because of the relative weighting of the various partonic pt bins, and the underconstrained measured value of the photon candidate's
pt, it is possible for few highly weighted events to dominate a bin primarily populated by a higher partonic bin. This causes problems with the KDE, since these single-bin spikes will turn into broad gaussian peaks.
We've considered generating more MC to wash out these sparse events, and are also looking into alternatives to the KDE that might not be so sensitive to them.
Longer-Term Issues:
1) The SMD information is not currently being used, since there is some doubt about how well the MC mocks up the actual detector response. Mike and others have done good work to start resolving this, and we will need to develop good ways to integrate the SMD into our 2D fit approach. The LDA approach is natural - we just include moments and energies in the SMD response into our existing list of parameters.
2) We don't currently have many connections between our MC study and actual data. For the LDA approach especially, we'll need to ensure that higher-order correlations in the MC are also present in the data.
3) There is currently no way to bridge between barrel and endcap in the gamma candidates myTowers. This would help recover the photons that fall near that edge in either detector. There is, however, the capacity to check for energy within certain eta-phi radii, which is capable of jumping this boundary.
- rcorliss's blog
- Login or register to post comments