A summary of our Photon XS Measurement Efforts

The starting points for our work attempting to measure the photon cross section were a set of papers (OPAL: hep-ex/0305075, ZEUS:
PhysLet B 413 1997 201-260) describing the ZEUS and OPAL approaches  to photon measurement.  The papers identified two strongly  discriminating variables useful in separating pion-like and photon- like events:

sigma_cluster:

f_max:

    In the above, Ei is the energy in a particular tower, with phi and theta being the position of that tower, and Egamma is the total energy of the candidate, and Emax is the energy of the high tower in the cluster.

As well as an isolation condition:

frixione (used in the OPAL paper, the ZEUS paper uses a simpler
version):

Even with these variables, there remains a large background population throughout the signal region, so it's not possible to use  them to label individual events as photons or background with any  confidence.  Instead, both approaches attempt to model the background  contribution in these regions through various simulations.  The main  background particles are identified as pions, etas, and antineutrons, though in our work we're mainly using the background sample provided by the basic jet MC.

The final counting is done (in the OPAL paper) by fitting the measured two-dimensional distribution of sigma_cluster vs f_max to a sum of photon and background distributions, as simulated.

We are currently running two approaches in parallel - the first mimics the OPAL approach, while the second uses an LDA to construct a discriminant, plots the distributions of that discriminant for the simulated photon and background, and fits the data to the sum of those two distributions.

Details about MC:

The signal Monte Carlo is obtained from pythia simulations, using pythia processes.  The background Monte Carlo was originally going to be a sum of single-particle simulations of the main background particles, but  as that didn't work very well, we now use pythia jet simulations instead.

For the photon sample, we use pythia IDs 14 (q qbar->g gamma) , 18 (q qbar->gamma gamma), 29 (q g->q gamma), 114 (g g->gamma gamma),  and 115 (gg->g gamma), visible in the plot below.  Note that since our 'background' is any particle  other than a prompt photon, and since our prompt photon processes  generate particles that are not prompt photons, we expect some of  our background candidates to come from the pythia IDs listed above.  The spike at 0 is from the gamma_jet files, for which the pythia IDs do not seem to be recorded in our ntuple.  We don't  currently understand why this is, though this may be a code issue.

Our MC is generated in ptbins, with the proper distribution in pt,  meaning every bin has more events at the lower end than at the  higher.  To each bin we also apply a weighting factor so that they  have the correct relative weights between bins.
The base format of the MC is MuDst, along with  associated .geant.root, and other .*.root files.  The StGammaMaker  code is run over the entire set of these to generate an equal set  of .gamma_tree.root files which contain all the information we're concerned with -- the energies of all towers in a cluster of nine  centered on the highest tower not already assigned to a cluster in a given event, along with the tracks in the vicinity (within eta-phi  radius of 1), the smd strip information underneath the high tower, and the energies in all pre and post layers of that cluster.  They  also perform a cut of sorts, removing any candidate that has less than 5GeV.

The output files can be found at:
 /star/institutions/mit/wleight/simulations/gamma_jet*.root
 and
 /star/u/rcorliss/sandbox1/photon*.gamma_tree.root

From these source files we create a single master ntuple (/star/u/rcorliss/sandbox1/huge_weighted_combined.ntuple.root) containing the following variables for each candidate:

event_index -- an integer identifying which event in a particular file the candidate originates from.
id -- a deprecated integer labeling the candidate's geant particle id #
process_id -- the integer value of the pythia process id that created the candidate's particle
detector_id -- a deprecated integer labeling the candidate as either  a barrel or endcap event
isForTraining -- a boolean value labeling a candidate to be used as  our known sample or our unknown sample
isPromptPhoton -- a boolean value lableing the candidate as a prompt  photon (coming from the pythia record)
isInBarrel -- a boolean value for where the candidate is located.
isInEndcap -- a most-likely redundant boolean for where the candidate is located
weight -- the statistical importance of the candidate, taken as the cross section for that partonic bin divided by the number of events  generated in that bin
parton_pt -- the partonic pt from the pythia record of the  candidate's event
true_pt -- the true particle pt from the pythia record, using the ->pt () method
pt -- the measured pt of the candidate
parton_ptbin -- the value of the midpoint of the partonic pt bin
ptbin -- the value of the midpoint of the measured pt bin
e_seed -- the energy in the candidate's central (and thus highest-energy) tower
e_rest -- the energy in the towers other than the central tower
e_pre1 -- the energy in the first preshower layer (or only, in the case of barrel events)
e_pre2 -- the energy in the second preshower layer (always 0 in the case of barrel events)
e_post -- the energy in the postshower layer (always 0 for barrel  events)
e_u -- the energy in the SMD u plane for endcap, or eta plane for barrel
e_v -- the energy in the SMD v plane for endcap, or phi for barrel
n_tracks -- the number of tracks pointing toward the candidate
n_towers -- the number of towers above threshold above threshold
n_pre1 -- the number of preshower 1 layers in the candidate above  threshold
n_pre2 -- the number of preshower 2 layers in the candidate above  threshold
n_post -- the number of postshower layers in the candidate above  threshold
n_u -- the number of u (eta) strips above threshold
n_v -- the number of u (eta) strips above threshold
sum_pt_3 -- the sum of all the pts of all tracks and EMC towers within .3 in eta-phi space of the candidate
sum_pt_5 -- the sum of all the pts of all tracks and EMC towers within .5 in eta-phi space of the candidate
sum_pt_7 -- the sum of all the pts of all tracks and EMC towers within .7 in eta-phi space of the candidate
sum_track_pt_3 -- the sum of all the pts of all tracks within .3 in eta-phi space of the candidate
sum_track_pt_5 -- the sum of all the pts of all tracks within .5 in eta-phi space of the candidate
sum_track_pt_7 -- the sum of all the pts of all tracks within .7 in eta-phi space of the candidate
sum_tower_pt_3 -- the sum of all the pts of all EMC towers within .3  in eta-phi space of the candidate
sum_tower_pt_5 -- the sum of all the pts of all EMC towers within .5  in eta-phi space of the candidate
sum_tower_pt_7 -- the sum of all the pts of all EMC towers within .7  in eta-phi space of the candidate
sum_post_3 -- the sum of all the pts of all EMC postshowers within .3  in eta-phi space of the candidate
sum_post_5 -- the sum of all the pts of all EMC postshowers within .5  in eta-phi space of the candidate
sum_post_7 -- the sum of all the pts of all EMC postshowers within .7  in eta-phi space of the candidate
isIsolated -- a boolean telling us whether the candidate satisfies  the frixione isolation condition described above
frixione_fraction -- the largest value (instead of .2) for which the  candidate satisfies the frixione condition
theta_1 -- the first theta moment of the candidate's towers, computed  about the central tower's center
theta_2 -- the second theta moment of the candidate's towers,  computed about the central tower's center
theta_3 -- the fthird theta moment of the candidate's towers,  computed about the central tower's center
phi_1 -- the first phi moment of the candidate's towers, computed  about the central tower's center
phi_2 -- the second phi moment of the candidate's towers, computed  about the central tower's center
phi_3 -- the third phi moment of the candidate's towers, computed  about the central tower's center
f_max -- the fraction of the total candidate energy found in the  central tower
sigma_cluster -- the sum of theta2 and phi2

Two-Dimensional Fits:

For convenience, we divide the single ntuple into multiple pieces:  a signal and a background set for endcap and for barrel. In each of  these we apply a cut on isForTraining to get our 'known' sample.  Cuts are then applied: each candidate in this sample must satisfy  the frixione isolation condition, and have no tracks pointing to it.  Distributions are generated, but before the fit is done a  smoothing process is applied to the distributions (see KDE below).   We use the smoothed two-dimensional distributions as the basis  functions to perform a fit to the unknown sample, fitting each pt  bin individually and then checking our results using the actual values in the unknown sample

The below is an example of a fit attempt showing both the LDA (blue) and the 2D approach (red).  (The blue is essentially measuring the total cross section of photons+background.)

KDE:

To overcome the finite statistics in our known sample, we apply  smoothing.  The textbook method of doing this is to replace each  data point with a gaussian centered at that point, to help us  interpolate into regions where there is no data.  To simplify (and  speed up) this  process, we instead replace every bin of a finely-binned histogram of the distribution with a gaussian.  The end  result is the
smoothing of sharp peaks, in exchange for erasing any  useful information on small scale structures in the distribution.   Proofs exist that if the gaussian width is reduced as more data is  added, increasing the number of data makes this KDE'd distribution closer to the actual parent distribution.  A rough example of this smoothing follows:

Before smoothing:

After smoothing:

There are still some troubles with how, exactly, to perform this smoothing. The widths of the gaussians must be carefully tuned in  each axis, and we also tune them based on the local density of data, which gives us more parameters to tweak.

Another way to look at this process is as a non-parametric fit to our sample, using a sum of gaussians that will generate a smooth  but non- polynomial sort of function.

LDA:

Michael Betancourt has spoken about this topic several times, but the basic approach is very simple.  We create a signal and a background matrix from our training samples, rows corresponding to entries, and columns to the measured parameters in each entry.  Through some  simple linear algebra, this gives us a vector in our parameter space that defines the norm that separates signal and background best. (This assumes that the matrix is not singular, which is not a general guarantee, but something we can tweak ourselves)

The basic procedure is:

1=Msignal*F        0=Mbackground*F
classi=Mi j * Fj
Fk=M^(-1)k i * classi

We feed the LDA the following parameters:
For the EEMC LDA (the ones in parentheses are only used for pt<25,  after which they lead to singular matrices and need to be dropped):
isIsolated
pt
e_seed
e_pre1
(e_pre2)
(e_post)
n_towers
n_pre1
(n_pre2)
(n_post)
sum_pt_3
sum_track_pt_3
sum_tower_pt_3
sum_tower_pt_5
sum_tower_pt_7
(sum_post_3)
(sum_post_5)
(sum_post_7)
frixione_fraction
sigma_cluster
f_max

For the BEMC LDA:
isIsolated
pt
e_seed
n_tracks
n_towers
sum_pt_3
sum_track_pt_3
sum_tower_pt_3
frixione_fraction
sigma_cluster
f_max

This reduces our task from the multidimensional fit above to a fit of a single variable, the discriminant, calculated by taking the dot  product
of our F and a particular candidate, which tells us how far into photon-like parameter space the candidate is.  This has fewer  parameters
to tweak and is faster to perform since the number of bins is vastly reduced. As this approach matures, we will have to spend some time to make sure that the higher-order correlations in the MC correctly represent the real data from the detector.

The code for these approaches is in the following files:

on rcf:
/star/u/rcorliss/sandbox1/StRoot/CreateTrainingTree.C  -- Makes the basic ntuple, a single file from all gamma_jet and photon MC  files.  Technically, the most up-to-date version of this is on Ross's laptop, but that version should be in synch.
/star/u/rcorliss/sandbox1/StRoot/readNtupleClassic.C -- Reads the ntuple (or ntuples, if it's been split for endcap/barrel, photon/background, etc) and performs the 2D fit discussed above, saving  various plots.  As before, this is a copy of the version on Ross's  laptop.

on deltag5:
/Users/rcorliss/trainGammas.C -- takes in the training tree and  builds the LDA vector discussed above.  These are saved as s series  of text files in a subdirectory '/weights/'
/Users/rcorliss/classify_ntuple.C -- reads the full ntuple and  creates a new ntuple that includes only the basic parameters of the previous, adding in the discriminant as a leaf.

back on rcf:

/star/u/rcorliss/sandbox1/StRoot/readNtupleSpiffy.C -- reads in the classed ntuple(s) and performs a one-dimensional fit on the  discriminant.  As before, this is a copy of the version on Ross's  laptop.

Current Goal:

     Right now we hope to perfect our analysis process to the point that we can input a MC sample with a known number of photons in each pt bin and extract those numbers.  At that point, we will be ready to move on to looking at actual data.

Current Issues:

1) We can't currently read the pythia records of the gamma_jet sample correctly.  We've contacted Renee Fatemi for help in this, and are
simultaneously rooting through our code to make sure the issue is not on our end.

2) Because of the relative weighting of the various partonic pt bins, and the underconstrained measured value of the photon candidate's
pt,  it is possible for few highly weighted events to dominate a bin primarily populated by a higher partonic bin.  This causes problems with the KDE, since these single-bin spikes will turn into broad gaussian peaks.
        We've considered generating more MC to wash out these sparse events, and are also looking into alternatives to the KDE that might not be so sensitive to them.

Longer-Term Issues:

1) The SMD information is not currently being used, since there is some doubt about how well the MC mocks up the actual detector response.  Mike and others have done good work to start resolving this, and we will need to develop good ways to integrate the SMD into our 2D fit approach.  The LDA approach is natural - we just include moments and energies in the SMD response into our existing list of parameters.

2) We don't currently have many connections between our MC study and actual data.  For the LDA approach especially, we'll need to  ensure that higher-order correlations in the MC are also present in  the data.

3) There is currently no way to bridge between barrel and endcap in the gamma candidates myTowers.  This would help recover the photons  that fall near that edge in either detector.  There is, however, the capacity to check for energy within certain eta-phi radii,  which is capable of jumping this boundary.