Systematic Errors in the Photon Signal Extraction

 In order to fully extract the prompt photon signal from the overwhelming QCD background, the photon analysis proceeds in two major steps.  First, all relevant variables are mapped to a single, one-dimensional discriminant using a machine learning technique known as a Gaussian process.  Second, this one-dimensional spectrum is fit to the sum of signal and background shapes inferred from simulation.  The output normalization of the signal shape (and its uncertainty) immediately gives an estimate for the true photon source strength.

An accurate estimate assumes that the signal and background shapes, and ultimately the simulation from which they were inferred, is consistent with the data.  The usual physicists approach is to vary parameters in the simulation (cuts, gains, efficiencies, etc) to determine the effect on the signal estimate, but the fitting procedure itself offers a more direct test.

If the simulation and data are consistent, then the residuals of the signal/background fit should be consistent with Poisson fluctuations (or given sufficient counts, Gaussian fluctuations).  Because of uncertainties in the simulation (due to finite Monte Carlo statistics) the number of degrees of freedom, and hence appropriate reduced chi2, is not straightforward to calculate and usual frequentist hypothesis tests are not applicable.  There is another consequence of consistency, however: the fluctuations are independent and there should be no systematic autocorrelations in the residuals.  Bounding any autocorrelation and its effect on the extracted signal provides an immediate bound on the systematic error of the fit.

In the prompt photon analysis, the signal distribution is dominantly restricted to y > 0, where y is the Gaussian process discriminant.  Systematic residuals in the region y > 0 affect the signal directly, while residuals in the region y < 0 affect the signal only implicitly through changes in the background distribution.  The residual behavior seen in data are systematic across all energy bins (except for those where the residuals are consistent with noise), and hence will be illustrated with only a single bin here.

There are clearly three distinct behaviors evident in the residuals; each is discussed below.  In order to decouple the systematic autocorrelation from the expected statistical fluctuations, the residuals must be fit with a smooth function.  Given the limited size of each region, however, these fits will not fully remove the sensitivity to fluctuations and hence the resulting estimates will be conservative.  The assumptions taken to bound the affect on the extracted signal below only make the estimates more conservative.  Indeed, applying this procedure to simulation results in overestimates: 10%-20% of the statistical uncertainty when they should vanish for a tight bound.

Left Interval

In the background domination region y < -0.7 there is a systematic deficit in the data (or equivalently an excess in the simulation).  A conservative resolution of this deficit would be to increase the background distribution everywhere, including the signal region y > 0.  The increase here would then imply an excess of signal events in the extraction.  Numerically, this excess is bounded by the increase in background counts in the signal region after the deficit is resolved.

Middle Interval

Centered in the residuals in a clear linear trend.  Any effect from the trend in the background region y < 0 is dominated by the systematic in the Left Interval and will consequently be ignored.  The integrated trend in the signal region, however, implies an excess of signal events.

Right Interval

The deviation in the extreme signal region y > 0.8 implies a deficit of signal events in the extraction.  Because of the small width of this region, a fit doesn't help to remove the statistical fluctuations here and the error is bounded by the total summed residuals.

The effect of the two deficits are added in quadrature to give the low systematic, while the excess immediately gives the high systematic.