Unfolding details

Details and formalism regarding unfolding, specifically addressing the effect of the range used.

Formalism


We shall consider 1D unfolding.   Let the generated and reconstructed bins be enumerated 1,2,...n, and let there exist a reconstructed bin n+1, which corresponds to not reconstructed.  Let the unfolding matrix be defined from bins n1 to n2.  Let i always index generated bins and j always index reconstructed bins.  We then consider several cases: whether pi0 falls within the generated set of bins [n_1,n_2] or not, and whether the same pi0 falls withint the reconstructed set of bins [n_1,n_2] or not.  Note: not being in the reconstructed set of bins [n_1,n_2] could mean reconstructed in other bins or could mean not reconstructed at all (bin n+1).

The following table details how each case is handled.
Generated Reconsructed How it is handled
IN IN Usual domain of the smearing matrix
IN OUT Accounted for by the reconstruction efficiency
OUT IN Is a background that neads removed
OUT OUT Not relevant

 


Cases 1 & 2: Generated IN

There is an interplay in the definitions of the smearing matrix and the reconstruction efficiency.  One can choose to define them relative to "reconstructed in" or just relative to "reconstructed."  As long as both are defined consistently, mathmatically the normalization factor cancels and causes no effect.  The product of the reconstruction efficiency and smearing matrix is related to the conditional probability formula, the probability of reconstructed p_T being formula given the generated p_T is formula, and the two options represent two ways of splitting this conditional probability into to conditional probabilities.

Option 1: Relative to "reconstructed in"

The smearing matrix and reconstruction efficiency are numeric, binned estimates of the underlying probabilty distributions
  • Smearing matrix: formula
  • Reconstruction efficiency: formula
In terms of the migragion matrix formula, they are defined as

formula

formula

Option 2: Relative to "reconstructed"

The smearing matrix and reconstruction efficiency are numeric, binned estimates of the underlying probabilty distributions
  • Smearing matrix: formula
  • Reconstruction efficiency: formula
In terms of the migragion matrix formula, they are defined as

formula

formula

Note

In both option 1 and 2, the numerator of the reconstruction efficiency is the denominator of the smearing matrix, and so the product is the same.  Thus, while the numeric values of the smearing matrix and efficiency do change with changing n1 and n2, the cross section (which includes the product of the two) is uneffected by the changes to the smearing matrix and efficiency from this cause.

Case 3: Generated Out, Reconstructed In

This is the case where the choice of n1, n2 has an impact on the cross section.  One can write the unfolding equation as either
formula and formula from the migration matrix via
formula
where L is the luminsocity of the data in mbarns (since the code normalizes the MC data to counts per mbarn).  The Monte Carlo value for formula is
formula
Setting formula, using the Monte Carlo to estimate all values, yields
formula.

Thus far, the code uses the method with the multiplicative factor.  Note, however, that a priori neither method is necessarily preferable.  However, it should be noted that either choose accounts for events smearing from outside the range [n1 ,n2] into the range.  Thus as long as this background can be accurately determined, there is no need to have "buffer" bins, i.e. having the results only shown for a smaller subset of [n1 ,n2].

Best Option Given Our Circumstance

The following is Figure 2.3 from Analysis Note II.

The MC is normalized to the data in the region of 8-12 GeV.  While the data is slightly higher than the MC in the 5-6 GeV bin, there is signficiant difference in the 4-5 bin.  So far, all indications suggest that this is due to the trigger and being so far below threshold.  The simulated trigger thresholds (which are at least 10% higher than any hardware or MC filter thresholds) for data are 4.2 GeV (HT) and 6.0 GeV (TP).  Note: the discrepancy begins almost exactly at 6.0 GeV.
 

Effect on the smearing matrix and reconstruction efficiency

Note, the pT resolution is smaller than the bin width.  For any bin, the events smearing in come from the adjacent bin--the one with higher number of events near the bin edge.  For most bins, the pT spectrum is higher at the lower pT edge, and so events smear in from below.  For the 4-5 bin, the upper edge is higher, and thus events smear in from above.  The migration matrix (Figures 3.20 and 3.21 from the analysis note) show that for all events reconstructed with pT in 5-6 GeV, 75% are in the right bin, 12% came from 4-5 GeV and 10% came from the bin above.  However, since the data has more events in the 4-5 bin, I would anticipate the migration matrix uncerestimates the amount of smearing from 4-5 into 5-6.  Thus, I have higher trust in the smearing matrix when it excludes the 4-5 GeV bin.  It seems unfolding with the 4-5 GeV bin in the smearing matrix over estimates the systematic uncertainty as it introduces more systematics than are actually present in the central values obtained by unfolding without it.  In other words, the discrepancy between data and Monte Carlo does not effect (for a given generated p_T bin) the sum of data reconstructed in 4-5 and the number not reconstructed.  It only affects the relative amount between these two columns of Figure 3.20.  Thus, as long as the choice of n1 and n2 are made such that only the sum of bin j=2 (the 4-5 GeV bin) and j=10 (the not reconstructed or "no match" column) are used, then the effect of the discrepancy should be minimized.  Thus we trust unfolding starting at bin 5-6 and are a bit cautions about unfolding using 4-5.

Effect on the background subtraction

Unlike for the smearing matrix and reconstruction efficiency, the background involves summing over the generated bins, not the reconstructed.  It is a little less clear to determine the effect of the MC having less data in the 4-5 bin.  If we assume the MC generated the correct spectrum, and that the troubles involve interplay between being below trigger threshold and nuances about the momentum distribution among constituents of the jet, then the discrepancy has minor effect.  In any case, it is not obvious that the "subtract" or "divide" method of removing the background has any advantage over the other.

Still to come...

  • Try unfolding using formula inset of formula to see what the effect is and either choose one as preferable or assign a systematic.
    • The result is here.

Conclusions

Since the background factor f accounts for events smeared in from outside the bins included in the smearing matrix, there is no reason to have extra "buffer bins".  Since the factor f is computed from the same basic migration matrix as the smearing matrix is, there are links between in accuracies in f and inaccuracies in the larger smearing matrix that includes buffer bins.  The only reason that one is more trustworthy that the other has to do with the sums.  Since we believe the data/MC discrepancy does not effect the sum of column 2 and column 10, but the relative amount between the two, we then can state that using a "buffer bin", i.e. includeing the 4-5 GeV bin in the smearing matrix is effect to a much higher degree by the discrepancy than excluding this bin and letting the smearing between bin 4-5 and 5-6 be accounded for by the background correction factor f.  Note: the statistical uncertainy on f is already propagated through to a systematic uncertainty on the cross section. One can thus argue that there is no problem starting unfolding at 5-6 and showing 5-6 as the first point of the cross section, and that the uncertainty assigned as the difference between this and unfolding starting with 4-5 is overkill.  To be conservative, I think it leaving this extra uncertainty is fine.  But I do not see sufficient scientific justification to not show the result for 5-6 GeV.