sgliske's blog

We shall consider 1D unfolding. Let the generated and reconstructed bins be enumerated 1,2,...n, and let there exist a reconstructed bin n+1, which corresponds to not reconstructed. Let the unfolding matrix be defined from bins n1 to n2. Let i always index generated bins and j always index reconstructed bins. We then consider several cases: whether pi0 falls within the generated set of bins [n_1,n_2] or not, and whether the same pi0 falls withint the reconstructed set of bins [n_1,n_2] or not. Note: not being in the reconstructed set of bins [n_1,n_2] could mean reconstructed in other bins or could mean not reconstructed at all (bin n+1).

The following table details how each case is handled.

Generated	Reconsructed	How it is handled
IN	IN	Usual domain of the smearing matrix
IN	OUT	Accounted for by the reconstruction efficiency
OUT	IN	Is a background that neads removed
OUT	OUT	Not relevant

Cases 1 & 2: Generated IN

There is an interplay in the definitions of the smearing matrix and the reconstruction efficiency. One can choose to define them relative to "reconstructed in" or just relative to "reconstructed." As long as both are defined consistently, mathmatically the normalization factor cancels and causes no effect. The product of the reconstruction efficiency and smearing matrix is related to the conditional probability $p(p_T^R |p_T^G)$ , the probability of reconstructed p_T being $p_T^R$ given the generated p_T is $p_T^G$ , and the two options represent two ways of splitting this conditional probability into to conditional probabilities.

Option 1: Relative to "reconstructed in"

The smearing matrix and reconstruction efficiency are numeric, binned estimates of the underlying probabilty distributions

Smearing matrix: $p(p_T^R|\mbox{reco in}, p_T^G)$
Reconstruction efficiency: $p(\mbox{reco in}|p_T^G)$

In terms of the migragion matrix $M_{ij}$ , they are defined as

$S_{j,i} = \frac{ M_{j,i} }{ \sum_{j=n_1}^{n_2} M_{j,i}}$

$\epsilon_i = \frac{ \sum_{j=n_1}^{n_2} M_{j,i}}{ \sum_{j=1}^{n+1} M_{j,i}}$

Option 2: Relative to "reconstructed"

The smearing matrix and reconstruction efficiency are numeric, binned estimates of the underlying probabilty distributions

Smearing matrix: $p(p_T^R|\mbox{reco}, p_T^G)$
Reconstruction efficiency: $p(\mbox{reco}|p_T^G)$

In terms of the migragion matrix $M_{ij}$ , they are defined as

$S_{j,i} = \frac{ M_{j,i} }{ \sum_{j=1}^{n} M_{j,i}}$

$\epsilon_i = \frac{ \sum_{j=1}^{n} M_{j,i}}{ \sum_{j=1}^{n+1} M_{j,i}}$

Note

In both option 1 and 2, the numerator of the reconstruction efficiency is the denominator of the smearing matrix, and so the product is the same. Thus, while the numeric values of the smearing matrix and efficiency do change with changing n1 and n2, the cross section (which includes the product of the two) is uneffected by the changes to the smearing matrix and efficiency from this cause.

Case 3: Generated Out, Reconstructed In

This is the case where the choice of n1, n2 has an impact on the cross section. One can write the unfolding equation as either
$y_j - y^{OUT,IN}_j = \sum_{i=n_1}^{n_2} S_{j,i} x_i[\tex] or [tex]y_j/f_j = \sum_{i=n_1}^{n_2} S_{j,i} x_i[\tex], either subtracting the background linearly or by dividing by the signal fraction. One can estimate [tex]y^{OUT,IN}$ and $f_j$ from the migration matrix via
$y^{OUT,IN} = L \sum_{i=1}^{n_1-1} M_{j,i} + L \sum_{i=n_2+1}^{n+1} M_{j,i}$
where L is the luminsocity of the data in mbarns (since the code normalizes the MC data to counts per mbarn). The Monte Carlo value for $y_j$ is
$y_j = \sum_{i=1}^{n} M_{j,i}$
Setting $y_j - y^{OUT,IN}_j = y_j/f_j$ , using the Monte Carlo to estimate all values, yields
$f_j = \frac{\sum_{i=1}^{n} M_{j,i}}{\sum_{i=n_1}^{n_2} M_{j,i}}$ .

Thus far, the code uses the method with the multiplicative factor. Note, however, that a priori neither method is necessarily preferable. However, it should be noted that either choose accounts for events smearing from outside the range [n1 ,n2] into the range. Thus as long as this background can be accurately determined, there is no need to have "buffer" bins, i.e. having the results only shown for a smaller subset of [n1 ,n2].

Best Option Given Our Circumstance

The following is Figure 2.3 from Analysis Note II.

The MC is normalized to the data in the region of 8-12 GeV. While the data is slightly higher than the MC in the 5-6 GeV bin, there is signficiant difference in the 4-5 bin. So far, all indications suggest that this is due to the trigger and being so far below threshold. The simulated trigger thresholds (which are at least 10% higher than any hardware or MC filter thresholds) for data are 4.2 GeV (HT) and 6.0 GeV (TP). Note: the discrepancy begins almost exactly at 6.0 GeV.

Effect on the smearing matrix and reconstruction efficiency

Note, the pT resolution is smaller than the bin width. For any bin, the events smearing in come from the adjacent bin--the one with higher number of events near the bin edge. For most bins, the pT spectrum is higher at the lower pT edge, and so events smear in from below. For the 4-5 bin, the upper edge is higher, and thus events smear in from above. The migration matrix (Figures 3.20 and 3.21 from the analysis note) show that for all events reconstructed with pT in 5-6 GeV, 75% are in the right bin, 12% came from 4-5 GeV and 10% came from the bin above. However, since the data has more events in the 4-5 bin, I would anticipate the migration matrix uncerestimates the amount of smearing from 4-5 into 5-6. Thus, I have higher trust in the smearing matrix when it excludes the 4-5 GeV bin. It seems unfolding with the 4-5 GeV bin in the smearing matrix over estimates the systematic uncertainty as it introduces more systematics than are actually present in the central values obtained by unfolding without it. In other words, the discrepancy between data and Monte Carlo does not effect (for a given generated p_T bin) the sum of data reconstructed in 4-5 and the number not reconstructed. It only affects the relative amount between these two columns of Figure 3.20. Thus, as long as the choice of n1 and n2 are made such that only the sum of bin j=2 (the 4-5 GeV bin) and j=10 (the not reconstructed or "no match" column) are used, then the effect of the discrepancy should be minimized. Thus we trust unfolding starting at bin 5-6 and are a bit cautions about unfolding using 4-5.

Effect on the background subtraction

Unlike for the smearing matrix and reconstruction efficiency, the background involves summing over the generated bins, not the reconstructed. It is a little less clear to determine the effect of the MC having less data in the 4-5 bin. If we assume the MC generated the correct spectrum, and that the troubles involve interplay between being below trigger threshold and nuances about the momentum distribution among constituents of the jet, then the discrepancy has minor effect. In any case, it is not obvious that the "subtract" or "divide" method of removing the background has any advantage over the other.

Still to come...

Try unfolding using inset of to see what the effect is and either choose one as preferable or assign a systematic.
- The result is here.

Conclusions

Since the background factor f accounts for events smeared in from outside the bins included in the smearing matrix, there is no reason to have extra "buffer bins". Since the factor f is computed from the same basic migration matrix as the smearing matrix is, there are links between in accuracies in f and inaccuracies in the larger smearing matrix that includes buffer bins. The only reason that one is more trustworthy that the other has to do with the sums. Since we believe the data/MC discrepancy does not effect the sum of column 2 and column 10, but the relative amount between the two, we then can state that using a "buffer bin", i.e. includeing the 4-5 GeV bin in the smearing matrix is effect to a much higher degree by the discrepancy than excluding this bin and letting the smearing between bin 4-5 and 5-6 be accounded for by the background correction factor f. Note: the statistical uncertainy on f is already propagated through to a systematic uncertainty on the cross section. One can thus argue that there is no problem starting unfolding at 5-6 and showing 5-6 as the first point of the cross section, and that the uncertainty assigned as the difference between this and unfolding starting with 4-5 is overkill. To be conservative, I think it leaving this extra uncertainty is fine. But I do not see sufficient scientific justification to not show the result for 5-6 GeV.

Groups:

STAR Protected

sgliske's blog
Login or register to post comments

The STAR experiment

sgliske's blog

STAR Protected

User login

Navigation

Group notifications

Unfolding details

Formalism