Proposal for Run IV

Under:

Procedure proposal for production and QA in Year4 run

Jérôme LAURET & Lanny RAY, 2004

Summary: The qualitative increase in data volume for run 4 together with finite cpu capacity at RCF precludes the possibility for multiple reconstruction passes through the full raw data volume next year. This new computing situation together with recent experiences involving production runs which were not pre-certified prior to full scale production motivates a significant change in the data quality assurance (QA) effort in STAR. This note describes the motivation and proposed implementation plan.

Introduction

The projection for the next RHIC run (also called, Year4 run which will start by the end of 2003), indicates a factor of five increase in the number of collected events comparing to preceding runs. This will increase the required data production turn-around time by an order of magnitude, from months to one year per full-scale production run. The qualitative increase in the reconstruction demands combined with an increasingly aggressive physics analysis program will strain the available data processing resources and poses a severe challenge to STAR and the RHIC computing community for delivering STAR’s scientific results in a reasonable time scale. This situation will become more and more problematic as our Physics program evolves to include rare probes. This situation is not unexpected and was anticipated since before the inception of RCF. The STAR decadal plan (10 year projection of STAR activities and development) clearly describes the need for several upgrade phases, including a factor of 10 increase in data acquisition rate and analysis throughput by 2007.

Typically, 1.2 represents an ideal, minimal number of passes through the raw data in order to produce calibrated data summary tapes for physics analysis. However, it is noteworthy that in STAR we have typically processed the raw data an average of 3.5 times where, at each step, major improvements in the calibrations were made which enabled more accurate reconstruction, resulting in greater precision in the physics measurements. The Year 4 data sample in STAR will include the new ¾ barrel EMC data which makes it unlikely that sufficiently accurate calibrations and reconstruction can be achieved with only the ideal 1.2 number of passes as we foresee the need for additional calibration passes through the entire data in order to accumulate enough statistics to push the energy calibration to the high Pt limit.

While drastically diverging from the initial computing requirement plans ( 1), this mode of operation, in conjunction with the expanded production time table, calls for a strengthening of procedures for calibration, production and quality assurance.

The following table summarizes the expectations for ~ 70 Million events with a mix of central and minbias triggers. Numbers of files and data storage requirements are also included for guidance


Au+Au 200 (minbias)

35 M central

35 M minbias

Total

No DAQ100 (1 pass)

329 days

152 days

481 days

No DAQ100 (2 passes)

658 days

304 days

962 days

Assuming DAQ100 (1 pass)

246 days

115 days

361 days

Assuming DAQ100 (2 passes)

493 days

230 days

723 days

Total storage estimated (raw)

x

x

203 TB

Total storage estimated
(1 pass)

x

x

203 TB


Quality Assurance: Goals and proposed procedure for QA and productions

What is QA in STAR?

The goal of the QA activities in STAR is the validation of data and software, up to DST production. While QA testing can never be exhaustive, the intention is that data that pass the QA testing stage should be considered highly reliable for downstream physics analysis. In addition, QA testing should be performed soon after production of the data, so that errors and problems can be caught and fixed in a timely manner.

QA processes are run independently of the data taking and DST production. These processes contain the accumulated knowledge of the collaboration with respect to potential modes of failure of data taking and DST production, along with those physics distributions that are most sensitive to the health of the data and DST production software. The results probe the data in various ways:

  • At the most basic level, the questions asked are whether the data can be read and whether all the components expected in a given dataset are present. Failures at this level are often related to problems with computing hardware and software infrastructure.

  • At a more sophisticated level, distributions of physics-related quantities are examined, both as histograms and as scalar quantities extracted from the histograms and other distributions. These distributions are compared to those of previous runs that are known to be valid, and the stability of the results is monitored. If changes are observed, these must be understood in terms of changing running conditions or controlled changes in the software, otherwise an error flag should be raised (deviations are not always bad, of course, and can signal new physics: QA must be used with care in areas where there is a danger of biasing the physics results of STAR).

Varieties of QA in STAR

The focus of the QA activities until summer 2000 has been on Offline DST production for the DEV branch of the library. With the inception of data taking, the scope of QA has broadened considerably. There are in fact two different servers running autoQA processes:

  • Offline QA. This autoQA-generated web page accesses QA results for all the varieties of Offline DST production:

    • Real data production produced by the Fast Offline framework. This is used to catch gross errors in data taking, online trigger and calibration, allowing for correcting the situation before too much data is accumulated (this framework also provides on the fly calibration as the data is produced).

    • Nightly tests of real and Monte Carlo data (almost always using the DEV and NEW branches of the library). This is used principally for the validation of migration of library versions

    • Large scale production of real and Monte Carlo data (almost always using the PRO branch of the library). This is used to monitor the stability of DSTs for physics.

  • Online QA. This autoQA-generated web page accesses QA results for data in the Online event pool, both raw data and DST production that is run on the Online processors.

The QA dilemma

While a QA shift is usually organized during data taking, the later, official production runs were encouraged (but not mandated) to be regularly QA-ed. Typically, there has not been an organized QA effort for post-experiment DST production runs. The absence of organized quality assurance efforts following the experiment permitted several post-production problems to arise. These were eventually discovered at the (later) physics analysis stage, but the entire production run was wasted. Examples include the following:

  1. missing physics quantities in the DSTs (e.g. V0, Kinks, etc ...)

  2. missing detector information or collections of information due to pilot errors or code support

  3. improperly calibrated and unusable data

  4. ...

The net effect of such late discoveries is a drastic increase in the production cycle time, where entire production passes have to be repeated, which could have been prevented by a careful QA procedure.

Production cycles and QA procedure

To address this problem we propose the following production and QA procedure for each major production cycle.

  1. A data sample (e.g. from a selected trigger setup or detector configuration) of not more than 100k events (Au+Au) or 500k events (p+p) will be produced prior to the start of the production of the entire data sample.

  2. This data sample will remain available on disk for a period of two weeks or until all members of “a” QA team (as defined here) have approved the sample (whichever comes first).

  3. After the two week review period, the remainder of the sample is produced with no further delays, with or without the explicit approval of everyone in the QA team.

  4. Production schedules will be vigorously maintained. Missing quantities which are detected after the start of the production run do not necessarily warrant a repetition of the entire run.

  5. The above policy does not apply to special or unique data samples involving calibration or reconstruction studies nor would it apply to samples having no overlaps with other selections. Such unique data samples include, for example, those containing a special trigger, magnetic field setting, beam-line constraint (fill transition), etc., which no other samples have and which, by their nature, require multiple reconstruction passes and/or special attention.

In order to carry out timely and accurate Quality Assurance evaluations during the proposed two week period, we propose the formation of a permanent and QA team consisting of:

  1. One or two members per Physics Working group. This manpower will be under the responsibility of the PWG conveners. The aim of these individuals will be to rigorously check, via the autoQA system or analysis codes specific to the PWG, for the presence of the required physics quantities of interest to that PWG which are understood to be vital for the PWG’s Physics program and studies.

  2. One or more detector sub-system experts from each of the major detector sub-systems in STAR. The goal of these individuals will be to ensure the presence and sanity of the data specific to that detector sub-system.

  3. Within the understanding that the outcome of such procedure and QA team is a direct positive impact on the Physics capabilities of a PWG, we recommend that this QA service work be done without shift signups or shift credit as is presently being done for DAQ100 and ITTF testing.

Summary

Facing important challenges driven by the data amount and Physics needs, we proposed an organized procedure for QA and production relying on a cohesive feedback from the PWG and detector sub-system’s experts within time constraints guidelines. It is understood that the intent is clearly to bring the data readiness to the shortest possible turn around time while avoiding the need for later re-production causing waste of CPU cycles and human hours.