Offline QA Shifts (Y2000 run)
Peter Jacobs, July 11, 2000
This document is a first try at describing procedures for the Offline
QA shift crew. As you will see, there are a number of open questions
concerning what should be done during this shift and how to do it,
whose answers we will have only after we gain experience with real
data. Please give feedback to the
STAR QA links and contacts
on what you find confusing, what could be done better, and what
doesn't make any sense to you.
- Scope of the Offline QA shift activities
The proposed scope of the Offline QA shift is to assess the quality of
the DSTs being produced by the Offline Production team. There are
several classes of data to be examined:
- Large scale production of real data: data that will be used for physics analysis
- Large scale production of MC data: MC data that will be used for
detailed physics studies and corrections for data analysis.
- Nightly tests of real and MC data: limited number of events run in the DEV or NEW
branches
of the library. These are used to test the libraries and validate them prior to a new r
elease and
migration DEV->NEW->PRO.
- Express queue of real data: a small fraction (~5%?) of real data
will be channeled to an express production queue during the running of
the experiment, to serve as feedback to the crews running the
experiment. The results of this production should be reported as soon
as possible, typically at the 5 p.m. meeting in the counting house.
The autoQA system can apply arbitrary sets of "tests" to the scalars
extracted from the data by the QA macros, raising errors or warnings
when these tests are failed. Which tests and what cuts to apply to
real data are complex issues that can only be addressed after we gain
some experience. Consequently, the automated testing aspects of the
autoQA framework will be applied only at a very low level for real
data for this summer's run. The decision about the quality of the data
will have to be made by the shift crew, i.e. you, by looking at the
data in detail.
- Use of autoQA
The principal tool for the Offline QA shift crew is the autoQA web
page. Discussion of QA in general and detailed usage of that page
can be found STAR QA for Offline Software, which you should be
familar with before you read the rest of this document. Usage of
autoQA version 2 is very similar to the old autoQA (version 1), so if
you used that you should be able to understand the following.
There have however been many changes behind the scenes. The major changes are
- autoQA now interfaces to the MySQL databases. It queries the
Production File Catalog for completed jobs, and writes QA information
back to a QA database. The latter can be used in future in the tag DB
or some other mechanism, once a reliable QA cycle is established.
- autoQA can now handle the range of data classes specified in the
introduction.
- All QA ROOT jobs are now run on rcas under LSF. This change was
necessary in anticipation of a large volume of QA processes once large
scale data taking starts. This of course also introduces another layer
of complexity into the QA framework, and monitoring of autoQA jobs on
rcas will be part of the QA shift work.
Some of the more complex displays of tables and documentation now
start an auxilliary browser window. If you happen to have started this
window once during an autoQA session and minimized it to get it out of
the way as you go on to do other things, you may be confused why the
browser is not responding when you make certain requests. It is in
fact sending the data to the minimized window, which you should
re-expand.
- Offline QA Shift Tasks
- Which runs to examine?
Discuss the recent production with the Production Crew and establish a
prioritized list of runs to QA. The express queue mechanism is still
under discussion and is not set up yet, but once it is established it
should recieve highest priority for timely feedback to the counting
house. The other criteria for setting priorities is whether urgent
feedback is needed for a library release, or other runs require
special attention. Otherwise, the shift crew should look at the most
recent production that has been QA-ed under the various classes of
data.
Since the autoQA mechanism queries the File Catalog once an hour (for
real data, less frequently for other data classes) and submits QA
batch jobs on rcas, there may be a significant delay between when
production is run and when the QA results become available. We will
have to monitor this process and adjust the procedures as
necessary. Feedback on this point from the shift crew is essential.
- How to look at a run
I will specify how to look at a run in the data class "Real Data
Production". Other data classes will have different selection
procedures, reflecting the differences in the File Catalog structure
for these different classes, but the changes should be obvious.
- Select "Real Data Production" from the pulldown menu in the banner.
- Use the pulldown menus to compose a DB query that includes the
run you are interested in. The simplest procedure at the moment is to
specify the runID and leave all other fields at "any". In the near
future these selections will include trigger, calibration and geometry
information. Note that the default for "QA status" is "done".
- Press "Display Datasets". A listing of all catalogued runs
corresponding to you query will appear in the upper frame.
- To examine the QA histograms, press the "QA details" button. In
the lower panel, a set of links to the histogram files will
appear. The format is gzipped postscript. If your browser is set up to
launch ghostview for files of type "ps", these files will be
automatically unzipped and displayed. Otherwise, you will have to do
something more complicated, such as save the file and view it another
way. Note that if the macro "bfcread_hist_to_ps" is reported to have
crashed, some or all histograms may be missing.
- To examine the QA scalars and tests, scroll past the histogram
links in the lower panel and push the button. Tables of scalars for
all the data branches will appear in the auxilliary window.
- To commpare the QA scalars to similar runs, press the "Compare
reports" button. Details on how to procede are found in the autoQA
documentation. Note that until more refined selections are available
for real data (e.g. comparing runs with idenitical trigger conditions
and processing chains), this facility will be of limited utility. Note
also that the planned functionality of automatically comparing to a
standard reference run has not yet been implemented, for similar
reasons.
- What QA data to examine
This area needs significant discussion. What we are generally looking
for is that all data are present and can be read (scalar values should
appear in all branches) and that the results look physically
meaningful (e.g. vertex distribution histograms). Comparison to
previous, similar runs to check for stability is highly desirable but
it is not clear how to carry this out at present, for reasons
described above. We should revisit this question as we gain more
experience.
The principal QA tool is the histograms, generated by
bfcread_hist_to_ps. The number of QA histograms has grown enormously
over the past six months and needs to be pruned back to be useful to
the non-expert. This work is going on now (week of July 10) and more
information will be forthcoming.
Description of all the macros run by autoQA is found here. This
documentation is important for understanding the meaning of the
QA scalars.
Here are some general guidelines on what to report:
- Status of run - completed, if not give error status (segmentation violation etc)
- Macros that crashed
- Macros whose QA status is not "O.K." (At present, this means
simply that there is no data in the branch that macro is trying to
read. No additional tests are applied to the data.)
- Anomalous histograms and scalars - this is necessarily vague at this point.
More specific rules for what should be in the report will be
forthcoming. Input on this question is welcome.
- How to report results
Once per shift you should send a status report to the QA
hypernews forum:
starqa-hn@www.star.bnl.gov
If you are doing Offline QA shifts, you should subscribe to this forum.
The autoQA framework has a "comment" facility that allows the user to
annotate particular runs or to enter a "global comment" that will
appear chronologically in the listing of all runs. These are displayed
together with the datasets, and while not appropriate for lengthy
reports, can serve as flags for specific problems and supply
hyperlinks to longer reports. Note that this is not a high security
system (anyone can alter or delete you messages).
You do not need the QA Expert's password to use this facility. Press
the button "Add or edit comments" in the upper right part of the upper
panel. You will be asked for some identifying string that will be
attached to your comments. Enter you name and press return. You will
have to press "Display Datasets" again, at which point a button "Add
global comment" will appear below the pulldown menus, and each run
listing will have an "Add comment" button. Follow the
instructions. Messages are interpreted as html, so links to other
pages can be introduced. One possibility is to enter the hyperlink to
the QA report you have sent to starqa-hn. This can obviously be
automated, but it isn't yet and doing it by hand should be
straightforward.
- Checking QA jobs on rcas
Every two hours you should check the status of autoQA jobs running on
rcas, by clicking on "RCAS/LSF monitor" (upper right, under the "Add
or Edit Comments" button). You cannot alter jobs using this browser
unless you have the Expert's password, so there is no possibility of
doing damage. Select jobs called QA_TEST. Each of these is a set of QA
macros for a single run, that should require up to 10 minutes CPU
time. The throughput of this system for QA is as yet unknown, but you
should check that jobs are not sitting in the PENDING queue for more
than an hour or two, and are not stalling while running (should not
take more than 15 minutes CPU). In case of problems, contact an expert.
Peter Jacobs
Last modified: Tue Jul 11 02:35:05 EDT 2000