Software & Computing

CVS Tools All, Offline, Online, StRoot, Recent, User
Code X-Ref Offline doc, Search code, Software guide
Other Tools CAS monitor, Autobuild & Code Sanity , Bug Tracking System , DB Browsing Interface , Hypernews
Quick Links Online


Software and Computing (private)

Startup Links  Getting a computer account in STAR
Drupal access, Account re-activation,
SSH Keys and login to the SDCC, SSH Key Management,
You do not have access to view this node, You do not have access to view this node
Get Help
You do not have access to view this node, STAR coding and naming standards, tools, Report problems, ...
Production Production Location, Options in production, Geometry options, Chain Options, Offline QA, Fast Offline
Simulation
Simu Status, GSTAR, Running Pythia
Grid and Cloud Scheduler, Monitoring
Offline Software Library release history, Simu, Reco, QA/QC,
StEvent/Special documentation, Library release structure and policy, Logger documentation
Infrastructure
Software Infrastructure, STAR Software installation,
You do not have access to view this node, Online Computing,
visualization, generic EVD
Databases Browser, design+implementation
Computing Environment Printers, Startup guide for visitors with laptops ...
Calendars & Phone
Software and Computing event calendar , BNL Voice Bridge Calendar , STAR Phones, BNL Phones, ...
Other links   Related and interesting sites, RCF (old site), ITD, BNL

 

General information

Meetings, meeting sessions and Reviews


Starting May 2020 after the S&C re-organization under the new STAR management, the S&C management team has weekly meeting on Wednesday between 12:00 to 13:00. The meeting is currently happening on Bluejeans. Link: https://bluejeans.com/727703210.

============Before May 2020 ====================
The S&C group has weekly meetings on Wednesday, between 12:00 to 13:00 (noon to 1 PM) in building 510a, room 1-189 at BNL.
Additional regular meetings include
  • A Grid operation and activity meeting on Thursday, 13:00 to 14:00, building 510a, room 1-189 at BNL.
  • Before and during the run, a Friday "run preparation meeting" or a "run status meeting" targeted toward the core team, DAQ, Slow Control and Trigger groups as well as the software sub-system's coordinators.

Phone bridge are provided for those and announced in mailing lists.

Reviews

2011 - Sti, CA and Stv tracking component review

This page will keep information related to the 2011 tracking component review. The review will cover the state of the Cellular Automaton (CA) seed finding component as well as the Virtual Monte-Carlo based tracker (Stv) and their relevance to STAR's future need in terms of tracking capabilities.

Project goals

After a successful review of the ITTF/Sti tracker in 2004, the STAR collaboration have approved the move to the new framework bringing at the time unprecedented new capabilities to the experiment and physics analysis.Sti allowed the STAR reconstruction approach to integrate to its tracking other detector sub-systems by providing method to integrate simple geometry models and allow to extrapolate track to the non-TPC detector planes therefore, correlating information across detector sub-systems. In 2005, the STAR production switched to a Sti based production and we have run in this mode ever since.

However, careful architecture considerations revealed a few areas where improvements seemed needed. Those are:

  • The need to maintain two different geometry models (one for reconstruction, one for simulation) increasing workforce load at a time when STAR is both active and ambitious in its future program as well as running thin on detector sub-system code developer. Beyond workforce considerations
    • The two separate geometries have consequences on embedding and simulation hence, our ability to bring efficiency corrections to the next level of accuracy.
    • Material budgets were found to be ill-accounted in reconstruction (dead-material were not properly modeled in the Sti framework). The use of a common geometry model would have removed this issue
  • Sti has some tracking restrictions - geometries made of planes and volumes perpendicular to the beam cannot be treated due to a technical choice (detector elements are organized in planes // to the beam, sub-systems assumed to be composed of elements replicated in Phi). This would preclude tracking in detectors such as the FGT.
    • Our goal was to create an extended set of functionalities providing a truly complete integrated tracking approach, allowing the inclusion of hit information from other detectors (a key goal the inclusion of detector hits placed in the forward direction)
  • The use Monte-Carlo based propagators would allow better access to Eloss, better predictors and track swimming allowing for tracking in non constant B field (this is also not possible in Sti)

Additional considerations for the future of STAR were

  • A single yet flexible geometry model would allow STAR to be ready for GeantX (5 and TGeo based)
  • A flexible geometry model would allow STAR to better cope with STAR to eSTAR migration (geometry morphing)
  • A revitalize framework would allow addressing long standing issues of event mode in simulation
    • While STAR has a FORtran based approach allowing integration of some event generators, many have appeared pure C++ based, making their integration to the STAR simulation framework difficult. A new generic model would allow a "plug-and-play" approach.
    • The use of non-perfect geometries (miss-aligned) have been lacking in the simulation framework and would be advisable
  • Novel algorithm have appeared in the community, leveraging and harvesting the power of multi-core and many-core architectures. Investigating speed and efficient gains and evaluate the best use of resources is necessary for STAR demanding physics program. Equally important, those new algorithm (Cellular Automaton being one) are opening to online tracking algorithm (GPU based).

 

Based on those considerations, several projects were launch and encouraged

  • CA based tracking - the sudy of the CBM/Alice Cellular Automaton algorithm for seed finding was launched in collaboration with our GSI colleagues.  Multi-core aware, the simple algorithm is thought to provide speed gains over the seed finding. Further work could spurse from this evaluation (online HLT) if successful. The algorithm was showed to be portable to STAR, thanks to Yuri Fisyak and Ivan Kisel team, and the product of this evaluation to be tested.
  • The VMC project - a three part project (VMC tracking, VMC geometry, VMC simulation framework), the VMC geometry (a.k.a. aGML) has rapidly matured under the care of Jason Webb. The VMC trakcing (a.k.a. Stv) has been developed by Victor Perevoztchikov and thought to provide equal efficiency than Sti (as well as implement all the features listed above).

We propose to review the aGML, CA and Stv components of our framework reshape.

 

NB: Beyond the scope of this review, a key goal for VMC is to allow the inclusion of newer Geant version and hence, getting ready to step away from Geant3 (barely maintainable), the FORtran baggage (Zebra and portability issues on 64 bits architectures) and remove the need for a special verison of root (root4star) hard-binding root and STAR specific needed runtime non-dynamic libraries.

 

Why a review?

  • All R&D projects are reviewed in STAR
    • Initial approach was to proceed with a "soft" PWG evaluation but (on second thoughts) not really an options …
    • An internal STAR review process should (and will) be established
  • Advantages
    • A review process provides strong and independent backing  of the projects
    • A review process provides  an independent set of guidance to management (S&C and PWG) on path forward
    • Collaboration wide scrutiny and endorsement across PWG lessen the risks of  finding problems later
  • Reminder: ITTF / Sti was not carried without problems
    • Sti review missed  the UPC PWG’s feedback –problems found a-posteriori diverted attention and workforce in solving it
    • Problem are seen in HBT and fluctuation analysis when Run 4 is compared to Run 10
      • HBT issues were not seen at Sti evaluation – Is it an analysis problem? Something else?
  • A review will also provide a good time to re-establish a solid baseline and get feedback from the PWG on opened issues if any
    • This is even more so important that STAR is moving forward to a new set of detectors and high-precision measurements

 

 

Review charges

See attachment at the bottom of this page.

 

Review committee

Status:

  • 2011/08/12 Intent of a review brought to management (charges to be written).
                        Action items was to suggest a set of names for the committee set.
  • 2011/08/18 Committee members suggestions provided at management meeting. Spokesperson decides he will contact chair.
  • 2011/09/02 Charges sent to management for comments along a note that the charges may be long (text is both for committee and reviewee). No feedback outside the provided self-provided note.
  • 2011/10/07 Chair contacted - process of selecting committee being worked out (Spokesperson or)
  • 2011/10/13 Spokesperson delegate committee forming to review Chair (Olga Evdokimov), S&C Leader (Jerome Lauret) and PAC (Xin Dong)
  • 2011/10/15 Committee assembled
  • 2011/10/31 Draft agenda made
  • 2011/11/01 Agenda presented and feedback requested
  • 2011/11/08 Final agenda crystalized

Members:

  • Olga Evdokimov (chair)
  • Claude Pruneau
  • Jim Thomas
  • Renee Fatemi                [EVO]
  • Aihong Tang
  • Thomas Ullrich              [EVO]
  • Jan Balewski                 [EVO]
  • Anselm Vossen

The agenda is ready and available at You do not have access to view this node.

Material

Below is a list of cross-references to other documents:

  • Meetings
    • You do not have access to view this node
    • You do not have access to view this node
    • You do not have access to view this node
    • You do not have access to view this node
    • You do not have access to view this node
    • You do not have access to view this node
    • You do not have access to view this node
    • You do not have access to view this node
       
  • Data and simulation samples, tools, ... (guidance given at You do not have access to view this node)
    • Tools
    • Data selections
      • Location summary
      • Real data   : You do not have access to view this node
      • Simulations: You do not have access to view this node
         
  • Nightly build (AutoBuild)

2021 TPC calibration review

2021 TPC calibration review

 

Data readiness

The pages here relates to the data readiness sub-group of the S&C team. This area is comprise of calibration, database and quality assurance.

Please consult the You do not have access to view this node for a responsibility chart.

Calibration

STAR Calibrations

In addition to the child pages listed below:

 

Calibration Run preparation

This page is meant to organize information about the calibration preparations and steps required for each subdetector for upcoming runs.

Run 18 Calibration Datasets

 Below are the calibration dataset considerations for isobar needs in Run 18:
  • EMC
    • emc-check runs once per fill (at the start)
      • For the purpose of EMC pedestals and status tables
      • 50k events
  • TOF
    • VPD
      • Resolutions needs to be calibrated (for trigger) and confirmed to be similar for both species
  • TPC
    • Laser runs
      • For the purpose of calibrating TPC drift velocities
      • Every few hours (as usual)
      • Either dedicated laser runs, or included in physics runs
      • ~5000 laser events
    • GMT inclusion in a fraction of events
      • For the purpose of understanding fluctuations in TPC SpaceCharge
      • Trigger choice is not very important, as long as it has TPX and during most physics running
      • Something around 100-200 Hz is sufficient (too much will cause dead time issues for the trigger)
    • Vernier scans
      • For the purpose of understanding backgrounds in the TPC that may be different between species
      • Once for each species under typical operating conditions
      • 4 incremental steps of collision luminosity, each step ~1 minute long and on the order of ~50k events (total = ~4 minutes)
      • TPX, BEMC, BTOF must be included
      • Minimum bias trigger with no (or wide) vertex z cut
      • You do not have access to view this node
    • Low luminosity fill IF a TPC shorted field cage ring occurs
      • Only one species needed (unimportant which)
      • Minimum bias trigger with no (or wide) vertex z cut
      • ~1M events
      • ZDC coincidence rates below 3 kHz
    • Old ALTRO thresholds run
      • For the purpose of TPC dE/dx understanding
      • Only one species needed (unimportant which)
      • ~2M events
      • Could be at the end of a fill, or any other convenient time during typical operations
    • Magnetic field flipped to A ("forward") polarity before AuAu27 data
      • For the purpose of acquiring sufficient cosmic ray data with both magnetic fields to understand alignment of new iTPC sector


-Gene

Run preparation Run VI (2006)

This page is meant to organize information about the calibration preparations and steps required for each subdetector for upcoming runs. Items with an asterisk (*) need to be completed in advance of data.

For the run in winter 2006, the plan is to take primarily pp data. This may lead to different requirements than in the past.

TPC
  • * Code development for 1 Hz scalers.
  • * Testing of T0/Twist code with pp data.
  • * Survey alignment information into DB.
  • Drift velocity from lasers into DB (automated).
  • T0 calibration as soon as data starts.
  • Twist calibration as soon as data starts (for each field).
  • SpaceCharge and GridLeak (SpaceCharge and GridLeak Calibration How-To Guide) as soon as T0/Twist calibrated (for each field).
  • Would like a field flip to identify origins of offsets in SpaceCharge vs. scalers.
  • dEdx: final calibration after run ends by sampling over the whole run period, but initial calibration can be done once TPC momentum is well-calibrated (after distortion corrections).
FTPC
  • HV calibration performed at start of data-taking.
  • Temperatures into DB (automated)
  • Rotation alignment of each FTPC done for each field (needs calibrated vertex from TPC[+SVT][+SSD]). There was concern about doing this calibration with the pp data - status???
SVT
  • Steps for pp should be the same as heavy ion runs, but more data necessary.
  • Self-alignment (to be completed January 2006)
  • Requires well-calibrated TPC.
  • Requires a few million events (a week into the run?)
  • Would like field flip to discriminate between SVT/SSD alignment vs. TPC distortions.
SSD
  • * Code development in progress (status???)
  • Requires well-calibrated TPC.
EEMC
  • EEMC initial setting & monitoring during data taking relies on prompt and fully automatic muDst EzTree production for all minias-only fast-only runs. Assume fast offline muDst exist on disk for 1 week.
  • Initial settings for HV: 500k minbias fast-triggered events will give slopes necessary to adjust relative gans. Same 60 GeV E_T maximum scale as in previous years.
  • Pedestals from 5k minbias events, once per fill.
  • Stability of towers from one 200k minbias run per fill.
  • Highly prescaled minbias and zerobias events in every run for "general monitoring" (e.g. correlated pedestal shift)
  • Offline calibrations unnecessary for production (can be done on the MuDst level).
  • "Basic" offline calibration from MIP response in 5M minbias fast events (taken a few weeks into the run)
  • "Final" offline calibration from pi0 (or other TBD) signal requires "huge" statistics of EHT and EJP triggers to do tower-by-tower (will need the full final dataset).
  • * Calibration codes exist, but are scattered. Need to be put in CVS with consistent paths and filenames.
BEMC
  • * LED runs to set HV for 60 GeV E_T maximum, changed from previous years (status???).
  • Online HV calibration from 300k minbias events (eta ring by eta ring) - requires "full TPC reconstruction".
  • MuDsts from calibration runs feedback to run settings (status???).
  • Pedestals from low multiplicity events from the event pool every 24 hours.
  • Offline calibrations unnecessary for production (can be done on the MuDst level).
  • "Final" offline tower-by-tower calibration from MIPs and electrons using several million events.
TOF
  • upVPD upgrade (coming in March 2006) for better efficiency and start resolution
  • Need TPC momentum calibrated first.
  • Requires several million events, less with upVPD (wait for upVPD???)

Run preparation Run VII

There is some question as to whether certain tasks need to be done this year because the detector was not moved during the shutdown period. Omitting such tasks should be justified before skipping!

TPC
  • Survey alignment information into DB (appears to be no survey data for TPC this this year)
  • High stats check of laser drift velocity calibration once there's gas in the TPC: 30k laser events with and without B field.
  • Check of reversed GG cable on sector 8 (lasers) once there's gas in the TPC: same laser run as above
  • Drift velocity from laser runs (laser runs taken once every 3-4 hours, ~2-3k events, entire run) into DB (automated); check that it's working
  • T0 calibration, new method from laser runs (same laser runs as above).
  • Twist calibration as soon as data starts: ~100k events, preferrably central/high multiplicity, near start of running for each field
  • SpaceCharge and GridLeak (SpaceCharge and GridLeak Calibration How-To Guide) as soon as T0/Twist calibrated: ~100k events from various luminosities, for each field.
  • BBC scaler study for correlation with azimuthally asymmetric backgrounds in TPC: needs several days of generic data.
  • Zerobias run with fast detectors to study short time scale fluctuations in luminosity (relevant for ionization distortions): needs a couple minutes of sequential, high rate events taken at any time.
  • Need a field flip to identify origins of offsets in SpaceCharge vs. scalers as well as disentangling TPC distortions from silicon alignment.
  • dEdx: final calibration after run ends by sampling over the whole run period, but initial calibration can be done once TPC momentum is well-calibrated (after distortion corrections).
FTPC
  • HV calibration performed at start of data-taking (special HV runs).
  • Temperatures into DB (automated)
  • Rotation alignment of each FTPC done for each field (needs calibrated vertex from TPC[+SVT][+SSD]): generic collision data
SSD
  • Pulser runs (for initial gains and alive status?)
  • Initial alignment can be done using roughly-calibrated TPC: ~100k minbias events.
  • P/N Gain-matching (any special run requirements?)
  • Alignment, needs fully-calibrated TPC: 250k minbias events from one low luminosity (low TPC distortions/background/occupancy/pile-up) fill, for each field, +/-30cm vertex constraint; collision rate preferrably below 1kHz.
SVT
  • Temp oscillation check with lasers: generic data once triggers are set.
  • Initial alignment can be done using roughly-calibrated TPC+SSD: ~100k minbias events.
  • Alignment, needs fully-calibrated TPC: 250k minbias events from one low luminosity fill (see SSD).
  • End-point T0 + drift velocity, needs fully-calibrated SSD+TPC: same low luminosity runs for initial values, watched during rest of run.
  • Gains: same low luminosity runs.
EEMC
  • Timing scan of all crates: few hours of beam time, ~6 minb-fast runs (5 minutes each) for TCD phase of all towers crates, another 6 minbias runs for the timing of the MAPMT crates, 2 days analysis
  • EEMC initial setting & monitoring during data taking: requests to process specific data will be made as needed during the run.
  • Initial settings for HV: 200k minbias fast-triggered events will give slopes necessary to adjust relative gans, 2 days analysis
  • Pedestals (for offline DB) from 5k minbias events, once per 5 hours
  • Stability of towers from one 200k minbias-fast run per fill
  • "General monitoring" (e.g. correlated pedestal shift) from highly prescaled minbias and zerobias events in every run.
  • Beam background monitoring from highly prescaled EEMC-triggered events with TPC for at the beginning of each fill.
  • Expect commissioning of triggers using EMC after one week of collisions
  • Offline calibrations unnecessary for production (can be done on the MuDst level).
  • "Basic" offline calibration from MIP response in 5M minbias fast events taken a few weeks into the run
  • "Final" offline calibration from pi0 (or other TBD) signal requires "huge" statistics of EHT and EJP triggers to do tower-by-tower (still undone for previous runs).
  • Calibration codes exist, but are scattered. Need to be put in CVS with consistent paths and filenames (status?)
BEMC
  • Timing scan of all crates
  • Online HV calibration of towers - do for outliers and new PMTs/crates; needed for EMC triggering. Needs~5 minutes of minbias fast-triggered events (eta ring by eta ring) at beginning of running (once a day for a few days) - same runs as for EEMC.
  • Online HV calibration of preshower - matching slopes. Not done before, will piggyback off other datasets.
  • Pedestals from low multiplicity events from the event pool every 24 hours.
  • Offline calibrations unnecessary for production (can be done on the MuDst level).
  • "Final" offline tower-by-tower calibration from MIPs and electrons using several million events
upVPD/TOF
  • upVPD (calibration?)
  • No TOF this year.

preparation Run VIII (2008)

This page is meant to organize information about the calibration preparations and steps required for each subdetector for upcoming runs.

Previous runs:

 


Red means that nothing has been done for this (yet), or that this needs to continue through the run.
Blue means that the data has been taken, but the calibration task is not completed yet.
Black indicates nothing (more) remaining to be done for this task.

 

TPC
  • Survey alignment information into DB (appears to be no survey data for TPC this this year)
  • Drift velocity from laser runs (laser runs taken once every 3-4 hours, ~2-3k events, entire run) into DB (automated); check that it's working
  • T0 calibration, using vertex-matching (~500k events, preferably high multiplicity, once triggers are in place).
  • Twist calibration as soon as data starts: same data
  • SpaceCharge and GridLeak (SpaceCharge and GridLeak Calibration How-To Guide) as soon as T0/Twist calibrated: ~500k events from various luminosities, for each field.
  • BBC scaler study for correlation with azimuthally asymmetric backgrounds in TPC: needs several days of generic data.
  • Zerobias run with fast detectors to study short time scale fluctuations in luminosity (relevant for ionization distortions): needs a couple minutes of sequential, high rate events taken at any time.
  • dEdx: final calibration after run ends by sampling over the whole run period, but initial calibration can be done once TPC momentum is well-calibrated (after distortion corrections).
TPX
???
FTPC
  • HV calibration performed at start of data-taking (special HV runs).
  • Temperatures into DB (automated)
  • Rotation alignment of each FTPC done for each field (needs calibrated vertex from TPC[+SVT][+SSD]): generic collision data
EEMC
  • Timing scan of all crates: few hours of beam time, ~6 fast runs (5 minutes each) for TCD phase of all towers crates
  • EEMC initial setting & monitoring during data taking: requests to process specific data will be made as needed during the run.
  • Initial settings for HV: 500k minbias fast-triggered events will give slopes necessary to adjust relative gans, 2 days analysis
  • Pedestals (for offline DB) from 5k minbias events, once per fill
  • "General monitoring" (e.g. correlated pedestal shift) from highly prescaled minbias and zerobias events in every run.
  • Beam background monitoring from highly prescaled EEMC-triggered events with TPC for at the beginning of each fill.
  • Expect commissioning of triggers using EMC after one week of collisions
  • Offline calibrations unnecessary for production (can be done on the MuDst level).
  • "Basic" offline calibration from MIP response in 5M minbias fast events taken a few weeks into the run
  • "Final" offline calibration from pi0 (or other TBD) signal requires "huge" statistics of EHT and EJP triggers to do tower-by-tower.
  • Calibration codes exist, but are scattered. Need to be put in CVS with consistent paths and filenames (status?)
BEMC
  • Timing scan of all crates
  • Online HV calibration of towers - do for outliers and new PMTs/crates; needed for EMC triggering. Needs~5 minutes of minbias fast-triggered events (eta ring by eta ring) at beginning of running (once a day for a few days) - same runs as for EEMC.
  • Online HV calibration of preshower - matching slopes. (same data).
  • Pedestals from low multiplicity events from the event pool every 24 hours.
  • Offline calibrations unnecessary for production (can be done on the MuDst level).
  • "Final" offline tower-by-tower calibration from MIPs and electrons using several million events
PMD
  • Hot Cells (~100k generic events from every few days?)
  • Cell-by-cell gains (same data)
  • SM-by-SM gains (same data)
VPD/TOF
  • T-TOT (~3M TOF-triggered events)
  • T-Z (same data)

Calibration Schedules

The listed dates should be considered deadlines for production readiness. Known issues with any of the calibrations by the deadline should be well-documented by the subsystems.
  •  Run 15: projected STAR physics operations end date of 2015-06-19 (CAD) 2015-06-22
    1. pp200
      • tracking: 2015-07-22
      • all: 2015-08-22
    2. pAu200
      • tracking: 2015-08-22
      • all: 2015-09-05
    3. pAl200
      • tracking: 2015-09-22
      • all: 2015-10-06

Calibration topics by dataset

 Focus here will be on topics of note by dataset

Run 12 CuAu200

Regarding the P14ia Preview Production of Run 12 CuAu200 from early 2014:

A check-list of observables and points to consider to help understand alayses' sensitivity to non-final calibrations

To unambiguously see issues due to mis-calibration of the TPC, stringent determination of triggered-event tracks is necessary. Pile-up tracks are expected to be incorrect in many ways, and they constitute a larger and larger fraction of TPC tracks as luminosity grows, so their inclusion can lead to luminosity-dependencies of what appear to be mis-calibrations but are not.

  1. TPC dE/dx PID (not calibrated)
    1. differences between real data dE/dx peaks' means and width vs. nsigma provided in the MuDst
    2. variability of these differences with time
  2. TPC alignment (old one used)
    1. sector-by-sector variations in charge-separated signed DCAs and momentum spectra (e.g h-/h+ pT spectra) that are time- and luminosity-independent
    2. differences in charge-separated invariant masses from expectations that are time- and luminosity-independent
    3. any momentum effects (including invariant masses) grow with momentum: delta(pT) is proprotional to q*pT^2
      1. alternatively, and perhaps more directly, delta(q/pT) effects are constant, and one could look at charge-separated 1/pT from sector-to-sector
  3. TPC SpaceCharge & GridLeak (preliminary calibration)
    1. sector-by-sector variations in charge-separated DCAs and momentum spectra that are luminosity-dependent
    2. possible track splitting between TPC pad rows 13 and 14 + possible track splitting at z=0, in the radial and/or azimuthal direction
    3. differences in charge-separated invariant masses from expectations that are luminosity-dependent
    4. any momentum effects (including invariant masses) grow with momentum: delta(pT) is proprotional to q*pT^2
      1. alternatively, and perhaps more directly, delta(q/pT) effects are constant, and one could look at charge-separated 1/pT from sector-to-sector
  4. TPC T0 & drift velocities (preliminary calibration)
    1. track splitting at z=0, in the z direction
    2. splitting of primary vertices into two close vertices (and subsequent irregularities in primary track event-wise distributions)
  5. TOF PID (VPD not calibrated, BTOF calibration from Run 12 UU used)
    [particularly difficult to disentangle from TPC calibration issues]
    1. not expected to be much of an issue, as the "startless" mode of using BTOF was forced (no VPD) and the calibration used for BTOF is expected to be reasonable
    2. broadening of, and differences in mass^2 peak positions from expectations are more likely due to TPC issues (particularly if charge-dependent, as BTOF mis-calibrations should see no charge sign dependence)
    3. while TOF results may not be the best place to identify TPC issues, it is worth noting that BTOF-matching is beneficial to removing pile-up tracks from studies

Docs

 Miscellaneous calibration-related documents

Intrinsic resolution in a tracking element

Foreword: This has probably been worked out in a textbook somewhere, but I wanted to write it down for my own sake. This is a re-write (hopefully more clear, with slightly better notation) of Appendix A of my PhD thesis (I don't think it was well-written there)...

-Gene

_______________________

Let's establish a few quantities:
  • Eintr : error on the measurement by the element in question
    • σintr2 = <Eintr2> : intrinsic resolution of the element, and its relation to an ensemble of errors in measurement
  • Eproj : error on the track projection to that element (excluding the element from the track fit)
    • σproj2 = <Eproj2> : resolution of the track projection to an element, and its relation to an ensemble of errors in track projections
  • Etrack : error on track fit at an element including the element in the fit
  • Rincl = Eintr - Etrack : residual difference between the measurement and the inclusive track fit
    • σincl2 = <(Eintr - Etrack)2> : resolution from the inclusive residuals
  • Rexcl = Eintr - Eproj : residual difference between the measurement and the exclusive track fit
    • σexcl2 = <(Eintr - Eproj)2> : resolution from the exclusive residuals
Let us further assume that the projection from the track fit excluding the element Eproj is uncorrelated with the intrinsic error of the measurement from the element: <Eproj Eintr> = 0. This implies that we can write:

σexcl2 = <Eintr2> + <Eproj2> = σintr2 + σproj2

Our goal is to determine σintr given that we can only observe σincl and σexcl.

To that end, we utilize a guess, σ'intr, and write down a reasonable estimation of Etrack using a weighted average of Eintr and Eproj, where the weights are wproj = 1/σproj2, and wintr = 1/σ'intr2:

Etrack = [(wintr Eintr) + (wproj Eproj)] / (wintr + wproj)
 = [(Eintr / σ'intr2) + (Eproj / σproj2)] / [(1/σ'intr2) + (1/σproj2)]
 = [(σproj2 Eintr) + (σ'intr2 Eproj)] / (σ'intr2 + σproj2)

Substituing this, we find...

σincl2 = <(Eintr - Etrack)2>
 = <Eintr2> - 2 <Eintr Etrack> + <Etrack2>
 = σintr2 - 2 <Eintr {[(σproj2 Eintr) + (σ'intr2 Eproj)] / (σ'intr2 + σproj2)}> + <{[(σproj2 Eintr) + (σ'intr2 Eproj)] / (σ'intr2 + σproj2)}2}>

Dropping terms of <Eintr Eproj>, replacing terms of <Eproj2> and <Eintr2> with σproj2 and σintr2 respectively, and multiplying through such that all terms on the right-hand-side of the equation have the denominator (σ'intr2 + σproj2)2, we find

σincl2 = [(σintr2 σ'intr4) + (2 σintr2 σ'intr2 σproj2) + (σintr2 σproj4) - (2 σintr2 σ'intr2 σproj2) - (2 σ'intr2 σproj4) + (σ'intr4 σproj2) + (σintr2 σproj4)] / (σ'intr2 + σproj2)2
 = (σintr2 σ'intr4 + σ'intr4 σproj2) / (σ'intr2 + σproj2)2
 = σ'intr4intr2 + σproj2) / (σ'intr2 + σproj2)2

We can substitute for σproj2 using σexcl2 = σintr2 + σproj2:

σincl2 = σ'intr4 σexcl2 / (σ'intr2 + σexcl2 - σintr2)2
σincl = σ'intr2 σexcl / (σ'intr2 + σexcl2 - σintr2)

And solving for σintr2 we find:

σintr2 = σ'intr2 + σexcl2 - (σ'intr2 σexcl / σincl)
σintr = √{ σexcl2 - σ'intr2 [(σexcl / σincl) - 1] }

This is an estimator of σintr. Ideally, σintr and σ'intr should be the same. One can iterate a few times starting with a good guess for σ'intr and then replacing it in later iterations with the σintr found from the previous iteration until the two are approximately equal.

_______

-Gene

SVT Calibrations

SVT Self-Alignment

Using tracks fit with SVT points plus a primary vertex alone, we can self-align the SVT using residuals to the fits. This document explains how this can be done, but only works for the situation in which the SVT is already rather well aligned, and only small scale alignment calibration remains. The technique explained herein also allows for calibration of the hybrid drift velocities.

TPC Calibrations

TPC Calibration & Data-Readiness Tasks:

Notes:
* "Run", with a capital 'R', refers to a year's Run period, e.g. Run 10)
* Not all people who have worked on various tasks are listed as they were recalled only from (faulty) memory and only primary persons are shown. Corrections and additions are welcome.

  1. You do not have access to view this node
    • Should be done each time the TPC may be moved (e.g. STAR is rolled out and back into the WAH) (not done in years)
    • Must be done before magnet endcaps are moved in
    • Past workers: J. Castillo (Runs 3,4), E. Hjort (Runs 5,6), Y. Fisyak (Run 14)
  2. TPC Pad Pedestals
    • Necessary for online cluster finding and zero suppression
    • Uses turned down anode HV
    • Performed by DAQ group frequently during Run
    • Past workers: A. Ljubicic
  3. TPC Pad Relative Gains & Relative T0s
    • Necessary for online cluster finding and zero suppression
    • Uses pulsers
    • Performed by DAQ group occasionally during Run
    • Past workers: A. Ljubicic
  4. TPC Dead RDOs
    • Influences track reconstruction
    • Monitored by DAQ
    • Past workers: A. Ljubicic
  5. You do not have access to view this node
    • Monitor continually during Run
    • Currently calibrated from laser runs and uploaded to the DB automatically
    • Past workers: J. Castillo (Runs 3,4), E. Hjort (Runs 5,6), A. Rose (Run 7), V. Dzhordzhadze (Run 8), S. Shi (Run 9), M. Naglis (Run 10), G. Van Buren (Run 12)
  6. TPC Anode HV
    • Trips should be recorded to avoid during reconstruction
    • Dead regions may influence track reconstruction
    • Reduced voltage will influence dE/dx
    • Dead/reduced voltage near inner/outer boundary will affect GridLeak
    • Past workers: G. Van Buren (Run 9)
  7. TPC Floating Gated Grid Wires
    • Wires no longer connected to voltage cause minor GridLeak-like distortions at one GG polarity
    • Currently known to be two wires in Sector 3 (seen in reversed polarity), and two wires in sector 8 (corrected with reversed polarity)
    • Past workers: G. Van Buren (Run 5)
  8. TPC T0s
    • Potentially: global, sector, padrow
    • Could be different for different triggers
    • Once per Run
    • Past workers: J. Castillo (Runs 3,4), Eric Hjort (Runs 5,6), G. Webb (Run 9), M. Naglis (Runs 10,11,12), Y. Fisyak (Run 14)
  9. You do not have access to view this node
    • Dependence of reconstructed time on pulse height seen only in Run 9 so far (un-shaped pulses)
    • Past workers: G. Van Buren (Run 9)
  10. You do not have access to view this node
    • Known shorts can be automatically monitored and uploaded to the DB during Run
    • New shorts need determination of location and magnitude, and may require low luminosity data
    • Past workers: G. Van Buren (Runs 4,5,6,7,8,9,10)
  11. You do not have access to view this node
    • Two parts: Inner/Outer Alignment, and Super-Sector Alignment
    • Requires low luminosity data
    • In recent years, done at least once per Run, and per magnetic field setting (perhaps not necessary)
    • Past workers: B. Choi (Run 1), H. Qiu (Run 7), G. Van Buren (Run 8), G. Webb (Run 9), L. Na (Run 10), Y. Fisyak (Run 14)
  12. TPC Clocking (Rotation of east half with respect to west)
    • Best done with low luminosity data
    • Calibration believed to be infrequently needed (not done in years)
    • Past workers: J. Castillo (Runs 3,4), Y. Fisyak (Run 14)
  13. TPC IFC Shift
    • Best done with low luminosity data
    • Calibration believed to be infrequently needed (not done in years)
    • Past workers: J. Dunlop (Runs 1,2 [here, here]), J. Castillo (Runs 3,4), Y. Fisyak (Run 14)
  14. TPC Twist (ExB) (Fine Global Alignment)
    • Best done with low luminosity data
    • At least once per Run, and per magnetic field setting
    • Past workers: J. Castillo (Runs 3,4), E. Hjort (Runs 5,6), A. Rose (Runs 7,8,9), Z. Ahammed (Run 10), R. Negrao (Run 11), M. Naglis (Run 12), J. Campbell (Runs 13,14), Y. Fisyak (Run 14)
  15. You do not have access to view this node
    • Done with low (no) luminosity data from known distortions
    • Done once (but could benefit from checking again)
    • Past workers: G. Van Buren (Run 4), M. Mustafa (Run 4 repeated)
  16. You do not have access to view this node
    • At least once per Run, and per beam energy & species, and per magnetic field setting
    • Past workers: J. Dunlop (Runs 1,2), G. Van Buren (Runs 4,5,6,7,8-dAu,12-pp500), H. Qiu (Run 8-pp), J. Seele (Run 9-pp500), G. Webb (Run 9-pp200), J. Zhao (Run 10), A. Davila (Runs 11,12-UU192,pp200,pp500), D. Garand (Run 12-CuAu200), M Vanderbroucke (Run 13), M. Posik (Run 12-UU192 with new alignment), P. Federic (Run 12-CuAu200 R&D)
  17. You do not have access to view this node
    • Once per Run, and per beam energy & species, and per magnetic field setting
    • Past workers: Y. Fisyak (Runs 1,2,3,4,5,6,7,9), P. Fachini (Run 8), L. Xue (Run 10), Y. Guo (Run 11), M. Skoby (Runs 9-pp200,12-pp200,pp500), R. Haque (Run 12-UU192,CuAu200)
  18. TPC Hit Errors
    • Once per Run, and per beam energy & species, and per magnetic field setting
    • Past workers: V. Perev (Runs 5,7,9), M. Naglis (Runs 9-pp200,10,11,12-UU193), R. Witt (Run 12-pp200,pp500, Run13)
  19. TPC Padrow 13 and Padrow 40 static distortions
    • Once ever
    • Past workers: Bum Choi (Run 1), J. Thomas (Runs 1,18,19), G. Van Buren (Runs 18,19)  I. Chakaberia (Runs 18,190

To better understand what is the effect of distortions on momentum measurements in the TPC, the attached sagitta.pdf file shows the relationship between track sagitta and its transverse momentum.

Miscellaneous TPC calibration notes

The log file for automated drift velocity calculations is at ~starreco/AutoCalib.log.

Log files for fast offline production are at /star/rcf/prodlog/dev/log/daq.

The CVS area for TPC calibration related scripts, macros, etc., is StRoot/macros/calib.

Padrow 13 and Padrow 40 static distortions

These two distortion corrections focus on static non-uniformities in the electric field in the gating grid region of the TPC, radially between the inner and outer sectors. This region has an absence of gating grid wires where the subsector structures meet, allowing some bending of the equipotential lines, creating radial electric field components. In both cases, the region of distorted fields is quite localized near the endcap and only over a small radial extent, but this then distorts all drifting electrons (essentially from all drift distances equally) in that small radial region, affecting signal only for a few padrows on each side of the inner/outer boundary.

Padrow 13 Distortion
For the original TPC structure, the was simply nothing in the gap between the inner and outer sectors. More about the gap structure can be learned by looking some of the GridLeak documentation (which is different in that it is dynamic with the ionization in the TPC. This static distortion was observed in the early operation of STAR when luminosities were low, well before the GridLeak distortion was ever observed. It was modeled as an offset potential on a small strip of the TPC drift field boundary, with the offset equal to a scale factor times the gating grid voltage, where the scale factor was calibrated from fits to the data. Below are the related distortion correction maps.

Padrow 40 Distortion
With the iTPC sectors, a wall was placed at the outer edge of the new inner sectors with conductive stripes on it to express potentials primarily to suppress the GridLeak distortion. It was understood that this would modify the static distortion, and in the maps below it is apparent that the static distortions became a few times larger with the wall, but in both cases only at the level of ~100 μm or less at the outermost padrow of the inner sectors. It is also worth noting that the sign of the distortion flipped with respect to Padrow 13.

This distortion correction is determined from the sum of 3 maps that represent the distortion contributions from (1) the wall's grounded structure, (2) the potential at the wall's tip [nominally -115 V], and (3) the potential on the wall's outer side [nominally -450 V]. Plots on this page were made using the nominal potentials.

Distortion correction maps:
The maps are essentially uniform in z, so the maps shown below focus only on the radial and azimuthal dependencies. Further, there is no distortion more than a few centimeters away from the inner/outer boundaries, and there is essentially a 12-fold symmetry azimuthally (though this is not strictly true if different wall voltages are used on different iTPC sectors), so the maps zoom in on a limited region of r and φ to offer a better view of the fine details. Also, the inner and outer edges of the nearest padrows are shown as thin black lines on some plots to help understand their proximity to the distortions.

Open each map in a new web browser tab or window to see higher resolution. Units are [cm] and [radian].

  Δ(r-φ) vs. r and φ Δ(r) vs. r and φ
Padrow 13
Padrow 13
lines indicating
locations of rows
13 and 14
Padrow 40
Padrow 40
lines indicating
locations of rows
40 and 41


-Gene

RunXI dE/dx calibration recipe

This is a recipe of RunXI dEdx calibration by Yi Guo.

TPC T0s

Global T0:

Twist (ExB) Distortion

 Run 12 calibration

ExB (twist) calibration procedure

In 2012, the procedure documentation was updated, including global T0 calibration:

Below are the older instructions.

________________
The procedure here is basically to calculate two beamlines using only west tpc data and only east tpc data independently and then adjust the XTWIST and YTWIST parameters so that the east and west beamlines meet at z=0. The calibration needs to be done every run for each B field configuration. The obtained parameters are stored in the tpcGlobalPosition table with four different flavors: FullmagFNegative, FullMagFPositive, HalfMagFPositive and HalfMagFNegative.

To calculate the beamline intercept the refitting code (originally written by Jamie Dunlop) is used. An older evr-based version used by Javier Castillo for the 2005 heavy ion run can be found at ~startpc/tpcwrkExB_2005, and a version used for the 2006 pp run that uses the minuit vertex finder can be found at ~hjort/tpcwrkExB_2006. Note that for the evr-based version the value of the B field is hard coded at line 578 of pams/global/evr/evr_am.F. All macros referred to below can be found under both of the tpcwrkExB_200X directories referred to above, and some of them under ~hjort have been extensively rewritten.

Step-by-step outline of the procedure:

1. If using evr set the correct B field and compile.
2. Use the "make_runs.pl" script to prepare your dataset. It will create links to fast offline event.root files in your runsXXX subdirectory (create it first, along with outdirXXX). The script will look for files that were previously processed in the outdirXXX file and skip over them.
3. Use the "submit.pl" script to submit your jobs. It has advanced options but the standard usage is "submit.pl rc runsXXX outdirXXX" where "rc" indicates to use the code for reconstructed real events. The jobs will create .refitter.root files in your ourdirXXX subdirectory.
4. Next you create a file that lists all of the .refitter.root files. A command something like this should do it: "ls outdirFF6094 | grep refitter | awk '{print "outdirFF6094/" $1}' > outdirFF6094/root.files"
5. Next you run the make_res.C macro (in StRoot/macros). Note that the input and output files are hard coded in this macro. This will create a histos.root file.
6. Finally you run plot_vtx.C (in StRoot/macros) which will create plots showing your beamline intercepts. Note that under ~hjort/tpcwrkExB_2006 there is also a macro called plot_diff.C which can be used to measure the offset between the east/west beams more directly (useful for pp where data isn't as good).

Once you have made a good measurement of the offsets an iterative procedure is used to find the XTWIST and YTWIST that will make the offset zero:

7. In StRoot/StDbUtilities/StMagUtilities.cxx change the XTWIST and YTWIST parameters to what was used to process the files you analyzed in steps 1-6, and then compile.
8. Run the macro fitDCA2new.C (in StRoot/macros). Jim Thomas produces this macro and you might want to consult with him to see if he has a newer, better version. An up-to-date version as of early 2006 is under ~hjort/tpcwrkExB_2006. When you run this macro it will first ask for a B field and the correction mode, which is 0x20 for this correction. Then it will ask for pt, rapidity, charge and Z0 position. Only Z0 position is really important for our purposes here and typical values to use would be "1.0 0.1 1 0.001". The code will then report the VertexX and VertexY coordinates, which we will call VertexX0 and VertexY0 in the following steps.
9. If we now take VertexX0 and VertexY0 and our measured beamline offsets we can calculate the values for VertexX and VertexY that we want to obtain when we run fitDCA2new.C - call them VertexX_target and VertexY_target:

VertexX_target = (West_interceptX - East_interceptX)/2 + VertexX0
VertexY_target = (East_interceptY - East_interceptY)/2 + VertexY0

The game now is to modify XTWIST and YTWIST in StMagUtilities, recompile, rerun fitDCA2new.C and obtain values for VertexX and VertexY that match VertexX_target and VertexY_target (within 10 microns for heavy ion runs in the past).
10. Once you have found XTWIST and YTWIST parameters you are happy with they can be entered into the db table tpcGlobalPosition as PhiXZ and PhiYZ.

However - IMPORTANT NOTE: XTWIST = 1000 * PhiXZ , but YTWIST = -1000 * PhiYZ.

NOTE THE MINUS SIGN!! What is stored in the database is PhiXZ and PhiYZ. But XTWIST and YTWIST are what are printed in the log files.


Enter the values into the db using AddGlobalPosition.C and a file like tpcGlobalPosition*.C. To check the correction you either need to use files processed in fast offline with your new XTWIST and YTWIST values or request (re)processing of files.

Databases

STAR DATABASE INFORMATION PAGES

USEFUL LINKS: Frequently Asked Questions Database Structure Browser Database Browsers : STAR,EMC,EEMC Online Plots
How To Request New Table Database Monitoring RunLog Onl2Ofl Migration
Online Server Port Map How To Setup Replication Slave Contact Persons for Remote DBs DB Servers Details
Online DB Run Preparations MQ-based API DB Timestamps, Explained You do not have access to view this node

 

 

Frequently Asked Questions

Frequently Asked Questions

Q: I am completely new to databases, what should I do first?
A: Please, read this FAQ list, and database API documentation :
Database documentation
Then, please read You do not have access to view this node
Don't forget to log in, most of the information is STAR-specific and is protected; If our documentation pages are missing some information (that's possible), please as questions at db-devel-maillist.

Q: I think, I've encountered database-related bug, how can I report it?
A:
Please report it using STAR RT system (create ticket), or send your observations to db-devel maillist. Don't hesitate to send ANY db-related questions to db-devel maillist, please!

Q: I am subsystem manager, and I have questions about possible database structure for my subsystem. Whom should I talk to discuss this?
A: Dmitry Arkhipkin is current STAR database administrator. You can contact him via email, phone, or just stop by his office at BNL:
Phone: (631)-344-4922
Email: arkhipkin@bnl.gov
Office: 1-182

Q: why do I need API at all, if I can access database directly?
A:
There are a few moments to consider :
    a) we need consistent data set conversion from storage format to C++ and Fortran;
    b) our data formats change with time, we add new structures, modify old structures;
    b) direct queries are less efficient than API calls: no caching,no load balancing;
    c) direct queries mean more copy-paste code, which generally means more human errors;
We need API to enable: schema evolution, data conversion, caching, load balancing.

Q: Why do we need all those databases?
A: STAR has lots of data, and it's volume is growing rapidly. To operate efficiently, we must use proven solution, suitable for large data warehousing projects – that's why we have such setup, there's simply no subpart we can ignore safely (without overall performance penalty).

Q: It is so complex and hard to use, I'd stay with plain text files...
A:
We have clean, well-defined API for both Offline and FileCatalog databases, so you don't have to worry about internal db activity. Most db usage examples are only a few lines long, so really, it is easy to use. Documentation directory (Drupal) is being improved constantly.

Q: I need to insert some data to database, how can I get write access enabled?
A: Please send an email with your rcas login and desired database domain (e.g. "Calibrations/emc/[tablename]") to arkhipkin@bnl.gov (or current database administrator). Write access is not for everyone, though - make sure that you are either subsystem coordinator, or have proper permission for such data upload.

Q: How can I read some data from database? I need simple code example!
A: Please read this page : You do not have access to view this node

Q: How can I write something to database? I need simple code example!
A: Please read this page : You do not have access to view this node

Q: I'm trying to set '001122' timestamp, but I cannot get records from db, what's wrong?
A: In C++, numbers starting with '0' are octals, so 001122 is really translated to 594! So, if you need to use '001122' timestamp (any timestamp with leading zeros), it should be written as simply '1122', omitting all leading zeros.

Q: What time zone is used for a database timestamps? I see EDT and GMT being used in RunLog...
A:
All STAR databases are using GMT timestamps, or UNIX time (seconds since epoch, no timezone). If you need to specify a date/time for db request, please use GMT timestamp.

Q: It is said that we need to document our subsystem's tables. I don't have privilege to create new pages (or, our group has another person responsible for Drupal pages), what should I do?
A: Please create blog page with documentation - every STAR user has this ability by default. It is possible to add blog page to subsystem documentation pages later (webmaster can do that).

Q: Which file(s) is used by Load Balancer to locate databases, and what is the order of precedence for those files (if many available)?
A: Files being searched by LB are :

1. $DB_SERVER_LOCAL_CONFIG env var, should point to new LB version schema xml file (set by default);
2. $DB_SERVER_GLOBAL_CONFIG env. var, should point to new LB version schema xml file (not set by default);
3. $STAR/StDb/servers/dbLoadBalancerGlobalConfig.xml : fallback for LB, new schema expected;

if no usable LB configurations found yet, following files are being used :

1. $STDB_SERVERS/dbServers.xml - old schema expected;
2. $HOME/dbServers.xml - old schema expected;
3. $STAR/StDb/servers/dbServers.xml - old schema expected;
 

How-To: user section

Useful database tips and tricks, which could be useful for STAR activities, are stored in this section.

Time Stamps

STAR   Computing Tutorials main page
STAR Databases: TIMESTAMP

Offline computing tutorial

 

TIMESTAMPS

There are three timestamps used in STAR databases;

beginTime This is STAR user timestamp and it defines ia validity range
entryTime insertion into the database
deactive either a 0 or a UNIX timestamp - used for turning off a row of data

EntryTime and deactive are essential for 'reproducibility' and 'stability' in production.

The beginTime is the STAR user timestamp. One manifistation of this, is the time recorded by daq at the beginning of a run. It is valid unti l the the beginning of the next run. So, the end of validity is the next beginTime. In this example it the time range will contain many eve nt times which are also defined by the daq system.

The beginTime can also be use in calibration/geometry to define a range of valid values.

EXAMPLE: (et = entryTime) The beginTime represents a 'running' timeline that marks changes in db records w/r to daq's event timestamp. In this example, say at some tim e, et1, I put in an initial record in the db with daqtime=bt1. This data will now be used for all daqTimes later than bt1. Now, I add a second record at et2 (time I write to the db) with beginTime=bt2 > bt1. At this point the 1st record is valid from bt1 to bt2 and the second is valid for bt2 to infinity. Now I add a 3rd record on et3 with bt3 < bt1 so that

 

1st valid bt1-to-bt2, 2nd valid bt2-to-infinity, 3rd is valid bt3-to-bt1.

Let's say that after we put in the 1st record but before we put in the second one, Lydia runs a tagged production that we'll want to 'use' fo rever. Later I want to reproduce some of this production (e.g. embedding...) but the database has changed (we've added 2nd and 3rd entries). I need to view the db as it existed prior to et2. To do this, whenever we run production, we defined a productionTimestamp at that production time, pt1 (which is in this example < et2). pt1 is passed to the StDbLib code and the code requests only data that was entered before pt1. This is how production in 'reproducible'.

The mechanism also provides 'stability'. Suppose at time et2 the production was still running. Use of pt1 is a barrier to the production from 'seeing' the later db entries.

Now let's assume that the 1st production is over, we have all 3 entries, and we want to run a new production. However, we decide that the 1st entry is no good and the 3rd entry should be used instead. We could delete the 1st entry so that 3rd entry is valid from bt3-to-bt2 but the n we could not reproduce the original production. So what we do is 'deactivate' the 1st entry with a timestamp, d1. And run the new production at pt2 > d1. The sql is written so that the 1st entry is ignored as long as pt2 > d1. But I can still run a production with pt1 < d1 which means the 1st entry was valid at time pt1, so it IS used.

One word of caution, you should not deactivate data without training!
email your request to the database expert.

 

In essence the API will request data as following:

'entryTime <productionTime<deactive || entryTime< productionTime & deactive==0.'

To put this to use with the BFC a user must use the dbv switch. For example, a chain that includes dbv20020802 will return values from the database as if today were August 2, 2002. In other words, the switch provides a user with a snapshot of the database from the requested time (which of coarse includes valid values older than that time). This ensures the reproducability of production.
If you do not specify this tag (or directly pass a prodTime to StDbLib) then you'll get the latest (non-deactivated) DB records.

 

Below is an example of the actual queries executed by the API:

 

select beginTime + 0 as mendDateTime, unix_timestamp(beginTime) as mendTime from eemcDbADCconf Where nodeID=16 AND flavor In('ofl') AND (deactive=0 OR deactive > =1068768000) AND unix_timestamp(entryTime) < =1068768000 AND beginTime > from_unixtime(1054276488) And elementID In(1) Order by beginTime limit 1

select unix_timestamp(beginTime) as bTime,eemcDbADCconf.* from eemcDbADCconf Where nodeID=16 AND flavor In('ofl') AND (deactive=0 OR deactive>=1068768000) AND unix_timestamp(entryTime) < =1068768000 AND beginTime < =from_unixtime(1054276488) AND elementID In(1) Order by beginTime desc limit 1

 

For a description of format see ....

 

 

Test

Test page

Quality Assurance

Welcome to the Quality assurance and quality control pages.

Proposal and statements

.

Proposal for Run IV

Procedure proposal for production and QA in Year4 run

Jérôme LAURET & Lanny RAY, 2004

Summary: The qualitative increase in data volume for run 4 together with finite cpu capacity at RCF precludes the possibility for multiple reconstruction passes through the full raw data volume next year. This new computing situation together with recent experiences involving production runs which were not pre-certified prior to full scale production motivates a significant change in the data quality assurance (QA) effort in STAR. This note describes the motivation and proposed implementation plan.

Introduction

The projection for the next RHIC run (also called, Year4 run which will start by the end of 2003), indicates a factor of five increase in the number of collected events comparing to preceding runs. This will increase the required data production turn-around time by an order of magnitude, from months to one year per full-scale production run. The qualitative increase in the reconstruction demands combined with an increasingly aggressive physics analysis program will strain the available data processing resources and poses a severe challenge to STAR and the RHIC computing community for delivering STAR’s scientific results in a reasonable time scale. This situation will become more and more problematic as our Physics program evolves to include rare probes. This situation is not unexpected and was anticipated since before the inception of RCF. The STAR decadal plan (10 year projection of STAR activities and development) clearly describes the need for several upgrade phases, including a factor of 10 increase in data acquisition rate and analysis throughput by 2007.

Typically, 1.2 represents an ideal, minimal number of passes through the raw data in order to produce calibrated data summary tapes for physics analysis. However, it is noteworthy that in STAR we have typically processed the raw data an average of 3.5 times where, at each step, major improvements in the calibrations were made which enabled more accurate reconstruction, resulting in greater precision in the physics measurements. The Year 4 data sample in STAR will include the new ¾ barrel EMC data which makes it unlikely that sufficiently accurate calibrations and reconstruction can be achieved with only the ideal 1.2 number of passes as we foresee the need for additional calibration passes through the entire data in order to accumulate enough statistics to push the energy calibration to the high Pt limit.

While drastically diverging from the initial computing requirement plans ( 1), this mode of operation, in conjunction with the expanded production time table, calls for a strengthening of procedures for calibration, production and quality assurance.

The following table summarizes the expectations for ~ 70 Million events with a mix of central and minbias triggers. Numbers of files and data storage requirements are also included for guidance


Au+Au 200 (minbias)

35 M central

35 M minbias

Total

No DAQ100 (1 pass)

329 days

152 days

481 days

No DAQ100 (2 passes)

658 days

304 days

962 days

Assuming DAQ100 (1 pass)

246 days

115 days

361 days

Assuming DAQ100 (2 passes)

493 days

230 days

723 days

Total storage estimated (raw)

x

x

203 TB

Total storage estimated
(1 pass)

x

x

203 TB


Quality Assurance: Goals and proposed procedure for QA and productions

What is QA in STAR?

The goal of the QA activities in STAR is the validation of data and software, up to DST production. While QA testing can never be exhaustive, the intention is that data that pass the QA testing stage should be considered highly reliable for downstream physics analysis. In addition, QA testing should be performed soon after production of the data, so that errors and problems can be caught and fixed in a timely manner.

QA processes are run independently of the data taking and DST production. These processes contain the accumulated knowledge of the collaboration with respect to potential modes of failure of data taking and DST production, along with those physics distributions that are most sensitive to the health of the data and DST production software. The results probe the data in various ways:

  • At the most basic level, the questions asked are whether the data can be read and whether all the components expected in a given dataset are present. Failures at this level are often related to problems with computing hardware and software infrastructure.

  • At a more sophisticated level, distributions of physics-related quantities are examined, both as histograms and as scalar quantities extracted from the histograms and other distributions. These distributions are compared to those of previous runs that are known to be valid, and the stability of the results is monitored. If changes are observed, these must be understood in terms of changing running conditions or controlled changes in the software, otherwise an error flag should be raised (deviations are not always bad, of course, and can signal new physics: QA must be used with care in areas where there is a danger of biasing the physics results of STAR).

Varieties of QA in STAR

The focus of the QA activities until summer 2000 has been on Offline DST production for the DEV branch of the library. With the inception of data taking, the scope of QA has broadened considerably. There are in fact two different servers running autoQA processes:

  • Offline QA. This autoQA-generated web page accesses QA results for all the varieties of Offline DST production:

    • Real data production produced by the Fast Offline framework. This is used to catch gross errors in data taking, online trigger and calibration, allowing for correcting the situation before too much data is accumulated (this framework also provides on the fly calibration as the data is produced).

    • Nightly tests of real and Monte Carlo data (almost always using the DEV and NEW branches of the library). This is used principally for the validation of migration of library versions

    • Large scale production of real and Monte Carlo data (almost always using the PRO branch of the library). This is used to monitor the stability of DSTs for physics.

  • Online QA. This autoQA-generated web page accesses QA results for data in the Online event pool, both raw data and DST production that is run on the Online processors.

The QA dilemma

While a QA shift is usually organized during data taking, the later, official production runs were encouraged (but not mandated) to be regularly QA-ed. Typically, there has not been an organized QA effort for post-experiment DST production runs. The absence of organized quality assurance efforts following the experiment permitted several post-production problems to arise. These were eventually discovered at the (later) physics analysis stage, but the entire production run was wasted. Examples include the following:

  1. missing physics quantities in the DSTs (e.g. V0, Kinks, etc ...)

  2. missing detector information or collections of information due to pilot errors or code support

  3. improperly calibrated and unusable data

  4. ...

The net effect of such late discoveries is a drastic increase in the production cycle time, where entire production passes have to be repeated, which could have been prevented by a careful QA procedure.

Production cycles and QA procedure

To address this problem we propose the following production and QA procedure for each major production cycle.

  1. A data sample (e.g. from a selected trigger setup or detector configuration) of not more than 100k events (Au+Au) or 500k events (p+p) will be produced prior to the start of the production of the entire data sample.

  2. This data sample will remain available on disk for a period of two weeks or until all members of “a” QA team (as defined here) have approved the sample (whichever comes first).

  3. After the two week review period, the remainder of the sample is produced with no further delays, with or without the explicit approval of everyone in the QA team.

  4. Production schedules will be vigorously maintained. Missing quantities which are detected after the start of the production run do not necessarily warrant a repetition of the entire run.

  5. The above policy does not apply to special or unique data samples involving calibration or reconstruction studies nor would it apply to samples having no overlaps with other selections. Such unique data samples include, for example, those containing a special trigger, magnetic field setting, beam-line constraint (fill transition), etc., which no other samples have and which, by their nature, require multiple reconstruction passes and/or special attention.

In order to carry out timely and accurate Quality Assurance evaluations during the proposed two week period, we propose the formation of a permanent and QA team consisting of:

  1. One or two members per Physics Working group. This manpower will be under the responsibility of the PWG conveners. The aim of these individuals will be to rigorously check, via the autoQA system or analysis codes specific to the PWG, for the presence of the required physics quantities of interest to that PWG which are understood to be vital for the PWG’s Physics program and studies.

  2. One or more detector sub-system experts from each of the major detector sub-systems in STAR. The goal of these individuals will be to ensure the presence and sanity of the data specific to that detector sub-system.

  3. Within the understanding that the outcome of such procedure and QA team is a direct positive impact on the Physics capabilities of a PWG, we recommend that this QA service work be done without shift signups or shift credit as is presently being done for DAQ100 and ITTF testing.

Summary

Facing important challenges driven by the data amount and Physics needs, we proposed an organized procedure for QA and production relying on a cohesive feedback from the PWG and detector sub-system’s experts within time constraints guidelines. It is understood that the intent is clearly to bring the data readiness to the shortest possible turn around time while avoiding the need for later re-production causing waste of CPU cycles and human hours.


Summary list of STAR QA Provisions

Summary of the provisions of Quality Assurance and Quality Control for the STAR Experiment


Online QA (near real-time data from the event pool)
  • Plots of hardware/electronics performance
    • Histogram generation framework and browsing tools are provided
    • Shift crew assigned to analyze
    • Plots are archived and available via web
    • Data can be re-checked
    • Yearly re-assessment of plot contents during run preparation meetings and via pre-run email request by the QA coordinator
  • Visualization of data
    • Event Display (GUI running at the control room)
  • DB data validity checks

FastOffline QA (full reconstruction within hours of acquisition)
  • Histogram framework, browsing, reporting, and archiving tools are provided
    • QA shift crew assigned to analyze and report
    • Similar yearly re-assessment of plot contents as Online QA plots
  • Data and histograms on disk for ~2 weeks and then archived to HPSS
    • Available to anyone
    • Variety of macros provided for customized studies (some available from histogram browser, e.g. integrate over runs)
  • Archived reports always available
    • Report browser provided

Reconstruction Code QA
  • Standardized test suite of numerous reconstruction chains in DEV library performed nightly
    • Analyzed by S&C team
    • Browser provided
    • Results kept at migration to NEW library
  • Standardized histogram suite recorded at library tagging (2008+)
    • Analyzed by S&C team
    • Test suite grows with newly identified problem
    • Discussions of analysis and new issues at S&C meetings
  • Test productions before full productions (overlaps with Production QA below)
    • Provided for calibration and PWG experts to analyze (intended to be a requirement of the PWGs, see Production cycles and QA procedure under Proposal for Run IV)
    • Available to anyone for a scheduled 2 weeks prior to commencing production
    • Discussions of analysis and new issues at S&C meetings

Production QA
  • All aspects of FastOffline QA also provided for Production QA (same tools)
    • Data and histograms are archived together (i.e. iff data, then histograms)
    • Same yearly re-assessment of plot contents as FastOffline QA plots (same plots)
    • Formerly analyzed during runs by persons on QA shift crew (2000-2005)
    • No current assignment of shift crew to analyze (2006+)
  • Visualization of data
    • Event Display: GUI, CLI, and visualization engine provided
  • See "Test productions before full production" under Reconstruction Code QA above (overlaps with Production QA)
  • Resulting data from productions are on disk and archived
    • Available to anyone (i.e. PWGs should take interest in monitoring the results)

Embedding QA
  • Standardized test suite of plots of baseline gross features of data
    • Analyzed by Embedding team
  • Provision for PWG specific (custom) QA analysis (2008+)

 

Offline QA

Offline QA Shift Resources

STAR Offline QA Documentation (start here!)

Quick Links: Shift Requirements , Automated Browser Instructions , You do not have access to view this node , Online RunLog Browser

 

Automated Offline QA Browser

Quick Links: You do not have access to view this node

QA Shift Report Forms

Quick Links: Issue Browser/Editor, Dashboard, Report Archive

QA Technical, Reference, and Historical Information

 

Reconstruction Code QA

 As a minimal check on effects caused by any changes to reconstruction code, the following code and procedures are to be exercised:

  

  1. A suite of datasets has been selected which should serve as a reference basis for any changes. These datasets include:

    1. Real data from Run 7 AuAu at 200 GeV

    2. Simulated data using year 2007 geometry with AuAu at 200 GeV

    3. Real data from Run 8 pp at 200 GeV

    4. Simulated data using year 2008 geometry with pp at 200 GeV

     

  2. These datasets should be processed with BFC as follows to generate historgrams in a hist.root file:

    1. root4star -b -q -l 'bfc.C(100,"P2007b,ittf,pmdRaw,OSpaceZ2,OGridLeak3D","/star/rcf/test/daq/2007/113/8113044/st_physics_8113044_raw_1040042.daq")'

    2. root4star -b -q -l 'bfc.C(100, "trs,srs,ssd,fss,y2007,Idst,IAna,l0,tpcI,fcf,ftpc,Tree,logger,ITTF,Sti,SvtIt,SsdIt,genvtx,MakeEvent,IdTruth,geant,tags,bbcSim,tofsim,emcY2,EEfs,evout,GeantOut,big,fzin,MiniMcMk,-dstout,clearmem","/star/rcf/simu/rcf1296_02_100evts.fzd")'

    3. root4star -b -q -l 'bfc.C(1000,"pp2008a,ittf","/star/rcf/test/daq/2008/043/st_physics_9043046_raw_2030002.daq")'

    4. ?

     

  3. The RecoQA.C macro generates CINT files from the hist.root files

    1. root4star -b -q -l 'RecoQA.C("st_physics_8113044_raw_1040042.hist.root")'

    2. root4star -b -q -l 'RecoQA.C("rcf1296_02_100evts.hist.root")'

    3. root4star -b -q -l 'RecoQA.C("st_physics_9043046_raw_2030002.hist.root")'

    4. ?

     

  4. The CINT files are then useful for comparison to the previous reference, or storage as the new reference for a given code library. To view these plots, simply execute the CINT file with root:

    1. root -l st_physics_8113044_raw_1040042.hist_1.CC
      root -l st_physics_8113044_raw_1040042.hist_2.CC

    2. root -l rcf1296_02_100evts.hist_1.CC
      root -l rcf1296_02_100evts.hist_2.CC

    3. root -l st_physics_9043046_raw_2030002.hist_1.CC
      root -l st_physics_9043046_raw_2030002.hist_2.CC

    4. ?

     

  5. One can similarly execute the reference CINT files for visual comparison: 

    1. root -l $STAR/StRoot/qainfo/st_physics_8113044_raw_1040042.hist_1.CC
      root -l $STAR/StRoot/qainfo/st_physics_8113044_raw_1040042.hist_2.CC

    2. root -l $STAR/StRoot/qainfo/rcf1296_02_100evts.hist_1.CC
      root -l $STAR/StRoot/qainfo/rcf1296_02_100evts.hist_2.CC

    3. root -l $STAR/StRoot/qainfo/st_physics_9043046_raw_2030002.hist_1.CC
      root -l $STAR/StRoot/qainfo/st_physics_9043046_raw_2030002.hist_2.CC

    4. ?

     

  6. Steps 1-3 above should be followed immediately upon establishing a new code library. At that point, the CINT files should be placed in the appropriate CVS directory, checked in, and then checked out (migrated) into the newly established library: 

    cvs co StRoot/qainfo
    mv *.CC StRoot/qainfo
    cvs ci -m "Update for library SLXXX" StRoot/qainfo
    cvs tag SLXXX StRoot/info/*.CC
    cd $STAR
    cvs update StRoot/info
    

     

Missing information will be filled in soon. We may also consolidate some of these steps into a single script yet to come.

 

 

Run QA


Helpful links:

Run 19 (BES II) QA

Run 19 (BES 2) Quality Assurance

Run Periods

Detector Resources

BBC BTOF BEMC EPD
eTOF GMT iTPC/TPC HLT
MTD VPD ZDC  

Other Resources

QA Experts:
  • BBC - Akio Ogawa
  • BTOF - Zaochen Ye
  • BEMC - Raghav Kunnawalkam Elayavalli
  • EPD  - Rosi Reed
  • eTOF - Florian Seck
  • GMT - Dick Majka
  • iTPC- Irakli Chakaberia
  • HLT - Hongwei Ke
  • MTD  - Rongrong Ma
  • VPD  -  Daniel Brandenburg
  • ZDC - Miroslav Simko and Lukas Kramarik
  • Offline-QA - Lanny Ray  + this week's Offline-QA shift taker
  • LFSUPC conveners: David Tlusty, Chi Yang, and Wangmei Zha 
    • delegate: Ben Kimelman
  • BulkCorr conveners: SinIchi Esumi,  Jiangyong Jia, and Xiaofeng Luo 
    • delegate: Takafumi Niida (BulkCorr)
  • PWGC - Zhenyu Ye
  • TriggerBoard (and BES focus group) - Daniel Cebra
  • S&C - Gene van Buren

Meeting Schedule

  • Weekly on Thursdays at 2pm EST
  • Blue Jeans information:
    To join the Meeting:
    https://bluejeans.com/967856029
    
    To join via Room System:
    Video Conferencing System: bjn.vc -or-199.48.152.152
    Meeting ID : 967856029
    
    To join via phone :
    1)  Dial:
    	+1.408.740.7256 (United States)
    	+1.888.240.2560 (US Toll Free)
    	+1.408.317.9253 (Alternate number)
    	(see all numbers - http://bluejeans.com/numbers)
    2)  Enter Conference ID : 967856029
    

AuAu 19.6GeV (2019)

Run 19 (BES-2) Au+Au @ √sNN=19.6 GeV

PWG QA resources:

 Direct links to the relevant Run-19 QA meetings:

  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node

 

LFSUPC Run-by-run QA

 

AuAu 11.5GeV (2020)

Run 20 (BES-2) Au+Au @ √sNN=11.5 GeV

PWG QA resources:

Event Level QA

 

Track QA (no track cuts)

 

Track QA (with track cuts)

 

nHits QA (no track cuts)

 

AuAu Fixed Target (2019)

 

Run 20 (BES II) QA

Run 20 (BES 2) Quality Assurance

Run Periods

Detector Resources

BBC BTOF BEMC EPD
eTOF GMT iTPC/TPC HLT
MTD VPD ZDC  

Other Resources

QA Experts:
  • BBC - Akio Ogawa
  • BTOF - Zaochen Ye
  • BEMC - Raghav Kunnawalkam Elayavalli
  • EPD  - Rosi Reed
  • eTOF - Florian Seck
  • GMT - Dick Majka
  • TPC- Irakli Chakaberia, Fleming Videbaek
  • HLT - Hongwei Ke
  • MTD  - Rongrong Ma
  • VPD  -  Daniel Brandenburg
  • ZDC - Miroslav Simko and Lukas Kramarik
  • Offline-QA - Lanny Ray
  • TriggerBoard - Daniel Cebra
  • S&C - Gene van Buren
Period 20a/b:
  • LFSUPC conveners: Wangmei Zha, Daniel Cebra, 
    • delegate: Ben Kimelman
  • BulkCorr conveners: SinIchi Esumi,  Jiangyong Jia, and Xiaofeng Luo 
    • delegate: Takafumi Niida (BulkCorr)
  • PWGC - Zhenyu Ye
Period 20b/c:
  • PWG Delegates
    • LFSUPC: Ben Kimelman, Chenliang Jin
    • BulkCorr: Kosuke Okubo, Ashish Pandav
      • HeavyFlavor - Kaifeng Shen, Yingjie Zhou
      • JetCorr - Tong Liu,  Isaac Mooney
      • Spin/ColdQCD : Yike Xu
  • PWGC - Rongrong Ma


Meeting Schedule
  • Weekly on Fridays at noon EST/EDT
  • Blue Jeans information:
    Meeting URL
    https://bluejeans.com/563179247?src=join_info
    
    Meeting ID
    563 179 247
    
    Want to dial in from a phone?
    
    Dial one of the following numbers:
    +1.408.740.7256 (US (San Jose))
    +1.888.240.2560 (US Toll Free)
    +1.408.317.9253 (US (Primary, San Jose))
    +41.43.508.6463 (Switzerland (Zurich, German))
    +31.20.808.2256 (Netherlands (Amsterdam))
    +39.02.8295.0790 (Italy (Italian))
    +33.1.8626.0562 (Paris, France)
    +49.32.221.091256 (Germany (National, German))
    (see all numbers - https://www.bluejeans.com/premium-numbers)
    
    Enter the meeting ID and passcode followed by #
    
    Connecting from a room system?
    Dial: bjn.vc or 199.48.152.152 and enter your meeting ID & passcode
    

Fixed Target Au+Au (2020)

Run-20 (BES-2) RunQA :: Fixed Target Au+Au

Relevant Weekly Meetings

Run 21 (BES II) QA

Run 21 (BES 2) Quality Assurance

Run Period(s)

Detector Resources

BBC BTOF BEMC EPD
eTOF GMT iTPC/TPC HLT
MTD VPD ZDC  

Other Resources

QA Experts:
  • BBC - Akio Ogawa
  • BTOF - Zaochen Ye
  • BEMC - Raghav Kunnawalkam Elayavalli
  • EPD  - Joey Adams
  • eTOF - Philipp Weidenkaff
  • GMT - 
  • TPC- Flemming Videbaek
  • HLT - Hongwei Ke
  • MTD  - Rongrong Ma
  • VPD  -  Daniel Brandenburg
  • ZDC - Miroslav Simko and Lukas Kramarik
  • Offline-QA - Lanny Ray
  • TriggerBoard - Daniel Cebra
  • Production & Calibrations - Gene Van Buren
Period 21:
  • PWG Delegates
    • LFSUPC: Chenliang Jin, Ben Kimelman
    • BulkCorr: Kosuke Okubo, Ashish Pandav
    • HeavyFlavor - Kaifeng Shen, Yingjie Zhou
    • JetCorr - Tong Liu,  Isaac Mooney
    • Spin/ColdQCD : Yike Xu
  • PWGC - Rongrong Ma


Meeting Schedule
  • Weekly on Fridays at noon EST/EDT
  • Zoom information:
    Topic: STAR QA Board
    Time: This is a recurring meeting Meet anytime
    
    Join Zoom Meeting
    https://riceuniversity.zoom.us/j/95314804042?pwd=ZUtBMzNZM3kwcEU3VDlyRURkN3JxUT09
    
    Meeting ID: 953 1480 4042
    Passcode: 2021
    One tap mobile
    +13462487799,,95314804042# US (Houston)
    +12532158782,,95314804042# US (Tacoma)
    
    Dial by your location
            +1 346 248 7799 US (Houston)
            +1 253 215 8782 US (Tacoma)
            +1 669 900 6833 US (San Jose)
            +1 646 876 9923 US (New York)
            +1 301 715 8592 US (Washington D.C)
            +1 312 626 6799 US (Chicago)
    Meeting ID: 953 1480 4042
    Find your local number: https://riceuniversity.zoom.us/u/amvmEfhce
    
    Join by SIP
    95314804042@zoomcrc.com
    
    Join by H.323
    162.255.37.11 (US West)
    162.255.36.11 (US East)
    115.114.131.7 (India Mumbai)
    115.114.115.7 (India Hyderabad)
    213.19.144.110 (Amsterdam Netherlands)
    213.244.140.110 (Germany)
    103.122.166.55 (Australia)
    149.137.40.110 (Singapore)
    64.211.144.160 (Brazil)
    69.174.57.160 (Canada)
    207.226.132.110 (Japan)
    Meeting ID: 953 1480 4042
    Passcode: 2021
    

AuAu 7.7GeV (2021)

Run-21 (BES-2) RunQA :: Au+Au at 7.7GeV

Fixed Target Au+Au (2021)

Run-21 (BES-2) RunQA :: Fixed Target Au+Au

  • (((PLACEHOLDERS)))

Relevant Weekly Meetings
  • ...

Run 22 QA

Weekly on Fridays at noon EST/EDT

Zoom information:

=========================
Topic: STAR QA board meeting

Join ZoomGov Meeting
 
Meeting ID: 161 843 5669
Passcode: 194299
=========================

Mailing List:
=========================
https://lists.bnl.gov/mailman/listinfo/STAR-QAboard-l
=========================

BES-II Data QA:

Summary Page by Rongrong:
https://drupal.star.bnl.gov/STAR/pwg/common/bes-ii-run-qa

==================

Run QA: Ashik Ikbal, Li-Ke Liu (Prithwish Tribedy, Yu Hu as code developers)
Centrality: Zach Sweger, Shuai Zhou, Zuowen Liu (pileup rejection), Xin Zhang

  • Friday, Oct 22, 2021. 12pm BNL Time
    • List of variables from different groups
      • Daniel: centrality
      • C​​h​​enliang: LF
      • A​​s​​h​​ish: bulk cor 
      • A​​s​​h​​ik: FC​V​
      • Kai​feng: HF
      • T​​o​​n​​g: jet cor












 

Run23 QA Volunteers

General TPC QA: Lanny Ray (Texas)

PWG      Volunteers
  CF        Yevheniia Khyzhniak (Ohio)
              Muhammad Ibrahim Abdulhamid Elsayed (Egypt)

  FCV     Han-Sheng Li (Purdue)
             Yicheng Feng (Purdue)
             
Niseem Magdy (SBU)
              

 LFSUPC   Hongcan Li (CCNU)

    HP      Andrew Tamis (Yale)
              
Ayanabha Das (CTU)
 

Run23 QA helpers

 

Grid and Cloud

These pages are dedicated to the GRID effort in STAR as part of our participation in the Open Science Grid.

Our previous pages are being migrated tot his area. Please find the previous content here.

Data Management

The data management section will have information on data transfer and development/consolidation of tools used in STAR for Grid data transfer.

 

SRM/DRM Testing June/July 2007

SRM/DRM Testing June/July 2007

Charge

From email:

We had a discussion with Arie Shoshani and group pertaining
to the use of SRM (client and site caching) in our analysis
scenario. We agreed we would proceed with the following plan,
giving ourselves the best shot at achieving the milestone we
have with the OSG.
- first of all, we will try to restore the SRM service both at
LBNL and BNL . This will require
* Disk space for the SRM cache at LBNL - 500 GB is plenty
* Disk space for the SRM cache at BNL - same size is fine

- we hope for a test of transfer to be passed to the OSG troubleshooting
team who will stress test the data transfer as we have defined i.e.
* size test and long term stability - we would like to define a test
where each job would transfer 500 MB of data from LBNL to BNL
We would like 100 jobs submitted at a time
For the test to be run for at least a few days
* we would like to be sure the test includes burst of
100 requests transfer /mn to SRM
+ the success matrix
. how many time the service had to be restarted
. % success on data transfer
+ we need to document the setup i.e.number of streams
(MUST be greater than 1)

- whenever this test is declared successful, we would use
the deployment in our simulation production in real
production mode - the milestone would then behalf
achieved

- To make our milestone fully completed, we would reach
+1 site. The question was which one?
* Our plan is to move to SRM v2.2 for this test - this
is the path which is more economical in terms of manpower,
OSG deliverables and allow for minimal reshuffling of
manpower and current assignment hence increasing our
chances for success.
* FermiGrid would not have SRM 2.2 however
=> We would then UIC for this, possibly leveraging OSG
manpower to help with setting up a fully working
environment.

Our contact people would be

- Doug Olson for LBNL working with Alex Sim, Andrew Rose,
Eric Hjort (whenever necessary) and Alex Sim
* The work with the OSG troubleshooting team will be
coordinated from LBNL side
* We hope Andrew/Eric will work along with Alex to
set the test described above

- Wayne Betts for access to the infrastructure at BNL
(assistance from everyone to clean the space if needed)

- Olga Barannikova will be our contact for UIC - we will
come back to this later according to the strawman plan
above

As a reminder, I have discussed with Ruth that at
this stage, and after many years of work which are bringing
exciting and encouraging sign of success (the recent production
stability being one) I have however no intent to move, re-scope
or re-schedule our milestone. Success of this milestone is path
forward to make Grid computing part of our plan for the future.
As our visit was understood and help is mobilize, we clearly
see that success is reachable.

I count on all of you for full assistance with
this process.

Thank you,

--
,,,,,
( o o )
--m---U---m--
Jerome

Test Plan (Alex S., 14 June)

 

Hi all,

The following plan will be performed for STAR SRM test by SDM group with
BeStMan SRM v2.2.
Andrew Rose will duplicate, in the mean time, the successful analysis case
that Eric Hjort had previously.

1. small local setup
1.1. small number of analysis jobs will be submitted directly to PDSF job
queue.
1.2. A job will transfer files from datagrid.lbl.gov via gsiftp into the
PDSF project working cache.
1.3. a fake analysis will be performed to produce a result file.
1.4 the job will issue srm-client to call BeStman to transfer the result
file out to datagrid.lbl.gov via gsiftp.

2. small remote setup
2.1. small number of analysis jobs will be submitted directly to PDSF job
queue.
2.2. A job will transfer files from stargrid?.rcf.bnl.gov via gsiftp into
the PDSF project working cache.
2.3. a fake analysis will be performed to produce a result file.
2.4 the job will issue srm-client to call BeStman to transfer the result
file out to stargrid?.rcf.bnl.gov via gsiftp.

3. large local setup
3.1. about 100-200 analysis jobs will be submitted directly to PDSF job
queue.
3.2. A job will transfer files from datagrid.lbl.gov via gsiftp into the
PDSF project working cache.
3.3. a fake analysis will be performed to produce a result file.
3.4 the job will issue srm-client to call BeStman to transfer the result
file out to datagrid.lbl.gov via gsiftp.

4. large remote setup
4.1. about 100-200 analysis jobs will be submitted directly to PDSF job
queue.
4.2. A job will transfer files from stargrid?.rcf.bnl.gov via gsiftp into
the PDSF project working cache.
4.3. a fake analysis will be performed to produce a result file.
4.4 the job will issue srm-client to call BeStman to transfer the result
file out to stargrid?.rcf.bnl.gov via gsiftp.

5. small remote sums setup
5.1. small number of analysis jobs will be submitted to SUMS.
5.2. A job will transfer files from stargrid?.rcf.bnl.gov via gsiftp into
the PDSF project working cache.
5.3. a fake analysis will be performed to produce a result file.
5.4 the job will issue srm-client to call BeStman to transfer the result
file out to stargrid?.rcf.bnl.gov via gsiftp.

6. large remote setup
6.1. about 100-200 analysis jobs will be submitted to SUMS.
6.2. A job will transfer files from stargrid?.rcf.bnl.gov via gsiftp into
the PDSF project working cache.
6.3. a fake analysis will be performed to produce a result file.
6.4 the job will issue srm-client to call BeStman to transfer the result
file out to stargrid?.rcf.bnl.gov via gsiftp.

7. have Andrew and Lidia use the setup #6 to test with real analysis jobs
8. have a setup #5 on UIC and test
9. have a setup #6 on UIC and test
10. have Andrew and Lidia use the setup #9 to test with real analysis jobs

Any questions? I'll let you know when things are in progress.

-- Alex
   asim at lbl dot gov

Site Bandwidth Testing

This page is for achieving site bandwidth measurement tests.

The above is a bandwidth test done using the tool iperf (version iperf_2.0.2-4_i386) between the site KISTI (ui03.sdfarm.kr) and BNL (stargrid03) around the beginning of the year 2014. The connection was noted to collapse (drop to zero) a few times during testing before a full plot could be prepared.

The above histogram shows the number of simultaneous copies in one minute bins, extracted from a few week segment of the actual production at KISTI. Solitary copies are suppressed because they overwhelm the plot. Copies represent less than 1% of the jobs total run time.


The above is a bandwidth test done using the tool iperf (version iperf_2.0.2-4_i386) between the site Dubna (lxpub01.jinr.ru) and BNL (stargrid01) on 8/14/2015. After exactly 97 parallel connections the connection was noted to collapse with many parallel processes timing out, this behavior was consistent across three attempts but was not present at any lower number of parallel connections. It is suspected that a soft limit is placed on the number of parallel processes somewhere.The raw data is attached at the bottom.

The 2006 STAR analysis scenario

This page will describe in detail the STAR analysis scenario as it was in ~2006.  This scenario involves SUMS grid job submission at RCF through condor-g to PDSF using SRM's at both ends to transfer input and output files in a managed fashion.

Transfer BNL/PDSF, summer 2009

This page will document the data transfers from/to PDSF to/from BNL in the summer/autumn of 2009.

October 17, 2009

I repeated earlier tests I had run with Dan Gunter (see below "Previous results"). It takes onlt 3 streams to saturate the 1GigE network interface of stargrid04.

[stargrid04] ~/> globus-url-copy -vb file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null
   2389704704 bytes        23.59 MB/sec avg        37.00 MB/sec inst

[stargrid04] ~/> globus-url-copy -vb -tcp-bs 8388608 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null
   1569718272 bytes        35.39 MB/sec avg        39.00 MB/sec inst

[stargrid04] ~/> globus-url-copy -vb -tcp-bs 4388608 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null
   1607467008 bytes        35.44 MB/sec avg        38.00 MB/sec inst

[stargrid04] ~/> globus-url-copy -p 2 -vb -tcp-bs 4388608 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null
   3414425600 bytes        72.36 MB/sec avg        63.95 MB/sec inst

[stargrid04] ~/> globus-url-copy -p 4 -vb -tcp-bs 4388608 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null
   8569487360 bytes       108.97 MB/sec avg       111.80 MB/sec inst

[stargrid04] ~/> globus-url-copy -p 3 -vb -tcp-bs 4388608 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null
   5576065024 bytes       106.36 MB/sec avg       109.70 MB/sec inst



[stargrid04] ~/> globus-url-copy -vb gsiftp://pdsfgrid2.nersc.gov/dev/zero file:///dev/null
    625999872 bytes         9.95 MB/sec avg        19.01 MB/sec inst

[stargrid04] ~/> globus-url-copy -vb -tcp-bs 4388608 gsiftp://pdsfgrid2.nersc.gov/dev/zero file:///dev/null
   1523580928 bytes        30.27 MB/sec avg        38.00 MB/sec inst

[stargrid04] ~/> globus-url-copy -vb -p 2 -tcp-bs 4388608 gsiftp://pdsfgrid2.nersc.gov/dev/zero file:///dev/null
   8712617984 bytes        71.63 MB/sec avg        75.87 MB/sec inst

[stargrid04] ~/> globus-url-copy -vb -p 3 -tcp-bs 4388608 gsiftp://pdsfgrid2.nersc.gov/dev/zero file:///dev/null
   7064518656 bytes       102.08 MB/sec avg       111.88 MB/sec inst

October 15, 2009 - evening

After replacing network card to 10GigE so that we could plug directly into the core switch quicktest gives:

 

[stargrid04] ~/> iperf -c pdsfsrm.nersc.gov -m -w 8388608 -t 120 -p 60005
------------------------------------------------------------
Client connecting to pdsfsrm.nersc.gov, TCP port 60005
TCP window size: 8.00 MByte
------------------------------------------------------------
[  3] local 130.199.6.109 port 50291 connected with 128.55.36.74 port 60005
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  4.39 GBytes    314 Mbits/sec
[  3] MSS size 1368 bytes (MTU 1408 bytes, unknown interface)

More work tomorrow.

October 15, 2009

Comparison between the signal from an optical tap at the NERSC  border with the tcpdump on the node showed most of the loss happening between the border and pdsfsrm.nersc.gov.

More work was done to optimize single-stream throughput.

  • pdsfsrm was moved from a switch that serves the rack where it resides to a switch that is one level up and closer to the border
  • a configuration of the forcedeth driver was changed (options forcedeth optimization_mode=1 poll_interval=100 set in /etc/modprobe.conf).

Changes resulted in an improved throughput but it is stillfar from what should be (see details below). We are going to insert a 10 GigE card into the node and move it even closer to the border.

Here are the results with those buffer memory settings as of the morning 10/15/2009. There is a header from the first
measurement and then results from a few tests run minutes apart.

-------------------------------------------------------------------------
[stargrid04] ~/> iperf -c pdsfsrm.nersc.gov -m -w 8388608 -t 120 -p 60005
-------------------------------------------------------------------------
Client connecting to pdsfsrm.nersc.gov, TCP port 60005 TCP window size: 8.00 MByte
-------------------------------------------------------------------------
[ 3] local 130.199.6.109 port 44070 connected with 128.55.36.74 port 60005
[ ID] Interval Transfer Bandwidth [ 3] 0.0-120.0 sec 1.81 GBytes 129 Mbits/sec
[ 3] 0.0-120.0 sec 3.30 GBytes 236 Mbits/sec
[ 3] 0.0-120.0 sec 1.86 GBytes 133 Mbits/sec
[ 3] 0.0-120.0 sec 2.04 GBytes 146 Mbits/sec
[ 3] 0.0-120.0 sec 3.61 GBytes 258 Mbits/sec
[ 3] 0.0-120.0 sec 1.88 GBytes 135 Mbits/sec
[ 3] 0.0-120.0 sec 3.35 GBytes 240 Mbits/sec


Then I restored the "dtn" buffer memory settings - again morning 10/15/2009 and I got similar if not worse results:


-------------------------------------------------------------------------
[stargrid04] ~/> iperf -c pdsfsrm.nersc.gov -m -w 8388608 -t 120 -p 60005
-------------------------------------------------------------------------
Client connecting to pdsfsrm.nersc.gov, TCP port 60005 TCP window size: 8.00 MByte
-------------------------------------------------------------------------
[ 3] local 130.199.6.109 port 44361 connected with 128.55.36.74 port 60005
[ ID] Interval Transfer Bandwidth [ 3] 0.0-120.0 sec 2.34 GBytes 168 Mbits/sec
[ 3] 0.0-120.0 sec 1.42 GBytes 101 Mbits/sec
[ 3] 0.0-120.0 sec 2.08 GBytes 149 Mbits/sec
[ 3] 0.0-120.0 sec 2.13 GBytes 152 Mbits/sec
[ 3] 0.0-120.0 sec 1.76 GBytes 126 Mbits/sec
[ 3] 0.0-120.0 sec 1.42 GBytes 102 Mbits/sec
[ 3] 0.0-120.0 sec 2.07 GBytes 148 Mbits/sec
[ 3] 0.0-120.0 sec 2.07 GBytes 148 Mbits/sec


And here if for comparison and to show how things vary with more or less same load on pdsfgrid2 results for the "dtn" settings
just like above from 10/14/2009 afternoon.


--------------------------------------------------------------------------------------
[stargrid04] ~/> iperf -c pdsfsrm.nersc.gov -m -w 8388608 -t 120 -p 60005
--------------------------------------------------------------------------------------
Client connecting to pdsfsrm.nersc.gov, TCP port 60005 TCP window size: 8.00 MByte
--------------------------------------------------------------------------------------
[ 3] local 130.199.6.109 port 34366 connected with 128.55.36.74 port 60005
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-120.0 sec 1.31 GBytes 93.5 Mbits/sec
[ 3] 0.0-120.0 sec 1.58 GBytes 113 Mbits/sec
[ 3] 0.0-120.0 sec 1.75 GBytes 126 Mbits/sec
[ 3] 0.0-120.0 sec 1.88 GBytes 134 Mbits/sec
[ 3] 0.0-120.0 sec 2.56 GBytes 183 Mbits/sec
[ 3] 0.0-120.0 sec 2.53 GBytes 181 Mbits/sec
[ 3] 0.0-120.0 sec 3.25 GBytes 232 Mbits/sec

Since the "80Mb/s or worse" persisted for a long time and was measured on various occasions the new numbers are due to the forceth param or the switch change. Most probably it was the switch. It is also true that the "dtn" settings were able to cope slightly better with the location on the Dell switch but seem to be not doing much when pdsfgrid2 is plugged directly into the "old pdsfcore" switch.

 

October 2, 2009

Notes on third party srm-copy to PDSF:

1) on PDSF interactive node, you need to set up your environment:

source /usr/local/pkg/OSG-1.2/setup.csh

2) srm-copy (recursive) has the following form:

srm-copy gsiftp://stargrid04.rcf.bnl.gov//star/institutions/lbl_prod/andrewar/transfer/reco/production_dAu/ReversedFullField/P08ie/2008/023b/  srm://pdsfsrm.nersc.gov:62443/srm/v2/server\?SFN=/eliza9/starprod/reco/production_dAu/ReversedFullField/P08ie/2008/023/  -recursive -td /eliza9/starprod/reco/production_dAu/ReversedFullField/P08ie/2008/023/

October 1, 2009

We conducted srm-copy tests between RCF and PDSF this week. Initially, the rates we saw for a third party srm-copy between RCF (stargrid04) and PDSF (pdsfsrm) are detailed in plots from Dan:

Per stream GridFTP ThroughputMbits vs. Time

 

 

September 24, 2009 

We updated the transfer proceedure to make use of the OSG automated monitoring tools. Perviously, the transfers ran between stargrid04 and one of the NERSC data transfer nodes. To take advantage of Dan's automated log harvesting, we're switiching the target to pdsfsrm.nersc.gov.

Transfers between stargrid04 and pdsfsrm are fairly stable at ~20MBytes/sec (as reported by the "-vb" option in the globus-url-copy). The command used is of the form:

globus-url-copy -r -p 15 gsiftp://stargrid04.rcf.bnl.gov/[dir]/ gisftp://pdsfsrm.nersc.gov/[target dir]/

Plots from the first set using the pdsfsrm node:

Data transfer rates vs. File Size.

The most recent rates seen are given in Dan's plots from Sept. 23rd:

Total data transferred

 

 

So, the data transfer is progressing at ~100-200 Mb/s. We will next compare to rates using the new BeStMan installation at PDSF.

 

Previous results

Tests have been repeated as a new node (stargrid10) became available. We ran from the SRM end host at PDSF pdsfgrid2.nersc.gov to the new stargrid10.rhic.bnl.gov endpoint at BNL . Because of firewalls we could only run from PDSF to BNL, not the other way. A 60-second test got about 75Mb/s. This number is consistent with earlier iperf tests between stargrid02 and pdsfgrid2.

globus-url-copy with 8 streams would go up 400Mb/s and 16 streams 550MB/s. Also with stargrid10, the transfer rates would be the same to and from BNL.

Details below.

pdsfgrid2 59% iperf -s -f m -m -p 60005 -w 8388608 -t 60 -i 2
------------------------------------------------------------
Server listening on TCP port 60005
TCP window size: 16.0 MByte (WARNING: requested 8.00 MByte)
------------------------------------------------------------
[ 4] local 128.55.36.74 port 60005 connected with 130.199.6.208 port 36698
[ 4] 0.0- 2.0 sec 13.8 MBytes 57.9 Mbits/sec
[ 4] 2.0- 4.0 sec 19.1 MBytes 80.2 Mbits/sec
[ 4] 4.0- 6.0 sec 4.22 MBytes 17.7 Mbits/sec
[ 4] 6.0- 8.0 sec 0.17 MBytes 0.71 Mbits/sec
[ 4] 8.0-10.0 sec 2.52 MBytes 10.6 Mbits/sec
[ 4] 10.0-12.0 sec 16.7 MBytes 70.1 Mbits/sec
[ 4] 12.0-14.0 sec 17.4 MBytes 73.1 Mbits/sec
[ 4] 14.0-16.0 sec 16.1 MBytes 67.7 Mbits/sec
[ 4] 16.0-18.0 sec 15.8 MBytes 66.4 Mbits/sec
[ 4] 18.0-20.0 sec 17.5 MBytes 73.6 Mbits/sec
[ 4] 20.0-22.0 sec 17.6 MBytes 73.7 Mbits/sec
[ 4] 22.0-24.0 sec 18.1 MBytes 75.8 Mbits/sec
[ 4] 24.0-26.0 sec 19.5 MBytes 81.7 Mbits/sec
[ 4] 26.0-28.0 sec 19.3 MBytes 80.9 Mbits/sec
[ 4] 28.0-30.0 sec 13.8 MBytes 58.1 Mbits/sec
[ 4] 30.0-32.0 sec 14.5 MBytes 60.7 Mbits/sec
[ 4] 32.0-34.0 sec 14.7 MBytes 61.8 Mbits/sec
[ 4] 34.0-36.0 sec 14.6 MBytes 61.2 Mbits/sec
[ 4] 36.0-38.0 sec 17.2 MBytes 72.2 Mbits/sec
[ 4] 38.0-40.0 sec 19.5 MBytes 81.6 Mbits/sec
[ 4] 40.0-42.0 sec 19.5 MBytes 81.6 Mbits/sec
[ 4] 42.0-44.0 sec 19.5 MBytes 81.6 Mbits/sec
[ 4] 44.0-46.0 sec 19.5 MBytes 81.7 Mbits/sec
[ 4] 46.0-48.0 sec 19.5 MBytes 81.6 Mbits/sec
[ 4] 48.0-50.0 sec 19.1 MBytes 79.9 Mbits/sec
[ 4] 50.0-52.0 sec 19.3 MBytes 80.9 Mbits/sec
[ 4] 52.0-54.0 sec 19.4 MBytes 81.3 Mbits/sec
[ 4] 54.0-56.0 sec 19.4 MBytes 81.5 Mbits/sec
[ 4] 56.0-58.0 sec 19.5 MBytes 81.6 Mbits/sec
[ 4] 58.0-60.0 sec 19.5 MBytes 81.7 Mbits/sec
[ 4] 0.0-60.4 sec 489 MBytes 68.0 Mbits/sec
[ 4] MSS size 1368 bytes (MTU 1408 bytes, unknown interface)

The client was on stargrid10.

 

on stargrid10

from stargrid10 to pdsfgrid2:


[stargrid10] ~/> globus-url-copy -vb file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null Source: file:///dev/ Dest: gsiftp://pdsfgrid2.nersc.gov/dev/

zero  ->  null
513802240 bytes 7.57 MB/sec avg 9.09 MB/sec inst

Cancelling copy...

[stargrid10] ~/> globus-url-copy -vb -p 4 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null Source: file:///dev/ Dest: gsiftp://pdsfgrid2.nersc.gov/dev/

zero  ->  null
1863843840 bytes 25.39 MB/sec avg 36.25 MB/sec inst

Cancelling copy...

[stargrid10] ~/> globus-url-copy -vb -p 6 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null Source: file:///dev/ Dest: gsiftp://pdsfgrid2.nersc.gov/dev/

zero  ->  null
3354394624 bytes 37.64 MB/sec avg 44.90 MB/sec inst

Cancelling copy...

[stargrid10] ~/> globus-url-copy -vb -p 8 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null Source: file:///dev/ Dest: gsiftp://pdsfgrid2.nersc.gov/dev/

zero  ->  null
5016649728 bytes 47.84 MB/sec avg 57.35 MB/sec inst

Cancelling copy...

[stargrid10] ~/> globus-url-copy -vb -p 12 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null Source: file:///dev/ Dest: gsiftp://pdsfgrid2.nersc.gov/dev/

zero  ->  null
5588647936 bytes 62.70 MB/sec avg 57.95 MB/sec inst

Cancelling copy...

[stargrid10] ~/> globus-url-copy -vb -p 16 file:///dev/zero gsiftp://pdsfgrid2.nersc.gov/dev/null Source: file:///dev/ Dest: gsiftp://pdsfgrid2.nersc.gov/dev/

zero  ->  null
15292432384 bytes 74.79 MB/sec avg 65.65 MB/sec inst

Cancelling copy...

 

 

and on stargrid10 the other way, from pdsfgrid2 to stargrid10 (similar although slightly better)

[stargrid10] ~/> globus-url-copy -vb gsiftp://pdsfgrid2.nersc.gov/dev/zero file:///dev/null Source: gsiftp://pdsfgrid2.nersc.gov/dev/ Dest: file:///dev/

zero  ->  null
1693450240 bytes 11.54 MB/sec avg 18.99 MB/sec inst

Cancelling copy...

[stargrid10] ~/> globus-url-copy -vb -p 4 gsiftp://pdsfgrid2.nersc.gov/dev/zero file:///dev/null Source: gsiftp://pdsfgrid2.nersc.gov/dev/ Dest: file:///dev/

zero  ->  null
12835618816 bytes 45.00 MB/sec avg 73.50 MB/sec inst

Cancelling copy...

[stargrid10] ~/> globus-url-copy -vb -p 8 gsiftp://pdsfgrid2.nersc.gov/dev/zero file:///dev/null Source: gsiftp://pdsfgrid2.nersc.gov/dev/ Dest: file:///dev/

zero  ->  null
14368112640 bytes 69.20 MB/sec avg 100.50 MB/sec inst

 

And now on pdsfgrid2 from pfsfgrid2 to stargrid10 (similar to the result for 4 stream in same direction above)

pdsfgrid2 70% globus-url-copy -vb -p 4 file:///dev/zero gsiftp://stargrid10.rcf.bnl.gov/dev/null Source: file:///dev/ Dest: gsiftp://stargrid10.rcf.bnl.gov/dev/

zero  ->  null
20869021696 bytes 50.39 MB/sec avg 73.05 MB/sec inst

Cancelling copy...

and to stargrid02, really, really bad. but since the node is going away we won't be investigating the mistery.

pdsfgrid2 71% globus-url-copy -vb -p 4 file:///dev/zero gsiftp://stargrid02.rcf.bnl.gov/dev/null Source: file:///dev/ Dest: gsiftp://stargrid02.rcf.bnl.gov/dev/

zero  ->  null
275513344 bytes 2.39 MB/sec avg 2.40 MB/sec inst

Cancelling copy...

 

12 Mar 2009

Baseline from bwctl from SRM end host at PDSF -- pdsfgrid2.nersc.gov -- to a perfsonar endpoint at BNL -- lhcmon.bnl.gov. Because of firewalls, could only run from PDSF to BNL, not the other way around. Last I checked, this direction was getting about 5Mb/s from SRM. A 60-second test to the perfsonar host got about 275Mb/s.

Summary: Current baseline from perfSONAR is more than 50X what we're seeing.

RECEIVER START
bwctl: exec_line: /usr/local/bin/iperf -B 192.12.15.23 -s -f m -m -p 5008 -w 8388608 -t 60 -i 2
bwctl: start_tool: 3445880257.865809
------------------------------------------------------------
Server listening on TCP port 5008
Binding to local address 192.12.15.23
TCP window size: 16.0 MByte (WARNING: requested 8.00 MByte)
------------------------------------------------------------
[ 14] local 192.12.15.23 port 5008 connected with 128.55.36.74 port 5008
[ 14] 0.0- 2.0 sec 7.84 MBytes 32.9 Mbits/sec
[ 14] 2.0- 4.0 sec 38.2 MBytes 160 Mbits/sec
[ 14] 4.0- 6.0 sec 110 MBytes 461 Mbits/sec
[ 14] 6.0- 8.0 sec 18.3 MBytes 76.9 Mbits/sec
[ 14] 8.0-10.0 sec 59.1 MBytes 248 Mbits/sec
[ 14] 10.0-12.0 sec 102 MBytes 428 Mbits/sec
[ 14] 12.0-14.0 sec 139 MBytes 582 Mbits/sec
[ 14] 14.0-16.0 sec 142 MBytes 597 Mbits/sec
[ 14] 16.0-18.0 sec 49.7 MBytes 208 Mbits/sec
[ 14] 18.0-20.0 sec 117 MBytes 490 Mbits/sec
[ 14] 20.0-22.0 sec 46.7 MBytes 196 Mbits/sec
[ 14] 22.0-24.0 sec 47.0 MBytes 197 Mbits/sec
[ 14] 24.0-26.0 sec 81.5 MBytes 342 Mbits/sec
[ 14] 26.0-28.0 sec 75.9 MBytes 318 Mbits/sec
[ 14] 28.0-30.0 sec 45.5 MBytes 191 Mbits/sec
[ 14] 30.0-32.0 sec 56.2 MBytes 236 Mbits/sec
[ 14] 32.0-34.0 sec 55.5 MBytes 233 Mbits/sec
[ 14] 34.0-36.0 sec 58.0 MBytes 243 Mbits/sec
[ 14] 36.0-38.0 sec 61.0 MBytes 256 Mbits/sec
[ 14] 38.0-40.0 sec 61.6 MBytes 258 Mbits/sec
[ 14] 40.0-42.0 sec 72.0 MBytes 302 Mbits/sec
[ 14] 42.0-44.0 sec 62.6 MBytes 262 Mbits/sec
[ 14] 44.0-46.0 sec 64.3 MBytes 270 Mbits/sec
[ 14] 46.0-48.0 sec 66.1 MBytes 277 Mbits/sec
[ 14] 48.0-50.0 sec 33.6 MBytes 141 Mbits/sec
[ 14] 50.0-52.0 sec 63.0 MBytes 264 Mbits/sec
[ 14] 52.0-54.0 sec 55.7 MBytes 234 Mbits/sec
[ 14] 54.0-56.0 sec 56.9 MBytes 239 Mbits/sec
[ 14] 56.0-58.0 sec 59.5 MBytes 250 Mbits/sec
[ 14] 58.0-60.0 sec 50.7 MBytes 213 Mbits/sec
[ 14] 0.0-60.3 sec 1965 MBytes 273 Mbits/sec
[ 14] MSS size 1448 bytes (MTU 1500 bytes, ethernet)
bwctl: stop_exec: 3445880322.405938

RECEIVER END

 

11 Feb 2009

By: Dan Gunter and Iwona Sakrejda

Measured between the STAR SRM hosts at NERSC/PDSF and Brookhaven:

  • pdsfgrid2.nersc.gov (henceforth, "PDSF")
  • stargrid02.rcf.bnl.gov (henceforth, "BNL")

Current data flow is from PDSF to BNL, but plans are to have data flow both ways.

All numbers are in megabits per second (Mb/s). Layer 4 (transport) protocol was TCP. Tests were at least 60 sec. long, 120 sec. for the higher numbers (to give it time to ramp up). All numbers are approximate, of course.

Both sides had recent Linux kernels with auto-tuning. The max buffer sizes were at Brian Tierney's recommended sizes.

From BNL to PDSF

Tool: iperf

  • 1 stream: 50-60 Mb/s (but some dips around 5Mb/s)
  • 8 or 16 streams: 250-300Mb/s aggregate

Tool: globus-url-copy (see PDSF to BNL for details). This was to confirm that globus-url-copy and iperf were roughly equivalent.

  • 1 stream: ~70 Mb/s
  • 8 streams: 250-300 Mb/s aggregate. Note: got same number with PDSF iptables turned off.

From PDSF to BNL

Tool: globus-url-copy (gridftp) -- iperf could not connect, which we proved was due to BNL restrictions by temporarily disabling IPtables at PDSF. To avoid any possible I/O effects, ran globus-url-copy from /dev/zero to /dev/null.

  • 1 stream: 5 Mb/s
  • 8 streams: 40 Mb/s
  • 64 streams: 250-300 Mb/s aggregate. Note: got same number with PDSF iptables turned off.

18 Aug 2008 - BNL (stargrid02) - LBLnet (dlolson)

Below are results from iperf tests bnl to lbl.
650 Mbps with very little loss is quite good.
For the uninformed (like me), we ran iperf server on dlolson.lbl.gov
listening on port 40050, then ran client on stargrid02.rcf.bnl.gov
sending udp packets with max rate of 1000 Mbps

[olson@dlolson star]$ iperf -s -p 40050 -t 60 -i 1 -u
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 40.0-41.0 sec  78.3 MBytes    657 Mbits/sec  0.012 ms    0/55826 (0%)
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 41.0-42.0 sec  78.4 MBytes    658 Mbits/sec  0.020 ms    0/55946 (0%)
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 42.0-43.0 sec  78.4 MBytes    658 Mbits/sec  0.020 ms    0/55911 (0%)
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 43.0-44.0 sec  76.8 MBytes    644 Mbits/sec  0.023 ms    0/54779 (0%)
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 44.0-45.0 sec  78.4 MBytes    657 Mbits/sec  0.016 ms    7/55912 (0.013%)
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 45.0-46.0 sec  78.4 MBytes    658 Mbits/sec  0.016 ms    0/55924 (0%)
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 46.0-47.0 sec  78.3 MBytes    656 Mbits/sec  0.024 ms    0/55820 (0%)
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 47.0-48.0 sec  78.3 MBytes    657 Mbits/sec  0.016 ms    0/55870 (0%)



[stargrid02] ~/> iperf -c dlolson.lbl.gov -t 60 -i 1 -p 40050 -u -b 1000M
[ ID] Interval       Transfer     Bandwidth
[  3] 40.0-41.0 sec  78.3 MBytes    657 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 41.0-42.0 sec  78.4 MBytes    658 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 42.0-43.0 sec  78.4 MBytes    657 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 43.0-44.0 sec  76.8 MBytes    644 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 44.0-45.0 sec  78.4 MBytes    657 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 45.0-46.0 sec  78.4 MBytes    658 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 46.0-47.0 sec  78.2 MBytes    656 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 47.0-48.0 sec  78.3 MBytes    657 Mbits/sec

Additional notes:
iperf server at bnl would not answer tho we used port 29000 with
GLOBUS_TCP_PORT_RANGE=20000,30000

iperf server at pdsf (pc2608) would not answer either.

 

25 August 2008 BNL - PDSF iperf results, after moving pdsf grid nodes to 1 GigE net

(pdsfgrid5) iperf % build/bin/iperf -s -p 40050 -t 20 -i 1 -u
------------------------------------------------------------
Server listening on UDP port 40050
Receiving 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 128.55.36.73 port 40050 connected with 130.199.6.168 port 56027
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  0.0- 1.0 sec  78.5 MBytes    659 Mbits/sec  0.017 ms   14/56030 (0.025%)
[  3]  0.0- 1.0 sec  44 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  1.0- 2.0 sec  74.1 MBytes    621 Mbits/sec  0.024 ms    8/52834 (0.015%)
[  3]  1.0- 2.0 sec  8 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  2.0- 3.0 sec  40.4 MBytes    339 Mbits/sec  0.023 ms   63/28800 (0.22%)
[  3]  2.0- 3.0 sec  63 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  3.0- 4.0 sec  73.0 MBytes    613 Mbits/sec  0.016 ms  121/52095 (0.23%)
[  3]  3.0- 4.0 sec  121 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  4.0- 5.0 sec  76.6 MBytes    643 Mbits/sec  0.020 ms   18/54661 (0.033%)
[  3]  4.0- 5.0 sec  18 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  5.0- 6.0 sec  76.8 MBytes    644 Mbits/sec  0.015 ms   51/54757 (0.093%)
[  3]  5.0- 6.0 sec  51 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  6.0- 7.0 sec  77.1 MBytes    647 Mbits/sec  0.016 ms   40/55012 (0.073%)
[  3]  6.0- 7.0 sec  40 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  7.0- 8.0 sec  74.9 MBytes    628 Mbits/sec  0.040 ms   64/53414 (0.12%)
[  3]  7.0- 8.0 sec  64 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  8.0- 9.0 sec  76.0 MBytes    637 Mbits/sec  0.021 ms   36/54189 (0.066%)
[  3]  8.0- 9.0 sec  36 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  9.0-10.0 sec  75.6 MBytes    634 Mbits/sec  0.018 ms   21/53931 (0.039%)
[  3]  9.0-10.0 sec  21 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 10.0-11.0 sec  54.7 MBytes    459 Mbits/sec  0.038 ms   20/38994 (0.051%)
[  3] 10.0-11.0 sec  20 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 11.0-12.0 sec  75.6 MBytes    634 Mbits/sec  0.019 ms   37/53939 (0.069%)
[  3] 11.0-12.0 sec  37 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 12.0-13.0 sec  74.1 MBytes    622 Mbits/sec  0.056 ms    4/52888 (0.0076%)
[  3] 12.0-13.0 sec  24 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 13.0-14.0 sec  75.4 MBytes    633 Mbits/sec  0.026 ms  115/53803 (0.21%)
[  3] 13.0-14.0 sec  115 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 14.0-15.0 sec  77.1 MBytes    647 Mbits/sec  0.038 ms   50/54997 (0.091%)
[  3] 14.0-15.0 sec  50 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 15.0-16.0 sec  75.2 MBytes    631 Mbits/sec  0.016 ms   26/53654 (0.048%)
[  3] 15.0-16.0 sec  26 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 16.0-17.0 sec  78.2 MBytes    656 Mbits/sec  0.039 ms   39/55793 (0.07%)
[  3] 16.0-17.0 sec  39 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 17.0-18.0 sec  76.6 MBytes    643 Mbits/sec  0.017 ms   35/54635 (0.064%)
[  3] 17.0-18.0 sec  35 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 18.0-19.0 sec  76.5 MBytes    641 Mbits/sec  0.039 ms   23/54544 (0.042%)
[  3] 18.0-19.0 sec  23 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3] 19.0-20.0 sec  78.0 MBytes    654 Mbits/sec  0.017 ms    1/55624 (0.0018%)
[  3] 19.0-20.0 sec  29 datagrams received out-of-order
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  0.0-20.0 sec  1.43 GBytes    614 Mbits/sec  0.018 ms   19/1044598 (0.0018%)
[  3]  0.0-20.0 sec  864 datagrams received out-of-order


[stargrid02] ~/> iperf -c pdsfgrid5.nersc.gov -t 20 -i 1 -p 40050 -u -b 1000M
------------------------------------------------------------
Client connecting to pdsfgrid5.nersc.gov, UDP port 40050
Sending 1470 byte datagrams
UDP buffer size:   128 KByte (default)
------------------------------------------------------------
[  3] local 130.199.6.168 port 56027 connected with 128.55.36.73 port 40050
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  78.5 MBytes    659 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  1.0- 2.0 sec  74.1 MBytes    621 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  2.0- 3.0 sec  40.4 MBytes    339 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  3.0- 4.0 sec  73.0 MBytes    613 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  4.0- 5.0 sec  76.6 MBytes    643 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  5.0- 6.0 sec  76.8 MBytes    644 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  6.0- 7.0 sec  77.1 MBytes    647 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  7.0- 8.0 sec  74.8 MBytes    628 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  8.0- 9.0 sec  76.0 MBytes    637 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  9.0-10.0 sec  75.6 MBytes    634 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 10.0-11.0 sec  54.6 MBytes    458 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 11.0-12.0 sec  75.7 MBytes    635 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 12.0-13.0 sec  74.1 MBytes    622 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 13.0-14.0 sec  75.4 MBytes    633 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 14.0-15.0 sec  77.1 MBytes    647 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 15.0-16.0 sec  75.2 MBytes    631 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 16.0-17.0 sec  78.2 MBytes    656 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 17.0-18.0 sec  76.6 MBytes    643 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3] 18.0-19.0 sec  76.4 MBytes    641 Mbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  1.43 GBytes    614 Mbits/sec
[  3] Sent 1044598 datagrams
[  3] Server Report:
[ ID] Interval       Transfer     Bandwidth       Jitter   Lost/Total Datagrams
[  3]  0.0-20.0 sec  1.43 GBytes    614 Mbits/sec  0.017 ms   19/1044598 (0.0018%)
[  3]  0.0-20.0 sec  864 datagrams received out-of-order

Transfers to/from Birmingham

Introduction
Transfers where either the source or target are on the Birmingham cluster. I am keeping a log of these as they come up. I don't do them too often so it will take a while to accumulate enough data points to discern any patterns…

 DateType
Size
Command
Duration

rate agg. rate/p
Source
Destination
2006.9.5 DAQ
40Gb
g-u-c
up to 12 hr  3-5 1 MB/s
~0.2 MB/s  pdsfgrid1,2,4rhilxs 
2006.10.6 MuDst
50 Gb
g-u-c3-5 hr 15~3.5 MB/s 0.25 MB/s
 rhilxs pdsfgrid2,4,5
2006.10.20  event.root geant.root500 Gbg-u-c -nodcau
 38 hr
93.7 MB/s 0.41 MB/s  rhilxs garchive

Notes
g-u-c is just shorthand for globus-url-copy
'p' is the total number of simultaneous connections and is the sum of the parameter for g-u-c -p option for all the commands running together
e.g. 4 g-u-c commands with no -p option gives total p=4 but 3 g-u-c commands with -p 5 gives total p=15

Links
May be useful for the beginner?
PDSF Grid info


Documentation

This page will add documents / documentation links and help for Grid beginners or experts. Those documents are either created by us or gathered from the internet.

Getting site information from VORS

VORS (Virtual Organization Resource Selector) provides information about grid sites similar to GridCat. You can find VORS information here.
As per information received at a GOC meeting on 8/14/06, VORS information is the to be be the preferred OSG information service. VORS provides information through the HTTP protocol. This can be in plane text format or HTML, both are viewable from a web browser. For the html version use the link:

Virtual Organization Selection
The plain text version may be more important because it can be easily parsed by other programs. This allows for the writing of information service modules for SUMS in a simple way.

Step 1:

Go to the link below in a web browser:

VORS text interface

Note that to get the text version index.cgi is replaced with tindex.cgi. This will bring up a page that looks like this:

238,Purdue-Physics,grid.physics.purdue.edu:2119,compute,OSG,PASS,2006-08-21 19:16:25
237,Rice,osg-gate.rice.edu:2119,compute,OSG,FAIL,2006-08-21 19:17:07
13,SDSS_TAM,tam01.fnal.gov:2119,compute,OSG,PASS,2006-08-21 19:17:10
38,SPRACE,spgrid.if.usp.br:2119,compute,OSG,PASS,2006-08-21 19:17:51
262,STAR-Bham,rhilxs.ph.bham.ac.uk:2119,compute,OSG,PASS,2006-08-21 19:23:12
217,STAR-BNL,stargrid02.rcf.bnl.gov:2119,compute,OSG,PASS,2006-08-21 19:24:11
16,STAR-SAO_PAULO,stars.if.usp.br:2119,compute,OSG,PASS,2006-08-21 19:26:55
44,STAR-WSU,rhic23.physics.wayne.edu:2119,compute,OSG,PASS,2006-08-21 19:29:10
34,TACC,osg-login.lonestar.tacc.utexas.edu:2119,compute,OSG,FAIL,2006-08-21 19:30:23
19,TTU-ANTAEUS,antaeus.hpcc.ttu.edu:2119,compute,OSG,PASS,2006-08-21 19:30:54


This page holds little information about the site its self however it links the site with the resource number of the site. The resource number is the first number that starts each line. It this example site STAR-BNL is resource 217.

Step 2:

To find out more useful information about the site this has to be applied to the link below (note I have already filled in 217 for STAR-BNL):

STAR-BNL VORS Information

The plane text information that comes back will look like this:

#VORS text interface (grid = All, VO = all, res = 217)
shortname=STAR-BNL
gatekeeper=stargrid02.rcf.bnl.gov
gk_port=2119
globus_loc=/opt/OSG-0.4.0/globus
host_cert_exp=Feb 24 17:32:06 2007 GMT
gk_config_loc=/opt/OSG-0.4.0/globus/etc/globus-gatekeeper.conf
gsiftp_port=2811
grid_services=
schedulers=jobmanager is of type fork
jobmanager-condor is of type condor
jobmanager-fork is of type fork
jobmanager-mis is of type mis
condor_bin_loc=/home/condor/bin
mis_bin_loc=/opt/OSG-0.4.0/MIS-CI/bin
mds_port=2135
vdt_version=1.3.9c
vdt_loc=/opt/OSG-0.4.0
app_loc=/star/data08/OSG/APP
data_loc=/star/data08/OSG/DATA
tmp_loc=/star/data08/OSG/DATA
wntmp_loc=: /tmp
app_space=6098.816 GB
data_space=6098.816 GB
tmp_space=6098.816 GB
extra_variables=MountPoints
SAMPLE_LOCATION default /SAMPLE-path
SAMPLE_SCRATCH devel /SAMPLE-path
exec_jm=stargrid02.rcf.bnl.gov/jobmanager-condor
util_jm=stargrid02.rcf.bnl.gov/jobmanager
sponsor_vo=star
policy=http://www.star.bnl.gov/STAR/comp/Grid


From the unix command line the command wget can be used to collect this information. From inside a java application the Socket class can be used to pull this information back as a String, and then parse as needed.

Globus 1.1.x

QuickStart.pdf is for Globus version 1.1.3 / 1.1.4 .

Globus Toolkit Error FAQ

Globus Toolkit Error FAQ

For GRAM error codes, follow this link.

The purpose of this document is to outline common errors encountered after the installation and setup of the Globus Toolkit.

  1. GRAM Job Submission failed because the connection to the server failed (check host and port) (error code 12)
  2. error in loading shared libraries
  3. ERROR: no valid proxy, or lifetime to small (one hour)
  4. GRAM Job submission failed because authentication with the remote server failed (error code 7)
  5. GRAM Job submission failed bacause authentication failed: remote certificate not yet valid (error code 7)
  6. GRAM Job submission failed bacause authentication failed: remote certificate has expired (error code 7)
  7. GRAM Job submission failed because data transfer to the server failed (error code 10)
  8. GRAM Job submission failed because authentication failed: Expected target subject name="/CN=host/hostname"
    Target returned subject name="/O=Grid/O=Globus/CN=hostname.domain.edu" (error code 7)
  9. Problem with local credentials no proxy credentials: run grid-proxy-init or wgpi first
  10. GRAM Job submission failed because authentication failed: remote side did not like my creds for unknown reason
  11. GRAM Job submission failed because the job manager failed to open stdout (error code 73)
    or
    GRAM Job submission failed because the job manager failed to open stderr (error code 74)
  12. GRAM Job submission failed because the provided RSL string includes variables that could not be identified (error code 39)
  13. 530 Login incorrect / FTP LOGIN REFUSED (shell not in /etc/shells)
  14. globus_i_gsi_gss_utils.c:866: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials: Couldn't verify the remote certificate
    OpenSSL Error: s3_pkt.c:1031: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate (error code 7)
  15. globus_gsi_callback.c:438: globus_i_gsi_callback_cred_verify: Could not verify credential: self signed certificate in certificate chain (error code 7)
    or
    globus_gsi_callback.c:424: globus_i_gsi_callback_cred_verify: Can't get the local trusted CA certificate: Cannot find issuer certificate for local credential (error code 7)
  16. SSL3_GET_CLIENT_CERTIFICATE: no certificate returned
  17. undefined symbol: lutil_sasl_interact
    followed by a failure to load a module
    /usr/local/globus-2.4.2/etc/grid-info-slapd.conf: line 23: failed to load or initialize module libback_giis.la

  1. GRAM Job Submission failed because the connection to the server failed (check host and port) (error code 12)

    Diagnosis

    Your client is unable to contact the gatekeeper specified. Possible causes include:
    • The gatekeeper is not running
    • The host is not reachable.
    • The gatekeeper is on a non-standard port

    Solution

    Make sure the gatekeeper is being launched by inetd or xinetd. Review the Install Guide if you do not know how to do this. Check to make sure that ordinary TCP/IP connections are possible; can you ssh to the host, or ping it? If you cannot, then you probably can't submit jobs either. Check for typos in the hostname.

    Try telnetting to port 2119. If you see a "Unable to load shared library", the gatekeeper was not built statically, and does not have an appropriate LD_LIBRARY_PATH set. If that is the case, either rebuild it statically, or set the environment variable for the gatekeeper. In inetd, use /usr/bin/env to wrap the launch of the gatekeeper, or in xinetd, use the "env=" option.

    Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log if it exists. It may tell you that the private key is insecure, so it refuses to start. In that case, fix the permissions of the key to be read only by the owner.

    If the gatekeeper is on a non-standard port, be sure to use a contact string of host:port.
    Back to top


  2. error in loading shared libraries

    Diagnosis

    LD_LIBRARY_PATH is not set.

    Solution

    If you receive this as a client, make sure to read in either $GLOBUS_LOCATION/etc/globus-user-env.sh (if you are using a Bourne-like shell) or $GLOBUS_LOCATION/etc/globus-user-env.csh (if you are using a C-like shell)
    Back to top


  3. ERROR: no valid proxy, or lifetime to small (one hour)

    Diagnosis

    You are running globus-personal-gatekeeper as root, or did not run grid-proxy-init.

    Solution

    Don't run globus-personal-gatekeeper as root. globus-personal-gatekeeper is designed to allow an ordinary user to establish a gatekeeper using a proxy from their personal certificate. If you are root, you should setup a gatekeeper using inetd or xinetd, and using your host certificates. If you are not root, make sure to run grid-proxy-init before starting the personal gatekeeper.
    Back to top


  4. GRAM Job submission failed because authentication with the remote server failed (error code 7)

    Diagnosis

    Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote server. You will probably see something like:

    Authenticated globus user: /O=Grid/O=Globus/OU=your.domain/OU=Your Name
    Failure: globus_gss_assist_gridmap() failed authorization. rc =1

    Solution

    This indicates that your account is not in the grid-mapfile. Create the grid-mapfile in /etc/grid-security (or wherever the -gridmap flag in $GLOBUS_LOCATION/etc/globus-gatekeeper.conf points to) with an entry pairing your subject name to your user name. Review the Install Guide if you do not know how to do this.  If you see "rc = 7", you may have bad permissions on the /etc/grid-security/.  It needs to be readable so that users can see the certificates/ subdirectory.
    Back to top


  5. GRAM Job submission failed bacause authentication failed: remote certificate not yet valid (error code 7)

    Diagnosis

    This indicates that the remote host has a date set greater than five minutes in the future relative to the remote host.

    Try typing "date -u" on both systems at the same time to verify this. (The "-u" specifies that the time should be displayed in universal time, also known as UTC or GMT.)

    Solution

    Ultimately, synchronize the hosts using NTP. Otherwise, unless you are willing to set the client host date back, you will have to wait until your system believes that the remote certificate is valid. Also, be sure to check your shell environment to see if you have any time zone variables set.
    Back to top


  6. GRAM Job submission failed because authentication failed: remote certificate has expired (error code 7)

    Diagnosis

    This indicates that the remote host has an expired certificate.

    To double-check, you can use grid-cert-info or grid-proxy-info. Use grid-cert-info on /etc/grid-security/hostcert.pem if you are dealing with a system level gatekeeper. Use grid-proxy-info if you are dealing with a personal gatekeeper.

    Solution

    If the host certificate has expired, use grid-cert-renew to get a renewal. If your proxy has expired, create a new one with grid-proxy-init.
    Back to top


  7. GRAM Job submission failed because data transfer to the server failed (error code 10)

    Diagnosis

    Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote server. You will probably see something like:

    Authenticated globus user: /O=Grid/O=Globus/OU=your.domain/OU=Your Name
    Failure: globus_gss_assist_gridmap() failed authorization. rc =1

    Solution

    This indicates that your account is not in the grid-mapfile. Create the grid-mapfile in /etc/grid-security (or wherever the -gridmap flag in $GLOBUS_LOCATION/etc/globus-gatekeeper.conf points to) with an entry pairing your subject name to your user name. Review the Install Guide if you do not know how to do this.
    Back to top


  8. GRAM Job submission failed because authentication failed: Expected target subject name="/CN=host/hostname"
    Target returned subject name="/O=Grid/O=Globus/CN=hostname.domain.edu" (error code 7)

    Diagnosis

    New installations will often see errors like the above where the expected target subject name has just the unqualified hostname but the target returned subject name has the fully qualified domain name (e.g. expected is "hostname" but returned is "hostname.domain.edu").

    This is usually becuase the client looks up the target host's IP address in /etc/hosts and only gets the simple hostname back.

    Solution

    The solution is to edit the /etc/hosts file so that it returns the fully qualified domain name. To do this find the line in /etc/hosts that has the target host listed and make sure it looks like:

    xx.xx.xx.xx hostname.domain.edu hostname

    Where "xx.xx.xx.xx" should be the numeric IP address of the host and hostname.domain.edu should be replaced with the actual hostname in question. The trick is to make sure the full name (hostname.domain.edu) is listed before the nickname (hostname).

    If this only happens with your own host, see the explanation of the failed to open stdout error, specifically about how to set the GLOBUS_HOSTNAME for your host.
    Back to top


  9. Problem with local credentials no proxy credentials: run grid-proxy-init or wgpi first

    Diagnosis

    You do not have a valid proxy.

    Solution

    Run grid-proxy-init
    Back to top


  10. GRAM Job submission failed because authentication failed: remote side did not like my creds for unknown reason

    Diagnosis

    Check the $GLOBUS_LOCATION/var/globus-gatekeeper.log on the remote host. It probably says "remote certificate not yet valid". This indicates that the client host has a date set greater than five minutes in the future relative to the remote host.

    Try typing "date -u" on both systems at the same time to verify this. (The "-u" specifies that the time should be displayed in universal time, also known as UTC or GMT.)

    Solution

    Ultimately, synchronize the hosts using NTP. Otherwise, unless you are willing to set the client host date back, you will have to wait until the remote server believes that your proxy is valid. Also, be sure to check your shell environment to see if you have any time zone variables set.
    Back to top


  11. GRAM Job submission failed because the job manager failed to open stdout (error code 73)

    Or GRAM Job submission failed because the job manager failed to open stderr (error code 74)

    Diagnosis

    The remote job manager is unable to open a connection back to your client host. Possible causes include:
    • Bad results from globus-hostname. Try running globus-hostname on your client. It should output the fully qualified domain name of your host. This is the information which the GRAM client tools use to let the jobmanager on the remote server know who to open a connection to. If it does not give a fully qualified domain name, the remote host may be unable to open a connection back to your host.
    • A firewall. If a firewall blocks the jobmanager's attempted connection back to your host, this error will result.
    • Troubles in the ~/.globus/.gass_cache on the remote host. This is the least frequent cause of this error. It could relate to NFS or AFS issues on the remote host.
    • It is also possible that the CA that issued your Globus certificate is not trusted by your local host. Running 'grid-proxy-init -verify' should detect this situation.

    Solution

    Depending on the cause from above, try the following solutions:
    • Fix the result of 'hostname' itself. You can accomplish this by editing /etc/hosts and adding the fully qualified domain name of your host to this file. See how to do this in the explanation of the expected target subject error. If you cannot do this, or do not want to do this, you can set the GLOBUS_HOSTNAME environment variable to override the result of globus-hostname. Set GLOBUS_HOSTNAME to the fully qualified domain name of your host.
    • To cope with a firewall, use the GLOBUS_TCP_PORT_RANGE environment variable. If your host is behind a firewall, set GLOBUS_TCP_PORT_RANGE to the allowable incoming connections on your firewall. If the firewall is in front of the remote server, you will need the remote site to set GLOBUS_TCP_PORT_RANGE in the gatekeeper's environment to the allowable incoming range of the firewall in front of the remote server. If there are firewalls on both sides, perform both of the above steps. Note that the allowable ranges do not need to coincide on the two firewalls; it is, however, necessary that the GLOBUS_TCP_PORT_RANGE be valid for both incoming and outgoing connections of the firewall it is set for.
    • If you are working with AFS, you will want the .gass_cache directory to be a link to a local filesystem. If you are having NFS trouble, you will need to fix it, which is beyond the scope of this document.
    • Install the trusted CA for your certificate on the local system.


    Back to top
  12. GRAM Job submission failed because the provided RSL string includes variables that could not be identified (error code 39)

    Diagnosis

    You submitted a job which specifies an RSL substitution which the remote jobmanager does not recognize. The most common case is using a 2.0 version of globus-job-get-output with a 1.1.x gatekeeper/jobmanager.

    Solution

    Currently, globus-job-get-output will not work between a 2.0 client and a 1.1.x gatekeeper. Work is in progress to ensure interoperability by the final release. In the meantime, you should be able to modify the globus-job-get-output script to use $(GLOBUS_INSTALL_PATH) instead of $(GLOBUS_LOCATION).
    Back to top


  13. 530 Login incorrect / FTP LOGIN REFUSED (shell not in /etc/shells)

    Diagnosis

    The 530 Login incorrect usually indicates that your account is not in the grid-mapfile, or that your shell is not in /etc/shells.

    Solution

    If your account is not in the grid-mapfile, make sure to get it added. If it is in the grid-mapfile, check the syslog on the machine, and you may see the /etc/shells message. If that is the case, make sure that your shell (as listed in finger or chsh) is in the list of approved shells in /etc/shells.
    Back to top


  14. globus_i_gsi_gss_utils.c:866: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials: Couldn't verify the remote certificate
    OpenSSL Error: s3_pkt.c:1031: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate (error code 7)

    Diagnosis

    This error message usually indicates that the server you are connecting to doesn't trust the Certificate Authority (CA) that issued your Globus certificate.

    Solution

    Either use a certificate from a different CA or contact the administer of the resource you are connecting to and request that they install the CA certificate in their trusted certificates directory.
    Back to top
  15. globus_gsi_callback.c:438: globus_i_gsi_callback_cred_verify: Could not verify credential: self signed certificate in certificate chain (error code 7)

    Or globus_gsi_callback.c:424: globus_i_gsi_callback_cred_verify: Can't get the local trusted CA certificate: Cannot find issuer certificate for local credential (error code 7)

    Diagnosis

    This error message indicates that your local system doesn't trust the certificate authority (CA) that issued the certificate on the resource you are connecting to.

    Solution

    You need to ask the resource administrator which CA issued their certificate and install the CA certificate in the local trusted certificates directory.
    Back to top  


  16. SSL3_GET_CLIENT_CERTIFICATE: no certificate returned

    Diagnosis

    This error message indicates that the name in the certificate for the remote party is not legal according local signing_policy file for that CA.

    Solution

    You need to verify you have the correct signing policy file installed for the CA by comparing it with the one distributed by the CA.
    Back to top
  17. undefined symbol: lutil_sasl_interact

    Diagnosis

    Globus replica catalog was installed along with MDS/Information Services.

    Solution

    Do not install the replica bundle into a GLOBUS_LOCATION containing other Information Services. The Replica Catalog is also deprecated - use RLS instead.
    Back to top

 

Intro to FermiGrid site for STAR users

The FNAL_FERMIGRID site policy and some documentation can be found here:

http://fermigrid.fnal.gov/policy.html

You must use VOMS proxies (rather than grid certificate proxies) to run at this site.  A brief intro to voms proxies is here:  Introduction to voms proxies for grid cert users

All users with STAR VOMS proxies are mapped to a single user account ("star").

Technical note: (Quoting from an email that Steve Timm sent to Levente) "Fermigrid1.fnal.gov is not a simple jobmanager-condor. It is emulating the jobmanager-condor protocol and then forwarding the jobs on to whichever clusters have got free slots, 4 condor clusters and actually one pbs cluster behind it too." For instance, I noticed jobs submitted to this gatekeeper winding up at the USCMS-FNAL-WC1-CE site in MonAlisa. (What are the other sites?)

You can use SUMS to submit jobs to this site (though this feature is still in beta testing) following this example:
star-submit-beta -p dynopol/FNAL_FERMIGRID jobDesription.xml

where jobDescription.xml is the filename of your job's xml file.

Site gatekeeper info:

Hostname:  fermigrid1.fnal.gov

condor queue is available (fermigrid1.fnal.gov/jobmanager-condor)

If no jobmanager is specified, the job runs on the gatekeeper itself (jobmanager-fork, I’d assume)

[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov /bin/cat /etc/redhat-release

Scientific Linux Fermi LTS release 4.2 (Wilson)

Fermi worker node info:

[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov/jobmanager-condor /bin/cat /etc/redhat-release

Scientific Linux SL release 4.2 (Beryllium)

 

[stargrid02] ~/> globus-job-run fermigrid1.fnal.gov/jobmanager-condor /usr/bin/gcc -v

Using built-in specs.

Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-java-awt=gtk --host=i386-redhat-linux

Thread model: posix

gcc version 3.4.4 20050721 (Red Hat 3.4.4-2)

 

There doesn't seem to be a GNU fortran compiler such as g77 on the worker nodes.

Open question:  What is the preferred file transfer mechanism?

In GridCat they list an SRM server at srm://fndca1.fnal.gov:8443/ but I have not made any attempt to use it.

Introduction to voms proxies for grid cert users

The information in a voms proxy is a superset of the information in a grid certificate proxy. This additional information includes details about the VO of the user. For users, the potential benefit is the possibility to work as a member of multiple VOs with a single DN and have your jobs accounted accordingly. Obtaining a voms-proxy (if all is well configured) is as simple as “voms-proxy-init -voms star” (This is of course for a member of the STAR VO).

Here is an example to illustrate the difference between grid proxies and voms proxies (note that the WARNING and Error lines at the top don’t seem to preclude the use of the voms proxy – the fact is that I don’t know why those appear or what practical implications there are from the underlying cause – I hope to update this info as I learn more):

[stargrid02] ~/> voms-proxy-info -all
WARNING: Unable to verify signature!
Error: Cannot find certificate of AC issuer for vo star
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856/CN=proxy
issuer : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
identity : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
type : proxy
strength : 512 bits
path : /tmp/x509up_u2302
timeleft : 4:10:20
=== VO star extension information ===
VO : star
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
issuer : /DC=org/DC=doegrids/OU=Services/CN=vo.racf.bnl.gov
attribute : /star/Role=NULL/Capability=NULL
timeleft : 4:10:19

 

[stargrid02] ~/> grid-proxy-info -all
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856/CN=proxy
issuer : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
identity : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
type : full legacy globus proxy
strength : 512 bits
path : /tmp/x509up_u2302
timeleft : 4:10:14

 


In order to obtain the proxy, the VOMS server for the requested VO must be contacted (with the potential drawback that it introduces a dependency on a working VOMS server that doesn’t exist with a simple grid cert. It is worth further noting that either a VOMS or GUMS server (I should investigate this) will also be contacted by VOMS-aware gatekeepers to authenticate the users at job submission time, behind the scenes. One goal (or consequence at least) of this sort of usage is to eliminate static grid-map-files.)

Something else to note (and investigate): the voms-proxy doesn’t necessarily last as long as the basic grid cert proxy – the voms part can apparently expire independent of the grid-proxy. Consider this example, in which the two expiration times are different:

[stargrid02] ~/> voms-proxy-info -all
WARNING: Unable to verify signature!
Error: Cannot find certificate of AC issuer for vo star
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856/CN=proxy
issuer : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
identity : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
type : proxy
strength : 512 bits
path : /tmp/x509up_u2302
timeleft : 35:59:58
=== VO star extension information ===
VO : star
subject : /DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856
issuer : /DC=org/DC=doegrids/OU=Services/CN=vo.racf.bnl.gov
attribute : /star/Role=NULL/Capability=NULL
timeleft : 23:59:58

 

(Question: What determines the duration of the voms-proxy extension - the VOMS server or the user/client?)

Technical note 1: on stargrid02, the “vomses” file, which lists the URL for VOMS servers, was not in a default location used by voms-proxy-init, and thus it was not actually working (basically, it worked just like grid-proxy-init). I have put an existing vomses file in /opt/OSG-0.4.1/voms/etc and it seems content to use it.

Technical note 2: neither stargrid03’s VDT installation nor the WNC stack on the rcas nodes has VOMS tools. I’m guessing that the VDT stack is too old on stargrid03 and that voms-proxy tools are missing on the worker nodes because that functionality isn't really needed on a worker node.

Job Managers

Several job managers are available as part of any OSG/VDT/Globus deploymenets. They may restrict access to keywords fundamental to job control and efficiency or may not even work.
The pages here will documents the needed changes or features.

Condor Job Manager

Condor job manager code is provided as-is for quick code inspection. The version below is from the OSG 0.4.1 software stack.

LSF job manager

LSF job manager code below is from globus 2.4.3.

SGE Job Manager

SGE job manager code was developed by the UK Grid eScience. It is provided as-is for quick code inspection. The version below is as integrated in VDT 1.5.2 (post OSG 0.4.1). Please, note that the version below includes patches provided by the RHIC/STAR VO. Consult SGE Job Manager patch for more information.

Modifying Virtual Machine Images and Deploying Them

Modifying Virtual Machine Images and Deploying Them

The steps:

  1. login to stargrid01

     

  2. Check that your ssh public key is in $home/.ssh/id_rsa.pub, if not put it there.

     

  3. Select the base image you wish to modify. You will find the name of the image you are currently using for your cluster by looking inside:

    /star/u/lbhajdu/ec2/workspace-cloud-client-010/samples/[cluster discretions].xml
    

    Open up this file you will find a structure that looks something like the one below. There are two <workspace> blocks one for the gatekeeper and one for the worker nodes. The name of the image for the worker node is in the second block in-between the <image> tags. So for the example below the name would be osgworker-012.

    <workspace>
    <name>head-node</name>
    <image>osgheadnode-012</image>
    <quantity>1</quantity>
    .
    .
    .
    </workspace>
    <workspace>
    <name>compute-nodes</name>
    <image>osgworker-012</image>
    <quantity>3</quantity>
    <nic interface=”eth1”>private</nic>
    .
    .
    .
    </workspace>

  1. To make a modification to the image we have to mount/deploy that image. Once we know the name, simply type:

    ./bin/cloud-client.sh --run --name [image name] --hours 50
    

    Where [image name] is the name we found in step 3. This image will be up for 50 hours. You will have to save the image before you run out of time, else all of your changes will be lost.

    The output of this command will look something like:

    [stargrid01] ~/ec2/workspace-cloud-client-010/> ./bin/cloud-client.sh --run --name osgworker-012 --hours 50
    (Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
    (New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
    SSH public keyfile contained tilde:
    - '~/.ssh/id_rsa.pub' --> '/star/u/lbhajdu/.ssh/id_rsa.pub'
    Launching workspace.
    Workspace Factory Service:
    https://tp-vm1.ci.uchicago.edu:8445/wsrf/services/WorkspaceFactoryService
    Creating workspace "vm-003"... done.
    IP address: 128.135.125.29
    Hostname: tp-x009.ci.uchicago.edu
    Start time: Tue Jan 13 13:59:04 EST 2009
    Shutdown time: Thu Jan 15 15:59:04 EST 2009
    Termination time: Thu Jan 15 16:09:04 EST 2009
    Waiting for updates.
    "vm-003" reached target state: Running
    Running: 'vm-003'

    It will take some time for the command to finish, usually a few minutes. Make sure you do not loose the output for this command. Inside the output there are two pieces of information you must note. They are the hostname and the handle. In this example the hostname is tp-x009.ci.uchicago.edu and the handle is vm-003.

     

  2. Next log on to the host using the host name from step 4. Note that your ssh public key will be copied to the /root/.ssh/id_rsa.pub. To log on type:

    ssh root@[hostname]

    Example:

    ssh root@tp-x009.ci.uchicago.edu
    
  3. Next make the change(s) to the image, you wish to make (this step is up to you).

     

  4. To save the changes you will need the handle from step 2. And you will need to pick a name for the new image. Run this command:

    ./bin/cloud-client.sh --save 	--handle [handle name] --newname [new image name]

    Where [handle name] is replaced with the name of the handle and [new image name] is replaced with the new image’s name. If you do not use the name option you will overwrite your image. Here is an example with the values from above.

    ./bin/cloud-client.sh --save --handle vm-003 --newname starworker-sl08f
    The output will look something like this:
    [stargrid01] ~/ec2/workspace-cloud-client-010/> ./bin/cloud-client.sh --save --handle vm-004 --newname starworker-sl08e
    (Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
    (New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
    Saving workspace.
    - Workspace handle (EPR): '/star/u/lbhajdu/ec2/workspace-cloud-client-010/history/vm-004/vw-epr.xml'
    - New name: 'starworker-sl08e'
    Waiting for updates.
    "Workspace #919": TransportReady, calling destroy for you.
    "Workspace #919" was terminated.
  5. This is an optional step, because the images can be several GB big you may want to delete the old image with this command:

    ./bin/cloud-client.sh --delete --name [old image name] 
    

    This is what it would look like:

    (Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
    (New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-010/lib/globus')
    Deleting: gsiftp://tp-vm1.ci.uchicago.edu:2811//cloud/56441986/starworker-sl08f
    Deleted.

     

  6. To start up a cluster with the new image you will need to modify one of the:

    /star/u/lbhajdu/ec2/workspace-cloud-client-010/samples/[cluster discretion].xml

    file inside the <workspace> block of the worker node replace <image> with the name of your own image from step 7. You can also set the number of worker node images you wish to bring up by setting the number in the <quantity> tag.

     

    Note: Be careful remember there are usually at least two <workspace> blocks in each xml fie.

     

  7. Next just bring up the cluster like any other VM cluster. (See my Drupal documentation)

 

Rudiments of grid map files on gatekeepers

This is intended as a practical introduction to mapfiles for admins of new sites to help get the *basics* working and avoid some common problems with grid user management and accounting.

It should be stressed that manually maintaining mapfiles is the most primitive user management technique.  It is not scalable and it has been nearly obsoleted by two factors:

1.  There are automated tools for maintaining mapfiles (GUMS with VOMS in the background, for instance, but that's not covered here).

2.  Furthermore, VOMS proxies are replacing grid certificate proxies, and the authentication mechanism no longer relies on static grid mapfiles, but instead can dynamically authenticate against GUMS or VOMS servers directly for each submitted job.

But let's ignore all that and proceed with good old-fashioned hand edits of two critical files on your gatekeeper:

/etc/grid-security/grid-mapfile
and
$VDT_LOCATION/monitoring/grid3-user-vo-map.txt

(the location of the grid-mapfile in /etc/grid-security is not universal, but that's the default location)

In the grid-mapfile, you'll want entries like the following, in which user DNs are mapped to specific user accounts.  You can see from this example that multiple DNs can map to one user account (rgrid000 in this case):

#---- members of vo: star ----#
"/DC=org/DC=doegrids/OU=People/CN=Valeri Fine 224072" fine
"/DC=org/DC=doegrids/OU=People/CN=Wayne Betts 602856" wbetts
#---- members of vo: mis ----#
"/DC=org/DC=doegrids/OU=People/CN=John Rosheck (GridScan) 474533" rgrid000
"/DC=org/DC=doegrids/OU=People/CN=John Rosheck (GridCat) 776427" rgrid000

(The lines starting with '#' are comments and are ignored.)

You see that if you want to support the STAR VO, then you will need to include the DN for every STAR user with a grid cert (though as of this writing, it is only a few dozen, and only a few of them are actively submitting any jobs.  Those two above are just a sampling.)  You can support multiple VOs if you wish, as we see with the MIS VO.  But MIS is a special VO -- it is a core grid infrustructure VO, and the DNs shown here are special testing accounts that you'll probably want to include so that you appear healthy in various monitoring tools.

In the grid3-user-vo-map.txt file, things are only slightly more complicated, and could look like this:

#User-VO map
# #comment line, format of each regular line line: account VO
# Next 2 lines with VO names, same order, all lowercase, with case (lines starting with #voi, #VOc)
#voi mis star
#VOc MIS STAR
#---- accounts for vo: star ----#
fine star
wbetts star
#---- accounts for vo: mis ----#
rgrid000 mis

(Here one must be careful -- the '#' symbol denotes comments, but the two lines starting with #voi and #VOc are actually read by VORS (this needs to be fleshed out), so keep them updated with your site's actual supported VOs.)

In this example, we see that users 'fine' and 'wbetts' are mapped to the star VO, while 'rgrid000' is mapped to the mis VO.

Maintaining this user-to-VO map is not critical to running jobs at your site, but it does have important functions:

1. MonAlisa uses this file in its accounting and monitoring (such as VO jobs per site)

2. VORS advertises the supported VOs at each site based on this file, and users use VORS to locate sites that claim to support their VO...  thus if you claim to support a VO that you don't actually support, then sooner or later someone from that VO will submit jobs to your site, which will fail and then THEY WILL REPORT YOU TO THE GOC!  

(Don't worry, there's no great penalty, just the shame of having to respond to the GOC ticket.  Note that updates to this file can take several hours to be noticed by VORS.)

If you aren't familiar with VORS or MonAlisa, then hop to it.  You can find links to both of them here:

http://www.opensciencegrid.org/?pid=1000098


SRM instructions for bulk file transfer to PDSF

These links describe how to do bulk file transfers from RCF to PDSF.

How to run the transfers

The first step is to figure out what files you want to transfer and make some file lists for SRM transfers:

At PDSF make subdirectories ~/xfer ~/hrm_g1 ~/hrm_g1/lists

Copy from ~hjort/xfer the files diskOrHpss.pl, ConfigModule.pm and Catalog.xml into your xfer directory.
You will need to contact ELHjort@lbl.gov to get Catalog.xml because it has administrative privileges in it.

Substitute your username for each "hjort" in ConfigModule.pm.

Then in your xfer directory run the script (in redhat8):

pdsfgrid1 88% diskOrHpss.pl
Usage: diskOrHpss.pl [production] [trgsetupname] [magscale]
e.g., diskOrHpss.pl P04ij ppMinBias FullField
pdsfgrid1 89%

Note that trgsetupname and magscale are optional. This script may take a while depending on what you specify. If all goes well you'll get some files created in your hrm_g1/lists directory. A brief description of the files the script created:

*.cmd: Commands to transfer files from RCF disks
*.srmcmd: Commands to transfer files from RCF HPSS

in lists:

*.txt: File list for transfers from RCF disks
*.rndm: Same as *.txt but randomized in order
*.srm: File list for transfer from RCF HPSS

Next you need to get your cert installed in the grid-mapfiles at PDSF and at RCF. At PDSF you do it in NIM. Pull up your personal info and find the "Grid Certificates" tab. Look at mine to see the form of what you need to enter there. For RCF go here:

http://orion.star.bnl.gov/STAR/comp/Grid/Infrastructure/certs-vomrs/

Also, you'll need to copy a file of mine into your directory:

cp ~hjort/hrm_g1/pdsfgrid1.rc ~/hrm_g1/pdsfgrid1.rc

That's the configuration file for the HRM running on pdsfgrid1. When you've got that done you can try to move some files by executing one of the srm-copy.linux commands found in the .cmd or .srmcmd file.

Monitoring transfers

You can tell if transfers are working from the messages in your terminal window.

You can monitor the transfer rate on the pdsfgrid1 ganglia page on the “bytes_in” plot. However, it’s also good to verify that rrs is entering the files into the file catalog as they are sunk into HPSS. This can be done with get_file_list.pl:

pdsfgrid1 172% get_file_list.pl -as Admin -keys 'filename' -limit 0 –cond 'production=P06ic' | wc -l
11611
pdsfgrid1 173%

A more specific set of conditions will of course result in a faster query. Note that the “-as Admin” part is required if you run this in the hrm_g1 subdirectory due to the Catalog.xml file. If you don't use it you will query the PDSF mirror of the BNL file catalog instead of the PDSF file catalog.

Running the HRM servers at PDSF

I suggest creating your own subdirectory ~/hrm_g1 similar to ~hjort/hrm_g1. Then copy from my directory to yours the following files:



setup

hrm

pdsfgrid1.rc

hrm_rrs.rc

Catalog.xml (coordinate permissions w/me)



Substitute your username for “hjort” in these files and then start the HRM by doing “source hrm”. Note that you need to run in redhat8 and your .chos file is ignored on grid nodes so you need to chos to redhat8 manually. If successful you should see the following 5 tasks running:



pdsfgrid1 149% ps -u hjort

PID TTY TIME CMD

8395 pts/1 00:00:00 nameserv

8399 pts/1 00:00:00 trm.linux

8411 pts/1 00:00:00 drmServer.linux

8461 pts/1 00:00:00 rrs.linux

8591 pts/1 00:00:00 java

pdsfgrid1 150%



Note that the “hrm” script doesn’t always work depending on the state things are in but it should always work if the 5 tasks shown above are all killed first.

Running the HRM servers at RCF

I suggest creating your own subdirectory ~/hrm_grid similar to ~hjort/hrm_grid. Then copy from my directory to yours the following files:



srm.sh

hrm

bnl.rc

drmServer.linux (create the link)

trm.linux (create the link)



Substitute your username for “hjort” in these files and then start the HRM by doing “source hrm”. If successful you should see the following 3 tasks running:



[stargrid03] ~/hrm_grid/> ps -u hjort

PID TTY TIME CMD

13608 pts/1 00:00:00 nameserv

13611 pts/1 00:00:00 trm.linux

13622 pts/1 00:00:01 drmServer.linux

[stargrid03] ~/hrm_grid/>

Scalability Issue Troubleshooting at EC

Scalability Issue Troubleshooting at EC2

 

Running jobs at EC2 show some scalability issues with grater then 20-50 jobs submitted at once. The pathology can only be seen once the jobs have completed there run cycle, that is to say, after the jobs copy back the files they have produced and the local batch system reports the job as having finished. The symptoms are as follows:

 

  1. No stdout from the job as defined in the .condorg file as “output=” comes back. No stderror from the job as defined in the .condorg file as “error=” comes back.

It should be noted that the std output/error can be recovered from the gate keeper at EC2 by scp'ing it back. The std output/error resides in:

/home/torqueuser/.globus/job/[gk name]/*/stdout

/home/torqueuser/.globus/job/[gk name]/*/stderr

The command would be:

scp -r root@[gk name]:/home/torqueuser/.globus/job /star/data08/users/lbhajdu/vmtest/io/

 

  1. Jobs are still reported as running under condor_q on the submitting end long after they have finished, and the batch system on the other end reports them is finished.

 

Below is a standard sample condor_g file from a job:

 

[stargrid01] /<1>data08/users/lbhajdu/vmtest/> cat globusscheduler= ec2-75-101-199-159.compute-1.amazonaws.com/jobmanager-pbs
output =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.log
error =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.err
log =schedC3A7967022377B3E5F2DCCE2C60CB79D_998.condorg.log
transfer_executable= true
notification =never
universe =globus
stream_output =false
stream_error =false
queue

 

The job parameters:

 

Work flow:

  1. Copy in event generator configuration

  2. Run raw event generator

  3. Copy back raw event file (*.fzd)

  4. Run reconstruction on raw events

  5. Copy back reconstructed files(*.root)

  6. Clean Up

 

Work flow processes : globus-url-copy -> pythia -> globus-url-copy -> root4star -> globus-url-copy

Note: Some low runtime processes not shown

Run time:

23 hours@1000 eventes

1 hour@10-100 events

Output:

15M rcf1504_*_1000evts.fzd

18M rcf1504_*_1000evts.geant.root

400K rcf1504_*_1000evts.hist.root

1.3M rcf1504_*_1000evts.minimc.root

3.7M rcf1504_*_1000evts.MuDst.root

60K rcf1504_*_1000evts.tags.root

14MB stdoutput log, later changed to 5KB by piping output to file and copying back via globus-url-copy.

Paths:

Jobs submitted form:

/star/data08/users/lbhajdu/vmtest/

Output copied back to:

/star/data08/users/lbhajdu/vmtest/data

STD redirect copied back to:

/star/data08/users/starreco/prodlog/P08ie/log

 

The tests:

  1. We first tested 100nodes. Whit 14MB of text going to stdoutput. Failed with symptoms above.

  2. Next test was with 10nodes. With 14MB of text going to stdoutput. This worked without any problems.

  3. Next test was 20 nodes. With 14MB of text going to stdoutput. This worked without any problems.

  4. Next test was 40 nodes. With 14MB of text going to stdoutput. Failed with symptoms above.

  5. Next we redirected “>” the output of the event generator and the reconstruction to a file and copied this file back directly with globus-url-copy after the job was finished. We tested again with 40 nodes. The std out now is only 15K. This time it worked without any problems. (Was this just coincidence?)

     

  6. Next we tried with 75 nodes and the redirected output trick. This failed with symptoms above.

  7. Next we tried with 50 nodes. This failed with symptoms above.

  8. We have consulted Alain Roy who has advised an upgrade of globus and condor-g. He says the upgrade of condor-g is most likely to help. Tim has upgraded the image with the latest version of globus and I will be submitting from stargrid05 which has a newer condor-g version. The software versions are listed here:

    • Stargrid01

      • Condor/Condor-G 6.8.8

      • Globus Toolkit, pre web-services, client 4.0.5

      • Globus Toolkit, web-services, client 4.0.5

       

    • Stargrid05

      • $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846

      • Globus Toolkit, pre web-services, client 4.0.7

      • Globus Toolkit, pre web-services, server 4.0.7

         

  1. We have tested on a five node cluster (1 head node, 4 works) and discovered a problem with stargrid05. Jobs do not get transfered over to the submitting side. The RCF has been contacted we know this is on our side. It was decided we should not submit until we can try from stargrid05.

 

Specification for a Grid efficiency framework

The following is an independently developed grid efficiency framework that will be consolidated with Lidia’s framework.  

The point of this work is to be able to add wrappers around the job that will report back key parameters about the job such as the time it started and the time it stopped the type of node it ran on, if it was successful and so on. These commands execute and return back strings in the jobs output stream. These can be parsed by an executable (I call it the job scanner) that extracts the parameters and writes them into a database. Later other programs use this data to produce web pages, and plots out of any parameter we have recorded.  

The image attached shows the relation between elements in my database and commands in my CSH. The commands in my CSH script will be integrated into SUMS soon. This will make it possible for any framework to parse out these parameters.   

 

Starting up a Globus Virtual Workspace with STAR’s image.

The steps:

1) login to stargrid01

2) Check that your ssh public key is at $home/.ssh/id_rsa.pub. This will be the key the client package copies to the gatekeeper and client nodes under the root account allowing local password free login as root, which you will need to install grid host certs.

a. Note the file name location must be as defined exactly as above or you must modify the path and name in the client configuration at ./workspace-cloud-client-009/conf/cloud.properties (more on this later).

b. If your using a Putty generated ssh public key it will not work directly. You can simply edit it with a text editor to get it in to this format. Below is an example of the right format A and the wrong format B. If it has multiple lines then it is the wrong format.

Right format A:

ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAySIkeTLsijvh1U01ass8XvfkBGocUePTkuG2F8TwRilq1gIcuTP5jBFSCF0eYXOpfNcgkujIsRj/+xS1QqM7c5Fs0hrRyLzyxgZrCKeXojVUFYfg9QuokqoY2ymgjxAdwNABKXI2IKMvM0UGBtmxphCuxUSUpMzNfmWk9H4HIrE=

Wrong format B:

---- BEGIN SSH2 PUBLIC KEY ----
Comment: "imported-openssh-key"
AAAAB3NzaC1yc2EAAAABJQAAAIEAySIkeTLsijvh1U01ass8XvfkBGocUePTkuG2
F8TwRilq1gIcuTP5jBFSCF0eYXOpfNcgkujIsRj/+xS1QqM7c5Fs0hrRyLzyxgZr
CKeXojVUFYfg9QuokqoY2ymgjxAdwNABKXI2IKMvM0UGBtmxphCuxUSUpMzNfmWk
9H4HIrE=
---- END SSH2 PUBLIC KEY ----

 

3) Get the grid client. By copying the folder /star/u/lbhajdu/ec2/workspace-cloud-client-009 to your area. It is recommended you execute your commands from inside the workspace-cloud-client-009. The manual describes all commands and paths relative to this directory, I will do the same.

a. This grid client is almost the same as the one you download from globus except it has the ./samples/star1.xml, which is configured to load STAR’s custom image.

4) cp to the workspace-cloud-client-009 and type:

./bin/grid-proxy-init.sh  -hours 100

The output should look like this:

[stargrid01] ~/ec2/workspace-cloud-client-009/> ./bin/grid-proxy-init.sh
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-009/lib/globus')
Your identity: DC=org,DC=doegrids,OU=People,CN=Levente B. Hajdu 105387
Enter GRID pass phrase for this identity:
Creating proxy, please wait...
Proxy verify OK
Your proxy is valid until Fri Aug 01 06:19:48 EDT 2008

Normal
0

false
false
false

MicrosoftInternetExplorer4

/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}


 

5.) To start the cluster type:

./bin/cloud-client.sh --run --hours 1 --cluster samples/star1.xml

Two very important things you will want to make a note of from this output are the cluster handle (usually looks something like “cluster-025”) and the gatekeeper name. It will take about 10minutes to lunch this cluster. The cluster will have one gatekeeper and one worker node. The max life time of the cluster is set in the command line arguments, more parameters are in the xml file (you will want to check with Tim before changing these).
If the command hangs up really quickly (about a minute) and says something like “terminating cluster”, this usually means that you do not have a sufficient number of slots to run.

It should look something like this:

[stargrid01] ~/ec2/workspace-cloud-client-009/> ./bin/cloud-client.sh --run --hours 1 --cluster samples/star1.xml
(Overriding old GLOBUS_LOCATION '/opt/OSG-0.8.0-client/globus')
(New GLOBUS_LOCATION: '/star/u/lbhajdu/ec2/workspace-cloud-client-009/lib/globus')
SSH public keyfile contained tilde:
- '~/.ssh/id_rsa.pub' --> '/star/u/lbhajdu/.ssh/id_rsa.pub'
SSH known_hosts contained tilde:
- '~/.ssh/known_hosts' --> '/star/u/lbhajdu/.ssh/known_hosts'
Requesting cluster.
- head-node: image 'osgheadnode-012', 1 instance
- compute-nodes: image 'osgworker-012', 1 instance
Workspace Factory Service:
https://tp-grid3.ci.uchicago.edu:8445/wsrf/services/WorkspaceFactoryService
Creating workspace "head-node"... done.
- 2 NICs: 128.135.125.29 ['tp-x009.ci.uchicago.edu'], 172.20.6.70 ['priv070']
Creating workspace "compute-nodes"... done.
- 172.20.6.25 [ priv025 ]
Launching cluster-025... done.
Waiting for launch updates.
- cluster-025: all members are Running
- wrote reports to '/star/u/lbhajdu/ec2/workspace-cloud-client-009/history/cluster-025/reports-vm'
Waiting for context broker updates.
- cluster-025: contextualized
- wrote ctx summary to '/star/u/lbhajdu/ec2/workspace-cloud-client-009/history/cluster-025/reports-ctx/CTX-OK.txt'
- wrote reports to '/star/u/lbhajdu/ec2/workspace-cloud-client-009/history/cluster-025/reports-ctx'
SSH trusts new key for tp-x009.ci.uchicago.edu [[ head-node ]]

 

5) But hold on you can’t submit yet even thought the grid map file has our DNs in it, the gatekeeper is not trusted. We will need to install an OSG host cert on the other side. Not just anybody can do this. Doug and Leve can do this at least (and I am assuming Wayne). Open up another terminal and logon into the newly instantiated gatekeeper as root. Example here:

[lbhajdu@rssh03 ~]$ ssh root@tp-x009.ci.uchicago.edu
The authenticity of host 'tp-x009.ci.uchicago.edu (128.135.125.29)' can't be established.
RSA key fingerprint is e3:a4:74:87:9e:69:c4:44:93:0c:f1:c8:54:e3:e3:3f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'tp-x009.ci.uchicago.edu,128.135.125.29' (RSA) to the list of known hosts.
Last login: Fri Mar 7 13:08:57 2008 from 99.154.10.107

 

6) Create a .globus directory:

[root@tp-x009 ~]# mkdir .globus

7) Go back to the stargrid node and copy over your grid cert and key:

[stargrid01] ~/.globus/> scp usercert.pem root@tp-x009.ci.uchicago.edu:/root/.globus
usercert.pem 100% 1724 1.7KB/s 00:00

[stargrid01] ~/.globus/> scp userkey.pem root@tp-x009.ci.uchicago.edu:/root/.globus
userkey.pem 100% 1923 1.9KB/s 00:00


8)
Move over to /etc/grid-security/ on the gate keeper:

cd /etc/grid-security/

9) Create a host cert here:

[root@tp-x009 grid-security]# cert-gridadmin -host 'tp-x002.ci.uchicago.edu' -email lbhajdu@bnl.gov -affiliation osg -vo star -prefix tp-x009
checking script version, V2-4, This is ok. except for gridadmin SSL_Server bug. Latest version is V2-6.
Generating a 2048 bit RSA private key
.......................................................................................................+++
.................+++
writing new private key to './tp-x009key.pem'
-----
osg
OSG
OSG:STAR
The next prompt should be for the passphrase for your
personal certificate which has been authorized to access the
gridadmin interface for this CA.
Enter PEM pass phrase:
Your new certificate and key files are ./tp-x009cert.pem ./tp-x009key.pem
move and rename them as you wish but be sure to protect the
key since it is not encrypted and password protected.

 

10) Change right on the credentialed:

[root@tp-x009 grid-security]# chmod 644 tp-x009cert.pem
[root@tp-x009 grid-security]# chmod 600 tp-x009key.pem

11) Delete the old host credentialed:

[root@tp-x009 grid-security]# rm hostcert.pem
[root@tp-x009 grid-security]# rm hostkey.pem


12)
Rename the credentials:

[root@tp-x009 grid-security]# mv tp-x009cert.pem hostcert.pem
[root@tp-x009 grid-security]# mv tp-x009key.pem hostkey.pem

 

13) Check grid functionality back on stargrid01

[stargrid01] ~/admin_cert/> globus-job-run tp-x009.ci.uchicago.edu /bin/date
Thu Jul 31 18:23:55 CDT 2008

14) Do your grid work

15) When its time for the cluster to go down (if there is unused time remaining) run the below command. Note that you will need the cluster handle from the command used to bring up the cluster.

./bin/cloud-client.sh --terminate --handle cluster-025

 

If there are problems:

If there are problems try this web page:
http://workspace.globus.org/clouds/cloudquickstart.html
If there are still problems try this mailing list:
workspace-user@globus.org
If there are still problems contact Tim Freeman (tfreeman at mcs.anl.gov).

 

Troubleshooting gsiftp at STAR-BNL

An overview of STAR troubleshooting with the official involvement of the OSG Troubleshooting Team in late 2006 and early 2007 history can be found here:
https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/TroubleshootingStar

As of mid March, the biggest open issue is intermittently failing file transfers when multiple simultaneous connections are attempted from a single client node to the STAR-BNL gatekeeper. 

This article will initially summarize the tests and analysis conducted during the period from ~March 23 to March 29, though not in chronological order.  Updates on testing at later dates will be added sequentially at the bottom.

A typical test scenario goes like this:  Log on to a pdsf worker node, such as pc2607.nersc.gov and exceute a test script such as mytest.perl, which calls myguc.perl (actually, myguc.pl, but I had to change the file extensions in order for drupal to handle these properly.  Btw, both of these were originally written by Eric Hjort).  Sample versions of these scripts are attached.  These start multiple transfers simultaneously (or very nearly so).

In a series of such tests, we've gathered a variety of information to try to understand what's going on.

In one test, in which 2 of 17 transfers failed, we had a tcpdump on the client node (attached as PDSF-tcpdump.pcap), the gridftp-auth.log and gridftp.log files from the gatekeeper (both attached) and I acquired the BNL firewall logs covering the test period (relevant portions (filtered by me) are attached as BNL_firewall.txt).

In a separate test, a tcpdump was acquired on the server side (attached as BNL-tcpdump.pcap).  In this test, 7 of 17 transfers failed.

Both tcpdumps are in a format that Wireshark (aka Ethereal) can handle, and the Statistics -> Converstations List -> TCP is a good thing to look at  early on (it also makes very useful filtering quite easy if you know how to use it).

Missing from all this is any info from the RCF firewall log, which I requested, but I got no response.  (The RCF firewall is inside the BNL firewall.)  From putting the pieces together as best I can without this, I doubt this firewall is interfering, but it would be good to verify if possible.)

What follows is my interpretation of what goes on in a successful transfer:
    A.  The client establishes a connection to port 2811 on the server (the "control" connection).
    B.  Using this connection, the user is authenticated and the details of the requested transfer are exchanged.
    C.  A second port (within the GLOBUS_TCP_PORT_RANGE, if defined) is opened on the server.
    D.  The client (which is itself using a new port as well) connects to the new port on the server (the "transfer" connection) and the file is transfered.
    E.  The transfer connection is closed at the end of the transfer.
    F.  The control connection is closed.

The failing connections seem to be breaking at step B, which I'll explain momentarily.  But first, I'd like to point out that if this is correct, then GLOBUS_TCP_PORT_RANGE and GLOBUS_TCP_SOURCE_RANGE and state files are all irrelevant, because the problem is occurring before those are ever consulted to open the transfer connection.  I point this out because the leading suspect to this point has been a suspected bug in the BNL Cisco firewalls that was thought to improperly handle new connections if source and destination ports were reused too quickly.

So, what is the evidence that the connection is failing at such an early point?  I'm afraid the only way to really follow along at this point to dig into the tcpdumps youself, or you'll just have to take my interpretation on most of this.

1.  The gridftp-auth.log clearly shows successful authentications for the successful transfers and no mention of authentication for the failed transfers.
2.  From the tcpdumps, three conversation "types" are evident -- successful control connections, corresponding transfer connections, and failed control connections.  There are no remaining packets that would constitute a failed transfer connection.
3.  The failed control connections are obviously failed, in that they quickly degnerate into duplicate ACKs from the server and retransmissions from the client, which both sides are seeing.  I interpret this to mean that any intermediate firewalls aren't interferring at the end of the connection either, but I suppose it doesn't mean they haven't plucked a single packet out of the stream somewhere in the middle.)

Here's what I've noticed from the packets in the tcpdumps when looking at the failed connection.  From the client side, it looks like the fifth packet in the converation never reaches the server, since the server keeps repeating its previous ACK (SEQ=81).  From the server side, things are more peculiar.  There is a second [SYN,ACK] sent from the server AFTER the TCP connection is "open" (the server has already received an [ACK] from the client side in response to the first [SYN,ACK]).  This is strange enough, but looking back at the client tcpdump, it doesn't show any second [SYN,ACK] arriving at the client! 

Why is this second [SYN,ACK] packet coming from the server, and why is it not received on the client side (apparently)?

At this point, I'm stumped.  I haven't painstakingly pieced together all the SEQ and ACK numbers from the packets to see if that reveals anything, but it probably best to leave that until we have simultaneous client and server dumps, at which point the correspondences should be easier to ferret out.  [note:  simultaneous dumps from a  test run were added on April 2 (tcpdump_BNL_1of12.pcap and tcpdump_PDSF_1of12.pcap).  See the updates section below.]

One more thing just for the record:  the client does produce an error message for the failed transfers, but it doesn't shed any more light on the matter:
error: globus_ftp_client: the server responded with an error
421 Idle Timeout: closing control connection.


Additional tests were also done, such as:

Iptables was disabled on the client side -- failures still occurred.
Iptables was disabled on the server side -- failures still occurred.

Similar tests have been lauched by Eric and me from PDSF clients connecting to the STAR-WSU and NERSC-PDSF gatekeepers instead of STAR-BNL.  There were no unexplained failures at sites other than STAR-BNL, which seems to squarely put the blame somewhere at BNL.

Updates on March 30:

The network interfaces of client and server show no additional errors or dropped packets occuring during failed transfers (checked from the output of ifconfig on both sides).

Increased the control_preauth_timeout on the server to 120, then 300 seconds (default is 30).  Failures occured with all settings.

Ran a test with GLOBUS_TCP_XIO_DEBUG set on the client side.  The resulting output of a failed transfer (with standard error and standard out intermixed) is attached as "g-u-c.TCP_XIO_DEBUG". 

Bumped up the server logging level to ALL (from the default "ERROR,WARN,INFO").  A test with 2 failures out of 12 is recorded in gridftp.log.ALL and gridftp-auth.log.ALL.  (The gridftp.log.ALL only shows activity for the 10 successful transfers, so it probably isn't useful.)  It appears that [17169] and [17190] in the gridftp-auth.log.ALL file represent failed transfers, but no clues as to the problem -- it just drops out at the point where the user is authenticated, so there's nothing new to be learned here as far as I can tell.  However, I do wonder about the fact that these two failing connections do seem to be the LAST two connections to be initiated on the server side, though they were the first and ninth connections in the order started from the client.  Looking at a small set of past results, the failed connections are very often the last ones to reach the server, regardless of the order started on the client, but this isn't 100%, and perhaps should be verified with a larger sample set.

Updates on April 2:

I've added simultaneous tcpdumps from the server and client side ("tcpdump_BNL_1of12.pcap" and "tcpdump_PDSF_1of12.pcap").  These are from a test with one failed connection out of 12.  Nothing new jumps out at me from these, with the same peculiar packets as described above.

I ran more than 10 tests using stargrid01 (inside the BNL and RCF firewalls) as the client host requesting 40 transfers each time, and saw no failures.  This seems strong evidence that the problem is somewhere in the network equipment, but where?  I have initiated a request for assistance from BNL ITD in obtaining the RCF firewall logs as well as any general assistance they can offer.

Updates on April 16:

In the past couple of weeks, a couple of things were tried, based on a brief conversation with a BNL ITD network admin:

1.  On the server (stargrid02.rcf.bnl.gov), I tried disabling TCP Windows Scaling ("sysctl -w net.ipv4.tcp_window_scaling=0") -- no improvement
2.  On the server (stargrid02.rcf.bnl.gov), I tried disabling MTU discovery ("sysctl -w net.ipv4.ip_no_pmtu_disc=1") -- no improvement

In response to a previous client log file with the TCP_XIO debug enabled, Charles Bacon contributed this via email: 

>Thanks for the -dbg+TCP logs!  I posted them in a new ticket at http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5190
>The response from the GridFTP team, posted there, is:
>"""
>What this report shows me is that the client (globus-url-copy) successfully forms a TCP control channel connection with the server.  It then successfully reads the 220 Banner message from the server.  The client then attempts to authenicate with the server.  It sends the AUTH GSSAPI command and posts a read for a response.  It is this second read that times out.

>From what i see here the both sides believe the TCP connection is formed successfully, enough so that at least 1 message is sent from the server to the client (220 banner) and possibly 1 from the client to the server (AUTH GSSAPI, since we dont have server logs we cannot confirm the server actually received it).

>I think the next step should be looking at the gssapi authentication logs on the gridftp server side to see what commands were actually received and what replies sent.  I think debugging at the TCP layer may be premature and may be introducing some red herrings.

>To get the desired logs sent the env
>export GLOBUS_XIO_GSSAPI_FTP_DEBUG=255,filename
>"""
>So, is it possible to get this set in the env of the server you're using, trigger the problem, then send the resulting gridftp.log?


I have done that and a sample log file (including log_level ALL) is attached as "gridftp-auth.xio_gssapi_ftp_debug.log"  This log file covers a sample test of 11 transfers in which 1 failed.

Updates on April 20:

Here is the Globus Bugzilla ticket on this matter:  http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5190
They have suggested better debug logging parameters to log each transfer process in separate files and requested new logs with the xio_gssapi_ftp_debug enabled, which I will do, but currently have urgent non-grid matters to work on.

Also, we have been given the go ahead to do some testing with ATLAS gatekeepers in different parts of the BNL network structure, which may help isolate the problem, so this is also on the pending to-do list which should get started no later than Monday, April 23.

AJ Temprosa, from BNL's ITD networking group has been assigned my open ticket.  On April 17 we ran a test with some logging at the network level, which he was going to look over, but I have not heard anything back from him, despite an email inquiry on April 19.

Updates on April 23:

Labelled gridftp logs with xio_gssapi_ftp_debug and the tcp_xio_debug enabled are attached as gridftp-auth.xio_gssapi_ftp_debug.labelled.log and gridftp-auth.xio_tcp_debug.labelled.log.  (By "labelled", I mean that each entry is tagged with the PID of the gsiftp server instance, so the intermixed messages can be easily sorted out.)  Ten of twenty-one transfers failed in this test.

I now have authorization and access to run tests with several additional gsiftp servers in different locations on the BNL network.  In a simple model of the situation, there are two firewalls involved -- the BNL firewall, and within that, the RACF firewall.  Two of the new test servers are in a location similar to stargrid02, which is inside both firewalls.  One new server is between the two firewalls, and one is outside both firewalls.  In ad hoc preliminary tests, the same sort of failures occurred with all the servers EXCEPT the one located outside both firewalls.  I've fed this back to the BNL ITD network engineer assigned to our open ticket and am still waiting for any response from him.

Updates on May 10:

[Long ago,] Eric Hjort did some testing with 1 second delays between successive connections and found no failures.  In recent limited testing with shorter delays, it appears that there is a threshhold at about 0.1 sec.  With delays longer than 0.1 sec, I've not seen any failures of this sort.

I installed the OSG-0.6.0 client package on presley.star.bnl.gov, which is between the RACF and BNL firewalls.  It also experiences failures when connecting to stargrid02 (inside the RACF firewall).

We've made additional tests with different server and client systems and collected additional firewall logs and tcpdumps.  For instance, using the g-u-c client on stargrid01.rcf.bnl.gov (inside both the RACF and BNL perimeter firewalls) and a gsiftp server on netmon.usatlas.bnl.gov (outside both firewalls) we see failures that appear to be the same.  I have attached firewall logs from both the RACF firewall ("RACF_fw_logs.txt") and the BNL firewall ("BNL_perimeter_fw_logs.txt") for a test with 4 failures out of 50 transfers (using a small 2.5 KB file).  Neither log shows anything out of the ordinary, with each expected connection showing up as permitted.  Tcpdumps from the client and server are also attached ("stargrid01-client.pcap" and "netmon-server.pcap" respectively).  They show a similar behaviour as in the previous dumps from NERSC and stargrid02, in which the failed connections appear to break immediately, with the client's first ACK packet somehow not quite being "understood" by the server.

RACF and ITD networking personnel have looked into this a bit.  To make a long story short, their best guess is "kernel bug, probably a race condition".  This is a highly speculative guess, with no hard evidence.  The fact that the problem has only been noticed when crossing firewalls at BNL casts doubt on this.  For instance, using a client on a NERSC host connecting to netmon, I've seen no failures, and I need to make this clear to them.  Based on tests with other clients (eg. presley.star.bnl.gov) and servers (eg. rftpexp01.rhic.bnl.gov), there is additional evidence that the problem only occurs when crossing firewalls at BNL, but I would like to quantify this, rather than relying on ad hoc testing by hand, with the hope of removing any significant possibility of statistical flukes in the test results so far.

Updates on May 25:

In testing this week, I have focused on eliminating a couple of suspects.  First, I replaced gsiftpd with a telnetd on stargrid03.rcf.bnl.gov.  The telnetd was setup to run under xinetd using port 2811 -- thus very similar to a stock gsiftp service (and conveniently punched through the various firewalls).  Testing this with connections from PDSF quickly turned up the same sort of broken connections as with gsiftp.  This seems to exonerate the globus/VDT/OSG software stack, though it doesn't rule out the possiblity of a bug in a shared library that is used by the gsiftp server.

Next, I tried to eliminate xinetd.  To do this, I tried some http connections to a known web server without any problems.  I then setup an sshd on port 2811 on stargrid03.  In manual testing with an scp command, I found no failures. I've scripted a test on pdsfgrid1 to run every 5 minutes that makes 30 scp connections to stargrid03 at BNL.&nbsp; The results so far are tantalizing -- there don't seem to be any failures of the sort we're looking for...&nbsp; If this holds up, then xinetd becomes suspect #1.&nbsp; There are also some interesting bug fixes in xinetd's history that seem suspiciously close to the symptoms we're experiencing, but I can't find much detail to corroborate with our situation.&nbsp; See for instance:

https://rhn.redhat.com/errata/RHBA-2005-208.html , https://rhn.redhat.com/errata/RHBA-2005-546.html and http://www.xinetd.org/#changes

Here is a sample problem description:
"Under some circumstances, xinetd confused the file descriptor used for
logging with the descriptors to be passed to the new server. This resulted
in the server process being unable communicate with the client. This
updated package contains a backported fix from upstream. "

(NB - These Redhat errata have been applied to stagrid02 and stargrid03, but there are prior examples of errata updates that failed to fix the problem they claimed :-( ) 

Updates on May 30:

SOLUTION(?)

By building xinetd from the latest source (v 2.3.14, released Oct. 24, 2005) and replacing the executable from the stock Red Hat RPM on stargrid02 (with prior testing on stargrid03), the connection problems disappeared.  (minor note:  I built it with the libwrap and loadavg options compiled in, as I think Red Hat does.)

For the record, here is some version information for the servers used in various testing to date:

stargrid02 and stargrid03 are identical as far as relevant software versions:
Linux stargrid02.rcf.bnl.gov 2.4.21-47.ELsmp #1 SMP Wed Jul 5 20:38:41 EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 3 (Taroon Update 8)
xinetd-2.3.12-6.3E.2 (the most recent update from Red Hat for this package for RHEL 3.  Confusingly enough, the CHANGELOG for this package indicates it is version 2:2.3.***13***-6.3E.2 (not 2.3.***12***))
Replacing this with xinetd-2.3.14 built from source has apparently fixed the problem on this node.

rftpexp01.rhic.bnl.gov (between the RACF and BNL firewalls):
Linux rftpexp01.rhic.bnl.gov 2.4.21-47.0.1.ELsmp #1 SMP Fri Oct 13 17:56:20 EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 3 (Taroon Update 8)
xinetd-2.3.12-6.3E.2 

netmon.usatlas.bnl.gov (outside the firewalls at BNL):
Linux netmon.usatlas.bnl.gov 2.6.9-42.0.8.ELsmp #1 SMP Tue Jan 23 13:01:26 EST 2007 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 4 (Nahant Update 4)
xinetd-2.3.13-4.4E.1 (the most recent update from Red Hat for this package in RHEL 4.)

Miscellaneous wrap-up:

As a confirmation of the xinetd conclusion, we ran some additional tests with a server at Wayne State with xinetd-2.3.12-6.3E (latest errata for RHEL 3.0.4). When crossing BNL firewalls (from stargrid01.rcf.bnl.gov for instance), we did indeed find connection failures. Wayne State then upgraded to RedHat's xinetd-2.3.12-6.3E.2 (latest errata for any version of RHEL 3) and the problems persisted. Upon building xinetd 2.3.14 from source, the connection failures disappeared. With two successful "fixes" in this manner, I filed a RedHat Bugzilla ticket:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243315

Open questions and remaining tests:

A time-consuming test of the firewalls would be a "tabletop" setup with two isolated nodes and a firewall to place between them at will.  In the absence of more detailed network analysis, that would seem to be the most definitive possible demonstration that the firewall is playing a role, though a negative result wouldn't be absolutely conclusive, since it may be configuration dependent as well.

Whatever YOU, the kind reader suggests! :-)

Using the GridCat Python client at BNL

If you want to run the GridCat Python client, there is a problem on some nodes at BNL related to BNL's proxy settings. Here are some notes that may help.

First, you'll need to get the gcatc.py Python script itself and put it somewhere that you can access. Here is the URL I used to get it, though apparently others exist:

http://gdsuf.phys.ufl.edu:8080/releases/gridcat/gridcat-client/bin/gcatc.py

(I used wget on the node on which I planned to run it, you may get it any way that works.)

Now, the trick at BNL is to get the proxy set correctly. Even though nodes like stargrid02.rcf.bnl.gov have a default "http_proxy" environment variable, it seems that Python's httplib doesn't parse it correctly and thus it fails. But it is easy enough to override as needed.

For example, here is one way in a bash shell:

[root@stargrid02 root]# http_proxy=192.168.1.4:3128; python gcatc.py --directories STAR-WSU

griddir /data/r23b/grid

appdir /data/r20g/apps

datadir /data/r20g/data

tmpdir /data/r20g/tmp

wntmpdir /tmp

Similarly in a tcsh shell:

[stargrid02] ~/> env http_proxy=192.168.1.4:3128 python /tmp/gcatc.py --gsiftpstatus STAR-BNL

gsiftp_in Pass

gsiftp_out Pass

Doug's email of November 3, 2005 contained a more detailed shell script (that requires gcatc.py) to query lots of information: http://lists.bnl.gov/mailman/private/stargrid-l/2005-November/002426.html.
You could add the proxy modification into that script, presumably as a local variable.

Grid Infrastructure

This page will be used for general information about our grid Infrastructure, news, upgrade stories, patches to the software stack, network configuration and studies etc ... Some documents containing local information are however protected.


External links

CERTS & VOMS/VOMRS

CERTS

If you do NOT have a grid certificate yet or need to renew your certificate, you need to either request a certificate or request a renewal. Instructions are available as:

A few notes
  • Your Sponsor and point of contact should be "Jerome Lauret"  the STAR/VO representative (and not your supervisor name or otherwise)
  • Note that as a request for a CERT, being added to the STAR VO requires approval from the STAR/RA and the STAR/VO representative (the RA are aware of this - best chance for your request to be promptly approved is to have the proper "Sponsor")
  • It does not hurt to specify that you belong to STAR when the ticket is created
  • Please, indicate on the request for a CERTificate what is your expected use of Grid services (data transfer? rnning jobs? anything else?)
  • Requesting a using a CERT and using it binds you to the OSG Policy Agreement you have to accept during the request. Failure to comply or violations will lead to a revocation of your CERT validity (in STAR, you have to expect that your VO representative will make sure of the polity IS respected in full)
     
  • The big advantage of renewing a CERT rather than requesting a new one is that the CN will be preserved (so no need for gridmap change)
  • The STAR/VO does NOT acept CERT-ificates other than STAR related CERT-ificates that is, OSG DigiCert-Grid CERT obtained for STAR related work and purposes. A user owning a CERT from a different VO will not be granted membership in VOMS - request a new CERT uniquely associated to STAR related work.
  • STAR rule of thumb / convention - Additional user certificates mapped to generic accounts: the CN would indicate the CERT owner's name. The generic account would appear in parenthesis. An example: /CN=Lidia Didenko (starreco)
  • STAR rule of thumb / convention - Service certificates: The CN field shows the requestor of the certificate

VOMS and VOMRS

Having a CERT is the first step. You now need to be part of a Virtual Organization (VO).

STAR used VOMRS during PPDG time and switched to VOMS at OSG time to maintained its VO user's certificates.
Only VOMS is currentely maintained. A VO is used as a centralized repository of user based information so all sites on the grid could be updated on addition (or removal) of identifications. VOMS service and Web interface are maintained by the RACF.
 

Using your Grid CERT to sign or encrypt Emails

Apart from allowing you to access the Grid, an SSL Client Certificate is imported into the Web browser from which you requested your Grid certificate. This certificate could be used to digtially sign or encrypt Email. For the second, you will need te certificate from the correspondign partner in order to encrypt Emails. To make use of it, folow the below guidance.

    • Find in your browser certificate management interface an 'export' or 'backup' option. THis interface varies from browser to browser and from Email client to Email client. We have checked only in Thudenrbird as an Email client and inventoried possible location for browser-based tools.
      • Internet Explorer: "Tools -> Internet Options -> Content"
      • Netscape Communicator as a "Security" vutton on the top bar menu
      • Mozilla: "Edit -> Prefercences -> Privacy and Security -> Certificates"
      Thudenrbird: "Tools -> Options -> Privacy -> Securiry -> View Certificate"
    • The file usually end-up withe xtension .p12 or .pfx.
      ATTENTION: Although the backup/export process will ask you for a "backup password" (and further encrypt your CERT), please guard this file carefully. Store it OFF your computer or remove the file once you are done with this process.
  • After exporting your certificate from your Web browser, you will need to re-import it into your Mail client. Let's assume it is Thuderbird for simplicity.
  • FIRST:
    Verify you have the DOEGrids Certificate Authority already imported in your Mail client and/or upload them.
    Note that the DOEGrid Certificate Authority is a subordinate CA of the ESnet CA ; therefore the ESnet CA root certificate should also be present. To check this
    • Go to "Tools -> Options -> Privacy -> Security -> View Cretificate"
    • Click on the "Authorities" tab
      • You should see both "DOEGrids CA 1" and "ESnet Root CA 1" under an "Esnet" tree as illustrated in this first picture
        Thunderbird CERT Manager

      • Be certain the "DOEGrids CA 1" is allowed to allow mail users. To do this, select the cert, click Edit. A window as illustrated in the next picture should appear. Both This certificate can indentify Web sites and This certificate can identify mail users should be checked.
        Thuderbird CERT Manager, Edit CA
    • If you DO NOT SEE those certificate authorities, you will need to import them.
      • Do so by downloading the doe-certs.zip attached at the bottom of this document, unzip . Two files should be there
      • Load them using the "Tools -> Options -> Privacy -> Security -> View Cretificate -> Aurthorities -> Import" button.
      • A similar window as displayed above will appear and you will need to check box at least This certificate can identify mail users.
  • Now, import your certificate.
    • Use the "Tools -> Options -> Privacy -> Security -> View Cretificate -> Your Certificate" menu and click "Import"
    • A file browser will appear, select the file you have exported from your browser. It will ask you for a password. You will need to use the smae password you used during the export phase from your Web Browser.
    • Click OK
    • Your are set to go ...
Note: if it is the very first time you use Thuderbird security device manager, an additional password dialog will appear asking for a "New Password" for the security device. This is NOT your backup password. You will need to remember this password as Thudenrbird will ask you for it each time you will start Thudenrbird and use a password or CERT for the first time during a session.

Usage note:
  • If you want a remote partner to send you encrypted messages, you MUST send first a digitally signed Email so your certificate public part could be imported into his/her Email client Certificate Manager under "Other People's". When done for the first time, THuderbird will ask you to set a certificate as default certificate ; the interface and selection is straight forwardso we will not detail the steps ...
  • If you want to send an encrypted message to a remote partner, you MUST have his public part imported into your Email client and then select the "Encrypt This Message" option in the Security drop down menu of Thunderbird.
  • Whenever a certificate expires, DO NOT remove from you Certificate Manager. If so, you will no longer be able to read / decrypt old encrypted Emails.



OSG Issues

This page will anchor various OSG-related collaborative efforts.

SGE Job Manager patch

We should come on this page with a draft that we want to send to the VDT guys about the SGE Job Manager.
  • Missing environment variables definition
    • In the BEGIN section check if $SGE_ROOT, $SGE_CELL and the commands ($qsub, $qstat, etc) are defined properly
    • in the SUBMIT, POOL and CLEAR sections, locate the line
      $ENV{"SGE_ROOT"} = $SGE_ROOT;
      
      and add the line
      $ENV{"SGE_CELL"} = $SGE_CELL;
      
  • Bug finding the correct job id when clearing jobs
    • in the CLEAR section, locate the line
      system("$qdel $job_id > /dev/null 2 > /dev/null");
      and replace for the following block
      $ENV{"SGE_ROOT"} = $SGE_ROOT;
      $ENV{"SGE_CELL"} = $SGE_CELL;
      $job_id =~ /(.*)\|(.*)\|(.*)/;
      $job_id = $1;
      system("$qdel $job_id > /dev/null 2 > /dev/null");
  • SGE Job Manager modifies definitions of both the standard output and standard error file names by appending .real. This procedure fails when a user specifies /dev/null for either of those files. The problem happens twice - once starting at line 318
  •     #####
        # Where to write output and error?
        #
        if(($description->jobtype() eq "single") && ($description->count() > 1))
        {
          #####
          # It's a single job and we use job arrays
          #
          $sge_job_script->print("#\$ -o "
                                 . $description->stdout() . ".\$TASK_ID\n");
          $sge_job_script->print("#\$ -e "
                                 . $description->stderr() . ".\$TASK_ID\n");
        }
        else
        {
            # [dwm] Don't use real output paths; copy the output there later.
            #       Globus doesn't seem to handle streaming of the output
            #       properly and can result in the output being lost.
            # FIXME: We would prefer continuous streaming.  Try to determine
            #       precisely what's failing so that we can fix the problem.
            #       See Globus bug #1288.
          $sge_job_script->print("#\$ -o " . $description->stdout() . ".real\n");
          $sge_job_script->print("#\$ -e " . $description->stderr() . ".real\n");
        }
     
    
    and then again at line 659:
          if(($description->jobtype() eq "single") && ($description->count() > 1))
          #####
          # Jobtype is single and count>1. Therefore, we used job arrays. We
          # need to merge individual output/error files into one.
          #
          {
            # [dwm] Use append, not overwrite to work around file streaming issues.
            system ("$cat $job_out.* >> $job_out");
            system ("$cat $job_err.* >> $job_err");
          }
          else
          {
            # [dwm] We still need to append the job output to the GASS cache file.
            #       We can't let SGE do this directly because it appears to
            #       *overwrite* the file, not append to it -- which the Globus
            #       file streaming components don't seem to handle properly.
            #       So append the output manually now.
            system("$cat $job_out.real >> $job_out");
          }
    
  • The snipped of code above is also missing a statement for the standard error. At the end instead of:
  •         #       So append the output manually now.
            system("$cat $job_out.real >> $job_out");
          }
    
    it should read:
            #       So append the output manually now.
            system("$cat $job_out.real >> $job_out");
            system("$cat $job_err.real >> $job_err");
          }
    
  • Additionally, if deployed in a CHOS environment, the job manager should be modified with the following additions at line 567:
  •     $ENV{"SGE_ROOT"} = $SGE_ROOT;
        if ( -r "$ENV{HOME}/.chos" ){
          $chos=`cat $ENV{HOME}/.chos`;
          $chos=~s/\n.*//;
          $ENV{CHOS}=$chos;
        }
    

gridftp update for VDT 1.3.9 or VDT 1.3.10

To install the updated gridftp server that includes a fix for secondary group membership:

for VDT 1.3.9 (which is what I got with OSG 0.4.0) in the OSG/VDT directory, do:

pacman -get http://vdt.cs.wisc.edu/vdt_139_cache:Globus-Updates

This nominally makes your VDT installation 1.3.9c, though it didn't update my vdt-version.info file accordingly -- it still says 1.3.9b

for VDT 1.3.10, similar installation should work:

pacman -get http://vdt.cs.wisc.edu/vdt_1310_cache:Globus-Updates

STAR VO Privilege Configuration

This page gives the GUMS and vomss configuration information for OSG sites to allow access for the STAR VO.

VOMS entry for edg-mkgridmap.conf
group vomss://vo.racf.bnl.gov:8443/edg-voms-admin/star osg-star

Example GUMS config:

<!--- 9 STAR ---!>
<groupMapping name='star' accountingVo='star' accountingDesc='STAR'>
<userGroup className='gov.bnl.gums.VOMSGroup'
url='https://vo.racf.bnl.gov:8443/edg-voms-admin/star/services/VOMSAdmin'
persistenceFactory='mysql'
name='osg-star'
voGroup="/star"
sslCertfile='/etc/grid-security/hostcert.pem'
sslKey='/etc/grid-security/hostkey.pem' ignoreFQAN="true"/>
<accountMapping className='gov.bnl.gums.GroupAccountMapper'
groupName='osg-star' /> </groupMapping>

Note that in the examples above "osg-star" refers to the local UID/GID names and can be replaced with whatever meets your local site policies.
Also the paths shown for sslKey and sslCertfile should be replaced with the correct values on your system.

Site information

This page will provide information specific to the STAR Grid sites.

BNL

GK Infrastructure

Gatekeeper Infrastructure

This page was last updated on May 17, 2016.

The nodes for STAR's grid-related activities at BNL are as follows:

Color coding

  • Black: in production (please, do NOT modify without prior warning)
  • Green: machine was setup for testing particular component or setup
  • Red : status unknown
  • Blue: available for upgrade upon approval
Grid Machine Usage Notes Hardware Make and Model OS version,
default gcc version
Hardware  arrangement OSG base Condor
stargrid01 FROM BNL, submit grid jobs from this node   Dell PowerEdge 2950

dual quad-core Xeon E5440 (2.83 GHz/ 1.333 GHz FSB), 16 GB RAM
RHEL Client 5.11,

gcc 4.3.2

6 x 1TB SATA2:

1GB /boot (/dev/md0) is RAID 1 across all six drives

There are 3 RAID 1 arrays using pairs of disks (eg. /dev/sda2 and /dev/sdb2 are one array).  The various local mount points and swap space are logical volumes scattered across these RAIDed pairs.

There are 2.68 TB of unassigned space in the current configuration.

NIC: 2 x 1Gb/s (one in use for RACF IPMI/remote administration on private network)

OSG 3.2.25 Client software stack for job submission 8.2.8-1.4 (part of OSG install -- only for grid submission, not part of RACF condor)
stargrid02 file transfer (gridftp) server Attention: on stargrid02, the mappings *formerly* were all grid mappings (i.e. to VO group accounts: osgstar, engage, ligo, etc...)

On May 17, 2016, this was changed to map STAR VO users to individual user accounts (matching the behaviour of stargrid03 and stargrid04).
  This behavior may be changed back. (TBD)

Former STAR-BNL site gatekeeper
Dell PowerEdge 2950

dual quad-core Xeon E5440 (2.83 GHz/ 1.333 GHz FSB), 16 GB RAM
RHEL Client 5.11,

gcc 4.3.2
6 x 1TB SATA2: Configured the same as stargrid01 above

NIC 2 x 1Gb/s (one in use for RACF IPMI/remote administration on private network)
OSG CE 3.1.23
7.6.10 (RCF RPM), NON-FUNCTIONAL (non-working configuration)
stargrid03 file transfer (gridftp) server To transfer using STAR individual user mappings, please use this node or stargrid04 Dell PowerEdge 2950

dual quad-core Xeon E5440 (2.83 GHz/ 1.333 GHz FSB), 16 GB RAM
RHEL Client 5.11,

gcc 4.3.2
6 x 1TB SATA2: Configured the same as stargrid01 above

NIC 2 x 1Gb/s (one in use for RACF IPMI/remote administration on private network)
OSG CE 3.1.18 7.6.10 (RCF RPM), NON-FUNCTIONAL (non-working configuration)
stargrid04 file transfer (gridftp) server To transfer using STAR individual user mappings, please use this node or stargrid03 Dell PowerEdge 2950

dual quad-core Xeon E5440 (2.83 GHz/ 1.333 GHz FSB), 16 GB RAM
RHEL Client 5.11,

gcc 4.3.2
6 x 1TB SATA2: Configured the same as stargrid01 above

NIC 2 x 1Gb/s (one in use for RACF IPMI/remote administration on private network)
OSG CE 3.1.23 7.6.10 (RCF RPM), NON-FUNCTIONAL (non-working configuration)

 

 

stargrid0[234] are using the VDT-supplied gums client (version 1.2.16).
stargrid02 has a local hack in condor.pm to adjust the condor parameters for STAR users with local accounts.


All nodes have GLOBUS_TCP_PORT_RANGE=20000,30000 and matching firewall conduits for Globus and other dynamic grid service ports.

 

 

LBL

MIT

CMS Analysis Facility

MIT’s CMS Analysis Facility is a large Tier-2 computing center built for CMS user analyses. We’re looking into the viability of using it for STAR computing.

Initial Setup

First things first. I went to http://www2.lns.mit.edu/compserv/cms-acctappl.html and applied for a local account. The welcome message contained a link to the CMSAF User Guide found on this TWiki page.

AFS isn’t available on CMSAF, so I started a local tree at /osg/app/star/afs_rhic and began to copy over stuff. Here’s a list of what I copied so far (nodes are running SL 4.4):

CERNLIB
/afs/rhic.bnl.gov/asis/sl4/slc4_ia32_gcc345/cern

OPTSTAR
/afs/rhic.bnl.gov/i386_sl4/opt/star/sl44_gcc346

GROUP_DIR
/afs/rhic.bnl.gov/star/group

ROOT 5.12.00
/afs/rhic.bnl.gov/star/ROOT/5.12.00/root
/afs/rhic.bnl.gov/star/ROOT/5.12.00/.sl44_gcc346

SL07e (sl44_gcc346 only)
/afs/rhic.bnl.gov/star/packages/SL07e

I copied these precompiled libraries over instead of building them myself because of a tricky problem with the interactive nodes’ configuration. The main gateway node is a 64-bit machine, so regular attempts at compilation produce 64-bit libraries that we can’t use. CMSAF has a node reserved for 32-bit builds, but it’s running SL 3.0.5. We’re still working on a proper resolution of that problem. Perhaps we can force cons to do 32-bit compilations.

The environment scripts are working, although I had to add more hacks than I thought were necessary. I only changed the following files:

  1. ~/.login
  2. ~/.cshrc
  3. $GROUP_DIR/site_post_setup.csh

It doesn’t seem possible to change the default login shell (chsh and ypchsh both fail), so when you login you need to type “tcsh” to get a working STAR environment (after copying my .login and .cshrc to your home directory, of course).

Basic interactive tests look good, and I’ve got a SUMS configuration that will do local job submissions to the Condor system (that’s a topic for another post). DB calls use the MIT database mirror. I think that’s all for now.

STAR Scheduler Configuration

I deployed a private build of SUMS (roughly 1.8.10) on CMSAF and made the following changes to globalConfig.xml to get local job submission working:

In the Queue List

In the Policy List

Now for the Dispatcher

And finally, here's the site configuration block

Database Mirror

MIT has a local slave connected to the STAR master database server.  A dbServers.xml with the following content will allow you to connect to it:


<StDbServer>
<server> star1 </server>
<host> star1.lns.mit.edu </host>
<port> 3316 </port>
<socket> /tmp/mysql.3316.sock </socket>
</StDbServer>

For more information on selecting database mirrors please visit this page.  You can also view a heartbeat of all the STAR database slaves here.  Finally, if you're interested in setting up your own database slave, Michael DePhillips has put some preliminary instructions on the

Drupal page.  Contact Michael for more info.

Guidelines For MIT Tier2 Job Requests

In order to facilitate the submission of jobs, all requests for the Tier2 must contain the following information.  Note that, because we cannot maintain stardev on Tier2, all jobs must be run from a tagged release.  It is the users responsibility to ensure that the requested job runs from a tagged release, with any necessary updates from CVS made explicit.

 

1.  Tagged release of the STAR environment from which the job will be run, e.g. SL08a.

2.  Link to all custom macros and/or  kumacs.

3.  Link to pams/ and StRoot/ directories containing any custom code, including all necessary CVS updates of the tagged release.

5.  List of commands to be executed, i.e. the contents of the <job></job> in your submission XML.

 

One is also free to include a custom log4j.xml, but this is not necessary.

MIT Simulation Productions

 

Production Name
STAR Library
Species Subprocesses

PYTHIA Library

BFC
Geometry
Notes
mit0000
SL08a pp200 QCD 2->2 pythia6410PionFilter
"trs fss y2006c Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727"
y2006c CKIN(3) = 4, CKIN(4) = 5
mit0001
SL08a pp200 QCD 2->2 pythia6410PionFilter
"trs fss y2006c Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727"
y2006c CKIN(3) = 5, CKIN(4) = 7
mit0002 SL08a pp200 QCD 2->2 pythia6410PionFilter "trs fss y2006c Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006c CKIN(3) = 7, CKIN(4) = 9
mit0003 SL08a pp200 QCD 2->2 pythia6410PionFilter "trs fss y2006c Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006c CKIN(3) = 9, CKIN(4) = 11
mit0004 SL08a pp200 QCD 2->2 pythia6410PionFilter "trs fss y2006c Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006c CKIN(3) = 11, CKIN(4) = 15
mit0005 SL08a pp200 QCD 2->2 pythia6410PionFilter "trs fss y2006c Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006c CKIN(3) = 15, CKIN(4) = 25
mit0006 SL08a pp200 QCD 2->2 pythia6410PionFilter "trs fss y2006c Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006c CKIN(3) = 25, CKIN(4) = 35

 

Kumacs slightly modified to incorporate local pythia libraries from ppQCDprod.kumac and ppWprod.kumac provided by Jim Sowinski

Production Name
STAR Library
Species Subprocesses PYTHIA Library BFC Geometry Notes
mit0007 SL08a pp500 W pythia6_410 "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=10, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched
mit0008 SL08a pp500 QCD 2->2 pythia6_410  "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13

CKIN(3)=20, CKIN(4)=30, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched

mit0009 SL08a pp500 W pythia6410FGTFilter  "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=10, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched
mit00010 SL08a pp500 QCD 2->2 pythia6410FGTFilter  "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=20, CKIN(4)=30, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched
mit0011
SL08a pp500 QCD 2->2 pythia6410FGTFilterV2 "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=5, CKIN(4)=10, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched
mit0012 SL08a pp500 QCD 2->2 pythia6410FGTFilter  "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=10, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched
mit0013 SL08a pp500 QCD 2->2 pythia6410FGTFilterV2 "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=15, CKIN(4)=20, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched
mit0014 SL08a pp500 QCD 2->2 pythia6410FGTFilterV2 "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=20, CKIN(4)=30, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched
mit0015 SL08a pp500 QCD 2->2 pythia6410FGTFilterV2 "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=30, CKIN(4)=50, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched
mit0016 SL08a pp500 QCD 2->2 pythia6410FGTFilterV2 "trs -ssd upgr13  Idst IAna l0 tpcI fcf -ftpc Tree logger ITTF Sti StiRnd  -IstIT -SvtIt -NoSvtIt SvtCL,svtDb -SsdIt MakeEvent McEvent geant evout geantout IdTruth  bbcSim emcY2 EEfs bigbig -dstout fzin -MiniMcMk McEvOut clearmem -ctbMatchVtx VFPPV eemcDb beamLine" upgr13 CKIN(3)=50, Custom BFC, vertex(0.1,-0.2,-60), beamLine matched

 

 

The seed for each file is given by 10000 * (Production Number) + (File Number). *The version of SL08c used is not the final version at RCF due to an unexpected update.

Production Name STAR Library Species Subprocess PYTHIA Library BFC Geometry Notessuffix
mit0019 SL08c pp200 Prompt Photon p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=2, CKIN(4)=3, StGammaFilterMaker
mit0020 SL08c pp200 Prompt Photon p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=3, CKIN(4)=4, StGammaFilterMaker
mit0021 SL08c pp200 Prompt Photon p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=4, CKIN(4)=6, StGammaFilterMaker
mit0022 SL08c pp200 Prompt Photon p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=6, CKIN(4)=9, StGammaFilterMaker
mit0023 SL08c pp200 Prompt Photon p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=9, CKIN(4)=15, StGammaFilterMaker
mit0024 SL08c pp200 Prompt Photon p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=15, CKIN(4)=25, StGammaFilterMaker
mit0025 SL08c pp200 Prompt Photon p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=25, CKIN(4)=35, StGammaFilterMaker
mit0026 SL08c pp200 QCD p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=2, CKIN(4)=3, StGammaFilterMaker
mit0027 SL08c pp200 QCD p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=3, CKIN(4)=4, StGammaFilterMaker
mit0028 SL08c pp200 QCD p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=4, CKIN(4)=6, StGammaFilterMaker
mit0029 SL08c pp200 QCD p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=6, CKIN(4)=9, StGammaFilterMaker
mit0030 SL08c pp200 QCD p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=9, CKIN(4)=15, StGammaFilterMaker
mit0031 SL08c pp200 QCD p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=15, CKIN(4)=25, StGammaFilterMaker
mit0032 SL08c pp200 QCD p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=25, CKIN(4)=35, StGammaFilterMaker
mit0033 SL08c pp200 QCD p6410BemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=35, CKIN(4)=65, StGammaFilterMaker

 

 

The seed for each file is given by 10000 * (Production Number) + (File Number). *The version of SL08c used is not the final version at RCF due to an unexpected update.

Production Name STAR Library Species Subprocess PYTHIA Library BFC Geometry Notessuffix
mit0034 SL08c pp200 Prompt Photon p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=2, CKIN(4)=3, StGammaFilterMaker
mit0035 SL08c pp200 Prompt Photon p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=3, CKIN(4)=4, StGammaFilterMaker
mit0036 SL08c pp200 Prompt Photon p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=4, CKIN(4)=6, StGammaFilterMaker
mit0037 SL08c pp200 Prompt Photon p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=6, CKIN(4)=9, StGammaFilterMaker
mit0038 SL08c pp200 Prompt Photon p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=9, CKIN(4)=15, StGammaFilterMaker
mit0039 SL08c pp200 Prompt Photon p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=15, CKIN(4)=25, StGammaFilterMaker
mit0040 SL08c pp200 QCD p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=2, CKIN(4)=3, StGammaFilterMaker
mit0041 SL08c pp200 QCD p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=3, CKIN(4)=4, StGammaFilterMaker
mit0042 SL08c pp200 QCD p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=4, CKIN(4)=6, StGammaFilterMaker
mit0043 SL08c pp200 QCD p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=6, CKIN(4)=9, StGammaFilterMaker
mit0044 SL08c pp200 QCD p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=9, CKIN(4)=15, StGammaFilterMaker
mit0045 SL08c pp200 QCD p6410EemcGammaFilter "trs fss y2006g Idst IAna l0 tpcI fcf ftpc Tree logger ITTF Sti VFPPV bbcSim tofsim tags emcY2 EEfs evout -dstout IdTruth geantout big fzin MiniMcMk clearmem eemcDb beamLine sdt20050727" y2006g CKIN(3)=15, CKIN(4)=25, StGammaFilterMaker

 

STAR environment on OS X

This page is obsolete -- please see Mac port of STAR offline software for the current status

In order of decreasing importance:

  1. pams - still can't get too far here.  No idea how the whole gcc -> agetof -> g77 works to compile "Mortran".  I know VMC is the future and all that, but I think we really do need pams in order to have a useful STAR cluster.
  2. dynamic library paths - specifying a relative object pathname to g++ -o means that the created dylib always looks for itself in the current directory on OS X.  In other words, the repository is useless.  Need to figure out how to tell cons to send the absolute path when linking.  Executables work just fine; it's only the .dylibs that have this problem.
  3. starsim - crashing on startup (!!!!! ZFATAL called from MZIOCH) Hopefully this is related to pams problems, although I do remember having some trouble linking.
  4. root4star - StarRoot,StUtilities,StarClassLibrary,St_base do not load automatically as I thought they were supposed to.  How do we do this at BNL?
  5. QtRoot - has it's own build system that didn't work out of the box for me.  Disabled StEventDisplayMaker and St_geom_Maker until I figure this out.

Contents of $OPTSTAR

I went through the list of required packages in /afs/rhic.bnl.gov/star/common/AAAREADME and figured out which ones were installed by default in an Intel OS X 10.4.8 client.  Here's what I found:

  • perl 5.8.6:  /usr/bin/perl (slightly newer than requested 5.8.4)
  • make 3.8.0:  /usr/bin/make -> gnumake
  • tar (??):  /usr/bin/tar
  • flex 2.5.4:  /usr/bin/flex
  • libXpm 4.11:  /usr/X11R6/lib/libXpm.dylib
  • libpng:  not found
  • mysql:  not found
  • gcc 4.0.1: /usr/bin/gcc -> gcc-4.0 (yeah, I know.  Apple does not support gcc 3.x in 10.4 for Intel!  We can do gcc_select to go back to 3.3 on ppc though.)
  • dejagnu:  not found
  • gdb 6.3.50:  /usr/bin/gdb (instead of 5.2)
  • texinfo:  not found
  • emacs 21.2.1:  /usr/bin/emacs (instead of 20.7)
  • findutils:  not found
  • fileutils:  not found
  • cvs 1.11:  /usr/bin/cvs
  • grep 2.5.1:  /bin/grep (instead of 2.5.1a)
  • m4 1.4.2:  /usr/bin/m4 (instead of 1.4.1)
  • autoconf 2.59:  /usr/bin/autoconf (2.53)
  • automake 1.6.3:  /usr/bin/automake
  • libtool (??):  /usr/bin/libtool (1.5.8)

I was able to find nearly all of the missing packages in the unstable branch for Fink (Intel machine).  I wouldn't worry about the "unstable" moniker; as long as you don't do a blind update-all it's certainly possible to stick to a solid config, and there are several packages on the list that are only available in unstable (only because they haven't yet gotten the votes to move them over to stable).  I've gone ahead and installed some of the missing packages in a fresh Fink installation and will serve it up over NFS at /Volumes/star1.lns.mit.edu/STAR/opt/star/osx48_i386_gcc401 (with a power_macintosh_gcc401 to match, although a more consistent $STAR_HOST_SYS would probably have been osx48_ppc_gcc401).

Here's a summary table of the packages installed in $OPTSTAR for the two OS X architectures at MIT.  Note that many of these packages have additional dependencies, so the full list of installed packages on each system (attached at the bottom of the page) is actually much longer.

package version
Fortran compiler
gfortran 4.2 (i386), g77 3.4.3 (ppc)
libpng 1.2.12
mysql 5.0.16-1002 (5.0.27 will break!)
dejagnu skipped
texinfo 4.8
findutils 4.2.20
fileutils 5.96
qt-x11 3.3.7
slang 1.4.9
doxygen 1.4.6
lynx 2.8.5
ImageMagick 6.2.8
nedit 5.5
astyle 1.15.3 (ppc only)
unixodbc 2.2.11
myodbc not available (2.50.39, if we want it)
libxml 2.6.26


I also looked for required perlmods in Fink.  I stuck with the default Perl 5.86, so the modules that say e.g. pm588 required I did not install.  I found that some of the modules are already part of core.  If the older ones hosted by STAR are still needed, let me know.  Virtual package means that it came with the OS already:

perlmod version
Compress-Zlib virtual package
DateManip 5.42a
DBI 1.53
DBD-mysql 3.0008
Digest-MD5 core module
HTML-Parser virtual package
HTML-Tagset 3.10
libnet not available
libwww-perl 5.805
LWPng-alpha not available
MD5 not available
MIME-Base64 3.05
Proc-ProcessTable 0.39-cvs20040222-sf77
Statistics-Descriptive 2.6
Storable core module
Time-HiRes core module
URI virtual package
XML-NamespaceSupport 1.08
XML-SAX 0.14
XML-Simple 2.16


There were some additional perlmods that install_perlmods listed as "Linux only" but Fink offered to install:

perlmod version
GD 2.30
perlindex not available
Pod-Escapes 1.04
Pod-Simple 3.04
Tk 804.026
Tk-HistEntry not available
Tk-Pod not available


Questions:

  • what was with all those soft-links (/usr/bin/sed -> /bin/sed, etc.) that Jerome had me make?  Will they be needed on every machine running STAR environment (that's a problem), or just on the one he was compiling on?
  • is perl in /usr/bin sufficient or do we need to put it in $OPTSTAR as directed in AAAREADME?
  • what to do about mysql? Is 5.0 back-compatible, or do we only need development headers and shared libraries?

 

Building PYTHIA dylibs with gfortran

The default makePythia6.macosx won't work out of the box for 10.4, since it requires g77.  Here's what I did to get the libraries built for Pythia 5:
$ gfortran -c jetset74.f $ gfortran -c pythia5707.f $ echo 'void MAIN__() {}' &gt; main.c $ gcc -c main.c $ gcc -dynamiclib -flat_namespace -single_module -undefined dynamic_lookup -install_name $OPTSTAR/lib/libPythia.dylib -o libPythia.dylib *.o $ sudo cp libPythia.dylib $OPTSTAR/lib/. and for Pythia 6: $ export MACOSX_DEPLOYMENT_TARGET=10.4 $ gfortran -c pythia6319.f In file pythia6319.f:50551 IF (AAMAX.EQ.0D0) PAUSE 'SINGULAR MATRIX IN PYLDCM' 1 Warning: Obsolete: PAUSE statement at (1) $ gfortran -fno-second-underscore -c tpythia6_called_from_cc.F $ echo 'void MAIN__() {}' &gt; main.c $ gcc -c main.c $ gcc -c pythia6_common_address.c $ gcc -dynamiclib -flat_namespace -single_module -undefined dynamic_lookup -install_name $OPTSTAR/lib/libPythia6.dylib -o libPythia6.dylib main.o tpythia6_called_from_cc.o pythia6*.o $ ln -s libPythia6.dylib libPythia6.so $ sudo cp libPythia6.* $OPTSTAR/lib/.

CERNLIB notes

All the CERNLIB libraries are static and the binaries depend only on system libraries, so the whole installation should be portable.  For PowerPC I had a CERNLIB 2005 build left over from a different Fink installation, so I just copied those binaries and libraries to the new location and downloaded the headers from CERN.  Fink doesn't support CERNLIB on Intel Macs, so for this build I used Robert Hatcher's excellent shell script:

http://home.fnal.gov/~rhatcher/macosx/readme.html

Hatcher's binaries link against the gfortran dylib, so I made sure to build them with gfortran from $OPTSTAR.

CERNLIB 2005 doesn't include libshift.a, but STAR really wants to link against it.  Here's a hack from Robert Hatcher to build your own cat &gt; fakeshift.c &lt; eof int rshift_(int* in, int* ishft) { return *in &gt;&gt; *ishft; } int ishft_(int* in, int* ishft) { if (*ishft == 0) return *in; if (*ishft &gt; 0) return *in &lt;&lt; *ishft; else return *in &gt;&gt; *ishft; } EOF gcc -O -fPIC -c fakeshift.c fi g77 -fPIC -c getarg_stub.f ar cr libshift.a fakeshift.o eof

ROOT build notes

Following the instructions at http://www.star.bnl.gov/STAR/comp/root/building_root.html was basically fine.  Here was my configure command for rootdeb:
./configure macosx --build=debug --enable-qt --enable-table --enable-pythia6 --enable-pythia --with-pythia-libdir=$OPTSTAR/lib --with-pythia6-libdir=$OPTSTAR/lib --with-qt-incdir=$OPTSTAR/include/qt which resulted in the final list Enabled support for asimage, astiff, builtin_afterimage, builtin_freetype, builtin_pcre, builtin_zlib, cern, cintex, exceptions, krb5, ldap, mathcore, mysql, odbc, opengl, pch, pythia, pythia6, python, qt, qtgsi, reflex, shared, ssl, table, thread, winrtdebug, xml, xrootd. I did run into a few snags:

  • MakeRootDir.pl didn't find my /usr/X11R6/bin/lndir automatically (even though that was in my $PATH) so I had edit the script and do it manually.
  • Had to run MakeRootDir.pl twice to get root and rootdeb directory structures in place, editing the script in between.
  • CVS was a mess.  I had to drill down into each subdirectory that needed updating, and even then it puked out conflicts instead of patching the files, so I had to trash the originals first.  Also, I'm fairly sure that root5/qt/inc/TQtWidget.h should have been included in the v5-12-00f tag, since my first attempt at compiling failed without the HEAD version of that file.

 

Hacking the environment scripts

  • set rhflavor = "osx48_" in STAR_SYS to get the name I chose for $STAR_HOST_SYS
  • I installed Qt in $OPTSTAR, so group_env.csh fails to find it

Building STAR software

I'm working with a checked out copy of the STAR software and modifying codes when necessary if the fix is obvious.  So far I've got the following cons working: cons %QtRoot %StEventDisplayMaker %pams %St_dst_Maker %St_geom_Maker St_dst_Maker tries to subtract an int and a struct!  Pams is a crazy mess of VAX-style Fortran STRUCTURES, but we really need it in order to run starsim.  I haven't delved too deeply into the QtRoot-related stuff; I'm sure Valeri can help when the time comes.  Hopefully we can get these things fixed without too much delay.

Power PC notes

  • why does everything insist on linking with libshift?  It's not a part of CERNLIB 2005, so I used Hatcher's hack to get around it and stuck libshift.a in $OPTSTAR/lib
  • libnsl is not needed on OS X, so we don't link against it anymore
  • remove -dynamiclib and -single_module for executables
  • cfortran.h can't identify our Fortran compiler -- define it as f2c
  • asps/Simulation/starsim/deccc/fputools.c won't compile under power pc (contains assembly code!) -- skip it for now
  • g++ root4star brings out lots of linking issues; one killer seems to be that libpacklib from Fink is missing fzicv symbol.
    • one very hack solution:  install gfortran, use it to build CERNLIB with Hatcher script, replace libpacklib.a, copy libgcc.a and libgfortran.a from gcc 4.2.0 into $OPTSTAR/lib or other, then link against them explicitly
    • needed to -lstarsim to get gufile, srndmc symbols defined
  • <malloc.h> -- on Mac they decided to put this in /usr/include/malloc, so we add this to path in ConsDefs.pm
  • cons wanted to link starsim using gcc and statically include libstdc++; on Mac we'll let g++ do the work.  Also, -lstarsim seems to be included too early in the chain.  Need to talk to Jerome about proper way to fix this, but for now I can hack a fix.
  • PAMS -- ACK!

Problems requiring changes to codes:

  • struct mallinfo isn't available on OS X
    • for now we surround any mallinfo with #ifndef __APPLE__; Frank Laue says there may be a workaround
  • 'fabs' was not declared in this scope
    • add <cmath> in header
  • TCL.h from ROOT conflicts with system tcl.h because of case-insensitive FS
    • TCL.h renamed to TCernLib.h in newer ROOT versions (ROOT bug 19313)
    • copied TCL.h to TCernLib.h myself and added #ifdef __APPLE__ #include "TCernLib.h"
    • this problem will go away when we patch/upgrade ROOT
  • passing U_Int to StMatrix::inverse() when it wants a size_t
    • changed input to size_t (only affected StFtpcTrackingParams)
  • abs(float) is not legal
    • change to fabs(float) and #include <cmath>

Intel notes

Basic problem here is (im)maturity of gfortran.  Current Fink unstable version 4.2.0-20060617 still does not include some instrinsic symbols (lshift, lstat) that we expect to be there.  Newer versions do have these symbols, and as soon as Fink updates I'll give it another go.  I may try installing gcc 4.3 from source in the meantime, but it's not a high priority.  Note that Intel machines should be able to run the Power PC build in translated mode with some hacking of the paths (force $STAR_HOST_SYS = osx48_power_macintosh_gcc401).

Xgrid

Summary of Apple's Xgrid cluster software and the steps we've taken to get it up and running at MIT.

http://deltag5.lns.mit.edu/xgrid/

Xgrid jobmanager status report

  • xgrid.pm can submit and cancel jobs successfully, haven't tested "poll" since the server is running WS-GRAM.
  • Xgrid SEG module monitors jobs successfully.  Current version of Xgrid logs directly to /var/log/system.log (only readable by admin group), so there's a permissions issue to resolve there.  My understanding is that the SEG module can run with elevated permissions if needed, but at the moment I'm using ACLs to explicitly allow user "globus" to read the system.log.  Unfortunately the ACLs get reset when the logs are rotated nightly.
  • CVS is up-to-date, but I can't promise that all of the Globus packaging stuff actually works.  I ended up installing both Perl module and the C library into my Globus installation by hand.
  • Current test environment uses SimpleCA, but I've applied for a server certificate at pki1.doegrids.org as part of the STAR VO.

Important Outstanding Issues

  • streaming stdout/stderr and stagingOut files is a little tricky.  Xgrid requires an explicit call to "xgrid -job results", otherwise it  just keeps all job info in the controller DB.  I haven't yet figured out where to inject this system call in the WS-GRAM job life cycle, so I'm asking for help on gram-dev@globus.org.
  • Need to decide how to do authentication.  Xgrid offers two options on the extreme ends of the spectrum.  On the one hand we can use a common password for all users, and on the other hand we can use K5 tickets.  Submitting a job using WS-GRAM involves a roundtrip user account -> container account -> user account via sudo, and I don't know how to forward a TGT for the user account through all of that.  I looked around and saw a "pkinit" effort that promised to do passwordless generation of TGTs from grid certs, but it doesn't seem like it's quite ready for primetime.

USP

This is a copy of the web page that contains a log of the Sao Paulo grid activities. For the full documentaion, please go to http://stars.if.usp.br:8080/~suaide/grid/

Installation

In order to be fully integrated to the STAR GRID you need to have the following items installed and running (the order I present the items are the same order I installed them in the cluster). There are other software to install before full integration but this is the actual status of the integration.

Installing the batch system (SGE)

We decided to install the SGE because it is the same system used in PDSF (so it is scheduler compatible) and it is free. The SGE web site is here. You can donwload the latest version from their website.

Instructions to install SGE

  1. Download from the SGE web site
  2. gunzip and untar the file
  3. cd to the directory
In the installation directory there are two pdf files.  The sge-install.pdf contains instruction on how to install the system. The sge-admin.pdf contains instructions how to maintain the system and create batch queues. Our procedure to install the system was:
  1. In the batch system server (in our case, STAR1)

    1. Create the SGE_ROOT directory. In our case, mkdir /home/sge-root. This directory HAS to be available in all the exec nodes
    2. copy the entire content of the installation directory to the SGE_ROOT directory
    3. add the lines bellow to your /etc/services file
      sge_execd        19001/udp
      sge_qmaster     19000/tcp
      sge_qmaster     19000/udp
      sge_execd        19001/tcp
    4. cd to the SGE_ROOT directory
    5. Type ./install_qmaster
    6. follow the instructions in the screen. In our case, the answers to the questions were:
      1. Do you want to install Grid Engine under an user id other than >root< (y/n) >> n
      2. $SGE_ROOT = /home/sge-root
      3. Enter cell name >> star
      4. Do you want to select another qmaster spool directory (y/n) [n] >> n
      5. verify and set the file permissions of your distribution (y/n) [y] >> y
      6. Are all hosts of your cluster in a single DNS domain (y/n) [y] >> y
      7. Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> classic
      8. You can change at any time the group id range in your cluster configuration. Please enter a range >> 20000-21000
      9. The pathname of the spool directory of the execution hosts. Default: [/home/sge-root/star/spool] >> [ENTER]
      10. Please enter an email address in the form >user@foo.com<. Default: [none] >> [PUT YOUR EMAIL]
      11. Do you want to change the configuration parameters (y/n) [n] >> n
      12. We can install the startup script that will start qmaster/scheduler at machine boot (y/n) [y] >> y
      13. Adding Grid Engine hosts. Do you want to use a file which contains the list of hosts (y/n) [n] >> n
      14. Host(s): star1 star2 star3 star4 ...... (ADD ALL HOSTS THAT WILL BE CONTROLED BY THE BATCH SYSTEM)
      15. Do you want to add your shadow host(s) now? (y/n) [y] >> n
      16. Scheduler Tuning. Default configuration is [1] >> 1
      17. Proceed with the default answers until the end of the script
    7. You have installed the master system. To make sure the system will start at boot time type
      ln -s /etc/init.d/sgemaster /etc/rc3.d/S95sgemaster
      ln -s /etc/init.d/sgemaster /etc/rc5.d/S95sgemaster
  2. Install the execution nodes (including the server, if it will be a exec node). This needs to be done in ALL exec nodes

    1. add the lines bellow to your /etc/services file
      sge_execd        19001/udp
      sge_qmaster     19000/tcp
      sge_qmaster     19000/udp
      sge_execd        19001/tcp
    2. cd to your SGE_ROOT directory
    3. type ./install_execd
      1. Answer the question about the SGE_ROOT directory location
      2. Please enter cell name which you used for the qmaster. >> star
      3. Do you want to configure a local spool directory for this host (y/n) [n] >> n
      4. We can install the startup script that will start execd at machine boot (y/n) [y] >> y
      5. Do you want to add a default queue instance for this host (y/n) [y] >> n (WE WILL CREATE A QUEUE LATER)
      6. follow the default instructions until the end
    4. You have now installed the master system. To start the system at boot time. type
      ln -s /etc/init.d/sgeexecd /etc/rc3.d/S96sgeexecd
      ln -s /etc/init.d/sgeexecd /etc/rc5.d/S96sgeexecd
  3. Install a default queue to your batch system

    1. type qmon
      It opens a GUI window where you can configure all the batch system.
    2. Click in the buttom QUEUE CONTROL
    3. It opens another screen with the queues you have in your system
    4. Click on ADD
    5. Fill the instructions. See the file sge-admin.pdf for instructions. It is very simple.

Installing GANGLIA

Aditional information from STAR web site

You can download the ganglia packages from their web site. You need to install the following packages:
  • gmond - the monitoring system. Should be installed in ALL machines in the cluster
  • gmetad - the gathering information system. Should be installed in the machine that will collect the data (in our case, STAR1)
  • the web front end. This is nice to have but not essential. It creates a web page, like this one, with all the information in your cluster. You should have a web server running in the collector machine (STAR1) for this to work
  • rrdtool - this is a package that creates the plots in the web page. Necessary only if you have the web frontend.
To install Ganglia, proceed with the following
  1. In each machine in the cluster

    1. Install the gmond package (change the name to match the version you are installing)
      rpm -ivh ganglia-gmond-3.0.1-1.i386.rpm
    2. edit the /etc/gmond.conf file. The only change I made in this file was
      cluster {
        name = "STAR"
      }
    3. Type
      ln -s /etc/init.d/gmond /etc/rc5.d/S97gmond
      ln -s /etc/init.d/gmond /etc/rc3.d/S97gmond
      /etc/init.d/gmond stop
      /etc/init.d/gmond start
  2. In the collector machine (STAR1)

    1. Install the gmetad, web and rrdtool packages (change the name to match the version you are installing)
      rpm -ivh ganglia-gmetad-3.0.1-1.i386.rpm
      rpm -ivh ganglia-web-3.0.1-1.noarch.rpm
      rpm -ivh rrdtool-1.0.28-1.i386.rpm
    2. edit the /etc/gmetad.conf file. The only change I made in this file was
      data_source "STAR" 10 star1:8649 star2:8649 star3:8649 star4:8649 star5:8649
    3. Type
      ln -s /etc/init.d/gmetad /etc/rc5.d/S98gmetad
      ln -s /etc/init.d/gmetad /etc/rc3.d/S98gmetad
      /etc/init.d/gmetad stop
      /etc/init.d/gmetad start

MonaLISA

Aditional information from STAR web site

To install Monalisa in your system you need to download the files from their web site. After you gunzip and untar the file you need to perform the following steps:
  1. Create a monalisa user in your master computer and its home directory
  2. cd to the monalisa installation dir
  3. type ./install.sh
  4. Answer the following questions:
    1. Please specify an account for the MonALISA service [monalisa]: [ENTER]
    1. Where do you want MonaLisa installed ? [/home/monalisa/MonaLisa] : [ENTER]
    2. Path to the java home []: [enter the path name for your java distribution]
    3. Please specify the farm name [star1]: [star]
    4. Answer the next questions as you wish
  5. Make sure that Monalisa will run after reboot by typing:
    ln -s /etc/init.d/MLD /etc/rc5.d/S80MLD
    ln -s /etc/init.d/MLD /etc/rc3.d/S80MLD
  6. You need to edit the following files in the directory /home/monalisa/MonaLisa/Services
    1. ml.properties
      MonaLisa.ContactName=your name
      MonaLisa.ContactEmail=xxx@yyyy.yyy
      MonaLisa.LAT=-23.25
      MonaLisa.LONG=-47.19
      lia.Monitor.group=OSG, star (Note that we are being part of both OSG and STAR groups)
      lia.Monitor.useIPaddress=xxx.xxx.xxx.xxx (your IP)
      lia.Monitor.MIN_BIND_PORT=9000
      lia.Monitor.MAX_BIND_PORT=9010
  7. Need to tell MonaLisa that I am using SGE as a batch system. For this, edit the Service/CMD/site_env file and add
    SGE_LOCATION=/home/sge-root
    export SGE_LOCATION
    SGE_ROOT=/home/sge-root
    export SGE_ROOT
It is important to make sure these ports are not blocked by your firewal, in case your system is behind one.

To start the MonaLisa service just type
/etc/init.d/MLD start

Requesting a GRID certificate

By the way, you will have to request (for Grid usage) a user certificate. For instructions, click on the link http://www.star.bnl.gov/STAR/comp/Grid/Infrastructure/#CERT

A grid installation will require a "host" certificate. Jerome told me he never asked for one really ...
The certificate arrived three days after I requested it (with some help from Jerome). I them followed
the instructions that came with the email to validade and export the certificate.

Installing OSG

I think this is the last step to be fully GRID integrated. I have not used the certificate I got up to now. Lets see. To install the OSG package I followed the instructions in the following web page

http://osg.ivdgl.org/twiki/bin/view/Documentation/OsgCEInstallGuide


The basic steps were
  1. Make sure pacman is installed. For this I had to update python to a version above 2.3. Pacman is a package management system. It can be downloaded from here
  2. create a directory at /home/grid. This is where I installed the grid stuff. Thid directory needs to be visible in all the cluster machines
  3. I typed
    export VDT_LOCATION=/home/grid
    cd $VDT_LOCATION
    pacman -get OSG:ce
    I  just followed the log and answered the questions.
The entire installation process took about 20 minutes or so but I imagine it depends on the network connection speed.

After this installation was done I typed source setup.sh to complete the installation. No messages in the screen...

Because our batch system is SGE, we need to install extra packages, as stated in the OSG documentation page. I typed:
pacman -get http://www.cs.wisc.edu/vdt/vdt_136_cache:Globus-SGE-Setup
and these extra packages were installed in a few seconds.

I just followed the instructions in the OSG installation guide and everything went fine. One important thing is related to firewall setup. If you have a firewall running with MASQUERADE, in which your private network is not accessible from the outside world, and your gatekeeper is not the firewall machine, remember to open the necessary ports (above 1024) and redirect the ports number 2119, 2811 and 2812 to your gatekeeper machine. The command depends on your firewall program. If using iptables, just add the following rule to your filter tables:
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2119 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2119 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2135 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2135 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2136 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2136 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2811 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2811 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2812 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2812 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2912 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2912 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 7512 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 7512 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 8443 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 8443 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 19000 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 19000 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 19001 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 19001 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 20000:65000 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 20000:65000 -j DNAT --to $STAR1

where $GLOBALIP is the external IP of your firewall and $STAR1 is the IP of the machine running the GRID stuff.

I also had to modify the files /home/grid/setup.csh and setup.sh to fix the HOSTNAME and port range. I added, in each file:
setup.csh
setenv GLOBUS_TCP_PORT_RANGE "60000 65000"
setenv GLOBUS_HOSTNAME="stars.if.usp.br"
setup.sh
export GLOBUS_TCP_PORT_RANGE="60000 65000"
export GLOBUS_HOSTNAME="stars.if.usp.br"
This assures that the port range opened in the firewall will correspond to those used in the GRID environment. Also, because I run the firewall in masquerade mode, I had to set the proper hostname, otherwise it will pick the machine name, and I do not want that to happen.

GridCat and making things to work...

It is very interesting to add your grid node to GridCat. It is a map, just like MonaLisa but it performs periodical tests to your gatekeeper, making it easier to find out problems (and, if you got to this point, there should be a few of them)

To add your gatekeeper to GridCat,  go to http://osg.ivdgl.org/twiki/bin/view/Integration/GridCat

You will have to fill a form, following the instructions in  the following link:

http://osg.ivdgl.org/twiki/bin/view/Documentation/OsgCEInstallGuide#OSG_Registration

If everything goes right, when your application is aproved you will show up in the GridGat Map, located  http://osg-cat.grid.iu.edu:8080

well, this is were debuggins starts. Every 2-3 hours the GridCat tests their gatekeepers and assign a status light for each one, based on tests results. The tests are basically:
  • Authentication test
  • Hello world test
  • Batch submition (depends on your batch system)
    • submit a job
    • query the status of the job
    • canceling the job
  • file transfer (gridFtp)
This is were I spent my last few days trying to resolve the issues. Thanks a lot for all the people in the STAR-GRID list that helped me a lot with suggestions. But I had to find out a lot of stuff... This is what google is made for.... The main issue is the fact that our cluster is behind a firewall configures qith masquerading.... It means that the internal IP's of the machines (including the gatekeeper) are not visible. All the machines have the same IP (the gateway IP) for the outside world.... I think I am the only one in the GRID with this kind of setup :)

How to turn authentication and hello world to green?

This is the easiest... Need to map the following certificates to your grid map (/etc/grid-security/grid-mapfile)
"/DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) 693100" XXXX
"/DC=org/DC=doegrids/OU=People/CN=Bockjoo Kim 740786" XXXX
The username 'XXXX' is the local username in your cluster... After this certificates were added to my mapfile the first two tests turned green

How to turn the batch system test green

It seems that SGE is not the preferable batch system in the GRID... Too bad because it is really nice and SIMPLE. Because of this the OSG interface to OSG does not work right.... I hope the bugs are fixed in the next release but just to keep log of what I did (with a lot of hel) in case they forget to fix it :)
  • mis-ci-functions
    • This file, located at $VDT_LOCATION/MIS-CI/etc/misci/ is responsible for checking your system basically every 10 minutes and extract information about your cluster. It uses the batch system to grab information. Of course, it does not work with SGE. Replace the file with the version 0.2.7, located here. Please check if your version is newer than this one before replacing...
  • sge.pm
    • This file is located at $VDT_LOCATION/globus/lib/perl/Globus/GRAM/JobManager/
    • Please check the following
      • In the BEGIN section
        • if $SGE_ROOT, $SGE_CELL and the commands ($qsub, $qstat, etc) are defined properly
      • In the submit section
        • Locate the line
          • $ENV{"SGE_ROOT"} = $SGE_ROOT;
        • add the line
          • $ENV{"SGE_CELL"} = $SGE_CELL;
      • The same in the pool section
      • In the clear section
        • locate the line  system("$qdel $job_id >/dev/null 2>/dev/null");
        • replace for the following
          •     $ENV{"SGE_ROOT"} = $SGE_ROOT;
                $ENV{"SGE_CELL"} = $SGE_CELL;
                $job_id =~ /(.*)\|(.*)\|(.*)/;
                $job_id = $1;
                system("$qdel $job_id");
This will make your batch tests turn green. It means people can submit jobs, query, cancel, etc. I hope I did not miss anything in here...

Making the gridFTP to work

This was the most difficult part because of my firewall configuration and thanks google for making reseach in the web easier...

Before, please check if the services are listed in your /etc/services file
  globus-gatekeeper       2119/tcp        # Added by the VDT
gsiftp 2811/tcp # Added by the VDT
gsiftp2 2812/tcp # Added by the VDT
gsiftp 2811/udp # Added by the VDT
gsiftp2 2812/udp # Added by the VDT
If not, add them...

I started testing file transfer between gatekeepers by logging into another gatekeeper, getting my proxy (grid-proxy-init) and do a file transfer with the command:
globus-url-copy -dbg file:///star/u/suaide/gram_job_mgr_13594.log gsiftp://stars.if.usp.br/home/star/c
The -dbg mean debug is turned on... Everything goes fine until it starts transfering the data (STOR /home/star/c). It hangs and times out. Researching on the web, I found a bug report at

http://bugzilla.globus.org/globus/show_bug.cgi?id=1127

And a quote in the bottom of the page:

" ... The wuftp based gridftp server is not supported behind a firewall. The problem is in reporting the external IP address in the PASV response. You can see this by using the -dbg flag to globus-url-copy. You will see the the PASV response specifies your internal IP address.

The server should, however, work for clients using PORT. ..."

which means I am doommed... Researching more the web I found some solutions and what I did was:
  • replace file /etc/xinetd.d/gsiftp for this one
    service gsiftp
    {
         socket_type = stream
         protocol = tcp
         wait = no
         user = root
         instances = UNLIMITED
         cps = 400 10
         server = /auto/home/grid/vdt/sbin/vdt-run-gsiftp2.sh
         disable = no
    }
  •  restarted xinetd
  • modified the file /hom/grid/globus/etc/gridftp.conf to
    # Configuration for file the new (3.9.5) GridFTP Server
    inetd 1
    log_level ERROR,WARN,INFO,ALL
    log_single /auto/home/grid/globus/var/log/gridftp.log
    hostname "XXX.XXX.XXX.XXX"
  •  XXX.XXX.XXX.XXX is the IP of the gateway for the outside world
And this worked!!!!

Now all tests are geen and I am happy and tired!!! There are still a few issues left, basically in the cluster information query (number of CPU's, batch queues, etc) that are related to mis-ci-functions (I think) and I will have a look latter.

Another important thing, if you plan to have a cluster running jobs from outside and making file transfers with gsiftp it is necessary that the directory /etc/grid-security is available in all machines in the cluster, even if they are not gatekeepers. Also, the grid setup should be executed in all the nodes (/home/grid/setup.csh). If not, when a job start running in one of the nodes and it attempts to transfer the file with globus-url-copy it will fail. The solution I used was to have the directory grid-security in the /home/grid and make symbolic links in all the nodes.

WSU

Grid Production

Summary of Reconstruction Production on GRID
 



Dataset name

Description

Events Submit Date

Finish Date

Number of jobs submitted

Output size

Efficiency

Cluster or Site

CPU in hours

rcf1304

pp200/pythia6_410/55_65gev/cdf_a/y2006c/gheisha_on

118K 2007-06-11

2007-06-12

60

35GB

98%

pdsf.nersc.gov

14hours

rcf1302

pp200/pythia6_410/45_55gev/cdf_a/y2006c/gheisha_on

118K 2007-06-01

2007-06-02

60

29.4GB

100%

pdsf.nersc.gov

14hours

rcf1303

pp200/pythia6_410/35_45gev/cdf_a/y2006c/gheisha_on

119K 2007-06-02

2007-06-02

120

36.2GB

97%

pdsf.nersc.gov

11hours

rcf1306

pp200/pythia6_410/25_35gev/cdf_a/y2006c/gheisha_on

393K 2007-06-04

2007-06-06

200

119GB

98%

pdsf.nersc.gov

41hours

rcf1307

pp200/pythia6_410/15_25gev/cdf_a/y2006c/gheisha_on

391K 2007-06-06

2007-06-07

200

114GB

98%

pdsf.nersc.gov

34hours

rcf1308

pp200/pythia6_410/11_15gev/cdf_a/y2006c/gheisha_on

416K 2007-06-08

2007-06-10

210

115GB

98%

pdsf.nersc.gov

39hours

rcf1309

pp200/pythia6_410/9_11gev/cdf_a/y2006c/gheisha_on

409K 2007-06-10

2007-06-12

210

109GB

98%

pdsf.nersc.gov

47hours

rcf1310

pp200/pythia6_410/7_9gev/cdf_a/y2006c/gheisha_on

420K 2007-06-13

2007-06-14

210

107GB

100%

pdsf.nersc.gov

31hours

rcf1311

pp200/pythia6_410/5_7gev/cdf_a/y2006c/gheisha_on

394K 2007-06-14

2007-06-16

199

96GB

98%

pdsf.nersc.gov

48hours

rcf1317

pp200/pythia6_410/4_5gev/cdf_a/y2006c/gheisha_on

683K 2007-06-16

2007-06-19

343

158GB

99%

pdsf.nersc.gov

69hours

rcf1318

pp200/pythia6_410/3_4gev/cdf_a/y2006c/gheisha_on

688K 2007-06-19

2007-06-22

345

152GB

100%

pdsf.nersc.gov

78hours

rcf1319

pp200/pythia6_410/minbias/cdf_a/y2006c/gheisha_on

201K 2007-06-22

2007-06-23

120

21GB

99%

pdsf.nersc.gov

13hours

rcf1321

pp62/pythia6_410/3_4gev/cdf_a/y2006c/gheisha_on

250K 2007-06-25

2007-06-26

125

41GB

100%

pdsf.nersc.gov

20hours

rcf1320

pp62/pythia6_410/4_5gev/cdf_a/y2006c/gheisha_on

400K 2007-06-26

2007-06-27

200

67GB

100%

pdsf.nersc.gov

28hours

rcf1322

pp62/pythia6_410/5_7gev/cdf_a/y2006c/gheisha_on

218K 2007-06-24

2007-06-25

110

38GB

100%

pdsf.nersc.gov

17hours

rcf1323

pp62/pythia6_410/7_9gev/cdf_a/y2006c/gheisha_on

220K 2007-06-29

2007-06-30

110

39GB

100%

pdsf.nersc.gov

18hours

rcf1324

pp62/pythia6_410/9_11gev/cdf_a/y2006c/gheisha_on

220K 2007-06-30

2007-06-30

110

41GB

100%

pdsf.nersc.gov

14hours

rcf1325

pp62/pythia6_410/11_15gev/cdf_a/y2006c/gheisha_on

220K 2007-07-01

2007-07-02

110

41GB

100%

pdsf.nersc.gov

19hours

rcf1326

pp62/pythia6_410/15_25gev/cdf_a/y2006c/gheisha_on

220K 2007-07-03

2007-07-04

110

40GB

100%

pdsf.nersc.gov

21hours

rcf1327

pp62/pythia6_410/25_35gev/cdf_a/y2006c/gheisha_on

220K 2007-07-04

2007-07-05

110

38GB

100%

pdsf.nersc.gov

18hours

rcf1312

pp200/pythia6_410/7_9gev/bin1/y2004y/gheisha_on

539K 2007-07-13

2007-07-18

272

143GB

99.6%

pdsf.nersc.gov

53hours

rcf1313

pp200/pythia6_410/9_11gev/bin2/y2004y/gheisha_on

758K 2007-07-19

2007-07-22

380

203GB

100%

pdsf.nersc.gov

72hours

rcf1314

pp200/pythia6_410/11_15gev/bin3/y2004y/gheisha_on

116K 2007-07-31

2007-08-01

58

32GB

100%

pdsf.nersc.gov

182hours

rcf1315

pp200/pythia6_410/11_15gev/bin4/y2004y/gheisha_on

420K 2007-08-04

2007-08-05

210

119GB

100%

pdsf.nersc.gov

527hours

rcf1316

pp200/pythia6_410/11_15gev/bin5/y2004y/gheisha_on

158K 2007-08-08

2007-08-09

79

45GB

100%

pdsf.nersc.gov

183hours

rcf1317 pp200/pythia6_410/4_5gev/cdf_a/y2006c/gheisha_on 683K              
rcf1318 pp200/pythia6_410/3_4gev/cdf_a/y2006c/gheisha_on 688K 2007-06-04 2007-06-04 360 83.4GB 95.8% fnal.gov 619hours
rcf1319 pp200/pythia6_410/minbias/cdf_a/y2006c/gheisha_on 201K 2007-06-04 2007-06-04 120 11.7GB 100.0% fnal.gov 105hours
rcf1320 pp62/pythia6_410/4_5gev/cdf_a/y2006c/gheisha_on 400K 2007-06-06 2007-06-06 200 35.7GB 100.0% fnal.gov 241hours
rcf1321 pp62/pythia6_410/3_4gev/cdf_a/y2006c/gheisha_on 250K 2007-06-06 2007-06-06 125 21.6GB 100.0% fnal.gov 139hours
rcf1322 pp62/pythia6_410/5_7gev/cdf_a/y2006c/gheisha_on 218K 2007-06-07 2007-06-07 110 20.1GB 100.0% fnal.gov 114hours
rcf1323 pp62/pythia6_410/7_9gev/cdf_a/y2006c/gheisha_on 220K 2007-06-07 2007-06-07 110 20.6GB 100.0% fnal.gov 112hours
rcf1324 pp62/pythia6_410/9_11gev/cdf_a/y2006c/gheisha_on 220K 2007-06-07 2007-06-07 110 20.6GB 99.0% fnal.gov 124hours
rcf1325 pp62/pythia6_410/11_15gev/cdf_a/y2006c/gheisha_on 220K 2007-06-07 2007-06-07 110 20.7GB 100.0% fnal.gov 91hours
rcf1326 pp62/pythia6_410/15_25gev/cdf_a/y2006c/gheisha_on 220K 2007-06-08 2007-06-08 110 20.2GB 100.0% fnal.gov 132hours
rcf1327 pp62/pythia6_410/25_35gev/cdf_a/y2006c/gheisha_on 220K 2007-06-08 2007-06-09 110 18.3GB 100.0% fnal.gov 133hours
rcf1501 pp200/pythia6_410/minbias/cdf_a/y2006g/gheisha_on 1.99M 2008-09-30 2008-11-12 1991 412GB 99.6% nersc.gov 1,026hours
rcf1504 1DplusOnly/gkine/pt10/eta1_5/y2005g/gheisha_on 1.097M 2009-01-16 2009-02-10 1102 80.7GB 99.9% nersc.gov 1,216hours
rcf9003 pp200/pythia6_410/5_7gev/cdf_a/y2007g/gheisha_on 389K part grid and part local, because of urgency (high priority) ec2.internal  
rcf9004 pp200/pythia6_410/7_9gev/cdf_a/y2007g/gheisha_on 408K part grid and part local, because of urgency (high priority) ec2.internal  
rcf9005 pp200/pythia6_410/9_11gev/cdf_a/y2007g/gheisha_on 401K 200-03-07 2009-03-17 782 333.7GB 99.10% ec2.internal 13,022hours
rcf9010 pp200/pythia6_410/45_55gev/cdf_a/y2007g/gheisha_on 118K part grid and part local, because of urgency (high priority) ec2.internal  
rcf9011 pp200/pythia6_410/55_65gev/cdf_a/y2007g/gheisha_on 119K 2009-03-07 2009-03-11 295 108.4GB 100% ec2.internal 8,060hours
rcf10020 pp200/pythia6_422/2_3gev/tune100/y2005h/gheisha_on 115K 2010-4-07 2010-4-07 115 406.5GB 99.1% pdsf.nersc.gov 946hours
pdsf10021 pp200/pythia6_422/3_4gev/tune100/y2005h/gheisha_on 114K 2010-4-07 2010-4-09 115 438.4GB 99.1% pdsf.nersc.gov 1,728hours
pdsf10022 pp200/pythia6_422/4_5gev/tune100/y2005h/gheisha_on 114K 2010-4-08 2010-4-09 115 458.6GB 99.1% pdsf.nersc.gov 1,926hours
pdsf10023 pp200/pythia6_422/5_7gev/tune100/y2005h/gheisha_on 116K 2010-4-09 2010-4-12 115 983.1GB 96.6% pdsf.nersc.gov 1,293hours
pdsf10024 p200/pythia6_422/7_9gev/tune100/y2005h/gheisha_on 1.19M 2010-4-08 2010-4-17 1200 9615.4GB 95.8% pdsf.nersc.gov 18,261hours
pdsf10025 pp200/pythia6_422/9_11gev/tune100/y2005h/gheisha_on 115K 2010-4-10 2010-4-12 115 1018.4GB 98.2% pdsf.nersc.gov 951hours
pdsf10026 pp200/pythia6_422/11_15gev/tune100/y2005h/gheisha_on 115K 2010-4-12 2010-4-13 115 509.5GB 94.6% pdsf.nersc.gov 965hours
pdsf10027 pp200/pythia6_422/15_25gev/tune100/y2005h/gheisha_on 112K 2010-4-13 2010-4-14 115 466.0GB 83.5% pdsf.nersc.gov 822hours
pdsf10028 pp200/pythia6_422/25_35gev/tune100/y2005h/gheisha_on 114K 2010-4-13 2010-4-13 115 525.7GB 89.5% pdsf.nersc.gov 999hours
pdsf10029 pp200/pythia6_422/35_infgev/tune100/y2005h/gheisha_on 104K 2010-4-13 2010-4-14 115 521.9GB 90.4% pdsf.nersc.gov 442hours
pdsf10030
AuAu7.7/hijing_382/B0_20/minbias/y2010a/gheisha_on
1.02M  part grid and part local, becuase of urgency (high priority) pdsf.nersc.gov  129,939hours
pdsf10031
AuAu11.5/hijing_382/B0_20/minbias/y2010a/gheisha_on
400K  2010-08-06 2010-08-14 2000 14,598GB 94.5% pdsf.nersc.gov 95,938hours
pdsf10033 AuAu7.7/hijing_382/B0_20/minbias/y2010a/gheisha_on  3.0M 2010-12-06 2011-01-30  15,300 10TB 89.4%  pdsf.nersc.gov 465,000hours
pdsf11010 pp200/pythia6_423/minbias/highptfilt/y2005i/tune_pro_pt0 3.4M 2011-02-14 2011-02-20 1,700 1.024TB 97.17% pdsf.nersc.gov 22,022hours
pdsf11000 pp200/pythia6_220/fmspi0filt/default/y2008e/gheisha_on 1.2M 2011-05-23 2011-06-01 600 403GB 20% pdsf.nersc.gov 9,800hours
pdsf11001 pp200/pythia6_220/minbias/default/y2008e/gheisha_on 300K 2011-05-21 2011-05-22 150 84GB 100% pdsf.nersc.gov 600hours
pdsf11002  dAu200/herwig_382/fmspi0filt/shadowing_on/y2008e/gheisha_on 200K 2011-06-02 2011-06-03 250 207GB 88% pdsf.nersc.gov 2500hours
pdsf11003 dAu200/herwig_382/fmspi0filt/shadowing_off/y2008e/gheisha_on 200K 2011-06-03 2011-06-04 250 233GB 100% pdsf.nersc.gov 2500hours
pdsf11011
pp200/pythia6_423/highptfilt/jp2filt/y2005i/tune_pro_pt0

45M (100k filtered)

2011-06-24 2011-07-14 4,500 653G 88.8% pdsf.nersc.gov 67,500hours
pdsf11010

pp200/pythia6_423/minbias/highptfilt/y2005i/tune_pro_pt0

(Expanding statistics for preexisting dataset pdsf11010)

30.3M

part grid (2,940jobs) and part local, becuase of urgency (high priority)

2011-08-05 2011-08-26 5,500 8.50T(inc. .FZD) ??%

pdsf.nersc.gov 82,500hours
pdsf11020 tracker review 2012 1K 2011-08-29 2011-09-05 10 484M 100% pdsf.nersc.gov 7.8hours
pdsf11021 tracker review 2012 10K 2011-08-29 2011-09-05 100 27G 100% pdsf.nersc.gov 600hours
pdsf11022 tracker review 2012 10K 2011-08-29 2011-09-06 250 102G 98.00% pdsf.nersc.gov 2,500hours
pdsf11023 tracker review 2012 10K 2011-08-29 2011-09-08 700 332G 98.14% pdsf.nersc.gov 3,500hours
pdsf11024 tracker review 2012 10K 2011-08-29 2011-09-05 100 28G 100% pdsf.nersc.gov 900hours
pdsf11025 tracker review 2012 10K 2011-08-29 2011-09-05 100 5.1G 100% pdsf.nersc.gov 200hours
pdsf11026 tracker review 2012 10K 2011-08-14 2011-08-16 200 94GB 100% pdsf.nersc.gov 2,000hours
pdsf11027 pending pending pending pending pending pending pending pending pending

Notes:


Notes for getting file size from catalog:

* This method is an approximation because the .fzd files are not cataloged, however there size is about the same as the geant.root files so an approximation is done as:

[rcas6010] ~/> get_file_list.pl -keys 'sum(size)' -cond 'path~pp200/pythia6_422/2_3gev/tune100/y2005h/gheisha_on,storage=HPSS'
27758788009
[rcas6010] ~/> get_file_list.pl -keys 'sum(size)' -cond 'path~pp200/pythia6_422/2_3gev/tune100/y2005h/gheisha_on,filetype=MC_reco_geant,storage=HPSS'
14106791434
[rcas6010] ~/> echo `echo "(27758788009+14106791434)/100000000" | bc`" GB"
418 GB

The true dataset value is 406.5GB so there is a +2.75% error.

 

The dataset description can be found at:

http://www.star.bnl.gov/public/comp/prod/MCProdList.html

 

#Example of getting the size
SELECT CONCAT(SUM(size_workerNode) / 1000000000 , 'GB')
FROM MasterIO f
WHERE f.`jobID_MD5` = 'C88562E422AF05783ACF43F6172DC95A';

#Example of fining Start Time
SELECT j.`jobID_MD5`, j.`submitTime`, f.`name_requester`
FROM MasterIO f, MasterJobEfficiency j
WHERE f.`jobID_MD5` = 'C88562E422AF05783ACF43F6172DC95A'
AND f.`jobID_MD5`= j.`jobID_MD5`
AND f.`name_requester` IS NOT NULL
ORDER BY `submitTime` ASC
LIMIT 3;


#Example of fining end time
SELECT j.`jobID_MD5`, j.`endTime`, f.`name_requester`
FROM MasterIO f, MasterJobEfficiency j
WHERE f.`jobID_MD5` = 'C88562E422AF05783ACF43F6172DC95A'
AND f.`jobID_MD5`= j.`jobID_MD5`
AND f.`name_requester` IS NOT NULL
ORDER BY `endTime` DESC
LIMIT 3;

 

Notes finding the number if events from catalog:

[rcas6010] ~/> get_file_list.pl -keys 'sum(events)' -cond 'path~pp200/pythia6_422/2_3gev/tune100/y2005h/gheisha_on,filetype=MC_reco_geant,storage=HPSS'
115000

*Note select only one type of file ( filetype=MC_reco_geant ), else you will be double counting.

#Example of getting the production Efficiency:


SELECT concat(
((SELECT count(*) AS jobsCount FROM MasterJobEfficiency j
WHERE submitAttempt = 1
AND overAllState = 'success'
AND j.`jobID_MD5` = 'C88562E422AF05783ACF43F6172DC95A'
) * 100 ) /
(SELECT count(*) AS jobsCount FROM MasterJobEfficiency j
WHERE submitAttempt = 1
AND j.`jobID_MD5` = 'C88562E422AF05783ACF43F6172DC95A'
),'%');

 

#Example of getting the run time. Note there is a filter for run-way jobs:

SELECT AVG((`endTime` - `startTime`) / 60 / 60) FROM MasterJobEfficiency f
WHERE endTime > 0
AND startTime > 0
AND ((`endTime` - `startTime`) / 60 / 60) < 200
AND f.`jobID_MD5` = 'C88562E422AF05783ACF43F6172DC95A';

 

Monitoring

Ganglia

This page has moved to Ganglia monitoring system.

 

Metrics and Accounting

Lets try to post the grid metrics and accounting info here.

SUMS and Grid snapshots

On this page, we will save snapshots of how SUMS "sees" the Grid utilization.
Average dispatch time,  all dispatchers, past 3 months (200607)
The jump of CondorRSLDispatcher is un-explained.

MySQL project activities

The STAR MySQL for GRID project is an effort to integrate MySQL database in the GRID infrastructure. This means providing tools to help management of networks of replicated database and providing GSI authentication for MySQL connection.

The current subprojects are:


Slides of the presentation given at LBL. They give a summary of the scope and the aims of this project.
Two version are posted: PowerPoint and HTML


Gabriele Carcassi - Richard Casella

GSI Enabled MySQL

Grid Security Infrastructure (GSI) is the mechanism used by the Globus Toolkit for enabling secure authentication and communication for a Grid over an open network. GSI provides a number of useful services for Grids, including mutual authentication and single sign-on. For detailed information regarding GSI you can read the GSI overview from Globus. Enabling MySQL to use GSI security and authentication will enable Grid users with grid proxy certificates to securely communicate with MySQL daemons on the grid without having to do further authentication. Processes that have been scheduled and initiated on the grid by an authenticated user will be able to communicate with MySQL daemons as well without further authentication.

GSI

GSI uses X.509 certificates and SSL providing:
  • secure communication
  • security across organizational boundaries
  • single sign-on for users of the Grid

MySQL

As of version 4.0.0, MySQL is both SSL and X.509 enabled.

By default, MySQL is not SSL enabled, since using encrypted connections to access the database would slow down transactions and MySQL is, by default, optimized fo speed. Read the MySQL documentation on Using Secure Connections for details on how to set up MySQL for SSL, including how to create and set up the user certificates and grant the proper privleges for a user to authenticate.

The current implementation requires that the Certificate Authority (CA) certificate which signs the user and server certificates be available for the SSL/X.509 configuration to work. This is fine for applications which do not work with GSI enabled applications. It does not, howerver fit with the GSI model for authentication. The CA only need sign user and service certificates to use GSI. An example of a successful implementation of GSI using SSL on legacy software is the GSI Enabled OpenSSH.


Testing


Presentations

  • PPDG Collaboration Meeting presentation, June 10, 2003 - HTML - PPT

Richard A. Casella

GSI Enabled MySQL - Testing

To Grid-enable MySQL is to allow client authentication using X509 certificates as used in the Globus Toolkit. Using the X509 certificates issued by the Globus Toolkit CA will alleviate the need for the client to authenticate separately after issuing the "grid-proxy-init" command. To do this in MySQL, one needs to connect over an SSL encrypted channel. This document will outline the steps needed to prepare MySQL for such connections, demonstrate a simple Perl DBI script which accomplishes the connection, and discuss future plans for testing and implementation.


Setup

  • MySQL
  • For a more in-depth explanation of the why's and how's, see the MySQL documentation. What is included here are exerpts and observations from that documentation.
    • Build MySQL with SSL enabled. The following conditions apply to MySQL 4.0.0 or greater. If you are running an older version, you should definitely check the documentation mentioned above.
      1. Install OpenSSL Library >= OpenSSL 0.9.6
      2. Configure and build with options --with-vio --with-openssl
      3. Check that your server supports OpenSSL by examining if SHOW VARIABLES LIKE 'have_openssl' returns YES
    • X509 Certificates
    • Check documentation for more detailed explanation of key creation.
      • Setup. First create a directory for the keys, copy and modify openssl.cnf
      • DIR=~/openssl
        PRIV=$DIR/private
        mkdir $DIR $PRIV $DIR/newcerts
        cp /usr/share/openssl.cnf $DIR/openssl.cnf
        replace .demoCA $DIR -- $DIR/openssl.cnf
      • Certificate Authority
      • openssl req -new -keyout cakey.pem -out $PRIV/cacert.pem -config $DIR/openssl.cnf
      • Server Request and Key
      • openssl req -new -keyout $DIR/server-key.pem -out $DIR/server-req.pem \
        -days 3600 -config $DIR/openssl.cnf
        openssl rsa -in $DIR/server-key.pem -out $DIR/server-key.pem
        openssl ca -policy policy_anything -out $DIR/server-cert.pem \
        -config $DIR/openssl.cnf -infiles $DIR/server-req.pem
      • Client Request and Key
      • openssl req -new -keyout $DIR/client-key.pem -out $DIR/client-req.pem \
        -days 3600 -config $DIR/openssl.cnf
        openssl rsa -in $DIR/client-key.pem -out $DIR/client-key.pem
        openssl ca -policy policy_anything -out $DIR/client-cert.pem \
        -config $DIR/openssl.cnf -infiles $DIR/client-req.pem
      • Init Files
      • MySQLd needs to be made aware of the certificates at start-up time. MySQLd reads /etc/my.cnf at start-up time. Add the following lines (be sure to replace $DIR with the actuaal location) to /etc/my.cnf
        [server]
        ssl-ca=$DIR/cacert.pem
        ssl-cert=$DIR/server-cert.pem
        ssl-key=$DIR/server-key.pem
        Add the following lines (be sure to replace $DIR with the actuaal location) to ~/.my.cnf
        [client]
        ssl-ca=$DIR/cacert.pem
        ssl-cert=$DIR/client-cert.pem
        ssl-key=$DIR/client-key.pem
    • Grant Options
    • Again, the MySQL documentation should be consulted, but basically, the following options are added to the grant options in the user table of the mysql database. Not all of the following options have been tested at this time, but they will be before all is said and done. These options are added as needed in the following manner...
      	  mysql> GRANT ALL PRIVILEGES ON test.* to username@localhost
      -> IDENTIFIED BY "secretpass" REQUIRE SSL;
      • REQUIRE SSL limits the server to allow only SSL connections
      • REQUIRE X509 &quot;issuer&quot; means that the client should have a valid certificate, but we do not care about the exact certificate, issuer or subject
      • REQUIRE ISSUER means the client must present a valid X509 certificate issued by issuer "issuer".
      • REQUIRE SUBJECT &quot;subject&quot; requires clients to have a valid X509 certificate with the subject "subject" on it.
      • REQUIRE CIPHER &quot;cipher&quot; is needed to ensure strong ciphers and keylengths will be used. (ie. REQUIRE CIPHER &quot;EDH-RSA-DES-CBC3-SHA&quot;)
  • Perl DBI/DBD
  • Perl DBI needs to connect over an SSL encrypted connection. SSL needs to be enabled. You must configure DBD::mysql with -ssl, then build and install it on the machine where you will be running your Perl code.

Testing


**Richard A. Casella -

Infrastructure

The pages in this tree relates to the Infrastructure sub-group of the S&C team.

The areas comprise: General infrastructure (software, web service, security,...), Online computing, operations and user support.

 

Online Computing

General

The online Web server front page is available here. This Drupal section will hold complementary informations.
A list of all operation manuals (beyond detector sub-systems) is available at You do not have access to view this node.
Please use it a startup page.

Detector sub-systems operation procedures - Updated 2008, requested confirmation for 2009

 

Online computing run preparation plans

This page will list by year action items, run plans and opened questions. It will server as a repository for documents serving as basis for drawing the requirements. To see documents in this tree, you must belong to the Software and Computing OG (the pages are not public).

Run 19

Feedback from software coordinators

Active feedback

Sub-system Coordinator Calibration POC Online monitoring POC
MTD Rongrong Ma - same - - same -
EMC

Raghav Kunnawalkam Elayavalli

Nick Lukow

- same -

Note: L2algo, bemc and  bsmdstatus

EPD Prashant Shanmuganathan N/A - same -
BTOF Frank Geurts - same - Frank Geurts
Zaochen Ye
ETOF Florian Seck - same - Florian Seck
Philipp Weidenkaff
HLT Hongwei Ke - same - - same -

Other software coordinators

sub-system Coordinator
iTPC (TPC?) Irakli Chakaberia
Trigger Akio Ogawa
DAQ Jeff Landgraf
...  

Run 20

Status of calibration timeline initialization

In RUN: EEMC, EMC, EPD, ETOF, GMT, TPC, MTD, TOF
Test: FST, FCS, STGC (no tables)
Desired init dates where announced to all software coordinators:

- Geometry tag has a timestamp of 20191120
- Simulation timeline [20191115,20191120[
- DB initialization for real data [20191125,...]

     Please initialize your table content appropriate yi.e.
sim flavor initial values are entered at 20191115 up to 20191119
(please exclude the edge),  ofl initial values at 20191125
(run starting on the 1st of December, even tomorrow's cosmic
and commissioning would pick the proper values).

 

 

Status - 2019/12/10

EMC  = ready
ETOF = ready - initialized at 2019-11-25, no sim (confirming)
TPC  = NOT ready [look at year 19 for comparison]
MTD  = ready
TOF  = Partially ready? INL correction, T0, TDC, status and alignement tables initialized
EPD  = gain initialized at 2019-12-15 (!?), status not initialized, no sim

EEMC = ready? (*last init at 2017-12-20)
GMT  = ready (*no db tables)



Status - 2019/12/09

EMC  = ready
ETOF = ready? initialized at 2019-11-25, no sim
TPC  = NOT ready
MTD  = ready
TOF  = NOT ready
EPD  = gain initialized at 2019-12-15 (!?), status not initialized, no sim

EEMC = ready? (*last init at 2017-12-20)
GMT  = ready (*no db tables)

 

 

Software coordinator feedback for Run 20 - Point of Contacts

Sub-system Coordinator Calibration POC Online monitoring POC
MTD Rongrong Ma - same - - same -
EMC
EEMC

Raghav Kunnawalkam Elayavalli

Nick Lukow

- same -

Note: L2algo, bemc and  bsmdstatus

EPD [ TBC] - same - - same -
BTOF Frank Geurts - same - Frank Geurts
Zaochen Ye
ETOF Florian Seck - same - Florian Seck
Philipp Weidenkaff
HLT Hongwei Ke - same - - same -
TPC Irakli Chakaberia - same -
Flemming Videbaek
Trigger detectors Akio Ogawa - same - - same -
DAQ Jeff Landgraf N/A  


---




Run 21

Status of calibration timeline initialization

- Geometry tag has a timestamp of 20201215
- Simulation timeline [20201210, 20201215]
- DB initialization for real data [20201220,...]

Status - 2020/12/10

 

Software coordinator feedback for Run 21 - Point of Contacts

Sub-system Coordinator Calibration POC Online monitoring POC
MTD Rongrong Ma - same - - same -
EMC
EEMC

Raghav Kunnawalkam Elayavalli

Nick Lukow

- same -

Note: L2algo, bemc and  bsmdstatus

EPD Prashanth Shanmuganathan (TBC) Skipper Kagamaster - same -
BTOF Zaochen - same - Frank Geurts
Zaochen Ye
ETOF Philipp Weidenkaff - same - Philipp Weidenkaff
HLT Hongwei Ke - same - - same -
TPC Yuri Fisyak - same - Flemming Videbaek
Trigger detectors Akio Ogawa - same - - same -
DAQ Jeff Landgraf N/A  
Forward Upgrade Daniel Brandenburg - same - FCS - Akio Ogawa
sTGC - Daniel Brandenburg
FST - Shenghui Zhang/Zhenyu Ye
       

---

Run 22

 

Status of calibration timeline initialization

- Geometry tag has a timestamp of 20211015
- Simulation timeline [20211015, 20211020[
- DB initialization for real data [20211025,...]

Status - 2021/10/13

 

Software coordinator feedback for Run 22 - Point of Contacts (TBC)

Sub-system Coordinator Calibration POC Online monitoring POC
MTD Rongrong Ma - same - - same -
EMC
EEMC

Raghav Kunnawalkam Elayavalli
Navagyan Ghimire

- same -

Note: L2algo, bemc and  bsmdstatus

EPD Prashanth Shanmuganathan (TBC) Skipper Kagamaster - same -
BTOF Zaochen - same - Frank Geurts
Zaochen Ye
ETOF Philipp Weidenkaff - same - Philipp Weidenkaff
HLT Hongwei Ke - same - - same -
TPC Yuri Fisyak - same - Flemming Videbaek
Trigger detectors Akio Ogawa - same - - same -
DAQ Jeff Landgraf N/A  
Forward Upgrade Daniel Brandenburg - same - FCS - Akio Ogawa
sTGC - Daniel Brandenburg
FST - Shenghui Zhang/Zhenyu Ye
       

---

Run XIII

Preparation meeting minutes

Database initialization check list

TPC Software  – Richard Witt          NO
GMT Software  – Richard Witt          NO
EMC2 Software - Alice Ohlson          Yes
FGT Software  - Anselm Vossen         Yes
FMS Software  - Thomas Burton         Yes
TOF Software  - Frank Geurts          Yes
Trigger Detectors  - Akio Ogawa       ??
HFT Software  - Spyridon Margetis     NO (no DB interface, hard-coded values in preview codes)

 

Calibration Point of Contacts per sub-system

If a name is missing, the POC role falls onto the coordinator.
                Coordinator           Possible POC
                ------------          ---------------
TPC Software  – Richard Witt          
GMT Software  – Richard Witt          
EMC2 Software - Alice Ohlson          Alice Ohlson  
FGT Software  - Anselm Vossen         
FMS Software  - Thomas Burton         Thomas Burton    
TOF Software  - Frank Geurts          
Trigger Detectors  - Akio Ogawa       
HFT Software  - Spyridon Margetis     Hao Qiu

Online Monitoring POC

The final list from the SPin PWGC can be found at You do not have access to view this node . The table below includes the Spin PWGC feedback and other feedbacks merged.

  Directories we inferred are being used (as reported in the RTS Hypernews)
  scaler Len Eun and Ernst Sichtermann (LBL) This directory usage was indirectly reported
  SlowControl James F Ross (Creighton)  
  HLT Qi-Ye Shou The 2012 directory had a recent timestamp but owned by mnaglis. Aihong Tang contacted 2013/02/12
Answer from  Qi-Ye Shou 2013/02/12 - will be POC.

  fmsStatus Yuxi Pan (UCLA) This was not requested but the 2011 directory is being overwritten by user=yuxip
FMS software coordinator contacted for confirmation 2013/02/12
Yuxi Pan confirmed 2013/02/13 as POC for this directory

     
Spin PWG monitoring related directories follows
  L0trg Pibero Djawotho (TAMU)  
  L2algo Maxence Vandenbroucke (Temple)  
  cdev Kevin Adkins (UKY)  
  zdc Len Eun and Ernst Sichtermann (LBL)  
  bsmdStatus Keith Landry (UCLA)  
  emcStatus Keith Landry (UCLA)  
  fgtStatus Xuan Li (Temple) This directory is also being written by user=akio causing protection access and possible clash problems.
POC contacted on 2013/02/08, both Akio and POC contacted again 2013/02/12 -> confirmed as OK.

  bbc Prashanth (KSU)  



Run XIV


Preparation meeting meetings, links

  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node
  • You do not have access to view this node

Notes

  • 2013/11/15
    • Info gathering begins (directories/areas and Point of Contacts)
      Status:
      2013/11/22, directory structure, 2 people provided feedback, Renee coordinated the rest
      2013/11/25, calibration POC, 3 coordinators provided feedback - Closed 2013/12/04
      2013/12/04, geometry for Run 14,
       
    • Basic check: CERT for online is old if coming from the Wireless
      Status: fixed at ITD level, 2013/11/18 - the reverse proxy did not have the proper CERT
  • 2013/1125

Database initialization check list

This actions suggested by this section has not started yet.

Sub-system Coordinator Check done
DAQ
Jeff Landgraf  
TPC Richard Witt  
GMT Richard Witt  
EMC2 Mike Skoby
Kevin Adkins
 
FMS Thomas Burton  
TOF Daniel Brandenburg  
MTD Rongrong Ma  
HFT Spiros Margetis (not known)
Trigger Akio Ogawa  
FGT Xuan Li  


Calibration Point of Contacts per sub-system

"-" indicates no feedback was provided. But if a name is missing, the POC role falls onto the coordinator.

Sub-system Coordinator Calibration POC
DAQ Jeff Landgraf -
TPC Richard Witt -
GMT Richard Witt -
EMC2 Mike Skoby
Kevn Adkins
-
FMS Thomas Burton -
TOF Daniel Brandenburg -
MTD Rongrong Ma Bingchu Huan
HFT Spiros Margetis Jonathan Bouchet
Trigger Akio Ogawa -
FGT Xuan Li N/A


Online Monitoring POC


scaler   Not needed 2013/11/25
SlowControl Chanaka DeSilva OKed on second Run preparation meeting
HLT Zhengquia Zhang  Learn incidently on 2014/01/28
HFT Shusu Shi Learn about it on 2014/02/26
fmsStatus   Not needed 2013/11/25
L0trg Zilong Chang
Mike Skoby
 
Informed 2013/11/10 and created 2013/11/15
L2algo  Nihar Sahoo Informed 2013/11/25
cdev   Not needed 2013/11/25
zdc   may not be used (TBC)
bsmdStatus  Janusz Oleniacz Info will be passed from Keith Landry 2014/01/20
Possible backup, Leszek Kosarzewski 2014/03/26
emcStatus  Janusz Oleniacz Info will be passed from Keith Landry 2014/01/20
Possible backup, Leszek Kosarzewski 2014/03/26
fgtStatus   Not needed 2013/11/25
bbc
 Akio Ogawa Informed 2013/11/15, created same day


Run XV

Run 15 was preapred essentiallydiscussing with indviduals and a comprehensive page not maintained.

Run XVI


This page will contain feedback related to the preparation of the online setup.

 

Notes



 

Online Monitoring POC

scaler    
SlowControl    
HLT Zhengqiao Feedback 2015/11/24
HFT Guannan Xie Spiros: Feedback 2015/11/24
fmsStatus   Akio: Possibly not needed (TBC). 2016/01/13 noted this was not used in Run 15 and wil probably never be used again.
fmsTrg   Confirmed neded 2016/01/13
fps   Akio: Not neded in Run 16? Perhaps later.
L0trg Zilong Chang Zilong: Feedback 2015/11/24
L2algo Kolja Kauder Kolja: will be POC - 2015/11/24
cdev Chanaka DeSilva  
zdc    
bsmdStatus Kolja Kauder Kolja: will be POC - 2015/11/24
bemcTrgDb Kolja Kauder Kolja: will be POC - 2015/11/24
emcStatus Kolja Kauder Kolja: will be POC - 2015/11/24
fgtStatus   Not needed since Run 14 ... May drop from the list
bbc
Akio Ogawa Feedback 2015/11/24, needed
rp    

 

Calibration Point of Contacts per sub-system

Sub-system Coordinator Calibration POC
DAQ Jeff Landgraf -
TPC Richard Witt
Yuri Fisyak
-
GMT Richard Witt -
EMC2 Kolja Kauder
Ting Lin
-
FMS Oleg Eysser -
TOF Daniel Brandenburg -
MTD Rongrong Ma (same confirmed 2015/11/24)
HFT Spiros Margetis Xin Dong
HLT Hongwei Ke (same confirmed 2015/11/24)
Trigger Akio Ogawa -
RP Kin Yip -

 

Database initialization check list



 

Shift Accounting

This page will now hold the shift accounting pages. They complement the Shift Sign-up process by documenting it.

Run 18 shift dues


Run 18 Shift Dues & Notes


Period coordinators

As usual, period coordinators are pre-assigned, as arranged by the Spokespersons.

Special arrangements and requests

  1. Under the family-related policy, the following 6 weeks of offline QA shifts were pre-assigned:
    MAR 27 Kevin Adkins (Kentucky)
    APR 03 Kevin Adkins
    APR 10 Sevil Salur (Rutgers)
    APR 17 Richard Witt (USNA/Yale)
    MAY 22 Juan Romero (UC Davis)
    JUN 12 Terry Tarnowsky (Michigan State)
     
  2. Lanny Ray (UT Austin), as QA coordinator, always is pre-assigned the first QA week.
     
  3. FIAS remains in “catch-up mode” and is taking extra shifts above their dues. Pre-assigned shifts can be requested in this scenario. FIAS has been pre-assigned 4 Detector Op shifts.
     
  4. Bob Tribble (TAMU) requests the evening Shift leader slot during Apr 10-17.

Run 19 special requests

The following pre-assigned slot requests were made.
    9 WEEKS PRE-ASSIGNED QA AS FOLLOWS
    ==================================
    Lanny Ray (UT Austin) QA Mar 5
    Richard Witt (USNA/Yale) QA Mar 19
    Sevil Salur (Rutgers) QA Apr 16
    Wei Li (Rice) QA Apr 23
    Kevin Adkins (Kentucky) QA May 14
    Juan Romero (UC Davis) QA May 21
    Jana Bielcikova (NPI, Czech Acad of Sci) QA May 28  
    Yanfang Liu (TAMU) QA June 25 
    Yanfang Liu (TAMU) QA July 02
    
    8 WEEKS PRE-ASSIGNED REGULAR SHIFTS AS FOLLOWS
    ==================================
    Bob Tribble (BNL) Feb 05 SL evening 
    Daniel Kincses (Eotvos) Mar 12  DO Trainee Day
    Daniel Kincses (Eotvos) Mar 19  DO Day
    Mate Csanad (Eotvos) Mar 12 SC Day
    Ronald Pinter (Eotvos) Mar 19 SC Day
    Carl Gagliardi (TAMU)  May 14  SL day
    Carl Gagliardi (TAMU)  May 21 SL day 
    Grazyna Odyniec (LBNL) July 02 SL evening
    
    

Shift Dues and Special Requests Run 20

For the calculation of shift dues, there are two considerations.
1) The length of time of the various shift configurations (2 person, 4 person no trainees, 4 person with trainees, plus period coordinators/QA shifts)
2) The percent occupancy of the training shifts

For many years, 2) has hovered about 45%, which is what we used to calculate the dues.  Since STAR gives credit for training shifts (as we should) this needs to be factored in or we would not have enough shifts.

The sum total of shifts needed are then divided by the total number of authors minus authors from Russian institutions who can not come to BNL.

date                  weeks           crew           training           PC           OFFLINE          
11/26-12/10    2                  2                      0                  0           0           
12/10-12/24    2                  4                      2                 1            0   
12/24-6/30      27                4                      2                 1            1   
7/02-7/16        2                  4                      0                 1            1   

Adding these together (3x a shift for crew, 3x45% for training, plus pc plus offline) gives a total of 522 shifts.
The total number of shifters is 303 - 30 Russian collaborators = 273 people
Giving a total due of 1.9 per author.

For a given institution, their load is calculated as # of authors - # of expert credits x due -> Set to an integer value as cutting collaborators into pieces is non-collegial behavior.

However, this year, this should have been:
date                  weeks           crew           training           PC           OFFLINE          
11/26-12/10    2                  2                      0                  0           0           
12/10-12/24    2                  4                      2                 1            0   
12/24-6/02      23                4                      2                 1            1   
6/02-6/16        2                  4                      0                 1            1   

Adding these together (3x a shift for crew, 3x45% for training, plus pc plus offline) gives a total of 456 shifts for a total due of 1.7 per author.

We allowed some people to pre-sign up, due to a couple different reasons.

Family reasons so offline QA:
James Kevin Adkins
Jana Bielčíková
Sevil Selur
Md. Nasim
Yanfang Liu

Additionally, Lanny Ray is given the first QA shift of the year as our experience QA shifter.

This year, to add an incentive to train for shift leader, we allowed people who were doing shift leader training to sign up for both their training shift and their "real" shift early:
Justin Ewigleben
Hanna Zbroszczyk
Jan Vanek
Maria Zurek
Mathew Kelsey
Kun Jiang
Yue-Hang Leung

Both Bob Tribble and Grazyna Odyniec sign up early for a shift leader position in recognition of their schedules and contributions

This year because of the date of Quark Matter and the STAR pre-QM meeting, several people were traveling on Tuesday during the sign up.  These people I signed up early as I did not want to punish some of our most active colleagues for the QM timing:
James Daniel  Brandenburg
Sooraj Radhakrishnan

3 other cases that were allowed to pre-sign up:
Panjab University had a single person who had the visa to enter the US, and had to take all of their shifts prior to the end of their contract in March.  So that the shifter could have some spaces in his shifts for sanity, I signed up:
Jagbir Singh
Eotvos Lorand University stated that travel is complicated for their group, and so it would be good if they could insure that they were all on shift at the same time.  Given that they are coming from Europe I signed up:
Mate Csanad
Daniel Kincses
Roland Pinter
Srikanta Tripathy
Frankfurt Institute for Advanced Studies (FIAS) wanted to be able to bring Masters students to do shift, but given the training requirements and timing with school and travel for Europe, this leaves little availability for shift.  So I signed up:
Iouri Vassiliev
Artemiy Belousov
Grigory Kozlov

Tools

This is to serve as a repository of information about various STAR tools used in experimental operations.

Implementing SSL (https) in Tomcat using CA generated certificates

The reason for using a certificate from a CA as opposed to a self-signed  certificate is that the browser gives a warning screen and asks you to except the certificate in the case of a self-signed  certificate. As there already exists a given list of trusted CAs in the browser this step is not needed.
 
The following list of certificates and a key are needed:

/etc/pki/tls/certs/wildcard.star.bnl.gov.Nov.2012.cert – host cert.
/etc/pki/tls/private/wildcard.star.bnl.gov.Nov.2012.key – host key (don’t give this one out)
/etc/pki/tls/certs/GlobalSignIntermediate.crt – intermediate cert.
/etc/pki/tls/certs/GlobalSignRootCA_ExtendedSSL.crt –root cert.
/etc/pki/tls/certs/ca-bundle.crt – a big list of many cert.

Concatenate the following certs into one file in this example I call it: Global_plus_Intermediate.crt
cat /etc/pki/tls/certs/GlobalSignIntermediate.crt > Global_plus_Intermediate.crt
cat /etc/pki/tls/certs/GlobalSignRootCA_ExtendedSSL.crt >> Global_plus_Intermediate.crt
cat /etc/pki/tls/certs/ca-bundle.crt >> Global_plus_Intermediate.crt

Run this command. Note that -name tomcat” and -caname root should not be changed to any other value. The command will still work but will fail under tomcat. If it works you will be asked for a password, that password should be set to "changeit".

 openssl pkcs12 -export -in wildcard.star.bnl.gov.Nov.2012.cert -inkey wildcard.star.bnl.gov.Nov.2012.key -out mycert.p12 -name tomcat -CAfile Global_plus_Intermediate.crt -caname root -chain

Test the new p12 output file with this command:

keytool -list -v -storetype pkcs12 -keystore mycert.p12

Note it should say: "Certificate chain length: 3"


In tomcat’s the server.xml file add a connector that looks like this:
 

<Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true"
           maxThreads="150" scheme="https" secure="true"
           keystoreFile="/home/lbhajdu/certs/mycert.p12" keystorePass="changeit"
           keystoreType="PKCS12" clientAuth="false" sslProtocol="TLS"/>


Note the path should be set to the correct path of the certificate.  And the p12 file should only be readable by the Tomcat account because it holds the host key. 

Online Linux pool

March 15, 2012:

THIS PAGE IS OBSOLETE!  It was written as a guide in 2008 for documenting improvements in the online Linux pool, but has not been updated to reflect additional changes to the state of the pool, so not all details are up to date. 

One particular detail to be aware of:  the name of the pool nodes is now onlNN.starp.bnl.gov, where 01<=NN<=14.  The "onllinuxN" names were retired several years ago.

 

Historical page (circa 2008/9):

Online Linux pool for general experiment support needs

 

GOAL: 

Provide a Linux environment for general computing needs in support of the experiemental operations.

HISTORY (as of approximately June 2008):

A pool of 14 nodes, consisting of four different hardware classes (all circa 2001) has been in existence for several years.  For the last three (or more?) years, they have had Scientific Linux 3.x with support for the STAR software environment, along with access to various DAQ and Trigger data sources.  The number of significant users has probably been less than 20, with the heaviest usage related to L2.  User authentication was originally based on an antique NIS server, to which we had imported the RCF accounts and passwords.  Though still alive, we have not kept this NIS information maintained over time.  Over time, local accounts on each node became the norm, though of course this is rather tedious.  Home directories come in three categories:  AFS, NFS on onllinux5, and local home directories on individual nodes.  Again, this gets rather tedious to maintain over time.

There are several "special" nodes to be aware of:

  1. Three of the nodes (onllinux1, 2 and 3) are in the Control Room for direct console login as needed.  (The rest are in the DAQ room.)
  2. onllinux5 has the NFS shared home directories (in /online/users).  (NB.  /online/users is being backed up by the ITD Networker backup system.)
  3. onllinux6 is (was?) used for many online database maintenance scripts (check with Mike DePhillps about this -- we had planned to move these scripts to onldb).
  4. onllinux1 was configured as an NIS slave server, in case the NIS master (starnis01) fails.

 

PLAN:

For the run starting in 2008 (2009?), we are replacing all of these nodes with newer hardware.

The basic hardware specs for the replacement nodes are:

Dual 2.4 GHZ Intel Xeon processors

1GB RAM

2 x 120 GB IDE disks

 

These nodes should be configured with Scientific Linux 4.5 (or 4.6 if we can ensure compatibility with STAR software) and support the STAR software environment.

They should have access to various DAQ and Trigger NFS shares.  Here is a starter list of mounts:

 

Shared DAQ and Trigger resources

SERVER DIRECTORY on SERVER LOCAL MOUNT PONT MOUNT OPTIONS
 evp.starp  /a  /evp/a  ro
 evb01.starp  /a  /evb01/a  ro
 evb01  /b  /evb01/b  ro
 evb01  /c  /evb01/c  ro
 evb01  /d  /evb01/d  ro
 evb02.starp  /a  /evb02/a  ro
 evb02  /b  /evb02/b  ro
 evb02  /c  /evb02/c  ro
 evb02  /d  /evb02/d  ro
 daqman.starp  /RTS  /daq/RTS  ro
 daqman  /data  /daq/data  rw
 daqman  /log  /daq/log  ro
 trgscratch.starp  /data/trgdata  /trg/trgdata  ro
 trgscratch.starp  /data/scalerdata  /trg/scalerdata  ro
 startrg2.starp  /home/startrg/trg/monitor/run9/scalers  /trg/scalermonitor  ro
 online.star  /export  /onlineweb/www  rw

 

 

WISHLIST Items with good progress:

  • <Uniform and easy to maintain user authentication system to replace the current NIS and local account mess.  Either a local LDAP, or a glom onto RCF LDAP seems most feasible> -- An ldap server (onlldap.starp.bnl.gov) has been set-up and the 15 onllinux nodes are authenticating to it *BUT* it is using NIS!
  • <Shared home directories across the nodes with backups> -- onlldap is also hosting the home directories and sharing them via NFS.  EMC Networker is backing up the home directories and Matt A. is recieving the email notifications.
  • <Integration into SSH key management system (mechanism depends upon user authentication method(s) selected).> --  The ldap server has been added to the STAR SSH key management system, and users are able to login to the new onlXX nodes with keys now.
  • <Common configuration management system> -- Webmin is in use.
  • <Ganglia monitoring of the nodes> -- I think this is done...
  • <Osiris monitoring of the nodes> -- I think this is done - Matt A. and Wayne B. are receiveing the notices...

WISHLIST Items still needing significant work:

  • None?

 

SSH Key Management

Overview 

An SSH public key management system has been developed for STAR (see D. Arkhipkin et al 2008 J. Phys.: Conf. Ser. 119 072005), with two primary goals stemming from the heightened cyber-security scrutiny at BNL:

  • Use of two-factor authentication for remote logins
  • Identification and management of remote users accessing our nodes (in particular, the users of "group" accounts which are not tied to one individual) and achieve accountability

A benefit for users also can be seen in the reduction in the number of passwords to remember and type.

 

In purpose, this system is similar to the RCF's key management system, but is somewhat more powerful because of its flexibility in the association of hosts (client systems), user accounts on those clients, and self-service key installation requests.

Here is a typical scenario of the system usage: 

  1. A sysadmin of a machine named FOO creates a user account named "JDOE" and, if not done already, installs the key_services client.
  2. A user account 'JDOE' on host 'FOO' is configured in the Key Management system by a key management administrator.
  3. John Doe uploads (via the web) his or her public ssh key (in openssh format).
  4. John Doe requests (via the web) that his key be added to JDOE's authorized_keys file on FOO.
  5. A key management administrator approves the request, and the key_services client places the key in ~JDOE/.ssh/authorized_keys.

At this point, John Doe has key-based access to JDOE@FOO.  Simple enough?  But wait, there's more!  Now John Doe realizes that he also needs access to the group account named "operator" on host BAR.  Since his key is already in the key management system he has only to request that his key be added to operator@BAR, and voila (subject to administrator approval), he can now login with his key to both JDOE@FOO and operator@BAR.  And if Mr. Doe should leave STAR, then an administrator simply removes him from the system and his keys are removed from both hosts.

Slightly Deeper...

There are three things to keep track of here -- people (and their SSH keys of course), host (client) systems, and user accounts on those hosts:

People want access to specific user accounts at specific hosts.

So the system maintains a list of user accounts for each host system, and a list of people associated with each user account at each host.
(To be clear -- the system does not have any automatic user account detection mechanism at this time -- each desired "user account@host" association has to be added "by hand" by an administrator.)

This Key Management system, as seen by the users (and admins), consists simply of users' web browsers (with https for encryption) and some PHP code on a web server (which we'll call "starkeyw") which inserts uploaded keys and user requests (and administrator's commands) to a backend database (which could be on a different node from the web server if desired). 

Behind the scenes, each host that is participating in the system has a keyservices client installed that runs as a system service.  The keyservices_client periodically (at five minute intervals by default) interacts a different web server (serving different PHP code that we'll call starkeyd).  The backend database is consulted for the list of approved associations and the appropriate keys are downloaded by the client and added to the authorized_keys files accordingly.

In our case, our primary web server at www.star.bnl.gov hosts all the STAR Key Manager (SKM) services (starkeyw and starkeyd via Apache, and a MySQL database), but they could each be on separate servers if desired.

Perhaps a picture will help.  See below for a link to an image labelled "SKMS in pictures".

Deployment Status and Future Plans

We have begun using the Key Management system with several nodes and are seeking to add more (currently on a voluntary basis).  Only RHEL 3/4/5 and Scientific Linux 3/4/5 with i386 and x86_64 kernels have been tested, but there is no reason to believe that the client couldn't be built on other Linux distributions or even Solaris.  We do not anticipate "forcing" this tool onto any detector sub-systems during the 2007 RHIC run, but we do expect it (or something similar) to become mandatory before any future runs.  Please contact one of the admins (Wayne Betts, Jerome Lauret or Mike Dephillips) if you'd like to volunteer or have any questions.

User access is currently based on RCF Kerberos authentication, but may be extended to additional authentication methods (eg., BNL LDAP) if the need arises.

Client RPMs (for some configurations) and SRPM's are available, and some installation details are available here: 

http://www.star.bnl.gov/~dmitry/skd_setup/

An additional related project is the possible implementation of a STAR ssh gateway system (while disallowing direct login to any of our nodes online) - in effect acting much like the current ssh gateway systems role in the SDCC.  Though we have an intended gateway node online (stargw1.starp.bnl.gov, with a spare on hand as well), it's use is not currently required.

 

Anxious to get started? 

Here you go: https://www.star.bnl.gov/starkeyw/ 

You can use your RCF username and Kerberos password to enter.

When uploading keys, use your SSH public keys - they need to be in OpenSSH format. If not, please consult SSH Keys and login to the SDCC.

 
 

Software Infrastructure

On the menu today ...

 

General Information

SOFI stands for SOFtware infrastructure and Infrastructure. It includes any topics related to code standards, tools compiling your code, problems with base code and Infrastructure. SOFI also addresses (or try to address) your need in terms of monitoring or easily manage activities and resources in the STAR environment.

 

Infrastructure & Software

Reporting problems

  • General RCF problems should be reported using Computing Facility Issue reporting system (RT).
    You should NOT use this system to report STAR-specific problems. Instead, use the STAR Request Tracking system described below.
  • To report STAR specific problems: Request Tracking (bug and issues tracker) system, 
    Submitting a problem (bug), help request or issue to the Request Tracking system using Email
    You can always submit a report to the bug tracking system by sending an Email directly. There is no need for a personalized account and using the Web Interface is not mandatory. For each BugTracking category, an equivalent @www.star.bnl.gov mailing list exists.
    The currently available queues are
    bugs-high problem with ANY STAR Software with a need to be fixed without delay
    bugs-medium problem with ANY STAR Software and must be fixed for the next release
    bugs-low problem with ANY STAR Software. Should be fixed for the next release
    comp-support General computing operation support (user, hardware and middleware provisioning)
    issues-infrstruct Any Infrastructure issues (General software and libraries, tools, network)
    issues-scheduler Issues related to the SUMS project (STAR Unified Meta-Scheduler)
    issues-xrootd Issues related to the (X)rootd distributed data usage
    issues-simu Issues related to Simulation
    grid-general STAR VO general Grid support : job submission, infrastructure, components, testing problem etc ...
    grid-bnl STAR VO, BNL Grid Operation support
    grid-lbl STAR VO, LBNL Grid Operation support
    wishlist Use it for or suggesting what you would wish to see soon, would be nice to have etc ...

    You may use the guest account for accessing the Web interface. The magic word is here.
    • To create a ticket, select the queue (drop down menu after the create-ticket button). Queues are currently sorted by problem priority. Select the appropriate level. A wishlist queue has been created for your comments and suggestions. After the queue is selected, click on the create-ticket button and fill the form. Please, do not forget the usual information i.e. the result of STAR_LEVELS and of uname -a AND a description of how to reproduce the problem.
    • If you want to request a private account instead of using the guest account, send a message to the wishlist queue. There are 2 main reasons for requesting a personalized account :
      1. If you are planning to be an administrator or a watcher of the bug tracking system (that is, receive tickets automatically, take responsibility for solving them etc ...) you MUST have a private account.
      2. If you prefer to see the summary and progress of your own submitted tickets at login instead of seeing all tickets submitted under the guest account, you should also ask for a private account.
    • At login, the left side panels shows the tickets you have requested and the tickets you own. The right panel shows a the status of all queues. Having a private account setup does NOT mean that you cannot browse the other users tickets. It only affects the left panels summary.
    • To find a particular bug, click on search and follow the instructions.
    • Finally, if you would like to have a new queue created for a particular purpose (sub-system specific problems), feel free to request to setup such a queue.

 

General Tools 

Data location tools

Several tools exists to locate data both on disk and in HPSS. Some tools are available from the production page and we will list here only the tools we are developing for the future.
    • FileCatalog (command line interface get_file_list.pl and Perl module interface).
    • You do not have access to view this node

Resource Monitoring tools

Browsers

Web Sanity, Software & documentation tools

Web based access and tools

Web Sanity

Software & documentation auto-generation

  • Our STAR Software CVS Repositories browser
    Allows browsing the full offline and online CVS repositories with listings showing days since last modification, modifier, and log message of last commit, display and download (checkout) access to code, access to all file versions and tags, and diff'ing between consecutive or arbitrary versions, direct file-level access to the cross-referenced presentation of a file, ... You can also sort by
    1. by user
    2. recent history
  • Doxygen Code documentation (what is already doxygenized )
    and the User documentation (a quick startup ...) Our current Code documentation is generated using the doxygen generator. Two utilities exists to help you with this new documentation scheme :
    1. doxygenize is a utility which takes as argument a list of header files and modify them to include a "startup" doxygen tag documentation. It tries to guess the comment block, the author and the class name based on content. The current version also documents struct and enum lists. Your are INVITED TO CHECK the result before committing anything. I have tested on several class headers but there is always the exception where the parsing fails ...
    2. An interface to doxygen named doxycron.pl was created and installed on our Linux machines to satisfy the need of users to generate the documentation by themselves for checking purposes. That same generator interface is used to produce our Code documentation every day so, a simple convention has been chosen to accomplish both tasks. But why doxycron.plinstead of directly using doxygen? If you are a doxygen expert, the answer is 'indeed, why ?'. If not, I hope you will appreciate that doxycron.pl not only takes care of everything for you (like creating the directory structure, a default actually-functional configuration file, safely creating a new documentation set etc ....) but also adds a few more tasks to its list you normally have to do it yourself when using doxygen base tools (index creation, sorting of run-time errors etc ...). This being said, let me describe this tool now ...

      The syntax for doxycron.pl is
      % doxycron.pl [-i] PathWhereDocWillBeGenerated Path(s)ToScanForCode Project(s)Name SubDir(s)Tag

      The arguments are:
      • -i is used here to disable the doxytag execution, a useless pass if you only want to test your documentation.
      • PathWhereDocWillBeGenerated is the path where the documentation tree will be or TARGETD
      • Path(s)ToScanForCode is the path where the sources are or INDEXD (default is the comma separated list /afs/rhic.bnl.gov/star/packages/dev/include,/afs/rhic.bnl.gov/star/packages/dev/StRoot)
      • Project(s)Name is a project name (list) or PROJECT (default is the comma separated include,StRoot)
      • SubDir(s)Tag an optional tag (list) for an extra tree level or SUBDIR. The default is the comma separated list include, . Note that the last element is null i.e. "". When encountered, the null portion of a SUBDIR list will tell doxycron.pl to generate an searchable index based all previous non-null SUBDIR in the list.

      Note that if one uses lists instead of single values, then, ALL arguments MUST be a list and the first 3 are mandatory.
      To pass an empty argument in a list, you must use quotations as in the following example

      % doxycron.pl /star/u/jeromel/work/doxygen /star/u/jeromel/work/STAR/.$STAR_HOST_SYS/DEV/include,/star/u/jeromel/work/STAR/DEV/StRoot include,StRoot 'include, '

      In order to make it clear what the conventions are, let's describe a step by step example as follow:

      Examples 1 (simple / brief explaination):
      % doxycron.pl `pwd` `pwd`/dev/StRoot StRoot
      would create a directory dox/ in `pwd` containing the code documentation generated from the relative tree dev/StRoot for the project named StRoot. Likely, this (or similar) will generate the documentation you need.

      Example 2 (fully explained):
      % doxycron.pl /star/u/jeromel/work/doxygen /star/u/jeromel/work/STAR/DEV/StRoot Test
      In this example, I scan any source code found in my local cvs checked-out area /star/u/jeromel/work/STAR/DEV starting from StRoot. The output tree structure (where the documentation will end) are requested to be in TARGETD=/star/u/jeromel/work/doxygen. In order to accomplish this, doxycron.pl will check and do the following:
      • Check that the doxygen program is installed
      • Create (if it does not exists) $TARGETD/dox directory where everything will be stored and the tree will start
      • Search for a $TARGETD/dox/$PROJECT.cfg file. If it does not exists, a default configuration file will be created. In our example, the name of the configuration file defaults to /star/u/jeromel/work/doxygen/dox/Test.cfg. You can play with several configuration file by changing the project name. However, changing the project name would not lead to placing the documents in a different directory tree. You have to play with the $SUBDIR value for that.
      • The $SUBDIR variable is not used in our example. If I had chosen it to be, let's say, /bof, the documentation would have been created in $TARGETD/dox/bof instead but the template is still expected to be $TARGETD/dox/$PROJECT.cfg.

      The configuration file should be considered as a template file, not a real configuration file. Any item appearing with a value like Auto-> or Fixed-> will be replaced on the fly by the appropriate value before doxygen is run. This ensures keeping the conventions tidy and clean. You actually, do not have to think about it neither, it works :) ... If it does not, please, let me know. Note that the temporary configuration file will be created in /tmp on the local machine and left there after running.

      What else does one need to know : the way doxycron.pl works is the safest I could think off. Each new documentation set is re-generated from scratch, that is, using temporary directories, renaming old ones and deleting very old ones. After doxycron.pl has completed its tasks, you will end up with the directories $TARGETD/dox$SUBDIR/html and $TARGETD/dox$SUBDIR/latex. The result of the preceding execution of doxycron.pl will be in directories named html.old and latex.old.
      One thing will not work for users though : the indexing. The installation of indexing mechanism in doxygen is currently not terribly flexible and fixed values were chosen so that clicking on the Search index link will go to the cgi searching the entire main documentation pages.

      As a last note, doxygen understands ABSOLUTE path names only and therefore, doxycron.pl will die out if you try to use relative paths as the arguments. Just as a reminder, /titi/toto is an absolute path while things like ./or ./tata are relative path.

HPSS tools & services

  • How to retrieve files from HPSS. Please, use the Data Carousel and ONLY the DataCarousel.
    Note: DO NOT use hsi to
    retrieve files from HPSS - this access mode locks tape drives for exclusive use (only you, not shared with any other user) and have dire impacts on STAR;s operations from production to data restores. If you are caught using it, you will be banned from accessing HPSS (your privilege to access HPSS resources will be revoked).
    Again - Please, use the Data Carousel.
  • Archiving into HPSS
    Several utilities exists. You can find the reference on the RCF HPSS Service page. Those utilities will bring you directly in the Archive class of service. Note that the DataCarousel can retrieve files from ANY class of service. The prefered mode for archining is the use of htar.
    NOTE: You should NOT abuse those services to retreive massive amount of files from HPSS (your operation will otherwise clash with other operations, including stall or slow down data production). Use the DataCarousel instead for massive file retreival. Abuse may lead to supression of access to archival service.
    • For rftp, history is in an Hypernews post Using rftp . If you save individual files and have lots of files in a directory, please avoid causing a Meta_data lookup. Meta-data lookup happens when you 'ls -l'. As a reminder, please keep in mind that HPSS is NOT made for neither small files and large amount of files in directories but for massive large files storage (on 2007/10/10 for example, a user crashed HPSS with a single 'ls -l' lookup of a 3000 files directory). In that regard, rftp is most useful if you create first an archive of your files yourself (tar, zip,...) and push the archive into HPSS afterward. If this is not your mode of operation, the preferred method is the use of htar which provides a command-line direct HPSS archive creation interface.
    • htar is the recommended mode for archining into HPSS. This utility provides a tar-like interface allowing for bundling together several files or an entire directory tree. Note the syntax of htar and especiallythe extract below from this thread:
      If you want the file to be created in /home/<username>/<subdir1> and <subdir1> does not existed yet, use
      % htar -Pcf /home/<username>/<subdir1>/<filename> <source>
      
      If you want the file to be created into /home/<username>/<subdir2> and <subdir2> already exists, use
      % htar -cf /home/<username>/<subdir2>/<filename> <source>
      
      Please consult the help on the web for more information about htar.
    • File size is limited to be <55GB, and if exceeded you will get Error -22. In this case consider using split-tar. A simple example/syntax on how to use split-tar is:
      % split-tar -s 55G -c blabla.tar blabla/
      This will at least create blabla-000.tar but also the next sequences (001, 002, ...) each of 55 GBytes until all files from directory blabla/ are packed. The magic 55 G suggested herein and in many posts works for any generation of drive for the past decade. But a limit of 100-150 GB should also  work on most media at BNL as per 2016. See this post for a summary of past pointers.
    • You may make split-tar archive cross-compatible with htar by creating afterward the htar indexes. To do this, use a command such as 
      % htar -X -E -f blabla-000.tar
      this will create blabla-000.tar.idx you will need to save in HPSS along the archive.

       

Batch system, resource management system

SUMS (aka, STAR Scheduler)


SUMS, the product of the STAR Scheduler project, stands for Star Unified Meta-Scheduler. This tool is currently documented on its own pages. SUMS provides a uniform user interface to submitting jobs on "a" farm that is, regardless of the batch system used, the language it provides (in XML) is identical. The scheduling is controlled by policies handling all the details on fitting your jobs in the proper queue, requesting proper resource allocation and so on. In other words, it isolates users from the infrastructure details.

You would benefit from starting with the following documents:

LSF

LSF was dropped from BNL facility support in July 2008 due to licensing cost. Please, refer to the historical revision for information about it. If a link brought you here, please update or send a note to the page owner. Information have been kept un-published You do not have access to view this node.

Condor

Quick start ...

Condor Pools at BNL

The condor pools are segmented into four pools extracted from this RACF page:

production jobs +Experiment = "star" +Job_Type = "crs" high priority CRS jobs, no time limit, may use all the slots on CRS nodes and up to 1/2 available job slots per system on CAS ; the CRS portion is not available to normal users and using this Job_Type for user will fail
users normal jobs +Experiment = "star" +Job_Type = "cas" short jobs, 3 to 5 hour soft limit (when resources are requested by others), 40 hour hard limit - this has higher priority than the "long" Job_Type.
user long jobs +Experiment = "star" +Job_Type = "long" long running jobs, 5 day soft limit (when resources are requested by others), 10 day hard limit, may use 1 job slot per system on a subset of machines
general queue +Experiment = "general"
+Experiment = "star"
  General queue shared by multiple experiments, 2 hours guaranteed time minimum (can be evicted afterward by any experiment's specific jobs claiming the slot)

 

The Condor configurations do not have create a simple notion of queues but generates a notion of pools. Pools are group of resources spanning all STAR machines (RCAS and RCRS nodes) and even other experiment's nodes. The first column tend to suggest four of such pools although we will see below that life is more complicated than that.

First, it is important to understand that the +Experiment attribute is only used for accounting purposes and what makes the difference between a user job or a production job or a general job is really the other attributes. 

Selection of how your jobs will run is the role of +Job_Type attribute. When it is unspecified, the general queue (spanning all RHIC machines at the facility) is assumed but your job may not have the same time limit. We will discuss the restriction later. The 4th column of the table above shows the CPU time limits and additional constraints such as the number of slots within a given category one may claim. Note that the +Job_type="crs" is reserved and its access will be enforced by Condor (only starreco may access this type).

In addition of using +Job_type which as we have seen controls what comes as close as possible to a queue in Condor, one may need to restrict its jobs to run on a subset of machines by using the CPU_Type attribute in the Requirements tag (if you are not completely lost by now, you are good ;-0 ).  An example to illustrate this:

+Experiment = "star"
+Job_type = "cas"
Requirements = (CPU_type != "crs") && (CPU_Experiment == "star")

In this example, a cas job (interpret this as "a normal user analysis job") is being run on behalf of the experiment star. The CPU / nodes requested are the CPU belonging to the star experiment and the nodes are not RCRS nodes. By specifying those two requirements, the user is trying to make sure that jobs will be running on RCAS nodes only (or != "crs") AND, regardless of a possible switch to +Experiment="general", the jobs will still be running on the nodes belonging to STAR only.

In this second example

+Experiment = "star"
+Job_type = "cas"
Requirements = (CPU_Experiment == "star")

we have pretty much the same request as before but the jobs may also run on RCRS nodes. However, if data production runs (+Job_type="crs" only starreco may start), the user's job will likely be evicted (as production jobs will have higher priorities) and the user may not want to risk that hence specifying the first Requirements tag.

Pool rules

A few rules apply (or summarized) below:

  • Production jobs cannot be evicted on their claimed slots ... since they have higher priority than user jobs even on CAS nodes, this means that as soon as production jobs starts, its pool of slots will slowly but surely be taken - user's jobs may use those slots at low-downs of utlization.
  • Users jobs can be evicte. Eviction happens after 3 hours of runtime from the time they start but only if the slot they are running in is claimed by other jobs. For example, if a production job wants a node being used by a user job that has been running for two hours then that user job has one hour left before it gets kicked out ...
  • This time limit comes into effect when a higher priority job wants the slot (i.e. production vs. user or production)
  • general queue jobs are evicted after two hours of guaranteed time when the slot is wanted by ANY STAR job (production, user)
  • general queue jobs will be evicted however if they consume more than 750 MB of memory

This provides the general structure of the Condor policy in place for STAR. The other policy options in place goes as follows:

  1. The following options apply to all machines: the 1 mn load has to be less than 1.4 on a two CPU node for a job to start
  2. General queue jobs will not start on any node unless 1 min < 1.4, swap > 200M, memory > 100M.
  3. User fairshare is in place.

In the land of confusion ...

Also, users are often confused of the meaning of their job priority. Condor will consider a user's job priority and submit jobs in priority order (where the larger the number, more likely the job willl start) but those priorities have NO meaning across two distinct users. In other words, it is not because user A sets job priorities larger by an order of magitude comparing to user B that his job will start first. Job priority only providesa mechanism for a user to specify which of their idle jobs in the queue are most important. Jobs with higher numerical priority should run before those with lower priority, although because jobs can be submitted from multiple machines, this is not always the case. Job prioritties are listed by the condor_q command in the PRIO column.

The effective user priority is dynamic, on the other hand, and changes as a user has been given access to resources over a period of time. A lower numerical effective user priority (EUP) indicates a higher priority. Condor's fairshare mechanism is implemented via EUP. The condor_userprio command hence provides an indication of your faireshareness.

You should be able to use condor_qedit to manually modify the "Priority" parameter, if desired. If a job does not run for weeks, there is likely a problem with its submitfile or one of its input, and in particular its Requirements line. You can use condor_q -analyze JOBID, or condor_q -better-analyze JOBID to determine why it cannot be scheduled.

 

What you need to know about Condor

First of all, we recommend you use SUMS to submit to Condor as we would take care of adding codes, tricks, tweaks to make sure your jobs run smoothly. But if you really don't want to, here are a few issues you may encounter:

  • Unless you use the GetEnv=true  datacard directive in your condor job description, Condor jobs will start with a blank set of environment variables unlike a shell startup. Especially, none of
    SHELL, HOME, LOGNAME, PATH, TERM and MAIL
    will be define. The absence of $HOME will have for side effect that, whenever a job starts, your .cshrc and .login will not be seen hence, your STAR environment will not be loaded. You must take this into account and execute the STAR login by hand (within your job file).
    Note that using GetEnv=true has its own sde effects which includes a full copy of the environment variables as defined from the submitter node. This will NOT be suitable for distributed computing jobs. The use of getenv() C primitive in your code is especially questionable (it will unlikely return a valid value) and
    • STAR user may look at this post for more information on how to use ROOT function calls for defining some of the above.
    • You may also use getent shell command (if exists) to get the value of your home directory
    • A combinations of getpwuid(), getpwnam() would allow to define $USER and $HOME
       
  • Condor follows a multi-submitter node model with no centralized repository for all jobs. As a consequence, whenever you use a command such as condor_rm, you would kill the jobs you have submitted from that node only. To kill jobs submitted from other submitter nodes (any interactive node at BNL is a potential submitter node), you need to loop over the possibilities and use the -name command line option.
     
  • Condor will keep your jobs indefinitely in the Pool unless you either remove the jobs or specify a condition allowing for jobs to be automatically removed upon status and expiration time. A few examples below could be used for the PeriodicRemove Condor datacard
    • To automatically remove jobs which have been in the queue for more than 2 days but marked as status 5 (held for a reason or another and not moving) use
      (((CurrentTime - EnteredCurrentStatus) > (2*24*3600)) && JobStatus == 5)
    • To automatically remove jobsruning the the queue for more than 2 days but using less than 10% of the CPU (probably looping or inefficient jobs blocking a job slot), use
      (JobStatus == 2 && (CurrentTime - JobCurrentStartDate > (54000)) && 
                          ((RemoteUserCpu+RemoteSysCpu)/(CurrentTime-JobCurrentStartDate) < 0.10))
    The full current condition SUMS add to each job is
    PeriodicRemove  = (JobStatus == 2 && (CurrentTime - JobCurrentStartDate > (54000)) && 
                       ((RemoteUserCpu+RemoteSysCpu)/(CurrentTime-JobCurrentStartDate) < 0.10)) || 
                      (((CurrentTime - EnteredCurrentStatus) > (2*24*3600)) && JobStatus == 5)

Some condor commands

This is not meant to be an exhaustive set of commands nor a tutorial. You are invited to read to the manpages for condor_submit, condor_rm, condor_q, condor_status. Those will be most of what you will need to use on a daily basis. Help for version 6.9 is available online.

  • Query and information
    • condor_q -submitter $USER
      List jobs of specific submitter $USER from all the queues in the pool
    • condor_q -submitter $USER -format "%s\n" ClusterID
      Shows the JobID for all jobs of $USER. This command may succeed although an unconstrained condor_q may fell if we had a large amount of jobs
    • condor_q -analyze $JOBID
      Perform an approximate analysis to determine how many resources are available to run the requested jobs.
    • condor_status -submitters
      shows the numbers of running/idle/held jobs for each user on all machines
    • condor_status -claimed
      Summarize jobs by servers as claimed
    • condor_status -avail
      Summarize resources which are available
       
  • Removing jobs, controlling them
    • condor_rm $USER
      removes all of your jobs submitted from this machine
    • condor_rm -name $node $USER
      removed all jobs for $USER submitted from machine $node
    • condor_rm -forcex $JOBID
      Forces the immediate local removal of jobs in undefined state (only affects jobs already being removed). This is needed if condor_q -submitter shows your job but condor_q -analyze $JOBID does not (indicating an out of sync information at Condor level).
    • condor_release $USER
      releases all of your held jobs back into the pending pool for $USER
    • condor_vacate -fast
      may be used to remove all jobs from the submitter node job queue. This is a fast mode command (no checks) and applies to running jobs (not pending ones)
       
  • More advanced
    • condor_status -constraint 'RemoteUser == "$USER@bnl.gov"'
      lists the machines on which your jobs are currently running
    • condor_q -submitter username -format "%d" ClusterId -format "  %d" JobStatus -format "   %s\n" Cmd
      shows the job id,  status, and command for all of your jobs.  1==Idle, 2==Running for Status.  I use something like this because the default output of condor_q truncates the command at 80 characters and prevents you from seeing the actually scheduler job ID associated with the Condor job.  I'll work on improving this command, but this is what I've got for now.
    • To access the reason for job 26875.0 to be held from a submitter node advertized to be rcas6007, use the following command to have a human readable format
      condor_q -pool condor02.rcf.bnl.gov:9664 -name rcas6007 -format "%s\n" HoldReason 26875.0
       

 

Computing Environment


The pages below will give you a rapid overview of the computing environment at BNL, including information for visitors and employees, accessible printers, best practices, recommended tools for managing Windows.

FAQs and Tips

Software Site Licenses

Do we have a site license for software package XYZ?

The answer is (almost) always: No!
Neither STAR nor BNL have site licences for any Microsoft product, Hummingbird Exceed, WinZIP, ssh.com's software or much of anything intended to run on individual users' desktops. Furthermore, for most purposes BNL-owned computers do not qualify for academic software licenses, though exceptions do exist.

FAQ: PDF creation

How can I create a file in pdf format?

Without Adobe Acrobat (an expensive bit of software), this can be a daunting question. I am researching answers, some of which are available in my Windows software tips. Here is the gist of it in a nutshell as I write this -- there are online conversion services and OpenOffice is capable of exporting PDF documents.

FAQ: X Servers

What X server software should I use in Windows?

I recommend trying the X Server that is available freely with Cygwin, for which I have created some documentation here: Cygwin Tips. If you can't make that work for you, then I next recommend a commercial product called Xmanager, available from http://www.netsarang.com. Last time I checked, you could still download a fully functional version for a time-limited evaluation period.

TIP: Windows Hibernation trick

Hibernate or Standby -- There is a difference which you might find handy: 
  • "Standby" puts the machine in a low power state from which it can be woken up nearly instantly with some stimulus, such as a keystroke or mouse movement (much like a screensaver) but the state requires a continuous power source.  The power required is quite small compared to normal running, but it can eventually deplete the battery (or crash hard if the power is lost in the case of a desktop).
  • "Hibernate" actually dumps everything in memory to disk and turns off the computer, then upon restarting it reloads the saved memory and basically is back to where it was.  While hibernating, no power source is required.  It can't wake up quickly (it takes about as long as a normal bootup), but when it does wake up, (almost) everything is just the way you left it.  One caveat about networking is in order here:  Stateful connections (eg. ssh logins) are not likely to survive a hibernation mode (though you may be able to enable such a feature if you control both the client and server configurations), but most web browsing activity and email clients, which don't maintain an active connection, can happily resume where they left off.

Imagine:  the lightning is starting, and you've got 50 windows open on your desktop that would take an hour to restore from scratch.  You want to hibernate now!  Here's how to enable hibernating if it isn't showing up in the shutdown box: 
Open the Control Panels and open "Power Options".  Go to the "Hibernate" tab and make sure the the box to enable Hibernation is checked.  When you hit "Turn Off Computer" in the Start menu, if you still only see a Standby button, then try holding down a Shift key -- the Standby button should change to a Hibernate button.  Obvious, huh?

For the curious:
There are actually six (or seven depending on what you call "official") ACPI power states, but most motherboards/BIOSes only support a subset of these.  To learn more, try Googling "acpi power state", or you can start here as long as this link works.  (Note there is an error in the main post -- the S5 state is actually "Shutdown" in Microsoft's terminology). 
From the command line, you can play around with these things with such straightforward commands as:

%windir%\System32\rundll32.exe powrprof.dll,SetSuspendState 1 

Even more obvious, right?  If you like that, then try this on for size.

TIP: My new computer is broken!:

It's almost certainly true - your new computer is faulty and the manufacturer knows it!  Unfortunately, that's just a fact of life.  Straight out of the box, or after acquiring a used PC, you might just want to have a peek at the vendor's website for various updates that have been released.  BIOS updates for the motherboard are a good place to start, as they tend to fix all sorts of niggling problems.  Firmware updates for other components are common as are driver updates and software patches for pre-installed software.  I've solved a number of problems applying these types of updates, though it can take hours to go through them thoroughly and most of the updates have no noticeable effect.  And it is dangerous at times.  One anecdote to share here -- we had a common wireless PC Card adapter that was well supported in both Windows and Linux.  The vendor provided an updated firmware for the card, installed under Windows.  But it turned out that the Linux drivers wouldn't work with the updated firmware.  So back we went to reinstall a less new firmware.  You'll want to try to be intelligent and discerning in your choices.  Dell for instance does a decent job with this (your Dell Service Tag is one very useful key here), but still requires a lot from the updater to help ensure things go smoothly.  This of course is in addition to OS updates that are so vital to security and discussed elsewhere.

Printers


STAR's publicly available printers are listed below. 


IP name
Wireless (Corus) CUPS URL
IP address Model Location rcf2 Queue Name Features
lj4700.star.bnl.gov

http://cups.bnl.gov:631/printers/HP_Color_LaserJet_4700_2
130.199.16.220 HP Color LaserJet 4700DN 510, room M1-16 lj4700-star color, duplexing, driver download site
(search for LaserJet 4700, recommend the PCL driver)
lj4700-2.star.bnl.gov

http://cups.bnl.gov:631/printers/lj4700-2.star.bnl.gov
130.199.16.221 HP Color LaserJet 4700DN 510, room M1-16 lj4700-2-star color, duplexing, driver download site
(search for LaserJet 4700, recommend the PCL driver)
hp510hall.star.bnl.gov

http://cups.bnl.gov:631/printers/hp510hall
130.199.16.222 HP LaserJet 2200DN 510, outside 1-164 hp510hall B&W, duplexing
starhp2.star.bnl.gov

http://cups.bnl.gov:631/printers/starhp2.star.bnl.gov
130.199.16.223 HP LaserJet 8100DN 510M, hallway starhp2_p B&W, duplexing
onlprinter1.star.bnl.gov

http://cups.bnl.gov/printers/onlprinter1.star.bnl.gov
130.199.162.165 HP Color LaserJet 4700DN 1006, Control Room staronl1 color, duplexing
chprinter.star.bnl.gov

N/A
130.199.162.178 HP Color LaserJet 3800dtn 1006C, mailroom n/a color, duplexing

There are additional printing resources available at BNL, such as large format paper, plotters, lamination and such.  Email us at starsupport 'at' bnl.gov and we might be able to help you locate such a resource.

 

Printing from the wireless (Corus) network

The "standard" way of printing from the wireless network is to go through ITD's CUPS server on the wireless network.  How to do this varies from OS to OS, but here is a Windows walkthrough.  The key thing is getting the URI for the printer into the right place:
 

  • Open the Printers Control Panel and click "Add a Printer". 
  • Select the option to add a network printer.  (Ignore the list of printers that it generates automatically).
  • Click on the button or option for "the printer that I want isn't listed". 
  • Select the option for a shared printer and enter the green URL from the list above for the printer you want.
    eg. http://cups.bnl.gov:631/printers/HP_Color_LaserJet_4700_2
  • On the next window, select the hardware manufacturer and model (if not listed, let Windows search for additional models).
  • Print a test page and cross your fingers... 
  • If your test print does not come out, it doesn't necessarily mean your configuration is wrong - sometimes a problem occurs on the the CUPS server that prevents printing - it isn't always easy to tell where the fault lies.

 

Since printing through ITD's CUPS servers at BNL has not been very reliable, here are some less convenient alternatives to using the printers that you may find handy.  (Note that with these, you can even print on our printers while you are offsite - probably not something to do often, but might come in handy sometimes.)
 

1.  Use VPN.  But if you are avoiding the internal network altogether for some reason, or can't use the VPN client, then keep reading...

2.  Get your files to rcf2.rhic.bnl.gov and print from there.  Most of printers listed above have rcf print queues (hence the column "rcf2 queue name").  But if you want to use a printer for which there is no queue on rcf2, or you have a format or file type that you can't figure out how to print from rcf2, then the next tip might be what you need.

3.  SSH tunnels can provide a way to talk directly (sort-of) to almost any printer on the campus wired network.  At least as far as your laptop's print subsystem is concerned, you will be talking directly to the printer.  (This is especially nice if you want to make various configuration changes to the print job through a locally installed driver.)  But if you don't understand SSH tunnels, this is gonna look like gibberish:

Here is the basic idea, using the printer in the Control Room.
It assumes you have access to both the RSSH and STAR SSH gateways.

The ITD SSH gateways might also work in place of rssh (I haven't
tried them yet).  If they can talk directly to our printers,
then it would eliminate step C below.

A.  From your laptop:

ssh -A -L 9100:127.0.0.1:9100 <username>@rssh.rhic.bnl.gov

(Note 1:  -A is only useful if you are running an ssh-agent with a
loaded key, which I highly recommend)

(Note 2:   Unfortunately, the rssh gateways cannot talk directly to our
printers, so we have to create another tunnel to a node that can...  If the
ITD SSH gateways can communicate directly with the printers, then the
next hop would be unnecessary...)

B.  From the rssh session:

ssh -L 9100:130.199.162.165:9100 <username>@stargw1.starp.bnl.gov

(Note 1: 130.199.162.165 is the IP address of onlprinter1.star.bnl.gov -
it could be replaced with any printer's IP address on the wired network.)
(Note 2:  port 9100 is the HP JetDirect default port - non-HP printers
might not use this, and there are other ways of communicating with HP
network printers, so ymmv - but the general idea will work with most TCP 
communications, if you know the port number in use. 

C.  On your laptop, set up a local print queue as if you were going to
print directly to the printer over the network (with no intermediate
server), but instead of supplying the printer's IP address, use
127.0.0.1 instead.

D. Start printing...


If you close either of the ssh sessions above, you will have to
re-establish them before you can print again. 

The two ssh commands can be combined into one and you can create an alias to
save typing the whole thing each time.  (Or use PuTTY or some other GUI SSH client
wrapper to save these details for reuse.)

You could set up multiple printers this way, but to use them
simultaneously, you would need to use unique port numbers for each one
(though the port number at the end of the printer IP would stay 9100).

 

Direct connection, internal network

You can use direct connections to access them over the network.

  • Direct:  These printers accept direct TCP/IP connections, without any intermediate server. 
  • JetDirect (AppSocket) and lpd usually work under Linux. 
  • For Windows NT/2K/XP, a Standard TCP/IP port is usually the way to go. 

How to configure this varies with OS and your installed printing software.

Tips

What follows are miscellaneous tips and suggestions that will be irregularly maintained.

  • The 2-sided printers are configured to print 2-sided by default, but the default for many printer drivers will override this and specify 1-sided.  If you are printing from Windows, you can usually choose your preferences for this in the printer preferences or configuration GUI.  You may need to look in the Advanced Settings and/or Printing Defaults to enable 2-sided printing in Windows.
  • Depending on the print method and drivers used, from the Linux command line you may be able to specify various options for things like duplex printing.  To see available options for a given print queue, try the "lpoptions" command.  For instance, on rcf2 you could do "lpoptions -d xerox7300 -l".  In the output, you will find a line like this:  "Duplex/2-Sided Printing: DuplexNoTumble *DuplexTumble None"  (DuplexNoTumble is the same as flip on long edge, while DuplexTumble is the same as flip on short edge, and the * indicates the default setting.)  So to turn off duplex printing, you could do "lp -d xerox7300 -o Duplex=None <filename>".  Keep in mind that not all options listed by lpoptions may actually be supported by the printer, and the defaults (especially in the rcf queues) may not be what you'd like.  There are so many print systems, options and drivers in Linux/Unix that there's no way to quickly describe all the possible scenarios.
  • There is a handy utility called a2ps that is available on most Linux distributions. It is an "Any to PostScript" filter that started as a Text to PostScript converter, with pretty printing features and all the expected features from this kind of program. But it is also able to deal with other file types (PostScript, Texinfo, compressed, whatever...) provided you have the necessary tools installed.

  • psresize is another useful utility in Linux for dealing with undesired page sizes. If you are given a PostScript file that specifies A4 paper, but want to print it on US Letter-sized paper, then you can do:
    psresize -PA4 -pletter in.ps out.ps
    See the man page for more information.
  • Some of the newer printers have installation wizards for Windows that can be accessed through their web interfaces. I've had mixed success with the HP IPP installation wizards. The Xerox wizard (linked above) has worked well, though it pops up some unnecessary windows and is a bit on the slow side.

  • Windows 9x/Me users will likely have to install software on their machines in order to print directly to these printers. HP and Xerox have such software available for download from their respective support websites, but who uses these OSes anymore?

  • For linux users setting up new machines, CUPS at least for recent distros is the default printing system (unless upgrading from an older distribution, in which case LPRng may still be in use).  Given an appropriate PPD file, CUPS is capable of utilizing various print options, such as tray selection and duplexing, or at least you can create different queues with different options to a single printer.

  • There are other potentially useful printers around that are not catalogued here. Some are STAR printers out of the mainstream (like in 1006D), and some belong to other groups in the physics department.

Quick (?) start guide for visitors with laptops

So you brought a laptop to BNL… and the first thing you want to do is get online, right?
Ok, here's a quick (?) guide to getting what you want without breaking too many rules.

Wired Options:

  • Visitors' network: Dark purple jacks (usually labeled VNxx) are on a visitors' network and are effectively outside of the BNL firewall. They support DHCP and do not require any sort of registration to use. Being outside the firewall can be advantageous, but will prevent you from
    using some network services within BNL (printing, for instance). (The rest of this page is largely irrelevant if you are using the visitors' network.)

  • BNL network: If it isn't dark purple (and it isn't a phone jack) then it is on the BNL network, which supports DHCP on most subnets. (NB. The 60/61 subnet (available in parts of 1006, including the WAH) has a locally managed DHCP server -- contact Wayne Betts to be added to the access list). All devices on the BNL networks are required to be registered based on the MAC address that is unique to each network interface. To help enforce this policy, if you request a DHCP
    address from an unregistered node, you will be assigned a restricted address. With a restricted IP address, your web browser will be automatically redirected to the BNL registration page, and you will be unable to surf anywhere else until you are registered.

    When registering a laptop, fill in "varies" for the location fields. For the computer name field, I recommend using "DHCP Client" (unless you have a static IP address of course).

    Previously registered users are encouraged to verify and update their registration information by going to http://register.bnl.gov from the machine to be updated.

    There you can also find out more about the registration system and find links to some useful information for network users.

     

Windows

This area is intended to provide information for STAR members to assist in configuring and using typical desktop/laptop PCs at BNL.

  Windows 2000/XP and Scientific Linux/Redhat Enterprise Linux are the preferred Operating Systems within STAR at BNL for desktop computing, though there is no formal requirement to use any particular OS.

  These pages are intended to be dynamic, subject to the constantly changing software world and user input.   Feedback from users -- what you find indispensable; what is misleading, confusing or flat-out wrong; and what is missing that you wish was here -- can help to significantly increase the value of these pages.

  Additional pages that are under consideration for creation:

  • Windows installation checklist (the basic software and configuration that should probably be on every Windows PC)
  • Linux installation checklist
  • Common Linux details and useful links, such as Linux equivalents to software for Windows.
  • Resources specific to the experiment operations (eg. common DAQ NFS mounts)
  • Publically useable terminals

Cygwin installation and tips

To quote from the Cygwin website:  "Cygwin is a Linux-like environment for Windows."

The Linux-like nature is quite comprehensive...  You can *almost* forget that you are using a Windows OS -- most utilities and software that you are familiar with from your Linux experience are available in Cygwin.  For example, the Cygwin distribution has available an openssh client (and the server too, but I don't recommend you use it), PostScript and PDF viewers and editors, compression (eg. zip) utilities, software development tools and X Windows packages (more on X below). 

Using the Cygwin X server

An example of Cygwin's usefulness and cost-saving potential is the X server.  The Cygwin X server is, in most cases, easy and convenient to use in place of commercial X servers such as Hummingbird Exceed.  Here is the short version for those familiar with Cygwin installations:
  1. You need the xorg-x11-base and X-startup-scripts packages (and whatever dependencies they have, which the setup routine should solve for you).  You'll probably also want the xwinclip package.  All of these are in the X11 Category in the Cygwin Setup.
  2. Execute "startxwin.bat" (in <cygwin_root>/usr/X11R6/bin/).  That will start a stand-alone X Server and an xterm with a cygwin shell.   Edit this batch file as you see fit -- it includes documentation for a number of options. 
  3. If you are displaying windows from a remote session over ssh, be sure you have X tunneling enabled in your ssh client configuration.  Please do not try to open up your X server to the entire world with anything like "xhost +".  That is a *VERY BAD IDEA*.
  4. In light of step 3 above:  If you have a local firewall that asks about blocking access to the Xserver, you can usually block it without a problem -- if you have X forwarding enabled and working, then you are usually ok.  (If you believe a localhost-based firewall is interfering with X, try allowing only connections from the loopback/localhost address (127.0.0.1)).
Long version:  Walkthrough of a Cygwin installation (MS Word doc).

Subsidiary recommendation:

There is a handy tool for initiating shell connections to remote hosts (such as via ssh) and starting the Cygwin X server called Mortens Cygwin X-Launcher.  Coming soon (?): screenshots of the X-Launcher configuration that are most likely to be useful...

Installation Tip:

A Cygwin mirror is available at http://mirror.bnl.gov/cygwin/ making the installation go quite quickly if you are at BNL.  This is quite handy for the cygwin installation and any subsequent use of the setup utility.  One potential catch for onsite users -- even if you intend to use the local mirror, you must still configure a BNL proxy server during Setup, as shown in  this walkthrough of a Cygwin installation (MS Word format).
Please send comments, corrections and suggestions to Wayne Betts: wbetts {at} bnl.gov

FAQS and Tips that don't fit well elsewhere

Software Site Licenses:

Do we have a site license for software package XYZ?

The answer is (almost) always: No!
Neither STAR nor BNL have site licences for any Microsoft product, Hummingbird Exceed, WinZIP, ssh.com's software or much of anything intended to run on individual users' desktops. Furthermore, for most purposes BNL-owned computers do not qualify for academic software licenses, though exceptions do exist.

FAQ: PDF creation:

How can I create a file in pdf format?

Without Adobe Acrobat (an expensive bit of software), this can be a daunting question. I am researching answers, some of which are available in my Windows software tips. Here is the gist of it in a nutshell as I write this -- there are online conversion services and OpenOffice is capable of exporting PDF documents.

FAQ: X Servers:

What X server software should I use in Windows?

I recommend trying the X Server that is available freely with Cygwin, for which I have created some documentation here: Cygwin Tips. If you can't make that work for you, then I next recommend a commercial product called Xmanager, available from http://www.netsarang.com. Last time I checked, you could still download a fully functional version for a time-limited evaluation period.

TIP: Windows Hibernation trick:

Hibernate or Standby -- There is a difference which you might find handy: 
  • "Standby" puts the machine in a low power state from which it can be woken up nearly instantly with some stimulus, such as a keystroke or mouse movement (much like a screensaver) but the state requires a continuous power source.  The power required is quite small compared to normal running, but it can eventually deplete the battery (or crash hard if the power is lost in the case of a desktop).
  • "Hibernate" actually dumps everything in memory to disk and turns off the computer, then upon restarting it reloads the saved memory and basically is back to where it was.  While hibernating, no power source is required.  It can't wake up quickly (it takes about as long as a normal bootup), but when it does wake up, (almost) everything is just the way you left it.  One caveat about networking is in order here:  Stateful connections (eg. ssh logins) are not likely to survive a hibernation mode (though you may be able to enable such a feature if you control both the client and server configurations), but most web browsing activity and email clients, which don't maintain an active connection, can happily resume where they left off.

Imagine:  the lightning is starting, and you've got 50 windows open on your desktop that would take an hour to restore from scratch.  You want to hibernate now!  Here's how to enable hibernating if it isn't showing up in the shutdown box: 
Open the Control Panels and open "Power Options".  Go to the "Hibernate" tab and make sure the the box to enable Hibernation is checked.  When you hit "Turn Off Computer" in the Start menu, if you still only see a Standby button, then try holding down a Shift key -- the Standby button should change to a Hibernate button.  Obvious, huh?

For the curious:
There are actually six (or seven depending on what you call "official") ACPI power states, but most motherboards/BIOSes only support a subset of these.  To learn more, try Googling "acpi power state", or you can start here as long as this link works.  (Note there is an error in the main post -- the S5 state is actually "Shutdown" in Microsoft's terminology). 
From the command line, you can play around with these things with such straighforward commands as:
%windir%\System32\rundll32.exe powrprof.dll,SetSuspendState 1
Even more obvious, right?  If you like that, then try this on for size.

TIP: My new computer is broken!:

It's almost certainly true - your new computer is faulty and the manufacturer knows it!  Unfortunately, that's just a fact of life.  Straight out of the box, or after acquiring a used PC, you might just want to have a peek at the vendor's website for various updates that have been released.  BIOS updates for the motherboard are a good place to start, as they tend to fix all sorts of niggling problems.  Firmware updates for other components are common as are driver updates and software patches for pre-installed software.  I've solved a number of problems applying these types of updates, though it can take hours to go through them thoroughly and most of the updates have no noticeable effect.  And it is dangerous at times.  One anecdote to share here -- we had a common wireless PC Card adapter that was well supported in both Windows and Linux.  The vendor provided an updated firmware for the card, installed under Windows.  But it turned out that the Linux drivers wouldn't work with the updated firmware.  So back we went to reinstall a less new firmware.  You'll want to try to be intelligent and discerning in your choices.  Dell for instance does a decent job with this (your Dell Service Tag is one very useful key here), but still requires a lot from the updater to help ensure things go smoothly.  This of course is in addition to OS updates that are so vital to security and discussed elsewhere.



Please send comments, corrections and suggestions to Wayne Betts: wbetts {at} bnl.gov

Networking Software

Networking
Software

  • PuTTY:
     This is the preferred SSH client for Windows.  It is free, easy to use
    and well maintained for both security and bug issues.
     (As with everything, it is only "maintained" if you regularly check
    for updated versions!)
     Please note that most other SSH clients for Windows are NOT free for
    use on government computers or in the pursuit of lab business, though
    they might function just fine without payment.

  • WinSCP:  This is a fine graphical SFTP and SCP client utility with some additional features built in.

  • X servers (no, Exceed doesn't make the cut because of the high monetary cost):

    • Cygwin:  Please look at the separate Cygwin page for information on installing and configuring the Cygwin X server.

    • Xmanager:  I
      recommend that you use the Cygwin X server, but if you find something
      that it can't handle, then this is the recommended alternative. 
      It isn't free (but it does have fully functional time-limited
      evaluation license if you want to try it out.) 
      It is much cheaper than Exceed and seemingly just as capable, but
      without quite as much overhead. 
      I'm particularly interested in hearing about X Server alternatives, so
      let me know if you have a favorite!

  • Alternatives to Microsoft's Internet Explorer and Outlook Express:

     As
    the leading web browser and mail client, these two apps are the target
    of prolific viruses, trojans, malware and other nasties. 
    In addition to avoiding many of these, you may also like some of the
    features available in the alternatives (eg. tabbed browsing is a
    popular feature unavailable in IE). 
    Four alternatives are in common use (three of them share much of the
    same code-base -- Mozilla, Netscape Navigator and Firefox). 
     This review
    might help sort you out the differences.
     As with anything, your preference is yours to decide (and also, as
    with everything else here, feature and security updates are released
    quite often, so you might try to check for new versions regularly): 
    They are listed here from highest recommendation to lowest:

    1. Firefox/Thunderbird: 
      Though frequently mentioned as a pair, Firefox and Thunderbird are
      stand-alone applications. 
      Firefox is a web browser, and Thunderbird is an email client. 
      "Stand alone" here means that these can be installed separately from
      each other. 
      You can configure them to work with alternative software as you wish
      (eg. use Firefox for surfing, but set Outlook as your default mail
      client). Actually, you can generally mix and match pieces from all of
      these alternatives, but most of them start out with defaults tied to
      their suite companions. 
      Slight thumbs up to Firefox over the other alternatives because it has
      almost every feature found in the corresponding Mozilla suite, plus
      additional add-ons. 
      Vast numbers of independently produced add-ons and customizations are
      available as well.
    2. Mozilla Suite: 
      A suite that includes the big three:  a browser, email client and HTML
      editor. 
      This is a fine alternative, but as a browser alternative, this author
      gives the bigger thumbs up to its sibling, Firefox, listed above.
    3. Opera. 
      It is available in a free version with a "branding" bar that contains
      advertisements, or you can buy the product to remove this minor
      annoyance.  (Branding/non-branding examples.)
    4. Netscape: 
      The Netscape suite includes a browser (Navigator), email client (Mail),
      HTML editor (Composer) and other tidbits. 
      Of the three Mozilla-based browsers, this is probably the least used
      and has the most extraneous stuff thrown in, which is one of several
      reasons it gets last place in this list.
        It is good enough to recommend, but just not quite as highly as the
      others.

  • Java, WebStart, JRE, J2RE, JSDK,
    Microsoft VM and all that Jazz...: The author of this segment finds
    this to be very puzzling and sometimes frustrating stuff to understand,
    keep up with, and especially to try to explain clearly and succinctly. 
    <Melodrama> Imagine Sun, IBM and Microsoft all walked into a bar
    and had a few drinks. Heck, let Netscape walk in a few minutes later
    for good measure. 
    Fifty states' attorneys general plus the US AG and DOJ are to act as a
    referee. 
    Now imagine that you, a mere passerby on the street were harangued into
    cleaning up the inevitable bar fight, complete with broken bottles,
    flying bar stools and blood everywhere all while it is still going on. 
    That's not even close to how awful it is...</Melodrama>   Details
    to be filled in here!

  • OpenAFS, MIT Kerberos, Wake and Leash: Details to be filled in here!

  • Google Toolbar :
    This is a very convenient interface to initiate Google searches, plus a
    decent pop-up blocker. Unfortunately, it is only available for Internet
    Explorer (though other browsers may support similar features natively).

Please send comments, corrections and suggestions to Wayne Betts: wbetts {at} bnl.gov

Office applications and productivity software

Productivity software and viewers/utilities for various file types

  • OpenOffice -- Free and available on multiple platforms.  Perhaps the single best reason to use it is that it natively creates PDF format.  In addition to its own formats, it can read (and write) MS Word, Excel and PowerPoint files (usually -- sometimes formatting details go haywire, but they are constantly updating it.)

  • Adobe Reader -- Used for viewing PDF documents.  (You will probably want to install it with the very useful text search feature.) (Linux users can try xpdf as an alternative which is part of many distributions.)

  • Ghostscript and GSview: PostScript interpreter and viewer (and PDF too) that you probably want to have.

  • Online Document Conversion Services:  Neevia Technology and CERN Document Conversion Service both have file convertors that allow you to submit a variety of common (and uncommon) file formats in small numbers and produce files in different formats (PDF being of most interest probably).  Though not convenient for many files or very large files (and certainly inappropriate for confidential or non-public information), they are good to know about.  (Don't forget -- OpenOffice is able to export documents in PDF format too and handles a lot of file types.)

  • Graphics and Image Manipulation software:  The GIMP and ImageMagick are both quite capable tools available for free for multiple platforms.  Perhaps not perfect replacements for Adobe PhotoShop, but pretty darn good.  (If you are a PhotoShop veteran, then you'll have to spend some time learning the ropes, but it will probably be worth it.)

  • Compression Utilities:  WinZip is not free (though many, many people use it without payment).  Fortunately, there are freeware alternatives.  For instance:
    • 7Zip:  This is the current recommendation of this page, the reasons for which may be included in the future..
    • FreeZip (but not "FreeZip!" which is reported to contain spyware and/or adware)
    • ZipCentral
    • ZipItFast
    • ExtractNow
    • CAMUnZip
    • ZipWrangler
    • Freebyte Zip

  • If you've ever spent a few minutes waiting for MS Windows Search function to find a file on your system, then you might find the following can save you some time. The basic idea is similar to most internet search engines: index your files (while the computer would otherwise be idle so as not to slow things down for the user) and then consult the indexes when a search is requested:

    • Yahoo! Desktop Search:  This is a free version of a well respected product from X1 with a few features removed, such as indexing of remote drives, Eudora and Mozilla-based email.
    • Google Desktop Search:  Use Google's Desktop Search to quickly search for files on your computer using an indexing system much like Google's web indexing.  Not all file types are supported, but most common ones are, such as Outlook mail, MS Office documents and so on.
    • Copernic Desktop Search:  This is similar to the Google Desktop Search, but appears to be a bit more capable, though as of this writing I have not had time or cause to test it much.  User comments would be appreciated.
    • Windows 2000 and XP include an "Indexing Service" which (according to Microsoft) is "a base service [...] that extracts content from files and constructs an indexed catalog to facilitate efficient and rapid searching."  To configure the Indexing Service open Control Panels -> Administrative Tools -> Computer Management.  In the left pane, click the plus sign next to "Services and Applications", then right-click on the "Indexing Service" icon.  In the popup menu, select "All Tasks | Tune Performance".  The "Indexing Service Usage" dialog box will appear.  The Indexing Service is actually quite customizable, though doing so can add significantly to the resources required by the service.  A warning: it can eat up a surprising amount of disk space to maintain the indexes.  It has sped up basic searches for this author, but your mileage may vary in both search efficiency gains and overall performance penalty.

  • Cygwin: Cygwin has a number of utilities for handling, viewing and transforming file formats, so I have a separate page of Cygwin tips
  • Multimedia Players (work related, of course!)

    Pick one.  Use it.  If you find a format it doesn't support, try a different one, or go to the vendor's site and look for a download of an update or add-on (plug-in, patch, codec, etc.) for your format.  This isn't the place to go into the details, but some quick thoughts are included here:
    • Microsoft's Media Player -- you've almost certainly already got it, so why not use it? 
    • Real Player:  complaint -- by default it runs background processes continuously, pops up annoying little messages and practically begs you to register it, though it isn't nessecary for full functionality..  It isn't a big deal to disable these annoyances, but why should you have to?
    • Winamp:  There is a free version and an inexpensive "Pro" version that has CD burning.  It has been up and down over the years, with some versions much quirkier than others.  Currently it seems to be on par with the rest.
    • Apple's iTunes:  Though intended to suck you into Apple's music store, you can use the application without using the store.  In keeping with most Apple stuff, it seems to be well liked by those who like it.  Enough said.


Please send comments, corrections and suggestions to Wayne Betts: wbetts {at} bnl.gov

Performance and Security enhancement

Utilities for Security and Performance

If your computer seems to be running slower than it used to, pop-up advertising is appearing at an alarming rate, your web browser's settings keep changing in undesired ways or you just want a better idea what your computer is up to (eg. "What the heck is PRPCUI.exe?"), here are some resources for understanding what's going on and making things better, presented in roughly the order from those that require the lease detailed understanding to the most:
  • Ad-Aware:

    Ad-Aware was, not very long ago, *the* place to start for malware detection and removal, with the added bonus that it was free.  Alas, recent versions of Ad-Aware (even the Personal version) are no longer licensed quite so freely (let's be clear -- a DOE-owned computer shouldn't have it installed without a paid license.)  It is still free for personal use, so it is highly recommended for home and personal laptop use, though it may not be keeping up with the constantly expanding field, which is a common problem with this type of software.  One thing to keep in mind:  you must be sure to keep your definitions up-to-date, just like a virus scanner, in order to get the most benefit.
  • Spybot - Search&Destroy:

    This is the historical alternative to Ad-Aware, with similar good results "in the early days", but it too may be failing to keep up.  Unlike Ad-Aware, it's license is quite liberal, so it can be installed as desired.  It has an "Advanced" mode, with a variety of additional tools beyond the basic malware scanner, (but keep in mind that some of these features are indeed "Advanced" and not to be played with lightly).  Broken record time:  you must be sure to keep your definitions up-to-date.  You should also consider using the "Immunize" feature to prevent some infestations, and to blacklist some sites known to host various forms of malware.

     

  • Microsoft's Malicious Software Removal Tool:  This is a regularly updated (but far from comprehensive) online removal tool for Windows 2000 and Windows XP.  It isn't a bad idea to run this scanner once a month or whenever you suspect you might have caught "something".

     

  • Microsoft's AntiSpyware Beta:  Though called a Beta product, this is essentially a re-GUIed and slightly modified version of a long standing and respected commercial product that Microsoft recently purchased.   Some recent tests by more-or-less independent testers have shown this tool to be better even than the old reliables, Ad-Aware and SpyBot.

     

  • Defragmenting your hard drive is something to put on the calendar 2-4 times a year.  Because Windows' built-in defragmenter seems especially slow, and modern disk drives hold so much, this is something usually left running overnight.  Third-party alternatives exist that may do a better job in various ways.  Let's hope I get around to listing one or two here in the not-too-distant future...

     

  • CrapCleaner:  This is a system optimization tool for removing unnecesary temporary files and registry entries. The default installation creates a "Run CCleaner" entry in the Recycle Bin's context (right-click) menu.
  • Monitoring startup activity and services.

    Programs that start when you boot or login to your computer can be big performance drains, in addition to doing unwanted things.  The following may help you understand and control what's going on.  (N.B. Some of the following are capable of rendering your system unusable if not handled with care!  They may require significant understanding of Windows' internals to be most useful):

     

    • StartUp Monitor and Startup Control Panel.  These are separate utilities, but they are from the same source and complement each other nicely.  (The author of these has additional utilities that you may find worthwhile as well.)
    • msconfig.exe:  This is Windows' very own "System Configuration Utility", with which one can look at and configure system startup paramters and files, which is especially useful to see the effects of individual changes. You can hose things up quite good in here however, so be careful!
    • services.msc:  This provides a Management Console to configure the startup of various registered services.  This is useul for disabling unnecessary or unused Windows services.  A potentially informative feature in this Console is the "Description" column, though it can still be quite cryptic (or blank).  
    • Merijn.org's website provides several downloads that you might find useful, such as HijackThis ("a general homepage hijack detector and remover"), CWShredder (CoolWebsearch removal tool) and StartUpList ("way better than msconfig")
    • BlackViper.com
    • http://www.sysinfo.org/ (slow site) 
    • Security Task Manager
    • http://www.sysinternals.com
    • HijackThis
    • BHODemon
  • Pop-up Blockers

Pop-up blocking software is increasingly unnecessary because other tools are including their own pop-up blockers.  Mozilla/Firefox for instance have built in pop-up blockers.  Internet Explorer has a pop-up blocker added with Windows XP SP2.  The Google Toolbar (recommended in the "Networking Software" recommendations) has a pop-up stopper as well.  Still, you might fight some utility in the products available from the PanicWare website.  Versions of their Pop-Up Stopper FREE Edition served this author quite well for over a year, but as I said above, it no longer seems as essential as in the pastthe basic functionality has been supplanted by features in other software.
Microsoft Office updates are a combination of security fixes, bug fixes and new features.  Though not emphasized as much as Windows Updates, the security fixes for Office are of similar importance.  Unfortunately, using the online updating system usually requires an installation CD that matches your product (for instance, "Office XP Pro" disks are not acceptable for updating "Office XP Standard".)  Many people, for a variety of reasons, don't have their original installation CD(s).  If you do not have an acceptable installation CD available then the online product update scan can still be used to determine what updates are applicable.  Then you can usually download full updates and apply them manually without the installation media.  (Browse for the downloads that match your product -- most are in self-extracting executable format.)  

  • Clock keepers

  • Multi-desktop software

Other resources



Please send comments, corrections and suggestions to Wayne Betts: wbetts {at} bnl.gov

Required software and configuration for Windows PCs at BNL

BNL-specific requirements and configuration for networked Windows computers:

  • A file and real-time virus scanner with up-to-date virus patterns/definitions is REQUIRED!  (***Cyber-Security requirement***)

      Information about the BNL-supported products from TrendMicro is available from the BNL ITD group: TrendMicro at BNL.   It is critical that any anti-virus product receive regular updates (daily or even more often), which is sometimes difficult for mobile machines on a variety of networks.   Four similar products are available to try to meet the demands of our diverse environment:

    Windows desktops that reside on the BNL internal networks are best served by TrendMicro's basic OfficeScan product.   It has a master server inside the BNL firewall from which it receives updates and to which it reports infections.  Every Windows desktop system at BNL should be using this product, with very few exceptions.  You can
    click here to go to the online install the OfficeScan product.  (You'll need administrator privileges on your system for the installation.)

    Laptop users with wireless networking are encouraged to use a newer OfficeScan version that has a firewall module and is able to recieve virus pattern updates from multiple sources -- so it can roam around on- and off-site and usually still reach an update server.  This OfficeScan version is also more capable of cleaning up some trojans and malware than the desktop version.   To install it in the standard way, you must already be on the BNL external wireless network and go here.   Repeat: you must be on the "BNLexternal" wireless network to use that link.

    BNL employees' personal home computers are permitted to use the PC-cillin product, which gets its updates from servers that are outside the BNL firewall (and it does not report infections to anybody at BNL).  PC-cillin includes a firewall module (OfficeScan does not) and PC-cillin has more (but quite limited) spy-ware and ad-ware detection capabilities.

    If you are running a Windows *Server* OS (if you are unsure, then you almost certainly are not!), then there is yet another option, for which you will need to contact ITD (help desk at x5522 or Jim McManus directly at x4107).

    or those readers to whom none of the above apply, which is to say, computers not owned or used primarily at BNL or by BNL employees, I recommend (though can offer no significant assistance with) the following three free anti-virus products about which we (Wayne / Jerome) have read or heard good things:

    1. AVG Anti-Virus     - JL tried for 3 months, worked great but had conflict with fingerprint driver (thought to be a malicious script when activated)
    2. COMODO Free        - JL tried this for years and it works just fine and appears to be a great product considering the cost (none :-) ). The free version is for home users only so NOT to be installed on a BNL system for sure (usually the case of most Free AV).
    3. Microsoft Sec. E   - Microsoft Security Essentials is new on the market but starts doing a good job and supports Windows 7, Vista and XP

      Other anti-virus resources available include online scanners, such as HouseCall from TrendMicro and Symantec's Security Check.   Most major anti-virus vendors have something similar.   Relying on these online scanners as you primary defense is unwise.   In addition to the inconvenience of manually performing these scans, you really need a product monitoring your system at all times to prevent infections in the first place, rather than trying to clean up afterwards.   But since no two products catch and/or clean the same set of problems, occaisionally using a second vendor's product can be useful.

     

  • Windows Critical Updates/SUS (***Cyber-Security requirement***)

      Windows systems must be regularly patched with "critical" updates.  Unfortunately, the BNL firewall and proxy configurations can interfere with the Windows Automatic Update feature in Windows 2000/XP (though you can still use Windows Updates in Internet Explorer if you have the proxies configured correctly, see below for proxy info).  To help with this situation, BNL ITD has set up a Software Update Services server to locally host critical updates.  To use this service (which places a notification icon in the System Tray when updates are available), please click here for more information and installation instructions.  (It is quite easy, but you must have administrative privileges.)   You can manually apply Windows updates (critical and otherwise) using Internet Explorer --  go to the Tools menu and click on "Windows Updates", at which point it is straightforward.  Note that in many cases, the machine must be rebooted to complete the update process.
  • Logon Banner (**Cyber-Security requirement**)

      As required by the DOE, please install a logon banner for BNL-owned or BNL-based computers.  (This includes other OSes as well -- essentially anything that you can log into is required to post a banner if technically possible.)  Click here for more information about logon banners at BNL. To install the banner:  Windows NT/2000/XP click here (must be an administrator to insert the registry changes).  Window 95/98 click here instead.
  • MAC Registration (**Cyber-Security requirement**)

  All networked devices on the BNL internal networks are required to be registered.   (NB--- Please do not attempt to register your machine while using STAR's cygnusb wireless access points.)   More specifically, each network interface is to be registered -- one computer might have multiple network interfaces, each of which requires a separate registration.   That's because the registration is keyed on a specific string assigned to each network interface by the manufacturer that is supposed to be unique in the world.   It is known as a "MAC", "ethernet" or "hardware" address and each network interface has one. (Ie. You must create a separate registration entry for each network card you use on a system.)   For more information, or to update your registration information, click here.  This requirement applies to things beyond typical PCs, such as remote network power supplies, VME processors and other networked equipment.   If you have such equipment that you cannot register (typically because it doesn't run any sort of web browser), then please contact ITD (x5522) or Wayne Betts for assistance in registering the system.   While not necessary, if you have the capability to verify that the MAC you are registering is in fact yours (Windows hint:  "ipconfig /all" or Linux hint:  "ifconfig"), please do so.   Glitches in the system occaisionally fail to properly keep track of the realtime IP-to-MAC mapping, and you, the adaptable human, can perhaps avert the unfortunate situation of misregistration.
  • Proxy servers

    As per 2017/11, please use direct connection to the network while at BNL.
  • Security Scanning

  The BNL networks are routinely scanned for vulnerabilities by ITD, auditors and even sometimes malicious intruders.  The most prevalent scan is done using Nessus, which looks for common network services and many known vulnerabilities.  Any user with a web browser can initiate a new scan of his host machine and look at the most recent scan results for his IP address by going to http://scanner.bnl.gov/.   (NB. When it requests an email address to send the results, you must use an address ending in bnl.gov, or it will reject you.)   The results can be daunting to interpret, so please ask for assistance if you are unsure how to interpret or correct any results.   Some results are "false positives" or uncorrectable but necessary, in which case they can be marked as such in the database.

 


Please send comments, corrections and suggestions to Wayne Betts: wbetts {at} bnl.gov

Facility Access

A selection of tips on how to log to the RCF
facility. We hope to augment those pages and add
information as user request or need.

Getting a computer account in STAR

  1. Introduction
  2. Getting an account and performing work at BNL

 

Introduction

First of all, if you are a new user, WELCOME to the RHIC/STAR collaboration and experiment. STAR is located at Brookhaven National Laboratory and is one of the premier particle detectors in the world.

As a (new) STAR user, you will need to be granted access to our BNL Tier0 computing facility in order to have access to the offline and online infrastructure and resources. This includes accessing BNL from remote or directly while visiting us on site. Access includes access to data, experiment, mailing lists, desktop computer for visitors to name only those. As a National Facility under the Department of Energy (DOE) regulations, a few steps are required for this to happen. Please, follow them precisely and make sure you understand their relevance.

Note:

The DOE requires proper credentials for anyone accessing a computing "resource" and expect such individual to keep credentials up-to-date i.e. in good standing. It is YOUR responsibility to keep valid credentials with Brookhaven National Laboratory's offices. Credentials include: being a valid and active STAR member, having a valid and active guest/user ID and appointment, having and keeping proper trainings. Any missing component would cause an immediate closure of access to computing resources.

In many cases, we rely on account name matching the one created at the RCF (for example, Hypernews or Drupal accounts need exact match to be approved) - this is enforced so we can accurately rely on the work already done by the RCF personnel and only base our automation on "RCF account exist and is active". The RCF personnel work with the user's office and other agencies to verify your credentials.


If you were a STAR user before and seek to re-activate your account, this page also has information for you.

 

Getting an account and performing work at BNL

Note that along the process of requesting either an appointment or a computing account implies a check from the facility and user office personnel of your good standing with RHIC/STAR as the affiliated experiment. Therefore, we urge you to follow the steps as described below


ALL USERS - Ensure/Verify you are affiliated to STAR in our records

Whenever you join a group affiliated with STAR, please
  • Ask your council representative that he/she sends your information to the collaboration's record keeping person (at this point in time, this person is Liz Mogavero).
    Note: Your council representative IS the one responsible for keeping the list of authors and active members at all times. We will not (and cannot) consider requests coming from other STAR members.
     
  • Pro-actively check the presence of your name and record in our Phone Book.
    Note: If you are not in our Phone Book, you are simply NOT a STAR user as far as we know as our PhoneBook is the central repository of active STAR members as defined by the STAR council representatives. 
     

New users in STAR

  1. Request a Guest appointment
    You must be sure you have a valid guest appointment with the BNL User Office.

    Note
    1: Requesting a Guest ID requires a procedure called “Foreign Visit and Assignment”. This procedure involves steps such as background checks with Counter Intelligence and approval from the Department of State. The procedure could take up to 60 days from the time it is started (sensitive countries may take 90 days).
    Note2 : If you have done this already and are a valid Guest, please go to to this section.

    • Go to the Guest Registration Form and complete the registration as instructed.
      • Purpose of Visit: likely "Research" but if you come for other purposes, chose as appropriate ("CRADA" or "Interview" may apply for example) 
      • Experiment/Facility:  "Physics Dept (RHIC/AGS)"
      • Facility Code: "RHIC"
      • Type of Research: "STAR"
      • Type of Access Requested: likely "Open Research" if you stated your visit purpose as "Research"
      • Subject Code for this Visit/Assignment: likely "General Physics"
         
  2. Be patient and wait for further instructions and the approval.
    • ONLY AFTER THE FIRST STEPS will you be able to proceed with the rest of the instructions below.
    • We will assume that you, from now on, have a valid Guest appointment and hold a Guest/BNL ID.
       
  3. Ensure you have the required and mandatory training
    You MUST take the Cyber Security training and course GE-CYBERSEC. This training is mandatory and access to the facility computing resources will NOT be granted without it.
    You are also requested to read the Personal User Agreement which describes your responsibilities, the reasonable use and scope of personal use of computing equipments. In recent years, the BNL User Office have requested for the form to be signed and returned for their records. Please, do not skip any of those steps.
     
  4. Request an RCF account
    To request a new RCF account, start here. The fields are explained on this instruction page.
    Note: There is a "Contact information" field which is aimed to be filled using an existing RHIC member (holder of a valid account and appointment) who can vouch for you. Put your council representative or team lead name there OR (in case of interview / CRADA etc...) the name of your contact and host at BNL. DO NOT use your own name for this field. DO NOT use the name a person who is NOT yet a STAR Member.
     
  5. Additional steps are described below.

Previously a STAR user

If you were a STAR user before and consulting those pages, it may mean that either
  1. you cannot remember how to login and need access but you are in good standing (all training valid, your BNL appointment is valid)
  2. you have let your training expire but you are a valid BNL guest (your appointment with BNL has NOT expired)
  3. your BNL appointment is about to expire or has expired not long ago
  4. you are a RHIC user (from another experiment), and now coming to STAR
The instructions follows:
  • First of all, please make sure you are in the STAR PhoneBook as indicated here.
    If you were a member of another experiment before, you will be joining STAR as either a member of an existing institution or joining as a new institution. All membership handling are the responsibilities of the STAR council (approval of new institutions) or your council representative. In both cases, we MUST find your name in our PhoneBook records.
     
  • Instructions for the several use cases above
    1. If you are in good standing but cannot remember your login information at the RCF facility, please see Account re-activation
       
    2. You have let your BNL training expire - likely, you have not renewed or taken GE-CYBERSEC training available from the training page (please locate in the list at the bottom the course named GE-CYBERSEC). Within 24 hours of the training being taken/renewed again, the privilege to access the BNL computing resources using your RCF account will be re-established (the process will be automatic).
       
    3. Your appointment is about to expire, or has expired not long ago - you will need to go to the Extension requests .
      The Guest Central interface will help identifying your status and appointment expiration. This form could be used by users already having a BNL Guest ID. If you have let your appointment expire for a long time however, this form may let you know (or not show your old BNL badge/guest ID at all). In such a case, you should consider yourself as a "New user" and follow the first set of instructions above.
      For an appointment renewal, the starting point will be the Guest Extension Form.
       
    4. If you were a user before and now coming to STAR, you will need to follow
    5. Additional steps are described below.

Additional steps for everyone

  1. Generate and upload your SSH keys to ensure secure login
    You may now read SSH Keys and login to the SDCC and following information in this section.
     
  2. Drupal access
    1. Log in to RCF node to verify your account username/password working
    2. Download 2-Factor Authentication app to your mobile device (application ranges from Google or Microsoft Authenticator, Duo Mobile, Authy, FreeOTP, Aegis, ...)
    3. Contact Dmitry Arkhipkin or Jerome Lauret on MatterMost (https://chat.sdcc.bnl.gov, choose "BNL Login" with your RCF username/password) to obtain the 2-Factor Authentication QR code.
    4. Use your RCF username/password + 2-Factor Authentication code (read from the app on your mobile device) to log in to drupal.
You may also be interested in
  • Web Access
  • You do not have access to view this node your ssh keys will need to be uploaded to a different interface if you have a need to login to the online setup - this is needed mostly by experts (not by all users in STAR).
  • Software Infrastructure for general information
  • Video Conferencing ... and the related comments at the bottom of some of those pages (viewable only if you are authenticated to Drupal).
All of those links are referenced on Software & Computing, the main page for Software and Computing ...
Wishing you a great time in STAR.

 

Account re-activation

The instructions here are for users who have an account at the RCF but have unfortunately let their BNL appointment expire or do not know how to access their (old) account.

Account expired or is disabled

First of all, please be sure you understand the requirements and rationals explained in Getting a computer account in STAR.
As soon as your appointment with BNL ends or expires, all access to BNL computing resources are closed / suspended and before re-establishing it, you MUST renew your appointment first. In such case, we will not provide you with any access which may include access to Drupal (personal account) and mailing lists.

The simplest way to proceed is to

  • Check that you do have GE-CYBERSEC training. You can do this by checking your training records.
    • If you do not, please take this training NOW as any future request will be denied until this training is complete
  • Send an Email to RT-RACF-UserAccounts@bnl.gov requesting re-activation of your account. Specify the account name (Unix account name, not your name) if you remember it. If you don't your full name may do. The RCF team will check your status (Cyber training, appointment status) and
    • if any is not valid, you will be notified and further actions will be needed.
    • If all is fine, they will re-activate your account after verifying with us that you are a valid STAR user. Please consult Getting a computer account in STAR for what this means ...

If your appointment has expired, you will need to renew it. Please, follow the instructions available here.

Chicken and Egg issue? Forgot your password but did not upload SSH keys

If your account is valid, so is your appointment  but you have not logged in the facility for a while and hence, are unable to upload your SSH keys (as described in SSH Keys and login to the SDCC and related documents) this may be for you.

You cannot access the upload page unless you have a valid password as the access to the RCF requires a double authentication scheme (Kerberos password + SSH key). In case you have forgotten your password, you have first to send an Email to the RCF at RT-RACF-UserAccounts@bnl.gov asking to reset your password, then thereafter go to the SSH key upload interface and proceed.

Drupal access

This page describes how you can obtain the access to the STAR drupal pages. Please understand that your Drupal access is now tight to a valid SDCC login - no SDCC account, no access to Drupal. This is because we integrated Drupal login to the common infrastructure (the login is Kerberos based). Here are the steps to gain access then:

  1. Get a computer account in STAR, make sure you have a valid guest appointment and valid cyber training, then request a RCF account  

    https://drupal.star.bnl.gov/STAR/comp/sofi/facility-access/general-access

  2. Generate SSH keys and upload them to the SDCC - this has formally little to do with access to Drupal but will allow you to log to the facility and verify you know your Kerberos password (it should have been sent to you via EMail during the account creating with instructions on how to change it)
    https://drupal.star.bnl.gov/STAR/comp/sofi/facility-access/ssh-keys
    Now Log in to RCF node- again, this is to ensure you have the proper

    1. ssh xxx@ssh.rhic.bnl.gov    (xxx is your username on RCF, enter the passphrase for your SSH key)
    2. kinit (enter your SDCC kerberos password - this is the one you will use for accessing Drupal)
  3. Download 2-Factor Authentication app to your mobile device (application ranges from Google or Microsoft Authenticator, Duo Mobile, Authy, FreeOTP, Aegis, ...)
  4. The FIRST time you log to Drupal, you will actually NOT need a second factor (leave the "code" box blank) but MUST generate it right away (the second login will require it).
    1. To generate it, use the "(re)Create 2FA login" in the left hand-side menu, leave all default option (you need a time based OTP), import the QRCode displayed.
    2. Once imported, a 6 Digits code should appear in your app - this is he "code" you will need to enter in future in the third field named "code". Note the code changes every 30 seconds.
  5. IF you have logged out  without generating a QRCode or forgot to import and have no "code", then you will need to contact Dmitry Arkhipkin or Jerome Lauret on MatterMost to obtain your 2FA QR-code (both can generate it for you after the fact). SDCC chat is available at https://chat.sdcc.bnl.gov. Use "BNL Login" and your SDCC account user/Kerberos password to login. Do as discussed above (import the QRCode, make sure an entry appears, test your login right away, ...)
     
  6. You are set. Drupal login will now ask for your SDCC username and Kerberos password and a 6-digit code you read from the 2-Factor Authentication app.
 

SSH Keys and login to the SDCC

How to generate keys for about every platform ... and actually be able to log to the SDCC

General

What you find below is especially useful for those of you that work on several machines and platforms in and out of BNL and need to use ssh key pairs to get into SDCC.

  • If you use Linux only or Windows only everywhere, all you need is follow the instructions on the SDCC web site and you are all set (see especially their Unix SSH Key generation page).
  • Otherwise this page is for you.

The findings on this web page are a combined effort of Jérôme Lauret, Jim Thomas, and Thomas Ullrich. All typos and mistakes on this page are my doing. I am also not going to discuss the wisdom of having to move private keys around - all I want to do is get things done.

The whole problem arises from the fact that there are 3 different formats to store ssh key-pairs and all are not compatible:

  • ssh.com: Secure Shell is the company that invented the (now public) ssh protocol. They provide the (so far) best ssh version for Windows which is far nicer than PuTTY. Especially the File Browser provided is so much nicer than the scp command interface. It is free for academic/university sites.
  • PuTTY: a free ssh tool for Windows.
  • OpenSSH: runs on all Linux boxes and via cygwin on Windows.

Despite all claims, OpenSSH cannot export private keys into ssh.com format, nor can it import ssh.com private keys. Public keys seem to work but this is not what we want. So here is how it goes:

[A] Windows: follow one of the instructions below

  1. PuTTY (Windows)
    1. Download puttygen.exe from the PuTTY download page. You only need it once, but it might be good to keep it in case you need to regenerate your keys.
    2. Start the program puttygen.exe
      • Under parameters pick SSH-2 (RSA) and 1024 for the size of the key in bits.
      • Then press the Generate button. You will be asked to move your mouse over the blank area.
      • Enter a passphrase in the referring fields. The passphrase is needed as it will correspond to a password. Make a mental note of it as keys will not be usable without it.
      • I recommend to save the "key fingerprint" too since you will need it at the SDCC web site when uploading your public key. Just save it in a plain text file.
        Note: You can always generate it later from Linux with ssh-keygen -l -f <key_file> but since you will need access to a Linux system to do this, it is important you keep a copy of this now so you could proceed with the rest of the instructions. The picture below shows where the important fields are
    3. Saving keys
      • Press Save Public Key. To not confuse all the keys you are going to generate I strongly recommend to call it rsa_putty.pub.
      • Next press Save Private Key. Type rsa_putty as a name when prompted. PuTTY will automatically name it rsa_putty.ppk. That's your private key.
        Don't quit puttygen yet. Now comes the important stuff.
      • In the menu bar pick Conversions->Export OpenSSH key. When prompted give a name that indicated that this is the private key for OpenSSH (Linux). I used rsa_openssh. There is no public key stored only the private. We will generate the public one from the private one later.
      • In the menu bar pick Conversions->Export ssh.com key. When prompted give a name that indicated that this is the private key for ssh.com. I used rsa_sshcom. Again, there is no public key stored only the private. We will generate the public one from the private one later.
    4. All done. Now you have essentially 4 files: public and private keys for putty and private keys for ssh.com and OpenSSH.
  2. Getting ssh.com to work (Windows):
    1. Here I assume that you have SSHSecureShell (client) installed, that is the ssh.com version. Open a DOS (or cygwin) shell. We now need to generate a public key from the private key we got from puttygen. Best is to change into the directory where your private key is stored and type: ssh-keygen2 -D rsa_sshcom . Note that the command has a '2' at the end. This will generate a file called rsa_sshcom.pub containing the public key. Now you have your key pair.
    2. Launch SSH and pick from the menu bar Edit->Settings.
      Click on GlobalSettings/UserAuthentication/Keys and press the Import button. Point to your public key rsa_sshcom.pub. The private key will be automatically loaded too. That's it. Press OK and quit SSH. We are not quite ready yet. We still have to generate and upload the OpenSSH key to SDCC.
  3. Getting keys to work with OpenSSH/Linux:
    1. Copy the private key rsa_openssh to a Linux box (cygwin on Windows works of course too).
    2. Set the permissions such that only you can read the private key file:
      % chomod 600 rsa_openssh
    3. Generate the public key with:
      % ssh-keygen -y -f rsa_openssh > rsa_openssh.pub
    4. Now you have the key pair.
    5. To install the key pair on a Linux box copy rsa_openssh and rsa_openssh.pub to your ~/.ssh directory.
      Important: the keys ideally will be named id_rsa and id_rsa.pub, otherwise extra steps/options will be required to work with them.  So, you are recommended to also do
      % mv rsa_openssh ~/.ssh/id_rsa  

      % mv rsa_openssh.pub ~/.ssh/id_rsa.pub

      All done.  Note that there is no need to put your key files on every machine to which you are going to connect.  In fact, you should keep your private key file in as few places as possible -- just the source machine(s) from which you will initiate SSH connections.  Your public key file is indeed safe to share with the public, so you need not be so careful with it and in fact will have to provide it to remote systems (such in the next section) in order to use your keys at all.
       

[B] Uploading the public key to SDCC:

  1. https://web.racf.bnl.gov/Facility/SshKeys/UploadSshKey.php
  2. Make sure you upload the OpenSSH public key. Everything else won't work.
    You need to provide the key fingerprint which you hopefully saved from the instructions above.  In case of OpenSSH based keys, you can re-generate the fingerprint with
    % ssh-keygen -Emd5 -l -f <key_file>

Note that forcing MD5 hash is important (default hash is SHA256 the RACF interface will not take). All done.
If you followed all instructions you now have 3 key pairs (files). This covers essentially all SSH implementations there are. Where ever you go, whatever machine and system you deal with, one key pair will work. Keep them all in a very save place.

 

[C] Done.What's next?

Uploading your keys to the SDCC and STAR SSH-key management interfaces

You need to upload your SSH keys only once. But after your first upload, please wait a while (30 mnts) before connecting to the SDCC SSH Gatekeepers. Basic connection instructions, use:

% ssh -AX xxx@sssh.sdcc.bnl.gov
% rterm

The rterm command will open an X-terminal on a valid STAR interactive node. If you do NOT have an X11 server running on your computer, you could use the -i options of rterm for interactive (non X-term based) session.

If you intend to logon to our online enclave, please check the instructions on You do not have access to view this node to request an account on the STAR SSH gateways and Linux pool (and upload your keys to the STAR Key SSH Management system).  Note that you cannot upload your keys anywhere without a Kerberos password (both the SDCC and STAR's interface will require a real account kerberos password to log in). Logging in to the Online enclave involves the following ssh connection:

% ssh -AX xxx@cssh.sdcc.bnl.gov
% ssh -AX xxx@stargw.starp.bnl.gov

A first thing to see is that SDCC gatekeeper is here "cssh" as the network is spearated into a "campus" side (cssh) and a ScienceZone side (sssh). For convenience, we have asked Cyber security to allow connections from "sssh" to our online enclave as well (so if you use sssh all the time, it will work).

For the requested an account online ... note that users do not request access to the individual stargw machines directly.  Instead, a shared user database is kept on onlcs.starp.bnl.gov - approval for access to onlcs grants access to the stargw machines and the Online Linux Pool.  Such access is typically requested on the user's behalf when the user requests access to the online resources following the instructions at You do not have access to view this node, though users may also initiate the request themselves. 

Logging in to the stargw machines is most conveniently done Using the SSH Agent, and is generally done through the SDCC's SSSH gateways.  This additional step of starting an agent would be removed whenever we will be able to directly access the STAR SSH GW (as per 2009, this is not yet possible due to technical details).

See also

To learn more, see:

 

Caveats, issues, special cases and possible problems

Shortcut links

 

SSH side effects

Please note that if you remote account name is different from your RCF account name, you will need to use

% ssh -X username@rssh.rhic.bnl.gov

specifying explicitly username rather as the form

% ssh -X rssh.rhic.bnl.gov

will assume a username defaulting to your local machine (remote from the BNL ssh-daemon stand point) user name where you issue the ssh command. This has been a source of confusion for a few users. The first form by the way is preferred as always work and removes all ambiguities.

X11 Forwarding: -X or -Y ??

-X is used to automatically set the display environment to a secure channel (also called untrusted X11 forwarding) . In other words, it enables X11 forwarding without having to grant remote applications the right to manipulate your Xserver parameters. If you want ssh client to always act like with X11 forwarding, have the following line added in your /etc/ssh/ssh_config (or any /etc/ssh*/ssh*_config ).

ForwardX11 yes

-Y enables trusted X11 forwarding. So, what does trusted mean? It means that the X-client will be allowed to gain full access to your Xserver, including changing X11 properties (i.e. attributes and values which alters the look and feel of opened X windows or things such as mouse controls and position info, keyboard input reading and so on).  Starting with OpenSSH 3.8, you will need to set

ForwardX11Trusted yes 

in the client configuration  to allow remote nodes full access to your Xserver as it is NOT enabled by default.

When to use trusted, when to use untrusted

Recent OpenSSH version supports both untrusted (-X) and trusted (-Y) X11 Forwarding. As hinted above, the difference is what level of permissions the client application has on the Xserver running on the client machine.  Untrusted (-X) X11 Forwarding is more secure, but unfortunately several applications (especially older X-based applications) do not support running with less privileges and will eventually die and/or crash your entire Xserver session.

Dilema? A rule of thumb is that while using trusted (-Y) X11 Forwarding will have less applications problems for the near future, try first the most secured untrusted (-X) way and see what happens. If remote X applications fail with a errorssimilar to the below:

X Error of failed request: BadAtom (invalid Atom parameter)
  Major opcode of failed request: 18 (X_ChangeProperty)
  Atom id in failed request: 0x114
  Serial number of failed request: 370
  Current serial number in output stream: 372

you will have to use the trusted (-Y) connection.

Per client / server setup?

Instead of a system global configuration which will require your system administrator's assistance, you may create a config file in your user’s home directory (client side) under the .ssh directory with the following line $HOME/.ssh/config

ForwardX11Trusted yes 

But it gets better as the config file allows per host or per-domain configuration. For example, the below is valid

Host *.edu
	ForwardX11 no
	User jlauret

Host *.starp.bnl.gov
	ForwardX11 yes
    	Cipher blowfish
	User jeromel

Host orion.star.bnl.gov
     ForwardAgent yes
     Cipher 3des
     ForwardX11Trusted yes

Host what.is.this
    User exampleoptions
    ServerAliveInternal=900
    Port 666
    Compression yes
    PasswordAuthentication no
    KeepAlive yes
    ForwardAgent yes
    ForwardX11 yes
    RhostsAuthentication no
    RhostsRSAAuthentication no
    RSAAuthentication yes
    TISAuthentication no
    PasswordAuthentication no
    FallBackToRsh no
    UseRsh no

As a side note, 3des is more secure thank blowfish but also 3x slower. If speed and security is important, use at least aes cypher.

Kerberos hand-shake, How to.

OK, now you are logged to the facility gatekeeper but any sub-sequent login would ask for your password again (and this would defeat security). But you can cure this problem by, on the gatekeeper, issue the following command (we assume $user is your user name)

% kinit -5 -d -l 7d $user

-l 7d is used to provide a long life K5 ticket (7 days long credentials). Note that you should afterward be granted an AFS token automatically upon login to the worker nodes on the facility. From the gatekeeper, the command

% rterm

would open a terminal from the least loaded node on the cluster where you are allowed to log.

Generic (group) accounts

Due to policy regulations, group or generic accounts login cannot be allowed at the facility unless the login is traceable to an individual. The way to log in is therefore to

  • Log to the gatekeeper using SSH keys under your PERSONAL account as described at SSH Keys and login to the SDCC
  • kinit -5 -4 -l 7d $gaccount
  • In case of wide use generic account, one more jump to a "special" node will be necessary. For starreco and starlib for example, this additional gatekeeper node is rcas6003. From there, login to the rest of the facility could be done using rterm as usual (at least in STAR)

Special nodes

This section is about standing on one foot, tapping on to of your head and chanting a mantra unless the moon is full (in such case, the procedure involves parsley and sacrificial offerings). OK, we are in the realm of the very very special tricks for very very special nodes:

  • The rmine nodes CANNOT be connected to anymore. However, one can use rsec00.rhic.bnl.gov has a gatekeeper, using your desktop keys and then jump from there to the rmine nodes.
    Scope: Subject to special authorization.
  • The test node aplay1.usatlas.bnl.gov cannot be accessed using a Kerberos trick. Since there are two HOPs from your machine to aplay1, you need to use the ssh-agent. See instructions on the Using the SSH Agent help page.

K5 Caveats

  • If you log to gatekeeper GK1 for your personal account, you will need to chose another gatekeeepr GK2 for your group account login. This will allow not interference of Kerberos credentials.
  • Whenever you log in a gatekeeper and you know you had previously obtained Kerberos credentials on this gatekeeper, you should ensure the destruction of previous credential to avoid premature lifetime expiration. In other words, -l 7d will NOT give you a 7 days lifetime K5 ticket on a gatekeeper where previous credentials exists. To destroy previous credentials, be sure
    1. you do not have (still) opened windows using the credential. Check this by issuing a klist and observe the listing. Valid credentials used in opened session would look like this
      Valid starting     Expires            Service principal
      12/26/06 10:59:28  12/31/06 10:59:28  krbtgt/RHIC.BNL.GOV@RHIC.BNL.GOV
             renew until 01/02/07 10:59:25
      12/26/06 10:59:30  12/31/06 10:59:28  host/rcas6005.rcf.bnl.gov@RHIC.BNL.GOV
             renew until 01/02/07 10:59:25
      12/26/06 11:11:48  12/31/06 10:59:28  host/rplay43.rcf.bnl.gov@RHIC.BNL.GOV
             renew until 01/02/07 10:59:25
      12/26/06 17:51:05  12/31/06 10:59:28  host/stargrid02.rcf.bnl.gov@RHIC.BNL.GOV
             renew until 01/02/07 10:59:25
      12/26/06 18:34:03  12/31/06 10:59:28  host/stargrid01.rcf.bnl.gov@RHIC.BNL.GOV
             renew until 01/02/07 10:59:25
      12/26/06 18:34:22  12/31/06 10:59:28  host/stargrid03.rcf.bnl.gov@RHIC.BNL.GOV
             renew until 01/02/07 10:59:25
      12/28/06 17:53:29  12/31/06 10:59:28  host/rcas6011.rcf.bnl.gov@RHIC.BNL.GOV
             renew until 01/02/07 10:59:25 
      
    2. If nothing appears to be relevant or existing, it is safe to issue the kdestroy command to wipe out all old credentials and then re-initiate a kinit.

 

 

Using the SSH Agent

General

The ssh-agent is a program you may use together with OpenSSH or similar ssh programs. The ssh-agent provides a secure way of storing the passphrase of the private key.

One advantage and common use of the agent is to use the agent forwarding. Agent forwarding allows you to open ssh sessions without having to repeatedly type your passphrase as you make multiple SSH hops. Below, we provide instructions on starting the agent, loading your keys and how to use key forwarding.

Instructions

Starting the agent

The ssh-agent is started as follow.

% ssh-agent

Note however that the agent will immediately display information such as the one below

% ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-fxDmNwelBA/agent.5884; export SSH_AUTH_SOCK;
SSH_AGENT_PID=3520; export SSH_AGENT_PID;
echo Agent pid 3520;

 
It may not be immediately obvious to you but you actually MUST type those commands on the command line for the next steps to be effective.

Here is what I usually do: redirect the message to a file and source it from the shell like this:

% ssh-agent >agent.sh 
% source agent.sh

The commands above will create a script containing the necessary shell commands, then the source command will load the information into your shell.  This assumes you are using sh. For csh, you need use the setenv shell command to define both SSH_AUTH_SOCK and SSH_AGENT_PID. A simpler approach may however be to use

% ssh-agent csh

The command above will start a new shell, in which the necessary environment variables will be defined in the newly started shell (no sourcing needed). 

Yet another method to start an agent and set the environment variables in tcsh or bash (and probably other shells) is this:
 

% eval `ssh-agent`


Now that you've started an agent and set the environment variables to use it, the next step is to load your SSH key.

 

Loading a key

The agent alone is not very useful until you've actually put keys into it. All your agent key management is handled by the ssh-add command. If you run it without arguments, it will add any of the 'standard' keys $HOME/.ssh/identity, $HOME/.ssh/id_rsa, and $HOME/.ssh/id_dsa.

To be sure the agent has not loaded any id yet, you may use the -l option with ssh-add.  Here's what you should see if you have not loaded a key:

% ssh-add -l
The agent has no identities.

 

To load your key, simply type

% ssh-add
Enter passphrase for /home/jlauret/.ssh/id_rsa:
Identity added: /home/jlauret/.ssh/id_rsa (/home/jlauret/.ssh/id_rsa)

 

To very if all is fine, you may use again the ssh-add command with the -l option. The result should be different now and similar to the below (if not, something went wrong).

% ssh-add -l
1024 34:a0:3f:56:6d:a2:02:d1:c5:23:2e:a0:27:16:3d:e5 /home/jlauret/.ssh/id_rsa (RSA)

 

Is so, all is fine.

Agent forwarding

Two conditions need to be present for agent forwarding to function:

  • The server need to be set to accept forwards (enabled by default)
  • You need to use the ssh client with the -A option

Usage is simply

 

% ssh -A user@remotehost

 

And that is all. For every hop, you need to use the -A option to have the key forwarded throughout the chain of ssh logins. Ideally, you may want to use -AX (where "X" enabled X11 agent forwarding).

Agent security concern

The ssh-agent creates a unix domain socket, and then listens for connections from /usr/bin/ssh on this socket. It relies on simple unix permissions to prevent access to this socket, which means that any keys you put into your agent are available to anyone who can connect to this socket. BE AWARE that root especially has acess to any file hence any sockets and as a consequence, may acquire access to your remote system whenever you use an agent.

Manpages indicates you may use the -c of ssh-add and this indeed adds one more level of safety to the agent mechanism (the agent will aks for the passphrase confirmation at each new session). However, if root has its mind on stealing a session, you are set for a lost battle from the start so do not feel over-confident of this option.

Addittional information

Help pages below links to the OpenSSH implementation of the ssh client/server and other ssh related documentation from our site.

 

SSH connection stability

IF
  • Your SSH connections are closed from home
  • You get disconnected from any nodes without any reasons?
  • ... and you are a PuTTY user
  • ... or an Uglix SSH client user
This page is for you. If you are another user, use different clients and so on, this page may still be informative and help you stabalize your connection (the same principles apply).

PuTTY users

PuTTY to connect to gateway (from a home connection), you have to

  • set a session, be sure to enable SSH

  • go to the 'Connection' menu and have the following options box checked

    • Disable Nagle's algorithm (TCP_NODELAY option)

    • Enable TCP keepalives (SO_KEEPALIVE option)

  • Furthermore, in 'Connection' -> 'SSH' -> 'Tunnels' enable the option

    • Enable X11 forwarding

    • Enable MIT-Magic-Cookie-1

  • Save the session

Documentation on those features (explanation for the interested) are added at the end of this document.


SSH Users

SSH users and owner of their system could first of all be sure to manipulate the SSH client configuration file and be sure settings are turned on by default. The client configuration is likely located as /etc/ssh_config or /usr/local/etc/ssh_config depending on where you have ssh installed.

But if you do NOT have access to the configuration file, the client can nonetheless pass on options from the command line. Those options would have the same name as they would appear in the config file.

Especially, KEEP_ALIVE is controlled via the SSH configuration option TCPKeepAlive.

% ssh -o TCPKeepAlive=yes

You will note in the next section that a spoofing issue exists with keep alive (I know it works well, but please consider the ServerAliveCountMax mechanism) so, you may use instead

% ssh -o TCPKeepAlive=no -o ServerAliveInterval=15

Note that the value 15 in our example is purely empirical. There are NO magic values and you need to test your connection and detect when (after what time) you get kicked out and disconnected and set the parameters from your client accordingly. Let's explain the default first and come back to this and a rule of thumb.

There are two relevant parameters (in addition of TCPKeepAlive):


ServerAliveInterval

Sets a timeout interval in seconds after which if no data has been received from the server, ssh will send a message through the encrypted channel to request a response from the server. The default is 0, indicating that these messages will not be sent to the server.

This option applies to protocol version 2 only.


ServerAliveCountMax

Sets the number of server alive messages (see above) which may be sent without ssh receiving any messages back from the server. If this threshold is reached while server alive messages are being sent, ssh will disconnect from the server, terminating the session. It is important to note that the use of server alive messages is very different from TCPKeepAlive (below). The server alive messages are sent through the encrypted channel and therefore will not be spoofable. The TCP keepalive option enabled by TCPKeepAlive is spoofable. The server alive mechanism is valuable when the client or server depend on knowing when a connection has become inactive.

The default value is 3. If, for example, ServerAliveInterval (above) is set to 15, and ServerAliveCountMax is left at the default, if the server becomes unresponsive ssh will disconnect after approximately 45 seconds.


In our example

% ssh -o TCPKeepAlive=no -o ServerAliveInterval=15

The recipe should be: if you get disconnected after N seconds, play with the above and be sure to set a

time of ServerAliveInterval*ServerAliveCountMax <= 0.8*N, N being the timeout. Since ServerAliveCountMax is typically not modified, in our example we assume the default value of 3 and therefore a a 3x15 = 45 seconds (and we guessed a disconnect every minute or so). If you set the value too low, the client will send to much "chatting" to the server and there will be a traffic impact.


Appendix

Nagle's algorithm

This was written based on this article.

RPC implementations on TCP should disable Nagle. This reduces average RPC request latency on TCP, and makes network trace tools work a little nicer.

Determines whether Nagle's algorithm is to be used. The Nagle's algorithm tries to conserve bandwidth by minimizing the number of segments that are sent. When applications wish to decrease network latency and increase performance, they can disable Nagle's algorithm (that is enable TCP_NODELAY). Data will be sent earlier, at the cost of an increase in bandwidth consumption.


KeepAlive

The KEEPALIVE option of the TCP/IP Protocol ensures that connections are kept alive even while they are idle. When a connection to a client is inactive for a period of time (the timeout period), the operating system sends KEEPALIVE packets at regular intervals. On most systems, the default timeout period is two hours (7,200,000 ms).

If the network hardware or software drops connections that have been idle for less than the two hour default, the Windows Client session will fail. KEEPALIVE timeouts are configured at the operating system level for all connections that have KEEPALIVE enabled.

If the network hardware or software (including firewalls) have a idle limit of one hour, then the KEEPALIVE timeout must be less than one hour. To rectify this situation TCP/IP KEEPALIVE settings can be lowered to fit inside the firewall limits. The implementation of TCP KEEPALIVE may vary from vendor to vendor. The original definition is quite old and described in RFC 1122.


MIT Magic cookie

To avoid unauthorized connections to your X display, the command xauth for encrypted X connections is widely used. When you login, a .Xauthority file is created in your home directory ($HOME). Even SSH initiate the creation of a magic cookie and without it, no display could be opened. Note that since the .Xauthority file IS the file containing the MIT Magic cookie, if you ever run out of disk quota or the file system is full, this file CANNOT be created or updated (even from the sshd impersonating the user) and consequently, no X connections can be opened.

The .Xauthority file sometimes contains information from older sessions, but this is not important, as a new key is created at every login session. The Xauthority is simple and powerful, and eliminates many of the security problems with X.




FileCatalog

The STAR FileCatalog is an a set of tools and API providing users access to the MeataData, File and Replica information pertaining to all data produced by the RHIC/STAR experiment.  The STAR FileCatalog in other words provides users access to meta-data, file and replica information through a unified schema-agnostic API. The user never needs to know the details of the relation between elements (or keywords) but rather, is provided with a flexible yet powerful query API allowing them to request any combination of 'keywords' based on sets of conditions composed of sequences of keyword operation values combinations. The user manual provides a list of keywords.

The STAR FIleCatalog also provides multi-site support through the same API. In other words, the same set of tools and programmatic interface allows to register, update, maintain a global catalog for the experiment and serve as a core component to the Data Management system. To date, the STAR FileCatalog holds information on 22 Million files and 52 Million active replicas.

 

The history & version information7

Manual

XML(s) examples

Examples and other documentation

 

A few examples will be left here to guide users and installer.

Data dictionary

This dictionary was created on 2012/03/12.

CollisionTypes

Field Type Null Default Comments
collisionTypeID smallint(6) No    
firstParticle varchar(10) No    
secondParticle varchar(10) No    
collisionEnergy float No 0  
collisionTypeIDate timestamp No CURRENT_TIMESTAMP  
collisionTypeCreator smallint(6) No 1  
collisionTypeCount int(11) Yes NULL  
collisionTypeComment text Yes NULL  

Creators

Field Type Null Default Comments
creatorID bigint(20) No    
creatorName varchar(15) Yes unknown  
creatorIDate timestamp No CURRENT_TIMESTAMP  
creatorCount int(11) Yes NULL  
creatorComment varchar(512) Yes NULL  

DetectorConfigurations

Field Type Null Default Comments
detectorConfigurationID int(11) No    
detectorConfigurationName varchar(50) Yes NULL  
dTPC tinyint(4) Yes NULL  
dSVT tinyint(4) Yes NULL  
dTOF tinyint(4) Yes NULL  
dEMC tinyint(4) Yes NULL  
dEEMC tinyint(4) Yes NULL  
dFPD tinyint(4) Yes NULL  
dFTPC tinyint(4) Yes NULL  
dPMD tinyint(4) Yes NULL  
dRICH tinyint(4) Yes NULL  
dSSD tinyint(4) Yes NULL  
dBBC tinyint(4) Yes NULL  
dBSMD tinyint(4) Yes NULL  
dESMD tinyint(4) Yes NULL  
dZDC tinyint(4) Yes NULL  
dCTB tinyint(4) Yes NULL  
dTPX tinyint(4) Yes NULL  
dFGT tinyint(4) Yes NULL  

DetectorStates

Field Type Null Default Comments
detectorStateID int(11) No    
sTPC tinyint(4) Yes NULL  
sSVT tinyint(4) Yes NULL  
sTOF tinyint(4) Yes NULL  
sEMC tinyint(4) Yes NULL  
sEEMC tinyint(4) Yes NULL  
sFPD tinyint(4) Yes NULL  
sFTPC tinyint(4) Yes NULL  
sPMD tinyint(4) Yes NULL  
sRICH tinyint(4) Yes NULL  
sSSD tinyint(4) Yes NULL  
sBBC tinyint(4) Yes NULL  
sBSMD tinyint(4) Yes NULL  
sESMD tinyint(4) Yes NULL  
sZDC tinyint(4) Yes NULL  
sCTB tinyint(4) Yes NULL  
sTPX tinyint(4) Yes NULL  
sFGT tinyint(4) Yes NULL  

EventGenerators

Field Type Null Default Comments
eventGeneratorID smallint(6) No    
eventGeneratorName varchar(30) No    
eventGeneratorVersion varchar(10) Yes 0  
eventGeneratorParams varchar(200) Yes NULL  
eventGeneratorIDate timestamp No CURRENT_TIMESTAMP  
eventGeneratorCreator smallint(6) No 1  
eventGeneratorCount int(11) Yes NULL  
eventGeneratorComment varchar(512) Yes NULL  

FileData

Field Type Null Default Comments
fileDataID bigint(20) No    
runParamID int(11) No 0  
fileName varchar(255) No    
baseName varchar(255) No   Name without extension
sName1 varchar(255) No   Will be used for name+runNumber
sName2 varchar(255) No   Will be used for name before runNumber
productionConditionID mediumint(9) Yes NULL  
numEntries mediumint(9) Yes 0  
md5sum varchar(32) Yes 0  
fileTypeID smallint(6) No 0  
fileSeq smallint(6) Yes NULL  
fileStream smallint(6) Yes 0  
fileDataIDate timestamp No CURRENT_TIMESTAMP  
fileDataCreator smallint(6) No 1  
fileDataCount int(11) Yes NULL  
fileDataComment text Yes NULL  

FileLocations

Field Type Null Default Comments
fileLocationID bigint(20) No    
fileDataID bigint(20) No 0  
filePathID bigint(20) No 0  
storageTypeID mediumint(9) No 0  
createTime timestamp No CURRENT_TIMESTAMP  
insertTime timestamp No 0000-00-00 00:00:00  
owner varchar(15) Yes NULL  
fsize bigint(20) Yes NULL  
storageSiteID smallint(6) No 0  
protection varchar(15) Yes NULL  
hostID mediumint(9) No 1  
availability tinyint(4) No 1  
persistent tinyint(4) No 0  
sanity tinyint(4) No 1  

FileLocationsID

Field Type Null Default Comments
fileLocationID bigint(20) No    

FileLocations_0

Field Type Null Default Comments
fileLocationID bigint(20) No    
fileDataID bigint(20) No 0  
filePathID bigint(20) No 0  
storageTypeID mediumint(9) No 0  
createTime timestamp No CURRENT_TIMESTAMP  
insertTime timestamp No 0000-00-00 00:00:00  
owner varchar(15) Yes NULL  
fsize bigint(20) Yes NULL  
storageSiteID smallint(6) No 0  
protection varchar(15) Yes NULL  
hostID mediumint(9) No 1  
availability tinyint(4) No 1  
persistent tinyint(4) No 0  
sanity tinyint(4) No 1  

FileLocations_1

Field Type Null Default Comments
fileLocationID bigint(20) No    
fileDataID bigint(20) No 0  
filePathID bigint(20) No 0  
storageTypeID mediumint(9) No 0  
createTime timestamp No CURRENT_TIMESTAMP  
insertTime timestamp No 0000-00-00 00:00:00  
owner varchar(15) Yes NULL  
fsize bigint(20) Yes NULL  
storageSiteID smallint(6) No 0  
protection varchar(15) Yes NULL  
hostID mediumint(9) No 1  
availability tinyint(4) No 1  
persistent tinyint(4) No 0  
sanity tinyint(4) No 1  

FileLocations_2

Field Type Null Default Comments
fileLocationID bigint(20) No    
fileDataID bigint(20) No 0  
filePathID bigint(20) No 0  
storageTypeID mediumint(9) No 0  
createTime timestamp No CURRENT_TIMESTAMP  
insertTime timestamp No 0000-00-00 00:00:00  
owner varchar(15) Yes NULL  
fsize bigint(20) Yes NULL  
storageSiteID smallint(6) No 0  
protection varchar(15) Yes NULL  
hostID mediumint(9) No 1  
availability tinyint(4) No 1  
persistent tinyint(4) No 0  
sanity tinyint(4) No 1  

FileLocations_3

Field Type Null Default Comments
fileLocationID bigint(20) No    
fileDataID bigint(20) No 0  
filePathID bigint(20) No 0  
storageTypeID mediumint(9) No 0  
createTime timestamp No CURRENT_TIMESTAMP  
insertTime timestamp No 0000-00-00 00:00:00  
owner varchar(15) Yes NULL  
fsize bigint(20) Yes NULL  
storageSiteID smallint(6) No 0  
protection varchar(15) Yes NULL  
hostID mediumint(9) No 1  
availability tinyint(4) No 1  
persistent tinyint(4) No 0  
sanity tinyint(4) No 1  

FileParents

Field Type Null Default Comments
parentFileID bigint(20) No 0  
childFileID bigint(20) No 0  

FilePaths

Field Type Null Default Comments
filePathID bigint(6) No    
filePathName varchar(255) No    
filePathIDate timestamp No CURRENT_TIMESTAMP  
filePathCreator smallint(6) No 1  
filePathCount int(11) Yes NULL  
filePathComment varchar(512) Yes NULL  

FileTypes

Field Type Null Default Comments
fileTypeID smallint(6) No    
fileTypeName varchar(30) No    
fileTypeExtension varchar(15) No    
fileTypeIDate timestamp No CURRENT_TIMESTAMP  
fileTypeCreator smallint(6) No 1  
fileTypeCount int(11) Yes NULL  
fileTypeComment varchar(512) Yes NULL  

Hosts

Field Type Null Default Comments
hostID smallint(6) No    
hostName varchar(30) No localhost  
hostIDate timestamp No CURRENT_TIMESTAMP  
hostCreator smallint(6) No 1  
hostCount int(11) Yes NULL  
hostComment varchar(512) Yes NULL  

ProductionConditions

Field Type Null Default Comments
productionConditionID smallint(6) No    
productionTag varchar(10) No    
libraryVersion varchar(10) No    
productionConditionIDate timestamp No CURRENT_TIMESTAMP  
productionConditionCreator smallint(6) No 1  
productionConditionCount int(11) Yes NULL  
productionConditionComment varchar(512) Yes NULL  

RunParams

Field Type Null Default Comments
runParamID int(11) No    
runNumber bigint(20) No 0  
dataTakingStart timestamp No 0000-00-00 00:00:00  
dataTakingEnd timestamp No 0000-00-00 00:00:00  
dataTakingDay smallint(6) Yes 0  
dataTakingYear smallint(6) Yes 0  
simulationParamsID int(11) Yes NULL  
runTypeID smallint(6) No 0  
triggerSetupID smallint(6) No 0  
detectorConfigurationID mediumint(9) No 0  
detectorStateID mediumint(9) No 0  
collisionTypeID smallint(6) No 0  
magFieldScale varchar(50) No    
magFieldValue float Yes NULL  
runParamIDate timestamp No CURRENT_TIMESTAMP  
runParamCreator smallint(6) No 1  
runParamCount int(11) Yes NULL  
runParamComment varchar(512) Yes NULL  

RunTypes

Field Type Null Default Comments
runTypeID smallint(6) No    
runTypeName varchar(255) No    
runTypeIDate timestamp No CURRENT_TIMESTAMP  
runTypeCreator smallint(6) No 1  
runTypeCount int(11) Yes NULL  
runTypeComment varchar(512) Yes NULL  

SimulationParams

Field Type Null Default Comments
simulationParamsID int(11) No    
eventGeneratorID smallint(6) No 0  
simulationParamIDate timestamp No CURRENT_TIMESTAMP  
simulationParamCreator smallint(6) No 1  
simulationParamCount int(11) Yes NULL  
simulationParamComment varchar(512) Yes NULL  

StorageSites

Field Type Null Default Comments
storageSiteID smallint(6) No    
storageSiteName varchar(30) No    
storageSiteLocation varchar(50) Yes NULL  
storageSiteIDate timestamp No CURRENT_TIMESTAMP  
storageSiteCreator smallint(6) No 1  
storageSiteCount int(11) Yes NULL  
storageSiteComment varchar(512) Yes NULL  

StorageTypes

Field Type Null Default Comments
storageTypeID mediumint(9) No    
storageTypeName varchar(6) No    
storageTypeIDate timestamp No CURRENT_TIMESTAMP  
storageTypeCreator smallint(6) No 1  
storageTypeCount int(11) Yes NULL  
storageTypeComment varchar(512) Yes NULL  

TriggerCompositions

Field Type Null Default Comments
fileDataID bigint(20) No 0  
triggerWordID mediumint(9) No 0  
triggerCount mediumint(9) Yes 0  

TriggerSetups

Field Type Null Default Comments
triggerSetupID smallint(6) No    
triggerSetupName varchar(50) No    
triggerSetupComposition varchar(255) No    
triggerSetupIDate timestamp No CURRENT_TIMESTAMP  
triggerSetupCreator smallint(6) No 1  
triggerSetupCount int(11) Yes NULL  
triggerSetupComment varchar(512) Yes NULL  

TriggerWords

Field Type Null Default Comments
triggerWordID mediumint(9) No    
triggerWordName varchar(50) No    
triggerWordVersion varchar(6) No V0.0  
triggerWordBits varchar(6) No    
triggerWordIDate timestamp No CURRENT_TIMESTAMP  
triggerWordCreator smallint(6) No 1  
triggerWordCount int(11) Yes NULL  
triggerWordComment varchar(512) Yes NULL  

Tables creation and attributes

#use FileCatalog;

#
# All IDs are named after their respective table. This MUST
# remain like this.
#  eventGeneratorID        -> eventGenerator+ID       in 'EventGenerators'
#  detectorConfigurationID ->detectorConfiguration+ID in 'DetectorConfigurations'
#
# etc...
#

DROP TABLE IF EXISTS EventGenerators;
CREATE TABLE EventGenerators
(
  eventGeneratorID      SMALLINT     NOT NULL    AUTO_INCREMENT,
  eventGeneratorName    VARCHAR(30)        NOT NULL,
  eventGeneratorVersion VARCHAR(10)     NOT NULL,
  eventGeneratorParams  VARCHAR(200),

  eventGeneratorIDate   TIMESTAMP       NOT NULL,
  eventGeneratorCreator CHAR(15)        DEFAULT 'unknown' NOT NULL,
  eventGeneratorCount   INT,
  eventGeneratorComment TEXT,
  UNIQUE        EG_EventGeneratorUnique (eventGeneratorName, eventGeneratorVersion, eventGeneratorParams),
  PRIMARY KEY (eventGeneratorID)
) TYPE=MyISAM;

DROP TABLE IF EXISTS DetectorConfigurations; CREATE TABLE DetectorConfigurations
(
  detectorConfigurationID               INT             NOT NULL        AUTO_INCREMENT,
  detectorConfigurationName             VARCHAR(50)        NULL           UNIQUE,
  dTPC                                  TINYINT,
  dSVT                                  TINYINT,
  dTOF                                  TINYINT,
  dEMC                                  TINYINT,
  dEEMC                                 TINYINT,
  dFPD                                  TINYINT,
  dFTPC                                 TINYINT,
  dPMD                                  TINYINT,
  dRICH                                 TINYINT,
  dSSD                                  TINYINT,
  dBBC                                  TINYINT,
  dBSMD                                 TINYINT,
  dESMD                                 TINYINT,
  PRIMARY KEY (detectorConfigurationID)
) TYPE=MyISAM;


# Trigger related tables
DROP TABLE IF EXISTS TriggerSetups; CREATE TABLE TriggerSetups
(
   triggerSetupID               SMALLINT     NOT NULL    AUTO_INCREMENT,
   triggerSetupName             VARCHAR(50)        NOT NULL       UNIQUE,
   triggerSetupComposition      VARCHAR(255) NOT NULL,

   triggerSetupIDate            TIMESTAMP       NOT NULL,
   triggerSetupCreator          CHAR(15)       DEFAULT 'unknown' NOT NULL,
   triggerSetupCount            INT,
   triggerSetupComment          TEXT,
   PRIMARY KEY                  (triggerSetupID)
) TYPE=MyISAM;


DROP TABLE IF EXISTS TriggerCompositions; CREATE TABLE TriggerCompositions
(
  fileDataID                    BIGINT          NOT NULL,
  triggerWordID                 INT             NOT NULL,
  triggerCount                  MEDIUMINT       DEFAULT 0,
  PRIMARY KEY                   (fileDataID, triggerWordID)
) TYPE=MyISAM;



DROP TABLE IF EXISTS TriggerWords;
CREATE TABLE TriggerWords (
  triggerWordID         mediumint(9)   NOT NULL auto_increment,
  triggerWordName       varchar(50)  NOT NULL default '',
  triggerWordVersion    varchar(6)        NOT NULL default 'V0.0',
  triggerWordBits       varchar(6)   NOT NULL default '',
  triggerWordIDate      timestamp(14)       NOT NULL,
  triggerWordCreator    varchar(15)       NOT NULL default 'unknown',
  triggerWordCount      int(11)     default NULL,
  triggerWordComment    text,
  PRIMARY KEY           (triggerWordID),
  UNIQUE KEY TW_TriggerCharacteristic (triggerWordName,triggerWordVersion,triggerWordBits)
) TYPE=MyISAM;




DROP TABLE IF EXISTS CollisionTypes; CREATE TABLE CollisionTypes
(
  collisionTypeID SMALLINT NOT NULL AUTO_INCREMENT,
  firstParticle VARCHAR(10) NOT NULL,
  secondParticle VARCHAR(10) NOT NULL,
  collisionEnergy FLOAT NOT NULL,
  PRIMARY KEY (collisionTypeID)
) TYPE=MyISAM;


#
# A few dictionary tables
#
DROP TABLE IF EXISTS ProductionConditions; CREATE TABLE ProductionConditions
(
  productionConditionID         SMALLINT       NOT NULL      AUTO_INCREMENT,
  productionTag                 VARCHAR(10)   NOT NULL,
  libraryVersion                VARCHAR(10)   NOT NULL,

  productionConditionIDate      TIMESTAMP       NOT NULL,
  productionConditionCreator    CHAR(15)        DEFAULT 'unknown' NOT NULL,
  productionConditionCount      INT,
  productionConditionComments   TEXT,
  PRIMARY KEY                   (productionConditionID)
) TYPE=MyISAM;

DROP TABLE IF EXISTS StorageSites; CREATE TABLE StorageSites
(
  storageSiteID                 SMALLINT      NOT NULL     AUTO_INCREMENT,
  storageSiteName               VARCHAR(30)  NOT NULL,
  storageSiteLocation           VARCHAR(50),

  storageSiteIDate              TIMESTAMP       NOT NULL,
  storageSiteCreator            CHAR(15)       DEFAULT 'unknown' NOT NULL,
  storageSiteCount              INT,
  storageSiteComment            TEXT,
  PRIMARY KEY                   (storageSiteID)
) TYPE=MyISAM;

DROP TABLE IF EXISTS FileTypes; CREATE TABLE FileTypes
(
  fileTypeID                    SMALLINT NOT NULL        AUTO_INCREMENT,
  fileTypeName                  VARCHAR(30)    NOT NULL   UNIQUE,
  fileTypeExtension             VARCHAR(15)        NOT NULL,

  fileTypeIDate                 TIMESTAMP       NOT NULL,
  fileTypeCreator               CHAR(15) DEFAULT 'unknown' NOT NULL,
  fileTypeCount                 INT,
  fileTypeComment               TEXT,
  PRIMARY KEY                   (fileTypeID)
) TYPE=MyISAM;

DROP TABLE IF EXISTS FilePaths; CREATE TABLE FilePaths
(
  filePathID                    BIGINT         NOT NULL         AUTO_INCREMENT,
  filePathName                  VARCHAR(255)   NOT NULL         UNIQUE,

  filePathIDate                 TIMESTAMP       NOT NULL,
  filePathCreator               CHAR(15) DEFAULT 'unknown' NOT NULL,
  filePathCount                 INT,
  filePathComment               TEXT,
  PRIMARY KEY                   (filePathID)
) TYPE=MyISAM;

DROP TABLE IF EXISTS Hosts; CREATE TABLE Hosts
(
  hostID                        SMALLINT       NOT NULL         AUTO_INCREMENT,
  hostName                      VARCHAR(30)    NOT NULL DEFAULT 'localhost' UNIQUE,

  hostIDate                     TIMESTAMP       NOT NULL,
  hostCreator                   CHAR(15)     DEFAULT 'unknown' NOT NULL,
  hostCount                     INT,
  hostComment                   TEXT,
  PRIMARY KEY                   (hostID)
) TYPE=MyISAM;


DROP TABLE IF EXISTS RunTypes; CREATE TABLE RunTypes
(
  runTypeID                     SMALLINT  NOT NULL AUTO_INCREMENT,
  runTypeName                   VARCHAR(255)    NOT NULL   UNIQUE,

  runTypeIDate                  TIMESTAMP       NOT NULL,
  runTypeCreator                CHAR(15)  DEFAULT 'unknown' NOT NULL,
  runTypeCount                  INT,
  runTypeComment                TEXT,
  PRIMARY KEY                   (runTypeID)
) TYPE=MyISAM;


DROP TABLE IF EXISTS StorageTypes; CREATE TABLE StorageTypes
(
  storageTypeID                 MEDIUMINT       NOT NULL    AUTO_INCREMENT,
  storageTypeName               VARCHAR(6)   NOT NULL  UNIQUE,

  storageTypeIDate              TIMESTAMP       NOT NULL,
  storageTypeCreator            CHAR(15)       DEFAULT 'unknown' NOT NULL,
  storageTypeCount              INT,
  storageTypeComment            TEXT,
  PRIMARY KEY                   (storageTypeID)
) TYPE=MyISAM;





DROP TABLE IF EXISTS SimulationParams; CREATE TABLE SimulationParams
(
  simulationParamsID            INT             NOT NULL     AUTO_INCREMENT,
  eventGeneratorID              SMALLINT    NOT NULL,
  detectorConfigurationID       INT             NOT NULL,
  simulationParamComments       TEXT,
  PRIMARY KEY                   (simulationParamsID),
  INDEX         SP_EventGeneratorIndex          (eventGeneratorID),
  INDEX         SP_DetectorConfigurationIndex   (detectorConfigurationID)
) TYPE=MyISAM;

DROP TABLE IF EXISTS RunParams;
CREATE TABLE RunParams
(
  runParamID                  INT        NOT NULL AUTO_INCREMENT,
  runNumber                   BIGINT     NOT NULL UNIQUE,
  dataTakingStart             TIMESTAMP,
  dataTakingEnd               TIMESTAMP,
  simulationParamsID          INT       NULL,
  runTypeID                   SMALLINT     NOT NULL,
  triggerSetupID              SMALLINT      NOT NULL,
  detectorConfigurationID     INT            NOT NULL,
  collisionTypeID             SMALLINT             NOT NULL,
  magFieldScale               VARCHAR(50)    NOT NULL,
  magFieldValue               FLOAT,
  runComments                 TEXT,
  PRIMARY KEY                          (runParamID),
  INDEX RP_RunNumberIndex              (runNumber),
  INDEX RP_DataTakingStartIndex        (dataTakingStart),
  INDEX RP_DataTakingEndIndex          (dataTakingEnd),
  INDEX RP_MagFieldScaleIndex          (magFieldScale),
  INDEX RP_MagFieldValueIndex          (magFieldValue),
  INDEX RP_SimulationParamsIndex       (simulationParamsID),
  INDEX RP_RunTypeIndex                (runTypeID),
  INDEX RP_TriggerSetupIndex           (triggerSetupID),
  INDEX RP_DetectorConfigurationIndex  (detectorConfigurationID),
  INDEX RP_CollisionTypeIndex          (collisionTypeID)
) TYPE=MyISAM;

DROP TABLE IF EXISTS FileData; CREATE TABLE FileData
(
  fileDataID                    BIGINT          NOT NULL AUTO_INCREMENT,
  runParamID                    INT             NOT NULL,
  fileName                      VARCHAR(255)       NOT NULL,
  baseName                      VARCHAR(255)       NOT NULL COMMENT 'Name without extension',
  sName1                        VARCHAR(255) NOT NULL COMMENT 'Will be used for name+runNumber',
  sName2                        VARCHAR(255) NOT NULL COMMENT 'Will be used for name before runNumber',
  productionConditionID         INT             NULL,
  numEntries                    MEDIUMINT,
  md5sum                        CHAR(32)     DEFAULT 0,
  fileTypeID                    SMALLINT NOT NULL,
  fileSeq                       SMALLINT,
  fileStream                    SMALLINT,
  fileDataComments              TEXT,
  PRIMARY KEY                   (fileDataID),
  INDEX         FD_FileNameIndex                (fileName(40)),
  INDEX         FD_BaseNameIndex                (baseName),
  INDEX         FD_SName1Index                  (sName1),
  INDEX         FS_SName2Index                  (sName2),
  INDEX         FD_RunParamsIndex               (runParamID),
  INDEX         FD_ProductionConditionIndex     (productionConditionID),
  INDEX         FD_FileTypeIndex                (fileTypeID),
  INDEX         FD_FileSeqIndex                 (fileSeq),
  UNIQUE        FD_FileDataUnique               (runParamID, fileName, productionConditionID, fileTypeID, fileSeq)
) TYPE=MyISAM;



# FileParents
DROP TABLE IF EXISTS FileParents; CREATE TABLE FileParents
(
  parentFileID                  BIGINT          NOT NULL,
  childFileID                   BIGINT          NOT NULL,
  PRIMARY KEY                   (parentFileID, childFileID)
) TYPE=MyISAM;

# FileLocations
DROP TABLE IF EXISTS FileLocations; CREATE TABLE FileLocations
(
  fileLocationID                BIGINT          NOT NULL      AUTO_INCREMENT,
  fileDataID                    BIGINT          NOT NULL,
  filePathID                    BIGINT          NOT NULL,
  storageTypeID                 MEDIUMINT       NOT NULL,
  createTime                    TIMESTAMP,
  insertTime                    TIMESTAMP       NOT NULL,
  owner                         VARCHAR(30),
  fsize                         BIGINT,
  storageSiteID                 SMALLINT      NOT NULL,
  protection                    VARCHAR(15),
  hostID                        BIGINT          NOT NULL DEFAULT 1,
  availability                  TINYINT         NOT NULL DEFAULT 1,
  persistent                    TINYINT         NOT NULL DEFAULT 0,
  sanity                        TINYINT         NOT NULL DEFAULT 1,
  PRIMARY KEY                   (fileLocationID),
  INDEX         FL_FilePathIndex                (filePathID),
  INDEX         FL_FileDataIndex                (fileDataID),
  INDEX         FL_StorageTypeIndex             (storageTypeID),
  INDEX         FL_StorageSiteIndex             (storageSiteID),
  INDEX         FL_HostIndex                    (hostID),
  UNIQUE        FL_FileLocationUnique           (fileDataID, storageTypeID, filePathID, storageSiteID, hostID)
) TYPE=MyISAM;

XML configuration

 

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<!DOCTYPE SCATALOG [
   <!ELEMENT SCATALOG (SITE*)>
       <!ATTLIST SCATALOG VERSION CDATA #REQUIRED>
   <!ELEMENT SITE (SERVER+)>
       <!ATTLIST SITE name (BNL | LBL) #REQUIRED>
       <!ATTLIST SITE description CDATA #IMPLIED>
       <!ATTLIST SITE URI CDATA #IMPLIED>
   <!ELEMENT SERVER (HOST+)>
       <!ATTLIST SERVER SCOPE (Master | Admin | User) #REQUIRED>
   <!ELEMENT HOST (ACCESS+)>
       <!ATTLIST HOST NAME CDATA #REQUIRED>
       <!ATTLIST HOST DBTYPE CDATA #IMPLIED>
       <!ATTLIST HOST DBNAME CDATA #REQUIRED>
       <!ATTLIST HOST PORT CDATA #IMPLIED>
   <!ELEMENT ACCESS EMPTY>
       <!ATTLIST ACCESS USER CDATA #IMPLIED>
       <!ATTLIST ACCESS PASS CDATA #IMPLIED>
]>



<SCATALOG VERSION="1.0.1">
        <SITE name="BNL">
                <SERVER SCOPE="Master">
                        <HOST NAME="mafata.wherever.net" DBNAME="Catalog_XXX" PORT="1234">
                                <ACCESS USER="Moi" PASS="HelloWorld"/>
                        </HOST>
                        <HOST NAME="mafata.wherever.net" DBNAME="Catalog_YYY" PORT="1235">
                                <ACCESS USER="Moi" PASS="HelloWorld"/>
                        </HOST>
                        <HOST NAME="duvall.star.bnl.gov" DBNAME="FileCatalog" PORT="">
                                <ACCESS USER="FC_master" PASS="AllAccess"/>
                        </HOST>
                </SERVER>
                <SERVER SCOPE="Admin">
                        <HOST NAME="duvall.star.bnl.gov" DBNAME="FileCatalog_BNL" PORT="">
                                <ACCESS USER="FC_admin" PASS="ExamplePassword"/>
                        </HOST>
                </SERVER>
                <SERVER SCOPE="User">
                        <HOST NAME="duvall.star.bnl.gov" DBNAME="FileCatalog_BNL" PORT="">
                                <ACCESS USER="FC_user" PASS="FCatalog"/>
                        </HOST>
                </SERVER>
        </SITE>
</SCATALOG>

Migration and notes from V01.265 to V01.275

This document is intended for FileCatalog managers only who have previously deployed an earlier version of API and older database table layout. It is NOT intended for users.

Reasoning for this upgrade and core of the upgrade

One of the major problem with the preceding database layout started to show itself when we reached 4 Million entries (for some reason, we seem to have magic numbers). A dire restriction was the presence of the field 'path' and 'nodename' in the FileLocations table. This table became unnecessarily large (of the order of GB) and sorting and queries would become slow and IO demanding (regardless of our careful indexing). The main action was to move both field to separate tables. This change requires a two step modification :

  1. reshape of the database (leaving the old field), deployment of the database API in cross mode support
  2. run the normalization scripts filling the new table and fields, deployment of the final API and drop of the obsolete columns (+ index rebuild)

The steps are more carefully described below ...

Step by step migration instructions

Has to be made in several steps for safety a least interruption of service (although a pain to the manager). Note that you can do that much faster by cutting the Master/slave relationship, disabling all daemons auto-updating the database, proceed with table reshape and normalization script execution, drop and rebuild index, deploy the point-of-no-return API and restore Master/slave relation).

This upgrade is best if you have perl 5.8 or upper. Note that this transition will be the LAST one using perl 5.6 (get ready for a perl upgrade on your cluster).

We will assume you know how to connect to your database from an account able to manipulate and create any tables in the FileCatalog database.

Steps in Phase I

  1. (0) Create the following tables
      DROP TABLE IF EXISTS FilePaths; CREATE TABLE FilePaths
      (
        filePathID                    BIGINT         NOT NULL         AUTO_INCREMENT,
        filePathName                  VARCHAR(255)   NOT NULL         UNIQUE,
        filePathCount                 INT,
        PRIMARY KEY                   (filePathID)
      ) TYPE=MyISAM;
    
      DROP TABLE IF EXISTS Hosts; CREATE TABLE Hosts 
     (
        hostID      smallint(6) NOT NULL auto_increment,
        hostName    varchar(30) NOT NULL default 'localhost',
        hostIDate   timestamp(14) NOT NULL,
        hostCreator varchar(15) NOT NULL default 'unknown',
        hostCount   int(11) default NULL,
        hostComment text,
        PRIMARY KEY (hostID),
        UNIQUE KEY  hostName (hostName)
      ) TYPE=MyISAM;
    
    
  2. Modify some table and recreate one
         
         ALTER TABLE `FileLocations` ADD `filePathID` bigint(20) NOT NULL default '0' AFTER `fileDataID`;
         ALTER TABLE `FileLocations` ADD `hostID` bigint(20) NOT NULL default '1' AFTER `protection`;
         UPDATE TABLE `FileLocations` SET hostID=0;
    
         # note that I did that one from the Web interface (TBC)
         INSERT INTO Hosts VALUES(0,'localhost',NOW()+0,'',0,'Any unspecified node'); 
    
         ALTER TABLE `FileLocations` ADD INDEX ( `filePathID` )  
    
         ALTER TABLE `FilePaths` ADD `filePathIDate` TIMESTAMP NOT NULL AFTER `filePathName` ;
         ALTER TABLE `FilePaths` ADD `filePathCreator` CHAR( 15 ) DEFAULT 'unknown' NOT NULL AFTER `filePathIDate` ;
         ALTER TABLE `FilePaths` ADD `filePathComment` TEXT AFTER `filePathCount`;
    
         ALTER TABLE `StorageSites` ADD  `storageSiteIDate` TIMESTAMP NOT NULL AFTER `storageSiteLocation` ;
         ALTER TABLE `StorageSites` ADD  `storageSiteCreator` CHAR( 15 ) DEFAULT 'unknown' NOT NULL AFTER `storageSiteIDate` ;
         ALTER TABLE `StorageSites` DROP `storageComment`;
         ALTER TABLE `StorageSites` ADD  `storageSiteComment` TEXT AFTER `storageSiteCount`;
    
         ALTER TABLE `StorageTypes` ADD `storageTypeIDate` TIMESTAMP NOT NULL AFTER `storageTypeName` ;
         ALTER TABLE `StorageTypes` ADD `storageTypeCreator` CHAR( 15 ) DEFAULT 'unknown' NOT NULL AFTER `storageTypeIDate` ;
    
    
         ALTER TABLE `FileTypes` ADD `fileTypeIDate` TIMESTAMP NOT NULL AFTER `fileTypeExtension` ;
         ALTER TABLE `FileTypes` ADD `fileTypeCreator` CHAR( 15 ) DEFAULT 'unknown' NOT NULL AFTER `fileTypeIDate` ;
         ALTER TABLE `FileTypes` ADD `fileTypeComment` TEXT AFTER `fileTypeCount`;
    
    
         ALTER TABLE `TriggerSetups` ADD `triggerSetupIDate` TIMESTAMP NOT NULL AFTER `triggerSetupComposition` ;
         ALTER TABLE `TriggerSetups` ADD `triggerSetupCreator` CHAR( 15 ) DEFAULT 'unknown' NOT NULL AFTER `triggerSetupIDate`;
         ALTER TABLE `TriggerSetups` ADD `triggerSetupCount`   INT AFTER `triggerSetupCreator`;
         ALTER TABLE `TriggerSetups` ADD `triggerSetupComment` TEXT  AFTER `triggerSetupCount`;
    
         ALTER TABLE `EventGenerators` ADD `eventGeneratorIDate` TIMESTAMP NOT NULL AFTER `eventGeneratorParams` ;
         ALTER TABLE `EventGenerators` ADD `eventGeneratorCreator` CHAR( 15 ) DEFAULT 'unknown' NOT NULL AFTER `eventGeneratorIDate` ;
         ALTER TABLE `EventGenerators` ADD `eventGeneratorCount`   INT AFTER `eventGeneratorCreator`;
    
         ALTER TABLE `RunTypes` ADD `runTypeIDate` TIMESTAMP NOT NULL AFTER `runTypeName` ;
         ALTER TABLE `RunTypes` ADD `runTypeCreator` CHAR( 15 ) DEFAULT 'unknown' NOT NULL AFTER `runTypeIDate` ;
    
         ALTER TABLE `ProductionConditions` DROP `productionComments`; 
         ALTER TABLE `ProductionConditions` ADD  `productionConditionIDate`   TIMESTAMP NOT NULL AFTER `libraryVersion`;
         ALTER TABLE `ProductionConditions` ADD  `productionConditionCreator` CHAR( 15 ) DEFAULT 'unknown' NOT NULL AFTER `productionConditionIDate`;
         ALTER TABLE `ProductionConditions` ADD  `productionConditionComment` TEXT AFTER `productionConditionCount`;
    
    
    
         #
         # This table was not shaped as a dictionary so needs to be re-created
         # Hopefully, was not filled prior (but will be this year)
         #
         DROP TABLE IF EXISTS TriggerWords; CREATE TABLE TriggerWords
         (
            triggerWordID           MEDIUMINT       NOT NULL        AUTO_INCREMENT,
            triggerWordName         VARCHAR(50)     NOT NULL,
            triggerWordVersion      CHAR(6)         NOT NULL DEFAULT "V0.0",
            triggerWordBits         CHAR(6)         NOT NULL,  
            triggerWordIDate        TIMESTAMP       NOT NULL,
            triggerWordCreator      CHAR(15)        DEFAULT 'unknown' NOT NULL,
            triggerWordCount        INT,
            triggerWordComment      TEXT,
            UNIQUE   TW_TriggerCharacteristic (triggerWordName, triggerWordVersion, triggerWordBits),
            PRIMARY KEY             (triggerWordID)
         ) TYPE=MyISAM;
  3. Deploy the new API CVS version 1.62 of FileCatalog.pm

  4. Run the following utility scripts

    util/path_convert.pl
    util/host_convert.pl

    Note that those scripts use a new method $fC->connect_as("Admin"); which assumes that the Master Catalog will be accessed using the XML connection description. Also, it should be obvious that

    use lib "/WhereverYourModulAPIisInstalled"; should be replaced by the appropriate path for your site (or test area). Finally, it uses API CVS version 1.62 which supports Xpath and Xnode transitional keywords allowing us to transfer the information from one field to one table.

  5. Check that Hosts table was filled properly and automatically with Creator/IDate
  6. Paranoia step : Re-run the scripts mentioned 2 steps ago

    At this stage and ideally, nothing should happen (as you have already modified the records).
    A few tips prior from doing that
    • % fC_cleanup.pl -modif node=localhost -cond node='' -doit
      would hopefully do nothing but if you have messed something up in the past, hostName would be NULL and the above would be necessary.
    • After a full update, the following queries should return NOTHING
      % get_file_list.pl -keys flid -cond rfpid=0 -all -alls -as Admin
      % get_file_list.pl -keys flid -cond rhid=0 -all -alls -as Admin

      Those are equivalent to the SQL statements
      >SELECT FileLocations.fileLocationID FROM FileLocations WHERE FileLocations.filePathID = 0 LIMIT 0, 100
      >SELECT FileLocations.fileLocationID FROM FileLocations WHERE FileLocations.hostID = 0 LIMIT 0, 100 
    If it does return anything, contact me for further investigation and database repairs. As a side note, the -as keyword was introduced recently and you should update your get_file_list.pl script if not available.
  7. Make a backup copy of the database for security (optional but safer) Backup can be done by easer a dump of mysql or more trivially, a cp -r of the database directory.
  8. Leave it running for a few days (should be fine) for confidence consolidation ;-)

You are ready for phase II. Hang on tight now ...

Steps in Phase II

Those steps are no VERY intrusive and potentially destructive. Be careful from here on ...

  1. Stop all daemons, be sure that during the rest of the operations, NO command attempts to manipulate the database. If you want to shield your users from the upgrade, stop all Master/slave relations.
  2. Connect to the master FileCatalog as administrator for that database and execute the following SQL commands
      > ALTER TABLE `FileLocations` ADD INDEX FL_HostIndex (hostID);
      > ALTER TABLE `FileLocations` DROP INDEX `FL_FileLocationUnique`, ADD UNIQUE (fileDataID, storageTypeID, filePathID, storageSiteID, hostID);
    
      # drop the columns not in use anymore / should also get rid of the associated
      # indexes.
      > ALTER TABLE `FileLocations` DROP COLUMN nodeName;
      > ALTER TABLE `FileLocations` DROP COLUMN filePath;
    
      # "rename" index / was created with a name difference to avoid clash for transition
      # now renamed for consistency
      > ALTER TABLE `FileLocations` DROP INDEX `filePathID`, ADD INDEX  FL_FilePathIndex (filePathID);
  3. OK, you should be done. Deploy either CVS version 1.63 which correspond to the FileCatalog API version V01.275 and above ... (by the way, get_file_list.pl -V gives the API version).


 

A few notes

  • The new API is XML connection aware via a non-mandatory module named XML::Simple . You should install that module but there are some limitations if you are using perl 5.6 i.e., you MUST use the schema with ONLY one choice per category (Admin, Master or User).
  • Your scripts will likely need to change if your Database Master and Slave are not on the same node (i.e. the administration account for the FileCatalog can be used only on the database Master and the regular user account on the Slave). There are a few forms of this such as the one below
    # Get connection fills the blanks while reading from XML
    # However, USER/PASSWORD presence are re-checked
    #$fC->debug_on();
    ($USER,$PASSWD,$PORT,$HOST,$DB) = $fC->get_connection("Admin");
    $port = $PORT if ( defined($PORT) );
    $host = $HOST if ( defined($HOST) );
    $db   = $DB   if ( defined($DB) );
    
    
    if ( defined($USER) ){   $user = $USER;}
    else {                   $user = "FC_admin";}
    
    if ( defined($PASSWD) ){ $passwd = $PASSWD;}
    else {                   print "Password for $user : ";
                             chomp($passwd = );}
    
    #
    # Now connect using a fully specified user/passwd/port/host/db
    #
    $fC->connect($user,$passwd,$port,$host,$db);

    or counting on the full definition in the XML file

    $fC    = FileCatalog->new();
    $fC->connect_as("Admin");
  • Note a small future convenience when XML is ON. connect_as() does not only select as who you want to connect to but where as well. In fact, the proper syntax is intent=SITE::User (for example BNL::Admin is valid as well as LBL::User). This is only partly supported however.
  • The new version of the API automatically add information in dictionary tables. Especially, the account under which a new dictionary value was inserted (Creator) and the insertion date (IDate) are filled automatically. A side effect being that the new API is NOT compatible with previous database table layout (no backward support will be attempted).

Migration and notes from V01.275 to V01.280

This document is intended for FileCatalog managers only who have previously deployed an earlier version of API and older database table layout. It is NOT intended for users.

Reasoning for this upgrade and core of the upgrade

This upgrade is a minor one, making support for two more detector sub-systems. The new API supports this modification. You need to alter the table DetectorConfigurations and add two columns. API are always forward compatible in that regard so it is completely safe to alter the table and deploy the API later.

ALTER TABLE `DetectorConfigurations` ADD dBSMD TINYINT;
ALTER TABLE `DetectorConfigurations` ADD dESMD TINYINT;
UPDATE `DetectorConfigurations` SET dBSMD=0;
UPDATE `DetectorConfigurations` SET dESMD=0;

And deploy the API V01.280 or later. You are done.

HPSS services

HPSS Performance study

Introduction

HPSS is software that manages petabytes of data on disk and robotic tape libraries. We will discuss in this section our observations as per the efficiency of accessing files in STAR as per a snapshot of the 2006 situation. It is clear that IO opptimizations has several components amongst which:
  • Access pattern optimization (request ordering, ...)
  • Optimization based on tape drive and technology capabilities
  • Miscellaneous technology considerations
    (cards, interface, firmware and driver, RAID, disks, ...)
  • HPSS disk cache optimizations
  • COS and/or PVR optimization
However, several trend have already been the object of past research and we will point to some of those and compare to our situation rather than debating the obvious. Wewill try to keep a focus on measurements in our environment.

Tape drive and technology capabilities

A starting point would be to discuss the capabilities of the technologies involved and their maximum performance and limitations. In STAR, two technologies remain as per 1006/10:
  • the 994B drives
  • the LTO-3 drives

Access pattern optimization (request ordering, ...)

A first simple and immediate consideration is to minimize tape mount and dismount operations, causing latencies and therefore performance drops. Since we use the DataCarousel for most restore operations, let's summarize its features.

The DataCarousel

The DataCarousel (DC) is an HPSS front end which main purpose is to coordinate requests from many un-correlated client's requests. Its main assumption is that all requests are asynchronous that is, you make a request from one client and it is satisfied “later” (as soon as possible). In other words, the DC aggregates all requests from all clients (many users could be considered as separate clients) and re-order them according policies, and possibly aggregating multiple requests for the same source into one request to the mass storage. The DC system itself is composed of a light client program (script), a plug-and-play policy based server architecture component (script) and a permanent process (compiled code) interfacing with the mass storage using HPSS API calls (this components is known as the “Oakridge Batch” although it current code content has little to do with the original idea from the Oakridge National Laboratory). Client and server interacts via a database component isolating client and server completely from each other (but sharing the same API , a perl module).

Policies may throttle the amount of data by group (quota, bandwidth percentage per user, etc ... i.e. queue request fairshare) but also perform tape access optimization such as grouping requests by tape ID (for equivalent share, all requests from the same tape are grouped together regardless of the time at which this request was performed or position in the request queue). The policy could be anything one can come up with based on the information either historical usage or current pending requests and characteristics of those requests (this could include file type, user, class of service, file size, ...). The DC then submits bundle of requests to the daemon component ; each request is a bundle of N file and known as a “job”. The DC submits K of those jobs before stopping and observing the mass storage behavior: if the jobs go through, more are submitted otherwise, either the server stops or proceed with a recovery procedure and consistency checks (as it will assume that no reaction and no unit of work being performed is a sign of MSS failure). In other words, the DC will also be error resilient and recover from intrinsic HPSS failures (being monitored). Whenever the files are moved from tape to cache in the MSS, a call back to the DC server is made and captive account connection is initiated to pull the file out of mass storage cache to more permanent storage.

Optimizations

While the policy is clearly a source of optimization (as far as the user is concerned), from a DataCarousel “post policy” perspective, N*K files are being requested at minimum at every point in time. In reality, more jobs are being submitted so the consumption of the “overflow”of job are used to monitor if the MSS is alive. The N*K files represents a total amount of files which should match the number of threads allowed by the daemon. The current setting are K=50, N=15 with an overflow allowed up to 25). The daemon itself has the possibility to treat requests simultaneously according to a“depth”. Those calls to HPSS are however only advisory. The depth is set at being 30 deep for the DST COS and 20 deep for the Raw COS. The deepest the request queue will be, more files will be requested simultaneously but this means that the daemon will also have to start more threads as previously noted. Those parameters have been showed to influence the performance to some extent (within 10%) with however a large impact on response time: the larger the request stack, the “less instantaneous” the response from a user's perspective (since the request queue length is longer).

The daemon has the ability of organizing X requests into a sorted list of tape ID and number of requests per tape. There are a few strategies allowing to alter the performance. We chose to enable “start with the tape with the largest number of files requested”. In addition, and since our queue depth is rather small comparing to the ideal number of files (K) per job, we order the files requested by the user by tape ID. Both optimizations are in place and lead to a 20% improvement within a realistic usage (bulk restore, Xrootd, other user activities).

Remaining optimizations

  • Optimization based on tapeID would need to be better quantified (graph, average restore rate) for several class of files and usage. TBD.

  • The tape ID program is a first implementation returning partial information. Especially, the MSS failures are not currently handled, leading to setting the tape ID to -1 (since there are now ways to recognize whether or not it is an error or a file missing in HPSS or even a file in the MSS MetaData server but located on a bad tape). Work in progress.

  • The queue depth parameters should be studied and adjusted according to the K and N values. However, this would need to respect the machine / hardware capabilities. The beefier the machine would be, the better but this is likely a fine tuning. This needs to be done with great care as the hardware is also shared by multiple experiments. Ideally the compiled daemon should auto-adjust to the DC API settings (and respect command line parameters for queue depth). TBD.

  • Currently, the daemon number of threads used for handling the HPSS API calls and to handle the call backs are sharing the same pool. This diminishes the number of threads available to communication with the Mass Storage and therefore, causes performance fluctuations (call back threads could get “stuck” or come in “waves” - we observed cosine behavior perhaps related to this issue). TBD.


Optimizations based on drive and technology capabilities

File size effects on IO performance

In this paper (CERN/IT 2005), the author measured the IO performance as a function of file size and number of files requested per tape. The figure of relevance is added here for illustration.
HPSS IO efficiency per file size and per
This graph has been extracted for an optimal 30 MB/s capable drive (9940B like). Both file size and number of files per cartriges have been evaluated. The conclusions are immediate and confirms the advertized behavior observed by all HPSS deployment (see references below). Smal file size is detrimental to HPSS IO performance and this size highly depends on the tape technology.

In STAR, we use the 9940B (read only as per 2006) and LT0-3 drives (read and write /all new files would go to LTO-3). The finding would not be altered but we have little marging of flexibility as per the "old" tape drive.

Below, we show the average file size per file type in STAR as a 2006 snapthot.

Average (bytes) Average (MB) File Type
943240627 899 MC_fzd
666227650 635 MC_reco_geant
561162588 535 emb_reco_event
487783881 465 online_daq
334945320 319 daq_reco_laser
326388157 311 MC_reco_dst
310350118 295 emb_reco_dst
298583617 284 daq_reco_event
246230932 234 daq_reco_dst
241519002 230 MC_reco_event
162678332 155 MC_reco_root_save
93111610 88 daq_reco_MuDst
52090140 49 MC_reco_MuDst
17495114 16 MC_reco_minimc
14982825 14 daq_reco_emcEvent
14812257 14 emb_reco_geant
12115661 11 scaler
884333 0 daq_reco_hist

Note that the average size for an event file is 284MB while for a MicroDST, the size average is 84 MB so a ratio of 3. The number of files per catriges is at best 1.2 files per cartriges with peaks at 10 or more. THis is mostly due to a request profile dominated by Xrootd "random" pattern and a few user's requests. According to the previous study, the IO efficiency should be around 8% for 84 MB file and reaching perhaps 20-25% efficiency for a 284 MB average file class. Cosndiering we have not studied the drive access pattern beyond a simple scaling (i.e. we will ignore to first order the fcat we have many drives at our disposal), we should see a performance change change from 8 to 20 so an improvement of x2.5-3 shall the file size argument stand.

In order to observe the IO performance when small or big files are being requested, we requested event files to the DataCarousel and produced the below two graphs for the dates ranging between 2006/10/26 and 2006/10/29. The first graph represents the IO "before" the massive submission of event file dominated requests, the second the graph "after" the event. The graphs are preliminary (work in progress).
DC IOPerf 20061027 (time before)
DC IOPerf 20061029 (time after)
We observe an average transfer rate at best saturating at 15 MB/sec for MuDST dominated files and an average close to 50 MB/sec for event files. The ratio is ~ 3 which remains consistent with the results on HPSS IO efficiency per file size and per and our initial rough estimate.

Note: It is interresting to note that a significant mix of very small files (below 12 MB average) would bring the performance to a sub 1% efficiency. The net result for 9 drives (as we have in STAR) would be an aggregate performance no better than 3MB/sec for a 9940B x 9 drive configuration . We observe periods with such poor performance. The second observation is that even with MuDST dominated files only, we would not be able to exceed in speed 70% of the performance of one drive so at best, 21 MB/sec (this correspond to our current "best hour"). The results are coherent to first order.

MSS failures and cascading effects

A poor MSS IO efficiency is one thing, but stability is another. Under poor performance situations, failures are critical to minimize. We have already stated that the DC is error resilient. However, during failure periods, the request queue accumulates requests and whenever the mass storage comes back, all requests are suddenly released, opening the flood gate of IO ... which are not much of a flood than a drip. As a consequence, user's requests or bulk transfer would not suffer much but modes requiring immediate response (such as Xrootd) would become largely impacted. In fact, shall the downtime be long enough, it is likely that all requests occuring while the errors started will fail but subsequent accumulated requests will also cause further delays and spurious Xrootd failures - Xrootd will timeout if the DC has not satisfied its request for 3600 secondes i.e. 1 hour per file. The following graph shows an error sequence:
HPSS Error types 2006 09 W36
While there are errors at the early stage of this graph, reminissent of previous failures, problems during the focused time period starts around 9 AM with a MetaData lookup failure (cannot get lock after 5 retries). Subsequentely, there is an immediate appearance of files failing to be restored for more than an hour and this continues up to around 10:30 to 11:00 at ehich point, more errors occur from the Mass Storage system (a mix of MetaData lookup failure and massive authentication failures). The authentication failures are related to the DCE component failure, causing periodic problems. In our case, we immediately the light blue band continuing for up to 15:00 (3 PM) followed but yet again a massive meta data failure. All of those cacading failures would, for a period of no less that 6 hours long, affect users requesting files from Xrootd which would needto be restored from MSS.

From all failures, the relative proprtion for that day is displayed below
HPSS Error relative proportion, 2006-09
Only one error in this graph (a DataCarousel connection failue) cna be fixed from a STAR stand point, allother occurances being a facility issue to resolve.

Miscellaneous technology considerations

All considerations in this sections are beyond our control and a facility work and optmization.

HPSS disk cache optimizations

This section seems rather academic considering the previous sections improvement perspectives.

COS and PVR optimizations

In this section, we will discuss optimizing based on file size, perhaps isolated by PVR or COS. This will be possible in future run but would lead to a massive repackaging of files and data for the past years.


Appendix

Further reading:



HSI


This is an highlight of the HSI features. Please, visit the HSI Home Page for more information.

HSI is a friendly interface for users of the High Performance Storage System (HPSS). It is intended to provide a familiar Unix-style environment for working within the HPSS environment, while automatically taking advantage of the power of HPSS (e.g. for high speed parallel file transfers) without requiring any special user interaction, where possible.

HSI requires one of two authentication methods (see the HSI User Guide for more information):
  • Kerberos (the preferred method)
  • DCE keytab (using a keytab file generated for you by the HPSS system administrators)
HSI's features include:
  • Familiar Unix-style command interface, with commands such as "LS", "CD", etc.
  • Interactive, batch, or "one-liner" execution modes
  • Ability to interactively pipe data into or out of HPSS, using filters such as "TAR"
  • Recursive option is available for most commands; including the ability to copy an entire directory tree to or from HPSS with a single simple command
  • Conditional put and get operations, including ability to update based on file timestamps
  • Automatically uses HPSS parallel I/O features for file transfer operations
  • Multi-threaded I/O within a single process space
  • Command aliases and abbreviations
  • 10 working directories
  • Ability to read command input from a file, and write log or command output to a file.
  • Non-DCE version runs on most major Unix-based platforms
  • Non-DCE version provides the ability to connect to multiple HPSS systems and perform 3rd-party copies between the systems, using a "virtual drive" path notation.

HTAR

To use htar within the HPSS environment, users are required to have the valid Kerberos credentials.

The following is the man page of how to use htar.

   
                     NAME
                          htar - HPSS tar utility
     
   
                     PURPOSE
                          Manipulates HPSS-resident tar-format archives.


                     SYNOPSIS
                          htar  -{c|t|x|X}  -f Archive [-?]  [-B] [-E]  [-L  inputlist] [-h]  [-m] [-o]
                                 [-d  debuglevel] [-p] [-v]  [-V] [-w]
                                 [-I  {IndexFile | .suffix}] [-Y  [Archive COS ID][:Index File COS ID]]
                                 [-S  Bufsize] [-T  Max Threads] [Filespec | Directory ...]

                     DESCRIPTION
                          htar  is a utility which manipulates HPSS-resident archives
                          by writing files to,  or retrieving files from the High
                          Performance Storage System (HPSS).  Files written to HPSS
                          are in the POSIX 1003.1 "tar" format, and may be retrieved
                          from HPSS, or read by native tar programs.

                          For those unfamiliar with HPSS, an introduction can be found
                          on the web at
                                 http://www.sdsc.edu/hpss

                          The local files used by the htar command are represented by
                          the Filespec parameter. If the Filespec parameter refers to
                          a directory, then that directory, and, recursively, all
                          files and directories within it, are referenced as well.

                          Unlike the standard Unix "tar" command, there is no default
                          archive device; the "-f Archive" flag is required.

                     Archive and Member files
                          Throughout the htar documentation, the term "archive file"
                          is used to refer to the tar-format  file, which is named by
                          the "-f filename" command line option. The term "member
                          file" is used to refer to individual files contained within
                          the archive file.

                     WHY USE HTAR
                          htar has been optimized for creation of archive files
                          directly in HPSS, without having to go through the
                          intermediate step of first creating the archive file on
                          local disk storage, and then copying the archive file to
                          HPSS via some other process such as ftp or hsi. The program
                          uses multiple threads and a sophisticated buffering scheme
                          in order to package member files into in-memory buffers,
                          while making use of the high-speed network striping
                          capabilities of HPSS.

                          In most cases, it will be signficantly  faster to use htar
                          to create a tar file in HPSS than to either create a local
                          tar file and then copy it to HPSS, or to use tar piped into
                          ftp (or hsi) to create the tar file directly in HPSS.

                          In addition, htar creates a separate index file, (see next
                          section) which contains the names and locations of all of
                          the  member files in the archive (tar) file.  Individual
                          files and directories in the archive can be randomly
                          retrieved without having to read through the archive file.
                          Because the index file is usually smaller than the archive
                          file, it is possible that the index file may reside in HPSS
                          disk cache  even though the archive file has been moved
                          offline to tape; since htar uses the index file for listing
                          operations, it may be possible to list the contents of the
                          archive file without having to incur the time delays of
                          reading the archive file back onto disk cache from tape.

                          It is also possible to create an index file for a tar file
                          that was not originally created by htar.

                     HTAR Index File
                          As part of the process of creating an archive file on HPSS,
                          htar also creates an index file, which is a directory of the
                          files contained in the archive. The Index File includes the
                          position of member files within the archive, so that files
                          and/or directories can be randomly retrieved from the
                          archive without having to read through it sequentially.  The
                          index file is usually significantly smaller in size than the
                          archive file, and may often reside in HPSS disk cache even
                          though the archive file resides on tape. All htar operations
                          make use of an index file.

                          It is also possible to create an index file for an archive
                          file that was not created by htar, by using the "Build
                          Index" [-X] function (see below).

                          By default, the index filename is created by adding ".idx"
                          as a suffix to the Archive name specified by the -f
                          parameter.  A different suffix or index filename may be
                          specified by the "-I " option, as described below.

                          By default, the Index File is assumed to reside in the same
                          directory as the Archive File.  This can be changed by
                          specifying a relative or absolute pathname via the -I
                          option. The Index file's relative pathname is relative to
                          the Archive File directory unless an absolute pathname is
                          specified.

                     HTAR Consistency File
                          HTAR writes an extra file as the last member file of each
                          Archive, with a name similar to:

                                  /tmp/HTAR_CF_CHK_64474_982644481

                          This file is used to verify the consistency of the Archive
                          File and the Index File.  Unless the file is explicitly
                          specified, HTAR does not extract this file from the Archive
                          when the -x action is selected.  The file is listed,
                          however, when the -t action is selected.

                     Tar File Restrictions
                          When specifying path names that are greater than 100
                          characters for a file (POSIX 1003.1 USTAR) format, remember
                          that the path name is composed of a prefix bufferFR, a /
                          (slash), and a name buffer.

                          The prefix buffer can be a maximum of 155 bytes and the name
                          buffer can hold a maximum of 100 bytes. Since some
                          implementations of TAR require the prefix and name buffers
                          to terminate with a null (' ') character, htar enforces the
                          restriction that the effective prefix buffer length is 154
                          characters (+ trailing zero byte), and the name buffer
                          length is 99 bytes (+ trailing zero byte). If the path name
                          cannot be split into these two parts by a slash, it cannot
                          be archived. This limitation is due to the structure of the
                          tar archive headers, and must be maintained for compliance
                          with standards and backwards compatibility. In addition, the
                          length of a destination for a hard or symbolic link ( the
                          'link name') cannot exceed 100 bytes (99 characters + zero-
                          byte terminator).

                     HPSS Default Directories
                          The default directory for the Archive file is the HPSS home
                          directory for the DCE user.  An absolute or relative HPSS
                          path can optionally be specified for either the Archive file
                          or the Index file. By default, the Index file is created in
                          the same HPSS directory as the Archive file.

                     Use of Absolute Pathnames
                          Although htar does not restrict the use of absolute
                          pathnames (pathnames that begin with a leading "/") when the
                          archive is created, it will remove the leading / when files
                          are extracted from the archive.  All extracted files use
                          pathnames that are relative to the current working
                          directory.

                     HTAR USAGE
                          Two groups of flags exist for the htar command; "action"
                          flags and "optional" flags. Action flags specify the
                          operation to be performed by the htar command, and are
                          specified by one of the following:

                          -c, -t, -x, -X

                          One action flag must be selected in order for the htar
                          command to perform any useful function.

                     File specification (Filespec)
                          A file specification has one of the following forms:

                                  WildcardPath
                                     or
                                  Pathname
                                     or
                                  Filename

                          WildcardPath is a path specification that includes standard
                          filename pattern-matching characters, as specified for the
                          shell that is being used to invoke htar.  The pattern-
                          matching characters are expanded by the shell and passed to
                          htar as command line arguments.

                     Action Flags
                          Action flags defined for htar are as follows:

                          -c   Creates a new HPSS-resident archive, and writes the
                               local files specified by one or more File parameters
                               into the archive. Warning: any pre-existing archive file
                               will be overwritten without prompting. This behavior
                               mimics that of the AIX tar utility.

                          -t   Lists the files in the order in which they appear in
                               the HPSS- resident archive.   Listable output is
                               written to standard output; all other output is written
                               to standard error.

                          -x   Extracts the files specified by one or more File
                               parameters from the HPSS-resident archive. If the File
                               parameter refers to a directory, the htar command
                               recursively extracts that directory and all of its
                               subdirectories from the archive.

                               If the File parameter is not specified, htar extracts
                               all of the files from the archive. If an archive
                               contains  multiple copies of the same file, the last
                               copy extracted overwrites  all previously extracted
                               copies. If the file being extracted does not already
                               exist on the system, it is created. If you have the
                               proper permissions, then htar command restores all
                               files and directories with the same owner and group IDs
                               as they have on the HPSS tar file. If you  do not have
                               the proper permissions, then files and directories are
                               restored with your owner and group IDs.

                          -X   builds a new index file by reading the entire tar file.
                               This operation is used either to reconstruct an index
                               for tar files whose Index File is unavailable (e.g.,
                               accidentally deleted), or for tar files that were not
                               originally created by htar.

                     Options
                          -?   Displays htar's verbose help

                          -B   Displays block numbers as part of the listing (-t
                               option). This is normally used only for debugging.

                          -d debuglevel
                               Sets debug level (0 - N) for htar. 0 disables debug, 1
                               - n enable progressively higher levels of debug output.
                               5 is the highest level; anything > 5 is silently mapped
                               to 5.  0 is the default debug level.

                          -E   If present, specifies that a local file should be used
                               for the file specified by the "-f Archive" option.  If
                               not specified, then the archive file will reside in
                               HPSS.

                          -f Archive
                               Uses Archive as the name of archive to be read or
                               written. Note: This is a required parameter for htar,
                               unlike the standard tar utility, which uses a built-in
                               default name.

                               If the Archive variable specified is - (minus sign),
                               the tar command writes to standard output or reads from
                               standard input. If you write to standard output, the -I
                               option is mandatory, in order to specify an Index File,
                               which is copied to HPSS if the Archive file is
                               successfully written to standard output. [Note: this
                               behavior is deferred - reading from or writing to pipes
                               is not supported in the initial version of htar].

                          -h   Forces the htar command to follow symbolic links as if
                               they were normal files or directories. Normally, the
                               tar command does not follow symbolic links.

                          -I index_name
                               Specifies the index file name or suffix.  If the first
                               character of the index_name is a period, then
                               index_name is appended to the Archive name, e.g. "-f
                               the_htar -I .xdnx" would create an index file called
                               "the_htar.xndx".  If the first character is not a
                               period, then index_name is treated as a relative
                               pathname for the index file (relative to the Archive
                               file directory) if the pathname does not start with
                               "/", or an absolute pathname otherwise.

                               The default directory for the Index file is the same as
                               for the Archive file.  If a relative Index file
                               pathname is specifed, then it is appended to the
                               directory path for the Archive file.  For example, if
                               the Archive file resides in HPSS in the directory
                               "projects/prj/files.tar", then an Index file
                               specification of "-I projects/prj/files.old.idx" would
                               fail, because htar would look for the file in the
                               directory "projects/prj/projects/prj".  The correct
                               specification in this case is "-I files.old.idx".

                          -L InputList
                               Writes the files and directories listed in the
                               "InputList" file to the archive. Directories named in
                               the InputList file are not treated recursively. For
                               directory names contained in the InputList file, the
                               tar command writes only the directory entry to the
                               archive, not the files and subdirectories rooted in the
                               directory.  Note that "home directory" notation ("~")
                               is not expanded for pathnames contained in the
                               InputList file, nor are wildcard characters, such as
                               "*" and "?".

                          -m   Uses the time of extraction as the modification time.
                               The default is to preserve the modification time of the
                               files. Note that the modification time of directories
                               is not guaranteed to be preserved, since the operating
                               system may change the timestamp as the directory
                               contents are changed by extracting other files and/or
                               directories.  htar will explicitly set the timestamp on
                               directories that it extracts from the Archive, but not
                               on intermediate directories that are created during the
                               process of extracting files.

                          -o   Provides backwards compatibility with older versions
                               (non-AIX) of the tar command. When this flag is used
                               for reading, it causes the extracted file to take on
                               the User and Group ID (UID and GID) of the user running
                               the program, rather than those on the archive.  This is
                               the default behavior for the ordinary user. If htar is
                               being run as root, use of this option causes files to
                               be owned by root rather than the original user.

                          -p   Says to restore fields to their original modes,
                               ignoring the present umask. The setuid, setgid, and
                               tacky bit permissions are also restored to the user
                               with root user authority.

                          -S bufsize
                               Specifies the buffer size to use when reading or
                               writing the HPSS tar file.  The buffer size can be
                               specified as a value, or as kilobytes by appending any
                               of  "k","K","kb", or "KB" to the value.  It can also be
                               specified as megabytes by appending any of  "m" or "M"
                               or "mb" or "MB" to the value, for example, 23mb.

                          -T max_threads
                               Specifies the maximum number of threads to use when
                               copying local member files to the Archive file.  The
                               default is defined when htar is built; the release
                               value is 20.  The maximum number of threads actually
                               used is dependent upon the local file sizes, and the
                               size of the I/O buffers.  A good approximation is
                               usually

                                  buffer size/average file size

                               If the -v or -V option is specified, then the maximum
                               number of local file threads  used while writing the
                               Archive file to HPSS is displayed when the transfer is
                               complete.

                          -V   "Slightly verbose" mode. If selected, file transfer
                               progress will be displayed in interactive mode. This
                               option should normally not be selected if verbose (-v)
                               mode is enabled, as the outputs for the two different
                               options are generated by separate threads, and may be
                               intermixed on the output.

                          -v   "Verbose" mode. For each file processed, displays a
                               one-character operation flag, and lists the name of
                               each file. The flag values displayed are:
                                   "a"  - file was added to the archive
                                   "x"  - file was extracted from the archive
                                   "i"  - index file entry was created (Build Index
                               operation)

                          -w   Displays the action to be taken, followed by the file
                               name, and then waits for user confirmation. If the
                               response is affirmative, the action is performed. If
                               the response is not affirmative, the file is ignored.

                          -Y auto | [Archive CosID][:IndexCosID]
                               Specifies the HPSS Class of Service ID to use when
                               creating a new Archive and/or Index file. If the
                               keyword auto is specified, then the HPSS hints
                               mechanism is used to select the archive COS, based upon
                               the file size.  If -Y cosID  is specified, then cosID
                               is the numeric COS ID to be used for the Archive File.

                               If -Y :IndexCosID is specified, then IndexCosID is the
                               numeric COS ID to be  used for the Index File.  If both
                               COS IDs are specified, the entire parameter must be
                               specified as a single string with no embedded spaces,
                               e.g. "-Y 40:30".

                     HTAR Memory Restrictions
                          When writing to an HPSS archive, the htar command uses a
                          temporary file (normally in /tmp) and maintains in memory a
                          table of files; you receive an error message if htar cannot
                          create the temporary file, or if there is not enough memory
                          available to hold the internal tables.

                     HTAR Environment
                          HTAR should be compiled and run within a non-DCE HPSS environment.

                     Miscellaneous Notes:
                          1. The maximum size of a single Member file within the
                          Archive is approximately 8 GB, due to restrictions in the
                          format of the tar header.  HTAR does not impose any
                          restriction on the total size of the Archive File when it is
                          written to HPSS; however, space quotas or other system
                          restrictions may limit the size of the Archive File when it
                          is written to a local file (-E option).

                          2.  HTAR will optionally write to a local file; however, it
                          will not write to any file type except "regular files".  In
                          particular, it is not suitable for writing to magnetic tape.
                          To write to a magnetic tape device, use the "tar" or "cpio"
                          utility.

                     Exit Status
                          This command returns the following exit values:

                          0       Successful completion.

                          >0      An error occurred.

                     Examples
                          1.   To write the file1 and file2 files to a new archive
                               called "files.tar" in the current HPSS home directory,
                               enter:

                                      htar -cf files.tar file1 file2

                          2.   To extract all files from the project1/src directory in
                               the Archive file called proj1.tar, and use the time of
                               extraction as the modification time,  enter:

                                     htar -xm -f proj1.tar project1/src

                          3.   To display the names of the files in the out.tar
                               archive file within the HPSS home directory, enter:

                                     htar -tvf out.tar

                     Related Information
                          For file archivers: the cat command, dd command, pax
                          command.  For HPSS file transfer programs: pftp, nft, hsi

                          File Systems Overview for System Management in AIX Version 4
                          System Management Guide: Operating System and Devices
                          explains file system types, management, structure, and
                          maintenance.

                          Directory Overview in AIX Version 4 Files Reference explains
                          working with directories and path names.

                          Files Overview in AIX Version 4 System User's Guide:
                          Operating System and Devices provides information on working
                          with files.

                          HPSS web site at http://www.sdsc.edu/hpss

                     Bugs and Limitations:
                          - There is no way to specify relative Index file pathnames
                          that are not rooted in the Archive file directory without
                          specifying an absolute path.

                          - The initial implementation of HTAR does not provide the
                          ability to append, update or remove files.  These features,
                          and others, are planned enhancements for future versions.

Home directories and other areas backups

Home directories

If you accidently erase a file in your home directoy at RFC, you can restore it using a two week backup that you can access directly. Two weeks worth of backups are kept as snapshots. The way it works is that as day pass, live backups are being made on the file system itself hence preserving your files in-place.

For example, your username is 123, your home directory is /star/u/123 and you erased a file /star/u/123/somedir/importantfile.txt and now realise that was a mistake. Don't panic. This is not the end of thw world as snapshot backups exist.

Simply look under /star/u/.snapshot

The directory names are odered by the date and time of backup. Pick a date when the file existed and under there is a copy of your home directory from that day. From here you can restore the file, i.e,

% cp /star/u/.snapshot/20yy-mm-dd_hhxx-mmxx.Daily_Backups_STAR-FS05/123/somedir/importantfile.txt 
/star/u/123/somedir/importantfile.txt

See also starsofi #7363.

AFS areas

Each doc_proected/ AFS areas also have a .backup volume which keeps recently deleted files in that directory until a real AFS based backup is made (then the content is deleted and you will need to ask the RCF to restore your files).  Finding it is tricky though because there is one such directory per volume. The best is to backward search for that directory. For example, let's suppose you are working in /afs/rhic.bnl.gov/star/doc_protected/www/bulkcorr/. If you bacward search for a .backup directory, you will find one as /afs/rhic.bnl.gov/star/doc_protected/www/bulkcorr/../.backup/ and this is where the files for this AFS volume will go upon deletion.

Other areas

Other areas are typically not backed-up.

 

Hypernews

Most Hypernews forum will have to be retired - please consult the list of mailing lists at this link to be sure you need HN at all.
While our Web serve ris down, many Computing related discussions are now happening on Mattermost Chat (later, will be Mail based by popular demand). Please log there using the 'BNL login' option (providing a facility wide unified login) and use your RACF/SDCC kerberos credentials to get in. If you are a STAR user, you will automatically be moved to the "STAR Team".

Please, read the Hypernews in STAR section before registering a new account (you may otherwise miss a few STAR specificities and constraints).

General Information

HyperNews is a cross between the hypermedia of the WWW and Usenet News. Readers can browse postings written by other people and reply to those messages. A forum (also called a base article) holds a tree of these messages, displayed as an indented outline that shows how the messages are related (i.e. all replies to a message are listed under it and indented).

Users can become members of HyperNews or subscribe to a forum in order to get Email whenever a message is posted, so they don't have to check if anything new has been added. A recipient can then send a reply email back to HyperNews, rather than finding a browser to write a reply, and HyperNews then places the message in the appropriate forum.

Hypernews in STAR

In STAR, there are a few specificities with Hypernews as listed below. 

  • Your Hypernews account should match your BNL/RCF account by name. This account must be part of the STAR group. For example, if you have a RCF STAR account named 'abc', you should create an Hypernews account named 'abc'.  Any other account will be removed automatically. Note that if you have any other RCF unix account but not a STAR account, the result will be the same (you will not be able to register to STAR's Hypernews). This is done so automation of account approval can be achieved while complying with the DOE requirement mentioned in Getting a computer account in STAR. If you are a STAR user in good standing, the automation especially allows for immediate use of your account without further approval process.
  • You should NOT use the same password for an Hypernews account as your RCF account. Hypernews has a weak authentication method and while physical access to the machine is needed to crack it, a focus on keeping the password different from the interactive login password(s) is important. In general, Web-based password authentication should  not be the same than interactive account passwords. 
  • Hypernews does not accept Email attachments. This includes Emails containing a mix of text and html - they will be rejected by the system. Please, be aware that whenever you send "formatted" Emails (bold character, font changes etc...), your Email client does NOTHING ELSE than sending the content in two parts: one part is plain text, the second an attached HTML. Hence, Hypernews will NOT process formatting (but will take your Email anyhow).
  • Hypernews posting DO NOT need to be done from the Web interface (this is true for ANY Hypernews systems); you can send an Email directly to the list address. However, posting must have a subject. Subject-less posting will be rejected. Also, we have a spam filter in place and it is noteworthy to mention that to date, we had no accidental rejections of valid Emails.
  • As per 2012/06, all STAR Hypernews fora were made protected. In other words, and in addition of your Hypernews personal account, you MUST use the 'protected' password to access the Web interface.

Startup links

Here are a few startup links and tips, including where to start for a new Hypernews account.

  • If you DO NOT have a STAR account, consult Getting a computer account in STAR first BUT you will STILL need the additional below information:
    • You will need the famous “protected” area password. If you do not understand what this means, you are probably not a STAR collaborator ... Otherwise, you can get this information from your PAC, PWGC, OPS manager, council rep, etc ...
    • For your RCF user account name, you will need to chose a User ID other than “protected”. Hopefully, this will be the case.
  • After you get a RCF Unix account
    • create an Hypernews account  (as indicated in the Hypernews in STAR section) starting from here.
    • IMPORTANT NOTE: please wait at least one hour after you get  confirmation from the RCF before creating your Hypernews account as there is a delay in information propagation of accounts to the Hypernews system.
  • To connect to the Web based Hypernews interface, please login to the system first. This will allow for your session to be authenticated properly and postings to be identified as you. As per 2006, any anonymous posting will be rejected.
  • You can then proceed to either
    • The forum list with all Hypernews forum displayed in descending order of 'last posted'
    • You can Edit your membership to change your personal information (this is a typical link which DO require for you to login first)
    • Use the Hypernews Search engine to search / locate a particular message. This is slow and painful (we have too many message) but is te only way you can search the huge 10 years worth of Email archive.
  • Note again that as soon as you have located your forum of interest and it's address, you can send Emails directly to that list forum and/or answer previous post by using your mail client 'Reply'.

 

Tips related to message delivery to Hypernews

If you have problems sending EMail to Hypernews, please understand and verify the following before asking for help:

  • Hypernews will silently discard EMails detected as spam. This is good news for our Hypernews subscribers but be aware that Spam filtering is a tricky business and some legitimate Email may be rejected unwillingly.
    • The first and foremost reason for rejection is the use of internet service provider (ISP) EMail servers to send Emails to Hypernews. Several ISP are blacklisted as they do not protect their service against anonymous Emails. Using such ISP will have as unfortunate consequence to have your Email rejected. Be sure to use your lab or university as provider or a trusted ISP.
    • The other reason is font encoding - DO NOT use special font encoding while sending Email - Korean (EUC-KR) or Chinese (GB2312, GB18030, Big5, ...) especially gets a high mark from the Spam filter and get you closet o the rejection rating threshold. A few unfortunate words here and there and ...
  • Hypernews in STAR DOES NOT accept attachments: your Email will be silently rejected if any appears.
    • Send instead a note of where your document resides for consulting.
    • DO NOT send messages as HTML - HTML EMail actually send plain text and HTML as an attachment ... and your post will be rejected. Typically, your mail client gives you the possibility to send plain-text for entire domain matching. Hypernews is covered by the www.star.bnl.gov domain.
      • MAC Users using Mac OSX Mail client, please consult this "How to Send a Message in Plain Text" (also explaining why MIME may be dangerous). Alternatively you may want to use Mozilla/Thunderbird as a client.
      • To set this up in Thunderbird so as follow
        Select the _Tools_ menu 
           Select _Options_ a window opens. Select the tab [Composition] -then-> [General]
            click on <send option> in the new opened pannel, then select the [Plain text domain] tab
              click [add] and add star.bnl.gov
              click OK
    • Sending EMail from BNL's Exchage server will result in MIME attachements and hence, cause a rejection of your posting. Two possible solutions offers themselves
      • Use the Hypernews Web interafce to send messages (after making sure you are logged in, click on the bottom reference to go to the message and [Add message]
      • Use a tool like Thunderbird with the SMTP outgoing server set to use the RCF server. Instructions are available here.
         
  • The folowing  restrictions  apply:
    • Use only one Hypernews forum in the To: field (and do not use CC: to another HN forum) - HN will not know where to post if you use multiple fora and the result will be un-predictable (depending on syntax used for the To: field and mailer, the post will end up in one of the specified fora or be discarded entirely)
    • You will NOT be able to forward a post from one forum to another - HN will know and send the message again to the original forum. This is because the information HN keeps for archiving your posts and threading them is part of the message header and not based on where you send the message (header includes Newsgroups, X-HN-Forum, X-HN-Re and X-HN-Loop). Your options could be to strip the header or cut-and-paste the original message into a new one.
    • Multiple recipient on the Email "To:" field will not post your EMail. Strictly speaking, this is a shortcoming in the parsing of the header as defined in RFC2822 (the RFC allows for a list, STAR HN implementation disallow mass posting).
  • One frequent source of issues and unrelated to any of te above (and very STAR infrastructure specific):
    • Always use address book entries using the alias of the form list [at] www.star.bnl.gov and NOT the node specific address (connery, orion, etc...). Especially, older users should remove from their address book any address not specifying the alias.
  • If you need to test sending Email to the system, please do not spam an existing active list - instead, use our test fora: startest or tesp.
    • Remember that Hypernews is a centralized system, if your Email passes and is deleiverred to the test forum, it should be to any other lists
    • Both fora are near identical - testp is nowadays used for testing new code and features so for a casual Email check, you may prefer startest.

 

Installing the STAR software stack

The pages here are under constructions. They aim to help remote sites to install the STAR software stack.

You should read first Setting up your computing environment before going through the documents provided herein as we refer to this page very often. Please, pay particular attention to the list of environment variables defined by the group login script and their meanings in STAR. Be aware of the assumptions as per the software locations (all will be referred by the environment variables listed there) as well as the need to use a custom (provided) set of .cshrc and .login file (you may have to modify them if you install the STAR software locally). Setting up your computing environment  is however NOT written as a software installation instruction and should not be read as such.

Please, follow the instructions in order they appear below

  1. Check first the availability of the CERN libraries as this may be a show stopper. If there is no CERN libraries for your OS version and/or the available libraries are not validated for your OS, you will NOT be able to get the STAR software working on your site.
  2. Your FIRST STEP is to install the Group login scripts. Although not all will be defined, the login should be successful after this step.
  3. The next step is then to install Additional software components
    However, your OS should also have installed a few base system wide RPMs.
    Lists are available on the OS Upgrade page as well as specific issues with some OS. Read it carefully.
  4. Then, install the ROOT library Building ROOT in STAR
  5. Finally, you are ready for a STAR library installation STAR codes

Sparse notes are also in Post installation and special instructions for administrators at OS Upgrade.

 

Group login scripts

Installing

The STAR general group login scripts are necessary to define the STAR environment. They reside in $CVSROOT within the group/ sub-tree. Template files for users .cshrc and .login support also exists within this tree in a sub-directory group/templates. To install properly on a local cluster, there are two possibilities:

  • if you have access to AFS, you should simply
        % mkdir  /usr/local/star # this is only an example
        % cd /usr/local/star     # this directory needs to be readable by a STAR group
        % cvs checkout group     # this assumes CVSROOT is defined 
    This will bring a copy of all you need locally in /usr/local/star/group
  • If you do not have access to AFS from your remote site, get a copy of the entire BNL $GROUP_DIR tree and unpack in a common place (like /usr/local/star above). A copy resides in the AFS tree mentioned Additional software components.

Note that wherever you install the login scripts, they need to be readable by a STAR members (you can do this by allowing a Unix group all STAR users will belong to read access to the tree or by making sure the scripts are all users accessible).

Also, as soon as you get a local copy of the group/templates/ files, EDIT BOTH the cshrc and login files and change on top the definition of GROUP_DIR to it matches your site GROUP script location (/usr/local/group in our example).

To enable a user to use the STAR environment, simply copy the template cshrc and login scripts as indicated in Setting up your computing environment.

Special scripts

Part of our login is optional and the scripts mentioned here will NOT be part of our CVS repository but, if exists, will be executed.

  • site_pre_setup.csh - this script, if exists in $GROUP_DIR, will be executed before the execution of the STAR standard login. Its purpose is to define variables indicating non-standard location for your packages. For those variables which may be redefined, please consult Setting up your computing environment for all the variables (in blue) which may be redefined prior to login.
  • site_post_setup.csh - this script, if exists in $GROUP_DIR, will be executed after the STAR standard login. Its purpose is to define local variables nor related to STAR's environment. Such variables may be for example the definition of a proxy (http_proxy, ftp_proxy, https_proxy), a NNTP server or a default WWW home directory (WW_HOME). Do not try to redefine STAR login's defined variables using this script.

Testing this phase

Testing this phase is as simple as creating a test account and verifying that the login does succeed. Whenever you start with a blank site, the login MUST succeed and lead to viable environment ($PATH especially should be minimally correct). A typical login example would be at this stage something like

Setting up WWW_HOME  = http://www.star.bnl.gov/

         ----- STAR Group Login from /usr/local/star/group/ -----

Setting up STAR_ROOT = /usr/local/star
Setting up STAR_PATH = /usr/local/star/packages
Setting up OPTSTAR   = /usr/local/star/opt/star
WARNING : XOPTSTAR points to /dev/null (no AFS area for it)
Setting up STAF      = /usr/local/star/packages/StAF/pro
Setting up STAF_LIB  = /usr/local/star/packages/StAF/pro/.cos46_gcc346/lib
Setting up STAF_BIN  = /usr/local/star/packages/StAF/pro/.cos46_gcc346/bin
Setting up STAR      = /usr/local/star/packages/pro
Setting up STAR_LIB  = /usr/local/star/packages/pro/.cos46_gcc346/lib
Setting up STAR_BIN  = /usr/local/star/packages/pro/.cos46_gcc346/bin
Setting up STAR_PAMS = /usr/local/star/packages/pro/pams
Setting up STAR_DATA = /usr/local/star/data
Setting up CVSROOT   = /usr/local/star/packages/repository
Setting up ROOT_LEVEL= 5.12.00
Setting up SCRATCH   = /tmp/jeromel
CERNLIB version pro has been initiated with CERN_ROOT=/cernlib/pro
STAR setup on star.phys.pusan.ac.kr by Tue Mar 12 06:43:47 KST 2002  has been completed
LD_LIBRARY_PATH = .cos46_gcc346/lib:/usr/local/star/ROOT/5.12.00/.cos46_gcc346/rootdeb/lib:ROOT:/usr/lib/qt-3.3/lib

 

Suggestions

STAR group

You may want to to create a rhstar group on your local cluster matching GID 31012. This will make AFS integration easier as the group names in AFS will then translate to rhstar (it will however not grant you any special access obviously since AFS is Kerberos authentication based and not Unix UID based).
To do this, and after checking that /etc/group do not contain any mapping for gid 31012, you could (Linux):

% groupadd -g 31012 rhstar

Test account

It may be practical for testing the STAR environment to create a test account on your local cluster. The starlib account is an account  used in STAR for software installation. You may want to create such account as follow (Linux):

% useradd -d /home/starlib -g rhstar -s /bin/tcsh  starlib

 This will allow for easier integration. Any account name will do (but testing is important and we will have a section on this later).

 

 

Additional software components

Scope & audience

Described in Setting up your computing environment, OPTSTAR is the environment variable pointing to an area which will supplement the operating system installation of libraries and program. This area is fundamental to the STAR software installation as it will contain needed libraries, approved software component version, shared files, configuration and so on.

The following path should contain all software components as sources for you to install a fresh copy on your cluster:
    /afs/rhic.bnl.gov/star/common

Note that this path should allow anyuser to read so there is no need for an AFS token. The note below are sparse and ONLY indicate special instructions you need to follow if any. In the absence of special instructions, the "standard" instructions are to be followed. None of the explanations below are aimed to target a regular user but aimed to target system administrator or software infrastructure personnel.

System wide RPMs

Some RPMs from your OS distribution may be found at BNL under the directory /afs/rhic.bnl.gov/rcfsl/X.Y/*/ where X.Y is the major and minor version for your Scientific Linux version respectively. You should have a look and install. If you do not have AFS, you should log to the RCF and transfer whatever is appropriate.

In other words, we may have re-packaged some packages and/or created additional ones for compatibility purposes. An example of this for SL5.3 is flex32libs-2.5.4a-41.fc6.i386.rpm located in /afs/rhic.bnl.gov/rcfsl/5.3/rcf-custom/ which supports the 32 bits compatbility package for flex on a kernel with dual 32/64 bits support.

STAR Specific

The directory tree /afs/rhic.bnl.gov/star/common contains packages installed on our farm in addition of the default distribution software packages coming with the operating system. At BNL, all packages referred here are installed in the AFS tree

	/opt/star -> /afs/rhic.bnl.gov/@sys/opt/star/

Be aware of the intent explained in Setting up your computing environment as per the difference between $XOPTSTAR and OPTSTAR.

OPTSTAR will either

  • at BNL or to a remote site: be used to indicate and access the local software BUT may be supported through a soft-link to the same AFS area as showed above whereas @sys will expand to the operating system of interest (see Setting up your computing environment as well for a support matrix)
  • at a remote site, will point to a LOCAL (that is, non-networked) installation of the software components. This space could be anywhere on your local cluster but obviously, will have to be shared and visible from all nodes in your cluster.

XOPTSTAR

The emergence of $XOPTSTAR started from 2003 to provide better support for software installation to remote institutions. Many packages add path information to their configuration (like the infamous .la files) and previously installed in $OPTSTAR, remote sites had problems loading libraries for a path reason. Hence, and unless specified otherwise below, $XOPTSTAR will be used preferably at BNL for installation the software so remote access to the AFS repository and copy will be made maximally transparent.

In 2005, we added an additional tree level reflecting the possibility of multiple compilers and possible mismatch between fs sysname setups and operating system versions. Hence, you may see path like OPTSTAR=/opt/star/sl44_gcc346 but this structure is a detail and if the additional layer does not exist for your site, later login will nonetheless succeed. This additional level is defined by the STAR login environment $STAR_HOST_SYS. In the next section, we explained how to set this up from a "blank" site (i.e. a site without the STAR environment and software installed).

On remote sites where you decide to install the software components locally, you should use $OPTSTAR in the configure or make statements.

Basic starting point

From a blank node on remote site, be sure to have $OPTSTAR defined. You can do this by hand for example like this

% setenv OPTSTAR /usr/local

or

% mkdir -p /opt/star
% setenv OPTSTAR /opt/star

are two possibilities. The second, being the default location of the software layer, will be automatically recognized by the STAR group login scripts. From this point, a few pre-requisites are

  • you have to have a system with "a" compiler - we support gcc but also icc on Linux
  • you should have the STAR group login scripts at hand (it could be from AFS). The STAR login scripts will NOT redefine $OPTSTAR if already defined.

Execute the STAR login. This will define $STAR_HOST_SYS appropriately. Then

% cd $OPTSTAR
% mkdir $STAR_HOST_SYS
% stardev
 

the definition of $OPTSTAR will change to the version dependent structure, adding $STAR_HOST_SYS to the path definition (the simple presence of the layer made the login script redefine it).

 

Changing platform or compiler

32 bits versus 64 bits

If you want to support native 64 bits on 64 bits, do not forget to pass/force -m64 -fPIC to the compiler and -m64 to the linker. If however you want to build a cross platform (64 bit/32 bit kernels compatible) executables and libraries, you will on the contrary need to force -m32 (use -fPIC). Even if you build the packages from a 32 bit kernel node, be aware that many applications and package save a setup including compilation flags (which will have to be using -m32 if you want a cross platform package). There are many places below were I do not specify this.

Often, using CFLAGS="-m32 -fPIC" CXXFLAGS="-m32 -fPIC" LDFLAGS="-m32" would do the trick for a force to 32 bits mode (similarly for -m64). You need to use such option for libraries and packages even if you assemble on a 32 bits kernel node as otherwise, the package may build extension later not compatible as cross-platform support.

Other GCC versions

As for the 32 bits versus 64 bits, often adding something like CC=`which gcc` and CXX=`which g++` to either the configure or make command would do the trick. If not, you will need to modify the Makefile accordingly. You may also define the environment variable CC for consistency.

Summary

If ylu do have a 64 bits kernel and intend to compile both 32 bits and 64 bits, you should define the envrionment variable as shown below.  The variables will make configure (and some Makefile) pick the proper flags and make your life much easier - follow the specific instructions for the packages noted in those instructions for specific tricks. Note as well that even if you do have a 32 bits kernel only, you are encouraged to use the -m32 compilation option (this will make further integration with dual 32/64 bits support smoother as some of the packages configurations include compiler path and options).

32 bits

% setenv CFLAGS   "-m32 -fPIC"
% setenv CXXFLAGS "-m32 -fPIC"
% setenv FFLAGS   "-m32 -fPIC"
% setenv FCFLAGS  "-m32 -fPIC"
% setenv LDFLAGS  "-m32"
% setenv CC  `which gcc`     # only if you use a different compiler than the system default
% setenv CXX `which g++`     # only if you use a different compiler than the system default

and/or pass to Makefile and/or configure the arguments CFLAGS="-m32 -fPIC" CXXFLAGS="-m32 -fPIC" LDFLAGS="-m32" CC=`which gcc` CXX=`which g++` (will not hurt to use it in addition of the environment variables)

64 bits

% setenv CFLAGS   "-m64 -fPIC"
% setenv CXXFLAGS "-m64 -fPIC"
% setenv FFLAGS   "-m64 -fPIC"
% setenv FCFLAGS  "-m64 -fPIC"
% setenv LDFLAGS  "-m64"
% setenv CC  `which gcc`     # only if you use a different compiler than the system default
% setenv CXX `which g++`     # only if you use a different compiler than the system default

and/or pass to Makefile and/or configure the arguments CFLAGS="-m64 -fPIC" CXXFLAGS="-m64 -fPIC" LDFLAGS="-m64" CC=`which gcc` CXX=`which g++` (will not hurt to use it in addition of the environment variables)

 

Software repository directory - starting a build

In the instructions below, greyed instructions are historical instructions and/or package version which no longer reflects the current official STAR supported platform. However, if you try to install the STAR software under older OS, refer carefully to the instructions and package versions.

perl

The STAR envrionment and login scripts heavily rely on perl for string manipulation, compilation management and a bunch of utility scripts. Assembling it from the start is essential. You may rely on your system-wide installed perl version BUT if so, note that the minimum version indicated below IS required.

In our software repository path, you will find a perl/ sub-directory containing all packages and modules.

The package and minimal version are below
		perl-5.6.1.tar.gz      -- Moved to 5.8.0 starting from RH8
perl-5.8.0.tar.gz -- Solaris and True64 upgraded 2003
perl-5.8.4.tar.gz -- 2004, Scientific Linux
perl-5.8.9.tar.gz -- SL5+ perl-5.10.1.tar.gz -- SL6+

When building perl

  • Use all default arguments BUT when you are asked for compilation / linker args, add -m32 or -m64 depending on the platform support you are building. Those questions are (example for the 32 bits version):
    • Any additional cc flags? []  -fPIC -m32
    • Any additional ld flags (NOT including libraries)? [] -m32
    • Any special flags to pass to cc -c to compile shared library modules? []  -fPIC -m32
    • Any special flags to pass to cc to create a dynamically loaded library? [-shared -O2] -shared -O2 -m32
    • If you build a 32 bits support on a 64 bit node, you may also answer "no" below but the defalt naswer SHOULD appear as no if you properly passed -m32 as indicated above.
      Try to use maximal 64-bit support, if available? [y] n  <--- you probably did ot pass -m32
      Try to use maximal 64-bit support, if available? [n]    <--- just press return, all is fine
  • when asked for the default prefix for the installation, give the value of $OPTSTAR as answer (or a base path starting with the value of $OPTSTAR wherever appropriate). Questions include
    • Installation prefix to use? (~name ok) [/usr/local]
  • If the build warn you at first that the directory does not exists but proceed - to questions like "Use that name anyway?" answer Yes


After installing perl itself, you will need to install the STAR required module.

The modules are installed using a bundle script (install_perlmods located in /afs/rhic.bnl.gov/star/common/bin/). It needs some work to get it generalized but the idea is that it contains the dependencies and installation order . To install, you can do the following (we assume install_perlmods is in the path for simplicity and clarity):
 

  1. first chose a work place where you would unpack the needed modules. Let's assume this is /home/xxx/myworkplace
  2. Check things out by running install_perlmods with arguments 0 as follow
    % install_perlmods 0 /home/xxx/myworkplace
    It will tell you the list of modules you need to unpack. If they are already unpacked and /home/xxx/myworkplace contains all needed package directories, skip to step 4.
  3. You can unpack manually OR use the command
    % install_perlmods 1 /home/xxx/myworkplace
    to do this automatically. Note that you could have skipped step 2 and do that from the start (if confident enough).
  4. The steps above should have created a file named /home/xxx/myworkplace/perlm-install-XXX.csh where XXX is the OS you are working on. Note that the same install directory may therefore be used for ALL platform on your cluster. However, versionning is not (yet) supported.
    Execute this script after checking its content. It will run (hopefully smoothly) the perl Makerfile.PL and make / make install commands. Note that you could have also used
    % install_perlmods 2 /home/xxx/myworkplace
    and skip step 2 & 3. In this mode, it unpacks and proceeds with compilation. To do only if you have absolute blind faith in the process (I don't and have written those scripts ;-)   ).

Very old note [this used to happen with older perl version]: if typing make, you get the following message

make: *** No rule to make target `<command-line>', needed by `miniperlmain.o'.  Stop.

then you bumped into an old gcc/perl build issue (tending to come back periodically depending on message formats of gcc) and can resolve this by a using any perl version available and running the commands:

% make depend
% perl -i~ -nle 'print unless /<(built-in|command.line)>/' makefile x2p/makefile

This will suppress from the makefile the offending lines and will get you back on your feet.
 

After you install perl, and your setup is local (in /opt/star) you may want to do the following

% cd /opt/star
% ln -s $STAR_HOST_SYS/* .
%
% # prepare for later directories packages will create
% ln -s $STAR_HOST_SYS/share .
% ln -s $STAR_HOST_SYS/include .
% ln -s $STAR_HOST_SYS/info .
% ln -s $STAR_HOST_SYS/etc .
% ln -s $STAR_HOST_SYS/libexec .
% ln -s $STAR_HOST_SYS/qt .
% ln -s $STAR_HOST_SYS/jed .
%

While some of those directories will not yet exist, this will create a base set of directories (without the additional compiler / OS version) supporting future upgrades via the "default" set of directories. In other words, any future upgrade of compilers for example leading to a different  $STAR_HOST_SYS will still lead as well to a functional environment as far as compatibility exists. Whenever compatibility will be broken, you will need of course to re-create a new $STAR_HOST_SYS tree.
At this stage, you should install as much of the libraries in $OPTSTAR and re-address the perl modules later as some depends on installed libraries necessary for the STAR environment to be functional.

 

Others/ [PLEASE READ, SOME PACKAGE MAY HAVE EXCEPTION NOTES]

        Needed on Other platform (but there on Linux). Unless specified 
        otherwise, the packages were build with the default values.
                make-3.80
                tar-1.13
                flex-2.5.4   
                xpm-3.4k
                libpng-1.0.9

                mysql-3.23.43 on Solaris
                mysql-3.23.55 starting from True64 days (should be tried as
                              an upgraded version of teh API)
                              BEWARE mysql-4.0.17 was tried and is flawed.
                              We also use native distribution MySQL
                mysql-4.1.22  *** IMPORTANT *** Actually this was an upgrade 
                              on SL4.4 (not necessary but the default 4.1.20 
                              has some bugs) 

                <gcc-2.95.2>
                dejagnu-1.4.1	 
                gdb-5.2
                texinfo-4.3
                emacs-20.7 

                findutils-4.1
                fileutils-4.1
                cvs-1.11       -- perl is needed before hand as it folds
                               it in generated scripts
                grep-2.5.1a    Started on Solaris 5.9 in 2005 as ROOT would complain 
                               about too old version of egrep 


This may be needed if not installed on your system. It is part of a needed
autoconf/automake deployment.
                m4-1.4.1		
                autoconf-2.53  
                automake-1.6.3
		
Linux only
                valgrind-2.2.0
valgrind-3.2.3 (was for SL 4.4 until 2009)
valgrind-3.4.1 SL4.4 General/ The installed packages/sources for diverse software layers. The order of installation was ImageMagick-5.4.3-9 On RedHat 8+, not needed for SL/RHE but see below ImageMagick-6.5.3-10 Used on SL5 as default revision is "old" (6.2.8) - TBC slang-1.4.5 On RedHat 8+, ATTENTION: not needed for SL/RHE, install RPM lynx2-8-2 lynx2-8-5 Starting from SL/RHE xv-3.10a-STAR Note the post-fix STAR (includes patch and 32/64 bits support Makefile) nedit-5.2-src ATTENTION: No need on SL/RHE (installed by default) [+] pythia5 pythia6 text2c icalc dejagnu-1.4.1 Optional / Dropped from SL3.0.5
gdb-5.1.91 For RH lower versions - Not RedHat , 8+
gdb-6.2 (patched) Done for SL3 only (do not install on others)
gsl-1.13 Started from SL5 and back ported to SL4 gsl-1.16 Update for SL6 chtext jed-0.99-16 jed-0.99.18 Used from SL5+ jed-0.99.19 Used in SL6/gcc 4.8.2 (no change in instructions) qt-x11-free-3.1.0
qt-x11-free-3.3.1 Starting with SL/RHE
[+] qt-x11-opensource-4.4.3 Deployed from i386_sl4 and i386_sl305 (after dropping SL3.0.2), SL5 qt-everywhere-opensource-src-4.8.5 Deployed from SL6 onward qt-everywhere-opensource-src-4.8.7 Deployed on SL6/gcc 4.8.2 (latest 4.8.x release) doxygen-1.3.5
doxygen-1.3.7 Starting with SL/RHE
doxygen-1.5.9 Use this for SL5+ - this package has a dependence in qt Installed native on SL6 Python 2.7.1 Started from SL4.4 and onward, provides pyROOT support Python 2.7.5 Started from SL6 onward, provides pyROOT support pyparsing V1.5.5 SL5 Note: "python setup.py install" to install pyparsing V1.5.7 SL6 Note: "python setup.py install" to install setuptools 0.6c11 SL5 Note: sh the .egg file to install setuptools 0.9.8 SL6 Note: "python setup.py install" to install MySQL-python-1.2.3 MySQL 14.x client libs compatible virtualenv 1.9 SL6 Note: "python setup.py install" to install Cython-0.24 SL6 Note: "python setup.py build ; python setup.py install" pyflakes / pygments {TODO} libxml2 Was used only for RH8.0, installed as part of SL later [+] libtool-1.5.8 This was used for OS not having libtool, Use latest version.
libtool-2.4 Needed under SL5 64 bits kernel (32 bits code will not assemble otherwise). This was re-packaged with a patch. Coin-3.1.1 Coin 3D and related packages Coin-3.1.3 ... was used for SL6/gcc 4.8.2 + patch (use the package named Coin-3.1.3-star) simage-1.7a SmallChange-1.0.0a SoQt-1.5.0a astyle_1.15.3 Started from SL3/RHE upon user request
astyle_1.19 SL4.4 and above
astyle_1.23 SL5 and above astyle_2.03 SL6 and above unixODBC-2.2.6 (depend on Qt) Was experimental Linux only for now.
unixODBC-2.3.0 SL5+, needed if you intend to use DataManagement tools MyODBC-3.51.06 Was Experimental on Linux at first, ignore this version
MyODBC-3.51.12 Version for SL4.4 (needed for mysql 4.1 and above)
mysql-connector-odbc-3.51.12 <-- Experimental package change - new name starting from 51.12. BEWARE.
mysql-connector-odbc-5.x SL5+. As above, only if you intend to use Data Management tools boost Experimental and introduced in 2010 but not used then boost_1_54_0 SL6+ needed log4cxx 0.9.7 This should be general, initial version
log4cxx 0.10.0 Started at SL5 - this is now from Apache apr-1.3.5 and depend on the Apache Portable Runtime (apr) package apr-util-1.3.7 which need to be installed BEFORE log4cxx and in the order expat-1.95.7 showed valkyrie-1.4.0 Added to SL3 as a GUI companion to valgrind (requires Qt3) Not installed in SL5 for now (Qt4 only) so ignore fastjet-2.4.4 Started from STAR Library version SL11e, essentially for analysis fastjet-3.0.6 SL6 onward unuran-1.8.1 Requested and installed from SL6+ LHAPDF-6.1.6 Added after SL6.4, gcc 4.8.2 In case you have problems emacs-24.3 Installed under SL6 as the default version had font troubles vim-7.4 Update made under SL6.4, please prefer RPM if possible Not necessary (installed anyway) chksum pine4.64 Added at SL4.4 as removed from base install at BNL Retired xemacs-21.5.15 << Linux only -- This was temporary and removed Other directories are WorkStation/ contains packages such as OpenAFS or OpenOffice Linux WebServer/ mostly perl modules needed for our WebServer Linux/ Linux specific utilities (does not fit in General) or packages tested under Linux only. Some notes about packages : Most of them are pretty straight forward to install (like ./configure make ; make install (changing the base path /usr/local to $OPTSTAR). With configure, this is done using either ./configure --prefix=$OPTSTAR ./configure --prefix=$XOPTSTAR Specific notes follows and include packages which are NOT yet official but tried out. - Beware that the Msql-Mysql-modules perl module requires a hack I have not quite understood yet how to make automatic (the advertized --blabla do not seem to work) on platforms supporting the client in OPTSTAR INC = ... -I$(XOPTSTAR)/include/mysql ... H_FILES = $(XOPTSTAR)/include/mysql/mysql.h OTHERLDFLAGS = -L$(XOPTSTAR)/lib/mysql LDLOADLIBS = -lmysqlclient -lm -lz - GD-2+ Do NOT select support for animated GIF. This will fail on standard SL distributions (default gd lib has no support for that).


ImageMagick

Really easy to install (usual configure / make / make install) but however, the PerlMagick part should be installed separatly (the usual perl module way i.e. cd to the directory, perl Makefile.PL and make / make install). I used the distribution's module. Therefore, that perl-module is not in perl/Installed/ as the other perl-modules. The copy of PerlMagick to /bin/ by default will fail so you may want to additionally do

% make install-info
% make install-data-html

which comes later depending on version.
 

lynx

- lynx2-8-2 / lynx2-8-5 
  Note: First, I tried lynx2-8-4 and the make file / configure
        is a real disaster. For 2-8-2/2-8-5, follow the notes 
        below

  General :
  %  ./configure --prefix=$XOPTSTAR {--with-screen=slang}

  Do not forget to
  % make install-help
  % make install-doc

 caveat 1 -- Linux (lynx 2-8-2 only, fixed at 2-8-5)

  $OPTSTAR/lib/lynx.cfg was modified as follow
96,97c96,97
< #HELPFILE:http://www.crl.com/~subir/lynx/lynx_help/lynx_help_main.html
< HELPFILE:file://localhost/opt/star/lib/lynx_help/lynx_help_main.html
---
>
HELPFILE:http://www.crl.com/~subir/lynx/lynx_help/lynx_help_main.html > #HELPFILE:file://localhost/PATH_TO/lynx_help/lynx_help_main.html

   For using curses (needed under Linux, otherwise, the screen looks funny), 
   one has to do a few manipulation by hand i.e. 
   . start with ./configure --prefix=$XOPTSTAR --with-screen=slang
   . edit the makefile and add -DUSE_SLANG to SITE_DEFS
   . change CPPFLAGS from /usr/local/slang to $OPTSTAR/include [when slang is local]
     Version 2-8-5 has this issue fixed.
   . Change LIBS -lslang to -L$OPTSTAR/lib -lslang
   . You are ready now
   There is probably an easier way but as usual, I gave up after close
   to 15mnts reading, as much struggle and complete flop at the end ..

 caveta 2 -- Solaris/True64 : 
   We did not build with slang but native (slang screws colors up)

 

text2c, chksum, chtext, icalc

Those packages can be assembled simply by using the following command:

% make clean && make install PREFIX=$OPTSTAR

To build a 32 bits versions of the executable under a 64 bits kernel, use

  • text2c:             % make CC=`which gcc` CFLAGS="-lfl -m32"
  • icalc:                % make CC=`which gcc` CFLAGS="-lm -m32"
  • chksum:           % make CC=`which gcc` CFLAGS="-m32 -trigraphs"
  • chtext:             % make CC=`which gcc` CFLAGS="-lfl -m32"

 

xv-3.10a

This package distributed already patched and in principle, only a few 'make' commands should suffice. Note

  • xv is licensed so the usage as to remain stricly for your users' amusement only. If you use this package for doing any work, you are violating the law. Please, read the license agreement at http://www.trilon.com/xv/pricing.html 

Normal build

Now,  you should be ready to build the main program (I am not sure why some depencies fail on some platform and did not bother to fix).

% cd tiff/
% make clean && make
% cd ../jpeg
% make clean && make
% cd ..
% rm -f *.o  && make
% make -f Makefile.gcc64 install BINDIR=$OPTSTAR/bin

For 32 bits compilation under a 64 bits kernel

% cd tiff/
% make clean && make CC=`which gcc` COPTS="-O -m32"
% cd ../jpeg
% make clean && make CC=`which gcc` CFLAGS="-O -I. -m32" LDFLAGS="-m32"
% cd ..
% rm -f *.o   && make -f Makefile.gcc32
% make -f Makefile.gcc32 install BINDIR=$OPTSTAR/bin

Makefile.gcc32 and Makefile.gcc64 are both provided for commodity.

Building from scratch (good luck)

However, if you need to re-generate the makefile (may be needed for new architectures), use

% xmkmf 

Then, the patches is as follow

% sed "s|/usr/local|$OPTSTAR|" MakeFile >Makefile.new
% mv Makefile.new Makefile

and in xv.h, line 119 becomes

# if !defined(__NetBSD__) && ! defined(__USE_BSD) 

After xmkmf, you will need to

% make depend

before typing make. This may generate some warnings. Ignore then.

However, I had to fix diverse caveats depending on situations ...

Caveat 1 - no tiff library found

Go into the tiff/ directory and do

% cd tiff  % make -f Makefile.gcc   % cd ..

to generate the mkg3states (itself creating the g3states.h file) as it did not work.

Caveat 2 - tiff and gcc 4.3.2 in tiff/

With gcc 4.3.2 I created an additional .h file named local_types.h and force the definition of a few of the u_* types but using define statements (I know, it is bad). The content of that file is as follows

#ifndef _LOCAL_TYPES_
#define _LOCAL_TYPES_

#if !defined(u_long)
# define u_long unsigned long
#endif
#if !defined(u_char)
# define u_char unsigned char
#endif
#if !defined(u_short)
# define u_short unsigned short
#endif
#if !defined(u_int)
# define u_int unsigned int
#endif

#endif

and it needs to be included in tiff/tif_fax3.h and tiff/tiffiop.h .

Caveat 3 -- no jpeg library?

 In case you have a warning about jpeg such as No rule to make target `libjpeg.a', do the following as well:

% cd jpeg
% ./configure
% make
% cd ..

 

Nedit

There is no install provided. I did

% make linux
% cp source/nc source/nedit $OPTSTAR/bin/
% cp doc/nc.man $OPTSTAR/man/man1/nc.1
% cp doc/nedit.man $OPTSTAR/man/man1/nedit.1

Other targets

% make dec
% make solaris

If you need to build for another compiler or another platform, you may want to copy one of the provided makefile and modify them to create a new target. For example, if you have a 64 bits kernel but want to build a 32 bits nedit (for consistency or otherwise), you could do this:

% cp makefiles/Makefile.linux makefiles/Makefile.linux32

then edit and add -m32 to bothe CFLAGS and LIBS. This will add a target "platform" linux32 for a make linux32 command (tested this and worked fine). The STAR provided package added (in case) both a linux64 and a linux32 reshaped makefile to ensure easy install for all kernels (gcc compiler should be recent and accept the -m flag).

 

Pythia libraries

The unpacking is "raw". So, go in a working directory where the .tar.gz are, and do the following (for linux)

% test -d Pythia && rm -fr Pythia ; mkdir Pythia && cd Pythia && tar -xzf ../pythia5.tar.gz 
% ./makePythia.linux 
% mv libPythia.so $OPTSTAR/lib/ 
% cd .. 
% 
% test -d Pythia6 && rm -fr Pythia6 ; mkdir Pythia6 && cd Pythia6 && tar -xzf ../pythia6.tar.gz 
% test -e main.c && rm -f main.c 
% ./makePythia6.linux 
% mv libPythia6.so $OPTSTAR/lib 
% 

Substitute linux with solaris for Solaris platform. On Solaris, Pythia6 requires gcc to build/link.

On SL5, 64 bits note

Depending on whether you compile a native 64 bit library support or a cross-platform 32/64, you will need to handle it differently.

For a 64 bits platform, I had to edit the makePythia.linux and  -fPIC to the options for a so the binaries main.c . I did not provide a patched package mainly because v5 is not really needed in STAR. For pythia6 caveat: On SL5, 64 bits, use makePythia6.linuxx8664 file. You will need to chmod +x first as it was not executable in my version.

On 64 bit platform to actually build a cross-platform version, I had instead to use the normal build but make sure to add -m32 to compilation and linker options and -fPIC to compilation option.

 

True64

% chmod +x ./makePythia.alpha && ./makePythia.alpha Pythia6
% chmod +x ./makePythia6.alpha && ./makePythia6.alpha 

The following script was used to split the code which was too big

 #!/usr/bin/env perl
 $filin = $ARGV[0];
 open(FI,$filin);
 $idx = $i = 0;
 while( defined($line = <FI>) ){
    chomp($line); $i++;

    if ($i >= 500 && $line =~ /subroutine/){
	$i = 0;
	$idx++;
    }

    if ($i == 0){
	close(FO);
	open(FO,">$idx.$filin");
	print "Opening $idx.$filin\n";
    }
    print FO "$line\n";
 }
 close(FO);
 close(FI);

 

Qt 4

Starts the same than Qt3 i.e. assuming that SRC=/afs/rhic.bnl.gov/star/common/General/Sources/ and $x and $y stands for major and minor versions of Qt. There are multiple flavors of the package name (it was called qt-x11-free* then qt-x11-opensource* and with more recent package qt-everywhere-opensource-src*). For the sake of instructions, I provide a generic example with the most recent naming (please adapt as your case requires). WHEREVER is a location of your choice (not the final target directory).

% cd $WHEREVER
% tar -xzf $SRC/qt-everywhere-opensource-src-4.$x.$y.tar.gz
% cd qt-everywhere-opensource-src-4.$x.$y
% ./configure --prefix=$XOPTSTAR/qt4.$x -qt-sql-mysql -no-exceptions -no-glib -no-rpath 

To build a 32/64 bits version on a 64 bits OS or forcing a 32 bits exec (shared mode) on a 32 bits OS, use a configure target like the below

% ./configure  -platform linux-g++-32 -mysql_config $OPTSTAR/bin/mysql_config [...] 
% ./configure  -platform linux-g++-64 [...]

Note that the above assumes you have a proper $OPTSTAR/bin/mysql_config. ON a mixed 64/32 bits node, the default in /usr/bin/mysql_config will return the linked library as the /usr/lib64/mysql path and not the /usr/lib/mysql and hence, Qt make will fail finding the dependencies necessary to link with -m32. The trick we had was to copy mysql_config and replace lib64 by lib and voila!.


Compiling

  % make
  % make install

  % cd $OPTSTAR
  % ln -s qt4.$x ./qt4

For compiling with a different compiler, note that the variables referenced in this section will be respected by configure. You HAVE TO do this as the project files and other configuration files from Qt will include information on the compiler (inconsistency may arrise otherwise).

Misc notes

  • If you use the same directory tree for compiling the 64 bits and the 32 bits version, please note that 'make clean' will not do the proper job. You will need to use a more systematic % find . -name '*.o' -exec rm -f {} \;  command before running  ./configure again.
  • We added mysql support in Qt4 and Qt can now be compiled in a separate directory and installed properly (at last!). If the mysql support gives you trouble on a 64 bit OS attempting to build a 32 bit image, be sure you have used as indicated above the -mysql_config /usr/lib/mysql/mysql_config option as otherwise, the default mysql_config will be picked from /usr/bin and that version will refer to the 64 bits  libraries (the link will then fail). 
  • For SL44, we created a qt3 distribution as then, both 3 and 4 existed. Otherwise, the ./qt4 link as indicated above is sufficient for SL5 and above.
  • On some systems (SL3.0.2 for sure), I also used
    • -no-openssl   as there were include problems with ssl.h and krb5.h
    • -qt-libtiff   as the default system included header did not agree with Qt code
    • -platform linux-icc  could be used for icc based compiler

 

Qt 3

Horribly packaged, the easiest is to unpack in $OPTSTAR, cd to qt-x11-free-3.X.X (where X.X stands for the current sub-version deployed on our node), run the configure script, make the package, then make clean. Then, link

  % cd $OPTSTAR && ln -s qt-x11-free-3.X.X qt

Later release can be build that way with changing the soft-link without removing the preceeding version entirely. Before building, do the following (if you had previous version of Qt installed). This is not necessary if you install the package the first time around. Please, close windows after compilation to ensure STAR path sanity.

  % cd $OPTSTAR/qt
% setenv QTDIR `pwd`
% setenv LD_LIBRARY_PATH `pwd`/lib:$LD_LIBRARY_PATH
% setenv PATH `pwd`/bin:$PATH

To configure the package, then use one of:

  • Linux gcc: ./configure --prefix=`pwd` -no-xft -thread
  • Linux icc:  ./configure --prefix=`pwd` -no-xft -thread -platform linux-icc
  • True64 :   ./configure --prefix=`pwd` -no-xft -thread
  • Solaris:     ./configure --prefix=`pwd` -no-xft

In case of thread, the regular version is build first then the threaded version (so far, they have different names and no Soft links).

You may also want to edit  $QTDIR/mkspecs/default/qmake.conf and replace the line

QMAKE_RPATH		= -Wl,-rpath,

by

QMAKE_RPATH		= 

By doing so, you would disable the rpath shared library loading and rely on LD_LIBRARY_PATH only for loading your Qt related libraries. This has the advantages that you may copy the Qt3 libraries along your project and isolate onto a specific machine without the need to see the original installation directory.

 

unixODBC

% ./configure --prefix=$XOPTSTAR [CC=icc CXX=icc]
% make clean       # in case you are re-using the same directory for multiple platform 
% make 
% make install

Use the environment variables noted in this section and all will go well.

Note on versions earlier than 2.3.0 (including 2.2.14 previously suggested)

The problem desribed below DOES NOT exist if you use 32 bits kernel OS and is specific to 64 bits kernel with 32 bits support.

For a 32 bits compilation under a 64 bits kernel, please use % cp -f $OPTSTAR/bin/libtool .  after the ./configure and before the make (see this section for an explaination of why). unixODBC versions 2.3.0 does not have this problem.



MyODBC

Older version

Came with sources and one could compile "easily" (and register manually).

- MyODBC
Linux % ./configure --prefix=$XOPTSTAR --with-unixODBC=$XOPTSTAR [CC=icc CXX=icc]
Others % ./configure --prefix=$XOPTSTAR --with-unixODBC=$XOPTSTAR --with-mysql-libs=$XOPTSTAR/lib/mysql
--with-mysql-includes=$XOPTSTAR/include/mysql --with-mysql-path=$XOPTSTAR

Note : Because of an unknown issue, I had to use --disable-gui on True64
as it would complain about not finding the X include ... GUI is
not important for ODBC client anyway but whenever time allows ...

Deploy instructions at
http://www.mysql.com/products/myodbc/faq_toc.html

Version 5.x of the connector

Get the proper package, currently named mysql-connector-odbc-5.x.y-linux-glibc2.3-x86-32bit or  mysql-connector-odbc-5.x.y-linux-glibc2.3-x86-64bit. the package are available from the MySQL Web site. The install will need to be manual i.e.

% cp -p bin/myodbc-installer $OPTSTAR/bin/
% cp -p lib/*  $OPTSTAR/lib/
% rehash

To register the driver, use the folowing command

% myodbc-installer -d -a -n "MySQL ODBC 5.1 Driver" -t "DRIVER=$OPTSTAR/lib/libmyodbc5.so;SETUP=$OPTSTAR/lib/libmyodbc3S.so"
% myodbc-installer -d -a -n "MySQL" -t "DRIVER=$OPTSTAR/lib/libmyodbc5.so;SETUP=$OPTSTAR/lib/libmyodbc3S.so"

this will add a few lines in $OPTSTAR/etc/odbcinst.ini . The  myodbc-installer -d -l does not seem to be listing what you installed though (but the proper lines will be added to the configuration).

 

doxygen

Installation would benefit from some smoothing + note the space between the --prefix and OPTSTAR (non standard option set for configure).

Use one of

% ./configure --prefix $OPTSTAR                       # for general compilation
% ./configure --platform linux-32 --prefix $OPTSTAR   # Linux, gcc 32 bits - this option was added in the STAR package
% ./configure --platform linux-64 --prefix $OPTSTAR   # Linux, gcc 64 bits - this option was fixed in the STAR package

then

% make
% make install

as usual but also

% make docs

which will fail d ue to missing eps2pdf program. Will create however the HTML files you will need to copy somewhere.

% cp -r html $WhereverTheyShouldGo

and as example

% cp -r html /afs/rhic.bnl.gov/star/doc/www/comp/sofi/doxygen


Note: The linux-32 and linux-64 platform were packaged in the archive provided for STAR (linux-32 does not exists in the original doxygen distribution while linux-64 is not consistent with -m64 compilation option).

 

Additional Graphics libraries

Starting from SL5, we also deployed the following: coin, simage, SmallChange, SoQt. Those needs to be installed before Qt4 but after doxygen. All options are specified below to install those packages. Please, substitute -m32 by -m64 for a 64 bits native support. After the configure, the usual make and make install is expected.

The problem desribed below DOES NOT exist if you use 32 bits kernel OS and is specific to 64 bits kernel with 32 bits support.

For the 32 bits version compilation under a 64 bits kernel and for ALL sub-packages below, please be sure you have the STAR version of libtool installed and use the command
% cp -f $OPTSTAR/bin/libtool .
after the ./configure to replace the generated local libtool script. This will correct a link problem which will occur at link time (see the libtool help for more information).

 

Coin:

% ./configure --enable-debug --disable-dependency-tracking --enable-optimization=yes \
--prefix=$XOPTSTAR CFLAGS="-m32 -fPIC -fpermissive" CXXFLAGS="-m32 -fPIC -fpermissive" LDFLAGS="-m32 -L/usr/lib" \
--x-libraries=/usr/lib

or, for or the 64 bits version

% ./configure --enable-debug --disable-dependency-tracking --enable-optimization=yes \
--prefix=$XOPTSTAR CFLAGS="-m64 -fPIC -fpermissive" CXXFLAGS="-m64 -fPIC -fpermissive" LDFLAGS="-m64" 

 

simage (needs Qt installed and QTDIR defined prior):

% ./configure --prefix=$XOPTSTAR --enable-threadsafe --enable-debug --disable-dependency-tracking \ 
--enable-optimization=yes --enable-qimage CFLAGS="-m32 -fPIC" CXXFLAGS="-m32 -fPIC" \ 
LDFLAGS="-m32" FFLAGS="-m32 -fPIC" --x-libraries=/usr/lib

or, for the 64 bits version

% ./configure --prefix=$XOPTSTAR --enable-threadsafe --enable-debug --disable-dependency-tracking \ 
--enable-optimization=yes --enable-qimage CFLAGS="-m64 -fPIC" CXXFLAGS="-m64 -fPIC" \ 
LDFLAGS="-m64" FFLAGS="-m64 -fPIC" 

SmallChange:

% ./configure --prefix=$XOPTSTAR --enable-threadsafe --enable-debug --disable-dependency-tracking \
--enable-optimization=yes CFLAGS="-m32 -fPIC -fpermissive" CXXFLAGS="-m32 -fPIC -fpermissive" \
LDFLAGS="-m32" FFLAGS="-m32 -fPIC"

or, for the 64 bits version

% ./configure --prefix=$XOPTSTAR --enable-threadsafe --enable-debug --disable-dependency-tracking \
--enable-optimization=yes CFLAGS="-m64 -fPIC -fpermissive" CXXFLAGS="-m64 -fPIC -fpermissive" \
LDFLAGS="-m64" FFLAGS="-m64 -fPIC"

SoQt:

./configure --prefix=$XOPTSTAR --enable-threadsafe --enable-debug --disable-dependency-tracking \ 
--enable-optimization=yes --with-qt=true --with-coin CFLAGS="-m32 -fPIC  -fpermissive" CXXFLAGS="-m32 -fPIC  -fpermissive" \
LDFLAGS="-m32" FFLAGS="-m32 -fPIC"

or, for the 64 bits version

./configure --prefix=$XOPTSTAR --enable-threadsafe --enable-debug --disable-dependency-tracking \ 
--enable-optimization=yes --with-qt=true --with-coin CFLAGS="-m64 -fPIC  -fpermissive" CXXFLAGS="-m64 -fPIC -fpermissive" \
LDFLAGS="-m64" FFLAGS="-m64 -fPIC"

 

 

flex

Flex is usually not needed but some OS have pre-GNU flex not adequate so I would recommend to deploy flex-2.5.4  anyway (the latest version since Linux 2001). Do not install under Linux if you have flex already on your system as rpm.

Attention: Under SL5 64 bits, be sure you have flex32libs-2.5.4a-41.fc6 installed as documented on Scientific Linux 5.3 from 4.4. Linkage of 32 bits executable would otherwise dramatically fail.

 

- Xpm (Solaris)
  % xmkmf
  % make Makefiles
  % make includes
  % make 
  I ran the install command by hand changing the path (cut and paste)
  Had to 


  % cd lib
  % installbsd -c -m 0644 libXpm.so $OPTSTAR/lib
  % installbsd -c -m 0644 libXpm.a $OPTSTAR/lib
  % cd ..
  % cd sxpm/
  % installbsd -c sxpm $OPTSTAR/bin
  % cd ../cxpm/
  % installbsd -c cxpm $OPTSTAR/bin
  %
  
  Onsolaris, the .a was not there, add to
  % cd lib && ar -q libXpm.a *.o && cp libXpm.a $OPTSTAR/lib
  % cd ..

  Additionally needed
  % if ( ! -e $OPTSTAR/include) mkdir $OPTSTAR/include 
  % cp lib/xpm.h $OPTSTAR/include/

  

- libpng 
  ** Solaris **
  % cat scripts/makefile.solaris | sed "s/-Wall //" > scripts/makefile.solaris2
  % cat scripts/makefile.solaris2 | sed "s/gcc/cc/" > scripts/makefile.solaris3
  % cat scripts/makefile.solaris3 | sed "s/-O3/-O/" > scripts/makefile.solaris2
  % cat scripts/makefile.solaris2 | sed "s/-fPIC/-KPIC/" > scripts/makefile.solaris3
  % 
  % make -f scripts/makefile.solaris3

  will eventually fail related to libucb. No worry, this can be sorted
  out (http://www.unixguide.net/sun/solaris2faq.shtml) by including
  /usr/ucblib in the -L list
  % cc -o pngtest -I/usr/local/include -O pngtest.o -L. -R. -L/usr/local/lib \
    -L/usr/ucblib -R/usr/local/lib -lpng -lz -lm
  % make -f scripts/makefile.solaris3 install prefix=$OPTSTAR


  ** True64 **
  Copy the make file but most likely, a change like
ZLIBINC = $(OPTSTAR)/include
ZLIBLIB = $(OPTSTAR)/lib

  in the makefile is neeed.
 
  pngconf.h and png.h needed for installation and either .a or .a + .so

cp pngconf.h png.h $OPTSTAR/include/
cp libpng.* $OPTSTAR/lib



- mysql client (Solaris)
 % ./configure --prefix=$XOPTSTAR --without-server {--enable-thread-safe-client}
 (very smooth)
 The latest option is needed to create the libmysqlclient_r needed by some
 applications. While this so is build with early version of MySQL, version
 4.1+ requires the configure option explicitly.


- dejagnu-1.4.1	[Solaris specific]
the install program was not found.
% cd doc/ && cp ./runtest.1 $OPTSTAR/man/man1/runtest.1
% chmod 644 $OPTSTAR/man/man1/runtest.1

Jed

The basic principles is as usual

% ./configure --prefix=$OPTSTAR
% make
% make xjed
% make install

However, on some platform (but this was not seen as a problem on SL/RHE), you may need to apply the following tweak before typing make. Edit the configure script and add $OPTSTAR (possibly /opt/star) to it as follow.

JD_Search_Dirs="$JD_Search_Dirs \
                $includedir,$libdir \
                /opt/star/include,/opt/star/lib \
                /usr/local/include,/usr/local/lib \
                /usr/include,/usr/lib \
                /usr/include/slang,/usr/lib \
                /usr/include/slang,/usr/lib/slang" 

32 / 64 bit issue?

The problem desribed below DOES NOT exist if you use 32 bits kernel OS and is specific to 64 bits kernel with 32 bits support.

The variables described here will make configure pick up the right comiler and compiler options. On our initial system, the 32 bits compilation under the 64 bits kernel Makefile tried to do something along the line of -L/usr/X11R6/lib64 -lX11 but did not find X11 libs (since the path is not adequate). To correct for this problem, edit src/Makefile and replace XLIBDIR = -L/usr/lib64 by XLIBDIR = -L/usr/lib . You MUST have the 32 bits compatibility libraries installed on your 64 bits kernel for this to work.

AIX

I had to make some hack on AIX (well, who wants to run on AIX in the first place right ?? but since AIX do not have any emacs etc ... jed is great) as follow

  • make a copy of unistd.h and comment the sleep() prototype
  • modify file.c to include the local version (replace <> by "")
  • modify main.c to include sys/io.h (and not io.h) and comment out direct.h

Voila (works like a charm, don't ask).

 
emacs

Version 24.3

In the below options, I recommend the with-x-toolkit=motif as the default GTK will lead to many warnings depending on the user's X11 server version and supported features. Motif may create an "old look and feel" but will work. However, you may have a local fix for GTK (by installing all required modules and dependencies) and not need to go to the Motif UI.

% ./configure --with-x-toolkit=motif --prefix=$OPTSTAR

For the 32 bits version supporting 64/32 bits, use the below

% ./configure --with-crt-dir=/usr/lib --with-x-toolkit=motif --prefix=$OPTSTAR CFLAGS="-m32 -fPIC" CXXFLAGS="-m32 -fPIC" LDFLAGS="-m32"

Then the usual 'make' and 'make install'.

Below are old instructions you should ignore

- emacs
  Was repacked with leim package (instead of keeping both separatly)
  in addition of having a patch in src/s/sol2.h for solaris as follow
 #define HAVE_VFORK 1
 #endif
 
+/* Newer versions of Solaris have bcopy etc. as functions, with
+   prototypes in strings.h.  They lose if the defines from usg5-4.h
+   are visible, which happens when X headers are included.  */
+#ifdef HAVE_BCOPY
+#undef bcopy
+#undef bzero
+#undef bcmp
+#ifndef NOT_C_CODE
+#include <strings.h>
+#endif
+#endif
+

  Nothing to do differently here. This is just a note to keep track
  of changes found from community mailing lists.

  % ./configure --prefix=$OPTSTAR --without-gcc
  

- Xemacs (Solaris)
  % ./configure --without-gcc --prefix=$OPTSTAR
  Other solution, forcing Xpm 
  % ./configure --without-gcc --prefix=$OPTSTAR --with-xpm --site-prefixes=$OPTSTAR

  Possible code problem :
  /* #include <X11/xpm.h> */
  #include <xpm.h> 

- gcc-2.95 On Solaris was used as a base compiler
  % ./configure --prefix=$OPTSTAR
  % make bootstrap

  o Additional gcc on Linux
  Had to do it in multiple passes (you do not need to do the first pass
  elsewhere ; this is just because we started without a valid node).

  A gcc version < 2.95.2 had to be used. I used a 6.1 node to assemble
  it and install in a specific AFS tree (cross version)
  % cd /opt/star && ln -s /afs/rhic/i386_linux24/opt/star/alt .
  Move to the gcc source directory
  % ./configure --prefix=/opt/star/alt
  % make bootstrap
  % make install
  install may fail in AFS land. Edit gcc/Makefile and remove "p" option
  to the tar options TAROUTOPTS .
  
  For it work under 7.2, go on a 7.2 node and
  % cp /opt/star/alt/include/g++-3/streambuf.h /opt/star/alt/include/g++-3/streambuf.h-init
  % cp -f /usr/include/g++-3/streambuf.h /opt/star/alt/include/g++-3/streambuf.h
  ... don't ask ...


  o On Solaris, no problems
  % ./configure --prefix=/opt/star/alt
  etc ...

- Compress-Zlib-1.12 --> zlib-1.1.4
  If installed in $OPTSTAR,
  % setenv ZLIB_LIB $OPTSTAR/lib
  % setenv ZLIB_INCLUDE $OPTSTAR/include
  

- findutil
  Needed a patch in lib/fnmatch.h for True64
  as follow :
  + addition of defined(__GNUC__) on line 49
  + do a chmod +rw lib/fnmatch.h  first

#if !defined (_POSIX_C_SOURCE) || _POSIX_C_SOURCE < 2 || defined (_GNU_SOURCE) || defined(__GNUC__)






* CLHEP1.8                      *** Experimental only ***
printVersion.cc needs a correction #include <string> to <string.h>
for True64 which is a bit strict in terms of compilation.

On Solaris, 2 caveats
o gcc was used (claim that CC is used but do not have the include)
o install failed running a "kdir" command instead of mkdir so do a
% make install MKDIR='mkdir -p'

Using icc was not tried and this package when then removed.
- mysqlcc ./configure --prefix=$OPTSTAR --with-mysql-include=$OPTSTAR/include/mysql --with-mysql-lib=$OPTSTAR/lib/mysql The excutable do not install itself so, one needs to % cp -f mysqlcc $OPTSTAR/bin/

 

libtool

First, please note that the package distributed for STAR contains a patch for support of the 32 / 64 bits environment. If you intend to download from the original site, please apply the patch below as indicated. If you do not use our distributed package and attempt to assemble a 32 bits library under a 64 bits kernel, we found cases where the default libtool will fail.

Why the replacement of libtool? Sometimes, "a" version of libtool is added along software packages indicated in this help. However, those do not consider the 32 bits / 64 bits mix and often, their use lead to the wrong linkage (typical problem is that a 32 bits executable or shared library is linked against the 64 bits stdc++ versions, creating a clash).

This problem does not existswhen you assemble a 64 bits code under a 64 bits kernel or assemble a 32 bits codes under a 32 bits kernel.

In all cases, to compile and assemble, use a command line like the below:

% ./configure --prefix=$XOPTSTAR CFLAGS="-m32 -fPIC" CXXFLAGS="-m32 -fPIC" \
FFLAGS="-m32 -fPIC" FCFLAGS="-m32 -fPIC" LDFLAGS="-m32"                        # 32 bits version
% ./configure --prefix=$XOPTSTAR CFLAGS="-m64 -fPIC" CXXFLAGS="-m64 -fPIC" \
FFLAGS="-m64 -fPIC" FCFLAGS="-m64 -fPIC" LDFLAGS="-m64"                        # 64 bits version
% make
% make install

Patches

libtool 2.4

The file ./libltdl/config/ltmain.sh needs the following patch

< 
< 	# JL patch 2010 -->
< 	if [ -z "$m32test" ]; then
< 	    #echo "Defining m32test"
< 	    m32test=$($ECHO "${LTCFLAGS}" | $GREP m32)
<    fi	
< 	if [ "$m32test" != "" ] ; then
< 	  dependency_libs=`$ECHO " $dependency_libs" | $SED 's% \([^ $]*\).ltframework% -framework \1%g' | $SED 's|lib64|lib|g'`
< 	else
< 	  dependency_libs=`$ECHO " $dependency_libs" | $SED 's% \([^ $]*\).ltframework% -framework \1%g'`
< 	fi
< 	# <-- end JL patch
< 
---
> 	dependency_libs=`$ECHO " $dependency_libs" | $SED 's% \([^ $]*\).ltframework% -framework \1%g'`

 

 

gdb (patch)

in gdb/linux-nat.c

         /*
fprintf_filtered (gdb_stdout,
"Detaching after fork from child process %d.\n",
child_pid);
*/

and go (no, I will not explain).

 

astyle

Version 2.03

% cd astyle_2.03/src
% make -f ../build/gcc/Makefile CXX="g++ -m32 -fPIC"   # for the 64 bits version, use the same command
                                                       # with   CXX="g++ -m64 -fPIC" 
% cp bin/astyle $XOPTSTAR/bin/
% test -d $XOPTSTAR/share/doc/astyle || mkdir -p $XOPTSTAR/share/doc/astyle
% cp ../doc/*.* $XOPTSTAR/share/doc/astyle

The target

% make -f ../build/gcc/Makefile clean

also works fine and is needed between versions.


Version 1.23

Directory structure changes but easier to make the package so use instead

% cd astyle_1.23/src/
% make -f ../buildgcc/Makefile  CXX="$CXX $CFLAGS"
% cp ../bin/astyle $OPTSTAR/bin/
% cd .. 

Note that the compressed command above assumes you have define dthe envrionment variables as described in this section. Between OS (32 / 64 bits) you may need to % rm -f obj/* as the make system will not reocgnize the change between kernels (you alternately may make -f ../buildgcc/Makefile clean but a rm will be faster :-) ).

Documentation

A crummy man page was added (will make it better later if really important). It was generted as follow and provided for convenience in the packages for STAR (do not overwrite because I will not tell you what to do to make the file a good pod):

% cd doc/
% lynx -dump astyle.html >astyle.pod 

[... some massage beyond the scope of this help - use what I provided ...]

% pod2man astyle.pod >astyle.man 
% cp astyle.man $OPTSTAR/man/man1/astyle.1 

 

Versions < 1.23

Find where the code really unpacks. There are no configure for this package.

% cd astyle_1.15.3 ! or 
% cd astyle/src
% make
% cp astyle $OPTSTAR/bin/

Version 1.15.3

The package comes as a zip archive. Be aware that unpacking extracts files in the current directory. So, the package was remade for convenience. Although written in C++, this executable will perform as expected under icc environment. On SL4 and for versions, gcc 3.4.3, add -fpermissive to the Makefile CPPFLAGS.

 

valgrind

MUST be installed using $XOPTSTAR because there is an explicit reference to the install path. Copying to a local /opt/star would therefore not work. For icc, use the regular command as this is a self-contained program without C++ crap and can be copied from gcc/icc directory. The command is

% ./configure --prefix=$XOPTSTAR  

Note: valgrind version >= 3.4 may ignore additional compiler options (but will respect the CC and CXX variables) as it will assemble both 32 bits and 64 bits version on a dual architecture platform. You could force a 32 build only by adding the command line options --enable-only32bit.

Caveats for earlier revisions below:

Version 2.2

A few hacks were made on the package, a go-and-learn experience as problems appeared
 

coregrind/vg_include.h
123c123
< #define VG_N_RWLOCKS 5000
---
> #define VG_N_RWLOCKS 500
coregrind/vg_libpthread.vs
195a196
> __pthread_clock_gettime; __pthread_clock_settime;

to solve problems encountered with large programs and pthread.

 

APR

The problem desribed below DOES NOT exist if you use 32 bits kernel OS and is specific to 64 bits kernel with 32 bits support.

For a 32 bits compilation under a 64 bits kernel, please use % cp -f $OPTSTAR/bin/libtool .  after the ./configure and before the make (see this section for an explaination of why) for both the apr and expat packages.

apr is an (almost) straight forward installation:                

% ./configure --prefix=$OPTSTAR

apr-util needs to have one more argument i.e. 

% ./configure --prefix=$OPTSTAR --with-apr=$OPTSTAR

The configure script will respect the environment variables described