Software and Computing
TPC Voltage issue and discussion #2
1-189, at 15:00 (GMT), duration : 01:00
This meeting is the second of this kind. Meetings are being set to discuss and converge as far as the TPC voltage issue is concerned.
Several action items and discussions were started since the first meeting (see especially this thread).
TPC Voltage issue and discussion #1
1-189, at 21:30 (GMT), duration : 01:00
Meeting was set to discuss on was to converge with the TPC voltage issue. This is the first of a serie of meetings with the diverse experts.
Run 11 preparation meeting #8
1-189, EVO, at 19:00 (GMT), duration : 01:00
Minutes
Attendees: Wayne B., Leve H., Jérôme L., Gene V.B.
ShiftLog [Leve]:
- New server:
- Ready:
- App copied to new server (Wayne)
- Successfully passed stability tests (pounding on it)
- All WARs deployed
- Scripts ready to flush cache and re-copy WARs
- App copied to new server (Wayne)
- To do:
- Will re-check links to and from RunLog
- Watcher script not on yet
- Verify Offline QA Shift Report connections
- Ready:
Online infrastructure [Wayne]:
- New web server:
- Links on homepage checked
- Some issues with DB access and privilege from current node name (dean2)
- New jpgraph version installed in same location as the old one
- Disk exports are not in place, so content isn't being updated currently
- Will do another sync over the weekend or early next week, in addition to the final one when we swap servers
- Ganglia to be installed
- User accounts to be added
- Links on homepage checked
- New web server backup:
- Configuration kept the same as main new web server
- ...but content isn't sync-ed
- Password rotation completed
- mysql-devel installed on online linux nodes (RT 2068)
- Two online disk failures (on onl09 (non-critical) RT 2065 and on onlldap (critical) RT 2067)
- RAID still providing for the file systems in the interim
- ITD network backups failing (RT 2069), but resolution from ITD may have just come through
- Machine replacements:
- emcsc ready (needs either deployed by tomorrow or ITD block needs postponed)
- emcspin ready (Monday)
- starutilities awaiting C-AD software
FastOffline [Jerome]:
- Many files coming into the system, but all are tests, pulsers, etc., so not being processed
- Some missing detector setup, beam energy, etc. information: could mean misconfiguration, or potentially not properly propagated data
QA [Gene]:
- OnlinePlots switched back to newer EVP server
- Switching steps still undocumented
- Pre-combining of FastOffline files still needs tested and configured to be the default
Run 11 preparation meeting #6
1-189, EVO, at 19:00 (GMT), duration : 01:00
Minutes
Attendees: Kefeng, Wayne B., Leve H., Gene V.B., Dmitry A.
ShiftLog [Leve]:
- Manual changes have been reviewed (by Jerome) and committed
- Re-deployed
Databases [Dmitry]:
- cdev enabled (by C-AD), so parameter propagation has begun
- Some data is still empty (e.g. beam species) and may not be filled for several weeks (when beams actually start)
- Might be possible to enter some default values for now
- Numbers are meaningless, so not useful for FastOffline testing
- Some data is still empty (e.g. beam species) and may not be filled for several weeks (when beams actually start)
- Testing new hardware nodes now (2 are available)
- Configuration of new nodes in discussion
Online Infrastructure [Wayne]:
- STAR login environment needs to be able to handle SL5.5 (which was installed on some nodes)
- Will follow up with Jerome
- Online linux pool is losing roughly 1 node a day (rebooting)
- Hit a few with network scans, but response was OK, so no clues
- Spin request (for Pibero):
- User has access to the nodes and is waiting for directory structures and mount points
- Expect to complete within a few days
- User has access to the nodes and is waiting for directory structures and mount points
- Condor installation on for rterm-like access to online pool from gateway
- Installed on the gateway
- Components for the pool nodes is to be done
- Slowness on evp
- Only occurred yesterday (Dec. 16)
- Nothing obvious, but suspicious of AFS, and coincided with mock data transfer challenge from counting house to HPSS
- Paths forward discussed:
- Remove outside AFS dependence using an online repository
- Remove AFS dependence using local codes on evp
- Write a new tool better identify AFS issues (i.e. more proactive than 'fs checkservers')
- Problem doesn't seem to persist, not a priority for now (no action)
- Yury G. requested networking support for the east FPD/FMS rack
- Just extends the starp network geographically
- Needs to be a fiber connection for proper grounding
- Webserver replacement
- Two redhat 5 machines given the ITD thumbs up
- Shared filespaces in progress
- No services started yet
- Request for an account to test tomcat (Leve)
QA [Gene]:
- Demo of the Offline QA with reference histogram comparison
- Missing some features for flexibly defining reference histogram sets (will work on over the holidays)
- Minor suggestion made for improving the "waiting..." display
Run 11 preparation meeting #7
1-189, EVO, at 19:00 (GMT), duration : 01:00
Minutes
Attendees: Leve H., Dmitry A., Jerome L., Wayne B., Gene V.B.
Databases [Dmitry]:
- All DB collectors running now
- Only TPC voltages absent due to no data yet
- Slow Controls Archiver not yet running to avoid collecting value-less data
- With Run start imminent, this needs to get going
- Dmitry will follow up with Yury G.
- Potential to benefit from an unused EPICS feature to read multiple channels in one request (we have been reading one channel at a time across the board), which could cut down the overhead in obtaining data
- Unknown why this wasn't previously used, so testing with caution to learn (perhaps the multiple channel data comes in a burst which could fail for some reason)
- New shift sign-up release is imminent (possibly today)
- Only minor features getting final adjustments (features have been [node:19902 "previously presented"])
- RunLog working fine: new runs appearing, but all being marked bad
- Jerome notes that bad runs won't come through FastOffline, which has been turned on already
- New DB nodes
- Tested and ready for use
- New configuration (relevant for FastOffline use) not yet in place (Jerome and Dmitry will work this out)
- Online DB plots working and logs show usage
- Only unavailable quantities are those collected from the Slow Controls Archiver
ShiftLog [Leve]:
- New manual is printed out and in place at the counting house
- ShiftLeader desktop computer has all the necessary icons, correctly linked and tested, including making a ShiftLog entry
- New web server not critical (stable operation on old web server)
Online infrastructure [Wayne & Jerome]:
- OS upgrade
- One critical: a replacement node for usage in monitoring chilled water (old machine is Windows 2000) is awaiting ITD processing.
- Follow-up with ITD on Monday if nothing transpires by then
- Other machines in the queue ("bond", "beatrice", "l3disp")
- One critical: a replacement node for usage in monitoring chilled water (old machine is Windows 2000) is awaiting ITD processing.
- Online web server replacement
- Currently replicating old machine on new one: file copying and version checking
- ShiftLog request to continue with same Tomcat version (a known quantity) instead of upgrading in hopes of improved stability
- Spin resource request involves storage to be delivered on new server
- Currently replicating old machine on new one: file copying and version checking
- Uncertainty in status of UPS services for computers at the experiment
- Testing today before the Run starts
- Network request for east FMS/FPD racks completed
- Online linux pool has been stable for the past ~3 weeks (after a couple weeks of apparently random reboots)
FastOffline [Jerome]:
- Ready, running, and waiting for data
QA [Gene]:
- OnlinePlots still running on older evp machine (but stable)
- Will look into moving it back over to the new node
- Need Jeff L. to flip some switches (and document it)
- Offline QA
- Finished implementing tools to update just specific histograms in reference set
- QA Shift set of histograms will be flushed next week and started anew
- Histograms only go into the set if given a description, and only stay in the set if a reference is set
We are expecting some collider operations imminently, so data will likely flow through the entire system over the next week.
TPC field cage short dates
I present this in the context of trying to understand what may play a role in a timeline study of the h-/h+ issue, which is believed to have begun sometime between 2004 and 2008 (between Run 4 and
Software and Computing Phone Meeting
1-189, EVO, at 17:00 (GMT), duration : 01:00
Time | Talk | Presenter |
---|---|---|
12:00 | FMS simulation, open request and readiness ( 00:20 ) 2 files | Pibero Djawotho (TAMU) |
12:20 | Production status ( 00:15 ) 0 files | Lidia Didenko |
12:35 | AOB ( 00:20 ) 0 files | All (All) |
h+/h- in Run 10 and beyond
EVO, at 20:00 (GMT), duration : 01:00
Issue 2043
In discussion of RT ticket 2043 and applying a patch to the mapping at the MuDst level instead of re-producing the data...
Software and Computing phone meeting
1-189, EVO, at 17:00 (GMT), duration : 01:00
Time | Talk | Presenter |
---|---|---|
12:00 | Multi-site Data transfer project - status ( 00:20 ) 1 file | Michal Zerola (NPI / ASCR) |
12:20 | Year 2011 geometry status and production intents for Y11 ( 00:15 ) 0 files | Jason Webb (BNL) |
12:35 | Production status and stats ( 00:15 ) 0 files | Jerome Lauret (BNL) |
12:50 | AOB ( 00:10 ) 0 files | All (All) |
Global tracks with negative flags
Many global tracks (approximately half) are given a negative flag. As I had not really understood this before, I made some brief investigations...
Run 11 preparation meeting #5
1-189, at 19:00 (GMT), duration : 01:00
Attendees: Leve H., Dmitry A., Jérôme L., Wayne B., Gene V.B., Jeff L.
Minutes:
Databases [Dmitry]:
- Conditions collectors started (all except beam information related ones)
- Everything running appears to be OK and status can be checked from web
- Beam info needs cdev, provided by C-AD once their operations get going for the Run (also includes magnetic field setting; expect data around the time we turn on the STAR magnet)
- Absence of this data means RunLog does not finalize runs and information is not migrated to Offline, but this is expected to be OK for now for anyone doing online-only tests (requested to post a note to rts-hn about this)
- No migration issues expected, but TPC anode voltage recording practices need to be ironed out in the context of the Run 11 data (discuss with Maxim)
- Online DBs flushed
- As expected, a few failed links and missing web pages were noted (by users too) and fixed
- Once cdev is running, will be ready for turn on / testing of FastOffline
ShiftLog [Leve]:
- As expected, excluded from and not affected by DB flush
- Manual updated for instructions on changing run status (may receive further review & changes from Jerome)
Online nodes [Wayne]:
- AFS has been stable on the newer evp machine since last Friday
- Jeff's been using the machine (and AFS) without incident
- Will likely switch OnlinePlots back to this machine (from evp2, the older machine) next week
- Documentation requested for switching between these machines
- Replacement of dean at the stage of OS installation on new hardware
- Timeline for replacement should be well before Run starts, as disk space on the new machine has been requested for online use
- Trigger commissioning is a high priority part of the request for online node resources (Wayne will contact Pibero directly)
- Nothing to report on OS upgrades
Online QA - jevp demo [Jeff]:
- Code
- Subsystems have an xxxBuilder class (e.g. tpcBuilder, eemcBuilder, etc.) which inherits from a jevp base class for plots
- Plot objects can have multiple histograms added to them, as well as other graphical objects like lines and circles (as long as they are of some "component" type?)
- Configuration
- Editor allows hierarchical arrangement of plots into tabs and sub-tabs, under larger Sets (e.g. Shift, 2009, ESMD)
- Features drag & drop for re-arrangement
- Tabs can have collective properties for all plots underneath (such as setting all to have the same maximum)
- Plots can have individual properties (such as log y scale)
- Editor allows hierarchical arrangement of plots into tabs and sub-tabs, under larger Sets (e.g. Shift, 2009, ESMD)
- Running
- One server starts up several builders as separate processes
- Data is sent to the builders from a file or an event pool
- Presenter runs separately to display plots
- Similar basic look & feel to current OnlinePlots presenter
- Clicking on a plot brings up a reference side-by-side
- Drag & drop to add new reference, or select a different reference from panel of choices
- Suggested: tag references with run from which they're made, and timestamp (already includes a comment)
- Deleting references is simple, but not so simple as to be done accidentally
- Mild concerns to be wary of reference bloat and organization as things evolve at the experiment (including collision species/energies change)
- PDFs generated and uploaded to RunLog DB for each Set
- Include table of contents
- Concerns of space this will consume on the onldb server and redundancy of plots
- Run 10 PDFs used more than half the available disk space, and it is reasonable to expect a doubling (or more) of the required space for Run 11
- Wayne & Dmitry will assess the current storage on onldb [post-meeting statement that space may be fine; follow up next week]
- Requirements still need to be more accurately defined
- Will depend on what gets axed/kept from review by subsystems & trigger board
- Would be helped by real data, but that's a bit late
Next week: demo of Offline QA for AutoQA with reference histograms [Gene]
MySQL trigger for oversubscription protection
How-to enable total oversubscription check for Shift Signup (mysql trigger) :
delimiter |
CREATE TRIGGER stop_oversubscription_handler BEFORE INSERT ON Shifts
FOR EACH ROW BEGIN
Software and Computing phone meeting
1-189, EVO, at 17:00 (GMT), duration : 01:00
Time | Talk | Presenter |
---|---|---|
12:00 | Summary of the h+/h- meeting ( 00:20 ) 0 files | Grant Webb (UKY) |
12:20 | Production status ( 00:15 ) 0 files | Lidia Didenko (BNL) |
12:35 | AOB ( 00:20 ) 0 files | All (All) |
Review of Past Issues and Current Understanding
Reference
Talk time : 15:20, Duration : 00:20
h+/h- in Run 10 and beyond
EVO, at 20:00 (GMT), duration : 01:00
Time | Talk | Presenter |
---|---|---|
15:00 | Description of Problem ( 00:20 ) 0 files | |
15:20 | Review of Past Issues and Current Understanding ( 00:20 ) 1 file | |
15:40 | Plan of Action Development ( 00:20 ) 0 files | |
16:00 | Task Assignments ( 00:20 ) 0 files | |
16:20 | AOB ( 00:20 ) 0 files |
Run 11 preparation meeting #4
1-189, at 19:00 (GMT), duration : 01:00
Minutes:
Attendees:Dmitry A., Leve H., Matt A., Wayne B., Jeff L., Jérôme L., G. Van Buren
Databases [Dmitry]:
- Online backup now using the new NAS system
- Daily backup of all three ports with a retention time of 7 days
- Currently have 2+ TB of space, which is probably more than enough for even 14 days retention
- Potential problem with permissions due to NFS mount and different user IDs on different systems, but not a problem presently
- Email alerts of problems from NAS goes to Wayne & Dmitry
- Flush of online DBs not yet done
- Reasoning is that still in testing at STAR and this can add up to significant amount of data
- ...but we're not sure which test data people will want to keep associated to Run 11
- Decision made to go forward with the flush and not continue waiting for testing to get further along
- NB: ShiftLog (and some other) DBs and tables are skipped in the flush; ShiftLog is already recording for Run 11
- Shift Signup GUI has been re-written
- Demo shown
- Some details of new features still need implementation (working with Jérôme)
- All old features are in place; could replace old codes at any time (pending bug checks)
- Deployment schedule not fixed by any deadlines
- Isolated nodes for FastOffline in Run 11
- Not in place, but Dmitry will write up a config file for this
ShiftLog [Leve]:
- Nothing until new web server
- No progress on new webserver
Online nodes [Wayne & Matt]:
- OS upgrades:
- FTPC and PMDsc done/replaced
- To be done: Bond, EMCsc (coordinate with users), STARUtilities (coordinate with C-AD), EMC01, Beatrice, L3display (Jeff notes the need for QT4 on this node for display programs)
- FUSE now working on all linux pool nodes
- gcc standardized on all linux pool nodes
- Recent rise in instability of the linux pool nodes: several have halted and/or needed rebooted in the past two weeks
- Previous solution of disabling USB controller not helpful for this (that solution is still in effect)
- No obvious environmental changes, but seems likely given the pattern
- These nodes are ~5.5 years old (hard to believe they would show age problems within a couple weeks of each other)
- Similar nodes are in use for offline DBs and not showing problems (located in BCF)
- Newer EVP machine experiencing AFS issues
- Access given to John McCarthy to help diagnose
- OnlinePlots will be switched to use old EVP machine for the time being (Gene & Jeff will arrange)
Software and Computing phone meeting
1-189, EVO, at 17:00 (GMT), duration : 01:00
Time | Talk | Presenter |
---|---|---|
12:00 | Review of ticket 2036 (embedding timestamp) ( 00:20 ) 0 files | All (All) |
12:20 | Data production issues and projections ( 00:20 ) 0 files | Lidia Didenko (BNL) |
12:40 | AOB ( 00:20 ) 0 files | All (All) |
Run 11 preparation meeting #3
1-189, at 19:00 (GMT), duration : 01:00
Attendees: Wayne B., Leve H., Gene V.B., Matt A.
Leve:
- ShiftLog exercised (successfully tried loading large 4+MB images)
- Awaiting new online web server
- No current action items regarding changing shift status after lat week's meeting
Matt:
- Windows 2000 upgrades to XP; involves replacing some computers:
- ftpctemp, emcsc, pmdsc replacements exist, but need configured before switch
- Timescale: mid-December
- OS upgrades on emc01, beatrice, l3display
- l3display has GUI issues (independent of the old large display screen issue brought up a couple weeks ago by Wayne); hope that the OS update to SL53 will resolve the GUI problems
- Timescale: next week (make sure systems are running again before the long weekend next week, perhaps time beatrice update with any plans by EMC people to be away)
- It would be of interest to know whether the l3display can support the STAR environment (e.g. AGS, gcc, etc.)
Wayne:
- Online linux pool gcc issue: an artifact of a known problem not corrected in Matt's installation script.
- A few nodes already fixed, pne node known to need fixing, and a few more need checked (low priority)
- FUSE also stopped working on a few nodes during some of these re-installs (due to some package dependencies)
- Ganglia metric to list number of users logged in now turned on for STAR gateways; will also add metric to the online linux pool machines
Gene:
- OnlinePlots running stably on newer EVP server
- FMS group expressed interest in changing some OnlinePlots; they were directed to the codes
- Contacts for components of online QA [node:19794 "posted"]
AOB:
- Next meeting in two weeks (holiday next week)