onldb.starp.bnl.gov:3501 - contains 'RunLog', 'Shift Sign-up' and 'Online' databases
onldb.starp.bnl.gov:3502 - contains 'Conditions_<subsysname>' databases (online daemons)
onldb.starp.bnl.gov:3503 - contains 'RunLog_daq' database (RTS system)
db01.star.bnl.gov:3316/trigger is a special buffer db for FileCatalog migration database
onldb2.starp.bnl.gov:3501 (slave of onldb.starp.bnl.gov:3501)
onldb2.starp.bnl.gov:3502 (slave of onldb.starp.bnl.gov:3502)
onldb2.starp.bnl.gov:3503 (slave of onldb.starp.bnl.gov:3503)
This page contains basic steps only, please see subpages for details on Run preparations!
1. DB: make sure that databases at ports 3501, 3502 and 3503 are running happily. It is useful to check that onldb2.starp, onl10.starp and onl11.starp have replication on and running.
2. COLLECTORS AND RUNLOG DAEMON: onldb.starp contains "old" versions of metadata collectors and RunLogDB daemon. Collector daemons need to be recompiled and started before the "migration" step. One should verify with "caGet <subsystem>.list" command that all EPICS variables are being transmitted and received without problems. Make sure no channels produce "cannot be connected" or "timeout" or "cannot contact IOC" warnings. If they do, please contact Slow Controls expert *before* enabling such service. Also, please keep in mind that RunLogDB daemon will process runs only if all collectors are started and collect meaningful data.
3. FASTOFFLINE: To allow FastOffline processing, please enable cron record which runs migrateDaqFileTags.new.pl script. Inspect that script and make sure that $minRun variable is pointing to some recently taken run or this script will consume extra resource from online db.
4. MONITORING: As soon as collector daemons are started, database monitoring scripts should be enabled. Please see crontabs under 'stardb' and 'staronl' accounts for details. It is recommended to verify that nfs-exported directory on dean is write-accessible.
Typical crontab for 'stardb' account would be like:
*/3 * * * * /home/stardb/check_senders.sh > /dev/null
*/3 * * * * /home/stardb/check_cdev_beam.sh > /dev/null
*/5 * * * * /home/stardb/check_rich_scaler_log.sh > /dev/null
*/5 * * * * /home/stardb/check_daemon_logs.sh > /dev/null
*/15 * * * * /home/stardb/check_missing_sc_data.sh > /dev/null
*/2 * * * * /home/stardb/check_stale_caget.sh > /dev/null
(don't forget to set email address to your own!)
Typical crontab for 'staronl' account would look like:
*/10 * * * * /home/staronl/check_update_db.sh > /dev/null
*/10 * * * * /home/staronl/check_qa_migration.sh > /dev/null
1. MQ: make sure that qpid service is running. This service processes MQ requests for "new" collectors and various signals (like "physics on").
2. DB: make sure that mysql database server at port 3606 is running. This database stores data for mq-based collectors ("new").
3. SERVICE DAEMONS: make sure that mq2memcached (generic service), mq2memcached-rt (signals processing) and mq2db (storage) services are running.
4. COLLECTORS: grab configuration files from cvs, and start cdev2mq and ds2mq collectors. Same common sense rule applies: please check that CDEV and EPICS do serve data on those channels first. Also, collectors may be started at onl10.starp.bnl.gov if onl11.starp is busy with something (unexpected IO stress tests, user analysis jobs, L0 monitoring scripts, etc).
1. MIGRATION: check crontab for 'stardb' user. Mare sure that "old" and "new" collector daemons are really running, before moving further. Verify that migration macros experience no problems by trying some simple migration script. If it breaks saying that library is not found or something - find latest stable (old) version of STAR lib and set it to .cshrc config file. If tests succeed, enable cron jobs for all macros, and verify that logs contain meaningful output (no errors, warnings etc).
1. PLOTS: Check dbPlots configuration, re-create it as a copy with incremented Run number if neccesary. Subsystem experts tend to check those plots often, so it is better to have dbPlots and mq collectors up and running a little earlier than the rest of services.
2. MONITORING:
3. RUNLOG - now RunLog browser should display recent runs.
1. TRIGGER COUNTS check cront tab for root, it should have the following records:
40 5 * * * /root/online_db/cron/fillDaqFileTag.sh
0,10,15,20,25,30,35,40,45,50,55 * * * * /root/online_db/sum_insTrgCnt >> /root/online_db/trgCnt.log
First script copies daqFileTag table from online db to local 'trigger' database. Second script calculates trigger counts for FileCatalog (Lidia). Please make sure that both migration and trigger counting work before you enable it in the crontab. There is no monitoring to enable for this service.
1. ONLINE BACKUPS: make sure that mysql-zrm is taking backups from onl10.starp.bnl.gov for all three ports. It should take raw backups daily and weekly, and logical backups once per month or so. It is generally recommended to periodically store weekly / monthly backups to HPSS, for long-term archival using /star/data07/dbbackup directory as temporary buffer space.
1. DB MERGE: Online databases from onldb.starp (all three ports) and onl11.starp (port 3606) should be merged into one. Make sure you keep mysql privilege tables from onldb.starp:3501. Do not overwrite it with 3502 or 3503 data. Add privileges allowing read-only access to mq_collector_<bla> tables from onl11.starp:3606 db.
2. DB ARCHIVE PART ONE: copy merged database to dbbak.starp.bnl.gov, and start it with incremented port number. Compress it with mysqlpack, if needed. Don't forget to add 'read-only' option to mysql config. It is generally recommended to put an extra copy to NAS archive, for fast restore if primary drive crashes.
3. DB ARCHIVE PART TWO: archive merged database, and split resulting .tgz file into chunks of ~4-5 GB each. Ship those chunks to HPSS for long-term archival using /star/data07/dbbackup as temporary(!) buffer storage space.
4. STOP MIGRATION macros at onl13.starp.bnl.gov - there is no need to run that during summer shutdown period.
5. STOP trigger count calculations at db03.star.bnl.gov for the reason above.
As of Feb 18th 2011, previously existing content of this page is removed.
If you need to know how to access RHIC or STAR data available through CDEV interface, please read official CDEV documentation here : http://www.cadops.bnl.gov/Controls/doc/usingCdev/remoteAccessCdevData.html
Documentation for CDEV access codes used in Online Data Collector system will be available soon in appropriate section of STAR database documentation.
-D.A.
TBC
1. New RunLog browser:
2. Retire Previous RunLog browser:
3. Update /admin/navigator.php immediately after /RunLog/ rotation! New run range is required.
Migration macros reside on stardb@onllinux6.starp.bnl.gov .
$> cd dbcron/macros-new/
(you should see no StRoot/StDbLib here, please don't check it out from CVS either - we will use precompiled libs)
First, one should check that Load Balancer config env. variable is NOT set :
$> printenv|grep DB
DB_SERVER_LOCAL_CONFIG=
(if it says =/afs/... .xml, then it should be reset to "" in .cshrc and .login scripts)
Second, let's check that we use stable libraries (newest) :
$> printenv | grep STAR
...
STAR_LEVEL=SL08e
STAR_VERSION=SL08e
...
(SL08e is valid for 2009, NOTE: no DEV here, we don't want to be affected by changed or broken DEV libraries)
OK, initial settings look good, let's try to load Fill_Magnet.C macro (easiest to see if its working or not) :
$> root4star -b -q Fill_Magnet.C
You should see some harsh words from Load Balancer, that's exactly what we need - LB should be disabled for our macros to work. Also, there should not be any segmentation violations. Initial macro run will take some time to process all runs known to date (see RunLog browser for run numbers).
Let's check if we see the entries in database:
$> mysql -h robinson.star.bnl.gov -e "use RunLog_onl; select count(*) from starMagOnl where entryTime > '2009-01-01 00:00:00' " ;
(entryTime should be set to current date)
+----------+
| count(*) |
+----------+
| 1589 |
+----------+
Now, check the run numbers and magnet current with :
$> mysql -h robinson.star.bnl.gov -e "use RunLog_onl; select * from starMagOnl where entryTime > '2009-01-01 00:00:00' order by entryTime desc limit 5" ;
+--------+---------------------+--------+-----------+---------------------+--------+---------+----------+----------+-----------+------------+------------------+
| dataID | entryTime | nodeID | elementID | beginTime | flavor | numRows | schemaID | deactive | runNumber | time | current |
+--------+---------------------+--------+-----------+---------------------+--------+---------+----------+----------+-----------+------------+------------------+
| 66868 | 2009-02-16 10:08:13 | 10 | 0 | 2009-02-15 20:08:00 | ofl | 1 | 1 | 0 | 10046008 | 1234743486 | -4511.1000980000 |
| 66867 | 2009-02-16 10:08:13 | 10 | 0 | 2009-02-15 20:06:26 | ofl | 1 | 1 | 0 | 10046007 | 1234743486 | -4511.1000980000 |
| 66866 | 2009-02-16 10:08:12 | 10 | 0 | 2009-02-15 20:02:42 | ofl | 1 | 1 | 0 | 10046006 | 1234743486 | -4511.1000980000 |
| 66865 | 2009-02-16 10:08:12 | 10 | 0 | 2009-02-15 20:01:39 | ofl | 1 | 1 | 0 | 10046005 | 1234743486 | -4511.1000980000 |
| 66864 | 2009-02-16 10:08:12 | 10 | 0 | 2009-02-15 19:58:20 | ofl | 1 | 1 | 0 | 10046004 | 1234743486 | -4511.1000980000 |
+--------+---------------------+--------+-----------+---------------------+--------+---------+----------+----------+-----------+------------+------------------+
If you see that, you are OK to start cron jobs (see "crontab -l") !
Online db enclave includes :
primary databases: onldb.starp.bnl.gov, ports : 3501|3502|3503
repl.slaves/hot backup: onldb2.starp.bnl.gov, ports : 3501|3502|3503
read-only online slaves: mq01.starp.bnl.gov, mq02.starp.bnl.gov 3501|3502|3503
trigger database: db01.star.bnl.gov, port 3316, database: trigger
Monitoring:
http://online.star.bnl.gov/Mon/
(scroll down to see online databases. db01 is monitored, it is in offline slave group)
Tasks:
1. Slow Control data collector daemons
$> ssh stardb@onldb.starp.bnl.gov;
$> cd /online/production/database/Run_11/dbSenders;
./bin/ - contains scripts for start/stop daemons
./online/Conditions/ - contains source code for daemons (e.g. ./online/Conditions/run is RunLogDb)
See crontab for monitoring scripts (protected by lockfiles)
Monitoring page :
http://online.star.bnl.gov/admin/daemons/
2. Online to Online migration
3. RunLog fix script
$> ssh root@db01.star.bnl.gov;
$> cd online_db;
sum_insTrgCnt is the binary to perform various activities per recorded run, and it is run as cron script (see crontab -l).
4. Trigger data migration
$> ssh root@db01.star.bnl.gov
/root/online_db/cron/fillDaqFileTag.sh <- cron script to perform copy from online trigger database to db01
BACKUP FOR TRIGGER CODE:
1. alpha.star.bnl.gov:/root/backups/db01.star.bnl.gov/root
2. bogart.star.bnl.gov:/root/backups/db01.star.bnl.gov/root
5. Online to Offline migration
$> ssh stardb@onl13.starp.bnl.gov;
$> cd dbcron/macros-new; ls;
Fill*.C macros are the online->offline migration macros. There is no need in local/modified copy of the DB API, all macros use regular STAR libraries (see tcsh init scripts for details)
Macros are cron jobs. See cron for details (crontab -l). Macros are lockfile-protected to avoid overlap/pileup of cron jobs.
Monitoring :
http://online.star.bnl.gov/admin/status/
New Online API proposal: Message-Queue-based data exchange for STAR Online domain;
Primary idea is to replace current DB-centric STAR Online system with industrial-strength Message Queueing service. Online databases will, then, take a proper data storage role, leaving information exchange to MQ server. STAR, as an experiment in-progress, is still growing every year, so standard information exchange protocol is required for all involved parties to enable efficient cross-communications.
It is proposed to leave EPICS system as it is now for Slow Controls part of Online domain, and allow easy data export from EPICS to MQ via specialized epics2mq services. Further, data will be stored to MySQL (or some other storage engine) via mq2db service(s). Clients could retrieve archived detector conditions either via direct MySQL access as it is now, or through properly formatted request to db2mq service.
[introduction-talk] [implementation-talk]
To see what service is connected to our MQ server, one should use qpid-stat. Example:
$> qpid-stat -c -S cproc -I localhost:5672 Connections client-addr cproc cpid auth connected idle msgIn msgOut ======================================================================================================== 127.0.0.1:54484 db2mq-service 9729 anonymous 2d 1h 44m 1s 2d 1h 39m 52s 29 0 127.0.0.1:56594 epics2mq-servic 31245 anonymous 5d 22h 39m 51s 4m 30s 5.15k 0 127.0.0.1:58283 epics2mq-servic 30965 anonymous 5d 22h 45m 50s 30s 5.16k 0 127.0.0.1:58281 epics2mq-servic 30813 anonymous 5d 22h 49m 18s 4m 0s 5.16k 0 127.0.0.1:55579 epics2mq-servic 28919 anonymous 5d 23h 56m 25s 1m 10s 5.20k 0 130.199.60.101:34822 epics2mq-servic 19668 anonymous 2d 1h 34m 36s 10s 17.9k 0 127.0.0.1:43400 mq2db-service 28586 anonymous 6d 0h 2m 38s 10s 25.7k 0 127.0.0.1:38496 qpid-stat 28995 guest@QPID 0s 0s 108 0
MQ routing allows to forward selected messages to remote MQ servers.
$> qpid-route -v route add onl10.starp.bnl.gov:5672 onl11.starp.bnl.gov:5672 amq.topic gov.bnl.star.#
$> qpid-route -v route add onl10.starp.bnl.gov:5672 onl11.starp.bnl.gov:5672 amq.direct gov.bnl.star.#
$> qpid-route -v route add onl10.starp.bnl.gov:5672 onl11.starp.bnl.gov:5672 qpid.management console.event.#
/etc/init.d/orbited needs to be corrected, because --daemon option does not work for RHEL5 (orbited does not listen to desired port). Here what is needed:
Edit /etc/init.d/orbited and :
1. add
ORBITED="nohup /usr/bin/orbited > /dev/null 2>&1 &"
to the very beginning of the script, just below "lockfile=<bla>" line
2. modify "start" subroutine to use $ORBITED variable instead of --daemon switch. It should look like this :
daemon --check $prog $ORBITED
Enjoy your *working* "/sbin/service/orbited start" command ! Functionality could be verified by trying lsof -i :[your desired port], (e.g. ":9000") - it should display "orbited"
How-to enable total oversubscription check for Shift Signup (mysql trigger) :
delimiter |
CREATE TRIGGER stop_oversubscription_handler BEFORE INSERT ON Shifts
FOR EACH ROW BEGIN
SET @insert_failed := "";
SET @shifts_required := (SELECT shifts_required FROM ShiftAdmin WHERE institution_id = NEW.institution_id);
SET @shifts_exist := (SELECT COUNT(*) FROM Shifts WHERE institution_id = NEW.institution_id);
IF ( (@shifts_exist+1) >= (@shifts_required * 1.15)) THEN
SET @insert_failed := "oversubscription protection error";
SET NEW.beginTime := null;
SET NEW.endTime := null;
SET NEW.week := null;
SET NEW.shiftNumber := null;
SET NEW.shiftTypeID := null;
SET NEW.duplicate := null;
END IF;
END;
|
delimiter ;
Run/Year | NODE | Port |
Run 1 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3400 |
Run 2 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3401 |
Run 3 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3402 |
Run 4 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3403 |
Run 5 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3404 |
Run 6 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3405 |
Run 7 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3406 |
Run 8 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3407 |
Run 9 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3408 |
Run 10 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3409 |
Run 11 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3410 |
Run 12 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3411 |
Run 13 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3412 |
Run 14 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3413 |
Run 15 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3414 |
Run 16 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3415 |
Run 17 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3416 |
Run 18 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3417 |
Run 19 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3418 |
Run 20 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3419 |
Run 21 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3420 |
Run 22 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3421 |
Run 23 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3422 |
Run 24 | dbbak.starp.bnl.gov / db04.star.bnl.gov | 3423 |
DATABASE | NODE | Port |
[MASTER] Run Log, Conditions_rts, Shift Signup, Shift Log | onldb.starp.bnl.gov | 3501 |
[MASTER] Conditions | onldb.starp.bnl.gov | 3502 |
[MASTER] Daq Tag Tables | onldb.starp.bnl.gov | 3503 |
[SLAVE] Run Log, Conditions_rts, Shift Signup, Shift Log | onldb2.starp.bnl.gov onldb3.starp.bnl.gov onldb4.starp.bnl.gov mq01.starp.bnl.gov mq02.starp.bnl.gov |
3501 |
[SLAVE] Conditions | onldb2.starp.bnl.gov onldb3.starp.bnl.gov onldb4.starp.bnl.gov mq01.starp.bnl.gov mq02.starp.bnl.gov |
3502 |
[SLAVE] Daq Tag Tables | onldb2.starp.bnl.gov onldb3.starp.bnl.gov onldb4.starp.bnl.gov mq01.starp.bnl.gov mq02.starp.bnl.gov |
3503 |
[MASTER] MQ Conditions DB | mq01.starp.bnl.gov | 3606 |
[SLAVE] MQ Conditions DB | mq02.starp.bnl.gov onldb2.starp.bnl.gov onldb3.starp.bnl.gov onldb4.starp.bnl.gov |
3606 |
RTS Database (MongoDB cluster) | mongodev01.starp.bnl.gov mongodev02.starp.bnl.gov mongodev03.starp.bnl.gov |
27017
|
1. Freeze databases (especially ShiftSignup on 3501) by creating a new db instance on the next sequential port from the archival series and copy all dbs from all three ports to this port.
2. Move the previous run to dbbak.starp
3. Send email announcing the creation of this port, so web pages can be changed.
4. Tar/zip the directories and ship them to HPSS.
1. Clear out the dbs on the "current" ports. NOTE: do not clear out overhead tables (e.g., Nodes, NodeRelations, blahIDs etc.). There is a script on onldb /online/production/databases/createTables which does this. Make sure you read the README.
UPDATE: more advanced script to flush db ports is attached to this page (chdb.sh). Existing " createPort350X.sql " files do not care about overhead tables!
UPDATE.v2: RunLog.runTypes & RunLog.detectorTypes should not be cleared too // D.A.
UPDATE.V3: RunLog.destinationTypes should not be cleared // D.A.
Confirm all firewall issues are resolved from both the IP TABLES on the local host and from an institutional networking perspective. This should only really need to be addressed with a new node, but it is good to confirm prior to advertising availability.
Update the above tables.
2. Verify that Online/Offline Detector ID list matches to RunLog list (table in RunLog db) :
http://www.star.bnl.gov/cgi-bin/protected/cvsweb.cgi/StRoot/RTS/include/rtsSystems.h
ONLINE TO OFFLINE MIGRATION
online.star.bnl.gov/admin/status.php
If all entries are red - chances are we are not running - or there is a
gap between runs e.g., a beam dump.
In this case please check the last run and or/time with the RunLog or
Shift Log to confirm that the last run was migrated (This was most
likely the case last night).
If one entry is red - please be sure that the latest value is _recent_
as some dbs are filled by hand once a year.
So we have a problem if ....
If an entry is red, other values are green, the of the red last value
was recent.
All values are red and you know we have been taking data for more than
.75 hours, This a rough estimate of time, but keep in mind migration of
each db happens at different time intervals so entries won't turn red
all at once nor will they turn green all at once. In fact RICH scalars
only get moved once an hour so it will not be uncommon to see this red
for a while after we just start taking data.
code is on stardb@onl13.starp.bnl.gov (formerly was on onllinux6.starp):
~/stardb/dbcron/macrcos
Below is an output of cron tab to start the processes uncomment the crons in stardbs cron tab
Times of transfers are as follows:
1,41 * * * * TpcGas
0,15,30,45 * * * * Clock
5,20,35,50 * * * * RDO
10,40 * * * * FTPCGAS
15 * * * * FTPCGASOUT
0,30 * * * * Trigger
25 * * * * TriggerPS
10,40 * * * * BeamInfo
15 * * * * Magnet
3,23,43 * * * * MagFactor
45 * * * * RichScalers
8,24,44 * * * * L0Trigger
6,18,32,48 * * * * FTPCVOLTAGE
10,35,50 * * * * FTPCTemps
check the log file in ../log to make sure the crons are moving data.
Backups of migration scripts and crontab are located here :
1. alpha.star.bnl.gov:/root/backups/onl13.starp.bnl.gov/dbuser
2. bogart.star.bnl.gov:/root/backups/onl13.starp.bnl.gov/dbuser
here is my summary of caget's performance studies done yesterday+today : 1. Right now, "normal" (sequential mode) caget from CaTools package takes 0.25 sec to fetch 400 channels, and, according to callgrind, it could be made even faster if I optimize various printf calls (40% speedup possible, see callgrind tree dump) : http://www.star.bnl.gov/~dmitry/tmp/caget_sequential.png [Valgrind memcheck reports 910kb RAM used, no memory leaks] 2. At the same time, "bulk" (parallel mode) caget from EzcaScan package takes 13 seconds to fetch same 400 channels. Here is a callgrind tree again: http://www.star.bnl.gov/~dmitry/tmp/caget_parallel.png [Valgrind memcheck reports 970kb RAM used, no memory leaks] For "parallel" caget, most of the time is spent on Ezca_getTypeCount, and Ezca_pvlist_search. I tried all possible command-line options available for this caget, with same result. This makes me believe that caget from EzcaScan package is even less optimized in terms of performance. It could be better optimized in terms of network usage, though (otherwise those guys won't even mention "improvement over regular caget" in their docs). Another thing is that current sequential caget is *possibly* using same "bulk" mode internally (that "ca_array_get" function is seen for both cagets).. Oh, if this matters, for this test I used EPICS base 3.14.8 + latest version of EzcaScan package recompiled with no/max optimizations in gcc.