USP
Updated on Fri, 2005-11-25 22:16. Originally created by suaide on 2005-10-10 17:22.
This is a copy of the web page that contains a log of the Sao Paulo grid activities. For the full documentaion, please go to
http://stars.if.usp.br:8080/~suaide/grid/
Instructions to install SGE
You can download the ganglia packages from their web site. You need to install the following packages:
To install Monalisa in your system you need to download the files from their web site. After you gunzip and untar the file you need to perform the following steps:
To start the MonaLisa service just type /etc/init.d/MLD start
A grid installation will require a "host" certificate. Jerome told me he never asked for one really ...
The certificate arrived three days after I requested it (with some help from Jerome). I them followed
the instructions that came with the email to validade and export the certificate.
http://osg.ivdgl.org/twiki/bin/view/Documentation/OsgCEInstallGuide
The basic steps were
After this installation was done I typed source setup.sh to complete the installation. No messages in the screen...
Because our batch system is SGE, we need to install extra packages, as stated in the OSG documentation page. I typed:
I just followed the instructions in the OSG installation guide and everything went fine. One important thing is related to firewall setup. If you have a firewall running with MASQUERADE, in which your private network is not accessible from the outside world, and your gatekeeper is not the firewall machine, remember to open the necessary ports (above 1024) and redirect the ports number 2119, 2811 and 2812 to your gatekeeper machine. The command depends on your firewall program. If using iptables, just add the following rule to your filter tables:
I also had to modify the files /home/grid/setup.csh and setup.sh to fix the HOSTNAME and port range. I added, in each file:
setup.csh
To add your gatekeeper to GridCat, go to http://osg.ivdgl.org/twiki/bin/view/Integration/GridCat
You will have to fill a form, following the instructions in the following link:
http://osg.ivdgl.org/twiki/bin/view/Documentation/OsgCEInstallGuide#OSG_Registration
If everything goes right, when your application is aproved you will show up in the GridGat Map, located http://osg-cat.grid.iu.edu:8080
well, this is were debuggins starts. Every 2-3 hours the GridCat tests their gatekeepers and assign a status light for each one, based on tests results. The tests are basically:
Before, please check if the services are listed in your /etc/services file
I started testing file transfer between gatekeepers by logging into another gatekeeper, getting my proxy (grid-proxy-init) and do a file transfer with the command:
http://bugzilla.globus.org/globus/show_bug.cgi?id=1127
And a quote in the bottom of the page:
" ... The wuftp based gridftp server is not supported behind a firewall. The problem is in reporting the external IP address in the PASV response. You can see this by using the -dbg flag to globus-url-copy. You will see the the PASV response specifies your internal IP address.
The server should, however, work for clients using PORT. ..."
which means I am doommed... Researching more the web I found some solutions and what I did was:
Now all tests are geen and I am happy and tired!!! There are still a few issues left, basically in the cluster information query (number of CPU's, batch queues, etc) that are related to mis-ci-functions (I think) and I will have a look latter.
Another important thing, if you plan to have a cluster running jobs from outside and making file transfers with gsiftp it is necessary that the directory /etc/grid-security is available in all machines in the cluster, even if they are not gatekeepers. Also, the grid setup should be executed in all the nodes (/home/grid/setup.csh). If not, when a job start running in one of the nodes and it attempts to transfer the file with globus-url-copy it will fail. The solution I used was to have the directory grid-security in the /home/grid and make symbolic links in all the nodes.
Installation
In order to be fully integrated to the STAR GRID you need to have the following items installed and running (the order I present the items are the same order I installed them in the cluster). There are other software to install before full integration but this is the actual status of the integration.Installing the batch system (SGE)
We decided to install the SGE because it is the same system used in PDSF (so it is scheduler compatible) and it is free. The SGE web site is here. You can donwload the latest version from their website.Instructions to install SGE
- Download from the SGE web site
- gunzip and untar the file
- cd to the directory
-
In the batch system server (in our case, STAR1)
- Create the SGE_ROOT directory. In our case, mkdir /home/sge-root. This
directory HAS to be available in all the exec nodes
- copy the entire content of the installation directory to the SGE_ROOT directory
- add the lines bellow to your /etc/services file
sge_execd 19001/udp
sge_qmaster 19000/tcp
sge_qmaster 19000/udp
sge_execd 19001/tcp - cd to the SGE_ROOT directory
- Type ./install_qmaster
- follow the instructions in the screen. In our case, the answers to the questions were:
- Do you want to install Grid Engine under an user id other than >root< (y/n) >> n
- $SGE_ROOT = /home/sge-root
- Enter cell name >> star
- Do you want to select another qmaster spool directory (y/n) [n] >> n
- verify and set the file permissions of your distribution (y/n) [y] >> y
- Are all hosts of your cluster in a single DNS domain (y/n) [y] >> y
- Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> classic
- You can change at any time the group id range in your cluster configuration. Please enter a range >> 20000-21000
- The pathname of the spool directory of the execution hosts. Default: [/home/sge-root/star/spool] >> [ENTER]
- Please enter an email address in the form >user@foo.com<. Default: [none] >> [PUT YOUR EMAIL]
- Do you want to change the configuration parameters (y/n) [n] >> n
- We can install the startup script that will start
qmaster/scheduler at machine boot (y/n) [y] >> y
- Adding Grid Engine hosts. Do you want to use a file which contains the list of hosts (y/n) [n] >> n
- Host(s): star1 star2 star3 star4 ...... (ADD ALL HOSTS THAT WILL BE CONTROLED BY THE BATCH SYSTEM)
- Do you want to add your shadow host(s) now? (y/n) [y] >> n
- Scheduler Tuning. Default configuration is [1] >> 1
- Proceed with the default answers until the end of the script
- You have installed the master system. To make sure the system
will start at boot time type
ln -s /etc/init.d/sgemaster /etc/rc3.d/S95sgemaster
ln -s /etc/init.d/sgemaster /etc/rc5.d/S95sgemaster -
Install the execution nodes (including the server, if it will be a exec node). This needs to be done in ALL exec nodes
- add the lines bellow to your /etc/services file
sge_execd 19001/udp
sge_qmaster 19000/tcp
sge_qmaster 19000/udp
sge_execd 19001/tcp - cd to your SGE_ROOT directory
- type ./install_execd
- Answer the question about the SGE_ROOT directory location
- Please enter cell name which you used for the qmaster. >> star
- Do you want to configure a local spool directory for this host (y/n) [n] >> n
- We can install the startup script that will start execd at machine boot (y/n) [y] >> y
- Do you want to add a default queue instance for this host (y/n) [y] >> n (WE WILL CREATE A QUEUE LATER)
- follow the default instructions until the end
- You have now installed the master system. To start the system
at boot time. type
ln -s /etc/init.d/sgeexecd /etc/rc3.d/S96sgeexecd
ln -s /etc/init.d/sgeexecd /etc/rc5.d/S96sgeexecd -
Install a default queue to your batch system
- type qmon
It opens a GUI window where you can configure all the batch system. - Click in the buttom QUEUE CONTROL
- It opens another screen with the queues you have in your system
- Click on ADD
- Fill the instructions. See the file sge-admin.pdf for
instructions. It is very simple.
Installing GANGLIA
Aditional information from STAR web siteYou can download the ganglia packages from their web site. You need to install the following packages:
- gmond - the monitoring system. Should be installed in ALL machines in the cluster
- gmetad - the gathering information system. Should be installed in the machine that will collect the data (in our case, STAR1)
- the web front end. This is nice to have but not essential. It creates a web page, like this one, with all the information in your cluster. You should have a web server running in the collector machine (STAR1) for this to work
- rrdtool - this is a package that creates the plots in the web page. Necessary only if you have the web frontend.
-
In each machine in the cluster
- Install the gmond package (change the name to match the version
you are installing)
rpm -ivh ganglia-gmond-3.0.1-1.i386.rpm - edit the /etc/gmond.conf
file. The only change I made in this file was
cluster {
name = "STAR"
} - Type
ln -s /etc/init.d/gmond /etc/rc5.d/S97gmond
ln -s /etc/init.d/gmond /etc/rc3.d/S97gmond
/etc/init.d/gmond stop
/etc/init.d/gmond start -
In the collector machine (STAR1)
- Install the gmetad, web and rrdtool packages (change the name
to match the version you are installing)
rpm -ivh ganglia-gmetad-3.0.1-1.i386.rpm
rpm -ivh ganglia-web-3.0.1-1.noarch.rpm
rpm -ivh rrdtool-1.0.28-1.i386.rpm - edit the /etc/gmetad.conf
file. The only change I made in this file was
data_source "STAR" 10 star1:8649 star2:8649 star3:8649 star4:8649 star5:8649 - Type
ln -s /etc/init.d/gmetad /etc/rc5.d/S98gmetad
ln -s /etc/init.d/gmetad /etc/rc3.d/S98gmetad
/etc/init.d/gmetad stop
/etc/init.d/gmetad start
MonaLISA
Aditional information from STAR web siteTo install Monalisa in your system you need to download the files from their web site. After you gunzip and untar the file you need to perform the following steps:
- Create a monalisa user in your master computer and its home directory
- cd to the monalisa installation dir
- type ./install.sh
- Answer the following questions:
- Please specify an account for the MonALISA service [monalisa]: [ENTER]
- Where do you want MonaLisa installed ? [/home/monalisa/MonaLisa] : [ENTER]
- Path to the java home []: [enter the path name for your java distribution]
- Please specify the farm name [star1]: [star]
- Answer the next questions as you wish
- Make sure that Monalisa will run after reboot by typing:
ln -s /etc/init.d/MLD /etc/rc5.d/S80MLD
ln -s /etc/init.d/MLD /etc/rc3.d/S80MLD - You need to edit the following files in the directory /home/monalisa/MonaLisa/Services
- ml.properties
MonaLisa.ContactName=your name
MonaLisa.ContactEmail=xxx@yyyy.yyy
MonaLisa.LAT=-23.25
MonaLisa.LONG=-47.19
lia.Monitor.group=OSG, star (Note that we are being part of both OSG and STAR groups)
lia.Monitor.useIPaddress=xxx.xxx.xxx.xxx (your IP)
lia.Monitor.MIN_BIND_PORT=9000
lia.Monitor.MAX_BIND_PORT=9010 - Need to tell MonaLisa that I am using SGE
as a batch system. For this, edit the Service/CMD/site_env file and add
SGE_LOCATION=/home/sge-root
export SGE_LOCATION
SGE_ROOT=/home/sge-root
export SGE_ROOT
To start the MonaLisa service just type /etc/init.d/MLD start
Requesting a GRID certificate
By the way, you will have to request (for Grid usage) a user certificate. For instructions, click on the link http://www.star.bnl.gov/STAR/comp/Grid/Infrastructure/#CERTA grid installation will require a "host" certificate. Jerome told me he never asked for one really ...
The certificate arrived three days after I requested it (with some help from Jerome). I them followed
the instructions that came with the email to validade and export the certificate.
Installing OSG
I think this is the last step to be fully GRID integrated. I have not used the certificate I got up to now. Lets see. To install the OSG package I followed the instructions in the following web pagehttp://osg.ivdgl.org/twiki/bin/view/Documentation/OsgCEInstallGuide
The basic steps were
- Make sure pacman is installed. For this I had to update python to a version above 2.3. Pacman is a package management system. It can be downloaded from here
- create a directory at /home/grid. This is where I installed the grid stuff. Thid directory needs to be visible in all the cluster machines
- I typed
export VDT_LOCATION=/home/grid
I just followed the log and answered the questions.
cd $VDT_LOCATION
pacman -get OSG:ce
After this installation was done I typed source setup.sh to complete the installation. No messages in the screen...
Because our batch system is SGE, we need to install extra packages, as stated in the OSG documentation page. I typed:
pacman -get http://www.cs.wisc.edu/vdt/vdt_136_cache:Globus-SGE-Setupand these extra packages were installed in a few seconds.
I just followed the instructions in the OSG installation guide and everything went fine. One important thing is related to firewall setup. If you have a firewall running with MASQUERADE, in which your private network is not accessible from the outside world, and your gatekeeper is not the firewall machine, remember to open the necessary ports (above 1024) and redirect the ports number 2119, 2811 and 2812 to your gatekeeper machine. The command depends on your firewall program. If using iptables, just add the following rule to your filter tables:
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2119 -j DNAT --to $STAR1where $GLOBALIP is the external IP of your firewall and $STAR1 is the IP of the machine running the GRID stuff.
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2119 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2135 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2135 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2136 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2136 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2811 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2811 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2812 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2812 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 2912 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 2912 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 7512 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 7512 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 8443 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 8443 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 19000 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 19000 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 19001 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 19001 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p udp -d $GLOBALIP --dport 20000:65000 -j DNAT --to $STAR1
$filter -t nat -A PREROUTING -p tcp -d $GLOBALIP --dport 20000:65000 -j DNAT --to $STAR1
I also had to modify the files /home/grid/setup.csh and setup.sh to fix the HOSTNAME and port range. I added, in each file:
setup.csh
setenv GLOBUS_TCP_PORT_RANGE "60000 65000"setup.sh
setenv GLOBUS_HOSTNAME="stars.if.usp.br"
export GLOBUS_TCP_PORT_RANGE="60000 65000"This assures that the port range opened in the firewall will correspond to those used in the GRID environment. Also, because I run the firewall in masquerade mode, I had to set the proper hostname, otherwise it will pick the machine name, and I do not want that to happen.
export GLOBUS_HOSTNAME="stars.if.usp.br"
GridCat and making things to work...
It is very interesting to add your grid node to GridCat. It is a map, just like MonaLisa but it performs periodical tests to your gatekeeper, making it easier to find out problems (and, if you got to this point, there should be a few of them)To add your gatekeeper to GridCat, go to http://osg.ivdgl.org/twiki/bin/view/Integration/GridCat
You will have to fill a form, following the instructions in the following link:
http://osg.ivdgl.org/twiki/bin/view/Documentation/OsgCEInstallGuide#OSG_Registration
If everything goes right, when your application is aproved you will show up in the GridGat Map, located http://osg-cat.grid.iu.edu:8080
well, this is were debuggins starts. Every 2-3 hours the GridCat tests their gatekeepers and assign a status light for each one, based on tests results. The tests are basically:
- Authentication test
- Hello world test
- Batch submition (depends on your batch system)
- submit a job
- query the status of the job
- canceling the job
- file transfer (gridFtp)
How to turn authentication and hello world to green?
This is the easiest... Need to map the following certificates to your grid map (/etc/grid-security/grid-mapfile)"/DC=org/DC=doegrids/OU=People/CN=Leigh Grundhoefer (GridCat) 693100" XXXXThe username 'XXXX' is the local username in your cluster... After this certificates were added to my mapfile the first two tests turned green
"/DC=org/DC=doegrids/OU=People/CN=Bockjoo Kim 740786" XXXX
How to turn the batch system test green
It seems that SGE is not the preferable batch system in the GRID... Too bad because it is really nice and SIMPLE. Because of this the OSG interface to OSG does not work right.... I hope the bugs are fixed in the next release but just to keep log of what I did (with a lot of hel) in case they forget to fix it :)- mis-ci-functions
- This file, located at $VDT_LOCATION/MIS-CI/etc/misci/ is responsible for checking your system basically every 10 minutes and extract information about your cluster. It uses the batch system to grab information. Of course, it does not work with SGE. Replace the file with the version 0.2.7, located here. Please check if your version is newer than this one before replacing...
- sge.pm
- This file is located at $VDT_LOCATION/globus/lib/perl/Globus/GRAM/JobManager/
- Please check the following
- In the BEGIN section
- if $SGE_ROOT, $SGE_CELL and the commands ($qsub, $qstat, etc) are defined properly
- In the submit section
- Locate the line
- $ENV{"SGE_ROOT"} = $SGE_ROOT;
- add the line
- $ENV{"SGE_CELL"} = $SGE_CELL;
- The same in the pool section
- In the clear section
- locate the line system("$qdel $job_id >/dev/null 2>/dev/null");
- replace for the following
- $ENV{"SGE_ROOT"}
= $SGE_ROOT;
$ENV{"SGE_CELL"} = $SGE_CELL;
$job_id =~ /(.*)\|(.*)\|(.*)/;
$job_id = $1;
system("$qdel $job_id");
Making the gridFTP to work
This was the most difficult part because of my firewall configuration and thanks google for making reseach in the web easier...Before, please check if the services are listed in your /etc/services file
globus-gatekeeper 2119/tcp # Added by the VDTIf not, add them...
gsiftp 2811/tcp # Added by the VDT
gsiftp2 2812/tcp # Added by the VDT
gsiftp 2811/udp # Added by the VDT
gsiftp2 2812/udp # Added by the VDT
I started testing file transfer between gatekeepers by logging into another gatekeeper, getting my proxy (grid-proxy-init) and do a file transfer with the command:
globus-url-copy -dbg file:///star/u/suaide/gram_job_mgr_13594.log gsiftp://stars.if.usp.br/home/star/c
The -dbg mean debug is turned on... Everything goes fine until it
starts transfering the data (STOR
/home/star/c). It hangs and times out. Researching on the web, I
found a bug report athttp://bugzilla.globus.org/globus/show_bug.cgi?id=1127
And a quote in the bottom of the page:
" ... The wuftp based gridftp server is not supported behind a firewall. The problem is in reporting the external IP address in the PASV response. You can see this by using the -dbg flag to globus-url-copy. You will see the the PASV response specifies your internal IP address.
The server should, however, work for clients using PORT. ..."
which means I am doommed... Researching more the web I found some solutions and what I did was:
- replace file /etc/xinetd.d/gsiftp
for this one
service gsiftp
{
socket_type = stream
protocol = tcp
wait = no
user = root
instances = UNLIMITED
cps = 400 10
server = /auto/home/grid/vdt/sbin/vdt-run-gsiftp2.sh
disable = no
} - restarted xinetd
- modified the file /hom/grid/globus/etc/gridftp.conf
to
# Configuration for file the new (3.9.5) GridFTP Server
inetd 1
log_level ERROR,WARN,INFO,ALL
log_single /auto/home/grid/globus/var/log/gridftp.log
hostname "XXX.XXX.XXX.XXX" - XXX.XXX.XXX.XXX is the IP of the gateway for the outside world
Now all tests are geen and I am happy and tired!!! There are still a few issues left, basically in the cluster information query (number of CPU's, batch queues, etc) that are related to mis-ci-functions (I think) and I will have a look latter.
Another important thing, if you plan to have a cluster running jobs from outside and making file transfers with gsiftp it is necessary that the directory /etc/grid-security is available in all machines in the cluster, even if they are not gatekeepers. Also, the grid setup should be executed in all the nodes (/home/grid/setup.csh). If not, when a job start running in one of the nodes and it attempts to transfer the file with globus-url-copy it will fail. The solution I used was to have the directory grid-security in the /home/grid and make symbolic links in all the nodes.
»
- Printer-friendly version
- Login or register to post comments