Tools

A few tools and middleware and deployed on our nodes for monitoring, sanity or integraty checking purposes. We will describe here the basic setup of some of those with hope it will ease installation on your end.

Ganglia monitoring system

Introduction

Ganglia is a resource information gathering tools and a monitoring system in general. Ganglia was designed to be scalable but there are a few tips and tricks you may want to know and/or follow for a smooth deployment on your cluster. This page is meant to provide such help. Ganglia packages are available from http://ganglia.sourceforge.net/.

Note first that the BNL install of this tool is based on version 3.5.0 and so are the instructions below. It is likely you will find a copy of the package in the form of a Gzip TAR archive in the /afs/rhic.bnl.gov/star/common/ tree (search for -name 'ganglia*')where miscellaneous packages are placed for setting up our environment.

Components - basics

First of all, grab the package from source force. The packages comes in three parts
- A monitor core component which include client only (gmond) and collector(gmetad).
- rrdtools - this needs to be installed ONLY on the node on which you will deploy the Ganglia collector gmetad.
  Note that gmetad will be build only if a version of the rrdtools is installed on the system. There is NO need to start multiple collectors. The decision of if a gmetad has to be deployed or not is only a logical structure decision. For example, grouping all of your compute nodes together (one gmetad running on a chosen node) could be one choice, all of your generic servers together (db and other service/servers including your Web-Server) could be another choice. At the end, if this information has to be useful, you have to create that logical separation right away. Especially, the Compute nodes should be together and this group should NOT include processors and/or nodes where no user jobs will run. If your farm is logically separated in sub-cluster or resources allocated to different groups, I would suggest to create as many categories as necessary to re-create this logical sub-division of CPU resources.
- Ganglia Web front-end - this is ONLY needed on the web front-end. Usually, there would be one such deployment per cluster (but you may also deploy on many nodes to create redundancies).
Each package is very straight forward and unless otherwise specified, use the usual software build/install procedure
```
% ./configure
% make
% make install
```
I won't say anything here ... You chose to install the packages wherever you see fit for your cluster providing you take notes of a few aspects of changing the default location for installs.

Some installation details

Collector / gmetad & rrdtools

I would first install the Ganglia collector(s). It was noted already how to decide where a collector needs to be deployed. At BNL, we have a few; one of them is running on the web server where the web front-end is deployed, creating a "server" category (see below) and one running on a dedicated node collecting information about the Analysis and Reconstruction farm. To install gmetad, do this
- Install first rrdtools on all nodes which will do the data collection.
  This should not be more that our Web server in the current offline STAR deployment.
  Note from the 3.5.0: unlike the previous version of ganglia, version 3.x+ adds a dependency in libart.
  The installation could be confusing as default locations are assumed and the recommendation is to do the following:
```
% ./configure --prefix=/usr --disable-tcl
              ^_ because ganglia will be a pain otherwise
                            ^_ because we don't care of tcl
                                           ^_ pain <=> because libart installed in /usr/local by default but ganglia 
                                                       will look into /usr (bummer!)
```
- Install the core monitor package. Note that gmetad program is installed during the 'make install' process.
- Create a round-robin database (rrd) file or rrds file.
  I used a pseudo-device for the RRD database. This greatly improve Ganglia's efficiency as the IO is made at OS level to a single file which in terns, do not make the machine to slow down under intensive IO (it may).
  - The file was created using
```
% dd if/dev/zero of=/home/var/ganglia/rrds.img bs=1024 count=262144
```
    (count is proportional to the size you need ; this is for a 256 MB file which is plenty)
```
% mke2fs -F /home/var/ganglia/rrds.img
% mkdir /var/lib/ganglia/rrds
% chown -R nobody /var/lib/ganglia/rrds
```
  - The mount in fstab looks like
    
    /home/var/ganglia/rrds.img /var/lib/ganglia/rrds auto looop 0 0
- Install the start up files. On Linux, it goes something like this
```
% cp ./gmetad/gmetad /etc/rc.d/init.d/gmetad
% chkconfig --add  gmetad
% chkconfig --list gmetad
gmetad   0:off   1:off  2:on  3:on  4:on  5:on  6:off
```
- Copy the default configuration file
```
% cp gmetad.conf /etc/
```
- Adjust your gmetad.conf according to your need. Little needs to be done so, don't be too inventive as a start and leave most of the values as default. A few things you MUST change or check
  - trusted_host should be 127.0.0.1 if you install the collector on a node also running the web front-end. Note that because this is configurable, you do not necessarily need to have gmetad running on the Web server. It can be a totally dedicated node for monitoring purposes only. Trusted host would have to be set to the host IP which will however connect to the gmetad service for pulling the information out and displaying them on the web pages. trusted_host is also used to allow other gmetad to get the information (chain or proxy like structure). This allows for a tree structure and aggregation of information.
  - rrd_rootdir. According to our example, this should be /var/lib/ganglia/rrds
  - data_source: it has to have a name your CPU will be grouped under. At BNL, we have
    data_source "STAR_RCF_Servers" 60 localhost
    data_source "STAR CAS Linux Cluster" 60 ganglia.rcf.bnl.gov:8651
    There can be as many data_source as "groups" to be displayed on your page. There are all of the format $IP{:$PORT } (port is optional) ; the syntax localhost indicates to gmetad that it has to grab the information from a local service i.e. a gmond process running locally. The second syntax is a reference to a remote ganglia collector. In this case, the name "STAR CAS Linux Cluster" has little importance and will be ignored. That connection alone in fact could lead to displaying several groups by itself.
  - grid_name: a name you want to give to your groups. At BNL, we set this to "STAR Computing Resources" for the total aggregate of information.
    To see how this configuration lead to an actual real structure, consult our BNL Ganglia pages. It will become more obvious.

Web front-end

Now, deploy the web front-end on your web-server Just unpack the package and move the entire tree to a place were it can be accessed on your web server. For example
```
% mv we /var/www/html/Ganglia-3.0.5
% cd /var/www/html/
% rm -f Ganglia && ln -s Ganglia-3.0.5 ./Ganglia
```
You should have this in place but it won't show anything until you get gmetad started and collecting. Before that, some adjustments need to be done. Note that for an update of the Web front-end, only the web directory need installing (so don't get over-zeleous replacing everything for a minor version correction).
In the web front-end directory, you will find a file named php.conf . If you have installed the RRD tools elsewhere than the default location, you may want to modify it. The value should be something like this for a /usr/local installation
define("RRDTOOL","/usr/local/bin/rrdtool");
You will have to make a few modifications to your PHP installation in case of a large cluster. This is done in /etc/php.ini. Two settings need adjustments
memory_limit = 16M
post_max_size = 16M
You should this only if the pages do not display or give an error about memory. The default values are 8 MB of memory which is little for a 400 nodes+ cluster.
You can now start gmetad
```
% /etc/rc.d/init.d/gmetad start
```
Obviously, it won't be exciting until you actually deploy the monitoring daemon gmond.

Client / gmond

You are now ready to deploy gmond (and start it) on all the nodes in your cluster as defined by your logical groupings. You have to install the monitor core component but NOT the rrdtools. When installed, do the following
- Usual install
```
% ./configure
% make
% make install
```
- Copy the startup file and add the service (Linux example)
```
% cp ./gmond/gmond.init /etc/rc.d/init.d/gmond
% chkconfig --add gmond
% chkconfig --list gmond
gmond 0:off 	1:off  2:on  3:on  4:on  5:on  6:off
```
- Copy the configuration file
```
        % cp ./gmond/gmond.conf /etc/
```
  Note that what I did is to create one gmond.conf (one common configuration per group is what you need, they should be different across groups) and then, pulled the appropriate configuration file on all the nodes. A few things to check in this configuration file
  - The block named cluster defines some parameters associated to the logical groupin you will be using
    The field name MUST be set uniquely per group of nodes. For example
    name = "STAR_RCF_Servers"
    
    The field owner seems to have no effect. Do not worry too much about it although we DO set it at BNL as
    owner = "RHIC/STAR"
  - The next blocks of importance are udp_send_channel and udp_recv_channel
    mcast_join is VERY important. mcast_join will tell Ganglia how to broadcast the information on the network. At BNL, I suggested the use a non-ambiguous yet unique set of number allowing to differentiate research groups or sub-groups within the department (so we would not sen information to each other). While this was in the old times of lose firewall/routing settings, the rule is good to keep and the convention I proposed was to use 239.2.$SUBNET.00 to those who are using Ganglia.
    For the 88 STAR subnet, this would equate to
    mcast_join = "239.2.88.00"
    
    In udp_recv_channel, also use
    bind = "239.2.88.00"
    
    In those two blocks, ttl is another important parameter and should NOT be greater than 1 if your collectors and all your nodes are on the same network and/or behind the same router. TTL is decreased at each router boundary. If you increase this value, you will create useless network traffic (to flood the network if routers/switches do not drop BCAST).
    ttl = "1"
- In principle, you are done and ready to start
```
% /etc/rc.d/init.d/gmond start
```
Immediately after starting gmond, and providing you do have gmetad running, you should see the node appear on your web page. Note that all gmond collect information from each others (via multicast). The one running on your Web-server (if you have one) makes no difference. This explains our data_source entry to localhost.

Other Notes

Note again the syntax for data_source: you may specify several (redundant) gmond $IP:$PORT references for a given group.
After you have gmond installed, you should be able to telnet to the gmond port (% telnet localhost 8649) and get a dump of the matrix. If the matrix comes empty but you have a header and preamble, likely multi-casting is not functional. You can then try unicast by having the following lines in gmond.conf :
```
....

udp_send_channel {
  host = localhost
  port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  port = 8649    
}
```
Then try to telnet again. If this time you get the matrix, you could be certain multi-casting is blocked on your node (revert to multi-cast setup and use the next tip).
If you are running a host firewall such as ipables, you will need to add lines like the below to your configuration to make sure the BCAST pass through
-A INPUT -p udp -m udp -d 239.2.88.0 --dport 8649 -j ACCEPT
-A INPUT -p igmp -j ACCEPT
gmetric: you may add new matrices (customized) using the gmetric program. This could be useful to monitor extraneous features not covered by gmond. When you add or remove a matrix, you need to add it consistently to all nodes. Removing a matrix (or stopping it) on a few nodes DOES NOT make the graph associated to it disappear. In fact, you will have to go into the rrd database and physically remove the structure pertinent to a new matrix to have it gone from your web front-end regardless of if you have removed every instance of gmetric. Also, you will need to do that on ALL nodes having a copy of the data in their own rrd database. Shall a matrix appear even for a minute, a plot will appear and this manual removal will be needed.
Note that if you change the polling interval or any vital time bin information in gmond/gmetad, you will lose ALL historical data unless you take special care while doing so. At BNL, we changed the polling interval from 20 seconds to one minute to reduce network traffic and cope with CISCO deficient treatment of multi-cast requests. We then used ganglia-rrd-modify.pl script to modify the historical data without losing it.
```
% ganglia-rrd-modify.pl -v -H 180 -r /path/to/rrds/dir
```
Note that the value 180 is 2-3 times the update interval. You have to do this operation in the following order
- Start from the "slave" or mirror server first
  - Adjust the polling value in the configuration file (data_source)
  - stop gmetad
  - execute the convert script
  - restart gmetad
- Only when all slaves or mirror of a collector are done shall you go up in the tree of gmetad dependencies

Content of ganglia-rrd-modify.pl

#!/usr/bin/perl -w
#
# Simple script to read ganglia's rrds and modify some of their configuration values.
#  - Uses the tune and resize commands.
#
# Written by:  Jason A. Smith <smithj4 {at} bnl.gov>
#

# Modules to use:
use RRDs;  # Round Robin Database perl module (shared version) - from rrdtool package.
use RRDp;  # Round Robin Database perl module (piped version) - from rrdtool package.
use Cwd;
use Data::Dumper;
use File::Basename;
use Getopt::Long;
use strict;

# Get the process name from the script file:
my $process_name = basename $0;

# Define some useful variables:
my $rrd_dir = '/var/lib/ganglia/rrds';
my $heartbeat = 0;
my $grow = 0;
my $shrink = 0;
my $verbose = 0;
my $debug = 0;
my $num_files = 0;

# Get the command line options:
&print_usage if $#ARGV < 0;
$Getopt::Long::ignorecase = 0;  # Need this because I have two short options, same letter, different case.
GetOptions('r|rrds=s'      => \$rrd_dir,
           'H|heartbeat=i' => \$heartbeat,
	   'g|grow=i'      => \$grow,
	   's|shrink=i'    => \$shrink,
	   'v|verbose'     => \$verbose,
           'd|debug'       => \$debug,
           'h|help'        => \&print_usage,
          ) or &print_usage;

# Recursively loop over ganglia's rrd directory, reading all directory and rrd files:
my $start = time;
chdir("/tmp");  # Let the rrdtool child process work in /tmp.
my $pid = RRDp::start "/usr/bin/rrdtool";
chdir("$rrd_dir") or die "$process_name: Error: Directory doesn't exist: $rrd_dir";
&process_dir($rrd_dir);
my $time = time - $start;  $time = 1 if $time == 0;
my ($usertime, $systemtime, $realtime) =  ($RRDp::user, $RRDp::sys, $RRDp::real);
my $status = RRDp::end;

# Print final stats:
warn sprintf "\n$process_name: Processed %d rrd files in %d seconds (%.1f f/p)\n\n", $num_files, $time, $num_files/$time;

# Exit:
exit $status;

# Function to read all directory entries, testing them for files & directories and processing them accordingly:
sub process_dir {
  my ($dir) = @_;
  
  my $cwd = getcwd;
  warn "$process_name: Reading directory: $cwd ....\n";
  foreach my $entry (glob("*")) {
    if (-d $entry) {
      chdir("$entry");
      &process_dir($entry);
      chdir("..");
    } elsif (-f $entry) {
      &process_rrd("$cwd/$entry");
    }
  }
}

# Function to process a given rrd file:
sub process_rrd {
  my ($file) = @_;
  
  # Who owns the file (if resizing the file then I have to move the file & change the ownership back):
  my ($uid, $gid) = (stat($file))[4,5];
  
  # Read the rrd header and other useful information:
  warn "$process_name: Reading rrd file: $file\n" if $debug;
  my $info = RRDs::info($file);
  my $error = RRDs::error;
  warn "ERROR while reading $file: $error" if $error;
  print "$file: ", Data::Dumper->Dump([$info], ['Info']) if $debug;
  my $num_rra = 0;  # Maximum index number of the RRAs.
  foreach my $key (keys %$info) {
    if ($key =~ /rra\[(\d+)\]/) {
      $num_rra = $1 if $1 > $num_rra;
    }
  }
  my ($start, $step, $names, $data) = RRDs::fetch($file, 'AVERAGE');
  $error = RRDs::error;
  warn "ERROR while reading $file: $error" if $error;
  if ($debug) {
    print "Start:       ", scalar localtime($start), " ($start)\n";
    print "Step size:   $step seconds\n";
    print "DS names:    ", join (", ", @$names)."\n";
    print "Data points: ", $#$data + 1, "\n";
  }
  
  # Set the heartbeat if asked to:
  if ($heartbeat) {
    foreach my $n (@$names) {
      warn "$process_name: Updating heartbeat for DS: $file:$n to $heartbeat\n" if $verbose;
      RRDs::tune($file, "--heartbeat=$n:$heartbeat");
      $error = RRDs::error;
      warn "ERROR while trying to tune $file:$n - $error" if $error;
    }
  }
  
  # Resize the rrds if asked to (have to use the pipe module - resize doesn't exist in the shared version):
  if ($grow or $shrink) {
    my $action = $grow ? 'GROW' : 'SHRINK';
    my $amount = $grow ? $grow : $shrink;
    foreach my $n (0..$num_rra) {
      my $cmd = "resize \"$file\" $n $action $amount";
      warn sprintf "$process_name: %sING: %s:RRA[%d] by %d rows.\n", $action, $file, $n, $amount if $verbose;
      RRDp::cmd($cmd);
      my $answer = RRDp::read;  # Returns nothing.
      warn "$process_name: Renaming: /tmp/resize.rrd --> $file\n" if $verbose;
      system("mv /tmp/resize.rrd $file");
      chown $uid, $gid, $file if $uid or $gid;
    }
  }
  $num_files++;
}

# Print usage function:
sub print_usage {
  print STDERR <<EndOfUsage;

Usage: $process_name [-Options]

 Options:
  -r|--rrds dir		Location of rrds to read (Default: $rrd_dir).
  -H|--heartbeat #	Set heartbeat interval to # (Default: unchanged).
  -g|--grow #		Add # rows to all RRAs in rrds (Default: unchanged).
  -s|--shrink #		Remove # rows from all RRAs in rrds (Default: unchanged).
  -v|--verbose		Enable verbose mode (explicitly print all actions).
  -d|--debug		Enable debug mode (more detailed messages written).
  -h|--help		Print this help message.

Simple script to read ganglia's rrds and modify some of their configuration
values.

EndOfUsage
  
  exit 0;
}

#
# End file.
#