Troubleshooting gsiftp at STAR-BNL

An overview of STAR troubleshooting with the official involvement of the OSG Troubleshooting Team in late 2006 and early 2007 history can be found here:
https://twiki.grid.iu.edu/twiki/bin/view/Troubleshooting/TroubleshootingStar

As of mid March, the biggest open issue is intermittently failing file transfers when multiple simultaneous connections are attempted from a single client node to the STAR-BNL gatekeeper. 

This article will initially summarize the tests and analysis conducted during the period from ~March 23 to March 29, though not in chronological order.  Updates on testing at later dates will be added sequentially at the bottom.

A typical test scenario goes like this:  Log on to a pdsf worker node, such as pc2607.nersc.gov and exceute a test script such as mytest.perl, which calls myguc.perl (actually, myguc.pl, but I had to change the file extensions in order for drupal to handle these properly.  Btw, both of these were originally written by Eric Hjort).  Sample versions of these scripts are attached.  These start multiple transfers simultaneously (or very nearly so).

In a series of such tests, we've gathered a variety of information to try to understand what's going on.

In one test, in which 2 of 17 transfers failed, we had a tcpdump on the client node (attached as PDSF-tcpdump.pcap), the gridftp-auth.log and gridftp.log files from the gatekeeper (both attached) and I acquired the BNL firewall logs covering the test period (relevant portions (filtered by me) are attached as BNL_firewall.txt).

In a separate test, a tcpdump was acquired on the server side (attached as BNL-tcpdump.pcap).  In this test, 7 of 17 transfers failed.

Both tcpdumps are in a format that Wireshark (aka Ethereal) can handle, and the Statistics -> Converstations List -> TCP is a good thing to look at  early on (it also makes very useful filtering quite easy if you know how to use it).

Missing from all this is any info from the RCF firewall log, which I requested, but I got no response.  (The RCF firewall is inside the BNL firewall.)  From putting the pieces together as best I can without this, I doubt this firewall is interfering, but it would be good to verify if possible.)

What follows is my interpretation of what goes on in a successful transfer:
    A.  The client establishes a connection to port 2811 on the server (the "control" connection).
    B.  Using this connection, the user is authenticated and the details of the requested transfer are exchanged.
    C.  A second port (within the GLOBUS_TCP_PORT_RANGE, if defined) is opened on the server.
    D.  The client (which is itself using a new port as well) connects to the new port on the server (the "transfer" connection) and the file is transfered.
    E.  The transfer connection is closed at the end of the transfer.
    F.  The control connection is closed.

The failing connections seem to be breaking at step B, which I'll explain momentarily.  But first, I'd like to point out that if this is correct, then GLOBUS_TCP_PORT_RANGE and GLOBUS_TCP_SOURCE_RANGE and state files are all irrelevant, because the problem is occurring before those are ever consulted to open the transfer connection.  I point this out because the leading suspect to this point has been a suspected bug in the BNL Cisco firewalls that was thought to improperly handle new connections if source and destination ports were reused too quickly.

So, what is the evidence that the connection is failing at such an early point?  I'm afraid the only way to really follow along at this point to dig into the tcpdumps youself, or you'll just have to take my interpretation on most of this.

1.  The gridftp-auth.log clearly shows successful authentications for the successful transfers and no mention of authentication for the failed transfers.
2.  From the tcpdumps, three conversation "types" are evident -- successful control connections, corresponding transfer connections, and failed control connections.  There are no remaining packets that would constitute a failed transfer connection.
3.  The failed control connections are obviously failed, in that they quickly degnerate into duplicate ACKs from the server and retransmissions from the client, which both sides are seeing.  I interpret this to mean that any intermediate firewalls aren't interferring at the end of the connection either, but I suppose it doesn't mean they haven't plucked a single packet out of the stream somewhere in the middle.)

Here's what I've noticed from the packets in the tcpdumps when looking at the failed connection.  From the client side, it looks like the fifth packet in the converation never reaches the server, since the server keeps repeating its previous ACK (SEQ=81).  From the server side, things are more peculiar.  There is a second [SYN,ACK] sent from the server AFTER the TCP connection is "open" (the server has already received an [ACK] from the client side in response to the first [SYN,ACK]).  This is strange enough, but looking back at the client tcpdump, it doesn't show any second [SYN,ACK] arriving at the client! 

Why is this second [SYN,ACK] packet coming from the server, and why is it not received on the client side (apparently)?

At this point, I'm stumped.  I haven't painstakingly pieced together all the SEQ and ACK numbers from the packets to see if that reveals anything, but it probably best to leave that until we have simultaneous client and server dumps, at which point the correspondences should be easier to ferret out.  [note:  simultaneous dumps from a  test run were added on April 2 (tcpdump_BNL_1of12.pcap and tcpdump_PDSF_1of12.pcap).  See the updates section below.]

One more thing just for the record:  the client does produce an error message for the failed transfers, but it doesn't shed any more light on the matter:
error: globus_ftp_client: the server responded with an error
421 Idle Timeout: closing control connection.


Additional tests were also done, such as:

Iptables was disabled on the client side -- failures still occurred.
Iptables was disabled on the server side -- failures still occurred.

Similar tests have been lauched by Eric and me from PDSF clients connecting to the STAR-WSU and NERSC-PDSF gatekeepers instead of STAR-BNL.  There were no unexplained failures at sites other than STAR-BNL, which seems to squarely put the blame somewhere at BNL.

Updates on March 30:

The network interfaces of client and server show no additional errors or dropped packets occuring during failed transfers (checked from the output of ifconfig on both sides).

Increased the control_preauth_timeout on the server to 120, then 300 seconds (default is 30).  Failures occured with all settings.

Ran a test with GLOBUS_TCP_XIO_DEBUG set on the client side.  The resulting output of a failed transfer (with standard error and standard out intermixed) is attached as "g-u-c.TCP_XIO_DEBUG". 

Bumped up the server logging level to ALL (from the default "ERROR,WARN,INFO").  A test with 2 failures out of 12 is recorded in gridftp.log.ALL and gridftp-auth.log.ALL.  (The gridftp.log.ALL only shows activity for the 10 successful transfers, so it probably isn't useful.)  It appears that [17169] and [17190] in the gridftp-auth.log.ALL file represent failed transfers, but no clues as to the problem -- it just drops out at the point where the user is authenticated, so there's nothing new to be learned here as far as I can tell.  However, I do wonder about the fact that these two failing connections do seem to be the LAST two connections to be initiated on the server side, though they were the first and ninth connections in the order started from the client.  Looking at a small set of past results, the failed connections are very often the last ones to reach the server, regardless of the order started on the client, but this isn't 100%, and perhaps should be verified with a larger sample set.

Updates on April 2:

I've added simultaneous tcpdumps from the server and client side ("tcpdump_BNL_1of12.pcap" and "tcpdump_PDSF_1of12.pcap").  These are from a test with one failed connection out of 12.  Nothing new jumps out at me from these, with the same peculiar packets as described above.

I ran more than 10 tests using stargrid01 (inside the BNL and RCF firewalls) as the client host requesting 40 transfers each time, and saw no failures.  This seems strong evidence that the problem is somewhere in the network equipment, but where?  I have initiated a request for assistance from BNL ITD in obtaining the RCF firewall logs as well as any general assistance they can offer.

Updates on April 16:

In the past couple of weeks, a couple of things were tried, based on a brief conversation with a BNL ITD network admin:

1.  On the server (stargrid02.rcf.bnl.gov), I tried disabling TCP Windows Scaling ("sysctl -w net.ipv4.tcp_window_scaling=0") -- no improvement
2.  On the server (stargrid02.rcf.bnl.gov), I tried disabling MTU discovery ("sysctl -w net.ipv4.ip_no_pmtu_disc=1") -- no improvement

In response to a previous client log file with the TCP_XIO debug enabled, Charles Bacon contributed this via email: 

>Thanks for the -dbg+TCP logs!  I posted them in a new ticket at http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5190
>The response from the GridFTP team, posted there, is:
>"""
>What this report shows me is that the client (globus-url-copy) successfully forms a TCP control channel connection with the server.  It then successfully reads the 220 Banner message from the server.  The client then attempts to authenicate with the server.  It sends the AUTH GSSAPI command and posts a read for a response.  It is this second read that times out.

>From what i see here the both sides believe the TCP connection is formed successfully, enough so that at least 1 message is sent from the server to the client (220 banner) and possibly 1 from the client to the server (AUTH GSSAPI, since we dont have server logs we cannot confirm the server actually received it).

>I think the next step should be looking at the gssapi authentication logs on the gridftp server side to see what commands were actually received and what replies sent.  I think debugging at the TCP layer may be premature and may be introducing some red herrings.

>To get the desired logs sent the env
>export GLOBUS_XIO_GSSAPI_FTP_DEBUG=255,filename
>"""
>So, is it possible to get this set in the env of the server you're using, trigger the problem, then send the resulting gridftp.log?


I have done that and a sample log file (including log_level ALL) is attached as "gridftp-auth.xio_gssapi_ftp_debug.log"  This log file covers a sample test of 11 transfers in which 1 failed.

Updates on April 20:

Here is the Globus Bugzilla ticket on this matter:  http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=5190
They have suggested better debug logging parameters to log each transfer process in separate files and requested new logs with the xio_gssapi_ftp_debug enabled, which I will do, but currently have urgent non-grid matters to work on.

Also, we have been given the go ahead to do some testing with ATLAS gatekeepers in different parts of the BNL network structure, which may help isolate the problem, so this is also on the pending to-do list which should get started no later than Monday, April 23.

AJ Temprosa, from BNL's ITD networking group has been assigned my open ticket.  On April 17 we ran a test with some logging at the network level, which he was going to look over, but I have not heard anything back from him, despite an email inquiry on April 19.

Updates on April 23:

Labelled gridftp logs with xio_gssapi_ftp_debug and the tcp_xio_debug enabled are attached as gridftp-auth.xio_gssapi_ftp_debug.labelled.log and gridftp-auth.xio_tcp_debug.labelled.log.  (By "labelled", I mean that each entry is tagged with the PID of the gsiftp server instance, so the intermixed messages can be easily sorted out.)  Ten of twenty-one transfers failed in this test.

I now have authorization and access to run tests with several additional gsiftp servers in different locations on the BNL network.  In a simple model of the situation, there are two firewalls involved -- the BNL firewall, and within that, the RACF firewall.  Two of the new test servers are in a location similar to stargrid02, which is inside both firewalls.  One new server is between the two firewalls, and one is outside both firewalls.  In ad hoc preliminary tests, the same sort of failures occurred with all the servers EXCEPT the one located outside both firewalls.  I've fed this back to the BNL ITD network engineer assigned to our open ticket and am still waiting for any response from him.

Updates on May 10:

[Long ago,] Eric Hjort did some testing with 1 second delays between successive connections and found no failures.  In recent limited testing with shorter delays, it appears that there is a threshhold at about 0.1 sec.  With delays longer than 0.1 sec, I've not seen any failures of this sort.

I installed the OSG-0.6.0 client package on presley.star.bnl.gov, which is between the RACF and BNL firewalls.  It also experiences failures when connecting to stargrid02 (inside the RACF firewall).

We've made additional tests with different server and client systems and collected additional firewall logs and tcpdumps.  For instance, using the g-u-c client on stargrid01.rcf.bnl.gov (inside both the RACF and BNL perimeter firewalls) and a gsiftp server on netmon.usatlas.bnl.gov (outside both firewalls) we see failures that appear to be the same.  I have attached firewall logs from both the RACF firewall ("RACF_fw_logs.txt") and the BNL firewall ("BNL_perimeter_fw_logs.txt") for a test with 4 failures out of 50 transfers (using a small 2.5 KB file).  Neither log shows anything out of the ordinary, with each expected connection showing up as permitted.  Tcpdumps from the client and server are also attached ("stargrid01-client.pcap" and "netmon-server.pcap" respectively).  They show a similar behaviour as in the previous dumps from NERSC and stargrid02, in which the failed connections appear to break immediately, with the client's first ACK packet somehow not quite being "understood" by the server.

RACF and ITD networking personnel have looked into this a bit.  To make a long story short, their best guess is "kernel bug, probably a race condition".  This is a highly speculative guess, with no hard evidence.  The fact that the problem has only been noticed when crossing firewalls at BNL casts doubt on this.  For instance, using a client on a NERSC host connecting to netmon, I've seen no failures, and I need to make this clear to them.  Based on tests with other clients (eg. presley.star.bnl.gov) and servers (eg. rftpexp01.rhic.bnl.gov), there is additional evidence that the problem only occurs when crossing firewalls at BNL, but I would like to quantify this, rather than relying on ad hoc testing by hand, with the hope of removing any significant possibility of statistical flukes in the test results so far.

Updates on May 25:

In testing this week, I have focused on eliminating a couple of suspects.  First, I replaced gsiftpd with a telnetd on stargrid03.rcf.bnl.gov.  The telnetd was setup to run under xinetd using port 2811 -- thus very similar to a stock gsiftp service (and conveniently punched through the various firewalls).  Testing this with connections from PDSF quickly turned up the same sort of broken connections as with gsiftp.  This seems to exonerate the globus/VDT/OSG software stack, though it doesn't rule out the possiblity of a bug in a shared library that is used by the gsiftp server.

Next, I tried to eliminate xinetd.  To do this, I tried some http connections to a known web server without any problems.  I then setup an sshd on port 2811 on stargrid03.  In manual testing with an scp command, I found no failures. I've scripted a test on pdsfgrid1 to run every 5 minutes that makes 30 scp connections to stargrid03 at BNL.  The results so far are tantalizing -- there don't seem to be any failures of the sort we're looking for...  If this holds up, then xinetd becomes suspect #1.  There are also some interesting bug fixes in xinetd's history that seem suspiciously close to the symptoms we're experiencing, but I can't find much detail to corroborate with our situation.  See for instance:

https://rhn.redhat.com/errata/RHBA-2005-208.html , https://rhn.redhat.com/errata/RHBA-2005-546.html and http://www.xinetd.org/#changes

Here is a sample problem description:
"Under some circumstances, xinetd confused the file descriptor used for
logging with the descriptors to be passed to the new server. This resulted
in the server process being unable communicate with the client. This
updated package contains a backported fix from upstream. "

(NB - These Redhat errata have been applied to stagrid02 and stargrid03, but there are prior examples of errata updates that failed to fix the problem they claimed :-( ) 

Updates on May 30:

SOLUTION(?)

By building xinetd from the latest source (v 2.3.14, released Oct. 24, 2005) and replacing the executable from the stock Red Hat RPM on stargrid02 (with prior testing on stargrid03), the connection problems disappeared.  (minor note:  I built it with the libwrap and loadavg options compiled in, as I think Red Hat does.)

For the record, here is some version information for the servers used in various testing to date:

stargrid02 and stargrid03 are identical as far as relevant software versions:
Linux stargrid02.rcf.bnl.gov 2.4.21-47.ELsmp #1 SMP Wed Jul 5 20:38:41 EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 3 (Taroon Update 8)
xinetd-2.3.12-6.3E.2 (the most recent update from Red Hat for this package for RHEL 3.  Confusingly enough, the CHANGELOG for this package indicates it is version 2:2.3.***13***-6.3E.2 (not 2.3.***12***))
Replacing this with xinetd-2.3.14 built from source has apparently fixed the problem on this node.

rftpexp01.rhic.bnl.gov (between the RACF and BNL firewalls):
Linux rftpexp01.rhic.bnl.gov 2.4.21-47.0.1.ELsmp #1 SMP Fri Oct 13 17:56:20 EDT 2006 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 3 (Taroon Update 8)
xinetd-2.3.12-6.3E.2 

netmon.usatlas.bnl.gov (outside the firewalls at BNL):
Linux netmon.usatlas.bnl.gov 2.6.9-42.0.8.ELsmp #1 SMP Tue Jan 23 13:01:26 EST 2007 i686 i686 i386 GNU/Linux
Red Hat Enterprise Linux WS release 4 (Nahant Update 4)
xinetd-2.3.13-4.4E.1 (the most recent update from Red Hat for this package in RHEL 4.)

Miscellaneous wrap-up:

As a confirmation of the xinetd conclusion, we ran some additional tests with a server at Wayne State with xinetd-2.3.12-6.3E (latest errata for RHEL 3.0.4). When crossing BNL firewalls (from stargrid01.rcf.bnl.gov for instance), we did indeed find connection failures. Wayne State then upgraded to RedHat's xinetd-2.3.12-6.3E.2 (latest errata for any version of RHEL 3) and the problems persisted. Upon building xinetd 2.3.14 from source, the connection failures disappeared. With two successful "fixes" in this manner, I filed a RedHat Bugzilla ticket:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243315

Open questions and remaining tests:

A time-consuming test of the firewalls would be a "tabletop" setup with two isolated nodes and a firewall to place between them at will.  In the absence of more detailed network analysis, that would seem to be the most definitive possible demonstration that the firewall is playing a role, though a negative result wouldn't be absolutely conclusive, since it may be configuration dependent as well.

Whatever YOU, the kind reader suggests! :-)