Implementation of an age2xml parser using the Parse::RecDecent module (perl)

A few preliminary notes on transforming our existing geometry files to a new format.
 
In an ideal world we would be able to have 100% backwards compatability between the old starsim based MC and the new starvmc application.   In order to achieve this we need to be able to transform the existing geometry files into a new language using an automated tool.  One approach to developing this automated tool is to define a grammar which a parser can apply to the files.  There are many choices of parsers and parsing strategies: bison/flex (c/c++), pyparsing.py (python), Parse::RecDecent (perl), ....  After examing some of the options I am opting to utilize the Parse::RecDescent perl module because (1) it appears to be the simplest solution with the best documentation and, (2) I've been using python too much lately anyway.  For the truly interested (or moderately insane), a detailed summary of the module can be found on the CPAN website at the link provided in the previous sentence.
 
I will assume for the moment that a top-down parser can be implemeted for Fortran.  Or more precsisely for the subset of Mortran which we actually use in our geometry definitions [1].  I have googled 'fortran grammar' and come up with several different hits that say that parsing fortran is difficult.  The difficulty arises from parsers which use 1-character look ahead to determine what they are looking at.  For example, when a compiler encounters "DOI=1", it can be looking at one of a couple of very different constructs illustrated below:
DOI  = 1.10             ! compiler sees DOI=1.10
DO i = 1, 10            ! compiler sees DOI=1,10
   A(i)=2*i*i-i 
ENDDO
The issue arises when one utilizes an LL(1) parser... LL(k) parsers (e.g. k characters of look-ahead w/ arbitrary k) do not encounter this problem.  The pattern which we are matching with
Parser::RecDescent
doloop: /DO/i variable '=' number ',' number ',' number
      | /DO/i variable '=' number ',' number
will be distinguishable from assignments because we explicitly look ahead for the comma....
assinment: variable '=' expression
The other issue which I have run across deals with how parsers / compilers handle numeric labels and  FORMAT statements... which of course can appear many-many-many lines below (or above?) the I/O statement being parsed.  Which for our geometries is a non issue, since numeric labels are highly discouraged by the system and I/O statements generally take the form of:
PrinN var1, var2, ..., varn;
   ( FORMAT DESCRIPTOR )
albeit scattered WRITE (*,*) and PRINT * statements are found.
 
Forging on. 
 
A very simplistic parser was written to handle a fairly trivial geometry block using this system. The code is written below. It takes as input the geometry specified beneath the __DATA__ statement in the perl script. The parser grammar is implemented in the $grammar variable. The grammar consists of 'productions' which take the form:
rule_name: rule_1 rule_2 ... rule_N
{
   # Actions to perform on match.
}
The rules may depend on previously defined rules and may be combined using logical constructs to build up more complicated rule sets. In the code example below, I implement a fairly simple parser to parser the fairly simple (and completely fake) geometry code listed under the __DATA__ element. The point of the exercise is to demonstrate that this is a viable method for implementing a parser for the AgSTAR portion of the language. (There are existing parsers available for fortran... though I haven't found a recursive descent grammar for fortran using google quite yet...)
Results in output which looks like:
    <Material name="AIR" />
    <Material dens="0.000123" isvol="0" name="AIR" />
    <Medium ifield="2" stemax="0.01" name="STANDARD" />
    <Attribute serial="1" ltyp="1" seen="1" colo="4" for="NAME" />
    <Shape npdiv="6" nz="3" dphi="360" phi1="0" type="PGON" />
    <Create block="BLCK" />
    <Position block="BLCK" />
    <Position in="" block="BLCK" />
<Block comment="And a short list of comments" name="NAME">
The output is largely what I intended.  However, the BLOCK declaration shows up at the end, not the beginning.  This is because the BLOCK is recognized as an entire unit, i.e. everything from BLOCK ... ENDBLOCK, and we print out matches as they occur.  In a production parser, we would need to accumulate these matches into data structures, then perform the output once a program unit resolves.
 
A few notes on the AgSTAR language definition:
 
The AgSTAR language is a modified version of Mortran. 
 
1) It uses line continuation in place of line termination --
o A ',' or '_' at the end of a line denote line continuation. 
o A ',' present in a language comment at the end of a line is sufficient to continue onto next line.
o ';' are ignored, at least in some constructs such as CREATE, and POSITION.
 
2) In general, authors have avoided using c-like constructions allowed in Mortran such as IF (expr)  {... } ELSE { ... }.  So we can match IF/THEN/ELSE IF/ELSE/END IF in a straightforward manner.
 
3)The ENDFILL side of the FILL/.../ENDFILL construct is not required.  This complicates parsing out fill blocks.  We should update geometries to make sure that all fill blocks are closed properly before trying to translate.  Alternatively we may need to preparse the code and add these in.
 
 

Feetnote:

[1] Our geometries use Pavel Nevski's modifications of the Mortran standard (demonstrating once more that the phrase "Mortran standard" is at least as oxymoronic as the phrase "Fortran standard".  Semi-colons are not required at the end of lines, format statementes and line numbers are not present, and I have not seen a "goto" statement anywhere but in the RICH geometry.  Equivalence statements are largely absent (except in Victor's TPC geometry), and common blocks are swallowed into CMZ-like directives.  If you've read this far you know what CMZ is, or are crazy and capable enough of googling the CERN website to find out.