Computing - Perl and Gedcom

Okay, so here's where we get to the useful stuff. This page discusses some basic techniques for using Perl to process Gedcom files. I think it should be clear how powerful Perl is in this area.

Parsing Gedcom Records

Using Perl's robust support for pattern matching, it is rather easy to split a Gedcom record into its parts. The following regular expression alone does that:

/^(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

Let's break down that expression to see what it does:

^ matches the start of the record.
(\d+) matches the digits of the level number.
\s+ matches one or more spaces.
(@\S+@)? matches the label (if specified). A Gedcom label is surrounded by @ signs. The question mark indicates that the item can occur zero or one times. \S matches non-spaces.
\s* matches zero or more spaces.
(\S+) matches the Gedcom tag, such as INDI.
\s+ matches one or more spaces.
(.*) matches everything else in the record.

The parentheses around each element indicate that the matched text is placed in predefined variables $1, $2, etc.

Reading Gedcom Files

The following Perl code reads the Gedcom records from standard input:

   @lines = <STDIN>;     # Read all lines into array @lines
   foreach (@lines)      # Each line is placed into predefined scalar $_
   {
      # Get rid of CR/LF chars
      chomp;             # Get rid of end of line character
      s/[\cM\cJ]//;      # Get rid of other CR or LF characters

      # Parse record
      /^(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;
      $level = $1;       # Save the level number
      $label = $2;       # Save the label (if specified)
      $tag   = uc($3);   # Uppercase the tag
      $text  = $4;       # Save everything else

      # Process record
      ...
   }

In this code fragment, you can substitute the ellipses with the code to process the record.

Put loop into subroutine

To make this more flexible, the above loop can be specified in a subroutine. The processing to be done on each record can be passed in as a block:

sub loopGedcom (&\@)
{
   my ($proc, $lins) = @_;
   local $level, $label, $tag, $text;
   foreach (@$lins)
   {
      # Get rid of CR/LF chars
      chomp;
      s/[\cM\cJ]//;

      # Parse record
      /^(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;
      $level = $1;
      $label = $2;
      $tag   = uc($3);
      $text  = $4;

      # Process record
      &$proc;
   }
}

This subroutine takes two parameters:

  1. A block of code that processes each record.
  2. An array containing the records of the file.

Here's an example of how to call the above subroutine:

# Read input file
@lines = <STDIN>;

# Loop through all records
loopGedcom
{
   # Do we have a NAME record?
   if ($tag eq 'NAME')
   {
      # Process NAME record
      ...
   }
}
@lines;

Put subroutine into module

Since this subroutine can be used by any Perl program that processes Gedcom files, it should be placed in a Perl module. Modules are files with extension .pm, that are automatically included in a program by coding Perl statement use.

A module can contain any Perl code and ends with the number 1:

# Module "gedcom.pm"

sub loopGedcom (&\@)
{
   my ($proc, $lins) = @_;
   local $level, $label, $tag, $text;
   foreach (@$lins)
   {
      # Get rid of CR/LF chars
      chomp;
      s/[\cM\cJ]//;

      # Parse record
      /^(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;
      $level = $1;
      $label = $2;
      $tag   = uc($3);
      $text  = $4;

      # Process record
      &$proc;
   }
}

1

Sample Program

Now, let's do something useful. When publishing genealogy data on the internet, it is useful to be able to list all surnames in a Gedcom file.

The following program reads a Gedcom file from standard input and writes out the unique surnames to standard output.

#!/usr/bin/perl

########################################################################
# surnames                                                             #
#                                                                      #
# Print a sorted list of surnames in the input gedcom file.            #
#                                                                      #
# Reads from STDIN and writes to STDOUT.                               #
#----------------------------------------------------------------------#
# (c) Copyright 1999 Hans Boldt                                        #
#                                                                      #
# This program may be freely distributed and modified provided that    #
# the original copyright be kept and that the source code must be      #
# available.                                                           #
########################################################################

use gedcom;                 # Contains subroutine 'loopGedcom'

# Read input file
@lines = <STDIN>>;

# Loop through GEDCOM data and extract surnames
loopGedcom
{
   # Do we have a name?
   if ($tag eq 'NAME'
   &&  $text =~ m"/(.*)/")
   {
      if ($1 and $1 ne '')
      {
         $list{$1} ++;
      }
   }
}
@lines;

# Print out list of surnames
foreach (sort keys %list)   # Each item of list is placed into $_
{
   print "$_\n";
}

exit;

(To call this program on your computer, check the documentation specific your version of Perl.)