Comma-separated lists in LaTeX

For some time I’ve wanted a way to process comma-separated lists in LaTeX. I didn’t find what I wanted online, so I wrote my own.

Coming from a functional programming background, I think about list processing in terms of inductive definitions. Inductive definitions about lists have two cases:

An inductive approach describes how to break down processing the list into the repeated steps for each element. Of course we’ll sometimes want a different base case – perhaps for a certain problem we’ll consider a one-element or even a two-element list as the base case. But zero is a good, simple starting point.

In addition to specifying the basic and inductive steps, most list processing situations also require considering associativity, or how the elements group with each other. Basic subtraction is a good example of an operation where left- and right-association give different results:

Left ((100-30)-10)-2 = 58
Right 100-(30-(10-2)) = 78

So how will this look in LaTeX? First we need some auxiliary macros for examples of the basic and inductive operations:

    \def\baseZero{NIL}
    \def\inductiveOp#1#2{$\langle$ #1 - #2 $\rangle$}

If we use these macros in the list processors we envision:

  \foreachR{\baseZero}{\inductiveOp}{abc,defg,hijk,lmnop}
  \\
  \foreachL{\baseZero}{\inductiveOp}{abc,defg,hijk,lmnop}

then we’d want the output to be:

  < abc - < defg - < hijk - < lmnop - NIL > > > >
  < < < < NIL - abc > - defg > - hijk > - lmnop >

LaTeX’s macro facility is good at specifying what happens to the next few input elements: underline the first and italize the second; output the next four as they are but put commas in between them; make the next thing purple. It’s less good at reaching into unceratin positions of sequences of arbitrary length – but this is exactly what we’re trying to do. We can’t know how many items are in a comma-separated list, or how long the items are. We don’t even know for certain if there’s a comma – the list could hold only a single element – or even any elements in the list at all. Macros defined by LaTeX’s \newcommand rearrange a fixed number of arguments in a pre-specified way, but \newcommand alone will not be enough for our list processors. We will need TeX’s macro builder \def.

Many LaTeX users will already know \def as a terse alternative to \newcommand. Rather than

  \newcommand{\myCmd}[3]{...}

many will prefer

  \def\myCmd#1#2#3{...}

But \def is not simply a shorthand for \newcommand – it can define macros that harvest patterns in the source text, and take as their arguments the source delimited by these patterns. Consider this simple example:

  \def\myCmd#1<#2>{First #1 then #2.}

Note the angle brackets around the second parameter declaration. This declaration means that \myCmd will set its first argument to be all of the source up to the first occurrence of the less-than sign, and its second argument to be all of the source in between the first less-than character and the next greater-than character. After the macro’s body is expanded, input continues after that greater-than character. Of course, the left- and right-angle characters are arbitrary; we can use any input tokens here, even the backslash-prefixed words that we normally use as macro names. For our loop macros, we will use the comma as a delimiter after the first list element, and a backslash-prefixed word as a marker for the end of the list.

Macros defined by \def with literal tokens in the parameter specifications are unconditional: expanding one of these macros, if LaTeX fails to find the required literals, then the translation fails. Through \def there is no mechanism for defining alternate expansion to be searched for matching input. For conditional expansions based on the present input tokens, we use \@ifnextchar. This token is a simple if-then-else test:

  \@ifnextchar C{A A A ... A}{B B B ... B}

When expanding this macro, LaTeX looks ahead in the input stream. If the next input token is C, then the first block will be used, else the second block will be used.

We can now move on to our first prototype of a list processor. For this first attempt, we’ll implement right-association with the base case of a list of length 0. The following pseudocode restates our algorithm in a way which is closer to the TeX implementation:

  1. If the list is empty, return the base case.
  2. Add an extra comma to the end of the list.
  3. If the list is empty, return the base case.
  4. Divide the list into the part before the first comma, and the part after the first comma, and expand to the inductive case form, applying as arguments:
    1. The first part of the divided list
    2. A recursive call to Step 3 using the second part of the divided list.

The initial test and addition of the comma may seem odd, but is required to make the problem more uniform: the main loop is greatly simplified when we can assume that every list item is followed by a comma, instead of all but the last one. However we cannot add the comma before a test for the null list; otherwise we could have converted the empty list into a list with a single empty element.

We now look at how each step of our algorithm is converted into TeX. For Step 1, we have

  \def\foreach#1#2#3{%
    \@nullTest@foreach{#1}{#2}#3\@end@foreach}
  \def\@nullTest@foreach#1#2{%
    \@ifnextchar\@end@foreach%
      {#1\@swallow}%

The first macro converts a list given as a curly-bracket-enclosed arguments into a list of tokens which is in the input stream, but now followed by our internal marker. While the first macro pulled three arguments, the second macro pulls only two. This means that the list and end-of-list marker remain in the input stream, and so \@ifnextchar can be applied to them. If our end-of-list marker is the very next item, then we know that the list was in fact empty, and return the base case form. The \@swallow macro is a simple helper

  \def\@swallow#1{}

since \@ifnextchar only examines the next input token, and does not remove it from the stream.

Of course the definition above of \@nullTest@foreach is incomplete; the rest of it corresponds to our implementation of Step 2:

  \def\@nullTest@foreach#1#2{%
    \@ifnextchar\@end@foreach{#1\@swallow}%
      {\@prime@foreach{#1}{#2}}}
  \def\@prime@foreach#1#2#3\@end@foreach{%
    \@test@foreach{#1}{#2}#3,\@end@foreach}

Note that when we expect several arguments before a literal, it is the last one into which possibly many tokens will be stored. We can also note one detail about the matching of literals: mached literals are removed from the input stream, even literals matched after all of the structure associated with a macro argument. So we must replace the end-of-list marker whenever we match it.

Step 3 translates into a macro very similar to that for Step 1:

  \def\@test@foreach#1#2{%
    \@ifnextchar\@end@foreach%
       {#1\@swallow}%
       {\@more@foreach{#1}{#2}}}

We have one final definition for our recursive step:

  \def\@more@foreach#1#2#3,#4\@end@foreach{%
    #2{#3}{\@test@foreach{#1}{#2}#4\@end@foreach}}

Having matched the end-of-list marker, we again must replace it. Moreover, the comma is also removed from the input; we do not need to issue any explicit instruction to clear it.

The left-recursive version of this set of macros differs only in the last macro. Rather than applying the inductive form directly, we make a recursive call to the macro itself, and apply the form in the new argument.

  \def\foreachLzero#1#2#3{%
    \@nullTest@foreachLzero{#1}{#2}#3\@end@foreachLzero}
  \def\@nullTest@foreachLzero#1#2{%
    \@ifnextchar\@end@foreachLzero%
      {#1\@swallow}%
      {\@prime@foreachLzero{#1}{#2}}}
  \def\@prime@foreachLzero#1#2#3\@end@foreachLzero{%
    \@test@foreachLzero{#1}{#2}#3,\@end@foreachLzero}
  \def\@test@foreachLzero#1#2{%
    \@ifnextchar\@end@foreachLzero%
      {#1\@swallow}%
      {\@more@foreachLzero{#1}{#2}}}
  \def\@more@foreachLzero#1#2#3,#4\@end@foreachLzero{%
    \@test@foreachLzero{#2{#1}{#3}}{#2}#4\@end@foreachLzero}

In the second part, we’ll look at macros for larger base cases, along with some simplifications to the library.

Written Sunday, August 23rd, 2009. Back to the main page, or onward to similar pages. Trackback.