Tool to Separate Spanish Text into Syllables (Castellano)

As with the rest of my articles, I also translated this one into English because, although not the tool I share below that works specifically for Spanish, what I explain here can be used with any language.

The praised feature of LaTeX is, that unlike Groff, which reads line by line, LaTeX reads the entire paragraph before processing it, if you look closely, you notice how LaTeX automatically shrinks the space between words to fit paragraphs.  In groff, the same or better results can be achieved by manually adjusting the spacing with the .ss request but, personally, I don't like compressing the text.  A more honest way to avoid the annoying gaps in the justified text is to optimize the hyphenation; in this article I will tell you how to achieve a result in some cases even more homogeneous than the observed in the same document processed with LaTeX (I have my complete novels also transcribed in LaTeX, so I speak with knowledge of the facts.)

To achieve hyphenation, groff uses a simplified version of the algorithm that uses TeX, which must be fed with a list of patterns that tell it where to hyphenate words.  Since each language has different hyphenation rules, each language needs its own patterns.  Groff also borrows from LaTeX the files including those patterns, groff's distribution already includes those for English, German, French, Swiss and Czech, to get the files for other languages you have to copy them yourself from the LaTeX distribution.  As explained in the groff documentation, you have to load them with the following request:

.hla es		\" First, set the hyphenation language!
.hpf hyphen.es	\" Then you source hyphenation LaTeX patterns

This is the ‘official’ way, so to speak, of achieving automatic hyphenation with groff.  I've been doing it this way for years, until recently when I found a better way.  There is a second way to tell groff how to hyphenate, using the .hw request, which allows you to add words on a single line or multiple lines anywhere in the document:

.hla es			\" Set the hyphenation language
.hw ca-sa te-lé-fo-no
.hw re-tri-bu-ción

Groff's documentation presents this option as a resource to define exceptions, but nothing prevents us from automating the task by generating a file including all the words of the document to be edited, which can be perfectly achieved with some tr(1), grep(1), and sed(1) commands.  Once I wrote such a script and saved its output to a file (hyphen.tr), I commented out the conventional method and used only the mentioned file:

.hla es			\" Set the hyphenation language
.\".hpf hyphen.es	\" Comment out the other method
.so hyphen.tr		\" File containing .hw entries

Obviously, once you let groff know how to hyphenate from the first to the last word of your document, the hyphenation is as good as it can be.  Many gaps left by the other method disappear.

Mandatory previous step

But to get the best from our new method, there is a mandatory previous step, which is to disable the automatic English hyphenation, which in most systems comes activated by default.  To accomplish this, first we need to create a personal tmac directory (perhaps you already have one,) copy there the system-wide groff configuration file, troffrc from the directory where your system saved groff installation files.

$ export GROFF_TMAC_PATH=~/.local/share/groff/site-tmac
$ mkdir -p $GROFF_TMAC_PATH
$ cp /usr/(local)/share/groff/current/tmac/troffrc \
	$GROFF_TMAC_PATH/

To make the previous configuration permanent:

$ echo 'export GROFF_TMAC_PATH=~/.local/share/groff/site-tmac' \
	>> ~/.bashrc

Then, open that file with your favorite text editor and comment out the English hyphenation rules:

$ vi $GROFF_TMAC_PATH/troffrc
.\" Set the hyphenation language to 'us'.
.\".do hla us
.
.\" Load hyphenation patterns and exceptions.
.\".do hpf hyphen.us
.\".do hpfa hyphenex.us

The Tool Written in C

At least for Spanish, to write a script to correctly separate words in syllables wasn't an easy task.  I had first to study how to implement Spanish rules (taking in care diphthongs, hiatuses, etc.) in an economic way, the code wasn't elegant but it served me as a model to write an application in C:

Download (hyphen-es.c)

To made it a general purpose tool, I designed it so that, by default, it prints all text as it reads it, just separating words with hyphens:

$ echo Ayer pasé por tu casa... - J. Corona. | hyphen-es
A-yer pa-sé por tu ca-sa ... - J. Co-ro-na.

Adding the -l (el) option, it'll generate a list of one word per line in lowercase, ignoring monosyllables:

$ echo Ayer pasé por tu casa... - J. Corona. | hyphen-es -l
a-yer
pa-sé
ca-sa
co-ro-na

If we intend to use it with a document already edited with roff, before piping it to hyphen-es its convenient to clean all roff code and leave only the text.  The following is an example of a simple shell script which does exactly that (you may need to adapt it to your needs):

#!/bin/sh
# roff2txt.sh

tmp=/tmp/$(basename $1 .tr)_$(date +%H%M%S)

# Include .so files
while read -r line; do
	if echo $line | grep -v hyphen.tr | grep -q '^\.so .*\.tr'
	then
		cat $(echo $line | awk '{ print $2 }') >> $tmp
	else
		echo $line >> $tmp
	fi
done < $1

cat $tmp | sed 's/\\f[IPR]//g' | sed 's/\\&//g' | sed 's/^\..*$//'

rm $tmp

Add a little Unix pipeline magic to the above and you get a ready-to-use file with our groff document (you can add a command like that to your Makefile):

$ roff2txt.sh ${doc}.tr | hyphen-es -l | \
	sort | uniq | sed 's/^/.hw /' > hyphen_es.tr

Finally, all that remains is to load the file from your groff document:

.so hyphen_es.tr


GO BACK HOME