Links to resources |
Compiling and running SnowballWhen you download Snowball, it already has in place a make file that you can call to build it. But in any case, Snowball has a very simple structure, comprising the traditional tokeniser, syntax analyser and code generator modules, with two extra modules for space management and an internal merge sort, and a small driver module, all sharing a common header file. If you put these sources into a directory p/, you can compile Snowball at once (Linux or Unix) with
gcc -O -o Snowball p/*.c
Snowball can then be called up with the following syntax,
F1 [-o[utput] F2]
[-s[yntax]]
[-w[idechars]]
[-j[ava]] [-n[ame] C]
[-ep[refix] S1] [-vp[refix] S2]
[-i[nclude] D]
For example,
./Snowball danish/stem.sbl -o q/danish
./Snowball danish/stem.sbl -syntax
./Snowball danish/stem.sbl -output q/danish -ep danish_
The first argument, F1, is the name of the Snowball file to be compiled. If
-java option is absent, it
produces
two outputs, an ANSI C module in F2.c and a corresponding header file in F2.h.
If option -java is present, Java output is produced in F2.java.
The -widechars, -eprefix and -vprefix options belong with ANSI C generation; the -name option with Java generation. ANSI C generationIn the absence of the -eprefix and -vprefix options, the list of declared externals in the Snowball program, for example,
externals ( stem_1 stem_2 moderate )
gives rise to a header file containing,
extern struct SN_env * create_env(void);
extern void close_env(struct SN_env * z);
extern int moderate(struct SN_env * z);
extern int stem_2(struct SN_env * z);
extern int stem_1(struct SN_env * z);
If -eprefix is used, its string, S1, is prefixed to each external
name, for example
-eprefix Khotanese_
would give rise to the header file,
extern struct SN_env * Khotanese_create_env(void);
extern void Khotanese_close_env(struct SN_env * z);
extern int Khotanese_moderate(struct SN_env * z);
extern int Khotanese_stem_2(struct SN_env * z);
extern int Khotanese_stem_1(struct SN_env * z);
If -vprefix is used, all Snowball strings, integers and booleans give
rise to a #define line in the header file. For example
-eprefix Khotanese_ -vprefix Khotanese_variable
would give rise the header file,
extern struct SN_env * Khotanese_create_env(void);
extern void Khotanese_close_env(struct SN_env * z);
#define Khotanese_variable_ch (S[0])
#define Khotanese_variable_Y_found (B[0])
#define Khotanese_variable_p2 (I[1])
#define Khotanese_variable_p1 (I[0])
extern int Khotanese_stem(struct SN_env * z);
The -widechars option affects interpretation of Snowball hex and
decimal strings, as in
stringdef m hex 'H1 H2 ...'
stringdef m decimal 'D1 D2 ...'
where H1, H2 ... are hex numbers and D1, D2 ... are decimal
numbers. Without the -widechars option it is an error for these numbers
to exceed 255. With the -widechars option it is only an error if they
exceed 65535. So by default one byte characters are assumed, but
-widechars makes the assumptions that characters are two bytes. Note
that (a) the output from Snowball is the same in both cases, and (b)
the -java option automatically sets the -widechars option.
Within the API header file api.h, symbol is given a typedef of
unsigned char,
typedef unsigned char symbol;
- and a sequence of characters representing a word to be stemmed is then
held in a symbol array. To switch to a 16 bit representation of characters,
just replace char by short here:
typedef unsigned short symbol;
Java generationThe -java option automatically sets the -widechars option.To run Java, download the tarball at , which will unpack into an appropriate directory structure. Other optionsIf -syntax is used the other options are ignored, and the syntax tree of the Snowball program is directed to stdout. This can be a handy way of checking that you have got the bracketing right in the program you have written.Any number of -include options may be present, for example,
./Snowball testfile -output test -ep danish_ \
-include /home/martin/Snowball/codesets \
-include extras
Each -include is followed by a directory name. With a chain of
directories D1, D2 ... Dn, a Snowball get directive,
get 'F'
causes F to be searched for in the successive locations,
F
D1/F
D2/F
...
Dn/F
- that is, the current directory, followed in turn by directories D1 to
Dn.
The Snowball APITo access Snowball from C, include the header api.h, and any headers generated from the Snowball scripts you wish to use. api.h declares
struct SN_env { ... };
extern void SN_set_current(struct SN_env * z, int size, char * s);
Continuing the previous example, you set up an environment to call the
resources of the Khotanese module with
struct SN_env * z;
z = Khotanese_create_env();
Snowball has the concept of a ‘current string’.
This can be set up by,
SN_set_current(z, i, b);
This defines the current string as the i bytes of data starting at
address b. The externals can then be called,
Khotanese_moderate(z);
...
Khotanese_stem_1(z);
They give a 1 or 0 result, corresponding to the t or f result of
the Snowball routine.
And later,
Khotanese_close_env(z);
To release the space raised by z back to the system. You can do this for a
number of Snowball modules at the same time: you will need a separate
struct SN_env * z; for each module.
The current string is given by the z->l bytes of data starting at z->p.
The string is not zero-terminated, but you can zero terminate it yourself with
z->p[z->l] = 0;
(There is always room for this last zero byte.) For example,
SN_set_current(z, strlen(s), s);
Khotanese_stem_1(z);
z->p[z->l] = 0;
printf("Khotanese-1 stems '%s' to '%s'\n", s, z->p);
The values of the other variables can be accessed via the #define
settings that result from the -vprefix option, although this should not
usually be necessary:
printf("p1 is %d\n", z->Khotanese_variable_p1);
The stemming scripts on this Web site use Snowball very simply.
-vprefix is left unset, and -eprefix is set to the name of the
script (usually the language the script is for). All the programs are
tested through a common driver program.
DebuggingIn the rare event that your Snowball script does not run perfectly the first time:Remember that the option -syntax prints out the syntax tree. A question mark can be included in Snowball as a command, and it will cause the current string to sent to stdout, with square brackets marking the slice and vertical bar the position of c. Curly brackets mark the end-limits of the string, which may be less than the whole string because of the action of setlimit. At present there is no way of reporting the value of an integer or boolean. If desperate, you can put debugging lines into the generated C program. This is not so hard, since running comments show the correspondence with the Snowball source. Compiler bugsThere must be a few compiler bugs in such a young language. If you hit one, try to capture it in a small script before notifying us.Known problems in SnowballThe main one is that it is possible to ‘pull the rug from under your own feet’ in constructions like this:
[ do something ]
do something else
( C1 delete C2 ) or ( C3 )
Suppose C1 gives t, the delete removes the slice established on the first
line, and C2 gives f, so C3 is done with c set back to the value it had
before C1 was obeyed - but this old value does not take account of the byte shift
caused by the delete. This problem was forseen from the beginning when designing
Snowball, and recognised as a minor issue because it is an unnatural thing to want to
do. (C3 should not be an alternative to something which has deletion as an
occasional side-effect.) It may be addressed in the future.
|