Random thoughts shooting out of volatile mind
Tutorial: Creating Aspell Dictionaries
I recently created aspell-kn dictionary and thus I became author for atleast one upstream project ;).. In this tutorial I'm just going to summarize the steps involved in creating an aspell dictionary for a particular language (in my case Kannada).



 First step is to get the latest source of aspell-lang from cvs. Yes aspell still uses cvs but we need the source only to get a map file for our language and belive it won't take much time even on slow connection (like mine) to checkout the source. Use following command to checkout the aspell-lang.

cvs -z3 -d:pserver:anonymous@cvs.savannah.gnu.org:/sources/aspell co aspell-lang
 The above command will create aspell-lang in the directory where you executed it. Now change to aspell-lang/maps  directory. This directory contains language map files. Look for the map file for your language. If your language has Unicode charachter set the map file should u-.txt. In my case the file was u-knda.txt. Open up the file in any editor and check if you need to add anything more than what is already in the file. This file actually maps 128-character Unicode space for the language to the space 128-255. Normally whatever content inside this file should be sufficient but for Indian scripts we require to map 2 Unicode control character ZWJ and ZWNJ This is how u-knda.txt looks after addition of these 2 control characters

0x11 = U+200C
0x12 = U+200D
0x80..0xFF = U+0C80..U+0CFF
Next step is to generate the cmap and cset file for the language. Under aspell-lang directory there is a perl script "mkchardata"  use this to generate u-.cmap and u-.cset. In my case I used following command to generate u-knda.cmap and u-knda.cset
perl mkchardata maps/u-knda.txt
 Now create a working folder for new directory aspell- in my case aspell-kn, copy the u-.cmap and u-.cset to that directory .cmap and .cset is not mandatory file but I think if your language uses Unicode character set then these files are mandatory. Also create a misc and copy u-.txt in to this folder.
Next we need to create info file. This file contains information about the dictionary. Below is the info file I used for aspell-kn

As you can see all fields are self explantory. One thing I would like to mention here is data-file field. If your language needs cset and cmap files then use this field to give their name.
Next you need to create Copying file describing the terms of dictionary usage. If you are using license other than standard license then licesne text should be placed in COPYING file. For GPL licences this file will be created during processing phase.
Next important file is .dat format of this file is described in aspell manual under chapter "Adding support for new language" [1]. Below is the kn.dat which I created.


Now we have all the required files. Copy proc file from aspell-lang (cloned from cvs) to your working directory and run it as shown below
./proc create
This will create configure script and Makefile.pre finally copy the word list file to working directory and name it as .wl. Word list file contains list of words seperated by new lines. Now run the following commands
./configure
make
This will convert the wl file to cwl (Compressed wordlist file using prezip tool). also creates a rws file. Now the dictionary is ready and to publish the dictionary to world use make dist this will create a tarball for distribution. Note that tarball doesn't include .wl file it only contains cwl file.

I created aspell-kn refering to the tutorial on Indlinux wiki [2]. The only reason of rewriting the entire tutorial is I felt the tutorial given in Indlinux wiki is not suitable for beginer (I being the one during creation of aspell-kn had to struggle a lot to get everything together).

The source for aspell-kn is available at [3] and source for the Kannada wordlist is available at [4] aspell-kn can be downloaded from [5] and will be soon packaged for Debian :)

[1] http://aspell.net/man-html/Adding-Support-For-Other-Languages.html#Adding-Support-For-Other-Languages
[2] http://indlinux.org/wiki/index.php/IndicSpellchecker
[3] https://gitorious.org/indic-projects/aspell-kn
[4] https://gitorious.org/indic-projects/wordlist
[5] http://sanchaya.net/downloads/aspell-kn/
Posted by: copyninja on Friday, 6 May 2011

blog comments powered by Disqus
Fork me on GitHub