]:: NATools ::[

This package is not ready for non-informatic users. 

For instalation intructions see INSTALL

For usage information: 
  visit http://natools.sf.net
  or    http://linguateca.di.uminho.pt/natools
  API informations are also available from this site.

For a demo (Web interface to query and browse parallel corpora):
  visit http://eremita.di.uminho.pt/natools/nat

For Perl API:
  visit http://linguateca.di.uminho.pt/natools/

For C API:
  visit http://linguateca.di.uminho.pt/natools/html/index.html


____NATools - Getting Started

This page describes the easy way to getting started using NATools. This is not
the only way to use them, and for full information you should read each tool
documentation.

=File formats

NATools understands two file formats for corpora: TMX and NATools specific
format. TMX is a standard, and you can see its specifications at LISA.

Regarding NATools specific format: use two files, one for each language. Each
translation unit is separated by a line with just a dollar sign ($). Each
translation unit can span for more than one line. That is not a problem.

Here is a simple example:

 I saw a cat .
 $
 The cat was 
 fat .
 $
            

 Eu vi um
 gato .
 $
 O gato era gordo .
 $
            

Note that both files need to have the same number of translation units, and
that the texts should be already tokenized.

=Bootstrapping from a TMX file

If you have a TMX file (with just two languages) you can bootstrap the NATools
alignment process using the nat-create script:

 [foo@bar]$  nat-create -tmx file.tmx
      

The script will ask you for a name for the corpus. Supply a name without
spaces. The script will create a directory with that name, where the files for
the encoded corpus, encoded lexicon and probabilistic translation dictionaries.

=Bootstrapping from a pair of NATools files

To use this method, you need to have a pair of files aligned at sentence level,
in the format specified above. For the following commands examples, we will
call these files lang1 and lang2.

You can align them directly using the built-in language identification process:

 [foo@bar]$  nat-create lang1 lang2
      

You can also specify the languages in case you want speed, or in case the
language identification process does not guess correctly the languages
involved. For that, you should use:

 [foo@bar]$  nat-create -langs=PT..EN lang1 lang2
      

where the -langs switch specify the languages involved in the same order as the
supplied files (so, lang1 should be Portuguese, and lang2 should be in
English).

Both methods will ask you for a corpus name. Supply a name without spaces. The
script will create a directory with that name, where the files for the encoded
corpus, encoded lexicon and probabilistic translation dictionaries.

=Creating a textual Probabilistic Translation Dictionary file

In some cases it is useful to look at the Probabilistic Translation Dictionary
(PTD) extracted from the parallel corpus without using the NATools server. For
this, we can extract the PTD to a textual file (in Perl Data::Dumper format
which is both legible to the human and to the computer).

Use the nat-dumpDicts command for that. First, change the current directory to
the directory created by the corpus encoding process, and then execute:

 [foo@bar]$  nat-dumpDicts source.lex source-target.bin target.lex
target-source.bin > dict.txt
      

The file dict.txt will be created with the PTD.

=Using nat-server

If you read the installation section, you know that the CGIs work based on a
server running in your machine. There are other tools that need this server as
well, so that they are quicker when accessing the corpus.

The server needs a configuration file. The configuration file is simple. Lines
starting with a sharp (#) are considered to be comments, and thus ignored.
Other lines should contain absolute paths to directories created by the
nat-create command (or nat-shell). For instance, if running nat-create you
created a corpus in the directory /corpora/parallel with name EuroParl, you
should add the following line to your configuration file:

 /corpora/parallel/EuroParl
      

The server will then configure each corpus based on the nat.cnf configuration
file present in each of those corpus directories.

To start the server, use:

 [foo@bar]$  nat-server /path/to/the/config/file.cfg