This is an implementation of a conditional random field (CRF) sequence tagger in C++. Development is ongoing, and only a limited number of possible features for training are currently implemented. However, the CRF will train a model and can use it to tag new sentences.
Currently, the underlying CRF training code as well as part-of-speech, chunking, and named-entity tagging layers are implemented.
If you're interested in how CRFs work, there is a brief introduction in
src/include/crf/tagger.h. There are also good tutorials on the web, such as
Sutton and McCallum (2012)
The code has an external dependency on the excellent libLBFGS library by Naoaki Okazaki. Otherwise, it only uses the standard C++ library, and does not require C++11. It should compile with any reasonably modern version of g++ or clang++.
- Clone the repository or download the code
mkdir ext- Build libLBFGS and install it in the
extdirectory, so it looks like this:ext/lbfgs/{include,lib,share} - Compile with
make. Binaries will be placed in thebindirectory - You may need to set your
DYLD_LIBRARY_PATHorLD_LIBRARY_PATHfor the code to run:export DYLD_LIBRARY_PATH=/path/to/ext/lbfgs/lib
- Formats are controlled by the "--ifmt" and "--ofmt" command line options.
- The Chunker and NER taggers currently only read the CoNLL 2000 and CoNLL 2003 data formats. The POS tagger can read flexible formats
- All the taggers can produce output in flexible formats
- There is a mini printf-style language for specifying input and output formats. The format specifies how each word in the sentence should be formatted along with its accompanying tags. Each word in the sentence is printed in the same way.
- Formats look like the following (
+means "at least one"):"<sent_pre>(<format><sep>)+<word_sep><sent_pos>"<sent_pre>is a string printed before each sentence<format>is one of the format strings<sep>is a one character (only) separator between format items (escapes like\nare allowed)<word_sep>is a one character (only) separator between each word block (escapes like\nare allowed)<sent_pos>is a string printed at the end of each sentence
- Available format strings are:
%wfor the word%pfor the part of speech tag%cfor the chunk tag%efor the entity tag
- Note that you should only print out format strings that are actually present in the input or produced by the tagger.
- For example, to produce output from the chunker in the CoNLL 2000 evaluation format:
--ofmt "%w %p %c %e\n\n\n"
bin/train_poswill train a model for POS tagging.bin/poswill take a model produced bytrain_posand use it to tag sentences- The software reads pipe-formatted input by default as described by command line
options. Check the
--helpflag for more info. Custom formats can also be used.
bin/train_chunkwill train a model for chunking.bin/chunkwill take a model produced bytrain_chunkand use it to tag sentences- The software will currently only read data in the CoNLL 2000 chunking shared task format. This data is available at http://www.cnts.ua.ac.be/conll2000/chunking/.
bin/train_nerwill train a model for NER tagging.bin/nerwill take a model produced bytrain_nerand use it to tag sentences- Run
bin/ner --helpfor a description of program options. The software will currently only read the CoNLL 2003 NER shared task formatted input (see http://www.cnts.ua.ac.be/conll2003/ner/ for more information).
This code is licensed for academic (non-commerical) use. Contact me for licensing terms if you wish to use any or all of this code for any non-academic purpose.