GSoC 2018: Extending lttoolbox to have the power of HFST

This Google Summer of Code 2018 project with Apertium involved extending the capability of performing morphographemics and adding lexical weights to the lttoolbox transducer in order to enable more complex translations with the transducer. The work primarily focused on lttoolbox and apertium-core.

Project Title

Extend lttoolbox to have the power of HFST

Work Done

Morphographemics

Implemented Two-Level Morphology (TWOL) rules within lttoolbox to handle morphographemic changes. This allows lttoolbox to handle more complex morphological transformations that were previously only possible with HFST.

Weights

att_compiler: Support for weights to lttoolbox binary format

Made all the tweaks necessary to have a minimal implementation of weight based analyses in the att_compiler. This allows morphological analyses to be weighted, enabling better disambiguation and more accurate translations.

lt-proc: Implement option to output n-best paths

Using the same option names as hfst-proc, we added options in lt-proc to output n-best paths using the weight values. This includes:

-W, --show-weights: Print final analysis weights (if any)
-N, --analyses: Output no more than N analyses (if the transducer is weighted, the N best analyses)
-L, --weight-classes: Output no more than N best weight classes

Allow weights on entries in lttoolbox XML

Modified the DTD and parser to allow weights on entries in lttoolbox XML. This enables dictionary maintainers to specify weights directly in their dictionary files.

Challenges

Overall, it was a wonderful and satisfying experience. I had a great learning experience and had a great time coding for Apertium.

But in the meanwhile a lot of unexpected challenges popped up which were very hard to get over. Debugging such a large codebase in C++ language and that too when you are modifying three repositories simultaneously was a huge pain in the ass. I got stuck in the debugging task for a long time during the GSoC period. Hadn't been there the help from my mentors and the other mentors at Apertium, I don't think I could have fixed that bug. As the display of my laptop broke during the second phase, I had a really hard time contributing then. For almost 2-3 weeks during the second phase, I didn't have my own stable system all set up for development.

Fortunately all these issues got fixed and I therefore was able to make a proper implementation of weights within the desired period.

Work to be done

Now that we can weight our morphological analysers, generators and bilingual dictionaries. Here are some problems that can be solved:

Having zero-context rules in your .lrx files. Now you can just put the weights directly in your bilingual dictionary. Analyses will be output according to lowest weight first, so you can mark your default translation as "1.0" and then all others as >1.0.
Improving POS-tagging accuracy by ordering analyses by probability. This way if your CG doesn't mop up all the ambiguity, you will get the best remaining analysis.
Dealing with non-standard forms, instead of having to use LR/RL direction restrictions, you can just make non-standard forms have a high weight and ask for lt-proc to only generate the surface form with the lowest weight.

Acknowledgements

Thanks a lot to all my fellow apertiumers for making my so far wonderful journey with Apertium. I was very fortunate to get this opportunity to work with this wonderful organisation. All the mentors are very helpful and this project wouldn't possible without their constant help and guidance. There is always someone or the other hanging out on IRC there to help you. Thank you fellas.

I'll definitely keep on contributing to Apertium after my GSoC.

Mentors: Francis Tyers and Tommi Pirinen

Links: