Babelfish service - TwM

This article presents an introduction to the Translation with Memory (aka:TwM) application in censhare Web, from a developer’s point-of-view. The purpose is to describe the underlying architecture, in particular how the Babelfish service on the back-end side works and integrates with the asset application on the front-end side. The article will also introduce the Okapi library, on which the service is built, and describe how to configure Okapi pipelines.

Audience

Support
IT
Professional Services
Pre-sales
Developers

How does it work

The Translation With Memory application, which was introduced with censhare Web, consists primarily of three main modules:

Client application on front-end side; connects to the
TwM asset application which provides the view model and the domain model; connects to the
Translation service (“Babelfish” service)

Translatable contents are extracted from the asset, pre-translated, and made available to the end user on the asset page’s translation tab. After translation, i.e. on “Save” or “Save and Close”, the translated contents are written back to the asset. The translation memory is updated on the fly when the user confirms a translated segment.

How is it structured

The following graphic illustrates the overall architecture:
Architecture Overview

On the front-end side, i.e. censhare Web, the user interacts with the application through four widgets (Segments, Terminology, Matches, and Statistics), which are located on the translation tab of the asset page. This tab is visible for ICML and XML assets, which are variants with a different language than their parent asset. The widgets are connected to the client application, which works from the view model and calls the interface methods of the TwM asset application.

On the back-end side, the Babelfish service is the single point of interaction for the asset application to retrieve and modify contents from the translation memory and the terminology database. This service also takes care of running the Okapi pipelines (see next section) and delivers the results back to the asset application. The actual pipeline is file-based, and a temporary zip-file with the generated XLIFF is part of the domain model. The zip-file contains a specific directory structure and a manifest file, which contains details about the pipeline and its configuration. During back-conversion, this manifest file is the input to the pipeline. The temporary zip-file is required to regenerate the translated target contents, so make sure it doesn’t get lost or destroyed by Vogons.

The domain model of the asset application primarily consists of the XLIFF 2.0 object that is generated by the Okapi pipeline. This object is kept in memory and updated as the user makes changes. It is also used to read TM matches and glossary entries. The view model is generated from this file and consists of the required contents that are displayed to the end user (segments with their status, terminology, TM matches).

How is it configured

The widget settings, which provide a range of options for configuring the application, are stored in the user preferences. The preferences key is com_censhare_Translation. The pipelines are configured in the service’s config.xml. The filters are configured in the “okf_” files. And the segmentation rules are configured in the form of an SRX file, stored under modules/xliff2/LanguageTool_SRX.srx.

As a special censhare feature, certain contents can be protected from modification by the translator (e.g. numbers). These contents are identified by means of one or more regular expressions stored in a special resource asset (key: censhare:configuration.translation).

Okapi pipelines

General use

The Translation With Memory application is largely based on an open-source localization framework called “Okapi”. The Okapi framework provides a pipeline mechanism, which can be used to extract translatable contents from a range of formats, convert them to XLIFF, pre-translate (“leverage”) this text against various TMs, insert terminology, recreate the original format from the translated XLIFF and much, much more.

Each pipeline consists of a number of pipeline steps, which are executed in sequence. Each step can have additional configuration parameters. In addition, the filters that are used during the initial extraction step can also be configured. This is done in the filter configurations files (starting with “okf_”) in the Babelfish service directory.

In order to test and create specific pipelines, the stand-alone application Rainbow (which is part of the Okapi framework) is useful. It can be used to manually construct Okapi pipelines with their associated configurations and filters, and apply these pipelines to specific files.

censhare specific use

In the censhare system, the Okapi pipelines are configured in XML format in the service configuration for the Babelfish service. At present, we have two pairs of pipelines for converting from/to the two formats that are currently supported for translation: ICML2XLIFF/XLIFF2ICML and XML2XLIFF/XLIFF2XML.

To maintain the flexibility of working with the Okapi pipelines, we have created two special censhare components, which are implemented in these classes: CenshareTMConnector and CenshareTerminologyConnector. These components (both of which are part of the project censhare-ExternalPlugins) are used by the pipeline steps LeveragingStep and TerminologyLeveragingStep, respectively. They provide a read-only connection to the TM and term base to automatically leverage contents during the pipeline execution.

Trouble shooting – common mistakes/errors and how to solve them

When working with the translation application, a few common scenarios can lead to problems. So, before investigating issues further, the following should be checked:

Inverted index service must be running; if not, there will be an error in the service log. To fix this, go to the admin client and start the service.
Updates of the inverted index are not working; due to corrupted contents in the translation table (each segment must be XML); if this error occurs, there will be an XML parsing exception in the error log. To fix this, check the offending segment in the translation table and make sure it is valid XML. Alternatively, you can also just clear the contents of the translation table and start from scratch with an empty TM (not recommended for production systems).
CDB broken; the inverted index might also be affected by a corrupted CDB. If the index isn’t updating and newly added segments are not matched against the TM on refresh, then check the server log for errors relating to the CDB. To fix this, delete the CDB and restart the server.
Asset structure not set up correctly; if the translation tab does not appear, it might be that the asset structure is wrong. To fix this, make sure the asset page is opened on the target language asset and that this asset is indeed a variant asset.
TwM must be activated on the system asset; if the translation tab does appear, but it does not show the four widgets of the TwM application, then the application might not be activated. To fix this, go to the System asset and activate the TwM application.

Useful Links

Okapi framework http://www.okapiframework.org/
Okapi developer mailing list https://groups.google.com/forum/#!forum/okapi-devel
Open-source repository on Bitbucket https://bitbucket.org/okapiframework/okapi/src
TMX standard https://www.gala-global.org/tmx-14b
SRX segmentation standard http://www.ttt.org/oscarStandards/srx/
XLIFF 2.0 standard http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html
Towel day http://www.towelday.org/

Page tree