If the translation memory in censhare is corrupted, there are malformed segments. This leads to errors when you export segments or update the Translation Memory Index. There might also be duplicated segments. censhare provides a server action to detect these two situations. Learn how to check and repair the translation memory. 


Context

The server action to check the integrity of the translation memory is provided in the censhare Admin Client. This server action reports the segment IDs of entries in the database that are corrupted or duplicated. Reconciliation of the database must be done with an external tool. For example, a database management tool.

New censhare installations with version 2018.2.1 or higher are normally not affected. Since then, censhare prevents:

  • The import and export of segments (TMX) if they have a malformed XML structure. 

  • Duplicate segment pairs occur in the translation memory.

  • Mistakes by the user. For example, he removes a closing tag.

Prerequisites

  • To reconcile corrupted segments, you need XML know-how to edit the segments and verify that the segments are well-formed.

  • To edit or delete segment pairs in the database, you need SQL know-how.

  • You also need a database management tool and permission to access the database. 

  • You are familiar with the concept of translation memory in censhare. 

Introduction

Translation with memory in censhare uses the translation memory to propose a suggestion for translation to users. To store the translation memory, censhare uses a relational database. censhare uses the censhare database (cdb) to search and read the translation memory. For this purpose, the cdb contains the so-called inverted index for translation memory.

When the translation memory changes, censhare writes the changes into the relational database. The cdb must be updated in order to contain the same translation memory as the relational database. If there are segments, that are malformed, the update process fails. Also, the export of segments (TMX) fails. The Translation with memory integrity server action in the censhare Admin Client reports the malformed segments. The report indicates which segments in the relational database must be repaired.

Before censhare 2018.2.1, it can happen that censhare creates duplicate segments. Translators also see duplicated segments with the same content in the Memory widget. The Translation with memory integrity server action reports duplicated segments. The report indicates which duplicate segments in the relational database must be deleted. As the segments are duplicated, there is no data loss.    

Note: In the following sections of this article, the inverted index for translation memory in the cdb is referred to as cdb. The relational database is referred to as the database.    

Note: We recommend that you run the Translation with memory integrity server action when you update a censhare system to version 2018.2.1 or higher from an earlier version.

Steps    

  • Execute the Translation with memory integrity server action.

  • Note the segments IDs of malformed segments.

  • Note the segments IDs of duplicated segments.

  • Access the database with a management tool.

  • For each malformed segment ID, repair the two segments.

  • For each duplicate segment ID, delete the two segments from the database.

  • Check the database again with Translation with memory integrity server action.

When to check translation memory integrity

Check the integrity of the translation memory in the database in the following cases:    

  • TheTMX export fails. 

  • The manual or automatic update of the inverted index for the cdb fails. 

  • A translator cannot open a text asset and receives this error message: "Error while trying to run pipeline line icml2xliff on asset with ID XXXX". XXXX is the ID of the asset the translator tries to open.    

The error message of the failed server action can give you a hint. Here is the part of an example of an error message:

The element type "it" must be terminated by the matching end-tag"</it>".

The message says that the XML of at least one segment is not valid. There is a"<it>" opening tag that is not followed by a closing"</it>" tag. This indicates that there is at least one segment in the translation table that is XML is not well-formed.    

Repair the translation memory

Execute the report

The Translation with memory integrity server action creates a report that lists the IDs for segment pairs where one or both segments of the pair contain malformed XML. You find the IDs in the row for the Malformed segments of the report.

In the second, the report lists the segment pair IDs for the segment pairs that are duplicated content. You find the segment pair IDs in the row for the Duplicated segments in the report.

Repair the translation table

To reconcile the database, you need to access the database itself. Typically, a database management tool is used to repair the database. To see a malformed segment pair, open an SQL query window in the tool and enter the command (tested in pgAdmin 4):

select * from translation
         where segment_id='SEGMENT_ID';
SQL

The SEGMENT_ID stands for an ID that is listed in the translation memory integrity report.

The segment itself is stored in the segment column of the database. Copy the content of the segment entry into an XML editor and check it. The XML must be well-formed. If needed, repair the XML. Copy the repaired XML back into the segment entry and check the other segment with the same segment ID.

Repeat this process for all segment IDs that are listed in the report and commit your changes in the database. Execute the Translation with memory integrity server action again to be sure that everything is fine. After the integrity check, update the inverted index of the cdb with the Rebuild Translation Memory Index server action.    

Note: The manual update is needed because the manual edits of the database do not create an event that triggers the automatic server action for the update.

In the second step, delete all duplicate segments from the database.

Note: All segment pair IDs in the report are duplicates.    

Note: Be sure that you have an actual backup of your database before you delete any records!    

Delete a duplicated segment pair with the following SQL statement (tested in pgAdmin 4):    

delete from translation
            where segment_id='SEGMENT_ID';
SQL

Note: Deleting data directly in a database table is a dangerous act, only experienced database administrators should do this!

How to prevent malformed and duplicate segments    

Once you have cleaned your system, there should be no further problems with malformed and duplicated segments. Since censhare 2018.2.1 the system prevents duplicated segments can be inserted into the translation table of the database. 

The TMX import prevents the import of malformed segments and protects the translation memory. You only need to repair the translation memory once.

Introduction into the translation memory

The translation memory in the database    

In the database, the translation table contains all segments of the translation memory. When a translator confirms a segment in the Segments widget, censhare writes two entries into the translation table. The first entry contains the source language segment in the segment field. The second entry contains the target language segment. Both entries share the same segment ID. You find the ID in the segment_id column in the table. If you search via SQL for a certain segment ID, you always receive two entries.

The segment column stores the content of the segment in XML. For example:

<?xmlversion="1.0" encoding="UTF-8"?>
<root><it>Alice was beginning to get very </it><it>tired></it> 
of sitting by her sister on the bank, and of having nothing to do.</root>
XML

The XML of the segment must be well-formed. For example, for each opening XML tag like "<it>" there has to be a closing XML tag like "</it>". 

If the XML structure is not well-formed, the segment content is malformed. For example:

<?xmlversion="1.0"encoding="UTF-8"?>
<root><it>Alice was beginning to getvery </it><it>tired of sitting 
by her sister on the bank, and ofhaving nothing to do.</root>
XML

The closing "</it>" tag for the "<it>" tag before "tired" is missing. 

The inverted index in the cdb

censhare uses the inverted index in the translation editor to look up matching segments in the translation memory in the database. These matches are presented in the Memory widget for the actual segment in the Segments widget. 

If the database and the cdb are not synchronized, there can occur an inconsistency with the Memory widget for several reasons:

  • A segment pair is stored in the database and the cdb is not updated.

  • The cdb contains segments that no longer exist in the database.

  • A segment pair is deleted from the database and the cdb is not updated.

If a user confirms a segment, censhare writes the segment pair into the database and creates an event. This event tells the Update Translation MemoryIndex (automatic) server action to update the cdb. The server action then updates the inverted index. The same happens if the user edits or deletes a segment pair in the Memory widget.

The Update Translation Memory Index (on start-up) updates the cdb when the server starts. The Rebuild Translation Memory Index server action starts the update process manually. Manual updates are usually not needed. In special cases, it is necessary to update the cdb after changes in the database. For example, if entities are edited directly in the translation table.

The update process fails if the segment column of a processed entry contains malformed XML. The TMX format is used to exchange translation memory between different translation systems. Import a TMX file to add additional segments to the translation memory. For example, export a TMX file to share your segments with a translation agency. 

If the database contains segments with malformed XML, the export of TMX file fails. You can still import a TMX file that has well-formed XML. But you cannot update the inverted index of the cdb after that.

You cannot import a TMX file that contains malformed XML. For more information on the import/export, see Related articles.

Duplicated segments in the database

When a user confirms a segment, censhare writes the according segment pair into the database. Before censhare 2018.2.1 the system did not check if a pair with the same content already exists in the database. censhare just created two new entries in database. If a text frequently changes and is translated, many segment pairs with the same content can exist in the database.    

Duplicate entries make it difficult a translator to see other suggested matches in the Memory widget. For instance, if there are duplicated segments with match of 95 percent, the translator has to scroll in the widget to see another segment pair with 93 percent match. As the translator cannot see at once if segments have the same content, he has to look at each segment pair individually.    

Steps

Execute integrity check    

  1. Open the censhare Admin Client.

  2. Click the Server actions icon and select ­Translation with memory integrity.

  3. Note all segment IDs in the Duplicated segments row.

  4. Note all segment IDs in the Malformed segments row. 

Edit segments

  1. Create a backup from your database if a backup is not yet created.

  2. Connect to the database through your database management tool.

  3. Goto the translation table in database.

  4. Search for each malformed segment ID.

  5. Check if both or only one of the two table entries of the segment pair are malformed. You find the segment content in the segment column.

  6. Edit the malformed content in the segment field and verify that the XML is well-formed.

  7. Commit your database changes.    

Delete segments

  1. Create a backup from your database if a backup is not yet created.

  2. Connect to the database through your database management tool.

  3. Goto the translation table in the database.

  4. Delete all rows from the translation table where the segment_id is the same as with the Duplicated segments entries from the Translation with memory integrity report.

  5. Commit your database changes.

Update the inverted index

  1. Update the inverted index for translation memory of the cdb with Rebuild Translation Memory Index.    

Configuration

Integrity check for translation memory (server action)

The Translation with memory integrity server action does not need any configuration. The server action is available by default.

Update translation memory

The server actions to update the inverted index are located in the Configuration/Modules/Translation folder of the censhare Admin Client:    

  • Rebuild Translation Memory Index

  • Update Translation Memory Index (automatic)

  • Update Translation Memory Index (on start-up)

Rebuild Translation Memory Index (server action)

By default, the server action is not active:

  1. Open the Rebuild Translation Memory Index entry.

  2. Check Enabled in the General section and click OK. 

  3. Update the server configuration.

  4. Synchronize the configuration with remote servers, if applicable.    

Update Translation Memory Index (automatic) (server action)

The server action is activated by default. No configuration is required.    

Update Translation Memory Index (on start-up) (server action)

The server action is activated by default. No configuration is required.

Result

You have checked if the translation memory in the database has malformed or duplicate segments. You repaired the malformed segments and removed the duplicate segments in the translation table of the database. Finally, you manually updated the inverted index for translation memory in the cdb.