censhare supports date/time and number variables for optimized segment reuse. Define rules to recognize and format variables in each language. 


Context

Variables are part of the Translation with memory application. In a text to be translated, variables are elements that are not translated, but may require different formatting. For example, dates have a different formats in different languages. The actual date does not change, though. The translation with Memory application can handle these in a defined way. Rules to detect variables and process their content are depending on the individual use cases. Therefore, censhare comes with one predefined rule that just matches numbers.

The definition of rules is done in the "Translation" service within the censhare Admin-Client.

Prerequisites

The Translation with memory application needs to be set up and running.

A deep understanding of regular expressions is needed. They define which text is recognized as a variable.

A deep understanding of the patterns in Java is needed: How they are used to parse and format date and numbers.

Introduction

Translation with memory automatically splits the text to be translated into segments. Each segment is then translated and stored. Text parts like date, time or numbers do not need to be translated. They just need to be copied from the source to the target segment. The concept of variables is used to represent this content in the source and the target segment.

Variables represent specific content in a segment like a date, time or numeric values, so-called numbers. Numbers can be any numeric value. There is no differentiation if this is an integer or a floating-point figure. censhare handles them all as text strings. The same is true with dates and time entries. As everything is a text string, no calculation is possible with variables.

However, depending on the syntactic rules, a variable does not always appear at the same position of in a segment. For example, in language "A", a date appears in the middle of a sentence, while in language B it appears at the end. Therefore, the variable has to be positioned manually by the translator.

Sometimes, it is necessary to change the format of the variable content. For instance, there is the "23.11.2005" date. In the translated text, the month should appear first and the separator should be a "/": "11/23/2005". Using variables, this can be done automatically.

To handle content like dates, times, or numbers, you have to define rules. For instance, they describe how to find a date in the segment and how to determine the different parts of the date like day, month and year. Besides that, rules describe how to format the variable content for the target language. Rules also allow us to describe different formats for different languages. For example, depending on the target language, "23.11.2005" can be transformed to "11/23/2005" or "23-11-2005".

censhare allows you to define different rules for different date, time and number formats for different languages. Besides that, censhare provides you with a common rule that applies to all digits in the text if no other rule applies.

Processing of text for variables

With output formatting

As censhare processes a source segment, it looks for variables inside a segment. The detection of variables for date, time or numbers is done by regular expressions (RegEx). censhare checks for each segment if there is any content that matches one of the RegEx rules for the source language. For each RegEx rule, there is also a format defined. The pattern of the format determines how the content of the variable will be analyzed.

The processing pipeline for variables in the Translation with memory application: 1) A variable is detected within a source segment using a RegEx rule. 2) The content of the variable is analyzed using the pattern format for the source language. 3) The content of the variable is formatted using the pattern format for the target language.

For instance, censhare detects the string "02.12.2015" and identifies it as a variable. The associated format pattern is

dd.mm.yyyy

The "dd" stands for a double-digit day, the "mm" for a double-digit month, and the "yyyy" for a four-digit year. After applying the format, censhare retrieves the following content information dd = "02", mm = "12" and yyyy = "2015". For the target language, there is also a format defined. This is used to output the content for the target language. For instance, the format for the target language is

mm/dd/yyyy

The output of the variable then looks like this: "12/02/2015". In the end, the target variable is inserted into the target segment, respective to the place of the source variable in the source segment. The translator can then position the variable in the target segment.

Without output formatting

The easiest use case is that the format of the source variable remains unchanged. In this case, you do not need a source and target format pattern. As of that, the respective transformation steps in the processing pipeline shown above are not needed. The text string, found by the RegEx, is just printed in the target language as it was detected in the source language.

The processing pipeline for variables in Translation with memory with no formatting: 1) A variable is detected within a source segment using a RegEx rule. The source variable is then copied to the target variable and inserted as it is into the segment for the target language.

In the example above the RegEx

\d{1,3}(\,\d{3})*(\.\d\d?)

finds the text string "20,000.55". This string is also the result for the variable in the target segment.

Defining rule sets for variables

censhare only creates variables for date, time or number formats where a rule exists. Therefore, you have to set up a rule set for each format you want to process. Create a rule for each language, where you want to process a certain date, time or number format. These rules are then combined into one ruleset. Each rule contains a RegEx to detect a variable and the respective format pattern. If no formatting is needed for a variable, you can define rules that just contain a RegEx.

Each rule for a language can be used in both translation directions as either source or target language. If you have set up two rules for German and English, they apply for both directions: translating English - German and German - English.

Note: Within one ruleset, there can be only one rule for a certain language.

Note: There can be more than one rule set for date/time respective number definitions.

Select rule sets

As there can more be than one rule set, there might also be more than one rule set that matches a certain string in a segment. This means that there is a rule with a RegEx in each of that rule sets that matches this string.

This can also only happen if the respective rulesets have the same format type as date/time or number.

censhare then chooses the ruleset whose RegEx definition produces the longest match with the most characters for the text that is being analyzed.

For instance, there is the text "xyz 12-03-18 zyx". There are two different rule sets with date type. The English rule of Rule set 1 is:

(0?[1-9]|[12][0-9]|3[01])[\-](0?[1-9]|1[012])[\-]\d{4}

The English rule of Rule set 2 is:

(0?[1-9]|[12][0-9]|3[01])[\-](0?[1-9]|1[012])

Ruleset 1 produces the following match "12-03-18".

Ruleset 2 produces the following match "12-03".

The match for rule set 1 is longer than that for rule set 2. Ruleset 1 is chosen for processing.

Regular expressions to find variables

As already said censhare uses a regular expression to identify a variable within the text of a source segment. This means that the RegEx has to describe exactly any combination of characters the desired variable can consist of.

For instance, the RegEx

(0?[1-9]|[12][0-9]|3[01])[\/](0?[1-9]|1[012])[\/]\d{4}

searches for a date that has the following format:

dd/mm/yyyy

This RegEx will find "1/1/2017" or "08/12/2010" but not "32/13/2017". The latter one is not a valid date. It does not match the RegEx, which checks if the “day” value is not greater than 31 and the “month” value is not greater than “12”.

The RegEx above will identify each pattern in a text, no matter, where it is placed within the text. So, both texts in the following example will produce a match: "The book was published on "11/10/2017" and "The book has the registration MAT11/10/2017". The latter is not a date but it matches the RegEx above. To prevent this, you can adapt the RegEx like this:

" (0?[1-9]|[12][0-9]|3[01])[\/](0?[1-9]|1[012])[\/]\d{4}"

The quotation marks are used to show the blank space at the beginning. The RegEx now checks for this blank space. Now, only strings with a leading blank space will match.

Therefore, before you define a RegEx, analyze what kind of date, time or number formats occur in your texts. Then create the RegEx definitions that exactly meet these use cases.

censhare uses a standard implementation for RegEx. As of that, there is a lot of literature available if you need more information on how to build RegEx. Here is a small, not representative selection:

Patterns to determine the content of a variable

When censhare identifies a variable, it is just a sequence of characters. In the next step, it must be determined if it is a date/time or number. As already said, censhare uses format patterns to identify the different parts of a date, time or number. Therefore, a format pattern has to match the character strings that the RegEx finds.

Here is another example for numbers. You have the RegEx

\d{1,3}(\,\d{3})*(\.\d\d?)

for this value type. It will match a number like this: "20,000.55". A standard format pattern for this is:

###,###.00

Each "#" represents a digit. A leading "0" will not be shown. The "0" in the pattern represents a digit. The number will be identified as "20000" and the two digits "55" after the decimal point. This information will be used in the output of the variable in the target language.

Patterns to format output in target segments

After determining the content of a variable, censhare looks for the rule within the rule set to be applied in the target language. The format pattern in this rule will be used to create the output of the variable in the target segment.

As censhare also determines the content of a variable, it can now format the output in a different way. The possibilities depend on what content pieces the source format pattern has detected. It also depends on the different format pattern signs that can be used like "#" or "0" for numbers and "d", "m", or "y" for a date. For a full overview, see the sub-sections for date and time respective numbers in the sections about creating rules.

For number example "20,000.55" we detected the two parts: "20000" and "55". The respective rule for the target language within the rule set has the format pattern:

##0.00

There is no grouping separator as in the number format pattern for the source variable "###,###.00". The result for the output variable is "20000.55". For completeness, here is the RegEx that is defined in the rule for the target language:

\d+(\.\d\d?)

It is not only possible to change the format but also to rearrange the content pieces, detected in a variable. In the date example above, the string is split into the values dd = "02", mm = "12", and yyyy = "2015". Now, assume that the month should be now shown at the beginning of the date. The format pattern looks like this:

mm/dd/yyyy

The result is "12/02/2015". censhare takes the content for "mm", "dd", and "yyyy", outputs it in the given order and applies the separator "/". The RegEx for this format pattern is as follows:

(0?[1-9]|1[012])[\-](0?[1-9]|[12][0-9]|3[01])[\-]\d{4}

Note: There is no calculation of values when a format pattern is applied to the content of a variable!

For instance, a text segment contains a number with an associated currency like "€205.50". There is a rule for German with

RegEx = "€\d+(\.\d\d?)"
Format = "€'##0.00"

and the rule for English with

RegEx = "$\d+(\.\d\d?)"
Format = "$'##0.00"

The German RegEx detects the variable "€205.50" in the source segment. The part before the "." is "205" and the part after it is "50". Now, the format pattern for the English target language is applied: "$205.50". There is no exchange rate conversion possible, like from Euro to Dollar, Inch to Centimeter, or Pound to Kilo! censhare only applies the defined format pattern for the target language defined in the rule set and outputs the formatted number.

Add rule sets

Rule sets for variables are defined in the censhare Admin-Client. Go to the "Configuration/Services/Translation" folder and double-click the "Configuration" entry. A configuration dialog opens.

Click the Plus button at the bottom of the dialog to create a "Group". A Group contains one rule set. censhare displays the section for the first group with the Type selector and an entry for the first rule. The Type determines which kind of rule set you will define: Date, Number or As-is. Date is for date and time formats. Number is for numeric values. For more information, see the "With formatting the output" section.

In the third case, As-is, no format is defined. The string, found by the RegEx, is just printed in target segment as it was found. For more information, see the "Without formatting the output" section.

Next, define a rule for each language that you want to cover in the rule set. If you need another entry for a rule, click the + button at the bottom of the "Items" frame. For each rule entry, you have to define a RegEx and Format pattern. If you have chosen As-is as Type, there is only one field for the RegEx.

Note: censhare does not validate the RegEx and Format entries. Make sure that you enter valid expressions and formattings here!

When you are finished, click OK on the top. Next, you have to transfer the changed configuration from the censhare Admin-Client to your censhare Servers. For more, see the "Updating the server" section.

Note: In some cases, invalid entries in the field may render the Translation with memory editor useless. You cannot open assets anymore that are source or a target text asset for translation. In this case, you receive an error message "Unable to open translation application. Please, make sure the Babelfish service and Translation service is available." In order to solve this issue, deactivate your rules one by one until you find the one that causes the error. For more information, see the "Deactivating rules or rule sets" section.

Rules for date and time

When you define rules for the date type, there is a set of characters available to describe the pattern format. See the table below for a list of all characters. The second table shows you some format pattern examples and date/time examples that have been applied to that pattern format.

For more information on the used pattern letters, see also https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.

Used date and time patterns

Letter

Description

Representation

Examples

G

Era designator

Text

AD

y

Year

Year

1996; 96

Y

Week year

Year

2009; 09

M

Month in year

Month

July; Jul; 07

w

Week in year

Number

27

W

Week in month

Number

2

D

Day in year

Number

189

d

Day in month

Number

10

F

Day of week in month

Number

2

E

Day name in week

Text

Tuesday; Tue

u

Day number of week (1 = Monday, ..., 7 = Sunday)

Number

1

a

Am/pm marker

Text

PM

H

Hour in day (0-23)

Number

0

k

Hour in day (1-24)

Number

24

K

Hour in am/pm (0-11)

Number

0

h

Hour in am/pm (1-12)

Number

12

m

Minute in hour

Number

30

s

Second in minute

Number

55

S

Millisecond

Number

978

z

Time zone

General time zone

Pacific Standard Time; PST; GMT-08:00

Z

Time zone

RFC 822 time zone

-0800

X

Time zone

ISO 8601 time zone

-08; -0800; -08:00

Examples for the date and time patterns

Pattern

Result

yyyy.MM.dd G 'at' HH:mm:ss z

2001.07.04 AD at 12:08:56 PDT

EEE, MMM d, ''yy

Wed, Jul 4, '01

h:mm a

12:08 PM

hh 'o''clock' a, zzzz

12 o'clock PM, Pacific Daylight Time

K:mm a, z

0:08 PM, PDT

yyyy.MMMMM.dd GGG hh:mm aaa

2001.July.04 AD 12:08 PM

EEE, d MMM yyyy HH:mm:ss Z

Wed, 4 Jul 2001 12:08:56 -0700

yMMddHHmmssZ

010704120856-0700

yyyy-MM-dd'T'HH:mm:ss.SSSZ

2001-07-04T12:08:56.235-0700

yyyy-MM-dd'T'HH:mm:ss.SSSXXX

2001-07-04T12:08:56.235-07:00

YYYY-'W'ww-u

2001-W27-3

Rules for numbers

When you define rules for the number type, there is a set of characters available to describe the pattern format. See the table below for a list of all characters that you can use. The second table shows you some format pattern examples and numerical examples that have been applied to that certain pattern.

For more information on the used pattern letters, see also https://docs.oracle.com/javase/8/docs/api/java/text/DecimalFormat.html.

Used number patterns

Symbol

Meaning

0

Digit

#

Digit, zero shows as absent

.

Decimal separator or monetary decimal separator

-

Minus sign

,

Grouping separator

E

Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.

;

Separates positive and negative subpatterns

Examples for number patterns

Pattern

Description

Result

#.00

Any number of digits before decimal point and 2 after

7667.27; 444.2

000.##

At least 3 digits before decimal point and 2 after

-001.5; 17777.65; 000.65

$'00.####

'$' in front of the number

$15.5674; $444415.4; $00.4

#,###,###

Grouping numbers with ,

1,556,789; 1,333,556,789; 5

%

Use of % for percentage

%15

##

Integer number

156

.#####

5 or less digits in the decimal part

1890.567, 1890.56778

'JCG&'000.#

String 'JCG' in front of the number

JCG015.6; JCG198347.6

Fallback rule set

Besides the rule sets for your use cases, you can add a rule set that covers all other numbers and marks them as variables. This special rule set can be created if it is not yet available in your system. If you have not created any rule set or none rule within a rule set applies, this fallback rule is applied. It determines numbers or date/time entries, mark them as a variable and copy these variables without further changes to the target segment. 

Create the following rule set:

  1. Create a new Group entry.

  2. Select "As-is" as Type.

  3. Select "*" for the Locale definition. This applies to any language that source text asset has been assigned to.

  4. Enter a RegEx, for example:

    (?<!(\<v i=")|(?:dataRef(?:End|Start)="d)|(pc id="[0-9]{0,10})|(sc id="[0-9]{0,10})|(startRef="[0-9]{0,10}))([0-9]+)(?!")
    TEXT

    If you have been provided with another RegEx, you can also use this RegEx. If the used RegEx does not detect special use case that you have, you can also adapt it.

The RegEx example above also prevents that numbers within certain tags of the XML source of a text are interpreted as numbers. These numbers are not output, but they are found in the text segment. The tags are "<v i='....'>", "<dataRefEnd='...'>", "<dataRefStart='...'>", "<pc id='...'>", "<sc id='...'>", "<startRef='...'>".

Deactivate rules or rule sets

If you do not want to use a rule within a rule set anymore, delete the rule. Click the trash icon next to the RegEx field.

If you do not want to use a rule set anymore, delete it. Click the trash icon on the right side of the Items frame of the Group entry with the rule set in question.

If you do not want to work with date, time or number variables anymore, delete all rule sets that you have created. This includes the generic rule set provided by censhare, which just detects any digits if no other rule set fits.

Note: It is strongly recommended to save a backup of all rule sets that you want to deactivate/delete. censhare does not provide a way to restore them after they have been deleted.

Update the server configuration

  1. Use the "Update server configuration" button at the top of the censhare Admin-Client to transfer your changes to the Master server.

  2. If there is more than one server, use the "Synchronize remote servers" button on the top to transfer the changed configuration to all other censhare Servers.

Check the configuration in censhare Web

censhare does not provide you with tools to check if your rule sets work as you expect. It is recommended to check to your RegEx if they work as expected before you create the rules. There are publicly available tools around that can be used for that purpose.

Besides that, you can use a simple ICML or Article asset structure to implement your test use cases. For more information, on how to create the asset structure, see the "Related topic" section below.

Add some example use cases of numbers and date/time combinations to your test text asset. Then use this asset as source asset for the translation. Next, open the target text asset. Go to the Translation tab. It shows you the segments with the variables in the source and target segments side by side. Check if your regular expressions are working as expected and the correct formatting of the variables is applied in the target segments.

Configuration of variables in versions previous to censhare 2018.2

In versions previous to censhare 2018.2, the configuration of variables was done within the "Translation" asset (asset type "Module/Configuration", resource key "censhare:configuration.translation"). This asset has been removed as of censhare 2018.2 because it is no longer needed for the configuration.

Result

You have set up rules for certain date/time and number figures in the censhare Admin-Client and transferred them to the censhare Server(s). The rule sets cover all the use cases where you need special processing for date/time and number formats. You have tested your RegEx. The test text assets for the different languages show that all your rules sets work as expected for the different languages.