Custom schemas & transformations

The PIM connector exports and imports data from and into censhare PIM. For this purpose, the connector uses a multi-stage process in censhare and an external ETL (Extract, Transform, Load) tool. This article shows how to customize the process.

Target groups

Administrators

Context

Product information management (PIM) is an application that runs in censhare Web. The PIM connector handles the data export and import using a third-party ETL software

Prerequisites

You should be familiar with the key concepts of the PIM connector. In order to configure export/import schemas, you need the respective knowledge of the third-party ETL software connected to censhare PIM.

Introduction

The censhare PIM connector is a bidirectional interface that collects and processes data records in order to export or import them from or into censhare PIM. Both processes (export and import) are configurable in the censhare Admin-Client. The PIM connector requires a third-party ETL (Extract, Transform, Load) process software that connects via a REST API. In the export direction, the ETL software calls the raw data from censhare to create an export file, for example, Excel. In the import direction, the ETL software passes the processed data to censhare. The import also allows to update existing data records.

The process stages of the PIM connector can be customized independently. The following table shows the configurable stages and available default configurations in the censhare standard setup.

Item or stage	Available default configurations	Remarks
Asset type	Product	The connector allows to retrieve data from any asset type. Asset type must be selected in the connector configuration in the censhare Admin-Client. Make sure that the asset type matches the data schema. The default configuration provided by censhare can only be applied to Product assets and includes Product items assigned to a product.
Data schema	Product schema, Product item schema, Person schema	The data schema stores a flat JSON representation of the asset features to be exported or imported. The selected schema must match the asset type. For example, if you select Person assets as data type, you must select the Person schema, consequently. The JSON files are stored in a Schema asset. You can create custom schemas if you want to add or remove asset features to the export//import. See the section JSON schema below for an example file.
JSON to XML	n/a	This stage is optional. If you want to use an XSLT template for manipulation of the data in the processor pipeline, first you have to create an XML representation from the JSON data. To do this, add the stage "JSON to XML" in the processor pipeline of your connector configuration. This stage works with any input (asset type and data schema) and requires no configuration.
XSLT transformation	CS-Pentaho-Export-to-File, CS-Pentaho-Import-from-File-to-XML	These XSLT pass the JSON or XML data from censhare to Pentaho (export pipeline) or fetch JSON or XML data from Pentaho to censhare. The output data structure must be in conformance with the specifications of your ETL tool. See the section XSL transformations below for an example file.
ETL export transformation	Pentaho-Export-Excel-products-and-items-on-separate-sheets, Pentaho-Export-Excel-products-format-1	The export transformations are unique for each ETL tool and export format. They are executed on the third-party ETL software. For each connector configuration, you can create multiple export schemas. For example: Pentaho/Excel, Pentaho/CSV, Apache NiFi/Excel, Apache NiFi/CSV. See the section ETL transformations for more information.
ETL import transformation	Pentaho-Import-Excel-products-and-items-on-separate-sheets, Pentaho-Import-Excel-products-and-items-on-one-sheet, Pentaho-Import-Excel-products-format-1	The import transformations are unique for each ETL tool and import format. They are executed on the third-party ETL software. For each connector configuration you can create multiple import schemas. For example: Pentaho/Excel, Pentaho/CSV, Apache NiFi/Excel, Apache NiFi/CSV. See the section ETL transformations for more information.

JSON schemas

When you export data from censhare, in the first step the connector creates a JSON representation of the asset properties. When you import data into censhare, the ETL processor creates a JSON representation from the imported data sets. The JSON schema provides the data model for the import and update of assets. censhare maps the JSON data from the ETL processor to this JSON schema. The schemas are stored in a Module / Schema asset. The root object defines the asset type that is processed. The properties contain the mapping between the keys of the import/export data and the censhare feature ID:

"$ref": "#/definitions/person",
 "definitions": {
  "person": {
   "cs:asset.type": "person.",
   "cs:$no_new_masterdata": true,
   "title": "person",
   "type": "object",
   "properties": {
    "id": {
     "cs:feature.key": "censhare:asset.id",
     "type": "integer" },

JAVA

The schemas are written in standard JSON schema format. The data bindings use censhare proprietary extension. For more information see http://json-schema.org. In the schemas, asset properties with multiple values are defined as arrays. Child properties of an asset feature are defined as objects inside the parent property.

In the censhare standard configuration, these three preconfigured schemas are ready to use with the respective assets and asset structures:

Person schema: This schema works with Person assets and provides a data model for all person-related properties, like name, address and contact data.
Product schema: This schema works with Product assets. It provides a data model for all product features like technical specifications, order and EAN numbers. It also works with marketing-related data like product texts and images. The product schema includes product items and therefore references the Product item schema, which can also be used as a stand-alone schema (see next point).
Product item schema: This schema works as a standalone with Product item assets. It is part of the Product schema, which includes products and their product items. Its data model is similar to the Product schema.

Besides the asset data that are specific to each asset type (product data, person data, etc.), the JSON schema also retrieves generic data. These are, for example, asset ID, name, description and localized names. For a sample search for the default schemas provided in the censhare standard configuration. The resource keys are as follows: censhare:schema.productItem, censhare:schema.product, censhare:schema.person.

To customize the schemas, create resource replace variants from the default schemas. For example:

Add properties that are not included or remove properties that are not needed for your business case.
Create custom schemas for other asset types or asset structures. Default schemas are available for products (with and without product items) and persons.
Customize the mapping of properties. Each property (for example: "title") is mapped to a an internal feature key (for example: "censhare:asset.name").
Customize the unique identifiers that are used to check if an imported data record already exists in censhare. Unique identifiers determine whether an existing asset is updated or a new one created. For more information see the following section.

Customize unique identifiers

The import pipeline of the PIM connector can determine when to create a new asset and when to update an existing one. For this purpose, unique identifiers are used. If these match in both, the imported data and the existing asset data, censhare overwrites the existing data. No new asset is created, in order to avoid duplicates. You can define an array of multiple identifiers. These are checked in the respective order.

By default, only the Product schema uses unique identifiers. The following code snippet shows how these are implemented in the JSON schema:

"$ref": "#/definitions/product",
 "definitions": {"product": {
  "cs:asset.type": "product.",
  "cs:$no_new_masterdata": true,
  "cs:$selection_keys": [
   ["id"],
   ["uuid"],
   ["externalId"],
   ["gtin"],
   ["ean"],
   ["orderNumber"]
  ],

JAVA

The code snippet defines the cs:$selection_keys for the asset type product. The array inside cs:$selection_keys contains an ordered list of keys that are checked during import. The keys ["id"], ["uuid"], ["externalId"], ["gtin"], ["ean"], ["orderNumber"] represent the respective properties. If they match with the data of an existing asset, no new asset is created. The imported data record overwrites the data of the existing asset.

To customize these properties, you can do the following:

Change the order in which these properties are checked. For example: ["gtin"], ["ean"], ["id"], ["uuid"], ["externalId"], ["orderNumber"].
Remove properties, if they are not needed. For example: ["id"], ["gtin"], ["ean"], ["orderNumber"].
Add new properties that you want to use as unique identifiers. For example: ["id"], ["uuid"], ["externalId"], ["gtin"], ["ean"], ["articleNumber"], ["orderNumber"]. Note: You must add the property definition and the mapping for new keys, if your JSON schema does not contain them. For example:
```
"articleNumber": {
 "cs:feature.key": "censhare:product.article-number",
 "type": "string"
},
```
JAVA

XSL transformation

In order to provide an exchange format that can be forwarded to the third-party ETL software configured with your PIM connector, you can use Java or XSLT templates. If XML is required as exchange format, the connector can be configured to transform the JSON data representation into XML first. This stage of the process can be added in your custom configuration in the censhare Admin-Client. The XML can then be transformed into the required exchange format.

The censhare standard configuration provides transformations for the export and import to and from Pentaho. If you are using Pentaho PDI CE as ETL process tool, these transformations work out of the box. If you are using other ETL tools, for example, Apache NiFi or Talend Open Studio please ask our professional services for assistance.

The XSLT transformation enhances and maps the raw XML data from censhare so that they can be parsed in Pentaho (or any other ETL solution). For one ETL export it is possible to configure different output formats, for example, Excel on a single sheet, Excel on several sheets or CSV. Each output format requires an Export / Import asset with the respective schemas. The XSLT transformation in this stage checks for available output formats and creates a dialog where users can select the desired one.

ETL transformations

Custom ETL export and import transformations are configured in the ETL software. These are third-party solutions and not part of censhare. Apart from the default configurations, censhare does not provide support for customizations. If you want to write custom export/import transformations, you should be familiar with the ETL software you are using.

Create custom Pentaho transformations

Pentaho's ETL engine is called Kettle. When you execute a data export or import, censhare passes the raw XML data to Pentaho. It also passes the information on which transformation must be executed by Pentaho. The Kettle transformations that are executed on the Pentaho server are stored in censhare in assets of the type Module / Configuration / Pentaho export and Module Configuration / Pentaho import. The master file of these assets is a ZIP archive that contains the following files:

File	Extension	Description
Kettle transformation	.ktr	The file contains an XML representation of the output format and metadata. If you want to parse hierarchical data like products and product items, you need a transformation step for each hierarchy level. These steps are then aggregated in a main transformation (see below). Otherwise, only one transformation is required.
Kettle job	.kjb	This optional file contains the main transformation that combines different Kettle transformations (see previous step).
Test data	.xml	This file contains dummy data to simulate the raw XML output from censhare (optional).

Configuration

To create custom transformations use the transformations provided with the censhare standard configuration. These are, for example:

Pentaho-Export-Excel-products-and-items-on-one-sheet
Pentaho-Export-Excel-products-and-items-on-separate-sheets

The parameters and default values are as follows:

Parameter	Default value	Remarks
CSSVAR_SRC_FILE_NAME	${Internal.Transformation.Filename.Directory}/ test-data.xml	Pentaho built-in variable to reference the transformation execution directory
CSSVAR_REST_URL_PREFIX	comes from censhare	This parameter contains the REST URL locator to that directory. The export file is stored at this directory: <filename>CSSVAR_REST_URL_PREFIX/File_NAME</filename>
Text file output	Y(es)	This option must be enabled (do not change).
Output to Servlet	Y(es)	This option must be enabled (do not change).

To test and run your custom Pentaho transformation, create a ZIP archive from the Kettle transformation file(s), Kettle job file (optional) and test data file. The test data file is optional. If used, it must be named "test-data.xml". Next, create a new asset in censhare with the Module / Configuration / Pentaho export asset type. or, respectively, a Module / Configuration / Pentaho import type. These two asset types have been introduced specifically for the Pentaho ETL solution. Upload the ZIP archive as master file to this asset. When executing an export or import in censhare, the connector checks all available Pentaho transformations and creates a drop-down list in the dialog where the user can select one of them. Your custom transformation should appear in the dialog.

Result

When executing a Pentaho export, users can select custom transformations.