Indexation

How to index data with Xtended Search?

Xtended Search is a Process add-on that aims to provide an advanced tool for searching in heterogeneous data.

Why “searches” :

The tool is based on an indexer (Lucene by default today); in Xtended Search, everything is in the index; the goal is to get excellent performance based only on the index by disconnecting completely from the source data.
We assume that all the data displayed by the add-on comes from the index; for us, the index is necessarily more efficient than access to the source data.

Why “advanced” :

You will see that it is possible to build search interfaces based on full-text and/or advanced filtering with search criteria.

Why “heterogeneous data”:

Xtended Search indexes data via connectors.
A number of connectors are delivered by default with the product.
It is then possible to create your own indexing connectors.
It is therefore possible to index any type of data as long as a connector is available.

Configuration

The tool is configured integrally in XML.

You can manipulate one or more XML files but each XML file is normalized (UTF-8 encoding without BOM (ANSI AS UTF-8)).

XML files must be stored in a specific folder:

Before Process11 : JBoss\server\default\deploy\vdoc.ear\vdoc.war\WEB-INF\storage\custom\configuration\axvdocsearch\
Since Process11 : custom\configuration\axvdocsearch\

Default structure of an XML file

The global structure is imposed; we find in the folder :

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <indexes>
        <!-- Declaration of indexes -->
    </indexes>

    <searches>
        <!-- Declaration of search interfaces -->
    </searches>

    <jobs>
        <!-- Declaration of jobs -->
    </jobs>
</root>

Creating an index

Under the indexes node, you can place as many index nodes as necessary.

Here is the basic framework you can use when you want to create a new index:

<index name="myIndexUniqueName" 
	label="LG_myIndexUnique" 
	controller="com.axemble.axvdocsearch.core.controllers.implementation.LuceneController" 
	extension="CLASSE D'INDEXATION" 
	indexStorePath="vdocsearch\myIndexUniqueName" 
	updateOnIndexedDocuments="true" 
	locales="fr" >
	
	<parameters>
        <!-- Input parameters for my indexing class -->
		<parameter key="PARAM1" value="VALUE1" />
		<parameter key="PARAM2" value="VALUE2" />
	</parameters>
	
    <!-- Tags to index for my data source -->
	<customtag name="TAG1" type="text" />
	<customtag name="TAG2" type="date" />
	<customtag name="TAG3" type="number" />
	<customtag name="TAG4" type="boolean" />
	<customtag name="TAG5" type="file" collection="true" />
</index>

Some remarks about this XML :

The indexing class depends on your data source, don’t forget that each class will require different input parameters
Custom tags are the attributes of your indexing, they will allow you to build your advanced searches, they are therefore essential. Everything that is used for filtering and everything that is displayed in the result views must be present in the form of customtags.

The native types of customtags

We support the following native types:

text
number
date
boolean
file

The “collection” attribute allows us to refine the configuration by specifying that several elements are stored in the tag.

Explanation of the notion of system custom tags

The system custom tags are elementary attributes that are necessarily indexed. (even if you do not set any customtag).

This information is thus obligatorily fed by the indexing connector independently of the configuration carried out.

Name	Description
ID	ID of the indexed document
REFERENCE	Reference of the indexed document
TITLE	Title of the indexed document
DESCRIPTION	Description of indexed document
CREATIONDATE	Creation date of the indexed document
HYPERLINK	Hypertext link for WEB access to the indexed document
URI	URI of the indexed document
LOCALE	Locale of the indexed document

The ID and REFERENCE, CREATIONDATE tags are mandatory for the tool to work properly.

If an indexing connector does not return data on these tags, the document is not indexed.

Note on full-text search

We don’t configure anything for the full-text search.

Indeed, the logic of Xtended Search is to perform a full-text search on all the exploitable indexed custom tags.

How to use one or more indexes?

Once your indexes have been defined, you can perform searches. These are really the graphical interfaces of the searches you want to propose.

Thus, we can designate none, one or several search interfaces based on one or several indexes.

Then, the indexing of the indexes will have to be carried out:

Either as standard, by creating an indexing agent that will browse all the indexes
Or by defining “jobs”.

For more information at this level, see the chapter on jobs.

Order of the XML configuration

To summarize, here are the steps to follow for the XML configuration of Xtended Search:

Definition of an index: what data do we need to index?
Association of this index to a job (optional)
Execution of the indexing job
Definition of one or more searches that exploit one or more indexes
Integration of the search screen in a module of the Process suite

Creation of indexing agents and job logic

Indexing agents

Indexing with Xtended Search is not done automatically. It is triggered by indexing agents or indexing codes.

Indexing modes

Mode	Description	Class
Full indexing	Full indexing: add, modify, delete.	Agent class: com.axemble.axvdocsearch.core.jobs.IndexJob
Incremental indexing	Incremental indexing since the last indexing. If the connector does not know how to handle incremental, it runs a full indexing.	Agent class: com.axemble.axvdocsearch.core.jobs.IndexLastJob
Unitary indexing	Possibility to trigger an indexing of one or more documents by SDK code.	IndexingHelper (see below)

Which agents for which indexes?

By default, if you create an agent (using the classes presented above), Xtended Search will try to browse all the indexes of all your configuration XML files.

This solution can be interesting but can quickly become complicated if you have many indexes.

The notion of jobs can allow us to refine the indexing by creating agents dedicated to the indexing of one or more indexes.

Let’s imagine the following case: I have 3 indexes in my XML; for the first 2, the data source is updated at night and the quantity of data being important, the indexing is quite long. On the other hand, my 3rd index is very small but the source data is constantly updated.

If I schedule my agent every hour (which would be necessary for my 3rd index), I will unnecessarily reindex my first 2 indexes, which will slow down my Process instance.

So in this case, I will define a particular job for my 3rd index.

In my XML, I will declare under the node “jobs” another node job :

<jobs>
    <!-- Don't delete this job -->
    <job name="default">
    </job>
    
    <job name="talend" agent="AgentVDocTestTalendForAXVDOCSEARCH01" >
        <index name="TestTalendForAXVDOCSEARCH01" />
    </job>
</jobs>

I give a name to my job and I integrate inside :

The index or indexes concerned by my job.
The name of the associated Process agent (“agent” attribute on the “job”)

Then, I will create a new agent in Process (full or incremental indexing; see above) with a given system name; here “AgentVDocTestTalendForAXVDOCSEARCH01”.

Your new agent will only take into account the indexes declared in the job.

Thus, the organization in jobs allows you to organize your indexing according to themes, indexing types, …

It should also be noted that in an XML file you can declare a job pointing to an index declared in another file.

You can therefore, and it is recommended, to create an XML file only to define your job organization.

Performing a unitary indexing

Full and incremental indexing have in common their asynchronous character.

However, if you want to modify your index in real time, it is possible to use a unitary indexing API: it will allow us to pass only the documents to be indexed (new or to be updated).

Use case:

In a process, between two steps (submit), we re-index the current document so that it is up to date in the searches.

Indexing: creation, update

Method to call: IndexingHelper.unitIndexing(...)

Parameters:

System name of the index
Collection of objects to index (the type of the object depends on the extension used for indexing)
Should we thread the processing? If yes, the processing is not blocking, it is done in parallel.
(Optional) The customtags we want to process or not process
(Optional) Are the customtags of the parameter exceptions or the opposite: everything but these tags or only these tags

Use cases

Indexing of a process, we reindex a document (full) :

Collection<Object> cObject = new ArrayList<>();
cObject.add(iWorkflowInstance);
IndexingHelper.unitIndexing(indexName, cObject, true, null, false);

Indexing of a process, we reindex a document (only some tags):

Collection<Object> cObject = new ArrayList<>();
cObject.add(iWorkflowInstance);
Collection<String> cTag = new ArrayList<>();
cTag.add("TAG1");
cTag.add("TAG2");
cTag.add("TAG3");
IndexingHelper.unitIndexing(indexName, cObject, true, cTag, true);

Indexing of a process, we reindex a document (all but some tags) :

Collection<Object> cObject = new ArrayList<>();
cObject.add(iWorkflowInstance);
Collection<String> cTag = new ArrayList<>();
cTag.add("TAG1");
cTag.add("TAG2");
cTag.add("TAG3");
IndexingHelper.unitIndexing(indexName, cObject, true, cTag, false);

Deleting documents from the index

Since Xtended-Search1.5.0

Method to call: IndexingHelper.unitRemoving(...)

Parameters:

System name of the index
Collection of object IDs to delete

Collection<String> cObject = new ArrayList<>();
cObject.add(15);
IndexingHelper.unitRemoving(indexName, cObject);

Sources :