Chapter 1. Indexing Data Grid caches

1.1. Configuring Data Grid to index caches
링크 복사

Enable indexing in your cache configuration and specify which entities Data Grid should include when creating indexes.

You should always configure Data Grid to index caches when using queries. Indexing provides a significant performance boost to your queries, allowing you to get faster insights into your data.

Procedure

Enable indexing in your cache configuration.
```
<distributed-cache>
  <indexing>
    
  </indexing>
</distributed-cache>
```
Tip
Adding an indexing element to your configuration enables indexing without the need to include the enabled=true attribute.
For remote caches adding this element also implicitly configures encoding as ProtoStream.

Specify the entities to index with the indexed-entity element.

<distributed-cache>
  <indexing>
    <indexed-entities>
      <indexed-entity>...</indexed-entity>
    </indexed-entities>
  </indexing>
</distributed-cache>

Protobuf messages

Specify the message declared in the schema as the value of the indexed-entity element, for example:

<distributed-cache>
  <indexing>
    <indexed-entities>
      <indexed-entity>org.infinispan.sample.Car</indexed-entity>
      <indexed-entity>org.infinispan.sample.Truck</indexed-entity>
    </indexed-entities>
  </indexing>
</distributed-cache>

This configuration indexes the Book message in a schema with the book_sample package name.

package book_sample;

/* @Indexed */
message Book {

    /* @Text(projectable = true) */
    optional string title = 1;

    /* @Text(projectable = true) */
    optional string description = 2;

    // no native Date type available in Protobuf
    optional int32 publicationYear = 3;

    repeated Author authors = 4;
}

message Author {
    optional string name = 1;
    optional string surname = 2;
}

Java objects

Specify the fully qualified name (FQN) of each class that includes the @Indexed annotation.

XML

<distributed-cache>
  <indexing>
    <indexed-entities>
      <indexed-entity>book_sample.Book</indexed-entity>
    </indexed-entities>
  </indexing>
</distributed-cache>

ConfigurationBuilder

import org.infinispan.configuration.cache.*;

ConfigurationBuilder config=new ConfigurationBuilder();
config.indexing().enable().storage(FILESYSTEM).path("/some/folder").addIndexedEntity(Book.class);

1.1.1. Index configuration
링크 복사

Data Grid configuration controls how indexes are stored and constructed.

Index storage

You can configure how Data Grid stores indexes:

On the host file system, which is the default and persists indexes between restarts.
In JVM heap memory, which means that indexes do not survive restarts.
You should store indexes in JVM heap memory only for small datasets.

File system

<distributed-cache>
  <indexing storage="filesystem" path="${java.io.tmpdir}/baseDir">
    <!-- Indexing configuration goes here. -->
  </indexing>
</distributed-cache>

JVM heap memory

<distributed-cache>
  <indexing storage="local-heap">
    <!-- Additional indexing configuration goes here. -->
  </indexing>
</distributed-cache>

Index path

Specifies a filesystem path for the index when storage is 'filesystem'. The value can be a relative or absolute path. Relative paths are created relative to the configured global persistent location, or to the current working directory when global state is disabled.

By default, the cache name is used as a relative path for index path.

Important

When setting a custom value, ensure that there are no conflicts between caches using the same indexed entities.

Index startup mode

When Data Grid starts caches it can perform operations to ensure the index is consistent with data in the cache. By default, it:

Checks the existing index file format.
- If it is incompatible or corrupt, it is deleted and the cache is automatically reindexed.
Automatically clear (purge) or reindex the cache.
- If data is volatile and the index is persistent then Data Grid performs the clear (purge) the indexes when it starts.
- If data is persistent and the index is volatile then Data Grid reindex the cache when it starts.

Note

The purge operation is performed synchronously, since it is usually very fast. So by the time the cache finishes to start, the operation will be completed. The cache becomes available only when the purge completes.

The reindex operation is performed asynchronously, since it might take a longer time to complete, depending on the size of the cache. If an indexed query is performed during the reindex the result could be partial. It is always possible to check if a reindex is ongoing accessing to the query statistics.

But you can manually configure it to:

Purge the index when the cache starts.
Reindex the cache when it starts.
No indexing operation takes place when a cache starts

Note

In the case of a manual configuration can lead to possible inconsistencies, a log message will be presented when the cache starts.

Clear the index when the cache starts

<distributed-cache>
  <indexing storage="filesystem" startup-mode="purge">
    <!-- Additional indexing configuration goes here. -->
  </indexing>
</distributed-cache>

Rebuild the index when the cachin this case

a warning message will be logged when the cache is startede starts

<distributed-cache>
  <indexing storage="local-heap" startup-mode="reindex">
    <!-- Additional indexing configuration goes here. -->
  </indexing>
</distributed-cache>

1.1.1.1. Automatic strategy and shared cache stores
링크 복사

In case of:

A shared storage is configured
The indexes are persistent on file system

The AUTO startup mode will apply the REINDEX strategy. This is done in order to not miss potential updates in case of crash and recovery of an indexed node.

Thus AUTO in this case may penalize the average user using a shared cache store that does not do any eviction, since more reindex than necessary will be triggered. For this reason if:

A shared storage is configured
The indexes are persistent on file system
Automatic evictions are not possible: cache memory max-size and max-count are not used
Manual eviction using the embedded API cache.eviction() is not used

to use NONE in place of AUTO as the index startup strategy.

Indexing mode

indexing-mode controls how cache operations are propagated to the indexes.

auto: Data Grid immediately applies any changes to the cache to the indexes. This is the default mode.
manual: Data Grid updates indexes only when the reindex operation is explicitly invoked. Configure manual mode, for example, when you want to perform batch updates to the indexes.

Set the indexing-mode to manual:

<distributed-cache>
  <indexing indexing-mode="manual">
    <!-- Additional indexing configuration goes here. -->
  </indexing>
</distributed-cache>

Use Java Entities

If the cache is protostream-encoded and the indexes initialized from a Data Grid server instance, the indexed entities must be the indexed Protobuf messages defined on some Proto schema. It is possible to change this behavior forcing the indexes be defined on the indexed entities that are discovered from the java entities locally accessible from the server VM. Useful in case we want to run embedded queries from a server task, in the case the cache is Protobuf encoded.

<distributed-cache>
  <indexing use-java-embedded-entities="true">
    <!-- Additional indexing configuration goes here. -->
  </indexing>
</distributed-cache>

Index reader

The index reader is an internal component that provides access to the indexes to perform queries. As the index content changes, Data Grid needs to refresh the reader so that search results are up to date. You can configure the refresh interval for the index reader. By default Data Grid reads the index before each query if the index changed since the last refresh.

<distributed-cache>
  <indexing storage="filesystem" path="${java.io.tmpdir}/baseDir">
    <!-- Sets an interval of one second for the index reader. -->
    <index-reader refresh-interval="1s"/>
    <!-- Additional indexing configuration goes here. -->
  </indexing>
</distributed-cache>

Index writer

The index writer is an internal component that constructs an index composed of one or more segments (sub-indexes) that can be merged over time to improve performance. Fewer segments usually means less overhead during a query because index reader operations need to take into account all segments.

Data Grid uses Apache Lucene internally and indexes entries in two tiers: memory and storage. New entries go to the memory index first and then, when a flush happens, to the configured index storage. Periodic commit operations occur that create segments from the previously flushed data and make all the index changes permanent.

Note

The index-writer configuration is optional. The defaults should work for most cases and custom configurations should only be used to tune performance.

<distributed-cache>
  <indexing storage="filesystem" path="${java.io.tmpdir}/baseDir">
    <index-writer commit-interval="2s"
                  low-level-trace="false"
                  max-buffered-entries="32"
                  queue-count="1"
                  queue-size="10000"
                  ram-buffer-size="400"
                  thread-pool-size="2">
      <index-merge calibrate-by-deletes="true"
                   factor="3"
                   max-entries="2000"
                   min-size="10"
                   max-size="20"/>
    </index-writer>
    <!-- Additional indexing configuration goes here. -->
  </indexing>
</distributed-cache>

Expand

Table 1.1. Index writer configuration attributes
Attribute	Description
`commit-interval`	Amount of time, in milliseconds, that index changes that are buffered in memory are flushed to the index storage and a commit is performed. Because operation is costly, small values should be avoided. The default is 1000 ms (1 second).
`max-buffered-entries`	Maximum number of entries that can be buffered in-memory before they are flushed to the index storage. Large values result in faster indexing but use more memory. When used in combination with the `ram-buffer-size` attribute, a flush occurs for whichever event happens first.
`ram-buffer-size`	Maximum amount of memory that can be used for buffering added entries and deletions before they are flushed to the index storage. Large values result in faster indexing but use more memory. For faster indexing performance you should set this attribute instead of `max-buffered-entries`. When used in combination with the `max-buffered-entries` attribute, a flush occurs for whichever event happens first.
`thread-pool-size`	This configuration is ignored since Infinispan 15.0. The indexing engine now uses the Infinispan thread pools.
`queue-count`	Default 4. Number of internal queues to use for each indexed type. Each queue holds a batch of modifications that is applied to the index and queues are processed in parallel. Increasing the number of queues will lead to an increase of indexing throughput, but only if the bottleneck is CPU.
`queue-size`	Default 4000. Maximum number of elements each queue can hold. Increasing the `queue-size` value increases the amount of memory that is used during indexing operations. Setting a value that is too small can lead to `CacheBackpressureFullException` or `RejectedExecutionExceptionOperationSubmitter` since index operation requests are never blocked. In this case to solve the issue increase the `queue-size` or set the `queue-count` to 1.
`low-level-trace`	Enables low-level trace information for indexing operations. Enabling this attribute substantially degrades performance. You should use this low-level tracing only as a last resource for troubleshooting.

To configure how Data Grid merges index segments, you use the index-merge sub-element.

Expand

Table 1.2. Index merge configuration attributes
Attribute	Description
`max-entries`	Maximum number of entries that an index segment can have before merging. Segments with more than this number of entries are not merged. Smaller values perform better on frequently changing indexes, larger values provide better search performance if the index does not change often.
`factor`	Number of segments that are merged at once. With smaller values, merging happens more often, which uses more resources, but the total number of segments will be lower on average, increasing search performance. Larger values (greater than 10) are best for heavy writing scenarios.
`min-size`	Minimum target size of segments, in MB, for background merges. Segments smaller than this size are merged more aggressively. Setting a value that is too large might result in expensive merge operations, even though they are less frequent.
`max-size`	Maximum size of segments, in MB, for background merges. Segments larger than this size are never merged in the background. Settings this to a lower value helps reduce memory requirements and avoids some merging operations at the cost of optimal search speed. This attribute is ignored when forcefully merging an index and `max-forced-size` applies instead.
`max-forced-size`	Maximum size of segments, in MB, for forced merges and overrides the `max-size` attribute. Set this to the same value as `max-size` or lower. However setting the value too low degrades search performance because documents are deleted.
`calibrate-by-deletes`	Whether the number of deleted entries in an index should be taken into account when counting the entries in the segment. Setting `false` will lead to more frequent merges caused by `max-entries`, but will more aggressively merge segments with many deleted documents, improving query performance.

Index sharding

When you have a large amount of data, you can configure Data Grid to split index data into multiple indexes called shards. Enabling data distribution among shards improves performance. By default, sharding is disabled.

Use the shards attribute to configure the number of indexes. The number of shards must be greater then 1.

<distributed-cache>
  <indexing>
    <index-sharding shards="6" />
  </indexing>
</distributed-cache>

1.2. Data Grid native indexing annotations
링크 복사

When you enable indexing in caches, you configure Data Grid to create indexes. You also need to provide Data Grid with a structured representation of the entities in your caches so it can actually index them.

1.2.1. Overview of the Data Grid indexing annotations
링크 복사

@Indexed: Indicates entities, or Protobuf message types, that Data Grid indexes.

To indicate the fields that Data Grid indexes use the indexing annotations. You can use these annotations the same way for both embedded and remote queries.

@Basic: Supports any type of field. Use the @Basic annotation for numbers and short strings that don’t require any transformation or processing.
@Decimal: Use this annotation for fields that represent decimal values.
@Keyword: Use this annotation for fields that are strings and intended for exact matching. Keyword fields are not analyzed or tokenized during indexing.
@Text: Use this annotation for fields that contain textual data and are intended for full-text search capabilities. You can use the analyzer to process the text and to generate individual tokens.
@Embedded: Use this annotation to mark a field as an embedded object within the parent entity. The NESTED structure preserves the original object relationship structure while the FLATTENED structure makes the leaf fields multivalued of the parent entity. The default structure used by @Embedded is NESTED.

NESTED embedded can be used in nested objects joins.

Each of the annotations supports a set of attributes that you can use to further describe how the entity is indexed.

Expand

Table 1.3. Data Grid annotations and supported attributes
Annotation	Supported attributes
@Basic	searchable, sortable, projectable, aggregable, indexNullAs
@Decimal	searchable, sortable, projectable, aggregable, indexNullAs, decimalScale
@Keyword	searchable, sortable, projectable, aggregable, indexNullAs, normalizer, norms
@Text	searchable, projectable, norms, analyzer, searchAnalyzer

Using Data Grid annotations

You can provide Data Grid with indexing annotations in two ways:

Annotate your Java classes or fields directly using the Data Grid annotations.
You then generate or update your Protobuf schema, .proto files, before uploading them to Data Grid Server.
Annotate Protobuf schema directly with @Indexed and @Basic, @Keyword or @Text.
You then upload your Protobuf schema to Data Grid Server.
For example, the following schema uses the @Text annotation:
```
/**
  * @Text(projectable = true)
  */
required string street = 1;
```

1.3. Rebuilding indexes
링크 복사

Rebuilding an index reconstructs it from the data stored in the cache. You should rebuild indexes when you change things like the definitions of indexed types or analyzers. Likewise, you can rebuild indexes after you delete them for whatever reason.

Important

Rebuilding indexes can take a long time to complete because the process takes place for all data in the grid. While the rebuild operation is in progress, queries might also return fewer results.

Procedure

Rebuild indexes in one of the following ways:

Call the reindexCache() method to programmatically rebuild an index from a Hot Rod Java client:
```
remoteCacheManager.administration().reindexCache("MyCache");
```
Tip
For remote caches you can also rebuild indexes from Data Grid Console.
Call the index.run() method to rebuild indexes for embedded caches as follows:
```
Indexer indexer = Search.getIndexer(cache);
CompletionStage<Void> future = index.run();
```
- Check the status of reindexing operation with the reindexing attribute of the index statistics.

1.4. Updating index schema
링크 복사

The update index schema operation lets you add schema changes with a minimal downtime. Instead of removing previously indexed data and recreating the index schema, Data Grid adds new fields to the existing schema. Updating index schema is much faster than rebuilding the index but you can update schema only when your changes do not affect fields that were already indexed.

Important

You can update index schema only when your changes does not affect previously indexed fields. When you change index field definitions or when you delete fields, you must rebuild the index.

Procedure

Update index schema for a given cache:
- Call the updateIndexSchema() method to programmatically update the index schema from a Hot Rod Java client:
  remoteCacheManager.administration().updateIndexSchema("MyCache");
  Tip
  For remote caches, you can update index schema from the Data Grid Console or using the REST API.

Additional resources

Rebuilding indexes

1.5. Non-indexed queries
링크 복사

Data Grid recommends indexing caches for the best performance for queries. However you can query caches that are non-indexed.

For embedded caches, you can perform non-indexed queries on Plain Old Java Objects (POJOs).
For remote caches, you must use ProtoStream encoding with the application/x-protostream media type to perform non-indexed queries.

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

1.1. Configuring Data Grid to index caches
링크 복사

Protobuf messages

Java objects

1.1.1. Index configuration
링크 복사

Index storage

Index path

Index startup mode

1.1.1.1. Automatic strategy and shared cache stores
링크 복사

Indexing mode

Use Java Entities

Index reader

Index writer

Index sharding

1.2. Data Grid native indexing annotations
링크 복사

1.2.1. Overview of the Data Grid indexing annotations
링크 복사

Using Data Grid annotations

1.3. Rebuilding indexes
링크 복사

1.4. Updating index schema
링크 복사

1.5. Non-indexed queries
링크 복사

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 문서 정보

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 1. Indexing Data Grid caches

1.1. Configuring Data Grid to index caches링크 복사링크가 클립보드에 복사되었습니다!

Protobuf messages

Java objects

1.1.1. Index configuration링크 복사링크가 클립보드에 복사되었습니다!

Index storage

Index path

Index startup mode

1.1.1.1. Automatic strategy and shared cache stores링크 복사링크가 클립보드에 복사되었습니다!

Indexing mode

Use Java Entities

Index reader

Index writer

Index sharding

1.2. Data Grid native indexing annotations링크 복사링크가 클립보드에 복사되었습니다!

1.2.1. Overview of the Data Grid indexing annotations링크 복사링크가 클립보드에 복사되었습니다!

Using Data Grid annotations

1.3. Rebuilding indexes링크 복사링크가 클립보드에 복사되었습니다!

1.4. Updating index schema링크 복사링크가 클립보드에 복사되었습니다!

1.5. Non-indexed queries링크 복사링크가 클립보드에 복사되었습니다!

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 문서 정보

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

1.1. Configuring Data Grid to index caches
링크 복사

1.1.1. Index configuration
링크 복사

1.1.1.1. Automatic strategy and shared cache stores
링크 복사

1.2. Data Grid native indexing annotations
링크 복사

1.2.1. Overview of the Data Grid indexing annotations
링크 복사

1.3. Rebuilding indexes
링크 복사

1.4. Updating index schema
링크 복사

1.5. Non-indexed queries
링크 복사