16.19. Text Extractors

ModeShape can store all kinds of content, and ModeShape makes it easy to perform full-text searches on that content. To support searching, ModeShape extracts the text from the various properties on each node. The way it does this for most property types (e.g., STRING, LONG, DATE, PATH, NAME, etc.) is to read and use the literal values. But BINARY properties are another story: there's no way to indexes the binary content directly. Instead, ModeShape has a small pluggable framework for extracting useful text from the binary content, based upon the MIME type of the content itself.

The process works like this: when a BINARY property needs to be indexed for search, ModeShape determines the MIME type of the content, determines if there is a text extractor capable of handling that MIME type, and if so it passes the content to the text extractor and gets back a string of text, and it indexes that text.

The Data Services VDB text extractor operates only upon Data Services virtual database (i.e., ".vdb") files and extracts the virtual database's logical name, description, and version, plus the logical name, description, source name, source translator name, and JNDI name for each of the virtual database's models.

Text extraction can be an intensive process, so it is not enabled by default. But enabling the text extractors in ModeShape's configuration is actually pretty easy. When using a configuration file, add a "<mode:textExtractors>" fragment under the "<configuration>" root element. Within the "<mode:textExtractors>" element place one or more "<mode:textExtractor>" fragments specifying at least the extractor's name and fully-qualified Java class.

For example, here is the fragment that defines the Data Services text extractor.

<mode:textExtractors>
    <mode:textExtractor jcr:name="VDB Text Extractors">
      <mode:description>Extract text from Data Services VDB files</mode:description>        
      <mode:classname>org.modeshape.extractor.teiid.TeiidVdbTextExtractor</mode:classname>
    </mode:textExtractor>

It's also possible to define your own text extractors by implementing the TextExtractor interface:

@ThreadSafe
public interface TextExtractor {

    /**
     * Determine if this extractor is capable of processing content with the supplied MIME type.
     * 
     * @param mimeType the MIME type; never null
     * @return true if this extractor can process content with the supplied MIME type, or false otherwise.
     */
    boolean supportsMimeType( String mimeType );

    /**
     * Sequence the data found in the supplied stream, placing the output information into the supplied map.
     * 
     * ModeShape's SequencingService determines the sequencers that should be executed by monitoring the changes to one or more
     * workspaces that it is monitoring. Changes in those workspaces are aggregated and used to determine which sequencers should
     * be called. If the sequencer implements this interface, then this method is called with the property that is to be sequenced
     * along with the interface used to register the output. The framework takes care of all the rest.
     * 
     * 
     * @param stream the stream with the data to be sequenced; never null
     * @param output the output from the sequencing operation; never null
     * @param context the context for the sequencing operation; never null
     * @throws IOException if there is a problem reading the stream
     */
    void extractFrom( InputStream stream,
                      TextExtractorOutput output,
                      TextExtractorContext context ) throws IOException;

}

As mentioned above, the "supportsMimeType" method will be called first, and only if your implementation returns true for a given MIME type will the "extractFrom" method be called. The supplied TextExtractorContext object provides information about the text being processed, while the TextExtractorOutput is a simple interface that your extractor uses to record one or more strings containing the extracted text.

If you need text extraction in sequencers or connectors, you can always get a TextExtractor instance from the ExecutionContext. That TextExtractor implementation is actually a composite of all of the text extractors defined in the configuration.

Of course, you can always use a different TextExtractor by creating a subcontext and supplying your implementation:

TextExtractor myExtractor = ...
ExecutionContext contextWithMyExtractor = context.with(myExtractor);

Este conteúdo não está disponível no idioma selecionado.

16.19. Text Extractors

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Tornando o open source mais inclusivo

Sobre a Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links