Este contenido no está disponible en el idioma seleccionado.

Chapter 3. Defining machine learning features


As part of the Feature Store workflow, machine learning (ML) engineers or data scientists are responsible for identifying data sources and defining features of interest.

3.1. Setting up your working environment

You must set up your Red Hat OpenShift AI working environment so that you can use features in your machine learning workflow.

Prerequisites

  • You have access to the OpenShift AI data science project in which your cluster administrator has set up the Feature Store instance.

Procedure

  1. From the OpenShift AI dashboard, click Data science projects.
  2. Click the name of the project in which your cluster administrator has set up the Feature Store instance.
  3. In the data science project in which the cluster administrator set up Feature Store, create a workbench, as described in Creating a workbench.
  4. To open the IDE (for example, JuypterLab), in a new window, click the open icon ( The open icon ) next to the workbench.
  5. Add a feature_store.yaml file to your notebook environment. For example, upload a local file or clone a Git repo that contains the file, as described in Uploading an existing notebook file to JupyterLab from a Git repository by using the CLI.
  6. Open a new Python notebook.
  7. In a cell, run the following command to install the feast CLI:

    ! pip install feast
    Copy to Clipboard Toggle word wrap

Verification

  1. Run the following command to list the available features:

    ! feast features list
    Copy to Clipboard Toggle word wrap

    The output should show a list of features, Feature View and data type similar to the following:

    Feature	Feature         View          Data Type
    credit_card_due	        credit_history			Int64
    mortgage_due	        credit_history			Int64
    student_loan_due	    credit_history			Int64
    vehicle_loan_due	    credit_history			Int64
    city			        zipcode_features		String
    state			        zipcode_features		String
    location_type		    zipcode_features		String
    Copy to Clipboard Toggle word wrap
  2. Optionally, run the following commands to list the registered feast projects, feature views, and entities.

    ! feast projects list
    
    ! feast feature-views list
    
    ! feast entities list
    Copy to Clipboard Toggle word wrap

3.2. About feature definitions

A machine learning feature is a measurable property or field within a data set that a machine learning model can analyze to learn patterns and make decisions. In Feature Store, you define a feature by defining the name and data type of a field.

A feature definition is a schema that includes the field name and data type, as shown in the following example:

from feast import Field
from feast.types import Int64

credit_card_amount_due = Field(
    name="credit_card_amount_due",
    dtype=Int64
)
Copy to Clipboard Toggle word wrap

For a list of supported data types for fields in Feature Store, see the feast.types module in the Feast documentation.

In addition to field name and data type, a feature definition can include additional metadata, specified as descriptions of features, as shown in the following example:

from feast import Field
from feast.types import Int64


credit_card_amount_due = Field(
    name="credit_card_amount_due",
    dtype=Int64,
    description="Credit card amount due for user",
    tags={"team": "loan_department"},
)
Copy to Clipboard Toggle word wrap

3.3. Specifying the data source for features

As an ML engineer or a data scientist, you must specify the data source for the features that you want to define.

The data source differs depending on whether you are using an offline store, for batch data and training data sets, or an online store, for model inference. Optionally, you can use a Parquet or a Delta-formatted file as the data source. You can specify a local file or a file in storage, such as Amazon Simple Storage Service (S3).

For offline stores, specify a batch data source. You can specify a data warehouse, such as BigQuery, Snowflake, Redshift, or a data lake, such as Amazon S3 or Google Cloud Platform (GCP). You can use Feature Store to ingest and query data across both types of data sources.

For online stores, specify a database backend, such as Redis, GCP Datastore, or DynamoDB.

Prerequisites

  • You know the location of the data source for your ML workflow.

Procedure

  1. In the editor of your choice, create a new Python file.
  2. At the beginning of the file, specify the data source for the features that you want to define within the file.

    For example, use the following code to specify the data source as a Parquet-formatted file:

    from feast import FileSource
    from feast.data_format import ParquetFormat
    
    parquet_file_source = FileSource(
        file_format=ParquetFormat(),
        path="file:///feast/customer.parquet",
    )
    Copy to Clipboard Toggle word wrap
  3. Save the file.

3.4. About organizing features by using entities

Within a feature view, you can group features that share a conceptual link or relationship together to define an entity. You can think of an entity as a primary key that you can use to fetch features. Typically, an entity maps to the domain of your use case. For example, a fraud detection use case could have customers and transactions as their entities, with group-related features that correspond to these customers and transactions.

A feature does not have to be associated with an entity. For example, a feature of a customer entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of transactions made by all users in the last month.

customer = Entity(name='dob_ssn', join_keys=['dob_ssn'])
Copy to Clipboard Toggle word wrap

The entity name uniquely identifies the entity. The join key identifies the physical primary key on which feature values are joined together for feature retrieval.

The following table shows example data with a single entity column (dob_ssn) and two feature columns (credit_card_due and bankruptcies).

Expand
Table 3.1. Example data showing an entity and features
rowtimestampdob_ssncredit_card_duebankruptcies

1

5/22/2025 0:00:00

19530219_5179

833

0

2

5/22/2025 0:00:00

19500806_6783

1297

0

3

5/22/2025 0:00:00

19690214_3370

3912

1

4

5/22/2025 0:00:00

19570513_7405

8840

0

3.5. Creating feature views

You define features within a feature view. A feature view is an object that represents a logical group of time-series feature data in a data source. Feature views indicate to Feature Store where to find your feature values, for example, in a parquet file or a BigQuery table.

By using feature views, you define the existing feature data in a consistent way for both an offline environment, when you train your models, and an online environment, when you want to serve features to models in production.

Feature Store uses feature views during the following tasks:

  • Generating training datasets by querying the data source of feature views to find historical feature values. A single training data set can consist of features from multiple feature views.
  • Loading feature values into an online or offline store. Feature views determine the storage schema in the online or offline store. Feature values can be loaded from batch sources or from stream sources.
  • Retrieving features from the online or offline store. Feature views provide the schema definition for looking up features from the online or offline store.

When you create a feature project, the feature_repo subfolder includes a Python file that includes example feature definitions (for example, example_features.py) .

To define new features, you can edit the code in the example file or add a new file to the feature repository.

Note: Feature views only work with timestamped data. If your data does not contain timestamps, insert dummy timestamps. The following example shows how to create a table with dummy timestamps for PostgreSQL-based data:

CREATE TABLE employee_metadata (
  employee_id INT PRIMARY KEY,
  department TEXT,
  dummy_event_timestamp TIMESTAMP DEFAULT '2024-01-01'
);
INSERT INTO employee_metadata (employee_id, department)
VALUES (1, 'Advanced'), (2, 'New');
Copy to Clipboard Toggle word wrap

Prerequisites

  • You know what data is relevant to your use case.
  • You have identified attributes in your data that you want to use as features in your ML models.

Procedure

  1. In your IDE, such as JupyterLab, open the feature_repo/example_features.py file that contains example feature definitions or create a new Python (.py) file in the feature_repo directory.
  2. Create a feature view that is relevant to your use case based on the structure shown in the following example:

    credit_history_source = FileSource(   
    1
    
    	name="Credit history",
    	path="data/credit_history.parquet",
    	file_format=ParquetFormat(),
    	timestamp_field="event_timestamp",
    	created_timestamp_column="created_timestamp",
    )
    credit_history = FeatureView(       
    2
    
    	name="credit_history",
    	entities=[dob_ssn],             
    3
    
    	ttl=timedelta(days=90),         
    4
    
    	schema=[                        
    5
    
        	Field(name="credit_card_due", dtype=Int64),
        	Field(name="mortgage_due", dtype=Int64),
        	Field(name="student_loan_due", dtype=Int64),
        	Field(name="vehicle_loan_due", dtype=Int64),
        	Field(name="hard_pulls", dtype=Int64),
        	Field(name="missed_payments_2y", dtype=Int64),
        	Field(name="missed_payments_1y", dtype=Int64),
        	Field(name="missed_payments_6m", dtype=Int64),
        	Field(name="bankruptcies", dtype=Int64),
    	],
    	source=credit_history_source,  
    6
    
        tags={"origin": "internet"},   
    7
    
    )
    Copy to Clipboard Toggle word wrap
    1
    A data source that provides time-stamped tabular data. A feature view must always have a data source for the generation of training datasets and when materializing feature values into the online store. Possible data sources are batch data sources from data warehouses (BigQuery, Snowflake, Redshift), data lakes (S3, GCS), or stream sources. Users can push features from data sources into Feature Store, and make the features available for training or batch scoring ("offline"), for realtime feature serving ("online"), or both.
    2
    A name that identifies the feature view in the project. Within a feature view, feature names must be unique.
    3
    Zero or more entities. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view. If the features are not related to a specific object, the feature view might not have entities.
    4
    (Optional) Time-to-live (TTL) to limit how far back to look when Feature Store generates historical datasets.
    5
    One or more feature definitions.
    6
    A reference to the data source.
    7
    (Optional) You can add metadata, such as tags that enable filtering of features when viewing them in the UI, listing them by using a CLI command, or by querying the registry directly.
  3. Save the file.
Volver arriba
Red Hat logoGithubredditYoutubeTwitter

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Ayudamos a los usuarios de Red Hat a innovar y alcanzar sus objetivos con nuestros productos y servicios con contenido en el que pueden confiar. Explore nuestras recientes actualizaciones.

Hacer que el código abierto sea más inclusivo

Red Hat se compromete a reemplazar el lenguaje problemático en nuestro código, documentación y propiedades web. Para más detalles, consulte el Blog de Red Hat.

Acerca de Red Hat

Ofrecemos soluciones reforzadas que facilitan a las empresas trabajar en plataformas y entornos, desde el centro de datos central hasta el perímetro de la red.

Theme

© 2025 Red Hat