Este contenido no está disponible en el idioma seleccionado.
Chapter 3. Defining machine learning features
As part of the Feature Store workflow, machine learning (ML) engineers or data scientists are responsible for identifying data sources and defining features of interest.
3.1. Setting up your working environment Copiar enlaceEnlace copiado en el portapapeles!
You must set up your Red Hat OpenShift AI working environment so that you can use features in your machine learning workflow.
Prerequisites
- You have access to the OpenShift AI data science project in which your cluster administrator has set up the Feature Store instance.
Procedure
- From the OpenShift AI dashboard, click Data science projects.
- Click the name of the project in which your cluster administrator has set up the Feature Store instance.
- In the data science project in which the cluster administrator set up Feature Store, create a workbench, as described in Creating a workbench.
-
To open the IDE (for example, JuypterLab), in a new window, click the open icon (
) next to the workbench.
-
Add a
feature_store.yaml
file to your notebook environment. For example, upload a local file or clone a Git repo that contains the file, as described in Uploading an existing notebook file to JupyterLab from a Git repository by using the CLI. - Open a new Python notebook.
In a cell, run the following command to install the
feast
CLI:! pip install feast
! pip install feast
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Run the following command to list the available features:
! feast features list
! feast features list
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output should show a list of features, Feature View and data type similar to the following:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optionally, run the following commands to list the registered feast projects, feature views, and entities.
! feast projects list ! feast feature-views list ! feast entities list
! feast projects list ! feast feature-views list ! feast entities list
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.2. About feature definitions Copiar enlaceEnlace copiado en el portapapeles!
A machine learning feature is a measurable property or field within a data set that a machine learning model can analyze to learn patterns and make decisions. In Feature Store, you define a feature by defining the name and data type of a field.
A feature definition is a schema that includes the field name and data type, as shown in the following example:
For a list of supported data types for fields in Feature Store, see the feast.types
module in the Feast documentation.
In addition to field name and data type, a feature definition can include additional metadata, specified as descriptions of features, as shown in the following example:
3.3. Specifying the data source for features Copiar enlaceEnlace copiado en el portapapeles!
As an ML engineer or a data scientist, you must specify the data source for the features that you want to define.
The data source differs depending on whether you are using an offline store, for batch data and training data sets, or an online store, for model inference. Optionally, you can use a Parquet or a Delta-formatted file as the data source. You can specify a local file or a file in storage, such as Amazon Simple Storage Service (S3).
For offline stores, specify a batch data source. You can specify a data warehouse, such as BigQuery, Snowflake, Redshift, or a data lake, such as Amazon S3 or Google Cloud Platform (GCP). You can use Feature Store to ingest and query data across both types of data sources.
For online stores, specify a database backend, such as Redis, GCP Datastore, or DynamoDB.
Prerequisites
- You know the location of the data source for your ML workflow.
Procedure
- In the editor of your choice, create a new Python file.
At the beginning of the file, specify the data source for the features that you want to define within the file.
For example, use the following code to specify the data source as a Parquet-formatted file:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Save the file.
3.4. About organizing features by using entities Copiar enlaceEnlace copiado en el portapapeles!
Within a feature view, you can group features that share a conceptual link or relationship together to define an entity. You can think of an entity as a primary key that you can use to fetch features. Typically, an entity maps to the domain of your use case. For example, a fraud detection use case could have customers and transactions as their entities, with group-related features that correspond to these customers and transactions.
A feature does not have to be associated with an entity. For example, a feature of a customer entity could be the number of transactions they have made on an average month, while a feature that is not observed on a specific entity could be the total number of transactions made by all users in the last month.
customer = Entity(name='dob_ssn', join_keys=['dob_ssn'])
customer = Entity(name='dob_ssn', join_keys=['dob_ssn'])
The entity name uniquely identifies the entity. The join key identifies the physical primary key on which feature values are joined together for feature retrieval.
The following table shows example data with a single entity column (dob_ssn
) and two feature columns (credit_card_due
and bankruptcies
).
row | timestamp | dob_ssn | credit_card_due | bankruptcies |
---|---|---|---|---|
1 | 5/22/2025 0:00:00 | 19530219_5179 | 833 | 0 |
2 | 5/22/2025 0:00:00 | 19500806_6783 | 1297 | 0 |
3 | 5/22/2025 0:00:00 | 19690214_3370 | 3912 | 1 |
4 | 5/22/2025 0:00:00 | 19570513_7405 | 8840 | 0 |
3.5. Creating feature views Copiar enlaceEnlace copiado en el portapapeles!
You define features within a feature view. A feature view is an object that represents a logical group of time-series feature data in a data source. Feature views indicate to Feature Store where to find your feature values, for example, in a parquet file or a BigQuery table.
By using feature views, you define the existing feature data in a consistent way for both an offline environment, when you train your models, and an online environment, when you want to serve features to models in production.
Feature Store uses feature views during the following tasks:
- Generating training datasets by querying the data source of feature views to find historical feature values. A single training data set can consist of features from multiple feature views.
- Loading feature values into an online or offline store. Feature views determine the storage schema in the online or offline store. Feature values can be loaded from batch sources or from stream sources.
- Retrieving features from the online or offline store. Feature views provide the schema definition for looking up features from the online or offline store.
When you create a feature project, the feature_repo
subfolder includes a Python file that includes example feature definitions (for example, example_features.py
) .
To define new features, you can edit the code in the example file or add a new file to the feature repository.
Note: Feature views only work with timestamped data. If your data does not contain timestamps, insert dummy timestamps. The following example shows how to create a table with dummy timestamps for PostgreSQL-based data:
Prerequisites
- You know what data is relevant to your use case.
- You have identified attributes in your data that you want to use as features in your ML models.
Procedure
-
In your IDE, such as JupyterLab, open the
feature_repo/example_features.py
file that contains example feature definitions or create a new Python (.py
) file in thefeature_repo
directory. Create a feature view that is relevant to your use case based on the structure shown in the following example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- A data source that provides time-stamped tabular data. A feature view must always have a data source for the generation of training datasets and when materializing feature values into the online store. Possible data sources are batch data sources from data warehouses (BigQuery, Snowflake, Redshift), data lakes (S3, GCS), or stream sources. Users can push features from data sources into Feature Store, and make the features available for training or batch scoring ("offline"), for realtime feature serving ("online"), or both.
- 2
- A name that identifies the feature view in the project. Within a feature view, feature names must be unique.
- 3
- Zero or more entities. Feature views generally contain features that are properties of a specific object, in which case that object is defined as an entity and included in the feature view. If the features are not related to a specific object, the feature view might not have entities.
- 4
- (Optional) Time-to-live (TTL) to limit how far back to look when Feature Store generates historical datasets.
- 5
- One or more feature definitions.
- 6
- A reference to the data source.
- 7
- (Optional) You can add metadata, such as tags that enable filtering of features when viewing them in the UI, listing them by using a CLI command, or by querying the registry directly.
- Save the file.