3.6. S3 選択操作 (テクノロジープレビュー)

開発者は、Spark-SQL などの高レベルの分析アプリケーションに S3 select API を使用して、レイテンシーとスループットを向上させることができます。たとえば、複数のギガバイトのデータを持つ CSV S3 オブジェクトの場合、ユーザーは以下のクエリーを使用して別の列でフィルターされる単一の列を抽出できます。

例

select customerid from s3Object where age>30 and age<65;

現時点で、S3 オブジェクトはデータのフィルタリングおよび抽出の前に、Ceph Object Gateway 経由で Ceph OSD からデータを取得する必要があります。オブジェクトのサイズが大きく、クエリーが具体的な場合に、パフォーマンスが向上します。

3.6.1. 前提条件

稼働中の Red Hat Ceph Storage クラスターがある。
RESTful クライアント。
ユーザーアクセスで作成された S3 ユーザー。

3.6.2. S3 select content from an object

select object content API は、構造化されたクエリー言語 (SQL) でオブジェクトの内容をフィルターします。リクエストでは、オブジェクトのコンマ区切りの値 (CSV) であるデータのシリアライズ形式を指定して、指定のコンテンツを取得する必要があります。Amazon Web Services (AWS) のコマンドラインインターフェイス (CLI) 選択オブジェクトコンテンツは CSV 形式を使用してオブジェクトデータをレコードに解析し、クエリーで指定されたレコードのみを返します。

注記

応答のデータシリアライゼーション形式を指定する必要があります。この操作には s3:GetObject パーミッションが必要です。

構文

POST /BUCKET/KEY?select&select-type=2 HTTP/1.1\r\n

例

POST /testbucket/sample1csv?select&select-type=2 HTTP/1.1\r\n

要求エンティティー

Bucket

説明: オブジェクトコンテンツを選択するバケット。
型: String
必須: はい

キー

説明: オブジェクトキー。
長さに関する制約: 最小長は 1 です。
型: String
必須: はい

SelectObjectContentRequest

説明: select オブジェクトコンテンツ要求パラメーターのルートレベルタグ。
型: String
必須: はい

式

説明: オブジェクトのクエリーに使用される式。
型: String
必須: はい

ExpressionType

説明: SQL など、提供された式のタイプ。
型: String
有効な値: SQL
必須: はい

InputSerialization

説明: クエリーされるオブジェクトに含まれるデータの形式を記述します。
型: String
必須: はい

OutputSerialization

説明: コンマセパレーターおよび改行で返されるデータの形式。
型: String
必須: はい

応答エンティティー

アクションに成功すると、サービスは HTTP 200 応答を返します。データは、サービスによって XML 形式で返されます。

Payload

説明: ペイロードパラメーターのルートレベルタグ。
型: String
必須: はい

Records

説明: レコードイベント。
型: base64 でエンコードされたバイナリーデータオブジェクト
必須: いいえ

Stats

説明: stats イベント。
型: ロング
必須: いいえ

Ceph Object Gateway は以下の応答をサポートします。

例

{:event-type,records} {:content-type,application/octet-stream} :message-type,event}

構文

aws --endpoint-url http://localhost:80 s3api select-object-content
 --bucket BUCKET_NAME
 --expression-type 'SQL'
 --input-serialization
 '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}'
 --key OBJECT_NAME
 --expression "select count(0) from stdin where int(_1)<10;" output.csv

例

aws --endpoint-url http://localhost:80 s3api select-object-content
 --bucket testbucket
 --expression-type 'SQL'
 --input-serialization
 '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}'
 --key testobject
 --expression "select count(0) from stdin where int(_1)<10;" output.csv

サポートされる機能

現時点で、AWS s3 select コマンドの一部のみがサポートされます。

機能	詳細	説明	例
算術演算子	^ * % / + - ( )		select (int(_1)+int(_2))*int(_9) from stdin;
算術演算子	% modulo		select count(*) from stdin where cast(_1 as int)%2 == 0;
算術演算子	^ power-of		select cast(2^10 as int) from stdin;
演算子の比較	> < >= ⇐ == !=		select _1,_2 from stdin where (int(_1)+int(_3))>int(_5);
論理演算子	AND または NOT		select count(*) from stdin where not (int(1)>123 and int(_5)<200);
論理演算子	is null	式の null 表示の場合は true/false を返します。
論理演算子および NULL	is not null	式の null 表示の場合は true/false を返します。
論理演算子および NULL	不明な状態	null 処理を確認し、NULL で論理操作の結果を確認します。クエリーは `0` を返します。	`select count(*) from stdin where null and (3>2);`
NULL を使用した算術演算子	不明な状態	null 処理を確認し、NULL でバイナリー操作の結果を確認します。クエリーは `0` を返します。	`select count(*) from stdin where (null+1) and (3>2);`
NULL との比較	不明な状態	null 処理を確認し、比較操作の結果を NULL で確認します。クエリーは `0` を返します。	`select count() from stdin where (null1.5) != 3;`
列がない	不明な状態		`select count(*) from stdin where _1 is null;`
投影列	if、then、または else と同様です。	ケースの選択	`when (1+1==(2+1)3) then ‘case_1' when 43)==(12 then ‘case_2’ else ‘case_else’ end, age*2 from stdin;`
論理演算子		`coalesce` は、最初の null 以外の引数を返します。	`select coalesce(nullif(5,5),nullif(1,1.0),age+12) from stdin;`
論理演算子		`nullif` の場合は、両方の引数が等しい場合は null を返し、それ以外の場合は最初の引数 `nullif(1,1)=NULL nullif(null,1)=NULL nullif(2,1)=2` を返します。	`select nullif(cast(_1 as int),cast(_2 as int)) from stdin;`
論理演算子		`{expression} in ( .. {expression} ..)`	`select count(*) from s3object where ‘ben' in (trim(_5),substring(_1,char_length(_1)-3,3),last_name);`
論理演算子		`{expression} between {expression} and {expression}`	`select count(*) from stdin where substring(_3,char_length(_3),1) between “x" and trim(_1) and substring(_3,char_length(_3)-1,1) == “:";`
論理演算子		`{expression} like {match-pattern}`	`select count() from stdin where first_name like ‘%de_’; select count() from stdin where _1 like "%a[r-s];`
キャスト演算子			`select cast(123 as int)%2 from stdin;`
キャスト演算子			`select cast(123.456 as float)%2 from stdin;`
キャスト演算子			`select cast(‘ABC0-9’ as string),cast(substr(‘ab12cd’,3,2) as int)*4 from stdin;`
キャスト演算子			`select cast(substring(‘publish on 2007-01-01’,12,10) as timestamp) from stdin;`
AWS 以外のキャスト演算子			`select int(_1),int( 1.2 + 3.4) from stdin;`
AWS 以外のキャスト演算子			`select float(1.2) from stdin;`
AWS 以外のキャスト演算子			`select timestamp(‘1999:10:10-12:23:44’) from stdin;`
集約機能	sun		`select sum(int(_1)) from stdin;`
集約機能	avg		`select avg(cast(_1 a float) + cast(_2 as int)) from stdin;`
集約機能	min		`select avg(cast(_1 a float) + cast(_2 as int)) from stdin;`
集約機能	max		`select max(float(_1)),min(int(_5)) from stdin;`
集約機能	count		`select count(*) from stdin where (int(1)+int(_3))>int(_5);`
タイムスタンプ関数	extract		`select count(*) from stdin where extract(‘year’,timestamp(_2)) > 1950 and extract(‘year’,timestamp(_1)) < 1960;`
タイムスタンプ関数	dateadd		`select count(0) from stdin where datediff(‘year’,timestamp(_1),dateadd(‘day’,366,timestamp(_1))) == 1;`
タイムスタンプ関数	datediff		`select count(0) from stdin where datediff(‘month’,timestamp(_1),timestamp(_2))) == 2;`
タイムスタンプ関数	utcnow		`select count(0) from stdin where datediff(‘hours’,utcnow(),dateadd(‘day’,1,utcnow())) == 24`
文字列関数	substring		`select count(0) from stdin where int(substring(_1,1,4))>1950 and int(substring(_1,1,4))<1960;`
文字列関数	trim		`select trim(‘ foobar ‘) from stdin;`
文字列関数	trim		`select trim(trailing from ‘ foobar ‘) from stdin;`
文字列関数	trim		`select trim(leading from ‘ foobar ‘) from stdin;`
文字列関数	trim		`select trim(both ‘12’ from ‘1112211foobar22211122’) from stdin;`
文字列関数	lower または upper		`select trim(both ‘12’ from ‘1112211foobar22211122’) from stdin;`
文字列関数	char_length, character_length		`select count(*) from stdin where char_length(_3)==3;`
複雑なクエリー			`select sum(cast(_1 as int)),max(cast(_3 as int)), substring(‘abcdefghijklm’, (2-1)*3+sum(cast(_1 as int))/sum(cast(_1 as int))+1, (count() + count(0))/count(0)) from stdin;`
エイリアスのサポート			`select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from stdin where a3>100 and a3<300;`

関連情報

詳細は、Amazon's S3 Select Object Content API を参照してください。

3.6.3. S3 supported select functions

S3 select は、.Timestamp の機能をサポートします。

timestamp(string)

説明: 文字列をタイムスタンプの基本タイプに変換します。
サポート対象: 現在、yyyy:mm:dd hh:mi:dd に変換します。

extract(date-part,timestamp)

説明: 入力タイムスタンプからの date-part の抽出に従って整数を返します。
サポート対象: date-part: year,month,week,day.

dateadd(date-part ,integer,timestamp)

説明: 入力されたタイムスタンプと date-part の結果に基づいて計算されたタイムスタンプを返します。
サポート対象: date-part : year,month,day.

datediff(date-part,timestamp,timestamp)

説明: 整数を返します。これは、date-part に応じた 2 つのタイムスタンプの差の計算結果です。
サポート対象: date-part : year,month,day,hours.

utcnow()

説明: 現在の時刻のタイムスタンプを返します。

集約

count()

説明: (条件がある場合) 条件と一致する行数に基づいて整数を返します。

sum(expression)

説明: (条件がある場合) 条件と一致する各行の式の概要を返します。

avg(expression)

説明: (条件がある場合) 条件に一致する各行の平均式を返します。

max(expression)

説明: (条件がある場合) 条件に一致するすべての式について最大結果を返します。

min(expression)

説明: (条件がある場合) 条件に一致するすべての式の最小結果を返します。

String

substring(string,from,to)

説明: from および input をもとに、入力文字列から抽出した文字列を返します。

Char_length

説明: 文字列の文字数を返します。Character_length も同じです。

Trim

説明: ターゲット文字列から先頭または末尾の文字をトリミングします。デフォルトは空白です。

Upper\lower

説明: 文字を大文字または小文字に変換します。

NULL

NULL 値が見つからないか、不明な値で、NULL が任意の演算に値を生成できません。同じことが算術比較にも当てはまります。NULL との比較は不明である NULL です。

表3.4 NULL ユースケース
A is NULL	Result(NULL=UNKNOWN)
Not A	`NULL`
A または alse	`NULL`
A or True	`True`
A or A	`NULL`
A and False	`False`
A and True	`NULL`
A and A	`NULL`

関連情報

詳細は、Amazon's S3 Select Object Content API を参照してください。

3.6.4. S3 alias programming construct

エイリアスプログラミング構築は、多くの列または複雑なクエリーを含むオブジェクトを持つプログラミングを容易にするため、s3 select 言語に不可欠な部分です。エイリアス構造を含むステートメントを解析すると、エイリアスを適切な投影列への参照に置き換え、クエリーの実行時に参照が他の式として評価されます。エイリアスは結果キャッシュを維持します。つまり、エイリアスが複数回使用された場合は、キャッシュからの結果が使用されるため、同じ式は評価されず、同じ結果が返されます。現在、Red Hat は列エイリアスをサポートしています。

例

select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from s3object where a3>100 and a3<300;")

3.6.5. S3 CSV parsing explained

入力シリアライゼーションを使用して CSV 定義をデフォルト値で定義できます。

行区切り文字には {\n}` を使用します。
引用には {“} を使用します。
エスケープ文字には {\} を使用します。

csv-header-info の解析は、これはスキーマを含む入力オブジェクトの最初の行になります。現在、シリアル化および圧縮タイプの出力はサポートされていません。S3 select エンジンには、S3-objects を解析する CSV パーサーがあります。

各行は、行区切り文字で終わります。
フィールド区切り文字は、隣接する列を区切ります。
連続するフィールドの区切り文字は NULL 列を定義します。
引用符は、フィールド区切り文字をオーバーライドします。フィールド区切り文字であるは、引用符の間の任意の文字です。
エスケープ文字は、行区切り文字以外の特殊文字を無効にします。

以下は、CSV 解析ルールの例です。

表3.5 CSV の解析
機能	説明	入力 (トークン)
`NULL`	連続するフィールド区切り文字	`,,1,,2, =⇒ {null}{null}{1}{null}{2}{null}`
`QUOTE`	引用符は、フィールドの区切り文字を上書きします。	`11,22,"a,b,c,d",last =⇒ {11}{22}{“a,b,c,d"}{last}`
`Escape`	エスケープ文字はメタ文字をオーバーライドします。	オブジェクトの所有者の `ID` および `DisplayName` のコンテナー。
`row delimiter`	クローズされた引用符はありません。行区切り文字は終了行になります。	`11,22,a="str,44,55,66 =⇒ {11}{22}{a="str,44,55,66}`
`csv header info`	FileHeaderInfo タグ	USE の値は、最初の行の各トークンが column-name であることを示します。IGNORE 値は最初の行をスキップすることを意味します。

関連情報

詳細は、Amazon's S3 Select Object Content API を参照してください。

3.6. S3 選択操作 (テクノロジープレビュー)

3.6.1. 前提条件

3.6.2. S3 select content from an object

3.6.3. S3 supported select functions

3.6.4. S3 alias programming construct

3.6.5. S3 CSV parsing explained

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Red Hat legal and privacy links

Red Hat legal and privacy links