3.5. S3 select 操作

開発者は、S3 select を実行してスループットを加速できます。ユーザーは、メディエーターなしで S3 select クエリーを直接実行できます。

S3 select ワークフローには、CSV、Apache Parquet (Parquet)、および JSON の 3 つがあり、CSV、Parquet、および JSON オブジェクトを使用した S3 select 操作を提供します。

CSV ファイルには、表形式のデータがプレーンテキスト形式で格納されます。ファイルの各行はデータレコードです。
Parquet は、効率的なデータの保存と取得のために設計された、オープンソースのカラム型のデータファイル形式です。複雑なデータをまとめて処理するための強化されたパフォーマンスを備えた、非常に効率的なデータ圧縮およびエンコーディングスキームを提供します。Parquet を使用すると、S3 select エンジンが列とチャンクをスキップできるため、(CSV および JSON 形式とは対照的に) IOPS が大幅に削減されます。
JSON はフォーマット構造です。S3 select エンジンは、JSON リーダーを使用して JSON 形式の入力データ上で SQL ステートメントを使用できるようにし、高度にネストされた複雑な JSON 形式のデータのスキャンを可能にします。

たとえば、数ギガバイトのデータを持つ CSV、Parquet または JSON S3 オブジェクトの場合、ユーザーは次のクエリーを使用して、別の列によってフィルター処理された単一の列を抽出できます。

例

select customerid from s3Object where age>30 and age<65;

現時点で、S3 オブジェクトはデータのフィルタリングおよび抽出の前に、Ceph Object Gateway 経由で Ceph OSD からデータを取得する必要があります。オブジェクトのサイズが大きく、クエリーが具体的な場合に、パフォーマンスが向上します。Parquet 形式は、CSV よりも効率的に処理できます。

前提条件

稼働中の Red Hat Ceph Storage クラスターがある。
RESTful クライアント。
ユーザーアクセスで作成された S3 ユーザー。

3.5.1. S3 select content from an object
リンクのコピー

select object content API は、構造化されたクエリー言語 (SQL) でオブジェクトの内容をフィルターします。インベントリーオブジェクトに含める必要がある内容の記述例は、AWS Systems Manager User Guide の Metadata collected by inventory セクションを参照してください。インベントリーの内容は、そのインベントリーに対して実行する必要があるクエリーのタイプに影響します。重要な情報を提供できる可能性のある SQL ステートメントの数は多いものの、S3 select は SQL に似たユーティリティーであるため、group-by や join などの一部の演算子はサポートされていません。

CSV の場合のみ、オブジェクトのコンマ区切りの値であるデータのシリアライズ形式を指定して、指定のコンテンツを取得する必要があります。Parquet はバイナリー形式であるため、区切り文字はありません。Amazon Web Services (AWS) のコマンドラインインターフェイス (CLI) 選択オブジェクトコンテンツは、CSV または Parquet 形式を使用してオブジェクトデータをレコードに解析し、クエリーで指定されたレコードのみを返します。

応答のデータシリアライゼーション形式を指定する必要があります。この操作には s3:GetObject パーミッションが必要です。

注記

InputSerialization 要素は、クエリーされるオブジェクトに含まれるデータの形式を記述します。オブジェクトは、CSV または Parquet 形式にすることができます。
OutputSerialization 要素は AWS-CLI ユーザークライアントの一部で、出力データのフォーマット方法を記述します。Ceph は AWS-CLI のサーバークライアントを実装しているため、現在 CSV のみである OutputSerialization に従って同じ出力を提供します。
InputSerialization の形式は、OutputSerialization の形式と一致する必要はありません。そのため、たとえば InputSerialization で Parquet を指定し、OutputSerialization で CSV を指定することもできます。

構文

POST /BUCKET/KEY?select&select-type=2 HTTP/1.1\r\n

例

POST /testbucket/sample1csv?select&select-type=2 HTTP/1.1\r\n
POST /testbucket/sample1parquet?select&select-type=2 HTTP/1.1\r\n

Expand

表3.4 要求エンティティー
リクエスト	説明	型	必須
`Bucket`	オブジェクトコンテンツを選択するバケット。	String	はい
`Key`	オブジェクトキー。必要な最小長さは 1 です。	String	はい
`SelectObjectContentRequest`	select オブジェクトコンテンツ要求パラメーターのルートレベルタグ。	String	はい
`Expression`	オブジェクトのクエリーに使用される式。	String	はい
`ExpressionType`	指定された式の型、たとえば `SQL`。有効な値: `SQL`	String	はい
`InputSerialization`	クエリーされるオブジェクトに含まれるデータの形式を記述します。	String	はい
`OutputSerialization`	コンマセパレーターおよび改行で返されるデータの形式。	String	はい

応答エンティティー

アクションに成功すると、サービスは HTTP 200 応答を返します。データはサービスによって XML 形式で返されます。

Expand

リクエスト	説明	型	必須
`Payload`	ペイロードパラメーターのルートレベルタグ。	String	はい
`Records`	レコードイベント。	base64 でエンコードされたバイナリーデータオブジェクト	いいえ
`Stats`	stats イベント。	Long	いいえ

Ceph Object Gateway は、以下の応答タイプをサポートしています。

例

{:event-type,records} {:content-type,application/octet-stream} {:message-type,event}

構文 (CSV の場合)

aws --endpoint-URL http://localhost:80 s3api select-object-content
 --bucket BUCKET_NAME
 --expression-type 'SQL'
 --input-serialization
 '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}'
 --key OBJECT_NAME.csv
 --expression "select count(0) from s3object where int(_1)<10;" output.csv

例 (CSV の場合)

aws --endpoint-url http://localhost:80 s3api select-object-content
 --bucket testbucket
 --expression-type 'SQL'
 --input-serialization
 '{"CSV": {"FieldDelimiter": "," , "QuoteCharacter": "\"" , "RecordDelimiter" : "\n" , "QuoteEscapeCharacter" : "\\" , "FileHeaderInfo": "USE" }, "CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}'
 --key testobject.csv
 --expression "select count(0) from s3object where int(_1)<10;" output.csv

構文 (Parquet の場合)

aws --endpoint-url http://localhost:80 s3api select-object-content
 --bucket BUCKET_NAME
 --expression-type 'SQL'
 --input-serialization
 '{"Parquet": {}, {"CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}'
 --key OBJECT_NAME.parquet
 --expression "select count(0) from s3object where int(_1)<10;" output.csv

例 (Parquet の場合)

aws --endpoint-url http://localhost:80 s3api select-object-content
 --bucket testbucket
 --expression-type 'SQL'
 --input-serialization
 '{"Parquet": {}, {"CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}'
 --key testobject.parquet
 --expression "select count(0) from s3object where int(_1)<10;" output.csv

構文 (JSON の場合)

aws --endpoint-URL http://localhost:80 s3api select-object-content
 --bucket BUCKET_NAME
 --expression-type 'SQL'
 --input-serialization
 '{"JSON": {"CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}}'
 --key OBJECT_NAME.json
 --expression "select count(0) from s3object where int(_1)<10;" output.csv

例 (JSON の場合)

aws --endpoint-url http://localhost:80 s3api select-object-content
 --bucket testbucket
 --expression-type 'SQL'
 --input-serialization
 '{"JSON": {"CompressionType": "NONE"}'
 --output-serialization '{"CSV": {}}}'
 --key testobject.json
 --expression "select count(0) from s3object where int(_1)<10;" output.csv

例 (BOTO3 の場合)

import pprint
import boto3
from botocore.exceptions import ClientError

def run_s3select(bucket,key,query,column_delim=",",row_delim="\n",quot_char='"',esc_char='\\',csv_header_info="NONE"):

   s3 = boto3.client('s3',
       endpoint_url=endpoint,
       aws_access_key_id=access_key,
       region_name=region_name,
       aws_secret_access_key=secret_key)

   result = ""
   try:
       r = s3.select_object_content(
       Bucket=bucket,
       Key=key,
       ExpressionType='SQL',
       InputSerialization = {"CSV": {"RecordDelimiter" : row_delim, "FieldDelimiter" : column_delim,"QuoteEscapeCharacter": esc_char, "QuoteCharacter": quot_char, "FileHeaderInfo": csv_header_info}, "CompressionType": "NONE"},
       OutputSerialization = {"CSV": {}},
       Expression=query,
       RequestProgress = {"Enabled": progress})

   except ClientError as c:
       result += str(c)
       return result

   for event in r['Payload']:
           if 'Records' in event:
               result = ""
               records = event['Records']['Payload'].decode('utf-8')
               result += records
           if 'Progress' in event:
               print("progress")
               pprint.pprint(event['Progress'],width=1)
           if 'Stats' in event:
               print("Stats")
               pprint.pprint(event['Stats'],width=1)
           if 'End' in event:
               print("End")
               pprint.pprint(event['End'],width=1)

   return result




 run_s3select(
 "my_bucket",
 "my_csv_object",
 "select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from s3object where a3>100 and a3<300;")

サポートされる機能

現時点で、AWS s3 select コマンドの一部のみがサポートされます。

Expand

機能	詳細	説明	例
算術演算子	^ * % / + - ( )	該当なし	select (int(_1)+int(_2))*int(_9) from s3object;
算術演算子	% modulo	該当なし	select count(*) from s3object where cast(_1 as int)%2 = 0;
算術演算子	^ power-of	該当なし	select cast(2^10 as int) from s3object;
演算子の比較	> < >= ⇐ == !=	該当なし	select _1,_2 from s3object where (int(_1)+int(_3))>int(_5);
論理演算子	AND または NOT	該当なし	select count(*) from s3object where not (int(1)>123 and int(_5)<200);
論理演算子	is null	式の null 表示の場合は true/false を返します。	該当なし
論理演算子および NULL	is not null	式の null 表示の場合は true/false を返します。	該当なし
論理演算子および NULL	不明な状態	null 処理を確認し、NULL で論理操作の結果を確認します。クエリーは `0` を返します。	`select count(*) from s3object where null and (3>2);`
NULL を使用した算術演算子	不明な状態	null 処理を確認し、NULL でバイナリー操作の結果を確認します。クエリーは `0` を返します。	`select count(*) from s3object where (null+1) and (3>2);`
NULL との比較	不明な状態	null 処理を確認し、比較操作の結果を NULL で確認します。クエリーは `0` を返します。	`select count() from s3object where (null1.5) != 3;`
列がない	不明な状態	該当なし	`select count(*) from s3object where _1 is null;`
投影列	if、then、または else と同様です。	該当なし	`select case when (1+1==(2+1)3) then 'case_1' when 43)==(12 then 'case_2' else 'case_else' end, age*2 from s3object;`
投影列	switch/case のデフォルトと同様です。	該当なし	`select case cast(_1 as int) + 1 when 2 then “a” when 3 then “b” else “c” end from s3object;`
論理演算子	該当なし	`coalesce` は、最初の null 以外の引数を返します。	`select coalesce(nullif(5,5),nullif(1,1.0),age+12) from s3object;`
論理演算子	該当なし	`nullif` の場合は、両方の引数が等しい場合は null を返し、それ以外の場合は最初の引数 `nullif(1,1)=NULL nullif(null,1)=NULL nullif(2,1)=2` を返します。	`select nullif(cast(_1 as int),cast(_2 as int)) from s3object;`
論理演算子	該当なし	`{expression} in ( .. {expression} ..)`	`select count(*) from s3object where 'ben' in (trim(_5),substring(_1,char_length(_1)-3,3),last_name);`
論理演算子	該当なし	`{expression} between {expression} and {expression}`	`select _1 from s3object where cast(_1 as int) between 800 and 900`; `select count(*) from stdin where substring(_3,char_length(_3),1) between “x” and trim(_1) and substring(_3,char_length(_3)-1,1) = “:”;`
論理演算子	該当なし	`{expression} like {match-pattern}`	`select count() from s3object where first_name like '%de_'; select count() from s3object where _1 like "%a[r-s];`
キャスト演算子	該当なし	該当なし	`select cast(123 as int)%2 from s3object;`
キャスト演算子	該当なし	該当なし	`select cast(123.456 as float)%2 from s3object;`
キャスト演算子	該当なし	該当なし	`select cast('ABC0-9' as string),cast(substr('ab12cd',3,2) as int)*4 from s3object;`
キャスト演算子	該当なし	該当なし	`select cast(substring('publish on 2007-01-01',12,10) as timestamp) from s3object;`
AWS 以外のキャスト演算子	該当なし	該当なし	`select int(_1),int( 1.2 + 3.4) from s3object;`
AWS 以外のキャスト演算子	該当なし	該当なし	`select float(1.2) from s3object;`
AWS 以外のキャスト演算子	該当なし	該当なし	`select to_timestamp('1999-10-10T12:23:44Z') from s3object;`
集約機能	sun	該当なし	`select sum(int(_1)) from s3object;`
集約機能	avg	該当なし	`select avg(cast(_1 as float) + cast(_2 as int)) from s3object;`
集約機能	min	該当なし	`select avg(cast(_1 a float) + cast(_2 as int)) from s3object;`
集約機能	max	該当なし	`select max(float(_1)),min(int(_5)) from s3object;`
集約機能	count	該当なし	`select count(*) from s3object where (int(1)+int(_3))>int(_5);`
タイムスタンプ関数	extract	該当なし	`select count(*) from s3object where extract(year from to_timestamp(_2)) > 1950 and extract(year from to_timestamp(_1)) < 1960;`
タイムスタンプ関数	dateadd	該当なし	`select count(0) from s3object where date_diff(year,to_timestamp(_1),date_add(day,366,to_timestamp(_1))) = 1;`
タイムスタンプ関数	datediff	該当なし	`select count(0) from s3object where date_diff(month,to_timestamp(_1),to_timestamp(_2)) = 2;`
タイムスタンプ関数	utcnow	該当なし	`select count(0) from s3object where date_diff(hour,utcnow(),date_add(day,1,utcnow())) = 24`
タイムスタンプ関数	to_string	該当なし	`select to_string( to_timestamp(“2009-09-17T17:56:06.234567Z”), “yyyyMMdd-H:m:s”) from s3object;`
文字列関数	substring	該当なし	`select count(0) from s3object where int(substring(_1,1,4))>1950 and int(substring(_1,1,4))<1960;`
文字列関数	substring	substring で from の後に負の数を指定した場合、first とみなされて有効です。	`select substring(“123456789” from -4) from s3object;`
文字列関数	substring	substring で from 0 for の後に範囲外の数値を指定した場合、(first,last) と同様に有効です。	`select substring(“123456789” from 0 for 100) from s3object;`
文字列関数	trim	該当なし	`select trim(' foobar ') from s3object;`
文字列関数	trim	該当なし	`select trim(trailing from ' foobar ') from s3object;`
文字列関数	trim	該当なし	`select trim(leading from ' foobar ') from s3object;`
文字列関数	trim	該当なし	`select trim(both '12' from '1112211foobar22211122') from s3object;`
文字列関数	lower または upper	該当なし	`select lower('ABcD12#$e') from s3object;`
文字列関数	char_length, character_length	該当なし	`select count(*) from s3object where char_length(_3)=3;`
複雑なクエリー	該当なし	該当なし	`select sum(cast(_1 as int)),max(cast(_3 as int)), substring('abcdefghijklm', (2-1)*3+sum(cast(_1 as int))/sum(cast(_1 as int))+1, (count() + count(0))/count(0)) from s3object;`
エイリアスのサポート	該当なし	該当なし	`select int(_1) as a1, int(_2) as a2 , (a1+a2) as a3 from s3object where a3>100 and a3<300;`

3.5. S3 select 操作

3.5.1. S3 select content from an object
リンクのコピー

詳細情報

試用、購入および販売

コミュニティー

会社概要

多様性を受け入れるオープンソースの強化

Red Hat ドキュメントについて

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.5. S3 select 操作

3.5.1. S3 select content from an objectリンクのコピーリンクがクリップボードにコピーされました!

詳細情報

試用、購入および販売

コミュニティー

会社概要

多様性を受け入れるオープンソースの強化

Red Hat ドキュメントについて

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.5.1. S3 select content from an object
リンクのコピー