ISSUE:
How can I use HCatalog to discover which files are associated with a partition in a table so that the files can be read directly from HDFS?
How do I place files in HDFS and then add them as a new partition to an existing table?
SOLUTION:
This document describes how to use HCatalog to discover which files are associated with a particular partition in a table so that those files can be read directly from HDFS, and how to place files in HDFS and then add them as a new partition to an existing table.
If you installed the code as tarballs, you will need to know the following before starting:
- hadoop_home: location where Hadoop is installed on your client machine. For example, if you did an install under /home/hadoop/hdp10/hadoop then this will be your hadoop_home value.
- hcat_home location where the HCatalog client is installed on your client machine. For example, if you did an install under /home/hadoop/hdp10/hcatalog then this will be your hcat_home value.
- table_name the name of the table you wish to read from or write to.
- templeton_host the hostname of the machine running Templeton, the web services API for HCatalog. This is only necessary if you are doing these calls via Templeton.
- user_name the name of the user to run these commands as. This is only necessary if you are doing these calls via Templeton.
Throughout the document commands are detailed for both operations done on the command line and those done via web services. For those done on the command line, if the installation of the client was done as a tarball (rather than an rpm) it is assumed that your environment contains the variable HADOOP_HOME set to hadoop_home and that hcat_home/bin is in your shell’s PATH environment variable.
Reading
Step 1: Determine the schema of the table (Optional)
Command line:
hcat -e "describe <table_name>;"
This will return text that looks like:
OK id bigint user string my_p string my_q string
Values on the left are column names, values on the right are data types.
Web services:
URL: GET to <templeton_host>/templeton/v1/ddl/database/<db-name>/table/<table-name> Accept: application/json ContentType: application/json
Example JSON Response:
{ "columns": [ { "name": "id", "type": "bigint" }, { "name": "user", "type": "string" }, { "name": "my_p", "type": "string" }, { "name": "my_q", "type": "string" } ], "database": "default", "table": "my_table" }
Step 2: Get a list of all partitions of the table (Optional)
Command line:
hcat -e "show partitions <table_name>;"
This will return text that looks like:
OK ds=20110924 ds=20110925
Each line represents one partition, with the partition key to the left of the equal sign and the value for that partition to the right. If there are multiple partition keys they will be comma separated.
Web services:
URL: GET to <templeton_host>/templeton/v1/ddl/database/<db-name>/table/<table-name>/partition Accept: application/json ContentType: application/json
Example JSON response:
{ "partitions": [ { "values": [ { "columnName": "ds", "columnValue": "20110924" }, ], "name": "ds='20110924'" }, { "values": [ { "columnName": "ds", "columnValue": "20110925" }, ], "name": "ds='20110925'" }, ], "database": "default", "table": "my_table" }
Step 3: Find location information for the partition you wish to read.
Once you know the partition values for the partition you wish to read, you can find the location information. In the following statements part_col is the name of the partition column, part_value is the value of that column for the partition you are reading.
Command line:
hcat -e "show table extended like <table_name> partition(<part_col>=<part_value>);"
This will return text that looks like:
OK tableName:studentparttab30k owner:hortonal location:hdfs://hrt9n03.cc1.ygridcore.net:9000/user/hcat/tests/data/studentparttab30k/studentparttab30k.20110924 inputformat:org.apache.hadoop.mapred.TextInputFormat outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat columns:struct columns { string name, i32 age, double gpa} partitioned:true partitionColumns:struct partition_columns { string ds} totalNumberFiles:1 totalFileSize:219190 maxFileSize:219190 minFileSize:219190 lastAccessTime:1328599687969 lastUpdateTime:1325806298324
You need the line starting with location:. This indicates the file or directory name where the partition data is stored. Depending on how the partition was created, this may be either a file or a directory.
Web services:
URL: GET to <templeton_host>/templeton/v1/ddl/database/<db-name>/table/<table-name>/partition/<partition-name> Accept: application/json ContentType: application/json
<partition-name> will be the name as given in the list of partitions obtained in Step 2. For example, for a table partitioned by the field ds where you wish to get the partition where ds=20110924, the name will be ds=’20110925′Note that the quotation marks will need to be escaped in your URL.
Example curl command:
curl -HContent-type:application/json 'http://www.myserver.com/templeton/v1/ddl/database/default/table/test_table/partition/ds=%2720110924%27'
Example JSON Response:
{ "minFileSize": 184, "totalNumberFiles": 1, "location": "hdfs://localhost:9000/user/hive/warehouse/my_table/my_p=XYZ/my_q=ABC", "lastUpdateTime": 1329980827336, "lastAccessTime": 1329980816220, "columns": [ { "name": "id", "type": "bigint" }, { "name": "user", "type": "string" }, ], "partitionColumns": [ { "name": "ds", "type": "string" }, ], "maxFileSize": 184, "inputformat": "org.apache.hadoop.hive.ql.io.RCFileInputFormat" "partitioned": true, "owner": "you", "totalFileSize": 184, "outputformat": "org.apache.hadoop.hive.ql.io.RCFileOutputFormat", "database": "default", "table": "my_table", "partition": "ds=’20110924’" }
Step 4: Read the file
This can be done via hadoop fs, the Hadoop Java API, or webhdfs.
Writing
Step 1: Load the file you wish to have as a partition into HDFS
This can be done via hadoop fs, the Hadoop Java API, or webhdfs.
Step 2: Add the partition to the table
In the following statements part_col is the name of the partition column, part_value is the value of that column for the partition you are reading, and file_location is where you loaded the file in the previous step.
Command line:
hcat -e "alter table <table_name> add partition (<part_col>='<part_value>') location '<file_location>';"
Web services:
URL: PUT to <templeton_host>/templeton/v1/ddl/database/<db-name>/table/<table-name>/partition/<partition-name> Accept: application/json ContentType: application/json
<partition-name> will be the name of the partition. This should match Hive’s partition naming scheme of key=’value’. For example, for a table partitioned by the field ds where you wish to get the partition where ds=20110924, the name will be ds=’20110925′ Note that the quotation marks will need to be escaped in your URL.
The location is passed in a JSON document:
{ “location”: “<location>” }
Example curl command:
curl -X PUT -HContent-type:application/json -d '{"location": "hdfs://nn.acme.com/user/data/mytable/20110924"}' 'http://www.myserver.com/templeton/v1/ddl/database/default/table/test_table/partition/ds=%2720110924%27'
Example JSON Response:
{ "return-code" : “OK” }
Full JSON Schema for document
{ "partition": "ds='20110924'", "table": "mytable", "database": "default" }