Hive
- Data Warehousing Solution built on top of Hadoop
- Provides SQL-like query language named HiveQL
- Minimal learning curve for people with SQL expertise
- Data analysts are target audience
- Early Hive development work started at Facebook in 2007
- Today Hive is an Apache project under Hadoop http://hive.apache.org
Hive Provides
- Ability to bring structure to various data formats
- Simple interface for ad hoc querying,analyzing and summarizing large amounts of data
- Access to files on various data stores such as HDFS and HBase
Hive
- Hive does NOT provide low latency or real-time queries
- Even querying small amounts of data may take minutes
- Designed for scalability and ease-of-use rather than low latency responses
Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster
Hive Metastore
To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database
- Packaged with Derby, a lightweight embedded SQL DB
- Default Derby based is good for evaluation an testing
- Schema is not shared between users as each user has their own instance of embedded Derby
- Stored in metastore_db directory which resides in the directory that hive was started from
- Can easily switch another SQL installation such as MySQL.
Hive Architecture
Hive Interface Options
- Command Line Interface (CLI)
Will use exclusively in these slides - Hive Web Interface
https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface
Hive Concepts
Re-used from Relational Databases
- Database: Set of Tables, used for name conflicts resolution
- Table: Set of Rows that have the same schema (same columns)
- Row: A single record; a set of columns
- Column: provides value and type for a single value
Installation Prerequisites
- Java 6 Just Like Hadoop
- Hadoop 0.20.x+ No surprise here
- Set $HADOOP_HOME environment variable Was done as a part of HDFS installation
- Set $HIVE_HOME and add hive to the PATH
export HIVE_HOME=$CDH_HOME/hive-0.8.1-cdh4.0.0 export PATH=$PATH:$HIVE_HOME/bin
- Hive will store its tables on HDFS and those locations needs to be bootstrapped
$ hdfs dfs -mkdir /tmp $ hdfs dfs -mkdir /user/hive/warehouse $ hdfs dfs -chmod g+w /tmp $ hdfs dfs -chmod g+w /user/hive/warehouse
- Similar to other Hadoop’s projects Hive’s configuration is in $HIVE_HOME/conf/hive-site.xml
< ?xml version="1.0"? >
< ?xml-stylesheet type="text/xsl" href="configuration.xsl"? >
< configuration >
< property >
< name >mapred.job.tracker< /name >
< value >localhost:10040< /value > <---Specify the location of ResourceManager so Hive knows where to execute MapReduce Jobs; by default Hive utilizes LocalJobRunner--->
< /property>
< /configuration>
Run Hive
HDFS and YARN need to be up and running
$ hive
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201207312052_1402761030.txt
hive> <---Hive’s Interactive Command Line Interface (CLI)--->
Create a Table
Let’s create a table to store data from $PLAY_AREA/data/user-posts.txt
$ cd $PLAY_AREA
$ hive <---Launch Hive Command Line Interface (CLI)--->
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201208022144_2014345460.txt <---Location of the session’s log file--->
hive> !cat data/user-posts.txt; <---Can execute local commands within CLI, place a command in between!and ;--->
user1,Funny Story,1343182026191
user2,Cool Deal,1343182133839
user4,Interesting Post,1343182154633
user5,Yet Another Blog,13431839394 <---Values are separate by ‘,’ and each row represents a record; first value is user name, second is post content and third is timestamp--->
hive>
hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ',' <---1 line:creates a table with 3 columns 2 and 3 line:how the underlying file should be parsed 4 line: how to store data--->
> STORED AS TEXTFILE; <---Statements must end with a semicolon and can span multiple rows--->
OK
Time taken: 10.606 seconds
hive> show tables; <---Display all of the tables--->
OK
posts <---Result is displayed between "OK--->
Time taken: 0.221 seconds
hive> describe posts; <---Display schema for posts table--->
OK
User string
Post string
Time bigint
Time taken: 0.212 seconds
Loading Data
Several options to start using data in HIVE
- Load data from HDFS location
hive> LOAD DATA INPATH '/training/hive/user-posts.txt'
> OVERWRITE INTO TABLE posts;
File is copied from the provided location to /user/hive/warehouse/(or configured location)
hive> LOAD DATA LOCAL INPATH 'data/user-posts.txt'
> OVERWRITE INTO TABLE posts;
File is copied from the provided location to /user/hive/warehouse/(or configured location)
Re-Use Existing HDFS Location
hive> CREATE EXTERNAL TABLE posts
> (user STRING, post STRING, time BIGINT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE
> LOCATION '/training/hive/'; <---Hive will load all the files under /training/hive directory in posts table--->
OK
Time taken: 0.077 seconds
hive>
Schema Violations
What would happen if we try to insert data that does not comply with the pre-defined schema?
hive> !cat data/user-posts-inconsistentFormat.txt;
user1,Funny Story,1343182026191
user2,Cool Deal,2012-01-05
user4,Interesting Post,1343182154633
user5,Yet Another Blog,13431839394
hive> describe posts;
OK
user string
post string <---Third Column ‘post’ is of type bigint;will not be able to convert ‘2012-01-05’ value--->
time bigint
Time taken: 0.289 seconds
hive> LOAD DATA LOCAL INPATH
> 'data/user-posts-inconsistentFormat.txt'
> OVERWRITE INTO TABLE posts;
OK
Time taken: 0.612 seconds
hive> select * from posts;
OK
user1 Funny Story 1343182026191 NULL <---null is set for any value that violates pre-defined schema--->
user4 Interesting Post 1343182154633
user5 Yet Another Blog 13431839394
Time taken: 0.136 seconds
hive>
Partitions
- To increase performance Hive has the capability to partition data
- The values of partitioned column divide a table into segments
- Entire partitions can be ignored at query time
- Similar to relational databases’ indexes but not as granular
- Partitions have to be properly crated by users
When inserting data must specify a partition - At query time, whenever appropriate, Hive will automatically filter out partitions
Creating Partitioned Table
hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT)
> PARTITIONED BY(country STRING) <---Partition table based on the value of a country.--->
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
OK
Time taken: 0.116 seconds
hive> describe posts;
OK
user string
Post string <---There is no difference in schema between "partition" columns and "data" columns--->
Time bigint
countrystring
Time taken: 0.111 seconds
hive> show partitions posts;
OK
Time taken: 0.102 seconds
hive>
Load Data Into Partitioned Table
hive> LOAD DATA LOCAL INPATH 'data/user-posts-US.txt'
> OVERWRITE INTO TABLE posts;
FAILED: Error in semantic analysis: Need to specify partition
columns because the destination table is partitioned <---Since the posts table was defined to be partitioned any insert statement must specify the partition--->
hive> LOAD DATA LOCAL INPATH 'data/user-posts-US.txt'
> OVERWRITE INTO TABLE posts PARTITION(country='US');
OK
Time taken: 0.225 seconds
hive> LOAD DATA LOCAL INPATH 'data/user-posts-AUSTRALIA.txt'
> OVERWRITE INTO TABLE posts PARTITION(country='AUSTRALIA'); <---Each file is loaded into separate partition;data is separated by country--->
OK
Time taken: 0.236 seconds
hive>
Partitioned Table
Partitions are physically stored under separate directories
hive> show partitions posts;
OK
country=AUSTRALIA <---There is a directory for each partition value--->
country=US
Time taken: 0.095 seconds
hive> exit;
$ hdfs dfs -ls -R /user/hive/warehouse/posts
/user/hive/warehouse/posts/country=AUSTRALIA <---There is a directory for each partition value--->
/user/hive/warehouse/posts/country=AUSTRALIA/user-posts-AUSTRALIA.txt
/user/hive/warehouse/posts/country=US
/user/hive/warehouse/posts/country=US/user-posts-US.txt
Querying Partitioned Table
- There is no difference in syntax
- When partitioned column is specified in the where clause entire directories/partitions could be ignored
hive> select * from posts where country='US' limit 10;<---Only "COUNTRY=US" partition will be queried,"COUNTRY=AUSTRALIA" partition will be ignored--->
OK
user1 Funny Story 1343182026191 US
user2 Cool Deal 1343182133839 US
user2 Great Interesting Note 13431821339485 US
user4 Interesting Post 1343182154633 US
user1 Humor is good 1343182039586 US
user2 Hi I am user #2 1343182133839 US
Time taken: 0.197 seconds
Bucketing
- Mechanism to query and examine random samples of data
- Break data into a set of buckets based on a hash function of a "bucket column"
Capability to execute queries on a sub-set of random data - Doesn’t automatically enforce bucketing
User is required to specify the number of buckets by setting # of reducer
hive> mapred.reduce.tasks = 256;
OR <---Either manually set the # of reducers to be the number of buckets or you can use ‘hive.enforce.bucketing’ which will set it on your behalf--->
hive> hive.enforce.bucketing = true;
Create and Use Table with Buckets
hive> CREATE TABLE post_count (user STRING, count INT)
> CLUSTERED BY (user) INTO 5 BUCKETS; <---Declare table with 5 buckets for user column--->
OK
Time taken: 0.076 seconds
hive> set hive.enforce.bucketing = true; <---# of reducer will get set 5--->
hive> insert overwrite table post_count
> select user, count(post) from posts group by user;
Total MapReduce jobs = 2 <---Insert data into post_count bucketed table; number of posts are counted up for each user--->
Launching Job 1 out of 2
...
Launching Job 2 out of 2
...
OK
Time taken: 42.304 seconds
hive> exit;
$ hdfs dfs -ls -R /user/hive/warehouse/post_count/
/user/hive/warehouse/post_count/000000_0 <---A file per bucket is created; now only a sub-set of buckets can be sampled--->
/user/hive/warehouse/post_count/000001_0
/user/hive/warehouse/post_count/000002_0
/user/hive/warehouse/post_count/000003_0
/user/hive/warehouse/post_count/000004_0
Random Sample of Bucketed Table
hive> select * from post_count TABLESAMPLE(BUCKET 1 OUT OF 2); <---Sample approximately 1 for every 2 buckets--->
OK
user5 1
user1 2
Time taken: 11.758 seconds
hive>
Load Data Into a Table
hive> LOAD DATA LOCAL INPATH 'data/user-posts.txt'
> OVERWRITE INTO TABLE posts;
Copying data from file:/home/hadoop/Training/play_area/data/user-posts.txt
Copying file: file:/home/hadoop/Training/play_area/data/user-posts.txt <---Existing records the table posts are deleted; data in user-posts.txt is loaded into Hive’s posts table--->
Loading data to table default.posts
Deleted /user/hive/warehouse/posts
OK
Time taken: 5.818 seconds
hive>
$ hdfs dfs -cat /user/hive/warehouse/posts/user-posts.txt <---Under the covers Hive stores it’s tables in /user/hive/warehouse (unless configured differently)--->
user1,Funny Story,1343182026191
user2,Cool Deal,1343182133839
user4,Interesting Post,1343182154633
user5,Yet Another Blog,13431839394
Query Data
hive> select count (1) from posts; <---Count number of records in posts table--->
Total MapReduce jobs = 1
Launching Job 1 out of 1 <---Transformed HiveQL into 1 MapReduce Job--->
Starting Job = job_1343957512459_0004, Tracking URL =
http://localhost:8088/proxy/application_1343957512459_0004/
Kill Command = hadoop job -Dmapred.job.tracker=localhost:10040 -kill
job_1343957512459_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2012-08-02 22:37:24,962 Stage-1 map = 0%, reduce = 0%
2012-08-02 22:37:30,497 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
2012-08-02 22:37:31,577 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
2012-08-02 22:37:32,664 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.64 sec
MapReduce Total cumulative CPU time: 2 seconds 640 msec
Ended Job = job_1343957512459_0004
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Accumulative CPU: 2.64 sec HDFS Read: 0 HDFS Write: 0
SUCESS
Total MapReduce CPU Time Spent: 2 seconds 640 msec
OK
4 <---Result is 4 records--->
Time taken: 14.204 seconds
hive> select * from posts where user="user2";
...
...
OK
user2 Cool Deal 1343182133839
Time taken: 12.184 seconds
hive> select * from posts where time<=1343182133839 limit 2; <---Usually there are too many results to display,then one could utilize limit command to bound the display--->
...
...
OK
user1 Funny Story 1343182026191
user2 Cool Deal 1343182133839
Time taken: 12.003 seconds
hive>
Drop the Table
hive> DROP TABLE posts; <---Remove the table; use with caution--->
OK
Time taken: 2.182 seconds
hive> exit;
$ hdfs dfs -ls /user/hive/warehouse/
$ <---If hive was managing underlying file then it will be removed--->
Joins
- Joins in Hive are trivial
- Supports outer joins left, right and full joins
- Can join multiple tables
- Default Join is Inner Join
- Rows are joined where the keys match
- Rows that do not have matches are not included in the result
Simple Inner Join
Let’s say we have 2 tables: posts and likes
hive> select * from posts limit 10;
OK
user1 Funny story 1343182026191
user2 Cool story 1343182133839
user4 Interesting Post 1343182154633
user5 Yet Another Blog 1343183939434 <---We want to join these 2 data-sets and produce a single table that contains user, post and count of likes--->
Time taken: 0.108 seconds
hive> select * from likes limit 10;
OK
user1 12 1343182026191
user2 7 1343182139394
user3 0 1343182154633 <---We want to join these 2 data-sets and produce a single table that contains user, post and count of likes--->
user4 50 1343182147364
Time taken: 0.103 seconds
hive> CREATE TABLE posts_likes (user STRING, post STRING, likes_count INT);
OK
Time taken: 0.06 seconds
Simple Inner Join
hive> INSERT OVERWRITE TABLE posts_likes
> SELECT p.user, p.post, l.count
> FROM posts p JOIN likes l ON (p.user = l.user); <---Two tables are joined based on user column; 3 columns are selected and stored in posts_likes table--->
OK
Time taken: 17.901 seconds
hive> select * from posts_likes limit 10;
OK
user1 Funny story 12
user2 Cool story 7
user4 Interesting Post 50
Time taken: 0.082 seconds
hive>
Outer Join
Rows which will not join with the ‘other’ table are still included in the result
Left Outer
Row from the first table are included whether they have a match or not. Columns from the unmatched (second) table are set to null.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Right Outer
The opposite of Left Outer Join: Rows from the second table are included no matter what. Columns from the unmatched (first) table are set to null.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
Full Outer
Rows from both sides are included. For unmatched rows the columns from the ‘other’ table are set to null.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Functions
This chapter explains the built-in functions available in Hive. The functions look quite similar to SQL functions, except for their usage.
Built-In Functions
Hive supports the following built-in functions:
Return Type | Signature | Description |
---|---|---|
BIGINT | round(double a) | It returns the rounded BIGINT value of the double. |
BIGINT | floor(double a) | It returns the maximum BIGINT value that is equal or less than the double. |
BIGINT | ceil(double a) | It returns the minimum BIGINT value that is equal or greater than the double. |
double | rand(), rand(int seed) | It returns a random number that changes from row to row. |
string | concat(string A, string B,...) | It returns the string resulting from concatenating B after A. |
string | substr(string A, int start) | It returns the substring of A starting from start position till the end of string A. |
string | substr(string A, int start, int length) | It returns the substring of A starting from start position with the given length. |
string | upper(string A) | It returns the string resulting from converting all characters of A to upper case. |
string | ucase(string A) | Same as above. |
string | lower(string A) | It returns the string resulting from converting all characters of B to lower case. |
string | lcase(string A) | Same as above. |
string | trim(string A) | It returns the string resulting from trimming spaces from both ends of A. |
string | ltrim(string A) | It returns the string resulting from trimming spaces from the beginning (left hand side) of A. |
string | rtrim(string A) | rtrim(string A) It returns the string resulting from trimming spaces from the end (right hand side) of A. |
string | regexp_replace(string A, string B, string C) | It returns the string resulting from replacing all substrings in B that match the Java regular expression syntax with C. |
int | size(Map<K.V>) | It returns the number of elements in the map type. |
int | size(Array<T>) | It returns the number of elements in the array type. |
value of <type> | cast(<expr> as <type>) | It converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) converts the string '1' to it integral representation. A NULL is returned if the conversion does not succeed. |
string | from_unixtime(int unixtime) | convert the number of seconds from Unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" |
string | to_date(string timestamp) | It returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" |
int | year(string date) | It returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 |
int | month(string date) | It returns the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 |
int | day(string date) | It returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 |
string | get_json_object(string json_string, string path) | It extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It returns NULL if the input json string is invalid. |
Example
The following queries demonstrate some built-in functions:
round() function
hive> SELECT round(2.6) from temp;
On successful execution of query, you get to see the following response:
2.0
floor() function
hive> SELECT floor(2.6) from temp;
On successful execution of the query, you get to see the following response:
2.0
ceil() function
hive> SELECT ceil(2.6) from temp;
On successful execution of the query, you get to see the following response:
3.0
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage of these functions is as same as the SQL aggregate functions.
Return Type | Signature | Description |
---|---|---|
BIGINT | count(*), count(expr), | count(*) - Returns the total number of retrieved rows. |
DOUBLE | sum(col), sum(DISTINCT col) | It returns the sum of the elements in the group or the sum of the distinct values of the column in the group. |
DOUBLE | avg(col), avg(DISTINCT col) | It returns the average of the elements in the group or the average of the distinct values of the column in the group. |
DOUBLE | min(col) | It returns the minimum value of the column in the group. |
DOUBLE | max(col) | It returns the maximum value of the column in the group. |
View and Indexes
This chapter describes how to create and manage views. Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all DML operations on a view.
Creating a View
You can create a view at the time of executing a SELECT statement. The syntax is as follows:
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...
Example
Let us take an example for view. Assume employee table as given below, with the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. We store the result in a view named emp_30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;
Dropping a View
Use the following syntax to drop a view:
DROP VIEW view_name
The following query drops a view named as emp_30000:
hive> DROP VIEW emp_30000;
Creating an Index
An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. Its syntax is as follows:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Example
Let us take an example for index. Use the same employee table that we have used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the salary column of the employee table.
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
It is a pointer to the salary column. If the column is modified, the changes are stored using an index value.
Dropping an Index
The following syntax is used to drop an index:
DROP INDEX <index_name> ON <table_name>
The following query drops an index named index_salary:
hive> DROP INDEX index_salary ON employee;
Select-Where
The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore. This chapter explains how to use the SELECT statement with WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters the data using the condition and gives you a finite result. The built-in operators and functions generate an expression, which fulfils the condition.
Syntax
Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Example
Let us take an example for SELECT…WHERE clause. Assume we have the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+
JDBC Program
The JDBC program to apply where clause for the given example is as follows.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveQLWhere {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE salary>30000;");
System.out.println("Result:");
System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " + res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Save the program in a file named HiveQLWhere.java. Use the following commands to compile and execute this program.
$ javac HiveQLWhere.java
$ java HiveQLWhere
Output:
ID Name Salary Designation Dept
1201 Gopal 45000 Technical manager TP
1202 Manisha 45000 Proofreader PR
1203 Masthanvali 40000 Technical writer TP
1204 Krian 40000 Hr Admin HR
Select-Group By
This chapter explains the details of GROUP BY clause in a SELECT statement. The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.
Syntax
The syntax of GROUP BY clause is as follows:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];
Example
Let us take an example of SELECT…GROUP BY clause. Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the number of employees in each department.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 45000 | Proofreader | PR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario.
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
On successful execution of the query, you get to see the following response:
+------+--------------+
| Dept | Count(*) |
+------+--------------+
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+------+--------------+
JDBC Program
Given below is the JDBC program to apply the Group By clause for the given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveQLGroupBy {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con = DriverManager.
getConnection("jdbc:hive://localhost:10000/userdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ” + “FROM employee GROUP BY DEPT; ”);
System.out.println(" Dept \t count(*)");
while (res.next()) {
System.out.println(res.getString(1) + " " + res.getInt(2));
}
con.close();
}
}
Save the program in a file named HiveQLGroupBy.java. Use the following commands to compile and execute this program.
$ javac HiveQLGroupBy.java
$ java HiveQLGroupBy
Output:
Dept Count(*)
Admin 1
PR 2
TP 3
Hive Interview Questions
There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.
No Hive does not provide insert and update at row level. So it is not suitable for OLTP system.
Alter Table table_name RENAME TO new_name
Using REPLACE column option
ALTER TABLE table_name REPLACE COLUMNS ……
It is a relational database storing the metadata of hive tables, partitions, Hive databases etc
Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. SO users need to write their own java code to satisfy their data format requirements.
Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.
hdfs://namenode_server/user/hive/warehouse
- Local mode
- Distributed mode
- Pseudodistributed mode
Yes. The TIMESTAMP data types stores date in java.sql.timestamp format
There are three collection data types in Hive.
- ARRAY
- MAP
- STRUCT
Yes, using the ! mark just before the command.
For example !pwd at hive prompt will list the current directory.
The hive variable is variable created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.
Using the source command.
Example −
Hive> source /path/to/file/file_with_query.hql
It is a file containing list of commands needs to run when the hive CLI starts. For example setting the strict mode to be true etc.
The default record delimiter is − \n
And the filed delimiters are − \001,\002,\003
The schema is validated with the data when reading the data and not enforced when writing data.
SHOW DATABASES LIKE ‘p.*’
With the use command you fix the database on which all the subsequent hive queries will run.
There is no way you can delete the DBPROPERTY.
It sets the mapreduce jobs to strict mode.By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.
This can be done with following query
SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’)
org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
When we issue the command DROP TABLE IF EXISTS table_name
Hive throws an error if the table being dropped does not exist in the first place.
The data stays in the old location. It has to be moved manually.
ALTER TABLE table_name
CHANGE COLUMN new_col INT
BEFORE x_col
No. It only reduces the number of files which becomes easier for namenode to manage.
By using the ENABLE OFFLINE clause with ALTER TABLE atatement.
By Omitting the LOCAL CLAUSE in the LOAD DATA statement.
The new incoming files are just added to the target directory and the existing files are simply overwritten. Other files whose name does not match any of the incoming files will continue to exist.
If you add the OVERWRITE clause then all the existing data in the directory will be deleted before new data is written.
It creates partition on table employees with partition values coming from the columns in the select clause. It is called Dynamic partition insert.
A table generating function is a function which takes a single column as argument and expands it to multiple column or rows. Example exploe()
If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.
The LIKE operator behaves the same way as the regular SQL operators used in select queries. Example −
street_name like ‘%Chi’
But the RLIKE operator uses more advance regular expressions which are available in java
Example − street_name RLIKE ‘.*(Chi|Oho).*’ which will select any word which has either chi or oho in it.
No. As this kind of Join can not be implemented in mapreduce
In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.
It controls ho wthe map output is reduced among the reducers. It is useful in case of streaming data
Select cast(price as FLOAT)
Hive will return NULL
No. The name of a view must be unique whne compared to all other tables and views present in the same database.
No. A view can not be the target of a INSERT or LOAD statement.
Indexes occupies space and there is a processing cost in arranging the values of the column on which index is cerated.
SHOW INDEX ON table_name
This will list all the indexes created on any of the columns in the table table_name.
The values in a column are hashed into a number of buckets which is defined by user. It is a way to avoid too many partitions or nested partitions while ensuring optimizes query output.
It is query hint to stream a table into memory before running the query. It is a query optimization Technique.
Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
It is a UDF which is created using a java program to server some specific need not covered under the existing functions in Hive. It can detect the type of input argument programmatically and provide appropriate response.
The local inpath should contain a file and not a directory. The $env:HOME is a valid variable available in the hive environment.
The TBLPROPERTIES clause is used to add the creator name while creating a table.
The TBLPROPERTIES is added like −
TBLPROPERTIES(‘creator’= ‘Joan’)
No comments:
Post a Comment