Hive

Data Warehousing Solution built on top of Hadoop
Provides SQL-like query language named HiveQL

Minimal learning curve for people with SQL expertise
Data analysts are target audience

Early Hive development work started at Facebook in 2007
Today Hive is an Apache project under Hadoop http://hive.apache.org

Hive Provides

Ability to bring structure to various data formats
Simple interface for ad hoc querying,analyzing and summarizing large amounts of data
Access to files on various data stores such as HDFS and HBase

Hive

Hive does NOT provide low latency or real-time queries
Even querying small amounts of data may take minutes
Designed for scalability and ease-of-use rather than low latency responses

Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster

Hive Metastore

To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database

Packaged with Derby, a lightweight embedded SQL DB

Default Derby based is good for evaluation an testing
Schema is not shared between users as each user has their own instance of embedded Derby
Stored in metastore_db directory which resides in the directory that hive was started from

Can easily switch another SQL installation such as MySQL.

Hive Architecture

Hive Interface Options

Command Line Interface (CLI)
Will use exclusively in these slides
Hive Web Interface
https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface

Hive Concepts

Re-used from Relational Databases

Database: Set of Tables, used for name conflicts resolution
Table: Set of Rows that have the same schema (same columns)
Row: A single record; a set of columns
Column: provides value and type for a single value

Installation Prerequisites

Java 6 Just Like Hadoop
Hadoop 0.20.x+ No surprise here
Set $HADOOP_HOME environment variable Was done as a part of HDFS installation

Set $HIVE_HOME and add hive to the PATH

export  HIVE_HOME=$CDH_HOME/hive-0.8.1-cdh4.0.0
export  PATH=$PATH:$HIVE_HOME/bin

Hive will store its tables on HDFS and those locations needs to be bootstrapped

$  hdfs  dfs  -mkdir        /tmp
$  hdfs  dfs  -mkdir        /user/hive/warehouse
$  hdfs  dfs  -chmod  g+w       /tmp
$  hdfs  dfs  -chmod  g+w       /user/hive/warehouse

Similar to other Hadoop’s projects Hive’s configuration is in $HIVE_HOME/conf/hive-site.xml

< ?xml version="1.0"? >
< ?xml-stylesheet type="text/xsl" href="configuration.xsl"? >
< configuration >
< property >
< name >mapred.job.tracker< /name >
< value >localhost:10040< /value >   <---Specify the location of ResourceManager so Hive knows where to execute MapReduce Jobs; by default Hive utilizes LocalJobRunner--->
< /property>
< /configuration>

Run Hive

HDFS and YARN need to be up and running

$ hive Hive history file=/tmp/hadoop/hive_job_log_hadoop_201207312052_1402761030.txt hive> <---Hive’s Interactive Command Line Interface (CLI)--->

Create a Table

Let’s create a table to store data from $PLAY_AREA/data/user-posts.txt

$ cd $PLAY_AREA
$ hive           <---Launch Hive Command Line Interface (CLI)--->
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201208022144_2014345460.txt  <---Location of the session’s log file--->
hive> !cat data/user-posts.txt;     <---Can execute local commands within CLI, place a command in between!and ;--->

user1,Funny Story,1343182026191
user2,Cool Deal,1343182133839
user4,Interesting Post,1343182154633
user5,Yet Another Blog,13431839394    <---Values are separate by ‘,’ and each row represents a record; first value is user name, second is post content and third is timestamp--->

hive>

hive>  CREATE  TABLE  posts  (user  STRING,  post  STRING,  time  BIGINT)
>  ROW  FORMAT  DELIMITED
>  FIELDS  TERMINATED  BY  ','     <---1 line:creates a table with 3 columns 2 and 3 line:how the underlying  file should  be  parsed 4    line:  how  to  store  data--->
>  STORED  AS  TEXTFILE;      <---Statements  must  end  with  a  semicolon and  can  span  multiple  rows--->

OK
Time  taken:  10.606  seconds
hive>  show  tables;      <---Display all of the tables--->
OK
posts          <---Result is displayed between "OK--->
Time  taken:  0.221  seconds
hive>  describe  posts;      <---Display schema for posts table--->
OK
User   string
Post   string
Time   bigint
Time  taken:  0.212  seconds

Loading Data

Several options to start using data in HIVE

Load data from HDFS location

hive> LOAD DATA INPATH '/training/hive/user-posts.txt'
> OVERWRITE INTO TABLE posts;

File is copied from the provided location to /user/hive/warehouse/(or configured location)

Load data from a local file system

hive> LOAD DATA LOCAL INPATH 'data/user-posts.txt'
> OVERWRITE INTO TABLE posts;

File is copied from the provided location to /user/hive/warehouse/(or configured location)

Utilize an existing location on HDFS Just point to an existing location when creating a table

Re-Use Existing HDFS Location

hive>  CREATE  EXTERNAL  TABLE  posts
>  (user  STRING,  post  STRING,  time  BIGINT)
>  ROW  FORMAT  DELIMITED
>  FIELDS  TERMINATED  BY  ','
>  STORED  AS  TEXTFILE
>  LOCATION  '/training/hive/';    <---Hive will load all the files under /training/hive directory in posts table--->
OK
Time  taken:  0.077  seconds
hive>

Schema Violations

What would happen if we try to insert data that does not comply with the pre-defined schema?

hive>  !cat  data/user-posts-inconsistentFormat.txt;
user1,Funny  Story,1343182026191
user2,Cool  Deal,2012-01-05
user4,Interesting  Post,1343182154633
user5,Yet  Another  Blog,13431839394
hive>  describe  posts;
OK
user   string
post   string       <---Third Column ‘post’ is of type bigint;will not be able to convert ‘2012-01-05’ value--->
time   bigint
Time  taken:  0.289  seconds

hive> LOAD DATA LOCAL INPATH
> 'data/user-posts-inconsistentFormat.txt'
> OVERWRITE INTO TABLE posts;
OK
Time taken: 0.612 seconds
hive> select * from posts;
OK
user1 Funny Story   1343182026191            NULL    <---null is set for any value that violates pre-defined schema--->
user4 Interesting Post 1343182154633
user5 Yet Another Blog 13431839394
Time taken: 0.136 seconds
hive>

Partitions

To increase performance Hive has the capability to partition data

The values of partitioned column divide a table into segments
Entire partitions can be ignored at query time
Similar to relational databases’ indexes but not as granular

Partitions have to be properly crated by users
When inserting data must specify a partition
At query time, whenever appropriate, Hive will automatically filter out partitions

Creating Partitioned Table

hive>  CREATE  TABLE  posts  (user  STRING,  post  STRING,  time  BIGINT)
>  PARTITIONED  BY(country  STRING)    <---Partition table based on the value of a country.--->
>  ROW  FORMAT  DELIMITED
>  FIELDS  TERMINATED  BY  ','
>  STORED  AS  TEXTFILE;
OK
Time  taken:  0.116  seconds
hive>  describe  posts;
OK
user   string
Post  string  <---There is no difference in schema between "partition" columns and "data" columns--->
Time  bigint
countrystring
Time  taken:  0.111  seconds
hive>  show  partitions  posts;
OK
Time  taken:  0.102  seconds
hive>

Load Data Into Partitioned Table

hive>  LOAD  DATA  LOCAL  INPATH  'data/user-posts-US.txt'
>  OVERWRITE  INTO  TABLE  posts;
FAILED:  Error  in  semantic  analysis:  Need  to  specify  partition
columns  because  the  destination  table  is  partitioned  <---Since the posts table was defined to be partitioned any insert statement must specify the partition--->
hive>  LOAD  DATA  LOCAL  INPATH  'data/user-posts-US.txt'
>  OVERWRITE  INTO  TABLE  posts  PARTITION(country='US');
OK
Time  taken:  0.225  seconds
hive>  LOAD  DATA  LOCAL  INPATH  'data/user-posts-AUSTRALIA.txt'
>  OVERWRITE  INTO  TABLE  posts  PARTITION(country='AUSTRALIA'); <---Each file is loaded into separate partition;data is separated by country--->
OK
Time  taken:  0.236  seconds
hive>

Partitioned Table

Partitions are physically stored under separate directories

hive> show partitions posts;
OK
country=AUSTRALIA     <---There  is  a  directory  for each  partition  value--->

country=US
Time taken: 0.095 seconds
hive> exit;
$ hdfs dfs -ls -R /user/hive/warehouse/posts
/user/hive/warehouse/posts/country=AUSTRALIA <---There  is  a  directory  for each  partition  value--->
/user/hive/warehouse/posts/country=AUSTRALIA/user-posts-AUSTRALIA.txt
/user/hive/warehouse/posts/country=US
/user/hive/warehouse/posts/country=US/user-posts-US.txt

Querying Partitioned Table

There is no difference in syntax
When partitioned column is specified in the where clause entire directories/partitions could be ignored

hive>  select  *  from  posts  where  country='US'  limit  10;<---Only "COUNTRY=US" partition will be queried,"COUNTRY=AUSTRALIA" partition will be ignored--->                
OK
user1  Funny  Story  1343182026191     US
user2  Cool   Deal  1343182133839     US
user2  Great  Interesting  Note   13431821339485 US
user4  Interesting  Post   1343182154633   US
user1  Humor is good    1343182039586   US
user2  Hi I am user #2    1343182133839   US
Time  taken:  0.197  seconds

Bucketing

Mechanism to query and examine random samples of data
Break data into a set of buckets based on a hash function of a "bucket column"
Capability to execute queries on a sub-set of random data
Doesn’t automatically enforce bucketing
User is required to specify the number of buckets by setting # of reducer

hive> mapred.reduce.tasks = 256;
OR    <---Either manually set the # of reducers to be the number of buckets or you can use ‘hive.enforce.bucketing’ which will set it on your behalf--->
hive> hive.enforce.bucketing = true;

Create and Use Table with Buckets

hive>  CREATE  TABLE  post_count  (user  STRING,  count  INT)
>  CLUSTERED  BY  (user)  INTO  5  BUCKETS;    <---Declare table with 5 buckets for user column--->

OK
Time  taken:  0.076  seconds
hive>  set  hive.enforce.bucketing  =  true;  <---# of reducer will get set 5--->
hive>  insert  overwrite  table  post_count
>  select  user,  count(post)  from  posts  group  by  user;
Total  MapReduce  jobs  =  2 <---Insert data into post_count bucketed table; number of posts are counted up for each user--->

Launching  Job  1  out  of  2
...
Launching  Job  2  out  of  2
...
OK
Time  taken:  42.304  seconds
hive>  exit;
$  hdfs  dfs  -ls  -R  /user/hive/warehouse/post_count/
/user/hive/warehouse/post_count/000000_0 <---A file per bucket is created; now only a sub-set of buckets can be sampled--->
/user/hive/warehouse/post_count/000001_0
/user/hive/warehouse/post_count/000002_0
/user/hive/warehouse/post_count/000003_0
/user/hive/warehouse/post_count/000004_0

Random Sample of Bucketed Table

hive>  select  *  from  post_count  TABLESAMPLE(BUCKET  1  OUT  OF  2); <---Sample approximately 1 for every 2 buckets--->
OK
user5   1
user1   2
Time  taken:  11.758  seconds
hive>

Load Data Into a Table

hive> LOAD DATA LOCAL INPATH 'data/user-posts.txt'
> OVERWRITE INTO TABLE posts;
Copying data from file:/home/hadoop/Training/play_area/data/user-posts.txt
Copying file: file:/home/hadoop/Training/play_area/data/user-posts.txt <---Existing records the table posts are deleted; data in user-posts.txt is loaded into Hive’s posts table--->
Loading data to table default.posts
Deleted /user/hive/warehouse/posts
OK
Time taken: 5.818 seconds
hive>

$  hdfs  dfs  -cat  /user/hive/warehouse/posts/user-posts.txt  <---Under  the  covers  Hive  stores  it’s tables  in  /user/hive/warehouse (unless  configured  differently)--->

user1,Funny  Story,1343182026191
user2,Cool  Deal,1343182133839
user4,Interesting  Post,1343182154633
user5,Yet  Another  Blog,13431839394

Query Data

hive>  select  count  (1)  from  posts;    <---Count  number  of  records  in  posts  table--->
Total  MapReduce  jobs  =  1
Launching  Job  1  out  of  1     <---Transformed  HiveQL  into  1  MapReduce  Job--->

Starting  Job  =  job_1343957512459_0004,  Tracking  URL  =
http://localhost:8088/proxy/application_1343957512459_0004/
Kill  Command  =  hadoop  job    -Dmapred.job.tracker=localhost:10040  -kill
job_1343957512459_0004
Hadoop  job  information  for  Stage-1:  number  of  mappers:  1;  number  of  reducers:  1
2012-08-02  22:37:24,962  Stage-1  map  =  0%,    reduce  =  0%
2012-08-02  22:37:30,497  Stage-1  map  =  100%,    reduce  =  0%,  Cumulative  CPU  0.87  sec
2012-08-02  22:37:31,577  Stage-1  map  =  100%,    reduce  =  0%,  Cumulative  CPU  0.87  sec
2012-08-02  22:37:32,664  Stage-1  map  =  100%,    reduce  =  100%,  Cumulative  CPU  2.64  sec
MapReduce  Total  cumulative  CPU  time:  2  seconds  640  msec
Ended  Job  =  job_1343957512459_0004
MapReduce  Jobs  Launched:
Job  0:  Map:  1    Reduce:  1      Accumulative  CPU:  2.64  sec      HDFS  Read:  0  HDFS  Write:  0
SUCESS
Total  MapReduce  CPU  Time  Spent:  2  seconds  640  msec
OK
4         <---Result is 4 records--->
Time  taken:  14.204  seconds

hive> select * from posts where user="user2";
...
...
OK
user2 Cool Deal   1343182133839
Time taken: 12.184 seconds

hive> select * from posts where time<=1343182133839 limit 2; <---Usually there are too many results to display,then one could utilize limit command to bound the display--->
...
...
OK
user1 Funny Story  1343182026191
user2 Cool Deal  1343182133839
Time taken: 12.003 seconds
hive>

Drop the Table

hive>  DROP  TABLE  posts;   <---Remove the table; use with caution--->
OK
Time  taken:  2.182  seconds
hive>  exit;
$  hdfs  dfs  -ls  /user/hive/warehouse/
$      <---If hive was managing underlying file then it will be removed--->

Joins

Joins in Hive are trivial
Supports outer joins left, right and full joins
Can join multiple tables
Default Join is Inner Join

Rows are joined where the keys match
Rows that do not have matches are not included in the result

Simple Inner Join

Let’s say we have 2 tables: posts and likes

hive> select * from posts limit 10;
OK
user1 Funny story       1343182026191
user2 Cool story        1343182133839
user4 Interesting Post  1343182154633
user5 Yet Another Blog  1343183939434 <---We want to join these 2 data-sets and produce a single table that contains user, post and count of likes--->

Time taken: 0.108 seconds
hive> select * from likes limit 10;
OK
user1  12  1343182026191
user2  7  1343182139394
user3  0  1343182154633   <---We want to join these 2 data-sets and produce a single table that contains user, post and count of likes--->
user4  50  1343182147364
Time taken: 0.103 seconds
hive> CREATE TABLE posts_likes (user STRING, post STRING, likes_count INT);
OK
Time taken: 0.06 seconds

Simple Inner Join

hive>  INSERT  OVERWRITE  TABLE  posts_likes
>  SELECT  p.user,  p.post,  l.count
>  FROM  posts  p  JOIN  likes  l  ON  (p.user  =  l.user); <---Two tables are joined based on user column; 3 columns are selected and stored in posts_likes table--->

OK
Time  taken:  17.901  seconds
hive>  select  *  from  posts_likes  limit  10;
OK
user1 Funny story   12
user2 Cool story   7
user4  Interesting  Post   50
Time  taken:  0.082  seconds
hive>

Outer Join

Rows which will not join with the ‘other’ table are still included in the result

Left Outer

Row from the first table are included whether they have a match or not. Columns from the unmatched (second) table are set to null.

The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE 
FROM CUSTOMERS c 
LEFT OUTER JOIN ORDERS o 
ON (c.ID = o.CUSTOMER_ID);

Right Outer

The opposite of Left Outer Join: Rows from the second table are included no matter what. Columns from the unmatched (first) table are set to null.

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

Full Outer

Rows from both sides are included. For unmatched rows the columns from the ‘other’ table are set to null.

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE 
FROM CUSTOMERS c 
FULL OUTER JOIN ORDERS o 
ON (c.ID = o.CUSTOMER_ID);

Functions

This chapter explains the built-in functions available in Hive. The functions look quite similar to SQL functions, except for their usage.

Built-In Functions

Hive supports the following built-in functions:

Return Type	Signature	Description
BIGINT	round(double a)	It returns the rounded BIGINT value of the double.
BIGINT	floor(double a)	It returns the maximum BIGINT value that is equal or less than the double.
BIGINT	ceil(double a)	It returns the minimum BIGINT value that is equal or greater than the double.
double	rand(), rand(int seed)	It returns a random number that changes from row to row.
string	concat(string A, string B,...)	It returns the string resulting from concatenating B after A.
string	substr(string A, int start)	It returns the substring of A starting from start position till the end of string A.
string	substr(string A, int start, int length)	It returns the substring of A starting from start position with the given length.
string	upper(string A)	It returns the string resulting from converting all characters of A to upper case.
string	ucase(string A)	Same as above.
string	lower(string A)	It returns the string resulting from converting all characters of B to lower case.
string	lcase(string A)	Same as above.
string	trim(string A)	It returns the string resulting from trimming spaces from both ends of A.
string	ltrim(string A)	It returns the string resulting from trimming spaces from the beginning (left hand side) of A.
string	rtrim(string A)	rtrim(string A) It returns the string resulting from trimming spaces from the end (right hand side) of A.
string	regexp_replace(string A, string B, string C)	It returns the string resulting from replacing all substrings in B that match the Java regular expression syntax with C.
int	size(Map<K.V>)	It returns the number of elements in the map type.
int	size(Array<T>)	It returns the number of elements in the array type.
value of <type>	cast(<expr> as <type>)	It converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) converts the string '1' to it integral representation. A NULL is returned if the conversion does not succeed.
string	from_unixtime(int unixtime)	convert the number of seconds from Unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"
string	to_date(string timestamp)	It returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"
int	year(string date)	It returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970
int	month(string date)	It returns the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11
int	day(string date)	It returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
string	get_json_object(string json_string, string path)	It extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It returns NULL if the input json string is invalid.

Example

The following queries demonstrate some built-in functions:

round() function

hive> SELECT round(2.6) from temp;

On successful execution of query, you get to see the following response:

2.0

floor() function

hive> SELECT floor(2.6) from temp;

On successful execution of the query, you get to see the following response:

2.0

ceil() function

hive> SELECT ceil(2.6) from temp;

On successful execution of the query, you get to see the following response:

3.0

Aggregate Functions

Hive supports the following built-in aggregate functions. The usage of these functions is as same as the SQL aggregate functions.

Return Type	Signature	Description
BIGINT	count(*), count(expr),	count(*) - Returns the total number of retrieved rows.
DOUBLE	sum(col), sum(DISTINCT col)	It returns the sum of the elements in the group or the sum of the distinct values of the column in the group.
DOUBLE	avg(col), avg(DISTINCT col)	It returns the average of the elements in the group or the average of the distinct values of the column in the group.
DOUBLE	min(col)	It returns the minimum value of the column in the group.
DOUBLE	max(col)	It returns the maximum value of the column in the group.

View and Indexes

This chapter describes how to create and manage views. Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all DML operations on a view.

Creating a View

You can create a view at the time of executing a SELECT statement. The syntax is as follows:

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

Example

Let us take an example for view. Assume employee table as given below, with the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. We store the result in a view named emp_30000.

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+

The following query retrieves the employee details using the above scenario:

hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;

Dropping a View

Use the following syntax to drop a view:

DROP VIEW view_name

The following query drops a view named as emp_30000:

hive> DROP VIEW emp_30000;

Creating an Index

An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. Its syntax is as follows:

CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]

Example

Let us take an example for index. Use the same employee table that we have used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the salary column of the employee table.

The following query creates an index:

hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

It is a pointer to the salary column. If the column is modified, the changes are stored using an index value.

Dropping an Index

The following syntax is used to drop an index:

DROP INDEX <index_name> ON <table_name>

The following query drops an index named index_salary:

hive> DROP INDEX index_salary ON employee;

Select-Where

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore. This chapter explains how to use the SELECT statement with WHERE clause.

SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters the data using the condition and gives you a finite result. The built-in operators and functions generate an expression, which fulfils the condition.

Syntax

Given below is the syntax of the SELECT query:

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference 
[WHERE where_condition] 
[GROUP BY col_list] 
[HAVING having_condition] 
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] 
[LIMIT number];

Example

Let us take an example for SELECT…WHERE clause. Assume we have the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000.

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  | 
+------+--------------+-------------+-------------------+--------+

The following query retrieves the employee details using the above scenario:

hive> SELECT * FROM employee WHERE salary>30000;

On successful execution of the query, you get to see the following response:

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
+------+--------------+-------------+-------------------+--------+

JDBC Program

The JDBC program to apply where clause for the given example is as follows.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveQLWhere {
   private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
   
   public static void main(String[] args) throws SQLException {
   
      // Register driver and create driver instance
      Class.forName(driverName);
      
      // get connection
      Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      
      // create statement
      Statement stmt = con.createStatement();
      
      // execute statement
      Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE salary>30000;");
      
      System.out.println("Result:");
      System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
      
      while (res.next()) {
         System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " + res.getString(4) + " " + res.getString(5));
      }
      con.close();
   }
}

Save the program in a file named HiveQLWhere.java. Use the following commands to compile and execute this program.

$ javac HiveQLWhere.java
$ java HiveQLWhere

Output:

ID       Name           Salary      Designation          Dept
1201     Gopal          45000       Technical manager    TP
1202     Manisha        45000       Proofreader          PR
1203     Masthanvali    40000       Technical writer     TP
1204     Krian          40000       Hr Admin             HR

Select-Group By

This chapter explains the details of GROUP BY clause in a SELECT statement. The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.

Syntax

The syntax of GROUP BY clause is as follows:

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference 
[WHERE where_condition] 
[GROUP BY col_list] 
[HAVING having_condition] 
[ORDER BY col_list]] 
[LIMIT number];

Example

Let us take an example of SELECT…GROUP BY clause. Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the number of employees in each department.

+------+--------------+-------------+-------------------+--------+ 
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+ 
|1201  | Gopal        | 45000       | Technical manager | TP     | 
|1202  | Manisha      | 45000       | Proofreader       | PR     | 
|1203  | Masthanvali  | 40000       | Technical writer  | TP     | 
|1204  | Krian        | 45000       | Proofreader       | PR     | 
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+

The following query retrieves the employee details using the above scenario.

hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;

On successful execution of the query, you get to see the following response:

+------+--------------+ 
| Dept | Count(*)     | 
+------+--------------+ 
|Admin |    1         | 
|PR    |    2         | 
|TP    |    3         | 
+------+--------------+

JDBC Program

Given below is the JDBC program to apply the Group By clause for the given example.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveQLGroupBy {
   private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
   
   public static void main(String[] args) throws SQLException {
   
      // Register driver and create driver instance
      Class.forName(driverName);
      
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      
      // create statement
      Statement stmt = con.createStatement();
      
      // execute statement
      Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ” + “FROM employee GROUP BY DEPT; ”);
      System.out.println(" Dept \t count(*)");
      
      while (res.next()) {
         System.out.println(res.getString(1) + " " + res.getInt(2)); 
      }
      con.close();
   }
}

Save the program in a file named HiveQLGroupBy.java. Use the following commands to compile and execute this program.

$ javac HiveQLGroupBy.java
$ java HiveQLGroupBy

Output:

 Dept     Count(*)
 Admin       1
 PR          2
 TP          3

Hive Interview Questions

What are the different types of tables available in HIve?

There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.

Is Hive suitable to be used for OLTP systems? Why?

No Hive does not provide insert and update at row level. So it is not suitable for OLTP system.

Can a table be renamed in Hive?

Alter Table table_name RENAME TO new_name

Can we change the data type of a column in a hive table?

Using REPLACE column option

ALTER TABLE table_name REPLACE COLUMNS ……

What is a metastore in Hive?

It is a relational database storing the metadata of hive tables, partitions, Hive databases etc

What is the need for custom Serde?

Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. SO users need to write their own java code to satisfy their data format requirements.

Why do we need Hive?

Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.

What is the default location where hive stores table data?

hdfs://namenode_server/user/hive/warehouse

What are the three different modes in which hive can be run?

Local mode
Distributed mode
Pseudodistributed mode

Is there a date data type in Hive?

Yes. The TIMESTAMP data types stores date in java.sql.timestamp format

What are collection data types in Hive?

There are three collection data types in Hive.

ARRAY
MAP
STRUCT

Can we run unix shell commands from hive? Give example.

Yes, using the ! mark just before the command.

For example !pwd at hive prompt will list the current directory.

What is a Hive variable? What for we use it?

The hive variable is variable created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.

Can hive queries be executed from script files? How?

Using the source command.

Example −

Hive> source /path/to/file/file_with_query.hql

What is the importance of .hiverc file?

It is a file containing list of commands needs to run when the hive CLI starts. For example setting the strict mode to be true etc.

What are the default record and field delimiter used for hive text files?

The default record delimiter is − \n

And the filed delimiters are − \001,\002,\003

What do you mean by schema on read?

The schema is validated with the data when reading the data and not enforced when writing data.

How do you list all databases whose name starts with p?

SHOW DATABASES LIKE ‘p.*’

What does the “USE” command in hive do?

With the use command you fix the database on which all the subsequent hive queries will run.

How can you delete the DBPROPERTY in Hive?

There is no way you can delete the DBPROPERTY.

What is the significance of the line

set hive.mapred.mode = strict;

It sets the mapreduce jobs to strict mode.By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.

How do you check if a particular partition exists?

This can be done with following query

SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’)

Which java class handles the Input record encoding into files which store the tables in Hive?

org.apache.hadoop.mapred.TextInputFormat

Which java class handles the output record encoding into files which result from Hive queries?

org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

What is the significance of ‘IF EXISTS” clause while dropping a table?

When we issue the command DROP TABLE IF EXISTS table_name

Hive throws an error if the table being dropped does not exist in the first place.

When you point a partition of a hive table to a new directory, what happens to the data?

The data stays in the old location. It has to be moved manually.

Write a query to insert a new column(new_col INT) into a hiev table (htab) at a position before an existing column (x_col)

ALTER TABLE table_name
CHANGE COLUMN new_col  INT
BEFORE x_col

Does the archiving of Hive tables give any space saving in HDFS?

No. It only reduces the number of files which becomes easier for namenode to manage.

How can you stop a partition form being queried?

By using the ENABLE OFFLINE clause with ALTER TABLE atatement.

While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file ?

By Omitting the LOCAL CLAUSE in the LOAD DATA statement.

If you omit the OVERWRITE clause while creating a hive table,what happens to file which are new and files which already exist?

The new incoming files are just added to the target directory and the existing files are simply overwritten. Other files whose name does not match any of the incoming files will continue to exist.

If you add the OVERWRITE clause then all the existing data in the directory will be deleted before new data is written.

What does the following query do?

INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;

It creates partition on table employees with partition values coming from the columns in the select clause. It is called Dynamic partition insert.

What is a Table generating Function on hive?

A table generating function is a function which takes a single column as argument and expands it to multiple column or rows. Example exploe()

How can Hive avoid mapreduce?

If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.

What is the difference between LIKE and RLIKE operators in Hive?

The LIKE operator behaves the same way as the regular SQL operators used in select queries. Example −

street_name like ‘%Chi’

But the RLIKE operator uses more advance regular expressions which are available in java

Example − street_name RLIKE ‘.*(Chi|Oho).*’ which will select any word which has either chi or oho in it.

Is it possible to create Cartesian join between 2 tables, using Hive?

No. As this kind of Join can not be implemented in mapreduce

As part of Optimizing the queries in HIve, what should be the order of table size in a join query?

In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.

What is the usefulness of the DISTRIBUTED BY clause in Hive?

It controls ho wthe map output is reduced among the reducers. It is useful in case of streaming data

How will you convert the string ’51.2’ to a float value in the price column?

Select cast(price as FLOAT)

What will be the result when you do cast(‘abc’ as INT)?

Hive will return NULL

Can the name of a view be same as the name of a hive table?

No. The name of a view must be unique whne compared to all other tables and views present in the same database.

Can we LOAD data into a view?

No. A view can not be the target of a INSERT or LOAD statement.

What types of costs are associated in creating index on hive tables?

Indexes occupies space and there is a processing cost in arranging the values of the column on which index is cerated.

Give the command to see the indexes on a table.

SHOW INDEX ON table_name

This will list all the indexes created on any of the columns in the table table_name.

What is bucketing ?

The values in a column are hashed into a number of buckets which is defined by user. It is a way to avoid too many partitions or nested partitions while ensuring optimizes query output.

What does /*streamtable(table_name)*/ do?

It is query hint to stream a table into memory before running the query. It is a query optimization Technique.

Can a partition be archived? What are the advantages and Disadvantages?

Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.

What is a generic UDF in hive?

It is a UDF which is created using a java program to server some specific need not covered under the existing functions in Hive. It can detect the type of input argument programmatically and provide appropriate response.

The following statement failed to execute. What can be the cause?

LOAD DATA LOCAL INPATH ‘${env:HOME}/country/state/’
OVERWRITE INTO TABLE address;

The local inpath should contain a file and not a directory. The $env:HOME is a valid variable available in the hive environment.

How do you specify the table creator name when creating a table in Hive?

The TBLPROPERTIES clause is used to add the creator name while creating a table.

The TBLPROPERTIES is added like −

TBLPROPERTIES(‘creator’= ‘Joan’)

Labels

21 Apr 2016

Hive Tutorial

Hive

Hive Provides

Hive

Hive Metastore

Hive Architecture

Hive Interface Options

Hive Concepts

Installation Prerequisites

Run Hive

Create a Table

Loading Data

Re-Use Existing HDFS Location

Schema Violations

Partitions

Creating Partitioned Table

Load Data Into Partitioned Table

Partitioned Table

Querying Partitioned Table

Bucketing

Create and Use Table with Buckets

Random Sample of Bucketed Table

Load Data Into a Table

Query Data

Drop the Table

Joins

Simple Inner Join

Simple Inner Join

Outer Join

Functions

Built-In Functions

Example

round() function

floor() function

ceil() function

Aggregate Functions

View and Indexes

Creating a View

Example

Dropping a View

Creating an Index

Example

Dropping an Index

Select-Where

Syntax

Example

JDBC Program

Output:

Select-Group By

Syntax

Example

JDBC Program

Output:

Hive Interview Questions

No comments:

Post a Comment