21 Apr 2016

Hive Tutorial

Hive

  • Data Warehousing Solution built on top of Hadoop
  • Provides SQL-like query language named HiveQL
    • Minimal learning curve for people with SQL expertise
    • Data analysts are target audience
  • Early Hive development work started at Facebook in 2007
  • Today Hive is an Apache project under Hadoop http://hive.apache.org

Hive Provides

  • Ability to bring structure to various data formats
  • Simple interface for ad hoc querying,analyzing and summarizing large amounts of data
  • Access to files on various data stores such as HDFS and HBase

Hive

  • Hive does NOT provide low latency or real-time queries
  • Even querying small amounts of data may take minutes
  • Designed for scalability and ease-of-use rather than low latency responses
Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster
Hive

Hive Metastore

To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database
  • Packaged with Derby, a lightweight embedded SQL DB
    • Default Derby based is good for evaluation an testing
    • Schema is not shared between users as each user has their own instance of embedded Derby
    • Stored in metastore_db directory which resides in the directory that hive was started from
  • Can easily switch another SQL installation such as MySQL.

Hive Architecture

Hive Architecture

Hive Interface Options

  • Command Line Interface (CLI)
    Will use exclusively in these slides
  • Hive Web Interface
    https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface

Hive Concepts

Re-used from Relational Databases
  • Database: Set of Tables, used for name conflicts resolution
  • Table: Set of Rows that have the same schema (same columns)
  • Row: A single record; a set of columns
  • Column: provides value and type for a single value
Hive  Concepts

Installation Prerequisites

  • Java 6 Just Like Hadoop
  • Hadoop 0.20.x+ No surprise here
  • Set $HADOOP_HOME environment variable Was done as a part of HDFS installation
  • Set $HIVE_HOME and add hive to the PATH
    export  HIVE_HOME=$CDH_HOME/hive-0.8.1-cdh4.0.0
    export  PATH=$PATH:$HIVE_HOME/bin
  • Hive will store its tables on HDFS and those locations needs to be bootstrapped
    $  hdfs  dfs  -mkdir        /tmp
    $  hdfs  dfs  -mkdir        /user/hive/warehouse
    $  hdfs  dfs  -chmod  g+w       /tmp
    $  hdfs  dfs  -chmod  g+w       /user/hive/warehouse
    
  • Similar to other Hadoop’s projects Hive’s configuration is in $HIVE_HOME/conf/hive-site.xml
  • < ?xml version="1.0"? >
    < ?xml-stylesheet type="text/xsl" href="configuration.xsl"? >
    < configuration >
    < property >
    < name >mapred.job.tracker< /name >
    < value >localhost:10040< /value >   <---Specify the location of ResourceManager so Hive knows where to execute MapReduce Jobs; by default Hive utilizes LocalJobRunner--->
    < /property>
    < /configuration>
    

Run Hive

HDFS and YARN need to be up and running

$ hive Hive history file=/tmp/hadoop/hive_job_log_hadoop_201207312052_1402761030.txt hive> <---Hive’s Interactive Command Line Interface (CLI)--->

Create a Table

Let’s create a table to store data from $PLAY_AREA/data/user-posts.txt
$ cd $PLAY_AREA
$ hive           <---Launch Hive Command Line Interface (CLI)--->
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201208022144_2014345460.txt  <---Location of the session’s log file--->
hive> !cat data/user-posts.txt;     <---Can execute local commands within CLI, place a command in between!and ;--->

user1,Funny Story,1343182026191
user2,Cool Deal,1343182133839
user4,Interesting Post,1343182154633
user5,Yet Another Blog,13431839394    <---Values are separate by ‘,’ and each row represents a record; first value is user name, second is post content and third is timestamp--->

hive>
hive>  CREATE  TABLE  posts  (user  STRING,  post  STRING,  time  BIGINT)
>  ROW  FORMAT  DELIMITED
>  FIELDS  TERMINATED  BY  ','     <---1 line:creates a table with 3 columns 2 and 3 line:how the underlying  file should  be  parsed 4    line:  how  to  store  data--->
>  STORED  AS  TEXTFILE;      <---Statements  must  end  with  a  semicolon and  can  span  multiple  rows--->

OK
Time  taken:  10.606  seconds
hive>  show  tables;      <---Display all of the tables--->
OK
posts          <---Result is displayed between "OK--->
Time  taken:  0.221  seconds
hive>  describe  posts;      <---Display schema for posts table--->
OK
User   string
Post   string
Time   bigint
Time  taken:  0.212  seconds

Loading Data

Several options to start using data in HIVE
  • Load data from HDFS location
  • hive> LOAD DATA INPATH '/training/hive/user-posts.txt'
    > OVERWRITE INTO TABLE posts;
    
    File is copied from the provided location to /user/hive/warehouse/(or configured location)

  • Load data from a local file system
  • hive> LOAD DATA LOCAL INPATH 'data/user-posts.txt'
    > OVERWRITE INTO TABLE posts;
    
    File is copied from the provided location to /user/hive/warehouse/(or configured location)

  • Utilize an existing location on HDFS Just point to an existing location when creating a table

Re-Use Existing HDFS Location

hive>  CREATE  EXTERNAL  TABLE  posts
>  (user  STRING,  post  STRING,  time  BIGINT)
>  ROW  FORMAT  DELIMITED
>  FIELDS  TERMINATED  BY  ','
>  STORED  AS  TEXTFILE
>  LOCATION  '/training/hive/';    <---Hive will load all the files under /training/hive directory in posts table--->
OK
Time  taken:  0.077  seconds
hive>

Schema Violations

What would happen if we try to insert data that does not comply with the pre-defined schema?
hive>  !cat  data/user-posts-inconsistentFormat.txt;
user1,Funny  Story,1343182026191
user2,Cool  Deal,2012-01-05
user4,Interesting  Post,1343182154633
user5,Yet  Another  Blog,13431839394
hive>  describe  posts;
OK
user   string
post   string       <---Third Column ‘post’ is of type bigint;will not be able to convert ‘2012-01-05’ value--->
time   bigint
Time  taken:  0.289  seconds
hive> LOAD DATA LOCAL INPATH
> 'data/user-posts-inconsistentFormat.txt'
> OVERWRITE INTO TABLE posts;
OK
Time taken: 0.612 seconds
hive> select * from posts;
OK
user1 Funny Story   1343182026191            NULL    <---null is set for any value that violates pre-defined schema--->
user4 Interesting Post 1343182154633
user5 Yet Another Blog 13431839394
Time taken: 0.136 seconds
hive>


Partitions

  • To increase performance Hive has the capability to partition data
    • The values of partitioned column divide a table into segments
    • Entire partitions can be ignored at query time
    • Similar to relational databases’ indexes but not as granular
  • Partitions have to be properly crated by users
    When inserting data must specify a partition
  • At query time, whenever appropriate, Hive will automatically filter out partitions

Creating Partitioned Table

hive>  CREATE  TABLE  posts  (user  STRING,  post  STRING,  time  BIGINT)
>  PARTITIONED  BY(country  STRING)    <---Partition table based on the value of a country.--->
>  ROW  FORMAT  DELIMITED
>  FIELDS  TERMINATED  BY  ','
>  STORED  AS  TEXTFILE;
OK
Time  taken:  0.116  seconds
hive>  describe  posts;
OK
user   string
Post  string  <---There is no difference in schema between "partition" columns and "data" columns--->
Time  bigint
countrystring
Time  taken:  0.111  seconds
hive>  show  partitions  posts;
OK
Time  taken:  0.102  seconds
hive>

Load Data Into Partitioned Table

hive>  LOAD  DATA  LOCAL  INPATH  'data/user-posts-US.txt'
>  OVERWRITE  INTO  TABLE  posts;
FAILED:  Error  in  semantic  analysis:  Need  to  specify  partition
columns  because  the  destination  table  is  partitioned  <---Since the posts table was defined to be partitioned any insert statement must specify the partition--->
hive>  LOAD  DATA  LOCAL  INPATH  'data/user-posts-US.txt'
>  OVERWRITE  INTO  TABLE  posts  PARTITION(country='US');
OK
Time  taken:  0.225  seconds
hive>  LOAD  DATA  LOCAL  INPATH  'data/user-posts-AUSTRALIA.txt'
>  OVERWRITE  INTO  TABLE  posts  PARTITION(country='AUSTRALIA'); <---Each file is loaded into separate partition;data is separated by country--->
OK
Time  taken:  0.236  seconds
hive>

Partitioned Table

Partitions are physically stored under separate directories
hive> show partitions posts;
OK
country=AUSTRALIA     <---There  is  a  directory  for each  partition  value--->

country=US
Time taken: 0.095 seconds
hive> exit;
$ hdfs dfs -ls -R /user/hive/warehouse/posts
/user/hive/warehouse/posts/country=AUSTRALIA <---There  is  a  directory  for each  partition  value--->
/user/hive/warehouse/posts/country=AUSTRALIA/user-posts-AUSTRALIA.txt
/user/hive/warehouse/posts/country=US
/user/hive/warehouse/posts/country=US/user-posts-US.txt

Querying Partitioned Table

  • There is no difference in syntax
  • When partitioned column is specified in the where clause entire directories/partitions could be ignored
hive>  select  *  from  posts  where  country='US'  limit  10;<---Only "COUNTRY=US" partition will be queried,"COUNTRY=AUSTRALIA" partition will be ignored--->                
OK
user1  Funny  Story  1343182026191     US
user2  Cool   Deal  1343182133839     US
user2  Great  Interesting  Note   13431821339485 US
user4  Interesting  Post   1343182154633   US
user1  Humor is good    1343182039586   US
user2  Hi I am user #2    1343182133839   US
Time  taken:  0.197  seconds

Bucketing

  • Mechanism to query and examine random samples of data
  • Break data into a set of buckets based on a hash function of a "bucket column"
    Capability to execute queries on a sub-set of random data
  • Doesn’t automatically enforce bucketing
    User is required to specify the number of buckets by setting # of reducer
hive> mapred.reduce.tasks = 256;
OR    <---Either manually set the # of reducers to be the number of buckets or you can use ‘hive.enforce.bucketing’ which will set it on your behalf--->
hive> hive.enforce.bucketing = true;

Create and Use Table with Buckets

hive>  CREATE  TABLE  post_count  (user  STRING,  count  INT)
>  CLUSTERED  BY  (user)  INTO  5  BUCKETS;    <---Declare table with 5 buckets for user column--->

OK
Time  taken:  0.076  seconds
hive>  set  hive.enforce.bucketing  =  true;  <---# of reducer will get set 5--->
hive>  insert  overwrite  table  post_count
>  select  user,  count(post)  from  posts  group  by  user;
Total  MapReduce  jobs  =  2 <---Insert data into post_count bucketed table; number of posts are counted up for each user--->

Launching  Job  1  out  of  2
...
Launching  Job  2  out  of  2
...
OK
Time  taken:  42.304  seconds
hive>  exit;
$  hdfs  dfs  -ls  -R  /user/hive/warehouse/post_count/
/user/hive/warehouse/post_count/000000_0 <---A file per bucket is created; now only a sub-set of buckets can be sampled--->
/user/hive/warehouse/post_count/000001_0
/user/hive/warehouse/post_count/000002_0
/user/hive/warehouse/post_count/000003_0
/user/hive/warehouse/post_count/000004_0

Random Sample of Bucketed Table

hive>  select  *  from  post_count  TABLESAMPLE(BUCKET  1  OUT  OF  2); <---Sample approximately 1 for every 2 buckets--->
OK
user5   1
user1   2
Time  taken:  11.758  seconds
hive>

Load Data Into a Table

hive> LOAD DATA LOCAL INPATH 'data/user-posts.txt'
> OVERWRITE INTO TABLE posts;
Copying data from file:/home/hadoop/Training/play_area/data/user-posts.txt
Copying file: file:/home/hadoop/Training/play_area/data/user-posts.txt <---Existing records the table posts are deleted; data in user-posts.txt is loaded into Hive’s posts table--->
Loading data to table default.posts
Deleted /user/hive/warehouse/posts
OK
Time taken: 5.818 seconds
hive>

$  hdfs  dfs  -cat  /user/hive/warehouse/posts/user-posts.txt  <---Under  the  covers  Hive  stores  it’s tables  in  /user/hive/warehouse (unless  configured  differently)--->

user1,Funny  Story,1343182026191
user2,Cool  Deal,1343182133839
user4,Interesting  Post,1343182154633
user5,Yet  Another  Blog,13431839394

Query Data

hive>  select  count  (1)  from  posts;    <---Count  number  of  records  in  posts  table--->
Total  MapReduce  jobs  =  1
Launching  Job  1  out  of  1     <---Transformed  HiveQL  into  1  MapReduce  Job--->

Starting  Job  =  job_1343957512459_0004,  Tracking  URL  =
http://localhost:8088/proxy/application_1343957512459_0004/
Kill  Command  =  hadoop  job    -Dmapred.job.tracker=localhost:10040  -kill
job_1343957512459_0004
Hadoop  job  information  for  Stage-1:  number  of  mappers:  1;  number  of  reducers:  1
2012-08-02  22:37:24,962  Stage-1  map  =  0%,    reduce  =  0%
2012-08-02  22:37:30,497  Stage-1  map  =  100%,    reduce  =  0%,  Cumulative  CPU  0.87  sec
2012-08-02  22:37:31,577  Stage-1  map  =  100%,    reduce  =  0%,  Cumulative  CPU  0.87  sec
2012-08-02  22:37:32,664  Stage-1  map  =  100%,    reduce  =  100%,  Cumulative  CPU  2.64  sec
MapReduce  Total  cumulative  CPU  time:  2  seconds  640  msec
Ended  Job  =  job_1343957512459_0004
MapReduce  Jobs  Launched:
Job  0:  Map:  1    Reduce:  1      Accumulative  CPU:  2.64  sec      HDFS  Read:  0  HDFS  Write:  0
SUCESS
Total  MapReduce  CPU  Time  Spent:  2  seconds  640  msec
OK
4         <---Result is 4 records--->
Time  taken:  14.204  seconds
hive> select * from posts where user="user2";
...
...
OK
user2 Cool Deal   1343182133839
Time taken: 12.184 seconds

hive> select * from posts where time<=1343182133839 limit 2; <---Usually there are too many results to display,then one could utilize limit command to bound the display--->
...
...
OK
user1 Funny Story  1343182026191
user2 Cool Deal  1343182133839
Time taken: 12.003 seconds
hive>

Drop the Table

hive>  DROP  TABLE  posts;   <---Remove the table; use with caution--->
OK
Time  taken:  2.182  seconds
hive>  exit;
$  hdfs  dfs  -ls  /user/hive/warehouse/
$      <---If hive was managing underlying file then it will be removed--->

Joins

  • Joins in Hive are trivial
  • Supports outer joins left, right and full joins
  • Can join multiple tables
  • Default Join is Inner Join
  • Rows are joined where the keys match
  • Rows that do not have matches are not included in the result
Joins

Simple Inner Join

Let’s say we have 2 tables: posts and likes
hive> select * from posts limit 10;
OK
user1 Funny story       1343182026191
user2 Cool story        1343182133839
user4 Interesting Post  1343182154633
user5 Yet Another Blog  1343183939434 <---We want to join these 2 data-sets and produce a single table that contains user, post and count of likes--->

Time taken: 0.108 seconds
hive> select * from likes limit 10;
OK
user1  12  1343182026191
user2  7  1343182139394
user3  0  1343182154633   <---We want to join these 2 data-sets and produce a single table that contains user, post and count of likes--->
user4  50  1343182147364
Time taken: 0.103 seconds
hive> CREATE TABLE posts_likes (user STRING, post STRING, likes_count INT);
OK
Time taken: 0.06 seconds

Simple Inner Join

hive>  INSERT  OVERWRITE  TABLE  posts_likes
>  SELECT  p.user,  p.post,  l.count
>  FROM  posts  p  JOIN  likes  l  ON  (p.user  =  l.user); <---Two tables are joined based on user column; 3 columns are selected and stored in posts_likes table--->

OK
Time  taken:  17.901  seconds
hive>  select  *  from  posts_likes  limit  10;
OK
user1 Funny story   12
user2 Cool story   7
user4  Interesting  Post   50
Time  taken:  0.082  seconds
hive>

Outer Join

Rows which will not join with the ‘other’ table are still included in the result
Left Outer
Row from the first table are included whether they have a match or not. Columns from the unmatched (second) table are set to null.
Left  Outer Joins
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE 
FROM CUSTOMERS c 
LEFT OUTER JOIN ORDERS o 
ON (c.ID = o.CUSTOMER_ID);

Right Outer
The opposite of Left Outer Join: Rows from the second table are included no matter what. Columns from the unmatched (first) table are set to null.
Right Outer Joins
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

Full Outer
Rows from both sides are included. For unmatched rows the columns from the ‘other’ table are set to null.
Full Outer Joins
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE 
FROM CUSTOMERS c 
FULL OUTER JOIN ORDERS o 
ON (c.ID = o.CUSTOMER_ID);

Functions

This chapter explains the built-in functions available in Hive. The functions look quite similar to SQL functions, except for their usage.

Built-In Functions

Hive supports the following built-in functions:
Return TypeSignatureDescription
BIGINTround(double a)It returns the rounded BIGINT value of the double.
BIGINTfloor(double a)It returns the maximum BIGINT value that is equal or less than the double.
BIGINTceil(double a)It returns the minimum BIGINT value that is equal or greater than the double.
doublerand(), rand(int seed)It returns a random number that changes from row to row.
stringconcat(string A, string B,...)It returns the string resulting from concatenating B after A.
stringsubstr(string A, int start)It returns the substring of A starting from start position till the end of string A.
stringsubstr(string A, int start, int length)It returns the substring of A starting from start position with the given length.
stringupper(string A)It returns the string resulting from converting all characters of A to upper case.
stringucase(string A)Same as above.
stringlower(string A)It returns the string resulting from converting all characters of B to lower case.
stringlcase(string A)Same as above.
stringtrim(string A)It returns the string resulting from trimming spaces from both ends of A.
stringltrim(string A)It returns the string resulting from trimming spaces from the beginning (left hand side) of A.
stringrtrim(string A)rtrim(string A) It returns the string resulting from trimming spaces from the end (right hand side) of A.
stringregexp_replace(string A, string B, string C)It returns the string resulting from replacing all substrings in B that match the Java regular expression syntax with C.
intsize(Map<K.V>)It returns the number of elements in the map type.
intsize(Array<T>)It returns the number of elements in the array type.
value of <type>cast(<expr> as <type>)It converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) converts the string '1' to it integral representation. A NULL is returned if the conversion does not succeed.
stringfrom_unixtime(int unixtime)convert the number of seconds from Unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"
stringto_date(string timestamp)It returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"
intyear(string date)It returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970
intmonth(string date)It returns the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11
intday(string date)It returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1
stringget_json_object(string json_string, string path)It extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It returns NULL if the input json string is invalid.

Example

The following queries demonstrate some built-in functions:

round() function

hive> SELECT round(2.6) from temp;
On successful execution of query, you get to see the following response:
2.0

floor() function

hive> SELECT floor(2.6) from temp;
On successful execution of the query, you get to see the following response:
2.0

ceil() function

hive> SELECT ceil(2.6) from temp;
On successful execution of the query, you get to see the following response:
3.0

Aggregate Functions

Hive supports the following built-in aggregate functions. The usage of these functions is as same as the SQL aggregate functions.
Return TypeSignatureDescription
BIGINTcount(*), count(expr),count(*) - Returns the total number of retrieved rows.
DOUBLEsum(col), sum(DISTINCT col)It returns the sum of the elements in the group or the sum of the distinct values of the column in the group.
DOUBLEavg(col), avg(DISTINCT col)It returns the average of the elements in the group or the average of the distinct values of the column in the group.
DOUBLEmin(col)It returns the minimum value of the column in the group.
DOUBLEmax(col)It returns the maximum value of the column in the group.

View and Indexes

This chapter describes how to create and manage views. Views are generated based on user requirements. You can save any result set data as a view. The usage of view in Hive is same as that of the view in SQL. It is a standard RDBMS concept. We can execute all DML operations on a view.

Creating a View

You can create a view at the time of executing a SELECT statement. The syntax is as follows:
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

Example

Let us take an example for view. Assume employee table as given below, with the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000. We store the result in a view named emp_30000.
+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;

Dropping a View

Use the following syntax to drop a view:
DROP VIEW view_name
The following query drops a view named as emp_30000:
hive> DROP VIEW emp_30000;

Creating an Index

An Index is nothing but a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table. Its syntax is as follows:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]

Example

Let us take an example for index. Use the same employee table that we have used earlier with the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the salary column of the employee table.
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
It is a pointer to the salary column. If the column is modified, the changes are stored using an index value.

Dropping an Index

The following syntax is used to drop an index:
DROP INDEX <index_name> ON <table_name>
The following query drops an index named index_salary:
hive> DROP INDEX index_salary ON employee;

Select-Where

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore. This chapter explains how to use the SELECT statement with WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters the data using the condition and gives you a finite result. The built-in operators and functions generate an expression, which fulfils the condition.

Syntax

Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference 
[WHERE where_condition] 
[GROUP BY col_list] 
[HAVING having_condition] 
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] 
[LIMIT number];

Example

Let us take an example for SELECT…WHERE clause. Assume we have the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than Rs 30000.
+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  | 
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario:
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
+------+--------------+-------------+-------------------+--------+

JDBC Program

The JDBC program to apply where clause for the given example is as follows.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveQLWhere {
   private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
   
   public static void main(String[] args) throws SQLException {
   
      // Register driver and create driver instance
      Class.forName(driverName);
      
      // get connection
      Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      
      // create statement
      Statement stmt = con.createStatement();
      
      // execute statement
      Resultset res = stmt.executeQuery("SELECT * FROM employee WHERE salary>30000;");
      
      System.out.println("Result:");
      System.out.println(" ID \t Name \t Salary \t Designation \t Dept ");
      
      while (res.next()) {
         System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " + res.getString(4) + " " + res.getString(5));
      }
      con.close();
   }
}
Save the program in a file named HiveQLWhere.java. Use the following commands to compile and execute this program.
$ javac HiveQLWhere.java
$ java HiveQLWhere

Output:

ID       Name           Salary      Designation          Dept
1201     Gopal          45000       Technical manager    TP
1202     Manisha        45000       Proofreader          PR
1203     Masthanvali    40000       Technical writer     TP
1204     Krian          40000       Hr Admin             HR

Select-Group By

This chapter explains the details of GROUP BY clause in a SELECT statement. The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.

Syntax

The syntax of GROUP BY clause is as follows:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference 
[WHERE where_condition] 
[GROUP BY col_list] 
[HAVING having_condition] 
[ORDER BY col_list]] 
[LIMIT number];

Example

Let us take an example of SELECT…GROUP BY clause. Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields. Generate a query to retrieve the number of employees in each department.
+------+--------------+-------------+-------------------+--------+ 
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+ 
|1201  | Gopal        | 45000       | Technical manager | TP     | 
|1202  | Manisha      | 45000       | Proofreader       | PR     | 
|1203  | Masthanvali  | 40000       | Technical writer  | TP     | 
|1204  | Krian        | 45000       | Proofreader       | PR     | 
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+
The following query retrieves the employee details using the above scenario.
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
On successful execution of the query, you get to see the following response:
+------+--------------+ 
| Dept | Count(*)     | 
+------+--------------+ 
|Admin |    1         | 
|PR    |    2         | 
|TP    |    3         | 
+------+--------------+

JDBC Program

Given below is the JDBC program to apply the Group By clause for the given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveQLGroupBy {
   private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
   
   public static void main(String[] args) throws SQLException {
   
      // Register driver and create driver instance
      Class.forName(driverName);
      
      // get connection
      Connection con = DriverManager.
      getConnection("jdbc:hive://localhost:10000/userdb", "", "");
      
      // create statement
      Statement stmt = con.createStatement();
      
      // execute statement
      Resultset res = stmt.executeQuery(“SELECT Dept,count(*)  + FROM employee GROUP BY DEPT; ”);
      System.out.println(" Dept \t count(*)");
      
      while (res.next()) {
         System.out.println(res.getString(1) + " " + res.getInt(2)); 
      }
      con.close();
   }
}
Save the program in a file named HiveQLGroupBy.java. Use the following commands to compile and execute this program.
$ javac HiveQLGroupBy.java
$ java HiveQLGroupBy

Output:

 Dept     Count(*)
 Admin       1
 PR          2
 TP          3

Hive Interview Questions


There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.
No Hive does not provide insert and update at row level. So it is not suitable for OLTP system.
Alter Table table_name RENAME TO new_name
Using REPLACE column option
ALTER TABLE table_name REPLACE COLUMNS ……
It is a relational database storing the metadata of hive tables, partitions, Hive databases etc
Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. SO users need to write their own java code to satisfy their data format requirements.
Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.
hdfs://namenode_server/user/hive/warehouse
  • Local mode
  • Distributed mode
  • Pseudodistributed mode
Yes. The TIMESTAMP data types stores date in java.sql.timestamp format
There are three collection data types in Hive.
  • ARRAY
  • MAP
  • STRUCT
Yes, using the ! mark just before the command.
For example !pwd at hive prompt will list the current directory.
The hive variable is variable created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.
Using the source command.
Example −
Hive> source /path/to/file/file_with_query.hql
It is a file containing list of commands needs to run when the hive CLI starts. For example setting the strict mode to be true etc.
The default record delimiter is − \n
And the filed delimiters are − \001,\002,\003
The schema is validated with the data when reading the data and not enforced when writing data.
SHOW DATABASES LIKE ‘p.*’
With the use command you fix the database on which all the subsequent hive queries will run.
There is no way you can delete the DBPROPERTY.
It sets the mapreduce jobs to strict mode.By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.
This can be done with following query
SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’)
org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
When we issue the command DROP TABLE IF EXISTS table_name
Hive throws an error if the table being dropped does not exist in the first place.
The data stays in the old location. It has to be moved manually.
ALTER TABLE table_name
CHANGE COLUMN new_col  INT
BEFORE x_col
No. It only reduces the number of files which becomes easier for namenode to manage.
By using the ENABLE OFFLINE clause with ALTER TABLE atatement.
By Omitting the LOCAL CLAUSE in the LOAD DATA statement.
The new incoming files are just added to the target directory and the existing files are simply overwritten. Other files whose name does not match any of the incoming files will continue to exist.
If you add the OVERWRITE clause then all the existing data in the directory will be deleted before new data is written.
It creates partition on table employees with partition values coming from the columns in the select clause. It is called Dynamic partition insert.
A table generating function is a function which takes a single column as argument and expands it to multiple column or rows. Example exploe()
If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.
The LIKE operator behaves the same way as the regular SQL operators used in select queries. Example −
street_name like ‘%Chi’
But the RLIKE operator uses more advance regular expressions which are available in java
Example − street_name RLIKE ‘.*(Chi|Oho).*’ which will select any word which has either chi or oho in it.
No. As this kind of Join can not be implemented in mapreduce
In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.
It controls ho wthe map output is reduced among the reducers. It is useful in case of streaming data
Select cast(price as FLOAT)
Hive will return NULL
No. The name of a view must be unique whne compared to all other tables and views present in the same database.
No. A view can not be the target of a INSERT or LOAD statement.
Indexes occupies space and there is a processing cost in arranging the values of the column on which index is cerated.
SHOW INDEX ON table_name
This will list all the indexes created on any of the columns in the table table_name.
The values in a column are hashed into a number of buckets which is defined by user. It is a way to avoid too many partitions or nested partitions while ensuring optimizes query output.
It is query hint to stream a table into memory before running the query. It is a query optimization Technique.
Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
It is a UDF which is created using a java program to server some specific need not covered under the existing functions in Hive. It can detect the type of input argument programmatically and provide appropriate response.
The local inpath should contain a file and not a directory. The $env:HOME is a valid variable available in the hive environment.
The TBLPROPERTIES clause is used to add the creator name while creating a table.
The TBLPROPERTIES is added like −
TBLPROPERTIES(‘creator’= Joan’)


No comments:

Post a Comment