Working with Hive and Impala Tutorial

Welcome to the fifth lesson ‘Working with Hive and Impala’ which is a part of ‘Big Data Hadoop and Spark Developer Certification course’ offered by Simplilearn. This lesson will focus on Working with Hive and Impala.

Let’s look at objectives of this lesson in the next section.

Objectives

By the end of this lesson, you will be able to:

  • Explain Hive and Impala

  • Explain Hive Metastore

  • Use Impala SQL and HiveQL DDL to create tables

  • Create Hive tables and manage tables using Hue or HCatalog

  • Load data into Hive and Impala tables using HDFS and Sqoop

Let's start this Hive tutorial with the process of managing data in Hive and Impala.

Managing Data with Hive and Impala

A table is simply an HDFS directory containing zero or more files.

Default path: /user/hive/warehouse/<table_name>

A table Managing Data with Hive and Impala is given below.

table-managing-data-with-hive-and-impalaThe table supports many formats for data storage and retrieval. Created metadata is stored in the Hive Metastore‚ and is contained in an RDBMS such as MySQL.

Hive and Impala work with the same data - Tables in HDFS, metadata in the Metastore.

Let us understand what Hive Metastore is.

What is Hive Metastore?

The Metastore is a Hive component that stores the system catalog containing metadata about Hive create tables, columns, and partitions. Usually, metadata is stored in the traditional RDBMS format.

However, by default, Apache Hive uses the Derby database to store metadata. Any Java Database Connectivity or JDBC-compliant database like MySQL can be used to create a Hive Metastore.

A number of primary attributes must be configured for Hive Metastore. Some of them are:

  • URL Connection

  • Driver Connection

  • User ID Connection

  • Password Connection

Use of Metastore in Hive and Impala

Hive and Impala server uses Metastore to get table structure and data location. As shown in the diagram, first the query is sent to Hive Or Impala server, then the query reaches the Metastore to get the table structure and data location.

Once the query receives the required information, the server queries the actual data on the table present in HDFS.

Now that we have looked at the use of Metastore in Hive and Impala, we will discuss how and where the data gets stored in Hive table.

Data Warehouse Directory Structure

By default, all data get stored in: /user/hive/warehouse.

Each table is a directory within the default location containing one or more files. The hive customers table is as shown below.

the-hive-customers-tableAs shown in the diagram, we can visualize the Hive customer table stored in the following way: The Hive customer table stores data within the HDFS default directory creating a directory, named “customers”.

Defining Databases and Tables

In this topic, you will learn how to define, create, and delete a database along with the steps involved in creating a table.

Databases and tables are created and managed using the (Data Definition Language) DDL of HiveQL or Impala SQL, which are very similar to standard SQL DDL. While writing HiveQL and Impala SQL DDL operations, you will find minor differences.

Now that you have understood the way to define a database, let’s analyze how to create a database.

Creating a Database in Hive

To create a new database, we need to write:

CREATE DATABASE databasename

In the example given below, we are creating a simplilearn database. This will add database definition to the Hive metastore and also creates a storage directory in HDFS in the default location.

For example:

syntax-to-create-new-databaseTo avoid an error in case the database already exists, you need to write:

CREATE DATABASE IF NOT EXISTS databasename

Deleting a Database

Removing a database is very similar to creating a database. You need to replace CREATE with DROP as:

DROP DATABASE databasename;

In case the database already exists, you can type:

DROP DATABASE IF EXISTS databasename;

To remove a database if it has tables already, append the word CASCADE in the query. It will forcefully remove the database.

DROP DATABASE simplilearn CASCADE;

Note: This might remove data in HDFS.

Now let’s understand the syntax of CREATE TABLE in detail.

Let us now look into how to create a table in Hive.

Hive - Create Table

The first line of the code is:

CREATE TABLE tablename (colnames DATA TYPES)

Specify a name for the table and a list of column names and DATATYPES.

The second line of the code is:

ROW FORMAT DELIMITED

This states that fields in each file in the table's directory are delimited by some character. The default delimiter is Control-a, but you may specify an alternate delimiter.

The third line of the code is:

FIELDS TERMINATED BY char

This specifies that each field of data is terminated.

For example, tab-delimited data would require you to specify fields that are terminated by, /t.

With the last line of code:

STOREAS {TEXT FILE OR SEQUENCEFILE}

You may declare the file format to be stored as a text file which is the default and does not need to be specified.

Syntax to create a new table is as given below.

syntax-to-create-new-tableSyntax creates a subdirectory in the databases warehouse directory in HDFS. For the default database, the table will be stored in the default HDFS location.

For the named database syntax will create a directory of data name.db name inside default HDFS location and save the table in that directory.

Let’s see an example of how to create a table named, simplilearn.

Table Creation - Example

The following example shows how to create a new table named simplilearn:

an-example-to-create-a-table-named-simplilearn

Data will be stored as text with four comma-separated fields per line.

Data Types in Hive

In this topic, you will learn about the data types in Hive, and understand the process of changing table data location and creating an external managed table.

There are three data types in Hive: primitive, complex, and user-defined. Let’s discuss each data type.

Primitive data types:

Primitive data types include integers such as TINYINT, SMALLINT, INT, and BIGINT.

They also include BOOLEAN, floating point numbers type such as FLOAT and DOUBLE, and STRING.

Complex data types:

Complex types include Structs, Maps, and Arrays.

User-defined data types:

User-defined types include structures with attributes of any type.

Let’s understand the steps to change table data location.

Changing Table Data Location

By default, table data is stored in default warehouse location: user/hive/warehouse.

Use LOCATION to specify the directory where you want to reside your data in HDFS.

In the example given below, we are creating the simplilearn table and by using LOCATION we are specifying where simplilearn data will reside.

syntax-to-change-table-data-location

External Managed Table

By default tables are internal or managed, in which data also gets deleted when the table is removed from the internal location.

  • You need to use EXTERNAL to create an externally managed table.

  • Dropping the external table removes only its metadata and not the data.

Syntax for creating an External Managed Table can be as below.

syntax-to-create-external-managed-tableIn the next topic, you will learn about the process of validating and loading data into Hive and Impala, and RDBMS.

Validation of Data

Data in Impala or Hive are applied to a plan or a schema as it is pulled out of a stored location, rather than pulling out data directly. Therefore, Impala and Hive are “schema on.”

Unlike RDBMS, they do not validate data on the insert. Files are simply moved into place. Loading data into tables is therefore very fast. Errors in file format will be discovered when queries are performed.

In the next section of this Hive tutorial, let us understand how data can be loaded in the Hive Table.

Loading of Data

Data can be loaded in the Hive table in the following two ways:

  • Directly move the data from the HDFS file to the Hive table using the syntax:

hdfs dfs -mv /simplilearn/data/user/hive/warehouse/simplilearn/

As shown in this example alternatively,

  • Load data by following the Hive syntax:

LOAD DATA INPATH '/simplilearn/data’

OVERWRITE INTO TABLE simplilearn;

OVERWRITE keyword removes all files within the table’s directory, then moves the new files into the directory.

In the next topic, you will learn how to load data in a hive create table from the HDFS Directory.

Loading Data from RDBMS

Sqoop has a built-in support to import data into Hive and Impala.

Using the Hive import option in Sqoop, you can create a table in the Hive metastore and also import data from the RDBMS to the table’s directory in HDFS. For Loading Data from RDBMS using sqoop, we can use following syntax.

loading-data-from-rdbms-using-sqoop

Note: hive-import creates a table accessible in both Hive and Impala.

In the next section, we will look into the HCatalog and its uses.

HCatalog and Its Uses

In this topic, you will learn about HCatalog, Impala in clusters and updating Impala Metadata Cache.

Hcatalog is a Hive sub-project that provides access to the Hive Metastore. Uses of Hcatalog are the following:

  • It allows users to define tables using HiveQL DDL syntax.

  • It is accessible via command line and REST API

  • It allows users to access tables created through Hcatalog from Hive, Impala, MapReduce, Pig, and so on.

Impala on Cluster

In Impala, along with Namenode there are two other Impala daemons on the master node. They are Catalog daemon, which relays metadata changes to all the Impala daemons in a cluster, and State Store daemon, which provides lookup service for Impala daemons and also periodically checks their status.

The tructure of Impala on Cluster is as shown below.

the-structure-of-impala-on-clusterNote: Each worker node in the cluster runs an Impala daemon which collocates with the HDFS slave daemon, DataNode.

Let’s understand the concept of loading data into Impala Metadata cache.

Loading Data into Impala Metadata Cache

When any new table is added in metadata, you need to execute the INVALIDATE METADATA query. This will mark the entire cache as stale and metadata cache is reloaded as required.

Similarly, if the table schema is modified or any new data is added to a table, you need to execute : REFRESH

This records the metadata for one table immediately and reloads HDFS block location for only the new file.

If data in a table is extensively altered by HDFS balancing, you need to execute: INVALIDATE METADATA

This will mark the metadata for a single table as stale. When the metadata is required, all HDFS block locations are retrieved.

Summary

Let us summarize what we have covered in this Hive tutorial.

  • Each table maps to a directory in HDFS.

  • The Hive Metastore stores data about the data in an RDBMS.

  • Tables are created and managed using the Impala SQL or HiveQL DDL.

  • SQOOP provides support for importing data into Hive and IMPALA from RDBMS

  • HCatalog provides access to the Hive Metastore from tools outside Hive or Impala (for example, Pig or MapReduce).

Conclusion

This brings us to the end of the lesson on Working on Hive and Impala. In the next lesson, you will learn about the Types of Data Formats.

Find our Big Data Hadoop and Spark Developer Online Classroom training classes in top cities:


Name Date Place
Big Data Hadoop and Spark Developer 28 Jul -2 Sep 2018, Weekend batch Your City View Details
Big Data Hadoop and Spark Developer 4 Aug -9 Sep 2018, Weekend batch Your City View Details
Big Data Hadoop and Spark Developer 6 Aug -27 Aug 2018, Weekdays batch Your City View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*