Loading Data into Hadoop HDFS

Moving Data into Hadoop:

Load Scenarios:

Once we have learned about the importance of putting big data into Hadoop, an early
question is related to data life cycle and data movement.
How do you load data into the cluster? How do you automate the flow of this humongous
amount of data?

What are the scenarios to load data into Hadoop?
.
We will look at four different load scenarios:

1.Data at rest
2.Data in motion
3.Data from a web server or a database log,
and
4.Data from a data warehouse

What is Data at rest?

It is data that is already in a file in some directory.
It is at rest, meaning that no additional updates are planned on this data
and it can be transferred as it is. the transfer can be accomplished using
standard HDFS shell commands, for example., cp or copyFromLocal or put, or
using the BigInsights web console.

What about when data is in motion?

First of all, what is meant by data in motion?
This is data that is continuously being updated. New data might be added regularly to these
data sources, Data might be appended to a file, or
Discrete or different logs might be getting merged into one log.
You need to have the capability of merging the files before copying them into Hadoop.

What about Data From Web Server?

Data from a web server such as WebSphere Application
Server or an Apache web server Data in database server logs or application
logs When moving data from a data warehouse, or any RDBMS for that matter,
we could export the data and then use Hadoop commands to import the data.

If you are working with a Netezza system, then you can use the Jaql Netezza module to
both read from and write to Netezza tables. Data can also be moved using BigSQL Load.
We also have Flume.

What is Flume?

Flume is a three tiered distributed service for data collection and possibly processing
of the data that consists of logical nodes. The first tier, or agent tier, has Flume agents
installed at the sources of the data. These agents then send their data to the second
tier, or collector tier. The collectors aggregate the data and in turn forward the data to the
final storage tier such as HDFS. Each logical node has a source and a sink.
The source tells from where to collect data and the sink specifies to where the data is
to be sent. Interceptors (sometimes called Decorators
or Annotators) can be optionally configured to allow for some simple data processing on
data it is passed through. Flume uses the concept of a physical node.
A physical node corresponds to a single Java process running on one machine in a cluster
as a single JVM. Here the concepts of physical machine and node are usually synonymous. But
sometimes a physical node can host multiple logical nodes.
We will see that Flume is a great tool for collecting data from a web server or from
database logs. Another, alternate, approach here would be
to use Java Management Extension (JMX) commands.

Source: Quora

MyPythonGuru

Follow us on Facebook

Post Top Ad

Thursday, August 22, 2019

Loading Data into Hadoop HDFS

No comments:

Post Top Ad

visitors today

Data Science Jobs

Python Jobs

Featured Posts

Popular

About

Archive

Sponsor

Tags