Skip to content

sainib/hadoop-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

                                                                                                                                              

Flow

Objective of the demo

The goal of this data pipeline flow is to demonstrate a typical (but simpler) ETL flow in Hadoop using Falcon and Atlas.

Introduction

As part of this flow, we will ingest data files, that are copied to the landing zone on a gateway server, and then process them at a regular interval automatically using Falcon. When the workflow begins, the files are ingested, stored, transformed and the transformed data is sqooped out of cluster into a MySQL database.

Once the data is processed, the hive processing lineage will be available in Apache Atlas (optional).

In this flow, there are following processes and steps

  • Gateway Server

    • has the flume agent running with a Spooling Directory configuration, to transfer the data files to HDFS
  • Master Server

    • has the Falcon job running to do the following
      • Clear the temp tables
      • Make a copy of the incoming data for backup purpose.
      • Insert the raw data in a temp table
      • Covert the XML data into JSON and insert into another table
      • Apply aggregation / transformation process to the JSON table
      • Also insert the data from process (temp table) into the history table.
      • export the data out of Hive to a Mysql table

Architecture

Data Pipeline Flow 1 - Architecture

Setting up the project

Pre-requisite - Download and install Hortonworks Sandbox

IMPORTANT NOTE : 

This demo assumes that the automated process will be running as user - admin. 
So, it is necessary to make sure that file and directory permissions are adjusted appropriately, if you plan on using another user.

                                                                                                                                              

About

Hadoop Data Pipeline using Falcon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published