Flow

Objective of the demo

The goal of this data pipeline flow is to demonstrate a typical (but simpler) ETL flow in Hadoop using Falcon and Atlas.

Introduction

As part of this flow, we will ingest data files, that are copied to the landing zone on a gateway server, and then process them at a regular interval automatically using Falcon. When the workflow begins, the files are ingested, stored, transformed and the transformed data is sqooped out of cluster into a MySQL database.

Once the data is processed, the hive processing lineage will be available in Apache Atlas (optional).

In this flow, there are following processes and steps

Gateway Server
- has the flume agent running with a Spooling Directory configuration, to transfer the data files to HDFS
Master Server
- has the Falcon job running to do the following
  - Clear the temp tables
  - Make a copy of the incoming data for backup purpose.
  - Insert the raw data in a temp table
  - Covert the XML data into JSON and insert into another table
  - Apply aggregation / transformation process to the JSON table
  - Also insert the data from process (temp table) into the history table.
  - export the data out of Hive to a Mysql table

Architecture

Setting up the project

Pre-requisite - Download and install Hortonworks Sandbox

IMPORTANT NOTE : 

This demo assumes that the automated process will be running as user - admin. 
So, it is necessary to make sure that file and directory permissions are adjusted appropriately, if you plan on using another user.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
READMEs		READMEs
falcon		falcon
flume		flume
hql		hql
input_data		input_data
jars		jars
presentation		presentation
scripts		scripts
sql		sql
udf		udf
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
conf		conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flow

Objective of the demo

Introduction

Architecture

Setting up the project

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

sainib/hadoop-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Flow

Objective of the demo

Introduction

Architecture

Setting up the project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages