Big Data (Hadoop)

About Big Data (Hadoop)

Big Data refers to Information resources whose characteristics in terms of volume, velocity and variety require the use of particular technologies and analytical methods to generate value and which generally exceed the capabilities of a single machine and require parallel processing..

Hadoop is an open source, java based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Big Data is treated like an asset, which can be valuable, whereas Hadoop is treated like a program to bring out the value from the asset, which is the main difference between Big Data and Hadoop. Big Data is unsorted and raw, whereas Hadoop is designed to manage and handle complicated & sophisticated Big data.

Who Uses

Zillow, Redfin and Trulia are companies using Hadoop and big data to democratize data for real estate consumers through customer analysis. Roughly 350+ big companies use Hadoop in their stack, below are the listing:

Netflix
Uber
Twitter
Spotify
Shopify
Airbnb

SCOPE

Big Data is a fast growing field with exciting opportunities for professionals in all industries and across the globe. With the demand for skilled big data professionals continuing to rise, now is a great time to enter the job market. Big Data is influencing the IT industry like few technologies or trends have done so before, it can help companies improve their decision-making and compete on another level.

Eligibility

While there are no strict requirements for learning Big Data Hadoop, basic knowledge in the following areas will make it easier to grasp the course:

Computer programming skills
SQL knowledge
Linux

NOTE: People who have a bachelor’s or Master’s degree in science, mathematics, engineering, finance, economics or statistics can grasp Big data Hadoop with ease.

Quick Enquiry Form

Experienced Faculty

Certification

Placement Assistance

JOB OPPORTUNITIES

Few popular Big Data job titles are listed below:

Hadoop / Big Data developer
Hadoop Administrator
Data Engineer.
Big Data Analyst
Machine learning engineer
Software development engineer
Big data engineer
Big data Consultant

Course Syllabus

Component 1

Introduction to Big Data
Characteristics
Why, How and What s of Big data
Existing OLTP, ETL,DWH,OLAP

Component 2

Introduction to Hadoop Ecosystem
Architecture-HDFS
Sharding , Distributed and Replication factor (SDR)
Daemons
Map reduce (MRV1) and Yarn
Hadoop v1 and v2
Hadoop Data federation

Component 3

Prerequisite for Installation
Single node , Pseudo distributed and Multinode cluster
Virtual machine using Linux ubuntu/CentOS
Installation of hadoop in cloud (Azure/AWS)
Installation of Java ,ssh,eclipse
Installation and configuration of Hadoop,HDFS,Daemons,YARN Daemons
High Availability (Active and Standby)
Automatic and manual failover
Hadoop Fs shell commands
Writing Data to HDFS
Reading Data from DFS

Component 4

Rack awareness policy and Replica placement Strategy
Failure Handling
Namenode
Datanode
Block-Safe mode
Rebalancing and load optimization
Trouble shooting and error rectification
Hadoop fs shell commands-Unix and Java-Basics
Assessment 1

Component 5

Introduction to Mapreduce
Architecture of Map reduce
Execution Map reduce in YARN
App Master ,Resource Manager and Node manager
Input format , Input split and Key Value Pairs
class and methods of Mapreduce paradigm
Mapper
Reducer
Partitioner
Custom and Default partition
Shuffle and Sort
Combiner-Scheduler
App Master /manager
Container-Node manager

Component 6

Map reduce Hands on
word count program/ log analytics
Hadoop streaming in R/Python
Data processing Transformations
Map only jobs and uber jobs
Inverted index and searches

Component 7

MR Programs 2
Structured and Unstructured Data handling
optimizing using Combiner
Partitioner
Single and multiple column
Inverted Index
XML -semi structure
Map side joins
Reduce side join

Component 8

Introduction to Hive Data warehouse
Installation hive and metastore database
Configure metastore to mysql
Hive QL Commands

Component 9

Manipulation and anlytical function in hive
Managed table and external tables
Partitioning and Bucketing
Complex data types and Unstructured data
Advance HQL commands
UDF and UDAF
Integration with Hbase
SerDe / Regular Expression

Component 10

Introduction to PIG
Installation-Bags and collections
Commands and Scripts
Pig UDF

File formats:

JSON to AVRO file conversion
Parquet compressed file to uncompressed
AVRO schema and data file
ORC file
Assessment 2

Component 11

Introduction to NOSQL
ACID /CAP/BASE
Key value pair
Map reduce
Column family
HbaseDocumennt
MongoDB
Graph DB
Neo4j

Component 12

Introduction to HBASE and installation
The HBase Data Model
The HBase Shell
HBase Architecture
Schema Design
The HBase API
HBase Configuration and Tuning

Component 13

Ingest data from RDB
Introduction to Sqoop and installation
Import and export data from and to RDB
Bulk loading , Incremental load , Split by , Conditional query
Sqoop validation and jobs

Component 14

Ingest streaming data
Flume Architecture
Agent ,Source,sink channel
Ingest log file
Collecting data from twitter for Sentimental analysis
Assessment 3

Component 15

Integrate With ETL
Talend Big data edition – Components of big data
Big data Analytics
Dimensional modelling
Data Visualization
Tableau – Hive and spark sql connectors

Component 16

Spark core and Components
Spark Shell
Create RDD from HDFS /Local
Creating new RDD-Transformations on RDD
Lineage Graph – DAG
Actions on RDD
RDD Concepts on Persist and Cache-Lazy evaluation of RDD
Hands on and core concepts of map() transformation
Hands on and core concepts of filter() transformation
Hands on and core concepts of flatMap() transformation Compare map and flatMap transformation Hands on and core concepts of reduce() action
Hands on and core concepts of fold() action-Hands on and core concepts of aggregate() action
Basics of Accumulator
Hands on and core concepts of collect() action
Hands on and core concepts of take() action
Apache Spark Execution Model
How Spark execute program
Concepts of RDD partitioning
RDD data shuffling and performance issue

Component 17

Data frames and dataset
Spark SQL
Pyspark
Spark jobs
Build scala program using SBT /Maven
Spark submit and spark Application

Component 18

KAFKA-Publisher /Subscriber
Consumer and producer
HUE
Monitoring and scheduling

Component 19

Zeppelin
OOZIE-Workflow and Co-ordinator
Distribution Installation on cloud or Sandbox
Cloudera -cloudera manager
Horton works -ambari server
MapR – MCS

Component 20

Introduction to Data science-Machine learning-Statistical Analysis-Sentiment Analysis
Use Multinode cluster setup-High Availability-Hadoop data federation-Commissioning and-decommissioning-Automatic and manual failover-Zookeeper failover controller
Use cases, Case studies and Proof of Concept-Working on different Distributions

Component 21 (Certification guidance)

CCA Spark and Hadoop Developer Exam (CCA175)
CCP Data Engineer (DE575)
HDPCD CERTIFICATION
HDP CERTIFIED APACHE SPARK DEVELOPER