SQOOP - RDBMS to Hadoop

@soﬁanhw
Tech Evangelist
KUDO

“Big data is like teenage
sex: everyone talks about
it, nobody really knows
how to do it, everyone
thinks everyone else is
doing it, so everyone
claims they are doing it…"
- Dan Ariely -

Most of Startup in Indonesia using RDBMS
But if we talk about BigData, everyone will talk about Hadoop

“In pioneer days they used
oxen for heavy pulling, and
when one ox couldn’t budge a
log, they didn’t try to grow a
larger ox. We shouldn’t be
trying for bigger computers,
but for more systems of
computers"
- Grace Hopper -

RDBMS focuses on relations, structures data in databases while the new Hadoop
system processes unstructured data in parallelism on large clusters of
inexpensive servers. Hadoop’s parallelism offers advantages in fast and credible
results at low cost

• Apache Sqoop is a tool designed for efﬁciently
transferring bulk data between Apache Hadoop
and structured datastores such as relational
databases.
• Sqoop imports data from external structured
datastores into HDFS or related systems like Hive
and HBase.
• Sqoop can also be used to export data from
Hadoop and export it to external structured
datastores such as relational databases and
enterprise data warehouses.
• Sqoop works with relational databases such as:
Teradata, Netezza, Oracle, MySQL, Postgres, and
HSQLDB.
SQOOP 
What is SQOOP?

• As more organizations deploy Hadoop to analyse vast streams of information,
they may ﬁnd they need to transfer large amount of data between Hadoop and
their existing databases, data warehouses and other data sources
• Loading bulk data into Hadoop from production systems or accessing it from
map-reduce applications running on a large cluster is a challenging task since
transferring data using scripts is a inefﬁcient and time-consuming task.
SQOOP 
Why is SQOOP?

• Hadoop is great for storing massive data in terms of volume using
HDFS
• It Provides a scalable processing environment for structured and
unstructured data
• But it’s Batch-Oriented and thus not suitable for low latency interactive
query operations
• Sqoop is basically an ETL Tool used to copy data between HDFS and
SQL databases
• Import SQL data to HDFS for archival or analysis
• Export HDFS to SQL ( e.g : summarized data used in a DW fact table )
SQOOP 
Hadoop-Sqoop?

Designed to efﬁciently transfer bulk data between Apache
Hadoop and structured datastores such as relational databases,
Apache Sqoop:
• Allows data imports from external datastores and enterprise data
warehouses into Hadoop
• Parallelizes data transfer for fast performance and optimal
system utilization
• Copies data quickly from external systems to Hadoop
• Makes data analysis more efﬁcient
• Mitigates excessive loads to external systems.
SQOOP 
What Sqoop Does?

• Sqoop provides a pluggable connector
mechanism for optimal connectivity to
external systems
• The Sqoop extension API provides a
convenient framework for building new
connectors which can be dropped into
Sqoop installations to provide
connectivity to various systems.
• Sqoop itself comes bundled with
various connectors that can be used for
popular database and data
warehousing systems.
SQOOP 
How SQOOP Works?

SQOOP 
JOB
• sqoop job —list
• sqoop job --show myjob
• sqoop job --exec myjob

Thanks
@sofianhw
me@sofianhw.com
http://www.sofianhw.com

SQOOP - RDBMS to Hadoop

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to SQOOP - RDBMS to Hadoop

Similar to SQOOP - RDBMS to Hadoop (20)

More from Sofian Hadiwijaya

More from Sofian Hadiwijaya (20)

Recently uploaded

Recently uploaded (20)

SQOOP - RDBMS to Hadoop