A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
This document summarizes a study review of common big data architectures for small to medium enterprises. It finds that such architectures typically include three main components: 1) an enterprise design framework like TOGAF for planning and architecture, 2) core infrastructure including data sources, messaging queues, data lakes, ETL processes, data warehouses, and visualization tools, and 3) operational aspects like data mining and security/compliance practices running on top of the infrastructure. The study concludes that open source tools can help SMEs establish affordable big data solutions to gain competitive advantages from data-driven insights.
Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.
Big data architecture on cloud computing infrastructuredatastack
This document provides an overview of using OpenStack and Sahara to implement a big data architecture on cloud infrastructure. It discusses:
- The characteristics and service models of cloud computing
- An introduction to OpenStack, why it is used, and some of its key statistics
- What Sahara is and its role in provisioning and managing Hadoop, Spark, and Storm clusters on OpenStack
- Sahara's architecture, how it integrates with OpenStack, and examples of how it can be used to quickly provision data processing clusters and execute analytic jobs on cloud infrastructure.
The document provides information about Hadoop, its core components, and MapReduce programming model. It defines Hadoop as an open source software framework used for distributed storage and processing of large datasets. It describes the main Hadoop components like HDFS, NameNode, DataNode, JobTracker and Secondary NameNode. It also explains MapReduce as a programming model used for distributed processing of big data across clusters.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers. It addresses problems like massive data storage needs and scalable processing of large datasets. Hadoop uses the Hadoop Distributed File System (HDFS) for storage and MapReduce as its processing engine. HDFS stores data reliably across commodity hardware and MapReduce provides a programming model for distributed computing of large datasets.
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
This document summarizes a presentation on improving organizational knowledge with natural language processing and enriched data pipelines. The system discussed ingests unstructured text data from various sources using Apache Kafka and Apache NiFi/MiNiFi. The data is then processed by Apache OpenNLP microservices to extract entities, sentences, tokens and perform sentiment analysis. The extracted structured data is stored in a database and Elasticsearch for visualization in Apache Superset dashboards. The system is designed to be scalable, extensible and repeatable using infrastructure as code deployed on Amazon Web Services.
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
When learning Apache Spark, where should a person begin? What are the key fundamentals when learning Apache Spark? Resilient Distributed Datasets, Spark Drivers and Context, Transformations, Actions.
Rainer Schmidt, AIT Austrian Institute of Technology, presented Scalable Preservation Workflows from SCAPE at the 5-days ‘Digital Preservation Advanced Practitioner Training’ event (http://bit.ly/1fYCvMO), hosted by DPC, in Glasgow on 15-19 July 2013.
The presentation gives an introduction to the SCAPE Platform, it presents scenarios from SCAPE Testbeds and it finally describes how to create scalable workflows and execute them on the SCAPE Platform.
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
It isn't easy to drink from the technology firehose of today's Internet economy. At Connexity, we have gone from home-grown MapReduce frameworks and custom in-house search-engines to extensive use of Apache Hadoop, Hive, Pig, Cassandra, Solr and other technologies to power our business. This talk will explore some of the evolutionary steps that we've made and what lessons you might draw from our 15+ years of experience of swimming with the Internet sharks.
This document discusses Hadoop, an open source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing of large datasets in parallel. The document also discusses several LinkedIn projects that use Hadoop like Project Takeout, Azkaban for workflow management, and Dr. Elephant for analyzing Hadoop jobs and problems. It encourages contributions to Apache Hadoop and related open source projects developed at LinkedIn.
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
During the rise and innovation of “big data,” the geospatial analytics landscape has grown and evolved. We are beyond just analyzing static maps. Geospatial data is streaming from devices, sensors, infrastructure systems, or social media, and our applications and use cases must dynamically scale to meet the increased demands.
Cloud can provide cost-effective storage and that ephemeral resource-burst needed for fast processing and low latency, all to monetize the immediate value of fresh geospatial data. Geospatial analytics require optimized spatial data types and algorithms to distill data to knowledge. Such processing, especially with strict latency requirements, has always been a challenge.
We propose an open source big data stack for geospatial analytics on Cloud based on Apache NiFi, Apache Spark and LocationTech GeoMesa. GeoMesa is a geospatial framework deployed in a modern big data platform that provides a scalable and low latency solution for indexing volumes of historical data and generating live views and streaming geospatial analytics. CONSTANTIN STANCA, Solutions Engineer, Hortonworks and JAMES HUGHES, Mathematician, CCRi
Presto is an open source distributed SQL query engine that allows interactive analysis of data across multiple data stores. At Facebook, Presto is used for ad-hoc queries of their Hadoop data warehouse, which processes trillions of rows and scans petabytes of data daily. Presto's low latency also makes it suitable for powering analytics in user-facing products. New features of Presto include improved SQL support, performance optimizations, and connectors to additional data sources like Redis and MongoDB.
Lambda-less Stream Processing @Scale in LinkedIn
The document discusses challenges with stream processing including data accuracy and reprocessing. It proposes a "lambda-less" approach using windowed computations and handling late and out-of-order events to produce eventually correct results. Samza is used in LinkedIn's implementation to store streaming data locally using RocksDB for processing within configurable windows. The approach avoids code duplication compared to traditional lambda architectures while still supporting reprocessing through resetting offsets. Challenges remain in merging online and reprocessed results at large scale.
Using Visualization to Succeed with Big Data Pactera_US
The document summarizes a webinar on big data visualization. It discusses drivers for the big data visualization market and new tools emerging. It then profiles several major vendors that offer big data visualization solutions, including Microsoft, QlikView, TIBCO, Tableau, Platfora, Datameer, Splunk, Jaspersoft, and Alpine Data. It concludes with an overview of how Pactera can help clients build advanced analytics solutions.
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...DataWorks Summit
Hadoop Distributed File System (HDFS) based architectures allow faster ingestion and processing of larger quantities of time series data than presently possible in current seismic, hydroacoustic, and infrasonic (SHI) analysis platforms. We have developed a data acquisition and signal analysis system using Hadoop, Accumulo, and NiFi. The data model allows individual waveform samples and their associated metadata to be stored in Accumulo. This is a significant departure from traditional storage practices, where continuous waveform segments are stored with their associated metadata as a single entity. Our design allows for rapid table scans of large data archives within Accumulo for locating, retrieving, and analyzing specific waveform segments directly. The scalability of Hadoop permits the system to accommodate the ingestion and analysis of new data as a sensor network grows. Our system is currently acquiring data from over 200 SHI sensors. Peak ingest rates are approaching 500k entries per second, while preserving constant sub-second access times to any range of entries. The average load produced by the data ingest process is consuming less than 10 percent of available system resources. CHARLES HOUCHIN, Computer Scientist, Air Force Technical Applications Center (AFTAC) and JOHN HIGHCOCK, Systems Architect, Hortonworks
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points:
- SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes.
- Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL.
- The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with the NameNode as master and DataNodes as slaves. The NameNode manages file system metadata and the DataNodes store data blocks. Hadoop also includes a MapReduce engine where the JobTracker splits jobs into tasks that are processed by TaskTrackers on each node. Hadoop saw early adoption from companies handling big data like Yahoo!, Facebook and Amazon and is now widely used for applications like advertisement targeting, search, and security analytics.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
This document provides information about Hadoop and its components. It discusses the history of Hadoop and how it has evolved over time. It describes key Hadoop components including HDFS, MapReduce, YARN, and HBase. HDFS is the distributed file system of Hadoop that stores and manages large datasets across clusters. MapReduce is a programming model used for processing large datasets in parallel. YARN is the cluster resource manager that allocates resources to applications. HBase is the Hadoop database that provides real-time random data access.
The document provides an overview of big data and Hadoop, discussing what big data is, current trends and challenges, approaches to solving big data problems including distributed computing, NoSQL, and Hadoop, and introduces HDFS and the MapReduce framework in Hadoop for distributed storage and processing of large datasets.
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
Hadoop Foundation for Analytics
History of Hadoop
Features of Hadoop
Key Advantages of Hadoop
Why Hadoop
Versions of Hadoop
Eco Projects
Essential of Hadoop ecosystem
RDBMS versus Hadoop
Key Aspects of Hadoop
Components of Hadoop
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses problems with traditional systems like data growth, network/server failures, and high costs by allowing data to be stored in a distributed manner and processed in parallel. Hadoop has two main components - the Hadoop Distributed File System (HDFS) which provides high-throughput access to application data across servers, and the MapReduce programming model which processes large amounts of data in parallel by splitting work into map and reduce tasks.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
Session from BGOUG I presented in June, 2016
Big data is one of the biggest buzzword in today's market. Terms like Hadoop, HDFS, YARN, Sqoop, and non-structured data has been scaring DBA's since 2010 - but where does the DBA team really fit in?
In this session, we will discuss everything database administrators and database developers needs to know about big data. We will demystify the Hadoop ecosystem and explore the different components. We will learn how HDFS and MapReduce are changing the data world, and where traditional databases fits into the grand scheme of things. We will also talk about why DBAs are the perfect candidates to transition into Big Data and Hadoop professionals and experts.
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
This is the presentation I used for Oracle Week 2016 session about Apache Spark.
In the agenda:
- The Big Data problem and possible solutions
- Basic Spark Core
- Working with RDDs
- Working with Spark Cluster and Parallel programming
- Spark modules: Spark SQL and Spark Streaming
- Performance and Troubleshooting
- Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across clusters of computers. It divides files into blocks and stores the blocks across nodes, replicating them for fault tolerance.
- HDFS is designed for distributed storage and processing of very large datasets. It allows applications to work with data in parallel on large clusters of commodity hardware.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
This document provides an introduction to Hadoop and big data concepts. It discusses what big data is, the four V's of big data (volume, velocity, variety, and veracity), different data types (structured, semi-structured, unstructured), how data is generated, and the Apache Hadoop framework. It also covers core Hadoop components like HDFS, YARN, and MapReduce, common Hadoop users, the difference between Hadoop and RDBMS systems, Hadoop cluster modes, the Hadoop ecosystem, HDFS daemons and architecture, and basic Hadoop commands.
This slide deck that Mr. Minh Tran - KMS's Software Architect shared at "Java-Trends and Career Opportunities" seminar of Information Technology Center of HCMC University of Science.
This document discusses building big data solutions using Microsoft's HDInsight platform. It provides an overview of big data and Hadoop concepts like MapReduce, HDFS, Hive and Pig. It also describes HDInsight and how it can be used to run Hadoop clusters on Azure. The document concludes by discussing some challenges with Hadoop and the broader ecosystem of technologies for big data beyond just Hadoop.
Pass AWS Certified Developer Associate with new exam dumps 2024SkillCertProExams
• For a full set of 1350+ questions. Go to
https://skillcertpro.com/product/aws-certified-developer-associate-practice-exam-questions/
• SkillCertPro offers detailed explanations to each question which helps to understand the concepts better.
• It is recommended to score above 85% in SkillCertPro exams before attempting a real exam.
• SkillCertPro updates exam questions every 2 weeks.
• You will get life time access and life time free updates
• SkillCertPro assures 100% pass guarantee in first attempt.
The goal of Meet Mack Monday Zoom meetings is to inform residents of township issues that impact them and to get feedback and comments from residents about such issues. This helps me keep better informed of residents’ concerns when I vote on the issues at Board of Supervisors meetings. This meeting focused on Anti Chick-fil-A “on” the Bypass Petition Update, Wawa Coming Soon – Will It Sell Beer/Wine 24/7? LI/O-LI District Overlay Problems: High Density Housing, More Traffic Congestion, Pedestrian Crosswalk Improvements – Why the Delays? Corners at Newtown “Garage Core” Apartments: Should We Amend JMZO to Allow This New Use in the Town Center District?, Pollinator Garden” in Roberts Ridge Park, Indoor Pickleball Club Proposed for Vacant Bed, Bath, and Beyond Site
Trapbone Routing Plan created by Marcus Davis JrMarcusDavisJr1
This is a mock routing plan I made for musical artist Trapbone. The project was made while pursuing a music business bachelor's degree from Full Sail University.
PSUG 3 - 2024-07-15 - Splunk & AI with Philipp DriegerTomas Moser
Once in a life time opportunity for Prague Splunk User Group and Splunkers in Czechia and abroad. Join us to discover Splunk AI and Machine Learning (ML) capabilities in a rare session presented by Philipp Drieger, Global Principal Machine Learning Architect at Splunk. With AI hype all over the world these days this is a unique moment and a chance to bring together those already familiar with Splunk universal machine data platform but without any AI/ML knowledge or experience and seasoned or full time data scientists interested in Splunk and its AI/ML capabilities.
Part 1: Introduction to Splunk AI (45min)
Get to know Splunk AI first hand from Philipp, Global Principal Architect for Machine Learning at Splunk. He will share a easy to understand overview of Splunk's key AI components and also highlight some real world customer use cases.
Open Q&A
Part 2: Splunk AI demos and open AMA session (45min)
Join Philipp showing live demos including Splunk's Machine Learning Toolkit, the Splunk App for Data Science and Deep Learning and the latest Splunk AI Assistant.
Open AMA session: Ask Me Anything about Splunk AI
This Presentations defines communication skills as the ability to exchange information via the use of language, both receptively and expressively. It examines several forms of communication based on organizational linkages and flow. Semantic concerns, emotional/psychological considerations, corporate policies, and personal attitudes can all operate as communication barriers. Effective communication is two-way, with active listening and feedback, and it is clear, concise, complete, concrete, respectful, and accurate. Good communication skills are essential for career success, dispute resolution, connection building, and increased productivity.
2. Juan Pablo Paz Grau, PhD, PMP
Systems Engineer
Specialist in Information Systems Management
PhD in Software Engineering
Certified in ITIL Foundation, PMP
Currently, I work in LG CNS Colombia
LG CNS Colombia is the IT partner of the SIRCI operation
The SIRCI Operation = Transmilenio Operation
Transmilenio is the world renown reference for BRT systems
The biggest public traffic system operation in Colombia
3. Presentation Agenda
1. What is Big Data?
2. Large Dataset Management Techniques
3. Hadoop Cluster Architecture
4. Closing the Loop: Real Time Cluster Architecture
5. The Development Process for Big Data Systems
6. Showcase of Big Data Tools for Public Traffic Systems
5. What is Big Data?
Information displayed
to final users
Data generated to
provide information
displayed to final
users
…
6. What is Big Data?
• Organizations produce lots of
data while they operate their
Information Systems
• Log files
• Access log files
• Debug log files
• Temporal, transient data
• Transactional data
• Usually, this data is stored
temporarily only for debugging
or incident analysis purposes
• With the increasing capacity to
store data, this data is been
reviewed and considered a
valuable source of information
7. Large Dataset Management Techniques
Very small intro to Hadoop
Cheap, reliable storage of
big datasets in commodity
hardware
A framework to parallelize
big data processing and
analysis
What is Hadoop?
Large Dataset
8. Large Dataset Management Techniques
Very small intro to Hadoop: Hadoop Distributed File System (HDFS)
File is split in
data blocks
File metadata and block
location is stored in the
name node
Data blocks are physically
stored in data nodes
Block B:
• If Data Node 0 fails, there is another
copy in the same rack at Data Node 1
• If the rack fails, there is still another
copy in another rack at Data Node 2
Rack 1 Rack 2
9. Large Dataset Management Techniques
• Very small intro to Hadoop: Map Reduce
Map: Select data that
matches a given criteria
(Status = Trip). The map
function returns a set of
{Key,Value} pairs
Shuffle: Collect an
sort the mapped pairs
Reduce: Apply a
reduce function (Sum
distance) for each key
10. Large Dataset Management Techniques
Very small intro to Hadoop: The Hadoop ecosystem
• Currently, there are a plethora of tools to work
with Big Data in top of Hadoop.
• The tools and frameworks selection will vary
depending on the implementation of the cluster.
11. Hadoop Cluster Architecture
The Lambda Architecture
Application
Data Access
Batch | Speed
Data
• Data layer: A data model and a set of data stored
following the data model. The data model should
be designed for the targeted subsystem.
• Batch layer: The computation layer that
processes data to turn facts into views for
querying the underlying stored data.
• Speed layer: A real time computation layer that
compensates the latency of the batch layer.
• Data Access layer: The engines, tools and
drivers that exposes views to applications and
manages queries.
• Application layer: The front-end application or
applications that present information to users of
the Big Data system.
12. Hadoop Cluster Architecture
Data Serialization
Source System
Source System
Source System
Data Serialization
Data Serialization
Data Serialization
Data Lake
Source System
Raw Data
13. Data Access: Hive, Hadoop Data Warehouse
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of managing data in Hadoop
• Manage files and schemas as tables
• Internal tables: Files managed by Hive
• External tables: Files located outside
of Hive but which can be analyzed with
Hive
• Provides a SQL like language to query data
stored in files
• Translates HiveQL language requests
into Map Reduce jobs
HiveQL
14. Load Transform Dump
Data Access: Pig, Data Processing Language
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of data processing and
analysis
• Capable of working with any type of data
source
• Provides a scripting language to process and
transform data
Pig
Latin
15. Hadoop Cluster Architecture
Hive
• Works with structured data
• Can index data
• HiveQL, a SQL like access language
• Turns the HiveQL input into MapReduce
jobs
Pig
• Works with structured/unstructured data
• Cannot index data
• Pig latin, a scripting language
• Turns the Pig latin input into MapReduce
jobs
Hive / Pig Comparison
16. Closing the Loop: Real Time Cluster Architecture
Why?
1. Hadoop is intended to store history, not changing data (write
once, read many times)
2. Batch processing of data usually takes many time to produce
output summarized data
3. Capability to provide real time processing of Big Data is also
desirable in the Lambda architecture
4. There is a need to implement a solution to cope with the time
between data in the Hadoop cluster and new data been
generated
Data available
in Hadoop
New data
been created
New data
stored in
Hadoop
Data
Gap
Time
17. Closing the Loop: Real Time Cluster Architecture
Cassandra: Accessing the Cluster
CQL Driver
CQL
1. Used to be through a thrift client, now CQL client
2. CQL (Cassandra QL), a very small subset of SQL
3. Driver is not JDBC like!
Cassandra: Data Model
1. Row oriented, instead of column oriented
2. Each row is identified by a key
3. Each key accesses a collection of columns
18. The Development Process for Big Data Systems
Development Process: System Implementation
Hadoop Cluster Architecture
Master Node
• Resource Manager
• Name Node
• Hive Server
• Sqoop
• Apache Tomcat
• MySQL Server
Worker Node Worker Node Worker Node Worker Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
19. Now, we have the cluster services up and running,
and data is flowing into our Big Data repository.
What´s next?
Showcase of Big Data Tools for Public Traffic Systems