SlideShare a Scribd company logo
Real time fraud detection at
1+M scale on hadoop stack
Ishan Chhabra
Nitin Aggarwal
Rocketfuel Inc
Agenda
• Rocketfuel & Advertising auction process
• Various kinds of frauds
• Problem statement
• Helios: Architecture
• Implementation in Hadoop Ecosystem
• Details about HDFS spout and datacube
• Key takeaways
Rocketfuel Inc
• AdTech firm that enables marketers using AI & Big Data
• Scores 120+ Billion Ad auctions in a day
• Handles 1-2 Million TPS at peak traffic
Auction Process
(4b) Notification(5) Record impression
Exchange - Rocketfuel discrepancy
(4b) Notification(5) Record impression
count(4b) != count(5)
Rocketfuel - Advertiser discrepancy
(5) Record impression
count(5) != count(6)
Common causes
• Fraud
– Bot networks and malware
– Hidden ad slots
• Human error
– AD JavaScript site or browser specific issues
– Bugs in Ad JavaScript
– 3rd-party JavaScript interactions in Ad or site
Need for real time
• Micro-patterns that change frequently
• Latency has big business impact; delays in reacting leads to
loss of money
• A lot of times discrepancies arise due to breakages and
sudden unexpected changes
Goal: Significantly reduce money loss from both ends by
reacting to these micropatterns in near real time
Data flow
x2
x2
x2
x2
Bidding Sites
Analytics Site
Data flow
Bids & Notifications
(batched and delayed)
Impressions
(near real time)
Bidding SiteAnalytics Site
Problem statement
• 3 streams with various delays (2 from HDFS, 1 from Kafka)
• Join and aggregate
• Filter among 2^n feature combinations to identify the top
culprits (OLAP cube)
• Feedback into bidding
Lambda architecture
Logs
Storm & HBase on
YARN (Slider)
Serving Infra
(Bidders and Ad-
servers)
Near real-time pipeline
Batch pipeline
Helios: Abstraction for real time learning
• Real time processing of data streams from sources like Kafka
and HDFS, with efficient join
• Process joined event views to generate different analytics,
using HBase and MapReduce
• OLAP support
• Join with dimensional data; different use-cases
Logs
Storm Cluster
(Slider and YARN)
HBase Cluster
(Slider and YARN)
Serving Infra
(Bidders and Ad-servers)
Helios architecture
OLAP
Metrics
Step 1a: Ingesting events from Kafka
Logs
Storm Cluster
(Slider and YARN)
Serving Infra
(Bidders and Ad-servers)
Processing Kafka events in real-time
• Relies on logs streams written to Kafka by scribe
• Kafka Topic with 200+ partitions
• Data produced and written via scribe from more than 3K
nodes
• Using upstream Kafka spout to read data
– Spout granularity is at record-level
– Uses Zookeeper extensively for book-keeping
Processing Kafka events in real-time
• Topology Statistics:
– Running on YARN as an application, so easily scalable
•Container: Memory: 2700m
– Running with 25 workers (5 executors/worker)
– Supervisor JVM opts:
•-Xms512m -Xmx512m -XX:PermSize=64m -XX:MaxPermSize=64m
– Worker JVM opts:
•-Xmx1800m -Xms1800m
– Processing nearly 100K events per second
Step 1b: Ingesting events from HDFS
Logs
Storm Cluster
(Slider and YARN)
Serving Infra
(Bidders and Ad-servers)
Processing HDFS events in real-time
• Relies on logs streams written to HDFS by scribe
• WAN limitations introduce high compression needs
• DistCp, rather than Kafka
• Using in-house Storm spout to read streams from HDFS
Processing Bid-logs in real-time
Storm Topology Statistics:
• Running on YARN as an application via slider (easily scalable)
–Container: Memory: 2700m
• Currently running with 350 workers (~10 executors/worker).
• Supervisor JVM opts:
–-Xms512m -Xmx512m -XX:PermSize=64m -XX:MaxPermSize=64m
• Worker JVM opts:
–-Xmx1800m -Xms1800m
• Processing nearly 1.5-2.0 million events per second (~ 100+B
events per day)
HDFS Spout Architecture
• Master-slave architecture
• Spout granularity is at file-level, with record level offset
bookkeeping.
• Use Zookeeper extensively for book-keeping
–Curator and recipes make life lot easier.
• Highly influenced from Kafka Spout
HDFS Spout Architecture
Spout
Leader
Spout Workers
un-assigned locked checkpoint done
offset Offset-lock
HDFS Spout Architecture
• Assignment Manager (AM):
– Elected based on leader election algorithm
– Polls HDFS periodically to identify new files, based on
timestamp and partitioned paths
– Publish files to be processed as work tasks in zookeeper (ZK)
– Manage time and path offsets, for cleaning up done nodes
– Create periodic done-markers on HDFS
HDFS Spout Architecture
• Worker (W):
– Select work-tasks from the available ones in ZK, when done
with current work, with ephemeral node locking
– Perform file checkpointing using record-offset in ZK to save
work
– Create done node in ZK, after processing the file
HDFS Spout Architecture
Bookkeeping node hierarchy:
• Pluggable Backend: Current implementation use ZK
• Work Life Cycle
– unassigned - file added here by AM
– locked - created by worker on selecting work
– checkpoint - timely checkpointing here
– processed - created by worker on completion
• Offset Management
– offset - stores path, time offset of HDFS
– offset-lock - ephemeral lock for offset update
HDFS Spout Architecture
• Spout Failures
– Slaves - Work made available again by Master
– Master - One of the slaves become master via leader
election and give away the slave duties
• Spouts Contention for work assignment via ZK ephemeral
nodes
• Leverage partitioned data directories and done-markers
based model in the organization
Comparison with official HDFS spout
Storm-1199
• Use HDFS for book-keeping
• Move or rename source files.
• All slave architecture, all spouts
contend for failed works
• No leverage for partitioned data
• Kerberos support
In-house Implementation
● Uses ZK for book-keeping.
● No changes to source files
● Master-Slave architecture with
leader election
● Leverage partitioned data, and
done-markers.
● No Kerberos support.
Step 2: Join via HBase
Logs
Storm Cluster
(Slider and YARN)
HBase Cluster
(Slider and YARN)
Donemarkers
HBase for joining streams of data
• Use request-id as key, to join different streams
• Different Column Qualifiers for different event streams
• HBase Cluster configuration
–Running on YARN as service via slider
–Region-servers: 40 instances, with 4G memory each
–Optimized for writes, with large MemStore
–Tuned compactions, to avoid unnecessary merging of files,
as they expire quickly (low retention)
•Date based compactions in HBase 2.0 available.
• Write throughput: 1M+ TPS
Observations from running Storm at scale
• ZeroMQ more stable than Netty in version 0.9.x
– Many Netty Optimizations available in 0.10.x
• Local-shuffle mode helpful for large data volumes
• Need to tune heartbeats interval
– (task|worker|supervisor).heartbeat.frequency.secs
– Pacemaker: Available in 1.0
• Need to tune code sync interval
– Distributed Cache: Available in 1.0
Step 3: Scan joined view and populate OLAP
OLAP
Metrics
Donemarkers
Event
Streams
Start MR Job
OLAP with multi-dimensional data
• Developed Mapreduce backed workflow
– Cron triggered hourly jobs based on donemarkers
– Scan data from HBase using snapshots
– Semantics for hour boundaries
– Event metric reporting
OLAP with multi-dimensional data
• Modular API for processing records
– Pluggable architecture for different use-cases
– OLAP implemented as a first-class use-case
• Use datacube library (Urban Airship) for generating OLAP
data.
– Configurable metric reporting.
OLAP with multi-dimensional data
Datacube for OLAP
• Library was developed at Urban Airship.
• About the API
– Need to define dimensions and rollups for the cube
– IO library for writing measures for cube
– Pluggable Databases: HBase, In-memory Map
– ID Service: Optimization for encoding values via ID substitution
– Support for bulk-loading and backfilling
OLAP with multi-dimensional data
New features (forked)
• Reverse lookups for scans
• New InputFormat for MR Jobs
• Prefix hashes (data and lookups) for load distribution.
• Optimized DB performance by using Async HBase library for efficient
reads/writes
MR Job statistics
• Use HBase Snapshots
• MR job runs every hour (Run time: 5-15mins)
• Hour is closed with delays of 30-60 minutes (on average), considering
log rotation and shipping(scribe) latencies.
Step 4: Scan OLAP cube for top feature vectors
OLAP
Metrics
Donemarkers
Start MR Job
Feature
Vectors
OLAP with multi-dimensional data
Serialize OLAP View
• Customizable MapReduce Job scans OLAP data (backed by HBase),
writes to HDFS.
• Different Jobs can use this easily accessible data from HDFS for
processing, and upload computed feedback stats to sources like MySQL
MR Job Statistics
• MR job runs every hour (Runtime: 2-5mins)
DevOps Automation
• Monitoring Service
• Topology submission service
Key Takeaways
• Hadoop ecosystem offers a productive stack for high velocity
real time learning problems
• YARN allows one to easily experiment with and tweak vertical
to horizontal scalability ratios
THANKS!
ANY QUESTIONS?
Reach us at
ichhabra@rocketfuel.com
naggarwal@rocketfuel.com

More Related Content

What's hot

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Data Con LA
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
DataWorks Summit
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
DataWorks Summit/Hadoop Summit
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
DataWorks Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
DataWorks Summit/Hadoop Summit
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
DataWorks Summit/Hadoop Summit
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
DataWorks Summit
 

What's hot (20)

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
 

Viewers also liked

Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
DataWorks Summit/Hadoop Summit
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Michael Noll
 
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Sabri Skhiri
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Spark Summit
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Automating OpenStack clouds and beyond w/ StackStorm
Automating OpenStack clouds and beyond w/ StackStormAutomating OpenStack clouds and beyond w/ StackStorm
Automating OpenStack clouds and beyond w/ StackStorm
OpenStack_Online
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
Dung Ngua
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark Summit
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
DataWorks Summit
 
CES - C Space Storytelling Session - Programmatic TV Advertising
CES - C Space Storytelling Session - Programmatic TV AdvertisingCES - C Space Storytelling Session - Programmatic TV Advertising
CES - C Space Storytelling Session - Programmatic TV Advertising
Rocket Fuel Inc.
 
Hado“OPS” or Had “oops”
Hado“OPS” or Had “oops”Hado“OPS” or Had “oops”
Hado“OPS” or Had “oops”
Rocket Fuel Inc.
 
Traffic Quality Webinar
Traffic Quality WebinarTraffic Quality Webinar
Traffic Quality Webinar
Rocket Fuel Inc.
 
How did you know this Ad will be relevant for me?!
How did you know this Ad will be relevant for me?!How did you know this Ad will be relevant for me?!
How did you know this Ad will be relevant for me?!
Rocket Fuel Inc.
 
Rocket fuel cross device and ptv 12-9-15 sharedv2
Rocket fuel cross device and ptv   12-9-15 sharedv2Rocket fuel cross device and ptv   12-9-15 sharedv2
Rocket fuel cross device and ptv 12-9-15 sharedv2
Rocket Fuel Inc.
 
Guide to Programmatic Marketing Webinar Deck
Guide to Programmatic Marketing Webinar DeckGuide to Programmatic Marketing Webinar Deck
Guide to Programmatic Marketing Webinar Deck
Rocket Fuel Inc.
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 

Viewers also liked (20)

Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Automating OpenStack clouds and beyond w/ StackStorm
Automating OpenStack clouds and beyond w/ StackStormAutomating OpenStack clouds and beyond w/ StackStorm
Automating OpenStack clouds and beyond w/ StackStorm
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
CES - C Space Storytelling Session - Programmatic TV Advertising
CES - C Space Storytelling Session - Programmatic TV AdvertisingCES - C Space Storytelling Session - Programmatic TV Advertising
CES - C Space Storytelling Session - Programmatic TV Advertising
 
Hado“OPS” or Had “oops”
Hado“OPS” or Had “oops”Hado“OPS” or Had “oops”
Hado“OPS” or Had “oops”
 
Traffic Quality Webinar
Traffic Quality WebinarTraffic Quality Webinar
Traffic Quality Webinar
 
How did you know this Ad will be relevant for me?!
How did you know this Ad will be relevant for me?!How did you know this Ad will be relevant for me?!
How did you know this Ad will be relevant for me?!
 
Rocket fuel cross device and ptv 12-9-15 sharedv2
Rocket fuel cross device and ptv   12-9-15 sharedv2Rocket fuel cross device and ptv   12-9-15 sharedv2
Rocket fuel cross device and ptv 12-9-15 sharedv2
 
Guide to Programmatic Marketing Webinar Deck
Guide to Programmatic Marketing Webinar DeckGuide to Programmatic Marketing Webinar Deck
Guide to Programmatic Marketing Webinar Deck
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 

Similar to Real time fraud detection at 1+M scale on hadoop stack

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
pbelko82
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Paolo latella
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
Claudiu Barbura
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
Sandeep Kunkunuru
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
MapR Technologies
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
Chandan Rajah
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 

Similar to Real time fraud detection at 1+M scale on hadoop stack (20)

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Badri_Bady
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Fwdays
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
 

Recently uploaded (20)

AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
 

Real time fraud detection at 1+M scale on hadoop stack

  • 1. Real time fraud detection at 1+M scale on hadoop stack Ishan Chhabra Nitin Aggarwal Rocketfuel Inc
  • 2. Agenda • Rocketfuel & Advertising auction process • Various kinds of frauds • Problem statement • Helios: Architecture • Implementation in Hadoop Ecosystem • Details about HDFS spout and datacube • Key takeaways
  • 3. Rocketfuel Inc • AdTech firm that enables marketers using AI & Big Data • Scores 120+ Billion Ad auctions in a day • Handles 1-2 Million TPS at peak traffic
  • 5. Exchange - Rocketfuel discrepancy (4b) Notification(5) Record impression count(4b) != count(5)
  • 6. Rocketfuel - Advertiser discrepancy (5) Record impression count(5) != count(6)
  • 7. Common causes • Fraud – Bot networks and malware – Hidden ad slots • Human error – AD JavaScript site or browser specific issues – Bugs in Ad JavaScript – 3rd-party JavaScript interactions in Ad or site
  • 8. Need for real time • Micro-patterns that change frequently • Latency has big business impact; delays in reacting leads to loss of money • A lot of times discrepancies arise due to breakages and sudden unexpected changes
  • 9. Goal: Significantly reduce money loss from both ends by reacting to these micropatterns in near real time
  • 11. Data flow Bids & Notifications (batched and delayed) Impressions (near real time) Bidding SiteAnalytics Site
  • 12. Problem statement • 3 streams with various delays (2 from HDFS, 1 from Kafka) • Join and aggregate • Filter among 2^n feature combinations to identify the top culprits (OLAP cube) • Feedback into bidding
  • 13. Lambda architecture Logs Storm & HBase on YARN (Slider) Serving Infra (Bidders and Ad- servers) Near real-time pipeline Batch pipeline
  • 14. Helios: Abstraction for real time learning • Real time processing of data streams from sources like Kafka and HDFS, with efficient join • Process joined event views to generate different analytics, using HBase and MapReduce • OLAP support • Join with dimensional data; different use-cases
  • 15. Logs Storm Cluster (Slider and YARN) HBase Cluster (Slider and YARN) Serving Infra (Bidders and Ad-servers) Helios architecture OLAP Metrics
  • 16. Step 1a: Ingesting events from Kafka Logs Storm Cluster (Slider and YARN) Serving Infra (Bidders and Ad-servers)
  • 17. Processing Kafka events in real-time • Relies on logs streams written to Kafka by scribe • Kafka Topic with 200+ partitions • Data produced and written via scribe from more than 3K nodes • Using upstream Kafka spout to read data – Spout granularity is at record-level – Uses Zookeeper extensively for book-keeping
  • 18. Processing Kafka events in real-time • Topology Statistics: – Running on YARN as an application, so easily scalable •Container: Memory: 2700m – Running with 25 workers (5 executors/worker) – Supervisor JVM opts: •-Xms512m -Xmx512m -XX:PermSize=64m -XX:MaxPermSize=64m – Worker JVM opts: •-Xmx1800m -Xms1800m – Processing nearly 100K events per second
  • 19. Step 1b: Ingesting events from HDFS Logs Storm Cluster (Slider and YARN) Serving Infra (Bidders and Ad-servers)
  • 20. Processing HDFS events in real-time • Relies on logs streams written to HDFS by scribe • WAN limitations introduce high compression needs • DistCp, rather than Kafka • Using in-house Storm spout to read streams from HDFS
  • 21. Processing Bid-logs in real-time Storm Topology Statistics: • Running on YARN as an application via slider (easily scalable) –Container: Memory: 2700m • Currently running with 350 workers (~10 executors/worker). • Supervisor JVM opts: –-Xms512m -Xmx512m -XX:PermSize=64m -XX:MaxPermSize=64m • Worker JVM opts: –-Xmx1800m -Xms1800m • Processing nearly 1.5-2.0 million events per second (~ 100+B events per day)
  • 22. HDFS Spout Architecture • Master-slave architecture • Spout granularity is at file-level, with record level offset bookkeeping. • Use Zookeeper extensively for book-keeping –Curator and recipes make life lot easier. • Highly influenced from Kafka Spout
  • 23. HDFS Spout Architecture Spout Leader Spout Workers un-assigned locked checkpoint done offset Offset-lock
  • 24. HDFS Spout Architecture • Assignment Manager (AM): – Elected based on leader election algorithm – Polls HDFS periodically to identify new files, based on timestamp and partitioned paths – Publish files to be processed as work tasks in zookeeper (ZK) – Manage time and path offsets, for cleaning up done nodes – Create periodic done-markers on HDFS
  • 25. HDFS Spout Architecture • Worker (W): – Select work-tasks from the available ones in ZK, when done with current work, with ephemeral node locking – Perform file checkpointing using record-offset in ZK to save work – Create done node in ZK, after processing the file
  • 26. HDFS Spout Architecture Bookkeeping node hierarchy: • Pluggable Backend: Current implementation use ZK • Work Life Cycle – unassigned - file added here by AM – locked - created by worker on selecting work – checkpoint - timely checkpointing here – processed - created by worker on completion • Offset Management – offset - stores path, time offset of HDFS – offset-lock - ephemeral lock for offset update
  • 27. HDFS Spout Architecture • Spout Failures – Slaves - Work made available again by Master – Master - One of the slaves become master via leader election and give away the slave duties • Spouts Contention for work assignment via ZK ephemeral nodes • Leverage partitioned data directories and done-markers based model in the organization
  • 28. Comparison with official HDFS spout Storm-1199 • Use HDFS for book-keeping • Move or rename source files. • All slave architecture, all spouts contend for failed works • No leverage for partitioned data • Kerberos support In-house Implementation ● Uses ZK for book-keeping. ● No changes to source files ● Master-Slave architecture with leader election ● Leverage partitioned data, and done-markers. ● No Kerberos support.
  • 29. Step 2: Join via HBase Logs Storm Cluster (Slider and YARN) HBase Cluster (Slider and YARN) Donemarkers
  • 30. HBase for joining streams of data • Use request-id as key, to join different streams • Different Column Qualifiers for different event streams • HBase Cluster configuration –Running on YARN as service via slider –Region-servers: 40 instances, with 4G memory each –Optimized for writes, with large MemStore –Tuned compactions, to avoid unnecessary merging of files, as they expire quickly (low retention) •Date based compactions in HBase 2.0 available. • Write throughput: 1M+ TPS
  • 31. Observations from running Storm at scale • ZeroMQ more stable than Netty in version 0.9.x – Many Netty Optimizations available in 0.10.x • Local-shuffle mode helpful for large data volumes • Need to tune heartbeats interval – (task|worker|supervisor).heartbeat.frequency.secs – Pacemaker: Available in 1.0 • Need to tune code sync interval – Distributed Cache: Available in 1.0
  • 32. Step 3: Scan joined view and populate OLAP OLAP Metrics Donemarkers Event Streams Start MR Job
  • 33. OLAP with multi-dimensional data • Developed Mapreduce backed workflow – Cron triggered hourly jobs based on donemarkers – Scan data from HBase using snapshots – Semantics for hour boundaries – Event metric reporting
  • 34. OLAP with multi-dimensional data • Modular API for processing records – Pluggable architecture for different use-cases – OLAP implemented as a first-class use-case • Use datacube library (Urban Airship) for generating OLAP data. – Configurable metric reporting.
  • 35. OLAP with multi-dimensional data Datacube for OLAP • Library was developed at Urban Airship. • About the API – Need to define dimensions and rollups for the cube – IO library for writing measures for cube – Pluggable Databases: HBase, In-memory Map – ID Service: Optimization for encoding values via ID substitution – Support for bulk-loading and backfilling
  • 36. OLAP with multi-dimensional data New features (forked) • Reverse lookups for scans • New InputFormat for MR Jobs • Prefix hashes (data and lookups) for load distribution. • Optimized DB performance by using Async HBase library for efficient reads/writes MR Job statistics • Use HBase Snapshots • MR job runs every hour (Run time: 5-15mins) • Hour is closed with delays of 30-60 minutes (on average), considering log rotation and shipping(scribe) latencies.
  • 37. Step 4: Scan OLAP cube for top feature vectors OLAP Metrics Donemarkers Start MR Job Feature Vectors
  • 38. OLAP with multi-dimensional data Serialize OLAP View • Customizable MapReduce Job scans OLAP data (backed by HBase), writes to HDFS. • Different Jobs can use this easily accessible data from HDFS for processing, and upload computed feedback stats to sources like MySQL MR Job Statistics • MR job runs every hour (Runtime: 2-5mins)
  • 39. DevOps Automation • Monitoring Service • Topology submission service
  • 40. Key Takeaways • Hadoop ecosystem offers a productive stack for high velocity real time learning problems • YARN allows one to easily experiment with and tweak vertical to horizontal scalability ratios
  • 41. THANKS! ANY QUESTIONS? Reach us at ichhabra@rocketfuel.com naggarwal@rocketfuel.com