Big Data Platform at Pinterest

•Download as PPTX, PDF•

129 likes•22,112 views

This document discusses Pinterest's data architecture and use of Pinball for workflow management. Pinterest processes 3 petabytes of data daily from their 60 billion pins and 1 billion boards across a 2000 node Hadoop cluster. They use Kafka, Secor and Singer for ingesting event data. Pinball is used for workflow management to handle their scale of hundreds of workflows, thousands of jobs and 500+ jobs in some workflows. Pinball provides simple abstractions, extensibility, reliability, debuggability and horizontal scalability for workflow execution.

What's hot

Introduction to Amazon Aurora

Amazon Web Services

This document provides an introduction to Amazon Aurora, AWS's managed relational database service. It discusses how Aurora was built to provide the speed and availability of commercial databases at the simplicity and cost-effectiveness of open source databases. The document outlines key Aurora features like automatic scaling, continuous backups, replication across Availability Zones, and integration with other AWS services. Customer case studies show how Aurora provides better performance at lower costs than alternative database options. The document also covers migration options and how Aurora offers a simpler, more cost-effective database solution than on-premises or self-managed options.

Intro to Delta Lake

Databricks

Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks

This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.

Considerations for Data Access in the Lakehouse

Databricks

Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker. The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.

Apache Flink, AWS Kinesis, Analytics

Araf Karsh Hamid

Apache Kafka - Martin Podval

Martin Podval

Apache Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics, in a fault-tolerant and scalable way. It is used for building real-time data pipelines and streaming apps. Producers write data to topics which are committed to disks across partitions and replicated for fault tolerance. Consumers read data from topics in a decoupled manner based on offsets. Kafka can process streaming data in real-time and at large volumes with low latency and high throughput.

My first 90 days with ClickHouse.pdf

Alkin Tezuysal

Alkin Tezuysal discusses his first 90 days working at ChistaDATA Inc. as EVP of Global Services. He has experience working with databases like MySQL, Oracle, and ClickHouse. ChistaDATA focuses on providing ClickHouse infrastructure operations through managed services, support, and consulting. ClickHouse is an open source columnar database that uses a shared-nothing architecture for high performance analytics workloads.

High-speed Database Throughput Using Apache Arrow Flight SQL

ScyllaDB

Introduction to DataFusion An Embeddable Query Engine Written in Rust

Andrew Lamb

Building Robust ETL Pipelines with Apache Spark

Databricks

Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Data Warehouse or Data Lake, Which Do I Choose?

DATAVERSITY

Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization. Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support. In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

DataWorks Summit

On paper, combining Apache NiFi, Kafka, and Spark Streaming provides a compelling architecture option for building your next generation ETL data pipeline in near real time. What does this look like in enterprise production environment to deploy and operationalized? The newer Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with elegant code samples, but is that the whole story? This session will cover the Royal Bank of Canada’s (RBC) journey of moving away from traditional ETL batch processing with Teradata towards using the Hadoop ecosystem for ingesting data. One of the first systems to leverage this new approach was the Event Standardization Service (ESS). This service provides a centralized “client event” ingestion point for the bank’s internal systems through either a web service or text file daily batch feed. ESS allows down stream reporting applications and end users to query these centralized events. We discuss the drivers and expected benefits of changing the existing event processing. In presenting the integrated solution, we will explore the key components of using NiFi, Kafka, and Spark, then share the good, the bad, and the ugly when trying to adopt these technologies into the enterprise. This session is targeted toward architects and other senior IT staff looking to continue their adoption of open source technology and modernize ingest/ETL processing. Attendees will take away lessons learned and experience in deploying these technologies to make their journey easier. Speakers Darryl Sutton, T4G, Principal Consultant Kenneth Poon, RBC, Director, Data Engineering

How We Optimize Spark SQL Jobs With parallel and sync IO

Databricks

Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing. In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30% In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...

Databricks

The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.

DW Migration Webinar-March 2022.pptx

Databricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Python tools to deploy your machine learning models faster

Jeff Hale

Emr spark tuning demystified

Omid Vahdaty

EMR Spark tuning involves configuring Spark and YARN parameters like executor memory and cores to optimize performance. The default Spark configurations depend on the deployment method (Thrift, Zeppelin etc). YARN is used for resource management in cluster mode, and allocates resources to containers based on minimum and maximum thresholds. When tuning, factors like available cluster resources, executor instances and cores should be considered to avoid overcommitting resources.

Microsoft Azure Databricks

Sascha Dittmann

Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. Designed in collaboration with the founders of Apache Spark, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. As an Azure service, customers automatically benefit from the native integration with other Azure services such as Power BI, SQL Data Warehouse, and Cosmos DB, as well as from enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.

Introduction to Stream Processing

Guido Schmutz

More and more data sources today provide a constant data stream, from Internet of Things devices to Social Media streams. It is one thing to collect these events in the velocity they arrive, without losing any single message. An Event Hub and a data flow engine can help here. It’s another thing to do some (complex) analytics on the data. There is always the option to first store them in a data sink of choice, such as a data lake implemented with HDFS/object store, or in a database such as a NoSQL or even an RDBMS, if the volume of events is not too high. Storing a high-volume event stream is feasible and not such a challenge anymore. But doing it adds to the end-to-end latency and it’s a matter of minutes or hours until you can present some results of your analytics. If you need to react fast, you simply can't afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics directly on the data stream. This is called Stream Processing or Stream Analytics. In this talk I will present the important concepts, a Stream Processing solution should support and then dive into some of the most popular frameworks available on the market and how they compare.

What's hot (20)

Introduction to Amazon Aurora

Intro to Delta Lake

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Considerations for Data Access in the Lakehouse

Apache Flink, AWS Kinesis, Analytics

Apache Kafka - Martin Podval

My first 90 days with ClickHouse.pdf

High-speed Database Throughput Using Apache Arrow Flight SQL

Introduction to DataFusion An Embeddable Query Engine Written in Rust

Building Robust ETL Pipelines with Apache Spark

Learn to Use Databricks for Data Science

Data Warehouse or Data Lake, Which Do I Choose?

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

How We Optimize Spark SQL Jobs With parallel and sync IO

Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...

DW Migration Webinar-March 2022.pptx

Python tools to deploy your machine learning models faster

Emr spark tuning demystified

Microsoft Azure Databricks

Introduction to Stream Processing

Similar to Big Data Platform at Pinterest

50 Billion pins and counting: Using Hadoop to build data driven Products

DataWorks Summit

Pinterest uses Hadoop and tools they developed on top of it like Pinball and Pinalytics to harness data from their 50 billion pins and 1 billion boards. Pinball is Pinterest's workflow manager that provides simple abstractions and scales horizontally to process their 3 petabytes of data daily across thousands of jobs. Pinalytics is Pinterest's scalable data analytics engine that allows flexible querying and visualization of metrics data stored in HBase.

Pinterest hadoop summit_talk

Krishna Gade

Webinar - DreamObjects/Ceph Case Study

Ceph Community

This document summarizes DreamObjects, an object storage platform powered by Ceph. It discusses the hardware used in storage and support nodes, including Intel and AMD processors, RAM, disks, and networking components. The document also provides details on Ceph configuration including replication, CRUSH mapping, OSD configuration, and application tuning. Monitoring tools discussed include Chef, pdsh, Sensu, collectd, graphite, logstash, Jenkins and future plans.

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

Monal Daxini

Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.

Openstack India May Meetup

Deepak Garg

Openstack is open source software that allows users to create an Infrastructure as a Service (IaaS) cloud by pooling physical compute, storage, and network resources. It provides on-demand, scalable computing and storage through components like Nova (compute), Swift (object storage), Glance (images), Keystone (identity), and Quantum (networking). The presentation covers the architecture and components of Openstack, how it works from a user perspective, its history and motivation, partners, open development model, and the Openstack community in India.

Serverless SQL

Torsten Steinbach

Serverless SQL provides a serverless analytics platform that allows users to analyze data stored in object storage without having to manage infrastructure. Key features include seamless elasticity, pay-per-query consumption, and the ability to analyze data directly in object storage without having to move it. The platform includes serverless storage, data ingest, data transformation, analytics, and automation capabilities. It aims to create a sharing economy for analytics by allowing various users like developers, data engineers, and analysts flexible access to data and analytics.

Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)

Sascha Wenninger

Michael stack -the state of apache h base

hdhappy001

The document provides an overview of Apache HBase, an open source, distributed, scalable, big data non-relational database. It discusses that HBase is modeled after Google's Bigtable and built on Hadoop for storage. It also summarizes that HBase is used by many large companies for applications such as messaging, real-time analytics, and search indexing. The project is led by an active community of committers and sees steady improvements and new features with each monthly release.

AWS Big Data Demystified #1: Big data architecture lessons learned

Omid Vahdaty

AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company The video: https://youtu.be/l5KmaZNQxaU dont forget to subcribe to the youtube channel The website: https://amazon-aws-big-data-demystified.ninja/ The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/ The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/

Facebook Presto presentation

Cyanny LIANG

Presto is an interactive SQL query engine for big data that was originally developed at Facebook in 2012 and open sourced in 2013. It is 10x faster than Hive for interactive queries on large datasets. Presto is highly extensible, supports pluggable backends, ANSI SQL, and complex queries. It uses an in-memory parallel processing architecture with pipelined task execution, data locality, caching, JIT compilation, and SQL optimizations to achieve high performance on large datasets.

AWS (Hadoop) Meetup 30.04.09

Chris Purrington

Netflix Open Source Meetup Season 4 Episode 2

aspyker

In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix. The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs. The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis. Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

cdmaxime

Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.

Low Latency Polyglot Model Scoring using Apache Apex

Apache Apex

This document discusses challenges in building low-latency machine learning applications and how Apache Apex can help address them. It introduces Apache Apex as a distributed streaming engine and describes how it allows embedding models from frameworks like R, Python, H2O through custom operators. It provides various data and model scoring patterns in Apex like dynamic resource allocation, checkpointing, exactly-once processing to meet SLAs. The document also demonstrates techniques like canary deployment, dormant models, model ensembles through logical overlays on the Apex DAG.

Sql Start! 2020 - SQL Server Lift & Shift su Azure

Marco Obinu

Low latency high throughput streaming using Apache Apex and Apache Kudu

DataWorks Summit

True streaming is fast becoming a necessity for many business use cases. On the other hand the data set sizes and volumes are also growing exponentially compounding the complexity of data processing pipelines.There exists a need for true low latency streaming coupled with very high throughput data processing. Apache Apex as a low latency and high throughput data processing framework and Apache Kudu as a high throughput store form a nice combination which solves this pattern very efficiently. This session will walk through a use case which involves writing a high throughput stream using Apache Kafka,Apache Apex and Apache Kudu. The session will start with a general overview of Apache Apex and capabilities of Apex that form the foundation for a low latency and high throughput engine with Apache kafka being an example input source of streams. Subsequently we walk through Kudu integration with Apex by walking through various patterns like end to end exactly once, selective column writes and timestamp propagations for out of band data. The session will also cover additional patterns that this integration will cover for enterprise level data processing pipelines. The session will conclude with some metrics for latency and throughput numbers for the use case that is presented. Speaker Ananth Gundabattula, Senior Architect, Commonwealth Bank of Australia

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

YARN: a resource manager for analytic platform

Tsuyoshi OZAWA

The document discusses YARN, a resource manager for Apache Hadoop. It provides an overview of YARN and its key features: (1) managing resources in a cluster, (2) managing application history logs, and (3) a service registry mechanism. It then discusses how distributed processing frameworks like Tez and Spark work on YARN, focusing on their directed acyclic graph (DAG) models and techniques for improving performance on YARN like container reuse.

Best of re:Invent

Amazon Web Services

The document summarizes announcements from AWS re:Invent 2016 related to compute, storage, artificial intelligence, serverless computing, databases, migration tools, and developer tools. Key announcements included new EC2 instance types, cost reductions, Elastic GPUs, AWS Batch for batch processing, Aurora PostgreSQL, Athena for analytics on S3 data, VMware on AWS, AWS X-Ray for tracing distributed applications, and expanded machine learning capabilities through services like Polly, Lex, and Rekognition as well as support for MXNet as an AI framework.

Modern MySQL Monitoring and Dashboards.

Mydbops

Similar to Big Data Platform at Pinterest (20)

50 Billion pins and counting: Using Hadoop to build data driven Products

Pinterest hadoop summit_talk

Webinar - DreamObjects/Ceph Case Study

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

Openstack India May Meetup

Serverless SQL

Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)

Michael stack -the state of apache h base

AWS Big Data Demystified #1: Big data architecture lessons learned

Facebook Presto presentation

AWS (Hadoop) Meetup 30.04.09

Netflix Open Source Meetup Season 4 Episode 2

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Low Latency Polyglot Model Scoring using Apache Apex

Sql Start! 2020 - SQL Server Lift & Shift su Azure

Low latency high throughput streaming using Apache Apex and Apache Kudu

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

YARN: a resource manager for analytic platform

Best of re:Invent

Modern MySQL Monitoring and Dashboards.

More from Qubole

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...

Big Data Platform at Pinterest

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Big Data Platform at Pinterest

Similar to Big Data Platform at Pinterest (20)

More from Qubole

More from Qubole (20)

Recently uploaded

Recently uploaded (20)

Big Data Platform at Pinterest