Sanglin Lee and Joep Rottinghuis of Twitter at HBaseConEast2016: http://www.meetup.com/HBase-NYC/events/233024937/
API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. It allows hosting multiple API versions and stages, generating SDKs, adding authentication, throttling requests, and caching responses to improve performance and reduce latency. API Gateway supports building and deploying REST and WebSocket APIs. Pricing is based on the number of API calls and amount of data transferred out. Optional dedicated caching tiers are also available.
This document discusses how Spring Boot and Kafka can form the basis of a new enterprise application platform focused on continuous delivery, event-driven architectures, and streaming data. It provides examples of companies that have successfully adopted this approach, such as Netflix transitioning to Spring Boot and a banking brand building a new core banking system using Spring Streams and Kafka. The document advocates an "event-first" and microservices-oriented mindset enabled by a streaming data platform and suggests that Spring Boot, Kafka, and related technologies provide a turnkey solution for implementing this new application development approach at large enterprises.
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design
The presentation at CloudNativeCon Europe 2017, about how to design logging platform in container and microservices based computing platforms.
Kafka Streams is a client library for building distributed applications that process streaming data stored in Apache Kafka. It provides a high-level streams DSL that allows developers to express streaming applications as set of processing steps. Alternatively, developers can use the lower-level processor API to implement custom business logic. Kafka Streams handles tasks like fault-tolerance, scalability and state management. It represents data as streams for unbounded data or tables for bounded state. Common operations include transformations, aggregations, joins and table operations.
Memory leaks are not always simple or easy to find. Heap dumps from production systems are often gigantic (4+ gigs) with millions of objects in memory. Simple spot checking with traditional tools is woefully inadequate in these situations, especially with real data. Leaks can be entire object graphs with enormous amounts of noise. This session will show you how to build custom tools using the Apache NetBeans Profiler/Heapwalker APIs. Using these APIs, you can read and analyze Java heaps programmatically to ask really hard questions. This gives you the power to analyze complex object graphs with tens of thousands of objects in seconds.
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL semantics and datatypes. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins and how they are implemented in MapReduce. The overall presentation aims to explain how Hive provides scalable SQL processing for big data.
This document introduces infrastructure as code (IaC) using Terraform and provides examples of deploying infrastructure on AWS including: - A single EC2 instance - A single web server - A cluster of web servers using an Auto Scaling Group - Adding a load balancer using an Elastic Load Balancer It also discusses Terraform concepts and syntax like variables, resources, outputs, and interpolation. The target audience is people who deploy infrastructure on AWS or other clouds.
This document discusses Pinot, Uber's real-time analytics platform. It provides an overview of Pinot's architecture and data ingestion process, describes a case study on modeling trip data in Pinot, and benchmarks Pinot's performance on ingesting large volumes of data and answering queries in real-time.
Implementing End-To-End Tracing With Roman Kolesnev and Antony Stubbs | Current 2022 Can you answer how a given event came to be? Is it an aggregation, a combination of multiple events with different sources? What are its origins? Given the growing complexity of event streaming architectures - stateful processing, joins, fan-outs, multi-cluster flows - it is increasingly important to be able to accurately answer those questions, understand data flows and capture data provenance. This talk will walk through how to use and extend OpenTelemetry Java agent auto instrumentation to achieve full end-to-end traceability in Kafka event streaming architectures involving multi-cluster deployments, the Connect platform, stateful KStream applications and ksqlDB workloads. We will cover: - Distributed Tracing concepts - context propagation and the OpenTelemetry implementation stack; - Java agent auto instrumentation, problems faced when instrumenting service platforms (Connect and ksqlDB), stateful applications (KStreams and ksqlDB) and how auto instrumentation can be extended using loadable extensions to solve those problems; - Demo of an end-to-end tracing implementation and a highlight of the interesting use cases it enables.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
What is Apache Pulsar? How does it differ from other event streaming technologies available? StreamNative Developer Advocate Tim Spann will walk you through the features and architecture of this increasingly popular event streaming system, along with best practices for streaming and storing your data.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for. This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
In this talk by Sean Glover, Principal Engineer at Lightbend, we will review how the Strimzi Kafka Operator, a supported technology in Lightbend Platform, makes many operational tasks in Kafka easy, such as the initial deployment and updates of a Kafka and ZooKeeper cluster. See the blog post containing the YouTube video here: https://www.lightbend.com/blog/running-kafka-on-kubernetes-with-strimzi-for-real-time-streaming-applications
This document summarizes the new YARN Timeline Service version 2, which was developed to address scalability, reliability, and usability challenges in version 1. Key highlights of version 2 include a distributed collector architecture for scalable and fault-tolerant writing of timeline data, an entity data model with first-class configuration and metrics support, and metrics aggregation capabilities. It stores data in HBase for scalability and provides a richer REST API for querying. Milestone goals include integration with more frameworks and production readiness.