Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and use work load management.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Who Should Attend:
• Data Warehouse Developers, Big Data Architects, BI Managers, and Data Engineers
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
As more and more people are moving to PostgreSQL from Oracle, a pattern of mistakes is emerging. They can be caused by the tools being used or just not understanding how PostgreSQL is different than Oracle. In this talk we will discuss the top mistakes people generally make when moving to PostgreSQL from Oracle and what the correct course of action.
Amazon Redshift is a fast, powerful, fully managed data warehouse service that allows for petabyte-scale data warehousing at very low costs. It uses columnar storage and data compression techniques to dramatically reduce I/O and allow for very fast query performance. Redshift automatically provisions clusters on optimized hardware and scales easily from terabytes to petabytes of data with no downtime. It integrates with popular BI tools and simplifies tasks like provisioning, administration, backup/restore and scaling the data warehouse.
The document discusses compaction in RocksDB, an embedded key-value storage engine. It describes the two compaction styles in RocksDB: level style compaction and universal style compaction. Level style compaction stores data in multiple levels and performs compactions by merging files from lower to higher levels. Universal style compaction keeps all files in level 0 and performs compactions by merging adjacent files in time order. The document provides details on the compaction process and configuration options for both styles.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
The document provides an overview of PostgreSQL performance tuning. It discusses caching, query processing internals, and optimization of storage and memory usage. Specific topics covered include the PostgreSQL configuration parameters for tuning shared buffers, work memory, and free space map settings.
(1) Amazon Redshift is a fully managed data warehousing service in the cloud that makes it simple and cost-effective to analyze large amounts of data across petabytes of structured and semi-structured data. (2) It provides fast query performance by using massively parallel processing and columnar storage techniques. (3) Customers like NTT Docomo, Nasdaq, and Amazon have been able to analyze petabytes of data faster and at a lower cost using Amazon Redshift compared to their previous on-premises solutions.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Amazon Web Services
PostgreSQL is an open source database growing in popularity because of its rich features, vibrant community, and compatibility with commercial databases. Learn about ways to run PostgreSQL on AWS including self-managed, and the managed database services from AWS: Amazon Relational Database Service (Amazon RDS) and the Amazon Aurora PostgreSQL-compatible Edition. This talk covers key Amazon RDS for PostgreSQL functionality, availability, and management. We also review general guidelines for common user operations and activities such as migration, tuning, and monitoring for their RDS for PostgreSQL instances.
This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
This document summarizes a presentation on Amazon Redshift. Redshift is a fully managed data warehouse service that makes it easy to analyze large amounts of data for less than $1,000 per terabyte per year. The presentation covers how to get started with Redshift, best practices for table design and data loading, using Redshift for analytics, and upgrading and scaling a Redshift data warehouse over time.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Amazon Redshift is a fully managed data warehouse service that makes it fast, simple and cost effective to analyze data using SQL and existing business intelligence tools. The document provides an overview of Amazon Redshift and its benefits including speed, low cost, security, scalability and ease of use. It also provides examples of how various companies use Redshift for big data analytics including analyzing social media firehoses, mobile usage and real-time IoT streaming data.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Speakers:
Ian Meyers, AWS Solutions Architect
Toby Moore, Chief Technology Officer, Space Ape
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
The document describes a presentation on Amazon Athena, a serverless interactive query service that allows users to analyze data directly from Amazon S3 using standard SQL. The presentation will introduce Athena and demonstrate how it can be used to query data in S3 without having to load it into a database first. It will also discuss how Athena uses Presto and the Glue Data Catalog under the hood and show some customer use cases for log analysis, ETL workflows, and analytics reporting using Athena with other AWS services.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
This document discusses performance tuning in Redshift. It covers factors to consider like database design, execution queues, and query tips. For database design, it recommends usage of sort keys and distribution keys, as well as column compression. It also discusses query diagnosis using EXPLAIN and provides tips on common performance issues like unoptimized queries, disk I/O, memory limits, and table fragmentation.
This document discusses Amazon Redshift, a fully managed data warehousing service. It provides petabyte-scale data warehousing capabilities with performance up to 3x faster and 80% lower cost than traditional data warehousing solutions. The document outlines use cases, architecture details, pricing and total cost of ownership, security features, integration options and best practices. It also shares customer examples and an ecosystem of partners building solutions on Amazon Redshift.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Learn best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your data warehouse performance.
The document discusses various techniques for performance tuning at different levels in Amazon Redshift including table design, querying, infrastructure configuration, and data management. Table design recommendations include choosing an appropriate distribution style, distkey, and sortkey to optimize data distribution and joins. Query tuning tips include avoiding SELECT *, using EXPLAIN to view query plans, managing query queues, and replacing cross joins. Infrastructure recommendations cover node types, disk choices, and workload management queues. Data-level techniques include enabling compression, updating statistics, cleaning data before loading, and removing non-ASCII characters.
AWS re:Invent re:Cap 행사에서 발표된 강연 자료입니다. 아마존 웹서비스의 김일호 솔루션스 아키텍트가 발표한 내용입니다.
내용 요약: Hadoop과 Elastic MapReduce, Redshift, Kinesis, Data Pipeline, S3 등 다양한 서비스들을 활용하는 데이터 분석의 모범사례 및 아키텍처 설계 패턴에 대해 말씀드리고, re:Invent에서 새로 추가된 Amazon EC2 컴퓨팅 최적화 인스턴스 C4와 새로 발표된 Amazon EBS 볼륨 확장 및 성능 향상에 대해 함께 살펴볼 예정입니다.
Building prediction models with Amazon Redshift and Amazon Machine Learning -...Amazon Web Services
Mining data with Redshift, using this data to build a prediction model with Amazon ML, performing batch predictions & real-time predictions (with a Java app).
In dynamic cloud environments, many organizations have a need to implement a unified threat management solution that enhances visibility across their workloads. Learn how REAN Cloud adopted Sophos Unified Threat Management (UTM) for increased simplicity, visibility, and security of their AWS workloads. Sophos is an Advanced Technology Partner in the AWS Partner Network that provides a reliable, unified security solution capable of scaling to meet the agility and speed of the AWS Cloud. Join the upcoming webinar to hear Sri Vasireddy from REAN Cloud, Bryan Nairn from Sophos, and Nick Matthews from AWS discuss security innovations on the AWS Cloud. Join us to learn: • Why Sophos end user REAN Cloud trusts Sophos UTM for simplicity, visibility and security. • How easy it can be to protect your AWS workloads, with a proven and scalable solution designed for the AWS Cloud. • AWS security innovations, including support across multiple Availability Zones and UTM Auto Scaling.
Who should attend: Security Managers, Security Engineers, Security Architects, IT System Administrators, System Administrators, IT Administrators, IT Managers, DevOps, Architects, IT Architects, IT Security Engineers, Business Decision Makers
The Getting Started on AWS deck serves to introduce Amazon users and prospective customers to the Amazon VPC, EC2 and the concepts and components that are necessary building Fault Tolerant & High Available environments on AWS. It also serves to introduce services like Direct Connect, Router53 (Amazon DNS Service) and one of our new additions, the Amazon
Application Load Balancer (ALB). After perusing this deck, users should have a better understanding of what these services are and their propose benefits.
AWS Enterprise Summit Netherlands - Starting Your Journey in the CloudAmazon Web Services
This document discusses best practices for getting started with AWS. It recommends laying out foundations such as creating an account structure, setting up billing and access controls using IAM, and choosing initial use cases that are well-scoped such as development and testing. It also emphasizes building on AWS security features and designing systems to take advantage of AWS strengths like auto-scaling and cost-effective storage.
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...Amazon Web Services
This document provides an overview of big data architectural patterns and best practices on AWS. It discusses challenges of big data including volume, velocity, and variety of data. It then describes how to simplify big data processing using a decoupled "data bus" approach with AWS services for collecting, storing, processing, analyzing and consuming data. The document provides recommendations on which AWS technologies to use for different data types, access patterns, and analysis workloads based on considerations like cost, performance and manageability. Finally, it discusses reference architectures and design patterns for building scalable big data systems on AWS.
An overview of the Amazon ElastiCache managed service, with examples of how it can be used to increase performance, lower costs and augment other database services and databases to make things faster, easier and less expensive.
Rackspace provides a comprehensive set of tooling and expertise on AWS that further unlocks your ability to secure your environment efficiently and cost effectively. The dynamic environment of data, applications, and infrastructure can pose challenges for businesses trying to manage security while following compliance regulations. To mitigate these challenges, businesses need a scalable security solution to ensure their data is safe, secure, and stable. In this webinar, Brad Schulteis, Jarret Raim and Todd Gleason will discuss the topic of security control requirements on AWS through the lens of three common compliance scenarios: HIPAA, PCI-DSS, and generalized security compliance based on the NIST Risk Management Framework. Watch our webinar to learn how Rackspace combines AWS and security expertise with tools like AWS CloudFormation, AWS CodeCommit and AWS CodeDeploy to help customers meet their security and compliance needs.
Join us to learn:
• Best practices for securely operating workloads on the AWS Cloud
• Architecting a secure environment for dynamic workloads
• How to incorporate Security by Design principles to address compliance needs across 3 use cases: HIPAA, PCI-DSS and generalized security compliance based on the NIST Risk Management Framework
Who should attend: Directors and Managers of Security, IT Administers, IT Architects, and IT Security Engineers
AWS Enterprise Summit Netherlands - Enterprise Applications on AWSAmazon Web Services
Aviko, a large European food company, implemented SAP software in AWS to help overcome technical barriers between their operations in the Netherlands and new facilities in China. By running SAP in the AWS China region, they achieved reliable connectivity, high performance, and flexibility needed to launch operations in China within 3 months. This avoided latency issues and unstable connections compared to routing data through their local datacenter. AWS provided a cost-effective alternative to building their own datacenter in China and eased management burdens through automated provisioning and support from AWS partners.
Learn about the Amazon made to a service-oriented architecture over a decade ago and an introduction to AWS CodeCommit, AWS CodePipeline, and AWS CodeDeploy, three new services born out of Amazon's internal DevOps experience.
AWS Enterprise Summit Netherlands - Cost Optimisation at ScaleAmazon Web Services
The document discusses cost optimization strategies for using AWS at scale. It begins with an overview of total cost of ownership (TCO) analysis and how "at scale" applies. Key principles for architecting cloud infrastructure for cost like right sizing instances and using reserved instances are covered. The document then analyzes how measuring and monitoring cloud resources can help improve costs over time. Specific AWS services and features that can reduce storage, compute, and other infrastructure costs are also outlined.
AWS Enterprise Summit Netherlands - Creating a Landing ZoneAmazon Web Services
This document discusses creating a landing zone in AWS for migrating applications from an on-premise data center environment. It covers setting up account structure with separate accounts for production, non-production and centralized services. It also discusses establishing network connectivity with VPC design, identity and access management using IAM, and using AWS Service Catalog for self-service provisioning by cloud consumers. The overall goal is to discuss best practices for creating a secure and governed landing zone in AWS to migrate and operate applications.
Database performance is a key factor for keeping your app running fast. As the amount of data and request throughput grow, it becomes harder to keep it nimble and performant. In this webinar, we will discuss using Redis, a popular NoSQL key-value store, to run your data fast at scale. Redis can be used as a cache in front of a database or as a fast in-memory data store. We will also cover how you can use ElastiCache for Redis to easily set up a large multi-terabyte Redis environment and operate it with zero management.
Learning Objectives:
• Understand fast data
• Understand the capabilities of Amazon Elasticache for Redis to cache Relational and NoSQL databases
• Use Amazon Elasticache for Redis as a primary data store
Who Should Attend:
• Developers, Data Architects
This document provides an overview of Amazon Redshift, including its history, architecture, concepts, terminology, storage subsystem, and query lifecycle. It discusses how Redshift uses a massively parallel processing (MPP) architecture with columnar storage to improve query performance and reduce storage requirements through data compression. Key concepts explained include slices, sorting, data distribution styles, and how data is stored across disks and persisted to blocks at the physical level.
This document provides an overview of Amazon Redshift, including its history, architecture, concepts, terminology, storage subsystem, and query lifecycle. It discusses how Redshift uses a massively parallel processing (MPP) architecture with columnar storage to improve query performance and reduce I/O. Key concepts explained include slices, columnar storage, compression encodings, sorting, and data distribution styles.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use workload management.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this webinar, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to design schemas and load data efficiently
• Learn best practices for workload management, distribution and sort keys, and optimizing queries
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management, tune your queries, and use Amazon Redshift's interleaved sorting features.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
This spring, the data warehouse team at Ancestry, flawlessly migrated and validated nearly half a trillion records from Actian Matrix to Amazon Redshift. During this session, the Ancestry team will describe how they orchestrated the entire migration in less than four months, the technical challenges they faced and overcame along the way, as well as share tips and tricks to break through common pitfalls of data warehouse migrations. They will also highlight how they tuned and optimized the Amazon Redshift environment, adopted Redshift Spectrum, and how they leverage their collaboration with Amazon to deliver a powerful customer experience.
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
by Tony Gibbs, Data Warehouse Specialist SA, AWS
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSCobus Bernard
In this session, we will take you through setting up an Amazon Redshift cluster and at the ways you can populate it with data. We will start by using AWS DMS to replicate the data as-is as well as doing some ETL on it. This will be followed by AWS Glue where you can do more advanced ETL operations. Lastly, we will look at how you can use Amazon Kinesis Firehose to stream event directly to the Redshift cluster.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
Get a look under the hood: Understand how to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. You’ll also hear about how the University of Technology Sydney (UTS) are using Redshift. The University of Technology Sydney will describe how utilizing Amazon Redshift enabled agility in dealing with Data Quality, a capacity to scale when required, and optimizing development processes through rapid provisioning of Data Warehouse environments.
Speaker: Ganesh Raja, Solutions Architect, Amazon Web Services with Susan Gibson, Manager, Data and Business Intelligence, UTS
Level: 300
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
by Darin Briskman, Technical Evangelist, AWS
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process. We’ll learn about AWS Database Migration Service and AWS Schema Migration Tool, which were recently enhanced to import data from six common data warehouse platforms. Level: 200
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process. We’ll learn about AWS Database Migration Service and AWS Schema Migration Tool, which were recently enhanced to import data from six common data warehouse platforms.
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process.
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
This document provides best practices for migrating a data warehouse to Amazon Redshift. It discusses why companies migrate to Redshift due to its scalability, performance and cost advantages. Example migration stories are provided from companies that achieved significant improvements after migrating large datasets from Oracle, Greenplum and SQL on Hadoop to Redshift. The document also outlines the Redshift cluster architecture, data loading best practices including file splitting and column encoding, schema design considerations and available migration tools.
Amazon Redshift is a data warehouse service that runs on AWS. It has a leader node that coordinates queries and compute nodes that store and process the data in parallel. The compute nodes can use either HDD storage optimized for large datasets or SSD storage optimized for fast queries. Data is stored in columns and compressed to reduce I/O. Queries are optimized using statistics on the data distribution, sort keys and other metadata. The EXPLAIN command and STL tables provide visibility into query plans and performance.
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & SpectrumAmazon Web Services
This document provides an overview and best practices for using Amazon Redshift and Redshift Spectrum for data warehousing. It covers the history and development of Redshift, key concepts like columnar storage, compression, sorting and distribution styles. It provides examples and recommendations for table design, workload management, and query optimization techniques.
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.
Learning Objectives:
Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
The document discusses best practices and performance tuning for U-SQL in Azure Data Lake. It provides an overview of U-SQL query execution, including the job scheduler, query compilation process, and vertex execution model. The document also covers techniques for analyzing and optimizing U-SQL job performance, including analyzing the critical path, using heat maps, optimizing AU usage, addressing data skew, and query tuning techniques like data loading tips, partitioning, predicate pushing and column pruning.
This document provides an overview of DFSMS Basics and Data Set Fundamentals. It discusses the structure of data sets on direct access storage devices (DASD) including volumes, tracks, cylinders, the volume table of contents (VTOC), catalogs, and data set names. It also summarizes the different types of data set organizations including non-VSAM (direct, sequential, partitioned) and VSAM (KSDS, ESDS, RRDS, linear) as well as newer technologies like PDSEs, HFS, and zFS. The document concludes with discussions on common data set uses, limitations, extended format features, and defining data set attributes in JCL or with IDCAMS.
Similar to Data Warehousing with Amazon Redshift (20)
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
Il Forecasting è un processo importante per tantissime aziende e viene utilizzato in vari ambiti per cercare di prevedere in modo accurato la crescita e distribuzione di un prodotto, l’utilizzo delle risorse necessarie nelle linee produttive, presentazioni finanziarie e tanto altro. Amazon utilizza delle tecniche avanzate di forecasting, in parte questi servizi sono stati messi a disposizione di tutti i clienti AWS.
In questa sessione illustreremo come pre-processare i dati che contengono una componente temporale e successivamente utilizzare un algoritmo che a partire dal tipo di dato analizzato produce un forecasting accurato.
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
La varietà e la quantità di dati che si crea ogni giorno accelera sempre più velocemente e rappresenta una opportunità irripetibile per innovare e creare nuove startup.
Tuttavia gestire grandi quantità di dati può apparire complesso: creare cluster Big Data su larga scala sembra essere un investimento accessibile solo ad aziende consolidate. Ma l’elasticità del Cloud e, in particolare, i servizi Serverless ci permettono di rompere questi limiti.
Vediamo quindi come è possibile sviluppare applicazioni Big Data rapidamente, senza preoccuparci dell’infrastruttura, ma dedicando tutte le risorse allo sviluppo delle nostre le nostre idee per creare prodotti innovativi.
Ora puoi utilizzare Amazon Elastic Kubernetes Service (EKS) per eseguire pod Kubernetes su AWS Fargate, il motore di elaborazione serverless creato per container su AWS. Questo rende più semplice che mai costruire ed eseguire le tue applicazioni Kubernetes nel cloud AWS.In questa sessione presenteremo le caratteristiche principali del servizio e come distribuire la tua applicazione in pochi passaggi
Vent'anni fa Amazon ha attraversato una trasformazione radicale con l'obiettivo di aumentare il ritmo dell'innovazione. In questo periodo abbiamo imparato come cambiare il nostro approccio allo sviluppo delle applicazioni ci ha permesso di aumentare notevolmente l'agilità, la velocità di rilascio e, in definitiva, ci ha consentito di creare applicazioni più affidabili e scalabili. In questa sessione illustreremo come definiamo le applicazioni moderne e come la creazione di app moderne influisce non solo sull'architettura dell'applicazione, ma sulla struttura organizzativa, sulle pipeline di rilascio dello sviluppo e persino sul modello operativo. Descriveremo anche approcci comuni alla modernizzazione, compreso l'approccio utilizzato dalla stessa Amazon.com.
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
L’utilizzo dei container è in continua crescita.
Se correttamente disegnate, le applicazioni basate su Container sono molto spesso stateless e flessibili.
I servizi AWS ECS, EKS e Kubernetes su EC2 possono sfruttare le istanze Spot, portando ad un risparmio medio del 70% rispetto alle istanze On Demand. In questa sessione scopriremo insieme quali sono le caratteristiche delle istanze Spot e come possono essere utilizzate facilmente su AWS. Impareremo inoltre come Spreaker sfrutta le istanze spot per eseguire applicazioni di diverso tipo, in produzione, ad una frazione del costo on-demand!
In recent months, many customers have been asking us the question – how to monetise Open APIs, simplify Fintech integrations and accelerate adoption of various Open Banking business models. Therefore, AWS and FinConecta would like to invite you to Open Finance marketplace presentation on October 20th.
Event Agenda :
Open banking so far (short recap)
• PSD2, OB UK, OB Australia, OB LATAM, OB Israel
Intro to Open Finance marketplace
• Scope
• Features
• Tech overview and Demo
The role of the Cloud
The Future of APIs
• Complying with regulation
• Monetizing data / APIs
• Business models
• Time to market
One platform for all: a Strategic approach
Q&A
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
Per creare valore e costruire una propria offerta differenziante e riconoscibile, le startup di successo sanno come combinare tecnologie consolidate con componenti innovativi creati ad hoc.
AWS fornisce servizi pronti all'utilizzo e, allo stesso tempo, permette di personalizzare e creare gli elementi differenzianti della propria offerta.
Concentrandoci sulle tecnologie di Machine Learning, vedremo come selezionare i servizi di intelligenza artificiale offerti da AWS e, anche attraverso una demo, come costruire modelli di Machine Learning personalizzati utilizzando SageMaker Studio.
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
Con l'approccio tradizionale al mondo IT per molti anni è stato difficile implementare tecniche di DevOps, che finora spesso hanno previsto attività manuali portando di tanto in tanto a dei downtime degli applicativi interrompendo l'operatività dell'utente. Con l'avvento del cloud, le tecniche di DevOps sono ormai a portata di tutti a basso costo per qualsiasi genere di workload, garantendo maggiore affidabilità del sistema e risultando in dei significativi miglioramenti della business continuity.
AWS mette a disposizione AWS OpsWork come strumento di Configuration Management che mira ad automatizzare e semplificare la gestione e i deployment delle istanze EC2 per mezzo di workload Chef e Puppet.
Scopri come sfruttare AWS OpsWork a garanzia e affidabilità del tuo applicativo installato su Instanze EC2.
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
Vuoi conoscere le opzioni per eseguire Microsoft Active Directory su AWS? Quando si spostano carichi di lavoro Microsoft in AWS, è importante considerare come distribuire Microsoft Active Directory per supportare la gestione, l'autenticazione e l'autorizzazione dei criteri di gruppo. In questa sessione, discuteremo le opzioni per la distribuzione di Microsoft Active Directory su AWS, incluso AWS Directory Service per Microsoft Active Directory e la distribuzione di Active Directory su Windows su Amazon Elastic Compute Cloud (Amazon EC2). Trattiamo argomenti quali l'integrazione del tuo ambiente Microsoft Active Directory locale nel cloud e l'utilizzo di applicazioni SaaS, come Office 365, con AWS Single Sign-On.
Dal riconoscimento facciale al riconoscimento di frodi o difetti di fabbricazione, l'analisi di immagini e video che sfruttano tecniche di intelligenza artificiale, si stanno evolvendo e raffinando a ritmi elevati. In questo webinar esploreremo le possibilità messe a disposizione dai servizi AWS per applicare lo stato dell'arte delle tecniche di computer vision a scenari reali.
Amazon Web Services e VMware organizzano un evento virtuale gratuito il prossimo mercoledì 14 Ottobre dalle 12:00 alle 13:00 dedicato a VMware Cloud ™ on AWS, il servizio on demand che consente di eseguire applicazioni in ambienti cloud basati su VMware vSphere® e di accedere ad una vasta gamma di servizi AWS, sfruttando a pieno le potenzialità del cloud AWS e tutelando gli investimenti VMware esistenti.
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
Molte aziende oggi, costruiscono applicazioni con funzionalità di tipo ledger ad esempio per verificare lo storico di accrediti o addebiti nelle transazioni bancarie o ancora per tenere traccia del flusso supply chain dei propri prodotti.
Alla base di queste soluzioni ci sono i database ledger che permettono di avere un log delle transazioni trasparente, immutabile e crittograficamente verificabile, ma sono strumenti complessi e onerosi da gestire.
Amazon QLDB elimina la necessità di costruire sistemi personalizzati e complessi fornendo un database ledger serverless completamente gestito.
In questa sessione scopriremo come realizzare un'applicazione serverless completa che utilizzi le funzionalità di QLDB.
Con l’ascesa delle architetture di microservizi e delle ricche applicazioni mobili e Web, le API sono più importanti che mai per offrire agli utenti finali una user experience eccezionale. In questa sessione impareremo come affrontare le moderne sfide di progettazione delle API con GraphQL, un linguaggio di query API open source utilizzato da Facebook, Amazon e altro e come utilizzare AWS AppSync, un servizio GraphQL serverless gestito su AWS. Approfondiremo diversi scenari, comprendendo come AppSync può aiutare a risolvere questi casi d’uso creando API moderne con funzionalità di aggiornamento dati in tempo reale e offline.
Inoltre, impareremo come Sky Italia utilizza AWS AppSync per fornire aggiornamenti sportivi in tempo reale agli utenti del proprio portale web.
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
In queste slide, gli esperti AWS e VMware presentano semplici e pratici accorgimenti per facilitare e semplificare la migrazione dei carichi di lavoro Oracle accelerando la trasformazione verso il cloud, approfondiranno l’architettura e dimostreranno come sfruttare a pieno le potenzialità di VMware Cloud ™ on AWS.
1) The document discusses building a minimum viable product (MVP) using Amazon Web Services (AWS).
2) It provides an example of an MVP for an omni-channel messenger platform that was built from 2017 to connect ecommerce stores to customers via web chat, Facebook Messenger, WhatsApp, and other channels.
3) The founder discusses how they started with an MVP in 2017 with 200 ecommerce stores in Hong Kong and Taiwan, and have since expanded to over 5000 clients across Southeast Asia using AWS for scaling.
This document discusses pitch decks and fundraising materials. It explains that venture capitalists will typically spend only 3 minutes and 44 seconds reviewing a pitch deck. Therefore, the deck needs to tell a compelling story to grab their attention. It also provides tips on tailoring different types of decks for different purposes, such as creating a concise 1-2 page teaser, a presentation deck for pitching in-person, and a more detailed read-only or fundraising deck. The document stresses the importance of including key information like the problem, solution, product, traction, market size, plans, team, and ask.
This document discusses building serverless web applications using AWS services like API Gateway, Lambda, DynamoDB, S3 and Amplify. It provides an overview of each service and how they can work together to create a scalable, secure and cost-effective serverless application stack without having to manage servers or infrastructure. Key services covered include API Gateway for hosting APIs, Lambda for backend logic, DynamoDB for database needs, S3 for static content, and Amplify for frontend hosting and continuous deployment.
This document provides tips for fundraising from startup founders Roland Yau and Sze Lok Chan. It discusses generating competition to create urgency for investors, fundraising in parallel rather than sequentially, having a clear fundraising narrative focused on what you do and why it's compelling, and prioritizing relationships with people over firms. It also notes how the pandemic has changed fundraising, with examples of deals done virtually during this time. The tips emphasize being fully prepared before fundraising and cultivating connections with investors in advance.
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
This document discusses Amazon's machine learning services for building conversational interfaces and extracting insights from unstructured text and audio. It describes Amazon Lex for creating chatbots, Amazon Comprehend for natural language processing tasks like entity extraction and sentiment analysis, and how they can be used together for applications like intelligent call centers and content analysis. Pre-trained APIs simplify adding machine learning to apps without requiring ML expertise.
Amazon Elastic Container Service (Amazon ECS) è un servizio di gestione dei container altamente scalabile, che semplifica la gestione dei contenitori Docker attraverso un layer di orchestrazione per il controllo del deployment e del relativo lifecycle. In questa sessione presenteremo le principali caratteristiche del servizio, le architetture di riferimento per i differenti carichi di lavoro e i semplici passi necessari per poter velocemente migrare uno o più dei tuo container.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
2. Deep Dive Agenda
• Amazon Redshift history and development
• Cluster architecture
• Concepts and terminology
• Node components
• Storage deep dive
• Design considerations
• Parallelism deep dive
• Open Q&A
14. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with row storage:
– Need to read everything
– Unnecessary I/O
aid loc dt
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
15. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with columnar storage:
– Only scan blocks for relevant
column
aid loc dt
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
16. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Columns grow and shrink independently
• Effective compression ratios due to like
data
• Reduces storage requirements
• Reduces I/O
aid loc dt
CREATE TABLE loft_deep_dive (
aid INT ENCODE LZO
,loc CHAR(3) ENCODE BYTEDICT
,dt DATE ENCODE RUNLENGTH
);
17. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks which cannot
contain data for a given query
• Eliminates unnecessary I/O
18. SELECT COUNT(*) FROM LOGS WHERE DATE = '09-JUNE-2013'
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
Unsorted Table
MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
Sorted By Date
Zone Maps
19. Terminology and Concepts: Data Sorting
• Goals:
• Physically order rows of table data based on certain column(s)
• Optimize effectiveness of zone maps
• Enable MERGE JOIN operations
• Impact:
• Enables rrscans to prune blocks by leveraging zone maps
• Overall reduction in block I/O
• Achieved with the table property SORTKEY defined over one or more
columns
• Optimal SORTKEY is dependent on:
• Query patterns
• Data profile
• Business requirements
20. Terminology and Concepts: Slices
A slice can be thought of like a “virtual compute node”
• Unit of data partitioning
• Parallel query processing
Facts about slices:
• Each compute node has either 2, 16, or 32 slices
• Table rows are distributed to slices
• A slice processes only its own data
21. Data Distribution
• Distribution style is a table property which dictates how that table’s data is
distributed throughout the cluster:
• KEY: Value is hashed, same value goes to same location (slice)
• ALL: Full table data goes to first slice of every node
• EVEN: Round robin
• Goals:
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVEN
22. Data Distribution: Example
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE (EVEN|KEY|ALL);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
Table: loft_deep_dive
User
Columns
System
Columns
aid loc dt ins del row
23. Data Distribution: EVEN Example
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE EVEN;
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO loft_deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 0 Rows: 0 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (4 slices) = 24 Blocks (24MB)
Rows: 1 Rows: 1 Rows: 1 Rows: 1
24. Data Distribution: KEY Example #1
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE KEY DISTKEY (loc);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO loft_deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 2 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (2 slices) = 12 Blocks (12MB)
Rows: 0Rows: 1
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 2Rows: 0Rows: 1
25. Data Distribution: KEY Example #2
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE KEY DISTKEY (aid);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO loft_deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 0 Rows: 0 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (4 slices) = 24 Blocks (24MB)
Rows: 1 Rows: 1 Rows: 1 Rows: 1
26. Data Distribution: ALL Example
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE ALL;
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO loft_deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (2 slice) = 12 Blocks (12MB)
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 0Rows: 1Rows: 2Rows: 4Rows: 3
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 0Rows: 1Rows: 2Rows: 4Rows: 3
27. Terminology and Concepts: Data Distribution
KEY
• The key creates an even distribution of data
• Joins are performed between large fact/dimension tables
• Optimizing merge joins and group by
ALL
• Small and medium size dimension tables (< 2-3M)
EVEN
• When key cannot produce an even distribution
29. Storage Deep Dive: Disks
Amazon Redshift utilizes locally attached storage
devices
• Compute nodes have 2.5-3x the advertised storage
capacity
1, 3, 8, or 24 disks depending on node type
Each disk is split into two partitions
• Local data storage, accessed by local CN
• Mirrored data, accessed by remote CN
Partitions are raw devices
• Local storage devices are ephemeral in nature
• Tolerant to multiple disk failures on a single node
30. Storage Deep Dive: Blocks
Column data is persisted to 1MB immutable blocks
Each block contains in-memory metadata:
• Zone Maps (MIN/MAX value)
• Location of previous/next block
• Blocks are individually compressed with 1 of 10 encodings
A full block contains between 16 and 8.4 million values
31. Storage Deep Dive: Columns
Column: Logical structure accessible via SQL
Physical structure is a doubly linked list of blocks
These blockchains exist on each slice for each column
All sorted & unsorted blockchains compose a column
Column properties include:
• Distribution Key
• Sort Key
• Compression Encoding
Columns shrink and grow independently, 1 block at a time
Three system columns per table-per slice for MVCC
32. Block Properties: Design Considerations
• Small writes:
• Batch processing system, optimized for processing massive
amounts of data
• 1MB size + immutable blocks means that we clone blocks on write
so as not to introduce fragmentation
• Small write (~1-10 rows) has similar cost to a larger write (~100 K
rows)
• UPDATE and DELETE:
• Immutable blocks means that we only logically delete rows on
UPDATE or DELETE
• Must VACUUM or DEEP COPY to remove ghost rows from table
33. Column Properties: Design Considerations
• Compression:
• COPY automatically analyzes and compresses data when loading into empty
tables
• ANALYZE COMPRESSION checks existing tables and proposes optimal
compression algorithms for each column
• Changing column encoding requires a table rebuild
• DISTKEY and SORTKEY significantly influence performance (orders of magnitude)
• Distribution Keys:
• A poor DISTKEY can introduce data skew and an unbalanced workload
• A query completes only as fast as the slowest slice completes
• Sort Keys:
• A sortkey is only effective as the data profile allows it to be
• Selectivity needs to be considered
35. Storage Deep Dive: Slices
Each compute node has either 2, 16, or 32 slices
A slice can be thought of like a “virtual compute node”
• Unit of data partitioning
• Parallel query processing
Facts about slices:
• Table rows are distributed to slices
• A slice processes only its own data
• Within a compute node all slices read from and write to all
disks
38. Query Execution Terminology
Step: An individual operation needed during query execution.
Steps need to be combined to allow compute nodes to perform a
join. Examples: scan, sort, hash, aggr
Segment: A combination of several steps that can be done by a
single process. The smallest compilation unit executable by a
slice. Segments within a stream run in parallel.
Stream: A collection of combined segments which output to the
next stream or SQL client.
40. Query Lifecycle
client
JDBC ODBC
Leader Node
Parser
Query Planner
Code Generator
Final Computations
Generate code
for all segments
of one stream
Explain Plans
Compute Node
Receive Compiled
Code
Run the Compiled
Code
Return results to
Leader
Compute Node
Receive Compiled
Code
Run the Compiled Code
Return results to
Leader
Return results to client
Segments in a stream
are executed
concurrently. Each
step in a segment is
executed serially.
41. Query Execution Deep Dive: Leader Node
1.The leader node receives the query and parses the SQL.
2.The parser produces a logical representation of the original query.
3.This query tree is input into the query optimizer (volt).
4.Volt rewrites the query to maximize its efficiency. Sometimes a single query will
be rewritten as several dependent statements in the background.
5.The rewritten query is sent to the planner which generates >= 1 query plans for
the execution with the best estimated performance.
6.The query plan is sent to the execution engine, where it’s translated into steps,
segments, and streams.
7.This translated plan is sent to the code generator, which generates a C++
function for each segment.
8.This generated C++ is compiled with gcc to a .o file and distributed to the
compute nodes.
42. Query Execution Deep Dive: Compute Nodes
• Slices execute query segments in parallel
• Executable segments are created for one stream at a
time in sequence
• When the compute nodes are done, they return the
query results to the leader node for final processing
• Leader node merges data into a single result set and
addresses any needed sorting or aggregation
• Leader node then returns the results to the client
45. Parallelism considerations with Redshift slices
DS2.8XL Compute Node
Ingestion Throughput:
• Each slice’s query processors can load one file at a time:
• Streaming decompression
• Parse
• Distribute
• Write
Realizing only partial node usage as 6.25% of slices are active
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
46. Design considerations for Redshift slices
Use at least as many input files
as there are slices in the cluster
With 16 input files, all slices are
working so you maximize
throughput
COPY continues to scale linearly
as you add nodes 16 Input Files
DS2.8XL Compute Node
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15