Apache Arrow - An Overview

•

6 likes•2,070 views

Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.

What's hot

Cosco: An Efficient Facebook-Scale Shuffle Service

Databricks

Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).

The columnar roadmap: Apache Parquet and Apache Arrow

DataWorks Summit

The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance. In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board. Speaker Julien Le Dem, Principal Engineer, WeWork

3D: DBT using Databricks and Delta

Databricks

Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Hive + Tez: A Performance Deep Dive

DataWorks Summit

This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Noritaka Sekiyama

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

Change Data Feed in Delta

Databricks

This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.

Parquet overview

Julien Le Dem

Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.

Large Scale Lakehouse Implementation Using Structured Streaming

Databricks

Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations. Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.

Building an open data platform with apache iceberg

Alluxio, Inc.

Parallelization of Structured Streaming Jobs Using Delta Lake

Databricks

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

Efficient Data Storage for Analytics with Apache Parquet 2.0

Cloudera, Inc.

Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.

My first 90 days with ClickHouse.pdf

Alkin Tezuysal

Alkin Tezuysal discusses his first 90 days working at ChistaDATA Inc. as EVP of Global Services. He has experience working with databases like MySQL, Oracle, and ClickHouse. ChistaDATA focuses on providing ClickHouse infrastructure operations through managed services, support, and consulting. ClickHouse is an open source columnar database that uses a shared-nothing architecture for high performance analytics workloads.

Parquet performance tuning: the missing guide

Ryan Blue

Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.

Hive: Loading Data

Benjamin Leonhardi

Processing Large Data with Apache Spark -- HasGeek

Venkata Naga Ravi

Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...

Databricks

What's hot (20)

Cosco: An Efficient Facebook-Scale Shuffle Service

The columnar roadmap: Apache Parquet and Apache Arrow

3D: DBT using Databricks and Delta

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Hive + Tez: A Performance Deep Dive

Apache Iceberg - A Table Format for Hige Analytic Datasets

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Change Data Feed in Delta

Parquet overview

Large Scale Lakehouse Implementation Using Structured Streaming

Building an open data platform with apache iceberg

Parallelization of Structured Streaming Jobs Using Delta Lake

Apache Spark in Depth: Core Concepts, Architecture & Internals

Efficient Data Storage for Analytics with Apache Parquet 2.0

My first 90 days with ClickHouse.pdf

Parquet performance tuning: the missing guide

Hive: Loading Data

Processing Large Data with Apache Spark -- HasGeek

Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...

Viewers also liked

The twins that everyone loved too much

Julian Hyde

Options for Data Prep - A Survey of the Current Market

Dremio Corporation

Data comes in many shapes and sizes, and every company struggles to find ways to transform, validate, and enrich data for multiple purposes. The problem has been around as long as data, and the market has an overwhelming number of options. In this presentation we look at the problem and key options from vendors in the market today. Dremio is a new approach that eliminates the need for stand alone data prep tools.

Bi on Big Data - Strata 2016 in London

Dremio Corporation

Data Science Languages and Industry Analytics

Wes McKinney

Apache Calcite: One planner fits all

Julian Hyde

This document discusses how Apache Calcite makes it easier to write database management systems (DBMS) by decomposing them into modular components like a query parser, catalog, algorithms, and storage engines. It presents Calcite as a framework that allows these components to be mixed and matched, with a core relational algebra and rule-based optimization. Calcite powers systems like Apache Hive, Drill, Phoenix, and Kylin by translating SQL and other queries to relational algebra and optimizing queries using over 100 rules before executing them using configurable engines and data sources.

SQL on everything, in memory

Julian Hyde

Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL. Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.

Don’t optimize my queries, optimize my data!

Julian Hyde

The document discusses strategies for optimizing data through materialized views and how data systems can learn to optimize themselves. It proposes an algorithm that uses sketches and information theory to profile data cardinalities and recommend materialized views. The algorithm aims to defeat the combinatorial search space by only considering combinations with "surprising" cardinalities. This profiling provides the cost and benefit information needed to optimize data structures. The document also discusses using query logs and statistics to infer relationships between tables and design summary tables through lattices.

Apache Calcite overview

Julian Hyde

Viewers also liked (8)

The twins that everyone loved too much

Options for Data Prep - A Survey of the Current Market

Bi on Big Data - Strata 2016 in London

Data Science Languages and Industry Analytics

Apache Calcite: One planner fits all

SQL on everything, in memory

Don’t optimize my queries, optimize my data!

Apache Calcite overview

Similar to Apache Arrow - An Overview

HUG_Ireland_Apache_Arrow_Tomer_Shiran

John Mulhall

Strata NY 2016: The future of column-oriented data processing with Arrow and ...

Julien Le Dem

Efficient Data Formats for Analytics with Parquet and Arrow

DataWorks Summit/Hadoop Summit

Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; columnar layouts for storage and in-memory execution allow the analysis of large amounts of data very quickly and efficiently. It provides the ability for multiple applications to share a common data representation and perform operations at full CPU throughput using SIMD and Vectorization. For interoperability, row based encodings (CSV, Thrift, Avro) combined with general purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. As discussed extensively in the database literature, a columnar layout with statistics and sorting provides vertical and horizontal partitioning, thus keeping IO to a minimum. Additionally a number of key big data technologies have or will soon have in-memory columnar capabilities. This includes Kudu, Ibis and Drill. Sharing a common in-memory columnar representation allows interoperability without the usual cost of serialization. Understanding modern CPU architecture is critical to maximizing processing throughput. We’ll discuss the advantages of columnar layouts in Parquet and Arrow for in-memory processing and data encodings used for storage (dictionary, bit-packing, prefix coding). We’ll dissect and explain the design choices that enable us to achieve all three goals of interoperability, space and query efficiency. In addition, we’ll provide an overview of what’s coming in Parquet and Arrow in the next year.

Data Eng Conf NY Nov 2016 Parquet Arrow

Julien Le Dem

- Arrow and Parquet are open source projects focused on column-oriented data formats for efficient in-memory (Arrow) and on-disk (Parquet) analytics. - They allow for interoperability across systems by eliminating the overhead of data serialization and enabling common data representations. - Column-oriented formats improve performance by reducing storage needs, enabling projection of only needed columns, and better utilization of CPU/memory through cache locality and vectorized processing.

Netflix oss season 2 episode 1 - meetup Lightning talks

Ruslan Meshenberg

The lightning talks covered various Netflix OSS projects including S3mper, PigPen, STAASH, Dynomite, Aegisthus, Suro, Zeno, Lipstick on GCE, AnsWerS, and IBM. 41 projects were discussed and the need for a cohesive Netflix OSS platform was highlighted. Matt Bookman then gave a presentation on running Lipstick and Hadoop on Google Cloud Platform using Google Compute Engine and Cloud Storage. He demonstrated running Pig jobs on Compute Engine and discussed design considerations for cloud-based Hadoop deployments. Finally, Peter Sankauskas from @Answers4AWS discussed initial ideas around CloudFormation for Asgard and deploying various Netflix OSS

Strata London 2016: The future of column oriented data processing with Arrow ...

Julien Le Dem

Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014

Puppet

This document discusses using Hiera to provide delegated configuration through multiple data sources. It begins with an introduction to Hiera and its uses. It then discusses using multiple backends like YAML and PostgreSQL to store hierarchical data. The document proposes a designed solution to delegate access to certain Hiera keys by filtering and importing data from external sources into a separate database. This database would act as a secondary Hiera backend. The solution is intended to allow certain users to manage configuration parameters for a subset of servers in a secure manner.

An Incomplete Data Tools Landscape for Hackers in 2015

Wes McKinney

Wes McKinney gives an overview of the current data analysis tools landscape in Python and R. He discusses essential Python packages like NumPy, pandas, and scikit-learn. For R, he covers packages in the "Hadley stack" like dplyr and ggplot2. IPython/Jupyter notebooks are also mentioned as a platform for interactive data analysis across languages. The talk aims to highlight trends, opportunities, and challenges in the open source data science tool ecosystem.

Mule soft mar 2017 Parquet Arrow

Julien Le Dem

This document discusses the future of column-oriented data processing with Apache Arrow and Apache Parquet. Arrow provides an open standard for in-memory columnar data, while Parquet provides an open standard for on-disk columnar data storage. Together they provide interoperability across systems and high performance by avoiding data copying and format conversions. The document outlines the goals and benefits of Arrow and Parquet, how they improve CPU and I/O efficiency, and examples of performance gains from integrating systems like Spark and Pandas with Arrow.

Using AWS, Terraform, and Ansible to Automate Splunk at Scale

Data Works MD

The DreamPort Splunk Project; How We Use AWS, Terraform, and Ansible to Automate Everything About a Splunk Cluster At DreamPort, we use cloud platforms, infrastructure-as-code tooling, configuration tools, automation software, and container technologies to very quickly design, develop, and prototype projects. This particular talk focuses on the tools used to deploy and configure a Splunk cluster for a particular project we recently ran. We will cover the deployment, configuration, and orchestration of a large 16 node Splunk cluster using tools that are a core set to DreamPort's cloud infrastructure toolbox; AWS, Terraform, Ansible, and Docker. It is recommended that attendees have a general understanding of AWS, Linux, Splunk, and Docker, and know about automation tools such as Terraform and Ansible. Attendees will learn how to use AWS, Terraform, Ansible, and Docker to deploy a large Splunk cluster, how to use Ansible to orchestrate and manage the Splunk cluster, and how to use Ansible to orchestrate and manage the Splunk cluster. ------------------------------------------------- Bill Cawthra is a Principal Cloud Infrastructure Architect for CyberPoint, managing project-related cloud systems and platforms. He works primarily on the AWS platform, using various automation tools to rapidly deploy and manage infrastructure. Bill has over 18 years of experience in computers and technology, working in a range of fields, including construction, DoD, health care, and social media.

How Apache Arrow and Parquet boost cross-language interoperability

Uwe Korn

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

DataWorks Summit/Hadoop Summit

1) Columnar formats like Parquet, Kudu and Arrow provide more efficient data storage and querying by organizing data by column rather than row. 2) Parquet provides an immutable columnar format well-suited for storage, while Kudu allows for mutable updates but is optimized for scans. Arrow provides an in-memory columnar format focused on CPU efficiency. 3) By establishing common in-memory and on-disk columnar standards, Arrow and Parquet enable more efficient data sharing and querying across systems without serialization overhead.

In-Ceph-tion: Deploying a Ceph cluster on DreamCompute

Patrick McGarry

This document discusses deploying a Ceph cluster on DreamCompute, an OpenStack-powered cloud computing service from DreamHost. It begins with an overview of Ceph's scalability and uses for object, block, and file storage. The document then discusses DreamCompute's open source infrastructure and deploying Ceph using tools like Juju. It provides details on configuring the Ceph cluster by deploying MONs, OSDs, the RGW gateway, and MDS. It concludes by discussing next steps like geo-replication and erasure coding, and opportunities to get involved with the Ceph community.

Apache Arrow -- Cross-language development platform for in-memory data

Wes McKinney

Wes McKinney is the creator of Python's pandas project and a primary developer of Apache Arrow, Apache Parquet, and other open-source projects. Apache Arrow is an open-source cross-language development platform for in-memory analytics that aims to improve data science tools. It provides a shared standard for memory interoperability and computation across languages through its columnar memory format and libraries. Apache Arrow has growing adoption in data science systems and is working to expand language support and computational capabilities.

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines

Timothy Spann

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines https://www.aicamp.ai/event/eventdetails/W2024022214 apache nifi llm generative ai gen ai ml dl machine learning apache kafka apache flink postgresql python AI Meetup (NYC): GenAI, LLMs, ML and Data Feb 22, 05:30 PM EST Welcome to the monthly in-person AI meetup in New York City, in collaboration with Microsoft. Join us for deep dive tech talks on AI, GenAI, LLMs and machine learning, food/drink, networking with speakers and fellow developers Agenda: * 5:30pm~6:00pm: Checkin, Food/drink and networking * 6:00pm~6:10pm: Welcome/community update * 6:10pm~8:30pm: Tech talks * 8:30pm: Q&A, Open discussion Tech Talk: Searching and Reasoning Over Multimedia Data with Vector Databases and LMMs Speaker: Zain Hasan (Weaviate LinkedIn) Abstract: In this talk, Zain Hasan will discuss how we can use open-source multimodal embedding models in conjunction with large generative multimodal models that can that can see, hear, read, and feel data(!), to perform cross-modal search(searching audio with images, videos with text etc.) and multimodal retrieval augmented generation (MM-RAG) at the billion-object scale with the help of open source vector databases. I will also demonstrate, with live code demos, how being able to perform this cross-modal retrieval in real-time can enables users to use LLMs that can reason over their enterprise multimodal data. This talk will revolve around how we can scale the usage of multimodal embedding and generative models in production. Tech Talk: Codeless Generative AI Pipelines Speaker: Timothy Spann (Cloudera LinkedIn) Abstract: Join us for an insightful talk on leveraging the power of real-time streaming tools, specifically Apache NiFi, to revolutionize GenAI data engineering. In this session, we’ll explore how the integration of Apache NiFi can automate the entire process of prompt building, making it a seamless and efficient task. Speakers/Topics: Stay tuned as we are updating speakers and schedules. If you have a keen interest in speaking to our community, we invite you to submit topics for consideration: Submit Topics Sponsors: We are actively seeking sponsors to support our community. Whether it is by offering venue spaces, providing food/drink, or cash sponsorship. Sponsors will have the chance to speak at the meetups, receive prominent recognition, and gain exposure to our extensive membership base of 20,000+ local or 300K+ developers worldwide. Venue: Microsoft NYC - Times Square, 11 Times Square, New York, NY 10036 Room Name: Central Park West 6501 Community on Slack/Discord - Event chat: chat and connect with speakers and attendees - Sharing blogs, events, job openings, projects collaborations Join Slack (search and join the #newyork channel) | Join Discord

Stackato v5

Jonas Brømsø

Apache Arrow (Strata-Hadoop World San Jose 2016)

Wes McKinney

Big Data Approaches to Cloud Security

Paul Morse

Stackato v2

Jonas Brømsø

New York REDIS Meetup Welcome Session

Aleksandr Yampolskiy

This document summarizes a New York Redis Meetup event. It introduces Aleksandr Yampolskiy and Danny Gershman, who will discuss Redis, a key-value store that can be used for caching, publishing/subscribing, and as a data store. Redis allows for fast, in-memory storage of data structures like strings, hashes, lists, sets and sorted sets. The document provides an overview of Redis' capabilities and common uses, such as caching, real-time analytics, and AOP caching. It also notes that Cinchcast is hiring for backend architect and frontend engineer roles.

Similar to Apache Arrow - An Overview (20)

HUG_Ireland_Apache_Arrow_Tomer_Shiran

Strata NY 2016: The future of column-oriented data processing with Arrow and ...

Efficient Data Formats for Analytics with Parquet and Arrow

Data Eng Conf NY Nov 2016 Parquet Arrow

Netflix oss season 2 episode 1 - meetup Lightning talks

Strata London 2016: The future of column oriented data processing with Arrow ...

Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014

An Incomplete Data Tools Landscape for Hackers in 2015

Mule soft mar 2017 Parquet Arrow

Using AWS, Terraform, and Ansible to Automate Splunk at Scale

How Apache Arrow and Parquet boost cross-language interoperability

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

In-Ceph-tion: Deploying a Ceph cluster on DreamCompute

Apache Arrow -- Cross-language development platform for in-memory data

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines

Stackato v5

Apache Arrow (Strata-Hadoop World San Jose 2016)

Big Data Approaches to Cloud Security

Stackato v2

New York REDIS Meetup Welcome Session

Recently uploaded

240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf

CS Kwak

UW Cert degree offer diploma

dakyuhe

学历定制【微信号:95270640】《(UW毕业证书)华盛顿大学毕业证》【微信号:95270640】《毕业证、成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【关于学历材料质量】我们承诺采用的是学校原版纸张（原版纸质、底色、纹路）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信号95270640】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信号95270640】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

What is Micro Frontends and Why Use it.pdf

lead93317

🚀 Let's Deep Dive into 𝐖𝐡𝐲 𝐌𝐢𝐜𝐫𝐨 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝𝐬 𝐢𝐬 𝐭𝐡𝐞 𝐅𝐮𝐭𝐮𝐫𝐞 𝐨𝐟 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 🚀 In today's fast-paced tech landscape, agility, scalability, and maintainability are more crucial than ever. Traditional monolithic frontend architectures often struggle to keep up with these demands. Enter Micro Frontends: a revolutionary approach that's transforming the way we build web applications.

Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools

Benjamin Bischoff

In the rapidly evolving landscape of software development and testing, it is tempting to chase the latest tools and technologies. However, some of the most effective solutions have been in existence for decades. In this talk, we’ll delve into the enduring value of these timeless testing tools. We’ll explore how established tools like Selenium, GNU Make, Maven, and Bash remain vital in today’s software development and testing toolkit even though they have been around for a long time (some were even invented before I was born). I’ll share examples of how these tools have addressed our testing and automation challenges, showcasing their adaptability, versatility, and reliability in various scenarios. I aim to demonstrate that sometimes, the “old” ways can indeed be the best ways.

PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)

Andre Hora

When creating test cases, ideally, developers should test both the expected and unexpected behaviors of the program to catch more bugs and avoid regressions. However, the literature has provided evidence that developers are more likely to test expected behaviors than unexpected ones. In this paper, we propose PathSpotter, a tool to automatically identify tested paths and support the detection of missing tests. Based on PathSpotter, we provide an approach to guide us in detecting missing tests. To evaluate it, we submitted pull requests with test improvements to open-source projects. As a result, 6 out of 8 pull requests were accepted and merged in relevant systems, including CPython, Pylint, and Jupyter Client. These pull requests created/updated 32 tests and added 80 novel assertions covering untested cases. This indicates that our test improvement solution is well received by open-source projects.

08. Ruby Enumerable - Ruby Core Teaching

quanhoangd129

02. Ruby Basic slides - Ruby Core Teaching

quanhoangd129

01. Ruby Introduction - Ruby Core Teaching

quanhoangd129

New York University degree Cert offer diploma Transcripta

pyxgy

【留信网认证定制】微信号 95270640《(NYU毕业证书)纽约大学毕业证》【微信号:95270640】《毕业证、成绩单、外壳、雅思、offer、真实留信官方学历认证（永久存档/真实可查）》采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【关于学历材料质量】我们承诺采用的是学校原版纸张（原版纸质、底色、纹路）我们工厂拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有成品以及工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信号95270640】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信号95270640】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

09. Ruby Object Oriented Programming - Ruby Core Teaching

quanhoangd129

Literals - A Machine Independent Feature

21h16charis

Learning Rust with Advent of Code 2023 - Princeton

Henry Schreiner

4. The Build System _ Embedded Android.pdf

VishalKumarJha10

Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...

OnePlan Solutions

Unlock the full potential of your projects with OnePlan’s seamless integration with monday.com. Join us to discover how OnePlan enhances monday.com by aligning your portfolio of projects with your organization’s strategic goals, optimizing resource allocation, and streamlining performance tracking. Learn how this powerful combination can drive efficiency, cost savings, and strategic success within your organization.

Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...

David D. Scott

Tube Magic Software is your ultimate tool for creating stunning video content with ease. Designed with both beginners and professionals in mind, it offers a user-friendly interface packed with powerful features. From seamless editing to eye-catching effects, Tube Magic helps you bring your creative vision to life. Elevate your videos and captivate your audience effortlessly. Join our community of content creators and experience the magic today!

The Politics of Agile Development.pptx

NMahendiran

Fixing Git Catastrophes - Nebraska.Code()

Gene Gotimer

The code is written and the tests pass. I just have to commit this last round of changes to my branch. Wait, why does that say committed to main? Did I commit all those changes to main? Arghh! I can’t redo all of this! Committing changes to the wrong branch, forgetting files, misspelling the commit message, and needing to undo commits are some of the “advanced” features of Git that we normal people run into way too often and need help with. The fixes are often easy – once you know what they are. But in the heat of the moment, with the deadline (or Friday afternoon) approaching, it isn’t always easy to figure out what magic spell to cast to get Git to do what you need. We’ll spend some time looking at typical Git situations people get themselves into, and then we’ll demonstrate how to get out of them. This isn’t about Git internals or a Git master’s class – this real-world Git when things aren’t going right. And there will be plenty of time for questions, so bring your “best” Git nightmare scenarios so we can figure out how to recover.

Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...

Andre Hora

The literature has provided evidence that developers are likely to test some behaviors of the program and avoid other ones. Despite this observation, we still lack empirical evidence from real-world systems. In this paper, we propose to automatically identify the tested paths of a method as a way to detect the method’s behaviors. Then, we provide an empirical study to assess the tested paths quantitatively. We monitor the execution of 14,177 tests from 25 real-world Python systems and assess 11,425 tested paths from 2,357 methods. Overall, our empirical study shows that one tested path is prevalent and receives most of the calls, while others are significantly less executed. We find that the most frequently executed tested path of a method has 4x more calls than the second one. Based on these findings, we discuss practical implications for practitioners and researchers and future research directions.

Test Polarity: Detecting Positive and Negative Tests (FSE 2024)

Andre Hora

Positive tests (aka, happy path tests) cover the expected behavior of the program, while negative tests (aka, unhappy path tests) check the unexpected behavior. Ideally, test suites should have both positive and negative tests to better protect against regressions. In practice, unfortunately, we cannot easily identify whether a test is positive or negative. A better understanding of whether a test suite is more positive or negative is fundamental to assessing the overall test suite capability in testing expected and unexpected behaviors. In this paper, we propose test polarity, an automated approach to detect positive and negative tests. Our approach runs/monitors the test suite and collects runtime data about the application execution to classify the test methods as positive or negative. In a first evaluation, test polarity correctly classified 117 tests as as positive or negative. Finally, we provide a preliminary empirical study to analyze the test polarity of 2,054 test methods from 12 real-world test suites of the Python Standard Library. We find that most of the analyzed test methods are negative (88%) and a minority is positive (12%). However, there is a large variation per project: while some libraries have an equivalent number of positive and negative tests, others have mostly negative ones.

The two flavors of Python 3.13 - PyHEP 2024

Henry Schreiner

Recently uploaded (20)

240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf

UW Cert degree offer diploma

What is Micro Frontends and Why Use it.pdf

Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools

PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)

08. Ruby Enumerable - Ruby Core Teaching

02. Ruby Basic slides - Ruby Core Teaching

01. Ruby Introduction - Ruby Core Teaching

New York University degree Cert offer diploma Transcripta

09. Ruby Object Oriented Programming - Ruby Core Teaching

Literals - A Machine Independent Feature

Learning Rust with Advent of Code 2023 - Princeton

4. The Build System _ Embedded Android.pdf

Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...

Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...

The Politics of Agile Development.pptx

Fixing Git Catastrophes - Nebraska.Code()

Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...

Test Polarity: Detecting Positive and Negative Tests (FSE 2024)

The two flavors of Python 3.13 - PyHEP 2024

Apache Arrow - An Overview

1. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Apache Arrow Columnar In-Memory Analytics UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET

2. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Dremio [NOT TODAY’S TOPIC] Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Apache Drill PMC Chair • Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • VP Product, MapR; Microsoft; IBM Research • Apache Drill Founder • Carnegie Mellon, Technion Julien Le Dem Architect • Apache Parquet Founder • Apache Pig PMC Member • Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Founded in June 2015 • Led by experts in Big Data and open source (Apache Parquet, Drill, Pig, Calcite and more) • Currently in stealth

3. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Introducing Apache Arrow • New open source project under the Apache Software Foundation – Top-level project (directly!) • Introduces new era of Columnar In-Memory Analytics 1. 10-100x speedup & concurrency for most workloads 2. Common data layer enables companies to choose best of breed systems 3. Users can utilize any programming language 4. Works with relational and complex data as-is; no ETL required • 13 major open source Big Data projects are already on board – A significant % of the world’s data will be processed through Arrow! UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET

4. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Arrow Turbo-Charges Big Data Execution Engines Apache Arrow Apache Arrow Apache Arrow Apache Arrow Impala Apache ArrowApache Arrow Apache Arrow Apache Arrow …

5. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Performance Advantage of Columnar In-Memory Intel CPU SELECT * FROM clickstream WHERE session_id = 1331246351 Traditional Memory Buffer Arrow Memory Buffer • Arrow leverages the data parallelism (SIMD) in modern Intel CPUs • Arrow optimizes CPU prefetching and caching

6. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Evolution Towards Heterogeneous Data Infrastructure RDBMS Hadoop MapReduce Databases Cassandra Elasticsearch HBase Kudu MongoDB Parquet Phoenix Execution Engines Drill Ibis Impala MapReduce Pandas Spark Storm Phase 1 Common Scheduler YARN Mesos Kubernetes Phase 2 Common Data/Memory Arrow

7. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Advantages of a Common Data Layer Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)

8. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Who’s Behind Apache Arrow? • The creators and lead developers of 13 major open source Big Data projects – Employees of Cloudera, Databricks, Datastax, Dremio, Hortonworks, MapR, Salesforce, Twitter • Jacques Nadeau is the PMC Chair (aka VP Apache Arrow) – Co-founder & CTO of Dremio Calcite Cassandra Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm

9. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Current Status • C, C++, Python and Java implementations currently underway • Will be adopted by Drill, Ibis, Impala, Kudu, Parquet and Spark by EOY • Additional languages (eg, R, JavaScript) and projects also expected to adopt Arrow by EOY

10. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Questions? Jacques Nadeau Dremio Founder & CTO VP Apache Arrow Julien Le Dem Dremio Architect VP Apache Parquet

11. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET APPENDIX

12. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET PMC Members/Committers Jacques Nadeau (PMC Chair) Todd Lipcon Ted Dunning Michael Stack P. Taylor Goetz Reynold Xin Julian Hyde Julien Le Dem James Taylor Jake Luciani Parth Chandra Alex Levenson Marcel Kornacker Steven Phillips Hanifi Gunes Jason Altekruse Abdel Hakim Deneche Wes McKinney Karthik Ramasamy David Alves Seshadri Mahalingam Ippokratis Pandis

Editor's Notes

This is changing the world! Emphasize that.
Trying to turbo-charge all the major technologies that people use today.
Explain that columnar on disk existed for several years, this is columnar in memory Is this only CPU and cache, or also main memory? BOTH, EVERYTHING. That’s what’s amazing here. Very technical explanation – simplify it. One blue vs 4 blues
Maybe improve the slide – from common scheduling to common data in memory
Don’t say it will come in in the coming months and years. Years is too far in the future. Everyone has the need today. We’re not offloading the work for them, they are going to do the work. Relationships – good point Call this a platform?

Apache Arrow - An Overview

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Apache Arrow - An Overview

Similar to Apache Arrow - An Overview (20)

Recently uploaded

Recently uploaded (20)

Apache Arrow - An Overview

Editor's Notes