Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
This document discusses Delta Change Data Feed (CDF), which allows capturing changes made to Delta tables. It describes how CDF works by storing change events like inserts, updates and deletes. It also outlines how CDF can be used to improve ETL pipelines, unify batch and streaming workflows, and meet regulatory needs. The document provides examples of enabling CDF, querying change data and storing the change events. It concludes by offering a demo of CDF in Jupyter notebooks.
Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through enhancements like delta encoding, binary packing designed for CPU efficiency, and predicate pushdown using statistics. Benchmark results show Parquet provides much better compression and query performance than row-oriented formats on big data workloads. The project is developed as an open-source community with contributions from many organizations.
Alkin Tezuysal discusses his first 90 days working at ChistaDATA Inc. as EVP of Global Services. He has experience working with databases like MySQL, Oracle, and ClickHouse. ChistaDATA focuses on providing ClickHouse infrastructure operations through managed services, support, and consulting. ClickHouse is an open source columnar database that uses a shared-nothing architecture for high performance analytics workloads.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Databricks
Building a curated data lake on real time data is an emerging data warehouse pattern with delta. However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.
A fairy tale about orphans, forests, kings and forking open source software projects, which particular reference to sqlline and Apache Hive.
From a talk I gave at the Apache Hive contributors' meetup in Santa Clara on April 22nd, 2015.
Options for Data Prep - A Survey of the Current MarketDremio Corporation
Data comes in many shapes and sizes, and every company struggles to find ways to transform, validate, and enrich data for multiple purposes. The problem has been around as long as data, and the market has an overwhelming number of options. In this presentation we look at the problem and key options from vendors in the market today. Dremio is a new approach that eliminates the need for stand alone data prep tools.
Data Science Languages and Industry AnalyticsWes McKinney
September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).
This document discusses how Apache Calcite makes it easier to write database management systems (DBMS) by decomposing them into modular components like a query parser, catalog, algorithms, and storage engines. It presents Calcite as a framework that allows these components to be mixed and matched, with a core relational algebra and rule-based optimization. Calcite powers systems like Apache Hive, Drill, Phoenix, and Kylin by translating SQL and other queries to relational algebra and optimizing queries using over 100 rules before executing them using configurable engines and data sources.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Don’t optimize my queries, optimize my data!Julian Hyde
The document discusses strategies for optimizing data through materialized views and how data systems can learn to optimize themselves. It proposes an algorithm that uses sketches and information theory to profile data cardinalities and recommend materialized views. The algorithm aims to defeat the combinatorial search space by only considering combinations with "surprising" cardinalities. This profiling provides the cost and benefit information needed to optimize data structures. The document also discusses using query logs and statistics to infer relationships between tables and design summary tables through lattices.
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
A presentation by Tomer Shiran, CEO of Dremio made to Hadoop User Group (HUG) Ireland on "Hadoop Summit Night" on April 12th, 2016. This presentation covers Apache Arrow in detail.
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.
Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; columnar layouts for storage and in-memory execution allow the analysis of large amounts of data very quickly and efficiently. It provides the ability for multiple applications to share a common data representation and perform operations at full CPU throughput using SIMD and Vectorization. For interoperability, row based encodings (CSV, Thrift, Avro) combined with general purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. As discussed extensively in the database literature, a columnar layout with statistics and sorting provides vertical and horizontal partitioning, thus keeping IO to a minimum. Additionally a number of key big data technologies have or will soon have in-memory columnar capabilities. This includes Kudu, Ibis and Drill. Sharing a common in-memory columnar representation allows interoperability without the usual cost of serialization.
Understanding modern CPU architecture is critical to maximizing processing throughput. We’ll discuss the advantages of columnar layouts in Parquet and Arrow for in-memory processing and data encodings used for storage (dictionary, bit-packing, prefix coding). We’ll dissect and explain the design choices that enable us to achieve all three goals of interoperability, space and query efficiency. In addition, we’ll provide an overview of what’s coming in Parquet and Arrow in the next year.
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
- Arrow and Parquet are open source projects focused on column-oriented data formats for efficient in-memory (Arrow) and on-disk (Parquet) analytics.
- They allow for interoperability across systems by eliminating the overhead of data serialization and enabling common data representations.
- Column-oriented formats improve performance by reducing storage needs, enabling projection of only needed columns, and better utilization of CPU/memory through cache locality and vectorized processing.
The lightning talks covered various Netflix OSS projects including S3mper, PigPen, STAASH, Dynomite, Aegisthus, Suro, Zeno, Lipstick on GCE, AnsWerS, and IBM. 41 projects were discussed and the need for a cohesive Netflix OSS platform was highlighted. Matt Bookman then gave a presentation on running Lipstick and Hadoop on Google Cloud Platform using Google Compute Engine and Cloud Storage. He demonstrated running Pig jobs on Compute Engine and discussed design considerations for cloud-based Hadoop deployments. Finally, Peter Sankauskas from @Answers4AWS discussed initial ideas around CloudFormation for Asgard and deploying various Netflix OSS
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Puppet
This document discusses using Hiera to provide delegated configuration through multiple data sources. It begins with an introduction to Hiera and its uses. It then discusses using multiple backends like YAML and PostgreSQL to store hierarchical data. The document proposes a designed solution to delegate access to certain Hiera keys by filtering and importing data from external sources into a separate database. This database would act as a secondary Hiera backend. The solution is intended to allow certain users to manage configuration parameters for a subset of servers in a secure manner.
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
Wes McKinney gives an overview of the current data analysis tools landscape in Python and R. He discusses essential Python packages like NumPy, pandas, and scikit-learn. For R, he covers packages in the "Hadley stack" like dplyr and ggplot2. IPython/Jupyter notebooks are also mentioned as a platform for interactive data analysis across languages. The talk aims to highlight trends, opportunities, and challenges in the open source data science tool ecosystem.
This document discusses the future of column-oriented data processing with Apache Arrow and Apache Parquet. Arrow provides an open standard for in-memory columnar data, while Parquet provides an open standard for on-disk columnar data storage. Together they provide interoperability across systems and high performance by avoiding data copying and format conversions. The document outlines the goals and benefits of Arrow and Parquet, how they improve CPU and I/O efficiency, and examples of performance gains from integrating systems like Spark and Pandas with Arrow.
Using AWS, Terraform, and Ansible to Automate Splunk at ScaleData Works MD
The DreamPort Splunk Project; How We Use AWS, Terraform, and Ansible to Automate Everything About a Splunk Cluster
At DreamPort, we use cloud platforms, infrastructure-as-code tooling, configuration tools, automation software, and container technologies to very quickly design, develop, and prototype projects. This particular talk focuses on the tools used to deploy and configure a Splunk cluster for a particular project we recently ran. We will cover the deployment, configuration, and orchestration of a large 16 node Splunk cluster using tools that are a core set to DreamPort's cloud infrastructure toolbox; AWS, Terraform, Ansible, and Docker.
It is recommended that attendees have a general understanding of AWS, Linux, Splunk, and Docker, and know about automation tools such as Terraform and Ansible.
Attendees will learn how to use AWS, Terraform, Ansible, and Docker to deploy a large Splunk cluster, how to use Ansible to orchestrate and manage the Splunk cluster, and how to use Ansible to orchestrate and manage the Splunk cluster.
-------------------------------------------------
Bill Cawthra is a Principal Cloud Infrastructure Architect for CyberPoint, managing project-related cloud systems and platforms. He works primarily on the AWS platform, using various automation tools to rapidly deploy and manage infrastructure. Bill has over 18 years of experience in computers and technology, working in a range of fields, including construction, DoD, health care, and social media.
1) Columnar formats like Parquet, Kudu and Arrow provide more efficient data storage and querying by organizing data by column rather than row.
2) Parquet provides an immutable columnar format well-suited for storage, while Kudu allows for mutable updates but is optimized for scans. Arrow provides an in-memory columnar format focused on CPU efficiency.
3) By establishing common in-memory and on-disk columnar standards, Arrow and Parquet enable more efficient data sharing and querying across systems without serialization overhead.
In-Ceph-tion: Deploying a Ceph cluster on DreamComputePatrick McGarry
This document discusses deploying a Ceph cluster on DreamCompute, an OpenStack-powered cloud computing service from DreamHost. It begins with an overview of Ceph's scalability and uses for object, block, and file storage. The document then discusses DreamCompute's open source infrastructure and deploying Ceph using tools like Juju. It provides details on configuring the Ceph cluster by deploying MONs, OSDs, the RGW gateway, and MDS. It concludes by discussing next steps like geo-replication and erasure coding, and opportunities to get involved with the Ceph community.
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
Wes McKinney is the creator of Python's pandas project and a primary developer of Apache Arrow, Apache Parquet, and other open-source projects. Apache Arrow is an open-source cross-language development platform for in-memory analytics that aims to improve data science tools. It provides a shared standard for memory interoperability and computation across languages through its columnar memory format and libraries. Apache Arrow has growing adoption in data science systems and is working to expand language support and computational capabilities.
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
https://www.aicamp.ai/event/eventdetails/W2024022214
apache nifi
llm
generative ai
gen ai
ml
dl
machine learning
apache kafka
apache flink
postgresql
python
AI Meetup (NYC): GenAI, LLMs, ML and Data
Feb 22, 05:30 PM EST
Welcome to the monthly in-person AI meetup in New York City, in collaboration with Microsoft. Join us for deep dive tech talks on AI, GenAI, LLMs and machine learning, food/drink, networking with speakers and fellow developers
Agenda:
* 5:30pm~6:00pm: Checkin, Food/drink and networking
* 6:00pm~6:10pm: Welcome/community update
* 6:10pm~8:30pm: Tech talks
* 8:30pm: Q&A, Open discussion
Tech Talk: Searching and Reasoning Over Multimedia Data with Vector Databases and LMMs
Speaker: Zain Hasan (Weaviate LinkedIn)
Abstract: In this talk, Zain Hasan will discuss how we can use open-source multimodal embedding models in conjunction with large generative multimodal models that can that can see, hear, read, and feel data(!), to perform cross-modal search(searching audio with images, videos with text etc.) and multimodal retrieval augmented generation (MM-RAG) at the billion-object scale with the help of open source vector databases. I will also demonstrate, with live code demos, how being able to perform this cross-modal retrieval in real-time can enables users to use LLMs that can reason over their enterprise multimodal data. This talk will revolve around how we can scale the usage of multimodal embedding and generative models in production.
Tech Talk: Codeless Generative AI Pipelines
Speaker: Timothy Spann (Cloudera LinkedIn)
Abstract: Join us for an insightful talk on leveraging the power of real-time streaming tools, specifically Apache NiFi, to revolutionize GenAI data engineering. In this session, we’ll explore how the integration of Apache NiFi can automate the entire process of prompt building, making it a seamless and efficient task.
Speakers/Topics:
Stay tuned as we are updating speakers and schedules. If you have a keen interest in speaking to our community, we invite you to submit topics for consideration: Submit Topics
Sponsors:
We are actively seeking sponsors to support our community. Whether it is by offering venue spaces, providing food/drink, or cash sponsorship. Sponsors will have the chance to speak at the meetups, receive prominent recognition, and gain exposure to our extensive membership base of 20,000+ local or 300K+ developers worldwide.
Venue:
Microsoft NYC - Times Square, 11 Times Square, New York, NY 10036
Room Name: Central Park West 6501
Community on Slack/Discord
- Event chat: chat and connect with speakers and attendees
- Sharing blogs, events, job openings, projects collaborations
Join Slack (search and join the #newyork channel) | Join Discord
Stackato presentation done at the Nordic Perl Workshop 2012 in Stockholm, Sweden
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
Slides of a talk given to the Seattle Chapter of the Cloud Security Alliance. Looks briefly at Architectures, Sources of Log Data, and behavioral signatures in the data and issues and observations around using Big Data products for security.
This document summarizes a New York Redis Meetup event. It introduces Aleksandr Yampolskiy and Danny Gershman, who will discuss Redis, a key-value store that can be used for caching, publishing/subscribing, and as a data store. Redis allows for fast, in-memory storage of data structures like strings, hashes, lists, sets and sorted sets. The document provides an overview of Redis' capabilities and common uses, such as caching, real-time analytics, and AOP caching. It also notes that Cinchcast is hiring for backend architect and frontend engineer roles.
What is Micro Frontends and Why Use it.pdflead93317
🚀 Let's Deep Dive into 𝐖𝐡𝐲 𝐌𝐢𝐜𝐫𝐨 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝𝐬 𝐢𝐬 𝐭𝐡𝐞 𝐅𝐮𝐭𝐮𝐫𝐞 𝐨𝐟 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 🚀
In today's fast-paced tech landscape, agility, scalability, and maintainability are more crucial than ever. Traditional monolithic frontend architectures often struggle to keep up with these demands. Enter Micro Frontends: a revolutionary approach that's transforming the way we build web applications.
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsBenjamin Bischoff
In the rapidly evolving landscape of software development and testing, it is tempting to chase the latest tools and technologies. However, some of the most effective solutions have been in existence for decades. In this talk, we’ll delve into the enduring value of these timeless testing tools.
We’ll explore how established tools like Selenium, GNU Make, Maven, and Bash remain vital in today’s software development and testing toolkit even though they have been around for a long time (some were even invented before I was born). I’ll share examples of how these tools have addressed our testing and automation challenges, showcasing their adaptability, versatility, and reliability in various scenarios. I aim to demonstrate that sometimes, the “old” ways can indeed be the best ways.
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)Andre Hora
When creating test cases, ideally, developers should test both the expected and unexpected behaviors of the program to catch more bugs and avoid regressions. However, the literature has provided evidence that developers are more likely to test expected behaviors than unexpected ones. In this paper, we propose PathSpotter, a tool to automatically identify tested paths and support the detection of missing tests. Based on PathSpotter, we provide an approach to guide us in detecting missing tests. To evaluate it, we submitted pull requests with test improvements to open-source projects. As a result, 6 out of 8 pull requests were accepted and merged in relevant systems, including CPython, Pylint, and Jupyter Client. These pull requests created/updated 32 tests and added 80 novel assertions covering untested cases. This indicates that our test improvement solution is well received by open-source projects.
Literals - A Machine Independent Feature21h16charis
Introduction to Literals, A machine independent feature. The presentation is based on the prescribed textbook for System Software and Compiler Design, Computer Science and Engineering - System Software by Leland. L. Beck,
D Manjula.
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...OnePlan Solutions
Unlock the full potential of your projects with OnePlan’s seamless integration with monday.com. Join us to discover how OnePlan enhances monday.com by aligning your portfolio of projects with your organization’s strategic goals, optimizing resource allocation, and streamlining performance tracking. Learn how this powerful combination can drive efficiency, cost savings, and strategic success within your organization.
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...David D. Scott
Tube Magic Software is your ultimate tool for creating stunning video content with ease. Designed with both beginners and professionals in mind, it offers a user-friendly interface packed with powerful features. From seamless editing to eye-catching effects, Tube Magic helps you bring your creative vision to life. Elevate your videos and captivate your audience effortlessly. Join our community of content creators and experience the magic today!
The code is written and the tests pass. I just have to commit this last round of changes to my branch. Wait, why does that say committed to main? Did I commit all those changes to main? Arghh! I can’t redo all of this!
Committing changes to the wrong branch, forgetting files, misspelling the commit message, and needing to undo commits are some of the “advanced” features of Git that we normal people run into way too often and need help with. The fixes are often easy – once you know what they are. But in the heat of the moment, with the deadline (or Friday afternoon) approaching, it isn’t always easy to figure out what magic spell to cast to get Git to do what you need.
We’ll spend some time looking at typical Git situations people get themselves into, and then we’ll demonstrate how to get out of them. This isn’t about Git internals or a Git master’s class – this real-world Git when things aren’t going right. And there will be plenty of time for questions, so bring your “best” Git nightmare scenarios so we can figure out how to recover.
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Andre Hora
The literature has provided evidence that developers are likely to test some behaviors of the program and avoid other ones. Despite this observation, we still lack empirical evidence from real-world systems. In this paper, we propose to automatically identify the tested paths of a method as a way to detect the method’s behaviors. Then, we provide an empirical study to assess the tested paths quantitatively. We monitor the execution of 14,177 tests from 25 real-world Python systems and assess 11,425 tested paths from 2,357 methods. Overall, our empirical study shows that one tested path is prevalent and receives most of the calls, while others are significantly less executed. We find that the most frequently executed tested path of a method has 4x more calls than the second one. Based on these findings, we discuss practical implications for practitioners and researchers and future research directions.
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Andre Hora
Positive tests (aka, happy path tests) cover the expected behavior of the program, while negative tests (aka, unhappy path tests) check the unexpected behavior. Ideally, test suites should have both positive and negative tests to better protect against regressions. In practice, unfortunately, we cannot easily identify whether a test is positive or negative. A better understanding of whether a test suite is more positive or negative is fundamental to assessing the overall test suite capability in testing expected and unexpected behaviors. In this paper, we propose test polarity, an automated approach to detect positive and negative tests. Our approach runs/monitors the test suite and collects runtime data about the application execution to classify the test methods as positive or negative. In a first evaluation, test polarity correctly classified 117 tests as as positive or negative. Finally, we provide a preliminary empirical study to analyze the test polarity of 2,054 test methods from 12 real-world test suites of the Python Standard Library. We find that most of the analyzed test methods are negative (88%) and a minority is positive (12%). However, there is a large variation per project: while some libraries have an equivalent number of positive and negative tests, others have mostly negative ones.
1. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Apache Arrow
Columnar In-Memory Analytics
UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
2. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Dremio [NOT TODAY’S TOPIC]
Jacques
Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer
Shiran
Founder & CEO
• VP Product, MapR; Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Founded in June 2015
• Led by experts in Big Data and open source
(Apache Parquet, Drill, Pig, Calcite and more)
• Currently in stealth
3. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Introducing Apache Arrow
• New open source project under the Apache Software Foundation
– Top-level project (directly!)
• Introduces new era of Columnar In-Memory Analytics
1. 10-100x speedup & concurrency for most workloads
2. Common data layer enables companies to choose best of breed
systems
3. Users can utilize any programming language
4. Works with relational and complex data as-is; no ETL required
• 13 major open source Big Data projects are already on board
– A significant % of the world’s data will be processed through Arrow!
UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
4. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Arrow Turbo-Charges Big Data Execution Engines
Apache Arrow Apache Arrow Apache Arrow Apache Arrow
Impala
Apache ArrowApache Arrow Apache Arrow Apache Arrow
…
5. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Performance Advantage of Columnar In-Memory
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
• Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs
• Arrow optimizes CPU prefetching
and caching
6. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Evolution Towards Heterogeneous Data Infrastructure
RDBMS
Hadoop MapReduce
Databases
Cassandra
Elasticsearch
HBase
Kudu
MongoDB
Parquet
Phoenix
Execution Engines
Drill
Ibis
Impala
MapReduce
Pandas
Spark
Storm
Phase 1
Common Scheduler
YARN Mesos
Kubernetes
Phase 2
Common Data/Memory
Arrow
7. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Advantages of a Common Data Layer
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
8. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Who’s Behind Apache Arrow?
• The creators and lead developers of 13
major open source Big Data projects
– Employees of Cloudera, Databricks,
Datastax, Dremio, Hortonworks, MapR,
Salesforce, Twitter
• Jacques Nadeau is the PMC Chair (aka VP
Apache Arrow)
– Co-founder & CTO of Dremio
Calcite
Cassandra
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
9. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Current Status
• C, C++, Python and Java implementations
currently underway
• Will be adopted by Drill, Ibis, Impala, Kudu,
Parquet and Spark by EOY
• Additional languages (eg, R, JavaScript) and
projects also expected to adopt Arrow by EOY
10. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Questions?
Jacques Nadeau
Dremio Founder & CTO
VP Apache Arrow
Julien Le Dem
Dremio Architect
VP Apache Parquet
12. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
PMC Members/Committers
Jacques Nadeau (PMC Chair)
Todd Lipcon
Ted Dunning
Michael Stack
P. Taylor Goetz
Reynold Xin
Julian Hyde
Julien Le Dem
James Taylor
Jake Luciani
Parth Chandra
Alex Levenson
Marcel Kornacker
Steven Phillips
Hanifi Gunes
Jason Altekruse
Abdel Hakim Deneche
Wes McKinney
Karthik Ramasamy
David Alves
Seshadri Mahalingam
Ippokratis Pandis
Editor's Notes
This is changing the world! Emphasize that.
Trying to turbo-charge all the major technologies that people use today.
Explain that columnar on disk existed for several years, this is columnar in memory
Is this only CPU and cache, or also main memory? BOTH, EVERYTHING. That’s what’s amazing here.
Very technical explanation – simplify it. One blue vs 4 blues
Maybe improve the slide – from common scheduling to common data in memory
Don’t say it will come in in the coming months and years. Years is too far in the future. Everyone has the need today.
We’re not offloading the work for them, they are going to do the work.
Relationships – good point
Call this a platform?