SlideShare a Scribd company logo
Owen O’Malley
September 2019
© 2019 Cloudera, Inc. All rights reserved. 2
• First committer added to Hadoop
 Working at Yahoo in 2006
 Original VP when Hadoop became a TLP
• Committer & PMC member on
 Ambari
 Hadoop
 Hive
• Mentor for: Giraph, Kafka, Knox, Kylin, Iceberg, Metron, Ranger, Reef, & Tez
© 2019 Cloudera, Inc. All rights reserved. 4
• Starting a project as GitHub repositories is very easy
Start developing immediately!
Allows building community before entering Apache
• But… getting lawyers to sign a code grant takes time
Need legal sign offs from each contributors company
Apache requires code grants for large chunks of code
© 2019 Cloudera, Inc. All rights reserved. 5
• When Hortonworks & Vertica developed a C++ ORC reader
Always intended to move to Apache
Only two companies
Still took a couple months to get the code grant signed
• Get IP agreements before committing code!
© 2019 Cloudera, Inc. All rights reserved. 6
• Open source is a vibrant ecosystem
Projects fill niches in that ecosystem
Creates choices for users and developers
• Your project is competing with many others
Apache doesn’t pick winners and losers
Fighting for attention
© 2019 Cloudera, Inc. All rights reserved. 7
• Advertise your project!
• Make building a good project website a priority!
Take down the old site
• Give conference talks
Tell users about new features
• Write blogs
Use cases
Experience in production
© 2019 Cloudera, Inc. All rights reserved. 8
• Apache is of two minds with respect to employers
Apache doesn’t care who your employer is.
Projects should encourage a diverse set of employers
• Your karma is yours, not your employer’s
Expected to keep your “hats” separate
• Avoid group-think
More voices and viewpoints are very very good
Happy users make your project grow
© 2019 Cloudera, Inc. All rights reserved. 9
• Don’t assume all smart people work at your company
Innovation happens everywhere
• Separate the company’s goals from the project’s
Don’t shoot down proposals because
• They compete with your proprietary products
• Would create work for your proprietary products
• Don’t promise features in upcoming versions
© 2019 Cloudera, Inc. All rights reserved. 10
• Excludes remote people
• Even video meetings are hard for different time zones
• Holding roadmap meetings are particularly problematic
Need to ensure full access to the community
Bring discussion back to the email list before finalizing
• When writing to the lists, use “I” instead of “we”
You are presenting your opinion, not a group’s
• Make your project website welcoming & helpful
© 2019 Cloudera, Inc. All rights reserved. 11
• Many projects make source and binary release artifacts
• Binary artifacts are hard to review and get right
Make reproducible builds
Licensing for binary artifacts is the transitive closure
Watch Docker file artifact versions
• Far better to make only source release artifacts
Can make convivence binary artifacts after release vote
Even better is to make downstream binary artifacts
© 2019 Cloudera, Inc. All rights reserved. 13
• Some projects require a lot of patches to make committer
• Much worse if project has a large patch queue
Really hard if you don’t know or work with a committer
3.6k uncommitted patches on Hadoop’s Jiras since 2006
• A committer shortage makes the patch queue worse
• Most important for becoming a committer should be:
Good technical taste
Knowing their own limits
© 2019 Cloudera, Inc. All rights reserved. 14
• Make sure that your project doesn’t use another trademark
Often comes down to a judgement call about risks
Changing early is much better than later
• Ensure people don’t abuse your trademark
Very hard if it is a user/non-project member
Project members need to fix their company’s behavior
• Board has removed PMC members
• Hold training classes for engineers and marketing
© 2019 Cloudera, Inc. All rights reserved. 15
• If employees work on open source projects:
• Make and measure the engineer’s objectives reflect this
Code contributions
Documentation contributions
Code reviews – include other companies’ patches
Conference presentations
• Make the managers’ goals also reflect the community time
© 2019 Cloudera, Inc. All rights reserved. 16
• Developing off-line breaks the community
Cuts community off from participation
Forward motion stops & release train stalls
• Yahoo developed Hadoop Security privately
0.18, 0.19, 0.20 were ~3-4 months
0.21 was 12 months, Facebook & LinkedIn forked
0.20.203 was 24 months from 0.20
1.0 was 8 more months
© 2019 Cloudera, Inc. All rights reserved. 17
• Your project should have source with permissive licenses
Eg. Apache, BSD, MIT
• Can have binary dependency on weak-copyleft licenses
Eg. Eclipse, Mozilla, Creative Commons Attribution
• Can only use Category X in very specific cases
Must be an optional, build tool, or system provided
• Includes recursive dependencies!
© 2019 Cloudera, Inc. All rights reserved. 18
• Can sneak up on you
Updated aircompressor dependency from 0.10 to 0.15
Started using the new zstd codec
Previously excluded slice dependency was now required
Slice depends on jol-core, which is GPL.
• Fortunately, the use of slice was for one method
Worked with aircompressor to get a fix
• Copying code from stack overflow has the same problem!
© 2019 Cloudera, Inc. All rights reserved. 19
• Build your community
Give talks
Make the website
Make the project easy to build
Make the community friendly – look at Beam talk
• Do your work in the open
• Train your employees in Apache & open source
Hortonworks training -
Owen O’Malley

More Related Content

What's hot

A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3
DataWorks Summit
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Steve Loughran
TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Kevin Minder
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
DataWorks Summit
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
Cloudera, Inc.
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
Rakesh Saha
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache KnoxFortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
DataWorks Summit
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
Wei-Chiu Chuang
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Hadoop Security
Hadoop SecurityHadoop Security
Hadoop Security
Timothy Spann
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
DataWorks Summit/Hadoop Summit
Ranger admin dev overview
Ranger admin dev overviewRanger admin dev overview
Ranger admin dev overview
Tushar Dudhatra
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Cluster
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
DataWorks Summit/Hadoop Summit
Secure Hadoop as a Service - Session Sponsored by Intel
Secure Hadoop as a Service - Session Sponsored by IntelSecure Hadoop as a Service - Session Sponsored by Intel
Secure Hadoop as a Service - Session Sponsored by Intel
Amazon Web Services

What's hot (20)

A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3 A First-Hand Look at What's New in HDP 2.3
A First-Hand Look at What's New in HDP 2.3
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache KnoxFortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Hadoop Security
Hadoop SecurityHadoop Security
Hadoop Security
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
Ranger admin dev overview
Ranger admin dev overviewRanger admin dev overview
Ranger admin dev overview
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Cluster
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Secure Hadoop as a Service - Session Sponsored by Intel
Secure Hadoop as a Service - Session Sponsored by IntelSecure Hadoop as a Service - Session Sponsored by Intel
Secure Hadoop as a Service - Session Sponsored by Intel

Similar to Running An Apache Project: 10 Traps and How to Avoid Them

Mainframe DevOps: A Zowe CLI-enabled Roadmap
Mainframe DevOps: A Zowe CLI-enabled RoadmapMainframe DevOps: A Zowe CLI-enabled Roadmap
Mainframe DevOps: A Zowe CLI-enabled Roadmap
OSSF 2018 - Colin Charles of GrokOpen - Community vs. enterprise how not to ...
OSSF 2018 - Colin Charles of GrokOpen - Community vs. enterprise  how not to ...OSSF 2018 - Colin Charles of GrokOpen - Community vs. enterprise  how not to ...
OSSF 2018 - Colin Charles of GrokOpen - Community vs. enterprise how not to ...
Selecting an Open Source License and Business Model for Your Project to Have ...
Selecting an Open Source License and Business Model for Your Project to Have ...Selecting an Open Source License and Business Model for Your Project to Have ...
Selecting an Open Source License and Business Model for Your Project to Have ...
All Things Open
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
Wei-Chiu Chuang
Emulators as an Emerging Best Practice for API Providers
Emulators as an Emerging Best Practice for API ProvidersEmulators as an Emerging Best Practice for API Providers
Emulators as an Emerging Best Practice for API Providers
Cisco DevNet
DevOps Patterns to Enable Success in Microservices
DevOps Patterns to Enable Success in MicroservicesDevOps Patterns to Enable Success in Microservices
DevOps Patterns to Enable Success in Microservices
Rich Mills
GitOps, Jenkins X &Future of CI/CD
GitOps, Jenkins X &Future of CI/CDGitOps, Jenkins X &Future of CI/CD
GitOps, Jenkins X &Future of CI/CD
Rakuten Group, Inc.
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)
VMware Tanzu
New in the Visual Studio 2012 IDE
New in the Visual Studio 2012 IDENew in the Visual Studio 2012 IDE
New in the Visual Studio 2012 IDE
CI/CD Best Practices for Your DevOps Journey
CI/CD Best  Practices for Your DevOps JourneyCI/CD Best  Practices for Your DevOps Journey
CI/CD Best Practices for Your DevOps Journey
Introducing Cloud Foundry Integration for Eclipse (Cloud Foundry Summit 2014)
Introducing Cloud Foundry Integration for Eclipse (Cloud Foundry Summit 2014)Introducing Cloud Foundry Integration for Eclipse (Cloud Foundry Summit 2014)
Introducing Cloud Foundry Integration for Eclipse (Cloud Foundry Summit 2014)
VMware Tanzu
Cloud Foundry Summit 2014: Introducing Cloud Foundry Integration for Eclipse
Cloud Foundry Summit 2014: Introducing Cloud Foundry Integration for EclipseCloud Foundry Summit 2014: Introducing Cloud Foundry Integration for Eclipse
Cloud Foundry Summit 2014: Introducing Cloud Foundry Integration for Eclipse
DevOps Patterns to Enable Success in Microservices
DevOps Patterns to Enable Success in MicroservicesDevOps Patterns to Enable Success in Microservices
DevOps Patterns to Enable Success in Microservices
Rich Mills
Oracle: Building Cloud Native Applications
Oracle: Building Cloud Native ApplicationsOracle: Building Cloud Native Applications
Oracle: Building Cloud Native Applications
Kelly Goetsch
Get the Exact Identity Solution You Need - In the Cloud - Overview
Get the Exact Identity Solution You Need - In the Cloud - OverviewGet the Exact Identity Solution You Need - In the Cloud - Overview
Get the Exact Identity Solution You Need - In the Cloud - Overview
Jenkins World 2019 - Integrating jenkins x with your business
Jenkins World 2019 - Integrating jenkins x with your businessJenkins World 2019 - Integrating jenkins x with your business
Jenkins World 2019 - Integrating jenkins x with your business
Mauricio (Salaboy) Salatino
"Portrait of the developer as The Artist" Lockheed Architect Workshop
"Portrait of the developer as The Artist" Lockheed Architect Workshop"Portrait of the developer as The Artist" Lockheed Architect Workshop
"Portrait of the developer as The Artist" Lockheed Architect Workshop
Patrick Chanezon
Improving Your Apache Project's Image And Brand
 Improving Your Apache Project's Image And Brand Improving Your Apache Project's Image And Brand
Improving Your Apache Project's Image And Brand
Shane Curcuru
Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18
Cloudera, Inc.
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
Cloudera, Inc.

Similar to Running An Apache Project: 10 Traps and How to Avoid Them (20)

Mainframe DevOps: A Zowe CLI-enabled Roadmap
Mainframe DevOps: A Zowe CLI-enabled RoadmapMainframe DevOps: A Zowe CLI-enabled Roadmap
Mainframe DevOps: A Zowe CLI-enabled Roadmap
OSSF 2018 - Colin Charles of GrokOpen - Community vs. enterprise how not to ...
OSSF 2018 - Colin Charles of GrokOpen - Community vs. enterprise  how not to ...OSSF 2018 - Colin Charles of GrokOpen - Community vs. enterprise  how not to ...
OSSF 2018 - Colin Charles of GrokOpen - Community vs. enterprise how not to ...
Selecting an Open Source License and Business Model for Your Project to Have ...
Selecting an Open Source License and Business Model for Your Project to Have ...Selecting an Open Source License and Business Model for Your Project to Have ...
Selecting an Open Source License and Business Model for Your Project to Have ...
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
Emulators as an Emerging Best Practice for API Providers
Emulators as an Emerging Best Practice for API ProvidersEmulators as an Emerging Best Practice for API Providers
Emulators as an Emerging Best Practice for API Providers
DevOps Patterns to Enable Success in Microservices
DevOps Patterns to Enable Success in MicroservicesDevOps Patterns to Enable Success in Microservices
DevOps Patterns to Enable Success in Microservices
GitOps, Jenkins X &Future of CI/CD
GitOps, Jenkins X &Future of CI/CDGitOps, Jenkins X &Future of CI/CD
GitOps, Jenkins X &Future of CI/CD
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)
New in the Visual Studio 2012 IDE
New in the Visual Studio 2012 IDENew in the Visual Studio 2012 IDE
New in the Visual Studio 2012 IDE
CI/CD Best Practices for Your DevOps Journey
CI/CD Best  Practices for Your DevOps JourneyCI/CD Best  Practices for Your DevOps Journey
CI/CD Best Practices for Your DevOps Journey
Introducing Cloud Foundry Integration for Eclipse (Cloud Foundry Summit 2014)
Introducing Cloud Foundry Integration for Eclipse (Cloud Foundry Summit 2014)Introducing Cloud Foundry Integration for Eclipse (Cloud Foundry Summit 2014)
Introducing Cloud Foundry Integration for Eclipse (Cloud Foundry Summit 2014)
Cloud Foundry Summit 2014: Introducing Cloud Foundry Integration for Eclipse
Cloud Foundry Summit 2014: Introducing Cloud Foundry Integration for EclipseCloud Foundry Summit 2014: Introducing Cloud Foundry Integration for Eclipse
Cloud Foundry Summit 2014: Introducing Cloud Foundry Integration for Eclipse
DevOps Patterns to Enable Success in Microservices
DevOps Patterns to Enable Success in MicroservicesDevOps Patterns to Enable Success in Microservices
DevOps Patterns to Enable Success in Microservices
Oracle: Building Cloud Native Applications
Oracle: Building Cloud Native ApplicationsOracle: Building Cloud Native Applications
Oracle: Building Cloud Native Applications
Get the Exact Identity Solution You Need - In the Cloud - Overview
Get the Exact Identity Solution You Need - In the Cloud - OverviewGet the Exact Identity Solution You Need - In the Cloud - Overview
Get the Exact Identity Solution You Need - In the Cloud - Overview
Jenkins World 2019 - Integrating jenkins x with your business
Jenkins World 2019 - Integrating jenkins x with your businessJenkins World 2019 - Integrating jenkins x with your business
Jenkins World 2019 - Integrating jenkins x with your business
"Portrait of the developer as The Artist" Lockheed Architect Workshop
"Portrait of the developer as The Artist" Lockheed Architect Workshop"Portrait of the developer as The Artist" Lockheed Architect Workshop
"Portrait of the developer as The Artist" Lockheed Architect Workshop
Improving Your Apache Project's Image And Brand
 Improving Your Apache Project's Image And Brand Improving Your Apache Project's Image And Brand
Improving Your Apache Project's Image And Brand
Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals

More from Owen O'Malley

Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
Owen O'Malley
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
Owen O'Malley
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
Owen O'Malley
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
Owen O'Malley
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
Data protection2015
Data protection2015Data protection2015
Data protection2015
Owen O'Malley
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
Owen O'Malley
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
ORC Files
ORC FilesORC Files
ORC Files
Owen O'Malley
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
Owen O'Malley
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
Owen O'Malley

More from Owen O'Malley (20)

Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Data protection2015
Data protection2015Data protection2015
Data protection2015
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
ORC Files
ORC FilesORC Files
ORC Files
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce

Recently uploaded

Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024
Learning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - PrincetonLearning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - Princeton
Henry Schreiner
Understanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdfUnderstanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdf
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
Henry Schreiner
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio, Inc.
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfWaze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Ben Ramedani
Crowd Strike\Windows Update Issue: Overview and Current Status
Crowd Strike\Windows Update Issue: Overview and Current StatusCrowd Strike\Windows Update Issue: Overview and Current Status
Crowd Strike\Windows Update Issue: Overview and Current Status
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
John Gallagher
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...
AI-driven Automation_ Transforming DevOps Practices.docx
AI-driven Automation_ Transforming DevOps Practices.docxAI-driven Automation_ Transforming DevOps Practices.docx
AI-driven Automation_ Transforming DevOps Practices.docx
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Benjamin Bischoff
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
Applitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdfApplitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdf
How to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at ScaleHow to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at Scale
CrushFTP PC Software - WhizNews
CrushFTP PC Software - WhizNewsCrushFTP PC Software - WhizNews
CrushFTP PC Software - WhizNews
Eman Nisar
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdfTop 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Banibro IT Solutions
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
Gene Gotimer
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence

Recently uploaded (20)

Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024
Learning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - PrincetonLearning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - Princeton
Understanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdfUnderstanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdf
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
Alluxio Webinar | What’s new in Alluxio Enterprise AI 3.2: Leverage GPU Anywh...
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfWaze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Crowd Strike\Windows Update Issue: Overview and Current Status
Crowd Strike\Windows Update Issue: Overview and Current StatusCrowd Strike\Windows Update Issue: Overview and Current Status
Crowd Strike\Windows Update Issue: Overview and Current Status
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...
AI-driven Automation_ Transforming DevOps Practices.docx
AI-driven Automation_ Transforming DevOps Practices.docxAI-driven Automation_ Transforming DevOps Practices.docx
AI-driven Automation_ Transforming DevOps Practices.docx
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
Applitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdfApplitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdf
How to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at ScaleHow to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at Scale
CrushFTP PC Software - WhizNews
CrushFTP PC Software - WhizNewsCrushFTP PC Software - WhizNews
CrushFTP PC Software - WhizNews
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdfTop 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence

Running An Apache Project: 10 Traps and How to Avoid Them

  • 1. RUNNING AN APACHE PROJECT: 10 TRAPS AND HOW TO AVOID THEM Owen O’Malley September 2019 @owen_omalley
  • 2. © 2019 Cloudera, Inc. All rights reserved. 2 WHO AM I? • First committer added to Hadoop  Working at Yahoo in 2006  Original VP when Hadoop became a TLP • Committer & PMC member on  Ambari  Hadoop  Hive  ORC • Mentor for: Giraph, Kafka, Knox, Kylin, Iceberg, Metron, Ranger, Reef, & Tez
  • 4. © 2019 Cloudera, Inc. All rights reserved. 4 MISTAKE 1. STARTING ON GITHUB WITHOUT AN IP AGREEMENT • Starting a project as GitHub repositories is very easy Start developing immediately! Allows building community before entering Apache • But… getting lawyers to sign a code grant takes time Need legal sign offs from each contributors company Apache requires code grants for large chunks of code
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 MISTAKE 1. STARTING ON GITHUB WITHOUT AN IP AGREEMENT (CONT.) • When Hortonworks & Vertica developed a C++ ORC reader Always intended to move to Apache Only two companies Still took a couple months to get the code grant signed • Get IP agreements before committing code!
  • 6. © 2019 Cloudera, Inc. All rights reserved. 6 MISTAKE 2. KEEPING YOUR PROJECT SECRET • Open source is a vibrant ecosystem Projects fill niches in that ecosystem Creates choices for users and developers • Your project is competing with many others Apache doesn’t pick winners and losers Fighting for attention
  • 7. © 2019 Cloudera, Inc. All rights reserved. 7 MISTAKE 2. KEEPING YOUR PROJECT SECRET (CONT.) • Advertise your project! • Make building a good project website a priority! Take down the old site • Give conference talks Tell users about new features • Write blogs Use cases Experience in production
  • 8. © 2019 Cloudera, Inc. All rights reserved. 8 MISTAKE 3. NOT FOSTERING DIVERSITY • Apache is of two minds with respect to employers Apache doesn’t care who your employer is. Projects should encourage a diverse set of employers • Your karma is yours, not your employer’s Expected to keep your “hats” separate • Avoid group-think More voices and viewpoints are very very good Happy users make your project grow
  • 9. © 2019 Cloudera, Inc. All rights reserved. 9 MISTAKE 3. NOT FOSTERING DIVERSITY (CONT.) • Don’t assume all smart people work at your company Innovation happens everywhere • Separate the company’s goals from the project’s Don’t shoot down proposals because • They compete with your proprietary products • Would create work for your proprietary products • Don’t promise features in upcoming versions
  • 10. © 2019 Cloudera, Inc. All rights reserved. 10 MISTAKE 4. HOLDING FACE TO FACE DEVELOPER MEETINGS • Excludes remote people • Even video meetings are hard for different time zones • Holding roadmap meetings are particularly problematic Need to ensure full access to the community Bring discussion back to the email list before finalizing • When writing to the lists, use “I” instead of “we” You are presenting your opinion, not a group’s • Make your project website welcoming & helpful
  • 11. © 2019 Cloudera, Inc. All rights reserved. 11 MISTAKE 5. INCLUDING BINARY RELEASE ARTIFACTS • Many projects make source and binary release artifacts • Binary artifacts are hard to review and get right Make reproducible builds Licensing for binary artifacts is the transitive closure Watch Docker file artifact versions • Far better to make only source release artifacts Can make convivence binary artifacts after release vote Even better is to make downstream binary artifacts
  • 13. © 2019 Cloudera, Inc. All rights reserved. 13 MISTAKE 6. HOLDING A HIGH BAR FOR COMMITTER AND PMC MEMBERS • Some projects require a lot of patches to make committer • Much worse if project has a large patch queue Really hard if you don’t know or work with a committer 3.6k uncommitted patches on Hadoop’s Jiras since 2006 • A committer shortage makes the patch queue worse • Most important for becoming a committer should be: Good technical taste Knowing their own limits
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 MISTAKE 7. IGNORING TRADEMARKS • Make sure that your project doesn’t use another trademark Often comes down to a judgement call about risks Changing early is much better than later • Ensure people don’t abuse your trademark Very hard if it is a user/non-project member Project members need to fix their company’s behavior • Board has removed PMC members • Hold training classes for engineers and marketing
  • 15. © 2019 Cloudera, Inc. All rights reserved. 15 MISTAKE 8. NOT REWARDING OPEN SOURCE WORK • If employees work on open source projects: • Make and measure the engineer’s objectives reflect this Code contributions Documentation contributions Code reviews – include other companies’ patches Conference presentations • Make the managers’ goals also reflect the community time
  • 16. © 2019 Cloudera, Inc. All rights reserved. 16 MISTAKE 9. STEALTH DEVELOPMENT • Developing off-line breaks the community Cuts community off from participation Forward motion stops & release train stalls • Yahoo developed Hadoop Security privately 0.18, 0.19, 0.20 were ~3-4 months 0.21 was 12 months, Facebook & LinkedIn forked 0.20.203 was 24 months from 0.20 1.0 was 8 more months
  • 17. © 2019 Cloudera, Inc. All rights reserved. 17 MISTAKE 10. LICENSING PROBLEMS • Your project should have source with permissive licenses Eg. Apache, BSD, MIT • Can have binary dependency on weak-copyleft licenses Eg. Eclipse, Mozilla, Creative Commons Attribution • Can only use Category X in very specific cases Eg. GPL, LGPL, JSON, CC-BY-A Must be an optional, build tool, or system provided • Includes recursive dependencies!
  • 18. © 2019 Cloudera, Inc. All rights reserved. 18 MISTAKE 10. LICENSING PROBLEMS (CONT.) • Can sneak up on you Updated aircompressor dependency from 0.10 to 0.15 Started using the new zstd codec Previously excluded slice dependency was now required Slice depends on jol-core, which is GPL. • Fortunately, the use of slice was for one method Worked with aircompressor to get a fix • Copying code from stack overflow has the same problem!
  • 19. © 2019 Cloudera, Inc. All rights reserved. 19 SUMMARY • Build your community Give talks Make the website Make the project easy to build Make the community friendly – look at Beam talk • Do your work in the open • Train your employees in Apache & open source Hortonworks training -