Monsanto uses geospatial data and analytics to improve sustainable agriculture. They process vast amounts of spatial data on Hadoop to generate prescription maps that optimize seeding rates. Their previous SQL-based system could only handle a small fraction of the data and took over 30 days to process. Monsanto's new Hadoop/HBase architecture loads the entire US dataset in 18 hours, representing significant cost savings over the SQL approach. This foundational system provides agronomic insights to farmers and supports Monsanto's vision of doubling yields by 2030 through information-driven farming.
Hive was initially developed by Facebook to manage large amounts of data stored in HDFS. It uses a SQL-like query language called HiveQL to analyze structured and semi-structured data. Hive compiles HiveQL queries into MapReduce jobs that are executed on a Hadoop cluster. It provides mechanisms for partitioning, bucketing, and sorting data to optimize query performance.
The document discusses enabling diverse workload scheduling in YARN. It covers several topics including node labeling, resource preemption, reservation systems, pluggable scheduler behavior, and Docker container support in YARN. The presenters are Wangda Tan and Craig Welch from Hortonworks who have experience with big data systems like Hadoop, YARN, and OpenMPI. They aim to discuss how these features can help different types of workloads like batch, interactive, and real-time jobs run together more happily in YARN.
The document discusses security features in Hortonworks Data Platform (HDP) and Pivotal HD. It covers authentication with Kerberos, authorization and auditing using Apache Ranger, perimeter security with Apache Knox, and data encryption at rest and in transit. Various security flows are illustrated including typical access to Hive through Beeline and adding authorization, firewall routing, and encryption. Installation and configuration of Ranger and Knox are also outlined.
This presentation breaks down the Aerospike Key Value Data Access. It covers the topics of Structured vs Unstructured Data, Database Hierarchy & Definitions as well as Data Patterns.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses problems posed by large and complex datasets that cannot be processed by traditional systems. Hadoop uses HDFS for storage and MapReduce for distributed processing of data in parallel. Hadoop clusters can scale to thousands of nodes and petabytes of data, providing low-cost and fault-tolerant solutions for big data problems faced by internet companies and other large organizations.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on AWS? Join us on our journey as we build a big data application in real-time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. We review architecture design patterns for big data solutions on AWS, and give you access to a take-home lab so that you can rebuild and customize the application yourself.
This document provides a summary of Sivareddy's profile as a SAP HANA consultant. He has over 3 years of experience working with HANA, including modeling row and column store tables, creating views, hierarchies and calculations. He also has experience with SAP BOBI reporting tools and extracting data from other databases into HANA using methods like SLT, DS and SDA. The profile highlights 4 projects where he implemented HANA and created reports, dashboards and analyses for clients in various industries.
Coursera Machine Learning (by Andrew Ng)_강의정리SANG WON PARK
단순히 공식으로 설명하지 않고, 실제 코드 및 샘플데이터를 이용하여 수식의 결과가 어떻게 적용되는지 자세하게 설명하고 있다.
처음 week1 ~ week4 까지는 김성훈 교수님의 "모두를 위한 딥러닝"에서 한번 이해했던 내용이라 좀 쉽게 진행했고, 나머지는 기초가 부족한 상황이라 다른 자료를 꽤 많이 참고하면서 학습해야 했다.
여러 도서나 강의를 이용하여 머신러닝을 학습하려고 했었는데, 이 강의만큼 나에게 맞는것은 없었던거 같다. 특히 Octave code를 이용한 실습자료는 나중에도 언제든 활용가능할 것 같다.
Week1
Linear Regression with One Variable
Linear Algebra - review
Week2
Linear Regression with Multiple Variables
Octave[incomplete]
Week3
Logistic Regression
Regularization
Week4
Neural Networks - Representation
Week5
Neural Networks - Learning
Week6
Advice for applying machine learning techniques
Machine Learning System Design
Week7
Support Vector Machines
Week8
Unsupervised Learning(Clustering)
Dimensionality Reduction
Week9
Anomaly Detection
Recommender Systems
Week10
Large Scale Machine Learning
Week11
Application Example - Photo OCR
Here we describe federated learning based traffic flow prediction system. In federated learning we solve the problem of data security and also provide collaborative learning. model parameter are shared here ,not data
Snowflake concepts & hands on expertise to help get you started on implementing Data warehouses using Snowflake. Necessary information and skills that will help you master Snowflake essentials.
Manufacturers have an abundance of data, whether from connected sensors, plant systems, manufacturing systems, claims systems and external data from industry and government. Manufacturers face increased challenges from continually improving product quality, reducing warranty and recall costs to efficiently leveraging their supply chain. For example, giving the manufacturer a complete view of the product and customer information integrating manufacturing and plant floor data, with as built product configurations with sensor data from customer use to efficiently analyze warranty claim information to reduce detection to correction time, detect fraud and even become proactive around issues requires a capable enterprise data hub that integrates large volumes of both structured and unstructured information. Learn how an enterprise data hub built on Hadoop provides the tools to support analysis at every level in the manufacturing organization.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
This document provides a summary of Lecture 10 on Bayesian decision theory and Naive Bayes machine learning algorithms. It begins with a recap of Lecture 9 on using probability to classify patterns into categories. It then discusses how to apply these probabilistic concepts to both nominal and continuous variables. A medical example is presented to illustrate Bayesian classification. The document concludes by explaining the Naive Bayes algorithm for classification tasks and providing a worked example of how it is trained and makes predictions.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
This document discusses Apache Ranger, an open source framework for centralized security administration across Hadoop ecosystems. It provides a presentation on securing Hadoop with Ranger, including an overview of current Hadoop security, how Ranger addresses this with centralized policy management and plugins for Hadoop components like HDFS, Hive and HBase. The document outlines Ranger's architecture and components like the policy administration server, user sync server and plugins, demonstrating how Ranger implements authorization for different Hadoop tools and integrates with their native permissions systems.
Talks about best practices and patterns on how to design an efficient cube in Kylin. Covers concepts like mandatory dimension, hierarchy dimension, derived dimension, incremental build, aggregation group etc.
HGrid A Data Model for Large Geospatial Data Sets in HBaseDan Han
This document summarizes research on geospatial data modeling and query performance in HBase. It describes two data models tested: a regular grid index and a tie-based quadtree index. For the grid index, objects are stored by grid cell row and column keys. For the quadtree, objects are stored by Z-value row keys and object IDs. The document analyzes the tradeoffs of each approach and presents experiments comparing their query performance. It concludes with lessons learned on data organization, query processing, and directions for future work.
GeoJinni is a spatial data management system built on Hadoop that provides built-in support for spatial data types, indexes, and operations. It includes a high-level spatial query language called Pigeon and uses spatial indexes like grid files and R-trees to efficiently process spatial queries and operations on large datasets distributed across clusters. GeoJinni allows users to analyze their spatial data efficiently by loading datasets into its system and expressing queries in a simple language.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
This document discusses using HBase for geo-based content processing at NAVTEQ. It outlines the problems with their previous system, such as ineffective scaling and high Oracle licensing costs. Their solution was to implement HBase on Hadoop for horizontal scalability and flexible rules-based processing. Some challenges included unstable early versions of HBase, database design issues, and interfacing batch systems with real-time systems. Cloudera support helped address many technical issues and provide best practices for operating Hadoop and HBase at scale.
Computation of spatial data on Hadoop ClusterAbhishek Sagar
This document provides an overview of distributed computation on spatial data using Hadoop. It discusses GIS background including common spatial data types and operations. It then introduces Hadoop and MapReduce, describing HDFS architecture, MapReduce programming model, and how Hadoop compares to distributed databases. The document outlines experiments on Hadoop including speedup from spatial joins and optimization, building R-tree indexes, nearest neighbor queries, and graph computations. It concludes with discussing results and outlines future work.
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
Scientific data services are a critical aspect of the NASA Center for Climate Simulation’s mission (NCCS). Hadoop, via MapReduce, provides an approach to high-performance analytics that is proving to be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. The NCCS is particularly interested in the potential of Hadoop to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we prototyped a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. The initial focus was on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. After preliminary results suggested that this approach improves efficiencies within data intensive analytic workflows, we invested in building a cyberinfrastructure resource for developing a new generation of climate data analysis capabilities using Hadoop. This resource is focused on reducing the time spent in the preparation of reanalysis data used in data-model intercomparison, a long sought goal of the climate community. This paper summarizes the related use cases and lessons learned.
Here's the second version of our big data landscape. Thoughts, questions, comments? We'd love to hear your feedback in the comments section here: http://wp.me/p2dLS7-6A
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
Intel's big data journey began in 2011 with an evaluation of Hadoop. Since then, Intel has expanded its use of Hadoop and Cloudera across multiple environments. Intel's 3-year roadmap focuses on evolving its Hadoop platform to support more advanced analytics, real-time capabilities, and integrating with traditional BI tools. Key strategies include designing for scalability, following an iterative approach to understand data, and leveraging open source technologies.
Pointerest is a web application that allows users to find points of interest (POIs) for any city. It uses a Python crawler to gather POI data from the web which is stored in a PostgreSQL database. The backend is built with PHP and locations are visualized on interactive maps using the LeafletJS framework. Key features include a multithreaded crawler, stored search results, highlighted motorways and night view in maps. The user interface allows users to search for a city, view results on a map and list, and access about and contact pages. It provides an open source tool for users to easily discover popular locations when visiting or moving to a new city.
Monsanto Automates R&D Decisions with Big DataCloudera, Inc.
Monsanto uses Cloudera's Enterprise Data Hub to automate data-driven research and development decisions. This has reduced Monsanto's time to market for new products from 5-10 years down to months. The system provides scientists a single view of all R&D data, improving collaboration. It also automates data-driven decisions that previously slowed development.
Presentation for Sydney Open Source Developers Conference 2008 covering the range of open source geospatial projects available to the modern programmer!
Bringing Geospatial Business Intelligenceto the Enterprisemkarren
KOREM provides geospatial business intelligence (location intelligence) solutions to help organizations understand how location impacts business operations. They offer consulting, data management, software integration, and training services. Their solutions integrate spatial data and analysis with existing business intelligence tools to provide strategic, operational, and analytic insights. KOREM works across industries with both public and private sector customers to develop customized geospatial business intelligence applications.
The document discusses two types of data scientists: those in "the lab" who focus on question-driven and interactive analytics on fixed data, and those in "the factory" who focus on metric-driven and automated analytics on fluid data. It then describes tools for each type, including Apache Spark and related tools for data science in the factory, and Cloudera Impala and related tools for investigative analytics in the lab. The speaker concludes by thanking the audience.
Promoting Geospatial Education in EuropeKarl Donert
Karl Donert presented on promoting geospatial education in Europe. EUROGEO aims to advance geography through events, publications, and lobbying. Its initiatives include the Digital Earth platform, geospatial education tools, and training programs. There remains a need to establish common geospatial qualifications, support education projects, and create engaging education to address the mismatch between workforce needs and skills. Recommendations include prioritizing education, establishing a think tank of industry and education leaders, and raising awareness of geospatial careers.
MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware...nishimurashoji
MD-HBase is a scalable multi-dimensional data infrastructure that uses a multi-dimensional index built on top of an ordered key-value store like HBase. It represents subspaces in the multi-dimensional index using the longest common prefix of keys to preserve boundary information. This allows for efficient multi-dimensional range and k nearest neighbor queries by pruning subspaces without scanning the entire key space. Evaluation shows MD-HBase performs range queries 10-100 times faster than other technologies and k nearest neighbor queries in 1.5 seconds for k <= 100. It also scales linearly during data insertion without significant overhead.
NUS-ISS PCP for FullStack Software DevelopersNUS-ISS
Presented by Ms Gloria Ng, Chief, Startups & SMEs Practice, NUS-ISS at NUS-ISS Briefing Session for Employers on Professional Conversion Programme on 9 Dec 2016.
A Producer’s Perspective: Agriculture and Nitrogen Deposition in Rocky Mounta...LPE Learning Center
Proceedings Available at: http://www.extension.org/67641
The efforts related to Colorado's Rocky Mountain National Park are voluntary, yet there are nitrogen reduction targets, or milestones, established over five year increments out to the year 2032. If a milestone is not met, mandatory controls could follow. How can the proactive emissions reduction efforts being taken by livestock and crop producers today be recognized or credited should mandatory controls be required at some future date? For example, could an agriculture certainty framework (used more for water quality protection/nutrient runoff) be used to validate actions being taken today for air quality purposes? How might an ag certainty program work and what partners should be at the table? Are there other approaches that states are using or researching that Colorado should consider?
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
1. Teledyne CARIS is developing new data-centric workflows and products to improve efficiency in processing large, multi-sensor survey datasets. This includes increased automation, variable resolution surfaces that integrate data of different densities, and support for open standards like S-100.
2. Organizations are moving to data-centric approaches to more quickly process bigger survey data, better utilize resources, and transition to being marine data providers. This supports producing new customized products and services for broader user bases.
3. CARIS is working on automating processing and creating variable resolution surfaces from multiple sensor data to help organizations efficiently manage higher survey volumes and backlogs from autonomous and crowd-sourced sources.
This document provides an overview of big data concepts and technologies. It discusses the growth of data, characteristics of big data including volume, variety and velocity. Popular big data technologies like Hadoop, MapReduce, HDFS, Pig and Hive are explained. NoSQL databases like Cassandra, HBase and MongoDB are introduced. The document also covers massively parallel processing databases and column-oriented databases like Vertica. Overall, the document aims to give the reader a high-level understanding of the big data landscape and popular associated technologies.
MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets -...StampedeCon
This document summarizes Monsanto's experiences using Hadoop for big data analytics in agriculture. Hadoop allows Monsanto to store and analyze large volumes of genomic, yield, and sensor data to increase crop yields. Lessons learned include starting small with a focus on business problems, using Hadoop for ETL pipelines and long-term storage, and addressing security, backup, and data management challenges as Hadoop and its ecosystem continue to evolve.
This document provides an overview of big data, including:
- Defining big data as large datasets that can reveal patterns when analyzed computationally.
- Describing the 3 Vs of big data - volume, velocity, and variety. It discusses how big data comes from many sources and is characterized by its large size and fast generation.
- Introducing Hadoop as an open-source software framework for distributed storage and processing of big data across clusters of commodity servers. Key Hadoop components HDFS and MapReduce are outlined.
Introduction to DDS: Context, Information Model, Security, and Applications.Gerardo Pardo-Castellote
Introduction to the Data-Distribution Service (DDS): Context and Applications.
This 50 minute presentation summarizes the main features of DDS including the information model, the type system, and security as well as how typical applications use DDS.
It was presented at the Canadian Government Information Day in Ottawa on September 2018.
There is also a video of this presentation at https://www.youtube.com/watch?v=6iICap5G7rw.
This document discusses big data management for OSS/BSS applications. It defines big data and describes the Hadoop framework for distributed processing of large, complex data sets. The document outlines using a big data solution with Hadoop to provide data warehousing, reporting, and revenue assurance across usage, provisioning, billing, and network data for telecom applications. Key benefits include a scalable, low-cost solution for insights, monitoring, and reconciling various systems and records.
Big data analytics and machine intelligence v5.0Amr Kamel Deklel
Why big data
What is big data
When big data is big data
Big data information system layers
Hadoop echo system
What is machine learning
Why machine learning with big data
This document discusses big data and provides information on related topics like Hadoop. It defines big data as extremely large and complex data that cannot be processed by traditional data management tools. It notes that machine learning is a major source of big data. There are three types of data sources: machine data, organizational data, and data from people. Hadoop is an open-source framework for distributed storage and processing of big data using the HDFS file system and MapReduce programming model. Hadoop has been expanded with projects like YARN and HBase. Organizations can gain benefits from big data like more efficient operations, higher sales, improved safety and customer satisfaction.
Big ideas for using data by Brett Whelan University of SydneyAmanda Woods
Brett Whelan presents on using data in precision agriculture. The development of precision agriculture has increased the volume and sources of data available. Using data can optimize production efficiency, quality, minimize business risk and environmental impact through data-driven decisions. Key components of using data include data generation, storage in data dormitories such as the cloud, and prescriptive agriculture using probabilistic models. Real-time adaptable decisions will involve integrating diverse data sources to improve sub-paddock management while optimizing whole business profitability and sustainability.
The document discusses the Stinger Initiative from Hortonworks to improve the performance and capabilities of interactive queries in Hive. The initiative takes a two-pronged approach, focusing on improvements to the query engine and the introduction of a new optimized column store file format called ORCFile. A new Tez execution engine is also introduced to avoid bottlenecks in MapReduce and enable lower latency queries. The goal is to extend Hive's ability to handle interactive queries with response times measured in seconds rather than minutes.
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
A Fortune 100 company recently introduced Hadoop into their data warehouse environment and ETL workflow to save $30 Million. This session examines the specific use case to illustrate the design considerations, as well as the economics behind ETL offload with Hadoop. Additional information about how the Hadoop platform was leveraged to support extended analytics will also be referenced.
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars Joel Saltz
This document discusses challenges related to exascale computing, including analyzing large spatial-temporal datasets from various sensors and simulations. It proposes a "sensor data mini-app" to address common patterns in integrating and analyzing correlated data from multiple sources. Key transformations like segmentation, feature extraction, and classification are discussed. Supporting exascale applications will require efficient mapping of data and computation, hierarchical task scheduling, and interoperability across systems. Representing and querying complex scientific data models across storage hierarchies is also addressed.
The document discusses spatiotemporal data management challenges faced by organizations like DEFRA and IBM's architectural approach. Specifically, it addresses the need for consolidated, high quality spatiotemporal data access across stakeholders. It also examines technical constraints encountered in large projects and potential next steps for integrated spatiotemporal enterprises.
Experian Marketing Services processes large amounts of customer data and needed a scalable solution to store and query the data. They developed a custom ETL solution using HBase for storage and an extractor to optimize queries. The solution ingests data from multiple sources into HBase tables in near real-time, generates aggregate tables, and allows queries to retrieve results from optimized tables based on metadata. Testing showed the solution could ingest over 1 million records per minute into HBase and return most queries within seconds.
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...Cloudera, Inc.
Explorys has been using HBase and Hadoop since HBase 0.20, and will walk through lessons learned over years of usage from their first HBase implementation through a series of upgrades and changes, including impacts to schema design, data loading, data indexing, data access and analytics, and operational processes.
Real Time Business Platform by Ivan Novick from PivotalVMware Tanzu Korea
This document discusses Pivotal's real time business platform for maximizing the value of data investments. It recommends identifying business problems with high ROI potential, then focusing data solutions on high-speed ingestion, consolidation, real-time queries, and analytics to drive real-time insights. The platform combines Gemfire for fast transactions with Greenplum for analytics. Use cases discussed include predictive maintenance, fraud detection, and recommendation engines. The platform provides a complete solution from data capture and analytics to application integration.
The document discusses IBM Spectrum Scale, a software-defined storage solution from IBM. It provides:
1) A family of software-defined storage products including IBM Spectrum Control, IBM Spectrum Protect, IBM Spectrum Archive, IBM Spectrum Virtualize, IBM Spectrum Accelerate, and IBM Spectrum Scale.
2) IBM Spectrum Scale allows storing data everywhere and running applications anywhere. It provides highly scalable, high-performance storage for files, objects, and analytics workloads.
3) The document provides an overview of the IBM Spectrum Scale product and its capabilities for optimizing storage costs, improving data protection, enabling global collaboration, and ensuring data availability, integrity and security.
ANDRITZ, a global manufacturing company formed by acquisitions, with over 50 offices and a virtual IT department, decided that a cloud-first strategy for server backups was the only solution for a disparate and dispersed environment. Brian Bagwell, IT Director of North America and Trey Brown, IT Manager, discussed the company’s challenges to gain more visibility of their data and a cloud-based disaster recovery solution.
With Druva, they discuss:
* Managing complexities of multi-site server recovery requirements being maintained by a virtual IT staff
* Best practices for server backup and data retention with centralized control
* Immediate benefits realized by ANDRITZ such as server restores in seconds, data privacy, and cost savings
To hear the recording, please visit: http://pages2.druva.com/Rethink-Server-Backup-and-Regain-Control-On-Demand.html?utm_source=Social&utm_medium=slideshare
Girish Juneja - Intel Big Data & Cloud Summit 2013IntelAPAC
This document discusses big data trends such as the growth of networked sensors, connected devices, and smartphone users. It then summarizes Intel's investments in big data technologies, including their software, processors, networking, storage and memory products. The document promotes Intel's Distribution for Apache Hadoop software and how it provides security, performance optimizations and support for workloads like data mining, graph analytics and full text search. Real-world customer examples are provided that demonstrate gains in performance, cost savings and new analytics capabilities.
Similar to Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield (20)
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield
1. Monsanto Company Confidential - Attorney Client Privilege
Geospatial Processing @ Monsanto
Hadoop Summit 2013
Robert Grailer, Big Data Engineer
Erich Hochmuth, Data & Analytics Architecture Lead
2. Monsanto Company Confidential - Attorney Client Privilege
Our Vision: Sustainable Agriculture
A Strong Vision That Guides All We Do
• Producing More
– We are committed to increasing yields to meet
the growing demand for food, fiber & fuel
• Conserving More
– We are committed to reducing the amount
of land, water and energy needed to
grow our crops
• Improving Lives
– We are committed to improving lives around
the world
2
3. Monsanto Company Confidential - Attorney Client Privilege
ADVANCED EQUIPMENT
AVERAGE CORN YIELD
–300 BU/AC
AUTOMATED WEATHER
STATIONS
FIELD SENSORS PROVIDING
INFORMATION
ADVANCED IMAGERY
TECHNOLOGY
Doubling Yields by 2030 - Farming in the Future
Will Be Increasingly Information-Driven
3
4. Monsanto Company Confidential - Attorney Client Privilege
4
Planting Prescription 2012
(DKC63-84 Brand)
Target Rate (Count)
(ksds/ac)
38.00 (24.75 ac)
37.00 (22.63 ac)
35.00 (16.60 ac)
34.00 ( 8.23 ac)
33.00 ( 6.00 ac)
32.00 ( 2.82 ac)
Integrated Farming Systems – FieldScriptsSM for 2014
• FieldScripts℠ will deliver, by field, a corn hybrid recommendation utilizing variable
rate seeding by FieldScripts management zones to increase yield potential and
reduce risk
• The science of FieldScripts is based on proprietary algorithms that combine data
from the FieldScripts Testing Network and Monsanto generated hybrid response to
plant population research
Precision Planting
5. Monsanto Company Confidential - Attorney Client Privilege
IL Irrigated, Back 80
Treatment Yield (bu/ac)
Static|34000 196
FieldScripts (35000) 233
Central IL Dry Land, 47-50
Treatment Yield (bu/ac)
Static|34000 139
FieldScripts (33000) 145
MS Irrigated, 21
Treatment Yield (bu/ac)
Static|34000 166
FieldScripts (34700) 181
2012 Field Trials Indicate 5-10 bu/a Average Yield Gain
5
In the United States Alone:
Corn acres planted in 2013 – 96M
Price of Corn per bushel – $6.93*
Advantage of 5–10 Bu/Ac
*Price reflects CBOT price of corn 1/9/2013
6. Monsanto Company Confidential - Attorney Client Privilege
Integrated Farming SystemsSM Combine Advanced Seed
Genetics, On-farm Agronomic Practices, Software and Hardware
Innovations to Drive Yield
DATABASE BACKBONE
Expansive product by environment
testing makes on-farm
prescriptions possible
VARIABLE-RATE FERTILITY
Variable rate N, P & K
“Apps” aligned with yield
management zones
PRECISION SEEDING
Planter hardware
systems enabling
variable rate seeding &
row spacing of
multiple hybrids in a
field by yield
management zone
FERTILITY & DISEASE
MANAGEMENT
“Apps” for in-season
custom application of
supplemental late
nitrogen and
fungicides
YIELD MONITOR
Advances in Yield
Monitoring to
deliver higher
resolution data
BREEDING
Significant
increases in data
points collected
per year to
increase annual
rate genetic gain
6
7. Monsanto Company Confidential - Attorney Client Privilege
Use Case
7
Public Data
Monsanto Data
Grower Data
Standardize
&
Link
Algorithms
• Load thousands of files containing spatial data
• Support diverse range of data types
— tabular, vector, raster
• Join & link data spatially
• Generate dense grid covering entire US
— 120 billion polygons
• Generate a set of derived attributes
— Think moving average
• Make data available for other data products such as Field Scripts
High Level Data Flow
8. Monsanto Company Confidential - Attorney Client Privilege
Version 1 Architecture
• In RDBMS spatial
• PL/SQL
• Multiple patches to DB Engine
• Just 8% of the data!!
– 35+ days to process
• TBs in indexes
• Tradeoffs
– Compressed vs. Uncompressed
– Performance vs. Storage
– Read vs. Write performance
• Options/recommendations
– Limit use of in DB spatial functionality
– Buy more RDBMS
8
0
10
20
30
Days
Data Processing Time
Soil
Elevation
Spatial Index
Processing
0
50
100
TBs
Data Volumes
Raw Data
Uncompressed
Compressed
Spatial Index
9. Monsanto Company Confidential - Attorney Client Privilege
Version 2 Architecture
• Combination of MapReduce & HBase
• Leverage existing Hadoop cluster
• MapReduce
– Parallelize everything!
– Bulk HBase loads
• HBase
– Spatial data model
– Custom spatial engine
9
10. Monsanto Company Confidential - Attorney Client Privilege
Data Ingestion
• Bulk load 1,000s of files into HDFS
• Standardize data
– Common usable format
• Storage vs. Compute
• Raster format is easily splitable
• Hadoop Streaming integrated with GDAL
• Streaming API Lessons Learned
– Lack of documentation
– Counters to track task progress
– Jobs run as mapred user
– HDFS Access outside of MR
10
0
20
40
60
Hours
Data Ingestion Time
RDBMS
Hadoop
NFS
• Raster Images
• Vector Shape Files
• Zip Files
• Text Data
•Unzip
•Convert to Raster
• Re-project
HDFS
Hadoop
Streaming
• Raster Files
Results
11. Monsanto Company Confidential - Attorney Client Privilege
Data Processing
• Process raster data
– Dense matrix
• Generic InputFormat & RecordReader for
raster data
• HFiles easily transportable between clusters
• Challenges tuning Jobs
– IO Sort Factor
– Split/Task Size
11
HDFS HBase
Generate
Derived
Attributes
• Raster Files
Results
Pre-split
table
Generate
HFiles
0
10
20
30
Days
Data Processing Time
RDBMS
Hadoop
13. Monsanto Company Confidential - Attorney Client Privilege
Geospatial in HBase
Need
– Dense data set
– Complex computations
– Scalable & cost efficient
– Bulk analytics & random reads
HBase
– GeoHash most notable example
• Best suited for sparse data
– Precision of reads
– Alphanumeric key
HBase Considerations
– Key overhead
– Scan vs. Get performance
– Reduce reading unnecessary data
Example Field
Complex Data Interactions
14. Monsanto Company Confidential - Attorney Client Privilege
Global Coordinate System
Longitude
Latitude-180 180
-90
90
16. Monsanto Company Confidential - Attorney Client Privilege
Reference System Continued
Longitude
Latitude
1 2 3 20
21 22 23
19
381 382 400399
190
-180 180
-90
90
4
17. Monsanto Company Confidential - Attorney Client Privilege
HBase Schema Take 1
Spatial Table
• Key: cell_id long
• Column Family: A
– Column: Data Holder
• elevation
• slope: float
• aspect: float
17
• Each spatial dataset is a separate table
• All attributes for a layer that are read together are stored together
‒ Attributes packed into a single column as an Avro object
• 1 row per record
• 120 billion rows total!
• 1,000s of Get requests per field
• TBs of key overhead – roughly 56% of the data
18. Monsanto Company Confidential - Attorney Client Privilege
Reference System Storage Format
• Data grouped into 100 x 100 super cells
• A super cell of 100 x 100 cells is a single row in HBase
• At most 4 disk reads are required to read all data for one layer for a 150 acre field
• Given a bounding box the super cells and attributed grid cells containing the desired
data can easily be computed
• A generic geospatial data service when given a set of layers will read each layer in
parallel
• Overhead of key data reduced from 56% to below 0.1%
Super Grid Cells
Attributed Grid Cells
Spatial Table
• Key: super_cell_id long
• Column Family: A
– Column: Data Holder
• elevation : array float [ values ]
• slope: array float [ values ]
• aspect: array float [ values ]
19. Monsanto Company Confidential - Attorney Client Privilege
Results
• Significant cost savings in required hardware
• 120 billion unique polygons in total
• 1.5 trillion data points
• Dense grid of the entire U.S.
• Foundational architecture for other spatial data sets
• Fully unit tested implementation
RDBMS
• 4 states only
• 30+ days to load
• 8 months of dev.
Hadoop
• Entire U.S.
• 18 hour load time
• 3 months of dev.
• 100% scalable
• Cloud ready
0
10
20
30
Days
Total Data Processing Time
RDBMS Hadoop
8% of the
data
Full
data set
Total Run Time
http://psipunk.com/page/18/With big agricultural farms getting smaller due to fast growing population, we need some compact and efficient tools of farming to balance structured agriculture with nature to ensure a healthy ecosystem around us. Offering a solution, the “Agria” by Julia Kaisinger, Katharina Unger and Stefan Riegbauer is an autonomous farm robot for sowing and plant protection in small farms. Featuring infrared and UV light to control bugs, fungi and pests, the modular machine examines the soil and plants regularly to allow specific treatment. Placing seeds and fertilizer in the right place and proportion, the Agria works with an intelligent network of fields and machines, supplied by a local station, which can be controlled through a computer or smartphone, so you may store and share data with experts for better analysis.
Agriculture is going through transition via adoption of breakthrough technologies in seed genetics, farm equipment hardware and software, and farm practices – akin to the advances in computer technology ushering in the modern information technology era;Growers are getting increasingly swamped by information – much of it needing further thoughtful analysis leading to extraction and integration of actionable information. Monsanto is gearing up to do that;Anyone interested in developing improved agronomic practices or information apps that contribute to increasing yield or improving life on the farm should get in touch with us (leave contact information at the Monsanto booth).
General data flow
Split and Task sizes were a challenge because of number of files to be processed and metadata needed to process each task. Data generation for only the United States so only 15% of all SuperCells covering the world were used. Presplit of table to even hfiles.