Griffin is a data quality platform built by eBay on Hadoop and Spark to provide a unified process for detecting data quality issues in both real-time and batch data across multiple systems. It defines common data quality dimensions and metrics and calculates measurement values and quality scores, storing results and generating trending reports. Griffin provides a centralized data quality service for eBay and has been deployed processing over 1.2PB of data and 800M daily records using 100+ metrics. It is open source and contributions are welcome.
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
The presentation discusses how to migrate expensive open source big data workloads to Azure and leverage latest compute and storage innovations within Azure Synapse with Azure Data Lake Storage to develop a powerful and cost effective analytics solutions. It shows how you can bring your .NET expertise with .NET for Apache Spark to bear and how the shared meta data experience in Synapse makes it easy to create a table in Spark and query it from T-SQL.
The document discusses different types of big data including unstructured, semi-structured, and structured data. It provides examples of each type such as audio, video, and images for unstructured data. JSON, XML, and sensor data are given as examples for semi-structured data. The document also discusses the challenges of processing big data due to its variety, velocity, and volume.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. Example applications and demos of Storm, HBase, Hive and Spark are also presented. The document highlights key aspects of using HDInsight including storage integration and tools for interactive analysis.
This document discusses leveraging Hadoop within the existing data warehouse environment of the Department of Immigration and Border Protection (DIBP) in Australia. It provides an overview of DIBP's business and why Hadoop was adopted, describes the existing EDW environment, and discusses the technical implementation of Hadoop. It also outlines next steps such as consolidating the departmental EDW and advanced analytics on Hadoop, and concludes by taking questions.
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.
Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.
Big data is driving transformative changes in traditional data warehousing. Traditional ETL processes and highly structured data schemas are being replaced with schema flexibility to handle all types of data from diverse sources. This allows for real-time experimentation and analysis beyond just operational reporting. Microsoft is applying lessons from its own big data journey to help customers by providing a comprehensive set of Apache big data tools in Azure along with intelligence and analytics services to gain insights from diverse data sources.
The document discusses Ido Friedman and his background working with various data technologies. It then discusses the concept of a data lake and how it serves as a single store for raw and transformed data used for reporting, analytics, and machine learning. The rest of the document discusses how traditional tools like SQL have changed with the rise of Hadoop and cloud storage. It provides examples of performance and cost differences between running data workloads on Hadoop clusters versus cloud-based data processing services like BigQuery and Dataproc. The document concludes that a large data lake is now possible in the cloud and discusses various deployment options to consider.
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
This document discusses big data concepts like volume, velocity, and variety of data. It introduces NoSQL databases as an alternative to relational databases for big data that does not require data cleansing or schema definition. Hadoop is presented as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. Key Hadoop components like HDFS, MapReduce, Hive, Pig and YARN are described at a high level. The document also discusses using Azure services like Azure Storage, HDInsight and Stream Analytics with Hadoop.
Data Con LA 2020
Description
Data warehouses are not enough. Data lakes are the backbone of a modern data environment. Data Lakes are best built leveraging unique services of the cloud provider to reduce operations complexity. This session will explain why everyone's talking about data lakes, break down the best services in Azure to build a Data Lake, and walk through code for querying and loading with Azure Databricks and Event Hubs for Kafka. Attendees will leave the session with a firm grasp of why we build data lakes and how Azure Databricks fits in for ETL and querying.
Speaker
Dustin Vannoy, Dustin Vannoy Consulting, Principal Data Engineer
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
Here I talk about examples and use cases for Big Data & Big Data Analytics and how we accomplished massive-scale sentiment, campaign and marketing analytics for Razorfish using a collecting of database, Big Data and analytics technologies.
This presentation covers "Introduction to Big Data" for enterprises. It includes challenges and benefits of Big Data including transition plan based on few case studies.
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar
Born at Facebook, Presto is an open source high performance, distributed SQL query engine. With the disaggregation of storage and compute, Presto was created to simplify querying of all data lakes - cloud data lakes like S3 and on premise data lakes like HDFS. Presto's high performance and flexibility has made it a very popular choice for interactive query workloads on large Hadoop-based clusters as well as AWS S3, Google Cloud Storage and Azure blob store. Today it has grown to support many users and use cases including ad hoc query, data lake house analytics, and federated querying. In this session, we will give an overview on Presto including architecture and how it works, the problems it solves, and most common use cases. We'll also share the latest innovation in the project as well as the future roadmap.
Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
Data Science Languages and Industry AnalyticsWes McKinney
September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
A fairy tale about orphans, forests, kings and forking open source software projects, which particular reference to sqlline and Apache Hive.
From a talk I gave at the Apache Hive contributors' meetup in Santa Clara on April 22nd, 2015.
This document discusses how Apache Calcite makes it easier to write database management systems (DBMS) by decomposing them into modular components like a query parser, catalog, algorithms, and storage engines. It presents Calcite as a framework that allows these components to be mixed and matched, with a core relational algebra and rule-based optimization. Calcite powers systems like Apache Hive, Drill, Phoenix, and Kylin by translating SQL and other queries to relational algebra and optimizing queries using over 100 rules before executing them using configurable engines and data sources.
Building a data lake is a daunting task. The promise of a virtual data lake is to provide the advantages of a data lake without consolidating all data into a single repository. With Apache Arrow and Dremio, companies can, for the first time, build virtual data lakes that provide full access to data no matter where it is stored and no matter what size it is.
This document discusses Apache Arrow, an open source cross-language development platform for in-memory analytics. It provides an overview of Arrow's goals of being cross-language compatible, optimized for modern CPUs, and enabling interoperability between systems. Key components include core C++/Java libraries, integrations with projects like Pandas and Spark, and common message patterns for sharing data. The document also describes how Arrow is implemented in practice in systems like Dremio's Sabot query engine.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Don’t optimize my queries, optimize my data!Julian Hyde
The document discusses strategies for optimizing data through materialized views and how data systems can learn to optimize themselves. It proposes an algorithm that uses sketches and information theory to profile data cardinalities and recommend materialized views. The algorithm aims to defeat the combinatorial search space by only considering combinations with "surprising" cardinalities. This profiling provides the cost and benefit information needed to optimize data structures. The document also discusses using query logs and statistics to infer relationships between tables and design summary tables through lattices.
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
This document summarizes the history and evolution of data warehousing and analytics architectures. It discusses how data warehouses emerged in the 1970s and were further developed in the late 1980s and 1990s. It then covers how big data and Hadoop have changed architectures, providing more scalability and lower costs. Finally, it outlines components of modern analytics architectures, including Hadoop, data warehouses, analytics engines, and visualization tools that integrate these technologies.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
Are you confused by Big Data? Get in touch with this new "black gold" and familiarize yourself with undiscovered insights through our complimentary introductory lesson on Big Data and Hadoop!
This document discusses building big data solutions using Microsoft's HDInsight platform. It provides an overview of big data and Hadoop concepts like MapReduce, HDFS, Hive and Pig. It also describes HDInsight and how it can be used to run Hadoop clusters on Azure. The document concludes by discussing some challenges with Hadoop and the broader ecosystem of technologies for big data beyond just Hadoop.
Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s Ne...Rittman Analytics
Mark Rittman presented at Big Data World in London in March 2017 on data integration and data warehousing for cloud, big data, and IoT. He discussed the history of data warehousing and how it has evolved from traditional RDBMS implementations to embrace big data technologies like Hadoop. He described how cloud data warehouse offerings from Google BigQuery and Amazon Redshift combine the scalability of big data with the structure of data warehousing. Rittman also covered new approaches to ETL using data pipelines, schema discovery using machine learning, emerging open-source BI tools, and his current work in these areas.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Big Data Strategy for the Relational World Andrew Brust
1) Andrew Brust is the CEO of Blue Badge Insights and a big data expert who writes for ZDNet and GigaOM Research.
2) The document discusses trends in databases including the growth of NoSQL databases like MongoDB and Cassandra and Hadoop technologies.
3) It also covers topics like SQL convergence with Hadoop, in-memory databases, and recommends that organizations look at how widely database products are deployed before adopting them to avoid being locked into niche products.
This document discusses big data analytics platforms and techniques. It describes various open-source projects like Hadoop, Spark, and Mahout that can perform analytics on large datasets. It also discusses commercial analytics platforms from vendors like SAS, Alpine, and Revolution Analytics. Spark is highlighted as gaining rapid adoption for its speed and expanding machine learning capabilities. Key questions are raised about which open-source projects and commercial offerings will emerge as leaders in their categories.
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
For more information on our services or upcoming events, please visit our website at http://www.casertaconcepts.com/.
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
The document discusses Seagate's plans to integrate hard disk drives (HDDs) with flash storage, systems, services, and consumer devices to deliver unique hybrid solutions for customers. It notes Seagate's annual revenue, employees, manufacturing plants, and design centers. It also discusses Seagate exploring the use of big data analytics and Hadoop across various potential use cases and outlines Seagate's high-level plans for Hadoop implementation.
The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems.
This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R.
Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
Find out how Hortonworks and IBM help you address these challenges to enable success to optimize your existing EDW environment.
https://hortonworks.com/webinar/modernize-existing-edw-ibm-big-sql-hortonworks-data-platform/
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
This document discusses Microsoft's efforts to make big data technologies like Hadoop more accessible through its products. It describes Hadoop, MapReduce, HDFS, and other big data concepts. It then outlines Microsoft's project to create a Hadoop distribution that runs on Windows Server and Windows Azure, including building an ODBC driver to allow tools like Excel to query Hadoop. This will help bring big data to more business users and integrate it with Microsoft's existing BI technologies.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Apache Tajo - An open source big data warehousehadoopsphere
Apache Tajo is an open source distributed data warehouse system that allows for low-latency queries and long-running batch queries on various data sources like HDFS, S3, and HBase. It features ANSI SQL compliance, support for common file formats like CSV and JSON, and Java/Python UDF support. The presentation discusses recent Tajo releases, including new features in version 0.10, and outlines future plans.
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.
Similar to Bi on Big Data - Strata 2016 in London (20)
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
1. BI on Big Data
Tomer Shiran - @tshiran
Co-Founder & CEO, Dremio
Strata + Hadoop World London, June 3, 2016
What are your options?
2. 2 BI on Hadoop: What are your Options
Dremio Company Background
Jacques Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Founder of Apache Arrow & Drill
• Previously Quigo (AOL); Offermatica
(ADBE); aQuantive (MSFT)
Tomer Shiran
Founder & CEO
• Previously MapR (VP Product &
employee #5), MapR; Microsoft;
IBM Research
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Founder of Apache Parquet
• Apache Pig PMC Member
• Previously Twitter (Lead, Analytics
Data Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source including
the creators of Apache Arrow & Apache Parquet
3. 3 BI on Hadoop: What are your Options
Recent changes to the BI landscape
Good ol’ Days
• Only a few databases (e.g.
Oracle, Teradata, SQL
Server)
• A few BI tools (MicroStrategy,
Cognos)
• Everything worked with
everything
• Things were easy!
Modern Reality
• Larger scale, less control and
less structure
• Lots of databases!
• Data Lake, not database
• HDFS: It’s a file system, folks!
• NoSQL: Let’s put the schema in
the application
• It can feel like the wild west!
4. 4 BI on Hadoop: What are your Options
Major Approaches to BI on Big Data?
ETL to RDBMS
o “Make the new world look like the old world!”
o Load a transformed set of data into relational database
Monolithic (all-in-one) solutions
o Use BI tools that connect directly to Big Data
SQL-on-Big-Data
o Connect BI tools to a query engine sitting on top of Big Data
o Three main sub-categories
• Native SQL
• Batch SQL
• OLAP Cubes
5. 5 BI on Hadoop: What are your Options
So how do we bring BI to Big Data?
Big Data
RDBMS
BI options
ETL tool
ETL to Data
Warehouse
Big Data
SQL
Engine
BI options
SQL-on-Big-DataBig Data
Monolithic tool with built-in BI
Monolithic All-in-one
Solutions
6. 6 BI on Hadoop: What are your Options
ETL to RDBMS: Introduction
• ETL (Extract, Transform, and Load) a subset of the data into a
relational database
o Oracle, PostgreSQL, Teradata, Redshift, Vertica, …
• Connect any desired BI tool to the RDBMS
o Tableau, Qlik, …
• Two options:
o Commercial tools (Informatica, Talend, Pentaho,…)
o Custom development, scripts, etc.
Big Data
RDBMS
BI options
ETL tool
7. 7 BI on Hadoop: What are your Options
ETL to RDBMS: Example
• Load web server logs from HDFS into RDBMS
• ETL software: Pentaho Data Integration (aka ‘Kettle’)
• RDBMS: MySQL
Connect ETL
to RDBMS
Add and Configure
Input/Output
Connect Input
and Output
Create and fill
RDBMS table
Connect BI tool
To RDBMS
0
50
100
150
200
250
April May June July
Source: http://wiki.pentaho.com/display/BAD/Extracting+Data+from+HDFS+to+Load+an+RDBMS
8. 8 BI on Hadoop: What are your Options
ETL to RDBMS: Pros and Cons
Pros
• Relational databases and their BI integrations are very mature
• Use your favorite tools
o Tableau, Excel, R, …
Cons
• Traditional ETL tools don’t work well with modern data
o Changing schemas, complex or semi-structured data, …
o Hand-coded scripts are a common substitute
• Data freshness
o How often do you replicate/synchronize?
• Data resolution
o Can’t store all the raw data in the RDBMS (due to scalability and/or cost)
o Need to sample, aggregate or time-constrain the data
…and really, who wants to ETL?
9. 9 BI on Hadoop: What are your Options
Monolithic (or All-in-One) Solutions: Introduction
• Single piece of software on top of Big Data
• Performs both data visualization (BI) and execution
• Utilize sampling or manual pre-aggregation to reduce
the data volume that the user is interacting with
• Examples:
o Datameer
o Platfora
o Zoomdata Big Data
Monolithic system
with built-in BI
Monolithic
Solutions
10. 10 BI on Hadoop: What are your Options
Platfora Architecture Overview
• Constructs aggregates that are
loaded into an external database
o Aggregates provide fast
visualizations
o Aggregations must be created
before consumption
MapReduce/Spark
HDFS
Hadoop Cluster
Hadoop
Proprietary DB
Aggregates
Platfora Cluster
11. 11 BI on Hadoop: What are your Options
Hadoop Cluster
Datameer Nodes
Datameer Architecture Overview
• Users interact with samples of the data
in an Excel-like interface
• Finished designs use the whole dataset
• Query router determines execution
engine based on data size
Single Node
Custom Execution
Tez MapReduce
Query Router
Sampling
Hadoop
HDFS
12. 12 BI on Hadoop: What are your Options
Zoomdata Architecture Overview
• Queries on historical (ie, non-streaming)
data are split into many sampling queries
• This sampling provides a view of the data
that converges toward an accurate picture
o But adds load on the data source…
• Can handle streaming data sources
Stream Processing
Engine
Spark-based
cache
HDFS / MongoDB
Zoomdata Server
Incremental
Sampling
Streaming
Data
Source
Multiple
Data Clusters
13. 13 BI on Hadoop: What are your Options
Monolithic Solutions: Pros and Cons
Pros
• Only one tool to learn and operate
• Easier than building and maintain ETL-to-RDBMS pipeline
• Integrated data preparation in some solutions
Cons
• Can’t analyze the raw data
o Rely on aggregation or sampling before primary analysis
• Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …)
• Can’t run arbitrary SQL queries
14. 14 BI on Hadoop: What are your Options
SQL-on-Big-Data: Introduction
• SQL queries against Big Data
o Hadoop
o NoSQL
• MongoDB, HBase, ...
o Cloud Storage
• S3, Azure Data Lake, GCS, …
• Use your existing BI tools
o Leverage standard ODBC/JDBC drivers
Tableau, Qlik, R, …
SQL Engine
Hadoop & NoSQL
15. 15 BI on Hadoop: What are your Options
SQL-on-Big-Data: Introduction
Three major design philosophies:
• Native SQL
• Batch & Data Science SQL
• OLAP Cubes on Hadoop
16. 16 BI on Hadoop: What are your Options
Native SQL
• Apache Drill
o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3)
o Based on Apache Arrow
o Columnar in-memory execution
• Apache Impala (incubating)
o Utilizes the Hive metastore
o Focused on data in HDFS
• Presto
o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3)
17. 17 BI on Hadoop: What are your Options
Native SQL: Pros and Cons
Pros
• Highest performance for Big Data workloads
• Connect to Hadoop and also NoSQL systems
• Make Hadoop “look like a database”
Cons
• Queries may still be too slow for interactive analysis on many TB/PB
• Can’t defeat physics
18. 18 BI on Hadoop: What are your Options
Batch & Data Science SQL
• Hive
o Enables SQL queries to be translated to
MapReduce/Tez
o Most commonly used for batch processing and ETL
workloads
• Spark SQL
o Provides a way to deliver SQL queries in Spark
programs (Scala/Java/Python)
o Excellent interleaving with data science work
19. 19 BI on Hadoop: What are your Options
Batch & Data Science SQL: Pros and Cons
Pros
o Potentially simpler deployment (no daemons)
• New YARN job (MapReduce/Spark) for each query
o Check-pointing support enables very long-running queries
• Days to weeks (ETL work)
o Works well in tandem with machine learning (Spark)
Cons
o Latency prohibitive for for interactive analytics
• Tableau, Qlik Sense, …
o Slower than native SQL engines
20. 20 BI on Hadoop: What are your Options
OLAP Cubes on Hadoop
• Kylin
o Hadoop-only
o Stores OLAP cubes in HBase
o Queries fail if not satisfied by cubes
o Open source
• AtScale
o Hadoop-only
o Leverages external SQL engine
• Hive, Impala, SparkSQL
o Collaborative cube creation
o Closed source
21. 21 BI on Hadoop: What are your Options
OLAP Cubes on Hadoop: Pros and Cons
Pros
o Fast queries on pre-aggregated data
o Can use SQL and MDX tools
Cons
o Explicit cube definition/modeling phase
• Not “self-service”
• Frequent updates required due to dependency on business logic
o Aggregation create and maintenance can be long (and large)
o User connects to and interacts with the cube
• Can’t interact with the raw data
22. 22 BI on Hadoop: What are your Options
SQL-on-Big-Data: Solution Comparison
Native SQL Batch & DS SQL OLAP Cubes
Technologies Drill, Impala, Presto Hive, Spark SQL Kylin, AtScale
Connectivity SQL and NoSQL SQL and NoSQL Hadoop-only
Primary Use Case Interactive
ETL or data-science
focused
Constrained
Interactive
Query Capability Raw data Raw data Aggregated data
Deployment Model
New daemons
collocated with existing
services
New MapReduce and/or
Spark job for each
query
Varies
23. 23 BI on Hadoop: What are your Options
SQL-on-Big-Data: General Pros and Cons
Pros
• Continue using your favorite BI tools and SQL-based clients
o Tableau, Qlik, Power BI, Excel, R, SAS, …
• Technical analysts can write custom SQL queries
Cons
• Another layer in your data stack
• May need to pre-aggregate the data depending on your scale
• Need a separate data preparation tool (or custom scripts)
24. 24 BI on Hadoop: What are your Options
Deciding what is right for you?
25. 25 BI on Hadoop: What are your Options
ETL to
RDBMS
BI on Big data: Heuristic
Do you already have a favorite BI Tool
No
Is External Cluster Okay?
Does your schema change frequently?
No Yes
Yes
Platfora
Zoomdata
No
Yes
Do you want to be
able to write SQL
No
Datameer
No
Do you like Excel Metaphor?
Yes
Monolithic/All-in-one Solutions
No
Is your working data relatively small & static?
Yes
Yes
Yes
Do you have very predictable analysis needs?
OLAP Cubes
on Hadoop
No
Are you focused on interactive BI?
No
Do you need to query NoSQL?
No
Hive
Native SQL
Do you want to combine ML with SQL?
No Yes
SparkSQL
26. 26 BI on Hadoop: What are your Options
Q&A
Tomer Shiran
tshiran@dremio.com
@tshiran
Reach out to learn what we’re up to at Dremio
(or to join the private beta…)