NoSQL and SQL databases can work together to handle real-time big data needs. Apache Drill is an open source tool that allows interactive analysis of big data using standard SQL queries across NoSQL, Hadoop, and relational data sources. It provides low-latency queries, full ANSI SQL support, and flexibility to handle rapidly evolving schemas and data in different systems. By enabling analysis of all data together using a common interface, it helps tackle challenges of combining operational and decision support systems on big, diverse datasets.
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Data Con LA
This talk explores how Netflix equips its engineers with the freedom to find and introduce the right software for the job - even if it isn't used anywhere else in-house. Examples include how Netflix has enabled analysts to fluidly switch between MPP RDBMS and an auto-scaling Presto cluster, how Spark + NoSQL stores are used when deploying data sets to internal web apps, and how data scientists are enabled to work in the ML framework of their choosing and deploy models as a service.
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
People, Platform, Projects: these slides overview how Netflix works with Big Data. I share how our teams are organized, the roles we typically have on the teams, an overview of our Big Data Platform, and two example projects.
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
This document summarizes a presentation given by Javier Dominguez at Big Data Spain about Stratio's multiplatform solution for graph data sources. It discusses graph use cases, different data stores like Spark, GraphX, GraphFrames and Neo4j. It demonstrates the machine learning life cycle using a massive dataset from Freebase, running queries and algorithms. It shows notebooks and a business example of clustering bank data using Jaccard distance and connected components. The presentation concludes with future directions like a semantic search engine and applying more machine learning algorithms.
Introduction to basic data analytics toolsNascenia IT
This document introduces basic data analytics tools. It discusses the data analytics pipeline of collecting, refining, storing, analyzing, and presenting data. It describes tools for each step including Requests and BeautifulSoup for data acquisition, Pandas and SQLAlchemy for data processing and storage, R and RStudio for data analysis, and Plotly and Matplotlib for data visualization. Apache Superset is highlighted as a tool for data visualization and exploration. Challenges of data analytics like data quality, privacy, and scaling are also outlined.
At Netflix, we've spent a lot of time thinking about how we can make our analytics group move quickly. Netflix's Data Engineering & Analytics organization embraces the company's culture of "Freedom & Responsibility".
How does a company with a $40 billion market cap and $6 billion in annual revenue keep their data teams moving with the agility of a tiny company?
How do hundreds of data engineers and scientists make the best decisions for their projects independently, without the analytics environment devolving into chaos?
We'll talk about how Netflix equips its business intelligence and data engineers with:
the freedom to leverage cloud-based data tools - Spark, Presto, Redshift, Tableau and others - in ways that solve our most difficult data problems
the freedom to find and introduce right software for the job - even if it isn't used anywhere else in-house
the freedom to create and drop new tables in production without approval
the freedom to choose when a question is a one-off, and when a question is asked often enough to require a self-service tool
the freedom to retire analytics and data processes whose value doesn't justify their support costs
Speaker Bios
Monisha Kanoth is a Senior Data Architect at Netflix, and was one of the founding members of the current streaming Content Analytics team. She previously worked as a big data lead at Convertro (acquired by AOL) and as a data warehouse lead at MySpace.
Jason Flittner is a Senior Business Intelligence Engineer at Netflix, focusing on data transformation, analysis, and visualization as part of the Content Data Engineering & Analytics team. He previously led the EC2 Business Intelligence team at Amazon Web Services and was a business intelligence engineer with Cisco.
Chris Stephens is a Senior Data Engineer at Netflix. He previously served as the CTO at Deep 6 Analytics, a machine learning & content analytics company in Los Angeles, and on the data warehouse teams at the FOX Audience Network and Anheuser-Busch.
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong
Building a data platform doesn’t have to be like entering a portal to Stranger Things.
Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale.
Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
This document discusses how Caserta Concepts used Apache Spark to help a customer master their customer data by cleaning, standardizing, matching, and linking over 6 million customer records and hundreds of millions of data points. Traditional customer data integration approaches were prohibitively expensive and slow for this volume of data. Spark enabled the data to be processed 10x faster by parallelizing data cleansing and transformation. GraphX was also used to model the data as a graph and identify linked customer records, reducing survivorship processing from 2 hours to under 5 minutes.
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
Machine learning in the enterprise is rarely delivered by a single team. In order to enable Machine Learning across an organisation you need to target a variety of different skills, processes, technologies, and maturities. To do this is incredibly hard and requires a composite of different techniques to deliver a single platform which empowers all users to build and deploy machine learning models.
In this session we discuss how Azure & Databricks enables a Data Science as a Service platform. We look at how a DSaaS platform is empowering users of all abilities to build models, deploy models and enabling organisations to realise and return on investment earlier.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
This document discusses Saxo Bank's plans to implement a data governance solution called the Data Workbench. The Data Workbench will consist of a Data Catalogue and Data Quality Solution to provide transparency into Saxo's data ecosystem and improve data quality. The Data Catalogue will be built using LinkedIn's open source DataHub tool, which provides a metadata search and UI. The Data Quality Solution will use Great Expectations to define and monitor data quality rules. The document discusses why a decentralized, domain-driven approach is needed rather than a centralized solution, and how the Data Workbench aims to establish governance while staying lean and iterative.
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiDatabricks
As big data technology matures, you’d think there would be more talent available to hire. Although the number of people interested and engaged in the big data world has dramatically increased, job demand is far ahead.
Companies like Google or Facebook have access to the best talent — thousands of engineers with PhDs from the best schools, which is why they are able to innovate. How can a company close the skills gap while innovating and creating product advantage?
This talk highlights how the right technology can allow you to compete without having an army of PhDs at your disposal. At iPass, we’ve created an environment where our engineers can be empowered to create value without getting bogged down by big data and Ops challenges. As a result, we have been able to more easily recruit internal engineers to our big data team, leveraging their current expertise, while bringing them up to speed on big data projects much faster. Join this talk to learn how you can do the same for your organization.
The document discusses data engineering and compares different data stores. It motivates data engineering to gain insights from data and build data infrastructures. It describes the data engineering ecosystem and various data stores like relational databases, key-value stores, and graph stores. It then compares Amazon Redshift, a cloud data warehouse, to NoSQL databases Cassandra and HBase. Redshift is optimized for analytics with SQL and columnar storage while Cassandra and HBase are better for scalability with eventual consistency. The best data store depends on an organization's architecture, use cases, and tradeoffs between consistency, availability and performance.
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
The document discusses the Lambda Architecture, which is an approach for building data systems to handle large volumes of real-time streaming data. It proposes using three main design principles: handling human errors by making the system fault-tolerant, storing raw immutable data, and enabling recomputation of results from the raw data. The document then provides two case studies of applying Lambda Architecture principles to analyze mobile app usage data and process high-volume web logs in real-time. It concludes with lessons learned, such as studying Lambda concepts, collecting any available data, and turning data into useful insights.
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit
HP ships millions of PCs, Printers, and other devices every year to customers in all market segments. More customers are seeking services provided with our products enabling new opportunities for HP to create services from the data we can collect from our devices. Every device we ship is an IoT endpoint with powerful CPU to capture rich data. Insights from this data are used internally to improve our products and focus on customer needs.
In this presentation, John will focus on HP’s journey to enabling Big Data analytics from within a large enterprise environment. He will review the challenges and how HP decided on AWS, Apache Spark and Databricks as the foundation for their entry into Big Data Analytics. John will also review how HP uses Spark to build analytic services from the data they generate from their devices.
Building the Artificially Intelligent EnterpriseDatabricks
Mike Ferguson is Managing Director of Intelligent Business Strategies Limited and specializes in business intelligence/analytics and data management. He discusses building the artificially intelligent enterprise and transitioning to a self-learning enterprise. Some key challenges discussed include the siloed and fractured nature of current data and analytics efforts, with many tools and scripts in use without integration. He advocates sorting out the data foundation, implementing DataOps and MLOps, creating a data and analytics marketplace, and integrating analytics into business processes to drive value from AI.
- The document discusses data infrastructure at an online video syndication platform that handles 2-3 million streams per day.
- It describes different tools for data storage, analytics, and real-time processing including Hadoop, Spark, MongoDB, Logstash, Elasticsearch, and Storm.
- It also discusses best practices for data collection, formatting, analysis, and using data to detect issues through a case study on investigating a rapid decline in video streams.
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerSri Ambati
H2O World 2015 - Brendan Herger of Capital One
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The document provides an overview of data engineering concepts for data scientists. It discusses the CAP theorem, which states that a distributed system cannot simultaneously provide consistency, availability, and partition tolerance. It describes various data store types and architectures that provide different balances of these properties, such as leader-follower systems that prioritize availability and consistency over partition tolerance. The document also summarizes reference architectures like Lambda and Kappa and discusses the concept of a data lake.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
This document provides an overview of architecting a first big data implementation. It defines key concepts like Hadoop, NoSQL databases, and real-time processing. It recommends asking questions about data, technology stack, and skills before starting a project. Distributed file systems, batch tools, and streaming systems like Kafka are important technologies for big data architectures. The document emphasizes moving from batch to real-time processing as a major opportunity.
Apache Drill is an open source SQL query engine for big data that provides highly flexible and high performance querying of data stored in Hadoop and NoSQL systems. It allows for ad-hoc queries on schema-less data without requiring upfront modeling or ETL. Drill uses a distributed, columnar data store and late binding to optimize query execution across systems. The project is actively developed with the goal of releasing version 1.0 in late 2014.
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
This document discusses big data tools and trends that enable real-time business intelligence from machine logs. It provides an overview of Perficient, a leading IT consulting firm, and introduces the speakers Eric Roch and Ben Hahn. It then covers topics like what constitutes big data, how machine data is a source of big data, and how tools like Hadoop, Storm, Elasticsearch can be used to extract insights from machine data in real-time through open source solutions and functional programming approaches like MapReduce. It also demonstrates a sample data analytics workflow using these tools.
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev
Modern big data solutions often incorporate Hadoop as one of the components and require the integration of Hadoop with other components including Oracle Database. This presentation explains how Hadoop integrates with Oracle products focusing specifically on the Oracle Database products. Explore various methods and tools available to move data between Oracle Database and Hadoop, how to transparently access data in Hadoop from Oracle Database, and review how other products, such as Oracle Business Intelligence Enterprise Edition and Oracle Data Integrator integrate with Hadoop.
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
This document discusses approaches to implementing Hadoop, NoSQL, and analytical databases. It describes:
1) The current landscape of big data databases including Hadoop, NoSQL, and analytical databases that are often used together but come from different vendors with different interfaces.
2) Common uses of transactional databases, Hadoop, NoSQL databases, and analytical databases.
3) The complexity of current implementation approaches that involve multiple coding steps across various tools.
4) How Pentaho provides a unified platform and visual tools to reduce the time and effort needed for implementation by eliminating disjointed steps and enabling non-coders to develop workflows and analytics for big data.
Swiss Big Data User Group - Introduction to Apache DrillMapR Technologies
This document provides an introduction and overview of Apache Drill, an open source distributed SQL query engine designed for interactive analysis of large-scale datasets. It describes Drill's architecture as being inspired by Google's Dremel, with support for standard SQL queries, pluggable data sources, and schema flexibility. Drill distributes query execution across multiple nodes to maximize data locality and parallelism. Key features highlighted include full ANSI SQL support, support for nested data, optional schemas, and extensibility points.
This document provides a summary of Oracle OpenWorld 2014 discussions on database cloud, in-memory database, native JSON support, big data, and Internet of Things (IoT) technologies. Key points include:
- Database Cloud on Oracle offers pay-as-you-go pricing and self-service provisioning similar to on-premise databases.
- Oracle Database 12c includes an in-memory option that can provide up to 100x faster analytics queries and 2-4x faster transaction processing.
- Native JSON support in 12c allows storing and querying JSON documents within the database.
- Big data technologies like Oracle Big Data SQL and Oracle Big Data Discovery help analyze large and diverse data sets from sources like
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
Apache Drill is an open source engine for interactive analysis of large-scale datasets. It provides low-latency queries using standard SQL and supports nested and hierarchical data. Drill is inspired by Google's Dremel system and provides an alternative to traditional batch processing systems like MapReduce for interactive analysis of big data.
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
The document discusses the role and responsibilities of a data architect. It provides information on the high demand and salaries for data architects, which can be over $200,000 at companies like Microsoft. The summary also outlines some of the key technical skills required for the role, including strong data modeling abilities, knowledge of databases, ETL tools, analytics dashboards, and programming languages like SQL, Python and R. Business skills like communication and presenting complex concepts are also important.
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Similar to No sql and sql - open analytics summit (20)
This document summarizes cybersecurity policy issues before Congress from 2012-2014 following the Snowden leaks. It discusses key pillars debated in 2012 like critical infrastructure protection and information sharing between government and private sector. In 2013, an executive order focused on voluntary best practices and increased information sharing. The document outlines various cybersecurity bills introduced but not passed. It predicts lame duck issues in the Senate and changes in congressional committee leadership going forward. It also summarizes lessons from a crisis response exercise showing focus on critical infrastructure protection and developing cybersecurity job skills.
This document discusses how cyber intelligence can be used to combat advanced cyber adversaries. It notes that traditional computer network defense is no longer sufficient due to state-sponsored groups, hacktivists, and crime rings. Cyber intelligence involves fusing open source data, reports, and internal attack data to provide organizations threat profiles, attack timelines, and malware intelligence. This intelligence can be combined with network defense to give a broader view of adversaries and better arm organizations against advanced threats.
CDM….Where do you start? (OA Cyber Summit)Open Analytics
The document discusses ForeScout's network access control solution. It provides visibility into networked devices and endpoints, including those that are and aren't corporate assets. It can control access based on compliance levels, perform continuous monitoring, and share information. The solution offers user and device authentication, posture assessment, policy-based enforcement across networks and infrastructure, and integration with existing enterprise tools through an open platform. It allows network access control to be implemented gradually over time through a staged approach.
An Immigrant’s view of Cyberspace (OA Cyber Summit)Open Analytics
This document discusses different perspectives on cyberspace. It notes that cyberspace is constantly dynamic, pervades everything, and can be seen as another reality. The document separates cyberspace into geographic and persona layers and invites questions and comments on viewing cyberspace.
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)Open Analytics
Moloch is an open source packet capture system built using Elasticsearch for storage and indexing and a Node.js web interface for searching. It consists of a capture process that extracts session profile information from packets and writes it to Elasticsearch, allowing the packet data and metadata to be queried and browsed through a web GUI or APIs. It is designed for scalability, supporting clustering across multiple nodes to handle large packet volumes.
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Open Analytics
The document summarizes website traffic data from the Council on Foreign Relations (CFR) website following the Boston Marathon bombings in April 2013. It found a significant surge in traffic on April 19th, with over 100,000 additional visits, focused on the page about Chechen terrorism. The traffic came from new sources like news sites and social media, and more visits from mobile devices and countries like the Netherlands and Australia. This showed that CFR was seen as an authoritative source for information on the suspected Chechen connection to the bombings.
Using Real-Time Data to Drive Optimization & PersonalizationOpen Analytics
This document discusses using real-time data and machine learning techniques to optimize, segment, and personalize digital experiences. It provides examples of optimization, segmentation, and personalization. It also describes building a platform that uses various technologies like Couchbase, Spring, and MongoDB to power a real-time engine that chooses offers for customers based on their data and business rules. This platform delivers personalized experiences and content to clients to increase conversions over time as it continuously learns from customer interactions and offer history.
The document discusses an upcoming tech summit hosted by Bois Capital, an investment bank focusing on the technology sector. Bois Capital's managing partners have extensive experience in the telecom big data analytics sector. The summit will provide an overview of the telco analytics market and applications across various stakeholders. Recent M&A transactions in the space are also analyzed, with revenue multiples typically between 3-5x for companies under $100m in revenue. The document concludes with a case study of Bois Capital advising a Swiss mobile analytics company in its sale to Gemalto.
The document discusses how businesses can compete in the digital economy. It covers topics like using big data and analytics to gain insights, delivering superior customer experiences, and the need to act on data insights. It provides examples of how various industries like healthcare, retail, automotive and insurance can leverage digital technologies and data to improve operations and customer value. The key message is that competing in the digital world involves using data and technology to improve quality of service while maintaining operational simplicity and price competitiveness.
Piwik: An Analytics Alternative (Chicago Summit)Open Analytics
The document discusses Piwik, an open-source web analytics platform. It provides an alternative to Google Analytics that gives users more control and independence over their behavioral data. The summary describes how Piwik is freely available, can be hosted anywhere, has a simple interface, provides real-time reporting, and is highly customizable. It also notes that an initial Piwik installation takes around 10-20 minutes.
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
This document discusses using social media, cloud computing, machine learning, open source, and big data analytics to analyze Twitter data. It describes how to collect tweets using the Twitter API, classify tweets in real-time using machine learning models on AWS, store classified tweets in MongoDB on AWS, and present results. Cost estimates for real-time classification of 1 million tweets per day are provided. Use cases described include tracking food poisoning reports and disease occurrence. Future directions discussed include developing turnkey services and linking to additional open data sources.
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...Open Analytics
This document discusses how a hospital system used big data analytics to reduce staff turnover rates and the associated costs of replacing employees. It provides data on turnover rates and replacement costs for nurses and non-nurses from 2009 to 2012. For nurses, the turnover rate decreased from 22.91% in 2009 to 24.01% in 2012, and the estimated replacement cost was over $14 million. For non-nurses, the turnover rate decreased from 21.49% to 24.53% over the same period, with a replacement cost of over $13 million in 2012. The total estimated cost of turnover for 2012 was over $27 million. The document also outlines best practices for using big data, including clearly defining objectives
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Open Analytics
The document discusses evolutions in media, marketing, and retail. It notes that media content and operations are going digital, enabling individual distribution and programmatic selling. Marketing is becoming more integrated to enable demand discovery, touchpoint messaging, and product lifecycle relationship management. Retail operations are also going digital, enabling location-based messaging, offers, and product services both online and in physical stores. Data sources are expanding to include more location data, purchase behavior data, and data from sensors. Integration is becoming the cornerstone of real-time analytics across industries.
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Open Analytics
1. The document discusses characterizing risk in supply chains, with a focus on human trafficking in agriculture. It identifies key challenges in understanding risk, including lack of information about lower tier suppliers.
2. Using data analysis, a methodology was established to characterize risk from supplier survey responses at the country, state, and item level. This allows identification of high risk vendors and areas for mitigation.
3. Opportunities exist to enhance risk characterization by incorporating additional publicly available and paid data sources, and monitoring industry news and social media.
From Insight to Impact (Chicago Summit - Keynote)Open Analytics
This document discusses five critical pillars for the success of analytics and data science projects: 1) align with corporate strategy, 2) ignite stakeholder engagement, 3) sharpen team focus, 4) drive change management, and 5) recruit key talent. It provides guidance on each pillar, such as prioritizing analytics opportunities by their impact and horizon, understanding stakeholder incentives, avoiding "zombie" projects, enabling experiments to drive change, and pre-screening talent for technical skills and culture fit. Following these pillars can help organizations improve analytics project success rates and better compete through data-driven insights.
This document discusses how EasyBib uses data analytics to help students improve their research skills and citation quality. It analyzes student paper bibliographies and source usage over time to identify top sources and credibility trends. EasyBib developed features to warn students about source credibility, analyze citation quality, and provide analytics on source usage. This data-driven approach helped shift top sources from places like Wikipedia to more credible sources like The New York Times and CDC. The document discusses expanding these data analytics efforts through tools like Cloudant Search to further help students find better sources and evaluate source credibility in real-time.
The document discusses enabling information discovery by unifying search and data management. It provides a brief history of search engines and databases from the 1960s to present. It then proposes that search and databases could be unified by using a schema-agnostic, hierarchical data model with a universal index that can index structured and unstructured data alike. Examples of potential use cases are given, such as creating a 360-degree customer view or enabling fraud prevention. The presentation concludes by suggesting future areas could include more semantic technologies and graph traversal capabilities.
The caprate presentation_july2013_open analytics dc meetupOpen Analytics
This document discusses capitalization rates and how they relate to property income and investment returns. It mentions paying X amount for a property with Y income and how that translates into the return on the investment money. The document also notes there will be a demonstration related to capitalization rates.
Verifeed open analytics_3min deck_071713_finalOpen Analytics
The document discusses Verifeed, a company that analyzes social media conversations to provide insights for enterprises. It highlights large potential markets, example use cases showing benefits, plans to grow revenue through an initial product launch and expansion. Key points include:
- Verifeed's platform allows customers to filter social data to identify relevant information, engage customers, and make better business decisions.
- There is significant potential demand totaling billions from industries like consumer goods, sports, financial services, and more.
- Early pilots showed benefits like increased engagement for sports and identifying customer attitudes for a dog food brand.
- The company plans an initial product launch, adding customers, and expanding its capabilities and markets over time.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
1. NoSQL and SQL Work Side-by-Side
to Tackle Real-time Big Data Needs
Allen Day
MapR Technologies
2. Me
• Allen Day
– Principal Data Scientist @ MapR
– Human Genomics / Bioinformatics
(PhD, UCLA School of Medicine)
• @allenday
• allenday@allenday.com
• aday@maprtech.com
3. You
• I’m assuming that the typical attendee:
– is a software developer
– is interested and familiar with open source
– is familiar with Hadoop, relational DBs
– has heard of or has used some NoSQL technology
4. Big Data Workloads
• Offline
– ETL
– Model creation & clustering & indexing
– Web Crawling
– Batch reporting
• Online
– Lightweight OLTP
– Classification & anomaly detection
– Stream processing
– Interactive reporting
SQL
5. What is NoSQL? Why use it?
• Traditional storage (relational DBs) are unable to
accommodate increasing # and variety of
observations
– Culprits: sensors, event logs, electronic payments
• Solution: stay responsive by relaxing ACID storage
requirements
– Denormalize (#)
– Loosen schema (variety), loosen consistency
• This is the essence of NoSQL
6. NoSQL Impact on Business Processes
• Traditional business intelligence (BI) tech stack
assumes relational DB storage
– Company decisions depend on this (reports, charts)
• NoSQL collected data aren’t in relational DB
– Data volume/variety is still increasing
– Tech and methods are still in flux
• Decoupled data storage and decision support
systems
– BI can’t access freshest, largest data sets
– Very high opportunity cost to business
7. Ideal Solution Features
• Scalable & Reliable
– Distributed replicated storage
– Distributed parallel processing
• BI application support
– Ad-hoc, interactive queries
– Real-time responsiveness
• Flexible
– Handles rapid storage and schema evolution
– Handles new analytics methods and functions
Hadoop FS
Map/Reduce, YARN{
SQL Interface{
Extensible for NoSQL,
Advanced Analytics{
8. From Ideals to Possibilities
• Migrate NoSQL data/processing to SQL
– High cost to marshal NoSQL data to SQL storage
– SQL systems lack advanced analytics capabilities
• Migrate SQL data to NoSQL
– Breaks compatibility for BI-dependent functions, e.g.
financial reporting
– Limited support for relational operations (joins)
• high latency
– NoSQL tech is still in flux (continuity)
• Other Approaches?
– Yes. First let’s consider a SQL/NoSQL use case
10. Example Problem: Marketing Campaign
• Jane is an analyst at an
e-commerce company
• How does she figure
out good targeting
segments for the next
marketing campaign?
• She has some ideas…
…and lots of data
User
profiles
Transaction
information
Access
logs
11. Traditional System Solution 1: RDBMS
• ETL the data from
MongoDB and Hadoop
into the RDBMS
– MongoDB data must be
flattened, schematized,
filtered and aggregated
– Hadoop data must be
filtered and aggregated
• Query the data using
any SQL-based tool
User
profiles
Access
logs
Transaction
information
12. Traditional System Solution 2: Hadoop
• ETL the data from
Oracle and MongoDB
into Hadoop
– MongoDB data must be
flattened and
schematized
• Work with the
MapReduce team to
write custom code to
generate the desired
analyses
User
profiles
Access
logs
Transaction
information
13. Traditional System Solution 3: Hive
• ETL the data from
Oracle and MongoDB
into Hadoop
– MongoDB data must be
flattened and
schematized
• But HiveQL queries are
slow and BI tool
support is limited
– Marshaling/Coding
User
profiles
Access
logs
Transaction
information
14. What Would Google Do?
Distributed
File System
NoSQL
Interactive
analysis
Batch
processing
GFS BigTable Dremel MapReduce
HDFS HBase ???
Hadoop
MapReduce
Build Apache Drill to provide a true open source
solution to interactive analysis of Big Data
15. Apache Drill Overview
• Interactive analysis of Big Data using standard
SQL
• Fast
– Low latency queries
– Complement native interfaces and
MapReduce/Hive/Pig
• Open
– Community driven open source project
– Under Apache Software Foundation
• Modern
– Standard ANSI SQL:2003 (select/into)
– Nested data support
– Schema is optional
– Supports RDBMS, Hadoop and NoSQL
Interactive queries
Data analyst
Reporting
100 ms-20 min
Data mining
Modeling
Large ETL
20 min-20 hr
MapReduce
Hive
Pig
Apache Drill
16. How Does It Work?
Drillbit
(Coordinator)
SQL Query
Parser
Query Planner
Drillbit
(Executor)
Drillbit
(Executor)
Drillbit
(Executor)
SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1
Drill Client
Tableau
Drill ODBC Driver
Micro-
Strategy
Crystal
Reports
Driver
17. How Does It Work?
• Drillbits run on each node, designed to
maximize data locality
• Processing is done outside MapReduce
paradigm (but possibly within YARN)
• Queries can be fed to any Drillbit
• Coordination, query planning, optimization,
scheduling, and execution are distributed
SELECT * FROM
oracle.transactions,
mongo.users,
hdfs.events
LIMIT 1
18. Apache Drill: Key Features
• Full ANSI SQL:2003 support
– Use any SQL-based tool
• Nested data support
– Flattening is error-prone and often impossible
• Schema-less data source support
– Schema can change rapidly and may be record-specific
• Extensible
– DSLs, UDFs
– Custom operators (e.g. k-means clustering)
– Well-documented data source & file format APIs
19. How Does Impala Fit In?
Impala Strengths
• Beta currently available
• Easy install and setup on top of
Cloudera
• Faster than Hive on some queries
• SQL-like query language
Questions
• Open Source ‘Lite’
• Lacks RDBMS support
• Lacks NoSQL support beyond
HBase
• Early row materialization
increases footprint and reduces
performance
• Limited file format support
• Query results must fit in memory!
• Rigid schema is required
• No support for nested data
• SQL-like (not SQL)
Many important features are “coming soon”.
Architectural foundation is constrained. No community development.
20. Drill Status: Alpha Available July
• Heavy active development by multiple organizations
– Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho
• Available
– Logical plan syntax and interpreter
– Reference interpreter
• In progress
– SQL interpreter
– Storage engine implementations for Accumulo, Cassandra, HBase and
various file formats
• Significant community momentum
– Over 200 people on the Drill mailing list
– Over 200 members of the Bay Area Drill User Group
– Drill meetups across the US and Europe
• Beta: Q3
21. Why Apache Drill Will Be Successful
Resources
• Contributors have
strong backgrounds
from companies like
Oracle, IBM Netezza,
Informatica, Clustrix
and Pentaho
Community
• Development done in
the open
• Active contributors
from multiple
companies
• Rapidly growing
Architecture
• Full SQL
• New data support
• Extensible APIs
• Full Columnar
Execution
• Beyond Hadoop
Bottom Line: Apache Drill enables NoSQL and SQL
Work Side-by-Side to Tackle Real-time Big Data Needs
22. Me
• Allen Day
– Principal Data Scientist @ MapR
• @allenday
• allenday@allenday.com
• aday@maprtech.com
25. Full SQL (ANSI SQL:2003)
• Drill supports SQL (ANSI SQL:2003 standard)
– Correlated subqueries, analytic functions, …
– SQL-like is not enough
• Use any SQL-based tool with Apache Drill
– Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …
– Standard ODBC and JDBC drivers
Drill%Worker
Drill%Worker
Driver
Client
Drillbit
SQL%Query%
Parser
Query%
Planner
Drillbits
Drill%ODBC%
Driver
Tableau
MicroStrategy
Excel
SAP%Crystal%
Reports
26. Nested Data
• Nested data is becoming prevalent
– JSON, BSON, XML, Protocol Buffers, Avro, etc.
– The data source may or may not be aware
• MongoDB supports nested data natively
• A single HBase value could be a JSON document
(compound nested type)
– Google Dremel’s innovation was efficient columnar
storage and querying of nested data
• Flattening nested data is error-prone and often
impossible
– Think about repeated and optional fields at every
level…
• Apache Drill supports nested data
– Extensions to ANSI SQL:2003
enum Gender {
MALE, FEMALE
}
record User {
string name;
Gender gender;
long followers;
}
{
"name": "Homer",
"gender": "Male",
"followers": 100
children: [
{name: "Bart"},
{name: "Lisa”}
]
}
JSON
Avro
27. Schema is Optional
• Many data sources do not have rigid schemas
– Schemas change rapidly
– Each record may have a different schema, may be sparse/wide
• Apache Drill supports querying against unknown schemas
– Query any HBase, Cassandra or MongoDB table
• User can define the schema or let the system discover it
automatically
– System of record may already have schema information
– No need to manage schema evolution
Row Key CF contents CF anchor
"com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com"
anchor:cnnsi.com = "CNN"
"com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News"
… … …
28. Flexible and Extensible Architecture
• Apache Drill is designed for extensibility
• Well-documented APIs and interfaces
• Data sources and file formats
– Implement a custom scanner to support a new source/format
• Query languages
– SQL:2003 is the primary language
– Implement a custom Parser to support a Domain Specific Language
– UDFs
• Optimizers
– Drill will have a cost-based optimizer
– Clear surrounding APIs support easy optimizer exploration
• Operators
– Custom operators can be implemented (e.g. k-Means clustering)
– Operator push-down to data source (RDBMS)
Editor's Notes
Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).
I’m assuming that the typical attendee of this talk is a software developer familiar with and interested in open source technologies. Is already familiar with Hadoop, relational databases, and has heard of or may have some hands-on experience working with some NosQL technologies.
Note correspondences between offline operation and its online counterpart
Call detail records, as we’ve been hearing about in the news around PRISM recently
Hive: compile to MR, Aster: external tables in MPP, Oracle/MySQL: export MR results to RDBMSDrill, Impala, CitusDB: real-time
Emphasize previous experience in my applied domain BFX, difficulty of processing queries effectively (stratified experiments of high-dimensional genomic data).