http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
This document discusses best practices for big data analytics. It emphasizes the importance of data curation to ensure semantic consistency and quality across diverse data sources. It warns against simply accumulating large amounts of ungoverned data ("data swamps") without relevant analytics or business applications. Instead, it advocates taking a full stack approach by building incremental decision models and data products to demonstrate value from the beginning. The document also stresses the need for data management layers, appropriate computing frameworks, and real-time and batch analytics capabilities to enable flexible exploration and insights.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
Augury and Omens Aside, Part 1: The Business Case for Apache MesosPaco Nathan
The document discusses the business case for Apache Mesos and provides three key points:
1. Mesos enables orders of magnitude in cost savings over prior solutions by facilitating paradigm shifts at multiple levels of the technology stack for cluster computing.
2. Recent news includes the release of Mesos 0.19 and the announcement of the inaugural MesosCon conference.
3. Mesos addresses challenges of running mixed workloads on commodity hardware and scheduling services, which can provide more efficient utilization of computing resources than prior solutions.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
Graph analytics can be used to analyze a social graph constructed from email messages on the Spark user mailing list. Key metrics like PageRank, in-degrees, and strongly connected components can be computed using the GraphX API in Spark. For example, PageRank was computed on the 4Q2014 email graph, identifying the top contributors to the mailing list.
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
Slides from Matt Dowle's presentation at H2O Open Tour: NYC
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Scalable Data Science and Deep Learning with H2Oodsc
The era of Big Data has passed, and the era of sensory overload – that is, the proliferation of sensor data – is upon us. The challenge today is how to create the next generation of business and consumer applications that transform how we interact with sensors themselves. Applications need to learn from every user interaction and data point and predict what can happen next. The future depends on Machine Learning, as much as it depends on the data itself, to change the way we interact with these systems.
In this talk, we explain H2O’s scalable distributed in-memory math architecture and its design principles. The platform was built alongside (and on top of) both Hadoop and Spark clusters and includes interfaces for R, Python, Scala, Java, JavaScript and JSON, along with its interactive graphical Flow interface that make it easier for non-engineers to stitch together complete analytic workflows. We outline the implementation of distributed machine learning algorithms such as Elastic Net, Random Forest, Gradient Boosting and Deep Learning. We will present a broad range of use cases and live demos that include world-record deep learning models, anomaly detection tools and approaches for Kaggle data science competitions. We also demonstrate the applicability of H2O in enterprise environments for real-world customer production use cases. By the end of this presentation, you will know how to create your own machine learning workflows on your data using R, Python (iPython Notebooks) or the Flow GUI.
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
The document provides an overview of Graph Analytics in Spark. It discusses Spark components and key distinctions from MapReduce. It also covers GraphX terminology and examples of composing node and edge RDDs into a graph. The document provides examples of simple traversals and routing problems on graphs. It discusses using GraphX for topic modeling with LDA and provides further reading resources on GraphX, algebraic graph theory, and graph analysis tools and frameworks.
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
Distributed Data Science…
* A genomics use case
* Spark Notebook
* Interactive Distributed Data Science
Distributed Data Science… Pipeline
* Pipeline: productizing Data Science
* Demo of Distributed Pipeline (ADAM, Akka, Cassandra, Parquet, Spark)
* Why Micro Services?
* Painful points:
* Data science is Discontiguous
* Context Lost in Translation
* Solution: Data Fellas’ Agile Data Science Toolkit
Dmitry will show the audience on how get started with Mxnet and building Deep Learning models to classify images, sound and text.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
This document contains summaries of "folk knowledge" or common sayings related to data science. Some key points included are:
- Machine learning requires both data and some prior knowledge or assumptions in order to generalize beyond the training data.
- Overfitting can take many forms like high bias, high variance, or sampling bias.
- Intuition fails in high dimensions according to Bellman's curse of dimensionality.
- Feature engineering is key, and more data often beats a cleverer algorithm.
This document provides an overview of Amundsen, an open source data discovery and metadata platform developed by Lyft. It begins with an introduction to the challenges of data discovery and outlines Amundsen's architecture, which uses a graph database and search engine to provide metadata about data resources. The document discusses how Amundsen impacts users at Lyft by reducing time spent searching for data and discusses the project's community and future roadmap.
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
The document discusses how data science may reinvent learning and education. It begins with background on the author's experience in data teams and teaching. It then questions what an "Uber for education" may look like and discusses definitions of learning, education, and schools. The author argues interactive notebooks like Project Jupyter and flipped classrooms can improve learning at scale compared to traditional lectures or MOOCs. Content toolchains combining Jupyter, Thebe, Atlas and Docker are proposed for authoring and sharing computational narratives and code-as-media.
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
The document provides an overview of real-time analytics using Spark Streaming. It discusses Spark Streaming's micro-batch approach of treating streaming data as a series of small batch jobs. This allows for low-latency analysis while integrating streaming and batch processing. The document also covers Spark Streaming's fault tolerance mechanisms and provides several examples of companies like Pearson, Guavus, and Sharethrough using Spark Streaming for real-time analytics in production environments.
The document provides an overview of Apache Spark, including its history and key capabilities. It discusses how Spark was developed in 2009 at UC Berkeley and later open sourced, and how it has since become a major open-source project for big data. The document summarizes that Spark provides in-memory performance for ETL, storage, exploration, analytics and more on Hadoop clusters, and supports machine learning, graph analysis, and SQL queries.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
This document discusses Spark, an open-source cluster computing framework. It provides a brief history of Spark, describing how it generalized MapReduce to support more types of applications. Spark allows for batch, interactive, and real-time processing within a single framework using Resilient Distributed Datasets (RDDs) and a logical plan represented as a directed acyclic graph (DAG). The document also discusses how Spark can be used for applications like machine learning via MLlib, graph processing with GraphX, and streaming data with Spark Streaming.
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
This document provides examples of different frameworks that can be used for machine learning data workflows, including KNIME, Python, Julia, Summingbird, Scalding, and Cascalog. It describes features of each framework such as KNIME's large number of integrations and visual workflow editing, Python's broad ecosystem, Julia's performance and parallelism support, Summingbird's ability to switch between Storm and Scalding backends, and Scalding's implementation of the Scala collections API over Cascading for compact workflow code. The document aims to familiarize readers with options for building machine learning data workflows.
Future of data science as a professionJose Quesada
How can you thrive in a future where machine learning has been popular for a few years already?
In this talk, I will give you actionable advice from my experience training serious data scientists at our retreat center in Berlin. You are going to face these pointy, hard questions:
- What is the promise of machine learning? Has it happened yet?
- Is it easy to take advance of machine learning, now that most algorithms are nicely packaged in APIs and libraries?
- How much time should I spend getting good at machine learning? Am I good enough now?
- Are data scientists going to be replaced by algorithms? Are we all?
- Is it easy to hire talent in machine learning after the explosion of MOOCs?
Big data & data science challenges and opportunitiesJose Quesada
This document discusses big data and data science challenges and opportunities. It provides background on the author, Jose Quesada, and outlines five key challenges companies face: 1) obtaining data from end users, 2) creating a data-driven culture, 3) finding data talent, 4) breaking down data silos within companies, and 5) addressing hype around big data. The document then provides three opportunities for companies: 1) measuring their data maturity, 2) identifying the value they want from data, and 3) finding stakeholders within the company who would benefit most from increased data use. Throughout, the author advocates starting small with available data rather than waiting for "big data" to extract business value.
Bruce Kasanoff - "Bring out the talent in other people"Bruce Kasanoff
Bruce Kasanoff is a speaker and ghostwriter whose credo is to be generous, expert, trustworthy, clear, open-minded, adaptable, persistent and present. He speaks and writes about how to bring out talent in other people and has spoken at many corporate and university events and for professional associations. His website kasanoff.com provides more information on topics like employee engagement, customer engagement, motivation, work-life balance, and training.
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
CO Data Science - Workshop 1: Probability DistributiionsJared Polivka
As the first session in this four part series, the discussion will be aimed at getting everyone on the same page for later sessions.
We will look at mathematical notation, probability, expectation, variance, and end this session with common probability distributions and use cases.
Slides created by and workshop taught by:
Josh Bernard, Associate Data Science Instructor at Galvanize
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
This document provides an overview of a Hadoop session that will cover:
1. An introduction to big data including the history and evolution of Hadoop and how it addresses challenges with traditional databases.
2. The Hadoop architecture and ecosystem including components like HDFS, MapReduce, HBase and how they address issues with scalability, flexibility and cost compared to traditional databases.
3. Hands-on analysis of a soccer dataset using Hadoop to perform tasks like data classification, prediction and player analysis.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
This document discusses trends in data science in 2016, including how data science is moving into new use cases such as medicine, politics, government, and neuroscience. It also covers trends in hardware, generalized libraries, leveraging workflows, and frameworks that could enable a big leap ahead. The document discusses learning trends like MOOCs, inverted classrooms, collaborative learning, and how O'Reilly Media is embracing Jupyter notebooks. It also covers measuring distance between learners and subject communities, and the importance of both people and automation working together.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
The document discusses big data and provides examples of how it can be collected and analyzed. It describes a master's thesis that collected 74,000 Dutch news articles over 2 months to analyze rare content. It also describes a bachelor's thesis that automated the coding of tweets to determine the tone politicians used when referring to opponents. The document outlines the typical process of collecting, storing, and analyzing big data and describes the infrastructure used in the workshop to collect Twitter tweets, news articles, and web snapshots.
Making the Most of In-Memory: More than SpeedInside Analysis
The document discusses how in-memory platforms are more than just speed - they are designed to efficiently exploit RAM and are optimized for analytics workloads that involve complex "crunching" of data. It explains that analytics workloads are CPU-intensive and benefit from techniques like parallelization across CPU cores. Additionally, the document notes that declining RAM prices and interest in advanced analytics are driving more adoption of in-memory platforms for both large and small data use cases.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
Data Science Institutes : kelly technologies is the best Data Science Training Institutes in Hyderabad. Providing Data Science training by real time faculty in Hyderabad.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch here: https://bit.ly/3719Bi7
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
-How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
-About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
The document outlines an agenda for a conference on Apache Spark and data science, including sessions on Spark's capabilities and direction, using DataFrames in PySpark, linear regression, text analysis, classification, clustering, and recommendation engines using Spark MLlib. Breakout sessions are scheduled between many of the technical sessions to allow for hands-on work and discussion.
(1) The document discusses several computer science topics including data science, artificial intelligence, and cloud computing. (2) It notes that data science has grown in popularity from 2012-2017 due to an ability to better process large volumes of data using statistics, specialized hardware, and contributions from companies. (3) Artificial intelligence aims to develop machines that can think and learn like humans, and this field has accelerated in recent years with improved data processing and hardware.
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
This document discusses how Cloudera Enterprise Data Hub (EDH) can be used for advanced analytics. EDH allows users to perform diverse concurrent analytics on large datasets without moving the data. It includes tools for machine learning, graph analytics, search, and statistical analysis. EDH protects data through security features and system change tracking. The document argues that EDH is the only platform that can support all these analytics capabilities in a single, integrated system. It provides several examples of how advanced analytics on EDH have helped organizations like the government address important problems.
In AI, it's all about the data. But it's hard to get the data, and to get *good* data with provenance. This talk shows how blockchains can help, with real-world examples including:
-a data exchange for self-driving car data (with Toyota Research and others)
-pooling designs for 3d printing fraud detection (with Innogy and others)
-and AI DAOs - AIs that can accumulate wealth
This was given as an invited talk at Consensus 2017, May 22 in NYC.
Measure All the Things! - Austin Data Day 2014gdusbabek
The document discusses the importance of metrics and measurement in business. It notes that as data generation increases exponentially, metadata and time-series data provide valuable insights. Different types of metrics are described like gauges, counters, timers and histograms. Methods for collecting metrics are outlined including instrumenting software and systems. The key is to measure the right things that matter to the business and help inform decisions.
This document summarizes a presentation on big data trends and open data. It introduces the speaker, Jongwook Woo, and his experience in big data. It then covers topics including what is big data, Hadoop and Spark frameworks, using open data for analysis, and examples of analyzing Twitter data on AlphaGo and government airline and crime data sets.
Similar to GalvanizeU Seattle: Eleven Almost-Truisms About Data (20)
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
O'Reilly Media has experimented with different uses of Jupyter notebooks in their publications and learning platforms. Their latest approach embeds notebooks with video narratives in online "Oriole" tutorials, allowing authors to create interactive, computable content. This new medium blends code, data, text, and video into narrated learning experiences that run in isolated Docker containers for higher engagement. Some best practices for using notebooks in teaching include focusing on concise concepts, chunking content, and alternating between text, code, and outputs to keep explanations clear and linear.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
2. Set and Setting:
Almost a Dozen Almost-Truisms about Data …
to consider when embarking on a journey
into Data Science
There are a number preconceptions about
working with data at scale, where the realities
beg to differ
We’ll crank this number up to eleven – even
though the actual number is of course much
larger, but that’s perhaps for another day
3. Almost a Dozen Almost-Truisms about Data …
to consider when embarking on a journey
into Data Science
Let’s discuss some less-intuitive directions,
along with likely consequences and corollaries
This is not intended to prove a set of points,
rather to provide a set of launching points
Set and Setting:
5. The rates of data being stored and analyzed
jumped quite dramatically in the late 1990s
to early 2000s … partly because storage
became incredibly cheap … partly because
internetworked machines suddenly started
producing much more machine data
Fifteen years later, the rates jump again, this
time by orders of magnitude … Because IoT
It’s almost like this thing has a pulse?
#01: Because Rates
6. In other words, to paraphrase von Schelling,
experience precedes analysis
Typically, we’re swimming in data, and we tend
to respond by struggling to understand its
structure and dynamics
That, in contrast to the myth that our analysis
drives data collection
#01: Because Rates
7. Four independent teams were working toward horizontal
scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successes during
the 1997 holiday season…
AMZN, EBAY, Inktomi (YHOO Search), then GOOG
MapReduce on clusters of commodity hardware and the
Apache Hadoop open source stack emerged from this context
#01: Because Rates – 1997 Q3 Inflection Point
8. Amazon
“Early Amazon: Splitting the website” – Greg Linden
glinden.blogspot.com/2006/02/early-amazon-splitting-
website.html
eBay
“The eBay Architecture” – Randy Shoup, Dan Pritchett
addsimplicity.com/adding_simplicity_an_engi/2006/11/
you_scaled_your.html
addsimplicity.com.nyud.net:8080/downloads/
eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)
“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
youtu.be/E91oEn1bnXM
Google
“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
youtu.be/qsan-GQaeyk
perspectives.mvdirona.com/2008/06/11/
JeffDeanOnGoogleInfrastructure.aspx
#01: Because Rates – 1997 Q3 Inflection Point
9. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
#01: Because Rates – Circa 2001, post e-commerce success
10. RDBMS
SQL Query
result sets
recommenders
+
classifiers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
“data products”
#01: Because Rates – Circa 2001, post e-commerce success
11. Primary sources for the notion:
Cleveland,W. S.,
“Data Science: an Action Plan for Expanding
the Technical Areas of the Field of Statistics,”
International Statistical Review (2001), 69, 21-26.
http://cm.bell-labs.com/stat/doc/datascience.ps
Breiman L.,
“Statistical modeling: the two cultures”,
Statistical Science (2001), 16:199-231.
http://projecteuclid.org/euclid.ss/1009213726
…also good to mention John Tukey
#01: Because Rates –Whither Data Science?
12. Rashomon, the 1950 Japanese period drama
by Akira Kurosawa, symbolizes a long-standing
tension in Statistics, one which Mark Twain
described ever so succinctly…
wikipedia.org/wiki/Rashomon:
“The film is known for a plot device
which involves various characters
providing alternative, self-serving
and contradictory versions of the
same incident.”
#01: Because Rates – A Sea Change
13. Because IoT! (exabytes/day per sensor)
bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-
machine-and-then-uses-sensors-to-listen-to-it/
#01: Because Rates – A Sea Change, Redux
17. Businesses want to join the 21c.,
and level up to streaming analytics
“I saw what you did … in batch,”
now performed a zillion times faster
#02: Batch Defenestration – Infrastructure, Remodeled
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache,
More than 500 known production deployments
21. Trending interests:
• electric cars
• organic farm-to-table cuisine
• permaculture
• sustainable urbanism
#03: Circa 1904
22. Speaking of batch windows…
The last century or two of statistics
represent an extremely huge mess
Let’s start the clock over, then move
forward into a more real-time near-future
#03: Circa 1904
23. #03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
Probability got going, formally, in the 16th c. –
although interesting mathematical estimations
trace back to classical times
Arabs in the 9th c. used frequency analysis –
later rediscovered by Europeans during the
early Italian Renaissance
Statistics followed, originally more about what
we might call demographics – through 18th c.
24. Laplace, Gauss, et al., bridged prob & stats in the
late 18th c. using distributions (what we studied
in Stats 101) to infer the probability of errors
in estimates
Much of the 19th/20th c. work was about using
goodness of fit tests, etc., justifying some distribution
• generally speaking, that require samples
• that, in turn, implies batch windows
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
25. While 19th/20th c. stats work focused on defensibility
21st c. work, w.r.t. Big Data apps, focuses more
on predictability – plus there’s a shift in how we make
estimates…
BTW, doesn’t it seem weird to crunch through piles
of data in large batch jobs, at large expense, when
the results get used to approximate features
ultimately? Why not perform that in stream?
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
26. A fascinating, relatively new area pioneered by
relatively few people – e.g., Philippe Flajolet
Provides approximation with error bounds using
much less resources (RAM, CPU, etc.)
highlyscalable.wordpress.com/
2012/05/01/probabilistic-structures-
web-analytics-data-mining/
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
27. algorithm use case example
Bloom Filter set membership code
MinHash set similarity code
HyperLogLog set cardinality code
Count-Min Sketch frequency summaries code
DSQ streaming quantiles code
SkipList ordered sequence search code
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
28. E.g., ±4% could buy you two orders of magnitude
reduction in the required memory footprint for
an analytics app
OSS projects such as Algebird and BlinkDB
provide for this newer approach to the math of
approximations at scale
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
30. IMO, many notions of “API” are illusions
Arguably, reductionist shell games
And that imposes limitations on how we
work, and even how we think…
#04: Your API is an Illusion
32. On the other hand, Physics
does well to teach modeling –
I like to hire physicists to work
on Data teams…
They tend to get the interdisciplinary aspects:
got the math background, coding experience,
generally good at systems engineering, etc.
Not saying we must all rush out to get Physics
degrees – there’s something to be learned there,
vital for the work and priorities ahead
#04: Your API is an Illusion –The Interzone
33. “The impact of computing extends far beyond
science… affecting all aspects of our lives.
To flourish in today's world, everyone needs
computational thinking.” – Jeannette Wing, CMU
Computing now ranks alongside the proverbial
Reading,Writing, and Arithmetic…
Center for ComputationalThinking @ CMU
http://www.cs.cmu.edu/~CompThink/
Exploring ComputationalThinking @ Google
https://www.google.com/edu/computational-thinking/
#04: Your API is an Illusion – Antidote: ComputationalThinking
35. Even so, do we really need to
write code for WordCount
10^N times?
#05: Code Inceptionism
36. Inceptionism: Going Deeper into
Neural Networks
Alexander Mordvintsev,
Christopher Olah, Mike Tyka
Google (2015-06-17)
googleresearch.blogspot.com/2015/06/
inceptionism-going-deeper-into-neural.html
Artificial Neural Networks have spurred remarkable recent
progress in image classification and speech recognition. But
even though these are very useful tools based on well-known
mathematical methods, we actually understand surprisingly
little of why certain models work and others don’t. So let’s
take a look at some simple techniques for peeking inside
these networks.
#05: Code Inceptionism
37. Imagine data mining GitHub commit
histories of popular open source projects,
then applying genetic programming to
evolve patches for other OSS projects...
In other words, brilliant:
Imagine data mining GitHub commit
histories of popular open source projects,
then apply genetic programming to evolve
patches for other OSS projects…
in other words, brilliant:
Sidebar: Claire Le Goues, automating software repair
Claire Le Goues
cmu.edu
GenProg:A Generic Method for Automatic
Software Repair
Claire Le Goues, ThanhVu Nguyen,
Stephanie Forrest, Westley Weimer
IEEE TSE (2012)
www.cs.cmu.edu/~clegoues/
docs/legoues-tse-genprog12.pdf
We describe the algorithm and report experimental
results of its success on 16 programs totaling 1.25M
lines of C code and 120K lines of module code,
spanning eight classes of defects, in 357 seconds,
on average.We analyze the generated repairs
qualitatively and quantitatively to demonstrate
that the process efficiently produces evolved
programs that repair the defect, are not fragile
input memorizations, and do not lead to serious
degradation in functionality.
GenProg:A Generic Method for
Automatic Software Repair
Claire Le Goues, ThanhVu Nguyen,
Stephanie Forrest, Westley Weimer
IEEE TSE (2012)
www.cs.cmu.edu/~clegoues/ docs/
legoues-tse-genprog12.pdf
We describe the algorithm and report experimental results
of its success on 16 programs totaling 1.25M lines of C code
and 120K lines of module code, spanning eight classes of
defects, in 357 seconds, on average.
We analyze the generated repairs qualitatively and
quantitatively to demonstrate that the process efficiently
produces evolved programs that repair the defect, are not
fragile input memorizations, and do not lead to serious
degradation in functionality.
#05: Code Inceptionism
39. Are databases going extinct?
Distributed file systems that can be accessed
as column stores are generally quite useful
There’s an old saying in Computer Science:
it’s difficult to distinguish a really good file
system from a database, and vice versa
#06: Database Extinction?
40. Original definitions for what became relational
databases had less to do with dedicated SQL
products, more similarity with something like
Spark SQL:
A relational model of data for
large shared data banks
Edgar Codd
Communications of the ACM (1970)
dl.acm.org/citation.cfm?id=362685
#06: Database Extinction?
41. #06: Database Extinction?
Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics
Physical Execution:
CPU Efficient Data Structures
Keep data closure to CPU cache
Tungsten
43. Consider: matrices, pivot tables, etc.
Our thinking about data representation
is often quite two-dimensional…
#07: “N Dims good, 2 Dims baa-d”
44. • many real-world problems are often
represented as graphs
• graphs can generally be converted into sparse
matrices (bridge to linear algebra)
• eigenvectors find the stable points in
a system defined by matrices – which
may be more efficient to compute
• beyond simpler graphs, complex data
may require work with tensors
#07: “N Dims good, 2 Dims baa-d”
45. Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line
connecting two vertices
v
u
w
x
#07: “N Dims good, 2 Dims baa-d”
46. We can represent this kind of graph as an
adjacency matrix:
• label the rows and columns based
on the vertices
• entries get a 1 if an edge connects the
corresponding vertices, or 0 otherwise
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
#07: “N Dims good, 2 Dims baa-d”
47. An adjacency matrix always has certain
properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges
between linear algebra and graph theory
#07: “N Dims good, 2 Dims baa-d”
48. Tensors are a good way to handle time-
series geo-spatially distributed linked data
with lots of N-dimensional attributes
In other words, potentially a general case
for handling much of the data that we’re
likely to encounter
#07: “N Dims good, 2 Dims baa-d”
49. Although tensor factorization is considered
problematic, it may provide more general
case solutions:
TheTensor Renaissance in Data Science
Anima Anandkumar @UC Irvine
radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and
Higher Order Markov Chains
David Gleich @Purdue
slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
#07: “N Dims good, 2 Dims baa-d”
51. There is Science … and there is Data
Data Science is largely about interdisciplinary
teams, largely about crossing boundaries
(organizational, cognitive) that might otherwise
preclude arriving at crucial insights –
In other words, about learning
It’s also about the repeatability and predictive
aspects of science, where workflows combine
people + automation
NB: may conflict with large portions of academia
which tend to decontextualize subjects
#08: Science … and Data
52. The Science in Data Science tends to rely on
the phenomenology and modeling of complex
systems (did we already mention Physics?)
Speaking of science and predictions, two
important works to include:
• Charles Sanders Peirce – one of the
most prolific scientists in the US, and also
one of the most fierce critics (abduction,
etc.)
• Karl Popper – who articulated some
of the inherent risks of mixing “science”,
“history”, and politics
#08: Science … and Data
53. For excellent examples of Science and Data
together, see CodeNeuro, particularly for
use of notebooks:
#08: Science … and Data
55. Learning Curves are forever –
the part you need to manage
more carefully than just about
anything else, especially within
a social context
In some sense, this is essence of
Data Science: How well do you
learn?
Much of the risk in managing
a Data Science team is about
budgeting for learning curve
#09: Learning Curves are Forever
56. In contrast, IT has a long history of
practicing a flavor of engineering
“conservatism”: highly structured
process, strictly codified practices
People learn a few things well, then
avoid having to struggle with learning
many new things perpetually…
That leads to enormous teams and
low ROI, among other badness
scale➞
complexity➞
#09: Learning Curves are Forever
57. ThrowYour Life a Curve
Whitney Johnson
blogs.hbr.org/johnson/2012/09/
throw-your-life-a-curve.html
Aggressively Pro-Active Learning:
• deconstruction of the cognitive bias One Size Fits All
• “makes a compelling case for personal disruption”
• “plan your career around learning curves”
• hire people who learn/re-learn efficiently
#09: Learning Curves are Forever
58. #09: Learning Curves are Forever
Education is more than just lessons, exams,
certifications, instructor evaluations, etc., …
though some tools would try to reduce it
to that level
What’s even more interesting is to leverage
ML to understand the “distance” between
the learner and the subject material
60. Speaking as a former alt bookstore owner…
Sadly, we don’t use books quite as much
these days:
• above ~35: buy it on Kindle
• below ~35: watch it onYouTube
#10: Books, not so much, sadly…
61. From a publisher perspective, consider
some of the risks:
• less people buy the titles
• search engines surface oh-so-much noise
• increasingly, it’s more difficult for experts
to take time to author good content and
keep it updated
#10: Books, not so much, sadly…
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache,
More than 500 known production deployments
62. However, it’s unlikely that Kindle, etc.,
represent the end-all-be-all of publishing…
Here’s an idea: your next “book” or
“video” should be able to compute
something useful
#10: Books, not so much, sadly…
63. Interactive notebooks: Sharing the code
Helen Shen
Nature (2014-11-05)
nature.com/news/interactive-notebooks-
sharing-the-code-1.16261
#10: Books, not so much – Repeatable Science
64. Embracing Jupyter Notebooks at O'Reilly
Andrew Odewahn, 2015-05-07
https://beta.oreilly.com/ideas/jupyter-at-oreilly
“O'Reilly Media is using our Atlas platform to
make Jupyter Notebooks a first class authoring
environment for our publishing program.”
Jupyter, Thebe, Docker, etc.
#10: Books, not so much – Something Borrowed, Something New
67. MOOCs have become popular, some are
quite useful … even so, these tend to have
a very low completion rate
Don’t hold your breath waiting for MOOCs
to replace other modes of education
Learning generally requires a social context:
for reinforcement, peer insights/modeling,
and frankly some people really feel a need
to be given permission to learn
#11: A MOOCish Edumacation?
68. One problem with university study is that
disciplines tend to decontextualize
GalvanizeU is rare opportunity in that way:
accredited, with contextualized hands-on
experience
#11: A MOOCish Edumacation?
69. A significant improvement may be
found in the notion of “flipped”
or inverted classrooms
For a good example, see:
Caltech Offers Online Course with
Live Lectures in Machine Learning
Yaser Abu-Mostafa (2012-03-30)
http://www.caltech.edu/news/caltech-offers-online-
course-live-lectures-machine-learning-4248
#11: A MOOCish Edumacation?
70. So a good bit of advice about learning and
Data Science … is to invert your classrooms,
recontextualize, cross the boundaries to do
things that matter, and leverage the hands-on
social aspects of learning
Like here at GalvanizeU
Summary…
74. After we’ve cleaned up data, formulated workflows
in terms of monoids, used graph representation, and
parallelized with a wealth of linear algebra, much of
the heavy-lifting that remains on the clusters is in
optimization
For example, deep learning @Google
uses many layers of neural nets trained
with gradient descent optimization
Taming LatencyVariability and Scaling Deep Learning
Jeff Dean @Google (2013)
youtu.be/S9twUcX1Zp0
Vector Quantization:
75. One advantage of quantum algorithms is
to run large gradient descent problems in
constant time… Reworking high-ROI apps
to leverage lots of ML and large clusters,
then SGD represents the datacenter cost
basis, notably that part that scales…
Want to slash costs exponentially?
Plug in quantum for a game-changer,
maybe
Fast quantum algorithm for
numerical gradient estimation
Stephen P. Jordan
Phys. Rev. Lett. 95, 050501 (2005)
arxiv.org/abs/quant-ph/0405146 dwavesys.com
Vector Quantization:
76. Proposal: let’s drop clusters of quantum
devices into lunar polar craters, so we
can handle massive vector quantization
workloads
• micro-kelvin environs
• near perpetual sunlight
for energy sources
• park routers at L4
• approx. $15B to finance,
i.e., ~6 days DoD budget
Vector Quantization:
77. We’ll just put this here…
a couple o’ Googly projects in progress:
qCraft: Quantum Physics In Minecraft
plus.google.com/u/
1/+QuantumAILab/posts/
grMbaaDGChH
Vector Quantization:
“We’re going back to the Moon. For good.”
lunar.xprize.org