JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
This document discusses how Cloudera Enterprise Data Hub (EDH) can be used for advanced analytics. EDH allows users to perform diverse concurrent analytics on large datasets without moving the data. It includes tools for machine learning, graph analytics, search, and statistical analysis. EDH protects data through security features and system change tracking. The document argues that EDH is the only platform that can support all these analytics capabilities in a single, integrated system. It provides several examples of how advanced analytics on EDH have helped organizations like the government address important problems.
This document outlines a data science enablement roadmap created by the Advanced Center of Excellence at Modern Renaissance Corporation. The roadmap consists of 1 introductory course and 3 advanced courses that can earn a student a master's level certificate in data science. The introductory course provides a broad overview of topics like algorithms, statistics, machine learning, and big data platforms. The advanced courses focus on specific skills like machine learning with R, modern data platforms using Hadoop, and advanced big data analytics techniques. The goal is to give students a versatile, practical skill set for a career in data science or big data engineering.
I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
Big Data and Data Intensive Computing: Use CasesJongwook Woo
This invited talk was held by LG Data Mining Lab at LG R&D center, Woomyun-dong, Seoul, Korea. Introduces the emerging Hadoop ecosystems: Giraph, Spark, Shark, Flume and the use cases using Big Data in Korea and US. And, illustrates the importance of taking training.
Big Data and Data Intensive Computing on NetworksJongwook Woo
Big Data on Networks with Hadoop and its ecosystems (Giraph, Flume,...) at Korea Institute of Science and Technology Information. Illustrates some possible approach on Networks
The document discusses Jongwook Woo and his background working with big data. It provides details on Woo's experience as a professor focusing on big data research and education partnerships. It also outlines some of the topics Woo covers in his presentations including introductions to big data, artificial intelligence, and the relationship between AI and big data. Key technologies like Hadoop, Spark, and neural networks are mentioned.
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
This document provides a tutorial on crowdsourced data processing from both academic and industry perspectives. The tutorial is divided into three parts. Part 0 provides a background on crowdsourcing and surveys Parts 1 and 2. Part 1 surveys crowdsourced data processing algorithms from academia, discussing unit operations, cost models, error models, and examples like filtering and sorting. Part 2 surveys crowdsourced data processing in industry, finding that many large companies use internal platforms at large scale for tasks like categorization and content moderation, and that academic research is not yet widely used in industry.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes “best” interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that we’ve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, I’ll offer general principles and concrete examples of how to interview data scientists. I’ll also touch on the challenges of sourcing and closing top candidates.
Python's Role in the Future of Data AnalysisPeter Wang
Why is "big data" a challenge, and what roles do high-level languages like Python have to play in this space?
The video of this talk is at: https://vimeo.com/79826022
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
This document provides an introduction to big data and artificial intelligence presented by Jongwook Woo. It discusses Woo's background and experience, provides an overview of big data including issues with traditional data handling approaches and the need for scalable solutions like Hadoop. It also covers machine learning and deep learning techniques for predictive analysis using big data, and provides examples applying these techniques to COVID-19 data and financial fraud detection.
MIT Deep Learning Basics: Introduction and Overview by Lex FridmanPeerasak C.
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
Watch video: https://youtu.be/O5xeyoRL95U
An introductory lecture for MIT course 6.S094 on the basics of deep learning including a few key ideas, subfields, and the big picture of why neural networks have inspired and energized an entire new generation of researchers. For more lecture videos on deep learning, reinforcement learning (RL), artificial intelligence (AI & AGI), and podcast conversations, visit our website or follow TensorFlow code tutorials on our GitHub repo.
INFO:
Website: https://deeplearning.mit.edu
CONNECT:
- If you enjoyed this video, please subscribe to this channel.
- Twitter: https://twitter.com/lexfridman
- LinkedIn: https://www.linkedin.com/in/lexfridman
- Facebook: https://www.facebook.com/lexfridman
- Instagram: https://www.instagram.com/lexfridman
Presentation - "Dealing with uncertainty in fintech using AI" by Evgeny Savin.
Evgeny is a Senior Data Scientist at smava for the last 3,5 years. His work focused on the design and implementation of the ML engine that helps customers find the best loans from one of the smava partner banks. Previously, Evgeny worked as a data scientist at HERE and Paypal.
Introduction to Machine Learning - WeCloudDataWeCloudData
In this talk, WeCloudData introduces the lifecycle of machine learning and its tools/ecosystems. For more detail about WeCloudData's machine learning course please visit: https://weclouddata.com/data-science/
Introduction to Deep Learning and AI at Scale for ManagersDataWorks Summit
Deep Learning and the new wave of AI are inevitably coming to your business area. If you are a manager and if you are trying to make sense of all the buzzwords, this session is four you. We will show you what is Deep Learning in a way that you will understand how it works and how can you apply it. We then expand the scope and apply the deep learning and AI techniques in the Big Data context. You will learn about things that don't work out so well, the risks and challenges in both applying and developing with deep learning and AI technologies. We conclude with practical guidance on how to add the exciting deep learning and AI capabilities to your next project.
Outline:
- The path to Deep Learning
- From machine learning to Deep Learning
- But how does it work?
- Deep Learning architectures
- Deep Learning applications
- Deep Learning at scale
- Running AI at scale
- Deep learning at Scale using Spark
- The trouble with AI
- Application challenges
- Development challenges
- How to start your first Deep Learning project
I was meaning to put this talk up for grabs for some time now, but kept forgetting. I was invited to give the keynote speech for the Microstrategy World 2008 conference. The talk was very well received, so here it is.
Human-in-the-loop: a design pattern for managing teams which leverage ML by P...Big Data Spain
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc.
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Big Data Spain Conference
16th -17th November - Kinépolis Madrid
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
��
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
Dashboards are Dumb Data - Why Smart Analytics Will Kill Your KPIsLuciano Pesci, PhD
Organizations of every size have access to data dashboard technology, yet none of the solutions have delivered on their hype and right now across the world executives and analysts are staring at a dashboard and thinking the same thing, ""so what?""
The failure of dashboards to deliver meaningful insights is inherent in their simplicity: they only show surface level information, and not the relationships between data points that really drive the fate of your organization.
But all is not lost! By combining the right mix of technology and human expertise in business strategy, research and data mining you can embrace the smart analytics movement, and start accessing insights that grow your company and your competitive position.
You can watch the accompanying webinar here: https://youtu.be/RdOcPxv9wLs
Many companies are starting or expanding their use of data mining and machine learning. This presentation covers seven practical ideas for encouraging advanced analytics in your organization.
Slide deck used to foster discussion with museum colleagues about the current trends, ideas, aspirations and challenges of digital strategy and implementation. Includes a short list of concerns and (exciting or even daunting) future trends. Nothing comprehensive here, just some jumping off points for discussion and debate.
This document discusses the importance of data fluency skills in the 21st century. It defines key terms like data science, machine learning, data literacy, and statistical literacy. While these fields require extensive training, the document argues that domain expertise combined with basic data analysis skills can solve many problems. These basic skills include understanding data structures, using programming to interact with data, and exploratory data analysis through visualization. The data analysis process involves defining problems, collecting and preparing data, visualization and modeling, and communicating results. RStudio is presented as a tool that can support the entire data analysis process within a single integrated development environment.
Scaling SlideShare to the World - An Asian PerpectiveAmit Ranjan
This document discusses lessons learned from scaling SlideShare, a presentation sharing platform, globally. Some key points include: focusing on organizational culture and values over specific ideas or markets; taking an agile, iterative approach to product development; prioritizing speed of development over initial scalability; focusing on widespread distribution before deep engagement; using metrics and data to inform product decisions; understanding cultural differences between East and West; outsourcing non-core complexities; managing risk; and focusing on users over competitors. The document provides several examples and insights from SlideShare's experience expanding internationally.
A Guide to AI for Smarter Nonprofits - Dr. Cori Faklaris, UNC CharlotteCori Faklaris
Working with data is a challenge for many organizations. Nonprofits in particular may need to collect and analyze sensitive, incomplete, and/or biased historical data about people. In this talk, Dr. Cori Faklaris of UNC Charlotte provides an overview of current AI capabilities and weaknesses to consider when integrating current AI technologies into the data workflow. The talk is organized around three takeaways: (1) For better or sometimes worse, AI provides you with “infinite interns.” (2) Give people permission & guardrails to learn what works with these “interns” and what doesn’t. (3) Create a roadmap for adding in more AI to assist nonprofit work, along with strategies for bias mitigation.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
The panel discussion at Future Perfect 2012 focused on digital preservation by design. The panelists represented several national archives and discussed the need for (1) common standards and frameworks to guide digital preservation efforts, (2) improved tools and cost models, and (3) greater collaboration across organizations through information sharing and an international preservation body. The discussion emphasized taking a purposeful, long-term approach to digital preservation planning and ensuring access to preserved materials.
This document provides a summary of popular machine learning resources from 2018, including the top 10 blogs and talks from the Open Data Science community. It highlights how machine learning grew significantly in 2018, drawing more investment and attention across industries. Practitioners paid increased attention to the societal impacts of their work and focused on technical approaches to address bias, lack of transparency, and other issues. The resources provided are meant to help readers further their understanding of machine learning and shape how these tools are applied.
This document discusses best practices for content delivery platforms to support artificial intelligence projects. It recommends that platforms (1) accept that they do not have all the data needed and should integrate third-party sources, (2) provide consistent tagging of content, (3) offer a lightweight programmatic interface, (4) embrace allowing large amounts of content to be taken offline for analysis, and (5) enable complex filtering and selection of data. The document also suggests platforms could consider offering preprocessed datasets or AI tools as new products.
Recommender systems support the decision making processes of customers with personalized suggestions. These widely used systems influence the daily life of almost everyone across domains like ecommerce, social media, and entertainment. However, the efficient generation of relevant recommendations in large-scale systems is a very complex task. In order to provide personalization, engines and algorithms need to capture users’ varying tastes and find mostly nonlinear dependencies between them and a multitude of items. Enormous data sparsity and ambitious real-time requirements further complicate this challenge. At the same time, deep learning has been proven to solve complex tasks like object or speech recognition where traditional machine learning failed or showed mediocre performance.
Join Marcel Kurovski to explore a use case for vehicle recommendations at mobile.de, Germany’s biggest online vehicle market. Marcel shares a novel regularization technique for the optimization criterion and evaluates it against various baselines. To achieve high scalability, he combines this method with strategies for efficient candidate generation based on user and item embeddings—providing a holistic solution for candidate generation and ranking.
The proposed approach outperforms collaborative filtering and hybrid collaborative-content-based filtering by 73% and 143% for MAP@5. It also scales well for millions of items and users returning recommendations in tens of milliseconds.
Event: O'Reilly Artificial Intelligence Conference, New York, 18.04.2019
Speaker: Marcel Kurovski, inovex GmbH
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
20240104 HICSS Panel on AI and Legal Ethical 20240103 v7.pptxISSIP
20240103 HICSS Panel
Ethical and legal implications raised by Generative AI and Augmented Reality in the workplace.
Souren Paul - https://www.linkedin.com/in/souren-paul-a3bbaa5/
Event: https://kmeducationhub.de/hawaii-international-conference-on-system-sciences-hicss/
Patrick Boily is the Director of the Centre for Quantitative Analysis and Decision Support (CQADS) at Carleton University. CQADS offers data analysis workshops to help participants learn core data science concepts. The 2017 winter workshop series covers important introductory topics like data preparation, visualization, classification, clustering, and exploring data with R. While the workshops only provide an introduction, their goal is to give participants a good overview of the data science field and spark further learning and collaboration. Boily hopes to see more emphasis placed on the ethics of data science and issues like privacy, fairness, and societal impact.
This document discusses the evolution of knowledge workers and knowledge management. Knowledge management 1.0 focused too heavily on rigid processes, tools and centralized control. However, knowledge management 2.0 focuses more on people, encourages collaboration, shares information freely and allows knowledge work to occur anywhere. For knowledge workers to thrive, organizations need a culture shift where information is openly shared, risk-taking is celebrated and knowledge work is not confined within strict boundaries.
This document provides an overview of data science including its importance, what data scientists do, how the field has emerged, and how to become a data scientist. It notes that by 2018 the US could face shortages of people with data analytics skills. It then discusses how LinkedIn's early growth in 2006 exemplifies the data science process of framing questions, collecting and processing data, exploring patterns, and communicating results. Finally, it outlines the tools used in data science like SQL, analytics software, and machine learning and discusses getting started in the field through education, curiosity, and ongoing learning with mentorship support.
This document provides an overview of the Python ecosystem for data science. It describes how tools in the ecosystem can be used to support various data science tasks like reporting, data processing, scientific computing, machine learning modeling, and application development. The document outlines common workflows for small, medium and big data use cases. It also reviews popular Python tools, identifies strengths in the current ecosystem, and discusses some gaps from a practitioner's perspective.
The document summarizes a tutorial on Opentech AI given by Jim Spohrer and Daniel Pakkala, discussing trends in lowering the cost of AI technologies, benchmarks for measuring AI progress, and types of cognitive systems ranging from tools to mediators. It also provides an outline for Daniel Pakkala's presentation on the Opentech AI architecture, ecosystem, and roadmap, discussing frameworks for understanding intelligence evolution and the need for an architecture framework for AI systems.
- The document discusses navigating the Python ecosystem for data science. It outlines the various areas data science teams deal with like reporting, data processing, machine learning modeling, and application development.
- It describes the different tools and libraries in the Python ecosystem that support these areas, including machine learning, cluster computing, and scientific computing.
- The talk aims to help understand what the Python data science ecosystem offers and common gaps, so people don't reinvent solutions or get stuck looking for answers. It covers how tools fit into the machine learning workflow and work together.
Similar to Humans in a loop: Jupyter notebooks as a front-end for AI (20)
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
O'Reilly Media has experimented with different uses of Jupyter notebooks in their publications and learning platforms. Their latest approach embeds notebooks with video narratives in online "Oriole" tutorials, allowing authors to create interactive, computable content. This new medium blends code, data, text, and video into narrated learning experiences that run in isolated Docker containers for higher engagement. Some best practices for using notebooks in teaching include focusing on concise concepts, chunking content, and alternating between text, code, and outputs to keep explanations clear and linear.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
The document discusses how data science may reinvent learning and education. It begins with background on the author's experience in data teams and teaching. It then questions what an "Uber for education" may look like and discusses definitions of learning, education, and schools. The author argues interactive notebooks like Project Jupyter and flipped classrooms can improve learning at scale compared to traditional lectures or MOOCs. Content toolchains combining Jupyter, Thebe, Atlas and Docker are proposed for authoring and sharing computational narratives and code-as-media.
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
O'Reilly Learning is focusing on evolving learning experiences using Jupyter notebooks. Jupyter notebooks allow combining code, outputs, and explanations in a single document. O'Reilly is using Jupyter notebooks as a new authoring environment and is exploring features like computational narratives, code as a medium for teaching, and interactive online learning environments. The goal is to provide a better learning architecture and content workflow that leverages the capabilities of Jupyter notebooks.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
The document provides an overview of Graph Analytics in Spark. It discusses Spark components and key distinctions from MapReduce. It also covers GraphX terminology and examples of composing node and edge RDDs into a graph. The document provides examples of simple traversals and routing problems on graphs. It discusses using GraphX for topic modeling with LDA and provides further reading resources on GraphX, algebraic graph theory, and graph analysis tools and frameworks.
Graph analytics can be used to analyze a social graph constructed from email messages on the Spark user mailing list. Key metrics like PageRank, in-degrees, and strongly connected components can be computed using the GraphX API in Spark. For example, PageRank was computed on the 4Q2014 email graph, identifying the top contributors to the mailing list.
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
The document provides an overview of real-time analytics using Spark Streaming. It discusses Spark Streaming's micro-batch approach of treating streaming data as a series of small batch jobs. This allows for low-latency analysis while integrating streaming and batch processing. The document also covers Spark Streaming's fault tolerance mechanisms and provides several examples of companies like Pearson, Guavus, and Sharethrough using Spark Streaming for real-time analytics in production environments.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
The document provides an overview of Apache Spark, including its history and key capabilities. It discusses how Spark was developed in 2009 at UC Berkeley and later open sourced, and how it has since become a major open-source project for big data. The document summarizes that Spark provides in-memory performance for ETL, storage, exploration, analytics and more on Hadoop clusters, and supports machine learning, graph analysis, and SQL queries.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
History and Introduction for Generative AI ( GenAI )
Humans in a loop: Jupyter notebooks as a front-end for AI
1. Humans
in
a
loop:
Jupyter
notebooks
as
a
front-‐end
for
AI
Paco
Nathan
@pacoid
Dir,
Learning
Group
@
O’Reilly
Media
JupyterCon
2017-‐08-‐24
2. Framing
2
Imagine
having
a
mostly-‐automated
system
where
people
and
machines
collaborate
together…
May
sound
a
bit
Sci-‐Fi
–
though
arguably
commonplace.
One
challenge
is
whether
we
can
advance
beyond
rote
tasks.
Instead
of
simply
running
code
libraries,
can
machines
make
difficult
decisions,
exercise
judgement
in
complex
situations?
Can
we
build
systems
in
which
people
who
aren’t
AI
experts
can
“teach”
machines
to
perform
complex
work
based
on
examples,
not
code?
3. Research
questions
▪ How
do
we
personalize
learning
experiences,
across
ebooks,
videos,
conferences,
computable
content,
live
online
courses,
case
studies,
expert
AMAs,
etc.
▪ How
do
we
help
experts
—
by
definition,
really
busy
people
—
share
knowledge
with
their
peers
in
industry?
▪ How
do
we
manage
the
role
of
editors
at
human
scale,
while
technology
and
delivery
media
evolve
rapidly?
▪ How
do
we
help
organizations
learn
and
transform
continuously?
3
7. Machine
learning
supervised
ML:
▪ take
a
dataset
where
each
element
has
a
label
▪ train
models
on
a
portion
of
the
data
to
predict
the
labels,
test
on
the
remainder
▪ deep
learning
is
a
popular
example,
though
only
if
you
have
lots
of
labeled
training
data
available
8. Machine
learning
unsupervised
ML:
▪ run
lots
of
unlabeled
data
through
an
algorithm
to
detect
“structure”
or
embedding
▪ for
example,
clustering
algorithms
such
as
K-‐means
▪ unsupervised
approaches
for
AI
are
an
open
research
question
9. Active
learning
special
case
of
semi-‐supervised
ML:
▪ send
difficult
calls
/
edge
cases
to
experts;
let
algorithms
handle
routine
decisions
▪ works
well
in
use
cases
which
have
lots
of
inexpensive,
unlabeled
data
▪ e.g.,
abundance
of
content
to
be
classified,
where
the
cost
of
labeling
is
the
expense
10. Active
learning
Real-‐World
Active
Learning:
Applications
and
Strategies
for
Human-‐in-‐the-‐Loop
Machine
Learning
radar.oreilly.com/2015/02/human-‐in-‐the-‐loop-‐
machine-‐learning.html
Ted
Cuzzillo
O’Reilly
Media,
2015-‐02-‐05
Develop
a
policy
for
how
human
experts
select
exemplars:
▪ bias
toward
labels
most
likely
to
influence
the
classifier
▪ bias
toward
ensemble
disagreement
▪ bias
toward
denser
regions
of
training
data
10
11. Active
learning
Data
preparation
in
the
age
of
deep
learning
oreilly.com/ideas/data-‐preparation-‐in-‐the-‐
age-‐of-‐deep-‐learning
Luke
Biewald
CrowdFlower
O’Reilly
Data
Show,
2017-‐05-‐04
send
human
workers
cases
where
machine
learning
algorithms
signal
uncertainty
(low
probability
scores)
…or
when
your
ensemble
of
ML
algorithms
signals
disagreement
11
12. Design
pattern:
Human-‐in-‐the-‐loop
Building
a
business
that
combines
human
experts
and
data
science
oreilly.com/ideas/building-‐a-‐business-‐that-‐
combines-‐human-‐experts-‐and-‐data-‐science-‐2
Eric
Colson
StitchFix
O’Reilly
Data
Show,
2016-‐01-‐28
“what
machines
can’t
do
are
things
around
cognition,
things
that
have
to
do
with
ambient
information,
or
appreciation
of
aesthetics,
or
even
the
ability
to
relate
to
another
human”
12
13. Design
pattern:
Human-‐in-‐the-‐loop
Strategies
for
integrating
people
and
machine
learning
in
online
systems
safaribooksonline.com/library/view/oreilly-‐
artificial-‐intelligence/9781491976289/
video311857.html
Jason
Laska
Clara
Labs
The
AI
Conf,
2017-‐06-‐29
how
to
create
a
two-‐sided
marketplace
where
machines
and
people
compete
on
a
spectrum
of
relative
expertise
and
capabilities
13
14. Design
pattern:
Human-‐in-‐the-‐loop
Building
human-‐assisted
AI
applications
oreilly.com/ideas/building-‐human-‐
assisted-‐ai-‐applications
Adam
Marcus
B12
O’Reilly
Data
Show,
2016-‐08-‐25
Orchestra:
a
platform
for
building
human-‐
assisted
AI
applications,
e.g.,
to
create
business
websites
https://github.com/b12io/orchestra
example
http://www.coloradopicked.com/
14
15. Design
pattern:
Flash
teams
Expert
Crowdsourcing
with
Flash
Teams
hci.stanford.edu/publications/2014/
flashteams/flashteams-‐uist2014.pdf
Daniela
Retelny,
et
al.
Stanford
HCI
“A
flash
team
is
a
linked
set
of
modular
tasks
that
draw
upon
paid
experts
from
the
crowd,
often
three
to
six
at
a
time,
on
demand”
http://stanfordhci.github.io/flash-‐teams/
15
16. Weak
supervision
/
Data
programming
Creating
large
training
data
sets
quickly
oreilly.com/ideas/creating-‐large-‐training-‐
data-‐sets-‐quickly
Alex
Ratner
Stanford
O’Reilly
Data
Show,
2017-‐06-‐08
Snorkel:
“weak
supervision”
and
“data
programming”
as
another
instance
of
human-‐in-‐the-‐loop
github.com/HazyResearch/snorkel
conferences.oreilly.com/strata/strata-‐ny/public/
schedule/detail/61849
16
17. Reinforcement
learning
Reinforcement
learning
explained
oreilly.com/ideas/reinforcement-‐
learning-‐explained
Junling
Hu
AI
Frontiers
O’Reilly
Radar,
2016-‐12-‐08
learning
to
act
based
on
long-‐term
payoffs
often
as
rewards/punishments
of
an
actor
within
a
simulation
17
19. AI
in
Media
▪ content
which
can
represented
as
text
can
be
parsed
by
NLP,
then
manipulated
by
available
AI
tooling
▪ labeled
images
get
really
interesting
▪ text
or
images
within
a
context
have
inherent
structure
▪ representation
of
that
kind
of
structure
is
rare
in
the
Media
vertical
–
so
far
19
21. Which
parts
do
people
or
machines
do
best?
21
team
goal:
maintain
structural
correspondence
between
the
layers
big
win
for
AI:
inferences
across
the
graph
human
scale
primary
structure
control
points
testability
machine
generated
data
products
~80%
of
the
graph
22. Ontology
▪ provides
context
which
Deep
Learning
lacks
▪ aka,
“knowledge
graph”
–
a
computable
thesaurus
▪ maps
the
semantics
of
business
relationships
▪ S/V/O:
“nouns”,
some
“verbs”,
a
few
“adjectives”
▪ difficult
work,
a
relatively
expensive
investment,
potentially
high
ROI
▪ conversational
interfaces
(e.g.,
Google
Assistant)
improve
UX
by
importing
ontologies
22
26. Disambiguating
contexts
26
Suppose
someone
publishes
a
book
which
uses
the
term
`IOS`:
are
they
talking
about
an
operating
system
for
an
Apple
iPhone,
or
about
an
operating
system
for
a
Cisco
router?
We
handle
lots
of
content
about
both.
Disambiguating
those
contexts
is
important
for
good
UX
in
personalized
learning.
In
other
words,
how
do
machines
help
people
distinguish
that
content
within
search?
Potentially
a
good
case
for
deep
learning,
except
for
the
lack
of
labeled
data
at
scale.
27. Disambiguation
in
content
discovery
27
Consider
searching
for
the
term
`react`
on
Google.
The
first
page
of
results:
▪ acting
coaches
▪ video
games
▪ student
engagement
▪ children’s
charities
▪ UI
web
components
▪ surveys
28. Active
learning
through
Jupyter
28
Jupyter
notebooks
are
used
to
manage
ML
pipelines
for
disambiguation,
where
machines
and
people
collaborate:
▪ notebooks
as
one
part
configuration
file,
one
part
data
sample,
one
part
structured
log,
one
part
data
visualization
tool
▪ ML
based
on
examples,
not
feature
engineering,
model
parameters,
etc.
29. Active
learning
through
Jupyter
1. Experts
use
notebooks
to
provide
examples
of
book
chapters,
video
segments,
etc.,
for
each
key
phrase
that
has
overlapping
contexts
2. Machines
build
ensemble
ML
models
based
on
those
examples,
updating
notebooks
with
model
evaluation
3. Machines
attempt
to
annotate
labels
for
millions
of
pieces
of
content,
e.g.,
`AlphaGo`,
`Golang`,
versus
a
mundane
use
of
the
verb
`go`
4. Disambiguation
can
run
mostly
automated,
in
parallel
at
scale
–
through
integration
with
Apache
Spark
5. In
cases
where
ensembles
disagree,
ML
pipelines
defer
to
human
experts
who
make
judgement
calls,
providing
further
examples
6. New
examples
go
into
training
ML
pipelines
to
build
better
models
7. Rinse,
lather,
repeat
31. Active
learning
through
Jupyter
▪ Jupyter
notebooks
allow
human
experts
to
access
the
internals
of
a
mostly
automated
ML
pipeline,
rapidly
▪ Stated
another
way,
both
the
machines
and
the
people
become
collaborators
on
shared
documents
▪ Anticipates
upcoming
collaborative
document
features
in
JupyterLab
32. Active
learning
through
Jupyter
32
Open
source
nbtransom
package:
▪ https://github.com/ceteri/nbtransom
▪ https://pypi.python.org/pypi/nbtransom
▪ based
on
use
of
nbformat
and
pandas
▪ notebook
as
both
Py
data
store
and
analytics
▪ custom
“pretty-‐print”
helps
with
use
of
Git
for
diffs,
commits,
etc.
33. Nuances
▪ No
Free
Lunch
theorem:
it’s
better
to
err
on
the
side
of
less
false
positives
/
more
false
negatives
for
this
use
case
▪ Bias
toward
exemplars
most
likely
to
influence
the
classifier
▪ Potentially,
the
“experts”
may
be
Customer
Service
staff
who
review
edge
cases
within
search
results
or
recommended
content
–
as
an
integral
part
of
our
UI
–
then
re-‐train
the
ML
pipelines
through
examples
35. Human-‐in-‐the-‐loop
as
management
strategy
personal
op-‐ed:
the
“game”
isn’t
to
replace
people
–
instead
it’s
about
leveraging
AI
to
augment
staff,
so
organizations
can
retain
people
with
valuable
domain
expertise,
making
their
contributions
and
expertise
even
more
vital
37. Acknowledgements
37
Many
thanks
to
people
who’ve
helped
with
this
work:
▪ Taylor
Martin
▪ Eszti
Schoell
▪ Fernando
Perez
▪ Andrew
Odewahn
▪ Scott
Murray
▪ John
Allwine
▪ Pascal
Honscher
38. Strata
Data
NY,
Sep
25-‐28
SG,
Dec
4-‐7
SJ,
Mar
5-‐8,
2018
UK,
May
21-‐24,
2018
The
AI
Conf
SF,
Sep
17-‐20
NY,
Apr
29-‐May
2,
2018
OSCON
(returns
to
Portland!)
PDX,
Jul
16-‐19,
2018
38
39. 39
Get
Started
with
NLP
in
Python
Just
Enough
Math Building
Data
Science
Teams
Hylbert-‐Speys How
Do
You
Learn?
updates,
reviews,
conference
summaries…
liber118.com/pxn/
@pacoid