The document provides an overview of buffer overflow vulnerabilities including:
1) It explains how buffer overflows work by writing more data to a buffer than it was allocated to hold, overwriting adjacent memory.
2) An example is given of a function that is vulnerable to buffer overflow by copying user input into a fixed-size buffer without checks.
3) It shows how by passing too much input data, the buffer can be overflown and the return address on the stack overwritten to point to the injected data instead of the intended return location.
Trading volume mapping R in recent environment Nagi Teramo
This document discusses visualizing trading volume data using R. It includes:
1. Loading and manipulating trading volume data from 2012 using data.table and dplyr packages to summarize the monthly total trading amount by country.
2. Creating an interactive choropleth map visualization of the monthly trading data over time using the rMaps package. Custom HTML and JavaScript are used to animate the map updating each month.
3. Providing the full codes in a GitHub repository for others to reproduce the analysis and visualization.
The Terror-Free Guide to Introducing Functional Scala at WorkJorge Vásquez
Too often, our applications are dominated by boilerplate that's not fun to write or test, and that makes our business logic complicated. In object-oriented programming, classes and interfaces help us with abstraction to reduce boilerplate. But, in functional programming, we use type classes.
Historically, type classes in functional programming have been very complex and confusing, partially because they import ideas from Haskell that don't make sense in Scala, and partially because of their esoteric origins in category theory.
In this presentation, Jorge Vásquez presents a new library called ZIO Prelude, which offers a distinctly Scala take on Functional Abstractions, and you will learn how you can eliminate common types of boilerplate by using it.
Come see how you can improve your happiness and productivity with a new take on what it means to do functional programming in Scala!
This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.
This document summarizes a presentation given by Diane Mueller from ActiveState and Dr. Mike Müller from Python Academy. It compares MATLAB and Python capabilities for scientific computing. Python has many libraries like NumPy, SciPy, IPython and matplotlib that provide similar functionality to MATLAB. Together these are often called "Pylab". The presentation provides an overview of Python, NumPy arrays, visualization with matplotlib, and integrating Python with other languages.
This is part of an introductory course on Big Data Tools for Artificial Intelligence. These slides introduce students to the new in-memory cluster computing named Spark.
HyperLogLog in Hive - How to count sheep efficiently?bzamecnik
This document discusses using HyperLogLog (HLL) in Hive to efficiently estimate the number of unique elements or cardinality in big datasets. It describes how HLL provides fast approximate counting using probabilistic data structures. It covers implementing HLL as user-defined functions in Hive, comparing different open source implementations, and examples of using HLL to estimate unique visitors per day and in a rolling window.
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are:
* high cardinality test
* distinct count of unique users over time
* Visualize hyperloglog for fraud detection"
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
Build a Big Data solution using DB2 for z/OSJane Man
The document discusses building a Big Data solution using IBM DB2 for z/OS and IBM BigInsights. It provides an overview of new functions in DB2 11 that allow DB2 applications to access and analyze data stored in Hadoop. Specifically, it describes the JAQL_SUBMIT and HDFS_READ functions that enable submitting analytic jobs to BigInsights from DB2 and reading the results back into DB2. Examples are provided that show an integrated workflow of submitting a JAQL query to BigInsights from DB2, reading the results into a DB2 table, and querying the results. Potential use cases for integrating DB2 and BigInsights are also outlined.
The document discusses Hadoop and Spark frameworks for big data analytics. It describes that Hadoop consists of HDFS for distributed storage and MapReduce for distributed processing. Spark is faster than MapReduce for iterative algorithms and interactive queries since it keeps data in-memory. While MapReduce is best for one-pass batch jobs, Spark performs better for iterative jobs that require multiple passes over datasets.
Virtualization and Open Virtualization Format (OVF)rajsandhu1989
This document discusses virtualization and its role as the backbone of cloud computing. It defines virtualization as the creation of virtual versions of hardware platforms, operating systems, storage devices and network resources. The document outlines different types of virtualization including hardware/server virtualization, storage virtualization, network virtualization, and desktop virtualization. It describes how server virtualization works using hypervisors to divide physical servers into multiple virtual machines. The benefits of virtualization discussed include resource sharing, load balancing, easier backup and recovery, and scalability.
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
This document provides an introduction and overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It outlines what Hadoop is, how its core components MapReduce and HDFS work, advantages like scalability and fault tolerance, disadvantages like complexity, and resources for getting started with Hadoop installations and programming.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It includes MapReduce for distributed computing, HDFS for storage, and runs efficiently on large clusters by distributing data and processing across nodes. Example applications include log analysis, machine learning, and sorting 1TB of data in under a minute. It is fault-tolerant, scalable, and designed for processing vast amounts of data in a reliable and cost-effective manner.
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
The document discusses deploying Apache Hadoop and Apache HBase cloud deployments. It begins with introducing the speaker and their background with Cloudera and various Apache projects. It then provides an overview of Cloudera and what they do. The majority of the document discusses Apache Hadoop and Apache HBase, what they are, how they are open source and horizontally scalable. It also discusses deploying a Hadoop and HBase cluster on Amazon EC2 using Apache Whirr to provision the machines. Real-world examples of using these technologies include building a web index for a search engine.
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
Distilling Insights from Social Media Using Big Data Technologies
Have you ever wondered what your customers are saying about you in Social media, and the impact it might be having on your business? This session will focus on how BigInsights and Big Data technologies can be used to glean useful and actionable insights from social media data.
You'll see how data can be ingested and prepped and do text analytics on social data in real time. Using Hadoop, we'll show you how you can store and analyze your large volume of historical social media data and reference data. This talk and demo will provide an introduction to text analytics and how it is used within the IBM Big Data platform for a social media solution.
This document discusses machine learning algorithms and their suitability for parallelization using MapReduce. It begins by introducing machine learning and different types of algorithms, including supervised, unsupervised, and reinforcement learning. It then discusses how several common machine learning algorithms can be expressed using the MapReduce framework. Single-pass algorithms like language modeling and naive Bayes classification are well-suited since they involve extracting statistics from each data point. Iterative algorithms can also be parallelized by chaining multiple MapReduce jobs, though the map tasks may need access to global parameters. Overall, the document analyzes how different machine learning algorithms map to the data processing patterns of MapReduce.
Big data relates to large and complex datasets that are difficult to analyze using traditional methods. It refers to datasets that are too large to be handled by typical database management tools. Big data can help organizations make better evidence-based decisions by analyzing structured and unstructured data from a variety of sources using specialized analytical techniques.
Boris Lublinsky and Alexey Yakubovich give us an overview of using Oozie. This presentation was given on December 13th, 2012 at the Nokia offices in Chicago, IL.
View the HD video of this talk here: http://vimeo.com/chug/oozie-overview
The document discusses scheduling Hadoop pipelines using various Apache projects. It provides an example of a marketing profit and loss (PnL) pipeline that processes booking, marketing spend, and web log data. It describes scheduling the example jobs using cron-style scheduling and the problems with time-based scheduling. It then introduces Apache Oozie and Apache Falcon for more robust workflow scheduling based on dataset availability. It provides examples of using Oozie coordinators and workflows and Falcon feeds and processes to schedule the example PnL pipeline based on when input data is available rather than fixed time schedules.
This deck presents the best practices of using Apache Hive with good performance. It covers getting data into Hive, using ORC file format, getting good layout into partitions and files based on query patterns, execution using Tez and YARN queues, memory configuration, and debugging common query performance issues. It also describes Hive Bucketing and reading Hive Explain query plans.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
The document contains screenshots and descriptions of the setup and configuration of a Hadoop cluster. It includes images showing the cluster with different numbers of live and dead nodes, replication settings across nodes, and outputs of commands like fsck and job execution information. The screenshots demonstrate how to view cluster health metrics, manage nodes, and run MapReduce jobs on the Hadoop cluster.
You can now download the presentation directly from Slideshare.
*Disclaimer this is just my imaginary example of a Comms Plan for the Puma work and not the actual strategy that was created by Droga5 for Puma. I had nothing to do with that plan and am just a fan of their work.
What is Comms Planning? is a presentation that provides a clear answer of the role of the Comms Planner within an Advertising Agency. I use the example of the Puma Social campaign to prove the point.
This document provides an overview of using R, Hadoop, and Rhadoop for scalable analytics. It begins with introductions to basic R concepts like data types, vectors, lists, and data frames. It then covers Hadoop basics like MapReduce. Next, it discusses libraries for data manipulation in R like reshape2 and plyr. Finally, it focuses on Rhadoop projects like RMR for implementing MapReduce in R and considerations for using RMR effectively.
Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.
Topics Covered
* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.
Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.
Spark is an open-source cluster computing framework. It started as a project in 2009 at UC Berkeley and was open sourced in 2010. It has over 300 contributors from 50+ organizations. Spark uses Resilient Distributed Datasets (RDDs) that allow in-memory cluster computing across clusters. RDDs provide a programming model for distributed datasets that can be created from external storage or by transforming existing RDDs. RDDs support operations like map, filter, reduce to perform distributed computations lazily.
This tutorial demonstrates how to create and work with geological databases in Surpac. Key steps covered include:
1. Creating DTMs from strings, spot heights, and a combination of breaklines and spot heights.
2. Setting up a new Surpac database by defining mandatory collar and survey tables, as well as optional tables for assays and geology.
3. Importing data, viewing tables, displaying and manipulating drillholes, creating sections, compositing, extracting data using domains, and displaying histograms.
The tutorial provides a comprehensive introduction to building and utilizing geological databases in Surpac for tasks such as resource estimation and feasibility studies.
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
Multi-objective evolutionary algorithms (MOEAs) help software engineers find novel solutions to complex problems. When automatic tools explore too many options, they are slow to use and hard to comprehend. GALE is a near-linear time MOEA that builds a piecewise approximation to the surface of best solutions along the Pareto frontier. For each piece, GALE mutates solutions towards the better end. In numerous case studies, GALE finds comparable solutions to standard methods (NSGA-II, SPEA2) using far fewer evaluations (e.g. 20 evaluations, not 1,000). GALE is recommended when a model is expensive to evaluate, or when some audience needs to browse and understand how an MOEA has made its conclusions.
Cloud Computing course presentation, Tarbiat Modares University
By: Sina Ebrahimi, Mohammadreza Noei
Advisor: Sadegh Dorri Nogoorani, PhD.
Presentation Data: 1397/03/07
Video Link in Aparat: https://www.aparat.com/v/N5VbK
Video Link on TMU Cloud: http://cloud.modares.ac.ir/public.php?service=files&t=9ecb8d2dd08df6f990a3eb63f42011f7
This presenation's pptx file (some animations may be lost in slideshare) : http://cloud.modares.ac.ir/public.php?service=files&t=f62282dbd205abaa66de2512d9fdfc83
Training in Analytics, R and Social Media AnalyticsAjay Ohri
This document provides an overview of basics of analysis, analytics, and R. It discusses why analysis is important, key concepts like central tendency, variance, and frequency analysis. It also covers exploratory data analysis, common analytics software, using R for tasks like importing data, data manipulation, visualization and more. Examples and demos are provided for many common R functions and techniques.
Pig Latin is a data flow language and execution framework for parallel computation. It allows users to express data analysis programs intuitively as a series of steps. Pig runs these steps on Hadoop for scalable processing. Key features include a simple declarative language, support for nested data types, user defined functions, and a debugging environment. The document provides an overview of Pig Latin concepts like loading and transforming data, filtering, joining, and outputting results. It also compares Pig Latin to MapReduce and SQL, highlighting Pig's advantages for iterative data analysis tasks on large datasets.
Spark DataFrames provide a more optimized way to work with structured data compared to RDDs. DataFrames allow skipping unnecessary data partitions when querying, such as only reading data partitions that match certain criteria like date ranges. DataFrames also integrate better with storage formats like Parquet, which stores data in a columnar format and allows skipping unrelated columns during queries to improve performance. The code examples demonstrate loading a CSV file into a DataFrame, finding and removing duplicate records, and counting duplicate records by key to identify potential duplicates.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This document provides an introduction to Hadoop, including its motivation and key components. It discusses the scale of cloud computing that Hadoop addresses, and describes the core Hadoop technologies - the Hadoop Distributed File System (HDFS) and MapReduce framework. It also briefly introduces the Hadoop ecosystem, including other related projects like Pig, HBase, Hive and ZooKeeper. Sample code is walked through to illustrate MapReduce programming. Key aspects of HDFS like fault tolerance, scalability and data reliability are summarized.
This document discusses using Ruby for big data analysis. It provides an overview of Hadoop and Spark ecosystems for distributed storage and processing. It then demonstrates analyzing Twitter data with a Ruby script using the Hadoop streaming API and Spark. The Ruby script extracts date fields from tweets and counts tweets by day, illustrating how Ruby can leverage Hadoop and Spark for distributed data analysis.
R is an open-source statistical programming language that can be used for data analysis and visualization. The document provided an introduction to R including how to install R, create variables, import and assemble data, perform basic statistical analyses like t-tests and linear regression, and create plots and graphs. Key functions and concepts introduced included using c() to combine values into vectors, reading in data from CSV files, using lm() for linear regression, and the basic plot() function.
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
The document discusses online analytical processing (OLAP) and the need for OLAP capabilities beyond basic data analysis. It describes how OLAP uses multidimensional data models and pre-computed aggregates to provide fast and interactive analysis of data across multiple dimensions. Different approaches for implementing OLAP like ROLAP, MOLAP, and hybrid systems are covered.
This document provides an overview of Pig Latin, a data flow language used for analyzing large datasets. Pig Latin scripts are compiled into MapReduce programs that can run on Hadoop. The key points covered include:
- Pig Latin allows expressing data transformations like filtering, joining, grouping in a declarative way similar to SQL. This is compiled into MapReduce jobs.
- It features a rich data model including tuples, bags and nested data to represent complex data structures from files.
- User defined functions (UDFs) allow custom processing like extracting terms from documents or checking for spam.
- The language provides commands like LOAD, FOREACH, FILTER, JOIN to load, transform and analyze data in parallel across
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
This document provides an introduction to Apache Spark, including its history and key concepts. It discusses how Spark was developed in response to big data processing needs at Google and how it builds upon earlier systems like MapReduce. The document then covers Spark's core abstractions like RDDs and DataFrames/Datasets and common transformations and actions. It also provides an overview of Spark SQL and how to deploy Spark applications on a cluster.
This document introduces the theory behind geological database processes and provides detailed
examples using the geological database modelling functions in Surpac. By working through this
tutorial you will gain skills in the creation, use, and modification of geological databases.
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
This document summarizes a presentation about data science at OLX. It discusses OLX's moderation and recommender systems. For moderation, it describes OLX's machine learning models that automatically moderate listings for issues like duplicates, spam, and illegal/NSFW content. Moderators review flagged content. For recommendations, it discusses collaborative filtering and item embeddings to suggest relevant listings to users. It also outlines OLX's team structure, goal setting process, and expectations for data scientists, which include a focus on modeling, evaluation and some production work.
Whylogs is an open source tool for data monitoring that automatically creates statistical summaries called profiles of datasets. It helps with data monitoring by generating these profiles which can be compared over time to detect changes visually or programmatically. This allows issues like schema changes or bugs in data pipelines to be identified. The profiles have properties like being descriptive, lightweight and mergeable, which enables monitoring across distributed systems by allowing profile data to be logically merged. Whylogs thus provides a step towards observability of data systems.
The document outlines the plan and syllabus for a Data Engineering Zoomcamp hosted by DataTalks.Club. It introduces the four instructors for the course - Ankush Khanna, Sejal Vaidya, Victoria Perez Mola, and Alexey Grigorev. The 10-week course will cover topics like data ingestion, data warehousing with BigQuery, analytics engineering with dbt, batch processing with Spark, streaming with Kafka, and a culminating 3-week student project. Pre-requisites include experience with Python, SQL, and the command line. Course materials will be pre-recorded videos and there will be weekly live office hours for support. Students can earn a certificate and compete on a
This document discusses Zalando's use of AI to improve size and fit recommendations for customers. It outlines several challenges including varying size conventions, limited fit data for new items, and sparse customer purchase histories. It then describes Zalando's approaches to address these, including algorithms that use item images to predict sizes for new items lacking data (SizeNet) and models that learn from customers' past purchases and feedback to provide personalized size recommendations. The goal is to help customers find the right fit on their first purchase to reduce returns and improve the shopping experience.
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAlexey Grigorev
Computer vision techniques like facial recognition and image captioning can help automate metadata generation for media companies. Facial recognition can identify people in photos to assist editors and improve searchability, while image captioning can propose captions. A case study of applying these techniques to photos from English Premier League football games achieved 99% accuracy for facial recognition and precision of 78.7% for image captioning. Combining the two allows generating customized captions that include names identified through facial recognition. Challenges remain when the automatic caption does not match details in the image.
This document discusses several paradoxes that can arise in data science. It begins by discussing modelling and simulations that can be used when data is unavailable. It then outlines Simpson's Paradox, where a trend seen in groups disappears or reverses when the groups are combined. Next, it discusses the accuracy paradox, where a metric stops being useful once it becomes the target. It also discusses the learnability-Godel paradox related to the limitations of mathematics according to Godel's incompleteness theorems. Finally, it discusses the law of unintended consequences as it relates to data science.
An algorithm is considered fair if its results and performance are independent of sensitive variables like gender, ethnicity, etc. Fairness can be introduced at different stages of model development, such as in data collection, preparation, and model selection. Techniques for identifying and mitigating bias include causal reasoning, explainability, fairness metrics, and counterfactuals. Counterfactual fairness evaluates predictions across different protected attribute values while holding other variables constant. Explainability helps ensure models make decisions for the right reasons. Overall fairness aims to achieve equal outcomes or opportunities across groups.
This document discusses MLOps at OLX, including:
- The main areas of data science work at OLX like search, recommendations, fraud detection, and content moderation.
- How OLX uses teams structured by both feature areas and roles to collaborate on projects.
- A maturity model for MLOps with levels from no MLOps to fully automated processes.
- How OLX has improved from siloed work to cross-functional teams and adding more automation to model creation, release, and application integration over time.
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
Olga Petrova gives an introduction to transformers for natural language processing (NLP). She begins with an overview of representing words using tokenization, word embeddings, and one-hot encodings. Recurrent neural networks (RNNs) are discussed as they are important for modeling sequential data like text, but they struggle with long-term dependencies. Attention mechanisms were developed to address this by allowing the model to focus on relevant parts of the input. Transformers use self-attention and have achieved state-of-the-art results in many NLP tasks. Bidirectional Encoder Representations from Transformers (BERT) provides contextualized word embeddings trained on large corpora.
This document discusses the use of machine learning in online marketplaces. It outlines how machine learning is used for recommendations, search, trust and safety, seller experience, and pricing/monetization. Specific applications mentioned include collaborative and content-based recommendation systems, ranking models for search, automated content moderation, image quality assessment, dynamic pricing, and promoting listings. The document provides examples of algorithms like counting, collaborative filtering, learning to rank, and neural networks that power these machine learning applications in online marketplaces.
ML Zoomcamp 2.1 - Car Price Prediction ProjectAlexey Grigorev
This document outlines the plan for a car price prediction project. The plan includes preparing and exploring the data, using linear regression to predict prices, understanding how linear regression works, evaluating the model's accuracy, engineering features, regularizing the model, and implementing the model. The source code for the project is available at a GitHub link provided. The next step mentioned is exploratory data analysis of the data.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
2. 2
Hadoop: In this Presentation
1. Introduction
2. Origins
3. MapReduce
4. Hadoop as MapReduce Implementation
5. Data Warehouse on Hadoop
6. Hadoop and Data Warehousing
7. Conclusions
3. 3
Why?
• Lot of Data
• How to deal with it?
• Hadoop to rescue!
• When to use?
• When not to use?
• Curiosity
4. 4
MapReduce: Origins
• Functional Programming
• High order functions to operate on lists
• mp
a
• apply to each element of the list
• rdc = fl = acmlt
eue
od
cuuae
• aggregate a list and produce one value of output
• No side effects
5. 5
MapReduce: Origins
• (eie(1e)( e 1)
dfn + l + l )
•
(a + (it123)
mp 1 ls
)
•
(eue+0(it234)
rdc
ls
)
•
(eue+0(a + (it123)
rdc
mp 1 ls
))
(it234
ls
)
9
9
⇒
⇒
⇒
6. 6
MapReduce: Origins
• These function do not have side effects
• And can be parallelized easily
• Can split the input data into chunks:
⇒
• (it1234
ls
)
( i t 1 2 and ( i t 3 4
ls
)
ls
)
• Apply map to each chuck separately, and then combine ( r d c them
e u e)
together
7. 7
MapReduce: Origins
• Mapping separately:
•
(eiers (eue+0(a + (it12)
dfn e1 rdc
mp 1 ls
))
•
(eue+rs (a + (it34)
rdc
e1 mp 1 ls
))
• This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 )
rdc
mp 1 ls
))
• Note that for r d c the function must be additive
eue
8. 8
MapReduce
• A m p function
a
• takes a key-value pair ( n k y i _ a )
i_e, nvl
• produces zero or more key-value pairs: intermediate results
• intermediate results are grouped by key
• A r d c function
eue
• for each group in the intermediate results
• aggregates and produces the final output
9. 9
MapReduce Stages
each MapReduce Job is executed in 3 stages
• map stage: apply m p to each key-value pair
a
• group together the intermediate results by key
• reduce stage: apply r d c to each group
eue
11. 11
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis
sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem
nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris
mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.
Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.
Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat
egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod
massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis
fringilla dolor ornare mi dictum ornare.
12. 12
MapReduce Example
0 .d f m p S r n i p t k y S r n d c :
1 e a(tig nu_e, tig o)
0.
2
0.
3
frec wr wi dc
o ah od
n o:
EiItreit w 1
m t n e m d a e( , )
0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s :
4 e eueSrn uptky trtr uptvl)
0.
5
itrs=0
n e
0.
6
frec vi otu_as
o ah
n uptvl:
0.
7
rs+ v
e =
0.
8
Ei rs
m t( e )
13. 13
MapReduce Example
w
)1 ,w(
• reduce stage: for each
pairs into
)]1 , . . . ,1 ,1[ ,w(
• group a list of
w
• map stage: output 1 for each word
calculate how many ones there are
16. 16
“
Hadoop
... is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.
17. 17
Hadoop
• Open Source implementation of MapReduce
• "Hadoop":
• HDFS
• Hadoop MapReduce
• HBase
• Hive
• ... many others
18. 18
Hadoop Cluster: Terminology
• Name Node: orchestrates the process
• Workers: nodes that do the computation
• Mappers do the map phase
• Reducers do the reduce phase
28. 28
Advantages
• Simple, especially for programmers who know FP
• Fault tolerant
• No schema, can process any data
• Flexible
• Cheap and runs on commodity hardware
29. 29
Disadvantages
• No declarative high-level language like SQL
• Performance issues:
• Map and Reduce are blocking
• Name Node: single point of failure
• It's young
34. 34
Cheetah
• Virtual views consist of columns that can be queried
• Everything inside is entirely denormalized
• Append-only design and slowly changing dimensions
• Proprietary
35. 35
Hive
• A data warehousing solution built by Facebook
• For Big data analysis:
• in 2010 (4 years ago!), 30+ PB
• Has its own data model
• HiveQL: a declarative SQL-like language for ad-hoc querying
36. 36
HiveQL
Tables
0 .S A U U D T ( s r i i t s a u s r n , d s r n )
1 TTS PAEue d n, tts tig s tig
0 .P O I E ( s r d i t s h o s r n , g n e i t
2 RFLSuei n, col tig edr n)
0 .L A D T L C L I P T ' o s s a u _ p a e '
1 OD AA OA NAH lg/ttsudts
0 .I T T B E s a u _ p a e
2 NO AL ttsudts
0 .P R I I N ( s ' 0 9 0 - 0 )
3 ATTO d=20-32'
37. 37
HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0 .I S R O E W I E T B E g n e _ u m r
5 NET VRRT AL edrsmay
0 .P R I I N ( s ' 0 9 0 - 0 )
6 ATTO d=20-32'
0 .S L C s b 1 g n e , c u t 1
7 EET uq.edr on()
0 .G O P B s b 1 g n e
8 RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col
38. 38
HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0. ISR OEWIETBEgne_umr
5
NET VRRT AL edrsmay
0. PRIIN(s'090-0)
6
ATTO d=20-32'
0. SLC sb1gne,cut1
7
EET uq.edr on()
0. GOPB sb1gne
8
RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col
39. 39
HiveQL
0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t
1 EUE uq.col uq.ee uq.n
0. UIG'o1.y A (col mm,ct
2
SN tp0p' S sho, ee n)
0 .F O (
3 RM
0.
4
SLC sb1sho,sb1mm,cut1 a ct
EET uq.col uq.ee on() s n
0.
5
FO
RM
0.
6
(A bsho,asau
MP .col .tts
0.
7
UIG'eeetatrp'
SN mm_xrco.y
0.
8
A (col mm)
S sho, ee
0.
9
FO sau_paeaJI poie b
RM ttsudt
ON rfls
1.
0
O (.srd=buei) sb1
N auei
.srd) uq
1.
1
GOPB sb1sho,sb1mm
RU Y uq.col uq.ee
1.
2
DSRBR B sho,mm
ITIUE Y col ee
1.
3
SR B sho,mm,ctds)
OT Y col ee n ec
1 .) s b 2
4
uq
41. 41
Hadoop + Data Warehouse
• Hadoop and Data Warehouses can co-exist
• DW: OLAP, BI, transactional data
• Hadoop: Raw, unstructured data
42. 42
ETL
• Extract: load to HDFS, parse, prepare
• Run some analysis
• Transform: clean data and transform to some structured format
• with MapReduce
• Load: extract from HDFS, load to DW
43. 43
ETL: examples
• Text processing
• Call center records analysis
• extract sentiment
• link to profile
• which customers are more important to keep?
• Image processing
44. 44
Active Storage
• Don't delete the data after processing
• Hadoop storage is cheap: it can store anything
• Run more analysis when needed
• Like: extract new keywords/features from the old dataset
45. 45
Active Storage - 2
• Up to 80% of data is dormant (or cold)
• Hadoop storage can be way cheaper than high-cost data management
solutions
• Move this data to Hadoop
• When needed quickly analyze there or move back to DW
49. 49
Analytical Sandbox
• What are we looking in this data?
• No structure - hard to know
• Run ad-hoc Hive queries to see what's there
50. 50
Conclusions
• Hadoop is becoming more and more popular
• Many companies plan to adopt
• Best used with existent DW solutions
• as an ETL
• as Active Storage
• as Analytical Sandbox
51. 51
References
1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.
[pdf]
2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for
data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.
[pdf]
4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and
Teradata)
5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB
Endowment 2.2 (2009): 1626-1629. [pdf]
6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the
VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
52. 52
References
7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of
the ACM 51.1 (2008): 107-113. [pdf]
11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
12. Apache Hadoop project home page, url: [link].
13. Apache HBase home page, [link].
14. Apache Mahout home page, [link].
15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical
workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]