K-means clustering is an unsupervised machine learning algorithm that groups unlabeled data points into a specified number of clusters (k) based on their similarity. It works by randomly assigning data points to k clusters and then iteratively updating cluster centroids and reassigning points until cluster membership stabilizes. K-means clustering aims to minimize intra-cluster variation while maximizing inter-cluster variation. There are various applications and variants of the basic k-means algorithm.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
This is very simple introduction to Clustering with some real world example. At the end of lecture I use stackOverflow API to test some clustering. I also wants to try facebook but it has some problem with it's API
The document discusses clustering and k-means clustering algorithms. It provides examples of scenarios where clustering can be used, such as placing cell phone towers or opening new offices. It then defines clustering as organizing data into groups where objects within each group are similar to each other and dissimilar to objects in other groups. The document proceeds to explain k-means clustering, including the process of initializing cluster centers, assigning data points to the closest center, recomputing the centers, and iterating until centers converge. It provides a use case of using k-means to determine locations for new schools.
This document discusses hierarchical clustering, an unsupervised learning technique. It describes different types of hierarchical clustering including agglomerative versus divisive approaches. It also discusses dendrograms, which show how clusters are merged or split hierarchically. The document focuses on the agglomerative clustering algorithm and different methods for defining the distance between clusters when they are merged, including single link, complete link, average link, and centroid methods.
This document discusses dimensionality reduction techniques for data mining. It begins with an introduction to dimensionality reduction and reasons for using it. These include dealing with high-dimensional data issues like the curse of dimensionality. It then covers major dimensionality reduction techniques of feature selection and feature extraction. Feature selection techniques discussed include search strategies, feature ranking, and evaluation measures. Feature extraction maps data to a lower-dimensional space. The document outlines applications of dimensionality reduction like text mining and gene expression analysis. It concludes with trends in the field.
This presentation introduces clustering analysis and the k-means clustering technique. It defines clustering as an unsupervised method to segment data into groups with similar traits. The presentation outlines different clustering types (hard vs soft), techniques (partitioning, hierarchical, etc.), and describes the k-means algorithm in detail through multiple steps. It discusses requirements for clustering, provides examples of applications, and reviews advantages and disadvantages of k-means clustering.
Cluster analysis is a technique used to classify objects into groups called clusters based on their similarities. It has many applications in areas like market research, biology, and image processing. There are different types of clustering methods like partitioning, hierarchical, density-based, and grid-based. The k-means algorithm is a commonly used partitioning method where objects are grouped into k clusters based on their distances from centroid points, which are recalculated in each iteration until cluster memberships stabilize. Cluster analysis helps discover patterns and insights from large datasets.
This document discusses hierarchical clustering, which produces nested clusters organized as a hierarchical tree. It can be visualized using a dendrogram. There are two main types: agglomerative, which starts with each point as its own cluster and merges the closest pairs; and divisive, which starts with all points in one cluster and splits them. Hierarchical clustering does not require specifying the number of clusters upfront like partitional clustering but is generally slower and the dendrogram can be difficult to interpret. The document provides examples of applications and notes pros and cons.
This document discusses clustering, which is the task of grouping data points into clusters so that points within the same cluster are more similar to each other than points in other clusters. It describes different types of clustering methods, including density-based, hierarchical, partitioning, and grid-based methods. It provides examples of specific clustering algorithms like K-means, DBSCAN, and discusses applications of clustering in fields like marketing, biology, libraries, insurance, city planning, and earthquake studies.
The document discusses K-means clustering, an unsupervised machine learning algorithm that partitions observations into k clusters defined by centroids. It compares clustering to classification, noting clustering does not use training data and maps observations into natural groupings. The K-means algorithm is then explained, with the steps of initializing centroids, assigning observations to the closest centroid, revising centroids as cluster means, and repeating until convergence. Applications of clustering in business contexts like banking, retail, and insurance are also briefly mentioned.
This document discusses different clustering methods in data mining. It begins by defining cluster analysis and its applications. It then categorizes major clustering methods into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based clustering methods. Finally, it provides details on partitioning methods like k-means and k-medoids clustering algorithms.
K-Nearest Neighbor is a simple machine learning algorithm that classifies unlabeled examples based on their similarity to labeled examples in a feature space. It works by finding the k closest training examples in the feature space and assigning the label based on a majority vote of the k neighbors. The algorithm does not use the training data for generalization and requires all data during testing. It treats features as coordinates and measures distance between points to determine similarity. Choosing an appropriate value for k and preparing the data through normalization are important for the efficacy of the model. Some applications of k-NN include agriculture, finance, and medicine.
This document discusses various clustering analysis methods including k-means, k-medoids (PAM), and CLARA. It explains that clustering involves grouping similar objects together without predefined classes. Partitioning methods like k-means and k-medoids (PAM) assign objects to clusters to optimize a criterion function. K-means uses cluster centroids while k-medoids uses actual data points as cluster representatives. PAM is more robust to outliers than k-means but does not scale well to large datasets, so CLARA applies PAM to samples of the data. Examples of clustering applications include market segmentation, land use analysis, and earthquake studies.
This document discusses machine learning concepts including supervised vs. unsupervised learning, clustering algorithms, and specific clustering methods like k-means and k-nearest neighbors. It provides examples of how clustering can be used for applications such as market segmentation and astronomical data analysis. Key clustering algorithms covered are hierarchy methods, partitioning methods, k-means which groups data by assigning objects to the closest cluster center, and k-nearest neighbors which classifies new data based on its closest training examples.
Hierarchical clustering is a method of partitioning a set of data into meaningful sub-classes or clusters. It involves two approaches - agglomerative, which successively links pairs of items or clusters, and divisive, which starts with the whole set as a cluster and divides it into smaller partitions. Agglomerative Nesting (AGNES) is an agglomerative technique that merges clusters with the least dissimilarity at each step, eventually combining all clusters. Divisive Analysis (DIANA) is the inverse, starting with all data in one cluster and splitting it until each data point is its own cluster. Both approaches can be visualized using dendrograms to show the hierarchical merging or splitting of clusters.
Ensemble methods combine multiple machine learning models to obtain better predictive performance than from any individual model. There are two main types of ensemble methods: sequential (e.g AdaBoost) where models are generated one after the other, and parallel (e.g Random Forest) where models are generated independently. Popular ensemble methods include bagging, boosting, and stacking. Bagging averages predictions from models trained on random samples of the data, while boosting focuses on correcting previous models' errors. Stacking trains a meta-model on predictions from other models to produce a final prediction.
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what Machine Learning is, what Machine Learning is, what Decision Tree is, the advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with resolved examples, and at the end of the decision Tree use case/demo in Python for loan payment. For both beginners and experts who want to learn Machine Learning Algorithms, this Decision Tree tutorial is perfect.
This document discusses decision tree induction and attribute selection measures. It describes common measures like information gain, gain ratio, and Gini index that are used to select the best splitting attribute at each node in decision tree construction. It provides examples to illustrate information gain calculation for both discrete and continuous attributes. The document also discusses techniques for handling large datasets like SLIQ and SPRINT that build decision trees in a scalable manner by maintaining attribute value lists.
K-means clustering groups data points into k clusters by minimizing the distance between points and cluster centroids. It works by randomly assigning points to initial centroids and then iteratively reassigning points to centroids until clusters are stable. Hierarchical clustering builds a dendrogram showing the relationship between clusters by either recursively merging or splitting clusters. Both are unsupervised learning techniques that group similar data points together without labels.
The document discusses the concept of clustering, which is an unsupervised machine learning technique used to group unlabeled data points that are similar. It describes how clustering algorithms aim to identify natural groups within data based on some measure of similarity, without any labels provided. The key types of clustering are partition-based (like k-means), hierarchical, density-based, and model-based. Applications include marketing, earth science, insurance, and more. Quality measures for clustering include intra-cluster similarity and inter-cluster dissimilarity.
The document provides an overview of clustering methods and algorithms. It defines clustering as the process of grouping objects that are similar to each other and dissimilar to objects in other groups. It discusses existing clustering methods like K-means, hierarchical clustering, and density-based clustering. For each method, it outlines the basic steps and provides an example application of K-means clustering to demonstrate how the algorithm works. The document also discusses evaluating clustering results and different measures used to assess cluster validity.
This document discusses various clustering methods used in data mining. It begins with an overview of clustering and its applications. It then describes five major categories of clustering methods: partitioning methods like k-means and k-medoids, hierarchical methods like agglomerative nesting and divisive analysis, density-based methods, grid-based methods, and model-based clustering methods. For each category, popular algorithms are provided as examples. The document also covers types of data for clustering and evaluating clustering results.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
The document discusses clustering analysis for data mining. It begins by outlining the importance and purposes of cluster analysis, including grouping related data and reducing large datasets. It then describes different types of clustering like hierarchical, partitional, density-based, and grid-based clustering. Specific clustering algorithms like k-means, hierarchical clustering, and DBSCAN are also covered. Finally, applications of clustering are mentioned, such as for machine translation, online shopping recommendations, and spatial databases.
Unsupervised learning techniques like clustering are used to explore intrinsic structures in unlabeled data and group similar data instances together. Clustering algorithms like k-means partition data into k clusters where each cluster has a centroid, and data points are assigned to the closest centroid. Hierarchical clustering creates nested clusters by iteratively merging or splitting clusters based on distance metrics. Choosing the right distance metric and clustering algorithm depends on factors like attribute ranges and presence of outliers.
Clustering is an unsupervised learning technique used to group unlabeled data points into clusters based on similarity. It is widely used in data mining applications. The k-means algorithm is one of the simplest clustering algorithms that partitions data into k predefined clusters, where each data point belongs to the cluster with the nearest mean. It works by assigning data points to their closest cluster centroid and recalculating the centroids until clusters stabilize. The k-medoids algorithm is similar but uses actual data points as centroids instead of means, making it more robust to outliers.
The document discusses various unsupervised learning techniques including clustering algorithms like k-means, k-medoids, hierarchical clustering and density-based clustering. It explains how k-means clustering works by selecting initial random centroids and iteratively reassigning data points to the closest centroid. The elbow method is described as a way to determine the optimal number of clusters k. The document also discusses how k-medoids clustering is more robust to outliers than k-means because it uses actual data points as cluster representatives rather than centroids.
K means Clustering - algorithm to cluster n objectsVoidVampire
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n.
It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data.
It assumes that the object attributes form a vector space.
This document provides an overview of data mining techniques including clustering and classification. It defines clustering as the process of organizing objects into groups of similar objects. The document outlines several existing clustering methods such as hierarchical, partitioning, and probabilistic clustering. It also defines classification as assigning data to predefined categories or classes. Several classification examples are described along with techniques like decision trees, k-nearest neighbors, regression, and neural networks. The document concludes that these techniques are useful for simplifying data, detecting patterns, and performing supervised and unsupervised learning.
Clustering is an unsupervised learning technique used to group unlabeled data points into clusters based on similarity. There are several clustering methods including hierarchical, partitioning, density-based, and grid-based approaches. K-means clustering is a popular partitioning method that groups data into K number of clusters by minimizing distances between data points and cluster centers. It works by randomly selecting K data points as initial cluster centers and then iteratively reassigning all other points to clusters while updating the cluster centers until the clusters are stable.
Cluster analysis, or clustering, is the process of grouping data objects into subsets called clusters so that objects within a cluster are similar to each other but dissimilar to objects in other clusters. There are several approaches to clustering, including partitioning, hierarchical, density-based, and grid-based methods. The k-means and k-medoids algorithms are popular partitioning methods that aim to partition observations into k clusters by minimizing distances between observations and cluster centroids or medoids. K-medoids is more robust to outliers as it uses actual observations as cluster representatives rather than centroids. Both methods require specifying the number of clusters k in advance.
This document discusses hierarchical clustering methods. It describes agglomerative hierarchical clustering, which starts with each data point as a cluster and iteratively merges the two closest clusters. Different methods for defining the distance between clusters are covered, including minimum, maximum, average, and centroid distance. The time and space complexity of hierarchical clustering is O(N2) space and O(N3) time for most approaches. The strengths and limitations of hierarchical clustering are also summarized.
K-means clustering is an algorithm used to classify objects into k number of groups or clusters. It works by minimizing the sum of squares of distances between data points and assigned cluster centroids. The basic steps are to initialize k cluster centroids, assign each data point to the nearest centroid, recompute the centroids based on new assignments, and repeat until centroids don't change. Some examples of its applications include machine learning, data mining, speech recognition, image segmentation, and color quantization. However, it is sensitive to initialization and may get stuck in local optima.
This document provides an overview of unsupervised learning and clustering algorithms. It discusses the motivation for clustering as grouping similar data points without labels. It introduces common clustering algorithms like K-means, hierarchical clustering, and fuzzy C-means. It covers clustering criteria such as similarity functions, stopping criteria, and cluster quality. It also discusses techniques like data normalization and challenges in evaluating clusters without ground truths. The document aims to explain the concepts and applications of unsupervised learning for clustering unlabeled data.
This document provides an overview of clustering and k-means clustering algorithms. It begins by defining clustering as the process of grouping similar objects together and dissimilar objects separately. K-means clustering is introduced as an algorithm that partitions data points into k clusters by minimizing total intra-cluster variance, iteratively updating cluster means. The k-means algorithm and an example are described in detail. Weaknesses and applications are discussed. Finally, vector quantization and principal component analysis are briefly introduced.
Blockchain technology is a distributed ledger that records transactions in digital blocks chained together using cryptography. It allows for decentralized consensus on a shared transaction history without the need for a central authority. Key elements include distributed ledgers that maintain copies of transactions across many nodes, cryptographic hash functions and digital signatures for security, and consensus algorithms to validate transactions and reach agreement in a decentralized network. Blockchain technology has the potential to disrupt many industries by facilitating trust and transparency in peer-to-peer transactions.
This document provides an overview of data mining and machine learning concepts. It defines data mining as the process of discovering patterns in data. Machine learning allows computers to learn without being explicitly programmed by improving at tasks through experience. The document discusses different types of machine learning including supervised learning to predict outputs from inputs, unsupervised learning to understand and describe data without correct answers, and reinforcement learning to learn actions through rewards. It also covers machine learning problems, algorithms like K-nearest neighbors for classification and K-means clustering, and evaluating machine learning models.
Cloud computing provides on-demand access to shared computing resources like servers, storage, databases, networking, software and analytics over the internet. It delivers computing as a utility or service rather than a product. There are different types of cloud services including Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). Clouds can be public, private, hybrid or community and are offered by major companies like Amazon, Microsoft, Google and IBM.
1) Data analytics is the process of examining large data sets to uncover patterns and insights. It involves descriptive, predictive, and prescriptive analysis.
2) Descriptive analysis summarizes past events, predictive analysis forecasts future events, and prescriptive analysis recommends actions.
3) Major companies like Facebook, Amazon, Uber, banks and Spotify extensively use big data and data analytics to improve customer experience, detect fraud, personalize recommendations and gain business insights.
This document provides an overview of the Hadoop ecosystem. It begins by defining big data and explaining how Hadoop uses MapReduce and HDFS to allow for distributed processing and storage of large datasets across commodity hardware. It then describes various components of the Hadoop ecosystem for acquiring, arranging, analyzing, and visualizing data, including Flume, Sqoop, Kafka, HDFS, HBase, Spark, Pig, Hive, Impala, Mahout, and HUE. Real-world use cases of Hadoop at companies like Facebook, Twitter, and NASA are also discussed. Overall, the document outlines the key elements that make up the Hadoop ecosystem for working with big data.
The document discusses parallel computing on the GPU. It outlines the goals of achieving high performance, energy efficiency, functionality, and scalability. It then covers the tentative schedule, which includes introductions to GPU computing, CUDA, threading and memory models, performance, and floating point considerations. It recommends textbooks and notes for further reading. It discusses key concepts like parallelism, latency vs throughput, bandwidth, and how GPUs were designed for throughput rather than latency like CPUs. Winning applications are said to use both CPUs and GPUs, with CPUs for sequential parts and GPUs for parallel parts.
This document discusses various methods for evaluating machine learning models, including:
- Using train, test, and validation sets to evaluate models on large datasets. Cross-validation is recommended for smaller datasets.
- Accuracy, error, precision, recall, and other metrics to quantify a model's performance using a confusion matrix.
- Lift charts and gains charts provide a visual comparison of a model's performance compared to no model. They are useful when costs are associated with different prediction outcomes.
This document discusses various methods for evaluating machine learning models. It describes splitting data into training, validation, and test sets to evaluate models on large datasets. For small or unbalanced datasets, it recommends cross-validation techniques like k-fold cross-validation and stratified sampling. The document also covers evaluating classifier performance using metrics like accuracy, confidence intervals, and lift charts, as well as addressing issues that can impact evaluation like overfitting and class imbalance.
The document discusses the K-nearest neighbors (KNN) algorithm, a simple machine learning algorithm used for classification problems. KNN works by finding the K training examples that are closest in distance to a new data point, and assigning the most common class among those K examples as the prediction for the new data point. The document covers how KNN calculates distances between data points, how to choose the K value, techniques for handling different data types, and the strengths and weaknesses of the KNN algorithm.
Decision trees are a machine learning technique that use a tree-like model to predict outcomes. They break down a dataset into smaller subsets based on attribute values. Decision trees evaluate attributes like outlook, temperature, humidity, and wind to determine the best predictor. The algorithm calculates information gain to determine which attribute best splits the data into the most homogeneous subsets. It selects the attribute with the highest information gain to place at the root node and then recursively builds the tree by splitting on subsequent attributes.
The document discusses covering (rule-based) algorithms for generating classification rules from data. It provides an example of using a simple covering algorithm to iteratively generate rules that assign contact lens recommendations based on patient attributes. The algorithm works by selecting the test at each step that best separates the data (maximizes accuracy) until all instances are covered by rules or no further separation is possible.
Data mining techniques can uncover useful patterns and relationships in data. Association rule mining finds frequent patterns and generates rules about associations between different attributes in the data. The Apriori algorithm is commonly used to efficiently find all frequent itemsets in a transaction database and generate association rules from those itemsets. It works in multiple passes over the data, generating candidate itemsets of length k from frequent itemsets of length k-1 and pruning unpromising candidates that have infrequent subsets.
Big data is generated from a variety of sources like web data, purchases, social networks, sensors, and IoT devices. Telecom companies process exabytes and zettabytes of data daily, including call detail records, network configuration data, and customer information. This big data is analyzed to enhance customer experience through personalization, predict churn, and optimize networks. Analytics also helps with operations, data monetization through services, and identifying new revenue streams from IoT and M2M data. Frameworks like Hadoop and MapReduce are used to analyze this distributed big data across clusters in a distributed manner for faster insights.
Cloud computing provides on-demand access to computing resources like servers, storage, databases, networking, software, analytics and more over the internet. It delivers these resources as a service on a pay-per-use basis. There are different types of cloud services including Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). Popular cloud computing providers include Amazon, Google, and Microsoft who offer public, private and hybrid cloud solutions. Cloud computing enables large scale data analysis and provides computing resources for research communities in a flexible and cost-effective manner.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
Cheetah is a custom data warehouse system built on top of Hadoop that provides high performance for storing and querying large datasets. It uses a virtual view abstraction over star and snowflake schemas to provide a simple yet powerful SQL-like query language. The system architecture utilizes MapReduce to parallelize query execution across many nodes. Cheetah employs columnar data storage and compression, multi-query optimization, and materialized views to improve query performance. Based on evaluations, Cheetah can efficiently handle both small and large queries and outperforms single-query execution when processing batches of queries together.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
Sawzall is a query language used with MapReduce to process large datasets in parallel across many machines. It allows writing programs that operate on individual records and emit intermediate values. These values are automatically aggregated across machines. Sawzall programs are concise, typically 10-20x shorter than equivalent MapReduce programs. The document provides examples of Sawzall programs for tasks like finding the highest ranked page for each website domain or counting search queries by geographic location.
This document summarizes HadoopDB, a system for building real-world applications on Hadoop. It discusses HadoopDB's architecture and components like the database connector, data loader, and catalog. It then provides two example applications - a semantic web application for biological data analysis and a business data warehousing application. The document demonstrates how to load sample datasets for each application into HadoopDB and execute sample queries on the data, including visualizing the query execution flow and demonstrating fault tolerance.
More from Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)
How to install python packages from PycharmCeline George
In this slide, let's discuss how to install Python packages from PyCharm. In case we do any customization in our Odoo environment, sometimes it will be necessary to install some additional Python packages. Let’s check how we can do this from PyCharm.
How to Create an XLS Report in Odoo 17 - Odoo 17 SlidesCeline George
XLSX reports are essential for structured data analysis, customizable presentation, and compatibility across platforms, facilitating efficient decision-making and communication within organizations.
A history of Innisfree in Milanville, PennsylvaniaThomasRue2
A history of Innisfree in Milanville, Damascus Township, Wayne County, Pennsylvania. By TOM RUE, July 23, 2023. Innisfree began as "an experiment in democracy," modeled after A.S. Neill's "Summerhill" school in England, "the first libertarian school".
Dear Sakthi Thiru Dr. G. B. Senthil Kumar,
It is with great honor and respect that we extend this formal invitation to you. As a distinguished leader whose presence commands admiration and reverence, we cordially invite you to join us in celebrating the 25th anniversary of our graduation from Adhiparasakthi Engineering College on 27th July, 2024. we would be honored to have you by our side as we reflect on the achievements and memories of the past 25 years.
New features of Maintenance Module in Odoo 17Celine George
In Odoo, the Maintenance Module is a comprehensive tool designed to help organizations manage their equipment, machinery, and overall maintenance activities efficiently. This module enables users to schedule, track, and manage maintenance requests and activities, ensuring minimal downtime and optimal operational efficiency.
2. • Grouping of records ,observations or cases into
classes of similar objects.
• A cluster is a collection of records,
– Similar to one another
– Dissimilar to records in other clusters
What is Clustering?
5. • There is no target variable for clustering
• Clustering does not try to classify or predict the
values of a target variable.
• Instead, clustering algorithms seek to segment
the entire data set into relatively homogeneous
subgroups or clusters,
– Where the similarity of the records within the
cluster is maximized, and
– Similarity to records outside this cluster is
minimized.
Difference between Clustering and
Classification
6. • Identification of groups f records such that
similarity within a group is very high while the
similarity to records in other groups is very low.
– group data points that are close (or similar) to
each other
– identify such groupings (or clusters) in an
unsupervised manner
• Unsupervised: no information is provided to
the algorithm on which data points belong to
which clusters
• In other words,
– Clustering algorithm seeks to construct clusters of
records such that the between-cluster variation(BCV)
is large compared to the within-cluster
variation(WCV)
Goal of Clustering
7. Between-cluster variation:
Within-cluster variation:
Goal of Clustering
between-cluster variation(BCV) is
large compared to the within-
cluster variation(WCV)
(Intra-cluster distance) the sum of distances
between objects in the same cluster are
minimized
(Inter-cluster distance) while the distances
between different clusters are maximized
8. • Clustering techniques apply when there is no
class to be predicted
• As we've seen clusters can be:
– disjoint vs. overlapping
– deterministic vs. probabilistic
– flat vs. hierarchical
• k-means Algorithm
– k-means clusters are disjoint, deterministic, and
flat
Clustering
9. Issues Related to Clustering
• How to measure similarity
– Euclidian Distance
– City-block Distance
– Minkowski Distance
• How to recode categorical variables?
• How to standardize or normalize numerical
variables?
– Min-Max Normalization
– Z-score standardization ( )
• How many clusters we expect to uncover?
m,s
10. Type of Clustering
• Partitional clustering: Partitional algorithms
determine all clusters at once. They include:
– K-Means Clustering
– Fuzzy c-means clustering
– QT clustering
• Hierarchical Clustering:
– Agglomerative ("bottom-up"): Agglomerative
algorithms begin with each element as a
separate cluster and merge them into successively
larger clusters.
– Divisive ("top-down"): Divisive algorithms begin
with the whole set and proceed to divide it into
successively smaller clusters.
12. k-Means Clustering
• Input: n objects (or points) and a number k
• Algorithm
1) Randomly assign K records to be the initial
cluster center locations
2) Assign each object to the group that has the
closest centroid
3) When all objects have been assigned,
recalculate the positions of the K centroids
4) Repeat steps 2 to 3 until convergence or
termination
14. Termination Conditions
• The algorithm terminates when the centroids
no longer change.
• The SSE(sum of squared errors) value is less
than some small threshold value
• Where p є Ci represents each data point in
cluster i and mi represent the centroid of
cluster i.
SSE = d(p- mi )2
pÎci
å
i=1
k
å
15. Example 1:
• Lets s suppose the following points are the
delivery locations for pizza
22. Example 2:
• Suppose that we have eight data points in
two-dimensional space as follows
• And suppose that we are interested in
uncovering k=2 clusters.
23. Point Distance from m1 Distance from m2 Cluster
membership
a
b
c
d
e
f
g
h
n
i
ii yxYXD
1
2
)(),(
24. Point Distance from m1 Distance from m2 Cluster
membership
a 2.00 2.24
b 2.83 2.24
c 3.61 2.83
d 4.47 3.61
e 1.00 1.41
f 3.16 2.24
g 0.00 1.00
h 1.00 0.00
n
i
ii yxYXD
1
2
)(),(
25. Point Distance from m1 Distance from m2 Cluster
membership
a 2.00 2.24 C1
b 2.83 2.24 C2
c 3.61 2.83 C2
d 4.47 3.61 C2
e 1.00 1.41 C1
f 3.16 2.24 C2
g 0.00 1.00 C1
h 1.00 0.00 C2
SSE=22+2.242+2.832+3.612+12+2.242+02+02=36
d(m1,m2)=1
26. Centroid of the cluster 1 is
[(1+1+1)/3,(3+2+1)/3]
=(1,2)
Centroid of the cluster 2 is
[(3+4+5+4+2)/5,(3+3+3+2+1)/5]
=(3.6,2.4)
27. Point Distance from m1 Distance from m2 Cluster
membership
a
b
c
d
e
f
g
h
m1=(1,2)
m2=(3.6,2.4)
28. Point Distance from m1 Distance from m2 Cluster
membership
a 1.00 2.67
b 2.24 0.85
c 3.61 0.72
d 4.12 1.52
e 0.00 2.63
f 3.00 0.57
g 1.00 2.95
h 1.41 2.13
m1=(1,2)
m2=(3.6,2.4)
29. Point Distance from m1 Distance from m2 Cluster
membership
a 1.00 2.67 C1
b 2.24 0.85 C2
c 3.61 0.72 C2
d 4.12 1.52 C2
e 0.00 2.63 C1
f 3.00 0.57 C2
g 1.00 2.95 C1
h 1.41 2.13 C1
SSE=12+0.852+0.722+1.522+02+0.572+12+1.412=7.88
d(m1,m2)=2.63
m1(1,2)
m2(3.6,2.4)
30. Centroid of the cluster 1 is
[(1+1+1+2)/4,(3+2+1+1)/4]
=(1.25,1.75)
Centroid of the cluster 2 is
[(3+4+5+4)/4,(3+3+3+2)/4]
=(4,2.75)
31. Point Distance from m1 Distance from m2 Cluster
membership
a
b
c
d
e
f
g
h
m1(1.25,1.75)
m2(4,2.75)
32. Point Distance from m1 Distance from m2 Cluster
membership
a 1.27 3.01
b 2.15 1.03
c 3.02 0.25
d 3.95 1.03
e 0.35 3.09
f 2.76 0.75
g 0.79 3.47
h 1.06 2.66
m1(1.25,1.75)
m2(4,2.75)
33. Point Distance from m1 Distance from m2 Cluster
membership
a 1.27 3.01 C1
b 2.15 1.03 C2
c 3.02 0.25 C2
d 3.95 1.03 C2
e 0.35 3.09 C1
f 2.76 0.75 C2
g 0.79 3.47 C1
h 1.06 2.66 C1
m1(1.25,1.75)
m2(4,2.75)
SSE=1.272+1.032+0.252+1.032+0.352+0.752+0.792+1.062=6.25
d(m1,m2)=2.93
37. How to decide k?
• Unless the analyst has a prior knowledge of
the number of underlying clusters, therefore,
– Clustering solutions for each value of K is
compared
– The value of K resulting in the smallest SSE being
selected
38. Summary
K-means algorithm is a simple yet popular method for
clustering analysis
• Low complexity :complexity is O(nkt), where t =
#iterations
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to overcome its
weaknesses
– K-Medoids: resistance to noise and/or outliers(data that do
not comply with the general behaviour or model of the data
)
– K-Modes: extension to categorical data clustering analysis
– CLARA: extension to deal with large data sets
– Gaussian Mixture models (EM algorithm): handling
uncertainty of clusters
40. 42
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram)
• One approach: recursive application of a partitional
clustering algorithm.
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
41. Hierarchical clustering and
dendrograms
• A hierarchical clustering on a set of objects D is a set
of nested partitions of D. It is represented by a binary
tree such that :
– The root node is a cluster that contains all data points
– Each (parent) node is a cluster made of two subclusters
(childs)
– Each leaf node represents one data point (singleton ie
cluster with only one item)
• A hierarchical clustering scheme is also called a
taxonomy. In data clustering the binary tree is called
a dendrogram.
42. 44
• Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
Dendogram: Hierarchical Clustering
44. Hierarchical clustering
• There are two styles of hierarchical clustering algorithms
to build a tree from the input set S:
– Agglomerative (bottom-up):
• Beginning with singletons (sets with 1 element)
• Merging them until S is achieved as the root.
• In each steps , the two closest clusters are aggregates into a new
combined cluster
• In this way, number of clusters in the data set is reduced at each step
• Eventually, all records/elements are combined into a single huge cluster
• It is the most common approach.
– Divisive (top-down):
• All records are combined in to a one big cluster
• Then the most dissimilar records being split off recursively partitioning
S until singleton sets are reached.
• Does not require the number of clusters k in advance
45. Two types of hierarchical clustering algorithms :
Agglomerative : “bottom-up”
Divisive : “top-down
46. 48
Hierarchical Agglomerative Clustering (HAC)
Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci cj
• Assumes a similarity function for determining the
similarity of two instances.
• Starts with all instances in a separate cluster and then
repeatedly joins the two clusters that are most similar
until there is only one cluster.
47. Classification of AHC
We can distinguish AHC algorithms according to the type of
distance measures used. There are two approaches :
Graph methods :
• Single link method
• Complete link method
• Group average method (UPGMA)
• Weighted group average method (WPGMA)
Geometric :
• Ward’s method
• Centroid method
• Median method
48. Lance –Williams Algorithm
Definition(Lance-Williams formula)
In AHC algorithms, the Lance-Williams formula
[Lance and Williams, 1967] is a recurrence
equation used to calculate the dissimilarity
between a cluster Ck and a cluster formed by
merging two other clusters Cl ∪Cl′
where α
I
, α
I’
,β, γ are real numbers
50. Single link method
• Also known as the nearest neighbor method,
since it employs the nearest neighbor to
measure the dissimilarity between two
clusters
51. Cluster distance measure
• Single link
– Distance between closest elements in clusters
• Complete link
– Distance between farthest elements in clusters
• Centroids
– Distance between centroids(means) of two clusters
84. Summary
Hierarchical Clustering
• For a dataset consisting of n points
• O(n2) space; it requires storing the distance matrix
• O(n3) time complexity in most of the cases(agglomerative
clustering)
• Advantages
– Dendograms are great for visualization
– Provides hierarchical relations between clusters
• Disadvantages
– Not easy to define levels for clusters
– Can never undo what was done previously
– Sensitive to cluster distance measures and noise/outliers
– Experiments showed that other clustering techniques
outperform hierarchical clustering
• There are several variants to overcome its weaknesses
– BIRCH: scalable to a large data set
– ROCK: clustering categorical data
– CHAMELEON: hierarchical clustering using dynamic modelling