Project’s Primary Goals
1. To analyse past sales data to generate insights to understand what features of mobile phone that drive the sales.
2. To use these insights to efficiently plan the inventory in the next 6 months.
Data Description
3. Dataset consists of sales and product-related features.
4. Dataset contains descriptions of the top 5 most popular mobile brands.
5. Dataset consists of 418 row-instances and 16 column-features.
Strategies Deployed for Modelling
6. Check for, and treat with suitable methods, missing values in dataset.
7. Observe for, and take suitable steps to treat, outliers.
8. Check for multicollinearity amongst variables and use suitable steps to treat highly correlated variables.
9. Build a Linear Regression Model to predict the sales of mobile phones.
10. Report on the the metrics of the models.
11. Identify the significant variables, and rebuild and report on the model using only these variables only.
12. Based on the final model outcomes, determine the features driving mobile phone sales.
13. List down the recommendations to help in the inventory planning for the next 6 months.
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESIRJET Journal
This document discusses a case study on using machine learning models to predict student admission to engineering colleges based on academic performance and exam ranks. It explores using linear regression, KNN regression, decision tree regression, and random forest regression models. The models are trained on data collected on student 10th grade marks, 12th grade marks, division, All India Engineering Entrance Exam (AIEEE) rank, and college ranks. Feature selection identifies the key predictive features as academic marks and exam rank. The models are evaluated to select the best performing algorithm to deploy in an application to help students predict their admission chances.
This document is a seminar report submitted by a student named Shahbaz Khan to Visvesvaraya Technological University in partial fulfillment of a bachelor's degree in electronics and communication engineering. The report describes a project to predict house prices in Mumbai using machine learning models. It explores a dataset of Mumbai house listings, applies techniques like data visualization, transformation and several regression models to predict prices. It finds that linear regression has the best performance and can be used to build a house price prediction application.
This Presentation is on recommended system on question paper predication using machine learning techniques. We did literature survey and implement using same technique.
This document discusses performance metrics for evaluating machine learning models. It explains that metrics are used to understand how well a model performs on both the training data and new, unseen data. For classification models, common metrics include accuracy, confusion matrix, precision, recall, F1 score, and AUC. For regression models, common metrics are mean absolute error, mean squared error, R2 score, and adjusted R2. The document provides formulas and explanations for calculating and interpreting each of these important performance metrics.
This document discusses performance metrics for evaluating machine learning models. It explains that performance metrics help understand how well a model performs on its training data and new, unseen data. For classification models, common metrics include accuracy, confusion matrix, precision, recall, F1 score, and AUC. For regression models, common metrics are mean absolute error, mean squared error, R2 score, and adjusted R2. The document provides formulas and explanations for calculating and interpreting each of these important performance metrics.
Predictive validation compares a model's predictions to actual observed data to determine how accurately it can predict real-world outcomes. This involves defining validation criteria, collecting real-world data to serve as a comparison, dividing the data into training and validation sets, using the model to make predictions on the validation data and calculating validation metrics to assess the model's performance. Parameter variability and sensitivity analysis are important techniques for understanding how sensitive a model's outputs are to changes in its parameters. This helps evaluate a model's robustness and identify influential parameters.
Dataset: Gather a large dataset of laptops and their features, including processor speed, RAM, storage, and display size, along with their corresponding prices.
Feature engineering: Extracting meaningful features from the dataset, such as brand, model, and year, and transforming them into a format that machine learning algorithms can use.
Model selection: Choosing the most appropriate machine learning algorithm, such as linear regression, decision tree, or random forest, based on the type of data and desired level of accuracy.
Model training: Splitting the dataset into training and testing sets, and using the training data to train the machine learning model.
Model evaluation: Testing the model's performance on the testing data and evaluating its accuracy using metrics such as mean squared error or R-squared.
Hyperparameter tuning: Optimizing the model's hyperparameters, such as learning rate or regularization strength, to achieve the best performance.
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
- A logistic regression model was found to best predict customer churn with the highest AUC and accuracy.
- The top variables increasing churn risk were credit class, handset price, average monthly calls, billing adjustments, household subscribers, call waiting ranges, and dropped/blocked calls.
- Cost and billing variables like charges and usage were significant, validating an independent survey.
- A lift chart showed targeting the highest risk 30% of customers could identify 33% of potential churners. The model allows prioritizing retention efforts on the 20% riskiest customers.
Explore how data science can be used to predict employee churn using this data science project presentation, allowing organizations to proactively address retention issues. This student presentation from Boston Institute of Analytics showcases the methodology, insights, and implications of predicting employee turnover. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
PRIORITIZING THE BANKING SERVICE QUALITY OF DIFFERENT BRANCHES USING FACTOR A...ijmvsc
In recent years, India’s service industry is developing rapidly. The objective of the study is to explore the
dimensions of customer perceived service quality in the context of the Indian banking industry. In order to
categorize the customer needs into quality dimensions, Factor analysis (FA) has been carried out on
customer responses obtained through questionnaire survey. Analytic Hierarchy Process (AHP) is employed
to determine the weights of the banking service quality dimensions. The priority structure of the quality
dimensions provides an idea for the Banking management to allocate the resources in an effective manner
to achieve more customer satisfaction. Technique for Order Preference Similarity to Ideal Solution
(TOPSIS) is used to obtain final ranking of different branches.
A LINEAR REGRESSION APPROACH TO PREDICTION OF STOCK MARKET TRADING VOLUME: A ...ijmvsc
Predicting daily behavior of stock market is a serious challenge for investors and corporate stockholders and it can help them to invest with more confident by taking risks and fluctuations into consideration. In this paper, by applying linear regression for predicting behavior of S&P 500 index, we prove that our proposed method has a similar and good performance in comparison to real volumes and the stockholders can invest confidentially based on that.
This project analyzes employee commute data to predict car usage. Exploratory data analysis identifies outliers and correlations. Logistic regression finds age, distance, and license significantly predict car use. Naive Bayes on SMOTE-boosted data best predicts car use with 97% accuracy, showing SMOTE and variables like age and distance critically impact predictions. Bagging also performs well at 96% accuracy, and both SMOTE and bagging increase model sensitivity and specificity.
Welcome to this comprehensive presentation on regression analysis, a fundamental technique in predictive modeling. In this slide deck, we will embark on a journey through the intricate world of regression, exploring its essence, types, applications, systematic process, underlying assumptions, diagnostic tools, and real-world significance.
Regression analysis is a powerful statistical tool that enables us to understand and quantify the relationships between variables. By examining the interplay between a dependent variable and one or more independent variables, regression unveils patterns and trends that can drive informed decision-making. Whether you're working in finance, marketing, healthcare, or any other field, regression empowers analysts to extract valuable information from their data and make accurate predictions.
During our exploration, we will delve into various types of regression models. Simple Linear Regression establishes a linear relationship between two variables, serving as a foundation for understanding more complex models. Multiple Linear Regression expands this concept by incorporating multiple predictors, allowing us to account for multiple factors influencing the dependent variable. Polynomial Regression goes beyond linear relationships, capturing non-linear associations between variables. Logistic Regression, on the other hand, is specifically designed for predicting categorical outcomes, making it an invaluable tool for classification problems.
Throughout the presentation, we will showcase real-world applications of regression analysis. Witness how regression aids in predicting stock prices, forecasting sales, estimating housing prices, analyzing customer behavior, predicting disease outcomes, optimizing resource allocation, and much more. These examples illustrate the remarkable impact of regression across industries, demonstrating its relevance and effectiveness in solving complex problems and driving data-driven decision-making.
In conclusion, regression analysis is a powerful tool that unlocks a world of possibilities. By unraveling complex relationships, making accurate predictions, and extracting valuable insights from data, regression empowers analysts to drive evidence-based decision-making and stay ahead in a rapidly evolving world. Join us as we delve into the world of regression and discover its potential to transform the way you approach data analysis and modeling. Let's embark on this journey together and harness the power of regression analysis!
This document discusses analyzing time series data regression through a practical example. It explains that regression analysis helps identify relationships between variables and make predictions by examining historical trends and patterns in time series data. As an example, it describes how a retail business could analyze monthly sales data over five years to build a regression model to accurately forecast future sales and make better inventory, marketing, and business planning decisions. The document outlines the key steps in time series regression analysis, including preprocessing data, selecting the appropriate regression model, evaluating model performance, and interpreting regression results.
This document discusses analyzing time series data regression through a practical example. It explains that regression analysis helps identify relationships between variables and make predictions by examining historical trends and patterns in time series data. As an example, it describes how a retail business could analyze monthly sales data over five years to build a regression model to accurately forecast future sales and make better inventory, marketing, and business planning decisions. The document outlines the key steps in time series regression analysis, including preprocessing data, selecting the appropriate regression model, evaluating model performance, and interpreting regression results.
This document discusses feature engineering, which is the process of transforming raw data into features that better represent the underlying problem for predictive models. It covers feature engineering categories like feature selection, feature transformation, and feature extraction. Specific techniques covered include imputation, handling outliers, binning, log transforms, scaling, and feature subset selection methods like filter, wrapper, and embedded methods. The goal of feature engineering is to improve machine learning model performance by preparing proper input data compatible with algorithm requirements.
- The document describes a project to predict customer churn for a telecom company using classification algorithms. It analyzes a dataset of 3333 customers to identify variables that contribute to churn and builds models using KNN and C4.5.
- The C4.5 model achieved higher accuracy (94.9%) than KNN (87.1%) on the test data. Key variables for predicting churn were found to be day minutes, customer service calls, and international plan.
- The model can help the telecom company prevent churn by focusing retention efforts on at-risk customers identified through these important variables.
Dimensionality reduction techniques transform high-dimensional data into a lower-dimensional representation while retaining important information. Principal component analysis (PCA) is a common linear technique that projects data along directions of maximum variance to obtain principal components as new uncorrelated variables. It works by computing the covariance matrix of standardized data to identify correlations, then computes the eigenvalues and eigenvectors of the covariance matrix to identify the principal components that capture the most information with fewer dimensions.
[KAIST DFMP CBA] Analyze price determinants and forecast Seoul apartment pric...경록 박
Analyzed price determinants and forecasted Seoul apartment prices with correlations, regressions (linear, decision tree, random forest, XGB), and time series models (Auto ARIMA, Holt-Winters) using Samsung Brightics Studio.
Similar to Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6 Months (20)
Difference in Differences - Does Strict Speed Limit Restrictions Reduce Road ...ThinkInnovation
Objective
To identify the impact of speed limit restrictions in different constituencies over the years with the help of DID technique to conclude whether having strict speed limit restrictions can help to reduce the increasing number of road accidents on weekends.
Context*
Generally, on weekends people tend to spend time with their family and friends and go for outings, parties, shopping, etc. which results in an increased number of vehicles and crowds on the roads.
Over the years a rapid increase in road casualties was observed on weekends by the Government.
In the year 2005, the Government wanted to identify the impact of road safety laws, especially the speed limit restrictions in different states with the help of government records for the past 10 years (1995-2004), the objective was to introduce/revive road safety laws accordingly for all the states to reduce the increasing number of road casualties on weekends
* The Speed limit restriction can be observed before 2000 year as well, but the strict speed limit restriction rule was implemented from 2000 year to understand the impact
Strategies
Observe the Difference in Differences between ‘year’ >= 2000 & ‘year’ <2000
Observe the outcome from multiple linear regression by considering all the independent variables & the interaction term
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
Context
1. Make Insight-informed Decisions: Clinic collected data on heart disease diagnosis and other patient information, and wants to use the data to make insight-informed decisions.
Objective
2. Predict Patient’s Well-being: To identify the rules that will predict whether a patient will have heart disease in the future, based on the data collected on him/her.
Strategy
3. Deploy Decision Tree Model: Create a Decision Tree Model, with rules, to predict whether a patient will have a heart disease in the future based on collected data.
3.1 To train and evaluate the model
3.2 Boost the model’s performance
3.3 Conduct predictions
Author: Anthony Mok
Date: 18 Nov 2023
Email: xxiaohao@yahoo.com
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
Context
1. Social Enterprise collected data on customers & wants to make insight-informed decisions.
Objective
2. To identify customer segments to customised offers for each segment.
Strategy
3. Explore & Clean data for analysis.
4. Perform K-Means Clustering, in Orange, to find possible segments in the customer data.
5. Tune the model to improve its performance.
6. Visualise the findings, share conclusions, and give insight-driven recommendations.
Author: Anthony mok
date: 18 Nov 2023
Email: xxiaohao@yahoo.com
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
Context
1. Housing Agent collected resale prices on HDB apartments in Singapore.
Objective
2. To predict resale prices in to advise his potential clients.
Strategies
3. Explore & Clean data for analysis.
4. Perform K-Means Clustering, in Orange, to find possible segments in the customer data.
5. Tune the model to improve its performance.
6. Visualise the findings, share conclusions, and give insight-driven recommendations.
Author: Anthony mok Date: 18 Nov 2023
Email: xxiaohao@yahoo.com
Project Primary Goal
1. To identify factors influencing medical expenses given the variables while removing endogeneity issue
Context
2. Good health insurance is one that can cover a maximum amount of medical expenses so that people don't have to worry about paying medical bills.
3. As a health insurance company, the company saw its sales fall significantly over time, something that is causing concerns.
4. It is the firm's intention to analyse factors that determine medical expenses in order to improve their sales in the coming fiscal year.
5. By conducting the study, they will have a better understanding of their customers' needs and be able to develop their marketing strategies accordingly.
Modelling Strategies
6. OLS REGRESSION: Observed outcomes from OLS regression using independent variables.
STAGE-2 REGRESSION:
7.1 Observe outcomes from Stage 1 Regression with the endogenous variable as the target variable.
7.2 Observe Stage 2 Regression using predicted endogenous variable.
8. INSIGHTS: Form insights from results extracted out of OLS regression and Stage - 2 Regression
Author: Anthony Mok
Date: 18 Nov 2023
Email: xxiaohao@yahoo.com
Decision Making Under Uncertainty - Predict the Chances of a Person Suffering...ThinkInnovation
Project Goal
1. Use Naive Bayes’ Classifier to Predict Heart Attacks Based on Patient’s Symptoms.
Context
2. After completing the project to identify the rules that predict patient’s heart disease, the Clinic reached out again wanting to know who is likely to have a heart attack based on his/her symptoms.
Dataset
3. The dataset was explored for its relationships and patterns, and it’s found that, through its univariate, bivariate and multivariate analysis, the data is highly correlated and suitable for modelling.
Strategies For Modelling & Data Analysis
4. Data Preparation: Three new categorical features were created.
5. Train Model: PivotTables are created for the features and probabilities calculated.
6. Findings & Conclusions: The probability of Patient A, given her attributes, is 53.66% more likely to have an heart attack as compared to Patient B, whose probability of experiencing an heart attack is merely 9.79%, given his attributes.
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
Monte Carlo Simulation
1. Simulation is the process of creating a virtual environment that mimics the behavior of a real-world system.
2. This virtual environment is used to train Machine Learning Models, test new algorithms, and explore the behavior of complex systems.
3. It provides a safe and controlled space to test different options, predict outcomes, and make insight-informed decisions.
Project Objective
4. Which is better: joining a partnership or starting own business?
Strategies For Modelling & Data Analysis
5. Simulate Number of Deliveries Made/Month
6. Simulate Labour Cost
7. Calculate Revenue Per Delivery
8. Calculate the Monthly Total Revenue & Profit, Calculate Estimated Average Profit & Variances
Author: Anthony mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Decision Making Under Uncertainty - Decide Whether Or Not to Take PrecautionsThinkInnovation
Context
1. Company A, a sports company based in Country B, signed a deal worth $5.5 million with Company C to install sports courts and golf courses.
2. Z, the lead manager, is confident in Company A's ability to meet Company C’s expectations, but is concerned with the risks of installation faults.
Track Record
3. Company A’s past experience suggests that 95% of project failures occur during the final installation phase.
4. Under normal production techniques, Company A can produce and install all products for $5 million, but there is a 6% chance of not meeting measurement specifications.
Rework Costs & Lost
5. If the products fail to meet specifications, they must be returned back to Country B for modification and reinstallation at a cost of $600k.
6. For an additional of $250k to ensure no errors, Company A could test its products prior to the installation.
7. If Company A fails to meet customer expectations, it may lose $200k in goodwill and reputation.
Simulation
8. The Test and Evaluation Manager approached Y to look into using simulation to predict the possibility of failure before deciding on spending on additional precautions.
9. Building a simulation model will cost $33k, which will give a positive or negative rating.
10. If the product is all right, the chance of testing Positive is 90%. If it is not all right, the chance of testing Negative is 65%.
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
Background
1. Invited by a local Social Enterprise (SE) to provide, as skill-base volunteerism, to solve a logistical problem.
Problem
2. Determine optimal quantity of Product X to be delivered from each SE’s outlets to different retailers at minimum transportation cost.
SE’s Outlets
3. Delivers from 3 outlets - in Jurong, Alexandra, & Tuas.
Retailers
4. Delivers to 6 retailers - in Jurong, Alexandra, Tampines, Yishun, Changi, Bishan, & Woodlands.
Request
5. Applies linear optimisation modeling to find optimal quantity of Project X to be delivered where it will be able to minimise the transportation cost significantly, which will result in increased profitability.
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Create Data Model & Conduct Visualisation in Power BI DesktopThinkInnovation
Context
A global agency (an ex-coaching client) goes through a yearly budgeting process, where it evaluates the costs incurred by various departments and uses that information to forecast expenses.
Objectives
Likes to improve its budgeting and forecasts based on the actual costs incurred.
Strategies
1. Load & combine data from multiple Excel & .csv files into Power BI, removing any unnecessary columns
2. Establish relationships between tables to connect data from the ‘Dimension’ Table to the ‘Forecast’ and ‘Budget’ Tables
3. Create a ‘Calendar’ Table using DAX, add it to the data model, & establish relationships with other tables
4. Visualise the budget by region using a chart
Create a line graph to compare monthly budget & forecast
5. Analyse budget distribution by business area using a pie chart
6. Create linked stacked column charts to visualise budget breakdown by cost element group and IT area
Author: Anthony Mok
Date: 18 Nov 2023
Email: xxiaohao@yahoo.com
Using DAX & Time-based Analysis in Data WarehouseThinkInnovation
Context
An art dealer friend, who has multiple sales representatives who sell various products across four different states in the US, likes to use the data he has collected to make insight-driven decisions.
Objective
Dealer wants to understand the sales performance across various products over the last three years
Strategies
1. Combine sales data from multiple CSV files and add product and sales rep data to the data model
2. Create a date table with a column for the last day of each month for the purpose of conducting time-based analysis
3. Establish relationships between sales rep, product, sales, and date tables
4. Calculate total net sales excluding discounts
5. Create a pivot table showing total sales and YOY% change for regions, excluding subtotals and individual regions
Author: Anthony Mok
Date: 18 Nov 2023
Email: xxiaohao@yahoo.com
Creating Data Warehouse Using Power Query & Power PivotThinkInnovation
Context
Social Enterprise, from a neighboring country which provides ambulatory services, has collected data on road accidents and is keen to use the data to inform on its resource deployment. It has stored the data into three files: ‘Accidents.xlsx’, ‘Casualties.xlsx’ and ‘Vehicles.txt’
Objective
Create a data warehouse containing meaningful information on road accidents
Strategies
1. Import file and transform data
2. Create queries as a new table
3. Merge these tables
4. Summary table
5. Power Pivot and create a data model
Unlocking New Insights Into the World of European Soccer Through the European...ThinkInnovation
Exploring Datasets With SQLite
Context
European Soccer Database (ESD) used to study team dynamics and identify the factors that lead to player’s and team’s success.
Objective
Run queries to inspect its structure through SQLite
Strategies
1. Import the European Soccer Database file into DB Browser (SQLite) and find the total number of tables in the database
2. Using the ‘Country’ table, run a SQL query to show the list of countries in descending order (Z-A) based on the country name
3. Display the specified columns from the ‘Team_Attributes’ table with filtered rows based on ‘buildUpPlaySpeed’
4. List all the players with the specified conditions in a table with the specified columns
Author: Anthony Mok
Date: 18 Nov 2023
Email: xxiaohao@yahoo.com
The document discusses managing projects and project management. It covers the importance of managing projects well given global trends. It describes the characteristics of modern projects, including established objectives, defined lifespans, involvement of multiple teams, and doing something never done before. It also discusses failures in project management and best practices through collaboration.
The document discusses the "Thinking Outside the Box" series which aims to help people think unconventionally. It describes the SCAMPER method, created by Bob Eberle, which provides a checklist for refining existing products and services by substituting, combining, adapting, modifying, putting to other uses, eliminating, or reversing elements. SCAMPER stands for these techniques and the document provides examples of applying each letter of the acronym to different products or services.
Created by Bob Eberle in the 1970’s, SCAMPER, which comes in the form of a checklist of idea-spurring questions, helps you think outside-of-the-box when you encounter a challenge.
SCAMPER is based on the notion that everything is a new translation of something that has already existed. Each letter in the acronym – SCAMPER, represents a way the characteristics of the challenge are manipulated until new ideas are created.
After years of teaching others how to think creatively, I find the best way to answer these questions is through learning and using the creative tools to experience what thinking outside the box really means.
Assumption Reversal Method, which ideas are triggered from assumptions that are reversed from those currently ruling the situation, is an excellent ideation technique that enables us to obtain such enlightenment.......
Psyche of Facilitation - The New Language of Facilitating ConversationsThinkInnovation
Not every participant in an interaction will respond in the same way to the facilitator.
Some language of facilitation may attract the participants to the conversation. Others may cause them to stay away.
So, by combining the sciences presented and described in this slideshare, I have created a framework that provides a guide on how the language could be better fine-tuned to enrich the collective learning and wisdom of the group.
Visual Connection - Ideation Through Word AssociationThinkInnovation
This document discusses techniques for thinking creatively outside the box, including visual connection, an ideation technique where words associated with images are used to trigger new ideas. It provides an example of visual connection, using news about decreased business in Chinatown after new road tolls to formulate a challenge statement: "In what ways might we increase the volume of business in Chinatown?". Words from a list of sensory perceptions related to an image are then used to generate potential solutions to the challenge.
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...rightmanforbloodline
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6 Months
1. Using Insight-informed
Data to Plan Inventory
in Next 6 Months
An Application of
Linear Regression Modelling
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
2. Agenda
1. Modern-day
Data Analytics
2. Linear
Regression
Analysis
4. Project’s
Primary Goals
5. Description of
Dataset
7. Findings,
Conclusions &
Recommendations
An Application of Linear Regression Modelling
3. Code-less,
Less-code
Analytical Tools
6. Modelling
Strategies
3. Modern-day
Data Analytics
Both traditional and modern-day data analytics deal
with extracting insights from information, but they
differ significantly in their methods and capabilities
Here are the significant differences:
An Application of Linear Regression Modelling 3
Features Traditional Data Analytics Modern-Day Data Analytics
Data type Mostly structured Diverse (structured, semi-structured,
unstructured)
Technology On-premises Cloud-based
Processing Batch-oriented Real-time or near-real-time
Analysis Descriptive, diagnostic Predictive, prescriptive
Accessibility Limited to data analysts Aims for data democratisation
4. Linear
Regression Analysis
Linear regression is a statistical method that
models the relationship between a dependent
variable and one or more independent variables
using a straight line
It is used to understand trends, make predictions,
and test hypotheses
This analysis is suitable when the data exhibits a
linear relationship where assumptions like
normality and constant variance are held
An Application of Linear Regression Modelling
5. Code-less &
Less-code Data Analytics
Code-less Applications
Offer drag-and-drop
interfaces, pre-built
connectors, and
automated workflows,
making data analysis
accessible to everyone,
even without technical
expertise
Less-code Applications
Requiring some coding
knowledge, less-code
platforms provide pre-
written code snippets,
wizards, and visual tools
to streamline complex
tasks
5
An Application of Linear Regression Modelling
6. KNIME -
Less-code Data Analytics
• Knime is a less-code data analytics platform
• Build visual workflows with pre-built nodes for data
preparation, analysis, and visualisation
• No coding required, but Python integration empowers
customisation
• Unique Selling Points: Open-source, free, and powerful.
Handles diverse data, builds predictive models, and deploys
insights
• This project is carried out using KNIME
6
An Application of Linear Regression Modelling
7. Project’s Primary Goals
To analyse past sales data to
generate insights to understand
what features of mobile phone
that drive the sales
To use these insights to
efficiently plan the inventory in
the next 6 months
7
An Application of Linear Regression Modelling
8. Dataset Data Description
• Dataset consists of sales and product-related features
• Dataset contains descriptions of the top 5 most popular
mobile brands
• Dataset consists of 418 rows and 16 columns
Data Dictionary
• A sample data dictionary* is given below:
8
* More details are found in the project report, which are
not released at the request of the Social Enterprise
An Application of Linear Regression Modelling
9. Strategies
for Modelling
• Check for, and treat with suitable methods, missing values
in dataset
• Observe for, and take suitable steps to treat, outliers
• Check for multicollinearity amongst variables and use
suitable steps to treat highly correlated variables
• Build a Linear Regression Model to predict the sales of
mobile phones
• Report on the the metrics of the models
• Identify the significant variables, and rebuild and report on
the model using only these variables only
• Based on the final model outcomes, determine the features
driving mobile phone sales
• List down the recommendations to help in the inventory
planning for the next 6 months
9
An Application of Linear Regression Modelling
10. Check for Missing Values in Dataset By:
10
An Application of Linear Regression Modelling
• KNIME Workflow was created
• ‘CSV Read’ and ‘Data Explorer’ nodes were dragged
and dropped onto the KNIME Platform to ingest and
explore the variables and data in the dataset
• Using the ‘Interactive Viewer’ in the ‘Data Explorer’
node, 16 numeric variables were discovered
• The properties of the variables were expanded to
explore their missing values, and only missing values
in Rows 7, 18 & 397 in the ‘display size’ variable
were found
11. Treat Missing Values in Dataset
11
An Application of Linear Regression Modelling
• Since ‘display size’ is a categorical variable, the
mode of this variable was used to replace its missing
values
• To do this, the data in the display size column was
converted from numbers to strings, by using the
‘Number-To-String’ node
• The ‘Missing Value’ node was used to replace the
three missing values with their ‘Most Frequent
Value’, which is its mode, of 6.5
• Finally, the ‘String-To-Number’ node was deployed to
return this column of data to its original data format
for modeling purposes
12. Observe and Treat Outliers
12
An Application of Linear Regression Modelling
The Histogram for ‘ratings’
were constructed to study their distribution:
The distribution of ‘ratings’ is left skewed. The median for
‘ratings’, the middlemost value when the smallest to
largest rating were ordered, is 4.3, while the mean, the
average of all ratings, is 4.339. There is a difference of
0.039 between the mean and median, which places them
tightly together. When the middle value resembles the
average, the dataset for ‘ratings’ is symmetrically
distributed. About 50% of the ‘ratings’ were in the
Interquartile Range, which is between 4.3 to 4.4, while
about 25% of the ‘ratings’ are higher than Quartile 3,
between 4.4 and 4.5. About 25% of the ‘ratings’ are
lower than Quartile 1, between 4.2 to 4.3
Observe and Treat Outliers
13. Observe and Treat Outliers
13
An Application of Linear Regression Modelling
The Box Plot for ‘ratings’
were constructed to study their outliers:
Observe and Treat Outliers
Through the Box Plot, a total of 6 outliers were found. One
value (in Row 49 of the dataset) is above the upper whisker
boundary and five values (in Rows 158, 259, 286, 320
and 408 of the dataset) are below the lower whisker
boundary of the Box Plot. Relating these six outliers to real
life circumstances, the decision is not to treat them since it
is realistic to observe ratings of 4.6 (for ‘ratings’ in Row
49) and 3.0 (for ‘ratings’ in Row 320) in a 5-point scale
customer rating form. So, these rows are kept to enhance
analysis
14. 14
An Application of Linear Regression Modelling
Check for
Multicollinearity Amongst Variables
The ‘Linear Correlation’ node was engaged to observe the correlation coefficients
between all the numerical variables. After sorting the ‘Correlation Value’, in
descending order, in the ‘View’ function in the ‘Linear Correlation’ node, the
correlation value between the variables ‘num_of_ratings’ and ‘sales’ is 0.9418, which
is 94.18%. This suggests that these two variables are highly correlated.
Multicollinearity of variables reduces the precision of the estimated coefficients since
they shift wildly with slight changes in other independent variables. Under such
situation, the p-values are unable to identify independent variables that are
statistically significant. To strengthen the statistical power in the regression model,
the multicollinearity of these variables needs to be removed . Typically, variables
which correlation values are >0.70 are deemed highly correlated and need to be
treated
15. 15
An Application of Linear Regression Modelling
Treating for
Multicollinearity Amongst Variables
• Observe the correlation values and identify the highly correlated quantitative
(numerical) variables, that is, correlation value is >0.7
• Shift this variable to the ‘Exclude’ box of the ‘Configure’ function of the ‘Linear
Correlation’ node Using the remaining variables, re-execute the ‘Linear
Correlation’ node
• Observe the correlation values of the remaining variables after re-executing the
node
• Identify the next highly correlated variables
• Repeat this process until all the variables have correlation value of <0.7
• This process was not repeated as there were no other highly correlated
quantitative (numerical) variables found after treating the multicollinearity of
‘num_of_ratings’ and ‘sales’
The following steps were taken to achieve this outcome:
16. 16
An Application of Linear Regression Modelling
Build the Linear Regression Model By:
1. ‘Partitioning’ node was configured
to split the dataset in training and
testing sets by the ratio of 7:3
3. Two sets of ‘Regression Predictor’ and ‘Numeric
Scorer’ were created; one to ingest the training dataset
and the other to churn the data from the testing dataset
2. ‘Linear Regression Learner’ was
created with these configurations
with ‘sales’ as ‘Target’
17. 17
An Application of Linear Regression Modelling
Evaluate the Linear Regression Model
After feeding the training and testing dataset, from the
‘partitioning’ node, into the learner and predictors, their
numeric scorers produced the following metrics:
Training Dataset Numeric Scorer Testing Dataset Numeric Scorer
The model has performed well on both the training and testing datasets. The R-squared is around 0.882 on the training dataset
and 0.928 on the testing dataset. They have high R-squared values; the higher these values are, the better the model fits the data
and the predictions approximate the real data points. It is a clear indication that a good model has been created that is able to
explain the variance in the sales of mobile phones of up to 88%. Mean Absolute Error indicates that my model is able to predict
sales of mobile phones within the mean error of 9.4 units of SGD on testing dataset
18. 18
An Application of Linear Regression Modelling
Identify Significant Variables
The p-value measures the significance of observational data. There
are 11 variables which p-values are more than 0.05, starting with
‘battery_capacity’ at 0.799. Typically, p-value that is less than or
equals to 0.05 is statistically significant, which helps to determine if
the observed relationship that arises is not a result of chance
19. 19
An Application of Linear Regression Modelling
Rebuild Model Significant Variables Only By:
• Shifting the variable with the highest p-value, that is >0.05, to the
‘Exclude’ box of the ‘Configure’ function of the ‘Linear Regression
Learner’
• Using the remaining variable, re-execute the node
• Observing the changes in the p-values through the ‘Coefficients
and Statistics’ function of the node
• Identifying the next variable with the highest p-value
• Continuing to iterate the process until all p-values of remaining
variables are ≤ 0.05
These are the six variables with p-value ≤ 0.05 that are
retained to rebuilt the model since they are statistically
significant:
20. 20
An Application of Linear Regression Modelling
Evaluate the Rebuilt Linear Regression Model
After the model has been rebuilt, the scorers for the training and
testing dataset show the following information:
Training Dataset Numeric Scorer Testing Dataset Numeric Scorer
This model continues to perform well on both the training and testing datasets. The R-squared is
around 0.875 on the training dataset and 0.924 on the testing dataset. These are 0.007 and 0.004
lower than the original model. Nevertheless, they have high R-squared values, and higher these
values are, the better the model fits the data and the predictions approximate the real data points. It
is a clear indication that I am able to create a good rebuilt model that is able to explain the variance
in the sales of mobile phones of up to 88%. Mean Absolute Error indicates that my model is able to
predict sales of mobile phones within the mean error of 9 units of rupees on testing dataset
21. Findings &
Conclusions*
21
Key Features Driving Mobile Phone Sales
• It seems that ‘discount_percent’ is the only comparatively
higher coefficient with a positive impact on 'sales’. An
increase in one unit of ‘discount_percent’ will increase
‘sales’ by 0.46 unit of SGD
• Similarly, ‘display_size’ has the most negative impact on
'sales’. An increase in one unit of the ‘display_size’
variable would decrease the ‘sales’ by around 1 unit of
SGD
• In ranking order, ‘num_of_ratings’, ‘model’, ‘processor’
and ‘num_rear_camera’ have similar negative effects on
‘sales’. A unit increase of these would reduce ‘sales’ by
0.38 unit of SGD
* More details are found in the project report, which are
not released at the request of the Social Enterprise
22. Recommendations*
22
2. Stock smaller display sizes
of mobile phones with lesser
rear cameras
3. Narrow range of models to
stock
4. Keep phones which
processors are encoded at a
lower value
Recommendations On Inventory Planning
1. Look at including
higher discounts
* More details are found in the project report, which are
not released at the request of the Social Enterprise