Are you still building data pipelines with Java and Python? Are you curious about the current buzz in the Big Data community surrounding Scala as a data processing environment? In this talk I'll discuss how Spotify migrated its music recommendations pipeline from Python to Scala. I'll dive into the language specific features that make Scala the ideal candidate for big data processing as well as highlight the rich set of tools and APIs that we take advantage of to process music recommendations for our 50 Million active users including Scalding, Breeze, Kafka, Spark, Parquet, Driven and Zeppelin.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
Discover Weekly is a personalized mixtape of 30 highly personalized songs that's curated and delivered to Spotify's 75M active users every Monday. It's received high acclaim in the press and reached 1B streams within its first 10 weeks. In this slide deck we dive into the narrative of how Discover Weekly came to be, highlighting technical challenges, data driven development, and the Machine Learning models used to power our recommendations engine.
These are the slides of my talk at the 2019 Netflix Workshop on Personalization, Recommendation and Search (PRS). This talk is based on previous talks on research we are doing at Spotify, but here I focus on the work we do on personalizing Spotify Home, with respect to success, intent & diversity. The link to the workshop is https://prs2019.splashthat.com/. This is research from various people at Spotify, and has been published at RecSys 2018, CIKM 2018 and WWW (The Web Conference) 2019.
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
Spotify is the world’s largest on-demand music streaming company, with over 100 million active users who generate around 2TB of interaction data every day. With over 30 million songs to choose from, discovery and personalization play an essential role in helping users discover the best music for them. In this talk, given at the newly opened Galvanize space in NYC in March 2017, we’ll explain how Spotify uses Latent Space Models and Deep Learning to power features such as Discover Weekly and Release Radar.
Presented at the Machine Learning class at Chalmers, Gothenburg.
http://www.cse.chalmers.se/research/lab/courses.php?coid=9
Trying to connect their theoretical machine learning class with industry examples.
This document provides an overview of Scala data pipelines at Spotify. It discusses:
- The speaker's background and Spotify's scale with over 75 million active users.
- Spotify's music recommendation systems including Discover Weekly and personalized radio.
- How Scala and frameworks like Scalding, Spark, and Crunch are used to build data pipelines for tasks like joins, aggregations, and machine learning algorithms.
- Techniques for optimizing pipelines including distributed caching, bloom filters, and Parquet for efficient storage and querying of large datasets.
- The speaker's success in migrating over 300 jobs from Python to Scala and growing the team of engineers building Scala pipelines at Spotify.
Music Personalization : Real time Platforms.Esh Vckay
1. The document discusses music personalization techniques at Spotify, including understanding users and music content, using collaborative filtering and latent vector models to make recommendations, and building real-time recommendation systems using Apache Storm.
2. It describes how Spotify uses machine learning techniques like matrix factorization and word2vec to generate latent vectors for users, songs, artists and playlists to measure similarity and make personalized recommendations at scale for its 75 million users.
3. The key challenges are processing huge amounts of data from 1 billion playlists and 1TB of logs daily to provide recommendations for each new user within 3 seconds and in real-time as listening behaviors change.
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
In this presentation, I give an overview of the machine learning algorithms behind Spotify’s extraordinarily popular Discover Weekly playlist. I provide a brief introduction to what the playlist is, explain how music recommendation engines have evolved over time, then break down the three main algorithm types powering Spotify’s recommendations: (1) collaborative filtering, (2) Natural Language Processing (NLP), and (3) Raw audio analysis.
Video of the presentation can be found here: https://www.youtube.com/watch?v=PUtYNjInopA
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
Slides from a talk at a meetup organized by SF Scala at Spotify's San Francisco office. The slides present details of playlist recommendations at Spotify and how Spotify uses Scalding to develop robust and reliable pipelines to generate these recommendations.
Meetup details: http://www.meetup.com/SF-Scala/events/224430674/
by Harald Steck (Netflix Inc., US), Roelof van Zwol (Netflix Inc., US) and Chris Johnson (Spotify Inc., US)
Slides of the tutorial on interactive recommender systems at the 2015 conference on Recommender Systems (RecSys).
Interactive recommender systems enable the user to steer the received recommendations in the desired direction through explicit interaction with the system. In the larger ecosystem of recommender systems used on a website, it is positioned between a lean-back recommendation experience and an active search for a specific piece of content. Besides this aspect, we will discuss several parts that are especially important for interactive recommender systems, including the following: design of the user interface and its tight integration with the algorithm in the back-end; computational efficiency of the recommender algorithm; as well as choosing the right balance between exploiting the feedback from the user as to provide relevant recommendations, and enabling the user to explore the catalog and steer the recommendations in the desired direction.
In particular, we will explore the field of interactive video and music recommendations and their application at Netflix and Spotify. We outline some of the user-experiences built, and discuss the approaches followed to tackle the various aspects of interactive recommendations. We present our insights from user studies and A/B tests.
The tutorial targets researchers and practitioners in the field of recommender systems, and will give the participants a unique opportunity to learn about the various aspects of interactive recommender systems in the video and music domain. The tutorial assumes familiarity with the common methods of recommender systems.
DATE: Wednesday, Sept 16, 2015, 11:00-12:30
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
1) At Spotify, big data is used to answer important questions from various stakeholders like how many times songs have been streamed, most popular artists, and streaming numbers for marketing purposes.
2) Data infrastructure at Spotify includes a large Hadoop cluster with over 6 petabytes of data used to generate insights from user activity logs and improve the product.
3) Answering tricky questions requires techniques like A/B testing and analyzing streaming patterns to determine viral songs or artist reactions to new releases. Data-driven decisions are made to personalize the user experience.
Today, I had the big honor to give the opening keynote at the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2020), being held virtually. HCOMP is the home of the human computation and crowdsourcing community working on frameworks, methods and systems that bring together people and machine intelligence to achieve better results. I decided to totally revamp a previous talk to focus on so-called "human in the loop" and showed how we incorporate human in the loop to personalise at scale, with some of the research at Spotify. Sharing the slides for general interests.
Spotify uses both push and pull paradigms to match artists and fans in a personal and relevant way. The push paradigm is exemplified by Home, which surfaces personalized playlists using an algorithm called BaRT. BaRT is a multi-armed bandit algorithm that explores and exploits to select playlists based on a reward function. Research shows personalizing the reward function for each user and playlist type improves results. Search represents the pull paradigm, where users search for specific music. Understanding user intent and mindset helps improve search satisfaction. Both paradigms aim to reduce effort and increase success based on offline and online evaluation. Voice interactions may represent a hybrid paradigm.
Spotify Machine Learning Solution for Music DiscoveryKarthik Murugesan
This document discusses machine learning and big data approaches for music discovery at Spotify. It describes Spotify's large catalog of 30 million songs and 2 billion playlists. It then discusses Spotify's use of collaborative filtering, latent factor models, and neural networks on audio data to generate song vectors and make personalized recommendations. Some challenges discussed are the scale of data, cold starts for new users and music, and learning from audio content.
Discover how the world of big data is evolving and becoming faster, more reliable and better organized-- powering many of the cooler new features that you see in the client today!
This document summarizes an approach for scaling implicit matrix factorization to large datasets using Apache Spark. It discusses three attempts at implementing alternating least squares for collaborative filtering in Spark. The first two attempts shuffle data across nodes on each iteration. The third attempt partitions and caches the user/item vectors, then builds mappings to join local blocks of data and update vectors within each partition, avoiding shuffles between iterations for more efficient distributed computation.
Proud to have graduated from the University of Amsterdam MSc Information Studies program in Data Science, this was my final msc thesis presentation on my defense last July. The thesis answers to the contemporary complex problem of popularity prediction, which is highly discussed but not yet solved. An effort for optimizing feature engineering, using several regression models and answering to the question of "what makes a post popular" according to its content e.g. action, scene, people, animals etc., taking advantage in the meantime both computer vision techniques and text processing, is the summary of this work. My MSc Thesis was marked unanimously with 8/10 - one of the highest grades in UvA Informatics department - and was submitted for a publication in MMM2018 24th International Conference on Multimedia Modeling (Thailand, Bangkok).
Approximate Nearest Neighbors and Vector Models by Erik BernhardssonHakka Labs
full talk video: https://www.hakkalabs.co/articles/approximate-nearest-neighbors-vector-models
Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik developed a library called "Annoy" that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.
This document provides an overview of a course on algorithms and data structures. It outlines the course topics that will be covered over 15 weeks of lectures. These include data types, arrays, matrices, pointers, linked lists, stacks, queues, trees, graphs, sorting, and searching algorithms. Evaluation will be based on assignments, quizzes, projects, sessionals, and a final exam. The goal is for students to understand different algorithm techniques, apply suitable data structures to problems, and gain experience with classical algorithm problems.
A Unified Music Recommender System Using Listening Habits and Semantics of Tagsdatasciencekorea
The document describes a unified music recommendation system that combines users' listening habits and semantics of tags. It proposes generating three types of user profiles: listening habits-based, tag-based, and a hybrid approach. A tag and emotion ontology are used to preprocess tags and assign weights. A music recommendation algorithm finds similar users and calculates item scores. An evaluation of the approaches found the hybrid method achieved the best precision and recall based on F-measure, outperforming listening habits only or tag-based recommendations. Statistical analysis confirmed the hybrid approach performed significantly better.
The document describes a unified music recommendation system that combines users' listening habits and semantics of tags. It proposes generating three types of user profiles: listening habits-based, tag-based, and a hybrid approach. A tag and emotion ontology are used to preprocess tags and assign weights. A music recommendation algorithm finds similar users and calculates item scores. An evaluation of the approaches found the hybrid method achieved the best precision and recall based on F-measure, outperforming listening habits only or tag-based recommendations. Statistical analysis confirmed the hybrid approach performed significantly better.
Apache Hadoop has emerged as the storage and processing platform of choice for Big Data. In this tutorial, I will give an overview of Apache Hadoop and its ecosystem, with specific use cases. I will explain the MapReduce programming framework in detail, and outline how it interacts with Hadoop Distributed File System (HDFS). While Hadoop is written in Java, MapReduce applications can be written using a variety of languages using a framework called Hadoop Streaming. I will give several examples of MapReduce applications using Hadoop Streaming.
Lecture 9: Dimensionality Reduction, Singular Value Decomposition (SVD), Principal Component Analysis (PCA). (ppt,pdf)
Appendices A, B from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
1) The document presents an approach called Multidimensional Annotation Scaling (MAS) for aggregating complex annotations from multiple annotators.
2) MAS models annotation tasks as distance matrices calculated using task-specific distance functions, rather than modeling the annotations directly.
3) It then applies a Bayesian hierarchical model called multidimensional scaling to learn annotator reliabilities and item difficulties from the distance matrices in order to aggregate the annotations.
4) Experiments on tasks with diverse complex label types like sequences, rankings and translations show MAS outperforms baselines and adapts to different tasks without retraining.
Dimensionality reduction techniques like principal component analysis (PCA) and singular value decomposition (SVD) are important for analyzing high-dimensional data by finding patterns in the data and expressing the data in a lower-dimensional space. PCA and SVD decompose a data matrix into orthogonal principal components/singular vectors that capture the maximum variance in the data, allowing the data to be represented in fewer dimensions without losing much information. Dimensionality reduction is useful for visualization, removing noise, discovering hidden correlations, and more efficiently storing and processing the data.
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
This document summarizes the implementation of Alternating Least Squares (ALS) in MLlib to make recommendations at scale. It discusses how MLlib reduces communication cost through a block-to-block approach and compressed storage formats. It also describes optimizations like avoiding garbage collection through specialized code. The ALS algorithm is tested on real-world datasets including Amazon reviews and Spotify music data involving billions of ratings.
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
This document provides an overview and agenda for a lecture on graph processing using MapReduce. It discusses representing graphs as adjacency matrices or lists, and gives examples of single source shortest path and PageRank algorithms. Graph processing in MapReduce typically involves computations at each node and propagating those computations across the graph. Key challenges include representing graph structure suitably for MapReduce and traversing the graph in a distributed manner through multiple iterations.
The document summarizes the Count-Min Sketch streaming algorithm. It uses a two-dimensional array and d independent hash functions to estimate item frequencies in a data stream using sublinear space. It works by incrementing the appropriate counters in each row when an item arrives. The estimated frequency of an item is the minimum value across the rows. Analysis shows that for an array width w proportional to 1/ε, the estimate will be within an additive error of ε times the total frequency with high probability.
When building mobile applications performance is really crucial. If we compare React for web development with React-Native, the latter is much more restrictive in the sense that performance mistakes are sometimes far more reaching, even taking into account the golden quote of Donald Knuth “premature optimization is the root of all evil”. This talk is going to provide a deeper dive into principles and practices of making sure your app behaves as fast as possible. We will look at the common mistakes of component design, which lead to performance degradation. How to make lists more performant and animations more fluid. I will also introduce some of the tools, which might help you discover bottlenecks in your application.
Sparkling Random Ferns by P Dendek and M FedoryszakSpark Summit
The document describes the Random Ferns algorithm for classification and its implementation as a Spark package. Random Ferns is an ensemble algorithm that builds multiple randomized decision trees called "ferns". It was implemented in Spark to make it easily accessible and scalable for big data. The package was published on spark-packages.org and deployed to the Maven Central Repository to simplify discovery and usage. An evaluation on benchmark datasets showed it achieved high accuracy while training models efficiently in a scalable manner on Spark.
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
This document discusses statistical computing for big data using distributed computing frameworks like MapReduce and Hadoop. It introduces MapReduce concepts like mappers, reducers, and Hadoop components including HDFS and YARN. Statistical challenges with big data are described, like scalability, dimensionality, and heterogeneity. The document discusses approaches for computing statistics on large datasets in parallel, including the Bag of Little Bootstraps method which breaks data into partitions to allow bootstrapping computations to run independently on clusters. Examples of computing means and counts in parallel using MapReduce are also provided.
This document summarizes some of the key topics and presentations from the Recsys 2018 conference. It discusses the growing popularity of deep learning and reinforcement learning in recommender systems. It provides an overview of Netflix's use of reinforcement learning for artwork recommendations. It also summarizes several papers presented at the conference, including ones on calibrated recommendations, reciprocal recommenders, the Recsys challenge on playlist continuation, and evaluating metrics for top-N recommendations. Finally, it discusses some mixed methods approaches and tutorials presented at the conference.
Scalable Recommendation Algorithms with LSHMaruf Aytekin
- Scalable recommendation algorithm based on Locality Sensitive Hashing (LSH) and Collaborative Filtering.
- Distributed implementation of LSH with Apache Spark.
Similar to Scala Data Pipelines for Music Recommendations (20)
The SQDC (Safety, Quality, Delivery, Cost) process enhances manufacturing performance through daily safety meetings, defect tracking, and waste reduction. Orcalean’s FactoryKPI digital dashboard streamlines this process, providing real-time data and AI-powered analytics for continuous improvement.
How Generative AI is Shaping the Future of Software Application DevelopmentMohammedIrfan308637
Generative AI is revolutionizing software development. Find out how it enhances innovation and productivity. https://www.qisacademy.com/blog-detail/the-power-of-generative-ai-in-software-application-development
Empowering Businesses with Intelligent Software Solutions - GrawlixAarisha Shaikh
Explore Grawlix's comprehensive suite of intelligent software solutions designed to drive transformative growth and scalability for businesses. This presentation covers our expertise in bespoke software development, digital marketing, web design, cloud solutions, cybersecurity, AI/ML, and IT consulting. Discover how Grawlix's customized solutions enhance productivity, streamline processes, and enable data-driven decision-making. Learn about our key projects, technologies, and the dedicated team who ensures exceptional client satisfaction through innovation and excellence.
BDRSuite - #1 Cost effective Data Backup and Recovery Solutionpraveene26
BDRSuite and BDRCloud by Vembu are comprehensive and cost-effective backup and disaster recovery solutions designed to meet the diverse data protection requirements of Businesses and Service Providers.
With BDRSuite & BDRCloud, you can backup diverse IT workloads from any location, including VMs (VMware, Hyper-V, KVM, Proxmox VE, oVirt), Servers & Endpoints (Windows, Linux, Mac), SaaS Applications (Microsoft 365, Google Workspace), Cloud VMs (AWS, Azure), NAS/File Shares and Databases & Applications (Microsoft Exchange Server, SQL Server, SharePoint Server, PostgreSQL, MySQL).
You can store backup anywhere like On-Premise/Remote storage, Private/Public Cloud, and BDRCloud.
You can centrally manage the entire backup infrastructure with BDRSuite’s self-hosted centralized management console (or) BDRCloud-hosted centralized management console.
You can quickly recover from data loss or ransomware attacks—all at an affordable price.
To know more visit our website -
https://www.bdrsuite.com/
https://www.bdrcloud.com/
How to Secure Your Kubernetes Software Supply Chain at ScaleAnchore
Achieving comprehensive security visibility in Kubernetes environments is essential for maintaining robust and compliant cloud-native applications. In this exclusive webinar, Anchore and Spectro Cloud team up to showcase how to enhance your Kubernetes security posture with SBOM (Software Bill of Materials) management and vulnerability scanning.
Join Cornelia Davis, VP of Product, Spectro Cloud and Alan Pope, Director of Developer Relations, Anchore to learn how to elevate your Kubernetes security visibility and protect your cloud-native applications effectively.
—Discover how Anchore can be integrated with Spectro Cloud Palette to take SBOM scanning to the next level, delivering fully automated software compliance
—Gain valuable insights into best practices for securing your Kubernetes workloads, ensuring compliance, and improving your DevSecOps processes.
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfBen Ramedani
Let’s face it, getting lost isn’t really part of the adventure anymore (unless you’re into that sort of thing!). Nowadays, a good navigation app is like your trusty compass, guiding you through busy city streets and winding country roads. But with so many options out there—from big names like Waze, Google Maps, and Apple Maps to some lesser-known contenders—choosing the right one can feel a bit overwhelming.
Think about it: you're about to head out on a road trip, and the last thing you want is to end up in the middle of nowhere because you took a wrong turn. Or maybe you're just trying to navigate your daily commute without hitting every single red light. That's where a solid navigation app comes in handy.
Google Maps is like the old reliable friend who knows every shortcut and scenic route. It's packed with features, from real-time traffic updates to detailed directions, making it a top choice for many. But then there's Waze, the social butterfly of navigation apps. It's all about community, with drivers sharing real-time updates on traffic, accidents, and even speed traps. It’s perfect if you want to feel like you’re part of a huge driving club, all working together to get everyone to their destination faster.
And let’s not forget Apple Maps, which has come a long way since its rocky start. If you're deep into the Apple ecosystem, it's a seamless choice, integrating smoothly with all your devices and offering some pretty neat features like Flyover for 3D city views.
But wait, there are also some underdog apps worth considering! Have you heard of MapQuest? It's still around and offers some great features, especially for planning long trips with multiple stops. Then there's HERE WeGo, which is fantastic for offline navigation—a real lifesaver if you're heading somewhere with spotty cell service.
So, whether you're planning a cross-country adventure or just trying to find the quickest route to work, we’ll help you sift through these options. We’ll dive into what makes each app unique, their pros and cons, and ultimately, guide you to the perfect navigation app for your needs. Buckle up and get ready for a smooth ride!
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...David D. Scott
Tube Magic Software is your ultimate tool for creating stunning video content with ease. Designed with both beginners and professionals in mind, it offers a user-friendly interface packed with powerful features. From seamless editing to eye-catching effects, Tube Magic helps you bring your creative vision to life. Elevate your videos and captivate your audience effortlessly. Join our community of content creators and experience the magic today!
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsBenjamin Bischoff
In the rapidly evolving landscape of software development and testing, it is tempting to chase the latest tools and technologies. However, some of the most effective solutions have been in existence for decades. In this talk, we’ll delve into the enduring value of these timeless testing tools.
We’ll explore how established tools like Selenium, GNU Make, Maven, and Bash remain vital in today’s software development and testing toolkit even though they have been around for a long time (some were even invented before I was born). I’ll share examples of how these tools have addressed our testing and automation challenges, showcasing their adaptability, versatility, and reliability in various scenarios. I aim to demonstrate that sometimes, the “old” ways can indeed be the best ways.
Mastering MicroStation DGN: How to Integrate CAD and GISSafe Software
Dive deep into the world of CAD-GIS integration and elevate your workflows to nexl-level efficiency levels. Discover how to seamlessly transfer data between Bentley MicroStation and leading GIS platforms, such as Esri ArcGIS.
This session goes beyond mere CAD/GIS conversion, showcasing techniques to precisely transform MicroStation elements including cells, text, lines, and symbology. We’ll walk you through tags versus item types, and understanding how to leverage both. You’ll also learn how to reproject to any coordinate system. Finally, explore cutting-edge automated methods for managing database links, and delve into innovative strategies for enabling self-serve data collection and validation services.
Join us to overcome the common hurdles in CAD and GIS integration and enhance the efficiency of your workflows. This session is perfect for professionals, both new to FME and seasoned users, seeking to streamline their processes and leverage the full potential of their CAD and GIS systems.
Literals - A Machine Independent Feature21h16charis
Introduction to Literals, A machine independent feature. The presentation is based on the prescribed textbook for System Software and Compiler Design, Computer Science and Engineering - System Software by Leland. L. Beck,
D Manjula.
Predicting Test Results without Execution (FSE 2024)Andre Hora
As software systems grow, test suites may become complex, making it challenging to run the tests frequently and locally. Recently, Large Language Models (LLMs) have been adopted in multiple software engineering tasks. It has demonstrated great results in code generation, however, it is not yet clear whether these models understand code execution. Particularly, it is unclear whether LLMs can be used to predict test results, and, potentially, overcome the issues of running real-world tests. To shed some light on this problem, in this paper, we explore the capability of LLMs to predict test results without execution. We evaluate the performance of the state-of-the-art GPT-4 in predicting the execution of 200 test cases of the Python Standard Library. Among these 200 test cases, 100 are passing and 100 are failing ones. Overall, we find that GPT-4 has a precision of 88.8%, recall of 71%, and accuracy of 81% in the test result prediction. However, the results vary depending on the test complexity: GPT-4 presented better precision and recall when predicting simpler tests (93.2% and 82%) than complex ones (83.3% and 60%). We also find differences among the analyzed test suites, with the precision ranging from 77.8% to 94.7% and recall between 60% and 90%. Our findings suggest that GPT-4 still needs significant progress in predicting test results.
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...vijayatibirds
Unlock the full potential of your business with iBirds Services. As a trusted Salesforce Consulting Partner, iBirds Software Pvt. Ltd. offers a wide range of customer-centric consulting services to help you seamlessly integrate, customize, and optimize your Salesforce CRM. Our team of experts specializes in delivering innovative software development solutions tailored to meet your unique business needs.
In this document, you will discover:
An overview of iBirds Services and our expertise in Salesforce CRM implementation.
Detailed insights into our software development services, including custom applications, integrations, and automation.
Case studies highlighting our successful projects and satisfied clients.
Key benefits of partnering with iBirds Services for your CRM and software development needs.
Whether you are a small business or a large enterprise, our proven strategies and cutting-edge technologies ensure your business stays ahead of the competition. Explore our services and learn how iBirds can transform your business operations with scalable and efficient solutions.
1. January 6, 2015
Scala Data Pipelines for
Music Recommendations
Chris Johnson
@MrChrisJohnson
2. Who am I??
•Chris Johnson
– Machine Learning guy from NYC
– Focused on music recommendations
– Formerly a PhD student at UTAustin
3. Spotify in Numbers 3
•Started in 2006, now available in 58 markets
•50+ million active users, 15 million paying subscribers
•30+ million songs, 20,000 new songs added per day
•1.5 billion playlists
•1 TB user data logged per day
•900 node Hadoop cluster
•10,000+ Hadoop jobs run every day
5. How can we find good recommendations? 5
•Manual Curation
•Manually Tag Attributes
•Audio Content
•News, Blogs, Text analysis
•Collaborative Filtering
9. The Genre Toplist Problem 9
•Assume we have access to daily log data for all plays on Spotify.
•Goal: Calculate the top 1k artists on for each genre based on total daily plays
{"User": “userA”, "Date": “2015-01-10", "Artist": “Beyonce", "Track": "Halo", "Genres": ["Pop", "R&B", "Soul"]}
{"User": “userB”, "Date": “2015-01-10”, "Artist": "Led Zeppelin”, "Track": "Achilles Last Stand", "Genres": ["Rock",
"Blues Rock", "Hard Rock"]}
……….
11. 11
Scalding is a Scala library that makes it easy to specify Hadoop
MapReduce jobs. Scalding is built on top of Cascading, a Java
library that abstracts away low-level Hadoop details. Scalding is
comparable to Pig, but offers tight integration with Scala, bringing
advantages of Scala to your MapReduce jobs.
-Twitter
16. sortWithTake doesn’t fully sort 16
•Uses PriorityQueueMonoid from Algebird library
•What is a Monoid??
-Definition: A Set S and a binary operation • : S x S —> S such that
1. Associativity: For all a, b, and c in S the equation
(a • b) • c = a • (b • c) holds
2. Identity Element: There exists an element e in S such that for every
element a in S, the equations e • a = a • e = a hold
•Example: The natural numbers N under the addition operation.
(1 + 2) + 3 = 1 + (2 + 3)
0 + 1 = 1 + 0 = 1
class PriorityQueueMonoid[K](max : Int)(implicit ord :
Ordering[K]) extends Monoid[PriorityQueue[K]]
18. sortWithTake 18
•Uses PriorityQueueMonoid from Algebird
•Ok, great observation… but what’s the point of all this!??
-All monoid aggregations and reduces can begin on the Mapper side
and finish on the Reducer side since the order doesn’t matter!
-Scalding implicitly takes care of Mapper side combining and custom
combiner
-Reduces network traffic to reducers
class PriorityQueueMonoid[K](max : Int)(implicit ord :
Ordering[K]) extends Monoid[PriorityQueue[K]]
reduced traffic
20. How do we store track metadata? 20
•Lots of metadata associated with tracks (100+ columns!)
-artist, album, record label, genres, audio features, …
•Options:
1. Store each track as one long row with many columns
-Sending lots of data over network when you only need 1 or 2 columns
2. Store each column as a separate data source
-Jobs require costly joins, especially when requiring many columns
•Can we do better?..
21. Apache Parquet to the rescue! 21
•Apache Parquet is a columnar storage format available to any project in the Hadoop
ecosystem, regardless of the choice of data processing framework, data model or
programming language.
•Efficiently read a subset of columns without scanning the entire dataset
•Row group: A logical horizontal partitioning of the data into rows. There is no
physical structure that is guaranteed for a row group. A row group consists of a
column chunk for each column in the dataset.
•Column chunk: A chunk of the data for a particular column. These live in a particular
row group and is guaranteed to be contiguous in the file.
•Predicate push-down: Define predicates (<, >, <=, …) to filter out column chunks or
even full row groups, evaluated at Hadoop InputFormat layer before Avro conversion
23. Driven - job visualization and performance analytics 23
24. Luigi - data plumbing since 2012 24
•Workflow management framework developed by Spotify
•Python luigi configuration takes care of dependency resolution, job
scheduling, fault tolerance, etc.
•Support for Hive queries, MapReduce jobs, python snippets, Scalding,
Crunch, Spark, and more!
•Like Oozie but without all of the messy XML
https://github.com/spotify/luigi
27. So…. back to music recommendations! 27
•Manual Curation
•Manually Tag Attributes
•Audio Content
•News, Blogs, Text analysis
•Collaborative Filtering
28. Collaborative Filtering
28
Hey,
I like tracks P, Q, R, S!
Well,
I like tracks Q, R, S, T!
Then you should check out
track P!
Nice! Btw try track T!
Image via Erik Bernhardsson
29. Implicit Matrix Factorization 29
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
30. Alternating Least Squares 30
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix tracks
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
31. 31
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix tracks
Solve for users
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
32. 32
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix users
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
33. 33
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix users
Solve for tracks
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
34. 34
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix users
Solve for tracks
Repeat until convergence…
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
35. 35
1 0 0 0 1 0 0 1
0 0 1 0 0 1 0 0
1 0 1 0 0 0 1 1
0 1 0 0 0 1 0 0
0 0 1 0 0 1 0 0
1 0 0 0 1 0 0 1
X YUsers
Songs
• = bias for user
• = bias for item
• = regularization parameter
• = 1 if user streamed track else 0
•
• = user latent factor vector
• = item latent factor vector
Fix users
Solve for tracks
Repeat until convergence…
•Aggregate all (user, track) streams into a large matrix
•Goal: Approximate binary preference matrix by the inner product of 2 smaller matrices by
minimizing the weighted RMSE (root mean squared error) using a function of total plays as weight
•Why?: Once learned, the top recommendations for a user are the top inner products between
their latent factor vector in X and the track latent factor vectors in Y.
Alternating Least Squares
36. Matrix Factorization with MapReduce
36
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1
u % K = 1
i % L = 0
u % K = 1
i % L = 1
... ...
... ... ... ...
u % K = K-1
i % L = 0
... ...
u % K = K-1
i % L = L-1
item vectors
item%L=0
item vectors
item%L=1
item vectors
i % L = L-1
user vectors
u % K = 0
user vectors
u % K = 1
user vectors
u % K = K-1
all log entries
u % K = 1
i % L = 1
u % K = 0
u % K = 1
u % K = K-1
Figure via Erik Bernhardsson
37. Matrix Factorization with MapReduce
37
One map task
Distributed
cache:
All user vectors
where u % K = x
Distributed
cache:
All item vectors
where i % L = y
Mapper Emit contributions
Map input:
tuples (u, i, count)
where
u % K = x
and
i % L = y
Reducer New vector!
Figure via Erik Bernhardsson
38. 38
•Fast and general purpose cluster computing system
•Provides high-level apis in Java, Scala, and Python
•Takes advantage of in-memory caching to reduce I/O bottleneck of
Hadoop MapReduce
•MLlib: Scalable Machine Learning library packaged with Spark
-Collaborative Filtering and Matrix Factorization
-Classification and Regression
-Clustering
-Optimization Primitives
•Spark Streaming: Real time, scalable, fault-tolerant stream processing
•Spark SQL: allows relational queries expressed in SQL, HiveQL, or
Scala to be executed using Spark
39. Matrix Factorization with Spark
39
streams user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
40. Matrix Factorization with Spark
40
user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
streams
41. Matrix Factorization with Spark
41
user vectors item vectors
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
streams
42. Matrix Factorization with Spark
42
user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
streams
43. Matrix Factorization with Spark
43
user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
streams
44. Matrix Factorization with Spark
44
user vectors item vectors
worker 1 worker 2 worker 3 worker 4 worker 5 worker 6
YtY YtY YtY YtY YtY YtY
•Partition streams matrix into user (row) and item (column) blocks, partition, and cache
-Unlike with the MapReduce implementation, ratings are never shuffled across the network!
•For each iteration:
1. Compute YtY over item vectors and broadcast
2. For each item vector, send a copy to each user rating partition that requires it (potentially all partitions)
3. Each partition aggregates intermediate terms and solves for optimal user vectors
streams
49. What should I be worried about? 49
•Multiple “right” ways to do the same thing
•Implicits can make code difficult to navigate
•Learning curve can be tough
•Avoid flattening before a join
•Be aware that Scala default collections are immutable (though mutable
versions are also available)
•Use monoid reduces and aggregations where possible and avoid folds
•Be patient with the compiler