From the course: Artificial Intelligence for Cybersecurity

Choosing the right ML approach

- [Instructor] Having seen the security use case and the use of AI to address them, it's time to look under the hood. I will start with a few commonly interchanged and often, confused terms. If you read through the literature on AI you will come across these terms very often. Now, I don't want you to walk away with the impression that the discipline of learning is the entire field of AI. In fact, it is one of the capabilities that is exhibited by AI systems. Machine learning, on the other hand, is a type of learning that uses statistical techniques and modeling to perform a task without programming. So that leaves us with deep learning. Deep learning is a type of machine learning that uses layering of many learning algorithms. It tries to mimic the way neural networks in our brain function. When you build an AI-based security solution you have many machine learning algorithms at your disposal. The algorithm you end up choosing depends primarily on two macro factors. First, the type of training data available to you and next the type of security problem you are trying to solve. If you study your organization's vulnerability database, log files, packet trace, and user access records you will discover that you have two types of data samples. First, the type of data whose characteristics you fully understand, in other words, you already know that the data at hand is indicative of either genuine or suspicious behavior. For example, a website that you know is fraudulent and is likely being used in a phishing attack, or a program trace that is a clear sign of malware execution. In fact, you know the data so well, you can label it with tags such as good or bad. This type of data is known as labeled data. In the second type of data, you don't know beforehand whether the data at hand represents good or bad behavior. For example, you're looking at the login dates and times of your employees over the past month, you just don't know which ones are suspicious or not. In other words, you can't put a good or bad label on it. This type of data is known as unlabeled data. To train a machine learning model using label data or, in other words, using our prior knowledge of the relationship between the data and the desired outcome is known as supervised learning. On the other hand, when such a label doesn't exist and we use the machine learning model to discover new and interesting patterns within the data that process is known as unsupervised learning. The second factor that determines your choice of algorithm is the type of security problem you aim to solve. Although machine learning has been applied commercially to a variety of problems across industries, in the field of security it is commonly applied to four types of problems today. Of course, that may change in the future. You want to predict a future security event based on the information you have about the past events. You want to categorize your data into known categories such as normal versus malicious. You want to find interesting and usual patterns in your data that you couldn't have found yourself. And the last one, you want to generate adversarial synthetic data that is indistinguishable from the real data. By clearly articulating the type of machine learning problem you have, combined with the type of data at hand you come up with a subset of algorithms that you can start experimenting with. Here is a visual that summarizes the mapping of data type to various use cases.

Contents