TransferLab Training: Practical Anomaly Detection - Module 1: Introduction to Anomaly Detection

In this video, we’ll dive into anomaly detection and its real-world applications. Together, we will explore the challenges and the different types of anomalies. Next, we’ll discuss the contamination framework and conclude by introducing evaluation metrics tailored for anomaly detection, addressing the class imbalance problem along the way.

Accessing the resources of this course:
🔗 GitHub Repository: complete code and examples on our GitHub https://github.com/aai-institute/tfl-training-practical-anomaly-detection
🌐 Website: digests of the latest research on the TransferLab’s website: https://transferlab.ai/
🎓 Full Course & Certification (For FREE): Enroll in our full video course available on our learning platform.
Complete the course at your own pace and earn a certificate for free to enhance your portfolio: https://lms.appliedai-institute.de/

00:10 What is an Anomaly?
02:55 Practical Relevance of Anomaly Detection (A.D.)
06:05 Relevance of Unsupervised Machine Learning in A.D.
07:32 The Contamination Framework
13:03 Evaluation Metrics for A.D. Systems
18:02 A.D. Using Distance Metrics
21:27 Exercise: Using Distance Metrics for A.D.
22:21 Solution: Using Distance Metrics for A.D.
30:01 Taxonomy of A.D. Approaches

The appliedAI Institute for Europe gGmbH is supported by the KI-Stiftung Heilbronn gGmbH.

okay let’s try and start to describe informally what we mean with an anomaly this is a good place to pause the video and think for yourself how would you describe what an anomaly actually is well if you tried to do that you might have run into problems because at the first glance it seems that there are no real features by which we can describe an anomaly and that it heavily depends on the application scenario whether something is anomalous or not and this is of course true but maybe also looking at the picture helped you to come up with an description because if I would ask you what are potential anomalies in this picture I guess everyone would directly name the red fish as the most likely instance of an anomaly here and this is because the red fish is of course very different from all the other fish by color by size so looking a bit further or a bit deeper into the image we can also identify other candidates of potential anomalies for instance if you look to the very right then you see that the blue fish there seem to be like strangely deformed so making up another like potential anomaly although it seems that they are not as anomalous as as the big red fish and this is really the Essence what we want to describe with an anomaly there are also several researchers who tried to phrase this in their own words and we are going to go through a couple of these definitions so one of the very early ones is grbs 1969 saying that an outlying observation or outlier is one that appears to deviate marketly from all the other members in the sample in which it occurs another one which is very often cited is the one by Hawkings saying in outlier is an observation that deviates so much from the other observations as to arouse suspicion that it was generated by a different mechanism and finally a slightly newer one by candola and others saying that anomalies are patterns in data that do not conform to a well defined notion of normal behavior so now that we know what we’re looking for we can see that anomaly detection seems to be a very hard task as an example you can imagine that you’re a Guard Patrol at a very crowded Beach and you try to identify thieves now looking at the mass of people you know that it’s pretty hard to identify them because you don’t know what to look for uh in the first place thieves might even try to look like normal guests on the beach and try to deceive you and these are exactly the problems that we need to tackle when we want to apply anomaly detection in Practical scenarios let’s talk about the Practical relevance of anomaly detection anomaly detection has has some quite direct applications in numerous Industries and I want to present three of them here predictive maintenance fraud detection and intrusion detection the first being predictive maintenance in this case we want to determine the condition of inservice equipment in order to optimize the maintenance Cycles this is pretty intuitive because two frequent inspections will cause unnecessary costs and downtimes while two infrequent uh inspections can lead to failures or even breaking of the equipment anomaly detection can be very helpful in this case because deviations from normal readings of the sensors might indicate that we have a wear in the system and that we want to run another maintenance cycle fraud detection is another very important use case in this case we want to identify fraudulent transactions for instance on credit cards of course this is meant to prevent criminal activities and to avoid Financial or other damages for the involved party again uh fraudulence transactions can be identified due to unusual destinations amounts or an unusual Network topology if we talk about several transactions let me tell a personal example how this can happen a couple of years ago I was in India for a conference and of course I was using my credit card there a couple of times and suddenly it got blocked after doing a couple of phone calls I found out that my credit card company found it very unusual that I do transactions in India this was actually the furthest I traveled by that time and so uh the credit card company identified these transactions as anomalous and uh blocked the card of course back then I was not too happy with this situation but thinking that the true fraudulent transaction could cause high Financial damages to me I’m of course very happy that the Tre credit card company is rather conservative here so finally let’s have a look at intrusion detection an equally important use case in this case we want to detect attacks against the network and protect the nodes against unauthorized access again this can be tackled with anomaly detection because Intruders often leave unusual Footprints such as protocols ports the number of packages the IP or the duration of the connection as an example you might want to Shield your uh Erp system against unauthorized access by Intruders now one potential harm could be that an intruder tries to inject fake invoices into your system in order to get paid for services that he did not do this might seem like a like a very strange idea but believe me that actually happened to rather large German companies so you really want to be shielded against that so why is anomaly detection different from other fields of machine learning first of all it’s very hard to identify anomalies and therefore we often don’t have labels in our data set even if we have labels available anomalies are by their very nature rarely represented data sets and that means that we have to work with high class imbalances finally we don’t we might not even want to restrict our system to anomalies that we have encountered in our in the past for instance consider the scenario where the anomalies are generated by an adversarial for instance in the credit card fraud case in this case the adversarial might change his behavior over time and we want our systems to cope with that so that means that the information that is available heavily influences the available techniques questions that you should ask yourself when you want to install an anomaly detection systems are is the distribution of the nominal data known is there clean data available for training meaning data without anomalies do we have labeled uh data for evaluation how large is the proportion of anomalies in our data and how much noise is in the data so far we have defined what an anomaly is on an informal level we have seen possible applications of anomaly detection but in order to really develop methods that are able to identify anomalies we need to come up with the formal description what anomaly is in this course we are going to work with the contamination framework which is a very popular framework for uh defining the anomaly detection task in this case we assume that the data comes from a mixture of two distributions a distribution F0 which generates the nominal data and the distribution F1 which generates the anomalies our data set then comes from the mixture model with relative frequency P of the anomalies so we can write the overall distribution as 1 minus P F0 plus P * fub1 the task of anomaly detection is then to estimate if a given sample which comes from this distribution is actually anomalous or not note that we don’t have any labels available because we’re just drawing all the points from the same distribution D looking at this setup I guess it’s pretty apparent that we will need additional assumptions in order to be able to identify anomalies reliably so typically one imp imposes the following assumption first of all we assume that the anomalies are few meaning that the relative the frequency p is way larger than 1 half second we want them to be outlined meaning that the distributions F zero and F1 do not overlap too much and finally we want them to be sparse meaning that fub1 is less clustered than F zero okay let’s try to understand why we need this three assumptions so first of all I think few is pretty intuitive imagine that we have like really an anomaly frequency of 50% then our data might really look like something like that just really two blobs of data that are not really distinguishable by any features in these cases it will be very hard to to tell which which of these two blobs is actually the anomalous one so this is a case that we definitely do not want to allow secondly we have the outlying property in this case again we can look at the extreme case that F0 is just the same distribution as fub1 in this case we really have no chance of distinguishing anything because yeah each of the two distributions can generate any every Point equally likely and since we know that the relative frequency of outliers will probably be very very low so it’s probably like on a from a basan perspective best to always predict predict nominal which of course is a case that we also want to exclude and finally why do we want to have them sparse again in this case we can consider nominal distributions where the distribution itself clusters into smaller clusters then we will get into a problem if all so our anomalous data kind of builds such NE cluster if our nominal data looks something like that and I kind of come and add my anomalous data which just looks like another cluster in this distribution then again we have no chance of identifying it and these are the reasons why we need these three assumptions the natural question is now is this the Silver Bullet to tackle all anormal detection scenarios and I directly give you a clear note towards this why is that yeah first of all maybe the situation is not as bad as we described it in the co termination framework maybe we have data from which we know that it’s clean meaning that it does not contain any anomalies we can of course use this for training and make the problem a little bit more benign but it can also be less benign than the contamination framework a very important case is where we don’t really have a well- defined distribution for the anomaly this is for instance the case if we’re talking about ADV serial scenario in this case our adversary might want to try to make his moves as similar to the nominal ones as possible and then of course we will see some distribution shift over time that the serial might adapt even to our systems that we have in place so we will not be able to assume that the anomal distribution is fixed over time and finally usually not be so fortunate to find all these three assumptions fully fulfilled it’s often that that the three assumptions of few bars and different are fulfilled to certain degrees and the degree to which they are fulfilled heavily also influences which kind of techniques we want to use in extreme cases this might even be that some of the are false as we just explained with the adversarial scenario anomaly detection does not only demand special techniques for training but also for evaluation this can easily be seen if we consider the case that we only have 1% of anomalies in our data in this case deploying an anomaly detection system that always predicts nominal will already give us 99% accuracy that which already shows that in such cases accuracy might not be a good metric to evaluate the system better metrics are Precision recall and df1 score in order to understand them let’s revisit the confusion Matrix given a test set of end samples we can apply our anomaly detection system and compare the resulting labels with the true labels we then obtain four possibilities namely that we have predicted nominal but and actually also have a nominal part which we call true negative we have predicted nominal but actually have an anomaly which we call false negative we have a predicted anomaly and actually have a nominal point which we call for positive and we have predicted anomal and actually have an anomaly which finally is true positive the measures that we’ve talked about earlier all use these numbers and combine them in different ways to highlight different aspects of the system so first of all the Precision is defined as the true positives divided by the sum of true positives and false positives which estimates the probability that an observation is anomalous given that the system predicted it to be on the other hand we have the recall which is defined as the true positives divided by the sum of true positives and false negatives which estimates the probability that an observation will be detected as anomalous given that it really is and finally the F1 score with a slightly more complicated formula it’s two times the Precision times the recall divided by the sum of precision and recall which is the harmonic mean of precision and recall it helps to combine these two matrics into a single one in order to evaluate the measures that we’ve just talked about we would assume that our system gives us a hard decision meaning a one if the point is predicted to be anomalous and a zero if it’s predicted to be nominal however most anomaly detection systems actually provide an anomaly score where higher values mean more anomalous and it’s up to us to decide which degree of anomalousness we want to detect so that means that for every possible threshold that we choose we get other values for position recall and F1 and then we can try to find the optimal threshold one way of doing that is the Precision recall curve which Poots the pairs recall for a given threshold against the Precision for a given threshold and we do that for all the uh for all the scores that we got in our data set so you see an example of a Precision recall curve on the right side you can see the optimal value would be on the upper right side being that precision and recall are one but as you can also see that as the threshold gets larger we usually decrease the Precision but increase the recall and it’s up to us to really find the optimal Point here another important curve is the so-called receiver operator characteristics or Rock curve which plots the true positive rate for a given threshold against the false positive rate for a given threshold where the true positive rate is defined as the number of true positives divided by the sum of true positives and false negatives and the false positive rate is the number of false positives divided by the sum of false positives and true negatives an example of a rock curve is on the left picture here in the Rock curve the optimal point is on the upper left corner and again we can try to identify this threshold which leads us closest to this point finally let me also mention that it’s possible to assess the overall quality of the anomaly detection system using the rock curve and the pr curve without the need of choosing a particular threshold we do that by Computing the area under the curves remember that the optimal point for the rock curve was the upper left corner and the optimal point for the pr curve was the upper right corner in both cases a perfect classifier would produce a area of one and the lower the number is in general the worse the classifier gets all right I think by now we’re ready to have a look at our first simple anomally detection technique in order to develop it let’s have a look at what would happen if you would already know what the distribution of the nominal data is say it’s given by a density function in this case we could simply take the negative flock as of the density function which is also known as the surprise and measure how surprise we would be if you would draw a sample like that from the given nominal distribution however this is often a case that we cannot expect to have now usually we don’t have an idea how the nominal data is distributed and of course estimating the distribution of data is a hard task on its own slightly simpler is estimating the co-variance and the mean of a distribution in this case we can take the following distance which is called Mahala mous distance in order to evaluate whether a point is anomalous or not the malous distance takes the following shape so you take the difference of your observation with the mean that you’ve estimated and then you take this quadratic form with the inverse of the covariance Matrix and take the square root of that this implicitly measures distances according to a unimodally centered distribution so we really measure the distance in terms of standard deviations of a distribution that the distribution has in the space meaning that it of course has only a restricted applicability only if we have really good reasons to believe that our data distribution is really uni modally centered around the mean if this is not the case there is a simple extension of this where we can f a mixture model which estimates the means and the covariances for the different uh clusters in our data in this case we can then take the minimal mahalanobis distance to any of these clusters let’s try to understand the Mahala nois distance a bit deeper and for that I claim that the Mahala nois distance is actually equivalent to the surprise of a gan so how does that come to be so assum we have a Garian which has the same covariance metrix and the same mean as we have used in the Mahala nois distance so then if we want to compute the surprise then we have to take the negative log of the density function which we do here so we plug in the formula of the Garian and we arrive at this expression now it’s just applying the log and then collecting all the constant terms into this constant C here and we end up with something which already looks quite similar to the mahalanobis distance the only difference is that we got rid of the C the multiplicative constant and we applied the square root to the expression but all of these are monotonous Transformations meaning that they won’t change the relative ranking of the outliers in our data so in this case we can really see say that this will give us the same result as Computing the sprice of Gan so in order to also get a practical grip on how to apply this let’s have an exercise in this exercise we are going to work through notebook number one if you have already opened the notebook and then you will see that we start with a very simple exercise so here we have prepared the synthetic data set basically consistent of nominal data which is one gion blob that you see here and anomalous data which is like slightly off from the nominal data but also in the shape of a gan blob and we want to apply the Mahala nois distance as we’ve seen it before to this exercise I hope that the individual tasks will be self-explanatory and I would ask you to go through them on your own and we will meet up together again to discuss the solution all right so I hope you all were able to solve the exercises so let’s see how the solution was intended as I said we want to fit or we want to apply the mahalan nois distance to our data set meaning that we need to estimate the co-variance Matrix and the mean we do this in the following cell here by a very simple procedure where we make our lives a little bit easier by not fitting a full co-variance Matrix but a diagonal covariance matrix meaning that we assume that the dimensions are actually independent in the data set which is true since we know how the data was generated so we can have a look at our result we can see that the mean is pretty close to zero and we have standard deviations of 1.3 and 1.7 in the two directions if you compare this with the original distribution of the nominal data then you will see that we are slightly off reason for that is that we actually fitted The covariance Matrix and the mean on the full data set containing nominal data and anomalous data reason for that is that we want to assume that we were not able to identify the anomalies AR priori which is a very common setup something that we have to live with in many situations and we want to see how this affects our fitting procedures so let us compare our fit against the true nominal distribution here you see the mean is centered at zero or located directly at zero and we have standard deviations of 1 and 1.5 in two in the two components and again we see the fit our contamination actually led to the fact that our mean is slightly off so it is slightly up on the diagonal where the anomalous data resides and also the standard deviations are slightly too high for our data and we can definitely say that this is an effect that we got from anomalies in our data so nevertheless with our fitting procedure we can compute mahalan nois distances and then compute the rock curve and the pr curve so we see we have two very beautiful curves here both are close to being optimal remember that for the Rock curve the optimal would be the edge having directly at 1.1 while for the pr curve the closer we get to the opposite Edge the closer we are to the optimum so now it’s about choosing the right threshold for data set and since we are only working with a two-dimensional data set this can be done visually or we can use a visual inspection to eight our search at this point we could Al go through all the possible thresholds and then compute the metrix Precision recall fub1 and so on but uh yeah for the sake of intuitiveness let’s try to do this by hand and Visually here so this is our initial guess we started with a threshold of three in the red line you basically see the red outline shows the decision threshold so everything that is in this circle will be detected to be a nominal everything that’s outside will be detected to be anomalous right this gives a rather good separation between nominal points and anomalous points the Precision being 6 and the recall being .9 with an overall F1 score of 72 again you can see that the Precision tends to be relatively low which is something that is not very uncommon in anomaly detection scenario because of the high class imbalance that we have as soon as we classify a few nominal points as anomalous this will heavily hit the Precision and that’s the reason why we have rather low values for precision often the recall is even more important so very often since uh anomalies can have high cost Associated to them you might want to detect all of them and you can live with a few false positives in this case but this heavily depends on the cost Matrix that we’ve discussed previously so by playing around with the value we can also see how precision and recoil are affected by the threshold so let’s choose a very low threshold and then computer visualization again so here I think you can very well see we have a terrible Precision but a perfect recall but actually most of the nominal data is now classified to be anomalous so this is definitely too restrictive let’s try what happens if we go to the other direction take a look so now you see we have a perfect Precision everything that be classified to be anomalous is actually anomalous and even the recall seems to be like not as highly affected as the Precision in this direction nevertheless as I said we we are actually working towards detecting all the anomaly and we are pretty far off here so this is really not an optimal value so if you play it around with this visualization a little bit then you probably also came to the conclusion that something around to a 3.2 maybe a little less maybe a little more but something around this value should be rather good for our for our system so what we can do now is we can take this threshold and then apply it to a new testing dat data set and see what happens and maybe you were shocked when you saw the values now because you see that in this case the evaluation is way worse than on our original training data and this is due to a little trick that we used because now we assume that we are in a case where we actually cannot assume that the anomaly distribution is really constant over time so here we have actually moved the anomaly distribution a little bit closer to the to the nominal one here at the 3.3 if we do a quick comparison with the original anomaly distribution and you see it was the mean was at position 5.5 so we moved it a little bit closer the advis serial tried to make his observations look a little bit more like nominal data and we can see that this was rather successful because he managed about half of all the anomalous points remain undetected and this is actually to sensitivies you that this scenario where the anomaly distribution is really not well defined is a rather realistic one and that there’s a certain danger of using labeled data in order to assess possible threshold because we don’t want to learn like a specific anomaly distribution but we want to be shielded against all possible anomalies that might Ur with these insights let’s continue with the course I guess you have already noticed by now that anomaly detection as a whole does not seem to have really a common Foundation rather it seems to be more a Loosely related collection of approaches and methods and in the following we want to try to categorize them in some sort of tonomy because for the actual application the structure of the data and the problem at hand is of extreme importance in order to choose the right method it’s even the case that sometimes handd designed rules might work better than fancy algorithms so a Sero statistical analysis should be performed before starting using heavier Machinery we will now walk through multiple philosophies that Encompass multiple of our anomaly detection methods so the first philosophy is the one of distance-based methods here we say that the point is an outlier if it has few close neighbors methods that imply this philosophy are K near neighbor based methods clustering in combination with the mahalanobis distance the local outlier factor or the Matrix profile for time series the next philosophy is the probabilistic one in this case we say most of the data is normal and we can fit the probabilistic model of normality and then a point is most likely anomalous if it has has low probability under the fitted model examples of such methods are kernel density estimation gion mixture models the extreme value Theory gun-based anomaly detection methods or time series forecasting with probabilistic models when looking at the first two philosophies we can already see the blurry lines in the taxonomy for instance the gsan mixture model could be considered to be a distance-based method especially when it’s used in combination with the mahalanobis distance but really well to be a probabilistic method subspace-based methods follow the philosophy that the space where the data lives can be partition into normal regions and abnormal regions this partition might happen in lower or even higher dimensional subspaces and common examples of such an approach are the isolation trees which we will also see in this Workshop another very popular approach are so-called reconstruction based methods the philosophy here is to first learn to compress the data usually by mapping it into a lower dimensional space then reconstructing the original impa from this compression typical examples are autoencoders or their probabilistic variant variational autoencoders associative memory models like hop networks or the principal component analysis again we see the blurry line PCA can equally well be counted as a probabilistic method the same holds for the variational aling code so finally we have also supervised method in this case we don’t use the unsupervised techniques that we’ve seen so far but use typical supervised classifiers and use some additional tricks in order to deal with the data imbalances there are of course many more methods for instance information theoretic uh methods domain specific methods or even like ensembles of different methods there’s no way that we can approach all of these in this Workshop instead we will demonstrate a few techniques that are rather prototypical for the different philosophies and dive into two more uh specialized topics namely time series analysis and extreme value Theory the techniques that we present are chosen so that they are useful in a variety of situations

TransferLab Training: Practical Anomaly Detection – Module 1: Introduction to Anomaly Detection