Most real-world datasets include a certain amount of anomalous values, generally termed as ‘outliers’. These observations substantially deviate from the general trend therefore, it is important to isolate these outliers for improving the quality of original data and reducing the adverse impact they have in the process of analyzing datasets. Practically, nearly all experimental data samples are likely to be contamination by outliers which reduce the efficiency, and reliability of statistical methods. Outliers are analyzed to see if their unusual behavior can be explained. Sometimes outliers have “bad” values occurring as a result of unusual but explainable events. The cause of outliers are not always random or chance. Therefore a study needs to be made before an outlier is discarded.
Detection of Statistical Outliers
Statistical outliers are more common in distributions that do not follow the normal distribution. For example, in a distribution with a long tail, the presence of statistical outliers is more common than in the case of a normal distribution.
The simplest method of identifying whether an extreme value is an outlier is by using the interquartile range. The IQR tells us how spread out the middle half of our data set is.
The interquartile range, or IQR, is determined by subtracting the first quartile from the third quartile.
We start with the IQR and multiply it by 1.5. Then subtract this number from the first quartile and add this number to the third quartile. These two numbers from our inner fence. For the outer fences, we start with the IQR and multiply it by 3. We then subtract this number from the first quartile and add it to the third quartile. These two numbers are our outer fences.
Outliers can now be detected by determining where the observation lies in reference to the inner and outer fences. If a single observation is more extreme than either of our outer fences, then it is an outlier, and more particularly referred to as a strong outlier. If our data value is between corresponding inner and outer fences, then this value is a suspected outlier or a weak outlier.
Suppose that we have calculated the first and third quartile of our data, and have found these values to be 40 and 50, respectively. The interquartile range IQR = 50 – 40 = 10. Next, we see that 1.5 x IQR = 15. This means that the inner fences are at 40 – 15 = 25 and 50 + 15 = 65. This is 1.5 x IQR less that the first quartile, and more than the third quartile.
We now calculate 3 x IQR, that is, 3 x 10 = 30. The outer fences are 3 x IQR more extreme that the first and third quartiles. This means that the outer fences are 40 – 30 = 10 and 50 + 30 = 80.
Any data values that are less than 10 or greater than 80, are considered outliers. Any data values that are between 10 and 25 or between 65 and 80 are suspected outliers.
Reasons for Identifying Outliers
The presence of outliers indicates errors in measurement or the occurrence of an unexpected and previously unknown phenomenon. It is extremely important to check for outliers in every statistical analysis as they have an impact on all the descriptive statistics, as they are sensitive to them. The mean, standard deviation and correlation coefficient in paired data are just a few of these types of statistics. This could mislead analysts into making incorrect insights as all these statistics get distorted.
Certain statistical estimators are able to deal with statistical outliers and are robust, while others cannot deal with them. A typical example is the case of a median. It is the most resistant statistic with a breakdown point of 50%. Which means that as long as no more than half the data are contaminated or missing, the median will not deviate by an arbitrarily large or small amount.
In practice, an outlier could cause severe damage to data-driven businesses. For example, outliers in transactional data of retailers or distributors could lead to the incorrect calculation of demand forecasts. Leading to a mismatch of demand and supply as the business either ends up understocking and overstocking its inventory. Other adverse outcomes could also include; inaccurate budget planning, non-optimum resource deployment, poor vendor selection, loss-making pricing model et cetera.
Even engineering firms or manufacturers can be adversely affected by outliers. Errors in measurement taken from sensors (eg. thermometers, barometers) during quality checks of the products produced, could result in unexpected failure of products, incorrect measurement of warranty periods, initiate re-designing of products et cetera.
The adverse effects of outliers could even influence the life of citizens when data collected by the government contains outliers. Biased samples in government surveys, containing observations which would’ve been considered outliers when compared to the entire population, could justify the formulation of policies that could damage society. Thus, it is imperative to devise methods of dealing with outliers in statistical analysis.
Treatment of Outliers
The treatment of outlier values can be achieved by the following categories of actions that can be taken:
Transformation of Data: Transformation data is one way to soften the impact of outliers since the most commonly used expressions, square root and logarithms, affect larger numbers to a much greater extent than they do the smaller ones. Transformations may not fit into the theory of the model all the time as they may affect its interpretation. Transforming a variable does more than make a distribution less skewed; it changes the relationship between the variables in the model.
Deletion of Values: When there are legitimate errors and cannot be corrected, or lie so far outside the range of the data that they distort statistical inferences the outliers should be deleted. When in doubt, we can report model results both with and without outliers to see how much they change. Data transformation and deletion are important tools, but they should not be viewed as an all-out for distributional problems associated with outliers. Transformations and/or outlier elimination should be an informed choice, not a routine task. In some cases, the removal of an outlier value can also induce incorrect inferences made about the data. In such cases, replacing the observation with a measure of central tendency (Mean, Median or Mode), depending on the situation.
Accommodation of Values: One very effective plan is to use methods that are robust in the presence of outliers. Nonparametric statistical methods fit into this category and should be more widely applied to continuous or interval data. When outliers are not a problem, simulation studies have indicated their ability to detect significant differences is only slightly smaller than corresponding parametric methods. There are also various forms of robust regression models and computer-intensive approaches that deserve further consideration.