blog/outliers-detection-in-r/ #63

utterances-bot · 2021-01-06T10:50:56Z

utterances-bot
Jan 6, 2021

Outliers detection in R - Stats and R

Learn how to detect outliers in R thanks to descriptive statistics and via the Hampel filter, the Grubbs, the Dixon and the Rosner tests for outliers

https://statsandr.com/blog/outliers-detection-in-r/

AntoineSoetewey · 2021-01-06T10:50:57Z

AntoineSoetewey
Jan 6, 2021
Maintainer

"Comment written by Felix Kluxen on August 17, 2020 09:27:12:

Dear Antoine,

thank you for this helpful post.

Just my two cents: I think it sometimes makes sense to formally distinguish two classes of outliers: extreme values and mistakes.
Extreme values are statistically and philosophically more interesting, because they are possible but unlikely responses -- such as in your height example.
Hawkins considers outliers as values that deviate so much from other observations one might suppose a different underlying sampling mechanism - which is another interesting take on this.

Cheers,
Felix

Hawkins, D. M., 1980. Identification of outliers. Chapman and Hall, London ; New York."

0 replies

AntoineSoetewey · 2021-01-06T10:51:42Z

AntoineSoetewey
Jan 6, 2021
Maintainer

"Comment written by Felix Kluxen on August 17, 2020 09:27:12:

Dear Antoine,

thank you for this helpful post.

Just my two cents: I think it sometimes makes sense to formally distinguish two classes of outliers: extreme values and mistakes.
Extreme values are statistically and philosophically more interesting, because they are possible but unlikely responses -- such as in your height example.
Hawkins considers outliers as values that deviate so much from other observations one might suppose a different underlying sampling mechanism - which is another interesting take on this.

Cheers,
Felix

Hawkins, D. M., 1980. Identification of outliers. Chapman and Hall, London ; New York."

Comment written by Antoine Soetewey on August 17, 2020 10:32:36:

Dear Felix,

Thanks for your comment, the article has been updated accordingly (see first and fourth paragraph of the introduction). Feel free to let me know if there is any inconsistency.

Regards,
Antoine

0 replies

AntoineSoetewey · 2021-01-06T10:52:00Z

AntoineSoetewey
Jan 6, 2021
Maintainer

Comment written by Antoine Soetewey on August 17, 2020 10:32:36:

Dear Felix,

Thanks for your comment, the article has been updated accordingly (see first and fourth paragraph of the introduction). Feel free to let me know if there is any inconsistency.

Regards,
Antoine

Comment written by Felix Kluxen on August 17, 2020 11:30:30:

Excellent! The elephant in the room with statistically identified outliers (here values that are probably not mistakes) is obviously that you cannot solve the issue of what researchers should do with the information - as you write. This really depends on the research question, eg subsets, responder/non-responder etc, and usually involves a suprising amount of needed reflection on the researcher's side... or the willingness to think the model assumptions through. If a statistical test result relies on a single influential value this should caution the researcher to make overambitious claims.

Cheers,
Felix

0 replies

AntoineSoetewey · 2021-01-06T10:52:24Z

AntoineSoetewey
Jan 6, 2021
Maintainer

Comment written by Felix Kluxen on August 17, 2020 11:30:30:

Excellent! The elephant in the room with statistically identified outliers (here values that are probably not mistakes) is obviously that you cannot solve the issue of what researchers should do with the information - as you write. This really depends on the research question, eg subsets, responder/non-responder etc, and usually involves a suprising amount of needed reflection on the researcher's side... or the willingness to think the model assumptions through. If a statistical test result relies on a single influential value this should caution the researcher to make overambitious claims.

Cheers,
Felix

Comment written by Antoine Soetewey on August 17, 2020 12:15:18:

You're totally right, outliers require thoughtful reflection and caution for many statistical analyses!

0 replies

DLAtatem · 2021-01-06T16:14:08Z

DLAtatem
Jan 6, 2021

Dear Antoine
This is very helpful indeed. I just found a key to detecting outliers formally for my project, thanks to this write up
Many thanks
Duncan

0 replies

AntoineSoetewey · 2021-01-06T16:16:38Z

AntoineSoetewey
Jan 6, 2021
Maintainer

Dear Antoine
This is very helpful indeed. I just found a key to detecting outliers formally for my project, thanks to this write up
Many thanks
Duncan

Glad you find it useful!

0 replies

DLAtatem · 2021-01-06T22:25:59Z

DLAtatem
Jan 6, 2021

Hi Antoine
Its been. Actually am looking for more on winsorizing outliers in R by replacing them rather than deleting them. Any guidance will be very helpful
Kind regards

0 replies

AntoineSoetewey · 2021-01-07T13:21:21Z

AntoineSoetewey
Jan 7, 2021
Maintainer

Comment written by vijayarajamanickam on December 03, 2020 12:26:17:

Dear Antonie,

I tried to detect outliers using this script

###out <- boxplot.stats(dat$hwy)$out     
out_ind <- which(dat$hwy %in% c(out))     
out_ind#### .

Most of them are working well, but in some cases it showing Integer(0).
Could you please help me in this?

Many thanks
vijay

0 replies

AntoineSoetewey · 2021-01-07T13:21:49Z

AntoineSoetewey
Jan 7, 2021
Maintainer

Comment written by vijayarajamanickam on December 03, 2020 12:26:17:

Dear Antonie,

I tried to detect outliers using this script
###out <- boxplot.stats(dat$hwy)$out
out_ind <- which(dat$hwy %in% c(out))
out_ind#### .
Most of them are working well, but in some cases it showing Integer(0).
Could you please help me in this?

Many thanks
vijay

Comment written by Antoine Soetewey on December 03, 2020 18:00:30:

Dear,

When you have the result:
integer(0)

it simply means that there is no outlier according to this method.

If you run boxplot(dat$hwy), you will see that there is no potential outliers as defined by this method.

Hope this helps.

Regards,
Antoine

0 replies

AntoineSoetewey · 2021-01-08T09:17:26Z

AntoineSoetewey
Jan 8, 2021
Maintainer

Hi Antoine
Its been. Actually am looking for more on winsorizing outliers in R by replacing them rather than deleting them. Any guidance will be very helpful
Kind regards

If you do not want to simply remove outliers, you can indeed use "Winsorization" which is a technique to replace extreme data values with less extreme values.

See for instance the Winsorize() function in R, or this article.

Hope this helps.

Regards,
Antoine

0 replies

DLAtatem · 2021-01-08T14:16:52Z

DLAtatem
Jan 8, 2021

Antoine
Many thanks. This is helpful

regards
duncan

0 replies

faheemja · 2021-07-05T05:00:52Z

faheemja
Jul 5, 2021 — with giscus

Dear Antoine
This is very helpful indeed. I just found a key to detecting outliers formally for my project,
I have one simple question in my project I not only detect these outliers but also replace these outliers by normal value, can you pinpoint a method which can be use to replace these outliers by normal values. I am using the time series data
Many thanks
Regard
Faheem Jan

1 reply

AntoineSoetewey Jul 6, 2021
Maintainer

Hello,

If deleting them is not an option, the easiest is to replace missing values by:

the mean or median for quantitative variables
the mode for qualitative variables

See the impute() function from the {Hmisc} package.

If this is not enough, see other imputing techniques here, and here in case of regressions.

Hope this helps.

Regards,
Antoine

AyaAlkhatib · 2021-09-09T13:28:06Z

AyaAlkhatib
Sep 9, 2021 — with giscus

Dear Antoine,

Thank you very much. This is very helpful indeed.

Regards,
Aya

1 reply

AntoineSoetewey Sep 9, 2021
Maintainer

Glad you find it useful!

saeedraeisi · 2022-02-18T23:02:54Z

saeedraeisi
Feb 18, 2022 — with giscus

Hi,
I have a big dataset (with 400,000 observations) which has 6 numeric columns with a lot of outliers that I cannot consider them as outliers, on the other hand, without cleaning it is not possible to analyze them. what should I do?

6 replies

saeedraeisi Feb 19, 2022 — with giscus

Thank you for your consideration.
Actually, I have some categorical variables (like department, disease, ...) and several date-time which is related to a hospital (Admission date-time, Doctor Note date-time, discharge date-time,... ). I just subtracted the date-time and made some numeric variables.
Im not sure but In my opinion there is 2 way to analyze my dataset: 1- finding some pattern based on department and diseases in different period 2-comparing hospitalization in different dep.
I would be appreciate if you help me in this regards.

AntoineSoetewey Feb 19, 2022
Maintainer

You could start by doing some plots and descriptive statistics of your quantitative variables by group (i.e., by department, by disease, etc.). If you need to go further, you could consider t-tests and ANOVA (depending on the number of groups), or eventually multiple linear regressions if you want to analyze the relationships with several variables.

If you need further help, feel free to contact me.

FarzadRaeisi111 Feb 22, 2022 — with giscus

Thanks for your informative guide,
how can I make some discrete time period based on hours from Date-time "%d/%m/%Y %H:%M:%S", ex: morning : from 6-10 am to produce informative plots?

AntoineSoetewey Feb 22, 2022
Maintainer

Hello @FarzadRaeisi111,

You can use the hour() function from the {lubridate} package.

Here is an example of code with two different times (one in the morning, one in the afternoon):

# load package
library(lubridate)

# convert to date
date <- as.POSIXct(c("21/02/2022 08:12:37", "22/02/2022 17:10:40"),
  format = "%d/%m/%Y %H:%M:%S"
)

# deduct time period from hour
dat <- data.frame(date,
  time_period = ifelse(hour(date) >= 6 & hour(date) <= 10,
    "morning",
    "afternoon"
  )
)

# see result
View(dat)

Hope this helps.

Regards,
Antoine

FarzadRaeisi111 Mar 1, 2022 — with giscus

Thank you again, it works.

BroVic · 2022-05-04T13:21:13Z

BroVic
May 4, 2022 — with giscus

Thanks for this excellent post, Antoine. I just wanted to remind you of the range() function...

2 replies

AntoineSoetewey May 6, 2022
Maintainer

Thanks @BroVic for pointing it out!

I have added it in this section.

Regards,
Antoine

BroVic May 7, 2022

Thanks for the credit 😎

RobWiederstein · 2022-07-21T00:42:46Z

RobWiederstein
Jul 21, 2022 — with giscus

Very comprehensive and super helpful! Many thanks!

1 reply

AntoineSoetewey Jul 21, 2022
Maintainer

Thanks, glad it was helpful for you Rob!

SLGiHub · 2022-12-01T11:52:29Z

SLGiHub
Dec 1, 2022 — with giscus

Hi there,
Thank you for this post, it’s really useful to see all the methods presented together in an applied way.

I have a dataset that is seasonal, ie waves. I’ve fitted regression models with harmonic terms(sine and cosine). However before model fitting do you know if any of these methods would work with seasonal data?

I thought the Hampel filter could work?

Thanks in’s advance,

SL

1 reply

AntoineSoetewey Dec 1, 2022
Maintainer

Hello @SLGiHub,

I am afraid the Hampel filter will not capture outliers in an optimal way if you have seasonal data.

The Hampel filter will spot an outlier only if it is higher than a upper threshold or lower than a lower threshold. But these two thresholds are compute based on the entire dataset.

Suppose this:

The red dot is clearly an outlier. However, the Hampel filter will not detect it because it is not lower than the lower threshold.

I suggest detrending the data to remove seasonality, then you can use the Hampel filter.

Alternatively, you could fit your data, and if your fit is good, then compute the residuals (i.e., difference between fitted values and your data). And finally apply the outlier detection method of your choice on these residuals.

Hope this helps.

Regards,
Antoine

JothamIT · 2023-04-30T18:40:53Z

JothamIT
Apr 30, 2023 — with giscus

Very informative.
I'll take Histogram & Summary() route.

1 reply

AntoineSoetewey Apr 30, 2023
Maintainer

Simple but very efficient!

zhakota · 2024-02-27T16:24:50Z

zhakota
Feb 27, 2024 — with giscus

Hello!
Thanks for the detailed review of the topic with outliers. Can you clarify the point with the normal distribution for tests?
From the text of the article: "Note that the 3 tests are appropriate only when the data (without any outliers) are approximately normally distributed."
The test Shapiro-Wilk result describes a not normal distribution.
data: dat$hwy
W = 0.95885, p-value = 2.999e-06

And the density plot has two peaks.

How to explain the application of the Grubbs test in this case?

4 replies

AntoineSoetewey Feb 28, 2024
Maintainer

Dear @zhakota,

You are right; the Grubbs test should not be applied on dat$hwy. My goal was to illustrate the three statistical tests (Grubbs, Dixon and Rosner tests) on the same dataset than the one used for the other methods (for the sake of simplicity and to avoid mixing several datasets).

It is however misleading to use these tests on non-normal data, as you noted. For this reason, I adapted the post by:

simulating new data (based on a normal distribution), and then
applying the Grubbs, Dixon and Rosner tests on these simulated data.

Let me know if you see any other inconsistencies.

Now, I would like to address another point you raised. As mentioned in the post, it is recommended to check normality visually, with a QQ-plot for instance. Although it can also be checked with a formal test for normality (such as the Shapiro-Wilk test as you used), the presence of one or more outliers may cause the normality test to reject normality when it is in fact a reasonable assumption for applying one of the 3 outlier tests (Grubbs, Dixon or Rosner).

Hope this helps.

Regards,
Antoine

zhakota Feb 28, 2024 — with giscus

Thanks a lot, Antoine!

A couple more questions, if I may.

What do you think about this approach? Is this acceptable?

Run a test for normality (like the Shapiro-Wilk test) before running Grubbs’ test. If you find your data set isn’t normally distributed, try removing the potential outlier from the data set and running the normality test again. If your data still isn’t normal, don’t run this test.
https://www.statisticshowto.com/grubbs-test/

Is combining boxplot and violin to skip the histogram stage a rational option?

ggplot(dat) +
  aes(x = "", y = hwy) +
  geom_violin() +
  geom_boxplot(fill = "lightblue", outlier.color = "red", width=0.1) +
  theme_minimal()

AntoineSoetewey Feb 29, 2024
Maintainer

Dear @zhakota,

Suppose the following scenarios:

Your data are normally distributed.
Your data are not normally distributed due to some outliers.
Your data are not normally distributed, not because of some outliers, but due to many points not following a normal distribution.

Now suppose that you want to perform the Grubbs test and you follow the approach you mentioned (using the Shapiro-Wilk (SW) test instead of a visual inspection).

Here is the process you would need to do for each scenario:

Do the SW test, conclude that it is normally distributed, do the Grubbs test.
Do the SW test, conclude that it is not normally distributed, check the QQ-plot (or boxplot/histogram), remove potential outlier(s), do the Grubbs test.
Do the SW test, conclude that it is not normally distributed, check the QQ-plot (or boxplot/histogram), conclude that the source of the non-normality is not due to some potential outliers but due to the shape of the distribution, do not do the Grubbs test.

As you can see in scenarios 2. and 3., given that the SW test will not tell you whether non-normality is due to the shape of the distribution or due to outliers, nor which point is a potential outlier, you will need to check it thanks to another method, visually thanks to a QQ-plot, boxplot or histogram for instance. Therefore, given that you will need to do a visual inspection as soon as your data are not normally distributed, I don’t really see the point of performing the SW test beforehand. Do you?

And if you are in scenario 1., the visual inspection will tell you that your data follow a normal distribution as well as a SW test. So again there is no need to check normality with a normality test.

One last remark: the Grubbs test requires that data are approximately normally distributed. With large samples, the SW test may tell you that data are not normal (since power increases with sample size) although in fact there is only a small deviation to normality. So if you choose to do the Grubbs test only based on the result of the SW test, you may decide not to perform the Grubbs test although you could have done it.

To sum up: unless you really want to make sure that your data are normally distributed based on the SW test rather than based on a visual inspection (which I recall, is not strictly necessary), I don’t see the point of performing a SW test.

Regarding the combination of boxplot and violin plot: yes I find it a useful alternative to a boxplot and a histogram. However, I rarely use it in practice because non-statisticians tend to not easily understand the violin plot, and even for those who are used to see boxplots, they tend to be confused when they see a violin plot combined with a boxplot. I realized that I was spending more time explaining how to interpret this plot, than simply do a boxplot and a histogram separately (because these two plots are much more often seen in statistic classes than the combination boxplot/violin plot).

Hope this helps. Feel free to share your opinions on these matters, I’d be happy to hear them.

Regards,
Antoine

zhakota Feb 29, 2024 — with giscus

Thanks a lot, Antoine, for the detailed explanations!

Marlenildo · 2024-05-10T00:58:23Z

Marlenildo
May 10, 2024 — with giscus

Hi. My name is Marlenildo and I just found your blog. I'm excited to read all the posts. To learn more about statistics and blog too. Regarding detecting outliers, consider using the indentify_outliers() function from the rstatix package. It finds all possible outliers in the dataset at once!

2 replies

Marlenildo May 10, 2024 — with giscus

In addition to potencial outliers, it finds extreme outliers!

AntoineSoetewey May 21, 2024
Maintainer

Thank you for pointing out this function @Marlenildo! I have added it to the post.

SugarRayLua · 2025-02-17T03:29:05Z

SugarRayLua
Feb 17, 2025 — with giscus

@AntoineSoeteway,

Another simple way I found to remove outliers using dplyr and the "lares" package on a variable with outliers:

var_no_outliers <- if_else(outlier_turkey(var_with_outliers, 1.5) == T, NA, var_with_outliers)

Fyi,

Have a good week :-)

1 reply

AntoineSoetewey Feb 17, 2025
Maintainer

Thank you very much for your input @SugarRayLua, I tried the function and it works perfectly!

As a side note, I'm curious to know why the function is called outlier_turkey although it uses the Tukey's fences :)

SugarRayLua · 2025-02-17T09:08:21Z

SugarRayLua
Feb 17, 2025

You're welcome, @AntoineSoetewey -- I was wondering that myself :-) I'll ask the question on his github.

P.S.-- any good websites to point me to (or perhaps something you might add to your chisquare tutorial) on using "permutation" tests for chisquare tests of independence when expected cell counts are low, and you don't want to use Fisher's exact test? I ask because I've recently been using the "compareGroups" package which (as a novice) I've found very helpful for the analysis I'm currently doing. However, the package developer for some reason decided not to have the package use Fisher's exact test and to use a "permutation" test instead but then needs the user to enter two additional parameters Chisq.test.B = integer number of permutations [default = 2000] and Chisq.test.seed = integer [with no default]. Not much on the web about how to do that and wasn't mentioned in your blog on chi square tests of independence.

Have a good rest of your week.

1 reply

SugarRayLua Feb 17, 2025

@AntoineSoetewey,

I filed an issue on the lares package site regarding the function using turkey vs. tukey. I suspect others might get confused and not type the function syntax correctly.

Please disregard my second question: I went onto the compareGroups website and someone had already filed a related issue, and the developer said that the package actually does use Fisher's exact by default if low expected cells unless the user asks it to do permutations-- it just wasn't clear from the package manual or the output that it was using Fisher's exact test. :-)

SugarRayLua · 2025-02-17T18:23:18Z

SugarRayLua
Feb 17, 2025

@AntoineSoetewey,
The developer fixed the turkey vs. tukey issue:
laresbernardo/lares#58 (comment)
:-)

1 reply

AntoineSoetewey Feb 17, 2025
Maintainer

Thanks for letting me know!

blog/outliers-detection-in-r/ #63

Outliers detection in R - Stats and R

Replies: 23 comments · 22 replies

AntoineSoetewey Jan 6, 2021 Maintainer

AntoineSoetewey Jan 6, 2021 Maintainer

AntoineSoetewey Jan 6, 2021 Maintainer

AntoineSoetewey Jan 6, 2021 Maintainer

AntoineSoetewey Jan 6, 2021 Maintainer

AntoineSoetewey Jan 7, 2021 Maintainer

AntoineSoetewey Jan 7, 2021 Maintainer

AntoineSoetewey Jan 8, 2021 Maintainer

faheemja Jul 5, 2021 — with giscus

AntoineSoetewey Jul 6, 2021 Maintainer

AyaAlkhatib Sep 9, 2021 — with giscus

AntoineSoetewey Sep 9, 2021 Maintainer

saeedraeisi Feb 18, 2022 — with giscus

saeedraeisi Feb 19, 2022 — with giscus

AntoineSoetewey Feb 19, 2022 Maintainer

FarzadRaeisi111 Feb 22, 2022 — with giscus

AntoineSoetewey Feb 22, 2022 Maintainer

FarzadRaeisi111 Mar 1, 2022 — with giscus

BroVic May 4, 2022 — with giscus

AntoineSoetewey May 6, 2022 Maintainer

RobWiederstein Jul 21, 2022 — with giscus

AntoineSoetewey Jul 21, 2022 Maintainer

SLGiHub Dec 1, 2022 — with giscus

AntoineSoetewey Dec 1, 2022 Maintainer

JothamIT Apr 30, 2023 — with giscus

AntoineSoetewey Apr 30, 2023 Maintainer

zhakota Feb 27, 2024 — with giscus

AntoineSoetewey Feb 28, 2024 Maintainer

zhakota Feb 28, 2024 — with giscus

AntoineSoetewey Feb 29, 2024 Maintainer

zhakota Feb 29, 2024 — with giscus

Marlenildo May 10, 2024 — with giscus

Marlenildo May 10, 2024 — with giscus

AntoineSoetewey May 21, 2024 Maintainer

SugarRayLua Feb 17, 2025 — with giscus

AntoineSoetewey Feb 17, 2025 Maintainer

AntoineSoetewey Feb 17, 2025 Maintainer

Replies: 23 comments 22 replies

AntoineSoetewey
Jan 6, 2021
Maintainer

AntoineSoetewey
Jan 6, 2021
Maintainer

AntoineSoetewey
Jan 6, 2021
Maintainer

AntoineSoetewey
Jan 6, 2021
Maintainer

AntoineSoetewey
Jan 6, 2021
Maintainer

AntoineSoetewey
Jan 7, 2021
Maintainer

AntoineSoetewey
Jan 7, 2021
Maintainer

AntoineSoetewey
Jan 8, 2021
Maintainer

faheemja
Jul 5, 2021 — with giscus

AntoineSoetewey Jul 6, 2021
Maintainer

AyaAlkhatib
Sep 9, 2021 — with giscus

AntoineSoetewey Sep 9, 2021
Maintainer

saeedraeisi
Feb 18, 2022 — with giscus

AntoineSoetewey Feb 19, 2022
Maintainer

AntoineSoetewey Feb 22, 2022
Maintainer

BroVic
May 4, 2022 — with giscus

AntoineSoetewey May 6, 2022
Maintainer

RobWiederstein
Jul 21, 2022 — with giscus

AntoineSoetewey Jul 21, 2022
Maintainer

SLGiHub
Dec 1, 2022 — with giscus

AntoineSoetewey Dec 1, 2022
Maintainer

JothamIT
Apr 30, 2023 — with giscus

AntoineSoetewey Apr 30, 2023
Maintainer

zhakota
Feb 27, 2024 — with giscus

AntoineSoetewey Feb 28, 2024
Maintainer

AntoineSoetewey Feb 29, 2024
Maintainer

Marlenildo
May 10, 2024 — with giscus

AntoineSoetewey May 21, 2024
Maintainer

SugarRayLua
Feb 17, 2025 — with giscus

AntoineSoetewey Feb 17, 2025
Maintainer

AntoineSoetewey Feb 17, 2025
Maintainer