I am having some trouble with leading and trailing white space in a data.frame.
For example, I look at a specific row
in a data.frame
based on a certain condition:
> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)]
[1] codeHelper country dummyLI dummyLMI dummyUMI
[6] dummyHInonOECD dummyHIOECD dummyOECD
<0 rows> (or 0-length row.names)
I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame
. After looking through my code history and trying to figure out what went wrong I tried:
> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
codeHelper country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18 AUT Austria 0 0 0 0 1
dummyOECD
18 1
All I have changed in the command is an additional white space after Austria.
Further annoying problems obviously arise. For example, when I like to merge two frames based on the country column. One data.frame
uses "Austria "
while the other frame has "Austria"
. The matching doesn't work.
Is there a nice way to 'show' the white space on my screen so that I am aware of the problem? And can I remove the leading and trailing white space in R?
So far I used to write a simple Perl script which removes the whites pace, but it would be nice if I can somehow do it inside R.
sub()
uses the Perl
notation as well. Sorry about that. I am going to try to use the function. But for my first question i don't have a solution yet.
As of R 3.2.0 a new function was introduced for removing leading/trailing white spaces:
trimws()
See: Remove Leading/Trailing Whitespace
Probably the best way is to handle the trailing white spaces when you read your data file. If you use read.csv
or read.table
you can set the parameterstrip.white=TRUE
.
If you want to clean strings afterwards you could use one of these functions:
# Returns string without leading white space
trim.leading <- function (x) sub("^\\s+", "", x)
# Returns string without trailing white space
trim.trailing <- function (x) sub("\\s+$", "", x)
# Returns string without leading or trailing white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
To use one of these functions on myDummy$country
:
myDummy$country <- trim(myDummy$country)
To 'show' the white space you could use:
paste(myDummy$country)
which will show you the strings surrounded by quotation marks (") making white spaces easier to spot.
str_trim
in the stringr
package.
To manipulate the white space, use str_trim() in the stringr package. The package has manual dated Feb 15, 2013 and is in CRAN. The function can also handle string vectors.
install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)
d4$clean2<-str_trim(d4$V2)
(Credit goes to commenter: R. Cotton)
trimws()
was unable to remove.
A simple function to remove leading and trailing whitespace:
trim <- function( x ) {
gsub("(^[[:space:]]+|[[:space:]]+$)", "", x)
}
Usage:
> text = " foo bar baz 3 "
> trim(text)
[1] "foo bar baz 3"
Ad 1) To see white spaces you could directly call print.data.frame
with modified arguments:
print(head(iris), quote=TRUE)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 "5.1" "3.5" "1.4" "0.2" "setosa"
# 2 "4.9" "3.0" "1.4" "0.2" "setosa"
# 3 "4.7" "3.2" "1.3" "0.2" "setosa"
# 4 "4.6" "3.1" "1.5" "0.2" "setosa"
# 5 "5.0" "3.6" "1.4" "0.2" "setosa"
# 6 "5.4" "3.9" "1.7" "0.4" "setosa"
See also ?print.data.frame
for other options.
Use grep or grepl to find observations with white spaces and sub to get rid of them.
names<-c("Ganga Din\t", "Shyam Lal", "Bulbul ")
grep("[[:space:]]+$", names)
[1] 1 3
grepl("[[:space:]]+$", names)
[1] TRUE FALSE TRUE
sub("[[:space:]]+$", "", names)
[1] "Ganga Din" "Shyam Lal" "Bulbul"
"^\\s+|\\s+$"
gsub
instead of sub
with hadley's regexp. With sub
it will strip trailing whitespace only if there is no leading whitespace...
Removing leading and trailing blanks might be achieved through the trim() function from the gdata package as well:
require(gdata)
example(trim)
Usage example:
> trim(" Remove leading and trailing blanks ")
[1] "Remove leading and trailing blanks"
I'd prefer to add the answer as comment to user56's, but I am yet unable so writing as an independent answer.
Another option is to use the stri_trim
function from the stringi
package which defaults to removing leading and trailing whitespace:
> x <- c(" leading space","trailing space ")
> stri_trim(x)
[1] "leading space" "trailing space"
For only removing leading whitespace, use stri_trim_left
. For only removing trailing whitespace, use stri_trim_right
. When you want to remove other leading or trailing characters, you have to specify that with pattern =
.
See also ?stri_trim
for more info.
Another related problem occurs if you have multiple spaces in between inputs:
> a <- " a string with lots of starting, inter mediate and trailing whitespace "
You can then easily split this string into "real" tokens using a regular expression to the split
argument:
> strsplit(a, split=" +")
[[1]]
[1] "" "a" "string" "with" "lots"
[6] "of" "starting," "inter" "mediate" "and"
[11] "trailing" "whitespace"
Note that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.
I created a trim.strings ()
function to trim leading and/or trailing whitespace as:
# Arguments: x - character vector
# side - side(s) on which to remove whitespace
# default : "both"
# possible values: c("both", "leading", "trailing")
trim.strings <- function(x, side = "both") {
if (is.na(match(side, c("both", "leading", "trailing")))) {
side <- "both"
}
if (side == "leading") {
sub("^\\s+", "", x)
} else {
if (side == "trailing") {
sub("\\s+$", "", x)
} else gsub("^\\s+|\\s+$", "", x)
}
}
For illustration,
a <- c(" ABC123 456 ", " ABC123DEF ")
# returns string without leading and trailing whitespace
trim.strings(a)
# [1] "ABC123 456" "ABC123DEF"
# returns string without leading whitespace
trim.strings(a, side = "leading")
# [1] "ABC123 456 " "ABC123DEF "
# returns string without trailing whitespace
trim.strings(a, side = "trailing")
# [1] " ABC123 456" " ABC123DEF"
The best method is trimws().
The following code will apply this function to the entire dataframe.
mydataframe<- data.frame(lapply(mydataframe, trimws),stringsAsFactors = FALSE)
df[] <- lapply(df, trimws)
to be more compact. But it will in both cases coerce columns to character. df[sapply(df,is.character)] <- lapply(df[sapply(df,is.character)], trimws)
to be safe.
Use dplyr/tidyverse mutate_all
with str_trim
to trim the entire data frame:
myDummy %>%
mutate_all(str_trim)
library(tidyverse)
set.seed(335)
df <- mtcars %>%
rownames_to_column("car") %>%
mutate(car = ifelse(runif(nrow(mtcars)) > 0.4, car, paste0(car, " "))) %>%
select(car, mpg)
print(head(df), quote = T)
#> car mpg
#> 1 "Mazda RX4 " "21.0"
#> 2 "Mazda RX4 Wag" "21.0"
#> 3 "Datsun 710 " "22.8"
#> 4 "Hornet 4 Drive " "21.4"
#> 5 "Hornet Sportabout " "18.7"
#> 6 "Valiant " "18.1"
df_trim <- df %>%
mutate_all(str_trim)
print(head(df_trim), quote = T)
#> car mpg
#> 1 "Mazda RX4" "21"
#> 2 "Mazda RX4 Wag" "21"
#> 3 "Datsun 710" "22.8"
#> 4 "Hornet 4 Drive" "21.4"
#> 5 "Hornet Sportabout" "18.7"
#> 6 "Valiant" "18.1"
Created on 2021-05-07 by the reprex package (v0.3.0)
I tried trim(). It works well with white spaces as well as the '\n'.
x = '\n Harden, J.\n '
trim(x)
myDummy[myDummy$country == "Austria "] <- "Austria"
After this, you'll need to force R not to recognize "Austria "
as a level. Let's pretend you also have "USA"
and "Spain"
as levels:
myDummy$country = factor(myDummy$country, levels=c("Austria", "USA", "Spain"))
It is a little less intimidating than the highest voted response, but it should still work.
For whom it may be of interest, here a simple benchmarking. This is of course not capturing all weird cases, but so far we are still lacking the example where str_trim
removes whitespace and trimws
doesn't (see Richard Telford's comment to this answer). Doesn't seem to matter - the gsub option seems to be fastest :)
x <- c(" lead", "trail ", " both ", " both and middle ", " _special")
gsub_trim <- function (x) gsub("^\\s+|\\s+$", "", x)
res <- microbenchmark::microbenchmark(
gsub = gsub_trim(x),
trimws = trimws(x),
str_trim = stringr::str_trim(x),
times = 10^5
)
res
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> gsub 20.201 22.788 31.43943 24.654 28.4115 5303.741 1e+05 a
#> trimws 38.204 41.980 61.92218 44.420 51.1810 40363.860 1e+05 b
#> str_trim 88.672 92.347 116.59186 94.542 105.2800 13618.673 1e+05 c
ggplot2::autoplot(res)
https://i.imgur.com/vqKlsms.png
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> stringr_1.4.0
Success story sharing
\n
being in the covered character class.trimws("SELECT\n blah\n FROM foo;")
still contains newlines.trimws
regex is fast enough.stringr::str_trim
(based onstringi
) is also interesting in that it uses a completely independent internationalized string library. You'd think whitespace would be immune from problems with internationalization, but I wonder. I've never seen a comparison of results of native vsstringr
/stringi
or any benchmarks.