unique() for more than one variable

r unique

I have the following data frame in R:

> str(df)
'data.frame':   545227 obs. of  15 variables:
 $ ykod : int  93 93 93 93 93 93 93 93 93 93 ...
 $ yad  : Factor w/ 42 levels "BAKUGAN","BARBIE",..: 30 30 30 30 30 30 30 30 30 30 ...
 $ per  : Factor w/ 3 levels "2 AYLIK","3 AYLIK",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ donem: int  201101 201101 201101 201101 201101 201101 201101 201101 201101 201101 ...
 $ sayi : int  201101 201101 201101 201101 201101 201101 201101 201101 201101 201101 ...
 $ mkod : int  4 5 9 11 12 18 20 22 25 26 ...
 $ mad  : Factor w/ 10464 levels "   Defne Market          ",..: 405 8075 9710 10145 9297 7973 2542 3892 2759 5769 ...
 $ mtip : Factor w/ 29 levels "Abone Bürosu                                      ",..: 2 20 20 2 2 2 2 2 2 2 ...
 $ kanal: Factor w/ 2 levels "OB","SS": 2 2 2 2 2 2 2 2 2 2 ...
 $ bkod : int  110565 110565 110565 110565 110565 110565 110565 110565 110565 110565 ...
 $ bad  : Factor w/ 212 levels "4. Levent","500 Evler",..: 167 167 167 167 167 167 167 167 167 167 ...
 $ bolge: Factor w/ 12 levels "Adana Şehiriçi",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ sevk : int  2 3 3 3 2 2 2 6 2 2 ...
 $ iade : int  2 1 0 2 0 2 1 0 0 2 ...
 $ satis: int  0 2 3 1 2 0 1 6 2 0 ...

I want to list unique (like SQL's DISTINCT) values for selected multiple variables. For example, unique(yad) gives me the names of each 42 elements, but I need to extract two columns (yad and per together, with all unique combinations):

yad           per
---           ---
BARBIE        AYLIK
BAKUGAN       2 AYLIK
MICKEY MOUSE  2 AYLIK
TINKERBELL    3 AYLIK
...           ...

How can I achieve this?

Josh O'Brien

How about using unique() itself?

df <- data.frame(yad = c("BARBIE", "BARBIE", "BAKUGAN", "BAKUGAN"),
                 per = c("AYLIK",  "AYLIK",  "2 AYLIK", "2 AYLIK"),
                 hmm = 1:4)

df
#       yad     per hmm
# 1  BARBIE   AYLIK   1
# 2  BARBIE   AYLIK   2
# 3 BAKUGAN 2 AYLIK   3
# 4 BAKUGAN 2 AYLIK   4

unique(df[c("yad", "per")])
#       yad     per
# 1  BARBIE   AYLIK
# 3 BAKUGAN 2 AYLIK

+1 Would also recommend normalizing strings (tolower,gsub out special characters, etc).

How to do it if df is a matrix? Shall I transform it to data.frame, or is there a function to do it?

Actually I have found unique.matrix() that has done the work, thanks anyway

What if you want to keep all the other variables (to know which row you have select, or to use this row (maybe the first))? I.e. there is a base R equivalent for dplyr::distinct(.data, ..., .keep_all = TRUE)?

I don't know dplyr::distinct(), but if you want to keep the whole row conatining the first occurrence of a combination,have a look at duplicated(). Here, you might do: df[!duplicated(df[1:2]),].

rafa.pereira

This is an addition to Josh's answer.

You can also keep the values of other variables while filtering out duplicated rows in data.table

Example:

library(data.table)

#create data table
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)],
  V3=c(1),
  V4=c(2) )

> dt
# V1 V2 V3 V4
# A  B  1  2
# A  C  1  2
# A  D  1  2
# A  B  1  2
# B  A  1  2
# C  D  1  2
# C  D  1  2
# E  F  1  2
# G  G  1  2
# A  B  1  2

# set the key to all columns
setkey(dt)

# Get Unique lines in the data table
unique( dt[list(V1, V2), nomatch = 0] ) 

# V1 V2 V3 V4
# A  B  1  2
# A  C  1  2
# A  D  1  2
# B  A  1  2
# C  D  1  2
# E  F  1  2
# G  G  1  2

Alert: If there are different combinations of values in the other variables, then your result will be

unique combination of V1 and V2

strange, the unique operation works but the result dt has all other columns set to NA. Do you know why?

Thank you for spotting that. This operation makes a merge and so it can generate some NA values. The solution would be to replace allow.cartesian=TRUE with nomatch = 0, what would ignore NA values in the results. I've updated the answer. Thanks

micahkimel

unique based on any columns and keep all other columns using dplyr.

df <- df %>%
distinct(col1, col2, .keep_all = TRUE)

Hong Ooi

There are a few ways to get all unique combinations of a set of factors.

with(df, interaction(yad, per, drop=TRUE))   # gives labels
with(df, yad:per)                            # ditto

aggregate(numeric(nrow(df)), df[c("yad", "per")], length)    # gives a data frame

stevec

This dplyr method works nicely when piping.

For selected columns:

library(dplyr)
iris %>% 
  select(Sepal.Width, Species) %>% 
  t %>% c %>% unique

 [1] "3.5"        "setosa"     "3.0"        "3.2"        "3.1"       
 [6] "3.6"        "3.9"        "3.4"        "2.9"        "3.7"       
[11] "4.0"        "4.4"        "3.8"        "3.3"        "4.1"       
[16] "4.2"        "2.3"        "versicolor" "2.8"        "2.4"       
[21] "2.7"        "2.0"        "2.2"        "2.5"        "2.6"       
[26] "virginica"

Or for the whole dataframe:

iris %>% t %>% c %>% unique 

 [1] "5.1"        "3.5"        "1.4"        "0.2"        "setosa"     "4.9"       
 [7] "3.0"        "4.7"        "3.2"        "1.3"        "4.6"        "3.1"       
[13] "1.5"        "5.0"        "3.6"        "5.4"        "3.9"        "1.7"       
[19] "0.4"        "3.4"        "0.3"        "4.4"        "2.9"        "0.1"       
[25] "3.7"        "4.8"        "1.6"        "4.3"        "1.1"        "5.8"       
[31] "4.0"        "1.2"        "5.7"        "3.8"        "1.0"        "3.3"       
[37] "0.5"        "1.9"        "5.2"        "4.1"        "5.5"        "4.2"       
[43] "4.5"        "2.3"        "0.6"        "5.3"        "7.0"        "versicolor"
[49] "6.4"        "6.9"        "6.5"        "2.8"        "6.3"        "2.4"       
[55] "6.6"        "2.7"        "2.0"        "5.9"        "6.0"        "2.2"       
[61] "6.1"        "5.6"        "6.7"        "6.2"        "2.5"        "1.8"       
[67] "6.8"        "2.6"        "virginica"  "7.1"        "2.1"        "7.6"       
[73] "7.3"        "7.2"        "7.7"        "7.4"        "7.9"

Bob

This is an old question with many solutions.

Yet, in case you need unique observations based on a selection of columns while also keeping all other columns in the dataframe, you can do it in a clean way using base R as follows:

df$dupe <- duplicated(df[c("X", "Y") ])
df<- subset(df, dupe == FALSE)

The alternative is using 'distinct' as proposed by @micahkimel

unique() for more than one variable

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Contact US