Working with a data frame similar to this:
set.seed(100)
df <- data.frame(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), val = runif(15))
df <- df[order(df$cat, df$val), ]
df
cat val
1 aaa 0.05638315
2 aaa 0.25767250
3 aaa 0.30776611
4 aaa 0.46854928
5 aaa 0.55232243
6 bbb 0.17026205
7 bbb 0.37032054
8 bbb 0.48377074
9 bbb 0.54655860
10 bbb 0.81240262
11 ccc 0.28035384
12 ccc 0.39848790
13 ccc 0.62499648
14 ccc 0.76255108
15 ccc 0.88216552
I am trying to add a column with numbering within each group. Doing it this way obviously isn't using the powers of R:
df$num <- 1
for (i in 2:(length(df[,1]))) {
if (df[i,"cat"]==df[(i-1),"cat"]) {
df[i,"num"]<-df[i-1,"num"]+1
}
}
df
cat val num
1 aaa 0.05638315 1
2 aaa 0.25767250 2
3 aaa 0.30776611 3
4 aaa 0.46854928 4
5 aaa 0.55232243 5
6 bbb 0.17026205 1
7 bbb 0.37032054 2
8 bbb 0.48377074 3
9 bbb 0.54655860 4
10 bbb 0.81240262 5
11 ccc 0.28035384 1
12 ccc 0.39848790 2
13 ccc 0.62499648 3
14 ccc 0.76255108 4
15 ccc 0.88216552 5
What would be a good way to do this?
Use ave
, ddply
, dplyr
or data.table
:
df$num <- ave(df$val, df$cat, FUN = seq_along)
or:
library(plyr)
ddply(df, .(cat), mutate, id = seq_along(val))
or:
library(dplyr)
df %>% group_by(cat) %>% mutate(id = row_number())
or (the most memory efficient, as it assigns by reference within DT
):
library(data.table)
DT <- data.table(df)
DT[, id := seq_len(.N), by = cat]
DT[, id := rowid(cat)]
For making this r-faq question more complete, a base R alternative with sequence
and rle
:
df$num <- sequence(rle(df$cat)$lengths)
which gives the intended result:
> df cat val num 4 aaa 0.05638315 1 2 aaa 0.25767250 2 1 aaa 0.30776611 3 5 aaa 0.46854928 4 3 aaa 0.55232243 5 10 bbb 0.17026205 1 8 bbb 0.37032054 2 6 bbb 0.48377074 3 9 bbb 0.54655860 4 7 bbb 0.81240262 5 13 ccc 0.28035384 1 14 ccc 0.39848790 2 11 ccc 0.62499648 3 15 ccc 0.76255108 4 12 ccc 0.88216552 5
If df$cat
is a factor variable, you need to wrap it in as.character
first:
df$num <- sequence(rle(as.character(df$cat))$lengths)
cat
column to be sorted?
cat
Here is a small improvement trick that allows sort 'val' inside the groups:
# 1. Data set
set.seed(100)
df <- data.frame(
cat = c(rep("aaa", 5), rep("ccc", 5), rep("bbb", 5)),
val = runif(15))
# 2. 'dplyr' approach
df %>%
arrange(cat, val) %>%
group_by(cat) %>%
mutate(id = row_number())
Another dplyr
possibility could be:
df %>%
group_by(cat) %>%
mutate(num = 1:n())
cat val num
<fct> <dbl> <int>
1 aaa 0.0564 1
2 aaa 0.258 2
3 aaa 0.308 3
4 aaa 0.469 4
5 aaa 0.552 5
6 bbb 0.170 1
7 bbb 0.370 2
8 bbb 0.484 3
9 bbb 0.547 4
10 bbb 0.812 5
11 ccc 0.280 1
12 ccc 0.398 2
13 ccc 0.625 3
14 ccc 0.763 4
15 ccc 0.882 5
1:n()
using seq_len(n())
is safer, in the event that in your sequence of operations you have a situation where n()
might return 0
, because 1:0
gives you a length two vector while seq_len(0)
gives a length zero vector, thus avoiding a length mismatch error with mutate()
.
I would like to add a data.table
variant using the rank()
function which provides the additional possibility to change the ordering and thus makes it a bit more flexible than the seq_len()
solution and is pretty similar to row_number functions in RDBMS.
# Variant with ascending ordering
library(data.table)
dt <- data.table(df)
dt[, .( val
, num = rank(val))
, by = list(cat)][order(cat, num),]
cat val num
1: aaa 0.05638315 1
2: aaa 0.25767250 2
3: aaa 0.30776611 3
4: aaa 0.46854928 4
5: aaa 0.55232243 5
6: bbb 0.17026205 1
7: bbb 0.37032054 2
8: bbb 0.48377074 3
9: bbb 0.54655860 4
10: bbb 0.81240262 5
11: ccc 0.28035384 1
12: ccc 0.39848790 2
13: ccc 0.62499648 3
14: ccc 0.76255108 4
# Variant with descending ordering
dt[, .( val
, num = rank(desc(val)))
, by = list(cat)][order(cat, num),]
Edit on 2021-04-16 to make the switch between descending and ascending order more fail-safe
Here is an option using a for
loop by groups rather by rows (like OP did)
for (i in unique(df$cat)) df$num[df$cat == i] <- seq_len(sum(df$cat == i))
Using the rowid()
function in data.table
:
> set.seed(100)
> df <- data.frame(cat = c(rep("aaa", 5), rep("bbb", 5), rep("ccc", 5)), val = runif(15))
> df <- df[order(df$cat, df$val), ]
> df$num <- data.table::rowid(df$cat)
> df
cat val num
4 aaa 0.05638315 1
2 aaa 0.25767250 2
1 aaa 0.30776611 3
5 aaa 0.46854928 4
3 aaa 0.55232243 5
10 bbb 0.17026205 1
8 bbb 0.37032054 2
6 bbb 0.48377074 3
9 bbb 0.54655860 4
7 bbb 0.81240262 5
13 ccc 0.28035384 1
14 ccc 0.39848790 2
11 ccc 0.62499648 3
15 ccc 0.76255108 4
12 ccc 0.88216552 5
Very simple, tidy solutions.
Row number for entire data.frame
library(tidyverse)
iris %>%
mutate(row_num = seq_along(Sepal.Length)) %>%
head
Sepal.Length Sepal.Width Petal.Length Petal.Width Species row_num
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 3
.. ... ... ... ... ...... ...
148 6.5 3.0 5.2 2.0 virginica 148
149 6.2 3.4 5.4 2.3 virginica 149
150 5.9 3.0 5.1 1.8 virginica 150
Row number by group in data.frame
iris %>%
group_by(Species) %>%
mutate(num_in_group=seq_along(Species)) %>%
as.data.frame
Sepal.Length Sepal.Width Petal.Length Petal.Width Species num_in_group
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 2
3 4.7 3.2 1.3 0.2 setosa 3
.. ... ... ... ... ...... ..
48 4.6 3.2 1.4 0.2 setosa 48
49 5.3 3.7 1.5 0.2 setosa 49
50 5.0 3.3 1.4 0.2 setosa 50
51 7.0 3.2 4.7 1.4 versicolor 1
52 6.4 3.2 4.5 1.5 versicolor 2
53 6.9 3.1 4.9 1.5 versicolor 3
.. ... ... ... ... ...... ..
98 6.2 2.9 4.3 1.3 versicolor 48
99 5.1 2.5 3.0 1.1 versicolor 49
100 5.7 2.8 4.1 1.3 versicolor 50
101 6.3 3.3 6.0 2.5 virginica 1
102 5.8 2.7 5.1 1.9 virginica 2
103 7.1 3.0 5.9 2.1 virginica 3
.. ... ... ... ... ...... ..
148 6.5 3.0 5.2 2.0 virginica 48
149 6.2 3.4 5.4 2.3 virginica 49
150 5.9 3.0 5.1 1.8 virginica 50
Another base R solution would be to split
the data frame per cat
, after that using lapply
: add a column with number 1:nrow(x)
. The last step is to have your final data frame back with do.call
, that is:
df_split <- split(df, df$cat)
df_lapply <- lapply(df_split, function(x) {
x$num <- seq_len(nrow(x))
return(x)
})
df <- do.call(rbind, df_lapply)
Success story sharing
ave
gives a float instead of an int here. Alternately, could changedf$val
toseq_len(nrow(df))
. I just ran into this over here: stackoverflow.com/questions/42796857/…data.table
solution seems to be quicker than usingfrank
:library(microbenchmark); microbenchmark(a = DT[, .(val ,num = frank(val)), by = list(cat)] ,b =DT[, .(val , id = seq_len(.N)), by = list(cat)] , times = 1000L)
dplyr
solution is good. But if, like me, you kept getting weird errors when trying this approach, make sure that you are not getting conflicts betweenplyr
anddplyr
as explained in this post It can be avoided by explicitly callingdplyr::mutate(...)
data.table
method issetDT(df)[, id:=rleid(val), by=.(cat)]
library(plyr)
andlibrary(dplyr)
answers to make the ranking val column in descending order?