ChatGPT解决这个技术问题 Extra ChatGPT

Call apply-like function on each row of dataframe with multiple arguments from each row

I have a dataframe with multiple columns. For each row in the dataframe, I want to call a function on the row, and the input of the function is using multiple columns from that row. For example, let's say I have this data and this testFunc which accepts two args:

> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
  x y z
1 1 3 5
2 2 4 6
> testFunc <- function(a, b) a + b

Let's say I want to apply this testFunc to columns x and z. So, for row 1 I want 1+5, and for row 2 I want 2 + 6. Is there a way to do this without writing a for loop, maybe with the apply function family?

I tried this:

> df[,c('x','z')]
  x z
1 1 5
2 2 6
> lapply(df[,c('x','z')], testFunc)
Error in a + b : 'b' is missing

But got error, any ideas?

EDIT: the actual function I want to call is not a simple sum, but it is power.t.test. I used a+b just for example purposes. The end goal is to be able to do something like this (written in pseudocode):

df = data.frame(
    delta=c(delta_values), 
    power=c(power_values), 
    sig.level=c(sig.level_values)
)

lapply(df, power.t.test(delta_from_each_row_of_df, 
                        power_from_each_row_of_df, 
                        sig.level_from_each_row_of_df
))

where the result is a vector of outputs for power.t.test for each row of df.

See also stackoverflow.com/a/24728107/946850 for the dplyr way.

M
Matt Tenenbaum

You can apply apply to a subset of the original data.

 dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
 apply(dat[,c('x','z')], 1, function(x) sum(x) )

or if your function is just sum use the vectorized version:

rowSums(dat[,c('x','z')])
[1] 6 8

If you want to use testFunc

 testFunc <- function(a, b) a + b
 apply(dat[,c('x','z')], 1, function(x) testFunc(x[1],x[2]))

EDIT To access columns by name and not index you can do something like this:

 testFunc <- function(a, b) a + b
 apply(dat[,c('x','z')], 1, function(y) testFunc(y['z'],y['x']))

thanks @agstudy, that worked! do you know if there is any way to specify the args by name instead of by index? so, for testFunc, something like apply(dat[,c('x','z')], 1, [pseudocode] testFunc(a=x, b=y))? the reason is that I am calling power.t.test in this manner, and I would love to be able to reference the delta, power, sig.level params by name instead of sticking them into an array with pre-specified positions and then referencing those position, for the reason of being more robust. in any case thanks so much!
sorry about previous comment, hit enter before finished typing :) deleted it and posted full version.
Don't use apply on big data.frames it will copy the entire object (to convert to a matrix). This will also cause problems If you have different class objects within the data.frame.
u
user2087984

A data.frame is a list, so ...

For vectorized functions do.call is usually a good bet. But the names of arguments come into play. Here your testFunc is called with args x and y in place of a and b. The ... allows irrelevant args to be passed without causing an error:

do.call( function(x,z,...) testFunc(x,z), df )

For non-vectorized functions, mapply will work, but you need to match the ordering of the args or explicitly name them:

mapply(testFunc, df$x, df$z)

Sometimes apply will work - as when all args are of the same type so coercing the data.frame to a matrix does not cause problems by changing data types. Your example was of this sort.

If your function is to be called within another function into which the arguments are all passed, there is a much slicker method than these. Study the first lines of the body of lm() if you want to go that route.


+10 if I could. Welcome to SO. great answer - it might be worth mentioning Vectorize as a wrapper to mapply to vectorize functions
wow, that is slick. The original function I used was not vectorized (a custom extension on top of power.t.test), but I think I will vectorize it and use do.call(...). Thanks!
Just reiterating the note that this answer already says that apply(df, 1, function(row) ...) can be bad because apply converts the df into a matrix!!!! This can be bad and result in lots of hair pulling. The alternatives to apply are much needed!
Thank you so much for differentiating between Vectorized/non-vectorized, this is absolutely the answer I was looking for
C
CHP

Use mapply

> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
  x y z
1 1 3 5
2 2 4 6
> mapply(function(x,y) x+y, df$x, df$z)
[1] 6 8

> cbind(df,f = mapply(function(x,y) x+y, df$x, df$z) )
  x y z f
1 1 3 5 6
2 2 4 6 8

C
Community

New answer with dplyr package

If the function that you want to apply is vectorized, then you could use the mutate function from the dplyr package:

> library(dplyr)
> myf <- function(tens, ones) { 10 * tens + ones }
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mutate(x, value = myf(tens, ones))
  hundreds tens ones value
1        7    1    4    14
2        8    2    5    25
3        9    3    6    36

Old answer with plyr package

In my humble opinion, the tool best suited to the task is mdply from the plyr package.

Example:

> library(plyr)
> x <- data.frame(tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
  tens ones V1
1    1    4 14
2    2    5 25
3    3    6 36

Unfortunately, as Bertjan Broeksema pointed out, this approach fails if you don't use all the columns of the data frame in the mdply call. For example,

> library(plyr)
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
Error in (function (tens, ones)  : unused argument (hundreds = 7)

It's nice when you only have a small number of columns. I tried to do something like: mdply(df, function(col1, col3) {}) and mdply bails out, complaining col2 is unused. Now, if you have tens or even hundreds of columns, this approach is not very attractive.
@BertjanBroeksema to modify a lot of columns, you can use dplyr::mutate_each. For example: iris %>% mutate_each(funs(half = . / 2),-Species).
Couldn't you just pass elipses, or the hundreds into the function and just not use it? That should fix that error?
r
rsoren

Others have correctly pointed out that mapply is made for this purpose, but (for the sake of completeness) a conceptually simpler method is just to use a for loop.

for (row in 1:nrow(df)) { 
    df$newvar[row] <- testFunc(df$x[row], df$z[row]) 
}

You're right. To use mapply effectively, I think you have to understand that it's just a "for" loop behind the scenes, especially if you come from a procedural programming background such as C++ or C#.
R
Ricardo Saporta

Many functions are vectorization already, and so there is no need for any iterations (neither for loops or *pply functions). Your testFunc is one such example. You can simply call:

  testFunc(df[, "x"], df[, "z"])

In general, I would recommend trying such vectorization approaches first and see if they get you your intended results.

Alternatively, if you need to pass multiple arguments to a function which is not vectorized, mapply might be what you are looking for:

  mapply(power.t.test, df[, "x"], df[, "z"])

oh, sweet. Do you know if there is a way to specify arguments by name in mapply? i.e. something like [pseudocode] mapply(power.t.test, delta=df[,'delta'], power=df[,'power'], ...) ?
Yep, it is exactly as you have it! ;)
U
Uwe

Here is an alternate approach. It is more intuitive.

One key aspect I feel some of the answers did not take into account, which I point out for posterity, is apply() lets you do row calculations easily, but only for matrix (all numeric) data

operations on columns are possible still for dataframes:

as.data.frame(lapply(df, myFunctionForColumn()))

To operate on rows, we make the transpose first.

tdf<-as.data.frame(t(df))
as.data.frame(lapply(tdf, myFunctionForRow()))

The downside is that I believe R will make a copy of your data table. Which could be a memory issue. (This is truly sad, because it is programmatically simple for tdf to just be an iterator to the original df, thus saving memory, but R does not allow pointer or iterator referencing.)

Also, a related question, is how to operate on each individual cell in a dataframe.

newdf <- as.data.frame(lapply(df, function(x) {sapply(x, myFunctionForEachCell()}))

Another downside is that the column name will be lost.
P
Pete M

data.table has a really intuitive way of doing this as well:

library(data.table)

sample_fxn = function(x,y,z){
    return((x+y)*z)
}

df = data.table(A = 1:5,B=seq(2,10,2),C = 6:10)
> df
   A  B  C
1: 1  2  6
2: 2  4  7
3: 3  6  8
4: 4  8  9
5: 5 10 10

The := operator can be called within brackets to add a new column using a function

df[,new_column := sample_fxn(A,B,C)]
> df
   A  B  C new_column
1: 1  2  6         18
2: 2  4  7         42
3: 3  6  8         72
4: 4  8  9        108
5: 5 10 10        150

It's also easy to accept constants as arguments as well using this method:

df[,new_column2 := sample_fxn(A,B,2)]

> df
   A  B  C new_column new_column2
1: 1  2  6         18           6
2: 2  4  7         42          12
3: 3  6  8         72          18
4: 4  8  9        108          24
5: 5 10 10        150          30

It should be noted that sample_fun must be vectorized for this approach to work. If it is not, then you can use base::Vectorized like this: df[,new_column := Vectorize(sample_fxn)(A,B,C)]
t
thelatemail

@user20877984's answer is excellent. Since they summed it up far better than my previous answer, here is my (posibly still shoddy) attempt at an application of the concept:

Using do.call in a basic fashion:

powvalues <- list(power=0.9,delta=2)
do.call(power.t.test,powvalues)

Working on a full data set:

# get the example data
df <- data.frame(delta=c(1,1,2,2), power=c(.90,.85,.75,.45))

#> df
#  delta power
#1     1  0.90
#2     1  0.85
#3     2  0.75
#4     2  0.45

lapply the power.t.test function to each of the rows of specified values:

result <- lapply(
  split(df,1:nrow(df)),
  function(x) do.call(power.t.test,x)
)

> str(result)
List of 4
 $ 1:List of 8
  ..$ n          : num 22
  ..$ delta      : num 1
  ..$ sd         : num 1
  ..$ sig.level  : num 0.05
  ..$ power      : num 0.9
  ..$ alternative: chr "two.sided"
  ..$ note       : chr "n is number in *each* group"
  ..$ method     : chr "Two-sample t test power calculation"
  ..- attr(*, "class")= chr "power.htest"
 $ 2:List of 8
  ..$ n          : num 19
  ..$ delta      : num 1
  ..$ sd         : num 1
  ..$ sig.level  : num 0.05
  ..$ power      : num 0.85
... ...

Haha convoluted perhaps? ;) why are you using t() and applying over 2, why not just apply over 1?
l
liborm

I came here looking for tidyverse function name - which I knew existed. Adding this for (my) future reference and for tidyverse enthusiasts: purrrlyr:invoke_rows (purrr:invoke_rows in older versions).

With connection to standard stats methods as in the original question, the broom package would probably help.


J
John Mark

If data.frame columns are different types, apply() has a problem. A subtlety about row iteration is how apply(a.data.frame, 1, ...) does implicit type conversion to character types when columns are different types; eg. a factor and numeric column. Here's an example, using a factor in one column to modify a numeric column:

mean.height = list(BOY=69.5, GIRL=64.0)

subjects = data.frame(gender = factor(c("BOY", "GIRL", "GIRL", "BOY"))
         , height = c(71.0, 59.3, 62.1, 62.1))

apply(height, 1, function(x) x[2] - mean.height[[x[1]]])

The subtraction fails because the columns are converted to character types.

One fix is to back-convert the second column to a number:

apply(subjects, 1, function(x) as.numeric(x[2]) - mean.height[[x[1]]])

But the conversions can be avoided by keeping the columns separate and using mapply():

mapply(function(x,y) y - mean.height[[x]], subjects$gender, subjects$height)

mapply() is needed because [[ ]] does not accept a vector argument. So the column iteration could be done before the subtraction by passing a vector to [], by a bit more ugly code:

subjects$height - unlist(mean.height[subjects$gender])

Z
Zach S.

A really nice function for this is adply from plyr, especially if you want to append the result to the original dataframe. This function and its cousin ddply have saved me a lot of headaches and lines of code!

df_appended <- adply(df, 1, mutate, sum=x+z)

Alternatively, you can call the function you desire.

df_appended <- adply(df, 1, mutate, sum=testFunc(x,z))

can adply() deal with functions that return lists or dataframes? e.g., what if testFunc() returns a list? would unnest() be used to mutate it into additional columns of your df_appened?