ChatGPT解决这个技术问题 Extra ChatGPT

Create an empty data.frame

I'm trying to initialize a data.frame without any rows. Basically, I want to specify the data types for each column and name them, but not have any rows created as a result.

The best I've been able to do so far is something like:

df <- data.frame(Date=as.Date("01/01/2000", format="%m/%d/%Y"), 
                 File="", User="", stringsAsFactors=FALSE)
df <- df[-1,]

Which creates a data.frame with a single row containing all of the data types and column names I wanted, but also creates a useless row which then needs to be removed.

Is there a better way to do this?


d
digEmAll

Just initialize it with empty vectors:

df <- data.frame(Date=as.Date(character()),
                 File=character(), 
                 User=character(), 
                 stringsAsFactors=FALSE) 

Here's an other example with different column types :

df <- data.frame(Doubles=double(),
                 Ints=integer(),
                 Factors=factor(),
                 Logicals=logical(),
                 Characters=character(),
                 stringsAsFactors=FALSE)

str(df)
> str(df)
'data.frame':   0 obs. of  5 variables:
 $ Doubles   : num 
 $ Ints      : int 
 $ Factors   : Factor w/ 0 levels: 
 $ Logicals  : logi 
 $ Characters: chr 

N.B. :

Initializing a data.frame with an empty column of the wrong type does not prevent further additions of rows having columns of different types.
This method is just a bit safer in the sense that you'll have the correct column types from the beginning, hence if your code relies on some column type checking, it will work even with a data.frame with zero rows.


Would it be the same if I initialize all fields with NULL?
@yosukesabai: no, if you initialize a column with NULL the column won't be added :)
@yosukesabai: data.frame's have typed columns, so yes, if you want to initialize a data.frame you must decide the type of the columns...
@user4050: the question was about creating an empty data.frame, so when the number of rows is zero...maybe you want to create a data.frame full on NAs... in that case you can use e.g. data.frame(Doubles=rep(as.double(NA),numberOfRow), Ints=rep(as.integer(NA),numberOfRow))
how do you append to such a data frame without triggering data has 0 rows error?
M
MichaelChirico

If you already have an existent data frame, let's say df that has the columns you want, then you can just create an empty data frame by removing all the rows:

empty_df = df[FALSE,]

Notice that df still contains the data, but empty_df doesn't.

I found this question looking for how to create a new instance with empty rows, so I think it might be helpful for some people.


Wonderful idea. Keep none of the rows, but ALL the columns. Whoever downvoted missed something.
Nice solution, however I found that I get a data frame with 0 rows. In order to keep the size of the data frame the same, I suggest new_df = df[NA,]. This also allows to store any previous column into the new data frame. For example to obtain the "Date" column from original df (while keeping rest NA): new_df$Date <- df$Date.
@Katya, if you do df[NA,] this will affect the index as well (which is unlikely to be what you want), I would instead use df[TRUE,] = NA; however notice that this will overwrite the original. You will need to copy the dataframe first copy_df = data.frame(df) and then copy_df[TRUE,] = NA
@Katya, or you can also easily add empty rows to the empty_df with empty_df[0:nrow(df),] <- NA.
@Katya, you use a the backquote (`) around what you would like to mark as code, and there is other stuff as italics using *, and bold using **. You probably want to read all the Markdown Syntax of SO. Most of it only make sense for answers though.
M
MERose

You can do it without specifying column types

df = data.frame(matrix(vector(), 0, 3,
                dimnames=list(c(), c("Date", "File", "User"))),
                stringsAsFactors=F)

In that case, the column types default as logical per vector(), but are then overridden with the types of the elements added to df. Try str(df), df[1,1]<-'x'
R
Rentrop

You could use read.table with an empty string for the input text as follows:

colClasses = c("Date", "character", "character")
col.names = c("Date", "File", "User")

df <- read.table(text = "",
                 colClasses = colClasses,
                 col.names = col.names)

Alternatively specifying the col.names as a string:

df <- read.csv(text="Date,File,User", colClasses = colClasses)

Thanks to Richard Scriven for the improvement


Or even read.table(text = "", ...) so you don't need to explicitly open a connection.
snazzy. probably the most extensible/automable way of doing this for many potential columns
The read.csv approach also works with readr::read_csv, as in read_csv("Date,File,User\n", col_types = "Dcc"). This way you can directly create an empty tibble of the required structure.
d
dpel

Just declare

table = data.frame()

when you try to rbind the first line it will create the columns


Doesn't really meet the OP's requirements of "I want to specify the data types for each column and name them". If the next step is an rbind this would work well, if not...
Anyway, thanks for this simple solution. I wanted also to initialize a data.frame with specific columns since I thought rbind can only be used if the columns corresponds between the two data.frame. This seems not to be the case. I was surprised that I can so simply initialize a data.frame when using rbind. Thanks.
The best proposed solution here. For me, using the proposed way, worked perfectly with rbind().
T
Thomas

The most efficient way to do this is to use structure to create a list that has the class "data.frame":

structure(list(Date = as.Date(character()), File = character(), User = character()), 
          class = "data.frame")
# [1] Date File User
# <0 rows> (or 0-length row.names)

To put this into perspective compared to the presently accepted answer, here's a simple benchmark:

s <- function() structure(list(Date = as.Date(character()), 
                               File = character(), 
                               User = character()), 
                          class = "data.frame")
d <- function() data.frame(Date = as.Date(character()),
                           File = character(), 
                           User = character(), 
                           stringsAsFactors = FALSE) 
library("microbenchmark")
microbenchmark(s(), d())
# Unit: microseconds
#  expr     min       lq     mean   median      uq      max neval
#   s()  58.503  66.5860  90.7682  82.1735 101.803  469.560   100
#   d() 370.644 382.5755 523.3397 420.1025 604.654 1565.711   100

data.table is usually contains a .internal.selfref attribute, which cannot be faked without calling the data.table functions. Are you sure you are not relying on an undocumented behavior here?
@AdamRyczkowski I think you're confusing the base "data.frame" class and the add-on "data.table" class from the data.table package.
Yes. Definitely. My bad. Ignore my last comment. I came across this thread when searching for the data.table and assumed that Google did find what I wanted and everything here is data.table-related.
@PatrickT There's no checking that what your code is doing makes any sense. data.frame() provides checks on naming, rownames, etc.
C
Community

If you are looking for shortness :

read.csv(text="col1,col2")

so you don't need to specify the column names separately. You get the default column type logical until you fill the data frame.


read.csv parses the text argument so you get the column names. It is more compact than read.table(text="", col.names = c("col1", "col2"))
I get : Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 2
This doesn't meet OP's requirements, "I want to specify the data types for each column", though it could probably be modified to do so.
Very late for the party but readr can do it: read_csv2("a;b;c;d;e\n", col_types = "icdDT"). There need to be \n to regognize it is string not a file (or use c("a;b;c;d;e", ""). As a bonus column names won't be modified (e.g. col-1 or why spaces)
S
Shrikant Prabhu

I created empty data frame using following code

df = data.frame(id = numeric(0), jobs = numeric(0));

and tried to bind some rows to populate the same as follows.

newrow = c(3, 4)
df <- rbind(df, newrow)

but it started giving incorrect column names as follows

  X3 X4
1  3  4

Solution to this is to convert newrow to type df as follows

newrow = data.frame(id=3, jobs=4)
df <- rbind(df, newrow)

now gives correct data frame when displayed with column names as follows

  id nobs
1  3   4 

D
DSides

To create an empty data frame, pass in the number of rows and columns needed into the following function:

create_empty_table <- function(num_rows, num_cols) {
    frame <- data.frame(matrix(NA, nrow = num_rows, ncol = num_cols))
    return(frame)
}

To create an empty frame while specifying the class of each column, simply pass a vector of the desired data types into the following function:

create_empty_table <- function(num_rows, num_cols, type_vec) {
  frame <- data.frame(matrix(NA, nrow = num_rows, ncol = num_cols))
  for(i in 1:ncol(frame)) {
    print(type_vec[i])
    if(type_vec[i] == 'numeric') {frame[,i] <- as.numeric(frame[,i])}
    if(type_vec[i] == 'character') {frame[,i] <- as.character(frame[,i])}
    if(type_vec[i] == 'logical') {frame[,i] <- as.logical(frame[,i])}
    if(type_vec[i] == 'factor') {frame[,i] <- as.factor(frame[,i])}
  }
  return(frame)
}

Use as follows:

df <- create_empty_table(3, 3, c('character','logical','numeric'))

Which gives:

   X1  X2 X3
1 <NA> NA NA
2 <NA> NA NA
3 <NA> NA NA

To confirm your choices, run the following:

lapply(df, class)

#output
$X1
[1] "character"

$X2
[1] "logical"

$X3
[1] "numeric"

This doesn't meet OP's requirements, "I want to specify the data types for each column"
G
Gregor Thomas

If you want to create an empty data.frame with dynamic names (colnames in a variable), this can help:

names <- c("v","u","w")
df <- data.frame()
for (k in names) df[[k]]<-as.numeric()

You can change the types as well if you need so. like:

names <- c("u", "v")
df <- data.frame()
df[[names[1]]] <- as.numeric()
df[[names[2]]] <- as.character()

佚名

If you don't mind not specifying data types explicitly, you can do it this way:

headers<-c("Date","File","User")
df <- as.data.frame(matrix(,ncol=3,nrow=0))
names(df)<-headers

#then bind incoming data frame with col types to set data types
df<-rbind(df, new_df)

R
Rushabh Patel

By Using data.table we can specify data types for each column.

library(data.table)    
data=data.table(a=numeric(), b=numeric(), c=numeric())

M
MichaelChirico

If you want to declare such a data.frame with many columns, it'll probably be a pain to type all the column classes out by hand. Especially if you can make use of rep, this approach is easy and fast (about 15% faster than the other solution that can be generalized like this):

If your desired column classes are in a vector colClasses, you can do the following:

library(data.table)
setnames(setDF(lapply(colClasses, function(x) eval(call(x)))), col.names)

lapply will result in a list of desired length, each element of which is simply an empty typed vector like numeric() or integer().

setDF converts this list by reference to a data.frame.

setnames adds the desired names by reference.

Speed comparison:

classes <- c("character", "numeric", "factor",
             "integer", "logical","raw", "complex")

NN <- 300
colClasses <- sample(classes, NN, replace = TRUE)
col.names <- paste0("V", 1:NN)

setDF(lapply(colClasses, function(x) eval(call(x))))

library(microbenchmark)
microbenchmark(times = 1000,
               read = read.table(text = "", colClasses = colClasses,
                                 col.names = col.names),
               DT = setnames(setDF(lapply(colClasses, function(x)
                 eval(call(x)))), col.names))
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval cld
#  read 2.598226 2.707445 3.247340 2.747835 2.800134 22.46545  1000   b
#    DT 2.257448 2.357754 2.895453 2.401408 2.453778 17.20883  1000  a 

It's also faster than using structure in a similar way:

microbenchmark(times = 1000,
               DT = setnames(setDF(lapply(colClasses, function(x)
                 eval(call(x)))), col.names),
               struct = eval(parse(text=paste0(
                 "structure(list(", 
                 paste(paste0(col.names, "=", 
                              colClasses, "()"), collapse = ","),
                 "), class = \"data.frame\")"))))
#Unit: milliseconds
#   expr      min       lq     mean   median       uq       max neval cld
#     DT 2.068121 2.167180 2.821868 2.211214 2.268569 143.70901  1000  a 
# struct 2.613944 2.723053 3.177748 2.767746 2.831422  21.44862  1000   b

t
toto_tico

If you already have a dataframe, you can extract the metadata (column names and types) from a dataframe (e.g. if you are controlling a BUG which is only triggered with certain inputs and need a empty dummy Dataframe):

colums_and_types <- sapply(df, class)

# prints: "c('col1', 'col2')"
print(dput(as.character(names(colums_and_types))))

# prints: "c('integer', 'factor')"
dput(as.character(as.vector(colums_and_types)))

And then use the read.table to create the empty dataframe

read.table(text = "",
   colClasses = c('integer', 'factor'),
   col.names = c('col1', 'col2'))

s
stevec

I keep this function handy for whenever I need it, and change the column names and classes to suit the use case:

make_df <- function() { data.frame(name=character(),
                     profile=character(),
                     sector=character(),
                     type=character(),
                     year_range=character(),
                     link=character(),
                     stringsAsFactors = F)
}

make_df()
[1] name       profile    sector     type       year_range link      
<0 rows> (or 0-length row.names)

j
jpmarindiaz

Say your column names are dynamic, you can create an empty row-named matrix and transform it to a data frame.

nms <- sample(LETTERS,sample(1:10))
as.data.frame(t(matrix(nrow=length(nms),ncol=0,dimnames=list(nms))))

This doesn't meet OP's requirements, "I want to specify the data types for each column"
d
d8aninja

This question didn't specifically address my concerns (outlined here) but in case anyone wants to do this with a parameterized number of columns and no coercion:

> require(dplyr)
> dbNames <- c('a','b','c','d')
> emptyTableOut <- 
    data.frame(
        character(), 
        matrix(integer(), ncol = 3, nrow = 0), stringsAsFactors = FALSE
    ) %>% 
    setNames(nm = c(dbNames))
> glimpse(emptyTableOut)
Observations: 0
Variables: 4
$ a <chr> 
$ b <int> 
$ c <int> 
$ d <int>

As divibisan states on the linked question,

...the reason [coercion] occurs [when cbinding matrices and their constituent types] is that a matrix can only have a single data type. When you cbind 2 matrices, the result is still a matrix and so the variables are all coerced into a single type before converting to a data.frame