ChatGPT解决这个技术问题 Extra ChatGPT

Forcing garbage collection to run in R with the gc() command

Periodically I program sloppily. Ok, I program sloppily all the time, but sometimes that catches up with me in the form of out of memory errors. I start exercising a little discipline in deleting objects with the rm() command and things get better. I see mixed messages online about whether I should explicitly call gc() after deleting large data objects. Some say that before R returns a memory error it will run gc() while others say that manually forcing gc is a good idea.

Should I run gc() after deleting large objects in order to ensure maximum memory availability?


C
Community

"Probably." I do it too, and often even in a loop as in

cleanMem <- function(n=10) { for (i in 1:n) gc() }

Yet that does not, in my experience, restore memory to a pristine state.

So what I usually do is to keep the tasks at hand in script files and execute those using the 'r' frontend (on Unix, and from the 'littler' package). Rscript is an alternative on that other OS.

That workflow happens to agree with

workflow-for-statistical-analysis-and-report-writing

tricks-to-manage-the-available-memory-in-an-r-session

which we covered here before.


Why does it help to run gc() repeatedly?
.@DirkEddelbuettel - Why run gc() repeatedly?
Is there any platform supported by R that does not ship by default with Rscript? I'd think it's an alternative on any OS, not just "that" one.
Rscript is everywhere, littler and it's r are not. That's the context as I prefer the latter (but the former just fixed an important boo-boo).
R
Richie Cotton

From the help page on gc:

A call of 'gc' causes a garbage collection to take place. This will also take place automatically without user intervention, and the primary purpose of calling 'gc' is for the report on memory usage. However, it can be useful to call 'gc' after a large object has been removed, as this may prompt R to return memory to the operating system.

So it can be useful to do, but mostly you shouldn't have to. My personal opinion is that it is code of last resort - you shouldn't be littering your code with gc() statements as a matter of course, but if your machine keeps falling over, and you've tried everything else, then it might be helpful.

By everything else, I mean things like

Writing functions rather than raw scripts, so variables go out of scope. Emptying your workspace if you go from one problem to another unrelated one. Discarding data/variables that you aren't interested in. (I frequently receive spreadsheets with dozens of uninteresting columns.)


In my computer gc() releases some of the memory but it's not perfect. If I load a large object do something with it, delete it and use gc() and I don't get the same free memory that at the beginning. The more things I do the more memory I'm unable to recover. At the end, after many operations with big objetcs I can run out of memory. I'm in Windows 10 x64 and I use 16GB of RAM.
I
IRTFM

Supposedly R uses only RAM. That's just not true on a Mac (and I suspect it's not true on Windows either.) If it runs out of RAM, it will start using virtual memory. Sometimes, but not always, processes will 'recognize' that they need to run gc() and free up memory. When they do not do so, you can see this by using the ActivityMonitor.app and seeing that all the RAM is occupied and disk access has jumped up. I find that when I am doing large Cox regression runs that I can avoid spilling over into virtual memory (with slow disk access) by preceding calls with gc(); cph(...)


I can confirm R doesn't use pagefile on Windows, and sometimes it would be very useful.
T
Tommy

A bit late to the party, but:

Explicitly calling gc will free some memory "now". ...so if other processes need the memory, it might be a good idea. For example before calling system or similar. Or perhaps when you're "done" with the script and R will sit idle for a while until the next job arrives - again, so that other processes get more memory.

If you just want your script to run faster, it won't matter since R will call it later if it needs to. It might even be slower since the normal GC cycle might never have needed to call it.

...but if you want to measure time for instance, it is typically a good idea to do a GC before running your test. This is what system.time does by default.

UPDATE As @DWin points out, R (or C#, or Java etc) doesn't always know when memory is low and the GC needs to run. So you could sometimes need to do GC as a work-around for deficiencies in the memory system.


h
hadley

No. If there is not enough memory available for an operation, R will run gc() automatically.


Doesn't always happen automatically in my experience. If you work regularly with large data, gc() regularly or restart your R session.
Please provide evidence for your statement.
> V1 <- vector(length=208000000) > sapply(1:20, function(x) {V2 <- vector(length=52000000); 0} ) [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 There were 32 warnings (use warnings() to see them) > warnings() Warning messages: 1: In vector(length = 5.2e+07) : Reached total allocation of 1023Mb: see help(memory.size) ... 32 times > V1 <- vector(length=208000000) > sapply(1:20, function(x) {V2 <- vector(length=52000000); rm(V2); gc(); 0} ) [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >
This is a fixed memory scenario. If you rm() and gc(), you're fine. You can increase your memory limit to avoid this problem, but it's really annoying to have R eating swap when there's perfectly good RAM sitting around.
Does R "know" memory pages of other processes have to be swapped to provide the necessary space to R? Because in my experience it sometimes helps to call gc() after an intermediary step to allow other processes to use RAM again, making the computer far more responsive than before the call to gc().
S
Shane

"Maybe." I don't really have a definitive answer. But the help file suggests that there are really only two reasons to call gc():

You want a report of memory usage. After removing a large object, "it may prompt R to return memory to the operating system."

Since it can slow down a large simulation with repeated calls, I have tended to only do it after removing something large. In other words, I don't think that it makes sense to systematically call it all the time unless you have good reason to.