I need to stop crashing my computer with R: a blog about gc()

Packages we will use:

library(pryr)

I am working with a large dataset in R at the moment (it’s got event data with lots of text), and my computer fan is working overtime.

I’m also beginning to realise my coding issues are actually memory problems disguised as coding problems.

So recently, I have had to learn alot about R’s functions to help me understand what is happening to my memory use.

A favourite of mine is gc(), which stands for garbage collection (garbage collector?)

Source: MuppetWiki

When I run gc(), it outputs the following table

gc()
          used (Mb)     gc trigger   (Mb)     max used      (Mb)
Ncells  646,297 34.6    1,234,340    66.0     1,019,471     54.5
Vcells 1,902,221 14.6  213,407,768   1628.2   255,567,715   1949.9

Ncells
Memory used by R’s internal bookkeeping: environments, expressions, symbols.
Usually stable and not a worry~

Vcells
Memory used by your data: vectors, data frames, strings, lists. So when I think “R ran out of memory,” it almost always mean Vcells are too high~

If I look at the columns:

  • used: memory currently in use after garbage collection
  • gc trigger: threshold at which R will automatically run GC next
  • max used: peak memory usage since the session started

As an aside, I can see from the table also that my session at one time or another used around 2 GB of memory, even though it now uses ~15 MB.

Source: ThePCDoctor

At its core, gc() has two roles:

  1. Trigger garbage collection
  2. Report memory usage

The official R documentation is explicit that reporting, not memory recovery, is the primary reason we should use gc()

“the primary purpose of calling gc is for the report on memory usage”

The documentition also says it can be useful to call gc() after a large object has been removed, as this may prompt R to return memory to the operating system.

Also if I turn on garbage collection logging with gcinfo()

gcinfo(TRUE)

This starts printing a log every time I execute a function:

Garbage collection 80 = 53+6+21 (level 0) ... 74.8 Mbytes of cons cells used (57%) 58.8 Mbytes of vectors used (14%)

I typed this into ChatGPT, and this is what the AI overlord told me was in this output:

1. Garbage collection 80

  • This is the 80th garbage collection since the R session started.
  • GC runs automatically when memory pressure crosses a trigger threshold.
  • A high number here usually reflects:
    • long sessions
    • repeated allocation and copying
    • large or complex objects being created and discarded

On its own, “80” is not a problem; it is contextual.


2. = 53+6+21

This is a breakdown of GC events by type, accumulated so far:

  • 53: minor (level-0) collections
    → clean up recently allocated objects only
  • 6: level-1 collections
    → more aggressive; scan more of the heap
  • 21: level-2 collections
    → full, expensive sweeps of memory

The sum equals 80.

Interpretation:

  • Most collections are cheap and local (good)
  • But 21 full GCs indicates some sustained memory pressure over time

3. (level 0)

This refers to the current GC event that just ran:

  • Level 0 = minor collection
  • Triggered by short-term allocation pressure
  • Typically fast

This is not a warning. It means R handled it without escalating.


4. 74.8 Mbytes of cons cells used (57%)

  • Cons cells (Ncells) = internal R objects:
    • environments
    • symbols
    • expressions
  • 74.8 MB is currently in use
  • This represents 57% of the current GC trigger threshold

Interpretation:

  • Ncells usage is moderate
  • Well below the trigger
  • Not your bottleneck

5. 58.8 Mbytes of vectors used (14%)

  • Vector cells (Vcells) = your actual data:
    • vectors, data frames, strings
  • 58.8 MB currently in use
  • Only 14% of the trigger threshold

Interpretation:

  • Data memory pressure is low
  • R is very far from running out of vector space
  • This GC was likely triggered by allocation churn, not dataset size

rm() for ReMoving objects

rm(my_uncessesarily_big_df)
gc()

A quick way to make sure there isn’t a ton of memory leakage here and there, we can use rm() to remove the object reference and gc() helps clean up unreachable memory.

From a stackoverflow comment:

gc does not delete any variables that you are still using- it only frees up the memory for ones that you no longer have access to (whether removed using rm() or, say, created in a function that has since returned). Running gc() will never make you lose variables.

object.size()

object.size(my_suspiciously_big_df)
object.size(another_suspiciously_big_df) / 1024^2  # size in MB


ls() + sapply() — Crude but Effective Audits

sapply(ls(), function(x) object.size(get(x)))

This reveals which objects dominate memory.

pryr::mem_used() (Optional, Cleaner Output)

pryr::mem_used()

Thank you for reading along with me to help understand some of the diagnostics we can use in R. Hopefully that can help our poor computers aviod booting up the fan and suffer with overheating~

And at the end of the day, the R Documentation stresses that session restarts as mandatory hygiene better than relying on gc() or rm()!

Leave a comment