ChatGPT解决这个技术问题 Extra ChatGPT

Does Haskell require a garbage collector?

I'm curious as to why Haskell implementations use a GC.

I can't think of a case where GC would be necessary in a pure language. Is it just an optimization to reduce copying, or is it actually necessary?

I'm looking for example code that would leak if a GC wasn't present.

You might find this series enlightening; it covers how garbage is generated (and subsequently collected): blog.ezyang.com/2011/04/the-haskell-heap
there are references everywhere in pure languages! just not mutable references.
@pelotom References to immutable data or immutable references?
Both. The fact that the data referred to is immutable follows from the fact that all the references are immutable, all the way down.
You will certainly be interested by the halting problem, as applying this reasoning to memory allocation helps understand why deallocation can't be statically predicted in the general case. However there are some programs for which deallocation can be predicted, just like they are some programs that can be known to terminate without actually running them.

r
rp123

As others have already pointed out, Haskell requires automatic, dynamic memory management: automatic memory management is necessary because manual memory management is unsafe; dynamic memory management is necessary because for some programs, the lifetime of an object can only be determined at runtime.

For example, consider the following program:

main = loop (Just [1..1000]) where
  loop :: Maybe [Int] -> IO ()
  loop obj = do
    print obj
    resp <- getLine
    if resp == "clear"
     then loop Nothing
     else loop obj

In this program, the list [1..1000] must be kept in memory until the user types "clear"; so the lifetime of this must be determined dynamically, and this is why dynamic memory management is necessary.

So in this sense, automated dynamic memory allocation is necessary, and in practice this means: yes, Haskell requires a garbage collector, since garbage collection is the highest-performance automatic dynamic memory manager.

However...

Although a garbage collector is necessary, we might try to find some special cases where the compiler can use a cheaper memory management scheme than garbage collection. For instance, given

f :: Integer -> Integer
f x = let x2 = x*x in x2*x2

we might hope for the compiler to detect that x2 can safely be deallocated when f returns (rather than waiting for the garbage collector to deallocate x2). Essentially, we are asking that the compiler perform escape analysis to convert allocations in to garbage-collected heap to allocations on the stack wherever possible.

This is not too unreasonable to ask for: the jhc haskell compiler does this, although GHC does not. Simon Marlow says that GHC's generational garbage collector makes escape analysis mostly unnecessary.

jhc actually uses a sophisticated form of escape analysis known as region inference. Consider

f :: Integer -> (Integer, Integer)
f x = let x2 = x * x in (x2, x2+1)

g :: Integer -> Integer
g x = case f x of (y, z) -> y + z

In this case, a simplistic escape analysis would conclude that x2 escapes from f (because it is returned in the tuple), and hence x2 must be allocated on the garbage-collected heap. Region inference, on the other hand, is able to detect that x2 can be deallocated when g returns; the idea here is that x2 should be allocated in g's region rather than f's region.

Beyond Haskell

While region inference is helpful in certain cases as discussed above, it appears to be difficult to reconcile effectively with lazy evaluation (see Edward Kmett's and Simon Peyton Jones' comments). For instance, consider

f :: Integer -> Integer
f n = product [1..n]

One might be tempted to allocate the list [1..n] on the stack and deallocate it after f returns, but this would be catastrophic: it would change f from using O(1) memory (under garbage collection) to O(n) memory.

Extensive work was done in the 1990s and early 2000s on region inference for the strict functional language ML. Mads Tofte, Lars Birkedal, Martin Elsman, Niels Hallenberg have written a quite readable retrospective on their work on region inference, much of which they integrated into the MLKit compiler. They experimented with purely region-based memory management (i.e. no garbage collector) as well as hybrid region-based/garbage-collected memory management, and reported that their test programs ran "between 10 times faster and 4 times slower" than pure garbage-collected versions.


Does Haskell require sharing? If not, in your first example, you can pass a copy of the list (resp. Nothing) to the recursive call of loop and deallocate the old one - no unknown lifetime. Of course nobody wants a non-sharing implementation of Haskell, because it's horribly slow for large data structures.
I really like this answer, although my only confusion is with the first example. Obviously if the user never typed "clear" then it could use infinite memory (without a GC), but that's not exactly a leak as the memory is still being tracked.
C++11 has a wonderful implementation of smart pointers. Basically it employs reference counting. I guess Haskell could ditch garbage collection in favour of something similar, and therefore become deterministic.
@ChrisNash - Doesn't work. Smart pointers use reference counting under the hood. Reference counting cannot deal with data structures with cycles. Haskell can generate data structures with cycles.
I'm not sure if I agree with the dynamic memory allocation part of this answer. Just because the program doesn't know when a user will stop looping temporally shouldn't make it dynamic. That is determined by whether the compiler knows if something will go out of context. In Haskell's case, where that is formally defined by the language grammar itself, the life context is known. However, the memory may still be dynamic for the reason that the list expressions and type are dynamically generated within the language.
a
augustss

Let's take a trivial example. Given this

f (x, y)

you need to allocate the pair (x, y) somewhere before calling f. When can you deallocate that pair? You have no idea. It cannot be deallocated when f returns, because f might have put the pair in a data structure (e.g, f p = [p]), so the lifetime of the pair might have to be longer than return from f. Now, say that the pair was put in a list, can whoever takes the list apart deallocate the pair? No, because the pair might be shared (e.g., let p = (x, y) in (f p, p)). So it's really difficult to tell when the pair can be deallocated.

The same holds for almost all allocations in Haskell. That said, it's possible have an analysis (region analysis) that gives an upper bound on the lifetime. This works reasonably well in strict languages, but less so in lazy languages (lazy languages tend to do a lot more mutation than strict languages in the implementation).

So I'd like to turn the question around. Why do you think Haskell does not need GC. How would you suggest memory allocation to be done?


s
sigfpe

Your intuition that this has something to do with purity has some truth to it.

Haskell is considered pure partly because side effects of functions are accounted for in the type signature. So if a function has the side effect of printing something, there must be an IO somewhere in its return type.

But there is a function that is used implicitly everywhere in Haskell and whose type signature doesn't account for, what is in some sense, a side effect. Namely the function that copies some data and gives you two versions back. Under the hood this can work either literally, by duplicating the data in memory, or 'virtually' by increasing a debt that needs to be paid back later.

It's possible to design languages with even more restrictive type systems (purely "linear" ones) that disallow the copy function. From the point of view of a programmer in such a language, Haskell looks a little impure.

In fact, Clean, a relative of Haskell, has linear (more strictly: unique) types, and that can give some idea of what it would be like to disallow copying. But Clean still allows copying for "non-unique" types.

There is lots of research in this area and if you Google enough you'll find examples of pure linear code that requires no garbage collection. You'll find all kinds of type systems that can signal to the compiler what memory might be used allowing the compiler to eliminate some of the GC.

There is a sense in which quantum algorithms are also purely linear. Every operation is reversible and so no data can be created, copied, or destroyed. (They're also linear in the usual mathematical sense.)

It's also interesting to compare with Forth (or other stack based languages) which have explicit DUP operations that make clear when duplication is happening.

Another (more abstract) way of thinking about this is to note that Haskell is built up from simply typed lambda calculus which is based on the theory of cartesian closed categories and that such categories come equipped with a diagonal function diag :: X -> (X, X). A language based on another class of category might have no such thing.

But in general, purely linear programming is too difficult to be useful, so we settle for GC.


Since I wrote this answer the Rust programming language has increased in popularity by quite a bit. So it's worth mentioning that Rust uses a linear-ish type system for controlling access to memory and it's worth having a look at if you want to see the ideas I mentioned used in practice.
e
ehird

The standard implementation techniques applied to Haskell actually require a GC moreso than most other languages, since they never mutate previous values, instead creating new, modified values based on the previous ones. Since this means the program is constantly allocating and using more memory, a large number of the values will be discarded as time goes on.

This is why GHC programs tend to have such high total allocation figures (from gigabytes to terabytes): they're constantly allocating memory, and it's only thanks to the efficient GC that they reclaim it before running out.


"they never mutate previous values": you can check haskell.org/haskellwiki/HaskellImplementorsWorkshop/2011/Takano, it is about an experimental GHC extension that reuses memory.
S
Stephen C

If a language (any language) allows you to allocate objects dynamically, then there are three practical ways to deal with the management of memory:

The language can only allow you to allocate memory on the stack, or at startup. But these restrictions severely limit the kinds of computations that a program can perform. (In practice. In theory, you can emulate dynamic data structures in (say) Fortran by representing them in a big array. It is HORRIBLE ... and not relevant to this discussion.) The language can provide an explicit free or dispose mechanism. But this relies on the programmer to get it right. Any mistake in the storage management can result in a memory leak ... or worse. The language (or more strictly, the language implementation) can provide an automatic storage manager for the dynamically allocated storage; i.e. some form of garbage collector.

The only other option is to never reclaim dynamically allocated storage. This is not a practical solution, except for small programs performing small computations.

Applying this to Haskell, the language doesn't have the limitation of 1., and there is no manual deallocation operation as per 2. Therefore, in order to be useable for non-trivial things, a Haskell implementation needs to include a garbage collector.

I can't think of a case where GC would be necessary in a pure language.

Presumably you mean a pure functional language.

The answer is that a GC is required under the hood to reclaim the heap objects that the language MUST create. For example.

A pure function needs to create heap objects because in some cases it has to return them. That means that they can't be allocated on the stack.

The fact that there can be cycles (resulting from a let rec for example) means that a reference counting approach won't work for heap objects.

Then there are function closures ... which also can't be allocated on the stack because they have a lifetime that is (typically) independent of that stack frame in which they were created.

I'm looking for example code that would leak if a GC wasn't present.

Just about any example that involved closures or graph-shaped data structures would leak under those conditions.


Why do you think your list of options is exhaustive? ARC in Objective C, region inference in MLKit and DDC, compile-time garbage collection in Mercury - they all don't fit into this list.
@DeeMon - they all fit into one of those categories. If you think they don't it is because you are drawing the category boundaries too tightly. When I say "some form of garbage collection", I mean any mechanism in which storage is reclaimed automatically.
C++11 uses smart pointers. Basically it employs reference counting. It is deterministic and automatic. I would love to see an implementation of Haskell use this method.
@ChrisNash - 1) It wouldn't work. Reference count-base reclamation leaks data if there are cycles ... unless you can rely on application code to break the cycles. 2) It is well known (to people who study these things) that reference counting performs poorly when compared with a modern (real) garbage collector.
@DeeMon - besides, see Reinerp's answer on why region inference wouldn't be practical with Haskell.
b
bdonlan

A garbage collector is never necessary, provided you have sufficient memory. However, in reality, we don't have infinite memory, and so we need some method to reclaim memory that is no longer needed. In impure languages like C, you can explicitly state you're done with some memory to free it - but this is a mutating operation (the memory you just freed is no longer safe to read), so you can't use this approach in a pure language. So it's either somehow statically analyze where you can free the memory (probably impossible in the general case), leak memory like a sieve (works great until you run out), or use a GC.


This answers why a GC is unnecessary in general but I am more interested in Haskell in particular.
If a GC is theoretically unnecessary in general, then it trivially follows that it's theoretically unnecessary for Haskell, too.
@ehird I meant to say necessary, I think my spell checker flipped the meaning.
Ehird comment still holds :-)
d
dev1223

GC is "must have" in pure FP languages. Why? Operations alloc and free are impure! And the second reason is, that immutable recursive data structures needs GC for existence because backlinking creates abstruse and unmaintainable structures for human mind. Of course, backlinking is blessing, because copying of structures which uses it is very cheap.

Anyway, If you don't believe me, just try to implement FP language and you will see that I'am right.

EDIT: I forgot. Laziness is HELL without GC. Don't believe me? Just try it without GC in, for example, C++. You will see ... things


"Laziness is HELL without GC" - can you explain why? I'm thinking about Rust, where afaik there's a lot of laziness and definitely no garbage collector
g
gfour

Haskell is a non-strict programming language, but most implementations use call-by-need (laziness) to implement non-strictness. In call-by-need you only evaluate stuff when it is reached during runtime using the machinery of "thunks" (expressions that wait to be evaluated and then overwrite themselves, staying visible for their value to be reused when needed).

So, if you implement your language lazily using thunks, you have deferred all reasoning about object lifetimes until the last moment, which is runtime. Since you now know nothing about lifetimes, the only thing you can reasonably do is garbage collect...


In some cases static analysis can insert into those thunks code which frees some data after the thunk is evaluated. Deallocation will happen at runtime but it's not GC. This is similar to idea of reference counting smart pointers in C++. Reasoning about object lifetimes happens in runtime there but no GC is used.