Opened 3 years ago

Last modified 3 years ago

#12900 new feature request

Common up identical info tables

Reported by: dobenour Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.0.1
Keywords: Cc: simonpj
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Consider

data Bool' = No | Yes

It is clear that this is identical to Bool at runtime, and so GHC should be able to reuse Bool's code. The same holds for all other pairs of isomorphic data types.

I think that a massive space savings could be achieved this way.

Change History (5)

comment:1 Changed 3 years ago by osa1

I've been thinking about this idea for a while, I think it's a great idea. I'd be willing to work on that if @simonpj or others can comment on how possible this is. I agree that we could save massive amount of space.

I think what we can do is, we can give same info tables same names, so if we have

data X    = A | B
data Bool = True | False

Both of these get the same info table hs_s_s_info (s is for "singleton"). Similarly,

data Either a b = Left a | Right b

gets info table hs_p_p_info ('p' is for "pointer") etc. Then when linking we ignore multiple definitions of same info table symbols (not sure if this is possible easily).

comment:2 Changed 3 years ago by osa1

We discussed this on the IRC channel and @ezyang pointed out two things:

  • Assuming this is only for merging same info tables, we can go even further by reordering fields so that data T1 = T1 Int# Bool and {{{data T2 = T2 Bool Int#}}} would get same info tables. This obviously needs a lot more work but saves more space.
  • This breaks -h which works even in non-profiled mode.

comment:3 Changed 3 years ago by simonpj

Why do you say "massive" amount of space? Do you have data? My guess would be that, except in pathological situation (e.g. thousands of identical data types) the saving would be vanishingly small.

Also I think that info tables contain strings that identify the type for debuggers etc; ghci has such a debugger built in. You'd lose this entirely.

So I'm a skeptical that the gain justifies the pain (implementation complexity, loss of functionality). But data might convince!

If you are looking for code space saving, then I think a more profitable place might be commoning up bits of Cmm code. E.g. consider

f x y = case x of True -> False; False -> y

We push a return address, evaluate x; when we come back we take a conditional jump based on the tag and, in the True case we just return a fixed constructor False. Lots of code just returns False! If all that code was commoned up, we might get a big space saving. Maybe there are some sequences that are SO common we can put them in a big library, always available in the RTS, so all libraries could share.

We'd need automation to find these common sequences and see which ones happened a lot.

I remember a PhD student trying this years ago (20 yrs?) with some success.

Simon

comment:4 Changed 3 years ago by dobenour

simonpj: I remember reading that one of the reasons GHC-compiled binaries tend to be large is that each data type has its own info table and entry code. I suspect that most data types have one of a few common forms.

comment:5 Changed 3 years ago by rwbarton

Maybe start by measuring the total code size used by constructors (say in the ghc library itself). Slightly tricky because the starts of info tables aren't marked by symbols, I think.

I'd be surprised if they account for more than about 10% of the total code size, myself.

Note: See TracTickets for help on using tickets.