Opened 19 months ago

Last modified 11 months ago

#14854 new bug

The size of FastString table is suboptimal for large codebases

Reported by: niteria Owned by: watashi
Priority: normal Milestone:
Component: Compiler Version:
Keywords: Cc: simonmar, sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D5211
Wiki Page:

Description

Context: I was profiling a no-op rebuild (build once, do incremental build with nothing changed) of a large code base. It takes 35.829s, and according to the profile about 20% of the time is spent in mkFastStringByteString.

I run it with -dfaststring-stats +RTS -s and got:

FastString stats:
    size:            4091
    entries:         829959
    longest chain:   268
    has z-encoding:  0%
  50,011,906,640 bytes allocated in the heap
   7,936,277,664 bytes copied during GC
   3,106,526,336 bytes maximum residency (8 sample(s))
     123,108,224 bytes maximum slop
           10169 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       117 colls,    37 par   32.390s   2.293s     0.0196s    0.0911s
  Gen  1         8 colls,     3 par   19.722s   2.266s     0.2832s    0.6936s

  Parallel GC work balance: 60.62% (serial 0%, perfect 100%)

  TASKS: 93 (1 bound, 92 peak workers (92 total), using -N30)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.009s  (  0.009s elapsed)
  MUT     time  277.370s  ( 31.261s elapsed)
  GC      time   52.113s  (  4.558s elapsed)
  EXIT    time    0.005s  (  0.001s elapsed)
  Total   time  329.497s  ( 35.829s elapsed)

  Alloc rate    180,307,295 bytes per MUT second

  Productivity  84.2% of total user, 87.3% of total elapsed	

That's pretty bad, fitting 800k elements in 4k buckets is hard.

To confirm this indeed could be improved I changed the size to 2^20 (1048576) and also removed division in hashStr function (it come up on perf profile), like this:

hashStr  :: Ptr Word8 -> Int -> Int
 -- use the Addr to produce a hash value between 0 & m (inclusive)
hashStr (Ptr a#) (I# len#) = loop 0# 0#
   where
    loop h n | isTrue# (n ==# len#) = I# h
             | otherwise  = loop h2 (n +# 1#)
          where !c = ord# (indexCharOffAddr# a# n)
                -- !h2 = (c +# (h *# 128#)) `remInt#`
                --       hASH_TBL_SIZE#
                -- NOTE: below is multiplication by 127 (a prime) and division by 2^20, it's by no means the best hashing function
                !h2 = (c +# ((h `uncheckedIShiftL#` 7#) -# h)) `andI#` hASH_TBL_SIZE_MASK#

Here's what I ended up with:

FastString stats:
    size:            1048576
    entries:         829959
    longest chain:   8
    has z-encoding:  0%
  51,012,218,048 bytes allocated in the heap
   8,948,377,768 bytes copied during GC
   3,543,797,248 bytes maximum residency (8 sample(s))
     139,428,352 bytes maximum slop
           10502 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       115 colls,    39 par   32.261s   2.677s     0.0233s    0.2225s
  Gen  1         8 colls,     3 par   21.786s   2.329s     0.2911s    0.5737s

  Parallel GC work balance: 48.75% (serial 0%, perfect 100%)

  TASKS: 93 (1 bound, 92 peak workers (92 total), using -N30)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.008s  (  0.008s elapsed)
  MUT     time   65.508s  ( 23.791s elapsed)
  GC      time   54.047s  (  5.006s elapsed)
  EXIT    time    0.007s  (  0.001s elapsed)
  Total   time  119.570s  ( 28.806s elapsed)

  Alloc rate    778,712,134 bytes per MUT second

  Productivity  54.8% of total user, 82.6% of total elapsed

The realtime improvement is nice (-20%), but even nicer is the "serial" time improvement, from 329.497sto 119.570s (2.75x speedup).

This is on modified 8.0.2, but the related code looks about the same on HEAD.

Change History (11)

comment:1 Changed 19 months ago by niteria

I should note that the vast majority of the calls comes from getFS :: BinHandle -> IO FastString, which is used when reading interface files.

comment:2 Changed 19 months ago by sgraf

Cc: sgraf added

comment:3 Changed 19 months ago by simonpj

I don't grok the details, but this looks good to me.

Shouldn't the size adapt somehow? What if there are more faststrings in some later, bigger program? Won't this just happen again? At least should we not have a warning for an over-populated table?

Whatever happens, please ensure there's a note pointing to this ticket, to explain the change.

comment:4 Changed 19 months ago by niteria

Shouldn't the size adapt somehow? What if there are more faststrings in some later, bigger program? Won't this just happen again?

Yes, probably, but the flexibility may come at a cost. I don't have a solution yet, that's why this is a ticket and not a phabricator patch. I also don't expect to be able to focus on this in the near future.

comment:5 Changed 19 months ago by simonmar

3,106,526,336 bytes maximum residency (8 sample(s))

:-o

What's the residency overhead for the larger hash table? I suppose this hash table should auto-grow and rehash itself, unless there's not much overhead for the larger table.

One stopgap approach might be to have a command-line flag to set the size. It would have to be a static flag though (yuck).

comment:6 Changed 19 months ago by niteria

If we decided to depend on https://hackage.haskell.org/package/hashtables, the implementation looks solid.

comment:7 Changed 12 months ago by watashi

Differential Rev(s): Phab:D5211

comment:8 Changed 12 months ago by simonpj

What's the actual status on this?

comment:9 Changed 11 months ago by simonmar

@simonpj, there's a diff up for review: Phab:D5211

comment:10 Changed 11 months ago by watashi

Owner: set to watashi

comment:11 Changed 11 months ago by Ben Gamari <ben@…>

In 5126764b/ghc:

Rewrite FastString table in concurrent hashtable

Summary:
Reimplement global FastString table using a concurrent hashatable with
fixed size segments and dynamically growing buckets instead of fixed size
buckets.

This addresses the problem that `mkFastString` was not linear when the
total number of entries was large.

Test Plan:
./validate

```
inplace/bin/ghc-stage2 --interactive -dfaststring-stats < /dev/null
GHCi, version 8.7.20181005: http://www.haskell.org/ghc/  :? for help
Prelude> Leaving GHCi.
FastString stats:
    segments:          256
    buckets:           16384
    entries:           7117
    largest segment:   64
    smallest segment:  64
    longest bucket:    5
    has z-encoding:    0%
```

Also comapre the two implementation using

{P187}

The new implementation is on a par with the old version with different
conbination of parameters and perform better when the number of
FastString's are large.

{P188}

Reviewers: simonmar, bgamari, niteria

Reviewed By: simonmar, bgamari

Subscribers: rwbarton, carter

GHC Trac Issues: #14854

Differential Revision: https://phabricator.haskell.org/D5211
Note: See TracTickets for help on using tickets.