Opened 5 years ago

Closed 3 years ago

Last modified 2 years ago

#9577 closed bug (fixed)

String literals are wasting space

Reported by: xnyhps Owned by: xnyhps
Priority: low Milestone: 8.2.1
Component: Compiler (NCG) Version: 7.8.2
Keywords: strings Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: codeGen/should_run/T9577
Blocked By: Blocking:
Related Tickets: #12937, #12965 Differential Rev(s): Phab:D1290
Wiki Page:

Description

For D199 I looked into how string literals are compiled down by GHC.

On 64-bit OS X, a simple string "AAA" turns into assembly:

.const
.align 3
.align 0
c38E_str:
	.byte	65
	.byte	65
	.byte	65
	.byte	0

(And also something that invokes unpackCString#, but that isn't relevant here.)

(MkCore.mkStringExprFS -> CmmUtils.mkByteStringCLit -> compiler/nativeGen/X86/Ppr.pprSectionHeader.)

Note how this:

  • Is 8 byte aligned.
  • Is a .const section.

I can't find any reason why string literals would need to be 8-byte aligned on OS X. There might be a small benefit in performance to read data starting 8-byte aligned, but I doubt doing that for string literals would be a meaningful difference. Assembly from both clang and gcc does not align string literals.

The trivial program:

main :: IO ()
main = return ()

has almost 5kB of wasted space of padding between all strings the Prelude brings in, built with GHC HEAD.

The fact that it is a .const section, instead of .cstring (https://developer.apple.com/library/mac/documentation/DeveloperTools/Reference/Assembler/040-Assembler_Directives/asm_directives.html#//apple_ref/doc/uid/TP30000823-TPXREF127) means duplicate strings aren't shared by the assembler. GHC floats out string literals to the top-level and uses CSE to eliminate duplicates, but that only works in a single modules. Strings shared between different modules end up as duplicate strings in an executable.

The same program as above also has ~4kB of wasted space due to duplicate Prelude strings ("base" occurs 16 times!). Compared to the total binary size (4MB after stripping), removing this redundant data wouldn't be a big improvement (0.2%), but I still think it can be a worthwile optimization.

I think this can be solved quite easily by creating a new section header for literal strings, which is unaligned and of type .cstring.

Change History (26)

comment:1 Changed 5 years ago by xnyhps

Owner: set to xnyhps

comment:2 Changed 5 years ago by simonpj

You sound as if you know what you are talking about. I say go for it! But please make sure that whatever you do works on Windows and Linux as well as Mac.

Thanks!

Simon

comment:3 Changed 5 years ago by xnyhps

Yeah, I have Windows and Linux available to test on. Other architectures (PPC/ARM/...) not as easily, so I'll keep the generated code for those unchanged.

Gathering up some discussion from GCC about alignment of string literals:

	* config/i386/i386.c (ix86_constant_alignment): Decrease alignment
	of long string literals from 32 bytes to sizeof (void *) when !-Os
	and to 1 with -Os.

The main argument in favor of alignment seems to be: code often memcpys string literals into buffers. By doing that with aligned addresses (apparently) SSE instructions can be used. This is irrelevant for GHC, because the strings are only parsed into [Char]s, never copied.

comment:4 in reply to:  3 ; Changed 5 years ago by dfeuer

Replying to xnyhps:

The main argument in favor of alignment seems to be: code often memcpys string literals into buffers. By doing that with aligned addresses (apparently) SSE instructions can be used. This is irrelevant for GHC, because the strings are only parsed into [Char]s, never copied.

Will that always be the case if a string literal represents something like Text or ByteString? If so, will that continue to hold in the future? Might a future optimization fuse putStr with the conversion to do a copy? It may be that these concerns are baseless, but it might make sense to consider what alternative optimizations yours could preclude.

You mention that there are a lot of string literals in the Prelude. I would bet that the vast majority of those are error messages. Might it be possible to specifically target exceptional strings that should never be anywhere speed-critical, and pack them all together? Putting them all together, ideally starting or ending on a page boundary, would (hopefully) mean that they wouldn't even need to be swapped in unless an error occurred.

comment:5 in reply to:  3 Changed 5 years ago by tibbe

Replying to xnyhps:

The main argument in favor of alignment seems to be: code often memcpys string literals into buffers. By doing that with aligned addresses (apparently) SSE instructions can be used. This is irrelevant for GHC, because the strings are only parsed into [Char]s, never copied.

We avoid going via [Char] when creating ByteString literals already today (using RULES).

comment:6 in reply to:  4 ; Changed 5 years ago by xnyhps

Replying to dfeuer:

Replying to xnyhps:

The main argument in favor of alignment seems to be: code often memcpys string literals into buffers. By doing that with aligned addresses (apparently) SSE instructions can be used. This is irrelevant for GHC, because the strings are only parsed into [Char]s, never copied.

Will that always be the case if a string literal represents something like Text or ByteString? If so, will that continue to hold in the future? Might a future optimization fuse putStr with the conversion to do a copy? It may be that these concerns are baseless, but it might make sense to consider what alternative optimizations yours could preclude.

For the record, this is the rewrite rule used by ByteString:

This just wraps the Addr# directly, no copying here. However, append does call memcpy twice. I don't think GHC has the kind of optimizations that can turn a memcpy call into SIMD instructions directly, but maybe memcpy is more efficient when called with aligned buffers. I'll try to test this.

And these are the rewrite rules for text:

A loop similar to unpackCString#, so alignment won't matter much.

It is true that we might find optimizations later that benefit from aligned strings. But unaligning them now doesn't preclude that. Literals only exist within a single module, so any optimization has control over both the literal and the code that uses it. (Strings can be exported, but Addr#s can't.) Even if someone would try to mix object files generated by different versions of GHC it wouldn't be a problem.

You mention that there are a lot of string literals in the Prelude. I would bet that the vast majority of those are error messages. Might it be possible to specifically target exceptional strings that should never be anywhere speed-critical, and pack them all together? Putting them all together, ideally starting or ending on a page boundary, would (hopefully) mean that they wouldn't even need to be swapped in unless an error occurred.

I'm not familiar enough with assembly or executable file formats to say whether this is possible, but I'll keep it in mind.

comment:7 in reply to:  6 Changed 5 years ago by tibbe

Replying to xnyhps:

For the record, this is the rewrite rule used by ByteString:

This just wraps the Addr# directly, no copying here. However, append does call memcpy twice. I don't think GHC has the kind of optimizations that can turn a memcpy call into SIMD instructions directly, but maybe memcpy is more efficient when called with aligned buffers. I'll try to test this.

The memcpy implementation (used by e.g. copyByteArray#) does unroll memcpys of statically known size and alignment, if aligned to a word, so I definitely think we should try to align our data that way. In 7.10 (if Phab:D166 goes in) we'll do even better and use a REP-MOVSB instruction on Ivy bridge and newer. REP-MOVSB is almost as fast as an unrolled AVX loop.

The memcpy unrolling is implemented in source:compiler/nativeGen/X86/CodeGen.hs.

comment:8 Changed 5 years ago by simonmar

Any memcpy of a string will not be a statically known size. Furthermore, for small strings I think the space we save by not aligning them outweighs any benefit from aligned memcpy.

comment:9 in reply to:  8 Changed 5 years ago by tibbe

Replying to simonmar:

Any memcpy of a string will not be a statically known size. Furthermore, for small strings I think the space we save by not aligning them outweighs any benefit from aligned memcpy.

Why not? With OverloadedLists we now desugar list literals (which are really no different than string literals, conceptually) to fromListN, so code that creates e.g. a ByteString from a list literal can be implemented with a memcpy that has its argument statically known.

comment:10 Changed 5 years ago by xnyhps

I found #5218, which sounds like a nice solution for bytestrings because it gives the option to do different things with string and bytestring literals (part 2 of comment 3). I can look at this at the same time.

Alternatively, using the same heuristic as GCC would also be an option: align only ≥32 byte strings. (genCCall has a maximum length it will unroll, but I can't tell what that practically is on x86.)

comment:11 in reply to:  10 Changed 5 years ago by rwbarton

If I have a server today that serves a large literal ByteString in response to a particular request, presumably that string buffer will get copied out of at some point—maybe even by the kernel. That memcpy probably won't have statically known size or alignment but it probably will determine at runtime that the size is large and alignment is suitable for doing a more efficient copy.

Replying to xnyhps:

Alternatively, using the same heuristic as GCC would also be an option: align only ≥32 byte strings. (genCCall has a maximum length it will unroll, but I can't tell what that practically is on x86.)

I think we should just do this (at least for now).

comment:12 in reply to:  4 ; Changed 5 years ago by rwbarton

Replying to dfeuer:

You mention that there are a lot of string literals in the Prelude. I would bet that the vast majority of those are error messages. Might it be possible to specifically target exceptional strings that should never be anywhere speed-critical, and pack them all together? Putting them all together, ideally starting or ending on a page boundary, would (hopefully) mean that they wouldn't even need to be swapped in unless an error occurred.

I think this is easy to do as far as the linker side of things is concerned (just put the exceptional strings in their own section); the only bit that might be tricky is identifying which string literals should be considered exceptional and plumbing that information through the compiler.

comment:13 Changed 5 years ago by xnyhps

Documenting what I found about merging duplicate string constants in GCC/clang:

Turning on -fmerge-constants with GCC changes the generated assembly:

-       .section        .rodata
+       .section        .rodata.str1.1,"aMS",@progbits,1

As far as I can tell, this stands for:

  • a sets SHF_ALLOC
  • M sets SHF_MERGE
  • S sets SHF_STRINGS
  • @progbits sets SHT_PROGBITS
  • The 1 indicates the size of each entry (character)

comment:14 in reply to:  12 Changed 5 years ago by dfeuer

Replying to rwbarton:

I think this is easy to do as far as the linker side of things is concerned (just put the exceptional strings in their own section); the only bit that might be tricky is identifying which string literals should be considered exceptional and plumbing that information through the compiler.

I think identifying them shouldn't be too hard (in enough cases to be useful)—they're not exported and they appear only as arguments to error or possibly other exception-raising functions. I say possibly because I imagine there may be situations where exceptions that carry strings may need to be handled quickly. I imagine, however, that in those cases the strings themselves may not be inspected. Another thought is that there's probably an open source or public domain block compression algorithm that can fit in a few hundred bytes of slow assembly to access a compressed error string region. Such a thing should of course be optional if we decide to do it.

comment:15 Changed 5 years ago by carter

One question I've got about this whole thread is this: What would this improve? I'm trying to understand how this will impact application performance or binary sizes (though i understand the latter is defnitely neglible)

comment:16 in reply to:  15 Changed 5 years ago by dfeuer

Replying to carter:

One question I've got about this whole thread is this: What would this improve? I'm trying to understand how this will impact application performance or binary sizes (though I understand the latter is definitely neglible)

Very good question. As someone who knows very little about these issues, it seems to me that there are probably approximately three related issues that will affect performance:

  1. Aligning strings to word boundaries seems to be good for all sorts of reasons (comparison, copying, searching, etc.).
  1. Arranging for code that (exclusively) uses a string to be likely to bring the beginning of the string in on its cache line should often be good. This kind of thing goes beyond strings, of course, and I have no idea what GHC does about it in general.
  1. Dragging an error message fragment in on a cache line should always be bad.

comment:17 Changed 5 years ago by carter

these are great hypotheses, but the question is "do they make a measurable difference on any code perf?" Are there ways we can measure this without having to do the whole change?

comment:18 Changed 5 years ago by jstolarek

Are there ways we can measure this without having to do the whole change?

Was there ever any optimisation which impact we could measure without implementing it? (I'm asking seriously.)

comment:19 in reply to:  15 Changed 5 years ago by xnyhps

Replying to carter:

One question I've got about this whole thread is this: What would this improve? I'm trying to understand how this will impact application performance or binary sizes (though i understand the latter is defnitely neglible)

I think the performance loss for [Char]-strings is negligible: sharing means they are going to be unpacked just once, that's not worthy optimizing performance for. After unpacking the literal is never used again.

ByteStrings may have noticeable impact. Prepending a ByteString literal onto another ByteString involves a memcpy of the literal string. There might be code out there that does it millions of times.

comment:20 Changed 3 years ago by thomie

Differential Rev(s): Phab:D1290
Milestone: 8.2.1
Status: newpatch
Test Case: codeGen/should_run/T9577

comment:21 Changed 3 years ago by Ben Gamari <ben@…>

In b7e88ee/ghc:

Reduce the size of string literals in binaries.

Removed the alignment for strings and mark then as cstring sections in
the generated asm so the linker can merge duplicate sections.

Reviewers: rwbarton, trofi, austin, trommler, simonmar, hvr, bgamari

Reviewed By: hvr, bgamari

Subscribers: simonpj, hvr, thomie

Differential Revision: https://phabricator.haskell.org/D1290

GHC Trac Issues: #9577

comment:22 Changed 3 years ago by bgamari

Resolution: fixed
Status: patchclosed

comment:23 Changed 3 years ago by Ben Gamari <ben@…>

In 55361b38/ghc:

nativeGen: Fix string merging on Windows

D1290 places string constants in the `.rodata.str` section with `aMS`
section flags so that the linker can merge them. However, it seems that
ld doesn't understand these flags. It appears that `gcc
-fmerge-constants` uses the `dr` flags on Windows. Make GHC do the same.

Test Plan: Validate on Windows

Reviewers: xnyhps, austin, Phyx

Reviewed By: Phyx

Subscribers: thomie, trommler

Differential Revision: https://phabricator.haskell.org/D2797

GHC Trac Issues: #9577

comment:24 Changed 3 years ago by bgamari

comment:25 Changed 3 years ago by Ben Gamari <ben@…>

In 4026b452/ghc:

Fix string merging with -split-sections

The added flags for string literal merging ended up printed in the
middle of the section name when -split-sections was enabled. Break it up
to put the flags after the name.

Test Plan: validate with SplitSections=YES

Reviewers: austin, bgamari

Reviewed By: bgamari

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2865

GHC Trac Issues: #9577

comment:26 Changed 2 years ago by bgamari

Keywords: strings added
Note: See TracTickets for help on using tickets.