Opened 8 years ago

Last modified 2 years ago

#5218 patch feature request

Add unpackCStringLen# to create Strings from string literals

Reported by: tibbe Owned by: thoughtpolice
Priority: normal Milestone:
Component: Compiler Version: 7.0.3
Keywords: strings Cc: johan.tibell@…, dons, dcoutts, pho@…, lykahb@…, reiner.pope@…, alexey.skladnoy@…, wren@…, patrick@…, hackage.haskell.org@…, tkn.akio@…, osa1, winter
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #5877 #10064 #11312, #9719 Differential Rev(s): Phab:D2443
Wiki Page:

Description

GHC insert calls to unpackCString# to convert string literals to Strings. Libraries like bytestring use rewrite rules to match on this call to optimize code like pack (unpackCString# s).

If GHC would instead use a version of unpackCString#, say unpackCStringLen#, that includes the (statically known) length, creating ByteStrings from literals could be a constant time operation instead of a linear time operation.

Another use case, which motivated this ticket, is appending string literals to builders (e.g. using Data.Binary.Builder.fromByteString). For small strings the most efficient way to append a string to the builder is to copy the statically allocated string directly into the builder's output buffer. If the string length was known statically, we could do this efficiently using memcpy or even using a small unrolled loop.

Change History (95)

comment:1 Changed 8 years ago by tibbe

Cc: johan.tibell@… added

comment:2 Changed 8 years ago by simonmar

Milestone: 7.4.1
Priority: normalhigh

We really should do this.

comment:3 Changed 8 years ago by simonmar

Simon and I discussed this just now and found two further problems in addition to the extra runtime length operation described above.

1. The RULES in ByteString for optimising literal strings currently do not work

e.g. try

{-# LANGUAGE OverloadedStrings #-}
module Foo where
import Data.ByteString.Char8
foo = "abc" :: ByteString

the generated code does the translation to/from String at runtime. Presumably this broke at some point, possibly due to things being inlined at the wrong time and/or let-floating.

2. The RULES in ByteString never worked for non-ASCII strings

The RULE in Data.ByteString.Char8 looks like

{-# RULES
"ByteString pack/packAddress" forall s .
   pack (unpackCString# s) = inlinePerformIO (B.unsafePackAddress s)
 #-}

But non-ASCII strings are UTF-8-encoded by GHC and wrapped in unpackCStringUtf8, not unpackCString, so the RULE won't apply. Example:

foo = "\0\0" :: ByteString

This will generate something like

foo = pack (unpackCStringUtf8 "\0\0"#)

We have two solutions.

1. Use quasi-quoting to declare ByteString literals.

You would write something like

foo :: ByteString
foo = [bytes| ffa3c9db77 ]

and the code for bytes would expand thus

  [bytes| ffa3c9db77 ]
===>
  inlinePerformIO (unsafePackAddress "..."#)

ie a call to unsafePackAddress passing a primitive literal string (HsStringPrim, aka "..."#). Note that primitive literal strings are not UTF-8-encoded by GHC, they are just 8-bit stripped (this happens in the lexer, of all places). No use of rules at all.

This would be more convenient for encoding static data into the program through ByteString literals.

2. Fix the RULES for ByteString literals.

Here's one way we could fix all the problems described above. It's a bit tricky and requires changes in various places.

First, somewhere in base or ghc-prim we have

newtype String8 = String8 String

string8 :: String -> String8
{-# INLINE [0] string8 #-}
string8 = String8

unpackCString :: Addr# -> String
unpackCStringUtf8 :: Addr# -> String

unpackCString8Len :: Addr# -> Int# -> String8

Next, GHC internally has RULEs that convert

  string8 (unpackCStringUtf8 <utf8stringlit#>) 
    ===> unpackCString8Len <stringlit#> <len#>

  string8 (unpackCString <asciistringlit#>)
    ===> unpackCString8Len <asciistringlit#> <len#>

where <stringlit#> is made by decoding <utf8stringlit#> and stripping out bits 8 and above from each character.

Then in Data.ByteString.Char8 we have:

module Data.ByteString.Char8 where

packAddressLen :: Addr# -> Int# -> ByteString

pack8 :: String8 -> ByteString

instance IsString ByteString where
  fromString = pack . string8

{-# RULES "pack8/unpackCString8" 
  pack8 (unpackCString8 addr len) = packAddressLen addr len
 #-}

comment:4 Changed 8 years ago by simonmar

Cc: dons dcoutts added

comment:5 Changed 8 years ago by tibbe

If solution two works I'm for it. It feels a bit complex but perhaps there's no better option. I'm against solution 1 as it would make most Haskell programs Template Haskell programs, as byte string literals are quite common, and that feels a bit too much to solve this problem.

If we could have proper byte string literals (ala Python's b"...") that would be even better but I guess that would be a too invasive change.

comment:6 Changed 8 years ago by simonpj

What's wrong with Template Haskell? The solution is so simple, it seems a shame not to use it, no?

comment:7 Changed 8 years ago by simonmar

I think we want to do both - the two solutions are complementary.

For large ByteString literals, such as when you've serialised and gzipped a data structure for unpacking at runtime, the quasiquotation syntax makes perfect sense.

However, some people want to write ByteString literals using string syntax, and to not have to use TH just for one of these small literals (TH is seen as a heavyweight dependency if you haven't already bought in).

comment:8 Changed 8 years ago by duncan

Solution 2 looks good to me.

As tibbe says, solution 1 would also be useful in other use cases. Hex, octal or bit string literals.

And if we ever switch Text to use UTF8 then unpackCStringUtf8Len# would be useful there too.

comment:9 Changed 8 years ago by PHO

Cc: pho@… added

comment:10 Changed 8 years ago by boris

Cc: lykahb@… added

comment:11 Changed 8 years ago by igloo

Milestone: 7.4.17.6.1

punting

comment:12 Changed 8 years ago by PHO

I want both solutions too, especially the solution 1.

What concerns me is that there seems no means of creating primitive byte-array literals with TH. That is, the Lit type currently only has a constructor StringPrimL String which represents an Addr# literal encoded in UTF-8, thus unsafePackAddressLen 3 "\NUL\NUL\NUL"# works but unsafePackAddressLen 3 $(litE $ StringPrimL "\NUL\NUL\NUL") doesn't. So we probably need to make a change to the type of StringPrimL:

data Lit = CharL Char
         | StringL String
         | ...
         | StringPrimL [Word8] -- Raw, non-encoded "..."# literal.

comment:13 Changed 8 years ago by reinerp

Cc: reiner.pope@… added

comment:14 Changed 8 years ago by reinerp

See also #5877 which fixed the Template Haskell point above

comment:15 Changed 8 years ago by simonpj

difficulty: Unknown
Owner: set to igloo

Assigning to Ian to investigate/propose

comment:16 Changed 7 years ago by Khudyakov

Cc: alexey.skladnoy@… added

comment:17 Changed 7 years ago by igloo

Milestone: 7.6.17.8.1

Punting

comment:18 Changed 7 years ago by WrenThornton

Cc: wren@… added

comment:19 Changed 7 years ago by parcs

Cc: patrick@… added

comment:21 Changed 6 years ago by liyang

Cc: hackage.haskell.org@… added

comment:22 Changed 6 years ago by igloo

Owner: igloo deleted

comment:23 Changed 6 years ago by thoughtpolice

Owner: set to thoughtpolice

comment:24 Changed 6 years ago by akio

Cc: tkn.akio@… added

comment:25 Changed 5 years ago by thoughtpolice

Milestone: 7.8.37.10.1

Bumping priority down (these tickets haven't been closely followed or fixed in 7.4), and moving out to 7.10 and out of 7.8.3.

comment:26 Changed 5 years ago by thoughtpolice

Priority: highnormal

Actually dropping priority. :)

comment:27 Changed 5 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:28 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:29 Changed 5 years ago by nomeata

I just put two 15000-element unboxed vector of Word16 in a module, using fromList [...] and compile time is annoying, and the runtime overhear is (possibly unjustified) worrying me. Having a way to have a literal form of such an array (which then hopefully is put into the program as-is) would be nice.

comment:30 Changed 5 years ago by simonpj

I'm all for it. Does anyone feel motivated to actually work on this ticket?

Simon

comment:31 Changed 5 years ago by hvr

comment:32 Changed 4 years ago by bgamari

nomeata, I'm not sure I see how your problem would be treated by the proposal 2 described in comment:3. Perhaps I am missing something or are you thinking of another approach entirely?

comment:33 Changed 4 years ago by nomeata

Well, any way to include a literal bytestring would help in avoiding the runtime cost. If the setup is only for Text.ByteString, there might be some dark coercing magic required to make that a vector while still pointing at the same Addr#, but it should work.

I currently have a template-haskell based solution in [my repository](https://github.com/entropia/tip-toi-reveng/blob/master/src/BakedVector.hs) which allows me to write [this code](https://github.com/entropia/tip-toi-reveng/blob/master/src/KnownCodes.hs), which works fine for me.

comment:34 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:35 Changed 4 years ago by thomie

Milestone: 8.0.1

comment:36 Changed 3 years ago by jscholl

How about instead of adding a new type String#, as suggested in #10064, or adding a function unpackCStringLen#, we add the ability to query the size of the payload of an Addr# at compile-time? We could provide a function which, given an Addr# constant, turns this into a (# Int#, Addr# #) pair without introducing a new type, thus keeping the overall changes low and the design flexible.

How do we get the length at compile-time? We use a special builtin rewrite-rule, which writes the length to the appropiate places. For example:

{-# INLINE[0] viewCString# #-}
viewCString# :: Addr# -> (# Int#, Addr# #)
viewCString# addr# = (# -1#, addr# #)

{-# RULES "viewCString#" forall addr . viewCString# addr = (# <length of addrs pointee>#, addr #) #-}

Library code could then use viewCString# to try to determine the length at compile-time, and, if optimizations are enabled, the call gets rewritten to the correct result. Otherwise, viewCString# inlines in phase 0, the resulting -1 is seen by the library code and the code is simplified to determine the length at runtime, like it does today.

Why does viewCString# return the Addr# again? If it does not, the Addr# given to viewCString# will be used multiple times, thus, GHC will bind it in some let, complicating the design of the rule. If the function returns it, the library can continue to use the returned Addr#, and GHC will less likely share it.

One could go even further and extend viewCString# to handle two additional cases: Converting the encoding at compile-time as well as determining the number of characters of the string. So viewCString# would become:

viewCString# :: Int# -> Addr# -> (# Int#, Addr#, Int#, Int# #)

The first input determines the requested mode of operation (only count bytes and characters, convert to utf16le/be, utf32le/be). The first output Int# determines the performed operation, it should always either be the input Int# or some "No operation performed" code. The other two Int# results are the number of bytes and number of characters, and the Addr# contains the potentially converted literal. Of course, an interface passing around magic Int#s is not the nicest, but I think, this is quite low-level code and only a few libraries like text and bytestring will have to deal with it.

comment:37 Changed 3 years ago by simonpj

Sounds plausible, although I have not re-read the whole thread. Best to think of viewCString# as a primop, I think.

Check with the bytestring folk to be sure it'll meet their goals.

comment:38 Changed 3 years ago by simonmar

Hmm, this seems a bit dodgy. The RULE is not semantics-preserving (deliberately, I know), so if you use viewCString# you get a non-deterministic result that depends on whether optimisation is turned on and whether the RULE fired. I suppose the idea is that you hide the non-determinism behind a layer of library code, but still, I'm not keen on having non-deterministic primitives.

What's wrong with the approach in #10064 to have string literals of type (# Int#, Addr# #)?

comment:39 Changed 3 years ago by Mathnerd314

I have yet another proposal, namely to add low-level construction methods to IsString:

class IsString a where
  fromString :: String -> a
  -- as before

  fromString_asc# :: Addr# -> Int# -> a
  fromString_asc# a _ = fromString (unpackCString# a)

  fromString_utf8# :: Addr# -> Int# -> a
  fromString_utf8# a _ = fromString (unpackCStringUtf8# a)

  fromString_raw# :: Addr# -> Int# -> a
  fromString_raw# a i = fromString (map chr (addrToWord8List a i)

The String instance would have fromString = id, while the ByteString instance would do zero-copy construction with the length information. ASCII literals would use _asc, UTF-8 literals would use _utf8, and primitive string literals would use _raw. Then "x"# :: ByteString would typecheck and work as expected.

The only tricky part is preserving "x"# :: Addr# (which would be expanded to fromString_raw# "x"# 1# :: Addr#). It doesn't seem like instance IsString Addr# would be suitable.

comment:40 Changed 3 years ago by carter

It does seem like a String# style design might be the most tasteful / clean, from my fuzzy understanding

comment:41 Changed 3 years ago by jscholl

I tried implementing a String# type which would carry the length as an Int# at its beginning and two functions to extract the length and address of the string literal. However, it quickly got a little bit out of hand:

  • unpackCString# etc. had to be adopted, breaking backwards compatibility. To avoid this, I tried to create wrapper functions unpackCStringLit#, which would extract the address and call the original unpackCString# function.
  • I could not solve the question how to adopt the rewrite rules dealing with strings without duplicating them for the Addr# and String# versions. I could also not figure out when unpackCStringLit# should inline to avoid the overhead of the new address computation.
  • It took a while to find all (most?) of the places (library, some hardcoded types in Ids, a place in the type checker, generating record selector errors) where types where wired in, especially for exceptions like absentError and recSelError.
  • Implementing a new String# also asked the question whether "foo"## should be the corresponding literal for it. However, adding it from the parser to the backend seemed quite complex, so I tried a different approach.

Instead of creating a new type String#, I rewrote unpackCStringLit# to have the type Addr# -> Int# -> [Char]. It would then just throw its second argument away and inline in some phase. However, it still meant duplicating rewrite rules, which seemed not like an idea solution.

My next idea was to push the length information into an ignored argument to a function giving us the address: cStringLitAddr# :: Addr# -> Int# -> Addr#. This could just be passed as an argument to unpackCString#, thus I was quite confident that it would remain backwards compatible and no extra rewrite rules were needed to maintain the current behavior (but extra rules to use the length information, e.g. to construct bytestrings, but this seems like an acceptable cost).

However, I did not anticipate the let/app invariant, thus my original design of unpackCString# (cStringLitAddr# "foo"# 3#) caused lint to warn me. After reading up about the invariant, I decided that cStringLitAddr#, applied to two literals, should be okay for speculation, as it did not have side effects nor could fail or anything. However, while now the generated core was accepted, it was useless, as it would not match the rewrite rules written by a user. Their rules would be translated to something like case cStringLitAddr# addr len of { tmp -> unpackCString# tmp } .

Thus, I decided to generate matching core and removed my fix to make cStringLitAddr# okay for speculation. In the current version, it is possible to create a bytestring in O(1) with rewrite rules. However, I have broken the general list fusion (or at least the built-in rules match_eq_string and match_append_lit), as the case statement gets in the way between foldr and build, causing them to not be optimized out (but maybe this is generally a missed opportunity, if I have foo (case something of { tmp -> bar tmp }) , maybe it should be possible to rewrite foo (bar x) = baz x anyway, leading to case something of { tmp -> baz tmp } , iff something is safe to evaluate with regards to time, space and exceptions (this is okay-for-speculation, right?)).

So right now I am stuck. Maybe it is okay to break backwards compatibility and just change the types of unpackCString# etc. to include an additional (ignored) Int# argument, pushing some #ifs to everyone using unpackCString# (I think this is basically text, bytestring and ghc itself) for the next few years. However, unpackCString# is called at some additional places, namely when constructing modules for Typeable. Right now the types only carry the Addr# to call it, but would then also need the length information (or there would be the risk that something rewrites it and gets a bogus length, if one just passes 0# as length information). On the other hand, maybe it would be a good thing to actually pass the length along to unpackCString#, making it mandatory, as this would avoid the need to null-terminate the strings, allowing '\NUL' characters to be encoded with one byte instead of two (which may be of interest for bytestring). On the other hand, I could imagine this breaking stuff if strings are no longer null-terminated in subtle ways...

comment:42 Changed 3 years ago by simonmar

Doesn't it work to have string literals be of type (# Int#, Addr# #) (see comment:38)? The unboxed tuple will be passed as two arguments, but in Core you can name the unboxed tuple and pass it around as a single entity.

comment:43 Changed 3 years ago by jscholl

Wouldn't that generate something like unpackCString# (# 3#, "abc"# #) for "abc"? And if I just typed "abc" in ghci, wouldn't it fail as ghci can not handle unboxed tuples? Beside from that, it should work, but then I could also just try to change unpackCString# to have type Addr# -> Int# -> [Char], avoiding the tuple. This would of course break backwards compatibility, as would an unboxed tuple, and I wanted to see if I could avoid that.

comment:44 Changed 3 years ago by simonmar

The Unarise pass removes unboxed tuple binders and arguments, so it would be passed as two separate arguments.

However, unarise is currently part of SimplStg, so GHCi doesn't get to take advantage of it, I don't know how hard that is to change. Maybe @osa1 or @simonpj would know.

Backwards compatibility is a separate issue, we can always introduce new APIs instead of breaking old ones, I doubt there are serious problems here.

comment:45 Changed 3 years ago by simonmar

Cc: osa1 added

comment:46 Changed 3 years ago by osa1

I didn't follow the whole thread but yeah, if you want to compile string literals to unboxed tuples ((# len, ptr #)) it'd be a problem for GHCi as the bytecode interpreter/compiler can't handle unboxed tuple returns. (it has some special cases for handling primops that return state tokens in an unboxed tuple)

We had briefly discussed doing bytecode generation after unarise in the mailing list (https://mail.haskell.org/pipermail/ghc-devs/2016-July/012502.html see Simon's response). I don't understand native calling conventions good enough to comment, but maybe we can add one more special case in the bytecode compiler to handle (# len, ptr #) returns, like the special cases for returning (# RealWorld#, a #) etc. (search for isUnboxedTupleCon in ByteCodeGen.hs)

I'm willing to work on extending unboxed tuple support in the bytecode interpreter, but I don't have any concrete proposals right now -- I need to study native calling conventions for returning tuples first.

comment:47 Changed 3 years ago by jscholl

After switching to (# Int#, Addr# #), I have a sort-of working implementation. The builtin rules seem to work, rewriting string literals from user-code works and has access to the length, but ghci wouldn't even start up. Maybe I can take a look this evening into it, but if I understand unboxed tuples correctly, it should be possible to just flatten them if they appear in an argument position, right? E.g. turn foo (# a, b, c #) into foo a b c .

comment:48 Changed 3 years ago by osa1

E.g. turn foo (# a, b, c #) into foo a b c

Yes. That's the easy part (see RepType.repType), the problem is returning tuples, it needs support from the code generator.

comment:49 Changed 3 years ago by jscholl

It seems to work. I added a pass before generating byte-code which transforms the unboxed tuples and now ghci works again.

I am validating right now, but stage2 was working, so this should pass, too. So to let you review my changes, I have to ask: Should I include changes made to submodules or just include the changes to GHC?

comment:50 Changed 3 years ago by simonmar

We'll need to think about how we want to do this for GHCi. I presume we can't do unarise on Core because it can't generate type-correct Core, right @osa1? It seems like a lot of trouble to have a whole pass just for strings in GHCi, especially since it overlaps with what unarise does.

Perhaps it could be handled in the byte code generator directly?

Anyway, feel free to put your patch up on Phabricator for review. If it depends on submodule changes, I believe the right way to handle those is to fork the submodule on github and make the .gitmodule file point to your fork, at least while the patch is in review, and then when pushing, revert the .gitmodules changes and push to the upstream submodules first.

comment:51 Changed 3 years ago by osa1

I believe pre-unboxed-sums unarise pass could be made type-safe (I actually had a patch that moved unarise pass to Core and I don't recall any type safety issues), but now that we have sums moving unarise to Core means adding unsafeCoerce#s everywhere. One of my initial unboxed sums patches was doing that, but no one liked it (SPJ reviewed it on Phab) because of unsafeCoerce#s in Core.

So yeah unarise can't be made type safe without unsafeCoerce#s at the moment, it needs to stay as a STG pass.

EDIT: Shall we open another ticket for supporting unboxed tuples and sums in all generality in the interpreter? I'd be willing to work on that.

Last edited 3 years ago by osa1 (previous) (diff)

comment:52 Changed 3 years ago by jscholl

Differential Rev(s): D2443
Status: newpatch

I uploaded a diff which includes my changes. I tried to integrate unrolling the unboxed tuples in the bytecode generator now, which seems to work, but may miss some cases.

There may be some points which could be done in a better way, so feel free to point them out and I will change them.

comment:53 Changed 3 years ago by jscholl

Differential Rev(s): D2443Phab:D2443

comment:54 Changed 3 years ago by simonpj

I've lost context on this. The trac ticket is very long, and I'm no longer sure

  • What, exactly, is the problem we are trying to solve?
  • What's the overview of the proposed solution?
  • How the solution works in detail

I wonder whether you might write a wiki page on Literal Strings, to describe how literal strings are handled, at least in your proposal?

Are there any tests that check that the problem is indeed solved?

I notice that the Phab patch changes .gitmodules to point to your repos, which presumably can't be right for a commit to GHC master.

Sorry to be dense. Thanks.

Simon

comment:55 Changed 3 years ago by simonmar

@simonpj the .gitmodules changes are because the diff on phabricator includes changes to submodules, and to avoid having to push to the upstream for those submodules it's necessary to fork the repos and point the .gitmodules file at your fork for the purposes of uploading the patch to phabricator. The idea is that before pushing the final patch, you revert the .gitmodules changes and push the necessary submodule changes to the relevant upstream repos.

I've been encouraging @bgamari and @austin to document this (slightly complex) workflow on the wiki, it really needs to go here: https://ghc.haskell.org/trac/ghc/wiki/WorkingConventions/Git/Submodules

comment:56 Changed 2 years ago by winter

Here's yet another proposal, I think we should compile string literals into byteArray using UTF8 encoding. I don't know if this is feasible, but this will make a universal byteArray based bytes library possible.

comment:57 Changed 2 years ago by bgamari

Regarding comment:56,

I don't think we would want to give every string literal the overhead of a full heap object. Even just adding a length as is proposed in this ticket is adding a significant binary size overhead for small strings, which occur quite often. The additional few words that a ByteArray closure would bring is far too expensive to stomach.

comment:58 Changed 2 years ago by winter

If i remember correctly, a ByteArray# would have an extra header and a length field, which in turn bring a 2 words overhead, one word more compareing to the (# int#, addr# #) solution. But i would argue this overhead can bring a nice solution to ghc's long lasting literal problem, for example, vector package and text package can provide some TH to directly save some byteArray# literal using hexadecimal notation, this save many extra copying during runtime.

Finally, by nature of binary loading, this ByteArray# is pinned, so we can still get its address without trouble.

comment:59 Changed 2 years ago by bgamari

If i remember correctly, a ByteArray# would have an extra header and a length field, which in turn bring a 2 words overhead, one word more compareing to the (# int#, addr# #) solution.

Right, this would incur another word of overhead. However, on the majority of machines this is 8-bytes which is quite significant. Looking at GHC itself, just over a third of all string literals are 8 characters or less (6000 our of 17500). For these literals adding another word would increase the fractional overhead from >50% to >67%.

I have spoken with GHC users targeting mobile platforms who already suffer from our code size; it's hard to justify such an increase without a very good reason.

But i would argue this overhead can bring a nice solution to ghc's long lasting literal problem, for example, vector package and text package can provide some TH to directly save some byteArray# literal using hexadecimal notation, this save many extra copying during runtime.

What is stopping these libraries from providing this mechanism currently using Addr# and primitive strings directly?

In general primitive strings are, as the name would suggest, primitive. I'm not sure forcing a heap object representation here is necessary nor prudent.

comment:60 Changed 2 years ago by winter

What is stopping these libraries from providing this mechanism currently using Addr# and primitive strings directly?

The problem is that there's no way to cast Addr# into ByteArray# without copy, while unboxed vector(not storable) and text both want ByteArray#.

In general primitive strings are, as the name would suggest, primitive. I'm not sure forcing a heap object representation here is necessary nor prudent.

I disagree. If we give string literal a proper compact representation, not only we can save unnecessary copying during runtime, we can save code size in other ways.

Consider if string literal now are ByteArray#s, we can use rules to simplify a UTF8 text type like forall a. fromString (GHC.unpackCString# a) = UTF8 a, that means we can directly use constructor here instead of several calls. The same applys for unbox vectors using unboxed string literal and hexdecimal notation.

Last edited 2 years ago by winter (previous) (diff)

comment:61 in reply to:  60 Changed 2 years ago by bgamari

Replying to winter:

What is stopping these libraries from providing this mechanism currently using Addr# and primitive strings directly?

The problem is that there's no way to cast Addr# into ByteArray# without copy, while unboxed vector(not storable) and text both want ByteArray#.

Fair enough, but why not just poke a hole in the ByteArray# abstraction in that case? Namely, provide a unsafeMkByteArray# :: Addr# -> Int -> ByteArray#.

In general primitive strings are, as the name would suggest, primitive. I'm not sure forcing a heap object representation here is necessary nor prudent.

I disagree. If we give string literal a proper compact representation, not only we can save unnecessary copying during runtime, we can save code size in other ways.

Consider if string literal now are ByteArray#s, we can use rules to simplify a UTF8 text type like forall a. fromString (GHC.unpackCString# a) = UTF8 a, that means we can directly use constructor here instead of several calls. The same applys for unbox vectors using unboxed string literal and hexdecimal notation.

I agree that these are all great simplifications, but I don't see why changing the representation of primitive strings is necessary to get there.

comment:62 Changed 2 years ago by bgamari

Fair enough, but why not just poke a hole in the ByteArray# abstraction in that case? Namely, provide a unsafeMkByteArray# :: Addr# -> Int -> ByteArray#.

Although, thinking about this for a tad longer it's not necessarily clear that this is a direction which we'd want to go. Afterall, it would require the introduction of another closure type since the current representation requires that the bytes directly follow the closure header. I suspect this would introduce quite a bit of complexity in the RTS.

comment:63 Changed 2 years ago by winter

Fair enough, but why not just poke a hole in the ByteArray# abstraction in that case? Namely, provide a unsafeMkByteArray# :: Addr# -> Int -> ByteArray#.

Although, thinking about this for a tad longer it's not necessarily clear that this is a direction which we'd want to go. Afterall, it would require the introduction of another closure type since the current representation requires that the bytes directly follow the closure header. I suspect this would introduce quite a bit of complexity in the RTS.

Yes, that is a more expensive design IMO. I read about the notes on unpointed heap object(maybe somewhere in infoTable.h), it says that they shouldn't be entered. But i fail to understand how can we provide unsafeMkByteArray# in an obvious way.

comment:64 Changed 2 years ago by bgamari

But i fail to understand how can we provide unsafeMkByteArray# in an obvious way.

This would essentially amount to making ByteArray# a sum internally. You would simply add a closure type which would carry a pointer to the payload instead of the VLA carried by StgArrBytes currently.

comment:65 Changed 2 years ago by winter

This would essentially amount to making ByteArray# a sum internally. You would simply add a closure type which would carry a pointer to the payload instead of the VLA carried by StgArrBytes currently.

I'm not an expert on this, so maybe this sounds ridiculous: If we add a new closure type for ByteArray#, could it be used in runtime also? Since it seems more memory efficient.

Last edited 2 years ago by winter (previous) (diff)

comment:66 Changed 2 years ago by bgamari

If we add a new closure type for ByteArray#, could it be used in runtime also?

I'm not sure I follow the suggestion here. What do you mean by "used in the runtime"?

comment:67 in reply to:  66 Changed 2 years ago by winter

Replying to bgamari:

If we add a new closure type for ByteArray#, could it be used in runtime also?

I'm not sure I follow the suggestion here. What do you mean by "used in the runtime"?

I mean can we change ByteArray# 's representation into just length and content (removing the header)? I think that's not possible due to GC expect to see an uniform heap object representation, but i'm not sure there's no other way to make it work.

Last edited 2 years ago by winter (previous) (diff)

comment:68 Changed 2 years ago by rwbarton

It seems to me we want a way to write a ByteArray# literal, without forcing every string literal to be compiled to a ByteArray#.

comment:69 Changed 2 years ago by bgamari

I mean can we change ByteArray# 's representation into just length and content (removing the header)?

I don't believe this is possible; as you point out, the garbage collecting needs the info table pointer.

It seems to me we want a way to write a ByteArray# literal, without forcing every string literal to be compiled to a ByteArray#.

I agree; I can see the value in wanting a ByteArray# literal, but making all primitive strings ByteArray#s seems like a step too far.

comment:70 Changed 2 years ago by winter

It looks like we have to make a space-time trade off here, if ByteArray# 's overhead is too large, adding a Int# will also cost a lot. Suppose GHC is smart enough to float all string constant out, then copying once in runtime is acceptable. Otherwise i would still want to switch space for time.

Here i also propose another solution: We still keep primitive string literal's type as Addr#, but encode the literal's byte length with UTF8 rules, that is using one byte for length less than 0x7F, two bytes for length less than 0x7FF...and so on. Then put these UTF8 encoded length header bytes in front of real bytes content.

Now all we have to do left is to add a new unpack function unpackGHCString# to decode these header bytes first(we can reuse UTF8 code!!!), then use memcpy or whatever you want to do with length info.

EDIT: In fact any variable length encoding will work here, and UTF8 is not the most efficient one. But it's a convenient one. Most of the time, string literal tend to be small.

Last edited 2 years ago by winter (previous) (diff)

comment:71 Changed 2 years ago by winter

Cc: winter added

comment:72 Changed 2 years ago by bgamari

comment:73 Changed 2 years ago by bgamari

Here is where we stand on this:

This bug seeks to address the fact that we currently have few good ways of encoding literal strings verbatim (e.g. as raw, unchanged bytes) in object code. This is because we insist on encoding primitive strings as null-terminated modified UTF-8. This means that things like bytestring and text have a rather complicated and inefficient handling of these literals. This inefficiency stems from two reasons,

  • One needs to look for and correctly handle the U+0000 codepoints (encoded as 0xc0 0x80) in the primitive string
  • It's impossible to know what the length of the string is without walking it

The solution here is to rework our desugaring of primitive strings such that,

"hello"#

Will be desugared as,

let ptrToHelloString :: Addr#
    ptrToHelloString = ...
in (# 5#, ptrToHelloString #)

This means that we can encode the string contents in plain UTF-8 without a NULL terminator. The type of unpackCString# then becomes,

unpackCString# :: (# Int#, Addr# #) -> String

and the implementation gets a tiny bit simpler (since it simply decodes a fixed number of bytes, instead of looking for a NULL). Consequently, libraries can then provide rules matching on unpackCString# applications, replacing them with what is essentially memcpy.

This is for the most part a simple change, with the exception being GHCi support due to the need for unboxed tuples. jscholl started implementing this nearly a year ago but stalled. I recently rebased his work (Phab:D2443) and addressed several of the issues that came up in review. Unfortunately currently GHCi segmentation faults, which will take some to work out.

Note that are two related problems that this does not address,

  • pure ASCII literals (which might be used to, for instance, encode a binary representation of a static Array)
  • ByteArray# literals, as requested in ticket:11312
Last edited 2 years ago by bgamari (previous) (diff)

comment:74 Changed 2 years ago by jscholl

Thinking about the problem again I decided to try to add ByteArray# literals to GHC. The idea is the following:

  • Use "foo"## as syntax for ByteArray#s. This is in essence my try for a String# type.
  • Provide
    unpackStringLit# :: ByteArray# -> [Char]
    {-# INLINE[1] unpackStringLit# #-}
    unpackStringLit# ba# = 
      unpackCStringWithLen# (byteArrayContents# ba#) (sizeofByteArray# ba#)
    
  • Compile "foo" as unpackStringLit# "foo"##
  • Let rewrites fire in phase 2.
  • In phase 1, inline unpackStringLit# and let rules rewrite it to unpackCStringWithLen# "foo"# 3#
  • Thus most ByteArray#s should get eliminated and binary size should stay more or less the same.
  • If someone rewrites something like ByteString.pack (unpackStringLit# lit), the literal is not eliminated and emitted to the binary. Thus a ByteString literal can increase binary size. However, I think this is what we want because we save making a copy of the data.
  • The downside is that turning optimization off causes the compiler to create a ByteArray# for every string literal instead of a c-string. GHCi will also allocate ByteArray#s instead of string literals directly.

I currently implemented the new literal type, extended the parser, changed the desugaring, added the needed rules, taught GHCi to handle ByteArray# literals, and generated cmm Code for them. I still have to look at all parts involved for fusion and other string rules to work, check how the change affects bootstrapping with an older compiler, take a look at template haskell, and whether there are any typeable/generic things involved.

I don't know how everyone feels about adding another literal type (especially because there are now two similar types, Addr# and ByteArray#, but if we want to keep binary sizes down, we need some form of Addr#, and it seems like having ByteArray# is beneficial too). Or whether it is reasonable to provide syntax for ByteArray#s (letting the compiler generate them would be enough for this ticket). But right now implementing it like this feels a lot better than my previous approach using unboxed tuples.

And I am sorry for not saying or doing anything such a long time.

comment:75 Changed 2 years ago by bgamari

jscholl, I was speaking to phadej yesterday and he also sounded interested in looking at this problem. Regardless of what happens, this is a change to the user-facing interface provided by the compiler and consequently falls under the GHC proposal process. Do you think you could write a proposal explaining your approach?

And I am sorry for not saying or doing anything such a long time.

Quite alright!

Last edited 2 years ago by bgamari (previous) (diff)

comment:76 Changed 2 years ago by phadej

jscholl, I'd like to participate on the proposal, bgamari mentioned above.

I really think we (= community) should design this properly, as you mentioned Addr# and ByteArray# are confusing. Also Unicode/encodings add some complexity. The reward, is that we can close many tickets (at least #11312 with "single" change.

I still try to wrap my head around internals, and my friends' weading is in 20 hours, so I'll return to this only next week

comment:77 Changed 2 years ago by alexbiehl

Not only string literals could benefit from ByteArray# literals: Integer/BigNats could too! Currently big integer literals are desugared into a list of Int chunks. With ByteArray# literals we could represent them as a constant directly!

Whats the status of this ticket?

Last edited 2 years ago by alexbiehl (previous) (diff)

comment:78 Changed 2 years ago by bgamari

comment:79 Changed 2 years ago by bgamari

Keywords: strings added

comment:80 Changed 2 years ago by winter

I don't think it's a good idea to change the string literal desguaring: there're too many libraries rely on the old Addr# desugaring behaviour, changing it will break them. And null terminate modified UTF-8 strings are useful: for example a filepath literal. It's also more compact, which is why we choose it in the first place.

So no matter what, i will ask that new desugaring literals should be added together with a new syntax, e.g. the primitive version has already differ in syntax: ## vs #. I would propose adding a new syntax ""..."".

comment:81 in reply to:  74 Changed 2 years ago by winter

Replying to jscholl:

Thinking about the problem again I decided to try to add ByteArray# literals to GHC. The idea is the following:

  • Use "foo"## as syntax for ByteArray#s. This is in essence my try for a String# type.
  • Provide
    unpackStringLit# :: ByteArray# -> [Char]
    {-# INLINE[1] unpackStringLit# #-}
    unpackStringLit# ba# = 
      unpackCStringWithLen# (byteArrayContents# ba#) (sizeofByteArray# ba#)
    
  • Compile "foo" as unpackStringLit# "foo"##
  • Let rewrites fire in phase 2.
  • In phase 1, inline unpackStringLit# and let rules rewrite it to unpackCStringWithLen# "foo"# 3#
  • Thus most ByteArray#s should get eliminated and binary size should stay more or less the same.
  • If someone rewrites something like ByteString.pack (unpackStringLit# lit), the literal is not eliminated and emitted to the binary. Thus a ByteString literal can increase binary size. However, I think this is what we want because we save making a copy of the data.
  • The downside is that turning optimization off causes the compiler to create a ByteArray# for every string literal instead of a c-string. GHCi will also allocate ByteArray#s instead of string literals directly.

The problem is, old Addr# unpacking differ from ByteArray# unpacking in that they are not the same encoding: they don't agree on how \NUL char get encoded, (at least I'm expecting ByteArray# is standard UTF-8 encoded). So you can't cast them with rewrite rules like that: you have to mention the encoding pitfall.

comment:82 Changed 2 years ago by winter

After thinking about this for a while, i think a better solution for literal problem is to overhaul the OverloaddedStrings extension, because the problem is created by it: without it we can't even directly use literal to get the Addr# at all.

I think i will make a GHC proposal finally, but let me sketch a little bit on my idea:

  1. Currently when OverloaddedStrings is enabled, we consider a string literal polymorphric by translating them to fromString ..., where fromString is a method from IsString type class.
  1. This makes desugaring literal into String the first step in during literal compiling, and at this very step we choose to use unpackCString# addr# to desugar the literal.
  1. Now we have a problem with the fixed desugaring scheme, it's not flexible enough to give arise a ByteArray# based representation, no matter what rewrite-rules are applied afterwards.
  1. So i proposal to solve the problem directly at this language extension level, besides IsString, i propose to add following typeclass:
class IsPlainAddr a where
    fromPlainAddr :: Addr# -> a

class IsAsciiByteArray a where
    fromAsciiByteArray :: ByteArray# -> a

class IsU8ByteArray a where
    fromU8ByteArray :: ByteArray# -> a

-- maybe someone want utf-16 desugaring? we can add later
  1. Together with IsString, these typeclasses are special when OverloaddedStrings is enabled, we will try to find an instance of the type which we are overloading: If we have a "Foo" :: Foo, we will try to find a instance for Foo among these classes, the priority of those instances can have an arbitrary order as long as we document it clearly.
  1. Once an instance is found, we do desugaring depending on the instance spec, and directly inject fromXXX "xxx"# into code, if the sourcecode codepoint can't be encoded with the instance spec, we issue a compile waring.
  1. If we failed to find an instance from above, we issue an compile error.
  1. Now a library author can choose to implement a type class which suit his/her need.
  1. This solution can also be extended to handle OverloadedLists, e.g. we can add following typeclass for desugaring list literals:
class IsIntList a where
    fromIntList :: ByteArray# -> a

class IsWordList a where
    fromWordList :: ByteArray# -> a

class IsInt8List a where
    fromInt8List :: ByteArray# -> a

...
  1. When OverloaddedLists is enabled, we will try to find an instance of these special classes, and transform the list into ByteArray# according to the instance spec, if there're overflowing we issue warnings.

If later, people ask for new format of literal desugaring, we add new typeclasses and done, old code continue to work, and new code will got a compile error on old compilers.

BTW. I think this is what we called "解铃还需系铃人" in chinese ; )

Last edited 2 years ago by winter (previous) (diff)

comment:83 in reply to:  74 Changed 2 years ago by phadej

Replying to jscholl:

I have been trying to write something similar down in https://github.com/ghc-proposals/ghc-proposals/compare/master...phadej:bytearray-literals

It has a different syntax, which I think makes more sense than double-hash.

Ben have given some comments already (they are inline in proposal text, so I remember them). He is worried about ByteArray#s code size, and I'm proposing similar "let's rewrite them to Addr# when possible" approach.

comment:84 Changed 2 years ago by hsyl20

I have just sketched a different proposal here: https://ghc.haskell.org/trac/ghc/wiki/StaticData

IMO it is more generic and the different parts can be implemented independently.

comment:85 in reply to:  84 Changed 2 years ago by phadej

Replying to hsyl20:

I have just sketched a different proposal here: https://ghc.haskell.org/trac/ghc/wiki/StaticData

IMO it is more generic and the different parts can be implemented independently.

Note: Not all GHC have TemplateHaskell, especially stage 1 GHC.

comment:86 Changed 2 years ago by simonpj

I'm struggling to grok this ticket, especially: what is the problem we are trying to solve?. I'm also concerned about making things too complicated.

jscholl in comment:74 sounds right on target to me. Here's my thinking, written out. Let's see if we agree at least about the "Goals" and "Core" part.

Goals

I believe that one goal is

  • The ability to put a block of binary data in the program code, without heavy encoding.

Is that a goal? Can we focus solely on that for a while?

Core

To meet that goal, in Core, we need

  • A primitive data type B# whose values are simply blobs of binary data.
  • Some operations over this type; e.g. lenB :: B# -> Int, or unpackString :: B# -> [Char] or whatever.
  • Literal values (in Core) for B# values.

B# plays the role of the (# Int#, Addr# #) representation mentioned above (comment:38 ff), but without being so concrete.

I'm only using "B#" as a placeholder; we need a proper name for it! So what is it, precisely?

  • B# could be a completely new primitive type.
  • Or B# could be ByteArray#. That would have the major advantage of not adding a new type, and for sure we'd need to be able to turn it into a ByteArray#. So I like that, and it's what jscholl suggests in comment:74.
  • But B# can't be Addr# (which is a memory address)! Also look at #11312, which is highly relevant because it has the same conclusion. In #11312, I call this new type String#, but that's too character-oriented. I think we should focus on binary data. But adopting B# would fix the ghastly problems in #11312.

Haskell

If we had this new primitive type, we'd soon want literals for it in Haskell source code.

  • I suppose we could have a new literal syntax (about whose details I am intensely relaxed). After all, the literals of a language should be expressible I suppose.
  • But we could say you could only get it via a TH quasiquote e.g. [bytes| fec923ac |]? Is that so terrible?

Note that everything in comment:84 belongs in this section. By the time we get to Core all this typeclass stuff has gone away.

Other goals

I don't have clarity on how bytestring would want to convert a ByteArray# to a ByteString. That ought to be a constant time operation.

comment:87 Changed 2 years ago by winter

I'm struggling to grok this ticket, especially: what is the problem we are trying to solve?. I'm also concerned about making things too complicated.

The ability to put a block of binary data in the program code, without heavy encoding.

My understanding of this problem is that we want a way to tell ghc how to compile literals. The compiler's interpretation of "...", "..."#, [...] are currently fixed , and we have to do runtime conversion in order to get the interpretation we want.

The difference between compiler's and user's interpretation may be address vs heap object, or utf8 vs modified utf8 vs utf16, these difference will decide how codegen emit literal code, (I think there's no need for new closure type really, we already have them all).

So we have to find a way to tell compiler what kind of interpretation we want, it can be via syntax, e.g. we create different syntax for different interpretation. Or we can use some type magic, e.g. the one i suggest. Each solution has its own pro and cons, that's why we are still proposing.

In short words, i believe this is a problem OverloadedString/OverloadedList are intended to solve, but solved it wrong.

Last edited 2 years ago by winter (previous) (diff)

comment:88 Changed 2 years ago by simonpj

My understanding of this problem is that we want a way to tell ghc how to compile literals.

That's the bit under "Haskell" in comment:86. But we still need to compile them into something. But what? That's the bit under "Core" in comment:86. Once we have the Core bit settled we can return to the Haskell bit.

comment:89 Changed 2 years ago by hsyl20

@simonpj

That's mostly what I propose in https://ghc.haskell.org/trac/ghc/wiki/StaticData

Your B# is my StaticData#.

comment:90 Changed 2 years ago by winter

@simonpj let say we fixed B# to ByteArray# , people still want to ask compiler turn literal into ByteArray# in different ways, maybe ASCII with overflow warning , maybe standard utf8. The point is we can't fix how compiler compile literal into B# or whatever primitive we choose.I fail to see how above "Core" section mention that. And I really don't think the problem laid in a universal B# selection: It's the ability to choose between Addr# or ByteArray#' or (# Int#, Addr# #), etc. that matters.

Last edited 2 years ago by winter (previous) (diff)

comment:91 Changed 2 years ago by bgamari

I believe that one goal is

The ability to put a block of binary data in the program code, without heavy encoding.

Is that a goal? Can we focus solely on that for a while?

That is indeed a goal. Another equally-important (in my opinion) goal is to be able to support constant-time length given the literals we write today. This is important for being able to efficiently implement things like bytestring Builder in terms of memcpy.

Or B# could be ByteArray#. That would have the major advantage of not adding a new type, and for sure we'd need to be able to turn it into a ByteArray#. So I like that, and it's what jscholl suggests in comment:74.

One issue with this is that we would gain a word of overhead for each string literal (assuming we use B# to implement Haskell's String). I took some measurements a while back and found that short strings are quite common, so this may be a hard cost to accept.

I don't have clarity on how bytestring would want to convert a ByteArray# to a ByteString. That ought to be a constant time operation.

I may be missing something but I don't believe this should pose any particular trouble. Afterall, a ByteString is essentially a ForeignPtr and some length/offset information. byteArrayContents# allows us to get the Addr# from a (pinned) ByteArray#, from which we can construct a Ptr and in turn a ForeignPtr. Since we are talking about static data byteArrayContents# should be safe.

Last edited 2 years ago by bgamari (previous) (diff)

comment:92 Changed 2 years ago by simonpj

One issue with this is that we would gain a word of overhead for each string literal

These are statically allocated literals. I have a hard time believing that a one-word increase in binary size for each string literal will make a noticeable difference to GHC's binary sizes.

If we are agreed, let's do it!

comment:93 Changed 2 years ago by simonpj

It's the ability to choose between Addr# or ByteArray#' or (# Int#, Addr# #), etc. that matters

That's a separate issue. It is indeed what OverloadedStrings is for. Perhaps we can have a new ticket for it. But we can't talk meaningfully about it until we settle the first issue: what are the primitive types and operations over them. There I think we have a preliminary consensus, with B# = ByteArray#.

Moreover, I think jscholl has a preliminary implementation (comment:74). jscholl: would you like to write up your design as a wiki page? Can you share it?

comment:94 Changed 2 years ago by winter

That's a separate issue. It is indeed what OverloadedStrings is for. Perhaps we can have a new ticket for it. But we can't talk meaningfully about it until we settle the first issue: what are the primitive types and operations over them. There I think we have a preliminary consensus, with B# = ByteArray#.

My intention is to add a way to let user choose which primitive types he/she wants. But if there's a must to fix a primitive type, ByteArray# may be better than Addr# since it can be cast into Addr# not vice-versa(at the cost of extra info-header and length words).

comment:95 Changed 2 years ago by bgamari

These are statically allocated literals. I have a hard time believing that a one-word increase in binary size for each string literal will make a noticeable difference to GHC's binary sizes.

Okay, I ran a quick experiment, compiling GHC with -ddump-simpl and grepping for primitive strings in the result. The result is merely ~30000 string literals. As I said earlier, a good fraction of these are short, so the relative change in space needed by string literals is rather large. However, compared to the rest of the program, it's quite small: each literal blowing up by a word would be around 250kB.

Note: See TracTickets for help on using tickets.