Ticket #1120 (new enhancement)

Opened 6 years ago

Last modified 5 years ago

UTF8-Bytestring Pango interface.

Reported by: guest Owned by: axel
Priority: normal Milestone:
Component: Pango bindings Version: 0.9.12
Keywords: Cc: jeanphilippe.bernardy@…

Description

Pango interface currently jumps through hoops to provide a String-based interface for pango functions. If (like in Yi) the user code uses UTF8 internally, it is harmful both from a performance and ease-of-use point of view. (Offsets have to be corrected twice in opposite directions)

It would be useful (and perhaps not too difficult) to provide an interface directly based on (UTF8) bytestrings.

Change History

  Changed 6 years ago by axel

I am surprised by this, actually. I thought you would internally keep Unicode strings rather than raw byte sequences. I thought it would be prudent in every application to avoid mixing multi-byte characters with composite characters. Using Unicode strings gets rid of the first obstacle and only leaves one difficulty do deal with. So Yi does this both at once? Or does it have it's own abstractions?

In principle, we can provide an interface to Pango using byte arrays but this would be a lot of work for few users and for those who don't need it, it would be confusing. If you want this interface for pure performance reasons, then I think I would politely decline this request unless you have evidence that conversion back and forth is really a bottleneck.

  Changed 6 years ago by guest

  • type changed from defect to enhancement

A reason for using utf8 internally is that I checked gtk documentation, and it said it returned utf8 offsets. I did not imagine you'd adapt it in gtk2hs. :) It turns out the Yi code in not (much) more complex anyway: position in buffer is used abstractly almost everywhere. Also, since we used bytestrings for performance reasons anyway, it was rather natural to encode unicode as UTF8 in it. An obvious additional benefit is saved memory in the usual ascii case. I was guessing this is a common use-case, maybe I'm wrong.

Also, using PangoLayout? is quite CPU-intensive. I can't tell the share of the haskell layer though.

If I end up implementing this, would you accept patches?

  Changed 6 years ago by guest

  • cc jeanphilippe.bernardy@… added

  Changed 6 years ago by duncan

Can I suggest that a better approach is to use an abstract type like ByteString? but that represents a sequence of Unicode Chars rather than bytes. That's a much nicer interface than a ByteString? which is assumed to be valid UTF8.

A student of mine is implementing a Unicode text type with an external api and internal representation and performance that is very similar to that of bytestrings. The work should be completed during this summer.

Then Gtk2Hs could have a nice interface and decent performance for large chunks of text.

follow-up: ↓ 6   Changed 6 years ago by axel

Duncan,

it sounds as if you are suggesting an additional interface that re-implements every existing Pango function using a PackedString? such that the UTF8 <-> Unicode conversion is still done by our Pango binding. If you'd actually use ByteString? where every element presumably occupies 8 bits, how would you store Unicode characters in it?

What JPB suggests is an interface using raw UTF8 strings that avoids the overhead of our Pango interface. I guess he would use some packed representation such as ByteString?, but I'm not sure.

I don't have a problem adding both interfaces. However, to ensure that "normal" people are not confused, could we have these additional APIs in Pango.PackedString?.* and Pango.ByteString?.* or similar?

in reply to: ↑ 5   Changed 6 years ago by duncan

Replying to axel:

Duncan, it sounds as if you are suggesting an additional interface that re-implements every existing Pango function using a PackedString? such that the UTF8 <-> Unicode conversion is still done by our Pango binding. If you'd actually use ByteString? where every element presumably occupies 8 bits, how would you store Unicode characters in it?

Yes, but using a Haskell type that is specifically designed to represent Unicode text. Internally it'd almost certainly use UTF-8 and provide fast conversion to UTF-8 encoded memory buffers.

It should not be necessary to have two full implementations of all functions since one set should be easy to implement in terms of the other.

What JPB suggests is an interface using raw UTF8 strings that avoids the overhead of our Pango interface. I guess he would use some packed representation such as ByteString?, but I'm not sure.

Right, and I'm suggesting something similar but using a type that is designed for the purpose rather than (ab)using ByteString? to represent unicode text.

I don't have a problem adding both interfaces. However, to ensure that "normal" people are not confused, could we have these additional APIs in Pango.PackedString?.* and Pango.ByteString?.* or similar?

That would make sense.

  Changed 5 years ago by pgavin

  • milestone set to 0.11.0

  Changed 5 years ago by axel

  • milestone 0.11.0 deleted

I don't think this is high priority. If it really becomes a major performance issue, we might re-address this, but I think I'd rather put this off in favour of other features.

Note: See TracTickets for help on using tickets.