Re: Unicode roadmap?

On 6/28/06, Julian ‘Julik’ Tarkhanov [email protected] wrote:

On 28-jun-2006, at 20:36, Austin Z. wrote:

Except that @top is guaranteed to not have an encoding – at least it
damned well better not – and @top.bytes is redundant in this case. I
see no reason to access #bytes unless I know I’m dealing with a
multibyte String.
You never know if you are, that’s the problem. And no, it’s NOT
redundant. You should just get used to the fact that all strings
might become multibyte.

How can you continue to be so wrong? All strings will not become
multibyte. Matz seems pretty committed to the m17n String, which means
that you’re not going to get a Unicode String. This is good.

When you’re not getting a String that is limited to Unicode, you don’t
need a separate ByteArray. This is also good.

Worse, why would “Not PNG.” be treated as Unicode under your scheme
but “\x89PNG\x0d\x0a\x1a\x0a” not be? I don’t think you’re thinking
this through.

@top[0, 8] is sufficient when you can guarantee that sizeof(char) ==
sizeof(byte).
You can NEVER guarantee that. N e v e r. More languages and more
people use multibyte characters by default than all ASCII users
combined.

Again, you are wrong. Horribly so. I can guarantee that sizeof(char)
== sizeof(byte) if String#encoding is a single-byte encoding or is “raw”
(or “binary”, whichever Matz uses).

It seems very pity but you still approcah multibyte strings as
something “special”.

It seems very sad, but you still aren’t willing to comprehend what I’m
saying.

On “raw” strings, this is always the case.
The only way to distinguish “raw” strings from multibyte strings is to
subclass (which sucks for you as a byte user and for me as strings
user).

Incorrect. I do not need to have:

UnicodeString
BinaryString
USASCIIString
ISO88591String

Never have. Never will.

What you’re not understanding – and at this point, I am really
thinking that it’s willful – is that I don’t consider multibyte strings
“special.” I consider all encodings special. But I also don’t think I
need full classes to support them. (I know for a fact that I don’t.)
What’s special is the encoding, not the string. Any string – including
a UTF-32 string – is merely a sequence of bytes. The encoding tells
me how large my “characters” are in terms of bytes. The encoding can
tell me more than that, too. This means that an encoding is simply a
lens through which that sequence of bytes gains meaning.

Therefore, I can do:

s = b"Wh\xc3\xa4t f\xc3\xb6\xc3\xb6l\xc3\xafshn\xc3\xabss."
s.encoding = :utf8
s # “Whät föölïshnëss.”

Gee. No subclass involved.

A substring of a “binary” (unencoded) string is simply the bytes
involved.

We’re not talking rocket science here. We’re talking being smart,
instead of being lemmings who apparently want Ruby to be more like Java.

On all strings, @top[0, 8] would return the appropriate number of
characters – not the number of bytes. It just so happens on binary
strings that the number of characters and bytes is exactly the same.
This is a very leaky abstraction - you can never expect what you will
get. What’s the problem with having bytes as an accessor?

What’s the need, if I know that what I’m testing against is going to
be dealt with bytewise? You’re expecting programmers to be stupid. I’m
expecting them to be smarter than that. Uninformed, perhaps, but not
stupid.

(And I would know in this case because the ultimate API that calls this
will have been given image data.)

What I’m arguing is that while the pragma may work for the
less-common encodings, both binary (non-)encoding and Unicode
(probably UTF-8) are going to be common enough that specific literal
constructors are probably a very good idea.
Python proved that to be wrong - both the subclassing part and the
literals part.

Python proved squat. Especially since you continue to think that I’m
talking about subclassing. Which I’m not and never have been.

The fact that you have to designate Unicode strings with literals is a
bad decision and I can only suspect that it has to do with compiler
intolerance, and the need to do preprocessing.

Have to nothing. You’re simply not willing to understand anything that
doesn’t bow to the god of Unicode. This has nothing to do with your
stupid assumptions, here. This has everything to do with being smarter
than you’re apparently wanting Ruby to be.

The special literals are convenience items only. Syntax sugar. The real
magic is in the assignment of encodings. And those are always special,
whether you want to pretend such or not.

I’m through with trying to argue with you and a few others who aren’t
listening and suggesting the same backwards stuff over and over again
without considering that you might be wrong. Contrary to what you might
believe, I have looked at a lot of this stuff and have really reached
the point where Unicode-only and separate class hierarchies is a waste
of everyone’s time and energy.

Argue for first-class Unicode support. But you should do so within the
framework which Matz has said he prefers (m17n String and no separate
byte array). Think about API changes that can make this valuable. I
think that Matz has settled on the basic data structure, though, and
it’s a fight you probably won’t win with him. Since, as he pointed out
to Charles Nutter, he’s in the percentage of humanity which needs to
deal with non-Unicode more than it needs to deal with Unicode.

-austin

Austin Z. wrote:

Again, you are wrong. Horribly so. I can guarantee that sizeof(char)
== sizeof(byte) if String#encoding is a single-byte encoding or is “raw”
(or “binary”, whichever Matz uses).

I think his point is that for any arbitrary string you cannot guarantee
that it is a single byte encoding and that your code should be written:

raise “Not PNG.” unless (SINGLE_BYTE_ENCODINGS.include?(@top.encoding)
&& @top[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”)

Of course, if you can guarantee that @top is indeed a single byte
encoded BEFORE hitting this line, then the encoding test is not needed
(and I think you assume that).

But in the general case, it just seems easier to write:

raise “Not PNG.” unless @top.byte(0, 8) == “\x89PNG\x0d\x0a\x1a\x0a”

which will work in all cases, without any effort to ensure the
precondition that the encoding is a single byte encoding.

– Jim W.

Byte arrays â?? memory blocks, whatever â?? do have their uses, although
mainly not for string ops. I know I’ve used memory blocks a lot, for
image processing or other exotic tasks. But never, far as I can tell,
for strings. However, byte-level ops can be useful on strings. I can
see two uses:

One is for 1-byte encodings. If you know that char==byte, byte-level
ops will speed up processing of the strings, since no second guessing
has to be done.

Another is because sometimes you have to rip multi-byte chars open and
look at their entrails. Say I want to decompose a hangul syllable into
its primary letters. Unless there is a function provided for that â??
fat chance considering the lack of interest in Unicode from the BDFL â??
I’ll have to do my own cooking at the byte-level.
Example:
irb(main):001:0> “í??ê¸?”.length
==> 2
irb(main):002:0> “ã?ã?ã?な”.lengthB
==> 12
[assuming utf-8 here of course.]

I don’t really care about a memory block [using this term instead of
bytearray so that I don’t get classified in any camp :)], but if
Strings go encodings-aware [hooray], we’ll need both types of
operations…

However, I think it is a bit psychotic to base the fundations of an
important feature of the language on the whims and needs[?] of a
small percentage of the user base. Unicode is an international,
working standard, whereas this m17n thing has little to show so far,
both in terms of production and acceptance [who uses m17n outside a
few agitated fellows inside Japan?].

Besides, while some variants of sinograms, aka kanji, and other exotic
chars, may not be in the Unicode project yet [including the first
sinogram of my wife’s given name, which is not to be found in any
dictionary listing less than 50,000 sinograms; yeah, blame my
father-in-law…], what’s in there for CJKV covers day-to-day needs of
most people. Seriously, how many times have you seen transcripts of
bone inscriptions on web sites or e-docs? Or arcane kanji pulled out
of the Morohashi? Or chu nom chars? Or Jürchen script? Sure, some
people do work with this stuff. I studied this stuff, and probably
would have liked a way to input/display them. But how many? And how
many use^H^H^H know of Ruby? Let’s not lose focus on who’s using
what…

my 0.02â?¬


Didier

On Jun 28, 2006, at 4:25 PM, Jim W. wrote:

raise “Not PNG.” unless (SINGLE_BYTE_ENCODINGS.include?
(@top.encoding)
&& @top[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”)

Of course, if you can guarantee that @top is indeed a single byte
encoded BEFORE hitting this line, then the encoding test is not needed
(and I think you assume that).

Does it even make sense to talk about ‘encodings’ in the context of
binary data?

I suppose you could extend the concept of encoding to capture some
sort of
mime type characterization of the data but isn’t that a bit beyond
what this
thread has been talking about?

I like Austin’s idea of an encoding as a ‘lens’ with respect to the
raw data.

It is getting pretty hard to follow this entire discussion in the
absence of
some concrete examples of the imagined APIs as well as some sort of
taxonomy
of use cases with which to evaluate the APIs. For example:

  • create a copy of a text file when the text encoding is unknown
  • transmit a copy of a text file across a TCP/IP connection when the
    encoding is unknown
  • analyze binary data and guess at its text encoding
  • convert PNG image to a GIF image, in memory, to/from disk
  • input n characters from the keyboard/stdin/tty
  • count the number of words, lines, and characters in a file
    with an explicit encoding
    with an implicit encoding associated with a given locale
    with an implicit encoding associated with the process/thread

and so on.

Gary W.

I’ll give a little ground on a few points. Perhaps I had a dream that
adjusted my perspectives a bit.

  • String == ByteArray is reasonable if String is considered to be a
    “ByteChunker”. The byte-chunking logic holds true for both binary string
    and
    encoded string models; what’s parametric about it is the size of the
    chunks.
    While I still believe that a general-purpose, high-performance ByteArray
    would be useful (perhaps preferable) for many operations, I will concede
    that ByteArray + ChunkSizer (encoding) := m17n String. I won’t say I’m
    sold
    on the ByteChunker pattern, but I think it will be easier to accept and
    discuss m17n Strings from a ByteChunker perspective. It also may be fair
    to
    say that m17n String provides a “view” into the underlying byte array,
    which
    could in the raw case be a wholly-transparent view.
  • If the intent is to provide a String that supports all encodings
    universally, I will concede that the m17n String is probably the only
    way to
    do it. As far as I know, there’s no one character encoding, code page,
    or
    character set that can encompass all other encodings without fail.
    Unicode
    does, despite what detractors may say, make a truly gallant attempt to
    achieve that impossible goal, and it deserves the 90+% of humanity that
    use
    Unicode or Unicode-encodable character sets exclusively. But if at the
    end
    of the day Ruby really needs a kitchen-sink approach to character
    encoding,
    Unicode will not fit that requirement.

So, a short glossary:

String == ByteChunker
chunk == character == n bytes in a specific order

The first item above brings out a few discussion points

  1. String provides an interface for managing a collection of ByteChunks.
    The
    sizing and nature of this “chunking” is primarily based on character
    encoding logic. I’ll refer to String and ByteChunker interchangeably
    from
    here on out.
  2. Indexed operations act upon chunks, not bytes. It may be the case
    that
    for some encodings, sizeof(chunk) == sizeof(byte). No assumptions should
    be
    made about chunk size.
  3. Altering String semantics from “byte ops always” to “chunk ops
    always”
    also implies that chunked operations should not be generally purposed
    toward
    byte-level operations, since there is no explicit guarantee you’ll work
    with
    byte-sized (har har) chunks
  4. Therefore it should be mandatory and acceptable under the supposed
    ByteChunker contract to provide a minimal set of explicitly byte-sized
    operations, since the purpose of chunking is to provide a way of
    consuming
    and digesting bytes. It would not be useful or recommended to completely
    hide those raw bytes under any circumstances, since byte-level
    operations
    will always be valid on a ByteChunker in the absence of a more specific
    ByteArray type.
  5. Byte-sized operations should be STRONGLY ENCOURAGED for byte-level
    work
    over chunk-sized operations due to the changing size and nature of
    chunks.
    This would mean that [0…5] should never be used instead of byte(0…5)
    for
    retrieving the first five bytes in a ByteChunker
  6. Methods on other classes whose purpose is to manipulate character
    data
    (chunks) logically should never be assumed to work with byte-sized
    chunks
    only (regex and friends)

A common theme here is that ByteChunker does have a set of logical
semantics, and m17n Strings as planned appear to be ByteChunkers. This
seems
like a reasonable abstraction to me, though it does expose
implementation
details many of us would prefer to keep hidden (namely that we’re
chunking
bytes, when a consumer shouldn’t need to know what chunks are composed
of).
If we can reasonably attempt to define the ByteChunker semantics, we can
see
where the holes are.

I could do without a separate ByteArray if the m17n String provided
explicit
byte-sized operations. The dual-purposing of String ops for both bytes
and
chunks is very worrisome since it’s bound to happen that chunk
operations
get incorrectly used for byte operations when sizeof(chunk) !=
sizeof(byte).
For byte-sensitive cases, rather than saying “I know that my String is
in
encoding X in which all chunks are byte-sized” it would be far safer
(and
better encapsulation of String state) to say “I know I need to work with
bytes all the time”. The byte-sized operations then follow. (Granted,
there’s no way to force people ahem Austin to use the “byte-safe”
methods,
but it feels like a really good best-practice to me).

A caveat to all this is that ByteChunker semantics are inherently more
complex than CharacterSequence semantics, and so the proposed m17n
String
is a more complicated solution than using a single internal encoding.
I’m
also not convinced that ByteChunker’s semantics are simpler than
separate
CharacterSequence and ByteArray semantics, they they may be more “Ruby.”
ByteChunker itself is a more useful general-purpose entity in itself
than
CharacterSequence or ByteArray alone.

complexity(ByteChunker) > (complexity(ByteArray) or
complexity(CharacterSequence))
complexity(ByteArray and CharacterSequence) maybe >
complexity(ByteChunker)
generality(ByteChunker) > (generality(ByteArray) or
generality(CharacterSequence))

I think it’s still a valid question whether there’s not a happy medium
somewhere that would make life easiest for the folks using unicode.
Rubyists
are fond of saying that Ruby makes easy problems easy and hard problems
possible. I would argue that unicode support should be the “easy
problem”
that’s easy and that support for incompatible
encodings–worldwide–should
be the “hard problem” that’s possible. Any plans for m17n that make
unicode
harder to work with in Ruby than in comparable languages could prove
fatal.

Charles O Nutter wrote:

The dual-purposing of String ops for both bytes
and chunks is very worrisome since it’s bound to happen
that chunk operations get incorrectly used for byte
operations when sizeof(chunk) != sizeof(byte).

I also have this concern.

Here’s radical idea. Perhaps it is time to deprecate the str[*arg]
operation in favor of str.char(*arg) and str.byte(*arg) like operations,
making it explicit which operation is to be used. It will break a lot
of code, but then, changing the semantics of [] from byte oriented to
character oriented operations will probably silently break a lot of
code as well. All things considered, I would prefer noisy breaking.

Like I said, its a radical idea.

– Jim W.

On 6/28/06, [email protected] [email protected] wrote:

I suppose you could extend the concept of encoding to capture some

  • create a copy of a text file when the text encoding is unknown
  • transmit a copy of a text file across a TCP/IP connection when the
    encoding is unknown

If you don’t know the encoding, you must use binary (unencoded) data.
That will be unchanged … unless we have a ByteArray.

without ByteArray, without pragma:
File.open(a, “r”) { |b| File.open(c, “w”) { |d| d.write b.read } }

without ByteArray, with pragma:
File.open(a, “r”, encoding: “binary”) { |b|
File.open(c, “w”, encoding: “binary”) { |d|
d.write b.read
}
}

with ByteArray:
File.open(a, “r”) { |b|
File.open(c, “w”) { |d|
d.write_bytes b.read_bytes
}
}

  • analyze binary data and guess at its text encoding

This is a “hard” problem and not demonstratable easily; there are
expensive programs out there that do this. The problem is that encodings
like ISO-8859-1 and ISO-8859-5 mean different things and you’d have
serious textual analysis to determine which you’re looking at. On the
other hand, simple analysis (e.g., determining ISO-8859-* but not which
one, as opposed to UTF-8) may be possible.

  • convert PNG image to a GIF image, in memory, to/from disk

Use RMagick. :wink:

  • input n characters from the keyboard/stdin/tty

I’m not sure, to be honest. This is unlikely to be cross-platform.

-austin

On 6/28/06, Jim W. [email protected] wrote:

of code, but then, changing the semantics of [] from byte oriented to
character oriented operations will probably silently break a lot of
code as well. All things considered, I would prefer noisy breaking.

Like I said, its a radical idea.

I can’t say that I like it; it’s clean to say str[0…5], but I
can’t say it’s a bad idea either.

Just radical.

-austin

On 6/28/06, Charles O Nutter [email protected] wrote:

I’ll give a little ground on a few points. Perhaps I had a dream that
adjusted my perspectives a bit.

  • String == ByteArray is reasonable if String is considered to be a
    “ByteChunker”. […]

This is essentially how I view Strings. What makes a String special is
NOT the fact that it’s a String. It’s the encoding associated with the
String. This is an important distinction.

[…] It also may be fair to say that m17n String provides a “view”
into the underlying byte array, which could in the raw case be a
wholly-transparent view.

Precisely. This is why I’ve been describing encodings as a lens.

  • If the intent is to provide a String that supports all encodings
    universally, […] it deserves the 90+% of humanity that use Unicode
    or Unicode-encodable character sets exclusively. But if at the end of
    the day Ruby really needs a kitchen-sink approach to character
    encoding, Unicode will not fit that requirement.

And I believe this to be the case. But I also believe that Ruby’s
support for Unicode needs to be first-rate. Where I am getting most
frustrated is that few people have understood that – and even fewer
have understood that viewing first-rate support for Unicode isn’t
incompatible with m17n String.

[…]

  1. Indexed operations act upon chunks, not bytes. It may be the case
    that for some encodings, sizeof(chunk) == sizeof(byte). No
    assumptions should be made about chunk size.

Chunk size is variable based on the encoding. Specifically:

sizeof(char) == sizeof(chunk)

  1. Altering String semantics from “byte ops always” to “chunk ops
    always” also implies that chunked operations should not be
    generally purposed toward byte-level operations, since there is no
    explicit guarantee you’ll work with byte-sized (har har) chunks

Correct. And in this case, the use of a #bytes accessor may make it
possible to operate on byte-level operations explicitly. I will grant
that much.

[…]

  1. Byte-sized operations should be STRONGLY ENCOURAGED for byte-level
    work over chunk-sized operations due to the changing size and
    nature of chunks. This would mean that [0…5] should never be used
    instead of byte(0…5) for retrieving the first five bytes in a
    ByteChunker

For some data, though, this may be irrelevant (the PNG example I gave
earlier).

  1. Methods on other classes whose purpose is to manipulate character
    data (chunks) logically should never be assumed to work with
    byte-sized chunks only (regex and friends)

I believe that this is fair.

a really good best-practice to me).
And that may be the case. But I also think that it’s unnecessary to have
byte-safe methods even if they’re available. I can guarantee that JPEG
data will not be Unicode per se. Certain data (EXIF, for example)
inside of the JPEG could theoretically be Unicode, but it will always be
stored in what is clearly a binary data area and converted afterwards.

A caveat to all this is that ByteChunker semantics are inherently more
complex than CharacterSequence semantics, and so the proposed m17n
String is a more complicated solution than using a single internal
encoding. I’m also not convinced that ByteChunker’s semantics are
simpler than separate CharacterSequence and ByteArray semantics, they
they may be more “Ruby.” ByteChunker itself is a more useful
general-purpose entity in itself than CharacterSequence or ByteArray
alone.

Also note that separating CharacterSequence really only helps if
there’s a single encoding to deal with. Otherwise, the CharacterSequence
has to interpret using ByteChunker-like facilities anyway.

[Edit: (C == complexity; G == generality)]

C(ByteChunker) > (C(ByteArray) or C(CharacterSequence))
C(ByteArray and CharacterSequence) maybe > C(ByteChunker)
G(ByteChunker) > (G(ByteArray) or G(CharacterSequence))

Um. As far as implementation complexity is concerned, I would agree with
your statements here. However, I firmly believe that use complexity of
a ByteChunker is of lower complexity than that of ByteArray and
CharSequence.

I think it’s still a valid question whether there’s not a happy medium
somewhere that would make life easiest for the folks using unicode.
Rubyists are fond of saying that Ruby makes easy problems easy and
hard problems possible. I would argue that unicode support should be
the “easy problem” that’s easy and that support for incompatible
encodings–worldwide–should be the “hard problem” that’s possible.
Any plans for m17n that make unicode harder to work with in Ruby than
in comparable languages could prove fatal.

Right. The problem is, adding a ByteArray makes a currently easy thing
harder, which is byte manipulation and acquisition. The added complexity
of that is not worth Unicode, IMO. But I do not see this as either or.
To be perfectly clear, I don’t care if the ByteChunker is harder for
matz or someone else to implement if the API available for Unicode- and
ByteArray-semantics is at least as expressive and as powerful as what we
have today.

-austin

Hi,

In message “Re: Unicode roadmap?”
on Thu, 29 Jun 2006 02:33:36 +0900, “Austin Z.”
[email protected] writes:

|I’m not sure I like the encoding pragma, personally, since it’s at the
|file level. Consider this:
|
| raise “Not PNG.” unless @top[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”
|
|If I understand the encoding pragma correctly, both the “Not PNG” and
|the matching string will be treated as Unicode, and the test string is
|not valid Unicode.
|
|Better, from my perspective:
|
| raise u"Not PNG." unless @top[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”
|
|That way, I mark the strings for which I want Unicode format. The
|encoding pragma makes it hard to do mixed content files.

I’d rather see r"\x89PNG\x0d\x0a\x1a\x0a" (or b"…"), since I expect
binary strings less often. It also removes unnecessary Unicode
expectation from users.

						matz.

On Jun 28, 2006, at 2:51 PM, Austin Z. wrote:

And I believe this to be the case. But I also believe that Ruby’s
support for Unicode needs to be first-rate. Where I am getting most
frustrated is that few people have understood that – and even fewer
have understood that viewing first-rate support for Unicode isn’t
incompatible with m17n String.

I think people understand what you want. But those of us who’ve done
a lot of i18n work know how hard it is to get things right; for
example, the single hardest piece of writing an efficient XML parser
is dealing with the character input/output. Those of us who write
search engines and have sweated the language-sensitive tokenization
details are also paranoid about these problems. We also know that it
is possible to get things right, if you adopt the limitation that
characters are Unicode characters. Matz is making a strong claim:
that he can write a class that will get Unicode right and also handle
arbitrary other character sets and encodings, and serve as a byte
buffer (it’s a floor wax and a dessert topping!) and do this all
with acceptable correctness and efficiency. This has not previously
been done that I know of. If he can pull it off, that’s super. It’s
not unreasonable to worry, though.

I would offer one piece of advice for the m17n implementation: have a
unicode/non-unicode mode bit, and in the case that it’s Unicode, pick
one encoding and stick to it (probably UTF-8, because that’s
friendlier to C programmers). The reason that this is a good idea is
that if you know the encoding, then for certain performance-critical
tasks (e.g. regexp) you can do sleazy low-level optimizations that
run on the encoding rather than on the chunked chars.

Yes, you’d have to do conversion of all the 8859 and JIS and Big5 and
so on going in and out, but if the volume is big enough that you
care, there’ll be disks involved, and you can transcode way faster
than I/O speeds, so the conversion cost will probably not be observable.

Among other things, I want to be able to process XML in Ruby really
really fast, and in XML you know that it’s all Unicode characters;
so it would be nice to leave the door open for low-level Unicode-
specific optimizations.

-Tim

Austin Z. wrote:

Better, from my perspective:

raise u"Not PNG." unless @top[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”

It would make more sense if it worked exactly like regexes:
$KCODE = ‘u’
raise “Not PNG.” unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"n
or
$KCODE = ‘n’
raise "Not PNG."u unless @top[0, 8] == “\x89PNG\x0d\x0a\x1a\x0a”

Can I read the specs for ruby2 somewhere? It would be better than
speculating
about how the m17n strings might be implemented. I took a look at the
1.9 docs
on ruby-doc.org but there is no ‘encoding’ accessor. It’s all the same
methods
as the docs for ruby 1.8.4, although there are a bunch of methods which
are
not available in my install of 1.8.4: [“iseuc”, “issjis”, “isutf8”,
“kconv”,
“new”, “scn”, “toeuc”, “tojis”, “tosjis”, “toutf16”, “toutf8”]

Oh, here’s another thought. How is that supposed to behave?
str.encoding = :sjis
str.split(//u)

Daniel

On 6/28/06, Yukihiro M. [email protected] wrote:

|That way, I mark the strings for which I want Unicode format. The
|encoding pragma makes it hard to do mixed content files.
I’d rather see r"\x89PNG\x0d\x0a\x1a\x0a" (or b"…"), since I expect
binary strings less often. It also removes unnecessary Unicode
expectation from users.

As I indicated in a later post, that’s also acceptable.

-austin

Hi,

In message “Re: Unicode roadmap?”
on Thu, 29 Jun 2006 08:36:12 +0900, Tim B. [email protected]
writes:

|Matz is making a strong claim:
|that he can write a class that will get Unicode right and also handle
|arbitrary other character sets and encodings, and serve as a byte
|buffer (it’s a floor wax and a dessert topping!) and do this all
|with acceptable correctness and efficiency. This has not previously
|been done that I know of. If he can pull it off, that’s super. It’s
|not unreasonable to worry, though.

Have you ever heard of regular expression engine (one of the hardest
parts to implement in text processing) that handles more than 30
different encodings without conversion, and runs faster than PCRE?

If you have, you might be able to believe existence of something that
is a floor wax and a dessert topping at the same time.

If you haven’t, I tell you that it is named Oniguruma, regular
expression engine comes with Ruby 1.9.

						matz.

On 6/29/06, Yukihiro M. [email protected] wrote:

If you have, you might be able to believe existence of something that
is a floor wax and a dessert topping at the same time.

But then you risk that people would lick the floor, which some may
find unacceptable :wink:

Michal

FWIW, OniGuruma is the regex engine used by SubEthaEdit â?? via the
OgreKit [OniGuruma RegEx Kit for Cocoa]. I am not too sure about its
being faster than PCRE â?? tests in SEE and BBEdit don’t show anything
conclusive. One of my pet peeves with OgreKit/SEE is that it treats
the full text as one line by default, making ^…$ useless [and in the
version of SEE I use ^ doesn’t work…].

OTOH, the good thing about OgreKit/SEE is that \w+ on í??ê¸?æ?¥æ?¬èª?dodo will
catch the whole yahzoo, whereas in PCRE/BBEdit only dodo will get
caught. Yet again, \p{L} works in PCRE, which helps refine what one
wants to call a word, as Mr. Bray showed.


Didier

On Jun 28, 2006, at 8:20 PM, Yukihiro M. wrote:

Have you ever heard of regular expression engine (one of the hardest
parts to implement in text processing) that handles more than 30
different encodings without conversion, and runs faster than PCRE?

If you haven’t, I tell you that it is named Oniguruma, regular
expression engine comes with Ruby 1.9.

I’d heard of it but I hadn’t tried it until now. Previously I have
done quantitative measurement of the performance of Perl vs. Java
regex engines (conclusion: Java is faster but perl is safer, see
ongoing by Tim Bray · Regex Update).

I thought I would compare Oniguruma, so I downloaded it and compiled
it and ran some tests and looked at the documentation. (http://
www.geocities.jp/kosako3/oniguruma/doc/RE.txt and http://
www.geocities.jp/kosako3/oniguruma/doc/API.txt, or is there something
better?)

Oniguruma is very clever; support for multiple different regex
syntaxes? Wow.

The documentation needs a little work, the example files such as
simple.c do not correspond very well (e.g. ONIG_OPTION_DEFAULT).

But I think I must be missing something, because I can’t run my
test. It’s is a fast approximate word counter for large volumes of
XML. Here is how the regular expression is built in Perl:

my $stag = “<^/?>”;
my $etag = “</[^>]>";
my $empty = "<[^>]
/>”;

my $alnum =
“\p{L}|” .
“\p{N}|” .
“[\x{4e00}-\x{9fa5}]|” .
“\x{3007}|” .
“[\x{3021}-\x{3029}]”;
my $wordChars =
“\p{L}|” .
“\p{N}|” .
“[-._:']|” .
“\x{2019}|” .
“[\x{4e00}-\x{9fa5}]|” .
“\x{3007}|” .
“[\x{3021}-\x{3029}]”;
my $word = “(($alnum)(($wordChars)*($alnum))?)”;

my $regex = “($stag)|($etag)|($empty)|$word”;

full regex: (<^/?>)|(</[^>]>)|(<[^>]/>)|((\p{L}|\p{N}|
[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])((\p{L}|\p{N}|
[-._:']|\x{2019}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])*(\p
{L}|\p{N}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}]))?)

I have a very specific idea of what I mean by “word”. \w is nice but
it’s not what I mean.

As far as I can tell, \p{L} and so on don’t work, so I can’t do this
in Oniguruma. Error message: “ERROR: invalid character property name
{L}”. So a bit more work is required to support Unicode? (Supporting
the properties from Chapter 4 is very important.) Or am I mis-
reading the documentation? I did it in C because simple.c was there,
would it make a difference if I did it from Ruby 1.9?

-Tim