Help needed for a new release of text-hyphen

luislavena · July 15, 2011, 6:46am

I’ve had folks asking me for a release of text-hyphen that works with
Ruby 1.9, and while I’ve got something that passes the tests that I’ve
created and added for MRI 1.9, it loses compatibility with Ruby
1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
1.9 mode, it appears). I need some help to get the last bits ready,
because I’m not ready to drop Ruby 1.8 entirely (at least one more
version).

You can find the source on GitHub:
GitHub - halostatue/text-hyphen: Text::Hyphen will hyphenate words using modified versions of TeX hyphenation patterns.
You will need hoe as a development dependency to assist with this if
you want to use the Rakefile; otherwise, you can run the test files in
test/ directly.
Only one of the tests fails, but there’s a good chance that new
tests along the same lines would probably fail.

I have tested against most Ruby environments, and it only succeeds
against MRI 1.9.2; even JRuby in 1.9 mode fails in the same way is
JRuby 1.8.

This issue is preventing the release of the next release of
text-hyphen, and if you have some help you can provide, I need it as I
don’t have time to investigate and fix it myself (I’ve got another
project that’s taking all of my time).

After this release, this project will probably be put into maintenance
mode (the hyphenation files, aside from an update to UTF-8 encoding
where they weren’t already such, have not been updated since the
original release) and I will look at implementing a new version that
works only under Ruby 1.9 (probably under a new name) that will use
the same basic engine but can read .tex hyphenation files from the
texhyphen project rather than depending on the hand-converted
hyphenation files I have, which will also simplify the licensing of
this successor project.

-a
[1] No, I won’t remove it as it helps with release management.

austin · July 15, 2011, 7:46am

On Jul 15, 2011, at 12:45 AM, Austin Z. wrote:

I’ve had folks asking me for a release of text-hyphen that works with
Ruby 1.9, and while I’ve got something that passes the tests that I’ve
created and added for MRI 1.9, it loses compatibility with Ruby
1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
1.9 mode, it appears). I need some help to get the last bits ready,
because I’m not ready to drop Ruby 1.8 entirely (at least one more
version).

Hi Austin,

Running with the debugger on for 1.8.7 brings up this discrepancy:

The “letters” array for 1.8.7 is this:
[“d”, “a”, “m”, “p”, “f”, “s”, “c”, “h”, “i”, “f”, “f”, “f”, “a”, “h”,
“r”, “t”, “s”, “k”, “a”, “p”, “i”, “t”, “\303”, “\244”, “n”, “s”, “m”,
“\303”, “\274”, “t”, “z”, “e”, “n”, “h”, “a”, “l”, “t”, “e”, “r”,
“h”, “e”, “r”, “s”, “t”, “e”, “l”, “l”, “e”, “r”]

Now, “\303”, “\244” is a UTF-8 encoding of umlauts-over-a (). In your
1.8 german
hyphenation file, you encode the in it with the latin-1 encoding \344.

Your input text is UTF-8, but the library searches for the latin1
encoding. Changing
the input to \344 for and \374 for made the test pass for me on 1.8.7.

Michael E.
[email protected]
http://carboni.ca/

austin · July 15, 2011, 10:38am

Is this the error I should see for JRuby?

gist.github.com

https://gist.github.com/anonymous/1084324

gistfile1.txt

~/projects/text-hyphen $ jruby --1.9 -S rake test
rake/rdoctask is deprecated.  Use rdoc/task instead (in RDoc 2.4.2+)
Couldn't read /Users/headius/.rubyforge/user-config.yml. Run `rubyforge setup`.
/Users/headius/projects/jruby/bin/jruby -w -Ilib:bin:test:. -e 'require "rubygems"; require "test/unit"; require "test/test_bugs.rb"; require "test/test_text_hyphen.rb"' -- 
Loaded suite -e
Started
...F.
Finished in 0.862 seconds.

  1) Failure:

This file has been truncated. show original

If so…yes, it could be something simple, but there’s obviously a bug
here. Perhaps I could bother you to formally file a bug at
http://bugs.jruby.org, so we can track it off-list?

Charlie

austin · July 15, 2011, 11:39am

On Jul 15, 2011, at 4:38 AM, Charles Oliver N. wrote:

Is this the error I should see for JRuby?

gist:1084324 · GitHub

If so…yes, it could be something simple, but there’s obviously a bug
here. Perhaps I could bother you to formally file a bug at
http://bugs.jruby.org, so we can track it off-list?

Charlie

That’s the same error I saw, and fixed by using a latin1 input case
instead of a ut8 one.

Michael E.
[email protected]
http://carboni.ca/

austin · July 15, 2011, 2:56pm

On Fri, Jul 15, 2011 at 4:38 AM, Charles Oliver N.
[email protected] wrote:

Is this the error I should see for JRuby?

gist:1084324 · GitHub

If so…yes, it could be something simple, but there’s obviously a bug
here. Perhaps I could bother you to formally file a bug at
http://bugs.jruby.org, so we can track it off-list?

Yes. But does jruby fake out mvm in this case? Because while Rake is
being run with 1.9, I’m not sure that the tests are:

~/projects/text-hyphen $ jruby --1.9 -S rake test
rake/rdoctask is deprecated. Use rdoc/task instead (in RDoc 2.4.2+)
Couldn’t read /Users/headius/.rubyforge/user-config.yml. Run rubyforge setup.
/Users/headius/projects/jruby/bin/jruby -w -Ilib:bin:test:. -e
‘require “rubygems”; require “test/unit”; require “test/test_bugs.rb”;
require “test/test_text_hyphen.rb”’ –

The tests claim to be running “jruby -w …” and not “jruby --1.9 -w
…”. It doesn’t matter because of https://gist.github.com/1084614

I’ve filed JRUBY-5927 about this; if my interpretation of what’s
happening with “jruby --1.9 -S rake test” is correct, I can file a
separate enhancement request about that (it’s a problem, but not a bug
per se). I think Michael E. is correct about the other case.

-a

austin · July 15, 2011, 2:57pm

On Fri, Jul 15, 2011 at 1:46 AM, Michael E. [email protected] wrote:

The “letters” array for 1.8.7 is this:
[“d”, “a”, “m”, “p”, “f”, “s”, “c”, “h”, “i”, “f”, “f”, “f”, “a”, “h”, “r”, “t”,
“s”, “k”, “a”, “p”, “i”, “t”, “\303”, “\244”, “n”, “s”, “m”, “\303”, “\274”, “t”,
“z”, “e”, “n”, “h”, “a”, “l”, “t”, “e”, “r”, “h”, “e”, “r”, “s”, “t”, “e”, “l”,
“l”, “e”, “r”]

Now, “\303”, “\244” is a UTF-8 encoding of umlauts-over-a (). In your 1.8 german
hyphenation file, you encode the in it with the latin-1 encoding \344.

Your input text is UTF-8, but the library searches for the latin1 encoding.
Changing
the input to \344 for and \374 for made the test pass for me on 1.8.7.

I think you’re right. Now to figure out how to fix it properly in this
case.

-a

austin · July 15, 2011, 2:18pm

Running with the debugger on for 1.8.7 brings up this discrepancy:

austin · July 15, 2011, 3:07pm

On Fri, Jul 15, 2011 at 8:18 AM, Kaspar S. [email protected] wrote:

hyphenation file, you encode the in it with the latin-1 encoding \344.
file to utf8, Austin.

Fixing the 1.8 version in the general case (any input, any language file
encoding) will be hard… and useless, since you would program towards a use
case that should go extinct.

I’m not so much looking for the general case, but this specific case,
since it’s a bug about a word that you filed four years ago (yes, the
one you linked)

Text::Hyphen under Ruby 1.8 has always said you need to match the
encoding of the input to the encoding of the hyphenation file (and
that’ll still be true under Ruby 1.9, but at least there it’ll be a
consistent UTF-8 encoding for all hyphenation files). I just forgot
that for this particular test.

More than one solution offers itself

a) convert the file test_bugs.rb back to latin1 (-> bad, will break soon
again)

Doing that would cause Ruby 1.9 to fail. If I’m willing to split the
test into 1.8 and 1.9 versions (and use load) for the specific failing
bug, then I can make this work for this release.

b) digging back through the old version history (I am sure you have it ;)) -
trying to see if [1] was specifically about german umlauts or if it was just
the german and the size of the word that tripped the bug. If it was one of
the latter - then remove those damn umlauts from the word ( → ae, → ue)
and use the new test expectations that derive from that. This would make the
file ASCII again, and less sensible to editor conversion.

It was the umlauts, and (ahem) you filed the bug with the umlauts.

c) The solution you say you don’t want: Dropping 1.8 support from newer
gems. Since bundler & rvm this is increasingly simple to manage - I’ll just
limit my old projects to use an old version of text-hyphen.

Considering the impossible (aka: very laborious and quite not to the point)
nature of the bug in 1.8, I would choose c) or (if must be) b).

I’m trying to get out one more release of 1.8this oneand then
Text::Hyphen (or its successor) will happily be 1.9 only. This is a
“final 1.8” release and then I’m going to bump the major version if I
keep the project name (which is a good one) and put “ruby >= 1.9.2” in
the gemspec. This is the transitional release only.

austin · July 15, 2011, 10:11pm

It was the umlauts

Man … Ruby 1.9.x hates umlauts.

hugs his 1.8.7 install

austin · July 15, 2011, 3:38pm

On Fri, Jul 15, 2011 at 9:06 AM, Austin Z. [email protected]
wrote:

Thanks everyone for the comments received. I’ve taken the approach
that I mentioned in my last message in response to Kaspar. You can see
the latest test code (where I have two data files; one latin1 and one
UTF-8). I’ll be preparing a release this weekend.

Sadly, JRuby in 1.9 mode won’t work because of an apparent bug in
JRuby itself, and “jruby --1.9 -S rake test” only looks like it works
because the test actually runs JRuby again in 1.8 mode. A bug has been
filed for the former case, but an improvement has not yet been filed
for the latter case.

-a

austin · July 16, 2011, 12:41am

On Fri, Jul 15, 2011 at 8:37 AM, Austin Z. [email protected]
wrote:

Sadly, JRuby in 1.9 mode won’t work because of an apparent bug in
JRuby itself, and “jruby --1.9 -S rake test” only looks like it works
because the test actually runs JRuby again in 1.8 mode. A bug has been
filed for the former case, but an improvement has not yet been filed
for the latter case.

Ok, I see your bugs. We’ll have a look into it.

FWIW, you can specify JRUBY_OPTS=–1.9 and it will pass through to the
child JRuby instances too. But I agree, we need a dotfile or similar
to force it.

Charlie

austin · July 16, 2011, 4:09pm

On 2011-07-15, at 18:40, Charles Oliver N. [email protected]
wrote:

child JRuby instances too. But I agree, we need a dotfile or similar
to force it.

I think it’s a little more subtle than that, as I noted in my last
comment on the --1.9 improvement request. When JRuby starts with --1.9
(whether through an arg, an opt, or a dotfile), it should essentially
do:

ENV[“JRUBY_OPTS”]=“–1.9”

Of course, it should be a bit smarter than that, preserving other
values, but this way you get the same expected behaviour that you get
when MRI spawns another instance of MRI based on
RbConfig::CONFIG[“ruby_instance_name”].

-a « from my iPad