Forum: Ruby Difficult RegEx substitutions (how to treat names in bibliography properly with .downcase!)

4719e9e0162789b66f578a7e7f530b26?d=identicon&s=25 Honza Hejzl (welblaud)
on 2014-07-24 13:34
I wonder if there is a way in Ruby how to solve the typical problem of
many publishing houses… Many editors and authors like capitalizing in
case of surnames whilst they are putting down a bibliography.

Typically:

ČERMÁK, F., BLATNÁ, R. (eds.) Manuál lexikografie. Jinočany: H&H, 1995.
DINGOVÁ, N., DVOŘÁKOVÁ, M., FALTÍNOVÁ, R., OKROUHLÍKOVÁ, L., SERVUSOVÁ,
J., ET AL.: Slovník znakového jazyka terminologie z období těhotenství,
porodu a péče o novorozence [CD-ROM]. Praha: Tamtam, 2005.

DÉGERANDO, J. M. De l’éducation des sourds-muets de naissance. Paris:
Chez Méquignon l’ainé père, 1827. [online]. [cit. 2012-02-14]. Dostupné
na: http://books.google.cz/

ECKHARDT, A., FRIED, V., HOFFMAN, M., HOKSZA, D., MATÝSKOVÁ, M.
Softwarový projekt Znak. Praha, 2005. Ročníková práce. Praha. MFF UK.

As you could notice, it includes accented chars. I would be happy if I
was able to treat these names as Abbb, (first is capitalized, the
remainder becomes small caps.

Until now I have figured this horrible thing. It is able to properly
find what I need but I have a problem with subtitution. As you can see,
the problem is when the first founded char == the second. That means if
I have MOTLIK, it works perfectly, if I had ÉPÉE, it becomes ÉpÉe. I
have no idea if I could use another RegEx or if it is solvable in the
replacing part:

par = ARGV[0]

text = open(par)
reg = text.read.force_encoding(Encoding::UTF_8)

overkill = "A-ZÁÄÉËĚÍÓÖÔÚŮÜÝČĎŇŘŠŤŽĹĽ"
reg.gsub!(/(?<=([#{overkill}]{1}))([#{overkill}]{1,15})([^.a-z\s\)-]),/)
{ |s|
  if $1 == $2; #?????
    #??????
    $2.downcase! + ',' #?????
  else
    #??????
  end
}

puts reg
text.close()
5c4e55b92169c16ce2ca8fd75318eded?d=identicon&s=25 David Unric (dunric)
on 2014-07-24 22:20
Hi,

not going to comment a bit hairy regex patterns in the example.

Concerning the case issue, by the documentation String#downcase and
String#upcase are effective with ASCII characters only. You would need
to look for an external module with more advanced unicode support like
http://rubygems.org/gems/unicode.
The above example would then result in
> Unicode::downcase 'ÉPÉE'
=> "épée"

For locale-based unicode collation, another module like ffi-icu(wrapper
around ICU library) would be required.
4719e9e0162789b66f578a7e7f530b26?d=identicon&s=25 Honza Hejzl (welblaud)
on 2014-07-25 07:29
David, thank you a lot!
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.