Difficult RegEx substitutions (how to treat names in bibliography properly with .downcase!)


#1

I wonder if there is a way in Ruby how to solve the typical problem of
many publishing houses… Many editors and authors like capitalizing in
case of surnames whilst they are putting down a bibliography.

Typically:

ČERMÁK, F., BLATNÁ, R. (eds.) Manuál lexikografie. Jinočany: H&H, 1995.
DINGOVÁ, N., DVOŘÁKOVÁ, M., FALTÍNOVÁ, R., OKROUHLÍKOVÁ, L., SERVUSOVÁ,
J., ET AL.: Slovník znakového jazyka terminologie z období těhotenství,
porodu a péče o novorozence [CD-ROM]. Praha: Tamtam, 2005.

DÉGERANDO, J. M. De l’éducation des sourds-muets de naissance. Paris:
Chez Méquignon l’ainé père, 1827. [online]. [cit. 2012-02-14]. Dostupné
na: http://books.google.cz/

ECKHARDT, A., FRIED, V., HOFFMAN, M., HOKSZA, D., MATÝSKOVÁ, M.
Softwarový projekt Znak. Praha, 2005. Ročníková práce. Praha. MFF UK.

As you could notice, it includes accented chars. I would be happy if I
was able to treat these names as Abbb, (first is capitalized, the
remainder becomes small caps.

Until now I have figured this horrible thing. It is able to properly
find what I need but I have a problem with subtitution. As you can see,
the problem is when the first founded char == the second. That means if
I have MOTLIK, it works perfectly, if I had ÉPÉE, it becomes ÉpÉe. I
have no idea if I could use another RegEx or if it is solvable in the
replacing part:

par = ARGV[0]

text = open(par)
reg = text.read.force_encoding(Encoding::UTF_8)

overkill = “A-ZÁÄÉËĚÍÓÖÔÚŮÜÝČĎŇŘŠŤŽĹĽ”
reg.gsub!(/(?<=([#{overkill}]{1}))([#{overkill}]{1,15})([^.a-z\s)-]),/)
{ |s|
if $1 == $2; #???
#???
$2.downcase! + ‘,’ #???
else
#???
end
}

puts reg
text.close()


#2

Hi,

not going to comment a bit hairy regex patterns in the example.

Concerning the case issue, by the documentation String#downcase and
String#upcase are effective with ASCII characters only. You would need
to look for an external module with more advanced unicode support like
http://rubygems.org/gems/unicode.
The above example would then result in

Unicode::downcase ‘ÉPÉE’
=> “épée”

For locale-based unicode collation, another module like ffi-icu(wrapper
around ICU library) would be required.


#3

David, thank you a lot!