Encoding problem?

Hi,

Trying to learn ruby, I am writing a script to migrate from a pybloxsom
to wordpress. As you may know, pybloxsom stores all entries and comments
in text files under a directory hierachy. Mi idea is to read all those
files (the subdirectories store the categories) and inject them in the
mysql database wordpress uses.

So far, I have been able to read all the posts and comments but I am
having some problems injecting them in mysql (BTW, I am using the mysql
module). The problem, I guess, is with some sort of encoding with the
text.
Basicaly I have two problems:

  • Accented characters. For example, if I have a accented vowel like “í”
    they are not properly inserted into the mysql table and would get weird
    characters. I guess that if I do a function that substitute every single
    of these characters for its html entity (ie. í) would work, but I
    guess there must be a more appropriately way to do it, right? Anything
    to do with the encoding?

  • Also, I have this problem that wordpress interprets \n characters (I
    guess). For example, if I have a post like the following:


This is an example of an image.


would turn into:


This is an example of an <img

src=“image.jpg”> image.


interpreting the \n character right after <img, inserting the br tag
which breaks the HTML. I thought that If I would delete all the \n
characters it would be fine, but the thing is that there are some posts
with pre labels where \n are required.
Any idea on this?

Anyway, thanks in advance! :slight_smile:

James B. wrote:

It may be that you need to tell MySQL to use a particular character set

http://dev.mysql.com/tech-resources/articles/4.1/unicode.html

might have some info.

Umm, but I would like to do it from within ruby. I mean, all the new
posts in the database (inserted using the web form) work ok, so, I
guess, the thing would be to do it programmatically in my script, right?

Thanks

Jesus Roncero wrote:

Hi,

  • Accented characters. For example, if I have a accented vowel like “í”
    they are not properly inserted into the mysql table and would get weird
    characters. I guess that if I do a function that substitute every single
    of these characters for its html entity (ie. í) would work, but I
    guess there must be a more appropriately way to do it, right? Anything
    to do with the encoding?

It may be that you need to tell MySQL to use a particular character set

http://dev.mysql.com/tech-resources/articles/4.1/unicode.html

might have some info.


James B.

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - The Journal By & For Rubyists
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.30secondrule.com - Building Better Tools