Dump and import MySQL table w/ accents


#1

I am switching to a composite primary key (string and user ID) from the
Rails
conventional auto-incrementing integer primary ID. The table is large
(2.5
million records) and I’d rather not discard the contents. The
composite_primary_key gem doesn’t appear to support altering the table
with a
migration to do its magic, only creating a table from scratch. So I
dumped
the table with mysqldump, ran the migration (table looks good), and am
trying
to repopulate the table. It has accented characters and is complaining
about
duplicates, apparently around words with and without accents, e.g.,
‘jose’ and
‘josé’. I’ve been deleting one by hand from the dump, but it is tedious
and
very slow. Emacs crawls when dealing with very large files with very
long
lines.

I just don’t understand why the accents are causing problems. The
string
column is utf8_general_ci collation, just like other fields in the
database
with strings with accents. What do I need to specify so it will import
the
dump? Is there a problem with strings with accents in composite
indexes?

The table is created with a Rails migration, but everything else is pure
MySQL
utilities.

TIA,
Jeffrey


#2

Ensure your database.yml file has a line like:

encoding: utf8

On May 26, 10:50 am, “Jeffrey L. Taylor” removed_email_address@domain.invalid


#3

Quoting BenH removed_email_address@domain.invalid:

Ensure your database.yml file has a line like:

encoding: utf8

Thank you. It has for over a year. The data dumped was UTF-8, but
something
is preventing it from being re-imported.

Sigh,
Jeffrey


#4

Quoting Jeffrey L. Taylor removed_email_address@domain.invalid:

very slow. Emacs crawls when dealing with very large files with very long
lines.

I just don’t understand why the accents are causing problems. The string
column is utf8_general_ci collation, just like other fields in the database
with strings with accents. What do I need to specify so it will import the
dump? Is there a problem with strings with accents in composite indexes?

The answer is mixed and bits of it don’t make sense. If I work with the
MySQL
client (mysql), I can create records with primary keys identical, except
for
an accent. Doing the same thing in the Rails console throws an
DuplicateKey
exception. Using mysqldump to dump the contents of a table, changing
the
table so the primary key is a integer-string composite and trying to
repopulate the table from the mysqldump output also barfs on a duplicate
key
exception.

It is non-optimal, but I can live for the moment with a table with two
non-unique indexes.

Jeffrey


#5

Jeffrey L. Taylor wrote:

The string column is utf8_general_ci collation, …

As far as I understand, utf8_general_ci is case-insensitive. Maybe, you
want to give utf8_bin a shot.

mysql> create table t(i int, v varchar(10) collate utf8_general_ci,
constraint t_pk primary key(i, v));
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t select 1, ‘jose’ union all select 1, ‘josé’;
ERROR 1062 (23000): Duplicate entry ‘1-josé’ for key 1
mysql> drop table t;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t(i int, v varchar(10) collate utf8_bin, constraint
t_pk primary key(i, v));
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t select 1, ‘jose’ union all select 1, ‘josé’;
Query OK, 2 rows affected (0.01 sec)
Records: 2 Duplicates: 0 Warnings: 0

mysql> select * from t;
±–±------+
| i | v |
±–±------+
| 1 | jose |
| 1 | josé |
±–±------+
2 rows in set (0.00 sec)


#6

Wisccal W. wrote:

Jeffrey L. Taylor wrote:

As far as I understand, utf8_general_ci is case-insensitive.

I meant to say “accent-insensitive”…


#7

Quoting Wisccal W. removed_email_address@domain.invalid:

Wisccal W. wrote:

Jeffrey L. Taylor wrote:

As far as I understand, utf8_general_ci is case-insensitive.

I meant to say “accent-insensitive”…

Thank you. I didn’t think MySQL would messup that badly. I’ll try this
out
this week. I just checked and my development and production machines
have
different collation, which explains the inconsistent results I was
seeing in
my tests.

Jeffrey