It is very important to use the same character set across all tables and acorss all databases.
The default character set used per database are listed using this query...
mysql> select * from information_schema.SCHEMATA;
+--------------+--------------------+----------------------------+------------------------+----------+
| CATALOG_NAME | SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | SQL_PATH |
+--------------+--------------------+----------------------------+------------------------+----------+
| def | information_schema | utf8 | utf8_general_ci | NULL |
| def | emailplatform | latin1 | latin1_swedish_ci | NULL |
| def | mysql | latin1 | latin1_swedish_ci | NULL |
| def | performance_schema | utf8 | utf8_general_ci | NULL |
| def | test | latin1 | latin1_swedish_ci | NULL |
| def | test1 | latin1 | latin1_swedish_ci | NULL |
+--------------+--------------------+----------------------------+------------------------+----------+
6 rows in set (0.01 sec)
If some of the tables from a database does not match with with other tables, we need to alter those tables. for e.g. if 1 or 2 tables from test database are using utf8 encoding, then the best choice is to drop and recreate those tables as latin1
There are 2 points to note here...
1) The table in question should not be part of foregin key relations.
2) The table should not actually contain unicode characters. Because once we convert it to latin1, there will be no way to store unicode.
The easiest way to check if the tables from a given database are using different character set is to use mysqldump command as shown here...
root@ip-10-86-106-75:/home/ubuntu# mysqldump test --no-data | grep ENGINE
) ENGINE=TokuDB DEFAULT CHARSET=latin1;
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
) ENGINE=TokuDB DEFAULT CHARSET=latin1;
This means there are a few tables with utf8 character while others are latin1.
You will have issues while joining a latin1 table with utf8 table. The query will not use indexes if the columns are of different character sets.
_____
The following query run on "information_schema" database shows that there is 1 table in test database that has utf8 collation. All other tables in test database are latin1. Therefore I need to change that single utf8 table to latin1
# create a new table in test database
create table test.tables select * from information_schema.tables;
# check for table_collations
mysql> select table_schema, table_collation, count(*) as cnt from test.tables group by table_schema, table_collation;
+--------------------+-------------------+-----+
| table_schema | table_collation | cnt |
+--------------------+-------------------+-----+
| emailplatform | latin1_swedish_ci | 64 |
| information_schema | utf8_general_ci | 45 |
| mysql | latin1_swedish_ci | 1 |
| mysql | utf8_bin | 8 |
| mysql | utf8_general_ci | 15 |
| performance_schema | utf8_general_ci | 17 |
| test | latin1_swedish_ci | 18 |
| test | utf8_general_ci | 1 |
| test1 | latin1_swedish_ci | 1 |
+--------------------+-------------------+-----+
9 rows in set (0.00 sec)
To find the name of the table I use this query:
mysql> select table_name from tables where table_schema = 'test' and table_collation like 'utf8%';
+------------+
| table_name |
+------------+
| cdr_master |
+------------+
1 row in set (0.01 sec)
And here is the count:
mysql> select count(*) from test.cdr_master;
+----------+
| count(*) |
+----------+
| 186166 |
+----------+
1 row in set (0.04 sec)
This is relatively small table so we can quickly change the character set.
But we need to check that there is no other table linked to this using foreign key relations.
mysql> select * from information_schema.KEY_COLUMN_USAGE where TABLE_SCHEMA='TEST' AND TABLE_NAME = 'cdr_master' LIMIT 1\G
Empty set (0.00 sec)
mysql> select * from information_schema.KEY_COLUMN_USAGE where REFERENCED_TABLE_SCHEMA='TEST' AND REFERENCED_TABLE_NAME = 'cdr_master' LIMIT 1\G
Empty set (0.06 sec)
_____
To change the default character set we need to drop and recreate the table.
Take the backup of the table.
# mysqldump --tab=/tmp/ test cdr_master
Change the character set:
# sed -i.bak 's/CHARSET=utf8/CHARSET=latin1/' /tmp/cdr_master.sql
Recreate the new table after dropping the old one:
# mysql test < /tmp/cdr_master.sql
# restore data:
mysql> load data infile '/tmp/cdr_master.txt' into table test.cdr_master;
Query OK, 186166 rows affected (25.06 sec)
Records: 186166 Deleted: 0 Skipped: 0 Warnings: 0
Now log-out and log back in to find that all the tables in test database are having the same table_collation.
mysql> select table_collation, count(*) as cnt from information_schema.tables where table_schema = 'test' group by table_collation;
+-------------------+-----+
| table_collation | cnt |
+-------------------+-----+
| latin1_swedish_ci | 19 |
+-------------------+-----+
1 row in set (0.00 sec)
_____
In order to check if the utf8 encoded table really has any unicode characters, you need to take the backup of the table.
mysqldump test tbl_name > todel.txt
And then run this python script. If the script processes all the lines without any problem then the data is compatible with latin1.
import codecs
f = codecs.open("/tmp/todel.txt", "r", "utf-8")
for line in f.readlines():
todel=line.decode('latin1')
You have unicode data in your table if you get an unicode error like this...
UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-35: ordinal not in range(128)
In such case, you can not continue with the task of table re-creation unless you decide to ignore the unicode characters being
_____
If you want to check which lines contain unicode characters, you need another type of dump (skip extended insert option will generate a line per record in the table)
# mysqldump test todel --skip-extended-insert > todel.txt
And the following python code will display all the records where unicode characters are used.
import codecs
f = codecs.open("/tmp/todel.txt", "r", "utf-8")
for line in f.readlines():
try:
todel=line.decode('latin1')
except:
print line
_____
If recreating the entire table is not an option, then simply change the single column to latin1.
mysql> alter table test.cdr_master modify column call_uuid varchar(50) charset latin1;
This will make the query very fast if the column "call_uuid" is used in the join query. The other table's column "call_uuid" should be also latin1.
_____
if you use extended explain then you will see what mysql is trying to do internally.
mysql> explain extended select * from a inner join b on a.column = b.column
mysql> show warnings;
The warning will show that mysql is trying to convert the b.column from utf8 to latin1 in order to match with a.column. This internal conversion will not allow mysql to use indexes.
Labels: mysql tips, python, unicode