Markdown

[MySQL] Charset & Collation

https://hackmd.io/@kberCpYVRlesIp9TNhFY5A/HJvgsr9M8

前陣子被問了一個問題,大概查了一下整理成筆記紀錄一下

utf8mb4_0900_ai_ci 是不是可以無痛直接轉成 utf8mb4_general_ci ?

結論:不建議,部份搜尋結果可能會因此改變。
此外 utf8mb4_general_ci 已經是比較舊的版本了, 有一些已知的問題, 原則上 utf8mb4_0900_ai_ci 比較完備,8.0 版的預設也是使用此 collation,
一些特性如同字母不同標音,同字母大小寫,空白字元的判定等皆有差異,
詳見以下…
5.5.3+: charset=utf8mb4, collation=utf8mb4_unicode_ci
5.7.7+: charset=utf8mb4, collation=utf8mb4_unicode_520_ci
8.0.0+: charset=utf8mb4, collation=utf8mb4_0900_ai_ci

10.3.1 Table 10.1 Collation Suffix Meanings

utf8mb4_0900_ai_ci
意義解釋:
{CHARSET}{UCA VERESION}{LOCALE}{SUFFIX}{SUFFIX}
  • utf8mb4 表示字元集
  • 0900 表示使用 UCA 9.0.0 版本
  • ai 表示同一字母的不同標音不影響排序_
  • ci 表示同一字母的大小寫不影響排序


關於 SUFFIX…
SuffixMeaning
_aiAccent-insensitive
_asAccent-sensitive
_ciCase-insensitive
_csCase-sensitive
_ksKana-sensitive
_binBinary

關於 UCA …

allkeys-9.0.0.txt

有的沒的實驗:

關於 Accent、Case Sensitive/Insensitive mysql> SELECT 'abad' = 'abád' collate utf8mb4_0900_as_cs; +---------------------------------------------+ | 'abad' = 'abád' collate utf8mb4_0900_as_cs | +---------------------------------------------+ | 0 | +---------------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT 'abad' = 'abád' collate utf8mb4_0900_ai_ci; +---------------------------------------------+ | 'abad' = 'abád' collate utf8mb4_0900_ai_ci | +---------------------------------------------+ | 1 | +---------------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT 'a' = 'A' collate utf8mb4_0900_as_cs; +--------------------------------------+ | 'a' = 'A' collate utf8mb4_0900_as_cs | +--------------------------------------+ | 0 | +--------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT 'a' = 'A' collate utf8mb4_0900_ai_ci; +--------------------------------------+ | 'a' = 'A' collate utf8mb4_0900_ai_ci | +--------------------------------------+ | 1 | +--------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT 'a' = 'á' collate utf8mb4_general_ci; +---------------------------------------+ | 'a' = 'á' collate utf8mb4_general_ci | +---------------------------------------+ | 1 | +---------------------------------------+ 1 row in set (0.00 sec) mysql> mysql> SELECT 'A' = 'á' collate utf8mb4_general_ci; +---------------------------------------+ | 'A' = 'á' collate utf8mb4_general_ci | +---------------------------------------+ | 1 | +---------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT 'A' = 'a' collate utf8mb4_general_ci; +--------------------------------------+ | 'A' = 'a' collate utf8mb4_general_ci | +--------------------------------------+ | 1 | +--------------------------------------+ 1 row in set (0.00 sec) 關於 PAD_ATTRIBUTE: mysql> SELECT 'a ' = 'a' collate utf8mb4_general_ci; +------------------------------------------+ | 'a ' = 'a' collate utf8mb4_general_ci | +------------------------------------------+ | 1 | +------------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT ' a' = 'a' collate utf8mb4_general_ci; +-----------------------------------------+ | ' a' = 'a' collate utf8mb4_general_ci | +-----------------------------------------+ | 0 | +-----------------------------------------+ 1 row in set (0.00 sec) 關於排序: mysql> SELECT 'M' = 'm' collate utf8mb4_general_ci; +--------------------------------------+ | 'M' = 'm' collate utf8mb4_general_ci | +--------------------------------------+ | 1 | +--------------------------------------+ 1 row in set (0.00 sec) STRCMP(str1, str2) 比較兩個字符串,如果這兩個字符串相等返回0,如果第一個參數是根據當前的排序小於第二個參數順序返回-1,否則返回1。 mysql> SELECT STRCMP('M', 'm') collate utf8mb4_general_ci; +---------------------------------------------+ | STRCMP('M', 'm') collate utf8mb4_general_ci | +---------------------------------------------+ | -1 | +---------------------------------------------+ 1 row in set (0.12 sec) mysql> SELECT STRCMP('M', 'M') collate utf8mb4_general_ci; +---------------------------------------------+ | STRCMP('M', 'M') collate utf8mb4_general_ci | +---------------------------------------------+ | 0 | +---------------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT STRCMP('m', 'M') collate utf8mb4_general_ci; +---------------------------------------------+ | STRCMP('m', 'M') collate utf8mb4_general_ci | +---------------------------------------------+ | 1 | +---------------------------------------------+ 1 row in set (0.01 sec)

Trailing Space Handling in Comparisons

MySQL collations have a pad attribute, which has a value of PAD SPACE or NO PAD:
Most MySQL collations have a pad attribute of PAD SPACE.
The Unicode collations based on UCA 9.0.0 and higher have a pad attribute of NO PAD; see Section 10.10.1, “Unicode Character Sets”.
  • For nonbinary strings (CHAR, VARCHAR, and TEXT values), the pad attribute determines how trailing spaces are treated in comparisons:
  • For PAD SPACE collations, trailing spaces are insignificant in comparisons; strings are compared without regard to any trailing spaces.
  • NO PAD collations treat spaces at the end of strings like any other character.
The differing behaviors can be demonstrated using the two utf8mb4 binary collations, one of which is PAD SPACE, the other of which is NO PAD. The example also shows how to use the INFORMATION_SCHEMA COLLATIONS table to determine the pad attribute for collations.
mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE 'utf8mb4%bin'; +------------------+---------------+ | COLLATION_NAME | PAD_ATTRIBUTE | +------------------+---------------+ | utf8mb4_bin | PAD SPACE | | utf8mb4_0900_bin | NO PAD | +------------------+---------------+ mysql> SET NAMES utf8mb4 COLLATE utf8mb4_bin; mysql> SELECT 'a ' = 'a'; +------------+ | 'a ' = 'a' | +------------+ | 1 | +------------+ mysql> SET NAMES utf8mb4 COLLATE utf8mb4_0900_bin; mysql> SELECT 'a ' = 'a'; +------------+ | 'a ' = 'a' | +------------+ | 0 | +------------+ mysql> SELECT 'a ' = 'a' collate utf8mb4_general_ci; +------------------------------------------+ | 'a ' = 'a' collate utf8mb4_general_ci | +------------------------------------------+ | 1 | +------------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT 'a ' = 'a' collate utf8mb4_0900_ai_ci; +------------------------------------------+ | 'a ' = 'a' collate utf8mb4_0900_ai_ci | +------------------------------------------+ | 0 | +------------------------------------------+ 1 row in set (0.00 sec)
mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE -> FROM INFORMATION_SCHEMA.COLLATIONS -> WHERE COLLATION_NAME LIKE 'utf8mb4%'; +----------------------------+---------------+ | COLLATION_NAME | PAD_ATTRIBUTE | +----------------------------+---------------+ | utf8mb4_general_ci | PAD SPACE | | utf8mb4_bin | PAD SPACE | | utf8mb4_unicode_ci | PAD SPACE | | utf8mb4_icelandic_ci | PAD SPACE | | utf8mb4_latvian_ci | PAD SPACE | | utf8mb4_romanian_ci | PAD SPACE | | utf8mb4_slovenian_ci | PAD SPACE | | utf8mb4_polish_ci | PAD SPACE | | utf8mb4_estonian_ci | PAD SPACE | | utf8mb4_spanish_ci | PAD SPACE | | utf8mb4_swedish_ci | PAD SPACE | | utf8mb4_turkish_ci | PAD SPACE | | utf8mb4_czech_ci | PAD SPACE | | utf8mb4_danish_ci | PAD SPACE | | utf8mb4_lithuanian_ci | PAD SPACE | | utf8mb4_slovak_ci | PAD SPACE | | utf8mb4_spanish2_ci | PAD SPACE | | utf8mb4_roman_ci | PAD SPACE | | utf8mb4_persian_ci | PAD SPACE | | utf8mb4_esperanto_ci | PAD SPACE | | utf8mb4_hungarian_ci | PAD SPACE | | utf8mb4_sinhala_ci | PAD SPACE | | utf8mb4_german2_ci | PAD SPACE | | utf8mb4_croatian_ci | PAD SPACE | | utf8mb4_unicode_520_ci | PAD SPACE | | utf8mb4_vietnamese_ci | PAD SPACE | | utf8mb4_0900_ai_ci | NO PAD | | utf8mb4_de_pb_0900_ai_ci | NO PAD | | utf8mb4_is_0900_ai_ci | NO PAD | | utf8mb4_lv_0900_ai_ci | NO PAD | | utf8mb4_ro_0900_ai_ci | NO PAD | | utf8mb4_sl_0900_ai_ci | NO PAD | | utf8mb4_pl_0900_ai_ci | NO PAD | | utf8mb4_et_0900_ai_ci | NO PAD | | utf8mb4_es_0900_ai_ci | NO PAD | | utf8mb4_sv_0900_ai_ci | NO PAD | | utf8mb4_tr_0900_ai_ci | NO PAD | | utf8mb4_cs_0900_ai_ci | NO PAD | | utf8mb4_da_0900_ai_ci | NO PAD | | utf8mb4_lt_0900_ai_ci | NO PAD | | utf8mb4_sk_0900_ai_ci | NO PAD | | utf8mb4_es_trad_0900_ai_ci | NO PAD | | utf8mb4_la_0900_ai_ci | NO PAD | | utf8mb4_eo_0900_ai_ci | NO PAD | | utf8mb4_hu_0900_ai_ci | NO PAD | | utf8mb4_hr_0900_ai_ci | NO PAD | | utf8mb4_vi_0900_ai_ci | NO PAD | | utf8mb4_0900_as_cs | NO PAD | | utf8mb4_de_pb_0900_as_cs | NO PAD | | utf8mb4_is_0900_as_cs | NO PAD | | utf8mb4_lv_0900_as_cs | NO PAD | | utf8mb4_ro_0900_as_cs | NO PAD | | utf8mb4_sl_0900_as_cs | NO PAD | | utf8mb4_pl_0900_as_cs | NO PAD | | utf8mb4_et_0900_as_cs | NO PAD | | utf8mb4_es_0900_as_cs | NO PAD | | utf8mb4_sv_0900_as_cs | NO PAD | | utf8mb4_tr_0900_as_cs | NO PAD | | utf8mb4_cs_0900_as_cs | NO PAD | | utf8mb4_da_0900_as_cs | NO PAD | | utf8mb4_lt_0900_as_cs | NO PAD | | utf8mb4_sk_0900_as_cs | NO PAD | | utf8mb4_es_trad_0900_as_cs | NO PAD | | utf8mb4_la_0900_as_cs | NO PAD | | utf8mb4_eo_0900_as_cs | NO PAD | | utf8mb4_hu_0900_as_cs | NO PAD | | utf8mb4_hr_0900_as_cs | NO PAD | | utf8mb4_vi_0900_as_cs | NO PAD | | utf8mb4_ja_0900_as_cs | NO PAD | | utf8mb4_ja_0900_as_cs_ks | NO PAD | | utf8mb4_0900_as_ci | NO PAD | | utf8mb4_ru_0900_ai_ci | NO PAD | | utf8mb4_ru_0900_as_cs | NO PAD | | utf8mb4_zh_0900_as_cs | NO PAD | | utf8mb4_0900_bin | NO PAD | +----------------------------+---------------+ 75 rows in set (0.01 sec)

補充:

字元集的修改步驟:

模擬 latin1 -> gbk
1.匯出資料表結構
mysqldump -uroot -p -h127.0.0.1 --default-charcter-set=gbk -d databasename > createtab.sql // --default-charcter-set=gbk 代表設置以哪種字元集連接 // -d 代表只會出結構,不匯出資料
2.手動修改 createtab.sql 中表結構定義中的字元集為新的字元集
3.確定紀錄不會再更新異動,會出所有紀錄
mysqldump -uroot -p -h 127.0.0.1 --quick --no-create-info --extended-insert --default-character-set=latin1 databasename> data.sql // --quick,-q 該選項用於轉儲大的資料表。它強制mysqldump從伺服器一次一行地檢索資料表中的行而不是檢索所有行並在輸出前將它緩存到內存中。 // --no-create-info,-t 不寫重新建立每個轉儲資料表的CREATE TABLE語句。 // --extended-insert,-e 使用包括幾個VALUES列資料表的多行INSERT語法。這樣使轉儲檔案更小,重載檔案時可以加速插入。 // --default-character-set=charset 使用charsetas預設字元編碼。參見5.10.1節,「數據和排序用字元編碼」。如果沒有指定,mysqldump使用utf8。
  1. 打開 data.sql 將 set names latin1 修改成 set names gbk
  2. 使用新的字元集建立新的資料庫
create database databasename default charset gbk;
6.建立表,執行 createtab.sql
mysql -uroot -p databasename < createtab.sql
7.導入資料,執行 data.sql
mysql -uroot -p databasename < data.sql
註: 選擇目標字元集的時候,最好是原始字元集的超集合,或者確定比原始字元集的字形檔案更大,若反之則不支援的會變亂碼。

utf8mb4_general_ciutf8mb4_0900_ai_ciutf8_general_ci
accentaiaiai?
casecicici
padspacenospace
charsetutf8mb4utf8mb4utf8mb3

join 欄位時比較目標若 Collation 不相同,則無法比較。

Next Illuminate\Database\QueryException: SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT) for operation '=' (SQL: select `TEST`.`TEST` from `TEST` inner join `TEST` on `TEST`.`TEST` = `TEST`.`code` where `TEST`.`TEST` < 2887057409 order by `TEST`.`ip` desc limit 1) in /var/www/html/waca/vendor/laravel/framework/src/Illuminate/Database/Connection.php:651
Stack trace:


mysql> SELECT @@character_set_database, @@collation_database;                                                                                                                                               +--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8mb4                  | utf8mb4_0900_ai_ci   |
+--------------------------+----------------------+
1 row in set (0.00 sec)

留言