The Ultimate Guide to MySQL Charsets and Collations for Beginners

Have you ever encountered strange characters in your database, or noticed incorrect sorting of data? Chances are, the culprit is related to character sets and collations. These often-overlooked aspects of MySQL are crucial for handling text data correctly, especially in today’s globalized world.

MySQL, a robust and widely-used relational database management system, offers powerful features for handling character sets and collations. These settings are essential for ensuring proper storage, retrieval, and comparison of text data, especially in today’s multilingual and globalized environments. In this blog, we’ll explore the importance, functionality, and practical applications of MySQL charsets and collations.

What Are Character Sets and Collations?

Character Set

A character set is a collection of symbols and their numeric representations. It defines the characters that are supported (e.g., letters, numbers, symbols) and how they are stored. Think of it as a dictionary that maps characters to numeric codes.

Collation

A collation is a set of rules for comparing characters within a character set. It determines sorting and equality comparisons, addressing aspects like case sensitivity and handling of accents.

Unicode Collation Algorithm (UCA)

The UCA is a standard for comparing Unicode strings.Modern collations in MySQL are often based on the Unicode Collation Algorithm (UCA), ensuring consistent sorting and comparison of Unicode strings across different languages.

MySQL can:

Store strings using a variety of character sets.
Compare strings using a variety of collations.
Mix strings with different character sets or collations in the same server, the same database, or even the same table.
Enable specification of character set and collation at any level.

To use these features effectively, you must know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.

While it’s not strictly necessary to define the charset and collation explicitly when creating a database or table in MySQL, doing so can be important for ensuring that your database behaves as expected, especially when dealing with multilingual data or specific sorting requirements.

Why Should We Care?

Ignoring charset and collation settings can lead to problems such as:

Strange characters appearing in text data.
Incorrect sorting of multilingual data.
Challenges with text searches and comparisons.

Proper configuration ensures accurate representation, sorting, and comparison of text data across different languages and system.

How MySQL Chooses Character Sets and Collations

MySQL chooses the database character set and collation in the following manner:

If both CHARACTER SET charset_name and COLLATE collation_name are specified, the specified values are used.
If CHARACTER SET charset_name is specified without COLLATE, the character set and its default collation are used. You can check default collations using the SHOW CHARACTER SET statement or querying the INFORMATION_SCHEMA CHARACTER_SETS table.
If COLLATE collation_name is specified without CHARACTER SET, the character set associated with collation_name is used.
If neither is specified, the server character set and collation are used.

If you’re unsure which character sets are supported by your MySQL server, this command will give you a complete list.

mysql> SHOW CHARACTER SET;

Collations are always tied to a character set. You can’t use a latin1 collation with utf8mb4 data, for example. The SHOW COLLATION command allows you to explore the available options within a particular character set.This command lists all the collations available for a specific character set

mysql> SHOW COLLATION WHERE Charset = ‘utf8mb4’;

Default Charset and Collation

Default Charset: utf8mb4 (supports a wide range of characters, including emojis).
Default Collation: utf8mb4_0900_ai_ci (case-insensitive and accent-insensitive, based on Unicode 9.0).

Understanding Collation Names

Collation names follow a specific pattern: charset_number_attributes. Let’s break down utf8mb4_0900_ai_ci:

utf8mb4: Supports the full Unicode range, requiring up to 4 bytes per character.
0900: Follows Unicode 9.0 standards.
ai: Accent insensitive (e.g., é is treated as e).
ci: Case insensitive (e.g., A is treated as a).

Checking Defaults

To check server-level defaults:

SHOW VARIABLES LIKE ‘character_set_server’;
SHOW VARIABLES LIKE ‘collation_server’;

To check database-level defaults:

USE db_name;
SELECT @@character_set_database, @@collation_database;

OR

SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = ‘your_database_name’;

To check table and column-level defaults:

SHOW TABLE STATUS LIKE ‘your_table_name’;
SHOW FULL COLUMNS FROM your_table_name;

Setting Charset and Collation

You can specify charset and collation at different levels:

Database Level

CREATE DATABASE database_name CHARACTER SET charset_name COLLATE collation_name;
Example:
CREATE DATABASE my_database CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Table Level

Example:
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(100),
department VARCHAR(50)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Modifying Charset and Collation

Changing charset or collation requires careful planning to avoid data loss or corruption.

Key Considerations

Data Compatibility: Ensure existing data can be represented in the new charset.
Data Integrity: Some characters may not translate correctly, causing issues.
Performance Impact: Conversions are resource-intensive, especially for large tables.
Index Rebuilding: String column indexes need rebuilding after changes.

Modifying Settings

Database Level

ALTER DATABASE database_name CHARACTER SET new_charset COLLATE new_collation;
Example:
ALTER DATABASE your_database CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Table Level

ALTER TABLE table_name CONVERT TO CHARACTER SET new_charset COLLATE new_collation;
Example:
ALTER TABLE your_table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Column Level

ALTER TABLE table_name MODIFY column_name column_type CHARACTER SET new_charset COLLATE new_collation;
Example:
ALTER TABLE your_table_name MODIFY your_column_name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Always include the data type and length (e.g., VARCHAR(100)) when using ALTER TABLE ... MODIFY to change a column’s character set or collation. This ensures predictable behavior, prevents data truncation, and makes your code clearer. It is particularly important when switching to a multi-byte character set like utf8mb4.

Conversion Errors

When converting character sets, two main problems can occur: data truncation (less common with modern character sets) and, more importantly, conversion errors due to unsupported character

The most common issue you’ll encounter is ERROR 1366 (HY000): Incorrect string value: '...' for column '...' at row .... This error is a critical safeguard built into MySQL. It arises when you attempt to convert data from a character set that supports a wider range of characters (like utf8mb4) to one that supports a narrower range (like latin1). The error message explicitly tells you that MySQL has encountered characters in your data that cannot be represented in the target character set.

Example:

Source Charset: utf8mb4 (supports all Unicode characters)
Target Charset: latin1 (supports only Western European characters)

Error Due to Larger byte length per character,

It occurs when converting from a character set with a larger maximum byte length per character (like utf8mb4, which uses up to 4 bytes) to one with a smaller maximum byte length (like latin1, which uses 1 byte). If your data contains characters that require more bytes than the target character set allows,

Example: Converting a string containing the musical symbol ‘𝄞’ (which requires 4 bytes in utf8mb4) to latin1 would trigger an error, as latin1 can only store single-byte characters.

Error due to Unsupported Characters,

Unsupported characters are simply characters that don’t exist in the target character set.

Example: Converting a string containing Chinese characters like ‘你好’ (which are not part of the latin1 character set) to latin1 would result an error.

This error is crucial because it prevents silent data loss. Without this check, your data would be silently corrupted during the conversion, potentially leading to significant problems later on.

How to handle conversion errors (and avoid data loss)

Avoid converting to a less capable character set: The best approach is to design your database from the start using utf8mb4 (or another appropriate Unicode character set) to avoid the need for conversion later. This is the most reliable way to prevent data loss.
Inspect your data before converting (if conversion is absolutely necessary): If you must convert to a less capable character set, you must inspect your data to identify any characters that would be lost.

SELECT * FROM Table_name WHERE LENGTH(text) != CHAR_LENGTH(text);

This query will help you identify problematic rows before you attempt the conversion.

When to Explicitly Define Charset and Collation

While MySQL provides defaults, it’s often best practice to explicitly define character sets and collations, especially when:

Working with multilingual data.
Requiring specific sorting behavior (e.g., case-sensitive).
Ensuring consistency across different environments.

In conclusion, mastering MySQL character sets and collations is crucial for building robust and globally accessible applications. By consistently using utf8mb4 with appropriate collations like utf8mb4_0900_ai_ci, you ensure accurate data storage and retrieval, seamless multilingual support, and efficient text comparisons. Taking the time to properly configure these settings from the outset, and carefully considering the implications of modifications, will save you from potential data corruption, unexpected sorting behavior, and performance bottlenecks down the line.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

The Ultimate Guide to MySQL Charsets and Collations for Beginners

What Are Character Sets and Collations?

Character Set

Collation

Unicode Collation Algorithm (UCA)

MySQL can:

Why Should We Care?

How MySQL Chooses Character Sets and Collations

Default Charset and Collation

Understanding Collation Names

Checking Defaults

Setting Charset and Collation

Database Level

Table Level

Modifying Charset and Collation

Key Considerations

Modifying Settings

Database Level

Table Level

Column Level

Conversion Errors

Error Due to Larger byte length per character,

Error due to Unsupported Characters,

How to handle conversion errors (and avoid data loss)

When to Explicitly Define Charset and Collation

Like this:

Related

Leave a ReplyCancel reply

Latest to read

EXPERT DATABASE SUPPORT PARTNER

The Ultimate Guide to MySQL Charsets and Collations for Beginners

What Are Character Sets and Collations?

Character Set

Collation

Unicode Collation Algorithm (UCA)

MySQL can:

Why Should We Care?

How MySQL Chooses Character Sets and Collations

Default Charset and Collation

Understanding Collation Names

Checking Defaults

Setting Charset and Collation

Database Level

Table Level

Modifying Charset and Collation

Key Considerations

Modifying Settings

Database Level

Table Level

Column Level

Conversion Errors

Error Due to Larger byte length per character,

Error due to Unsupported Characters,

How to handle conversion errors (and avoid data loss)

When to Explicitly Define Charset and Collation

Share this:

Like this:

Related

Leave a ReplyCancel reply

Latest to read

Discover more from Genexdbs