Unicode Normalization

Unicode normalization is the process of converting Unicode characters into a standardized form so that equivalent sequences of characters are represented identically. While this serves legitimate purposes in text processing, it introduces security risks when an application validates input before normalization occurs. An attacker can use visually similar or functionally equivalent Unicode characters to bypass security filters, smuggle payloads past input validation, or create confusion in identity-related operations.

How It Works

Unicode defines four normalization forms: NFC, NFD, NFKC, and NFKD. The compatibility forms (NFKC and NFKD) are particularly relevant to security because they map visually similar characters to their canonical equivalents. For example, the fullwidth character ＜ (U+FF1C) normalizes to < (U+003C) under NFKC normalization. If a web application filter checks for angle brackets but the normalization happens after the security check, an attacker can bypass XSS filters by using the fullwidth variant.

This class of attack extends beyond simple character substitution. Different Unicode representations of the same visual character can bypass blocklists, confuse string comparison logic, and create inconsistencies between what a security filter sees and what the application ultimately processes. Directory traversal sequences, SQL keywords, and script tags can all be represented using alternative Unicode encodings that survive initial validation only to be normalized into their dangerous forms later in the processing pipeline.

Account-related attacks also leverage Unicode normalization. An attacker might register a username using homoglyph characters (visually identical characters from different scripts) that normalize to the same form as an existing account, potentially enabling account takeover or impersonation. For example, the Cyrillic а (U+0430) looks identical to the Latin a (U+0061) but is a different codepoint.

Why It Matters

Unicode normalization issues are subtle and frequently missed during both development and testing. They affect any application that processes international text, which today means virtually every web application. Security assessments should verify that normalization occurs before validation, that homoglyph attacks are mitigated in identity systems, and that the application handles unexpected Unicode sequences without creating filter bypasses.

Need your application tested? Get in touch.

How It Works

Why It Matters

Related terms

Go deeper

Need your application tested?