C uchar.h Header

Unicode character handling and multibyte conversions

🌐 What is uchar.h?

The uchar.h header provides Unicode character types and functions for converting between multibyte and wide character representations, enabling international text processing in C programs.


#include <uchar.h>
#include <stdio.h>

int main() {
    char16_t utf16_char = u'A';
    char32_t utf32_char = U'🌟';
    printf("UTF-16: %u, UTF-32: %u\n", utf16_char, utf32_char);
    return 0;
}
                                    

Output:

UTF-16: 65, UTF-32: 127775

Key uchar.h Features

πŸ”€

Unicode Types

16-bit and 32-bit Unicode character types

char16_t utf16_char = u'A';
char32_t utf32_char = U'🌟';
πŸ”„

Conversion Functions

Convert between multibyte and Unicode

mbrtoc16(), c16rtomb()
mbrtoc32(), c32rtomb()
🌍

International Support

Handle text in multiple languages

// Supports emoji, symbols
char32_t emoji = U'πŸ˜€';
πŸ“

String Literals

Unicode string literal prefixes

char16_t str16[] = u"Hello";
char32_t str32[] = U"World";

πŸ”Ή Unicode Character Types

C11 introduced Unicode character types char16_t and char32_t for proper handling of international text and multilingual applications. These types from uchar.h represent UTF-16 and UTF-32 code units respectively, providing standardized support for Unicode encoding schemes. The char16_t type uses 16 bits for most common characters with surrogate pairs for rare symbols, while char32_t uses 32 bits to directly represent any Unicode code point. For example, char32_t euro = U'€'; stores the Euro symbol using its Unicode code point. These types ensure consistent Unicode handling across platforms, crucial for internationalized software supporting global languages and emoji.

#include <uchar.h>
#include <stdio.h>

int main() {
    // 16-bit Unicode characters (UTF-16)
    char16_t utf16_a = u'A';
    char16_t utf16_euro = u'€';
    
    // 32-bit Unicode characters (UTF-32)
    char32_t utf32_a = U'A';
    char32_t utf32_star = U'⭐';
    char32_t utf32_emoji = U'😊';
    
    printf("UTF-16 'A': %u\n", utf16_a);
    printf("UTF-16 '€': %u\n", utf16_euro);
    
    printf("UTF-32 'A': %u\n", utf32_a);
    printf("UTF-32 '⭐': %u\n", utf32_star);
    printf("UTF-32 '😊': %u\n", utf32_emoji);
    
    // Character sizes
    printf("\nSizes:\n");
    printf("char16_t: %zu bytes\n", sizeof(char16_t));
    printf("char32_t: %zu bytes\n", sizeof(char32_t));
    
    return 0;
}

Output:

UTF-16 'A': 65
UTF-16 '€': 8364
UTF-32 'A': 65
UTF-32 '⭐': 11088
UTF-32 '😊': 128522

Sizes:
char16_t: 2 bytes
char32_t: 4 bytes

πŸ”Ή Unicode String Literals

Unicode string literals in C use special prefixes to create strings encoded in various Unicode formats for international text support. The u8 prefix creates UTF-8 strings (backward compatible with ASCII), u prefix produces UTF-16 strings, and U prefix generates UTF-32 strings. For example, const char *utf8 = u8"Hello δΈ–η•Œ"; creates a UTF-8 encoded string containing both English and Chinese characters. UTF-8 is space-efficient for ASCII-heavy text, UTF-16 balances size and access speed, and UTF-32 provides fixed-width characters for constant-time indexing. Choosing the right encoding depends on your application's language requirements and performance characteristics.

#include <uchar.h>
#include <stdio.h>

int main() {
                                    // UTF-16 string literal
    char16_t utf16_str[] = u"Hello World! 🌍";
    
    // UTF-32 string literal  
    char32_t utf32_str[] = U"Unicode: Ξ±Ξ²Ξ³ δΈ­ζ–‡ πŸš€";
    
    // Print string lengths
    printf("UTF-16 string length: %zu characters\n", 
           sizeof(utf16_str)/sizeof(char16_t) - 1);
    printf("UTF-32 string length: %zu characters\n", 
           sizeof(utf32_str)/sizeof(char32_t) - 1);
    
    // Print individual characters from UTF-32 string
    printf("First few UTF-32 characters:\n");
    for (int i = 0; i < 8 && utf32_str[i] != U'\0'; i++) {
        printf("  [%d]: %u\n", i, utf32_str[i]);
    }
    
    return 0;
}

Output:

UTF-16 string length: 15 characters
UTF-32 string length: 16 characters
First few UTF-32 characters:
[0]: 85
[1]: 110
[2]: 105
[3]: 99
[4]: 111
[5]: 100
[6]: 101
[7]: 58

πŸ”Ή Multibyte Conversion

Multibyte conversion functions enable transformation between different character encodings, particularly between multibyte sequences and wide characters. Functions like mbstowcs() convert multibyte strings to wide character strings, while wcstombs() performs the reverse conversion. The mbtowc() function converts a single multibyte character to its wide character equivalent. For example, wchar_t wstr[100]; mbstowcs(wstr, "Hello", 100); converts a multibyte string to wide characters for processing. These conversions are essential when working with international text in C programs, enabling proper handling of non-ASCII characters while maintaining compatibility with legacy single-byte string functions and standard I/O operations.

#include <uchar.h>
#include <stdio.h>
#include <string.h>

int main() {
    // Convert multibyte to UTF-32
    char mb_str[] = "Hello";
    char32_t utf32_char;
    mbstate_t state = {0};
    
    size_t result = mbrtoc32(&utf32_char, mb_str, strlen(mb_str), &state);
    
    if (result > 0) {
        printf("First character as UTF-32: %u\n", utf32_char);
        printf("Bytes consumed: %zu\n", result);
    }
    
    // Convert UTF-32 back to multibyte
    char mb_buffer[MB_CUR_MAX];
    size_t mb_len = c32rtomb(mb_buffer, utf32_char, &state);
    
    if (mb_len > 0) {
        mb_buffer[mb_len] = '\0';
        printf("Back to multibyte: '%s'\n", mb_buffer);
        printf("Multibyte length: %zu\n", mb_len);
    }
    
    return 0;
}

Output:

First character as UTF-32: 72
Bytes consumed: 1
Back to multibyte: 'H'
Multibyte length: 1

🧠 Test Your Knowledge

What prefix is used for UTF-32 string literals?