C uchar.h Header
Unicode character handling and multibyte conversions
π What is uchar.h?
The uchar.h header provides Unicode character types and functions for converting between multibyte and wide character representations, enabling international text processing in C programs.
#include <uchar.h>
#include <stdio.h>
int main() {
char16_t utf16_char = u'A';
char32_t utf32_char = U'π';
printf("UTF-16: %u, UTF-32: %u\n", utf16_char, utf32_char);
return 0;
}
Output:
Key uchar.h Features
Unicode Types
16-bit and 32-bit Unicode character types
char16_t utf16_char = u'A';
char32_t utf32_char = U'π';
Conversion Functions
Convert between multibyte and Unicode
mbrtoc16(), c16rtomb()
mbrtoc32(), c32rtomb()
International Support
Handle text in multiple languages
// Supports emoji, symbols
char32_t emoji = U'π';
String Literals
Unicode string literal prefixes
char16_t str16[] = u"Hello";
char32_t str32[] = U"World";
πΉ Unicode Character Types
C11 introduced Unicode character types char16_t and char32_t for proper handling of international text and multilingual applications. These types from uchar.h represent UTF-16 and UTF-32 code units respectively, providing standardized support for Unicode encoding schemes. The char16_t type uses 16 bits for most common characters with surrogate pairs for rare symbols, while char32_t uses 32 bits to directly represent any Unicode code point. For example, char32_t euro = U'β¬'; stores the Euro symbol using its Unicode code point. These types ensure consistent Unicode handling across platforms, crucial for internationalized software supporting global languages and emoji.
#include <uchar.h>
#include <stdio.h>
int main() {
// 16-bit Unicode characters (UTF-16)
char16_t utf16_a = u'A';
char16_t utf16_euro = u'β¬';
// 32-bit Unicode characters (UTF-32)
char32_t utf32_a = U'A';
char32_t utf32_star = U'β';
char32_t utf32_emoji = U'π';
printf("UTF-16 'A': %u\n", utf16_a);
printf("UTF-16 'β¬': %u\n", utf16_euro);
printf("UTF-32 'A': %u\n", utf32_a);
printf("UTF-32 'β': %u\n", utf32_star);
printf("UTF-32 'π': %u\n", utf32_emoji);
// Character sizes
printf("\nSizes:\n");
printf("char16_t: %zu bytes\n", sizeof(char16_t));
printf("char32_t: %zu bytes\n", sizeof(char32_t));
return 0;
}
Output:
UTF-16 'β¬': 8364
UTF-32 'A': 65
UTF-32 'β': 11088
UTF-32 'π': 128522
Sizes:
char16_t: 2 bytes
char32_t: 4 bytes
πΉ Unicode String Literals
Unicode string literals in C use special prefixes to create strings encoded in various Unicode formats for international text support. The u8 prefix creates UTF-8 strings (backward compatible with ASCII), u prefix produces UTF-16 strings, and U prefix generates UTF-32 strings. For example, const char *utf8 = u8"Hello δΈη"; creates a UTF-8 encoded string containing both English and Chinese characters. UTF-8 is space-efficient for ASCII-heavy text, UTF-16 balances size and access speed, and UTF-32 provides fixed-width characters for constant-time indexing. Choosing the right encoding depends on your application's language requirements and performance characteristics.
#include <uchar.h>
#include <stdio.h>
int main() {
// UTF-16 string literal
char16_t utf16_str[] = u"Hello World! π";
// UTF-32 string literal
char32_t utf32_str[] = U"Unicode: Ξ±Ξ²Ξ³ δΈζ π";
// Print string lengths
printf("UTF-16 string length: %zu characters\n",
sizeof(utf16_str)/sizeof(char16_t) - 1);
printf("UTF-32 string length: %zu characters\n",
sizeof(utf32_str)/sizeof(char32_t) - 1);
// Print individual characters from UTF-32 string
printf("First few UTF-32 characters:\n");
for (int i = 0; i < 8 && utf32_str[i] != U'\0'; i++) {
printf(" [%d]: %u\n", i, utf32_str[i]);
}
return 0;
}
Output:
UTF-32 string length: 16 characters
First few UTF-32 characters:
[0]: 85
[1]: 110
[2]: 105
[3]: 99
[4]: 111
[5]: 100
[6]: 101
[7]: 58
πΉ Multibyte Conversion
Multibyte conversion functions enable transformation between different character encodings, particularly between multibyte sequences and wide characters. Functions like mbstowcs() convert multibyte strings to wide character strings, while wcstombs() performs the reverse conversion. The mbtowc() function converts a single multibyte character to its wide character equivalent. For example, wchar_t wstr[100]; mbstowcs(wstr, "Hello", 100); converts a multibyte string to wide characters for processing. These conversions are essential when working with international text in C programs, enabling proper handling of non-ASCII characters while maintaining compatibility with legacy single-byte string functions and standard I/O operations.
#include <uchar.h>
#include <stdio.h>
#include <string.h>
int main() {
// Convert multibyte to UTF-32
char mb_str[] = "Hello";
char32_t utf32_char;
mbstate_t state = {0};
size_t result = mbrtoc32(&utf32_char, mb_str, strlen(mb_str), &state);
if (result > 0) {
printf("First character as UTF-32: %u\n", utf32_char);
printf("Bytes consumed: %zu\n", result);
}
// Convert UTF-32 back to multibyte
char mb_buffer[MB_CUR_MAX];
size_t mb_len = c32rtomb(mb_buffer, utf32_char, &state);
if (mb_len > 0) {
mb_buffer[mb_len] = '\0';
printf("Back to multibyte: '%s'\n", mb_buffer);
printf("Multibyte length: %zu\n", mb_len);
}
return 0;
}
Output:
Bytes consumed: 1
Back to multibyte: 'H'
Multibyte length: 1