Is the gcc compiler that is responsible for storing ( in the executable ) utf8 characters from a C language char array?
I'm on an Ubuntu system and I wrote this simple program :
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
int main( void ){
char utf8_arr[] = "写一个名字列表: \n\n";
write(1,utf8_arr,sizeof(utf8_arr));
char utf8_buff[1024];
ssize_t r;
while ( (r = read(0,utf8_buff,sizeof(utf8_buff))) > 0 ){
write(1,utf8_buff,r);
}
return 0;
}
My questions :
1)Who controls the character encoding ( the way actual characters are stored in memory ) when it comes to C language strings like the one in my program ? Is it the gcc compiler ( that in turn gets its own character encoding settings from somewhere ) ?
2)Is it 100% correct to use utf8 char string just for storing ,writing and reading the way my program does ?
3)What about sizeof ? Is it fine to use it like this ?
Answers
-
The character encoding used in C language strings is determined by the encoding of the source code file and the compiler settings. When you declare a string literal like
"写一个名字列表: \n\n"
, the characters are encoded according to the encoding of the source code file. If your source code file is encoded in UTF-8 (which is a common practice), then the string will be interpreted as UTF-8 encoded characters. The GCC compiler generally assumes UTF-8 encoding by default for source code files, but you can specify a different encoding using compiler options if needed. -
Using UTF-8 encoded strings like
char utf8_arr[] = "写一个名字列表: \n\n";
is generally fine, especially if your system and terminal support UTF-8 encoding (which is the case for most modern systems). UTF-8 is a variable-width encoding that can represent virtually all characters in the Unicode character set, making it suitable for handling multilingual text. Your program's usage of UTF-8 strings for storing, writing, and reading text should work correctly as long as your environment supports UTF-8 encoding. -
Using
sizeof
in your program assizeof(utf8_arr)
andsizeof(utf8_buff)
is generally fine.sizeof
returns the size in bytes of its operand, which in this case is an array. However, you need to be aware thatsizeof(utf8_arr)
will return the size of the array in bytes, not the length of the string stored in it. If you want to get the length of the string stored inutf8_arr
, you can use functions likestrlen
after ensuring that the string is null-terminated. Similarly,sizeof(utf8_buff)
returns the size of the array in bytes, which may not necessarily be the amount of data read into it fromread
. In your case,r
contains the number of bytes read, which is the length of the data actually read intoutf8_buff
.