Is the gcc compiler that is responsible for storing ( in the exe

ghz 8months ago ⋅ 80 views

Is the gcc compiler that is responsible for storing ( in the executable ) utf8 characters from a C language char array?

I'm on an Ubuntu system and I wrote this simple program :

#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>


int main( void ){

    char utf8_arr[] = "写一个名字列表: \n\n";
    
    write(1,utf8_arr,sizeof(utf8_arr));
    
    
    char utf8_buff[1024];
    ssize_t r;
    
    while ( (r = read(0,utf8_buff,sizeof(utf8_buff))) > 0 ){
    
             
             write(1,utf8_buff,r); 
             
    }
    
     return 0;

}

My questions :

1)Who controls the character encoding ( the way actual characters are stored in memory ) when it comes to C language strings like the one in my program ? Is it the gcc compiler ( that in turn gets its own character encoding settings from somewhere ) ?

2)Is it 100% correct to use utf8 char string just for storing ,writing and reading the way my program does ?

3)What about sizeof ? Is it fine to use it like this ?

Answers

  1. The character encoding used in C language strings is determined by the encoding of the source code file and the compiler settings. When you declare a string literal like "写一个名字列表: \n\n", the characters are encoded according to the encoding of the source code file. If your source code file is encoded in UTF-8 (which is a common practice), then the string will be interpreted as UTF-8 encoded characters. The GCC compiler generally assumes UTF-8 encoding by default for source code files, but you can specify a different encoding using compiler options if needed.

  2. Using UTF-8 encoded strings like char utf8_arr[] = "写一个名字列表: \n\n"; is generally fine, especially if your system and terminal support UTF-8 encoding (which is the case for most modern systems). UTF-8 is a variable-width encoding that can represent virtually all characters in the Unicode character set, making it suitable for handling multilingual text. Your program's usage of UTF-8 strings for storing, writing, and reading text should work correctly as long as your environment supports UTF-8 encoding.

  3. Using sizeof in your program as sizeof(utf8_arr) and sizeof(utf8_buff) is generally fine. sizeof returns the size in bytes of its operand, which in this case is an array. However, you need to be aware that sizeof(utf8_arr) will return the size of the array in bytes, not the length of the string stored in it. If you want to get the length of the string stored in utf8_arr, you can use functions like strlen after ensuring that the string is null-terminated. Similarly, sizeof(utf8_buff) returns the size of the array in bytes, which may not necessarily be the amount of data read into it from read. In your case, r contains the number of bytes read, which is the length of the data actually read into utf8_buff.