foundry27 : Post

Forum Topic - can not get the utf8?: (10 Items)

View: as

Xiaolong Zhang

04/22/2010 7:08 AM

post52365

Hi all!
  I play the chinese song,when I get the info from nowplaying ,it is not utf8 format.How can I get the format of utf8 . 
 the info is :
 ffffffc3
 ffffff94
 ffffffc3
 ffffff99
 ffffffc2
 ffffffbc
 ffffffc3
 ffffffbb
 54
 59
 45
 52


thanks!

Dan Cardamore(deleted)

Re: can not get the utf8?

Dan Cardamore(deleted)

04/22/2010 8:02 AM

post52369

Re: can not get the utf8?

Hi Xiaolong,

Its likely that you will need to write a charconvert DLL for MM.

In the MM configuration guide, see the section titled ³Creating an external
DLL to provide character encoding routines²

Dan


On 10-04-22 7:08 AM, "Xiaolong Zhang" <community-noreply@qnx.com> wrote:

> Hi all!
>   I play the chinese song,when I get the info from nowplaying ,it is not utf8
> format.How can I get the format of utf8 .  the info is :
>  ffffffc3
>  ffffff94
>  ffffffc3
>  ffffff99
>  ffffffc2
>  ffffffbc
>  ffffffc3
>  ffffffbb
>  54
>  59
>  45
>  52
> 
> 
> thanks!
> 
> 
> 
> _______________________________________________
> 
> General
> http://community.qnx.com/sf/go/post52365
> 
> 

-- 
Dan Cardamore   <dcardamore@qnx.com>
QNX Multimedia  http://community.qnx.com/sf/projects/multimedia

Wojtek Lerch

Re: can not get the utf8?

Wojtek Lerch

04/22/2010 10:21 AM

post52385

Re: can not get the utf8?

> Hi all!
>   I play the chinese song,when I get the info from nowplaying ,it is not utf8 
> format.How can I get the format of utf8 .  the info is :

This looks like a correct UTF-8 encoding of the string "ÔÙ¼ûTYER", which indeed doesn't look meaningful and contains
 no Chinese characters.  If I could have a look at your MP3 file, I could tell you more about what that string looks 
like in the original ID3 tag and how it was turned into the UTF-8 string you're seeing and why.

What is the string that you were expecting to see?

>  ffffffc3
>  ffffff94

UTF8 "\xC3\x94" is U00D4 "LATIN CAPITAL LETTER O WITH CIRCUMFLEX"

>  ffffffc3
>  ffffff99

UTF8 "\xC3\x99" is U00D9 "LATIN CAPITAL LETTER U WITH GRAVE"


>  ffffffc2
>  ffffffbc

UTF8 "\xC2\xBC" is U00BC "VULGAR FRACTION ONE QUARTER"


>  ffffffc3
>  ffffffbb

UTF8 "\xC3\x94" is U00FB "LATIN SMALL LETTER U WITH CIRCUMFLEX"

>  54

 ASCII 'T'

>  59

 ASCII 'Y'

>  45

ASCII 'E'

>  52

ASCII 'R'

Xiaolong Zhang

04/23/2010 10:32 AM

post52528

Re: can not get the utf8?

Hi wojtek!
     Thanks for your reply.The U00D4 is the format that I wanted get.
I want to known that how to  convert between U00D4 and
>  ffffffc3
>  ffffff94



thanks!

Wojtek Lerch

04/23/2010 1:29 PM

post52573

Re: can not get the utf8?

> Hi wojtek!
>      Thanks for your reply.The U00D4 is the format that I wanted get.

Ah, that's not UTF-8.  The U00D4 numbers are what Unicde calls "code points" -- 24-bit integer values.  In a C program 
they're typically stored in arrays of 32-bit integers -- which is roughly equivalent to UTF-32.  In ISO C, this 
corresponds to "wide characters" and the wchar_t typedef.

> I want to known that how to  convert between U00D4 and
> >  ffffffc3
> >  ffffff94

This is UTF-8.  It encodes a code point as a sequence of betwen one and four bytes.  In ISO C, this corresponds to a "
multibyte character".  

Under Neutrino, all the standard ISO multibyte/wide character conversion functions work with UTF-8 and UTF-32 by default
.  Have a look at the documentation of such functions as wctomb() and mbtowc().  For more details about UTF-32 and UTF-8
, take a look at the Unicode Consortium's Web page: 

http://www.unicode.org/faq//utf_bom.html

Xiaolong Zhang

04/26/2010 6:36 AM

post52736

Re: can not get the utf8?

Hi wojtek!
Thanks for your reply. 
I need the utf8 format as follow:%E5%86%8D   %E8%A7%81

but your mean the utf8 format is :
  ffffffc3
  ffffff94
  ffffffc3
  ffffff99
  ffffffc2
  ffffffbc
  ffffffc3
  ffffffbb
I want to known the difference for the two format of utf8.Is the second format  a utf16 or utf32?  Otherwise , I convert
 the second format by the function mbtowc().As a result , it is like the format of gbk encoding.such as U00D4 .
thaks!

Eric Fausett

Re: can not get the utf8?

Eric Fausett

04/26/2010 7:10 AM

post52739

Re: can not get the utf8?

Xiaolong,

You might try some of these references:
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Utf8
http://en.wikipedia.org/wiki/UTF-32/UCS-4

Cheers,

Eric

Wojtek Lerch

Re: can not get the utf8?

Wojtek Lerch

04/26/2010 11:03 AM

post52783

Re: can not get the utf8?

Xiaolong,

Perhaps I misunderstood what you meant by your examples.  My assumption that each line (such as "ffffff94") in your 
original example represented a single byte of the UTF-8 string.  In other words, I assumed that the original example was
 produced by C code similar to this:

  char *utf8string = ...;

  for ( i = 0; utfstring[i] != '\0'; +i )
    printf( "  %x\n", utfstring[i] );

And now I'll assume that the new example represents each byte with a three-character sequence, such as obtained by C 
code similar to this:

  for ( i = 0; utfstring[i] != '\0'; +i )
    printf( "%%%02X\n", (unsigned char) utfstring[i] );

If my assumption were correct, the only difference between %E5%86%8D and

  ffffffe5
  ffffff86
  ffffff8d

would be how the same UTF-8 string is converted to text for the purpose of posting it here.  In other words, both would 
refer to the same three-byte UTF-8 sequence that represents the Unicode character U518D.  Another ways of representing 
the same byte sequence would be as snippets of C code such as

  char utf8string[] = "\xE5\x86\x8D";

or

  char utf8string[] = {  0xffffffe5, 0xffffff86, 0xffffff8d };

Obviously, I am misunderstanding something, and you will have to explain what *you* mean by those examples before I can 
answer your question.

Xiaolong Zhang

04/27/2010 8:30 AM

post52870

Re: can not get the utf8?

Hi wojtek!
thanks for your reply!
I mean that 
E5868D   E8A781 should correspond to 

  ffffffc3
  ffffff94
  ffffffc3
  ffffff99

  ffffffc2
  ffffffbc
  ffffffc3
  ffffffbb
they are all mean "再见" in chinese. I didn't known how to convert between them.it is not the reason that the format is 
different output by C.
thanks!

Wojtek Lerch

04/27/2010 10:57 AM

post52907

Re: can not get the utf8?

Hi Xiaolong,

The Chinese characters "再" and "见" are Unicode codepoints U+518D and U+89C1, respectively.  Their UTF-8 encodings are 
three bytes long and can be written as C strings "\xE5\x86\x8D" and "\xE8\xA7\x81".  You can read more about those two 
characters on the unicode.org Web site:

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=518D&useutf8=false
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=89C1&useutf8=false

To translate between Unicode codepoints and UTF-8 in C code, you need to store the Unicode values in a wchar_t and the UTF-8 strings in an array of char and use the mbtowc() and wctomb() functions.  The attached program is an example of how to do that.

Now I have to say that I have no idea how you came up with these:

>   ffffffc3
>   ffffff94
>   ffffffc3
>   ffffff99
> 
>   ffffffc2
>   ffffffbc
>   ffffffc3
>   ffffffbb

They look like UTF-8 encoding of the string "ÔÙ¼û", not "再见".  But perhaps I misunderstand something -- maybe if 
you showed me the C code that produced the above output it would make it easier for me .

Attachment:

utf.c 853 bytes

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page