package example.encoding;

import java.io.UnsupportedEncodingException;


/** *//**
* The Class GetCharTest.
*/

public class GetCharTest
{


/** *//**
* The main method.
*
* @param args the arguments
*/

public static void main(String args[])
{
String content = "中文";
String defaultEncoding = System.getProperty("file.encoding");
String defaultLnaguage = System.getProperty("user.language");
System.out.println("System default encoding --- " + defaultEncoding);
System.out.println("System default language --- " + defaultLnaguage);

GetCharTest tester = new GetCharTest();
tester.getCharWithDefaultEncoding(content);
tester.getCharWithGivenEncoding(content, "ISO-8859-1");
tester.getCharWithGivenEncoding(content, "GBK");
tester.getCharWithGivenEncoding(content, "UTF-8");
}


/** *//**
* Gets the char with default encoding.
*
* @param content the content
*
* @return the char with default encoding
*/

public void getCharWithDefaultEncoding(String content)
{
System.out.println("\nGet characters with default encoding\n");
printCharArray(content);
}


/** *//**
* Gets the char with given encoding.
*
* @param content the content
* @param encoding the encoding
*
* @return the char with given encoding
*/

public void getCharWithGivenEncoding(String content, String encoding)
{
System.out.println("\nGet characters with given encoding : " + encoding
+ "\n");

try
{
String encodedString = new String(content.getBytes(), encoding);
printCharArray(encodedString);

} catch (UnsupportedEncodingException e)
{
e.printStackTrace();
}
}


/** *//**
* Prints the char array.
*
* @param inStr the in str
*/

public void printCharArray(String inStr)
{
char[] charArray = inStr.toCharArray();


for (int i = 0; i < inStr.length(); i++)
{
byte b = (byte) charArray[i];
short s = (short) charArray[i];
String hexB = Integer.toHexString(b).toUpperCase();
String hexS = Integer.toHexString(s).toUpperCase();
StringBuffer sb = new StringBuffer();

// print char
sb.append("char[");
sb.append(i);
sb.append("]='");
sb.append(charArray[i]);
sb.append("'\t");

// byte value
sb.append("byte=");
sb.append(b);
sb.append(" \\u");
sb.append(hexB);
sb.append('\t');

// short value
sb.append("short=");
sb.append(s);
sb.append(" \\u");
sb.append(hexS);
sb.append('\t');

// Unicode Block
sb.append(Character.UnicodeBlock.of(charArray[i]));

System.out.println(sb.toString());
}
System.out.println("\nCharacters length: " + charArray.length);
}

}
【1】在中文平臺下,測試的結果如下:
System default encoding --- GBK
System default language --- zh
Get characters with default encoding
char[0]='中' byte=45 \u2D short=20013 \u4E2D CJK_UNIFIED_IDEOGRAPHS
char[1]='文' byte=-121 \uFFFFFF87 short=25991 \u6587 CJK_UNIFIED_IDEOGRAPHS
Characters length: 2
Get characters with given encoding : ISO-8859-1
char[0]='?' byte=-42 \uFFFFFFD6 short=214 \uD6 LATIN_1_SUPPLEMENT
char[1]='?' byte=-48 \uFFFFFFD0 short=208 \uD0 LATIN_1_SUPPLEMENT
char[2]='?' byte=-50 \uFFFFFFCE short=206 \uCE LATIN_1_SUPPLEMENT
char[3]='?' byte=-60 \uFFFFFFC4 short=196 \uC4 LATIN_1_SUPPLEMENT
Characters length: 4
Get characters with given encoding : GBK
char[0]='中' byte=45 \u2D short=20013 \u4E2D CJK_UNIFIED_IDEOGRAPHS
char[1]='文' byte=-121 \uFFFFFF87 short=25991 \u6587 CJK_UNIFIED_IDEOGRAPHS
Characters length: 2
Get characters with given encoding : UTF-8
char[0]='?' byte=-3 \uFFFFFFFD short=-3 \uFFFFFFFD SPECIALS
char[1]='?' byte=-3 \uFFFFFFFD short=-3 \uFFFFFFFD SPECIALS
char[2]='?' byte=-3 \uFFFFFFFD short=-3 \uFFFFFFFD SPECIALS
char[3]='?' byte=-3 \uFFFFFFFD short=-3 \uFFFFFFFD SPECIALS
Characters length: 4
【2】在英文平臺下,測試的結果如下:
System default encoding --- Cp1252
System default language --- en
Get characters with default encoding
char[0]='?' byte=45 \u2D short=20013 \u4E2D CJK_UNIFIED_IDEOGRAPHS
char[1]='?' byte=-121 \uFFFFFF87 short=25991 \u6587 CJK_UNIFIED_IDEOGRAPHS
Characters length: 2
Get characters with given encoding : ISO-8859-1
char[0]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
char[1]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
Characters length: 2
Get characters with given encoding : GBK
char[0]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
char[1]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
Characters length: 2
Get characters with given encoding : UTF-8
char[0]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
char[1]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
Characters length: 2
【結論】
和getBytes(encoding)不同,toCharArray()返回的是"自然字符"。但是這個"自然字符"的數目和內容卻是由原始的編碼方式決定的。來看看里面是如何進行字符串的操作的:
String encodedString = new String(content.getBytes(), encoding);
char[] charArray = inStr.toCharArray();
可以看到系統首先對原始字符串按照默認的編碼方式進行編碼,得到一個字節數組,然后按照指定的新的編碼方式進行解碼,得到新的編碼后的字符串。再轉換成對應的字符數組。
由于在中文平臺下,默認的字符集編碼是GBK,于是content.getBytes()得到的是什么呢?就是下面這4個字節:
byte[0] = -42 hex string = ffffffd6
byte[1] = -48 hex string = ffffffd0
byte[2] = -50 hex string = ffffffce
byte[3] = -60 hex string = ffffffc4
如果新的encoding是GBK,那么經過解碼后,由于一個字符用2個字節表示。于是最終的結果就是:
char[0]='中' --- byte[0] + byte[1]
char[1]='文' --- byte[2] + byte[3]
如果新的encoding是ISO-8859-1,那么經過解碼后,由于一個字符用1個字節表示,于是原來本應該2個字節一起解析的變成單個字節解析,每個字節都代表了一個漢字字符的一半。這一半的字節在ISO-8859-1中找不到對應的字符,就變成了"?"了,最終的結果:
char[0]='?' ---- byte[0]
char[1]='?' ---- byte[1]
char[2]='?' ---- byte[2]
char[3]='?' ---- byte[3]
如果新的encoding是UTF-8,那么經過解碼后,由于一個字符用3個字節表示,于是原來4個字節的數據無法正常的解析成UTF-8的數據,最終的結果也是每一個都變成"?"。
char[0]='?' ---- byte[0]
char[1]='?' ---- byte[1]
char[2]='?' ---- byte[2]
char[3]='?' ---- byte[3]
如果是在英文平臺下,由于默認的編碼方式是Cp1252,于是content.getBytes()得到的字節都是被截去一半的殘留字符,所以我們看到在英文平臺下,不論指定的encoding是GBK、UTF-8,其結果和ISO-8859-1都是一樣的。
記住:
這個方法再次證明了String的getBytes()方法的危險性,如果我們使用new String(str.getBytes(), encoding)對字符串進行重新編碼解碼時,我們一定要清楚str.getBytes()方法返回的字節數組的長度、內容到底是什么,因為在接下來使用新的encoding進行編碼解碼時,Java并不會自動地對字節數組進行擴展以適應新的encoding。而是按照新的編碼方法直接對該字節數組進行解析。
于是結果就像上面的例子一樣,同樣是4個原始字節,有些每2個一組進行解析,有些每個一組進行解析,有些每3個一組進行解析。其結果就只能看那種編碼方式合適了。
posted on 2010-02-22 17:18
Paul Lin 閱讀(7102)
評論(2) 編輯 收藏 所屬分類:
J2SE