【Java基礎專題】編碼與亂碼(02)---String的getBytes([encoding])方法

package example.encoding;

import java.io.UnsupportedEncodingException;

/**

* The Class GetBytesTest.

public class GetBytesTest {

/**

* The main method.

* @param args the arguments

public static void main(String args[]) {

String content = "中文";

String defaultEncoding = System.getProperty("file.encoding");

String defaultLnaguage = System.getProperty("user.language");

System.out.println("System default encoding --- " + defaultEncoding);

System.out.println("System default language --- " + defaultLnaguage);

GetBytesTest tester = new GetBytesTest();

byte[] defaultBytes = tester.getBytesWithDefaultEncoding(content);

tester.printBytes(defaultBytes);

byte[] iso8859Bytes = tester.getBytesWithGivenEncoding(content,

"ISO-8859-1");

tester.printBytes(iso8859Bytes);

byte[] gbkBytes = tester.getBytesWithGivenEncoding(content, "GBK");

tester.printBytes(gbkBytes);

byte[] utfBytes = tester.getBytesWithGivenEncoding(content, "UTF-8");

tester.printBytes(utfBytes);

}

/**

* Gets the bytes with default encoding.

* @param content the content

* @return the bytes with default encoding

public byte[] getBytesWithDefaultEncoding(String content) {

System.out.println("\nEncode with default encoding\n");

byte[] bytes = content.getBytes();

return bytes;

}

/**

* Gets the bytes with given encoding.

* @param content the content

* @param encoding the encoding

* @return the bytes with given encoding

public byte[] getBytesWithGivenEncoding(String content, String encoding) {

System.out.println("\nEncode with given encoding : " + encoding + "\n");

try {

byte[] bytes = content.getBytes(encoding);

return bytes;

} catch (UnsupportedEncodingException e) {

e.printStackTrace();

return null;

}

/**

* Prints the bytes.

* @param bytes the bytes

public void printBytes(byte[] bytes) {

for (int i = 0; i < bytes.length; i++) {

System.out.print(" byte[" + i + "] = " + bytes[i]);

System.out

.println(" hex string = " + Integer.toHexString(bytes[i]));

}

【1】在中文平臺下，測試結果如下：

System default encoding --- GBK
System default language --- zh

Encode with default encoding

byte[0] = -42 hex string = ffffffd6
byte[1] = -48 hex string = ffffffd0
byte[2] = -50 hex string = ffffffce
byte[3] = -60 hex string = ffffffc4

Encode with given encoding : ISO-8859-1

byte[0] = 63 hex string = 3f
byte[1] = 63 hex string = 3f

Encode with given encoding : GBK

byte[0] = -42 hex string = ffffffd6
byte[1] = -48 hex string = ffffffd0
byte[2] = -50 hex string = ffffffce
byte[3] = -60 hex string = ffffffc4

Encode with given encoding : UTF-8

byte[0] = -28 hex string = ffffffe4
byte[1] = -72 hex string = ffffffb8
byte[2] = -83 hex string = ffffffad
byte[3] = -26 hex string = ffffffe6
byte[4] = -106 hex string = ffffff96
byte[5] = -121 hex string = ffffff87

【2】在英文平臺下，測試結果如下：

System default encoding --- Cp1252
System default language --- en

Encode with default encoding

byte[0] = 63 hex string = 3f
byte[1] = 63 hex string = 3f

Encode with given encoding : ISO-8859-1

byte[0] = 63 hex string = 3f
byte[1] = 63 hex string = 3f

Encode with given encoding : GBK

byte[0] = -42 hex string = ffffffd6
byte[1] = -48 hex string = ffffffd0
byte[2] = -50 hex string = ffffffce
byte[3] = -60 hex string = ffffffc4

Encode with given encoding : UTF-8

getBytes()、getBytes(encoding)函數的作用是使用系統默認或者指定的字符集編碼方式，將字符串編碼成字節數組。

在中文平臺下，默認的字符集編碼是GBK，此時如果使用getBytes()或者getBytes("GBK")，則按照GBK的編碼規則將每個中文字符用2個byte表示。所以我們看到"中文"最終GBK編碼結果就是： -42 -48 -50 -60 。-42和-48代表了"中"字，而"-50"和"-60"則代表了"文"字。

在中文平臺下，如果指定的字符集編碼是UTF-8，那么按照UTF-8對中文的編碼規則：每個中文用3個字節表示，那么"中文"這兩個字符最終被編碼成：-28 -72 -83、-26 -106 -121兩組。每3個字節代表一個中文字符。

在中文平臺下，如果指定的字符集編碼是ISO-8859-1，由于此字符集是單字節編碼，所以使用getBytes("ISO-8859-1")時，每個字符只取一個字節，每個漢字只取到了一半的字符。另外一半的字節丟失了。由于這一半的字符在字符集中找不到對應的字符，所以默認使用編碼63代替，也就是?。

在英文平臺下，默認的字符集編碼是Cp1252(類似于ISO-8859-1)，如果使用GBK、UTF-8進行編碼，得到的字節數組依然是正確的(GBK4個字節，UTF-8是6個字節)。因為在JVM內部是以Unicode存儲字符串的，使用getBytes(encoding)會讓JVM進行一次Unicode到指定編碼之間的轉換。對于GBK，JVM依然會轉換成4個字節，對于UTF-8，JVM依然會轉換成6個字節。但是對于ISO-8859-1，則由于無法轉換(2個字節--->1個字節，截取了一半的字節)，所以轉換后的結果是錯誤的。

相同的平臺下，同一個中文字符，在不同的編碼方式下，得到的是完全不同的字節數組。這些字節數組有可能是正確的(只要該字符集支持中文)，也可能是完全錯誤的(該字符集不支持中文)。

記住：

不要輕易地使用或濫用String類的getBytes(encoding)方法，更要盡量避免使用getBytes()方法。因為這個方法是平臺依賴的，在平臺不可預知的情況下完全可能得到不同的結果。如果一定要進行字節編碼，則用戶要確保encoding的方法就是當初字符串輸入時的encoding。

-------------------------------------------------------------
生活就像打牌，不是要抓一手好牌，而是要盡力打好一手爛牌。

posted on 2010-02-22 16:53 Paul Lin 閱讀(4600) 評論(1) 編輯收藏所屬分類: J2SE

FeedBack:

# re: 【Java基礎專題】編碼與亂碼(02)---String的getBytes([encoding])方法

2010-02-25 19:12 | PhoenixLi

解釋的非常深刻，謝謝！回復更多評論

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問
相關文章: 【Java基礎專題】IO與文件讀寫---優化搜索程序(01) 【Java基礎專題】IO與文件讀寫---DirectoryWalker和FileFilter的復雜條件使用【Java基礎專題】IO與文件讀寫---使用DirectoryWalker和FileFilterUtils進行搜索【Java基礎專題】IO與文件讀寫---慎用FileUtils.writeLines(File, Collection)方法 TSS上關于JDBC操作優化的Tips總結【Java基礎專題】IO與文件讀寫---對同步/異步和阻塞/非阻塞的理解【Java基礎專題】IO與文件讀寫---同步/異步與阻塞/非阻塞的區別（轉）【Java基礎專題】IO與文件讀寫---使用Apache commons IO包進行資源遍歷【Java基礎專題】IO與文件讀寫---使用Apache commons IO過濾文件和目錄【Java基礎專題】IO與文件讀寫---使用Apache commons IO操縱底層讀寫

2010年2月

日

一

二

三

四

五

六

常用鏈接

留言簿(21)

隨筆分類

隨筆檔案

BlogJava熱點博客

好友博客

無羽蒼鷹

常用鏈接

留言簿(21)

隨筆分類

隨筆檔案

BlogJava熱點博客

好友博客

搜索

最新評論

閱讀排行榜

評論排行榜