大多數文本編輯器在打開文件時都能夠自動檢測文件的編碼,那它是怎樣做到的?我雖然沒有實現過一個文本編輯器,但是可以猜測的是,它有一個默認的編碼集合,然后嘗試用每一個編碼去解碼打開的文件,如果能夠解碼則表示這就是文件的正確編碼。有一些特殊情況,有些編碼在文件開頭有特殊的標記字節,因而可以很快檢測,這里不考慮。現在的核心問題就是如何決定一個編碼是否能夠解碼一個文件,在Java1.4中可以利用nio中的Charset來解決這個問題。
/**
* 測試輸入字節流是否能夠使用指定的字符集解碼。
*/
public static boolean canDecode(InputStream input, Charset charset) throws IOException {
ReadableByteChannel channel = Channels.newChannel(input);
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer byteBuffer = ByteBuffer.allocate(2048);
CharBuffer charBuffer = CharBuffer.allocate(1024);
boolean endOfInput = false;
while (!endOfInput) {
int n = channel.read(byteBuffer);
byteBuffer.flip(); // flip so it can be drained
endOfInput = (n == -1);
CoderResult coderResult = decoder.decode(byteBuffer, charBuffer, endOfInput);
charBuffer.clear();
if (coderResult == CoderResult.OVERFLOW) {
while (coderResult == CoderResult.OVERFLOW) {
coderResult = decoder.decode(byteBuffer, charBuffer, endOfInput);
charBuffer.clear();
}
}
if (coderResult.isError()) {
return false;
}
byteBuffer.compact(); // compact so it can be refilled
}
CoderResult coderResult;
while ((coderResult = decoder.flush(charBuffer)) == CoderResult.OVERFLOW) {
charBuffer.clear();
}
if (coderResult.isError()) {
return false;
}
return true;
}
要理解上面的代碼必須熟悉對Buffer和Channel的操作以及解碼的過程。上面的代碼只是決定能不能解碼,下面代碼能夠解碼出的內容寫到字符輸出流中(也就是Writer),它要更復雜一些。
Java代碼
/**
* 使用指定的字符集解碼字節輸入流,并將它寫入到字符輸出流中,如果發生解碼錯誤則返回false,否則返回true,
* 輸入中的無效字節序列將被忽略。
*/
public static boolean decode(InputStream input, Writer output, Charset charset) throws IOException {
ReadableByteChannel channel = Channels.newChannel(input);
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer byteBuffer = ByteBuffer.allocate(2048);
CharBuffer charBuffer = CharBuffer.allocate(1024);
boolean endOfInput = false;
boolean error = false;
while (!endOfInput) {
int n = channel.read(byteBuffer);
byteBuffer.flip(); // flip so it can be drained
endOfInput = (n == -1);
CoderResult coderResult = decoder.decode(byteBuffer, charBuffer, endOfInput);
error = drainCharBuffer(error, byteBuffer, charBuffer, coderResult, output);
if (coderResult != CoderResult.UNDERFLOW) {
while (coderResult != CoderResult.UNDERFLOW) {
coderResult = decoder.decode(byteBuffer, charBuffer, endOfInput);
error = drainCharBuffer(error, byteBuffer, charBuffer, coderResult, output);
}
}
byteBuffer.compact(); // compact so it can be refilled
}
CoderResult coderResult;
while ((coderResult = decoder.flush(charBuffer)) != CoderResult.UNDERFLOW) {
error = drainCharBuffer(error, byteBuffer, charBuffer, coderResult, output);
}
error = drainCharBuffer(error, byteBuffer, charBuffer, coderResult, output);
output.flush();
return !error;
}
private static boolean drainCharBuffer(boolean error, ByteBuffer byteBuffer,
CharBuffer charBuffer, CoderResult coderResult, Writer output) throws IOException {
// write charBuffer to output
charBuffer.flip();
if (charBuffer.hasRemaining())
output.write(charBuffer.toString());
charBuffer.clear();
if (coderResult.isError()) {
error = true;
byteBuffer.position(byteBuffer.position() + coderResult.length()); // ignore invalid byte sequence
}
return error;
}
要注意byteBuffer的大小不能太小以至于比一個字符的最大字節數還要小,比如說utf-8的每個字符最多可能占用4個字節,如果設置byteBuffer的大小為3,解碼結果可能總是CoderResult.UNDERFLOW,但是又無法再往byteBuffer填充數據,因而會出現死循環。
另外要注意的是,程序可能得到錯誤的結果,如:
String s = "abc中國";
byte[] utf8Bytes = s.getBytes(Charset.forName("utf-8"));
byte[] gbkBytes = s.getBytes(Charset.forName("gbk"));
CharArrayWriter writer = new CharArrayWriter();
System.out.println(decode(new ByteArrayInputStream(utf8Bytes), writer, Charset.forName("utf-8")));
System.out.println(writer.toString());
writer = new CharArrayWriter();
System.out.println(decode(new ByteArrayInputStream(utf8Bytes), writer, Charset.forName("gbk")));
System.out.println(writer.toString());
輸出結果:
Java代碼
true
abc中國
true
abc涓 浗
可以看到用utf-8編碼的字節流仍然可以用gbk進行解碼,但是解碼的結果卻不對。這是偶然情況,將字符串換成"中國人",則用gbk就不能解碼了。