unicode_utf8互轉(zhuǎn)函數(shù) [轉(zhuǎn)]

from : http://gudai.cnblogs.com/articles/294580.html
2005-12-10。從下午12點(diǎn)奮斗到晚上9點(diǎn)。
2005-12-11。根據(jù)嘮叨的回帖，更新了轉(zhuǎn)換的算法。對于那篇unicode編碼的faq.我總算理解了整個轉(zhuǎn)換的過程。理解了這個公式。

參考文章http://tech.163.com/05/0516/10/1JS9KEGA00091589.html
UTF編碼

　　UTF-8就是以8位為單元對UCS進(jìn)行編碼。從UCS-2到UTF-8的編碼方式如下：

UCS-2編碼(16進(jìn)制) UTF-8 字節(jié)流(二進(jìn)制)

0000 - 007F 0xxxxxxx

0080 - 07FF 110xxxxx 10xxxxxx

0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx

　　例如“漢”字的Unicode編碼是6C49。6C49在0800-FFFF之間，所以肯定要用3字節(jié)模板了：1110xxxx 10xxxxxx 10xxxxxx。將6C49寫成二進(jìn)制是：0110 110001 001001，用這個比特流依次代替模板中的x，得到：11100110 10110001 10001001，即E6 B1 89。

終于將unicode和utf8互轉(zhuǎn)搞定。

如果utf-8編碼的字符ch是3個字節(jié)。xx yy zz
將xx和1F AND 操作得到 a
將yy和7F AND 操作得到 b
將zz和7F AND 操作得到 c

(64a+b)*64+c = ch(unicode編碼)

echo.php沒什么。就是幾個函數(shù)。

<?php
require_once("echo.php");

$data = "大鬧西奴xx x8890.-_奴";
echo(urlencode($data));echo("<br/>");
//寫入unicode文件
$ucs2data = utf8ToUnicode($data,"little");
$endian = chr(0xFE).chr(0xFF);
$endian = chr(0xFF).chr(0xFE);
$rt = file_put_contents ( "ucs2.txt", $endian.$ucs2data);
                //19:32,utf8toUnicode函數(shù)ok.
                //20:09。發(fā)現(xiàn)little endian 和big endian問題。并解決。
                //big endian 方式存入的unicode字符串，ue和editplus均不能
                //識別。只有notepad正常識別。

$rt = file_put_contents ( "usc2ys_data.txt", $ucs2_ysdata);
//寫入utf8文件
$utf8data = unicodeToUtf8($ucs2data); // 20:52. 將字串轉(zhuǎn)回utf8碼ok.
$rt = file_put_contents ( "utf8.txt", $utf8data);
echo(urlencode($utf8data));echo("<br/>");

$esc = utf8Escape($data);
echot($esc);
$esc = phpEscape($data);
echot($esc);
$unesc = phpUnescape($esc);
echot($unesc);

/**
* 此函數(shù)將utf8編碼字串轉(zhuǎn)為unicode編碼字符串
* 參數(shù) str ,utf8編碼的字符串。
* 參數(shù) order,存放數(shù)據(jù)格式，是big endian還是little endian，默認(rèn)的unicode存放次序是little.
* 如："大"的unicode碼是 5927。little方式存放即為：27 59 。big方式則順序不變：59 27.
* little 存放格式文件的開頭均需有FF FE。big 存放方式的文件開頭為 FE FF。否則。將會產(chǎn)生嚴(yán)重混亂。
* 本函數(shù)只轉(zhuǎn)換字符，不負(fù)責(zé)增加頭部。
* iconv轉(zhuǎn)換過來的字符串是 big endian存放的。
* 返回 ucs2string , 轉(zhuǎn)換過的字符串。
* 感謝嘮叨（xuzuning）
*/
function utf8ToUnicode($str,$order="little")
{
  $ucs2string ="";
    $n=strlen($str);
    for ($i=0;$i<$n ;$i++ ) {
  $v = $str[$i];
  $ord = ord($v);
  if( $ord<=0x7F){ // 0xxxxxxx
    if ($order=="little") {
       $ucs2string .= $v.chr(0);
   }
   else {
       $ucs2string .= chr(0).$v;
   }
  }
  elseif ($ord<0xE0 && ord($str[$i+1])>0x80) { //110xxxxx 10xxxxxx
   $a = (ord($str[$i]) & 0x3F )<<6;
   $b = ord($str[$i+1]) & 0x3F ;
   $ucsCode = dechex($a+$b);   //echot($ucsCode);
   $h = intval(substr($ucsCode,0,2),16);
   $l = intval(substr($ucsCode,2,2),16);
   if ($order=="little") {
       $ucs2string   .= chr($l).chr($h);
   }
   else {
        $ucs2string   .= chr($h).chr($l);
   }
   $i++;
  }elseif ($ord<0xF0 && ord($str[$i+1])>0x80 && ord($str[$i+2])>0x80) { //1110xxxx 10xxxxxx 10xxxxxx
     $a = (ord($str[$i]) & 0x1F)<<12;
   $b = (ord($str[$i+1]) & 0x3F )<<6;
   $c = ord($str[$i+2]) & 0x3F ;
   $ucsCode = dechex($a+$b+$c);   //echot($ucsCode);
   $h = intval(substr($ucsCode,0,2),16);
   $l = intval(substr($ucsCode,2,2),16);
   if ($order=="little") {
       $ucs2string   .= chr($l).chr($h);
   }
   else {
        $ucs2string   .= chr($h).chr($l);
   }
   $i +=2;
  }
    }
return $ucs2string;
} // end func

/*
* 此函數(shù)將unicode編碼字串轉(zhuǎn)為utf8編碼字符串
* 參數(shù) str ,unicode編碼的字符串。
* 參數(shù) order ,unicode字串的存放次序，為big endian還是little endian.
* 返回 utf8string , 轉(zhuǎn)換過的字符串。
*
*/
function unicodeToUtf8($str,$order="little")
{
$utf8string ="";
    $n=strlen($str);
    for ($i=0;$i<$n ;$i++ ) {
  if ($order=="little") {
      $val = dechex(ord($str[$i+1])).dechex(ord($str[$i]));
  }
  else {
   $val = dechex(ord($str[$i])).dechex(ord($str[$i+1]));
  }
  $val = intval($val,16); //由于上次的.連接，導(dǎo)致$val變?yōu)樽址@里得轉(zhuǎn)回來。
  $i++; //兩個字節(jié)表示一個unicode字符。
  $c = "";
  if($val < 0x7F){        // 0000-007F
   $c .= chr($val);
  }elseif($val < 0x800) { // 0080-0800
   $c .= chr(0xC0 | ($val / 64));
   $c .= chr(0x80 | ($val % 64));
  }else{                // 0800-FFFF
   $c .= chr(0xE0 | (($val / 64) / 64));
   $c .= chr(0x80 | (($val / 64) % 64));
   $c .= chr(0x80 | ($val % 64));
   //echot($c);
  }
  $utf8string .= $c;
    }
return $utf8string;
} // end func

/*
* 將utf8編碼的字符串編碼為unicode 碼型，等同escape
* 之所以只接受utf8碼，因?yàn)橹挥衭tf8碼和unicode之間有公式轉(zhuǎn)換，其他的編碼都得查碼表來轉(zhuǎn)換。
* 不知道查找utf8碼的正則是否完全正確。迷茫ing
* 雖然調(diào)用utf2ucs對每個字符進(jìn)行碼值計(jì)算。效率過低。然而，代碼清晰，要是把那個計(jì)算過程嵌入。
* 代碼就不太容易閱讀了。
*/
function utf8Escape($str) {
preg_match_all("/[\xC0-\xE0].|[\xE0-\xF0]..|[\x01-\x7f]+/",$str,$r);
//prt($r);
$ar = $r[0];
foreach($ar as $k=>$v) {
$ord = ord($v[0]);
    if( $ord<=0x7F)
      $ar[$k] = rawurlencode($v);
    elseif ($ord<0xE0) { //雙字節(jié)utf8碼
      $ar[$k] = "%u".utf2ucs($v);
    }
elseif ($ord<0xF0) { //三字節(jié)utf8碼
      $ar[$k] = "%u".utf2ucs($v);
}
}//foreach
return join("",$ar);
}

/**
*
* 把utf8編碼字符轉(zhuǎn)為ucs-2編碼
* 參數(shù) utf8編碼的字符。
* 返回該字符的unicode碼值。知道了碼值，你就可以使用chr將字符弄出來了。
*
* 原理：unicode轉(zhuǎn)為utf-8碼的算法是。頭部固定位或。
該過程的逆向算法就是這個函數(shù)了，頭部固定位反位與。
*/

/*
* 用處：此函數(shù)用來逆轉(zhuǎn)javascript的escape函數(shù)編碼后的字符。
* 關(guān)鍵的正則查找我不知道有沒有問題.
* 參數(shù)：javascript編碼過的字符串。
* 如：unicodeToUtf8("%u5927")= 大
* 2005-12-10
*
*/
function phpUnescape($escstr){
preg_match_all("/%u[0-9A-Za-z]{4}|%.{2}|[0-9a-zA-Z.+-_]+/",$escstr,$matches); //prt($matches);
$ar = &$matches[0];
$c = "";
foreach($ar as $val){
if (substr($val,0,1)!="%") { //如果是字母數(shù)字+-_.的ascii碼
     $c .=$val;
}
elseif (substr($val,1,1)!="u") { //如果是非字母數(shù)字+-_.的ascii碼
  $x = hexdec(substr($val,1,2));
     $c .=chr($x);
}
else { //如果是大于0xFF的碼
  $val = intval(substr($val,2),16);
  if($val < 0x7F){        // 0000-007F
   $c .= chr($val);
  }elseif($val < 0x800) { // 0080-0800
   $c .= chr(0xC0 | ($val / 64));
   $c .= chr(0x80 | ($val % 64));
  }else{                // 0800-FFFF
   $c .= chr(0xE0 | (($val / 64) / 64));
   $c .= chr(0x80 | (($val / 64) % 64));
   $c .= chr(0x80 | ($val % 64));
  }
}
}
return $c;
}

/*
* 等同escape
* 來自網(wǎng)上。本文件其他幾個函數(shù)都參考了這個函數(shù)里面的關(guān)鍵算法。
*/
function phpEscape($str,$encode="") {
if ($encode=="" && !(function_exists("mb_detect_encoding"))) {
      echo "error You must enter the string's encoding or extend the php for mb_string";
   return ;
}
elseif($encode=="") {
   echo "Use mb_string function to detect the string's encoding <br/>";
      $encode = mb_detect_encoding($str);
}
preg_match_all("/[\xC0-\xE0].|[\xE0-\xF0]..|[\x01-\x7f]+/",$str,$r);
//prt($r);
$ar = $r[0];
foreach($ar as $k=>$v) {
$ord = ord($v[0]);
    if( $ord<=0x7F)
      $ar[$k] = rawurlencode($v);
    elseif ($ord<0xE0) {
      $ar[$k] = "%u".bin2hex(iconv($encode,"UCS-2",$v));
    }
elseif ($ord<0xF0) {
      $ar[$k] = "%u".bin2hex(iconv($encode,"UCS-2",$v));
}
}//foreach
return join("",$ar);
}

Posted on 2005-12-10 20:54 古代閱讀(698) 評論(2) 編輯收藏網(wǎng)摘所屬分類: wap 、php

Feedback

#1樓 [樓主] 回復(fù) 引用 查看

2006-09-12 22:24 by 古代的專欄

上文的phpescape函數(shù)是錯的，補(bǔ)上正確的

/*
* 等同escape ，只能轉(zhuǎn)gbk編碼的漢字并在gb2312的網(wǎng)頁中被unescape
* gbk漢字：第一：0x81-0xFE 第二：0x40-0xFE
*/
function phpEscape( $str )
{
$chars = preg_split('//', $str, -1 ) ; //pr($chars);
$n = count( $chars )-1 ;
for($i=1;$i<$n ;$i++ ) {
$char = $chars[$i];
$ord = ord( $char ); //echo $ord . " -- ".chr($ord)." -- $i<br>";
if( $ord<=0x7F){
$ar .= rawurlencode( $char );
}else{
$ar .= "%u".bin2hex( iconv( 'gbk' ,"UCS-2",$chars[$i].$chars[$i+1] ) );
$i++;
}
}//foreach
return $ar;
}

posted on 2009-02-25 11:19 小馬歌閱讀(1015) 評論(0) 編輯收藏所屬分類: php

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: PHP.INI安全配置 php中防止SQL注入的方法 php讀取網(wǎng)絡(luò)文件 curl, fsockopen ,file_get_contents 幾個方法的效率對比 thrift rpc 使用常見問題解答和經(jīng)驗(yàn) Graphviz "gif/png" not recognized問題解決方案 nginx 502 Bad Gateway 錯誤解決辦法 php中include和require的區(qū)別安裝Memcached和Memcached PHP擴(kuò)展【轉(zhuǎn)】 Linux中的php-fpm作為服務(wù)安裝 xsplit A PHP extension for Chinese segmentation using MMSEG algorithm[zz]

My Links

Blog Stats

留言簿(26)

隨筆分類

文章分類

文章檔案

博客連接

搜索

最新評論

閱讀排行榜

評論排行榜

[php]完善的escape/unescape/unicode_utf8互轉(zhuǎn)函數(shù) [轉(zhuǎn)]

Feedback

#1樓 [樓主] 回復(fù) 引用 查看

UCS-2編碼(16進(jìn)制)	UTF-8 字節(jié)流(二進(jìn)制)
0000 - 007F	0xxxxxxx
0080 - 07FF	110xxxxx 10xxxxxx
0800 - FFFF	1110xxxx 10xxxxxx 10xxxxxx

My Links

Blog Stats

留言簿(26)

隨筆分類

文章分類

文章檔案

博客連接

搜索

最新評論

閱讀排行榜

評論排行榜

[php]完善的escape/unescape/unicode_utf8互轉(zhuǎn)函數(shù) [轉(zhuǎn)]

Feedback

#1樓 [樓主] 回復(fù) 引用 查看

#1樓 [樓主] 回復(fù) 引用查看