無為

無為則可為，無為則至深！

:: 管理

190 Posts :: 291 Stories :: 258 Comments :: 0 Trackbacks

很多人用java進行文檔操作時經常會遇到一個問題，就是如何獲得word，excel，pdf等文檔的內容？我研究了一下，在這里總結一下抽取word,pdf的幾種方法。?

???1.?用jacob

??其實jacob是一個bridage，連接java和com或者win32函數的一個中間件，jacob并不能直接抽取word,excel等文件，需要自己寫dll哦，不過已經有為你寫好的了，就是jacob的作者一并提供了。?

??jacob?jar與dll文件下載：?

http://danadler.com/jacob/?;

??下載了jacob并放到指定的路徑之后(dll放到path,jar文件放到classpath)，就可以寫你自己的抽取程序了，下面是一個簡單的例子：?

import?java.io.File;
import?com.jacob.com.*;
import?com.jacob.activeX.*;
/**
?*?Title:?pdf?extraction
?*?Description:?email:chris@matrix.org.cn
?*?Copyright:?Matrix?Copyright?(c)?2003
?*?Company:?Matrix.org.cn
?*?@author?chris
?*?@version?1.0,who?use?this?example?pls?remain?the?declare
?*/
public?class?FileExtracter{
?public?static?void?main(String[]?args)?{
??ActiveXComponent?component?=?new?ActiveXComponent("Word.Application");
??String?inFile?=?"c:\\test.doc";
??String?tpFile?=?"c:\\temp.htm";
??String?otFile?=?"c:\\temp.xml";
??boolean?flag?=?false;
??try?{
???component.setProperty("Visible",?new?Variant(false));
???Object?wordacc?=?component.getProperty("document．").toDispatch();
???Object?wordfile?=?Dispatch.invoke(wordacc,"Open",?Dispatch.Method,?
?????????????????????????????????????new?Object[]{inFile,new?Variant(false),?new?Variant(true)},
?????????????????????????????????????new?int[1]?).toDispatch();
???Dispatch.invoke(wordfile,"SaveAs",?Dispatch.Method,?new?Object[]{tpFile,new?Variant(8)},?new?int[1]);
???Variant?f?=?new?Variant(false);
???Dispatch.call(wordfile,?"Close",?f);
???flag?=?true;
??}?catch?(Exception?e)?{
???e.printStackTrace();
??}?finally?{
???component.invoke("Quit",?new?Variant[]?{});
??}
?}
}

2.?用apache的poi來抽取word，excel。

poi是apache的一個項目，不過就算用poi你可能都覺得很煩，不過不要緊，這里提供了更加簡單的一個接口給你：?

下載經過封裝后的poi包：?

http://jakarta.apache.org/poi/?;

下載之后，放到你的classpath就可以了，下面是如何使用它的一個例子：?

import?java.io.*;
import??org.textmining.text.extraction.WordExtractor;
/**
?*?Title:?word?extraction
?*?Description:?email:chris@matrix.org.cn
?*?Copyright:?Matrix?Copyright?(c)?2003
?*?Company:?Matrix.org.cn
?*?@author?chris
?*?@version?1.0,who?use?this?example?pls?remain?the?declare
?*/

public?class?PdfExtractor?{
??public?PdfExtractor()?{
??}
??public?static?void?main(String?args[])?throws?Exception
??{
??FileInputStream?in?=?new?FileInputStream?("c:\\a.doc");
??WordExtractor?extractor?=?new?WordExtractor();
??String?str?=?extractor.extractText(in);
??System.out.println("the?result?length?is"+str.length());
???System.out.println("the?result?is"+str);
}
}

3.?pdfbox-用來抽取pdf文件

但是pdfbox對中文支持還不好，先下載pdfbox：?

http://www.pdfbox.org/?;

下面是一個如何使用pdfbox抽取pdf文件的例子：?

import?org.pdfbox.pdmodel.PDdocument．
import?org.pdfbox.pdfparser.PDFParser;
import?java.io.*;
import?org.pdfbox.util.PDFTextStripper;
import?java.util.Date;
/**
?*?Title:?pdf?extraction
?*?Description:?email:chris@matrix.org.cn
?*?Copyright:?Matrix?Copyright?(c)?2003
?*?Company:?Matrix.org.cn
?*?@author?chris
?*?@version?1.0,who?use?this?example?pls?remain?the?declare
?*/

public?class?PdfExtracter{

public?PdfExtracter(){
??}
public?String?GetTextFromPdf(String?filename)?throws?Exception
??{
??String?temp=null;
??PDdocument．nbsppdfdocument．null;
??FileInputStream?is=new?FileInputStream(filename);
??PDFParser?parser?=?new?PDFParser(?is?);
??parser.parse();
??pdfdocument．nbsp=?parser.getPDdocument．);
??ByteArrayOutputStream?out?=?new?ByteArrayOutputStream();
??OutputStreamWriter?writer?=?new?OutputStreamWriter(?out?);
??PDFTextStripper?stripper?=?new?PDFTextStripper();
??stripper.writeText(pdfdocument．getdocument．),?writer?);
??writer.close();
??byte[]?contents?=?out.toByteArray();

??String?ts=new?String(contents);
??System.out.println("the?string?length?is"+contents.length+"\n");
??return?ts;
}
public?static?void?main(String?args[])
{
PdfExtracter?pf=new?PdfExtracter();
PDdocument．nbsppdfdocument．nbsp=?null;

try{
String?ts=pf.GetTextFromPdf("c:\\a.pdf");
System.out.println(ts);
}
catch(Exception?e)
??{
??e.printStackTrace();
??}
}

}

4.?抽取支持中文的pdf文件－xpdf

xpdf是一個開源項目，我們可以調用他的本地方法來實現抽取中文pdf文件。?

下載xpdf函數包：?

http://www.foolabs.com/xpdf/?;

同時需要下載支持中文的補丁包，按照readme放好中文的patch，就可以開始寫調用本地方法的java程序了。

下面是一個如何調用的例子：?

import?java.io.*;
/**
?*?Title:?pdf?extraction
?*?Description:?email:chris@matrix.org.cn
?*?Copyright:?Matrix?Copyright?(c)?2003
?*?Company:?Matrix.org.cn
?*?@author?chris
?*?@version?1.0,who?use?this?example?pls?remain?the?declare
?*/

public?class?PdfWin?{
??public?PdfWin()?{
??}
??public?static?void?main(String?args[])?throws?Exception
??{
????String?PATH_TO_XPDF="C:\\Program?Files\\xpdf\\pdftotext.exe";
????String?filename="c:\\a.pdf";
????String[]?cmd?=?new?String[]?{?PATH_TO_XPDF,?"-enc",?"UTF-8",?"-q",?filename,?"-"};
????Process?p?=?Runtime.getRuntime().exec(cmd);
????BufferedInputStream?bis?=?new?BufferedInputStream(p.getInputStream());
????InputStreamReader?reader?=?new?InputStreamReader(bis,?"UTF-8");
????StringWriter?out?=?new?StringWriter();
????char?[]?buf?=?new?char[10000];
????int?len;
????while((len?=?reader.read(buf))>=?0)?{
????//out.write(buf,?0,?len);
????System.out.println("the?length?is"+len);
????}
????reader.close();
????String?ts=new?String(buf);
????System.out.println("the?str?is"+ts);
??}
}

凡是有該標志的文章，都是該blog博主Caoer（草兒）原創，凡是索引、收藏
、轉載請注明來處和原文作者。非常感謝。

posted on 2006-06-11 12:58 草兒閱讀(163) 評論(0) 編輯收藏所屬分類: Java編程經驗談

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問
相關文章: Out of Memory Error的原因有關亂碼的處理－－－中國程序員永遠無法避免的話題 jfreechart+sql實現時間曲線圖(顯示曲線數據點) JFreeChar web 解讀JFreeChart的源碼結構 JFreeChart介紹及經典入門資料 Java下的框架編程(反射,泛型,元數據,CGLib,代碼動態生成,AOP,動態語言嵌入) Selenium--透明反復推介的集成測試工具(Pragmatic系列) WEB數據倉庫系統層次結構結合 Direct Web Remoting 使用 Ajax

無為

公告

隨筆分類(222)

隨筆檔案(188)

相冊

收藏夾(6)

AJAX

DB BI DM

ＪＡＶＡ編程論壇

ＵＭＬ技術論壇

搜索

積分與排名

最新評論

閱讀排行榜