Lucene是apache软件基金会4 jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,但它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎(英文与德文两种西方语言)。Lucene的目的是为软件开发人员提供一个简单易用的工具包,以方便的在目标系统中实现全文检索的功能,或者是以此为基础建立起完整的全文检索引擎。Lucene是一套用于全文检索和搜寻的开源程式库,由Apache软件基金会支持和提供。Lucene提供了一个简单却强大的应用程式接口,能够做全文索引和搜寻。在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言,Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。人们经常提到信息检索程序库,虽然与搜索引擎有关,但不应该将信息检索程序库与搜索引擎相混淆。
这里讲一下使用Lucene对doc、docx、pdf、txt文档进行全文检索功能的实现。
涉及到的类一共有两个:
LuceneCreateIndex,创建索引:
package com.yhd.test.poi;
import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.util.Date;
import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.DateTools; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.SimpleFSDirectory; import org.apache.lucene.util.Version; import org.apache.pdfbox.pdfparser.PDFParser; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.util.PDFTextStripper; import org.apache.poi.hwpf.extractor.WordExtractor; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class LuceneCreateIndex {
public static void main(String[] args) throws IOException { String dataDirectory = "D:\\Studying\\poi\\test\\dataDirectory"; String indexDirectory = "D:\\Studying\\poi\\test\\indexDirectory"; Directory directory = new SimpleFSDirectory(new File(indexDirectory)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
IndexWriter indexWriter = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); File[] files = new File(dataDirectory).listFiles();
for (int i = 0; i < files.length; i++) { System.out.println("这是第" + i + "个文件----------------"); System.out.println("完整路径:" + files[i].toString()); String fileName = files[i].getName(); String fileType = fileName.substring(fileName.lastIndexOf(".") + 1, fileName.length()).toLowerCase(); System.out.println("文件名称:" + fileName); System.out.println("文件类型:" + fileType);
Document doc = new Document();
InputStream in = new FileInputStream(files[i]); InputStreamReader reader = null;
if (fileType != null && !fileType.equals("")) {
if (fileType.equals("doc")) { WordExtractor wordExtractor = new WordExtractor(in); doc.add(new Field("contents", wordExtractor.getText(), Field.Store.YES, Field.Index.ANALYZED)); wordExtractor.close(); System.out.println("注意:已为文件“" + fileName + "”创建了索引");
} else if (fileType.equals("docx")) { XWPFWordExtractor xwpfWordExtractor = new XWPFWordExtractor( new XWPFDocument(in)); doc.add(new Field("contents", xwpfWordExtractor.getText(), Field.Store.YES, Field.Index.ANALYZED)); xwpfWordExtractor.close(); System.out.println("注意:已为文件“" + fileName + "”创建了索引");
} else if (fileType.equals("pdf")) { PDFParser parser = new PDFParser(in); parser.parse(); PDDocument pdDocument = parser.getPDDocument(); PDFTextStripper stripper = new PDFTextStripper(); doc.add(new Field("contents", stripper.getText(pdDocument), Field.Store.NO, Field.Index.ANALYZED)); pdDocument.close(); System.out.println("注意:已为文件“" + fileName + "”创建了索引");
} else if (fileType.equals("txt")) { reader = new InputStreamReader(in); BufferedReader br = new BufferedReader(reader); String txtFile = ""; String line = null;
while ((line = br.readLine()) != null) { txtFile += line; } doc.add(new Field("contents", txtFile, Field.Store.NO, Field.Index.ANALYZED)); System.out.println("注意:已为文件“" + fileName + "”创建了索引");
} else {
System.out.println(); continue;
}
} doc.add(new Field("filename", files[i].getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("indexDate", DateTools.dateToString(new Date(), DateTools.Resolution.DAY), Field.Store.YES, Field.Index.NOT_ANALYZED)); indexWriter.addDocument(doc); System.out.println(); } System.out.println("numDocs=" + indexWriter.numDocs()); indexWriter.close();
} }
|
LuceneSearch,进行搜索:
package com.yhd.test.poi;
import java.io.File; import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.Directory; import org.apache.lucene.store.SimpleFSDirectory; import org.apache.lucene.util.Version;
public class LuceneSearch { public static void main(String[] args) throws IOException, ParseException { String indexDirectory = "D:\\Studying\\poi\\test\\indexDirectory"; Directory directory = new SimpleFSDirectory(new File(indexDirectory)); IndexSearcher indexSearch = new IndexSearcher(directory); QueryParser queryParser = new QueryParser(Version.LUCENE_30, "contents", new StandardAnalyzer(Version.LUCENE_30)); Query query = queryParser.parse("百度"); TopDocs hits = indexSearch.search(query, 10); System.out.println("找到了" + hits.totalHits + "个"); for (int i = 0; i < hits.scoreDocs.length; i++) { ScoreDoc sdoc = hits.scoreDocs[i]; Document doc = indexSearch.doc(sdoc.doc); System.out.println(doc.get("filename")); } indexSearch.close(); } }
|
详细的解释在代码注释里都有了,就不做过多解释了。需要的jar包如下:
读取poi的类到poi官网下载,读取pdf的类到Apache PDFBox官网下载,这里用的1.8.13版本,2.0版本的调用方式与1.0版本已经不太一样了。
项目整体结构如下:
先运行类:
LuceneCreateIndex
会读取目录dataDirectory,即:
D:\Studying\poi\test\dataDirectory
下的文件,建立索引,索引会保存在目录indexDirectory,即:
D:\Studying\poi\test\indexDirectory
下,然后运行:
LuceneSearch
使用索引进行查询,就能看到效果了。