LUCENE（二）

分析器

各种分析器对英文分词效果均一样因为英文简单，而中文分词则需要特殊的规则

StandardAnalyzer：

单字分词：就是按照中文一个字一个字地进行分词。如：“我爱中国”，
效果：“我”、“爱”、“中”、“国”。

SmartChineseAnalyzer

对中文支持较好，但扩展性差，扩展词库，禁用词库和同义词库等不好处理

IKAnalyzer
- 添加jar包 IK-Analyzer-1.0-SNAPSHOT.jar
- 把配置文件、扩展词典（千万不要随意编辑⚠️utf-8，可以手动添加）、停用辞典添加到工程的classpath
  
  (放到src下即可)
```
hotword.dic
IKAnalyzer.cfg.xml
stopword.dic
```
使用自定义分词器的方法

只需要修改indexWriter的config即可

Directory directory = FSDirectory.open(new File("D:\\temp\\index").toPath());
  IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
  //创建一个indexwriter对象
  IndexWriter indexWriter = new IndexWriter(directory, config);

分词器分词效果测试

@Test
    public void testStandardAnalyzer() throws Exception {
        //1.创建一个StandardAnalyzer
//        StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
        Analyzer ikAnalyzer = new IKAnalyzer();
        //2.使用分析器对象的tokenStream方法得到tokenStream
        TokenStream tokenStream = ikAnalyzer.tokenStream("", "全文检索是将整本书java、整篇文章中的任意内容信息查找出来的检索，java。它可以根据需要获得全文中有关章、节、段、句、词等信息，计算机程序通过扫描文章中的每一个词，对每一个词建立一个索引，指明该词在文章中出现的次数和位置，当用户查询时根据建立的索引查找，类似于通过字典的检索字表查字的过程。 ");
        //3.向tokenStream中设置一个引用
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4.调用reset方法
        tokenStream.reset();
        //5.遍历charTermAttribute
        while (tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        tokenStream.close();
    }

索引库的维护

Field域介绍

Field域的属性

是否分析：是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。

是否索引：将Field分析后的词或整个Field值进行索引，只有索引方可搜索到。

比如：商品名称、商品简介分析后进行索引，订单号、身份证号不用分析但也要索引，这些将来都要作为查询条件。

是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取

比如：商品名称、订单号，凡是将来要从Document中获取的Field都要存储。

我们可以根据不同的数据类型选择不同的域来存储

Field类	数据类型	Analyzed 是否分析	Indexed 是否索引	Stored 是否存储	说明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	这个Field用来构建一个字符串Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号,姓名等) 是否存储在文档中用Store.YES或Store.NO决定
LongPoint(String name, long… point)	Long型	Y	Y	N	可以使用LongPoint、IntPoint等类型存储数值类型的数据。让数值类型可以进行索引。但是不能存储数据，如果想存储数据还需要使用StoredField。
StoredField(FieldName, FieldValue)	重方法，支持多种类型	N	N	Y	这个Field用来构建不同类型Field 不分析，不索引，但要Field存储在文档中
TextField(FieldName, FieldValue, Store.NO) 或 TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略.

添加索引

//创建一个IndexWriter对象
   IndexWriter indexWriter = new IndexWriter(FSDirectory.open(new File("/Users/index_sth").toPath()),
           new IndexWriterConfig(new IKAnalyzer()));
   //创建一个Document对象
   Document document = new Document();
   //向Document中添加域
   document.add(new TextField("name","今天很开心", TextField.Store.YES));
   document.add(new TextField("content","内容是今天很开心", TextField.Store.NO));
   document.add(new StoredField("path","/U/c"));
   //添加document(将docment写入索引库)
   indexWriter.addDocument(document);
   //关闭IndexWriter
   indexWriter.close();

删除索引

@Test
public void deleteIndex()throws Exception {
    //删除所有索引
    indexWriter.deleteAll();
    //关闭
    indexWriter.close();
}

@Test
public void deleteIndexByQuery()throws Exception{
    //根据条件删除
    indexWriter.deleteDocuments(new Term("name","apache"));
    //关闭
    indexWriter.close();
}

修改索引

@Test
public void updateDocument()throws Exception{
    //更新文档 起始就是先删除后添加
    Document document = new Document();
    document.add(new TextField("name1","我的天1", Field.Store.YES));
    document.add(new TextField("name2","我的天2", Field.Store.YES));
    document.add(new TextField("name3","我的天3", Field.Store.YES));
    //执行更新操作
    indexWriter.updateDocument(new Term("name","spring"),document);
    //关闭
    indexWriter.close();
}

查询

TermQuery

TermQuery，通过项查询，TermQuery不使用分析器所以建议匹配不分词的Field域查询，比如订单号、分类ID号等。
指定要查询的域和要查询的关键词。

详见LUCENE入门（一）

RangeQuery

@Test
public void testRangeQuery() throws Exception {
    //创建一个Query对象
    Query query = LongPoint.newRangeQuery("size", 0l, 10000l);
    printResult(query);
}

private void printResult(Query query) throws Exception {
    //执行查询
    TopDocs topDocs = indexSearcher.search(query, 10);
    System.out.println("总记录数：" + topDocs.totalHits);
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    for (ScoreDoc doc:scoreDocs){
        //取文档id
        int docId = doc.doc;
        //根据id取文档对象
        Document document = indexSearcher.doc(docId);
        System.out.println(document.get("name"));
        System.out.println(document.get("path"));
        System.out.println(document.get("size"));
        //System.out.println(document.get("content"));
        System.out.println("-----------------寂寞的分割线");
    }
    indexReader.close();
}

QueryParser

可以对要查询的内容先分词后查询

添加一个jar包 lucene-queryparser-7.4.0.jar

@Test
public void testQueryParser() throws Exception {
    //创建一个QueryPaser对象，两个参数
    QueryParser queryParser = new QueryParser("name", new IKAnalyzer());
    //参数1：默认搜索域，参数2：分析器对象
    //使用QueryPaser对象创建一个Query对象
    Query query = queryParser.parse("lucene是一个Java开发的全文检索工具包");
    //执行查询
    printResult(query);
}

Bxan

http://anbingxu666.com/2020/01/04/LUCENE%E5%85%A5%E9%97%A8%EF%BC%88%E4%BA%8C%EF%BC%89/