Skip to content

Instantly share code, notes, and snippets.

@vincentwyshan
Created September 6, 2013 03:59
Show Gist options
  • Select an option

  • Save vincentwyshan/6459378 to your computer and use it in GitHub Desktop.

Select an option

Save vincentwyshan/6459378 to your computer and use it in GitHub Desktop.
ES ik 分词 1.2.0
index:
analysis:
analyzer:
ik:
alias: [ik_analyzer]
type: org.elasticsearch.index.analysis.IkAnalyzerProvider
use_smart: true
test:
tokenizer : 'ik'
filter : ["synonym"]
use_smart: true
ik_smart:
tokenizer : 'ik'
filter : ["synonym"]
use_smart : true
ik_notsmart:
tokenizer : 'ik'
filter : ["synonym"]
use_smart : false
filter:
synonym:
type: synonym
synonyms_path : "analysis/synonym.txt"
测试结果:
http://127.0.0.1:9200/apple/_analyze?text=%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD&analyzer=ik_notsmart
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 1
}
]
}
http://127.0.0.1:9200/apple/_analyze?text=%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD&analyzer=ik_smart
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 1
}
]
}
@vincentwyshan
Copy link
Author

按照我的理解, use_smart: true 的时候, 切分结果应该不止一个 "中华人民共和国的"

较早版本的 ik 分词:

{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "中华人民",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 3
},
{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "人民共和国",
"start_offset": 2,
"end_offset": 7,
"type": "word",
"position": 5
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 6
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 7
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 8
}
]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment