7.8. Tokenizers¶
7.8.1. Summary¶
Groonga has tokenizer module that tokenizes text. It is used when the following cases:
Tokenizer is an important module for full-text search. You can change trade-off between precision and recall by changing tokenizer.
Normally, TokenBigram is a suitable tokenizer. If you don’t know much about tokenizer, it’s recommended that you choose TokenBigram.
You can try a tokenizer by tokenize and table_tokenize. Here is an example to try TokenBigram tokenizer by tokenize:
Execution example:
tokenize TokenBigram "Hello World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "He"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o "
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": " W"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "Wo"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "or"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "rl"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "ld"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "d"
# }
# ]
# ]
7.8.2. What is “tokenize”?¶
“tokenize” is the process that extracts zero or more tokens from a text. There are some “tokenize” methods.
For example, Hello World is tokenized to the following tokens by
bigram tokenize method:
Heelllloo_(_means a white-space)_W(_means a white-space)Woorrlld
In the above example, 10 tokens are extracted from one text Hello
World.
For example, Hello World is tokenized to the following tokens by
white-space-separate tokenize method:
HelloWorld
In the above example, 2 tokens are extracted from one text Hello
World.
Token is used as search key. You can find indexed documents only by
tokens that are extracted by used tokenize method. For example, you
can find Hello World by ll with bigram tokenize method but you
can’t find Hello World by ll with white-space-separate tokenize
method. Because white-space-separate tokenize method doesn’t extract
ll token. It just extracts Hello and World tokens.
In general, tokenize method that generates small tokens increases recall but decreases precision. Tokenize method that generates large tokens increases precision but decreases recall.
For example, we can find Hello World and A or B by or with
bigram tokenize method. Hello World is a noise for people who
wants to search “logical and”. It means that precision is
decreased. But recall is increased.
We can find only A or B by or with white-space-separate
tokenize method. Because World is tokenized to one token World
with white-space-separate tokenize method. It means that precision is
increased for people who wants to search “logical and”. But recall is
decreased because Hello World that contains or isn’t found.
7.8.3. Built-in tokenizsers¶
Here is a list of built-in tokenizers:
TokenBigramTokenBigramSplitSymbolTokenBigramSplitSymbolAlphaTokenBigramSplitSymbolAlphaDigitTokenBigramIgnoreBlankTokenBigramIgnoreBlankSplitSymbolTokenBigramIgnoreBlankSplitSymbolAlphaTokenBigramIgnoreBlankSplitSymbolAlphaDigitTokenUnigramTokenTrigramTokenDelimitTokenDelimitNullTokenMecabTokenRegexp
7.8.3.1. TokenBigram¶
TokenBigram is a bigram based tokenizer. It’s recommended to use
this tokenizer for most cases.
Bigram tokenize method tokenizes a text to two adjacent characters
tokens. For example, Hello is tokenized to the following tokens:
Heellllo
Bigram tokenize method is good for recall because you can find all texts by query consists of two or more characters.
In general, you can’t find all texts by query consists of one
character because one character token doesn’t exist. But you can find
all texts by query consists of one character in Groonga. Because
Groonga find tokens that start with query by predictive search. For
example, Groonga can find ll and lo tokens by l query.
Bigram tokenize method isn’t good for precision because you can find
texts that includes query in word. For example, you can find world
by or. This is more sensitive for ASCII only languages rather than
non-ASCII languages. TokenBigram has solution for this problem
described in the below.
TokenBigram behavior is different when it’s worked with any
Normalizers.
If no normalizer is used, TokenBigram uses pure bigram (all tokens
except the last token have two characters) tokenize method:
Execution example:
tokenize TokenBigram "Hello World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "He"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o "
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": " W"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "Wo"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "or"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "rl"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "ld"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "d"
# }
# ]
# ]
If normalizer is used, TokenBigram uses white-space-separate like
tokenize method for ASCII characters. TokenBigram uses bigram
tokenize method for non-ASCII characters.
You may be confused with this combined behavior. But it’s reasonable for most use cases such as English text (only ASCII characters) and Japanese text (ASCII and non-ASCII characters are mixed).
Most languages consists of only ASCII characters use white-space for word separator. White-space-separate tokenize method is suitable for the case.
Languages consists of non-ASCII characters don’t use white-space for word separator. Bigram tokenize method is suitable for the case.
Mixed tokenize method is suitable for mixed language case.
If you want to use bigram tokenize method for ASCII character, see
TokenBigramSplitXXX type tokenizers such as
TokenBigramSplitSymbolAlpha.
Let’s confirm TokenBigram behavior by example.
TokenBigram uses one or more white-spaces as token delimiter for
ASCII characters:
Execution example:
tokenize TokenBigram "Hello World" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "hello"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "world"
# }
# ]
# ]
TokenBigram uses character type change as token delimiter for
ASCII characters. Character type is one of them:
- Alphabet
- Digit
- Symbol (such as
(,)and!)- Hiragana
- Katakana
- Kanji
- Others
The following example shows two token delimiters:
- at between
100(digits) andcents(alphabets)- at between
cents(alphabets) and!!!(symbols)
Execution example:
tokenize TokenBigram "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "100"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "cents"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "!!!"
# }
# ]
# ]
Here is an example that TokenBigram uses bigram tokenize method
for non-ASCII characters.
Execution example:
tokenize TokenBigram "日本語の勉強" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語の"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "の勉"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "勉強"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "強"
# }
# ]
# ]
7.8.3.2. TokenBigramSplitSymbol¶
TokenBigramSplitSymbol is similar to TokenBigram. The
difference between them is symbol handling. TokenBigramSplitSymbol
tokenizes symbols by bigram tokenize method:
Execution example:
tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "100"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "cents"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.3. TokenBigramSplitSymbolAlpha¶
TokenBigramSplitSymbolAlpha is similar to TokenBigram. The
difference between them is symbol and alphabet
handling. TokenBigramSplitSymbolAlpha tokenizes symbols and
alphabets by bigram tokenize method:
Execution example:
tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "100"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ce"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "en"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "nt"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "ts"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "s!"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.4. TokenBigramSplitSymbolAlphaDigit¶
TokenBigramSplitSymbolAlphaDigit is similar to
TokenBigram. The difference between them is symbol, alphabet
and digit handling. TokenBigramSplitSymbolAlphaDigit tokenizes
symbols, alphabets and digits by bigram tokenize method. It means that
all characters are tokenized by bigram tokenize method:
Execution example:
tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "10"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "00"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "0c"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "ce"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "en"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "nt"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "ts"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "s!"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.5. TokenBigramIgnoreBlank¶
TokenBigramIgnoreBlank is similar to TokenBigram. The
difference between them is blank handling. TokenBigramIgnoreBlank
ignores white-spaces in continuous symbols and non-ASCII characters.
You can find difference of them by 日 本 語 ! ! ! text because it
has symbols and non-ASCII characters.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlank:
Execution example:
tokenize TokenBigramIgnoreBlank "日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!!!"
# }
# ]
# ]
7.8.3.6. TokenBigramIgnoreBlankSplitSymbol¶
TokenBigramIgnoreBlankSplitSymbol is similar to
TokenBigram. The differences between them are the followings:
- Blank handling
- Symbol handling
TokenBigramIgnoreBlankSplitSymbol ignores white-spaces in
continuous symbols and non-ASCII characters.
TokenBigramIgnoreBlankSplitSymbol tokenizes symbols by bigram
tokenize method.
You can find difference of them by 日 本 語 ! ! ! text because it
has symbols and non-ASCII characters.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlankSplitSymbol:
Execution example:
tokenize TokenBigramIgnoreBlankSplitSymbol "日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "語!"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.7. TokenBigramIgnoreBlankSplitSymbolAlpha¶
TokenBigramIgnoreBlankSplitSymbolAlpha is similar to
TokenBigram. The differences between them are the followings:
- Blank handling
- Symbol and alphabet handling
TokenBigramIgnoreBlankSplitSymbolAlpha ignores white-spaces in
continuous symbols and non-ASCII characters.
TokenBigramIgnoreBlankSplitSymbolAlpha tokenizes symbols and
alphabets by bigram tokenize method.
You can find difference of them by Hello 日 本 語 ! ! ! text because it
has symbols and non-ASCII characters with white spaces and alphabets.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "Hello 日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "hello"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "日"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "本"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlankSplitSymbolAlpha:
Execution example:
tokenize TokenBigramIgnoreBlankSplitSymbolAlpha "Hello 日 本 語 ! ! !" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "he"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o日"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "語!"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.8.3.8. TokenBigramIgnoreBlankSplitSymbolAlphaDigit¶
TokenBigramIgnoreBlankSplitSymbolAlphaDigit is similar to
TokenBigram. The differences between them are the followings:
- Blank handling
- Symbol, alphabet and digit handling
TokenBigramIgnoreBlankSplitSymbolAlphaDigit ignores white-spaces
in continuous symbols and non-ASCII characters.
TokenBigramIgnoreBlankSplitSymbolAlphaDigit tokenizes symbols,
alphabets and digits by bigram tokenize method. It means that all
characters are tokenized by bigram tokenize method.
You can find difference of them by Hello 日 本 語 ! ! ! 777 text
because it has symbols and non-ASCII characters with white spaces,
alphabets and digits.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "hello"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "日"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "本"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "語"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "!"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "777"
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlankSplitSymbolAlphaDigit:
Execution example:
tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "he"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o日"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "日本"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "本語"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "語!"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "!!"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "!7"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "77"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "77"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "7"
# }
# ]
# ]
7.8.3.9. TokenUnigram¶
TokenUnigram is similar to TokenBigram. The differences
between them is token unit. TokenBigram uses 2 characters per
token. TokenUnigram uses 1 character per token.
Execution example:
tokenize TokenUnigram "100cents!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "100"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "cents"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "!!!"
# }
# ]
# ]
7.8.3.10. TokenTrigram¶
TokenTrigram is similar to TokenBigram. The differences
between them is token unit. TokenBigram uses 2 characters per
token. TokenTrigram uses 3 characters per token.
Execution example:
tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "10000"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "cents"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "!!!!!"
# }
# ]
# ]
7.8.3.11. TokenDelimit¶
TokenDelimit extracts token by splitting one or more space
characters (U+0020). For example, Hello World is tokenized to
Hello and World.
TokenDelimit is suitable for tag text. You can extract groonga
and full-text-search and http as tags from groonga
full-text-search http.
Here is an example of TokenDelimit:
Execution example:
tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "groonga"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "full-text-search"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "http"
# }
# ]
# ]
TokenDelimit can also specify options.
TokenDelimit has delimiter option and pattern option.
delimiter option can split token with a specified characters.
For example, Hello,World is tokenized to Hello and World
with delimiter option as below.
Execution example:
tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "World",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
delimiter option can also specify multiple delimiters.
For example, Hello, World is tokenized to Hello and World.
, and `` `` are delimiters in below example.
Execution example:
tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "World",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
pattern option can split token with a regular expression.
You can except needless space by pattern option.
For example, This is a pen. This is an apple is tokenized to This is a pen and
This is an apple with pattern option as below.
Normally, when This is a pen. This is an apple. is splitted by .,
needless spaces are included at the beginning of “This is an apple.”.
You can except the needless spaces by a pattern option as below example.
Execution example:
tokenize 'TokenDelimit("pattern", "\\.\\s*")' "This is a pen. This is an apple."
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "This is a pen.",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "This is an apple.",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
You can extract token in complex conditions by pattern option.
For example, これはペンですか!?リンゴですか?「リンゴです。」 is tokenize to これはペンですか and リンゴですか, 「リンゴです。」 with delimiter option as below.
Execution example:
tokenize 'TokenDelimit("pattern", "([。!?]+(?![)」])|[\\r\\n]+)\\s*")' "これはペンですか!?リンゴですか?「リンゴです。」"
# [
# [
# 0,
# 1545179416.22277,
# 0.0002887248992919922
# ],
# [
# {
# "value": "これはペンですか",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "リンゴですか",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "「リンゴです。」",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
\\s* of the end of above regular expression match 0 or more spaces after a delimiter.
[。!?]+ matches 1 or more 。 or !, ?.
For example, [。!?]+ matches !? of これはペンですか!?.
(?![)」]) is negative lookahead.
(?![)」]) matches if a character is not matched ) or 」.
negative lookahead interprets in combination regular expression of just before.
Therefore it interprets [。!?]+(?![)」]).
[。!?]+(?![)」]) matches if there are not ) or 」 after 。 or !, ?.
In other words, [。!?]+(?![)」]) matches 。 of これはペンですか。. But [。!?]+(?![)」]) doesn’t match 。 of 「リンゴです。」.
Because there is 」 after 。.
[\\r\\n]+ match 1 or more newline character.
In conclusion, ([。!?]+(?![)」])|[\\r\\n]+)\\s* uses 。 and ! and ?, newline character as delimiter. However, 。 and !, ? are not delimiters if there is ) or 」 after 。 or !, ?.
7.8.3.12. TokenDelimitNull¶
TokenDelimitNull is similar to TokenDelimit. The
difference between them is separator character. TokenDelimit
uses space character (U+0020) but TokenDelimitNull uses NUL
character (U+0000).
TokenDelimitNull is also suitable for tag text.
Here is an example of TokenDelimitNull:
Execution example:
tokenize TokenDelimitNull "Groonga\u0000full-text-search\u0000HTTP" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "groongau0000full-text-searchu0000http"
# }
# ]
# ]
7.8.3.13. TokenMecab¶
TokenMecab is a tokenizer based on MeCab part-of-speech and
morphological analyzer.
MeCab doesn’t depend on Japanese. You can use MeCab for other languages by creating dictionary for the languages. You can use NAIST Japanese Dictionary for Japanese.
You need to install an additional package to using TokenMecab. For more detail of how to installing an additional package, see how to install each OS .
TokenMecab is good for precision rather than recall. You can find
東京都 and 京都 texts by 京都 query with
TokenBigram but 東京都 isn’t expected. You can find only
京都 text by 京都 query with TokenMecab.
If you want to support neologisms, you need to keep updating your MeCab dictionary. It needs maintain cost. (TokenBigram doesn’t require dictionary maintenance because TokenBigram doesn’t use dictionary.) mecab-ipadic-NEologd : Neologism dictionary for MeCab may help you.
Here is an example of TokenMeCab. 東京都 is tokenized to 東京
and 都. They don’t include 京都:
Execution example:
tokenize TokenMecab "東京都"
# [
# [
# 0,
# 1545812631.661493,
# 0.0002415180206298828
# ],
# [
# {
# "value": "東京",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "都",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
TokenMecab can also specify options.
TokenMecab has target_class option, include_class option,
include_reading option, include_form option and use_reading.
target_class option searches a token of specifying a part-of-speech.
For example, you can search only a noun as below.
Execution example:
tokenize 'TokenMecab("target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545810238.195525,
# 0.0003066062927246094
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "名前",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "山田",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "さん",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "はず",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
target_class option can also specify subclasses and exclude or add specific
part-of-speech of specific using + or -.
So, you can also search a noun with excluding non-independent word and suffix of
person name as below.
In this way you can search exclude the noise of token.
Execution example:
tokenize 'TokenMecab("target_class", "-名詞/非自立", "target_class", "-名詞/接尾/人名", "target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1545810363.771334,
# 0.0003197193145751953
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "名前",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "山田",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.8.3.14. TokenRegexp¶
New in version 5.0.1.
Caution
This tokenizer is experimental. Specification may be changed.
Caution
This tokenizer can be used only with UTF-8. You can’t use this tokenizer with EUC-JP, Shift_JIS and so on.
TokenRegexp is a tokenizer for supporting regular expression
search by index.
In general, regular expression search is evaluated as sequential search. But the following cases can be evaluated as index search:
- Literal only case such as
hello- The beginning of text and literal case such as
\A/home/alice- The end of text and literal case such as
\.txt\z
In most cases, index search is faster than sequential search.
TokenRegexp is based on bigram tokenize method. TokenRegexp
adds the beginning of text mark (U+FFEF) at the begging of text
and the end of text mark (U+FFF0) to the end of text when you
index text:
Execution example:
tokenize TokenRegexp "/home/alice/test.txt" NormalizerAuto --mode ADD
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": ""
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "/h"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ho"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "om"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "me"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "e/"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "/a"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "al"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "li"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "ic"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "ce"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "e/"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "/t"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "te"
# },
# {
# "position": 14,
# "force_prefix": false,
# "value": "es"
# },
# {
# "position": 15,
# "force_prefix": false,
# "value": "st"
# },
# {
# "position": 16,
# "force_prefix": false,
# "value": "t."
# },
# {
# "position": 17,
# "force_prefix": false,
# "value": ".t"
# },
# {
# "position": 18,
# "force_prefix": false,
# "value": "tx"
# },
# {
# "position": 19,
# "force_prefix": false,
# "value": "xt"
# },
# {
# "position": 20,
# "force_prefix": false,
# "value": "t"
# },
# {
# "position": 21,
# "force_prefix": false,
# "value": ""
# }
# ]
# ]

