Apply stemming for search
We currently use the default (general purpose) analyzer on publication titles and abstracts. I think that if we apply stemming, the search function would provide more relevant results.
WebDSL already fully supports this. Here are some examples from a webdsl test app I created:
User input query:
configure
will be normalized to:
configur
and will match:
CONFIGURE
CONFIGURATIONAL
CONFIGURATION
CONFIGURATIONS
CONFIGURED
CONFIGURATIVE
CONFIGURES
CONFIGURINGOr user input:
linguistic
will be normalized to:
linguist
and will match: LINGUISTICALLY
LINGUISTS
LINGUISTIC
LINGUISTICAL
LINGUISTICS
LINGUISTetc.
This will require a partial reindexation (Publication including its descendants):
sudo sh webdsl reindex Publication
Similarly, we might provide a search function for authors that performs phonetic matching.
Submitted by Elmer van Chastelet on 27 July 2012 at 17:06
Issue Log
Added in r741
Stemming has its downsides, it sometimes stems too aggressively:
anime, animation, animal will all stem to the same root.
A protected word list can be used that overrides the stemmer for the listed words. Problem: where to get a good protwords.txt from?
And it’s probably better to use stemming as additional search field instead of the single field for searching.
We can also go for a less aggressive stemmer, like KStem or a plural-only stemmer (we now use Snowball). SOLR stemmers which are available in webdsl
See also some info on Google’s approach and here
KStem actually does a decent job when looking at some words that are stemmed too aggressive by the snowball porter impl.
snowball matches forms of ‘iron’ and ‘ironic’ on ‘iron’:
IRONIZE IRONNESS IRONIZES IRONICAL IRONER IRONIC IRONIZING IRONNESSES IRONIZED IRONS IRON IRONERS IRONING IRONICALLY IRONED IRONINGS IRONE IRONES
kstem only matches forms of ‘iron’:
IRONIZE IRONNESS IRONIZES IRONER IRONIZING IRONIZED IRON IRONERS IRONED IRONES
Same for animal, which snowball expands to forms of animate:
snowball:
ANIMALIZE ANIMATE ANIMATOR ANIMATEDLY ANIMES ANIMATENESS ANIMATELY ANIMATION ANIMAL ANIMALIZATION ANIMATENESSES ANIMALIZATIONS ANIMISMS ANIMALITIES ANIMATING ANIMALIZES ANIMATORS ANIMALIZED ANIMALIZING ANIMISM ANIMALISMS ANIMALITY ANIMALS ANIMATES ANIMALISM ANIMATIONS ANIMALLY ANIMATED ANIME
kstem:
ANIMALIZE ANIMAL ANIMALIC ANIMALIZATION ANIMALIZATIONS ANIMALITIES ANIMALIZES ANIMALIZED ANIMALIZING ANIMALITY ANIMALS ANIMALLY
Log in to post comments