We currently use the default (general purpose) analyzer on publication titles and abstracts. I think that if we apply stemming, the search function would provide more relevant results.

WebDSL already fully supports this. Here are some examples from a webdsl test app I created:

User input query: configure

will be normalized to: configur

and will match:
CONFIGURE
CONFIGURATIONAL
CONFIGURATION
CONFIGURATIONS
CONFIGURED
CONFIGURATIVE
CONFIGURES
CONFIGURING

Or user input: linguistic

will be normalized to: linguist

and will match: LINGUISTICALLY
LINGUISTS
LINGUISTIC
LINGUISTICAL
LINGUISTICS
LINGUIST

etc.

This will require a partial reindexation (Publication including its descendants): sudo sh webdsl reindex Publication

Similarly, we might provide a search function for authors that performs phonetic matching.

Submitted by Elmer van Chastelet on 27 July 2012 at 17:06

On 27 July 2012 at 17:06 Elmer van Chastelet tagged improvement

On 27 July 2012 at 17:06 Elmer van Chastelet tagged feature

On 27 July 2012 at 17:06 Elmer van Chastelet tagged search

On 27 July 2012 at 17:07 Elmer van Chastelet tagged @elmer

On 16 August 2012 at 17:02 Elmer van Chastelet closed this issue.

On 16 August 2012 at 17:02 Elmer van Chastelet commented:

Added in r741


On 31 August 2012 at 16:11 Elmer van Chastelet reopened this issue.

On 31 August 2012 at 16:31 Elmer van Chastelet commented:

Stemming has its downsides, it sometimes stems too aggressively:

anime, animation, animal will all stem to the same root.

Nice demo app here

A protected word list can be used that overrides the stemmer for the listed words. Problem: where to get a good protwords.txt from?

And it’s probably better to use stemming as additional search field instead of the single field for searching.

We can also go for a less aggressive stemmer, like KStem or a plural-only stemmer (we now use Snowball). SOLR stemmers which are available in webdsl

See also some info on Google’s approach and here


On 13 September 2012 at 11:32 Elmer van Chastelet commented:

KStem actually does a decent job when looking at some words that are stemmed too aggressive by the snowball porter impl.

snowball matches forms of ‘iron’ and ‘ironic’ on ‘iron’:

 IRONIZE
 IRONNESS
 IRONIZES
 IRONICAL
 IRONER
 IRONIC
 IRONIZING
 IRONNESSES
 IRONIZED
 IRONS
 IRON
 IRONERS
 IRONING
 IRONICALLY
 IRONED
 IRONINGS
 IRONE
 IRONES

kstem only matches forms of ‘iron’:

 IRONIZE
 IRONNESS
 IRONIZES
 IRONER
 IRONIZING
 IRONIZED
 IRON
 IRONERS
 IRONED
 IRONES

Same for animal, which snowball expands to forms of animate:

snowball:

 ANIMALIZE
 ANIMATE
 ANIMATOR
 ANIMATEDLY
 ANIMES
 ANIMATENESS
 ANIMATELY
 ANIMATION
 ANIMAL
 ANIMALIZATION
 ANIMATENESSES
 ANIMALIZATIONS
 ANIMISMS
 ANIMALITIES
 ANIMATING
 ANIMALIZES
 ANIMATORS
 ANIMALIZED
 ANIMALIZING
 ANIMISM
 ANIMALISMS
 ANIMALITY
 ANIMALS
 ANIMATES
 ANIMALISM
 ANIMATIONS
 ANIMALLY
 ANIMATED
 ANIME

kstem:

 ANIMALIZE
 ANIMAL
 ANIMALIC
 ANIMALIZATION
 ANIMALIZATIONS
 ANIMALITIES
 ANIMALIZES
 ANIMALIZED
 ANIMALIZING
 ANIMALITY
 ANIMALS
 ANIMALLY 

On 4 January 2013 at 22:25 Eelco Visser removed tag improvement

Log in to post comments