SharePoint Online Search Word Breaking
Recently I was asked if search in SharePoint Online would return two results for words separated by hyphens, dashes, or other characters. For example, we’d like to know if entering the search phrase “committal” will return both “noncommittal” and “non-committal”
The answer is, it depends on the specific Search scenario and configuration. It’s indeed an under-documented topic so I thought I’d try and give you good context here.
- SharePoint Online search “verticals” (Result Sources) determines the source of the search and characteristics of the query logic. We can also create Custom Result Sources.
- Out of the box Result Sources, are mapped to Managed Properties, which are in essence “columns” of search data. As with Result Sources, there are out of the box managed properties, and we also can create our own.
- The Search Query component tokenization process splits the stream of text retrieved from the managed properties into individual words (tokens) at the time of a query which includes word breaking, stemming, query spellchecking and the native thesaurus capabilities.
- This tokenization will only take place if the following setting on the specific Managed Property being searched is turned off:
- Since we cannot directly modify the out of the box Managed Properties, we are bound to whatever those properties have set for Complete Matching, as the first factor in Word Breaking being applied or not.
Here’s the inventory of the special characters used for tokenization in the context of using SharePoint Document Libraries as the Search Result source:
There are other possible root content source types as well: File Share, Exchange and Open Search, and they have their own idiosyncrasies regarding word breaking. For example, ampersand would work as a word breaker in the context of a File Share search.
So, in conclusion, yes, a search of OOTB SP Document Libraries for “committal” will return both “noncommittal” and “non-committal”, with higher relevance given to an item with both portions of the string present.