Warning: this post contains descriptions of search algorithms by someone who isn’t entirely sure how search algorithms work.
Intellectual activity consists mainly of various kinds of search. – Alan Turing, Intelligent Machinery.
Quite suddenly, it seems, the MacGuffin search page is up-and-running on the staging site, and looking rather handsome. What’s more, it’s working. You can run a multi-tag search, such as “#10minutelisten #manchester”, and get a result that is just that: a 10-minute story set in Manchester.
I was pretty thrilled, I must admit, to tag a gaggle of stories with #jimsfavouritestories, then search for the tag and find them obediently waiting for me on the results page: the first of many personal reading lists on the platform, I hope.
So now we’re fine-tuning the search algorithms. I say ‘we’ – Russell at fffunction has been tightening the sprockets of Elastic Search, the plugin we’re using, while here in Manchester, me and Zach drew fanciful diagrams on a whiteboard.
Browsing the Elastic Search resources pages left me feeling like a dog who’d wandered into a lecture theatre, but one thing I have begun to realise is there are actually two sides to us getting search right: on the one hand, there’s the number-crunching bit – the coding and hard maths. On the other hand, we need to predict user-behaviour, and try to configure our search algorithms to provide the user with good results even when they search badly, or when the reading community doesn’t tag content adequately.
Here’s how it’s supposed to work. Ideally, if you were looking for a story about a kitten on MacGuffin, you’d simply search for the ‘kitten’ tag. If a story about a kitten existed on the platform, it’d be returned in the results, because either the author or another MacGuffin user would have already tagged it ‘kitten’.
But what if there’s a kitten story on MacGuffin that hasn’t been tagged ‘kitten’, even though it’s called ‘Kitten Story’, is described by the author as ‘A story about a kitten’, and the text includes the word ‘kitten’ over a hundred times? As a kitten fan, you’ll want to know about it. It’ll be right up your street. That’s why search terms will be run as a boolean query in all of the four fields that make up a content entry:
Tag – tags that have been given to the content by the author, by a reader, or automatically generated (as in the case of audio length)
Title – the title given to the content by the author
Description – the description given to the content by the author
Text – the text content itself
The results of the search across each of the four fields contributes to an aggregate score, and that score determines the rank of the content in the search results. We can adjust the weighting for matches in each field, to game the results. For example, we might decide that story descriptions are more important than titles. So in our search for ‘kitten’, a story that has ‘kitten’ in the description field would rank higher than one that has ‘kitten’ in the title field.
//A SHORT DIGRESSION//
Down-weighting the title might sound counter-intuitive, but when you think about it, the titles of literary works often include words that have nothing to do with the content.
On the other hand, the author must add a 140-character description to the content when uploading it to MacGuffin, which ought to at least attempt to describe it (if the author follows our guidance), e.g. “This Depression-era novel is about tenant farmers driven from their Oklahoma home by the dustbowl drought.”
But hold your horses. Because someone might recommend a story to their friend, who instead of searching for the terms ‘depression’, ‘farmer’, ‘oklahoma’ or ‘dustbowl’, searches for what they can remember of the title: “Angry Grapes”.
That’s when you need a boolean search of the title field, as a safety net.
And this is sort of what I mean about understanding user-behaviour.
//END OF DIGRESSION//
Things Get Complicated
MacGuffin is actually built to run 2 kinds of searches, though: ‘tag’ and ‘free text’. You can see the difference (in terms of the user-interface) in this short video clip.
Start typing into the search bar, and auto-complete will suggest tags that already exist on the database, and are attached to some content. If you click a suggestion, it goes blue – ‘tagified’ as I insist on calling it. Hit the search icon now, and MacGuffin runs a tag search.
If you enter a term that doesn’t match an existing tag, the text stays white. Now when you hit the search icon, MacGuffin will run a free text search. If you enter a string of two or more words separated by spaces (like ‘The Sun Also Rises’), that will automatically be a free text search.
In a tag search, the results will prioritise matches found in tag fields.
In a free text search, the results will favour matches found in description and title fields.
As we continue to work on the search result weightings, we’re going to assume that as users become more experienced, they’ll learn the difference between the two types of search – that when they do a free text search, they mean to do a free text search, and visa versa.
Testing the Theory: Tag Search
After playing around with different weightings for a while, and imagining various user-behaviour scenarios, we drew up a hierarchy, to decide which results should beat others in a tag search – a kind of tag search ‘top trumps’. Then, via the CMS, we uploaded carefully-crafted dummy content which includes every possible permutation of the search term appearing in the four fields of an item of content (see table below). The search term was abacus.
In case you’re wondering about the naming regime: ‘abacus’ needs to appear in some of the titles (but not others) in order to test that search field, and we needed to add another word as a constant, to make all these dummy entries easier to pull up at once on the CMS. Hence ‘Zach’.
The idea is, this ranking gives us an order to aim for, as we tweak the algorithm. We make an adjustment, then run the search. Are the stories returned in order from 1-16? No? Then make another adjustment, and run the search again. Eventually, we hope to get them lined up in order.
Here’s where we’re at right now, the 16th January:
NB: Russell sensibly cautions that this could be a very large rabbit hole to fall down… Tiny changes in the algorithm can cause a domino effect in the Boolean search of set of documents. And we could fiddle around for ages, getting the results to return in the correct order, only to be forced to make further changes in the light of real user-test data. It’s a fair point.
With that in mind, I’ll post an update (‘Search Part 2’), when we’ve done some more user testing. Until next time, remember: all great search tools begin life in turquoise.