RankBrain in Semantic Search
Posted by Dylan Yates on September 30, 2017
How RankBrain just answered a question you’re not going to think of until tomorrow:
Google wants you to trust the list of websites and content it returns to you in its Search Engine Results Page. It wants you to trust in the accuracy of its results, so you feel like they’re relevant to your query. Delivering high-quality, high-confidence search results has been at the core of Google’s ethos since it was established. At no point since has this goal been more of a priority to Google than right now.
In the years since Google launched we’ve seen a surfeit of updates and changes. One of the biggest of these, a major contributing factor to Google’s success in establishing high quality search results, was launched in 2013. It was called Hummingbird.
Who is responsible for Hummingbird?
Diego Federici is one of the key figures behind Hummingbird, Google’s ground-breaking semantic search algorithm released at the end of 2013. You can read more about Diego’s Hummingbird algorithm in the US patent he filed here.
If spending an evening pouring over copies of US patents isn’t your cup of tea though (and quite frankly, if it is, you need a long hard look in the mirror!) I’ll outline a few of Diego’s points below:
Here is a summary of the patent, and outlined in bold are the points in which we are mainly interested:
Data identifying entities and transition probabilities between entities is stored in a computer readable medium. Each transition probability represents a strength of a relationship between a pair of entities as they are related in search history data. In some implementations, an increase in popularity for a query is identified and a different query is identified as temporally related to the query. Scoring data for documents responsive to the different query is modified to favor newer documents.
Let’s try and translate these into more understandable takeaways:
‘Entities and transition probabilities between entities’
An entity being a thing with independent existence; for example a person or a place. The transition probabilities could be defined as the expected sequence of a user’s search journey from query to query, or from document to document (probably based on historical data).
‘A strength of a relationship between a pair of entities’
‘American Elections’ and ‘Donald Trump’ – two separate entities – but each with a relationship to the other. Google may infer the two are connected if a user searches for ‘American Elections’ and then follows up with a search for ‘Trump’. Google can learn about the relationship between the two entities, according to how users search for, and consume information about said entities within the context of a similar timeframe.
‘increase in popularity for a query’ – and – ‘modified to favor newer documents’
This can be summarised in an example by Google’s ‘Query Deserves Freshness’ rating. Google concedes that a query is affected by the context in which it’s placed. The context in which a query is placed will have a stake in determining the type of information a searcher is looking for. Some types of queries are more likely to be subject to this than others; ‘News’ queries, are a good example, as is location.
All of the above means what exactly?
- Google wants to understand and categorise entities, in the same way it indexes URLs. It probably keeps entities in an index which is likely reflected in features such as Knowledge Base.
- Google wants to understand the relationship between entities, to better help it understand the difference between Homonyms.
- Google might be using different signals according to what the type of query is. For example,the history of a document might be a more important signal for some queries than others.
Semantic understanding of words is therefore influenced by their relationship to each other, and the context in which they’re placed.
Semantics play a huge part in how Google understands queries, and subsequently, how to rank URLs for them.
So, where does RankBrain come into the equation?
I’ll start by saying this: RankBrain doesn’t (*directly) determine Google search rankings and you cannot optimise for RankBrain.
What is RankBrain?
Google’s research group (known as Google Brain) has developed models that simulate user behavior in the SERPs (and possibly other situations) in order to help with, among other things, evaluating the success of its results.
Traditionally, Google has used humans to evaluate the algorithms in testing environments before releasing them to the wild. This was confirmed by John Mueller as recently as 19th September 2017 in a Google Hangouts (at the 6:13 minutes mark) – https://youtu.be/Bgy8bQebnbc?t=6m31s
With RankBrain, Google is trying to replicate this human tester model in the form of Artificial Intelligence, allowing presumably, a faster and more efficient evaluation of Google Search’s performance. This is what Eric Enge (of The Art of SEO and Stone Temple Consultancy) had to say:
“More recently, in 2015, Google introduced RankBrain. Despite its name, RankBrain doesn’t determine Google search rankings. Rather, it serves as part of Google’s overall search algorithm to better understand a user’s search query so it can surface the most relevant results.
Machine learning will play a major role in continuing improvements in semantic search.”
How Does RankBrain work?
RankBrain is tasked with helping Google understand new queries. This figure is estimated to be at approximately 15% of searches Google receives on a daily basis.
Now it’s true – there is a lot of ambiguity surrounding RankBrain, what it does, and how it works, even from Google itself. So I’ll try and distil this description as much as possible, with the glaring caveat that this is at best an educated guess at how RankBrain works, and not by any means a categorical definition.
Firstly, as alluded to earlier, RankBrain is working separately to the core algorithm, rather than directly in tandem. This at least, is my understanding.
Rank Brain exists in a ‘simulated’ world, which involves:
- Generating unknown words or search phrases (though by which method, human, or computer-inputted, it’s still unclear)
- Processing the unknown query with a set of search results
- Evaluating whether or not those results are accurate and trustworthy
RankBrain processes these unique, long-tail queries by comparing them to similar words that already exist in its database. If the test proves successful, great: Whatever vector was responsible for successfully answering said query can be integrated into the live version of the algorithm. If not, well, back to the drawing board.
Implementing machine learning in this way is meant to improve the processing of complex queries. Queries that haven’t even been asked yet.
When the RankBrain project was initiated these searches were processed by engineers teaching the computer how to deal with unfamiliar terms. Is this still the case? We don’t know for certain, although we can assume the long term goal is for RankBrain to process this task autonomously, if it doesn’t already.
A machine learning environment that is constantly learning and improving through iteration, all on its own is a very interesting idea. More exciting than that: Rank Brain is inventing solutions for queries you haven’t even thought of to ask yet. Is this the first example of AI being used to predict the future? Maybe not quite, but it’s a fascinating thought isn’t it?