Probabilistic relevance model

The probabilistic relevance model^[1]^[2] was devised by Stephen E. Robertson and Karen Spärck Jones as a framework for probabilistic models to come. It is a formalism of information retrieval useful to derive ranking functions used by search engines and web search engines in order to rank matching documents according to their relevance to a given search query.

It is a theoretical model estimating the probability that a document d_j is relevant to a query q. The model assumes that this probability of relevance depends on the query and document representations. Furthermore, it assumes that there is a portion of all documents that is preferred by the user as the answer set for query q. Such an ideal answer set is called R and should maximize the overall probability of relevance to that user. The prediction is that documents in this set R are relevant to the query, while documents not present in the set are non-relevant.

$sim(d_{j},q)={\frac {P(R|{\vec {d}}_{j})}{P({\bar {R}}|{\vec {d}}_{j})}}$

YouTube Encyclopedic

1/3
Views:
5 827
1 134
431 966

Transcription

What we want then is a probabilistic model P over a word sequence, and we can write that sequence word 1, word 2, word 3, all the way up to word n, and we can use an abbreviation for that and write that the sequence of words 1 through n, using the colon. Now the next step is to say we can factor this and take these individual variables write that in terms of conditional probabilities. So, this probability is equal to the product over all i of the probability of words of i given all the subsequent words. So that would be from word 1 up to word i - 1. The is just the definition of conventional probability-- the joint probability of a set of variables can be factored out as the conditional probability of one variable given all the others, and then we can recursively do that until we've got all the variables accounted for. We can make the Markov assumption and that's the assumption that the effect of one variable on another will be local. That is, if we're looking at the nth word, the words that are relevant to that are the ones that have occurred recently and not the ones occurred a long time ago. What the Markov assumption means is that the probability of a word i, given the words all the was from 1 up to word i minus 1, we can assume that that's equal or approximately equal to the probability of the word given only the words from i minus k up to i minus 1. Instead of going all the way back to number 1, we only go back k steps. For order 1 Markov model, for an order k equals one, then this would be equal to the probability of word i given only word i minus 1. Now, the next thing we want to do is in our mode is called the Stationarity Assumption. What that says is that the probability of each variable is going to be the same. So the probability distribution over the first word is going to be same as the probability distribution over the nth word. Another way to look at that is if I keep saying sentences, the words that show up in my sentence depend on what the surrounding words are in the sentence, but they don't depend on whether I'm on the first sentence or the second sentence or the third sentence. Stationarity assumption we can write as the probability of a word given the previous word is the same for all variables. For all values of i and j, the probability of word i given the previous word as the same as the probability of word j given the previous word. That gives us all the formalism we need to talk about these word sequence models-- probabilistic word sequence models. In practice there are many tricks. One thing we talked about before, when we were doing the spam filterings and so on, is a necessity of smoothing. That is, if we're going to learn these probabilities from counts, we go out into the world, we observe some data, we figure out how often word i occurs given word i - 1 was the previous word, we're going to find out that a lot of these counts are going to be zero or going to be some small number, and the estimates are not going to be good. And therefore we need some type of smoothing, like the Laplace smoothing that we talked about, and there are many other techniques for doing smoothing to come up good estimates. Another thing we can do is augment these models to say maybe we want to deal not just with words but with other data as well. We saw that in the spam-filtering model also. So there you might want to think about who the sender is, what the time of day is and so on, these auxiliary fields like in the header fields of the email messages as well as the words in the message, and that's true for other applications as well. You may want to go beyond the words and consider variables that have to do with context of the words. We may also want to have other properties of words. The great thing about just dealing with an individual word like "dog" is that it's observable in the world. We see this spoken or written text, and we can figure out what it means, and we can start making counts about it and start estimating probabilities, but we also might want to know that, say, "dog" is being used as a noun, and that's not immediately observable in the world, but it is inferable. It's a hidden variable, and we may want to try to recover these hidden variables like parts of speech. We may also want to go to bigger sequences than just individual words, so rather than treat "New York City" as three separate words, we may want to a model that allows us to think of it as a single phrase. Or we may want to go smaller than that and look at a model that deals with individual letters rather than dealing with words. So these are all variations, and the type of model you choose depends on the application, but they all follow from this idea of a probabilistic model over sequences.

Related models

There are some limitations to this framework that need to be addressed by further development:

There is no accurate estimate for the first run probabilities
Index terms are not weighted
Terms are assumed mutually independent

To address these and other concerns, other models have been developed from the probabilistic relevance framework, among them the Binary Independence Model from the same author. The best-known derivative of this framework is the Okapi (BM25) weighting scheme, along with BM25F, a modification thereof.

References

^ Robertson, S. E.; Jones, K. Spärck (May 1976). "Relevance weighting of search terms". Journal of the American Society for Information Science. 27 (3): 129–146. doi:10.1002/asi.4630270302.
^ Robertson, Stephen; Zaragoza, Hugo (2009). "The Probabilistic Relevance Framework: BM25 and Beyond". Foundations and Trends in Information Retrieval. 3 (4): 333–389. CiteSeerX 10.1.1.156.5282. doi:10.1561/1500000019.

This page was last edited on 26 June 2021, at 19:00

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Related models

References