Binary Independence Model

The Binary Independence Model (BIM)^[1]^[2] in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

YouTube Encyclopedic

1/3
Views:
5 950
8 739
11 481

Transcription

What we want then is a probabilistic model P over a word sequence, and we can write that sequence word 1, word 2, word 3, all the way up to word n, and we can use an abbreviation for that and write that the sequence of words 1 through n, using the colon. Now the next step is to say we can factor this and take these individual variables write that in terms of conditional probabilities. So, this probability is equal to the product over all i of the probability of words of i given all the subsequent words. So that would be from word 1 up to word i - 1. The is just the definition of conventional probability-- the joint probability of a set of variables can be factored out as the conditional probability of one variable given all the others, and then we can recursively do that until we've got all the variables accounted for. We can make the Markov assumption and that's the assumption that the effect of one variable on another will be local. That is, if we're looking at the nth word, the words that are relevant to that are the ones that have occurred recently and not the ones occurred a long time ago. What the Markov assumption means is that the probability of a word i, given the words all the was from 1 up to word i minus 1, we can assume that that's equal or approximately equal to the probability of the word given only the words from i minus k up to i minus 1. Instead of going all the way back to number 1, we only go back k steps. For order 1 Markov model, for an order k equals one, then this would be equal to the probability of word i given only word i minus 1. Now, the next thing we want to do is in our mode is called the Stationarity Assumption. What that says is that the probability of each variable is going to be the same. So the probability distribution over the first word is going to be same as the probability distribution over the nth word. Another way to look at that is if I keep saying sentences, the words that show up in my sentence depend on what the surrounding words are in the sentence, but they don't depend on whether I'm on the first sentence or the second sentence or the third sentence. Stationarity assumption we can write as the probability of a word given the previous word is the same for all variables. For all values of i and j, the probability of word i given the previous word as the same as the probability of word j given the previous word. That gives us all the formalism we need to talk about these word sequence models-- probabilistic word sequence models. In practice there are many tricks. One thing we talked about before, when we were doing the spam filterings and so on, is a necessity of smoothing. That is, if we're going to learn these probabilities from counts, we go out into the world, we observe some data, we figure out how often word i occurs given word i - 1 was the previous word, we're going to find out that a lot of these counts are going to be zero or going to be some small number, and the estimates are not going to be good. And therefore we need some type of smoothing, like the Laplace smoothing that we talked about, and there are many other techniques for doing smoothing to come up good estimates. Another thing we can do is augment these models to say maybe we want to deal not just with words but with other data as well. We saw that in the spam-filtering model also. So there you might want to think about who the sender is, what the time of day is and so on, these auxiliary fields like in the header fields of the email messages as well as the words in the message, and that's true for other applications as well. You may want to go beyond the words and consider variables that have to do with context of the words. We may also want to have other properties of words. The great thing about just dealing with an individual word like "dog" is that it's observable in the world. We see this spoken or written text, and we can figure out what it means, and we can start making counts about it and start estimating probabilities, but we also might want to know that, say, "dog" is being used as a noun, and that's not immediately observable in the world, but it is inferable. It's a hidden variable, and we may want to try to recover these hidden variables like parts of speech. We may also want to go to bigger sequences than just individual words, so rather than treat "New York City" as three separate words, we may want to a model that allows us to think of it as a single phrase. Or we may want to go smaller than that and look at a model that deals with individual letters rather than dealing with words. So these are all variations, and the type of model you choose depends on the application, but they all follow from this idea of a probabilistic model over sequences.

Definitions

The Binary Independence Assumption is the that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. Terms are independently distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents. The representation is an ordered set of Boolean variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector $d = (x 1, ..., x m)$ where $x t =1$ if term t is present in the document d and $x t =0$ if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way. "Independence" signifies that terms in the document are considered independently from each other and no association between terms is modeled. This assumption is very limiting, but it has been shown that it gives good enough results for many situations. This independence is the "naive" assumption of a Naive Bayes classifier, where properties that imply each other are nonetheless treated as independent for the sake of simplicity. This assumption allows the representation to be treated as an instance of a Vector space model by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms.

The probability $P(R|d,q)$ that a document is relevant derives from the probability of relevance of the terms vector of that document $P(R|x,q)$ . By using the Bayes rule we get:

P(R|x,q)={\frac {P(x|R,q)*P(R|q)}{P(x|q)}}

where $P(x|R=1,q)$ and $P(x|R=0,q)$ are the probabilities of retrieving a relevant or nonrelevant document, respectively. If so, then that document's representation is x. The exact probabilities can not be known beforehand, so estimates from statistics about the collection of documents must be used.

$P(R=1|q)$ and $P(R=0|q)$ indicate the previous probability of retrieving a relevant or nonrelevant document respectively for a query q. If, for instance, we knew the percentage of relevant documents in the collection, then we could use it to estimate these probabilities. Since a document is either relevant or nonrelevant to a query we have that:

P(R=1|x,q)+P(R=0|x,q)=1

Query Terms Weighting

Given a binary query and the dot product as the similarity function between a document and a query, the problem is to assign weights to the terms in the query such that the retrieval effectiveness will be high. Let $p_{i}$ and $q_{i}$ be the probability that a relevant document and an irrelevant document has the $i$ ^th term respectively. Yu and Salton,^[1] who first introduce BIM, propose that the weight of the $i$ ^th term is an increasing function of $Y_{i}={\frac {p_{i}*(1-q_{i})}{(1-p_{i})*q_{i}}}$ . Thus, if $Y_{i}$ is higher than $Y_{j}$ , the weight of term $i$ will be higher than that of term $j$ . Yu and Salton^[1] showed that such a weight assignment to query terms yields better retrieval effectiveness than if query terms are equally weighted. Robertson and Spärck Jones^[2] later showed that if the $i$ ^th term is assigned the weight of $\log Y_{i}$ , then optimal retrieval effectiveness is obtained under the Binary Independence Assumption.

The Binary Independence Model was introduced by Yu and Salton.^[1] The name Binary Independence Model was coined by Robertson and Spärck Jones^[2] who used the log-odds probability of the probabilistic relevance model to derive $\log Y_{i}$ where the log-odds probability is shown to be rank equivalent to the probability of relevance (i.e., $P(R|d,q)$ ) by Luk,^[3] obeying the probability ranking principle.^[4]

References

^ ^a ^b ^c ^d Yu, C. T.; Salton, G. (1976). "Precision Weighting – An Effective Automatic Indexing Method" (PDF). Journal of the ACM. 23: 76. doi:10.1145/321921.321930. hdl:1813/7313.
^ ^a ^b ^c Robertson, S. E.; Spärck Jones, K. (1976). "Relevance weighting of search terms". Journal of the American Society for Information Science. 27 (3): 129. doi:10.1002/asi.4630270302.
^ Luk, R. W. P. (2022). "Why is information retrieval a scientific discipline?". Foundations of Science. 27 (2): 427–453. doi:10.1007/s10699-020-09685-x. hdl:10397/94873.
^ Robertson, S. E. (1977). "The Probability Ranking Principle in IR". Journal of Documentation. 33 (4): 294–304. doi:10.1108/eb026647.

This page was last edited on 31 October 2023, at 09:24

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Definitions

Query Terms Weighting

See also

Further reading

References