Evgeniy Gabrilovich

Evgeniy Gabrilovich
Born	Minsk, Belarus
Nationality	Israeli
Alma mater	Technion – Israel Institute of Technology
Known for	IDN homograph attack, explicit semantic analysis
Scientific career
Fields	Computational Linguistics Information retrieval
Institutions	Google Research Yahoo! Research

Evgeniy Gabrilovich is a research director at Facebook Reality Labs where he conducts research on neuromotor interfaces. Prior to that, he was a Principal Scientist/Director at Google, specializing in Information Retrieval, Machine Learning, and Computational Linguistics, and a Fellow of the Institute of Electrical and Electronics Engineers (IEEE),^[1] and an ACM Fellow.^[2] In 2010, he received the Karen Spärck Jones Award from the British Computer Society Information Retrieval Specialist Group.^[3]

YouTube Encyclopedic

1/1
Views:
3 617

Transcription

>> Up next, we have Evgeniy Gabrilovich, from Yahoo Research. He's a Senior Research Scientist there and he's going to be talking about using the Collaboratively Generated Content. >> GABRILOVIVH: Thank you, Amaf for the introduction. And I would also like to start with thanking the symposium organizers for inviting me to present this talk show. I would like to devote this talk to describe some steps in making computers understand human language. I would also call it nature language to distinguish from unnatural language just like computer program and languages. And I would start with the realization that we humans have such a huge amount of world knowledge which allows us to understand text or understand each other in a conversation. Computers usually lack such kind of knowledge and I'll talk about where we can take this knowledge from and how to let computers use this kind of knowledge. I'll also talk about a paradigm shift that happened about 10 years ago. Up until about 10 years ago, the main way to teach computers about the world was to use small scale, carefully crafted with professional's data sets. One example would be the WordNet Electronic Dictionary which was crafted by a team of lexicographers at Princeton for about 20 years or so. And then, about around 2001 something happened. Tools have been developed such as the wikipoach which allowed essentially millions or tens of millions of people around the globe to put their technology online. And the most prominent example would of course be Wikipedia. So well, just to post the amount of information that was available until recently and that is available today. And then we'll see how did these two knowledge and those resources constructed collaboratively by multiple people in order to make computers smarter or make them understand text. I'll also briefly mention some future research directions whereas we don’t only look at the currency snap shots of Wikipedia but we also look at the behavioral information of all the editors, so humans who contributed to Wikipedia. We look--although revisions [INDISTINCT] to it, every single minute change of information in Wikipedia and we'll see how we can get semantic close from the way people, like, information of the--from the HO documents also in process. So let's see what really differentiates us from computers. Now I understand that I'm the last speaker for the day, I separate you from dinner, so let's take dinner for example, because in a way, everybody thinks about it. So when a computer thinks about dinner, it fully doesn’t really think. It sees six letters, D-I-N-N-E-R, and that’s about it. Let's see what humans think about dinner when there are facing with--well, those actually are my kids. So the six-month old doesn’t think much. For him, dinner is just milk right through the single supporting concept. The four-year old thinks much more, so for him, dinner immediately triggers notions of washing his hands, what dishes he can have for dinner, of course, the apple juice and getting to sleep. After that, he should--actually, also thinking about reading a book before sleeping. And all those concepts that are [INDISTINCT] by a mention of dinner, it is those concept that help us humans understand each other, it is exactly this kind of world knowledge that computers do not posses. So let's see how we can teach them to use this kind of knowledge. In order to motivate the rest of my talk, I'll take one particular resource of knowledge, one particular task and I'll show how this extra knowledge becomes indispensable in solving this task. So my example for the research would be the Open Directory Project, you can actually see the dmoz.org. It is arguably the largest catalog of URL's or websites on the net. It has about one million categories and five million URL's catalog in those categories by about hundred thousands of volunteers around the globe. These are the volunteers. The reason thereafter was constructed was editors in charge of categories would take a website they believe to be prominent enough to be catalog in the directory. And they would associate it manually with some notes of the directory. In this case, I took a website of a Mining Company and editor in charge associated this site with Science Technology Mining and Business Mining Drilling Categories. And this--if you considered directories [INDISTINCT], so there is a notion of Science, underneath there's Technology and Mining. Now let's take a particular task in text processing and my task will be Task Categorization. Our input is a piece of text and the output is instead of categories or labels, we would like assign to this text. So given a new stream, say from Reuters, we might want to distinguish between legal and business and medical articles and then the medical professional will might want to distinguish between Dental Medicine and say Ophthalmology or we can do it at final resolution. Now, I'm--I took this particular excerpt from a real document which comes from a real and often used collection for Text Categorizations Research Reuters to identify by 578 which is the number of documents in this collection. Every document come in and out of Reuters has been cataloged by professional editors in Reuters and that's how we get Gold Standard Data. We can train computer algorithms to predict those labels. In this case, we have a document talking about Mergers and positions between a group of companies and their establishing a joint venture in Highland Valley. And the Reuter's editor who is very knowledgeable about this domain got this document with a category cover. Now as we learned from the previous talk, Annie showed us that the documents often represented is a vector of space and the dimensions of this space are usually vector of words so the Paradigm to describe documents in this way is called the Bag of Words. Now the trouble of the notions corporate that were mentioned is a documents, there is no word saying copper. So any classifier did only look at the label of the Bag of Words would fail to recognize the category of copper correctly. However, what we can also do before it should be solve into text categorization problem would met the document on the relevant notes once they open directly and then the magic would happen. So as it mentioned all those companies would trigger their association of the document with relevant notes of the open directory and those notes will become additional features or introduce properties of this document. The way it works is these documents had been written about--in about 1980's. The main content had been separate companies back then they have merged into a single documental company but there is a prominent website describing this new merged company. Dimension of those words into text helps us associate this text with external knowledge which is [INDISTINCT] alone. So the notion of mining and drilling will become additional properties of the document. But the same way, we'll note that Highland Valley is actually a prominent copper mine in Canada. Now when we add all this knowledge into the Bag of Words, we have the Bag of Words plus concept, it is in risked feature space we are actually able--to easily understand that document is about copper. So let's talk about the sources of knowledge where we can take this knowledge from. Again, we'll have before and after and when we think about extra knowledge, arguably, the first thing that comes to mind is Encyclopedia, in which case, the first thing that comes to mind is Encyclopedia Britannica. It has been in print from over 200 years and it has a repeatable--a respectable number of 65 something articles. The problem is it sees us to build with a respectable number when we look at Wikipedia, which only existed for about 10 years but it has 3.5 millions articles in English alone and about 18.4 million articles in over 200 languages. The WordNet Electronic Dictionary which I already mentioned has been developed by a group of lexicographers in Princeton. It has been developed since 1985. It has 150 thousand entries, very nice, except it's dwarfed by the Collaboratively Generated Dictionary, dictionary which has 2.5 millions entries again in English alone. Arguably, the largest contender here would be Cyc, it's the brain child of Doug Lenat and his colleagues said Cycorp. Back in 1994, they tried to manually catalog all human knowledge and today, they have about 4.6 million essentials catalog. Looks nice, except the Yahoo answers by now has one billion question answered pairs. Flickr has over four billion images many of them tagged with human generated concepts, so--notion, so attributes. This kind of knowledge has not been available before. I would argue that it gives us a qualitative change there in the access of information with computers can have to. We just need to differentiate between computer readable information. We call--all of this is obviously computer readable, it's not sufficient. We need to make this knowledge computer usable. We need to make this knowledge to let computers represent the meaning of language and then reason on top of it. I'll use several sample applications for the duration of this talk. This is my first application. We want to quantify the relatedness between individual words or pairs of text. So a sample problem here would be--we have a pair of words, getting nouns and we can ask how related they are or can we use longer phases, we can ask how related Augeon Stables and golden apples of the Hesperides, any idea how also might be related? No one? Hmm? Those actually happen to be laborers of Hercules. And we can also ask it in multiple languages, for example in Chinese, we can ask how Mutianyu and Simatai are related, which happen to be--no? Which happen to be prominent segments of the Great Wall of China. In all of those cases, there is exactly zero overlap between the words on the left and words on the right. So the Bag of Words approach will not help us because there is no overlap in the Bag of Words, not a single word is shared. We need access to some more general information towards--about those--about those words--those text. If we have this technology, it would be tremendously beneficial in a number of applications, so for example, information retrieval, the most common example of which is websites. You have a query, you have documents and you need to judge relatedness or relevance to those documents to the query. Word-sense disambiguation is about having polysemous words, words having multiple senses. Words like these appear in a text, you need to judge which sense this word appears in this context. We can also do error corrections this way. According to my recent [INDISTINCT] they've been using dictation software a lot and it frequently misunderstands me with homophones. Where it has completely different meaning, we [INDISTINCT] differently but sound similarly. I had to correct those manually but if we use Semantic Relatedness Technology, we can really judge that site is much more relevant to word than this kind of sight. So let's see how to use Wikipedia for judging Semantic Relatedness of Words. We will start with the words that use it structure of Wikipedia that is identities of articles and the plurality of links between those articles. And then we'll talk about using the entire content of Wikipedia, the entire text of those 3.5 million articles. There are also some words you try to merge structure and content but as it happens their major improvement comes from using there content of Wikipedia and the structure on top of it has very, very little [INDISTINCT] so we primarily focus on the structure alone and then the content alone. The first approach in using Wikipedia for judging semantic relatedness was interviewed by Strube and Ponzetto back in 2006. Again, our task is worth two words and we want to judge the relatedness. Step number one is to make those words too relevant to Wikipedia articles which is done as a level of comparison between the work and the titles of those articles. Seventy-two articles will now judge relatedness of reduced the problem of computer relatedness between words to computer relatedness between articles. When we are about to associate words to articles, well, a single work would correspondent to a single article up to more than one. If each work response to multiple articles, we can do some reversing disambiguation or joined disambiguation, if there [INDISTINCT] articles which have--say, common designations, such as chess, well, choose those, otherwise, we'll just choose the first article--the first sense for each word. And then we'll compute their relatedness. Now, we have two articles, we need to compute the relatedness of similarity. We can even do so at the level of that words is a two articles. Now some words might be short--some might not be short. So what the authors propose is to use the same system of categories in the Wikipedia to have some generalization structure. Wikipedia actually has a very rich system of categories and each article is supposed to belong to just one category and many articles belong to many more. So we can use this kind of the structure to just relatedness of articles and actually many approaches have been developed to do so. One approach would be simple edge counting of the shortest paths, another approach would judge the information content of the lowest common answers to all the two notes. They're multiple approaches to be this--do this with hierarchy. The main the conclusion here is it uses hierarchies through [INDISTINCT] as oppose to merely using the two Bags of Words. In the previous work, we only used category links and subsequently it may have not written in 2008 proposed and approach that uses entire richness of names in Wikipedia. Now with articles who have many links betweens them, even conditional Encyclopedias, like Encyclopedia Britannica have many links between article. Wikipedia takes as much further because it's much easier to create article in a HTML-like environment. And if we have articles that incoming and outgoing of the articles, we can have too much, so it's one basic incoming articles and so--incoming links, one basic ongoing links. If we use analogy within bibliographic consultation domain, incoming links correspond to bibliographic coupling. We can say the two articles are related if they're co-cited by another article. Outgoing links correspond to what they call co-citation. Two articles might be related if they co-cite another article. Now this time, this thing needs to be done in [INDISTINCT] because it is the identity of the co-cited article, it can reveal a lot of information or irrelevant information as we shall seen this example. So suppose I have two articles, one extremely general, like an article about science and one extremely specific, such as atmospheric thermodynamics. Now the 52 articles co-cited the article about science gives us very little information because it's extremely general notion. However, if we know the two articles co-cite an extremely specific notion, such as that of atmospheric thermodynamics, we can [INDISTINCT] that those are probably much more tightly related. So let's now see how to use the actual content of the Wikipedia and not only the links. It has been a long standing dream of artificial intelligence to use Encyclopedia to empower computers in different kinds of tasks not necessarily text understanding. There is one problem with this approach. In order to understand Encyclopedia, we need Natural Language Understanding. Now, Natural Language Understanding is hard, so we need encyclopedic knowledge to actually understand the language in which Encyclopedia has written and where the [INDISTINCT] cycle. Now here's how we break the cycle. So we want to get rid of understanding at least at this level which a proposal said we will--you will structure in content of Wikipedia to develop a new kind of semantics, a new way to represent a meaning of texts and will use this representation directly to understand subsequent texts and to apply the empower knowledge to Wikipedia for solving new problems. There is one problem here because we settled by the actual language understanding rule, not be truly understanding language will be doing what is called statistical language process and we'll be counting words, we'll be counting concepts and counting categories. We'll be doing remarkably well in a variety of tasks, such as judging semantic relatedness, sync of web search and but again, we are no there yet. We cannot truly understand the full power of natural language. So here's how we develop a representation of language meaning using Wikipedia. We start with the realization that every Wikipedia article represents a concept, so an article about leopards represents a concept of leopards. Wikipedia could then be viewed as a huge anthology--huge collection of concept and those will be dimensions of meaning for it represented the meaning of texts. Again, it has about 3.5 million concepts in English alone. The semantics offer a single word would be a vector of associations of this word with multiple Wikipedia concepts. There will be some concepts to which this word is irrelevant and the strengths of association to those is exactly zero. But there will be concepts to which its strength of association is much more permanent. Again, you represent the meaning of a word in the space of Wikipedia concepts. Those concepts had been explicitly defined and manipulated by humans. In the first talk tomorrow, Simon [INDISTINCT] will talk about a different approach. We'll talk about how to actually teach computers to learn this representation. It will not be based on concepts that human defined who based on a statistical approach to learn the concepts. So in an approach, we called Explicit Semantic Analysis, explicit comes from use of explicitly defined concepts, explicitly defined by humans. We represent the meanings of text in the space of Wikipedia concepts. However, the first step would be for each article, you want to define the concept of this article as a vector of words. And the way do it, we identify all the [INDISTINCT] or the content words in this article. We compute some weight or some prominence of the work in this article and each article is represented as a sequence or words and weights. And now we can do it for all the concepts in Wikipedia. Now we think the [INDISTINCT], instead of representing concepts as vectors of words, we represent words as a vector of that concept. So for example, the word "cat" appear here, here and here, so all these concepts contribute to the presentation of the word "cat" in this case. If whatever the presentation of word is a vector of concepts, if you want to judge relatedness of two words--now for two words, we have concepts for the first word, concepts for the second word where vectors is a high dimensional space, again, remember the previous talk? Where a vector space--in order to judge how--seem like to a vector's currency space we can use cosine. Obviously, we only--well, computing cosine will only use the identical dimensions and we'll essentially ignore all that non-identical dimensions. Okay? So let's see some experimental results. Previous approaches that use the WordNet Electronic Dictionary or a--the Roger's Thesaurus achieved the performance of about 0.3 to 0.5 and I have to explain you how this performance is quantified. So the main way to quantify the relatedness of words would be to take long list of pairs of words and as human judges to judge how related each pair is, say on a scale zero to ten, obviously, if you want to achieve some consistency, we need to ask more than one person. Some usually say ask a dozen of people and then average their scores. So with a pair of--list of pairs of words, we have human judgments, we have computer [INDISTINCT] judgments and we can compute the correlation betweens them. So approaches that are based on small-scale resources, such as WordNet or Roger's Thesaurus, it shows 0.3 to 0.5 per correlation with human judgments. As we start using Wikipedia, we get high and high numbers, so using categories alone allows us to go to 0.49 using all the links in Wikipedia allows us to go up to 0.69. If we're using telecontent of Wikipedia as a content of the actual articles and the links alone, you get to 75 and it's there immediately in the following slides, I'll show you how to go even further by using temporal information about how concepts evolve over time. So again, we have so far used the content of Wikipedia in the sense of looking at what is written in Wikipedia. Now, let's see how this content--concepts change over, say, human history. So let's take 150 years worth of New York Times articles and for example let’s take two notions of--the notion of peace and war and let's see how they are correlated overtime. As you can see, the actual patterns are obviously different but there is a huge correlation between those notions. So we can infer from this very graph that the notions of war and peace are probably related. And just to show you what several spikes here mean. Any idea what this spike between 1850 and 1870 is about? That's the American Civil War, this one. The other one, this one, World War II and this is the Vietnam War, right? Any idea what this peak of blue, peak of peace around 1905? That's actually the Treaty of Portsmouth that ended the Russian-Japanese War in 1905. So if you look at those graphs, you infer that in the temporal dimension, those notions are correlated. Let’s augment the statistical representation with the current content of Wikipedia with this temporal dimension. So what we have done so far, given a word, we'll represent it as a matter of Wikipedia concepts. Now, we augment with the temporal dimension. We compute how this each concept is changed over the 150 years worth of New York Times articles. So each concept now comes with a time series. Given two words whose relatedness we want to compute, we have two search representations and we need to define some measure of distance. Now, we started with a bag of words and we looked for identical words, if there are no shared words, too bad. We generalize two concepts of categories to get some high level notion of what similar or identical might mean. Now, we go in other level. Even though the two concepts on the sides might be different, their time series might be pretty similar and this is a correlation between or alignment between those time series would let us know how related those concepts are actually and we can use a plurality of approaches such as, say, dynamic time warp or DTW, which is common in speech processing, to actually quantify this similarity. So [INDISTINCT] we talked about primarily relatedness of individual words of very short text, in which case, represented them in this [INDISTINCT] of concepts was just fine. Let's talk about processing longer text. Think about web search. We have very long documents. It actually does make much sense to get rid of all those words in the documents because words are extremely informative. In text categorization, which was already mentioned, we have a text and we need to classify with a respective [INDISTINCT] labels or categories. Again, it does make sense to get rid of all the words, so instead of representing the text in this [INDISTINCT] of concepts alone, we will augment a bag of words so--a bag of words plus concepts and let me show you brief example. So suppose here is my text, Wal-Mart supply chain goes real-time, I won't specify the task for now, here is my text, let's fetch some external knowledge to enrich our understanding of this text. So we can obviously [INDISTINCT] or break this text into features and go get the bag of words, essentially all the words in this text. But, before we continue any process, we consult a feature generator. We call it a feature generator because it generates additional features or properties of the text. It is powered by Wikipedia but it could also be powered by any other language or knowledge resource. And it constructs a collection of relevant concepts, in this case, concepts from Wikipedia. So here are some concepts it fetches. We suddenly know that Sam Walton is the founder of Wikipedia. We learn that Sierra's, Target in Albertson are relevant because they are also prominent competitors of Wal-Mart. Hypermarket is a more general notion of what Wal-Mart is. Actually I like RFID best here because it's clearly the most--a very relevant notion. RFID's a tracking technology that Wal-Mart pioneered to tracks its supply. We won't know that RFID is related to this paragraph which clearly talks about Wal-Mart Supply Chain unless we consult a huge external repository of knowledge. Now, we have a bag of words plus concepts and is augmented through representation that we can learn better. Classification functions, we can learn how to better retrieve documents because we have better ways to judge relatedness. And in about 2008 [INDISTINCT] independently [INDISTINCT] invented a cross-lingual extension to explicit semantic analysis. So if we know how to represent a text in this base of Wikipedia categories, the authors noted that Wikipedia has a lot of links not only within the same language Wikipedia but also between different versions of Wikipedia. There are multiple links between multiple articles in one language to the same notion of the languages. So if we train this model and we represent the text in this base of English concepts we can use the cross-language links to easily represent the same text in the Russian language concepts. Some links don’t exist yet, some articles don’t exist yet but I emphasize it here because Wikipedia grows as we speak and if the article doesn’t exist today, it might as well be there tomorrow or the day after. So, let me briefly summarize my talk. I started with the realization that exogenous knowledge or external knowledge that is not explicitly present in the text is critical for computers to understand language because we, humans, posses this kind of language--this kind of knowledge through our world experience. And then [INDISTINCT] the event of collaboratively generative knowledge qualitatively changed the amount of information we can have access to in--for computers. This knowledge wasn’t in the first place designed to be used by computers. Wikipedia was launched to allow many people around the globe to have access to knowledge. It just happened incidentally so that we can use this information to make computers smarter, in quotes, and make them solve tasks they couldn’t solve before. This concept based approach allows us to address at least partially the two main problems in nature language processing namely, sinonimi, the ability of the same notion being expressed by multiple words. If we have two texts that talk about the same notion but use drastically different words, we suddenly are able to realize that those concepts are related and we can also address polysemi, the property of words having more than one sense. If have a text in which polysemous word appears, the word that has multiple senses, we can use this technology to figure out which exactly sense it appears in this particular word. So in this talk, I only mentioned Wikipedia an open directory, there are many more resources that simply a way to detect and the future work would be not necessarily working with the current snapshot of those [INDISTINCT] but also looking in the entire history of changes. So Wikipedia tracts every single change to every article however small or big it is, even in the level of single character change. Across Wikimedia projects [INDISTINCT] Wikipedia and the likes there are over one billion changes [INDISTINCT] we can look at how people change information. We can actually gain Symantec clues from the behavioral information of people. And the brief idea here would be if I know that a particular word has been introduced early in the different life and it [INDISTINCT] longer, this word is probably prominent in the document--for the concept describing the document. I can do simple counting in the last version of this document or it look at the entire history of things. I think I'll stop at this moment. Thank you very much for your attention. I'll be happy to take any questions you might have. >> Questions? >> Thank you. [INDISTINCT] so in--earlier, in your presentation and also in the previous presentation there were human based correlation values and scores argument, how much of that work is done in foreign languages as well and how much is English really shaping all the natural language processing work that's going on in the future because of that? >> GABRILOVICH: And so with [INDISTINCT] of the worldwide web, it's probably--the problem was elevated to a certain degree because more content was available to researchers in multiple languages. Definitely you are right. There is a huge amount of work done in English but the future of availability of material is probably much easier to process languages such as Chinese or French or Russian, which is spoken by many people, as opposed to, say, a much less privileged languages which is spoken by only a small group of people. Some of it can be solved by availability of information online but probably not everything. >> Maybe as a--as a quick follow-up. So, within English, we're going to have certain associations between words and what not but those--I wonder how much we're losing by focusing on that, just those association that come with that thing which as opposed... >> GABRILOVICH: I'm sorry. Can you please explain? >> I wonder what other association we might be missing if we--by not using correlation scores that were generated by none-English speaking people. Does that--does that make sense? >> GABRILOVICH: I'm afraid I'm still missing the point of your question. Are you talking about quantifying relatedness in languages other than English and how much we miss by not being privileged to have so many resources? >> Yeah. >> GABRILOVICH: I'm afraid I cannot site many relevant works. I know--yes, say, that [INDISTINCT] symantic analysis have been redone for German and they got comparative numbers of performance similar to English but German is a pretty commonly spoken language with lots of material. I would expect the numbers to be lower, I just cannot say by how much. >> SAMPLE: Hi. I'm John Sample from the Naval Research Lab. I have a question about how the meanings of words change over time. Do you have enough training data to do that kind of analysis to look at words over what a hundred years or 200 years to see how they're changing and how, you know, the meanings are changing? >> GABRILOVICH: I'm sure they--meanings are changed. I am afraid they haven't worked specifically on this topic. I think the purpose of Google has Google books which dates back I think at least 500 years would be a tremendous asset in judging those things. There have been some works that tried to see how word usage is changed. So, there was [INDISTINCT] I think published Nature a few years ago which studied how irregular verbs in English become regular and they tried to call it between this phenomenon and the frequency of the words so the more frequent the verb is, the more likely it is to become regular over time. This published only [INDISTINCT] research known to date. >> CORNELL: Hod Lipson Cornell here. I have a question about, you know, all the information you showed has to do with that taking text style of the Web. What about audio or spoken text, video--there’s probably just as much information in those channels. Is that also being mind in the same way? Is that as reliable as written text? >> GABRILOVICH: I would assume it has a lot of information. It's probably even more difficult to mind those because probably the approach to do so would be the first transcribed audio to text, and then the text minding the venue suffers some imperfectness because of the noisy inference of the audio and probably the video if you'll try to transcribe all the captions of video. And I'm not certain about the current where it tries to [INDISTINCT] since they're reported throughout as spoken [INDISTINCT] a video material to learn about the world I would assume is possible. I would assume it's interesting to do it on the video because you can have close from the image and correlate [INDISTINCT] what is spoken. And again, I'm not aware of the work that tried to learn about from word from those sources. >> SUDIMARA: Sid Sudimara in Colorado State University. My question is on how you handle bad data. Because Wikipedia, it doesn’t have the same level of rigorous period evaluation as most other content does. So, how do you take for veracity? >> GABRILOVICH: So, it's a good question, I would probably address in two ways. First, Wikipedia is not as bad as one could think. There was study in Nature I think 1995 comparing the quality of Wikipedia and finding it to be on par with that of Britannica. Actually, Wikipedia has a lot of editors who oversee the development of individual articles, and then there are many people who simply monitor the stream of changes to Wikipedia and just see if those changes make sense or it's vandalism or an attempt to promote some commercial entity. So, the quality is not as bad. One thing that we alleviate this problem in this setting is that we take a text, we generate relevant concepts for Wikipedia. Those concepts are not user phrasing. We do not show them directly to the user. They are used as an intermediate representation of the meaning of text to achieve some additional goal. So what we usually do, we first generate relevant concepts, and then we decimate them. Actually, generating concepts automatically is very easy. We can easily generate millions, if not more, of relevant concepts. It is the subsequent selection, that is in professional language called feature selection, we decimate those concepts to only leave the truly relevant ones, and this is the step that allows us to get rid of a lot of noise and it only retain informative concepts, and those concepts do improve the final [INDISTINCT] whatever it has to do. >> SARPESH: Rajo Sarpesh for MIT. So, a 12-year-old child has much less knowledge than Wikipedia for sure. And I'd be willing to bet, has better semantic understanding than anything a computer could do. And that’s because knowledge is not an understanding. And the reason I think a 12-year-old child can do that is it’s got feedback loops between all four parts of its brain that are processing images, emotions, verbal knowledge, world structure all at the same time, and like you put up, you know, with the kid in the beginning of your talk having the idea of dinner. It’s those feedback loops between all those things that are creating all these associations that give it understanding, even though it has very little knowledge. Which is why Einstein said, "Imagination is more important than knowledge." So, I think you have to simultaneously model image processing, emotional processing, verbal understanding and world structure to get true meaning. And it seems to me like all these work is all left brain. You're completely ignoring the right analogue, feeling, emotional imaging, graphical imaginative part of our brain which actually gives meaning to tying the left and the right brain together. Which is why a kid of 12 years old can do what a computer can. So, why--now, I'm completely ignorant of this whole field so maybe I'm missing something. But it seems to me like this is too brute force. There has to be more insight into how the brain actually works and processes language to get understanding, not correlations. >> GABRILOVICH: I think you're right. I think the main reason we're not there yet is because, well, it’s truly difficult and the 50-plus research nature language processing didn’t afford that the ability to do what you describe. I think humans differ from computers in a very key property. Humans can learn from single example. You show a child a single example of what the kid is, she can reliably understand what kids are from this point. Computers don’t work this way. Computers need to be taught multiple times, that’s why we need multiple examples, multiple label examples for each task. I believe the multiple modalities you mentioned, obviously a very important--is very important to join audio signal and metoric signal and language signal, we're not--we're just not there yet. It’s not because the approach I advocate is the right one, it's what we know how to do today. We would [INDISTINCT] to be where you describe, we’re there yet. >> Any other question? >> HUANG: Hi. Mado Huang, IBM. Are there any parallels between your semantic recognition methods and those spoken today with the other fields like pattern recognition or feature recognition and physical parts and that sort of thing? >> GABRILOVICH: So, there is definitely one parallelism in that most of the task you mentioned probably would benefit from injection of external knowledge. This knowledge can come from bodies of text or they can come from traditional sources such as observing lots of audio samples or lots of video samples. So this level I'm sure there is definitely a lot of commonality. Multiple techniques which I mentioned today, like the idea of generating features, we use generation of features for text. There is also work on generating text--features for images or generating features in multiple other domains. The idea of post-selecting automatically generate feature selection is also used in multiple domains, especially in [INDISTINCT] when you need to retain only the most informative features with respect to some labels. So there are definitely commonalities here as well. >> WARNER: Hi. Andy Warner from Purdue. Since the effectiveness of these semantic classifiers are judged against human judgments, so I'm wondering what attention is given to how you select the population of humans. For example, in the previous talk, we heard that the book or price committee may see things differently than other groups or some reference to sort of people at the graduate student level perhaps judging things different than other population groups. So, you know, doing this type of work, what kind of thought is given into where the human sample comes from? >> GABRILOVICH: All right. It's definitely a very important question. So when people publish papers like this, they usually say, "We recruited such and such number of people who happened to be representatives of the general population or happen to be graduate students of a medical university or happen to be people we found in the library. Definitely, it affects the results, so we need to let the reader know who the people are." One common, if not dominant way to get judgments today through the Amazon mechanical [INDISTINCT]. Amazon has this nice platform of outsourcing various human level tasks, like you have a sequence of images and you want to label them with respect to what is shown in the image. They’ve been--I don’t know the exact numbers, but definitely thousands and probably tens of thousands of people around the globe, many of them in the US. You can ask those to do the judgments for you. Yes, it will be noisy, yes, it will be sub-optimal and yes, and yet those will be very cheap. So, it's a definitely a way to give them. One way to get some control of the qualities to write, it runs so-called prequalification tests. You ask those people some questions, the answers to which you know ahead of time. If they pass the test with some score you said ahead of time, you let them do some subsequent test. This way you can hope to get a reasonably representative sample of the population. >> Okay. Let's thank the speaker.

Career

In 2002, Gabrilovich published a research paper documenting the possibility of an IDN homograph attack, with fellow researcher Alex Gontmakher. In 2005, Gabrilovich earned his Ph.D. degree in Computer Science from the Technion–Israel Institute of Technology. In his Ph.D. thesis, he developed a methodology for using large-scale repositories of world knowledge, such as Wikipedia, as a basis for the improvement of text representations.

Publications

"Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis", The 20th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1606–1611, Evgeniy Gabrilovich and Shaul Markovitch, Hyderabad, India, January 2007
"Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization", Evgeniy Gabrilovich and Shaul Markovitch, Journal of Machine Learning Research 8 (Oct), pp. 2297–2345, 2007
"Robust Classification of Rare Queries Using Web Knowledge", The 30th Annual International ACM SIGIR Conference, Amsterdam, the Netherlands, July 2007
The Homograph Attack Archived 2019-11-04 at the Wayback Machine, Evgeniy Gabrilovich and Alex Gontmakher, Communications of the ACM, 45(2):128, February 2002

References

^ Gabrilovich, Evgeniy. "Homepage of Evgeniy Gabrilovich". Retrieved 7 December 2021.
^ "ACM Names 71 Fellows for Computing Advances That Are Driving Innovation". HPCWire. Tabor Communications, Inc. 19 January 2022. Retrieved 25 March 2022.
^ "Evgeniy Gabrilovich Honored with Prestigious Karen Spärck Jones Award". Retrieved 7 August 2012.

This biographical article relating to a computer specialist is a stub. You can help Wikipedia by expanding it.

This page was last edited on 1 May 2024, at 23:29

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Career

Publications

References