k-independent hashing

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal^[1] if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables (see precise mathematical definitions below). Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many k-independent families have been proposed.

YouTube Encyclopedic

1/5
Views:
446 514
50 444
3 132
778
1 125

Transcription

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: All right. Let's get started. Today we start a brand new section of 006, which is hashing. Hashing is cool. It is probably the most used and common and important data structure and all of computer science. It's in, basically, every system you've ever used, I think. And in particular, it's in Python as part of what makes Python fun to program in. And basically, every modern programming language has it. So today is about how to make it actually happen. So what is it? It is usually called a dictionary. So this is an abstract data if you remember that term from a couple lectures ago. It's kind of an old term, not so common anymore, but it's useful to think about. So a dictionary is a data structure, or it's a thing, that can store items, and it can insert items, delete items and search for items. So in general, it's going to be a set of items, each item has a key. And you can insert an item, you can delete an item from the set, and you can search for a key, not an item. And the interesting part is the search. I think you know what insert and delete do. So there are two outcomes to this kind of search. This is what I call an exact search. Either you find an item with a given key, or there isn't one, and then you just say key error in Python. OK. This is a little different from what we could do with binary search trees. Binary search trees, if we didn't find a key, we could find the next larger and the next smaller successor and predecessor. With dictionaries you're not allowed to do that, or you're not able to do that. And you're just interested in the question does the key exist? And if so, give me the item with that key. So we're assuming here that the items have unique keys, no two items have the same key. And one way to enforce that is when you insert an item with an existing key, it overwrites whatever key was there. That's the Python behavior. So we'll assume that. Overwrite any existing key. And so, it's well defined what search does. Either there's one item with that key, or there's no item with that key, and it tells you what the situation is. OK. So one way to solve dictionaries is to use a balanced binary search tree like AVL trees. And so you can do all of these operations on log n time. I mean, you can ignore the fact that AVL trees give you more information when you do a search, and still does exact search. So that's one solution, but it turns out you can do better. And while last class was about, well, in the comparison model the best way to sort is n log n and the best way to search is log n. Then we saw in the RAM model, where if you assume your items are integers we can sort faster, sometimes we can sort in linear time. Today's lecture is about how to search faster than log n time. And we're going to get down to constant time. No-- basically, no assumptions except, maybe, that your keys are integers. We'll be able to get down to constant time with high probability. It's going to be a randomized data structure. It's one of the few instances of randomization in 006, but it'll be pretty simple to analyze, so don't worry. But we're going to use some probability today. Make it a little exciting. I think you know how dictionaries work in Python. In Python it's the dict data type. We've used it all over the place. The key things you can do are lookup a key and-- so this is the analog of search-- you can set a key to a value. This is the analog of an insert. It overwrites whatever was there. And what else? Delete. So you can delete a particular key. OK. We'll usually use this notation because it's more familiar and intuitive. But the big topic today is how do you actually implement these operations for a dictionary, D? The one specific thing about Python dictionaries is that an item is basically a pair of two things, a key and a value. And so, in particular, when you call d.items you get a whole bunch of ordered pairs, a key and a value. And so the key is always-- the key of an item is always this first part. So it's well defined. OK. So that's Python dictionaries. So one obvious motivation for building dictionaries is you need them in Python. And in fact, people use them all the time. We used them in docdist. All of the fastest versions of the document distance problem used dictionaries for counting words, how many times each word occurs in a document, and for computing inner products, for finding common words between two documents. And it's just it's the best way to do things, it's the easiest way to do things , and the fastest. As a result, dictionaries are built into basically every modern programming language, Python, Perl, Ruby, JavaScript, Java, C++, C#. In modern versions, all have some version of dictionaries. And they all run in, basically, constant time using the stuff that's in this lecture and the next two lectures. Let's see. It's also, in, basically, every database. There are essentially two kinds of databases in the world, there are those that use hashing, and there are those that use search trees. Sometimes you need one. Sometimes you need the other. There are a lot of situations in databases where you just need hashing. So if you've ever used Berkeley DB, there's a hash type of a database. So if things like, when you go to Merriam-Webster, and you look up a word, how do you find the definition of that word? You use a hash table, you use a dictionary, I should say. How do you-- when you spell check your document, how do you tell whether a word is correctly spelled? You look it up in a dictionary. If it's not correctly spelled, how do you find the closest related, correct spelling? You try tweaking one the letters, and look it up in a dictionary and see if it's there. You do that for all possible letters, or maybe two letters. That is a state of the art way to do spelling correction. Just keep looking up in a dictionary. Because dictionaries are so fast you can afford to do things like trial perturbations of letters. What else. In the old days, which means pre-Google, every search engine on the web would have a dictionary that says, for given word, give me all of the documents containing that word. Google doesn't do it that way, but that's another story. It's less fancy, actually. Or when you log into a system, you type your username and password. You look in a dictionary that stores a username and, associated with that username, all the information of that user. Every time you log into a web system, or whatever, it is going through a dictionary. So they're all over the place. One of the original applications is in writing programming languages. Some of the first computer programs were programming languages, so you could actually program them in a reasonable way. Whenever you type a variable name the computer doesn't really think about that variable name, it wants to think about an address in memory. And so you've got to translate that variable name into a real, physical address in the machine, or a position on the stack, or whatever it is in real life. In the old days of Python, I guess this is pre-Python 2 or so, 2.1, I don't remember the exact transition it was. In the interpreter, there was the dictionary of all your global variables, there's a dictionary of all your local variables. And that was-- it was right there. I mean you could modify the dictionary, you could do crazy things. And all the variables were there. And so they'd match the key to the actual value stored in the variable. They don't do that anymore because it's a little slow, but-- and you could do better in practice. But at the very least, when you're compiling the thing, you need a dictionary. And then, later on, you can do more efficient lookups. Let's see. On the internet there are hash tables all over, like in your router. Router needs to know all the machines that are connected to it. Each machine has an IP address, so when you get a packet in, and it says, deliver to this IP address, you see, oh, is it in my dictionary of all the machines that are directly connected to me? If so, send it there. If it's not then it has to find the right subnet. That's not quite a dictionary problem, a little more complicated. But for looking up local machines, it's a dictionary. Routers use dictionaries because they need to go really fast. They're getting a billion packets every second. Also, in the network stack of a machine, when you come in you get it packet delivered to a particular port, you need to say, oh, which application, or which socket is connected to this port? All of these things are dictionaries. The point is they're in, basically, everything you've ever used, virtual memory, I mean, they're all over the place. There are also some more subtle applications, where it is not obvious that's it a dictionary, but still, we use this idea of hashing we're going to talk about today. Like searching in a string. So when you hit-- I don't know-- in your favorite editor, you do Control-F, or Control-S, or slash, or whatever your way of searching for something is, and you type start typing. If your editor is clever, it will use hashing in order to search for that string. It's a faster way to do it. If you use grep, for example, in Unix it does it in a fancy way. Every time you do a Google search it's essentially using this. It's solving this problem. I don't know what algorithm, but we could guess. Using the algorithms we'll cover in next lecture. It wouldn't surprise me. Also, if you have a couple strings and you want to know what they have in common, how similar they are? Example, you have two DNA strings. You want to see how similar they are, you use hashing. And you're going to do that in the next problem set, PS4, which goes out on Thursday. Also, for things like file and directory synchronization. So on Unix, if you rsync or unison, or, I guess, modern day-- these days, Dropbox, MIT startup-- Whenever you're synchronizing files between two locations, you use hashing to tell whether a file has changed, or whether a directory has changed. That's a big idea. Fairly modern idea. And also in cryptography-- this will be a topic of next Tuesday's lecture. If you're transferring a file and you want to check that you actually transferred that file, and there wasn't some person in the middle corrupting your file and making it look like it was what you wanted it to be, you use something called cryptographic hash functions, which [INAUDIBLE] will talk about on Tuesday. So tons of motivation for dictionaries. Let's actually do it, see how they are done. We're going to start with sort of a very simple straw man, and then we're going to improve it until, by the end of today, we have a really good way to solve the dictionary problem in constant time for operation. So the really simple approach is called a direct access table. So it's just a big table, an array. You have-- the index into the array is the key. So, store items in an array, indexed by key. And in fact, Python kind makes you think about this because the Python notation for accessing dictionaries is identical to the notation for accessing arrays. But with arrays, the keys are restricted to be non-negative integers, 0 through n minus 1. So why not just implement it that way? If your keys happen to be integers I could just store all my items in a giant array. So if I just want to store an item here with key 2, call that, maybe, item 2, I just put that there. If I want to store something with key 4 I'll just put it there. Everything else is going to be null, or none, or whatever. So lots of blank entries. Whatever keys I don't use I'll just put a null value there. Every key that I want to put into the dictionary I'll just store it at the corresponding position. What's bad about this? Yeah. AUDIENCE: It's hard to associate something with just an integer. PROFESSOR: Hard to associate something with an integer. Good. That's one problem. There's actually two big problems with this structure. I want both of them. So bad-- badness number one is keys may not be integers. Good. Another problem. Yeah. AUDIENCE: Possibility of collision. PROFESSOR: Possibility of collision. So here there's no collisions. We'll get to collisions in a moment, but a collision is when two items go to the same slot in this table. And we defined the problem so there weren't collisions. We said whenever we insert item with the same key you overwrite whatever is there. So collisions are OK. They will be a problem in a moment, so save your answer. Yeah? AUDIENCE: [INAUDIBLE] PROFESSOR: Running time? AUDIENCE: [INAUDIBLE] PROFESSOR: For deletion? Actually, running time is going to be great. If I want to insert-- I mean, I do these operations but on array instead of a dictionary. So if I want insert I just put something there. If I want to delete I just set it to null. If I want to search I just go there and see is it null? Yeah? AUDIENCE: It's a gigantic memory hog PROFESSOR: It's gigantic memory hog. I like that phrasing. Not always of course. If it happens that your keys are-- the set of possible keys is not too giant then life is good. Let's see If I cannot kill somebody today. Oh yes. Very good. But if you have a lot of keys, you need one slot in your array per key. That could be a lot. Maybe your keys are 64-bit integers. Then you need 264 slots just to store one measly dictionary. That's huge. I guess there's also the running time of initialize that. But at the very least, you have huge space hog. This is bad. So we're going to fix both of these problems one at a time. First problem we're going to talk about is what if your keys aren't integers? Because if your keys aren't integers you can't use this at all. So lets at least get something that works. And this is a notion called prehashing. I guess different people call it different things. Unfortunately Python calls it hash. It's not hashing, it's prehashing. Emphasized the "pre" here. So prehash function maps whatever keys you have to non-negative integers. At this point we're not worrying about how big those integers are. They could be giant. We're not going to fix the second problem til later. First problem is if I have some key, maybe it's a string, it's whatever, it's an object, how do I map it to some integer so I could, at least in principle, put it in a direct access table. There's a theoretical answer to how to do this, and then there's the practical answer. how to do this. I'll start with the mathematical. In theory, I like this, keys are finite and discrete. OK. We know that anything on the computer could, ultimately, be written down as a string of bits. So a string of bits represents an integer. So we're done. So in theory, this is easy. And we're going to assume in this class, because it's sort of a theory class, that this is what's happening. At least for analysis, we're always going to analyze things as if this is what's happening. Now in reality, people don't always do this. In particular-- I'll go somewhere else. In Python it's not quite so simple, but at least you get to see what's going on. There's a function called hash, which should be called prehash, and it, given an object, it produces a non-- I'm not sure, actually, if it's non-negative. It's not a big deal if it has a minus sign because then you could just use this and get rid of the sign. But it maps every object to an integer, or every hashable object, technically. But pretty much anything can be mapped to an integer, one way or another. And so for example, if you given it an integer it just returns the integer. So that's pretty easy. If you give it a string it does something. I don't know exactly what it does, but there are some issues. For example, hash of this string, backslash 0B is equal to the hash of backslash 0 backslash 0C 64. It's a little tricky to find these examples, but they're out there. And I guess, this is probably the lowest one in a certain measure. So it's a concern. In practice you have to be careful about these things because what you'd like-- in an ideal world, and in the theoretical world-- this prehash function of x, if it equals the prehash function of y, this should only happen when x=y, when they're the same thing. And equals equal sense, I guess, would be the technical version. Sadly, in Python this is not quite true. But mostly true. Let's see. If you define a custom object, you may know this, there is an __hash__ method you can implement, which tells Python what to do when you call hash of your object. If you don't, it uses the default of id, which is the physical location of your object in memory. So as long as your object isn't moving around in memory this is a pretty good hash function because no two items occupy the same space in memory. So that's just implementation side of things. Other implementation side of things is in Python, well, there's this distinction between objects and keys, I guess you would say. You really don't want this prehash function to change value. In, say, a direct access table, if you store-- you take an item, you compute the prehash function of the key in there, and you throw it in, and it says, oh, prehash value is four. Then you put it in position four. If that value change, then when you go to search for that key, and you call prehash of that thing, and if it give you five, you look in position five, and you say, oh, it's not there. So prehash really should not change. If you ever implement this function don't mess with it. I mean, make sure it's defined in such a way that it doesn't change over time. Otherwise, you won't be able to find your items in the table. Python can't protect you from that. This is why, for example, if you have a list, which is a mutable object, you cannot put it into a hash table as a key value because it would change over time. Potentially, you'd append to the list, or whatever. All right. So hopefully you're reasonably happy with this. You could also think of it is we're going to assume keys are non-negative integers. But in practice, anything you have you can map to an integer, one way or another. The bigger problem in a certain sense, or the more interesting problem is reducing space. So how do we do that? This would be hashing. This is sort of the magic part of today's lecture. In case you're wondering, hashing has nothing to do with hashish. Hashish is a Arabic root word unrelated to the Germanic, which is hachet, I believe. Yeah. Or hacheh-- I guess, something like that. I'm not very good at German. Which means hatchet. OK It's like you take your key, and you cut it up into little pieces, and you mix them around and cut and dice, and it's like cooking. OK. What? AUDIENCE: Hash browns. PROFESSOR: Hash browns, for example. Yeah, same root. OK. It's like the only two English words with that kind of hash. OK. In our case, it's a verb, to hash. It means to cut into pieces and mix around. OK. That won't really be clear until towards the end of today's lecture, but we will eventually get to the etymology of hashing. Or, we've got the etymology, but why it's, actually, why we use that term. All right. So the big idea is we take all possible keys and we want to reduce them down to some small, small set of integers. Let me draw a picture of that. So we have this giant space of all possible keys. We'll call this key space. It's like outer space, basically. It's giant. And if we stored a direct access table, this would also be giant. And we don't want to do that. We'd like to somehow map using a hash function h down to some smaller set. How do I want to draw this? Like an array. So we're going to have possible values 0 up to m minus 1. m is a new thing. It's going to be the size of our hash table. Let's call the hash table. I think we'll call it t also. And we'd somehow like to map-- All right. So there's a giant space of all possible keys, but then there's a subset of keys that are actually stored in this set, in this dictionary. At any moment in time there's some set of keys that are present. That set changes, but at any moment there's some keys that are actually there. k1, k2, k3, k4. I'd like to map them to positions in this table. So maybe I store k2-- or actually, item 2 would go here. In particular, this is when h of k2, if it equals zero, then you'd put item 2 there. Item 3, let's say, it's at position-- wow, 3 would be a bit of a coincidence, but what the hell. Maybe h or k3 equals 3. Then you'd put item 3 here. OK. You get the idea. So these four items each have a special position in their table. And the idea is we would like to be, m to be around n. n is the number of keys In the dictionary right now. So if we could achieve that, the size of the table was proportional to the number of keys being stored in the dictionary, that would be good news because then the space is not gigantic and hoggish. It would just be linear, which is optimal. So if we want to store m things, maybe we'll use 2m space, a 3m space, but not much more. How the heck are we going to define such a function h? Well, that's the rest of the lecture. But even before we define a function h, do you see any problems with this? Yeah. AUDIENCE: [INAUDIBLE]. PROFESSOR: Yeah. This space over here, this is pigeonhole principle. The number of slots for your pigeons over here is way smaller than the number of possible pigeons. So there are going to be two keys that map to the same slot in the hash table. This is what we call a collision. Let's call this, I don't know, ki, kj. h of ki equals h of kj, but the keys are different. So ki does not equal kj, yet their hash functions are the same, hash values are the same. We call that a collision. And that's guaranteed to happen a lot, yet somehow, we can still make this work. That's the magic. And that is going to be chaining. We've done these guys. Next up is a technique for dealing with collisions. There are two techniques for dealing with collisions we're going to talk about in 006. One is called chaining, and next Tuesday, we'll see another method called open addressing. But let's start with chaining. The idea with chaining a simple. If you have multiple items here all with the same-- that hash to the same position, just store them as a list. I'm going to draw it as a linked list. I think I need a big picture here. So we have our nice universe, various keys that we actually have present. So these are the keys in the dictionary, and this is all of key space. These guys map to slots in the table. Some of them might map to the same value. So let's say k1 and k2, suppose they collide. So they both go this slot. What we're going to store here is a linked list that stores item 1, and stores a pointer to the next item, which is item 2. And that's the end of the list. Or you could-- however you want to draw a null. So however many items there are, we're going to have a linked list of that length in that slot. So in particular, if there's just one item, like say, this k3 here, maybe it just maps to this slot. And maybe that's all that maps to that slot. In that case, we just say, follow this item 3, and there's no other items. Some slots are going to be completely empty. There nothing there so you just store a null pointer. That is hashing with chaining. It's pretty simple, very simple really. The only question is why would you expect it to be any good? Because, in the worst case, if you fix your hash function here, h, there's going to be a whole bunch of keys that all map to the same slot. And so in the worst case, those are the keys that you insert, and they all go here. And then you have this fancy data structure. And in the end, all you have is a linked list of all n items. So the worst case is theta n. And this is going to be true for any hashing scheme, actually. In the worst case, hashing sucks. Yet in practice, it works really, really well. And the reason is randomization, essentially, that this hash function, unless you're really unlucky, the hash function will nicely distribute your items, and most of these lists will have constant length. We're going to prove that under an assumption. Well have to warm up a little bit. But I'm also going to cop out a little m as you'll see. So in 006 we're going to make an assumption called Simple Uniform Hashing. OK. And this is an assumption, it's an unrealistic assumption. I would go so far as to say it's false, a false assumption. But it's really convenient for analysis, and it's going to make it obvious why chaining is a good idea. Sadly, the assumption isn't quite true, but it gives you a flavor. If you want to see why hashing is actually good, I'm going to hint at it at the end of lecture but really should take 6.046 Yeah. AUDIENCE: [INAUDIBLE] question. Is the hashing function [INAUDIBLE]? Like, how do we know the array is still [INAUDIBLE]? PROFESSOR: OK. The hashing function-- I guess I didn't specify up here. The hashing function maps your universe to 0, 1, up to m minus 1, That's the definition. So it's guaranteed to reduce the space of keys to just m slots. So your hashing function needs to know what m is. In reality there's not going to be one hashing function, there's going to be 1 for each m, or at least one for each m. And so, depending on how big your table is, you use the corresponding hash function. Yeah, good question. So the hash function is what does the work of reducing your key space down to small set of slots. So that's what's going to give us low space. OK. But now, how do we get low time? Let me just state this assumption and get to business. Simply, uniform hashing is, essentially, two probabilistic assumptions. The first one is uniformity. If you take some key in your space that you want to store the hash function maps it to a uniform random choice. This is, of course, is what you want to happen. Each of these slots here is equally likely to be hashed to. OK. That's a good start. But to do proper analysis, not only do we uniformity, we also need independence. So not only is this true for each key individually, but it's true for all the keys together. So if key one maps to a uniform random place, no matter where it goes, key two also matches to a uniform random place. And no matter where those two go, key three maps to a uniform random place. This really can't be true. But if it's true, we can prove that this takes constant time. So let me do that. So under this assumption, we can analyze hashing-- hashing with chaining is what this method is called. So let's do it I want to know-- I got to cheat, sorry. I got to remember the notation. I don't have any good notation here. All right. What I'd like to know is the expected length of a chain. OK. Now this is if I have n keys that are stored in the table, and m slots in the table, then what is the expected length of a chain? Any suggestions. Yeah. AUDIENCE: 1 over m to the n. PROFESSOR: 1 over m to the n? That's going to be a probability of something. Not quite. AUDIENCE: [INAUDIBLE] PROFESSOR: That's between 0 and 1. It's probably at least one, or something. Yeah. AUDIENCE: m over n. PROFESSOR: n over m, yeah. It's really easy. The chance of a key going to a particular slot is 1 over m. They're all independent, so it's 1 over m, plus 1 over m, plus 1 over m, n times. So it's n over m. This is really easy when you have independence. Sadly, in the real world, you don't have independence. We're going to call this thing alpha, and it's also known as the load factor of the table. So if it's one, n equals m. And so the length of a chain is one. If it's 10, then you have 10 times as many elements as you have slots. But still, the expected length of a chain is 10. That's a constant. It's OK. If it's a 12, that's OK. It means that you have a bigger table than you have items. As long as it's a constant, as long as we have-- I erased it by now-- as long as m is theta n, this is going to be constant. And so we need to maintain this property. But as long as you set your table size to the right value, to be roughly n, this will be constant. And so the running time of an operation, insert, delete, and search-- Well, search is really the hardest because when you want to search for a key, you map it into your table, then you walk the linked list and look for the key that you're searching for. Now is this the key you're searching for? No, it's not the key you're searching for. Is this the key you're searching for? Those are not the keys you're searching for. You keep going. Either you find your key or you don't. But in the worst case, you have to walk the entire list. Sorry for the bad Star Trek reference-- Star Wars. God. I'm not awake. All right. In general, the running time, in the worst case, is 1 plus the length of your chain. OK. So it's going to be 1 plus alpha. Why do I write one? Well, because alpha can be much smaller than 1, in general. And you always have to pay the cost of computing the hash function. We're going to assume that takes constant time. And then you have to follow the first pointer. So you always pay constant time, but then you also pay alpha. That's your expected life. OK. That's the analysis. It's super simple. If you assume Simple Uniform Hashing, it's clear, as long as your load factor is constant, m theta n, you get constant running time for all your operations. Life is good. This is the intuition of why hashing works. It's not really why hashing works. But it's about as far as we're going to get in 006. I'm going to tell you a little bit more about why hashing is actually good to practice and in theory. What are we up to? Last topic is hash functions. The one remaining thing is how do I construct h? How do I actually map from this giant universe of keys to this small set of slots in the table, there's m of them? I'm going to give you three hash functions, two of which are, let's say, common practice, and the third of which is actually theoretically good. So the first two are not good theoretically. You can prove that they're bad, but at least they give you some flavor, and they're still common in practice because a lot of the time they're OK, but you can't really prove much about them. OK. So first method, sort of the obvious one, called the division method. And if you have a key, this could be a giant key, huge universe of keys, you just take that key, modulo m, that gives you a number between zero and m minus 1. Done. It's so easy. I'm not going to tell you in detail why this is a bad method. Maybe you can think about it. It's especially bad if m has some common factors with k. Like, let's say k is even always, and m is even also because you say, oh, I'd like a table the size of power of two. That seems natural. Then that will be really bad because you'll use only half the table. There are lots of situations where this is bad. In practice, it's pretty good. If m is prime, you always choose a prime table size, so you don't have those common factors. And it's not very close to a power of 2 or power of 10 because real world powers of 2's and 10's are common. But it's very hackish, OK? It works a lot of the time but not always. A cooler method-- I think it's cooler-- still, you can't prove much about it-- Division didn't seem to work so great, so how about multiplication? What does that mean? Multiply by m, that wouldn't be very good. Now, it's a bit different. We're going to take the key, multiply it by an integer, a, and then we're going to do this crazy, crazy stuff. Take it mod 2 to the w and then shift it right, w minus r. OK. What is w? We're assuming that we're in a w-bit machine. Remember way back in models of computation? Your machine has a word size, it's w bits. So let's suppose it's w bits. So we have our key, k. Here it is. It's w bits long. We take some number, a-- think of a as being a random integer among all possible w bit integers. So it's got some zeros, it's got some ones. And I multiply these. What does multiplication mean in binary? Well, I take one of these copies of k for each one that's here. So I'm going to take one copy here because there's a one there. I'm going to take one copy here because there's a one there. And I'm going to take one copy here because there's a one there. And on average, half of them will be ones. So I have various copies of k, and then I just add them up. And you know, stuff happens. I get some gobbledygook here. OK. How big is it? In general, it's two words long. When I multiply two words I get two words. It could be twice as long, in general. And what this business is doing is saying take the right word, this right half here-- let the right word in, I guess, if you see vampire movies-- and then shift right-- this is a shift right operation-- by w minus r. I didn't even say what r is. But basically, what I want is these bits. I want r bits here-- this is w bits. I want the leftmost r bits of the rightmost w bits because I shift right here and get rid of all these guys. r-- I should say, m, is two to the r. So I'm going to assume here I have a table of size a power of 2, and then this number will be a number between 0 and m minus 1. OK. Why does this work? It's intuitive. In practice it works quite well because what you're doing is taking a whole bunch of sort of randomly shifted copies of k, adding them up-- you get carries, things get mixed up-- This is hashing. This is-- you're taking k, sort of cutting it up while you're shifting it around, adding things and they collide, and weird stuff happens. You sort of randomize stuff. Out here, you don't get much randomization because most-- like the last bit could just be this one bit of k. But in the middle, everybody's kind of colliding together. And so intuitively, you're mixing lots of things in the center. You take those r bits, roughly, in the center. That will be nicely mixed up. And most of the time this works well. In practice it works well-- I have some things written here. a better be odd, otherwise you're throwing away stuff. And it should not be very close to a power of 2. But it should be in between 2 to the r minus 1 and 2 to the r. Cool. One more. Again, theoretically, this can be bad. And I leave it as an exercise to find situations, find key values where this does not do a good job. The cool method is called universal hashing. This is something that's a bit beyond the scope of 006. If you want to understand it better you should take 046. But I'll give you the flavor and the method, one of the methods. There's actually many ways to do this. We see a mod m on the outside. That's just division method just to make the number between 0 and a minus 1. Here's our key. And then there's these numbers a and b. These are going to be random numbers between 0 and p minus 1. What's p? Prime number bigger than the size of the universe. So it's a big prime number. I think we know how to find prime numbers. We don't know in this class, but people know how to find the prime numbers. So there's a subroutine here, find a big prime number bigger than your universe. It's not too hard to do that. We can do it in polynomial time. That's just set up. You do that once for a given size table. And then you choose two random numbers, a and b. And then this is the hash function, a times k plus b, mod p mod m. OK. What does this do? It turns out-- here's the interesting part. For worst case keys, k1 and k2, that are distinct, the probability of h of k1 equaling h of k2 is 1 over n. So probability of two keys that are different colliding is 1 over m, for the worst case keys. What the heck does that mean? What's the probability over? Any suggestions? What's random here? AUDIENCE: a and b. PROFESSOR: a and b. This is the probability over a and b. This is the probability over the choice of your hash function. So it's the worst case inputs, worst case insertions, but random hash function. As long as you choose your random hash function, the probability of collision is 1 over m. This is the ideal situation And so you can prove, just like we analyzed here-- It's a little more work. It's in the notes. You use linearity of expectation. And you can prove, still, that the expected length of a chain-- the expected number of collisions that a key has with another key is the load factor, in the worst case, but in expectation for a given hash function. So still, the expected length of a chain, and therefore, the expected running time of hashing with chaining, using this hash function, or this collection of hash functions, or a randomly chosen one, is constant for constant load factor. And that's why hashing really works in theory. We're not going to go into details of this again. Take 6.046 if you want to know. But this should make you feel more comfortable. And we'll see other ways do hashing next class.

Background

The goal of hashing is usually to map keys from some large domain (universe) $U$ into a smaller range, such as $m$ bins (labelled $[m]=\{0,\dots ,m-1\}$ ). In the analysis of randomized algorithms and data structures, it is often desirable for the hash codes of various keys to "behave randomly". For instance, if the hash code of each key were an independent random choice in $[m]$ , the number of keys per bin could be analyzed using the Chernoff bound. A deterministic hash function cannot offer any such guarantee in an adversarial setting, as the adversary may choose the keys to be the precisely the preimage of a bin. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function.

The solution to these problems is to pick a function randomly from a large family of hash functions. The randomness in choosing the hash function can be used to guarantee some desired random behavior of the hash codes of any keys of interest. The first definition along these lines was universal hashing, which guarantees a low collision probability for any two designated keys. The concept of $k$ -independent hashing, introduced by Wegman and Carter in 1981,^[2] strengthens the guarantees of random behavior to families of $k$ designated keys, and adds a guarantee on the uniform distribution of hash codes.

Definitions

The strictest definition, introduced by Wegman and Carter^[2] under the name "strongly universal $_{k}$ hash family", is the following. A family of hash functions $H=\{h:U\to [m]\}$ is $k$ -independent if for any $k$ distinct keys $(x_{1},\dots ,x_{k})\in U^{k}$ and any $k$ hash codes (not necessarily distinct) $(y_{1},\dots ,y_{k})\in [m]^{k}$ , we have:

\Pr _{h\in H}\left[h(x_{1})=y_{1}\land \cdots \land h(x_{k})=y_{k}\right]=m^{-k}

This definition is equivalent to the following two conditions:

for any fixed $x\in U$ , as $h$ is drawn randomly from $H$ , $h(x)$ is uniformly distributed in $[m]$ .
for any fixed, distinct keys $x_{1},\dots ,x_{k}\in U$ , as $h$ is drawn randomly from $H$ , $h(x_{1}),\dots ,h(x_{k})$ are independent random variables.

Often it is inconvenient to achieve the perfect joint probability of $m^{-k}$ due to rounding issues. Following,^[3] one may define a $(\mu ,k)$ -independent family to satisfy:

\forall

distinct

(x_{1},\dots ,x_{k})\in U^{k}

and

\forall (y_{1},\dots ,y_{k})\in [m]^{k}

~~\Pr _{h\in H}\left[h(x_{1})=y_{1}\land \cdots \land h(x_{k})=y_{k}\right]\leq \mu /m^{k}

Observe that, even if $\mu$ is close to 1, $h(x_{i})$ are no longer independent random variables, which is often a problem in the analysis of randomized algorithms. Therefore, a more common alternative to dealing with rounding issues is to prove that the hash family is close in statistical distance to a $k$ -independent family, which allows black-box use of the independence properties.

Techniques

Polynomials with random coefficients

The original technique for constructing $k$ -independent hash functions, given by Carter and Wegman, was to select a large prime number $p$ , choose $k$ random numbers modulo $p$ , and use these numbers as the coefficients of a polynomial of degree $k - 1$ whose values modulo $p$ are used as the value of the hash function. All polynomials of the given degree modulo $p$ are equally likely, and any polynomial is uniquely determined by any $k$ -tuple of argument-value pairs with distinct arguments, from which it follows that any $k$ -tuple of distinct arguments is equally likely to be mapped to any $k$ -tuple of hash values.^[2]

In general the polynomial can be evaluated in any finite field. Besides the fields modulo prime, a popular choice is the field of size $2^{n}$ , which supports fast finite field arithmetic on modern computers. This was the approach taken by Daniel Lemire and Owen Kaser for CLHash.^[4]

Tabulation hashing

Tabulation hashing is a technique for mapping keys to hash values by partitioning each key into bytes, using each byte as the index into a table of random numbers (with a different table for each byte position), and combining the results of these table lookups by a bitwise exclusive or operation. Thus, it requires more randomness in its initialization than the polynomial method, but avoids possibly-slow multiplication operations. It is 3-independent but not 4-independent.^[5] Variations of tabulation hashing can achieve higher degrees of independence by performing table lookups based on overlapping combinations of bits from the input key, or by applying simple tabulation hashing iteratively.^[6]^[7]

Independence needed by different types of collision resolution

The notion of k-independence can be used to differentiate between different collision resolution in hashtables, according to the level of independence required to guarantee constant expected time per operation.

For instance, hash chaining takes constant expected time even with a 2-independent family of hash functions, because the expected time to perform a search for a given key is bounded by the expected number of collisions that key is involved in. By linearity of expectation, this expected number equals the sum, over all other keys in the hash table, of the probability that the given key and the other key collide. Because the terms of this sum only involve probabilistic events involving two keys, 2-independence is sufficient to ensure that this sum has the same value that it would for a truly random hash function.^[2]

Double hashing is another method of hashing that requires a low degree of independence. It is a form of open addressing that uses two hash functions: one to determine the start of a probe sequence, and the other to determine the step size between positions in the probe sequence. As long as both of these are 2-independent, this method gives constant expected time per operation.^[8]

On the other hand, linear probing, a simpler form of open addressing where the step size is always one can be guaranteed to work in constant expected time per operation with a 5-independent hash function,^[9] and there exist 4-independent hash functions for which it takes logarithmic time per operation.^[10]

For Cuckoo hashing the required k-independence is not known as of 2021. In 2009 it was shown^[11] that $O(\log n)$ -independence suffices, and at least 6-independence is needed. Another approach is to use Tabulation hashing, which is not 6-independent, but was shown in 2012^[12] to have other properties sufficient for Cuckoo hashing. A third approach from 2014^[13] is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.

Other applications

Kane, Nelson and David Woodruff improved the Flajolet–Martin algorithm for the Distinct Elements Problem in 2010.^[14] To give an $\varepsilon$ approximation to the correct answer, they need a ${\tfrac {\log 1/\varepsilon }{\log \log 1/\varepsilon }}$ -independent hash function.

The Count sketch algorithm for dimensionality reduction requires two hash functions, one 2-independent and one 4-independent.

The Karloff–Zwick algorithm for the MAX-3SAT problem can be implemented with 3-independent random variables.

The MinHash algorithm can be implemented using a $\log {\tfrac {1}{\epsilon }}$ -independent hash function as was proven by Piotr Indyk in 1999^[15]

References

^ Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4.
^ ^a ^b ^c ^d Wegman, Mark N.; Carter, J. Lawrence (1981). "New hash functions and their use in authentication and set equality" (PDF). Journal of Computer and System Sciences. 22 (3): 265–279. doi:10.1016/0022-0000(81)90033-7. Conference version in FOCS'79. Retrieved 9 February 2011.
^ Siegel, Alan (2004). "On universal classes of extremely random constant-time hash functions and their time-space tradeoff" (PDF). SIAM Journal on Computing. 33 (3): 505–543. doi:10.1137/S0097539701386216. Conference version in FOCS'89.
^ Lemire, Daniel, and Owen Kaser. "Faster 64-bit universal hashing using carry-less multiplications." Journal of Cryptographic Engineering 6.3 (2016): 171-185.
^ Pătraşcu, Mihai; Thorup, Mikkel (2012), "The power of simple tabulation hashing", Journal of the ACM, 59 (3): Art. 14, arXiv:1011.5200, doi:10.1145/2220357.2220361, MR 2946218
^ Siegel, Alan (2004), "On universal classes of extremely random constant-time hash functions" (PDF), SIAM Journal on Computing, 33 (3): 505–543, doi:10.1137/S0097539701386216, MR 2066640, S2CID 1742179, archived from the original (PDF) on 2019-02-17
^ Thorup, M. (2013), "Simple tabulation, fast expanders, double tabulation, and high independence", Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2013), pp. 90–99, arXiv:1311.3121, doi:10.1109/FOCS.2013.18, ISBN 978-0-7695-5135-7, MR 3246210
^ Bradford, Phillip G.; Katehakis, Michael N. (2007), "A probabilistic study on combinatorial expanders and hashing" (PDF), SIAM Journal on Computing, 37 (1): 83–111, doi:10.1137/S009753970444630X, MR 2306284, archived from the original (PDF) on 2016-01-25, retrieved 2016-01-19
^ Pagh, Anna; Pagh, Rasmus; Ružić, Milan (2009), "Linear probing with constant independence", SIAM Journal on Computing, 39 (3): 1107–1120, arXiv:cs/0612055, doi:10.1137/070702278, MR 2538852
^ Pătraşcu, Mihai; Thorup, Mikkel (2010), "On the $k$ -independence required by linear probing and minwise independence" (PDF), Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I, Lecture Notes in Computer Science, vol. 6198, Springer, pp. 715–726, arXiv:1302.5127, doi:10.1007/978-3-642-14165-2_60, ISBN 978-3-642-14164-5
^ Cohen, Jeffrey S., and Daniel M. Kane. "Bounds on the independence required for cuckoo hashing." ACM Transactions on Algorithms (2009).
^ Pǎtraşcu, Mihai, and Mikkel Thorup. "The power of simple tabulation hashing." Journal of the ACM (JACM) 59.3 (2012): 1-50.
^ Aumüller, Martin, Martin Dietzfelbinger, and Philipp Woelfel. "Explicit and efficient hash families suffice for cuckoo hashing with a stash." Algorithmica 70.3 (2014): 428-456.
^ Kane, Daniel M., Jelani Nelson, and David P. Woodruff. "An optimal algorithm for the distinct elements problem." Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2010.
^ Indyk, Piotr. "A small approximately min-wise independent family of hash functions." Journal of Algorithms 38.1 (2001): 84-90.

From Wikipedia, the free encyclopedia