To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Burrows–Wheeler transform

From Wikipedia, the free encyclopedia

The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation.

YouTube Encyclopedic

  • 1/5
    Views:
    16 133
    11 535
    377
    9 754
    493
  • Burrows-Wheeler Transform (Ep 4, Compressor Head) Google
  • Burrows Wheeler Transformation (BWT)
  • Fundamentals of Stringology IV: The Burrows-Wheeler Transform
  • 5. Library Complexity and Short Read Alignment (Mapping)
  • Burrows-Wheeler Transform (Java)

Transcription

COLT MCANLIS: Inspirtation's a funny thing. I mean, there you are idly going about your day, eating a banana, and boom, some amazing revelation hits you. How fitting is it that all of the technological advances we've had over the past 40 years have been because of these sort of crazy, amazing inspirational thoughts. Take Burrows-Wheeler Transform, for example. It's a very common compression algorithm for Linux and the web. But even the authors themselves will admit that they don't fully understand how they came up with the algorithm itself. Huh, that's the trick with inspiration, I suppose. You get greatness from nothingness. But it doesn't really help you if you want to implement the algorithm itself. But fear not, young programmer. I'm here to help. My name is Colt McAnlis, and this is "Compressor Head." [MUSIC PLAYING] One of funny things about information theory is that whole "theory" part. See, there's some interesting points about the theory that don't always work out right. Take entropy, for example, which generally measures the information content of a data set given in bits. But one of the issues with entropy is that it fails to take into account the order of the symbols. See, regardless of how I shuffle this data set of 0 through 9 here, it will always have an entropy of 4. But if you've been watching the rest of our shows, you know that order does, in fact, matter. Take delta coding, for example. If we have this variation of our data set, we can see that the delta-encoded version really doesn't provide us with any lower entropy. However, if we sort this data set, giving us the nice linear number range from 0 to 9, the delta-encoded version produces a much lower entropy than the source stream. Now, in an ideal world, we'd be able to just apply this type of sort to all of our data and end up with a perfectly compressible stream. That's not actually how it works. See, pure sorts are generally one directional. That is, once you sort it, you can't get it back in its original order without a whole ton of extra data to tell you where everything goes. So we can't purely sort it, but we can get close. See, the Burrows-Wheeler transform, or BWT, will shuffle around the data to cluster together groups of the same symbol as much as it possibly can. The benefit of this is that once the symbols have been clustered together, they effectively have an ordering, which can make it more compressible for other algorithms that can take advantage of it. Now, remember, this isn't a perfect sorting algorithm. It's more of a rough clustering algorithm. Well, technically, it's a lexographical permutation. You know what? Anyhow, the point is that BWT has one amazing characteristic. We can encode to and decode from BWT without having to add any significant additional data to our stream. Let's take a look. Go! MAGNUS HYTTSTEN: Whoa! COLT MCANLIS: We want him talking? The Transform part of Burrows-Wheeler Transform starts by creating a table of shifted permutations for your input stream. So let's take the word "banana" as an example and write that in the first row of our table. Now, on each row under that, let's do a rotational shift of that word to the right one character. That is, all of the letters in our string shift to the right, and the front-most character gets pushed onto the back. We do this for each symbol in our stream, rotating one character to the right, so we end up with a table whose number of rows equals the length of the stream. The next step is to sort the rows of this table lexographically, giving us a sorted table form. Now, this is where some of the dark magic happens. You can see that each row is effectively a permutation of the word "banana," but also, each column is a permutation as well. The first column, of course, is the sorted permutation of our word. But really, what we want is the last column, in general. This N-N-B-A-A-A has some interesting characteristics. Notice it's got better symbol clustering than any of the other columns, especially our input stream and, thus, is what we output as our encode. Now, before you run off, there's one other piece of data that we need to grab as well. Notice that in our sorted table, the input string actually sits at index 4. We'll need this row index during the decode phase of the Burrows-Wheeler Transform, because it will allow us to move from our more compressible permutation back to the source string. Now, you may ask yourself, why do we choose this last column? I mean, the one right next to it seems just as good, seems to have very close clustering. Why not that one? Well, this is more BWT magic. See, this last column has an interesting feature. If we have only this, N-N-B-A-A-A, we can recover the rest of our sorted table entirely. The other columns don't possess the same characteristic, which is highly important when we're trying to invert our transform. Banana. MAGNUS HYTTSTEN: That's great! COLT MCANLIS: All right. The remarkable thing about Burrows-Wheeler Transform is not that it generates a more compressible output-- any ordinary sort would do that-- but that this particular transform is reversible with minimal data overhead. And to help us demonstrate that is my good friend Magnus, who, turns out, is training for the World Sorting Championship. MAGNUS HYTTSTEN: Yeah! Sorting is my thing. Yeah. [LAUGHS] COLT MCANLIS: Yeah. This guy right here, he's going to take it all. All right. So at the start of our decode step, we're given the string N-N-B-A-A-A and the row index of 4. If you recall, N-N-B-A-A-A represents the last column of our table. So Magnus, can you put an N-N-B-A-A-A up on the board for us? MAGNUS HYTTSTEN: Like that? COLT MCANLIS: Just like that. MAGNUS HYTTSTEN: Great! COLT MCANLIS: DO it. Now, when he's doing this, I want you to watch his technique. Watch the way his hand flows through the process. MAGNUS HYTTSTEN: Yeah! Finished. COLT MCANLIS: Fantastic. Yeah. Young kids, you're afraid of this guy. Now, once we have this data in our table, the next step is to sort it, which is where Magnus comes in for us. All right, Magus, you ready? MAGNUS HYTTSTEN: Absolutely. COLT MCANLIS: We're going to do a sort. Nice warm up. Don't pull anything. MAGNUS HYTTSTEN: OK. COLT MCANLIS: All right. When you're ready, go. Excellent, good job. Great form, nice. Remember, clench those abs. Yeah, good, stay with it. Fantastic. MAGNUS HYTTSTEN: All right! Ha ha. COLT MCANLIS: This kid's good. All right. Now that we've got a sorted table, the next step is to prepend the input string again. So Magnus, another N-N-B-A-A-A, please. MAGNUS HYTTSTEN: Over there? COLT MCANLIS: Right there. MAGNUS HYTTSTEN: Great, man. COLT MCANLIS: This guy kind gets it, right? Good. Nice. Nice. MAGNUS HYTTSTEN: Ah! COLT MCANLIS: Now-- all right. Now that it's there, we've got to do another sort. MAGNUS HYTTSTEN: A sort? COLT MCANLIS: Another sort. MAGNUS HYTTSTEN: OK. COLT MCANLIS: All right. Now, I'm going to time this one. MAGNUS HYTTSTEN: OK. COLT MCANLIS: This is going to be a good warm up round? MAGNUS HYTTSTEN: Yeah. COLT MCANLIS: All right. Ready? When you're ready, three, two, one, go. MAGNUS HYTTSTEN: OK. Let's go. COLT MCANLIS: Now, remember, you've got to tighten your abs during the sort. You're going to be judged on speed and accuracy here. MAGNUS HYTTSTEN: Boom! COLT MCANLIS: Fantastic. Look at that. 11 seconds. MAGNUS HYTTSTEN: Yeah! COLT MCANLIS: That's a personal best for this guy. MAGNUS HYTTSTEN: Wow! COLT MCANLIS: You're already sorting faster than Michael Phelps. MAGNUS HYTTSTEN: That's great to know. COLT MCANLIS: I agree. So this is how the BWT decode works. We append our input string across the rows of our input table, then sort, and then continue on until we've recovered our original matrix. So Magnus, ready for the big one? MAGNUS HYTTSTEN: Yeah. Yeah, yeah! COLT MCANLIS: All right. So here's what we're going to do. OK? We're going to do this four more times-- MAGNUS HYTTSTEN: Four? COLT MCANLIS: --so we've got six letters in each row. MAGNUS HYTTSTEN: Wow! COLT MCANLIS: You think you can do it? MAGNUS HYTTSTEN: Yeah! COLT MCANLIS: Are you ready to do it? MAGNUS HYTTSTEN: I'm ready, ready, ready! COLT MCANLIS: This is for the World Championship! MAGNUS HYTTSTEN: Yeah, man! Let's do it! COLT MCANLIS: Ready? MAGNUS HYTTSTEN: Yeah! COLT MCANLIS: Set, go! All right. Good. Great. Good job. Good job. N-N-B-A-A-A. We're going to get you so many endorsements. You're almost there, Magnus. Finish strong! Finish strong! Ah! Oh-oh-oh. Man. Great job, kid. All right, all right, all right. Calm down. Why don't you hit the showers, and I'll walk the people through the rest of the algorithm, all right? Good job today. Good job. I don't think he's got a shot at that one. Anyhow, thanks to Magnus, we've got a recreated permutation table alongside a nifty row index that we output during the encoding phase. Since this matrix over there is identical to the post-sorted one from the encoder, the fourth row actually contains the original input string, "BANANA." See? Pretty easy stuff. Oh, and to be clear, kids, training to be a world champion at sorting requires a lot of practice and dedication. So eat your vegetables and listen to your parents. MALE SPEAKER: Take one. COLT MCANLIS: How do you make your hair do that? MALE SPEAKER: Lots and lots of gel. COLT MCANLIS: Really? I'm gel-ous, and not just because it's a pun. Nice. It'll take him a second. In 1994, Mike Burrows and David Wheeler were set up to be the title fight at the second annual UFC championship in Las Vegas, Nevada. But before the fight, while playing a few rounds of pai gow, the two ended up creating the Burrows-Wheeler Transform algorithm. MIKE BURROWS: That isn't really true. COLT MCANLIS: No, no. No, it is not. Um-- ha. You know, one of the problems about working at Google is you can't really throw a stone around here without hitting the inventor of some amazing algorithm. Ladies and gentlemen, I want to introduce you to the co-inventor of the Burrows-Wheeler Transform, Mr. Mike Burrows. Mike, it's fantastic to see you. Maybe if you have a second, can you sit down and talk to us a little bit about your algorithm? MIKE BURROWS: Certainly. I'd like to put you straight about that title fight. COLT MCANLIS: OK. So I guess at the beginning, let's start there. MIKE BURROWS: Well, it really started with David Wheeler. He was on the faculty of the University of Cambridge, but on sabbatical at Bell Labs, and working on compression. And he came up with the algorithm there. I only learned about it some years later when I became his graduate student. COLT MCANLIS: In the world of data compression, Burrows-Wheeler Transform just stands alone against everything else. There's nothing really like it. How did you all come up with it? MIKE BURROWS: Well, I asked David what was going through his mind when he came up with it originally. And he didn't know. He could not explain to me how he came up with that particular algorithm. I got the impression that he was playing around with sorting the contexts of characters that were to be encoded and using that as a good predictor. And then he must have realized that it was possible to invert that sort under certain conditions. COLT MCANLIS: Did you and Mike kind of co-create the article together? Was it more of his idea, and he handed it off to you? How did that relationship work? MIKE BURROWS: What happened was that he never bothered writing it up. I learned about it from him as a graduate student. And the years ticked by after that. And he never wrote it up. At the time when I was a student, I thought that it was just one of those things that grad students learn. And then eventually, I realized, no, this was really special, and the world ought to know about it. When he did it, he did not think of it as a practical algorithm. He thought of it as an algorithm to use to calibrate other algorithms. It wasn't practical for him, because he didn't have a really fast implementation. And computers at the time were significantly slower. And so it didn't really go fast enough. So I decided that what I should do is write it up myself. And in order to do that, I had to do something. And so I worked with him to make it go fast. And we came up with ways to implement it efficiently. COLT MCANLIS: So I've got to ask you, in the 20 years since this algorithm's invention, what has been the most amazing thing you've seen it applied for? I mean, it's been everywhere, from the Linux operating system to a protocol on the internet. What's impressed you so much about its use so far? MIKE BURROWS: Well, I suppose the most unexpected thing has been its use in DNA sequencing. It puts together the fragments of DNA that have been sequenced independently very efficiently into a combined whole. But there have been many other advances around it, particularly in ways to do the sorting efficiently, also variance of the transform, one that sorts only limited-length contexts. It turns out that Michael Schindler of the Technical University of Vienna discovered that if you sort the context up to any finite length k, it's still invertible as a transform. Unfortunately, the inversion takes twice as long, so it's not popular. But it still works. David Scott and Yossi [? Azah ?] managed to find a bijective variant of the transform, where the size of the transform string is exactly the same as the size of the input string. COLT MCANLIS: So you don't need the row index at the end? MIKE BURROWS: Yes, that's correct. It's a bit more involved and a bit slower to do the compression, but it still works. And that was another surprise for me. COLT MCANLIS: Give me a little bit of background here. Where did you actually publish the paper originally for this? MIKE BURROWS: There's a funny story about that. We first of all sent the paper to the Annual Data Compression Conference. But they rejected it. And there were no comments about why they rejected it. So I wrote to them and asked why they had rejected it. And they told me that it was their policy not to explain why they rejected papers. So we just published it as a technical report. The next year, people at the same conference actually asked me to submit the paper again so that they could publish it. And I said, no. And I wasn't really going to explain why, because it was my policy not to explain that sort of thing. So the algorithm actually became more well known when someone else who saw the technical report published a popular article about it in "Dr. Dobb's Journal." So that was the first real publication where people got to know about it. COLT MCANLIS: Wow. So just completely bypassed the academic route, and go for mainstream media. That's how we get our algorithms out nowadays, right? MIKE BURROWS: That's the way it worked. COLT MCANIS: Listen, Mike, thank you so much for coming on and talking to us a bit about this algorithm. We're really honored to have you here. MIKE BURROWS: It's a pleasure to be here. I felt, given the state of the other "Compressor Head" episodes, that I ought to come along personally and fix things. COLT MCANLIS: Huh. OK. Well, thank you for that. You sure did. MIKE BURROWS: Thank you. COLT MCANLIS: OK. Oh, it's on me. Oh. I was just texting my boss. So it's apparent that BWT doesn't actually compress the data, it just transforms it. To practically use BWT, we need some way to apply another transform that's going to yield a stream which has lower entropy and, thus, is more compressible. Good old delta compression doesn't really help us that much. Encoding our post-BWT stream of N-N-B-A-A-A yields us the version of the string, this whole thing, which has an entropy of 1.77, while the source entropy itself is only 1.45. We haven't actually improved anything. No, for this type of data, we need a different type of transform which is better suited for the type of symbol clustering that BWT produces. We call this move-to-front encoding. Effectively, start by creating a buffer that contains all unique symbols that your stream could have and list them in a sorted order. For our purposes, we'll just use the English alphabet. The move-to-front algorithm is pretty simple. For each symbol we read, we write its output buffer position and then move it to the front of our buffer. For example, if we input the letter N, that sits at position 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14. So we write that to our output stream. The next step is how the algorithm gets its name. Once we've written its out position, we then move that symbol to the front of the buffer. Time to read in the next symbol, which is also N. Since that's already at the front of our stream, we write a 1 to the output stream. This is the trick of the transform. The idea is that since BWT clusters symbols together, there's a high probability that subsequent copies of that symbol will appear in the stream after we've encountered the first one. So we end up outputting more 1's as a result. Let's take a look at the rest of the coding. Let's read B onto the stream. And as such, we output the index 3, because that's where it sits in the buffer. And move it to the front. We next read A and output the index of 3 again, since here it sits once more. Just fix that. Now, the next two A's that we read from the stream won't adjust any part of the buffer since A already sits in the first position. As such, we simply admit 1's to the output stream. Now, after encoding, the output stream here is 14, 1, 3, 3 1, 1, whose entropy is 1.45, which is identical to our source input stream. Eh, pretty good. From here, you can simply throw the output of move-to-front to any statistical compressor, like Huffman or arithmetic compression. In fact, bzip2, the popular Linux compression application, couples Burrows-Wheeler Transform plus move-to-front plus arithmetic compression for its exact algorithm. You're looking at the inner workings of science here, folks. [GIBBERISH] MIKE BURROWS: That's an interesting thing to say. COLT MCANLIS: Like, that was just really cool to hear. MIKE BURROWS: Yeah. Anyway-- COLT MCANLIS: All right, all right. Yeah, sorry. We could do this all day. What's amazing about Burrows-Wheeler Transform is that it stands alone as a compression algorithm. LZ, Huffman, arithmetic-- they all have tons of variance and have been worked on by tons of different people. In the last 20 years, Burrows-Wheeler Transform has gotten much less attention. But that doesn't mean it's any less useful. In recent competitions to compress human DNA sequences, BWT-based compressors were all in the top 10 finalists. And as far as the compression world goes, there's still plenty left to improve upon here. But that's a topic for a different show. My name is Colt McAnlis. Thanks for watching. [MUSIC PLAYING]

Contents

Description

The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California.[1] It is based on a previously unpublished transformation discovered by Wheeler in 1983.

When a character string is transformed by the BWT, the transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row.

For example:

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT[2]

The output is easier to compress because it has many repeated characters. In this example the transformed string, there are a total of eight runs of identical characters: XX, II, XX, SS, PP, .., II, and III, which together make 17 out of the 44 characters.

Example

The transform is done by sorting all rotations of the text into lexicographic order, by which we mean that the 8 rotations (as the input is 8 characters long) appear in the second column in a different order, in that the 8 rows have been sorted into lexicographical order. We then take as output the last column and the number k = 7 of the row that the non rotated row ends up in. For example, the text "^BANANA|" is transformed into "BNN^AA|A" through these steps (the red | character indicates the 'EOF' pointer):

Transformation
Input All
rotations
Sorted into
lexical order
Taking
last column
Output
last column
^BANANA|
^BANANA|
|^BANANA
A|^BANAN
NA|^BANA
ANA|^BAN
NANA|^BA
ANANA|^B
BANANA|^
ANANA|^B
ANA|^BAN
A|^BANAN
BANANA|^
NANA|^BA
NA|^BANA
^BANANA|
|^BANANA
ANANA|^B
ANA|^BAN
A|^BANAN
BANANA|^
NANA|^BA
NA|^BANA
^BANANA|
|^BANANA
BNN^AA|A

The following pseudocode gives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input string s contains a special character 'EOF' which is the last character and occurs nowhere else in the text.

function BWT (string s)
   create a table, rows are all possible rotations of s
   sort rows alphabetically
   return (last column of the table)
function inverseBWT (string s)
   create empty table
   repeat length(s) times
       // first insert creates first column
       insert s as a column of table before first column of the table
       sort rows of the table alphabetically
   return (row that ends with the 'EOF' character)

Explanation

To understand why this creates more-easily-compressible data, consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).

The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is reversible, allowing the original document to be re-generated from the last column data.

The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters alphabetically to get the first column. Then, the first and last columns (of each row) together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:

Inverse transformation
Input
BNN^AA|A
Add 1 Sort 1 Add 2 Sort 2
B
N
N
^
A
A
|
A
A
A
A
B
N
N
^
|
BA
NA
NA
^B
AN
AN
|^
A|
AN
AN
A|
BA
NA
NA
^B
|^
Add 3 Sort 3 Add 4 Sort 4
BAN
NAN
NA|
^BA
ANA
ANA
|^B
A|^
ANA
ANA
A|^
BAN
NAN
NA|
^BA
|^B
BANA
NANA
NA|^
^BAN
ANAN
ANA|
|^BA
A|^B
ANAN
ANA|
A|^B
BANA
NANA
NA|^
^BAN
|^BA
Add 5 Sort 5 Add 6 Sort 6
BANAN
NANA|
NA|^B
^BANA
ANANA
ANA|^
|^BAN
A|^BA
ANANA
ANA|^
A|^BA
BANAN
NANA|
NA|^B
^BANA
|^BAN
BANANA
NANA|^
NA|^BA
^BANAN
ANANA|
ANA|^B
|^BANA
A|^BAN
ANANA|
ANA|^B
A|^BAN
BANANA
NANA|^
NA|^BA
^BANAN
|^BANA
Add 7 Sort 7 Add 8 Sort 8
BANANA|
NANA|^B
NA|^BAN
^BANANA
ANANA|^
ANA|^BA
|^BANAN
A|^BANA
ANANA|^
ANA|^BA
A|^BANA
BANANA|
NANA|^B
NA|^BAN
^BANANA
|^BANAN
BANANA|^
NANA|^BA
NA|^BANA
^BANANA|
ANANA|^B
ANA|^BAN
|^BANANA
A|^BANAN
ANANA|^B
ANA|^BAN
A|^BANAN
BANANA|^
NANA|^BA
NA|^BANA
^BANANA|
|^BANANA
Output
^BANANA|

Optimization

A number of optimizations can make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. Some care must be taken to ensure that the sort does not exhibit bad worst-case behavior: Standard library sort functions are unlikely to be appropriate. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.

One may also make the observation that mathematically, the encoded string can be computed as a simple modification of the suffix array, and suffix arrays can be computed with linear time and memory. The BWT can be defined with regards to the suffix array SA of text T as (1-based indexing):

[3]

There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.

A complete description of the algorithms can be found in Burrows and Wheeler's paper[4][citation needed], or in a number of online sources.

Bijective variant

When a bijective variant of the Burrows–Wheeler transform is performed on "^BANANA", you get ANNBAA^ without the need for a special character for the end of the string. A special character forces one to increase character space by one, or to have a separate field with a numerical value for an offset. Either of these features makes data compression more difficult. When dealing with short files, the savings are great percentage-wise.

The bijective transform is done by sorting all rotations of the Lyndon words. In comparing two strings of unequal length, one can compare the infinite periodic repetitions of each of these in lexicographic order and take the last column of the base-rotated Lyndon word. For example, the text "^BANANA|" is transformed into "ANNBAA^|" through these steps (the red | character indicates the EOF pointer) in the original string. The EOF character is unneeded in the bijective transform, so it is dropped during the transform and re-added to its proper place in the file.

The string is broken into Lyndon words so the words in the sequence are decreasing using the comparison method above. "^BANANA" becomes (^) (B) (AN) (AN) (A), but Lyndon words are combined into (^) (B) (ANAN) (A).

Bijective transformation
Input All
rotations
Sorted alphabetically
by first letter
Last column
of rotated Lyndon word
Output
^BANANA|
^^^^^^^^ (^)
BBBBBBBB (B)
ANANANAN... (ANAN)
NANANANA... (NANA)
ANANANAN... (ANAN)
NANANANA... (NANA)
AAAAAAAA... (A)
AAAAAAAA... (A)
ANANANAN... (ANAN)
ANANANAN... (ANAN)
BBBBBBBB... (B)
NANANANA... (NANA)
NANANANA... (NANA)
^^^^^^^^... (^)
AAAAAAAA... (A)
ANANANAN... (ANAN)
ANANANAN... (ANAN)
BBBBBBBB... (B)
NANANANA... (NANA)
NANANANA... (NANA)
^^^^^^^^... (^)
ANNBAA^|
Inverse bijective transform
Input
ANNBAA^
Add 1 Sort 1 Add 2 Sort 2
A
N
N
B
A
A
^
A
A
A
B
N
N
^
AA
NA
NA
BB
AN
AN
^^
AA
AN
AN
BB
NA
NA
^^
Add 3 Sort 3 Add 4 Sort 4
AAA
NAN
NAN
BBB
ANA
ANA
^^^
AAA
ANA
ANA
BBB
NAN
NAN
^^^
AAAA
NANA
NANA
BBBB
ANAN
ANAN
^^^^
AAAA
ANAN
ANAN
BBBB
NANA
NANA
^^^^
Output
^BANANA

The above may be viewed as four cycles:

^ = (^)(^)... = ^^^^^...
B = (B)(B)... = BBBB...
ANAN = (ANAN)(ANAN)... = ANANANAN...
A = (A)(A).. = AAAAA..

or 5 cycles WHERE ANAN broken into 2:

AN = (AN) (AN) ... = ANANANAN
AN = (AN) (AN) ... = ANANANAN

If a cycle is N character it will be repeated N times:

(^)
(B)
(ANAN)
(A)

or

(^)
(B)
(AN)
(AN)
(A)

to get the

^BANANA

Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an EOF marker to the input or, augmenting the output with information such as an index, making it possible to identify the input string from all its rotations.

There is a bijective version of the transform, by which the transformed string uniquely identifies the original. In this version, every string has a unique inverse of the same length.[5][6]

The fastest versions are linear in time and space.

The bijective transform is computed by factoring the input into a non-increasing sequence of Lyndon words; such a factorization exists in the Chen–Fox–Lyndon theorem,[7] and may be found in linear time.[8] The algorithm sorts the rotations of all the words; as in the Burrows–Wheeler transform, this produces a sorted sequence of n strings. The transformed string is then obtained by picking the final character of each string in this sorted list.

For example, applying the bijective transform gives:

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Lyndon words SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT

The bijective transform includes eight runs of identical characters. These runs are, in order: XX, II, XX, PP, .., EE, .., and IIII.

In total, 18 characters are used in these runs.

Dynamic Burrows–Wheeler transform

When a text is edited, its Burrows–Wheeler transform will change. Salson et al.[9] propose an algorithm that deduces the Burrows–Wheeler transform of an edited text from that of the original text, doing a limited number of local reorderings in the original Burrows–Wheeler transform, which can be faster than constructing the Burrows–Wheeler transform of the edited text directly.

Sample implementation

This Python implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation.

Using the STX/ETX control codes to mark the start and end of the text, and using s[i:] + s[:i] to construct the ith rotation of s, the forward transform takes the last character of each of the sorted rows:

def bwt(s):
    """Apply Burrows-Wheeler transform to input string."""
    assert "\002" not in s and "\003" not in s, "Input string cannot contain STX and ETX characters"
    s = "\002" + s + "\003"  # Add start and end of text marker
    table = sorted(s[i:] + s[:i] for i in range(len(s)))  # Table of rotations of string
    last_column = [row[-1:] for row in table]  # Last characters of each row
    return "".join(last_column)  # Convert list of characters into string

The inverse transform repeatedly inserts r as the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with ETX, minus the STX and ETX.

def ibwt(r):
    """Apply inverse Burrows-Wheeler transform."""
    table = [""] * len(r)  # Make empty table
    for i in range(len(r)):
        table = sorted(r[i] + table[i] for i in range(len(r)))  # Add a column of r
    s = [row for row in table if row.endswith("\003")][0]  # Find the correct row (ending in ETX)
    return s.rstrip("\003").strip("\002")  # Get rid of start and end markers

Here is another, more efficient method for the inverse transform. Although more complex, it increases the speed greatly when decoding lengthy strings.

def ibwt(r, *args):
    """Inverse Burrows-Wheeler transform. args is the original index \
if it was not indicated by an ETX character."""

    firstCol = "".join(sorted(r))
    count = [0]*256
    byteStart = [-1]*256
    output = [""] * len(r)
    shortcut = [None]*len(r)
    #Generates shortcut lists
    for i in range(len(r)):
        shortcutIndex = ord(r[i])
        shortcut[i] = count[shortcutIndex]
        count[shortcutIndex] += 1
        shortcutIndex = ord(firstCol[i])
        if byteStart[shortcutIndex] == -1:
            byteStart[shortcutIndex] = i

    localIndex = (r.index("\003") if not args else args[0])
    for i in range(len(r)):
        #takes the next index indicated by the transformation vector
        nextByte = r[localIndex]
        output [len(r)-i-1] = nextByte
        shortcutIndex = ord(nextByte)
        #assigns localIndex to the next index in the transformation vector
        localIndex = byteStart[shortcutIndex] + shortcut[localIndex]
    return "".join(output).rstrip("\003").strip("\002")

Here is a third one, more efficient and very simple method. It increases the speed greatly when decoding lengthy strings. Although it needs an origin index list generated by bwt.

def bwt(s):
    """Apply Burrows-Wheeler transform to input string. Not indicated by a unique byte but use index list"""
    # Table of rotations of string
    table = [s[i:] + s[:i] for i in range(len(s))]
    # Sorted table
    table_sorted = table[:]
    table_sorted.sort()
    # Get index list of ((every string in sorted table)'s next string in unsorted table)'s index in sorted table
    indexlist = []
    for t in table_sorted:
        index1 = table.index(t)
        index1 = index1+1 if index1 < len(s)-1 else 0
        index2 = table_sorted.index(table[index1])
        indexlist.append(index2)
    # Join last characters of each row into string
    r = ''.join([row[-1] for row in table_sorted])
    return r, indexlist

def ibwt(r,indexlist):
    """Inverse Burrows-Wheeler transform. Not indicated by a unique byte but use index list"""
    s = ''
    x = indexlist[0]
    for _ in r:
        s = s + r[x]
        x = indexlist[x]
    return s

BWT in bioinformatics

The advent of next-generation sequencing (NGS) techniques at the end of the 2000s decade has led to another application of the Burrows–Wheeler transformation. In NGS, DNA is fragmented into small pieces, of which the first few bases are sequenced, yielding several millions of "reads", each 30 to 500 base pairs ("DNA characters") long. In many experiments, e.g., in ChIP-Seq, the task is now to align these reads to a reference genome, i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on hashing (e.g., Eland, SOAP,[10] or Maq[11]). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed (Bowtie,[12] BWA,[13] and SOAP2[14]) that use the Burrows–Wheeler transform.

References

  1. ^ Burrows, Michael; Wheeler, David J. (1994), A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation 
  2. ^ https://github.com/adrien-mogenet/scala-bwt/blob/master/src/main/scala/me/algos/bwt/BurrowsWheelerCodec.scala
  3. ^ Simpson, Jared T.; Durbin, Richard (2010-06-15). "Efficient construction of an assembly string graph using the FM-index". Bioinformatics. 26 (12): i367–i373. doi:10.1093/bioinformatics/btq217. ISSN 1367-4803. PMC 2881401Freely accessible. PMID 20529929. 
  4. ^ Kutylowski, Miroslaw; Pacholski, Leszek (1999-08-18). Mathematical Foundations of Computer Science 1999: 24th International Symposium, MFCS'99 Szklarska Poreba, Poland, September 6-10, 1999 Proceedings. Springer Science & Business Media. ISBN 9783540664086. 
  5. ^ Gil, J.; Scott, D. A. (2009), A bijective string sorting transform (PDF) 
  6. ^ Kufleitner, Manfred (2009), "On bijective variants of the Burrows-Wheeler transform", in Holub, Jan; Žďárek, Jan, Prague Stringology Conference, pp. 65–69, arXiv:0908.0239Freely accessible .
  7. ^ *Lothaire, M. (1997), Combinatorics on words, Encyclopedia of Mathematics and Its Applications, 17, Perrin, D.; Reutenauer, C.; Berstel, J.; Pin, J. E.; Pirillo, G.; Foata, D.; Sakarovitch, J.; Simon, I.; Schützenberger, M. P.; Choffrut, C.; Cori, R.; Lyndon, Roger; Rota, Gian-Carlo. Foreword by Roger Lyndon (2nd ed.), Cambridge University Press, p. 67, ISBN 0-521-59924-5, Zbl 0874.20040 
  8. ^ Duval, Jean-Pierre (1983), "Factorizing words over an ordered alphabet", Journal of Algorithms, 4 (4): 363–381, doi:10.1016/0196-6774(83)90017-2, ISSN 0196-6774, Zbl 0532.68061 .
  9. ^ Salson M, Lecroq T, Léonard M, Mouchard L (2009). "A Four-Stage Algorithm for Updating a Burrows–Wheeler Transform". Theoretical Computer Science. 410 (43): 4350–4359. doi:10.1016/j.tcs.2009.07.016. 
  10. ^ Li R; et al. (2008). "SOAP: short oligonucleotide alignment program". Bioinformatics. 24 (5): 713–714. doi:10.1093/bioinformatics/btn025. PMID 18227114. 
  11. ^ Li H, Ruan J, Durbin R (2008-08-19). "Mapping short DNA sequencing reads and calling variants using mapping quality scores". Genome Research. 18 (11): 1851–1858. doi:10.1101/gr.078212.108. PMC 2577856Freely accessible. PMID 18714091. 
  12. ^ Langmead B, Trapnell C, Pop M, Salzberg SL (2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome". Genome Biology. 10 (3): R25. doi:10.1186/gb-2009-10-3-r25. PMC 2690996Freely accessible. PMID 19261174. 
  13. ^ Li H, Durbin R (2009). "Fast and accurate short read alignment with Burrows–Wheeler Transform". Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. PMC 2705234Freely accessible. PMID 19451168. 
  14. ^ Li R; et al. (2009). "SOAP2: an improved ultrafast tool for short read alignment". Bioinformatics. 25 (15): 1966–1967. doi:10.1093/bioinformatics/btp336. PMID 19497933. 

External links

This page was last edited on 26 January 2018, at 11:10.
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.