Log-structured file system

A log-structured filesystem is a file system in which data and metadata are written sequentially to a circular buffer, called a log. The design was first proposed in 1988 by John K. Ousterhout and Fred Douglis and first implemented in 1992 by Ousterhout and Mendel Rosenblum for the Unix-like Sprite distributed operating system.^[1]

YouTube Encyclopedic

1/3
Views:
7 466
1 726
1 053

Transcription

That brings me to another background technology that I have to explain to you and which is called log structured file system. The idea here is that when I make a change to file y meaning I either append to the file or make some modifications to it. What I'm going to do is rather than writing the file as is, I'm going to write the change that I made to the file as a log record. So, I have a log record that says, what are the changes I made to this file x. Similarly, I have a log record of all the changes I made to this file y. And this is being done in a data structure which I'll call log segment. And I'll keep this log segment data structure in memory, of course, to make it fast in terms of the file system operation. So with this log segment data structure, what I can do is, buffer the changes to multiple files in one contiguous log segment data structure. So this log segment data structure, I can write it out as a file, and when I write it out, I'm not writing a single file, but I'm actually writing a log segment which contains all the changes made to multiple files. And because the log segment is contiguous, I can write it sequentially on the disk and sequential writes are good in the disk subsystem. And what we want to do is, we want to gather these changes to files that are happening in my system in the log segment in memory, and every once in a while, flush the log segment to disk, once the log segment fills up to a certain extent, or periodically. And the reason, of course, is the fact that if it is in memory, you have to worry about reliability of your file system, if, in fact, the node crashes. And therefore, what we want to do is, we want to either write out these log segments periodically or when a lot of file activity is happening and the log segment fills up very rapidly. After it passes of threshold, then you write it out to the desk. So in other words, we use a space metric or a time metric to figure out when to flush the changes from the log segment into the disk. And this solves the small write problem because if y happens to be a small file. No problem, because we are not writing y as-is on to the disk. But what we are writing is this log segment that contains changes that have been made to y in addition to changes that have been made to a number of other files. And therefore, this log segment is going to be a big file. And therefore, we can use the RAID technology to stripe the log segments across multiple disks. And give the benefit of the parallel IO that's possible with the RAID technology. So this log structured file system solves the small write problem. And in log structured file system, there are only logs. No data files. You'll never write any data files. All the things that you're writing are these append only logs to the disk. And when you have a read of a file, the read of a file, if it has to go to the disk and fetch that file, then the file system has to reconstruct the file from the logs that it has stored on the disk. Of course, once it comes into the memory of the server, then in the file cache the file is going to remain as a file. But, if at any point, the server has to fetch the file from the disk, it's actually fetching the log segments. And then reconstructing the file from the log segments. That's important. Which means that, in a log structured file system, there could be latency associated with reading a file for the first time from the disk. Of course, once it is read from the disk and reconstructed, it is in memory. In the file cache of the server, then everything is fine. But the first time, you have to read it from the disk, it's going to take some time because you have to read all these log segments and reconstruct it and that's where parallel RAID technology can be very helpful, because you're aggregating all the bandwidth that's available for reading the log segments from multiple disks at the same time. And the other thing that you have to worry about, when you have a log structured file system, is that these logs represent changes that have been made to the files. So, for instance, I may have written a particular block of y and that may be the change sitting here. Next time, what I'm doing is perhaps I'm writing the same block of the file. In which case, the first strike that I did, that is invalid. I have got a new write of that same block. So, you see that over time, the logs are going to have lots of holes created by overwriting the same block of a particular file. So in a log structured file system, one of the things that has to happen is that the logs have to be cleaned periodically to ensure that the disk is not cluttered with wasted logs that have empty holes in them because of old writes to parts of a file that are no longer relevant. Because those parts of the file have been rewritten, overwritten by subsequent writes to the same file. So logs, as I've introduced you, is similar to the disks that you've seen in the DSM system with the multiple writer protocol that we talked about in a previous lecture. You may have also heard the term, journalling file system, there is a difference between log structured file system, and journalling file system. Journalling file systems has both log files as well as data files, and what a journalling file system does, is it applies the log files to the data files and discards the log files. The goal is similar in a journaling file system, and the goal is to solve the small write problem, but in a journaling file system, the logs are there only for a short duration of time before the logs are committed to the data files themselves. Whereas in a log structured file system, you don't have data files at all, all that you have are log files and reads have to deconstruct the data from the log files.

Rationale

Conventional file systems tend to lay out files with great care for spatial locality and make in-place changes to their data structures in order to perform well on optical and magnetic disks, which tend to seek relatively slowly.

The design of log-structured file systems is based on the hypothesis that this will no longer be effective because ever-increasing memory sizes on modern computers would lead to I/O becoming write-heavy because reads would be almost always satisfied from memory cache. A log-structured file system thus treats its storage as a circular log and writes sequentially to the head of the log.

This has several important side effects:

Write throughput on optical and magnetic disks is improved because they can be batched into large sequential runs and costly seeks are kept to a minimum.
- The structure is naturally suited to media with append-only zones or pages such as flash storages and shingled magnetic recording HDDs^[2]^[3]
Writes create multiple, chronologically-advancing versions of both file data and meta-data. Some implementations make these old file versions nameable and accessible, a feature sometimes called time-travel or snapshotting. This is very similar to a versioning file system.
Recovery from crashes is simpler. Upon its next mount, the file system does not need to walk all its data structures to fix any inconsistencies, but can reconstruct its state from the last consistent point in the log.

Log-structured file systems, however, must reclaim free space from the tail of the log to prevent the file system from becoming full when the head of the log wraps around to meet it. The tail can release space and move forward by skipping over data for which newer versions exist farther ahead in the log. If there are no newer versions, then the data is moved and appended to the head.

To reduce the overhead incurred by this garbage collection, most implementations avoid purely circular logs and divide up their storage into segments. The head of the log simply advances into non-adjacent segments which are already free. If space is needed, the least-full segments are reclaimed first. This decreases the I/O load (and decreases the write amplification) of the garbage collector, but becomes increasingly ineffective as the file system fills up and nears capacity.

Disadvantages

The design rationale for log-structured file systems assumes that most reads will be optimized away by ever-enlarging memory caches. This assumption does not always hold:

On magnetic media—where seeks are relatively expensive—the log structure may actually make reads much slower, since it fragments files that conventional file systems normally keep contiguous with in-place writes.
On flash memory—where seek times are usually negligible—the log structure may not confer a worthwhile performance gain because write fragmentation has much less of an impact on write throughput. Another issue is stacking one log on top of another log, which is not a very good idea as it forces multiple erases with unaligned access.^[4] However many flash based devices cannot rewrite part of a block, and they must first perform a (slow) erase cycle of each block before being able to re-write, so by putting all the writes in one block, this can help performance as opposed to writes scattered into various blocks, each one of which must be copied into a buffer, erased, and written back, which is a clear advantage for so-called "raw" flash memory where flash translation layer is bypassed.^{[citation needed]}

References

^ John K. Ousterhout, Mendel Rosenblum. (1991), The Design and Implementation of a Log-Structured File System (PDF), University of California, Berkeley
^ Magic Pocket Hardware Engineering Teams. "Extending Magic Pocket Innovation with the first petabyte scale SMR drive deployment". dropbox.tech.
^ Reid, Colin; Bernstein, Phil (1 January 2010). "Implementing an Append-Only Interface for Semiconductor Storage" (PDF). IEEE Data Eng. Bull. 33: 14-20.
^ Swaminathan Sundararaman, Jingpei Yang (2014), Don't stack your Log on my Log (PDF), SanDisk Corporation

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Rationale

Disadvantages

See also

References

Further reading