* fallocate support for bitmap-based files
@ 2007-06-29 20:01 Andrew Morton
2007-06-29 20:36 ` Dave Kleikamp
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Andrew Morton @ 2007-06-29 20:01 UTC (permalink / raw)
To: Theodore Ts'o, Andreas Dilger
Cc: Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org
Guys, Mike and Sreenivasa at google are looking into implementing
fallocate() on ext2. Of course, any such implementation could and should
also be portable to ext3 and ext4 bitmapped files.
I believe that Sreenivasa will mainly be doing the implementation work.
The basic plan is as follows:
- Create (with tune2fs and mke2fs) a hidden file using one of the
reserved inode numbers. That file will be sized to have one bit for each
block in the partition. Let's call this the "unwritten block file".
The unwritten block file will be initialised with all-zeroes
- at fallocate()-time, allocate the blocks to the user's file (in some
yet-to-be-determined fashion) and, for each one which is uninitialised,
set its bit in the unwritten block file. The set bit means "this block
is uninitialised and needs to be zeroed out on read".
- truncate() would need to clear out set-bits in the unwritten blocks file.
- When the fs comes to read a block from disk, it will need to consult
the unwritten blocks file to see if that block should be zeroed by the
CPU.
- When the unwritten-block is written to, its bit in the unwritten blocks
file gets zeroed.
- An obvious efficiency concern: if a user file has no unwritten blocks
in it, we don't need to consult the unwritten blocks file.
Need to work out how to do this. An obvious solution would be to have
a number-of-unwritten-blocks counter in the inode. But do we have space
for that?
(I expect google and others would prefer that the on-disk format be
compatible with legacy ext2!)
- One concern is the following scenario:
- Mount fs with "new" kernel, fallocate() some blocks to a file.
- Now, mount the fs under "old" kernel (which doesn't understand the
unwritten blocks file).
- This kernel will be able to read uninitialised data from that
fallocated-to file, which is a security concern.
- Now, the "old" kernel writes some data to a fallocated block. But
this kernel doesn't know that it needs to clear that block's flag in
the unwritten blocks file!
- Now mount that fs under the "new" kernel and try to read that file.
The flag for the block is set, so this kernel will still zero out the
data on a read, thus corrupting the user's data
So how to fix this? Perhaps with a per-inode flag indicating "this
inode has unwritten blocks". But to fix this problem, we'd require that
the "old" kernel clear out that flag.
Can anyone propose a solution to this?
Ah, I can! Use the compatibility flags in such a way as to prevent the
"old" kernel from mounting this filesystem at all. To mount this fs
under an "old" kernel the user will need to run some tool which will
- read the unwritten blocks file
- for each set-bit in the unwritten blocks file, zero out the
corresponding block
- zero out the unwritten blocks file
- rewrite the superblock to indicate that this fs may now be mounted
by an "old" kernel.
Sound sane?
- I'm assuming that there are more reserved inodes available, and that
the changes to tune2fs and mke2fs will be basically a copy-n-paste job
from the `tune2fs -j' code. Correct?
- I haven't thought about what fsck changes would be needed.
Presumably quite a few. For example, fsck should check that set-bits
in the unwriten blobks file do not correspond to freed blocks. If they
do, that should be fixed up.
And fsck can check each inodes number-of-unwritten-blocks counters
against the unwritten blocks file (if we implement the per-inode
number-of-unwritten-blocks counter)
What else should fsck do?
- I haven't thought about the implications of porting this into ext3/4.
Probably the commit to the unwritten blocks file will need to be atomic
with the commit to the user's file's metadata, so the unwritten-blocks
file will effectively need to be in journalled-data mode.
Or, more likely, we access the unwritten blocks file via the blockdev
pagecache (ie: use bmap, like the journal file) and then we're just
talking direct to the disk's blocks and it becomes just more fs metadata.
- I guess resize2fs will need to be taught about the unwritten blocks
file: to shrink and grow it appropriately.
That's all I can think of for now - I probably missed something.
Suggestions and thought are sought, please.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: fallocate support for bitmap-based files 2007-06-29 20:01 fallocate support for bitmap-based files Andrew Morton @ 2007-06-29 20:36 ` Dave Kleikamp 2007-06-29 20:52 ` Mike Waychison 2007-06-29 20:55 ` Theodore Tso 2007-06-30 14:13 ` Mingming Cao 2 siblings, 1 reply; 19+ messages in thread From: Dave Kleikamp @ 2007-06-29 20:36 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Andreas Dilger, Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: > Guys, Mike and Sreenivasa at google are looking into implementing > fallocate() on ext2. Of course, any such implementation could and should > also be portable to ext3 and ext4 bitmapped files. > > I believe that Sreenivasa will mainly be doing the implementation work. > > > The basic plan is as follows: > > - Create (with tune2fs and mke2fs) a hidden file using one of the > reserved inode numbers. That file will be sized to have one bit for each > block in the partition. Let's call this the "unwritten block file". > > The unwritten block file will be initialised with all-zeroes > > - at fallocate()-time, allocate the blocks to the user's file (in some > yet-to-be-determined fashion) and, for each one which is uninitialised, > set its bit in the unwritten block file. The set bit means "this block > is uninitialised and needs to be zeroed out on read". > > - truncate() would need to clear out set-bits in the unwritten blocks file. By truncating the blocks file at the correct byte offset, only needing to zero some bits of the last byte of the file. > - When the fs comes to read a block from disk, it will need to consult > the unwritten blocks file to see if that block should be zeroed by the > CPU. > > - When the unwritten-block is written to, its bit in the unwritten blocks > file gets zeroed. > > - An obvious efficiency concern: if a user file has no unwritten blocks > in it, we don't need to consult the unwritten blocks file. > > Need to work out how to do this. An obvious solution would be to have > a number-of-unwritten-blocks counter in the inode. But do we have space > for that? Would it be too expensive to test the blocks-file page each time a bit is cleared to see if it is all-zero, and then free the page, making it a hole? This test would stop if if finds any non-zero word, so it may not be too bad. (This could further be done on a block basis if the block size is less than a page.) > (I expect google and others would prefer that the on-disk format be > compatible with legacy ext2!) > > - One concern is the following scenario: > > - Mount fs with "new" kernel, fallocate() some blocks to a file. > > - Now, mount the fs under "old" kernel (which doesn't understand the > unwritten blocks file). > > - This kernel will be able to read uninitialised data from that > fallocated-to file, which is a security concern. > > - Now, the "old" kernel writes some data to a fallocated block. But > this kernel doesn't know that it needs to clear that block's flag in > the unwritten blocks file! > > - Now mount that fs under the "new" kernel and try to read that file. > The flag for the block is set, so this kernel will still zero out the > data on a read, thus corrupting the user's data > > So how to fix this? Perhaps with a per-inode flag indicating "this > inode has unwritten blocks". But to fix this problem, we'd require that > the "old" kernel clear out that flag. > > Can anyone propose a solution to this? > > Ah, I can! Use the compatibility flags in such a way as to prevent the > "old" kernel from mounting this filesystem at all. To mount this fs > under an "old" kernel the user will need to run some tool which will > > - read the unwritten blocks file > > - for each set-bit in the unwritten blocks file, zero out the > corresponding block > > - zero out the unwritten blocks file > > - rewrite the superblock to indicate that this fs may now be mounted > by an "old" kernel. > > Sound sane? Yeah. I think it would have to be done under a compatibility flag. Is going back to an older kernel really that important? I think it's more important to make sure it can't be mounted by an older kernel if bad things can happen, and they can. Shaggy -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 20:36 ` Dave Kleikamp @ 2007-06-29 20:52 ` Mike Waychison 2007-06-29 21:24 ` Dave Kleikamp 0 siblings, 1 reply; 19+ messages in thread From: Mike Waychison @ 2007-06-29 20:52 UTC (permalink / raw) To: Dave Kleikamp Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Sreenivasa Busam, linux-ext4@vger.kernel.org Dave Kleikamp wrote: > On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: > >>Guys, Mike and Sreenivasa at google are looking into implementing >>fallocate() on ext2. Of course, any such implementation could and should >>also be portable to ext3 and ext4 bitmapped files. >> >>I believe that Sreenivasa will mainly be doing the implementation work. >> >> >>The basic plan is as follows: >> >>- Create (with tune2fs and mke2fs) a hidden file using one of the >> reserved inode numbers. That file will be sized to have one bit for each >> block in the partition. Let's call this the "unwritten block file". >> >> The unwritten block file will be initialised with all-zeroes >> >>- at fallocate()-time, allocate the blocks to the user's file (in some >> yet-to-be-determined fashion) and, for each one which is uninitialised, >> set its bit in the unwritten block file. The set bit means "this block >> is uninitialised and needs to be zeroed out on read". >> >>- truncate() would need to clear out set-bits in the unwritten blocks file. > > > By truncating the blocks file at the correct byte offset, only needing > to zero some bits of the last byte of the file. We were thinking the unwritten blocks file would be indexed by physical block number of the block device. There wouldn't be a logical to physical relationship for the blocks, so we wouldn't be able to get away with truncating the blocks file itself. > > >>- When the fs comes to read a block from disk, it will need to consult >> the unwritten blocks file to see if that block should be zeroed by the >> CPU. >> >>- When the unwritten-block is written to, its bit in the unwritten blocks >> file gets zeroed. >> >>- An obvious efficiency concern: if a user file has no unwritten blocks >> in it, we don't need to consult the unwritten blocks file. >> >> Need to work out how to do this. An obvious solution would be to have >> a number-of-unwritten-blocks counter in the inode. But do we have space >> for that? > > > Would it be too expensive to test the blocks-file page each time a bit > is cleared to see if it is all-zero, and then free the page, making it a > hole? This test would stop if if finds any non-zero word, so it may not > be too bad. (This could further be done on a block basis if the block > size is less than a page.) When clearing the bits, we'd likely see a large stream of writes to the unwritten blocks, which could result in a O(n^2) pass of rescanning the page over and over. Maybe a per-unwritten-block-file block per-block-header with a count that could be cheaply tested? Ie: the unwritten block file is composed of blocks that each have a small header that contains count -- when the count hits zero, we could punch a hole in the file. > > >> (I expect google and others would prefer that the on-disk format be >> compatible with legacy ext2!) >> >>- One concern is the following scenario: >> >> - Mount fs with "new" kernel, fallocate() some blocks to a file. >> >> - Now, mount the fs under "old" kernel (which doesn't understand the >> unwritten blocks file). >> >> - This kernel will be able to read uninitialised data from that >> fallocated-to file, which is a security concern. >> >> - Now, the "old" kernel writes some data to a fallocated block. But >> this kernel doesn't know that it needs to clear that block's flag in >> the unwritten blocks file! >> >> - Now mount that fs under the "new" kernel and try to read that file. >> The flag for the block is set, so this kernel will still zero out the >> data on a read, thus corrupting the user's data >> >> So how to fix this? Perhaps with a per-inode flag indicating "this >> inode has unwritten blocks". But to fix this problem, we'd require that >> the "old" kernel clear out that flag. >> >> Can anyone propose a solution to this? >> >> Ah, I can! Use the compatibility flags in such a way as to prevent the >> "old" kernel from mounting this filesystem at all. To mount this fs >> under an "old" kernel the user will need to run some tool which will >> >> - read the unwritten blocks file >> >> - for each set-bit in the unwritten blocks file, zero out the >> corresponding block >> >> - zero out the unwritten blocks file >> >> - rewrite the superblock to indicate that this fs may now be mounted >> by an "old" kernel. >> >> Sound sane? > > > Yeah. I think it would have to be done under a compatibility flag. Is > going back to an older kernel really that important? I think it's more > important to make sure it can't be mounted by an older kernel if bad > things can happen, and they can. > Ya, I too was originally thinking of a compat flag to keep the old kernel from mounting the filesystem. We'd arrange our bootup scripts to check for compatibility and call out to tune2fs (or some other tool) to down convert (by simply writing out zero blocks for each bit set and clearing the bitmap). Mike Waychison ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 20:52 ` Mike Waychison @ 2007-06-29 21:24 ` Dave Kleikamp 0 siblings, 0 replies; 19+ messages in thread From: Dave Kleikamp @ 2007-06-29 21:24 UTC (permalink / raw) To: Mike Waychison Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Sreenivasa Busam, linux-ext4@vger.kernel.org On Fri, 2007-06-29 at 16:52 -0400, Mike Waychison wrote: > Dave Kleikamp wrote: > > > > By truncating the blocks file at the correct byte offset, only needing > > to zero some bits of the last byte of the file. > > We were thinking the unwritten blocks file would be indexed by physical > block number of the block device. There wouldn't be a logical to > physical relationship for the blocks, so we wouldn't be able to get away > with truncating the blocks file itself. I misunderstood. I was thinking about a block-file per regular file (that had preallocated blocks). Ignore that comment. > >>- When the fs comes to read a block from disk, it will need to consult > >> the unwritten blocks file to see if that block should be zeroed by the > >> CPU. > >> > >>- When the unwritten-block is written to, its bit in the unwritten blocks > >> file gets zeroed. > >> > >>- An obvious efficiency concern: if a user file has no unwritten blocks > >> in it, we don't need to consult the unwritten blocks file. > >> > >> Need to work out how to do this. An obvious solution would be to have > >> a number-of-unwritten-blocks counter in the inode. But do we have space > >> for that? > > > > > > Would it be too expensive to test the blocks-file page each time a bit > > is cleared to see if it is all-zero, and then free the page, making it a > > hole? This test would stop if if finds any non-zero word, so it may not > > be too bad. (This could further be done on a block basis if the block > > size is less than a page.) > > When clearing the bits, we'd likely see a large stream of writes to the > unwritten blocks, which could result in a O(n^2) pass of rescanning the > page over and over. If you start checking for zero at the bit that was just zeroed, you'd likely find a non-zero bit right away, so you wouldn't be looking at too much of the page in the typical case. > Maybe a per-unwritten-block-file block > per-block-header with a count that could be cheaply tested? Ie: the > unwritten block file is composed of blocks that each have a small header > that contains count -- when the count hits zero, we could punch a hole > in the file. Having the data be just a bitmap seems more elegant to me. It would be nice to avoid keeping a count in the bitmap page if possible. -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 20:01 fallocate support for bitmap-based files Andrew Morton 2007-06-29 20:36 ` Dave Kleikamp @ 2007-06-29 20:55 ` Theodore Tso 2007-06-29 21:38 ` Andrew Morton 2007-06-29 21:46 ` Andreas Dilger 2007-06-30 14:13 ` Mingming Cao 2 siblings, 2 replies; 19+ messages in thread From: Theodore Tso @ 2007-06-29 20:55 UTC (permalink / raw) To: Andrew Morton Cc: Andreas Dilger, Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote: > > Guys, Mike and Sreenivasa at google are looking into implementing > fallocate() on ext2. Of course, any such implementation could and should > also be portable to ext3 and ext4 bitmapped files. What's the eventual goal of this work? Would it be for mainline use, or just something that would be used internally at Google? I'm not particularly ennthused about supporting two ways of doing fallocate(); one for ext4 and one for bitmap-based files in ext2/3/4. Is the benefit reallyworth it? What I would suggest, which would make much easier, is to make this be an incompatible extensions (which you as you point out is needed for security reasons anyway) and then steal the high bit from the block number field to indicate whether or not the block has been initialized or not. That way you don't end up having to seek to a potentially distant part of the disk to check out the bitmap. Also, you don't have to worry about how to recover if the "block initialized bitmap" inode gets smashed. The downside is that it reduces the maximum size of the filesystem supported by ext2 by a factor of two. But, there are at least two patch series floating about that promise to allow filesystem block sizes > than PAGE_SIZE which would allow you to recover the maximum size supported by the filesytem. Furthermore, I suspect (especially after listening to a very fasting Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks ago) that for many of Google's workloads, using a filesystem blocksize of 16K or 32K might not be a bad thing in any case. It would be a lot simpler.... - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 20:55 ` Theodore Tso @ 2007-06-29 21:38 ` Andrew Morton 2007-06-29 22:07 ` Mike Waychison 2007-06-29 21:46 ` Andreas Dilger 1 sibling, 1 reply; 19+ messages in thread From: Andrew Morton @ 2007-06-29 21:38 UTC (permalink / raw) To: Theodore Tso Cc: Andreas Dilger, Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org On Fri, 29 Jun 2007 16:55:25 -0400 Theodore Tso <tytso@mit.edu> wrote: > On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote: > > > > Guys, Mike and Sreenivasa at google are looking into implementing > > fallocate() on ext2. Of course, any such implementation could and should > > also be portable to ext3 and ext4 bitmapped files. > > What's the eventual goal of this work? Would it be for mainline use, > or just something that would be used internally at Google? Mainline, preferably. > I'm not > particularly ennthused about supporting two ways of doing fallocate(); > one for ext4 and one for bitmap-based files in ext2/3/4. Is the > benefit reallyworth it? umm, it's worth it if you don't want to wear the overhead of journalling, and/or if you don't want to wait on the, err, rather slow progress of ext4. > What I would suggest, which would make much easier, is to make this be > an incompatible extensions (which you as you point out is needed for > security reasons anyway) and then steal the high bit from the block > number field to indicate whether or not the block has been initialized > or not. That way you don't end up having to seek to a potentially > distant part of the disk to check out the bitmap. Also, you don't > have to worry about how to recover if the "block initialized bitmap" > inode gets smashed. > > The downside is that it reduces the maximum size of the filesystem > supported by ext2 by a factor of two. But, there are at least two > patch series floating about that promise to allow filesystem block > sizes > than PAGE_SIZE which would allow you to recover the maximum > size supported by the filesytem. > > Furthermore, I suspect (especially after listening to a very fasting > Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks > ago) that for many of Google's workloads, using a filesystem blocksize > of 16K or 32K might not be a bad thing in any case. > > It would be a lot simpler.... > Hadn't thought of that. Also, it's unclear to me why google is going this way rather than using (perhaps suitably-tweaked) ext2 reservations code. Because the stock ext2 block allcoator sucks big-time. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 21:38 ` Andrew Morton @ 2007-06-29 22:07 ` Mike Waychison 2007-07-04 23:11 ` Valerie Henson 0 siblings, 1 reply; 19+ messages in thread From: Mike Waychison @ 2007-06-29 22:07 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Tso, Andreas Dilger, Sreenivasa Busam, linux-ext4@vger.kernel.org Andrew Morton wrote: > On Fri, 29 Jun 2007 16:55:25 -0400 > Theodore Tso <tytso@mit.edu> wrote: > > >>On Fri, Jun 29, 2007 at 01:01:20PM -0700, Andrew Morton wrote: >> >>>Guys, Mike and Sreenivasa at google are looking into implementing >>>fallocate() on ext2. Of course, any such implementation could and should >>>also be portable to ext3 and ext4 bitmapped files. >> >>What's the eventual goal of this work? Would it be for mainline use, >>or just something that would be used internally at Google? > > > Mainline, preferably. > > >> I'm not >>particularly ennthused about supporting two ways of doing fallocate(); >>one for ext4 and one for bitmap-based files in ext2/3/4. Is the >>benefit reallyworth it? > > > umm, it's worth it if you don't want to wear the overhead of journalling, > and/or if you don't want to wait on the, err, rather slow progress of ext4. > > >>What I would suggest, which would make much easier, is to make this be >>an incompatible extensions (which you as you point out is needed for >>security reasons anyway) and then steal the high bit from the block >>number field to indicate whether or not the block has been initialized >>or not. That way you don't end up having to seek to a potentially >>distant part of the disk to check out the bitmap. Also, you don't >>have to worry about how to recover if the "block initialized bitmap" >>inode gets smashed. >> >>The downside is that it reduces the maximum size of the filesystem >>supported by ext2 by a factor of two. But, there are at least two >>patch series floating about that promise to allow filesystem block >>sizes > than PAGE_SIZE which would allow you to recover the maximum >>size supported by the filesytem. >> >>Furthermore, I suspect (especially after listening to a very fasting >>Usenix Invited Talk by Jeffery Dean, a fellow from Google two weeks >>ago) that for many of Google's workloads, using a filesystem blocksize >>of 16K or 32K might not be a bad thing in any case. >> >>It would be a lot simpler.... >> > > > Hadn't thought of that. > > Also, it's unclear to me why google is going this way rather than using > (perhaps suitably-tweaked) ext2 reservations code. > > Because the stock ext2 block allcoator sucks big-time. The primary reason this is a problem is that our writers into these files aren't neccesarily coming from the same hosts in the cluster, so their arrival times aren't sequential. It ends up looking to the kernel like a random write workload, which in turn ends up causing odd fragmentation patterns that aren't very deterministic. That data is often eventually streamed off the disk though, which is when the fragmentation hurts. Currently, our clustered filesystem supports pre-allocation of the target chunks of files, but this is implemented by writting effectively zeroes to files, which in turn causes pagecache churn and a double write-out of the blocks. Recently, we've changed the code to minimize this pagecache churn and double write out by performing an ftruncate to extend files, but then we'll be back to square-one in terms of fragmentation for the random writes. Relying on (a tweaked) reservations code is also somewhat limitting at this stage given that reservations are lost on close(fd). Unless we change the lifetime of the reservations (maybe for the lifetime of the in-core inode?), crank up the reservation sizes and deal with the overcommit issues, I can't think of any better way at this time to deal with the problem. Mike Waychison ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 22:07 ` Mike Waychison @ 2007-07-04 23:11 ` Valerie Henson 2007-07-06 21:15 ` Mike Waychison 0 siblings, 1 reply; 19+ messages in thread From: Valerie Henson @ 2007-07-04 23:11 UTC (permalink / raw) To: Mike Waychison Cc: Andrew Morton, Theodore Tso, Andreas Dilger, Sreenivasa Busam, linux-ext4@vger.kernel.org, Mingming Cao On Fri, Jun 29, 2007 at 06:07:25PM -0400, Mike Waychison wrote: > > Relying on (a tweaked) reservations code is also somewhat limitting at > this stage given that reservations are lost on close(fd). Unless we > change the lifetime of the reservations (maybe for the lifetime of the > in-core inode?), crank up the reservation sizes and deal with the > overcommit issues, I can't think of any better way at this time to deal > with the problem. While I never ever intended the ext3-to-ext2 reservations port to be used :), I think you can make some fairly minor tweaks to it and get something that works for your use case. Move the reservation drop to iput() and turn up your inode cache size, or store it in a tree when the inode is closed and go look for it again when it's reopened. Changing the reservation size seems fairly easy. I'm not sure how the overcommit issues affect your use case; any data you can share on that? In any case, storing the reservation data on-disk seems like not such a great idea. It adds complexity, disk traffic, and a new set of checks for fsck. I wouldn't want to incur that cost unless absolutely necessary. -VAL ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-07-04 23:11 ` Valerie Henson @ 2007-07-06 21:15 ` Mike Waychison 0 siblings, 0 replies; 19+ messages in thread From: Mike Waychison @ 2007-07-06 21:15 UTC (permalink / raw) To: Valerie Henson Cc: Andrew Morton, Theodore Tso, Andreas Dilger, Sreenivasa Busam, linux-ext4@vger.kernel.org, Mingming Cao Valerie Henson wrote: > On Fri, Jun 29, 2007 at 06:07:25PM -0400, Mike Waychison wrote: >> Relying on (a tweaked) reservations code is also somewhat limitting at >> this stage given that reservations are lost on close(fd). Unless we >> change the lifetime of the reservations (maybe for the lifetime of the >> in-core inode?), crank up the reservation sizes and deal with the >> overcommit issues, I can't think of any better way at this time to deal >> with the problem. > > While I never ever intended the ext3-to-ext2 reservations port to be > used :), I think you can make some fairly minor tweaks to it and get > something that works for your use case. Move the reservation drop to > iput() and turn up your inode cache size, or store it in a tree when > the inode is closed and go look for it again when it's reopened. > Changing the reservation size seems fairly easy. I'm not sure how the > overcommit issues affect your use case; any data you can share on > that? The overcommit is speculation on my part. GFS uses a lot of files on the disks and we like to keep the disks near full, especially in large GFS configurations. If we end up with a lot of reservations in-core, associated with the inode cache, we begin to rely on memory pressure for getting the reserved blocks back. That memory pressure may not exist (leading to ENOSPC). Unless the code is adapted to cull reservations in that case. > > In any case, storing the reservation data on-disk seems like not such > a great idea. It adds complexity, disk traffic, and a new set of > checks for fsck. I wouldn't want to incur that cost unless absolutely > necessary. Ya, I wouldn't want the reservations on disk either unless it came in as an explicit pre-allocation request (and was accounted for in statfs info). I'm treating that as a completely different beast at the moment. Mike Waychison ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 20:55 ` Theodore Tso 2007-06-29 21:38 ` Andrew Morton @ 2007-06-29 21:46 ` Andreas Dilger 2007-06-29 22:26 ` Mike Waychison 1 sibling, 1 reply; 19+ messages in thread From: Andreas Dilger @ 2007-06-29 21:46 UTC (permalink / raw) To: Theodore Tso Cc: Andrew Morton, Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org On Jun 29, 2007 16:55 -0400, Theodore Tso wrote: > What's the eventual goal of this work? Would it be for mainline use, > or just something that would be used internally at Google? I'm not > particularly ennthused about supporting two ways of doing fallocate(); > one for ext4 and one for bitmap-based files in ext2/3/4. Is the > benefit reallyworth it? > > What I would suggest, which would make much easier, is to make this be > an incompatible extensions (which you as you point out is needed for > security reasons anyway) and then steal the high bit from the block > number field to indicate whether or not the block has been initialized > or not. That way you don't end up having to seek to a potentially > distant part of the disk to check out the bitmap. Also, you don't > have to worry about how to recover if the "block initialized bitmap" > inode gets smashed. > > The downside is that it reduces the maximum size of the filesystem > supported by ext2 by a factor of two. But, there are at least two > patch series floating about that promise to allow filesystem block > sizes > than PAGE_SIZE which would allow you to recover the maximum > size supported by the filesytem. I don't think ext2 is safe for > 8TB filesystems anyways, so this isn't a huge loss. The other possibility is, assuming Google likes ext2 because they don't care about e2fsck, is to patch ext4 to not use any journaling (i.e. make all of the ext4_journal*() wrappers be no-ops). That way they would get extents, mballoc and other speedups. That said, what is the reason for not using ext3? Presumably performance (which is greatly improved in ext4) or is there something else? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 21:46 ` Andreas Dilger @ 2007-06-29 22:26 ` Mike Waychison 2007-06-30 5:14 ` Andreas Dilger 0 siblings, 1 reply; 19+ messages in thread From: Mike Waychison @ 2007-06-29 22:26 UTC (permalink / raw) To: Andreas Dilger Cc: Theodore Tso, Andrew Morton, Sreenivasa Busam, linux-ext4@vger.kernel.org Andreas Dilger wrote: > On Jun 29, 2007 16:55 -0400, Theodore Tso wrote: > >>What's the eventual goal of this work? Would it be for mainline use, >>or just something that would be used internally at Google? I'm not >>particularly ennthused about supporting two ways of doing fallocate(); >>one for ext4 and one for bitmap-based files in ext2/3/4. Is the >>benefit reallyworth it? >> >>What I would suggest, which would make much easier, is to make this be >>an incompatible extensions (which you as you point out is needed for >>security reasons anyway) and then steal the high bit from the block >>number field to indicate whether or not the block has been initialized >>or not. That way you don't end up having to seek to a potentially >>distant part of the disk to check out the bitmap. Also, you don't >>have to worry about how to recover if the "block initialized bitmap" >>inode gets smashed. >> >>The downside is that it reduces the maximum size of the filesystem >>supported by ext2 by a factor of two. But, there are at least two >>patch series floating about that promise to allow filesystem block >>sizes > than PAGE_SIZE which would allow you to recover the maximum >>size supported by the filesytem. > > > I don't think ext2 is safe for > 8TB filesystems anyways, so this > isn't a huge loss. This is reference to the idea of overloading the high-bit and not related to the >PAGE_SIZE blocks correct? > > The other possibility is, assuming Google likes ext2 because they > don't care about e2fsck, is to patch ext4 to not use any > journaling (i.e. make all of the ext4_journal*() wrappers be > no-ops). That way they would get extents, mballoc and other speedups. > We do care about the e2fsck problem, though the cost/benefit of e2fsck times/memory problems vs the overhead of journalling doesn't weigh in journalling's favour for a lot of our per-spindle-latency bound applications. These apps manage to get pretty good disk locality guarantees and the journal overheads can induce undesired head movement. ext4 does look very promising, though I'm not certain it's ready for our consumption. What are people's thoughts on providing ext3 non-journal mode? We could benefit from several of the additions to ext3 that aren't available in ext2 and disabling journalling there sounds much more feasible for us instead of trying to backport each ext3 component to ext2. Mike Waychison > That said, what is the reason for not using ext3? Presumably performance > (which is greatly improved in ext4) or is there something else? > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 22:26 ` Mike Waychison @ 2007-06-30 5:14 ` Andreas Dilger 2007-06-30 14:31 ` Mingming Cao 0 siblings, 1 reply; 19+ messages in thread From: Andreas Dilger @ 2007-06-30 5:14 UTC (permalink / raw) To: Mike Waychison Cc: Theodore Tso, Andrew Morton, Sreenivasa Busam, linux-ext4@vger.kernel.org On Jun 29, 2007 18:26 -0400, Mike Waychison wrote: > Andreas Dilger wrote: > >I don't think ext2 is safe for > 8TB filesystems anyways, so this > >isn't a huge loss. > > This is reference to the idea of overloading the high-bit and not > related to the >PAGE_SIZE blocks correct? Correct - just that the high-bit use wouldn't unduely impact the already-existing 8TB limit of ext2. The other thing to note is that Val Henson already ported the ext3 reservation code to ext2, so this is a pretty straight forward option for you and also doesn't affect the on-disk format. > >The other possibility is, assuming Google likes ext2 because they > >don't care about e2fsck, is to patch ext4 to not use any > >journaling (i.e. make all of the ext4_journal*() wrappers be > >no-ops). That way they would get extents, mballoc and other speedups. > > We do care about the e2fsck problem, though the cost/benefit of e2fsck > times/memory problems vs the overhead of journalling doesn't weigh in > journalling's favour for a lot of our per-spindle-latency bound > applications. These apps manage to get pretty good disk locality > guarantees and the journal overheads can induce undesired head movement. You could push the journal to a separate spindle, but that may not be practical. > ext4 does look very promising, though I'm not certain it's ready for our > consumption. FYI, the extents code (the most complex part of ext4) has been running for a couple of years on many PB of storage at CFS, so it is by no means new and untried code. There are definitely less-well tested changes in ext4 but they are mostly straight forward. I'm not saying you should jump right into ext4, but it isn't as far away as you might think. > What are people's thoughts on providing ext3 non-journal mode? We could > benefit from several of the additions to ext3 that aren't available in > ext2 and disabling journalling there sounds much more feasible for us > instead of trying to backport each ext3 component to ext2. This is something we've talked about for a long time, and I'd be happy to have this possibility. This would also allow you to take similar advantage of extents, the improved allocator and other features. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-30 5:14 ` Andreas Dilger @ 2007-06-30 14:31 ` Mingming Cao 0 siblings, 0 replies; 19+ messages in thread From: Mingming Cao @ 2007-06-30 14:31 UTC (permalink / raw) To: Andreas Dilger Cc: Mike Waychison, Theodore Tso, Andrew Morton, Sreenivasa Busam, linux-ext4@vger.kernel.org On Sat, 2007-06-30 at 01:14 -0400, Andreas Dilger wrote: > On Jun 29, 2007 18:26 -0400, Mike Waychison wrote: > > Andreas Dilger wrote: > > >I don't think ext2 is safe for > 8TB filesystems anyways, so this > > >isn't a huge loss. > > > > This is reference to the idea of overloading the high-bit and not > > related to the >PAGE_SIZE blocks correct? > > Correct - just that the high-bit use wouldn't unduely impact the > already-existing 8TB limit of ext2. > The 8TB limit on mainline ext2 was simplely caused by kernel block variable type bugs. The bug fixes were ported back from ext3 to ext2, when reservation+simple-multiple-balloc were backported from ext3 to ext2. I believe ext2 in mm tree is able to address 16TB in the kernel side. Not sure if there are remaining work to be done in e2fsck to handle 16TB ext2, but I assume it's not huge work. Mingming ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-29 20:01 fallocate support for bitmap-based files Andrew Morton 2007-06-29 20:36 ` Dave Kleikamp 2007-06-29 20:55 ` Theodore Tso @ 2007-06-30 14:13 ` Mingming Cao 2007-06-30 17:29 ` Andreas Dilger 2007-07-02 17:44 ` Badari Pulavarty 2 siblings, 2 replies; 19+ messages in thread From: Mingming Cao @ 2007-06-30 14:13 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Ts'o, Andreas Dilger, Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: > Guys, Mike and Sreenivasa at google are looking into implementing > fallocate() on ext2. Of course, any such implementation could and should > also be portable to ext3 and ext4 bitmapped files. > > I believe that Sreenivasa will mainly be doing the implementation work. > > > The basic plan is as follows: > > - Create (with tune2fs and mke2fs) a hidden file using one of the > reserved inode numbers. That file will be sized to have one bit for each > block in the partition. Let's call this the "unwritten block file". > > The unwritten block file will be initialised with all-zeroes > > - at fallocate()-time, allocate the blocks to the user's file (in some > yet-to-be-determined fashion) and, for each one which is uninitialised, > set its bit in the unwritten block file. The set bit means "this block > is uninitialised and needs to be zeroed out on read". > > - truncate() would need to clear out set-bits in the unwritten blocks file. > > - When the fs comes to read a block from disk, it will need to consult > the unwritten blocks file to see if that block should be zeroed by the > CPU. > > - When the unwritten-block is written to, its bit in the unwritten blocks > file gets zeroed. > > - An obvious efficiency concern: if a user file has no unwritten blocks > in it, we don't need to consult the unwritten blocks file. > > Need to work out how to do this. An obvious solution would be to have > a number-of-unwritten-blocks counter in the inode. But do we have space > for that? > > (I expect google and others would prefer that the on-disk format be > compatible with legacy ext2!) > > - One concern is the following scenario: > > - Mount fs with "new" kernel, fallocate() some blocks to a file. > > - Now, mount the fs under "old" kernel (which doesn't understand the > unwritten blocks file). > > - This kernel will be able to read uninitialised data from that > fallocated-to file, which is a security concern. > > - Now, the "old" kernel writes some data to a fallocated block. But > this kernel doesn't know that it needs to clear that block's flag in > the unwritten blocks file! > > - Now mount that fs under the "new" kernel and try to read that file. > The flag for the block is set, so this kernel will still zero out the > data on a read, thus corrupting the user's data > > So how to fix this? Perhaps with a per-inode flag indicating "this > inode has unwritten blocks". But to fix this problem, we'd require that > the "old" kernel clear out that flag. > > Can anyone propose a solution to this? > > Ah, I can! Use the compatibility flags in such a way as to prevent the > "old" kernel from mounting this filesystem at all. To mount this fs > under an "old" kernel the user will need to run some tool which will > > - read the unwritten blocks file > > - for each set-bit in the unwritten blocks file, zero out the > corresponding block > > - zero out the unwritten blocks file > > - rewrite the superblock to indicate that this fs may now be mounted > by an "old" kernel. > > Sound sane? > > - I'm assuming that there are more reserved inodes available, and that > the changes to tune2fs and mke2fs will be basically a copy-n-paste job > from the `tune2fs -j' code. Correct? > > - I haven't thought about what fsck changes would be needed. > > Presumably quite a few. For example, fsck should check that set-bits > in the unwriten blobks file do not correspond to freed blocks. If they > do, that should be fixed up. > > And fsck can check each inodes number-of-unwritten-blocks counters > against the unwritten blocks file (if we implement the per-inode > number-of-unwritten-blocks counter) > > What else should fsck do? > > - I haven't thought about the implications of porting this into ext3/4. > Probably the commit to the unwritten blocks file will need to be atomic > with the commit to the user's file's metadata, so the unwritten-blocks > file will effectively need to be in journalled-data mode. > > Or, more likely, we access the unwritten blocks file via the blockdev > pagecache (ie: use bmap, like the journal file) and then we're just > talking direct to the disk's blocks and it becomes just more fs metadata. > > - I guess resize2fs will need to be taught about the unwritten blocks > file: to shrink and grow it appropriately. > > > That's all I can think of for now - I probably missed something. > > Suggestions and thought are sought, please. > > Another approach we have been thinking is using a backing inode(per-inode-with-preallocation) to store the preallocated blocks. When user asked for preallocation on the base inode, ext2/3 create a temporary backing inode, and it's (pre)allocate the corresponding blocks in the backing inode. When writes to the base inode, and realize we need to block allocation on, before doing the fs real block allocation, it will check if the file has a backing inode stores some preallocated blocks for the same logical blocks. If so, it will transfer the preallocated blocks from backing inode to the base inode. We need to link the two inodes in some way, maybe store the backing inode number via EA in the base inode, and flag the base inode that it has a backing inode to get preallocated blocks. Since it doesn't change the block mapping on the original file until writeout, so it doesn't require a incompat feature to protect the preallocated contents to be read in "old" kernel. There some work need to be done in e2fsck to understand the backing inode. Mingming ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-30 14:13 ` Mingming Cao @ 2007-06-30 17:29 ` Andreas Dilger 2007-07-02 14:44 ` Mingming Cao 2007-07-02 17:44 ` Badari Pulavarty 1 sibling, 1 reply; 19+ messages in thread From: Andreas Dilger @ 2007-06-30 17:29 UTC (permalink / raw) To: Mingming Cao Cc: Andrew Morton, Theodore Ts'o, Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org On Jun 30, 2007 10:13 -0400, Mingming Cao wrote: > Another approach we have been thinking is using a backing > inode(per-inode-with-preallocation) to store the preallocated blocks. > When user asked for preallocation on the base inode, ext2/3 create a > temporary backing inode, and it's (pre)allocate the corresponding > blocks in the backing inode. > > When writes to the base inode, and realize we need to block allocation > on, before doing the fs real block allocation, it will check if the file > has a backing inode stores some preallocated blocks for the same logical > blocks. If so, it will transfer the preallocated blocks from backing > inode to the base inode. > > We need to link the two inodes in some way, maybe store the backing > inode number via EA in the base inode, and flag the base inode that it > has a backing inode to get preallocated blocks. > > Since it doesn't change the block mapping on the original file until > writeout, so it doesn't require a incompat feature to protect the > preallocated contents to be read in "old" kernel. There some work need > to be done in e2fsck to understand the backing inode. I don't know if you realize, but this is half-way to supporting snapshots within the filesystem. If there are any serious efforts in the direction of snapshots, you should start by looking at ext3cow, which does that already. I haven't looked at that code yet, but I worked on a snapshotting ext2 many years ago and it was implemented nearly as you describe (though backward, moving blocks from the "real" file to the shadow inode). The OTHER thing that is important for snapshots, is quite easy to implement now (it even makes the filesystem more robust), but will be considerably harder to do later, and something I wish someone could work on is to add "whiteout" support for extents to allow extents in a file to explicitly encode a hole in the file to "hide" the contents in a backing/snapshot inode that was truncated away, as described in my email "[RFC] extent whiteouts" (that nobody ever commented on). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-30 17:29 ` Andreas Dilger @ 2007-07-02 14:44 ` Mingming Cao 0 siblings, 0 replies; 19+ messages in thread From: Mingming Cao @ 2007-07-02 14:44 UTC (permalink / raw) To: Andreas Dilger Cc: Andrew Morton, Theodore Ts'o, Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org On Sat, 2007-06-30 at 13:29 -0400, Andreas Dilger wrote: > On Jun 30, 2007 10:13 -0400, Mingming Cao wrote: > > Another approach we have been thinking is using a backing > > inode(per-inode-with-preallocation) to store the preallocated blocks. > > When user asked for preallocation on the base inode, ext2/3 create a > > temporary backing inode, and it's (pre)allocate the corresponding > > blocks in the backing inode. > > > > When writes to the base inode, and realize we need to block allocation > > on, before doing the fs real block allocation, it will check if the file > > has a backing inode stores some preallocated blocks for the same logical > > blocks. If so, it will transfer the preallocated blocks from backing > > inode to the base inode. > > > > We need to link the two inodes in some way, maybe store the backing > > inode number via EA in the base inode, and flag the base inode that it > > has a backing inode to get preallocated blocks. > > > > Since it doesn't change the block mapping on the original file until > > writeout, so it doesn't require a incompat feature to protect the > > preallocated contents to be read in "old" kernel. There some work need > > to be done in e2fsck to understand the backing inode. > > I don't know if you realize, but this is half-way to supporting > snapshots within the filesystem. >From your description it seems similar, but not sure if it's half-way yet. Just to clarify: What's stored in the backing inode(in the preallocation case) is just metablocks, not data blocks. The transfer (from backing inode to the base inode) do not involve any data blocks migration. Another comment, if we seriously looking for supporting preallocation in ext2 in upstreeam, I'd like to choose a solution suitable for ext3 as well. Taking a bit from block number to flag preallocated blocks means reduce ext2/3 fs limit to 8TB, which probably not a big deal for ext2, but not so good for ext3. Mingming ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-06-30 14:13 ` Mingming Cao 2007-06-30 17:29 ` Andreas Dilger @ 2007-07-02 17:44 ` Badari Pulavarty 2007-07-06 21:33 ` Mike Waychison 1 sibling, 1 reply; 19+ messages in thread From: Badari Pulavarty @ 2007-07-02 17:44 UTC (permalink / raw) To: cmm Cc: Andrew Morton, Theodore Ts'o, Andreas Dilger, Mike Waychison, Sreenivasa Busam, linux-ext4@vger.kernel.org On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote: > On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: > > Guys, Mike and Sreenivasa at google are looking into implementing > > fallocate() on ext2. Of course, any such implementation could and should > > also be portable to ext3 and ext4 bitmapped files. > > > > I believe that Sreenivasa will mainly be doing the implementation work. > > > > > > The basic plan is as follows: > > > > - Create (with tune2fs and mke2fs) a hidden file using one of the > > reserved inode numbers. That file will be sized to have one bit for each > > block in the partition. Let's call this the "unwritten block file". > > > > The unwritten block file will be initialised with all-zeroes > > > > - at fallocate()-time, allocate the blocks to the user's file (in some > > yet-to-be-determined fashion) and, for each one which is uninitialised, > > set its bit in the unwritten block file. The set bit means "this block > > is uninitialised and needs to be zeroed out on read". > > > > - truncate() would need to clear out set-bits in the unwritten blocks file. > > > > - When the fs comes to read a block from disk, it will need to consult > > the unwritten blocks file to see if that block should be zeroed by the > > CPU. > > > > - When the unwritten-block is written to, its bit in the unwritten blocks > > file gets zeroed. > > > > - An obvious efficiency concern: if a user file has no unwritten blocks > > in it, we don't need to consult the unwritten blocks file. > > > > Need to work out how to do this. An obvious solution would be to have > > a number-of-unwritten-blocks counter in the inode. But do we have space > > for that? > > > > (I expect google and others would prefer that the on-disk format be > > compatible with legacy ext2!) > > > > - One concern is the following scenario: > > > > - Mount fs with "new" kernel, fallocate() some blocks to a file. > > > > - Now, mount the fs under "old" kernel (which doesn't understand the > > unwritten blocks file). > > > > - This kernel will be able to read uninitialised data from that > > fallocated-to file, which is a security concern. > > > > - Now, the "old" kernel writes some data to a fallocated block. But > > this kernel doesn't know that it needs to clear that block's flag in > > the unwritten blocks file! > > > > - Now mount that fs under the "new" kernel and try to read that file. > > The flag for the block is set, so this kernel will still zero out the > > data on a read, thus corrupting the user's data > > > > So how to fix this? Perhaps with a per-inode flag indicating "this > > inode has unwritten blocks". But to fix this problem, we'd require that > > the "old" kernel clear out that flag. > > > > Can anyone propose a solution to this? > > > > Ah, I can! Use the compatibility flags in such a way as to prevent the > > "old" kernel from mounting this filesystem at all. To mount this fs > > under an "old" kernel the user will need to run some tool which will > > > > - read the unwritten blocks file > > > > - for each set-bit in the unwritten blocks file, zero out the > > corresponding block > > > > - zero out the unwritten blocks file > > > > - rewrite the superblock to indicate that this fs may now be mounted > > by an "old" kernel. > > > > Sound sane? > > > > - I'm assuming that there are more reserved inodes available, and that > > the changes to tune2fs and mke2fs will be basically a copy-n-paste job > > from the `tune2fs -j' code. Correct? > > > > - I haven't thought about what fsck changes would be needed. > > > > Presumably quite a few. For example, fsck should check that set-bits > > in the unwriten blobks file do not correspond to freed blocks. If they > > do, that should be fixed up. > > > > And fsck can check each inodes number-of-unwritten-blocks counters > > against the unwritten blocks file (if we implement the per-inode > > number-of-unwritten-blocks counter) > > > > What else should fsck do? > > > > - I haven't thought about the implications of porting this into ext3/4. > > Probably the commit to the unwritten blocks file will need to be atomic > > with the commit to the user's file's metadata, so the unwritten-blocks > > file will effectively need to be in journalled-data mode. > > > > Or, more likely, we access the unwritten blocks file via the blockdev > > pagecache (ie: use bmap, like the journal file) and then we're just > > talking direct to the disk's blocks and it becomes just more fs metadata. > > > > - I guess resize2fs will need to be taught about the unwritten blocks > > file: to shrink and grow it appropriately. > > > > > > That's all I can think of for now - I probably missed something. > > > > Suggestions and thought are sought, please. > > > > > > Another approach we have been thinking is using a backing > inode(per-inode-with-preallocation) to store the preallocated blocks. > When user asked for preallocation on the base inode, ext2/3 create a > temporary backing inode, and it's (pre)allocate the > corresponding blocks in the backing inode. > > When writes to the base inode, and realize we need to block allocation > on, before doing the fs real block allocation, it will check if the file > has a backing inode stores some preallocated blocks for the same logical > blocks. If so, it will transfer the preallocated blocks from backing > inode to the base inode. > > We need to link the two inodes in some way, maybe store the backing > inode number via EA in the base inode, and flag the base inode that it > has a backing inode to get preallocated blocks. > > Since it doesn't change the block mapping on the original file until > writeout, so it doesn't require a incompat feature to protect the > preallocated contents to be read in "old" kernel. There some work need > to be done in e2fsck to understand the backing inode. > Small detail - we need to mark size of the backing inode to zero -- so that if we ever boot on older kernel, we will not be able to read the contents of that inode. (Ofcourse, this also means that fsck would remove that inode if we run fscheck). Thanks, Badari ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-07-02 17:44 ` Badari Pulavarty @ 2007-07-06 21:33 ` Mike Waychison 2007-07-07 2:05 ` Badari Pulavarty 0 siblings, 1 reply; 19+ messages in thread From: Mike Waychison @ 2007-07-06 21:33 UTC (permalink / raw) To: Badari Pulavarty Cc: cmm, Andrew Morton, Theodore Ts'o, Andreas Dilger, Sreenivasa Busam, linux-ext4@vger.kernel.org Badari Pulavarty wrote: > On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote: >> On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: >>> Guys, Mike and Sreenivasa at google are looking into implementing >>> fallocate() on ext2. Of course, any such implementation could and should >>> also be portable to ext3 and ext4 bitmapped files. >>> >>> I believe that Sreenivasa will mainly be doing the implementation work. >>> >>> >>> The basic plan is as follows: >>> >>> - Create (with tune2fs and mke2fs) a hidden file using one of the >>> reserved inode numbers. That file will be sized to have one bit for each >>> block in the partition. Let's call this the "unwritten block file". >>> >>> The unwritten block file will be initialised with all-zeroes >>> >>> - at fallocate()-time, allocate the blocks to the user's file (in some >>> yet-to-be-determined fashion) and, for each one which is uninitialised, >>> set its bit in the unwritten block file. The set bit means "this block >>> is uninitialised and needs to be zeroed out on read". >>> >>> - truncate() would need to clear out set-bits in the unwritten blocks file. >>> >>> - When the fs comes to read a block from disk, it will need to consult >>> the unwritten blocks file to see if that block should be zeroed by the >>> CPU. >>> >>> - When the unwritten-block is written to, its bit in the unwritten blocks >>> file gets zeroed. >>> >>> - An obvious efficiency concern: if a user file has no unwritten blocks >>> in it, we don't need to consult the unwritten blocks file. >>> >>> Need to work out how to do this. An obvious solution would be to have >>> a number-of-unwritten-blocks counter in the inode. But do we have space >>> for that? >>> >>> (I expect google and others would prefer that the on-disk format be >>> compatible with legacy ext2!) >>> >>> - One concern is the following scenario: >>> >>> - Mount fs with "new" kernel, fallocate() some blocks to a file. >>> >>> - Now, mount the fs under "old" kernel (which doesn't understand the >>> unwritten blocks file). >>> >>> - This kernel will be able to read uninitialised data from that >>> fallocated-to file, which is a security concern. >>> >>> - Now, the "old" kernel writes some data to a fallocated block. But >>> this kernel doesn't know that it needs to clear that block's flag in >>> the unwritten blocks file! >>> >>> - Now mount that fs under the "new" kernel and try to read that file. >>> The flag for the block is set, so this kernel will still zero out the >>> data on a read, thus corrupting the user's data >>> >>> So how to fix this? Perhaps with a per-inode flag indicating "this >>> inode has unwritten blocks". But to fix this problem, we'd require that >>> the "old" kernel clear out that flag. >>> >>> Can anyone propose a solution to this? >>> >>> Ah, I can! Use the compatibility flags in such a way as to prevent the >>> "old" kernel from mounting this filesystem at all. To mount this fs >>> under an "old" kernel the user will need to run some tool which will >>> >>> - read the unwritten blocks file >>> >>> - for each set-bit in the unwritten blocks file, zero out the >>> corresponding block >>> >>> - zero out the unwritten blocks file >>> >>> - rewrite the superblock to indicate that this fs may now be mounted >>> by an "old" kernel. >>> >>> Sound sane? >>> >>> - I'm assuming that there are more reserved inodes available, and that >>> the changes to tune2fs and mke2fs will be basically a copy-n-paste job >>> from the `tune2fs -j' code. Correct? >>> >>> - I haven't thought about what fsck changes would be needed. >>> >>> Presumably quite a few. For example, fsck should check that set-bits >>> in the unwriten blobks file do not correspond to freed blocks. If they >>> do, that should be fixed up. >>> >>> And fsck can check each inodes number-of-unwritten-blocks counters >>> against the unwritten blocks file (if we implement the per-inode >>> number-of-unwritten-blocks counter) >>> >>> What else should fsck do? >>> >>> - I haven't thought about the implications of porting this into ext3/4. >>> Probably the commit to the unwritten blocks file will need to be atomic >>> with the commit to the user's file's metadata, so the unwritten-blocks >>> file will effectively need to be in journalled-data mode. >>> >>> Or, more likely, we access the unwritten blocks file via the blockdev >>> pagecache (ie: use bmap, like the journal file) and then we're just >>> talking direct to the disk's blocks and it becomes just more fs metadata. >>> >>> - I guess resize2fs will need to be taught about the unwritten blocks >>> file: to shrink and grow it appropriately. >>> >>> >>> That's all I can think of for now - I probably missed something. >>> >>> Suggestions and thought are sought, please. >>> >>> >> Another approach we have been thinking is using a backing >> inode(per-inode-with-preallocation) to store the preallocated blocks. >> When user asked for preallocation on the base inode, ext2/3 create a >> temporary backing inode, and it's (pre)allocate the >> corresponding blocks in the backing inode. >> >> When writes to the base inode, and realize we need to block allocation >> on, before doing the fs real block allocation, it will check if the file >> has a backing inode stores some preallocated blocks for the same logical >> blocks. If so, it will transfer the preallocated blocks from backing >> inode to the base inode. >> >> We need to link the two inodes in some way, maybe store the backing >> inode number via EA in the base inode, and flag the base inode that it >> has a backing inode to get preallocated blocks. >> >> Since it doesn't change the block mapping on the original file until >> writeout, so it doesn't require a incompat feature to protect the >> preallocated contents to be read in "old" kernel. There some work need >> to be done in e2fsck to understand the backing inode. >> > > Small detail - we need to mark size of the backing inode to zero -- > so that if we ever boot on older kernel, we will not be able to read > the contents of that inode. (Ofcourse, this also means that fsck > would remove that inode if we run fscheck). > One downside of moving this data over to a backing inode is that we lose the benefit of making large pre allocations following by a series of random writes that result in in-ordered data on disk. I presume we'd be scanning the backing inode for free data blocks? Unless of course if we make the backing inode be an effective 'negative' of the holes in the actual inode. Each hole introduced in the actual inode would have it's backing inode have actual storage at the same logical block offsets. Another problem I can think of with this approach is that we'd have difficult reclaiming the metadata indirect blocks from the backing inode efficiently. So if a user went and pre-allocated say 1GB of disk space for a file, we'd end up with the ~%0.1 metadata overhead doubled until we see the i_blocks for the backing inode hit zero (meaning all pre-allocated blocks were dirtied and backing inode can be freed). May not be an issue in the real world.. Mike Waychison ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: fallocate support for bitmap-based files 2007-07-06 21:33 ` Mike Waychison @ 2007-07-07 2:05 ` Badari Pulavarty 0 siblings, 0 replies; 19+ messages in thread From: Badari Pulavarty @ 2007-07-07 2:05 UTC (permalink / raw) To: Mike Waychison Cc: cmm, Andrew Morton, Theodore Ts'o, Andreas Dilger, Sreenivasa Busam, linux-ext4@vger.kernel.org On Fri, 2007-07-06 at 14:33 -0700, Mike Waychison wrote: > Badari Pulavarty wrote: > > On Sat, 2007-06-30 at 10:13 -0400, Mingming Cao wrote: > >> On Fri, 2007-06-29 at 13:01 -0700, Andrew Morton wrote: > >>> Guys, Mike and Sreenivasa at google are looking into implementing > >>> fallocate() on ext2. Of course, any such implementation could and should > >>> also be portable to ext3 and ext4 bitmapped files. > >>> > >>> I believe that Sreenivasa will mainly be doing the implementation work. > >>> > >>> > >>> The basic plan is as follows: > >>> > >>> - Create (with tune2fs and mke2fs) a hidden file using one of the > >>> reserved inode numbers. That file will be sized to have one bit for each > >>> block in the partition. Let's call this the "unwritten block file". > >>> > >>> The unwritten block file will be initialised with all-zeroes > >>> > >>> - at fallocate()-time, allocate the blocks to the user's file (in some > >>> yet-to-be-determined fashion) and, for each one which is uninitialised, > >>> set its bit in the unwritten block file. The set bit means "this block > >>> is uninitialised and needs to be zeroed out on read". > >>> > >>> - truncate() would need to clear out set-bits in the unwritten blocks file. > >>> > >>> - When the fs comes to read a block from disk, it will need to consult > >>> the unwritten blocks file to see if that block should be zeroed by the > >>> CPU. > >>> > >>> - When the unwritten-block is written to, its bit in the unwritten blocks > >>> file gets zeroed. > >>> > >>> - An obvious efficiency concern: if a user file has no unwritten blocks > >>> in it, we don't need to consult the unwritten blocks file. > >>> > >>> Need to work out how to do this. An obvious solution would be to have > >>> a number-of-unwritten-blocks counter in the inode. But do we have space > >>> for that? > >>> > >>> (I expect google and others would prefer that the on-disk format be > >>> compatible with legacy ext2!) > >>> > >>> - One concern is the following scenario: > >>> > >>> - Mount fs with "new" kernel, fallocate() some blocks to a file. > >>> > >>> - Now, mount the fs under "old" kernel (which doesn't understand the > >>> unwritten blocks file). > >>> > >>> - This kernel will be able to read uninitialised data from that > >>> fallocated-to file, which is a security concern. > >>> > >>> - Now, the "old" kernel writes some data to a fallocated block. But > >>> this kernel doesn't know that it needs to clear that block's flag in > >>> the unwritten blocks file! > >>> > >>> - Now mount that fs under the "new" kernel and try to read that file. > >>> The flag for the block is set, so this kernel will still zero out the > >>> data on a read, thus corrupting the user's data > >>> > >>> So how to fix this? Perhaps with a per-inode flag indicating "this > >>> inode has unwritten blocks". But to fix this problem, we'd require that > >>> the "old" kernel clear out that flag. > >>> > >>> Can anyone propose a solution to this? > >>> > >>> Ah, I can! Use the compatibility flags in such a way as to prevent the > >>> "old" kernel from mounting this filesystem at all. To mount this fs > >>> under an "old" kernel the user will need to run some tool which will > >>> > >>> - read the unwritten blocks file > >>> > >>> - for each set-bit in the unwritten blocks file, zero out the > >>> corresponding block > >>> > >>> - zero out the unwritten blocks file > >>> > >>> - rewrite the superblock to indicate that this fs may now be mounted > >>> by an "old" kernel. > >>> > >>> Sound sane? > >>> > >>> - I'm assuming that there are more reserved inodes available, and that > >>> the changes to tune2fs and mke2fs will be basically a copy-n-paste job > >>> from the `tune2fs -j' code. Correct? > >>> > >>> - I haven't thought about what fsck changes would be needed. > >>> > >>> Presumably quite a few. For example, fsck should check that set-bits > >>> in the unwriten blobks file do not correspond to freed blocks. If they > >>> do, that should be fixed up. > >>> > >>> And fsck can check each inodes number-of-unwritten-blocks counters > >>> against the unwritten blocks file (if we implement the per-inode > >>> number-of-unwritten-blocks counter) > >>> > >>> What else should fsck do? > >>> > >>> - I haven't thought about the implications of porting this into ext3/4. > >>> Probably the commit to the unwritten blocks file will need to be atomic > >>> with the commit to the user's file's metadata, so the unwritten-blocks > >>> file will effectively need to be in journalled-data mode. > >>> > >>> Or, more likely, we access the unwritten blocks file via the blockdev > >>> pagecache (ie: use bmap, like the journal file) and then we're just > >>> talking direct to the disk's blocks and it becomes just more fs metadata. > >>> > >>> - I guess resize2fs will need to be taught about the unwritten blocks > >>> file: to shrink and grow it appropriately. > >>> > >>> > >>> That's all I can think of for now - I probably missed something. > >>> > >>> Suggestions and thought are sought, please. > >>> > >>> > >> Another approach we have been thinking is using a backing > >> inode(per-inode-with-preallocation) to store the preallocated blocks. > >> When user asked for preallocation on the base inode, ext2/3 create a > >> temporary backing inode, and it's (pre)allocate the > >> corresponding blocks in the backing inode. > >> > >> When writes to the base inode, and realize we need to block allocation > >> on, before doing the fs real block allocation, it will check if the file > >> has a backing inode stores some preallocated blocks for the same logical > >> blocks. If so, it will transfer the preallocated blocks from backing > >> inode to the base inode. > >> > >> We need to link the two inodes in some way, maybe store the backing > >> inode number via EA in the base inode, and flag the base inode that it > >> has a backing inode to get preallocated blocks. > >> > >> Since it doesn't change the block mapping on the original file until > >> writeout, so it doesn't require a incompat feature to protect the > >> preallocated contents to be read in "old" kernel. There some work need > >> to be done in e2fsck to understand the backing inode. > >> > > > > Small detail - we need to mark size of the backing inode to zero -- > > so that if we ever boot on older kernel, we will not be able to read > > the contents of that inode. (Ofcourse, this also means that fsck > > would remove that inode if we run fscheck). > > > > One downside of moving this data over to a backing inode is that we lose > the benefit of making large pre allocations following by a series of > random writes that result in in-ordered data on disk. I presume we'd be > scanning the backing inode for free data blocks? > Unless of course if we make the backing inode be an effective 'negative' > of the holes in the actual inode. Each hole introduced in the actual > inode would have it's backing inode have actual storage at the same > logical block offsets. > What we considered at that time was, we allocate backing inode at the time of pre-allocate call. Then when we need a block in the real-inode, grab the corresponding block from backing-inode. So once we preallocate, even if we fill the real-inode through random writes, we still get the sequential pattern preserved. > Another problem I can think of with this approach is that we'd have > difficult reclaiming the metadata indirect blocks from the backing inode > efficiently. So if a user went and pre-allocated say 1GB of disk space > for a file, we'd end up with the ~%0.1 metadata overhead doubled until > we see the i_blocks for the backing inode hit zero (meaning all > pre-allocated blocks were dirtied and backing inode can be freed). May > not be an issue in the real world.. I am not sure, if its a problem worth solving in the real-world use case. Thanks, Badari ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-07-07 2:03 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-29 20:01 fallocate support for bitmap-based files Andrew Morton 2007-06-29 20:36 ` Dave Kleikamp 2007-06-29 20:52 ` Mike Waychison 2007-06-29 21:24 ` Dave Kleikamp 2007-06-29 20:55 ` Theodore Tso 2007-06-29 21:38 ` Andrew Morton 2007-06-29 22:07 ` Mike Waychison 2007-07-04 23:11 ` Valerie Henson 2007-07-06 21:15 ` Mike Waychison 2007-06-29 21:46 ` Andreas Dilger 2007-06-29 22:26 ` Mike Waychison 2007-06-30 5:14 ` Andreas Dilger 2007-06-30 14:31 ` Mingming Cao 2007-06-30 14:13 ` Mingming Cao 2007-06-30 17:29 ` Andreas Dilger 2007-07-02 14:44 ` Mingming Cao 2007-07-02 17:44 ` Badari Pulavarty 2007-07-06 21:33 ` Mike Waychison 2007-07-07 2:05 ` Badari Pulavarty
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox