* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) [not found] ` <20080419012952.GE25797@mit.edu> @ 2008-04-19 9:44 ` Alexey Zaytsev 2008-04-19 18:56 ` Theodore Tso 0 siblings, 1 reply; 30+ messages in thread From: Alexey Zaytsev @ 2008-04-19 9:44 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4, linux-fsdevel, Rik van Riel On Sat, Apr 19, 2008 at 5:29 AM, Theodore Tso <tytso@mit.edu> wrote: > On Fri, Apr 18, 2008 at 06:20:14PM +0400, Alexey Zaytsev wrote: > > Hello, guys. > > > > It seems like the Linux-fountation was not able to find a mentor for > > my project. If somebody is willing to mentor this project through the > > Google Summer of Code, please contact Rik and me now, as little > > time is left. > > > > A link to the application: > > http://rom.etherboot.org/share/xl0/gsoc2008/application-linux-foundation.txt > > Hi Alexey, > > I really don't think your project is likely to be successful given the > 3 month timeframe of a GSoC. At least not without a mentor spending > vast amounts of time educating you about how things works within ext2 > and e2fsck. Even given some broad hints about problems that you need > to address, you still have not addressed how you will solve > fundamental race conditions resulting from trying to read the multiple > blocks scattered all over the disk which comprise allocation bitmap > blocks while allocations might be taking place, for example. > > Your approach of monitoring writes to the buffer cache for metadata > writes is completely busted; suppose the kernel modifies block #12345 > in the filesystem; how do you know what that means? Could that be an > indirect block? If so, to which inode does it belong? Sorry, I still don't understand where the problem is. If it is a block containing a metadata object fsck has already read, than we already know what kind of object it is (there must be a way to quickly find all cached objects derived from a given block), and can update the cached version. And if fsck has not yet read the block, it can just be ignored, no matter what kind of data it contains. If it contains metadata and fsck is intrested in it, it will read it sooner or later anyway. If it contains file data, why should fsck even care? And you are wrong if you think this problem never came to me. This is in fact what motivated the design, and there is no coincidence it is not affected. (well, at least I think it is not affected). But you are probably right, this project may be not doable in just three months. The changes on the kernel side probably are, but there is a huge e2fsck work. > If all you are > doing is monitoring metadata blocks, you would have no idea! The fact > that it apparently didn't even occur to you that this might be a > show-stopping problem scares the heck out of me. It leads me to > believe that this project is very likely to fail, and/or will require > vast amounts of time from the mentor. Unfortunately, the former is > something that I just don't have this summer. > > Regards, > > - Ted > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 9:44 ` Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) Alexey Zaytsev @ 2008-04-19 18:56 ` Theodore Tso 2008-04-19 19:07 ` Eric Sandeen ` (2 more replies) 0 siblings, 3 replies; 30+ messages in thread From: Theodore Tso @ 2008-04-19 18:56 UTC (permalink / raw) To: Alexey Zaytsev; +Cc: linux-ext4, linux-fsdevel, Rik van Riel On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote: > If it is a block containing a metadata object fsck has already read, > than we already know what kind of object it is (there must be a way > to quickly find all cached objects derived from a given block), and > can update the cached version. And if fsck has not yet read the > block, it can just be ignored, no matter what kind of data it > contains. If it contains metadata and fsck is intrested in it, it > will read it sooner or later anyway. If it contains file data, why > should fsck even care? The problem is that e2fsck makes calculations on the filesystem data read out from the disk and stores that in a highly compressed format. So it doesn't remember that block #12345 was an indirect block for inode #123, and that it contained data block numbers 17, 42, and 45. Instead it just marks blocks #12345, #17, #42, and #45 as in use, and then moves on. If you are going to store all of the cached objects then you will need to effectively store *all* of the filesystem metatdata in memory at the same time. For a large filesystem, you won't have enough *room* in memory store all of the cached objects. That's one of the reasons why e2fsck has a lot of very clever design so that summary information can be stored in a very compressed form in memory so that things can be fast (by avoid re-reading objects from disk) as well as not requiring vast amounts of memory. Even if you *do* store all of the cached objects, it still takes time to examine all of the objects and in the mean time, more changes will have come rolling in, and you will either need to add a huge amount of dependency to figure out what internal data structures need to be updated based on the changes in some of the cached objects --- or you will end up restarting the e2fsck checking process from scratch. In either case, there is still the issue of knowing exactly whether a particular read happened before or after some change in the filesystem. This race condition is a really hard one to deal with, especially on a multiple CPU system and the filesystem checker is running in userspace. > But you are probably right, this project may be not doable in just three > months. The changes on the kernel side probably are, but there is a > huge e2fsck work. Yes, that is the concern. And without implementing the user-space side, you'll never besure whether you completely got the kernel side changes right! Regards, - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 18:56 ` Theodore Tso @ 2008-04-19 19:07 ` Eric Sandeen 2008-04-19 22:04 ` Theodore Tso ` (2 more replies) 2008-04-20 23:37 ` Andi Kleen 2008-04-21 0:23 ` Alexey Zaytsev 2 siblings, 3 replies; 30+ messages in thread From: Eric Sandeen @ 2008-04-19 19:07 UTC (permalink / raw) To: Theodore Tso; +Cc: Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Theodore Tso wrote: > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote: >> If it is a block containing a metadata object fsck has already read, >> than we already know what kind of object it is (there must be a way >> to quickly find all cached objects derived from a given block), and >> can update the cached version. And if fsck has not yet read the >> block, it can just be ignored, no matter what kind of data it >> contains. If it contains metadata and fsck is intrested in it, it >> will read it sooner or later anyway. If it contains file data, why >> should fsck even care? It seems to me that what the proposed project really does, in essence, is a read-only check of a filesystem snapshot. It's just that the snapshot is proposed to be constructed in a complex and non-generic (and maybe impossible) way. If you really just want to verify a snapshot of the fs at a point in time, surely there are simpler ways. If the device is on lvm, there's already a script floating around to do it in automated fasion. (I'd pondered the idea of introducing META_WRITE (to go with META_READ) and maybe lvm could do a "metadata-only" snapshot to be lighter weight?) -Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 19:07 ` Eric Sandeen @ 2008-04-19 22:04 ` Theodore Tso 2008-04-20 1:24 ` Eric Sandeen 2008-04-20 23:30 ` Andi Kleen 2008-04-21 0:27 ` Alexey Zaytsev 2008-04-22 16:54 ` Peter Teoh 2 siblings, 2 replies; 30+ messages in thread From: Theodore Tso @ 2008-04-19 22:04 UTC (permalink / raw) To: Eric Sandeen; +Cc: Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel On Sat, Apr 19, 2008 at 02:07:34PM -0500, Eric Sandeen wrote: > > It seems to me that what the proposed project really does, in essence, > is a read-only check of a filesystem snapshot. It's just that the > snapshot is proposed to be constructed in a complex and non-generic (and > maybe impossible) way. That's not a bad way of thinking about it; except that the snapshot is being maintained in userspace, without any discussion of some kind of filesystem-level freeze (which would be hard because the freeze, in the best case, would take as long as e2image -r would take --- which is roughly time required for e2fsck's pass1, which is in general approximately 70% of the e2fsck run-time.) > If you really just want to verify a snapshot of the fs at a point in > time, surely there are simpler ways. If the device is on lvm, there's > already a script floating around to do it in automated fasion. (I'd > pondered the idea of introducing META_WRITE (to go with META_READ) and > maybe lvm could do a "metadata-only" snapshot to be lighter weight?) That would be great, although I think the major issue is not necessarily the performance problems of using an LVM snapshot on a very busy filesystem (althouh I could imagine for some users this might be an issue), but rather for filesystem devices that aren't using LVM at all. (I've heard some complaints that LVM imposes a performance penalty even if you aren't using a snapshot; has anyone done any benchmarks of a filesystem with and without LVM to see whether or not there really is a significant performance penalty; whether or not there really is one, the perception is definitely out there that it does.) If we could do a lightweight snapshot that didn't require an LVM, that would be really great. But that's probably not an ext4 project, and I'm not sure the it would be considered politically correct in the LKML community. - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 22:04 ` Theodore Tso @ 2008-04-20 1:24 ` Eric Sandeen 2008-04-20 23:30 ` Andi Kleen 1 sibling, 0 replies; 30+ messages in thread From: Eric Sandeen @ 2008-04-20 1:24 UTC (permalink / raw) To: Theodore Tso; +Cc: Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Theodore Tso wrote: > On Sat, Apr 19, 2008 at 02:07:34PM -0500, Eric Sandeen wrote: >> If you really just want to verify a snapshot of the fs at a point in >> time, surely there are simpler ways. If the device is on lvm, there's >> already a script floating around to do it in automated fasion. (I'd >> pondered the idea of introducing META_WRITE (to go with META_READ) and >> maybe lvm could do a "metadata-only" snapshot to be lighter weight?) > > That would be great, although I think the major issue is not > necessarily the performance problems of using an LVM snapshot on a > very busy filesystem well, backing space for the snapshot could be an issue too. Basically, if you're only using it for this purpose, why COW all the post-snapshot data if you just don't care... > (althouh I could imagine for some users this > might be an issue), but rather for filesystem devices that aren't > using LVM at all. (I've heard some complaints that LVM imposes a > performance penalty even if you aren't using a snapshot; has anyone > done any benchmarks of a filesystem with and without LVM to see > whether or not there really is a significant performance penalty; > whether or not there really is one, the perception is definitely out > there that it does.) I've heard from someone who did some testing about a minor penalty, but I can't point to any published test so I guess that's just more hearsay. It's intuitive that putting lvm on top of a block device might not be absolutely, 100% free, though.... Adds to stack, too. > If we could do a lightweight snapshot that didn't require an LVM, that > would be really great. But that's probably not an ext4 project, and > I'm not sure the it would be considered politically correct in the > LKML community. Yep; my original reply originally wished something about non-lvm snapshots but... while yes, it'd be nice for this purpose, ponies for everyone would be nice too... :) But I didn't mention it because... how do you do a generic non-lvm snapshot of, say, /dev/sda3 without some sort of volume manager...? If there's some clever idea that could be implemented cleanly, I'd be all ears. :) -Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 22:04 ` Theodore Tso 2008-04-20 1:24 ` Eric Sandeen @ 2008-04-20 23:30 ` Andi Kleen 2008-04-20 23:42 ` Jamie Lokier 1 sibling, 1 reply; 30+ messages in thread From: Andi Kleen @ 2008-04-20 23:30 UTC (permalink / raw) To: Theodore Tso Cc: Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Theodore Tso <tytso@mit.edu> writes: > That would be great, although I think the major issue is not > necessarily the performance problems of using an LVM snapshot on a > very busy filesystem (althouh I could imagine for some users this > might be an issue), but rather for filesystem devices that aren't > using LVM at all. (I've heard some complaints that LVM imposes a > performance penalty even if you aren't using a snapshot; has anyone It always disables barriers if you don't apply a so far unmerged patch that enables them in some special circumstances (only single backing device) Not having barriers sometimes makes your workloads faster (and less safe) and in other cases slower. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-20 23:30 ` Andi Kleen @ 2008-04-20 23:42 ` Jamie Lokier 2008-04-21 8:01 ` Andi Kleen [not found] ` <20080421080111.GD14446@one.firstfloor.org> 0 siblings, 2 replies; 30+ messages in thread From: Jamie Lokier @ 2008-04-20 23:42 UTC (permalink / raw) To: Andi Kleen Cc: Theodore Tso, Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Andi Kleen wrote: > [LVM] always disables barriers if you don't apply a so far unmerged > patch that enables them in some special circumstances (only single > backing device) (I continue to be surprised at the un-safety of Linux fsync) > Not having barriers sometimes makes your workloads faster (and less > safe) and in other cases slower. I'm curious, how does it make them slower? Merely not issuing barrier calls seems like it will always be the same speed or faster. Thanks, -- Jamie ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-20 23:42 ` Jamie Lokier @ 2008-04-21 8:01 ` Andi Kleen [not found] ` <20080421080111.GD14446@one.firstfloor.org> 1 sibling, 0 replies; 30+ messages in thread From: Andi Kleen @ 2008-04-21 8:01 UTC (permalink / raw) To: Andi Kleen, Theodore Tso, Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote: > Andi Kleen wrote: > > [LVM] always disables barriers if you don't apply a so far unmerged > > patch that enables them in some special circumstances (only single > > backing device) > > (I continue to be surprised at the un-safety of Linux fsync) Note barrier less does not necessarily always mean unsafe fsync, it just often means that. Also surprisingly lot more syncs or write cache off tend to lower the MTBF of your disk significantly, so "unsafer" fsync might actually be more safe for your unbackuped data. > > Not having barriers sometimes makes your workloads faster (and less > > safe) and in other cases slower. > > I'm curious, how does it make them slower? Merely not issuing barrier > calls seems like it will always be the same speed or faster. Some setups detect the no barrier case and switch to full sync + wait (or write cache off) which depending on the disk supporting NCQ can be slower. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <20080421080111.GD14446@one.firstfloor.org>]
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) [not found] ` <20080421080111.GD14446@one.firstfloor.org> @ 2008-04-21 11:51 ` Jamie Lokier 2008-04-21 17:29 ` Ricardo M. Correia 2008-04-21 18:15 ` Ric Wheeler 2 siblings, 0 replies; 30+ messages in thread From: Jamie Lokier @ 2008-04-21 11:51 UTC (permalink / raw) To: Andi Kleen Cc: Theodore Tso, Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Andi Kleen wrote: > On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote: > > Andi Kleen wrote: > > > [LVM] always disables barriers if you don't apply a so far unmerged > > > patch that enables them in some special circumstances (only single > > > backing device) > > > > (I continue to be surprised at the un-safety of Linux fsync) > > Note barrier less does not necessarily always mean unsafe fsync, > it just often means that. > > Also surprisingly lot more syncs or write cache off tend to lower the MTBF > of your disk significantly, so "unsafer" fsync might actually be more safe > for your unbackuped data. That's really interesting, thanks. Do you have something to cite about syncs reducing the MTBF? ( I'm really glad I added barriers instead of write cache off to my 2.4.26 based disk using devices now ;-) ) > > > Not having barriers sometimes makes your workloads faster (and less > > > safe) and in other cases slower. > > > > I'm curious, how does it make them slower? Merely not issuing barrier > > calls seems like it will always be the same speed or faster. > > Some setups detect the no barrier case and switch to full sync + > wait (or write cache off) which depending on the disk supporting NCQ > can be slower. But to issue full syncs, that's implemented as barrier calls in the block request layers isn't it? The filesystem isn't given a facility to request the block device do full syncs or disable the write cache. So when a blockdev doesn't offer barriers to the filesystem, it means the driver doesn't support full syncs or cache disabling either, since if it did, the request layer would expose them to the fs as barriers. What am I missing from this picture? Do you mean that manual setup (such as by a DBA) tends to disable the write cache? Thanks, -- Jamie ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) [not found] ` <20080421080111.GD14446@one.firstfloor.org> 2008-04-21 11:51 ` Jamie Lokier @ 2008-04-21 17:29 ` Ricardo M. Correia 2008-04-21 17:40 ` Andi Kleen 2008-04-21 18:15 ` Ric Wheeler 2 siblings, 1 reply; 30+ messages in thread From: Ricardo M. Correia @ 2008-04-21 17:29 UTC (permalink / raw) To: Andi Kleen Cc: Theodore Tso, Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel (sorry if this is a duplicate, my previous email was rejected) Hi Andi, On Seg, 2008-04-21 at 10:01 +0200, Andi Kleen wrote: > > (I continue to be surprised at the un-safety of Linux fsync) > > Note barrier less does not necessarily always mean unsafe fsync, > it just often means that. Am I correct that the Linux fsync(), when used (from userspace) directly on file descriptors associated with block devices doesn't actually flush the disk write cache and wait for the data to reach the disk before returning? Is there a reason why this isn't being done other than performance? I would imagine that the only reason a process is using fsync() is because it is worried about data loss, and therefore is perfectly willing to lose some performance if necessary.. Regards, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia@Sun.COM -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 17:29 ` Ricardo M. Correia @ 2008-04-21 17:40 ` Andi Kleen 2008-04-21 18:27 ` Ricardo M. Correia 2008-04-22 14:48 ` Jamie Lokier 0 siblings, 2 replies; 30+ messages in thread From: Andi Kleen @ 2008-04-21 17:40 UTC (permalink / raw) To: Ricardo M. Correia Cc: Theodore Tso, Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel > Am I correct that the Linux fsync(), when used (from userspace) > directly on file descriptors associated with block devices doesn't > actually flush the disk write cache and wait for the data to reach the > disk before returning? Not quite. It depends. Sometimes it does this and sometimes it doesn't, depending on the disk and the controller and the file system and the kernel version and the distribution default. For details search the archives of linux-kernel/linux-fsdevel. This has been discussed many times. > Is there a reason why this isn't being done other than performance? One reason against it is that in many (but not all) setups to guarantee reaching the platter you have to disable the write cache, and at least for consumer level hard disks disk vendors generally do not recommend doing this because it significantly lowers the MTBF of the disk. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 17:40 ` Andi Kleen @ 2008-04-21 18:27 ` Ricardo M. Correia 2008-04-22 14:48 ` Jamie Lokier 1 sibling, 0 replies; 30+ messages in thread From: Ricardo M. Correia @ 2008-04-21 18:27 UTC (permalink / raw) To: Andi Kleen Cc: Theodore Tso, Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel On Seg, 2008-04-21 at 19:40 +0200, Andi Kleen wrote: > > Is there a reason why this isn't being done other than performance? > > One reason against it is that in many (but not all) setups to guarantee > reaching the platter you have to disable the write cache, and at least > for consumer level hard disks disk vendors generally do not recommend > doing this because it significantly lowers the MTBF of the disk. I understand that, but if the disk/storage doesn't support flushing the cache, I would expect fsync() to return EIO or ENOTSUP, I wouldn't expect it to ignore my request and risk losing data without my knowledge.. I know fsync() also flushes dirty buffers, but IMHO even if it flushes the buffers it'd be better to return an error if a full sync wasn't being done rather than returning success and misleading the application. Anyway, sorry if this has been discussed before, I should take a look at the archives.. Thanks, Ricardo -- Ricardo Manuel Correia Lustre Engineering Sun Microsystems, Inc. Portugal Phone +351.214134023 / x58723 Mobile +351.912590825 Email Ricardo.M.Correia@Sun.COM ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 17:40 ` Andi Kleen 2008-04-21 18:27 ` Ricardo M. Correia @ 2008-04-22 14:48 ` Jamie Lokier 1 sibling, 0 replies; 30+ messages in thread From: Jamie Lokier @ 2008-04-22 14:48 UTC (permalink / raw) To: Andi Kleen Cc: Ricardo M. Correia, Theodore Tso, Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Andi Kleen wrote: > > Is there a reason why this isn't being done other than performance? > > One reason against it is that in many (but not all) setups to guarantee > reaching the platter you have to disable the write cache, and at least > for consumer level hard disks disk vendors generally do not recommend > doing this because it significantly lowers the MTBF of the disk. I think the MTBF argument is a bit spurious, because guaranteeing it reaches the platter with all modern disks is possible, with the appropriate kernel changes, and does not require the write cache to be disabled. TBH, I think the reason is it's simply never been implemented. There are other strategies for mitigating data loss, after all, and filesystem structure is not at risk; barriers are fine for that. Right now, you have the choice of 'disable write cache' or 'fsync flushes sometimes but not always, depending on lots of factors'. The option 'fsync flushes always, write cache enabled' isn't implemented, though most hardware supports it. Btw, on Darwin (Mac OS X) it _is_ because of performance that fsync() doesn't issue a flush to platter. It has an fcntl(F_FULLSYNC) which is documented to do the latter. -- Jamie ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) [not found] ` <20080421080111.GD14446@one.firstfloor.org> 2008-04-21 11:51 ` Jamie Lokier 2008-04-21 17:29 ` Ricardo M. Correia @ 2008-04-21 18:15 ` Ric Wheeler 2008-04-21 18:25 ` Eric Sandeen 2 siblings, 1 reply; 30+ messages in thread From: Ric Wheeler @ 2008-04-21 18:15 UTC (permalink / raw) To: Andi Kleen Cc: Theodore Tso, Eric Sandeen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Andi Kleen wrote: > On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote: >> Andi Kleen wrote: >>> [LVM] always disables barriers if you don't apply a so far unmerged >>> patch that enables them in some special circumstances (only single >>> backing device) >> (I continue to be surprised at the un-safety of Linux fsync) > > Note barrier less does not necessarily always mean unsafe fsync, > it just often means that. > > Also surprisingly lot more syncs or write cache off tend to lower the MTBF > of your disk significantly, so "unsafer" fsync might actually be more safe > for your unbackuped data. > Hi Andi, Where did you get this data? I have never heard that using more barrier operations lowers the reliability or the MTBF of a drive and I look at a fairly huge population when doing this ;-) ric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 18:15 ` Ric Wheeler @ 2008-04-21 18:25 ` Eric Sandeen 2008-04-21 18:44 ` Ric Wheeler 0 siblings, 1 reply; 30+ messages in thread From: Eric Sandeen @ 2008-04-21 18:25 UTC (permalink / raw) To: ric Cc: Andi Kleen, Theodore Tso, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Ric Wheeler wrote: > > Andi Kleen wrote: >> On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote: >>> Andi Kleen wrote: >>>> [LVM] always disables barriers if you don't apply a so far unmerged >>>> patch that enables them in some special circumstances (only single >>>> backing device) >>> (I continue to be surprised at the un-safety of Linux fsync) >> Note barrier less does not necessarily always mean unsafe fsync, >> it just often means that. >> >> Also surprisingly lot more syncs or write cache off tend to lower the MTBF >> of your disk significantly, so "unsafer" fsync might actually be more safe >> for your unbackuped data. >> > > Hi Andi, > > Where did you get this data? > > I have never heard that using more barrier operations lowers the reliability or > the MTBF of a drive and I look at a fairly huge population when doing this ;-) Ric, what about the other part - turning write cache off? I've also heard it suggested that this might hurt drive lifespan, and it sorta makes sense, I assume it keeps the head working harder... -Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 18:25 ` Eric Sandeen @ 2008-04-21 18:44 ` Ric Wheeler 2008-04-21 18:58 ` Matthew Wilcox 0 siblings, 1 reply; 30+ messages in thread From: Ric Wheeler @ 2008-04-21 18:44 UTC (permalink / raw) To: Eric Sandeen Cc: Andi Kleen, Theodore Tso, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Eric Sandeen wrote: > Ric Wheeler wrote: >> Andi Kleen wrote: >>> On Mon, Apr 21, 2008 at 12:42:42AM +0100, Jamie Lokier wrote: >>>> Andi Kleen wrote: >>>>> [LVM] always disables barriers if you don't apply a so far unmerged >>>>> patch that enables them in some special circumstances (only single >>>>> backing device) >>>> (I continue to be surprised at the un-safety of Linux fsync) >>> Note barrier less does not necessarily always mean unsafe fsync, >>> it just often means that. >>> >>> Also surprisingly lot more syncs or write cache off tend to lower the MTBF >>> of your disk significantly, so "unsafer" fsync might actually be more safe >>> for your unbackuped data. >>> >> Hi Andi, >> >> Where did you get this data? >> >> I have never heard that using more barrier operations lowers the reliability or >> the MTBF of a drive and I look at a fairly huge population when doing this ;-) > > Ric, what about the other part - turning write cache off? I've also > heard it suggested that this might hurt drive lifespan, and it sorta > makes sense, I assume it keeps the head working harder... > > -Eric Turning the drive write cache off is the default case for most RAID products (including our mid and high end arrays). I have not seen an issue with drives wearing out with either setting (cache disabled or enabled with barriers). The theory does make some sense, but does not map into my experience ;-) ric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 18:44 ` Ric Wheeler @ 2008-04-21 18:58 ` Matthew Wilcox 2008-04-21 19:11 ` Ric Wheeler 0 siblings, 1 reply; 30+ messages in thread From: Matthew Wilcox @ 2008-04-21 18:58 UTC (permalink / raw) To: Ric Wheeler Cc: Eric Sandeen, Andi Kleen, Theodore Tso, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel On Mon, Apr 21, 2008 at 02:44:45PM -0400, Ric Wheeler wrote: > Turning the drive write cache off is the default case for most RAID > products (including our mid and high end arrays). > > I have not seen an issue with drives wearing out with either setting (cache > disabled or enabled with barriers). > > The theory does make some sense, but does not map into my experience ;-) To be fair though, the gigabytes of NVRAM on the array perform the job that the drive's cache would do on a lower-end system. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 18:58 ` Matthew Wilcox @ 2008-04-21 19:11 ` Ric Wheeler 0 siblings, 0 replies; 30+ messages in thread From: Ric Wheeler @ 2008-04-21 19:11 UTC (permalink / raw) To: Matthew Wilcox Cc: Eric Sandeen, Andi Kleen, Theodore Tso, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Matthew Wilcox wrote: > On Mon, Apr 21, 2008 at 02:44:45PM -0400, Ric Wheeler wrote: >> Turning the drive write cache off is the default case for most RAID >> products (including our mid and high end arrays). >> >> I have not seen an issue with drives wearing out with either setting (cache >> disabled or enabled with barriers). >> >> The theory does make some sense, but does not map into my experience ;-) > > To be fair though, the gigabytes of NVRAM on the array perform the job > that the drive's cache would do on a lower-end system. The population I deal with personally is a huge number of 1U Centera nodes, each of which has 4 high capacity ATA or S-ATA drives (no NVRAM). We run with barriers (and write cache) enabled and I have not seen anything that leads me to think that this is an issue. One way to think about this is that even with barriers, relatively few operations actually turn into cache flushes (fsync's, journal syncs, unmounts?). Another thing to keep in mind is that drives are constantly writing and moving heads - disabling write cache or doing a flush just adds an incremental number of writes/head movements. Using barriers or disabling write cache matters only when you are doing a write intensive load, read intensive loads are not impacted (and random, cache miss reads will move the heads often). I just don't see it being an issue for any normal user (laptop user or desktop user) since the write workload more people have is a small fraction of what we run into in production data centers. Running your drives in a moderate way will probably help them last longer, but I am just not convinced that the write cache/barrier load makes much of a difference... ric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 19:07 ` Eric Sandeen 2008-04-19 22:04 ` Theodore Tso @ 2008-04-21 0:27 ` Alexey Zaytsev 2008-04-21 9:45 ` Andi Kleen 2008-04-22 16:54 ` Peter Teoh 2 siblings, 1 reply; 30+ messages in thread From: Alexey Zaytsev @ 2008-04-21 0:27 UTC (permalink / raw) To: Eric Sandeen; +Cc: Theodore Tso, linux-ext4, linux-fsdevel, Rik van Riel On Sat, Apr 19, 2008 at 11:07 PM, Eric Sandeen <sandeen@redhat.com> wrote: > Theodore Tso wrote: > > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote: > >> If it is a block containing a metadata object fsck has already read, > >> than we already know what kind of object it is (there must be a way > >> to quickly find all cached objects derived from a given block), and > >> can update the cached version. And if fsck has not yet read the > >> block, it can just be ignored, no matter what kind of data it > >> contains. If it contains metadata and fsck is intrested in it, it > >> will read it sooner or later anyway. If it contains file data, why > >> should fsck even care? > > It seems to me that what the proposed project really does, in essence, > is a read-only check of a filesystem snapshot. It's just that the > snapshot is proposed to be constructed in a complex and non-generic (and > maybe impossible) way. Maybe complex and non-generic, but also quite efficient. Only the actually used matadata is cached, and everything is done in userspace. > > If you really just want to verify a snapshot of the fs at a point in > time, surely there are simpler ways. If the device is on lvm, there's > already a script floating around to do it in automated fasion. (I'd > pondered the idea of introducing META_WRITE (to go with META_READ) and > maybe lvm could do a "metadata-only" snapshot to be lighter weight?) How do you tell data from metadata on this level? > > -Eric > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 0:27 ` Alexey Zaytsev @ 2008-04-21 9:45 ` Andi Kleen 0 siblings, 0 replies; 30+ messages in thread From: Andi Kleen @ 2008-04-21 9:45 UTC (permalink / raw) To: Alexey Zaytsev Cc: Eric Sandeen, Theodore Tso, linux-ext4, linux-fsdevel, Rik van Riel "Alexey Zaytsev" <alexey.zaytsev@gmail.com> writes: > > How do you tell data from metadata on this level? You could always change the file system to pass down hints through the block layer. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 19:07 ` Eric Sandeen 2008-04-19 22:04 ` Theodore Tso 2008-04-21 0:27 ` Alexey Zaytsev @ 2008-04-22 16:54 ` Peter Teoh 2008-04-22 17:02 ` Eric Sandeen [not found] ` <480E4950.1090300@oracle.com> 2 siblings, 2 replies; 30+ messages in thread From: Peter Teoh @ 2008-04-22 16:54 UTC (permalink / raw) To: Eric Sandeen Cc: Theodore Tso, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel On Sun, Apr 20, 2008 at 3:07 AM, Eric Sandeen <sandeen@redhat.com> wrote: > Theodore Tso wrote: > > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote: > >> If it is a block containing a metadata object fsck has already read, > >> than we already know what kind of object it is (there must be a way > >> to quickly find all cached objects derived from a given block), and > >> can update the cached version. And if fsck has not yet read the > >> block, it can just be ignored, no matter what kind of data it > >> contains. If it contains metadata and fsck is intrested in it, it > >> will read it sooner or later anyway. If it contains file data, why > >> should fsck even care? > > It seems to me that what the proposed project really does, in essence, > is a read-only check of a filesystem snapshot. It's just that the > snapshot is proposed to be constructed in a complex and non-generic (and > maybe impossible) way. > > If you really just want to verify a snapshot of the fs at a point in > time, surely there are simpler ways. If the device is on lvm, there's > already a script floating around to do it in automated fasion. (I'd > pondered the idea of introducing META_WRITE (to go with META_READ) and > maybe lvm could do a "metadata-only" snapshot to be lighter weight?) > Can I know where is this script? Or if u cannot locate it, does it have any resemblance to all the stuff mentioned below?. Apologizing for the regression of discussion back to this part again, (and pardon my superficial knowledge of filesystem, just brainstorming and eager to learn :-)), I think the idea of "online checker" can be developed further, taking into consideration all that have been said in this threads - morphing into "semi-online" (real online is not feasible eg what have been fscked can be immediately be invalidated by another subsequent corrupted writes, so the idea of fsck on read-only snapshot is best we could achieved, and then mark the fsck results with the timestamp, so that all writes beyond this timestamp may invalidate the earlier fsck results. This idea has its equivalence in the Oracle database world - "online datafile backup" feature, where all transactions goes to memory + journal logs (a physical file itself), and datafile is frozen for writing, enabling it to be physically copied): a. First, integrity of the filesystem must be treated as a WHOLE, and therefore, all WRITES must somehow be frozen at THE SAME TIME, and, after that point in time, all writes will then go direct to memory only. So the permanent storage will be readonly. This I guessed is the readonly snapshot part, correct? b. Concerning all the different infinite combination of race condition that can happened, it should not happen here. This is because now the entire filesystem's integrity is maintained as a whole. c. The only difficulty i can see is that updates to the journal logs - can this part of online updates just go to memory temporarily, while the frozen image is being fsck? d. When ALL fsck is done, everything in memory will get resync with the filesystem. and during this short period of resyncing, all writing should be completely frozen - no writing to disk nor memory, as race condition may arise. after syncing, all read/writing to go direct to the disk. Complexity of cache interaction is beyond my understanding. Some are rephrasing or adaptation of what I have read in this thread, so is my understanding correct? Thank you for sharing. -- Regards, Peter Teoh ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-22 16:54 ` Peter Teoh @ 2008-04-22 17:02 ` Eric Sandeen 2008-04-22 23:37 ` Andreas Dilger [not found] ` <480E4950.1090300@oracle.com> 1 sibling, 1 reply; 30+ messages in thread From: Eric Sandeen @ 2008-04-22 17:02 UTC (permalink / raw) To: Peter Teoh Cc: Theodore Tso, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Peter Teoh wrote: > On Sun, Apr 20, 2008 at 3:07 AM, Eric Sandeen <sandeen@redhat.com> wrote: >> Theodore Tso wrote: >> > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote: >> >> If it is a block containing a metadata object fsck has already read, >> >> than we already know what kind of object it is (there must be a way >> >> to quickly find all cached objects derived from a given block), and >> >> can update the cached version. And if fsck has not yet read the >> >> block, it can just be ignored, no matter what kind of data it >> >> contains. If it contains metadata and fsck is intrested in it, it >> >> will read it sooner or later anyway. If it contains file data, why >> >> should fsck even care? >> >> It seems to me that what the proposed project really does, in essence, >> is a read-only check of a filesystem snapshot. It's just that the >> snapshot is proposed to be constructed in a complex and non-generic (and >> maybe impossible) way. >> >> If you really just want to verify a snapshot of the fs at a point in >> time, surely there are simpler ways. If the device is on lvm, there's >> already a script floating around to do it in automated fasion. (I'd >> pondered the idea of introducing META_WRITE (to go with META_READ) and >> maybe lvm could do a "metadata-only" snapshot to be lighter weight?) >> > > Can I know where is this script? Or if u cannot locate it, does it > have any resemblance to all the stuff mentioned below?. Google for "lvcheck" and find it buried in a thread "forced fsck (again?)" on the ext3-users list - I'm not sure if it has an upstream home anywhere yet... -Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-22 17:02 ` Eric Sandeen @ 2008-04-22 23:37 ` Andreas Dilger 2008-04-23 0:52 ` Eric Sandeen 0 siblings, 1 reply; 30+ messages in thread From: Andreas Dilger @ 2008-04-22 23:37 UTC (permalink / raw) To: Eric Sandeen Cc: Peter Teoh, Theodore Tso, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel On Apr 22, 2008 12:02 -0500, Eric Sandeen wrote: > Peter Teoh wrote: > > On Sun, Apr 20, 2008 at 3:07 AM, Eric Sandeen <sandeen@redhat.com> wrote: > >> If you really just want to verify a snapshot of the fs at a point in > >> time, surely there are simpler ways. If the device is on lvm, there's > >> already a script floating around to do it in automated fasion. (I'd > >> pondered the idea of introducing META_WRITE (to go with META_READ) and > >> maybe lvm could do a "metadata-only" snapshot to be lighter weight?) > > > > Can I know where is this script? Or if u cannot locate it, does it > > have any resemblance to all the stuff mentioned below?. > > Google for "lvcheck" and find it buried in a thread "forced fsck > (again?)" on the ext3-users list - I'm not sure if it has an upstream > home anywhere yet... We thought the best place to put it would be in the lvm2 utilities, since it is tied to LVM snapshots (and not really a particular filesystem). Eric, any chance you could pass the script over to the LVM folks at RH? AFAIK, they are the official LVM/DM maintainers still (adopted as part of Sistina and GFS). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-22 23:37 ` Andreas Dilger @ 2008-04-23 0:52 ` Eric Sandeen 0 siblings, 0 replies; 30+ messages in thread From: Eric Sandeen @ 2008-04-23 0:52 UTC (permalink / raw) To: Andreas Dilger Cc: Peter Teoh, Theodore Tso, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Andreas Dilger wrote: > On Apr 22, 2008 12:02 -0500, Eric Sandeen wrote: >> Peter Teoh wrote: >>> On Sun, Apr 20, 2008 at 3:07 AM, Eric Sandeen <sandeen@redhat.com> wrote: >>>> If you really just want to verify a snapshot of the fs at a point in >>>> time, surely there are simpler ways. If the device is on lvm, there's >>>> already a script floating around to do it in automated fasion. (I'd >>>> pondered the idea of introducing META_WRITE (to go with META_READ) and >>>> maybe lvm could do a "metadata-only" snapshot to be lighter weight?) >>> Can I know where is this script? Or if u cannot locate it, does it >>> have any resemblance to all the stuff mentioned below?. >> Google for "lvcheck" and find it buried in a thread "forced fsck >> (again?)" on the ext3-users list - I'm not sure if it has an upstream >> home anywhere yet... > > We thought the best place to put it would be in the lvm2 utilities, since > it is tied to LVM snapshots (and not really a particular filesystem). > > Eric, any chance you could pass the script over to the LVM folks at RH? > AFAIK, they are the official LVM/DM maintainers still (adopted as part of > Sistina and GFS). Sure, I'll see who I can bug :) -Eric ^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <480E4950.1090300@oracle.com>]
[parent not found: <804dabb00804221633g1f61029dh7b27737134fc0b7a@mail.gmail.com>]
[parent not found: <480E7954.9090408@oracle.com>]
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) [not found] ` <480E7954.9090408@oracle.com> @ 2008-04-23 1:02 ` Peter Teoh 0 siblings, 0 replies; 30+ messages in thread From: Peter Teoh @ 2008-04-23 1:02 UTC (permalink / raw) To: Sunil Mushran; +Cc: linux-fsdevel On Wed, Apr 23, 2008 at 7:48 AM, Sunil Mushran <Sunil.Mushran@oracle.com> wrote: > Peter Teoh wrote: > > > I understood that, and that was what I said "memory + journal log > > file" except that perhaps a better term for journal log will be redo > > log. Correct? > > > redolog would be the appropriate term. But the point still is that oracle > does not stop writing to the db file when it is in hot backup mode. That > would be sheer madness as the user may not stop the backup mode for say > a day or maybe a week. It keeps writing to the dbfiles but at the same time > generates more redo to ensure it can handle fractured blocks when one is > restoring from those backups. > Ah....now I understood your point: few days of backing up (due to large size of dbfiles) + fractured blocks....key words I have learned from you.....must thank you for your sharing.... > Ok, now I must admit that online fsck is extremely difficult - given that I no longer have an option of fs-snapshot for readonly at a point in time, which is my only way of doing it so far. Thank you for the discussion! :-). -- Regards, Peter Teoh ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 18:56 ` Theodore Tso 2008-04-19 19:07 ` Eric Sandeen @ 2008-04-20 23:37 ` Andi Kleen 2008-04-21 2:33 ` Theodore Tso 2008-04-21 0:23 ` Alexey Zaytsev 2 siblings, 1 reply; 30+ messages in thread From: Andi Kleen @ 2008-04-20 23:37 UTC (permalink / raw) To: Theodore Tso; +Cc: Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel Theodore Tso <tytso@mit.edu> writes: > > If you are going to store all of the cached objects then you will need > to effectively store *all* of the filesystem metatdata in memory at > the same time. Are you sure about all data? I think he would just need some lookup table from metadata block numbers to inode numbers and then when a hit occurs on a block in the table somehow invalidate all data related to that inode and restart that part. And the same thing for bitmap blocks. That lookup table should be much smaller than the full metadata. Anyways my favourite fsck wish list feature would be a way to record the changes a read-only fsck would want to do and then some quick way to apply them to a writable version of the file system without doing a full rescan. Then you could regularly do a background check and if it finds something wrong just remount and apply the changes quickly. Or perhaps just tell the kernel which objects is suspicious and should be EIOed. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-20 23:37 ` Andi Kleen @ 2008-04-21 2:33 ` Theodore Tso 2008-04-21 14:43 ` Andi Kleen 0 siblings, 1 reply; 30+ messages in thread From: Theodore Tso @ 2008-04-21 2:33 UTC (permalink / raw) To: Andi Kleen; +Cc: Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel On Mon, Apr 21, 2008 at 01:37:37AM +0200, Andi Kleen wrote: > Are you sure about all data? I think he would just need some lookup table from > metadata block numbers to inode numbers and then when a hit occurs on a block > in the table somehow invalidate all data related to that inode > and restart that part. And the same thing for bitmap blocks. That lookup > table should be much smaller than the full metadata. Yeah, unfortunately it's close to all of the metadata. Consider that e2fsck also has to deal with changes in the directory, and there can be multiple hard links in a directory, so it's not just a simple lookup table. You could try to condense the directory into a list of inodes numbers and the number of times they were counted in a directory, but then any time the directory changed, you'd have to rescan the *entire* directory. Also, consider that the lookup table might not be enough, if the filesystem is actually corrupted, and there are multiple blocks claimed by an inode. How you "invalidate all data" in that case becomes less obvious. It would be possible to condense the metdata somewhat by taking the omitting unused inodes, and storing the indirect blocks as extents. But there would still be a huge amount of metadata that would have to be stored in memory. If you're willing to completely rewrite e2fsck (which the on-line resize would need anyway, because the updated data could invalidate the previously done work at any point anywhere in the e2fsck processing), maybe the extra cached data structures won't be on completely additive on top of the other intermediate data kept by e2fsck, but it once again points out it would be insane for a student to try to do this in 3 months. > Anyways my favourite fsck wish list feature would be a way to record the > changes a read-only fsck would want to do and then some quick way > to apply them to a writable version of the file system without > doing a full rescan. Then you could regularly do a background check > and if it finds something wrong just remount and apply the changes > quickly. This is a read-only fsck while the filesystem is changing out from underneath it, and the hope is that you can take the instructions gathered from the read-only fsck (presumably run on a snapshot) and then apply them to filesystem that has since been modified after the snaphot was taken. Even if it has been remounted read-only at this point, this gets really dicey. Consider that with certain types of corruption, if the filesystem continues to get modified, the corruption can get worse. > Or perhaps just tell the kernel which objects is suspicious and > should be EIOed. Yeah; you could do that, as long as it's not a guarantee that all of the objects which were suspicious were found. It would also be possible to isolate the objects, perhaps with some potential inode and block leakage that would get fixed at the next off-line fsck. Still, it would be a lot of work. Let me know if someone is willing to pay for this, and I could probably work with someone like Val to execute this. But otherwise, it probably falls in the "we'd all like a pony" sort of wishlist..... - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 2:33 ` Theodore Tso @ 2008-04-21 14:43 ` Andi Kleen 0 siblings, 0 replies; 30+ messages in thread From: Andi Kleen @ 2008-04-21 14:43 UTC (permalink / raw) To: Theodore Tso Cc: Andi Kleen, Alexey Zaytsev, linux-ext4, linux-fsdevel, Rik van Riel > snaphot was taken. Even if it has been remounted read-only at this > point, this gets really dicey. Consider that with certain types of > corruption, if the filesystem continues to get modified, the > corruption can get worse. I see, but perhaps you could do that on at least some common type of corruptions and only give up in the extreme cases? Mind you I don't have a good feeling what common and uncommon types are. > > > Or perhaps just tell the kernel which objects is suspicious and > > should be EIOed. > > Yeah; you could do that, as long as it's not a guarantee that all of > the objects which were suspicious were found. It would also be Ok to do the 100% job you probably need metadata checksums and always validate on initial read. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-19 18:56 ` Theodore Tso 2008-04-19 19:07 ` Eric Sandeen 2008-04-20 23:37 ` Andi Kleen @ 2008-04-21 0:23 ` Alexey Zaytsev 2008-04-21 12:53 ` Theodore Tso 2 siblings, 1 reply; 30+ messages in thread From: Alexey Zaytsev @ 2008-04-21 0:23 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4, linux-fsdevel, Rik van Riel On Sat, Apr 19, 2008 at 10:56 PM, Theodore Tso <tytso@mit.edu> wrote: > On Sat, Apr 19, 2008 at 01:44:51PM +0400, Alexey Zaytsev wrote: > > If it is a block containing a metadata object fsck has already read, > > than we already know what kind of object it is (there must be a way > > to quickly find all cached objects derived from a given block), and > > can update the cached version. And if fsck has not yet read the > > block, it can just be ignored, no matter what kind of data it > > contains. If it contains metadata and fsck is intrested in it, it > > will read it sooner or later anyway. If it contains file data, why > > should fsck even care? > > The problem is that e2fsck makes calculations on the filesystem data > read out from the disk and stores that in a highly compressed format. > So it doesn't remember that block #12345 was an indirect block for > inode #123, and that it contained data block numbers 17, 42, and 45. > Instead it just marks blocks #12345, #17, #42, and #45 as in use, and > then moves on. > > If you are going to store all of the cached objects then you will need > to effectively store *all* of the filesystem metatdata in memory at > the same time. For a large filesystem, you won't have enough *room* > in memory store all of the cached objects. That's one of the reasons > why e2fsck has a lot of very clever design so that summary information > can be stored in a very compressed form in memory so that things can > be fast (by avoid re-reading objects from disk) as well as not > requiring vast amounts of memory. > Yes, I agree on this problem. Do you have any estimates on how much RAM the current e2fsck uses in some test cases? I hope my approach will not add much to this. The only big thing I see is the data needed to associate each inode/dir entry with the parent block. Probably one radix tree to enumerate the blocks and a pointer added to the ext2_inode and ext2_dir_entry structures to form a linked list of objects belonging to the same block. Still no idea how much RAM the whole thing would consume. > Even if you *do* store all of the cached objects, it still takes time > to examine all of the objects and in the mean time, more changes will > have come rolling in, and you will either need to add a huge amount of > dependency to figure out what internal data structures need to be > updated based on the changes in some of the cached objects --- or you > will end up restarting the e2fsck checking process from scratch. > Not really. In my application I propose some changes to the fsck pass order to avoid the need to rerun it. And I don't get what dependency you are talking about. The only one I see is between the directory entries and the directory inode. Should not be hard to solve. (Or do I miss something? Could you give more examples maybe?) > In either case, there is still the issue of knowing exactly whether a > particular read happened before or after some change in the > filesystem. This race condition is a really hard one to deal with, > especially on a multiple CPU system and the filesystem checker is > running in userspace. I don't see why should fsck care about this. The notification is always sent after the write happened, so fsck should just re-read the data. No problem if it already read the (half-)updated version just before the notification. Btw, how about an even simplyer method: just watch the journal commits (changes to jbd needed). This way we can get all actual metadata updates, without being flooded by the file data updates. > > > But you are probably right, this project may be not doable in just three > > months. The changes on the kernel side probably are, but there is a > > huge e2fsck work. > > Yes, that is the concern. And without implementing the user-space > side, you'll never besure whether you completely got the kernel side > changes right! > > Regards, > > - Ted > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) 2008-04-21 0:23 ` Alexey Zaytsev @ 2008-04-21 12:53 ` Theodore Tso 0 siblings, 0 replies; 30+ messages in thread From: Theodore Tso @ 2008-04-21 12:53 UTC (permalink / raw) To: Alexey Zaytsev; +Cc: linux-ext4, linux-fsdevel, Rik van Riel On Mon, Apr 21, 2008 at 04:23:42AM +0400, Alexey Zaytsev wrote: > Not really. In my application I propose some changes to the fsck pass > order to avoid the need to rerun it. And I don't get what dependency you > are talking about. The only one I see is between the directory entries and > the directory inode. Should not be hard to solve. > (Or do I miss something? Could you give more examples maybe?) And *this* is why I ultimately decided I didn't have the time to mentor you. There are large numbers of other dependencies. For example, between the direct and indirect blocks in the inode, and the block allocation bitmaps. (Note that e2fsck keeps up to 3 different block bitmaps and 6 different inofr bitmaps.) You need to know which inodes are directories and which inodes are regular files. E2fsck currently keeps these bitmaps so we don't have the cache the entire 128 byte inode for all inodes. (Instead, we cache a single bit for every single inode. There's a ***reason*** for all of these bitmaps.) You also need to know which blocks are being used to store extended attributes, which may potentially be shared across multiple inodes. That's just *three* additional dependencis, and there are many more. If you can't think of them, how much time would it take for me as mentor to explain all of this to you? > > In either case, there is still the issue of knowing exactly whether a > > particular read happened before or after some change in the > > filesystem. This race condition is a really hard one to deal with, > > especially on a multiple CPU system and the filesystem checker is > > running in userspace. > > I don't see why should fsck care about this. The notification is always sent > after the write happened, so fsck should just re-read the data. No problem > if it already read the (half-)updated version just before the notification. Keep in mind that when a file gets deleted, a *large* number of metadata blocks will potentially get updated. So while e2fsck is handling these reads, a bunch more can start coming in from other filesystem transactions, and since the kernel doesn't know what userspace has already cached, it will have to send them again... and again... In fact if the filesystem is being very quickly updated, the notifications could easily overrun whatever buffers has been set up to transfer this information from userspace to the kernel side. Worse yet, unless you also send down transaction boundaries, the userspace won't know when the filesystem has reached a "stable state" which would be internally consistent. There are ways that this could be solved, but at the end of the day, the $1,000,000 question is why not just do a kernel-side snapshot? Then you don't have to completely rewrite e2fsck --- and given that you've claimed the e2fsck code is "hard to understand", it seems especially audacious that you would have thought you could do this in 3 months. If you really don't want to use LVM, you could have proposed a snapshot solution which didn't involve devicemapper. It's not clear it would have entered mainline, but at least there would have been some non-zero chance that you would complete the project successfully. Regards, - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2008-04-23 1:02 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <f19298770804180720w2e72b821j95b709c1dd1b1c25@mail.gmail.com>
[not found] ` <20080419012952.GE25797@mit.edu>
2008-04-19 9:44 ` Mentor for a GSoC application wanted (Online ext2/3 filesystem checker) Alexey Zaytsev
2008-04-19 18:56 ` Theodore Tso
2008-04-19 19:07 ` Eric Sandeen
2008-04-19 22:04 ` Theodore Tso
2008-04-20 1:24 ` Eric Sandeen
2008-04-20 23:30 ` Andi Kleen
2008-04-20 23:42 ` Jamie Lokier
2008-04-21 8:01 ` Andi Kleen
[not found] ` <20080421080111.GD14446@one.firstfloor.org>
2008-04-21 11:51 ` Jamie Lokier
2008-04-21 17:29 ` Ricardo M. Correia
2008-04-21 17:40 ` Andi Kleen
2008-04-21 18:27 ` Ricardo M. Correia
2008-04-22 14:48 ` Jamie Lokier
2008-04-21 18:15 ` Ric Wheeler
2008-04-21 18:25 ` Eric Sandeen
2008-04-21 18:44 ` Ric Wheeler
2008-04-21 18:58 ` Matthew Wilcox
2008-04-21 19:11 ` Ric Wheeler
2008-04-21 0:27 ` Alexey Zaytsev
2008-04-21 9:45 ` Andi Kleen
2008-04-22 16:54 ` Peter Teoh
2008-04-22 17:02 ` Eric Sandeen
2008-04-22 23:37 ` Andreas Dilger
2008-04-23 0:52 ` Eric Sandeen
[not found] ` <480E4950.1090300@oracle.com>
[not found] ` <804dabb00804221633g1f61029dh7b27737134fc0b7a@mail.gmail.com>
[not found] ` <480E7954.9090408@oracle.com>
2008-04-23 1:02 ` Peter Teoh
2008-04-20 23:37 ` Andi Kleen
2008-04-21 2:33 ` Theodore Tso
2008-04-21 14:43 ` Andi Kleen
2008-04-21 0:23 ` Alexey Zaytsev
2008-04-21 12:53 ` Theodore Tso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).