* Ext4 and the "30 second window of death" @ 2009-03-29 10:24 Alberto Gonzalez 2009-03-31 12:25 ` Theodore Tso 0 siblings, 1 reply; 59+ messages in thread From: Alberto Gonzalez @ 2009-03-29 10:24 UTC (permalink / raw) To: Linux Kernel Mailing List Hi, Reading this discussion about the fsync performance problems, the reliability of delayed allocation, etc... made me a bit confused, so as a normal user I would like to ask a clear question with an example so I can get a clear answer and understand the implications of all this. - Let's say I'm a writer and I like to take my laptop to a cafe every day to write there for a few hours. - As such, I want to get good battery life so I'm fine with my data being written to death say every 30 seconds instead of waking up the disk immediately if I save the document I'm working on. - I use Ext4 as my filesystem (default in next Fedora release). - Let's say I've been working on my book for the last 14 months and I've written about 400 pages on an ODF file. - My usual workflow is that every time I finish a paragraph, say every 2-3 minutes, I hit Ctrl+S to save the changes. - So one day, while I'm working on the book the following happens: I finish a paragraph and his Ctrl+S to save it. 5 seconds later the system freezes for some reason. Let's suppose that in that 5 window timeframe between pressing Ctrl+S and the crash the data has not been written to disk (which happens every 30 seconds). So as a result I: A - Lose that last paragraph B - Lose the whole book If it's 'A', then that's ok, as expected. Bad luck. But if it's 'B', then I think that's totally unexpected by any user, and totally unacceptable too. Sure I want good performance and good battery life, but not at such cost. (Yes, you can argue I should have a recent backup at home, and you'd be right, but that doesn't change things fundamentally). As far as I understand, with Ext3 (defaults), the behavior was A. Will this change to B with Ext4 and all "modern" filesystems (XFS, Btrfs,...)? Thanks for any answer. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-29 10:24 Ext4 and the "30 second window of death" Alberto Gonzalez @ 2009-03-31 12:25 ` Theodore Tso 2009-03-31 12:52 ` Alberto Gonzalez ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Theodore Tso @ 2009-03-31 12:25 UTC (permalink / raw) To: Alberto Gonzalez; +Cc: Linux Kernel Mailing List On Sun, Mar 29, 2009 at 12:24:21PM +0200, Alberto Gonzalez wrote: > Hi, > > - I use Ext4 as my filesystem (default in next Fedora release). Fedora will have the patches so that applications that do replace-via-truncate (a bad idea, these applications are buggy, and will lose data sometimes even with ext3), or replace-via-rename without the fsync(), will force the blocks out to disk with the commit. > - Let's say I've been working on my book for the last 14 months and I've > written about 400 pages on an ODF file. Openoffice, being a portable application, that has to work on other operating systems and filesystems (for example, like Solaris's UFS), does do open/write/close/fsync/rename. So you're safe if you're using OpenOffice (and emacs, and vim). The replace-via-truncate and replace-via-rename workarounds are there for the benefit of KDE, and GNOME, which in some configurations apparently will replace hundreds of dot files when the desktop is started up, for no reason that I can understand. (Not such a great idea for SSD write endurance!) Some people apparently spend hours making sure that their windows are exactly positioned the way they want it when their desktop starts up, and if the system crashes while their desktop is starting up, those they could lose their window positions, which apparently made a whole bunch of users cranky. In practice, most of the editors that I'm familiar with have been around for a while, have needed to make sure that that cases such as yours wouldn't result in data loss, and so are pretty good about using fsync() so that users' files wouldn't be lost, no matter what the filesystem or operating system being used. The problem has been mostly with newer applications, especially the newer desktop ones, which have been written to assume that they only have to work safely on Linux and ext3. The replace-via-truncate and replace-via-rename workarounds provide this safety for ext4. Best regards, - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 12:25 ` Theodore Tso @ 2009-03-31 12:52 ` Alberto Gonzalez 2009-03-31 13:45 ` Theodore Tso 2009-04-03 7:13 ` Bojan Smojver 2009-04-05 17:27 ` Ed Tomlinson 2 siblings, 1 reply; 59+ messages in thread From: Alberto Gonzalez @ 2009-03-31 12:52 UTC (permalink / raw) To: Theodore Tso; +Cc: Linux Kernel Mailing List On Tuesday 31 March 2009 14:25:40 Theodore Tso wrote: > On Sun, Mar 29, 2009 at 12:24:21PM +0200, Alberto Gonzalez wrote: > > Hi, > > > > - I use Ext4 as my filesystem (default in next Fedora release). > > Fedora will have the patches so that applications that do > replace-via-truncate (a bad idea, these applications are buggy, and > will lose data sometimes even with ext3), or replace-via-rename > without the fsync(), will force the blocks out to disk with the > commit. > > > - Let's say I've been working on my book for the last 14 months and I've > > written about 400 pages on an ODF file. > > Openoffice, being a portable application, that has to work on other > operating systems and filesystems (for example, like Solaris's UFS), > does do open/write/close/fsync/rename. So you're safe if you're using > OpenOffice (and emacs, and vim). Ah, good to know, that's quite a relief for normal users like me who were getting lost with this discussion. But one other doubt: You've proposed that in laptop mode, fsync's should be held until next write cycle (say every 30 seconds) so that the disk is not spun up unnecessarily, wasting battery and shortening it's lifespan too. I absolutely agree with this, and as a trade-off I'm ok with losing my last paragraph even if I did hit Ctrl+S to save it a few seconds before a crash. But again, with Ext4 will I just lose that last paragraph or the whole book in this case? Thanks, Alberto. > Best regards, > > - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 12:52 ` Alberto Gonzalez @ 2009-03-31 13:45 ` Theodore Tso 2009-03-31 14:45 ` Alberto Gonzalez 2009-03-31 22:02 ` Alberto Gonzalez 0 siblings, 2 replies; 59+ messages in thread From: Theodore Tso @ 2009-03-31 13:45 UTC (permalink / raw) To: Alberto Gonzalez; +Cc: Linux Kernel Mailing List On Tue, Mar 31, 2009 at 02:52:05PM +0200, Alberto Gonzalez wrote: > > You've proposed that in laptop mode, fsync's should be held until next write > cycle (say every 30 seconds) so that the disk is not spun up unnecessarily, > wasting battery and shortening it's lifespan too. I absolutely agree with > this, and as a trade-off I'm ok with losing my last paragraph even if I did hit > Ctrl+S to save it a few seconds before a crash. But again, with Ext4 will I > just lose that last paragraph or the whole book in this case? Laptop mode is already set up such that the moment the disk spins up, any pending writes are immediately flushed to disk --- the idea being that if the disk is spinning, we might as well take advantage of it to get everything pushed out to disk. As long as we actually keep a linked list of those fsync's which were "held up", and we make sure all of the delayed allocation blocks are also allocated before we push them out, the right thing will happen. If we just ignore the fsync's, then we might not allocate the delayed allocation blocks. So basically, we need to be careful about how we implement this addition to laptop_mode. Jeff Garzik has also pointed out that there are additional concerns for databases which may have issued multiple fsync()'s while the disk has been spun down, where we wouldn't want to mix writes between fsync()'s. This basically boils down to how much protection do we want to give for the case where the system crashes while the disk blocks are being pushed out to disk. (Which isn't that farfetched; consider the case where the laptop is very low on battery, and runs out when the disk is woken up and crashes before all of the writes could be processed.) So there are some things that would be tricky in terms of implementing this perfectly, and maybe we would disable the fsync suppression machinery if the battery level isgetting critical --- and then do either a clean shutdown or a suspend-to-disk (although here too there had better be enough juice in the battery to write all of memory to your swap partition). The bottom line is that it *can* be implemented safely, but there are some things that we would need to pay attention to in order to make sure it *was* safe. - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 13:45 ` Theodore Tso @ 2009-03-31 14:45 ` Alberto Gonzalez 2009-04-01 0:04 ` Theodore Tso 2009-03-31 22:02 ` Alberto Gonzalez 1 sibling, 1 reply; 59+ messages in thread From: Alberto Gonzalez @ 2009-03-31 14:45 UTC (permalink / raw) To: Theodore Tso; +Cc: Linux Kernel Mailing List On Tuesday 31 March 2009 15:45:47 Theodore Tso wrote: > On Tue, Mar 31, 2009 at 02:52:05PM +0200, Alberto Gonzalez wrote: > > You've proposed that in laptop mode, fsync's should be held until next > > write cycle (say every 30 seconds) so that the disk is not spun up > > unnecessarily, wasting battery and shortening it's lifespan too. I > > absolutely agree with this, and as a trade-off I'm ok with losing my last > > paragraph even if I did hit Ctrl+S to save it a few seconds before a > > crash. But again, with Ext4 will I just lose that last paragraph or the > > whole book in this case? > > Laptop mode is already set up such that the moment the disk spins up, > any pending writes are immediately flushed to disk --- the idea being > that if the disk is spinning, we might as well take advantage of it to > get everything pushed out to disk. As long as we actually keep a > linked list of those fsync's which were "held up", and we make sure > all of the delayed allocation blocks are also allocated before we push > them out, the right thing will happen. If we just ignore the fsync's, > then we might not allocate the delayed allocation blocks. So > basically, we need to be careful about how we implement this addition > to laptop_mode. > > Jeff Garzik has also pointed out that there are additional concerns > for databases which may have issued multiple fsync()'s while the disk > has been spun down, where we wouldn't want to mix writes between > fsync()'s. This basically boils down to how much protection do we > want to give for the case where the system crashes while the disk > blocks are being pushed out to disk. (Which isn't that farfetched; > consider the case where the laptop is very low on battery, and runs > out when the disk is woken up and crashes before all of the writes > could be processed.) > > So there are some things that would be tricky in terms of implementing > this perfectly, and maybe we would disable the fsync suppression > machinery if the battery level isgetting critical --- and then do > either a clean shutdown or a suspend-to-disk (although here too there > had better be enough juice in the battery to write all of memory to > your swap partition). > > The bottom line is that it *can* be implemented safely, but there are > some things that we would need to pay attention to in order to make > sure it *was* safe. > > - Ted I see. Thanks for the explanation. Right now, laptop-mode (if you use laptop- mode-tools) is disabled when battery reaches critical level. Anyway, regardless of corner cases, I think that what we "normal" users want is to have the choice between: A - Writing data to disk immediately and lose no work at all, but get worse performance/battery life/HDD lifespan (this is what happens when an application uses fsync, right?). Or B - Delay writes for X seconds (30, 60, 120,...) and get better performance/battery life/HDD lifespan, but risk to lose X seconds of work. What is not acceptable is having to choose between A and: C - Delay writes for X seconds and get better performance/battery life/HDD lifespan, but risk to lose _all_ your work (instead of just the last X seconds). The problem I guess is that right now application writers targeting Ext4 must choose between using fsync and giving users the 'A' behaviour or not using fsync and giving them the 'C' behaviour. But what most users would like is 'B', I'm afraid (at least, it's what I want, I might be an exception). Regards, Alberto. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 14:45 ` Alberto Gonzalez @ 2009-04-01 0:04 ` Theodore Tso 2009-04-01 1:14 ` Alberto Gonzalez 0 siblings, 1 reply; 59+ messages in thread From: Theodore Tso @ 2009-04-01 0:04 UTC (permalink / raw) To: Alberto Gonzalez; +Cc: Linux Kernel Mailing List On Tue, Mar 31, 2009 at 04:45:28PM +0200, Alberto Gonzalez wrote: > > A - Writing data to disk immediately and lose no work at all, but get worse > performance/battery life/HDD lifespan (this is what happens when an > application uses fsync, right?). People are stressing over the battery usage of spinning up the disk when you write a file, but in practice, if you're writing an OpenOffice file, you're probably only going to be typing ^S every 45 seconds? Every couple of minutes? So the fsync() caused by Openoffice saving out your 300 page Magnum Opus really isn't going to make that big of a difference to your battery life --- whether it happens write away when you hit ^S, or whether it happens some 30 or 120 seconds later, isn't really a big deal. The problem comes when you have lots of applications open on the desktop, and for some reason they all decide they need to be writing a huge number of files every few seconds. That seems to be the concern that people have with respect to wating to batch spinning up the disk in order to save power. So for example, if every time you get an instant message via AIM or IRC, your Pidgin client wants to write the message to a log file, should Pidgin try to fsync() that write? Right now, if Pidgin doesn't call fsync(), with ext3, in practice your IM will be written to disk after 5 seconds. With ext4, your IM might not get written to disk until around 30 seconds. Since Pidgin isn't replacing the log file, but rather appending to it, it's not a case of losing the previous work, but rather not simply getting the latest IM's pushed to stable storage as quickly. Quite frankly, the people who are complaining about "fsync() will burn too much problem" are really protesting way too much. How often, really, should applications be replacing files? Apparently KDE replaces hundreds the files in some configurations at desktop startup, but most people seem to agree this is a bug. Firefox wants to replace a large number of files (and in practice writes 2.5 megabytes of data) each time you click on a link. (This is not great for SSD write endurance; after browsing 400 links, you've written over a gigabyte to your SSD.) But let's be realistic here; if you're browsing the web, the power used by running flash animations by the web browser, not to mention the power costs of the WiFi is probably at least as much if not more than the cost of spinning up the disk. At least when I'm running on batteries, I keep the number of applications down to a minimum, and regardless of whether we are batching I/O's using laptop mode or not, it's *always* going to save more power to not do file I/O at all than to do file I/O with some kind of batching scheme. So the folks who are saying that they can't afford to fsync() every single file for power reasons really are making an excuse; the reality is that if they were really worried about power consumption, they would be going out of their way to avoid file writes unless it's really necessary. It's one thing if a user wants to save their Open Office document; when the user wants to save it, they should save it, and it should go to disk pretty fast --- how much work the user is willing to risk should be based on how often the user manually types ^S, or how the user configures their application to do periodic auto-saves --- whether that's once a minute, or every 3 minutes, or every 5 minutes, or every 10 minutes. But if there's some application which is replacing hundreds of files a minute, then that's the real problem, whether they use fsync() or not. Now, while I think the whole, "we can't use fsync() for power reasons is an excuse", it's also true that we're not going to be able to change all applications at a drop of a hat, and may in fact be impossible to fix all applications, perhaps for years to come. It is for that reason that ext4 has the replace-via-truncate and replace-via-rename workarounds. These currently start I/O as soon as the file is closed (if it had been previously truncated), or renamed (if it overwrites a target file). From a power perspective, it would have been better to wait until the next commit boundary to initiate the I/O (although doing it right away is better from an I/O smoothing perspective and to reduce fsync latencies). But again, if the application is replacing a huge number of files on a frequent basis, that's what's going to suck the most amount of power; batching to allow the disk to spin down might save a little, but fundamentally the application is doing something that's going to be a massive power drain anyway. > The problem I guess is that right now application writers targeting > Ext4 must choose between using fsync and giving users the 'A' > behaviour or not using fsync and giving them the 'C' behaviour. But > what most users would like is 'B', I'm afraid (at least, it's what I > want, I might be an exception). So no, application programmers don't have to choose; if they do things the broken (old) way, assuming ext3 semantics, users won't lose existing files, thanks to the workaround patches. Those applications will be unsafe for many other filesystems and operating systems, but maybe those application writers don't care. Unfortunately, I confused a lot of people by telling people they should use fsync(), instead of saying, "that's OK, ext4 will take care of it for you", because I care about application portability. But I implemented the application workarounds *first* because I knew that it would take a long time for people to fix their applications. Users will be protected either way. If applications use fsync(), they really won't be using much in the way of extra power, really! If they are replacing hundreds of files in a very short time interval, and doing that all the time, then that's going to burn power no matter what the filesystem tries to do. Regards, - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 0:04 ` Theodore Tso @ 2009-04-01 1:14 ` Alberto Gonzalez 0 siblings, 0 replies; 59+ messages in thread From: Alberto Gonzalez @ 2009-04-01 1:14 UTC (permalink / raw) To: Theodore Tso; +Cc: Linux Kernel Mailing List Ted, I agree with all you've said and now I really think we're making way too much fuss about a quite simple issue (we, stupid users). On Wednesday 01 April 2009 02:04:47 Theodore Tso wrote: > Quite frankly, the people who are complaining about "fsync() will burn > too much problem" are really protesting way too much. Yes, I guess you're right. Filesystem behaviour is not going to make that much difference, it's user's and application's behaviour what will determine battery life (plus hardware capabilities, obviously). > Firefox wants to replace a large number of files (and in practice > writes 2.5 megabytes of data) each time you click on a link. (This is > not great for SSD write endurance; after browsing 400 links, you've > written over a gigabyte to your SSD.) Agreed. In fact I always thought that the ext3+fsync problem with Firefox was mostly a myth. The fact is that Firefox 3 has some rather unrealistic settings that cause an insane amount of I/O (disk, but also network I/O). I was using an old computer with a very slow 40Gb @ 5400 IDE HD at the time F3 came out and had some problems. After going through all the options and choosing reasonable settings the problems went away forever (but then I use Firefox reasonably, not with a couple hundreds of tabs opened at the same time - no filesystem can fix that). > But let's be realistic here; if > you're browsing the web, the power used by running flash animations by > the web browser, not to mention the power costs of the WiFi is > probably at least as much if not more than the cost of spinning up the > disk. Since I just tested this the other day, I'll post the numbers: With flash enabled, Konqueror visiting 3 pages, one of them with one small flash add, my battery lasted for 184 minutes (for an average or 8.5 watts out of my 26w/h battery). Without flash, 205 minutes, an average of 7.6 watts (this is on an HP mini netbook). Anyway, I agree with all the below too. Thanks again for the detailed explanation. Regards, Alberto. > > At least when I'm running on batteries, I keep the number of > applications down to a minimum, and regardless of whether we are > batching I/O's using laptop mode or not, it's *always* going to save > more power to not do file I/O at all than to do file I/O with some > kind of batching scheme. So the folks who are saying that they can't > afford to fsync() every single file for power reasons really are > making an excuse; the reality is that if they were really worried > about power consumption, they would be going out of their way to avoid > file writes unless it's really necessary. It's one thing if a user > wants to save their Open Office document; when the user wants to save > it, they should save it, and it should go to disk pretty fast --- how > much work the user is willing to risk should be based on how often the > user manually types ^S, or how the user configures their application > to do periodic auto-saves --- whether that's once a minute, or every 3 > minutes, or every 5 minutes, or every 10 minutes. > > But if there's some application which is replacing hundreds of files a > minute, then that's the real problem, whether they use fsync() or not. > > Now, while I think the whole, "we can't use fsync() for power reasons > is an excuse", it's also true that we're not going to be able to > change all applications at a drop of a hat, and may in fact be > impossible to fix all applications, perhaps for years to come. It is > for that reason that ext4 has the replace-via-truncate and > replace-via-rename workarounds. These currently start I/O as soon as > the file is closed (if it had been previously truncated), or renamed > (if it overwrites a target file). From a power perspective, it would > have been better to wait until the next commit boundary to initiate > the I/O (although doing it right away is better from an I/O smoothing > perspective and to reduce fsync latencies). But again, if the > application is replacing a huge number of files on a frequent basis, > that's what's going to suck the most amount of power; batching to > allow the disk to spin down might save a little, but fundamentally the > application is doing something that's going to be a massive power > drain anyway. > > > The problem I guess is that right now application writers targeting > > Ext4 must choose between using fsync and giving users the 'A' > > behaviour or not using fsync and giving them the 'C' behaviour. But > > what most users would like is 'B', I'm afraid (at least, it's what I > > want, I might be an exception). > > So no, application programmers don't have to choose; if they do things > the broken (old) way, assuming ext3 semantics, users won't lose > existing files, thanks to the workaround patches. Those applications > will be unsafe for many other filesystems and operating systems, but > maybe those application writers don't care. Unfortunately, I confused > a lot of people by telling people they should use fsync(), instead of > saying, "that's OK, ext4 will take care of it for you", because I care > about application portability. But I implemented the application > workarounds *first* because I knew that it would take a long time for > people to fix their applications. Users will be protected either way. > > If applications use fsync(), they really won't be using much in the > way of extra power, really! If they are replacing hundreds of files > in a very short time interval, and doing that all the time, then that's > going to burn power no matter what the filesystem tries to do. > > Regards, > > - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 13:45 ` Theodore Tso 2009-03-31 14:45 ` Alberto Gonzalez @ 2009-03-31 22:02 ` Alberto Gonzalez 2009-03-31 23:22 ` Andreas T.Auer 1 sibling, 1 reply; 59+ messages in thread From: Alberto Gonzalez @ 2009-03-31 22:02 UTC (permalink / raw) To: Theodore Tso; +Cc: Linux Kernel Mailing List On Tuesday 31 March 2009 15:45:47 Theodore Tso wrote: > On Tue, Mar 31, 2009 at 02:52:05PM +0200, Alberto Gonzalez wrote: > > You've proposed that in laptop mode, fsync's should be held until next > > write cycle (say every 30 seconds) so that the disk is not spun up > > unnecessarily, wasting battery and shortening it's lifespan too. I > > absolutely agree with this, and as a trade-off I'm ok with losing my last > > paragraph even if I did hit Ctrl+S to save it a few seconds before a > > crash. But again, with Ext4 will I just lose that last paragraph or the > > whole book in this case? > > Laptop mode is already set up such that the moment the disk spins up, > any pending writes are immediately flushed to disk --- the idea being > that if the disk is spinning, we might as well take advantage of it to > get everything pushed out to disk. As long as we actually keep a > linked list of those fsync's which were "held up", and we make sure > all of the delayed allocation blocks are also allocated before we push > them out, the right thing will happen. If we just ignore the fsync's, > then we might not allocate the delayed allocation blocks. So > basically, we need to be careful about how we implement this addition > to laptop_mode. In fact, thinking about it, this option would be the ideal one for desktops and especially laptops (servers running databases are a different thing). What we need is that _no_ application uses fsync. The decision as to when the data should be written to disk should be left to the filesystem. And then the user can choose how often they want this to happen (every 5, 15, 30, 60... seconds). So if Ext4 could have a "nofsync" mount option that would disable fsync from applications (i.e, it wouldn't honor an fsync call), that would be wonderful. But then of course we have to make sure that if the kernel crashes (or there's a power-off, etc..), we will just lose the new data that hasn't been written to disk, but the old data will still be there. So maybe this could be achieved with mounting the filesystem with nofsync, nodelalloc? > The bottom line is that it *can* be implemented safely, but there are > some things that we would need to pay attention to in order to make > sure it *was* safe. If you could do this, many of us would be willing to buy you a beer :) > > - Ted And of course, thanks for your patience with this issue. And sorry for all you're having to take from us uninformed but somehow worried users (I run Ext4 now, but added the nodelalloc option when all this started). Alberto. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 22:02 ` Alberto Gonzalez @ 2009-03-31 23:22 ` Andreas T.Auer 2009-04-01 1:25 ` Alberto Gonzalez 2009-04-01 1:50 ` Theodore Tso 0 siblings, 2 replies; 59+ messages in thread From: Andreas T.Auer @ 2009-03-31 23:22 UTC (permalink / raw) To: Alberto Gonzalez; +Cc: Theodore Tso, Linux Kernel Mailing List On 01.04.2009 00:02 Alberto Gonzalez wrote: > In fact, thinking about it, this option would be the ideal one for desktops > and especially laptops (servers running databases are a different thing). What > we need is that _no_ application uses fsync. The decision as to when the data > should be written to disk should be left to the filesystem. And then the user > can choose how often they want this to happen (every 5, 15, 30, 60... > seconds). So if Ext4 could have a "nofsync" mount option that would disable > fsync from applications (i.e, it wouldn't honor an fsync call), that would be > wonderful. But then of course we have to make sure that if the kernel crashes > (or there's a power-off, etc..), we will just lose the new data that hasn't > been written to disk, but the old data will still be there. So maybe this > could be achieved with mounting the filesystem with nofsync, nodelalloc? > > You are always thinking about the few seconds/minutes of work you gonna lose, but there are different situations, too. E.g. your POP3 client receives a very important mail, saves it to disk, uses fsync to make sure it is out and tells the server to delete it. If you are gonna delay the fsync, you will have a long window in which the mail can get lost instead of a minimum window. Or are there any POP3 clients, which can synchronize the mail-polling with a spinning a disk? There are tasks that are not very important, that should not spin up the disk and there are tasks, that might better do so. It is the preference of the user, which tasks should or should not spin up the disk, but the application developer has to decide globally, whether or not to use fsync() and the filesystem can't even distinguish the tasks at all, except that it receives fsyncs or not. So fine-tuning the system to the ideal disk-writing policy is really problematic, especially given a lot of different people turning knobs: - different filesystem developers using different methods and default behaviors, which can be changed by distros and sys admins. - different applications trying to use or not use fsync() and other methods to get the best policies for any kind of fs. Or the developers are incompetent enough to expect features from the filesystem which are not always given, whether trained by ext3 data=ordered or trained by reiserfs or just bare of any better fs knowledge. - different users having different preferences on what data is how important, but usually they can not change the fsync-policy of the applications. Andreas ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 23:22 ` Andreas T.Auer @ 2009-04-01 1:25 ` Alberto Gonzalez 2009-04-01 1:50 ` Theodore Tso 1 sibling, 0 replies; 59+ messages in thread From: Alberto Gonzalez @ 2009-04-01 1:25 UTC (permalink / raw) To: Andreas T.Auer; +Cc: Theodore Tso, Linux Kernel Mailing List On Wednesday 01 April 2009 01:22:19 Andreas T.Auer wrote: > On 01.04.2009 00:02 Alberto Gonzalez wrote: > > In fact, thinking about it, this option would be the ideal one for > > desktops and especially laptops (servers running databases are a > > different thing). What we need is that _no_ application uses fsync. The > > decision as to when the data should be written to disk should be left to > > the filesystem. And then the user can choose how often they want this to > > happen (every 5, 15, 30, 60... seconds). So if Ext4 could have a > > "nofsync" mount option that would disable fsync from applications (i.e, > > it wouldn't honor an fsync call), that would be wonderful. But then of > > course we have to make sure that if the kernel crashes (or there's a > > power-off, etc..), we will just lose the new data that hasn't been > > written to disk, but the old data will still be there. So maybe this > > could be achieved with mounting the filesystem with nofsync, nodelalloc? > > You are always thinking about the few seconds/minutes of work you gonna > lose, but there are different situations, too. > > E.g. your POP3 client receives a very important mail, saves it to disk, > uses fsync to make sure it is out and tells the server to delete it. If > you are gonna delay the fsync, you will have a long window in which the > mail can get lost instead of a minimum window. Or are there any POP3 > clients, which can synchronize the mail-polling with a spinning a disk? Yes, I guess this is a clear example of data that needs to be written to disk straight away. > > There are tasks that are not very important, that should not spin up the > disk and there are tasks, that might better do so. It is the preference > of the user, which tasks should or should not spin up the disk, but the > application developer has to decide globally, whether or not to use > fsync() and the filesystem can't even distinguish the tasks at all, > except that it receives fsyncs or not. > > So fine-tuning the system to the ideal disk-writing policy is really > problematic, especially given a lot of different people turning knobs: > - different filesystem developers using different methods and default > behaviors, which can be changed by distros and sys admins. > - different applications trying to use or not use fsync() and other > methods to get the best policies for any kind of fs. Or the developers > are incompetent enough to expect features from the filesystem which are > not always given, whether trained by ext3 data=ordered or trained by > reiserfs or just bare of any better fs knowledge. > - different users having different preferences on what data is how > important, but usually they can not change the fsync-policy of the > applications. Yes, I agree with all the above. There's no magic recipe for any filesystem, and honestly, I've never had problems with reiserfs in the past or ext3 later on. I don't know why I got scared with all this "ext4 will give you zero- length files on every crash unless all applications start to fsync like crazy and kill your hard drive in a year time" thing. Filesystem developers must have a bit of bit of knowledge about how this works to not do something too stupid. > Andreas Alberto. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 23:22 ` Andreas T.Auer 2009-04-01 1:25 ` Alberto Gonzalez @ 2009-04-01 1:50 ` Theodore Tso 2009-04-01 5:20 ` Sitsofe Wheeler 2009-04-01 8:51 ` Andreas T.Auer 1 sibling, 2 replies; 59+ messages in thread From: Theodore Tso @ 2009-04-01 1:50 UTC (permalink / raw) To: Andreas T.Auer; +Cc: Alberto Gonzalez, Linux Kernel Mailing List On Wed, Apr 01, 2009 at 01:22:19AM +0200, Andreas T.Auer wrote: > You are always thinking about the few seconds/minutes of work you gonna > lose, but there are different situations, too. > > E.g. your POP3 client receives a very important mail, saves it to disk, > uses fsync to make sure it is out and tells the server to delete it. If > you are gonna delay the fsync, you will have a long window in which the > mail can get lost instead of a minimum window. Or are there any POP3 > clients, which can synchronize the mail-polling with a spinning a disk? True, but consider --- this is a laptop we're talking about, right? What if the laptop hard drive crashes after you accidentally drop your laptop. Even if you're using an SSD, what if someone steals your laptop. Your first mistake was using POP3. :-) Personally, what I do is create a local *copy* of my IMAP mailbox, and I delete messages on the local copy of the mail spool --- and then periodically I run a program called "mbsync" (http://isync.sourceforge.net) to propagate deletes back to the IMAP server, and download new mail to my local Maildir copy of my mail spool. But still, you're right. In some cases, you really want "fsync()" to mean "fsync()". I'm not sure how often such applications _should_ be running on a laptop which is prone to be being dropped and/or stolen. This would have to be something that a user chooses to do on their system, and they would have to take into account whether they are running some workloads that really can't tolerate data loss or not. If all they are doing is browsing the web, and the issue is firefox's desire to constantly write to their home directory, the user should be able to say, "you know, my battery life is more important that making sure that every last web page I visit is saved away in some file --- Firefox's 'Awesome Bar' really isn't worth that much to me." Of course, there is the question whether most users will be able to understand the risks of doing things like using POP3 and fetchmail as described in your scenario above. And that's a valid question --- so it's worth asking whether suppressing fsync()'s really saves enough power to be worth it, as opposed to say, fixing applications that are write-happy, or choosing not to use applications which are write-happy when you are running on battery. - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 1:50 ` Theodore Tso @ 2009-04-01 5:20 ` Sitsofe Wheeler 2009-04-01 15:12 ` Matthew Garrett 2009-04-01 8:51 ` Andreas T.Auer 1 sibling, 1 reply; 59+ messages in thread From: Sitsofe Wheeler @ 2009-04-01 5:20 UTC (permalink / raw) To: Theodore Tso, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Tue, Mar 31, 2009 at 09:50:10PM -0400, Theodore Tso wrote: > > But still, you're right. In some cases, you really want "fsync()" to > mean "fsync()". I'm not sure how often such applications _should_ be Hmm. This is starting to sound a lot like the OSX fsync ( http://developer.apple.com/documentation/Darwin/Reference/Manpages/man2/fsync.2.html ) where there is effectively a "fsync harder" syscall (F_FULLFSYNC fcntl11). > If all they are doing is browsing the web, and the issue is firefox's > desire to constantly write to their home directory, the user should be > able to say, "you know, my battery life is more important that making > sure that every last web page I visit is saved away in some file --- > Firefox's 'Awesome Bar' really isn't worth that much to me." The "Awesome(bar) Firefox 3 fsync Problem" isn't that you are missing a day's worth of browsing. The issue is that the sqlite database might become corrupt and lose _all history_ if fsync lies/doesn't happen and a crash occurs ( https://bugzilla.mozilla.org/show_bug.cgi?id=435712#c10). With Firefox 2 there was a file swap happening so an fsync wasn't vital. Just out of curiosity, when laptop mode is happening is there a guarantee that writes to other files won't be reordered to before the fsync? -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 5:20 ` Sitsofe Wheeler @ 2009-04-01 15:12 ` Matthew Garrett 2009-04-01 17:35 ` Theodore Tso 0 siblings, 1 reply; 59+ messages in thread From: Matthew Garrett @ 2009-04-01 15:12 UTC (permalink / raw) To: Sitsofe Wheeler Cc: Theodore Tso, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Wed, Apr 01, 2009 at 06:20:50AM +0100, Sitsofe Wheeler wrote: > Just out of curiosity, when laptop mode is happening is there a > guarantee that writes to other files won't be reordered to before the > fsync? laptop-mode does two things - tweak the dirty page semantics slightly (not in an interestingly relevant way) and call sys_sync() a few seconds after something hits disk rather than cache. In contrast to Ted's suggestion that laptop-mode reduces data integrity, it actually enhances it by opportunistically ensuring that data hits disk. It's the lengthening of the commit intervals that usually accompanies it that increases the risk of data loss. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 15:12 ` Matthew Garrett @ 2009-04-01 17:35 ` Theodore Tso 2009-04-01 17:43 ` Matthew Garrett 2009-04-02 11:37 ` Ext4 and the "30 second window of death" Sitsofe Wheeler 0 siblings, 2 replies; 59+ messages in thread From: Theodore Tso @ 2009-04-01 17:35 UTC (permalink / raw) To: Matthew Garrett Cc: Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Wed, Apr 01, 2009 at 04:12:21PM +0100, Matthew Garrett wrote: > On Wed, Apr 01, 2009 at 06:20:50AM +0100, Sitsofe Wheeler wrote: > > > Just out of curiosity, when laptop mode is happening is there a > > guarantee that writes to other files won't be reordered to before the > > fsync? > > laptop-mode does two things - tweak the dirty page semantics slightly > (not in an interestingly relevant way) and call sys_sync() a few seconds > after something hits disk rather than cache. In contrast to Ted's > suggestion that laptop-mode reduces data integrity, it actually enhances > it by opportunistically ensuring that data hits disk. It's the > lengthening of the commit intervals that usually accompanies it that > increases the risk of data loss. It *can* reduce data integrity; it really depends on how it's tuned and what scenario you're talking about. To the extent that it uses sys_sync(), it could help in some cases as well, since filesystems that do delayed allocation will wake up when the commit interval fires, and then force out all writes to the disk, yes. But before the commit interval, there is an increased risk of data loss --- which the user requested. The other subtlety comes if we add fsync() suppression to laptop mode --- which is something that Bart Samwel is very interested in doing and I talked to him at FOSDEM about this. As Jeff Garzik recently pointed out, however, if we let the system reorder writes across fsync() boundaries, or if we combine two writes to the same block separated by an fsync(), and the system crashes in the middle of pushing all of these blocks out to the disk, we can end up trashing the consistency guarantees of a database such as mysql or postgres. It's a good point, but it only applies if we add fsync() suppression to laptop mode --- which we haven't done yet. - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 17:35 ` Theodore Tso @ 2009-04-01 17:43 ` Matthew Garrett 2009-04-01 21:21 ` Ray Lee ` (2 more replies) 2009-04-02 11:37 ` Ext4 and the "30 second window of death" Sitsofe Wheeler 1 sibling, 3 replies; 59+ messages in thread From: Matthew Garrett @ 2009-04-01 17:43 UTC (permalink / raw) To: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Wed, Apr 01, 2009 at 01:35:21PM -0400, Theodore Tso wrote: > On Wed, Apr 01, 2009 at 04:12:21PM +0100, Matthew Garrett wrote: > > On Wed, Apr 01, 2009 at 06:20:50AM +0100, Sitsofe Wheeler wrote: > > > > > Just out of curiosity, when laptop mode is happening is there a > > > guarantee that writes to other files won't be reordered to before the > > > fsync? > > > > laptop-mode does two things - tweak the dirty page semantics slightly > > (not in an interestingly relevant way) and call sys_sync() a few seconds > > after something hits disk rather than cache. In contrast to Ted's > > suggestion that laptop-mode reduces data integrity, it actually enhances > > it by opportunistically ensuring that data hits disk. It's the > > lengthening of the commit intervals that usually accompanies it that > > increases the risk of data loss. > > It *can* reduce data integrity; it really depends on how it's tuned > and what scenario you're talking about. To the extent that it uses > sys_sync(), it could help in some cases as well, since filesystems > that do delayed allocation will wake up when the commit interval > fires, and then force out all writes to the disk, yes. But before the > commit interval, there is an increased risk of data loss --- which the > user requested. Not from laptop-mode. Let's separate the functionality from the typical use case. > The other subtlety comes if we add fsync() suppression to laptop mode > --- which is something that Bart Samwel is very interested in doing > and I talked to him at FOSDEM about this. As Jeff Garzik recently > pointed out, however, if we let the system reorder writes across > fsync() boundaries, or if we combine two writes to the same block > separated by an fsync(), and the system crashes in the middle of > pushing all of these blocks out to the disk, we can end up trashing > the consistency guarantees of a database such as mysql or postgres. > It's a good point, but it only applies if we add fsync() suppression > to laptop mode --- which we haven't done yet. I've got absolutely no idea why anyone would want fsync() to stop meaning "Put my data on the disk please". laptop-mode isn't intended to reduce data integrity - it's intended to batch disk write-outs such that there's a lower risk of needing to perform further write-outs in future. It makes sense for applications which really desperately want information on disk to fsync() (for instance, saving a file in OpenOffice). laptop-mode is something that makes sense as a default behaviour under a lot of circumstances. Adding fsync() suppression means it's utterly impossible to use it in that way. An additional mode would be perfectly reasonable, as long as it's made clear that it's really a request for data to be discarded at some point. The current mode isn't. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 17:43 ` Matthew Garrett @ 2009-04-01 21:21 ` Ray Lee 2009-04-01 21:26 ` Matthew Garrett 2009-04-02 11:25 ` Sitsofe Wheeler 2009-04-02 18:22 ` david 2009-04-06 21:32 ` supporting laptops fs-semantic changes (was Re: Ext4 and the "30 second window of death") Linda Walsh 2 siblings, 2 replies; 59+ messages in thread From: Ray Lee @ 2009-04-01 21:21 UTC (permalink / raw) To: Matthew Garrett Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Wed, Apr 1, 2009 at 10:43 AM, Matthew Garrett <mjg59@srcf.ucam.org> wrote: > I've got absolutely no idea why anyone would want fsync() to stop > meaning "Put my data on the disk please". Some guy named Andrew used to run a kernel with 'return 0' at the top of fsync and fdatasync: http://lkml.org/lkml/2007/4/27/88 It's that the latency penalty of apps using *sync() on common hardware sucks. That's all, and finding a way to fix that would make this entire thread go away, I think. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 21:21 ` Ray Lee @ 2009-04-01 21:26 ` Matthew Garrett 2009-04-02 11:25 ` Sitsofe Wheeler 1 sibling, 0 replies; 59+ messages in thread From: Matthew Garrett @ 2009-04-01 21:26 UTC (permalink / raw) To: Ray Lee Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Wed, Apr 01, 2009 at 02:21:43PM -0700, Ray Lee wrote: > On Wed, Apr 1, 2009 at 10:43 AM, Matthew Garrett <mjg59@srcf.ucam.org> wrote: > > I've got absolutely no idea why anyone would want fsync() to stop > > meaning "Put my data on the disk please". > > Some guy named Andrew used to run a kernel with 'return 0' at the top > of fsync and fdatasync: http://lkml.org/lkml/2007/4/27/88 > > It's that the latency penalty of apps using *sync() on common hardware > sucks. That's all, and finding a way to fix that would make this > entire thread go away, I think. And also disk spinups. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 21:21 ` Ray Lee 2009-04-01 21:26 ` Matthew Garrett @ 2009-04-02 11:25 ` Sitsofe Wheeler 1 sibling, 0 replies; 59+ messages in thread From: Sitsofe Wheeler @ 2009-04-02 11:25 UTC (permalink / raw) To: Ray Lee Cc: Matthew Garrett, Theodore Tso, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Wed, Apr 01, 2009 at 02:21:43PM -0700, Ray Lee wrote: > > Some guy named Andrew used to run a kernel with 'return 0' at the top > of fsync and fdatasync: http://lkml.org/lkml/2007/4/27/88 (Quoting out of context from Andrew's mail) "hm, fsync. Aside: why the heck do applications think that their data is so important that they need to fsync it all the time." So the advice/complaint is that apps shouldn't fsync unless absolutely necessary because syncing will always slow? -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 17:43 ` Matthew Garrett 2009-04-01 21:21 ` Ray Lee @ 2009-04-02 18:22 ` david 2009-04-02 18:29 ` Matthew Garrett 2009-04-02 18:34 ` Nick Piggin 2009-04-06 21:32 ` supporting laptops fs-semantic changes (was Re: Ext4 and the "30 second window of death") Linda Walsh 2 siblings, 2 replies; 59+ messages in thread From: david @ 2009-04-02 18:22 UTC (permalink / raw) To: Matthew Garrett Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Wed, 1 Apr 2009, Matthew Garrett wrote: >> The other subtlety comes if we add fsync() suppression to laptop mode >> --- which is something that Bart Samwel is very interested in doing >> and I talked to him at FOSDEM about this. As Jeff Garzik recently >> pointed out, however, if we let the system reorder writes across >> fsync() boundaries, or if we combine two writes to the same block >> separated by an fsync(), and the system crashes in the middle of >> pushing all of these blocks out to the disk, we can end up trashing >> the consistency guarantees of a database such as mysql or postgres. >> It's a good point, but it only applies if we add fsync() suppression >> to laptop mode --- which we haven't done yet. > > I've got absolutely no idea why anyone would want fsync() to stop > meaning "Put my data on the disk please". laptop-mode isn't intended to > reduce data integrity - it's intended to batch disk write-outs such that > there's a lower risk of needing to perform further write-outs in future. > It makes sense for applications which really desperately want > information on disk to fsync() (for instance, saving a file in > OpenOffice). > > laptop-mode is something that makes sense as a default behaviour under a > lot of circumstances. Adding fsync() suppression means it's utterly > impossible to use it in that way. An additional mode would be perfectly > reasonable, as long as it's made clear that it's really a request for > data to be discarded at some point. The current mode isn't. this issue seems pretty straightforward to me the apps do fsync (and similar) to the degree that they think their data is important (potentially with config options if they acknowlege that their data isn't _always_ that important) the system allows the admin to override the application and say "I'm willing to loose up to X seconds of data for other benifits" if this can work cleanly (with the ordering issue that was identified, which may involve having multiple versions of the metadata cached) it seems like a very clean interface. David Lang ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:22 ` david @ 2009-04-02 18:29 ` Matthew Garrett 2009-04-02 18:44 ` david 2009-04-02 18:34 ` Nick Piggin 1 sibling, 1 reply; 59+ messages in thread From: Matthew Garrett @ 2009-04-02 18:29 UTC (permalink / raw) To: david Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 11:22:48AM -0700, david@lang.hm wrote: > On Wed, 1 Apr 2009, Matthew Garrett wrote: > >laptop-mode is something that makes sense as a default behaviour under a > >lot of circumstances. Adding fsync() suppression means it's utterly > >impossible to use it in that way. An additional mode would be perfectly > >reasonable, as long as it's made clear that it's really a request for > >data to be discarded at some point. The current mode isn't. > > this issue seems pretty straightforward to me > > the apps do fsync (and similar) to the degree that they think their data > is important (potentially with config options if they acknowlege that > their data isn't _always_ that important) > > the system allows the admin to override the application and say "I'm > willing to loose up to X seconds of data for other benifits" > > if this can work cleanly (with the ordering issue that was identified, > which may involve having multiple versions of the metadata cached) it > seems like a very clean interface. It does, but it's a different interface to the current one with a different aim and a different set of tradeoffs. The current behaviour of laptop-mode is that fsync() results in things hitting disk. The only configurability of laptop-mode is how long it then waits to flush out everything else as well. The solution to "fsync() causes disk spinups" isn't "ignore fsync()". It's "ensure that applications only use fsync() when they really need it", which requires us to also be able to say "fsync() should not be required to ensure that events occur in order". -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:29 ` Matthew Garrett @ 2009-04-02 18:44 ` david 2009-04-02 20:07 ` Ray Lee ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: david @ 2009-04-02 18:44 UTC (permalink / raw) To: Matthew Garrett Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, 2 Apr 2009, Matthew Garrett wrote: > On Thu, Apr 02, 2009 at 11:22:48AM -0700, david@lang.hm wrote: >> On Wed, 1 Apr 2009, Matthew Garrett wrote: >>> laptop-mode is something that makes sense as a default behaviour under a >>> lot of circumstances. Adding fsync() suppression means it's utterly >>> impossible to use it in that way. An additional mode would be perfectly >>> reasonable, as long as it's made clear that it's really a request for >>> data to be discarded at some point. The current mode isn't. >> >> this issue seems pretty straightforward to me >> >> the apps do fsync (and similar) to the degree that they think their data >> is important (potentially with config options if they acknowlege that >> their data isn't _always_ that important) >> >> the system allows the admin to override the application and say "I'm >> willing to loose up to X seconds of data for other benifits" >> >> if this can work cleanly (with the ordering issue that was identified, >> which may involve having multiple versions of the metadata cached) it >> seems like a very clean interface. > > It does, but it's a different interface to the current one with a > different aim and a different set of tradeoffs. The current behaviour of > laptop-mode is that fsync() results in things hitting disk. The only > configurability of laptop-mode is how long it then waits to flush out > everything else as well. > > The solution to "fsync() causes disk spinups" isn't "ignore fsync()". > It's "ensure that applications only use fsync() when they really need > it", which requires us to also be able to say "fsync() should not be > required to ensure that events occur in order". ignore the issue of order on the local disk for the moment. what should an application do to make sure it's data isn't lost? let's not talk a database here, let's talk something simpler, like a POP3 mail client (even though I strongly favor IMAP ;-) it wants to have the message saved before it deletes it from the server. how should it try to do this? the only portable method is to fsync the file after it's written and before sending the delete to the server. so your mail client _should_ issue fsync calls. however, some (many, most??) users would probably be willing to loose a little e-mail to gain a significant increase in battery life on their laptops. today they have no choice (other than picking a mail client that doesn't try to protect it's local data) with the proposed addition to laptop mode (delaying fsync until the disk is awake), the user (or more precisely the admin) gains the ability to define this trade-off rather than depending on the application developers all doing this right. without this, we end up in a situation like the powertop wakeups. it only takes one 'buggy' application to destroy your power management and performance. but in this case, the application that is 'buggy' from a power management point of view may be entirely correct from a data safety point of view. David Lang ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:44 ` david @ 2009-04-02 20:07 ` Ray Lee 2009-04-02 20:59 ` Andreas T.Auer 2009-04-02 22:36 ` Bron Gondwana 2009-04-02 23:46 ` Matthew Garrett 2 siblings, 1 reply; 59+ messages in thread From: Ray Lee @ 2009-04-02 20:07 UTC (permalink / raw) To: david Cc: Matthew Garrett, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 2, 2009 at 11:44 AM, <david@lang.hm> wrote: > let's not talk a database here, let's talk something simpler, like a POP3 > mail client (even though I strongly favor IMAP ;-) > > it wants to have the message saved before it deletes it from the server. > > how should it try to do this? > > the only portable method is to fsync the file after it's written and before > sending the delete to the server. > > so your mail client _should_ issue fsync calls. That's just not the case. Every POP fetcher I've seen offers an option to leave seen messages on the server for some period measured in days. Setting it to one day means that the data will eventually get flushed by the time the message is deleted. So, no, the mail client does not have to issue fsync()s at all. (If dirty data can hang around unwritten for 24 hours, I'd argue that's a misfeature of the filesystem or kernel.) Alternately, a client could fetch once every half hour at which point the cost of an fsync is amortized over all the fetched messages. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 20:07 ` Ray Lee @ 2009-04-02 20:59 ` Andreas T.Auer 2009-04-02 23:38 ` Theodore Tso 0 siblings, 1 reply; 59+ messages in thread From: Andreas T.Auer @ 2009-04-02 20:59 UTC (permalink / raw) To: Ray Lee Cc: david, Matthew Garrett, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On 02.04.2009 22:07 Ray Lee wrote: > On Thu, Apr 2, 2009 at 11:44 AM, <david@lang.hm> wrote: >> let's not talk a database here, let's talk something simpler, like a POP3 >> mail client (even though I strongly favor IMAP ;-) >> >> it wants to have the message saved before it deletes it from the server. >> >> how should it try to do this? >> >> the only portable method is to fsync the file after it's written and before >> sending the delete to the server. >> >> so your mail client _should_ issue fsync calls. > > That's just not the case. Every POP fetcher I've seen offers an option > to leave seen messages on the server for some period measured in days. > Setting it to one day means that the data will eventually get flushed > by the time the message is deleted. Yes, but a lot of users (and I assume >90% of POP3 users) don't use this option. > So, no, the mail client does not have to issue fsync()s at all. Except when operating in immediate-delete mode. > Alternately, a client could fetch once every half hour at which point > the cost of an fsync is amortized over all the fetched messages. Again this is forcing a policy on how users should configure their clients. And don't forget: POP3 was just an example. There can be a lot of other applications as well. E.g. what about an application for the reception of SMS or other mobile text messages? This is pushed to the client, not polled as with POP3 AFAIK. Andreas ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 20:59 ` Andreas T.Auer @ 2009-04-02 23:38 ` Theodore Tso 2009-04-03 0:00 ` Matthew Garrett ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Theodore Tso @ 2009-04-02 23:38 UTC (permalink / raw) To: Andreas T.Auer Cc: Ray Lee, david, Matthew Garrett, Sitsofe Wheeler, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 10:59:39PM +0200, Andreas T.Auer wrote: > > Yes, but a lot of users (and I assume >90% of POP3 users) don't use this > option. > Sometimes, the filesystem isn't the best place to solve all problems. What's been frustrating about this whole controversy is this implicit assumptions that users and applications should never change, and the filesystem should magically accomodate and Do The Right Thing. If you're *never* going want to risk ever losing mail, then fine, fsync() it to disk before you send the POP3 DELETE command. If you don't like the performance delay, or the battery consumption implications, tough. I'm fresh out of magic pixie dust. If the application is smarter about not deleting the messages from the POP spool, then you can afford not to fsync(). But (oh, horrors!) it might involve making the application smarter, and playing a synchronization game between the local POP spool and IMAP. It's more efficient to do this with IMAP, but there are POP clients that do this. If you are a mail client developer, and the user says, "I want the advantages of IMAP, but I refuse to switch to an ISP that provides IMAP; you must give me *all* the advantages IMAP even though I'm using POP3", you'd probably tell the user, "Yes, and do you want a pony, too?" The problem is, this is what the application programmers are telling the filesystem developers. They refuse to change their programs; and the features they want are sometimes mutually contradictory, or at least result in a overconstrained problem --- and then they throw the whole mess at the filesystem developers' feet and say, "you fix it!" I'm not saying the filesystems are blameless, but give us a little slack, guys; we NEED some help from the application developers here. - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 23:38 ` Theodore Tso @ 2009-04-03 0:00 ` Matthew Garrett 2009-04-03 7:33 ` Pavel Machek 2009-04-03 8:14 ` Andreas T.Auer 2 siblings, 0 replies; 59+ messages in thread From: Matthew Garrett @ 2009-04-03 0:00 UTC (permalink / raw) To: Theodore Tso, Andreas T.Auer, Ray Lee, david, Sitsofe Wheeler, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 07:38:06PM -0400, Theodore Tso wrote: > What's been frustrating about this whole controversy is this implicit > assumptions that users and applications should never change, and the > filesystem should magically accomodate and Do The Right Thing. This is the attitude that I have a significant problem with. Filesystems exist to serve applications. Without applications, there's no reason to have a filesystem. If a filesystem doesn't provide the behaviour that applications want then that filesystem has no reason to exist. The aim isn't to produce a platonically ideal filesystem. The aim is to produce a filesystem that behaves well given the applications that use it. Disagreeing with the behaviour of applications is a perfectly sensible thing to do. However, it's something that should be done at the *start* of a filesystem development cycle. Getting agreement from a broad section of application developers means that you get to write a filesystem that embodies a different set of assumptions and everyone wins. Writing a filesystem and then bitching about application behaviour after it's been merged to mainline is just pathological. > The problem is, this is what the application programmers are telling > the filesystem developers. They refuse to change their programs; and > the features they want are sometimes mutually contradictory, or at > least result in a overconstrained problem --- and then they throw the > whole mess at the filesystem developers' feet and say, "you fix it!" Which application developers did you speak to? Because, frankly, the majority of the ones I know felt that ext3 embodied the pony that they'd always dreamed of as a five year old. Stephen gave them that pony almost a decade ago and now you're trying to take it to the glue factory. I remember almost crying at that bit on Animal Farm, so I'm really not surprised that you're getting pushback here. > I'm not saying the filesystems are blameless, but give us a little > slack, guys; we NEED some help from the application developers here. Then having a discussion with application developers over the expectations they can have would be a good first step. Just pointing at POSIX isn't good enough - POSIX allows a bunch of behaviours sufficiently pathological that a filesystem implementing them would be less useful than /dev/null. We need to have a worthwhile conversation about what guarantees Linux will provide above and beyond POSIX. The filesystem summit next week isn't going to be that conversation. Perhaps something to try at Plumbers? -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 23:38 ` Theodore Tso 2009-04-03 0:00 ` Matthew Garrett @ 2009-04-03 7:33 ` Pavel Machek 2009-04-03 8:14 ` Andreas T.Auer 2 siblings, 0 replies; 59+ messages in thread From: Pavel Machek @ 2009-04-03 7:33 UTC (permalink / raw) To: Theodore Tso, Andreas T.Auer, Ray Lee, david, Matthew Garrett, Sitsofe Wheeler, Alberto Gonzalez, Linux Kernel Mailing List > If you are a mail client developer, and the user says, "I want the > advantages of IMAP, but I refuse to switch to an ISP that provides > IMAP; you must give me *all* the advantages IMAP even though I'm using > POP3", you'd probably tell the user, "Yes, and do you want a pony, > too?" Somebody wants a pony? > The problem is, this is what the application programmers are telling > the filesystem developers. They refuse to change their programs; and > the features they want are sometimes mutually contradictory, or at > least result in a overconstrained problem --- and then they throw the > whole mess at the filesystem developers' feet and say, "you fix it!" > > I'm not saying the filesystems are blameless, but give us a little > slack, guys; we NEED some help from the application developers here. >From what I seen on the gtk lists, application developers are willing to change they code. _But_ we should make sure that it does not regress. fsync() is a regression: spins the disk up too much, slow on ext3. (They may be willing to do that, but I believe that's a very bad idea). And yes, I hope your "lets add fsync() everywhere, then break the fsync with eat-my-data-^W-laptop-mode" plan does not happen. (Please acknowledge that it is a stupid idea...) If you give them fbarrier() or replace() or something that is nop or nearly so on ext3 data=ordered and fixes ext4/btrfs, they'll happily use it. But we do not have such thing now, and we should not be really asking them to regress on existing setups. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 23:38 ` Theodore Tso 2009-04-03 0:00 ` Matthew Garrett 2009-04-03 7:33 ` Pavel Machek @ 2009-04-03 8:14 ` Andreas T.Auer 2 siblings, 0 replies; 59+ messages in thread From: Andreas T.Auer @ 2009-04-03 8:14 UTC (permalink / raw) To: Theodore Tso, Andreas T.Auer, Ray Lee, david, Matthew Garrett, Sitsofe Wheeler, Alberto Gonzalez, Linux Kernel Mailing List On 03.04.2009 01:38 Theodore Tso wrote: > On Thu, Apr 02, 2009 at 10:59:39PM +0200, Andreas T.Auer wrote: >> Yes, but a lot of users (and I assume >90% of POP3 users) don't use this >> option. >> > > Sometimes, the filesystem isn't the best place to solve all problems. Surely you cannot solve all problems in the filesystem. Especially the delay-spin-up vs. keep-all-important-recent-data problem simply can't be done by the filesystem. It can't be done by the application either, because it is the decision of the user, which data are important enough to do a spin-up. But it's not possible to tell the filesystem, which applications should spin-up at fsync(). And even within applications there are differences between the love-mail from the girl you met recently and the love-mail from that "russian girl", which isn't a girl, but just a bunch of fraudsters. > What's been frustrating about this whole controversy is this implicit > assumptions that users and applications should never change, and the > filesystem should magically accomodate and Do The Right Thing. It's not that they should never change, it's that you can't expect them to change. There are just a few filesystems in the kernel and you need some level of competence to maintain the code or contribute to it. But you have no such filter in the application world, which is much much bigger than the controlled area of the kernel. The application can be crappy and would still have its users as long there is no better alternative for a special task. Even after the project is orphaned it still can be used by the users. I had such a tool to get the log data out of my PBX. It was orphaned long before and it had no alternative. > If you're *never* going want to risk ever losing mail, then fine, > fsync() it to disk before you send the POP3 DELETE command. The *user* wants his data safe, but the *application* has to decide whether or not to fsync(). Well, in case of a POP3 client fsync() should be common sense before a DELETE. > The problem is, this is what the application programmers are telling > the filesystem developers. They refuse to change their programs; and > the features they want are sometimes mutually contradictory, or at > least result in a overconstrained problem --- and then they throw the > whole mess at the filesystem developers' feet and say, "you fix it!" I think the users are complaining more than the application developers. If the application developers would complain for their piece of software, they would probably be smart enough to change their code using some new function calls (like barrier() or whatever). But the problem are the non-complaining developers that simply don't have a clue about all this. > I'm not saying the filesystems are blameless, but give us a little > slack, guys; we NEED some help from the application developers here. You have to find a _reasonable_ default integrity/performance trade-off for those applications that are not aware of the filesystem levels. "I just write out the data to disk with fprintf()." For laptop-mode a global reasonable default doesn't seem to exist, so a "perfect system" would have the possibility to tell the users, which applications triggered a spin-up and provide the users with methods to suppress/fine-tune the spin-up for the applications he wants to. The distros could pre-configure it to some reasonable defaults for each application. Andreas ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:44 ` david 2009-04-02 20:07 ` Ray Lee @ 2009-04-02 22:36 ` Bron Gondwana 2009-04-02 23:46 ` Matthew Garrett 2 siblings, 0 replies; 59+ messages in thread From: Bron Gondwana @ 2009-04-02 22:36 UTC (permalink / raw) To: david Cc: Matthew Garrett, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 11:44:20AM -0700, david@lang.hm wrote: > let's not talk a database here, let's talk something simpler, like a POP3 > mail client (even though I strongly favor IMAP ;-) > > it wants to have the message saved before it deletes it from the server. > > how should it try to do this? > > the only portable method is to fsync the file after it's written and > before sending the delete to the server. > > so your mail client _should_ issue fsync calls. > > however, some (many, most??) users would probably be willing to loose a > little e-mail to gain a significant increase in battery life on their > laptops. Obviously it should do a spamminess test. If the sender is in your addressbook/whitelist then fsync it, otherwise if it looks spammy, don't bother. Seriously, there's no way of telling which emails are the really important job offer/flight confirmation/invitation from that really cute girl you met that one time... ... lots of data is like that. It's usually not important except when it really, really is - and the average user don't want to be babysitting every single decision about importance. "Your email program wants to spin up the disk to store a message, confirm or deny" Bron. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:44 ` david 2009-04-02 20:07 ` Ray Lee 2009-04-02 22:36 ` Bron Gondwana @ 2009-04-02 23:46 ` Matthew Garrett 2009-04-03 0:55 ` david 2 siblings, 1 reply; 59+ messages in thread From: Matthew Garrett @ 2009-04-02 23:46 UTC (permalink / raw) To: david Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 11:44:20AM -0700, david@lang.hm wrote: > On Thu, 2 Apr 2009, Matthew Garrett wrote: > >The solution to "fsync() causes disk spinups" isn't "ignore fsync()". > >It's "ensure that applications only use fsync() when they really need > >it", which requires us to also be able to say "fsync() should not be > >required to ensure that events occur in order". > > ignore the issue of order on the local disk for the moment. > > what should an application do to make sure it's data isn't lost? fsync(). > however, some (many, most??) users would probably be willing to loose a > little e-mail to gain a significant increase in battery life on their > laptops. Then they shouldn't use a mail client that fsync()s. > today they have no choice (other than picking a mail client that doesn't > try to protect it's local data) > > with the proposed addition to laptop mode (delaying fsync until the disk > is awake), the user (or more precisely the admin) gains the ability to > define this trade-off rather than depending on the application developers > all doing this right. No. Ignoring fsync() makes it difficult for an application to inappropriately spin up a disk - but it also makes it *impossible* for an application to save data that it genuinely needs to. Doing this in kernel means that you have no granularity. You ignore the inappropriate fsync()s, but you also ignore the ones that are vitally important. I've no objection to the kernel supporting this functionality, but it should be /proc/sys/vm/fuck-my-data-harder rather than /proc/sys/vm/laptop-mode. Power management is a tradeoff. Sometimes providing correct functionality costs more than providing incorrect functionality. In general we strive to carry on providing applications the behaviour they expect even if it costs us more power - the alternative leads to users disabling power management functionality because they can't trust it. Throwing data away isn't an acceptable tradeoff for an extra three minutes of battery life for most users. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 23:46 ` Matthew Garrett @ 2009-04-03 0:55 ` david 2009-04-03 1:06 ` Matthew Garrett 0 siblings, 1 reply; 59+ messages in thread From: david @ 2009-04-03 0:55 UTC (permalink / raw) To: Matthew Garrett Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, 3 Apr 2009, Matthew Garrett wrote: > On Thu, Apr 02, 2009 at 11:44:20AM -0700, david@lang.hm wrote: >> On Thu, 2 Apr 2009, Matthew Garrett wrote: >>> The solution to "fsync() causes disk spinups" isn't "ignore fsync()". >>> It's "ensure that applications only use fsync() when they really need >>> it", which requires us to also be able to say "fsync() should not be >>> required to ensure that events occur in order". >> >> ignore the issue of order on the local disk for the moment. >> >> what should an application do to make sure it's data isn't lost? > > fsync(). > >> however, some (many, most??) users would probably be willing to loose a >> little e-mail to gain a significant increase in battery life on their >> laptops. > > Then they shouldn't use a mail client that fsync()s. so they need to use one mail client when they want to have good battery life and a different one when they are plugged in to power? >> today they have no choice (other than picking a mail client that doesn't >> try to protect it's local data) >> >> with the proposed addition to laptop mode (delaying fsync until the disk >> is awake), the user (or more precisely the admin) gains the ability to >> define this trade-off rather than depending on the application developers >> all doing this right. > > No. Ignoring fsync() makes it difficult for an application to > inappropriately spin up a disk - but it also makes it *impossible* for > an application to save data that it genuinely needs to. Doing this in > kernel means that you have no granularity. You ignore the inappropriate > fsync()s, but you also ignore the ones that are vitally important. I've > no objection to the kernel supporting this functionality, but it should > be /proc/sys/vm/fuck-my-data-harder rather than > /proc/sys/vm/laptop-mode. > > Power management is a tradeoff. Sometimes providing correct > functionality costs more than providing incorrect functionality. In > general we strive to carry on providing applications the behaviour they > expect even if it costs us more power - the alternative leads to users > disabling power management functionality because they can't trust it. > Throwing data away isn't an acceptable tradeoff for an extra three > minutes of battery life for most users. I would agree with you if it was three minutes of battery life, but what if it's an extra hour? (easily possible if the fsyncs make the difference between the drive running all the time and waking up every 5 min for a few seconds) David Lang ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 0:55 ` david @ 2009-04-03 1:06 ` Matthew Garrett 2009-04-03 1:16 ` david 0 siblings, 1 reply; 59+ messages in thread From: Matthew Garrett @ 2009-04-03 1:06 UTC (permalink / raw) To: david Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 05:55:11PM -0700, david@lang.hm wrote: > On Fri, 3 Apr 2009, Matthew Garrett wrote: > >Then they shouldn't use a mail client that fsync()s. > > so they need to use one mail client when they want to have good battery > life and a different one when they are plugged in to power? They need to make a decision about whether they care about their mailbox being precisely in sync with their server or not, and either use a client that adapts appropriately or choose a client that behaves appropriately. It's certainly not the kernel's business. > >No. Ignoring fsync() makes it difficult for an application to > >inappropriately spin up a disk - but it also makes it *impossible* for > >an application to save data that it genuinely needs to. Doing this in > >kernel means that you have no granularity. You ignore the inappropriate > >fsync()s, but you also ignore the ones that are vitally important. I've > >no objection to the kernel supporting this functionality, but it should > >be /proc/sys/vm/fuck-my-data-harder rather than > >/proc/sys/vm/laptop-mode. > > > >Power management is a tradeoff. Sometimes providing correct > >functionality costs more than providing incorrect functionality. In > >general we strive to carry on providing applications the behaviour they > >expect even if it costs us more power - the alternative leads to users > >disabling power management functionality because they can't trust it. > >Throwing data away isn't an acceptable tradeoff for an extra three > >minutes of battery life for most users. > > I would agree with you if it was three minutes of battery life, but what > if it's an extra hour? (easily possible if the fsyncs make the difference > between the drive running all the time and waking up every 5 min for a few > seconds) If you can demonstrate a real world use case where the hard drive (typically well under a watt of power consumption on modern systems) spindown policy will be affected sufficiently pathologically by a mail client that you lose an hour of battery life, then I'd rethink this. But mostly I'd conclude that this was an example of an inappropriate spindown policy. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 1:06 ` Matthew Garrett @ 2009-04-03 1:16 ` david 2009-04-03 1:19 ` Matthew Garrett 0 siblings, 1 reply; 59+ messages in thread From: david @ 2009-04-03 1:16 UTC (permalink / raw) To: Matthew Garrett Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, 3 Apr 2009, Matthew Garrett wrote: > On Thu, Apr 02, 2009 at 05:55:11PM -0700, david@lang.hm wrote: >> On Fri, 3 Apr 2009, Matthew Garrett wrote: >>> Then they shouldn't use a mail client that fsync()s. >> >> so they need to use one mail client when they want to have good battery >> life and a different one when they are plugged in to power? > > They need to make a decision about whether they care about their mailbox > being precisely in sync with their server or not, and either use a > client that adapts appropriately or choose a client that behaves > appropriately. It's certainly not the kernel's business. the kernel is not deciding this, the kernel would be implementing the user's choice >>> No. Ignoring fsync() makes it difficult for an application to >>> inappropriately spin up a disk - but it also makes it *impossible* for >>> an application to save data that it genuinely needs to. Doing this in >>> kernel means that you have no granularity. You ignore the inappropriate >>> fsync()s, but you also ignore the ones that are vitally important. I've >>> no objection to the kernel supporting this functionality, but it should >>> be /proc/sys/vm/fuck-my-data-harder rather than >>> /proc/sys/vm/laptop-mode. >>> >>> Power management is a tradeoff. Sometimes providing correct >>> functionality costs more than providing incorrect functionality. In >>> general we strive to carry on providing applications the behaviour they >>> expect even if it costs us more power - the alternative leads to users >>> disabling power management functionality because they can't trust it. >>> Throwing data away isn't an acceptable tradeoff for an extra three >>> minutes of battery life for most users. >> >> I would agree with you if it was three minutes of battery life, but what >> if it's an extra hour? (easily possible if the fsyncs make the difference >> between the drive running all the time and waking up every 5 min for a few >> seconds) > > If you can demonstrate a real world use case where the hard drive > (typically well under a watt of power consumption on modern systems) > spindown policy will be affected sufficiently pathologically by a mail > client that you lose an hour of battery life, then I'd rethink this. But > mostly I'd conclude that this was an example of an inappropriate > spindown policy. remember that the mail client was an example. you want another example, think of anything that uses sqlite (like the firefox history stuff, although that was weakened drasticly due to the ext3 problems). David Lang ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 1:16 ` david @ 2009-04-03 1:19 ` Matthew Garrett 2009-04-03 1:24 ` david 0 siblings, 1 reply; 59+ messages in thread From: Matthew Garrett @ 2009-04-03 1:19 UTC (permalink / raw) To: david Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 06:16:20PM -0700, david@lang.hm wrote: > On Fri, 3 Apr 2009, Matthew Garrett wrote: > > >On Thu, Apr 02, 2009 at 05:55:11PM -0700, david@lang.hm wrote: > >>On Fri, 3 Apr 2009, Matthew Garrett wrote: > >>>Then they shouldn't use a mail client that fsync()s. > >> > >>so they need to use one mail client when they want to have good battery > >>life and a different one when they are plugged in to power? > > > >They need to make a decision about whether they care about their mailbox > >being precisely in sync with their server or not, and either use a > >client that adapts appropriately or choose a client that behaves > >appropriately. It's certainly not the kernel's business. > > the kernel is not deciding this, the kernel would be implementing the > user's choice No it wouldn't. The kernel would be implementing an adminstrator's choice about whether fsync() is important or not. That's something that would affect the mail client, but it's hardly a decision based on the mail client. Sucks to be that user if they do anything involving mysql. > >If you can demonstrate a real world use case where the hard drive > >(typically well under a watt of power consumption on modern systems) > >spindown policy will be affected sufficiently pathologically by a mail > >client that you lose an hour of battery life, then I'd rethink this. But > >mostly I'd conclude that this was an example of an inappropriate > >spindown policy. > > remember that the mail client was an example. > > you want another example, think of anything that uses sqlite (like the > firefox history stuff, although that was weakened drasticly due to the > ext3 problems). Benchmarks please. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 1:19 ` Matthew Garrett @ 2009-04-03 1:24 ` david 2009-04-03 1:36 ` Matthew Garrett 0 siblings, 1 reply; 59+ messages in thread From: david @ 2009-04-03 1:24 UTC (permalink / raw) To: Matthew Garrett Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, 3 Apr 2009, Matthew Garrett wrote: > On Thu, Apr 02, 2009 at 06:16:20PM -0700, david@lang.hm wrote: >> On Fri, 3 Apr 2009, Matthew Garrett wrote: >> >>> On Thu, Apr 02, 2009 at 05:55:11PM -0700, david@lang.hm wrote: >>>> On Fri, 3 Apr 2009, Matthew Garrett wrote: >>>>> Then they shouldn't use a mail client that fsync()s. >>>> >>>> so they need to use one mail client when they want to have good battery >>>> life and a different one when they are plugged in to power? >>> >>> They need to make a decision about whether they care about their mailbox >>> being precisely in sync with their server or not, and either use a >>> client that adapts appropriately or choose a client that behaves >>> appropriately. It's certainly not the kernel's business. >> >> the kernel is not deciding this, the kernel would be implementing the >> user's choice > > No it wouldn't. The kernel would be implementing an adminstrator's > choice about whether fsync() is important or not. That's something that > would affect the mail client, but it's hardly a decision based on the > mail client. Sucks to be that user if they do anything involving mysql. in the case of laptops, in 99+% of the cases the user and the administrator are the same person. in the other cases that's something the user should take up with the administrator, because the administrator can do a lot of things to the system that will affect the safety of their data (including loading a kernel that turns fsync into a noop, but more likely involving enabling or disabling write caches on disks) >>> If you can demonstrate a real world use case where the hard drive >>> (typically well under a watt of power consumption on modern systems) >>> spindown policy will be affected sufficiently pathologically by a mail >>> client that you lose an hour of battery life, then I'd rethink this. But >>> mostly I'd conclude that this was an example of an inappropriate >>> spindown policy. >> >> remember that the mail client was an example. >> >> you want another example, think of anything that uses sqlite (like the >> firefox history stuff, although that was weakened drasticly due to the >> ext3 problems). > > Benchmarks please. if spinning down a drive saves so little power that it wouldn't make a significant difference to battery lift to leave it on, why does anyone bother to spin the drive down? David Lang ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 1:24 ` david @ 2009-04-03 1:36 ` Matthew Garrett 2009-04-03 3:08 ` david 2009-04-03 4:54 ` Theodore Tso 0 siblings, 2 replies; 59+ messages in thread From: Matthew Garrett @ 2009-04-03 1:36 UTC (permalink / raw) To: david Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 06:24:28PM -0700, david@lang.hm wrote: > On Fri, 3 Apr 2009, Matthew Garrett wrote: > >No it wouldn't. The kernel would be implementing an adminstrator's > >choice about whether fsync() is important or not. That's something that > >would affect the mail client, but it's hardly a decision based on the > >mail client. Sucks to be that user if they do anything involving mysql. > > in the case of laptops, in 99+% of the cases the user and the > administrator are the same person. in the other cases that's something the > user should take up with the administrator, because the administrator can > do a lot of things to the system that will affect the safety of their data > (including loading a kernel that turns fsync into a noop, but more likely > involving enabling or disabling write caches on disks) Well, yes, the administrator could hate the user. They could achieve the same affect by just LD_PRELOADING something that stubbed out fsync() and inserted random data into every other write(). We generally trust that admins won't do that. > >Benchmarks please. > > if spinning down a drive saves so little power that it wouldn't make a > significant difference to battery lift to leave it on, why does anyone > bother to spin the drive down? There's various circumstances in which it's beneficial. The difference between an optimal algorithm for typical use and an optimal algorithm for typical use where there's an fsync() every 5 minutes isn't actually that great. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 1:36 ` Matthew Garrett @ 2009-04-03 3:08 ` david 2009-04-03 13:42 ` Matthew Garrett 2009-04-03 4:54 ` Theodore Tso 1 sibling, 1 reply; 59+ messages in thread From: david @ 2009-04-03 3:08 UTC (permalink / raw) To: Matthew Garrett Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, 3 Apr 2009, Matthew Garrett wrote: > On Thu, Apr 02, 2009 at 06:24:28PM -0700, david@lang.hm wrote: >> On Fri, 3 Apr 2009, Matthew Garrett wrote: >>> No it wouldn't. The kernel would be implementing an adminstrator's >>> choice about whether fsync() is important or not. That's something that >>> would affect the mail client, but it's hardly a decision based on the >>> mail client. Sucks to be that user if they do anything involving mysql. >> >> in the case of laptops, in 99+% of the cases the user and the >> administrator are the same person. in the other cases that's something the >> user should take up with the administrator, because the administrator can >> do a lot of things to the system that will affect the safety of their data >> (including loading a kernel that turns fsync into a noop, but more likely >> involving enabling or disabling write caches on disks) > > Well, yes, the administrator could hate the user. They could achieve the > same affect by just LD_PRELOADING something that stubbed out fsync() and > inserted random data into every other write(). We generally trust that > admins won't do that. then trust the admins to make a reasonable decision for or with the user on this as well. >>> Benchmarks please. >> >> if spinning down a drive saves so little power that it wouldn't make a >> significant difference to battery lift to leave it on, why does anyone >> bother to spin the drive down? > > There's various circumstances in which it's beneficial. The difference > between an optimal algorithm for typical use and an optimal algorithm > for typical use where there's an fsync() every 5 minutes isn't actually > that great. mixing some sub-threads a bit to combine thoughts you object to calling something like this 'laptop mode' Ted's statements about laptop mode indicate that he believes that it delays writes for a configurable time rather than accelerating writes. what would you think of something like the following at the block device level an option called something like "delay_writes" delays writes (including fsync) up to the configurable number of seconds. if an fsync or barrier is issued the block driver figures out what pages would be written by that fsync/barrier, puts them in it's queue (but doesn't start the write), puts a barrier in it's queue following the pages and marks the pages COW. if the timeout expires (or the drive spins up for other reasons) and the pages have not been modified, they get written and released by the block driver (which should take them out of COW mode). if the pages get written to prior to the write taking place, COW kicks in and new pages are allocated for the changes. since the device driver already has those pages queued the filesystem just ends up with the copied pages and continues operation. when the drive finally gets spun up, the queued pages get written prior to anything else (preserving order in case of a crash) doing this could cost memory (as there may be multiple copies of something queued), so it may be worth having some trigger that if more than X pages are queued by the block driver, it should go ahead and spin up the drive to write them. thoughts? David Lang ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 3:08 ` david @ 2009-04-03 13:42 ` Matthew Garrett 0 siblings, 0 replies; 59+ messages in thread From: Matthew Garrett @ 2009-04-03 13:42 UTC (permalink / raw) To: david Cc: Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 08:08:36PM -0700, david@lang.hm wrote: > On Fri, 3 Apr 2009, Matthew Garrett wrote: > >Well, yes, the administrator could hate the user. They could achieve the > >same affect by just LD_PRELOADING something that stubbed out fsync() and > >inserted random data into every other write(). We generally trust that > >admins won't do that. > > then trust the admins to make a reasonable decision for or with the user > on this as well. What a reasonable decision is here depends on what software the user is running. There simply isn't a reasonable default other than to allow fsync() to work. Changing requires auditing every single piece of code the user may run. > >There's various circumstances in which it's beneficial. The difference > >between an optimal algorithm for typical use and an optimal algorithm > >for typical use where there's an fsync() every 5 minutes isn't actually > >that great. > > mixing some sub-threads a bit to combine thoughts > > you object to calling something like this 'laptop mode' > > Ted's statements about laptop mode indicate that he believes that it > delays writes for a configurable time rather than accelerating writes. As I said, the code is pretty easy to read. (snip) > thoughts? I've certainly got no objection to the addition of a mode that changes the behaviour of fsync() - personally I think it would be an error for almost anyone to use it, but that's really up to the individual situation. But it would have a different goal to the existing laptop-mode and so should have a different name in order to avoid confusion. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 1:36 ` Matthew Garrett 2009-04-03 3:08 ` david @ 2009-04-03 4:54 ` Theodore Tso 2009-04-03 11:09 ` Sitsofe Wheeler ` (2 more replies) 1 sibling, 3 replies; 59+ messages in thread From: Theodore Tso @ 2009-04-03 4:54 UTC (permalink / raw) To: Matthew Garrett Cc: david, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, Apr 03, 2009 at 02:36:03AM +0100, Matthew Garrett wrote: > > if spinning down a drive saves so little power that it wouldn't make a > > significant difference to battery lift to leave it on, why does anyone > > bother to spin the drive down? > > There's various circumstances in which it's beneficial. The difference > between an optimal algorithm for typical use and an optimal algorithm > for typical use where there's an fsync() every 5 minutes isn't actually > that great. More to the point, if an application is insane enough to push 2.5 megabytes to disk every single time you click on a web page (this is excluding the cache; I had my firefox cache pointed at /tmp when I did this measurement), *and* you are running the WiFi for the browser, *and* the browser is running flash applications, etc., whether you defer the writes or not, you're going to be burning a lot of power. Fundamentally, if an application needs to be writing hundreds of files or hundreds of kilibytes or more of data all the time, there's something wrong with the application. If some KDE applications needs to rewrite hundreds of files at desktop startup, when the user hasn't even changed any configuration options yet (this is that desktop **startup**, mind you, where this was reported), then you're going to burning a lot of power. Anything we do at the filesystem level is really going to be at the margins. The annoying thing is the applications programmers aren't willing to fix their d*mn applications, and instead heap all of the blame on the filesystem. I will be the first to admit that filesystem designers have to do their part, and once I realized how bad and sloppy people had gotten with fsync(), and needlessly rewriting files, I implemented the ext4 workaround patches *first*. I only started talking about how application programmers might make changes to obey the established standards and work with other filesystems after I had put my own house in order. These are system-wide problems we are talking about, that will require system-wide solutions. I can provide workarounds for existing application behaviours, but claiming that applications can never change, and we must always accomodate the way applications are currently working and are designed is going to be a losing strategy for us all. - Ted ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 4:54 ` Theodore Tso @ 2009-04-03 11:09 ` Sitsofe Wheeler 2009-04-03 13:07 ` Alberto Gonzalez 2009-04-03 13:45 ` Matthew Garrett 2 siblings, 0 replies; 59+ messages in thread From: Sitsofe Wheeler @ 2009-04-03 11:09 UTC (permalink / raw) To: Theodore Tso, Matthew Garrett, david, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, Apr 03, 2009 at 12:54:14AM -0400, Theodore Tso wrote: > On Fri, Apr 03, 2009 at 02:36:03AM +0100, Matthew Garrett wrote: > > > > There's various circumstances in which it's beneficial. The difference > > between an optimal algorithm for typical use and an optimal algorithm > > for typical use where there's an fsync() every 5 minutes isn't actually > > that great. > > More to the point, if an application is insane enough to push 2.5 > megabytes to disk every single time you click on a web page (this is > excluding the cache; I had my firefox cache pointed at /tmp when I did I no longer know what is being debated here. Is it one or more of the following: a) Laptop mode (as it stands today). b) Laptop mode with fsync-nop. c) Apps that should be using fsync. d) Apps that should not using fsync. e) Apps writing to the disk too frequently. f) Apps writing to many files to the disk. g) Userland constraining kernel changes. h) Increasing battery life. i) "Acceptable" chance of new data loss after a crash. j) "Acceptable" chance of data corruption after a crash. k) Support for a new filesystem barrier() syscall to indicate the order that data has to be written. Note some of the above points are in conflict with each other... > The annoying thing is the applications programmers aren't willing to > fix their d*mn applications, and instead heap all of the blame on the > filesystem. I will be the first to admit that filesystem designers Isn't this the problem that other systems that place a high value on backwards compatibly face that the Linux kernel was not supposed to? If some piece of userland depends on every last bit of behaviour (whether it was intended/promised or not) then the only way anything can be changed is with massive effort expended on shims... -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 4:54 ` Theodore Tso 2009-04-03 11:09 ` Sitsofe Wheeler @ 2009-04-03 13:07 ` Alberto Gonzalez 2009-04-03 13:45 ` Matthew Garrett 2 siblings, 0 replies; 59+ messages in thread From: Alberto Gonzalez @ 2009-04-03 13:07 UTC (permalink / raw) To: Theodore Tso Cc: Matthew Garrett, david, Sitsofe Wheeler, Andreas T.Auer, Linux Kernel Mailing List On Friday 03 April 2009 06:54:14 Theodore Tso wrote: > On Fri, Apr 03, 2009 at 02:36:03AM +0100, Matthew Garrett wrote: > > > if spinning down a drive saves so little power that it wouldn't make a > > > significant difference to battery lift to leave it on, why does anyone > > > bother to spin the drive down? > > > > There's various circumstances in which it's beneficial. The difference > > between an optimal algorithm for typical use and an optimal algorithm > > for typical use where there's an fsync() every 5 minutes isn't actually > > that great. > > More to the point, if an application is insane enough to push 2.5 > megabytes to disk every single time you click on a web page (this is > excluding the cache; I had my firefox cache pointed at /tmp when I did > this measurement), *and* you are running the WiFi for the browser, > *and* the browser is running flash applications, etc., whether you > defer the writes or not, you're going to be burning a lot of power. > Fundamentally, if an application needs to be writing hundreds of files > or hundreds of kilibytes or more of data all the time, there's > something wrong with the application. I really have to agree. Looking at this thread (that unfortunately I started) it seems that if Linux is going to improve its power consumption at all it depends on the filesystem. Firefox has some unrealistic settings that stress the hard drive and the network, then some people open a couple hundred tabs at the same time, and then even the most simple flash animation proved to increase power by 0.9 watts on my atom processor that has a 2.5 watt TDP, and there are many other problems to solve first. Linux is still trying to catch up with Windows when it comes to battery life. It's still clearly behind in "normal" setups (I know, you can tweak Linux to use little power, but a default install of a mainstream distro will use clearly more power than Windows while providing similar functionality). And then Windows can use up to twice more power than OS X [1]. So clearly there is a lot of room for improvement when it comes to power usage in Linux. But honestly, if we all start blaming the filesystem for it, I don't think we're going to find the real problems. Besides, with SSDs getting better and cheaper, I'm sure that from 2010 on, most (if not all) laptops are going to be shipping with an SSD by default. And all the spin-up/spin-down problem will go away by itself. And yes, SSDs have proven to save some battery, but in the most real world tests I've seen it's by about 5%, so I guess that even with the most powersaving filesystem for a mechanical HD we could just save about 3% - 4% battery. Not too bad, but still far from the 40% needed. So for all having performance problems with ext3 + fsync, let's see if ext4 works for them. For those worried about battery life, let's at least start looking elsewhere before we want to optimize the filesystem to the last milliwatt. And as I feel guilty myself for contributing to this, I'd beg for us all to leave a bit of Slack (as Ted said) to filesystem developers. It's been a hard week for them already. Regards, Alberto. 1 - http://www.anandtech.com/mac/showdoc.aspx?i=3435&p=13 - http://www.anandtech.com/mobile/showdoc.aspx?i=3540&p=10 ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 4:54 ` Theodore Tso 2009-04-03 11:09 ` Sitsofe Wheeler 2009-04-03 13:07 ` Alberto Gonzalez @ 2009-04-03 13:45 ` Matthew Garrett 2 siblings, 0 replies; 59+ messages in thread From: Matthew Garrett @ 2009-04-03 13:45 UTC (permalink / raw) To: Theodore Tso, david, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, Apr 03, 2009 at 12:54:14AM -0400, Theodore Tso wrote: > More to the point, if an application is insane enough to push 2.5 > megabytes to disk every single time you click on a web page (this is > excluding the cache; I had my firefox cache pointed at /tmp when I did > this measurement), *and* you are running the WiFi for the browser, > *and* the browser is running flash applications, etc., whether you > defer the writes or not, you're going to be burning a lot of power. > Fundamentally, if an application needs to be writing hundreds of files > or hundreds of kilibytes or more of data all the time, there's > something wrong with the application. Yes. If applications are fsync()ing too often then the obvious fix is to deal with those applications, and that's something we've been successful with in other fields of power management. > If some KDE applications needs to rewrite hundreds of files at desktop > startup, when the user hasn't even changed any configuration options > yet (this is that desktop **startup**, mind you, where this was > reported), then you're going to burning a lot of power. Anything we > do at the filesystem level is really going to be at the margins. Not really. Desktop startup is a one-off cost and has no significant impact on your overall power budget. There's little worthwhile optimisation there from a power management point of view. > The annoying thing is the applications programmers aren't willing to > fix their d*mn applications, and instead heap all of the blame on the > filesystem. I will be the first to admit that filesystem designers > have to do their part, and once I realized how bad and sloppy people > had gotten with fsync(), and needlessly rewriting files, I implemented > the ext4 workaround patches *first*. I only started talking about how > application programmers might make changes to obey the established > standards and work with other filesystems after I had put my own house > in order. These are system-wide problems we are talking about, that > will require system-wide solutions. I can provide workarounds for > existing application behaviours, but claiming that applications can > never change, and we must always accomodate the way applications are > currently working and are designed is going to be a losing strategy > for us all. >From a power management perspective, anything that requires applications to call fsync() more frequently is a bad thing. So filesystems that reorder metadata operations are a bad thing. The fix isn't to add fsync() to applications, the fix is to ensure that filesystems don't force applications to do so. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:22 ` david 2009-04-02 18:29 ` Matthew Garrett @ 2009-04-02 18:34 ` Nick Piggin 2009-04-02 18:38 ` Matthew Garrett 2009-04-02 21:47 ` david 1 sibling, 2 replies; 59+ messages in thread From: Nick Piggin @ 2009-04-02 18:34 UTC (permalink / raw) To: david Cc: Matthew Garrett, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Friday 03 April 2009 05:22:48 david@lang.hm wrote: > On Wed, 1 Apr 2009, Matthew Garrett wrote: > > >> The other subtlety comes if we add fsync() suppression to laptop mode > >> --- which is something that Bart Samwel is very interested in doing > >> and I talked to him at FOSDEM about this. As Jeff Garzik recently > >> pointed out, however, if we let the system reorder writes across > >> fsync() boundaries, or if we combine two writes to the same block > >> separated by an fsync(), and the system crashes in the middle of > >> pushing all of these blocks out to the disk, we can end up trashing > >> the consistency guarantees of a database such as mysql or postgres. > >> It's a good point, but it only applies if we add fsync() suppression > >> to laptop mode --- which we haven't done yet. > > > > I've got absolutely no idea why anyone would want fsync() to stop > > meaning "Put my data on the disk please". laptop-mode isn't intended to > > reduce data integrity - it's intended to batch disk write-outs such that > > there's a lower risk of needing to perform further write-outs in future. > > It makes sense for applications which really desperately want > > information on disk to fsync() (for instance, saving a file in > > OpenOffice). > > > > laptop-mode is something that makes sense as a default behaviour under a > > lot of circumstances. Adding fsync() suppression means it's utterly > > impossible to use it in that way. An additional mode would be perfectly > > reasonable, as long as it's made clear that it's really a request for > > data to be discarded at some point. The current mode isn't. > > this issue seems pretty straightforward to me > > the apps do fsync (and similar) to the degree that they think their data > is important (potentially with config options if they acknowlege that > their data isn't _always_ that important) > > the system allows the admin to override the application and say "I'm > willing to loose up to X seconds of data for other benifits" > > if this can work cleanly (with the ordering issue that was identified, > which may involve having multiple versions of the metadata cached) it > seems like a very clean interface. It isn't just about ordering of writes a a filesystem. A database program commits a transaction and then tells the client that it is safe. Client then goes and does <something> in response to that, which may or may not involve more writes to the filesystem. Shouldn't applications have a mode to avoid spinning up the disk if it is so important? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:34 ` Nick Piggin @ 2009-04-02 18:38 ` Matthew Garrett 2009-04-02 18:56 ` Nick Piggin 2009-04-02 21:47 ` david 1 sibling, 1 reply; 59+ messages in thread From: Matthew Garrett @ 2009-04-02 18:38 UTC (permalink / raw) To: Nick Piggin Cc: david, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote: > Shouldn't applications have a mode to avoid spinning up the disk if it is > so important? They do. It's called "Don't use fsync() unless your data needs to be on disk". I'm not sure why you'd ever want an application to be in anything but this mode. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:38 ` Matthew Garrett @ 2009-04-02 18:56 ` Nick Piggin 2009-04-02 23:47 ` Matthew Garrett 2009-04-03 2:22 ` Ric Wheeler 0 siblings, 2 replies; 59+ messages in thread From: Nick Piggin @ 2009-04-02 18:56 UTC (permalink / raw) To: Matthew Garrett Cc: david, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Friday 03 April 2009 05:38:34 Matthew Garrett wrote: > On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote: > > > Shouldn't applications have a mode to avoid spinning up the disk if it is > > so important? > > They do. It's called "Don't use fsync() unless your data needs to be on > disk". I'm not sure why you'd ever want an application to be in anything > but this mode. > Well you might decide you are willing to sacrifice timely storage of logs, or reducing backups in your editor or something. But obviously the kernel can't decide which of those fsyncs is safe to omit (or turn into a barrier) while staying within the advertised semantics of the app. Application obviously can. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:56 ` Nick Piggin @ 2009-04-02 23:47 ` Matthew Garrett 2009-04-03 0:59 ` david 2009-04-03 2:22 ` Ric Wheeler 1 sibling, 1 reply; 59+ messages in thread From: Matthew Garrett @ 2009-04-02 23:47 UTC (permalink / raw) To: Nick Piggin Cc: david, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, Apr 03, 2009 at 05:56:40AM +1100, Nick Piggin wrote: > On Friday 03 April 2009 05:38:34 Matthew Garrett wrote: > > On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote: > > > > > Shouldn't applications have a mode to avoid spinning up the disk if it is > > > so important? > > > > They do. It's called "Don't use fsync() unless your data needs to be on > > disk". I'm not sure why you'd ever want an application to be in anything > > but this mode. > > > > Well you might decide you are willing to sacrifice timely storage of > logs, or reducing backups in your editor or something. But obviously > the kernel can't decide which of those fsyncs is safe to omit (or > turn into a barrier) while staying within the advertised semantics of > the app. Application obviously can. I'd argue that if the user cares enough that they want it fsync()ed on ext3 then they probably also want it fsync()ed if they're on battery. But yes, if anything is going to make a distinction between grades of "Must be saved" then it has to be the application - the kernel certainly doesn't have that information. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 23:47 ` Matthew Garrett @ 2009-04-03 0:59 ` david 2009-04-03 1:09 ` Matthew Garrett 0 siblings, 1 reply; 59+ messages in thread From: david @ 2009-04-03 0:59 UTC (permalink / raw) To: Matthew Garrett Cc: Nick Piggin, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, 3 Apr 2009, Matthew Garrett wrote: > On Fri, Apr 03, 2009 at 05:56:40AM +1100, Nick Piggin wrote: >> On Friday 03 April 2009 05:38:34 Matthew Garrett wrote: >>> On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote: >>> >>>> Shouldn't applications have a mode to avoid spinning up the disk if it is >>>> so important? >>> >>> They do. It's called "Don't use fsync() unless your data needs to be on >>> disk". I'm not sure why you'd ever want an application to be in anything >>> but this mode. >>> >> >> Well you might decide you are willing to sacrifice timely storage of >> logs, or reducing backups in your editor or something. But obviously >> the kernel can't decide which of those fsyncs is safe to omit (or >> turn into a barrier) while staying within the advertised semantics of >> the app. Application obviously can. > > I'd argue that if the user cares enough that they want it fsync()ed on > ext3 then they probably also want it fsync()ed if they're on battery. > But yes, if anything is going to make a distinction between grades of > "Must be saved" then it has to be the application - the kernel certainly > doesn't have that information. but is it the user who's deciding today or the application developer? I agree that the kernel has no way of saying 'this fsync is important, that one can be ignored' but I don't think anyone is suggesting that (everyone who has mentioned it in a proposal has done so saying 'this obviously is too complicated to try to do' however, there is one thing about laptop mode that I need clarification on. is laptop mode A. "write everything now, don't delay writes" in the hope that the drive will be idle enough later to spin down or B. "delay all writes until later, then when the drive wakes up do all pending writes at that time" so that the drive can go to sleep in the meantime? I've heard things in these threads that would indicate both behaviors. David Lang ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 0:59 ` david @ 2009-04-03 1:09 ` Matthew Garrett 2009-04-03 1:17 ` david 0 siblings, 1 reply; 59+ messages in thread From: Matthew Garrett @ 2009-04-03 1:09 UTC (permalink / raw) To: david Cc: Nick Piggin, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 05:59:53PM -0700, david@lang.hm wrote: > is laptop mode > > A. "write everything now, don't delay writes" in the hope that the drive > will be idle enough later to spin down laptop-mode doesn't delay writes. Ever. > or > > B. "delay all writes until later, then when the drive wakes up do all > pending writes at that time" so that the drive can go to sleep in the > meantime? Yes. > I've heard things in these threads that would indicate both behaviors. The code's pretty trivial. The only real functional differences laptop-mode brings are to write out all dirty pages (rather than just writing down to the watermark) and to call sys_sync() a few seconds after the last thing that hit disk rather than being satisfied from cache. It's entirely a mechanism to opportunistically take advantage of the disk being spun up. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 1:09 ` Matthew Garrett @ 2009-04-03 1:17 ` david 2009-04-03 1:22 ` Matthew Garrett 0 siblings, 1 reply; 59+ messages in thread From: david @ 2009-04-03 1:17 UTC (permalink / raw) To: Matthew Garrett Cc: Nick Piggin, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, 3 Apr 2009, Matthew Garrett wrote: > On Thu, Apr 02, 2009 at 05:59:53PM -0700, david@lang.hm wrote: > >> is laptop mode >> >> A. "write everything now, don't delay writes" in the hope that the drive >> will be idle enough later to spin down > > laptop-mode doesn't delay writes. Ever. > >> or >> >> B. "delay all writes until later, then when the drive wakes up do all >> pending writes at that time" so that the drive can go to sleep in the >> meantime? > > Yes. you just contridicted yourself in these two statements. David Lang >> I've heard things in these threads that would indicate both behaviors. > > The code's pretty trivial. The only real functional differences > laptop-mode brings are to write out all dirty pages (rather than just > writing down to the watermark) and to call sys_sync() a few seconds > after the last thing that hit disk rather than being satisfied from > cache. It's entirely a mechanism to opportunistically take advantage of > the disk being spun up. > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 1:17 ` david @ 2009-04-03 1:22 ` Matthew Garrett 0 siblings, 0 replies; 59+ messages in thread From: Matthew Garrett @ 2009-04-03 1:22 UTC (permalink / raw) To: david Cc: Nick Piggin, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Thu, Apr 02, 2009 at 06:17:12PM -0700, david@lang.hm wrote: > On Fri, 3 Apr 2009, Matthew Garrett wrote: > > >On Thu, Apr 02, 2009 at 05:59:53PM -0700, david@lang.hm wrote: > > > >>is laptop mode > >> > >>A. "write everything now, don't delay writes" in the hope that the drive > >>will be idle enough later to spin down > > > >laptop-mode doesn't delay writes. Ever. > > > >>or > >> > >>B. "delay all writes until later, then when the drive wakes up do all > >>pending writes at that time" so that the drive can go to sleep in the > >>meantime? > > > >Yes. > > you just contridicted yourself in these two statements. That's because I'm horribly drunk and managed to confuse the order of your statements. My substantive point stands - the laptop-mode code doesn't delay writes, and it's pretty easy for anyone to prove this to themselves. Neither of your options actually describe its behaviour. -- Matthew Garrett | mjg59@srcf.ucam.org ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:56 ` Nick Piggin 2009-04-02 23:47 ` Matthew Garrett @ 2009-04-03 2:22 ` Ric Wheeler 1 sibling, 0 replies; 59+ messages in thread From: Ric Wheeler @ 2009-04-03 2:22 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Garrett, david, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List Nick Piggin wrote: > On Friday 03 April 2009 05:38:34 Matthew Garrett wrote: > >> On Fri, Apr 03, 2009 at 05:34:59AM +1100, Nick Piggin wrote: >> >> >>> Shouldn't applications have a mode to avoid spinning up the disk if it is >>> so important? >>> >> They do. It's called "Don't use fsync() unless your data needs to be on >> disk". I'm not sure why you'd ever want an application to be in anything >> but this mode. >> >> > > Well you might decide you are willing to sacrifice timely storage of > logs, or reducing backups in your editor or something. But obviously > the kernel can't decide which of those fsyncs is safe to omit (or > turn into a barrier) while staying within the advertised semantics of > the app. Application obviously can. > > One thing that you can do at the application level is to try and batch up your fsync() requests - running one fsync (especially on the most recently written file) can take down the earlier files with it. Clearly, this does require some application level complexity, but you get the same strong fsync() semantics that you are used to and can run almost at non-fsync speeds if the batch size is large enough. Your application should not acknowledge it has safely stored any of the files locally until it has done an fsync on that particular file. This technique would work great for an application like rsync, tar, etc. For a mail client, you would see a benefit only when you were pulling down batches of messages which clearly is a common case if you are still reading this thread :-) The fs_mark program I wrote plays around with the various ways to do this if someone is interested in playing around a bit, Ric ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-02 18:34 ` Nick Piggin 2009-04-02 18:38 ` Matthew Garrett @ 2009-04-02 21:47 ` david 1 sibling, 0 replies; 59+ messages in thread From: david @ 2009-04-02 21:47 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Garrett, Theodore Tso, Sitsofe Wheeler, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Fri, 3 Apr 2009, Nick Piggin wrote: > On Friday 03 April 2009 05:22:48 david@lang.hm wrote: >> On Wed, 1 Apr 2009, Matthew Garrett wrote: >> >>>> The other subtlety comes if we add fsync() suppression to laptop mode >>>> --- which is something that Bart Samwel is very interested in doing >>>> and I talked to him at FOSDEM about this. As Jeff Garzik recently >>>> pointed out, however, if we let the system reorder writes across >>>> fsync() boundaries, or if we combine two writes to the same block >>>> separated by an fsync(), and the system crashes in the middle of >>>> pushing all of these blocks out to the disk, we can end up trashing >>>> the consistency guarantees of a database such as mysql or postgres. >>>> It's a good point, but it only applies if we add fsync() suppression >>>> to laptop mode --- which we haven't done yet. >>> >>> I've got absolutely no idea why anyone would want fsync() to stop >>> meaning "Put my data on the disk please". laptop-mode isn't intended to >>> reduce data integrity - it's intended to batch disk write-outs such that >>> there's a lower risk of needing to perform further write-outs in future. >>> It makes sense for applications which really desperately want >>> information on disk to fsync() (for instance, saving a file in >>> OpenOffice). >>> >>> laptop-mode is something that makes sense as a default behaviour under a >>> lot of circumstances. Adding fsync() suppression means it's utterly >>> impossible to use it in that way. An additional mode would be perfectly >>> reasonable, as long as it's made clear that it's really a request for >>> data to be discarded at some point. The current mode isn't. >> >> this issue seems pretty straightforward to me >> >> the apps do fsync (and similar) to the degree that they think their data >> is important (potentially with config options if they acknowlege that >> their data isn't _always_ that important) >> >> the system allows the admin to override the application and say "I'm >> willing to loose up to X seconds of data for other benifits" >> >> if this can work cleanly (with the ordering issue that was identified, >> which may involve having multiple versions of the metadata cached) it >> seems like a very clean interface. > > It isn't just about ordering of writes a a filesystem. A database program > commits a transaction and then tells the client that it is safe. Client > then goes and does <something> in response to that, which may or may not > involve more writes to the filesystem. > > Shouldn't applications have a mode to avoid spinning up the disk if it is > so important? why should every application have to have a "I'm mobile" config option? what about a user that's only mobile sometimes and wants full protection the rest of the time? how can they easily switch every application between 'keep the data as safe as you can' and 'save battery' modes? will you have to restart all the apps when you unplug power to switch their modes? allowing the user to tell the system to override the applications when the user wants to is _much_ easier. David Lang ^ permalink raw reply [flat|nested] 59+ messages in thread
* supporting laptops fs-semantic changes (was Re: Ext4 and the "30 second window of death") 2009-04-01 17:43 ` Matthew Garrett 2009-04-01 21:21 ` Ray Lee 2009-04-02 18:22 ` david @ 2009-04-06 21:32 ` Linda Walsh 2 siblings, 0 replies; 59+ messages in thread From: Linda Walsh @ 2009-04-06 21:32 UTC (permalink / raw) To: Linux Kernel Mailing List; +Cc: Matthew Garrett Matthew Garrett wrote: >> The other subtlety comes if we add fsync() suppression to laptop mode ----- Perhaps this has already been suggested, but rather than adding all these semantics to the core file-system / kernel routines, wouldn't it be preferable to allow some 'layering' of a pseudo, memory-based file-system, OVER some 'real' file system (OR), definable set of files (under a subdir...or same device...or whatever). The semantics of when the virtual-fs would sync to the physical-fs/files controlled via mount options. Physical disk writes would be controlled by selectively ignoring or honoring various "sync" events (time expired, sync, fsync). This could allow file-systems with different 'needs' (DB, or otherwise) to be treated differently. The advantage of another layer, is you could define _how much_ buffering you wanted to allocate to a filesystem (or file-set). Maybe it's tolerable losing a audio-recording of a talk, so large buff + don't sync 'cept when full is fine. Sensitive filesystems(or sets) (i.e. db's), could be set with buffers to hold largest 'single-writes', but sync/fsyncs are what they are. An optimization could provide for read/writes through the user-mem controlled buffered 'fs', to do direct I/O rather than into normal file-buffs where possible, since presumably all accesses to a file would go through the layer or not. Wouldn't require application changing, and wouldn't require changing well defined, lower-level kernel-filesystem operations. Just a thought. Linda ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 17:35 ` Theodore Tso 2009-04-01 17:43 ` Matthew Garrett @ 2009-04-02 11:37 ` Sitsofe Wheeler 1 sibling, 0 replies; 59+ messages in thread From: Sitsofe Wheeler @ 2009-04-02 11:37 UTC (permalink / raw) To: Theodore Tso, Matthew Garrett, Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On Wed, Apr 01, 2009 at 01:35:21PM -0400, Theodore Tso wrote: > On Wed, Apr 01, 2009 at 04:12:21PM +0100, Matthew Garrett wrote: > > On Wed, Apr 01, 2009 at 06:20:50AM +0100, Sitsofe Wheeler wrote: > > > > > Just out of curiosity, when laptop mode is happening is there a > > > guarantee that writes to other files won't be reordered to before the > > > fsync? > > > > laptop-mode does two things - tweak the dirty page semantics slightly > > (not in an interestingly relevant way) and call sys_sync() a few seconds > > after something hits disk rather than cache. In contrast to Ted's > > suggestion that laptop-mode reduces data integrity, it actually enhances > > it by opportunistically ensuring that data hits disk. It's the > > lengthening of the commit intervals that usually accompanies it that > > increases the risk of data loss. > > It *can* reduce data integrity; it really depends on how it's tuned > and what scenario you're talking about. To the extent that it uses > sys_sync(), it could help in some cases as well, since filesystems > that do delayed allocation will wake up when the commit interval > fires, and then force out all writes to the disk, yes. But before the > commit interval, there is an increased risk of data loss --- which the > user requested. That's fair enough and always seemed to be part of the bargain (let the disk spin down for longer but risk losing 30 seconds of non-synced recent data in a crash). The result shouldn't be corruption though. > The other subtlety comes if we add fsync() suppression to laptop mode > --- which is something that Bart Samwel is very interested in doing > and I talked to him at FOSDEM about this. As Jeff Garzik recently > pointed out, however, if we let the system reorder writes across > fsync() boundaries, or if we combine two writes to the same block > separated by an fsync(), and the system crashes in the middle of > pushing all of these blocks out to the disk, we can end up trashing > the consistency guarantees of a database such as mysql or postgres. > It's a good point, but it only applies if we add fsync() suppression > to laptop mode --- which we haven't done yet. eek. If this goes in it needs to come with scary warnings so a distro doesn't enable it by default (think of all those sqlite database that are springing up). I know my system is crummy, all of this is only concerned with if the system crashes uncontrollably (which it shouldn't do) and I don't do things that would make it safer (like mount with sync) because I like the speed but there's a risk limit. I don't want to increase my chances of corruption (as opposed to "just" loss of non recent data) to be too high... -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-01 1:50 ` Theodore Tso 2009-04-01 5:20 ` Sitsofe Wheeler @ 2009-04-01 8:51 ` Andreas T.Auer 1 sibling, 0 replies; 59+ messages in thread From: Andreas T.Auer @ 2009-04-01 8:51 UTC (permalink / raw) To: Theodore Tso; +Cc: Andreas T.Auer, Alberto Gonzalez, Linux Kernel Mailing List On 01.04.2009 03:50 Theodore Tso wrote: > On Wed, Apr 01, 2009 at 01:22:19AM +0200, Andreas T.Auer wrote: > >> E.g. your POP3 client receives a very important mail, saves it to disk, >> uses fsync to make sure it is out and tells the server to delete it. If >> you are gonna delay the fsync, you will have a long window in which the >> mail can get lost instead of a minimum window. Or are there any POP3 >> clients, which can synchronize the mail-polling with a spinning a disk? >> > > True, but consider --- this is a laptop we're talking about, right? > What if the laptop hard drive crashes after you accidentally drop your > laptop. Even if you're using an SSD, what if someone steals your > laptop. Well, there is always a worst case, but I had quite a lot system crashes with unstable versions without dropping the laptop once. > Your first mistake was using POP3. :-) > I agree. :-) I am using IMAP, but a lot of people have only their POP3 account on their only laptop. > If all they are doing is browsing the web, and the issue is firefox's > desire to constantly write to their home directory, the user should be > able to say, "you know, my battery life is more important that making > sure that every last web page I visit is saved away in some file --- > Firefox's 'Awesome Bar' really isn't worth that much to me." > AFAIK especially FF doesn't use fsync that often anymore by default. And the user has to know this meanwhile hidden config entry toolkit.storage.synchronous to raise the fsync level. But there are surely enough applications that use fsync too much, and enough applications using it not often enough. > Of course, there is the question whether most users will be able to > understand the risks of doing things like using POP3 and fetchmail as > described in your scenario above. And that's a valid question --- so > it's worth asking whether suppressing fsync()'s really saves enough > power to be worth it, as opposed to say, fixing applications that are > write-happy, or choosing not to use applications which are write-happy > when you are running on battery. > > Surely a lot of users don't understand all the risks or downsides of any write-out policy. But there are users who do understand. For those it would be fine, if they could define the policies for fsync and non-fsyncs on a per-application basis (with a global default). E.g.: The POP3-client should write synchrously with fsync, but can wait for two minutes for non-fsynced data. Firefox should have these values and openoffice those values etc... But I guess the implementation effort is too high. Andreas ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 12:25 ` Theodore Tso 2009-03-31 12:52 ` Alberto Gonzalez @ 2009-04-03 7:13 ` Bojan Smojver 2009-04-05 4:07 ` Bojan Smojver 2009-04-05 17:27 ` Ed Tomlinson 2 siblings, 1 reply; 59+ messages in thread From: Bojan Smojver @ 2009-04-03 7:13 UTC (permalink / raw) To: linux-kernel Theodore Tso <tytso <at> mit.edu> writes: > The replace-via-truncate and replace-via-rename workarounds are there > for the benefit of KDE, and GNOME, which in some configurations > apparently will replace hundreds of dot files when the desktop is > started up, for no reason that I can understand. Maybe it would be useful if we had IN_SYNC event in inotify (meaning all buffers of a closed file have been synced to disk, either implicitly or by fsync() - not important). Then we could have these apps to do something like this on configuration change: 1. Backup by link("foo","foo~"), unless we are watching "foo" for IN_SYNC event. 2. Open "foo" and read it. 3. Create "foo.new" and put new stuff in it. 4. Close "foo.new". 5. Rename "foo.new" into "foo". 6. Put a watch on "foo" for IN_SYNC, unless we already have one. In the regular loop of the app: 1. When the event IN_SYNC turns up for "foo", remove "foo~". 2. Remove the watch. No fsync() in sight, all atomic and no chance of losing data. If things go haywire, we shall have fully committed "foo~" on startup, which we then just rename into most likely broken "foo" and continue. If we don't have "foo~", it must mean "foo" is OK. Something like this may even work for rsync (slightly different flow of events, probably watching from another thread). When throwing stones, please limit yourself to less than 5kg specimens... :-) -- Bojan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-03 7:13 ` Bojan Smojver @ 2009-04-05 4:07 ` Bojan Smojver 2009-04-05 4:51 ` Bojan Smojver 2009-04-05 5:41 ` Bojan Smojver 0 siblings, 2 replies; 59+ messages in thread From: Bojan Smojver @ 2009-04-05 4:07 UTC (permalink / raw) To: linux-kernel Bojan Smojver <bojan <at> rexursive.com> writes: > Maybe it would be useful if we had IN_SYNC event in inotify Or, maybe we can just (ab)use aio_fsync() for all this. This could be useful for renaming of configuration files, less so for rsync (although it could be done there too, I guess; rsync would just have to wait for synchronisation at the end of the run). It would work like this: 1. Open "foo" and read it. 2. Open mktemp()-ed "foo.XXXXXX". 3. Write into the temp file. 4. Call aio_fsync(). Then, in the signal handler or the thread created on completion we'd have: 1. Rename the fully synced temp file into "foo". If we made aio_fsync() wait in laptop mode for the regular commit interval, instead of writing to disk right away (because it is an async interface after all, so nobody expects it to finish immediately), we could preserve the normal fsync() in laptop mode to mean write to disk now. DBs and similar stuff would then get what they needed too, without complications. For machines that are not laptops, with a constantly spinning disk and a decent file system (such as ext4 :-), this should not be a problem performance wise. And, the program asking for aio_fsync() could still continue without blocking, therefore being fully interactive. PS. Disclaimer: I never used this call in any of my programs, so I'm just guessing that it works the way I understood the docs. -- Bojan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-05 4:07 ` Bojan Smojver @ 2009-04-05 4:51 ` Bojan Smojver 2009-04-05 5:41 ` Bojan Smojver 1 sibling, 0 replies; 59+ messages in thread From: Bojan Smojver @ 2009-04-05 4:51 UTC (permalink / raw) To: linux-kernel Bojan Smojver <bojan <at> rexursive.com> writes: > 1. Rename the fully synced temp file into "foo". Forgot to mention... At which point the current config kept in memory would be dumped, if the reference count of temp files associated with it reached zero. -- Bojan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-04-05 4:07 ` Bojan Smojver 2009-04-05 4:51 ` Bojan Smojver @ 2009-04-05 5:41 ` Bojan Smojver 1 sibling, 0 replies; 59+ messages in thread From: Bojan Smojver @ 2009-04-05 5:41 UTC (permalink / raw) To: linux-kernel Bojan Smojver <bojan <at> rexursive.com> writes: > 1. Open "foo" and read it. Of course, this step would be skipped if we had config still in memory. -- Bojan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Ext4 and the "30 second window of death" 2009-03-31 12:25 ` Theodore Tso 2009-03-31 12:52 ` Alberto Gonzalez 2009-04-03 7:13 ` Bojan Smojver @ 2009-04-05 17:27 ` Ed Tomlinson 2 siblings, 0 replies; 59+ messages in thread From: Ed Tomlinson @ 2009-04-05 17:27 UTC (permalink / raw) To: Theodore Tso; +Cc: Alberto Gonzalez, Linux Kernel Mailing List On Tuesday 31 March 2009 08:25:40 Theodore Tso wrote: > On Sun, Mar 29, 2009 at 12:24:21PM +0200, Alberto Gonzalez wrote: > > Hi, > > > > - I use Ext4 as my filesystem (default in next Fedora release). > > Fedora will have the patches so that applications that do > replace-via-truncate (a bad idea, these applications are buggy, and > will lose data sometimes even with ext3), or replace-via-rename > without the fsync(), will force the blocks out to disk with the > commit. > > > - Let's say I've been working on my book for the last 14 months and I've > > written about 400 pages on an ODF file. > > Openoffice, being a portable application, that has to work on other > operating systems and filesystems (for example, like Solaris's UFS), > does do open/write/close/fsync/rename. So you're safe if you're using > OpenOffice (and emacs, and vim). > > The replace-via-truncate and replace-via-rename workarounds are there > for the benefit of KDE, and GNOME, which in some configurations > apparently will replace hundreds of dot files when the desktop is > started up, for no reason that I can understand. (Not such a great > idea for SSD write endurance!) Some people apparently spend hours > making sure that their windows are exactly positioned the way they > want it when their desktop starts up, and if the system crashes while > their desktop is starting up, those they could lose their window > positions, which apparently made a whole bunch of users cranky. In > practice, most of the editors that I'm familiar with have been around > for a while, have needed to make sure that that cases such as yours > wouldn't result in data loss, and so are pretty good about using > fsync() so that users' files wouldn't be lost, no matter what the > filesystem or operating system being used. Its more than losing window postions. I've been using ext4 with kde 4.2.1 along with some experimental modules (drm for xorg for r600 support, btrfs) and a few patches. As expected this has caused a few crashes. I have had kde lose desktop setup info (eg. it forgot it was using xrender accel). I have also had kmail lose all its configuration - which is a pita to rebuild. Note that these crashes occur long after kde has been started... > The problem has been mostly with newer applications, especially the > newer desktop ones, which have been written to assume that they only > have to work safely on Linux and ext3. The replace-via-truncate and > replace-via-rename workarounds provide this safety for ext4. When there are patches out to improve this (bad) behavior I would love to try them. TIA Ed Tomlinson ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2009-04-06 22:01 UTC | newest] Thread overview: 59+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-03-29 10:24 Ext4 and the "30 second window of death" Alberto Gonzalez 2009-03-31 12:25 ` Theodore Tso 2009-03-31 12:52 ` Alberto Gonzalez 2009-03-31 13:45 ` Theodore Tso 2009-03-31 14:45 ` Alberto Gonzalez 2009-04-01 0:04 ` Theodore Tso 2009-04-01 1:14 ` Alberto Gonzalez 2009-03-31 22:02 ` Alberto Gonzalez 2009-03-31 23:22 ` Andreas T.Auer 2009-04-01 1:25 ` Alberto Gonzalez 2009-04-01 1:50 ` Theodore Tso 2009-04-01 5:20 ` Sitsofe Wheeler 2009-04-01 15:12 ` Matthew Garrett 2009-04-01 17:35 ` Theodore Tso 2009-04-01 17:43 ` Matthew Garrett 2009-04-01 21:21 ` Ray Lee 2009-04-01 21:26 ` Matthew Garrett 2009-04-02 11:25 ` Sitsofe Wheeler 2009-04-02 18:22 ` david 2009-04-02 18:29 ` Matthew Garrett 2009-04-02 18:44 ` david 2009-04-02 20:07 ` Ray Lee 2009-04-02 20:59 ` Andreas T.Auer 2009-04-02 23:38 ` Theodore Tso 2009-04-03 0:00 ` Matthew Garrett 2009-04-03 7:33 ` Pavel Machek 2009-04-03 8:14 ` Andreas T.Auer 2009-04-02 22:36 ` Bron Gondwana 2009-04-02 23:46 ` Matthew Garrett 2009-04-03 0:55 ` david 2009-04-03 1:06 ` Matthew Garrett 2009-04-03 1:16 ` david 2009-04-03 1:19 ` Matthew Garrett 2009-04-03 1:24 ` david 2009-04-03 1:36 ` Matthew Garrett 2009-04-03 3:08 ` david 2009-04-03 13:42 ` Matthew Garrett 2009-04-03 4:54 ` Theodore Tso 2009-04-03 11:09 ` Sitsofe Wheeler 2009-04-03 13:07 ` Alberto Gonzalez 2009-04-03 13:45 ` Matthew Garrett 2009-04-02 18:34 ` Nick Piggin 2009-04-02 18:38 ` Matthew Garrett 2009-04-02 18:56 ` Nick Piggin 2009-04-02 23:47 ` Matthew Garrett 2009-04-03 0:59 ` david 2009-04-03 1:09 ` Matthew Garrett 2009-04-03 1:17 ` david 2009-04-03 1:22 ` Matthew Garrett 2009-04-03 2:22 ` Ric Wheeler 2009-04-02 21:47 ` david 2009-04-06 21:32 ` supporting laptops fs-semantic changes (was Re: Ext4 and the "30 second window of death") Linda Walsh 2009-04-02 11:37 ` Ext4 and the "30 second window of death" Sitsofe Wheeler 2009-04-01 8:51 ` Andreas T.Auer 2009-04-03 7:13 ` Bojan Smojver 2009-04-05 4:07 ` Bojan Smojver 2009-04-05 4:51 ` Bojan Smojver 2009-04-05 5:41 ` Bojan Smojver 2009-04-05 17:27 ` Ed Tomlinson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox