* Plans to evaluate the reliability and integrity of ext4 against power failures.
@ 2009-06-30 23:27 Shaozhi Ye
2009-07-01 0:58 ` Eric Sandeen
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Shaozhi Ye @ 2009-06-30 23:27 UTC (permalink / raw)
To: linux-fsdevel, linux-ext4; +Cc: fs-team
[-- Attachment #1: Type: text/plain, Size: 404 bytes --]
Hi there,
We are planing to evaluate the reliability and integrity of ext4
against power failures and will post the results when its done.
Please find attached design document and let me know if you have any
suggestions for features to test or existing benchmark tools which
serve our purpose.
Thank you!
--
Shaozhi Ye
Software Engineer Intern from UC Davis
Cluster Management Team
http://who/yeshao
[-- Attachment #2: powerloss.pdf --]
[-- Type: application/pdf, Size: 52060 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-06-30 23:27 Plans to evaluate the reliability and integrity of ext4 against power failures Shaozhi Ye @ 2009-07-01 0:58 ` Eric Sandeen 2009-07-01 17:39 ` Michael Rubin 2009-07-01 18:07 ` Chris Worley ` (2 subsequent siblings) 3 siblings, 1 reply; 12+ messages in thread From: Eric Sandeen @ 2009-07-01 0:58 UTC (permalink / raw) To: Shaozhi Ye; +Cc: linux-fsdevel, linux-ext4, fs-team Shaozhi Ye wrote: > Hi there, > > We are planing to evaluate the reliability and integrity of ext4 > against power failures and will post the results when its done. > Please find attached design document and let me know if you have any > suggestions for features to test or existing benchmark tools which > serve our purpose. > > Thank you! > I'll be very interested to see the results. One thing you will need to look out for (people who know me knew I would say this ;) is volatile write caches on the storage, and filesystem barrier implementation. To characterize the test, you'll want to be explicit about your storage. Are they local disks? A Raid controller? If the disks are external to the server, is power lost to the disks? Do they have write caching enabled? If so do write barriers from the filesystem pass through to the storage? This will all be highly relevant to how a power loss will affect the filesystem. I'm also curious about whether the tool itself will be available under an open source license? Other filesystems would benefit from this testing as well. Thanks, -Eric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-07-01 0:58 ` Eric Sandeen @ 2009-07-01 17:39 ` Michael Rubin 0 siblings, 0 replies; 12+ messages in thread From: Michael Rubin @ 2009-07-01 17:39 UTC (permalink / raw) To: Eric Sandeen; +Cc: Shaozhi Ye, linux-fsdevel, linux-ext4 On Tue, Jun 30, 2009 at 5:58 PM, Eric Sandeen<sandeen@redhat.com> wrote: > Shaozhi Ye wrote: > I'll be very interested to see the results. One thing you will need to > look out for (people who know me knew I would say this ;) is volatile > write caches on the storage, and filesystem barrier implementation. I am interested int he numbers too. As to the other dimensions, we plan on investigating this as much as we can given our environment. > To characterize the test, you'll want to be explicit about your storage. > Are they local disks? A Raid controller? If the disks are external to > the server, is power lost to the disks? Do they have write caching > enabled? If so do write barriers from the filesystem pass through to > the storage? This will all be highly relevant to how a power loss will > affect the filesystem. Yup. We plan on detailing these qualities and comparing some of them in the write up when yeshao's work is done. > I'm also curious about whether the tool itself will be available under > an open source license? Other filesystems would benefit from this > testing as well. The end goal is to have a test suite for power loss that we can share with the community. Realize we may send out numbers before sending out the tests. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-06-30 23:27 Plans to evaluate the reliability and integrity of ext4 against power failures Shaozhi Ye 2009-07-01 0:58 ` Eric Sandeen @ 2009-07-01 18:07 ` Chris Worley 2009-07-01 18:31 ` Michael Rubin 2009-07-01 20:59 ` Theodore Tso 2009-07-01 23:37 ` Andreas Dilger 3 siblings, 1 reply; 12+ messages in thread From: Chris Worley @ 2009-07-01 18:07 UTC (permalink / raw) To: Shaozhi Ye; +Cc: linux-fsdevel, linux-ext4, fs-team On Tue, Jun 30, 2009 at 5:27 PM, Shaozhi Ye<yeshao@google.com> wrote: > Hi there, > > We are planing to evaluate the reliability and integrity of ext4 > against power failures and will post the results when its done. > Please find attached design document and let me know if you have any > suggestions for features to test or existing benchmark tools which > serve our purpose. This looks like a very valuable project. I do lack understanding of how certain problems that very much need to be tested will be tested. >From your pdf: "Data loss: The client thinks the server has A while the server does not." I've been wondering how you test to assure that data committed to the disk is really committed? Ext4 receives the callback saying the data is committed; you need to somehow log what it thinks has been committed and then cut power without any time passing from the moment the callback was invoked, then check the disk upon reboot to assure the data made it by comparing it with the log. I just don't see a method to test this, but it is so critically important. Chris > > Thank you! > > -- > Shaozhi Ye > Software Engineer Intern from UC Davis > Cluster Management Team > http://who/yeshao > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-07-01 18:07 ` Chris Worley @ 2009-07-01 18:31 ` Michael Rubin 2009-07-01 18:44 ` Ric Wheeler 0 siblings, 1 reply; 12+ messages in thread From: Michael Rubin @ 2009-07-01 18:31 UTC (permalink / raw) To: Chris Worley; +Cc: Shaozhi Ye, linux-fsdevel, linux-ext4 On Wed, Jul 1, 2009 at 11:07 AM, Chris Worley<worleys@gmail.com> wrote: > On Tue, Jun 30, 2009 at 5:27 PM, Shaozhi Ye<yeshao@google.com> wrote: > This looks like a very valuable project. I do lack understanding of > how certain problems that very much need to be tested will be tested. > From your pdf: > > "Data loss: The client thinks the server has A while the server > does not." > > I've been wondering how you test to assure that data committed to the > disk is really committed? What we are trying to capture is what the users perceives and can expect in our environment. This is not an attempt to know the moment the OS can guarantee the data is stored persistently. I am not sure if that's feasible to do with write caching drives today. This experiment's goal as of now is not to know the exact moment in time "when the data is committed". It has two goals. The first to assure ourselves there is no strange corner case making ext4 behave worse or unexpectedly compared to ext2 in the rare event of a power failure. And to deliver expectations to our users on the recoverability of data after the event. For now we are employing a client server model for network exported sharing in this test. In that context the App doesn't have a lot of methods to know when the data is committed. I know of O_DIRECT, fsync, etc. Given these current day interfaces what can the network client apps expect? After we have results we will try to figure out if we need to develop new interfaces or methods to improve the situation and hopefully start sending patches. > I just don't see a method to test this, but it is so critically important. I agree. mrubin -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-07-01 18:31 ` Michael Rubin @ 2009-07-01 18:44 ` Ric Wheeler 2009-07-01 19:58 ` Jeff Moyer 2009-07-02 2:12 ` Jamie Lokier 0 siblings, 2 replies; 12+ messages in thread From: Ric Wheeler @ 2009-07-01 18:44 UTC (permalink / raw) To: Michael Rubin; +Cc: Chris Worley, Shaozhi Ye, linux-fsdevel, linux-ext4 On 07/01/2009 02:31 PM, Michael Rubin wrote: > On Wed, Jul 1, 2009 at 11:07 AM, Chris Worley<worleys@gmail.com> wrote: > >> On Tue, Jun 30, 2009 at 5:27 PM, Shaozhi Ye<yeshao@google.com> wrote: >> This looks like a very valuable project. I do lack understanding of >> how certain problems that very much need to be tested will be tested. >> From your pdf: >> >> "Data loss: The client thinks the server has A while the server >> does not." >> >> I've been wondering how you test to assure that data committed to the >> disk is really committed? >> > > What we are trying to capture is what the users perceives and can > expect in our environment. This is not an attempt to know the moment > the OS can guarantee the data is stored persistently. I am not sure if > that's feasible to do with write caching drives today. > > This experiment's goal as of now is not to know the exact moment in > time "when the data is committed". It has two goals. The first to > assure ourselves there is no strange corner case making ext4 behave > worse or unexpectedly compared to ext2 in the rare event of a power > failure. And to deliver expectations to our users on the > recoverability of data after the event. > The key is not to ack the clients request until you have done the best effort in moving the data to persistent storage locally. Today, I think that the best practice would be to either disable the write cache on the drive or have properly configured write barrier support and use an fsync() on any file before sending the ack back over the wire to the client. Note that disabling the write cache is required if you use some MD/DM constructs that might not honor barrier requests. Doing this consistently has been shown to significantly reduce the data loss due to power failure. > For now we are employing a client server model for network exported > sharing in this test. In that context the App doesn't have a lot of > methods to know when the data is committed. I know of O_DIRECT, fsync, > etc. Given these current day interfaces what can the network client > apps expect? > Isn't this really just proper design of the server component? > After we have results we will try to figure out if we need to develop > new interfaces or methods to improve the situation and hopefully start > sending patches. > > >> I just don't see a method to test this, but it is so critically important. >> > > I agree. > > mrubin > One way to test this with reasonable, commodity hardware would be something like the following: (1) Get an automated power kill setup to control your server (2) Configure the server with your regular storage stack and one local, non-write cache enabled device (could be a normal S-ATA drive with write cache disabled) (3) On receipt of each client request, record with O_DIRECT writes to the non-caching device the receipt of the request and the sending of the ack back to the client. Getting a really low latency device for the recording skews the accuracy of this technique much less of course :-) (4) On the client, record locally its requests and received acks. (5) At random times, drop power to the server. Verification would be to replay the client log of received acks & validate that the server (after recovery) still has the data that it acked over the network. Wouldn't this suffice to raise the bar to a large degree? Thanks! Ric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-07-01 18:44 ` Ric Wheeler @ 2009-07-01 19:58 ` Jeff Moyer 2009-07-02 2:12 ` Jamie Lokier 1 sibling, 0 replies; 12+ messages in thread From: Jeff Moyer @ 2009-07-01 19:58 UTC (permalink / raw) To: Ric Wheeler Cc: Michael Rubin, Chris Worley, Shaozhi Ye, linux-fsdevel, linux-ext4 Ric Wheeler <rwheeler@redhat.com> writes: > On 07/01/2009 02:31 PM, Michael Rubin wrote: >> On Wed, Jul 1, 2009 at 11:07 AM, Chris Worley<worleys@gmail.com> wrote: >> >>> On Tue, Jun 30, 2009 at 5:27 PM, Shaozhi Ye<yeshao@google.com> wrote: >>> This looks like a very valuable project. I do lack understanding of >>> how certain problems that very much need to be tested will be tested. >>> From your pdf: >>> >>> "Data loss: The client thinks the server has A while the server >>> does not." >>> >>> I've been wondering how you test to assure that data committed to the >>> disk is really committed? >>> >> >> What we are trying to capture is what the users perceives and can >> expect in our environment. This is not an attempt to know the moment >> the OS can guarantee the data is stored persistently. I am not sure if >> that's feasible to do with write caching drives today. >> >> This experiment's goal as of now is not to know the exact moment in >> time "when the data is committed". It has two goals. The first to >> assure ourselves there is no strange corner case making ext4 behave >> worse or unexpectedly compared to ext2 in the rare event of a power >> failure. And to deliver expectations to our users on the >> recoverability of data after the event. >> > > The key is not to ack the clients request until you have done the best > effort in moving the data to persistent storage locally. > > Today, I think that the best practice would be to either disable the > write cache on the drive or have properly configured write barrier > support and use an fsync() on any file before sending the ack back > over the wire to the client. Note that disabling the write cache is > required if you use some MD/DM constructs that might not honor barrier > requests. > > Doing this consistently has been shown to significantly reduce the > data loss due to power failure. > > >> For now we are employing a client server model for network exported >> sharing in this test. In that context the App doesn't have a lot of >> methods to know when the data is committed. I know of O_DIRECT, fsync, >> etc. Given these current day interfaces what can the network client >> apps expect? >> > > Isn't this really just proper design of the server component? > >> After we have results we will try to figure out if we need to develop >> new interfaces or methods to improve the situation and hopefully start >> sending patches. >> >> >>> I just don't see a method to test this, but it is so critically important. >>> >> >> I agree. >> >> mrubin >> > > One way to test this with reasonable, commodity hardware would be > something like the following: > > (1) Get an automated power kill setup to control your server > > (2) Configure the server with your regular storage stack and one > local, non-write cache enabled device (could be a normal S-ATA drive > with write cache disabled) > > (3) On receipt of each client request, record with O_DIRECT writes to > the non-caching device the receipt of the request and the sending of > the ack back to the client. Getting a really low latency device for > the recording skews the accuracy of this technique much less of course > :-) > > (4) On the client, record locally its requests and received acks. > > (5) At random times, drop power to the server. > > Verification would be to replay the client log of received acks & > validate that the server (after recovery) still has the data that it > acked over the network. > > Wouldn't this suffice to raise the bar to a large degree? I already have a test app that does something along these lines. See: http://people.redhat.com/jmoyer/dainto-0.99.3.tar.gz I initially wrote it to try to simulate I/O that leads to torn pages. I later reworked it to test to ensure that data that was acknowledged to the server was actually available after power loss. The basic idea is that you have a client and a server. The client writes blocks to a device (not file system) in ascending order. As blocks complete, the block number is sent to the server. Each block contains the generation number (which indicates the pass number), the block number and a crc of the block contents. When the generation wraps, it calls out to the generation script provided. In my case, this was a script that would delay for a random number of seconds and then power cycle the client using an ilo fencing agent. When the client powers back up, you run the client code in check mode, which pulls its configuration from the server and then proceeds to ensure that blocks that were acknowledged as written were, in fact, intact on disk. Now, I don't remember in what state I left the code. Feel free to pick it up and send me questions if you have any. I'll be on vacation, of course, but I'll answer when I get back. ;-) Cheers, Jeff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-07-01 18:44 ` Ric Wheeler 2009-07-01 19:58 ` Jeff Moyer @ 2009-07-02 2:12 ` Jamie Lokier 2009-07-02 11:21 ` Ric Wheeler 1 sibling, 1 reply; 12+ messages in thread From: Jamie Lokier @ 2009-07-02 2:12 UTC (permalink / raw) To: Ric Wheeler Cc: Michael Rubin, Chris Worley, Shaozhi Ye, linux-fsdevel, linux-ext4 Ric Wheeler wrote: > One way to test this with reasonable, commodity hardware would be > something like the following: > > (1) Get an automated power kill setup to control your server etc. Good plan. Another way to test the entire software stack, but not the physical disks, is to run the entire test using VMs, and simulate hard disk write caching and simulated power failure in the VM. KVM would be a great candidate for that, as it runs VMs as ordinary processes and the disk I/O emulation is quite easy to modify. As most issues probably are software issues (kernel, filesystems, apps not calling fsync, or assuming barrierless O_DIRECT/O_DSYNC are sufficient, network fileserver protocols, etc.), it's surely worth a look. It could be much faster than the physical version too, in other words more complete testing of the software stack given available resources. With the ability to "fork" a running VM's state by snapshotting it and continuing, it would even be possible to simulate power failure cache loss scenarios at many points in the middle of a stress test, with the stress test continuing to run - no full reboot needed at every point. That way, maybe deliberate trace points could be placed in the software stack at places where power failure cache loss seems likely to cause a problem. -- Jamie ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-07-02 2:12 ` Jamie Lokier @ 2009-07-02 11:21 ` Ric Wheeler 0 siblings, 0 replies; 12+ messages in thread From: Ric Wheeler @ 2009-07-02 11:21 UTC (permalink / raw) To: Jamie Lokier Cc: Michael Rubin, Chris Worley, Shaozhi Ye, linux-fsdevel, linux-ext4 On 07/01/2009 10:12 PM, Jamie Lokier wrote: > Ric Wheeler wrote: >> One way to test this with reasonable, commodity hardware would be >> something like the following: >> >> (1) Get an automated power kill setup to control your server > > etc. Good plan. > > Another way to test the entire software stack, but not the physical > disks, is to run the entire test using VMs, and simulate hard disk > write caching and simulated power failure in the VM. KVM would be a > great candidate for that, as it runs VMs as ordinary processes and the > disk I/O emulation is quite easy to modify. Certainly, that could be useful to test some level of the stack. Historically, the biggest issues that I have run across have been focused on the volatile write cache on the storage targets. Not only can it lose data that has been acked all the back to the host, it can also potentially reorder that data in challenging ways that will make file system recovery difficult.... > > As most issues probably are software issues (kernel, filesystems, apps > not calling fsync, or assuming barrierless O_DIRECT/O_DSYNC are > sufficient, network fileserver protocols, etc.), it's surely worth a look. > > It could be much faster than the physical version too, in other words > more complete testing of the software stack given available resources. > > With the ability to "fork" a running VM's state by snapshotting it and > continuing, it would even be possible to simulate power failure cache > loss scenarios at many points in the middle of a stress test, with the > stress test continuing to run - no full reboot needed at every point. > That way, maybe deliberate trace points could be placed in the > software stack at places where power failure cache loss seems likely > to cause a problem. > > -- Jamie I do agree that this testing would also be very useful, especially so since you can do this almost in any environment. Regards, Ric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-06-30 23:27 Plans to evaluate the reliability and integrity of ext4 against power failures Shaozhi Ye 2009-07-01 0:58 ` Eric Sandeen 2009-07-01 18:07 ` Chris Worley @ 2009-07-01 20:59 ` Theodore Tso 2009-07-02 1:04 ` Michael Rubin 2009-07-01 23:37 ` Andreas Dilger 3 siblings, 1 reply; 12+ messages in thread From: Theodore Tso @ 2009-07-01 20:59 UTC (permalink / raw) To: Shaozhi Ye; +Cc: linux-fsdevel, linux-ext4, fs-team I've got to ask --- what does "testing on elephants mean"? I'm reminded of the (in)famous sign from the San Diego Zoo: http://www.flickr.com/photos/matusiak/3391292342/ :-) - Ted ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-07-01 20:59 ` Theodore Tso @ 2009-07-02 1:04 ` Michael Rubin 0 siblings, 0 replies; 12+ messages in thread From: Michael Rubin @ 2009-07-02 1:04 UTC (permalink / raw) To: Theodore Tso; +Cc: Shaozhi Ye, linux-fsdevel, linux-ext4 On Wed, Jul 1, 2009 at 1:59 PM, Theodore Tso<tytso@mit.edu> wrote: > I've got to ask --- what does "testing on elephants mean"? An elephant is a large, plant-eating land mammal with a prehensile trunk and long, curvy tusks. Testing elephants is hard. mrubin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Plans to evaluate the reliability and integrity of ext4 against power failures. 2009-06-30 23:27 Plans to evaluate the reliability and integrity of ext4 against power failures Shaozhi Ye ` (2 preceding siblings ...) 2009-07-01 20:59 ` Theodore Tso @ 2009-07-01 23:37 ` Andreas Dilger 3 siblings, 0 replies; 12+ messages in thread From: Andreas Dilger @ 2009-07-01 23:37 UTC (permalink / raw) To: Shaozhi Ye; +Cc: linux-fsdevel, linux-ext4, fs-team On Jun 30, 2009 16:27 -0700, Shaozhi Ye wrote: > We are planing to evaluate the reliability and integrity of ext4 > against power failures and will post the results when its done. > Please find attached design document and let me know if you have any > suggestions for features to test or existing benchmark tools which > serve our purpose. What might be interesting is to enhance fsx to work on multiple nodes. It would need an additional "sync" operation that would flush the cache to disk, so that there is a limited amount of rollback after a server reset. The client would log locally the file operations that are being done on the server and after the server is restarted the client would verify that the data in the file is consistent at least up to the most recent sync. Another very interesting filesystem test is "racer.sh" (a cleaned up version is in the Lustre test set), which does random filesystem operations (create, write, rename, link, unlink, symlink, mkdir, rmdir). Currently the operations are completely random, but if there was a client logging the operations it should be possible to track the state on the client. What would be necessary is for the server to export the current transaction id (tid) and then the client records this with every operation. At server recovery time the last committed tid is in the journal superblock and the client can then verify that its record of the filesystem state matches the actual state (after journal rollback). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2009-07-02 11:23 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-30 23:27 Plans to evaluate the reliability and integrity of ext4 against power failures Shaozhi Ye 2009-07-01 0:58 ` Eric Sandeen 2009-07-01 17:39 ` Michael Rubin 2009-07-01 18:07 ` Chris Worley 2009-07-01 18:31 ` Michael Rubin 2009-07-01 18:44 ` Ric Wheeler 2009-07-01 19:58 ` Jeff Moyer 2009-07-02 2:12 ` Jamie Lokier 2009-07-02 11:21 ` Ric Wheeler 2009-07-01 20:59 ` Theodore Tso 2009-07-02 1:04 ` Michael Rubin 2009-07-01 23:37 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).