* CrashMonkey: A Framework to Systematically Test File-System Crash Consistency @ 2017-08-14 16:32 Vijay Chidambaram 2017-08-15 17:13 ` Amir Goldstein 2017-08-15 17:33 ` Josef Bacik 0 siblings, 2 replies; 10+ messages in thread From: Vijay Chidambaram @ 2017-08-14 16:32 UTC (permalink / raw) To: linux-ext4, linux-xfs, linux-fsdevel, linux-btrfs; +Cc: vijay, Ashlie Martinez Hi, I'm Vijay Chidambaram, an Assistant Professor at the University of Texas at Austin. My research group is developing CrashMonkey, a file-system agnostic framework to test file-system crash consistency on power failures. We are developing CrashMonkey publicly at Github [1]. This is very much a work-in-progress, so we welcome feedback. CrashMonkey works by recording all the IO from running a given workload, then *constructing* possible crash states (while honoring FUA and FLUSH flags). A crash state is the state of storage after an abrupt power failure or crash. For each crash state, CrashMonkey runs the filesystem-provided fsck on top of the state, and checks if the file-system recovers correctly. Once the file system mounts correctly, we can run further tests to check data consistency. The work was presented at HotStorage 17. The workshop paper is available at [2] and the slides at [3]. Our plan was to post on the mailing lists after reproducing an existing bug. We are not there yet, but I saw some posts where others were considering building something similar, so I thought I would post about our work. [1] https://github.com/utsaslab/crashmonkey [2] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey.pdf [3] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey-slides.pdf Thanks, Vijay Chidambaram http://www.cs.utexas.edu/~vijay/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-14 16:32 CrashMonkey: A Framework to Systematically Test File-System Crash Consistency Vijay Chidambaram @ 2017-08-15 17:13 ` Amir Goldstein 2017-08-15 17:33 ` Josef Bacik 1 sibling, 0 replies; 10+ messages in thread From: Amir Goldstein @ 2017-08-15 17:13 UTC (permalink / raw) To: Vijay Chidambaram Cc: Ext4, linux-xfs, linux-fsdevel, linux-btrfs, vijay, Ashlie Martinez, Josef Bacik, Christoph Hellwig On Mon, Aug 14, 2017 at 6:32 PM, Vijay Chidambaram <vvijay03@gmail.com> wrote: > Hi, > > I'm Vijay Chidambaram, an Assistant Professor at the University of > Texas at Austin. My research group is developing CrashMonkey, a > file-system agnostic framework to test file-system crash consistency > on power failures. We are developing CrashMonkey publicly at Github > [1]. This is very much a work-in-progress, so we welcome feedback. > > CrashMonkey works by recording all the IO from running a given > workload, then *constructing* possible crash states (while honoring > FUA and FLUSH flags). A crash state is the state of storage after an > abrupt power failure or crash. For each crash state, CrashMonkey runs > the filesystem-provided fsck on top of the state, and checks if the > file-system recovers correctly. Once the file system mounts correctly, > we can run further tests to check data consistency. The work was > presented at HotStorage 17. The workshop paper is available at [2] and > the slides at [3]. > > Our plan was to post on the mailing lists after reproducing an > existing bug. We are not there yet, but I saw some posts where others > were considering building something similar, so I thought I would post > about our work. > > [1] https://github.com/utsaslab/crashmonkey > [2] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey.pdf > [3] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey-slides.pdf > Hi Vijay, Thanks a lot for sharing your work. Crash consistency is high on my TODO list. Did you happen to run into this work by Joseph Bacik? https://www.redhat.com/archives/dm-devel/2014-November/msg00083.html I wonder if the disk_wrapper driver could be made into something more standard like the suggested dm-power-fail target, so that it may be proposed for upstream? Thanks, Amir. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-14 16:32 CrashMonkey: A Framework to Systematically Test File-System Crash Consistency Vijay Chidambaram 2017-08-15 17:13 ` Amir Goldstein @ 2017-08-15 17:33 ` Josef Bacik 2017-08-15 18:01 ` Vijay Chidambaram 1 sibling, 1 reply; 10+ messages in thread From: Josef Bacik @ 2017-08-15 17:33 UTC (permalink / raw) To: Vijay Chidambaram Cc: linux-ext4, linux-xfs, linux-fsdevel, linux-btrfs, vijay, Ashlie Martinez On Mon, Aug 14, 2017 at 11:32:02AM -0500, Vijay Chidambaram wrote: > Hi, > > I'm Vijay Chidambaram, an Assistant Professor at the University of > Texas at Austin. My research group is developing CrashMonkey, a > file-system agnostic framework to test file-system crash consistency > on power failures. We are developing CrashMonkey publicly at Github > [1]. This is very much a work-in-progress, so we welcome feedback. > > CrashMonkey works by recording all the IO from running a given > workload, then *constructing* possible crash states (while honoring > FUA and FLUSH flags). A crash state is the state of storage after an > abrupt power failure or crash. For each crash state, CrashMonkey runs > the filesystem-provided fsck on top of the state, and checks if the > file-system recovers correctly. Once the file system mounts correctly, > we can run further tests to check data consistency. The work was > presented at HotStorage 17. The workshop paper is available at [2] and > the slides at [3]. > > Our plan was to post on the mailing lists after reproducing an > existing bug. We are not there yet, but I saw some posts where others > were considering building something similar, so I thought I would post > about our work. > > [1] https://github.com/utsaslab/crashmonkey > [2] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey.pdf > [3] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey-slides.pdf > I did this same work 3 years ago https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt https://github.com/josefbacik/log-writes I have xfstests patches I need to get upstreamed at some point that does fsstress and then replays the logs and verifies, and also one that makes fsx store state so we can verify fsync() is doing the right thing. We run this on our major releases on xfs, ext4, and btrfs to make sure everything is working right internally at Facebook. You'll notice a bunch of commits recently because we thought we found an xfs replay problem (we didn't). This stuff is actively used, I'd welcome contributions to it if you have anything to add. One thing I haven't done yet and have on my list is to randomly replay writes between flush/fua, but it hasn't been a pressing priority yet. Thanks, Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-15 17:33 ` Josef Bacik @ 2017-08-15 18:01 ` Vijay Chidambaram 2017-08-15 20:32 ` Amir Goldstein 0 siblings, 1 reply; 10+ messages in thread From: Vijay Chidambaram @ 2017-08-15 18:01 UTC (permalink / raw) To: Josef Bacik Cc: linux-ext4, linux-xfs, linux-fsdevel, linux-btrfs, Ashlie Martinez Hi Josef and Amir, Thank you for the replies! We were aware that Josef had proposed something like this a few years ago [1], but didn't know it was being currently used inside Facebook. Glad to hear it! @Josef: Thanks for the link! I think CrashMonkey does what you have in log-writes, but our goal is to go a bit further: - We want to test replaying subsets of writes between flush/fua. Indeed, this is one of the major focus points of the work. Given W1 W2 W3 Flush, CrashMonkey will generate states, (W1), (W1 W3), (W1 W2), etc. We believe many interesting bugs lie in this space. The problem is that there are a large number of possible crash states, so we are working on techniques to find "interesting" crash states. For now, our plan is to focus on write requests tagged with the META flag. - We want to aid the users in testing data consistency after a crash. The plan is that after each crash state, after running fsck, if the file system mounts, we allow the user to run a number of custom tests. To help the user figure out what data should be present in the crash state, we plan to provide functionality that informs the user at which point the crash occurred (similar to the "mark" functionality in log-writes, but instead of indicating a single point in the stream, it would provide a snapshot of fs state) @Amir: Given that Josef's code is already in the kernel, do you think changing CrashMonkey code would be useful? We are always happy to provide something for upstream, but we want to be sure how much work would be involved. [1] https://lwn.net/Articles/637079/ Thanks, Vijay On Tue, Aug 15, 2017 at 12:33 PM, Josef Bacik <josef@toxicpanda.com> wrote: > On Mon, Aug 14, 2017 at 11:32:02AM -0500, Vijay Chidambaram wrote: >> Hi, >> >> I'm Vijay Chidambaram, an Assistant Professor at the University of >> Texas at Austin. My research group is developing CrashMonkey, a >> file-system agnostic framework to test file-system crash consistency >> on power failures. We are developing CrashMonkey publicly at Github >> [1]. This is very much a work-in-progress, so we welcome feedback. >> >> CrashMonkey works by recording all the IO from running a given >> workload, then *constructing* possible crash states (while honoring >> FUA and FLUSH flags). A crash state is the state of storage after an >> abrupt power failure or crash. For each crash state, CrashMonkey runs >> the filesystem-provided fsck on top of the state, and checks if the >> file-system recovers correctly. Once the file system mounts correctly, >> we can run further tests to check data consistency. The work was >> presented at HotStorage 17. The workshop paper is available at [2] and >> the slides at [3]. >> >> Our plan was to post on the mailing lists after reproducing an >> existing bug. We are not there yet, but I saw some posts where others >> were considering building something similar, so I thought I would post >> about our work. >> >> [1] https://github.com/utsaslab/crashmonkey >> [2] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey.pdf >> [3] http://www.cs.utexas.edu/~vijay/papers/hotstorage17-crashmonkey-slides.pdf >> > > I did this same work 3 years ago > > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt > https://github.com/josefbacik/log-writes > > I have xfstests patches I need to get upstreamed at some point that does > fsstress and then replays the logs and verifies, and also one that makes fsx > store state so we can verify fsync() is doing the right thing. We run this on > our major releases on xfs, ext4, and btrfs to make sure everything is working > right internally at Facebook. You'll notice a bunch of commits recently because > we thought we found an xfs replay problem (we didn't). This stuff is actively > used, I'd welcome contributions to it if you have anything to add. One thing I > haven't done yet and have on my list is to randomly replay writes between > flush/fua, but it hasn't been a pressing priority yet. Thanks, > > Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-15 18:01 ` Vijay Chidambaram @ 2017-08-15 20:32 ` Amir Goldstein 2017-08-16 1:44 ` Vijay Chidambaram 0 siblings, 1 reply; 10+ messages in thread From: Amir Goldstein @ 2017-08-15 20:32 UTC (permalink / raw) To: Vijay Chidambaram Cc: Josef Bacik, Ext4, linux-xfs, linux-fsdevel, linux-btrfs, Ashlie Martinez On Tue, Aug 15, 2017 at 8:01 PM, Vijay Chidambaram <vvijay03@gmail.com> wrote: > Hi Josef and Amir, > ... > > @Amir: Given that Josef's code is already in the kernel, do you think > changing CrashMonkey code would be useful? We are always happy to > provide something for upstream, but we want to be sure how much work > would be involved. > Simply put, people (myself included) are more likely to use CrashMonkey if it uses upstream kernel and/or if it brings valuable functionality to filesystem testing, beyond what log-writes already does - I am have not studies either tools yet to be able to determine if that is the case. Cheers, Amir. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-15 20:32 ` Amir Goldstein @ 2017-08-16 1:44 ` Vijay Chidambaram 2017-08-16 13:06 ` Josef Bacik 0 siblings, 1 reply; 10+ messages in thread From: Vijay Chidambaram @ 2017-08-16 1:44 UTC (permalink / raw) To: Amir Goldstein Cc: Josef Bacik, Ext4, linux-xfs, linux-fsdevel, linux-btrfs, Ashlie Martinez Hi Amir, I neglected to mention this earlier: CrashMonkey does not require recompiling the kernel (it is a stand-alone kernel module), and has been tested with the kernel 4.4. It should work with future kernel versions as long as there are no changes to the bio structure. As it is, I believe CrashMonkey is compatible with the current kernel. It certainly provides functionality beyond log-writes (the ability to replay a subset of writes between FLUSH/FUA), and we intend to add more functionality in the future. Right now, CrashMonkey does not do random sampling among possible crash states -- it will simply test a given number of unique states. Thus, right now I don't think it is very effective in finding crash-consistency bugs. But the entire infrastructure to profile a workload, construct crash states, and test them with fsck is present. I'd be grateful if you could try it and give us feedback on what make testing easier/more useful for you. As I mentioned before, this is a work-in-progress, so we are happy to incorporate feedback. Thanks, Vijay Chidambaram, http://www.cs.utexas.edu/~vijay/ On Tue, Aug 15, 2017 at 3:32 PM, Amir Goldstein <amir73il@gmail.com> wrote: > On Tue, Aug 15, 2017 at 8:01 PM, Vijay Chidambaram <vvijay03@gmail.com> wrote: >> Hi Josef and Amir, >> > ... >> >> @Amir: Given that Josef's code is already in the kernel, do you think >> changing CrashMonkey code would be useful? We are always happy to >> provide something for upstream, but we want to be sure how much work >> would be involved. >> > > Simply put, people (myself included) are more likely to use CrashMonkey > if it uses upstream kernel and/or if it brings valuable functionality > to filesystem > testing, beyond what log-writes already does - > I am have not studies either tools yet to be able to determine if that > is the case. > > Cheers, > Amir. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-16 1:44 ` Vijay Chidambaram @ 2017-08-16 13:06 ` Josef Bacik 2017-08-16 19:06 ` Vijay Chidambaram 0 siblings, 1 reply; 10+ messages in thread From: Josef Bacik @ 2017-08-16 13:06 UTC (permalink / raw) To: Vijay Chidambaram Cc: Amir Goldstein, Josef Bacik, Ext4, linux-xfs, linux-fsdevel, linux-btrfs, Ashlie Martinez, kernel-team On Tue, Aug 15, 2017 at 08:44:16PM -0500, Vijay Chidambaram wrote: > Hi Amir, > > I neglected to mention this earlier: CrashMonkey does not require > recompiling the kernel (it is a stand-alone kernel module), and has > been tested with the kernel 4.4. It should work with future kernel > versions as long as there are no changes to the bio structure. > > As it is, I believe CrashMonkey is compatible with the current kernel. > It certainly provides functionality beyond log-writes (the ability to > replay a subset of writes between FLUSH/FUA), and we intend to add > more functionality in the future. > > Right now, CrashMonkey does not do random sampling among possible > crash states -- it will simply test a given number of unique states. > Thus, right now I don't think it is very effective in finding > crash-consistency bugs. But the entire infrastructure to profile a > workload, construct crash states, and test them with fsck is present. > > I'd be grateful if you could try it and give us feedback on what make > testing easier/more useful for you. As I mentioned before, this is a > work-in-progress, so we are happy to incorporate feedback. > Sorry I was travelling yesterday so I couldn't give this my full attention. Everything you guys do is already accomplished with dm-log-writes. If you look at the example scripts I've provided https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh The first initiates the replay, and points at the second script to run after each entry is replayed. The whole point of this stuff was to make it as flexible as possible. The way we use it is to replay, create a snapshot of the replay, mount, unmount, fsck, delete the snapshot and carry on to the next position in the log. There is nothing keeping us from generating random crash points, this has been something on my list of things to do forever. All that would be required would be to hold the entries between flush/fua events in memory, and then replay them in whatever order you deemed fit. That's the only functionality missing from my replay-log stuff that CrashMonkey has. The other part of this is getting user space applications to do more thorough checking of consistency that it expects, which I implemented here https://github.com/josefbacik/fstests/commit/70d41e17164b2afc9a3f2ae532f084bf64cb4a07 fsx will randomly do operations to a file, and every time it fsync()'s it saves it's state and marks the log. Then we can go back and replay the log to the mark and md5sum the file to make sure it matches the saved state. This infrastructure was meant to be as simple as possible so the possiblities for crash consistency testing were endless. One of the next areas we plan to use this in Facebook is just for application consistency, so we can replay the fs and verify the application works in whatever state the fs is at any given point. I looked at your code and you are logging entries at submit time, not completion time. The reason I do those crazy acrobatics is because we have had bugs in previous kernels where we were not waiting for io completion of important metadata before writing out the super block, so logging only at completion allows us to catch that class of problems. The other thing CrashMonkey is missing is DISCARD support. We fuck up discard support constantly, and being able to replay discards to make sure we're not discarding important data is very important. I'm not trying to shit on your project, obviously it's a good idea, that's why I did it years ago ;). The community is going to use what is easiest to use, and modprobe dm-log-writes is a lot easier than compiling and insmod'ing an out of tree driver. Also your driver won't work on upstream kernels because of the way the bio flags were changed recently, which is why we prefer using upstream solutions. If you guys want to get this stuff used then it would be better at this point to build on top of what we already have. Just off the top of my head we need 1) Random replay support for replay-log. This is probably a day or two worth of work for a student. 2) Documentation, because right now I'm the only one who knows how this works. 3) My patches need to actually be pushed into upstream fstests. This would be the largest win because then all the fs developers would be running the tests by default. 4) Multi-device support. One thing that would be good to have and is a dream of mine is to connect multiple devices to one log, so we can do things like make sure mdraid or btrfs's raid consistency. We could do super evil things like only replay one device, or replay alternating writes on each device. This would be a larger project but would be super helpful. Thanks, Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-16 13:06 ` Josef Bacik @ 2017-08-16 19:06 ` Vijay Chidambaram 2017-08-16 20:27 ` Amir Goldstein 0 siblings, 1 reply; 10+ messages in thread From: Vijay Chidambaram @ 2017-08-16 19:06 UTC (permalink / raw) To: Josef Bacik Cc: Amir Goldstein, Ext4, linux-xfs, linux-fsdevel, linux-btrfs, Ashlie Martinez, kernel-team Hi Josef, Thank you for the detailed reply -- I think it provides several pointers for our future work. It sounds like we have a similar vision for where we want this to go, though we may disagree about how to implement this :) This is exciting! I agree that we should be building off existing work if it is a good option. We might end up using log-writes, but for now we see several problems: - The log-writes code is not documented well. As you have mentioned, at this point, only you know how it works, and we are not seeing a lot of adoption by other developers of log-writes as well. - I don't think our requirements exactly match what log-writes provides. For example, at some point we want to introduce checkpoints so that we can co-relate a crash state with file-system state at the time of crash. We also want to add functionality to guide creation of random crash states (see below). This might require changing log-writes significantly. I don't know if that would be a good idea. Regarding random crashes, there is a lot of complexity there that log-writes couldn't handle without significant changes. For example, just randomly generating crash states and testing each state is unlikely to catch bugs. We need a more nuanced way of doing this. We plan to add a lot of functionality to CrashMonkey to (a) let the user guide crash-state generation (b) focus on "interesting" states (by re-ordering or dropping metadata). All of this will likely require adding more sophistication to the kernel module. I don't think we want to take log-writes and add a lot of extra functionality. Regarding logging writes, I think there is a difference in approach between log-writes and CrashMonkey. We don't really care about the completion order since the device may anyway re-order the writes after that point. Thus, the set of crash states generated by CrashMonkey is bound only by FUA and FLUSH flags. It sounds as if log-writes focuses on a more restricted set of crash states. CrashMonkey works with the 4.4 kernel, and we will try and keep up with changes to the kernel that breaks CrashMonkey. CrashMonkey is useless without the user-space component, so users will be needing to compile some code anyway. I do not believe it will matter much whether it is in-tree or not, as long as it compiles with the latest kernel. Regarding discard, multi-device support, and application-level crash consistency, this is on our road-map too! Our current priority is to build enough scaffolding to reproduce a known crash-consistency bug (such as the delayed allocation bug of ext4), and then go on and try to find new bugs in newer file systems like btrfs. Adding CrashMonkey into the kernel is not a priority at this point (I don't think CrashMonkey is useful enough at this point to do so). When CrashMonkey becomes useful enough to do so, we will perhaps add the device_wrapper as a DM target to enable adoption. Our hope currently is that developers like Ari will try out CrashMonkey in its current form, which will guide us as to what functionality to add to CrashMonkey to find bugs more effectively. Thanks, Vijay On Wed, Aug 16, 2017 at 8:06 AM, Josef Bacik <josef@toxicpanda.com> wrote: > On Tue, Aug 15, 2017 at 08:44:16PM -0500, Vijay Chidambaram wrote: >> Hi Amir, >> >> I neglected to mention this earlier: CrashMonkey does not require >> recompiling the kernel (it is a stand-alone kernel module), and has >> been tested with the kernel 4.4. It should work with future kernel >> versions as long as there are no changes to the bio structure. >> >> As it is, I believe CrashMonkey is compatible with the current kernel. >> It certainly provides functionality beyond log-writes (the ability to >> replay a subset of writes between FLUSH/FUA), and we intend to add >> more functionality in the future. >> >> Right now, CrashMonkey does not do random sampling among possible >> crash states -- it will simply test a given number of unique states. >> Thus, right now I don't think it is very effective in finding >> crash-consistency bugs. But the entire infrastructure to profile a >> workload, construct crash states, and test them with fsck is present. >> >> I'd be grateful if you could try it and give us feedback on what make >> testing easier/more useful for you. As I mentioned before, this is a >> work-in-progress, so we are happy to incorporate feedback. >> > > Sorry I was travelling yesterday so I couldn't give this my full attention. > Everything you guys do is already accomplished with dm-log-writes. If you look > at the example scripts I've provided > > https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh > https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh > > The first initiates the replay, and points at the second script to run after > each entry is replayed. The whole point of this stuff was to make it as > flexible as possible. The way we use it is to replay, create a snapshot of the > replay, mount, unmount, fsck, delete the snapshot and carry on to the next > position in the log. > > There is nothing keeping us from generating random crash points, this has been > something on my list of things to do forever. All that would be required would > be to hold the entries between flush/fua events in memory, and then replay them > in whatever order you deemed fit. That's the only functionality missing from my > replay-log stuff that CrashMonkey has. > > The other part of this is getting user space applications to do more thorough > checking of consistency that it expects, which I implemented here > > https://github.com/josefbacik/fstests/commit/70d41e17164b2afc9a3f2ae532f084bf64cb4a07 > > fsx will randomly do operations to a file, and every time it fsync()'s it saves > it's state and marks the log. Then we can go back and replay the log to the > mark and md5sum the file to make sure it matches the saved state. This > infrastructure was meant to be as simple as possible so the possiblities for > crash consistency testing were endless. One of the next areas we plan to use > this in Facebook is just for application consistency, so we can replay the fs > and verify the application works in whatever state the fs is at any given point. > > I looked at your code and you are logging entries at submit time, not completion > time. The reason I do those crazy acrobatics is because we have had bugs in > previous kernels where we were not waiting for io completion of important > metadata before writing out the super block, so logging only at completion > allows us to catch that class of problems. > > The other thing CrashMonkey is missing is DISCARD support. We fuck up discard > support constantly, and being able to replay discards to make sure we're not > discarding important data is very important. > > I'm not trying to shit on your project, obviously it's a good idea, that's why I > did it years ago ;). The community is going to use what is easiest to use, and > modprobe dm-log-writes is a lot easier than compiling and insmod'ing an out of > tree driver. Also your driver won't work on upstream kernels because of the way > the bio flags were changed recently, which is why we prefer using upstream > solutions. > > If you guys want to get this stuff used then it would be better at this point to > build on top of what we already have. Just off the top of my head we need > > 1) Random replay support for replay-log. This is probably a day or two worth of > work for a student. > > 2) Documentation, because right now I'm the only one who knows how this works. > > 3) My patches need to actually be pushed into upstream fstests. This would be > the largest win because then all the fs developers would be running the tests > by default. > > 4) Multi-device support. One thing that would be good to have and is a dream of > mine is to connect multiple devices to one log, so we can do things like make > sure mdraid or btrfs's raid consistency. We could do super evil things like > only replay one device, or replay alternating writes on each device. This would > be a larger project but would be super helpful. > > Thanks, > > Josef ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-16 19:06 ` Vijay Chidambaram @ 2017-08-16 20:27 ` Amir Goldstein 2017-08-16 20:36 ` Vijay Chidambaram 0 siblings, 1 reply; 10+ messages in thread From: Amir Goldstein @ 2017-08-16 20:27 UTC (permalink / raw) To: Vijay Chidambaram Cc: Josef Bacik, Ext4, linux-xfs, linux-fsdevel, Linux Btrfs, Ashlie Martinez, kernel-team On Wed, Aug 16, 2017 at 10:06 PM, Vijay Chidambaram <vvijay03@gmail.com> wrote: > Hi Josef, > > Thank you for the detailed reply -- I think it provides several > pointers for our future work. It sounds like we have a similar vision > for where we want this to go, though we may disagree about how to > implement this :) This is exciting! > > I agree that we should be building off existing work if it is a good > option. We might end up using log-writes, but for now we see several > problems: > > - The log-writes code is not documented well. As you have mentioned, > at this point, only you know how it works, and we are not seeing a lot > of adoption by other developers of log-writes as well. > > - I don't think our requirements exactly match what log-writes > provides. For example, at some point we want to introduce checkpoints > so that we can co-relate a crash state with file-system state at the > time of crash. We also want to add functionality to guide creation of > random crash states (see below). This might require changing > log-writes significantly. I don't know if that would be a good idea. > > Regarding random crashes, there is a lot of complexity there that > log-writes couldn't handle without significant changes. For example, > just randomly generating crash states and testing each state is > unlikely to catch bugs. We need a more nuanced way of doing this. We > plan to add a lot of functionality to CrashMonkey to (a) let the user > guide crash-state generation (b) focus on "interesting" states (by > re-ordering or dropping metadata). All of this will likely require > adding more sophistication to the kernel module. I don't think we want > to take log-writes and add a lot of extra functionality. > > Regarding logging writes, I think there is a difference in approach > between log-writes and CrashMonkey. We don't really care about the > completion order since the device may anyway re-order the writes after > that point. Thus, the set of crash states generated by CrashMonkey is > bound only by FUA and FLUSH flags. It sounds as if log-writes focuses > on a more restricted set of crash states. > > CrashMonkey works with the 4.4 kernel, and we will try and keep up > with changes to the kernel that breaks CrashMonkey. CrashMonkey is > useless without the user-space component, so users will be needing to > compile some code anyway. I do not believe it will matter much whether > it is in-tree or not, as long as it compiles with the latest kernel. > > Regarding discard, multi-device support, and application-level crash > consistency, this is on our road-map too! Our current priority is to > build enough scaffolding to reproduce a known crash-consistency bug > (such as the delayed allocation bug of ext4), and then go on and try > to find new bugs in newer file systems like btrfs. > > Adding CrashMonkey into the kernel is not a priority at this point (I > don't think CrashMonkey is useful enough at this point to do so). When > CrashMonkey becomes useful enough to do so, we will perhaps add the > device_wrapper as a DM target to enable adoption. > > Our hope currently is that developers like Ari will try out > CrashMonkey in its current form, which will guide us as to what > functionality to add to CrashMonkey to find bugs more effectively. > Vijay, I can only speak for myself, but I think I represent other filesystem developers with this response: - Often with competing projects the end results is always for the best when project members cooperate to combine the best of both projects. - Some of your project goals (e.g. user guided crash states) sound very intriguing - IMO you are severely underestimating the pros in mainlined kernel code for other developers. If you find the dm-log-writes target is lacking functionality it would be MUCH better if you work to improve it. Even more - it would be far better if you make sure that your userspace tools can work also with the reduced functionality in mainline kernel. - If you choose to complete your academic research before crossing over to existing code base, that is a reasonable choice for you to make, but the reasonable choice for me to make is to try Joseph's tools from his repo (even if not documented) and *only* if it doesn't meet my needs I would make the extra effort to try out CrashMonkey. - AFAIK the state of filesystem crash consistency testing tools is so bright (maybe except in Facebook ;) , so my priority is to get *some* automated testing tools in motion In any case, I'm glad this discussion started and I hope it would expedite the adoption of crash testing tools. I wish you all the best with your project. Amir. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency 2017-08-16 20:27 ` Amir Goldstein @ 2017-08-16 20:36 ` Vijay Chidambaram 0 siblings, 0 replies; 10+ messages in thread From: Vijay Chidambaram @ 2017-08-16 20:36 UTC (permalink / raw) To: Amir Goldstein Cc: Josef Bacik, Ext4, linux-xfs, linux-fsdevel, Linux Btrfs, Ashlie Martinez, kernel-team Amir, That's a fair response. I certainly did not mean to add more work on your end :) Using dm-log-writes for now is a reasonable approach. Like I mentioned before, I think there is further work involved in getting CrashMonkey to a useful point (where it finds at least known bugs). Once this is done, I'd be happy to rework the device_wrapper as a DM target (or perhaps as a modification of log-writes) for upstream. I'm not sure how feasible it would be to keep functionality in-kernel simple, but we will try our best. We will keep this goal in mind as we continue development, so that we don't make any decisions that will prevent us from going the DM target route later. Thanks, Vijay On Wed, Aug 16, 2017 at 3:27 PM, Amir Goldstein <amir73il@gmail.com> wrote: > On Wed, Aug 16, 2017 at 10:06 PM, Vijay Chidambaram <vvijay03@gmail.com> wrote: >> Hi Josef, >> >> Thank you for the detailed reply -- I think it provides several >> pointers for our future work. It sounds like we have a similar vision >> for where we want this to go, though we may disagree about how to >> implement this :) This is exciting! >> >> I agree that we should be building off existing work if it is a good >> option. We might end up using log-writes, but for now we see several >> problems: >> >> - The log-writes code is not documented well. As you have mentioned, >> at this point, only you know how it works, and we are not seeing a lot >> of adoption by other developers of log-writes as well. >> >> - I don't think our requirements exactly match what log-writes >> provides. For example, at some point we want to introduce checkpoints >> so that we can co-relate a crash state with file-system state at the >> time of crash. We also want to add functionality to guide creation of >> random crash states (see below). This might require changing >> log-writes significantly. I don't know if that would be a good idea. >> >> Regarding random crashes, there is a lot of complexity there that >> log-writes couldn't handle without significant changes. For example, >> just randomly generating crash states and testing each state is >> unlikely to catch bugs. We need a more nuanced way of doing this. We >> plan to add a lot of functionality to CrashMonkey to (a) let the user >> guide crash-state generation (b) focus on "interesting" states (by >> re-ordering or dropping metadata). All of this will likely require >> adding more sophistication to the kernel module. I don't think we want >> to take log-writes and add a lot of extra functionality. >> >> Regarding logging writes, I think there is a difference in approach >> between log-writes and CrashMonkey. We don't really care about the >> completion order since the device may anyway re-order the writes after >> that point. Thus, the set of crash states generated by CrashMonkey is >> bound only by FUA and FLUSH flags. It sounds as if log-writes focuses >> on a more restricted set of crash states. >> >> CrashMonkey works with the 4.4 kernel, and we will try and keep up >> with changes to the kernel that breaks CrashMonkey. CrashMonkey is >> useless without the user-space component, so users will be needing to >> compile some code anyway. I do not believe it will matter much whether >> it is in-tree or not, as long as it compiles with the latest kernel. >> >> Regarding discard, multi-device support, and application-level crash >> consistency, this is on our road-map too! Our current priority is to >> build enough scaffolding to reproduce a known crash-consistency bug >> (such as the delayed allocation bug of ext4), and then go on and try >> to find new bugs in newer file systems like btrfs. >> >> Adding CrashMonkey into the kernel is not a priority at this point (I >> don't think CrashMonkey is useful enough at this point to do so). When >> CrashMonkey becomes useful enough to do so, we will perhaps add the >> device_wrapper as a DM target to enable adoption. >> >> Our hope currently is that developers like Ari will try out >> CrashMonkey in its current form, which will guide us as to what >> functionality to add to CrashMonkey to find bugs more effectively. >> > > Vijay, > > I can only speak for myself, but I think I represent other filesystem > developers with this response: > - Often with competing projects the end > results is always for the best when project members cooperate to combine > the best of both projects. > - Some of your project goals (e.g. user guided crash states) sound very > intriguing > - IMO you are severely underestimating the pros in mainlined > kernel code for other developers. If you find the dm-log-writes target > is lacking functionality it would be MUCH better if you work to improve it. > Even more - it would be far better if you make sure that your userspace > tools can work also with the reduced functionality in mainline kernel. > - If you choose to complete your academic research before crossing over > to existing code base, that is a reasonable choice for you to make, but > the reasonable choice for me to make is to try Joseph's tools from his > repo (even if not documented) and *only* if it doesn't meet my needs > I would make the extra effort to try out CrashMonkey. > - AFAIK the state of filesystem crash consistency testing tools is so bright > (maybe except in Facebook ;) , so my priority is to get *some* automated > testing tools in motion > > In any case, I'm glad this discussion started and I hope it would expedite > the adoption of crash testing tools. > I wish you all the best with your project. > > Amir. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-08-16 20:37 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-08-14 16:32 CrashMonkey: A Framework to Systematically Test File-System Crash Consistency Vijay Chidambaram 2017-08-15 17:13 ` Amir Goldstein 2017-08-15 17:33 ` Josef Bacik 2017-08-15 18:01 ` Vijay Chidambaram 2017-08-15 20:32 ` Amir Goldstein 2017-08-16 1:44 ` Vijay Chidambaram 2017-08-16 13:06 ` Josef Bacik 2017-08-16 19:06 ` Vijay Chidambaram 2017-08-16 20:27 ` Amir Goldstein 2017-08-16 20:36 ` Vijay Chidambaram
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).