From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josef Bacik Subject: Re: [LSF/MM TOPIC] Working towards better power fail testing Date: Tue, 13 Jan 2015 12:17:22 -0500 Message-ID: <54B55322.5030002@fb.com> References: <5486221D.6000006@fb.com> <87r3uy3931.fsf@openvz.org> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Cc: To: Dmitry Monakhov , Return-path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:48830 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752665AbbAMRR2 (ORCPT ); Tue, 13 Jan 2015 12:17:28 -0500 In-Reply-To: <87r3uy3931.fsf@openvz.org> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 01/13/2015 12:05 PM, Dmitry Monakhov wrote: > Josef Bacik writes: > >> Hello, >> >> We have been doing pretty well at populating xfstests with loads of >> tests to catch regressions and validate we're all working properly. One >> thing that has been lacking is a good way to verify file system >> integrity after a power fail. This is a core part of what file systems >> are supposed to provide but it is probably the least tested aspect. We >> have dm-flakey tests in xfstests to test fsync correctness, but these >> tests do not catch the random horrible things that can go wrong. We are >> still finding horrible scary things that go wrong in Btrfs because it is >> simply hard to reproduce and test for. >> >> I have been working on an idea to do this better, some may have seen my >> dm-power-fail attempt, and I've got a new incarnation of the idea thanks >> to discussions with Zach Brown. Obviously there will be a lot changing >> in this area in the time between now and March but it would be good to >> have everybody in the room talking about what they would need to build a >> good and deterministic test to make sure we're always giving a >> consistent file system and to make sure our fsync() handling is working >> properly. Thanks, > I've submitted generic/019 long time ago. Test is fine and helps to > uncover several bugs, But it is not ideal because currently power failure > simulation (via fail_make_request) is not not completely atomic > So I would like to attend to discussion how we can implement power > failure simulation completely atomic. > Yeah I did the first dm-flakey tests and extended that some. These are good baselines but I've hit a few bugs recently in btrfs that would have required us to crash at exactly the right spot to hit which is what I want to try and build for. Something we can run through all the possible crash scenarios to make sure we're always leaving a consistent fs. > BTW I also would like to share hw-flush utility (which our QA team use for > use power-fail/SSD-cache testing) and harness for it. > That would be super cool, the more testing we can have around making sure we're waiting for stuff properly and flushing caches properly the better. Thanks, Josef