From mboxrd@z Thu Jan  1 00:00:00 1970
From: Josef Bacik <jbacik@fb.com>
Subject: Re: [LSF/MM TOPIC] Working towards better power fail testing
Date: Tue, 13 Jan 2015 12:17:22 -0500
Message-ID: <54B55322.5030002@fb.com>
References: <5486221D.6000006@fb.com> <87r3uy3931.fsf@openvz.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Cc: <linux-fsdevel@vger.kernel.org>
To: Dmitry Monakhov <dmonlist@gmail.com>,
	<lsf-pc@lists.linux-foundation.org>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:48830 "EHLO
	mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1752665AbbAMRR2 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 13 Jan 2015 12:17:28 -0500
In-Reply-To: <87r3uy3931.fsf@openvz.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On 01/13/2015 12:05 PM, Dmitry Monakhov wrote:
> Josef Bacik <jbacik@fb.com> writes:
>
>> Hello,
>>
>> We have been doing pretty well at populating xfstests with loads of
>> tests to catch regressions and validate we're all working properly.  One
>> thing that has been lacking is a good way to verify file system
>> integrity after a power fail.  This is a core part of what file systems
>> are supposed to provide but it is probably the least tested aspect.  We
>> have dm-flakey tests in xfstests to test fsync correctness, but these
>> tests do not catch the random horrible things that can go wrong.  We are
>> still finding horrible scary things that go wrong in Btrfs because it is
>> simply hard to reproduce and test for.
>>
>> I have been working on an idea to do this better, some may have seen my
>> dm-power-fail attempt, and I've got a new incarnation of the idea thanks
>> to discussions with Zach Brown.  Obviously there will be a lot changing
>> in this area in the time between now and March but it would be good to
>> have everybody in the room talking about what they would need to build a
>> good and deterministic test to make sure we're always giving a
>> consistent file system and to make sure our fsync() handling is working
>> properly.  Thanks,
> I've submitted generic/019 long time ago. Test is fine and helps to
> uncover several bugs, But it is not ideal because currently power failure
> simulation (via fail_make_request) is not not completely atomic
> So I would like to attend to discussion how we can implement power
> failure simulation completely atomic.
>

Yeah I did the first dm-flakey tests and extended that some.  These are 
good baselines but I've hit a few bugs recently in btrfs that would have 
required us to crash at exactly the right spot to hit which is what I 
want to try and build for.  Something we can run through all the 
possible crash scenarios to make sure we're always leaving a consistent fs.

> BTW I also would like to share hw-flush utility (which our QA team use for
> use power-fail/SSD-cache testing) and harness for it.
>

That would be super cool, the more testing we can have around making 
sure we're waiting for stuff properly and flushing caches properly the 
better.  Thanks,

Josef