From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ric Wheeler <ric@emc.com>
Subject: Re: XFS corruption during power-blackout
Date: Fri, 01 Jul 2005 09:57:48 -0400
Message-ID: <42C54BDC.6000206@emc.com>
References: <20050629001847.GB850@frodo> <200506290453.HAA14576@raad.intranet> <556815.441dd7d1ebc32b4a80e049e0ddca5d18e872c6e8a722b2aefa7525e9504533049d801014.ANY@taniwha.stupidest.org> <42C4FC14.7070402@slaphack.com> <20050701092412.GD2243@suse.de> <20050701131950.GA15180@ime.usp.br>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: linux-kernel@vger.kernel.org, Brett Russ <russb@emc.com>,
	linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mailhub.lss.emc.com ([168.159.2.31]:57260 "EHLO
	mailhub.lss.emc.com") by vger.kernel.org with ESMTP id S263348AbVGAN7A
	(ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 1 Jul 2005 09:59:00 -0400
To: =?ISO-8859-1?Q?Rog=E9rio_Brito?= <rbrito@ime.usp.br>
In-Reply-To: <20050701131950.GA15180@ime.usp.br>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Rog=E9rio Brito wrote:

>On Jul 01 2005, Jens Axboe wrote:
> =20
>
>>On Fri, Jul 01 2005, David Masover wrote:
>>   =20
>>
>>>Not always possible.  Some disks lie and leave caching on anyway.
>>>     =20
>>>
>>And the same (and others) disks will not honor a flush anyways.
>>Moral of that story - avoid bad hardware.
>>   =20
>>
>
>But how does the end-user know what hardware is "good hardware"? Which
>vendors don't lie (or, at least, lie less than others) regarding HDs?
>
>
>Thanks, Rog=E9rio Brito.
>
> =20
>
The only real way is to test the drive (and retest when you get a new=20
versions of firmware) and the whole fsync -> write barrier code path.

We use a bus analyzer to make sure that when you fsync() a file, you=20
will see a cache flush command coming across the bus. Of course, that i=
s=20
the easy step ;-)

The second step is to test your system across power failures.  We have =
a=20
"wbtest" code that we have used to catch bugs. The basic idea is to=20
write a file to a disk with the cache turned off, write the same file t=
o=20
the disk with the write barrier (and working cache flush command) and=20
then randomly drop power to the box.  It is important to really drop=20
power to the whole box since a "reset button" push often does not drop=20
power to the drives and will give you false passes.

Our wbtest used to be good at finding holes in the write barrier code=20
using 2.4 kernels and PATA drives, but we have had no luck yet in=20
catching known bugs with this test on 2.6 with S-ATA drives.

Ideas on how to get a more effective test are welcome - it is a very=20
small window that you need to hit to catch a misbehaving drive (i.e.,=20
your write cache flush command has returned, you want to drop power and=
=20
on reboot, validate that the platter contains that last IO correctly). =
=20
If you had enough NVRAM in a test system, you might be able to=20
substitute a NVRAM backed file system for the write-cache disabled driv=
e=20
and get closer to catching the window.

The alternative is to either run with the write cache disabled (again,=20
you will need to validate that the drive really disabled the cache) or=20
to buy a mid-range or better storage array that provides a non-volatile=
=20
(battery backed) write cache.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html