From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Krdzh-0003f3-Rp
	for qemu-devel@nongnu.org; Sun, 19 Oct 2008 15:31:17 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Krdzh-0003eR-5n
	for qemu-devel@nongnu.org; Sun, 19 Oct 2008 15:31:17 -0400
Received: from [199.232.76.173] (port=57617 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Krdzh-0003eF-1e
	for qemu-devel@nongnu.org; Sun, 19 Oct 2008 15:31:17 -0400
Received: from pasmtpa.tele.dk ([80.160.77.114]:33888)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <axboe@kernel.dk>) id 1Krdzg-0007xq-Ih
	for qemu-devel@nongnu.org; Sun, 19 Oct 2008 15:31:17 -0400
Date: Sun, 19 Oct 2008 21:30:25 +0200
From: Jens Axboe <qemu@kernel.dk>
Subject: Re: [Qemu-devel] [RFC] Disk integrity in QEMU
Message-ID: <20081019193024.GX19428@kernel.dk>
References: <48EF1D55.7060307@redhat.com> <48F0E83E.2000907@redhat.com>
	<48F10DFD.40505@codemonkey.ws> <48F1CD76.2000203@redhat.com>
	<20081017132040.GK19428@kernel.dk> <48FAF751.8010806@redhat.com>
	<20081019181026.GU19428@kernel.dk> <48FB7B7A.4050008@redhat.com>
	<20081019183642.GV19428@kernel.dk> <48FB865B.60906@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <48FB865B.60906@redhat.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>, Mark McLoughlin <markmc@redhat.com>, kvm-devel <kvm-devel@lists.sourceforge.net>, Laurent Vivier <Laurent.Vivier@bull.net>, qemu-devel@nongnu.org, Ryan Harper <ryanh@us.ibm.com>

On Sun, Oct 19 2008, Avi Kivity wrote:
> Jens Axboe wrote:
> > On Sun, Oct 19 2008, Avi Kivity wrote:
> >   
> >> Jens Axboe wrote:
> >>
> >>  
> >>
> >>     
> >>>> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
> >>>> user explicitly enables it, if NCQ is available?  NCQ should provide 
> >>>> acceptable throughput even without the write cache.
> >>>>     
> >>>>         
> >>> How can it be a bug? 
> >>>       
> >> If it puts my data at risk, it's a bug.  I can understand it for IDE,
> >> but not for SATA with NCQ.
> >>     
> >
> > Then YOU turn it off. Other people would consider the lousy performance
> > to be the bigger problem. See policy :-)
> >
> >   
> 
> If I get lousy performance, I can turn on the write cache and ignore the
> risk of data loss.  If I lose my data, I can't turn off the write cache
> and get my data back.
> 
> (it seems I can't turn off the write cache even without losing my data:
> 
> [avi@firebolt ~]$ sudo sdparm --set=WCE=0 /dev/sd[ab]
>     /dev/sda: ATA       WDC WD3200YS-01P  21.0
> change_mode_page: failed setting page: Caching (SBC)
>     /dev/sdb: ATA       WDC WD3200YS-01P  21.0
> change_mode_page: failed setting page: Caching (SBC)

Use hdparm, it's an ATA drive even if Linux currently uses the scsi
layer for it. Or use sysfs, there's a "cache_type" attribute in the scsi
disk sysfs directory.

> >>> Changing the cache policy of a drive would be a
> >>> policy decision in the kernel, 
> >>>       
> >> If you don't want this in the kernel, then the system as a whole should
> >> default to being safe.  Though in this case I think it is worthwhile to
> >> do this in the kernel.
> >>     
> >
> > Doesn't matter how you turn this, it's still a policy decision. Leave it
> > to the user. It's not exactly a new turn of events, commodity drives
> > have shipped with write caching on forever. What if the drive has a
> > battery backing? 
> 
> If the drive has a batter backup, I'd argue it should report it as a
> write-through cache.  I'm not a drive manufacturer though.

You could argue that, but that could influence other decision making.
FWIW, we've discussed this very issue for YEARS, reiterating the debate
here isn't likely to change much...

> > What if the user has an UPS?
> >
> >   
> 
> They should enable the write-back cache if they trust the UPS.  Or maybe
> the system should do that automatically if it's aware of the UPS.
> 
> "Policy" doesn't mean you shouldn't choose good defaults.

Changing the hardware settings for this kind of behaviour IS most
certainly policy.

> >>> that is never the right thing to do.
> >>> There's no such thing as 'acceptable throughput',
> >>>       
> >> I meant that performance is not completely destroyed.  How can you even
> >>     
> >
> > How do you know it's not destroyed? Depending on your workload, it may
> > very well be dropping your throughput by orders of magnitude.
> >
> >   
> 
> I guess this is the crux.  According to my understanding, you shouldn't
> see such a horrible drop, unless the application does synchronous writes
> explicitly, in which case it is probably worried about data safety.

Then you need to adjust your understanding, because you definitely will
see a big drop in performance.

> >> compare data safety to some percent of performance?
> >>     
> >
> > I'm not, what I'm saying is that different people will have different
> > opponions on what is most important. Do note that the window of
> > corruption is really small and requires powerloss to trigger. So for
> > most desktop users, the tradeoff is actually sane.
> >
> >   
> 
> I agree that the window is very small, and that by eliminating software
> failures we get rid of the major source of data loss.  What I don't know
> is what the performance tradeoff looks like (and I can't measure since
> my drives won't let me turn off the cache for some reason).
> 
> >>> Additionally, write back caching is perfectly safe, if used
> >>> with a barrier enabled file system in Linux.
> >>>   
> >>>       
> >> Not all Linux filesystems are barrier enabled, AFAIK.  Further, barriers
> >> don't help with O_DIRECT (right?).
> >>     
> >
> > O_DIRECT should just use FUA writes, there are safe with write back
> > caching. I'm actually testing such a change just to gauge the
> > performance impact.
> >   
> 
> You mean, this is not in mainline yet?

It isn't.

> So, with this, plus barrier support for metadata and O_SYNC writes, the
> write-back cache should be safe?

Yes, and fsync() as well provided the fs does a flush there too.

> Some googling shows that Windows XP introduced FUA for O_DIRECT and
> metadata writes as well.

There's a lot of other background information to understand to gauge the
impact of using eg FUA for O_DIRECT in Linux as well. MS basically wrote
the FUA for ATA proposal, and the original usage pattern (as far as I
remember) was indeed meta data. Hence it also imposes a priority boost
in most (all?) drive firmwares, since it's deemed important. So just
using FUA vs non-FUA is likely to impact performance of other workloads
in fairly unknown ways. FUA on non-queuing drives will also likely suck
for performance, since you're basically going to be blowing a drive rev
for each IO. And that hurts.

> >> I shouldn't need a disk array to run a database.
> >>     
> >
> > You are free to turn off write back caching!
> >
> >   
> 
> What about the users who aren't on qemu-devel?

It may be news to you, but it has been debated on lkml in the past as
well. Not even that long ago, and I'd be surprised of lwn didn't run
some article on it as well. But I agree it's important information, but
realize that until just recently most people didn't really consider it a
likely scenario in practice...

I wrote and committed the original barrier implementation in Linux in
2001, and just this year XFS made it a default mount option. After the
recent debacle on this on lkml, ext4 made it the default as well.

So let me turn it around a bit - if this issue really did hit lots of
people out there in real life, don't you think there would have been
more noise about this and we would have made this the default years ago?
So while we both agree it's a risk, it's not a huuuge risk...

> However, with your FUA change, they should be safe.

Yes, that would make O_DIRECT safe always. Except when it falls back to
buffered IO, woops...

> >> Most desktop workloads use writeback cache, so write performance is not
> >> critical.
> >>     
> >
> > Ehm, how do you reach that conclusion based on that statement?
> >
> >   
> 
> Any write latency is buffered by the kernel.  Write speed is main memory
> speed.  Disk speed only bubbles up when memory is tight.

That's a nice theory, in practice that is completely wrong. You end up
waiting on writes for LOTS of other reasons!

> >> However I'd hate to see my data destroyed by a power failure, and
> >> today's large caches can hold a bunch of data.
> >>     
> >
> > Then you use barriers or turn write back caching off, simple as that.
> >   
> 
> I will (if I figure out how) but there may be one or two users who
> haven't read the scsi spec yet.

A newish hdparm should work, or the sysfs attribute. hdparm will
pass-through the real ata command to do this, the sysfs approach (and
sdparm) requires MODE_SENSE and MODE_SELECT transformation of that page.

> Or more correctly, I am revising my opinion of the write back cache
> since even when it is enabled, it is completely optional.  Instead of
> disabling the write back cache we should use FUA and barriers, and since
> you are to be working on FUA, it looks like this will be resolved soon
> without performance/correctness compromises.

Lets see how the testing goes :-)
Possibly just enabled FUA O_DIRECT with barriers, that'll likely be a
good default.

-- 
Jens Axboe