From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rusty Russell Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH Date: Thu, 6 May 2010 15:35:19 +0930 Message-ID: <201005061535.20619.rusty@rustcorp.com.au> References: <20100218222220.GA14847@redhat.com> <201005051428.41735.rusty@rustcorp.com.au> <20100505160343.264fd015@notabene.brown> Mime-Version: 1.0 Content-Type: Text/Plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jamie Lokier , Jens Axboe , tytso@mit.edu, kvm@vger.kernel.org, "Michael S. Tsirkin" , qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, hch@lst.de To: Neil Brown Return-path: Received: from ozlabs.org ([203.10.76.45]:44097 "EHLO ozlabs.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750880Ab0EFGF0 convert rfc822-to-8bit (ORCPT ); Thu, 6 May 2010 02:05:26 -0400 In-Reply-To: <20100505160343.264fd015@notabene.brown> Sender: kvm-owner@vger.kernel.org List-ID: On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote: > On Wed, 5 May 2010 14:28:41 +0930 > Rusty Russell wrote: >=20 > > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: > > > Jens Axboe wrote: > > > > On Tue, May 04 2010, Rusty Russell wrote: > > > > > ISTR someone mentioning a desire for such an API years ago, s= o CC'ing the > > > > > usual I/O suspects... > > > >=20 > > > > It would be nice to have a more fuller API for this, but the re= ality is > > > > that only the flush approach is really workable. Even just stri= ct > > > > ordering of requests could only be supported on SCSI, and even = there the > > > > kernel still lacks proper guarantees on error handling to preve= nt > > > > reordering there. > > >=20 > > > There's a few I/O scheduling differences that might be useful: > > >=20 > > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but = not > > > before a BARRIER. That might be useful for time-critical WRIT= Es, > > > and those issued by high I/O priority. > >=20 > > This is only because noone actually wants flushes or barriers, thou= gh > > I/O people seem to only offer that. We really want "= must > > occur before ". That offers maximum choice to the I/O = subsystem > > and potentially to smart (virtual?) disks. > >=20 > > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH= is > > > only for data belonging to a particular file (e.g. fdatasync w= ith > > > no file size change, even on btrfs if O_DIRECT was used for th= e > > > writes being committed). That would entail tagging FLUSHes an= d > > > WRITEs with a fs-specific identifier (such as inode number), o= paque > > > to the scheduler which only checks equality. > >=20 > > This is closer. In userspace I'd be happy with a "all prior writes= to this > > struct file before all future writes". Even if the original guaran= tees were > > stronger (ie. inode basis). We currently implement transactions us= ing 4 fsync > > /msync pairs. > >=20 > > write_recovery_data(fd); > > fsync(fd); > > msync(mmap); > > write_recovery_header(fd); > > fsync(fd); > > msync(mmap); > > overwrite_with_new_data(fd); > > fsync(fd); > > msync(mmap); > > remove_recovery_header(fd); > > fsync(fd); > > msync(mmap); >=20 > Seems over-zealous. > If the recovery_header held a strong checksum of the recovery_data yo= u would > not need the first fsync, and as long as you have two places to write= recovery > data, you don't need the 3rd and 4th syncs. > Just: > write_internally_checksummed_recovery_data_and_header_to_unused_log= _space() > fsync / msync > overwrite_with_new_data() >=20 > To recovery you choose the most recent log_space and replay the conte= nt. > That may be a redundant operation, but that is no loss. I think you missed a checksum for the new data? Otherwise we can't tel= l if the new data is completely written. But yes, I will steal this scheme = for TDB2, thanks! In practice, it's the first sync which is glacial, the rest are pretty = cheap. > Also cannot see the point of msync if you have already performed an f= sync, > and if there is a point, I would expect you to call msync before > fsync... Maybe there is some subtlety there that I am not aware of. I assume it's this from the msync man page: msync() flushes changes made to the in-core copy of a file t= hat was mapped into memory using mmap(2) back to disk. Without use o= f this call there is no guarantee that changes are written back befo= re mun=E2=80=90 map(2) is called.=20 > > It's an implementation detail; barrier has less flexibility because= it has > > less information about what is required. I'm saying I want to give = you as > > much information as I can, even if you don't use it yet. >=20 > Only we know that approach doesn't work. > People will learn that they don't need to give the extra information = to still > achieve the same result - just like they did with ext3 and fsync. > Then when we improve the implementation to only provide the guarantee= s that > you asked for, people will complain that they are getting empty files= that > they didn't expect. I think that's an oversimplification: IIUC that occurred to people *not= * using fsync(). They weren't using it because it was too slow. Providi= ng a primitive which is as fast or faster and more specific doesn't have t= he same magnitude of social issues. And we can't write userspace interfaces for idiots only. > The abstraction I would like to see is a simple 'barrier' that contai= ns no > data and has a filesystem-wide effect. I think you lack ambition ;) Thinking about the single-file use case (eg. kvm guest or tdb), isn't t= hat suboptimal for md? Since you have to hand your barrier to every device whereas a file-wide primitive may theoretically only go to some. Cheers, Rusty. From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1O9uDI-0001Y1-Fj for qemu-devel@nongnu.org; Thu, 06 May 2010 02:05:36 -0400 Received: from [140.186.70.92] (port=49040 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1O9uDF-0001VU-Dj for qemu-devel@nongnu.org; Thu, 06 May 2010 02:05:34 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1O9uDC-0001AQ-Ko for qemu-devel@nongnu.org; Thu, 06 May 2010 02:05:33 -0400 Received: from ozlabs.org ([203.10.76.45]:57649) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1O9uDB-00019y-Uf for qemu-devel@nongnu.org; Thu, 06 May 2010 02:05:30 -0400 From: Rusty Russell Subject: Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH Date: Thu, 6 May 2010 15:35:19 +0930 References: <20100218222220.GA14847@redhat.com> <201005051428.41735.rusty@rustcorp.com.au> <20100505160343.264fd015@notabene.brown> In-Reply-To: <20100505160343.264fd015@notabene.brown> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <201005061535.20619.rusty@rustcorp.com.au> List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Neil Brown Cc: tytso@mit.edu, kvm@vger.kernel.org, "Michael S. Tsirkin" , qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, Jens Axboe , hch@lst.de On Wed, 5 May 2010 03:33:43 pm Neil Brown wrote: > On Wed, 5 May 2010 14:28:41 +0930 > Rusty Russell wrote: >=20 > > On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: > > > Jens Axboe wrote: > > > > On Tue, May 04 2010, Rusty Russell wrote: > > > > > ISTR someone mentioning a desire for such an API years ago, so CC= 'ing the > > > > > usual I/O suspects... > > > >=20 > > > > It would be nice to have a more fuller API for this, but the realit= y is > > > > that only the flush approach is really workable. Even just strict > > > > ordering of requests could only be supported on SCSI, and even ther= e the > > > > kernel still lacks proper guarantees on error handling to prevent > > > > reordering there. > > >=20 > > > There's a few I/O scheduling differences that might be useful: > > >=20 > > > 1. The I/O scheduler could freely move WRITEs before a FLUSH but not > > > before a BARRIER. That might be useful for time-critical WRITEs, > > > and those issued by high I/O priority. > >=20 > > This is only because noone actually wants flushes or barriers, though > > I/O people seem to only offer that. We really want " must > > occur before ". That offers maximum choice to the I/O subs= ystem > > and potentially to smart (virtual?) disks. > >=20 > > > 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is > > > only for data belonging to a particular file (e.g. fdatasync with > > > no file size change, even on btrfs if O_DIRECT was used for the > > > writes being committed). That would entail tagging FLUSHes and > > > WRITEs with a fs-specific identifier (such as inode number), opaque > > > to the scheduler which only checks equality. > >=20 > > This is closer. In userspace I'd be happy with a "all prior writes to = this > > struct file before all future writes". Even if the original guarantees= were > > stronger (ie. inode basis). We currently implement transactions using = 4 fsync > > /msync pairs. > >=20 > > write_recovery_data(fd); > > fsync(fd); > > msync(mmap); > > write_recovery_header(fd); > > fsync(fd); > > msync(mmap); > > overwrite_with_new_data(fd); > > fsync(fd); > > msync(mmap); > > remove_recovery_header(fd); > > fsync(fd); > > msync(mmap); >=20 > Seems over-zealous. > If the recovery_header held a strong checksum of the recovery_data you wo= uld > not need the first fsync, and as long as you have two places to write rec= overy > data, you don't need the 3rd and 4th syncs. > Just: > write_internally_checksummed_recovery_data_and_header_to_unused_log_spa= ce() > fsync / msync > overwrite_with_new_data() >=20 > To recovery you choose the most recent log_space and replay the content. > That may be a redundant operation, but that is no loss. I think you missed a checksum for the new data? Otherwise we can't tell if the new data is completely written. But yes, I will steal this scheme for TDB2, thanks! In practice, it's the first sync which is glacial, the rest are pretty chea= p. > Also cannot see the point of msync if you have already performed an fsync, > and if there is a point, I would expect you to call msync before > fsync... Maybe there is some subtlety there that I am not aware of. I assume it's this from the msync man page: msync() flushes changes made to the in-core copy of a file that = was mapped into memory using mmap(2) back to disk. Without use of t= his call there is no guarantee that changes are written back before m= un=E2=80=90 map(2) is called.=20 > > It's an implementation detail; barrier has less flexibility because it = has > > less information about what is required. I'm saying I want to give you = as > > much information as I can, even if you don't use it yet. >=20 > Only we know that approach doesn't work. > People will learn that they don't need to give the extra information to s= till > achieve the same result - just like they did with ext3 and fsync. > Then when we improve the implementation to only provide the guarantees th= at > you asked for, people will complain that they are getting empty files that > they didn't expect. I think that's an oversimplification: IIUC that occurred to people *not* using fsync(). They weren't using it because it was too slow. Providing a primitive which is as fast or faster and more specific doesn't have the same magnitude of social issues. And we can't write userspace interfaces for idiots only. > The abstraction I would like to see is a simple 'barrier' that contains no > data and has a filesystem-wide effect. I think you lack ambition ;) Thinking about the single-file use case (eg. kvm guest or tdb), isn't that suboptimal for md? Since you have to hand your barrier to every device whereas a file-wide primitive may theoretically only go to some. Cheers, Rusty.