From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: Questions on block drivers, REQ_FLUSH and REQ_FUA Date: Wed, 25 May 2011 18:43:14 +0200 Message-ID: References: <8E109506CAF2B94185C68859@nimrod.local> <20110524223220.GA379@redhat.com> <20110525085950.GC10146@htj.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Vivek Goyal , linux-fsdevel@vger.kernel.org To: Alex Bligh Return-path: Received: from mail-qw0-f46.google.com ([209.85.216.46]:40659 "EHLO mail-qw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757164Ab1EYQnP convert rfc822-to-8bit (ORCPT ); Wed, 25 May 2011 12:43:15 -0400 Received: by qwk3 with SMTP id 3so3990451qwk.19 for ; Wed, 25 May 2011 09:43:14 -0700 (PDT) In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hello, Alex. On Wed, May 25, 2011 at 5:54 PM, Alex Bligh wrote: > a) If I do not complete a write command, I may avoid writing it to di= sk > =A0indefinitely (despite completing subsequently received FLUSH > =A0commands). The only flushes to disk that I am obliged to flush > =A0are those that I've actually told the block layer that I have done= =2E Yes, driver doesn't have any ordering responsibility w.r.t. FLUSH for writes which it hasn't declared finished yet. > b) If I receive a flush command, and prior to completing that flush > =A0command, I receive subsequent write commands, I may execute > =A0(and, if I like, write, to disk) write commands received AFTER tha= t > =A0flush command. I presume if the subsequent write commands write to > =A0blocks that I am meant to be flushing, I can just forget about > =A0the blocks I am meant to be flushing (because they would be > =A0overwritten) provided *something* overwritten what was there befor= e. The first half is correct. The latter half may be correct if there's no intervening write but _please_ don't do that. If there's something to be optimized there, it should be done in upper layers. It's playing with fire. > If my understanding is correct, then for future readers of the archiv= e > (perhaps I should put this list in Documentation/ ?) the semantics ar= e > something like: > > 1. Block drivers may handle requests received in any order, and may > =A0issue completions in any order, subject only to the rules below. > > 2. If a read covering a given block X is received after one or more w= rites > =A0for that block, then irrespective of the order in which the read > =A0and write(s) are handled/completed, the read shall return the > =A0value written by the immediately preceding write to that block. > > =A0Therefore whilst the following is legal... > > =A0 =A0 =A0 Driver sends =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0Driver replies > > =A0 =A0 =A0 WRITE BLOCK 1 =3D X > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 WRITE BLOCK 1 COMPLETED > =A0 =A0 =A0 .... time passes ... > =A0 =A0 =A0 READ BLOCK 1 > =A0 =A0 =A0 WRITE BLOCK 1 =3D Y > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 WRITE BLOCK 1 COMPLETED > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 READ BLOCK 1 COMPLETED > > =A0...the read from block 1 should return X and not Y, even if it was > =A0handled by the driver after the write. This is usually synchronized in the upper layer and AFAIK filesystems don't issue overlapping reads and writes simultaneously (right?) and in the above case I don't think READ BLOCK 1 returning Y would be illegal. There's no ordering constraints between them anyway and block layer would happily reorder the second write in front of the read. > 3. If a flush request is received, then before completing it (and, > =A0in the case of a make_request_function driver) before initiating > =A0any attached write, the driver MUST have written to non-volatile > =A0storage any writes which were COMPLETED prior to the reception > =A0of the flush. This does not affect any writes received, but > =A0not completed, prior to the flush, nor does it prevent a block dri= ver > =A0from completing subsequently issued writes before completion of th= e > =A0flush. IE the flush does not act as a barrier, it merely ensures t= hat > =A0on completion of the flush non-volatile storage contains either th= e > =A0blocks written to prior to the flush or blocks written to in comma= nds > =A0issued subsequent to the flush, but completed prior to it. > > 4. Requests marked FUA should be written to non-volatile storage prio= r > =A0to completion, but impose no restrictions on ordering. Hmm... For bio drivers, REQ_FLUSH and REQ_FUA are best explained together. The followings are legal combinations. * No write data, REQ_FLUSH - doesn't have any ordering constraint other than the inherent FLUSH requirement (previously completed WRITEs should be on the media on FLUSH completion). * Write data, REQ_FLUSH - FLUSH must be completed before write data is issued. ie. write data must not be written to the media before all previous writes are on the media. * Write data, REQ_FUA - Write should be completed before FLUSH is issued - ie. the write data should be on platter along with previously completed writes on bio completion. * Write data, REQ_FLUSH | REQ_FUA - Write data must not be written to the media before all previous writes are on the media && the write data must be on the media on bio completion. This is usually sequenced as FLUSH write FLUSH. Request based drivers only see REQ_FLUSH w/o write data and the only rule it has to follow is that all writes it completed prior to receiving FLUSH must be on the media on completion of FLUSH and being smart about it might not be a good idea. Thanks. --=20 tejun -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html