From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Williams Subject: Re: Write atomicity guarantees Date: Thu, 24 Apr 2014 11:23:22 -0700 Message-ID: References: <20140424173909.GB5886@linux.intel.com> <53595209.50906@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Matthew Wilcox , "Martin K. Petersen" , "Theodore Ts'o" , Dave Chinner , linux-fsdevel To: Chris Mason Return-path: Received: from mail-qg0-f50.google.com ([209.85.192.50]:63555 "EHLO mail-qg0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758455AbaDXSXW (ORCPT ); Thu, 24 Apr 2014 14:23:22 -0400 Received: by mail-qg0-f50.google.com with SMTP id 63so2860486qgz.23 for ; Thu, 24 Apr 2014 11:23:22 -0700 (PDT) In-Reply-To: <53595209.50906@fb.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, Apr 24, 2014 at 11:03 AM, Chris Mason wrote: > On 04/24/2014 01:39 PM, Matthew Wilcox wrote: >> >> >> NVMe allows the drive to tell the host what atomicity guarantees it >> provides for a write command. At the moment, I don't think Linux has >> a way for the driver to pass that information up to the filesystem. >> >> The value that is most interesting to report is Atomic Write Unit Power >> Fail ("if you send a write no larger than this, the drive guarantees to >> write all of it or none of it"), minimum value 1 sector. [1] >> >> There's a proposal before the NVMe workgroup to add a boundary size/offset >> to modify AWUPF ("except if you cross this boundary, then AWUPF is not >> guaranteed"). Think RAID stripe crossing. >> >> So, three questions. Is there somewhere already to pass boundary >> information up to the filesystem? Can filesystems make use of a larger >> atomic write unit than a single sector? And, if the device is internally >> a RAID device, is knowing the boundary size/offset useful? >> >> >> [1] There is also Atomic Write Unit Normal ("if you send two writes, >> neither of which is larger than this, subsequent reads will get either >> one or the other, not a mixture of both"), which I don't think we care >> about because the page cache prevents us from sending two writes which >> overlap with each other. > > > I think we really need the atomics to be vectored. Send N writes which as a > unit are not larger than X, but which may span anywhere on device. An array > with writeback cache, or a log structured squirrel in the FTL should be able > to provide this pretty easily? > > The immediate use case is mysql (16K writes) on a fragmented filesystem. > The FS needs to be able to collect a single atomic write made up of N 4K > sectors. How big does N need to be before it starts to be generally useful? Here it seems we're talking on the order to tens of writes, but for the upper bound Dave said that N could be in the hundreds of thousands [1]. -- Dan [1]: http://marc.info/?l=linux-fsdevel&m=139262740324307&w=2