From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Mason Subject: Re: Write atomicity guarantees Date: Thu, 24 Apr 2014 14:03:53 -0400 Message-ID: <53595209.50906@fb.com> References: <20140424173909.GB5886@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Cc: To: Matthew Wilcox , "Martin K. Petersen" , "Theodore Ts'o" , Dave Chinner Return-path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:30644 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757450AbaDXSDU (ORCPT ); Thu, 24 Apr 2014 14:03:20 -0400 In-Reply-To: <20140424173909.GB5886@linux.intel.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 04/24/2014 01:39 PM, Matthew Wilcox wrote: > > NVMe allows the drive to tell the host what atomicity guarantees it > provides for a write command. At the moment, I don't think Linux has > a way for the driver to pass that information up to the filesystem. > > The value that is most interesting to report is Atomic Write Unit Power > Fail ("if you send a write no larger than this, the drive guarantees to > write all of it or none of it"), minimum value 1 sector. [1] > > There's a proposal before the NVMe workgroup to add a boundary size/offset > to modify AWUPF ("except if you cross this boundary, then AWUPF is not > guaranteed"). Think RAID stripe crossing. > > So, three questions. Is there somewhere already to pass boundary > information up to the filesystem? Can filesystems make use of a larger > atomic write unit than a single sector? And, if the device is internally > a RAID device, is knowing the boundary size/offset useful? > > > [1] There is also Atomic Write Unit Normal ("if you send two writes, > neither of which is larger than this, subsequent reads will get either > one or the other, not a mixture of both"), which I don't think we care > about because the page cache prevents us from sending two writes which > overlap with each other. I think we really need the atomics to be vectored. Send N writes which as a unit are not larger than X, but which may span anywhere on device. An array with writeback cache, or a log structured squirrel in the FTL should be able to provide this pretty easily? The immediate use case is mysql (16K writes) on a fragmented filesystem. The FS needs to be able to collect a single atomic write made up of N 4K sectors. -chris