From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Mason Subject: Re: Write atomicity guarantees Date: Thu, 24 Apr 2014 14:50:23 -0400 Message-ID: <53595CEF.3020603@fb.com> References: <20140424173909.GB5886@linux.intel.com> <53595209.50906@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit Cc: Matthew Wilcox , "Martin K. Petersen" , "Theodore Ts'o" , Dave Chinner , linux-fsdevel To: Dan Williams Return-path: Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:64058 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932411AbaDXStu (ORCPT ); Thu, 24 Apr 2014 14:49:50 -0400 In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 04/24/2014 02:23 PM, Dan Williams wrote: > On Thu, Apr 24, 2014 at 11:03 AM, Chris Mason wrote: >> On 04/24/2014 01:39 PM, Matthew Wilcox wrote: >>> >>> >>> NVMe allows the drive to tell the host what atomicity guarantees it >>> provides for a write command. At the moment, I don't think Linux has >>> a way for the driver to pass that information up to the filesystem. >>> >>> The value that is most interesting to report is Atomic Write Unit Power >>> Fail ("if you send a write no larger than this, the drive guarantees to >>> write all of it or none of it"), minimum value 1 sector. [1] >>> >>> There's a proposal before the NVMe workgroup to add a boundary size/offset >>> to modify AWUPF ("except if you cross this boundary, then AWUPF is not >>> guaranteed"). Think RAID stripe crossing. >>> >>> So, three questions. Is there somewhere already to pass boundary >>> information up to the filesystem? Can filesystems make use of a larger >>> atomic write unit than a single sector? And, if the device is internally >>> a RAID device, is knowing the boundary size/offset useful? >>> >>> >>> [1] There is also Atomic Write Unit Normal ("if you send two writes, >>> neither of which is larger than this, subsequent reads will get either >>> one or the other, not a mixture of both"), which I don't think we care >>> about because the page cache prevents us from sending two writes which >>> overlap with each other. >> >> >> I think we really need the atomics to be vectored. Send N writes which as a >> unit are not larger than X, but which may span anywhere on device. An array >> with writeback cache, or a log structured squirrel in the FTL should be able >> to provide this pretty easily? >> >> The immediate use case is mysql (16K writes) on a fragmented filesystem. >> The FS needs to be able to collect a single atomic write made up of N 4K >> sectors. > > How big does N need to be before it starts to be generally useful? > Here it seems we're talking on the order to tens of writes, but for > the upper bound Dave said that N could be in the hundreds of thousands Right, if you ask the filesystem guys, we'll want to dump the entire contents of ram down to the storage in atomic fashion. I do agree with Dave here, bigger is definitely better. 16K and up are useful, depending on which workload you're targeting. The fusion devices can do 1MB. -chris