From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: Write atomicity guarantees Date: Fri, 25 Apr 2014 05:27:09 +1000 Message-ID: <20140424192709.GV18672@dastard> References: <20140424173909.GB5886@linux.intel.com> <53595209.50906@fb.com> <53595CEF.3020603@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Dan Williams , Matthew Wilcox , "Martin K. Petersen" , Theodore Ts'o , linux-fsdevel To: Chris Mason Return-path: Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:14582 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750860AbaDXT1N (ORCPT ); Thu, 24 Apr 2014 15:27:13 -0400 Content-Disposition: inline In-Reply-To: <53595CEF.3020603@fb.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, Apr 24, 2014 at 02:50:23PM -0400, Chris Mason wrote: > > > On 04/24/2014 02:23 PM, Dan Williams wrote: > >On Thu, Apr 24, 2014 at 11:03 AM, Chris Mason wrote: > >>On 04/24/2014 01:39 PM, Matthew Wilcox wrote: > >>> > >>> > >>>NVMe allows the drive to tell the host what atomicity guarantees it > >>>provides for a write command. At the moment, I don't think Linux has > >>>a way for the driver to pass that information up to the filesystem. > >>> > >>>The value that is most interesting to report is Atomic Write Unit Power > >>>Fail ("if you send a write no larger than this, the drive guarantees to > >>>write all of it or none of it"), minimum value 1 sector. [1] > >>> > >>>There's a proposal before the NVMe workgroup to add a boundary size/offset > >>>to modify AWUPF ("except if you cross this boundary, then AWUPF is not > >>>guaranteed"). Think RAID stripe crossing. > >>> > >>>So, three questions. Is there somewhere already to pass boundary > >>>information up to the filesystem? Can filesystems make use of a larger > >>>atomic write unit than a single sector? And, if the device is internally > >>>a RAID device, is knowing the boundary size/offset useful? > >>> > >>> > >>>[1] There is also Atomic Write Unit Normal ("if you send two writes, > >>>neither of which is larger than this, subsequent reads will get either > >>>one or the other, not a mixture of both"), which I don't think we care > >>>about because the page cache prevents us from sending two writes which > >>>overlap with each other. > >> > >> > >>I think we really need the atomics to be vectored. Send N writes which as a > >>unit are not larger than X, but which may span anywhere on device. An array > >>with writeback cache, or a log structured squirrel in the FTL should be able > >>to provide this pretty easily? > >> > >>The immediate use case is mysql (16K writes) on a fragmented filesystem. > >>The FS needs to be able to collect a single atomic write made up of N 4K > >>sectors. > > > >How big does N need to be before it starts to be generally useful? > >Here it seems we're talking on the order to tens of writes, but for > >the upper bound Dave said that N could be in the hundreds of thousands > > Right, if you ask the filesystem guys, we'll want to dump the entire > contents of ram down to the storage in atomic fashion. I do agree > with Dave here, bigger is definitely better. Right, bigger is better, but what about minimum requirements? The minimum requirement I need for converting XFS is around 4MB of discontiguous single sector IOs for the worst case event. That covers the largest *single* atomic transaction log reservation we currently make on XFS at 64k block sizes. > 16K and up are useful, depending on which workload you're targeting. > The fusion devices can do 1MB. User data workloads, yes. The moment we start thinking about atomic filesystem metadata updates, the requirements go way, way up.... Cheers, Dave. -- Dave Chinner david@fromorbit.com