From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH 1/2] block: Add support for atomic writes Date: Wed, 13 Nov 2013 16:35:54 -0500 Message-ID: <20131113213554.GL6900@linux.intel.com> References: <20131101212704.10239.73920@localhost.localdomain> <20131101212854.10239.19830@localhost.localdomain> <20131107135220.3802.91392@localhost.localdomain> <20131112151151.GI6900@linux.intel.com> <20131113204438.3802.80855@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jeff Moyer , Linux FS Devel , Jens Axboe To: Chris Mason Return-path: Received: from mga09.intel.com ([134.134.136.24]:30632 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750992Ab3KMVf5 (ORCPT ); Wed, 13 Nov 2013 16:35:57 -0500 Content-Disposition: inline In-Reply-To: <20131113204438.3802.80855@localhost.localdomain> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Wed, Nov 13, 2013 at 03:44:38PM -0500, Chris Mason wrote: > Quoting Matthew Wilcox (2013-11-12 10:11:51) > > On Thu, Nov 07, 2013 at 08:52:20AM -0500, Chris Mason wrote: > > > Unfortunately, it's hard to say. I think the fusionio cards are = the > > > only shipping devices that support this, but I've definitely hear= d that > > > others plan to support it as well. mariadb/percona already suppo= rt the > > > atomics via fusionio specific ioctls, and turning that into a rea= l > > > O_ATOMIC is a priority so other hardware can just hop on the trai= n. > > >=20 > > > This feature in general is pretty natural for the log structured = squirrels > > > they stuff inside flash, so I'd expect everyone to support it. M= atthew, > > > how do you feel about all of this? > >=20 > > NVMe doesn't have support for this functionality. I know what stor= ies I've > > heard from our internal device teams about what they can and can't = support > > in the way of this kind of thing, but I obviously can't repeat them= here! >=20 > There are some atomics in the NVMe spec though, correct? The minimum > needed for database use is only ~16-64K. Yes, NVMe has limited atomic support. It has two fields: Atomic Write Unit Normal (AWUN): This field indicates the atomic writ= e size for the controller during normal operation. This field is specif= ied in logical blocks and is a 0=E2=80=99s based value. If a write is sub= mitted of this size or less, the host is guaranteed that the write is atomic to the NVM with respect to other read or write operations. If a write is submitted that is greater than this size, there is no guarantee of atomicity. A value of FFFFh indicates all commands are atomic as this is the largest command size. It is recommended that implementations support a minimum of 128KB (appropriately scaled based on LBA size). Atomic Write Unit Power Fail (AWUPF): This field indicates the atomic write size for the controller during a power fail condition. This field is specified in logical blocks and is a 0=E2=80=99s based value= =2E If a write is submitted of this size or less, the host is guaranteed that the write is atomic to the NVM with respect to other read or write operations. If a write is submitted that is greater than this size, there is no guarantee of atomicity. Basically just exposing what is assumed to be true for SCSI and general= ly assumed to be lies for ATA drives :-) > > I took a look at the SCSI Block Command spec. If I understand it > > correctly, SCSI would implement this with the WRITE USING TOKEN com= mand. > > I don't see why it couldn't implement this API, though it seems lik= e > > SCSI would prefer a separate setup step before the write comes in. = I'm > > not sure that's a reasonable request to make of the application (no= r > > am I sure I understand SBC correctly). >=20 > What kind of setup would we have to do? We have all the IO in hand, = so > it can be organized in just about any way needed. Someone who understands SCSI better than I do assures me this is NOT th= e proposal that allows SCSI devices to do scattered writes. Apparently t= hat proposal is still in progress. This appears to be true; from the t10 NEW list: 12-087r6 SBC-4 Gathered reads, optionally atomic Rob Elliott, Ashish = Batwara, Walt Hubis Missing=09 12-086r6 SBC-4 SPC-5 Scattered writes, optionally atomic Rob Elliott,= Ashish Batwara, Walt Hubis Missing > Grin, almost Btrfs already does this...COW means that btrfs needs to > update metadata to point to new locations. To avoid an ugly > flush-all-the-io-every-commit mess, we track pending writes and updat= e > the meatadata when the write is fully on media. >=20 > We're missing a firm line that makes sure all the metadata updates fo= r a > single write happen in the same transaction, but that part isn't hard= =2E >=20 > We're missing good performance in database workloads, which is a > slightly bigger trick. Yeah ... if only you could find a database company to ... oh, wait :-) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html