From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Mason Subject: Re: [PATCH 1/2] block: Add support for atomic writes Date: Wed, 13 Nov 2013 15:44:38 -0500 Message-ID: <20131113204438.3802.80855@localhost.localdomain> References: <20131101212704.10239.73920@localhost.localdomain> <20131101212854.10239.19830@localhost.localdomain> <20131107135220.3802.91392@localhost.localdomain> <20131112151151.GI6900@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8BIT Cc: Jeff Moyer , Linux FS Devel , Jens Axboe To: Matthew Wilcox Return-path: Received: from dkim2.fusionio.com ([66.114.96.54]:59605 "EHLO dkim2.fusionio.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751000Ab3KMUoo convert rfc822-to-8bit (ORCPT ); Wed, 13 Nov 2013 15:44:44 -0500 Received: from mx1.fusionio.com (unknown [10.101.1.160]) by dkim2.fusionio.com (Postfix) with ESMTP id 4A09D9A0370 for ; Wed, 13 Nov 2013 13:44:44 -0700 (MST) In-Reply-To: <20131112151151.GI6900@linux.intel.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Quoting Matthew Wilcox (2013-11-12 10:11:51) > On Thu, Nov 07, 2013 at 08:52:20AM -0500, Chris Mason wrote: > > Unfortunately, it's hard to say. I think the fusionio cards are the > > only shipping devices that support this, but I've definitely heard that > > others plan to support it as well. mariadb/percona already support the > > atomics via fusionio specific ioctls, and turning that into a real > > O_ATOMIC is a priority so other hardware can just hop on the train. > > > > This feature in general is pretty natural for the log structured squirrels > > they stuff inside flash, so I'd expect everyone to support it. Matthew, > > how do you feel about all of this? > > NVMe doesn't have support for this functionality. I know what stories I've > heard from our internal device teams about what they can and can't support > in the way of this kind of thing, but I obviously can't repeat them here! There are some atomics in the NVMe spec though, correct? The minimum needed for database use is only ~16-64K. > > I took a look at the SCSI Block Command spec. If I understand it > correctly, SCSI would implement this with the WRITE USING TOKEN command. > I don't see why it couldn't implement this API, though it seems like > SCSI would prefer a separate setup step before the write comes in. I'm > not sure that's a reasonable request to make of the application (nor > am I sure I understand SBC correctly). What kind of setup would we have to do? We have all the IO in hand, so it can be organized in just about any way needed. > > I like the API, but I'm a little confused not to see a patch saying "Oh, > and here's how we implemented it in btrfs without any hardware support" > ;-) It seems to me that the concept is just as good a match for an > advanced filesystem that supports snapshots as it is for the FTL inside > a drive. Grin, almost Btrfs already does this...COW means that btrfs needs to update metadata to point to new locations. To avoid an ugly flush-all-the-io-every-commit mess, we track pending writes and update the meatadata when the write is fully on media. We're missing a firm line that makes sure all the metadata updates for a single write happen in the same transaction, but that part isn't hard. We're missing good performance in database workloads, which is a slightly bigger trick. -chris