From: Matthew Wilcox <willy@linux.intel.com>
To: Chris Mason <chris.mason@fusionio.com>
Cc: Jeff Moyer <jmoyer@redhat.com>,
Linux FS Devel <linux-fsdevel@vger.kernel.org>,
Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 1/2] block: Add support for atomic writes
Date: Wed, 13 Nov 2013 16:35:54 -0500 [thread overview]
Message-ID: <20131113213554.GL6900@linux.intel.com> (raw)
In-Reply-To: <20131113204438.3802.80855@localhost.localdomain>
On Wed, Nov 13, 2013 at 03:44:38PM -0500, Chris Mason wrote:
> Quoting Matthew Wilcox (2013-11-12 10:11:51)
> > On Thu, Nov 07, 2013 at 08:52:20AM -0500, Chris Mason wrote:
> > > Unfortunately, it's hard to say. I think the fusionio cards are the
> > > only shipping devices that support this, but I've definitely heard that
> > > others plan to support it as well. mariadb/percona already support the
> > > atomics via fusionio specific ioctls, and turning that into a real
> > > O_ATOMIC is a priority so other hardware can just hop on the train.
> > >
> > > This feature in general is pretty natural for the log structured squirrels
> > > they stuff inside flash, so I'd expect everyone to support it. Matthew,
> > > how do you feel about all of this?
> >
> > NVMe doesn't have support for this functionality. I know what stories I've
> > heard from our internal device teams about what they can and can't support
> > in the way of this kind of thing, but I obviously can't repeat them here!
>
> There are some atomics in the NVMe spec though, correct? The minimum
> needed for database use is only ~16-64K.
Yes, NVMe has limited atomic support. It has two fields:
Atomic Write Unit Normal (AWUN): This field indicates the atomic write
size for the controller during normal operation. This field is specified
in logical blocks and is a 0’s based value. If a write is submitted
of this size or less, the host is guaranteed that the write is atomic
to the NVM with respect to other read or write operations. If a write
is submitted that is greater than this size, there is no guarantee
of atomicity.
A value of FFFFh indicates all commands are atomic as this is the
largest command size. It is recommended that implementations support
a minimum of 128KB (appropriately scaled based on LBA size).
Atomic Write Unit Power Fail (AWUPF): This field indicates the atomic
write size for the controller during a power fail condition. This
field is specified in logical blocks and is a 0’s based value. If a
write is submitted of this size or less, the host is guaranteed that
the write is atomic to the NVM with respect to other read or write
operations. If a write is submitted that is greater than this size,
there is no guarantee of atomicity.
Basically just exposing what is assumed to be true for SCSI and generally
assumed to be lies for ATA drives :-)
> > I took a look at the SCSI Block Command spec. If I understand it
> > correctly, SCSI would implement this with the WRITE USING TOKEN command.
> > I don't see why it couldn't implement this API, though it seems like
> > SCSI would prefer a separate setup step before the write comes in. I'm
> > not sure that's a reasonable request to make of the application (nor
> > am I sure I understand SBC correctly).
>
> What kind of setup would we have to do? We have all the IO in hand, so
> it can be organized in just about any way needed.
Someone who understands SCSI better than I do assures me this is NOT the
proposal that allows SCSI devices to do scattered writes. Apparently that
proposal is still in progress. This appears to be true; from the t10
NEW list:
12-087r6 SBC-4 Gathered reads, optionally atomic Rob Elliott, Ashish Batwara, Walt Hubis Missing
12-086r6 SBC-4 SPC-5 Scattered writes, optionally atomic Rob Elliott, Ashish Batwara, Walt Hubis Missing
> Grin, almost Btrfs already does this...COW means that btrfs needs to
> update metadata to point to new locations. To avoid an ugly
> flush-all-the-io-every-commit mess, we track pending writes and update
> the meatadata when the write is fully on media.
>
> We're missing a firm line that makes sure all the metadata updates for a
> single write happen in the same transaction, but that part isn't hard.
>
> We're missing good performance in database workloads, which is a
> slightly bigger trick.
Yeah ... if only you could find a database company to ... oh, wait :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2013-11-13 21:35 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-01 21:27 [PATCH 0/2] Support for atomic IOs Chris Mason
2013-11-01 21:28 ` [PATCH 1/2] block: Add support for atomic writes Chris Mason
2013-11-01 21:47 ` Shaohua Li
2013-11-05 17:43 ` Jeff Moyer
2013-11-07 13:52 ` Chris Mason
2013-11-07 15:43 ` Jeff Moyer
2013-11-07 15:55 ` Chris Mason
2013-11-07 16:14 ` Jeff Moyer
2013-11-07 16:52 ` Chris Mason
2013-11-13 23:59 ` Dave Chinner
2013-11-12 15:11 ` Matthew Wilcox
2013-11-13 20:44 ` Chris Mason
2013-11-13 20:53 ` Howard Chu
2013-11-13 21:35 ` Matthew Wilcox [this message]
2013-11-01 21:29 ` [PATCH 2/3] fs: Add O_ATOMIC support to direct IO Chris Mason
-- strict thread matches above, loose matches on Subject: below --
2013-11-20 8:23 [PATCH 1/2] block: Add support for atomic writes Kishore Sampathkumar
2013-11-26 6:24 Kishore Sampathkumar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131113213554.GL6900@linux.intel.com \
--to=willy@linux.intel.com \
--cc=axboe@kernel.dk \
--cc=chris.mason@fusionio.com \
--cc=jmoyer@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.