Re: [LSF/MM TOPIC] atomic block device

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Andy Rudoff <andy@rudoff.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	lsf-pc@lists.linux-foundation.org,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	jmoyer@redhat.com, Chris Mason <clm@fb.com>,
	Jens Axboe <axboe@kernel.dk>,
	Bryan E Veal <bryan.e.veal@intel.com>,
	Annie Foong <annie.foong@intel.com>
Subject: Re: [LSF/MM TOPIC] atomic block device
Date: Mon, 17 Feb 2014 19:56:27 +1100	[thread overview]
Message-ID: <20140217085627.GA13647@dastard> (raw)
In-Reply-To: <CABBL8E+r+Uao9aJsezy16K_JXQgVuoD7ArepB46WTS=zruHL4g@mail.gmail.com>

On Sat, Feb 15, 2014 at 10:47:12AM -0700, Andy Rudoff wrote:
> On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@intel.com>wrote:
> 
> > In response to Dave's call [1] and highlighting Jeff's attend request
> > [2] I'd like to stoke a discussion on an emulation layer for atomic
> > block commands.  Specifically, SNIA has laid out their position on the
> > command set an atomic block device may support (NVM Programming Model
> > [3]) and it is a good conversation piece for this effort.  The goal
> > would be to review the proposed operations, identify the capabilities
> > that would be readily useful to filesystems / existing use cases, and
> > tear down a straw man implementation proposal.
> >
> ...
> 
> > The argument for not doing this as a
> > device-mapper target or stacked block device driver is to ease
> > provisioning and make the emulation transparent.  On the other hand,
> > the argument for doing this as a virtual block device is that the
> > "failed to parse device metadata" is a known failure scenario for
> > dm/md, but not sd for example.
> >
> 
> Hi Dan,
> 
> Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in here
> with a couple observations.  I think the most interesting cases where
> atomics provide a benefit are cases where storage is RAIDed across multiple
> devices.  Part of the argument for atomic writes on SSDs is that databases
> and file systems can save bandwidth and complexity by avoiding
> write-ahead-logging.  But even if every SSD supported it, the majority of
> production databases span across devices for either capacity, performance,
> or, most likely, high availability reasons.  So in my opinion, that very
> much supports the idea of doing atomics at a layer where it applies to SW
> RAIDed storage (as I believe Dave and others are suggesting).
> 
> On the other side of the coin, I remember Dave talking about this during
> our NVM discussion at LSF last year and I got the impression the size and
> number of writes he'd need supported before he could really stop using his
> journaling code was potentially large.  Dave: perhaps you can re-state the
> number of writes and their total size that would have to be supported by
> block level atomics in order for them to be worth using by XFS?

Hi Andy - the numbers I gave last year were at the upper end of the
number of iovecs we can dump into an atomic checkpoint in the XFS
log at a time. because that is typically based on log size and the
log can be up to 2GB in size, this tends to max out at somewhere
around 150-200,000 individual iovecs and/or roughly 100MB of
metadata.

Yeah, it's a lot, but keep in mind that a workload running 250,000
file creates a second on XFS is retiring somewhere around 300,000
individual transactions per second, each of which will typically
have 10-20 dirty regions in them.  If we were to write them as
individual atomic writes at transaction commit time we'd need to
sustain somewhere in the order of 3-6 _million IOPS_ to maintain
this transaction rate with individual atomic writes for each
transaction.

That would also introduce unacceptible IO latency as we can't modify
metadata while it is under IO, especially as a large number of these
regions are redirtied repeatedly during ongoing operations(e.g.
directory data and index blocks). Hence to avoid this problem with
atomic writes, we need still need asynchronous transactions and
in-memory aggregation of changes.  IOWs, checkpoints are the until
of atomic write we need to for support in XFS.

We can limit the size of checkpoints in XFS without too much
trouble, either by amount of data or number of iovecs, but that
comes at a performance code. To maintain current levels of
performance we need a decent amount of in-memory change aggregation
and hence we are going to need - at minimum - thousands of vectors
in each atomic write. I'd prefer tens of thousands to hundreds of
thousands of vectors because that's our typical unit of "atomic
write" at current performance levels, but several thousand vectors
and tens of MB is sufficient to start with....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2014-02-17  8:56 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-15 15:04 [LSF/MM TOPIC] atomic block device Dan Williams
2014-02-15 17:55 ` Andy Rudoff
2014-02-15 18:29   ` Howard Chu
2014-02-15 18:31     ` Howard Chu
2014-02-15 18:02 ` James Bottomley
2014-02-15 18:15   ` Andy Rudoff
2014-02-15 20:25     ` James Bottomley
2014-03-20 20:10       ` Jeff Moyer
     [not found] ` <CABBL8E+r+Uao9aJsezy16K_JXQgVuoD7ArepB46WTS=zruHL4g@mail.gmail.com>
2014-02-15 21:35   ` Dan Williams
2014-02-17  8:56   ` Dave Chinner [this message]
2014-02-17  9:51     ` [Lsf-pc] " Jan Kara
2014-02-17 10:20       ` Howard Chu
2014-02-18  0:10         ` Dave Chinner
2014-02-18  8:59           ` Alex Elsayed
2014-02-18 13:17             ` Dave Chinner
2014-02-18 14:09               ` Theodore Ts'o
2014-02-17 13:05 ` Chris Mason
2014-02-18 19:07   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140217085627.GA13647@dastard \
    --to=david@fromorbit.com \
    --cc=andy@rudoff.com \
    --cc=annie.foong@intel.com \
    --cc=axboe@kernel.dk \
    --cc=bryan.e.veal@intel.com \
    --cc=clm@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).