Re: [PATCH] xfsprogs: Issue smaller discards at mkfs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Keith Busch <keith.busch@intel.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Sandeen <sandeen@sandeen.net>,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfsprogs: Issue smaller discards at mkfs
Date: Fri, 27 Oct 2017 09:24:50 +1100	[thread overview]
Message-ID: <20171026222450.GD3666@dastard> (raw)
In-Reply-To: <20171026212414.GA30535@localhost.localdomain>

On Thu, Oct 26, 2017 at 03:24:15PM -0600, Keith Busch wrote:
> On Thu, Oct 26, 2017 at 12:59:23PM -0700, Darrick J. Wong wrote:
> > 
> > Sure, but now you have to go fix mke2fs and everything /else/ that
> > issues BLKDISCARD (or FALLOC_FL_PUNCH) on a large file / device, and
> > until you fix every program to work around this weird thing in the
> > kernel there'll still be someone somewhere with this timeout problem...
> 
> e2progs already splits large discards in a loop. ;)
> 
> > ...so I started digging into what the kernel does with a BLKDISCARD
> > request, which is to say that I looked at blkdev_issue_discard.  That
> > function uses blk_*_plug() to wrap __blkdev_issue_discard, which in turn
> > splits the request into a chain of UINT_MAX-sized struct bios.
> > 
> > 128G's worth of 4G ios == 32 chained bios.
> > 
> > 2T worth of 4G ios == 512 chained bios.
> > 
> > So now I'm wondering, is the problem more that the first bio in the
> > chain times out because the last one hasn't finished yet, so the whole
> > thing gets aborted because we chained too much work together?
> 
> You're sort of on the right track. The timeouts are set on an individual
> request in the chain rather than one timeout for the entire chain.
> 
> All the bios in the chain get turned into 'struct request' and sent
> to the low-level driver. The driver calls blk_mq_start_request before
> sending to hardware. That starts the timer on _that_ request,
> independent of the other requests in the chain.
> 
> NVMe supports very large queues. A 4TB discard becomes 1024 individual
> requests started at nearly the same time. The last ones in the queue are
> the ones that risk timeout.

And that's just broken when it comes to requests that might take
several seconds to run.  This is a problem the kernel needs to fix -
it's not something we should be working around in userspace.

I can't wait to see how badly running fstrim on one of those devices
screws them up....

> When we're doing read/write, latencies at the same depth are well within
> tolerance, and high queue depths are good for throughput. When doing
> discard, though, tail latencies fall outside the timeout tolerance at
> the same queue depth.

Yup, because most SSDs have really shit discard implementations -
nobody who "reviews" SSDs look at the performance aspect of discard
and so it doesn't get publicly compared against other drives like
read/write IO performance does. IOWs, discard doesn't sell devices,
so it never gets fixed or optimised.

Hardware quirks should be dealt with by the kernel, not userspace.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2017-10-26 22:24 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-26 14:41 [PATCH] xfsprogs: Issue smaller discards at mkfs Keith Busch
2017-10-26 16:25 ` Darrick J. Wong
2017-10-26 17:49   ` Eric Sandeen
2017-10-26 18:01     ` Eric Sandeen
2017-10-26 18:32       ` Keith Busch
2017-10-26 19:59         ` Darrick J. Wong
2017-10-26 21:24           ` Keith Busch
2017-10-26 22:24             ` Dave Chinner [this message]
2017-10-26 23:09               ` Keith Busch
2017-10-26 18:00   ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171026222450.GD3666@dastard \
    --to=david@fromorbit.com \
    --cc=darrick.wong@oracle.com \
    --cc=keith.busch@intel.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.