Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	drbd-dev@lists.linbit.com, Christoph Hellwig <hch@lst.de>,
	Philipp Reisner <philipp.reisner@linbit.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [Drbd-dev] FLUSH/FUA documentation & code discrepancy
Date: Mon, 10 Sep 2012 15:54:42 -0700	[thread overview]
Message-ID: <20120910225442.GE7677@google.com> (raw)
In-Reply-To: <20120907084221.GD7028@soda.linbit>

Hello, Lars.

On Fri, Sep 07, 2012 at 10:42:21AM +0200, Lars Ellenberg wrote:
> We have a kernel thread that is receiving data blocks,
> and some "boundary" information (in the sense that between such
> boundaries, we have a reorder domain, where requests may reorder freely,
> but no requests may be reordered across such boundaries).

What purpose does this boundary serve?  Why is it necessary?  Which
driver is this?

> This same thread submits the assembled bios.
> 
> With the old, stronger, BIO_RW_BARRIER implementation,
> if it was supported, we could just submit the first bio of a reorder
> domain (plus some special cases) with that flag,
> and could keep receiving -> assembling -> submitting.

Yes, but the actual request processing would continue to stall as
block layer would have been draining requests continuously.

> Now, we assumed that with FLUSH/FUA, we can do the same.
> And we could, as long as it is supported through the whole stack.
> 
> But if it is not supported at some level in the stack, we must first drain.
> 
> And since it is all "transparent", we just cannot determine
> if the whole stack does or does not support it.
> 
> So we have to drain always.

The driver was hitching on BARRIER for draining.  As that's gone now,
if you want the same behavior, the driver would need to drain itself.

> We did not realize that. 
> In certain cases, where we submitted in the right order, and even
> indicated what we thought would amount to at least a "soft barrier"
> (reorder boundary) for the elevator, we ended up with data corruption
> because the elevator never sees these indicators, and reorders.
> 
> Fine, our mistake/misunderstanding of the drain requirement.
> That's fixed now, we do always drain
> (unless specifically configured not to, where the admin takes the blame
> if that does not work on his stack).
> 
> 
> To always drain is also a performance hit, as we would rather keep
> receiving data and assembling bios and submitting them.

Is the performance hit measureable?  Block BARRIER support had some
optimizations but it still had to constantly drain all the same.

> We can possibly work around that by introducing an additional submitter thread,
> or at least our own list where we queue assembled bios until the lower
> level device queue drains.
> 
> But we'd rather have the elevator see the FLUSH/FUA,
> and treat them as at least a soft barrier/reorder boundary.
> 
> I may be wrong here, but all the necessary bits for this seem to be in
> place already, if the information would even reach the elevator in one
> way or other, and not be completely stripped away early.
> 
> What would you rather see, the elevator recognizing reorder boundaries?
> Or additional higher level queueing and extra thread/work queue/whatever?
> 
> Both are fine with me, I'm just asking for an opinion.

First of all, using FLUSH/FUA for such purpose is an error-prone
abuse.  You're trying to exploit an implementation detail which may
change at any time.  I think what you want is to be able to specify
REQ_SOFTBARRIER on bio submission, which shouldn't be too hard but I'm
still lost why this is necessary.  Can you please explain it a bit
more?

Thanks.

-- 
tejun

  reply	other threads:[~2012-09-10 22:54 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-04 12:32 [Drbd-dev] FLUSH/FUA documentation & code discrepancy Philipp Reisner
2012-09-04 22:46 ` Tejun Heo
2012-09-05  8:44   ` Philipp Reisner
2012-09-05  8:49     ` Tejun Heo
2012-09-05 10:07       ` Lars Ellenberg
2012-09-06 21:29         ` Tejun Heo
2012-09-07  8:42           ` Lars Ellenberg
2012-09-10 22:54             ` Tejun Heo [this message]
2012-09-10 23:06               ` Tejun Heo
2012-09-10 23:12                 ` Kent Overstreet
2012-09-10 23:31                 ` Kent Overstreet
2012-09-11  5:58                   ` NeilBrown
2012-09-11  8:25                     ` Lars Ellenberg
2012-09-11 14:41                       ` Vivek Goyal
2012-09-12 18:58                       ` Tejun Heo
2012-09-12 23:12                         ` Joseph Glanville
2012-09-12 23:20                           ` Tejun Heo
2012-09-12 23:53                             ` Joseph Glanville
2012-09-13  0:17                               ` Joseph Glanville
2012-09-13  3:10                                 ` Joseph Glanville
2012-09-13 19:25                                   ` Tejun Heo
2012-09-11 14:34                 ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120910225442.GE7677@google.com \
    --to=tj@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=drbd-dev@lists.linbit.com \
    --cc=hch@lst.de \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=philipp.reisner@linbit.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox