public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Michel Lespinasse <walken@google.com>
Cc: Alex Shi <alex.shi@intel.com>, Ingo Molnar <mingo@kernel.org>,
	David Howells <dhowells@redhat.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Thomas Gleixner <tglx@linutronix.de>,
	Yuanhan Liu <yuanhan.liu@linux.intel.com>,
	Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
Date: Wed, 13 Mar 2013 14:23:34 +1100	[thread overview]
Message-ID: <20130313032334.GU21651@dastard> (raw)
In-Reply-To: <CANN689Hu4S6hQaZ0tytNuToCpwtdJmyisyqyrDrNK+buJXTsKA@mail.gmail.com>

On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
> Hi Dave,
> 
> On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> >> - since all readers are woken at once, you might see burst of direct
> >> IO operations followed by bursts of truncate operations, instead of
> >> having them interleaved in smaller groups as before.
> >
> > And this will result in the same application IO pattern giving
> > vastly different results. e.g. a read IO promoted ahead of a
> > truncate will now return data instead of -1 (beyond EOF).
> 
> I will reply to this part first, as I think this gives a more concrete
> anchor to the discussion.
> 
> The crucial point is that the truncate system call hasn't completed
> yet - that thread is still queued in the rwsem.
> 
> You still have the guarantee that the truncate will complete
> eventually (even if there is a never-ending stream of IOs), and that
> all IO system calls that start after the truncate system call
> completes will see the file as truncated.

Sure, but the problem is not about when the syscall completes - the
problem is that you are changing where in the pipeline of IO the
truncate is initially executed.  i.e. ordering is not defined by
when an operation completes, but by the order ing which the queue is
processed after a blocking operation completes. That is not when the
syscall completes, but when the filesystem drops the exclusive lock.

>From a single threaded userspace application perspective or
multithreaded apps with their own userspace locking, truncates
complete when either the syscall completes or some time after when
the application drops it's lock. Either way, we're looking at
completion time serialisation, and in that case what XFS or the
kernel does simply doesn't matter.

However, if we are talking about userspace applications that use
lockless IO algorithms or kernel side applications (like knfsd) that
are exposed directly to the filesystem locking semantics,
serialisation behaviour is actually defined by filesystem's
submission side locking behaviour. There is no external
serialisation of IO completion, so serialisation and ordering is
defined (for XFS) solely by the rwsem behaviour.

> You don't have guarantees about which system call will "appear to run
> before the other" if the execution times of the two system calls
> overlap each other, but you actually never had such a guarantee from a
> correctness perspective (i.e. you could never guarantee which of the
> two would get queued on the rwsem ahead of the other).

Sure, but as long as the submission side ordering is deterministic,
it doesn't matter.

> > Ok, so I can see where your latency figure comes from, but it's
> > still a change of ordering in that W2 is no longer a barrier to the
> > reads queued after it.
> 
> My claim is that it's not a visible change from a correctness
> perspective

I am not arguing that it is incorrect. I'm arguing that the change
of ordering semantics breaks assumptions a lot of code makes about
how these locks work.

> > similar to this:
> >
> >         W1R1W2R2W3R3.....WnRm
> >
> > By my reading of the algorithm you are proposing, after W1 is
> > released, we end up with the queue being treated like this:
> >
> >         R1R2R3....RmW2W3....Wn
> >
> > Right?
> 
> Yes, if what you are representing is the state of the queue at a given
> point of time (which implies R1..Rm must be distinct threads, not just
> the same thread doing repeated requests).

Yup, that's very typical.

> As new requests come in over time, one important point to remember is
> that each writer will see at most one batch of readers wake up ahead
> of it (though this batch may include readers that were originally
> queued both before and after it).

And that's *exactly* the problem with the changes you are proposing
- rwsems will no longer provide strongly ordered write vs read
barrier semantics.

> I find the name 'barrier' actually confusing when used to describe
> synchronous operations.  To me a barrier is usualy between
> asynchronous operations, and it is well defined which operations
> are ahead or behind of the barrier (either because they were
> queued by the same thread, or they were queued by different
> threads which may have synchronized together some other way).

When you have hundreds or thousands of threads doing IO to the one
file, it doesn't matter if the IO is issued synchronously or
asynchronously by the threads - you simply have a huge amount of
data IO concurrency and very, very deep pipelines.

Inserting a metadata modification (truncate, preallocation,
holepunch, etc) into that pipeline currently causes all new
submissions to queue behind the metadata modification, waits for
everything submitted before the metadata modification to complete
and then runs the metadata modification. Once it completes, it then
allows everything queued up to the next metadata modification to
run....

That matches your definition of a barrier pretty well, I think.

> But in this case, we have two
> synchronous operations with overlapping execution times - they don't
> have a way to know which one is 'supposed to' be ahead of the other.
> The two threads can't observe their relative ordering since they are
> blocked during the operation...

We don't care about the ordering between multiple concurrent
metadata modifications - what matters is whether the ongoing data IO
around them is ordered correctly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2013-03-13  3:23 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
2013-03-06 23:21 ` [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask Michel Lespinasse
2013-03-13 21:33   ` Rik van Riel
2013-03-06 23:21 ` [PATCH 02/12] rwsem: shorter spinlocked section in rwsem_down_failed_common() Michel Lespinasse
2013-03-06 23:21 ` [PATCH 03/12] rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 04/12] rwsem: simplify rwsem_down_read_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 05/12] rwsem: simplify rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 06/12] rwsem: more agressive lock stealing in rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 07/12] rwsem: use cmpxchg for trying to steal write lock Michel Lespinasse
2013-03-06 23:21 ` [PATCH 08/12] rwsem: avoid taking wait_lock in rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 09/12] rwsem: skip initial trylock " Michel Lespinasse
2013-03-06 23:21 ` [PATCH 10/12] rwsem-spinlock: wake all readers when first waiter is a reader Michel Lespinasse
2013-03-06 23:21 ` [PATCH 11/12] rwsem: " Michel Lespinasse
2013-03-09  0:32   ` Dave Chinner
2013-03-09  1:20     ` Michel Lespinasse
2013-03-11  0:16       ` Dave Chinner
2013-03-11  5:17         ` Michel Lespinasse
2013-03-12  2:36           ` Dave Chinner
2013-03-12  6:43             ` Michel Lespinasse
2013-03-13  3:23               ` Dave Chinner [this message]
2013-03-13 11:03                 ` Michel Lespinasse
2013-03-14  2:00                 ` Peter Hurley
2013-03-19  1:17                   ` Dave Chinner
2013-03-19 23:48                     ` Michel Lespinasse
2013-03-11  7:50         ` Ingo Molnar
2013-03-11 20:36   ` Peter Hurley
2013-03-14  7:03     ` Michel Lespinasse
2013-03-14 11:39       ` Peter Hurley
2013-03-14 15:20         ` Michel Lespinasse
2013-03-06 23:21 ` [PATCH 12/12] x86 rwsem: avoid taking slow path when stealing write lock Michel Lespinasse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130313032334.GU21651@dastard \
    --to=david@fromorbit.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=dhowells@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=walken@google.com \
    --cc=yuanhan.liu@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox