From: Peter Hurley <peter@hurleysoftware.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Michel Lespinasse <walken@google.com>,
Alex Shi <alex.shi@intel.com>, Ingo Molnar <mingo@kernel.org>,
David Howells <dhowells@redhat.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Thomas Gleixner <tglx@linutronix.de>,
Yuanhan Liu <yuanhan.liu@linux.intel.com>,
Rik van Riel <riel@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
Date: Wed, 13 Mar 2013 22:00:51 -0400 [thread overview]
Message-ID: <1363226451.25976.170.camel@thor.lan> (raw)
In-Reply-To: <20130313032334.GU21651@dastard>
On Wed, 2013-03-13 at 14:23 +1100, Dave Chinner wrote:
> On Mon, Mar 11, 2013 at 11:43:34PM -0700, Michel Lespinasse wrote:
> > Hi Dave,
> >
> > On Mon, Mar 11, 2013 at 7:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> > >> - since all readers are woken at once, you might see burst of direct
> > >> IO operations followed by bursts of truncate operations, instead of
> > >> having them interleaved in smaller groups as before.
> > >
> > > And this will result in the same application IO pattern giving
> > > vastly different results. e.g. a read IO promoted ahead of a
> > > truncate will now return data instead of -1 (beyond EOF).
> >
> > I will reply to this part first, as I think this gives a more concrete
> > anchor to the discussion.
> >
> > The crucial point is that the truncate system call hasn't completed
> > yet - that thread is still queued in the rwsem.
> >
> > You still have the guarantee that the truncate will complete
> > eventually (even if there is a never-ending stream of IOs), and that
> > all IO system calls that start after the truncate system call
> > completes will see the file as truncated.
>
> Sure, but the problem is not about when the syscall completes - the
> problem is that you are changing where in the pipeline of IO the
> truncate is initially executed. i.e. ordering is not defined by
> when an operation completes, but by the order ing which the queue is
> processed after a blocking operation completes. That is not when the
> syscall completes, but when the filesystem drops the exclusive lock.
>
> From a single threaded userspace application perspective or
> multithreaded apps with their own userspace locking, truncates
> complete when either the syscall completes or some time after when
> the application drops it's lock. Either way, we're looking at
> completion time serialisation, and in that case what XFS or the
> kernel does simply doesn't matter.
>
> However, if we are talking about userspace applications that use
> lockless IO algorithms or kernel side applications (like knfsd) that
> are exposed directly to the filesystem locking semantics,
> serialisation behaviour is actually defined by filesystem's
> submission side locking behaviour. There is no external
> serialisation of IO completion, so serialisation and ordering is
> defined (for XFS) solely by the rwsem behaviour.
>
> > You don't have guarantees about which system call will "appear to run
> > before the other" if the execution times of the two system calls
> > overlap each other, but you actually never had such a guarantee from a
> > correctness perspective (i.e. you could never guarantee which of the
> > two would get queued on the rwsem ahead of the other).
>
> Sure, but as long as the submission side ordering is deterministic,
> it doesn't matter.
>
> > > Ok, so I can see where your latency figure comes from, but it's
> > > still a change of ordering in that W2 is no longer a barrier to the
> > > reads queued after it.
> >
> > My claim is that it's not a visible change from a correctness
> > perspective
>
> I am not arguing that it is incorrect. I'm arguing that the change
> of ordering semantics breaks assumptions a lot of code makes about
> how these locks work.
>
> > > similar to this:
> > >
> > > W1R1W2R2W3R3.....WnRm
> > >
> > > By my reading of the algorithm you are proposing, after W1 is
> > > released, we end up with the queue being treated like this:
> > >
> > > R1R2R3....RmW2W3....Wn
> > >
> > > Right?
> >
> > Yes, if what you are representing is the state of the queue at a given
> > point of time (which implies R1..Rm must be distinct threads, not just
> > the same thread doing repeated requests).
>
> Yup, that's very typical.
>
> > As new requests come in over time, one important point to remember is
> > that each writer will see at most one batch of readers wake up ahead
> > of it (though this batch may include readers that were originally
> > queued both before and after it).
>
> And that's *exactly* the problem with the changes you are proposing
> - rwsems will no longer provide strongly ordered write vs read
> barrier semantics.
>
> > I find the name 'barrier' actually confusing when used to describe
> > synchronous operations. To me a barrier is usualy between
> > asynchronous operations, and it is well defined which operations
> > are ahead or behind of the barrier (either because they were
> > queued by the same thread, or they were queued by different
> > threads which may have synchronized together some other way).
>
> When you have hundreds or thousands of threads doing IO to the one
> file, it doesn't matter if the IO is issued synchronously or
> asynchronously by the threads - you simply have a huge amount of
> data IO concurrency and very, very deep pipelines.
>
> Inserting a metadata modification (truncate, preallocation,
> holepunch, etc) into that pipeline currently causes all new
> submissions to queue behind the metadata modification, waits for
> everything submitted before the metadata modification to complete
> and then runs the metadata modification. Once it completes, it then
> allows everything queued up to the next metadata modification to
> run....
>
> That matches your definition of a barrier pretty well, I think.
>
> > But in this case, we have two
> > synchronous operations with overlapping execution times - they don't
> > have a way to know which one is 'supposed to' be ahead of the other.
> > The two threads can't observe their relative ordering since they are
> > blocked during the operation...
>
> We don't care about the ordering between multiple concurrent
> metadata modifications - what matters is whether the ongoing data IO
> around them is ordered correctly.
Dave,
The point that Michel is making is that there never was any ordering
guarantee by rwsem. It's an illusion.
The reason is simple: to even get to the lock the cpu has to be
sleep-able. So for every submission that you believe is ordered, is by
its very nature __not ordered__, even when used by kernel code.
Why? Because any thread on its way to claim the lock (reader or writer)
could be pre-empted for some other task, thus delaying the submission of
whatever i/o you believed to be ordered.
So just to reiterate: there is no 'queue' and no 'barrier'. The
guarantees that rwsem makes are;
1. Multiple readers can own the lock.
2. Only a single writer can own the lock.
3. Readers will not starve writers.
> On Tue, 2013-03-12 at 13:36 +1100, Dave Chinner wrote:
> > On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> > - since all readers are woken at once, you might see burst of direct
> > IO operations followed by bursts of truncate operations, instead of
> > having them interleaved in smaller groups as before.
>
> And this will result in the same application IO pattern giving
> vastly different results. e.g. a read IO promoted ahead of a
> truncate will now return data instead of -1 (beyond EOF).
The 'same application IO pattern' will give you 'vastly different
results' with the __current__ implementation for any given
machine/day/time/run. The reason is that you cannot control which cpu
submits which IO, and is delayed because it's busy handling which
network interrupt, etc.
Where lock policy can have a significant impact is on performance. But
predicting that impact is difficult -- it's better just to measure.
It's not my intention to convince you (or anyone else) that there should
only be One True Rwsem, because I don't believe that. But I didn't want
the impression to persist that rwsem does anything more than implement a
fair reader/writer semaphore.
Regards,
Peter Hurley
next prev parent reply other threads:[~2013-03-14 2:01 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
2013-03-06 23:21 ` [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask Michel Lespinasse
2013-03-13 21:33 ` Rik van Riel
2013-03-06 23:21 ` [PATCH 02/12] rwsem: shorter spinlocked section in rwsem_down_failed_common() Michel Lespinasse
2013-03-06 23:21 ` [PATCH 03/12] rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 04/12] rwsem: simplify rwsem_down_read_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 05/12] rwsem: simplify rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 06/12] rwsem: more agressive lock stealing in rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 07/12] rwsem: use cmpxchg for trying to steal write lock Michel Lespinasse
2013-03-06 23:21 ` [PATCH 08/12] rwsem: avoid taking wait_lock in rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 09/12] rwsem: skip initial trylock " Michel Lespinasse
2013-03-06 23:21 ` [PATCH 10/12] rwsem-spinlock: wake all readers when first waiter is a reader Michel Lespinasse
2013-03-06 23:21 ` [PATCH 11/12] rwsem: " Michel Lespinasse
2013-03-09 0:32 ` Dave Chinner
2013-03-09 1:20 ` Michel Lespinasse
2013-03-11 0:16 ` Dave Chinner
2013-03-11 5:17 ` Michel Lespinasse
2013-03-12 2:36 ` Dave Chinner
2013-03-12 6:43 ` Michel Lespinasse
2013-03-13 3:23 ` Dave Chinner
2013-03-13 11:03 ` Michel Lespinasse
2013-03-14 2:00 ` Peter Hurley [this message]
2013-03-19 1:17 ` Dave Chinner
2013-03-19 23:48 ` Michel Lespinasse
2013-03-11 7:50 ` Ingo Molnar
2013-03-11 20:36 ` Peter Hurley
2013-03-14 7:03 ` Michel Lespinasse
2013-03-14 11:39 ` Peter Hurley
2013-03-14 15:20 ` Michel Lespinasse
2013-03-06 23:21 ` [PATCH 12/12] x86 rwsem: avoid taking slow path when stealing write lock Michel Lespinasse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1363226451.25976.170.camel@thor.lan \
--to=peter@hurleysoftware.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@intel.com \
--cc=david@fromorbit.com \
--cc=dhowells@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=walken@google.com \
--cc=yuanhan.liu@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox