public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Michel Lespinasse <walken@google.com>
Cc: Alex Shi <alex.shi@intel.com>, Ingo Molnar <mingo@kernel.org>,
	David Howells <dhowells@redhat.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Thomas Gleixner <tglx@linutronix.de>,
	Yuanhan Liu <yuanhan.liu@linux.intel.com>,
	Rik van Riel <riel@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 11/12] rwsem: wake all readers when first waiter is a reader
Date: Tue, 12 Mar 2013 13:36:58 +1100	[thread overview]
Message-ID: <20130312023658.GH21651@dastard> (raw)
In-Reply-To: <CANN689FF+fjphwNqRz_ixpo4hW-rkXq6fbmJf7xDacBbP48BXw@mail.gmail.com>

On Sun, Mar 10, 2013 at 10:17:42PM -0700, Michel Lespinasse wrote:
> On Sun, Mar 10, 2013 at 5:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Mar 08, 2013 at 05:20:34PM -0800, Michel Lespinasse wrote:
> >> On Fri, Mar 8, 2013 at 4:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > Isn't this a significant change of semantics for the rwsem? i.e.
> >> > that read lock requests that come after a write lock request now
> >> > jump ahead of the write lock request? i.e.the write lock request is
> >> > no longer a barrier in the queue?
> >>
> >> Yes, I am allowing readers to skip ahead of writers in the queue (but
> >> only if they can run with another reader that was already ahead).
> >>
> >> I don't see that this is a change of observable semantics for correct
> >> programs. If a reader and a writer both block on the rwsem, how do you
> >> known for sure which one got queued first ? rwsem API doesn't give you
> >> any easy way to know whether a thread is currently queued on the rwsem
> >> (it could also be descheduled before it gets onto the rwsem queue).
> >
> > There are algorithms that rely on write locks to act as read-side
> > barriers to prevent write side starvation. i.e. if you keep queuing
> > up read locks, the writer never gets the lock, thereby starving the
> > writer.
> 
> What I propose actually isn't as bad as a reader preference rwlock
> like rwlock_t. When a reader makes it to the head of the queue, all
> readers gets woken. At that point there are only writers left in the
> queue, and new readers will get queued after them and won't be able to
> skip over them (since they have no earlier reader to join up with).
> So, I am not throwing reader/writer fairness out the window here.

OK, but....

> > My point isn't that XFS gets the serialisation wrong - my point is
> > that the change of queuing order will change the userspace visible
> > behaviour of the filesystem *significantly*.
> >
> > For example: - direct IO only takes read locks, while truncate takes
> > a write lock.  Right now we can drop a truncate, preallocation or
> > hole punch operation into a stream of concurrent direct IO and have
> > it execute almost immediately - the barrier mechanism in the rwsem
> > ensures that it will be executed ahead of any future IO that is
> > issued by the application. With your proposed change, that operation
> > won't take place until all current *and all future* IOs stop and the
> > read lock is no longer held by anyone.
> 
> You actually wouldn't get such starvation with my proposal.
> 
> What you might see would be:
> 
> - Assuming you have up to N concurrent reads in progress, a writer
> might see up to 2N-1 readers proceed in front of it. This limit is
> reached if you first had N-1 readers grabbing the rwsem with an empty
> queue, then another writer W1 got queued,

So something like

	RRRRRRRRRRRRRW1

> then a reader RN (note that
> it will get queued after W1 and not run concurrently with the existing
> N-1 readers), then our writer W2 gets queued.

So:

	RRRRRRRRRRRRRW1rrrrrrrrrrrrW2
>
> As our N-1 readers
> complete their IO operations, they might get queued again after W2,

So:
	W1rrrrrrrrrrrrW2RRRRRRRRRRRRR

> and then skip ahead of W2 when RN reaches the front of the queue. So,
> W2 is still guaranteed to run eventually, but with a worst case
> latency that is ~2x worse than before

So, when W1 is released, the queue is treated as though it was:

	rrrrrrrrrrrrRRRRRRRRRRRRRW2

yes? Ok, so I can see where your latency figure comes from, but it's
still a change of ordering in that W2 is no longer a barrier to the
reads queued after it.

And if we extend that to WN, we have an interleaved queue
similar to this:

	W1R1W2R2W3R3.....WnRm

By my reading of the algorithm you are proposing, after W1 is
released, we end up with the queue being treated like this:

	R1R2R3....RmW2W3....Wn

Right?

If so, that is definitely a change of lock ordering semantics -
writes do not have barrier properties anymore.

> - since all readers are woken at once, you might see burst of direct
> IO operations followed by bursts of truncate operations, instead of
> having them interleaved in smaller groups as before.

And this will result in the same application IO pattern giving
vastly different results. e.g. a read IO promoted ahead of a
truncate will now return data instead of -1 (beyond EOF).

> I'm not sure if these characteristics are good enough for the XFS use
> case. It seems tougher than many rwsem use cases, since the benefits
> of increased reader parallelism are not as clear here (i.e. the IO

Well defined, predictable, deterministc ordering of operations takes
far more precedence over parallelism when it comes to filesystem
behaviour. The rwsems already have enough parallelism to allow us to
drive more than 2 million 4k IOPS to a single file via
multi-threaded direct IO(*), so we aren't limited by parallelism and
concurrency in rwsems.

> So if
> this explanation still didn't made you comfortable with the proposal,
> I will rework it to avoid the queue reordering.

Not really, but I'm still not sure I fully understand the different
queuing/queue jumping semantics fully....

Cheers,

Dave.
> 
> -- 
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 

-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2013-03-12  2:37 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-06 23:21 [PATCH 00/12] rwsem fast-path write lock stealing Michel Lespinasse
2013-03-06 23:21 ` [PATCH 01/12] rwsem: make the waiter type an enumeration rather than a bitmask Michel Lespinasse
2013-03-13 21:33   ` Rik van Riel
2013-03-06 23:21 ` [PATCH 02/12] rwsem: shorter spinlocked section in rwsem_down_failed_common() Michel Lespinasse
2013-03-06 23:21 ` [PATCH 03/12] rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 04/12] rwsem: simplify rwsem_down_read_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 05/12] rwsem: simplify rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 06/12] rwsem: more agressive lock stealing in rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 07/12] rwsem: use cmpxchg for trying to steal write lock Michel Lespinasse
2013-03-06 23:21 ` [PATCH 08/12] rwsem: avoid taking wait_lock in rwsem_down_write_failed Michel Lespinasse
2013-03-06 23:21 ` [PATCH 09/12] rwsem: skip initial trylock " Michel Lespinasse
2013-03-06 23:21 ` [PATCH 10/12] rwsem-spinlock: wake all readers when first waiter is a reader Michel Lespinasse
2013-03-06 23:21 ` [PATCH 11/12] rwsem: " Michel Lespinasse
2013-03-09  0:32   ` Dave Chinner
2013-03-09  1:20     ` Michel Lespinasse
2013-03-11  0:16       ` Dave Chinner
2013-03-11  5:17         ` Michel Lespinasse
2013-03-12  2:36           ` Dave Chinner [this message]
2013-03-12  6:43             ` Michel Lespinasse
2013-03-13  3:23               ` Dave Chinner
2013-03-13 11:03                 ` Michel Lespinasse
2013-03-14  2:00                 ` Peter Hurley
2013-03-19  1:17                   ` Dave Chinner
2013-03-19 23:48                     ` Michel Lespinasse
2013-03-11  7:50         ` Ingo Molnar
2013-03-11 20:36   ` Peter Hurley
2013-03-14  7:03     ` Michel Lespinasse
2013-03-14 11:39       ` Peter Hurley
2013-03-14 15:20         ` Michel Lespinasse
2013-03-06 23:21 ` [PATCH 12/12] x86 rwsem: avoid taking slow path when stealing write lock Michel Lespinasse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130312023658.GH21651@dastard \
    --to=david@fromorbit.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=dhowells@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=walken@google.com \
    --cc=yuanhan.liu@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox