rwsem: down_read_unfair() proposal

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* rwsem: down_read_unfair() proposal
@ 2010-05-05  3:20 Michel Lespinasse
  2010-05-05 10:03 ` David Howells
  2010-05-05 10:06 ` David Howells
  0 siblings, 2 replies; 7+ messages in thread
From: Michel Lespinasse @ 2010-05-05  3:20 UTC (permalink / raw)
  To: David Howells, Andrew Morton, Linux-MM; +Cc: Ying Han, LKML

Hi,

I am looking at ways to solve the following problem:

Some cluster monitoring software we use at Google periodically accesses
files such as /proc/<pid>/exe or /proc/<pid>/maps, which requires
acquiring that pid's mmap_sem for read. Sometimes when the machines get
loaded enough, this acquisition can take a long time - typically this
happens when thread A acquires mmap_sem for reads in do_page_fault and
gets blocked (either trying to allocate a page or trying to read from
disk); then thread B tries to acquire mmap_sem for write and gets queued
behind A; then the monitoring software tries to read /proc/<pid>/maps
and gets queued behind B due to rwlock fair behavior.

We have been using patches that address these issues by allowing the
/proc/<pid>/exe and /proc/<pid>/maps code paths to acquire the mmap_sem
for reading in an unfair way, thus allowing the monitoring software to
bypass thread B and acquire mmap_sem concurrently with thread A.

This was easy to implement with the generic rwsem, and looks like it's
doable with the x86 rwsem implementation as well in a way that would only
involve changes to the rwsem spinlock-protected slow paths in lib/rwsem.c .
We are still working on that code but I thought we should ask first how
the developer community feels about the general idea.

For reference, here is one patch we have (against 2.6.33) using
down_read_unfair() in such a way (but with no x86 specific rwsem
implementation yet)


Author: Ying Han <yinghan@google.com>

    Introduce down_read_unfair()
    
    In down_read_unfair(), reader is not waiting non-exclusive lock
    even a writer on the queue. Apply it to maps & exes in procfs where
    monitoring program reads frequently.
    
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 58324c2..d51bc55 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1367,7 +1367,7 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 
 	/* We need mmap_sem to protect against races with removal of
 	 * VM_EXECUTABLE vmas */
-	down_read(&mm->mmap_sem);
+	down_read_unfair(&mm->mmap_sem);
 	exe_file = mm->exe_file;
 	if (exe_file)
 		get_file(exe_file);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f277c4a..118e0cd 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -119,7 +119,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 	mm = mm_for_maps(priv->task);
 	if (!mm)
 		return NULL;
-	down_read(&mm->mmap_sem);
+	down_read_unfair(&mm->mmap_sem);
 
 	tail_vma = get_gate_vma(priv->task);
 	priv->tail_vma = tail_vma;
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 5d9fd64..2ab484b 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -193,7 +193,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 		priv->task = NULL;
 		return NULL;
 	}
-	down_read(&mm->mmap_sem);
+	down_read_unfair(&mm->mmap_sem);
 
 	/* start from the Nth VMA */
 	for (p = rb_first(&mm->mm_rb); p; p = rb_next(p))
diff --git a/include/linux/rwsem-spinlock.h b/include/linux/rwsem-spinlock.h
index bdfcc25..48199db 100644
--- a/include/linux/rwsem-spinlock.h
+++ b/include/linux/rwsem-spinlock.h
@@ -60,7 +60,8 @@ do {								\
 	__init_rwsem((sem), #sem, &__key);			\
 } while (0)
 
-extern void __down_read(struct rw_semaphore *sem);
+extern void __down_read_fair(struct rw_semaphore *sem);
+extern void __down_read_unfair(struct rw_semaphore *sem);
 extern int __down_read_trylock(struct rw_semaphore *sem);
 extern void __down_write(struct rw_semaphore *sem);
 extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index efd348f..65d666e 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -20,6 +20,7 @@ struct rw_semaphore;
 #include <linux/rwsem-spinlock.h> /* use a generic implementation */
 #else
 #include <asm/rwsem.h> /* use an arch-specific implementation */
+#error "Missing down_read_unfair support."
 #endif
 
 /*
@@ -28,6 +29,11 @@ struct rw_semaphore;
 extern void down_read(struct rw_semaphore *sem);
 
 /*
+ * lock for reading - skip waitting writers
+ */
+extern void down_read_unfair(struct rw_semaphore *sem);
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 extern int down_read_trylock(struct rw_semaphore *sem);
diff --git a/kernel/rwsem.c b/kernel/rwsem.c
index cae050b..24578c5 100644
--- a/kernel/rwsem.c
+++ b/kernel/rwsem.c
@@ -21,12 +21,25 @@ void __sched down_read(struct rw_semaphore *sem)
 	might_sleep();
 	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
 
-	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read_fair);
 }
 
 EXPORT_SYMBOL(down_read);
 
 /*
+ * lock for reading - skip waitting writers
+ */
+void __sched down_read_unfair(struct rw_semaphore *sem)
+{
+	might_sleep();
+	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
+
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read_unfair);
+}
+
+EXPORT_SYMBOL(down_read_unfair);
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 int down_read_trylock(struct rw_semaphore *sem)
@@ -112,7 +125,7 @@ void down_read_nested(struct rw_semaphore *sem, int subclass)
 	might_sleep();
 	rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
 
-	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read_fair);
 }
 
 EXPORT_SYMBOL(down_read_nested);
@@ -121,7 +134,7 @@ void down_read_non_owner(struct rw_semaphore *sem)
 {
 	might_sleep();
 
-	__down_read(sem);
+	__down_read_fair(sem);
 }
 
 EXPORT_SYMBOL(down_read_non_owner);
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index ccf95bf..8c44c08 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -139,14 +139,14 @@ __rwsem_wake_one_writer(struct rw_semaphore *sem)
 /*
  * get a read lock on the semaphore
  */
-void __sched __down_read(struct rw_semaphore *sem)
+void __sched __down_read(struct rw_semaphore *sem, int unfair)
 {
 	struct rwsem_waiter waiter;
 	struct task_struct *tsk;
 
 	spin_lock_irq(&sem->wait_lock);
 
-	if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
+	if (sem->activity >= 0 && (unfair || list_empty(&sem->wait_list))) {
 		/* granted */
 		sem->activity++;
 		spin_unlock_irq(&sem->wait_lock);
@@ -161,7 +161,11 @@ void __sched __down_read(struct rw_semaphore *sem)
 	waiter.flags = RWSEM_WAITING_FOR_READ;
 	get_task_struct(tsk);
 
-	list_add_tail(&waiter.list, &sem->wait_list);
+	if (unfair) {
+		list_add(&waiter.list, &sem->wait_list);
+	} else {
+		list_add_tail(&waiter.list, &sem->wait_list);
+	}
 
 	/* we don't need to touch the semaphore struct anymore */
 	spin_unlock_irq(&sem->wait_lock);
@@ -180,6 +184,22 @@ void __sched __down_read(struct rw_semaphore *sem)
 }
 
 /*
+ * wrapper for fair __down_read
+ */
+void __sched __down_read_fair(struct rw_semaphore *sem)
+{
+	__down_read(sem, 0);
+}
+
+/*
+ * wrapper for unfair __down_read
+ */
+void __sched __down_read_unfair(struct rw_semaphore *sem)
+{
+	__down_read(sem, 1);
+}
+
+/*
  * trylock for reading -- returns 1 if successful, 0 if contention
  */
 int __down_read_trylock(struct rw_semaphore *sem)


-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: rwsem: down_read_unfair() proposal
  2010-05-05  3:20 rwsem: down_read_unfair() proposal Michel Lespinasse
@ 2010-05-05 10:03 ` David Howells
  2010-05-05 10:36   ` Michel Lespinasse
  2010-05-05 10:06 ` David Howells
  1 sibling, 1 reply; 7+ messages in thread
From: David Howells @ 2010-05-05 10:03 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: dhowells, Andrew Morton, Linux-MM, Ying Han, LKML


If the system is as heavily loaded as you say, how do you prevent writer
starvation?  Or do things just grind along until sufficient threads are queued
waiting for a write lock?

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rwsem: down_read_unfair() proposal
  2010-05-05 10:03 ` David Howells
@ 2010-05-05 10:36   ` Michel Lespinasse
  2010-05-06 23:26     ` Mike Waychison
  0 siblings, 1 reply; 7+ messages in thread
From: Michel Lespinasse @ 2010-05-05 10:36 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, Linux-MM, Ying Han, LKML

On Wed, May 05, 2010 at 11:03:40AM +0100, David Howells wrote:
> If the system is as heavily loaded as you say, how do you prevent
> writer starvation?  Or do things just grind along until sufficient
> threads are queued waiting for a write lock?

Reader/Writer fairness is not disabled in the general case - it only is
for a few specific readers such as /proc/<pid>/maps. In particular, the
do_page_fault path, which holds a read lock on mmap_sem for potentially long
(~disk latency) periods of times, still uses a fair down_read() call.
In comparison, the /proc/<pid>/maps path which we made unfair does not
normally hold the mmap_sem for very long (it does not end up hitting disk);
so it's been working out well for us in practice.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rwsem: down_read_unfair() proposal
  2010-05-05 10:36   ` Michel Lespinasse
@ 2010-05-06 23:26     ` Mike Waychison
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Waychison @ 2010-05-06 23:26 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: David Howells, Andrew Morton, Linux-MM, Ying Han, LKML

Michel Lespinasse wrote:
> On Wed, May 05, 2010 at 11:03:40AM +0100, David Howells wrote:
>> If the system is as heavily loaded as you say, how do you prevent
>> writer starvation?  Or do things just grind along until sufficient
>> threads are queued waiting for a write lock?
> 
> Reader/Writer fairness is not disabled in the general case - it only is
> for a few specific readers such as /proc/<pid>/maps. In particular, the
> do_page_fault path, which holds a read lock on mmap_sem for potentially long
> (~disk latency) periods of times, still uses a fair down_read() call.
> In comparison, the /proc/<pid>/maps path which we made unfair does not
> normally hold the mmap_sem for very long (it does not end up hitting disk);
> so it's been working out well for us in practice.
> 

FWIW, these sorts of block-ups are usually really pronounce on machines 
with harddrives that take _forever_ to respond to SMART commands (which 
are done via PIO, and which can serialize many drives when they are 
hidden behind a port multiplier).  We've seen cases where hard faults 
can take unusually long on an otherwise non-busy machines (~10 seconds?).

The other case we have problems with mmap_sem from a cluster monitoring 
perspective occurs when we get blocked up behind a task that is having 
problems dying from oom.  We have a variety of hacks used internally to 
cover these cases, though I think we (David and I?) figured that it'd 
make more sense to fix the dependencies on down_read(&current->mmap_sem) 
in the do_exit() path.  For instance, it really makes no sense to 
coredump when we are being oom killed (and thus we should be able to 
skip the mmap_sem dependency there..).

Mike Waychison

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rwsem: down_read_unfair() proposal
  2010-05-05  3:20 rwsem: down_read_unfair() proposal Michel Lespinasse
  2010-05-05 10:03 ` David Howells
@ 2010-05-05 10:06 ` David Howells
  2010-05-05 10:48   ` Michel Lespinasse
  2010-05-05 11:09   ` David Howells
  1 sibling, 2 replies; 7+ messages in thread
From: David Howells @ 2010-05-05 10:06 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: dhowells, Andrew Morton, Linux-MM, Ying Han, LKML

Michel Lespinasse <walken@google.com> wrote:

> and looks like it's doable with the x86 rwsem implementation as well in a
> way that would only involve changes to the rwsem spinlock-protected slow
> paths in lib/rwsem.c .

It's not as easy as it seems.  Once an XADD-based rwsem is contended, you
cannot necessarily tell without looking at the queue whether the rwsem is
currently write-locked or read-locked.

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rwsem: down_read_unfair() proposal
  2010-05-05 10:06 ` David Howells
@ 2010-05-05 10:48   ` Michel Lespinasse
  2010-05-05 11:09   ` David Howells
  1 sibling, 0 replies; 7+ messages in thread
From: Michel Lespinasse @ 2010-05-05 10:48 UTC (permalink / raw)
  To: David Howells; +Cc: Andrew Morton, Linux-MM, Ying Han, LKML

On Wed, May 05, 2010 at 11:06:44AM +0100, David Howells wrote:
> Michel Lespinasse <walken@google.com> wrote:
> 
> > and looks like it's doable with the x86 rwsem implementation as well in a
> > way that would only involve changes to the rwsem spinlock-protected slow
> > paths in lib/rwsem.c .
> 
> It's not as easy as it seems.  Once an XADD-based rwsem is contended, you
> cannot necessarily tell without looking at the queue whether the rwsem is
> currently write-locked or read-locked.

I only said it was doable :) Not done with the implementation yet, but I can
describe the general idea if that helps. The high part of the rwsem is
decremented by two for each thread holding or trying to acquire a write lock;
additionally the high part of the rwsem is decremented by one for the first
thread getting queued. Since queuing is done under a spinlock, it is easy
to decrement only for the first blocked thread there. In down_read_unfair(),
the rwsem value is compared with RWSEM_WAITING_BIAS (== -1 << 16 or 32);
if it's smaller then the rwsem might be write owned and we have to block;
otherwise it only has waiters which we can decide to ignore. This is the
idea in a nutshell.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: rwsem: down_read_unfair() proposal
  2010-05-05 10:06 ` David Howells
  2010-05-05 10:48   ` Michel Lespinasse
@ 2010-05-05 11:09   ` David Howells
  1 sibling, 0 replies; 7+ messages in thread
From: David Howells @ 2010-05-05 11:09 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: dhowells, Andrew Morton, Linux-MM, Ying Han, LKML

Michel Lespinasse <walken@google.com> wrote:

> I only said it was doable :) Not done with the implementation yet, but I can
> describe the general idea if that helps. The high part of the rwsem is
> decremented by two for each thread holding or trying to acquire a write
> lock;

That would mean you're reducing the capacity of the upper counter by one since
the high part must remain negative if we're to be able to check it for
non-zeroness by checking the sign flag.  That means a maximum of 2^14-1 writers
queued on a 32-bit box (16384), but we can have more threads than that (up to
~32767).

Currently, we can have a maximum of 32767 writers+readers queued as we only
decrement the upper counter by 1 each time.

On a 64-bit box, the limitations go away for all practical purposes.

David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-05-06 23:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-05  3:20 rwsem: down_read_unfair() proposal Michel Lespinasse
2010-05-05 10:03 ` David Howells
2010-05-05 10:36   ` Michel Lespinasse
2010-05-06 23:26     ` Mike Waychison
2010-05-05 10:06 ` David Howells
2010-05-05 10:48   ` Michel Lespinasse
2010-05-05 11:09   ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).