[ANNOUNCE] 3.2.9-rt17

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [ANNOUNCE] 3.2.9-rt17
@ 2012-03-07 21:49 Thomas Gleixner
  2012-03-08 18:23 ` Steven Rostedt
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2012-03-07 21:49 UTC (permalink / raw)
  To: LKML; +Cc: linux-rt-users

Dear RT Folks,

I'm pleased to announce the 3.2.9-rt17 release.

Changes vs. 3.2.9-rt17:

  * Cherry-picked a scheduled for 3.2.10 genirq fix

  * Add missing preemption checks for softirqd wakeups

  * Implement cpu_chill() and use it in dcache and networking

    RT suffers from trylock or other busywait loops. When the lock
    holder / updater is preempted. This is basically the same problem
    as we experienced with seqlocks and especially their open coded
    variants. Though it's way harder to solve. 

    The trylock loops are usually implemented to deal with reverse
    lock ordering. On !RT this only needs to loop when one of the
    locks is held on another cpu. On RT the lock holder can be
    preempted which in turn puts the preempting task into an eternal
    retry loop.

    I tried to implement spin_trydeadlock() - thanks Peter for the
    brilliant function name - which basically boosts the lock holder
    w/o deadlocking, but it turned out to become a quite horrible mess
    close to the infamous multiple reader boosting code.

    Before my brain deadlocked on trylocks I took the easy way out and
    replaced the cpu_relax() calls in those retry loops with
    cpu_chill() calls.

    cpu_chill() defaults to cpu_relax() for !RT. On RT is simply puts
    the task to sleep for a tick, so the preempted lock holder/updater
    can make progress.

    I think that's reasonable as the affected code pathes are not RT
    critical and not likely to hit. fs operations have no RT
    guarantees at all, so it might affect random fs scanners which get
    blocked on a rename or delete operation going on. I don't think
    that's a real issue. Feel free to yell if you find out that it
    hurts, but be aware that I might ask _you_ to twist _your_ brain
    around implementing spin_trydeadlock().


The incremental patch against 3.2.9-rt16 can be found here:

  http://www.kernel.org/pub/linux/kernel/projects/rt/3.2/incr/patch-3.2.9-rt16-rt17.patch.xz

and is appended below.


The RT patch against 3.2.9 can be found here:

  http://www.kernel.org/pub/linux/kernel/projects/rt/3.2/patch-3.2.9-rt17.patch.xz


The split quilt queue is available at:

  http://www.kernel.org/pub/linux/kernel/projects/rt/3.2/patches-3.2.9-rt17.tar.xz


Enjoy,

	tglx
---
Index: linux-3.2/fs/autofs4/autofs_i.h
===================================================================
--- linux-3.2.orig/fs/autofs4/autofs_i.h
+++ linux-3.2/fs/autofs4/autofs_i.h
@@ -34,6 +34,7 @@
 #include <linux/sched.h>
 #include <linux/mount.h>
 #include <linux/namei.h>
+#include <linux/delay.h>
 #include <asm/current.h>
 #include <asm/uaccess.h>
 
Index: linux-3.2/fs/autofs4/expire.c
===================================================================
--- linux-3.2.orig/fs/autofs4/expire.c
+++ linux-3.2/fs/autofs4/expire.c
@@ -170,7 +170,7 @@ again:
 			parent = p->d_parent;
 			if (!seq_spin_trylock(&parent->d_lock)) {
 				seq_spin_unlock(&p->d_lock);
-				cpu_relax();
+				cpu_chill();
 				goto relock;
 			}
 			seq_spin_unlock(&p->d_lock);
Index: linux-3.2/fs/dcache.c
===================================================================
--- linux-3.2.orig/fs/dcache.c
+++ linux-3.2/fs/dcache.c
@@ -37,6 +37,7 @@
 #include <linux/rculist_bl.h>
 #include <linux/prefetch.h>
 #include <linux/ratelimit.h>
+#include <linux/delay.h>
 #include "internal.h"
 
 /*
@@ -410,7 +411,7 @@ static inline struct dentry *dentry_kill
 	if (inode && !spin_trylock(&inode->i_lock)) {
 relock:
 		seq_spin_unlock(&dentry->d_lock);
-		cpu_relax();
+		cpu_chill();
 		return dentry; /* try again with same dentry */
 	}
 	if (IS_ROOT(dentry))
@@ -796,7 +797,7 @@ relock:
 
 		if (!seq_spin_trylock(&dentry->d_lock)) {
 			spin_unlock(&dcache_lru_lock);
-			cpu_relax();
+			cpu_chill();
 			goto relock;
 		}
 
@@ -1974,7 +1975,7 @@ again:
 	if (dentry->d_count == 1) {
 		if (inode && !spin_trylock(&inode->i_lock)) {
 			seq_spin_unlock(&dentry->d_lock);
-			cpu_relax();
+			cpu_chill();
 			goto again;
 		}
 		dentry->d_flags &= ~DCACHE_CANT_MOUNT;
Index: linux-3.2/fs/namespace.c
===================================================================
--- linux-3.2.orig/fs/namespace.c
+++ linux-3.2/fs/namespace.c
@@ -31,6 +31,7 @@
 #include <linux/idr.h>
 #include <linux/fs_struct.h>
 #include <linux/fsnotify.h>
+#include <linux/delay.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include "pnode.h"
@@ -346,7 +347,7 @@ int mnt_want_write(struct vfsmount *mnt)
 	 */
 	while (mnt->mnt_flags & MNT_WRITE_HOLD) {
 		preempt_enable();
-		cpu_relax();
+		cpu_chill();
 		preempt_disable();
 	}
 	/*
Index: linux-3.2/include/linux/preempt.h
===================================================================
--- linux-3.2.orig/include/linux/preempt.h
+++ linux-3.2/include/linux/preempt.h
@@ -56,8 +56,10 @@ do { \
 
 #ifndef CONFIG_PREEMPT_RT_BASE
 # define preempt_enable_no_resched()	__preempt_enable_no_resched()
+# define preempt_check_resched_rt()	do { } while (0)
 #else
 # define preempt_enable_no_resched()	preempt_enable()
+# define preempt_check_resched_rt()	preempt_check_resched()
 #endif
 
 #define preempt_enable() \
Index: linux-3.2/kernel/irq/manage.c
===================================================================
--- linux-3.2.orig/kernel/irq/manage.c
+++ linux-3.2/kernel/irq/manage.c
@@ -995,6 +995,11 @@ __setup_irq(unsigned int irq, struct irq
 
 		/* add new interrupt at end of irq queue */
 		do {
+			/*
+			 * Or all existing action->thread_mask bits,
+			 * so we can find the next zero bit for this
+			 * new action.
+			 */
 			thread_mask |= old->thread_mask;
 			old_ptr = &old->next;
 			old = *old_ptr;
@@ -1003,14 +1008,41 @@ __setup_irq(unsigned int irq, struct irq
 	}
 
 	/*
-	 * Setup the thread mask for this irqaction. Unlikely to have
-	 * 32 resp 64 irqs sharing one line, but who knows.
+	 * Setup the thread mask for this irqaction for ONESHOT. For
+	 * !ONESHOT irqs the thread mask is 0 so we can avoid a
+	 * conditional in irq_wake_thread().
 	 */
-	if (new->flags & IRQF_ONESHOT && thread_mask == ~0UL) {
-		ret = -EBUSY;
-		goto out_mask;
+	if (new->flags & IRQF_ONESHOT) {
+		/*
+		 * Unlikely to have 32 resp 64 irqs sharing one line,
+		 * but who knows.
+		 */
+		if (thread_mask == ~0UL) {
+			ret = -EBUSY;
+			goto out_mask;
+		}
+		/*
+		 * The thread_mask for the action is or'ed to
+		 * desc->thread_active to indicate that the
+		 * IRQF_ONESHOT thread handler has been woken, but not
+		 * yet finished. The bit is cleared when a thread
+		 * completes. When all threads of a shared interrupt
+		 * line have completed desc->threads_active becomes
+		 * zero and the interrupt line is unmasked. See
+		 * handle.c:irq_wake_thread() for further information.
+		 *
+		 * If no thread is woken by primary (hard irq context)
+		 * interrupt handlers, then desc->threads_active is
+		 * also checked for zero to unmask the irq line in the
+		 * affected hard irq flow handlers
+		 * (handle_[fasteoi|level]_irq).
+		 *
+		 * The new action gets the first zero bit of
+		 * thread_mask assigned. See the loop above which or's
+		 * all existing action->thread_mask bits.
+		 */
+		new->thread_mask = 1 << ffz(thread_mask);
 	}
-	new->thread_mask = 1 << ffz(thread_mask);
 
 	if (!shared) {
 		init_waitqueue_head(&desc->wait_for_threads);
Index: linux-3.2/localversion-rt
===================================================================
--- linux-3.2.orig/localversion-rt
+++ linux-3.2/localversion-rt
@@ -1 +1 @@
--rt16
+-rt17
Index: linux-3.2/net/core/dev.c
===================================================================
--- linux-3.2.orig/net/core/dev.c
+++ linux-3.2/net/core/dev.c
@@ -1779,6 +1779,7 @@ static inline void __netif_reschedule(st
 	sd->output_queue_tailp = &q->next_sched;
 	raise_softirq_irqoff(NET_TX_SOFTIRQ);
 	local_irq_restore(flags);
+	preempt_check_resched_rt();
 }
 
 void __netif_schedule(struct Qdisc *q)
@@ -1800,6 +1801,7 @@ void dev_kfree_skb_irq(struct sk_buff *s
 		sd->completion_queue = skb;
 		raise_softirq_irqoff(NET_TX_SOFTIRQ);
 		local_irq_restore(flags);
+		preempt_check_resched_rt();
 	}
 }
 EXPORT_SYMBOL(dev_kfree_skb_irq);
@@ -2969,6 +2971,7 @@ enqueue:
 	rps_unlock(sd);
 
 	local_irq_restore(flags);
+	preempt_check_resched_rt();
 
 	atomic_long_inc(&skb->dev->rx_dropped);
 	kfree_skb(skb);
@@ -3789,6 +3792,7 @@ static void net_rps_action_and_irq_enabl
 	} else
 #endif
 		local_irq_enable();
+	preempt_check_resched_rt();
 }
 
 static int process_backlog(struct napi_struct *napi, int quota)
@@ -3861,6 +3865,7 @@ void __napi_schedule(struct napi_struct 
 	local_irq_save(flags);
 	____napi_schedule(&__get_cpu_var(softnet_data), n);
 	local_irq_restore(flags);
+	preempt_check_resched_rt();
 }
 EXPORT_SYMBOL(__napi_schedule);
 
@@ -6401,6 +6406,7 @@ static int dev_cpu_callback(struct notif
 
 	raise_softirq_irqoff(NET_TX_SOFTIRQ);
 	local_irq_enable();
+	preempt_check_resched_rt();
 
 	/* Process offline CPU's input_pkt_queue */
 	while ((skb = __skb_dequeue(&oldsd->process_queue))) {
Index: linux-3.2/block/blk-iopoll.c
===================================================================
--- linux-3.2.orig/block/blk-iopoll.c
+++ linux-3.2/block/blk-iopoll.c
@@ -38,6 +38,7 @@ void blk_iopoll_sched(struct blk_iopoll 
 	list_add_tail(&iop->list, &__get_cpu_var(blk_cpu_iopoll));
 	__raise_softirq_irqoff(BLOCK_IOPOLL_SOFTIRQ);
 	local_irq_restore(flags);
+	preempt_check_resched_rt();
 }
 EXPORT_SYMBOL(blk_iopoll_sched);
 
@@ -135,6 +136,7 @@ static void blk_iopoll_softirq(struct so
 		__raise_softirq_irqoff(BLOCK_IOPOLL_SOFTIRQ);
 
 	local_irq_enable();
+	preempt_check_resched_rt();
 }
 
 /**
@@ -204,6 +206,7 @@ static int __cpuinit blk_iopoll_cpu_noti
 				 &__get_cpu_var(blk_cpu_iopoll));
 		__raise_softirq_irqoff(BLOCK_IOPOLL_SOFTIRQ);
 		local_irq_enable();
+		preempt_check_resched_rt();
 	}
 
 	return NOTIFY_OK;
Index: linux-3.2/block/blk-softirq.c
===================================================================
--- linux-3.2.orig/block/blk-softirq.c
+++ linux-3.2/block/blk-softirq.c
@@ -50,6 +50,7 @@ static void trigger_softirq(void *data)
 		raise_softirq_irqoff(BLOCK_SOFTIRQ);
 
 	local_irq_restore(flags);
+	preempt_check_resched_rt();
 }
 
 /*
@@ -92,6 +93,7 @@ static int __cpuinit blk_cpu_notify(stru
 				 &__get_cpu_var(blk_cpu_done));
 		raise_softirq_irqoff(BLOCK_SOFTIRQ);
 		local_irq_enable();
+		preempt_check_resched_rt();
 	}
 
 	return NOTIFY_OK;
@@ -150,6 +152,7 @@ do_local:
 		goto do_local;
 
 	local_irq_restore(flags);
+	preempt_check_resched_rt();
 }
 
 /**
Index: linux-3.2/include/linux/delay.h
===================================================================
--- linux-3.2.orig/include/linux/delay.h
+++ linux-3.2/include/linux/delay.h
@@ -52,4 +52,10 @@ static inline void ssleep(unsigned int s
 	msleep(seconds * 1000);
 }
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+# define cpu_chill()	msleep(1)
+#else
+# define cpu_chill()	cpu_relax()
+#endif
+
 #endif /* defined(_LINUX_DELAY_H) */
Index: linux-3.2/net/packet/af_packet.c
===================================================================
--- linux-3.2.orig/net/packet/af_packet.c
+++ linux-3.2/net/packet/af_packet.c
@@ -89,6 +89,7 @@
 #include <linux/virtio_net.h>
 #include <linux/errqueue.h>
 #include <linux/net_tstamp.h>
+#include <linux/delay.h>
 
 #ifdef CONFIG_INET
 #include <net/inet_common.h>
@@ -673,7 +674,7 @@ static void prb_retire_rx_blk_timer_expi
 	if (BLOCK_NUM_PKTS(pbd)) {
 		while (atomic_read(&pkc->blk_fill_in_prog)) {
 			/* Waiting for skb_copy_bits to finish... */
-			cpu_relax();
+			cpu_chill();
 		}
 	}
 
@@ -928,7 +929,7 @@ static void prb_retire_current_block(str
 		if (!(status & TP_STATUS_BLK_TMO)) {
 			while (atomic_read(&pkc->blk_fill_in_prog)) {
 				/* Waiting for skb_copy_bits to finish... */
-				cpu_relax();
+				cpu_chill();
 			}
 		}
 		prb_close_block(pkc, pbd, po, status);
Index: linux-3.2/net/rds/ib_rdma.c
===================================================================
--- linux-3.2.orig/net/rds/ib_rdma.c
+++ linux-3.2/net/rds/ib_rdma.c
@@ -34,6 +34,7 @@
 #include <linux/slab.h>
 #include <linux/rculist.h>
 #include <linux/llist.h>
+#include <linux/delay.h>
 
 #include "rds.h"
 #include "ib.h"
@@ -286,7 +287,7 @@ static inline void wait_clean_list_grace
 	for_each_online_cpu(cpu) {
 		flag = &per_cpu(clean_list_grace, cpu);
 		while (test_bit(CLEAN_LIST_BUSY_BIT, flag))
-			cpu_relax();
+			cpu_chill();
 	}
 }
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-07 21:49 [ANNOUNCE] 3.2.9-rt17 Thomas Gleixner
@ 2012-03-08 18:23 ` Steven Rostedt
  2012-03-08 18:28   ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 18:23 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, linux-rt-users, Peter Zijlstra

On Wed, 2012-03-07 at 22:49 +0100, Thomas Gleixner wrote:
> Dear RT Folks,
> 
> I'm pleased to announce the 3.2.9-rt17 release.
> 
> Changes vs. 3.2.9-rt17:
> 
>   * Cherry-picked a scheduled for 3.2.10 genirq fix
> 
>   * Add missing preemption checks for softirqd wakeups
> 
>   * Implement cpu_chill() and use it in dcache and networking
> 
>     RT suffers from trylock or other busywait loops. When the lock
>     holder / updater is preempted. This is basically the same problem
>     as we experienced with seqlocks and especially their open coded
>     variants. Though it's way harder to solve. 
> 
>     The trylock loops are usually implemented to deal with reverse
>     lock ordering. On !RT this only needs to loop when one of the
>     locks is held on another cpu. On RT the lock holder can be
>     preempted which in turn puts the preempting task into an eternal
>     retry loop.
> 
>     I tried to implement spin_trydeadlock() - thanks Peter for the
>     brilliant function name - which basically boosts the lock holder
>     w/o deadlocking, but it turned out to become a quite horrible mess
>     close to the infamous multiple reader boosting code.

Thanks for the reference :-p

> 
>     Before my brain deadlocked on trylocks I took the easy way out and
>     replaced the cpu_relax() calls in those retry loops with
>     cpu_chill() calls.
> 
>     cpu_chill() defaults to cpu_relax() for !RT. On RT is simply puts
>     the task to sleep for a tick, so the preempted lock holder/updater
>     can make progress.
> 
>     I think that's reasonable as the affected code pathes are not RT
>     critical and not likely to hit. fs operations have no RT
>     guarantees at all, so it might affect random fs scanners which get
>     blocked on a rename or delete operation going on. I don't think
>     that's a real issue. Feel free to yell if you find out that it
>     hurts, but be aware that I might ask _you_ to twist _your_ brain
>     around implementing spin_trydeadlock().
> 

So basically what you tried to do was just set the owner of the lock to
have the priority of the task that wants the lock, until it releases it?
But by doing it without having this task sleep?

I'm guessing the difficulty came with the "waiter"? That's because the
waiter is on the stack of the process that is waiting. If the waiter
isn't waiting, then you can't create the waiter structure, as the stack
isn't available to use.

Did you try it so that it blocks only if the owner is not blocked, and
if the owner blocks, it wakes up the one waiting it on a
spin_trydeadlock()? But this probably has issues if the owner is blocked
on the lock the task has, you would have to do the sleep anyway.

What if you moved the "trydeadlock" into the cpu_chill()? Thus you can
have:

cpu_chill(&parent->d-lock);

on !RT it's still just a cpu_relax(), but on RT it will boost the owner
of the lock, and will block only as long as the owner chain is running.
The difference between cpu_chill() blocking and doing the blocking at
the spin_trydeadlock(), is that here we release a lock which should make
something have forward progress. Doing it before the release of the lock
probably doesn't help. Unless you had spin_trydeadlock() do the release
of the lock too?

It would be interesting to see what was tried.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 18:23 ` Steven Rostedt
@ 2012-03-08 18:28   ` Peter Zijlstra
  2012-03-08 18:42     ` Steven Rostedt
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 18:28 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 13:23 -0500, Steven Rostedt wrote:
> So basically what you tried to do was just set the owner of the lock to
> have the priority of the task that wants the lock, until it releases it?
> But by doing it without having this task sleep? 

No, by having it sleep ;-)

So you do the full PI sleeping lock thing, except you return fail if you
loose the acquisition race on wakeup and you mark this waiter as
'special'.

Then on every rt_mutex block you have to do a deadlock analysis on the
PI blocking chain (preferably shared with PI boost traversal of said
chain), during that scan you collect all special tagged waiters.

If you find a deadlock, wake all these special waiters and have them
return -EDEADLK.

I guess you could also do the full spin_deadlock() and do away with the
try part and purely rely on the deadlock detection.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 18:28   ` Peter Zijlstra
@ 2012-03-08 18:42     ` Steven Rostedt
  2012-03-08 19:39       ` Peter Zijlstra
  2012-03-08 19:48       ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 18:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 19:28 +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-08 at 13:23 -0500, Steven Rostedt wrote:
> > So basically what you tried to do was just set the owner of the lock to
> > have the priority of the task that wants the lock, until it releases it?
> > But by doing it without having this task sleep? 
> 
> No, by having it sleep ;-)
> 

That was the second part of my email.

> So you do the full PI sleeping lock thing, except you return fail if you
> loose the acquisition race on wakeup and you mark this waiter as
> 'special'.
> 
> Then on every rt_mutex block you have to do a deadlock analysis on the
> PI blocking chain (preferably shared with PI boost traversal of said
> chain), during that scan you collect all special tagged waiters.
> 
> If you find a deadlock, wake all these special waiters and have them
> return -EDEADLK.
> 
> I guess you could also do the full spin_deadlock() and do away with the
> try part and purely rely on the deadlock detection.

But do you release the lock first? For example, we have:

@@ -410,7 +411,7 @@ static inline struct dentry *dentry_kill
        if (inode && !spin_trylock(&inode->i_lock)) {
 relock:
                seq_spin_unlock(&dentry->d_lock);
-               cpu_relax();
+               cpu_chill();
                return dentry; /* try again with same dentry */
        }

By doing the test at the trylock, we can easily hit the deadlock,
because we still hold dentry->d_lock. But by moving the block to the
cpu_chill(), then we are less likely to hit the deadlock.

Perhaps call it, cpu_chill_on_lock().

-- Steve



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 18:42     ` Steven Rostedt
@ 2012-03-08 19:39       ` Peter Zijlstra
  2012-03-08 20:10         ` Steven Rostedt
  2012-03-08 19:48       ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 19:39 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 13:42 -0500, Steven Rostedt wrote:
> On Thu, 2012-03-08 at 19:28 +0100, Peter Zijlstra wrote:
> > On Thu, 2012-03-08 at 13:23 -0500, Steven Rostedt wrote:
> > > So basically what you tried to do was just set the owner of the lock to
> > > have the priority of the task that wants the lock, until it releases it?
> > > But by doing it without having this task sleep? 
> > 
> > No, by having it sleep ;-)
> > 
> 
> That was the second part of my email.

All I saw there was some rambling about cpu_chill() blocking, which
doesn't make any sense..

> > So you do the full PI sleeping lock thing, except you return fail if you
> > loose the acquisition race on wakeup and you mark this waiter as
> > 'special'.
> > 
> > Then on every rt_mutex block you have to do a deadlock analysis on the
> > PI blocking chain (preferably shared with PI boost traversal of said
> > chain), during that scan you collect all special tagged waiters.
> > 
> > If you find a deadlock, wake all these special waiters and have them
> > return -EDEADLK.
> > 
> > I guess you could also do the full spin_deadlock() and do away with the
> > try part and purely rely on the deadlock detection.
> 
> But do you release the lock first? 

Not sure I get you. Release what lock before what?

> For example, we have:
> 
> @@ -410,7 +411,7 @@ static inline struct dentry *dentry_kill
>         if (inode && !spin_trylock(&inode->i_lock)) {
>  relock:
>                 seq_spin_unlock(&dentry->d_lock);
> -               cpu_relax();
> +               cpu_chill();
>                 return dentry; /* try again with same dentry */
>         }
> 
> By doing the test at the trylock, we can easily hit the deadlock,
> because we still hold dentry->d_lock. But by moving the block to the
> cpu_chill(), then we are less likely to hit the deadlock.

Actually hitting the deadlock isn't a problem, and doing it in the place
of the trylock has the distinct advantage that you can actually get the
lock and continue like you want.

You cannot grab the inode->i_lock without holding ->d_lock because
->d_lock serializes dentry->d_inode. So in that case you have to add
more atomic and uglify the code for no gain what so ever.

Furthermore, by pulling the waiting out your success rate doesn't
improve at all.

So its a loose-loose proposition.

> Perhaps call it, cpu_chill_on_lock().

Nah, that doesn't make any sense.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 18:42     ` Steven Rostedt
  2012-03-08 19:39       ` Peter Zijlstra
@ 2012-03-08 19:48       ` Peter Zijlstra
  2012-03-08 20:01         ` Steven Rostedt
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 19:48 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 13:42 -0500, Steven Rostedt wrote:
> cpu_chill_on_lock().

I think what you were hinting at is:

  spin_lock(&lock);
  spin_unlock(&lock);

which is the rt_mutex implementation of spin_unlock_wait(). No new
function needed for that.

Trouble is that you cannot use inode->i_lock as explained..

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 19:48       ` Peter Zijlstra
@ 2012-03-08 20:01         ` Steven Rostedt
  2012-03-08 20:08           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 20:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 20:48 +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-08 at 13:42 -0500, Steven Rostedt wrote:
> > cpu_chill_on_lock().
> 
> I think what you were hinting at is:
> 
>   spin_lock(&lock);
>   spin_unlock(&lock);

Exactly.

> 
> which is the rt_mutex implementation of spin_unlock_wait(). No new
> function needed for that.

Yes but...
> 
> Trouble is that you cannot use inode->i_lock as explained..

Right!


What I was hinting at was that the spin_lock() can boost, but it could
also fail if there was a deadlock detected, and it wouldn't cause the
system to hang.

-- Steve



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 20:01         ` Steven Rostedt
@ 2012-03-08 20:08           ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 20:08 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 15:01 -0500, Steven Rostedt wrote:
> 
> What I was hinting at was that the spin_lock() can boost, but it could
> also fail if there was a deadlock detected, and it wouldn't cause the
> system to hang. 

Right, that's spin_deadlock() ;-) And you can in fact write
spin_trydeadlock() as:

int spin_trydeadlock(spinlock_t *lock)
{
	return spin_deadlock(lock) != -EDEADLK;
}

The advantage of introducing spin_trydeadlock() is that you can merge it
upstream (modulo naming) and the !RT code won't change at all.

IIRC tglx said there were more sites than just the dcache that did this
inverse lock order trylock game.

Seems we're agreeing more than disagreeing.. anyway, if you care to give
it a go.. worst that can happen is that you'll get as big a mess as tglx
got :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 19:39       ` Peter Zijlstra
@ 2012-03-08 20:10         ` Steven Rostedt
  2012-03-08 20:26           ` Peter Zijlstra
  2012-03-09  0:20           ` Thomas Gleixner
  0 siblings, 2 replies; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 20:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 20:39 +0100, Peter Zijlstra wrote:

> > For example, we have:
> > 
> > @@ -410,7 +411,7 @@ static inline struct dentry *dentry_kill
> >         if (inode && !spin_trylock(&inode->i_lock)) {
> >  relock:
> >                 seq_spin_unlock(&dentry->d_lock);
> > -               cpu_relax();
> > +               cpu_chill();
> >                 return dentry; /* try again with same dentry */
> >         }
> > 
> > By doing the test at the trylock, we can easily hit the deadlock,
> > because we still hold dentry->d_lock. But by moving the block to the
> > cpu_chill(), then we are less likely to hit the deadlock.
> 
> Actually hitting the deadlock isn't a problem, and doing it in the place
> of the trylock has the distinct advantage that you can actually get the
> lock and continue like you want.

By doing a spin_trydeadlock() while still holding the d_lock, if the
holder of the i_lock was blocked on that d_lock then it would detect the
failure, and release the lock and continue the loop. This doesn't solve
anything. Just because we released the lock, we are still preempting the
holder of the d_lock, and if we are higher in priority, we will never
let the owner run.

That's why I recommended doing it after releasing the lock. Of course
the d_put() is so screwed up because it's not just two locks involved,
it's a reverse chain, where this probably wont help.

But just sleeping a tick sounds like a heuristic that may someday fail.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 20:10         ` Steven Rostedt
@ 2012-03-08 20:26           ` Peter Zijlstra
  2012-03-08 21:08             ` Steven Rostedt
  2012-03-09  0:20           ` Thomas Gleixner
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 20:26 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 15:10 -0500, Steven Rostedt wrote:
> 
> By doing a spin_trydeadlock() while still holding the d_lock, if the
> holder of the i_lock was blocked on that d_lock then it would detect the
> failure, and release the lock and continue the loop. This doesn't solve
> anything. Just because we released the lock, we are still preempting the
> holder of the d_lock

->i_lock, right?

> , and if we are higher in priority, we will never let the owner run.

So, suppose:

task-A				task-B

lock ->i_lock
				lock ->d_lock
lock ->d_lock <blocks>
				trylock ->i_lock

In this case B's trylock will insta-fail (with -EDEADLK) and we unlock
->d_lock in the existing retry logic. That dropping of ->d_lock will
then wake A, but since B is higher prio A we don't actually run A and
B's retry loop will re-acquire ->d_lock.

Crap.. there's also the fact that A doesn't get (or stays) boosted.

I can only think of ugly things to do, like on the deadlock scan, force
assign the first waiter of the inverted lock to owner (in this case the
deadlock is on ->d_lock so assign A), so that the moment we release
->d_lock our re-acquisition fails because its already owned by A, at
that point B will block and boost A.

There's just the little detail of having two owners for a little while..

> But just sleeping a tick sounds like a heuristic that may someday fail.

Oh absolutely agreed, ideally someone would find a way to implement the
-EDEADLK stuff for rt-mutex in a way that doesn't make tglx sad and
actually works.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 20:26           ` Peter Zijlstra
@ 2012-03-08 21:08             ` Steven Rostedt
  2012-03-08 21:20               ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 21:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 21:26 +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-08 at 15:10 -0500, Steven Rostedt wrote:
> > 
> > By doing a spin_trydeadlock() while still holding the d_lock, if the
> > holder of the i_lock was blocked on that d_lock then it would detect the
> > failure, and release the lock and continue the loop. This doesn't solve
> > anything. Just because we released the lock, we are still preempting the
> > holder of the d_lock
> 
> ->i_lock, right?

The dvorak key layout has the 'i' and 'd' right next to each other (the
'g' and 'h' key respectively of the qwerty layout). It makes it easy to
get confused :-)

> 
> > , and if we are higher in priority, we will never let the owner run.
> 
> So, suppose:
> 
> task-A				task-B
> 
> lock ->i_lock
> 				lock ->d_lock
> lock ->d_lock <blocks>
> 				trylock ->i_lock
> 
> In this case B's trylock will insta-fail (with -EDEADLK) and we unlock
> ->d_lock in the existing retry logic. That dropping of ->d_lock will
> then wake A, but since B is higher prio A we don't actually run A and
> B's retry loop will re-acquire ->d_lock.
> 
> Crap.. there's also the fact that A doesn't get (or stays) boosted.

Yep. I'm surprised that virtual machines don't have the same issue. Lets
say you are running two CPUs virtual machine that gets pinned to a
single CPU for some reason. Then unless the one vCPU gets preempted
after it releases the i_lock and before it grabs it again, I can see a
virtual machine going in the same loop.

Maybe it does, but eventually the vCPU gets preempted in the right place
and things move forward again. No one notices because people just expect
virtual machines to have long latencies.

Hmm, I bet a -rt kernel would probably run better than a normal kernel
on virtual machines, as spinlocks probably hurt virtual machines more
than mutexes do.

> 
> I can only think of ugly things to do, like on the deadlock scan, force
> assign the first waiter of the inverted lock to owner (in this case the
> deadlock is on ->d_lock so assign A), so that the moment we release
> ->d_lock our re-acquisition fails because its already owned by A, at
> that point B will block and boost A.

Hmm, perhaps we need a way to attach a priority to a lock. Maybe we just
need a way to set a priority of a lock with.. "A task of priority X
needs the lock, set the owner to at least X while it holds the lock",
where it doesn't care about the high priority task, it just cares about
the lock. That is, give locks a priority too (like priority ceiling). On
doing spin_trylock_rt() (no need for deadlock detection) if it fails,
gives a lock the priority of the task trying to take it. The lock will
be given a temporary priority for the duration it is held. The owner of
the lock will get that priority unless its already higher in priority.
When the lock is released, both the owner and the lock lose the
priority.

Note, spin_trylock_rt() continues to run even on failure.

Have cpu_chill() do a "sched_yield()" (the good kind, to put the current
FIFO task behind another FIFO task of the same priority). Then the owner
of the lock will get to run.

The sched_yield in cpu_chill() would be needed if the owner of the lock
is blocked on the lock the high priority task has. After the high
priority task releases its lock, and calls cpu_chill(), the
sched_yield() allows the owner of the lock to run if it happens to be
blocked on the lock the high prio task held. As the cpu_chill() will be
called after that lock is released.

This shouldn't be too hard to implement, as the boosting by the lock
priority lasts only as long as the lock is held. It would not require
implementing any kind of proxy waiter. If a higher prio task comes along
and wants the lock, it will just up the locks priority. The lock
priority still only lasts as long as the lock is held. If the lock isn't
own, the task asking for it will simply get the lock.

I'm really starting to like this idea, and unless you can point holes
into it, I'll go ahead and start implementing it. The worse that can
happen is that the owner of the lock may get a high prio and the
original task that wanted the lock loses its priority for some reason,
that wont affect the owner of the lock. But as its only a temporary
priority, the effect of the boost wont last long (lost on release of the
lock).

If the task that tried to get the lock gets is priority boosted, as the
task is just doing a loop anyway (never blocked as the spin_trylock_rt()
never blocks), even if it preempts the owner of the lock (now that it
has a higher priority), on the next grab of the spin_trylock_rt() it
will boost the lock priority again, as well as the owner of the lock.

Thoughts?

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:08             ` Steven Rostedt
@ 2012-03-08 21:20               ` Peter Zijlstra
  2012-03-08 21:25                 ` Steven Rostedt
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 21:20 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 16:08 -0500, Steven Rostedt wrote:
> Hmm, perhaps we need a way to attach a priority to a lock. Maybe we just
> need a way to set a priority of a lock with.. "A task of priority X
> needs the lock, set the owner to at least X while it holds the lock",
> where it doesn't care about the high priority task, it just cares about
> the lock. That is, give locks a priority too (like priority ceiling). On
> doing spin_trylock_rt() (no need for deadlock detection) if it fails,
> gives a lock the priority of the task trying to take it. The lock will
> be given a temporary priority for the duration it is held. The owner of
> the lock will get that priority unless its already higher in priority.
> When the lock is released, both the owner and the lock lose the
> priority.
> 
> Note, spin_trylock_rt() continues to run even on failure.
> 
> Have cpu_chill() do a "sched_yield()" (the good kind, to put the current
> FIFO task behind another FIFO task of the same priority). Then the owner
> of the lock will get to run.
> 
> The sched_yield in cpu_chill() would be needed if the owner of the lock
> is blocked on the lock the high priority task has. After the high
> priority task releases its lock, and calls cpu_chill(), the
> sched_yield() allows the owner of the lock to run if it happens to be
> blocked on the lock the high prio task held. As the cpu_chill() will be
> called after that lock is released. 

Now put the thing on 2 cpus and both tasks can endlessly chase each
other's tail, no? The yield will be useless there.. 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:20               ` Peter Zijlstra
@ 2012-03-08 21:25                 ` Steven Rostedt
  2012-03-08 21:28                   ` Peter Zijlstra
  2012-03-09  0:33                   ` Thomas Gleixner
  0 siblings, 2 replies; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 21:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 22:20 +0100, Peter Zijlstra wrote:

> Now put the thing on 2 cpus and both tasks can endlessly chase each
> other's tail, no? 

How would this be different than what mainline does? When the lock is
released, it will wake up the other task.

> The yield will be useless there.. 

If there's no task with the same priority, it should be a nop. I don't
see anything wrong with it.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:25                 ` Steven Rostedt
@ 2012-03-08 21:28                   ` Peter Zijlstra
  2012-03-08 21:36                     ` Steven Rostedt
  2012-03-09  0:33                   ` Thomas Gleixner
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 21:28 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 16:25 -0500, Steven Rostedt wrote:
> 
> How would this be different than what mainline does? When the lock is
> released, it will wake up the other task. 

mainline has ticket locks, the rt-mutex stuff has equal priority lock
stealing, waking up the blocked task will take so long our running loop
will have re-acquired ->d_lock again before it even gets to trying.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:28                   ` Peter Zijlstra
@ 2012-03-08 21:36                     ` Steven Rostedt
  2012-03-08 21:37                       ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 21:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 22:28 +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-08 at 16:25 -0500, Steven Rostedt wrote:
> > 
> > How would this be different than what mainline does? When the lock is
> > released, it will wake up the other task. 
> 
> mainline has ticket locks, the rt-mutex stuff has equal priority lock
> stealing, waking up the blocked task will take so long our running loop
> will have re-acquired ->d_lock again before it even gets to trying.

And we have adaptive mutexes.

So we wake up the task (now with the higher priority), by the time it
wakes up, the original task retook the lock. But because of adaptive
mutexes, as this task takes the lock it notices that the owner is still
running, and it will spin and not sleep.

Now when the original task releases the lock again, the other task can
take it just like it does on mainline.

-- Steve



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:36                     ` Steven Rostedt
@ 2012-03-08 21:37                       ` Peter Zijlstra
  2012-03-08 21:44                         ` Steven Rostedt
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 21:37 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 16:36 -0500, Steven Rostedt wrote:
> On Thu, 2012-03-08 at 22:28 +0100, Peter Zijlstra wrote:
> > On Thu, 2012-03-08 at 16:25 -0500, Steven Rostedt wrote:
> > > 
> > > How would this be different than what mainline does? When the lock is
> > > released, it will wake up the other task. 
> > 
> > mainline has ticket locks, the rt-mutex stuff has equal priority lock
> > stealing, waking up the blocked task will take so long our running loop
> > will have re-acquired ->d_lock again before it even gets to trying.
> 
> And we have adaptive mutexes.
> 
> So we wake up the task (now with the higher priority), by the time it
> wakes up, the original task retook the lock. But because of adaptive
> mutexes, as this task takes the lock it notices that the owner is still
> running, and it will spin and not sleep.
> 
> Now when the original task releases the lock again, the other task can
> take it just like it does on mainline.

Now interleave it with a third task of even higher priority that puts
the spinner to sleep.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:37                       ` Peter Zijlstra
@ 2012-03-08 21:44                         ` Steven Rostedt
  2012-03-08 21:54                           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 21:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 22:37 +0100, Peter Zijlstra wrote:

> > Now when the original task releases the lock again, the other task can
> > take it just like it does on mainline.
> 
> Now interleave it with a third task of even higher priority that puts
> the spinner to sleep.

So? It will eventually have to allow the task to run. Adding a "third
higher priority" task can cause problems in any other part of the -rt
kernel.

We don't need to worry about priority inversion. If the higher task
blocks on the original task, it will boost its priority (even if it does
the adaptive spin) which will again boost the task that it preempted.

Now we may need to add a sched_yield() in the adaptive spin to let the
other task run.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:44                         ` Steven Rostedt
@ 2012-03-08 21:54                           ` Peter Zijlstra
  2012-03-08 22:13                             ` Steven Rostedt
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 21:54 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 16:44 -0500, Steven Rostedt wrote:
> On Thu, 2012-03-08 at 22:37 +0100, Peter Zijlstra wrote:
> 
> > > Now when the original task releases the lock again, the other task can
> > > take it just like it does on mainline.
> > 
> > Now interleave it with a third task of even higher priority that puts
> > the spinner to sleep.
> 
> So? It will eventually have to allow the task to run. Adding a "third
> higher priority" task can cause problems in any other part of the -rt
> kernel.
> 
> We don't need to worry about priority inversion. If the higher task
> blocks on the original task, it will boost its priority (even if it does
> the adaptive spin) which will again boost the task that it preempted.
> 
> Now we may need to add a sched_yield() in the adaptive spin to let the
> other task run.

That's not what I mean,..

task-A (cpu0)	task-B (cpu1)		task-C (cpu1)

		lock ->d_lock
lock ->i_lock
lock ->d_lock
			<-------------- preempts B
		trylock ->i_lock


While is is perfectly normal, the result is that A stops spinning and
goes to sleep. Now B continues and loops ad infinitum because it keeps
getting ->d_lock before A because its cache hot on cpu1 and waking A
takes a while etc..

No progress guarantee -> fail.

Test-and-set spinlocks have unbounded latency and we've hit pure
starvation cases in mainline. In fact it was so bad mainline had to grow
ticket locks to cope -- we don't want to rely on anything like this in
RT.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:54                           ` Peter Zijlstra
@ 2012-03-08 22:13                             ` Steven Rostedt
  2012-03-08 22:20                               ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 22:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 22:54 +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-08 at 16:44 -0500, Steven Rostedt wrote:
> > On Thu, 2012-03-08 at 22:37 +0100, Peter Zijlstra wrote:
> > 
> > > > Now when the original task releases the lock again, the other task can
> > > > take it just like it does on mainline.
> > > 
> > > Now interleave it with a third task of even higher priority that puts
> > > the spinner to sleep.
> > 
> > So? It will eventually have to allow the task to run. Adding a "third
> > higher priority" task can cause problems in any other part of the -rt
> > kernel.
> > 
> > We don't need to worry about priority inversion. If the higher task
> > blocks on the original task, it will boost its priority (even if it does
> > the adaptive spin) which will again boost the task that it preempted.
> > 
> > Now we may need to add a sched_yield() in the adaptive spin to let the
> > other task run.
> 
> That's not what I mean,..

Actually this is what I thought you meant :-)

> 
> task-A (cpu0)	task-B (cpu1)		task-C (cpu1)
> 
> 		lock ->d_lock
> lock ->i_lock
> lock ->d_lock
> 			<-------------- preempts B
> 		trylock ->i_lock
> 
> 
> While is is perfectly normal, the result is that A stops spinning and
> goes to sleep. Now B continues and loops ad infinitum because it keeps
> getting ->d_lock before A because its cache hot on cpu1 and waking A
> takes a while etc..

I'm confused? As A isn't doing a loop. B is doing the loop because it's
trying to grab the locks in reverse order and can't take the i_lock.
Your example above would have A go to sleep when it tries to take
d_lock.

> 
> No progress guarantee -> fail.

I still don't see the permanent blocking?  Task-A is blocked on d_lock,
which is own by task-B but is preempted by task-C. This happens all the
time, in -rt. What is the issue?

C will eventually go away and the other two will run again, if C doesn't
go away, that's the general problem with migrate_disable() but is out of
scope for the issue we are dealing with here.

> 
> Test-and-set spinlocks have unbounded latency and we've hit pure
> starvation cases in mainline. In fact it was so bad mainline had to grow
> ticket locks to cope -- we don't want to rely on anything like this in
> RT.

It was an issue on all spinlocks. This solution I'm giving is to fix the
trylock() when taking locks in a reverse order. Most of these locations
isn't even in critical paths (hopefully all of them are not).

I'm not changing normal spin locks or the way rt turns the to mutexes,
I'm changing the spin_trylock() that today can become a live lock when a
high priority task preempts a task on the same CPU that holds the lock
it needs.

I still don't see an issue with my proposed solution.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 22:13                             ` Steven Rostedt
@ 2012-03-08 22:20                               ` Peter Zijlstra
  2012-03-08 22:27                                 ` Steven Rostedt
  2012-03-09  4:17                                 ` Steven Rostedt
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2012-03-08 22:20 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 17:13 -0500, Steven Rostedt wrote:
> > task-A (cpu0) task-B (cpu1)           task-C (cpu1)
> > 
> >               lock ->d_lock
> > lock ->i_lock
> > lock ->d_lock
> >                       <-------------- preempts B
> >               trylock ->i_lock
> > 
> > 
> > While is is perfectly normal, the result is that A stops spinning and
> > goes to sleep. Now B continues and loops ad infinitum because it keeps
> > getting ->d_lock before A because its cache hot on cpu1 and waking A
> > takes a while etc..
> 
> I'm confused? As A isn't doing a loop. B is doing the loop because it's
> trying to grab the locks in reverse order and can't take the i_lock.
> Your example above would have A go to sleep when it tries to take
> d_lock. 

Right, but what guarantees that A will ever get ->d_lock when B releases
it before B again acquires it?

B is in a very tight:

1:
 lock ->d_lock
 trylock ->i_lock
 unlock ->d_lock
 goto 1

loop, while A is doing:

1:
  trylock ->d_lock
  goto 1

and with rt-mutex having the equal priority lock stealing this reverts
to a plain test-and-set lock. There's only a tiny window in which A can
actually get the lock and that is hampered by B's cpu owning the
cacheline in exclusive mode.

I simply cannot see guaranteed progress here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 22:20                               ` Peter Zijlstra
@ 2012-03-08 22:27                                 ` Steven Rostedt
  2012-03-09  4:17                                 ` Steven Rostedt
  1 sibling, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2012-03-08 22:27 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 23:20 +0100, Peter Zijlstra wrote:

> and with rt-mutex having the equal priority lock stealing this reverts
> to a plain test-and-set lock. There's only a tiny window in which A can
> actually get the lock and that is hampered by B's cpu owning the
> cacheline in exclusive mode.
> 
> I simply cannot see guaranteed progress here.

This can be easily fixed by not having the lateral steal when a lock has
a priority.

-- Steve





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 20:10         ` Steven Rostedt
  2012-03-08 20:26           ` Peter Zijlstra
@ 2012-03-09  0:20           ` Thomas Gleixner
  2012-03-09  2:50             ` Steven Rostedt
  1 sibling, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2012-03-09  0:20 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Peter Zijlstra, LKML, linux-rt-users

On Thu, 8 Mar 2012, Steven Rostedt wrote:

> On Thu, 2012-03-08 at 20:39 +0100, Peter Zijlstra wrote:
> 
> > > For example, we have:
> > > 
> > > @@ -410,7 +411,7 @@ static inline struct dentry *dentry_kill
> > >         if (inode && !spin_trylock(&inode->i_lock)) {
> > >  relock:
> > >                 seq_spin_unlock(&dentry->d_lock);
> > > -               cpu_relax();
> > > +               cpu_chill();
> > >                 return dentry; /* try again with same dentry */
> > >         }
> > > 
> > > By doing the test at the trylock, we can easily hit the deadlock,
> > > because we still hold dentry->d_lock. But by moving the block to the
> > > cpu_chill(), then we are less likely to hit the deadlock.
> > 
> > Actually hitting the deadlock isn't a problem, and doing it in the place
> > of the trylock has the distinct advantage that you can actually get the
> > lock and continue like you want.
> 
> By doing a spin_trydeadlock() while still holding the d_lock, if the
> holder of the i_lock was blocked on that d_lock then it would detect the
> failure, and release the lock and continue the loop. This doesn't solve
> anything. Just because we released the lock, we are still preempting the
> holder of the d_lock, and if we are higher in priority, we will never
> let the owner run.
> 
> That's why I recommended doing it after releasing the lock. Of course
> the d_put() is so screwed up because it's not just two locks involved,
> it's a reverse chain, where this probably wont help.
> 
> But just sleeping a tick sounds like a heuristic that may someday fail.

And it's the only way to deal with it sanely simply because after
dropping d_lock you CANNOT dereference inode->i_lock anymore. Care to
read the fricking locking rules of dcache and friends ?

So yopu need to replace the trylock by try_deadlock() and boost the
lock owner w/o blocking on the lock. That means that you need some
other means to tell the lock owner that you are waiting for it w/o
actually waiting. After you return from try_deadlock() you drop d_lock
and wait for some magic event which is associated to the unlock of
i_lock. There is no other way to deal with reverse lock ordering,
period.

Now go and imagine the mess you have to create to make this work under
all circumstances. My reference to the multiple reader boosting was
not for fun. Though I consider that try_deadlock() idea to be in the
same category of insanity as the original attempts of implementing
requeue_pi with ownerless but boosted rtmutexes ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 21:25                 ` Steven Rostedt
  2012-03-08 21:28                   ` Peter Zijlstra
@ 2012-03-09  0:33                   ` Thomas Gleixner
  2012-03-09  3:08                     ` Steven Rostedt
  1 sibling, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2012-03-09  0:33 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Peter Zijlstra, LKML, linux-rt-users

On Thu, 8 Mar 2012, Steven Rostedt wrote:

> On Thu, 2012-03-08 at 22:20 +0100, Peter Zijlstra wrote:
> 
> > Now put the thing on 2 cpus and both tasks can endlessly chase each
> > other's tail, no? 
> 
> How would this be different than what mainline does? When the lock is
> released, it will wake up the other task.

Nonsense. That code is not causing any headache in mainline simply
because the lock holder cannot be preempted. So the contention case
runs on different cpus. On RT the failure case is when the trylocker
preempts the lock holder, which cannot be moved to a different cpu due
to the implicit migrate disable. Aside of that cpu_relax() and ticket
locks are there for a reason. They allow the other cpu to make
progress instead of allowing the trylocking cpu to monopolize the
cache line forever.

The only case where mainline can fail is when a high prio task does a
mutex_trylock() loop and the mutex owner and the trylocker are pinned
on the same core. Though I have not yet found code like that, but I
have not looked too hard either :)

It's a simple RT problem, which has been there forever, but obviously
nobody did stress tests such code pathes on UP systems or if someone
did he was not able to trigger it. On SMP this was not a big issue
when task migration was almost always enabled. Due to the implicit
migrate disable withing spinlocked regions we just made it more likely
to happen.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-09  0:20           ` Thomas Gleixner
@ 2012-03-09  2:50             ` Steven Rostedt
  2012-03-09 10:23               ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Steven Rostedt @ 2012-03-09  2:50 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Peter Zijlstra, LKML, linux-rt-users

On Fri, 2012-03-09 at 01:20 +0100, Thomas Gleixner wrote:
> On Thu, 8 Mar 2012, Steven Rostedt wrote:
> 
> > On Thu, 2012-03-08 at 20:39 +0100, Peter Zijlstra wrote:
> > 
> > > > For example, we have:
> > > > 
> > > > @@ -410,7 +411,7 @@ static inline struct dentry *dentry_kill
> > > >         if (inode && !spin_trylock(&inode->i_lock)) {
> > > >  relock:
> > > >                 seq_spin_unlock(&dentry->d_lock);
> > > > -               cpu_relax();
> > > > +               cpu_chill();
> > > >                 return dentry; /* try again with same dentry */
> > > >         }
> > > > 
> > > > By doing the test at the trylock, we can easily hit the deadlock,
> > > > because we still hold dentry->d_lock. But by moving the block to the
> > > > cpu_chill(), then we are less likely to hit the deadlock.
> > > 
> > > Actually hitting the deadlock isn't a problem, and doing it in the place
> > > of the trylock has the distinct advantage that you can actually get the
> > > lock and continue like you want.
> > 
> > By doing a spin_trydeadlock() while still holding the d_lock, if the
> > holder of the i_lock was blocked on that d_lock then it would detect the
> > failure, and release the lock and continue the loop. This doesn't solve
> > anything. Just because we released the lock, we are still preempting the
> > holder of the d_lock, and if we are higher in priority, we will never
> > let the owner run.
> > 
> > That's why I recommended doing it after releasing the lock. Of course
> > the d_put() is so screwed up because it's not just two locks involved,
> > it's a reverse chain, where this probably wont help.
> > 
> > But just sleeping a tick sounds like a heuristic that may someday fail.
> 
> And it's the only way to deal with it sanely simply because after
> dropping d_lock you CANNOT dereference inode->i_lock anymore. Care to
> read the fricking locking rules of dcache and friends ?

Right, and the solution I proposed does not need to.

> 
> So yopu need to replace the trylock by try_deadlock() and boost the
> lock owner w/o blocking on the lock. That means that you need some
> other means to tell the lock owner that you are waiting for it w/o
> actually waiting. 

Actually, you don't need to tell the lock owner you are waiting for it,
you only need to make the lock itself have a priority, and the owner
will inherit that priority as long as it holds the lock.



> After you return from try_deadlock() you drop d_lock
> and wait for some magic event which is associated to the unlock of
> i_lock. There is no other way to deal with reverse lock ordering,
> period.

No magic event, the owner of the lock will inherit the priority you just
gave it. All the task needs to do next is a sched_yield(), and if the
other task is on the same CPU, it will have the same priority and run.

The boosted priority of the lock and the owner will only last as long as
the owner has the lock. As soon as the owner drops it, the priority is
reset for both the lock and the owner.

> 
> Now go and imagine the mess you have to create to make this work under
> all circumstances. My reference to the multiple reader boosting was
> not for fun. Though I consider that try_deadlock() idea to be in the
> same category of insanity as the original attempts of implementing
> requeue_pi with ownerless but boosted rtmutexes ....

And the solution I proposed has nothing to do with the mess needed to do
the multi priority boost. As it had a waiter on it. This isn't waiting,
it's simulating more what happens in non-rt. That is, the owner of the
lock will have a chance to run.

-- Steve



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-09  0:33                   ` Thomas Gleixner
@ 2012-03-09  3:08                     ` Steven Rostedt
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2012-03-09  3:08 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Peter Zijlstra, LKML, linux-rt-users

On Fri, 2012-03-09 at 01:33 +0100, Thomas Gleixner wrote:
> On Thu, 8 Mar 2012, Steven Rostedt wrote:
> 
> > On Thu, 2012-03-08 at 22:20 +0100, Peter Zijlstra wrote:
> > 
> > > Now put the thing on 2 cpus and both tasks can endlessly chase each
> > > other's tail, no? 
> > 
> > How would this be different than what mainline does? When the lock is
> > released, it will wake up the other task.
> 
> Nonsense. That code is not causing any headache in mainline simply
> because the lock holder cannot be preempted. So the contention case
> runs on different cpus. On RT the failure case is when the trylocker
> preempts the lock holder, which cannot be moved to a different cpu due
> to the implicit migrate disable. Aside of that cpu_relax() and ticket
> locks are there for a reason. They allow the other cpu to make
> progress instead of allowing the trylocking cpu to monopolize the
> cache line forever.

I understand that, I was replying to the "both tasks can endlessly chase
each other's tail". And I gave responses to that. The current solution
of having the task sleep for a tick still doesn't solve the issue of a
the owner being preempted by a higher priority task.

At least the solution I proposed wouldn't cause priority inversion,
where as the current solution can.

task-a (cpu 0)                     task-b(cpu1)        task-c(cpu1)
--------------                     ------------        ------------

retry:                             lock(y);

    lock(x);
                                        <<-------------  preempt task-b
    if (!try_lock(y)) {
	unlock(x);
	sleep(1);
	goto retry;
    }

Now task-a can be of the highest priority task in the system, and task-b
the lowest, but task-c is higher than task-b and lower than task-a. If c
is a CPU hog, then task-a will never get out of this loop.

With the lock inheritance, b will get to run over c.

> 
> The only case where mainline can fail is when a high prio task does a
> mutex_trylock() loop and the mutex owner and the trylocker are pinned
> on the same core. Though I have not yet found code like that, but I
> have not looked too hard either :)
> 
> It's a simple RT problem, which has been there forever, but obviously
> nobody did stress tests such code pathes on UP systems or if someone
> did he was not able to trigger it. On SMP this was not a big issue
> when task migration was almost always enabled. Due to the implicit
> migrate disable withing spinlocked regions we just made it more likely
> to happen.

Right, and I mentioned that the migrate disable causes new issues.

I'm trying to come up with a solution that doesn't "wait for some magic
event which is associated to the unlock of i_lock". Because that's what
we are doing right now. The magic event is the sleep hoping that it will
unlock the lock. And a solution that isn't as nasty as the multiple
readers lock.

Maybe the lock inheritance isn't the best solution, but I believe it's
better than the sleep and hope solution that's there now. And I don't
think it would be that complex to implement.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-08 22:20                               ` Peter Zijlstra
  2012-03-08 22:27                                 ` Steven Rostedt
@ 2012-03-09  4:17                                 ` Steven Rostedt
  1 sibling, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2012-03-09  4:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Thomas Gleixner, LKML, linux-rt-users

On Thu, 2012-03-08 at 23:20 +0100, Peter Zijlstra wrote:

> Right, but what guarantees that A will ever get ->d_lock when B releases
> it before B again acquires it?
> 
> B is in a very tight:
> 
> 1:
>  lock ->d_lock
>  trylock ->i_lock
>  unlock ->d_lock
>  goto 1
> 
> loop, while A is doing:
> 
> 1:
>   trylock ->d_lock
>   goto 1
> 
> and with rt-mutex having the equal priority lock stealing this reverts
> to a plain test-and-set lock. There's only a tiny window in which A can
> actually get the lock and that is hampered by B's cpu owning the
> cacheline in exclusive mode.
> 
> I simply cannot see guaranteed progress here.

I'm wondering if this is really even an issue. Yes, with mainline before
ticket locks, a tight spin could be faster and grab the lock. But those
were real test-and-set locks. Here we really don't have a tight spin.
d_lock has a waiter and it will force the unlock into the slow path
which is (removing optional debug code):

raw_spin_lock(&lock->wait_lock);
if (!rt_mutex_has_waiters(lock)) {
  /* false */
}

wakeup_next_waiter(lock);
raw_spin_unlock(&lock->wait_lock);
rt_mutex_adjust_prio(current);

Now that wakeup_next_waiter(lock):

raw_spin_lock_irqsave(&current->pi_lock, flags);
waiter = rt_mutex_top_waiter(lock);
plist_del(&waiter->pi_list_entry, &current->pi_waiters);
rt_mutex_set_owner(lock, NULL);  <<-- here's where the race starts.
                                  The other task can break the loop.
raw_spin_unlock_irqrestore(&current->pi_lock, flags);
/* the unlock acts as a write memory barrier */

rt_mutex_wake_waiter(waiter);

Now the task goes through the entire effort of waking up the spinning
task. It needs to grab the spinlocks of the task's runqueue which is the
runqueue of the CPU of the spinning task.

And it's not done. It then does:

raw_spin_unlock(&lock->wait_lock);
rt_mutex_adjust_prio(current);

Which again grabs the task->pi_lock.

Now it's out of the unlock. But we said we would have it also call a
sched_yield() which will add a bit more overhead here. But after that,
it may grab the lock again keeping the other task from getting it.

OK, lets see what the other lock is doing. Well, it's in the adaptive
loop which is:

rcu_read_lock();
for (;;) {
	if (owner != rt_mutex_owner(lock))
		break;  <<-- This is where it jumps out.
	barrier();
	if (!owner->on_cpu) {
		res = 1;
		break;
	}
	cpu_relax();
}
rcu_read_unlock();

Now, the worse case is we just missed the test. Then it did the barrier
(compile barrier only), the check if the owner is on the cpu and a
cpu_relax(), which may slow it down a little, but so is the owner doing
this too.

The the rcu_read_unlock() which does more compiler barriers and updating
the task structure's rcu_read_lock_nesting numbers.

But once this is done, it does...

raw_spin_lock(&lock->wait_lock).

And once we are here, the game is over. The first task had to do
everything stated after the setting of owner to this task, and loop
around and get to the lock -> d_lock again. The wait_lock is a ticket
spinlock, thus once the second task gets here, the first task wont be
able to get in. And the task will be able to get the lock.

But cache is getting bigger, and CPUs are getting faster compared to
memory. It may still be a case where forward progress is prevented, I
don't know.

My last response about not doing lateral steals on locks with priority
is faulty, as the lock in question isn't the one with priority. But this
could be solved by having the unlock be special. A spin_unlock_rt() or
something that tells the lock to not let the new owner have a lateral
steal.

This is only needed if it really is a race, and the first task can do
the wakeup and call to schedule() all before the second task can grab
that lock.

-- Steve

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-09  2:50             ` Steven Rostedt
@ 2012-03-09 10:23               ` Thomas Gleixner
  2012-03-09 12:51                 ` Steven Rostedt
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2012-03-09 10:23 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Peter Zijlstra, LKML, linux-rt-users

On Thu, 8 Mar 2012, Steven Rostedt wrote:
> On Fri, 2012-03-09 at 01:20 +0100, Thomas Gleixner wrote:
> > So yopu need to replace the trylock by try_deadlock() and boost the
> > lock owner w/o blocking on the lock. That means that you need some
> > other means to tell the lock owner that you are waiting for it w/o
> > actually waiting. 
> 
> Actually, you don't need to tell the lock owner you are waiting for it,
> you only need to make the lock itself have a priority, and the owner
> will inherit that priority as long as it holds the lock.

Hmm, that would actually work, though I'm not too enthusiastic about
sched_yield().

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] 3.2.9-rt17
  2012-03-09 10:23               ` Thomas Gleixner
@ 2012-03-09 12:51                 ` Steven Rostedt
  0 siblings, 0 replies; 28+ messages in thread
From: Steven Rostedt @ 2012-03-09 12:51 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Peter Zijlstra, LKML, linux-rt-users

On Fri, 2012-03-09 at 11:23 +0100, Thomas Gleixner wrote:

> > Actually, you don't need to tell the lock owner you are waiting for it,
> > you only need to make the lock itself have a priority, and the owner
> > will inherit that priority as long as it holds the lock.
> 
> Hmm, that would actually work, though I'm not too enthusiastic about
> sched_yield().

I don't like sched_yield() either, but this is exactly what it was made
for. To give a process with the same priority the CPU.

-- Steve



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2012-03-09 12:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-07 21:49 [ANNOUNCE] 3.2.9-rt17 Thomas Gleixner
2012-03-08 18:23 ` Steven Rostedt
2012-03-08 18:28   ` Peter Zijlstra
2012-03-08 18:42     ` Steven Rostedt
2012-03-08 19:39       ` Peter Zijlstra
2012-03-08 20:10         ` Steven Rostedt
2012-03-08 20:26           ` Peter Zijlstra
2012-03-08 21:08             ` Steven Rostedt
2012-03-08 21:20               ` Peter Zijlstra
2012-03-08 21:25                 ` Steven Rostedt
2012-03-08 21:28                   ` Peter Zijlstra
2012-03-08 21:36                     ` Steven Rostedt
2012-03-08 21:37                       ` Peter Zijlstra
2012-03-08 21:44                         ` Steven Rostedt
2012-03-08 21:54                           ` Peter Zijlstra
2012-03-08 22:13                             ` Steven Rostedt
2012-03-08 22:20                               ` Peter Zijlstra
2012-03-08 22:27                                 ` Steven Rostedt
2012-03-09  4:17                                 ` Steven Rostedt
2012-03-09  0:33                   ` Thomas Gleixner
2012-03-09  3:08                     ` Steven Rostedt
2012-03-09  0:20           ` Thomas Gleixner
2012-03-09  2:50             ` Steven Rostedt
2012-03-09 10:23               ` Thomas Gleixner
2012-03-09 12:51                 ` Steven Rostedt
2012-03-08 19:48       ` Peter Zijlstra
2012-03-08 20:01         ` Steven Rostedt
2012-03-08 20:08           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).