[RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
@ 2015-09-04  1:19 Steven Rostedt
  2015-09-04  1:19 ` [RFC][PATCH RT 1/3] locking: Add spin_try_or_boost_lock() infrastructure Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-09-04  1:19 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

(Frozen shark deflectors on)

Currently the solution to prevent the trylock livelock (described below)
is to change cpu_relax() to cpu_chill() that is simply a msleep(1) that
hopes that the owner of the lock runs and releases the lock and nobody
else takes that lock by the time the task wakes up from the msleep.
When I showed this at LinuxCon/LinuxPlumbers, I was able to see the entire
audience wince at once. It was a beautiful sight, and I wish I took a picture.
Interesting enough, everyone winced to their left side.

This patch set is an attempt to solve the issue in a more deterministic
manner. By introducing a new primitive called spin_try_or_boost_lock(), which
will try to take the lock, and if it fails, it will boost the owner of the
lock to its own priority if needed.

Because we do not want to add an extra field to the locking, or have the
trylock caller block in any way, the solution is to give the owner a temporary
priority boost that is lost as soon as it releases any spinlock (that was
converted to a rtmutex). This new boosted priority is saved in the task
struct, and cleared on releasing any spinlock. Sure, it may release a
different spinlock than the one it was boosted by, but it still will make
progress, and will be boosted again by the trylock spinner. Each time
the owner gets boosted it will move forward till it can release the wanted
lock.

The cpu_chill() will now become a sched_yield(), which will allow the newly
boosted task to run ahead of the spinner (it is of the same priority). When the
task releases a lock, it loses its priority, and the trylock spinner can
try again.

Now, there's a few locations where cpu_chill() is not used for a trylock,
but instead it spins on a bit or some status that will be updated by another
task. As this other task may also be blocked by the spinner, it needs to be
handled as well. As we do not know who the updating task is, there's still
no way to boost it. Maybe in the future we can come up with another API
that can handle this. For now, we will still use the msleep(), but instead
of using cpu_chill(), another primitive is created called cpu_rest() :-)
The cpu_rest() (which I think is more descriptive of a msleep(1)) acts the
same as the current cpu_chill(). Hopefully we can remove that too.

There are a lot of trylocks in the kernel, and I'm sure there's more around
that need to be convert to this method. I think this is an elegant solution
but others may feel differently. As I think a msleep() hail mary is extremely
non deterministic, it's a blemish for a kernel that prides itself on adding
determinism.

I tested this with a module that forces the race. If you want that too, I can
supply it as well. I did some basic testing, but I just recently got this
working so there may be bugs. This is an RFC to see if it is worth while
to implement.

[ Note, this still needs to be tested against non PREEMPT_RT configs ]

Thanks!

-- Steve

Steven Rostedt (Red Hat) (3):
      locking: Add spin_try_or_boost_lock() infrastructure
      locking: Convert trylock spinners over to spin_try_or_boost_lock()
      rt: Make cpu_chill() into yield() and add new cpu_rest() as msleep(1)

---
 block/blk-ioc.c             |   4 ++--
 fs/autofs4/expire.c         |   2 +-
 fs/dcache.c                 |   6 +++---
 fs/namespace.c              |   2 +-
 include/linux/delay.h       |  13 +++++++++++++
 include/linux/init_task.h   |   8 ++++++++
 include/linux/rtmutex.h     |   1 +
 include/linux/sched.h       |  27 +++++++++++++++++++++++++
 include/linux/spinlock.h    |   5 +++++
 include/linux/spinlock_rt.h |  13 +++++++++++++
 kernel/locking/rtmutex.c    | 121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/core.c         |  14 +++++++++++++
 kernel/time/hrtimer.c       |   4 ++--
 kernel/workqueue.c          |   2 +-
 net/packet/af_packet.c      |   4 ++--
 net/rds/ib_rdma.c           |   2 +-
 16 files changed, 204 insertions(+), 24 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC][PATCH RT 1/3] locking: Add spin_try_or_boost_lock() infrastructure
  2015-09-04  1:19 [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
@ 2015-09-04  1:19 ` Steven Rostedt
  2015-09-04  1:48   ` Steven Rostedt
  2015-09-04  1:19 ` [RFC][PATCH RT 2/3] locking: Convert trylock spinners over to spin_try_or_boost_lock() Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2015-09-04  1:19 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

[-- Attachment #1: 0001-locking-Add-spin_try_or_boost_lock-infrastructure.patch --]
[-- Type: text/plain, Size: 14973 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

In mainline Linux, there are cases where locks need to be taken in reverse
order. In non PREEMPT_RT code, spinlocks can not be preempted, thus if one
needs to take locks in reverse order, if it can't get the second lock, it
simply needs to release the first lock (the one that can cause a deadlock)
and try again. The owner of the second lock must be running on another CPU
and as long as it's not blocked on the held lock of the original task, then
progress will be made.

But things change when these locks become preemptive locks.

Simply releasing the first lock when failing to get the second lock wont
always work if the locks can be preempted. That's because the task could
have preempted the owner, and it can prevent the owne from making forward
progress. Just letting go of the first lock and trying again wont change the
state of things, as the owner will still be blocked by being preempted by
the original task. This is especially true when the original task is a
real-time task running a FIFO prio.

	<Task 1>
	Grab lock A

	<preempt to higher priority Task 2>

	Grab lock B
	try to grab lock A
	release lock B
	Grab lock B
	try to grab lock A
	<wash, rinse, repeat>  (sorry Fred)

As Task 1 will never run as long as Task 2 is spinning, no forward progress
is had, and this becomes a live lock (as suppose to a dead lock).

The solution here is to add a new type of priority. All tasks are given a
new field in the task_struct called "trylock_prio" (only when PREEMPT_RT is
configured). This priority is initialize as MAX_PRIO. If a task needs to
grab locks in reverse order, instead of simply using spin_trylock(), it will
use the new primitive spin_try_or_boost_lock(). If the task does not get the
lock, and the lock has an owner, then that owner gets its "trylock_prio"
field set to the prio of the caller of spin_try_or_boost_lock() if it is
less than than that prio. The priority inheritance code is executed in case
that task is currently blocked on something else that may be preempted by
the original task. But unlike the normal priority inheritance code, the
original task is not part of the chain, as that could be detected as a false
deadlock. Instead, each step checks not only the priority of the task and
its highest waiter, but also this new "trylock_prio" field of the task. This
allows the task to be boosted to that priority as well.

Whenever a task releases a spinlock converted to an rtmutex, if its
"trylock_prio" is set to something other than MAX_PRIO, then it enters the
slow unlock path, resets the "trylock_prio" to MAX_PRIO and runs the
priority inheritance chain, where it loses its special priority.

It is possible that the owner releases a lock other than the one that
boosted its priority with the spin_try_or_boost_lock() call. This is fine,
because the owner would still make forward progress. If it loses its
priority, and the original task runs, it will loop again, and boost it
again, until eventually the owner releases the lock, letting the original
task to grab it in the reverse order.

Note, as FIFO tasks won't let the owner run if the current task is spinning,
it is still required that the original task calls cpu_chill() after calling
spin_try_or_boost_lock(). But this lets us change cpu_chill() from being a
simple msleep(1) (hoping that the owner will release the lock in that time
frame, and no other owner will take it), to a sched_yield(), where it will
allow the owner to run before it to make that forward progress.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/init_task.h   |   8 +++
 include/linux/rtmutex.h     |   1 +
 include/linux/sched.h       |  27 ++++++++++
 include/linux/spinlock.h    |   5 ++
 include/linux/spinlock_rt.h |  13 +++++
 kernel/locking/rtmutex.c    | 121 ++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 164 insertions(+), 11 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 4a77d39ff7dd..eb39f31837dd 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -189,6 +189,13 @@ extern struct task_group root_task_group;
 # define INIT_KASAN(tsk)
 #endif
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+# define INIT_TRYLOCK_PRIO(tsk)						\
+	.trylock_prio	= MAX_PRIO,
+#else
+# define INIT_TRYLOCK_PRIO(tsk)
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -204,6 +211,7 @@ extern struct task_group root_task_group;
 	.normal_prio	= MAX_PRIO-20,					\
 	.policy		= SCHED_NORMAL,					\
 	.cpus_allowed	= CPU_MASK_ALL,					\
+	INIT_TRYLOCK_PRIO(tsk)						\
 	.nr_cpus_allowed= NR_CPUS,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index d5a04ea47a13..4ab8d4bddfbb 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -111,6 +111,7 @@ extern int rt_mutex_timed_lock(struct rt_mutex *lock,
 			       struct hrtimer_sleeper *timeout);
 
 extern int rt_mutex_trylock(struct rt_mutex *lock);
+extern int rt_mutex_try_or_boost_lock(struct rt_mutex *lock);
 
 extern void rt_mutex_unlock(struct rt_mutex *lock);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 956658437527..e6a39360a182 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1357,6 +1357,33 @@ struct task_struct {
 
 	int prio, static_prio, normal_prio;
 	unsigned int rt_priority;
+#ifdef CONFIG_PREEMPT_RT_FULL
+	/*
+	 * There are cases where spinlocks are taken in reverse
+	 * order via a spin_trylock(). This is fine for true
+	 * spinlocks, but in PREEMPT_RT, they become mutexes.
+	 * If a real-time high priority task preempts the owner
+	 * of a lock it is trying to grab in reverse order with
+	 * a trylock, it can go into a livelock, as it spins
+	 * waiting for the owner to release the lock. But as it
+	 * has preempted that owner, and prevented it from continuing,
+	 * the lock is never released, and the high priority task
+	 * spins forever.
+	 *
+	 * For these cases, a spin_try_or_boost_lock() must be used
+	 * on PREEMPT_RT. This will set the owner of the lock to the
+	 * prioirty of the caller of spin_try_or_boost_lock()
+	 * via this trylock_prio field (if the owner has a lower
+	 * priority than the caller). This priority is always reset
+	 * whenever any spinlock converted rtmutex is released.
+	 * This guarantees forward progress of the trylock spin.
+	 *
+	 * Callers of spin_try_or_boost_lock() must also call cpu_chill()
+	 * which on PREEMPT_RT is a sched_yield() that allows the
+	 * newly risen priority task to run ahead of the trylock spinner.
+	 */
+	int trylock_prio;
+#endif
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 28f4366fd495..b48bebec4302 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -330,6 +330,11 @@ static inline int spin_trylock(spinlock_t *lock)
 	return raw_spin_trylock(&lock->rlock);
 }
 
+static inline int spin_try_or_boost_lock(spinlock_t *lock)
+{
+	return raw_spin_trylock(&lock->rlock);
+}
+
 #define spin_lock_nested(lock, subclass)			\
 do {								\
 	raw_spin_lock_nested(spinlock_check(lock), subclass);	\
diff --git a/include/linux/spinlock_rt.h b/include/linux/spinlock_rt.h
index f757096b230c..084816738d3e 100644
--- a/include/linux/spinlock_rt.h
+++ b/include/linux/spinlock_rt.h
@@ -26,6 +26,7 @@ extern void __lockfunc rt_spin_unlock_wait(spinlock_t *lock);
 extern int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags);
 extern int __lockfunc rt_spin_trylock_bh(spinlock_t *lock);
 extern int __lockfunc rt_spin_trylock(spinlock_t *lock);
+extern int __lockfunc rt_spin_try_or_boost_lock(spinlock_t *lock);
 extern int atomic_dec_and_spin_lock(atomic_t *atomic, spinlock_t *lock);
 
 /*
@@ -53,6 +54,8 @@ extern int __lockfunc __rt_spin_trylock(struct rt_mutex *lock);
 
 #define spin_do_trylock(lock)		__cond_lock(lock, rt_spin_trylock(lock))
 
+#define spin_do_try_or_boost_lock(lock) __cond_lock(lock, rt_spin_try_or_boost_lock(lock))
+
 #define spin_trylock(lock)			\
 ({						\
 	int __locked;				\
@@ -63,6 +66,16 @@ extern int __lockfunc __rt_spin_trylock(struct rt_mutex *lock);
 	__locked;				\
 })
 
+#define spin_try_or_boost_lock(lock)			\
+({							\
+	int __locked;					\
+	migrate_disable();				\
+	__locked = spin_do_try_or_boost_lock(lock);	\
+	if (!__locked)					\
+		migrate_enable();			\
+	__locked;					\
+})
+
 #ifdef CONFIG_LOCKDEP
 # define spin_lock_nested(lock, subclass)		\
 	do {						\
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 2822aceb8dfb..2830c17dc3e4 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -253,6 +253,65 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 	RB_CLEAR_NODE(&waiter->pi_tree_entry);
 }
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+static inline int task_normal_prio(struct task_struct *task)
+{
+	return min(task->normal_prio, task->trylock_prio);
+}
+
+static inline int task_top_waiter_prio(struct task_struct *task)
+{
+	return min3(task_top_pi_waiter(task)->prio,
+		    task->normal_prio, task->trylock_prio);
+}
+
+static inline void clear_trylock_prio(struct task_struct *task)
+{
+	task->trylock_prio = MAX_PRIO;
+}
+
+static inline bool current_boosted(void)
+{
+	return current->trylock_prio < MAX_PRIO;
+}
+
+static void __rt_mutex_adjust_prio(struct task_struct *task);
+
+/* Must call rt_mutex_adjust_prio_chain() if an owner is returned */
+static inline struct task_struct *trylock_boost_owner(struct rt_mutex *lock)
+{
+	struct task_struct *owner;
+
+	owner = rt_mutex_owner(lock);
+	if (!owner)
+		return NULL;
+
+	/* Will be released by rt_mutex_adjust_prio_chain() */
+	get_task_struct(owner);
+
+	raw_spin_lock_irq(&owner->pi_lock);
+	if (owner->trylock_prio > current->prio) {
+		owner->trylock_prio = current->prio;
+		__rt_mutex_adjust_prio(owner);
+	}
+	raw_spin_unlock_irq(&owner->pi_lock);
+
+	return owner;
+}
+#else
+static inline int task_normal_prio(struct task_struct *task)
+{
+	return task->normal_prio;
+}
+static inline int task_top_waiter_prio(struct task_struct *task)
+{
+	return min(task_top_pi_waiter(task)->prio,
+		   task->normal_prio);
+}
+# define current_boosted()	0
+# define clear_trylock_prio(tsk)	do {} while (0)
+#endif
+
 /*
  * Calculate task priority from the waiter tree priority
  *
@@ -262,10 +321,9 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
 int rt_mutex_getprio(struct task_struct *task)
 {
 	if (likely(!task_has_pi_waiters(task)))
-		return task->normal_prio;
+		return task_normal_prio(task);
 
-	return min(task_top_pi_waiter(task)->prio,
-		   task->normal_prio);
+	return task_top_waiter_prio(task);
 }
 
 struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
@@ -925,7 +983,7 @@ static inline void rt_spin_lock_fastlock(struct rt_mutex *lock,
 static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
 					   void  (*slowfn)(struct rt_mutex *lock))
 {
-	if (likely(rt_mutex_cmpxchg(lock, current, NULL)))
+	if (likely(rt_mutex_cmpxchg(lock, current, NULL) && !current_boosted()))
 		rt_mutex_deadlock_account_unlock(current);
 	else
 		slowfn(lock);
@@ -1074,6 +1132,8 @@ static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
 	if (!rt_mutex_has_waiters(lock)) {
 		lock->owner = NULL;
 		raw_spin_unlock(&lock->wait_lock);
+		if (current_boosted())
+			goto adjust;
 		return;
 	}
 
@@ -1081,6 +1141,9 @@ static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
 
 	raw_spin_unlock(&lock->wait_lock);
 
+adjust:
+	clear_trylock_prio(current);
+
 	/* Undo pi boosting.when necessary */
 	rt_mutex_adjust_prio(current);
 }
@@ -1148,6 +1211,16 @@ int __lockfunc rt_spin_trylock(spinlock_t *lock)
 }
 EXPORT_SYMBOL(rt_spin_trylock);
 
+int __lockfunc rt_spin_try_or_boost_lock(spinlock_t *lock)
+{
+	int ret = rt_mutex_try_or_boost_lock(&lock->lock);
+
+	if (ret)
+		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+	return ret;
+}
+EXPORT_SYMBOL(rt_spin_try_or_boost_lock);
+
 int __lockfunc rt_spin_trylock_bh(spinlock_t *lock)
 {
 	int ret;
@@ -1717,26 +1790,34 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 /*
  * Slow path try-lock function:
  */
-static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
+static inline int rt_mutex_slowtrylock(struct rt_mutex *lock, bool boost)
 {
+	struct task_struct *owner;
 	int ret;
 
 	/*
 	 * If the lock already has an owner we fail to get the lock.
 	 * This can be done without taking the @lock->wait_lock as
 	 * it is only being read, and this is a trylock anyway.
+	 *
+	 * Only do the short cut if we do not need to boost the task
+	 * if we fail to get the lock.
 	 */
-	if (rt_mutex_owner(lock))
+	if (!boost && rt_mutex_owner(lock))
 		return 0;
 
 	/*
-	 * The mutex has currently no owner. Lock the wait lock and
-	 * try to acquire the lock.
+	 * The mutex has currently no owner or we need to boost the task
+	 * if we fail to grab the lock. Lock the wait lock and try to
+	 * acquire the lock.
 	 */
 	raw_spin_lock(&lock->wait_lock);
 
 	ret = try_to_take_rt_mutex(lock, current, NULL);
 
+	if (!ret && boost)
+		owner = trylock_boost_owner(lock);
+
 	/*
 	 * try_to_take_rt_mutex() sets the lock waiters bit
 	 * unconditionally. Clean this up.
@@ -1745,6 +1826,10 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
 
 	raw_spin_unlock(&lock->wait_lock);
 
+	if (!ret && boost && owner)
+		rt_mutex_adjust_prio_chain(owner, RT_MUTEX_MIN_CHAINWALK,
+					   lock, NULL, NULL, NULL);
+
 	return ret;
 }
 
@@ -1852,13 +1937,14 @@ rt_mutex_timed_fastlock(struct rt_mutex *lock, int state,
 
 static inline int
 rt_mutex_fasttrylock(struct rt_mutex *lock,
-		     int (*slowfn)(struct rt_mutex *lock))
+		     int (*slowfn)(struct rt_mutex *lock, bool boost),
+		     bool boost)
 {
 	if (likely(rt_mutex_cmpxchg(lock, NULL, current))) {
 		rt_mutex_deadlock_account_lock(lock, current);
 		return 1;
 	}
-	return slowfn(lock);
+	return slowfn(lock, boost);
 }
 
 static inline void
@@ -1969,11 +2055,24 @@ EXPORT_SYMBOL_GPL(rt_mutex_timed_lock);
  */
 int __sched rt_mutex_trylock(struct rt_mutex *lock)
 {
-	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock);
+	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock, false);
 }
 EXPORT_SYMBOL_GPL(rt_mutex_trylock);
 
 /**
+ * rt_mutex_try_or_boost_lock - try to lock a rt_mutex or boost the owner
+ *
+ * @lock:	the rt_mutex to be locked
+ *
+ * Returns 1 on success and 0 on contention
+ */
+int __sched rt_mutex_try_or_boost_lock(struct rt_mutex *lock)
+{
+	return rt_mutex_fasttrylock(lock, rt_mutex_slowtrylock, true);
+}
+EXPORT_SYMBOL_GPL(rt_mutex_try_or_boost_lock);
+
+/**
  * rt_mutex_unlock - unlock a rt_mutex
  *
  * @lock: the rt_mutex to be unlocked
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC][PATCH RT 2/3] locking: Convert trylock spinners over to spin_try_or_boost_lock()
  2015-09-04  1:19 [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
  2015-09-04  1:19 ` [RFC][PATCH RT 1/3] locking: Add spin_try_or_boost_lock() infrastructure Steven Rostedt
@ 2015-09-04  1:19 ` Steven Rostedt
  2015-09-04  1:19 ` [RFC][PATCH RT 3/3] rt: Make cpu_chill() into yield() and add new cpu_rest() as msleep(1) Steven Rostedt
  2015-09-05 10:30 ` [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Thomas Gleixner
  3 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-09-04  1:19 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

[-- Attachment #1: 0002-locking-Convert-trylock-spinners-over-to-spin_try_or.patch --]
[-- Type: text/plain, Size: 2984 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

When trying to take locks in reverse order, it is possible that on
PREEMPT_RT that the running task could have preempted the owner and never
let it run, creating a live lock. This is because spinlocks in PREEMPT_RT
can be preempted.

Currently, this is solved by calling cpu_chill(), which on PREEMPT_RT is
converted into a msleep(1), and we just hopen that the owner will have time
to release the lock, and nobody else will take in when the task wakes up.

By converting these to spin_try_or_boost_lock() which will boost the owners,
the cpu_chill() can be converted into a sched_yield() which will allow the
owners to make immediate progress even if it was preempted by a high
priority task.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 block/blk-ioc.c     | 4 ++--
 fs/autofs4/expire.c | 2 +-
 fs/dcache.c         | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 28f467e636cc..de5eccdc8abb 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -105,7 +105,7 @@ static void ioc_release_fn(struct work_struct *work)
 						struct io_cq, ioc_node);
 		struct request_queue *q = icq->q;
 
-		if (spin_trylock(q->queue_lock)) {
+		if (spin_try_or_boost_lock(q->queue_lock)) {
 			ioc_destroy_icq(icq);
 			spin_unlock(q->queue_lock);
 		} else {
@@ -183,7 +183,7 @@ retry:
 	hlist_for_each_entry(icq, &ioc->icq_list, ioc_node) {
 		if (icq->flags & ICQ_EXITED)
 			continue;
-		if (spin_trylock(icq->q->queue_lock)) {
+		if (spin_try_or_boost_lock(icq->q->queue_lock)) {
 			ioc_exit_icq(icq);
 			spin_unlock(icq->q->queue_lock);
 		} else {
diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c
index d487fa27add5..025bfc71dc6c 100644
--- a/fs/autofs4/expire.c
+++ b/fs/autofs4/expire.c
@@ -148,7 +148,7 @@ again:
 			}
 
 			parent = p->d_parent;
-			if (!spin_trylock(&parent->d_lock)) {
+			if (!spin_try_or_boost_lock(&parent->d_lock)) {
 				spin_unlock(&p->d_lock);
 				cpu_chill();
 				goto relock;
diff --git a/fs/dcache.c b/fs/dcache.c
index c1dad92434d5..6b5643ecdf37 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -573,12 +573,12 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 	struct inode *inode = dentry->d_inode;
 	struct dentry *parent = NULL;
 
-	if (inode && unlikely(!spin_trylock(&inode->i_lock)))
+	if (inode && unlikely(!spin_try_or_boost_lock(&inode->i_lock)))
 		goto failed;
 
 	if (!IS_ROOT(dentry)) {
 		parent = dentry->d_parent;
-		if (unlikely(!spin_trylock(&parent->d_lock))) {
+		if (unlikely(!spin_try_or_boost_lock(&parent->d_lock))) {
 			if (inode)
 				spin_unlock(&inode->i_lock);
 			goto failed;
@@ -2394,7 +2394,7 @@ again:
 	inode = dentry->d_inode;
 	isdir = S_ISDIR(inode->i_mode);
 	if (dentry->d_lockref.count == 1) {
-		if (!spin_trylock(&inode->i_lock)) {
+		if (!spin_try_or_boost_lock(&inode->i_lock)) {
 			spin_unlock(&dentry->d_lock);
 			cpu_chill();
 			goto again;
-- 
2.4.6



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC][PATCH RT 3/3] rt: Make cpu_chill() into yield() and add new cpu_rest() as msleep(1)
  2015-09-04  1:19 [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
  2015-09-04  1:19 ` [RFC][PATCH RT 1/3] locking: Add spin_try_or_boost_lock() infrastructure Steven Rostedt
  2015-09-04  1:19 ` [RFC][PATCH RT 2/3] locking: Convert trylock spinners over to spin_try_or_boost_lock() Steven Rostedt
@ 2015-09-04  1:19 ` Steven Rostedt
  2015-09-05 10:30 ` [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Thomas Gleixner
  3 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-09-04  1:19 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

[-- Attachment #1: 0003-rt-Make-cpu_chill-into-yield-and-add-new-cpu_rest-as.patch --]
[-- Type: text/plain, Size: 5009 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Turn the cpu_chill() into a sched_yield() which is only used by those calling
spin_try_or_boost_lock(), as it will allow the owner of the lock, which had its
priority boosted (if needed), to run even with SCHED_FIFO tasks.

As there are still locations that use cpu_chill(), as it spins on status events
like bits and what not (not locks), where the updater is not known and can not
be boosted, add a new cpu_rest() that will take over the msleep(1) action.

Hopefully, we can get rid of the cpu_rest() and find a way to know what tasks
need priority boosting, and perhaps make another API that will allow boosting
of the updater, and the current task can contine to spin instead of sleep.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 fs/namespace.c         |  2 +-
 include/linux/delay.h  | 13 +++++++++++++
 kernel/sched/core.c    | 14 ++++++++++++++
 kernel/time/hrtimer.c  |  4 ++--
 kernel/workqueue.c     |  2 +-
 net/packet/af_packet.c |  4 ++--
 net/rds/ib_rdma.c      |  2 +-
 7 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 28937028f3a5..24769de44041 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -356,7 +356,7 @@ int __mnt_want_write(struct vfsmount *m)
 	smp_mb();
 	while (ACCESS_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) {
 		preempt_enable();
-		cpu_chill();
+		cpu_rest();
 		preempt_disable();
 	}
 	/*
diff --git a/include/linux/delay.h b/include/linux/delay.h
index 37caab306336..b11e40388387 100644
--- a/include/linux/delay.h
+++ b/include/linux/delay.h
@@ -53,9 +53,22 @@ static inline void ssleep(unsigned int seconds)
 }
 
 #ifdef CONFIG_PREEMPT_RT_FULL
+/*
+ * Use cpu_chill() after a spin_try_or_boost_lock() which will boost the owner
+ * of the lock to the callers priority (if needed), and cpu_chill will
+ * act like a sched_yield() allowing the owner to proceed.
+ */
 extern void cpu_chill(void);
+/*
+ * Use cpu_rest() if there's no way to find out who the owner you are waiting
+ * for (like spinning on a status variable or bit). This is equivalent to
+ * a msleep(1) and you can hope that the status will change by the time
+ * you wake up.
+ */
+extern void cpu_rest(void);
 #else
 # define cpu_chill()	cpu_relax()
+# define cpu_rest()	cpu_relax()
 #endif
 
 #endif /* defined(_LINUX_DELAY_H) */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 392aad4ec3d6..3e3efc980136 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4473,6 +4473,20 @@ SYSCALL_DEFINE0(sched_yield)
 	return 0;
 }
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+/*
+ * Used after a spin_try_or_boost_lock(), which should boost the owner
+ * of the lock to the priority of the current task (if needed), and
+ * this will yield the current task to the owner if the owner is on
+ * current's CPU.
+ */
+void cpu_chill(void)
+{
+	sys_sched_yield();
+}
+EXPORT_SYMBOL(cpu_chill);
+#endif
+
 int __sched _cond_resched(void)
 {
 	if (should_resched()) {
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 2c6be169bdc7..7dfeb55be5c1 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1881,7 +1881,7 @@ SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
 /*
  * Sleep for 1 ms in hope whoever holds what we want will let it go.
  */
-void cpu_chill(void)
+void cpu_rest(void)
 {
 	struct timespec tu = {
 		.tv_nsec = NSEC_PER_MSEC,
@@ -1894,7 +1894,7 @@ void cpu_chill(void)
 	if (!freeze_flag)
 		current->flags &= ~PF_NOFREEZE;
 }
-EXPORT_SYMBOL(cpu_chill);
+EXPORT_SYMBOL(cpu_rest);
 #endif
 
 /*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 21daecdfd86d..32b4d73349dd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1260,7 +1260,7 @@ fail:
 	local_unlock_irqrestore(pendingb_lock, *flags);
 	if (work_is_canceling(work))
 		return -ENOENT;
-	cpu_chill();
+	cpu_rest();
 	return -EAGAIN;
 }
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ef1eb20504a7..e906044802c8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -699,7 +699,7 @@ static void prb_retire_rx_blk_timer_expired(unsigned long data)
 	if (BLOCK_NUM_PKTS(pbd)) {
 		while (atomic_read(&pkc->blk_fill_in_prog)) {
 			/* Waiting for skb_copy_bits to finish... */
-			cpu_chill();
+			cpu_rest();
 		}
 	}
 
@@ -961,7 +961,7 @@ static void prb_retire_current_block(struct tpacket_kbdq_core *pkc,
 		if (!(status & TP_STATUS_BLK_TMO)) {
 			while (atomic_read(&pkc->blk_fill_in_prog)) {
 				/* Waiting for skb_copy_bits to finish... */
-				cpu_chill();
+				cpu_rest();
 			}
 		}
 		prb_close_block(pkc, pbd, po, status);
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c8faaf36423a..31fe3b8b4cde 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -287,7 +287,7 @@ static inline void wait_clean_list_grace(void)
 	for_each_online_cpu(cpu) {
 		flag = &per_cpu(clean_list_grace, cpu);
 		while (test_bit(CLEAN_LIST_BUSY_BIT, flag))
-			cpu_chill();
+			cpu_rest();
 	}
 }
 
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 1/3] locking: Add spin_try_or_boost_lock() infrastructure
  2015-09-04  1:19 ` [RFC][PATCH RT 1/3] locking: Add spin_try_or_boost_lock() infrastructure Steven Rostedt
@ 2015-09-04  1:48   ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-09-04  1:48 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, Sebastian Andrzej Siewior,
	John Kacur, Paul Gortmaker, Peter Zijlstra, Clark Williams,
	Arnaldo Carvalho de Melo, Ingo Molnar

On Thu, 03 Sep 2015 21:19:01 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

>					\
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 2822aceb8dfb..2830c17dc3e4 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -253,6 +253,65 @@ rt_mutex_dequeue_pi(struct task_struct *task, struct rt_mutex_waiter *waiter)
>  	RB_CLEAR_NODE(&waiter->pi_tree_entry);
>  }
>  
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +static inline int task_normal_prio(struct task_struct *task)
> +{
> +	return min(task->normal_prio, task->trylock_prio);
> +}
> +
> +static inline int task_top_waiter_prio(struct task_struct *task)
> +{
> +	return min3(task_top_pi_waiter(task)->prio,
> +		    task->normal_prio, task->trylock_prio);
> +}
> +
> +static inline void clear_trylock_prio(struct task_struct *task)
> +{
> +	task->trylock_prio = MAX_PRIO;
> +}
> +
> +static inline bool current_boosted(void)
> +{
> +	return current->trylock_prio < MAX_PRIO;
> +}
> +
> +static void __rt_mutex_adjust_prio(struct task_struct *task);
> +
> +/* Must call rt_mutex_adjust_prio_chain() if an owner is returned */
> +static inline struct task_struct *trylock_boost_owner(struct rt_mutex *lock)
> +{
> +	struct task_struct *owner;
> +
> +	owner = rt_mutex_owner(lock);
> +	if (!owner)
> +		return NULL;
> +
> +	/* Will be released by rt_mutex_adjust_prio_chain() */
> +	get_task_struct(owner);
> +
> +	raw_spin_lock_irq(&owner->pi_lock);
> +	if (owner->trylock_prio > current->prio) {
> +		owner->trylock_prio = current->prio;
> +		__rt_mutex_adjust_prio(owner);
> +	}
> +	raw_spin_unlock_irq(&owner->pi_lock);
> +
> +	return owner;
> +}
> +#else

I forgot to add:

static inline struct task_struct *trylock_boost_owner(struct rt_mutex *lock)
{
	return NULL;
}

See, I didn't test the non PREEMPT_RT cases ;-)

-- Steve

> +static inline int task_normal_prio(struct task_struct *task)
> +{
> +	return task->normal_prio;
> +}
> +static inline int task_top_waiter_prio(struct task_struct *task)
> +{
> +	return min(task_top_pi_waiter(task)->prio,
> +		   task->normal_prio);
> +}
> +# define current_boosted()	0
> +# define clear_trylock_prio(tsk)	do {} while (0)
> +#endif
> +



> @@ -1717,26 +1790,34 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
>  /*
>   * Slow path try-lock function:
>   */
> -static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
> +static inline int rt_mutex_slowtrylock(struct rt_mutex *lock, bool boost)
>  {
> +	struct task_struct *owner;
>  	int ret;
>  
>  	/*
>  	 * If the lock already has an owner we fail to get the lock.
>  	 * This can be done without taking the @lock->wait_lock as
>  	 * it is only being read, and this is a trylock anyway.
> +	 *
> +	 * Only do the short cut if we do not need to boost the task
> +	 * if we fail to get the lock.
>  	 */
> -	if (rt_mutex_owner(lock))
> +	if (!boost && rt_mutex_owner(lock))
>  		return 0;
>  
>  	/*
> -	 * The mutex has currently no owner. Lock the wait lock and
> -	 * try to acquire the lock.
> +	 * The mutex has currently no owner or we need to boost the task
> +	 * if we fail to grab the lock. Lock the wait lock and try to
> +	 * acquire the lock.
>  	 */
>  	raw_spin_lock(&lock->wait_lock);
>  
>  	ret = try_to_take_rt_mutex(lock, current, NULL);
>  
> +	if (!ret && boost)
> +		owner = trylock_boost_owner(lock);
> +
>  	/*
>  	 * try_to_take_rt_mutex() sets the lock waiters bit
>  	 * unconditionally. Clean this up.
> @@ -1745,6 +1826,10 @@ static inline int rt_mutex_slowtrylock(struct rt_mutex *lock)
>  
>  	raw_spin_unlock(&lock->wait_lock);
>  
> +	if (!ret && boost && owner)
> +		rt_mutex_adjust_prio_chain(owner, RT_MUTEX_MIN_CHAINWALK,
> +					   lock, NULL, NULL, NULL);
> +
>  	return ret;
>  }
>  

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-04  1:19 [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
                   ` (2 preceding siblings ...)
  2015-09-04  1:19 ` [RFC][PATCH RT 3/3] rt: Make cpu_chill() into yield() and add new cpu_rest() as msleep(1) Steven Rostedt
@ 2015-09-05 10:30 ` Thomas Gleixner
  2015-09-05 12:04   ` Ingo Molnar
  2015-09-05 12:18   ` Steven Rostedt
  3 siblings, 2 replies; 19+ messages in thread
From: Thomas Gleixner @ 2015-09-05 10:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo,
	Ingo Molnar

On Thu, 3 Sep 2015, Steven Rostedt wrote:
> There are a lot of trylocks in the kernel, and I'm sure there's more around
> that need to be convert to this method.

The only ones we need to convert are those which do an actual trylock
loop. The others, which simply bail if the trylock fails are
completely irrelevant.

> I think this is an elegant solution but others may feel
> differently. As I think a msleep() hail mary is extremely non
> deterministic, it's a blemish for a kernel that prides itself on
> adding determinism.

I agree that the msleep hack is horrible. Though I do not agree that
this solution is elegant. It's clever.

I was looking into that a few days ago and did not come up with
something sensible, but your patch and reading up on your well done
explanation made me look another time.

So the problem we need to solve is:

retry:
	lock(B);
	if (!try_lock(A)) {
		unlock(B);
		cpu_relax();
		goto retry;
	}

So instead of doing that proposed magic boost, we can do something
more straight forward:

retry:
	lock(B);
	if (!try_lock(A)) {
		lock_and_drop(A, B);
		unlock(A);
		goto retry;
	}

lock_and_drop() queues the task as a waiter on A, drops B and then
does the PI adjustment on A. 

Thoughts?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-05 10:30 ` [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Thomas Gleixner
@ 2015-09-05 12:04   ` Ingo Molnar
  2015-09-05 12:26     ` Steven Rostedt
  2015-09-07  8:35     ` Thomas Gleixner
  2015-09-05 12:18   ` Steven Rostedt
  1 sibling, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2015-09-05 12:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo


* Thomas Gleixner <tglx@linutronix.de> wrote:

> So the problem we need to solve is:
> 
> retry:
> 	lock(B);
> 	if (!try_lock(A)) {
> 		unlock(B);
> 		cpu_relax();
> 		goto retry;
> 	}
> 
> So instead of doing that proposed magic boost, we can do something
> more straight forward:
> 
> retry:
> 	lock(B);
> 	if (!try_lock(A)) {
> 		lock_and_drop(A, B);
> 		unlock(A);
> 		goto retry;
> 	}
> 
> lock_and_drop() queues the task as a waiter on A, drops B and then
> does the PI adjustment on A. 
> 
> Thoughts?

So why not do:

	lock(B);
	if (!trylock(A)) {
		unlock(B);
		lock(A);
		lock(B);
	}

?

Or, if this can be done, why didn't we do:

	lock(A);
	lock(B);

to begin with?

i.e. I'm not sure the problem is properly specified.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-05 10:30 ` [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Thomas Gleixner
  2015-09-05 12:04   ` Ingo Molnar
@ 2015-09-05 12:18   ` Steven Rostedt
  2015-09-05 12:27     ` Steven Rostedt
  2015-09-05 12:50     ` Steven Rostedt
  1 sibling, 2 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-09-05 12:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo,
	Ingo Molnar

On Sat, 5 Sep 2015 12:30:59 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> So the problem we need to solve is:
> 
> retry:
> 	lock(B);
> 	if (!try_lock(A)) {
> 		unlock(B);
> 		cpu_relax();
> 		goto retry;
> 	}
> 
> So instead of doing that proposed magic boost, we can do something
> more straight forward:
> 
> retry:
> 	lock(B);
> 	if (!try_lock(A)) {
> 		lock_and_drop(A, B);
> 		unlock(A);
> 		goto retry;
> 	}
> 
> lock_and_drop() queues the task as a waiter on A, drops B and then
> does the PI adjustment on A. 

That was my original solution, and I believe I added patches to do
exactly that to the networking code in the past. I remember writing
that helper function such that on non PREEMPT_RT it was a nop.

I even had that solution in my slides at LinuxCon/LinuxPlumbers ;-)


But then I talk about dcache.c. Take a look at that file, and the
complexity of that. Is it safe to take the inode and dcache parent
locks after you unlock the other locks?

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-05 12:04   ` Ingo Molnar
@ 2015-09-05 12:26     ` Steven Rostedt
  2015-09-07  8:35     ` Thomas Gleixner
  1 sibling, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-09-05 12:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo

On Sat, 5 Sep 2015 14:04:57 +0200
Ingo Molnar <mingo@kernel.org> wrote:

 
> So why not do:
> 
> 	lock(B);
> 	if (!trylock(A)) {
> 		unlock(B);
> 		lock(A);
> 		lock(B);
> 	}
> 
> ?
> 
> Or, if this can be done, why didn't we do:
> 
> 	lock(A);
> 	lock(B);
> 
> to begin with?
> 
> i.e. I'm not sure the problem is properly specified.

Yeah, this is actually the solution I came up with before. I misread
what Thomas wrote. His is slightly different.

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-05 12:18   ` Steven Rostedt
@ 2015-09-05 12:27     ` Steven Rostedt
  2015-09-05 12:50     ` Steven Rostedt
  1 sibling, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-09-05 12:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo,
	Ingo Molnar

On Sat, 5 Sep 2015 08:18:36 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:


> I even had that solution in my slides at LinuxCon/LinuxPlumbers ;-)

Here's my slides:

http://rostedt.homelinux.com/private/linux-con-rt-into-mainline-2015.pdf

Slide 19 is where I start talking about it.

Slide 21 shows Ingo's solution.

Slide 22 shows the complex issue of dcache.c.

Would your solution work with that code?

-- Steve

> 
> 
> But then I talk about dcache.c. Take a look at that file, and the
> complexity of that. Is it safe to take the inode and dcache parent
> locks after you unlock the other locks?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-05 12:18   ` Steven Rostedt
  2015-09-05 12:27     ` Steven Rostedt
@ 2015-09-05 12:50     ` Steven Rostedt
  2015-09-07  9:14       ` Thomas Gleixner
  1 sibling, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2015-09-05 12:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo,
	Ingo Molnar

On Sat, 5 Sep 2015 08:18:36 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sat, 5 Sep 2015 12:30:59 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > So instead of doing that proposed magic boost, we can do something
> > more straight forward:
> > 
> > retry:
> > 	lock(B);
> > 	if (!try_lock(A)) {
> > 		lock_and_drop(A, B);
> > 		unlock(A);
> > 		goto retry;
> > 	}
> > 
> > lock_and_drop() queues the task as a waiter on A, drops B and then
> > does the PI adjustment on A. 
> 
> That was my original solution, and I believe I added patches to do
> exactly that to the networking code in the past. I remember writing
> that helper function such that on non PREEMPT_RT it was a nop.

Just to point out again that I misread what you wrote. That's what I
get for responding to email 10 minutes after I get out of bed ;-)


You need to be careful about adding the waiter on A. If the owner of A
is blocked on B, the pi inheritance may detect that as a deadlock.

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-05 12:04   ` Ingo Molnar
  2015-09-05 12:26     ` Steven Rostedt
@ 2015-09-07  8:35     ` Thomas Gleixner
  2015-09-07 10:10       ` Thomas Gleixner
                         ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Thomas Gleixner @ 2015-09-07  8:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo

On Sat, 5 Sep 2015, Ingo Molnar wrote:
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > So the problem we need to solve is:
> > 
> > retry:
> > 	lock(B);
> > 	if (!try_lock(A)) {
> > 		unlock(B);
> > 		cpu_relax();
> > 		goto retry;
> > 	}
> > 
> > So instead of doing that proposed magic boost, we can do something
> > more straight forward:
> > 
> > retry:
> > 	lock(B);
> > 	if (!try_lock(A)) {
> > 		lock_and_drop(A, B);
> > 		unlock(A);
> > 		goto retry;
> > 	}
> > 
> > lock_and_drop() queues the task as a waiter on A, drops B and then
> > does the PI adjustment on A. 
> > 
> > Thoughts?
> 
> So why not do:
> 
> 	lock(B);
> 	if (!trylock(A)) {
> 		unlock(B);
> 		lock(A);
> 		lock(B);
> 	}
> 
> ?
> 
> Or, if this can be done, why didn't we do:
> 
> 	lock(A);
> 	lock(B);
> 
> to begin with?
> 
> i.e. I'm not sure the problem is properly specified.

Right. I omitted some essential information.

       lock(y->lock);
       x = y->x;
       if (!try_lock(x->lock))
		....

Once we drop x->lock, y->x can change. That's why the retry is there.

Now on RT the trylock loop can obviously lead to a live lock if the
try locker preempted the holder of x->lock.

What Steve is trying to do is to boost the holder of x->lock (task A)
without actually queueing the task (task B) on the lock wait queue of
x->lock. To get out of the try-lock loop he calls sched_yield() from
task B.

While this works by some definition of works, I really do not like the
semantical obscurity of this approach.

1) The boosting is not related to anything.

   If the priority of taskB changes then nothing changes the boosting
   of taskA.

2) The boosting stops

3) sched_yield() makes me shudder

   CPU0			CPU1	

   taskA
     lock(x->lock)

   preemption
   taskC
			taskB
			  lock(y->lock);
			  x = y->x;
			  if (!try_lock(x->lock)) {
			    unlock(y->lock);
			    boost(taskA);
			    sched_yield();  <- returns immediately

   So, if taskC has higher priority than taskB and therefor than
   taskA, taskB will do the lock/trylock/unlock/boost dance in
   circles.

   We can make that worse. If taskB's code looks like this:

			  lock(y->lock);
			  x = y->x;
			  if (!try_lock(x->lock)) {
			    unlock(y->lock);
			    boost(taskA);
			    sched_yield();
			    return -EAGAIN;

  and at the callsite it decides to do something completely different
  than retrying then taskA stays boosted.

So we have already two scenarios where this clearly violates the PI
rules and I really do not have any interest to debug leaked RT
priorites.

I agree with Steve, that the main case where we have this horrible
msleep() right now - dcache - is complex, but we rather sit down and
analyze it proper and come up with semantically well defined
solutions.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-05 12:50     ` Steven Rostedt
@ 2015-09-07  9:14       ` Thomas Gleixner
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2015-09-07  9:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo,
	Ingo Molnar

On Sat, 5 Sep 2015, Steven Rostedt wrote:

> On Sat, 5 Sep 2015 08:18:36 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Sat, 5 Sep 2015 12:30:59 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > > 
> > > So instead of doing that proposed magic boost, we can do something
> > > more straight forward:
> > > 
> > > retry:
> > > 	lock(B);
> > > 	if (!try_lock(A)) {
> > > 		lock_and_drop(A, B);
> > > 		unlock(A);
> > > 		goto retry;
> > > 	}
> > > 
> > > lock_and_drop() queues the task as a waiter on A, drops B and then
> > > does the PI adjustment on A. 
> > 
> > That was my original solution, and I believe I added patches to do
> > exactly that to the networking code in the past. I remember writing
> > that helper function such that on non PREEMPT_RT it was a nop.
> 
> Just to point out again that I misread what you wrote. That's what I
> get for responding to email 10 minutes after I get out of bed ;-)
> 
> 
> You need to be careful about adding the waiter on A. If the owner of A
> is blocked on B, the pi inheritance may detect that as a deadlock.

That's exactly why I wrote:

> > > lock_and_drop() queues the task as a waiter on A, drops B and then
> > > does the PI adjustment on A. 

because the PI adjustment would detect the deadlock.

Now the real question is whether we need that complexity at all. We
need to analyze the life time rules. If we have something like this:

     lock(y->lock);
     x = y->x;
     ...

and x cannot go away under us, then we can simply do

     lock(y->lock);
     x = y->x;
     if (!trylock(x->lock)) {
     	unlock(y->lock);
	lock(x->lock);
	unlock(x->lock);
	goto retry;
     }

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-07  8:35     ` Thomas Gleixner
@ 2015-09-07 10:10       ` Thomas Gleixner
  2015-09-08  7:31       ` Ingo Molnar
  2015-09-08 16:59       ` Steven Rostedt
  2 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2015-09-07 10:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo

On Mon, 7 Sep 2015, Thomas Gleixner wrote:
> 1) The boosting is not related to anything.
> 
>    If the priority of taskB changes then nothing changes the boosting
>    of taskA.
> 
> 2) The boosting stops

Scratch that #2. Hit send too fast. #1 and #3 still hold.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-07  8:35     ` Thomas Gleixner
  2015-09-07 10:10       ` Thomas Gleixner
@ 2015-09-08  7:31       ` Ingo Molnar
  2015-09-08  8:09         ` Thomas Gleixner
  2015-09-08 16:59       ` Steven Rostedt
  2 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2015-09-08  7:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo

* Thomas Gleixner <tglx@linutronix.de> wrote:

> 3) sched_yield() makes me shudder
> 
>    CPU0			CPU1	
> 
>    taskA
>      lock(x->lock)
> 
>    preemption
>    taskC
> 			taskB
> 			  lock(y->lock);
> 			  x = y->x;
> 			  if (!try_lock(x->lock)) {
> 			    unlock(y->lock);
> 			    boost(taskA);
> 			    sched_yield();  <- returns immediately

So I'm still struggling with properly parsing the usecase.

If y->x might become invalid the moment we drop y->lock, what makes the 'taskA' 
use (after we've dropped y->lock) safe? Shouldn't we at least also have a 
task_get(taskA)/task_put(taskA) reference count, to make sure the boosted task 
stays around?

And if we are into getting reference counts, why not solve it at a higher level 
and get a reference count to 'x' to make sure it's safe to use? Then we could do:

        lock(y->lock);
retry:
	x = y->x;
        if (!trylock(x->lock)) {
		get_ref(x->count)
                unlock(y->lock);
                lock(x->lock);
                lock(y->lock);
		put_ref(x->count);
		if (y->x != x) { /* Retry if 'x' got dropped meanwhile */
			unlock(x->lock);
			goto retry;
		}
        }

Or so.

Note how much safer this sequence is, and still just as fast in the common case 
(which I suppose is the main motivation within dcache.c?).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-08  7:31       ` Ingo Molnar
@ 2015-09-08  8:09         ` Thomas Gleixner
  2015-09-14  9:50           ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2015-09-08  8:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo

On Tue, 8 Sep 2015, Ingo Molnar wrote:
> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > 3) sched_yield() makes me shudder
> > 
> >    CPU0			CPU1	
> > 
> >    taskA
> >      lock(x->lock)
> > 
> >    preemption
> >    taskC
> > 			taskB
> > 			  lock(y->lock);
> > 			  x = y->x;
> > 			  if (!try_lock(x->lock)) {
> > 			    unlock(y->lock);
> > 			    boost(taskA);
> > 			    sched_yield();  <- returns immediately
> 
> So I'm still struggling with properly parsing the usecase.
> 
> If y->x might become invalid the moment we drop y->lock, what makes
> the 'taskA' use (after we've dropped y->lock) safe? Shouldn't we at
> least also have a task_get(taskA)/task_put(taskA) reference count,
> to make sure the boosted task stays around?

Stevens trylock_and_boost() function makes sure that taskA cannot go
away while doing the boosting. It's a bug in my pseudo code, but that
does not make the issue above going away.
 
> And if we are into getting reference counts, why not solve it at a
> higher level and get a reference count to 'x' to make sure it's safe
> to use? Then we could do:
>
>         lock(y->lock);
> retry:
> 	x = y->x;
>         if (!trylock(x->lock)) {
> 		get_ref(x->count)
>                 unlock(y->lock);
>                 lock(x->lock);
>                 lock(y->lock);
> 		put_ref(x->count);
> 		if (y->x != x) { /* Retry if 'x' got dropped meanwhile */
> 			unlock(x->lock);
> 			goto retry;
> 		}
>         }
> 
> Or so.

In the case of dcache::dentry_kill() we probably do not have to take
refcounts and it might be actually counterproductive to do so. y->x,
i.e. dentry->parent, cannot vanish under us, if I understand the life
time rules correctly.

Aside of that, yes, I was thinking about a similar scheme for
that. I need some more time to grok all the rules there :)

Thanks,

	tglx


 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-07  8:35     ` Thomas Gleixner
  2015-09-07 10:10       ` Thomas Gleixner
  2015-09-08  7:31       ` Ingo Molnar
@ 2015-09-08 16:59       ` Steven Rostedt
  2015-09-08 19:35         ` Steven Rostedt
  2 siblings, 1 reply; 19+ messages in thread
From: Steven Rostedt @ 2015-09-08 16:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo

On Mon, 7 Sep 2015 10:35:25 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> > i.e. I'm not sure the problem is properly specified.
> 
> Right. I omitted some essential information.
> 
>        lock(y->lock);
>        x = y->x;
>        if (!try_lock(x->lock))
> 		....
> 
> Once we drop x->lock, y->x can change. That's why the retry is there.

You mean once we drop y->lock, y->x can change.


> 
> Now on RT the trylock loop can obviously lead to a live lock if the
> try locker preempted the holder of x->lock.
> 
> What Steve is trying to do is to boost the holder of x->lock (task A)
> without actually queueing the task (task B) on the lock wait queue of
> x->lock. To get out of the try-lock loop he calls sched_yield() from
> task B.
> 
> While this works by some definition of works, I really do not like the
> semantical obscurity of this approach.
> 
> 1) The boosting is not related to anything.
> 
>    If the priority of taskB changes then nothing changes the boosting
>    of taskA.
> 
> 2) The boosting stops
> 
> 3) sched_yield() makes me shudder
> 
>    CPU0			CPU1	
> 
>    taskA
>      lock(x->lock)
> 
>    preemption
>    taskC
> 			taskB
> 			  lock(y->lock);
> 			  x = y->x;
> 			  if (!try_lock(x->lock)) {
> 			    unlock(y->lock);
> 			    boost(taskA);
> 			    sched_yield();  <- returns immediately
> 			    
>    So, if taskC has higher priority than taskB and therefor than
>    taskA, taskB will do the lock/trylock/unlock/boost dance in
>    circles.

Yeah, I was aware of this scenario, in which case, it just shows the
nastiness of a spinning trylock. Even with the current situation, the
task is going to constantly be spinning until TaskA can run. The
difference, is that it will let other tasks run during that 1 ms sleep.
The current code works, but as you said, by some definition of works ;-)


> 
>    We can make that worse. If taskB's code looks like this:
> 
> 			  lock(y->lock);
> 			  x = y->x;
> 			  if (!try_lock(x->lock)) {
> 			    unlock(y->lock);
> 			    boost(taskA);
> 			    sched_yield();
> 			    return -EAGAIN;
> 
>   and at the callsite it decides to do something completely different
>   than retrying then taskA stays boosted.

But only till it releases the lock (or any lock). That is, it's a
bounded boost, and not a leak. A leak would have no end.

> 
> So we have already two scenarios where this clearly violates the PI
> rules and I really do not have any interest to debug leaked RT
> priorites.

It's not a leak, it's only bounded by the time it holds that lock. As
the only locks that have this characteristic are spinlock converted to
rtmutex, those should be short.

> 
> I agree with Steve, that the main case where we have this horrible
> msleep() right now - dcache - is complex, but we rather sit down and
> analyze it proper and come up with semantically well defined
> solutions.

I'm happy to have this discussion! That's what RFC patches are for ;-)

I would also be happy if we can come up with a better solution than this
proposal.

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-08 16:59       ` Steven Rostedt
@ 2015-09-08 19:35         ` Steven Rostedt
  0 siblings, 0 replies; 19+ messages in thread
From: Steven Rostedt @ 2015-09-08 19:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo

On Tue, 8 Sep 2015 12:59:34 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 7 Sep 2015 10:35:25 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> > 			    
> >    So, if taskC has higher priority than taskB and therefor than
> >    taskA, taskB will do the lock/trylock/unlock/boost dance in
> >    circles.
> 
> Yeah, I was aware of this scenario, in which case, it just shows the
> nastiness of a spinning trylock. Even with the current situation, the
> task is going to constantly be spinning until TaskA can run. The
> difference, is that it will let other tasks run during that 1 ms sleep.
> The current code works, but as you said, by some definition of works ;-)
>

I thought about this a little more, but did not do any coding yet. Just
wanted to bounce off some ideas.

First, I still want to keep the spin_try_or_boost_lock() logic, maybe
call it, spin_trylock_or_boost() as we are not really boosting the
lock, but the owner.

Still have the requirement that if you don't get the lock, you must
still call cpu_chill().

But to get rid of the three issues you have with my current patch, I
have some ideas.

First let me express the three issues you have to make sure that we are
in-sync.

 Issue #1) The task owning the requested lock is on another CPU and is
  blocked by an even higher task. Thus the trylock spinner hogs its CPU
  preventing further progress of other tasks that want to run on that
  CPU.

 Issue #2) There could be a temporary leak of priority if the caller of
  the trylock_or_boost returns after the cpu_chill() with -EAGAIN and
  does something different.

 Issue #3) The boosting is not related to anything. If the trylock
  spinner gets boosted (or deboosted), there's no trail back to the
  owner of the lock.

To solve the above, we need to keep some state between the failed call
to spin_trylock_or_boost() and cpu_chill() (which is why there would be
a requirement to call cpu_chill() on a failed attempt to acquire the
lock).

We could add a static waiter to the task_struct, such that a failed
spin_trylock_or_boost() would add that waiter to the pi_list of the
owner, and of the waiters of the lock. This waiter will need to have a
flag stating that it's not a blocked task, and that a loop around the
pi locking will not detect it as a deadlock.

Now, if the trylock spinner either blocks on another lock, or fails to
get another trylock_or_boost(), it would unhook this waiter, and do
all the work that is required to unhook it (deboost owners, etc). That
is, a task can only be using one waiter at a time (to prevent any other
strange behaviors). It can only boost one task at a time, it can't be
boosting more than one. The last call to a spin_lock or
spin_trylock_or_boost() always wins (gets to use the waiter).

When the owner of the lock releases the lock, if this waiter is the top
waiter, it will have to grab the task's pi_lock, set the lock pointer to
NULL, and wake it up.

Then, when the trylock spinner gets around to the cpu_chill(), That
code will check if the static waiter of the task_struct is in use. It
will grab its own pi_lock, check if the waiter lock is NULL, if not, it
will go to sleep.

This solves:

 Issue #1) if the lock is still in use, the cpu_chill() will sleep, not
  yield.

 Issue #2) The trylock spinner does not go past the cpu_chill() while
  the last spin_trylock_or_boost() owner has not released the lock. The
  owner will not have leaked that priority.

 Issue #3) Use of a static waiter allows changing of boosting to occur.
  The priority chain will need to be modified to take this into account.

This would be more complex than what I proposed, but it would also
solve the issues that you brought up. The fact that a task can only
block with one waiter (it gives up the last trylock_or_boost if it
needs to block on something else that needs a waiter, or tries another
trylock_or_boost and fails to get the lock) will keep the complexity
down. It would prevent a task being in a chain twice.

Thoughts?

-- Steve

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack
  2015-09-08  8:09         ` Thomas Gleixner
@ 2015-09-14  9:50           ` Ingo Molnar
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2015-09-14  9:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, linux-kernel, linux-rt-users, Carsten Emde,
	Sebastian Andrzej Siewior, John Kacur, Paul Gortmaker,
	Peter Zijlstra, Clark Williams, Arnaldo Carvalho de Melo


* Thomas Gleixner <tglx@linutronix.de> wrote:

> > And if we are into getting reference counts, why not solve it at a higher 
> > level and get a reference count to 'x' to make sure it's safe to use? Then we 
> > could do:
> >
> >         lock(y->lock);
> > retry:
> >         x = y->x;
> >         if (!trylock(x->lock)) {
> >                 get_ref(x->count)
> >                 unlock(y->lock);
> >                 lock(x->lock);
> >                 lock(y->lock);
> >                 put_ref(x->count);
> >                 if (y->x != x) { /* Retry if 'x' got dropped meanwhile */
> >                         unlock(x->lock);
> >                         goto retry;
> >                 }
> >          }
> > 
> > Or so.
> 
> In the case of dcache::dentry_kill() we probably do not have to take refcounts 
> and it might be actually counterproductive to do so. y->x, i.e. dentry->parent, 
> cannot vanish under us, if I understand the life time rules correctly.

Ok, that's even better.

> Aside of that, yes, I was thinking about a similar scheme for that. I need some 
> more time to grok all the rules there :)

Ok, great! :-)

I really don't think we need a new locking primitive - and with something like the 
above we could improve the code upstream as well and make it scale better in some 
scenarios, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-09-14  9:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-04  1:19 [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Steven Rostedt
2015-09-04  1:19 ` [RFC][PATCH RT 1/3] locking: Add spin_try_or_boost_lock() infrastructure Steven Rostedt
2015-09-04  1:48   ` Steven Rostedt
2015-09-04  1:19 ` [RFC][PATCH RT 2/3] locking: Convert trylock spinners over to spin_try_or_boost_lock() Steven Rostedt
2015-09-04  1:19 ` [RFC][PATCH RT 3/3] rt: Make cpu_chill() into yield() and add new cpu_rest() as msleep(1) Steven Rostedt
2015-09-05 10:30 ` [RFC][PATCH RT 0/3] RT: Fix trylock deadlock without msleep() hack Thomas Gleixner
2015-09-05 12:04   ` Ingo Molnar
2015-09-05 12:26     ` Steven Rostedt
2015-09-07  8:35     ` Thomas Gleixner
2015-09-07 10:10       ` Thomas Gleixner
2015-09-08  7:31       ` Ingo Molnar
2015-09-08  8:09         ` Thomas Gleixner
2015-09-14  9:50           ` Ingo Molnar
2015-09-08 16:59       ` Steven Rostedt
2015-09-08 19:35         ` Steven Rostedt
2015-09-05 12:18   ` Steven Rostedt
2015-09-05 12:27     ` Steven Rostedt
2015-09-05 12:50     ` Steven Rostedt
2015-09-07  9:14       ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).