Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Esben Nielsen" <nielsen.esben@googlemail.com>
To: "Gregory Haskins" <ghaskins@novell.com>
Cc: mingo@elte.hu, paulmck@linux.vnet.ibm.com, peterz@infradead.org,
	tglx@linutronix.de, rostedt@goodmis.org,
	linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
	gregory.haskins@gmail.com, David.Holmes@sun.com,
	jkacur@gmail.com
Subject: Re: [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface
Date: Fri, 22 Aug 2008 14:55:30 +0200	[thread overview]
Message-ID: <59929f7a0808220555r5a0b972cl5db047f74d7cabec@mail.gmail.com> (raw)
In-Reply-To: <20080815202823.668.26199.stgit@dev.haskins.net>

Disclaimer: I am no longer actively involved and I must admit I might
have lost out on much of
what have been going on since I contributed to the PI system 2 years
ago. But I allow myself to comment
anyway.

On Fri, Aug 15, 2008 at 10:28 PM, Gregory Haskins <ghaskins@novell.com> wrote:
> The kernel currently addresses priority-inversion through priority-
> inheritence.  However, all of the priority-inheritence logic is
> integrated into the Real-Time Mutex infrastructure.  This causes a few
> problems:
>
>  1) This tightly coupled relationship makes it difficult to extend to
>    other areas of the kernel (for instance, pi-aware wait-queues may
>    be desirable).
>  2) Enhancing the rtmutex infrastructure becomes challenging because
>    there is no seperation between the locking code, and the pi-code.
>
> This patch aims to rectify these shortcomings by designing a stand-alone
> pi framework which can then be used to replace the rtmutex-specific
> version.  The goal of this framework is to provide similar functionality
> to the existing subsystem, but with sole focus on PI and the
> relationships between objects that can boost priority, and the objects
> that get boosted.

This is really a good idea. When I had time (2 years ago) to actively
work on these problem
I also came to the conclusion that PI should be more general than just
the rtmutex. Preemptive RCU
was the example which drove it.

But I do disagree that general objects should get boosted: The end
targets are always tasks. The objects might
be boosted as intermediate steps, but priority end the only applies to tasks.

I also have a few comments to the actual design:

> ....
> +
> +Multiple sinks per Node:
> +
> +We allow multiple sinks to be associated with a node.  This is a slight departure from the previous implementation which had the notion of only a single sink (i.e. "task->pi_blocked_on").  The reason why we added the ability to add more than one sink was not to change the default chaining model (I.e. multiple boost targets), but rather to add a flexible notification mechanism that is peripheral to the chain, which are informally called "leaf sinks".
> +
> +Leaf-sinks are boostable objects that do not perpetuate a chain per se.  Rather, they act as endpoints to a priority boosting.  Ultimately, every chain ends with a leaf-sink, which presumably will act on the new priority information.  However, there may be any number of leaf-sinks along a chain as well.  Each one will act on its localized priority in its own implementation specific way.  For instance, a task_struct pi-leaf may change the priority of the task and reschedule it if necessary.  Whereas an rwlock leaf-sink may boost a list of reader-owners.

This is bad from a RT point of view: You have a hard time determininig
the number of sinks per node. An rw-lock could have an arbitrary
number of readers (is supposed to really). Therefore
you have no chance of knowing how long the boost/deboost operation
will take. And you also know for how long the boosted tasks stay
boosted. If there can be an arbitrary number of
such tasks you can no longer be deterministic.

> ...
> +
> +#define MAX_PI_DEPENDENCIES 5


WHAT??? There is a finite lock depth defined. I know we did that
originally but it wasn't hardcoded (as far as I remember) and
it was certainly not as low as 5.

Remember: PI is used by the user space futeces as well!

> ....
> +/*
> + * _pi_node_update - update the chain
> + *
> + * We loop through up to MAX_PI_DEPENDENCIES times looking for stale entries
> + * that need to propagate up the chain.  This is a step-wise process where we
> + * have to be careful about locking and preemption.  By trying MAX_PI_DEPs
> + * times, we guarantee that this update routine is an effective barrier...
> + * all modifications made prior to the call to this barrier will have completed.
> + *
> + * Deadlock avoidance: This node may participate in a chain of nodes which
> + * form a graph of arbitrary structure.  While the graph should technically
> + * never close on itself barring any bugs, we still want to protect against
> + * a theoretical ABBA deadlock (if for nothing else, to prevent lockdep
> + * from detecting this potential).  To do this, we employ a dual-locking
> + * scheme where we can carefully control the order.  That is: node->lock
> + * protects most of the node's internal state, but it will never be held
> + * across a chain update.  sinkref->lock, on the other hand, can be held
> + * across a boost/deboost, and also guarantees proper execution order. Also
> + * note that no locks are held across an sink->update.
> + */
> +static int
> +_pi_node_update(struct pi_sink *sink, unsigned int flags)
> +{
> +       struct pi_node    *node = node_of(sink);
> +       struct pi_sinkref *sinkref;
> +       unsigned long      iflags;
> +       int                count = 0;
> +       int                i;
> +       int                pprio;
> +       struct updater     updaters[MAX_PI_DEPENDENCIES];
> +
> +       spin_lock_irqsave(&node->lock, iflags);
> +
> +       pprio = node->prio;
> +
> +       if (!plist_head_empty(&node->srcs))
> +               node->prio = plist_first(&node->srcs)->prio;
> +       else
> +               node->prio = MAX_PRIO;
> +
> +       list_for_each_entry(sinkref, &node->sinks, list) {
> +               /*
> +                * If the priority is changing, or if this is a
> +                * BOOST/DEBOOST, we consider this sink "stale"
> +                */
> +               if (pprio != node->prio
> +                   || sinkref->state != pi_state_boosted) {
> +                       struct updater *iter = &updaters[count++];

What prevents count from overrun?

> +
> +                       BUG_ON(!atomic_read(&sinkref->sink->refs));
> +                       _pi_sink_get(sinkref);
> +
> +                       iter->update  = 1;
> +                       iter->sinkref = sinkref;
> +                       iter->sink     = sinkref->sink;
> +               }
> +       }
> +
> +       spin_unlock(&node->lock);
> +
> +       for (i = 0; i < count; ++i) {
> +               struct updater    *iter = &updaters[i];
> +               unsigned int       lflags = PI_FLAG_DEFER_UPDATE;
> +               struct pi_sink    *sink;
> +
> +               sinkref = iter->sinkref;
> +               sink = iter->sink;
> +
> +               spin_lock(&sinkref->lock);
> +
> +               switch (sinkref->state) {
> +               case pi_state_boost:
> +                       sinkref->state = pi_state_boosted;
> +                       /* Fall through */
> +               case pi_state_boosted:
> +                       sink->ops->boost(sink, &sinkref->src, lflags);
> +                       break;
> +               case pi_state_deboost:
> +                       sink->ops->deboost(sink, &sinkref->src, lflags);
> +                       sinkref->state = pi_state_free;
> +
> +                       /*
> +                        * drop the ref that we took when the sinkref
> +                        * was allocated.  We still hold a ref from
> +                        * above.
> +                        */
> +                       _pi_sink_put_all(node, sinkref);
> +                       break;
> +               case pi_state_free:
> +                       iter->update = 0;
> +                       break;
> +               default:
> +                       panic("illegal sinkref type: %d", sinkref->state);
> +               }
> +
> +               spin_unlock(&sinkref->lock);
> +
> +               /*
> +                * We will drop the sinkref reference while still holding the
> +                * preempt/irqs off so that the memory is returned synchronously
> +                * to the system.
> +                */
> +               _pi_sink_put_local(node, sinkref);
> +       }
> +
> +       local_irq_restore(iflags);

Yack! You keep interrupts off while doing the chain. I think my main
contribution to the PI system 2 years ago was to do this preemptively.
I.e. there was points in the loop where interrupts and preemption
where turned on.

Remember: It goes into user space again. An evil user could craft an
application with a very long lock depth and keep higher priority real
time tasks from running for an arbitrary long time (if
no limit on the lock depth is set, which is bad because it will be too
low in some cases.)

But as I said I have had no time to watch what has actually been going
on in the kernel for the last 2 years roughly. The said defects might
have creeped in by other contributers already :-(

Esben

next prev parent reply	other threads:[~2008-08-22 12:55 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-01 21:16 [PATCH RT RFC 0/7] Priority Inheritance enhancements Gregory Haskins
2008-08-01 21:16 ` [PATCH RT RFC 1/7] add generalized priority-inheritance interface Gregory Haskins
2008-08-01 21:16 ` [PATCH RT RFC 2/7] sched: add the basic PI infrastructure to the task_struct Gregory Haskins
2008-08-01 21:17 ` [PATCH RT RFC 3/7] rtmutex: formally initialize the rt_mutex_waiters Gregory Haskins
2008-08-01 21:17 ` [PATCH RT RFC 4/7] RT: wrap the rt_rwlock "add reader" logic Gregory Haskins
2008-08-01 21:17 ` [PATCH RT RFC 5/7] rtmutex: use runtime init for rtmutexes Gregory Haskins
2008-08-01 21:17 ` [PATCH RT RFC 6/7] rtmutex: convert rtmutexes to fully use the PI library Gregory Haskins
2008-08-01 21:17 ` [PATCH RT RFC 7/7] rtmutex: pi-boost locks as late as possible Gregory Haskins
2008-08-04 13:21   ` Gregory Haskins
2008-08-05  3:01     ` Gregory Haskins
2008-08-15 12:08 ` [PATCH RT RFC v2 0/8] Priority Inheritance enhancements Gregory Haskins
2008-08-15 12:08   ` [PATCH RT RFC v2 1/8] add generalized priority-inheritance interface Gregory Haskins
2008-08-15 13:16     ` [PATCH RT RFC v3] " Gregory Haskins
2008-08-15 12:08   ` [PATCH RT RFC v2 2/8] sched: add the basic PI infrastructure to the task_struct Gregory Haskins
2008-08-15 12:08   ` [PATCH RT RFC v2 3/8] sched: rework task reference counting to work with the pi infrastructure Gregory Haskins
2008-08-15 12:08   ` [PATCH RT RFC v2 4/8] rtmutex: formally initialize the rt_mutex_waiters Gregory Haskins
2008-08-15 12:08   ` [PATCH RT RFC v2 5/8] RT: wrap the rt_rwlock "add reader" logic Gregory Haskins
2008-08-15 12:08   ` [PATCH RT RFC v2 6/8] rtmutex: use runtime init for rtmutexes Gregory Haskins
2008-08-15 12:08   ` [PATCH RT RFC v2 7/8] rtmutex: convert rtmutexes to fully use the PI library Gregory Haskins
2008-08-15 12:08   ` [PATCH RT RFC v2 8/8] rtmutex: pi-boost locks as late as possible Gregory Haskins
2008-08-15 20:28 ` [PATCH RT RFC v4 0/8] Priority Inheritance enhancements Gregory Haskins
2008-08-15 20:28   ` [PATCH RT RFC v4 1/8] add generalized priority-inheritance interface Gregory Haskins
2008-08-15 20:32     ` Gregory Haskins
2008-08-16 15:32     ` AW: " Matthias Behr
2008-08-19  8:34       ` Gregory Haskins
2008-08-16 19:56     ` Peter Zijlstra
2008-08-19  8:40       ` Gregory Haskins
2008-08-22 12:55     ` Esben Nielsen [this message]
2008-08-22 13:15       ` Gregory Haskins
2008-08-22 16:08         ` Gregory Haskins
2008-08-22 13:17       ` Steven Rostedt
2008-08-15 20:28   ` [PATCH RT RFC v4 2/8] sched: add the basic PI infrastructure to the task_struct Gregory Haskins
2008-08-15 20:28   ` [PATCH RT RFC v4 3/8] sched: rework task reference counting to work with the pi infrastructure Gregory Haskins
2008-08-15 20:28   ` [PATCH RT RFC v4 4/8] rtmutex: formally initialize the rt_mutex_waiters Gregory Haskins
2008-08-15 20:28   ` [PATCH RT RFC v4 5/8] RT: wrap the rt_rwlock "add reader" logic Gregory Haskins
2008-08-15 20:28   ` [PATCH RT RFC v4 6/8] rtmutex: use runtime init for rtmutexes Gregory Haskins
2008-08-15 20:28   ` [PATCH RT RFC v4 7/8] rtmutex: convert rtmutexes to fully use the PI library Gregory Haskins
2008-08-15 20:29   ` [PATCH RT RFC v4 8/8] rtmutex: pi-boost locks as late as possible Gregory Haskins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59929f7a0808220555r5a0b972cl5db047f74d7cabec@mail.gmail.com \
    --to=nielsen.esben@googlemail.com \
    --cc=David.Holmes@sun.com \
    --cc=ghaskins@novell.com \
    --cc=gregory.haskins@gmail.com \
    --cc=jkacur@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).