Re: [RFC] Transactional CGroup task attachment

Linux Container Development
 help / color / mirror / Atom feed

From: Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
To: Paul Menage <menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: "Linux Containers"
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	"\"Daisuke Miyakawa (?? ??)\""
	<dmiyakawa-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	"Peter Zijlstra"
	<a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>,
	"Pavel Emelianov" <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Subject: Re: [RFC] Transactional CGroup task attachment
Date: Fri, 11 Jul 2008 21:42:39 +0530	[thread overview]
Message-ID: <48778677.3040308@linux.vnet.ibm.com> (raw)
In-Reply-To: <6599ad830807092346j1fdb2ef9l2ca2b52e2e68096a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Paul Menage wrote:
> This is an initial design for a transactional task attachment
> framework for cgroups. There are probably some potential simplications
> that I've missed, particularly in the area of locking. Comments
> appreciated.
> 
> The Problem
> ===========
> 
> Currently cgroups task movements are done in three phases
> 
> 1) all subsystems in the destination cgroup get the chance to veto the
> movement (via their can_attach()) callback
> 2) task->cgroups is updated (while holding task->alloc_lock)
> 3) all subsystems get an attach() callback to let them perform any
> housekeeping updates
> 
> The problems with this include:
> 
> - There's no way to ensure that the result of can_attach() remains
> valid up until the attach() callback, unless any invalidating
> operations call cgroup_lock() to synchronize with the change. This is
> fine for something like cpusets, where invalidating operations are
> rare slow events like the user removing all cpus from a cpuset, or cpu
> hotplug triggering removal of a cpuset's last cpu. It's not so good
> for the virtual address space controller where the can_attach() check
> might be that the res_counter has enough space, and an invalidating
> operation might be another task in the cgroup allocating another page
> of virtual address space.
> 

Precisely!

> - It doesn't handle the case of the proposed "cgroup.procs" file which
> can move multiple threads into a cgroup in one operation; the
> can_attach() and attach() calls should be able to atomically allow all
> or none of the threads to move.
> 
> - it can create races around the time of the movement regarding to
> which cgroup a resource charge/uncharge should be assigned (e.g.
> between steps 2 and 3 new resource usage will be charged to the
> destination cgroup, but step 3 might involve migrating a charge equal
> to the task's resource usage from the old cgroup to the new, resulting
> in over/under-charges.
> 
> 
> Conceptual solution
> ===================
> 
> In ideal terms, a solution for this problem would meet the following
> requirements:
> 
> - support movement of an arbitrary set of threads between an arbitrary
> set of cgroups
> - allow arbitrarily complex locking from the subsystems involved so
> that they can synchronize against concurrent charges, etc
> - allow rollback at any point in the process
> 
> But in practice that would probably be way more complex than we'd want
> in the kernel. We don't want to encourage excessively-complex locking
> from subsystems, and we don't need to support arbitrary task
> movements.
> 
> 
> (Hopefully!) Practical solution
> =============
> 
> So here's a more practical solution, which hopefully catches the
> important parts of the requirements without being quite so complex.
> 
> The restrictions are:
> 
> - only supporting movement to one destination cgroup (in the same
> hierarchy, of course); if an entire process is being moved, then
> potentially its threads could be coming from different source cgroups
> - a subsystem may optionally fail such an attach if it can't handle
> the synchronization this would entail.
> 
> - supporting moving either one thread, one entire thread group or (for
> the future) "all threads". This supports the existing "tasks" file,
> the proposed "procs" file and also allows scope for things like adding
> a subsystem to an existing hierarchy.
> 
> - locking/checking performed in two phases - one to support sleeping
> locks, and one to support spinlocks. This is to support both
> subsystems that use mutexes to protect their data, and subsystems that
> use spinlocks
> 
> - no locks allowed to be shared between multiple subsystems during the
> transaction, with the single exception of the mmap_sem of the
> thread/process being moved. This is because multiple subsystems use
> the mmap_sem for synchronization, and are quite likely to be mounted
> in the same hierarchy. The alternative would be to introduce a
> down_read_unfair() operation that would skip ahead of waiting writers,
> to safely allow a single thread to recursively lock mm->mmap_sem.
> 

This would ideally be recursive mutexes, Linus does not like recursive mutexes.
Adding an unfair variant would mean that we need to support a more generic
locking class.

> First we define the state for the transaction:
> 
> struct cgroup_attach_state {
> 
>   // The thread or process being moved, NULL if moving (potentially) all threads
>   struct task_struct *task;
>   enum {
>     CGROUP_ATTACH_THREAD,
>     CGROUP_ATTACH_PROCESS,
>     CGROUP_ATTACH_ALL
>   } mode;
> 
>   // The destination cgroup
>   struct cgroup *dest;
> 
>   // The source cgroup for "task" (child threads *may* have different
> groups; subsystem must handle this if it needs to)
>   struct cgroup *src;
> 
>   // Private state for the attach operation per-subsys. Subsystems are
> completely responsible for managing this
>   void *subsys_state[CGROUP_SUBSYS_STATE];
> 
>   // "Recursive lock count" for task->mm->mmap_sem (needed if we don't
> introduce down_read_unfair())
>   int mmap_sem_lock_count;
> };
> 
> New cgroup subsystem callbacks (all optional):
> 
> -----
> 
> int prepare_attach_sleep(struct cgroup_attach_state *state);
> 

Is _sleep really required to be specfied? The function name sounds as if the
callback processor will sleep.

> Called during the first preparation phase for each subsystem. The
> subsystem may perform any sleeping behaviour, including waiting for
> mutexes and doing sleeping memory allocations, but may not disable
> interrupts or take any spinlocks. Return a -ve error on failure or 0
> on success. If it returns failure, then no further callbacks will be
> made for this attach; if it returns success then exactly one of
> abort_attach_sleep() or commit_attach() is guaranteed to be called in
> the future
> 
> No two subsystems may take the same lock as part of their
> prepare_attach_sleep() callback. A special case is made for mmap_sem:
> if this callback needs to down_read(&state->task->mmap_sem) it should
> only do so if state->mmap_sem_lock_count++ == 0. (A helper function
> will be provided for this). The callback should not
> write_lock(&state->task->mmap_sem).
> 
> Called with group_mutex (which prevents any other task movement
> between cgroups) held plus any mutexes/semaphores taken by earlier
> subsystems's callbacks.
> 

This sounds almost like the BKL for cgroups :) I see where you are going with
this, but I am afraid the implementation and rules sound complex. It will be
hard to verify that two subsystems are not going to take the same lock. I would
rather prefer to do the following

1. Prepare_attach() the subsystem does it's or fails
2. If someone failed, send out failed notifications to successfull callbacks
3. On receiving a failed notification (due to a different cgroup failure),
clients undo their operation (done in prepare_attach())
4. If all was successful, move the task and call attached() after the task is
attached.

[snip]

> 
> So, thoughts? Is this just way to complex? Have I missed something
> that means this approach can't work?
> 

I read through the rest of it. The sleep/nosleep might make sense (to help the
task acquire the type of lock it wants to acquire), but isn't sleep a generic
case for nosleep as well?

Can we manage with steps I've listed above?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

next prev parent reply	other threads:[~2008-07-11 16:12 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-10  6:46 [RFC] Transactional CGroup task attachment Paul Menage
     [not found] ` <6599ad830807092346j1fdb2ef9l2ca2b52e2e68096a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-10 11:23   ` Paul Jackson
2008-07-11  0:20   ` KAMEZAWA Hiroyuki
     [not found]     ` <20080711092058.754ad72e.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-07-14  6:28       ` Daisuke Nishimura
     [not found]         ` <20080714152822.c795f321.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
2008-07-14  7:54           ` KAMEZAWA Hiroyuki
     [not found]             ` <20080714165444.65f1ac8e.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-07-14 12:36               ` Daisuke Nishimura
2008-07-14 19:16           ` Paul Menage
2008-07-11  2:14   ` Li Zefan
     [not found]     ` <4876C209.2010605-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2008-07-11  3:31       ` Paul Menage
2008-07-11 14:27   ` Serge E. Hallyn
     [not found]     ` <20080711142739.GA25338-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-11 15:23       ` Paul Menage
     [not found]         ` <6599ad830807110823l3835fba1sae3df01eeb3fa5da-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-11 15:34           ` Serge E. Hallyn
     [not found]             ` <20080711153421.GA31344-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-11 15:36               ` Paul Menage
2008-07-11 16:12   ` Balbir Singh [this message]
     [not found]     ` <48778677.3040308-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2008-07-11 20:41       ` Paul Menage
2008-07-12  0:03   ` Matt Helsley
     [not found]     ` <1215821034.5456.357.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2008-07-12  0:18       ` Paul Menage
     [not found]         ` <6599ad830807111718ye005349yaa31e729c233474f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-12  0:48           ` Matt Helsley
     [not found]             ` <1215823694.5456.382.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2008-07-12  1:49               ` Paul Menage
     [not found]                 ` <6599ad830807111849q7aaf8226tba15193fa23f9531-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-12  2:03                   ` Matt Helsley
     [not found]                     ` <1215828216.5456.409.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2008-07-12  2:09                       ` Paul Menage
     [not found]                         ` <6599ad830807111909j4d777735ree2ee17dfcdda853-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-07-14 10:56                           ` Peter Zijlstra
2008-07-14 10:50       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48778677.3040308@linux.vnet.ibm.com \
    --to=balbir-23vcf4htsmix0ybbhkvfkdbpr1lh4cv8@public.gmane.org \
    --cc=a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=dmiyakawa-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=menage-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox