Linux Container Development
 help / color / mirror / Atom feed
* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
From: Oren Laadan @ 2009-04-14  5:46 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: xemul-bzQdu9zFT3WakBO8gow8eQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI
In-Reply-To: <20090410023207.GA27788-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>


Some meta comments about this patch set:

* Patches 1-9 are cleanups, unrelated to checkpoint/restart. They
deserve a separate thread.

* You barely take locks or reference counts to objects that you
later refer to. What if something really bad happens ?

* (contd) If you don't take locks, then you at the very least need
to rely on the container remaining frozen for the duration of the
operation (during checkpoint).

* (contd) Still with locks and references, during restart you can't
even freeze the container, so need extra logic to prevent bad things
(e.g. OOM killer, signal or ptrace from parent container etc).

* What is the rationale behind doing the freeze/unfreeze from within
sys_checkpoint/sys_restart ?   Instead of you let userspace do it
(and only verify in kernel) you gain, again, flexibility. For example,
you want to also snapshot the filesystem, then userspace will do
something like:  freeze container -> snapshot filesystem -> checkpoint
-> thaw container.

* A plethora of "FIXME" comments ...

Alexey Dobriyan wrote:
> This is to show how we see C/R and to provoke discussion on number of
> important issues (mounts, ...).

Quoting your patch:
---
This is one big FIXME:
	What to do with overmounted files?
	What to do with mounts at all, who should restore them?

just restore something to not oops on task exit
---

> 
> This is small part of long-awaited to be cleanuped code.
> 
> It's able to restore busyloop on i386 and x86_64 and restore i386
> busyloop on x86_64. It wasn't tested much more than that.

Oh .. I really wish you'd sent a x86_64 patch our way, too ;)

> 
> I'm currently starting formal testsuite, otherwise it's whack-a-mole
> game and formal TODO list (a huge one).
> 

So I'm still struggling to see the major different in the approaches
that would justify throwing away our hard worked, reviewed, tested
and functional code, and take this - similar in design, largely
incomplete and unreviewed code.

Best,

Oren.

^ permalink raw reply

* Re: [PATCH 10/30] cr: core stuff
From: Oren Laadan @ 2009-04-14  5:22 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: xemul-bzQdu9zFT3WakBO8gow8eQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI
In-Reply-To: <20090410023539.GK27788-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>


Alexey Dobriyan wrote:
> * add struct file_operations::checkpoint
> 
>   The point of hook is to serialize enough information to allow restoration
>   of an opened file.
> 
>   The idea (good one!) is that the code which supplies struct file_operations
>   know better what to do with file.

Actually, credit is due to Dave Hansen (or Christoph Hellwig, or both?).

> 
>   Hook gets C/R context (a cookie more or less) on which dump code can
>   cr_write() and small restrictions on what to write: globally unique object id
>   and correct object length to allow jumping through objects.
> 
>   For usual files on on-disk filesystem add generic_file_checkpoint()
> 
>   Add ext3 opened regular files and directories for start.
> 
>   No ->checkpoint, checkpointing is aborted -- deny by default.
> 
> FIXME: unlinked, but opened files aren't supported yet.
> 
> * C/R image design
> 
>   The thing should be flexible -- kernel internals changes every day, so we can't
>   really afford a format with much enforced structure.
> 
>   Image consists of header, object images and terminator.
> 
>   Image header consists of immutable part and mutable part (for future).
> 
>   Immutable header part is magic and image version: "LinuxC/R" + __le32
> 
>   Image version determines everything including image header's mutable part.
>   Image version is going to be bumped at earliest opportunity following changes
>   in kernel internals.
> 
>   So far image header mutable part consists of arch of the kernel which dumped
>   the image (i386, x86_64, ...) and kernel version as found in utsname.
> 
>   Kernel version as string is for distributions. Distro can support C/R for
>   their own kernels, but can't realistically be expected to bump image version --
>   this will conflict with mainline kernels having used same version. We also don't
>   want requests for private parts of image version space.

So far so good, like in our patch-set.

You also need to address differences in configuration (kernel could
have been recompiled) and runtime environment (boot params, etc).

We deferred this issue to a later time.

> 
>   Distro expected to keep image version alone and on restart(2) check utsname
>   version and compare it against previously release kernel versions and based
>   on that turn on compatibility code.

Are you suggesting that conversion of a checkpoint image from an older
version to a newer version be done in the kernel ?

It may work for a few versions, and then you'll get a spaghetti of
#ifdef's in the code, together with a plethora of legacy code.

It is much better/easier to handle checkpoint image transformations
in user space. The kernel will only understand its "current" version
(for some definition of version).

> 
>   Object image is very flexible, the only required parts are a) object type (u32)
>   and b) object total length (u32, [knocks wood]) which must be at the beginning
>   of an image. The rest is not generic C/R code problem.
> 
>   Object images follow one another without holes. Holes are in theory possible but
>   unneeded.
> 

When would you need holes ?

>   Image ends with terminator object. This is mostly to be sure, that, yes, image
>   wasn't truncated for some reason.
> 
> 
> * Objects subject to C/R
> 
>   The idea is to not be very smart but directly dump core kernel data structures
>   related to processes. This includes in this patch:
> 
> 	struct task_struct
> 	struct mm_struct
> 	VMAs
> 	dirty pages
> 	struct file
> 
>   Relations between objects (task_struct has pointer to mm_struct) are fullfilled
>   by dumping pointed to object first, keeping it's position in dumpfile and saving
>   position in a image of pointe? object:

Unless you use the physical position to actually lseek to there to
re-read the data, there is no reason to use the actual position. In
fact it is easier to debug when the shared object identifier is a
simple counter.

If you do use it to lseek, then it's a poor decision -- sounds fragile:
what if we change the file (legitimately) adding data in the middle -
the whole concept breaks.

> 
> 	struct cr_image_task_struct {
> 		cr_pos_t	cr_pos_mm;
> 			...
> 	};
> 
>   Code so far tries hard to dump objects in certain order so there won't be any loops.
>   This property of process that dumpfile can in theory be O_APPEND, will likely be
>   sacrifised (read: child can ptrace parent)

The ability to streamline the checkpoint image IMHO is invaluable.
It's the unix way (TM) of doing things; it makes the process pipe-able.

You can do many nice things when the checkpoint can be streamed: you
can compress, sign, encrypt etc on the fly without taking additional
diskspace. You can transfer over the network (e.g. for migration),
or store remotely without explicit file system support. You can easily
transform the stream from one c/r version to another etc.

This should be a design principle. In my experience I never hit a wall
that forced me to "sacrifice" this decision.

>   sacrifised (read: child can ptrace parent)

Hmmm... if all tasks are created in user space, then this specific
becomes a no-brainer !

> 
> * add struct vm_operations_struct::checkpoint
> 
>   just like with files, code that creates special VMAs should know what to do with them
>   used.
> 
>   just like with files, deny checkpointing by default
> 
>   So far used to install vDSO to same place.

VDSO can be a troublemaker; in recent kernels its location in the MM
can be randomized. It is not necessarily immutable - it can reflect
ynamic kernel data. It may contain different code on newer versions,
so must be compared or worked around during restart etc.

> 
> * add checkpoint(2)
> 
>   Done by determining which tasks are subject to checkpointing, freezeing them,
>   collecting pointers to necessary kernel internals (task_struct, mm_struct, ...),
>   doing that checking supported/unsupported status and aborting if necessary,
>   actual dumping, unfreezeing/killing set of tasks.
> 
>   Also in-checkpoint refcount is maintained to abort on possible invisible changes.
>   Now it works:
> 
> 	For every collected object (mm_struct) keep numbers of references from
> 	other collected objects. It should match object's own refcount.
> 	If there is a mismatch, something is likely pinning object, which means
> 	there is "leak" to outside which means checkpoint(2) can't realistically and
> 	without consequences proceed.
> 
> 	This is in some sense independent check. It's designed to protect from internals
> 	change when C/R code was forgotten to be updated.
> 
>   Userpsace supplies pid of root task and opened file descriptor of future dump file.
>   Kernel reports 0/-E as usual.
> 
>   Runtime tracking of "checkpointable" property is explicitly not done.
>   This introduces overhead even if checkpoint(2) is not done as shown by proponents.
>   Instead any check is done at checkpoint(2) time and -E is returned if something is
>   suspicious or known to be unsupported.
> 
>   FIXME: more checks especially in cr_check_task_struct().
> 
> * add restart(2)
> 
>   Recreate tasks and evething dumped by checkpoint(2) as if nothing happened.
> 
>   The focus is on correct recreating, checking every possibility that target kernel
>   can be on different arch (i386 => x86_64) and target kernel can be very different
>   from source kernel by mistake (i386 => x86_64 COMPAT=n) kernel.
> 
>   restart(2) is done first by creating kernel thread and that demoting it to usual
>   process by adding mm_struct, VMAs, et al. This saves time against method when
>   userspace does fork(2)+restart(2) -- forked mm_struct will be thrown out anyway
>   or at least everything will be unmapped in any case.

Do have figures to support your claims about "saves time" ?

The *largest* component of the restart time, as you probably know,
is the time it takes to restore the memory address space (pages, pages)
of the tasks.

If you do show that this optimization is worth our attention, then it
takes < 10 lines to change current mktree.c to use CLONE_VM ... voila.

I'm interested in hearing more convincing arguments in favor of kernel
creations of restarting tasks (see my other post about it).

> 
>   Restoration is done in current context except CPU registers at last stage.
>   This is because "creation is done by current" is in many, many places,
>    e.g. mmap(2) code.
> 
>   It's expected that filesystem state will be the same. Kernel can't do anything
>   about it expect probably virtual filesystems. If a file is not there anymore,
>   it's not kernel fault, -E will be returned, restart aborted.
> 
>   FIXME: errors aren't propagated correctly out of kernel thread context

Heh .. I guess they always propagate correctly out of regular task
context ;)

Oren.

^ permalink raw reply

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
From: Oren Laadan @ 2009-04-14  4:26 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: xemul-bzQdu9zFT3WakBO8gow8eQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen,
	hch-wEGCiKHe2LqWVfeAwA7xHQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI
In-Reply-To: <20090413091423.GA19236-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>



Alexey Dobriyan wrote:
> On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
>> I'm curious how you see these fitting in with the work that we've been
>> doing with Oren.  Do you mean to just start a discussion or are you
>> really proposing these as an alternative to what Oren has been posting?
> 
> Yes, this is posted as alternative.
> 
> Some design decisions are seen as incorrect from here like:

A definition of "design" would help; I find most of your comments
below either vague, cryptic, or technical nits...

> * not rejecting checkpoint with possible "leaks" from container

...like this, for example.
Anything in the current design makes it impossible ?

Anything prohibiting your from adding this feature to the current
patch-set ?

> * not having CAP_SYS_ADMIN on restart(2)

Surely you have read already on the containers mailing list that
for the *time being* we attempt to get as far as possible without
requiring root privileges, to identify security hot-spots.

And surely you have read there the observation that for the general
case root privileges will probably be inevitable.

And surely you don't seriously think that adding this check changes
the "design"...

> * having small (TASK_COMM_LEN) and bigger (objref[1]) image format
>   misdesigns.

Eh ?

> * doing fork(2)+restart(2) per restarted task and whole orchestration
>   done from userspace/future init task.

Why is it "incorrect" ?
What makes it "better" to do it in the kernel ?
Only because you say so is not convincing.

(also see my other post in this matter).

> * not seeing bigger picture (note, this is not equivalent to supporting
>   everything at once, nobody is asking for everything at once) wrt shared
>   objects and format and code changes because of that (note again, image
>   format will change, but it's easy to design high level structure which
>   won't change)

Why don't you describe the bigger picture so that the rest of can
finally see it, too ?!
(what a waste to have spent all this effort in vain...)

> * checking of unsupported features done at wrong place and wrong time
>   and runtime overhead because of that on CR=y kernels.

Eh ?   Did you follow the code recently ?

> 
> There are also low-level things, but it's cumulative effect.
> 
> [1] Do I inderstand correctly that cookie for shared object is an
> address on kernel stack? This is obviously unreliable, if yes :-)

Ah... I see... you didn't look at it that hard, not even read the
documentation with the code.

> 
> 	int objref;
> 		...
> 	/* adding 'file' to the hash will keep a reference to it */
> 	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
> 					^^^^^^^

That said, there are more similarities than differences between your
suggested template and the current patchset. With your expertise you
can contribute tremendously if you decide to work together.

Oren.

^ permalink raw reply

* Re: [PATCH] Make tst_ipcshm_multi automatable
From: Oren Laadan @ 2009-04-14  3:44 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg
In-Reply-To: <1239650808-22254-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>


thanks ... added.

Dan Smith wrote:
> Add a little to tst_ipcshm_multi to make it automatically validate the
> results and return a pass/fail status indication for automated runs.
> 
> Since Oren said he applied my previous patch to his repository, I'm
> sending this as a delta from the last one I sent[1].  Since the public
> user-cr is not updated yet, you'll need to apply that first to try this.
> 
> 1: https://lists.linux-foundation.org/pipermail/containers/2009-April/016827.html
> 
> Cc: orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org
> Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  tst_ipcshm_multi.c |   86 ++++++++++++++++++++++++++++++++++++++++++----------
>  1 files changed, 70 insertions(+), 16 deletions(-)
> 
> diff --git a/tst_ipcshm_multi.c b/tst_ipcshm_multi.c
> index 1ef31f7..1bc0cd2 100644
> --- a/tst_ipcshm_multi.c
> +++ b/tst_ipcshm_multi.c
> @@ -4,6 +4,8 @@
>  #include <errno.h>
>  #include <sys/ipc.h>
>  #include <sys/shm.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
>  
>  #include <linux/sched.h>
>  #include <sched.h>
> @@ -11,6 +13,7 @@
>  #define OUTFILE  "/tmp/cr-test.out"
>  #define SEG_SIZE (20 * 4096)
>  #define DELAY 20
> +#define COUNT_MAX 15
>  
>  int attach(unsigned char **seg, int num)
>  {
> @@ -50,14 +53,45 @@ int validate(unsigned char *seg, int num)
>  	return 0;
>  }
>  
> -void track(unsigned char *seg, int num)
> +int track_incr(unsigned char *seg, int num)
>  {
>  	int i;
> +	int last = seg[0];
>  
>  	for (i = 0; i < 20; i++) {
> +		if (seg[0] == COUNT_MAX)
> +			break;
> +
> +		if (abs(last - (int)seg[0]) > 1) {
> +			printf("[CHILD %i] Expected +/-%i (got %i) %i\n",
> +			       num, last, seg[0], abs(last - seg[0]));
> +			return 1;
> +		}
> +
> +		last = seg[0] + 1;
> +
>  		printf("[CHILD %i] Seg[0]: %i\n", num, seg[0]);
>  		sleep(1);
>  	}
> +
> +	return !(seg[0] == COUNT_MAX);
> +}
> +
> +int track_const(unsigned char *seg, int num, int val)
> +{
> +	int i;
> +
> +	for (i = 0; i < 20; i++) {
> +		if (seg[0] != val) {
> +			printf("[CHILD %i] Expected %i not %i\n",
> +			       num, val, seg[0]);
> +			return 1;
> +		}
> +		printf("[CHILD %i] Seg[0]: %i\n", num, seg[0]);
> +		sleep(1);
> +	}
> +
> +	return 0;
>  }
>  
>  /*
> @@ -81,9 +115,7 @@ int child1(void)
>  
>  	sleep(DELAY - 1); /* Wait until after the checkpoint */
>  
> -	track(seg, num);
> -
> -	return 0;
> +	return track_incr(seg, num);
>  }
>  
>  /*
> @@ -106,12 +138,10 @@ int child2(void)
>  	if (validate(seg, num))
>  		return -1;
>  
> -	track(seg, num);
> -
> -	return 0;
> +	return track_incr(seg, num);
>  }
>  
> -int child4(void);
> +int child4(int constval);
>  
>  /*
>   * Detach from the parent's IPC namespace and verify that:
> @@ -123,14 +153,20 @@ int child3(void)
>  {
>  	unsigned char *seg;
>  	int num = 3;
> +	int cpid;
> +	int ret;
> +	int status;
>  
>  	if (unshare(CLONE_NEWIPC) != 0) {
>  		printf("[CHILD %i] unshare(CLONE_NEWIPC): %m", num);
>  		return -1;
>  	}
>  
> -	if (fork() == 0)
> -		return child4();
> +	cpid = fork();
> +	if (cpid < 0)
> +		return 1;
> +	else if (cpid == 0)
> +		return child4(123);
>  
>  	printf("[CHILD %i] Running (new IPC NS)\n", num);
>  
> @@ -153,16 +189,22 @@ int child3(void)
>  
>  	sleep(DELAY); /* Wait until after checkpoint, then attach */
>  
> -	track(seg, num);
> -  
> -	return 0;
> +	ret = track_const(seg, num, 123);
> +
> +	printf("[CHILD %i] Waiting for child %i\n", num, cpid);
> +	wait(&status);
> +
> +	if (ret == 0)
> +		return WEXITSTATUS(status);
> +	else
> +		return ret;
>  }
>  
>  /*
>   * This child is forked from child3 under the new IPC namespace.
>   * Verify that post-restart, we do not see the changing seg[0]
>   */
> -int child4(void)
> +int child4(int constval)
>  {
>  	unsigned char *seg;
>  	int num = 4;
> @@ -174,7 +216,7 @@ int child4(void)
>  	if (attach(&seg, num))
>  		return -1;
>  
> -	track(seg, num);
> +	return track_const(seg, num, constval);
>  
>  	return 0;
>  }
> @@ -250,11 +292,23 @@ int main(int argc, char *argv[])
>  	sleep(DELAY);
>  	printf("[MSTER] Woke\n");
>  
> -	for (i = 0; i < 15; i++) {
> +	for (i = 0; i <= COUNT_MAX; i++) {
>  		seg[0] = i;
>  		sleep(1);
>  	}
>  
> +	for (i = 0; i < 3; i++) {
> +		int status;
> +
> +		printf("[MSTER] Waiting on child %i\n", i+1);
> +		wait(&status);
> +		if (WEXITSTATUS(status)) {
> +			printf("[MSTER] child exited with %i\n",
> +			       WEXITSTATUS(status));
> +			return WEXITSTATUS(status);
> +		}
> +	}
> +
>  	if (shmdt(seg) < 0)
>  		perror("shmdt");
>  

^ permalink raw reply

* Creating tasks on restart: userspace vs kernel
From: Oren Laadan @ 2009-04-14  3:43 UTC (permalink / raw)
  To: containers, Alexey Dobriyan
  Cc: Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds,
	Linux-Kernel, Ingo Molnar


For checkpoint/restart (c/r) we need a method to (re)create the tasks
tree during restart. There are basically two approaches: in userspace
(zap approach) or in the kernel (openvz approach).

Once tasks have been created both approaches are similar in that all
restarting tasks end up calling the equivalent of "do_restart()" in
the kernel to perform the gory details of restoring its state.

In terms of performance, both approaches are similar, and both can
optimize to avoid duplicating resources unnecessarily during the
clone (e.g. mm, etc) knowing that they will be reconstructed soon
after.

So the question is what's better - user-space or kernel ?

Too bad that Alexey chose to ignore what's been discussed in
linux-containers mailing list in his recent post.  Here is my take on
cons/pros.

Task creation in the kernel
---------------------------
* how: the user program calls sys_restart() which, for each task to
  restore, creates a kernel thread which is demoted to a regular
  process manually.

* pro: a single task that calls sys_restart()
* pro: restarting tasks are in full control of kernel at all times

* con: arch-dependent, harder to port across architectures
* con: can only restart a full container

Task creation in user space
---------------------------
* how: the user programs calls fork/clone to recreate a suitable
  task tree in userspace, and each task calls sys_restart() to restore
  its state; some kernel glue is necessary to synchronize restarting
  tasks when in the kernel.

* pro: allows important flexibility during restart (see <1>)
* pro: code leverages existing well-understood syscalls (fork, clone)
* pro: allows restart of a only subtree (see <2>)

* con: requires a way to creates tasks with specific pid (see <3>)

<1> Flexibility:

In the spirit of madvise() that lets tasks advise the kernel because
they know better, there should be cradvise() for checkpoint/restart
purposes. During checkpoint it can tell the kernel "don't save this
piece of memory, it's scratch", or "ignore this file-descriptor" etc.
During restart, it will can tell the kernel "use this file-descriptor"
or "use this network namespace" (instead of trying to restore).

Offering cradvise() capability during restart is especially important
in cases where the kernel (inevitably) won't know how to restore a
resource (e.g. think special devices), when the application wants to
override (e.g. think of a c/r aware server that would like to change
the port on which it is listening), or when it's that much simpler to
do it in userspace (e.g. think setting up network namespaces).

Another important example is distributed checkpoint, where the
restarting tasks could (re)create all their network connections in
user space, before invoking sys_restart() and tell the kernel, via
cradvise(), to use the newly created sockets.

The need for this sort of flexibility has been stressed multiple times
and by multiple stake-holders interested in checkpoint/restart.

<2> Restarting a subtree:

The primary c/r effort is directed towards providing c/r functionality
for containers.

Wouldn't it be nice if, while doing so and at minimal added effort, we
also gain a method to checkpoint and restart an arbitrary subtree of
tasks, which isn't necessarily an entire container ?

Sure, it will be more constrained (e.g. resulting pid in restart won't
match the original pids), and won't work for all applications. But it
will still be a useful tool for many use cases, like batch cpu jobs,
some servers, vnc sessions (if you want graphics) etc. Imagine you run
'octave' for a week and must reboot now - 'octave' wouldn't care if
you checkpointed it and then restart with a different pid !

<3> Clone with pid:

To restart processes from userspace, there needs to be a way to
request a specific pid--in the current pid_ns--for the child process
(clearly, if it isn't in use).

Why is it a disadvantage ?  to Linus, a syscall clone_with_pid()
"sounds like a _wonderful_ attack vector against badly written
user-land software...".  Actually, getting a specific pid is possible
without this syscall.  But the point is that it's undesirable to have
this functionality unrestricted.

So one option is to require root privileges. Another option is to
restrict such action in pid_ns created by the same user. Even more so,
restrict to only containers that are being restarted.

---

Either way we go, it should be fairly easy to switch from one method
to the other, should we need to.

All in all, there isn't a strong reason in favor of kernel method.

In contrast, it's at least as simple in userspace (reusing existing
syscalls). More importantly, the flexibility that we gain with restart
of tasks in userspace, no cost incurred (in terms of implementation or
runtime overhead).

Oren.

^ permalink raw reply

* Network Namespace-1000 networks with Overlap Addresses
From: Krishna Vamsi-B22174 @ 2009-04-14  2:40 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

 
 
Hi,
 
I am a newbie  to this list.  Here is my use case , we have  Loadable
Kernel Module which applies security to
the packets arriving from 1000 networks with overlap addresses. There
are 3 different  user space process which handles 
control traffic  from these 1000 networks .   
 
Please let me know
 
1)How to create a Network Namespace Object ?
2)How to delete a Network Namespace Object ?
3)Can these 3  user space process see all the Network Namespace objects
created in the kernel ?
 If so, how can they access these objects?
4)How to group 2-3 interfaces under a particular Network Namespace ?
 
Is there any patch available to achieve the above use case ?
 
 
Regards
    Vamsi

^ permalink raw reply

* RE: /dev/tun PPP interfaces under network namespace
From: Elwin Stelzer Eliazer @ 2009-04-14  1:20 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Posting this again, since I did not get any response.

 

Any pointers?

 

Thanks,

Elwin.

 

 

From: Elwin Stelzer Eliazer [mailto:stelzere-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] 
Sent: Thursday, April 09, 2009 5:04 PM
To: 'containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org'
Subject: /dev/tun PPP interfaces under network namespace

 

Hi,

 

Can I have /dev/tun based PPP interfaces under network namespace, associated
with a container?

What config type will this be? Veth? Or vppp?

Is this available in 2.6.29?

 

Regards,

Elwin.

^ permalink raw reply

* RE: netfilter/iptables under namespace
From: Elwin Stelzer Eliazer @ 2009-04-14  1:15 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

I am posting this again, since I did not get response.

 

Some of the older status (Mar 28, 2008) on LXC/namespace indicate
netfilter+lxc is partially completed, based on the link below:

http://lxc.sourceforge.net/network/status.php

 

In 2.6.29 is it fully supported for IPv4?

If not is there any workaround for now?

 

Regards,

Elwin.

 

 

 

From: Elwin Stelzer Eliazer [mailto:stelzere-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] 
Sent: Thursday, April 09, 2009 4:59 PM
To: 'containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org'
Subject: netfilter/iptables under namespace

 

Hi,

 

Are netfilter/iptables properly isolated under namespace in 2.6.29?

 

Regards,

Elwin.

^ permalink raw reply

* [RFC][PATCH] devcg: cache the last matched whitelist item
From: Li Zefan @ 2009-04-14  1:01 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

While I was doing testing by open/close files like this:
	for (i = 0; i < LOOP; i++) {
		fd = open("/dev/null");
		close(fd);
	}
It got a bit slower when devcg is used, so I made this patch
to speed it up.

But walking through whitelist in devcgroup_inode_permission()
doesn't seem to be a hot-path, if so this patch probably doesn't
worth it, and it may not work well under real workload.

Just post it to see if someone thinks it's useful.

Signed-off-by: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---

based on mainline.

---
 device_cgroup.c |   53 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 5fda7df..39ddf2b 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -41,6 +41,7 @@ struct dev_whitelist_item {
 struct dev_cgroup {
 	struct cgroup_subsys_state css;
 	struct list_head whitelist;
+	dev_whitelist_item *cache;
 };
 
 static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s)
@@ -153,6 +154,8 @@ static void dev_whitelist_rm(struct dev_cgroup *dev_cgroup,
 remove:
 		walk->access &= ~wh->access;
 		if (!walk->access) {
+			if (walk == dev_cgroup->cache)
+				dev_cgroup->cache = NULL;
 			list_del_rcu(&walk->list);
 			call_rcu(&walk->rcu, whitelist_item_free);
 		}
@@ -473,6 +476,26 @@ struct cgroup_subsys devices_subsys = {
 	.subsys_id = devices_subsys_id,
 };
 
+int __devcgroup_inode_permission(struct inode *inode, int mask,
+				 struct dev_whitelist_item *wh)
+{
+	if (wh->type & DEV_ALL)
+		return 1;
+	if (wh->type & DEV_BLOCK && !S_ISBLK(inode->i_mode))
+		return 0;
+	if (wh->type & DEV_CHAR && !S_ISCHR(inode->i_mode))
+		return 0;
+	if (wh->major != ~0 && wh->major != imajor(inode))
+		return 0;
+	if (wh->minor != ~0 && wh->minor != iminor(inode))
+		return 0;
+	if ((mask & MAY_WRITE) && !(wh->access & ACC_WRITE))
+		return 0;
+	if ((mask & MAY_READ) && !(wh->access & ACC_READ))
+		return 0;
+	return 1;
+}
+
 int devcgroup_inode_permission(struct inode *inode, int mask)
 {
 	struct dev_cgroup *dev_cgroup;
@@ -488,24 +511,20 @@ int devcgroup_inode_permission(struct inode *inode, int mask)
 
 	dev_cgroup = task_devcgroup(current);
 
+	if (dev_cgroup->cache) {
+		wh = dev_cgroup->cache;
+		if (__devcgroup_inode_permission(inode, mask, wh)) {
+			rcu_read_unlock();
+			return 0;
+		}
+	}
+
 	list_for_each_entry_rcu(wh, &dev_cgroup->whitelist, list) {
-		if (wh->type & DEV_ALL)
-			goto acc_check;
-		if ((wh->type & DEV_BLOCK) && !S_ISBLK(inode->i_mode))
-			continue;
-		if ((wh->type & DEV_CHAR) && !S_ISCHR(inode->i_mode))
-			continue;
-		if (wh->major != ~0 && wh->major != imajor(inode))
-			continue;
-		if (wh->minor != ~0 && wh->minor != iminor(inode))
-			continue;
-acc_check:
-		if ((mask & MAY_WRITE) && !(wh->access & ACC_WRITE))
-			continue;
-		if ((mask & MAY_READ) && !(wh->access & ACC_READ))
-			continue;
-		rcu_read_unlock();
-		return 0;
+		if (__devcgroup_inode_permission(inode, mask, wh)) {
+			dev_cgroup->cache = wh;
+			rcu_read_unlock();
+			return 0;
+		}
 	}
 
 	rcu_read_unlock();

^ permalink raw reply related

* Re: [PATCH] Memory usage limit notification addition to memcg
From: Dan Malek @ 2009-04-14  0:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20090413165404.b6660aea.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>


On Apr 13, 2009, at 4:54 PM, Andrew Morton wrote:

> I agree.  But it would be a mighty mess if we were to turn around in
> two years time and add a second centi-percent interface.

OK, I'll do it.  I'm just going to create a couple of simple functions
within this file to serve my requirements for the string conversion.

Thanks.

	-- Dan

^ permalink raw reply

* Re: [PATCH] Memory usage limit notification addition to memcg
From: Andrew Morton @ 2009-04-13 23:54 UTC (permalink / raw)
  To: Dan Malek
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <298A09AD-495E-4671-84A1-B831826424A8-L1vi/lXTdtvkgf6YlCu6wwC/G2K4zDHf@public.gmane.org>

On Mon, 13 Apr 2009 16:45:17 -0700
Dan Malek <dan-L1vi/lXTdtvkgf6YlCu6wwC/G2K4zDHf@public.gmane.org> wrote:

> On Apr 13, 2009, at 4:08 PM, Andrew Morton wrote:
> 
> > We've run into problems in the past where a percentage number is too
> > coarse on large-memory systems.
> >
> > Proabably that won't be an issue here, but I invite you to convince us
> > of this ;)
> 
> The challenge here is that the absolute limit of the memcg can
> be dynamically changed, so I wanted to avoid a couple of problems.
> One is just a system configuration error where someone forgets
> to modify both.  For example, if you start with the memcg limit of 100M,
> and the notification limit to 80M, then come back and change the memcg
> limit to 90M (or worse, < 80M) you now have a clearly incorrect
> configuration.   Another problem is the operation isn't atomic, at some
> point during the changes, even if you remember to do it correctly, you
> will have the two values not representing what you really want.  It
> could trigger an erroneous notification, or simply OOM kill before you
> get the configuration correct.
> 
> If an integer number turns out to not be sufficient, we could change  
> this
> to some fixed point representation and adjust the arithmetic in the  
> tests.
> I believe the integer number will be fine, even in large memory systems.
> This is just a notification model, if we want something more fine  
> grained
> I believe it would need different semantics.

I agree.  But it would be a mighty mess if we were to turn around in
two years time and add a second centi-percent interface.  So we should
give this careful thought now and really convince ourselves that we
will never ever ever want sub-1% resolution.

^ permalink raw reply

* Re: [PATCH] Memory usage limit notification addition to memcg
From: Dan Malek @ 2009-04-13 23:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20090413160814.1c335d1d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>


OK, I'll rewrite and resubmit the patch with suggested updates.
Comments below...

On Apr 13, 2009, at 4:08 PM, Andrew Morton wrote:

> We've run into problems in the past where a percentage number is too
> coarse on large-memory systems.
>
> Proabably that won't be an issue here, but I invite you to convince us
> of this ;)

The challenge here is that the absolute limit of the memcg can
be dynamically changed, so I wanted to avoid a couple of problems.
One is just a system configuration error where someone forgets
to modify both.  For example, if you start with the memcg limit of 100M,
and the notification limit to 80M, then come back and change the memcg
limit to 90M (or worse, < 80M) you now have a clearly incorrect
configuration.   Another problem is the operation isn't atomic, at some
point during the changes, even if you remember to do it correctly, you
will have the two values not representing what you really want.  It
could trigger an erroneous notification, or simply OOM kill before you
get the configuration correct.

If an integer number turns out to not be sufficient, we could change  
this
to some fixed point representation and adjust the arithmetic in the  
tests.
I believe the integer number will be fine, even in large memory systems.
This is just a notification model, if we want something more fine  
grained
I believe it would need different semantics.

> Does it support select()/poll()/eventfd()/etc?

No.  Unfortunately this is a cgroup implementation limitation.
My TODO list includes updating cgroups to allow this, using
this notification as an example.

> Stylistic trick: here, please add

Will do, including some others to get rid of ifdefs.

> ifdefs-in-c make kernel developers sad.

I know.  I'll make them go away.

I'll fix it up and resubmit shortly.

Thanks.

	-- Dan

^ permalink raw reply

* Re: [PATCH] Make tst_ipcshm_multi automatable
From: Serge E. Hallyn @ 2009-04-13 23:15 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-qjLDD68F18O7TbgM5vRIOg
In-Reply-To: <1239650808-22254-1-git-send-email-danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Quoting Dan Smith (danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> Add a little to tst_ipcshm_multi to make it automatically validate the
> results and return a pass/fail status indication for automated runs.
> 
> Since Oren said he applied my previous patch to his repository, I'm
> sending this as a delta from the last one I sent[1].  Since the public
> user-cr is not updated yet, you'll need to apply that first to try this.
> 
> 1: https://lists.linux-foundation.org/pipermail/containers/2009-April/016827.html
> 
> Cc: orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org
> Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
> Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Thanks, Dan, added this to the cr_tests dir on sf.net/projects/lxc.

-serge

^ permalink raw reply

* Re: [PATCH] Memory usage limit notification addition to memcg
From: Andrew Morton @ 2009-04-13 23:08 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	dan-L1vi/lXTdtvkgf6YlCu6wwC/G2K4zDHf,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1239660512-25468-1-git-send-email-dan-L1vi/lXTdtvkgf6YlCu6wwC/G2K4zDHf@public.gmane.org>


Please cc containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org on this sort of thing
so that all the right people get to see it.

On Mon, 13 Apr 2009 18:08:32 -0400
Dan Malek <dan-L1vi/lXTdtvkgf6YlCu6wwC/G2K4zDHf@public.gmane.org> wrote:

> This patch updates the Memory Controller cgroup to add
> a configurable memory usage limit notification.  The feature
> was presented at the April 2009 Embedded Linux Conference.
> 
> Signed-off-by: Dan Malek <dan-L1vi/lXTdtvkgf6YlCu6wwC/G2K4zDHf@public.gmane.org>
> ---
>  Documentation/cgroups/mem_notify.txt |  129 +++++++++++++++++++++
>  include/linux/memcontrol.h           |    7 +
>  init/Kconfig                         |    9 ++
>  mm/memcontrol.c                      |  207 ++++++++++++++++++++++++++++++++++
>  4 files changed, 352 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/cgroups/mem_notify.txt
> 
> diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
> new file mode 100644
> index 0000000..72d5c26
> --- /dev/null
> +++ b/Documentation/cgroups/mem_notify.txt
> @@ -0,0 +1,129 @@
> +
> +Memory Limit Notificiation
> +
> +Attempts have been made in the past to provide a mechanism for
> +the notification to processes (task, an address space) when memory
> +usage is approaching a high limit.  The intention is that it gives
> +the application an opportunity to release some memory and continue
> +operation rather than be OOM killed.  The CE Linux Forum requested
> +a more comtemporary implementation, and this is the result.
> +
> +The memory limit notification is a configurable extension to the
> +existing Memory Resource Controller.  Please read memory.txt in this
> +directory to understand its operation before continuing here.
> +
> +1. Operation
> +
> +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
> +files will appear in the memory resource controller:
> +
> +	memory.notify_limit_percent

We've run into problems in the past where a percentage number is too
coarse on large-memory systems.

Proabably that won't be an issue here, but I invite you to convince us
of this ;)


> +	memory.notify_limit_usage
> +	memory.notify_limit_lowait
> +
> +The notification is based upon reaching a percentage of the memory
> +resource controller limit (memory.limit_in_bytes).  When the controller
> +group is created, the percentage is set to 100.  Any integer percentage
> +may be set by writing to memory.notify_limit_percent, such as:
> +
> +	echo 80 > memory.notify_limit_percent
> +
> +The current integer usage percentage may be read at any time from
> +the memory.notify_limit_usage file.
> +
> +The memory.notify_limit_lowait is a blocking read file.  The read will
> +block until one of four conditions occurs:
> +
> +    - The usage reaches or exceeds the memory.notify_limit_percent
> +    - The memory.notify_limit_lowait file is written with any value (debug)
> +    - A thread is moved to another controller group
> +    - The cgroup is destroyed or forced empty (memory.force_empty)

Does it support select()/poll()/eventfd()/etc?

> +
> +1.1 Example Usage
> +
> +An application must be designed to properly take advantage of this
> +memory limit notification feature.  It is a powerful management component
> +of some operating systems and embedded devices that must provide
> +highly available and reliable computing services.  The application works
> +in conjunction with information provided by the operating system to
> +control limited resource usage.  Since many programmers still think
> +memory is infinite and never check the return value from malloc(), it
> +may come as a surprise that such mechanisms have been utilized long ago.
> +
> +A typical application will be multithreaded, with one thread either
> +polling or waiting for the notification event.  When the event occurs,
> +the thread will take whatever action is appropriate within the application
> +design.  This could be actually running a garbage collection algorithm
> +or to simply signal other processing threads they must do something to
> +reduce their memory usage.  The notification thread will then be required
> +to poll the actual usage until the low limit of its choosing is met,
> +at which time the reclaim of memory can stop and the notification thread
> +will wait for the next event.
> +
> +Internally, the application only needs to fopen("memory.notify_limit_usage" ..)
> +and fopen("memory.notify_limit_lowait" ...), then either poll the former
> +file or block read on the latter file using fread() or fscanf() as desired.
> +
> +2. Configuration
> +
> +Follow the instructions in memory.txt for the configuration and usage of
> +the Memory Resource Controller cgroup.  Once this is created and tasks
> +assigned, use the memory limit notification as described here.
> +
> +The only action that is needed outside of the application waiting or polling
> +is to set the memory.notify_limit_percent.  To set a notification to occur
> +when memory usage of the cgroup reaches or exceeds 80 percent can be
> +simply done:
> +
> +	echo 80 > memory.notify_limit_percent
> +
> +This value may be read or changed at any time.  Writing a lower value once
> +the Memory Resource Controller is in operation may trigger immediate
> +notification if the usage is above the new limit.
> +
> +3. Debug and Testing
> +
> +The design of cgroups makes it easier to perform some debugging or
> +monitoring tasks without modification to the application.  For example,
> +a write of any value to memory.notify_limit_lowait will wake up all
> +threads waiting for notifications regardless of current memory usage.
> +
> +Collecting performance data about the cgroup is also simplified, as
> +no application modifications are necessary.  A separate task can be
> +created that will open and monitor any necessary files of the cgroup
> +(such as current limits, usage and usage percentages and even when
> +notification occurs).  This task can also operate outside of the cgroup,
> +so its memory usage is not charged to the cgroup.
> +
> +4. Design
> +
> +The memory limit notification is a configurable extension to the
> +existing Memory Resource Controller, which operates as described to
> +track and manage the memory of the Control Group.  The Memory Resource
> +Controller will still continue to reclaim memory under pressure
> +of the limits, and may OOM kill tasks within the cgroup according to
> +the OOM Killer configuration.
> +
> +The memory notification limit was chosen as a percentage of the
> +memory in use so the cgroup paramaters may continue to be dynamically
> +modified without the need to modify the notificaton parameters.
> +Otherwise, the notification limit would have to also be computed
> +and modified on any Memory Resource Controller operating parameter change.
> +
> +The cgroup file semantics are not well suited for this type of notificaton
> +mechanism.  While applications may choose to simply poll the current
> +usage at their convenience, it was also desired to have a notification
> +event that would trigger when the usage attained the limit.  The
> +blocking read() was chosen, as it is the only current useful method.
> +This presented the problems of "out of band" notification, when you want
> +to return some exceptional status other than reaching the notification
> +limit.  In the cases listed above, the read() on the memory.notify_limit_lowait
> +file will not block and return "0" for the percentage.  When this occurs,
> +the thread must determine if the task has moved to a new cgroup or if
> +the cgroup has been destroyed.  Due to the usage model of this cgroup,
> +neither is likely to happen during normal operation of a product.
> +
> +Dan Malek <dan-L1vi/lXTdtvkgf6YlCu6wwC/G2K4zDHf@public.gmane.org>
> +Embedded Alley Solutions, Inc.
> +10 March 2009
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..031e5d1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -117,6 +117,13 @@ static inline bool mem_cgroup_disabled(void)
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +extern void test_and_wakeup_notify(struct mem_cgroup *mcg,
> +				unsigned long long newlimit);
> +extern unsigned long compute_usage_percent(unsigned long long usage,
> +				unsigned long long limit);
> +#endif

Stylistic trick: here, please add

#else
static inline void test_and_wakeup_notify(struct mem_cgroup *mcg,
				unsigned long long newlimit)
{
}

static inline unsigned long compute_usage_percent(unsigned long long usage,
				unsigned long long limit)
{
	return 0;		/* ? */
}
#endif

and then remove the ifdefs from the .c files.

>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> diff --git a/init/Kconfig b/init/Kconfig
> index f2f9b53..97138da 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -588,6 +588,15 @@ config CGROUP_MEM_RES_CTLR
>  	  This config option also selects MM_OWNER config option, which
>  	  could in turn add some fork/exit overhead.
>  
> +config CGROUP_MEM_NOTIFY
> +	bool "Memory Usage Limit Notification"
> +	depends on CGROUP_MEM_RES_CTLR
> +	help
> +	  Provides a memory notification when usage reaches a preset limit.
> +	  It is an extenstion to the memory resource controller, since it
> +	  uses the memory usage accounting of the cgroup to test against
> +	  the notification limit.  (See Documentation/cgroups/mem_notify.txt)
> +
>  config CGROUP_MEM_RES_CTLR_SWAP
>  	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
>  	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2fc6d6c..d6367ed 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
>   * Copyright 2007 OpenVZ SWsoft Inc
>   * Author: Pavel Emelianov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>   *
> + * Memory Limit Notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <dan-L1vi/lXTdtvkgf6YlCu6wwC/G2K4zDHf@public.gmane.org>
> + *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
>   * the Free Software Foundation; either version 2 of the License, or
> @@ -180,6 +184,11 @@ struct mem_cgroup {
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
>  	struct mem_cgroup_stat stat;
> +
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	unsigned long notify_limit_percent;
> +	wait_queue_head_t notify_limit_wait;
> +#endif
>  };
>  
>  enum charge_type {
> @@ -934,6 +943,21 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  
>  	VM_BUG_ON(mem_cgroup_is_obsolete(mem));
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	/* We check on the way in so we don't have to duplicate code
> +	 * in both the normal and error exit path.
> +	 */
> +	if (likely(mem->res.limit != (unsigned long long)LLONG_MAX)) {
> +		unsigned long usage_pct;
> +
> +		usage_pct = compute_usage_percent(mem->res.usage + PAGE_SIZE,
> +								mem->res.limit);
> +		if ((usage_pct >= mem->notify_limit_percent) &&
> +		    waitqueue_active(&mem->notify_limit_wait))
> +			wake_up(&mem->notify_limit_wait);
> +	}
> +#endif

It would be nicer to pull this out into a separate function, I expect.

>  	while (1) {
>  		int ret;
>  		bool noswap = false;
> @@ -1663,6 +1687,13 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  	int children = mem_cgroup_count_children(memcg);
>  	u64 curusage, oldusage;
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	/* Test and notify ahead of the necessity to free pages, as
> +	 * applications giving up pages may help this reclaim procedure.
> +	 */
> +	test_and_wakeup_notify(memcg, val);
> +#endif

ifdefs-in-c make kernel developers sad.

>  	/*
>  	 * For keeping hierarchical_reclaim simple, how long we should retry
>  	 * is depends on callers. We set our retry-count to be function
> @@ -2215,6 +2246,147 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +#define CGROUP_LOCAL_BUFFER_SIZE 64 /* Would be nice if this was in cgroup.h */
> +
> +/* The resource counters are defined as long long, but few processors
> + * handle 64-bit divisor in hardware, and the software to do it isn't
> + * present in the kernel.  It would be nice if the resource counters were
> + * platform specific configurable typedefs, but for now we'll just divide
> + * down the byte counters by the page size to get 32-bit arithmetic.
> + * With a 4K page size, this will work up to about 16384G resource limit.
> + */
> +unsigned long compute_usage_percent(unsigned long long usage,
> +					unsigned long long limit)

This is a poor choice of identifier for a global symbol.  Please prefix
it with some string which identifies its subsystem.


> +{
> +	unsigned long lim;
> +	unsigned long long usage_pct;
> +
> +	usage_pct = (usage / PAGE_SIZE) * 100;
> +	lim = (unsigned long)(limit / PAGE_SIZE);
> +
> +	do_div(usage_pct, lim);
> +
> +	return (unsigned long)usage_pct;
> +}
> +
> +void test_and_wakeup_notify(struct mem_cgroup *mcg, unsigned long long newlimit)

Ditto.

> +{
> +	unsigned long usage_pct;
> +
> +	/* Check to see if the new limit should cause notification.
> +	*/
> +	usage_pct = compute_usage_percent(mcg->res.usage, newlimit);
> +
> +	if ((usage_pct >= mcg->notify_limit_percent) &&
> +	    waitqueue_active(&mcg->notify_limit_wait))
> +		wake_up(&mcg->notify_limit_wait);
> +}

Also, identifiers such as "newlimit" are a bit unclear because they
don't communicate the units, and (less seriously) they don't
communicate what quantity they are measuring.  Something like
memory_newlimit_bytes would be clearer, although a bit silly.

I _assume_ these things are all operating in units of bytes.  Perhaps
it was pages?  That's my point...

> +static ssize_t notify_limit_read(struct cgroup *cgrp, struct cftype *cft,
> +			       struct file *file,
> +			       char __user *buf, size_t nbytes,
> +			       loff_t *ppos)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> +	int len;
> +
> +	len = sprintf(tmp, "%lu\n", memcg->notify_limit_percent);

The reader has to run around the tree to find out if there's a buffer
overflow here.  scnprintf() would set minds at ease.


> +	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +static int notify_limit_write(struct cgroup *cgrp, struct cftype *cft,
> +			    const char *buffer)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	unsigned long val;
> +	char *endptr;
> +
> +	val = simple_strtoul(buffer, &endptr, 0);

strict_strtoul() please.  It's stricter.

> +	if (val > 100)
> +		return -EINVAL;
> +
> +	memcg->notify_limit_percent = val;
> +
> +	/* Check to see if the new percentage limit should cause notification.
> +	*/
> +	test_and_wakeup_notify(memcg, memcg->res.limit);
> +
> +	return 0;
> +}
> +
> +static ssize_t notify_limit_usage_read(struct cgroup *cgrp, struct cftype *cft,
> +			       struct file *file,
> +			       char __user *buf, size_t nbytes,
> +			       loff_t *ppos)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> +	unsigned long usage_pct;
> +	int len;
> +
> +	usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
> +
> +	len = sprintf(tmp, "%lu\n", usage_pct);

scnprintf()

> +	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +static ssize_t notify_limit_lowait(struct cgroup *cgrp, struct cftype *cft,

Perhaps this function would benefit from a nice comment explaining its
design.  It's the core thing.

Than again, the .txt file has good coverage.

> +			       struct file *file,
> +			       char __user *buf, size_t nbytes,
> +			       loff_t *ppos)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> +	unsigned long usage_pct;
> +	int len;
> +	DEFINE_WAIT(notify_lowait);
> +
> +	/* A memory resource usage of zero is a special case that
> +	 * causes us not to sleep.  It normally happens when the
> +	 * cgroup is about to be destroyed, and we don't want someone
> +	 * trying to sleep on a queue that is about to go away.  This
> +	 * condition can also be forced as part of testing.
> +	 */
> +	usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
> +	if (likely(mem->res.usage != 0)) {
> +
> +		prepare_to_wait(&mem->notify_limit_wait, &notify_lowait,
> +							TASK_INTERRUPTIBLE);
> +
> +		if (usage_pct < mem->notify_limit_percent) {
> +			schedule();
> +
> +			/* Compute percentage we have now and return it.
> +			*/
> +			usage_pct = compute_usage_percent(mem->res.usage,
> +							mem->res.limit);
> +		}
> +		finish_wait(&mem->notify_limit_wait, &notify_lowait);
> +	}
> +
> +	len = sprintf(tmp, "%lu\n", usage_pct);

scnprintf()

> +	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}

What is the behavior of this read when someone sends the process a
signal?  Seems that it will return early, giving userspace a number
which it didn't expect to see.  I guess that's OK.

> +/* This is used to wake up all threads that may be hanging
> + * out waiting for a low memory condition prior to that happening.
> + * Useful for triggering the event to assist with debug of applications.
> + */
> +static int notify_limit_wake_em_up(struct cgroup *cgrp, unsigned int event)
> +{
> +	struct mem_cgroup *mem;
> +
> +	mem = mem_cgroup_from_cont(cgrp);
> +	wake_up(&mem->notify_limit_wait);
> +	return 0;
> +}
> +#endif /* CONFIG_CGROUP_MEM_NOTIFY */
> +
>  
>  static struct cftype mem_cgroup_files[] = {
>  	{
> @@ -2258,6 +2430,22 @@ static struct cftype mem_cgroup_files[] = {
>  		.read_u64 = mem_cgroup_swappiness_read,
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	{
> +		.name = "notify_limit_percent",
> +		.write_string = notify_limit_write,
> +		.read = notify_limit_read,
> +	},
> +	{
> +		.name = "notify_limit_usage",
> +		.read = notify_limit_usage_read,
> +	},
> +	{
> +		.name = "notify_limit_lowait",
> +		.trigger = notify_limit_wake_em_up,
> +		.read = notify_limit_lowait,
> +	},
> +#endif
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -2461,6 +2649,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	init_waitqueue_head(&mem->notify_limit_wait);
> +	mem->notify_limit_percent = 100;
> +#endif
> +
>  	if (parent)
>  		mem->swappiness = get_swappiness(parent);
>  	atomic_set(&mem->refcnt, 1);
> @@ -2504,6 +2697,20 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>  				struct cgroup *old_cont,
>  				struct task_struct *p)
>  {
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	/* We wake up all notification threads any time a migration takes
> +	 * place.  They will have to check to see if a move is needed to
> +	 * a new cgroup file to wait for notification.
> +	 * This isn't so much a task move as it is an attach.  A thread not
> +	 * a child of an existing task won't have a valid parent, which
> +	 * is necessary to test because it won't have a valid mem_cgroup
> +	 * either.  Which further means it won't have a proper wait queue
> +	 * and we can't do a wakeup.
> +	 */
> +	if (old_cont->parent != NULL)
> +		notify_limit_wake_em_up(old_cont, 0);
> +#endif
> +

^ permalink raw reply

* Containers syslog support?
From: Chris R. Jones @ 2009-04-13 21:53 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

Hello again,

Another question on containers.  This time, for syslog.  Is there any
containers support to isolate syslog entries for different containers?
That is, is there any way I can run two different syslogd processes in
two different containers, in such a way that each syslogd process only
sees and logs events generated by processes in it's own container?

Are syslog messages covered under one of the other namespaces (pids, utsname, sysv ipc, network, users), or is there a seperate namespace for them.

Thanks,
Chris

^ permalink raw reply

* Re: [PATCH 10/30] cr: core stuff
From: Serge E. Hallyn @ 2009-04-13 21:47 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: xemul-bzQdu9zFT3WakBO8gow8eQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI
In-Reply-To: <20090410023539.GK27788-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>

Quoting Alexey Dobriyan (adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):

Hi Alexey,

as far as I can see, the main differences between this patch and the
equivalent in Oren's tree are:

1. kernel auto-selects container init to freeze
2. kernel freezes tasks
3. no objhash taking references
4. no hbuf
5. always require CAP_SYS_ADMIN

Are there other differences which you would consider meaningful?  Which
do you consider the most important?

Also, since Dave introduced the fops->checkpoint(), we (or at least I)
have been struck by the ugly assymetry with checkpoint() being in fops,
and restart() not.  Do you have an idea for fixing that?

thanks,
-serge

^ permalink raw reply

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
From: Ingo Molnar @ 2009-04-13 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: xemul-bzQdu9zFT3WakBO8gow8eQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan
In-Reply-To: <alpine.LFD.2.00.0904131137520.26713-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>


* Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Mon, 13 Apr 2009, Alexey Dobriyan wrote:
> > 
> > Well, in OpenVZ everything is in kernel/cpt/ and prefixed with 
> > "cpt_" and "rst_".
> 
> So?
> 
> We're not merging OpenVZ code _either_.
> 
> > And I think "cr_" is super nice prefix: it's short, it's C-like, 
> > it reminds about restart part
> 
> It does no such thing. THAT'S THE POINT. "cr" means _nothing_ to 
> anybody else than some CR-specific people, and those people don't 
> even need it!
> 
> Look around you. We try to use nicer names. We spell out 
> "cpufreq", we don't call it "cf".

I've done an analysis about what a good kernel-global naming scheme 
for checkpoint/restore might be (see further below).

Here's the current namespace (from a stupid sed+grep script combo 
done on my mbox - so it combines all the submitted patches):

cr_align
cr_arch
cr_arch_check_image_header
cr_arch_check_image_task_struct
cr_arch_check_mm_struct
cr_arch_check_task_struct
cr_arch_dump_mm_struct
cr_arch_dump_task_struct
cr_arch_len_mm_struct
cr_arch_len_task_struct
cr_arch_restore_mm_struct
cr_arch_restore_task_struct
cr_arg_end
cr_arg_start
cr_attach_get_file
cr_bits
cr_blocked
cr_brk
cr_can_checkpoint_file
cr_cap_bset
cr_cap_effective
cr_cap_inheritable
cr_cap_permitted
cr_c_cc
cr_c_cflag
cr_check_*
cr_check_cred
cr_check_file
cr_check_file:
cr_check_image_header
cr_check_mm
cr_check_mm_struct
cr_check_pid_ns
cr_check_sighand
cr_check_sighand_struct
cr_check_signal
cr_check_signal_struct
cr_check_sock
cr_check_task_struct
cr_check_tty
cr_check_user_struct
cr_check_vma
cr_c_iflag
cr_c_ispeed
cr_c_lflag
cr_c_line
cr_c_oflag
cr_collect
cr_collect_all_cred
cr_collect_all_file
cr_collect_all_files_struct
cr_collect_all_fs_struct
cr_collect_all_group_info
cr_collect_all_ipc_ns
cr_collect_all_mm_struct
cr_collect_all_mnt_ns
cr_collect_all_net_ns
cr_collect_all_nsproxy
cr_collect_all_pid
cr_collect_all_pid_ns
cr_collect_all_sighand_struct
cr_collect_all_signal_struct
cr_collect_all_sock
cr_collect_all_task_struct
cr_collect_all_tty_struct
cr_collect_all_user_ns
cr_collect_all_user_struct
cr_collect_all_uts_ns
cr_collect_cred
cr_collect_file
cr_collect_files_struct
cr_collect_fs_struct
cr_collect_group_info
cr_collect_ipc_ns
cr_collect_mm
cr_collect_mm_struct
cr_collect_mnt_ns
cr_collect_net_ns
cr_collect_nsproxy
cr_collect_object
cr_collect_pid
cr_collect_pid_ns
cr_collect_sighand
cr_collect_sighand_struct
cr_collect_signal
cr_collect_signal_struct
cr_collect_sock
cr_collect_tasks
cr_collect_task_struct
cr_collect_tty
cr_collect_user_ns
cr_collect_user_struct
cr_collect_uts_ns
cr_comm
cr_context
cr_context_create
cr_context_destroy
cr_context_obj_type
cr_context_task_struct
cr_core
cr_c_ospeed
cr_cred
cr_cs
cr_ct
cr_ctrl_status
cr_ctx
cr_ctx_count
cr_ctx_put
cr_data
cr_debug
cr_def_flags
cr_domainname
cr_dr0
cr_dr1
cr_dr2
cr_dr3
cr_dr6
cr_dr7
cr_driver_flags
cr_driver_subtype
cr_driver_type
cr_ds
cr_dump
cr_dump_all_cred
cr_dump_all_file
cr_dump_all_files_struct
cr_dump_all_fs_struct
cr_dump_all_group_info
cr_dump_all_ipc_ns
cr_dump_all_mm_struct
cr_dump_all_mnt_ns
cr_dump_all_net_ns
cr_dump_all_nsproxy
cr_dump_all_pid
cr_dump_all_pid_ns
cr_dump_all_sighand_struct
cr_dump_all_signal_struct
cr_dump_all_sock
cr_dump_all_task_struct
cr_dump_all_tty
cr_dump_all_user_ns
cr_dump_all_user_struct
cr_dump_all_uts_ns
cr_dump_anonvma
cr_dump_cred
cr_dump_fd
cr_dump_file
cr_dump_files_struct
cr_dump_fs_struct
cr_dump_group_info
cr_dump_header
cr_dump_image_header
cr_dump_ipc_ns
cr_dump_mm_struct
cr_dump_mnt_ns
cr_dump_net_ns
cr_dump_nsproxy
cr_dump_pid
cr_dump_pid_ns
cr_dump_ptr
cr_dump_sighand_struct
cr_dump_signal_struct
cr_dump_sigset
cr_dump_sock
cr_dump_task_struct
cr_dump_task_struct_x86_32
cr_dump_task_struct_x86_64
cr_dump_terminator
cr_dump_tty
cr_dump_user_ns
cr_dump_user_struct
cr_dump_uts_ns
cr_dump_vma
cr_dump_vma_pages
cr_dump_xstate
cr_eax
cr_ebp
cr_ebx
cr_ecx
cr_edi
cr_edx
cr_eflags
cr_egid
cr_eip
cr_enabled
cr_end_code
cr_end_data
cr_env_end
cr_env_start
cr_es
cr_esi
cr_esp
cr_euid
cr_explain_file
cr_fd
cr_fd_flags
cr_f_flags
cr_file
cr_file_checkpoint
cr_file_checkpointable
cr_file_get_func
cr_files_struct
cr_file_supported
cr_fill_name
cr_find_obj_by_pos
cr_find_obj_by_ptr
cr_flags
cr_flow_stopped
cr_f_owner
cr_f_owner_euid
cr_f_owner_pid_type
cr_f_owner_signum
cr_f_owner_uid
cr_f_pos
cr_freeze_tasks
cr_fs
cr_fs_checkpointable
cr_fsgid
cr_fsindex
cr_fsuid
cr_gid
cr_gs
cr_gsindex
cr_hbuf_get
cr_hbuf_put
cr_hdr
cr_hdr_cpu
cr_hdr_fd
cr_hdr_fd_data
cr_hdr_fd_ent
cr_hdr_files
cr_hdr_head
cr_hdr_head_arch
cr_hdr_mm
cr_hdr_mm_context
cr_hdr_pgarr
cr_hdr_pids
cr_hdr_tail
cr_hdr_tree
cr_hdr_vma
cr_header
cr_hw_stopped
cr_imag
cr_image_arch_x86_32
cr_image_arch_x86_64
cr_image_cred
cr_image_fd
cr_image_file
cr_image_files_struct
cr_image_fs_struct
cr_image_group_info
cr_image_header
cr_image_header_arch
cr_image_ipc_ns
cr_image_magic
cr_image_mm_struct
cr_image_mnt_ns
cr_image_net_ns
cr_image_nsproxy
cr_image_pid
cr_image_pid_ns
cr_image_sighand_struct
cr_image_signal_struct
cr_image_sigset
cr_image_sock
cr_image_task_st
cr_image_task_struct
cr_image_tty
cr_image_user_ns
cr_image_user_struct
cr_image_uts_ns
cr_image_version
cr_image_vma
cr_image_vma_content
cr_image_vma_vdso
cr_i_mode
cr_index
cr_ipc_ns
cr_kill_tasks
cr_kread
cr_last_pid
cr_len
cr_len_arch
cr_len_xstate
cr_level
cr_link_index
cr_low_latency
cr_machine
cr_mm_struct
cr_mnt_ns
cr_msg_ctlmax
cr_msg_ctlmnb
cr_msg_ctlmni
cr_name
cr_name_len
cr_net_ns
cr_ngroups
cr_nodename
cr_nr
cr_nr_pages
cr_nsproxy
cr_obj
cr_obj_add_ptr
cr_obj_add_ref
cr_object
cr_object_create
cr_object_destroy
cr_object_header
cr_orig_eax
cr_orig_rax
cr_packet
cr_page_size
cr_pgarr
cr_pid_ns
cr_pid_type
cr_pos_cred
cr_pos_f_cred
cr_pos_file
cr_pos_files
cr_pos_f_owner_pid
cr_pos_fs
cr_pos_group_info
cr_pos_ipc_ns
cr_pos_mm
cr_pos_mm_struct
cr_pos_mnt_ns
cr_pos_net_ns
cr_pos_nsproxy
cr_pos_parent
cr_pos_pgrp
cr_pos_pid
cr_pos_pid_ns
cr_pos_pids
cr_pos_real_cred
cr_pos_real_parent
cr_pos_session
cr_pos_sighand
cr_pos_signal
cr_pos_sk_net
cr_pos_t
cr_pos_user
cr_pos_user_ns
cr_pos_uts_ns
cr_pos_vm_file
cr_pread
cr_prepare_image
cr_pwd
cr_pwd_len
cr_quan@163
cr_r10
cr_r11
cr_r12
cr_r13
cr_r14
cr_r15
cr_r8
cr_r9
cr_rax
cr_rbp
cr_rbx
cr_rcx
cr_rdi
cr_rdx
cr_read_buffer
cr_read_buf_type
cr_read_cpu
cr_read_cpu_fpu
cr_read_fd_data
cr_read_files
cr_read_head_arch
cr_read_mm
cr_read_mm_context
cr_read_obj_type
cr_read_open_fname
cr_read_string
cr_read_task_struct
cr_read_thread
cr_real_blocked
cr_release
cr_restart
cr_restore_all_fd
cr_restore_all_vma
cr_restore_cred
cr_restore_fd
cr_restore_file
cr_restore_files_struct
cr_restore_fs_struct
cr_restore_group_info
cr_restore_ipc_ns
cr_restore_mm_struct
cr_restore_mnt_ns
cr_restore_net_ns
cr_restore_nsproxy
cr_restore_pid
cr_restore_pid_ns
cr_restore_ptr
cr_restore_sighand_struct
cr_restore_signal_struct
cr_restore_sigset
cr_restore_task_struct
cr_restore_task_struct_x86_32
cr_restore_task_struct_x86_64
cr_restore_uts_ns
cr_restore_vma
cr_restore_vma_content
cr_restore_vma_vdso
cr_restore_xstate
cr_rflags
cr_rip
cr_rlim
cr_rlim_cur
cr_rlim_max
cr_root
cr_root_len
cr_rsi
cr_rsp
cr_s390_mm
cr_s390_regs
cr_sa
cr_sa_flags
cr_sa_handler
cr_sa_mask
cr_sa_restorer
cr_save_cpu_regs
cr_saved_auxv
cr_saved_sigmask
cr_scan_fds
cr_securebits
cr_sem_ctls
cr_sgid
cr_shm_ctlall
cr_shm_ctlmax
cr_shm_ctlmni
cr_sighand
cr_signal
cr_signature
cr_signum
cr_sk_family
cr_ss
cr_start_addr
cr_start_brk
cr_start_code
cr_start_data
cr_start_stack
cr_stopped
cr_suid
cr_sysctl_max_dgram_qlen
cr_sysctl_somaxconn
cr_sysname
cr_task_struct
cr_task_struct_arch
cr_termios
cr_thaw_tasks
cr_tls_array
cr_tsk_arch
cr_type
cr_uid
cr_umask
cr_unx
cr_uts_ns
cr_uts_release
cr_version
cr_vm_end
cr_vm_flags
cr_vm_page_prot
cr_vm_pgoff
cr_vm_start
cr_winsize
cr_write
cr_write_buffer
cr_write_cpu
cr_write_error
cr_write_fd_data
cr_write_files
cr_write_head
cr_write_head_arch
cr_write_mm_context
cr_write_obj
cr_write_thread
cr_write_vma
cr_ws_col
cr_ws_row
cr_ws_xpixel
cr_ws_ypixel
cr_xstate

The cr_ prefix for internal symbols and fields should be dropped. 

For global symbols, the following naming scheme looks pretty 
intuitive, unique and self-descriptive to me:

  state_checkpoint_XYZ()
  state_restore_XYZ()
  state_collect_XYZ()
  state_dump_XYZ()
  state_image_XYZ()
  ...

Seeing a state_*() symbol anywhere in the kernel will tell me 'ah, 
this is kernel object state checkpoint/restore related 
functionality', at a glance.

[ And equally importantly, when seeing a state_*() symbol i can
  skip it easily, as an uninteresting-to-my-current-activity item. ]

I had a quick look at various existing uses of "\<state_.*" symbols 
in the kernel, and no widely used variant looked particularly 
confusing to me, so i think this would be a good choice IMHO.

	Ingo

^ permalink raw reply

* [PATCH] Make tst_ipcshm_multi automatable
From: Dan Smith @ 2009-04-13 19:26 UTC (permalink / raw)
  To: containers-qjLDD68F18O7TbgM5vRIOg

Add a little to tst_ipcshm_multi to make it automatically validate the
results and return a pass/fail status indication for automated runs.

Since Oren said he applied my previous patch to his repository, I'm
sending this as a delta from the last one I sent[1].  Since the public
user-cr is not updated yet, you'll need to apply that first to try this.

1: https://lists.linux-foundation.org/pipermail/containers/2009-April/016827.html

Cc: orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org
Cc: serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
Signed-off-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 tst_ipcshm_multi.c |   86 ++++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 70 insertions(+), 16 deletions(-)

diff --git a/tst_ipcshm_multi.c b/tst_ipcshm_multi.c
index 1ef31f7..1bc0cd2 100644
--- a/tst_ipcshm_multi.c
+++ b/tst_ipcshm_multi.c
@@ -4,6 +4,8 @@
 #include <errno.h>
 #include <sys/ipc.h>
 #include <sys/shm.h>
+#include <sys/types.h>
+#include <sys/wait.h>
 
 #include <linux/sched.h>
 #include <sched.h>
@@ -11,6 +13,7 @@
 #define OUTFILE  "/tmp/cr-test.out"
 #define SEG_SIZE (20 * 4096)
 #define DELAY 20
+#define COUNT_MAX 15
 
 int attach(unsigned char **seg, int num)
 {
@@ -50,14 +53,45 @@ int validate(unsigned char *seg, int num)
 	return 0;
 }
 
-void track(unsigned char *seg, int num)
+int track_incr(unsigned char *seg, int num)
 {
 	int i;
+	int last = seg[0];
 
 	for (i = 0; i < 20; i++) {
+		if (seg[0] == COUNT_MAX)
+			break;
+
+		if (abs(last - (int)seg[0]) > 1) {
+			printf("[CHILD %i] Expected +/-%i (got %i) %i\n",
+			       num, last, seg[0], abs(last - seg[0]));
+			return 1;
+		}
+
+		last = seg[0] + 1;
+
 		printf("[CHILD %i] Seg[0]: %i\n", num, seg[0]);
 		sleep(1);
 	}
+
+	return !(seg[0] == COUNT_MAX);
+}
+
+int track_const(unsigned char *seg, int num, int val)
+{
+	int i;
+
+	for (i = 0; i < 20; i++) {
+		if (seg[0] != val) {
+			printf("[CHILD %i] Expected %i not %i\n",
+			       num, val, seg[0]);
+			return 1;
+		}
+		printf("[CHILD %i] Seg[0]: %i\n", num, seg[0]);
+		sleep(1);
+	}
+
+	return 0;
 }
 
 /*
@@ -81,9 +115,7 @@ int child1(void)
 
 	sleep(DELAY - 1); /* Wait until after the checkpoint */
 
-	track(seg, num);
-
-	return 0;
+	return track_incr(seg, num);
 }
 
 /*
@@ -106,12 +138,10 @@ int child2(void)
 	if (validate(seg, num))
 		return -1;
 
-	track(seg, num);
-
-	return 0;
+	return track_incr(seg, num);
 }
 
-int child4(void);
+int child4(int constval);
 
 /*
  * Detach from the parent's IPC namespace and verify that:
@@ -123,14 +153,20 @@ int child3(void)
 {
 	unsigned char *seg;
 	int num = 3;
+	int cpid;
+	int ret;
+	int status;
 
 	if (unshare(CLONE_NEWIPC) != 0) {
 		printf("[CHILD %i] unshare(CLONE_NEWIPC): %m", num);
 		return -1;
 	}
 
-	if (fork() == 0)
-		return child4();
+	cpid = fork();
+	if (cpid < 0)
+		return 1;
+	else if (cpid == 0)
+		return child4(123);
 
 	printf("[CHILD %i] Running (new IPC NS)\n", num);
 
@@ -153,16 +189,22 @@ int child3(void)
 
 	sleep(DELAY); /* Wait until after checkpoint, then attach */
 
-	track(seg, num);
-  
-	return 0;
+	ret = track_const(seg, num, 123);
+
+	printf("[CHILD %i] Waiting for child %i\n", num, cpid);
+	wait(&status);
+
+	if (ret == 0)
+		return WEXITSTATUS(status);
+	else
+		return ret;
 }
 
 /*
  * This child is forked from child3 under the new IPC namespace.
  * Verify that post-restart, we do not see the changing seg[0]
  */
-int child4(void)
+int child4(int constval)
 {
 	unsigned char *seg;
 	int num = 4;
@@ -174,7 +216,7 @@ int child4(void)
 	if (attach(&seg, num))
 		return -1;
 
-	track(seg, num);
+	return track_const(seg, num, constval);
 
 	return 0;
 }
@@ -250,11 +292,23 @@ int main(int argc, char *argv[])
 	sleep(DELAY);
 	printf("[MSTER] Woke\n");
 
-	for (i = 0; i < 15; i++) {
+	for (i = 0; i <= COUNT_MAX; i++) {
 		seg[0] = i;
 		sleep(1);
 	}
 
+	for (i = 0; i < 3; i++) {
+		int status;
+
+		printf("[MSTER] Waiting on child %i\n", i+1);
+		wait(&status);
+		if (WEXITSTATUS(status)) {
+			printf("[MSTER] child exited with %i\n",
+			       WEXITSTATUS(status));
+			return WEXITSTATUS(status);
+		}
+	}
+
 	if (shmdt(seg) < 0)
 		perror("shmdt");
 
-- 
1.6.0.3

^ permalink raw reply related

* Re: How to enable SYSENTER/SYSEXIT in glibc-2.7?
From: Linus Torvalds @ 2009-04-13 18:45 UTC (permalink / raw)
  To: Guofu Xiang
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-bzQdu9zFT3WakBO8gow8eQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	mingo-X9Un+BFzKDI
In-Reply-To: <d15b2c7b0904130210j7ff25665h539546d239db5697-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>



On Mon, 13 Apr 2009, Guofu Xiang wrote:
> 
>      (gdb) disas __kernel_vsyscall
>      Dump of assembler code for function __kernel_vsyscall:
>      0xb7f7a400 <__kernel_vsyscall+0>:       int    $0x80
>      0xb7f7a402 <__kernel_vsyscall+2>:       ret
>      End of assembler dump.

(gdb) disassemble __kernel_vsyscall
Dump of assembler code for function __kernel_vsyscall:
0xb8046414 <__kernel_vsyscall+0>:	push   %ecx
0xb8046415 <__kernel_vsyscall+1>:	push   %edx
0xb8046416 <__kernel_vsyscall+2>:	push   %ebp
0xb8046417 <__kernel_vsyscall+3>:	mov    %esp,%ebp
0xb8046419 <__kernel_vsyscall+5>:	sysenter 
0xb804641b <__kernel_vsyscall+7>:	nop    
0xb804641c <__kernel_vsyscall+8>:	nop    
0xb804641d <__kernel_vsyscall+9>:	nop    
0xb804641e <__kernel_vsyscall+10>:	nop    
0xb804641f <__kernel_vsyscall+11>:	nop    
0xb8046420 <__kernel_vsyscall+12>:	nop    
0xb8046421 <__kernel_vsyscall+13>:	nop    
0xb8046422 <__kernel_vsyscall+14>:	jmp    0xb8046417 <__kernel_vsyscall+3>
0xb8046424 <__kernel_vsyscall+16>:	pop    %ebp
0xb8046425 <__kernel_vsyscall+17>:	pop    %edx
0xb8046426 <__kernel_vsyscall+18>:	pop    %ecx
0xb8046427 <__kernel_vsyscall+19>:	ret    

>      The system call is implemented by int $80h, rather than SYSENTER. What
> is the reason? How to set SYSENTER as the default implementation of system
> call? Thank you for your response!

What CPU do you have?  We only enable sysenter on CPU's that support them 
well.

		Linus

^ permalink raw reply

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
From: Linus Torvalds @ 2009-04-13 18:39 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: xemul-bzQdu9zFT3WakBO8gow8eQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI
In-Reply-To: <20090413073925.GB7085-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>



On Mon, 13 Apr 2009, Alexey Dobriyan wrote:
> 
> Well, in OpenVZ everything is in kernel/cpt/ and prefixed with "cpt_"
> and "rst_".

So?

We're not merging OpenVZ code _either_.

> And I think "cr_" is super nice prefix: it's short, it's C-like,
> it reminds about restart part

It does no such thing. THAT'S THE POINT. "cr" means _nothing_ to anybody 
else than some CR-specific people, and those people don't even need it!

Look around you. We try to use nicer names. We spell out "cpufreq", we 
don't call it "cf".

		Linus

^ permalink raw reply

* Re: [PATCH 14/30] cr: x86_64 support
From: Alexey Dobriyan @ 2009-04-13 18:26 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: xemul-bzQdu9zFT3WakBO8gow8eQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
In-Reply-To: <20090410023649.GO27788-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>

On Fri, Apr 10, 2009 at 06:36:49AM +0400, Alexey Dobriyan wrote:
> Now x86 matrix of migration is:
> 
> 	task/kernel		kernel
> 	------------------------------
> 	i386/i386	=>	i386
> 	i386/i386	=>	x86_64
> 	i386/x86_64	=>	i386
> 	i386/x86_64	=>	x86_64
> 	x86_64/x86_64	=>	x86_64

Hmm, x86_64 kernel i386 program => i386 kernel case doesn't work, should
be better in next iteration.

^ permalink raw reply

* Re: [PATCH 00/30] C/R OpenVZ/Virtuozzo style
From: Dave Hansen @ 2009-04-13 18:07 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: xemul-bzQdu9zFT3WakBO8gow8eQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mingo-X9Un+BFzKDI
In-Reply-To: <20090413091423.GA19236-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>

On Mon, 2009-04-13 at 13:14 +0400, Alexey Dobriyan wrote:
> On Thu, Apr 09, 2009 at 10:07:11PM -0700, Dave Hansen wrote:
> > I'm curious how you see these fitting in with the work that we've been
> > doing with Oren.  Do you mean to just start a discussion or are you
> > really proposing these as an alternative to what Oren has been posting?
> 
> Yes, this is posted as alternative.

I'm sure that you can understand that we've been working on Oren's
patches for a bit and don't want to jump ship on a whim. :)

Do you think we can change Oren's patch sufficiently to meet your needs
here?  Do you think we can change your patch sufficiently to meet Oren's
needs?

> Some design decisions are seen as incorrect from here like:
> * not rejecting checkpoint with possible "leaks" from container

Could you elaborate on this a bit?  This sounds really important, but
I'm having difficulty seeing how your patch addresses this or how Oren's
doesn't.

> * not having CAP_SYS_ADMIN on restart(2)
> * having small (TASK_COMM_LEN) and bigger (objref[1]) image format
>   misdesigns.

The format is certainly still in a huge amount of flux.  I'd be really
happy to see patches fixing these or clarifying them.  I'm planning on
going and looking at your patches in detail right now.  Would you mind
doing a more detailed review of Oren's so that we could use your
expertise to close some of these gaps in the format?

> * doing fork(2)+restart(2) per restarted task and whole orchestration
>   done from userspace/future init task.

Yeah, we've certainly argued plenty about this one amongst ourselves.
I'm personally open to changing this especially if you feel strongly
about it.

> * not seeing bigger picture (note, this is not equivalent to supporting
>   everything at once, nobody is asking for everything at once) wrt shared
>   objects and format and code changes because of that (note again, image
>   format will change, but it's easy to design high level structure which
>   won't change)
> * checking of unsupported features done at wrong place and wrong time
>   and runtime overhead because of that on CR=y kernels.

Yeah, I'm especially worried with respect to the shared objects right
now.  I'm pestering Oren to replace a big chunk of that stuff with other
garbage that I've dreamed up.  I'll go take a look at how you do this in
your patches...  Perhaps it will work even better.

-- Dave

^ permalink raw reply

* Re: [RFC v2][PATCH 01/10] Infrastructure for work postponed to the end of checkpoint/restart
From: Serge E. Hallyn @ 2009-04-13 18:04 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Hansen
In-Reply-To: <49E35F25.5070501-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> 
> 
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> >> --- a/checkpoint/Makefile
> >> +++ b/checkpoint/Makefile
> >> @@ -2,8 +2,8 @@
> >>  # Makefile for linux checkpoint/restart.
> >>  #
> >>
> >> -obj-$(CONFIG_CHECKPOINT) += sys.o objhash.o \
> >> +obj-$(CONFIG_CHECKPOINT) += sys.o objhash.o deferqueue.o \
> >>  		checkpoint.o restart.o \
> >>  		ckpt_task.o rstr_task.o \
> >>  		ckpt_mem.o rstr_mem.o \
> >> -		ckpt_file.o rstr_file.o
> >> +		ckpt_file.o rstr_file.o \
> > 
> > ?
> > 
> >> +int cr_deferqueue_add(struct cr_ctx *ctx, cr_deferqueue_func_t function,
> >> +		     unsigned int flags, void *data, int size)
> >> +{
> >> +	struct cr_deferqueue *wq;
> >> +
> >> +	wq = kmalloc(sizeof(wq) + size, GFP_KERNEL);
> >> +	if (!wq)
> >> +		return -ENOMEM;
> >> +
> >> +	wq->function = function;
> >> +	wq->flags = flags;
> >> +	memcpy(wq->data, data, size);
> >> +
> >> +	cr_debug("adding work %p function %p\n", wq, wq->function);
> >> +	list_add_tail(&ctx->deferqueue, &wq->list);
> >> +	return 0;
> >> +}
> > 
> > Shouldn't the deferqueue be protected by a spinlock here?
> 
> Not until we implement concurrent checkpoint/restart. At the moment
> it's one task at a time the can access it.

That's too bad.  I think this woudl be better done as a single
simple patch addin ga new generic deferqueue mechanism for all
to use, with a per-queue spinlock protecting both _add and
_run

-serge

^ permalink raw reply

* Re: [RFC v2][PATCH 01/10] Infrastructure for work postponed to the end of checkpoint/restart
From: Oren Laadan @ 2009-04-13 15:49 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Hansen
In-Reply-To: <20090413153504.GA15846-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>> --- a/checkpoint/Makefile
>> +++ b/checkpoint/Makefile
>> @@ -2,8 +2,8 @@
>>  # Makefile for linux checkpoint/restart.
>>  #
>>
>> -obj-$(CONFIG_CHECKPOINT) += sys.o objhash.o \
>> +obj-$(CONFIG_CHECKPOINT) += sys.o objhash.o deferqueue.o \
>>  		checkpoint.o restart.o \
>>  		ckpt_task.o rstr_task.o \
>>  		ckpt_mem.o rstr_mem.o \
>> -		ckpt_file.o rstr_file.o
>> +		ckpt_file.o rstr_file.o \
> 
> ?
> 
>> +int cr_deferqueue_add(struct cr_ctx *ctx, cr_deferqueue_func_t function,
>> +		     unsigned int flags, void *data, int size)
>> +{
>> +	struct cr_deferqueue *wq;
>> +
>> +	wq = kmalloc(sizeof(wq) + size, GFP_KERNEL);
>> +	if (!wq)
>> +		return -ENOMEM;
>> +
>> +	wq->function = function;
>> +	wq->flags = flags;
>> +	memcpy(wq->data, data, size);
>> +
>> +	cr_debug("adding work %p function %p\n", wq, wq->function);
>> +	list_add_tail(&ctx->deferqueue, &wq->list);
>> +	return 0;
>> +}
> 
> Shouldn't the deferqueue be protected by a spinlock here?

Not until we implement concurrent checkpoint/restart. At the moment
it's one task at a time the can access it.

Oren.

^ permalink raw reply

* Re: [RFC v2][PATCH 01/10] Infrastructure for work postponed to the end of checkpoint/restart
From: Serge E. Hallyn @ 2009-04-13 15:35 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Dave Hansen
In-Reply-To: <1239107503-21941-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> --- a/checkpoint/Makefile
> +++ b/checkpoint/Makefile
> @@ -2,8 +2,8 @@
>  # Makefile for linux checkpoint/restart.
>  #
> 
> -obj-$(CONFIG_CHECKPOINT) += sys.o objhash.o \
> +obj-$(CONFIG_CHECKPOINT) += sys.o objhash.o deferqueue.o \
>  		checkpoint.o restart.o \
>  		ckpt_task.o rstr_task.o \
>  		ckpt_mem.o rstr_mem.o \
> -		ckpt_file.o rstr_file.o
> +		ckpt_file.o rstr_file.o \

?

> +int cr_deferqueue_add(struct cr_ctx *ctx, cr_deferqueue_func_t function,
> +		     unsigned int flags, void *data, int size)
> +{
> +	struct cr_deferqueue *wq;
> +
> +	wq = kmalloc(sizeof(wq) + size, GFP_KERNEL);
> +	if (!wq)
> +		return -ENOMEM;
> +
> +	wq->function = function;
> +	wq->flags = flags;
> +	memcpy(wq->data, data, size);
> +
> +	cr_debug("adding work %p function %p\n", wq, wq->function);
> +	list_add_tail(&ctx->deferqueue, &wq->list);
> +	return 0;
> +}

Shouldn't the deferqueue be protected by a spinlock here?

-serge

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox