* [RFC][patch 1/2] new system call, unshare
@ 2005-05-10 13:08 Janak Desai
2005-05-10 19:44 ` Jonathan Corbet
0 siblings, 1 reply; 3+ messages in thread
From: Janak Desai @ 2005-05-10 13:08 UTC (permalink / raw)
To: linux-fsdevel
Patch Summary:
This patch implements a new system call, unshare. unshare allows a
process to dissociate parts of process context that were initially
being shared using the clone() system call.
The patch consists of two parts:
[1/2] Implements the system call handler function sys_unshare.
[2/2] Implements system call setup for different architectures.
For now, I am only posting part I to request comments on justification,
approach and implementation.
Patch Justification:
unshare system call is needed to implement, using PAM,
per-security_context and/or per-user namespace to provide
polyinstantiated directories. Using unshare and bind mounts, a
PAM module can create private namespace with appropriate
directories(based on user's security context) bind mounted on
public directories such as /tmp, thus providing an instance of
/tmp that is based on user's security context. Without the
unshare system call, namespace separation can only be achieved
by clone, which would require porting and maintaining all commands
such as login, and su, that establish a user session.
Overall Approach:
The overall approach followed clone system call and its permission
enforcement. However, instead of clone's "what do we leave shared?"
logic, here the logic was based on "what do we unshare, that was
previously being shared?". Unlike clone, which operated on a newly
allocated and not-yet schedulable task structure, additional
task_lock()s were taken to avoid race conditions from unshare
having to work on the current process. Before unsharing any part
of the context, a check is made to ensure that that part of the
context is being shared in the first place. If the context is not
being shared to begin with, the system call returns success. If
the context is being shared, the system call makes a private copy
of that context and updates the appropriate pointers of the
current task structure to point to this new private copy. If allocation
and setup of the private copy fails, the system call appropriately
restores the current task structures to continue using the shared
context.
Currently, the system call only allows "unsharing" of namespace,
signal handlers and virtual memory, because those three were deemed
useful on this mailing list in the past.
Testing:
The patch has been unit tested on uni-processor i386 architecture
based Fedora Core 3 system.
Signed-off-by: Janak Desai
-------------------------------------------------------------------
diff -Naurp linux-2.6.11.8/kernel/fork.c linux-2.6.11.8-p1/kernel/fork.c
--- linux-2.6.11.8/kernel/fork.c 2005-04-30 01:23:45.000000000 +0000
+++ linux-2.6.11.8-p1/kernel/fork.c 2005-05-09 19:03:52.000000000 +0000
@@ -56,6 +56,17 @@ int nr_threads; /* The idle threads do
int max_threads; /* tunable limit on nr_threads */
+/*
+ * mm_copy gets called from clone or unshare system calls. When called
+ * from clone, mm_struct may be shared depending on the clone flags
+ * argument, however, when called from the unshare system call, a private
+ * copy of mm_struct is made.
+ */
+enum mm_copy_share {
+ MAY_SHARE,
+ UNSHARE,
+};
+
DEFINE_PER_CPU(unsigned long, process_counts) = 0;
__cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */
@@ -421,16 +432,26 @@ void mm_release(struct task_struct *tsk,
}
}
-static int copy_mm(unsigned long clone_flags, struct task_struct * tsk)
+static int copy_mm(unsigned long clone_flags, struct task_struct * tsk,
+ enum mm_copy_share copy_share_action)
{
struct mm_struct * mm, *oldmm;
int retval;
- tsk->min_flt = tsk->maj_flt = 0;
- tsk->nvcsw = tsk->nivcsw = 0;
+ /*
+ * If the process memory is being duplicated as part of the
+ * unshare system call, we are working with the current process
+ * and not a newly allocated task strucutre, and should not
+ * zero out fault info, context switch counts, mm and active_mm
+ * fields.
+ */
+ if (copy_share_action == MAY_SHARE) {
+ tsk->min_flt = tsk->maj_flt = 0;
+ tsk->nvcsw = tsk->nivcsw = 0;
- tsk->mm = NULL;
- tsk->active_mm = NULL;
+ tsk->mm = NULL;
+ tsk->active_mm = NULL;
+ }
/*
* Are we cloning a kernel thread?
@@ -917,7 +938,7 @@ static task_t *copy_process(unsigned lon
goto bad_fork_cleanup_fs;
if ((retval = copy_signal(clone_flags, p)))
goto bad_fork_cleanup_sighand;
- if ((retval = copy_mm(clone_flags, p)))
+ if ((retval = copy_mm(clone_flags, p, MAY_SHARE)))
goto bad_fork_cleanup_signal;
if ((retval = copy_keys(clone_flags, p)))
goto bad_fork_cleanup_mm;
@@ -1222,3 +1243,172 @@ void __init proc_caches_init(void)
sizeof(struct mm_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
}
+
+/*
+ * unshare_mm is called from the unshare system call handler function to
+ * make a private copy of the mm_struct structure. It calls copy_mm with
+ * CLONE_VM flag cleard, to ensure that a private copy of mm_struct is made,
+ * and with mm_copy_share enum set to UNSHARE, to ensure that copy_mm
+ * does not clear fault info, context switch counts, mm and active_mm
+ * fields of the mm_struct.
+ */
+static int unshare_mm(unsigned long unshare_flags, struct task_struct *tsk)
+{
+ int retval = 0;
+ struct mm_struct *mm = tsk->mm;
+
+ /*
+ * If the virtual memory is being shared, make a private
+ * copy and disassociate the process from the shared virtual
+ * memory.
+ */
+ if (atomic_read(&mm->mm_users) > 1) {
+ retval = copy_mm((unshare_flags & ~CLONE_VM), tsk, UNSHARE);
+
+ /*
+ * If copy_mm was successful, decrement the number of users
+ * on the original, shared, mm_struct.
+ */
+ if (!retval)
+ atomic_dec(&mm->mm_users);
+ }
+ return retval;
+}
+
+/*
+ * unshare_sighand is called from the unshare system call handler function to
+ * make a private copy of the sighand_struct structure. It calls copy_sighand
+ * with CLONE_SIGHAND cleared to ensure that a new signal handler structure
+ * is cloned from the current shared one.
+ */
+static int unshare_sighand(unsigned long unshare_flags, struct task_struct *tsk)
+{
+ int retval = 0;
+ struct sighand_struct *sighand = tsk->sighand;
+
+ /*
+ * If the signal handlers are being shared, make a private
+ * copy and disassociate the process from the shared signal
+ * handlers.
+ */
+ if (atomic_read(&sighand->count) > 1) {
+ retval = copy_sighand((unshare_flags & ~CLONE_SIGHAND), tsk);
+
+ /*
+ * If copy_sighand was successful, decrement the use count
+ * on the original, shared, sighand_struct.
+ */
+ if (!retval)
+ atomic_dec(&sighand->count);
+ }
+ return retval;
+}
+
+/*
+ * unshare_namespace is called from the unshare system call handler
+ * function to make a private copy of the current shared namespace. It
+ * calls copy_namespace with CLONE_NEWNS set to ensure that a new
+ * namespace is cloned from the current namespace.
+ */
+static int unshare_namespace(struct task_struct *tsk)
+{
+ int retval = 0;
+ struct namespace *namespace = tsk->namespace;
+
+ /*
+ * If the namespace is being shared, make a private copy
+ * and disassociate the process from the shared namespace.
+ */
+ if (atomic_read(&namespace->count) > 1) {
+ retval = copy_namespace(CLONE_NEWNS, tsk);
+
+ /*
+ * If copy_namespace was successful, decrement the use count
+ * on the original, shared, namespace struct.
+ */
+ if (!retval)
+ atomic_dec(&namespace->count);
+ }
+ return retval;
+}
+
+/*
+ * unshare allows a process to 'unshare' part of the process
+ * context which was originally shared using clone.
+ */
+asmlinkage long sys_unshare(unsigned long unshare_flags)
+{
+ struct task_struct *tsk = current;
+ int retval = 0;
+ struct namespace *namespace;
+ struct mm_struct *mm;
+ struct sighand_struct *sighand;
+
+ if (!(unshare_flags & (CLONE_NEWNS | CLONE_VM | CLONE_SIGHAND)))
+ goto bad_unshare_invalid_val;
+
+ /*
+ * Shared signal handlers imply shared VM, so if CLONE_SIGHAND is
+ * set, CLONE_VM must also be set in the system call argument.
+ */
+ if ((unshare_flags & CLONE_SIGHAND) && !(unshare_flags & CLONE_VM))
+ goto bad_unshare_invalid_val;
+
+ task_lock(tsk);
+ namespace = tsk->namespace;
+ mm = tsk->mm;
+ sighand = tsk->sighand;
+
+ if (unshare_flags & CLONE_VM) {
+ retval = unshare_mm(unshare_flags, tsk);
+ if (retval)
+ goto unshare_unlock_task;
+ else if (unshare_flags & CLONE_SIGHAND) {
+ retval = unshare_sighand(unshare_flags, tsk);
+ if (retval)
+ goto bad_unshare_cleanup_mm;
+ }
+ }
+
+ if (unshare_flags & CLONE_NEWNS) {
+ retval = unshare_namespace(tsk);
+ if (retval)
+ goto bad_unshare_cleanup_sighand;
+ }
+
+unshare_unlock_task:
+ task_unlock(tsk);
+
+unshare_out:
+ return retval;
+
+bad_unshare_cleanup_sighand:
+ /*
+ * If signal handlers were unshared (private copy was made),
+ * clean them up (delete the private copy) and restore
+ * the task to point to the old, shared, value.
+ */
+ if (unshare_flags & CLONE_SIGHAND) {
+ exit_sighand(tsk);
+ tsk->sighand = sighand;
+ atomic_inc(&sighand->count);
+ }
+
+bad_unshare_cleanup_mm:
+ /*
+ * If mm struct was unshared (private copy was made),
+ * clean it up (delete the private copy) and restore
+ * the task to point to the old, shared, value.
+ */
+ if (unshare_flags & CLONE_VM) {
+ if (tsk->mm)
+ mmput(tsk->mm);
+ tsk->mm = mm;
+ atomic_inc(&mm->mm_users);
+ }
+ goto unshare_unlock_task;
+
+bad_unshare_invalid_val:
+ retval = -EINVAL;
+ goto unshare_out;
+}
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC][patch 1/2] new system call, unshare
2005-05-10 13:08 [RFC][patch 1/2] new system call, unshare Janak Desai
@ 2005-05-10 19:44 ` Jonathan Corbet
2005-05-11 12:11 ` Janak Desai
0 siblings, 1 reply; 3+ messages in thread
From: Jonathan Corbet @ 2005-05-10 19:44 UTC (permalink / raw)
To: Janak Desai; +Cc: linux-fsdevel
Janak Desai <janak@us.ibm.com> wrote:
> +asmlinkage long sys_unshare(unsigned long unshare_flags)
> +{
> [...]
>
> + if (!(unshare_flags & (CLONE_NEWNS | CLONE_VM | CLONE_SIGHAND)))
> + goto bad_unshare_invalid_val;
If I read this function correctly, it will silently ignore any flags
other than the three above, as long as at least one of those three is
present. It might be a good idea to fail the call if any unrecognized
flags are present or you risk changing application behavior in the
future if any new flags are added.
jon
Jonathan Corbet
Executive editor, LWN.net
corbet@lwn.net
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC][patch 1/2] new system call, unshare
2005-05-10 19:44 ` Jonathan Corbet
@ 2005-05-11 12:11 ` Janak Desai
0 siblings, 0 replies; 3+ messages in thread
From: Janak Desai @ 2005-05-11 12:11 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-fsdevel
Jonathan Corbet wrote:
> Janak Desai <janak@us.ibm.com> wrote:
>
>
>>+asmlinkage long sys_unshare(unsigned long unshare_flags)
>>+{
>>[...]
>>
>>+ if (!(unshare_flags & (CLONE_NEWNS | CLONE_VM | CLONE_SIGHAND)))
>>+ goto bad_unshare_invalid_val;
>
>
> If I read this function correctly, it will silently ignore any flags
> other than the three above, as long as at least one of those three is
> present. It might be a good idea to fail the call if any unrecognized
> flags are present or you risk changing application behavior in the
> future if any new flags are added.
>
Ok, thanks Jonathan. I will update the patch accordingly.
-Janak
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2005-05-11 12:12 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-10 13:08 [RFC][patch 1/2] new system call, unshare Janak Desai
2005-05-10 19:44 ` Jonathan Corbet
2005-05-11 12:11 ` Janak Desai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).