[PATCH 0/2] virtiofsd: add net and pid namespace sandboxing

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] virtiofsd: add net and pid namespace sandboxing
@ 2019-10-16 16:01 Stefan Hajnoczi
  2019-10-16 16:01 ` [PATCH 1/2] virtiofsd: move to an empty network namespace Stefan Hajnoczi
  2019-10-16 16:01 ` [PATCH 2/2] virtiofsd: move to a new pid namespace Stefan Hajnoczi
  0 siblings, 2 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-10-16 16:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, Dr. David Alan Gilbert, Stefan Hajnoczi

These patches are based on gitlab.com/virtio-fs/qemu.git virtio-fs-dev.

virtiofsd is sandboxed so that it does not have access to the system in the
event that the process is compromised.  At the moment we use seccomp and mount
namespaces to restrict the list of allowed syscalls and only give access to the
shared directory.

This patch series enhances sandboxing by putting virtiofsd into an empty
network and pid namespace.  If the process is compromised it will be unable to
perform network activity, even to localhost services running on the host.  It
will also be unable to see other processes running on the system since it runs
as pid 1 in a new pid namespace.

These enhancements are inspired by the Crosvm virtio-fs device's jail
configuration.

Stefan Hajnoczi (2):
  virtiofsd: move to an empty network namespace
  virtiofsd: move to a new pid namespace

 contrib/virtiofsd/passthrough_ll.c | 109 +++++++++++++++++++++++------
 1 file changed, 86 insertions(+), 23 deletions(-)

-- 
2.21.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] virtiofsd: move to an empty network namespace
  2019-10-16 16:01 [PATCH 0/2] virtiofsd: add net and pid namespace sandboxing Stefan Hajnoczi
@ 2019-10-16 16:01 ` Stefan Hajnoczi
  2019-10-23  9:34   ` Dr. David Alan Gilbert
  2019-10-16 16:01 ` [PATCH 2/2] virtiofsd: move to a new pid namespace Stefan Hajnoczi
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-10-16 16:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, Dr. David Alan Gilbert, Stefan Hajnoczi

If the process is compromised there should be no network access.  Use an
empty network namespace to sandbox networking.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 contrib/virtiofsd/passthrough_ll.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
index 84b60d85bd..c27ff7d800 100644
--- a/contrib/virtiofsd/passthrough_ll.c
+++ b/contrib/virtiofsd/passthrough_ll.c
@@ -2736,6 +2736,19 @@ static void setup_shared_versions(struct lo_data *lo)
 	lo->version_table = addr;
 }
 
+/*
+ * Called after our UNIX domain sockets have been created, now we can move to
+ * an empty network namespace to prevent TCP/IP and other network activity in
+ * case this process is compromised.
+ */
+static void setup_net_namespace(void)
+{
+	if (unshare(CLONE_NEWNET) != 0) {
+		fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWNET): %m\n");
+		exit(1);
+	}
+}
+
 /* This magic is based on lxc's lxc_pivot_root() */
 static void setup_pivot_root(const char *source)
 {
@@ -2818,6 +2831,7 @@ static void setup_mount_namespace(const char *source)
  */
 static void setup_sandbox(struct lo_data *lo, bool enable_syslog)
 {
+	setup_net_namespace();
 	setup_mount_namespace(lo->source);
 	setup_seccomp(enable_syslog);
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] virtiofsd: move to an empty network namespace
  2019-10-16 16:01 ` [PATCH 1/2] virtiofsd: move to an empty network namespace Stefan Hajnoczi
@ 2019-10-23  9:34   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2019-10-23  9:34 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> If the process is compromised there should be no network access.  Use an
> empty network namespace to sandbox networking.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  contrib/virtiofsd/passthrough_ll.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
> index 84b60d85bd..c27ff7d800 100644
> --- a/contrib/virtiofsd/passthrough_ll.c
> +++ b/contrib/virtiofsd/passthrough_ll.c
> @@ -2736,6 +2736,19 @@ static void setup_shared_versions(struct lo_data *lo)
>  	lo->version_table = addr;
>  }
>  
> +/*
> + * Called after our UNIX domain sockets have been created, now we can move to
> + * an empty network namespace to prevent TCP/IP and other network activity in
> + * case this process is compromised.
> + */
> +static void setup_net_namespace(void)
> +{
> +	if (unshare(CLONE_NEWNET) != 0) {
> +		fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWNET): %m\n");
> +		exit(1);
> +	}
> +}
> +
>  /* This magic is based on lxc's lxc_pivot_root() */
>  static void setup_pivot_root(const char *source)
>  {
> @@ -2818,6 +2831,7 @@ static void setup_mount_namespace(const char *source)
>   */
>  static void setup_sandbox(struct lo_data *lo, bool enable_syslog)
>  {
> +	setup_net_namespace();
>  	setup_mount_namespace(lo->source);
>  	setup_seccomp(enable_syslog);
>  }
> -- 
> 2.21.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/2] virtiofsd: move to a new pid namespace
  2019-10-16 16:01 [PATCH 0/2] virtiofsd: add net and pid namespace sandboxing Stefan Hajnoczi
  2019-10-16 16:01 ` [PATCH 1/2] virtiofsd: move to an empty network namespace Stefan Hajnoczi
@ 2019-10-16 16:01 ` Stefan Hajnoczi
  2019-10-17 14:45   ` [Virtio-fs] " Vivek Goyal
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-10-16 16:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: virtio-fs, Dr. David Alan Gilbert, Stefan Hajnoczi

virtiofsd needs access to /proc/self/fd.  Let's move to a new pid
namespace so that a compromised process cannot see another other
processes running on the system.

One wrinkle in this approach: unshare(CLONE_NEWPID) affects *child*
processes and not the current process.  Therefore we need to fork the
pid 1 process that will actually run virtiofsd and leave a parent in
waitpid(2).  This is not the same thing as daemonization and parent
processes should not notice a difference.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 contrib/virtiofsd/passthrough_ll.c | 95 ++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 23 deletions(-)

diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
index c27ff7d800..b6ee9b2e90 100644
--- a/contrib/virtiofsd/passthrough_ll.c
+++ b/contrib/virtiofsd/passthrough_ll.c
@@ -56,9 +56,12 @@
 #include <sys/xattr.h>
 #include <sys/mman.h>
 #include <sys/socket.h>
+#include <sys/types.h>
 #include <sys/un.h>
+#include <sys/wait.h>
 #include <sys/capability.h>
 
+
 #include "ireg.h"
 #include <sys/mount.h>
 #include <sys/resource.h>
@@ -2749,6 +2752,72 @@ static void setup_net_namespace(void)
 	}
 }
 
+/*
+ * Move to a new pid namespace to prevent access to other processes if this
+ * process is compromised.
+ */
+static void setup_pid_namespace(void)
+{
+	pid_t child;
+
+	/*
+	 * Create a new pid namespace for *child* processes.  We'll have to
+	 * fork in order to enter the new pid namespace.  A new mount namespace
+	 * is also needed so that we can remount /proc for the new pid
+	 * namespace.
+	 */
+	if (unshare(CLONE_NEWPID | CLONE_NEWNS) != 0) {
+		fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
+		exit(1);
+	}
+
+	child = fork();
+	if (child < 0) {
+		fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
+		exit(1);
+	}
+	if (child > 0) {
+		pid_t waited;
+		int wstatus;
+
+		/* The parent waits for the child */
+		do {
+			waited = waitpid(child, &wstatus, 0);
+		} while (waited < 0 && errno == EINTR);
+
+		if (WIFEXITED(wstatus)) {
+			exit(WEXITSTATUS(wstatus));
+		}
+
+		exit(1);
+	}
+
+	/*
+	 * If the mounts have shared propagation then we want to opt out so our
+	 * mount changes don't affect the parent mount namespace.
+	 */
+	if (mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) < 0) {
+		fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
+		exit(1);
+	}
+
+	/* The child must remount /proc to use the new pid namespace */
+	if (mount("proc", "/proc", "proc",
+		  MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {
+		fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
+		exit(1);
+	}
+}
+
+static void setup_proc_self_fd(struct lo_data *lo)
+{
+	lo->proc_self_fd = open("/proc/self/fd", O_PATH);
+	if (lo->proc_self_fd == -1) {
+		fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
+		exit(1);
+	}
+}
+
 /* This magic is based on lxc's lxc_pivot_root() */
 static void setup_pivot_root(const char *source)
 {
@@ -2803,20 +2872,10 @@ static void setup_pivot_root(const char *source)
 
 /*
  * Make the source directory our root so symlinks cannot escape and no other
- * files are accessible.
+ * files are accessible.  Assumes unshare(CLONE_NEWNS) was already called.
  */
 static void setup_mount_namespace(const char *source)
 {
-	if (unshare(CLONE_NEWNS) != 0) {
-		fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWNS): %m\n");
-		exit(1);
-	}
-
-	if (mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) < 0) {
-		fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_PRIVATE): %m\n");
-		exit(1);
-	}
-
 	if (mount(source, source, NULL, MS_BIND, NULL) < 0) {
 		fuse_log(FUSE_LOG_ERR, "mount(%s, %s, MS_BIND): %m\n", source, source);
 		exit(1);
@@ -2831,6 +2890,8 @@ static void setup_mount_namespace(const char *source)
  */
 static void setup_sandbox(struct lo_data *lo, bool enable_syslog)
 {
+	setup_pid_namespace();
+	setup_proc_self_fd(lo);
 	setup_net_namespace();
 	setup_mount_namespace(lo->source);
 	setup_seccomp(enable_syslog);
@@ -2860,15 +2921,6 @@ static void setup_root(struct lo_data *lo, struct lo_inode *root)
 	g_atomic_int_set(&root->refcount, 2);
 }
 
-static void setup_proc_self_fd(struct lo_data *lo)
-{
-	lo->proc_self_fd = open("/proc/self/fd", O_PATH);
-	if (lo->proc_self_fd == -1) {
-		fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
-		exit(1);
-	}
-}
-
 /* Raise the maximum number of open file descriptors to the system limit */
 static void setup_nofile_rlimit(void)
 {
@@ -3110,9 +3162,6 @@ int main(int argc, char *argv[])
 		get_shared(&lo, &lo.root);
 	}
 
-	/* Must be after daemonize to get the right /proc/self/fd */
-	setup_proc_self_fd(&lo);
-
 	setup_sandbox(&lo, opts.syslog);
 
 	setup_root(&lo, &lo.root);
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofsd: move to a new pid namespace
  2019-10-16 16:01 ` [PATCH 2/2] virtiofsd: move to a new pid namespace Stefan Hajnoczi
@ 2019-10-17 14:45   ` Vivek Goyal
  2019-10-17 16:11     ` Stefan Hajnoczi
  2019-10-23  9:46   ` Dr. David Alan Gilbert
  2019-10-24 10:26   ` Daniel P. Berrangé
  2 siblings, 1 reply; 9+ messages in thread
From: Vivek Goyal @ 2019-10-17 14:45 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel

On Wed, Oct 16, 2019 at 05:01:57PM +0100, Stefan Hajnoczi wrote:

[..]
> +	/*
> +	 * If the mounts have shared propagation then we want to opt out so our
> +	 * mount changes don't affect the parent mount namespace.
> +	 */
> +	if (mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
> +		exit(1);
> +	}

So we will get mount propogation form parent but our mounts will not
propagate back. Sounds reasonable.

Can we take away CAP_SYS_ADMIN from virtiofsd? That way it will not be 
able to do mount at all. 

I am wondering are we dependent on daemon having CAP_SYS_ADMIN. 

Thanks
Vivek


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Virtio-fs] [PATCH 2/2] virtiofsd: move to a new pid namespace
  2019-10-17 14:45   ` [Virtio-fs] " Vivek Goyal
@ 2019-10-17 16:11     ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-10-17 16:11 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 907 bytes --]

On Thu, Oct 17, 2019 at 10:45:53AM -0400, Vivek Goyal wrote:
> On Wed, Oct 16, 2019 at 05:01:57PM +0100, Stefan Hajnoczi wrote:
> 
> [..]
> > +	/*
> > +	 * If the mounts have shared propagation then we want to opt out so our
> > +	 * mount changes don't affect the parent mount namespace.
> > +	 */
> > +	if (mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) < 0) {
> > +		fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
> > +		exit(1);
> > +	}
> 
> So we will get mount propogation form parent but our mounts will not
> propagate back. Sounds reasonable.
> 
> Can we take away CAP_SYS_ADMIN from virtiofsd? That way it will not be 
> able to do mount at all. 
> 
> I am wondering are we dependent on daemon having CAP_SYS_ADMIN. 

I don't know the answer.  Additional patches to reduce the capability
set as much as possible would be great, but are a separate task.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] virtiofsd: move to a new pid namespace
  2019-10-16 16:01 ` [PATCH 2/2] virtiofsd: move to a new pid namespace Stefan Hajnoczi
  2019-10-17 14:45   ` [Virtio-fs] " Vivek Goyal
@ 2019-10-23  9:46   ` Dr. David Alan Gilbert
  2019-10-24 10:26   ` Daniel P. Berrangé
  2 siblings, 0 replies; 9+ messages in thread
From: Dr. David Alan Gilbert @ 2019-10-23  9:46 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd needs access to /proc/self/fd.  Let's move to a new pid
> namespace so that a compromised process cannot see another other
> processes running on the system.
> 
> One wrinkle in this approach: unshare(CLONE_NEWPID) affects *child*
> processes and not the current process.  Therefore we need to fork the
> pid 1 process that will actually run virtiofsd and leave a parent in
> waitpid(2).  This is not the same thing as daemonization and parent
> processes should not notice a difference.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

OK, I think that's OK (I don't know the mount semantics that well).

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  contrib/virtiofsd/passthrough_ll.c | 95 ++++++++++++++++++++++--------
>  1 file changed, 72 insertions(+), 23 deletions(-)
> 
> diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
> index c27ff7d800..b6ee9b2e90 100644
> --- a/contrib/virtiofsd/passthrough_ll.c
> +++ b/contrib/virtiofsd/passthrough_ll.c
> @@ -56,9 +56,12 @@
>  #include <sys/xattr.h>
>  #include <sys/mman.h>
>  #include <sys/socket.h>
> +#include <sys/types.h>
>  #include <sys/un.h>
> +#include <sys/wait.h>
>  #include <sys/capability.h>
>  
> +
>  #include "ireg.h"
>  #include <sys/mount.h>
>  #include <sys/resource.h>
> @@ -2749,6 +2752,72 @@ static void setup_net_namespace(void)
>  	}
>  }
>  
> +/*
> + * Move to a new pid namespace to prevent access to other processes if this
> + * process is compromised.
> + */
> +static void setup_pid_namespace(void)
> +{
> +	pid_t child;
> +
> +	/*
> +	 * Create a new pid namespace for *child* processes.  We'll have to
> +	 * fork in order to enter the new pid namespace.  A new mount namespace
> +	 * is also needed so that we can remount /proc for the new pid
> +	 * namespace.
> +	 */
> +	if (unshare(CLONE_NEWPID | CLONE_NEWNS) != 0) {
> +		fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> +		exit(1);
> +	}
> +
> +	child = fork();
> +	if (child < 0) {
> +		fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +		exit(1);
> +	}
> +	if (child > 0) {
> +		pid_t waited;
> +		int wstatus;
> +
> +		/* The parent waits for the child */
> +		do {
> +			waited = waitpid(child, &wstatus, 0);
> +		} while (waited < 0 && errno == EINTR);
> +
> +		if (WIFEXITED(wstatus)) {
> +			exit(WEXITSTATUS(wstatus));
> +		}
> +
> +		exit(1);
> +	}
> +
> +	/*
> +	 * If the mounts have shared propagation then we want to opt out so our
> +	 * mount changes don't affect the parent mount namespace.
> +	 */
> +	if (mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
> +		exit(1);
> +	}
> +
> +	/* The child must remount /proc to use the new pid namespace */
> +	if (mount("proc", "/proc", "proc",
> +		  MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
> +		exit(1);
> +	}
> +}
> +
> +static void setup_proc_self_fd(struct lo_data *lo)
> +{
> +	lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +	if (lo->proc_self_fd == -1) {
> +		fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
> +		exit(1);
> +	}
> +}
> +
>  /* This magic is based on lxc's lxc_pivot_root() */
>  static void setup_pivot_root(const char *source)
>  {
> @@ -2803,20 +2872,10 @@ static void setup_pivot_root(const char *source)
>  
>  /*
>   * Make the source directory our root so symlinks cannot escape and no other
> - * files are accessible.
> + * files are accessible.  Assumes unshare(CLONE_NEWNS) was already called.
>   */
>  static void setup_mount_namespace(const char *source)
>  {
> -	if (unshare(CLONE_NEWNS) != 0) {
> -		fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWNS): %m\n");
> -		exit(1);
> -	}
> -
> -	if (mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) < 0) {
> -		fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_PRIVATE): %m\n");
> -		exit(1);
> -	}
> -
>  	if (mount(source, source, NULL, MS_BIND, NULL) < 0) {
>  		fuse_log(FUSE_LOG_ERR, "mount(%s, %s, MS_BIND): %m\n", source, source);
>  		exit(1);
> @@ -2831,6 +2890,8 @@ static void setup_mount_namespace(const char *source)
>   */
>  static void setup_sandbox(struct lo_data *lo, bool enable_syslog)
>  {
> +	setup_pid_namespace();
> +	setup_proc_self_fd(lo);
>  	setup_net_namespace();
>  	setup_mount_namespace(lo->source);
>  	setup_seccomp(enable_syslog);
> @@ -2860,15 +2921,6 @@ static void setup_root(struct lo_data *lo, struct lo_inode *root)
>  	g_atomic_int_set(&root->refcount, 2);
>  }
>  
> -static void setup_proc_self_fd(struct lo_data *lo)
> -{
> -	lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> -	if (lo->proc_self_fd == -1) {
> -		fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
> -		exit(1);
> -	}
> -}
> -
>  /* Raise the maximum number of open file descriptors to the system limit */
>  static void setup_nofile_rlimit(void)
>  {
> @@ -3110,9 +3162,6 @@ int main(int argc, char *argv[])
>  		get_shared(&lo, &lo.root);
>  	}
>  
> -	/* Must be after daemonize to get the right /proc/self/fd */
> -	setup_proc_self_fd(&lo);
> -
>  	setup_sandbox(&lo, opts.syslog);
>  
>  	setup_root(&lo, &lo.root);
> -- 
> 2.21.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] virtiofsd: move to a new pid namespace
  2019-10-16 16:01 ` [PATCH 2/2] virtiofsd: move to a new pid namespace Stefan Hajnoczi
  2019-10-17 14:45   ` [Virtio-fs] " Vivek Goyal
  2019-10-23  9:46   ` Dr. David Alan Gilbert
@ 2019-10-24 10:26   ` Daniel P. Berrangé
  2019-10-25 12:53     ` Stefan Hajnoczi
  2 siblings, 1 reply; 9+ messages in thread
From: Daniel P. Berrangé @ 2019-10-24 10:26 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Dr. David Alan Gilbert

On Wed, Oct 16, 2019 at 05:01:57PM +0100, Stefan Hajnoczi wrote:
> virtiofsd needs access to /proc/self/fd.  Let's move to a new pid
> namespace so that a compromised process cannot see another other
> processes running on the system.
> 
> One wrinkle in this approach: unshare(CLONE_NEWPID) affects *child*
> processes and not the current process.  Therefore we need to fork the
> pid 1 process that will actually run virtiofsd and leave a parent in
> waitpid(2).  This is not the same thing as daemonization and parent
> processes should not notice a difference.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  contrib/virtiofsd/passthrough_ll.c | 95 ++++++++++++++++++++++--------
>  1 file changed, 72 insertions(+), 23 deletions(-)
> 
> diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
> index c27ff7d800..b6ee9b2e90 100644
> --- a/contrib/virtiofsd/passthrough_ll.c
> +++ b/contrib/virtiofsd/passthrough_ll.c
> @@ -56,9 +56,12 @@
>  #include <sys/xattr.h>
>  #include <sys/mman.h>
>  #include <sys/socket.h>
> +#include <sys/types.h>
>  #include <sys/un.h>
> +#include <sys/wait.h>
>  #include <sys/capability.h>
>  
> +
>  #include "ireg.h"
>  #include <sys/mount.h>
>  #include <sys/resource.h>
> @@ -2749,6 +2752,72 @@ static void setup_net_namespace(void)
>  	}
>  }
>  
> +/*
> + * Move to a new pid namespace to prevent access to other processes if this
> + * process is compromised.
> + */
> +static void setup_pid_namespace(void)
> +{
> +	pid_t child;
> +
> +	/*
> +	 * Create a new pid namespace for *child* processes.  We'll have to
> +	 * fork in order to enter the new pid namespace.  A new mount namespace
> +	 * is also needed so that we can remount /proc for the new pid
> +	 * namespace.
> +	 */
> +	if (unshare(CLONE_NEWPID | CLONE_NEWNS) != 0) {
> +		fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> +		exit(1);
> +	}
> +
> +	child = fork();
> +	if (child < 0) {
> +		fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
> +		exit(1);
> +	}
> +	if (child > 0) {
> +		pid_t waited;
> +		int wstatus;
> +
> +		/* The parent waits for the child */
> +		do {
> +			waited = waitpid(child, &wstatus, 0);
> +		} while (waited < 0 && errno == EINTR);
> +
> +		if (WIFEXITED(wstatus)) {
> +			exit(WEXITSTATUS(wstatus));
> +		}
> +
> +		exit(1);
> +	}

It might be useful to call prctl(PR_SET_PDEATHSIG) here, so that
if the parent process exits for any reason, the child will be killed
off too.

> +
> +	/*
> +	 * If the mounts have shared propagation then we want to opt out so our
> +	 * mount changes don't affect the parent mount namespace.
> +	 */
> +	if (mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
> +		exit(1);
> +	}
> +
> +	/* The child must remount /proc to use the new pid namespace */
> +	if (mount("proc", "/proc", "proc",
> +		  MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {
> +		fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
> +		exit(1);
> +	}
> +}

I feel like this is making things a bit misleading.

 setup_pid_namespace()

is now creating the mount namespace and pid namespace, and doing
some mount point config

 setup_mount_namespace()

is not creating the mount namespace, but is doing some more mount
point config.

And then there's setup_net_namespace() too.

I think there could be a  single

  setup_namespaces()

method that does the unshare(CLONE_NEWNS|CLONE_NEWNET|CLONE_NEWPID)
and forking the child.

And a setup_mounts()

method that does all the mount() calls.

> +
> +static void setup_proc_self_fd(struct lo_data *lo)
> +{
> +	lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +	if (lo->proc_self_fd == -1) {
> +		fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
> +		exit(1);
> +	}
> +}
> +
>  /* This magic is based on lxc's lxc_pivot_root() */
>  static void setup_pivot_root(const char *source)
>  {
> @@ -2803,20 +2872,10 @@ static void setup_pivot_root(const char *source)
>  
>  /*
>   * Make the source directory our root so symlinks cannot escape and no other
> - * files are accessible.
> + * files are accessible.  Assumes unshare(CLONE_NEWNS) was already called.
>   */
>  static void setup_mount_namespace(const char *source)
>  {
> -	if (unshare(CLONE_NEWNS) != 0) {
> -		fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWNS): %m\n");
> -		exit(1);
> -	}
> -
> -	if (mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) < 0) {
> -		fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_PRIVATE): %m\n");
> -		exit(1);
> -	}
> -
>  	if (mount(source, source, NULL, MS_BIND, NULL) < 0) {
>  		fuse_log(FUSE_LOG_ERR, "mount(%s, %s, MS_BIND): %m\n", source, source);
>  		exit(1);
> @@ -2831,6 +2890,8 @@ static void setup_mount_namespace(const char *source)
>   */
>  static void setup_sandbox(struct lo_data *lo, bool enable_syslog)
>  {
> +	setup_pid_namespace();
> +	setup_proc_self_fd(lo);
>  	setup_net_namespace();
>  	setup_mount_namespace(lo->source);
>  	setup_seccomp(enable_syslog);
> @@ -2860,15 +2921,6 @@ static void setup_root(struct lo_data *lo, struct lo_inode *root)
>  	g_atomic_int_set(&root->refcount, 2);
>  }
>  
> -static void setup_proc_self_fd(struct lo_data *lo)
> -{
> -	lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> -	if (lo->proc_self_fd == -1) {
> -		fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
> -		exit(1);
> -	}
> -}
> -
>  /* Raise the maximum number of open file descriptors to the system limit */
>  static void setup_nofile_rlimit(void)
>  {
> @@ -3110,9 +3162,6 @@ int main(int argc, char *argv[])
>  		get_shared(&lo, &lo.root);
>  	}
>  
> -	/* Must be after daemonize to get the right /proc/self/fd */
> -	setup_proc_self_fd(&lo);
> -
>  	setup_sandbox(&lo, opts.syslog);
>  
>  	setup_root(&lo, &lo.root);
> -- 
> 2.21.0
> 
> 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] virtiofsd: move to a new pid namespace
  2019-10-24 10:26   ` Daniel P. Berrangé
@ 2019-10-25 12:53     ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2019-10-25 12:53 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: virtio-fs, qemu-devel, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1050 bytes --]

On Thu, Oct 24, 2019 at 11:26:11AM +0100, Daniel P. Berrangé wrote:
> On Wed, Oct 16, 2019 at 05:01:57PM +0100, Stefan Hajnoczi wrote:
> It might be useful to call prctl(PR_SET_PDEATHSIG) here, so that
> if the parent process exits for any reason, the child will be killed
> off too.
[...]
> I feel like this is making things a bit misleading.
> 
>  setup_pid_namespace()
> 
> is now creating the mount namespace and pid namespace, and doing
> some mount point config
> 
>  setup_mount_namespace()
> 
> is not creating the mount namespace, but is doing some more mount
> point config.
> 
> And then there's setup_net_namespace() too.
> 
> I think there could be a  single
> 
>   setup_namespaces()
> 
> method that does the unshare(CLONE_NEWNS|CLONE_NEWNET|CLONE_NEWPID)
> and forking the child.
> 
> And a setup_mounts()
> 
> method that does all the mount() calls.

Thanks for your suggestions.  I'll implement both of them as follow-up
patches since this has already been included in the virtiofsd code.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-10-25 12:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-10-16 16:01 [PATCH 0/2] virtiofsd: add net and pid namespace sandboxing Stefan Hajnoczi
2019-10-16 16:01 ` [PATCH 1/2] virtiofsd: move to an empty network namespace Stefan Hajnoczi
2019-10-23  9:34   ` Dr. David Alan Gilbert
2019-10-16 16:01 ` [PATCH 2/2] virtiofsd: move to a new pid namespace Stefan Hajnoczi
2019-10-17 14:45   ` [Virtio-fs] " Vivek Goyal
2019-10-17 16:11     ` Stefan Hajnoczi
2019-10-23  9:46   ` Dr. David Alan Gilbert
2019-10-24 10:26   ` Daniel P. Berrangé
2019-10-25 12:53     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).