qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
@ 2023-06-28 15:27 Bin Meng
  2023-06-28 15:27 ` [PATCH v4 1/6] tests/tcg/cris: Fix the coding style Bin Meng
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Bin Meng @ 2023-06-28 15:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Zhangjin Wu, Claudio Imbrenda,
	Daniel P. Berrangé, Edgar E. Iglesias, Jason Wang,
	Kevin Wolf, Marc-André Lureau, Michael S. Tsirkin,
	Paolo Bonzini, Philippe Mathieu-Daudé, Thomas Huth,
	Xuzhou Cheng


Current codes using a brute-force traversal of all file descriptors
do not scale on a system where the maximum number of file descriptors
is set to a very large value (e.g.: in a Docker container of Manjaro
distribution it is set to 1073741816). QEMU just looks frozen during
start-up.

The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
doesn't need to manually close the fds for child process as the proper
O_CLOEXEC flag should have been set properly on files with its own
codes, QEMU uses a huge number of 3rd party libraries and we don't
trust them to reliably be using O_CLOEXEC on everything they open.

Modern Linux and BSDs have the close_range() call we can use to do the
job, and on Linux we have one more way to walk through /proc/self/fd
to complete the task efficiently, which is what qemu_close_range()
does, a new API we add in util/osdep.c.

V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/

Changes in v4:
- add 'first > last' check logic
- reorder the ifdefs logic
- change i to unsigned int type
- use qemu_strtoi() instead of atoi()
- limit last upper value to sysconf(_SC_OPEN_MAX) - 1
- call sysconf directly instead of using a variable
- put fd on its own line

Changes in v3:
- fix win32 build failure
- limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)

Changes in v2:
- new patch: "tests/tcg/cris: Fix the coding style"
- new patch: "tests/tcg/cris: Correct the off-by-one error"
- new patch: "util/async-teardown: Fall back to close fds one by one"
- new patch: "util/osdep: Introduce qemu_close_range()"
- new patch: "util/async-teardown: Use qemu_close_range() to close fds"
- Change to use qemu_close_range() to close fds for child process efficiently
- v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/

Bin Meng (4):
  tests/tcg/cris: Fix the coding style
  tests/tcg/cris: Correct the off-by-one error
  util/async-teardown: Fall back to close fds one by one
  util/osdep: Introduce qemu_close_range()

Zhangjin Wu (2):
  util/async-teardown: Use qemu_close_range() to close fds
  net: tap: Use qemu_close_range() to close fds

 include/qemu/osdep.h                |  1 +
 net/tap.c                           | 24 ++++++------
 tests/tcg/cris/libc/check_openpf5.c | 57 +++++++++++++--------------
 util/async-teardown.c               | 37 +-----------------
 util/osdep.c                        | 60 +++++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 78 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/6] tests/tcg/cris: Fix the coding style
  2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
@ 2023-06-28 15:27 ` Bin Meng
  2023-06-28 15:27 ` [PATCH v4 2/6] tests/tcg/cris: Correct the off-by-one error Bin Meng
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Bin Meng @ 2023-06-28 15:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Zhangjin Wu, Philippe Mathieu-Daudé,
	Edgar E. Iglesias

The code style does not conform with QEMU's. Correct it so that the
upcoming commit does not trigger checkpatch warnings.

Signed-off-by: Bin Meng <bmeng@tinylab.org>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

---

(no changes since v2)

Changes in v2:
- new patch: "tests/tcg/cris: Fix the coding style"

 tests/tcg/cris/libc/check_openpf5.c | 57 ++++++++++++++---------------
 1 file changed, 27 insertions(+), 30 deletions(-)

diff --git a/tests/tcg/cris/libc/check_openpf5.c b/tests/tcg/cris/libc/check_openpf5.c
index 1f86ea283d..0037fbca4c 100644
--- a/tests/tcg/cris/libc/check_openpf5.c
+++ b/tests/tcg/cris/libc/check_openpf5.c
@@ -13,44 +13,41 @@
 #include <fcntl.h>
 #include <string.h>
 
-int main (int argc, char *argv[])
+int main(int argc, char *argv[])
 {
-  int i;
-  int filemax;
+    int i;
+    int filemax;
 
 #ifdef OPEN_MAX
-  filemax = OPEN_MAX;
+    filemax = OPEN_MAX;
 #else
-  filemax = sysconf (_SC_OPEN_MAX);
+    filemax = sysconf(_SC_OPEN_MAX);
 #endif
 
-  char *fn = malloc (strlen (argv[0]) + 2);
-  if (fn == NULL)
-    abort ();
-  strcpy (fn, "/");
-  strcat (fn, argv[0]);
+    char *fn = malloc(strlen(argv[0]) + 2);
+    if (fn == NULL) {
+        abort();
+    }
+    strcpy(fn, "/");
+    strcat(fn, argv[0]);
 
-  for (i = 0; i < filemax + 1; i++)
-    {
-      if (open (fn, O_RDONLY) < 0)
-	{
-	  /* Shouldn't happen too early.  */
-	  if (i < filemax - 3 - 1)
-	    {
-	      fprintf (stderr, "i: %d\n", i);
-	      abort ();
-	    }
-	  if (errno != EMFILE)
-	    {
-	      perror ("open");
-	      abort ();
-	    }
-	  goto ok;
-	}
+    for (i = 0; i < filemax + 1; i++) {
+        if (open(fn, O_RDONLY) < 0) {
+            /* Shouldn't happen too early.  */
+            if (i < filemax - 3 - 1) {
+                fprintf(stderr, "i: %d\n", i);
+                abort();
+            }
+            if (errno != EMFILE) {
+                perror("open");
+                abort();
+            }
+        goto ok;
+        }
     }
-  abort ();
+    abort();
 
 ok:
-  printf ("pass\n");
-  exit (0);
+    printf("pass\n");
+    exit(0);
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/6] tests/tcg/cris: Correct the off-by-one error
  2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
  2023-06-28 15:27 ` [PATCH v4 1/6] tests/tcg/cris: Fix the coding style Bin Meng
@ 2023-06-28 15:27 ` Bin Meng
  2023-06-28 15:27 ` [PATCH v4 3/6] util/async-teardown: Fall back to close fds one by one Bin Meng
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Bin Meng @ 2023-06-28 15:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Zhangjin Wu, Philippe Mathieu-Daudé,
	Edgar E. Iglesias

sysconf(_SC_OPEN_MAX) returns the maximum number of files that
a process can have open at any time, which means the fd should
not be larger than or equal to the return value.

Signed-off-by: Bin Meng <bmeng@tinylab.org>
Acked-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

---

(no changes since v2)

Changes in v2:
- new patch: "tests/tcg/cris: Correct the off-by-one error"

 tests/tcg/cris/libc/check_openpf5.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/tcg/cris/libc/check_openpf5.c b/tests/tcg/cris/libc/check_openpf5.c
index 0037fbca4c..7f585c6d37 100644
--- a/tests/tcg/cris/libc/check_openpf5.c
+++ b/tests/tcg/cris/libc/check_openpf5.c
@@ -31,10 +31,10 @@ int main(int argc, char *argv[])
     strcpy(fn, "/");
     strcat(fn, argv[0]);
 
-    for (i = 0; i < filemax + 1; i++) {
+    for (i = 0; i < filemax; i++) {
         if (open(fn, O_RDONLY) < 0) {
             /* Shouldn't happen too early.  */
-            if (i < filemax - 3 - 1) {
+            if (i < filemax - 3) {
                 fprintf(stderr, "i: %d\n", i);
                 abort();
             }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/6] util/async-teardown: Fall back to close fds one by one
  2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
  2023-06-28 15:27 ` [PATCH v4 1/6] tests/tcg/cris: Fix the coding style Bin Meng
  2023-06-28 15:27 ` [PATCH v4 2/6] tests/tcg/cris: Correct the off-by-one error Bin Meng
@ 2023-06-28 15:27 ` Bin Meng
  2023-06-28 15:27 ` [PATCH v4 4/6] util/osdep: Introduce qemu_close_range() Bin Meng
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Bin Meng @ 2023-06-28 15:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Zhangjin Wu, Claudio Imbrenda,
	Michael S. Tsirkin, Philippe Mathieu-Daudé, Thomas Huth

When opening /proc/self/fd fails, current codes just return directly,
but we can fall back to close fds one by one.

Signed-off-by: Bin Meng <bmeng@tinylab.org>

---
- feel free to drop this patch if it does not make too much sense

(no changes since v2)

Changes in v2:
- new patch: "util/async-teardown: Fall back to close fds one by one"

 util/async-teardown.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/util/async-teardown.c b/util/async-teardown.c
index 3ab19c8740..7e0177a8da 100644
--- a/util/async-teardown.c
+++ b/util/async-teardown.c
@@ -48,7 +48,11 @@ static void close_all_open_fd(void)
 
     dir = opendir("/proc/self/fd");
     if (!dir) {
-        /* If /proc is not mounted, there is nothing that can be done. */
+        /* If /proc is not mounted, close fds one by one. */
+        int open_max = sysconf(_SC_OPEN_MAX), i;
+        for (i = 0; i < open_max; i++) {
+                close(i);
+        }
         return;
     }
     /* Avoid closing the directory. */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 4/6] util/osdep: Introduce qemu_close_range()
  2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
                   ` (2 preceding siblings ...)
  2023-06-28 15:27 ` [PATCH v4 3/6] util/async-teardown: Fall back to close fds one by one Bin Meng
@ 2023-06-28 15:27 ` Bin Meng
  2023-07-07 14:40   ` Markus Armbruster
  2023-06-28 15:27 ` [PATCH v4 5/6] util/async-teardown: Use qemu_close_range() to close fds Bin Meng
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Bin Meng @ 2023-06-28 15:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Zhangjin Wu, Daniel P. Berrangé,
	Kevin Wolf, Marc-André Lureau, Paolo Bonzini, Thomas Huth,
	Xuzhou Cheng

This introduces a new QEMU API qemu_close_range() that closes all
open file descriptors from first to last (included).

This API will try a more efficient call to close_range(), or walk
through of /proc/self/fd whenever these are possible, otherwise it
falls back to a plain close loop.

Co-developed-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Bin Meng <bmeng@tinylab.org>

---

Changes in v4:
- add 'first > last' check logic
- reorder the ifdefs logic
- change i to unsigned int type
- use qemu_strtoi() instead of atoi()
- limit last upper value to sysconf(_SC_OPEN_MAX) - 1

Changes in v3:
- fix win32 build failure

Changes in v2:
- new patch: "util/osdep: Introduce qemu_close_range()"

 include/qemu/osdep.h |  1 +
 util/osdep.c         | 60 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index cc61b00ba9..e22434ce10 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -560,6 +560,7 @@ int qemu_open_old(const char *name, int flags, ...);
 int qemu_open(const char *name, int flags, Error **errp);
 int qemu_create(const char *name, int flags, mode_t mode, Error **errp);
 int qemu_close(int fd);
+int qemu_close_range(unsigned int first, unsigned int last);
 int qemu_unlink(const char *name);
 #ifndef _WIN32
 int qemu_dup_flags(int fd, int flags);
diff --git a/util/osdep.c b/util/osdep.c
index e996c4744a..1d8c719b3f 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -411,6 +411,66 @@ int qemu_close(int fd)
     return close(fd);
 }
 
+int qemu_close_range(unsigned int first, unsigned int last)
+{
+    if (first > last) {
+        errno = EINVAL;
+        return -1;
+    }
+
+#ifndef _WIN32
+    if (last >= sysconf(_SC_OPEN_MAX)) {
+        last = sysconf(_SC_OPEN_MAX) - 1;
+    }
+#endif
+
+#ifdef CONFIG_CLOSE_RANGE
+    int r = close_range(first, last, 0);
+    if (!r) {
+        /* Success, no need to try other ways */
+        return 0;
+    }
+#endif
+
+#ifdef __linux__
+    DIR *dir = opendir("/proc/self/fd");
+    if (dir) {
+        /* Avoid closing the directory */
+        int dfd = dirfd(dir);
+
+        for (struct dirent *de = readdir(dir); de; de = readdir(dir)) {
+            int fd, ret;
+
+            ret = qemu_strtoi(de->d_name, NULL, 10, &fd);
+            if (ret) {
+                /* skip "." and ".." */
+                continue;
+            }
+            if (fd < first || fd > last) {
+                /* Exclude the fds outside the target range */
+                continue;
+            }
+            if (fd != dfd) {
+                close(fd);
+            }
+        }
+        closedir(dir);
+
+        return 0;
+    }
+#endif
+
+    /*
+     * If /proc is not mounted or /proc/self/fd is not supported,
+     * try close() from first to last.
+     */
+    for (unsigned int i = first; i <= last; i++) {
+        close(i);
+    }
+
+    return 0;
+}
+
 /*
  * Delete a file from the filesystem, unless the filename is /dev/fdset/...
  *
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 5/6] util/async-teardown: Use qemu_close_range() to close fds
  2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
                   ` (3 preceding siblings ...)
  2023-06-28 15:27 ` [PATCH v4 4/6] util/osdep: Introduce qemu_close_range() Bin Meng
@ 2023-06-28 15:27 ` Bin Meng
  2023-07-07 14:40   ` Markus Armbruster
  2023-06-28 15:27 ` [PATCH v4 6/6] net: tap: " Bin Meng
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Bin Meng @ 2023-06-28 15:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, Zhangjin Wu, Claudio Imbrenda,
	Michael S. Tsirkin, Philippe Mathieu-Daudé, Thomas Huth

From: Zhangjin Wu <falcon@tinylab.org>

Based on the old close_all_open_fd() of util/async-teardown.c, a new
generic qemu_close_range() has been added in osdep.c.

Now, let's switch over to use the generic qemu_close_range().

Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Bin Meng <bmeng@tinylab.org>

---

Changes in v4:
- call sysconf directly instead of using a variable

Changes in v3:
- limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)

Changes in v2:
- new patch: "util/async-teardown: Use qemu_close_range() to close fds"

 util/async-teardown.c | 41 +----------------------------------------
 1 file changed, 1 insertion(+), 40 deletions(-)

diff --git a/util/async-teardown.c b/util/async-teardown.c
index 7e0177a8da..a038a255ff 100644
--- a/util/async-teardown.c
+++ b/util/async-teardown.c
@@ -29,44 +29,6 @@
 
 static pid_t the_ppid;
 
-/*
- * Close all open file descriptors.
- */
-static void close_all_open_fd(void)
-{
-    struct dirent *de;
-    int fd, dfd;
-    DIR *dir;
-
-#ifdef CONFIG_CLOSE_RANGE
-    int r = close_range(0, ~0U, 0);
-    if (!r) {
-        /* Success, no need to try other ways. */
-        return;
-    }
-#endif
-
-    dir = opendir("/proc/self/fd");
-    if (!dir) {
-        /* If /proc is not mounted, close fds one by one. */
-        int open_max = sysconf(_SC_OPEN_MAX), i;
-        for (i = 0; i < open_max; i++) {
-                close(i);
-        }
-        return;
-    }
-    /* Avoid closing the directory. */
-    dfd = dirfd(dir);
-
-    for (de = readdir(dir); de; de = readdir(dir)) {
-        fd = atoi(de->d_name);
-        if (fd != dfd) {
-            close(fd);
-        }
-    }
-    closedir(dir);
-}
-
 static void hup_handler(int signal)
 {
     /* Check every second if this process has been reparented. */
@@ -92,9 +54,8 @@ static int async_teardown_fn(void *arg)
     /*
      * Close all file descriptors that might have been inherited from the
      * main qemu process when doing clone, needed to make libvirt happy.
-     * Not using close_range for increased compatibility with older kernels.
      */
-    close_all_open_fd();
+    qemu_close_range(0, sysconf(_SC_OPEN_MAX) - 1);
 
     /* Set up a handler for SIGHUP and unblock SIGHUP. */
     sigaction(SIGHUP, &sa, NULL);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 6/6] net: tap: Use qemu_close_range() to close fds
  2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
                   ` (4 preceding siblings ...)
  2023-06-28 15:27 ` [PATCH v4 5/6] util/async-teardown: Use qemu_close_range() to close fds Bin Meng
@ 2023-06-28 15:27 ` Bin Meng
  2023-06-30  4:52   ` Jason Wang
  2023-06-29  8:33 ` [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Michael Tokarev
  2023-07-09 15:47 ` Bin Meng
  7 siblings, 1 reply; 16+ messages in thread
From: Bin Meng @ 2023-06-28 15:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Zhangjin Wu, Jason Wang

From: Zhangjin Wu <falcon@tinylab.org>

Current codes using a brute-force traversal of all file descriptors
do not scale on a system where the maximum number of file descriptors
is set to a very large value (e.g.: in a Docker container of Manjaro
distribution it is set to 1073741816). QEMU just looks frozen during
start-up.

The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
doesn't need to manually close the fds for child process as the proper
O_CLOEXEC flag should have been set properly on files with its own
codes, QEMU uses a huge number of 3rd party libraries and we don't
trust them to reliably be using O_CLOEXEC on everything they open.

Modern Linux and BSDs have the close_range() call we can use to do the
job, and on Linux we have one more way to walk through /proc/self/fd
to complete the task efficiently, which is what qemu_close_range() does.

Reported-by: Zhangjin Wu <falcon@tinylab.org>
Co-developed-by: Bin Meng <bmeng@tinylab.org>
Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Bin Meng <bmeng@tinylab.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

---

Changes in v4:
- put fd on its own line

Changes in v2:
- Change to use qemu_close_range() to close fds for child process efficiently
- v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/

 net/tap.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/tap.c b/net/tap.c
index 1bf085d422..9f080215f0 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -446,13 +446,13 @@ static void launch_script(const char *setup_script, const char *ifname,
         return;
     }
     if (pid == 0) {
-        int open_max = sysconf(_SC_OPEN_MAX), i;
+        unsigned int last_fd = sysconf(_SC_OPEN_MAX) - 1;
+
+        /* skip stdin, stdout and stderr */
+        qemu_close_range(3, fd - 1);
+        /* skip the currently used fd */
+        qemu_close_range(fd + 1, last_fd);
 
-        for (i = 3; i < open_max; i++) {
-            if (i != fd) {
-                close(i);
-            }
-        }
         parg = args;
         *parg++ = (char *)setup_script;
         *parg++ = (char *)ifname;
@@ -536,16 +536,16 @@ static int net_bridge_run_helper(const char *helper, const char *bridge,
         return -1;
     }
     if (pid == 0) {
-        int open_max = sysconf(_SC_OPEN_MAX), i;
+        unsigned int last_fd = sysconf(_SC_OPEN_MAX) - 1;
+        unsigned int fd = sv[1];
         char *fd_buf = NULL;
         char *br_buf = NULL;
         char *helper_cmd = NULL;
 
-        for (i = 3; i < open_max; i++) {
-            if (i != sv[1]) {
-                close(i);
-            }
-        }
+        /* skip stdin, stdout and stderr */
+        qemu_close_range(3, fd - 1);
+        /* skip the currently used fd */
+        qemu_close_range(fd + 1, last_fd);
 
         fd_buf = g_strdup_printf("%s%d", "--fd=", sv[1]);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
  2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
                   ` (5 preceding siblings ...)
  2023-06-28 15:27 ` [PATCH v4 6/6] net: tap: " Bin Meng
@ 2023-06-29  8:33 ` Michael Tokarev
  2023-06-29  9:05   ` Daniel P. Berrangé
  2023-07-09 15:47 ` Bin Meng
  7 siblings, 1 reply; 16+ messages in thread
From: Michael Tokarev @ 2023-06-29  8:33 UTC (permalink / raw)
  To: Bin Meng, qemu-devel
  Cc: Richard Henderson, Zhangjin Wu, Claudio Imbrenda,
	Daniel P. Berrangé, Edgar E. Iglesias, Jason Wang,
	Kevin Wolf, Marc-André Lureau, Michael S. Tsirkin,
	Paolo Bonzini, Philippe Mathieu-Daudé, Thomas Huth,
	Xuzhou Cheng

28.06.2023 18:27, Bin Meng wrote:
> 
> Current codes using a brute-force traversal of all file descriptors
> do not scale on a system where the maximum number of file descriptors
> is set to a very large value (e.g.: in a Docker container of Manjaro
> distribution it is set to 1073741816). QEMU just looks frozen during
> start-up.

So, the same question as before. *Why* do we close all filedescriptors
to begin with?

Thanks,

/mjt


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
  2023-06-29  8:33 ` [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Michael Tokarev
@ 2023-06-29  9:05   ` Daniel P. Berrangé
  0 siblings, 0 replies; 16+ messages in thread
From: Daniel P. Berrangé @ 2023-06-29  9:05 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Bin Meng, qemu-devel, Richard Henderson, Zhangjin Wu,
	Claudio Imbrenda, Edgar E. Iglesias, Jason Wang, Kevin Wolf,
	Marc-André Lureau, Michael S. Tsirkin, Paolo Bonzini,
	Philippe Mathieu-Daudé, Thomas Huth, Xuzhou Cheng

On Thu, Jun 29, 2023 at 11:33:29AM +0300, Michael Tokarev wrote:
> 28.06.2023 18:27, Bin Meng wrote:
> > 
> > Current codes using a brute-force traversal of all file descriptors
> > do not scale on a system where the maximum number of file descriptors
> > is set to a very large value (e.g.: in a Docker container of Manjaro
> > distribution it is set to 1073741816). QEMU just looks frozen during
> > start-up.
> 
> So, the same question as before. *Why* do we close all filedescriptors
> to begin with?

The O_CLOSEXEC flag is a terrible concept, as the default behaviour of
file descriptors is to be leaked into all child processes, unless code
takes explicit action to set O_CLOEXEC in every case. Even if they are
diligent about their own code, apps developers can have zero confidence
that every library they use has set O_CLOEXEC. Not just set it after the
FD is open, but set it atomically when the the FD is open, because
threads create race conditions if not atomically set.

Leaking FDs is a security risk, and QEMU is an especially security
critical application. QEMU needs stronger guarantees that O_CLOEXEC
can offer, and mass-close before execve is the only viable option
to achieve this.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 6/6] net: tap: Use qemu_close_range() to close fds
  2023-06-28 15:27 ` [PATCH v4 6/6] net: tap: " Bin Meng
@ 2023-06-30  4:52   ` Jason Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Wang @ 2023-06-30  4:52 UTC (permalink / raw)
  To: Bin Meng; +Cc: qemu-devel, Richard Henderson, Zhangjin Wu

On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
>
> From: Zhangjin Wu <falcon@tinylab.org>
>
> Current codes using a brute-force traversal of all file descriptors
> do not scale on a system where the maximum number of file descriptors
> is set to a very large value (e.g.: in a Docker container of Manjaro
> distribution it is set to 1073741816). QEMU just looks frozen during
> start-up.
>
> The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> doesn't need to manually close the fds for child process as the proper
> O_CLOEXEC flag should have been set properly on files with its own
> codes, QEMU uses a huge number of 3rd party libraries and we don't
> trust them to reliably be using O_CLOEXEC on everything they open.
>
> Modern Linux and BSDs have the close_range() call we can use to do the
> job, and on Linux we have one more way to walk through /proc/self/fd
> to complete the task efficiently, which is what qemu_close_range() does.
>
> Reported-by: Zhangjin Wu <falcon@tinylab.org>
> Co-developed-by: Bin Meng <bmeng@tinylab.org>
> Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
> Signed-off-by: Bin Meng <bmeng@tinylab.org>
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

Patch looks good but I'm not sure using helper scripts is good for the
production environment since it increases attack surfaces. Passing TAP
fd should be a better way.

Thanks

>
> ---
>
> Changes in v4:
> - put fd on its own line
>
> Changes in v2:
> - Change to use qemu_close_range() to close fds for child process efficiently
> - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>
>  net/tap.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/net/tap.c b/net/tap.c
> index 1bf085d422..9f080215f0 100644
> --- a/net/tap.c
> +++ b/net/tap.c
> @@ -446,13 +446,13 @@ static void launch_script(const char *setup_script, const char *ifname,
>          return;
>      }
>      if (pid == 0) {
> -        int open_max = sysconf(_SC_OPEN_MAX), i;
> +        unsigned int last_fd = sysconf(_SC_OPEN_MAX) - 1;
> +
> +        /* skip stdin, stdout and stderr */
> +        qemu_close_range(3, fd - 1);
> +        /* skip the currently used fd */
> +        qemu_close_range(fd + 1, last_fd);
>
> -        for (i = 3; i < open_max; i++) {
> -            if (i != fd) {
> -                close(i);
> -            }
> -        }
>          parg = args;
>          *parg++ = (char *)setup_script;
>          *parg++ = (char *)ifname;
> @@ -536,16 +536,16 @@ static int net_bridge_run_helper(const char *helper, const char *bridge,
>          return -1;
>      }
>      if (pid == 0) {
> -        int open_max = sysconf(_SC_OPEN_MAX), i;
> +        unsigned int last_fd = sysconf(_SC_OPEN_MAX) - 1;
> +        unsigned int fd = sv[1];
>          char *fd_buf = NULL;
>          char *br_buf = NULL;
>          char *helper_cmd = NULL;
>
> -        for (i = 3; i < open_max; i++) {
> -            if (i != sv[1]) {
> -                close(i);
> -            }
> -        }
> +        /* skip stdin, stdout and stderr */
> +        qemu_close_range(3, fd - 1);
> +        /* skip the currently used fd */
> +        qemu_close_range(fd + 1, last_fd);
>
>          fd_buf = g_strdup_printf("%s%d", "--fd=", sv[1]);
>
> --
> 2.34.1
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/6] util/osdep: Introduce qemu_close_range()
  2023-06-28 15:27 ` [PATCH v4 4/6] util/osdep: Introduce qemu_close_range() Bin Meng
@ 2023-07-07 14:40   ` Markus Armbruster
  0 siblings, 0 replies; 16+ messages in thread
From: Markus Armbruster @ 2023-07-07 14:40 UTC (permalink / raw)
  To: Bin Meng
  Cc: qemu-devel, Richard Henderson, Zhangjin Wu,
	Daniel P. Berrangé, Kevin Wolf, Marc-André Lureau,
	Paolo Bonzini, Thomas Huth, Xuzhou Cheng

Bin Meng <bmeng@tinylab.org> writes:

> This introduces a new QEMU API qemu_close_range() that closes all
> open file descriptors from first to last (included).
>
> This API will try a more efficient call to close_range(), or walk
> through of /proc/self/fd whenever these are possible, otherwise it
> falls back to a plain close loop.
>
> Co-developed-by: Zhangjin Wu <falcon@tinylab.org>
> Signed-off-by: Bin Meng <bmeng@tinylab.org>
>
> ---
>
> Changes in v4:
> - add 'first > last' check logic
> - reorder the ifdefs logic
> - change i to unsigned int type
> - use qemu_strtoi() instead of atoi()
> - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
>
> Changes in v3:
> - fix win32 build failure
>
> Changes in v2:
> - new patch: "util/osdep: Introduce qemu_close_range()"
>
>  include/qemu/osdep.h |  1 +
>  util/osdep.c         | 60 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 61 insertions(+)
>
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index cc61b00ba9..e22434ce10 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -560,6 +560,7 @@ int qemu_open_old(const char *name, int flags, ...);
>  int qemu_open(const char *name, int flags, Error **errp);
>  int qemu_create(const char *name, int flags, mode_t mode, Error **errp);
>  int qemu_close(int fd);
> +int qemu_close_range(unsigned int first, unsigned int last);
>  int qemu_unlink(const char *name);
>  #ifndef _WIN32
>  int qemu_dup_flags(int fd, int flags);
> diff --git a/util/osdep.c b/util/osdep.c
> index e996c4744a..1d8c719b3f 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -411,6 +411,66 @@ int qemu_close(int fd)
>      return close(fd);
>  }
>  
> +int qemu_close_range(unsigned int first, unsigned int last)
> +{
> +    if (first > last) {
> +        errno = EINVAL;
> +        return -1;
> +    }
> +
> +#ifndef _WIN32
> +    if (last >= sysconf(_SC_OPEN_MAX)) {
> +        last = sysconf(_SC_OPEN_MAX) - 1;
> +    }
> +#endif
> +
> +#ifdef CONFIG_CLOSE_RANGE
> +    int r = close_range(first, last, 0);

TOCTTOU if sysconf(_SC_OPEN_MAX) can change at run time.

Say the caller passes ~0U to @last, like the example program in
close_range()'s manual page does.

Since this is larger than sysconf(_SC_OPEN_MAX), we clamp it to
sysconf(_SC_OPEN_MAX).  If the actual value increases before we get to
call close_range(), we can fail to close all fds.

Can't happen if we simply drop the clamping.

> +    if (!r) {
> +        /* Success, no need to try other ways */
> +        return 0;
> +    }
> +#endif

What are the failure modes of close_range() where the other ways are
worth trying?  I asked this in review of v3, and you replied it should
only ever fail when first > last, which you catch before getting here in
v4.

Why not simply return r?

> +
> +#ifdef __linux__
> +    DIR *dir = opendir("/proc/self/fd");
> +    if (dir) {
> +        /* Avoid closing the directory */
> +        int dfd = dirfd(dir);
> +
> +        for (struct dirent *de = readdir(dir); de; de = readdir(dir)) {
> +            int fd, ret;
> +
> +            ret = qemu_strtoi(de->d_name, NULL, 10, &fd);
> +            if (ret) {
> +                /* skip "." and ".." */

Anything that isn't a decimal integer, actually.  Should be just "." and
".." in practice.

> +                continue;
> +            }
> +            if (fd < first || fd > last) {

Not clamping @last is just fine here.

> +                /* Exclude the fds outside the target range */

Suggest to put this comment right before the if.

> +                continue;
> +            }
> +            if (fd != dfd) {
> +                close(fd);
> +            }

Do we still need this now we skip "."?

> +        }
> +        closedir(dir);
> +
> +        return 0;
> +    }
> +#endif
> +
> +    /*
> +     * If /proc is not mounted or /proc/self/fd is not supported,
> +     * try close() from first to last.
> +     */
> +    for (unsigned int i = first; i <= last; i++) {

Here, we do need to stop at sysconf(_SC_OPEN_MAX) - 1.  Recommend to
move the clamping before this loop.

Still a TOCTTOU, but acceptable here, because this fallback is
fundamentally racy no matter what.

> +        close(i);
> +    }
> +
> +    return 0;
> +}
> +
>  /*
>   * Delete a file from the filesystem, unless the filename is /dev/fdset/...
>   *



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 5/6] util/async-teardown: Use qemu_close_range() to close fds
  2023-06-28 15:27 ` [PATCH v4 5/6] util/async-teardown: Use qemu_close_range() to close fds Bin Meng
@ 2023-07-07 14:40   ` Markus Armbruster
  0 siblings, 0 replies; 16+ messages in thread
From: Markus Armbruster @ 2023-07-07 14:40 UTC (permalink / raw)
  To: Bin Meng
  Cc: qemu-devel, Richard Henderson, Zhangjin Wu, Claudio Imbrenda,
	Michael S. Tsirkin, Philippe Mathieu-Daudé, Thomas Huth

Bin Meng <bmeng@tinylab.org> writes:

> From: Zhangjin Wu <falcon@tinylab.org>
>
> Based on the old close_all_open_fd() of util/async-teardown.c, a new
> generic qemu_close_range() has been added in osdep.c.
>
> Now, let's switch over to use the generic qemu_close_range().
>
> Signed-off-by: Zhangjin Wu <falcon@tinylab.org>
> Signed-off-by: Bin Meng <bmeng@tinylab.org>
>
> ---
>
> Changes in v4:
> - call sysconf directly instead of using a variable
>
> Changes in v3:
> - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
>
> Changes in v2:
> - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
>
>  util/async-teardown.c | 41 +----------------------------------------
>  1 file changed, 1 insertion(+), 40 deletions(-)
>
> diff --git a/util/async-teardown.c b/util/async-teardown.c
> index 7e0177a8da..a038a255ff 100644
> --- a/util/async-teardown.c
> +++ b/util/async-teardown.c
> @@ -29,44 +29,6 @@
>  
>  static pid_t the_ppid;
>  
> -/*
> - * Close all open file descriptors.
> - */
> -static void close_all_open_fd(void)
> -{
> -    struct dirent *de;
> -    int fd, dfd;
> -    DIR *dir;
> -
> -#ifdef CONFIG_CLOSE_RANGE
> -    int r = close_range(0, ~0U, 0);
> -    if (!r) {
> -        /* Success, no need to try other ways. */
> -        return;
> -    }
> -#endif
> -
> -    dir = opendir("/proc/self/fd");
> -    if (!dir) {
> -        /* If /proc is not mounted, close fds one by one. */
> -        int open_max = sysconf(_SC_OPEN_MAX), i;
> -        for (i = 0; i < open_max; i++) {
> -                close(i);
> -        }
> -        return;
> -    }
> -    /* Avoid closing the directory. */
> -    dfd = dirfd(dir);
> -
> -    for (de = readdir(dir); de; de = readdir(dir)) {
> -        fd = atoi(de->d_name);
> -        if (fd != dfd) {
> -            close(fd);
> -        }
> -    }
> -    closedir(dir);
> -}
> -
>  static void hup_handler(int signal)
>  {
>      /* Check every second if this process has been reparented. */
> @@ -92,9 +54,8 @@ static int async_teardown_fn(void *arg)
>      /*
>       * Close all file descriptors that might have been inherited from the
>       * main qemu process when doing clone, needed to make libvirt happy.
> -     * Not using close_range for increased compatibility with older kernels.
>       */
> -    close_all_open_fd();
> +    qemu_close_range(0, sysconf(_SC_OPEN_MAX) - 1);

Why do you change the upper bound from ~0U to sysconf(_SC_OPEN_MAX) - 1?

>  
>      /* Set up a handler for SIGHUP and unblock SIGHUP. */
>      sigaction(SIGHUP, &sa, NULL);



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
  2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
                   ` (6 preceding siblings ...)
  2023-06-29  8:33 ` [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Michael Tokarev
@ 2023-07-09 15:47 ` Bin Meng
  2023-07-10  3:05   ` Jason Wang
  7 siblings, 1 reply; 16+ messages in thread
From: Bin Meng @ 2023-07-09 15:47 UTC (permalink / raw)
  To: Bin Meng
  Cc: qemu-devel, Richard Henderson, Zhangjin Wu, Claudio Imbrenda,
	Daniel P. Berrangé, Edgar E. Iglesias, Jason Wang,
	Kevin Wolf, Marc-André Lureau, Michael S. Tsirkin,
	Paolo Bonzini, Philippe Mathieu-Daudé, Thomas Huth,
	Xuzhou Cheng

On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
>
>
> Current codes using a brute-force traversal of all file descriptors
> do not scale on a system where the maximum number of file descriptors
> is set to a very large value (e.g.: in a Docker container of Manjaro
> distribution it is set to 1073741816). QEMU just looks frozen during
> start-up.
>
> The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> doesn't need to manually close the fds for child process as the proper
> O_CLOEXEC flag should have been set properly on files with its own
> codes, QEMU uses a huge number of 3rd party libraries and we don't
> trust them to reliably be using O_CLOEXEC on everything they open.
>
> Modern Linux and BSDs have the close_range() call we can use to do the
> job, and on Linux we have one more way to walk through /proc/self/fd
> to complete the task efficiently, which is what qemu_close_range()
> does, a new API we add in util/osdep.c.
>
> V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>
> Changes in v4:
> - add 'first > last' check logic
> - reorder the ifdefs logic
> - change i to unsigned int type
> - use qemu_strtoi() instead of atoi()
> - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> - call sysconf directly instead of using a variable
> - put fd on its own line
>
> Changes in v3:
> - fix win32 build failure
> - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
>
> Changes in v2:
> - new patch: "tests/tcg/cris: Fix the coding style"
> - new patch: "tests/tcg/cris: Correct the off-by-one error"
> - new patch: "util/async-teardown: Fall back to close fds one by one"
> - new patch: "util/osdep: Introduce qemu_close_range()"
> - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> - Change to use qemu_close_range() to close fds for child process efficiently
> - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>
> Bin Meng (4):
>   tests/tcg/cris: Fix the coding style
>   tests/tcg/cris: Correct the off-by-one error
>   util/async-teardown: Fall back to close fds one by one
>   util/osdep: Introduce qemu_close_range()
>
> Zhangjin Wu (2):
>   util/async-teardown: Use qemu_close_range() to close fds
>   net: tap: Use qemu_close_range() to close fds
>

Ping for 8.1?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
  2023-07-09 15:47 ` Bin Meng
@ 2023-07-10  3:05   ` Jason Wang
  2023-07-10  6:07     ` Markus Armbruster
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Wang @ 2023-07-10  3:05 UTC (permalink / raw)
  To: Bin Meng
  Cc: Bin Meng, qemu-devel, Richard Henderson, Zhangjin Wu,
	Claudio Imbrenda, Daniel P. Berrangé, Edgar E. Iglesias,
	Kevin Wolf, Marc-André Lureau, Michael S. Tsirkin,
	Paolo Bonzini, Philippe Mathieu-Daudé, Thomas Huth,
	Xuzhou Cheng

On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
> >
> >
> > Current codes using a brute-force traversal of all file descriptors
> > do not scale on a system where the maximum number of file descriptors
> > is set to a very large value (e.g.: in a Docker container of Manjaro
> > distribution it is set to 1073741816). QEMU just looks frozen during
> > start-up.
> >
> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> > doesn't need to manually close the fds for child process as the proper
> > O_CLOEXEC flag should have been set properly on files with its own
> > codes, QEMU uses a huge number of 3rd party libraries and we don't
> > trust them to reliably be using O_CLOEXEC on everything they open.
> >
> > Modern Linux and BSDs have the close_range() call we can use to do the
> > job, and on Linux we have one more way to walk through /proc/self/fd
> > to complete the task efficiently, which is what qemu_close_range()
> > does, a new API we add in util/osdep.c.
> >
> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >
> > Changes in v4:
> > - add 'first > last' check logic
> > - reorder the ifdefs logic
> > - change i to unsigned int type
> > - use qemu_strtoi() instead of atoi()
> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> > - call sysconf directly instead of using a variable
> > - put fd on its own line
> >
> > Changes in v3:
> > - fix win32 build failure
> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
> >
> > Changes in v2:
> > - new patch: "tests/tcg/cris: Fix the coding style"
> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
> > - new patch: "util/async-teardown: Fall back to close fds one by one"
> > - new patch: "util/osdep: Introduce qemu_close_range()"
> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> > - Change to use qemu_close_range() to close fds for child process efficiently
> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >
> > Bin Meng (4):
> >   tests/tcg/cris: Fix the coding style
> >   tests/tcg/cris: Correct the off-by-one error
> >   util/async-teardown: Fall back to close fds one by one
> >   util/osdep: Introduce qemu_close_range()
> >
> > Zhangjin Wu (2):
> >   util/async-teardown: Use qemu_close_range() to close fds
> >   net: tap: Use qemu_close_range() to close fds
> >
>
> Ping for 8.1?

Queued.

Thanks



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
  2023-07-10  3:05   ` Jason Wang
@ 2023-07-10  6:07     ` Markus Armbruster
  2023-07-11  2:40       ` Jason Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Markus Armbruster @ 2023-07-10  6:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: Bin Meng, Bin Meng, qemu-devel, Richard Henderson, Zhangjin Wu,
	Claudio Imbrenda, Daniel P. Berrangé, Edgar E. Iglesias,
	Kevin Wolf, Marc-André Lureau, Michael S. Tsirkin,
	Paolo Bonzini, Philippe Mathieu-Daudé, Thomas Huth,
	Xuzhou Cheng

Jason Wang <jasowang@redhat.com> writes:

> On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>>
>> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
>> >
>> >
>> > Current codes using a brute-force traversal of all file descriptors
>> > do not scale on a system where the maximum number of file descriptors
>> > is set to a very large value (e.g.: in a Docker container of Manjaro
>> > distribution it is set to 1073741816). QEMU just looks frozen during
>> > start-up.
>> >
>> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
>> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
>> > doesn't need to manually close the fds for child process as the proper
>> > O_CLOEXEC flag should have been set properly on files with its own
>> > codes, QEMU uses a huge number of 3rd party libraries and we don't
>> > trust them to reliably be using O_CLOEXEC on everything they open.
>> >
>> > Modern Linux and BSDs have the close_range() call we can use to do the
>> > job, and on Linux we have one more way to walk through /proc/self/fd
>> > to complete the task efficiently, which is what qemu_close_range()
>> > does, a new API we add in util/osdep.c.
>> >
>> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>> >
>> > Changes in v4:
>> > - add 'first > last' check logic
>> > - reorder the ifdefs logic
>> > - change i to unsigned int type
>> > - use qemu_strtoi() instead of atoi()
>> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
>> > - call sysconf directly instead of using a variable
>> > - put fd on its own line
>> >
>> > Changes in v3:
>> > - fix win32 build failure
>> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
>> >
>> > Changes in v2:
>> > - new patch: "tests/tcg/cris: Fix the coding style"
>> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
>> > - new patch: "util/async-teardown: Fall back to close fds one by one"
>> > - new patch: "util/osdep: Introduce qemu_close_range()"
>> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
>> > - Change to use qemu_close_range() to close fds for child process efficiently
>> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>> >
>> > Bin Meng (4):
>> >   tests/tcg/cris: Fix the coding style
>> >   tests/tcg/cris: Correct the off-by-one error
>> >   util/async-teardown: Fall back to close fds one by one
>> >   util/osdep: Introduce qemu_close_range()
>> >
>> > Zhangjin Wu (2):
>> >   util/async-teardown: Use qemu_close_range() to close fds
>> >   net: tap: Use qemu_close_range() to close fds
>> >
>>
>> Ping for 8.1?
>
> Queued.

There are review questions open on PATCH 4+5.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
  2023-07-10  6:07     ` Markus Armbruster
@ 2023-07-11  2:40       ` Jason Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Wang @ 2023-07-11  2:40 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Bin Meng, Bin Meng, qemu-devel, Richard Henderson, Zhangjin Wu,
	Claudio Imbrenda, Daniel P. Berrangé, Edgar E. Iglesias,
	Kevin Wolf, Marc-André Lureau, Michael S. Tsirkin,
	Paolo Bonzini, Philippe Mathieu-Daudé, Thomas Huth,
	Xuzhou Cheng

On Mon, Jul 10, 2023 at 2:07 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Jason Wang <jasowang@redhat.com> writes:
>
> > On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> >>
> >> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
> >> >
> >> >
> >> > Current codes using a brute-force traversal of all file descriptors
> >> > do not scale on a system where the maximum number of file descriptors
> >> > is set to a very large value (e.g.: in a Docker container of Manjaro
> >> > distribution it is set to 1073741816). QEMU just looks frozen during
> >> > start-up.
> >> >
> >> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> >> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> >> > doesn't need to manually close the fds for child process as the proper
> >> > O_CLOEXEC flag should have been set properly on files with its own
> >> > codes, QEMU uses a huge number of 3rd party libraries and we don't
> >> > trust them to reliably be using O_CLOEXEC on everything they open.
> >> >
> >> > Modern Linux and BSDs have the close_range() call we can use to do the
> >> > job, and on Linux we have one more way to walk through /proc/self/fd
> >> > to complete the task efficiently, which is what qemu_close_range()
> >> > does, a new API we add in util/osdep.c.
> >> >
> >> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >> >
> >> > Changes in v4:
> >> > - add 'first > last' check logic
> >> > - reorder the ifdefs logic
> >> > - change i to unsigned int type
> >> > - use qemu_strtoi() instead of atoi()
> >> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> >> > - call sysconf directly instead of using a variable
> >> > - put fd on its own line
> >> >
> >> > Changes in v3:
> >> > - fix win32 build failure
> >> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
> >> >
> >> > Changes in v2:
> >> > - new patch: "tests/tcg/cris: Fix the coding style"
> >> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
> >> > - new patch: "util/async-teardown: Fall back to close fds one by one"
> >> > - new patch: "util/osdep: Introduce qemu_close_range()"
> >> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> >> > - Change to use qemu_close_range() to close fds for child process efficiently
> >> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >> >
> >> > Bin Meng (4):
> >> >   tests/tcg/cris: Fix the coding style
> >> >   tests/tcg/cris: Correct the off-by-one error
> >> >   util/async-teardown: Fall back to close fds one by one
> >> >   util/osdep: Introduce qemu_close_range()
> >> >
> >> > Zhangjin Wu (2):
> >> >   util/async-teardown: Use qemu_close_range() to close fds
> >> >   net: tap: Use qemu_close_range() to close fds
> >> >
> >>
> >> Ping for 8.1?
> >
> > Queued.
>
> There are review questions open on PATCH 4+5.

Sorry, I missed that. I've dropped them from the queue now.

Thanks

>



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-07-11  2:41 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-28 15:27 [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Bin Meng
2023-06-28 15:27 ` [PATCH v4 1/6] tests/tcg/cris: Fix the coding style Bin Meng
2023-06-28 15:27 ` [PATCH v4 2/6] tests/tcg/cris: Correct the off-by-one error Bin Meng
2023-06-28 15:27 ` [PATCH v4 3/6] util/async-teardown: Fall back to close fds one by one Bin Meng
2023-06-28 15:27 ` [PATCH v4 4/6] util/osdep: Introduce qemu_close_range() Bin Meng
2023-07-07 14:40   ` Markus Armbruster
2023-06-28 15:27 ` [PATCH v4 5/6] util/async-teardown: Use qemu_close_range() to close fds Bin Meng
2023-07-07 14:40   ` Markus Armbruster
2023-06-28 15:27 ` [PATCH v4 6/6] net: tap: " Bin Meng
2023-06-30  4:52   ` Jason Wang
2023-06-29  8:33 ` [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large Michael Tokarev
2023-06-29  9:05   ` Daniel P. Berrangé
2023-07-09 15:47 ` Bin Meng
2023-07-10  3:05   ` Jason Wang
2023-07-10  6:07     ` Markus Armbruster
2023-07-11  2:40       ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).