* [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2021-05-03 14:39 ` Christian Brauner
[not found] ` <20210503143922.3093755-2-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-05-03 14:39 ` [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
` (7 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
To: Tejun Heo, Roman Gushchin
Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs/YUNznpcFYbw,
Christian Brauner
From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Give a brief overview of the cgroup.kill functionality.
Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
/* v2 */
- Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>:
- Drop sentence that mentions combined useage of cgroup.kill and
cgroup.freezer.
---
Documentation/admin-guide/cgroup-v2.rst | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 64c62b979f2f..6adc749d863e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -949,6 +949,21 @@ All cgroup core files are prefixed with "cgroup."
it's possible to delete a frozen (and empty) cgroup, as well as
create new sub-cgroups.
+ cgroup.kill
+ A write-only single value file which exists in non-root cgroups.
+ The only allowed value is "1".
+
+ Writing "1" to the file causes the cgroup and all descendant cgroups to
+ be killed. This means that all processes located in the affected cgroup
+ tree will be killed via SIGKILL.
+
+ Killing a cgroup tree will deal with concurrent forks appropriately and
+ is protected against migrations.
+
+ In a threaded cgroup, writing this file fails with EOPNOTSUPP as
+ killing cgroups is a process directed operation, i.e. it affects
+ the whole thread-group.
+
Controllers
===========
--
2.27.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall()
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-05-03 14:39 ` [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
@ 2021-05-03 14:39 ` Christian Brauner
[not found] ` <20210503143922.3093755-3-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-05-03 14:39 ` [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
` (6 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
To: Tejun Heo, Roman Gushchin
Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs/YUNznpcFYbw,
Christian Brauner
From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
If cgroup.kill file is supported make use of it.
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
/* v2 */
- Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>:
- Fix whitespace.
---
tools/testing/selftests/cgroup/cgroup_util.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 027014662fb2..f60f7d764690 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -252,6 +252,10 @@ int cg_killall(const char *cgroup)
char buf[PAGE_SIZE];
char *ptr = buf;
+ /* If cgroup.kill exists use it. */
+ if (!cg_write(cgroup, "cgroup.kill", "1"))
+ return 0;
+
if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
return -1;
--
2.27.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-05-03 14:39 ` [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
2021-05-03 14:39 ` [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
@ 2021-05-03 14:39 ` Christian Brauner
[not found] ` <20210503143922.3093755-4-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-05-03 14:39 ` [PATCH v2 5/5] tests/cgroup: test cgroup.kill Christian Brauner
` (5 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
To: Tejun Heo, Roman Gushchin
Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs/YUNznpcFYbw,
Christian Brauner
From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
as they will be used by the tests for cgroup killing.
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
/* v2 */
unchanged
---
tools/testing/selftests/cgroup/cgroup_util.c | 47 +++++++++++++++
tools/testing/selftests/cgroup/cgroup_util.h | 2 +
tools/testing/selftests/cgroup/test_freezer.c | 57 -------------------
3 files changed, 49 insertions(+), 57 deletions(-)
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index f60f7d764690..623cec04ad42 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -5,10 +5,12 @@
#include <errno.h>
#include <fcntl.h>
#include <linux/limits.h>
+#include <poll.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
+#include <sys/inotify.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
@@ -580,3 +582,48 @@ int clone_into_cgroup_run_wait(const char *cgroup)
(void)clone_reap(pid, WEXITED);
return 0;
}
+
+int cg_prepare_for_wait(const char *cgroup)
+{
+ int fd, ret = -1;
+
+ fd = inotify_init1(0);
+ if (fd == -1)
+ return fd;
+
+ ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
+ IN_MODIFY);
+ if (ret == -1) {
+ close(fd);
+ fd = -1;
+ }
+
+ return fd;
+}
+
+int cg_wait_for(int fd)
+{
+ int ret = -1;
+ struct pollfd fds = {
+ .fd = fd,
+ .events = POLLIN,
+ };
+
+ while (true) {
+ ret = poll(&fds, 1, 10000);
+
+ if (ret == -1) {
+ if (errno == EINTR)
+ continue;
+
+ break;
+ }
+
+ if (ret > 0 && fds.revents & POLLIN) {
+ ret = 0;
+ break;
+ }
+ }
+
+ return ret;
+}
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index 5a1305dd1f0b..82e59cdf16e7 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -54,3 +54,5 @@ extern pid_t clone_into_cgroup(int cgroup_fd);
extern int clone_reap(pid_t pid, int options);
extern int clone_into_cgroup_run_wait(const char *cgroup);
extern int dirfd_open_opath(const char *dir);
+extern int cg_prepare_for_wait(const char *cgroup);
+extern int cg_wait_for(int fd);
diff --git a/tools/testing/selftests/cgroup/test_freezer.c b/tools/testing/selftests/cgroup/test_freezer.c
index 23d8fa4a3e4e..ff519029f6f4 100644
--- a/tools/testing/selftests/cgroup/test_freezer.c
+++ b/tools/testing/selftests/cgroup/test_freezer.c
@@ -7,9 +7,7 @@
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
-#include <poll.h>
#include <stdlib.h>
-#include <sys/inotify.h>
#include <string.h>
#include <sys/wait.h>
@@ -54,61 +52,6 @@ static int cg_freeze_nowait(const char *cgroup, bool freeze)
return cg_write(cgroup, "cgroup.freeze", freeze ? "1" : "0");
}
-/*
- * Prepare for waiting on cgroup.events file.
- */
-static int cg_prepare_for_wait(const char *cgroup)
-{
- int fd, ret = -1;
-
- fd = inotify_init1(0);
- if (fd == -1) {
- debug("Error: inotify_init1() failed\n");
- return fd;
- }
-
- ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
- IN_MODIFY);
- if (ret == -1) {
- debug("Error: inotify_add_watch() failed\n");
- close(fd);
- fd = -1;
- }
-
- return fd;
-}
-
-/*
- * Wait for an event. If there are no events for 10 seconds,
- * treat this an error.
- */
-static int cg_wait_for(int fd)
-{
- int ret = -1;
- struct pollfd fds = {
- .fd = fd,
- .events = POLLIN,
- };
-
- while (true) {
- ret = poll(&fds, 1, 10000);
-
- if (ret == -1) {
- if (errno == EINTR)
- continue;
- debug("Error: poll() failed\n");
- break;
- }
-
- if (ret > 0 && fds.revents & POLLIN) {
- ret = 0;
- break;
- }
- }
-
- return ret;
-}
-
/*
* Attach a task to the given cgroup and wait for a cgroup frozen event.
* All transient events (e.g. populated) are ignored.
--
2.27.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH v2 5/5] tests/cgroup: test cgroup.kill
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (2 preceding siblings ...)
2021-05-03 14:39 ` [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
@ 2021-05-03 14:39 ` Christian Brauner
[not found] ` <20210503143922.3093755-5-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2021-05-03 17:18 ` [PATCH v2 1/5] cgroup: introduce cgroup.kill Shakeel Butt
` (4 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2021-05-03 14:39 UTC (permalink / raw)
To: Tejun Heo, Roman Gushchin
Cc: Shakeel Butt, Zefan Li, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs/YUNznpcFYbw,
Christian Brauner
From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Test that the new cgroup.kill feature works as intended.
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
/* v2 */
- Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>:
- Remove debug logging (It wasn't very helpful in the first place.).
---
tools/testing/selftests/cgroup/.gitignore | 3 +-
tools/testing/selftests/cgroup/Makefile | 2 +
tools/testing/selftests/cgroup/test_kill.c | 291 +++++++++++++++++++++
3 files changed, 295 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/cgroup/test_kill.c
diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index 84cfcabea838..be9643ef6285 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -2,4 +2,5 @@
test_memcontrol
test_core
test_freezer
-test_kmem
\ No newline at end of file
+test_kmem
+test_kill
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index f027d933595b..59e222460581 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -9,6 +9,7 @@ TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_core
TEST_GEN_PROGS += test_freezer
+TEST_GEN_PROGS += test_kill
include ../lib.mk
@@ -16,3 +17,4 @@ $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h
$(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h
$(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h
$(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h
+$(OUTPUT)/test_kill: cgroup_util.c ../clone3/clone3_selftests.h ../pidfd/pidfd.h
diff --git a/tools/testing/selftests/cgroup/test_kill.c b/tools/testing/selftests/cgroup/test_kill.c
new file mode 100644
index 000000000000..3e1132deed33
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_kill.c
@@ -0,0 +1,291 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <errno.h>
+#include <linux/limits.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "../pidfd/pidfd.h"
+#include "cgroup_util.h"
+
+/*
+ * Kill the given cgroup and wait for the inotify signal.
+ * If there are no events in 10 seconds, treat this as an error.
+ * Then check that the cgroup is in the desired state.
+ */
+static int cg_kill_wait(const char *cgroup)
+{
+ int fd, ret = -1;
+
+ fd = cg_prepare_for_wait(cgroup);
+ if (fd < 0)
+ return fd;
+
+ ret = cg_write(cgroup, "cgroup.kill", "1");
+ if (ret)
+ goto out;
+
+ ret = cg_wait_for(fd);
+ if (ret)
+ goto out;
+
+ ret = cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n");
+out:
+ close(fd);
+ return ret;
+}
+
+/*
+ * A simple process running in a sleep loop until being
+ * re-parented.
+ */
+static int child_fn(const char *cgroup, void *arg)
+{
+ int ppid = getppid();
+
+ while (getppid() == ppid)
+ usleep(1000);
+
+ return getppid() == ppid;
+}
+
+static int test_cgkill_simple(const char *root)
+{
+ pid_t pids[100];
+ int ret = KSFT_FAIL;
+ char *cgroup = NULL;
+ int i;
+
+ cgroup = cg_name(root, "cg_test_simple");
+ if (!cgroup)
+ goto cleanup;
+
+ if (cg_create(cgroup))
+ goto cleanup;
+
+ for (i = 0; i < 100; i++)
+ pids[i] = cg_run_nowait(cgroup, child_fn, NULL);
+
+ if (cg_wait_for_proc_count(cgroup, 100))
+ goto cleanup;
+
+ if (cg_write(cgroup, "cgroup.kill", "1"))
+ goto cleanup;
+
+ if (cg_read_strcmp(cgroup, "cgroup.events", "populated 1\n"))
+ goto cleanup;
+
+ if (cg_kill_wait(cgroup))
+ goto cleanup;
+
+ if (cg_read_strcmp(cgroup, "cgroup.events", "populated 0\n"))
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ for (i = 0; i < 100; i++)
+ wait_for_pid(pids[i]);
+
+ if (cgroup)
+ cg_destroy(cgroup);
+ free(cgroup);
+ return ret;
+}
+
+/*
+ * The test creates the following hierarchy:
+ * A
+ * / / \ \
+ * B E I K
+ * /\ |
+ * C D F
+ * |
+ * G
+ * |
+ * H
+ *
+ * with a process in C, H and 3 processes in K.
+ * Then it tries to kill the whole tree.
+ */
+static int test_cgkill_tree(const char *root)
+{
+ pid_t pids[5];
+ char *cgroup[10] = {0};
+ int ret = KSFT_FAIL;
+ int i;
+
+ cgroup[0] = cg_name(root, "cg_test_tree_A");
+ if (!cgroup[0])
+ goto cleanup;
+
+ cgroup[1] = cg_name(cgroup[0], "B");
+ if (!cgroup[1])
+ goto cleanup;
+
+ cgroup[2] = cg_name(cgroup[1], "C");
+ if (!cgroup[2])
+ goto cleanup;
+
+ cgroup[3] = cg_name(cgroup[1], "D");
+ if (!cgroup[3])
+ goto cleanup;
+
+ cgroup[4] = cg_name(cgroup[0], "E");
+ if (!cgroup[4])
+ goto cleanup;
+
+ cgroup[5] = cg_name(cgroup[4], "F");
+ if (!cgroup[5])
+ goto cleanup;
+
+ cgroup[6] = cg_name(cgroup[5], "G");
+ if (!cgroup[6])
+ goto cleanup;
+
+ cgroup[7] = cg_name(cgroup[6], "H");
+ if (!cgroup[7])
+ goto cleanup;
+
+ cgroup[8] = cg_name(cgroup[0], "I");
+ if (!cgroup[8])
+ goto cleanup;
+
+ cgroup[9] = cg_name(cgroup[0], "K");
+ if (!cgroup[9])
+ goto cleanup;
+
+ for (i = 0; i < 10; i++)
+ if (cg_create(cgroup[i]))
+ goto cleanup;
+
+ pids[0] = cg_run_nowait(cgroup[2], child_fn, NULL);
+ pids[1] = cg_run_nowait(cgroup[7], child_fn, NULL);
+ pids[2] = cg_run_nowait(cgroup[9], child_fn, NULL);
+ pids[3] = cg_run_nowait(cgroup[9], child_fn, NULL);
+ pids[4] = cg_run_nowait(cgroup[9], child_fn, NULL);
+
+ /*
+ * Wait until all child processes will enter
+ * corresponding cgroups.
+ */
+
+ if (cg_wait_for_proc_count(cgroup[2], 1) ||
+ cg_wait_for_proc_count(cgroup[7], 1) ||
+ cg_wait_for_proc_count(cgroup[9], 3))
+ goto cleanup;
+
+ /*
+ * Kill A and check that we get an empty notification.
+ */
+ if (cg_kill_wait(cgroup[0]))
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ for (i = 0; i < 5; i++)
+ wait_for_pid(pids[i]);
+
+ for (i = 9; i >= 0 && cgroup[i]; i--) {
+ cg_destroy(cgroup[i]);
+ free(cgroup[i]);
+ }
+
+ return ret;
+}
+
+static int forkbomb_fn(const char *cgroup, void *arg)
+{
+ int ppid;
+
+ fork();
+ fork();
+
+ ppid = getppid();
+
+ while (getppid() == ppid)
+ usleep(1000);
+
+ return getppid() == ppid;
+}
+
+/*
+ * The test runs a fork bomb in a cgroup and tries to kill it.
+ */
+static int test_cgkill_forkbomb(const char *root)
+{
+ int ret = KSFT_FAIL;
+ char *cgroup = NULL;
+ pid_t pid = -ESRCH;
+
+ cgroup = cg_name(root, "cg_forkbomb_test");
+ if (!cgroup)
+ goto cleanup;
+
+ if (cg_create(cgroup))
+ goto cleanup;
+
+ pid = cg_run_nowait(cgroup, forkbomb_fn, NULL);
+ if (pid < 0)
+ goto cleanup;
+
+ usleep(100000);
+
+ if (cg_kill_wait(cgroup))
+ goto cleanup;
+
+ if (cg_wait_for_proc_count(cgroup, 0))
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ if (cgroup)
+ cg_destroy(cgroup);
+ if (pid > 0)
+ wait_for_pid(pid);
+ free(cgroup);
+ return ret;
+}
+
+#define T(x) { x, #x }
+struct cgkill_test {
+ int (*fn)(const char *root);
+ const char *name;
+} tests[] = {
+ T(test_cgkill_simple),
+ T(test_cgkill_tree),
+ T(test_cgkill_forkbomb),
+};
+#undef T
+
+int main(int argc, char *argv[])
+{
+ char root[PATH_MAX];
+ int i, ret = EXIT_SUCCESS;
+
+ if (cg_find_unified_root(root, sizeof(root)))
+ ksft_exit_skip("cgroup v2 isn't mounted\n");
+ for (i = 0; i < ARRAY_SIZE(tests); i++) {
+ switch (tests[i].fn(root)) {
+ case KSFT_PASS:
+ ksft_test_result_pass("%s\n", tests[i].name);
+ break;
+ case KSFT_SKIP:
+ ksft_test_result_skip("%s\n", tests[i].name);
+ break;
+ default:
+ ret = EXIT_FAILURE;
+ ksft_test_result_fail("%s\n", tests[i].name);
+ break;
+ }
+ }
+
+ return ret;
+}
--
2.27.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (3 preceding siblings ...)
2021-05-03 14:39 ` [PATCH v2 5/5] tests/cgroup: test cgroup.kill Christian Brauner
@ 2021-05-03 17:18 ` Shakeel Butt
2021-05-04 1:47 ` Serge E. Hallyn
` (3 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2021-05-03 17:18 UTC (permalink / raw)
To: Christian Brauner
Cc: Tejun Heo, Roman Gushchin, Zefan Li, Johannes Weiner, Cgroups,
containers-cunTk1MwBs/YUNznpcFYbw, Christian Brauner
On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
>
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
>
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
>
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
>
> As with all new cgroup features cgroup.kill is recursive by default.
>
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
>
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
> usually has a delegated cgroup. For such layouts there is a 1:1
> mapping between container and cgroup. If the container in addition
> uses a separate pid namespace then killing a container usually becomes
> a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> However, there are quite a few scenarios where that isn't true. For
> example, there are containers that share the cgroup with other
> processes on purpose that are supposed to be bound to the lifetime of
> the container but are not in the same pidns of the container.
> Containers that are in a delegated cgroup but share the pid namespace
> with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
> processes belonging to a service. They usually rely on a recursive
> algorithm now to kill a service. With cgroup.kill this becomes a
> simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
> efficiently take down whole cgroups quickly.
> - The kill program can gain a new
> kill --cgroup /sys/fs/cgroup/delegated
> flag to take down cgroups.
>
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> not specified we are not taking cgroup mutex meaning the cgroup can be
> killed while a process in that cgroup is forking.
> If the kill request happens right before cgroup_can_fork() and before
> the parent grabs its siglock the parent is guaranteed to see the
> pending SIGKILL. In addition we perform another check in
> cgroup_post_fork() whether the cgroup is being killed and is so take
> down the child (see above). This is robust enough and protects gainst
> forkbombs. If userspace really really wants to have stricter
> protection the simple solution would be to grab the write side of the
> cgroup threadgroup rwsem which will force all ongoing forks to
> complete before killing starts. We concluded that this is not
> necessary as the semantics for concurrent forking should simply align
> with freezer where a similar check as cgroup_post_fork() is performed.
>
> For all other cases CLONE_INTO_CGROUP is required. In this case we
> will grab the cgroup mutex so the cgroup can't be killed while we
> fork. Once we're done with the fork and have dropped cgroup mutex we
> are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
> kthread will not become empty after killing and consequently no
> unpopulated event will be generated. The assumption is that kthreads
> should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> you have two tasks in different pid namespaces the cgroup would still
> be frozen. The cgroup.kill mechanism consequently behaves the same
> way, i.e. we kill all processes and ignore in which pid namespace they
> exist.
> - If the caller is located in a cgroup that is killed the caller will
> obviously be killed as well.
>
> Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (4 preceding siblings ...)
2021-05-03 17:18 ` [PATCH v2 1/5] cgroup: introduce cgroup.kill Shakeel Butt
@ 2021-05-04 1:47 ` Serge E. Hallyn
2021-05-04 18:29 ` Shakeel Butt
` (2 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Serge E. Hallyn @ 2021-05-04 1:47 UTC (permalink / raw)
To: Christian Brauner
Cc: Tejun Heo, Roman Gushchin, Shakeel Butt, Zefan Li,
Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
containers-cunTk1MwBs/YUNznpcFYbw, Christian Brauner
On Mon, May 03, 2021 at 04:39:19PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
>
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
>
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
>
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
>
> As with all new cgroup features cgroup.kill is recursive by default.
>
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
>
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
> usually has a delegated cgroup. For such layouts there is a 1:1
> mapping between container and cgroup. If the container in addition
> uses a separate pid namespace then killing a container usually becomes
> a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> However, there are quite a few scenarios where that isn't true. For
> example, there are containers that share the cgroup with other
> processes on purpose that are supposed to be bound to the lifetime of
> the container but are not in the same pidns of the container.
> Containers that are in a delegated cgroup but share the pid namespace
> with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
> processes belonging to a service. They usually rely on a recursive
> algorithm now to kill a service. With cgroup.kill this becomes a
> simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
> efficiently take down whole cgroups quickly.
> - The kill program can gain a new
> kill --cgroup /sys/fs/cgroup/delegated
> flag to take down cgroups.
>
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> not specified we are not taking cgroup mutex meaning the cgroup can be
> killed while a process in that cgroup is forking.
> If the kill request happens right before cgroup_can_fork() and before
> the parent grabs its siglock the parent is guaranteed to see the
> pending SIGKILL. In addition we perform another check in
> cgroup_post_fork() whether the cgroup is being killed and is so take
> down the child (see above). This is robust enough and protects gainst
> forkbombs. If userspace really really wants to have stricter
> protection the simple solution would be to grab the write side of the
> cgroup threadgroup rwsem which will force all ongoing forks to
> complete before killing starts. We concluded that this is not
> necessary as the semantics for concurrent forking should simply align
> with freezer where a similar check as cgroup_post_fork() is performed.
>
> For all other cases CLONE_INTO_CGROUP is required. In this case we
> will grab the cgroup mutex so the cgroup can't be killed while we
> fork. Once we're done with the fork and have dropped cgroup mutex we
> are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
> kthread will not become empty after killing and consequently no
> unpopulated event will be generated. The assumption is that kthreads
> should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> you have two tasks in different pid namespaces the cgroup would still
> be frozen. The cgroup.kill mechanism consequently behaves the same
> way, i.e. we kill all processes and ignore in which pid namespace they
> exist.
> - If the caller is located in a cgroup that is killed the caller will
> obviously be killed as well.
>
> Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> ---
>
> The series can be pulled from
>
> git-OoYKEaZ2EDaWaY/ihj7yzEB+6BGkLq7r@public.gmane.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
>
> /* v2 */
> - Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>:
> - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> ---
> include/linux/cgroup-defs.h | 3 +
> kernel/cgroup/cgroup.c | 127 ++++++++++++++++++++++++++++++++----
> 2 files changed, 116 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 559ee05f86b2..43fef771009a 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -71,6 +71,9 @@ enum {
>
> /* Cgroup is frozen. */
> CGRP_FROZEN,
> +
> + /* Control group has to be killed. */
> + CGRP_KILL,
> };
>
> /* cgroup_root->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9153b20e5cc6..aee84b99534a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static void __cgroup_kill(struct cgroup *cgrp)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> +
> + lockdep_assert_held(&cgroup_mutex);
> +
> + spin_lock_irq(&css_set_lock);
> + set_bit(CGRP_KILL, &cgrp->flags);
> + spin_unlock_irq(&css_set_lock);
> +
> + css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> + while ((task = css_task_iter_next(&it))) {
> + /* Ignore kernel threads here. */
> + if (task->flags & PF_KTHREAD)
> + continue;
> +
> + /* Skip tasks that are already dying. */
> + if (__fatal_signal_pending(task))
> + continue;
> +
> + send_sig(SIGKILL, task, 0);
> + }
> + css_task_iter_end(&it);
> +
> + spin_lock_irq(&css_set_lock);
> + clear_bit(CGRP_KILL, &cgrp->flags);
> + spin_unlock_irq(&css_set_lock);
> +}
> +
> +static void cgroup_kill(struct cgroup *cgrp)
> +{
> + struct cgroup_subsys_state *css;
> + struct cgroup *dsct;
> +
> + lockdep_assert_held(&cgroup_mutex);
> +
> + cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> + __cgroup_kill(dsct);
> +}
> +
> +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, loff_t off)
> +{
> + ssize_t ret = 0;
> + int kill;
> + struct cgroup *cgrp;
> +
> + ret = kstrtoint(strstrip(buf), 0, &kill);
> + if (ret)
> + return ret;
> +
> + if (kill != 1)
> + return -ERANGE;
> +
> + cgrp = cgroup_kn_lock_live(of->kn, false);
> + if (!cgrp)
> + return -ENOENT;
> +
> + /*
> + * Killing is a process directed operation, i.e. the whole thread-group
> + * is taken down so act like we do for cgroup.procs and only make this
> + * writable in non-threaded cgroups.
> + */
> + if (cgroup_is_threaded(cgrp))
> + ret = -EOPNOTSUPP;
> + else
> + cgroup_kill(cgrp);
> +
> + cgroup_kn_unlock(of->kn);
> +
> + return ret ?: nbytes;
> +}
> +
> static int cgroup_file_open(struct kernfs_open_file *of)
> {
> struct cftype *cft = of_cft(of);
> @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
> .seq_show = cgroup_freeze_show,
> .write = cgroup_freeze_write,
> },
> + {
> + .name = "cgroup.kill",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .write = cgroup_kill_write,
> + },
> {
> .name = "cpu.stat",
> .seq_show = cpu_stat_show,
> @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
> struct kernel_clone_args *kargs)
> __releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
> {
> + unsigned long cgrp_flags = 0;
> + bool kill = false;
> struct cgroup_subsys *ss;
> struct css_set *cset;
> int i;
> @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
>
> /* init tasks are special, only link regular threads */
> if (likely(child->pid)) {
> + if (kargs->cgrp)
> + cgrp_flags = kargs->cgrp->flags;
> + else
> + cgrp_flags = cset->dfl_cgrp->flags;
> +
> WARN_ON_ONCE(!list_empty(&child->cg_list));
> cset->nr_tasks++;
> css_set_move_task(child, NULL, cset, false);
> @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
> cset = NULL;
> }
>
> - /*
> - * If the cgroup has to be frozen, the new task has too. Let's set
> - * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> - * frozen state.
> - */
> - if (unlikely(cgroup_task_freeze(child))) {
> - spin_lock(&child->sighand->siglock);
> - WARN_ON_ONCE(child->frozen);
> - child->jobctl |= JOBCTL_TRAP_FREEZE;
> - spin_unlock(&child->sighand->siglock);
> + if (!(child->flags & PF_KTHREAD)) {
> + if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> + /*
> + * If the cgroup has to be frozen, the new task has
> + * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> + * get the task into the frozen state.
> + */
> + spin_lock(&child->sighand->siglock);
> + WARN_ON_ONCE(child->frozen);
> + child->jobctl |= JOBCTL_TRAP_FREEZE;
> + spin_unlock(&child->sighand->siglock);
> +
> + /*
> + * Calling cgroup_update_frozen() isn't required here,
> + * because it will be called anyway a bit later from
> + * do_freezer_trap(). So we avoid cgroup's transient
> + * switch from the frozen state and back.
> + */
> + }
>
> /*
> - * Calling cgroup_update_frozen() isn't required here,
> - * because it will be called anyway a bit later from
> - * do_freezer_trap(). So we avoid cgroup's transient switch
> - * from the frozen state and back.
> + * If the cgroup is to be killed notice it now and take the
> + * child down right after we finished preparing it for
> + * userspace.
> */
> + kill = test_bit(CGRP_KILL, &cgrp_flags);
> }
>
> spin_unlock_irq(&css_set_lock);
> @@ -6135,6 +6230,10 @@ void cgroup_post_fork(struct task_struct *child,
> put_css_set(rcset);
> }
>
> + /* Cgroup has to be killed so take down child immediately. */
> + if (kill)
> + send_sig(SIGKILL, child, 0);
> +
> cgroup_css_set_put_fork(kargs);
> }
>
>
> base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
> --
> 2.27.0
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (5 preceding siblings ...)
2021-05-04 1:47 ` Serge E. Hallyn
@ 2021-05-04 18:29 ` Shakeel Butt
2021-05-05 17:57 ` Roman Gushchin
2021-05-05 18:31 ` Eric W. Biederman
8 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2021-05-04 18:29 UTC (permalink / raw)
To: Christian Brauner, Michal Hocko
Cc: Tejun Heo, Roman Gushchin, Zefan Li, Johannes Weiner, Cgroups,
containers-cunTk1MwBs/YUNznpcFYbw, Christian Brauner, Linux MM
+Michal Hocko
On Mon, May 3, 2021 at 7:40 AM Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
>
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
>
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
>
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
>
> As with all new cgroup features cgroup.kill is recursive by default.
>
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
>
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
> usually has a delegated cgroup. For such layouts there is a 1:1
> mapping between container and cgroup. If the container in addition
> uses a separate pid namespace then killing a container usually becomes
> a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> However, there are quite a few scenarios where that isn't true. For
> example, there are containers that share the cgroup with other
> processes on purpose that are supposed to be bound to the lifetime of
> the container but are not in the same pidns of the container.
> Containers that are in a delegated cgroup but share the pid namespace
> with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
> processes belonging to a service. They usually rely on a recursive
> algorithm now to kill a service. With cgroup.kill this becomes a
> simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
> efficiently take down whole cgroups quickly.
Just to further add the motivation for userspace oom-killers. Instead
of traversing the tree, opening cgroup.procs and manually killing the
processes under memory pressure, the userspace oom-killer can just
keep the list of cgroup.kill files open and just write '1' when it
decides to trigger the oom-kill.
Michal, what do you think of putting the processes killed through this
interface into the oom_reaper_list as well? Will there be any
concerns?
> - The kill program can gain a new
> kill --cgroup /sys/fs/cgroup/delegated
> flag to take down cgroups.
>
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> not specified we are not taking cgroup mutex meaning the cgroup can be
> killed while a process in that cgroup is forking.
> If the kill request happens right before cgroup_can_fork() and before
> the parent grabs its siglock the parent is guaranteed to see the
> pending SIGKILL. In addition we perform another check in
> cgroup_post_fork() whether the cgroup is being killed and is so take
> down the child (see above). This is robust enough and protects gainst
> forkbombs. If userspace really really wants to have stricter
> protection the simple solution would be to grab the write side of the
> cgroup threadgroup rwsem which will force all ongoing forks to
> complete before killing starts. We concluded that this is not
> necessary as the semantics for concurrent forking should simply align
> with freezer where a similar check as cgroup_post_fork() is performed.
>
> For all other cases CLONE_INTO_CGROUP is required. In this case we
> will grab the cgroup mutex so the cgroup can't be killed while we
> fork. Once we're done with the fork and have dropped cgroup mutex we
> are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
> kthread will not become empty after killing and consequently no
> unpopulated event will be generated. The assumption is that kthreads
> should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> you have two tasks in different pid namespaces the cgroup would still
> be frozen. The cgroup.kill mechanism consequently behaves the same
> way, i.e. we kill all processes and ignore in which pid namespace they
> exist.
> - If the caller is located in a cgroup that is killed the caller will
> obviously be killed as well.
>
> Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
[...]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (6 preceding siblings ...)
2021-05-04 18:29 ` Shakeel Butt
@ 2021-05-05 17:57 ` Roman Gushchin
[not found] ` <YJLcbOtcv8qWtMRQ-lLJQVQxiE4uLfgCeKHXN1g2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2021-05-05 18:31 ` Eric W. Biederman
8 siblings, 1 reply; 18+ messages in thread
From: Roman Gushchin @ 2021-05-05 17:57 UTC (permalink / raw)
To: Christian Brauner
Cc: Tejun Heo, Shakeel Butt, Zefan Li, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs/YUNznpcFYbw,
Christian Brauner
On Mon, May 03, 2021 at 04:39:19PM +0200, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
>
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
>
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
>
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
>
> As with all new cgroup features cgroup.kill is recursive by default.
>
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
>
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
> usually has a delegated cgroup. For such layouts there is a 1:1
> mapping between container and cgroup. If the container in addition
> uses a separate pid namespace then killing a container usually becomes
> a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> However, there are quite a few scenarios where that isn't true. For
> example, there are containers that share the cgroup with other
> processes on purpose that are supposed to be bound to the lifetime of
> the container but are not in the same pidns of the container.
> Containers that are in a delegated cgroup but share the pid namespace
> with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
> processes belonging to a service. They usually rely on a recursive
> algorithm now to kill a service. With cgroup.kill this becomes a
> simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
> efficiently take down whole cgroups quickly.
> - The kill program can gain a new
> kill --cgroup /sys/fs/cgroup/delegated
> flag to take down cgroups.
>
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> not specified we are not taking cgroup mutex meaning the cgroup can be
> killed while a process in that cgroup is forking.
> If the kill request happens right before cgroup_can_fork() and before
> the parent grabs its siglock the parent is guaranteed to see the
> pending SIGKILL. In addition we perform another check in
> cgroup_post_fork() whether the cgroup is being killed and is so take
> down the child (see above). This is robust enough and protects gainst
> forkbombs. If userspace really really wants to have stricter
> protection the simple solution would be to grab the write side of the
> cgroup threadgroup rwsem which will force all ongoing forks to
> complete before killing starts. We concluded that this is not
> necessary as the semantics for concurrent forking should simply align
> with freezer where a similar check as cgroup_post_fork() is performed.
>
> For all other cases CLONE_INTO_CGROUP is required. In this case we
> will grab the cgroup mutex so the cgroup can't be killed while we
> fork. Once we're done with the fork and have dropped cgroup mutex we
> are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
> kthread will not become empty after killing and consequently no
> unpopulated event will be generated. The assumption is that kthreads
> should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> you have two tasks in different pid namespaces the cgroup would still
> be frozen. The cgroup.kill mechanism consequently behaves the same
> way, i.e. we kill all processes and ignore in which pid namespace they
> exist.
> - If the caller is located in a cgroup that is killed the caller will
> obviously be killed as well.
>
> Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
>
> The series can be pulled from
>
> git-OoYKEaZ2EDaWaY/ihj7yzEB+6BGkLq7r@public.gmane.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
>
> /* v2 */
> - Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>:
> - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> ---
> include/linux/cgroup-defs.h | 3 +
> kernel/cgroup/cgroup.c | 127 ++++++++++++++++++++++++++++++++----
> 2 files changed, 116 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 559ee05f86b2..43fef771009a 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -71,6 +71,9 @@ enum {
>
> /* Cgroup is frozen. */
> CGRP_FROZEN,
> +
> + /* Control group has to be killed. */
> + CGRP_KILL,
> };
>
> /* cgroup_root->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9153b20e5cc6..aee84b99534a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static void __cgroup_kill(struct cgroup *cgrp)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> +
> + lockdep_assert_held(&cgroup_mutex);
> +
> + spin_lock_irq(&css_set_lock);
> + set_bit(CGRP_KILL, &cgrp->flags);
> + spin_unlock_irq(&css_set_lock);
> +
> + css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> + while ((task = css_task_iter_next(&it))) {
> + /* Ignore kernel threads here. */
> + if (task->flags & PF_KTHREAD)
> + continue;
> +
> + /* Skip tasks that are already dying. */
> + if (__fatal_signal_pending(task))
> + continue;
> +
> + send_sig(SIGKILL, task, 0);
> + }
> + css_task_iter_end(&it);
> +
> + spin_lock_irq(&css_set_lock);
> + clear_bit(CGRP_KILL, &cgrp->flags);
> + spin_unlock_irq(&css_set_lock);
> +}
> +
> +static void cgroup_kill(struct cgroup *cgrp)
> +{
> + struct cgroup_subsys_state *css;
> + struct cgroup *dsct;
> +
> + lockdep_assert_held(&cgroup_mutex);
> +
> + cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> + __cgroup_kill(dsct);
> +}
> +
> +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, loff_t off)
> +{
> + ssize_t ret = 0;
> + int kill;
> + struct cgroup *cgrp;
> +
> + ret = kstrtoint(strstrip(buf), 0, &kill);
> + if (ret)
> + return ret;
> +
> + if (kill != 1)
> + return -ERANGE;
> +
> + cgrp = cgroup_kn_lock_live(of->kn, false);
> + if (!cgrp)
> + return -ENOENT;
> +
> + /*
> + * Killing is a process directed operation, i.e. the whole thread-group
> + * is taken down so act like we do for cgroup.procs and only make this
> + * writable in non-threaded cgroups.
> + */
> + if (cgroup_is_threaded(cgrp))
> + ret = -EOPNOTSUPP;
> + else
> + cgroup_kill(cgrp);
> +
> + cgroup_kn_unlock(of->kn);
> +
> + return ret ?: nbytes;
> +}
> +
> static int cgroup_file_open(struct kernfs_open_file *of)
> {
> struct cftype *cft = of_cft(of);
> @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
> .seq_show = cgroup_freeze_show,
> .write = cgroup_freeze_write,
> },
> + {
> + .name = "cgroup.kill",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .write = cgroup_kill_write,
> + },
> {
> .name = "cpu.stat",
> .seq_show = cpu_stat_show,
> @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
> struct kernel_clone_args *kargs)
> __releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
> {
> + unsigned long cgrp_flags = 0;
> + bool kill = false;
> struct cgroup_subsys *ss;
> struct css_set *cset;
> int i;
> @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
>
> /* init tasks are special, only link regular threads */
> if (likely(child->pid)) {
> + if (kargs->cgrp)
> + cgrp_flags = kargs->cgrp->flags;
> + else
> + cgrp_flags = cset->dfl_cgrp->flags;
> +
> WARN_ON_ONCE(!list_empty(&child->cg_list));
> cset->nr_tasks++;
> css_set_move_task(child, NULL, cset, false);
> @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
> cset = NULL;
> }
>
> - /*
> - * If the cgroup has to be frozen, the new task has too. Let's set
> - * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> - * frozen state.
> - */
> - if (unlikely(cgroup_task_freeze(child))) {
> - spin_lock(&child->sighand->siglock);
> - WARN_ON_ONCE(child->frozen);
> - child->jobctl |= JOBCTL_TRAP_FREEZE;
> - spin_unlock(&child->sighand->siglock);
> + if (!(child->flags & PF_KTHREAD)) {
> + if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> + /*
> + * If the cgroup has to be frozen, the new task has
> + * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> + * get the task into the frozen state.
> + */
> + spin_lock(&child->sighand->siglock);
> + WARN_ON_ONCE(child->frozen);
> + child->jobctl |= JOBCTL_TRAP_FREEZE;
> + spin_unlock(&child->sighand->siglock);
> +
> + /*
> + * Calling cgroup_update_frozen() isn't required here,
> + * because it will be called anyway a bit later from
> + * do_freezer_trap(). So we avoid cgroup's transient
> + * switch from the frozen state and back.
> + */
> + }
I think this part can be optimized a bit further:
1) we don't need atomic test_bit() here
2) all PF_KTHREAD, CGRP_FREEZE and CGRP_KILL cases are very unlikely
So something like this could work (completely untested):
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 0965b44ff425..f567ca69119d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6190,13 +6190,15 @@ void cgroup_post_fork(struct task_struct *child,
cset = NULL;
}
- if (!(child->flags & PF_KTHREAD)) {
- if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
- /*
- * If the cgroup has to be frozen, the new task has
- * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
- * get the task into the frozen state.
- */
+
+ if (unlikely(!(child->flags & PF_KTHREAD) &&
+ cgrp_flags & (CGRP_FREEZE | CGRP_KILL))) {
+ /*
+ * If the cgroup has to be frozen, the new task has
+ * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
+ * get the task into the frozen state.
+ */
+ if (cgrp_flags & CGRP_FREEZE) {
spin_lock(&child->sighand->siglock);
WARN_ON_ONCE(child->frozen);
child->jobctl |= JOBCTL_TRAP_FREEZE;
@@ -6215,7 +6217,8 @@ void cgroup_post_fork(struct task_struct *child,
* child down right after we finished preparing it for
* userspace.
*/
- kill = test_bit(CGRP_KILL, &cgrp_flags);
+ if (cgrp_flags & CGRP_KILL)
+ kill = true;
}
spin_unlock_irq(&css_set_lock);
>
> /*
> - * Calling cgroup_update_frozen() isn't required here,
> - * because it will be called anyway a bit later from
> - * do_freezer_trap(). So we avoid cgroup's transient switch
> - * from the frozen state and back.
> + * If the cgroup is to be killed notice it now and take the
> + * child down right after we finished preparing it for
> + * userspace.
> */
> + kill = test_bit(CGRP_KILL, &cgrp_flags);
> }
>
> spin_unlock_irq(&css_set_lock);
> @@ -6135,6 +6230,10 @@ void cgroup_post_fork(struct task_struct *child,
> put_css_set(rcset);
> }
>
> + /* Cgroup has to be killed so take down child immediately. */
> + if (kill)
> + send_sig(SIGKILL, child, 0);
I think it's better to use do_send_sig_info() here, which skips the check
for the signal number, which is obviously valid.
Thanks!
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
[not found] ` <20210503143922.3093755-1-brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
` (7 preceding siblings ...)
2021-05-05 17:57 ` Roman Gushchin
@ 2021-05-05 18:31 ` Eric W. Biederman
[not found] ` <m1v97x6niq.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
8 siblings, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2021-05-05 18:31 UTC (permalink / raw)
To: Christian Brauner
Cc: Tejun Heo, Roman Gushchin, Shakeel Butt, Zefan Li,
Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA,
containers-cunTk1MwBs/YUNznpcFYbw, Christian Brauner
Please see below this patch uses the wrong function to send SIGKILL.
Eric
Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> From: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
>
> Introduce the cgroup.kill file. It does what it says on the tin and
> allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> The file is available in non-root cgroups.
>
> Killing cgroups is a process directed operation, i.e. the whole
> thread-group is affected. Consequently trying to write to cgroup.kill in
> threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> aligns with cgroup.procs where reads in threaded-cgroups are rejected
> with EOPNOTSUPP.
>
> The cgroup.kill file is write-only since killing a cgroup is an event
> not which makes it different from e.g. freezer where a cgroup
> transitions between the two states.
>
> As with all new cgroup features cgroup.kill is recursive by default.
>
> Killing a cgroup is protected against concurrent migrations through the
> cgroup mutex. To protect against forkbombs and to mitigate the effect of
> racing forks a new CGRP_KILL css set lock protected flag is introduced
> that is set prior to killing a cgroup and unset after the cgroup has
> been killed. We can then check in cgroup_post_fork() where we hold the
> css set lock already whether the cgroup is currently being killed. If so
> we send the child a SIGKILL signal immediately taking it down as soon as
> it returns to userspace. To make the killing of the child semantically
> clean it is killed after all cgroup attachment operations have been
> finalized.
>
> There are various use-cases of this interface:
> - Containers usually have a conservative layout where each container
> usually has a delegated cgroup. For such layouts there is a 1:1
> mapping between container and cgroup. If the container in addition
> uses a separate pid namespace then killing a container usually becomes
> a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> However, there are quite a few scenarios where that isn't true. For
> example, there are containers that share the cgroup with other
> processes on purpose that are supposed to be bound to the lifetime of
> the container but are not in the same pidns of the container.
> Containers that are in a delegated cgroup but share the pid namespace
> with the host or other containers.
> - Service managers such as systemd use cgroups to group and organize
> processes belonging to a service. They usually rely on a recursive
> algorithm now to kill a service. With cgroup.kill this becomes a
> simple write to cgroup.kill.
> - Userspace OOM implementations can make good use of this feature to
> efficiently take down whole cgroups quickly.
> - The kill program can gain a new
> kill --cgroup /sys/fs/cgroup/delegated
> flag to take down cgroups.
>
> A few observations about the semantics:
> - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> not specified we are not taking cgroup mutex meaning the cgroup can be
> killed while a process in that cgroup is forking.
> If the kill request happens right before cgroup_can_fork() and before
> the parent grabs its siglock the parent is guaranteed to see the
> pending SIGKILL. In addition we perform another check in
> cgroup_post_fork() whether the cgroup is being killed and is so take
> down the child (see above). This is robust enough and protects gainst
> forkbombs. If userspace really really wants to have stricter
> protection the simple solution would be to grab the write side of the
> cgroup threadgroup rwsem which will force all ongoing forks to
> complete before killing starts. We concluded that this is not
> necessary as the semantics for concurrent forking should simply align
> with freezer where a similar check as cgroup_post_fork() is performed.
>
> For all other cases CLONE_INTO_CGROUP is required. In this case we
> will grab the cgroup mutex so the cgroup can't be killed while we
> fork. Once we're done with the fork and have dropped cgroup mutex we
> are visible and will be found by any subsequent kill request.
> - We obviously don't kill kthreads. This means a cgroup that has a
> kthread will not become empty after killing and consequently no
> unpopulated event will be generated. The assumption is that kthreads
> should be in the root cgroup only anyway so this is not an issue.
> - We skip killing tasks that already have pending fatal signals.
> - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> you have two tasks in different pid namespaces the cgroup would still
> be frozen. The cgroup.kill mechanism consequently behaves the same
> way, i.e. we kill all processes and ignore in which pid namespace they
> exist.
> - If the caller is located in a cgroup that is killed the caller will
> obviously be killed as well.
>
> Cc: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> ---
>
> The series can be pulled from
>
> git-OoYKEaZ2EDaWaY/ihj7yzEB+6BGkLq7r@public.gmane.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
>
> /* v2 */
> - Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>:
> - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> ---
> include/linux/cgroup-defs.h | 3 +
> kernel/cgroup/cgroup.c | 127 ++++++++++++++++++++++++++++++++----
> 2 files changed, 116 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 559ee05f86b2..43fef771009a 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -71,6 +71,9 @@ enum {
>
> /* Cgroup is frozen. */
> CGRP_FROZEN,
> +
> + /* Control group has to be killed. */
> + CGRP_KILL,
> };
>
> /* cgroup_root->flags */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9153b20e5cc6..aee84b99534a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static void __cgroup_kill(struct cgroup *cgrp)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> +
> + lockdep_assert_held(&cgroup_mutex);
> +
> + spin_lock_irq(&css_set_lock);
> + set_bit(CGRP_KILL, &cgrp->flags);
> + spin_unlock_irq(&css_set_lock);
> +
> + css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> + while ((task = css_task_iter_next(&it))) {
> + /* Ignore kernel threads here. */
> + if (task->flags & PF_KTHREAD)
> + continue;
> +
> + /* Skip tasks that are already dying. */
> + if (__fatal_signal_pending(task))
> + continue;
> +
> + send_sig(SIGKILL, task, 0);
^^^^^^^^
Using send_sig here is wrong. The function send_sig
is the interface to send a signal to a single task/thread.
The signal SIGKILL can not be sent to a single task/thread.
So it is never makes sense to use send_sig with SIGKILL.
As this all happens in the context of the process writing
to the file this can either be:
group_send_sig_info(SIGKILL, SEND_SIG_NOINFO, task, PIDTYPE_TGID);
Which will check that the caller actually has permissions to kill the
specified task. Or:
do_send_sig_info(SIGKILL, SEND_SIG_NOINFO, task, PIDTYPE_TGID);
> + }
> + css_task_iter_end(&it);
> +
> + spin_lock_irq(&css_set_lock);
> + clear_bit(CGRP_KILL, &cgrp->flags);
> + spin_unlock_irq(&css_set_lock);
> +}
> +
> +static void cgroup_kill(struct cgroup *cgrp)
> +{
> + struct cgroup_subsys_state *css;
> + struct cgroup *dsct;
> +
> + lockdep_assert_held(&cgroup_mutex);
> +
> + cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> + __cgroup_kill(dsct);
> +}
> +
> +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> + size_t nbytes, loff_t off)
> +{
> + ssize_t ret = 0;
> + int kill;
> + struct cgroup *cgrp;
> +
> + ret = kstrtoint(strstrip(buf), 0, &kill);
> + if (ret)
> + return ret;
> +
> + if (kill != 1)
> + return -ERANGE;
> +
> + cgrp = cgroup_kn_lock_live(of->kn, false);
> + if (!cgrp)
> + return -ENOENT;
> +
> + /*
> + * Killing is a process directed operation, i.e. the whole thread-group
> + * is taken down so act like we do for cgroup.procs and only make this
> + * writable in non-threaded cgroups.
> + */
> + if (cgroup_is_threaded(cgrp))
> + ret = -EOPNOTSUPP;
> + else
> + cgroup_kill(cgrp);
> +
> + cgroup_kn_unlock(of->kn);
> +
> + return ret ?: nbytes;
> +}
> +
> static int cgroup_file_open(struct kernfs_open_file *of)
> {
> struct cftype *cft = of_cft(of);
> @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
> .seq_show = cgroup_freeze_show,
> .write = cgroup_freeze_write,
> },
> + {
> + .name = "cgroup.kill",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .write = cgroup_kill_write,
> + },
> {
> .name = "cpu.stat",
> .seq_show = cpu_stat_show,
> @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
> struct kernel_clone_args *kargs)
> __releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
> {
> + unsigned long cgrp_flags = 0;
> + bool kill = false;
> struct cgroup_subsys *ss;
> struct css_set *cset;
> int i;
> @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
>
> /* init tasks are special, only link regular threads */
> if (likely(child->pid)) {
> + if (kargs->cgrp)
> + cgrp_flags = kargs->cgrp->flags;
> + else
> + cgrp_flags = cset->dfl_cgrp->flags;
> +
> WARN_ON_ONCE(!list_empty(&child->cg_list));
> cset->nr_tasks++;
> css_set_move_task(child, NULL, cset, false);
> @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
> cset = NULL;
> }
>
> - /*
> - * If the cgroup has to be frozen, the new task has too. Let's set
> - * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> - * frozen state.
> - */
> - if (unlikely(cgroup_task_freeze(child))) {
> - spin_lock(&child->sighand->siglock);
> - WARN_ON_ONCE(child->frozen);
> - child->jobctl |= JOBCTL_TRAP_FREEZE;
> - spin_unlock(&child->sighand->siglock);
> + if (!(child->flags & PF_KTHREAD)) {
> + if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> + /*
> + * If the cgroup has to be frozen, the new task has
> + * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> + * get the task into the frozen state.
> + */
> + spin_lock(&child->sighand->siglock);
> + WARN_ON_ONCE(child->frozen);
> + child->jobctl |= JOBCTL_TRAP_FREEZE;
> + spin_unlock(&child->sighand->siglock);
> +
> + /*
> + * Calling cgroup_update_frozen() isn't required here,
> + * because it will be called anyway a bit later from
> + * do_freezer_trap(). So we avoid cgroup's transient
> + * switch from the frozen state and back.
> + */
> + }
>
> /*
> - * Calling cgroup_update_frozen() isn't required here,
> - * because it will be called anyway a bit later from
> - * do_freezer_trap(). So we avoid cgroup's transient switch
> - * from the frozen state and back.
> + * If the cgroup is to be killed notice it now and take the
> + * child down right after we finished preparing it for
> + * userspace.
> */
> + kill = test_bit(CGRP_KILL, &cgrp_flags);
> }
>
> spin_unlock_irq(&css_set_lock);
> @@ -6135,6 +6230,10 @@ void cgroup_post_fork(struct task_struct *child,
> put_css_set(rcset);
> }
>
> + /* Cgroup has to be killed so take down child immediately. */
> + if (kill)
> + send_sig(SIGKILL, child, 0);
^^^^^^^^
Using send_sig is wrong here for the same reasons as above.
Is a change to cgroup_post_fork necessary? Fork already
has protections against a signal being delivered effectively
during fork. Which may be enough in this case.
> +
> cgroup_css_set_put_fork(kargs);
> }
>
>
> base-commit: 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
^ permalink raw reply [flat|nested] 18+ messages in thread