From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, David Goode <dgoode@fb.com>,
Nina Schiff <ninasc@fb.com>,
"David S. Miller" <davem@davemloft.net>,
Tejun Heo <tj@kernel.org>
Subject: [PATCH 4.9 73/88] cgroup, net_cls: iterate the fds of only the tasks which are being migrated
Date: Tue, 28 Mar 2017 14:31:28 +0200 [thread overview]
Message-ID: <20170328122751.643801615@linuxfoundation.org> (raw)
In-Reply-To: <20170328122748.656530096@linuxfoundation.org>
4.9-stable review patch. If anyone has any objections, please let me know.
------------------
From: Tejun Heo <tj@kernel.org>
commit a05d4fd9176003e0c1f9c3d083f4dac19fd346ab upstream.
The net_cls controller controls the classid field of each socket which
is associated with the cgroup. Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51de ("cgroups: Allow dynamically changing net_classid").
While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations. However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css. This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.
On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds. Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.
Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse. Before bfc2cf6f61fc ("cgroup: call subsys->*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the ->css_attach callback even for controllers which don't
see actual migration to a different css.
As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
->css_attach callback for those. The net_cls ->css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used. This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.
The worst symptom is already fixed in upstream by bfc2cf6f61fc
("cgroup: call subsys->*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.
This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated. This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup. As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().
Reported-by: David Goode <dgoode@fb.com>
Fixes: 3b13758f51de ("cgroups: Allow dynamically changing net_classid")
Cc: Nina Schiff <ninasc@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
net/core/netclassid_cgroup.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -69,27 +69,17 @@ static int update_classid_sock(const voi
return 0;
}
-static void update_classid(struct cgroup_subsys_state *css, void *v)
+static void cgrp_attach(struct cgroup_taskset *tset)
{
- struct css_task_iter it;
+ struct cgroup_subsys_state *css;
struct task_struct *p;
- css_task_iter_start(css, &it);
- while ((p = css_task_iter_next(&it))) {
+ cgroup_taskset_for_each(p, css, tset) {
task_lock(p);
- iterate_fd(p->files, 0, update_classid_sock, v);
+ iterate_fd(p->files, 0, update_classid_sock,
+ (void *)(unsigned long)css_cls_state(css)->classid);
task_unlock(p);
}
- css_task_iter_end(&it);
-}
-
-static void cgrp_attach(struct cgroup_taskset *tset)
-{
- struct cgroup_subsys_state *css;
-
- cgroup_taskset_first(tset, &css);
- update_classid(css,
- (void *)(unsigned long)css_cls_state(css)->classid);
}
static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft)
@@ -101,12 +91,22 @@ static int write_classid(struct cgroup_s
u64 value)
{
struct cgroup_cls_state *cs = css_cls_state(css);
+ struct css_task_iter it;
+ struct task_struct *p;
cgroup_sk_alloc_disable();
cs->classid = (u32)value;
- update_classid(css, (void *)(unsigned long)cs->classid);
+ css_task_iter_start(css, &it);
+ while ((p = css_task_iter_next(&it))) {
+ task_lock(p);
+ iterate_fd(p->files, 0, update_classid_sock,
+ (void *)(unsigned long)cs->classid);
+ task_unlock(p);
+ }
+ css_task_iter_end(&it);
+
return 0;
}
next prev parent reply other threads:[~2017-03-28 12:45 UTC|newest]
Thread overview: 90+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-28 12:30 [PATCH 4.9 00/88] 4.9.19-stable review Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 01/88] net/openvswitch: Set the ipv6 source tunnel key address attribute correctly Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 02/88] net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabled Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 03/88] net: properly release sk_frag.page Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 04/88] amd-xgbe: Fix jumbo MTU processing on newer hardware Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 05/88] openvswitch: Add missing case OVS_TUNNEL_KEY_ATTR_PAD Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 06/88] net: unix: properly re-increment inflight counter of GC discarded candidates Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 08/88] net: vrf: Reset rt6i_idev in local dst after put Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 09/88] net/mlx5: Add missing entries for set/query rate limit commands Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 10/88] net/mlx5e: Use the proper UAPI values when offloading TC vlan actions Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 11/88] net/mlx5: Increase number of max QPs in default profile Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 12/88] net/mlx5e: Count GSO packets correctly Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 13/88] net/mlx5e: Count LRO " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 14/88] ipv6: make sure to initialize sockc.tsflags before first use Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 15/88] net: bcmgenet: remove bcmgenet_internal_phy_setup() Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 16/88] ipv4: provide stronger user input validation in nl_fib_input() Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 17/88] socket, bpf: fix sk_filter use after free in sk_clone_lock Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 18/88] tcp: initialize icsk_ack.lrcvtime at session start time Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 19/88] Input: ALPS - fix V8+ protocol handling (73 03 28) Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 20/88] Input: ALPS - fix trackstick button handling on V8 devices Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 21/88] Input: elan_i2c - add ASUS EeeBook X205TA special touchpad fw Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 22/88] Input: i8042 - add noloop quirk for Dell Embedded Box PC 3000 Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 23/88] Input: iforce - validate number of endpoints before using them Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 24/88] Input: ims-pcu " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 25/88] Input: hanwang " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 26/88] Input: yealink " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 27/88] Input: cm109 " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 28/88] Input: kbtab " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 29/88] Input: sur40 " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 30/88] ALSA: seq: Fix racy cell insertions during snd_seq_pool_done() Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 31/88] ALSA: ctxfi: Fix the incorrect check of dma_set_mask() call Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 32/88] ALSA: hda - Adding a group of pin definition to fix headset problem Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 33/88] USB: serial: option: add Quectel UC15, UC20, EC21, and EC25 modems Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 36/88] usb: gadget: f_uvc: Fix SuperSpeed companion descriptors wBytesPerInterval Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 37/88] usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 38/88] USB: uss720: fix NULL-deref at probe Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 39/88] USB: lvtest: " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 40/88] USB: idmouse: " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 41/88] USB: wusbcore: " Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 42/88] usb: musb: cppi41: dont check early-TX-interrupt for Isoch transfer Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 43/88] usb: hub: Fix crash after failure to read BOS descriptor Greg Kroah-Hartman
2017-03-28 12:30 ` [PATCH 4.9 44/88] USB: usbtmc: add missing endpoint sanity check Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 45/88] USB: usbtmc: fix probe error path Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 46/88] uwb: i1480-dfu: fix NULL-deref at probe Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 47/88] uwb: hwa-rc: " Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 48/88] mmc: ushc: " Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 49/88] iio: adc: ti_am335x_adc: fix fifo overrun recovery Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 50/88] iio: sw-device: Fix config group initialization Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 51/88] iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3 Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 52/88] iio: magnetometer: ak8974: remove incorrect __exit markups Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 53/88] parport: fix attempt to write duplicate procfiles Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 54/88] ext4: mark inode dirty after converting inline directory Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 55/88] ext4: lock the xattr block before checksuming it Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 56/88] powerpc/64s: Fix idle wakeup potential to clobber registers Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 57/88] mmc: sdhci-of-at91: Support external regulators Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 58/88] mmc: sdhci-of-arasan: fix incorrect timeout clock Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 59/88] mmc: sdhci: Do not disable interrupts while waiting for clock Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 60/88] mmc: sdhci-pci: Do not disable interrupts in sdhci_intel_set_power Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 61/88] hwrng: amd - Revert managed API changes Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 62/88] hwrng: geode " Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 63/88] clk: sunxi-ng: sun6i: Fix enable bit offset for hdmi-ddc module clock Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 64/88] clk: sunxi-ng: mp: Adjust parent rate for pre-dividers Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 65/88] mwifiex: pcie: dont leak DMA buffers when removing Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 66/88] crypto: ccp - Assign DMA commands to the channels CCP Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 67/88] xen/acpi: upload PM state from init-domain to Xen Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 68/88] iommu/vt-d: Fix NULL pointer dereference in device_to_iommu Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 69/88] Revert "ARM: at91/dt: sama5d2: Use new compatible for ohci node" Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 70/88] ARM: at91: pm: cpu_idle: switch DDR to power-down mode Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 71/88] arm64: kaslr: Fix up the kernel image alignment Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 72/88] cpufreq: Restore policy min/max limits on CPU online Greg Kroah-Hartman
2017-03-28 12:31 ` Greg Kroah-Hartman [this message]
2017-03-28 12:31 ` [PATCH 4.9 74/88] blk-mq: dont complete un-started request in timeout handler Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 75/88] libceph: force GFP_NOIO for socket allocations Greg Kroah-Hartman
2017-03-29 8:09 ` Michal Hocko
2017-03-28 12:31 ` [PATCH 4.9 76/88] drm/amdgpu: reinstate oland workaround for sclk Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 77/88] auxdisplay: img-ascii-lcd: add missing sentinel entry in img_ascii_lcd_matches Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 78/88] jbd2: dont leak memory if setting up journal fails Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 79/88] intel_th: Dont leak module refcount on failure to activate Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 80/88] Drivers: hv: vmbus: Dont leak channel ids Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 81/88] Drivers: hv: vmbus: Dont leak memory when a channel is rescinded Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 82/88] libceph: dont set weight to IN when OSD is destroyed Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 83/88] device-dax: fix pmd/pte fault fallback handling Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 84/88] drm/bridge: analogix dp: Fix runtime PM state on driver bind Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 85/88] nl80211: fix dumpit error path RTNL deadlocks Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 86/88] drm: reference count event->completion Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 87/88] fbcon: Fix vc attr at deinit Greg Kroah-Hartman
2017-03-28 12:31 ` [PATCH 4.9 88/88] crypto: algif_hash - avoid zero-sized array Greg Kroah-Hartman
2017-03-28 19:39 ` [PATCH 4.9 00/88] 4.9.19-stable review Shuah Khan
2017-03-29 4:48 ` Guenter Roeck
[not found] ` <58daa23e.4b542e0a.1135a.4d6e@mx.google.com>
[not found] ` <m2o9wl9ozj.fsf@baylibre.com>
2017-03-29 5:47 ` Greg Kroah-Hartman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170328122751.643801615@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=davem@davemloft.net \
--cc=dgoode@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=ninasc@fb.com \
--cc=stable@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).