From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2CE4AC43381 for ; Tue, 12 Mar 2019 10:20:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DBCCD214AE for ; Tue, 12 Mar 2019 10:20:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="r0dQiOtM" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726506AbfCLKUU (ORCPT ); Tue, 12 Mar 2019 06:20:20 -0400 Received: from mail-ed1-f65.google.com ([209.85.208.65]:38734 "EHLO mail-ed1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726434AbfCLKUU (ORCPT ); Tue, 12 Mar 2019 06:20:20 -0400 Received: by mail-ed1-f65.google.com with SMTP id h58so1791011edb.5 for ; Tue, 12 Mar 2019 03:20:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=iIQD/CveG1PY5jem3+DLzJ9G/eZ3V9mAk4zPWaxwqc4=; b=r0dQiOtMdksBaoTX1eso4blxKJrLqXAtukyXF1l17tz/fmBouE7qW6fYotwE9LSvaN Amg+0T4YQyd8QBtPbvU2U9XV4zZgQolVAx5yRwD1dzaoEORVQQHdi01DMGXF6oKIRykg wPxAO0wGWSUVww0WWEMNDxvq1MUpzf7hvM5QB0A3M/quXz2/NetBNUxx+iQN9WKwPMx5 gVejANsC/IYA3FYeveKFwTx5aCXonDjOcwGC+aErhMweM7kbM6ASS6+hjK3CpAO/KxV2 O1gLDjFO0DyEqzqeDT1X9gCwFrdZByg/DmfmGvVR0Cxq5K70mWavwTuIwDoCAh9kOflL 38SA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=iIQD/CveG1PY5jem3+DLzJ9G/eZ3V9mAk4zPWaxwqc4=; b=Caf9Gubd3Hb/AygyOJpbYKgg5Rd4ndBUNPnW3+7+bxRJ34carkOMHuwYWHblz+FfQu gpIHfY3zMugX2wYzWj0yBHtDHRlMX9bbjENrHqJzP2dtuGYOhWn0D7FF2WkyFSG0+oeL Uqhi2s8U38uzegJb2YW+Ynl+lqD8Ro6Lyqi8XKZDwWnPOhaq4uayWya8sRYz5JFZbyjQ 3qU7nZ9I/hYeaGCMf53Ppcm4g1WCZ4V0UWGKIYbtBHHRVgwLZLm5ApSHp3eS4H/Rb0ri OCgdy0vHh1byzBvXY9x7FcbdlK/oVWWhpj8fG6FYcq+3MEB4bbAu2QMLqgJqPoTmvJG4 yj4g== X-Gm-Message-State: APjAAAXt8kuNbKBi62f8o2C2W5GwjBs7NCagkrVdlkou5tsev7/djhzA dIU8pHYfE0kVRkRDEOyZMlU= X-Google-Smtp-Source: APXvYqyJyQOXvgiI0iejQMIUSH6BpFILpcvUo/9z5QW3DbKi15+kCUaOebSnrXzBY2r0PArMmIXA/w== X-Received: by 2002:a05:6402:386:: with SMTP id o6mr2827176edv.9.1552386017139; Tue, 12 Mar 2019 03:20:17 -0700 (PDT) Received: from jwang-pc.pb.local ([2001:1438:4010:254c:1e6f:65ff:fed4:d10]) by smtp.gmail.com with ESMTPSA id q12sm408698edd.12.2019.03.12.03.20.15 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 12 Mar 2019 03:20:16 -0700 (PDT) From: Jack Wang To: gregkh@linuxfoundation.org, stable@vger.kernel.org Cc: Johannes Weiner , Christopher Lameter , Ingo Molnar , Johannes Weiner , Mike Galbraith , Peter Enderborg , Randy Dunlap , Shakeel Butt , Vinayak Menon , Andrew Morton , Linus Torvalds , Jack Wang Subject: [stable-4.14 09/11] psi: cgroup support Date: Tue, 12 Mar 2019 11:20:00 +0100 Message-Id: <20190312102002.31737-10-jinpuwang@gmail.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190312102002.31737-1-jinpuwang@gmail.com> References: <20190312102002.31737-1-jinpuwang@gmail.com> Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Johannes Weiner On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Tejun Heo Acked-by: Peter Zijlstra (Intel) Tested-by: Daniel Drake Tested-by: Suren Baghdasaryan Cc: Christopher Lameter Cc: Ingo Molnar Cc: Johannes Weiner Cc: Mike Galbraith Cc: Peter Enderborg Cc: Randy Dunlap Cc: Shakeel Butt Cc: Vinayak Menon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 992fe97ea79dcb03297f72c36bd7cff990e02232) [jwang: to do check error case] Signed-off-by: Jack Wang Conflicts: Documentation/cgroup-v2.txt kernel/cgroup/cgroup.c --- Documentation/accounting/psi.txt | 9 +++ Documentation/cgroup-v2.txt | 17 +++++ include/linux/cgroup-defs.h | 4 ++ include/linux/cgroup.h | 15 ++++ include/linux/psi.h | 25 +++++++ init/Kconfig | 4 ++ kernel/cgroup/cgroup.c | 44 +++++++++++- kernel/sched/psi.c | 118 ++++++++++++++++++++++++++++--- 8 files changed, 227 insertions(+), 9 deletions(-) diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt index 3753a82f1cf5..b8ca28b60215 100644 --- a/Documentation/accounting/psi.txt +++ b/Documentation/accounting/psi.txt @@ -62,3 +62,12 @@ well as medium and long term trends. The total absolute stall time is tracked and exported as well, to allow detection of latency spikes which wouldn't necessarily make a dent in the time averages, or to average trends over custom time frames. + +Cgroup2 interface +================= + +In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem +mounted, pressure stall information is also tracked for tasks grouped +into cgroups. Each subdirectory in the cgroupfs mountpoint contains +cpu.pressure, memory.pressure, and io.pressure files; the format is +the same as the /proc/pressure/ files. diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index dc44785dc0fa..5cd589694a6a 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -957,6 +957,11 @@ All time durations are in microseconds. which indicates that the group may consume upto $MAX in each $PERIOD duration. If only one number is written, $MAX is updated. + cpu.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for CPU. See + Documentation/accounting/psi.txt for details. Memory @@ -1194,6 +1199,12 @@ PAGE_SIZE multiple when read back. Swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous meomry of the cgroup will not be swapped out. + memory.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for memory. See + Documentation/accounting/psi.txt for details. + Usage Guidelines ~~~~~~~~~~~~~~~~ @@ -1329,6 +1340,12 @@ IO Interface Files 8:16 rbps=2097152 wbps=max riops=max wiops=max + io.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for IO. See + Documentation/accounting/psi.txt for details. + Writeback ~~~~~~~~~ diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index e7905d9353e8..f113ebd940e0 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -19,6 +19,7 @@ #include #include #include +#include #ifdef CONFIG_CGROUPS @@ -368,6 +369,9 @@ struct cgroup { /* used to schedule release agent */ struct work_struct release_agent_work; + /* used to track pressure stalls */ + struct psi_group psi; + /* used to store eBPF programs */ struct cgroup_bpf bpf; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index dddbc29e2009..28a1a067da52 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -626,6 +626,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp) pr_cont_kernfs_path(cgrp->kn); } +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return &cgrp->psi; +} + static inline void cgroup_init_kthreadd(void) { /* @@ -679,6 +684,16 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp) return NULL; } +static inline struct cgroup *cgroup_parent(struct cgroup *cgrp) +{ + return NULL; +} + +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return NULL; +} + static inline bool task_under_cgroup_hierarchy(struct task_struct *task, struct cgroup *ancestor) { diff --git a/include/linux/psi.h b/include/linux/psi.h index b0daf050de58..8e0725aac0aa 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -4,6 +4,9 @@ #include #include +struct seq_file; +struct css_set; + #ifdef CONFIG_PSI extern bool psi_disabled; @@ -16,6 +19,14 @@ void psi_memstall_tick(struct task_struct *task, int cpu); void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); +int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); + +#ifdef CONFIG_CGROUPS +int psi_cgroup_alloc(struct cgroup *cgrp); +void psi_cgroup_free(struct cgroup *cgrp); +void cgroup_move_task(struct task_struct *p, struct css_set *to); +#endif + #else /* CONFIG_PSI */ static inline void psi_init(void) {} @@ -23,6 +34,20 @@ static inline void psi_init(void) {} static inline void psi_memstall_enter(unsigned long *flags) {} static inline void psi_memstall_leave(unsigned long *flags) {} +#ifdef CONFIG_CGROUPS +static inline int psi_cgroup_alloc(struct cgroup *cgrp) +{ + return 0; +} +static inline void psi_cgroup_free(struct cgroup *cgrp) +{ +} +static inline void cgroup_move_task(struct task_struct *p, struct css_set *to) +{ + rcu_assign_pointer(p->cgroups, to); +} +#endif + #endif /* CONFIG_PSI */ #endif /* _LINUX_PSI_H */ diff --git a/init/Kconfig b/init/Kconfig index d19c853088c1..cb22efbd74d9 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -481,6 +481,10 @@ config PSI the share of walltime in which some or all tasks in the system are delayed due to contention of the respective resource. + In kernels with cgroup support, cgroups (cgroup2 only) will + have cpu.pressure, memory.pressure, and io.pressure files, + which aggregate pressure stalls for the grouped tasks only. + For more details see Documentation/accounting/psi.txt. Say N if unsure. diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 21bbfc09e395..0897e980a2e2 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #define CREATE_TRACE_POINTS @@ -794,7 +795,7 @@ static void css_set_move_task(struct task_struct *task, */ WARN_ON_ONCE(task->flags & PF_EXITING); - rcu_assign_pointer(task->cgroups, to_cset); + cgroup_move_task(task, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); } @@ -3329,6 +3330,21 @@ static int cgroup_stat_show(struct seq_file *seq, void *v) return 0; } +#ifdef CONFIG_PSI +static int cgroup_io_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_IO); +} +static int cgroup_memory_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_MEM); +} +static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU); +} +#endif + static int cgroup_file_open(struct kernfs_open_file *of) { struct cftype *cft = of->kn->priv; @@ -4439,6 +4455,23 @@ static struct cftype cgroup_base_files[] = { .name = "cgroup.stat", .seq_show = cgroup_stat_show, }, +#ifdef CONFIG_PSI + { + .name = "io.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_io_pressure_show, + }, + { + .name = "memory.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_memory_pressure_show, + }, + { + .name = "cpu.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_cpu_pressure_show, + }, +#endif { } /* terminate */ }; @@ -4499,6 +4532,7 @@ static void css_free_work_fn(struct work_struct *work) */ cgroup_put(cgroup_parent(cgrp)); kernfs_put(cgrp->kn); + psi_cgroup_free(cgrp); kfree(cgrp); } else { /* @@ -4742,6 +4776,10 @@ static struct cgroup *cgroup_create(struct cgroup *parent) cgrp->root = root; cgrp->level = level; + ret = psi_cgroup_alloc(cgrp); + if (ret) + goto out_idr_free; + for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) { cgrp->ancestor_ids[tcgrp->level] = tcgrp->id; @@ -4782,6 +4820,10 @@ static struct cgroup *cgroup_create(struct cgroup *parent) return cgrp; +out_psi_free: + psi_cgroup_free(cgrp); +out_idr_free: + cgroup_idr_remove(&root->cgroup_idr, cgrp->id); out_cancel_ref: percpu_ref_exit(&cgrp->self.refcnt); out_free_cgrp: diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 595414599b98..7cdecfc010af 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -473,9 +473,35 @@ static void psi_group_change(struct psi_group *group, int cpu, schedule_delayed_work(&group->clock_work, PSI_FREQ); } +static struct psi_group *iterate_groups(struct task_struct *task, void **iter) +{ +#ifdef CONFIG_CGROUPS + struct cgroup *cgroup = NULL; + + if (!*iter) + cgroup = task->cgroups->dfl_cgrp; + else if (*iter == &psi_system) + return NULL; + else + cgroup = cgroup_parent(*iter); + + if (cgroup && cgroup_parent(cgroup)) { + *iter = cgroup; + return cgroup_psi(cgroup); + } +#else + if (*iter) + return NULL; +#endif + *iter = &psi_system; + return &psi_system; +} + void psi_task_change(struct task_struct *task, int clear, int set) { int cpu = task_cpu(task); + struct psi_group *group; + void *iter = NULL; if (!task->pid) return; @@ -492,17 +518,23 @@ void psi_task_change(struct task_struct *task, int clear, int set) task->psi_flags &= ~clear; task->psi_flags |= set; - psi_group_change(&psi_system, cpu, clear, set); + while ((group = iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set); } void psi_memstall_tick(struct task_struct *task, int cpu) { - struct psi_group_cpu *groupc; + struct psi_group *group; + void *iter = NULL; - groupc = per_cpu_ptr(psi_system.pcpu, cpu); - write_seqcount_begin(&groupc->seq); - record_times(groupc, cpu, true); - write_seqcount_end(&groupc->seq); + while ((group = iterate_groups(task, &iter))) { + struct psi_group_cpu *groupc; + + groupc = per_cpu_ptr(group->pcpu, cpu); + write_seqcount_begin(&groupc->seq); + record_times(groupc, cpu, true); + write_seqcount_end(&groupc->seq); + } } /** @@ -565,8 +597,78 @@ void psi_memstall_leave(unsigned long *flags) rq_unlock_irq(rq, &rf); } -static int psi_show(struct seq_file *m, struct psi_group *group, - enum psi_res res) +#ifdef CONFIG_CGROUPS +int psi_cgroup_alloc(struct cgroup *cgroup) +{ + if (psi_disabled) + return 0; + + cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); + if (!cgroup->psi.pcpu) + return -ENOMEM; + group_init(&cgroup->psi); + return 0; +} + +void psi_cgroup_free(struct cgroup *cgroup) +{ + if (psi_disabled) + return; + + cancel_delayed_work_sync(&cgroup->psi.clock_work); + free_percpu(cgroup->psi.pcpu); +} + +/** + * cgroup_move_task - move task to a different cgroup + * @task: the task + * @to: the target css_set + * + * Move task to a new cgroup and safely migrate its associated stall + * state between the different groups. + * + * This function acquires the task's rq lock to lock out concurrent + * changes to the task's scheduling state and - in case the task is + * running - concurrent changes to its stall state. + */ +void cgroup_move_task(struct task_struct *task, struct css_set *to) +{ + bool move_psi = !psi_disabled; + unsigned int task_flags = 0; + struct rq_flags rf; + struct rq *rq; + + if (move_psi) { + rq = task_rq_lock(task, &rf); + + if (task_on_rq_queued(task)) + task_flags = TSK_RUNNING; + else if (task->in_iowait) + task_flags = TSK_IOWAIT; + + if (task->flags & PF_MEMSTALL) + task_flags |= TSK_MEMSTALL; + + if (task_flags) + psi_task_change(task, task_flags, 0); + } + + /* + * Lame to do this here, but the scheduler cannot be locked + * from the outside, so we move cgroups from inside sched/. + */ + rcu_assign_pointer(task->cgroups, to); + + if (move_psi) { + if (task_flags) + psi_task_change(task, 0, task_flags); + + task_rq_unlock(rq, task, &rf); + } +} +#endif /* CONFIG_CGROUPS */ + +int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { int full; -- 2.17.1