From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55E7EC83F1B for ; Wed, 16 Jul 2025 16:06:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B08488D0002; Wed, 16 Jul 2025 12:06:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AB8EC8D0001; Wed, 16 Jul 2025 12:06:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F5D48D0002; Wed, 16 Jul 2025 12:06:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 900898D0001 for ; Wed, 16 Jul 2025 12:06:58 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E8EE2590BC for ; Wed, 16 Jul 2025 16:06:57 +0000 (UTC) X-FDA: 83670606474.16.97AD826 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf04.hostedemail.com (Postfix) with ESMTP id 204B740018 for ; Wed, 16 Jul 2025 16:06:55 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=i+YWym5q; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752682016; a=rsa-sha256; cv=none; b=1Dm+/uDRCDskkE/EIb4/3kOqD65A+XQ1IWpmZ+Zt9ewOQjpfUjwqGEP/ax/IybEKeF8Pht fKgLLYBcyXdYlWKan4S/pO418+ao7vT07k7lT846jUz0D6I5NjMru9cUiMP/1uK8LANMVG w9kZKbHnS3aDPxHHuIb1qMzh2SezqZw= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=i+YWym5q; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of gmonaco@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=gmonaco@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752682016; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MkHMZF+W7s2KVUTf3+zNGyNPfFECcfwDxIyxKbBuCS4=; b=eiGpgu8NF+vFOWJUJG+aq0H49bUHrm33tVXEsHmJyDinJWB0zIKYGeJrBuBIOYCTrAu8WY llA/WWyA4SeIPlrBucweHvCTvGFuGsEh/FZUd+4hYiddWS1FhaVqINESn7NeKDOvyUFHa+ CmEz6mrb6nV8N7uOg+LfZe7jyTSO7fU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1752682015; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MkHMZF+W7s2KVUTf3+zNGyNPfFECcfwDxIyxKbBuCS4=; b=i+YWym5qAP0p+h0HtNlUTFjXzN3d/HuzvIUbhefBajLW5SENSRNOMhQAndCptzxJzx655z FfRNGPAtsz0h6b4URfibAH5JtuQTwAVkF4WSyPxjTvJ1bqeV2sqo2j2ghK+dX7fDqlculQ R5fZCdWX1H3997DE4tKa/guaeNLrthw= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-624-n75QxM0UM3OWc2VkwffocQ-1; Wed, 16 Jul 2025 12:06:52 -0400 X-MC-Unique: n75QxM0UM3OWc2VkwffocQ-1 X-Mimecast-MFC-AGG-ID: n75QxM0UM3OWc2VkwffocQ_1752682010 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2D3101800283; Wed, 16 Jul 2025 16:06:50 +0000 (UTC) Received: from gmonaco-thinkpadt14gen3.rmtit.com (unknown [10.44.33.144]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id B48B319560AB; Wed, 16 Jul 2025 16:06:45 +0000 (UTC) From: Gabriele Monaco To: linux-kernel@vger.kernel.org, Andrew Morton , David Hildenbrand , Ingo Molnar , Peter Zijlstra , Mathieu Desnoyers , "Paul E. McKenney" , linux-mm@kvack.org Cc: Gabriele Monaco , Ingo Molnar Subject: [PATCH v2 2/4] rseq: Run the mm_cid_compaction from rseq_handle_notify_resume() Date: Wed, 16 Jul 2025 18:06:06 +0200 Message-ID: <20250716160603.138385-8-gmonaco@redhat.com> In-Reply-To: <20250716160603.138385-6-gmonaco@redhat.com> References: <20250716160603.138385-6-gmonaco@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 204B740018 X-Stat-Signature: kdi4b37gwuoks3y4g1j65g6aq7d51gqh X-HE-Tag: 1752682015-203278 X-HE-Meta: U2FsdGVkX1/4F2KT9SLW1fbwAbY0t2cuuopzOYvGQ766efjkaWSxwB1ZuAy94xBZuQIasQOUYUGwmu7th00eAB3EFL1MyixeRA/OV3DIyBThtKABvLsRb+BoZ6IZtSheg+ts0+R45ur5mtp77mEZcbUUqhhcZkR0a1RB6eM3UEJL0C1/PgwNwdSn6nYB2uaHWWO3g3hqfZtjiPtuCxVrJ+GqXhqFaAFJs13hsDwXwdpH9LiayxrrLSs6OoP/MUjRDP76AVeb4yrfV6C7m17bUTZQz/ag0TN/GLxgYGOPjBQyYzydWI5SwJlc5zTpCRHrBcMU7xIFwiBTjsr6ofMlby5AiY1isnCkA9RuF6+0sKWR2Z+k5aGW5MKL2NQReF08Jf8v8RLhU1Q35Rj0LyMTjM9YhGzhG55BXBLUISV/zRFbx8XHpEbg+hpqNxD3FAxSWKePXQdCjo+0P5GiBaq+EdmToDdUawyEpj2tRK+cdWvr+sNjWHuurCApou4B74FlO6BvK0kaTEU354l0Pynm9kNd3C2s14Jr5DQXtBh0QEpJi/0NYgH6kgrgKOkvlZ7maWGUO7D3piBPl5GihrZ0XbLecjXjg2jx0JEcg/x1QzILtt/4LZ2P0j23UmMNSZjETFx+kNT56TRG2bL8p0UGGSbNnjWcaiFu66RrbcRNEOI3RI9B4m3cqQTB/Ioglud73Tmh8hUNkSTU6Vqs8wl3uz3hu7aAJe1GYHdAv1oAkjJguaPeZFgut3lugeEI16Eepg8aNc0myvzqov/7awCFfcs24Db6X+so52ozxHhkW9jHeJ5l8MbOzLHkfN/ZBBYzODY2raO7fsgdfut1+qqF6c50z8sV5Yxxd/iB1pMxJnolPvZS+dD+vtHj3ejHDlS0eBKz4eR/dFau2yirpDBUHCYDik8IpCVdp2H2EjDdsIUDwbzgXKOvoT7Jcqs1yPiHgqFU0cQPtGN2iCxsIXC MKScoN/J jaBg0Ah1QCjz3d1xx7FGJZS7Cp/5bx+uG0myT61UKwtcv6KlC4xb05QxXo4sN7AP4b9mBX8y0toHeUb709+NwReFFUdWrutwv7h78tIGOWSuQoGw3nkEnvitto1+Wk1/q9cCWb7ZW9o9M2fRw9lKJK/2yYrupNwffQYd0jZHk+4WBC6N4HfU35ZunYGAFc/FrUC4x8xNiHI1nioct4lOEO2a5f+9LKhqDHrwNYRljjWHe2rxr3uTgkOt66i8p3RPojyN2rUf5UQyKiEVveqkK/AuJEdzGU5zF6rpQnSctSod85goib+sXM8BAJRQ5D3321RlM7c1210x1kURqD+PS1Lr6DGcW8hjxwyLMth9UFzW5H/8Y06Z1cnxlBgKERDhwxl3N X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently the mm_cid_compaction is triggered by the scheduler tick and runs in a task_work, behaviour is more unpredictable with periodic tasks with short runtime, which may rarely run during a tick. Run the mm_cid_compaction from the rseq_handle_notify_resume() call, which runs from resume_user_mode_work. Since the context is the same where the task_work would run, skip this step and call the compaction function directly. The compaction function still exits prematurely in case the scan is not required, that is when the pseudo-period of 100ms did not elapse. Keep a tick handler used for long running tasks that are never preempted (i.e. that never call rseq_handle_notify_resume), which triggers a compaction and mm_cid update only in that case. Signed-off-by: Gabriele Monaco --- include/linux/mm.h | 2 ++ include/linux/mm_types.h | 11 ++++++++ include/linux/sched.h | 2 +- kernel/rseq.c | 2 ++ kernel/sched/core.c | 55 +++++++++++++++++++++++++--------------- kernel/sched/sched.h | 2 ++ 6 files changed, 53 insertions(+), 21 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index fa538feaa8d95..cc8c1c9ae26c1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2294,6 +2294,7 @@ void sched_mm_cid_before_execve(struct task_struct *t); void sched_mm_cid_after_execve(struct task_struct *t); void sched_mm_cid_fork(struct task_struct *t); void sched_mm_cid_exit_signals(struct task_struct *t); +void task_mm_cid_work(struct task_struct *t); static inline int task_mm_cid(struct task_struct *t) { return t->mm_cid; @@ -2303,6 +2304,7 @@ static inline void sched_mm_cid_before_execve(struct task_struct *t) { } static inline void sched_mm_cid_after_execve(struct task_struct *t) { } static inline void sched_mm_cid_fork(struct task_struct *t) { } static inline void sched_mm_cid_exit_signals(struct task_struct *t) { } +static inline void task_mm_cid_work(struct task_struct *t) { } static inline int task_mm_cid(struct task_struct *t) { /* diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index d6b91e8a66d6d..e6d6e468e64b4 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1420,6 +1420,13 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas WRITE_ONCE(mm->nr_cpus_allowed, cpumask_weight(mm_allowed)); raw_spin_unlock(&mm->cpus_allowed_lock); } + +static inline bool mm_cid_needs_scan(struct mm_struct *mm) +{ + if (!mm) + return false; + return time_after(jiffies, READ_ONCE(mm->mm_cid_next_scan)); +} #else /* CONFIG_SCHED_MM_CID */ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { } static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; } @@ -1430,6 +1437,10 @@ static inline unsigned int mm_cid_size(void) return 0; } static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { } +static inline bool mm_cid_needs_scan(struct mm_struct *mm) +{ + return false; +} #endif /* CONFIG_SCHED_MM_CID */ struct mmu_gather; diff --git a/include/linux/sched.h b/include/linux/sched.h index aa9c5be7a6325..a75f61cea2271 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1428,7 +1428,7 @@ struct task_struct { int last_mm_cid; /* Most recent cid in mm */ int migrate_from_cpu; int mm_cid_active; /* Whether cid bitmap is active */ - struct callback_head cid_work; + unsigned long last_cid_reset; /* Time of last reset in jiffies */ #endif struct tlbflush_unmap_batch tlb_ubc; diff --git a/kernel/rseq.c b/kernel/rseq.c index b7a1ec327e811..100f81e330dc6 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -441,6 +441,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) } if (unlikely(rseq_update_cpu_node_id(t))) goto error; + /* The mm_cid compaction returns prematurely if scan is not needed. */ + task_mm_cid_work(t); return; error: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 81c6df746df17..27b856a1cb0a9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10589,22 +10589,13 @@ static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu, sched_mm_cid_remote_clear(mm, pcpu_cid, cpu); } -static void task_mm_cid_work(struct callback_head *work) +void task_mm_cid_work(struct task_struct *t) { unsigned long now = jiffies, old_scan, next_scan; - struct task_struct *t = current; struct cpumask *cidmask; - struct mm_struct *mm; int weight, cpu; + struct mm_struct *mm = t->mm; - WARN_ON_ONCE(t != container_of(work, struct task_struct, cid_work)); - - work->next = work; /* Prevent double-add */ - if (t->flags & PF_EXITING) - return; - mm = t->mm; - if (!mm) - return; old_scan = READ_ONCE(mm->mm_cid_next_scan); next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY); if (!old_scan) { @@ -10643,23 +10634,47 @@ void init_sched_mm_cid(struct task_struct *t) if (mm_users == 1) mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY); } - t->cid_work.next = &t->cid_work; /* Protect against double add */ - init_task_work(&t->cid_work, task_mm_cid_work); } void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { - struct callback_head *work = &curr->cid_work; - unsigned long now = jiffies; + u64 rtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime; + /* + * If a task is running unpreempted for a long time, it won't get its + * mm_cid compacted and won't update its mm_cid value after a + * compaction occurs. + * For such a task, this function does two things: + * A) trigger the mm_cid recompaction, + * B) trigger an update of the task's rseq->mm_cid field at some point + * after recompaction, so it can get a mm_cid value closer to 0. + * A change in the mm_cid triggers an rseq_preempt. + * + * B occurs once after the compaction work completes, neither A nor B + * run as long as the compaction work is pending, the task is exiting + * or is not a userspace task. + */ if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || - work->next != work) + test_tsk_thread_flag(curr, TIF_NOTIFY_RESUME)) return; - if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan))) + if (rtime < RSEQ_UNPREEMPTED_THRESHOLD) return; - - /* No page allocation under rq lock */ - task_work_add(curr, work, TWA_RESUME); + if (mm_cid_needs_scan(curr->mm)) { + /* Trigger mm_cid recompaction */ + rseq_set_notify_resume(curr); + } else if (time_after(jiffies, curr->last_cid_reset + + msecs_to_jiffies(MM_CID_SCAN_DELAY))) { + /* Update mm_cid field */ + int old_cid = curr->mm_cid; + + if (!curr->mm_cid_active) + return; + mm_cid_snapshot_time(rq, curr->mm); + mm_cid_put_lazy(curr); + curr->last_mm_cid = curr->mm_cid = mm_cid_get(rq, curr, curr->mm); + if (old_cid != curr->mm_cid) + rseq_preempt(curr); + } } void sched_mm_cid_exit_signals(struct task_struct *t) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 475bb5998295e..90a5b58188232 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3606,6 +3606,7 @@ extern const char *preempt_modes[]; #define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */ #define MM_CID_SCAN_DELAY 100 /* 100ms */ +#define RSEQ_UNPREEMPTED_THRESHOLD SCHED_MM_CID_PERIOD_NS extern raw_spinlock_t cid_lock; extern int use_cid_lock; @@ -3809,6 +3810,7 @@ static inline int mm_cid_get(struct rq *rq, struct task_struct *t, int cid; lockdep_assert_rq_held(rq); + t->last_cid_reset = jiffies; cpumask = mm_cidmask(mm); cid = __this_cpu_read(pcpu_cid->cid); if (mm_cid_is_valid(cid)) { -- 2.50.1