* [PATCH AUTOSEL 6.1 2/7] rcu-tasks: Repair RCU Tasks Trace quiescence check
2024-03-24 17:07 [PATCH AUTOSEL 6.1 1/7] sysv: don't call sb_bread() with pointers_lock held Sasha Levin
@ 2024-03-24 17:07 ` Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 3/7] rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks Sasha Levin
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2024-03-24 17:07 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Paul E. McKenney, Steven Rostedt, Boqun Feng, Sasha Levin,
frederic, quic_neeraju, joel, josh, rcu
From: "Paul E. McKenney" <paulmck@kernel.org>
[ Upstream commit 2eb52fa8900e642b3b5054c4bf9776089d2a935f ]
The context-switch-time check for RCU Tasks Trace quiescence expects
current->trc_reader_special.b.need_qs to be zero, and if so, updates
it to TRC_NEED_QS_CHECKED. This is backwards, because if this value
is zero, there is no RCU Tasks Trace grace period in flight, an thus
no need for a quiescent state. Instead, when a grace period starts,
this field is set to TRC_NEED_QS.
This commit therefore changes the check from zero to TRC_NEED_QS.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
include/linux/rcupdate.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index d2507168b9c7b..e7474d7833424 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -205,9 +205,9 @@ void rcu_tasks_trace_qs_blkd(struct task_struct *t);
do { \
int ___rttq_nesting = READ_ONCE((t)->trc_reader_nesting); \
\
- if (likely(!READ_ONCE((t)->trc_reader_special.b.need_qs)) && \
+ if (unlikely(READ_ONCE((t)->trc_reader_special.b.need_qs) == TRC_NEED_QS) && \
likely(!___rttq_nesting)) { \
- rcu_trc_cmpxchg_need_qs((t), 0, TRC_NEED_QS_CHECKED); \
+ rcu_trc_cmpxchg_need_qs((t), TRC_NEED_QS, TRC_NEED_QS_CHECKED); \
} else if (___rttq_nesting && ___rttq_nesting != INT_MIN && \
!READ_ONCE((t)->trc_reader_special.b.blocked)) { \
rcu_tasks_trace_qs_blkd(t); \
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH AUTOSEL 6.1 3/7] rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
2024-03-24 17:07 [PATCH AUTOSEL 6.1 1/7] sysv: don't call sb_bread() with pointers_lock held Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 2/7] rcu-tasks: Repair RCU Tasks Trace quiescence check Sasha Levin
@ 2024-03-24 17:07 ` Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 4/7] rcu-tasks: Maintain lists " Sasha Levin
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2024-03-24 17:07 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Paul E. McKenney, Chen Zhongjin, Yang Jihong, Frederic Weisbecker,
Boqun Feng, Sasha Levin, mingo, peterz, juri.lelli,
vincent.guittot, quic_neeraju, joel, josh, rcu
From: "Paul E. McKenney" <paulmck@kernel.org>
[ Upstream commit bfe93930ea1ea3c6c115a7d44af6e4fea609067e ]
Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock. This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch. However, such deadlocks are becoming
more frequent. In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.
In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state. And with some adjustments, RCU Tasks could
just as well take advantage of that fact.
This commit therefore adds the data structures that will be needed
to rely on these quiescent states and to eliminate these deadlocks.
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
include/linux/sched.h | 2 ++
kernel/rcu/tasks.h | 2 ++
2 files changed, 4 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0cac69902ec58..ffcd100de169c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -848,6 +848,8 @@ struct task_struct {
u8 rcu_tasks_idx;
int rcu_tasks_idle_cpu;
struct list_head rcu_tasks_holdout_list;
+ int rcu_tasks_exit_cpu;
+ struct list_head rcu_tasks_exit_list;
#endif /* #ifdef CONFIG_TASKS_RCU */
#ifdef CONFIG_TASKS_TRACE_RCU
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index b5d5b6cf093a7..919c22698569e 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -30,6 +30,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
* @rtp_irq_work: IRQ work queue for deferred wakeups.
* @barrier_q_head: RCU callback for barrier operation.
* @rtp_blkd_tasks: List of tasks blocked as readers.
+ * @rtp_exit_list: List of tasks in the latter portion of do_exit().
* @cpu: CPU number corresponding to this entry.
* @rtpp: Pointer to the rcu_tasks structure.
*/
@@ -42,6 +43,7 @@ struct rcu_tasks_percpu {
struct irq_work rtp_irq_work;
struct rcu_head barrier_q_head;
struct list_head rtp_blkd_tasks;
+ struct list_head rtp_exit_list;
int cpu;
struct rcu_tasks *rtpp;
};
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH AUTOSEL 6.1 4/7] rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
2024-03-24 17:07 [PATCH AUTOSEL 6.1 1/7] sysv: don't call sb_bread() with pointers_lock held Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 2/7] rcu-tasks: Repair RCU Tasks Trace quiescence check Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 3/7] rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks Sasha Levin
@ 2024-03-24 17:07 ` Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 5/7] block: prevent division by zero in blk_rq_stat_sum() Sasha Levin
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2024-03-24 17:07 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Paul E. McKenney, Chen Zhongjin, Yang Jihong, Frederic Weisbecker,
Boqun Feng, Sasha Levin, quic_neeraju, joel, josh, rcu
From: "Paul E. McKenney" <paulmck@kernel.org>
[ Upstream commit 6b70399f9ef3809f6e308fd99dd78b072c1bd05c ]
This commit continues the elimination of deadlocks involving do_exit()
and RCU tasks by causing exit_tasks_rcu_start() to add the current
task to a per-CPU list and causing exit_tasks_rcu_stop() to remove the
current task from whatever list it is on. These lists will be used to
track tasks that are exiting, while still accounting for any RCU-tasks
quiescent states that these tasks pass though.
[ paulmck: Apply Frederic Weisbecker feedback. ]
Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/
Reported-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Yang Jihong <yangjihong1@huawei.com>
Tested-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
kernel/rcu/tasks.h | 43 +++++++++++++++++++++++++++++++++----------
1 file changed, 33 insertions(+), 10 deletions(-)
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 919c22698569e..76ad0bc2b3312 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1014,25 +1014,48 @@ EXPORT_SYMBOL_GPL(show_rcu_tasks_classic_gp_kthread);
#endif // !defined(CONFIG_TINY_RCU)
/*
- * Contribute to protect against tasklist scan blind spot while the
- * task is exiting and may be removed from the tasklist. See
- * corresponding synchronize_srcu() for further details.
+ * Protect against tasklist scan blind spot while the task is exiting and
+ * may be removed from the tasklist. Do this by adding the task to yet
+ * another list.
+ *
+ * Note that the task will remove itself from this list, so there is no
+ * need for get_task_struct(), except in the case where rcu_tasks_pertask()
+ * adds it to the holdout list, in which case rcu_tasks_pertask() supplies
+ * the needed get_task_struct().
*/
-void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu)
+void exit_tasks_rcu_start(void)
{
- current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu);
+ unsigned long flags;
+ struct rcu_tasks_percpu *rtpcp;
+ struct task_struct *t = current;
+
+ WARN_ON_ONCE(!list_empty(&t->rcu_tasks_exit_list));
+ preempt_disable();
+ rtpcp = this_cpu_ptr(rcu_tasks.rtpcpu);
+ t->rcu_tasks_exit_cpu = smp_processor_id();
+ raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
+ if (!rtpcp->rtp_exit_list.next)
+ INIT_LIST_HEAD(&rtpcp->rtp_exit_list);
+ list_add(&t->rcu_tasks_exit_list, &rtpcp->rtp_exit_list);
+ raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
+ preempt_enable();
}
/*
- * Contribute to protect against tasklist scan blind spot while the
- * task is exiting and may be removed from the tasklist. See
- * corresponding synchronize_srcu() for further details.
+ * Remove the task from the "yet another list" because do_exit() is now
+ * non-preemptible, allowing synchronize_rcu() to wait beyond this point.
*/
-void exit_tasks_rcu_stop(void) __releases(&tasks_rcu_exit_srcu)
+void exit_tasks_rcu_stop(void)
{
+ unsigned long flags;
+ struct rcu_tasks_percpu *rtpcp;
struct task_struct *t = current;
- __srcu_read_unlock(&tasks_rcu_exit_srcu, t->rcu_tasks_idx);
+ WARN_ON_ONCE(list_empty(&t->rcu_tasks_exit_list));
+ rtpcp = per_cpu_ptr(rcu_tasks.rtpcpu, t->rcu_tasks_exit_cpu);
+ raw_spin_lock_irqsave_rcu_node(rtpcp, flags);
+ list_del_init(&t->rcu_tasks_exit_list);
+ raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH AUTOSEL 6.1 5/7] block: prevent division by zero in blk_rq_stat_sum()
2024-03-24 17:07 [PATCH AUTOSEL 6.1 1/7] sysv: don't call sb_bread() with pointers_lock held Sasha Levin
` (2 preceding siblings ...)
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 4/7] rcu-tasks: Maintain lists " Sasha Levin
@ 2024-03-24 17:07 ` Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 6/7] fs: improve dump_mapping() robustness Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 7/7] nvme: clear caller pointer on identify failure Sasha Levin
5 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2024-03-24 17:07 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Roman Smirnov, Sergey Shtylyov, Jens Axboe, Sasha Levin,
linux-block
From: Roman Smirnov <r.smirnov@omp.ru>
[ Upstream commit 93f52fbeaf4b676b21acfe42a5152620e6770d02 ]
The expression dst->nr_samples + src->nr_samples may
have zero value on overflow. It is necessary to add
a check to avoid division by zero.
Found by Linux Verification Center (linuxtesting.org) with Svace.
Signed-off-by: Roman Smirnov <r.smirnov@omp.ru>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20240305134509.23108-1-r.smirnov@omp.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
block/blk-stat.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-stat.c b/block/blk-stat.c
index da9407b7d4abf..41be89ecaf20e 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -28,7 +28,7 @@ void blk_rq_stat_init(struct blk_rq_stat *stat)
/* src is a per-cpu stat, mean isn't initialized */
void blk_rq_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
{
- if (!src->nr_samples)
+ if (dst->nr_samples + src->nr_samples <= dst->nr_samples)
return;
dst->min = min(dst->min, src->min);
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH AUTOSEL 6.1 6/7] fs: improve dump_mapping() robustness
2024-03-24 17:07 [PATCH AUTOSEL 6.1 1/7] sysv: don't call sb_bread() with pointers_lock held Sasha Levin
` (3 preceding siblings ...)
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 5/7] block: prevent division by zero in blk_rq_stat_sum() Sasha Levin
@ 2024-03-24 17:07 ` Sasha Levin
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 7/7] nvme: clear caller pointer on identify failure Sasha Levin
5 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2024-03-24 17:07 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Baolin Wang, Matthew Wilcox, Christian Brauner, Sasha Levin, viro,
linux-fsdevel
From: Baolin Wang <baolin.wang@linux.alibaba.com>
[ Upstream commit 8b3d838139bcd1e552f1899191f734264ce2a1a5 ]
We met a kernel crash issue when running stress-ng testing, and the
system crashes when printing the dentry name in dump_mapping().
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : dentry_name+0xd8/0x224
lr : pointer+0x22c/0x370
sp : ffff800025f134c0
......
Call trace:
dentry_name+0xd8/0x224
pointer+0x22c/0x370
vsnprintf+0x1ec/0x730
vscnprintf+0x2c/0x60
vprintk_store+0x70/0x234
vprintk_emit+0xe0/0x24c
vprintk_default+0x3c/0x44
vprintk_func+0x84/0x2d0
printk+0x64/0x88
__dump_page+0x52c/0x530
dump_page+0x14/0x20
set_migratetype_isolate+0x110/0x224
start_isolate_page_range+0xc4/0x20c
offline_pages+0x124/0x474
memory_block_offline+0x44/0xf4
memory_subsys_offline+0x3c/0x70
device_offline+0xf0/0x120
......
The root cause is that, one thread is doing page migration, and we will
use the target page's ->mapping field to save 'anon_vma' pointer between
page unmap and page move, and now the target page is locked and refcount
is 1.
Currently, there is another stress-ng thread performing memory hotplug,
attempting to offline the target page that is being migrated. It discovers
that the refcount of this target page is 1, preventing the offline operation,
thus proceeding to dump the page. However, page_mapping() of the target
page may return an incorrect file mapping to crash the system in dump_mapping(),
since the target page->mapping only saves 'anon_vma' pointer without setting
PAGE_MAPPING_ANON flag.
The page migration issue has been fixed by commit d1adb25df711 ("mm: migrate:
fix getting incorrect page mapping during page migration"). In addition,
Matthew suggested we should also improve dump_mapping()'s robustness to
resilient against the kernel crash [1].
With checking the 'dentry.parent' and 'dentry.d_name.name' used by
dentry_name(), I can see dump_mapping() will output the invalid dentry
instead of crashing the system when this issue is reproduced again.
[12211.189128] page:fffff7de047741c0 refcount:1 mapcount:0 mapping:ffff989117f55ea0 index:0x1 pfn:0x211dd07
[12211.189144] aops:0x0 ino:1 invalid dentry:74786574206e6870
[12211.189148] flags: 0x57ffffc0000001(locked|node=1|zone=2|lastcpupid=0x1fffff)
[12211.189150] page_type: 0xffffffff()
[12211.189153] raw: 0057ffffc0000001 0000000000000000 dead000000000122 ffff989117f55ea0
[12211.189154] raw: 0000000000000001 0000000000000001 00000001ffffffff 0000000000000000
[12211.189155] page dumped because: unmovable page
[1] https://lore.kernel.org/all/ZXxn%2F0oixJxxAnpF@casper.infradead.org/
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://lore.kernel.org/r/937ab1f87328516821d39be672b6bc18861d9d3e.1705391420.git.baolin.wang@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c
index 8cfda7a6d5900..5ad22efb5def4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -589,7 +589,8 @@ void dump_mapping(const struct address_space *mapping)
}
dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias);
- if (get_kernel_nofault(dentry, dentry_ptr)) {
+ if (get_kernel_nofault(dentry, dentry_ptr) ||
+ !dentry.d_parent || !dentry.d_name.name) {
pr_warn("aops:%ps ino:%lx invalid dentry:%px\n",
a_ops, ino, dentry_ptr);
return;
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH AUTOSEL 6.1 7/7] nvme: clear caller pointer on identify failure
2024-03-24 17:07 [PATCH AUTOSEL 6.1 1/7] sysv: don't call sb_bread() with pointers_lock held Sasha Levin
` (4 preceding siblings ...)
2024-03-24 17:07 ` [PATCH AUTOSEL 6.1 6/7] fs: improve dump_mapping() robustness Sasha Levin
@ 2024-03-24 17:07 ` Sasha Levin
5 siblings, 0 replies; 7+ messages in thread
From: Sasha Levin @ 2024-03-24 17:07 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Keith Busch, Christoph Hellwig, Sasha Levin, sagi, linux-nvme
From: Keith Busch <kbusch@kernel.org>
[ Upstream commit 7e80eb792bd7377a20f204943ac31c77d859be89 ]
The memory allocated for the identification is freed on failure. Set
it to NULL so the caller doesn't have a pointer to that freed address.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/nvme/host/core.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0c088db944706..20c79cc67ce54 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1363,8 +1363,10 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct nvme_id_ctrl **id)
error = nvme_submit_sync_cmd(dev->admin_q, &c, *id,
sizeof(struct nvme_id_ctrl));
- if (error)
+ if (error) {
kfree(*id);
+ *id = NULL;
+ }
return error;
}
@@ -1493,6 +1495,7 @@ static int nvme_identify_ns(struct nvme_ctrl *ctrl, unsigned nsid,
if (error) {
dev_warn(ctrl->device, "Identify namespace failed (%d)\n", error);
kfree(*id);
+ *id = NULL;
}
return error;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread