* [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep
@ 2024-10-25 0:37 Ming Lei
2024-10-25 0:37 ` [PATCH V2 1/3] blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs Ming Lei
` (3 more replies)
0 siblings, 4 replies; 25+ messages in thread
From: Ming Lei @ 2024-10-25 0:37 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Peter Zijlstra, Waiman Long, Boqun Feng,
Ingo Molnar, Will Deacon, linux-kernel, Bart Van Assche, Ming Lei
Hello,
The 1st patch adds non_owner variants of start_freeze/unfreeze queue
API.
The 2nd patch applies the non_owner variants on nvme_freeze() & nvme_unfreeze().
The 3rd patch models freeze/enter queue as lock for lockdep support.
V2:
- improve comment log & document(Christoph)
- add reviewed-by tag
Ming Lei (3):
blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs
nvme: core: switch to non_owner variant of start_freeze/unfreeze queue
block: model freeze & enter queue as lock for supporting lockdep
block/blk-core.c | 18 ++++++++++++++--
block/blk-mq.c | 44 +++++++++++++++++++++++++++++++++++++---
block/blk.h | 29 +++++++++++++++++++++++---
block/genhd.c | 15 ++++++++++----
drivers/nvme/host/core.c | 9 ++++++--
include/linux/blk-mq.h | 2 ++
include/linux/blkdev.h | 6 ++++++
7 files changed, 109 insertions(+), 14 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH V2 1/3] blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs
2024-10-25 0:37 [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep Ming Lei
@ 2024-10-25 0:37 ` Ming Lei
2024-10-25 0:37 ` [PATCH V2 2/3] nvme: core: switch to non_owner variant of start_freeze/unfreeze queue Ming Lei
` (2 subsequent siblings)
3 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2024-10-25 0:37 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Peter Zijlstra, Waiman Long, Boqun Feng,
Ingo Molnar, Will Deacon, linux-kernel, Bart Van Assche, Ming Lei
Add non_owner variant of start_freeze/unfreeze queue APIs, so that the
caller knows that what they are doing, and we can skip lockdep support
for non_owner variant in per-call level.
Prepare for supporting lockdep for freezing/unfreezing queue.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 20 ++++++++++++++++++++
include/linux/blk-mq.h | 2 ++
2 files changed, 22 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4b2c8e940f59..96858fb3b9ff 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -196,6 +196,26 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
}
EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
+/*
+ * non_owner variant of blk_freeze_queue_start
+ *
+ * Unlike blk_freeze_queue_start, the queue doesn't need to be unfrozen
+ * by the same task. This is fragile and should not be used if at all
+ * possible.
+ */
+void blk_freeze_queue_start_non_owner(struct request_queue *q)
+{
+ blk_freeze_queue_start(q);
+}
+EXPORT_SYMBOL_GPL(blk_freeze_queue_start_non_owner);
+
+/* non_owner variant of blk_mq_unfreeze_queue */
+void blk_mq_unfreeze_queue_non_owner(struct request_queue *q)
+{
+ __blk_mq_unfreeze_queue(q, false);
+}
+EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue_non_owner);
+
/*
* FIXME: replace the scsi_internal_device_*block_nowait() calls in the
* mpt3sas driver such that this function can be removed.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4fecf46ef681..c5063e0a38a0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -925,6 +925,8 @@ void blk_freeze_queue_start(struct request_queue *q);
void blk_mq_freeze_queue_wait(struct request_queue *q);
int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
unsigned long timeout);
+void blk_mq_unfreeze_queue_non_owner(struct request_queue *q);
+void blk_freeze_queue_start_non_owner(struct request_queue *q);
void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
--
2.46.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH V2 2/3] nvme: core: switch to non_owner variant of start_freeze/unfreeze queue
2024-10-25 0:37 [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep Ming Lei
2024-10-25 0:37 ` [PATCH V2 1/3] blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs Ming Lei
@ 2024-10-25 0:37 ` Ming Lei
2024-10-25 0:37 ` [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep Ming Lei
2024-10-26 13:15 ` [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep Jens Axboe
3 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2024-10-25 0:37 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Peter Zijlstra, Waiman Long, Boqun Feng,
Ingo Molnar, Will Deacon, linux-kernel, Bart Van Assche, Ming Lei
nvme_start_freeze() and nvme_unfreeze() may be called from same context,
so switch them to call non_owner variant of start_freeze/unfreeze queue.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/nvme/host/core.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ba6508455e18..66d76a9296b1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4871,7 +4871,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
srcu_idx = srcu_read_lock(&ctrl->srcu);
list_for_each_entry_rcu(ns, &ctrl->namespaces, list)
- blk_mq_unfreeze_queue(ns->queue);
+ blk_mq_unfreeze_queue_non_owner(ns->queue);
srcu_read_unlock(&ctrl->srcu, srcu_idx);
clear_bit(NVME_CTRL_FROZEN, &ctrl->flags);
}
@@ -4913,7 +4913,12 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
set_bit(NVME_CTRL_FROZEN, &ctrl->flags);
srcu_idx = srcu_read_lock(&ctrl->srcu);
list_for_each_entry_rcu(ns, &ctrl->namespaces, list)
- blk_freeze_queue_start(ns->queue);
+ /*
+ * Typical non_owner use case is from pci driver, in which
+ * start_freeze is called from timeout work function, but
+ * unfreeze is done in reset work context
+ */
+ blk_freeze_queue_start_non_owner(ns->queue);
srcu_read_unlock(&ctrl->srcu, srcu_idx);
}
EXPORT_SYMBOL_GPL(nvme_start_freeze);
--
2.46.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-25 0:37 [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep Ming Lei
2024-10-25 0:37 ` [PATCH V2 1/3] blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs Ming Lei
2024-10-25 0:37 ` [PATCH V2 2/3] nvme: core: switch to non_owner variant of start_freeze/unfreeze queue Ming Lei
@ 2024-10-25 0:37 ` Ming Lei
[not found] ` <CGME20241029111338eucas1p2bd56c697b825eef235604e892569207e@eucas1p2.samsung.com>
2024-10-30 6:45 ` Lai, Yi
2024-10-26 13:15 ` [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep Jens Axboe
3 siblings, 2 replies; 25+ messages in thread
From: Ming Lei @ 2024-10-25 0:37 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Christoph Hellwig, Peter Zijlstra, Waiman Long, Boqun Feng,
Ingo Molnar, Will Deacon, linux-kernel, Bart Van Assche, Ming Lei
Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().
Turns out the two are just like acquiring read/write lock, so model them
as read/write lock for supporting lockdep:
1) model q->q_usage_counter as two locks(io and queue lock)
- queue lock covers sync with blk_enter_queue()
- io lock covers sync with bio_enter_queue()
2) make the lockdep class/key as per-queue:
- different subsystem has very different lock use pattern, shared lock
class causes false positive easily
- freeze_queue degrades to no lock in case that disk state becomes DEAD
because bio_enter_queue() won't be blocked any more
- freeze_queue degrades to no lock in case that request queue becomes dying
because blk_enter_queue() won't be blocked any more
3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
- it is exclusive lock, so dependency with blk_enter_queue() is covered
- it is trylock because blk_mq_freeze_queue() are allowed to run
concurrently
4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
- nested blk_enter_queue() are allowed
- dependency with blk_mq_freeze_queue() is covered
- blk_queue_exit() is often called from other contexts(such as irq), and
it can't be annotated as lock_release(), so simply do it in
blk_enter_queue(), this way still covered cases as many as possible
With lockdep support, such kind of reports may be reported asap and
needn't wait until the real deadlock is triggered.
For example, lockdep report can be triggered in the report[3] with this
patch applied.
[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
https://bugzilla.kernel.org/show_bug.cgi?id=219166
[2] del_gendisk() vs blk_queue_enter() race condition
https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
[3] queue_freeze & queue_enter deadlock in scsi
https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-core.c | 18 ++++++++++++++++--
block/blk-mq.c | 26 ++++++++++++++++++++++----
block/blk.h | 29 ++++++++++++++++++++++++++---
block/genhd.c | 15 +++++++++++----
include/linux/blkdev.h | 6 ++++++
5 files changed, 81 insertions(+), 13 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index bc5e8c5eaac9..09d10bb95fda 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -261,6 +261,8 @@ static void blk_free_queue(struct request_queue *q)
blk_mq_release(q);
ida_free(&blk_queue_ida, q->id);
+ lockdep_unregister_key(&q->io_lock_cls_key);
+ lockdep_unregister_key(&q->q_lock_cls_key);
call_rcu(&q->rcu_head, blk_free_queue_rcu);
}
@@ -278,18 +280,20 @@ void blk_put_queue(struct request_queue *q)
}
EXPORT_SYMBOL(blk_put_queue);
-void blk_queue_start_drain(struct request_queue *q)
+bool blk_queue_start_drain(struct request_queue *q)
{
/*
* When queue DYING flag is set, we need to block new req
* entering queue, so we call blk_freeze_queue_start() to
* prevent I/O from crossing blk_queue_enter().
*/
- blk_freeze_queue_start(q);
+ bool freeze = __blk_freeze_queue_start(q);
if (queue_is_mq(q))
blk_mq_wake_waiters(q);
/* Make blk_queue_enter() reexamine the DYING flag. */
wake_up_all(&q->mq_freeze_wq);
+
+ return freeze;
}
/**
@@ -321,6 +325,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
return -ENODEV;
}
+ rwsem_acquire_read(&q->q_lockdep_map, 0, 0, _RET_IP_);
+ rwsem_release(&q->q_lockdep_map, _RET_IP_);
return 0;
}
@@ -352,6 +358,8 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio)
goto dead;
}
+ rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
+ rwsem_release(&q->io_lockdep_map, _RET_IP_);
return 0;
dead:
bio_io_error(bio);
@@ -441,6 +449,12 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
if (error)
goto fail_stats;
+ lockdep_register_key(&q->io_lock_cls_key);
+ lockdep_register_key(&q->q_lock_cls_key);
+ lockdep_init_map(&q->io_lockdep_map, "&q->q_usage_counter(io)",
+ &q->io_lock_cls_key, 0);
+ lockdep_init_map(&q->q_lockdep_map, "&q->q_usage_counter(queue)",
+ &q->q_lock_cls_key, 0);
q->nr_requests = BLKDEV_DEFAULT_RQ;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 96858fb3b9ff..76f277a30c11 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -120,17 +120,29 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
inflight[1] = mi.inflight[1];
}
-void blk_freeze_queue_start(struct request_queue *q)
+bool __blk_freeze_queue_start(struct request_queue *q)
{
+ int freeze;
+
mutex_lock(&q->mq_freeze_lock);
if (++q->mq_freeze_depth == 1) {
percpu_ref_kill(&q->q_usage_counter);
mutex_unlock(&q->mq_freeze_lock);
if (queue_is_mq(q))
blk_mq_run_hw_queues(q, false);
+ freeze = true;
} else {
mutex_unlock(&q->mq_freeze_lock);
+ freeze = false;
}
+
+ return freeze;
+}
+
+void blk_freeze_queue_start(struct request_queue *q)
+{
+ if (__blk_freeze_queue_start(q))
+ blk_freeze_acquire_lock(q, false, false);
}
EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
@@ -176,8 +188,10 @@ void blk_mq_freeze_queue(struct request_queue *q)
}
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
-void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
+bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
{
+ int unfreeze = false;
+
mutex_lock(&q->mq_freeze_lock);
if (force_atomic)
q->q_usage_counter.data->force_atomic = true;
@@ -186,13 +200,17 @@ void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
if (!q->mq_freeze_depth) {
percpu_ref_resurrect(&q->q_usage_counter);
wake_up_all(&q->mq_freeze_wq);
+ unfreeze = true;
}
mutex_unlock(&q->mq_freeze_lock);
+
+ return unfreeze;
}
void blk_mq_unfreeze_queue(struct request_queue *q)
{
- __blk_mq_unfreeze_queue(q, false);
+ if (__blk_mq_unfreeze_queue(q, false))
+ blk_unfreeze_release_lock(q, false, false);
}
EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
@@ -205,7 +223,7 @@ EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
*/
void blk_freeze_queue_start_non_owner(struct request_queue *q)
{
- blk_freeze_queue_start(q);
+ __blk_freeze_queue_start(q);
}
EXPORT_SYMBOL_GPL(blk_freeze_queue_start_non_owner);
diff --git a/block/blk.h b/block/blk.h
index c718e4291db0..832e54c5a271 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -4,6 +4,7 @@
#include <linux/bio-integrity.h>
#include <linux/blk-crypto.h>
+#include <linux/lockdep.h>
#include <linux/memblock.h> /* for max_pfn/max_low_pfn */
#include <linux/sched/sysctl.h>
#include <linux/timekeeping.h>
@@ -35,8 +36,9 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
void blk_free_flush_queue(struct blk_flush_queue *q);
void blk_freeze_queue(struct request_queue *q);
-void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
-void blk_queue_start_drain(struct request_queue *q);
+bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
+bool blk_queue_start_drain(struct request_queue *q);
+bool __blk_freeze_queue_start(struct request_queue *q);
int __bio_queue_enter(struct request_queue *q, struct bio *bio);
void submit_bio_noacct_nocheck(struct bio *bio);
void bio_await_chain(struct bio *bio);
@@ -69,8 +71,11 @@ static inline int bio_queue_enter(struct bio *bio)
{
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
- if (blk_try_enter_queue(q, false))
+ if (blk_try_enter_queue(q, false)) {
+ rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
+ rwsem_release(&q->io_lockdep_map, _RET_IP_);
return 0;
+ }
return __bio_queue_enter(q, bio);
}
@@ -734,4 +739,22 @@ void blk_integrity_verify(struct bio *bio);
void blk_integrity_prepare(struct request *rq);
void blk_integrity_complete(struct request *rq, unsigned int nr_bytes);
+static inline void blk_freeze_acquire_lock(struct request_queue *q, bool
+ disk_dead, bool queue_dying)
+{
+ if (!disk_dead)
+ rwsem_acquire(&q->io_lockdep_map, 0, 1, _RET_IP_);
+ if (!queue_dying)
+ rwsem_acquire(&q->q_lockdep_map, 0, 1, _RET_IP_);
+}
+
+static inline void blk_unfreeze_release_lock(struct request_queue *q, bool
+ disk_dead, bool queue_dying)
+{
+ if (!queue_dying)
+ rwsem_release(&q->q_lockdep_map, _RET_IP_);
+ if (!disk_dead)
+ rwsem_release(&q->io_lockdep_map, _RET_IP_);
+}
+
#endif /* BLK_INTERNAL_H */
diff --git a/block/genhd.c b/block/genhd.c
index 1c05dd4c6980..6ad3fcde0110 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -581,13 +581,13 @@ static void blk_report_disk_dead(struct gendisk *disk, bool surprise)
rcu_read_unlock();
}
-static void __blk_mark_disk_dead(struct gendisk *disk)
+static bool __blk_mark_disk_dead(struct gendisk *disk)
{
/*
* Fail any new I/O.
*/
if (test_and_set_bit(GD_DEAD, &disk->state))
- return;
+ return false;
if (test_bit(GD_OWNS_QUEUE, &disk->state))
blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
@@ -600,7 +600,7 @@ static void __blk_mark_disk_dead(struct gendisk *disk)
/*
* Prevent new I/O from crossing bio_queue_enter().
*/
- blk_queue_start_drain(disk->queue);
+ return blk_queue_start_drain(disk->queue);
}
/**
@@ -641,6 +641,7 @@ void del_gendisk(struct gendisk *disk)
struct request_queue *q = disk->queue;
struct block_device *part;
unsigned long idx;
+ bool start_drain, queue_dying;
might_sleep();
@@ -668,7 +669,10 @@ void del_gendisk(struct gendisk *disk)
* Drop all partitions now that the disk is marked dead.
*/
mutex_lock(&disk->open_mutex);
- __blk_mark_disk_dead(disk);
+ start_drain = __blk_mark_disk_dead(disk);
+ queue_dying = blk_queue_dying(q);
+ if (start_drain)
+ blk_freeze_acquire_lock(q, true, queue_dying);
xa_for_each_start(&disk->part_tbl, idx, part, 1)
drop_partition(part);
mutex_unlock(&disk->open_mutex);
@@ -725,6 +729,9 @@ void del_gendisk(struct gendisk *disk)
if (queue_is_mq(q))
blk_mq_exit_queue(q);
}
+
+ if (start_drain)
+ blk_unfreeze_release_lock(q, true, queue_dying);
}
EXPORT_SYMBOL(del_gendisk);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50c3b959da28..57f1ee386b57 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -25,6 +25,7 @@
#include <linux/uuid.h>
#include <linux/xarray.h>
#include <linux/file.h>
+#include <linux/lockdep.h>
struct module;
struct request_queue;
@@ -471,6 +472,11 @@ struct request_queue {
struct xarray hctx_table;
struct percpu_ref q_usage_counter;
+ struct lock_class_key io_lock_cls_key;
+ struct lockdep_map io_lockdep_map;
+
+ struct lock_class_key q_lock_cls_key;
+ struct lockdep_map q_lockdep_map;
struct request *last_merge;
--
2.46.0
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep
2024-10-25 0:37 [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep Ming Lei
` (2 preceding siblings ...)
2024-10-25 0:37 ` [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep Ming Lei
@ 2024-10-26 13:15 ` Jens Axboe
3 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2024-10-26 13:15 UTC (permalink / raw)
To: linux-block, Ming Lei
Cc: Christoph Hellwig, Peter Zijlstra, Waiman Long, Boqun Feng,
Ingo Molnar, Will Deacon, linux-kernel, Bart Van Assche
On Fri, 25 Oct 2024 08:37:17 +0800, Ming Lei wrote:
> The 1st patch adds non_owner variants of start_freeze/unfreeze queue
> API.
>
> The 2nd patch applies the non_owner variants on nvme_freeze() & nvme_unfreeze().
>
> The 3rd patch models freeze/enter queue as lock for lockdep support.
>
> [...]
Applied, thanks!
[1/3] blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs
commit: 8acdd0e7bfadda6b5103f2960d293581954454ed
[2/3] nvme: core: switch to non_owner variant of start_freeze/unfreeze queue
commit: 6b6f6c41c8ac9b5ef758f16b793e1fd998cd25b4
[3/3] block: model freeze & enter queue as lock for supporting lockdep
commit: f1be1788a32e8fa63416ad4518bbd1a85a825c9d
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
[not found] ` <CGME20241029111338eucas1p2bd56c697b825eef235604e892569207e@eucas1p2.samsung.com>
@ 2024-10-29 11:13 ` Marek Szyprowski
2024-10-29 15:58 ` Ming Lei
0 siblings, 1 reply; 25+ messages in thread
From: Marek Szyprowski @ 2024-10-29 11:13 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Christoph Hellwig, Peter Zijlstra, Waiman Long, Boqun Feng,
Ingo Molnar, Will Deacon, linux-kernel, Bart Van Assche
On 25.10.2024 02:37, Ming Lei wrote:
> Recently we got several deadlock report[1][2][3] caused by
> blk_mq_freeze_queue and blk_enter_queue().
>
> Turns out the two are just like acquiring read/write lock, so model them
> as read/write lock for supporting lockdep:
>
> 1) model q->q_usage_counter as two locks(io and queue lock)
>
> - queue lock covers sync with blk_enter_queue()
>
> - io lock covers sync with bio_enter_queue()
>
> 2) make the lockdep class/key as per-queue:
>
> - different subsystem has very different lock use pattern, shared lock
> class causes false positive easily
>
> - freeze_queue degrades to no lock in case that disk state becomes DEAD
> because bio_enter_queue() won't be blocked any more
>
> - freeze_queue degrades to no lock in case that request queue becomes dying
> because blk_enter_queue() won't be blocked any more
>
> 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
> - it is exclusive lock, so dependency with blk_enter_queue() is covered
>
> - it is trylock because blk_mq_freeze_queue() are allowed to run
> concurrently
>
> 4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
> - nested blk_enter_queue() are allowed
>
> - dependency with blk_mq_freeze_queue() is covered
>
> - blk_queue_exit() is often called from other contexts(such as irq), and
> it can't be annotated as lock_release(), so simply do it in
> blk_enter_queue(), this way still covered cases as many as possible
>
> With lockdep support, such kind of reports may be reported asap and
> needn't wait until the real deadlock is triggered.
>
> For example, lockdep report can be triggered in the report[3] with this
> patch applied.
>
> [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
> https://bugzilla.kernel.org/show_bug.cgi?id=219166
>
> [2] del_gendisk() vs blk_queue_enter() race condition
> https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
>
> [3] queue_freeze & queue_enter deadlock in scsi
> https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
This patch landed yesterday in linux-next as commit f1be1788a32e
("block: model freeze & enter queue as lock for supporting lockdep").
In my tests I found that it introduces the following 2 lockdep warnings:
1. On Samsung Exynos 4412-based Odroid U3 board (ARM 32bit), observed
when booting it:
======================================================
WARNING: possible circular locking dependency detected
6.12.0-rc4-00037-gf1be1788a32e #9290 Not tainted
------------------------------------------------------
find/1284 is trying to acquire lock:
cf3b8534 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x30/0x70
but task is already holding lock:
c203a0c8 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at:
iterate_dir+0x30/0x140
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&sb->s_type->i_mutex_key#2){++++}-{3:3}:
down_write+0x44/0xc4
start_creating+0x8c/0x170
debugfs_create_dir+0x1c/0x178
blk_register_queue+0xa0/0x1c0
add_disk_fwnode+0x210/0x434
brd_alloc+0x1cc/0x210
brd_init+0xac/0x104
do_one_initcall+0x64/0x30c
kernel_init_freeable+0x1c4/0x228
kernel_init+0x1c/0x12c
ret_from_fork+0x14/0x28
-> #3 (&q->debugfs_mutex){+.+.}-{3:3}:
__mutex_lock+0x94/0x94c
mutex_lock_nested+0x1c/0x24
blk_mq_init_sched+0x140/0x204
elevator_init_mq+0xb8/0x130
add_disk_fwnode+0x3c/0x434
mmc_blk_alloc_req+0x34c/0x464
mmc_blk_probe+0x1f4/0x6f8
really_probe+0xe0/0x3d8
__driver_probe_device+0x9c/0x1e4
driver_probe_device+0x30/0xc0
__device_attach_driver+0xa8/0x120
bus_for_each_drv+0x80/0xcc
__device_attach+0xac/0x1fc
bus_probe_device+0x8c/0x90
device_add+0x5a4/0x7cc
mmc_add_card+0x118/0x2c8
mmc_attach_mmc+0xd8/0x174
mmc_rescan+0x2ec/0x3a8
process_one_work+0x240/0x6d0
worker_thread+0x1a0/0x398
kthread+0x104/0x138
ret_from_fork+0x14/0x28
-> #2 (&q->q_usage_counter(io)#17){++++}-{0:0}:
blk_mq_submit_bio+0x8dc/0xb34
__submit_bio+0x78/0x148
submit_bio_noacct_nocheck+0x204/0x32c
ext4_mpage_readpages+0x558/0x7b0
read_pages+0x64/0x28c
page_cache_ra_unbounded+0x118/0x1bc
filemap_get_pages+0x124/0x7ec
filemap_read+0x174/0x5b0
__kernel_read+0x128/0x2c0
bprm_execve+0x230/0x7a4
kernel_execve+0xec/0x194
try_to_run_init_process+0xc/0x38
kernel_init+0xdc/0x12c
ret_from_fork+0x14/0x28
-> #1 (mapping.invalidate_lock#2){++++}-{3:3}:
down_read+0x44/0x224
filemap_fault+0x648/0x10f0
__do_fault+0x38/0xd4
handle_mm_fault+0xaf8/0x14d0
do_page_fault+0xe0/0x5c8
do_DataAbort+0x3c/0xb0
__dabt_svc+0x50/0x80
__clear_user_std+0x34/0x68
elf_load+0x1a8/0x208
load_elf_binary+0x3f4/0x13cc
bprm_execve+0x28c/0x7a4
do_execveat_common+0x150/0x198
sys_execve+0x30/0x38
ret_fast_syscall+0x0/0x1c
-> #0 (&mm->mmap_lock){++++}-{3:3}:
__lock_acquire+0x158c/0x2970
lock_acquire+0x130/0x384
__might_fault+0x50/0x70
filldir64+0x94/0x28c
dcache_readdir+0x174/0x260
iterate_dir+0x64/0x140
sys_getdents64+0x60/0x130
ret_fast_syscall+0x0/0x1c
other info that might help us debug this:
Chain exists of:
&mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#2
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(&sb->s_type->i_mutex_key#2);
lock(&q->debugfs_mutex);
lock(&sb->s_type->i_mutex_key#2);
rlock(&mm->mmap_lock);
*** DEADLOCK ***
2 locks held by find/1284:
#0: c3df1e88 (&f->f_pos_lock){+.+.}-{3:3}, at: fdget_pos+0x88/0xd0
#1: c203a0c8 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at:
iterate_dir+0x30/0x140
stack backtrace:
CPU: 1 UID: 0 PID: 1284 Comm: find Not tainted
6.12.0-rc4-00037-gf1be1788a32e #9290
Hardware name: Samsung Exynos (Flattened Device Tree)
Call trace:
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x88
dump_stack_lvl from print_circular_bug+0x31c/0x394
print_circular_bug from check_noncircular+0x16c/0x184
check_noncircular from __lock_acquire+0x158c/0x2970
__lock_acquire from lock_acquire+0x130/0x384
lock_acquire from __might_fault+0x50/0x70
__might_fault from filldir64+0x94/0x28c
filldir64 from dcache_readdir+0x174/0x260
dcache_readdir from iterate_dir+0x64/0x140
iterate_dir from sys_getdents64+0x60/0x130
sys_getdents64 from ret_fast_syscall+0x0/0x1c
Exception stack(0xf22b5fa8 to 0xf22b5ff0)
5fa0: 004b4fa0 004b4f80 00000004 004b4fa0 00008000
00000000
5fc0: 004b4fa0 004b4f80 00000001 000000d9 00000000 004b4af0 00000000
000010ea
5fe0: 004b1eb4 bea05af0 b6da4b08 b6da4a28
--->8---
2. On QEMU's ARM64 virt machine, observed during system suspend/resume
cycle:
# time rtcwake -s10 -mmem
rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Oct 29 11:54:30 2024
PM: suspend entry (s2idle)
Filesystems sync: 0.004 seconds
Freezing user space processes
Freezing user space processes completed (elapsed 0.007 seconds)
OOM killer disabled.
Freezing remaining freezable tasks
Freezing remaining freezable tasks completed (elapsed 0.004 seconds)
======================================================
WARNING: possible circular locking dependency detected
6.12.0-rc4+ #9291 Not tainted
------------------------------------------------------
rtcwake/1299 is trying to acquire lock:
ffff80008358a7f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x1c/0x28
but task is already holding lock:
ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at:
virtblk_freeze+0x24/0x60
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&q->q_usage_counter(io)#5){++++}-{0:0}:
blk_mq_submit_bio+0x7c0/0x9d8
__submit_bio+0x74/0x164
submit_bio_noacct_nocheck+0x2d4/0x3b4
submit_bio_noacct+0x148/0x3fc
submit_bio+0x130/0x204
submit_bh_wbc+0x148/0x1bc
submit_bh+0x18/0x24
ext4_read_bh_nowait+0x70/0x100
ext4_sb_breadahead_unmovable+0x50/0x80
__ext4_get_inode_loc+0x354/0x654
ext4_get_inode_loc+0x40/0xa8
ext4_reserve_inode_write+0x40/0xf0
__ext4_mark_inode_dirty+0x88/0x300
ext4_ext_tree_init+0x40/0x4c
__ext4_new_inode+0x99c/0x1614
ext4_create+0xe4/0x1d4
lookup_open.isra.0+0x47c/0x540
path_openat+0x370/0x9e8
do_filp_open+0x80/0x130
do_sys_openat2+0xb4/0xe8
__arm64_compat_sys_openat+0x5c/0xa4
invoke_syscall+0x48/0x110
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc_compat+0x20/0x3c
el0_svc_compat+0x44/0xe0
el0t_32_sync_handler+0x98/0x148
el0t_32_sync+0x194/0x198
-> #2 (jbd2_handle){++++}-{0:0}:
start_this_handle+0x178/0x4e8
jbd2__journal_start+0x110/0x298
__ext4_journal_start_sb+0x9c/0x274
ext4_dirty_inode+0x38/0x88
__mark_inode_dirty+0x90/0x568
generic_update_time+0x50/0x64
touch_atime+0x2c0/0x324
ext4_file_mmap+0x68/0x88
mmap_region+0x448/0xa38
do_mmap+0x3dc/0x540
vm_mmap_pgoff+0xf8/0x1b4
ksys_mmap_pgoff+0x148/0x1f0
__arm64_compat_sys_aarch32_mmap2+0x20/0x2c
invoke_syscall+0x48/0x110
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc_compat+0x20/0x3c
el0_svc_compat+0x44/0xe0
el0t_32_sync_handler+0x98/0x148
el0t_32_sync+0x194/0x198
-> #1 (&mm->mmap_lock){++++}-{3:3}:
__might_fault+0x5c/0x80
inet_gifconf+0xcc/0x278
dev_ifconf+0xc4/0x1f8
sock_ioctl+0x234/0x384
compat_sock_ioctl+0x180/0x35c
__arm64_compat_sys_ioctl+0x14c/0x16c
invoke_syscall+0x48/0x110
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc_compat+0x20/0x3c
el0_svc_compat+0x44/0xe0
el0t_32_sync_handler+0x98/0x148
el0t_32_sync+0x194/0x198
-> #0 (rtnl_mutex){+.+.}-{3:3}:
__lock_acquire+0x1374/0x2224
lock_acquire+0x200/0x340
__mutex_lock+0x98/0x428
mutex_lock_nested+0x24/0x30
rtnl_lock+0x1c/0x28
virtnet_freeze_down.isra.0+0x20/0x9c
virtnet_freeze+0x44/0x60
virtio_device_freeze+0x68/0x94
virtio_mmio_freeze+0x14/0x20
platform_pm_suspend+0x2c/0x6c
dpm_run_callback+0xa4/0x218
device_suspend+0x12c/0x52c
dpm_suspend+0x10c/0x2e4
dpm_suspend_start+0x70/0x78
suspend_devices_and_enter+0x130/0x798
pm_suspend+0x1ac/0x2e8
state_store+0x8c/0x110
kobj_attr_store+0x18/0x2c
sysfs_kf_write+0x4c/0x78
kernfs_fop_write_iter+0x120/0x1b4
vfs_write+0x2b0/0x35c
ksys_write+0x68/0xf4
__arm64_sys_write+0x1c/0x28
invoke_syscall+0x48/0x110
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc_compat+0x20/0x3c
el0_svc_compat+0x44/0xe0
el0t_32_sync_handler+0x98/0x148
el0t_32_sync+0x194/0x198
other info that might help us debug this:
Chain exists of:
rtnl_mutex --> jbd2_handle --> &q->q_usage_counter(io)#5
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->q_usage_counter(io)#5);
lock(jbd2_handle);
lock(&q->q_usage_counter(io)#5);
lock(rtnl_mutex);
*** DEADLOCK ***
9 locks held by rtcwake/1299:
#0: ffff0000069103f8 (sb_writers#7){.+.+}-{0:0}, at: vfs_write+0x1e4/0x35c
#1: ffff000007c0ae88 (&of->mutex#2){+.+.}-{3:3}, at:
kernfs_fop_write_iter+0xf0/0x1b4
#2: ffff0000046906e8 (kn->active#24){.+.+}-{0:0}, at:
kernfs_fop_write_iter+0xf8/0x1b4
#3: ffff800082fd9908 (system_transition_mutex){+.+.}-{3:3}, at:
pm_suspend+0x88/0x2e8
#4: ffff000006137678 (&q->q_usage_counter(io)#6){++++}-{0:0}, at:
virtblk_freeze+0x24/0x60
#5: ffff0000061376b0 (&q->q_usage_counter(queue)#2){++++}-{0:0}, at:
virtblk_freeze+0x24/0x60
#6: ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at:
virtblk_freeze+0x24/0x60
#7: ffff000006136da0 (&q->q_usage_counter(queue)){++++}-{0:0}, at:
virtblk_freeze+0x24/0x60
#8: ffff0000056208f8 (&dev->mutex){....}-{3:3}, at:
device_suspend+0xf8/0x52c
stack backtrace:
CPU: 0 UID: 0 PID: 1299 Comm: rtcwake Not tainted 6.12.0-rc4+ #9291
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x94/0xec
show_stack+0x18/0x24
dump_stack_lvl+0x90/0xd0
dump_stack+0x18/0x24
print_circular_bug+0x298/0x37c
check_noncircular+0x15c/0x170
__lock_acquire+0x1374/0x2224
lock_acquire+0x200/0x340
__mutex_lock+0x98/0x428
mutex_lock_nested+0x24/0x30
rtnl_lock+0x1c/0x28
virtnet_freeze_down.isra.0+0x20/0x9c
virtnet_freeze+0x44/0x60
virtio_device_freeze+0x68/0x94
virtio_mmio_freeze+0x14/0x20
platform_pm_suspend+0x2c/0x6c
dpm_run_callback+0xa4/0x218
device_suspend+0x12c/0x52c
dpm_suspend+0x10c/0x2e4
dpm_suspend_start+0x70/0x78
suspend_devices_and_enter+0x130/0x798
pm_suspend+0x1ac/0x2e8
state_store+0x8c/0x110
kobj_attr_store+0x18/0x2c
sysfs_kf_write+0x4c/0x78
kernfs_fop_write_iter+0x120/0x1b4
vfs_write+0x2b0/0x35c
ksys_write+0x68/0xf4
__arm64_sys_write+0x1c/0x28
invoke_syscall+0x48/0x110
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc_compat+0x20/0x3c
el0_svc_compat+0x44/0xe0
el0t_32_sync_handler+0x98/0x148
el0t_32_sync+0x194/0x198
virtio_blk virtio2: 1/0/0 default/read/poll queues
virtio_blk virtio3: 1/0/0 default/read/poll queues
OOM killer enabled.
Restarting tasks ... done.
random: crng reseeded on system resumption
PM: suspend exit
--->8---
Let me know if You need more information about my test systems.
> ---
> block/blk-core.c | 18 ++++++++++++++++--
> block/blk-mq.c | 26 ++++++++++++++++++++++----
> block/blk.h | 29 ++++++++++++++++++++++++++---
> block/genhd.c | 15 +++++++++++----
> include/linux/blkdev.h | 6 ++++++
> 5 files changed, 81 insertions(+), 13 deletions(-)
> ...
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-29 11:13 ` Marek Szyprowski
@ 2024-10-29 15:58 ` Ming Lei
2024-10-29 16:59 ` Marek Szyprowski
2024-11-12 8:36 ` Marek Szyprowski
0 siblings, 2 replies; 25+ messages in thread
From: Ming Lei @ 2024-10-29 15:58 UTC (permalink / raw)
To: Marek Szyprowski
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche
On Tue, Oct 29, 2024 at 12:13:35PM +0100, Marek Szyprowski wrote:
> On 25.10.2024 02:37, Ming Lei wrote:
> > Recently we got several deadlock report[1][2][3] caused by
> > blk_mq_freeze_queue and blk_enter_queue().
> >
> > Turns out the two are just like acquiring read/write lock, so model them
> > as read/write lock for supporting lockdep:
> >
> > 1) model q->q_usage_counter as two locks(io and queue lock)
> >
> > - queue lock covers sync with blk_enter_queue()
> >
> > - io lock covers sync with bio_enter_queue()
> >
> > 2) make the lockdep class/key as per-queue:
> >
> > - different subsystem has very different lock use pattern, shared lock
> > class causes false positive easily
> >
> > - freeze_queue degrades to no lock in case that disk state becomes DEAD
> > because bio_enter_queue() won't be blocked any more
> >
> > - freeze_queue degrades to no lock in case that request queue becomes dying
> > because blk_enter_queue() won't be blocked any more
> >
> > 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
> > - it is exclusive lock, so dependency with blk_enter_queue() is covered
> >
> > - it is trylock because blk_mq_freeze_queue() are allowed to run
> > concurrently
> >
> > 4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
> > - nested blk_enter_queue() are allowed
> >
> > - dependency with blk_mq_freeze_queue() is covered
> >
> > - blk_queue_exit() is often called from other contexts(such as irq), and
> > it can't be annotated as lock_release(), so simply do it in
> > blk_enter_queue(), this way still covered cases as many as possible
> >
> > With lockdep support, such kind of reports may be reported asap and
> > needn't wait until the real deadlock is triggered.
> >
> > For example, lockdep report can be triggered in the report[3] with this
> > patch applied.
> >
> > [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
> > https://bugzilla.kernel.org/show_bug.cgi?id=219166
> >
> > [2] del_gendisk() vs blk_queue_enter() race condition
> > https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
> >
> > [3] queue_freeze & queue_enter deadlock in scsi
> > https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
> >
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
>
> This patch landed yesterday in linux-next as commit f1be1788a32e
> ("block: model freeze & enter queue as lock for supporting lockdep").
> In my tests I found that it introduces the following 2 lockdep warnings:
>
> 1. On Samsung Exynos 4412-based Odroid U3 board (ARM 32bit), observed
> when booting it:
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.12.0-rc4-00037-gf1be1788a32e #9290 Not tainted
> ------------------------------------------------------
> find/1284 is trying to acquire lock:
> cf3b8534 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x30/0x70
>
> but task is already holding lock:
> c203a0c8 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at:
> iterate_dir+0x30/0x140
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #4 (&sb->s_type->i_mutex_key#2){++++}-{3:3}:
> down_write+0x44/0xc4
> start_creating+0x8c/0x170
> debugfs_create_dir+0x1c/0x178
> blk_register_queue+0xa0/0x1c0
> add_disk_fwnode+0x210/0x434
> brd_alloc+0x1cc/0x210
> brd_init+0xac/0x104
> do_one_initcall+0x64/0x30c
> kernel_init_freeable+0x1c4/0x228
> kernel_init+0x1c/0x12c
> ret_from_fork+0x14/0x28
>
> -> #3 (&q->debugfs_mutex){+.+.}-{3:3}:
> __mutex_lock+0x94/0x94c
> mutex_lock_nested+0x1c/0x24
> blk_mq_init_sched+0x140/0x204
> elevator_init_mq+0xb8/0x130
> add_disk_fwnode+0x3c/0x434
The above chain can be cut by the following patch because disk state
can be thought as DEAD in add_disk(), can you test it?
diff --git a/block/elevator.c b/block/elevator.c
index 4122026b11f1..efa6ff941a25 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -600,12 +600,14 @@ void elevator_init_mq(struct request_queue *q)
* requests, then no need to quiesce queue which may add long boot
* latency, especially when lots of disks are involved.
*/
- blk_mq_freeze_queue(q);
+ if (__blk_freeze_queue_start(q))
+ blk_freeze_acquire_lock(q, true, false);
blk_mq_cancel_work_sync(q);
err = blk_mq_init_sched(q, e);
- blk_mq_unfreeze_queue(q);
+ if (__blk_mq_unfreeze_queue(q, false))
+ blk_unfreeze_release_lock(q, true, false);
if (err) {
pr_warn("\"%s\" elevator initialization failed, "
...
> 2 locks held by find/1284:
> #0: c3df1e88 (&f->f_pos_lock){+.+.}-{3:3}, at: fdget_pos+0x88/0xd0
> #1: c203a0c8 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at:
> iterate_dir+0x30/0x140
>
> stack backtrace:
> CPU: 1 UID: 0 PID: 1284 Comm: find Not tainted
> 6.12.0-rc4-00037-gf1be1788a32e #9290
> Hardware name: Samsung Exynos (Flattened Device Tree)
> Call trace:
> unwind_backtrace from show_stack+0x10/0x14
> show_stack from dump_stack_lvl+0x68/0x88
> dump_stack_lvl from print_circular_bug+0x31c/0x394
> print_circular_bug from check_noncircular+0x16c/0x184
> check_noncircular from __lock_acquire+0x158c/0x2970
> __lock_acquire from lock_acquire+0x130/0x384
> lock_acquire from __might_fault+0x50/0x70
> __might_fault from filldir64+0x94/0x28c
> filldir64 from dcache_readdir+0x174/0x260
> dcache_readdir from iterate_dir+0x64/0x140
> iterate_dir from sys_getdents64+0x60/0x130
> sys_getdents64 from ret_fast_syscall+0x0/0x1c
> Exception stack(0xf22b5fa8 to 0xf22b5ff0)
> 5fa0: 004b4fa0 004b4f80 00000004 004b4fa0 00008000
> 00000000
> 5fc0: 004b4fa0 004b4f80 00000001 000000d9 00000000 004b4af0 00000000
> 000010ea
> 5fe0: 004b1eb4 bea05af0 b6da4b08 b6da4a28
>
> --->8---
>
>
> 2. On QEMU's ARM64 virt machine, observed during system suspend/resume
> cycle:
>
> # time rtcwake -s10 -mmem
> rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Oct 29 11:54:30 2024
> PM: suspend entry (s2idle)
> Filesystems sync: 0.004 seconds
> Freezing user space processes
> Freezing user space processes completed (elapsed 0.007 seconds)
> OOM killer disabled.
> Freezing remaining freezable tasks
> Freezing remaining freezable tasks completed (elapsed 0.004 seconds)
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.12.0-rc4+ #9291 Not tainted
> ------------------------------------------------------
> rtcwake/1299 is trying to acquire lock:
> ffff80008358a7f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x1c/0x28
>
> but task is already holding lock:
> ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at:
> virtblk_freeze+0x24/0x60
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
This one looks a real thing, at least the added lockdep code works as
expected, also the blk_mq_freeze_queue() use in virtio-blk's ->suspend()
is questionable. I will take a further look.
Thanks,
Ming
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-29 15:58 ` Ming Lei
@ 2024-10-29 16:59 ` Marek Szyprowski
2024-11-12 8:36 ` Marek Szyprowski
1 sibling, 0 replies; 25+ messages in thread
From: Marek Szyprowski @ 2024-10-29 16:59 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche
On 29.10.2024 16:58, Ming Lei wrote:
> On Tue, Oct 29, 2024 at 12:13:35PM +0100, Marek Szyprowski wrote:
>> On 25.10.2024 02:37, Ming Lei wrote:
>>> Recently we got several deadlock report[1][2][3] caused by
>>> blk_mq_freeze_queue and blk_enter_queue().
>>>
>>> Turns out the two are just like acquiring read/write lock, so model them
>>> as read/write lock for supporting lockdep:
>>>
>>> 1) model q->q_usage_counter as two locks(io and queue lock)
>>>
>>> - queue lock covers sync with blk_enter_queue()
>>>
>>> - io lock covers sync with bio_enter_queue()
>>>
>>> 2) make the lockdep class/key as per-queue:
>>>
>>> - different subsystem has very different lock use pattern, shared lock
>>> class causes false positive easily
>>>
>>> - freeze_queue degrades to no lock in case that disk state becomes DEAD
>>> because bio_enter_queue() won't be blocked any more
>>>
>>> - freeze_queue degrades to no lock in case that request queue becomes dying
>>> because blk_enter_queue() won't be blocked any more
>>>
>>> 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
>>> - it is exclusive lock, so dependency with blk_enter_queue() is covered
>>>
>>> - it is trylock because blk_mq_freeze_queue() are allowed to run
>>> concurrently
>>>
>>> 4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
>>> - nested blk_enter_queue() are allowed
>>>
>>> - dependency with blk_mq_freeze_queue() is covered
>>>
>>> - blk_queue_exit() is often called from other contexts(such as irq), and
>>> it can't be annotated as lock_release(), so simply do it in
>>> blk_enter_queue(), this way still covered cases as many as possible
>>>
>>> With lockdep support, such kind of reports may be reported asap and
>>> needn't wait until the real deadlock is triggered.
>>>
>>> For example, lockdep report can be triggered in the report[3] with this
>>> patch applied.
>>>
>>> [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
>>> https://bugzilla.kernel.org/show_bug.cgi?id=219166
>>>
>>> [2] del_gendisk() vs blk_queue_enter() race condition
>>> https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
>>>
>>> [3] queue_freeze & queue_enter deadlock in scsi
>>> https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
>>>
>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>> This patch landed yesterday in linux-next as commit f1be1788a32e
>> ("block: model freeze & enter queue as lock for supporting lockdep").
>> In my tests I found that it introduces the following 2 lockdep warnings:
>>
>> 1. On Samsung Exynos 4412-based Odroid U3 board (ARM 32bit), observed
>> when booting it:
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.12.0-rc4-00037-gf1be1788a32e #9290 Not tainted
>> ------------------------------------------------------
>> find/1284 is trying to acquire lock:
>> cf3b8534 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x30/0x70
>>
>> but task is already holding lock:
>> c203a0c8 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at:
>> iterate_dir+0x30/0x140
>>
>> which lock already depends on the new lock.
>>
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #4 (&sb->s_type->i_mutex_key#2){++++}-{3:3}:
>> down_write+0x44/0xc4
>> start_creating+0x8c/0x170
>> debugfs_create_dir+0x1c/0x178
>> blk_register_queue+0xa0/0x1c0
>> add_disk_fwnode+0x210/0x434
>> brd_alloc+0x1cc/0x210
>> brd_init+0xac/0x104
>> do_one_initcall+0x64/0x30c
>> kernel_init_freeable+0x1c4/0x228
>> kernel_init+0x1c/0x12c
>> ret_from_fork+0x14/0x28
>>
>> -> #3 (&q->debugfs_mutex){+.+.}-{3:3}:
>> __mutex_lock+0x94/0x94c
>> mutex_lock_nested+0x1c/0x24
>> blk_mq_init_sched+0x140/0x204
>> elevator_init_mq+0xb8/0x130
>> add_disk_fwnode+0x3c/0x434
> The above chain can be cut by the following patch because disk state
> can be thought as DEAD in add_disk(), can you test it?
Seems to be fixing this issue. Feel free to add:
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> diff --git a/block/elevator.c b/block/elevator.c
> index 4122026b11f1..efa6ff941a25 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -600,12 +600,14 @@ void elevator_init_mq(struct request_queue *q)
> * requests, then no need to quiesce queue which may add long boot
> * latency, especially when lots of disks are involved.
> */
> - blk_mq_freeze_queue(q);
> + if (__blk_freeze_queue_start(q))
> + blk_freeze_acquire_lock(q, true, false);
> blk_mq_cancel_work_sync(q);
>
> err = blk_mq_init_sched(q, e);
>
> - blk_mq_unfreeze_queue(q);
> + if (__blk_mq_unfreeze_queue(q, false))
> + blk_unfreeze_release_lock(q, true, false);
>
> if (err) {
> pr_warn("\"%s\" elevator initialization failed, "
>
> ...
>
>> 2 locks held by find/1284:
>> #0: c3df1e88 (&f->f_pos_lock){+.+.}-{3:3}, at: fdget_pos+0x88/0xd0
>> #1: c203a0c8 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at:
>> iterate_dir+0x30/0x140
>>
>> stack backtrace:
>> CPU: 1 UID: 0 PID: 1284 Comm: find Not tainted
>> 6.12.0-rc4-00037-gf1be1788a32e #9290
>> Hardware name: Samsung Exynos (Flattened Device Tree)
>> Call trace:
>> unwind_backtrace from show_stack+0x10/0x14
>> show_stack from dump_stack_lvl+0x68/0x88
>> dump_stack_lvl from print_circular_bug+0x31c/0x394
>> print_circular_bug from check_noncircular+0x16c/0x184
>> check_noncircular from __lock_acquire+0x158c/0x2970
>> __lock_acquire from lock_acquire+0x130/0x384
>> lock_acquire from __might_fault+0x50/0x70
>> __might_fault from filldir64+0x94/0x28c
>> filldir64 from dcache_readdir+0x174/0x260
>> dcache_readdir from iterate_dir+0x64/0x140
>> iterate_dir from sys_getdents64+0x60/0x130
>> sys_getdents64 from ret_fast_syscall+0x0/0x1c
>> Exception stack(0xf22b5fa8 to 0xf22b5ff0)
>> 5fa0: 004b4fa0 004b4f80 00000004 004b4fa0 00008000
>> 00000000
>> 5fc0: 004b4fa0 004b4f80 00000001 000000d9 00000000 004b4af0 00000000
>> 000010ea
>> 5fe0: 004b1eb4 bea05af0 b6da4b08 b6da4a28
>>
>> --->8---
>>
>>
>> 2. On QEMU's ARM64 virt machine, observed during system suspend/resume
>> cycle:
>>
>> # time rtcwake -s10 -mmem
>> rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Oct 29 11:54:30 2024
>> PM: suspend entry (s2idle)
>> Filesystems sync: 0.004 seconds
>> Freezing user space processes
>> Freezing user space processes completed (elapsed 0.007 seconds)
>> OOM killer disabled.
>> Freezing remaining freezable tasks
>> Freezing remaining freezable tasks completed (elapsed 0.004 seconds)
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.12.0-rc4+ #9291 Not tainted
>> ------------------------------------------------------
>> rtcwake/1299 is trying to acquire lock:
>> ffff80008358a7f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x1c/0x28
>>
>> but task is already holding lock:
>> ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at:
>> virtblk_freeze+0x24/0x60
>>
>> which lock already depends on the new lock.
>>
>>
>> the existing dependency chain (in reverse order) is:
> This one looks a real thing, at least the added lockdep code works as
> expected, also the blk_mq_freeze_queue() use in virtio-blk's ->suspend()
> is questionable. I will take a further look.
>
>
> Thanks,
> Ming
>
>
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-25 0:37 ` [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep Ming Lei
[not found] ` <CGME20241029111338eucas1p2bd56c697b825eef235604e892569207e@eucas1p2.samsung.com>
@ 2024-10-30 6:45 ` Lai, Yi
2024-10-30 7:13 ` Ming Lei
1 sibling, 1 reply; 25+ messages in thread
From: Lai, Yi @ 2024-10-30 6:45 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
Hi Ming,
Greetings!
I used Syzkaller and found that there is possible deadlock in __submit_bio in linux-next next-20241029.
After bisection and the first bad commit is:
"
f1be1788a32e block: model freeze & enter queue as lock for supporting lockdep
"
All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241029_183511___submit_bio/bzImage_6fb2fa9805c501d9ade047fc511961f3273cdcb5
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/241029_183511___submit_bio/6fb2fa9805c501d9ade047fc511961f3273cdcb5_dmesg.log
"
[ 22.219103] 6.12.0-rc5-next-20241029-6fb2fa9805c5 #1 Not tainted
[ 22.219512] ------------------------------------------------------
[ 22.219827] repro/735 is trying to acquire lock:
[ 22.220066] ffff888010f1a768 (&q->q_usage_counter(io)#25){++++}-{0:0}, at: __submit_bio+0x39f/0x550
[ 22.220568]
[ 22.220568] but task is already holding lock:
[ 22.220884] ffffffff872322a0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x76b/0x21e0
[ 22.221453]
[ 22.221453] which lock already depends on the new lock.
[ 22.221453]
[ 22.221862]
[ 22.221862] the existing dependency chain (in reverse order) is:
[ 22.222247]
[ 22.222247] -> #1 (fs_reclaim){+.+.}-{0:0}:
[ 22.222630] lock_acquire+0x80/0xb0
[ 22.222920] fs_reclaim_acquire+0x116/0x160
[ 22.223244] __kmalloc_cache_node_noprof+0x59/0x470
[ 22.223528] blk_mq_init_tags+0x79/0x1a0
[ 22.223771] blk_mq_alloc_map_and_rqs+0x1f4/0xdd0
[ 22.224127] blk_mq_init_sched+0x33d/0x6d0
[ 22.224376] elevator_init_mq+0x2b2/0x400
[ 22.224619] add_disk_fwnode+0x11c/0x1300
[ 22.224866] device_add_disk+0x31/0x40
[ 22.225097] sd_probe+0xa79/0x1080
[ 22.225324] really_probe+0x27c/0xac0
[ 22.225556] __driver_probe_device+0x1f3/0x460
[ 22.225819] driver_probe_device+0x56/0x1b0
[ 22.226069] __device_attach_driver+0x1e7/0x300
[ 22.226336] bus_for_each_drv+0x159/0x1e0
[ 22.226576] __device_attach_async_helper+0x1e4/0x2a0
[ 22.226871] async_run_entry_fn+0xa3/0x450
[ 22.227127] process_one_work+0x92e/0x1b50
[ 22.227380] worker_thread+0x68d/0xe90
[ 22.227614] kthread+0x35a/0x470
[ 22.227834] ret_from_fork+0x56/0x90
[ 22.228119] ret_from_fork_asm+0x1a/0x30
[ 22.228365]
[ 22.228365] -> #0 (&q->q_usage_counter(io)#25){++++}-{0:0}:
[ 22.228739] __lock_acquire+0x2ff8/0x5d60
[ 22.228982] lock_acquire.part.0+0x142/0x390
[ 22.229237] lock_acquire+0x80/0xb0
[ 22.229452] blk_mq_submit_bio+0x1cbe/0x2590
[ 22.229708] __submit_bio+0x39f/0x550
[ 22.229931] submit_bio_noacct_nocheck+0x647/0xcc0
[ 22.230212] submit_bio_noacct+0x620/0x1e00
[ 22.230463] submit_bio+0xce/0x480
[ 22.230677] __swap_writepage+0x2f1/0xdf0
[ 22.230923] swap_writepage+0x464/0xbc0
[ 22.231157] shmem_writepage+0xdeb/0x1340
[ 22.231406] pageout+0x3bc/0x9b0
[ 22.231617] shrink_folio_list+0x16b9/0x3b60
[ 22.231884] shrink_lruvec+0xd78/0x2790
[ 22.232220] shrink_node+0xb29/0x2870
[ 22.232447] do_try_to_free_pages+0x2e3/0x1790
[ 22.232789] try_to_free_pages+0x2bc/0x750
[ 22.233065] __alloc_pages_slowpath.constprop.0+0x812/0x21e0
[ 22.233528] __alloc_pages_noprof+0x5d4/0x6f0
[ 22.233861] alloc_pages_mpol_noprof+0x30a/0x580
[ 22.234233] alloc_pages_noprof+0xa9/0x180
[ 22.234555] kimage_alloc_pages+0x79/0x240
[ 22.234808] kimage_alloc_control_pages+0x1cb/0xa60
[ 22.235119] do_kexec_load+0x3a6/0x8e0
[ 22.235418] __x64_sys_kexec_load+0x1cc/0x240
[ 22.235759] x64_sys_call+0xf0f/0x20d0
[ 22.236058] do_syscall_64+0x6d/0x140
[ 22.236343] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 22.236643]
[ 22.236643] other info that might help us debug this:
[ 22.236643]
[ 22.237042] Possible unsafe locking scenario:
[ 22.237042]
[ 22.237359] CPU0 CPU1
[ 22.237662] ---- ----
[ 22.237926] lock(fs_reclaim);
[ 22.238098] lock(&q->q_usage_counter(io)#25);
[ 22.238535] lock(fs_reclaim);
[ 22.238935] rlock(&q->q_usage_counter(io)#25);
[ 22.239182]
[ 22.239182] *** DEADLOCK ***
[ 22.239182]
[ 22.239521] 1 lock held by repro/735:
[ 22.239778] #0: ffffffff872322a0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x76b/0x21e0
[ 22.240320]
[ 22.240320] stack backtrace:
[ 22.240550] CPU: 1 UID: 0 PID: 735 Comm: repro Not tainted 6.12.0-rc5-next-20241029-6fb2fa9805c5 #1
[ 22.241079] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 22.241780] Call Trace:
[ 22.241928] <TASK>
[ 22.242095] dump_stack_lvl+0xea/0x150
[ 22.242351] dump_stack+0x19/0x20
[ 22.242534] print_circular_bug+0x47f/0x750
[ 22.242761] check_noncircular+0x2f4/0x3e0
[ 22.242992] ? __pfx_check_noncircular+0x10/0x10
[ 22.243302] ? __pfx_lock_release+0x10/0x10
[ 22.243586] ? lockdep_lock+0xd0/0x1d0
[ 22.243865] ? __pfx_lockdep_lock+0x10/0x10
[ 22.244103] __lock_acquire+0x2ff8/0x5d60
[ 22.244382] ? __kernel_text_address+0x16/0x50
[ 22.244633] ? __pfx___lock_acquire+0x10/0x10
[ 22.244871] ? stack_trace_save+0x97/0xd0
[ 22.245102] lock_acquire.part.0+0x142/0x390
[ 22.245394] ? __submit_bio+0x39f/0x550
[ 22.245646] ? __pfx_lock_acquire.part.0+0x10/0x10
"
I hope you find it useful.
Regards,
Yi Lai
---
If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.
How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install
On Fri, Oct 25, 2024 at 08:37:20AM +0800, Ming Lei wrote:
> Recently we got several deadlock report[1][2][3] caused by
> blk_mq_freeze_queue and blk_enter_queue().
>
> Turns out the two are just like acquiring read/write lock, so model them
> as read/write lock for supporting lockdep:
>
> 1) model q->q_usage_counter as two locks(io and queue lock)
>
> - queue lock covers sync with blk_enter_queue()
>
> - io lock covers sync with bio_enter_queue()
>
> 2) make the lockdep class/key as per-queue:
>
> - different subsystem has very different lock use pattern, shared lock
> class causes false positive easily
>
> - freeze_queue degrades to no lock in case that disk state becomes DEAD
> because bio_enter_queue() won't be blocked any more
>
> - freeze_queue degrades to no lock in case that request queue becomes dying
> because blk_enter_queue() won't be blocked any more
>
> 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
> - it is exclusive lock, so dependency with blk_enter_queue() is covered
>
> - it is trylock because blk_mq_freeze_queue() are allowed to run
> concurrently
>
> 4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
> - nested blk_enter_queue() are allowed
>
> - dependency with blk_mq_freeze_queue() is covered
>
> - blk_queue_exit() is often called from other contexts(such as irq), and
> it can't be annotated as lock_release(), so simply do it in
> blk_enter_queue(), this way still covered cases as many as possible
>
> With lockdep support, such kind of reports may be reported asap and
> needn't wait until the real deadlock is triggered.
>
> For example, lockdep report can be triggered in the report[3] with this
> patch applied.
>
> [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
> https://bugzilla.kernel.org/show_bug.cgi?id=219166
>
> [2] del_gendisk() vs blk_queue_enter() race condition
> https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
>
> [3] queue_freeze & queue_enter deadlock in scsi
> https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-core.c | 18 ++++++++++++++++--
> block/blk-mq.c | 26 ++++++++++++++++++++++----
> block/blk.h | 29 ++++++++++++++++++++++++++---
> block/genhd.c | 15 +++++++++++----
> include/linux/blkdev.h | 6 ++++++
> 5 files changed, 81 insertions(+), 13 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bc5e8c5eaac9..09d10bb95fda 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -261,6 +261,8 @@ static void blk_free_queue(struct request_queue *q)
> blk_mq_release(q);
>
> ida_free(&blk_queue_ida, q->id);
> + lockdep_unregister_key(&q->io_lock_cls_key);
> + lockdep_unregister_key(&q->q_lock_cls_key);
> call_rcu(&q->rcu_head, blk_free_queue_rcu);
> }
>
> @@ -278,18 +280,20 @@ void blk_put_queue(struct request_queue *q)
> }
> EXPORT_SYMBOL(blk_put_queue);
>
> -void blk_queue_start_drain(struct request_queue *q)
> +bool blk_queue_start_drain(struct request_queue *q)
> {
> /*
> * When queue DYING flag is set, we need to block new req
> * entering queue, so we call blk_freeze_queue_start() to
> * prevent I/O from crossing blk_queue_enter().
> */
> - blk_freeze_queue_start(q);
> + bool freeze = __blk_freeze_queue_start(q);
> if (queue_is_mq(q))
> blk_mq_wake_waiters(q);
> /* Make blk_queue_enter() reexamine the DYING flag. */
> wake_up_all(&q->mq_freeze_wq);
> +
> + return freeze;
> }
>
> /**
> @@ -321,6 +325,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> return -ENODEV;
> }
>
> + rwsem_acquire_read(&q->q_lockdep_map, 0, 0, _RET_IP_);
> + rwsem_release(&q->q_lockdep_map, _RET_IP_);
> return 0;
> }
>
> @@ -352,6 +358,8 @@ int __bio_queue_enter(struct request_queue *q, struct bio *bio)
> goto dead;
> }
>
> + rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
> + rwsem_release(&q->io_lockdep_map, _RET_IP_);
> return 0;
> dead:
> bio_io_error(bio);
> @@ -441,6 +449,12 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
> PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
> if (error)
> goto fail_stats;
> + lockdep_register_key(&q->io_lock_cls_key);
> + lockdep_register_key(&q->q_lock_cls_key);
> + lockdep_init_map(&q->io_lockdep_map, "&q->q_usage_counter(io)",
> + &q->io_lock_cls_key, 0);
> + lockdep_init_map(&q->q_lockdep_map, "&q->q_usage_counter(queue)",
> + &q->q_lock_cls_key, 0);
>
> q->nr_requests = BLKDEV_DEFAULT_RQ;
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 96858fb3b9ff..76f277a30c11 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -120,17 +120,29 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct block_device *part,
> inflight[1] = mi.inflight[1];
> }
>
> -void blk_freeze_queue_start(struct request_queue *q)
> +bool __blk_freeze_queue_start(struct request_queue *q)
> {
> + int freeze;
> +
> mutex_lock(&q->mq_freeze_lock);
> if (++q->mq_freeze_depth == 1) {
> percpu_ref_kill(&q->q_usage_counter);
> mutex_unlock(&q->mq_freeze_lock);
> if (queue_is_mq(q))
> blk_mq_run_hw_queues(q, false);
> + freeze = true;
> } else {
> mutex_unlock(&q->mq_freeze_lock);
> + freeze = false;
> }
> +
> + return freeze;
> +}
> +
> +void blk_freeze_queue_start(struct request_queue *q)
> +{
> + if (__blk_freeze_queue_start(q))
> + blk_freeze_acquire_lock(q, false, false);
> }
> EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
>
> @@ -176,8 +188,10 @@ void blk_mq_freeze_queue(struct request_queue *q)
> }
> EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
>
> -void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
> +bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
> {
> + int unfreeze = false;
> +
> mutex_lock(&q->mq_freeze_lock);
> if (force_atomic)
> q->q_usage_counter.data->force_atomic = true;
> @@ -186,13 +200,17 @@ void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
> if (!q->mq_freeze_depth) {
> percpu_ref_resurrect(&q->q_usage_counter);
> wake_up_all(&q->mq_freeze_wq);
> + unfreeze = true;
> }
> mutex_unlock(&q->mq_freeze_lock);
> +
> + return unfreeze;
> }
>
> void blk_mq_unfreeze_queue(struct request_queue *q)
> {
> - __blk_mq_unfreeze_queue(q, false);
> + if (__blk_mq_unfreeze_queue(q, false))
> + blk_unfreeze_release_lock(q, false, false);
> }
> EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
>
> @@ -205,7 +223,7 @@ EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
> */
> void blk_freeze_queue_start_non_owner(struct request_queue *q)
> {
> - blk_freeze_queue_start(q);
> + __blk_freeze_queue_start(q);
> }
> EXPORT_SYMBOL_GPL(blk_freeze_queue_start_non_owner);
>
> diff --git a/block/blk.h b/block/blk.h
> index c718e4291db0..832e54c5a271 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -4,6 +4,7 @@
>
> #include <linux/bio-integrity.h>
> #include <linux/blk-crypto.h>
> +#include <linux/lockdep.h>
> #include <linux/memblock.h> /* for max_pfn/max_low_pfn */
> #include <linux/sched/sysctl.h>
> #include <linux/timekeeping.h>
> @@ -35,8 +36,9 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
> void blk_free_flush_queue(struct blk_flush_queue *q);
>
> void blk_freeze_queue(struct request_queue *q);
> -void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
> -void blk_queue_start_drain(struct request_queue *q);
> +bool __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
> +bool blk_queue_start_drain(struct request_queue *q);
> +bool __blk_freeze_queue_start(struct request_queue *q);
> int __bio_queue_enter(struct request_queue *q, struct bio *bio);
> void submit_bio_noacct_nocheck(struct bio *bio);
> void bio_await_chain(struct bio *bio);
> @@ -69,8 +71,11 @@ static inline int bio_queue_enter(struct bio *bio)
> {
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> - if (blk_try_enter_queue(q, false))
> + if (blk_try_enter_queue(q, false)) {
> + rwsem_acquire_read(&q->io_lockdep_map, 0, 0, _RET_IP_);
> + rwsem_release(&q->io_lockdep_map, _RET_IP_);
> return 0;
> + }
> return __bio_queue_enter(q, bio);
> }
>
> @@ -734,4 +739,22 @@ void blk_integrity_verify(struct bio *bio);
> void blk_integrity_prepare(struct request *rq);
> void blk_integrity_complete(struct request *rq, unsigned int nr_bytes);
>
> +static inline void blk_freeze_acquire_lock(struct request_queue *q, bool
> + disk_dead, bool queue_dying)
> +{
> + if (!disk_dead)
> + rwsem_acquire(&q->io_lockdep_map, 0, 1, _RET_IP_);
> + if (!queue_dying)
> + rwsem_acquire(&q->q_lockdep_map, 0, 1, _RET_IP_);
> +}
> +
> +static inline void blk_unfreeze_release_lock(struct request_queue *q, bool
> + disk_dead, bool queue_dying)
> +{
> + if (!queue_dying)
> + rwsem_release(&q->q_lockdep_map, _RET_IP_);
> + if (!disk_dead)
> + rwsem_release(&q->io_lockdep_map, _RET_IP_);
> +}
> +
> #endif /* BLK_INTERNAL_H */
> diff --git a/block/genhd.c b/block/genhd.c
> index 1c05dd4c6980..6ad3fcde0110 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -581,13 +581,13 @@ static void blk_report_disk_dead(struct gendisk *disk, bool surprise)
> rcu_read_unlock();
> }
>
> -static void __blk_mark_disk_dead(struct gendisk *disk)
> +static bool __blk_mark_disk_dead(struct gendisk *disk)
> {
> /*
> * Fail any new I/O.
> */
> if (test_and_set_bit(GD_DEAD, &disk->state))
> - return;
> + return false;
>
> if (test_bit(GD_OWNS_QUEUE, &disk->state))
> blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
> @@ -600,7 +600,7 @@ static void __blk_mark_disk_dead(struct gendisk *disk)
> /*
> * Prevent new I/O from crossing bio_queue_enter().
> */
> - blk_queue_start_drain(disk->queue);
> + return blk_queue_start_drain(disk->queue);
> }
>
> /**
> @@ -641,6 +641,7 @@ void del_gendisk(struct gendisk *disk)
> struct request_queue *q = disk->queue;
> struct block_device *part;
> unsigned long idx;
> + bool start_drain, queue_dying;
>
> might_sleep();
>
> @@ -668,7 +669,10 @@ void del_gendisk(struct gendisk *disk)
> * Drop all partitions now that the disk is marked dead.
> */
> mutex_lock(&disk->open_mutex);
> - __blk_mark_disk_dead(disk);
> + start_drain = __blk_mark_disk_dead(disk);
> + queue_dying = blk_queue_dying(q);
> + if (start_drain)
> + blk_freeze_acquire_lock(q, true, queue_dying);
> xa_for_each_start(&disk->part_tbl, idx, part, 1)
> drop_partition(part);
> mutex_unlock(&disk->open_mutex);
> @@ -725,6 +729,9 @@ void del_gendisk(struct gendisk *disk)
> if (queue_is_mq(q))
> blk_mq_exit_queue(q);
> }
> +
> + if (start_drain)
> + blk_unfreeze_release_lock(q, true, queue_dying);
> }
> EXPORT_SYMBOL(del_gendisk);
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 50c3b959da28..57f1ee386b57 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -25,6 +25,7 @@
> #include <linux/uuid.h>
> #include <linux/xarray.h>
> #include <linux/file.h>
> +#include <linux/lockdep.h>
>
> struct module;
> struct request_queue;
> @@ -471,6 +472,11 @@ struct request_queue {
> struct xarray hctx_table;
>
> struct percpu_ref q_usage_counter;
> + struct lock_class_key io_lock_cls_key;
> + struct lockdep_map io_lockdep_map;
> +
> + struct lock_class_key q_lock_cls_key;
> + struct lockdep_map q_lockdep_map;
>
> struct request *last_merge;
>
> --
> 2.46.0
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-30 6:45 ` Lai, Yi
@ 2024-10-30 7:13 ` Ming Lei
2024-10-30 8:50 ` Lai, Yi
0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2024-10-30 7:13 UTC (permalink / raw)
To: Lai, Yi
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
On Wed, Oct 30, 2024 at 02:45:03PM +0800, Lai, Yi wrote:
> Hi Ming,
>
> Greetings!
>
> I used Syzkaller and found that there is possible deadlock in __submit_bio in linux-next next-20241029.
>
> After bisection and the first bad commit is:
> "
> f1be1788a32e block: model freeze & enter queue as lock for supporting lockdep
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241029_183511___submit_bio/bzImage_6fb2fa9805c501d9ade047fc511961f3273cdcb5
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/241029_183511___submit_bio/6fb2fa9805c501d9ade047fc511961f3273cdcb5_dmesg.log
>
> "
> [ 22.219103] 6.12.0-rc5-next-20241029-6fb2fa9805c5 #1 Not tainted
> [ 22.219512] ------------------------------------------------------
> [ 22.219827] repro/735 is trying to acquire lock:
> [ 22.220066] ffff888010f1a768 (&q->q_usage_counter(io)#25){++++}-{0:0}, at: __submit_bio+0x39f/0x550
> [ 22.220568]
> [ 22.220568] but task is already holding lock:
> [ 22.220884] ffffffff872322a0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x76b/0x21e0
> [ 22.221453]
> [ 22.221453] which lock already depends on the new lock.
> [ 22.221453]
> [ 22.221862]
> [ 22.221862] the existing dependency chain (in reverse order) is:
> [ 22.222247]
> [ 22.222247] -> #1 (fs_reclaim){+.+.}-{0:0}:
> [ 22.222630] lock_acquire+0x80/0xb0
> [ 22.222920] fs_reclaim_acquire+0x116/0x160
> [ 22.223244] __kmalloc_cache_node_noprof+0x59/0x470
> [ 22.223528] blk_mq_init_tags+0x79/0x1a0
> [ 22.223771] blk_mq_alloc_map_and_rqs+0x1f4/0xdd0
> [ 22.224127] blk_mq_init_sched+0x33d/0x6d0
> [ 22.224376] elevator_init_mq+0x2b2/0x400
It should be addressed by the following patch:
https://lore.kernel.org/linux-block/ZyEGLdg744U_xBjp@fedora/
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-30 7:13 ` Ming Lei
@ 2024-10-30 8:50 ` Lai, Yi
2024-10-30 9:50 ` Ming Lei
0 siblings, 1 reply; 25+ messages in thread
From: Lai, Yi @ 2024-10-30 8:50 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
On Wed, Oct 30, 2024 at 03:13:09PM +0800, Ming Lei wrote:
> On Wed, Oct 30, 2024 at 02:45:03PM +0800, Lai, Yi wrote:
> > Hi Ming,
> >
> > Greetings!
> >
> > I used Syzkaller and found that there is possible deadlock in __submit_bio in linux-next next-20241029.
> >
> > After bisection and the first bad commit is:
> > "
> > f1be1788a32e block: model freeze & enter queue as lock for supporting lockdep
> > "
> >
> > All detailed into can be found at:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio
> > Syzkaller repro code:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.c
> > Syzkaller repro syscall steps:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.prog
> > Syzkaller report:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/repro.report
> > Kconfig(make olddefconfig):
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/kconfig_origin
> > Bisect info:
> > https://github.com/laifryiee/syzkaller_logs/tree/main/241029_183511___submit_bio/bisect_info.log
> > bzImage:
> > https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241029_183511___submit_bio/bzImage_6fb2fa9805c501d9ade047fc511961f3273cdcb5
> > Issue dmesg:
> > https://github.com/laifryiee/syzkaller_logs/blob/main/241029_183511___submit_bio/6fb2fa9805c501d9ade047fc511961f3273cdcb5_dmesg.log
> >
> > "
> > [ 22.219103] 6.12.0-rc5-next-20241029-6fb2fa9805c5 #1 Not tainted
> > [ 22.219512] ------------------------------------------------------
> > [ 22.219827] repro/735 is trying to acquire lock:
> > [ 22.220066] ffff888010f1a768 (&q->q_usage_counter(io)#25){++++}-{0:0}, at: __submit_bio+0x39f/0x550
> > [ 22.220568]
> > [ 22.220568] but task is already holding lock:
> > [ 22.220884] ffffffff872322a0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x76b/0x21e0
> > [ 22.221453]
> > [ 22.221453] which lock already depends on the new lock.
> > [ 22.221453]
> > [ 22.221862]
> > [ 22.221862] the existing dependency chain (in reverse order) is:
> > [ 22.222247]
> > [ 22.222247] -> #1 (fs_reclaim){+.+.}-{0:0}:
> > [ 22.222630] lock_acquire+0x80/0xb0
> > [ 22.222920] fs_reclaim_acquire+0x116/0x160
> > [ 22.223244] __kmalloc_cache_node_noprof+0x59/0x470
> > [ 22.223528] blk_mq_init_tags+0x79/0x1a0
> > [ 22.223771] blk_mq_alloc_map_and_rqs+0x1f4/0xdd0
> > [ 22.224127] blk_mq_init_sched+0x33d/0x6d0
> > [ 22.224376] elevator_init_mq+0x2b2/0x400
>
> It should be addressed by the following patch:
>
> https://lore.kernel.org/linux-block/ZyEGLdg744U_xBjp@fedora/
>
I have applied proposed fix patch on top of next-20241029. Issue can
still be reproduced.
It seems the dependency chain is different from Marek's log and mine.
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-30 8:50 ` Lai, Yi
@ 2024-10-30 9:50 ` Ming Lei
2024-10-30 10:39 ` Lai, Yi
0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2024-10-30 9:50 UTC (permalink / raw)
To: Lai, Yi
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
On Wed, Oct 30, 2024 at 4:51 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>
> On Wed, Oct 30, 2024 at 03:13:09PM +0800, Ming Lei wrote:
> > On Wed, Oct 30, 2024 at 02:45:03PM +0800, Lai, Yi wrote:
...
> >
> > It should be addressed by the following patch:
> >
> > https://lore.kernel.org/linux-block/ZyEGLdg744U_xBjp@fedora/
> >
>
> I have applied proposed fix patch on top of next-20241029. Issue can
> still be reproduced.
>
> It seems the dependency chain is different from Marek's log and mine.
Can you post the new log since q->q_usage_counter(io)->fs_reclaim from
blk_mq_init_sched is cut down by the patch?
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-30 9:50 ` Ming Lei
@ 2024-10-30 10:39 ` Lai, Yi
2024-10-30 11:08 ` Ming Lei
0 siblings, 1 reply; 25+ messages in thread
From: Lai, Yi @ 2024-10-30 10:39 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
On Wed, Oct 30, 2024 at 05:50:15PM +0800, Ming Lei wrote:
> On Wed, Oct 30, 2024 at 4:51 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 03:13:09PM +0800, Ming Lei wrote:
> > > On Wed, Oct 30, 2024 at 02:45:03PM +0800, Lai, Yi wrote:
> ...
> > >
> > > It should be addressed by the following patch:
> > >
> > > https://lore.kernel.org/linux-block/ZyEGLdg744U_xBjp@fedora/
> > >
> >
> > I have applied proposed fix patch on top of next-20241029. Issue can
> > still be reproduced.
> >
> > It seems the dependency chain is different from Marek's log and mine.
>
> Can you post the new log since q->q_usage_counter(io)->fs_reclaim from
> blk_mq_init_sched is cut down by the patch?
>
New possible deadlock log after patch applied:
[ 52.485023] repro: page allocation failure: order:1, mode:0x10cc0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null),cpuset=/,mems0
[ 52.486074] CPU: 1 UID: 0 PID: 635 Comm: repro Not tainted 6.12.0-rc5-next-20241029-kvm-dirty #6
[ 52.486752] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/014
[ 52.487616] Call Trace:
[ 52.487820] <TASK>
[ 52.488001] dump_stack_lvl+0x121/0x150
[ 52.488345] dump_stack+0x19/0x20
[ 52.488616] warn_alloc+0x218/0x350
[ 52.488913] ? __pfx_warn_alloc+0x10/0x10
[ 52.489263] ? __alloc_pages_direct_compact+0x130/0xa10
[ 52.489699] ? __pfx___alloc_pages_direct_compact+0x10/0x10
[ 52.490151] ? __drain_all_pages+0x27b/0x480
[ 52.490522] __alloc_pages_slowpath.constprop.0+0x14d6/0x21e0
[ 52.491018] ? __pfx___alloc_pages_slowpath.constprop.0+0x10/0x10
[ 52.491519] ? lock_is_held_type+0xef/0x150
[ 52.491875] ? __pfx_get_page_from_freelist+0x10/0x10
[ 52.492291] ? lock_acquire+0x80/0xb0
[ 52.492619] __alloc_pages_noprof+0x5d4/0x6f0
[ 52.492992] ? __pfx___alloc_pages_noprof+0x10/0x10
[ 52.493405] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 52.493830] ? policy_nodemask+0xf9/0x450
[ 52.494169] alloc_pages_mpol_noprof+0x30a/0x580
[ 52.494561] ? __pfx_alloc_pages_mpol_noprof+0x10/0x10
[ 52.494982] ? sysvec_apic_timer_interrupt+0x6a/0xd0
[ 52.495396] ? asm_sysvec_apic_timer_interrupt+0x1f/0x30
[ 52.495845] alloc_pages_noprof+0xa9/0x180
[ 52.496201] kimage_alloc_pages+0x79/0x240
[ 52.496558] kimage_alloc_control_pages+0x1cb/0xa60
[ 52.496982] ? __pfx_kimage_alloc_control_pages+0x10/0x10
[ 52.497437] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 52.497897] do_kexec_load+0x3a6/0x8e0
[ 52.498228] ? __pfx_do_kexec_load+0x10/0x10
[ 52.498593] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 52.499035] ? _copy_from_user+0xb6/0xf0
[ 52.499371] __x64_sys_kexec_load+0x1cc/0x240
[ 52.499740] x64_sys_call+0xf0f/0x20d0
[ 52.500055] do_syscall_64+0x6d/0x140
[ 52.500367] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 52.500778] RIP: 0033:0x7f310423ee5d
[ 52.501077] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c88
[ 52.502494] RSP: 002b:00007fffcecca558 EFLAGS: 00000207 ORIG_RAX: 00000000000000f6
[ 52.503087] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f310423ee5d
[ 52.503644] RDX: 0000000020000040 RSI: 0000000000000009 RDI: 0000000000000000
[ 52.504198] RBP: 00007fffcecca560 R08: 00007fffcecc9fd0 R09: 00007fffcecca590
[ 52.504767] R10: 0000000000000000 R11: 0000000000000207 R12: 00007fffcecca6d8
[ 52.505345] R13: 0000000000401c72 R14: 0000000000403e08 R15: 00007f3104469000
[ 52.505949] </TASK>
[ 52.506239] Mem-Info:
[ 52.506449] active_anon:119 inactive_anon:14010 isolated_anon:0
[ 52.506449] active_file:17895 inactive_file:87 isolated_file:0
[ 52.506449] unevictable:0 dirty:15 writeback:0
[ 52.506449] slab_reclaimable:6957 slab_unreclaimable:20220
[ 52.506449] mapped:11598 shmem:1150 pagetables:766
[ 52.506449] sec_pagetables:0 bounce:0
[ 52.506449] kernel_misc_reclaimable:0
[ 52.506449] free:13776 free_pcp:99 free_cma:0
[ 52.509456] Node 0 active_anon:476kB inactive_anon:56040kB active_file:71580kB inactive_file:348kB unevictable:0kB isolateo
[ 52.511881] Node 0 DMA free:440kB boost:0kB min:440kB low:548kB high:656kB reserved_highatomic:0KB active_anon:0kB inactivB
[ 52.513883] lowmem_reserve[]: 0 1507 0 0 0
[ 52.514269] Node 0 DMA32 free:54664kB boost:0kB min:44612kB low:55764kB high:66916kB reserved_highatomic:0KB active_anon:4B
[ 52.516485] lowmem_reserve[]: 0 0 0 0 0
[ 52.516831] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4B
[ 52.517895] Node 0 DMA32: 2970*4kB (UME) 1123*8kB (UME) 532*16kB (UME) 280*32kB (UM) 126*64kB (UM) 27*128kB (UME) 9*256kB B
[ 52.519279] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 52.519387]
[ 52.519971] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 52.520138] ======================================================
[ 52.520805] 19113 total pagecache pages
[ 52.521702] WARNING: possible circular locking dependency detected
[ 52.522016] 0 pages in swap cache
[ 52.522022] Free swap = 124984kB
[ 52.522027] Total swap = 124996kB
[ 52.523720] 6.12.0-rc5-next-20241029-kvm-dirty #6 Not tainted
[ 52.523741] ------------------------------------------------------
[ 52.524059] 524158 pages RAM
[ 52.525050] kswapd0/56 is trying to acquire lock:
[ 52.525452] 0 pages HighMem/MovableOnly
[ 52.525461] 129765 pages reserved
[ 52.525465] 0 pages cma reserved
[ 52.525469] 0 pages hwpoisoned
[ 52.527163] ffff8880104374e8 (&q->q_usage_counter(io)#25){++++}-{0:0}, at: __submit_bio+0x39f/0x550
[ 52.532396]
[ 52.532396] but task is already holding lock:
[ 52.533293] ffffffff872322a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xb0f/0x1500
[ 52.534508]
[ 52.534508] which lock already depends on the new lock.
[ 52.534508]
[ 52.535723]
[ 52.535723] the existing dependency chain (in reverse order) is:
[ 52.536818]
[ 52.536818] -> #2 (fs_reclaim){+.+.}-{0:0}:
[ 52.537705] lock_acquire+0x80/0xb0
[ 52.538337] fs_reclaim_acquire+0x116/0x160
[ 52.539076] blk_mq_alloc_and_init_hctx+0x4df/0x1200
[ 52.539906] blk_mq_realloc_hw_ctxs+0x4cf/0x610
[ 52.540676] blk_mq_init_allocated_queue+0x3da/0x11b0
[ 52.541547] blk_mq_alloc_queue+0x22c/0x300
[ 52.542279] __blk_mq_alloc_disk+0x34/0x100
[ 52.543011] loop_add+0x4c9/0xbd0
[ 52.543622] loop_init+0x133/0x1a0
[ 52.544248] do_one_initcall+0x114/0x5d0
[ 52.544954] kernel_init_freeable+0xab0/0xeb0
[ 52.545732] kernel_init+0x28/0x2f0
[ 52.546366] ret_from_fork+0x56/0x90
[ 52.547009] ret_from_fork_asm+0x1a/0x30
[ 52.547698]
[ 52.547698] -> #1 (&q->sysfs_lock){+.+.}-{4:4}:
[ 52.548625] lock_acquire+0x80/0xb0
[ 52.549276] __mutex_lock+0x17c/0x1540
[ 52.549958] mutex_lock_nested+0x1f/0x30
[ 52.550664] queue_attr_store+0xea/0x180
[ 52.551360] sysfs_kf_write+0x11f/0x180
[ 52.552036] kernfs_fop_write_iter+0x40e/0x630
[ 52.552808] vfs_write+0xc59/0x1140
[ 52.553446] ksys_write+0x14f/0x290
[ 52.554068] __x64_sys_write+0x7b/0xc0
[ 52.554728] x64_sys_call+0x1685/0x20d0
[ 52.555397] do_syscall_64+0x6d/0x140
[ 52.556029] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 52.556865]
[ 52.556865] -> #0 (&q->q_usage_counter(io)#25){++++}-{0:0}:
[ 52.557963] __lock_acquire+0x2ff8/0x5d60
[ 52.558667] lock_acquire.part.0+0x142/0x390
[ 52.559427] lock_acquire+0x80/0xb0
[ 52.560057] blk_mq_submit_bio+0x1cbe/0x2590
[ 52.560801] __submit_bio+0x39f/0x550
[ 52.561473] submit_bio_noacct_nocheck+0x647/0xcc0
[ 52.562285] submit_bio_noacct+0x620/0x1e00
[ 52.563017] submit_bio+0xce/0x480
[ 52.563637] __swap_writepage+0x2f1/0xdf0
[ 52.564349] swap_writepage+0x464/0xbc0
[ 52.565022] shmem_writepage+0xdeb/0x1340
[ 52.565745] pageout+0x3bc/0x9b0
[ 52.566353] shrink_folio_list+0x16b9/0x3b60
[ 52.567104] shrink_lruvec+0xd78/0x2790
[ 52.567794] shrink_node+0xb29/0x2870
[ 52.568454] balance_pgdat+0x9c2/0x1500
[ 52.569142] kswapd+0x765/0xe00
[ 52.569741] kthread+0x35a/0x470
[ 52.570340] ret_from_fork+0x56/0x90
[ 52.570993] ret_from_fork_asm+0x1a/0x30
[ 52.571696]
[ 52.571696] other info that might help us debug this:
[ 52.571696]
[ 52.572904] Chain exists of:
[ 52.572904] &q->q_usage_counter(io)#25 --> &q->sysfs_lock --> fs_reclaim
[ 52.572904]
[ 52.574631] Possible unsafe locking scenario:
[ 52.574631]
[ 52.575547] CPU0 CPU1
[ 52.576246] ---- ----
[ 52.576942] lock(fs_reclaim);
[ 52.577467] lock(&q->sysfs_lock);
[ 52.578382] lock(fs_reclaim);
[ 52.579250] rlock(&q->q_usage_counter(io)#25);
[ 52.579974]
[ 52.579974] *** DEADLOCK ***
[ 52.579974]
[ 52.580866] 1 lock held by kswapd0/56:
[ 52.581459] #0: ffffffff872322a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xb0f/0x1500
[ 52.582731]
[ 52.582731] stack backtrace:
[ 52.583404] CPU: 0 UID: 0 PID: 56 Comm: kswapd0 Not tainted 6.12.0-rc5-next-20241029-kvm-dirty #6
[ 52.584735] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/014
[ 52.586439] Call Trace:
[ 52.586836] <TASK>
[ 52.587190] dump_stack_lvl+0xea/0x150
[ 52.587753] dump_stack+0x19/0x20
[ 52.588253] print_circular_bug+0x47f/0x750
[ 52.588872] check_noncircular+0x2f4/0x3e0
[ 52.589492] ? __pfx_check_noncircular+0x10/0x10
[ 52.590180] ? lockdep_lock+0xd0/0x1d0
[ 52.590741] ? __pfx_lockdep_lock+0x10/0x10
[ 52.590790] kexec: Could not allocate control_code_buffer
[ 52.591341] __lock_acquire+0x2ff8/0x5d60
[ 52.592365] ? __pfx___lock_acquire+0x10/0x10
[ 52.593087] ? __pfx_mark_lock.part.0+0x10/0x10
[ 52.593753] ? __kasan_check_read+0x15/0x20
[ 52.594366] lock_acquire.part.0+0x142/0x390
[ 52.594989] ? __submit_bio+0x39f/0x550
[ 52.595554] ? __pfx_lock_acquire.part.0+0x10/0x10
[ 52.596246] ? debug_smp_processor_id+0x20/0x30
[ 52.596900] ? rcu_is_watching+0x19/0xc0
[ 52.597484] ? trace_lock_acquire+0x139/0x1b0
[ 52.598118] lock_acquire+0x80/0xb0
[ 52.598633] ? __submit_bio+0x39f/0x550
[ 52.599191] blk_mq_submit_bio+0x1cbe/0x2590
[ 52.599805] ? __submit_bio+0x39f/0x550
[ 52.600361] ? __kasan_check_read+0x15/0x20
[ 52.600966] ? __pfx_blk_mq_submit_bio+0x10/0x10
[ 52.601632] ? __pfx_mark_lock.part.0+0x10/0x10
[ 52.602285] ? __this_cpu_preempt_check+0x21/0x30
[ 52.602968] ? __this_cpu_preempt_check+0x21/0x30
[ 52.603646] ? lock_release+0x441/0x870
[ 52.604207] __submit_bio+0x39f/0x550
[ 52.604742] ? __pfx___submit_bio+0x10/0x10
[ 52.605364] ? __this_cpu_preempt_check+0x21/0x30
[ 52.606045] ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
[ 52.606940] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 52.607707] ? kvm_clock_get_cycles+0x43/0x70
[ 52.608345] submit_bio_noacct_nocheck+0x647/0xcc0
[ 52.609045] ? __pfx_submit_bio_noacct_nocheck+0x10/0x10
[ 52.609820] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 52.610552] submit_bio_noacct+0x620/0x1e00
[ 52.611167] submit_bio+0xce/0x480
[ 52.611677] __swap_writepage+0x2f1/0xdf0
[ 52.612267] swap_writepage+0x464/0xbc0
[ 52.612837] shmem_writepage+0xdeb/0x1340
[ 52.613441] ? __pfx_shmem_writepage+0x10/0x10
[ 52.614090] ? __kasan_check_write+0x18/0x20
[ 52.614716] ? folio_clear_dirty_for_io+0xc1/0x600
[ 52.615403] pageout+0x3bc/0x9b0
[ 52.615894] ? __pfx_pageout+0x10/0x10
[ 52.616471] ? __pfx_folio_referenced_one+0x10/0x10
[ 52.617169] ? __pfx_folio_lock_anon_vma_read+0x10/0x10
[ 52.617918] ? __pfx_invalid_folio_referenced_vma+0x10/0x10
[ 52.618713] shrink_folio_list+0x16b9/0x3b60
[ 52.619346] ? __pfx_shrink_folio_list+0x10/0x10
[ 52.620021] ? __this_cpu_preempt_check+0x21/0x30
[ 52.620713] ? mark_lock.part.0+0xf3/0x17b0
[ 52.621339] ? isolate_lru_folios+0xcb1/0x1250
[ 52.621991] ? __pfx_mark_lock.part.0+0x10/0x10
[ 52.622655] ? __this_cpu_preempt_check+0x21/0x30
[ 52.623335] ? lock_release+0x441/0x870
[ 52.623900] ? __this_cpu_preempt_check+0x21/0x30
[ 52.624573] ? _raw_spin_unlock_irq+0x2c/0x60
[ 52.625204] ? lockdep_hardirqs_on+0x89/0x110
[ 52.625848] shrink_lruvec+0xd78/0x2790
[ 52.626422] ? __pfx_shrink_lruvec+0x10/0x10
[ 52.627040] ? __this_cpu_preempt_check+0x21/0x30
[ 52.627729] ? __this_cpu_preempt_check+0x21/0x30
[ 52.628423] ? trace_lock_acquire+0x139/0x1b0
[ 52.629061] ? trace_lock_acquire+0x139/0x1b0
[ 52.629752] shrink_node+0xb29/0x2870
[ 52.630305] ? __pfx_shrink_node+0x10/0x10
[ 52.630899] ? pgdat_balanced+0x1d4/0x230
[ 52.631490] balance_pgdat+0x9c2/0x1500
[ 52.632055] ? __pfx_balance_pgdat+0x10/0x10
[ 52.632669] ? __this_cpu_preempt_check+0x21/0x30
[ 52.633380] kswapd+0x765/0xe00
[ 52.633861] ? __pfx_kswapd+0x10/0x10
[ 52.634393] ? local_clock_noinstr+0xb0/0xd0
[ 52.635015] ? __pfx_autoremove_wake_function+0x10/0x10
[ 52.635759] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 52.636525] ? __kthread_parkme+0x15d/0x230
[ 52.637134] kthread+0x35a/0x470
[ 52.637616] ? __pfx_kswapd+0x10/0x10
[ 52.638146] ? __pfx_kthread+0x10/0x10
[ 52.638693] ret_from_fork+0x56/0x90
[ 52.639227] ? __pfx_kthread+0x10/0x10
[ 52.639778] ret_from_fork_asm+0x1a/0x30
[ 52.640391] </TASK>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-30 10:39 ` Lai, Yi
@ 2024-10-30 11:08 ` Ming Lei
2024-12-04 3:21 ` Lai, Yi
0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2024-10-30 11:08 UTC (permalink / raw)
To: Lai, Yi
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
On Wed, Oct 30, 2024 at 06:39:13PM +0800, Lai, Yi wrote:
> On Wed, Oct 30, 2024 at 05:50:15PM +0800, Ming Lei wrote:
> > On Wed, Oct 30, 2024 at 4:51 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 03:13:09PM +0800, Ming Lei wrote:
> > > > On Wed, Oct 30, 2024 at 02:45:03PM +0800, Lai, Yi wrote:
> > ...
> > > >
> > > > It should be addressed by the following patch:
> > > >
> > > > https://lore.kernel.org/linux-block/ZyEGLdg744U_xBjp@fedora/
> > > >
> > >
> > > I have applied proposed fix patch on top of next-20241029. Issue can
> > > still be reproduced.
> > >
> > > It seems the dependency chain is different from Marek's log and mine.
> >
> > Can you post the new log since q->q_usage_counter(io)->fs_reclaim from
> > blk_mq_init_sched is cut down by the patch?
> >
>
> New possible deadlock log after patch applied:
This one looks like one real deadlock, any memory allocation with
q->sysfs_lock held has such risk.
There is another similar report related with queue sysfs store operation:
https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-29 15:58 ` Ming Lei
2024-10-29 16:59 ` Marek Szyprowski
@ 2024-11-12 8:36 ` Marek Szyprowski
2024-11-12 10:15 ` Ming Lei
1 sibling, 1 reply; 25+ messages in thread
From: Marek Szyprowski @ 2024-11-12 8:36 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche
On 29.10.2024 16:58, Ming Lei wrote:
> On Tue, Oct 29, 2024 at 12:13:35PM +0100, Marek Szyprowski wrote:
>> On 25.10.2024 02:37, Ming Lei wrote:
>>> Recently we got several deadlock report[1][2][3] caused by
>>> blk_mq_freeze_queue and blk_enter_queue().
>>>
>>> Turns out the two are just like acquiring read/write lock, so model them
>>> as read/write lock for supporting lockdep:
>>>
>>> 1) model q->q_usage_counter as two locks(io and queue lock)
>>>
>>> - queue lock covers sync with blk_enter_queue()
>>>
>>> - io lock covers sync with bio_enter_queue()
>>>
>>> 2) make the lockdep class/key as per-queue:
>>>
>>> - different subsystem has very different lock use pattern, shared lock
>>> class causes false positive easily
>>>
>>> - freeze_queue degrades to no lock in case that disk state becomes DEAD
>>> because bio_enter_queue() won't be blocked any more
>>>
>>> - freeze_queue degrades to no lock in case that request queue becomes dying
>>> because blk_enter_queue() won't be blocked any more
>>>
>>> 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
>>> - it is exclusive lock, so dependency with blk_enter_queue() is covered
>>>
>>> - it is trylock because blk_mq_freeze_queue() are allowed to run
>>> concurrently
>>>
>>> 4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
>>> - nested blk_enter_queue() are allowed
>>>
>>> - dependency with blk_mq_freeze_queue() is covered
>>>
>>> - blk_queue_exit() is often called from other contexts(such as irq), and
>>> it can't be annotated as lock_release(), so simply do it in
>>> blk_enter_queue(), this way still covered cases as many as possible
>>>
>>> With lockdep support, such kind of reports may be reported asap and
>>> needn't wait until the real deadlock is triggered.
>>>
>>> For example, lockdep report can be triggered in the report[3] with this
>>> patch applied.
>>>
>>> [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
>>> https://bugzilla.kernel.org/show_bug.cgi?id=219166
>>>
>>> [2] del_gendisk() vs blk_queue_enter() race condition
>>> https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
>>>
>>> [3] queue_freeze & queue_enter deadlock in scsi
>>> https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
>>>
>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>> This patch landed yesterday in linux-next as commit f1be1788a32e
>> ("block: model freeze & enter queue as lock for supporting lockdep").
>> In my tests I found that it introduces the following 2 lockdep warnings:
>>
>> > ...
>>
>>
>> 2. On QEMU's ARM64 virt machine, observed during system suspend/resume
>> cycle:
>>
>> # time rtcwake -s10 -mmem
>> rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Oct 29 11:54:30 2024
>> PM: suspend entry (s2idle)
>> Filesystems sync: 0.004 seconds
>> Freezing user space processes
>> Freezing user space processes completed (elapsed 0.007 seconds)
>> OOM killer disabled.
>> Freezing remaining freezable tasks
>> Freezing remaining freezable tasks completed (elapsed 0.004 seconds)
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.12.0-rc4+ #9291 Not tainted
>> ------------------------------------------------------
>> rtcwake/1299 is trying to acquire lock:
>> ffff80008358a7f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x1c/0x28
>>
>> but task is already holding lock:
>> ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at:
>> virtblk_freeze+0x24/0x60
>>
>> which lock already depends on the new lock.
>>
>>
>> the existing dependency chain (in reverse order) is:
> This one looks a real thing, at least the added lockdep code works as
> expected, also the blk_mq_freeze_queue() use in virtio-blk's ->suspend()
> is questionable. I will take a further look.
Did you find a way to fix this one? I still observe such warnings in my
tests, even though your lockdep fixes are already merged to -next:
https://lore.kernel.org/all/20241031133723.303835-1-ming.lei@redhat.com/
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-11-12 8:36 ` Marek Szyprowski
@ 2024-11-12 10:15 ` Ming Lei
2024-11-12 11:32 ` Marek Szyprowski
0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2024-11-12 10:15 UTC (permalink / raw)
To: Marek Szyprowski
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche
On Tue, Nov 12, 2024 at 09:36:40AM +0100, Marek Szyprowski wrote:
> On 29.10.2024 16:58, Ming Lei wrote:
> > On Tue, Oct 29, 2024 at 12:13:35PM +0100, Marek Szyprowski wrote:
> >> On 25.10.2024 02:37, Ming Lei wrote:
> >>> Recently we got several deadlock report[1][2][3] caused by
> >>> blk_mq_freeze_queue and blk_enter_queue().
> >>>
> >>> Turns out the two are just like acquiring read/write lock, so model them
> >>> as read/write lock for supporting lockdep:
> >>>
> >>> 1) model q->q_usage_counter as two locks(io and queue lock)
> >>>
> >>> - queue lock covers sync with blk_enter_queue()
> >>>
> >>> - io lock covers sync with bio_enter_queue()
> >>>
> >>> 2) make the lockdep class/key as per-queue:
> >>>
> >>> - different subsystem has very different lock use pattern, shared lock
> >>> class causes false positive easily
> >>>
> >>> - freeze_queue degrades to no lock in case that disk state becomes DEAD
> >>> because bio_enter_queue() won't be blocked any more
> >>>
> >>> - freeze_queue degrades to no lock in case that request queue becomes dying
> >>> because blk_enter_queue() won't be blocked any more
> >>>
> >>> 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
> >>> - it is exclusive lock, so dependency with blk_enter_queue() is covered
> >>>
> >>> - it is trylock because blk_mq_freeze_queue() are allowed to run
> >>> concurrently
> >>>
> >>> 4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
> >>> - nested blk_enter_queue() are allowed
> >>>
> >>> - dependency with blk_mq_freeze_queue() is covered
> >>>
> >>> - blk_queue_exit() is often called from other contexts(such as irq), and
> >>> it can't be annotated as lock_release(), so simply do it in
> >>> blk_enter_queue(), this way still covered cases as many as possible
> >>>
> >>> With lockdep support, such kind of reports may be reported asap and
> >>> needn't wait until the real deadlock is triggered.
> >>>
> >>> For example, lockdep report can be triggered in the report[3] with this
> >>> patch applied.
> >>>
> >>> [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=219166
> >>>
> >>> [2] del_gendisk() vs blk_queue_enter() race condition
> >>> https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
> >>>
> >>> [3] queue_freeze & queue_enter deadlock in scsi
> >>> https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
> >>>
> >>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >> This patch landed yesterday in linux-next as commit f1be1788a32e
> >> ("block: model freeze & enter queue as lock for supporting lockdep").
> >> In my tests I found that it introduces the following 2 lockdep warnings:
> >>
> >> > ...
> >>
> >>
> >> 2. On QEMU's ARM64 virt machine, observed during system suspend/resume
> >> cycle:
> >>
> >> # time rtcwake -s10 -mmem
> >> rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Oct 29 11:54:30 2024
> >> PM: suspend entry (s2idle)
> >> Filesystems sync: 0.004 seconds
> >> Freezing user space processes
> >> Freezing user space processes completed (elapsed 0.007 seconds)
> >> OOM killer disabled.
> >> Freezing remaining freezable tasks
> >> Freezing remaining freezable tasks completed (elapsed 0.004 seconds)
> >>
> >> ======================================================
> >> WARNING: possible circular locking dependency detected
> >> 6.12.0-rc4+ #9291 Not tainted
> >> ------------------------------------------------------
> >> rtcwake/1299 is trying to acquire lock:
> >> ffff80008358a7f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x1c/0x28
> >>
> >> but task is already holding lock:
> >> ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at:
> >> virtblk_freeze+0x24/0x60
> >>
> >> which lock already depends on the new lock.
> >>
> >>
> >> the existing dependency chain (in reverse order) is:
> > This one looks a real thing, at least the added lockdep code works as
> > expected, also the blk_mq_freeze_queue() use in virtio-blk's ->suspend()
> > is questionable. I will take a further look.
>
> Did you find a way to fix this one? I still observe such warnings in my
> tests, even though your lockdep fixes are already merged to -next:
> https://lore.kernel.org/all/20241031133723.303835-1-ming.lei@redhat.com/
The lockdep fixes in ->next is just for making the added lockdep work
correctly, and virtio-blk is another story.
It might be fine to annotate it with blk_mq_freeze_queue_no_owner(),
but it looks very fragile to call freeze queue in ->suspend(), and the lock
is just kept as being grabbed in the whole suspend code path.
Can you try the following patch?
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 194417abc105..21488740eb15 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -1594,6 +1594,7 @@ static int virtblk_freeze(struct virtio_device *vdev)
/* Ensure no requests in virtqueues before deleting vqs. */
blk_mq_freeze_queue(vblk->disk->queue);
+ blk_mq_unfreeze_queue(vblk->disk->queue);
/* Ensure we don't receive any more interrupts */
virtio_reset_device(vdev);
@@ -1617,8 +1618,6 @@ static int virtblk_restore(struct virtio_device *vdev)
return ret;
virtio_device_ready(vdev);
-
- blk_mq_unfreeze_queue(vblk->disk->queue);
return 0;
}
#endif
Thanks,
Ming
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-11-12 10:15 ` Ming Lei
@ 2024-11-12 11:32 ` Marek Szyprowski
2024-11-12 11:48 ` Ming Lei
0 siblings, 1 reply; 25+ messages in thread
From: Marek Szyprowski @ 2024-11-12 11:32 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche
On 12.11.2024 11:15, Ming Lei wrote:
> On Tue, Nov 12, 2024 at 09:36:40AM +0100, Marek Szyprowski wrote:
>> On 29.10.2024 16:58, Ming Lei wrote:
>>> On Tue, Oct 29, 2024 at 12:13:35PM +0100, Marek Szyprowski wrote:
>>>> On 25.10.2024 02:37, Ming Lei wrote:
>>>>> Recently we got several deadlock report[1][2][3] caused by
>>>>> blk_mq_freeze_queue and blk_enter_queue().
>>>>>
>>>>> Turns out the two are just like acquiring read/write lock, so model them
>>>>> as read/write lock for supporting lockdep:
>>>>>
>>>>> 1) model q->q_usage_counter as two locks(io and queue lock)
>>>>>
>>>>> - queue lock covers sync with blk_enter_queue()
>>>>>
>>>>> - io lock covers sync with bio_enter_queue()
>>>>>
>>>>> 2) make the lockdep class/key as per-queue:
>>>>>
>>>>> - different subsystem has very different lock use pattern, shared lock
>>>>> class causes false positive easily
>>>>>
>>>>> - freeze_queue degrades to no lock in case that disk state becomes DEAD
>>>>> because bio_enter_queue() won't be blocked any more
>>>>>
>>>>> - freeze_queue degrades to no lock in case that request queue becomes dying
>>>>> because blk_enter_queue() won't be blocked any more
>>>>>
>>>>> 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
>>>>> - it is exclusive lock, so dependency with blk_enter_queue() is covered
>>>>>
>>>>> - it is trylock because blk_mq_freeze_queue() are allowed to run
>>>>> concurrently
>>>>>
>>>>> 4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
>>>>> - nested blk_enter_queue() are allowed
>>>>>
>>>>> - dependency with blk_mq_freeze_queue() is covered
>>>>>
>>>>> - blk_queue_exit() is often called from other contexts(such as irq), and
>>>>> it can't be annotated as lock_release(), so simply do it in
>>>>> blk_enter_queue(), this way still covered cases as many as possible
>>>>>
>>>>> With lockdep support, such kind of reports may be reported asap and
>>>>> needn't wait until the real deadlock is triggered.
>>>>>
>>>>> For example, lockdep report can be triggered in the report[3] with this
>>>>> patch applied.
>>>>>
>>>>> [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=219166
>>>>>
>>>>> [2] del_gendisk() vs blk_queue_enter() race condition
>>>>> https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
>>>>>
>>>>> [3] queue_freeze & queue_enter deadlock in scsi
>>>>> https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
>>>>>
>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>> This patch landed yesterday in linux-next as commit f1be1788a32e
>>>> ("block: model freeze & enter queue as lock for supporting lockdep").
>>>> In my tests I found that it introduces the following 2 lockdep warnings:
>>>>
>>>>> ...
>>>>
>>>> 2. On QEMU's ARM64 virt machine, observed during system suspend/resume
>>>> cycle:
>>>>
>>>> # time rtcwake -s10 -mmem
>>>> rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Oct 29 11:54:30 2024
>>>> PM: suspend entry (s2idle)
>>>> Filesystems sync: 0.004 seconds
>>>> Freezing user space processes
>>>> Freezing user space processes completed (elapsed 0.007 seconds)
>>>> OOM killer disabled.
>>>> Freezing remaining freezable tasks
>>>> Freezing remaining freezable tasks completed (elapsed 0.004 seconds)
>>>>
>>>> ======================================================
>>>> WARNING: possible circular locking dependency detected
>>>> 6.12.0-rc4+ #9291 Not tainted
>>>> ------------------------------------------------------
>>>> rtcwake/1299 is trying to acquire lock:
>>>> ffff80008358a7f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x1c/0x28
>>>>
>>>> but task is already holding lock:
>>>> ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at:
>>>> virtblk_freeze+0x24/0x60
>>>>
>>>> which lock already depends on the new lock.
>>>>
>>>>
>>>> the existing dependency chain (in reverse order) is:
>>> This one looks a real thing, at least the added lockdep code works as
>>> expected, also the blk_mq_freeze_queue() use in virtio-blk's ->suspend()
>>> is questionable. I will take a further look.
>> Did you find a way to fix this one? I still observe such warnings in my
>> tests, even though your lockdep fixes are already merged to -next:
>> https://lore.kernel.org/all/20241031133723.303835-1-ming.lei@redhat.com/
> The lockdep fixes in ->next is just for making the added lockdep work
> correctly, and virtio-blk is another story.
>
> It might be fine to annotate it with blk_mq_freeze_queue_no_owner(),
> but it looks very fragile to call freeze queue in ->suspend(), and the lock
> is just kept as being grabbed in the whole suspend code path.
>
> Can you try the following patch?
Yes, this hides this lockdep warning, but imho it looks like a
workaround, not a final fix.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 194417abc105..21488740eb15 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -1594,6 +1594,7 @@ static int virtblk_freeze(struct virtio_device *vdev)
>
> /* Ensure no requests in virtqueues before deleting vqs. */
> blk_mq_freeze_queue(vblk->disk->queue);
> + blk_mq_unfreeze_queue(vblk->disk->queue);
>
> /* Ensure we don't receive any more interrupts */
> virtio_reset_device(vdev);
> @@ -1617,8 +1618,6 @@ static int virtblk_restore(struct virtio_device *vdev)
> return ret;
>
> virtio_device_ready(vdev);
> -
> - blk_mq_unfreeze_queue(vblk->disk->queue);
> return 0;
> }
> #endif
>
>
>
> Thanks,
> Ming
>
>
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-11-12 11:32 ` Marek Szyprowski
@ 2024-11-12 11:48 ` Ming Lei
0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2024-11-12 11:48 UTC (permalink / raw)
To: Marek Szyprowski
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche
On Tue, Nov 12, 2024 at 12:32:29PM +0100, Marek Szyprowski wrote:
> On 12.11.2024 11:15, Ming Lei wrote:
> > On Tue, Nov 12, 2024 at 09:36:40AM +0100, Marek Szyprowski wrote:
> >> On 29.10.2024 16:58, Ming Lei wrote:
> >>> On Tue, Oct 29, 2024 at 12:13:35PM +0100, Marek Szyprowski wrote:
> >>>> On 25.10.2024 02:37, Ming Lei wrote:
> >>>>> Recently we got several deadlock report[1][2][3] caused by
> >>>>> blk_mq_freeze_queue and blk_enter_queue().
> >>>>>
> >>>>> Turns out the two are just like acquiring read/write lock, so model them
> >>>>> as read/write lock for supporting lockdep:
> >>>>>
> >>>>> 1) model q->q_usage_counter as two locks(io and queue lock)
> >>>>>
> >>>>> - queue lock covers sync with blk_enter_queue()
> >>>>>
> >>>>> - io lock covers sync with bio_enter_queue()
> >>>>>
> >>>>> 2) make the lockdep class/key as per-queue:
> >>>>>
> >>>>> - different subsystem has very different lock use pattern, shared lock
> >>>>> class causes false positive easily
> >>>>>
> >>>>> - freeze_queue degrades to no lock in case that disk state becomes DEAD
> >>>>> because bio_enter_queue() won't be blocked any more
> >>>>>
> >>>>> - freeze_queue degrades to no lock in case that request queue becomes dying
> >>>>> because blk_enter_queue() won't be blocked any more
> >>>>>
> >>>>> 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
> >>>>> - it is exclusive lock, so dependency with blk_enter_queue() is covered
> >>>>>
> >>>>> - it is trylock because blk_mq_freeze_queue() are allowed to run
> >>>>> concurrently
> >>>>>
> >>>>> 4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
> >>>>> - nested blk_enter_queue() are allowed
> >>>>>
> >>>>> - dependency with blk_mq_freeze_queue() is covered
> >>>>>
> >>>>> - blk_queue_exit() is often called from other contexts(such as irq), and
> >>>>> it can't be annotated as lock_release(), so simply do it in
> >>>>> blk_enter_queue(), this way still covered cases as many as possible
> >>>>>
> >>>>> With lockdep support, such kind of reports may be reported asap and
> >>>>> needn't wait until the real deadlock is triggered.
> >>>>>
> >>>>> For example, lockdep report can be triggered in the report[3] with this
> >>>>> patch applied.
> >>>>>
> >>>>> [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
> >>>>> https://bugzilla.kernel.org/show_bug.cgi?id=219166
> >>>>>
> >>>>> [2] del_gendisk() vs blk_queue_enter() race condition
> >>>>> https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
> >>>>>
> >>>>> [3] queue_freeze & queue_enter deadlock in scsi
> >>>>> https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
> >>>>>
> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>> This patch landed yesterday in linux-next as commit f1be1788a32e
> >>>> ("block: model freeze & enter queue as lock for supporting lockdep").
> >>>> In my tests I found that it introduces the following 2 lockdep warnings:
> >>>>
> >>>>> ...
> >>>>
> >>>> 2. On QEMU's ARM64 virt machine, observed during system suspend/resume
> >>>> cycle:
> >>>>
> >>>> # time rtcwake -s10 -mmem
> >>>> rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Oct 29 11:54:30 2024
> >>>> PM: suspend entry (s2idle)
> >>>> Filesystems sync: 0.004 seconds
> >>>> Freezing user space processes
> >>>> Freezing user space processes completed (elapsed 0.007 seconds)
> >>>> OOM killer disabled.
> >>>> Freezing remaining freezable tasks
> >>>> Freezing remaining freezable tasks completed (elapsed 0.004 seconds)
> >>>>
> >>>> ======================================================
> >>>> WARNING: possible circular locking dependency detected
> >>>> 6.12.0-rc4+ #9291 Not tainted
> >>>> ------------------------------------------------------
> >>>> rtcwake/1299 is trying to acquire lock:
> >>>> ffff80008358a7f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x1c/0x28
> >>>>
> >>>> but task is already holding lock:
> >>>> ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at:
> >>>> virtblk_freeze+0x24/0x60
> >>>>
> >>>> which lock already depends on the new lock.
> >>>>
> >>>>
> >>>> the existing dependency chain (in reverse order) is:
> >>> This one looks a real thing, at least the added lockdep code works as
> >>> expected, also the blk_mq_freeze_queue() use in virtio-blk's ->suspend()
> >>> is questionable. I will take a further look.
> >> Did you find a way to fix this one? I still observe such warnings in my
> >> tests, even though your lockdep fixes are already merged to -next:
> >> https://lore.kernel.org/all/20241031133723.303835-1-ming.lei@redhat.com/
> > The lockdep fixes in ->next is just for making the added lockdep work
> > correctly, and virtio-blk is another story.
> >
> > It might be fine to annotate it with blk_mq_freeze_queue_no_owner(),
> > but it looks very fragile to call freeze queue in ->suspend(), and the lock
> > is just kept as being grabbed in the whole suspend code path.
> >
> > Can you try the following patch?
>
> Yes, this hides this lockdep warning, but imho it looks like a
> workaround, not a final fix.
>
> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Thanks for the test!
It is actually not workaround, because what virtblk_freeze() needs is to drain
all in-flight IOs. One thing missed is to mark the queue as quiesced,
and I will post one formal patch with queue quiesce covered.
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-10-30 11:08 ` Ming Lei
@ 2024-12-04 3:21 ` Lai, Yi
2024-12-04 3:30 ` Ming Lei
2025-01-13 14:39 ` Chris Bainbridge
0 siblings, 2 replies; 25+ messages in thread
From: Lai, Yi @ 2024-12-04 3:21 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
On Wed, Oct 30, 2024 at 07:08:48PM +0800, Ming Lei wrote:
> On Wed, Oct 30, 2024 at 06:39:13PM +0800, Lai, Yi wrote:
> > On Wed, Oct 30, 2024 at 05:50:15PM +0800, Ming Lei wrote:
> > > On Wed, Oct 30, 2024 at 4:51 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
> > > >
> > > > On Wed, Oct 30, 2024 at 03:13:09PM +0800, Ming Lei wrote:
> > > > > On Wed, Oct 30, 2024 at 02:45:03PM +0800, Lai, Yi wrote:
> > > ...
> > > > >
> > > > > It should be addressed by the following patch:
> > > > >
> > > > > https://lore.kernel.org/linux-block/ZyEGLdg744U_xBjp@fedora/
> > > > >
> > > >
> > > > I have applied proposed fix patch on top of next-20241029. Issue can
> > > > still be reproduced.
> > > >
> > > > It seems the dependency chain is different from Marek's log and mine.
> > >
> > > Can you post the new log since q->q_usage_counter(io)->fs_reclaim from
> > > blk_mq_init_sched is cut down by the patch?
> > >
> >
> > New possible deadlock log after patch applied:
>
> This one looks like one real deadlock, any memory allocation with
> q->sysfs_lock held has such risk.
>
> There is another similar report related with queue sysfs store operation:
>
> https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/
>
>
For v6.13-rc3, there is possible deadlock in
blk_trace_ioctl/perf_event_ctx_lock_nested/in blk_trace_ioctl. Based on
my bisection log, it all bisected to your commit f1be1788a32e block:
model freeze & enter queue as lock for supporting lockdep.
I am hoping this will be insightful to you.
possible deadlock in blk_trace_ioctl:
[ 33.317360] ffff88800fc8f3e0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0xf1/0x1b0
[ 33.317859]
[ 33.317859] but task is already holding lock:
[ 33.318206] ffff8880109b92e0 (&q->debugfs_mutex){+.+.}-{4:4}, at: blk_trace_ioctl+0xaa/0x290
[ 33.318722]
[ 33.318722] which lock already depends on the new lock.
[ 33.318722]
[ 33.319198]
[ 33.319198] the existing dependency chain (in reverse order) is:
[ 33.319637]
[ 33.319637] -> #4 (&q->debugfs_mutex){+.+.}-{4:4}:
[ 33.320018] lock_acquire+0x80/0xb0
[ 33.320270] __mutex_lock+0x17c/0x1540
[ 33.320555] mutex_lock_nested+0x1f/0x30
[ 33.320839] blk_register_queue+0x146/0x510
[ 33.321177] add_disk_fwnode+0x7b7/0x1300
[ 33.321490] device_add_disk+0x31/0x40
[ 33.321759] brd_alloc.isra.0+0x4d4/0x710
[ 33.322042] brd_init+0x11e/0x1a0
[ 33.322283] do_one_initcall+0x114/0x5f0
[ 33.322556] kernel_init_freeable+0xab0/0xeb0
[ 33.322857] kernel_init+0x28/0x2f0
[ 33.323108] ret_from_fork+0x56/0x90
[ 33.323362] ret_from_fork_asm+0x1a/0x30
[ 33.323638]
[ 33.323638] -> #3 (&q->sysfs_lock){+.+.}-{4:4}:
[ 33.324004] lock_acquire+0x80/0xb0
[ 33.324253] __mutex_lock+0x17c/0x1540
[ 33.324535] mutex_lock_nested+0x1f/0x30
[ 33.324814] queue_attr_store+0xea/0x180
[ 33.325127] sysfs_kf_write+0x11f/0x180
[ 33.325440] kernfs_fop_write_iter+0x40e/0x630
[ 33.325744] vfs_write+0xc59/0x1140
[ 33.325995] ksys_write+0x14f/0x280
[ 33.326242] __x64_sys_write+0x7b/0xc0
[ 33.326507] x64_sys_call+0x16b3/0x2140
[ 33.326776] do_syscall_64+0x6d/0x140
[ 33.327032] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 33.327367]
[ 33.327367] -> #2 (&q->q_usage_counter(io)#25){++++}-{0:0}:
[ 33.327796] lock_acquire+0x80/0xb0
[ 33.328045] blk_mq_submit_bio+0x1b81/0x2430
[ 33.328338] __submit_bio+0x37b/0x530
[ 33.328614] submit_bio_noacct_nocheck+0x642/0xcd0
[ 33.328940] submit_bio_noacct+0x61f/0x1d10
[ 33.329274] submit_bio+0xce/0x480
[ 33.329540] submit_bh_wbc+0x58c/0x750
[ 33.329807] submit_bh+0x2b/0x40
[ 33.330042] ext4_read_bh_nowait+0x17f/0x260
[ 33.330337] ext4_sb_breadahead_unmovable+0xc3/0x100
[ 33.330668] __ext4_get_inode_loc+0x107c/0x1550
[ 33.330977] ext4_get_inode_loc+0xd1/0x180
[ 33.331264] ext4_reserve_inode_write+0xd7/0x280
[ 33.331573] __ext4_mark_inode_dirty+0x18b/0x8c0
[ 33.331885] ext4_ext_tree_init+0x163/0x1a0
[ 33.332172] __ext4_new_inode+0x4697/0x52d0
[ 33.332458] ext4_create+0x32e/0x550
[ 33.332736] lookup_open.isra.0+0x117c/0x1580
[ 33.333047] path_openat+0xdbf/0x2bf0
[ 33.333343] do_filp_open+0x1fd/0x460
[ 33.333623] do_sys_openat2+0x185/0x1f0
[ 33.333892] __x64_sys_openat+0x17a/0x240
[ 33.334172] x64_sys_call+0x1774/0x2140
[ 33.334442] do_syscall_64+0x6d/0x140
[ 33.334696] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 33.335034]
[ 33.335034] -> #1 (jbd2_handle){++++}-{0:0}:
[ 33.335384] lock_acquire+0x80/0xb0
[ 33.335634] start_this_handle+0x1081/0x1590
[ 33.335926] jbd2__journal_start+0x397/0x6b0
[ 33.336216] __ext4_journal_start_sb+0x458/0x600
[ 33.336548] ext4_dirty_inode+0xb3/0x140
[ 33.336826] __mark_inode_dirty+0x1cb/0xd80
[ 33.337154] generic_update_time+0xe0/0x110
[ 33.337473] touch_atime+0x514/0x5f0
[ 33.337730] ext4_file_mmap+0x344/0x6a0
[ 33.338000] __mmap_region+0x10e7/0x25a0
[ 33.338273] mmap_region+0x13b/0x2f0
[ 33.338529] do_mmap+0xd9b/0x11f0
[ 33.338787] vm_mmap_pgoff+0x1ea/0x390
[ 33.339056] vm_mmap+0xa3/0xd0
[ 33.339290] elf_load+0x3fe/0x750
[ 33.339527] load_elf_binary+0xe47/0x5260
[ 33.339805] bprm_execve+0x73b/0x1950
[ 33.340063] do_execveat_common.isra.0+0x521/0x650
[ 33.340387] __x64_sys_execve+0x96/0xc0
[ 33.340677] x64_sys_call+0x1b31/0x2140
[ 33.340960] do_syscall_64+0x6d/0x140
[ 33.341267] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 33.341616]
[ 33.341616] -> #0 (&mm->mmap_lock){++++}-{4:4}:
[ 33.341989] __lock_acquire+0x2ff8/0x5d60
[ 33.342267] lock_acquire.part.0+0x142/0x390
[ 33.342564] lock_acquire+0x80/0xb0
[ 33.342814] __might_fault+0x122/0x1b0
[ 33.343079] _copy_from_user+0x2e/0xa0
[ 33.343347] __blk_trace_setup+0xa3/0x190
[ 33.343625] blk_trace_ioctl+0x14c/0x290
[ 33.343899] blkdev_ioctl+0x11e/0x6a0
[ 33.344161] __x64_sys_ioctl+0x1ba/0x220
[ 33.344434] x64_sys_call+0x1227/0x2140
[ 33.344724] do_syscall_64+0x6d/0x140
[ 33.344988] entry_SYSCALL_64_after_hwframe+0x76/0x7e
possible deadlock in perf_event_ctx_lock_nested
[ 36.357846] ffff88801108f3e0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0xf1/0x1b0
[ 36.358351]
[ 36.358351] but task is already holding lock:
[ 36.358690] ffff88806c441508 (&cpuctx_mutex){+.+.}-{4:4}, at: perf_event_ctx_lock_nested+0x252/0x4f0
[ 36.359229]
[ 36.359229] which lock already depends on the new lock.
[ 36.359229]
[ 36.359695]
[ 36.359695] the existing dependency chain (in reverse order) is:
[ 36.360149]
[ 36.360149] -> #6 (&cpuctx_mutex){+.+.}-{4:4}:
[ 36.360521] lock_acquire+0x80/0xb0
[ 36.360770] __mutex_lock+0x17c/0x1540
[ 36.361036] mutex_lock_nested+0x1f/0x30
[ 36.361307] perf_event_init_cpu+0x2be/0x910
[ 36.361597] perf_event_init+0x520/0x700
[ 36.361888] start_kernel+0x26e/0x550
[ 36.362157] x86_64_start_reservations+0x1c/0x30
[ 36.362466] x86_64_start_kernel+0xa0/0xb0
[ 36.362744] common_startup_64+0x13e/0x141
[ 36.363023]
[ 36.363023] -> #5 (pmus_lock){+.+.}-{4:4}:
[ 36.363357] lock_acquire+0x80/0xb0
[ 36.363602] __mutex_lock+0x17c/0x1540
[ 36.363862] mutex_lock_nested+0x1f/0x30
[ 36.364155] perf_event_init_cpu+0xd8/0x910
[ 36.364450] cpuhp_invoke_callback+0x3cf/0x9d0
[ 36.364747] __cpuhp_invoke_callback_range+0x11f/0x250
[ 36.365084] _cpu_up+0x3b0/0x7d0
[ 36.365332] cpu_up+0x192/0x220
[ 36.365562] cpuhp_bringup_mask+0x112/0x1c0
[ 36.365878] bringup_nonboot_cpus+0x1b8/0x200
[ 36.366186] smp_init+0x3c/0x140
[ 36.366418] kernel_init_freeable+0x8fc/0xeb0
[ 36.366714] kernel_init+0x28/0x2f0
[ 36.366959] ret_from_fork+0x56/0x90
[ 36.367212] ret_from_fork_asm+0x1a/0x30
[ 36.367480]
[ 36.367480] -> #4 (cpu_hotplug_lock){++++}-{0:0}:
[ 36.367849] lock_acquire+0x80/0xb0
[ 36.368119] __cpuhp_state_add_instance+0x58/0x2f0
[ 36.368448] blk_mq_alloc_and_init_hctx+0x270/0x1200
[ 36.368776] blk_mq_realloc_hw_ctxs+0x4cf/0x610
[ 36.369079] blk_mq_init_allocated_queue+0x3da/0x11b0
[ 36.369428] blk_mq_alloc_queue+0x225/0x2f0
[ 36.369723] __blk_mq_alloc_disk+0x34/0x100
[ 36.370028] loop_add+0x4bb/0xbc0
[ 36.370274] loop_init+0x133/0x1a0
[ 36.370513] do_one_initcall+0x114/0x5f0
[ 36.370781] kernel_init_freeable+0xab0/0xeb0
[ 36.371079] kernel_init+0x28/0x2f0
[ 36.371323] ret_from_fork+0x56/0x90
[ 36.371573] ret_from_fork_asm+0x1a/0x30
[ 36.371841]
[ 36.371841] -> #3 (&q->sysfs_lock){+.+.}-{4:4}:
[ 36.372234] lock_acquire+0x80/0xb0
[ 36.372490] __mutex_lock+0x17c/0x1540
[ 36.372749] mutex_lock_nested+0x1f/0x30
[ 36.373021] queue_attr_store+0xea/0x180
[ 36.373291] sysfs_kf_write+0x11f/0x180
[ 36.373558] kernfs_fop_write_iter+0x40e/0x630
[ 36.373874] vfs_write+0xc59/0x1140
[ 36.374133] ksys_write+0x14f/0x280
[ 36.374376] __x64_sys_write+0x7b/0xc0
[ 36.374634] x64_sys_call+0x16b3/0x2140
[ 36.374901] do_syscall_64+0x6d/0x140
[ 36.375150] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 36.375502]
[ 36.375502] -> #2 (&q->q_usage_counter(io)#25){++++}-{0:0}:
[ 36.375956] lock_acquire+0x80/0xb0
[ 36.376216] blk_mq_submit_bio+0x1b81/0x2430
[ 36.376503] __submit_bio+0x37b/0x530
[ 36.376753] submit_bio_noacct_nocheck+0x642/0xcd0
[ 36.377068] submit_bio_noacct+0x61f/0x1d10
[ 36.377349] submit_bio+0xce/0x480
[ 36.377587] submit_bh_wbc+0x58c/0x750
[ 36.377864] submit_bh+0x2b/0x40
[ 36.378098] ext4_read_bh_nowait+0x17f/0x260
[ 36.378398] ext4_sb_breadahead_unmovable+0xc3/0x100
[ 36.378726] __ext4_get_inode_loc+0x107c/0x1550
[ 36.379030] ext4_get_inode_loc+0xd1/0x180
[ 36.379312] ext4_reserve_inode_write+0xd7/0x280
[ 36.379617] __ext4_mark_inode_dirty+0x18b/0x8c0
[ 36.379940] ext4_ext_tree_init+0x163/0x1a0
[ 36.380228] __ext4_new_inode+0x4697/0x52d0
[ 36.380520] ext4_create+0x32e/0x550
[ 36.380766] lookup_open.isra.0+0x117c/0x1580
[ 36.381059] path_openat+0xdbf/0x2bf0
[ 36.381310] do_filp_open+0x1fd/0x460
[ 36.381558] do_sys_openat2+0x185/0x1f0
[ 36.381839] __x64_sys_openat+0x17a/0x240
[ 36.382123] x64_sys_call+0x1774/0x2140
[ 36.382396] do_syscall_64+0x6d/0x140
[ 36.382646] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 36.382973]
[ 36.382973] -> #1 (jbd2_handle){++++}-{0:0}:
[ 36.383317] lock_acquire+0x80/0xb0
[ 36.383559] start_this_handle+0x1081/0x1590
[ 36.383844] jbd2__journal_start+0x397/0x6b0
[ 36.384153] __ext4_journal_start_sb+0x458/0x600
[ 36.384473] ext4_dirty_inode+0xb3/0x140
[ 36.384741] __mark_inode_dirty+0x1cb/0xd80
[ 36.385024] generic_update_time+0xe0/0x110
possible deadlock in blk_trace_ioctl
[ 42.443488] ======================================================
[ 42.444172] WARNING: possible circular locking dependency detected
[ 42.444846] 6.12.0-rc4+ #1 Not tainted
[ 42.445265] ------------------------------------------------------
[ 42.445936] repro/724 is trying to acquire lock:
[ 42.446451] ffff8880055e4b98 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0xf1/0x1b0
[ 42.447368]
[ 42.447368] but task is already holding lock:
[ 42.448012] ffff8880050f2fc0 (&q->debugfs_mutex){+.+.}-{3:3}, at: blk_trace_ioctl+0xaa/0x290
[ 42.448969]
[ 42.448969] which lock already depends on the new lock.
[ 42.448969]
[ 42.449879]
[ 42.449879] the existing dependency chain (in reverse order) is:
[ 42.450713]
[ 42.450713] -> #3 (&q->debugfs_mutex){+.+.}-{3:3}:
[ 42.451442] lock_acquire+0x80/0xb0
[ 42.451907] __mutex_lock+0x15c/0x1490
[ 42.452410] mutex_lock_nested+0x1f/0x30
[ 42.452934] blk_mq_init_sched+0x46e/0x6d0
[ 42.453475] elevator_init_mq+0x2b2/0x400
[ 42.453999] add_disk_fwnode+0x11c/0x1300
[ 42.454524] device_add_disk+0x31/0x40
[ 42.455017] sd_probe+0xa79/0x1080
[ 42.455473] really_probe+0x27c/0xac0
[ 42.455952] __driver_probe_device+0x1f3/0x460
[ 42.456515] driver_probe_device+0x56/0x1b0
[ 42.457052] __device_attach_driver+0x1e7/0x300
[ 42.457625] bus_for_each_drv+0x159/0x1e0
[ 42.458155] __device_attach_async_helper+0x1e4/0x2a0
[ 42.458792] async_run_entry_fn+0xa3/0x450
[ 42.459336] process_one_work+0x92e/0x1b50
[ 42.459870] worker_thread+0x68d/0xe90
[ 42.460371] kthread+0x35a/0x470
[ 42.460808] ret_from_fork+0x56/0x90
[ 42.461288] ret_from_fork_asm+0x1a/0x30
[ 42.461802]
[ 42.461802] -> #2 (&q->q_usage_counter(io)#25){++++}-{0:0}:
[ 42.462618] lock_acquire+0x80/0xb0
[ 42.463083] blk_mq_submit_bio+0x1cb8/0x2580
[ 42.463628] __submit_bio+0x39f/0x550
[ 42.464116] submit_bio_noacct_nocheck+0x6a0/0xcc0
[ 42.464717] submit_bio_noacct+0x61d/0x1dc0
[ 42.465258] submit_bio+0xce/0x480
[ 42.465711] submit_bh_wbc+0x58a/0x740
[ 42.466220] submit_bh+0x2b/0x40
[ 42.466668] ext4_read_bh+0x14e/0x2c0
[ 42.467163] ext4_read_bh_lock+0x88/0xd0
[ 42.467685] ext4_block_zero_page_range+0x486/0xa20
[ 42.468298] ext4_truncate+0xe7e/0x1210
[ 42.468802] ext4_setattr+0x1b10/0x2560
[ 42.470507] notify_change+0x6d8/0x1270
[ 42.471000] do_truncate+0x171/0x240
[ 42.471488] do_ftruncate+0x602/0x750
[ 42.471967] do_sys_ftruncate+0x71/0xd0
[ 42.472471] __x64_sys_ftruncate+0x61/0x90
[ 42.472999] x64_sys_call+0x553/0x20d0
[ 42.473501] do_syscall_64+0x6d/0x140
[ 42.473984] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 42.474617]
[ 42.474617] -> #1 (jbd2_handle){++++}-{0:0}:
[ 42.475277] lock_acquire+0x80/0xb0
[ 42.475742] start_this_handle+0x1081/0x1590
[ 42.476292] jbd2__journal_start+0x397/0x6b0
[ 42.476841] __ext4_journal_start_sb+0x466/0x600
[ 42.477424] ext4_dirty_inode+0xb3/0x140
[ 42.477936] __mark_inode_dirty+0x1d0/0xd50
[ 42.478480] generic_update_time+0xe0/0x110
[ 42.479022] touch_atime+0x514/0x5f0
[ 42.479507] ext4_file_mmap+0x344/0x6a0
[ 42.480005] mmap_region+0x1152/0x29c0
[ 42.480510] do_mmap+0xd9b/0x11f0
[ 42.480963] vm_mmap_pgoff+0x1ea/0x390
[ 42.481456] ksys_mmap_pgoff+0x3dc/0x520
[ 42.481972] __x64_sys_mmap+0x139/0x1d0
[ 42.482480] x64_sys_call+0x18c6/0x20d0
[ 42.482981] do_syscall_64+0x6d/0x140
[ 42.483467] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 42.484106]
[ 42.484106] -> #0 (&mm->mmap_lock){++++}-{3:3}:
[ 42.484786] __lock_acquire+0x2f1b/0x5c90
[ 42.485307] lock_acquire.part.0+0x142/0x390
[ 42.485851] lock_acquire+0x80/0xb0
[ 42.486316] __might_fault+0x122/0x1b0
[ 42.486807] _copy_from_user+0x33/0xe0
[ 42.487311] __blk_trace_setup+0xa3/0x190
[ 42.487837] blk_trace_ioctl+0x14c/0x290
[ 42.488345] blkdev_ioctl+0x11e/0x6a0
[ 42.488819] __x64_sys_ioctl+0x1b5/0x230
[ 42.489334] x64_sys_call+0x1209/0x20d0
[ 42.489854] do_syscall_64+0x6d/0x140
[ 42.490354] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 42.491011]
[ 42.491011] other info that might help us debug this:
[ 42.491011]
[ 42.491866] Chain exists of:
[ 42.491866] &mm->mmap_lock --> &q->q_usage_counter(io)#25 --> &q->debugfs_mutex
[ 42.491866]
[ 42.493153] Possible unsafe locking scenario:
[ 42.493153]
[ 42.493788] CPU0 CPU1
[ 42.494285] ---- ----
[ 42.494780] lock(&q->debugfs_mutex);
[ 42.495207] lock(&q->q_usage_counter(io)#25);
[ 42.495978] lock(&q->debugfs_mutex);
[ 42.496697] rlock(&mm->mmap_lock);
[ 42.497121]
[ 42.497121] *** DEADLOCK ***
Regards,
Yi Lai
>
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-12-04 3:21 ` Lai, Yi
@ 2024-12-04 3:30 ` Ming Lei
2025-01-13 14:39 ` Chris Bainbridge
1 sibling, 0 replies; 25+ messages in thread
From: Ming Lei @ 2024-12-04 3:30 UTC (permalink / raw)
To: Lai, Yi
Cc: Jens Axboe, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
On Wed, Dec 04, 2024 at 11:21:53AM +0800, Lai, Yi wrote:
> On Wed, Oct 30, 2024 at 07:08:48PM +0800, Ming Lei wrote:
> > On Wed, Oct 30, 2024 at 06:39:13PM +0800, Lai, Yi wrote:
> > > On Wed, Oct 30, 2024 at 05:50:15PM +0800, Ming Lei wrote:
> > > > On Wed, Oct 30, 2024 at 4:51 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
> > > > >
> > > > > On Wed, Oct 30, 2024 at 03:13:09PM +0800, Ming Lei wrote:
> > > > > > On Wed, Oct 30, 2024 at 02:45:03PM +0800, Lai, Yi wrote:
> > > > ...
> > > > > >
> > > > > > It should be addressed by the following patch:
> > > > > >
> > > > > > https://lore.kernel.org/linux-block/ZyEGLdg744U_xBjp@fedora/
> > > > > >
> > > > >
> > > > > I have applied proposed fix patch on top of next-20241029. Issue can
> > > > > still be reproduced.
> > > > >
> > > > > It seems the dependency chain is different from Marek's log and mine.
> > > >
> > > > Can you post the new log since q->q_usage_counter(io)->fs_reclaim from
> > > > blk_mq_init_sched is cut down by the patch?
> > > >
> > >
> > > New possible deadlock log after patch applied:
> >
> > This one looks like one real deadlock, any memory allocation with
> > q->sysfs_lock held has such risk.
> >
> > There is another similar report related with queue sysfs store operation:
> >
> > https://lore.kernel.org/linux-scsi/ZxG38G9BuFdBpBHZ@fedora/
> >
> >
> For v6.13-rc3, there is possible deadlock in
> blk_trace_ioctl/perf_event_ctx_lock_nested/in blk_trace_ioctl. Based on
> my bisection log, it all bisected to your commit f1be1788a32e block:
> model freeze & enter queue as lock for supporting lockdep.
>
> I am hoping this will be insightful to you.
>
> possible deadlock in blk_trace_ioctl:
> [ 33.317360] ffff88800fc8f3e0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0xf1/0x1b0
> [ 33.317859]
> [ 33.317859] but task is already holding lock:
> [ 33.318206] ffff8880109b92e0 (&q->debugfs_mutex){+.+.}-{4:4}, at: blk_trace_ioctl+0xaa/0x290
> [ 33.318722]
> [ 33.318722] which lock already depends on the new lock.
> [ 33.318722]
The warning has been fixed by the following patches in for-6.14/block:
blktrace: move copy_[to|from]_user() out of ->debugfs_lock
blktrace: don't centralize grabbing q->debugfs_mutex in blk_trace_ioctl
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-6.14/block
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2024-12-04 3:21 ` Lai, Yi
2024-12-04 3:30 ` Ming Lei
@ 2025-01-13 14:39 ` Chris Bainbridge
2025-01-13 15:11 ` Ming Lei
2025-01-13 15:23 ` Chris Bainbridge
1 sibling, 2 replies; 25+ messages in thread
From: Chris Bainbridge @ 2025-01-13 14:39 UTC (permalink / raw)
To: Lai, Yi
Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
Peter Zijlstra, Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon,
linux-kernel, Bart Van Assche, yi1.lai, syzkaller-bugs
Hi,
With latest mainline 6.13-rc6, I have been getting intermittent lock
warnings when using a btrfs filesystem. The warnings bisect to this
commit:
f1be1788a32e8fa63416ad4518bbd1a85a825c9d is the first bad commit
commit f1be1788a32e8fa63416ad4518bbd1a85a825c9d
Author: Ming Lei <ming.lei@redhat.com>
Date: Fri Oct 25 08:37:20 2024 +0800
block: model freeze & enter queue as lock for supporting lockdep
On my system, these lockdep warnings are reproducible just by doing some
large fs operation, like copying the whole linux kernel git repo to the
btrfs filesystem.
The lockdep warning is:
[ 437.745808] ======================================================
[ 437.745810] WARNING: possible circular locking dependency detected
[ 437.745811] 6.13.0-rc6-00037-gac70f027bab6 #112 Not tainted
[ 437.745813] ------------------------------------------------------
[ 437.745814] kswapd0/141 is trying to acquire lock:
[ 437.745815] ffff925c11095e90 (&delayed_node->mutex){+.+.}-{4:4}, at: __btrfs_release_delayed_node.part.0+0x3f/0x280 [btrfs]
[ 437.745862]
but task is already holding lock:
[ 437.745863] ffffffffb9cc8c80 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x578/0xa80
[ 437.745869]
which lock already depends on the new lock.
[ 437.745870]
the existing dependency chain (in reverse order) is:
[ 437.745871]
-> #3 (fs_reclaim){+.+.}-{0:0}:
[ 437.745873] fs_reclaim_acquire+0xbd/0xf0
[ 437.745877] __kmalloc_node_noprof+0xa1/0x4f0
[ 437.745880] __kvmalloc_node_noprof+0x24/0x100
[ 437.745881] sbitmap_init_node+0x98/0x240
[ 437.745885] scsi_realloc_sdev_budget_map+0xdd/0x1d0
[ 437.745889] scsi_add_lun+0x458/0x760
[ 437.745891] scsi_probe_and_add_lun+0x15e/0x480
[ 437.745892] __scsi_scan_target+0xfb/0x230
[ 437.745893] scsi_scan_channel+0x65/0xc0
[ 437.745894] scsi_scan_host_selected+0xfb/0x160
[ 437.745896] do_scsi_scan_host+0x9d/0xb0
[ 437.745897] do_scan_async+0x1c/0x1a0
[ 437.745898] async_run_entry_fn+0x2d/0x120
[ 437.745901] process_one_work+0x210/0x730
[ 437.745903] worker_thread+0x193/0x350
[ 437.745905] kthread+0xf3/0x120
[ 437.745906] ret_from_fork+0x40/0x70
[ 437.745910] ret_from_fork_asm+0x11/0x20
[ 437.745912]
-> #2 (&q->q_usage_counter(io)#10){++++}-{0:0}:
[ 437.745916] blk_mq_submit_bio+0x970/0xb40
[ 437.745918] __submit_bio+0x116/0x200
[ 437.745921] submit_bio_noacct_nocheck+0x1bf/0x420
[ 437.745923] submit_bio_noacct+0x1cd/0x610
[ 437.745925] submit_bio+0x38/0x100
[ 437.745927] btrfs_submit_dev_bio+0xf1/0x340 [btrfs]
[ 437.745956] btrfs_submit_bio+0x132/0x170 [btrfs]
[ 437.745980] btrfs_submit_chunk+0x198/0x690 [btrfs]
[ 437.746004] btrfs_submit_bbio+0x1b/0x30 [btrfs]
[ 437.746028] read_extent_buffer_pages+0x197/0x220 [btrfs]
[ 437.746060] btrfs_read_extent_buffer+0x95/0x1d0 [btrfs]
[ 437.746093] read_block_for_search+0x21c/0x3b0 [btrfs]
[ 437.746123] btrfs_search_slot+0x36c/0x1040 [btrfs]
[ 437.746151] btrfs_init_root_free_objectid+0x88/0x120 [btrfs]
[ 437.746180] btrfs_get_root_ref+0x21a/0x3c0 [btrfs]
[ 437.746206] btrfs_get_fs_root+0x13/0x20 [btrfs]
[ 437.746231] btrfs_lookup_dentry+0x606/0x670 [btrfs]
[ 437.746259] btrfs_lookup+0x12/0x40 [btrfs]
[ 437.746285] __lookup_slow+0xf9/0x1a0
[ 437.746288] walk_component+0x10c/0x180
[ 437.746290] path_lookupat+0x67/0x1a0
[ 437.746291] filename_lookup+0xee/0x200
[ 437.746294] vfs_path_lookup+0x54/0x90
[ 437.746296] mount_subtree+0x8b/0x150
[ 437.746297] btrfs_get_tree+0x3a8/0x8b0 [btrfs]
[ 437.746326] vfs_get_tree+0x27/0x100
[ 437.746328] path_mount+0x4f3/0xc00
[ 437.746330] __x64_sys_mount+0x120/0x160
[ 437.746331] x64_sys_call+0x8a1/0xfb0
[ 437.746333] do_syscall_64+0x87/0x140
[ 437.746337] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 437.746340]
-> #1 (btrfs-tree-01){++++}-{4:4}:
[ 437.746343] lock_release+0x130/0x2c0
[ 437.746345] up_write+0x1c/0x1f0
[ 437.746348] btrfs_tree_unlock+0x1e/0x30 [btrfs]
[ 437.746378] unlock_up+0x1ce/0x380 [btrfs]
[ 437.746406] btrfs_search_slot+0x344/0x1040 [btrfs]
[ 437.746433] btrfs_lookup_inode+0x52/0xe0 [btrfs]
[ 437.746461] __btrfs_update_delayed_inode+0x6f/0x2f0 [btrfs]
[ 437.746490] btrfs_commit_inode_delayed_inode+0x123/0x130 [btrfs]
[ 437.746516] btrfs_evict_inode+0x2ff/0x440 [btrfs]
[ 437.746545] evict+0x11f/0x2d0
[ 437.746547] iput.part.0+0x1bf/0x270
[ 437.746548] iput+0x1c/0x30
[ 437.746549] do_unlinkat+0x2d2/0x320
[ 437.746551] __x64_sys_unlinkat+0x35/0x70
[ 437.746552] x64_sys_call+0x51b/0xfb0
[ 437.746554] do_syscall_64+0x87/0x140
[ 437.746556] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 437.746558]
-> #0 (&delayed_node->mutex){+.+.}-{4:4}:
[ 437.746560] __lock_acquire+0x1615/0x27d0
[ 437.746561] lock_acquire+0xc9/0x300
[ 437.746563] __mutex_lock+0xd3/0x920
[ 437.746565] mutex_lock_nested+0x1b/0x30
[ 437.746567] __btrfs_release_delayed_node.part.0+0x3f/0x280 [btrfs]
[ 437.746592] btrfs_remove_delayed_node+0x2a/0x40 [btrfs]
[ 437.746616] btrfs_evict_inode+0x1a5/0x440 [btrfs]
[ 437.746643] evict+0x11f/0x2d0
[ 437.746644] dispose_list+0x39/0x80
[ 437.746645] prune_icache_sb+0x59/0x90
[ 437.746647] super_cache_scan+0x14e/0x1d0
[ 437.746649] do_shrink_slab+0x176/0x7a0
[ 437.746651] shrink_slab+0x4b6/0x970
[ 437.746652] shrink_one+0x125/0x200
[ 437.746654] shrink_node+0xca3/0x13f0
[ 437.746655] balance_pgdat+0x50b/0xa80
[ 437.746657] kswapd+0x224/0x480
[ 437.746658] kthread+0xf3/0x120
[ 437.746659] ret_from_fork+0x40/0x70
[ 437.746661] ret_from_fork_asm+0x11/0x20
[ 437.746663]
other info that might help us debug this:
[ 437.746664] Chain exists of:
&delayed_node->mutex --> &q->q_usage_counter(io)#10 --> fs_reclaim
[ 437.746667] Possible unsafe locking scenario:
[ 437.746668] CPU0 CPU1
[ 437.746668] ---- ----
[ 437.746669] lock(fs_reclaim);
[ 437.746670] lock(&q->q_usage_counter(io)#10);
[ 437.746672] lock(fs_reclaim);
[ 437.746673] lock(&delayed_node->mutex);
[ 437.746674]
*** DEADLOCK ***
[ 437.746675] 2 locks held by kswapd0/141:
[ 437.746676] #0: ffffffffb9cc8c80 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x578/0xa80
[ 437.746680] #1: ffff925c027fc0e0 (&type->s_umount_key#30){++++}-{4:4}, at: super_cache_scan+0x39/0x1d0
[ 437.746684]
stack backtrace:
[ 437.746685] CPU: 4 UID: 0 PID: 141 Comm: kswapd0 Not tainted 6.13.0-rc6-00037-gac70f027bab6 #112
[ 437.746687] Hardware name: HP HP Pavilion Aero Laptop 13-be0xxx/8916, BIOS F.16 08/01/2024
[ 437.746688] Call Trace:
[ 437.746689] <TASK>
[ 437.746692] dump_stack_lvl+0x8d/0xe0
[ 437.746694] dump_stack+0x10/0x20
[ 437.746696] print_circular_bug+0x27d/0x350
[ 437.746698] check_noncircular+0x14c/0x170
[ 437.746700] __lock_acquire+0x1615/0x27d0
[ 437.746703] lock_acquire+0xc9/0x300
[ 437.746704] ? __btrfs_release_delayed_node.part.0+0x3f/0x280 [btrfs]
[ 437.746730] __mutex_lock+0xd3/0x920
[ 437.746731] ? __btrfs_release_delayed_node.part.0+0x3f/0x280 [btrfs]
[ 437.746754] ? __btrfs_release_delayed_node.part.0+0x3f/0x280 [btrfs]
[ 437.746777] mutex_lock_nested+0x1b/0x30
[ 437.746779] ? mutex_lock_nested+0x1b/0x30
[ 437.746781] __btrfs_release_delayed_node.part.0+0x3f/0x280 [btrfs]
[ 437.746803] btrfs_remove_delayed_node+0x2a/0x40 [btrfs]
[ 437.746826] btrfs_evict_inode+0x1a5/0x440 [btrfs]
[ 437.746852] ? lock_release+0xdb/0x2c0
[ 437.746853] ? evict+0x10d/0x2d0
[ 437.746856] evict+0x11f/0x2d0
[ 437.746858] dispose_list+0x39/0x80
[ 437.746860] prune_icache_sb+0x59/0x90
[ 437.746861] super_cache_scan+0x14e/0x1d0
[ 437.746864] do_shrink_slab+0x176/0x7a0
[ 437.746866] shrink_slab+0x4b6/0x970
[ 437.746868] ? shrink_slab+0x2fe/0x970
[ 437.746869] ? shrink_slab+0x383/0x970
[ 437.746872] shrink_one+0x125/0x200
[ 437.746873] ? shrink_node+0xc87/0x13f0
[ 437.746875] shrink_node+0xca3/0x13f0
[ 437.746877] ? shrink_node+0xad5/0x13f0
[ 437.746879] ? mem_cgroup_iter+0x352/0x470
[ 437.746882] balance_pgdat+0x50b/0xa80
[ 437.746883] ? balance_pgdat+0x50b/0xa80
[ 437.746886] ? lock_acquire+0xc9/0x300
[ 437.746889] kswapd+0x224/0x480
[ 437.746891] ? sugov_hold_freq+0xc0/0xc0
[ 437.746894] ? balance_pgdat+0xa80/0xa80
[ 437.746895] kthread+0xf3/0x120
[ 437.746897] ? kthread_insert_work_sanity_check+0x60/0x60
[ 437.746898] ret_from_fork+0x40/0x70
[ 437.746900] ? kthread_insert_work_sanity_check+0x60/0x60
[ 437.746902] ret_from_fork_asm+0x11/0x20
[ 437.746905] </TASK>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2025-01-13 14:39 ` Chris Bainbridge
@ 2025-01-13 15:11 ` Ming Lei
2025-01-13 15:33 ` Chris Bainbridge
2025-01-13 15:23 ` Chris Bainbridge
1 sibling, 1 reply; 25+ messages in thread
From: Ming Lei @ 2025-01-13 15:11 UTC (permalink / raw)
To: Chris Bainbridge
Cc: Lai, Yi, Jens Axboe, linux-block, Christoph Hellwig,
Peter Zijlstra, Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon,
linux-kernel, Bart Van Assche, yi1.lai, syzkaller-bugs
On Mon, Jan 13, 2025 at 02:39:22PM +0000, Chris Bainbridge wrote:
> Hi,
>
> With latest mainline 6.13-rc6, I have been getting intermittent lock
> warnings when using a btrfs filesystem. The warnings bisect to this
> commit:
>
> f1be1788a32e8fa63416ad4518bbd1a85a825c9d is the first bad commit
> commit f1be1788a32e8fa63416ad4518bbd1a85a825c9d
> Author: Ming Lei <ming.lei@redhat.com>
> Date: Fri Oct 25 08:37:20 2024 +0800
>
> block: model freeze & enter queue as lock for supporting lockdep
>
>
> On my system, these lockdep warnings are reproducible just by doing some
> large fs operation, like copying the whole linux kernel git repo to the
> btrfs filesystem.
>
> The lockdep warning is:
>
> [ 437.745808] ======================================================
> [ 437.745810] WARNING: possible circular locking dependency detected
> [ 437.745811] 6.13.0-rc6-00037-gac70f027bab6 #112 Not tainted
> [ 437.745813] ------------------------------------------------------
> [ 437.745814] kswapd0/141 is trying to acquire lock:
> [ 437.745815] ffff925c11095e90 (&delayed_node->mutex){+.+.}-{4:4}, at: __btrfs_release_delayed_node.part.0+0x3f/0x280 [btrfs]
> [ 437.745862]
> but task is already holding lock:
> [ 437.745863] ffffffffb9cc8c80 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x578/0xa80
> [ 437.745869]
> which lock already depends on the new lock.
>
> [ 437.745870]
> the existing dependency chain (in reverse order) is:
> [ 437.745871]
> -> #3 (fs_reclaim){+.+.}-{0:0}:
> [ 437.745873] fs_reclaim_acquire+0xbd/0xf0
> [ 437.745877] __kmalloc_node_noprof+0xa1/0x4f0
> [ 437.745880] __kvmalloc_node_noprof+0x24/0x100
> [ 437.745881] sbitmap_init_node+0x98/0x240
> [ 437.745885] scsi_realloc_sdev_budget_map+0xdd/0x1d0
> [ 437.745889] scsi_add_lun+0x458/0x760
> [ 437.745891] scsi_probe_and_add_lun+0x15e/0x480
> [ 437.745892] __scsi_scan_target+0xfb/0x230
> [ 437.745893] scsi_scan_channel+0x65/0xc0
> [ 437.745894] scsi_scan_host_selected+0xfb/0x160
> [ 437.745896] do_scsi_scan_host+0x9d/0xb0
> [ 437.745897] do_scan_async+0x1c/0x1a0
> [ 437.745898] async_run_entry_fn+0x2d/0x120
> [ 437.745901] process_one_work+0x210/0x730
> [ 437.745903] worker_thread+0x193/0x350
> [ 437.745905] kthread+0xf3/0x120
> [ 437.745906] ret_from_fork+0x40/0x70
> [ 437.745910] ret_from_fork_asm+0x11/0x20
> [ 437.745912]
> -> #2 (&q->q_usage_counter(io)#10){++++}-{0:0}:
Hello,
This one has been solved in for-6.14/block:
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-6.14/block
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2025-01-13 14:39 ` Chris Bainbridge
2025-01-13 15:11 ` Ming Lei
@ 2025-01-13 15:23 ` Chris Bainbridge
1 sibling, 0 replies; 25+ messages in thread
From: Chris Bainbridge @ 2025-01-13 15:23 UTC (permalink / raw)
To: Lai, Yi
Cc: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig,
Peter Zijlstra, Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon,
linux-kernel, Bart Van Assche, yi1.lai, syzkaller-bugs,
regressions
On Mon, Jan 13, 2025 at 02:39:22PM +0000, Chris Bainbridge wrote:
> Hi,
>
> With latest mainline 6.13-rc6, I have been getting intermittent lock
> warnings when using a btrfs filesystem. The warnings bisect to this
> commit:
>
> f1be1788a32e8fa63416ad4518bbd1a85a825c9d is the first bad commit
> commit f1be1788a32e8fa63416ad4518bbd1a85a825c9d
> Author: Ming Lei <ming.lei@redhat.com>
> Date: Fri Oct 25 08:37:20 2024 +0800
>
> block: model freeze & enter queue as lock for supporting lockdep
I just read through this thread and noticed:
> The warning has been fixed by the following patches in for-6.14/block:
>
> blktrace: move copy_[to|from]_user() out of ->debugfs_lock
> blktrace: don't centralize grabbing q->debugfs_mutex in blk_trace_ioctl
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-6.14/block
I tested linux-block/for-6.14/block with a directory copy and did not
get the lockdep warning, so it looks like it might be the same issue
that has already been fixed (?). If so, it would be nice if the fix
could make it in to v6.13, otherwise many people will get hit with this
warning, and end up tracking it down and re-reporting it.
#regzbot title: intermittent block/fs/kswapd0 lockdep warning
#regzbot introduced: f1be1788a32e ^
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2025-01-13 15:11 ` Ming Lei
@ 2025-01-13 15:33 ` Chris Bainbridge
2025-01-13 15:52 ` Jens Axboe
0 siblings, 1 reply; 25+ messages in thread
From: Chris Bainbridge @ 2025-01-13 15:33 UTC (permalink / raw)
To: Ming Lei
Cc: Lai, Yi, Jens Axboe, linux-block, Christoph Hellwig,
Peter Zijlstra, Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon,
linux-kernel, Bart Van Assche, yi1.lai, syzkaller-bugs
On Mon, Jan 13, 2025 at 11:11:56PM +0800, Ming Lei wrote:
>
> This one has been solved in for-6.14/block:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-6.14/block
Thank you, yes I thought that might be the case.
Would it not make sense for the fix to be included in 6.13? Or is it too
big a change / just too late?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep
2025-01-13 15:33 ` Chris Bainbridge
@ 2025-01-13 15:52 ` Jens Axboe
0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2025-01-13 15:52 UTC (permalink / raw)
To: Chris Bainbridge, Ming Lei
Cc: Lai, Yi, linux-block, Christoph Hellwig, Peter Zijlstra,
Waiman Long, Boqun Feng, Ingo Molnar, Will Deacon, linux-kernel,
Bart Van Assche, yi1.lai, syzkaller-bugs
On 1/13/25 8:33 AM, Chris Bainbridge wrote:
> On Mon, Jan 13, 2025 at 11:11:56PM +0800, Ming Lei wrote:
>>
>> This one has been solved in for-6.14/block:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-6.14/block
>
> Thank you, yes I thought that might be the case.
>
> Would it not make sense for the fix to be included in 6.13? Or is it too
> big a change / just too late?
It's too late. If need be, we can shuffle them towards stable once
6.13 has been released.
--
Jens Axboe
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-01-13 15:52 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25 0:37 [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep Ming Lei
2024-10-25 0:37 ` [PATCH V2 1/3] blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs Ming Lei
2024-10-25 0:37 ` [PATCH V2 2/3] nvme: core: switch to non_owner variant of start_freeze/unfreeze queue Ming Lei
2024-10-25 0:37 ` [PATCH V2 3/3] block: model freeze & enter queue as lock for supporting lockdep Ming Lei
[not found] ` <CGME20241029111338eucas1p2bd56c697b825eef235604e892569207e@eucas1p2.samsung.com>
2024-10-29 11:13 ` Marek Szyprowski
2024-10-29 15:58 ` Ming Lei
2024-10-29 16:59 ` Marek Szyprowski
2024-11-12 8:36 ` Marek Szyprowski
2024-11-12 10:15 ` Ming Lei
2024-11-12 11:32 ` Marek Szyprowski
2024-11-12 11:48 ` Ming Lei
2024-10-30 6:45 ` Lai, Yi
2024-10-30 7:13 ` Ming Lei
2024-10-30 8:50 ` Lai, Yi
2024-10-30 9:50 ` Ming Lei
2024-10-30 10:39 ` Lai, Yi
2024-10-30 11:08 ` Ming Lei
2024-12-04 3:21 ` Lai, Yi
2024-12-04 3:30 ` Ming Lei
2025-01-13 14:39 ` Chris Bainbridge
2025-01-13 15:11 ` Ming Lei
2025-01-13 15:33 ` Chris Bainbridge
2025-01-13 15:52 ` Jens Axboe
2025-01-13 15:23 ` Chris Bainbridge
2024-10-26 13:15 ` [PATCH V2 0/3] block: model freeze/enter queue as lock for lockdep Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).