* [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 8:28 [PATCHv2 0/6] block: fix lock order and remove redundant locking Nilay Shroff
@ 2025-02-18 8:28 ` Nilay Shroff
2025-02-18 8:46 ` Christoph Hellwig
2025-02-18 12:10 ` Ming Lei
2025-02-18 8:28 ` [PATCHv2 2/6] blk-sysfs: acquire q->limits_lock while reading attributes Nilay Shroff
` (5 subsequent siblings)
6 siblings, 2 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 8:28 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce
There're few sysfs attributes in block layer which don't really need
acquiring q->sysfs_lock while accessing it. The reason being, writing
a value to such attributes are either atomic or could be easily
protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
are inherently protected with sysfs/kernfs internal locking.
So this change help segregate all existing sysfs attributes for which
we could avoid acquiring q->sysfs_lock. We group all such attributes,
which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
method (show_nolock/store_nolock) is assigned to attributes using these
new macros. The show_nolock/store_nolock run without holding q->sysfs_
lock.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-settings.c | 2 +-
block/blk-sysfs.c | 106 ++++++++++++++++++++++++++++++++-----------
2 files changed, 81 insertions(+), 27 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c44dadc35e1e..c541bf22f543 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -21,7 +21,7 @@
void blk_queue_rq_timeout(struct request_queue *q, unsigned int timeout)
{
- q->rq_timeout = timeout;
+ WRITE_ONCE(q->rq_timeout, timeout);
}
EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6f548a4376aa..0c9be7c7ecc1 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -23,9 +23,14 @@
struct queue_sysfs_entry {
struct attribute attr;
ssize_t (*show)(struct gendisk *disk, char *page);
+ ssize_t (*show_nolock)(struct gendisk *disk, char *page);
+
ssize_t (*store)(struct gendisk *disk, const char *page, size_t count);
+ ssize_t (*store_nolock)(struct gendisk *disk,
+ const char *page, size_t count);
int (*store_limit)(struct gendisk *disk, const char *page,
size_t count, struct queue_limits *lim);
+
void (*load_module)(struct gendisk *disk, const char *page, size_t count);
};
@@ -320,7 +325,12 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
ret = queue_var_store(&val, page, count);
if (ret < 0)
return ret;
-
+ /*
+ * Here we update two queue flags each using atomic bitops, although
+ * updating two flags isn't atomic it should be harmless as those flags
+ * are accessed individually using atomic test_bit operation. So we
+ * don't grab any lock while updating these flags.
+ */
if (val == 2) {
blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
@@ -353,7 +363,8 @@ static ssize_t queue_poll_store(struct gendisk *disk, const char *page,
static ssize_t queue_io_timeout_show(struct gendisk *disk, char *page)
{
- return sysfs_emit(page, "%u\n", jiffies_to_msecs(disk->queue->rq_timeout));
+ return sysfs_emit(page, "%u\n",
+ jiffies_to_msecs(READ_ONCE(disk->queue->rq_timeout)));
}
static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page,
@@ -405,6 +416,19 @@ static struct queue_sysfs_entry _prefix##_entry = { \
.show = _prefix##_show, \
};
+#define QUEUE_RO_ENTRY_NOLOCK(_prefix, _name) \
+static struct queue_sysfs_entry _prefix##_entry = { \
+ .attr = {.name = _name, .mode = 0644 }, \
+ .show_nolock = _prefix##_show, \
+}
+
+#define QUEUE_RW_ENTRY_NOLOCK(_prefix, _name) \
+static struct queue_sysfs_entry _prefix##_entry = { \
+ .attr = {.name = _name, .mode = 0644 }, \
+ .show_nolock = _prefix##_show, \
+ .store_nolock = _prefix##_store, \
+}
+
#define QUEUE_RW_ENTRY(_prefix, _name) \
static struct queue_sysfs_entry _prefix##_entry = { \
.attr = { .name = _name, .mode = 0644 }, \
@@ -446,7 +470,7 @@ QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity");
QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes");
QUEUE_LIM_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes");
-QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
+QUEUE_RO_ENTRY_NOLOCK(queue_discard_zeroes_data, "discard_zeroes_data");
QUEUE_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes");
QUEUE_RO_ENTRY(queue_atomic_write_boundary_sectors,
@@ -454,25 +478,25 @@ QUEUE_RO_ENTRY(queue_atomic_write_boundary_sectors,
QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
-QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
+QUEUE_RO_ENTRY_NOLOCK(queue_write_same_max, "write_same_max_bytes");
QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
QUEUE_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
QUEUE_RO_ENTRY(queue_zoned, "zoned");
-QUEUE_RO_ENTRY(queue_nr_zones, "nr_zones");
+QUEUE_RO_ENTRY_NOLOCK(queue_nr_zones, "nr_zones");
QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
-QUEUE_RW_ENTRY(queue_nomerges, "nomerges");
+QUEUE_RW_ENTRY_NOLOCK(queue_nomerges, "nomerges");
QUEUE_LIM_RW_ENTRY(queue_iostats_passthrough, "iostats_passthrough");
-QUEUE_RW_ENTRY(queue_rq_affinity, "rq_affinity");
-QUEUE_RW_ENTRY(queue_poll, "io_poll");
-QUEUE_RW_ENTRY(queue_poll_delay, "io_poll_delay");
+QUEUE_RW_ENTRY_NOLOCK(queue_rq_affinity, "rq_affinity");
+QUEUE_RW_ENTRY_NOLOCK(queue_poll, "io_poll");
+QUEUE_RW_ENTRY_NOLOCK(queue_poll_delay, "io_poll_delay");
QUEUE_LIM_RW_ENTRY(queue_wc, "write_cache");
QUEUE_RO_ENTRY(queue_fua, "fua");
QUEUE_RO_ENTRY(queue_dax, "dax");
-QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout");
+QUEUE_RW_ENTRY_NOLOCK(queue_io_timeout, "io_timeout");
QUEUE_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask");
QUEUE_RO_ENTRY(queue_dma_alignment, "dma_alignment");
@@ -561,9 +585,11 @@ QUEUE_RW_ENTRY(queue_wb_lat, "wbt_lat_usec");
/* Common attributes for bio-based and request-based queues. */
static struct attribute *queue_attrs[] = {
+ /*
+ * attributes protected with q->sysfs_lock
+ */
&queue_ra_entry.attr,
&queue_max_hw_sectors_entry.attr,
- &queue_max_sectors_entry.attr,
&queue_max_segments_entry.attr,
&queue_max_discard_segments_entry.attr,
&queue_max_integrity_segments_entry.attr,
@@ -575,46 +601,63 @@ static struct attribute *queue_attrs[] = {
&queue_io_min_entry.attr,
&queue_io_opt_entry.attr,
&queue_discard_granularity_entry.attr,
- &queue_max_discard_sectors_entry.attr,
&queue_max_hw_discard_sectors_entry.attr,
- &queue_discard_zeroes_data_entry.attr,
&queue_atomic_write_max_sectors_entry.attr,
&queue_atomic_write_boundary_sectors_entry.attr,
&queue_atomic_write_unit_min_entry.attr,
&queue_atomic_write_unit_max_entry.attr,
- &queue_write_same_max_entry.attr,
&queue_max_write_zeroes_sectors_entry.attr,
&queue_max_zone_append_sectors_entry.attr,
&queue_zone_write_granularity_entry.attr,
- &queue_rotational_entry.attr,
&queue_zoned_entry.attr,
- &queue_nr_zones_entry.attr,
&queue_max_open_zones_entry.attr,
&queue_max_active_zones_entry.attr,
- &queue_nomerges_entry.attr,
+ &queue_fua_entry.attr,
+ &queue_dax_entry.attr,
+ &queue_virt_boundary_mask_entry.attr,
+ &queue_dma_alignment_entry.attr,
+
+ /*
+ * attributes protected with q->limits_lock
+ */
+ &queue_max_sectors_entry.attr,
+ &queue_max_discard_sectors_entry.attr,
+ &queue_rotational_entry.attr,
&queue_iostats_passthrough_entry.attr,
&queue_iostats_entry.attr,
&queue_stable_writes_entry.attr,
&queue_add_random_entry.attr,
- &queue_poll_entry.attr,
&queue_wc_entry.attr,
- &queue_fua_entry.attr,
- &queue_dax_entry.attr,
+
+ /*
+ * attributes which don't require locking
+ */
+ &queue_nomerges_entry.attr,
+ &queue_poll_entry.attr,
&queue_poll_delay_entry.attr,
- &queue_virt_boundary_mask_entry.attr,
- &queue_dma_alignment_entry.attr,
+ &queue_discard_zeroes_data_entry.attr,
+ &queue_write_same_max_entry.attr,
+ &queue_nr_zones_entry.attr,
+
NULL,
};
/* Request-based queue attributes that are not relevant for bio-based queues. */
static struct attribute *blk_mq_queue_attrs[] = {
+ /*
+ * attributes protected with q->sysfs_lock
+ */
&queue_requests_entry.attr,
&elv_iosched_entry.attr,
- &queue_rq_affinity_entry.attr,
- &queue_io_timeout_entry.attr,
#ifdef CONFIG_BLK_WBT
&queue_wb_lat_entry.attr,
#endif
+ /*
+ * attrbiutes which don't require locking
+ */
+ &queue_rq_affinity_entry.attr,
+ &queue_io_timeout_entry.attr,
+
NULL,
};
@@ -666,8 +709,12 @@ queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
ssize_t res;
- if (!entry->show)
+ if (!entry->show && !entry->show_nolock)
return -EIO;
+
+ if (entry->show_nolock)
+ return entry->show_nolock(disk, page);
+
mutex_lock(&disk->queue->sysfs_lock);
res = entry->show(disk, page);
mutex_unlock(&disk->queue->sysfs_lock);
@@ -684,7 +731,7 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
unsigned int memflags;
ssize_t res;
- if (!entry->store_limit && !entry->store)
+ if (!entry->store_limit && !entry->store_nolock && !entry->store)
return -EIO;
/*
@@ -695,6 +742,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
if (entry->load_module)
entry->load_module(disk, page, length);
+ if (entry->store_nolock) {
+ memflags = blk_mq_freeze_queue(q);
+ res = entry->store_nolock(disk, page, length);
+ blk_mq_unfreeze_queue(q, memflags);
+ return res;
+ }
+
if (entry->store_limit) {
struct queue_limits lim = queue_limits_start_update(q);
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 8:28 ` [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it Nilay Shroff
@ 2025-02-18 8:46 ` Christoph Hellwig
2025-02-18 11:26 ` Nilay Shroff
2025-02-18 12:10 ` Ming Lei
1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2025-02-18 8:46 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, ming.lei, dlemoal, axboe, gjoyce
On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
> There're few sysfs attributes in block layer which don't really need
> acquiring q->sysfs_lock while accessing it. The reason being, writing
> a value to such attributes are either atomic or could be easily
> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
> are inherently protected with sysfs/kernfs internal locking.
>
> So this change help segregate all existing sysfs attributes for which
> we could avoid acquiring q->sysfs_lock. We group all such attributes,
> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
> method (show_nolock/store_nolock) is assigned to attributes using these
> new macros. The show_nolock/store_nolock run without holding q->sysfs_
> lock.
Can you add the analys why they don't need sysfs_lock to this commit
message please?
With that:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 8:46 ` Christoph Hellwig
@ 2025-02-18 11:26 ` Nilay Shroff
2025-02-21 14:02 ` Nilay Shroff
0 siblings, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 11:26 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
On 2/18/25 2:16 PM, Christoph Hellwig wrote:
> On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
>> There're few sysfs attributes in block layer which don't really need
>> acquiring q->sysfs_lock while accessing it. The reason being, writing
>> a value to such attributes are either atomic or could be easily
>> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
>> are inherently protected with sysfs/kernfs internal locking.
>>
>> So this change help segregate all existing sysfs attributes for which
>> we could avoid acquiring q->sysfs_lock. We group all such attributes,
>> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
>> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
>> method (show_nolock/store_nolock) is assigned to attributes using these
>> new macros. The show_nolock/store_nolock run without holding q->sysfs_
>> lock.
>
> Can you add the analys why they don't need sysfs_lock to this commit
> message please?
Sure will do it in next patchset.
>
> With that:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 11:26 ` Nilay Shroff
@ 2025-02-21 14:02 ` Nilay Shroff
2025-02-22 12:44 ` Ming Lei
2025-02-24 8:41 ` Hannes Reinecke
0 siblings, 2 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-21 14:02 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
Hi Christoph, Ming and others,
On 2/18/25 4:56 PM, Nilay Shroff wrote:
>
>
> On 2/18/25 2:16 PM, Christoph Hellwig wrote:
>> On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
>>> There're few sysfs attributes in block layer which don't really need
>>> acquiring q->sysfs_lock while accessing it. The reason being, writing
>>> a value to such attributes are either atomic or could be easily
>>> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
>>> are inherently protected with sysfs/kernfs internal locking.
>>>
>>> So this change help segregate all existing sysfs attributes for which
>>> we could avoid acquiring q->sysfs_lock. We group all such attributes,
>>> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
>>> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
>>> method (show_nolock/store_nolock) is assigned to attributes using these
>>> new macros. The show_nolock/store_nolock run without holding q->sysfs_
>>> lock.
>>
>> Can you add the analys why they don't need sysfs_lock to this commit
>> message please?
> Sure will do it in next patchset.
>>
>> With that:
>>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>
>
I think we discussed about all attributes which don't require locking,
however there's one which I was looking at "nr_zones" which we haven't
discussed. This is read-only attribute and currently protected with
q->sysfs_lock.
Write to this attribute (nr_zones) mostly happens in the driver probe
method (except nvme) before disk is added and outside of q->sysfs_lock
or any other lock. But in case of nvme it could be updated from disk
scan.
nvme_validate_ns
-> nvme_update_ns_info_block
-> blk_revalidate_disk_zones
-> disk_update_zone_resources
The update to disk->nr_zones is done outside of queue freeze or any
other lock today. So do you agree if we could use READ_ONCE/WRITE_ONCE
to protect this attribute and remove q->sysfs_lock? I think, it'd be
great if we could agree upon this one before I send the next patchset.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-21 14:02 ` Nilay Shroff
@ 2025-02-22 12:44 ` Ming Lei
2025-02-24 13:09 ` Nilay Shroff
2025-02-24 14:49 ` Christoph Hellwig
2025-02-24 8:41 ` Hannes Reinecke
1 sibling, 2 replies; 34+ messages in thread
From: Ming Lei @ 2025-02-22 12:44 UTC (permalink / raw)
To: Nilay Shroff; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce
On Fri, Feb 21, 2025 at 07:32:52PM +0530, Nilay Shroff wrote:
>
> Hi Christoph, Ming and others,
>
> On 2/18/25 4:56 PM, Nilay Shroff wrote:
> >
> >
> > On 2/18/25 2:16 PM, Christoph Hellwig wrote:
> >> On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
> >>> There're few sysfs attributes in block layer which don't really need
> >>> acquiring q->sysfs_lock while accessing it. The reason being, writing
> >>> a value to such attributes are either atomic or could be easily
> >>> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
> >>> are inherently protected with sysfs/kernfs internal locking.
> >>>
> >>> So this change help segregate all existing sysfs attributes for which
> >>> we could avoid acquiring q->sysfs_lock. We group all such attributes,
> >>> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
> >>> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
> >>> method (show_nolock/store_nolock) is assigned to attributes using these
> >>> new macros. The show_nolock/store_nolock run without holding q->sysfs_
> >>> lock.
> >>
> >> Can you add the analys why they don't need sysfs_lock to this commit
> >> message please?
> > Sure will do it in next patchset.
> >>
> >> With that:
> >>
> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>
> >
> I think we discussed about all attributes which don't require locking,
> however there's one which I was looking at "nr_zones" which we haven't
> discussed. This is read-only attribute and currently protected with
> q->sysfs_lock.
>
> Write to this attribute (nr_zones) mostly happens in the driver probe
> method (except nvme) before disk is added and outside of q->sysfs_lock
> or any other lock. But in case of nvme it could be updated from disk
> scan.
> nvme_validate_ns
> -> nvme_update_ns_info_block
> -> blk_revalidate_disk_zones
> -> disk_update_zone_resources
>
> The update to disk->nr_zones is done outside of queue freeze or any
> other lock today. So do you agree if we could use READ_ONCE/WRITE_ONCE
> to protect this attribute and remove q->sysfs_lock? I think, it'd be
> great if we could agree upon this one before I send the next patchset.
IMO, it is fine to read it lockless without READ_ONCE/WRITE_ONCE because
disk->nr_zones is defined as 'unsigned int', which won't return garbage
value anytime.
But I don't object if you want to change to READ_ONCE/WRITE_ONCE.
Thanks,
Ming
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-22 12:44 ` Ming Lei
@ 2025-02-24 13:09 ` Nilay Shroff
2025-02-24 14:49 ` Christoph Hellwig
1 sibling, 0 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-24 13:09 UTC (permalink / raw)
To: Ming Lei; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce
On 2/22/25 6:14 PM, Ming Lei wrote:
> On Fri, Feb 21, 2025 at 07:32:52PM +0530, Nilay Shroff wrote:
>>
>> Hi Christoph, Ming and others,
>>
>> On 2/18/25 4:56 PM, Nilay Shroff wrote:
>>>
>>>
>>> On 2/18/25 2:16 PM, Christoph Hellwig wrote:
>>>> On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
>>>>> There're few sysfs attributes in block layer which don't really need
>>>>> acquiring q->sysfs_lock while accessing it. The reason being, writing
>>>>> a value to such attributes are either atomic or could be easily
>>>>> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
>>>>> are inherently protected with sysfs/kernfs internal locking.
>>>>>
>>>>> So this change help segregate all existing sysfs attributes for which
>>>>> we could avoid acquiring q->sysfs_lock. We group all such attributes,
>>>>> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
>>>>> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
>>>>> method (show_nolock/store_nolock) is assigned to attributes using these
>>>>> new macros. The show_nolock/store_nolock run without holding q->sysfs_
>>>>> lock.
>>>>
>>>> Can you add the analys why they don't need sysfs_lock to this commit
>>>> message please?
>>> Sure will do it in next patchset.
>>>>
>>>> With that:
>>>>
>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>
>>>
>> I think we discussed about all attributes which don't require locking,
>> however there's one which I was looking at "nr_zones" which we haven't
>> discussed. This is read-only attribute and currently protected with
>> q->sysfs_lock.
>>
>> Write to this attribute (nr_zones) mostly happens in the driver probe
>> method (except nvme) before disk is added and outside of q->sysfs_lock
>> or any other lock. But in case of nvme it could be updated from disk
>> scan.
>> nvme_validate_ns
>> -> nvme_update_ns_info_block
>> -> blk_revalidate_disk_zones
>> -> disk_update_zone_resources
>>
>> The update to disk->nr_zones is done outside of queue freeze or any
>> other lock today. So do you agree if we could use READ_ONCE/WRITE_ONCE
>> to protect this attribute and remove q->sysfs_lock? I think, it'd be
>> great if we could agree upon this one before I send the next patchset.
>
> IMO, it is fine to read it lockless without READ_ONCE/WRITE_ONCE because
> disk->nr_zones is defined as 'unsigned int', which won't return garbage
> value anytime.
>
> But I don't object if you want to change to READ_ONCE/WRITE_ONCE.
>
Thanks Ming! So, as it seems we can't go wrong here if we don't use
READ_ONCE/WRITE_ONCE or read/write nr_zones without any lock, I'd
send next patchset without any protection for nr_zones. Lets see
if Christoph or others have any objection against this.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-22 12:44 ` Ming Lei
2025-02-24 13:09 ` Nilay Shroff
@ 2025-02-24 14:49 ` Christoph Hellwig
2025-02-26 12:09 ` Nilay Shroff
1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2025-02-24 14:49 UTC (permalink / raw)
To: Ming Lei
Cc: Nilay Shroff, Christoph Hellwig, linux-block, dlemoal, axboe,
gjoyce
On Sat, Feb 22, 2025 at 08:44:51PM +0800, Ming Lei wrote:
> IMO, it is fine to read it lockless without READ_ONCE/WRITE_ONCE because
> disk->nr_zones is defined as 'unsigned int', which won't return garbage
> value anytime.
>
> But I don't object if you want to change to READ_ONCE/WRITE_ONCE.
It changes every time the disk capacity changes. And on the (uncommon)
reformats. So the best locking would be the same as for the disk
capacity.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-24 14:49 ` Christoph Hellwig
@ 2025-02-26 12:09 ` Nilay Shroff
0 siblings, 0 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-26 12:09 UTC (permalink / raw)
To: Christoph Hellwig, Ming Lei; +Cc: linux-block, dlemoal, axboe, gjoyce
On 2/24/25 8:19 PM, Christoph Hellwig wrote:
> On Sat, Feb 22, 2025 at 08:44:51PM +0800, Ming Lei wrote:
>> IMO, it is fine to read it lockless without READ_ONCE/WRITE_ONCE because
>> disk->nr_zones is defined as 'unsigned int', which won't return garbage
>> value anytime.
>>
>> But I don't object if you want to change to READ_ONCE/WRITE_ONCE.
>
> It changes every time the disk capacity changes. And on the (uncommon)
> reformats. So the best locking would be the same as for the disk
> capacity.
>
So does that mean we shall use bdev->bd_size_lock or disk->open_mutex here?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-21 14:02 ` Nilay Shroff
2025-02-22 12:44 ` Ming Lei
@ 2025-02-24 8:41 ` Hannes Reinecke
2025-02-24 13:12 ` Nilay Shroff
1 sibling, 1 reply; 34+ messages in thread
From: Hannes Reinecke @ 2025-02-24 8:41 UTC (permalink / raw)
To: Nilay Shroff, Christoph Hellwig
Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
On 2/21/25 15:02, Nilay Shroff wrote:
>
> Hi Christoph, Ming and others,
>
> On 2/18/25 4:56 PM, Nilay Shroff wrote:
>>
>>
>> On 2/18/25 2:16 PM, Christoph Hellwig wrote:
>>> On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
>>>> There're few sysfs attributes in block layer which don't really need
>>>> acquiring q->sysfs_lock while accessing it. The reason being, writing
>>>> a value to such attributes are either atomic or could be easily
>>>> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
>>>> are inherently protected with sysfs/kernfs internal locking.
>>>>
>>>> So this change help segregate all existing sysfs attributes for which
>>>> we could avoid acquiring q->sysfs_lock. We group all such attributes,
>>>> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
>>>> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
>>>> method (show_nolock/store_nolock) is assigned to attributes using these
>>>> new macros. The show_nolock/store_nolock run without holding q->sysfs_
>>>> lock.
>>>
>>> Can you add the analys why they don't need sysfs_lock to this commit
>>> message please?
>> Sure will do it in next patchset.
>>>
>>> With that:
>>>
>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>
>>
> I think we discussed about all attributes which don't require locking,
> however there's one which I was looking at "nr_zones" which we haven't
> discussed. This is read-only attribute and currently protected with
> q->sysfs_lock.
>
> Write to this attribute (nr_zones) mostly happens in the driver probe
> method (except nvme) before disk is added and outside of q->sysfs_lock
> or any other lock. But in case of nvme it could be updated from disk
> scan.
> nvme_validate_ns
> -> nvme_update_ns_info_block
> -> blk_revalidate_disk_zones
> -> disk_update_zone_resources
>
> The update to disk->nr_zones is done outside of queue freeze or any
> other lock today. So do you agree if we could use READ_ONCE/WRITE_ONCE
> to protect this attribute and remove q->sysfs_lock? I think, it'd be
> great if we could agree upon this one before I send the next patchset.
>
READ_ONCE should be fine here. 'nr_zones' is unlikely to change, and
if that is updated we've done a full disk revalidation including a read
of all zones. So not a critical operation, and nothing which needs to be
protected.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-24 8:41 ` Hannes Reinecke
@ 2025-02-24 13:12 ` Nilay Shroff
0 siblings, 0 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-24 13:12 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
On 2/24/25 2:11 PM, Hannes Reinecke wrote:
> On 2/21/25 15:02, Nilay Shroff wrote:
>>
>> Hi Christoph, Ming and others,
>>
>> On 2/18/25 4:56 PM, Nilay Shroff wrote:
>>>
>>>
>>> On 2/18/25 2:16 PM, Christoph Hellwig wrote:
>>>> On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
>>>>> There're few sysfs attributes in block layer which don't really need
>>>>> acquiring q->sysfs_lock while accessing it. The reason being, writing
>>>>> a value to such attributes are either atomic or could be easily
>>>>> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
>>>>> are inherently protected with sysfs/kernfs internal locking.
>>>>>
>>>>> So this change help segregate all existing sysfs attributes for which
>>>>> we could avoid acquiring q->sysfs_lock. We group all such attributes,
>>>>> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
>>>>> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
>>>>> method (show_nolock/store_nolock) is assigned to attributes using these
>>>>> new macros. The show_nolock/store_nolock run without holding q->sysfs_
>>>>> lock.
>>>>
>>>> Can you add the analys why they don't need sysfs_lock to this commit
>>>> message please?
>>> Sure will do it in next patchset.
>>>>
>>>> With that:
>>>>
>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>
>>>
>> I think we discussed about all attributes which don't require locking,
>> however there's one which I was looking at "nr_zones" which we haven't
>> discussed. This is read-only attribute and currently protected with
>> q->sysfs_lock.
>>
>> Write to this attribute (nr_zones) mostly happens in the driver probe
>> method (except nvme) before disk is added and outside of q->sysfs_lock
>> or any other lock. But in case of nvme it could be updated from disk
>> scan.
>> nvme_validate_ns
>> -> nvme_update_ns_info_block
>> -> blk_revalidate_disk_zones
>> -> disk_update_zone_resources
>>
>> The update to disk->nr_zones is done outside of queue freeze or any
>> other lock today. So do you agree if we could use READ_ONCE/WRITE_ONCE
>> to protect this attribute and remove q->sysfs_lock? I think, it'd be
>> great if we could agree upon this one before I send the next patchset.
>>
> READ_ONCE should be fine here. 'nr_zones' is unlikely to change, and
> if that is updated we've done a full disk revalidation including a read
> of all zones. So not a critical operation, and nothing which needs to be
> protected.
>
Thanks Hannes! As you as well as Ming pointed out that it's unlikely to go
wrong here with nr_zones even if we read/write it without any protection,
I'd send next patchset without using READ_ONCE/WRITE_ONCE.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 8:28 ` [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it Nilay Shroff
2025-02-18 8:46 ` Christoph Hellwig
@ 2025-02-18 12:10 ` Ming Lei
2025-02-18 13:11 ` Nilay Shroff
1 sibling, 1 reply; 34+ messages in thread
From: Ming Lei @ 2025-02-18 12:10 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, dlemoal, axboe, gjoyce
On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
> There're few sysfs attributes in block layer which don't really need
> acquiring q->sysfs_lock while accessing it. The reason being, writing
> a value to such attributes are either atomic or could be easily
> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
> are inherently protected with sysfs/kernfs internal locking.
>
> So this change help segregate all existing sysfs attributes for which
> we could avoid acquiring q->sysfs_lock. We group all such attributes,
> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
> method (show_nolock/store_nolock) is assigned to attributes using these
> new macros. The show_nolock/store_nolock run without holding q->sysfs_
> lock.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
...
>
> +#define QUEUE_RO_ENTRY_NOLOCK(_prefix, _name) \
> +static struct queue_sysfs_entry _prefix##_entry = { \
> + .attr = {.name = _name, .mode = 0644 }, \
> + .show_nolock = _prefix##_show, \
> +}
> +
> +#define QUEUE_RW_ENTRY_NOLOCK(_prefix, _name) \
> +static struct queue_sysfs_entry _prefix##_entry = { \
> + .attr = {.name = _name, .mode = 0644 }, \
> + .show_nolock = _prefix##_show, \
> + .store_nolock = _prefix##_store, \
> +}
> +
> #define QUEUE_RW_ENTRY(_prefix, _name) \
> static struct queue_sysfs_entry _prefix##_entry = { \
> .attr = { .name = _name, .mode = 0644 }, \
> @@ -446,7 +470,7 @@ QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
> QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity");
> QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes");
> QUEUE_LIM_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes");
> -QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
> +QUEUE_RO_ENTRY_NOLOCK(queue_discard_zeroes_data, "discard_zeroes_data");
I think all QUEUE_RO_ENTRY needn't sysfs_lock, why do you just convert
part of them?
Thanks,
Ming
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 12:10 ` Ming Lei
@ 2025-02-18 13:11 ` Nilay Shroff
2025-02-18 13:45 ` Ming Lei
0 siblings, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 13:11 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, hch, dlemoal, axboe, gjoyce
On 2/18/25 5:40 PM, Ming Lei wrote:
> On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
>> There're few sysfs attributes in block layer which don't really need
>> acquiring q->sysfs_lock while accessing it. The reason being, writing
>> a value to such attributes are either atomic or could be easily
>> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
>> are inherently protected with sysfs/kernfs internal locking.
>>
>> So this change help segregate all existing sysfs attributes for which
>> we could avoid acquiring q->sysfs_lock. We group all such attributes,
>> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
>> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
>> method (show_nolock/store_nolock) is assigned to attributes using these
>> new macros. The show_nolock/store_nolock run without holding q->sysfs_
>> lock.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>
> ...
>
>>
>> +#define QUEUE_RO_ENTRY_NOLOCK(_prefix, _name) \
>> +static struct queue_sysfs_entry _prefix##_entry = { \
>> + .attr = {.name = _name, .mode = 0644 }, \
>> + .show_nolock = _prefix##_show, \
>> +}
>> +
>> +#define QUEUE_RW_ENTRY_NOLOCK(_prefix, _name) \
>> +static struct queue_sysfs_entry _prefix##_entry = { \
>> + .attr = {.name = _name, .mode = 0644 }, \
>> + .show_nolock = _prefix##_show, \
>> + .store_nolock = _prefix##_store, \
>> +}
>> +
>> #define QUEUE_RW_ENTRY(_prefix, _name) \
>> static struct queue_sysfs_entry _prefix##_entry = { \
>> .attr = { .name = _name, .mode = 0644 }, \
>> @@ -446,7 +470,7 @@ QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
>> QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity");
>> QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes");
>> QUEUE_LIM_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes");
>> -QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
>> +QUEUE_RO_ENTRY_NOLOCK(queue_discard_zeroes_data, "discard_zeroes_data");
>
> I think all QUEUE_RO_ENTRY needn't sysfs_lock, why do you just convert
> part of them?
>
I think we have few read-only attributes which still need protection
using q->limits_lock. So if you refer 2nd patch in the series then you'd
find it. In this patch we group only attributes which don't require any
locking and grouped them under show_nolock/store_nolock.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 13:11 ` Nilay Shroff
@ 2025-02-18 13:45 ` Ming Lei
2025-02-18 16:29 ` Christoph Hellwig
0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2025-02-18 13:45 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, dlemoal, axboe, gjoyce
On Tue, Feb 18, 2025 at 06:41:06PM +0530, Nilay Shroff wrote:
>
>
> On 2/18/25 5:40 PM, Ming Lei wrote:
> > On Tue, Feb 18, 2025 at 01:58:54PM +0530, Nilay Shroff wrote:
> >> There're few sysfs attributes in block layer which don't really need
> >> acquiring q->sysfs_lock while accessing it. The reason being, writing
> >> a value to such attributes are either atomic or could be easily
> >> protected using WRITE_ONCE()/READ_ONCE(). Moreover, sysfs attributes
> >> are inherently protected with sysfs/kernfs internal locking.
> >>
> >> So this change help segregate all existing sysfs attributes for which
> >> we could avoid acquiring q->sysfs_lock. We group all such attributes,
> >> which don't require any sorts of locking, using macro QUEUE_RO_ENTRY_
> >> NOLOCK() or QUEUE_RW_ENTRY_NOLOCK(). The newly introduced show/store
> >> method (show_nolock/store_nolock) is assigned to attributes using these
> >> new macros. The show_nolock/store_nolock run without holding q->sysfs_
> >> lock.
> >>
> >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> >> ---
> >
> > ...
> >
> >>
> >> +#define QUEUE_RO_ENTRY_NOLOCK(_prefix, _name) \
> >> +static struct queue_sysfs_entry _prefix##_entry = { \
> >> + .attr = {.name = _name, .mode = 0644 }, \
> >> + .show_nolock = _prefix##_show, \
> >> +}
> >> +
> >> +#define QUEUE_RW_ENTRY_NOLOCK(_prefix, _name) \
> >> +static struct queue_sysfs_entry _prefix##_entry = { \
> >> + .attr = {.name = _name, .mode = 0644 }, \
> >> + .show_nolock = _prefix##_show, \
> >> + .store_nolock = _prefix##_store, \
> >> +}
> >> +
> >> #define QUEUE_RW_ENTRY(_prefix, _name) \
> >> static struct queue_sysfs_entry _prefix##_entry = { \
> >> .attr = { .name = _name, .mode = 0644 }, \
> >> @@ -446,7 +470,7 @@ QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
> >> QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity");
> >> QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes");
> >> QUEUE_LIM_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes");
> >> -QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
> >> +QUEUE_RO_ENTRY_NOLOCK(queue_discard_zeroes_data, "discard_zeroes_data");
> >
> > I think all QUEUE_RO_ENTRY needn't sysfs_lock, why do you just convert
> > part of them?
> >
> I think we have few read-only attributes which still need protection
> using q->limits_lock. So if you refer 2nd patch in the series then you'd
> find it. In this patch we group only attributes which don't require any
> locking and grouped them under show_nolock/store_nolock.
IMO, this RO attributes needn't protection from q->limits_lock:
- no lifetime issue
- in-tree code needn't limits_lock.
- all are scalar variable, so the attribute itself is updated atomically
- the limits still may be updated after lock is released
So what do you want to protect on these RO attributes? My concern is
that it is too complicated to maintain multiple versions of such macro.
Thanks,
Ming
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 13:45 ` Ming Lei
@ 2025-02-18 16:29 ` Christoph Hellwig
2025-02-19 3:24 ` Ming Lei
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2025-02-18 16:29 UTC (permalink / raw)
To: Ming Lei; +Cc: Nilay Shroff, linux-block, hch, dlemoal, axboe, gjoyce
On Tue, Feb 18, 2025 at 09:45:02PM +0800, Ming Lei wrote:
> IMO, this RO attributes needn't protection from q->limits_lock:
>
> - no lifetime issue
>
> - in-tree code needn't limits_lock.
>
> - all are scalar variable, so the attribute itself is updated atomically
Except in the memory model they aren't without READ_ONCE/WRITE_ONCE.
Given that the limits_lock is not a hot lock taking the lock is a very
easy way to mark our intent. And if we get things like thread thread
sanitizer patches merged that will become essential. Even KCSAN
might object already without it.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-18 16:29 ` Christoph Hellwig
@ 2025-02-19 3:24 ` Ming Lei
2025-02-19 5:42 ` Christoph Hellwig
2025-02-19 8:34 ` Nilay Shroff
0 siblings, 2 replies; 34+ messages in thread
From: Ming Lei @ 2025-02-19 3:24 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Nilay Shroff, linux-block, dlemoal, axboe, gjoyce
On Tue, Feb 18, 2025 at 05:29:53PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 18, 2025 at 09:45:02PM +0800, Ming Lei wrote:
> > IMO, this RO attributes needn't protection from q->limits_lock:
> >
> > - no lifetime issue
> >
> > - in-tree code needn't limits_lock.
> >
> > - all are scalar variable, so the attribute itself is updated atomically
>
> Except in the memory model they aren't without READ_ONCE/WRITE_ONCE.
RW_ONCE is supposed for avoiding compiler optimization, and scalar
variable atomic update should be decided by hardware.
>
> Given that the limits_lock is not a hot lock taking the lock is a very
> easy way to mark our intent. And if we get things like thread thread
> sanitizer patches merged that will become essential. Even KCSAN
> might object already without it.
My main concern is that there are too many ->store()/->load() variants
now, but not deal if you think this way is fine, :-)
Thanks,
Ming
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-19 3:24 ` Ming Lei
@ 2025-02-19 5:42 ` Christoph Hellwig
2025-02-19 8:34 ` Nilay Shroff
1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2025-02-19 5:42 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Nilay Shroff, linux-block, dlemoal, axboe,
gjoyce
On Wed, Feb 19, 2025 at 11:24:43AM +0800, Ming Lei wrote:
> On Tue, Feb 18, 2025 at 05:29:53PM +0100, Christoph Hellwig wrote:
> > On Tue, Feb 18, 2025 at 09:45:02PM +0800, Ming Lei wrote:
> > > IMO, this RO attributes needn't protection from q->limits_lock:
> > >
> > > - no lifetime issue
> > >
> > > - in-tree code needn't limits_lock.
> > >
> > > - all are scalar variable, so the attribute itself is updated atomically
> >
> > Except in the memory model they aren't without READ_ONCE/WRITE_ONCE.
>
> RW_ONCE is supposed for avoiding compiler optimization, and scalar
> variable atomic update should be decided by hardware.
Nothing force the compiler to update atomically without that. Yes,
it is very unlikely to get thing wrong, but I'd rather be explicit.
And make sure static and dynamic analysis knows what we are doing here.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-19 3:24 ` Ming Lei
2025-02-19 5:42 ` Christoph Hellwig
@ 2025-02-19 8:34 ` Nilay Shroff
2025-02-19 8:56 ` Nilay Shroff
1 sibling, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2025-02-19 8:34 UTC (permalink / raw)
To: Ming Lei, Christoph Hellwig; +Cc: linux-block, dlemoal, axboe, gjoyce
On 2/19/25 8:54 AM, Ming Lei wrote:
> On Tue, Feb 18, 2025 at 05:29:53PM +0100, Christoph Hellwig wrote:
>> On Tue, Feb 18, 2025 at 09:45:02PM +0800, Ming Lei wrote:
>>> IMO, this RO attributes needn't protection from q->limits_lock:
>>>
>>> - no lifetime issue
>>>
>>> - in-tree code needn't limits_lock.
>>>
>>> - all are scalar variable, so the attribute itself is updated atomically
>>
>> Except in the memory model they aren't without READ_ONCE/WRITE_ONCE.
>
> RW_ONCE is supposed for avoiding compiler optimization, and scalar
> variable atomic update should be decided by hardware.
>
>>
>> Given that the limits_lock is not a hot lock taking the lock is a very
>> easy way to mark our intent. And if we get things like thread thread
>> sanitizer patches merged that will become essential. Even KCSAN
>> might object already without it.
>
> My main concern is that there are too many ->store()/->load() variants
> now, but not deal if you think this way is fine, :-)
>
We will only have ->store_limit()/->show_limit() and ->store()/->load() in
the next patchset as I am going to cleanup load_module() as well as get away with show_nolock() and store_nolock() methods as discussed with Christoph in
another thread.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-19 8:34 ` Nilay Shroff
@ 2025-02-19 8:56 ` Nilay Shroff
2025-02-19 9:20 ` Ming Lei
0 siblings, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2025-02-19 8:56 UTC (permalink / raw)
To: Ming Lei, Christoph Hellwig; +Cc: linux-block, dlemoal, axboe, gjoyce
On 2/19/25 2:04 PM, Nilay Shroff wrote:
>
>
> On 2/19/25 8:54 AM, Ming Lei wrote:
>> On Tue, Feb 18, 2025 at 05:29:53PM +0100, Christoph Hellwig wrote:
>>> On Tue, Feb 18, 2025 at 09:45:02PM +0800, Ming Lei wrote:
>>>> IMO, this RO attributes needn't protection from q->limits_lock:
>>>>
>>>> - no lifetime issue
>>>>
>>>> - in-tree code needn't limits_lock.
>>>>
>>>> - all are scalar variable, so the attribute itself is updated atomically
>>>
>>> Except in the memory model they aren't without READ_ONCE/WRITE_ONCE.
>>
>> RW_ONCE is supposed for avoiding compiler optimization, and scalar
>> variable atomic update should be decided by hardware.
>>
>>>
>>> Given that the limits_lock is not a hot lock taking the lock is a very
>>> easy way to mark our intent. And if we get things like thread thread
>>> sanitizer patches merged that will become essential. Even KCSAN
>>> might object already without it.
>>
>> My main concern is that there are too many ->store()/->load() variants
>> now, but not deal if you think this way is fine, :-)
>>
> We will only have ->store_limit()/->show_limit() and ->store()/->load() in
> the next patchset as I am going to cleanup load_module() as well as get away with show_nolock() and store_nolock() methods as discussed with Christoph in
> another thread.
>
Sorry a typo, I meant we will only have ->store_limit()/->show_limit()
and ->store()/show() methods. Also, we'll cleanup load_module() as well
as get away with show_nolock() and store_nolock() methods in the next
patchset as discussed with Christoph in another thread.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it
2025-02-19 8:56 ` Nilay Shroff
@ 2025-02-19 9:20 ` Ming Lei
0 siblings, 0 replies; 34+ messages in thread
From: Ming Lei @ 2025-02-19 9:20 UTC (permalink / raw)
To: Nilay Shroff; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce
On Wed, Feb 19, 2025 at 02:26:49PM +0530, Nilay Shroff wrote:
>
>
> On 2/19/25 2:04 PM, Nilay Shroff wrote:
> >
> >
> > On 2/19/25 8:54 AM, Ming Lei wrote:
> >> On Tue, Feb 18, 2025 at 05:29:53PM +0100, Christoph Hellwig wrote:
> >>> On Tue, Feb 18, 2025 at 09:45:02PM +0800, Ming Lei wrote:
> >>>> IMO, this RO attributes needn't protection from q->limits_lock:
> >>>>
> >>>> - no lifetime issue
> >>>>
> >>>> - in-tree code needn't limits_lock.
> >>>>
> >>>> - all are scalar variable, so the attribute itself is updated atomically
> >>>
> >>> Except in the memory model they aren't without READ_ONCE/WRITE_ONCE.
> >>
> >> RW_ONCE is supposed for avoiding compiler optimization, and scalar
> >> variable atomic update should be decided by hardware.
> >>
> >>>
> >>> Given that the limits_lock is not a hot lock taking the lock is a very
> >>> easy way to mark our intent. And if we get things like thread thread
> >>> sanitizer patches merged that will become essential. Even KCSAN
> >>> might object already without it.
> >>
> >> My main concern is that there are too many ->store()/->load() variants
> >> now, but not deal if you think this way is fine, :-)
> >>
> > We will only have ->store_limit()/->show_limit() and ->store()/->load() in
> > the next patchset as I am going to cleanup load_module() as well as get away with show_nolock() and store_nolock() methods as discussed with Christoph in
> > another thread.
> >
>
> Sorry a typo, I meant we will only have ->store_limit()/->show_limit()
> and ->store()/show() methods. Also, we'll cleanup load_module() as well
> as get away with show_nolock() and store_nolock() methods in the next
> patchset as discussed with Christoph in another thread.
OK, that looks much better!
Thanks,
Ming
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCHv2 2/6] blk-sysfs: acquire q->limits_lock while reading attributes
2025-02-18 8:28 [PATCHv2 0/6] block: fix lock order and remove redundant locking Nilay Shroff
2025-02-18 8:28 ` [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it Nilay Shroff
@ 2025-02-18 8:28 ` Nilay Shroff
2025-02-18 8:46 ` Christoph Hellwig
2025-02-18 8:28 ` [PATCHv2 3/6] block: Introduce a dedicated lock for protecting queue elevator updates Nilay Shroff
` (4 subsequent siblings)
6 siblings, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 8:28 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce
There're few sysfs attributes(RW) whose store method is protected with
q->limits_lock, however the corresponding show method of such attributes
runs with holding q->sysfs_lock and that doesn't make sense. Hence
update the show method of these sysfs attributes so that reading of
these attributes acquire q->limits_lock instead of q->sysfs_lock.
Similarly, there're few sysfs attributes(RO) whose show method is
currently protected with q->sysfs_lock however updates to these
attributes could occur using atomic limit update APIs such as queue_
limits_start_update() and queue_limits_commit_update() which run
holding q->limits_lock. So that means that reading such attributes
holding q->sysfs_lock doesn't make sense. Hence update the show method
of these sysfs attributes(RO) such that show method of these attributes
runs with holding q->limits_lock instead of q->sysfs_lock.
We have defined a new macro QUEUE_LIM_RO_ENTRY() which uses new show_
limit() method and it runs holding q->limits_lock. All sysfs existing
attributes(RO) which needs protection using q->limits_lock while
reading the entry have been now moved to use this new macro for
attribute initialization.
Similarly, the existing QUEUE_LIM_RW_ENTRY() is updated to use new
show_limit() method for reading attributes instead of existing show()
method. As show_limit() runs holding q->limits_lock the existing sysfs
attributes(RW) whose read/show method needs protection using q->limits_
lock are now inherently protected.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-sysfs.c | 100 ++++++++++++++++++++++++++--------------------
1 file changed, 57 insertions(+), 43 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0c9be7c7ecc1..7e22ec96f2b3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -24,6 +24,7 @@ struct queue_sysfs_entry {
struct attribute attr;
ssize_t (*show)(struct gendisk *disk, char *page);
ssize_t (*show_nolock)(struct gendisk *disk, char *page);
+ ssize_t (*show_limit)(struct gendisk *disk, char *page);
ssize_t (*store)(struct gendisk *disk, const char *page, size_t count);
ssize_t (*store_nolock)(struct gendisk *disk,
@@ -436,10 +437,16 @@ static struct queue_sysfs_entry _prefix##_entry = { \
.store = _prefix##_store, \
};
+#define QUEUE_LIM_RO_ENTRY(_prefix, _name) \
+static struct queue_sysfs_entry _prefix##_entry = { \
+ .attr = { .name = _name, .mode = 0644 }, \
+ .show_limit = _prefix##_show, \
+}
+
#define QUEUE_LIM_RW_ENTRY(_prefix, _name) \
-static struct queue_sysfs_entry _prefix##_entry = { \
+static struct queue_sysfs_entry _prefix##_entry = { \
.attr = { .name = _name, .mode = 0644 }, \
- .show = _prefix##_show, \
+ .show_limit = _prefix##_show, \
.store_limit = _prefix##_store, \
}
@@ -454,39 +461,39 @@ static struct queue_sysfs_entry _prefix##_entry = { \
QUEUE_RW_ENTRY(queue_requests, "nr_requests");
QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb");
QUEUE_LIM_RW_ENTRY(queue_max_sectors, "max_sectors_kb");
-QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
-QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
-QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
-QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
+QUEUE_LIM_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
+QUEUE_LIM_RO_ENTRY(queue_max_segments, "max_segments");
+QUEUE_LIM_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
+QUEUE_LIM_RO_ENTRY(queue_max_segment_size, "max_segment_size");
QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
-QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
-QUEUE_RO_ENTRY(queue_physical_block_size, "physical_block_size");
-QUEUE_RO_ENTRY(queue_chunk_sectors, "chunk_sectors");
-QUEUE_RO_ENTRY(queue_io_min, "minimum_io_size");
-QUEUE_RO_ENTRY(queue_io_opt, "optimal_io_size");
+QUEUE_LIM_RO_ENTRY(queue_logical_block_size, "logical_block_size");
+QUEUE_LIM_RO_ENTRY(queue_physical_block_size, "physical_block_size");
+QUEUE_LIM_RO_ENTRY(queue_chunk_sectors, "chunk_sectors");
+QUEUE_LIM_RO_ENTRY(queue_io_min, "minimum_io_size");
+QUEUE_LIM_RO_ENTRY(queue_io_opt, "optimal_io_size");
-QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
-QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity");
-QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes");
+QUEUE_LIM_RO_ENTRY(queue_max_discard_segments, "max_discard_segments");
+QUEUE_LIM_RO_ENTRY(queue_discard_granularity, "discard_granularity");
+QUEUE_LIM_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes");
QUEUE_LIM_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes");
QUEUE_RO_ENTRY_NOLOCK(queue_discard_zeroes_data, "discard_zeroes_data");
-QUEUE_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes");
-QUEUE_RO_ENTRY(queue_atomic_write_boundary_sectors,
+QUEUE_LIM_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes");
+QUEUE_LIM_RO_ENTRY(queue_atomic_write_boundary_sectors,
"atomic_write_boundary_bytes");
-QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
-QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
+QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
+QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
QUEUE_RO_ENTRY_NOLOCK(queue_write_same_max, "write_same_max_bytes");
-QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
-QUEUE_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
-QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
+QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
+QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
+QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
-QUEUE_RO_ENTRY(queue_zoned, "zoned");
+QUEUE_LIM_RO_ENTRY(queue_zoned, "zoned");
QUEUE_RO_ENTRY_NOLOCK(queue_nr_zones, "nr_zones");
-QUEUE_RO_ENTRY(queue_max_open_zones, "max_open_zones");
-QUEUE_RO_ENTRY(queue_max_active_zones, "max_active_zones");
+QUEUE_LIM_RO_ENTRY(queue_max_open_zones, "max_open_zones");
+QUEUE_LIM_RO_ENTRY(queue_max_active_zones, "max_active_zones");
QUEUE_RW_ENTRY_NOLOCK(queue_nomerges, "nomerges");
QUEUE_LIM_RW_ENTRY(queue_iostats_passthrough, "iostats_passthrough");
@@ -494,16 +501,16 @@ QUEUE_RW_ENTRY_NOLOCK(queue_rq_affinity, "rq_affinity");
QUEUE_RW_ENTRY_NOLOCK(queue_poll, "io_poll");
QUEUE_RW_ENTRY_NOLOCK(queue_poll_delay, "io_poll_delay");
QUEUE_LIM_RW_ENTRY(queue_wc, "write_cache");
-QUEUE_RO_ENTRY(queue_fua, "fua");
-QUEUE_RO_ENTRY(queue_dax, "dax");
+QUEUE_LIM_RO_ENTRY(queue_fua, "fua");
+QUEUE_LIM_RO_ENTRY(queue_dax, "dax");
QUEUE_RW_ENTRY_NOLOCK(queue_io_timeout, "io_timeout");
-QUEUE_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask");
-QUEUE_RO_ENTRY(queue_dma_alignment, "dma_alignment");
+QUEUE_LIM_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask");
+QUEUE_LIM_RO_ENTRY(queue_dma_alignment, "dma_alignment");
/* legacy alias for logical_block_size: */
static struct queue_sysfs_entry queue_hw_sector_size_entry = {
- .attr = {.name = "hw_sector_size", .mode = 0444 },
- .show = queue_logical_block_size_show,
+ .attr = {.name = "hw_sector_size", .mode = 0444 },
+ .show_limit = queue_logical_block_size_show,
};
QUEUE_LIM_RW_ENTRY(queue_rotational, "rotational");
@@ -589,6 +596,18 @@ static struct attribute *queue_attrs[] = {
* attributes protected with q->sysfs_lock
*/
&queue_ra_entry.attr,
+
+ /*
+ * attributes protected with q->limits_lock
+ */
+ &queue_max_sectors_entry.attr,
+ &queue_max_discard_sectors_entry.attr,
+ &queue_rotational_entry.attr,
+ &queue_iostats_passthrough_entry.attr,
+ &queue_iostats_entry.attr,
+ &queue_stable_writes_entry.attr,
+ &queue_add_random_entry.attr,
+ &queue_wc_entry.attr,
&queue_max_hw_sectors_entry.attr,
&queue_max_segments_entry.attr,
&queue_max_discard_segments_entry.attr,
@@ -617,18 +636,6 @@ static struct attribute *queue_attrs[] = {
&queue_virt_boundary_mask_entry.attr,
&queue_dma_alignment_entry.attr,
- /*
- * attributes protected with q->limits_lock
- */
- &queue_max_sectors_entry.attr,
- &queue_max_discard_sectors_entry.attr,
- &queue_rotational_entry.attr,
- &queue_iostats_passthrough_entry.attr,
- &queue_iostats_entry.attr,
- &queue_stable_writes_entry.attr,
- &queue_add_random_entry.attr,
- &queue_wc_entry.attr,
-
/*
* attributes which don't require locking
*/
@@ -709,12 +716,19 @@ queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
ssize_t res;
- if (!entry->show && !entry->show_nolock)
+ if (!entry->show && !entry->show_nolock && !entry->show_limit)
return -EIO;
if (entry->show_nolock)
return entry->show_nolock(disk, page);
+ if (entry->show_limit) {
+ mutex_lock(&disk->queue->limits_lock);
+ res = entry->show_limit(disk, page);
+ mutex_unlock(&disk->queue->limits_lock);
+ return res;
+ }
+
mutex_lock(&disk->queue->sysfs_lock);
res = entry->show(disk, page);
mutex_unlock(&disk->queue->sysfs_lock);
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCHv2 3/6] block: Introduce a dedicated lock for protecting queue elevator updates
2025-02-18 8:28 [PATCHv2 0/6] block: fix lock order and remove redundant locking Nilay Shroff
2025-02-18 8:28 ` [PATCHv2 1/6] blk-sysfs: remove q->sysfs_lock for attributes which don't need it Nilay Shroff
2025-02-18 8:28 ` [PATCHv2 2/6] blk-sysfs: acquire q->limits_lock while reading attributes Nilay Shroff
@ 2025-02-18 8:28 ` Nilay Shroff
2025-02-18 9:05 ` Christoph Hellwig
2025-02-18 8:28 ` [PATCHv2 4/6] blk-sysfs: protect nr_requests update using q->elevator_lock Nilay Shroff
` (3 subsequent siblings)
6 siblings, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 8:28 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce
A queue's elevator can be updated either when modifying nr_hw_queues
or through the sysfs scheduler attribute. To prevent simultaneous
updates from causing race conditions, introduce a dedicated lock
q->elevator_lock. Currently, elevator switching/updating is protected
using q->sysfs_lock, but this has led to lockdep splats[1] due to
inconsistent lock ordering between q->sysfs_lock and the freeze-lock
in multiple block layer call sites.
As the scope of q->sysfs_lock is not well-defined, its misuse has
resulted in numerous lockdep warnings. To resolve this, replace q->
sysfs_lock with a new dedicated q->elevator_lock, which will be
exclusively used to protect elevator switching and updates.
[1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-core.c | 1 +
block/blk-mq.c | 12 ++--
block/blk-sysfs.c | 133 ++++++++++++++++++++++++++++-------------
block/elevator.c | 18 ++++--
block/genhd.c | 9 ++-
include/linux/blkdev.h | 1 +
6 files changed, 119 insertions(+), 55 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index d6c4fa3943b5..222cdcb662c2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -431,6 +431,7 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
mutex_init(&q->debugfs_mutex);
mutex_init(&q->sysfs_lock);
mutex_init(&q->limits_lock);
+ mutex_init(&q->elevator_lock);
mutex_init(&q->rq_qos_mutex);
spin_lock_init(&q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 40490ac88045..f58e11dee8a0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4467,7 +4467,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
unsigned long i, j;
/* protect against switching io scheduler */
- mutex_lock(&q->sysfs_lock);
+ mutex_lock(&q->elevator_lock);
for (i = 0; i < set->nr_hw_queues; i++) {
int old_node;
int node = blk_mq_get_hctx_node(set, i);
@@ -4500,7 +4500,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
xa_for_each_start(&q->hctx_table, j, hctx, j)
blk_mq_exit_hctx(q, set, hctx, j);
- mutex_unlock(&q->sysfs_lock);
+ mutex_unlock(&q->elevator_lock);
/* unregister cpuhp callbacks for exited hctxs */
blk_mq_remove_hw_queues_cpuhp(q);
@@ -4934,7 +4934,7 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
return false;
/* q->elevator needs protection from ->sysfs_lock */
- mutex_lock(&q->sysfs_lock);
+ mutex_lock(&q->elevator_lock);
/* the check has to be done with holding sysfs_lock */
if (!q->elevator) {
@@ -4950,7 +4950,7 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
list_add(&qe->node, head);
elevator_disable(q);
unlock:
- mutex_unlock(&q->sysfs_lock);
+ mutex_unlock(&q->elevator_lock);
return true;
}
@@ -4980,11 +4980,11 @@ static void blk_mq_elv_switch_back(struct list_head *head,
list_del(&qe->node);
kfree(qe);
- mutex_lock(&q->sysfs_lock);
+ mutex_lock(&q->elevator_lock);
elevator_switch(q, t);
/* drop the reference acquired in blk_mq_elv_switch_none */
elevator_put(t);
- mutex_unlock(&q->sysfs_lock);
+ mutex_unlock(&q->elevator_lock);
}
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7e22ec96f2b3..355dfb514712 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -58,7 +58,13 @@ queue_var_store(unsigned long *var, const char *page, size_t count)
static ssize_t queue_requests_show(struct gendisk *disk, char *page)
{
- return queue_var_show(disk->queue->nr_requests, page);
+ int ret;
+
+ mutex_lock(&disk->queue->sysfs_lock);
+ ret = queue_var_show(disk->queue->nr_requests, page);
+ mutex_unlock(&disk->queue->sysfs_lock);
+
+ return ret;
}
static ssize_t
@@ -66,27 +72,42 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
{
unsigned long nr;
int ret, err;
+ unsigned int memflags;
+ struct request_queue *q = disk->queue;
- if (!queue_is_mq(disk->queue))
- return -EINVAL;
+ mutex_lock(&q->sysfs_lock);
+ memflags = blk_mq_freeze_queue(q);
+
+ if (!queue_is_mq(disk->queue)) {
+ ret = -EINVAL;
+ goto out;
+ }
ret = queue_var_store(&nr, page, count);
if (ret < 0)
- return ret;
+ goto out;
if (nr < BLKDEV_MIN_RQ)
nr = BLKDEV_MIN_RQ;
err = blk_mq_update_nr_requests(disk->queue, nr);
if (err)
- return err;
-
+ ret = err;
+out:
+ blk_mq_unfreeze_queue(q, memflags);
+ mutex_unlock(&q->sysfs_lock);
return ret;
}
static ssize_t queue_ra_show(struct gendisk *disk, char *page)
{
- return queue_var_show(disk->bdi->ra_pages << (PAGE_SHIFT - 10), page);
+ int ret;
+
+ mutex_lock(&disk->queue->sysfs_lock);
+ ret = queue_var_show(disk->bdi->ra_pages << (PAGE_SHIFT - 10), page);
+ mutex_unlock(&disk->queue->sysfs_lock);
+
+ return ret;
}
static ssize_t
@@ -94,11 +115,19 @@ queue_ra_store(struct gendisk *disk, const char *page, size_t count)
{
unsigned long ra_kb;
ssize_t ret;
+ unsigned int memflags;
+ struct request_queue *q = disk->queue;
+
+ mutex_lock(&q->sysfs_lock);
+ memflags = blk_mq_freeze_queue(q);
ret = queue_var_store(&ra_kb, page, count);
if (ret < 0)
- return ret;
+ goto out;
disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10);
+out:
+ blk_mq_unfreeze_queue(q, memflags);
+ mutex_unlock(&q->sysfs_lock);
return ret;
}
@@ -534,14 +563,24 @@ static ssize_t queue_var_store64(s64 *var, const char *page)
static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
{
- if (!wbt_rq_qos(disk->queue))
- return -EINVAL;
+ int ret;
+
+ mutex_lock(&disk->queue->sysfs_lock);
+
+ if (!wbt_rq_qos(disk->queue)) {
+ ret = -EINVAL;
+ goto out;
+ }
if (wbt_disabled(disk->queue))
- return sysfs_emit(page, "0\n");
+ ret = sysfs_emit(page, "0\n");
+ else
+ ret = sysfs_emit(page, "%llu\n",
+ div_u64(wbt_get_min_lat(disk->queue), 1000));
- return sysfs_emit(page, "%llu\n",
- div_u64(wbt_get_min_lat(disk->queue), 1000));
+out:
+ mutex_unlock(&disk->queue->sysfs_lock);
+ return ret;
}
static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
@@ -551,18 +590,24 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
struct rq_qos *rqos;
ssize_t ret;
s64 val;
+ unsigned int memflags;
+
+ mutex_lock(&q->sysfs_lock);
+ memflags = blk_mq_freeze_queue(q);
ret = queue_var_store64(&val, page);
if (ret < 0)
- return ret;
- if (val < -1)
- return -EINVAL;
+ goto out;
+ if (val < -1) {
+ ret = -EINVAL;
+ goto out;
+ }
rqos = wbt_rq_qos(q);
if (!rqos) {
ret = wbt_init(disk);
if (ret)
- return ret;
+ goto out;
}
if (val == -1)
@@ -570,8 +615,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
else if (val >= 0)
val *= 1000ULL;
- if (wbt_get_min_lat(q) == val)
- return count;
+ if (wbt_get_min_lat(q) == val) {
+ ret = count;
+ goto out;
+ }
/*
* Ensure that the queue is idled, in case the latency update
@@ -584,7 +631,10 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
blk_mq_unquiesce_queue(q);
- return count;
+out:
+ blk_mq_unfreeze_queue(q, memflags);
+ mutex_unlock(&q->sysfs_lock);
+ return ret;
}
QUEUE_RW_ENTRY(queue_wb_lat, "wbt_lat_usec");
@@ -593,7 +643,7 @@ QUEUE_RW_ENTRY(queue_wb_lat, "wbt_lat_usec");
/* Common attributes for bio-based and request-based queues. */
static struct attribute *queue_attrs[] = {
/*
- * attributes protected with q->sysfs_lock
+ * attributes which require some form of locking
*/
&queue_ra_entry.attr,
@@ -652,10 +702,10 @@ static struct attribute *queue_attrs[] = {
/* Request-based queue attributes that are not relevant for bio-based queues. */
static struct attribute *blk_mq_queue_attrs[] = {
/*
- * attributes protected with q->sysfs_lock
+ * attributes which require some form of locking
*/
- &queue_requests_entry.attr,
&elv_iosched_entry.attr,
+ &queue_requests_entry.attr,
#ifdef CONFIG_BLK_WBT
&queue_wb_lat_entry.attr,
#endif
@@ -729,10 +779,7 @@ queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
return res;
}
- mutex_lock(&disk->queue->sysfs_lock);
- res = entry->show(disk, page);
- mutex_unlock(&disk->queue->sysfs_lock);
- return res;
+ return entry->show(disk, page);
}
static ssize_t
@@ -778,12 +825,7 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
return length;
}
- mutex_lock(&q->sysfs_lock);
- memflags = blk_mq_freeze_queue(q);
- res = entry->store(disk, page, length);
- blk_mq_unfreeze_queue(q, memflags);
- mutex_unlock(&q->sysfs_lock);
- return res;
+ return entry->store(disk, page, length);
}
static const struct sysfs_ops queue_sysfs_ops = {
@@ -852,15 +894,19 @@ int blk_register_queue(struct gendisk *disk)
if (ret)
goto out_debugfs_remove;
+ ret = blk_crypto_sysfs_register(disk);
+ if (ret)
+ goto out_unregister_ia_ranges;
+
+ mutex_lock(&q->elevator_lock);
if (q->elevator) {
ret = elv_register_queue(q, false);
- if (ret)
- goto out_unregister_ia_ranges;
+ if (ret) {
+ mutex_unlock(&q->elevator_lock);
+ goto out_crypto_sysfs_unregister;
+ }
}
-
- ret = blk_crypto_sysfs_register(disk);
- if (ret)
- goto out_elv_unregister;
+ mutex_unlock(&q->elevator_lock);
blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
wbt_enable_default(disk);
@@ -885,8 +931,8 @@ int blk_register_queue(struct gendisk *disk)
return ret;
-out_elv_unregister:
- elv_unregister_queue(q);
+out_crypto_sysfs_unregister:
+ blk_crypto_sysfs_unregister(disk);
out_unregister_ia_ranges:
disk_unregister_independent_access_ranges(disk);
out_debugfs_remove:
@@ -932,8 +978,11 @@ void blk_unregister_queue(struct gendisk *disk)
blk_mq_sysfs_unregister(disk);
blk_crypto_sysfs_unregister(disk);
- mutex_lock(&q->sysfs_lock);
+ mutex_lock(&q->elevator_lock);
elv_unregister_queue(q);
+ mutex_unlock(&q->elevator_lock);
+
+ mutex_lock(&q->sysfs_lock);
disk_unregister_independent_access_ranges(disk);
mutex_unlock(&q->sysfs_lock);
diff --git a/block/elevator.c b/block/elevator.c
index cd2ce4921601..6ee372f0220c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -457,7 +457,7 @@ int elv_register_queue(struct request_queue *q, bool uevent)
struct elevator_queue *e = q->elevator;
int error;
- lockdep_assert_held(&q->sysfs_lock);
+ lockdep_assert_held(&q->elevator_lock);
error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched");
if (!error) {
@@ -481,7 +481,7 @@ void elv_unregister_queue(struct request_queue *q)
{
struct elevator_queue *e = q->elevator;
- lockdep_assert_held(&q->sysfs_lock);
+ lockdep_assert_held(&q->elevator_lock);
if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
kobject_uevent(&e->kobj, KOBJ_REMOVE);
@@ -618,7 +618,7 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
unsigned int memflags;
int ret;
- lockdep_assert_held(&q->sysfs_lock);
+ lockdep_assert_held(&q->elevator_lock);
memflags = blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
@@ -655,7 +655,7 @@ void elevator_disable(struct request_queue *q)
{
unsigned int memflags;
- lockdep_assert_held(&q->sysfs_lock);
+ lockdep_assert_held(&q->elevator_lock);
memflags = blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
@@ -723,11 +723,19 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
{
char elevator_name[ELV_NAME_MAX];
int ret;
+ unsigned int memflags;
+ struct request_queue *q = disk->queue;
strscpy(elevator_name, buf, sizeof(elevator_name));
+
+ memflags = blk_mq_freeze_queue(q);
+ mutex_lock(&q->elevator_lock);
ret = elevator_change(disk->queue, strstrip(elevator_name));
+ mutex_unlock(&q->elevator_lock);
+ blk_mq_unfreeze_queue(q, memflags);
if (!ret)
return count;
+
return ret;
}
@@ -738,6 +746,7 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *name)
struct elevator_type *cur = NULL, *e;
int len = 0;
+ mutex_lock(&q->elevator_lock);
if (!q->elevator) {
len += sprintf(name+len, "[none] ");
} else {
@@ -755,6 +764,7 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *name)
spin_unlock(&elv_list_lock);
len += sprintf(name+len, "\n");
+ mutex_unlock(&q->elevator_lock);
return len;
}
diff --git a/block/genhd.c b/block/genhd.c
index e9375e20d866..c2bd86cd09de 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -565,8 +565,11 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
if (disk->major == BLOCK_EXT_MAJOR)
blk_free_ext_minor(disk->first_minor);
out_exit_elevator:
- if (disk->queue->elevator)
+ if (disk->queue->elevator) {
+ mutex_lock(&disk->queue->elevator_lock);
elevator_exit(disk->queue);
+ mutex_unlock(&disk->queue->elevator_lock);
+ }
return ret;
}
EXPORT_SYMBOL_GPL(add_disk_fwnode);
@@ -742,9 +745,9 @@ void del_gendisk(struct gendisk *disk)
blk_mq_quiesce_queue(q);
if (q->elevator) {
- mutex_lock(&q->sysfs_lock);
+ mutex_lock(&q->elevator_lock);
elevator_exit(q);
- mutex_unlock(&q->sysfs_lock);
+ mutex_unlock(&q->elevator_lock);
}
rq_qos_exit(q);
blk_mq_unquiesce_queue(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..5690d2f3588e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -562,6 +562,7 @@ struct request_queue {
struct mutex sysfs_lock;
struct mutex limits_lock;
+ struct mutex elevator_lock;
/*
* for reusing dead hctx instance in case of updating
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [PATCHv2 3/6] block: Introduce a dedicated lock for protecting queue elevator updates
2025-02-18 8:28 ` [PATCHv2 3/6] block: Introduce a dedicated lock for protecting queue elevator updates Nilay Shroff
@ 2025-02-18 9:05 ` Christoph Hellwig
2025-02-18 11:14 ` Nilay Shroff
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2025-02-18 9:05 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
On Tue, Feb 18, 2025 at 01:58:56PM +0530, Nilay Shroff wrote:
> As the scope of q->sysfs_lock is not well-defined, its misuse has
> resulted in numerous lockdep warnings. To resolve this, replace q->
> sysfs_lock with a new dedicated q->elevator_lock, which will be
> exclusively used to protect elevator switching and updates.
Well, it's not replacing q->sysfs_lock as that is still around.
It changes some data to now be protected by elevator_lock instead,
and you should spell out which, preferably in a comment next to the
elevator_lock definition (I should have done the same for limits_lock
despite the trivial applicability to the limits field).
> /* q->elevator needs protection from ->sysfs_lock */
> - mutex_lock(&q->sysfs_lock);
> + mutex_lock(&q->elevator_lock);
Well, the above comment is no trivially wrong.
>
> /* the check has to be done with holding sysfs_lock */
Same for this one.
They could probably go away with the proper comments near elevator_lock
itself.
> static ssize_t queue_requests_show(struct gendisk *disk, char *page)
> {
> - return queue_var_show(disk->queue->nr_requests, page);
> + int ret;
> +
> + mutex_lock(&disk->queue->sysfs_lock);
> + ret = queue_var_show(disk->queue->nr_requests, page);
> + mutex_unlock(&disk->queue->sysfs_lock);
This also shifts taking sysfs_lock into the the ->show and ->store
methods. Which feels like a good thing, but isn't mentioned in the
commit log or directly releate to elevator_lock. Can you please move
taking the locking into the methods into a preparation patch with a
proper commit log? Also it's pretty strange the ->store_nolock is
now called with the queue frozen but ->store is not and both are
called without any locks. So I think part of that prep patch should
also be moving the queue freezing into ->store and do away with
the separate ->store_nolock, and just keep the special ->store_limit.
There's also no more need for ->lock_module when the elevator store
method is called unfrozen and without a lock.
->show and ->show_nolock also are identical now and can be done
away with.
So maybe shifting the freezing and locking into the methods should go
very first, before dropping the lock for the attributes that don't it?
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCHv2 3/6] block: Introduce a dedicated lock for protecting queue elevator updates
2025-02-18 9:05 ` Christoph Hellwig
@ 2025-02-18 11:14 ` Nilay Shroff
2025-02-18 16:32 ` Christoph Hellwig
0 siblings, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 11:14 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
On 2/18/25 2:35 PM, Christoph Hellwig wrote:
> On Tue, Feb 18, 2025 at 01:58:56PM +0530, Nilay Shroff wrote:
>> As the scope of q->sysfs_lock is not well-defined, its misuse has
>> resulted in numerous lockdep warnings. To resolve this, replace q->
>> sysfs_lock with a new dedicated q->elevator_lock, which will be
>> exclusively used to protect elevator switching and updates.
>
> Well, it's not replacing q->sysfs_lock as that is still around.
> It changes some data to now be protected by elevator_lock instead,
> and you should spell out which, preferably in a comment next to the
> elevator_lock definition (I should have done the same for limits_lock
> despite the trivial applicability to the limits field).
>
>> /* q->elevator needs protection from ->sysfs_lock */
>> - mutex_lock(&q->sysfs_lock);
>> + mutex_lock(&q->elevator_lock);
>
> Well, the above comment is no trivially wrong.
Yes I will update comment, I don't know how I missed updating it :(
>
>>
>> /* the check has to be done with holding sysfs_lock */
>
> Same for this one.
> They could probably go away with the proper comments near elevator_lock
> itself.
yes I will add proper comment.
>
>> static ssize_t queue_requests_show(struct gendisk *disk, char *page)
>> {
>> - return queue_var_show(disk->queue->nr_requests, page);
>> + int ret;
>> +
>> + mutex_lock(&disk->queue->sysfs_lock);
>> + ret = queue_var_show(disk->queue->nr_requests, page);
>> + mutex_unlock(&disk->queue->sysfs_lock);
>
> This also shifts taking sysfs_lock into the the ->show and ->store
> methods. Which feels like a good thing, but isn't mentioned in the
> commit log or directly releate to elevator_lock. Can you please move
> taking the locking into the methods into a preparation patch with a
> proper commit log?
Yes I would do add proper commit log as suggested
> Also it's pretty strange the ->store_nolock is
> now called with the queue frozen but ->store is not and both are
> called without any locks. So I think part of that prep patch should
> also be moving the queue freezing into ->store and do away with
> the separate ->store_nolock, and just keep the special ->store_limit.
You are absolutely correct and that was my original plan to do away with
->store_nolock and ->show_nolock methods in the end however problem arised
due to "read_ahead_kb" attribute. As you know, "read_ahead_kb" requires
update outside of atomic limit APIs (for the reason mentioned in last
patch comment as well as in the cover letter). So we can't move
"read_ahead_kb" into ->store_limit. For updating "read_ahead_kb",
we follow the below sequence:
lock ->limits_lock
lock ->freeze_lock
update bdi->ra_pages
unlock ->freeze_lock
unlock ->limits_lock
As you could see above, we have to take ->limits_lock before ->freeze_lock
while updating ra_pages (doing the opposite would surely trigger lockdep
splat). However for attributes which are grouped under ->store_nolock the
update sequence is:
lock ->freeze_lock
invoke attribute specific store method
unlock ->freeze_lock.
So now if you suggest keeping only ->store and do away with ->store_nolock
method then we have to move feeze-lock inside each attributes' corresponding
store method as we can't keep freeze-lock under ->store in queue_attr_store().
> There's also no more need for ->lock_module when the elevator store
> method is called unfrozen and without a lock.
>
yes agreed, I would remove ->load_module and we can now load module
from elevator ->store method directly before we freeze the queue.
> ->show and ->show_nolock also are identical now and can be done
> away with.
>
Yes this is possible and I just kept it for pairing show_nolock with
store_nolock. But I would remove it.
> So maybe shifting the freezing and locking into the methods should go
> very first, before dropping the lock for the attributes that don't it?
>
Yes this can be done. I will add it when I spin next patchset.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCHv2 3/6] block: Introduce a dedicated lock for protecting queue elevator updates
2025-02-18 11:14 ` Nilay Shroff
@ 2025-02-18 16:32 ` Christoph Hellwig
2025-02-19 8:41 ` Nilay Shroff
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2025-02-18 16:32 UTC (permalink / raw)
To: Nilay Shroff
Cc: Christoph Hellwig, linux-block, ming.lei, dlemoal, axboe, gjoyce
On Tue, Feb 18, 2025 at 04:44:44PM +0530, Nilay Shroff wrote:
> So now if you suggest keeping only ->store and do away with ->store_nolock
> method then we have to move feeze-lock inside each attributes' corresponding
> store method as we can't keep freeze-lock under ->store in queue_attr_store().
Yes, that's what I suggested in my previous mail:
"So I think part of that prep patch should
also be moving the queue freezing into ->store and do away with
the separate ->store_nolock"
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 3/6] block: Introduce a dedicated lock for protecting queue elevator updates
2025-02-18 16:32 ` Christoph Hellwig
@ 2025-02-19 8:41 ` Nilay Shroff
0 siblings, 0 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-19 8:41 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
On 2/18/25 10:02 PM, Christoph Hellwig wrote:
> On Tue, Feb 18, 2025 at 04:44:44PM +0530, Nilay Shroff wrote:
>> So now if you suggest keeping only ->store and do away with ->store_nolock
>> method then we have to move feeze-lock inside each attributes' corresponding
>> store method as we can't keep freeze-lock under ->store in queue_attr_store().
>
> Yes, that's what I suggested in my previous mail:
>
> "So I think part of that prep patch should
> also be moving the queue freezing into ->store and do away with
> the separate ->store_nolock"
>
>
Alright, I'm addressing this in the the next patchset.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCHv2 4/6] blk-sysfs: protect nr_requests update using q->elevator_lock
2025-02-18 8:28 [PATCHv2 0/6] block: fix lock order and remove redundant locking Nilay Shroff
` (2 preceding siblings ...)
2025-02-18 8:28 ` [PATCHv2 3/6] block: Introduce a dedicated lock for protecting queue elevator updates Nilay Shroff
@ 2025-02-18 8:28 ` Nilay Shroff
2025-02-18 8:28 ` [PATCHv2 5/6] blk-sysfs: protect wbt_lat_usec " Nilay Shroff
` (2 subsequent siblings)
6 siblings, 0 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 8:28 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce
The sysfs attribute nr_requests could be simultaneously updated from
elevator switch/update or nr_hw_queue update code path. The update to
nr_requests for each of those code path now runs holding q->elevator_
lock. So we should now protect access to sysfs attribute nr_requests
using q->elevator_lock instead of q->sysfs_lock.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-sysfs.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 355dfb514712..37ac73468d4e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -59,10 +59,11 @@ queue_var_store(unsigned long *var, const char *page, size_t count)
static ssize_t queue_requests_show(struct gendisk *disk, char *page)
{
int ret;
+ struct request_queue *q = disk->queue;
- mutex_lock(&disk->queue->sysfs_lock);
- ret = queue_var_show(disk->queue->nr_requests, page);
- mutex_unlock(&disk->queue->sysfs_lock);
+ mutex_lock(&q->elevator_lock);
+ ret = queue_var_show(q->nr_requests, page);
+ mutex_unlock(&q->elevator_lock);
return ret;
}
@@ -75,8 +76,8 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
unsigned int memflags;
struct request_queue *q = disk->queue;
- mutex_lock(&q->sysfs_lock);
memflags = blk_mq_freeze_queue(q);
+ mutex_lock(&q->elevator_lock);
if (!queue_is_mq(disk->queue)) {
ret = -EINVAL;
@@ -94,8 +95,9 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
if (err)
ret = err;
out:
+ mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
- mutex_unlock(&q->sysfs_lock);
+
return ret;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCHv2 5/6] blk-sysfs: protect wbt_lat_usec using q->elevator_lock
2025-02-18 8:28 [PATCHv2 0/6] block: fix lock order and remove redundant locking Nilay Shroff
` (3 preceding siblings ...)
2025-02-18 8:28 ` [PATCHv2 4/6] blk-sysfs: protect nr_requests update using q->elevator_lock Nilay Shroff
@ 2025-02-18 8:28 ` Nilay Shroff
2025-02-18 8:28 ` [PATCHv2 6/6] blk-sysfs: protect read_ahead_kb using q->limits_lock Nilay Shroff
2025-02-18 9:21 ` [PATCHv2 0/6] block: fix lock order and remove redundant locking Christoph Hellwig
6 siblings, 0 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 8:28 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce
The wbt latency and state could be updated while initializing the
elevator or exiting the elevator. The elevator code path is now
protected with q->elevator_lock. So we should now protect the access
to sysfs attribute wbt_lat_usec using q->elevator_lock instead of
q->sysfs_lock.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-sysfs.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 37ac73468d4e..876376bfdac3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -567,7 +567,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
{
int ret;
- mutex_lock(&disk->queue->sysfs_lock);
+ mutex_lock(&disk->queue->elevator_lock);
if (!wbt_rq_qos(disk->queue)) {
ret = -EINVAL;
@@ -581,7 +581,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
div_u64(wbt_get_min_lat(disk->queue), 1000));
out:
- mutex_unlock(&disk->queue->sysfs_lock);
+ mutex_unlock(&disk->queue->elevator_lock);
return ret;
}
@@ -594,8 +594,8 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
s64 val;
unsigned int memflags;
- mutex_lock(&q->sysfs_lock);
memflags = blk_mq_freeze_queue(q);
+ mutex_lock(&q->elevator_lock);
ret = queue_var_store64(&val, page);
if (ret < 0)
@@ -634,8 +634,8 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
blk_mq_unquiesce_queue(q);
out:
+ mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
- mutex_unlock(&q->sysfs_lock);
return ret;
}
@@ -907,11 +907,11 @@ int blk_register_queue(struct gendisk *disk)
mutex_unlock(&q->elevator_lock);
goto out_crypto_sysfs_unregister;
}
+ wbt_enable_default(disk);
}
mutex_unlock(&q->elevator_lock);
blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
- wbt_enable_default(disk);
/* Now everything is ready and send out KOBJ_ADD uevent */
kobject_uevent(&disk->queue_kobj, KOBJ_ADD);
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCHv2 6/6] blk-sysfs: protect read_ahead_kb using q->limits_lock
2025-02-18 8:28 [PATCHv2 0/6] block: fix lock order and remove redundant locking Nilay Shroff
` (4 preceding siblings ...)
2025-02-18 8:28 ` [PATCHv2 5/6] blk-sysfs: protect wbt_lat_usec " Nilay Shroff
@ 2025-02-18 8:28 ` Nilay Shroff
2025-02-18 9:12 ` Christoph Hellwig
2025-02-18 9:21 ` [PATCHv2 0/6] block: fix lock order and remove redundant locking Christoph Hellwig
6 siblings, 1 reply; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 8:28 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce
The bdi->ra_pages could be updated under q->limits_lock while applying
bdi limits (please refer blk_apply_bdi_limits()). So protect accessing
sysfs attribute read_ahead_kb using q->limits_lock instead of q->sysfs_
lock.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-sysfs.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 876376bfdac3..a8116d3d9127 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -105,9 +105,9 @@ static ssize_t queue_ra_show(struct gendisk *disk, char *page)
{
int ret;
- mutex_lock(&disk->queue->sysfs_lock);
+ mutex_lock(&disk->queue->limits_lock);
ret = queue_var_show(disk->bdi->ra_pages << (PAGE_SHIFT - 10), page);
- mutex_unlock(&disk->queue->sysfs_lock);
+ mutex_unlock(&disk->queue->limits_lock);
return ret;
}
@@ -119,17 +119,24 @@ queue_ra_store(struct gendisk *disk, const char *page, size_t count)
ssize_t ret;
unsigned int memflags;
struct request_queue *q = disk->queue;
-
- mutex_lock(&q->sysfs_lock);
+ /*
+ * We don't use atomic update helper queue_limits_start_update() and
+ * queue_limits_commit_update() here for updaing ra_pages bacause
+ * blk_apply_bdi_limits() which is invoked from queue_limits_commit_
+ * update() can overwrite the ra_pages value which user actaully wants
+ * to store here. The blk_apply_bdi_limits() sets value of ra_pages
+ * based on the optimal I/O size(io_opt).
+ */
+ mutex_lock(&q->limits_lock);
memflags = blk_mq_freeze_queue(q);
-
ret = queue_var_store(&ra_kb, page, count);
if (ret < 0)
goto out;
disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10);
out:
+ mutex_unlock(&q->limits_lock);
blk_mq_unfreeze_queue(q, memflags);
- mutex_unlock(&q->sysfs_lock);
+
return ret;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [PATCHv2 6/6] blk-sysfs: protect read_ahead_kb using q->limits_lock
2025-02-18 8:28 ` [PATCHv2 6/6] blk-sysfs: protect read_ahead_kb using q->limits_lock Nilay Shroff
@ 2025-02-18 9:12 ` Christoph Hellwig
2025-02-18 11:27 ` Nilay Shroff
0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2025-02-18 9:12 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, ming.lei, dlemoal, axboe, gjoyce
> + /*
> + * We don't use atomic update helper queue_limits_start_update() and
> + * queue_limits_commit_update() here for updaing ra_pages bacause
> + * blk_apply_bdi_limits() which is invoked from queue_limits_commit_
> + * update() can overwrite the ra_pages value which user actaully wants
> + * to store here. The blk_apply_bdi_limits() sets value of ra_pages
> + * based on the optimal I/O size(io_opt).
> + */
Maybe replace this with:
/*
* ra_pages is protected by limit_lock because it is usually
* calculated from the queue limits by queue_limits_commit_update.
*/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 6/6] blk-sysfs: protect read_ahead_kb using q->limits_lock
2025-02-18 9:12 ` Christoph Hellwig
@ 2025-02-18 11:27 ` Nilay Shroff
0 siblings, 0 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 11:27 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
On 2/18/25 2:42 PM, Christoph Hellwig wrote:
>> + /*
>> + * We don't use atomic update helper queue_limits_start_update() and
>> + * queue_limits_commit_update() here for updaing ra_pages bacause
>> + * blk_apply_bdi_limits() which is invoked from queue_limits_commit_
>> + * update() can overwrite the ra_pages value which user actaully wants
>> + * to store here. The blk_apply_bdi_limits() sets value of ra_pages
>> + * based on the optimal I/O size(io_opt).
>> + */
>
> Maybe replace this with:
>
> /*
> * ra_pages is protected by limit_lock because it is usually
> * calculated from the queue limits by queue_limits_commit_update.
> */
>
Yeah will update comment.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCHv2 0/6] block: fix lock order and remove redundant locking
2025-02-18 8:28 [PATCHv2 0/6] block: fix lock order and remove redundant locking Nilay Shroff
` (5 preceding siblings ...)
2025-02-18 8:28 ` [PATCHv2 6/6] blk-sysfs: protect read_ahead_kb using q->limits_lock Nilay Shroff
@ 2025-02-18 9:21 ` Christoph Hellwig
2025-02-18 12:09 ` Nilay Shroff
6 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2025-02-18 9:21 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, ming.lei, dlemoal, axboe, gjoyce
The mix of blk-sysfs and block in the subject lines is a bit odd.
Maybe just use the block prefix everywhere?
Also q->sysfs_lock is almost unused now and we should probably look
into killing it entirely.
blk_mq_hw_sysfs_show takes it around the ->show methods which
looks pretty useless. The debugfs code takes it for a few undocumented
things, which are worth digging into and if needed split into a separate
lock.
The concurrent ranges code takes it - I think that is because it does
register a complex sysfs hierarchy from something that could race with
add_disk / del_gendisk. Damien, can you help with your thoughts?
(sd.c also has a comment reference it and the removed sysfs_dir_lock
which needs fixing anyway).
blk_register_queue still takes it around a pretty random range of code
including nesting with other locks. I can't see what it protects
against, but it could use a careful look.
blk_unregister_queue takes it just to clear QUEUE_FLAG_REGISTERED,
which by definition can't really protect against anything.
Also the sysfs_lock in the elevator_queue should probably go away or
be replaced with the new elevator_lock for the non-show/store path
for the same reasons as outlined in this series.
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [PATCHv2 0/6] block: fix lock order and remove redundant locking
2025-02-18 9:21 ` [PATCHv2 0/6] block: fix lock order and remove redundant locking Christoph Hellwig
@ 2025-02-18 12:09 ` Nilay Shroff
0 siblings, 0 replies; 34+ messages in thread
From: Nilay Shroff @ 2025-02-18 12:09 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce
On 2/18/25 2:51 PM, Christoph Hellwig wrote:
> The mix of blk-sysfs and block in the subject lines is a bit odd.
> Maybe just use the block prefix everywhere?
>
Okay I will update subject line of each patch to have "block" prefix everywhere.
> Also q->sysfs_lock is almost unused now and we should probably look
> into killing it entirely.
>
Yes that's the eventual goal and I'd work towards it.
> blk_mq_hw_sysfs_show takes it around the ->show methods which
> looks pretty useless. The debugfs code takes it for a few undocumented
> things, which are worth digging into and if needed split into a separate
> lock.
>
> The concurrent ranges code takes it - I think that is because it does
> register a complex sysfs hierarchy from something that could race with
> add_disk / del_gendisk. Damien, can you help with your thoughts?
> (sd.c also has a comment reference it and the removed sysfs_dir_lock
> which needs fixing anyway).
>
> blk_register_queue still takes it around a pretty random range of code
> including nesting with other locks. I can't see what it protects
> against, but it could use a careful look.
>
> blk_unregister_queue takes it just to clear QUEUE_FLAG_REGISTERED,
> which by definition can't really protect against anything.
Yes and also as clearing QUEUE_FLAG_REGISTERED uses atomic bitops,
we don't need to acquire q->sysfs_lock here.
>
> Also the sysfs_lock in the elevator_queue should probably go away or
> be replaced with the new elevator_lock for the non-show/store path
> for the same reasons as outlined in this series.
yes agreed.
Once this patch series is approved, I'd work further to eliminate the
remaining use of q->sysfs_lock in block layer. At some places we may
be able to just straight away remove it and other places we may replace
it with appropriate lock.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 34+ messages in thread