All of lore.kernel.org
 help / color / mirror / Atom feed
From: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
To: John Garry <john.g.garry@oracle.com>,
	song@kernel.org, yukuai@fygo.io, magiclinan@didiglobal.com,
	xiao@kernel.org, axboe@kernel.dk, vverma@digitalocean.com,
	martin.petersen@oracle.com, linux-kernel@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Subject: Re: [PATCH v2 2/7] md/raid1: advertise atomic write limits and handle runtime constraints
Date: Tue, 30 Jun 2026 10:39:34 +0200	[thread overview]
Message-ID: <m2se64zak9.fsf@gmail.com> (raw)
In-Reply-To: <7cbaaca3-4eb5-434d-a13f-f9574c9f977b@oracle.com>


Hi John,

On Mon, Jun 29, 2026 at 15:48 +0100, John Garry wrote:
> On 28/06/2026 15:24, Abd-Alrhman Masalkhi wrote:
>> Atomic writes in RAID1 must fit within a single barrier unit. Advertise
>> this restriction through the queue limits by setting
>> atomic_write_hw_unit_max to BARRIER_UNIT_SECTOR_SIZE so that bios which
>> would cross a barrier-unit boundary are rejected by the block layer
>> before reaching MD.
>> 
>> A bio that passes block-layer validation may still become unserviceable
>> within RAID1 due to bad blocks or write-behind constraints. In the former
>> case, complete the bio with EIO. In the latter case, disable
>> write-behind rather than failing the bio with EIO.
>> 
>> Fixes: f2a38abf5f1c ("md/raid1: Atomic write support")
>> Fixes: a4c55c902670 ("md/raid1: simplify raid1_write_request() error handling")
>> Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
>> ---
>> Changes in v2:
>>   - Drop the early atomic write split check from raid1_write_request().
>>   - Advertise the atomic write size limit via queue limits.
>>   - Disable write-behind instead of failing atomic writes when the
>>     BIO_MAX_VECS limit is encountered.
>>   - Link to v1: https://urldefense.com/v3/__https://lore.kernel.org/linux-raid/20260623072456.333437-3-abd.masalkhi@gmail.com/__;!!ACWV5N9M2RV99hQ!LbMSGSClRi0PNBqQti5ZNWGDVjDd34-7saYEAwNyBNjpNTjEA7veqM5RHG8KB1QiscarW4UaIefjm19ywSImtIgh$
>> ---
>>   drivers/md/raid1.c | 36 +++++++++++++++++++-----------------
>>   1 file changed, 19 insertions(+), 17 deletions(-)
>> 
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index afe2ca96ad8c..f322048ab3c2 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1522,6 +1522,7 @@ static bool raid1_write_request(struct mddev *mddev, struct bio *bio,
>>   	int first_clone;
>>   	bool write_behind = false;
>>   	bool nowait = bio->bi_opf & REQ_NOWAIT;
>> +	bool atomic = bio->bi_opf & REQ_ATOMIC;
>>   	bool is_discard = op_is_discard(bio->bi_opf);
>>   	sector_t sector = bio->bi_iter.bi_sector;
>>   
>> @@ -1603,20 +1604,6 @@ static bool raid1_write_request(struct mddev *mddev, struct bio *bio,
>>   			}
>>   			if (is_bad) {
>>   				int good_sectors;
>> -
>> -				/*
>> -				 * We cannot atomically write this, so just
>> -				 * error in that case. It could be possible to
>> -				 * atomically write other mirrors, but the
>> -				 * complexity of supporting that is not worth
>> -				 * the benefit.
>> -				 */
>> -				if (bio->bi_opf & REQ_ATOMIC) {
>> -					bio->bi_status = BLK_STS_NOTSUPP;
>> -					bio_endio(bio);
>> -					goto err_dec_pending;
>> -				}
>> -
>>   				good_sectors = first_bad - sector;
>>   				if (good_sectors < max_sectors)
>>   					max_sectors = good_sectors;
>> @@ -1633,10 +1620,24 @@ static bool raid1_write_request(struct mddev *mddev, struct bio *bio,
>>   	 * at a time and thus needs a new bio that can fit the whole payload
>>   	 * this bio in page sized chunks.
>>   	 */
>> -	if (write_behind && mddev->bitmap)
>> -		max_sectors = min_t(int, max_sectors,
>> -				    BIO_MAX_VECS * (PAGE_SIZE >> 9));
>> +	if (write_behind && mddev->bitmap) {
>> +		if (atomic && max_sectors > BIO_MAX_VECS * (PAGE_SIZE >> 9))
>
> where does BIO_MAX_VECS * (PAGE_SIZE >> 9) even come from?
>

BIO_MAX_VECS * (PAGE_SIZE >> 9) defines the maximum size supported by
write-behind. The write-behind copy (alloc_behind_master_bio) uses a
single bio, which can hold at most BIO_MAX_VECS pages, making this the
largest payload it can carry. With a 4 KiB PAGE_SIZE, that corresponds
to 256 pages, or 1 MiB (2048 sectors).
This patch changes the behavior for atomic writes that exceed this
limit. Instead of failing the write with -EIO when the number of sectors
must be reduced, it disables write-behind and proceeds with the atomic
write.

>> +			/*
>> +			 * Atomic writes cannot be split, so disable
>> +			 * write-behind.
>> +			 */
>> +			write_behind = false;
>> +		else
>> +			max_sectors = min_t(int, max_sectors,
>> +					    BIO_MAX_VECS * (PAGE_SIZE >> 9));
>> +	}
>> +
>>   	if (max_sectors < bio_sectors(bio)) {
>> +		if (atomic) {
>> +			bio_io_error(bio);
>> +			goto err_dec_pending;
>> +		}
>> +
>>   		bio = bio_submit_split_bioset(bio, max_sectors,
>>   					      &conf->bio_split);
>>   		if (!bio)
>> @@ -3229,6 +3230,7 @@ static int raid1_set_limits(struct mddev *mddev)
>>   	lim.max_write_zeroes_sectors = 0;
>>   	lim.max_hw_wzeroes_unmap_sectors = 0;
>>   	lim.logical_block_size = mddev->logical_block_size;
>> +	lim.atomic_write_hw_unit_max = BARRIER_UNIT_SECTOR_SIZE;
>
> This BARRIER_UNIT_SECTOR_SIZE is a bit like chunk sectors, no? I am just 
> wondering if we just should set it to chunk sectors = 
> BARRIER_UNIT_SECTOR_SIZE
>
> I assume that it affects more than Reads and writes, e.g. discard also.
>

BARRIER_UNIT_SECTOR_SIZE is the resync barrier-bucket, not the layout
chunk size. Unless I'm missing something, using
atomic_write_hw_unit_max seems more appropriate than using the chunk
size. That way, the limit only applies to atomic writes instead of
affecting other operations such.

>>   	lim.features |= BLK_FEAT_ATOMIC_WRITES;
>>   	lim.features |= BLK_FEAT_PCI_P2PDMA;
>>   	err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
>

-- 
Best Regards,
Abd-Alrhman

  reply	other threads:[~2026-06-30  8:39 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 1/7] md/raid10: fix r10bio leak in raid10_write_request() error paths Abd-Alrhman Masalkhi
2026-06-28 14:39   ` sashiko-bot
2026-06-28 14:24 ` [PATCH v2 2/7] md/raid1: advertise atomic write limits and handle runtime constraints Abd-Alrhman Masalkhi
2026-06-28 14:38   ` sashiko-bot
2026-06-29 14:48   ` John Garry
2026-06-30  8:39     ` Abd-Alrhman Masalkhi [this message]
2026-06-28 14:24 ` [PATCH v2 3/7] md/raid10: consistently fail atomic writes that require splitting Abd-Alrhman Masalkhi
2026-06-28 14:36   ` sashiko-bot
2026-06-28 21:35     ` Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 4/7] md/raid10: remove unnecessary barrier around bio_submit_split_bioset() Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 5/7] md/raid10: replace wait loop with wait_event_idle() Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 6/7] md/raid10: simplify write request error handling Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 7/7] md/raid10: simplify read " Abd-Alrhman Masalkhi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m2se64zak9.fsf@gmail.com \
    --to=abd.masalkhi@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=john.g.garry@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=magiclinan@didiglobal.com \
    --cc=martin.petersen@oracle.com \
    --cc=song@kernel.org \
    --cc=vverma@digitalocean.com \
    --cc=xiao@kernel.org \
    --cc=yukuai@fygo.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.