linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* md raid0 Direct IO DMA alignment
@ 2025-06-17  2:19 Jason Rahman
  2025-06-17  3:12 ` Jason Rahman
  2025-06-17 16:36 ` Keith Busch
  0 siblings, 2 replies; 3+ messages in thread
From: Jason Rahman @ 2025-06-17  2:19 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
  Cc: Adam Prout, Girish Mittur Venkataramanappa, kbusch@meta.com,
	James Bottomley

Hi, I have a question around DMA alignment for MD block devices (RAID0 in our case, but applicable to other MD array types). Many underlying block devices support more permissive DMA alignment requirements. For example, in https://github.com/torvalds/linux/commit/52fde2c07da606f3f120af4f734eadcfb52b04be#diff-dc92ff74575224dc8a460fa8ea47dd00968c082be4205ecc672530e116a0043bL1776 the NVMe controller DMA requirements were relaxed to only require 8 byte alignment on the buffer provided for Direct IO.

However, when NVMe devices (or any other devices with less restrictive DMA alignment) are used to back a MD device (RAID0 in our case), the dma_alignment on the block device queue is set to a much more restrictive value than what the device supports. From initial exploration, I don't see why that is necessary. If the underlying devices support less strictly aligned Direct IO buffers, and the sector/block sizes are a multiple of that alignment, all possible addresses handed off to the backing devices will be correctly aligned. For example, even if the buffer is split across multiple stripes on a mdraid array, since the IO starts with sector alignment on the disk, any multiple of sectors from the start of the buffer will still be correctly aligned.

Within the md driver and block layer, when setting up the md block device queue limits, md_init_stacking_limits() is called which in turn sets up default values from blk_set_stacking_limits here: https://github.com/torvalds/linux/blob/9afe652958c3ee88f24df1e4a97f298afce89407/block/blk-settings.c#L42. The DMA alignment requirement initialized there (SECTOR_SIZE - 1) is far stricter than required by many/most actual backing devices. Then when the md layer later calls into mddev_stack_rdev_limits, it calls into queue_limits_stack_bdev which takes the max of dma_alignment on the current queue limits and the next device in the mddev.

It seems that rather than setting dma_alignment to SECTOR_SIZE - 1 in md_init_stacking_limits, it should be set to zero, and as queue_limits_stack_bdev is called on each backing device, the dma_alignment value will be updated to the largest dma_alignment value among all backing devices. Are there any thoughts/concerns about updating the mddev dma_alignment computation to track the underlying backing device more closely, without the minimum SECTOR_SIZE - 1 lower bound today?

Regards,
   Jason

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: md raid0 Direct IO DMA alignment
  2025-06-17  2:19 md raid0 Direct IO DMA alignment Jason Rahman
@ 2025-06-17  3:12 ` Jason Rahman
  2025-06-17 16:36 ` Keith Busch
  1 sibling, 0 replies; 3+ messages in thread
From: Jason Rahman @ 2025-06-17  3:12 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
  Cc: kbusch@meta.com, James Bottomley

And apologies for the unintended recall message on the list. Meant that for a separate, prior version of the email. With that said, any input or thoughts here would be greatly appreciated.

Regards,
   Jason
________________________________________
From: Jason Rahman <jasonrahman@microsoft.com>
Sent: Monday, June 16, 2025 7:19 PM
To: linux-block@vger.kernel.org <linux-block@vger.kernel.org>; linux-fsdevel@vger.kernel.org <linux-fsdevel@vger.kernel.org>
Cc: Adam Prout <adamprout@microsoft.com>; Girish Mittur Venkataramanappa <girishmv@microsoft.com>; kbusch@meta.com <kbusch@meta.com>; James Bottomley <james.bottomley@microsoft.com>
Subject: md raid0 Direct IO DMA alignment


Hi, I have a question around DMA alignment for MD block devices (RAID0 in our case, but applicable to other MD array types). Many underlying block devices support more permissive DMA alignment requirements. For example, in https://github.com/torvalds/linux/commit/52fde2c07da606f3f120af4f734eadcfb52b04be#diff-dc92ff74575224dc8a460fa8ea47dd00968c082be4205ecc672530e116a0043bL1776 the NVMe controller DMA requirements were relaxed to only require 8 byte alignment on the buffer provided for Direct IO.



However, when NVMe devices (or any other devices with less restrictive DMA alignment) are used to back a MD device (RAID0 in our case), the dma_alignment on the block device queue is set to a much more restrictive value than what the device supports. From initial exploration, I don't see why that is necessary. If the underlying devices support less strictly aligned Direct IO buffers, and the sector/block sizes are a multiple of that alignment, all possible addresses handed off to the backing devices will be correctly aligned. For example, even if the buffer is split across multiple stripes on a mdraid array, since the IO starts with sector alignment on the disk, any multiple of sectors from the start of the buffer will still be correctly aligned.



Within the md driver and block layer, when setting up the md block device queue limits, md_init_stacking_limits() is called which in turn sets up default values from blk_set_stacking_limits here: https://github.com/torvalds/linux/blob/9afe652958c3ee88f24df1e4a97f298afce89407/block/blk-settings.c#L42. The DMA alignment requirement initialized there (SECTOR_SIZE - 1) is far stricter than required by many/most actual backing devices. Then when the md layer later calls into mddev_stack_rdev_limits, it calls into queue_limits_stack_bdev which takes the max of dma_alignment on the current queue limits and the next device in the mddev.



It seems that rather than setting dma_alignment to SECTOR_SIZE - 1 in md_init_stacking_limits, it should be set to zero, and as queue_limits_stack_bdev is called on each backing device, the dma_alignment value will be updated to the largest dma_alignment value among all backing devices. Are there any thoughts/concerns about updating the mddev dma_alignment computation to track the underlying backing device more closely, without the minimum SECTOR_SIZE - 1 lower bound today?



Regards,

   Jason


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: md raid0 Direct IO DMA alignment
  2025-06-17  2:19 md raid0 Direct IO DMA alignment Jason Rahman
  2025-06-17  3:12 ` Jason Rahman
@ 2025-06-17 16:36 ` Keith Busch
  1 sibling, 0 replies; 3+ messages in thread
From: Keith Busch @ 2025-06-17 16:36 UTC (permalink / raw)
  To: Jason Rahman
  Cc: linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Adam Prout, Girish Mittur Venkataramanappa, kbusch@meta.com,
	James Bottomley

On Tue, Jun 17, 2025 at 02:19:11AM +0000, Jason Rahman wrote:
> It seems that rather than setting dma_alignment to SECTOR_SIZE - 1 in
> md_init_stacking_limits, it should be set to zero, and as
> queue_limits_stack_bdev is called on each backing device, the
> dma_alignment value will be updated to the largest dma_alignment value
> among all backing devices. Are there any thoughts/concerns about
> updating the mddev dma_alignment computation to track the underlying
> backing device more closely, without the minimum SECTOR_SIZE - 1 lower
> bound today?

I believe it should be safe to stack dma alignment to the least common
multiple of the block devices you're stacking. blk_stack_limits already
tries to do that, at least.

So I think you're right, it should be okay to not set the dma_alignemnt
limit when initializing the stacking limits. For any block device who
hasn't set their dma_alignemnt limit, it will default to SECTOR_SIZE - 1
later anyway, so I don't think stacking needs to explicitly initialize
it.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-06-17 16:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17  2:19 md raid0 Direct IO DMA alignment Jason Rahman
2025-06-17  3:12 ` Jason Rahman
2025-06-17 16:36 ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).