Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-block@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	david@fromorbit.com, leon@kernel.org, hch@lst.de,
	kbusch@kernel.org, sagi@grimberg.me, axboe@kernel.dk,
	joro@8bytes.org, brauner@kernel.org, hare@suse.de,
	willy@infradead.org, john.g.garry@oracle.com,
	p.raghav@samsung.com, gost.dev@samsung.com, da.gomez@samsung.com
Subject: Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
Date: Fri, 21 Mar 2025 07:43:09 +0530	[thread overview]
Message-ID: <87jz8jrv0q.fsf@gmail.com> (raw)
In-Reply-To: <20250320213034.GG2803730@frogsfrogsfrogs>

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Fri, Mar 21, 2025 at 12:16:28AM +0530, Ritesh Harjani wrote:
>> Luis Chamberlain <mcgrof@kernel.org> writes:
>> 
>> > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
>> > This is due to the number of DMA segments and the segment size. With LBS the
>> > segments can be much bigger without using huge pages, and so on a 64 KiB
>> > block size filesystem you can now see 2 MiB IOs when using buffered IO.
>> > But direct IO is still crippled, because allocations are from anonymous
>> > memory, and unless you are using mTHP you won't get large folios. mTHP
>> > is also non-deterministic, and so you end up in a worse situation for
>> > direct IO if you want to rely on large folios, as you may *sometimes*
>> > end up with large folios and sometimes you might not. IO patterns can
>> > therefore be erratic.
>> >
>> > As I just posted in a simple RFC [0], I believe the two step DMA API
>> > helps resolve this.  Provided we move the block integrity stuff to the
>> > new DMA API as well, the only patches really needed to support larger
>> > IOs for direct IO for NVMe are:
>> >
>> >   iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
>> >   blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
>> 
>> Maybe some naive questions, however I would like some help from people
>> who could confirm if my understanding here is correct or not.
>> 
>> Given that we now support large folios in buffered I/O directly on raw
>> block devices, applications must carefully serialize direct I/O and
>> buffered I/O operations on these devices, right?
>> 
>> IIUC. until now, mixing buffered I/O and direct I/O (for doing I/O on
>> /dev/xxx) on separate boundaries (blocksize == pagesize) worked fine,
>> since direct I/O would only invalidate its corresponding page in the
>> page cache. This assumes that both direct I/O and buffered I/O use the
>> same blocksize and pagesize (e.g. both using 4K or both using 64K).
>> However with large folios now introduced in the buffered I/O path for
>> block devices, direct I/O may end up invalidating an entire large folio,
>> which could span across a region where an ongoing direct I/O operation
>
> I don't understand the question.  Should this read  ^^^ "buffered"?

oops, yes.

> As in, directio submits its write bio, meanwhile another thread
> initiates a buffered write nearby, the write gets a 2MB folio, and
> then the post-write invalidation knocks down the entire large folio?
> Even though the two ranges written are (say) 256k apart?
>

Yes, Darrick. That is my question. 

i.e. w/o large folios in block devices one could do direct-io &
buffered-io in parallel even just next to each other (assuming 4k pagesize). 

           |4k-direct-io | 4k-buffered-io | 


However with large folios now supported in buffered-io path for block
devices, the application cannot submit such direct-io + buffered-io
pattern in parallel. Since direct-io can end up invalidating the folio
spanning over it's 4k range, on which buffered-io is in progress.

So now applications need to be careful to not submit any direct-io &
buffered-io in parallel with such above patterns on a raw block device,
correct? That is what I would like to confirm.

> --D
>
>> is taking place. That means, with large folio support in block devices,
>> application developers must now ensure that direct I/O and buffered I/O
>> operations on block devices are properly serialized, correct?
>> 
>> I was looking at posix page [1] and I don't think posix standard defines
>> the semantics for operations on block devices. So it is really upto the
>> individual OS implementation, correct? 
>> 
>> And IIUC, what Linux recommends is to never mix any kind of direct-io
>> and buffered-io when doing I/O on raw block devices, but I cannot find
>> this recommendation in any Documentation? So can someone please point me
>> one where we recommend this?

And this ^^^ 


-ritesh

>> 
>> [1]: https://pubs.opengroup.org/onlinepubs/9799919799/
>> 
>> 
>> -ritesh
>> 
>> >
>> > The other two nvme-pci patches in that series are to just help with
>> > experimentation now and they can be ignored.
>> >
>> > It does beg a few questions:
>> >
>> >  - How are we computing the new max single IO anyway? Are we really
>> >    bounded only by what devices support?
>> >  - Do we believe this is the step in the right direction?
>> >  - Is 2 MiB a sensible max block sector size limit for the next few years?
>> >  - What other considerations should we have?
>> >  - Do we want something more deterministic for large folios for direct IO?
>> >
>> > [0] https://lkml.kernel.org/r/20250320111328.2841690-1-mcgrof@kernel.org
>> >
>> >   Luis
>>

next prev parent reply	other threads:[~2025-03-21  2:26 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-20 11:41 [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64 Luis Chamberlain
2025-03-20 12:11 ` Matthew Wilcox
2025-03-20 13:29   ` Daniel Gomez
2025-03-20 14:31     ` Matthew Wilcox
2025-03-20 13:47 ` Daniel Gomez
2025-03-20 14:54   ` Christoph Hellwig
2025-03-21  9:14     ` Daniel Gomez
2025-03-20 14:18 ` Christoph Hellwig
2025-03-20 15:37   ` Bart Van Assche
2025-03-20 15:58     ` Keith Busch
2025-03-20 16:13       ` Kanchan Joshi
2025-03-20 16:38       ` Christoph Hellwig
2025-03-20 21:50         ` Luis Chamberlain
2025-03-20 21:46       ` Luis Chamberlain
2025-03-20 21:40   ` Luis Chamberlain
2025-03-20 18:46 ` Ritesh Harjani
2025-03-20 21:30   ` Darrick J. Wong
2025-03-21  2:13     ` Ritesh Harjani [this message]
2025-03-21  3:05       ` Darrick J. Wong
2025-03-21  4:56         ` Theodore Ts'o
2025-03-21  5:00           ` Christoph Hellwig
2025-03-21 18:39             ` Ritesh Harjani
2025-03-21 16:38       ` Keith Busch
2025-03-21 17:21         ` Ritesh Harjani
2025-03-21 18:55           ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jz8jrv0q.fsf@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=da.gomez@samsung.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=john.g.garry@oracle.com \
    --cc=joro@8bytes.org \
    --cc=kbusch@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=sagi@grimberg.me \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.