Re: [RFC] Documentation: Add initial iomap document

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Dave Chinner <david@fromorbit.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	Christian Brauner <brauner@kernel.org>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>, Jan Kara <jack@suse.cz>
Subject: Re: [RFC] Documentation: Add initial iomap document
Date: Wed, 05 Jun 2024 09:37:51 +0530	[thread overview]
Message-ID: <8734psjalk.fsf@gmail.com> (raw)
In-Reply-To: <20240603233422.GI53013@frogsfrogsfrogs>

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Fri, May 03, 2024 at 07:40:19PM +0530, Ritesh Harjani (IBM) wrote:
>> This adds an initial first draft of iomap documentation. Hopefully this
>> will come useful to those who are looking for converting their
>> filesystems to iomap. Currently this is in text format since this is the
>> first draft. I would prefer to work on it's conversion to .rst once we
>> receive the feedback/review comments on the overall content of the document.
>> But feel free to let me know if we prefer it otherwise.
>> 
>> A lot of this has been collected from various email conversations, code
>> comments, commit messages and/or my own understanding of iomap. Please
>> note a large part of this has been taken from Dave's reply to last iomap
>> doc patchset. Thanks to Dave, Darrick, Matthew, Christoph and other iomap
>> developers who have taken time to explain the iomap design in various emails,
>> commits, comments etc.
>> 
>> Please note that this is not the complete iomap design doc. but a brief
>> overview of iomap.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  Documentation/filesystems/index.rst |   1 +
>>  Documentation/filesystems/iomap.txt | 289 ++++++++++++++++++++++++++++
>>  MAINTAINERS                         |   1 +
>>  3 files changed, 291 insertions(+)
>>  create mode 100644 Documentation/filesystems/iomap.txt
>> 
>> diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
>> index 1f9b4c905a6a..c17b5a2ec29b 100644
>> --- a/Documentation/filesystems/index.rst
>> +++ b/Documentation/filesystems/index.rst
>> @@ -34,6 +34,7 @@ algorithms work.
>>     seq_file
>>     sharedsubtree
>>     idmappings
>> +   iomap
>> 
>>     automount-support
>> 
>> diff --git a/Documentation/filesystems/iomap.txt b/Documentation/filesystems/iomap.txt
>> new file mode 100644
>> index 000000000000..4f766b129975
>> --- /dev/null
>> +++ b/Documentation/filesystems/iomap.txt
>> @@ -0,0 +1,289 @@
>> +Introduction
>> +============
>> +iomap is a filesystem centric mapping layer that maps file's logical offset
>
> It's really more of a library for filesystem implementations that need
> to handle file operations that involve physical storage devices.
>

Maybe something like this...

iomap is a filesystem library for handling various filesystem operations
that involves mapping of file's logical offset ranges to physical extents.


>> +ranges to physical extents. It provides several iterator APIs which filesystems
>
> Minor style nit: Start each sentence on a new line, so that future
> revisions to a sentence don't produce diff for adjoining sentences.
>

I will try to keep that in mind wherever possible.

>> +can use for doing various file_operations, address_space_operations,
>
> "...for doing various file and pagecache operations, such as:
>
>  * Pagecache reads and writes
>  * Page faults
>  * Dirty pagecache writeback
>  * FIEMAP
>  * directio
>  * lseek
>  * swapfile activation
> <etc>
>

Thanks this is more clear.


>> +vm_operations, inode_operations etc. It supports APIs for doing direct-io,
>> +buffered-io, lseek, dax-io, page-mkwrite, swap_activate and extent reporting
>> +via fiemap.
>> +
>> +iomap is termed above as filesystem centric because it first calls
>> +->iomap_begin() phase supplied by the filesystem to get a mapped extent and
>> +then loops over each folio within that mapped extent.
>> +This is useful for filesystems because now they can allocate/reserve a much
>> +larger extent at begin phase v/s the older approach of doing block allocation
>> +of one block at a time by calling filesystem's provided ->get_blocks() routine.
>
> I think this should be more general because all the iomap operations
> work this way, not just the ones that involve folios:
>
> "Unlike the classic Linux IO model which breaks file io into small units
> (generally memory pages or blocks) and looks up space mappings on the
> basis of that unit, the iomap model asks the filesystem for the largest
> space mappings that it can create for a given file operation and
> initiates operations on that basis.
> This strategy improves the filesystem's visibility into the size of the
> operation being performed, which enables it to combat fragmentation with
> larger space allocations when possible.
> Larger mappings improve runtime performance by amortizing the cost of a
> mapping function call into the filesystem across a larger amount of
> data."
>
> "At a high level, an iomap operation looks like this:
>
>   for each byte in the operation range,
>       obtain space mapping via ->iomap_begin
>       for each sub-unit of work,
>           revalidate the mapping
>           do the work
>       increment file range cursor
>       if needed, release the mapping via ->iomap_end
>

Thanks for making it more general. I will use your above version.

> "Each iomap operation will be covered in more detail below."
>
>> +
>> +i.e. at a high level how iomap does write iter is [1]::
>> +	user IO
>> +	  loop for file IO range
>> +	    loop for each mapped extent
>> +	      if (buffered) {
>> +		loop for each page/folio {
>> +		  instantiate page cache
>> +		  copy data to/from page cache
>> +		  update page cache state
>> +		}
>> +	      } else { /* direct IO */
>> +		loop for each bio {
>> +		  pack user pages into bio
>> +		  submit bio
>> +		}
>> +	      }
>> +	    }
>> +	  }
>
> I agree with the others that each io operation (pagecache, directio,
> etc) should be a separate section.
>

Actually the idea for adding this loop was mainly to show that
->iomap_begin operation first maps the extent and then we loop over each
folio. This is some sense is already covered in your paragraph above.
So I am not sure how much of a value it will add to add these loops for
individual operations. But I will re-asses that while doing v2.


>> +
>> +
>> +Motivation for filesystems to convert to iomap
>> +===============================================
>
> Maybe this is the section titled buffered IO?
>

No the idea here was to list the modern filesystem features iomap
library supports. e.g. large folios, upcoming atomic writes for
buffered-io, abstracts page cache operation completely...

But, I got your point. Let me find a better line for this section :)

>> +1. iomap is a modern filesystem mapping layer VFS abstraction.
>
> I don't think this is much of a justification -- buffer heads were
> modern once.
>

Sure depending upon what ends up being the section name. I will see what
needs to come here, or will drop it.

>> +2. It also supports large folios for buffered-writes. Large folios can help
>> +improve filesystem buffered-write performance and can also improve overall
>> +system performance.
>
> How does it improve pagecache performance?  Is this a result of needing
> to take fewer locks?  Is it because page table walks can stop early?  Is
> it because we can reduce the number of memcpy calls?
>

You mean to ask how does large folios in iomap improves buffered-write
performance & system performance?
- less no. of struct folio entries to iterate over while doing
buffered-I/O. 
- large folios in pagecache helps in reducing MM fragmentation.
- Smaller LRU lists size means, lesser work overhead for MM subsystems; means
better system performance.

I will go ahead and add these subpoints for point 2 then. 


>> +3. Less maintenance overhead for individual filesystem maintainers.
>> +iomap is able to abstract away common folio-cache related operations from the
>> +filesystem to within the iomap layer itself. e.g. allocating, instantiating,
>> +locking and unlocking of the folios for buffered-write operations are now taken
>> +care within iomap. No ->write_begin(), ->write_end() or direct_IO
>> +address_space_operations are required to be implemented by filesystem using
>> +iomap.
>
> I think it's stronger to say that iomap abstracts away /all/ the
> pagecache operations, which gets each fs implementation out of the
> business of doing that itself.
>

Sure. Make sense. Will do that.

>> +
>> +
>> +blocksize < pagesize path/large folios
>> +======================================
>
> Is this a subsection?  Should the annotation be "----" instead of "====" ?
>

Ok.

>> +Large folio support or systems with large pagesize e.g 64K on Power/ARM64 and
>> +4k blocksize, needs filesystems to support bs < ps paths. iomap embeds
>> +struct iomap_folio_state (ifs) within folio->private. ifs maintains uptodate
>> +and dirty bits for each subblock within the folio. Using ifs iomap can track
>> +update and dirty status of each block within the folio. This helps in supporting
>> +bs < ps path for such systems with large pagesize or with large folios [2].
>
> TBH I think it's sufficient to mention that the iomap pagecache
> implementation tracks uptodate and dirty status for each the fsblocks
> cached by a folio to reduce read and write amplification.  No need to go
> into the details of exactly how that happens.
>
> Perhaps also mention that variable sized folios are required to handle
> fsblock > pagesize filesystems.
>

Ok.

>> +
>> +
>> +struct iomap
>> +=============
>> +This structure defines a file mapping information of logical file offset range
>> +to a physical mapped extent on which an IO operation could be performed.
>> +An iomap reflects a single contiguous range of filesystem address space that
>> +either exists in memory or on a block device.
>
> "This structure contains information to map a range of file data to
> physical space on a storage device.
> The information is useful for translating file operations into action.
> The actions taken are specific to the target of the operation, such as
> disk cache, physical storage devices, or another part of the kernel."
>
> "Each iomap operation will pass operation flags as needed to the
> ->iomap_begin function.
> These flags will be documented in the operation-specific sections below.
> For a write operation, IOMAP_WRITE will be set.
> For any other operation, IOMAP_WRITE will not be set."
>
> ...and then down in the section on pagecache operations, you can do
> something like:
>
> "For pagecache operations, the @flags argument to ->iomap_begin can be
> any of the following:"
>
>  * IOMAP_WRITE: write to a file
>
>  * IOMAP_WRITE | IOMAP_UNSHARE: if any of the space backing a file is
>    shared, mark that part of the pagecache dirty so that it will be
>    unshared during writeback
>
>  * IOMAP_WRITE | IOMAP_FAULT: this is a write fault for mmapped
>    pagecache
>
>  * IOMAP_ZERO: zero parts of a file
>
>  * 0: read from the pagecache
>

Ok. Thanks for this.

> Ugh, this namespace collides with IOMAP_{HOLE..INLINE}.  Sigh.
>
>> +1. The type field within iomap determines what type the range maps to e.g.
>> +IOMAP_HOLE, IOMAP_DELALLOC, IOMAP_UNWRITTEN etc.
>
> Maybe have a list here stating what a "DELALLOC" is?  Not all readers
> are going to know what these types are.
>
>  * IOMAP_HOLE: No storage has been allocated.  This must never be
>    returned in response to an IOMAP_WRITE operation because writes must
>    allocate and map space, and return the mapping.
>
>  * IOMAP_DELALLOC: A promise to allocate space at a later time ("delayed
>    allocation").  If the filesystem returns IOMAP_F_NEW here and the
>    write fails, the ->iomap_end function must delete the reservation.
>
>  * IOMAP_MAPPED: The file range maps to this space on the storage
>    device.  The device is returned in @bdev or @dax_dev, and the device
>    address of the space is returned in @addr.
>
>  * IOMAP_UNWRITTEN: The file range maps to this space on the storage
>    device, but the space has not yet been initialized.  The device is
>    returned in @bdev or @dax_dev, and the device address of the space is
>    returned in @addr.  Reads will return zeroes, and the write ioend
>    must update the mapping to MAPPED.
>
>  * IOMAP_INLINE: The file range maps to the in-memory buffer
>    @inline_data.  For IOMAP_WRITE operations, the ->iomap_end function
>    presumable handles persisting the data.
>

I was thinking maybe it's corresponding file will already have that
that's why, so we need not cover it here.
And I also wanted to kept initial RFC document as a brief overview.
But sure thanks for doing this. I will add this info in v2.


>> +2. The flags field represent the state flags (e.g. IOMAP_F_*), most of which are
>> +set the by the filesystem during mapping time that indicates how iomap
>> +infrastructure should modify it's behaviour to do the right thing.
>
> I think you should summarize the IOMAP_F_ flags here.
>

Sure. Thanks!

>> +3. private void pointer within iomap allows the filesystems to pass filesystem's
>> +private data from ->iomap_begin() to ->iomap_end() [3].
>> +(see include/linux/iomap.h for more details)
>
> 4. Maybe we should mention what bdev/dax_dev/inline_data do here?
>

Let me read about this.

> 5. What does the validity cookie do?
>

Sure, I will add that.

>> +
>> +
>> +iomap operations
>> +================
>> +iomap provides different iterator APIs for direct-io, buffered-io, lseek,
>> +dax-io, page-mkwrite, swap_activate and extent reporting via fiemap. It requires
>> +various struct operations to be prepared by filesystem and to be supplied to
>> +iomap iterator APIs either at the beginning of iomap api call or attaching it
>> +during the mapping callback time e.g iomap_folio_ops is attached to
>> +iomap->folio_ops during ->iomap_begin() call.
>> +
>> +Following provides various ops to be supplied by filesystems to iomap layer for
>> +doing different I/O types as discussed above.
>> +
>> +iomap_ops: IO interface specific operations
>> +==========
>> +The methods are designed to be used as pairs. The begin method creates the iomap
>> +and attaches all the necessary state and information which subsequent iomap
>> +methods & their callbacks might need. Once the iomap infrastructure has finished
>> +working on the iomap it will call the end method to allow the filesystem to tear
>> +down any unused space and/or structures it created for the specific iomap
>> +context.
>> +
>> +Almost all iomap iterator APIs require filesystems to define iomap_ops so that
>> +filesystems can be called into for providing logical to physical extent mapping,
>> +wherever required. This is required by the iomap iter apis used for the
>> +operations which are listed in the beginning of "iomap operations" section.
>> +  - iomap_begin: This either returns an existing mapping or reserve/allocates a
>> +    new mapping when called by iomap. pos and length are passed as function
>> +    arguments. Filesystem returns the new mapping information within struct
>> +    iomap which also gets passed as a function argument. Filesystems should
>> +    provide the type of this extent in iomap->type for e.g. IOMAP_HOLE,
>> +    IOMAP_UNWRITTEN and it should set the iomap->flags e.g. IOMAP_F_*
>> +    (see details in include/linux/iomap.h)
>> +
>> +    Note that iomap_begin() call has srcmap passed as another argument. This is
>> +    mainly used only during the begin phase for COW mappings to identify where
>> +    the reads are to be performed from. Filesystems needs to fill that mapping
>> +    information if iomap should read data for partially written blocks from a
>> +    different location than the write target [4].
>> +
>> +  - iomap_end: Commit and/or unreserve space which was previously allocated
>> +    using iomap_begin. During buffered-io, when a short writes occurs,
>> +    filesystem may need to remove the reserved space that was allocated
>> +    during ->iomap_begin. For filesystems that use delalloc allocation, we need
>> +    to punch out delalloc extents from the range that are not dirty in the page
>> +    cache. See comments in iomap_file_buffered_write_punch_delalloc() for more
>> +    info [5][6].
>
> I think these three sections "struct iomap", "iomap operations", and
> "iomap_ops" should come first before you talk about specific things like
> buffered io operations.
>

Ok. 

> I'm going to stop here for the day.
>
> --D
>

Thanks Darrick for taking time to do the extensive review. 

I will take yours and others review into account and will work on v2.

-ritesh

     prev parent reply	other threads:[~2024-06-05 10:16 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-03 14:10 [RFC] Documentation: Add initial iomap document Ritesh Harjani (IBM)
2024-05-03 16:41 ` Christoph Hellwig
2024-05-12 12:44 ` Jan Kara
2024-06-03 23:34 ` Darrick J. Wong
2024-06-05  4:07   ` Ritesh Harjani [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8734psjalk.fsf@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=brauner@kernel.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.