All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: John Garry <john.g.garry@oracle.com>, linux-ext4@vger.kernel.org
Cc: Theodore Ts'o <tytso@mit.edu>, Jan Kara <jack@suse.cz>,
	"Darrick J . Wong" <djwong@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Dave Chinner <david@fromorbit.com>,
	linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes
Date: Fri, 25 Oct 2024 18:06:10 +0530	[thread overview]
Message-ID: <87r084mkat.fsf@gmail.com> (raw)
In-Reply-To: <7aea00d4-3914-414d-a18f-586a303868c1@oracle.com>

John Garry <john.g.garry@oracle.com> writes:

> On 25/10/2024 12:19, Ritesh Harjani (IBM) wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>>> On 25/10/2024 11:35, Ritesh Harjani (IBM) wrote:
>>>>>> Same as mentioned above. We can't have atomic writes to get split.
>>>>>> This patch is just lifting the restriction of iomap to allow more than
>>>>>> blocksize but the mapped length should still meet iter->len, as
>>>>>> otherwise the writes can get split.
>>>>> Sure, I get this. But I wonder why would we be getting multiple
>>>>> mappings? Why cannot the FS always provide a single mapping?
>>>> FS can decide to split the mappings when it couldn't allocate a single
>>>> large mapping of the requested length. Could be due to -
>>>> - already allocated extent followed by EOF,
>>>> - already allocated extent followed by a hole
>>>> - already mapped extent followed by an extent of different type (e.g. written followed by unwritten or unwritten followed by written)
>>>
>>> This is the sort of scenario which I am concerned with. This issue has
>>> been discussed at length for XFS forcealign support for atomic writes.
>> 
>> extsize and forcealign is being worked for ext4 as well where we can
>> add such support, sure.
>> 
>>>
>>> So far, the user can atomic write a single FS block regardless of
>>> whether the extent in which it would be part of is in written or
>>> unwritten state.
>>>
>>> Now the rule will be to write multiple FS blocks atomically, all blocks
>>> need to be in same written or unwritten state.
>> 
>> FS needs to ensure that the writes does not get torned. So for whatever reason
>> FS splits the mapping then we need to return an -EINVAL error to not
>> allow such writes to get torned. This patch just does that.
>> 
>> But I get your point. More below.
>> 
>>>
>>> This oddity at least needs to be documented.
>> 
>> Got it. Yes, we can do that.
>> 
>>>
>>> Better yet would be to not have this restriction.
>>>
>> 
>> I haven't thought of a clever way where we don't have to zero out the
>> rest of the unwritten mapping. With ext4 bigalloc since the entire
>> cluster is anyway reserved - I was thinking if we can come up with a
>> clever way for doing atomic writes to the entire user requested size w/o
>> zeroing out.
>
> This following was main method which was being attempted:
>
> https://lore.kernel.org/linux-fsdevel/20240429174746.2132161-15-john.g.garry@oracle.com/
>
> There were other ideas in different versions of the forcelign/xfs block 
> atomic writes series.
>
>> 
>> Zeroing out the other unwritten extent is also a cost penalty to the
>> user anyways.
>
> Sure, unless we have a special inode flag to say "pre-zero the extent".
>
>> So user will anyway will have to be made aware of not to
>> attempt writes of fashion which can cause them such penalties.
>> 
>> As patch-6 mentions this is a base support for bs = ps systems for
>> enabling atomic writes using bigalloc. For now we return -EINVAL when we
>> can't allocate a continuous user requested mapping which means it won't
>> support operations of types 8k followed by 16k.
>> 
>
> That's my least-preferred option.
>
> I think better would be reject atomic writes that cover unwritten 
> extents always - but that boat is about to sail...

That's what this patch does. For whatever reason if we couldn't allocate
a single contiguous region of requested size for atomic write, then we
reject the request always, isn't it. Or maybe I didn't understand your comment.

If others prefer - we can maybe add such a check (e.g. ext4_dio_atomic_write_checks()) 
for atomic writes in ext4_dio_write_checks(), similar to how we detect
overwrites case to decide whether we need a read v/s write semaphore. 
So this can check if the user has a partially allocated extent for the
user requested region and if yes, we can return -EINVAL from
ext4_dio_write_iter() itself. 

I think this maybe better option than waiting until ->iomap_begin().
This might also bring all atomic write constraints to be checked in one
place i.e. during ext4_file_write_iter() itself.

Thoughts?

-ritesh

  reply	other threads:[~2024-10-25 12:59 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-25  3:45 [PATCH 0/6] ext4: Add atomic write support for DIO Ritesh Harjani (IBM)
2024-10-25  3:45 ` [PATCH 1/6] ext4: Add statx support for atomic writes Ritesh Harjani (IBM)
2024-10-25  9:41   ` John Garry
2024-10-25 10:08     ` Ritesh Harjani
2024-10-25 16:09       ` Darrick J. Wong
2024-10-25 17:45         ` Ritesh Harjani
2024-10-25  3:45 ` [PATCH 2/6] ext4: Check for atomic writes support in write iter Ritesh Harjani (IBM)
2024-10-25  9:44   ` John Garry
2024-10-25 10:33     ` Ritesh Harjani
2024-10-25 16:11       ` Darrick J. Wong
2024-10-25 17:50         ` Ritesh Harjani
2024-10-25  3:45 ` [PATCH 3/6] ext4: Support setting FMODE_CAN_ATOMIC_WRITE Ritesh Harjani (IBM)
2024-10-25  3:45 ` [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes Ritesh Harjani (IBM)
2024-10-25 16:16   ` Darrick J. Wong
2024-10-25 17:51     ` Ritesh Harjani
2024-10-27 22:26   ` Dave Chinner
2024-10-28  1:09     ` Ritesh Harjani
2024-10-28  5:26       ` Dave Chinner
2024-10-28  8:43         ` Ritesh Harjani
2024-10-28 18:14         ` Ritesh Harjani
2024-10-29 22:29           ` Dave Chinner
2024-10-29 23:51             ` Ritesh Harjani
2024-10-25  3:45 ` [PATCH 5/6] iomap: Lift blocksize restriction on " Ritesh Harjani (IBM)
2024-10-25  8:52   ` John Garry
2024-10-25  9:31     ` Ritesh Harjani
2024-10-25  9:59       ` John Garry
2024-10-25 10:35         ` Ritesh Harjani
2024-10-25 11:07           ` John Garry
2024-10-25 11:19             ` Ritesh Harjani
2024-10-25 12:23               ` John Garry
2024-10-25 12:36                 ` Ritesh Harjani [this message]
2024-10-25 14:04                   ` John Garry
2024-10-25 14:13                     ` Ritesh Harjani
2024-10-25 18:28                       ` Darrick J. Wong
2024-10-26  4:35                         ` Ritesh Harjani
2024-10-31 21:36                           ` Darrick J. Wong
2024-11-04  1:52                             ` Dave Chinner
2024-11-05  0:09                               ` Darrick J. Wong
2024-10-25  3:45 ` [PATCH 6/6] ext4: Add atomic write support for bigalloc Ritesh Harjani (IBM)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r084mkat.fsf@gmail.com \
    --to=ritesh.list@gmail.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=john.g.garry@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.