public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Zhang Yi <yi.zhang@huaweicloud.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, ritesh.list@gmail.com,
	yi.zhang@huawei.com, chengzhihao1@huawei.com, yukuai3@huawei.com
Subject: Re: [PATCH v2 06/10] ext4: update delalloc data reserve spcae in ext4_es_insert_extent()
Date: Sat, 10 Aug 2024 12:01:28 +0800	[thread overview]
Message-ID: <5dcb2dc2-7622-05bd-d330-610e9c009fe2@huaweicloud.com> (raw)
In-Reply-To: <20240809162013.tieom26umwqcsfe4@quack3>

On 2024/8/10 0:20, Jan Kara wrote:
> On Fri 09-08-24 11:35:49, Zhang Yi wrote:
>> On 2024/8/9 2:36, Jan Kara wrote:
>>> On Thu 08-08-24 19:18:30, Zhang Yi wrote:
>>>> On 2024/8/8 1:41, Jan Kara wrote:
>>>>> On Fri 02-08-24 19:51:16, Zhang Yi wrote:
>>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>>
>>>>>> Now that we update data reserved space for delalloc after allocating
>>>>>> new blocks in ext4_{ind|ext}_map_blocks(), and if bigalloc feature is
>>>>>> enabled, we also need to query the extents_status tree to calculate the
>>>>>> exact reserved clusters. This is complicated now and it appears that
>>>>>> it's better to do this job in ext4_es_insert_extent(), because
>>>>>> __es_remove_extent() have already count delalloc blocks when removing
>>>>>> delalloc extents and __revise_pending() return new adding pending count,
>>>>>> we could update the reserved blocks easily in ext4_es_insert_extent().
>>>>>>
>>>>>> Thers is one special case needs to concern is the quota claiming, when
>>>>>> bigalloc is enabled, if the delayed cluster allocation has been raced
>>>>>> by another no-delayed allocation(e.g. from fallocate) which doesn't
>>>>>> cover the delayed blocks:
>>>>>>
>>>>>>   |<       one cluster       >|
>>>>>>   hhhhhhhhhhhhhhhhhhhdddddddddd
>>>>>>   ^            ^
>>>>>>   |<          >| < fallocate this range, don't claim quota again
>>>>>>
>>>>>> We can't claim quota as usual because the fallocate has already claimed
>>>>>> it in ext4_mb_new_blocks(), we could notice this case through the
>>>>>> removed delalloc blocks count.
>>>>>>
>>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>> ...
>>>>>> @@ -926,9 +928,27 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>>>>>  			__free_pending(pr);
>>>>>>  			pr = NULL;
>>>>>>  		}
>>>>>> +		pending = err3;
>>>>>>  	}
>>>>>>  error:
>>>>>>  	write_unlock(&EXT4_I(inode)->i_es_lock);
>>>>>> +	/*
>>>>>> +	 * Reduce the reserved cluster count to reflect successful deferred
>>>>>> +	 * allocation of delayed allocated clusters or direct allocation of
>>>>>> +	 * clusters discovered to be delayed allocated.  Once allocated, a
>>>>>> +	 * cluster is not included in the reserved count.
>>>>>> +	 *
>>>>>> +	 * When bigalloc is enabled, allocating non-delayed allocated blocks
>>>>>> +	 * which belong to delayed allocated clusters (from fallocate, filemap,
>>>>>> +	 * DIO, or clusters allocated when delalloc has been disabled by
>>>>>> +	 * ext4_nonda_switch()). Quota has been claimed by ext4_mb_new_blocks(),
>>>>>> +	 * so release the quota reservations made for any previously delayed
>>>>>> +	 * allocated clusters.
>>>>>> +	 */
>>>>>> +	resv_used = rinfo.delonly_cluster + pending;
>>>>>> +	if (resv_used)
>>>>>> +		ext4_da_update_reserve_space(inode, resv_used,
>>>>>> +					     rinfo.delonly_block);
>>>>>
>>>>> I'm not sure I understand here. We are inserting extent into extent status
>>>>> tree. We are replacing resv_used clusters worth of space with delayed
>>>>> allocation reservation with normally allocated clusters so we need to
>>>>> release the reservation (mballoc already reduced freeclusters counter).
>>>>> That I understand. In normal case we should also claim quota because we are
>>>>> converting from reserved into allocated state. Now if we allocated blocks
>>>>> under this range (e.g. from fallocate()) without
>>>>> EXT4_GET_BLOCKS_DELALLOC_RESERVE, we need to release quota reservation here
>>>>> instead of claiming it. But I fail to see how rinfo.delonly_block > 0 is
>>>>> related to whether EXT4_GET_BLOCKS_DELALLOC_RESERVE was set when allocating
>>>>> blocks for this extent or not.
>>>>
>>>> Oh, this is really complicated due to the bigalloc feature, please let me
>>>> explain it more clearly by listing all related situations.
>>>>
>>>> There are 2 types of paths of allocating delayed/reserved cluster:
>>>> 1. Normal case, normally allocate delayed clusters from the write back path.
>>>> 2. Special case, allocate blocks under this delayed range, e.g. from
>>>>    fallocate().
>>>>
>>>> There are 4 situations below:
>>>>
>>>> A. bigalloc is disabled. This case is simple, after path 2, we don't need
>>>>    to distinguish path 1 and 2, when calling ext4_es_insert_extent(), we
>>>>    set EXT4_GET_BLOCKS_DELALLOC_RESERVE after EXT4_MAP_DELAYED bit is
>>>>    detected. If the flag is set, we must be replacing a delayed extent and
>>>>    rinfo.delonly_block must be > 0. So rinfo.delonly_block > 0 is equal
>>>>    to set EXT4_GET_BLOCKS_DELALLOC_RESERVE.
>>>
>>> Right. So fallocate() will call ext4_map_blocks() and
>>> ext4_es_lookup_extent() will find delayed extent and set EXT4_MAP_DELAYED
>>> which you (due to patch 2 of this series) transform into
>>> EXT4_GET_BLOCKS_DELALLOC_RESERVE. We used to update the delalloc
>>> accounting through in ext4_ext_map_blocks() but this patch moved the update
>>> to ext4_es_insert_extent(). But there is one cornercase even here AFAICT:
>>>
>>> Suppose fallocate is called for range 0..16k, we have delalloc extent at
>>> 8k..16k. In this case ext4_map_blocks() at block 0 will not find the
>>> delalloc extent but ext4_ext_map_blocks() will allocate 16k from mballoc
>>> without using delalloc reservation but then ext4_es_insert_extent() will
>>> still have rinfo.delonly_block > 0 so we claim the quota reservation
>>> instead of releasing it?
>>>
>>
>> After commit 6430dea07e85 ("ext4: correct the hole length returned by
>> ext4_map_blocks()"), the fallocate range 0-16K would be divided into two
>> rounds. When we first calling ext4_map_blocks() with 0-16K, the map range
>> will be corrected to 0-8k by ext4_ext_determine_insert_hole() and the
>> allocating range should not cover any delayed range.
> 
> Eww, subtle, subtle, subtle... And isn't this also racy? We drop i_data_sem
> in ext4_map_blocks() after we do the initial lookup. So there can be some
> changes to both the extent tree and extent status tree before we grab
> i_data_sem again for the allocation. We hold inode_lock so there can be
> only writeback and page faults racing with us but e.g. ext4_page_mkwrite()
> -> block_page_mkwrite -> ext4_da_get_block_prep() -> ext4_da_map_blocks()
> can add delayed extent into extent status tree in that window causing
> breakage, can't it?

Oh! you are totally right, I missed that current ext4_fallocate() doesn't
hold invalidate_lock for the normal fallocate path, hence there's nothing
could prevent this race now, thanks a lot for pointing this out.

> 
>> Then
>> ext4_alloc_file_blocks() will call ext4_map_blocks() again to allocate
>> 8K-16K in the second round, in this round, we are allocating a real
>> delayed range. Please below graph for details,
>>
>> ext4_alloc_file_blocks() //0-16K
>>  ext4_map_blocks()  //0-16K
>>   ext4_es_lookup_extent() //find nothing
>>    ext4_ext_map_blocks(0)
>>     ext4_ext_determine_insert_hole() //change map range to 0-8K
>>    ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE) //allocate blocks under hole
>>  ext4_map_blocks()  //8-16K
>>   ext4_es_lookup_extent() //find delayed extent
>>   ext4_ext_map_blocks(EXT4_GET_BLOCKS_CREATE)
>>     //allocate blocks under a whole delayed range,
>>     //use rinfo.delonly_block > 0 is okay
>>
>> Hence the allocating range can't mixed with delayed and non-delayed extent
>> at a time, and the rinfo.delonly_block > 0 should work.
> 
> Besides the race above I agree. So either we need to trim mapping extent in
> ext4_map_blocks() after re-acquiring i_data_sem

Yeah, if we keep on using this solution, it looks like we have to add similar
logic we've done in ext4_da_map_blocks() a few months ago into the begin of
the new helper ext4_map_create_blocks(). I guess it may expensive and not
worth now.

	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
		map->m_len = min_t(unsigned int, map->m_len,
				   es.es_len - (map->m_lblk - es.es_lblk));
	} else
		retval = ext4_map_query_blocks(NULL, inode, map);
		...
	}

> or we need to deal with
> unwritten extents that are partially delalloc. I'm more and more leaning
> towards just passing the information whether delalloc was used or not to
> extent status tree insertion. Because that can deal with partial extents
> just fine...
> 

Yeah, I agree with you, passing the information to ext4_es_init_extent()
is simple and looks fine. I will change to use this solution.

> Thanks for your patience with me :).
> 

Anytime! I appreciate your review and suggestions as well. :)

Thanks,
Yi.


  reply	other threads:[~2024-08-10  4:01 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-02 11:51 [PATCH v2 00/10] ext4: simplify the counting and management of delalloc reserved blocks Zhang Yi
2024-08-02 11:51 ` [PATCH v2 01/10] ext4: factor out ext4_map_create_blocks() to allocate new blocks Zhang Yi
2024-08-06 14:38   ` Jan Kara
2024-08-02 11:51 ` [PATCH v2 02/10] ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set Zhang Yi
2024-08-06 14:48   ` Jan Kara
2024-08-02 11:51 ` [PATCH v2 03/10] ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks Zhang Yi
2024-08-06 15:23   ` Jan Kara
2024-08-07 12:18     ` Zhang Yi
2024-08-07 13:37       ` Jan Kara
2024-08-02 11:51 ` [PATCH v2 04/10] ext4: let __revise_pending() return newly inserted pendings Zhang Yi
2024-08-02 11:51 ` [PATCH v2 05/10] ext4: count removed reserved blocks for delalloc only extent entry Zhang Yi
2024-08-02 11:51 ` [PATCH v2 06/10] ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Zhang Yi
2024-08-07 17:41   ` Jan Kara
2024-08-08 11:18     ` Zhang Yi
2024-08-08 18:36       ` Jan Kara
2024-08-09  3:35         ` Zhang Yi
2024-08-09 16:20           ` Jan Kara
2024-08-10  4:01             ` Zhang Yi [this message]
2024-08-02 11:51 ` [PATCH v2 07/10] ext4: drop ext4_es_delayed_clu() Zhang Yi
2024-08-02 11:51 ` [PATCH v2 08/10] ext4: use ext4_map_query_blocks() in ext4_map_blocks() Zhang Yi
2024-08-07 17:43   ` Jan Kara
2024-08-02 11:51 ` [PATCH v2 09/10] ext4: drop ext4_es_is_delonly() Zhang Yi
2024-08-07 17:48   ` Jan Kara
2024-08-08 11:21     ` Zhang Yi
2024-08-02 11:51 ` [PATCH v2 10/10] ext4: drop all delonly descriptions Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5dcb2dc2-7622-05bd-d330-610e9c009fe2@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox