From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4309C3FD131 for ; Wed, 17 Jun 2026 13:00:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.54 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781701251; cv=none; b=u3VDgvFMpEAghcfDvDJJRaHWsrA5O0nl8WDPj9qG47d53dPHYfcgjPm1HgU7BYhkinLS4bqZiABh6DVT7YZH+zVnj7EEifE0Q98nHGTCBlzwlGdil2BUnxwCVQc1Bk1xliRrD3kE7LgERZZDgJyDnbFAYq6c215FL+jY9t7MVA8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781701251; c=relaxed/simple; bh=UEOqeZy4nuSiMcJbDnhYvbNKPHX4GO6itY+DAaN4aYY=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=ov9EWPNJs31CemA7pq+5nU54kwfNgNZHPPFMwLyf0yH886p/Q1NG6ysPUaZSLRDH7Vp9eGIppe8YExob7GnVa6H8ZJR7ZAIX/qPZxswr+J9RsLZRxF0g49/8AzFNsUirZTPNAG1nw/lu3CZ29ANOPdCLSXZNJeMkVRc1IBmLNHA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=HtjdaZGI; arc=none smtp.client-ip=209.85.216.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HtjdaZGI" Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-37c5b9d42efso457972a91.1 for ; Wed, 17 Jun 2026 06:00:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1781701250; x=1782306050; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=1k2m3aSHTjqA6bdTXWu0OZg7gZ+wtGbmj+2TwUuz208=; b=HtjdaZGIt71DANrlStykmIbZJ6NYy7wf4Bmf/+FsHNupNMZm/GLQ/lzmVbVJ1788VY fadHx6vLegwJ9JJkLUemxjcwqoFI11carrj+pvsurlHfpFP7TLwy0bC2t6VQji3nYvGT qz+Gv+joHGjaLs9jLZcXD6zThrUhh33ZQ5FngmR/uFaaR1gSknEPDfr3CpYIJHpcytXP 8bLwv+MDqC3YR2y791vbtXPLZaGqZfoCyVBZDrGFUnowqkBmlnI+YqzjSSfYhVovqbkd dpuXAP6MZWEyq7ILgKKNLpt5MmEIIfVlDg2/65djQfsA6LtRboNhNSXbKO9a7NLWUeex c0zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1781701250; x=1782306050; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=1k2m3aSHTjqA6bdTXWu0OZg7gZ+wtGbmj+2TwUuz208=; b=mNzAzzmqX7eHtuzCXTZpjfeAHgwbJC0hq+jlLR891EdfPL7+Yee0TPwyXOk0Dm5I0e 3Llf8eAobC1aEz7etE/guhsGyqe48KpDXcWliENoVSktjpSrsEOxLVm8j9jQTUXQ3WTW YePAo7OffgMjZFO38MlphDH/jecfhKuJjn9PEJOGeFGagJj9lEVEnfR0mA2TDdBTM5Xf xYXft6otSXekyE83VaAMGJ5NwaNkkXg2CQP8TXM9de+fnVX2MrqgSVaY1Pg4DKowaLpt MHDI/8/snwy10Pszw2I1jrzKI2HbRjbwesqwBUL7XunNKICirCwZrbqPjuihRihaD5MN Ctyw== X-Gm-Message-State: AOJu0YxpgAHiGajpZbm4qwsQ/xQQhpSn4S/hR2Xb+gvlrkjEJ1qAMCmY ulC514e72Ow9vq82OpYbzsJkJezIPM8S6NymbeL8DRoTuYLjoJu/AMld X-Gm-Gg: AfdE7ckYMqzCXls2OipRnRGGEBIbjhQ2xOFS9vgh4HqWb5QEJFKCIquZd0lveRSqVIf 01+fj9EJY4U5AX2gDjK/BpkI6FWEGCtm/OguusyrAHedHiJymvBvXMnNI08yYXUkrU4mzxUsYP4 /LxhgaVntVCltjJCYAbnU/WCOb7Gn4dSAaDmVTi+qQt9tFYepz4v/mtgfKzPZxgojCvvba+nCxT DB02gjFC3ZXTZLCK39dU/Nl+HsbR7hcN88irprP+BfjX/IPUsD8F/WmD6UDbSfPFy4kM3cNP+n9 hzjF/i5B47zuBxJ4dQ1jmRtR6qb7h9MvKYUdXp42DxUrsAJpIUX7I7/CHPNmTbmNOcqVxtFwAQX GHU+uMyDV6OIqZbfCEr8lGQ4w+KSUkEcK3vV4BOuLRgAbXc72YXqNYuKhNxXwTixZdV4OykpDYo Cpx5/Twmkrn+2L/TjmePSgurDnL5DnXxC689ms/AFuWkiG3mDWm5h4Qf2HmqKpmrE= X-Received: by 2002:a17:90b:2704:b0:37c:543f:b7c with SMTP id 98e67ed59e1d1-37ca718e760mr2033578a91.10.1781701248124; Wed, 17 Jun 2026 06:00:48 -0700 (PDT) Received: from ?IPV6:240e:390:a92:e6d1:ed78:8bd4:5db9:c767? ([240e:390:a92:e6d1:ed78:8bd4:5db9:c767]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-37c70aa47f0sm3238095a91.13.2026.06.17.06.00.38 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 17 Jun 2026 06:00:47 -0700 (PDT) Message-ID: <070f205c-11d6-4f7f-8b15-a928e37bb37e@gmail.com> Date: Wed, 17 Jun 2026 21:00:35 +0800 Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap To: Jan Kara , Zhang Yi Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, libaokun@linux.alibaba.com, ojaswin@linux.ibm.com, ritesh.list@gmail.com, djwong@kernel.org, hch@infradead.org, yi.zhang@huawei.com, yangerkun@huawei.com, yukuai@fnnas.com, Brian Foster References: <20260511072344.191271-1-yi.zhang@huaweicloud.com> <20260511072344.191271-15-yi.zhang@huaweicloud.com> <16cccb83-cdad-4113-8182-e8ea9e3049a2@huaweicloud.com> Content-Language: en-US From: Zhang Yi In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 6/17/2026 6:50 PM, Jan Kara wrote: > On Wed 17-06-26 16:14:40, Zhang Yi wrote: >> On 6/16/2026 8:28 PM, Jan Kara wrote: >>> On Mon 11-05-26 15:23:34, Zhang Yi wrote: >>>> From: Zhang Yi >>>> >>>> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with >>>> ext4_iomap_block_zero_range() to implement block zeroing via the iomap >>>> infrastructure for ext4. >>>> >>>> ext4_iomap_block_zero_range() calls iomap_zero_range() with >>>> ext4_iomap_zero_begin() as the callback. The callback locates and zeros >>>> out either a mapped partial block or a dirty, unwritten partial block. >>>> >>>> Important constraints: >>>> >>>> Zeroing out under an active journal handle can cause deadlock, because >>>> the order of acquiring the folio lock and starting a handle is >>>> inconsistent with the iomap writeback path. >>>> >>>> Therefore, ext4_iomap_block_zero_range(): >>>> - Must NOT be called under an active handle. >>>> - Cannot rely on data=ordered mode to ensure zeroed data persistence >>>> before updating i_disksize (for the cases of post-EOF append write, >>>> post-EOF fallocate, and truncate up). In subsequent patches, we will >>>> address this by synchronizing commit I/O but doesn't waiting for >>>> completion, and updating i_disksize to i_size only after the zeroed >>>> data has been written back. >>>> >>>> Signed-off-by: Zhang Yi >>>> --- >>>> fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 92 insertions(+) >>>> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>> index c6fe42d012fc..e0dae2501292 100644 >>>> --- a/fs/ext4/inode.c >>>> +++ b/fs/ext4/inode.c >>>> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, >>>> return 0; >>>> } >>>> >>>> +static int ext4_iomap_zero_begin(struct inode *inode, >>>> + loff_t offset, loff_t length, unsigned int flags, >>>> + struct iomap *iomap, struct iomap *srcmap) >>>> +{ >>>> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); >>> >>> This looks like a layering violation to me. I don't think you can safely >>> assume the iomap you're passed is a part of iomap_iter... >>> >>>> + struct ext4_map_blocks map; >>>> + u8 blkbits = inode->i_blkbits; >>>> + unsigned int iomap_flags = 0; >>>> + int ret; >>>> + >>>> + ret = ext4_emergency_state(inode->i_sb); >>>> + if (unlikely(ret)) >>>> + return ret; >>>> + >>>> + if (WARN_ON_ONCE(!(flags & IOMAP_ZERO))) >>>> + return -EINVAL; >>>> + >>>> + ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map); >>>> + if (ret < 0) >>>> + return ret; >>>> + >>>> + /* >>>> + * Look up dirty folios for unwritten mappings within EOF. Providing >>>> + * this bypasses the flush iomap uses to trigger extent conversion >>>> + * when unwritten mappings have dirty pagecache in need of zeroing. >>>> + */ >>>> + if (map.m_flags & EXT4_MAP_UNWRITTEN) { >>>> + loff_t start = ((loff_t)map.m_lblk) << blkbits; >>>> + loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits; >>>> + >>>> + iomap_fill_dirty_folios(iter, &start, end, &iomap_flags); >>>> + if ((start >> blkbits) < map.m_lblk + map.m_len) >>>> + map.m_len = (start >> blkbits) - map.m_lblk; >>>> + } >>> >>> ... and you need access to iter only for this which seems to be really a >>> hack that's trying to outsmart the iomap code. I have to admit I don't >>> fully understand what you are trying to achieve here. Are you trying to >>> avoid flushing of the range that will be zeroed out? >> >> This logic is copied from the XFS and iomap infrastructure. Its primary >> purpose is to optimize the zeroing operations on dirty written extents. >> It was introduced by Brian in [1]. > > Ah, I see. I still find it hacky but apparently it is an established hack > in iomap :). Fair. > >> The history as I understand it: originally, the iomap infrastructure >> could not zero dirty unwritten extents during zero range processing, >> which led to stale data exposure. XFS had to flush dirty ranges itself >> before zeroing — a workaround that was not generic. >> >> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten >> extents"), Brian added an unconditional flush in the iomap >> infrastructure, ensuring that by the time zeroing runs the extent has >> already been converted to written so the zero can proceed correctly. >> However, this flush was too heavy and introduced noticeable performance >> overhead. >> >> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush >> conditional on unwritten mappings"), which restricts flushing to only >> dirty pagecache over unwritten or hole mappings. >> >> Brian later proposed a different approach: rather than relying on flush >> to convert the extent type, find dirty folios ahead of the zero range >> and zero the dirty unwritten extents directly. In [1] he added this >> lookup logic. The filesystem now supplies a folio batch (a collection of >> dirty folios) via the iomap begin callback, and zero range iterates over >> these dirty folios to perform zeroing. Clean regions not covered by the >> batch are simply skipped. This entirely eliminates the need to flush. >> >> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/ > > Thanks for the summary! So I was confused because somehow I thought this is > about fallocate(FALLOC_FL_ZERO_RANGE) and so I was wondering why we just > cannot evict the page cache and be done with that. Only after reading > everything again I've realized this is about zeroing partial blocks on hole > punch etc. And we may need to really handle multiple folios because XFS > also uses this mechanism to implement FALLOC_FL_ZERO_RANGE for zoned > storage. Ugh. OK, anyway for now this looks like your patch is following > how things are expected to be done so feel free to add: > > Reviewed-by: Jan Kara > >>>> + /* >>>> + * TODO: The iomap does not distinguish between different types of >>>> + * zeroing and always sets zero_written if a zeroing operation is >>>> + * performed, which may result in unnecessary order operations. >>>> + */ >>> >>> Is this still true after your fix to did_zero handling? >> >> Yeah. Currently, iomap_zero_range() can only report whether a zeroing >> operation has occurred through did_zero parameter, but it cannot >> distinguish whether the zeroed range is a written extent that already >> exists on disk. That is, even if the zeroing is performed on a delalloc >> extent, did_zero will still return true. > > So maybe write in the comment explicitely, that this may result in > unnecessary flushing of folios if zeroing happened in > delayed-not-yet-allocated blocks? > > Honza Sure, I'll include it in the next iteration. :) Thanks, Yi.