From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5905180B; Mon, 25 Nov 2024 01:47:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.188 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732499265; cv=none; b=j0HXkQwDpJX+JE/yBerVwkW5co0V1gLurcIOrmtLqh5z0l3OM9RnPivBoXg0niyQPNYD0uurK8/XJOloGAOLP8OutM1/YDwSDWEE74+wHeKhzu/rf6DUO17Xuyg+Ilsn27OzWVXedO/jwRqi3vEyLj2x3AqhhGLWhrczVD78We4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732499265; c=relaxed/simple; bh=F2GUL/+T8zED1xEtnrSOwlk6AQj2HTjq3G7hOoUPcyQ=; h=Date:From:To:CC:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=hClW1h7PO5rxuQlggLLkzgZ+Lx9GjHW5iSOjUuDM/fqlrhg0YtqxeHxvX+kCUIw21JiESXRw0sKzvzy9j1ZMW/hqfwysYewEFLSu5OIOIuxB2VE8SVBo+rKoUziKlLypkK7duXo5m1WG6tzsHkK4PpFfvzEsocAc8Ge9hsK9atA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.48]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4XxT7J4wbYzqScw; Mon, 25 Nov 2024 09:45:48 +0800 (CST) Received: from dggpemf500017.china.huawei.com (unknown [7.185.36.126]) by mail.maildlp.com (Postfix) with ESMTPS id 22F4318009B; Mon, 25 Nov 2024 09:47:39 +0800 (CST) Received: from localhost (10.175.112.188) by dggpemf500017.china.huawei.com (7.185.36.126) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 25 Nov 2024 09:47:38 +0800 Date: Mon, 25 Nov 2024 09:45:46 +0800 From: Long Li To: Brian Foster CC: , , , , , , , Subject: Re: [PATCH v3 1/2] iomap: fix zero padding data issue in concurrent append writes Message-ID: References: <20241121063430.3304895-1-leo.lilong@huawei.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline In-Reply-To: X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpemf500017.china.huawei.com (7.185.36.126) On Fri, Nov 22, 2024 at 09:19:53AM -0500, Brian Foster wrote: > On Thu, Nov 21, 2024 at 02:34:29PM +0800, Long Li wrote: > > During concurrent append writes to XFS filesystem, zero padding data > > may appear in the file after power failure. This happens due to imprecise > > disk size updates when handling write completion. > > > > Consider this scenario with concurrent append writes same file: > > > > Thread 1: Thread 2: > > ------------ ----------- > > write [A, A+B] > > update inode size to A+B > > submit I/O [A, A+BS] > > write [A+B, A+B+C] > > update inode size to A+B+C > > > > > > > > After reboot: > > 1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C] > > > > |< Block Size (BS) >| > > |DDDDDDDDDDDDDDDD0000000000000000| > > ^ ^ ^ > > A A+B A+B+C > > (EOF) > > > > 2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS] > > > > |< Block Size (BS) >|< Block Size (BS) >| > > |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000| > > ^ ^ ^ ^ > > A A+B A+BS A+B+C > > (EOF) > > > > D = Valid Data > > 0 = Zero Padding > > > > The issue stems from disk size being set to min(io_offset + io_size, > > inode->i_size) at I/O completion. Since io_offset+io_size is block > > size granularity, it may exceed the actual valid file data size. In > > the case of concurrent append writes, inode->i_size may be larger > > than the actual range of valid file data written to disk, leading to > > inaccurate disk size updates. > > > > This patch changes the meaning of io_size to represent the size of > > valid data in ioend, while the extent size of ioend can be obtained > > by rounding up based on block size. It ensures more precise disk > > size updates and avoids the zero padding issue. Another benefit is > > that it makes the xfs_ioend_is_append() check more accurate, which > > can reduce unnecessary end bio callbacks of xfs_end_bio() in certain > > scenarios, such as repeated writes at the file tail without extending > > the file size. > > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > > Signed-off-by: Long Li > > --- > > v2->v3: > > 1. Modify commit message and add the description of A+B+C > A+BS > > 2. Rename iomap_ioend_extent_size() to iomap_ioend_size_aligned() > > 3. Move iomap_ioend_size_aligned to buffered-io.c and avoid exposed > > to new users. > > 4. Add comment for rounding up io_size to explain when/why use it > > > > Thanks for the tweaks.. > > > fs/iomap/buffered-io.c | 43 ++++++++++++++++++++++++++++++++++++------ > > include/linux/iomap.h | 2 +- > > 2 files changed, 38 insertions(+), 7 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index d42f01e0fc1c..3f59dfb4d58d 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -1593,12 +1593,35 @@ iomap_finish_ioends(struct iomap_ioend *ioend, int error) > > } > > EXPORT_SYMBOL_GPL(iomap_finish_ioends); > > > > +/* > > + * Calculates the extent size of an ioend by rounding up to block size. When > > + * the last block in the ioend's extent contains the file EOF and the EOF is > > + * not block-aligned, the io_size will not be block-aligned. > > + * > > + * This function is specifically used for ioend grow/merge management: > > + * 1. In concurrent writes, when one write's io_size is truncated due to > > + * non-block-aligned file size while another write extends the file size, > > + * if these two writes are physically and logically contiguous at block > > + * boundaries, rounding up io_size to block boundaries helps grow the > > + * first write's ioend and share this ioend between both writes. > > + * 2. During IO completion, we try to merge physically and logically > > + * contiguous ioends before completion to minimize the number of > > + * transactions. Rounding up io_size to block boundaries helps merge > > + * ioends whose io_size is not block-aligned. > > + */ > > I might suggest to simplify this and maybe split off a comment to where > the ioend size is trimmed as well. For example: > > "Calculate the physical size of an ioend by rounding up to block > granularity. io_size might be unaligned if the last block crossed an > unaligned i_size boundary at creation time." Ok, I want to move the explain of iomap_ioend_size_aligned() to commit message. > > > +static inline size_t iomap_ioend_size_aligned(struct iomap_ioend *ioend) > > +{ > > + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); > > +} > > + > > /* > > * We can merge two adjacent ioends if they have the same set of work to do. > > */ > > static bool > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > { > > + size_t size = iomap_ioend_size_aligned(ioend); > > + > > if (ioend->io_bio.bi_status != next->io_bio.bi_status) > > return false; > > if (next->io_flags & IOMAP_F_BOUNDARY) > ... > > @@ -1784,12 +1810,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); > > } > > > > - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) > > + ioend = wpc->ioend; > > + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) > > goto new_ioend; > > > > if (ifs) > > atomic_add(len, &ifs->write_bytes_pending); > > - wpc->ioend->io_size += len; > > + > > And here.. > > "If the ioend spans i_size, trim io_size to the former to provide the fs > with more accurate size information. This is useful for completion time > on-disk size updates." > it's fine to me. > > + ioend->io_size = iomap_ioend_size_aligned(ioend) + len; > > + if (ioend->io_offset + ioend->io_size > isize) > > + ioend->io_size = isize - ioend->io_offset; > > + > > wbc_account_cgroup_owner(wbc, folio, len); > > return 0; > > } > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > index 5675af6b740c..956a0f7c2a8d 100644 > > --- a/include/linux/iomap.h > > +++ b/include/linux/iomap.h > > @@ -335,7 +335,7 @@ struct iomap_ioend { > > u16 io_type; > > u16 io_flags; /* IOMAP_F_* */ > > struct inode *io_inode; /* file being written to */ > > - size_t io_size; /* size of the extent */ > > + size_t io_size; /* size of valid data */ > > "valid data" is kind of unclear to me. Maybe just "size of data within > eof..?" It's fine to me. > > But those are just suggestions. Feel free to take, leave or reword any > of it, and this LGTM regardless: > > Reviewed-by: Brian Foster Thanks for your review, I'll update it as suggested in next version. Long Li