From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7FC7B264A86; Tue, 10 Jun 2025 20:13:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.175 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749586403; cv=none; b=iLkJN+DU2cyOaFOCO0+IP/+h712SQ2FKIM7UQWbtHf3UlWWyV5Lxtml0RA/c2ljqF0pgOpCLuzqe6B/mwh/C+pMZhsBORHQGf/q/QRZJufhQON7HGwYwcbi39RAPChUdMoQPAcH/t/etBKd0DsDpQXnJbH8qlajT07W2xgiPtJE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749586403; c=relaxed/simple; bh=ug4o0OQD3U8aYTZtojZkULRawLsX9EcF3pSxkfdpWJ0=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=XTRFGtDZkYJv3YWZdrrOvrJ2hPSUUcM++WWC7iWRiUhXPIDyRwfOe7mgQ5u7Wpa6reWdSHNbqTibn9DL5Rf877giAYkNVHWyNtIYaxh28YKr4cwFP/BDzSDBb09B0Pu1wYNrhPjbuB34tO5RN+cMODlqw0qKLe+iQu1eF5WXTaQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=SdT7eYp5; arc=none smtp.client-ip=209.85.160.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SdT7eYp5" Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-4a58f79d6e9so64423831cf.2; Tue, 10 Jun 2025 13:13:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749586400; x=1750191200; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=rv1K8V0qXemtRgHGyIxXsEyWV3R8o02H0eJFKiYINSo=; b=SdT7eYp5glsUulbH05NuMdA9IDOpqkIBB+zfINbioqwq3JkN2ZBG1UjuibFNOXDZIq wvUpdUp1VKQ6uIuW3fX3fBLu2jmMN1SxviFV4AWuDf3sRd/GnLiSB16fys9h6EHrJ4Mc egUAQRdLlHJxV8t2PUZb4c6jSzeLWUaojDB09hZa+nvjPICCqocG6kwSccMc3X70D0sp mLUdgjqECpU7LBw+Z/0ET0oYmX1sZ+rogENHxNmY4VKGbs5/x39b45e0AzZ1riw87EeI P6wnkouceyI4nD/7VstpIe9rHynkTIfoy9QzKbJ6zWk2TLNkOnO/iNt4Ls4lwuFIUiFx 9Usg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749586400; x=1750191200; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rv1K8V0qXemtRgHGyIxXsEyWV3R8o02H0eJFKiYINSo=; b=Os//RbUX0Pxt/M8aknaRClDbQ/iYV8+VGoBPBlTTLWkrhWyecrO1ke09IAE3e42ATr VkAaBS6h/J5z2mLsqEeQX6Lw7jvbWYrD+yDm8L53rHwNp0Bon2x8W6B3u2sHyjchjgcm 8YGGY7ISQUPCGZ0oZAOWVJYW9ZqMYXNWY+KghsDZNmNfdbWQ7R1otgxNtbnA4Lx0h2ob UoZ8z2INd0BeJCpu8gYd+ccR0h/yyNWfLnBFVu4OSfJ/Km/0en86XP6V7frNYJD/Ci6+ TSW108knDa6TSimL2CBjemh1KFBDX7w3u9CnAcZTn9SRP8Kph0bjwVJ5gN7sdhQTq5d4 6uPg== X-Forwarded-Encrypted: i=1; AJvYcCU6KOLzXqGBVa29QSiA72+jvec20sUHITR+4isE1eH0mBV1+tlWb8VC1Cd0SwBUM/N3nFJYt+K+omxM@vger.kernel.org, AJvYcCUgpYkHP4WUk+25x9OBEXjnxCZ7+gzjlTD2DW5GtoYug7kickuBZ96n3GCstytXjS10LtoweZfZZ6vi1prB@vger.kernel.org X-Gm-Message-State: AOJu0YyjKriPtpEjM0AtDxnJ5nN/gYm4cXgBuVJg/rAaCIKV/SWcsWs9 JsxnMpdGenbbnU3AZaPJVFoadYLCKw9m8CRDXil2i2A3Xd285q2q+GLgOoB2nFhB9ZELM9skF+j L1Jhd8u1+pfXhCTI1hqzth2ra3KlbXGA= X-Gm-Gg: ASbGncsEi5FV/ZIavXdovSEr3p/dlU42CCdKQ5jaqdxB0f+n2KCuc7o/aHbbfsJpJyi Vt+B+wL0RjoaY147wkb/5vQqJ3WuvTNLHREYs1mUJkclfdZdBZewtZjKD3/b2DQeBg+ShHrw3FR rXv7aHliegZjwdQVRUob3bhRP/UXicLwPIlVlssDLewCg= X-Google-Smtp-Source: AGHT+IEzvgszwqvoTEsN0eiIdRCJWuEdonPydD25CbluGxxroYxfnuqdZHI8ZkMdvUJ1fOY9b2Ye/spxDj0moFpa/bM= X-Received: by 2002:a05:622a:4cc6:b0:4a4:30cf:c213 with SMTP id d75a77b69052e-4a713c65d3bmr10984481cf.48.1749586400340; Tue, 10 Jun 2025 13:13:20 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20250606233803.1421259-1-joannelkoong@gmail.com> <20250606233803.1421259-3-joannelkoong@gmail.com> In-Reply-To: From: Joanne Koong Date: Tue, 10 Jun 2025 13:13:09 -0700 X-Gm-Features: AX0GCFvWdPTKIpdfMQoECUZje2TuIYo3X8nX0skN8nFeh8vzio8DJt9NBJBgcT0 Message-ID: Subject: Re: [PATCH v1 2/8] iomap: add IOMAP_IN_MEM iomap type To: Christoph Hellwig Cc: miklos@szeredi.hu, djwong@kernel.org, brauner@kernel.org, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, bernd.schubert@fastmail.fm, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Jun 10, 2025 at 6:27=E2=80=AFAM Christoph Hellwig wrote: > > So I looked into something else, what if we just use ->read_folio > despite it not seeming ideal initially? After going through with > it I think it's actually less bad than I thought. This passes > -g auto on xfs with 4k blocks, and has three regression with 1k > blocks, 2 look are the seek hole testers upset that we can't > easily create detectable sub-block holes now, and one because > generic/563 thinks the cgroup accounting is off, probably because > we read more data now or something like that. > > --- > From c5d3cf651c815d3327199c74eac43149fc958098 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig > Date: Tue, 10 Jun 2025 09:39:57 +0200 > Subject: iomap: use ->read_folio instead of iomap_read_folio_sync > > iomap_file_buffered_write has it's own private read path for reading > in folios that are only partially overwritten, which not only adds > extra code, but also extra problem when e.g. we want reads to go > through a file system method to support checksums or RAID, or even > support non-block based file systems. > > Switch to using ->read_folio instead, which has a few up- and downsides. > > ->read_folio always reads the entire folios and not just the start and > the tail that is not being overwritten. Historically this was seen as a > downside as it reads more data than needed. But with modern file systems > and modern storage devices this is probably a benefit. If the folio is > stored contiguously on disk, the single read will be more efficient than > two small reads on almost all current hardware. If the folio is backed by > two blocks, at least we pipeline the two reads instead of doing two > synchronous ones. And if the file system fragmented the folio so badly > that we'll now need to do more than two reads we're still at least > pipelining it, although that should basically never happen with modern > file systems. If the filesystem wants granular folio reads, it can also just do that itself by calling an iomap helper (eg what iomap_adjust_read_range() is doing right now) in its ->read_folio() implementation, correct? For fuse at least, we definitely want granular reads, since reads may be extremely expensive (eg it may be a network fetch) and there's non-trivial mempcy overhead incurred with fuse needing to memcpy read buffer data from userspace back to the kernel. > > ->read_folio unlocks the folio on completion. This adds extract atomics > to the write fast path, but the actual signaling by doing a lock_page > after ->read_folio is not any slower than the completion wakeup. We > just have to recheck the mapping in this case do lock out truncates > and other mapping manipulations. > > ->read_folio starts another, nested, iomap iteration, with an extra > lookup of the extent at the current file position. For in-place update > file systems this is extra work, although if they use a good data > structure like the xfs iext btree there is very little overhead in > another lookup. For file system that write out of place this actually > implements the desired semantics as they don't care about the existing > data for the write iteration at all, although untangling this and > removing the srcmap member in the iomap_iter will require additional > work to turn the block zeroing and unshare helpers upside down. > > Signed-off-by: Christoph Hellwig > --- > fs/iomap/buffered-io.c | 116 ++++++++++++++++------------------------- > 1 file changed, 45 insertions(+), 71 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 3729391a18f3..52b4040208dd 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -667,30 +667,34 @@ iomap_write_failed(struct inode *inode, loff_t pos,= unsigned len) > pos + len - 1); > } > > -static int iomap_read_folio_sync(loff_t block_start, struct folio *folio= , > - size_t poff, size_t plen, const struct iomap *iomap) > +/* > + * Now that we have a locked folio, check that the iomap we have cached = is not > + * stale before we do anything. > + * > + * The extent mapping can change due to concurrent IO in flight, e.g. th= e > + * IOMAP_UNWRITTEN state can change and memory reclaim could have reclai= med a > + * previously partially written page at this index after IO completion b= efore > + * this write reaches this file offset, and hence we could do the wrong = thing > + * here (zero a page range incorrectly or fail to zero) and corrupt data= . > + */ > +static bool iomap_validate(struct iomap_iter *iter) > { > - struct bio_vec bvec; > - struct bio bio; > + const struct iomap_folio_ops *folio_ops =3D iter->iomap.folio_ops= ; > > - bio_init(&bio, iomap->bdev, &bvec, 1, REQ_OP_READ); > - bio.bi_iter.bi_sector =3D iomap_sector(iomap, block_start); > - bio_add_folio_nofail(&bio, folio, plen, poff); > - return submit_bio_wait(&bio); > + if (folio_ops && folio_ops->iomap_valid && > + !folio_ops->iomap_valid(iter->inode, &iter->iomap)) { > + iter->iomap.flags |=3D IOMAP_F_STALE; > + return false; > + } > + > + return true; > } > > -static int __iomap_write_begin(const struct iomap_iter *iter, size_t len= , > +static int __iomap_write_begin(struct iomap_iter *iter, size_t len, > struct folio *folio) > { > - const struct iomap *srcmap =3D iomap_iter_srcmap(iter); > + struct inode *inode =3D iter->inode; > struct iomap_folio_state *ifs; > - loff_t pos =3D iter->pos; > - loff_t block_size =3D i_blocksize(iter->inode); > - loff_t block_start =3D round_down(pos, block_size); > - loff_t block_end =3D round_up(pos + len, block_size); > - unsigned int nr_blocks =3D i_blocks_per_folio(iter->inode, folio)= ; > - size_t from =3D offset_in_folio(folio, pos), to =3D from + len; > - size_t poff, plen; > > /* > * If the write or zeroing completely overlaps the current folio,= then > @@ -699,45 +703,29 @@ static int __iomap_write_begin(const struct iomap_i= ter *iter, size_t len, > * For the unshare case, we must read in the ondisk contents beca= use we > * are not changing pagecache contents. > */ > - if (!(iter->flags & IOMAP_UNSHARE) && pos <=3D folio_pos(folio) &= & > - pos + len >=3D folio_pos(folio) + folio_size(folio)) > + if (!(iter->flags & IOMAP_UNSHARE) && > + iter->pos <=3D folio_pos(folio) && > + iter->pos + len >=3D folio_pos(folio) + folio_size(folio)) > return 0; > > - ifs =3D ifs_alloc(iter->inode, folio, iter->flags); > - if ((iter->flags & IOMAP_NOWAIT) && !ifs && nr_blocks > 1) > + ifs =3D ifs_alloc(inode, folio, iter->flags); > + if ((iter->flags & IOMAP_NOWAIT) && !ifs && > + i_blocks_per_folio(inode, folio) > 1) > return -EAGAIN; > > - if (folio_test_uptodate(folio)) > - return 0; > - > - do { > - iomap_adjust_read_range(iter->inode, folio, &block_start, > - block_end - block_start, &poff, &plen); > - if (plen =3D=3D 0) > - break; > + if (!folio_test_uptodate(folio)) { > + inode->i_mapping->a_ops->read_folio(NULL, folio); > > - if (!(iter->flags & IOMAP_UNSHARE) && > - (from <=3D poff || from >=3D poff + plen) && > - (to <=3D poff || to >=3D poff + plen)) > - continue; > - > - if (iomap_block_needs_zeroing(iter, block_start)) { > - if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE)) > - return -EIO; > - folio_zero_segments(folio, poff, from, to, poff += plen); > - } else { > - int status; > - > - if (iter->flags & IOMAP_NOWAIT) > - return -EAGAIN; > - > - status =3D iomap_read_folio_sync(block_start, fol= io, > - poff, plen, srcmap); > - if (status) > - return status; > - } > - iomap_set_range_uptodate(folio, poff, plen); > - } while ((block_start +=3D plen) < block_end); > + /* > + * ->read_folio unlocks the folio. Relock and revalidate= the > + * folio. > + */ > + folio_lock(folio); > + if (unlikely(folio->mapping !=3D inode->i_mapping)) > + return 1; > + if (unlikely(!iomap_validate(iter))) > + return 1; Does this now basically mean that every caller that uses iomap for writes will have to implement ->iomap_valid and up the sequence counter anytime there's a write or truncate, in case the folio changes during the lock drop? Or were we already supposed to be doing this? > + } > > return 0; > } > @@ -803,7 +791,6 @@ static int iomap_write_begin_inline(const struct ioma= p_iter *iter, > static int iomap_write_begin(struct iomap_iter *iter, struct folio **fol= iop, > size_t *poffset, u64 *plen) > { > - const struct iomap_folio_ops *folio_ops =3D iter->iomap.folio_ops= ; > const struct iomap *srcmap =3D iomap_iter_srcmap(iter); > loff_t pos =3D iter->pos; > u64 len =3D min_t(u64, SIZE_MAX, iomap_length(iter)); > @@ -818,28 +805,14 @@ static int iomap_write_begin(struct iomap_iter *ite= r, struct folio **foliop, > if (fatal_signal_pending(current)) > return -EINTR; > > +lookup_again: > folio =3D __iomap_get_folio(iter, len); > if (IS_ERR(folio)) > return PTR_ERR(folio); > > - /* > - * Now we have a locked folio, before we do anything with it we n= eed to > - * check that the iomap we have cached is not stale. The inode ex= tent > - * mapping can change due to concurrent IO in flight (e.g. > - * IOMAP_UNWRITTEN state can change and memory reclaim could have > - * reclaimed a previously partially written page at this index af= ter IO > - * completion before this write reaches this file offset) and hen= ce we > - * could do the wrong thing here (zero a page range incorrectly o= r fail > - * to zero) and corrupt data. > - */ > - if (folio_ops && folio_ops->iomap_valid) { > - bool iomap_valid =3D folio_ops->iomap_valid(iter->inode, > - &iter->iomap); > - if (!iomap_valid) { > - iter->iomap.flags |=3D IOMAP_F_STALE; > - status =3D 0; > - goto out_unlock; > - } > + if (unlikely(!iomap_validate(iter))) { > + status =3D 0; > + goto out_unlock; > } > > pos =3D iomap_trim_folio_range(iter, folio, poffset, &len); > @@ -860,7 +833,8 @@ static int iomap_write_begin(struct iomap_iter *iter,= struct folio **foliop, > > out_unlock: > __iomap_put_folio(iter, 0, folio); > - > + if (status =3D=3D 1) > + goto lookup_again; > return status; > } > > -- > 2.47.2 >