From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qt1-f180.google.com (mail-qt1-f180.google.com [209.85.160.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DBC702874F3; Thu, 4 Sep 2025 20:58:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757019503; cv=none; b=qBOLXmphPSBjDV5dBe0l7Ztx8wcLvwzf3X2973PWQoQrak0cXJjnDnLVZKzQ2ERDXI3oIccKHwclqyR6dU/vytswobJeAxNNxpUA0fM+W2qe++LZHJ1Yi4ahMwOeYOmpq5+yhDRmT/gDT3GHW8HGLRdhhif/H1QztCD8ATXy+DQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757019503; c=relaxed/simple; bh=GbPYQt63wKjujfZb/7zodtfPZpf2M98vl7RvUKSaMVc=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=Mjc0O1OddCuTdKCMqJKLtAyfyu+h/ow+TCDjZHD5tE0XoVTJe6cwdj3uFSFThSp++R8A/M89pck6cgo9fOoV99nVueoaKkKICfjfR2hiLki3OqW7R7Wx19EOriFesKoSG3JE4k1AHbkq9vqTvZvGcJ1x2ZKdSbf0h9yhwxsAfXk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=LRI7MuGc; arc=none smtp.client-ip=209.85.160.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="LRI7MuGc" Received: by mail-qt1-f180.google.com with SMTP id d75a77b69052e-4b3289ed834so17836051cf.1; Thu, 04 Sep 2025 13:58:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1757019501; x=1757624301; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YjIWkNshoqX+yPDVvFR9uyzNZLhGoD944zzaK0IVS0s=; b=LRI7MuGcmIsNbu6hw+lw/hmXsE8XJy7MQ11VTcTaN+idqZCWtPkvc6miex1/49Mlcl C6Na4b7whCz4PNyt/yZJFw7qRzYU4AIh7G6qxsgB7Bu6c+5wt9vzQQQ7DuAdCMBp/lAg 6SoACj7VwTdrU+MOYv6HkYqQSbX1yfS+/wYfPgFpI5RZbxq2FZTH2zVxHbUTqd67pmEu 6R2dCdwwZqoHeNzXNqIhOrlXo7LNU8eJ1iZFplSNmSpu2+vFVVTGcAjlUe9XuK4IuItF mq5/DCjEYOq8njpvw97ayNna6TPCcImeiX04bv7CiAT/2ffUWuE1Qa1viWVwh4VmUZUF mSDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757019501; x=1757624301; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YjIWkNshoqX+yPDVvFR9uyzNZLhGoD944zzaK0IVS0s=; b=hBy0aZvpISsYhbbCJOecTNPc1qB1vsHMFLqLSd3+QuAstutJI9RTobVHyTG+HaykQN ZakjBqN9C0wT3D1/3sglfExGI+YP//2lPCU/EN7wugaayzAHb0Ipac86+ukj0cvjAx1c zKrtGj2PLbMDuxd0q6hv+7VdT8Th3ke8t4jx3dhlEM4htxg5uAVuBaMahG+sLIuewggY mxh7XPDJ4ekAno3UnO8g4wJyJwAZbrLGxDfFvEalqNg+Wlj779QPpqWHgbR8OEwsyNV3 qt2v/TuzvSGW6VJ2N6L2fHrDqPp1Xe1ZWY61iAm2Yo78uk4cPoWpjp8va/GUL1X3vPWA xXZg== X-Forwarded-Encrypted: i=1; AJvYcCU4zF7XxGMcXijvOGuEYE5aCm50kgpsChS2V0EBSpr23D300lVvoi/juIa7FMVxLariECrdWH8aTJew@vger.kernel.org, AJvYcCX9O0Mmn2mzJIm84ow0MbjyFJbCF5QW0caf9EaupWvV6oZAHcguf28G1/dmxTZq0odfuGSa0YGHYyrHy9ICyQ==@vger.kernel.org, AJvYcCXKqWf/KFt20nZsiFSeB/Jj4T8kb1Uq4vO+Ct2K8jdwL8sB67XBeLKlPNZKexyfkB8BHH2gV0Lpleo=@vger.kernel.org X-Gm-Message-State: AOJu0YwcCfMc96y3i0b21zhm1rlzHGyEBd93HwZ7FnhTkUzqWzTX/sJe S8aZYspWwV9S1v1lbFUKdz6QB8jodzuLp537dd4ksG7QNVIL9bx5dT4vZrsV52jSNNiOiDuS5OD oDnLkvTQk7JXuVsEK7tzcZM3bm73MjZg= X-Gm-Gg: ASbGncuBp1gfVNpEPMwwCB+1VlCYr9sxMbBRittCVhiwY9UPXsskTnupybyWtj31X5n fKQYQRM7ZbY+KxWoFOqqE+5r+SttAg/xQNKfApeMuFxRPIxNJQvRr70PqwdIW+772w1dGA75IYQ kRQsTjSe74Liz7q7jDwAWL+NZvLQnmqnWHAQntmO8CHrzF80LHgKFEiL5OJFkjkkWW8iQIXgkgA qcN30qFsEttVq6grCU= X-Google-Smtp-Source: AGHT+IGlajoodFNlAUxj3SBBz6KiuvIwKSuixFHDhY4up8xLBB9u6Pqovfv7y9tHlhYH5O4rHymbtI0KM02BdjTLmy8= X-Received: by 2002:a05:622a:13c6:b0:4b4:2d33:3bed with SMTP id d75a77b69052e-4b42d333fedmr108443161cf.30.1757019500534; Thu, 04 Sep 2025 13:58:20 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20250829235627.4053234-1-joannelkoong@gmail.com> <20250829235627.4053234-13-joannelkoong@gmail.com> <20250903210856.GT1587915@frogsfrogsfrogs> In-Reply-To: <20250903210856.GT1587915@frogsfrogsfrogs> From: Joanne Koong Date: Thu, 4 Sep 2025 13:58:09 -0700 X-Gm-Features: Ac12FXxLOWvhBdadrOyqQwz4w0JgKKdk3QYJbd8zKKnqjs5Ug4b2WYfQHShwqgs Message-ID: Subject: Re: [PATCH v1 12/16] iomap: add iomap_read_ops for read and readahead To: "Darrick J. Wong" Cc: brauner@kernel.org, miklos@szeredi.hu, hch@infradead.org, linux-fsdevel@vger.kernel.org, kernel-team@meta.com, linux-xfs@vger.kernel.org, linux-doc@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Sep 3, 2025 at 2:08=E2=80=AFPM Darrick J. Wong = wrote: > > On Fri, Aug 29, 2025 at 04:56:23PM -0700, Joanne Koong wrote: > > Add a "struct iomap_read_ops" that contains a read_folio_range() > > callback that callers can provide as a custom handler for reading in a > > folio range, if the caller does not wish to issue bio read requests > > (which otherwise is the default behavior). read_folio_range() may read > > the request asynchronously or synchronously. The caller is responsible > > for calling iomap_start_folio_read()/iomap_finish_folio_read() when > > reading the folio range. > > > > This makes it so that non-block based filesystems may use iomap for > > reads. > > > > Signed-off-by: Joanne Koong > > --- > > .../filesystems/iomap/operations.rst | 19 +++++ > > block/fops.c | 4 +- > > fs/erofs/data.c | 4 +- > > fs/gfs2/aops.c | 4 +- > > fs/iomap/buffered-io.c | 79 +++++++++++++------ > > fs/xfs/xfs_aops.c | 4 +- > > fs/zonefs/file.c | 4 +- > > include/linux/iomap.h | 21 ++++- > > 8 files changed, 105 insertions(+), 34 deletions(-) > > > > diff --git a/Documentation/filesystems/iomap/operations.rst b/Documenta= tion/filesystems/iomap/operations.rst > > index 067ed8e14ef3..215053f0779d 100644 > > --- a/Documentation/filesystems/iomap/operations.rst > > +++ b/Documentation/filesystems/iomap/operations.rst > > @@ -57,6 +57,25 @@ The following address space operations can be wrappe= d easily: > > * ``bmap`` > > * ``swap_activate`` > > > > +``struct iomap_read_ops`` > > +-------------------------- > > + > > +.. code-block:: c > > + > > + struct iomap_read_ops { > > + int (*read_folio_range)(const struct iomap_iter *iter, > > + struct folio *folio, loff_t pos, size_t len); > > + }; > > + > > +iomap calls these functions: > > + > > + - ``read_folio_range``: Called to read in the range (read does not n= eed to > > + be synchronous). The caller is responsible for calling > > Er... does this perform the read synchronously or asynchronously? > Does the implementer need to know? How does iomap figure out what > happened? It is up to the implementer whether they do the read synchronously or asynchronously. Most filesystems I think will issue readahead asynchronously but for fuse, readahead needs to be synchronous unless the server explicitly opts into async read, otherwise the read requests can be sent in non-sequential order which some servers may not be able to handle. I don't think it matters from the iomap side whether the read is synchronous or asynchronous, or even if the read has completed by the time iomap_readahead() completes. I think it only needs to make sure the reads get kicked off. > > My guess is that iomap_finish_folio_read unlocks the folio, and anyone > who cared is by this point already waiting on the folio lock? So it's > actually not important if the ->read_folio_range implementation runs > async or not; the key is that the folio stays locked until we've > completed the read IO? This is my understanding too. The unlocking and waking-any-waiting-threads stuff happens in folio_end_read(). > > > + iomap_start_folio_read() and iomap_finish_folio_read() when readin= g the > > + folio range. This should be done even if an error is encountered d= uring > > + the read. If this function is not provided by the caller, then iom= ap > > + will default to issuing asynchronous bio read requests. > > What is this function supposed to return? The usual 0 or negative > errno? Good point, I will add that info in. > > > + > > ``struct iomap_write_ops`` > > -------------------------- > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 5d153c6b16b6..06f2c857de64 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -335,8 +335,8 @@ void iomap_start_folio_read(struct folio *folio, si= ze_t len) > > } > > EXPORT_SYMBOL_GPL(iomap_start_folio_read); > > > > -void iomap_finish_folio_read(struct folio *folio, size_t off, size_t l= en, > > - int error) > > +static void __iomap_finish_folio_read(struct folio *folio, size_t off, > > + size_t len, int error, bool update_bitmap) > > { > > struct iomap_folio_state *ifs =3D folio->private; > > bool uptodate =3D !error; > > @@ -346,7 +346,7 @@ void iomap_finish_folio_read(struct folio *folio, s= ize_t off, size_t len, > > unsigned long flags; > > > > spin_lock_irqsave(&ifs->state_lock, flags); > > - if (!error) > > + if (!error && update_bitmap) > > uptodate =3D ifs_set_range_uptodate(folio, ifs, o= ff, len); > > When do we /not/ want to set uptodate after a successful read? I guess > iomap_read_folio_range_async goes through the bio machinery and sets > uptodate via iomap_finish_folio_read()? Does the ->read_folio_range > function need to set the uptodate bits itself? Possibly by calling > iomap_finish_folio_read as well? > Maybe this is hacky but I'm not sure if there's a better way to do it, but I essentially need a "bias" for read requests if they are async so that we don't prematurely end the read on the folio if the first few async requests are completed before the next ones are issued. For example if there's a large folio and a readahead request for 16 pages in that folio, if doing readahead on the 16 pages is split into 4 async requests and the first request is sent off and then completed before we send off the second request, then the "iomap_finish_folio_read()" call on the first request will set ifs->read_bytes_pending to now 0 and call folio_end_read(). For that reason, I added a "__iomap_start_folio_read(folio, ..., 1);" before any async requests are sent and a "__iomap_finish_folio_read(folio, ..., 1);" after all the requests have been sent. (This is the same thing the iomap writeback logic does for their async requests). Those calls though should not update the uptodate bitmap, they are used only to prevent the premature folio_end_read(). Yes, ->read_folio_range() is responsible for calling iomap_finish_folio_read() (as well as iomap_start_folio_read()); this call will update the uptodate bitmap. Thanks, Joanne > > ifs->read_bytes_pending -=3D len; > > finished =3D !ifs->read_bytes_pending; > > @@ -356,6 +356,12 @@ void iomap_finish_folio_read(struct folio *folio, = size_t off, size_t len, > > if (finished) > > folio_end_read(folio, uptodate); > > } > > + > > @@ -471,7 +478,14 @@ static int iomap_readfolio_iter(struct iomap_iter = *iter, > > } > > > > /* zero post-eof blocks as the page may be mapped */ > > - ifs_alloc(iter->inode, folio, iter->flags); > > + ifs =3D ifs_alloc(iter->inode, folio, iter->flags); > > + > > + /* > > + * Add a bias to ifs->read_bytes_pending so that a read is ended = only > > + * after all the ranges have been read in. > > + */ > > + if (ifs) > > + iomap_start_folio_read(folio, 1); > > > > length =3D min_t(loff_t, length, > > folio_size(folio) - offset_in_folio(folio, pos)); > > @@ -479,35 +493,53 @@ static int iomap_readfolio_iter(struct iomap_iter= *iter, > > iomap_adjust_read_range(iter->inode, folio, &pos, > > length, &poff, &plen); > > count =3D pos - iter->pos + plen; > > - if (plen =3D=3D 0) > > - return iomap_iter_advance(iter, &count); > > + if (plen =3D=3D 0) { > > + ret =3D iomap_iter_advance(iter, &count); > > + break; > > + } > > > > if (iomap_block_needs_zeroing(iter, pos)) { > > folio_zero_range(folio, poff, plen); > > iomap_set_range_uptodate(folio, poff, plen); > > } else { > > - iomap_read_folio_range_async(iter, ctx, pos, plen= ); > > + ctx->folio_unlocked =3D true; > > + if (read_ops && read_ops->read_folio_range) { > > + ret =3D read_ops->read_folio_range(iter, = folio, pos, plen); > > + if (ret) > > + break; > > + } else { > > + iomap_read_folio_range_async(iter, ctx, p= os, plen); > > + } > > } > > > > length -=3D count; > > ret =3D iomap_iter_advance(iter, &count); > > if (ret) > > - return ret; > > + break; > > pos =3D iter->pos; > > } > > - return 0; > > + > > + if (ifs) { > > + __iomap_finish_folio_read(folio, 0, 1, ret, false); > > + ctx->folio_unlocked =3D true; > > Er.... so we subtract 1 from read_bytes_pending? I thought the > ->read_folio_range ioend was supposed to decrease that? > > --D >