From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34627C369B2 for ; Mon, 14 Apr 2025 20:24:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B243528008C; Mon, 14 Apr 2025 16:24:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AD31D280088; Mon, 14 Apr 2025 16:24:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9725B28008C; Mon, 14 Apr 2025 16:24:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 7934E280088 for ; Mon, 14 Apr 2025 16:24:39 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C9ED0AE08C for ; Mon, 14 Apr 2025 20:24:39 +0000 (UTC) X-FDA: 83333777478.25.4DE9CF1 Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com [209.85.160.173]) by imf18.hostedemail.com (Postfix) with ESMTP id E5DAA1C0005 for ; Mon, 14 Apr 2025 20:24:37 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Y5os91eq; spf=pass (imf18.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.173 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744662277; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hjalnf6NhmbK3zLnEA93KvQq39O1mk67mqpHcJeLGtY=; b=UP7F1vAlrlG6WwH4VAngBD+qGaTBF1pygnLN40JmUwUbaqh0lC+Kqo4rJpiRgePwTNOBLC FvZKx7nsV8zoiza8b7/c7J74mccENde4R/WriWSS69FFvzmDRiHKzSLIti+oyGkCKAEemY 33XPbnSAAbzSAnzQwz5TpKUArOp+xhk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744662277; a=rsa-sha256; cv=none; b=PYnOQvcEQrw5opxMeRjZ7PwzZ5iqFeYqOX5QVoidDofhe4rKwce1AVaQX8BUdOxZpDpqs4 EuX51p+xYag/IpjrgTgm2UQyrDG3odzvU3l6XdcY2OapLOM+uQD9l/Uuk5zfwMvMkjtuUE 7iUShMPHFwB0hWzMlYgkltIDQFSRu/E= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Y5os91eq; spf=pass (imf18.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.173 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qt1-f173.google.com with SMTP id d75a77b69052e-4769bbc21b0so42261451cf.2 for ; Mon, 14 Apr 2025 13:24:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1744662277; x=1745267077; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hjalnf6NhmbK3zLnEA93KvQq39O1mk67mqpHcJeLGtY=; b=Y5os91eqtlxR9Hf/8T3SgF85SVuzGe/FcrYQ3HnOGvCuaMFtin2eUFFouqLGEy5s/g LNAOx8d/+3Y3QESfrrmvEckw1Idufzf64uB2r7zV2Yy9oMg0uNw8OrfvKNC7NYdyQ2HE ois/lmzmFI7Wgwx+yTJbXEWH+EJzd0YUET3fpCXfM2DGPAYpp+4uTgOeMFftA14ZvYl6 R/nCcJcguI6nFRIVDtDh85stfHLbMTFR75J6Sb9I7k8zZaOgdBWGyJpVp6TYxYJsO5OZ zZWXAlF7hcQaU521XYinCwsffWJUVl7vpeKh5CquGVjaN428CKHT5c9+/vtOUm9JHcKf rLBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744662277; x=1745267077; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hjalnf6NhmbK3zLnEA93KvQq39O1mk67mqpHcJeLGtY=; b=cZ7yj649GeRBMyyxIydddoubFLQsIhZM3E0roC74+FwdUD10jDcQkbuMjkDo6jKpwq NPrBs5dR2ykECkyaEznPBWfGfUCMewnqjpp/QeGqMtGy5c6RQHGqM2aw0mut4Af7d6V3 pUttZcSB6r24V46ueUMHLlGnKTgsGnBx9jZh+yzSIFKuD0isXtyVYRtF0EluTbsvQwmm w6BaYX2EPbL0WvMKLM2dwTceqXrwH1jiyK0PCqOVla3Y9dUx4GQnyzXMpdCyKPiwf4/0 /L8GOQLhq0k3k/K6YVhtNUO7vYpQvImh3Lai5rva1wEWIO9VWufGPK/5kkf1DoT0xeNe A59A== X-Forwarded-Encrypted: i=1; AJvYcCW9eFIYisYERZHlrhJuG/kkBrgW/7O988dgnaqm4S+tQs1j5SAocp+NHO09dK46AmqaKUA42/CKog==@kvack.org X-Gm-Message-State: AOJu0YwxVNvOVsxRBBMz4CZRlOrYrRYn2IaRvEQ5Y/AT3rHX1KgiLBHK B8mgkHfsVNEpn56AsN46pfQ4WbrYcm2bThW6Nqt83h0NKEUz/OTn57qHubr6kWjAf2/IT58SSnZ anclC8FGMNpBa1a4c05geZ5iofNE= X-Gm-Gg: ASbGncu13KHeVbTJM6I1K/piGKozq7ajHK+BhlXooac6XuWYZ5ZXVLf3eZWqGVFwfPu b4oN4xuKVxiwLa6FVsyW3fPn8qCBPOGLikOS60wAHcUxzi+xh9GHW4NB0a1is4uR4qIz2xV+RfX TgbTghhFd+aK6T2x2N5BwBTdtkbYo8yveHSpZxTA== X-Google-Smtp-Source: AGHT+IH9XmMhDR49J4+TwAnE+Fg9oII6/lpKKBu8unyVgb6A/IHau/R3sHHWV7hbiTUZ0VEtUyCZHY3RYZ/nxv+DbDg= X-Received: by 2002:a05:622a:180d:b0:477:6ec9:1033 with SMTP id d75a77b69052e-479775261d0mr189130041cf.1.1744662276892; Mon, 14 Apr 2025 13:24:36 -0700 (PDT) MIME-Version: 1.0 References: <20250404181443.1363005-1-joannelkoong@gmail.com> <20250404181443.1363005-4-joannelkoong@gmail.com> <7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com> <9a3cfb55-faae-4551-9bef-b9650432848a@linux.alibaba.com> In-Reply-To: From: Joanne Koong Date: Mon, 14 Apr 2025 13:24:24 -0700 X-Gm-Features: ATxdqUGHOQAamSeYPCJOivk7Bm3Xf4bayW887pC1sF1cWjckxnHY4wsgvIq81Eg Message-ID: Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree To: Jingbo Xu Cc: miklos@szeredi.hu, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, david@redhat.com, bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org, kernel-team@meta.com, Miklos Szeredi Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: okzjeg5eg6t447ozr88y7n3xppxwn1hc X-Rspam-User: X-Rspamd-Queue-Id: E5DAA1C0005 X-Rspamd-Server: rspam08 X-HE-Tag: 1744662277-711256 X-HE-Meta: U2FsdGVkX1+QAiA0lUVUKAlZovVPigoJygRGtMKFTxPS8b2TXwyjux46HhG2aY5CNt/Evy3OvWRqQOwUob9/08gfJusjzQkLR0LP1vOPmfzSkz5hDyRg+L/RXS6YWUiFpexC2Pv9xjWjBEdL1c0O06MfhBG5bsBgYO7O6SugYt6JZPS6hwtJvBSpvFUtghrcbw+iUZPo14PNfHuElzu8gyPep824J/srGwb9yl+pBi1yYqU7/LdMT2E0B/i4vXodO/O0dYbOzWe7Dq9lDE1L04jFgoLXBE3qM7Hq/CLZE1S1BBMjG7B8/ImFF5V/5SrcSGTLRxbFQU1WdXiHnu0nOAbYBK++T/wBIHFT/0mZkFr7DxF9kg8jGKLTGW7AhqrrtT12PG7KujepMKcdZNW8ZCI3XbAstGeEFnn2e6ekgC9jB7pdxEKILME+aVhbPAW110csR3mn9L14sZ5TYKqKbobH+v1B3NRHrRcLjJlf7N7zUmClDQJHYoVCklvXZorIyKC7TwJNbilx/Apy8fL8JPcqINUDcw2OnK9SO2QffnSxSTfBuLAPO0R9YY1uGiRXqArOQchTTdix/+5NFjENg01PnclEdlCib3KqY9l4YTCGtjrjsbft5ubPfaYGbtRrHz64aL12Iv52myBoSlRptZQqf2hVkUo399v7nTDK7Rv4y7XSllevq1ougS++EyzyR60Mx6LxUANZdNbQnB1w6EGcuUVPpADyYy+ZqrB0c2iamwlWt8EjK8ZmJ0Jzq1Tgq811ILHVBQ4nJVt4yJJVkCbNTBwOhyivyMP/gk/yEsDqR7KCcBXzPHor6q0l0ErzTX++AaONhMVxb6ZjHoCccyn3YftA50k8PUKuwzOUD2dHrFt8wSsE1jTm4GPg0KX0ocI6WEIHoO++MhBT3DYtOAhJxjjwbqgJftLwG1q5zj9SUgrK+O1WB7X8vB/FNbyVPiJrzVd618xQUCNQdj6 +7W2Trar RIJ8LAHjM+dc3bD1RHDCq8aXJRfOIzvCncpUNMs4mPMCwby15LnhBn/IlwsBVC5/pmYf6WP7dNnOJ/uVklhFTW+SEPbyk16NByJIBgcWWzj0Hf63ayew9WpDk7xFIbloM9XyMRfSmpubNsAyF6x0ni7OAQK0DGinc33+N17tvKH0FrcoJMb76pRHYZ5JDGJwraUx8Y+elACNbse1PCTYxMbCzqw13f97PXbPc7SjRcAd8zsn3DZEPYjSFVoGfxdRxmlI+FGyXwoZRaS7QfSCofdnNzBc3Vj2DczLXBjq5u6Ghk/aVMc/9IWzcKP16HIvpCarXCD4UzL99Qg94LLc2NqjuJj+AQUxl6bdsttRIJ97Ea8fpPOJI4xQ+8f6Reuj3x3BfNlgmNrLJMcU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 10, 2025 at 9:11=E2=80=AFAM Joanne Koong wrote: > > On Thu, Apr 10, 2025 at 8:11=E2=80=AFAM Jingbo Xu wrote: > > > > On 4/10/25 11:07 PM, Joanne Koong wrote: > > > On Wed, Apr 9, 2025 at 7:12=E2=80=AFPM Jingbo Xu wrote: > > >> > > >> > > >> > > >> On 4/10/25 7:47 AM, Joanne Koong wrote: > > >>> On Tue, Apr 8, 2025 at 7:43=E2=80=AFPM Jingbo Xu wrote: > > >>>> > > >>>> Hi Joanne, > > >>>> > > >>>> On 4/5/25 2:14 AM, Joanne Koong wrote: > > >>>>> In the current FUSE writeback design (see commit 3be5a52b30aa > > >>>>> ("fuse: support writable mmap")), a temp page is allocated for ev= ery > > >>>>> dirty page to be written back, the contents of the dirty page are= copied over > > >>>>> to the temp page, and the temp page gets handed to the server to = write back. > > >>>>> > > >>>>> This is done so that writeback may be immediately cleared on the = dirty page, > > >>>>> and this in turn is done in order to mitigate the following deadl= ock scenario > > >>>>> that may arise if reclaim waits on writeback on the dirty page to= complete: > > >>>>> * single-threaded FUSE server is in the middle of handling a requ= est > > >>>>> that needs a memory allocation > > >>>>> * memory allocation triggers direct reclaim > > >>>>> * direct reclaim waits on a folio under writeback > > >>>>> * the FUSE server can't write back the folio since it's stuck in > > >>>>> direct reclaim > > >>>>> > > >>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mi= tigates > > >>>>> the situations described above, FUSE writeback does not need to u= se > > >>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode map= pings. > > >>>>> > > >>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings > > >>>>> and removes the temporary pages + extra copying and the internal = rb > > >>>>> tree. > > >>>>> > > >>>>> fio benchmarks -- > > >>>>> (using averages observed from 10 runs, throwing away outliers) > > >>>>> > > >>>>> Setup: > > >>>>> sudo mount -t tmpfs -o size=3D30G tmpfs ~/tmp_mount > > >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threa= ds=3D4 -o source=3D~/tmp_mount ~/fuse_mount > > >>>>> > > >>>>> fio --name=3Dwriteback --ioengine=3Dsync --rw=3Dwrite --bs=3D{1k,= 4k,1M} --size=3D2G > > >>>>> --numjobs=3D2 --ramp_time=3D30 --group_reporting=3D1 --directory= =3D/root/fuse_mount > > >>>>> > > >>>>> bs =3D 1k 4k 1M > > >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > > >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > > >>>>> % diff -3% 23% 45% > > >>>>> > > >>>>> Signed-off-by: Joanne Koong > > >>>>> Reviewed-by: Jingbo Xu > > >>>>> Acked-by: Miklos Szeredi > > >>>> > > >>> > > >>> Hi Jingbo, > > >>> > > >>> Thanks for sharing your analysis for this. > > >>> > > >>>> Overall this patch LGTM. > > >>>> > > >>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechan= ism is > > >>>> also unneeded then, at least the DIRECT IO routine (i.e. > > >>> > > >>> I took a look at fi->writectr and fi->queued_writes and my > > >>> understanding is that we do still need this. For example, for > > >>> truncates (I'm looking at fuse_do_setattr()), I think we still need= to > > >>> prevent concurrent writeback or else the setattr request and the > > >>> writeback request could race which would result in a mismatch betwe= en > > >>> the file's reported size and the actual data written to disk. > > >> > > >> I haven't looked into the truncate routine yet. I will see it later= . > > >> > > >>> > > >>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That i= s > > >>>> because after removing the temp page, the DIRECT IO routine has al= ready > > >>>> been waiting for all inflight WRITE requests, see > > >>>> > > >>>> # DIRECT read > > >>>> generic_file_read_iter > > >>>> kiocb_write_and_wait > > >>>> filemap_write_and_wait_range > > >>> > > >>> Where do you see generic_file_read_iter() getting called for direct= io reads? > > >> > > >> # DIRECT read > > >> fuse_file_read_iter > > >> fuse_cache_read_iter > > >> generic_file_read_iter > > >> kiocb_write_and_wait > > >> filemap_write_and_wait_range > > >> a_ops->direct_IO(),i.e. fuse_direct_IO() > > >> > > > > > > Oh I see, I thought files opened with O_DIRECT automatically call the > > > .direct_IO handler for reads/writes but you're right, it first goes > > > through .read_iter / .write_iter handlers, and the .direct_IO handler > > > only gets invoked through generic_file_read_iter() / > > > generic_file_direct_write() in mm/filemap.c > > > > > > There's two paths for direct io in FUSE: > > > a) fuse server sets fi->direct_io =3D true when a file is opened, whi= ch > > > will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side > > > b) fuse server doesn't set fi->direct_io =3D true, but the client ope= ns > > > the file with O_DIRECT > > > > > > We only go through the stack trace you listed above for the b) case. > > > For the a) case, we'll hit > > > > > > if (ff->open_flags & FOPEN_DIRECT_IO) > > > return fuse_direct_read_iter(iocb, to); > > > > > > and > > > > > > if (ff->open_flags & FOPEN_DIRECT_IO) > > > return fuse_direct_write_iter(iocb, from); > > > > > > which will invoke fuse_direct_IO() / fuse_direct_io() without going > > > through the kiocb_write_and_wait() -> filemap_write_and_wait_range() = / > > > kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed > > > above. > > > > > > So for the a) case I think we'd still need the fuse_sync_writes() in > > > case there's still pending writeback. > > > > > > Do you agree with this analysis or am I missing something here? > > > > Yeah, that's true. But instead of calling fuse_sync_writes(), we can > > call filemap_wait_range() or something similar here. > > > > Agreed. Actually, the more I look at this, the more I think we can > replace all fuse_sync_writes() and get rid of it entirely. Right now, > fuse_sync_writes() is called in: This is nontrivial so I'll leave this optimization to be done as a separate future patchset so as to not slow this one down. Thanks, Joanne > > fuse_fsync(): > /* > * Start writeback against all dirty pages of the inode, then > * wait for all outstanding writes, before sending the FSYNC > * request. > */ > err =3D file_write_and_wait_range(file, start, end); > if (err) > goto out; > > fuse_sync_writes(inode); > > /* > * Due to implementation of fuse writeback > * file_write_and_wait_range() does not catch errors. > * We have to do this directly after fuse_sync_writes() > */ > err =3D file_check_and_advance_wb_err(file); > if (err) > goto out; > > > We can get rid of the fuse_sync_writes() and > file_check_and_advance_wb_err() entirely since now without temp pages, > the file_write_and_wait_range() call actually ensures that writeback > is completed > > > > fuse_writeback_range(): > static int fuse_writeback_range(struct inode *inode, loff_t > start, loff_t end) > { > int err =3D > filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX); > > if (!err) > fuse_sync_writes(inode); > > return err; > } > > > We can replace fuse_writeback_range() entirely with > filemap_write_and_wait_range(). > > > > fuse_direct_io(): > if (fopen_direct_io && fc->direct_io_allow_mmap) { > res =3D filemap_write_and_wait_range(mapping, pos, pos + > count - 1); > if (res) { > fuse_io_free(ia); > return res; > } > } > if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + > count - 1))) { > if (!write) > inode_lock(inode); > fuse_sync_writes(inode); > if (!write) > inode_unlock(inode); > } > > > I think this can just replaced with > if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse= )) { > res =3D filemap_write_and_wait_range(mapping, > pos, pos + count - 1); > if (res) { > fuse_io_free(ia); > return res; > } > } > since for the !fopen_direct_io case, it will already go through > filemap_write_and_wait_range(), as you mentioned in your previous > message. I think this also fixes a bug (?) in the original code - in > the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we > still need to write out dirty pages first, which we don't currently > do. > > > What do you think? > > Thanks for all your careful review on this patchset throughout all of > its iterations. > > > > > > > >> filp_close() > > >> filp_flush() > > >> filp->f_op->flush() > > >> fuse_flush() > > >> write_inode_now > > >> writeback_single_inode(WB_SYNC_ALL) > > >> do_writepages > > >> # flush dirty page > > >> filemap_fdatawait > > >> # wait for WRITE completion > > > > > > Nice. I missed that write_inode_now() will invoke filemap_fdatawait()= . > > > This seems pretty straightforward. I'll remove the fuse_sync_writes() > > > call in fuse_flush() when I send out v8. > > > > > > The direct io one above is less straight-forward. I won't add that to > > > v8 but that can be done in a separate future patch when we figure tha= t > > > out. > > > > Thanks for keep working on this. Appreciated. > > > > -- > > Thanks, > > Jingbo