next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
@ 2025-07-23 11:44 Naresh Kamboju
  2025-07-23 14:46 ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Naresh Kamboju @ 2025-07-23 11:44 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-xfs, open list, lkft-triage,
	Linux Regressions
  Cc: Miklos Szeredi, Jan Kara, Andrew Morton, Christian Brauner,
	Darrick J. Wong, Lorenzo Stoakes, Liam R. Howlett, Arnd Bergmann,
	Dan Carpenter, Anders Roxell, Ben Copeland

Regressions found while running LTP msync04 tests on qemu-arm64 running
Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
page size enabled builds.

CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )

No warning noticed with 4K page size.
CONFIG_ARM64_4K_PAGES=y works as expected


First seen on the tag next-20250721.
Good: next-20250718
Bad:  next-20250721 to next-20250723

Regression Analysis:
- New regression? Yes
- Reproducibility? Yes

Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
fuse file.c at fuse_iomap_writeback_range

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>

## Test log
------------[ cut here ]------------
[  343.828105] WARNING: fs/fuse/file.c:2146 at
fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
[  343.830969] Modules linked in: btrfs blake2b_generic xor xor_neon
raid6_pq zstd_compress sm3_ce sha3_ce drm fuse backlight ip_tables
x_tables
[  343.833830] CPU: 0 UID: 0 PID: 4190 Comm: msync04 Not tainted
6.16.0-rc7-next-20250723 #1 PREEMPT
[  343.834736] Hardware name: linux,dummy-virt (DT)
[  343.835788] pstate: 03402009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  343.836455] pc : fuse_iomap_writeback_range+0x478/0x558 fuse
[  343.837294] lr : iomap_writeback_folio (fs/iomap/buffered-io.c:1586
fs/iomap/buffered-io.c:1710)
[  343.838178] sp : ffff80008b26f8d0
[  343.838668] x29: ffff80008b26f8d0 x28: fff00000e7f8c800 x27: 0000000000000000
[  343.839391] x26: fff00000d4b30000 x25: 0000000000000000 x24: 0000000000000000
[  343.840305] x23: 0000000000000000 x22: fffffc1fc0334200 x21: 0000000000001000
[  343.840928] x20: ffff80008b26fa00 x19: 0000000000000000 x18: 0000000000000000
[  343.841782] x17: 0000000000000000 x16: ffffb8d3b90c67c8 x15: 0000000000000000
[  343.842565] x14: ffffb8d3ba91e340 x13: 0000ffff8ff3ffff x12: 0000000000000000
[  343.843002] x11: 1ffe000004b74a21 x10: fff0000025ba510c x9 : ffffb8d3b90c6308
[  343.843962] x8 : ffff80008b26f788 x7 : ffffb8d365830b90 x6 : ffffb8d3bb6c9000
[  343.844718] x5 : 0000000000000000 x4 : 000000000000000a x3 : 0000000000001000
[  343.845333] x2 : fff00000c0b5ecc0 x1 : 000000000000ffff x0 : 0bfffe000000400b
[  343.846323] Call trace:
[  343.846767] fuse_iomap_writeback_range+0x478/0x558 fuse (P)
[  343.847288] iomap_writeback_folio (fs/iomap/buffered-io.c:1586
fs/iomap/buffered-io.c:1710)
[  343.847930] iomap_writepages (fs/iomap/buffered-io.c:1762)
[  343.848494] fuse_writepages+0xa0/0xe8 fuse
[  343.849112] do_writepages (mm/page-writeback.c:2634)
[  343.849614] filemap_fdatawrite_wbc (mm/filemap.c:386 mm/filemap.c:376)
[  343.850202] __filemap_fdatawrite_range (mm/filemap.c:420)
[  343.850791] file_write_and_wait_range (mm/filemap.c:794)
[  343.851108] fuse_fsync+0x6c/0x138 fuse
[  343.851688] vfs_fsync_range (fs/sync.c:188)
[  343.852002] __arm64_sys_msync (mm/msync.c:96 mm/msync.c:32 mm/msync.c:32)
[  343.852197] invoke_syscall.constprop.0
(arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54)
[  343.852914] do_el0_svc (include/linux/thread_info.h:135
(discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2)
arch/arm64/kernel/syscall.c:151 (discriminator 2))
[  343.853389] el0_svc (arch/arm64/include/asm/irqflags.h:82
(discriminator 1) arch/arm64/include/asm/irqflags.h:123 (discriminator
1) arch/arm64/include/asm/irqflags.h:136 (discriminator 1)
arch/arm64/kernel/entry-common.c:169 (discriminator 1)
arch/arm64/kernel/entry-common.c:182 (discriminator 1)
arch/arm64/kernel/entry-common.c:880 (discriminator 1))
[  343.853829] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:899)
[  343.854350] el0t_64_sync (arch/arm64/kernel/entry.S:596)
[  343.854652] ---[ end trace 0000000000000000 ]---



## Source
* Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
* Project: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250723/
* Git sha: a933d3dc1968fcfb0ab72879ec304b1971ed1b9a
* Git describe: 6.16.0-rc7-next-20250723
* kernel version: next-20250723
* Architectures: arm64
* Toolchains: gcc-13
* Kconfigs: defconfig + CONFIG_ARM64_64K_PAGES=y
* Kconfigs: defconfig + CONFIG_ARM64_16K_PAGES=y

## Test
* Test log 1: https://qa-reports.linaro.org/api/testruns/29227309/log_file/
* Test log 2: https://qa-reports.linaro.org/api/testruns/29227074/log_file/
* Test run: https://regressions.linaro.org/lkft/linux-next-master/next-20250723/testruns/1713367/
* Test history:
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250723/testrun/29227309/suite/log-parser-test/test/exception-warning-fsfusefile-at-fuse_iomap_writeback_range/history/
* Test plan: https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/30G3hpJVVdXkZKnB15v1qoQOL03
* Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/30G3dvSFyHHQ3E8CvKH7tjU98I6/
* Kernel config:
https://storage.tuxsuite.com/public/linaro/lkft/builds/30G3dvSFyHHQ3E8CvKH7tjU98I6/config

--
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-23 11:44 next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range Naresh Kamboju
@ 2025-07-23 14:46 ` Darrick J. Wong
  2025-07-23 18:42   ` Joanne Koong
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2025-07-23 14:46 UTC (permalink / raw)
  To: Naresh Kamboju, joannelkoong
  Cc: linux-fsdevel, linux-mm, linux-xfs, open list, lkft-triage,
	Linux Regressions, Miklos Szeredi, Jan Kara, Andrew Morton,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Arnd Bergmann, Dan Carpenter, Anders Roxell, Ben Copeland

[cc Joanne]

On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> Regressions found while running LTP msync04 tests on qemu-arm64 running
> Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> page size enabled builds.
> 
> CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> 
> No warning noticed with 4K page size.
> CONFIG_ARM64_4K_PAGES=y works as expected

You might want to cc Joanne since she's been working on large folio
support in fuse.

> First seen on the tag next-20250721.
> Good: next-20250718
> Bad:  next-20250721 to next-20250723
> 
> Regression Analysis:
> - New regression? Yes
> - Reproducibility? Yes
> 
> Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> fuse file.c at fuse_iomap_writeback_range
> 
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> 
> ## Test log
> ------------[ cut here ]------------
> [  343.828105] WARNING: fs/fuse/file.c:2146 at
> fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190

	WARN_ON_ONCE(len & (PAGE_SIZE - 1));

/me speculates that this might be triggered by an attempt to write back
some 4k fsblock within a 16/64k base page?

--D

> [  343.830969] Modules linked in: btrfs blake2b_generic xor xor_neon
> raid6_pq zstd_compress sm3_ce sha3_ce drm fuse backlight ip_tables
> x_tables
> [  343.833830] CPU: 0 UID: 0 PID: 4190 Comm: msync04 Not tainted
> 6.16.0-rc7-next-20250723 #1 PREEMPT
> [  343.834736] Hardware name: linux,dummy-virt (DT)
> [  343.835788] pstate: 03402009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [  343.836455] pc : fuse_iomap_writeback_range+0x478/0x558 fuse
> [  343.837294] lr : iomap_writeback_folio (fs/iomap/buffered-io.c:1586
> fs/iomap/buffered-io.c:1710)
> [  343.838178] sp : ffff80008b26f8d0
> [  343.838668] x29: ffff80008b26f8d0 x28: fff00000e7f8c800 x27: 0000000000000000
> [  343.839391] x26: fff00000d4b30000 x25: 0000000000000000 x24: 0000000000000000
> [  343.840305] x23: 0000000000000000 x22: fffffc1fc0334200 x21: 0000000000001000
> [  343.840928] x20: ffff80008b26fa00 x19: 0000000000000000 x18: 0000000000000000
> [  343.841782] x17: 0000000000000000 x16: ffffb8d3b90c67c8 x15: 0000000000000000
> [  343.842565] x14: ffffb8d3ba91e340 x13: 0000ffff8ff3ffff x12: 0000000000000000
> [  343.843002] x11: 1ffe000004b74a21 x10: fff0000025ba510c x9 : ffffb8d3b90c6308
> [  343.843962] x8 : ffff80008b26f788 x7 : ffffb8d365830b90 x6 : ffffb8d3bb6c9000
> [  343.844718] x5 : 0000000000000000 x4 : 000000000000000a x3 : 0000000000001000
> [  343.845333] x2 : fff00000c0b5ecc0 x1 : 000000000000ffff x0 : 0bfffe000000400b
> [  343.846323] Call trace:
> [  343.846767] fuse_iomap_writeback_range+0x478/0x558 fuse (P)
> [  343.847288] iomap_writeback_folio (fs/iomap/buffered-io.c:1586
> fs/iomap/buffered-io.c:1710)
> [  343.847930] iomap_writepages (fs/iomap/buffered-io.c:1762)
> [  343.848494] fuse_writepages+0xa0/0xe8 fuse
> [  343.849112] do_writepages (mm/page-writeback.c:2634)
> [  343.849614] filemap_fdatawrite_wbc (mm/filemap.c:386 mm/filemap.c:376)
> [  343.850202] __filemap_fdatawrite_range (mm/filemap.c:420)
> [  343.850791] file_write_and_wait_range (mm/filemap.c:794)
> [  343.851108] fuse_fsync+0x6c/0x138 fuse
> [  343.851688] vfs_fsync_range (fs/sync.c:188)
> [  343.852002] __arm64_sys_msync (mm/msync.c:96 mm/msync.c:32 mm/msync.c:32)
> [  343.852197] invoke_syscall.constprop.0
> (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54)
> [  343.852914] do_el0_svc (include/linux/thread_info.h:135
> (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2)
> arch/arm64/kernel/syscall.c:151 (discriminator 2))
> [  343.853389] el0_svc (arch/arm64/include/asm/irqflags.h:82
> (discriminator 1) arch/arm64/include/asm/irqflags.h:123 (discriminator
> 1) arch/arm64/include/asm/irqflags.h:136 (discriminator 1)
> arch/arm64/kernel/entry-common.c:169 (discriminator 1)
> arch/arm64/kernel/entry-common.c:182 (discriminator 1)
> arch/arm64/kernel/entry-common.c:880 (discriminator 1))
> [  343.853829] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:899)
> [  343.854350] el0t_64_sync (arch/arm64/kernel/entry.S:596)
> [  343.854652] ---[ end trace 0000000000000000 ]---
> 
> 
> 
> ## Source
> * Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
> * Project: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250723/
> * Git sha: a933d3dc1968fcfb0ab72879ec304b1971ed1b9a
> * Git describe: 6.16.0-rc7-next-20250723
> * kernel version: next-20250723
> * Architectures: arm64
> * Toolchains: gcc-13
> * Kconfigs: defconfig + CONFIG_ARM64_64K_PAGES=y
> * Kconfigs: defconfig + CONFIG_ARM64_16K_PAGES=y
> 
> ## Test
> * Test log 1: https://qa-reports.linaro.org/api/testruns/29227309/log_file/
> * Test log 2: https://qa-reports.linaro.org/api/testruns/29227074/log_file/
> * Test run: https://regressions.linaro.org/lkft/linux-next-master/next-20250723/testruns/1713367/
> * Test history:
> https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250723/testrun/29227309/suite/log-parser-test/test/exception-warning-fsfusefile-at-fuse_iomap_writeback_range/history/
> * Test plan: https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/30G3hpJVVdXkZKnB15v1qoQOL03
> * Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/30G3dvSFyHHQ3E8CvKH7tjU98I6/
> * Kernel config:
> https://storage.tuxsuite.com/public/linaro/lkft/builds/30G3dvSFyHHQ3E8CvKH7tjU98I6/config
> 
> --
> Linaro LKFT
> https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-23 14:46 ` Darrick J. Wong
@ 2025-07-23 18:42   ` Joanne Koong
  2025-07-23 21:20     ` Darrick J. Wong
  2025-07-23 23:15     ` Joanne Koong
  0 siblings, 2 replies; 22+ messages in thread
From: Joanne Koong @ 2025-07-23 18:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> [cc Joanne]
>
> On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > page size enabled builds.
> >
> > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> >
> > No warning noticed with 4K page size.
> > CONFIG_ARM64_4K_PAGES=y works as expected
>
> You might want to cc Joanne since she's been working on large folio
> support in fuse.
>
> > First seen on the tag next-20250721.
> > Good: next-20250718
> > Bad:  next-20250721 to next-20250723

Thanks for the report. Is there a link to the script that mounts the
fuse server for these tests? I'm curious whether this was mounted as a
fuseblk filesystem.

> >
> > Regression Analysis:
> > - New regression? Yes
> > - Reproducibility? Yes
> >
> > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > fuse file.c at fuse_iomap_writeback_range
> >
> > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> >
> > ## Test log
> > ------------[ cut here ]------------
> > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
>
>         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
>
> /me speculates that this might be triggered by an attempt to write back
> some 4k fsblock within a 16/64k base page?
>

I think this can happen on 4k base pages as well actually. On the
iomap side, the length passed is always block-aligned and in fuse, we
set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
page-aligned, but I missed that if it's a "fuseblk" filesystem, that
isn't true and the blocksize is initialized to a default size of 512
or whatever block size is passed in when it's mounted.

I'll send out a patch to remove this line. It doesn't make any
difference for fuse_iomap_writeback_range() logic whether len is
page-aligned or not; I had added it as a sanity-check against sketchy
ranges.

Also, I just noticed that apparently the blocksize can change
dynamically for an inode in fuse through getattr replies from the
server (see fuse_change_attributes_common()). This is a problem since
the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
think we will have to cache the inode blkbits in the iomap_folio_state
struct unfortunately :( I'll think about this some more and send out a
patch for this.


Thanks,
Joanne

> --D
>
> > [  343.830969] Modules linked in: btrfs blake2b_generic xor xor_neon
> > raid6_pq zstd_compress sm3_ce sha3_ce drm fuse backlight ip_tables
> > x_tables
> > [  343.833830] CPU: 0 UID: 0 PID: 4190 Comm: msync04 Not tainted
> > 6.16.0-rc7-next-20250723 #1 PREEMPT
> > [  343.834736] Hardware name: linux,dummy-virt (DT)
> > [  343.835788] pstate: 03402009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > [  343.836455] pc : fuse_iomap_writeback_range+0x478/0x558 fuse
> > [  343.837294] lr : iomap_writeback_folio (fs/iomap/buffered-io.c:1586
> > fs/iomap/buffered-io.c:1710)
> > [  343.838178] sp : ffff80008b26f8d0
> > [  343.838668] x29: ffff80008b26f8d0 x28: fff00000e7f8c800 x27: 0000000000000000
> > [  343.839391] x26: fff00000d4b30000 x25: 0000000000000000 x24: 0000000000000000
> > [  343.840305] x23: 0000000000000000 x22: fffffc1fc0334200 x21: 0000000000001000
> > [  343.840928] x20: ffff80008b26fa00 x19: 0000000000000000 x18: 0000000000000000
> > [  343.841782] x17: 0000000000000000 x16: ffffb8d3b90c67c8 x15: 0000000000000000
> > [  343.842565] x14: ffffb8d3ba91e340 x13: 0000ffff8ff3ffff x12: 0000000000000000
> > [  343.843002] x11: 1ffe000004b74a21 x10: fff0000025ba510c x9 : ffffb8d3b90c6308
> > [  343.843962] x8 : ffff80008b26f788 x7 : ffffb8d365830b90 x6 : ffffb8d3bb6c9000
> > [  343.844718] x5 : 0000000000000000 x4 : 000000000000000a x3 : 0000000000001000
> > [  343.845333] x2 : fff00000c0b5ecc0 x1 : 000000000000ffff x0 : 0bfffe000000400b
> > [  343.846323] Call trace:
> > [  343.846767] fuse_iomap_writeback_range+0x478/0x558 fuse (P)
> > [  343.847288] iomap_writeback_folio (fs/iomap/buffered-io.c:1586
> > fs/iomap/buffered-io.c:1710)
> > [  343.847930] iomap_writepages (fs/iomap/buffered-io.c:1762)
> > [  343.848494] fuse_writepages+0xa0/0xe8 fuse
> > [  343.849112] do_writepages (mm/page-writeback.c:2634)
> > [  343.849614] filemap_fdatawrite_wbc (mm/filemap.c:386 mm/filemap.c:376)
> > [  343.850202] __filemap_fdatawrite_range (mm/filemap.c:420)
> > [  343.850791] file_write_and_wait_range (mm/filemap.c:794)
> > [  343.851108] fuse_fsync+0x6c/0x138 fuse
> > [  343.851688] vfs_fsync_range (fs/sync.c:188)
> > [  343.852002] __arm64_sys_msync (mm/msync.c:96 mm/msync.c:32 mm/msync.c:32)
> > [  343.852197] invoke_syscall.constprop.0
> > (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54)
> > [  343.852914] do_el0_svc (include/linux/thread_info.h:135
> > (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2)
> > arch/arm64/kernel/syscall.c:151 (discriminator 2))
> > [  343.853389] el0_svc (arch/arm64/include/asm/irqflags.h:82
> > (discriminator 1) arch/arm64/include/asm/irqflags.h:123 (discriminator
> > 1) arch/arm64/include/asm/irqflags.h:136 (discriminator 1)
> > arch/arm64/kernel/entry-common.c:169 (discriminator 1)
> > arch/arm64/kernel/entry-common.c:182 (discriminator 1)
> > arch/arm64/kernel/entry-common.c:880 (discriminator 1))
> > [  343.853829] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:899)
> > [  343.854350] el0t_64_sync (arch/arm64/kernel/entry.S:596)
> > [  343.854652] ---[ end trace 0000000000000000 ]---
> >
> >
> >
> > ## Source
> > * Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
> > * Project: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250723/
> > * Git sha: a933d3dc1968fcfb0ab72879ec304b1971ed1b9a
> > * Git describe: 6.16.0-rc7-next-20250723
> > * kernel version: next-20250723
> > * Architectures: arm64
> > * Toolchains: gcc-13
> > * Kconfigs: defconfig + CONFIG_ARM64_64K_PAGES=y
> > * Kconfigs: defconfig + CONFIG_ARM64_16K_PAGES=y
> >
> > ## Test
> > * Test log 1: https://qa-reports.linaro.org/api/testruns/29227309/log_file/
> > * Test log 2: https://qa-reports.linaro.org/api/testruns/29227074/log_file/
> > * Test run: https://regressions.linaro.org/lkft/linux-next-master/next-20250723/testruns/1713367/
> > * Test history:
> > https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250723/testrun/29227309/suite/log-parser-test/test/exception-warning-fsfusefile-at-fuse_iomap_writeback_range/history/
> > * Test plan: https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/30G3hpJVVdXkZKnB15v1qoQOL03
> > * Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/30G3dvSFyHHQ3E8CvKH7tjU98I6/
> > * Kernel config:
> > https://storage.tuxsuite.com/public/linaro/lkft/builds/30G3dvSFyHHQ3E8CvKH7tjU98I6/config
> >
> > --
> > Linaro LKFT
> > https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-23 18:42   ` Joanne Koong
@ 2025-07-23 21:20     ` Darrick J. Wong
  2025-07-23 22:37       ` Joanne Koong
  2025-07-23 23:15     ` Joanne Koong
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2025-07-23 21:20 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > [cc Joanne]
> >
> > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > > page size enabled builds.
> > >
> > > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> > >
> > > No warning noticed with 4K page size.
> > > CONFIG_ARM64_4K_PAGES=y works as expected
> >
> > You might want to cc Joanne since she's been working on large folio
> > support in fuse.
> >
> > > First seen on the tag next-20250721.
> > > Good: next-20250718
> > > Bad:  next-20250721 to next-20250723
> 
> Thanks for the report. Is there a link to the script that mounts the
> fuse server for these tests? I'm curious whether this was mounted as a
> fuseblk filesystem.
> 
> > >
> > > Regression Analysis:
> > > - New regression? Yes
> > > - Reproducibility? Yes
> > >
> > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > fuse file.c at fuse_iomap_writeback_range
> > >
> > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > >
> > > ## Test log
> > > ------------[ cut here ]------------
> > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> >
> >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> >
> > /me speculates that this might be triggered by an attempt to write back
> > some 4k fsblock within a 16/64k base page?
> >
> 
> I think this can happen on 4k base pages as well actually. On the
> iomap side, the length passed is always block-aligned and in fuse, we
> set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> isn't true and the blocksize is initialized to a default size of 512
> or whatever block size is passed in when it's mounted.

<nod> I think you're correct.

> I'll send out a patch to remove this line. It doesn't make any
> difference for fuse_iomap_writeback_range() logic whether len is
> page-aligned or not; I had added it as a sanity-check against sketchy
> ranges.
> 
> Also, I just noticed that apparently the blocksize can change
> dynamically for an inode in fuse through getattr replies from the
> server (see fuse_change_attributes_common()). This is a problem since
> the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> think we will have to cache the inode blkbits in the iomap_folio_state
> struct unfortunately :( I'll think about this some more and send out a
> patch for this.

From my understanding of the iomap code, it's possible to do that if you
flush and unmap the entire pagecache (whilst holding i_rwsem and
mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
so I have no idea if it actually works, however.  Note that even I don't
implement the flush and unmap bit; I just scream loudly and do nothing:

void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
{
	trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);

	if (inode->i_blkbits == new_blkbits)
		return;

	if (!S_ISREG(inode->i_mode))
		goto set_it;

	/*
	 * iomap attaches per-block state to each folio, so we cannot allow
	 * the file block size to change if there's anything in the page cache.
	 * In theory, fuse servers should never be doing this.
	 */
	if (inode->i_mapping->nrpages > 0) {
		WARN_ON(inode->i_blkbits != new_blkbits &&
			inode->i_mapping->nrpages > 0);
		return;
	}

set_it:
	inode->i_blkbits = new_blkbits;
}

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884

--D

> 
> Thanks,
> Joanne
> 
> > --D
> >
> > > [  343.830969] Modules linked in: btrfs blake2b_generic xor xor_neon
> > > raid6_pq zstd_compress sm3_ce sha3_ce drm fuse backlight ip_tables
> > > x_tables
> > > [  343.833830] CPU: 0 UID: 0 PID: 4190 Comm: msync04 Not tainted
> > > 6.16.0-rc7-next-20250723 #1 PREEMPT
> > > [  343.834736] Hardware name: linux,dummy-virt (DT)
> > > [  343.835788] pstate: 03402009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > > [  343.836455] pc : fuse_iomap_writeback_range+0x478/0x558 fuse
> > > [  343.837294] lr : iomap_writeback_folio (fs/iomap/buffered-io.c:1586
> > > fs/iomap/buffered-io.c:1710)
> > > [  343.838178] sp : ffff80008b26f8d0
> > > [  343.838668] x29: ffff80008b26f8d0 x28: fff00000e7f8c800 x27: 0000000000000000
> > > [  343.839391] x26: fff00000d4b30000 x25: 0000000000000000 x24: 0000000000000000
> > > [  343.840305] x23: 0000000000000000 x22: fffffc1fc0334200 x21: 0000000000001000
> > > [  343.840928] x20: ffff80008b26fa00 x19: 0000000000000000 x18: 0000000000000000
> > > [  343.841782] x17: 0000000000000000 x16: ffffb8d3b90c67c8 x15: 0000000000000000
> > > [  343.842565] x14: ffffb8d3ba91e340 x13: 0000ffff8ff3ffff x12: 0000000000000000
> > > [  343.843002] x11: 1ffe000004b74a21 x10: fff0000025ba510c x9 : ffffb8d3b90c6308
> > > [  343.843962] x8 : ffff80008b26f788 x7 : ffffb8d365830b90 x6 : ffffb8d3bb6c9000
> > > [  343.844718] x5 : 0000000000000000 x4 : 000000000000000a x3 : 0000000000001000
> > > [  343.845333] x2 : fff00000c0b5ecc0 x1 : 000000000000ffff x0 : 0bfffe000000400b
> > > [  343.846323] Call trace:
> > > [  343.846767] fuse_iomap_writeback_range+0x478/0x558 fuse (P)
> > > [  343.847288] iomap_writeback_folio (fs/iomap/buffered-io.c:1586
> > > fs/iomap/buffered-io.c:1710)
> > > [  343.847930] iomap_writepages (fs/iomap/buffered-io.c:1762)
> > > [  343.848494] fuse_writepages+0xa0/0xe8 fuse
> > > [  343.849112] do_writepages (mm/page-writeback.c:2634)
> > > [  343.849614] filemap_fdatawrite_wbc (mm/filemap.c:386 mm/filemap.c:376)
> > > [  343.850202] __filemap_fdatawrite_range (mm/filemap.c:420)
> > > [  343.850791] file_write_and_wait_range (mm/filemap.c:794)
> > > [  343.851108] fuse_fsync+0x6c/0x138 fuse
> > > [  343.851688] vfs_fsync_range (fs/sync.c:188)
> > > [  343.852002] __arm64_sys_msync (mm/msync.c:96 mm/msync.c:32 mm/msync.c:32)
> > > [  343.852197] invoke_syscall.constprop.0
> > > (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54)
> > > [  343.852914] do_el0_svc (include/linux/thread_info.h:135
> > > (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2)
> > > arch/arm64/kernel/syscall.c:151 (discriminator 2))
> > > [  343.853389] el0_svc (arch/arm64/include/asm/irqflags.h:82
> > > (discriminator 1) arch/arm64/include/asm/irqflags.h:123 (discriminator
> > > 1) arch/arm64/include/asm/irqflags.h:136 (discriminator 1)
> > > arch/arm64/kernel/entry-common.c:169 (discriminator 1)
> > > arch/arm64/kernel/entry-common.c:182 (discriminator 1)
> > > arch/arm64/kernel/entry-common.c:880 (discriminator 1))
> > > [  343.853829] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:899)
> > > [  343.854350] el0t_64_sync (arch/arm64/kernel/entry.S:596)
> > > [  343.854652] ---[ end trace 0000000000000000 ]---
> > >
> > >
> > >
> > > ## Source
> > > * Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
> > > * Project: https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250723/
> > > * Git sha: a933d3dc1968fcfb0ab72879ec304b1971ed1b9a
> > > * Git describe: 6.16.0-rc7-next-20250723
> > > * kernel version: next-20250723
> > > * Architectures: arm64
> > > * Toolchains: gcc-13
> > > * Kconfigs: defconfig + CONFIG_ARM64_64K_PAGES=y
> > > * Kconfigs: defconfig + CONFIG_ARM64_16K_PAGES=y
> > >
> > > ## Test
> > > * Test log 1: https://qa-reports.linaro.org/api/testruns/29227309/log_file/
> > > * Test log 2: https://qa-reports.linaro.org/api/testruns/29227074/log_file/
> > > * Test run: https://regressions.linaro.org/lkft/linux-next-master/next-20250723/testruns/1713367/
> > > * Test history:
> > > https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250723/testrun/29227309/suite/log-parser-test/test/exception-warning-fsfusefile-at-fuse_iomap_writeback_range/history/
> > > * Test plan: https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/30G3hpJVVdXkZKnB15v1qoQOL03
> > > * Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/30G3dvSFyHHQ3E8CvKH7tjU98I6/
> > > * Kernel config:
> > > https://storage.tuxsuite.com/public/linaro/lkft/builds/30G3dvSFyHHQ3E8CvKH7tjU98I6/config
> > >
> > > --
> > > Linaro LKFT
> > > https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-23 21:20     ` Darrick J. Wong
@ 2025-07-23 22:37       ` Joanne Koong
  2025-07-24 19:14         ` Joanne Koong
  0 siblings, 1 reply; 22+ messages in thread
From: Joanne Koong @ 2025-07-23 22:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > [cc Joanne]
> > >
> > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > > > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > > > page size enabled builds.
> > > >
> > > > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > > > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> > > >
> > > > No warning noticed with 4K page size.
> > > > CONFIG_ARM64_4K_PAGES=y works as expected
> > >
> > > You might want to cc Joanne since she's been working on large folio
> > > support in fuse.
> > >
> > > > First seen on the tag next-20250721.
> > > > Good: next-20250718
> > > > Bad:  next-20250721 to next-20250723
> >
> > Thanks for the report. Is there a link to the script that mounts the
> > fuse server for these tests? I'm curious whether this was mounted as a
> > fuseblk filesystem.
> >
> > > >
> > > > Regression Analysis:
> > > > - New regression? Yes
> > > > - Reproducibility? Yes
> > > >
> > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > fuse file.c at fuse_iomap_writeback_range
> > > >
> > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > >
> > > > ## Test log
> > > > ------------[ cut here ]------------
> > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > >
> > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > >
> > > /me speculates that this might be triggered by an attempt to write back
> > > some 4k fsblock within a 16/64k base page?
> > >
> >
> > I think this can happen on 4k base pages as well actually. On the
> > iomap side, the length passed is always block-aligned and in fuse, we
> > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > isn't true and the blocksize is initialized to a default size of 512
> > or whatever block size is passed in when it's mounted.
>
> <nod> I think you're correct.
>
> > I'll send out a patch to remove this line. It doesn't make any
> > difference for fuse_iomap_writeback_range() logic whether len is
> > page-aligned or not; I had added it as a sanity-check against sketchy
> > ranges.
> >
> > Also, I just noticed that apparently the blocksize can change
> > dynamically for an inode in fuse through getattr replies from the
> > server (see fuse_change_attributes_common()). This is a problem since
> > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > think we will have to cache the inode blkbits in the iomap_folio_state
> > struct unfortunately :( I'll think about this some more and send out a
> > patch for this.
>
> From my understanding of the iomap code, it's possible to do that if you
> flush and unmap the entire pagecache (whilst holding i_rwsem and
> mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> so I have no idea if it actually works, however.  Note that even I don't
> implement the flush and unmap bit; I just scream loudly and do nothing:

lol! i wish I could scream loudly and do nothing too for my case.

AFAICT, I think I just need to flush and unmap that file and can leave
the rest of the files/folios in the pagecache as is? But then if the
file has active refcounts on it or has been pinned into memory, can I
still unmap and remove it from the page cache? I see the
invalidate_inode_pages2() function but my understanding is that the
page still stays in the cache if it has has active references, and if
the page gets mmaped and there's a page fault on it, it'll end up
using the preexisting old page in the page cache.

I don't think I really need to have it removed from the page cache so
much as just have the ifs state for all the folios in the file freed
(after flushing the file) so that it can start over with a new ifs.
Ideally we could just flush the file, then iterate through all the
folios in the mapping in order of ascending index, and kfree their
->private, but I'm not seeing how we can prevent the case of new
writes / a new ifs getting allocated for folios at previous indexes
while we're trying to do the iteration/kfreeing.

>
> void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> {
>         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
>
>         if (inode->i_blkbits == new_blkbits)
>                 return;
>
>         if (!S_ISREG(inode->i_mode))
>                 goto set_it;
>
>         /*
>          * iomap attaches per-block state to each folio, so we cannot allow
>          * the file block size to change if there's anything in the page cache.
>          * In theory, fuse servers should never be doing this.
>          */
>         if (inode->i_mapping->nrpages > 0) {
>                 WARN_ON(inode->i_blkbits != new_blkbits &&
>                         inode->i_mapping->nrpages > 0);
>                 return;
>         }
>
> set_it:
>         inode->i_blkbits = new_blkbits;
> }
>
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
>
> --D
>
> >
> > Thanks,
> > Joanne
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-23 18:42   ` Joanne Koong
  2025-07-23 21:20     ` Darrick J. Wong
@ 2025-07-23 23:15     ` Joanne Koong
  1 sibling, 0 replies; 22+ messages in thread
From: Joanne Koong @ 2025-07-23 23:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Wed, Jul 23, 2025 at 11:42 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > [cc Joanne]
> >
> > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > > page size enabled builds.
> > >
> > > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> > >
> > > No warning noticed with 4K page size.
> > > CONFIG_ARM64_4K_PAGES=y works as expected
> >
> > You might want to cc Joanne since she's been working on large folio
> > support in fuse.
> >
> > > First seen on the tag next-20250721.
> > > Good: next-20250718
> > > Bad:  next-20250721 to next-20250723
>
> Thanks for the report. Is there a link to the script that mounts the
> fuse server for these tests? I'm curious whether this was mounted as a
> fuseblk filesystem.
>
> > >
> > > Regression Analysis:
> > > - New regression? Yes
> > > - Reproducibility? Yes
> > >
> > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > fuse file.c at fuse_iomap_writeback_range
> > >
> > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > >
> > > ## Test log
> > > ------------[ cut here ]------------
> > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> >
> >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> >
> > /me speculates that this might be triggered by an attempt to write back
> > some 4k fsblock within a 16/64k base page?
> >
>
> I think this can happen on 4k base pages as well actually. On the
> iomap side, the length passed is always block-aligned and in fuse, we
> set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> isn't true and the blocksize is initialized to a default size of 512
> or whatever block size is passed in when it's mounted.
>
> I'll send out a patch to remove this line. It doesn't make any
> difference for fuse_iomap_writeback_range() logic whether len is
> page-aligned or not; I had added it as a sanity-check against sketchy
> ranges.
>

https://lore.kernel.org/linux-fsdevel/20250723230850.2395561-1-joannelkoong@gmail.com/T/#u
is the patch for removing this

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-23 22:37       ` Joanne Koong
@ 2025-07-24 19:14         ` Joanne Koong
  2025-07-26  1:16           ` Joanne Koong
  0 siblings, 1 reply; 22+ messages in thread
From: Joanne Koong @ 2025-07-24 19:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > [cc Joanne]
> > > >
> > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > > > > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > > > > page size enabled builds.
> > > > >
> > > > > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > > > > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> > > > >
> > > > > No warning noticed with 4K page size.
> > > > > CONFIG_ARM64_4K_PAGES=y works as expected
> > > >
> > > > You might want to cc Joanne since she's been working on large folio
> > > > support in fuse.
> > > >
> > > > > First seen on the tag next-20250721.
> > > > > Good: next-20250718
> > > > > Bad:  next-20250721 to next-20250723
> > >
> > > Thanks for the report. Is there a link to the script that mounts the
> > > fuse server for these tests? I'm curious whether this was mounted as a
> > > fuseblk filesystem.
> > >
> > > > >
> > > > > Regression Analysis:
> > > > > - New regression? Yes
> > > > > - Reproducibility? Yes
> > > > >
> > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > fuse file.c at fuse_iomap_writeback_range
> > > > >
> > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > >
> > > > > ## Test log
> > > > > ------------[ cut here ]------------
> > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > >
> > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > >
> > > > /me speculates that this might be triggered by an attempt to write back
> > > > some 4k fsblock within a 16/64k base page?
> > > >
> > >
> > > I think this can happen on 4k base pages as well actually. On the
> > > iomap side, the length passed is always block-aligned and in fuse, we
> > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > isn't true and the blocksize is initialized to a default size of 512
> > > or whatever block size is passed in when it's mounted.
> >
> > <nod> I think you're correct.
> >
> > > I'll send out a patch to remove this line. It doesn't make any
> > > difference for fuse_iomap_writeback_range() logic whether len is
> > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > ranges.
> > >
> > > Also, I just noticed that apparently the blocksize can change
> > > dynamically for an inode in fuse through getattr replies from the
> > > server (see fuse_change_attributes_common()). This is a problem since
> > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > struct unfortunately :( I'll think about this some more and send out a
> > > patch for this.
> >
> > From my understanding of the iomap code, it's possible to do that if you
> > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > so I have no idea if it actually works, however.  Note that even I don't
> > implement the flush and unmap bit; I just scream loudly and do nothing:
>
> lol! i wish I could scream loudly and do nothing too for my case.
>
> AFAICT, I think I just need to flush and unmap that file and can leave
> the rest of the files/folios in the pagecache as is? But then if the
> file has active refcounts on it or has been pinned into memory, can I
> still unmap and remove it from the page cache? I see the
> invalidate_inode_pages2() function but my understanding is that the
> page still stays in the cache if it has has active references, and if
> the page gets mmaped and there's a page fault on it, it'll end up
> using the preexisting old page in the page cache.

Never mind, I was mistaken about this. Johannes confirmed that even if
there's active refcounts on the folio, it'll still get removed from
the page cache after unmapping and the page cache reference will get
dropped.

I think I can just do what you suggested and call
filemap_invalidate_inode() in fuse_change_attributes_common() then if
the inode blksize gets changed. Thanks for the suggestion!

>
> I don't think I really need to have it removed from the page cache so
> much as just have the ifs state for all the folios in the file freed
> (after flushing the file) so that it can start over with a new ifs.
> Ideally we could just flush the file, then iterate through all the
> folios in the mapping in order of ascending index, and kfree their
> ->private, but I'm not seeing how we can prevent the case of new
> writes / a new ifs getting allocated for folios at previous indexes
> while we're trying to do the iteration/kfreeing.
>
> >
> > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > {
> >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> >
> >         if (inode->i_blkbits == new_blkbits)
> >                 return;
> >
> >         if (!S_ISREG(inode->i_mode))
> >                 goto set_it;
> >
> >         /*
> >          * iomap attaches per-block state to each folio, so we cannot allow
> >          * the file block size to change if there's anything in the page cache.
> >          * In theory, fuse servers should never be doing this.
> >          */
> >         if (inode->i_mapping->nrpages > 0) {
> >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> >                         inode->i_mapping->nrpages > 0);
> >                 return;
> >         }
> >
> > set_it:
> >         inode->i_blkbits = new_blkbits;
> > }
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> >
> > --D
> >
> > >
> > > Thanks,
> > > Joanne
> > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-24 19:14         ` Joanne Koong
@ 2025-07-26  1:16           ` Joanne Koong
  2025-07-28 17:14             ` Darrick J. Wong
  2025-07-28 17:34             ` Matthew Wilcox
  0 siblings, 2 replies; 22+ messages in thread
From: Joanne Koong @ 2025-07-26  1:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > [cc Joanne]
> > > > >
> > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > > > > > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > > > > > page size enabled builds.
> > > > > >
> > > > > > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > > > > > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> > > > > >
> > > > > > No warning noticed with 4K page size.
> > > > > > CONFIG_ARM64_4K_PAGES=y works as expected
> > > > >
> > > > > You might want to cc Joanne since she's been working on large folio
> > > > > support in fuse.
> > > > >
> > > > > > First seen on the tag next-20250721.
> > > > > > Good: next-20250718
> > > > > > Bad:  next-20250721 to next-20250723
> > > >
> > > > Thanks for the report. Is there a link to the script that mounts the
> > > > fuse server for these tests? I'm curious whether this was mounted as a
> > > > fuseblk filesystem.
> > > >
> > > > > >
> > > > > > Regression Analysis:
> > > > > > - New regression? Yes
> > > > > > - Reproducibility? Yes
> > > > > >
> > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > >
> > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > >
> > > > > > ## Test log
> > > > > > ------------[ cut here ]------------
> > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > >
> > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > >
> > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > some 4k fsblock within a 16/64k base page?
> > > > >
> > > >
> > > > I think this can happen on 4k base pages as well actually. On the
> > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > isn't true and the blocksize is initialized to a default size of 512
> > > > or whatever block size is passed in when it's mounted.
> > >
> > > <nod> I think you're correct.
> > >
> > > > I'll send out a patch to remove this line. It doesn't make any
> > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > ranges.
> > > >
> > > > Also, I just noticed that apparently the blocksize can change
> > > > dynamically for an inode in fuse through getattr replies from the
> > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > struct unfortunately :( I'll think about this some more and send out a
> > > > patch for this.
> > >
> > > From my understanding of the iomap code, it's possible to do that if you
> > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > so I have no idea if it actually works, however.  Note that even I don't
> > > implement the flush and unmap bit; I just scream loudly and do nothing:
> >
> > lol! i wish I could scream loudly and do nothing too for my case.
> >
> > AFAICT, I think I just need to flush and unmap that file and can leave
> > the rest of the files/folios in the pagecache as is? But then if the
> > file has active refcounts on it or has been pinned into memory, can I
> > still unmap and remove it from the page cache? I see the
> > invalidate_inode_pages2() function but my understanding is that the
> > page still stays in the cache if it has has active references, and if
> > the page gets mmaped and there's a page fault on it, it'll end up
> > using the preexisting old page in the page cache.
>
> Never mind, I was mistaken about this. Johannes confirmed that even if
> there's active refcounts on the folio, it'll still get removed from
> the page cache after unmapping and the page cache reference will get
> dropped.
>
> I think I can just do what you suggested and call
> filemap_invalidate_inode() in fuse_change_attributes_common() then if
> the inode blksize gets changed. Thanks for the suggestion!
>

Thinking about this some more, I don't think this works after all
because the writeback + page cache removal and inode blkbits update
needs to be atomic, else after we write back and remove the pages from
the page cache, a write could be issued right before we update the
inode blkbits. I don't think we can hold the inode lock the whole time
for it either since writeback could be intensive. (also btw, I
realized in hindsight that invalidate_inode_pages2_range() would have
been the better function to call instead of
filemap_invalidate_inode()).

> >
> > I don't think I really need to have it removed from the page cache so
> > much as just have the ifs state for all the folios in the file freed
> > (after flushing the file) so that it can start over with a new ifs.
> > Ideally we could just flush the file, then iterate through all the
> > folios in the mapping in order of ascending index, and kfree their
> > ->private, but I'm not seeing how we can prevent the case of new
> > writes / a new ifs getting allocated for folios at previous indexes
> > while we're trying to do the iteration/kfreeing.
> >

Going back to this idea, I think this can work. I realized we don't
need to flush the file, it's enough to free the ifs, then update the
inode->i_blkbits, then reallocate the ifs (which will now use the
updated blkbits size), and if we hold the inode lock throughout, that
prevents any concurrent writes.
Something like:
     inode_lock(inode);
     XA_STATE(xas, &mapping->i_pages, 0);
     xa_lock_irq(&mapping->i_pages);
     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
          folio_lock(folio);
          if (folio_test_dirty(folio)) {
                  folio_wait_writeback(folio);
                  kfree(folio->private);
          }
          folio_unlock(folio);
     }
    inode->i_blkbits = new_blkbits_size;
    xas_set(&xas, 0);
    xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
          folio_lock(folio);
          if (folio_test_dirty(folio) && !folio_test_writeback(folio))
                 folio_mark_dirty(folio);
          folio_unlock(folio);
    }
    xa_unlock_irq(&mapping->i_pages);
    inode_unlock(inode);


I think this is the only approach that doesn't require changes to iomap.

I'm going to think about this some more next week and will try to send
out a patch for this then.


Thanks,
Joanne

> > >
> > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > {
> > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > >
> > >         if (inode->i_blkbits == new_blkbits)
> > >                 return;
> > >
> > >         if (!S_ISREG(inode->i_mode))
> > >                 goto set_it;
> > >
> > >         /*
> > >          * iomap attaches per-block state to each folio, so we cannot allow
> > >          * the file block size to change if there's anything in the page cache.
> > >          * In theory, fuse servers should never be doing this.
> > >          */
> > >         if (inode->i_mapping->nrpages > 0) {
> > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > >                         inode->i_mapping->nrpages > 0);
> > >                 return;
> > >         }
> > >
> > > set_it:
> > >         inode->i_blkbits = new_blkbits;
> > > }
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > >
> > > --D
> > >
> > > >
> > > > Thanks,
> > > > Joanne
> > > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-26  1:16           ` Joanne Koong
@ 2025-07-28 17:14             ` Darrick J. Wong
  2025-07-28 17:44               ` Joanne Koong
  2025-07-28 17:34             ` Matthew Wilcox
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2025-07-28 17:14 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > [cc Joanne]
> > > > > >
> > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > > > > > > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > > > > > > page size enabled builds.
> > > > > > >
> > > > > > > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > > > > > > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> > > > > > >
> > > > > > > No warning noticed with 4K page size.
> > > > > > > CONFIG_ARM64_4K_PAGES=y works as expected
> > > > > >
> > > > > > You might want to cc Joanne since she's been working on large folio
> > > > > > support in fuse.
> > > > > >
> > > > > > > First seen on the tag next-20250721.
> > > > > > > Good: next-20250718
> > > > > > > Bad:  next-20250721 to next-20250723
> > > > >
> > > > > Thanks for the report. Is there a link to the script that mounts the
> > > > > fuse server for these tests? I'm curious whether this was mounted as a
> > > > > fuseblk filesystem.
> > > > >
> > > > > > >
> > > > > > > Regression Analysis:
> > > > > > > - New regression? Yes
> > > > > > > - Reproducibility? Yes
> > > > > > >
> > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > >
> > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > >
> > > > > > > ## Test log
> > > > > > > ------------[ cut here ]------------
> > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > >
> > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > >
> > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > some 4k fsblock within a 16/64k base page?
> > > > > >
> > > > >
> > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > or whatever block size is passed in when it's mounted.
> > > >
> > > > <nod> I think you're correct.
> > > >
> > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > ranges.
> > > > >
> > > > > Also, I just noticed that apparently the blocksize can change
> > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > patch for this.
> > > >
> > > > From my understanding of the iomap code, it's possible to do that if you
> > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > >
> > > lol! i wish I could scream loudly and do nothing too for my case.
> > >
> > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > the rest of the files/folios in the pagecache as is? But then if the
> > > file has active refcounts on it or has been pinned into memory, can I
> > > still unmap and remove it from the page cache? I see the
> > > invalidate_inode_pages2() function but my understanding is that the
> > > page still stays in the cache if it has has active references, and if
> > > the page gets mmaped and there's a page fault on it, it'll end up
> > > using the preexisting old page in the page cache.
> >
> > Never mind, I was mistaken about this. Johannes confirmed that even if
> > there's active refcounts on the folio, it'll still get removed from
> > the page cache after unmapping and the page cache reference will get
> > dropped.
> >
> > I think I can just do what you suggested and call
> > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > the inode blksize gets changed. Thanks for the suggestion!
> >
> 
> Thinking about this some more, I don't think this works after all
> because the writeback + page cache removal and inode blkbits update
> needs to be atomic, else after we write back and remove the pages from
> the page cache, a write could be issued right before we update the
> inode blkbits. I don't think we can hold the inode lock the whole time
> for it either since writeback could be intensive. (also btw, I
> realized in hindsight that invalidate_inode_pages2_range() would have
> been the better function to call instead of
> filemap_invalidate_inode()).
> 
> > >
> > > I don't think I really need to have it removed from the page cache so
> > > much as just have the ifs state for all the folios in the file freed
> > > (after flushing the file) so that it can start over with a new ifs.
> > > Ideally we could just flush the file, then iterate through all the
> > > folios in the mapping in order of ascending index, and kfree their
> > > ->private, but I'm not seeing how we can prevent the case of new
> > > writes / a new ifs getting allocated for folios at previous indexes
> > > while we're trying to do the iteration/kfreeing.
> > >
> 
> Going back to this idea, I think this can work. I realized we don't
> need to flush the file, it's enough to free the ifs, then update the
> inode->i_blkbits, then reallocate the ifs (which will now use the
> updated blkbits size), and if we hold the inode lock throughout, that
> prevents any concurrent writes.
> Something like:
>      inode_lock(inode);
>      XA_STATE(xas, &mapping->i_pages, 0);
>      xa_lock_irq(&mapping->i_pages);
>      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
>           folio_lock(folio);
>           if (folio_test_dirty(folio)) {
>                   folio_wait_writeback(folio);
>                   kfree(folio->private);
>           }
>           folio_unlock(folio);
>      }
>     inode->i_blkbits = new_blkbits_size;

The trouble is, you also have to resize the iomap_folio_state objects
attached to each folio if you change i_blkbits...

>     xas_set(&xas, 0);
>     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
>           folio_lock(folio);
>           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
>                  folio_mark_dirty(folio);

...because iomap_dirty_folio doesn't know how to reallocate the folio
state object in response to i_blkbits having changed.

--D

>           folio_unlock(folio);
>     }
>     xa_unlock_irq(&mapping->i_pages);
>     inode_unlock(inode);
> 
> 
> I think this is the only approach that doesn't require changes to iomap.
> 
> I'm going to think about this some more next week and will try to send
> out a patch for this then.
> 
> 
> Thanks,
> Joanne
> 
> > > >
> > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > {
> > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > >
> > > >         if (inode->i_blkbits == new_blkbits)
> > > >                 return;
> > > >
> > > >         if (!S_ISREG(inode->i_mode))
> > > >                 goto set_it;
> > > >
> > > >         /*
> > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > >          * the file block size to change if there's anything in the page cache.
> > > >          * In theory, fuse servers should never be doing this.
> > > >          */
> > > >         if (inode->i_mapping->nrpages > 0) {
> > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > >                         inode->i_mapping->nrpages > 0);
> > > >                 return;
> > > >         }
> > > >
> > > > set_it:
> > > >         inode->i_blkbits = new_blkbits;
> > > > }
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > >
> > > > --D
> > > >
> > > > >
> > > > > Thanks,
> > > > > Joanne
> > > > >
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-26  1:16           ` Joanne Koong
  2025-07-28 17:14             ` Darrick J. Wong
@ 2025-07-28 17:34             ` Matthew Wilcox
  2025-07-28 17:55               ` Joanne Koong
  1 sibling, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2025-07-28 17:34 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Darrick J. Wong, Naresh Kamboju, linux-fsdevel, linux-mm,
	linux-xfs, open list, lkft-triage, Linux Regressions,
	Miklos Szeredi, Jan Kara, Andrew Morton, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Arnd Bergmann, Dan Carpenter,
	Anders Roxell, Ben Copeland

On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > Also, I just noticed that apparently the blocksize can change
> > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > patch for this.

Does this actually happen in practice, once you've started _using_ the
block device?  Rather than all this complicated stuff to invalidate the
page cache based on the fuse server telling us something, maybe just
declare the server to be misbehaving and shut the whole filesystem down?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-28 17:14             ` Darrick J. Wong
@ 2025-07-28 17:44               ` Joanne Koong
  2025-07-28 19:11                 ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Joanne Koong @ 2025-07-28 17:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > >
> > > > > > > [cc Joanne]
> > > > > > >
> > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > > > > > > > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > > > > > > > page size enabled builds.
> > > > > > > >
> > > > > > > > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > > > > > > > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> > > > > > > >
> > > > > > > > No warning noticed with 4K page size.
> > > > > > > > CONFIG_ARM64_4K_PAGES=y works as expected
> > > > > > >
> > > > > > > You might want to cc Joanne since she's been working on large folio
> > > > > > > support in fuse.
> > > > > > >
> > > > > > > > First seen on the tag next-20250721.
> > > > > > > > Good: next-20250718
> > > > > > > > Bad:  next-20250721 to next-20250723
> > > > > >
> > > > > > Thanks for the report. Is there a link to the script that mounts the
> > > > > > fuse server for these tests? I'm curious whether this was mounted as a
> > > > > > fuseblk filesystem.
> > > > > >
> > > > > > > >
> > > > > > > > Regression Analysis:
> > > > > > > > - New regression? Yes
> > > > > > > > - Reproducibility? Yes
> > > > > > > >
> > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > >
> > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > >
> > > > > > > > ## Test log
> > > > > > > > ------------[ cut here ]------------
> > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > >
> > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > >
> > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > >
> > > > > >
> > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > or whatever block size is passed in when it's mounted.
> > > > >
> > > > > <nod> I think you're correct.
> > > > >
> > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > ranges.
> > > > > >
> > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > patch for this.
> > > > >
> > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > >
> > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > >
> > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > file has active refcounts on it or has been pinned into memory, can I
> > > > still unmap and remove it from the page cache? I see the
> > > > invalidate_inode_pages2() function but my understanding is that the
> > > > page still stays in the cache if it has has active references, and if
> > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > using the preexisting old page in the page cache.
> > >
> > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > there's active refcounts on the folio, it'll still get removed from
> > > the page cache after unmapping and the page cache reference will get
> > > dropped.
> > >
> > > I think I can just do what you suggested and call
> > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > the inode blksize gets changed. Thanks for the suggestion!
> > >
> >
> > Thinking about this some more, I don't think this works after all
> > because the writeback + page cache removal and inode blkbits update
> > needs to be atomic, else after we write back and remove the pages from
> > the page cache, a write could be issued right before we update the
> > inode blkbits. I don't think we can hold the inode lock the whole time
> > for it either since writeback could be intensive. (also btw, I
> > realized in hindsight that invalidate_inode_pages2_range() would have
> > been the better function to call instead of
> > filemap_invalidate_inode()).
> >
> > > >
> > > > I don't think I really need to have it removed from the page cache so
> > > > much as just have the ifs state for all the folios in the file freed
> > > > (after flushing the file) so that it can start over with a new ifs.
> > > > Ideally we could just flush the file, then iterate through all the
> > > > folios in the mapping in order of ascending index, and kfree their
> > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > while we're trying to do the iteration/kfreeing.
> > > >
> >
> > Going back to this idea, I think this can work. I realized we don't
> > need to flush the file, it's enough to free the ifs, then update the
> > inode->i_blkbits, then reallocate the ifs (which will now use the
> > updated blkbits size), and if we hold the inode lock throughout, that
> > prevents any concurrent writes.
> > Something like:
> >      inode_lock(inode);
> >      XA_STATE(xas, &mapping->i_pages, 0);
> >      xa_lock_irq(&mapping->i_pages);
> >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> >           folio_lock(folio);
> >           if (folio_test_dirty(folio)) {
> >                   folio_wait_writeback(folio);
> >                   kfree(folio->private);
> >           }
> >           folio_unlock(folio);
> >      }
> >     inode->i_blkbits = new_blkbits_size;
>
> The trouble is, you also have to resize the iomap_folio_state objects
> attached to each folio if you change i_blkbits...

I think the iomap_folio_state objects automatically get resized here,
no? We first kfree the folio->private which kfrees the entire ifs,
then we change inode->i_blkbits to the new size, then when we call
folio_mark_dirty(), it'll create the new ifs which creates a new folio
state object using the new/updated i_blkbits size

>
> >     xas_set(&xas, 0);
> >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> >           folio_lock(folio);
> >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> >                  folio_mark_dirty(folio);
>
> ...because iomap_dirty_folio doesn't know how to reallocate the folio
> state object in response to i_blkbits having changed.
>
> --D
>
> >           folio_unlock(folio);
> >     }
> >     xa_unlock_irq(&mapping->i_pages);
> >     inode_unlock(inode);
> >
> >
> > I think this is the only approach that doesn't require changes to iomap.
> >
> > I'm going to think about this some more next week and will try to send
> > out a patch for this then.
> >
> >
> > Thanks,
> > Joanne
> >
> > > > >
> > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > {
> > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > >
> > > > >         if (inode->i_blkbits == new_blkbits)
> > > > >                 return;
> > > > >
> > > > >         if (!S_ISREG(inode->i_mode))
> > > > >                 goto set_it;
> > > > >
> > > > >         /*
> > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > >          * the file block size to change if there's anything in the page cache.
> > > > >          * In theory, fuse servers should never be doing this.
> > > > >          */
> > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > >                         inode->i_mapping->nrpages > 0);
> > > > >                 return;
> > > > >         }
> > > > >
> > > > > set_it:
> > > > >         inode->i_blkbits = new_blkbits;
> > > > > }
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > >
> > > > > --D
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Joanne
> > > > > >
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-28 17:34             ` Matthew Wilcox
@ 2025-07-28 17:55               ` Joanne Koong
  2025-07-28 18:43                 ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Joanne Koong @ 2025-07-28 17:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Naresh Kamboju, linux-fsdevel, linux-mm,
	linux-xfs, open list, lkft-triage, Linux Regressions,
	Miklos Szeredi, Jan Kara, Andrew Morton, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Arnd Bergmann, Dan Carpenter,
	Anders Roxell, Ben Copeland

On Mon, Jul 28, 2025 at 10:34 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > patch for this.
>
> Does this actually happen in practice, once you've started _using_ the
> block device?  Rather than all this complicated stuff to invalidate the
> page cache based on the fuse server telling us something, maybe just
> declare the server to be misbehaving and shut the whole filesystem down?
>

I don't think this case is likely at all but I guess one scenario
where the server might want to change the block size midway through is
if they send the data to some network filesystem on the backend and if
that backend shuts down or is at full capacity for whatever reason and
they need to migrate to another backend that uses a different block
size then I guess this would be useful for that.

fuse currently does allow the block size to be changed dynamically so
I'm not sure if we can change that behavior without breaking backwards
compatibility.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-28 17:55               ` Joanne Koong
@ 2025-07-28 18:43                 ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2025-07-28 18:43 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Matthew Wilcox, Naresh Kamboju, linux-fsdevel, linux-mm,
	linux-xfs, open list, lkft-triage, Linux Regressions,
	Miklos Szeredi, Jan Kara, Andrew Morton, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Arnd Bergmann, Dan Carpenter,
	Anders Roxell, Ben Copeland

On Mon, Jul 28, 2025 at 10:55:42AM -0700, Joanne Koong wrote:
> On Mon, Jul 28, 2025 at 10:34 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > patch for this.
> >
> > Does this actually happen in practice, once you've started _using_ the
> > block device?  Rather than all this complicated stuff to invalidate the

For most block device filesystems?  No.  And as far as I can tell, none
of the filesystems actually support changing i_blkbits on the fly; I
think only block devices can do that:

$ git grep 'i_blkbits\s=\s'
block/bdev.c:150:       BD_INODE(bdev)->i_blkbits = blksize_bits(bsize);
block/bdev.c:209:               inode->i_blkbits = blksize_bits(size);
fs/ceph/inode.c:81:     inode->i_blkbits = CEPH_FSCRYPT_BLOCK_SHIFT;
fs/ceph/inode.c:1071:           inode->i_blkbits = CEPH_FSCRYPT_BLOCK_SHIFT;
fs/ceph/inode.c:1076:           inode->i_blkbits = CEPH_BLOCK_SHIFT;
fs/ceph/inode.c:1180:           inode->i_blkbits = PAGE_SHIFT;
fs/direct-io.c:612:     unsigned int i_blkbits = sdio->blkbits + sdio->blkfactor;
fs/direct-io.c:908:     const unsigned i_blkbits = blkbits + sdio->blkfactor;
fs/direct-io.c:1110:    unsigned i_blkbits = READ_ONCE(inode->i_blkbits);
fs/erofs/fscache.c:527: inode->i_blkbits = EROFS_SB(sb)->blkszbits;
fs/fuse/file_iomap.c:2327:      inode->i_blkbits = new_blkbits;
fs/fuse/inode.c:304:            inode->i_blkbits = new_blkbits;
fs/inode.c:234: inode->i_blkbits = sb->s_blocksize_bits;
fs/libfs.c:1761:        inode->i_blkbits = PAGE_SHIFT;
fs/ocfs2/aops.c:2123:   unsigned int i_blkbits = inode->i_sb->s_blocksize_bits;
fs/orangefs/orangefs-utils.c:320:                       inode->i_blkbits = ffs(new_op->downcall.resp.getattr.
fs/smb/client/cifsfs.c:411:     cifs_inode->netfs.inode.i_blkbits = 14;  /* 2**14 = CIFS_MAX_MSGSIZE */
fs/stack.c:72:  dest->i_blkbits = src->i_blkbits;
fs/vboxsf/utils.c:123:  inode->i_blkbits = 12;

> > page cache based on the fuse server telling us something, maybe just
> > declare the server to be misbehaving and shut the whole filesystem down?
> >
> 
> I don't think this case is likely at all but I guess one scenario
> where the server might want to change the block size midway through is
> if they send the data to some network filesystem on the backend and if
> that backend shuts down or is at full capacity for whatever reason and
> they need to migrate to another backend that uses a different block
> size then I guess this would be useful for that.

Maybe, but it would still be pretty extraordinary to change the block
size on an open file -- any program that tries to do its IO in blocks
(i.e. not a byte stream) has already stat'd the file and will be very
confused.

> fuse currently does allow the block size to be changed dynamically so
> I'm not sure if we can change that behavior without breaking backwards
> compatibility.

<nod> For fuse+iomap I'm not going to allow it initially if there's
anything in the pagecache ... but I could be talked into it.

--D

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-28 17:44               ` Joanne Koong
@ 2025-07-28 19:11                 ` Darrick J. Wong
  2025-07-28 21:28                   ` Joanne Koong
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2025-07-28 19:11 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > >
> > > > > > > > [cc Joanne]
> > > > > > > >
> > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > Regressions found while running LTP msync04 tests on qemu-arm64 running
> > > > > > > > > Linux next-20250721, next-20250722 and next-20250723 with 16K and 64K
> > > > > > > > > page size enabled builds.
> > > > > > > > >
> > > > > > > > > CONFIG_ARM64_64K_PAGES=y ( kernel warning as below )
> > > > > > > > > CONFIG_ARM64_16K_PAGES=y ( kernel warning as below )
> > > > > > > > >
> > > > > > > > > No warning noticed with 4K page size.
> > > > > > > > > CONFIG_ARM64_4K_PAGES=y works as expected
> > > > > > > >
> > > > > > > > You might want to cc Joanne since she's been working on large folio
> > > > > > > > support in fuse.
> > > > > > > >
> > > > > > > > > First seen on the tag next-20250721.
> > > > > > > > > Good: next-20250718
> > > > > > > > > Bad:  next-20250721 to next-20250723
> > > > > > >
> > > > > > > Thanks for the report. Is there a link to the script that mounts the
> > > > > > > fuse server for these tests? I'm curious whether this was mounted as a
> > > > > > > fuseblk filesystem.
> > > > > > >
> > > > > > > > >
> > > > > > > > > Regression Analysis:
> > > > > > > > > - New regression? Yes
> > > > > > > > > - Reproducibility? Yes
> > > > > > > > >
> > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > >
> > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > >
> > > > > > > > > ## Test log
> > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > >
> > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > >
> > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > >
> > > > > > >
> > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > or whatever block size is passed in when it's mounted.
> > > > > >
> > > > > > <nod> I think you're correct.
> > > > > >
> > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > ranges.
> > > > > > >
> > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > patch for this.
> > > > > >
> > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > >
> > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > >
> > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > still unmap and remove it from the page cache? I see the
> > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > page still stays in the cache if it has has active references, and if
> > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > using the preexisting old page in the page cache.
> > > >
> > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > there's active refcounts on the folio, it'll still get removed from
> > > > the page cache after unmapping and the page cache reference will get
> > > > dropped.
> > > >
> > > > I think I can just do what you suggested and call
> > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > the inode blksize gets changed. Thanks for the suggestion!
> > > >
> > >
> > > Thinking about this some more, I don't think this works after all
> > > because the writeback + page cache removal and inode blkbits update
> > > needs to be atomic, else after we write back and remove the pages from
> > > the page cache, a write could be issued right before we update the
> > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > for it either since writeback could be intensive. (also btw, I
> > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > been the better function to call instead of
> > > filemap_invalidate_inode()).
> > >
> > > > >
> > > > > I don't think I really need to have it removed from the page cache so
> > > > > much as just have the ifs state for all the folios in the file freed
> > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > Ideally we could just flush the file, then iterate through all the
> > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > while we're trying to do the iteration/kfreeing.
> > > > >
> > >
> > > Going back to this idea, I think this can work. I realized we don't
> > > need to flush the file, it's enough to free the ifs, then update the
> > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > updated blkbits size), and if we hold the inode lock throughout, that
> > > prevents any concurrent writes.
> > > Something like:
> > >      inode_lock(inode);
> > >      XA_STATE(xas, &mapping->i_pages, 0);
> > >      xa_lock_irq(&mapping->i_pages);
> > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > >           folio_lock(folio);
> > >           if (folio_test_dirty(folio)) {
> > >                   folio_wait_writeback(folio);
> > >                   kfree(folio->private);
> > >           }

Heh, I didn't even see this chunk, distracted as I am today. :/

So this doesn't actually /initiate/ writeback, it just waits
(potentially for a long time) for someone else to come along and do it.
That might not be what you want since the blocksize change will appear
to stall while nothing else is going on in the system.

Also, unless you're going to put this in buffered-io.c, it's not
desirable for a piece of code to free something it didn't allocate.
IOWs, I don't think it's a good idea for *fuse* to go messing with a
folio->private that iomap set.

> > >           folio_unlock(folio);
> > >      }
> > >     inode->i_blkbits = new_blkbits_size;
> >
> > The trouble is, you also have to resize the iomap_folio_state objects
> > attached to each folio if you change i_blkbits...
> 
> I think the iomap_folio_state objects automatically get resized here,
> no? We first kfree the folio->private which kfrees the entire ifs,

Err, right, it does free the ifs and recreate it later if necessary.

> then we change inode->i_blkbits to the new size, then when we call
> folio_mark_dirty(), it'll create the new ifs which creates a new folio
> state object using the new/updated i_blkbits size
> 
> >
> > >     xas_set(&xas, 0);
> > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > >           folio_lock(folio);
> > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > >                  folio_mark_dirty(folio);
> >
> > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > state object in response to i_blkbits having changed.

Also, what about clean folios that have an ifs?  You'd still need to
handle the ifs's attached to those.

So I guess if you wanted iomap to handle a blocksize change, you could
do something like:

iomap_change_file_blocksize(inode, new_blkbits) {
	inode_lock()
	filemap_invalidate_lock()

	inode_dio_wait()
	filemap_write_and_wait()
	if (new_blkbits > mapping_min_folio_order()) {
		truncate_pagecache()
		inode->i_blkbits = new_blkbits;
	} else {
		inode->i_blkbits = new_blkbits;
		xas_for_each(...) {
			<create new ifs>
			<translate uptodate/dirty state to new ifs>
			<swap ifs>
			<free old ifs>
		}
	}

	filemap_invalidate_unlock()
	inode_unlock()
}

--D

> > --D
> >
> > >           folio_unlock(folio);
> > >     }
> > >     xa_unlock_irq(&mapping->i_pages);
> > >     inode_unlock(inode);
> > >
> > >
> > > I think this is the only approach that doesn't require changes to iomap.
> > >
> > > I'm going to think about this some more next week and will try to send
> > > out a patch for this then.
> > >
> > >
> > > Thanks,
> > > Joanne
> > >
> > > > > >
> > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > {
> > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > >
> > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > >                 return;
> > > > > >
> > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > >                 goto set_it;
> > > > > >
> > > > > >         /*
> > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > >          * In theory, fuse servers should never be doing this.
> > > > > >          */
> > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > >                 return;
> > > > > >         }
> > > > > >
> > > > > > set_it:
> > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > }
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > >
> > > > > > --D
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Joanne
> > > > > > >
> > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-28 19:11                 ` Darrick J. Wong
@ 2025-07-28 21:28                   ` Joanne Koong
  2025-07-29 20:21                     ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Joanne Koong @ 2025-07-28 21:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > >
> > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > [cc Joanne]
> > > > > > > > >
> > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > >
> > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > > >
> > > > > > > > > > ## Test log
> > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > >
> > > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > >
> > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > >
> > > > > > > >
> > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > >
> > > > > > > <nod> I think you're correct.
> > > > > > >
> > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > ranges.
> > > > > > > >
> > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > patch for this.
> > > > > > >
> > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > >
> > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > >
> > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > still unmap and remove it from the page cache? I see the
> > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > page still stays in the cache if it has has active references, and if
> > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > using the preexisting old page in the page cache.
> > > > >
> > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > the page cache after unmapping and the page cache reference will get
> > > > > dropped.
> > > > >
> > > > > I think I can just do what you suggested and call
> > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > >
> > > >
> > > > Thinking about this some more, I don't think this works after all
> > > > because the writeback + page cache removal and inode blkbits update
> > > > needs to be atomic, else after we write back and remove the pages from
> > > > the page cache, a write could be issued right before we update the
> > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > for it either since writeback could be intensive. (also btw, I
> > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > been the better function to call instead of
> > > > filemap_invalidate_inode()).
> > > >
> > > > > >
> > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > while we're trying to do the iteration/kfreeing.
> > > > > >
> > > >
> > > > Going back to this idea, I think this can work. I realized we don't
> > > > need to flush the file, it's enough to free the ifs, then update the
> > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > prevents any concurrent writes.
> > > > Something like:
> > > >      inode_lock(inode);
> > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > >      xa_lock_irq(&mapping->i_pages);
> > > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > >           folio_lock(folio);
> > > >           if (folio_test_dirty(folio)) {
> > > >                   folio_wait_writeback(folio);
> > > >                   kfree(folio->private);
> > > >           }
>
> Heh, I didn't even see this chunk, distracted as I am today. :/
>
> So this doesn't actually /initiate/ writeback, it just waits
> (potentially for a long time) for someone else to come along and do it.
> That might not be what you want since the blocksize change will appear
> to stall while nothing else is going on in the system.

I thought if the folio isn't under writeback then
folio_wait_writeback() just returns immediately as a no-op.
I don't think we need/want to initiate writeback, I think we only need
to ensure that if it is already under writeback, that writeback
finishes while it uses the old i_blksize so nothing gets corrupted. As
I understand it (but maybe I'm misjudging this), holding the inode
lock and then initiating writeback is too much given that writeback
can take a long time (eg if the fuse server writes the data over some
network).

>
> Also, unless you're going to put this in buffered-io.c, it's not
> desirable for a piece of code to free something it didn't allocate.
> IOWs, I don't think it's a good idea for *fuse* to go messing with a
> folio->private that iomap set.

Okay, good point. I agree. I was hoping to have this not bleed into
the iomap library but maybe there's no getting around that in a good
way.

>
> > > >           folio_unlock(folio);
> > > >      }
> > > >     inode->i_blkbits = new_blkbits_size;
> > >
> > > The trouble is, you also have to resize the iomap_folio_state objects
> > > attached to each folio if you change i_blkbits...
> >
> > I think the iomap_folio_state objects automatically get resized here,
> > no? We first kfree the folio->private which kfrees the entire ifs,
>
> Err, right, it does free the ifs and recreate it later if necessary.
>
> > then we change inode->i_blkbits to the new size, then when we call
> > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > state object using the new/updated i_blkbits size
> >
> > >
> > > >     xas_set(&xas, 0);
> > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > >           folio_lock(folio);
> > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > >                  folio_mark_dirty(folio);
> > >
> > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > state object in response to i_blkbits having changed.
>
> Also, what about clean folios that have an ifs?  You'd still need to
> handle the ifs's attached to those.

Ah you're right, there could be clean folios there too that have an
ifs. I think in the above logic, if we iterate through all
mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
the kfree to after the "if (folio_test_dirty(folio))" block, then it
addresses that case. eg something like this:

     inode_lock(inode);
     XA_STATE(xas, &mapping->i_pages, 0);
     xa_lock_irq(&mapping->i_pages);
     xas_for_each(&xas, folio, ULONG_MAX) {
          folio_lock(folio);
          if (folio_test_dirty(folio))
                  folio_wait_writeback(folio);
          kfree(folio->private);
          folio_unlock(folio);
     }
    inode->i_blkbits = new_blkbits;
    xas_set(&xas, 0);
    xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
          folio_lock(folio);
          if (folio_test_dirty(folio) && !folio_test_writeback(folio))
                 folio_mark_dirty(folio);
          folio_unlock(folio);
    }
    xa_unlock_irq(&mapping->i_pages);
    inode_unlock(inode);


>
> So I guess if you wanted iomap to handle a blocksize change, you could
> do something like:
>
> iomap_change_file_blocksize(inode, new_blkbits) {
>         inode_lock()
>         filemap_invalidate_lock()
>
>         inode_dio_wait()
>         filemap_write_and_wait()
>         if (new_blkbits > mapping_min_folio_order()) {
>                 truncate_pagecache()
>                 inode->i_blkbits = new_blkbits;
>         } else {
>                 inode->i_blkbits = new_blkbits;
>                 xas_for_each(...) {
>                         <create new ifs>
>                         <translate uptodate/dirty state to new ifs>
>                         <swap ifs>
>                         <free old ifs>
>                 }
>         }
>
>         filemap_invalidate_unlock()
>         inode_unlock()
> }

Do you prefer this logic to the one above that walks through
&mapping->i_pages? If so, then I'll go with this way.
The part I'm unsure about is that this logic seems more disruptive (eg
initiating writeback while holding the inode lock and doing work for
unmapping/page cache removal) than the other approach, but I guess
this is also rare enough that it doesn't matter much.


Thanks,
Joanne

>
> --D
>
> > > --D
> > >
> > > >           folio_unlock(folio);
> > > >     }
> > > >     xa_unlock_irq(&mapping->i_pages);
> > > >     inode_unlock(inode);
> > > >
> > > >
> > > > I think this is the only approach that doesn't require changes to iomap.
> > > >
> > > > I'm going to think about this some more next week and will try to send
> > > > out a patch for this then.
> > > >
> > > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > > > >
> > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > {
> > > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > >
> > > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > > >                 return;
> > > > > > >
> > > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > > >                 goto set_it;
> > > > > > >
> > > > > > >         /*
> > > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > > >          * In theory, fuse servers should never be doing this.
> > > > > > >          */
> > > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > > >                 return;
> > > > > > >         }
> > > > > > >
> > > > > > > set_it:
> > > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > > }
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > >
> > > > > > > --D
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Joanne
> > > > > > > >
> > > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-28 21:28                   ` Joanne Koong
@ 2025-07-29 20:21                     ` Darrick J. Wong
  2025-07-29 23:23                       ` Joanne Koong
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2025-07-29 20:21 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Mon, Jul 28, 2025 at 02:28:31PM -0700, Joanne Koong wrote:
> On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > >
> > > > > > > > > > [cc Joanne]
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > > >
> > > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > > > >
> > > > > > > > > > > ## Test log
> > > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > > >
> > > > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > > >
> > > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > > >
> > > > > > > > <nod> I think you're correct.
> > > > > > > >
> > > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > > ranges.
> > > > > > > > >
> > > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > > patch for this.
> > > > > > > >
> > > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > > >
> > > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > > >
> > > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > > still unmap and remove it from the page cache? I see the
> > > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > > page still stays in the cache if it has has active references, and if
> > > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > > using the preexisting old page in the page cache.
> > > > > >
> > > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > > the page cache after unmapping and the page cache reference will get
> > > > > > dropped.
> > > > > >
> > > > > > I think I can just do what you suggested and call
> > > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > > >
> > > > >
> > > > > Thinking about this some more, I don't think this works after all
> > > > > because the writeback + page cache removal and inode blkbits update
> > > > > needs to be atomic, else after we write back and remove the pages from
> > > > > the page cache, a write could be issued right before we update the
> > > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > > for it either since writeback could be intensive. (also btw, I
> > > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > > been the better function to call instead of
> > > > > filemap_invalidate_inode()).
> > > > >
> > > > > > >
> > > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > > while we're trying to do the iteration/kfreeing.
> > > > > > >
> > > > >
> > > > > Going back to this idea, I think this can work. I realized we don't
> > > > > need to flush the file, it's enough to free the ifs, then update the
> > > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > > prevents any concurrent writes.
> > > > > Something like:
> > > > >      inode_lock(inode);
> > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > >      xa_lock_irq(&mapping->i_pages);
> > > > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > >           folio_lock(folio);
> > > > >           if (folio_test_dirty(folio)) {
> > > > >                   folio_wait_writeback(folio);
> > > > >                   kfree(folio->private);
> > > > >           }
> >
> > Heh, I didn't even see this chunk, distracted as I am today. :/
> >
> > So this doesn't actually /initiate/ writeback, it just waits
> > (potentially for a long time) for someone else to come along and do it.
> > That might not be what you want since the blocksize change will appear
> > to stall while nothing else is going on in the system.
> 
> I thought if the folio isn't under writeback then
> folio_wait_writeback() just returns immediately as a no-op.
> I don't think we need/want to initiate writeback, I think we only need
> to ensure that if it is already under writeback, that writeback
> finishes while it uses the old i_blksize so nothing gets corrupted. As
> I understand it (but maybe I'm misjudging this), holding the inode
> lock and then initiating writeback is too much given that writeback
> can take a long time (eg if the fuse server writes the data over some
> network).
> 
> >
> > Also, unless you're going to put this in buffered-io.c, it's not
> > desirable for a piece of code to free something it didn't allocate.
> > IOWs, I don't think it's a good idea for *fuse* to go messing with a
> > folio->private that iomap set.
> 
> Okay, good point. I agree. I was hoping to have this not bleed into
> the iomap library but maybe there's no getting around that in a good
> way.

<shrug> Any other filesystem that has mutable file block size is going
to need something to enact a change.

> >
> > > > >           folio_unlock(folio);
> > > > >      }
> > > > >     inode->i_blkbits = new_blkbits_size;
> > > >
> > > > The trouble is, you also have to resize the iomap_folio_state objects
> > > > attached to each folio if you change i_blkbits...
> > >
> > > I think the iomap_folio_state objects automatically get resized here,
> > > no? We first kfree the folio->private which kfrees the entire ifs,
> >
> > Err, right, it does free the ifs and recreate it later if necessary.
> >
> > > then we change inode->i_blkbits to the new size, then when we call
> > > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > > state object using the new/updated i_blkbits size
> > >
> > > >
> > > > >     xas_set(&xas, 0);
> > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > >           folio_lock(folio);
> > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > >                  folio_mark_dirty(folio);
> > > >
> > > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > > state object in response to i_blkbits having changed.
> >
> > Also, what about clean folios that have an ifs?  You'd still need to
> > handle the ifs's attached to those.
> 
> Ah you're right, there could be clean folios there too that have an
> ifs. I think in the above logic, if we iterate through all
> mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
> the kfree to after the "if (folio_test_dirty(folio))" block, then it
> addresses that case. eg something like this:
> 
>      inode_lock(inode);
>      XA_STATE(xas, &mapping->i_pages, 0);
>      xa_lock_irq(&mapping->i_pages);
>      xas_for_each(&xas, folio, ULONG_MAX) {
>           folio_lock(folio);
>           if (folio_test_dirty(folio))
>                   folio_wait_writeback(folio);
>           kfree(folio->private);
>           folio_unlock(folio);
>      }
>     inode->i_blkbits = new_blkbits;
>     xas_set(&xas, 0);
>     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
>           folio_lock(folio);
>           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
>                  folio_mark_dirty(folio);
>           folio_unlock(folio);
>     }
>     xa_unlock_irq(&mapping->i_pages);
>     inode_unlock(inode);
> 
> 
> >
> > So I guess if you wanted iomap to handle a blocksize change, you could
> > do something like:
> >
> > iomap_change_file_blocksize(inode, new_blkbits) {
> >         inode_lock()
> >         filemap_invalidate_lock()
> >
> >         inode_dio_wait()
> >         filemap_write_and_wait()
> >         if (new_blkbits > mapping_min_folio_order()) {
> >                 truncate_pagecache()
> >                 inode->i_blkbits = new_blkbits;
> >         } else {
> >                 inode->i_blkbits = new_blkbits;
> >                 xas_for_each(...) {
> >                         <create new ifs>
> >                         <translate uptodate/dirty state to new ifs>
> >                         <swap ifs>
> >                         <free old ifs>
> >                 }
> >         }
> >
> >         filemap_invalidate_unlock()
> >         inode_unlock()
> > }
> 
> Do you prefer this logic to the one above that walks through
> &mapping->i_pages? If so, then I'll go with this way.

Yes.  iomap should not be tightly bound to the pagecache's xarray; I
don't even really like the xas_for_each that I suggested above.

> The part I'm unsure about is that this logic seems more disruptive (eg
> initiating writeback while holding the inode lock and doing work for
> unmapping/page cache removal) than the other approach, but I guess
> this is also rare enough that it doesn't matter much.

I hope it's rare enough that doing truncate_pagecache unconditionally
won't be seen as a huge burden.

iomap_change_file_blocksize(inode, new_blkbits) {
        inode_dio_wait()
        filemap_write_and_wait()
        truncate_pagecache()

        inode->i_blkbits = new_blkbits;
}

fuse_file_change_blocksize(inode, new_blkbits) {
        inode_lock()
        filemap_invalidate_lock()

	iomap_change_file_blocksize(inode, new_blkbits);

        filemap_invalidate_unlock()
        inode_unlock()
}

Though my question remains -- is there a fuse filesystem that changes
the blocksize at runtime such that we can test this??

--D

> Thanks,
> Joanne
> 
> >
> > --D
> >
> > > > --D
> > > >
> > > > >           folio_unlock(folio);
> > > > >     }
> > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > >     inode_unlock(inode);
> > > > >
> > > > >
> > > > > I think this is the only approach that doesn't require changes to iomap.
> > > > >
> > > > > I'm going to think about this some more next week and will try to send
> > > > > out a patch for this then.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Joanne
> > > > >
> > > > > > > >
> > > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > > {
> > > > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > > >
> > > > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > > > >                 return;
> > > > > > > >
> > > > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > > > >                 goto set_it;
> > > > > > > >
> > > > > > > >         /*
> > > > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > > > >          * In theory, fuse servers should never be doing this.
> > > > > > > >          */
> > > > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > > > >                 return;
> > > > > > > >         }
> > > > > > > >
> > > > > > > > set_it:
> > > > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > > > }
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > > >
> > > > > > > > --D
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Joanne
> > > > > > > > >
> > > > >
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-29 20:21                     ` Darrick J. Wong
@ 2025-07-29 23:23                       ` Joanne Koong
  2025-07-29 23:40                         ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Joanne Koong @ 2025-07-29 23:23 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Tue, Jul 29, 2025 at 1:21 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Mon, Jul 28, 2025 at 02:28:31PM -0700, Joanne Koong wrote:
> > On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > > > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > [cc Joanne]
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > > > >
> > > > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > > > > >
> > > > > > > > > > > > ## Test log
> > > > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > > > >
> > > > > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > > > >
> > > > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > > > >
> > > > > > > > > <nod> I think you're correct.
> > > > > > > > >
> > > > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > > > ranges.
> > > > > > > > > >
> > > > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > > > patch for this.
> > > > > > > > >
> > > > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > > > >
> > > > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > > > >
> > > > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > > > still unmap and remove it from the page cache? I see the
> > > > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > > > page still stays in the cache if it has has active references, and if
> > > > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > > > using the preexisting old page in the page cache.
> > > > > > >
> > > > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > > > the page cache after unmapping and the page cache reference will get
> > > > > > > dropped.
> > > > > > >
> > > > > > > I think I can just do what you suggested and call
> > > > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > > > >
> > > > > >
> > > > > > Thinking about this some more, I don't think this works after all
> > > > > > because the writeback + page cache removal and inode blkbits update
> > > > > > needs to be atomic, else after we write back and remove the pages from
> > > > > > the page cache, a write could be issued right before we update the
> > > > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > > > for it either since writeback could be intensive. (also btw, I
> > > > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > > > been the better function to call instead of
> > > > > > filemap_invalidate_inode()).
> > > > > >
> > > > > > > >
> > > > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > > > while we're trying to do the iteration/kfreeing.
> > > > > > > >
> > > > > >
> > > > > > Going back to this idea, I think this can work. I realized we don't
> > > > > > need to flush the file, it's enough to free the ifs, then update the
> > > > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > > > prevents any concurrent writes.
> > > > > > Something like:
> > > > > >      inode_lock(inode);
> > > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > > >      xa_lock_irq(&mapping->i_pages);
> > > > > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > >           folio_lock(folio);
> > > > > >           if (folio_test_dirty(folio)) {
> > > > > >                   folio_wait_writeback(folio);
> > > > > >                   kfree(folio->private);
> > > > > >           }
> > >
> > > Heh, I didn't even see this chunk, distracted as I am today. :/
> > >
> > > So this doesn't actually /initiate/ writeback, it just waits
> > > (potentially for a long time) for someone else to come along and do it.
> > > That might not be what you want since the blocksize change will appear
> > > to stall while nothing else is going on in the system.
> >
> > I thought if the folio isn't under writeback then
> > folio_wait_writeback() just returns immediately as a no-op.
> > I don't think we need/want to initiate writeback, I think we only need
> > to ensure that if it is already under writeback, that writeback
> > finishes while it uses the old i_blksize so nothing gets corrupted. As
> > I understand it (but maybe I'm misjudging this), holding the inode
> > lock and then initiating writeback is too much given that writeback
> > can take a long time (eg if the fuse server writes the data over some
> > network).
> >
> > >
> > > Also, unless you're going to put this in buffered-io.c, it's not
> > > desirable for a piece of code to free something it didn't allocate.
> > > IOWs, I don't think it's a good idea for *fuse* to go messing with a
> > > folio->private that iomap set.
> >
> > Okay, good point. I agree. I was hoping to have this not bleed into
> > the iomap library but maybe there's no getting around that in a good
> > way.
>
> <shrug> Any other filesystem that has mutable file block size is going
> to need something to enact a change.
>
> > >
> > > > > >           folio_unlock(folio);
> > > > > >      }
> > > > > >     inode->i_blkbits = new_blkbits_size;
> > > > >
> > > > > The trouble is, you also have to resize the iomap_folio_state objects
> > > > > attached to each folio if you change i_blkbits...
> > > >
> > > > I think the iomap_folio_state objects automatically get resized here,
> > > > no? We first kfree the folio->private which kfrees the entire ifs,
> > >
> > > Err, right, it does free the ifs and recreate it later if necessary.
> > >
> > > > then we change inode->i_blkbits to the new size, then when we call
> > > > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > > > state object using the new/updated i_blkbits size
> > > >
> > > > >
> > > > > >     xas_set(&xas, 0);
> > > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > >           folio_lock(folio);
> > > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > > >                  folio_mark_dirty(folio);
> > > > >
> > > > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > > > state object in response to i_blkbits having changed.
> > >
> > > Also, what about clean folios that have an ifs?  You'd still need to
> > > handle the ifs's attached to those.
> >
> > Ah you're right, there could be clean folios there too that have an
> > ifs. I think in the above logic, if we iterate through all
> > mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
> > the kfree to after the "if (folio_test_dirty(folio))" block, then it
> > addresses that case. eg something like this:
> >
> >      inode_lock(inode);
> >      XA_STATE(xas, &mapping->i_pages, 0);
> >      xa_lock_irq(&mapping->i_pages);
> >      xas_for_each(&xas, folio, ULONG_MAX) {
> >           folio_lock(folio);
> >           if (folio_test_dirty(folio))
> >                   folio_wait_writeback(folio);
> >           kfree(folio->private);
> >           folio_unlock(folio);
> >      }
> >     inode->i_blkbits = new_blkbits;
> >     xas_set(&xas, 0);
> >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> >           folio_lock(folio);
> >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> >                  folio_mark_dirty(folio);
> >           folio_unlock(folio);
> >     }
> >     xa_unlock_irq(&mapping->i_pages);
> >     inode_unlock(inode);
> >
> >
> > >
> > > So I guess if you wanted iomap to handle a blocksize change, you could
> > > do something like:
> > >
> > > iomap_change_file_blocksize(inode, new_blkbits) {
> > >         inode_lock()
> > >         filemap_invalidate_lock()
> > >
> > >         inode_dio_wait()
> > >         filemap_write_and_wait()
> > >         if (new_blkbits > mapping_min_folio_order()) {
> > >                 truncate_pagecache()
> > >                 inode->i_blkbits = new_blkbits;
> > >         } else {
> > >                 inode->i_blkbits = new_blkbits;
> > >                 xas_for_each(...) {
> > >                         <create new ifs>
> > >                         <translate uptodate/dirty state to new ifs>
> > >                         <swap ifs>
> > >                         <free old ifs>
> > >                 }
> > >         }
> > >
> > >         filemap_invalidate_unlock()
> > >         inode_unlock()
> > > }
> >
> > Do you prefer this logic to the one above that walks through
> > &mapping->i_pages? If so, then I'll go with this way.
>
> Yes.  iomap should not be tightly bound to the pagecache's xarray; I
> don't even really like the xas_for_each that I suggested above.

Okay, sounds good.

>
> > The part I'm unsure about is that this logic seems more disruptive (eg
> > initiating writeback while holding the inode lock and doing work for
> > unmapping/page cache removal) than the other approach, but I guess
> > this is also rare enough that it doesn't matter much.
>
> I hope it's rare enough that doing truncate_pagecache unconditionally
> won't be seen as a huge burden.
>
> iomap_change_file_blocksize(inode, new_blkbits) {
>         inode_dio_wait()
>         filemap_write_and_wait()
>         truncate_pagecache()
>
>         inode->i_blkbits = new_blkbits;
> }
>
> fuse_file_change_blocksize(inode, new_blkbits) {
>         inode_lock()
>         filemap_invalidate_lock()
>
>         iomap_change_file_blocksize(inode, new_blkbits);
>
>         filemap_invalidate_unlock()
>         inode_unlock()
> }
>
> Though my question remains -- is there a fuse filesystem that changes
> the blocksize at runtime such that we can test this??

There's not one currently but I was planning to hack up the libfuse
passthrough_hp server to test the change.

>
> --D
>
> > Thanks,
> > Joanne
> >
> > >
> > > --D
> > >
> > > > > --D
> > > > >
> > > > > >           folio_unlock(folio);
> > > > > >     }
> > > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > > >     inode_unlock(inode);
> > > > > >
> > > > > >
> > > > > > I think this is the only approach that doesn't require changes to iomap.
> > > > > >
> > > > > > I'm going to think about this some more next week and will try to send
> > > > > > out a patch for this then.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Joanne
> > > > > >
> > > > > > > > >
> > > > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > > > {
> > > > > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > > > >
> > > > > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > > > > >                 return;
> > > > > > > > >
> > > > > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > > > > >                 goto set_it;
> > > > > > > > >
> > > > > > > > >         /*
> > > > > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > > > > >          * In theory, fuse servers should never be doing this.
> > > > > > > > >          */
> > > > > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > > > > >                 return;
> > > > > > > > >         }
> > > > > > > > >
> > > > > > > > > set_it:
> > > > > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > > > >
> > > > > > > > > --D
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Joanne
> > > > > > > > > >
> > > > > >
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-29 23:23                       ` Joanne Koong
@ 2025-07-29 23:40                         ` Darrick J. Wong
  2025-07-30 22:54                           ` Joanne Koong
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2025-07-29 23:40 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Tue, Jul 29, 2025 at 04:23:02PM -0700, Joanne Koong wrote:
> On Tue, Jul 29, 2025 at 1:21 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Mon, Jul 28, 2025 at 02:28:31PM -0700, Joanne Koong wrote:
> > > On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > > > > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > [cc Joanne]
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > > > > >
> > > > > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > > > > > >
> > > > > > > > > > > > > ## Test log
> > > > > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > > > > >
> > > > > > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > > > > >
> > > > > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > > > > >
> > > > > > > > > > <nod> I think you're correct.
> > > > > > > > > >
> > > > > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > > > > ranges.
> > > > > > > > > > >
> > > > > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > > > > patch for this.
> > > > > > > > > >
> > > > > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > > > > >
> > > > > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > > > > >
> > > > > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > > > > still unmap and remove it from the page cache? I see the
> > > > > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > > > > page still stays in the cache if it has has active references, and if
> > > > > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > > > > using the preexisting old page in the page cache.
> > > > > > > >
> > > > > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > > > > the page cache after unmapping and the page cache reference will get
> > > > > > > > dropped.
> > > > > > > >
> > > > > > > > I think I can just do what you suggested and call
> > > > > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > > > > >
> > > > > > >
> > > > > > > Thinking about this some more, I don't think this works after all
> > > > > > > because the writeback + page cache removal and inode blkbits update
> > > > > > > needs to be atomic, else after we write back and remove the pages from
> > > > > > > the page cache, a write could be issued right before we update the
> > > > > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > > > > for it either since writeback could be intensive. (also btw, I
> > > > > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > > > > been the better function to call instead of
> > > > > > > filemap_invalidate_inode()).
> > > > > > >
> > > > > > > > >
> > > > > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > > > > while we're trying to do the iteration/kfreeing.
> > > > > > > > >
> > > > > > >
> > > > > > > Going back to this idea, I think this can work. I realized we don't
> > > > > > > need to flush the file, it's enough to free the ifs, then update the
> > > > > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > > > > prevents any concurrent writes.
> > > > > > > Something like:
> > > > > > >      inode_lock(inode);
> > > > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > > > >      xa_lock_irq(&mapping->i_pages);
> > > > > > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > >           folio_lock(folio);
> > > > > > >           if (folio_test_dirty(folio)) {
> > > > > > >                   folio_wait_writeback(folio);
> > > > > > >                   kfree(folio->private);
> > > > > > >           }
> > > >
> > > > Heh, I didn't even see this chunk, distracted as I am today. :/
> > > >
> > > > So this doesn't actually /initiate/ writeback, it just waits
> > > > (potentially for a long time) for someone else to come along and do it.
> > > > That might not be what you want since the blocksize change will appear
> > > > to stall while nothing else is going on in the system.
> > >
> > > I thought if the folio isn't under writeback then
> > > folio_wait_writeback() just returns immediately as a no-op.
> > > I don't think we need/want to initiate writeback, I think we only need
> > > to ensure that if it is already under writeback, that writeback
> > > finishes while it uses the old i_blksize so nothing gets corrupted. As
> > > I understand it (but maybe I'm misjudging this), holding the inode
> > > lock and then initiating writeback is too much given that writeback
> > > can take a long time (eg if the fuse server writes the data over some
> > > network).
> > >
> > > >
> > > > Also, unless you're going to put this in buffered-io.c, it's not
> > > > desirable for a piece of code to free something it didn't allocate.
> > > > IOWs, I don't think it's a good idea for *fuse* to go messing with a
> > > > folio->private that iomap set.
> > >
> > > Okay, good point. I agree. I was hoping to have this not bleed into
> > > the iomap library but maybe there's no getting around that in a good
> > > way.
> >
> > <shrug> Any other filesystem that has mutable file block size is going
> > to need something to enact a change.
> >
> > > >
> > > > > > >           folio_unlock(folio);
> > > > > > >      }
> > > > > > >     inode->i_blkbits = new_blkbits_size;
> > > > > >
> > > > > > The trouble is, you also have to resize the iomap_folio_state objects
> > > > > > attached to each folio if you change i_blkbits...
> > > > >
> > > > > I think the iomap_folio_state objects automatically get resized here,
> > > > > no? We first kfree the folio->private which kfrees the entire ifs,
> > > >
> > > > Err, right, it does free the ifs and recreate it later if necessary.
> > > >
> > > > > then we change inode->i_blkbits to the new size, then when we call
> > > > > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > > > > state object using the new/updated i_blkbits size
> > > > >
> > > > > >
> > > > > > >     xas_set(&xas, 0);
> > > > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > >           folio_lock(folio);
> > > > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > > > >                  folio_mark_dirty(folio);
> > > > > >
> > > > > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > > > > state object in response to i_blkbits having changed.
> > > >
> > > > Also, what about clean folios that have an ifs?  You'd still need to
> > > > handle the ifs's attached to those.
> > >
> > > Ah you're right, there could be clean folios there too that have an
> > > ifs. I think in the above logic, if we iterate through all
> > > mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
> > > the kfree to after the "if (folio_test_dirty(folio))" block, then it
> > > addresses that case. eg something like this:
> > >
> > >      inode_lock(inode);
> > >      XA_STATE(xas, &mapping->i_pages, 0);
> > >      xa_lock_irq(&mapping->i_pages);
> > >      xas_for_each(&xas, folio, ULONG_MAX) {
> > >           folio_lock(folio);
> > >           if (folio_test_dirty(folio))
> > >                   folio_wait_writeback(folio);
> > >           kfree(folio->private);
> > >           folio_unlock(folio);
> > >      }
> > >     inode->i_blkbits = new_blkbits;
> > >     xas_set(&xas, 0);
> > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > >           folio_lock(folio);
> > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > >                  folio_mark_dirty(folio);
> > >           folio_unlock(folio);
> > >     }
> > >     xa_unlock_irq(&mapping->i_pages);
> > >     inode_unlock(inode);
> > >
> > >
> > > >
> > > > So I guess if you wanted iomap to handle a blocksize change, you could
> > > > do something like:
> > > >
> > > > iomap_change_file_blocksize(inode, new_blkbits) {
> > > >         inode_lock()
> > > >         filemap_invalidate_lock()
> > > >
> > > >         inode_dio_wait()
> > > >         filemap_write_and_wait()
> > > >         if (new_blkbits > mapping_min_folio_order()) {
> > > >                 truncate_pagecache()
> > > >                 inode->i_blkbits = new_blkbits;
> > > >         } else {
> > > >                 inode->i_blkbits = new_blkbits;
> > > >                 xas_for_each(...) {
> > > >                         <create new ifs>
> > > >                         <translate uptodate/dirty state to new ifs>
> > > >                         <swap ifs>
> > > >                         <free old ifs>
> > > >                 }
> > > >         }
> > > >
> > > >         filemap_invalidate_unlock()
> > > >         inode_unlock()
> > > > }
> > >
> > > Do you prefer this logic to the one above that walks through
> > > &mapping->i_pages? If so, then I'll go with this way.
> >
> > Yes.  iomap should not be tightly bound to the pagecache's xarray; I
> > don't even really like the xas_for_each that I suggested above.
> 
> Okay, sounds good.
> 
> >
> > > The part I'm unsure about is that this logic seems more disruptive (eg
> > > initiating writeback while holding the inode lock and doing work for
> > > unmapping/page cache removal) than the other approach, but I guess
> > > this is also rare enough that it doesn't matter much.
> >
> > I hope it's rare enough that doing truncate_pagecache unconditionally
> > won't be seen as a huge burden.
> >
> > iomap_change_file_blocksize(inode, new_blkbits) {
> >         inode_dio_wait()
> >         filemap_write_and_wait()
> >         truncate_pagecache()
> >
> >         inode->i_blkbits = new_blkbits;
> > }
> >
> > fuse_file_change_blocksize(inode, new_blkbits) {
> >         inode_lock()
> >         filemap_invalidate_lock()
> >
> >         iomap_change_file_blocksize(inode, new_blkbits);
> >
> >         filemap_invalidate_unlock()
> >         inode_unlock()
> > }
> >
> > Though my question remains -- is there a fuse filesystem that changes
> > the blocksize at runtime such that we can test this??
> 
> There's not one currently but I was planning to hack up the libfuse
> passthrough_hp server to test the change.

Heh, ok.

I guess I could also hack up fuse2fs to change its own blocksize
randomly to see how many programs that pisses off. :)

(Not right now though, gotta prepare for fossy tomorrow...)

--D

> >
> > --D
> >
> > > Thanks,
> > > Joanne
> > >
> > > >
> > > > --D
> > > >
> > > > > > --D
> > > > > >
> > > > > > >           folio_unlock(folio);
> > > > > > >     }
> > > > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > > > >     inode_unlock(inode);
> > > > > > >
> > > > > > >
> > > > > > > I think this is the only approach that doesn't require changes to iomap.
> > > > > > >
> > > > > > > I'm going to think about this some more next week and will try to send
> > > > > > > out a patch for this then.
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Joanne
> > > > > > >
> > > > > > > > > >
> > > > > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > > > > {
> > > > > > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > > > > >
> > > > > > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > > > > > >                 return;
> > > > > > > > > >
> > > > > > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > > > > > >                 goto set_it;
> > > > > > > > > >
> > > > > > > > > >         /*
> > > > > > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > > > > > >          * In theory, fuse servers should never be doing this.
> > > > > > > > > >          */
> > > > > > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > > > > > >                 return;
> > > > > > > > > >         }
> > > > > > > > > >
> > > > > > > > > > set_it:
> > > > > > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > > > > >
> > > > > > > > > > --D
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Joanne
> > > > > > > > > > >
> > > > > > >
> > >
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-29 23:40                         ` Darrick J. Wong
@ 2025-07-30 22:54                           ` Joanne Koong
  2025-07-31 17:55                             ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Joanne Koong @ 2025-07-30 22:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Tue, Jul 29, 2025 at 4:40 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Jul 29, 2025 at 04:23:02PM -0700, Joanne Koong wrote:
> > On Tue, Jul 29, 2025 at 1:21 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Mon, Jul 28, 2025 at 02:28:31PM -0700, Joanne Koong wrote:
> > > > On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > > > > > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > [cc Joanne]
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ## Test log
> > > > > > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > > > > > >
> > > > > > > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > > > > > >
> > > > > > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > > > > > >
> > > > > > > > > > > <nod> I think you're correct.
> > > > > > > > > > >
> > > > > > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > > > > > ranges.
> > > > > > > > > > > >
> > > > > > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > > > > > patch for this.
> > > > > > > > > > >
> > > > > > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > > > > > >
> > > > > > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > > > > > >
> > > > > > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > > > > > still unmap and remove it from the page cache? I see the
> > > > > > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > > > > > page still stays in the cache if it has has active references, and if
> > > > > > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > > > > > using the preexisting old page in the page cache.
> > > > > > > > >
> > > > > > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > > > > > the page cache after unmapping and the page cache reference will get
> > > > > > > > > dropped.
> > > > > > > > >
> > > > > > > > > I think I can just do what you suggested and call
> > > > > > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thinking about this some more, I don't think this works after all
> > > > > > > > because the writeback + page cache removal and inode blkbits update
> > > > > > > > needs to be atomic, else after we write back and remove the pages from
> > > > > > > > the page cache, a write could be issued right before we update the
> > > > > > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > > > > > for it either since writeback could be intensive. (also btw, I
> > > > > > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > > > > > been the better function to call instead of
> > > > > > > > filemap_invalidate_inode()).
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > > > > > while we're trying to do the iteration/kfreeing.
> > > > > > > > > >
> > > > > > > >
> > > > > > > > Going back to this idea, I think this can work. I realized we don't
> > > > > > > > need to flush the file, it's enough to free the ifs, then update the
> > > > > > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > > > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > > > > > prevents any concurrent writes.
> > > > > > > > Something like:
> > > > > > > >      inode_lock(inode);
> > > > > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > > > > >      xa_lock_irq(&mapping->i_pages);
> > > > > > > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > > >           folio_lock(folio);
> > > > > > > >           if (folio_test_dirty(folio)) {
> > > > > > > >                   folio_wait_writeback(folio);
> > > > > > > >                   kfree(folio->private);
> > > > > > > >           }
> > > > >
> > > > > Heh, I didn't even see this chunk, distracted as I am today. :/
> > > > >
> > > > > So this doesn't actually /initiate/ writeback, it just waits
> > > > > (potentially for a long time) for someone else to come along and do it.
> > > > > That might not be what you want since the blocksize change will appear
> > > > > to stall while nothing else is going on in the system.
> > > >
> > > > I thought if the folio isn't under writeback then
> > > > folio_wait_writeback() just returns immediately as a no-op.
> > > > I don't think we need/want to initiate writeback, I think we only need
> > > > to ensure that if it is already under writeback, that writeback
> > > > finishes while it uses the old i_blksize so nothing gets corrupted. As
> > > > I understand it (but maybe I'm misjudging this), holding the inode
> > > > lock and then initiating writeback is too much given that writeback
> > > > can take a long time (eg if the fuse server writes the data over some
> > > > network).
> > > >
> > > > >
> > > > > Also, unless you're going to put this in buffered-io.c, it's not
> > > > > desirable for a piece of code to free something it didn't allocate.
> > > > > IOWs, I don't think it's a good idea for *fuse* to go messing with a
> > > > > folio->private that iomap set.
> > > >
> > > > Okay, good point. I agree. I was hoping to have this not bleed into
> > > > the iomap library but maybe there's no getting around that in a good
> > > > way.
> > >
> > > <shrug> Any other filesystem that has mutable file block size is going
> > > to need something to enact a change.
> > >
> > > > >
> > > > > > > >           folio_unlock(folio);
> > > > > > > >      }
> > > > > > > >     inode->i_blkbits = new_blkbits_size;
> > > > > > >
> > > > > > > The trouble is, you also have to resize the iomap_folio_state objects
> > > > > > > attached to each folio if you change i_blkbits...
> > > > > >
> > > > > > I think the iomap_folio_state objects automatically get resized here,
> > > > > > no? We first kfree the folio->private which kfrees the entire ifs,
> > > > >
> > > > > Err, right, it does free the ifs and recreate it later if necessary.
> > > > >
> > > > > > then we change inode->i_blkbits to the new size, then when we call
> > > > > > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > > > > > state object using the new/updated i_blkbits size
> > > > > >
> > > > > > >
> > > > > > > >     xas_set(&xas, 0);
> > > > > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > > >           folio_lock(folio);
> > > > > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > > > > >                  folio_mark_dirty(folio);
> > > > > > >
> > > > > > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > > > > > state object in response to i_blkbits having changed.
> > > > >
> > > > > Also, what about clean folios that have an ifs?  You'd still need to
> > > > > handle the ifs's attached to those.
> > > >
> > > > Ah you're right, there could be clean folios there too that have an
> > > > ifs. I think in the above logic, if we iterate through all
> > > > mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
> > > > the kfree to after the "if (folio_test_dirty(folio))" block, then it
> > > > addresses that case. eg something like this:
> > > >
> > > >      inode_lock(inode);
> > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > >      xa_lock_irq(&mapping->i_pages);
> > > >      xas_for_each(&xas, folio, ULONG_MAX) {
> > > >           folio_lock(folio);
> > > >           if (folio_test_dirty(folio))
> > > >                   folio_wait_writeback(folio);
> > > >           kfree(folio->private);
> > > >           folio_unlock(folio);
> > > >      }
> > > >     inode->i_blkbits = new_blkbits;
> > > >     xas_set(&xas, 0);
> > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > >           folio_lock(folio);
> > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > >                  folio_mark_dirty(folio);
> > > >           folio_unlock(folio);
> > > >     }
> > > >     xa_unlock_irq(&mapping->i_pages);
> > > >     inode_unlock(inode);
> > > >
> > > >
> > > > >
> > > > > So I guess if you wanted iomap to handle a blocksize change, you could
> > > > > do something like:
> > > > >
> > > > > iomap_change_file_blocksize(inode, new_blkbits) {
> > > > >         inode_lock()
> > > > >         filemap_invalidate_lock()
> > > > >
> > > > >         inode_dio_wait()
> > > > >         filemap_write_and_wait()
> > > > >         if (new_blkbits > mapping_min_folio_order()) {
> > > > >                 truncate_pagecache()
> > > > >                 inode->i_blkbits = new_blkbits;
> > > > >         } else {
> > > > >                 inode->i_blkbits = new_blkbits;
> > > > >                 xas_for_each(...) {
> > > > >                         <create new ifs>
> > > > >                         <translate uptodate/dirty state to new ifs>
> > > > >                         <swap ifs>
> > > > >                         <free old ifs>
> > > > >                 }
> > > > >         }
> > > > >
> > > > >         filemap_invalidate_unlock()
> > > > >         inode_unlock()
> > > > > }
> > > >
> > > > Do you prefer this logic to the one above that walks through
> > > > &mapping->i_pages? If so, then I'll go with this way.
> > >
> > > Yes.  iomap should not be tightly bound to the pagecache's xarray; I
> > > don't even really like the xas_for_each that I suggested above.
> >
> > Okay, sounds good.
> >
> > >
> > > > The part I'm unsure about is that this logic seems more disruptive (eg
> > > > initiating writeback while holding the inode lock and doing work for
> > > > unmapping/page cache removal) than the other approach, but I guess
> > > > this is also rare enough that it doesn't matter much.
> > >
> > > I hope it's rare enough that doing truncate_pagecache unconditionally
> > > won't be seen as a huge burden.
> > >
> > > iomap_change_file_blocksize(inode, new_blkbits) {
> > >         inode_dio_wait()
> > >         filemap_write_and_wait()
> > >         truncate_pagecache()
> > >
> > >         inode->i_blkbits = new_blkbits;
> > > }
> > >
> > > fuse_file_change_blocksize(inode, new_blkbits) {
> > >         inode_lock()
> > >         filemap_invalidate_lock()
> > >
> > >         iomap_change_file_blocksize(inode, new_blkbits);
> > >
> > >         filemap_invalidate_unlock()
> > >         inode_unlock()
> > > }
> > >
> > > Though my question remains -- is there a fuse filesystem that changes
> > > the blocksize at runtime such that we can test this??
> >
> > There's not one currently but I was planning to hack up the libfuse
> > passthrough_hp server to test the change.
>
> Heh, ok.
>
> I guess I could also hack up fuse2fs to change its own blocksize
> randomly to see how many programs that pisses off. :)
>
> (Not right now though, gotta prepare for fossy tomorrow...)
>

What I've been using as a helpful sanity-check so far has been running
fstests generic/750 after adding this line to libfuse:

+++ b/lib/fuse_lowlevel.c
@@ -547,6 +547,8 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
        arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
        convert_stat(attr, &arg.attr);
+       arg.attr.blksize = 4096;
        return send_reply_ok(req, &arg, size);

and modifying the kernel side logic in fuse_change_attributes_common()
to unconditionally execute the page cache removal logic if
attr->blksize != 0.


While running this however, I discovered another problem :/ we can't
grab the inode lock here in the fuse path because the vfs layer that
calls into this logic may already be holding the inode lock (eg the
stack traces I was seeing included path_openat()  ->
inode_permission() -> fuse_permission() which then fetches the
blksize, and the vfs rename path), while there are other call paths
that may not be holding the lock already.

I don't really see a good solution here. The simplest one imo would be
to cache "u8 blkbits" in the iomap_folio_state struct - are you okay
with that or do you think there's a better solution here?


Thanks,
Joanne

> --D
>
> > >
> > > --D
> > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > >
> > > > > --D
> > > > >
> > > > > > > --D
> > > > > > >
> > > > > > > >           folio_unlock(folio);
> > > > > > > >     }
> > > > > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > > > > >     inode_unlock(inode);
> > > > > > > >
> > > > > > > >
> > > > > > > > I think this is the only approach that doesn't require changes to iomap.
> > > > > > > >
> > > > > > > > I'm going to think about this some more next week and will try to send
> > > > > > > > out a patch for this then.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Joanne
> > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > > > > > {
> > > > > > > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > > > > > >
> > > > > > > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > > > > > > >                 return;
> > > > > > > > > > >
> > > > > > > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > > > > > > >                 goto set_it;
> > > > > > > > > > >
> > > > > > > > > > >         /*
> > > > > > > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > > > > > > >          * In theory, fuse servers should never be doing this.
> > > > > > > > > > >          */
> > > > > > > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > > > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > > > > > > >                 return;
> > > > > > > > > > >         }
> > > > > > > > > > >
> > > > > > > > > > > set_it:
> > > > > > > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > > > > > >
> > > > > > > > > > > --D
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Joanne
> > > > > > > > > > > >
> > > > > > > >
> > > >
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-30 22:54                           ` Joanne Koong
@ 2025-07-31 17:55                             ` Darrick J. Wong
  2025-07-31 20:48                               ` Joanne Koong
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2025-07-31 17:55 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Wed, Jul 30, 2025 at 03:54:15PM -0700, Joanne Koong wrote:
> On Tue, Jul 29, 2025 at 4:40 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Tue, Jul 29, 2025 at 04:23:02PM -0700, Joanne Koong wrote:
> > > On Tue, Jul 29, 2025 at 1:21 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Mon, Jul 28, 2025 at 02:28:31PM -0700, Joanne Koong wrote:
> > > > > On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > > > > > > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > > > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [cc Joanne]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ## Test log
> > > > > > > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > > > > > > >
> > > > > > > > > > > > <nod> I think you're correct.
> > > > > > > > > > > >
> > > > > > > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > > > > > > ranges.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > > > > > > patch for this.
> > > > > > > > > > > >
> > > > > > > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > > > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > > > > > > >
> > > > > > > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > > > > > > >
> > > > > > > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > > > > > > still unmap and remove it from the page cache? I see the
> > > > > > > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > > > > > > page still stays in the cache if it has has active references, and if
> > > > > > > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > > > > > > using the preexisting old page in the page cache.
> > > > > > > > > >
> > > > > > > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > > > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > > > > > > the page cache after unmapping and the page cache reference will get
> > > > > > > > > > dropped.
> > > > > > > > > >
> > > > > > > > > > I think I can just do what you suggested and call
> > > > > > > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > > > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thinking about this some more, I don't think this works after all
> > > > > > > > > because the writeback + page cache removal and inode blkbits update
> > > > > > > > > needs to be atomic, else after we write back and remove the pages from
> > > > > > > > > the page cache, a write could be issued right before we update the
> > > > > > > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > > > > > > for it either since writeback could be intensive. (also btw, I
> > > > > > > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > > > > > > been the better function to call instead of
> > > > > > > > > filemap_invalidate_inode()).
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > > > > > > while we're trying to do the iteration/kfreeing.
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Going back to this idea, I think this can work. I realized we don't
> > > > > > > > > need to flush the file, it's enough to free the ifs, then update the
> > > > > > > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > > > > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > > > > > > prevents any concurrent writes.
> > > > > > > > > Something like:
> > > > > > > > >      inode_lock(inode);
> > > > > > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > > > > > >      xa_lock_irq(&mapping->i_pages);
> > > > > > > > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > > > >           folio_lock(folio);
> > > > > > > > >           if (folio_test_dirty(folio)) {
> > > > > > > > >                   folio_wait_writeback(folio);
> > > > > > > > >                   kfree(folio->private);
> > > > > > > > >           }
> > > > > >
> > > > > > Heh, I didn't even see this chunk, distracted as I am today. :/
> > > > > >
> > > > > > So this doesn't actually /initiate/ writeback, it just waits
> > > > > > (potentially for a long time) for someone else to come along and do it.
> > > > > > That might not be what you want since the blocksize change will appear
> > > > > > to stall while nothing else is going on in the system.
> > > > >
> > > > > I thought if the folio isn't under writeback then
> > > > > folio_wait_writeback() just returns immediately as a no-op.
> > > > > I don't think we need/want to initiate writeback, I think we only need
> > > > > to ensure that if it is already under writeback, that writeback
> > > > > finishes while it uses the old i_blksize so nothing gets corrupted. As
> > > > > I understand it (but maybe I'm misjudging this), holding the inode
> > > > > lock and then initiating writeback is too much given that writeback
> > > > > can take a long time (eg if the fuse server writes the data over some
> > > > > network).
> > > > >
> > > > > >
> > > > > > Also, unless you're going to put this in buffered-io.c, it's not
> > > > > > desirable for a piece of code to free something it didn't allocate.
> > > > > > IOWs, I don't think it's a good idea for *fuse* to go messing with a
> > > > > > folio->private that iomap set.
> > > > >
> > > > > Okay, good point. I agree. I was hoping to have this not bleed into
> > > > > the iomap library but maybe there's no getting around that in a good
> > > > > way.
> > > >
> > > > <shrug> Any other filesystem that has mutable file block size is going
> > > > to need something to enact a change.
> > > >
> > > > > >
> > > > > > > > >           folio_unlock(folio);
> > > > > > > > >      }
> > > > > > > > >     inode->i_blkbits = new_blkbits_size;
> > > > > > > >
> > > > > > > > The trouble is, you also have to resize the iomap_folio_state objects
> > > > > > > > attached to each folio if you change i_blkbits...
> > > > > > >
> > > > > > > I think the iomap_folio_state objects automatically get resized here,
> > > > > > > no? We first kfree the folio->private which kfrees the entire ifs,
> > > > > >
> > > > > > Err, right, it does free the ifs and recreate it later if necessary.
> > > > > >
> > > > > > > then we change inode->i_blkbits to the new size, then when we call
> > > > > > > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > > > > > > state object using the new/updated i_blkbits size
> > > > > > >
> > > > > > > >
> > > > > > > > >     xas_set(&xas, 0);
> > > > > > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > > > >           folio_lock(folio);
> > > > > > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > > > > > >                  folio_mark_dirty(folio);
> > > > > > > >
> > > > > > > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > > > > > > state object in response to i_blkbits having changed.
> > > > > >
> > > > > > Also, what about clean folios that have an ifs?  You'd still need to
> > > > > > handle the ifs's attached to those.
> > > > >
> > > > > Ah you're right, there could be clean folios there too that have an
> > > > > ifs. I think in the above logic, if we iterate through all
> > > > > mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
> > > > > the kfree to after the "if (folio_test_dirty(folio))" block, then it
> > > > > addresses that case. eg something like this:
> > > > >
> > > > >      inode_lock(inode);
> > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > >      xa_lock_irq(&mapping->i_pages);
> > > > >      xas_for_each(&xas, folio, ULONG_MAX) {
> > > > >           folio_lock(folio);
> > > > >           if (folio_test_dirty(folio))
> > > > >                   folio_wait_writeback(folio);
> > > > >           kfree(folio->private);
> > > > >           folio_unlock(folio);
> > > > >      }
> > > > >     inode->i_blkbits = new_blkbits;
> > > > >     xas_set(&xas, 0);
> > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > >           folio_lock(folio);
> > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > >                  folio_mark_dirty(folio);
> > > > >           folio_unlock(folio);
> > > > >     }
> > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > >     inode_unlock(inode);
> > > > >
> > > > >
> > > > > >
> > > > > > So I guess if you wanted iomap to handle a blocksize change, you could
> > > > > > do something like:
> > > > > >
> > > > > > iomap_change_file_blocksize(inode, new_blkbits) {
> > > > > >         inode_lock()
> > > > > >         filemap_invalidate_lock()
> > > > > >
> > > > > >         inode_dio_wait()
> > > > > >         filemap_write_and_wait()
> > > > > >         if (new_blkbits > mapping_min_folio_order()) {
> > > > > >                 truncate_pagecache()
> > > > > >                 inode->i_blkbits = new_blkbits;
> > > > > >         } else {
> > > > > >                 inode->i_blkbits = new_blkbits;
> > > > > >                 xas_for_each(...) {
> > > > > >                         <create new ifs>
> > > > > >                         <translate uptodate/dirty state to new ifs>
> > > > > >                         <swap ifs>
> > > > > >                         <free old ifs>
> > > > > >                 }
> > > > > >         }
> > > > > >
> > > > > >         filemap_invalidate_unlock()
> > > > > >         inode_unlock()
> > > > > > }
> > > > >
> > > > > Do you prefer this logic to the one above that walks through
> > > > > &mapping->i_pages? If so, then I'll go with this way.
> > > >
> > > > Yes.  iomap should not be tightly bound to the pagecache's xarray; I
> > > > don't even really like the xas_for_each that I suggested above.
> > >
> > > Okay, sounds good.
> > >
> > > >
> > > > > The part I'm unsure about is that this logic seems more disruptive (eg
> > > > > initiating writeback while holding the inode lock and doing work for
> > > > > unmapping/page cache removal) than the other approach, but I guess
> > > > > this is also rare enough that it doesn't matter much.
> > > >
> > > > I hope it's rare enough that doing truncate_pagecache unconditionally
> > > > won't be seen as a huge burden.
> > > >
> > > > iomap_change_file_blocksize(inode, new_blkbits) {
> > > >         inode_dio_wait()
> > > >         filemap_write_and_wait()
> > > >         truncate_pagecache()
> > > >
> > > >         inode->i_blkbits = new_blkbits;
> > > > }
> > > >
> > > > fuse_file_change_blocksize(inode, new_blkbits) {
> > > >         inode_lock()
> > > >         filemap_invalidate_lock()
> > > >
> > > >         iomap_change_file_blocksize(inode, new_blkbits);
> > > >
> > > >         filemap_invalidate_unlock()
> > > >         inode_unlock()
> > > > }
> > > >
> > > > Though my question remains -- is there a fuse filesystem that changes
> > > > the blocksize at runtime such that we can test this??
> > >
> > > There's not one currently but I was planning to hack up the libfuse
> > > passthrough_hp server to test the change.
> >
> > Heh, ok.
> >
> > I guess I could also hack up fuse2fs to change its own blocksize
> > randomly to see how many programs that pisses off. :)
> >
> > (Not right now though, gotta prepare for fossy tomorrow...)
> >
> 
> What I've been using as a helpful sanity-check so far has been running
> fstests generic/750 after adding this line to libfuse:
> 
> +++ b/lib/fuse_lowlevel.c
> @@ -547,6 +547,8 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
>         arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
>         convert_stat(attr, &arg.attr);
> +       arg.attr.blksize = 4096;
>         return send_reply_ok(req, &arg, size);
> 
> and modifying the kernel side logic in fuse_change_attributes_common()
> to unconditionally execute the page cache removal logic if
> attr->blksize != 0.
> 
> 
> While running this however, I discovered another problem :/ we can't
> grab the inode lock here in the fuse path because the vfs layer that
> calls into this logic may already be holding the inode lock (eg the
> stack traces I was seeing included path_openat()  ->
> inode_permission() -> fuse_permission() which then fetches the
> blksize, and the vfs rename path), while there are other call paths
> that may not be holding the lock already.

Oh nooooo heisenlocking.  Which paths do not hold i_rwsem?

> I don't really see a good solution here. The simplest one imo would be
> to cache "u8 blkbits" in the iomap_folio_state struct - are you okay
> with that or do you think there's a better solution here?

1. Don't support changing the blocksize, complain loudly if anyone does,
and only then implement it.  Writeback cache is a fairly new feature so
the impact should be low, right? ;)

[there is no 2 :D]

--D

> 
> Thanks,
> Joanne
> 
> > --D
> >
> > > >
> > > > --D
> > > >
> > > > > Thanks,
> > > > > Joanne
> > > > >
> > > > > >
> > > > > > --D
> > > > > >
> > > > > > > > --D
> > > > > > > >
> > > > > > > > >           folio_unlock(folio);
> > > > > > > > >     }
> > > > > > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > > > > > >     inode_unlock(inode);
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think this is the only approach that doesn't require changes to iomap.
> > > > > > > > >
> > > > > > > > > I'm going to think about this some more next week and will try to send
> > > > > > > > > out a patch for this then.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Joanne
> > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > > > > > > {
> > > > > > > > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > > > > > > >
> > > > > > > > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > > > > > > > >                 return;
> > > > > > > > > > > >
> > > > > > > > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > > > > > > > >                 goto set_it;
> > > > > > > > > > > >
> > > > > > > > > > > >         /*
> > > > > > > > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > > > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > > > > > > > >          * In theory, fuse servers should never be doing this.
> > > > > > > > > > > >          */
> > > > > > > > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > > > > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > > > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > > > > > > > >                 return;
> > > > > > > > > > > >         }
> > > > > > > > > > > >
> > > > > > > > > > > > set_it:
> > > > > > > > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > > > > > > >
> > > > > > > > > > > > --D
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Joanne
> > > > > > > > > > > > >
> > > > > > > > >
> > > > >
> > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-31 17:55                             ` Darrick J. Wong
@ 2025-07-31 20:48                               ` Joanne Koong
  2025-08-12 16:42                                 ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Joanne Koong @ 2025-07-31 20:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Thu, Jul 31, 2025 at 10:55 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Jul 30, 2025 at 03:54:15PM -0700, Joanne Koong wrote:
> > On Tue, Jul 29, 2025 at 4:40 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Tue, Jul 29, 2025 at 04:23:02PM -0700, Joanne Koong wrote:
> > > > On Tue, Jul 29, 2025 at 1:21 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > On Mon, Jul 28, 2025 at 02:28:31PM -0700, Joanne Koong wrote:
> > > > > > On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > >
> > > > > > > On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > > > > > > > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > > > > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [cc Joanne]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ## Test log
> > > > > > > > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > > > > > > > >
> > > > > > > > > > > > > <nod> I think you're correct.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > > > > > > > ranges.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > > > > > > > patch for this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > > > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > > > > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > > > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > > > > > > > >
> > > > > > > > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > > > > > > > >
> > > > > > > > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > > > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > > > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > > > > > > > still unmap and remove it from the page cache? I see the
> > > > > > > > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > > > > > > > page still stays in the cache if it has has active references, and if
> > > > > > > > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > > > > > > > using the preexisting old page in the page cache.
> > > > > > > > > > >
> > > > > > > > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > > > > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > > > > > > > the page cache after unmapping and the page cache reference will get
> > > > > > > > > > > dropped.
> > > > > > > > > > >
> > > > > > > > > > > I think I can just do what you suggested and call
> > > > > > > > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > > > > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thinking about this some more, I don't think this works after all
> > > > > > > > > > because the writeback + page cache removal and inode blkbits update
> > > > > > > > > > needs to be atomic, else after we write back and remove the pages from
> > > > > > > > > > the page cache, a write could be issued right before we update the
> > > > > > > > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > > > > > > > for it either since writeback could be intensive. (also btw, I
> > > > > > > > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > > > > > > > been the better function to call instead of
> > > > > > > > > > filemap_invalidate_inode()).
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > > > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > > > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > > > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > > > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > > > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > > > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > > > > > > > while we're trying to do the iteration/kfreeing.
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Going back to this idea, I think this can work. I realized we don't
> > > > > > > > > > need to flush the file, it's enough to free the ifs, then update the
> > > > > > > > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > > > > > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > > > > > > > prevents any concurrent writes.
> > > > > > > > > > Something like:
> > > > > > > > > >      inode_lock(inode);
> > > > > > > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > > > > > > >      xa_lock_irq(&mapping->i_pages);
> > > > > > > > > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > > > > >           folio_lock(folio);
> > > > > > > > > >           if (folio_test_dirty(folio)) {
> > > > > > > > > >                   folio_wait_writeback(folio);
> > > > > > > > > >                   kfree(folio->private);
> > > > > > > > > >           }
> > > > > > >
> > > > > > > Heh, I didn't even see this chunk, distracted as I am today. :/
> > > > > > >
> > > > > > > So this doesn't actually /initiate/ writeback, it just waits
> > > > > > > (potentially for a long time) for someone else to come along and do it.
> > > > > > > That might not be what you want since the blocksize change will appear
> > > > > > > to stall while nothing else is going on in the system.
> > > > > >
> > > > > > I thought if the folio isn't under writeback then
> > > > > > folio_wait_writeback() just returns immediately as a no-op.
> > > > > > I don't think we need/want to initiate writeback, I think we only need
> > > > > > to ensure that if it is already under writeback, that writeback
> > > > > > finishes while it uses the old i_blksize so nothing gets corrupted. As
> > > > > > I understand it (but maybe I'm misjudging this), holding the inode
> > > > > > lock and then initiating writeback is too much given that writeback
> > > > > > can take a long time (eg if the fuse server writes the data over some
> > > > > > network).
> > > > > >
> > > > > > >
> > > > > > > Also, unless you're going to put this in buffered-io.c, it's not
> > > > > > > desirable for a piece of code to free something it didn't allocate.
> > > > > > > IOWs, I don't think it's a good idea for *fuse* to go messing with a
> > > > > > > folio->private that iomap set.
> > > > > >
> > > > > > Okay, good point. I agree. I was hoping to have this not bleed into
> > > > > > the iomap library but maybe there's no getting around that in a good
> > > > > > way.
> > > > >
> > > > > <shrug> Any other filesystem that has mutable file block size is going
> > > > > to need something to enact a change.
> > > > >
> > > > > > >
> > > > > > > > > >           folio_unlock(folio);
> > > > > > > > > >      }
> > > > > > > > > >     inode->i_blkbits = new_blkbits_size;
> > > > > > > > >
> > > > > > > > > The trouble is, you also have to resize the iomap_folio_state objects
> > > > > > > > > attached to each folio if you change i_blkbits...
> > > > > > > >
> > > > > > > > I think the iomap_folio_state objects automatically get resized here,
> > > > > > > > no? We first kfree the folio->private which kfrees the entire ifs,
> > > > > > >
> > > > > > > Err, right, it does free the ifs and recreate it later if necessary.
> > > > > > >
> > > > > > > > then we change inode->i_blkbits to the new size, then when we call
> > > > > > > > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > > > > > > > state object using the new/updated i_blkbits size
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >     xas_set(&xas, 0);
> > > > > > > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > > > > >           folio_lock(folio);
> > > > > > > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > > > > > > >                  folio_mark_dirty(folio);
> > > > > > > > >
> > > > > > > > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > > > > > > > state object in response to i_blkbits having changed.
> > > > > > >
> > > > > > > Also, what about clean folios that have an ifs?  You'd still need to
> > > > > > > handle the ifs's attached to those.
> > > > > >
> > > > > > Ah you're right, there could be clean folios there too that have an
> > > > > > ifs. I think in the above logic, if we iterate through all
> > > > > > mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
> > > > > > the kfree to after the "if (folio_test_dirty(folio))" block, then it
> > > > > > addresses that case. eg something like this:
> > > > > >
> > > > > >      inode_lock(inode);
> > > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > > >      xa_lock_irq(&mapping->i_pages);
> > > > > >      xas_for_each(&xas, folio, ULONG_MAX) {
> > > > > >           folio_lock(folio);
> > > > > >           if (folio_test_dirty(folio))
> > > > > >                   folio_wait_writeback(folio);
> > > > > >           kfree(folio->private);
> > > > > >           folio_unlock(folio);
> > > > > >      }
> > > > > >     inode->i_blkbits = new_blkbits;
> > > > > >     xas_set(&xas, 0);
> > > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > >           folio_lock(folio);
> > > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > > >                  folio_mark_dirty(folio);
> > > > > >           folio_unlock(folio);
> > > > > >     }
> > > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > > >     inode_unlock(inode);
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > So I guess if you wanted iomap to handle a blocksize change, you could
> > > > > > > do something like:
> > > > > > >
> > > > > > > iomap_change_file_blocksize(inode, new_blkbits) {
> > > > > > >         inode_lock()
> > > > > > >         filemap_invalidate_lock()
> > > > > > >
> > > > > > >         inode_dio_wait()
> > > > > > >         filemap_write_and_wait()
> > > > > > >         if (new_blkbits > mapping_min_folio_order()) {
> > > > > > >                 truncate_pagecache()
> > > > > > >                 inode->i_blkbits = new_blkbits;
> > > > > > >         } else {
> > > > > > >                 inode->i_blkbits = new_blkbits;
> > > > > > >                 xas_for_each(...) {
> > > > > > >                         <create new ifs>
> > > > > > >                         <translate uptodate/dirty state to new ifs>
> > > > > > >                         <swap ifs>
> > > > > > >                         <free old ifs>
> > > > > > >                 }
> > > > > > >         }
> > > > > > >
> > > > > > >         filemap_invalidate_unlock()
> > > > > > >         inode_unlock()
> > > > > > > }
> > > > > >
> > > > > > Do you prefer this logic to the one above that walks through
> > > > > > &mapping->i_pages? If so, then I'll go with this way.
> > > > >
> > > > > Yes.  iomap should not be tightly bound to the pagecache's xarray; I
> > > > > don't even really like the xas_for_each that I suggested above.
> > > >
> > > > Okay, sounds good.
> > > >
> > > > >
> > > > > > The part I'm unsure about is that this logic seems more disruptive (eg
> > > > > > initiating writeback while holding the inode lock and doing work for
> > > > > > unmapping/page cache removal) than the other approach, but I guess
> > > > > > this is also rare enough that it doesn't matter much.
> > > > >
> > > > > I hope it's rare enough that doing truncate_pagecache unconditionally
> > > > > won't be seen as a huge burden.
> > > > >
> > > > > iomap_change_file_blocksize(inode, new_blkbits) {
> > > > >         inode_dio_wait()
> > > > >         filemap_write_and_wait()
> > > > >         truncate_pagecache()
> > > > >
> > > > >         inode->i_blkbits = new_blkbits;
> > > > > }
> > > > >
> > > > > fuse_file_change_blocksize(inode, new_blkbits) {
> > > > >         inode_lock()
> > > > >         filemap_invalidate_lock()
> > > > >
> > > > >         iomap_change_file_blocksize(inode, new_blkbits);
> > > > >
> > > > >         filemap_invalidate_unlock()
> > > > >         inode_unlock()
> > > > > }
> > > > >
> > > > > Though my question remains -- is there a fuse filesystem that changes
> > > > > the blocksize at runtime such that we can test this??
> > > >
> > > > There's not one currently but I was planning to hack up the libfuse
> > > > passthrough_hp server to test the change.
> > >
> > > Heh, ok.
> > >
> > > I guess I could also hack up fuse2fs to change its own blocksize
> > > randomly to see how many programs that pisses off. :)
> > >
> > > (Not right now though, gotta prepare for fossy tomorrow...)
> > >
> >
> > What I've been using as a helpful sanity-check so far has been running
> > fstests generic/750 after adding this line to libfuse:
> >
> > +++ b/lib/fuse_lowlevel.c
> > @@ -547,6 +547,8 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
> >         arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> >         convert_stat(attr, &arg.attr);
> > +       arg.attr.blksize = 4096;
> >         return send_reply_ok(req, &arg, size);
> >
> > and modifying the kernel side logic in fuse_change_attributes_common()
> > to unconditionally execute the page cache removal logic if
> > attr->blksize != 0.
> >
> >
> > While running this however, I discovered another problem :/ we can't
> > grab the inode lock here in the fuse path because the vfs layer that
> > calls into this logic may already be holding the inode lock (eg the
> > stack traces I was seeing included path_openat()  ->
> > inode_permission() -> fuse_permission() which then fetches the
> > blksize, and the vfs rename path), while there are other call paths
> > that may not be holding the lock already.
>
> Oh nooooo heisenlocking.  Which paths do not hold i_rwsem?

A path I was seeing that doesn't hold the inode lock was

[ 19.738097] Call Trace:
[ 19.738468] inode_permission+0xea/0x190
[ 19.738790] may_open+0x6e/0x150
[ 19.739053] path_openat+0x4cf/0x1120
[ 19.739341] ? generic_fillattr+0x49/0x130
[ 19.739711] do_filp_open+0xc1/0x170
[ 19.740064] ? kmem_cache_alloc_noprof+0x11b/0x380
[ 19.740458] ? __check_object_size+0x22a/0x2c0
[ 19.740834] ? alloc_fd+0xea/0x1b0
[ 19.741125] do_sys_openat2+0x71/0xd0
[ 19.741435] __x64_sys_openat+0x56/0xa0
[ 19.741754] do_syscall_64+0x50/0x1c0
[ 19.742068] entry_SYSCALL_64_after_hwframe+0x76/0x7e

and a path that does hold the inode lock:

[   42.176858]  inode_permission+0xea/0x190
[   42.177372]  path_openat+0xd34/0x1120
[   42.177838]  do_filp_open+0xc1/0x170
[   42.178381]  ? kmem_cache_alloc_noprof+0x11b/0x380
[   42.178970]  ? __check_object_size+0x22a/0x2c0
[   42.179525]  ? alloc_fd+0xea/0x1b0
[   42.179955]  do_sys_openat2+0x71/0xd0
[   42.180417]  __x64_sys_creat+0x4c/0x70
[   42.180868]  do_syscall_64+0x50/0x1c0


>
> > I don't really see a good solution here. The simplest one imo would be
> > to cache "u8 blkbits" in the iomap_folio_state struct - are you okay
> > with that or do you think there's a better solution here?
>
> 1. Don't support changing the blocksize, complain loudly if anyone does,
> and only then implement it.  Writeback cache is a fairly new feature so
> the impact should be low, right? ;)
>
> [there is no 2 :D]

I think technically the writeback cache was added in 2014 but I don't
think anyone changes the blocksize so I'm happy to go with this
approach if you/Miklos think it's fine :D I'll send out a patch for
this then. Thanks for all the discussion on this!

>
> --D
>
> >
> > Thanks,
> > Joanne
> >
> > > --D
> > >
> > > > >
> > > > > --D
> > > > >
> > > > > > Thanks,
> > > > > > Joanne
> > > > > >
> > > > > > >
> > > > > > > --D
> > > > > > >
> > > > > > > > > --D
> > > > > > > > >
> > > > > > > > > >           folio_unlock(folio);
> > > > > > > > > >     }
> > > > > > > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > > > > > > >     inode_unlock(inode);
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I think this is the only approach that doesn't require changes to iomap.
> > > > > > > > > >
> > > > > > > > > > I'm going to think about this some more next week and will try to send
> > > > > > > > > > out a patch for this then.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Joanne
> > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > > > > > > > {
> > > > > > > > > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > > > > > > > >
> > > > > > > > > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > > > > > > > > >                 return;
> > > > > > > > > > > > >
> > > > > > > > > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > > > > > > > > >                 goto set_it;
> > > > > > > > > > > > >
> > > > > > > > > > > > >         /*
> > > > > > > > > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > > > > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > > > > > > > > >          * In theory, fuse servers should never be doing this.
> > > > > > > > > > > > >          */
> > > > > > > > > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > > > > > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > > > > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > > > > > > > > >                 return;
> > > > > > > > > > > > >         }
> > > > > > > > > > > > >
> > > > > > > > > > > > > set_it:
> > > > > > > > > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > > > > > > > > }
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > > > > > > > >
> > > > > > > > > > > > > --D
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Joanne
> > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range
  2025-07-31 20:48                               ` Joanne Koong
@ 2025-08-12 16:42                                 ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2025-08-12 16:42 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Naresh Kamboju, linux-fsdevel, linux-mm, linux-xfs, open list,
	lkft-triage, Linux Regressions, Miklos Szeredi, Jan Kara,
	Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Arnd Bergmann, Dan Carpenter, Anders Roxell,
	Ben Copeland

On Thu, Jul 31, 2025 at 01:48:15PM -0700, Joanne Koong wrote:
> On Thu, Jul 31, 2025 at 10:55 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Wed, Jul 30, 2025 at 03:54:15PM -0700, Joanne Koong wrote:
> > > On Tue, Jul 29, 2025 at 4:40 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Tue, Jul 29, 2025 at 04:23:02PM -0700, Joanne Koong wrote:
> > > > > On Tue, Jul 29, 2025 at 1:21 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, Jul 28, 2025 at 02:28:31PM -0700, Joanne Koong wrote:
> > > > > > > On Mon, Jul 28, 2025 at 12:11 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, Jul 28, 2025 at 10:44:01AM -0700, Joanne Koong wrote:
> > > > > > > > > On Mon, Jul 28, 2025 at 10:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Jul 25, 2025 at 06:16:15PM -0700, Joanne Koong wrote:
> > > > > > > > > > > On Thu, Jul 24, 2025 at 12:14 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 23, 2025 at 3:37 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 23, 2025 at 2:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jul 23, 2025 at 11:42:42AM -0700, Joanne Koong wrote:
> > > > > > > > > > > > > > > On Wed, Jul 23, 2025 at 7:46 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [cc Joanne]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Jul 23, 2025 at 05:14:28PM +0530, Naresh Kamboju wrote:
> > > > > > > > > > > > > > > > > Test regression: next-20250721 arm64 16K and 64K page size WARNING fs
> > > > > > > > > > > > > > > > > fuse file.c at fuse_iomap_writeback_range
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ## Test log
> > > > > > > > > > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > > > > > > > > > [  343.828105] WARNING: fs/fuse/file.c:2146 at
> > > > > > > > > > > > > > > > > fuse_iomap_writeback_range+0x478/0x558 [fuse], CPU#0: msync04/4190
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >         WARN_ON_ONCE(len & (PAGE_SIZE - 1));
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > /me speculates that this might be triggered by an attempt to write back
> > > > > > > > > > > > > > > > some 4k fsblock within a 16/64k base page?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think this can happen on 4k base pages as well actually. On the
> > > > > > > > > > > > > > > iomap side, the length passed is always block-aligned and in fuse, we
> > > > > > > > > > > > > > > set blkbits to be PAGE_SHIFT so theoretically block-aligned is always
> > > > > > > > > > > > > > > page-aligned, but I missed that if it's a "fuseblk" filesystem, that
> > > > > > > > > > > > > > > isn't true and the blocksize is initialized to a default size of 512
> > > > > > > > > > > > > > > or whatever block size is passed in when it's mounted.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > <nod> I think you're correct.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'll send out a patch to remove this line. It doesn't make any
> > > > > > > > > > > > > > > difference for fuse_iomap_writeback_range() logic whether len is
> > > > > > > > > > > > > > > page-aligned or not; I had added it as a sanity-check against sketchy
> > > > > > > > > > > > > > > ranges.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Also, I just noticed that apparently the blocksize can change
> > > > > > > > > > > > > > > dynamically for an inode in fuse through getattr replies from the
> > > > > > > > > > > > > > > server (see fuse_change_attributes_common()). This is a problem since
> > > > > > > > > > > > > > > the iomap uses inode->i_blkbits for reading/writing to the bitmap. I
> > > > > > > > > > > > > > > think we will have to cache the inode blkbits in the iomap_folio_state
> > > > > > > > > > > > > > > struct unfortunately :( I'll think about this some more and send out a
> > > > > > > > > > > > > > > patch for this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > From my understanding of the iomap code, it's possible to do that if you
> > > > > > > > > > > > > > flush and unmap the entire pagecache (whilst holding i_rwsem and
> > > > > > > > > > > > > > mmap_invalidate_lock) before you change i_blkbits.  Nobody *does* this
> > > > > > > > > > > > > > so I have no idea if it actually works, however.  Note that even I don't
> > > > > > > > > > > > > > implement the flush and unmap bit; I just scream loudly and do nothing:
> > > > > > > > > > > > >
> > > > > > > > > > > > > lol! i wish I could scream loudly and do nothing too for my case.
> > > > > > > > > > > > >
> > > > > > > > > > > > > AFAICT, I think I just need to flush and unmap that file and can leave
> > > > > > > > > > > > > the rest of the files/folios in the pagecache as is? But then if the
> > > > > > > > > > > > > file has active refcounts on it or has been pinned into memory, can I
> > > > > > > > > > > > > still unmap and remove it from the page cache? I see the
> > > > > > > > > > > > > invalidate_inode_pages2() function but my understanding is that the
> > > > > > > > > > > > > page still stays in the cache if it has has active references, and if
> > > > > > > > > > > > > the page gets mmaped and there's a page fault on it, it'll end up
> > > > > > > > > > > > > using the preexisting old page in the page cache.
> > > > > > > > > > > >
> > > > > > > > > > > > Never mind, I was mistaken about this. Johannes confirmed that even if
> > > > > > > > > > > > there's active refcounts on the folio, it'll still get removed from
> > > > > > > > > > > > the page cache after unmapping and the page cache reference will get
> > > > > > > > > > > > dropped.
> > > > > > > > > > > >
> > > > > > > > > > > > I think I can just do what you suggested and call
> > > > > > > > > > > > filemap_invalidate_inode() in fuse_change_attributes_common() then if
> > > > > > > > > > > > the inode blksize gets changed. Thanks for the suggestion!
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thinking about this some more, I don't think this works after all
> > > > > > > > > > > because the writeback + page cache removal and inode blkbits update
> > > > > > > > > > > needs to be atomic, else after we write back and remove the pages from
> > > > > > > > > > > the page cache, a write could be issued right before we update the
> > > > > > > > > > > inode blkbits. I don't think we can hold the inode lock the whole time
> > > > > > > > > > > for it either since writeback could be intensive. (also btw, I
> > > > > > > > > > > realized in hindsight that invalidate_inode_pages2_range() would have
> > > > > > > > > > > been the better function to call instead of
> > > > > > > > > > > filemap_invalidate_inode()).
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I don't think I really need to have it removed from the page cache so
> > > > > > > > > > > > > much as just have the ifs state for all the folios in the file freed
> > > > > > > > > > > > > (after flushing the file) so that it can start over with a new ifs.
> > > > > > > > > > > > > Ideally we could just flush the file, then iterate through all the
> > > > > > > > > > > > > folios in the mapping in order of ascending index, and kfree their
> > > > > > > > > > > > > ->private, but I'm not seeing how we can prevent the case of new
> > > > > > > > > > > > > writes / a new ifs getting allocated for folios at previous indexes
> > > > > > > > > > > > > while we're trying to do the iteration/kfreeing.
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Going back to this idea, I think this can work. I realized we don't
> > > > > > > > > > > need to flush the file, it's enough to free the ifs, then update the
> > > > > > > > > > > inode->i_blkbits, then reallocate the ifs (which will now use the
> > > > > > > > > > > updated blkbits size), and if we hold the inode lock throughout, that
> > > > > > > > > > > prevents any concurrent writes.
> > > > > > > > > > > Something like:
> > > > > > > > > > >      inode_lock(inode);
> > > > > > > > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > > > > > > > >      xa_lock_irq(&mapping->i_pages);
> > > > > > > > > > >      xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > > > > > >           folio_lock(folio);
> > > > > > > > > > >           if (folio_test_dirty(folio)) {
> > > > > > > > > > >                   folio_wait_writeback(folio);
> > > > > > > > > > >                   kfree(folio->private);
> > > > > > > > > > >           }
> > > > > > > >
> > > > > > > > Heh, I didn't even see this chunk, distracted as I am today. :/
> > > > > > > >
> > > > > > > > So this doesn't actually /initiate/ writeback, it just waits
> > > > > > > > (potentially for a long time) for someone else to come along and do it.
> > > > > > > > That might not be what you want since the blocksize change will appear
> > > > > > > > to stall while nothing else is going on in the system.
> > > > > > >
> > > > > > > I thought if the folio isn't under writeback then
> > > > > > > folio_wait_writeback() just returns immediately as a no-op.
> > > > > > > I don't think we need/want to initiate writeback, I think we only need
> > > > > > > to ensure that if it is already under writeback, that writeback
> > > > > > > finishes while it uses the old i_blksize so nothing gets corrupted. As
> > > > > > > I understand it (but maybe I'm misjudging this), holding the inode
> > > > > > > lock and then initiating writeback is too much given that writeback
> > > > > > > can take a long time (eg if the fuse server writes the data over some
> > > > > > > network).
> > > > > > >
> > > > > > > >
> > > > > > > > Also, unless you're going to put this in buffered-io.c, it's not
> > > > > > > > desirable for a piece of code to free something it didn't allocate.
> > > > > > > > IOWs, I don't think it's a good idea for *fuse* to go messing with a
> > > > > > > > folio->private that iomap set.
> > > > > > >
> > > > > > > Okay, good point. I agree. I was hoping to have this not bleed into
> > > > > > > the iomap library but maybe there's no getting around that in a good
> > > > > > > way.
> > > > > >
> > > > > > <shrug> Any other filesystem that has mutable file block size is going
> > > > > > to need something to enact a change.
> > > > > >
> > > > > > > >
> > > > > > > > > > >           folio_unlock(folio);
> > > > > > > > > > >      }
> > > > > > > > > > >     inode->i_blkbits = new_blkbits_size;
> > > > > > > > > >
> > > > > > > > > > The trouble is, you also have to resize the iomap_folio_state objects
> > > > > > > > > > attached to each folio if you change i_blkbits...
> > > > > > > > >
> > > > > > > > > I think the iomap_folio_state objects automatically get resized here,
> > > > > > > > > no? We first kfree the folio->private which kfrees the entire ifs,
> > > > > > > >
> > > > > > > > Err, right, it does free the ifs and recreate it later if necessary.
> > > > > > > >
> > > > > > > > > then we change inode->i_blkbits to the new size, then when we call
> > > > > > > > > folio_mark_dirty(), it'll create the new ifs which creates a new folio
> > > > > > > > > state object using the new/updated i_blkbits size
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >     xas_set(&xas, 0);
> > > > > > > > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > > > > > >           folio_lock(folio);
> > > > > > > > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > > > > > > > >                  folio_mark_dirty(folio);
> > > > > > > > > >
> > > > > > > > > > ...because iomap_dirty_folio doesn't know how to reallocate the folio
> > > > > > > > > > state object in response to i_blkbits having changed.
> > > > > > > >
> > > > > > > > Also, what about clean folios that have an ifs?  You'd still need to
> > > > > > > > handle the ifs's attached to those.
> > > > > > >
> > > > > > > Ah you're right, there could be clean folios there too that have an
> > > > > > > ifs. I think in the above logic, if we iterate through all
> > > > > > > mapping->i_pages (not just PAGECACHE_TAG_DIRTY marked ones) and move
> > > > > > > the kfree to after the "if (folio_test_dirty(folio))" block, then it
> > > > > > > addresses that case. eg something like this:
> > > > > > >
> > > > > > >      inode_lock(inode);
> > > > > > >      XA_STATE(xas, &mapping->i_pages, 0);
> > > > > > >      xa_lock_irq(&mapping->i_pages);
> > > > > > >      xas_for_each(&xas, folio, ULONG_MAX) {
> > > > > > >           folio_lock(folio);
> > > > > > >           if (folio_test_dirty(folio))
> > > > > > >                   folio_wait_writeback(folio);
> > > > > > >           kfree(folio->private);
> > > > > > >           folio_unlock(folio);
> > > > > > >      }
> > > > > > >     inode->i_blkbits = new_blkbits;
> > > > > > >     xas_set(&xas, 0);
> > > > > > >     xas_for_each_marked(&xas, folio, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
> > > > > > >           folio_lock(folio);
> > > > > > >           if (folio_test_dirty(folio) && !folio_test_writeback(folio))
> > > > > > >                  folio_mark_dirty(folio);
> > > > > > >           folio_unlock(folio);
> > > > > > >     }
> > > > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > > > >     inode_unlock(inode);
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > So I guess if you wanted iomap to handle a blocksize change, you could
> > > > > > > > do something like:
> > > > > > > >
> > > > > > > > iomap_change_file_blocksize(inode, new_blkbits) {
> > > > > > > >         inode_lock()
> > > > > > > >         filemap_invalidate_lock()
> > > > > > > >
> > > > > > > >         inode_dio_wait()
> > > > > > > >         filemap_write_and_wait()
> > > > > > > >         if (new_blkbits > mapping_min_folio_order()) {
> > > > > > > >                 truncate_pagecache()
> > > > > > > >                 inode->i_blkbits = new_blkbits;
> > > > > > > >         } else {
> > > > > > > >                 inode->i_blkbits = new_blkbits;
> > > > > > > >                 xas_for_each(...) {
> > > > > > > >                         <create new ifs>
> > > > > > > >                         <translate uptodate/dirty state to new ifs>
> > > > > > > >                         <swap ifs>
> > > > > > > >                         <free old ifs>
> > > > > > > >                 }
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         filemap_invalidate_unlock()
> > > > > > > >         inode_unlock()
> > > > > > > > }
> > > > > > >
> > > > > > > Do you prefer this logic to the one above that walks through
> > > > > > > &mapping->i_pages? If so, then I'll go with this way.
> > > > > >
> > > > > > Yes.  iomap should not be tightly bound to the pagecache's xarray; I
> > > > > > don't even really like the xas_for_each that I suggested above.
> > > > >
> > > > > Okay, sounds good.
> > > > >
> > > > > >
> > > > > > > The part I'm unsure about is that this logic seems more disruptive (eg
> > > > > > > initiating writeback while holding the inode lock and doing work for
> > > > > > > unmapping/page cache removal) than the other approach, but I guess
> > > > > > > this is also rare enough that it doesn't matter much.
> > > > > >
> > > > > > I hope it's rare enough that doing truncate_pagecache unconditionally
> > > > > > won't be seen as a huge burden.
> > > > > >
> > > > > > iomap_change_file_blocksize(inode, new_blkbits) {
> > > > > >         inode_dio_wait()
> > > > > >         filemap_write_and_wait()
> > > > > >         truncate_pagecache()
> > > > > >
> > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > }
> > > > > >
> > > > > > fuse_file_change_blocksize(inode, new_blkbits) {
> > > > > >         inode_lock()
> > > > > >         filemap_invalidate_lock()
> > > > > >
> > > > > >         iomap_change_file_blocksize(inode, new_blkbits);
> > > > > >
> > > > > >         filemap_invalidate_unlock()
> > > > > >         inode_unlock()
> > > > > > }
> > > > > >
> > > > > > Though my question remains -- is there a fuse filesystem that changes
> > > > > > the blocksize at runtime such that we can test this??
> > > > >
> > > > > There's not one currently but I was planning to hack up the libfuse
> > > > > passthrough_hp server to test the change.
> > > >
> > > > Heh, ok.
> > > >
> > > > I guess I could also hack up fuse2fs to change its own blocksize
> > > > randomly to see how many programs that pisses off. :)
> > > >
> > > > (Not right now though, gotta prepare for fossy tomorrow...)
> > > >
> > >
> > > What I've been using as a helpful sanity-check so far has been running
> > > fstests generic/750 after adding this line to libfuse:
> > >
> > > +++ b/lib/fuse_lowlevel.c
> > > @@ -547,6 +547,8 @@ int fuse_reply_attr(fuse_req_t req, const struct stat *attr,
> > >         arg.attr_valid_nsec = calc_timeout_nsec(attr_timeout);
> > >         convert_stat(attr, &arg.attr);
> > > +       arg.attr.blksize = 4096;
> > >         return send_reply_ok(req, &arg, size);
> > >
> > > and modifying the kernel side logic in fuse_change_attributes_common()
> > > to unconditionally execute the page cache removal logic if
> > > attr->blksize != 0.
> > >
> > >
> > > While running this however, I discovered another problem :/ we can't
> > > grab the inode lock here in the fuse path because the vfs layer that
> > > calls into this logic may already be holding the inode lock (eg the
> > > stack traces I was seeing included path_openat()  ->
> > > inode_permission() -> fuse_permission() which then fetches the
> > > blksize, and the vfs rename path), while there are other call paths
> > > that may not be holding the lock already.
> >
> > Oh nooooo heisenlocking.  Which paths do not hold i_rwsem?
> 
> A path I was seeing that doesn't hold the inode lock was
> 
> [ 19.738097] Call Trace:
> [ 19.738468] inode_permission+0xea/0x190
> [ 19.738790] may_open+0x6e/0x150
> [ 19.739053] path_openat+0x4cf/0x1120
> [ 19.739341] ? generic_fillattr+0x49/0x130
> [ 19.739711] do_filp_open+0xc1/0x170
> [ 19.740064] ? kmem_cache_alloc_noprof+0x11b/0x380
> [ 19.740458] ? __check_object_size+0x22a/0x2c0
> [ 19.740834] ? alloc_fd+0xea/0x1b0
> [ 19.741125] do_sys_openat2+0x71/0xd0
> [ 19.741435] __x64_sys_openat+0x56/0xa0
> [ 19.741754] do_syscall_64+0x50/0x1c0
> [ 19.742068] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> and a path that does hold the inode lock:
> 
> [   42.176858]  inode_permission+0xea/0x190
> [   42.177372]  path_openat+0xd34/0x1120
> [   42.177838]  do_filp_open+0xc1/0x170
> [   42.178381]  ? kmem_cache_alloc_noprof+0x11b/0x380
> [   42.178970]  ? __check_object_size+0x22a/0x2c0
> [   42.179525]  ? alloc_fd+0xea/0x1b0
> [   42.179955]  do_sys_openat2+0x71/0xd0
> [   42.180417]  __x64_sys_creat+0x4c/0x70
> [   42.180868]  do_syscall_64+0x50/0x1c0

Heh, yep, exactly where I feared it would be. :(

> 
> >
> > > I don't really see a good solution here. The simplest one imo would be
> > > to cache "u8 blkbits" in the iomap_folio_state struct - are you okay
> > > with that or do you think there's a better solution here?
> >
> > 1. Don't support changing the blocksize, complain loudly if anyone does,
> > and only then implement it.  Writeback cache is a fairly new feature so
> > the impact should be low, right? ;)
> >
> > [there is no 2 :D]
> 
> I think technically the writeback cache was added in 2014 but I don't
> think anyone changes the blocksize so I'm happy to go with this
> approach if you/Miklos think it's fine :D I'll send out a patch for
> this then. Thanks for all the discussion on this!

<nod> I'll go look at that thread.

(still recovering after fossy...)

--D

> >
> > --D
> >
> > >
> > > Thanks,
> > > Joanne
> > >
> > > > --D
> > > >
> > > > > >
> > > > > > --D
> > > > > >
> > > > > > > Thanks,
> > > > > > > Joanne
> > > > > > >
> > > > > > > >
> > > > > > > > --D
> > > > > > > >
> > > > > > > > > > --D
> > > > > > > > > >
> > > > > > > > > > >           folio_unlock(folio);
> > > > > > > > > > >     }
> > > > > > > > > > >     xa_unlock_irq(&mapping->i_pages);
> > > > > > > > > > >     inode_unlock(inode);
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I think this is the only approach that doesn't require changes to iomap.
> > > > > > > > > > >
> > > > > > > > > > > I'm going to think about this some more next week and will try to send
> > > > > > > > > > > out a patch for this then.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Joanne
> > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > void fuse_iomap_set_i_blkbits(struct inode *inode, u8 new_blkbits)
> > > > > > > > > > > > > > {
> > > > > > > > > > > > > >         trace_fuse_iomap_set_i_blkbits(inode, new_blkbits);
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >         if (inode->i_blkbits == new_blkbits)
> > > > > > > > > > > > > >                 return;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >         if (!S_ISREG(inode->i_mode))
> > > > > > > > > > > > > >                 goto set_it;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >         /*
> > > > > > > > > > > > > >          * iomap attaches per-block state to each folio, so we cannot allow
> > > > > > > > > > > > > >          * the file block size to change if there's anything in the page cache.
> > > > > > > > > > > > > >          * In theory, fuse servers should never be doing this.
> > > > > > > > > > > > > >          */
> > > > > > > > > > > > > >         if (inode->i_mapping->nrpages > 0) {
> > > > > > > > > > > > > >                 WARN_ON(inode->i_blkbits != new_blkbits &&
> > > > > > > > > > > > > >                         inode->i_mapping->nrpages > 0);
> > > > > > > > > > > > > >                 return;
> > > > > > > > > > > > > >         }
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > set_it:
> > > > > > > > > > > > > >         inode->i_blkbits = new_blkbits;
> > > > > > > > > > > > > > }
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fuse-iomap-attrs&id=da9b25d994c1140aae2f5ebf10e54d0872f5c884
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --D
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Joanne
> > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > > >
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-08-12 16:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 11:44 next-20250721 arm64 16K and 64K page size WARNING fs fuse file.c at fuse_iomap_writeback_range Naresh Kamboju
2025-07-23 14:46 ` Darrick J. Wong
2025-07-23 18:42   ` Joanne Koong
2025-07-23 21:20     ` Darrick J. Wong
2025-07-23 22:37       ` Joanne Koong
2025-07-24 19:14         ` Joanne Koong
2025-07-26  1:16           ` Joanne Koong
2025-07-28 17:14             ` Darrick J. Wong
2025-07-28 17:44               ` Joanne Koong
2025-07-28 19:11                 ` Darrick J. Wong
2025-07-28 21:28                   ` Joanne Koong
2025-07-29 20:21                     ` Darrick J. Wong
2025-07-29 23:23                       ` Joanne Koong
2025-07-29 23:40                         ` Darrick J. Wong
2025-07-30 22:54                           ` Joanne Koong
2025-07-31 17:55                             ` Darrick J. Wong
2025-07-31 20:48                               ` Joanne Koong
2025-08-12 16:42                                 ` Darrick J. Wong
2025-07-28 17:34             ` Matthew Wilcox
2025-07-28 17:55               ` Joanne Koong
2025-07-28 18:43                 ` Darrick J. Wong
2025-07-23 23:15     ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).