Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime
From: Jeff Layton @ 2023-07-10 13:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: jk, arnd, mpe, npiggin, christophe.leroy, hca, gor, agordeev,
	borntraeger, svens, gregkh, arve, tkjos, maco, joel, cmllamas,
	surenb, dennis.dalessandro, jgg, leon, bwarrum, rituagar, ericvh,
	lucho, asmadeus, linux_oss, dsterba, dhowells, marc.dionne, viro,
	raven, luisbg, salah.triki, aivazian.tigran, ebiederm, keescook,
	clm, josef, xiubli, idryomov, jaharkes, coda, jlbec, hch, nico,
	rafael, code, ardb, xiang, chao, huyue2, jefflexu, linkinjeon,
	sj1557.seo, jack, tytso, adilger.kernel, jaegeuk, hirofumi,
	miklos, rpeterso, agruenba, richard, anton.ivanov, johannes,
	mikulas, mike.kravetz, muchun.song, dwmw2, shaggy, tj,
	trond.myklebust, anna, chuck.lever, neilb, kolga, Dai.Ngo, tom,
	konishi.ryusuke, anton, almaz.alexandrovich, mark, joseph.qi, me,
	hubcap, martin, amir73il, mcgrof, yzaikin, tony.luck, gpiccoli,
	al, sfrench, pc, lsahlber, sprasad, senozhatsky, phillip, rostedt,
	mhiramat, dushistov, hdegoede, djwong, dlemoal, naohiro.aota, jth,
	ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, hughd, akpm, davem, edumazet, kuba,
	pabeni, john.johansen, paul, jmorris, serge, stephen.smalley.work,
	eparis, jgross, stern, lrh2000, sebastian.reichel, wsa+renesas,
	quic_ugoswami, quic_linyyuan, john, error27, quic_uaggarwa,
	hayama, jomajm, axboe, dhavale, dchinner, hannes, zhangpeng362,
	slava, gargaditya08, penguin-kernel, yifeliu, madkar, ezk, yuzhe,
	willy, okanatov, jeffxu, linux, mirimmad17, yijiangshan,
	yang.yang29, xu.xin16, chengzhihao1, shr, Liam.Howlett, adobriyan,
	chi.minghao, roberto.sassu, linuszeng, bvanassche, zohar,
	yi.zhang, trix, fmdefrancesco, ebiggers, princekumarmaurya06,
	chenzhongjin, riel, shaozhengchao, jingyuwang_vip, linuxppc-dev,
	linux-kernel, linux-s390, linux-rdma, linux-usb, v9fs,
	linux-fsdevel, linux-afs, autofs, linux-mm, linux-btrfs,
	ceph-devel, codalist, ecryptfs, linux-efi, linux-erofs,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-um, linux-mtd,
	jfs-discussion, linux-nfs, linux-nilfs, linux-ntfs-dev, ntfs3,
	ocfs2-devel, linux-karma-devel, devel, linux-unionfs,
	linux-hardening, reiserfs-devel, linux-cifs, samba-technical,
	linux-trace-kernel, linux-xfs, bpf, netdev, apparmor,
	linux-security-module, selinux
In-Reply-To: <20230710-zudem-entkam-bb508cbd8c78@brauner>

On Mon, 2023-07-10 at 14:35 +0200, Christian Brauner wrote:
> On Fri, Jul 07, 2023 at 08:42:31AM -0400, Jeff Layton wrote:
> > On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
> > > v2:
> > > - prepend patches to add missing ctime updates
> > > - add simple_rename_timestamp helper function
> > > - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> > > - drop individual inode_ctime_set_{sec,nsec} helpers
> > > 
> > 
> > After review by Jan and others, and Jan's ext4 rework, the diff on top
> > of the series I posted a couple of days ago is below. I don't really
> > want to spam everyone with another ~100 patch v3 series, but I can if
> > you think that's best.
> > 
> > Christian, what would you like me to do here?
> 
> I picked up the series from the list and folded the fixups you posted
> here into the respective fs conversion patches. I hope that helps you
> avoid a resend. You should have received a separate "thank you" mail for
> all of this.
> 
> To each patch that I folded one of the fixlets from below into I added a
> git note that records a link to your mail here and the respective patch
> hunk from this mail that I folded into the patch. git.kernel.org will
> show notes by default. For example,
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.ctime&id=8b0e3c2e99004609a16ba145bcbdfdddb78e220e
> should show you the note I added. You can also fetch them via
> git fetch $remote refs/notes/*:refs/notes/*
> (You probably know that ofc but jic.) if you're interested.
> 
> Based on v6.5-rc1 as of today.
> 

Many thanks!!! I'll get to work rebasing the multigrain timestamp series
on top of that.

> Btw, both b4 and patchwork somehow treat the series in weird was.
> IOW, based on the message id of the cover letter I was able to pull most
> messages except for:
> 
> [07/92] fs: add ctime accessors infrastructure
> [08/92] fs: new helper: simple_rename_timestamp
> [92/92] fs: rename i_ctime field to __i_ctime
> 
> which I pulled in separately. Not sure what the cause of 
> 
> this is.

Good to know.

I ended up doing the send in two phases: one for the cover letter and
infrastructure patches that went to everyone, and one for the per-
subsystem patches that went do individual maintainers and lists.

I suspect that screwed up the message IDs somehow. Hopefully I won't
need to do a posting like that again soon, but I'll pay closer attention
to the message id handling next time.

Thanks again!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [syzbot] [ext4?] kernel BUG in ext4_split_extent_at (2)
From: yebin (H) @ 2023-07-10 15:06 UTC (permalink / raw)
  To: syzbot, adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	syzkaller-bugs, tytso
In-Reply-To: <000000000000f1378f0600025915@google.com>



On 2023/7/9 7:45, syzbot wrote:
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:    8689f4f2ea56 Merge tag 'mmc-v6.5-2' of git://git.kernel.or..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=12b9cb02a80000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7ad417033279f15a
> dashboard link: https://syzkaller.appspot.com/bug?extid=0f4d9f68fb6632330c6c
> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=13977778a80000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1004666aa80000
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/21b63023cf5a/disk-8689f4f2.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/e04836fe057e/vmlinux-8689f4f2.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/ee05dfd52843/bzImage-8689f4f2.xz
> mounted in repro: https://storage.googleapis.com/syzbot-assets/e2ab005f1edb/mount_0.gz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+0f4d9f68fb6632330c6c@syzkaller.appspotmail.com
>
> ------------[ cut here ]------------
> kernel BUG at fs/ext4/ext4_extents.h:200!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 5885 Comm: syz-executor219 Not tainted 6.4.0-syzkaller-12365-g8689f4f2ea56 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
> RIP: 0010:ext4_ext_mark_unwritten fs/ext4/ext4_extents.h:200 [inline]
> RIP: 0010:ext4_split_extent_at+0xd11/0xe10 fs/ext4/extents.c:3221
> Code: e9 d2 f8 ff ff e8 1f 6d 5d ff 66 81 c5 00 80 e9 32 fd ff ff e8 10 6d 5d ff 44 8d bd 00 80 ff ff e9 d1 fc ff ff e8 ff 6c 5d ff <0f> 0b 48 8b 7c 24 18 e8 73 9d b0 ff e9 7f f3 ff ff 48 89 cf e8 46
> RSP: 0018:ffffc900055ef268 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff88807d9c0000 RSI: ffffffff82277271 RDI: 0000000000000007
> RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807587d00c
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88807587d012
> FS:  00007f4820ae9700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f4820ae8fe8 CR3: 0000000028105000 CR4: 0000000000350ee0
> Call Trace:
>   <TASK>
>   ext4_split_extent+0x3fc/0x530 fs/ext4/extents.c:3384
>   ext4_ext_handle_unwritten_extents fs/ext4/extents.c:3874 [inline]
>   ext4_ext_map_blocks+0x2e22/0x5bc0 fs/ext4/extents.c:4166
>   ext4_map_blocks+0x760/0x1890 fs/ext4/inode.c:621
>   ext4_iomap_alloc fs/ext4/inode.c:3276 [inline]
>   ext4_iomap_begin+0x43d/0x7a0 fs/ext4/inode.c:3326
>   iomap_iter+0x446/0x10e0 fs/iomap/iter.c:91
>   __iomap_dio_rw+0x6e3/0x1d80 fs/iomap/direct-io.c:574
>   iomap_dio_rw+0x40/0xa0 fs/iomap/direct-io.c:665
>   ext4_dio_write_iter fs/ext4/file.c:609 [inline]
>   ext4_file_write_iter+0x1102/0x1880 fs/ext4/file.c:720
>   call_write_iter include/linux/fs.h:1871 [inline]
>   new_sync_write fs/read_write.c:491 [inline]
>   vfs_write+0x981/0xda0 fs/read_write.c:584
>   ksys_write+0x122/0x250 fs/read_write.c:637
>   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>   do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>   entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f4828f26cd9
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 61 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f4820ae9208 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000034 RCX: 00007f4828f26cd9
> RDX: 0000000000000012 RSI: 0000000020000000 RDI: 0000000000000004
> RBP: 00007f4828fa4790 R08: 00007f4828fa4798 R09: 00007f4828fa4798
> R10: 00007f4828fa4798 R11: 0000000000000246 R12: 00007f4828fa479c
> R13: 00007ffef851f96f R14: 00007f4820ae9300 R15: 0000000000022000
>   </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:ext4_ext_mark_unwritten fs/ext4/ext4_extents.h:200 [inline]
> RIP: 0010:ext4_split_extent_at+0xd11/0xe10 fs/ext4/extents.c:3221
> Code: e9 d2 f8 ff ff e8 1f 6d 5d ff 66 81 c5 00 80 e9 32 fd ff ff e8 10 6d 5d ff 44 8d bd 00 80 ff ff e9 d1 fc ff ff e8 ff 6c 5d ff <0f> 0b 48 8b 7c 24 18 e8 73 9d b0 ff e9 7f f3 ff ff 48 89 cf e8 46
> RSP: 0018:ffffc900055ef268 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff88807d9c0000 RSI: ffffffff82277271 RDI: 0000000000000007
> RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807587d00c
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88807587d012
> FS:  00007f4820ae9700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ffef85799a0 CR3: 0000000028105000 CR4: 0000000000350ee0
>
>
> ---
> If you want syzbot to run the reproducer, reply with:
> #syz test: git://repo/address.git branch-or-commit-hash
> If you attach or paste a git patch, syzbot will apply it before testing.
>
> .
This should be caused by the write block device.
Here's the log I captured writing the block device：
[T15721] blkdev_write_iter:---------bdev=0xffff888117fc5080 loop2-----
[T15722] EXT4-fs error (device loop2) in ext4_reserve_inode_write:5718: 
Corrupt filesystem
[T15722] EXT4-fs (loop2): Remounting filesystem read-only
[T15722] EXT4-fs error (device loop2): ext4_ext_grow_indepth:1468: inode 
#16: comm rep: mark_inode_dirty error
[T15722] general protection fault, probably for non-canonical address 
0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
[T15722] KASAN: null-ptr-deref in range 
[0x0000000000000000-0x0000000000000007]



^ permalink raw reply

* Re: generic/269 failure on ext4 dev branch
From: Eric Whitney @ 2023-07-10 19:01 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Eric Whitney, linux-ext4, tytso, ojaswin
In-Reply-To: <87edlkk8jc.fsf@doe.com>

* Ritesh Harjani <ritesh.list@gmail.com>:
> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> 
> > Eric Whitney <enwlinux@gmail.com> writes:
> >
> >> I've discovered that generic/269 will trigger a BUG_ON on line 5070 in
> >
> > Can you confirm in your tree what is line 5070 out of the two?
> >
> > BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> > BUG_ON(ac->ac_pa == NULL);
> >
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev&id=ab8627e104696b8c1c6953ad5255def5b0821e06#n5070
> > [2]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev#n5070
> >
> > Based on which tree you tested, it could differ I guess.
> >
> > I am assuming it is... 
> >   BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
> >
> >> ext4_mb_new_inode_pa when running kvm-xfstests on the 1k test case
> >> with a kernel built from the current ext4 dev branch. After hitting the
> >> BUG_ON, the kernel then reports persistent soft lockups.  I mentioned this in
> >> today's concall, and Ted confirmed the current dev branch should reflect
> >> what's upstream at this time.
> >>
> >> This test reproduces for me 5 to 10% of the time, but reliably enough - I
> >> typically don't need more than 25 trials to see the failure, and 10 often
> >> suffices. (I haven't yet tried the 4k test case, but will do so.)
> >>
> >> The failure bisects to:
> >> 7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)")
> >>
> >
> > Thanks for the bisection.
> >
> >> Trace follows.
> >>
> >> Eric
> >>
> >>
> >> generic/269 24s ...  [21:41:11][  284.208474] run fstests generic/269 at 2023-07-03 21:41:11
> >> [  284.511484] EXT4-fs (vdc): mounted filesystem 2b1fbdd6-2724-47bc-b7b5-f4a73c9f19be r/w with ordered data mode. Quota mode: none.
> >> [  284.950657] ------------[ cut here ]------------
> >> [  284.950901] kernel BUG at fs/ext4/mballoc.c:5070!
> >> [  284.951104] invalid opcode: 0000 [#1] PREEMPT SMP
> >> [  284.951296] CPU: 0 PID: 12039 Comm: fsstress Not tainted 6.4.0-rc5+ #6
> >> [  284.951567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> >> [  284.951900] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [  284.952124] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 9d 56 d3 ff 84 c0 0f 85 b5 fe ff ff 0f 0b e9 ae fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 4c 89 c1 31 c0 e9 42 ff ff ff 0f 1f 84 00
> >> [  284.952891] RSP: 0018:ffffc90004053970 EFLAGS: 00010a87
> >> [  284.953124] RAX: 0000000000000002 RBX: ffff8880342d0720 RCX: 0000000000000001
> >> [  284.953420] RDX: 0000000000004000 RSI: 00001e4000000000 RDI: ffff8880342d0720
> >> [  284.953720] RBP: ffffc90004053a00 R08: ffff88800a5fc000 R09: 0000000000000000
> >> [  284.954020] R10: ffff888007964a98 R11: 0000000000000002 R12: 0000000000000003
> >> [  284.954321] R13: ffff8880342d0720 R14: ffff88800a5fc000 R15: ffff88800abfc000
> >> [  284.954610] FS:  00007f3d9db07740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
> >> [  284.954923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [  284.955148] CR2: 000055b54ce0fde8 CR3: 0000000006730006 CR4: 0000000000770ef0
> >> [  284.955443] PKRU: 55555554
> >> [  284.955559] Call Trace:
> >> [  284.955669]  <TASK>
> >> [  284.955760]  ? die+0x33/0x90
> >> [  284.955887]  ? do_trap+0xe0/0x110
> >> [  284.956031]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [  284.956223]  ? do_error_trap+0x65/0x80
> >> [  284.956385]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [  284.956575]  ? exc_invalid_op+0x4b/0x70
> >> [  284.956738]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [  284.956929]  ? asm_exc_invalid_op+0x16/0x20
> >> [  284.957112]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> >> [  284.957305]  ext4_mb_complex_scan_group+0x2e0/0x3e0
> >> [  284.957512]  ext4_mb_regular_allocator+0x3be/0xd80
> >> [  284.957716]  ext4_mb_new_blocks+0x9dc/0x1040
> >> [  284.957895]  ? __kmalloc+0xca/0x150
> >> [  284.958038]  ? ext4_find_extent+0x3ec/0x450
> >> [  284.958204]  ? _raw_write_unlock+0x29/0x50
> >> [  284.958369]  ext4_ext_map_blocks+0x9a4/0x19d0
> >> [  284.958543]  ? __kmem_cache_free+0x17d/0x2e0
> >> [  284.958723]  ? find_held_lock+0x2b/0x80
> >> [  284.958889]  ext4_map_blocks+0x230/0x5d0
> >> [  284.959056]  ? lock_release+0x139/0x280
> >> [  284.959222]  ext4_getblk+0x7b/0x2d0
> >> [  284.959369]  ext4_bread+0xc/0x70
> >> [  284.959510]  ext4_append+0x8d/0x190
> >> [  284.959665]  ext4_init_new_dir+0xd5/0x1b0
> >> [  284.959835]  ext4_mkdir+0x192/0x340
> >
> > hmm.. looks like a allocation request for a directory inode.
> >
> >> [  284.959987]  vfs_mkdir+0x98/0x140
> >> [  284.960133]  do_mkdirat+0x131/0x160
> >> [  284.960285]  __x64_sys_mkdir+0x48/0x70
> >> [  284.960445]  do_syscall_64+0x38/0x90
> >> [  284.960600]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> >> [  284.960814] RIP: 0033:0x7f3d9dbf8b07
> >> [  284.960967] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
> >> [  284.961741] RSP: 002b:00007ffcd5131098 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
> >> [  284.962075] RAX: ffffffffffffffda RBX: 00007ffcd5131200 RCX: 00007f3d9dbf8b07
> >> [  284.962363] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055b54cdac240
> >> [  284.962647] RBP: 00000000000001ff R08: 0000000000000001 R09: 0000000000000003
> >> [  284.962928] R10: 00007ffcd5130d16 R11: 0000000000000206 R12: 00000000000000cb
> >> [  284.963209] R13: 8f5c28f5c28f5c29 R14: 000055b54c8ec660 R15: 00000000000000cb
> >> [  284.963492]  </TASK>
> >> [  284.963586] Modules linked in:
> >> [  284.963730] ---[ end trace 0000000000000000 ]---
> >
> >
> > @Ojaswin,
> >
> > I was looking at the code. In function ext4_mb_choose_next_group_best_avail(),
> > we use fls() on goal len to calculate "order". But we never subtract 1
> > from it. Then we set the goal len based on this "order". This might make
> > ac_g_ex.fe_len > ac_o_ex.fe_len in some cases where we really don't want
> > that (like the current case).
> > You have added some comments there, so I was not sure if that was
> > intentional. 
> >
> > Now, IIUC, the overall concept of cr_1.5 is to trim the max len order
> > from goal len to something which is still larger than original length. 
> > But this is only valid for regular files allocation request. Because we don't
> > normalize the request length for non-regular files. See
> > ext4_mb_normalize_request() function. As I also see from the current
> > bug_on, the request was for dir inode I guess.
> >
> > Although, I still think we should check the function logic in
> > ext4_mb_choose_next_group_best_avail(), but either ways I guess we don't
> > want to use CR_BEST_AVAIL_LEN criteria for non-regular files right,
> > given we anyways don't normalize the allocation request len for such files.
> >
> > In that case do you think below diff make sense?
> >
> >
> > mballoc: Don't use CR_BEST_AVAIL_LEN for non-regular files
> >
> > Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
> > non-regular files we never normalize the allocation request length i.e.
> > goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).
> >
> > Hence there is no scope of trimming the goal length such that it can
> > still satisfy original request len. Thus this patch avoids using
> > CR_BEST_AVAIL_LEN criteria for non-regular files request.
> >
> > ---
> >  fs/ext4/mballoc.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index a2475b8c9fb5..5fbbd7344456 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -974,7 +974,19 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
> >                 *group = grp->bb_group;
> >                 ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
> >         } else {
> > -               *new_cr = CR_BEST_AVAIL_LEN;
> > +               /*
> > +                * CR_BEST_AVAIL_LEN works based on the concept that we have
> > +                * a larger normalized goal len request which can be trimmed to
> > +                * a smaller goal len such that it can still satisfy original
> > +                * request len. However, allocation request for non-regular
> > +                * files never gets normalized.
> > +                * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
> > +                */
> > +               if ((ac->ac_criteria & EXT4_MB_HINT_DATA))
> 
> my bad. It should be 
>                   if ((ac->ac_flags & EXT4_MB_HINT_DATA))
> 
> -ritesh


Ritesh:

An update: the generic/269 failure on 1k is easily reproducible on 6.5-rc1.
The BUG_ON on line 5069 in mballoc.c is now triggering for me, rather than 5070:
BUG_ON(!S_ISREG(ac->ac_inode->i_mode));

Representative trace follows - operation at top of trace varies run to run.

Eric

generic/269 25s ...  [18:07:16][   54.255436] run fstests generic/269 at 2023-06
[   54.508385] EXT4-fs (vdc): mounted filesystem a986a0fa-5c36-4e16-bad9-2d53b8.
[   57.914332] ------------[ cut here ]------------
[   57.914534] kernel BUG at fs/ext4/mballoc.c:5069!
[   57.914863] invalid opcode: 0000 [#1] PREEMPT SMP
[   57.915156] CPU: 1 PID: 4157 Comm: fsstress Not tainted 6.5.0-rc1 #1
[   57.915531] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4
[   57.916033] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
[   57.916354] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 70
[   57.917473] RSP: 0018:ffffc900057f39f0 EFLAGS: 00010206
[   57.917799] RAX: 0000000000000002 RBX: ffff888017b70d10 RCX: 0000000000000001
[   57.918233] RDX: 000000000000a000 RSI: 00000b1c00000000 RDI: ffff888017b70d10
[   57.918667] RBP: ffffc900057f3a80 R08: ffff88800672f000 R09: ffff888009ac65a0
[   57.919100] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000068
[   57.919534] R13: ffff888017b70d10 R14: ffff88800672f000 R15: ffff888007b02000
[   57.919971] FS:  00007f03368b6740(0000) GS:ffff88807dd00000(0000) knlGS:00000
[   57.920461] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   57.920789] CR2: 000055fe23769d78 CR3: 000000000787e005 CR4: 0000000000770ee0
[   57.921113] PKRU: 55555554
[   57.921213] Call Trace:
[   57.921327]  <TASK>
[   57.921443]  ? die+0x33/0x90
[   57.921611]  ? do_trap+0xe0/0x110
[   57.921807]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[   57.922023]  ? do_error_trap+0x65/0x80
[   57.922174]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[   57.922354]  ? exc_invalid_op+0x4b/0x70
[   57.922510]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[   57.922693]  ? asm_exc_invalid_op+0x16/0x20
[   57.922859]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
[   57.923039]  ext4_mb_complex_scan_group+0x2e0/0x3e0
[   57.923232]  ext4_mb_regular_allocator+0x3be/0xd80
[   57.923425]  ext4_mb_new_blocks+0x9dc/0x1040
[   57.923592]  ? __kmalloc+0xca/0x150
[   57.923732]  ? ext4_find_extent+0x3ec/0x450
[   57.923896]  ? __kmem_cache_free+0x191/0x310
[   57.924064]  ? rcu_is_watching+0xd/0x40
[   57.924218]  ext4_ext_map_blocks+0x981/0x19c0
[   57.924393]  ? __lock_acquire.constprop.0+0x4b/0x6a0
[   57.924601]  ? __lock_acquire.constprop.0+0x4b/0x6a0
[   57.924871]  ext4_map_blocks+0x22a/0x5b0
[   57.925079]  ext4_getblk+0x7b/0x2d0
[   57.925260]  ext4_bread+0xc/0x70
[   57.925392]  ext4_symlink+0x12b/0x3c0
[   57.925610]  vfs_symlink+0x18c/0x230
[   57.925839]  do_symlinkat+0x11c/0x130
[   57.926071]  __x64_sys_symlink+0x38/0x50
[   57.926318]  do_syscall_64+0x38/0x90
[   57.926544]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   57.926866] RIP: 0033:0x7f03369a9a97
[   57.927093] Code: f0 ff ff 73 01 c3 48 8b 0d f6 d3 0c 00 f7 d8 64 89 01 48 88
[   57.928176] RSP: 002b:00007ffdf02ee128 EFLAGS: 00000206 ORIG_RAX: 00000000008
[   57.928574] RAX: ffffffffffffffda RBX: 00007ffdf02ee290 RCX: 00007f03369a9a97
[   57.928879] RDX: 0000000000000064 RSI: 000055fe23514410 RDI: 000055fe23542d10
[   57.929161] RBP: 000055fe23514410 R08: 000055fe23542d10 R09: 0000000000000003
[   57.929446] R10: 00007ffdf02edda6 R11: 0000000000000206 R12: 0000000000000170
[   57.929722] R13: 000055fe23542d10 R14: 00007ffdf02ee290 R15: 000055fe23542d10
[   57.930029]  </TASK>
[   57.930126] Modules linked in:
[   57.930274] ---[ end trace 0000000000000000 ]---



^ permalink raw reply

* [PATCH] ext4: do not mark inode dirty which is already dirtied under append-write scenario
From: Liu Song @ 2023-07-11  3:42 UTC (permalink / raw)
  To: tytso, adilger.kernel; +Cc: linux-ext4, linux-kernel, liusong

In the append-write scenario, after ensuring that the dirty inode can be
seen by the writeback process, there is no need to execute
"mark_inode_dirty" for every write. Instead, we can rely on
"ext4_mark_inode_dirty" executed when updating i_disksize in
"mpage_map_and_submit_extent" to ensure data consistency, which can
significantly improve performance in high-frequency append-write
scenarios.

In test scenarios of Kafka version 2.6.2, using packet size of 2K
resulted in a 10% performance improvement.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
---
 fs/ext4/inode.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 56 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9d9f414f99fe..d1aa775c9936 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3128,6 +3128,57 @@ static int ext4_da_should_update_i_disksize(struct page *page,
 	return 1;
 }
 
+/*
+ * Copy from generic_write_end, add conditions to execute mark_inode_dirty
+ * to avoid additional overhead caused by frequent dirty inode operations
+ */
+static int ext4_da_generic_write_end(struct file *file, struct address_space *mapping,
+			loff_t pos, unsigned len, unsigned copied, bool need_dirty,
+			struct page *page, void *fsdata)
+{
+	struct inode *inode = mapping->host;
+	loff_t old_size = inode->i_size;
+	bool i_size_changed = false;
+	int was_dirty;
+
+	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+	/*
+	 * No need to use i_size_read() here, the i_size cannot change under us
+	 * because we hold i_rwsem.
+	 *
+	 * But it's important to update i_size while still holding page lock:
+	 * page writeout could otherwise come in and zero beyond i_size.
+	 */
+	if (pos + copied > inode->i_size) {
+		i_size_write(inode, pos + copied);
+		i_size_changed = true;
+	}
+
+	unlock_page(page);
+	put_page(page);
+
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
+
+	/*
+	 * In the append-write scenario, if the inode is marked as dirty,
+	 * it is ensured that the inode will be seen by the writeback process.
+	 * In the ext4_writepages process, when updating i_disksize,
+	 * corresponding metadata updates are also performed.
+	 * Therefore, it is unnecessary to repeatedly execute mark_inode_dirty
+	 * to improve performance.
+	 */
+	if (i_size_changed) {
+		spin_lock(&inode->i_lock);
+		was_dirty = inode->i_state & I_DIRTY;
+		spin_unlock(&inode->i_lock);
+		if (!was_dirty || need_dirty)
+			mark_inode_dirty(inode);
+	}
+	return copied;
+}
+
 static int ext4_da_write_end(struct file *file,
 			     struct address_space *mapping,
 			     loff_t pos, unsigned len, unsigned copied,
@@ -3137,6 +3188,7 @@ static int ext4_da_write_end(struct file *file,
 	loff_t new_i_size;
 	unsigned long start, end;
 	int write_mode = (int)(unsigned long)fsdata;
+	bool need_dirty = false;
 
 	if (write_mode == FALL_BACK_TO_NONDELALLOC)
 		return ext4_write_end(file, mapping, pos,
@@ -3169,10 +3221,12 @@ static int ext4_da_write_end(struct file *file,
 	 */
 	new_i_size = pos + copied;
 	if (copied && new_i_size > inode->i_size &&
-	    ext4_da_should_update_i_disksize(page, end))
+	    ext4_da_should_update_i_disksize(page, end)) {
 		ext4_update_i_disksize(inode, new_i_size);
+		need_dirty = true;
+	}
 
-	return generic_write_end(file, mapping, pos, len, copied, page, fsdata);
+	return ext4_da_generic_write_end(file, mapping, pos, len, copied, need_dirty, page, fsdata);
 }
 
 /*
-- 
2.19.1.6.gb485710b


^ permalink raw reply related

* Re: generic/269 failure on ext4 dev branch
From: Ritesh Harjani @ 2023-07-11  8:03 UTC (permalink / raw)
  To: Eric Whitney; +Cc: Eric Whitney, linux-ext4, tytso, ojaswin
In-Reply-To: <ZKxVfsmH71ahAkXN@debian-BULLSEYE-live-builder-AMD64>

Eric Whitney <enwlinux@gmail.com> writes:

> * Ritesh Harjani <ritesh.list@gmail.com>:
>> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
>> 
>> > Eric Whitney <enwlinux@gmail.com> writes:
>> >
>> >> I've discovered that generic/269 will trigger a BUG_ON on line 5070 in
>> >
>> > Can you confirm in your tree what is line 5070 out of the two?
>> >
>> > BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>> > BUG_ON(ac->ac_pa == NULL);
>> >
>> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev&id=ab8627e104696b8c1c6953ad5255def5b0821e06#n5070
>> > [2]: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/mballoc.c?h=dev#n5070
>> >
>> > Based on which tree you tested, it could differ I guess.
>> >
>> > I am assuming it is... 
>> >   BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>> >
>> >> ext4_mb_new_inode_pa when running kvm-xfstests on the 1k test case
>> >> with a kernel built from the current ext4 dev branch. After hitting the
>> >> BUG_ON, the kernel then reports persistent soft lockups.  I mentioned this in
>> >> today's concall, and Ted confirmed the current dev branch should reflect
>> >> what's upstream at this time.
>> >>
>> >> This test reproduces for me 5 to 10% of the time, but reliably enough - I
>> >> typically don't need more than 25 trials to see the failure, and 10 often
>> >> suffices. (I haven't yet tried the 4k test case, but will do so.)
>> >>
>> >> The failure bisects to:
>> >> 7e170922f06b ("ext4: Add allocation criteria 1.5 (CR1_5)")
>> >>
>> >
>> > Thanks for the bisection.
>> >
>> >> Trace follows.
>> >>
>> >> Eric
>> >>
>> >>
>> >> generic/269 24s ...  [21:41:11][  284.208474] run fstests generic/269 at 2023-07-03 21:41:11
>> >> [  284.511484] EXT4-fs (vdc): mounted filesystem 2b1fbdd6-2724-47bc-b7b5-f4a73c9f19be r/w with ordered data mode. Quota mode: none.
>> >> [  284.950657] ------------[ cut here ]------------
>> >> [  284.950901] kernel BUG at fs/ext4/mballoc.c:5070!
>> >> [  284.951104] invalid opcode: 0000 [#1] PREEMPT SMP
>> >> [  284.951296] CPU: 0 PID: 12039 Comm: fsstress Not tainted 6.4.0-rc5+ #6
>> >> [  284.951567] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
>> >> [  284.951900] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [  284.952124] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 9d 56 d3 ff 84 c0 0f 85 b5 fe ff ff 0f 0b e9 ae fe ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 4c 89 c1 31 c0 e9 42 ff ff ff 0f 1f 84 00
>> >> [  284.952891] RSP: 0018:ffffc90004053970 EFLAGS: 00010a87
>> >> [  284.953124] RAX: 0000000000000002 RBX: ffff8880342d0720 RCX: 0000000000000001
>> >> [  284.953420] RDX: 0000000000004000 RSI: 00001e4000000000 RDI: ffff8880342d0720
>> >> [  284.953720] RBP: ffffc90004053a00 R08: ffff88800a5fc000 R09: 0000000000000000
>> >> [  284.954020] R10: ffff888007964a98 R11: 0000000000000002 R12: 0000000000000003
>> >> [  284.954321] R13: ffff8880342d0720 R14: ffff88800a5fc000 R15: ffff88800abfc000
>> >> [  284.954610] FS:  00007f3d9db07740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
>> >> [  284.954923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> [  284.955148] CR2: 000055b54ce0fde8 CR3: 0000000006730006 CR4: 0000000000770ef0
>> >> [  284.955443] PKRU: 55555554
>> >> [  284.955559] Call Trace:
>> >> [  284.955669]  <TASK>
>> >> [  284.955760]  ? die+0x33/0x90
>> >> [  284.955887]  ? do_trap+0xe0/0x110
>> >> [  284.956031]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [  284.956223]  ? do_error_trap+0x65/0x80
>> >> [  284.956385]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [  284.956575]  ? exc_invalid_op+0x4b/0x70
>> >> [  284.956738]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [  284.956929]  ? asm_exc_invalid_op+0x16/0x20
>> >> [  284.957112]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
>> >> [  284.957305]  ext4_mb_complex_scan_group+0x2e0/0x3e0
>> >> [  284.957512]  ext4_mb_regular_allocator+0x3be/0xd80
>> >> [  284.957716]  ext4_mb_new_blocks+0x9dc/0x1040
>> >> [  284.957895]  ? __kmalloc+0xca/0x150
>> >> [  284.958038]  ? ext4_find_extent+0x3ec/0x450
>> >> [  284.958204]  ? _raw_write_unlock+0x29/0x50
>> >> [  284.958369]  ext4_ext_map_blocks+0x9a4/0x19d0
>> >> [  284.958543]  ? __kmem_cache_free+0x17d/0x2e0
>> >> [  284.958723]  ? find_held_lock+0x2b/0x80
>> >> [  284.958889]  ext4_map_blocks+0x230/0x5d0
>> >> [  284.959056]  ? lock_release+0x139/0x280
>> >> [  284.959222]  ext4_getblk+0x7b/0x2d0
>> >> [  284.959369]  ext4_bread+0xc/0x70
>> >> [  284.959510]  ext4_append+0x8d/0x190
>> >> [  284.959665]  ext4_init_new_dir+0xd5/0x1b0
>> >> [  284.959835]  ext4_mkdir+0x192/0x340
>> >
>> > hmm.. looks like a allocation request for a directory inode.
>> >
>> >> [  284.959987]  vfs_mkdir+0x98/0x140
>> >> [  284.960133]  do_mkdirat+0x131/0x160
>> >> [  284.960285]  __x64_sys_mkdir+0x48/0x70
>> >> [  284.960445]  do_syscall_64+0x38/0x90
>> >> [  284.960600]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> >> [  284.960814] RIP: 0033:0x7f3d9dbf8b07
>> >> [  284.960967] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
>> >> [  284.961741] RSP: 002b:00007ffcd5131098 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
>> >> [  284.962075] RAX: ffffffffffffffda RBX: 00007ffcd5131200 RCX: 00007f3d9dbf8b07
>> >> [  284.962363] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 000055b54cdac240
>> >> [  284.962647] RBP: 00000000000001ff R08: 0000000000000001 R09: 0000000000000003
>> >> [  284.962928] R10: 00007ffcd5130d16 R11: 0000000000000206 R12: 00000000000000cb
>> >> [  284.963209] R13: 8f5c28f5c28f5c29 R14: 000055b54c8ec660 R15: 00000000000000cb
>> >> [  284.963492]  </TASK>
>> >> [  284.963586] Modules linked in:
>> >> [  284.963730] ---[ end trace 0000000000000000 ]---
>> >
>> >
>> > @Ojaswin,
>> >
>> > I was looking at the code. In function ext4_mb_choose_next_group_best_avail(),
>> > we use fls() on goal len to calculate "order". But we never subtract 1
>> > from it. Then we set the goal len based on this "order". This might make
>> > ac_g_ex.fe_len > ac_o_ex.fe_len in some cases where we really don't want
>> > that (like the current case).
>> > You have added some comments there, so I was not sure if that was
>> > intentional. 
>> >
>> > Now, IIUC, the overall concept of cr_1.5 is to trim the max len order
>> > from goal len to something which is still larger than original length. 
>> > But this is only valid for regular files allocation request. Because we don't
>> > normalize the request length for non-regular files. See
>> > ext4_mb_normalize_request() function. As I also see from the current
>> > bug_on, the request was for dir inode I guess.
>> >
>> > Although, I still think we should check the function logic in
>> > ext4_mb_choose_next_group_best_avail(), but either ways I guess we don't
>> > want to use CR_BEST_AVAIL_LEN criteria for non-regular files right,
>> > given we anyways don't normalize the allocation request len for such files.
>> >
>> > In that case do you think below diff make sense?
>> >
>> >
>> > mballoc: Don't use CR_BEST_AVAIL_LEN for non-regular files
>> >
>> > Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
>> > non-regular files we never normalize the allocation request length i.e.
>> > goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).
>> >
>> > Hence there is no scope of trimming the goal length such that it can
>> > still satisfy original request len. Thus this patch avoids using
>> > CR_BEST_AVAIL_LEN criteria for non-regular files request.
>> >
>> > ---
>> >  fs/ext4/mballoc.c | 14 +++++++++++++-
>> >  1 file changed, 13 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> > index a2475b8c9fb5..5fbbd7344456 100644
>> > --- a/fs/ext4/mballoc.c
>> > +++ b/fs/ext4/mballoc.c
>> > @@ -974,7 +974,19 @@ static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_context *
>> >                 *group = grp->bb_group;
>> >                 ac->ac_flags |= EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED;
>> >         } else {
>> > -               *new_cr = CR_BEST_AVAIL_LEN;
>> > +               /*
>> > +                * CR_BEST_AVAIL_LEN works based on the concept that we have
>> > +                * a larger normalized goal len request which can be trimmed to
>> > +                * a smaller goal len such that it can still satisfy original
>> > +                * request len. However, allocation request for non-regular
>> > +                * files never gets normalized.
>> > +                * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA).
>> > +                */
>> > +               if ((ac->ac_criteria & EXT4_MB_HINT_DATA))
>> 
>> my bad. It should be 
>>                   if ((ac->ac_flags & EXT4_MB_HINT_DATA))
>> 
>> -ritesh
>
>
> Ritesh:
>
> An update: the generic/269 failure on 1k is easily reproducible on 6.5-rc1.
> The BUG_ON on line 5069 in mballoc.c is now triggering for me, rather than 5070:
> BUG_ON(!S_ISREG(ac->ac_inode->i_mode));

Thanks Eric for updating. Yes, I gave this a shot on my setup too and I
am able to reproduce the same BUG_ON on my setup with a similar call
trace that you pasted below. I was trying to see if I can hit the other
bug on with ac->ac_pa = NULL, but I wasn't able to hit that at all.

>
> Representative trace follows - operation at top of trace varies run to run.
>
> Eric
>
> generic/269 25s ...  [18:07:16][   54.255436] run fstests generic/269 at 2023-06
> [   54.508385] EXT4-fs (vdc): mounted filesystem a986a0fa-5c36-4e16-bad9-2d53b8.
> [   57.914332] ------------[ cut here ]------------
> [   57.914534] kernel BUG at fs/ext4/mballoc.c:5069!
> [   57.914863] invalid opcode: 0000 [#1] PREEMPT SMP
> [   57.915156] CPU: 1 PID: 4157 Comm: fsstress Not tainted 6.5.0-rc1 #1
> [   57.915531] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4
> [   57.916033] RIP: 0010:ext4_mb_new_inode_pa+0x2a6/0x2c0
> [   57.916354] Code: b5 7e 0f 85 b5 fe ff ff 0f 1f 44 00 00 e9 ab fe ff ff e8 70
> [   57.917473] RSP: 0018:ffffc900057f39f0 EFLAGS: 00010206
> [   57.917799] RAX: 0000000000000002 RBX: ffff888017b70d10 RCX: 0000000000000001
> [   57.918233] RDX: 000000000000a000 RSI: 00000b1c00000000 RDI: ffff888017b70d10
> [   57.918667] RBP: ffffc900057f3a80 R08: ffff88800672f000 R09: ffff888009ac65a0
> [   57.919100] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000068
> [   57.919534] R13: ffff888017b70d10 R14: ffff88800672f000 R15: ffff888007b02000
> [   57.919971] FS:  00007f03368b6740(0000) GS:ffff88807dd00000(0000) knlGS:00000
> [   57.920461] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   57.920789] CR2: 000055fe23769d78 CR3: 000000000787e005 CR4: 0000000000770ee0
> [   57.921113] PKRU: 55555554
> [   57.921213] Call Trace:
> [   57.921327]  <TASK>
> [   57.921443]  ? die+0x33/0x90
> [   57.921611]  ? do_trap+0xe0/0x110
> [   57.921807]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [   57.922023]  ? do_error_trap+0x65/0x80
> [   57.922174]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [   57.922354]  ? exc_invalid_op+0x4b/0x70
> [   57.922510]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [   57.922693]  ? asm_exc_invalid_op+0x16/0x20
> [   57.922859]  ? ext4_mb_new_inode_pa+0x2a6/0x2c0
> [   57.923039]  ext4_mb_complex_scan_group+0x2e0/0x3e0
> [   57.923232]  ext4_mb_regular_allocator+0x3be/0xd80
> [   57.923425]  ext4_mb_new_blocks+0x9dc/0x1040
> [   57.923592]  ? __kmalloc+0xca/0x150
> [   57.923732]  ? ext4_find_extent+0x3ec/0x450
> [   57.923896]  ? __kmem_cache_free+0x191/0x310
> [   57.924064]  ? rcu_is_watching+0xd/0x40
> [   57.924218]  ext4_ext_map_blocks+0x981/0x19c0
> [   57.924393]  ? __lock_acquire.constprop.0+0x4b/0x6a0
> [   57.924601]  ? __lock_acquire.constprop.0+0x4b/0x6a0
> [   57.924871]  ext4_map_blocks+0x22a/0x5b0
> [   57.925079]  ext4_getblk+0x7b/0x2d0
> [   57.925260]  ext4_bread+0xc/0x70
> [   57.925392]  ext4_symlink+0x12b/0x3c0
> [   57.925610]  vfs_symlink+0x18c/0x230
> [   57.925839]  do_symlinkat+0x11c/0x130
> [   57.926071]  __x64_sys_symlink+0x38/0x50
> [   57.926318]  do_syscall_64+0x38/0x90
> [   57.926544]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [   57.926866] RIP: 0033:0x7f03369a9a97
> [   57.927093] Code: f0 ff ff 73 01 c3 48 8b 0d f6 d3 0c 00 f7 d8 64 89 01 48 88
> [   57.928176] RSP: 002b:00007ffdf02ee128 EFLAGS: 00000206 ORIG_RAX: 00000000008
> [   57.928574] RAX: ffffffffffffffda RBX: 00007ffdf02ee290 RCX: 00007f03369a9a97
> [   57.928879] RDX: 0000000000000064 RSI: 000055fe23514410 RDI: 000055fe23542d10
> [   57.929161] RBP: 000055fe23514410 R08: 000055fe23542d10 R09: 0000000000000003
> [   57.929446] R10: 00007ffdf02edda6 R11: 0000000000000206 R12: 0000000000000170
> [   57.929722] R13: 000055fe23542d10 R14: 00007ffdf02ee290 R15: 000055fe23542d10
> [   57.930029]  </TASK>
> [   57.930126] Modules linked in:
> [   57.930274] ---[ end trace 0000000000000000 ]---

-ritesh

^ permalink raw reply

* Re: [syzbot] [ext4?] WARNING: locking bug in ext4_move_extents
From: syzbot @ 2023-07-11 15:19 UTC (permalink / raw)
  To: adilger.kernel, linux-ext4, linux-fsdevel, linux-kernel,
	syzkaller-bugs, tytso
In-Reply-To: <000000000000e55d2005fd59d6c9@google.com>

syzbot has bisected this issue to:

commit aff3bea95388299eec63440389b4545c8041b357
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Wed May 24 03:49:51 2023 +0000

    ext4: add lockdep annotations for i_data_sem for ea_inode's

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=14abf682a80000
start commit:   3f01e9fed845 Merge tag 'linux-watchdog-6.5-rc2' of git://w..
git tree:       upstream
final oops:     https://syzkaller.appspot.com/x/report.txt?x=16abf682a80000
console output: https://syzkaller.appspot.com/x/log.txt?x=12abf682a80000
kernel config:  https://syzkaller.appspot.com/x/.config?x=150188feee7071a7
dashboard link: https://syzkaller.appspot.com/bug?extid=7f4a6f7f7051474e40ad
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11d8f2b0a80000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1061e6b0a80000

Reported-by: syzbot+7f4a6f7f7051474e40ad@syzkaller.appspotmail.com
Fixes: aff3bea95388 ("ext4: add lockdep annotations for i_data_sem for ea_inode's")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [PATCH 1/1] mke2fs: the -d option can now handle tarball input
From: Darrick J. Wong @ 2023-07-11 23:53 UTC (permalink / raw)
  To: Johannes Schauer Marin Rodrigues; +Cc: linux-ext4
In-Reply-To: <168836303674.2483784.4947178089926484601@localhost>

On Mon, Jul 03, 2023 at 07:43:56AM +0200, Johannes Schauer Marin Rodrigues wrote:
> Hi,
> 
> Quoting Darrick J. Wong (2023-06-30 17:51:28)
> > On Tue, Jun 20, 2023 at 02:16:41PM +0200, Johannes Schauer Marin Rodrigues wrote:
> > > If archive.h is available during compilation, enable mke2fs to read a
> > > tarball as input. Since libarchive.so.13 is opened with dlopen,
> > > libarchive is not a hard library dependency of the resulting binary.
> > 
> > I can't say I'm in favor of adding build dependencies to e2fsprogs,
> > since the point of -d taking a directory arg was to *avoid* having to
> > understand anything other than posix(ish) directory tree walking APIs.
> 
> this is why the build dependency is optional.

As Ted said elsewhere, the big question is (a) do we really want
e2fsprogs depending on libarchive at all, and (b) is libarchive's API
stable enough that you'll maintain it for us?  Merging this patch *is*
adding to the complexity of what most distros consider to be critical
system utility.

> It should be perfectly possible
> to build e2fsprogs without libarchive as well. I copied the pattern that was
> already implemented for libmagic which is also not a hard dependency but gets
> dlopened-ed at runtime. If this mechanism is fine for libmagic it should be
> fine for others as well, no?
> 
> The tar format (minus some features) is also not terribly complicated. Would

There's at least five formats known to GNU tar, according to its manpage:

Format	UID		File Size	File Name	Devn
gnu	1.8e19		Unlimited	Unlimited	63
oldgnu	1.8e19		Unlimited	Unlimited	63
v7	2097151		8GB		99		n/a
ustar	2097151		8GB		256		21
posix	Unlimited	Unlimited	Unlimited	Unlimited

https://www.gnu.org/software/tar/manual/html_chapter/Formats.html

> you prefer I add a rudimentary tar parser that will be used in the event that
> libarchive is not available? The tar format is not that complicated but adding
> such code to e2fsprogs would be overkill for a functionality that is otherwise
> optional, no?

Indeed not.

> > > This enables the creation of filesystems containing files which would
> > > otherwise need superuser privileges to create (like device nodes, which are
> > > also not allowed in unshared user namespaces). By reading from standard
> > > input when the filename is a dash (-), mke2fs can be used as part of a
> > > shell pipeline without temporary files.
> > What if the argument is actually a Microsoft CAB archive (which libarchive
> > claims to support)?  Will it actually copy the cab archive into an ext4
> > image?
> 
> I didn't have a cab archive so I couldn't test this but it does work with other
> archive formats like zip files. Would you like me to artificially restrict the
> input format to only tarballs?

No -- if Ted wants libarchive input for e2fsprogs, it may as well take
full advantage of it.

--D

> Thanks!
> 
> cheers, josch



^ permalink raw reply

* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write
From: Darrick J. Wong @ 2023-07-12  0:05 UTC (permalink / raw)
  To: zhanchengbin
  Cc: Theodore Ts'o, linux-ext4, linfeilong, louhongxiang,
	liuzhiqiang26, Ye Bin
In-Reply-To: <84a1a21a-be09-f70d-1d1b-234c706ddf14@huawei.com>

On Sat, Jul 08, 2023 at 03:29:51PM +0800, zhanchengbin wrote:
> On 2023/7/5 3:33, Theodore Ts'o wrote:
> 
> I have written the ioctl for EXT4_IOC_SET_ERROR_BEHAVIOR according to your
> instructions and verified that it can indeed modify the data on the disk.
> 
> However, I realized some problems. If we use the ioctl method to modify the
> data, it would require multiple ioctls in user space to modify the
> superblock.
> If one ioctl successfully modifies the superblock data, but another ioctl
> fails, the atomicity of the superblock cannot be guaranteed. This is just
> within one user space process, not to mention the scenario of multiple user
> space processes calling ioctls concurrently. Additionally, multiple ioctls
> modifying the superblock may be placed in multiple transactions, which
> further
> compromises atomicity.
> 
> Writing the entire superblock buffer_head to disk can ensure atomicity.

...at a cost of racing with the mounted fs, which might be updating the
superblock at the same time; and prohibiting the kernel devs from
closing the "scribble on mounted bdev" attack surface.

How many of these attributes do you /really/ need to be able to commit
atomically, anyway?  If the system crashes, can't you re-run the
program and end up with the same super fields?

--D

> However, these data need to be converted to ext4_sb_info. Otherwise, during
> unmount, the data in memory will overwrite the data on the disk.
> (Modifying the values in ext4_sb_info can potentially introduce unexpected
> issues, just like the problem thata arises when setting the UUID without
> checking ext4_has_feature_csum_seed.)
> 
> > So the better approach is to have multiple new ioctl's for each
> > superblock field (or set of fields) that you might want modify.  We
> > have some of these already --- for example, EXT4_IOC_SETFSUUID and
> > FS_IOC_SETFSLABEL.  For tune2fs, all of additional ioctls for making
> > configuration changes while the file system is mounted are:
> > 
> >     * EXT4_IOC_SET_FEATURES
> > 	- for tune2fs -O...
> >     * EXT4_IOC_CLEAR_FEATURES
> > 	- for tune2fs -O^...
> >     * EXT4_IOC_SET_ERROR_BEHAVIOR
> > 	- for tune2fs -e
> >     * EXT4_IOC_SET_DEFAULT_MOUNT_FLAGS
> >          - for tune2fs -o
> >     * EXT4_IOC_SET_DEFAULT_MOUNT_OPTS
> >          - for tune2fs -E mount_opts=XXX
> >     * EXT4_IOC_SET_FSCK_POLICY
> > 	- for tune2fs -[cCiT]
> >     * EXT4_IOC_SET_RESERVED_ALLOC
> > 	- for tune2fs -[ugm]
> 
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 331859511f80..d598e1975786 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -51,6 +51,11 @@ static void ext4_sb_setuuid(struct ext4_super_block *es,
> const void *arg)
>  	memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE);
>  }
> 
> +static void ext4_sb_set_error_behavior(struct ext4_super_block *es, const
> void *arg)
> +{
> +	es->s_errors = cpu_to_le16(*(__u16 *)arg);
> +}
> +
>  static
>  int ext4_update_primary_sb(struct super_block *sb, handle_t *handle,
>  			   ext4_update_sb_callback func,
> @@ -1220,6 +1225,32 @@ static int ext4_ioctl_setuuid(struct file *filp,
>  	return ret;
>  }
> 
> +static int ext4_ioctl_set_error_behavior(struct file *filp,
> +			const __u16 __user *uerror_behavior)
> +{
> +	int ret = 0;
> +	struct super_block *sb = file_inode(filp)->i_sb;
> +	__u16 error_behavior;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	if (copy_from_user(&error_behavior, uerror_behavior,
> sizeof(error_behavior)))
> +		return -EFAULT;
> +
> +	if (error_behavior < EXT4_ERRORS_CONTINUE || error_behavior >
> EXT4_ERRORS_PANIC)
> +		return -EINVAL;
> +
> +	ret = mnt_want_write_file(filp);
> +	if (ret)
> +		return ret;
> +
> +	ret = ext4_update_superblocks_fn(sb, ext4_sb_set_error_behavior,
> &error_behavior);
> +	mnt_drop_write_file(filp);
> +
> +	return ret;
> +}
> +
>  static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long
> arg)
>  {
>  	struct inode *inode = file_inode(filp);
> @@ -1607,6 +1638,8 @@ static long __ext4_ioctl(struct file *filp, unsigned
> int cmd, unsigned long arg)
>  		return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
>  	case EXT4_IOC_SETFSUUID:
>  		return ext4_ioctl_setuuid(filp, (const void __user *)arg);
> +	case EXT4_IOC_SET_ERROR_BEHAVIOR:
> +		return ext4_ioctl_set_error_behavior(filp, (const void __user *)arg);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
> index 1c4c2dd29112..68329d51a8a7 100644
> --- a/include/uapi/linux/ext4.h
> +++ b/include/uapi/linux/ext4.h
> @@ -33,6 +33,7 @@
>  #define EXT4_IOC_CHECKPOINT		_IOW('f', 43, __u32)
>  #define EXT4_IOC_GETFSUUID		_IOR('f', 44, struct fsuuid)
>  #define EXT4_IOC_SETFSUUID		_IOW('f', 44, struct fsuuid)
> +#define EXT4_IOC_SET_ERROR_BEHAVIOR	_IOW('f', 45, __u16)
> 
>  #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
> 
> 
> 
> Thanks,
>  - bin.
> 

^ permalink raw reply

* Słowa kluczowe do wypozycjonowania
From: Adam Charachuta @ 2023-07-12  8:05 UTC (permalink / raw)
  To: linux-ext4

Dzień dobry,

zapoznałem się z Państwa ofertą i z przyjemnością przyznaję, że przyciąga uwagę i zachęca do dalszych rozmów. 

Pomyślałem, że może mógłbym mieć swój wkład w Państwa rozwój i pomóc dotrzeć z tą ofertą do większego grona odbiorców. Pozycjonuję strony www, dzięki czemu generują świetny ruch w sieci.

Możemy porozmawiać w najbliższym czasie?

Pozdrawiam
Adam Charachuta

^ permalink raw reply

* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write
From: zhanchengbin @ 2023-07-12  9:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Theodore Ts'o, linux-ext4, linfeilong, louhongxiang,
	liuzhiqiang26, Ye Bin
In-Reply-To: <20230712000511.GA11427@frogsfrogsfrogs>

On 2023/7/12 8:05, Darrick J. Wong wrote:
> On Sat, Jul 08, 2023 at 03:29:51PM +0800, zhanchengbin wrote:
>> On 2023/7/5 3:33, Theodore Ts'o wrote:
>>
>> I have written the ioctl for EXT4_IOC_SET_ERROR_BEHAVIOR according to your
>> instructions and verified that it can indeed modify the data on the disk.
>>
>> However, I realized some problems. If we use the ioctl method to modify the
>> data, it would require multiple ioctls in user space to modify the
>> superblock.
>> If one ioctl successfully modifies the superblock data, but another ioctl
>> fails, the atomicity of the superblock cannot be guaranteed. This is just
>> within one user space process, not to mention the scenario of multiple user
>> space processes calling ioctls concurrently. Additionally, multiple ioctls
>> modifying the superblock may be placed in multiple transactions, which
>> further
>> compromises atomicity.
>>
>> Writing the entire superblock buffer_head to disk can ensure atomicity.
> 
> ...at a cost of racing with the mounted fs, which might be updating the
> superblock at the same time; and prohibiting the kernel devs from
> closing the "scribble on mounted bdev" attack surface.

Regardless of whether I am modifying a single byte or the entire
buffer_head, there will always be a situation of contention with the kernel
lock, You can take a look at ext4_update_superblocks_fn which calls
lock_buffer.

What I am more concerned about is that the superblock needs to be
synchronized to the memory before being saved to the disk. Otherwise, during
the ext4_commit_super process, outdated data may be saved to the disk.

> 
> How many of these attributes do you /really/ need to be able to commit

My plan with Tytso is to add seven attribute modifications.

> atomically, anyway?  If the system crashes, can't you re-run the
> program and end up with the same super fields?
Just run the program again after the system crashes, o.O? I don't think so.

The program perceives that the superblock has been modified 
successfully, and
the value of the modified superblock is saved on disk in
ext4_update_primary_sb, but there is no guarantee whether the super block is
saved in journal on the disk or whether it is checkpointed. If the super 
block
has not been saved in journal on the disk and the system crashes, the
modification of the super block may be overwritten when the journal
recover; similarly, this problem will also occur for the translation 
that has
not been checkpointed; Both of these scenarios are not perceptible to user
process unless there is a backup mechanism implemented in user mode.

Moreover, the method of rerunning the program cannot resolve the conflicting
racing condition between the two ioctls.

Thanks,
  - bin.

> 
> --D
> 
>> However, these data need to be converted to ext4_sb_info. Otherwise, during
>> unmount, the data in memory will overwrite the data on the disk.
>> (Modifying the values in ext4_sb_info can potentially introduce unexpected
>> issues, just like the problem thata arises when setting the UUID without
>> checking ext4_has_feature_csum_seed.)

^ permalink raw reply

* [PATCH] ext4: fix decoding of raw_inode timestamps
From: Jeff Layton @ 2023-07-12 15:02 UTC (permalink / raw)
  To: brauner, Theodore Ts'o, Andreas Dilger, Jan Kara
  Cc: linux-fsdevel, linux-ext4, linux-kernel

When we covert a timestamp from raw disk format, we need to consider it
to be signed, as the value may represent a date earlier than 1970. This
fixes generic/258 on ext4.

Cc: Jan Kara <jack@suse.cz>
Fixes: f2ddb05870fb ("ext4: convert to ctime accessor functions")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ext4/ext4.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

It might be best to just squash this fix in with the ext4 conversion in
the vfs tree.

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d63543187359..2af347669db7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -877,7 +877,7 @@ static inline __le32 ext4_encode_extra_time(struct timespec64 ts)
 static inline struct timespec64 ext4_decode_extra_time(__le32 base,
 						       __le32 extra)
 {
-	struct timespec64 ts = { .tv_sec = le32_to_cpu(base) };
+	struct timespec64 ts = { .tv_sec = (signed)le32_to_cpu(base) };
 
 	if (unlikely(extra & cpu_to_le32(EXT4_EPOCH_MASK)))
 		ts.tv_sec += (u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK) << 32;
-- 
2.41.0


^ permalink raw reply related

* Re: [PATCH] ext4: fix decoding of raw_inode timestamps
From: Christian Brauner @ 2023-07-12 15:32 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christian Brauner, linux-fsdevel, linux-ext4, linux-kernel,
	Theodore Ts'o, Andreas Dilger, Jan Kara
In-Reply-To: <20230712150251.163790-1-jlayton@kernel.org>

On Wed, 12 Jul 2023 11:02:49 -0400, Jeff Layton wrote:
> When we covert a timestamp from raw disk format, we need to consider it
> to be signed, as the value may represent a date earlier than 1970. This
> fixes generic/258 on ext4.
> 
> 

Applied to the vfs.ctime branch of the vfs/vfs.git tree.
Patches in the vfs.ctime branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.ctime

[1/1 FOLDED] ext4: convert to ctime accessor functions
      https://git.kernel.org/vfs/vfs/c/f65cb009d449

^ permalink raw reply

* Re: [PATCH 01/79] fs: add ctime accessors infrastructure
From: Randy Dunlap @ 2023-07-12 15:31 UTC (permalink / raw)
  To: Jeff Layton, linux-kernel@vger.kernel.org,
	Linux FS-devel Mailing List, linux-um
In-Reply-To: <20230621144507.55591-2-jlayton@kernel.org>

Hi Jeff,

On arch/um/, (subarch i386 or x86_64), hostfs build fails with:

../fs/hostfs/hostfs_kern.c:520:36: error: incompatible type for arg
ument 2 of 'inode_set_ctime_to_ts'
../include/linux/fs.h:1499:73: note: expected 'struct timespec64' b
ut argument is of type 'const struct hostfs_timespec *'


-- 
~Randy

^ permalink raw reply

* Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write
From: Theodore Ts'o @ 2023-07-12 15:42 UTC (permalink / raw)
  To: zhanchengbin
  Cc: Darrick J. Wong, linux-ext4, linfeilong, louhongxiang,
	liuzhiqiang26, Ye Bin
In-Reply-To: <4a3ac0da-69cf-e282-dc56-aefaa0e90718@huawei.com>

On Wed, Jul 12, 2023 at 05:06:31PM +0800, zhanchengbin wrote:
> > ...at a cost of racing with the mounted fs, which might be updating the
> > superblock at the same time; and prohibiting the kernel devs from
> > closing the "scribble on mounted bdev" attack surface.
> 
> Regardless of whether I am modifying a single byte or the entire
> buffer_head, there will always be a situation of contention with the kernel
> lock, You can take a look at ext4_update_superblocks_fn which calls
> lock_buffer.

Many/most of the fields that tune2fs will need to modify are ones
which the kernel never needs to modify.  So no locking will be
necessary, and so long as you are using the journal, we don't need to
worry about an invalid checksum getting written to the disk.

There might be races with buffered reads to the superblock, but those
races exist today.  E2fsprogs has a way of dealing this where if the
checksum is invalid, it will just sleep and retry.  Another way of
dealing with is to use an O_DIRECT read to the superblock.

Longer-term, we may want to have a EXT4_IOC_GET_SUPERBLOCK ioctl, at
which point we may need to have some kind of new kernel locking ----
or we can just take a snapshot of the superblock, and check to see the
checksum is valid; if not, it can just retry the snapshot.

> What I am more concerned about is that the superblock needs to be
> synchronized to the memory before being saved to the disk. Otherwise, during
> the ext4_commit_super process, outdated data may be saved to the disk.

As I've noted above, this is already not a problem if journalling is
enabled.  If journalling is not enabled, it's possible that an invalid
superblock is written to disk.  This can sort of happen already, if
you have an orphan list update racing with an ext4_error() update to
the superblock.  It's a debateable point how much we should care,
since if you don't have journalling enabled, on a crash the file
system may be corrupted in many situations, and so we will need to use
fsck.ext4 anyway.  If there is only a single superblock, then fsck
might not have a fallback superblock to use, but arguably that's a
corner case.

> The program perceives that the superblock has been modified
> successfully, and the value of the modified superblock is saved on
> disk in ext4_update_primary_sb, but there is no guarantee whether
> the super block is saved in journal on the disk or whether it is
> checkpointed.

So in actual practice, e2fsprogs will replay the journal and then
reread the superblock.  So if you change the max_mounts_count via
tune2fs -c (even though very few systems use that these days, and it's
largely a deprecated feature), so long as the transaction has been
committed, the superblock update will be honored by e2fsck.

I'm not sure we care *all* that much, but if we really want to make
sure things like tune2fs -c will always "take" after a crash, we can
simply have the ioctl force a journal commit before returning.

> If the super block has not been saved in journal on
> the disk and the system crashes, the modification of the super block
> may be overwritten when the journal recover; similarly, this problem
> will also occur for the translation that has not been checkpointed;
> Both of these scenarios are not perceptible to user process unless
> there is a backup mechanism implemented in user mode.

We now *always* update the superblock through the journal.  We only
fall back to a direct update of the superblock if the journal is not
present, or the journal has aborted due to some kind of fundamental
failure (so that the ext4_error can be written out before we reboot
the system or force the file system to be read-only).

> Moreover, the method of rerunning the program cannot resolve the conflicting
> racing condition between the two ioctls.

These ioctls are quite rarely used, and it's a problem we have today
if we have two racing tune2fs commands.  By having the kernel handle
it, so long as the two ioctls are modifying different superblock
fields, both ioctl updates will happen --- where as today, when we
have two racing tune2fs, one of the tune2fs updates could be
completely lost.  This has been true for decades in ext2, ext3, and
ext4, and no one has actually reported this as a problem.

Cheers,

^ permalink raw reply

* Re: [PATCH 01/32] block: Provide blkdev_get_handle_* functions
From: Haris Iqbal @ 2023-07-12 16:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-block, linux-fsdevel, Jens Axboe, Alasdair Kergon,
	Andrew Morton, Anna Schumaker, Chao Yu, Christian Borntraeger,
	Darrick J. Wong, Dave Kleikamp, David Sterba, dm-devel, drbd-dev,
	Gao Xiang, Jack Wang, Jaegeuk Kim, jfs-discussion, Joern Engel,
	Joseph Qi, Kent Overstreet, linux-bcache, linux-btrfs,
	linux-erofs, linux-ext4, linux-f2fs-devel, linux-mm, linux-mtd,
	linux-nfs, linux-nilfs, linux-nvme, linux-pm, linux-raid,
	linux-s390, linux-scsi, linux-xfs, Mike Snitzer, Minchan Kim,
	ocfs2-devel, reiserfs-devel, Sergey Senozhatsky, Song Liu,
	Sven Schnelle, target-devel, Ted Tso, Trond Myklebust, xen-devel
In-Reply-To: <ZKbgAG5OoHVyUKOG@infradead.org>

On Thu, Jul 6, 2023 at 5:38 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Jul 04, 2023 at 02:21:28PM +0200, Jan Kara wrote:
> > Create struct bdev_handle that contains all parameters that need to be
> > passed to blkdev_put() and provide blkdev_get_handle_* functions that
> > return this structure instead of plain bdev pointer. This will
> > eventually allow us to pass one more argument to blkdev_put() without
> > too much hassle.
>
> Can we use the opportunity to come up with better names?  blkdev_get_*
> was always a rather horrible naming convention for something that
> ends up calling into ->open.
>
> What about:
>
> struct bdev_handle *bdev_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
>                 const struct blk_holder_ops *hops);
> struct bdev_handle *bdev_open_by_path(dev_t dev, blk_mode_t mode,
>                 void *holder, const struct blk_holder_ops *hops);
> void bdev_release(struct bdev_handle *handle);

+1 to this.
Also, if we are removing "handle" from the function, should the name
of the structure it returns also change? Would something like bdev_ctx
be better?

(Apologies for the previous non-plaintext email)

>
> ?

^ permalink raw reply

* Re: [PATCH] ext4: fix decoding of raw_inode timestamps
From: Theodore Ts'o @ 2023-07-12 17:52 UTC (permalink / raw)
  To: Jeff Layton
  Cc: brauner, Andreas Dilger, Jan Kara, linux-fsdevel, linux-ext4,
	linux-kernel
In-Reply-To: <20230712150251.163790-1-jlayton@kernel.org>

On Wed, Jul 12, 2023 at 11:02:49AM -0400, Jeff Layton wrote:
> When we covert a timestamp from raw disk format, we need to consider it
> to be signed, as the value may represent a date earlier than 1970. This
> fixes generic/258 on ext4.
> 
> Cc: Jan Kara <jack@suse.cz>
> Fixes: f2ddb05870fb ("ext4: convert to ctime accessor functions")
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Acked-by: Theodore Ts'o <tytso@mit.edu>

Thanks for the fix!

It had been on my list to checking to see if the ext4 kunit tests
would pass, since Jan had mentioned that he had done the work to make
sure the ext4 kunit test would compile, but he hadn't gotten around to
try run the kunit test.  Unfortunately, I hadn't gotten to it.

I *think* the ext4 kunit tests should have caught this as well; out of
curiosity, have you tried running the ext4 kunit tests either before
or after this patch?  If so, what were your findings?

Cheers,

					- Ted

^ permalink raw reply

* Re: [PATCH] ext4: fix decoding of raw_inode timestamps
From: Jeff Layton @ 2023-07-12 18:09 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: brauner, Andreas Dilger, Jan Kara, linux-fsdevel, linux-ext4,
	linux-kernel
In-Reply-To: <20230712175258.GB3677745@mit.edu>

On Wed, 2023-07-12 at 13:52 -0400, Theodore Ts'o wrote:
> On Wed, Jul 12, 2023 at 11:02:49AM -0400, Jeff Layton wrote:
> > When we covert a timestamp from raw disk format, we need to consider it
> > to be signed, as the value may represent a date earlier than 1970. This
> > fixes generic/258 on ext4.
> > 
> > Cc: Jan Kara <jack@suse.cz>
> > Fixes: f2ddb05870fb ("ext4: convert to ctime accessor functions")
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> Acked-by: Theodore Ts'o <tytso@mit.edu>
> 
> Thanks for the fix!
> 
> It had been on my list to checking to see if the ext4 kunit tests
> would pass, since Jan had mentioned that he had done the work to make
> sure the ext4 kunit test would compile, but he hadn't gotten around to
> try run the kunit test.  Unfortunately, I hadn't gotten to it.
> 
> I *think* the ext4 kunit tests should have caught this as well; out of
> curiosity, have you tried running the ext4 kunit tests either before
> or after this patch?  If so, what were your findings?
> 
> Cheers,
> 
> 					- Ted

No, I haven't. I'm running fstests on it now. Is there a quickstart for
running those tests?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 0/6] jbd2: fix several checkpoint inconsistent issues
From: Theodore Ts'o @ 2023-07-12 18:29 UTC (permalink / raw)
  To: linux-ext4, Zhang Yi
  Cc: Theodore Ts'o, adilger.kernel, jack, yi.zhang, yukuai3,
	chengzhihao1
In-Reply-To: <20230606135928.434610-1-yi.zhang@huaweicloud.com>


On Tue, 06 Jun 2023 21:59:22 +0800, Zhang Yi wrote:
> v2->v3:
>  - Init released parameter in journal_shrink_one_cp_list() instead of
>    __jbd2_journal_clean_checkpoint_list() in patch 3.
>  - Fix a comment in patch 5.
> v1->v2:
>  - Drop the last patch in [1].
>  - Merge journal_clean_one_cp_list() into journal_shrink_one_cp_list().
>  - Fix the race issues through trying to hold buffer lock and check
>    dirty state under the lock.
>  - Append a cleanup patch, remove __journal_try_to_free_buffer().
> 
> [...]

Applied, thanks!

[1/6] jbd2: recheck chechpointing non-dirty buffer
      commit: c2d6fd9d6f35079f1669f0100f05b46708c74b7f
[2/6] jbd2: remove t_checkpoint_io_list
      commit: be22255360f80d3af789daad00025171a65424a5
[3/6] jbd2: remove journal_clean_one_cp_list()
      commit: b98dba273a0e47dbfade89c9af73c5b012a4eabb
[4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
      commit: e34c8dd238d0c9368b746480f313055f5bab5040
[5/6] jbd2: fix a race when checking checkpoint buffer busy
      commit: 46f881b5b1758dc4a35fba4a643c10717d0cf427
[6/6] jbd2: remove __journal_try_to_free_buffer()
      commit: 3c55097c553c49deab60ac62c83ef17565004a97

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH] ext4: fix decoding of raw_inode timestamps
From: Theodore Ts'o @ 2023-07-12 21:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: brauner, Andreas Dilger, Jan Kara, linux-fsdevel, linux-ext4,
	linux-kernel
In-Reply-To: <4c29c4e8f88509b2f8e8c08197dba8cfeb07c045.camel@kernel.org>

On Wed, Jul 12, 2023 at 02:09:59PM -0400, Jeff Layton wrote:
> 
> No, I haven't. I'm running fstests on it now. Is there a quickstart for
> running those tests?

At the top level kernel sources:

./tools/testing/kunit/kunit.py  run --kunitconfig ./fs/ext4/.kunitconfig

You should get:

[17:23:09] Starting KUnit Kernel (1/1)...
[17:23:09] ============================================================
[17:23:09] =============== ext4_inode_test (1 subtest) ================
[17:23:09] ============= inode_test_xtimestamp_decoding  ==============
[17:23:09] [PASSED] 1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits
[17:23:09] [PASSED] 1969-12-31 Upper bound of 32bit < 0 timestamp, no extra bits
[17:23:09] [PASSED] 1970-01-01 Lower bound of 32bit >=0 timestamp, no extra bits
[17:23:09] [PASSED] 2038-01-19 Upper bound of 32bit >=0 timestamp, no extra bits
[17:23:09] [PASSED] 2038-01-19 Lower bound of 32bit <0 timestamp, lo extra sec bit on
[17:23:09] [PASSED] 2106-02-07 Upper bound of 32bit <0 timestamp, lo extra sec bit on
[17:23:09] [PASSED] 2106-02-07 Lower bound of 32bit >=0 timestamp, lo extra sec bit on
[17:23:09] [PASSED] 2174-02-25 Upper bound of 32bit >=0 timestamp, lo extra sec bit on
[17:23:09] [PASSED] 2174-02-25 Lower bound of 32bit <0 timestamp, hi extra sec bit on
[17:23:09] [PASSED] 2242-03-16 Upper bound of 32bit <0 timestamp, hi extra sec bit on
[17:23:09] [PASSED] 2242-03-16 Lower bound of 32bit >=0 timestamp, hi extra sec bit on
[17:23:09] [PASSED] 2310-04-04 Upper bound of 32bit >=0 timestamp, hi extra sec bit on
[17:23:09] [PASSED] 2310-04-04 Upper bound of 32bit>=0 timestamp, hi extra sec bit 1. 1 ns
[17:23:09] [PASSED] 2378-04-22 Lower bound of 32bit>= timestamp. Extra sec bits 1. Max ns
[17:23:09] [PASSED] 2378-04-22 Lower bound of 32bit >=0 timestamp. All extra sec bits on
[17:23:09] [PASSED] 2446-05-10 Upper bound of 32bit >=0 timestamp. All extra sec bits on
[17:23:09] ========= [PASSED] inode_test_xtimestamp_decoding ==========
[17:23:09] ================= [PASSED] ext4_inode_test =================
[17:23:09] ============================================================
[17:23:09] Testing complete. Ran 16 tests: passed: 16
[17:23:09] Elapsed time: 1.943s total, 0.001s configuring, 1.777s building, 0.123s running

	   	   	 	       	      		   - Ted

^ permalink raw reply

* [PATCH 4/7] migrate: Use folio_set_bh() instead of set_bh_page()
From: Matthew Wilcox (Oracle) @ 2023-07-13  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), David Sterba, linux-fsdevel,
	Pankaj Raghav, Konstantin Komarov, ntfs3, Theodore Tso, Jan Kara,
	linux-ext4
In-Reply-To: <20230713035512.4139457-1-willy@infradead.org>

This function was converted before folio_set_bh() existed.  Catch
up to the new API.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index af8557d78549..1363053894ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -773,7 +773,7 @@ static int __buffer_migrate_folio(struct address_space *mapping,
 
 	bh = head;
 	do {
-		set_bh_page(bh, &dst->page, bh_offset(bh));
+		folio_set_bh(bh, dst, bh_offset(bh));
 		bh = bh->b_this_page;
 	} while (bh != head);
 
-- 
2.39.2


^ permalink raw reply related

* [PATCH 5/7] ntfs3: Convert ntfs_get_block_vbo() to use a folio
From: Matthew Wilcox (Oracle) @ 2023-07-13  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), David Sterba, linux-fsdevel,
	Pankaj Raghav, Konstantin Komarov, ntfs3, Theodore Tso, Jan Kara,
	linux-ext4
In-Reply-To: <20230713035512.4139457-1-willy@infradead.org>

Remove a user of set_bh_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ntfs3/inode.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c
index dc7e7ab701c6..8ae572aacc69 100644
--- a/fs/ntfs3/inode.c
+++ b/fs/ntfs3/inode.c
@@ -554,7 +554,7 @@ static noinline int ntfs_get_block_vbo(struct inode *inode, u64 vbo,
 	struct super_block *sb = inode->i_sb;
 	struct ntfs_sb_info *sbi = sb->s_fs_info;
 	struct ntfs_inode *ni = ntfs_i(inode);
-	struct page *page = bh->b_page;
+	struct folio *folio = bh->b_folio;
 	u8 cluster_bits = sbi->cluster_bits;
 	u32 block_size = sb->s_blocksize;
 	u64 bytes, lbo, valid;
@@ -569,7 +569,7 @@ static noinline int ntfs_get_block_vbo(struct inode *inode, u64 vbo,
 
 	if (is_resident(ni)) {
 		ni_lock(ni);
-		err = attr_data_read_resident(ni, page);
+		err = attr_data_read_resident(ni, &folio->page);
 		ni_unlock(ni);
 
 		if (!err)
@@ -642,17 +642,17 @@ static noinline int ntfs_get_block_vbo(struct inode *inode, u64 vbo,
 		 */
 		bytes = block_size;
 
-		if (page) {
+		if (folio) {
 			u32 voff = valid - vbo;
 
 			bh->b_size = block_size;
 			off = vbo & (PAGE_SIZE - 1);
-			set_bh_page(bh, page, off);
+			folio_set_bh(bh, folio, off);
 
 			err = bh_read(bh, 0);
 			if (err < 0)
 				goto out;
-			zero_user_segment(page, off + voff, off + block_size);
+			folio_zero_segment(folio, off + voff, off + block_size);
 		}
 	}
 
-- 
2.39.2


^ permalink raw reply related

* [PATCH 1/7] highmem: Add memcpy_to_folio() and memcpy_from_folio()
From: Matthew Wilcox (Oracle) @ 2023-07-13  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), David Sterba, linux-fsdevel,
	Pankaj Raghav, Konstantin Komarov, ntfs3, Theodore Tso, Jan Kara,
	linux-ext4
In-Reply-To: <20230713035512.4139457-1-willy@infradead.org>

These are the folio equivalent of memcpy_to_page() and memcpy_from_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/highmem.h | 44 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 68da30625a6c..0280f57d4744 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -439,6 +439,50 @@ static inline void memzero_page(struct page *page, size_t offset, size_t len)
 	kunmap_local(addr);
 }
 
+static inline void memcpy_from_folio(char *to, struct folio *folio,
+		size_t offset, size_t len)
+{
+	VM_BUG_ON(offset + len > folio_size(folio));
+
+	do {
+		char *from = kmap_local_folio(folio, offset);
+		size_t chunk = len;
+
+		if (folio_test_highmem(folio) &&
+		    (chunk > (PAGE_SIZE - offset_in_page(offset))))
+			chunk = PAGE_SIZE - offset_in_page(offset);
+		memcpy(to, from, len);
+		kunmap_local(from);
+
+		from += chunk;
+		offset += chunk;
+		len -= chunk;
+	} while (len > 0);
+}
+
+static inline void memcpy_to_folio(struct folio *folio, size_t offset,
+		const char *from, size_t len)
+{
+	VM_BUG_ON(offset + len > folio_size(folio));
+
+	do {
+		char *to = kmap_local_folio(folio, offset);
+		size_t chunk = len;
+
+		if (folio_test_highmem(folio) &&
+		    (chunk > (PAGE_SIZE - offset_in_page(offset))))
+			chunk = PAGE_SIZE - offset_in_page(offset);
+		memcpy(to, from, len);
+		kunmap_local(to);
+
+		from += chunk;
+		offset += chunk;
+		len -= chunk;
+	} while (len > 0);
+
+	flush_dcache_folio(folio);
+}
+
 /**
  * memcpy_from_file_folio - Copy some bytes from a file folio.
  * @to: The destination buffer.
-- 
2.39.2


^ permalink raw reply related

* [PATCH 6/7] jbd2: Use a folio in jbd2_journal_write_metadata_buffer()
From: Matthew Wilcox (Oracle) @ 2023-07-13  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), David Sterba, linux-fsdevel,
	Pankaj Raghav, Konstantin Komarov, ntfs3, Theodore Tso, Jan Kara,
	linux-ext4
In-Reply-To: <20230713035512.4139457-1-willy@infradead.org>

The primary goal here is removing the use of set_bh_page().
Take the opportunity to switch from kmap_atomic() to kmap_local().
This simplifies the function as the offset is already added to
the pointer.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/jbd2/journal.c | 35 ++++++++++++++++-------------------
 1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index fbce16fedaa4..1b5a45ab62b0 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -341,7 +341,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	int do_escape = 0;
 	char *mapped_data;
 	struct buffer_head *new_bh;
-	struct page *new_page;
+	struct folio *new_folio;
 	unsigned int new_offset;
 	struct buffer_head *bh_in = jh2bh(jh_in);
 	journal_t *journal = transaction->t_journal;
@@ -370,14 +370,14 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	if (jh_in->b_frozen_data) {
 		done_copy_out = 1;
-		new_page = virt_to_page(jh_in->b_frozen_data);
-		new_offset = offset_in_page(jh_in->b_frozen_data);
+		new_folio = virt_to_folio(jh_in->b_frozen_data);
+		new_offset = offset_in_folio(new_folio, jh_in->b_frozen_data);
 	} else {
-		new_page = jh2bh(jh_in)->b_page;
-		new_offset = offset_in_page(jh2bh(jh_in)->b_data);
+		new_folio = jh2bh(jh_in)->b_folio;
+		new_offset = offset_in_folio(new_folio, jh2bh(jh_in)->b_data);
 	}
 
-	mapped_data = kmap_atomic(new_page);
+	mapped_data = kmap_local_folio(new_folio, new_offset);
 	/*
 	 * Fire data frozen trigger if data already wasn't frozen.  Do this
 	 * before checking for escaping, as the trigger may modify the magic
@@ -385,18 +385,17 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 * data in the buffer.
 	 */
 	if (!done_copy_out)
-		jbd2_buffer_frozen_trigger(jh_in, mapped_data + new_offset,
+		jbd2_buffer_frozen_trigger(jh_in, mapped_data,
 					   jh_in->b_triggers);
 
 	/*
 	 * Check for escaping
 	 */
-	if (*((__be32 *)(mapped_data + new_offset)) ==
-				cpu_to_be32(JBD2_MAGIC_NUMBER)) {
+	if (*((__be32 *)mapped_data) == cpu_to_be32(JBD2_MAGIC_NUMBER)) {
 		need_copy_out = 1;
 		do_escape = 1;
 	}
-	kunmap_atomic(mapped_data);
+	kunmap_local(mapped_data);
 
 	/*
 	 * Do we need to do a data copy?
@@ -417,12 +416,10 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 		}
 
 		jh_in->b_frozen_data = tmp;
-		mapped_data = kmap_atomic(new_page);
-		memcpy(tmp, mapped_data + new_offset, bh_in->b_size);
-		kunmap_atomic(mapped_data);
+		memcpy_from_folio(tmp, new_folio, new_offset, bh_in->b_size);
 
-		new_page = virt_to_page(tmp);
-		new_offset = offset_in_page(tmp);
+		new_folio = virt_to_folio(tmp);
+		new_offset = offset_in_folio(new_folio, tmp);
 		done_copy_out = 1;
 
 		/*
@@ -438,12 +435,12 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 * copying, we can finally do so.
 	 */
 	if (do_escape) {
-		mapped_data = kmap_atomic(new_page);
-		*((unsigned int *)(mapped_data + new_offset)) = 0;
-		kunmap_atomic(mapped_data);
+		mapped_data = kmap_local_folio(new_folio, new_offset);
+		*((unsigned int *)mapped_data) = 0;
+		kunmap_local(mapped_data);
 	}
 
-	set_bh_page(new_bh, new_page, new_offset);
+	folio_set_bh(new_bh, new_folio, new_offset);
 	new_bh->b_size = bh_in->b_size;
 	new_bh->b_bdev = journal->j_dev;
 	new_bh->b_blocknr = blocknr;
-- 
2.39.2


^ permalink raw reply related

* [PATCH 7/7] buffer: Remove set_bh_page()
From: Matthew Wilcox (Oracle) @ 2023-07-13  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), David Sterba, linux-fsdevel,
	Pankaj Raghav, Konstantin Komarov, ntfs3, Theodore Tso, Jan Kara,
	linux-ext4
In-Reply-To: <20230713035512.4139457-1-willy@infradead.org>

With all users converted to folio_set_bh(), remove this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/buffer.c                 | 15 ---------------
 include/linux/buffer_head.h |  2 --
 2 files changed, 17 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 587e4d4af9de..f0563ebae75f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1539,21 +1539,6 @@ void invalidate_bh_lrus_cpu(void)
 	bh_lru_unlock();
 }
 
-void set_bh_page(struct buffer_head *bh,
-		struct page *page, unsigned long offset)
-{
-	bh->b_page = page;
-	BUG_ON(offset >= PAGE_SIZE);
-	if (PageHighMem(page))
-		/*
-		 * This catches illegal uses and preserves the offset:
-		 */
-		bh->b_data = (char *)(0 + offset);
-	else
-		bh->b_data = page_address(page) + offset;
-}
-EXPORT_SYMBOL(set_bh_page);
-
 void folio_set_bh(struct buffer_head *bh, struct folio *folio,
 		  unsigned long offset)
 {
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index a7377877ff4e..06566aee94ca 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -194,8 +194,6 @@ void buffer_check_dirty_writeback(struct folio *folio,
 void mark_buffer_dirty(struct buffer_head *bh);
 void mark_buffer_write_io_error(struct buffer_head *bh);
 void touch_buffer(struct buffer_head *bh);
-void set_bh_page(struct buffer_head *bh,
-		struct page *page, unsigned long offset);
 void folio_set_bh(struct buffer_head *bh, struct folio *folio,
 		  unsigned long offset);
 bool try_to_free_buffers(struct folio *);
-- 
2.39.2


^ permalink raw reply related

* [PATCH 0/7] More filesystem folio conversions for 6.6
From: Matthew Wilcox (Oracle) @ 2023-07-13  3:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), David Sterba, linux-fsdevel,
	Pankaj Raghav, Konstantin Komarov, ntfs3, Theodore Tso, Jan Kara,
	linux-ext4

Remove the only spots in affs which actually use a struct page; there
are a few places where one is mentioned, but it's part of the interface.

The rest of this is removing the remaining calls to set_bh_page(),
and then removing the function before any new users show up.

Matthew Wilcox (Oracle) (7):
  highmem: Add memcpy_to_folio() and memcpy_from_folio()
  affs: Convert affs_symlink_read_folio() to use the folio
  affs: Convert data read and write to use folios
  migrate: Use folio_set_bh() instead of set_bh_page()
  ntfs3: Convert ntfs_get_block_vbo() to use a folio
  jbd2: Use a folio in jbd2_journal_write_metadata_buffer()
  buffer: Remove set_bh_page()

 fs/affs/file.c              | 77 ++++++++++++++++++-------------------
 fs/affs/symlink.c           | 12 +++---
 fs/buffer.c                 | 15 --------
 fs/jbd2/journal.c           | 35 ++++++++---------
 fs/ntfs3/inode.c            | 10 ++---
 include/linux/buffer_head.h |  2 -
 include/linux/highmem.h     | 44 +++++++++++++++++++++
 mm/migrate.c                |  2 +-
 8 files changed, 109 insertions(+), 88 deletions(-)

-- 
2.39.2


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox