* EXT4/JBD2 Not Fully Released device after unmount of NVMe-oF Block Device @ 2025-06-01 11:02 Mitta Sai Chaithanya 2025-06-01 22:04 ` Theodore Ts'o 0 siblings, 1 reply; 5+ messages in thread From: Mitta Sai Chaithanya @ 2025-06-01 11:02 UTC (permalink / raw) To: linux-ext4@vger.kernel.org Cc: Nilesh Awate, Ganesan Kalyanasundaram, Pawan Sharma Hi Team, I'm encountering journal block device (JBD2) errors after unmounting a device and have been trying to trace the source of these errors. I've observed that these JBD2 errors only occur if the entries under /proc/fs/ext4/<device_name> or /proc/fs/jbd2/<device_name> still exist even after a successful unmount (the unmount command returns success). For context: the block device (/dev/nvme0n1) is connected over NVMe-oF TCP to a remote target. I'm confident that no I/O is stuck on the target side, as there are no related I/O errors or warnings in the kernel logs where the target is connected. However, the /proc entries mentioned above remain even after a successful unmount, and this seems to correlate with the journal-related errors. I'd like to understand how to debug this issue further to determine the root cause. Specifically, I’m looking for guidance on what kernel-level references or subsystems might still be holding on to the journal or device structures post-unmount, and how to trace or identify them effectively (or) is this has fixed in latest versions of ext4? Proc entries exist even after unmount: root@aks-nodepool1-44537149-vmss000002 [ / ]# ls /proc/fs/ext4/nvme0n1/ es_shrinker_info fc_info mb_groups mb_stats mb_structs_summary options root@aks-nodepool1-44537149-vmss000002 [ / ]# ls /proc/fs/jbd2/nvme0n1-8/ info Active process associated with unmounted device: root 636845 0.0 0.0 0 0 ? S 08:43 0:03 [jbd2/nvme0n1-8] root 636987 0.0 0.0 0 0 ? I< 08:43 0:00 [dio/nvme0n1] root 699903 0.0 0.0 0 0 ? I 09:18 0:01 [kworker/u16:1-nvme-wq] root 761100 0.0 0.0 0 0 ? I< 09:50 0:00 [kworker/1:1H-nvme_tcp_wq] root 763896 0.0 0.0 0 0 ? I< 09:52 0:00 [kworker/0:0H-nvme_tcp_wq] root 779007 0.0 0.0 0 0 ? I< 10:01 0:00 [kworker/0:1H-nvme_tcp_wq] Stack trace of process (after unmount): root@aks-nodepool1-44537149-vmss000002 [ / ]# cat /proc/636845/stack [<0>] kjournald2+0x219/0x270 [<0>] kthread+0x12a/0x150 [<0>] ret_from_fork+0x22/0x30 root@aks-nodepool1-44537149-vmss000002 [ / ]# cat /proc/636846/stack [<0>] rescuer_thread+0x2db/0x3b0 [<0>] kthread+0x12a/0x150 [<0>] ret_from_fork+0x22/0x30 [ / ]# cat /proc/636987/stack [<0>] rescuer_thread+0x2db/0x3b0 [<0>] kthread+0x12a/0x150 [<0>] ret_from_fork+0x22/0x30 [ / ]# cat /proc/699903/stack [<0>] worker_thread+0xcd/0x3d0 [<0>] kthread+0x12a/0x150 [<0>] ret_from_fork+0x22/0x30 [ / ]# cat /proc/761100/stack [<0>] worker_thread+0xcd/0x3d0 [<0>] kthread+0x12a/0x150 [<0>] ret_from_fork+0x22/0x30 [ / ]# cat /proc/763896/stack [<0>] worker_thread+0xcd/0x3d0 [<0>] kthread+0x12a/0x150 [<0>] ret_from_fork+0x22/0x30 [ / ]# cat /proc/779007/stack [<0>] worker_thread+0xcd/0x3d0 [<0>] kthread+0x12a/0x150 [<0>] ret_from_fork+0x22/0x30 Kernel Logs: 2025-06-01T10:01:11.568304+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30452.346875] nvme nvme0: Failed reconnect attempt 6 2025-06-01T10:01:11.568330+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30452.346881] nvme nvme0: Reconnecting in 10 seconds... 2025-06-01T10:01:21.814134+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30462.596133] nvme nvme0: Connect command failed, error wo/DNR bit: 6 2025-06-01T10:01:21.814165+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30462.596186] nvme nvme0: failed to connect queue: 0 ret=6 2025-06-01T10:01:21.814174+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30462.596289] nvme nvme0: Failed reconnect attempt 7 2025-06-01T10:01:21.814176+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30462.596292] nvme nvme0: Reconnecting in 10 seconds... 2025-06-01T10:01:32.055063+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30472.836929] nvme nvme0: queue_size 128 > ctrl sqsize 64, clamping down 2025-06-01T10:01:32.055094+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30472.837002] nvme nvme0: creating 2 I/O queues. 2025-06-01T10:01:32.108286+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30472.886546] nvme nvme0: mapped 2/0/0 default/read/poll queues. 2025-06-01T10:01:32.108313+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30472.887450] nvme nvme0: Successfully reconnected (8 attempt) High level information of ext4: root@aks-nodepool1-44537149-vmss000002 [ / ]# dumpe2fs /dev/nvme0n1 dumpe2fs 1.46.5 (30-Dec-2021) Filesystem volume name: <none> Last mounted on: /datadir Filesystem UUID: 1a564b4d-8f34-4f71-8370-802a239e350a Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index FEATURE_C12 filetype needs_recovery extent 64bit flex_bg metadata_csum_seed sparse_super large_file huge_file dir_nlink extra_isize metadata_csum FEATURE_R16 Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 655360 Block count: 2620155 Reserved block count: 131007 Overhead clusters: 66747 Free blocks: 454698 Free inodes: 655344 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Reserved GDT blocks: 1024 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stripe width: 32 Flex block group size: 16 Filesystem created: Sun Jun 1 08:36:28 2025 Last mount time: Sun Jun 1 08:43:57 2025 Last write time: Sun Jun 1 08:43:57 2025 Mount count: 4 Maximum mount count: -1 Last checked: Sun Jun 1 08:36:28 2025 Check interval: 0 (<none>) Lifetime writes: 576 MB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 22fed392-1993-4796-a996-feab145379ba Journal backup: inode blocks Checksum type: crc32c Checksum: 0xea839b0c Checksum seed: 0x8e742ce9 Journal features: journal_64bit journal_checksum_v3 Total journal size: 64M Total journal blocks: 16384 Max transaction length: 16384 Fast commit length: 0 Journal sequence: 0x000002a0 Journal start: 6816 Journal checksum type: crc32c Journal checksum: 0xa35736ab Thanks & Regards, Sai ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: EXT4/JBD2 Not Fully Released device after unmount of NVMe-oF Block Device 2025-06-01 11:02 EXT4/JBD2 Not Fully Released device after unmount of NVMe-oF Block Device Mitta Sai Chaithanya @ 2025-06-01 22:04 ` Theodore Ts'o 2025-06-02 21:32 ` [EXTERNAL] " Mitta Sai Chaithanya 0 siblings, 1 reply; 5+ messages in thread From: Theodore Ts'o @ 2025-06-01 22:04 UTC (permalink / raw) To: Mitta Sai Chaithanya Cc: linux-ext4@vger.kernel.org, Nilesh Awate, Ganesan Kalyanasundaram, Pawan Sharma On Sun, Jun 01, 2025 at 11:02:05AM +0000, Mitta Sai Chaithanya wrote: > Hi Team, > > I'm encountering journal block device (JBD2) errors after unmounting > a device and have been trying to trace the source of > these errors. I've observed that these JBD2 errors only > occur if the entries under /proc/fs/ext4/<device_name> or > /proc/fs/jbd2/<device_name> still exist even after a > successful unmount (the unmount command returns success). What you are seeing is I/O errors, not jbd2 errors. i.e., > 2025-06-01T10:01:11.568304+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30452.346875] nvme nvme0: Failed reconnect attempt 6 These errors may have been caused by the jbd2 layer issuing I/O requests, but these are not failures of the jbd2 subsystem. Rather, that _apparently_ ext4/jbd2 is issuing I/O's after the NVMe-OF connection has been torn down. It appears that you are assuming once umount command/system call has successfuly returned, that the kernel file system will be done sending I/O requests to the block device. This is simply not true. For example, consider what happens if you do something like: # mount /dev/sda1 /mnt # mount --bind /mnt /mnt2 # umount /mnt The umount command will have returned successfully, but the ext4 file system is still mounted, thanks to the bind mount. And it's not just bind mounts. If you have one or more processes in a different mount namespace (created using clone(2) with the CLONE_NEWNS flag) so long as those processes are active, the file system will stay active regardless of the file system being unounted in the original mount namespace. Internally inside in the kernel, this is the distinction between the "struct super" object, and the "struct vfsmnt" object. The umount(2) system call removes the vfsmnt object from a mount namespace object, and decrements the refcount of the vfsmnt object. The "struct super" object can not be deleted so long as there is at least one vfsmnt object pointing at the "struct super" object. So when you say that /proc/fs/ext4/<device_name> still exists, that is an indication that "struct super" for that particular ext4 file system is still alive, and so of course, there can still be ext4 and jbd2 I/O activity happening. > I'd like to understand how to debug this issue further to determine > the root cause. Specifically, I’m looking for guidance on what > kernel-level references or subsystems might still be holding on to > the journal or device structures post-unmount, and how to trace or > identify them effectively (or) is this has fixed in latest versions > of ext4? I don't see any evidence of anything "wrong" that requires fixing in the kernel. It looks something or someone assumed that the file system was deactivated after the umount and then tore down the NVMe-OF TCP connection, even though the file system was still active, resulting in those errors. But that's not a kernel bug; but rather a bug in some human's understanding of how umount works in the context of bind mounts and mount namespaces. Cheers, - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [EXTERNAL] Re: EXT4/JBD2 Not Fully Released device after unmount of NVMe-oF Block Device 2025-06-01 22:04 ` Theodore Ts'o @ 2025-06-02 21:32 ` Mitta Sai Chaithanya 2025-06-03 0:29 ` Theodore Ts'o 0 siblings, 1 reply; 5+ messages in thread From: Mitta Sai Chaithanya @ 2025-06-02 21:32 UTC (permalink / raw) To: Theodore Ts'o Cc: linux-ext4@vger.kernel.org, Nilesh Awate, Ganesan Kalyanasundaram, Pawan Sharma Hi Ted, Thanks for your quick response. You're right that we use a bind mount; however, I'm certain that we first unmount the bind mount before unmounting the original mount. I also checked in a different namespace and couldn't find any reference that NVMe device being mounted. > 2025-06-01T10:01:11.568304+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30452.346875] nvme nvme0: Failed reconnect attempt 6 > Rather, that _apparently_ ext4/jbd2 is issuing I/O's after the NVMe-OF connection has been torn down. Yes, I am reproducing the issue and expected to see connection outage errors for a few seconds (i.e., within the tolerable time frame). However, after the connection is re-established and the device is unmounted from all namespaces, I still observe errors from both ext4 and jb2 when the device is especially disconnected. >So when you say that /proc/fs/ext4/<device_name> still exists, that is an > indication that "struct super" for that particular ext4 file system is > still alive, and so of course, there can still be ext4 and jbd2 I/O > activity happening. So even when no user-space process is holding the device, and it has been unmounted from all namespaces, mounts and bind mounts, is there still a possibility of I/O occurring on the device? If so, how long does the kernel typically take to flush any remaining I/O operations, whether from ext4 or jb2? Another point I would like to mention, I am observing JBD2 errors especially after NVMe-oF device has been disconnected and below are the logs. Logs: [Wed May 14 16:58:50 2025] nvme nvme0: Removing ctrl: NQN "nqn.2019-05.io.openebs:4cde20d8-ed8f-47ef-90c7-8cf9521a5734" [Wed May 14 16:58:50 2025] Buffer I/O error on dev nvme0n1, logical block 1081344, lost sync page write [Wed May 14 16:58:50 2025] JBD2: Error -5 detected when updating journal superblock for nvme0n1-8. [Wed May 14 16:58:50 2025] Aborting journal on device nvme0n1-8. [Wed May 14 16:58:50 2025] blk_update_request: recoverable transport error, dev nvme0n1, sector 8650752 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0 [Wed May 14 16:58:50 2025] Buffer I/O error on dev nvme0n1, logical block 1081344, lost sync page write [Wed May 14 16:58:50 2025] JBD2: Error -5 detected when updating journal superblock for nvme0n1-8. [Wed May 14 16:58:50 2025] EXT4-fs error (device nvme0n1): ext4_put_super:1205: comm ig: Couldn't clean up the journal [Wed May 14 16:58:50 2025] blk_update_request: recoverable transport error, dev nvme0n1, sector 0 op 0x1:(WRITE) flags 0x23800 phys_seg 1 prio class 0 [Wed May 14 16:58:50 2025] Buffer I/O error on dev nvme0n1, logical block 0, lost sync page write [Wed May 14 16:58:50 2025] EXT4-fs (nvme0n1): I/O error while writing superblock [Wed May 14 16:58:50 2025] EXT4-fs (nvme0n1): Remounting filesystem read-only [Wed May 14 16:58:50 2025] blk_update_request: recoverable transport error, dev nvme0n1, sector 0 op 0x1:(WRITE) flags 0x23800 phys_seg 1 prio class 0 [Wed May 14 16:58:50 2025] Buffer I/O error on dev nvme0n1, logical block 0, lost sync page write [Wed May 14 16:58:50 2025] EXT4-fs (nvme0n1): I/O error while writing superblock Thanks & Regards, Sai ________________________________________ From: Theodore Ts'o <tytso@mit.edu> Sent: Monday, June 02, 2025 03:34 To: Mitta Sai Chaithanya <mittas@microsoft.com> Cc: linux-ext4@vger.kernel.org <linux-ext4@vger.kernel.org>; Nilesh Awate <Nilesh.Awate@microsoft.com>; Ganesan Kalyanasundaram <ganesanka@microsoft.com>; Pawan Sharma <sharmapawan@microsoft.com> Subject: [EXTERNAL] Re: EXT4/JBD2 Not Fully Released device after unmount of NVMe-oF Block Device [You don't often get email from tytso@mit.edu. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] On Sun, Jun 01, 2025 at 11:02:05AM +0000, Mitta Sai Chaithanya wrote: > Hi Team, > > I'm encountering journal block device (JBD2) errors after unmounting > a device and have been trying to trace the source of > these errors. I've observed that these JBD2 errors only > occur if the entries under /proc/fs/ext4/<device_name> or > /proc/fs/jbd2/<device_name> still exist even after a > successful unmount (the unmount command returns success). What you are seeing is I/O errors, not jbd2 errors. i.e., > 2025-06-01T10:01:11.568304+00:00 aks-nodepool1-44537149-vmss000002 kernel: [30452.346875] nvme nvme0: Failed reconnect attempt 6 These errors may have been caused by the jbd2 layer issuing I/O requests, but these are not failures of the jbd2 subsystem. Rather, that _apparently_ ext4/jbd2 is issuing I/O's after the NVMe-OF connection has been torn down. It appears that you are assuming once umount command/system call has successfuly returned, that the kernel file system will be done sending I/O requests to the block device. This is simply not true. For example, consider what happens if you do something like: # mount /dev/sda1 /mnt # mount --bind /mnt /mnt2 # umount /mnt The umount command will have returned successfully, but the ext4 file system is still mounted, thanks to the bind mount. And it's not just bind mounts. If you have one or more processes in a different mount namespace (created using clone(2) with the CLONE_NEWNS flag) so long as those processes are active, the file system will stay active regardless of the file system being unounted in the original mount namespace. Internally inside in the kernel, this is the distinction between the "struct super" object, and the "struct vfsmnt" object. The umount(2) system call removes the vfsmnt object from a mount namespace object, and decrements the refcount of the vfsmnt object. The "struct super" object can not be deleted so long as there is at least one vfsmnt object pointing at the "struct super" object. So when you say that /proc/fs/ext4/<device_name> still exists, that is an indication that "struct super" for that particular ext4 file system is still alive, and so of course, there can still be ext4 and jbd2 I/O activity happening. > I'd like to understand how to debug this issue further to determine > the root cause. Specifically, I’m looking for guidance on what > kernel-level references or subsystems might still be holding on to > the journal or device structures post-unmount, and how to trace or > identify them effectively (or) is this has fixed in latest versions > of ext4? I don't see any evidence of anything "wrong" that requires fixing in the kernel. It looks something or someone assumed that the file system was deactivated after the umount and then tore down the NVMe-OF TCP connection, even though the file system was still active, resulting in those errors. But that's not a kernel bug; but rather a bug in some human's understanding of how umount works in the context of bind mounts and mount namespaces. Cheers, - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [EXTERNAL] Re: EXT4/JBD2 Not Fully Released device after unmount of NVMe-oF Block Device 2025-06-02 21:32 ` [EXTERNAL] " Mitta Sai Chaithanya @ 2025-06-03 0:29 ` Theodore Ts'o 2025-06-03 20:32 ` Andreas Dilger 0 siblings, 1 reply; 5+ messages in thread From: Theodore Ts'o @ 2025-06-03 0:29 UTC (permalink / raw) To: Mitta Sai Chaithanya Cc: linux-ext4@vger.kernel.org, Nilesh Awate, Ganesan Kalyanasundaram, Pawan Sharma On Mon, Jun 02, 2025 at 09:32:18PM +0000, Mitta Sai Chaithanya wrote: > However, after the connection is re-established and the device is > unmounted from all namespaces, I still observe errors from both ext4 > and jb2 when the device is especially disconnected. How do you *know* that you've unmounted the device in all namespaces. I seem to recall that some process (I think one of the systemd daemons, but I could be wrong) was creating a namespace that users were not expecting, resulting in the device staying mounted when the users were not so expecting it. The fact that /proc/fs/ext4/<device_name> still exists means that the kernel (specifically, the VFS layer) doesn't think that the file system can be shut down. As a result, the VFS layer has not called ext4's put_super() and kill_sb() methods. And so yes, I/O activity can still happen, because the file system has not been shutdown. If you still see /proc/fs/ext4/<device_name>, my suggestion would be grep /proc/*/mounts looking to see which processes has a namespace which still has the device mounted. I suspect that you will see that there is some namespace that you weren't aware of that is keeping the ext4 struct super object pinned and alive. > Another point I would like to mention, I am observing JBD2 errors especially after NVMe-oF device has been disconnected and below are the logs. Sure, but that's the effect, not the cause, of the NVME-of device getting ripped down while the file system is still active. Which I am 99.997% sure is because it is still mounted in some namespace. The other 0.003% chance is that there is some refcount problem in the VFS subsytem, and I would suggest that you ask Microsoft's VFS experts, (such as Christain Brauner, who is one of the VFS maintainers) to take a look. I very much doubt it is a kernel bug, though. - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [EXTERNAL] Re: EXT4/JBD2 Not Fully Released device after unmount of NVMe-oF Block Device 2025-06-03 0:29 ` Theodore Ts'o @ 2025-06-03 20:32 ` Andreas Dilger 0 siblings, 0 replies; 5+ messages in thread From: Andreas Dilger @ 2025-06-03 20:32 UTC (permalink / raw) To: Theodore Ts'o Cc: Mitta Sai Chaithanya, linux-ext4@vger.kernel.org, Nilesh Awate, Ganesan Kalyanasundaram, Pawan Sharma [-- Attachment #1: Type: text/plain, Size: 2383 bytes --] > On Jun 2, 2025, at 6:29 PM, Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, Jun 02, 2025 at 09:32:18PM +0000, Mitta Sai Chaithanya wrote: > >> However, after the connection is re-established and the device is >> unmounted from all namespaces, I still observe errors from both ext4 >> and jb2 when the device is especially disconnected. > > How do you *know* that you've unmounted the device in all namespaces. > I seem to recall that some process (I think one of the systemd > daemons, but I could be wrong) was creating a namespace that users > were not expecting, resulting in the device staying mounted when the > users were not so expecting it. > > The fact that /proc/fs/ext4/<device_name> still exists means that the > kernel (specifically, the VFS layer) doesn't think that the file > system can be shut down. As a result, the VFS layer has not called > ext4's put_super() and kill_sb() methods. And so yes, I/O activity > can still happen, because the file system has not been shutdown. > > If you still see /proc/fs/ext4/<device_name>, my suggestion would be > grep /proc/*/mounts looking to see which processes has a namespace > which still has the device mounted. I suspect that you will see that > there is some namespace that you weren't aware of that is keeping the > ext4 struct super object pinned and alive. > >> Another point I would like to mention, I am observing JBD2 errors especially after NVMe-oF device has been disconnected and below are the logs. > > Sure, but that's the effect, not the cause, of the NVME-of device > getting ripped down while the file system is still active. Which I am > 99.997% sure is because it is still mounted in some namespace. The > other 0.003% chance is that there is some refcount problem in the VFS > subsytem, and I would suggest that you ask Microsoft's VFS experts, > (such as Christain Brauner, who is one of the VFS maintainers) to take > a look. I very much doubt it is a kernel bug, though. We've definitely seen similar situations with filesystem mounts inside of a namespace keeping the mountpoint busy. Adding debugging in ext4_put_super() if current->comm != "umount" to print the process name showed monitoring tools running in the container that held open references on the mountpoint until they exited and closed files. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-06-03 20:32 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-01 11:02 EXT4/JBD2 Not Fully Released device after unmount of NVMe-oF Block Device Mitta Sai Chaithanya 2025-06-01 22:04 ` Theodore Ts'o 2025-06-02 21:32 ` [EXTERNAL] " Mitta Sai Chaithanya 2025-06-03 0:29 ` Theodore Ts'o 2025-06-03 20:32 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).