* Issue with ext4 filesystem corruption when writing to a file after disk exhaustion @ 2025-07-11 3:20 Jiany Wu 2025-07-11 5:29 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Jiany Wu @ 2025-07-11 3:20 UTC (permalink / raw) To: yi.zhang, jack, linux-ext4 Hello, Recently I encountered an issue in kernel 6.1.123, when writing to a file after disk exhaustion, it will report EFSCORRUPTED. I think it is un-expected behavior. Could you help clarify: 1. Why writing to file after disk exhaust will cause "Error while async write back metadata"? Assume it might be inode or block metadata is corrupted there? 2. Why would the file system corrupt, like "Aborting journal on device"? Thank you~ Detailed reproduction steps are: # 1. Create ext4 file system in mydisk root@testbed:/tmp# touch mydisk root@testbed:/tmp# ls -l mydisk -rw-r--r-- 1 root root 0 Jul 8 05:36 mydisk root@testbed:/tmp# truncate -s 128M mydisk root@testbed:/tmp# ls -lh mydisk -rw-r--r-- 1 root root 128M Jul 8 05:36 mydisk root@testbed:/tmp# mkfs.ext4 mydisk mke2fs 1.47.0 (5-Feb-2023) Discarding device blocks: done Creating filesystem with 131072 1k blocks and 32768 inodes Filesystem UUID: b0b12002-d497-436e-b89d-d0e02f53b46d Superblock backups stored on blocks: 8193, 24577, 40961, 57345, 73729 Allocating group tables: done Writing inode tables: done Creating journal (4096 blocks): done Writing superblocks and filesystem accounting information: done # 2. Mount mydisk to /mnt/test_fs root@testbed:/tmp# mkdir /mnt/test_fs root@testbed:/tmp# mount mydisk /mnt/test_fs/ root@testbed:/tmp# findmnt /mnt/test_fs TARGET SOURCE FSTYPE OPTIONS /mnt/test_fs /dev/loop2 ext4 rw,relatime root@testbed:/mnt/test_fs# file /tmp/mydisk /tmp/mydisk: Linux rev 1.0 ext4 filesystem data, UUID=b0b12002-d497-436e-b89d-d0e02f53b46d (needs journal recovery) (extents) (64bit) (large files) (huge files) # 3. Exhaust disk in /mnt/test_fs with 32G test_file root@testbed:/mnt/test_fs# fallocate -l 32716560K /mnt/test_fs/test_file fallocate: fallocate failed: No space left on device root@testbed:/mnt/test_fs# ls lost+found test_file root@testbed:/mnt/test_fs# journalctl -f Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 9178112, length 1024. Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. Jul 08 05:43:07 testbed kernel: I/O error, dev loop2, sector 17926 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 2 Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical block 8963, lost async page write Jul 08 05:43:07 testbed kernel: I/O error, dev loop2, sector 518 op 0x1:(WRITE) flags 0x103000 phys_seg 17 prio class 2 Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical block 259, lost async page write Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical block 260, lost async page write Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical block 261, lost async page write Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical block 262, lost async page write # 4. Write to /mnt/test_fs/file.dat with dd cmd, I/O error appears. root@testbed:/mnt/test_fs# dd if=/dev/zero of=/mnt/test_fs/file.dat bs=1M count=64 root@testbed:/mnt/test_fs# journalctl -f Jul 08 05:49:24 testbed kernel: Buffer I/O error on dev loop2, logical block 268, lost async page write Jul 08 05:49:26 testbed kernel: EXT4-fs error (device loop2): ext4_check_bdev_write_error:217: comm dd: Error while async write back metadata Jul 08 05:49:26 testbed kernel: I/O error, dev loop2, sector 20482 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 2 # Can see First error function is ext4_check_bdev_write_error. root@testbed:/mnt/test_fs# dumpe2fs -h /dev/loop2 dumpe2fs 1.47.0 (5-Feb-2023) Filesystem volume name: <none> Last mounted on: /mnt/test_fs Filesystem UUID: b0b12002-d497-436e-b89d-d0e02f53b46d Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean with errors Errors behavior: Continue Filesystem OS type: Linux Inode count: 32768 Block count: 131072 Reserved block count: 6553 Overhead clusters: 13869 Free blocks: 42236 Free inodes: 32754 First block: 1 Block size: 1024 Fragment size: 1024 Group descriptor size: 64 Reserved GDT blocks: 256 Blocks per group: 8192 Fragments per group: 8192 Inodes per group: 2048 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Tue Jul 8 05:37:11 2025 Last mount time: Tue Jul 8 05:37:37 2025 Last write time: Tue Jul 8 05:50:36 2025 Mount count: 1 Maximum mount count: -1 Last checked: Tue Jul 8 05:37:11 2025 Check interval: 0 (<none>) Lifetime writes: 74 MB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: bf9009a6-ff19-41d5-8abc-a4d9cd65eeb4 Journal backup: inode blocks FS Error count: 4 First error time: Tue Jul 8 05:45:22 2025 First error function: ext4_check_bdev_write_error First error line #: 217 First error err: EIO Last error time: Tue Jul 8 05:50:36 2025 Last error function: ext4_check_bdev_write_error Last error line #: 217 Last error err: EIO Checksum type: crc32c Checksum: 0x0583faaa Journal features: journal_incompat_revoke journal_64bit journal_checksum_v3 Total journal size: 4096k Total journal blocks: 4096 Max transaction length: 4096 Fast commit length: 0 Journal sequence: 0x00000002 Journal start: 1 Journal checksum type: crc32c Journal checksum: 0xebf7b874 # 5. unmount the filesystem, file system became read-only, result show EFSCORRUPTED Jul 08 06:44:17 testbed kernel: EXT4-fs (loop2): unmounting filesystem. Jul 08 06:44:17 testbed kernel: Aborting journal on device loop2-8. Jul 08 06:44:17 testbed kernel: EXT4-fs error (device loop2): ext4_put_super:1232: comm umount: Couldn't clean up the journal Jul 08 06:44:17 testbed kernel: EXT4-fs (loop2): Remounting filesystem read-only root@testbed:/tmp# dumpe2fs -h /dev/loop2 ... FS Error count: 9 First error time: Tue Jul 8 05:45:22 2025 First error function: ext4_check_bdev_write_error First error line #: 217 First error err: EIO Last error time: Tue Jul 8 06:46:30 2025 Last error function: ext4_validate_block_bitmap Last error line #: 420 Last error err: EFSCORRUPTED ... Thank you~ Best regards, Jianyue Wu ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-11 3:20 Issue with ext4 filesystem corruption when writing to a file after disk exhaustion Jiany Wu @ 2025-07-11 5:29 ` Theodore Ts'o 2025-07-11 9:56 ` Jiany Wu 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2025-07-11 5:29 UTC (permalink / raw) To: Jiany Wu; +Cc: yi.zhang, jack, linux-ext4 On Fri, Jul 11, 2025 at 11:20:32AM +0800, Jiany Wu wrote: > Hello, > > Recently I encountered an issue in kernel 6.1.123, when writing to a > file after disk exhaustion, it will report EFSCORRUPTED. I think it is > un-expected behavior. What you did was created a file system in /tmp/mydisk by creating a sparse image file: > root@testbed:/tmp# touch mydisk > root@testbed:/tmp# ls -l mydisk > -rw-r--r-- 1 root root 0 Jul 8 05:36 mydisk > root@testbed:/tmp# truncate -s 128M mydisk > root@testbed:/tmp# mkfs.ext4 mydisk The potential problem is this assumes that /tmp had enough space to write 128M of space. But it's clear that it didn't have enough space. Do not only did you exhaust the space in the file system, you *also* exhausted space in /tmp. You can see this because of the I/O errors when writing to /dev/loop2: > root@testbed:/tmp# mount mydisk /mnt/test_fs/ > root@testbed:/tmp# findmnt /mnt/test_fs > TARGET SOURCE FSTYPE OPTIONS > /mnt/test_fs /dev/loop2 ext4 rw,relatime > ... > root@testbed:/mnt/test_fs# fallocate -l 32716560K /mnt/test_fs/test_file > fallocate: fallocate failed: No space left on device > root@testbed:/mnt/test_fs# journalctl -f > Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset > 9178112, length 1024. > Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset > 274432, length 1024. These error messages are write errors in /dev/loop2, which were almost certainly caused by ENOSPC errors when trying to write to /tmp/mydisk. This is the moral equivalent of buying a fradulent USB thumb drive from the back alleys of Shenzhen, where the USB thumb drive was *labelled* as having 128MB of storage, but which only had 16MB of flash, such that writes after the first 16MB would fail (or overwrite other disk blocks). If /tmp had enough space, then you wouldn't have see these errors. One alternative way you could create the image would have been to replace > root@testbed:/tmp# touch mydisk > root@testbed:/tmp# ls -l mydisk > -rw-r--r-- 1 root root 0 Jul 8 05:36 mydisk > root@testbed:/tmp# truncate -s 128M mydisk with: root@testbed:/tmp# dd if=/dev/zero of=mydisk bs=1M count=128 This allocates 128MB to /tmp/mydisk, and if there isn't enough space in /tmp, the dd will fail with an error. If it succeeds, then when you create the file system and mount it, you won't see the error messages writing to /dev/loopN. The bottom line is that the bug is a PEBCAK ("probem exists between chair and keyboard") which is another way of saying, it's a failure in the system admisitrator not understanding that they had done something bad. It is not a kernel bug, but rather a bug in your procedure / system setup. Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-11 5:29 ` Theodore Ts'o @ 2025-07-11 9:56 ` Jiany Wu 2025-07-11 15:40 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Jiany Wu @ 2025-07-11 9:56 UTC (permalink / raw) To: Theodore Ts'o; +Cc: yi.zhang, jack, linux-ext4 Hello, Ted, Thanks indeed for the help, really appreciated! BTW, is it proper to fallocate whole disk space to exhaust disk? I see even fallocate full disk size, seems file size equal to avail size still can be allocated. i.e. When /tmp availability space is 26G, but fallocate requests 32G (total disk space), we see it finally allocated a 26G file, but exit code is 1. Is it legal usage or will it trigger some unknown issue? I'm a newbie on fallocate:) Best regards, Jianyue Wu ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-11 9:56 ` Jiany Wu @ 2025-07-11 15:40 ` Theodore Ts'o 2025-07-12 4:27 ` Darrick J. Wong 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2025-07-11 15:40 UTC (permalink / raw) To: Jiany Wu; +Cc: yi.zhang, jack, linux-ext4 On Fri, Jul 11, 2025 at 05:56:18PM +0800, Jiany Wu wrote: > Hello, Ted, > > Thanks indeed for the help, really appreciated! > BTW, is it proper to fallocate whole disk space to exhaust disk? I'm not sure what do you mean by "proper". It depends on what you are trying to do, I suppose. The other thing here is I think you are seriously confusing yourself (and others) by using a loopback file image which si mounted. That's because now you need worry about failures at two levels; at the level of the storage device containing the image (e.g., /tmp for the image /tmp/mydisk) and the loopback file system (e.g., /mnt/tmp when /tmp/mydisk is mounted on top of /mnt/tmp). You could potentially have ENOSPC errors at either level. > I see even fallocate full disk size, seems file size equal to avail > size still can be allocated. > i.e. When /tmp availability space is 26G, but fallocate requests 32G > (total disk space), we see it finally allocated a 26G file, but exit > code is 1. You're being ambiguous here. When you say "full disk sice", which level are you talking about? /tmp or /mnt/test? And when you fallocate, which are you fallocating. What I would recommend is to fallocate *first* at the /mnt/mydisk level. So do this: # fallocate -l 32G /tmp/mydisk # mkfs.ext4 /tmp/mydisk If /tmp only has 26GB of free space, then the fallocate will fail --- but that's fine. That tells you that you don't have enough free space to fully allocate the file system image. So *stop*, and do this somewhere you have enough free space: # fallocate -l 32G /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk # mkfs.ext4 /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk Now you know that no matter what, when you mount mydisk, you don't need to worry about I/O errors when writing to mydisk. And you can proceed with your experimentation. Now, what if you don't have that huge 10TB disk. Can you use /tmp/mydisk to create a 32TB file system even though /tmp only has 26GB of free space. You *can*. but you need to be careful, because eventually when you start writing to the mounted file system, you will eventually run out of space in /tmp. For example: % cp /dev/null /tmp/test.img % ls -lsh /tmp/test.img 0 -rw-r--r-- 1 tytso tytso 0 Jul 11 11:19 /tmp/test.img % mkfs.ext4 -q /tmp/test.img 32G % ls -lsh /tmp/test.img 6.4M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:19 /tmp/test.img So you can see here that we have created a test file system which is 32 GiB in size, but so far, the actual amount of *space* consumed in /tmp is 6.4 MiB. The i_size of the file is 32 GiB, but it is a sparse file, which means not all of the blocks between logical offset 0 and 32 GiB have been allocated. Now, if we mount the file system, as we start writing into the file, we will allocate space in /tmp. Now, the way fallocate works in the mounted file system is that it guarantees space in the file system, but it won't write the data blocks, so space confusmed in /tmp by /tmp/mydisk will grow only by the space needed when we updated the metadata blocks in the file system contained in /tmp/mydisk. % sudo mount /tmp/test.img /mnt/test % df -h /mnt/test Filesystem Size Used Avail Use% Mounted on /dev/loop1 32G 2.1M 30G 1% /mnt/test % sudo fallocate -l 16G /mnt/test/testfile 1093% df -h /mnt/test Filesystem Size Used Avail Use% Mounted on /dev/loop1 32G 17G 14G 55% /mnt/test % ls -lsh /mnt/test/testfile 17G -rw-r--r-- 1 root root 16G Jul 11 11:25 /mnt/test/testfile % ls -lsh /tmp/test.img 7.5M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:25 /tmp/test.img So here, we fallocated 16GB in the file system in /tmp/test.img. You can see that it created a file which is 16GB in size, but which is a bit more than 16GB once you include the metadata blocks for /tmp/test/testfile. That's why the space used is 17GB (the ls program rounded up) but the i_size is 16GB. *But* the space consumed by the file /tmp/test.img only went up from 6.4 MiB to 7.5 MiB. That's because although we reserved space in the file system /tmp/test.img, we didn't reserve any space in /tmp. This is working as intended; and if what you are doing is "thin provisioning", this is a feature, but a bug. But what it means is that if /tmp only has 26GB of space, eventually if you keep writing to /tmp/test.img, there will be block level errors in the loop device when /tmp runs out of space. That was what you saw in your original example: Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. As soon as there are block I/O errors in the underlying file system, all bets are off. There could be data loss (if you had been writing to a data block when /tmp ran out of space) or file system corruption (if the kernel had been trying to write a metadata block when /tmp ran out of space), or possibly both. So as soon as you see block I/O errors, don't assume that file system is unscathed, because you probably *will* have lost data or have a corrupted file sytem. > Is it legal usage or will it trigger some unknown issue? I'm a newbie > on fallocate:) So it's *legal* to do thin provisioning; if you are trying to test a very large file system, and you don't have enough space, then you might not have a choice. Or if you are trying to be more efficient, it mgiht allow you to allow users to *think* they have more space than you actually have purchased, since very often, users don't artually use; they just want to feel good that they have the space. And if you have a large number of users, thin provisioning might make sense because it saves money. But it's much like a bank which has lent out money that depositors have on deposit, relying on the fact that it is very rare that all of the depositors will suddenly show up and withdraw all of their money all at the same time. If that happens, then you have a run on the bank, and there could be civil unrest, and things get ugly. Which is why after bank runs, government regulartors will demand that banks keep more money on reserve, which lowers their profits and makes the bank's shareholders sad --- but better that than angry bank customers. :-) So if you know what you are doing, it *can* work. But it might trigger an issue which is unknown/unexpected for you, even though for soemone who understands how things work it makes perfect sense and is the system working as designed. If you have lots of disk space, then just use fallocate to allocate space for /tmp/mydisk, and then you can use fallocate to allocate space for the file system contained in /tmp/mydisk. And it will all work, but it will require more disk space to be available in /tmp. Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-11 15:40 ` Theodore Ts'o @ 2025-07-12 4:27 ` Darrick J. Wong 2025-07-12 14:34 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Darrick J. Wong @ 2025-07-12 4:27 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Jiany Wu, yi.zhang, jack, linux-ext4 On Fri, Jul 11, 2025 at 11:40:12AM -0400, Theodore Ts'o wrote: > On Fri, Jul 11, 2025 at 05:56:18PM +0800, Jiany Wu wrote: > > Hello, Ted, > > > > Thanks indeed for the help, really appreciated! > > BTW, is it proper to fallocate whole disk space to exhaust disk? > > I'm not sure what do you mean by "proper". It depends on what you are > trying to do, I suppose. > > The other thing here is I think you are seriously confusing yourself > (and others) by using a loopback file image which si mounted. That's > because now you need worry about failures at two levels; at the level > of the storage device containing the image (e.g., /tmp for the image > /tmp/mydisk) and the loopback file system (e.g., /mnt/tmp when > /tmp/mydisk is mounted on top of /mnt/tmp). You could potentially > have ENOSPC errors at either level. Honestly it's really too bad that there's no way for an fs to ask the block device how much space it thinks is available, and then teach its own statfs method to return min(fs space available, bdev space availble). Then at least df could report that your 500T ramdisk filesystem on a 4G /tmp really only has 4G of space available. > > I see even fallocate full disk size, seems file size equal to avail > > size still can be allocated. > > i.e. When /tmp availability space is 26G, but fallocate requests 32G > > (total disk space), we see it finally allocated a 26G file, but exit > > code is 1. > > You're being ambiguous here. When you say "full disk sice", which > level are you talking about? /tmp or /mnt/test? And when you > fallocate, which are you fallocating. > > What I would recommend is to fallocate *first* at the /mnt/mydisk > level. So do this: > > # fallocate -l 32G /tmp/mydisk > # mkfs.ext4 /tmp/mydisk > > If /tmp only has 26GB of free space, then the fallocate will fail --- > but that's fine. That tells you that you don't have enough free space > to fully allocate the file system image. So *stop*, and do this > somewhere you have enough free space: > > # fallocate -l 32G /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk > # mkfs.ext4 /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk > > Now you know that no matter what, when you mount mydisk, you don't > need to worry about I/O errors when writing to mydisk. And you can > proceed with your experimentation. > > > Now, what if you don't have that huge 10TB disk. Can you use > /tmp/mydisk to create a 32TB file system even though /tmp only has > 26GB of free space. You *can*. but you need to be careful, because > eventually when you start writing to the mounted file system, you will > eventually run out of space in /tmp. > > For example: > > % cp /dev/null /tmp/test.img > % ls -lsh /tmp/test.img > 0 -rw-r--r-- 1 tytso tytso 0 Jul 11 11:19 /tmp/test.img > % mkfs.ext4 -q /tmp/test.img 32G > % ls -lsh /tmp/test.img > 6.4M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:19 /tmp/test.img > > So you can see here that we have created a test file system which is > 32 GiB in size, but so far, the actual amount of *space* consumed in > /tmp is 6.4 MiB. The i_size of the file is 32 GiB, but it is a sparse > file, which means not all of the blocks between logical offset 0 and > 32 GiB have been allocated. > > Now, if we mount the file system, as we start writing into the file, > we will allocate space in /tmp. Now, the way fallocate works in the > mounted file system is that it guarantees space in the file system, > but it won't write the data blocks, so space confusmed in /tmp by > /tmp/mydisk will grow only by the space needed when we updated the > metadata blocks in the file system contained in /tmp/mydisk. > > % sudo mount /tmp/test.img /mnt/test > % df -h /mnt/test > Filesystem Size Used Avail Use% Mounted on > /dev/loop1 32G 2.1M 30G 1% /mnt/test > % sudo fallocate -l 16G /mnt/test/testfile > 1093% df -h /mnt/test > Filesystem Size Used Avail Use% Mounted on > /dev/loop1 32G 17G 14G 55% /mnt/test > % ls -lsh /mnt/test/testfile > 17G -rw-r--r-- 1 root root 16G Jul 11 11:25 /mnt/test/testfile > % ls -lsh /tmp/test.img > 7.5M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:25 /tmp/test.img > > So here, we fallocated 16GB in the file system in /tmp/test.img. You > can see that it created a file which is 16GB in size, but which is a > bit more than 16GB once you include the metadata blocks for > /tmp/test/testfile. That's why the space used is 17GB (the ls program > rounded up) but the i_size is 16GB. > > *But* the space consumed by the file /tmp/test.img only went up from > 6.4 MiB to 7.5 MiB. That's because although we reserved space in the > file system /tmp/test.img, we didn't reserve any space in /tmp. > > This is working as intended; and if what you are doing is "thin > provisioning", this is a feature, but a bug. > > But what it means is that if /tmp only has 26GB of space, eventually > if you keep writing to /tmp/test.img, there will be block level errors > in the loop device when /tmp runs out of space. That was what you saw > in your original example: > > Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024. > > As soon as there are block I/O errors in the underlying file system, > all bets are off. There could be data loss (if you had been writing > to a data block when /tmp ran out of space) or file system corruption > (if the kernel had been trying to write a metadata block when /tmp ran > out of space), or possibly both. So as soon as you see block I/O > errors, don't assume that file system is unscathed, because you > probably *will* have lost data or have a corrupted file sytem. > > > Is it legal usage or will it trigger some unknown issue? I'm a newbie > > on fallocate:) > > So it's *legal* to do thin provisioning; if you are trying to test a > very large file system, and you don't have enough space, then you > might not have a choice. Or if you are trying to be more efficient, > it mgiht allow you to allow users to *think* they have more space than > you actually have purchased, since very often, users don't artually > use; they just want to feel good that they have the space. > > And if you have a large number of users, thin provisioning might make > sense because it saves money. But it's much like a bank which has > lent out money that depositors have on deposit, relying on the fact > that it is very rare that all of the depositors will suddenly show up > and withdraw all of their money all at the same time. If that > happens, then you have a run on the bank, and there could be civil > unrest, and things get ugly. Which is why after bank runs, government > regulartors will demand that banks keep more money on reserve, which > lowers their profits and makes the bank's shareholders sad --- but > better that than angry bank customers. :-) LOL SVB :( --D > So if you know what you are doing, it *can* work. But it might > trigger an issue which is unknown/unexpected for you, even though for > soemone who understands how things work it makes perfect sense and is > the system working as designed. > > If you have lots of disk space, then just use fallocate to allocate > space for /tmp/mydisk, and then you can use fallocate to allocate > space for the file system contained in /tmp/mydisk. And it will all > work, but it will require more disk space to be available in /tmp. > > Cheers, > > - Ted > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-12 4:27 ` Darrick J. Wong @ 2025-07-12 14:34 ` Theodore Ts'o 2025-07-14 4:37 ` Jiany Wu 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2025-07-12 14:34 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Jiany Wu, yi.zhang, jack, linux-ext4 On Fri, Jul 11, 2025 at 09:27:14PM -0700, Darrick J. Wong wrote: > > Honestly it's really too bad that there's no way for an fs to ask the > block device how much space it thinks is available, and then teach its > own statfs method to return min(fs space available, bdev space > availble). > > Then at least df could report that your 500T ramdisk filesystem on a 4G > /tmp really only has 4G of space available. I think it would be better if there was an extra field in the statfs structure that reported bdev space available, and have it show up as an extra (optional) column in the df report. The problem is that bdev space available could be highly variable. For example, suppose you had a few thousand users all sharing thinly provisioned space. If a whole bunch of users suddenly all start using space, the available space at the storage layer could suddenly plummet. And if the available space starts getting low, this might trigger automated, central fstrims on all of the volumes, causing the free space to go back up. Having the free space on a file system as reported by df go up and down randomly would very likely cause users to get very confused and upset, especially when it wasn't under their control. Even for a single user system the free space in tmpfs could go down suddenly when some huge process suddenly started, and then go up suddenly when that process gets OOM-killed. :-) - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-12 14:34 ` Theodore Ts'o @ 2025-07-14 4:37 ` Jiany Wu 2025-07-14 13:09 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Jiany Wu @ 2025-07-14 4:37 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Darrick J. Wong, yi.zhang, jack, linux-ext4 Hello, Ted, Good day, thanks indeed for the clarification~ Yes, previously tried to mount a specific ext4 disk-img to /var/log, with /dev/loop1 device, and rsyslogd will write to /var/log/syslog. When /tmp directory exhaust manually via fallocate, / dir will be also occupied as 100%, and rsyslog write errors in /dev/loop1 happen, later mount as read-only. Different from the early scenario, but this scenario is not easy to reproduce. Tried updating the test case, not fallocate all spaces in disk, now alloc 95%, everything is normal now, no related error prints anymore. It is confirmed errors are caused by disk exhaust. I think the main hesitation part is whether fallocate is allowed to use the whole disk space. root@testbed:~$ df -Th Filesystem Type Size Used Avail Use% Mounted on udev devtmpfs 16G 0 16G 0% /dev tmpfs tmpfs 3.2G 53M 3.1G 2% /run root-overlay overlay 32G 6.2G 25G 20% / /dev/nvme0n1p3 ext4 32G 6.2G 25G 20% /host /dev/loop1 ext4 3.9G 189M 3.5G 6% /var/log tmpfs tmpfs 16G 236M 16G 2% /dev/shm tmpfs tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup root@testbed:~$ mount | grep log /host/disk-img/var-log.ext4 on /var/log type ext4 (rw,relatime) root@testbed:~$ ls -lh /host/disk-img/var-log.ext4 -rw-r--r-- 1 root root 4.0G Jul 14 07:05 /host/disk-img/var-log.ext4 root@testbed:~$ file /host/disk-img/var-log.ext4 /host/disk-img/var-log.ext4: Linux rev 1.0 ext4 filesystem data, UUID=49281462-eb22-4f19-8d03-51338eaf278a (needs journal recovery) (extents) (64bit) (large files) (huge files) # fallocate to exhaust /tmp directly root@testbed:~$ df /tmp Filesystem 1K-blocks Used Available Use% Mounted on root-overlay 229572940 229556556 0 100% / # loop write error testbed ERR kernel: [ 1019.470013] I/O error, dev loop1, sector 266248 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 2 testbed ERR kernel: [ 1019.479242] Buffer I/O error on dev loop1, logical block 33281, lost async page write testbed ERR kernel: [ 1009.228833] loop: Write error at byte offset 673349632, length 4096. testbed CRIT kernel: [ 1019.487101] EXT4-fs error (device loop1): ext4_check_bdev_write_error:217: comm rs:main Q:Reg: Error while async write back metadata # remounting fs as read-only testbed ERR kernel: [ 1326.758055] Aborting journal on device loop1-8. testbed CRIT kernel: [ 1326.765336] EXT4-fs error (device loop1): ext4_journal_check_start:83: comm auditd: Detected aborted journal testbed CRIT kernel: [ 1326.765960] EXT4-fs error (device loop1): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal testbed CRIT kernel: [ 1326.775629] EXT4-fs (loop1): Remounting filesystem read-only Best regards, Jianyue Wu On Sat, Jul 12, 2025 at 10:34 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Fri, Jul 11, 2025 at 09:27:14PM -0700, Darrick J. Wong wrote: > > > > Honestly it's really too bad that there's no way for an fs to ask the > > block device how much space it thinks is available, and then teach its > > own statfs method to return min(fs space available, bdev space > > availble). > > > > Then at least df could report that your 500T ramdisk filesystem on a 4G > > /tmp really only has 4G of space available. > > I think it would be better if there was an extra field in the statfs > structure that reported bdev space available, and have it show up > as an extra (optional) column in the df report. > > The problem is that bdev space available could be highly variable. > For example, suppose you had a few thousand users all sharing thinly > provisioned space. If a whole bunch of users suddenly all start using > space, the available space at the storage layer could suddenly > plummet. And if the available space starts getting low, this might trigger > automated, central fstrims on all of the volumes, causing the free > space to go back up. > > Having the free space on a file system as reported by df go up and > down randomly would very likely cause users to get very confused > and upset, especially when it wasn't under their control. Even for a > single user system the free space in tmpfs could go down suddenly when > some huge process suddenly started, and then go up suddenly when that > process gets OOM-killed. :-) > > - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-14 4:37 ` Jiany Wu @ 2025-07-14 13:09 ` Theodore Ts'o 2025-07-15 1:27 ` Jiany Wu 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2025-07-14 13:09 UTC (permalink / raw) To: Jiany Wu; +Cc: Darrick J. Wong, yi.zhang, jack, linux-ext4 On Mon, Jul 14, 2025 at 12:37:21PM +0800, Jiany Wu wrote: > Hello, Ted, > > Good day, thanks indeed for the clarification~ > Yes, previously tried to mount a specific ext4 disk-img to /var/log, > with /dev/loop1 device, and rsyslogd will write to /var/log/syslog. > When /tmp directory exhaust manually via fallocate, / dir will be also > occupied as 100%, and rsyslog write errors in /dev/loop1 happen, later > mount as read-only. Different from the early scenario, but this > scenario is not easy to reproduce. > Tried updating the test case, not fallocate all spaces in disk, now > alloc 95%, everything is normal now, no related error prints anymore. > It is confirmed errors are caused by disk exhaust. > I think the main hesitation part is whether fallocate is allowed to > use the whole disk space. The fallocate system call is allowed to use the whole space on the *file system*. But it doesn't know about how much free space a thin-provisioned device's underlying storage is available. If you are using a loopback mounted image on a disk, if the underlying file system on the disk fills up then the block device will have I/O errors --- and then the file system on the loop device will run into problems, either data loss or metadata corruption. So this is working as intended. If you don't want this, either don't use a loopback mount with a sparse file; either use fallocate when creating the image file, or don't use a loopback mount. - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-14 13:09 ` Theodore Ts'o @ 2025-07-15 1:27 ` Jiany Wu 2025-07-15 3:42 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Jiany Wu @ 2025-07-15 1:27 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Darrick J. Wong, yi.zhang, jack, linux-ext4 Hello, Ted, Thanks indeed for the clarification, it is clear now. OK, if using a loopback mounted image on a disk, underlying file system full then the block device will have I/O error. This loopback mount belongs to a third party common config. I'll fallocate lower disk space to not exhaust disk as a work around firstly. Thanks again for the help:) Best regards, Jianyue Wu On Mon, Jul 14, 2025 at 9:09 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, Jul 14, 2025 at 12:37:21PM +0800, Jiany Wu wrote: > > Hello, Ted, > > > > Good day, thanks indeed for the clarification~ > > Yes, previously tried to mount a specific ext4 disk-img to /var/log, > > with /dev/loop1 device, and rsyslogd will write to /var/log/syslog. > > When /tmp directory exhaust manually via fallocate, / dir will be also > > occupied as 100%, and rsyslog write errors in /dev/loop1 happen, later > > mount as read-only. Different from the early scenario, but this > > scenario is not easy to reproduce. > > Tried updating the test case, not fallocate all spaces in disk, now > > alloc 95%, everything is normal now, no related error prints anymore. > > It is confirmed errors are caused by disk exhaust. > > I think the main hesitation part is whether fallocate is allowed to > > use the whole disk space. > > The fallocate system call is allowed to use the whole space on the > *file system*. But it doesn't know about how much free space a > thin-provisioned device's underlying storage is available. If you are > using a loopback mounted image on a disk, if the underlying file > system on the disk fills up then the block device will have I/O errors > --- and then the file system on the loop device will run into > problems, either data loss or metadata corruption. > > So this is working as intended. If you don't want this, either don't > use a loopback mount with a sparse file; either use fallocate when > creating the image file, or don't use a loopback mount. > > - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion 2025-07-15 1:27 ` Jiany Wu @ 2025-07-15 3:42 ` Theodore Ts'o 0 siblings, 0 replies; 10+ messages in thread From: Theodore Ts'o @ 2025-07-15 3:42 UTC (permalink / raw) To: Jiany Wu; +Cc: Darrick J. Wong, yi.zhang, jack, linux-ext4 On Tue, Jul 15, 2025 at 09:27:01AM +0800, Jiany Wu wrote: > > Thanks indeed for the clarification, it is clear now. > OK, if using a loopback mounted image on a disk, underlying file > system full then the block device will have I/O error. > This loopback mount belongs to a third party common config. I'll > fallocate lower disk space to not exhaust disk as a work around > firstly. The question I'd ask is *why* someone set up that loopback mount in the first place. As a guess, perhaps the goal was to restrict the mount of disk space could be used for log files in /var/log, so that if there is runaway logging, that all of the free space on the root partition won't be consumed. But if that's the case, there are much better ways of achieving the same goal. For example, in addition to using log rotation programs, you could back that up with using project quotas , which restricts how much space can get used in a subdirectory. Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-07-15 3:42 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-11 3:20 Issue with ext4 filesystem corruption when writing to a file after disk exhaustion Jiany Wu 2025-07-11 5:29 ` Theodore Ts'o 2025-07-11 9:56 ` Jiany Wu 2025-07-11 15:40 ` Theodore Ts'o 2025-07-12 4:27 ` Darrick J. Wong 2025-07-12 14:34 ` Theodore Ts'o 2025-07-14 4:37 ` Jiany Wu 2025-07-14 13:09 ` Theodore Ts'o 2025-07-15 1:27 ` Jiany Wu 2025-07-15 3:42 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).