Issue with ext4 filesystem corruption when writing to a file after disk exhaustion

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
@ 2025-07-11  3:20 Jiany Wu
  2025-07-11  5:29 ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Jiany Wu @ 2025-07-11  3:20 UTC (permalink / raw)
  To: yi.zhang, jack, linux-ext4

Hello,

Recently I encountered an issue in kernel 6.1.123, when writing to a
file after disk exhaustion, it will report EFSCORRUPTED. I think it is
un-expected behavior.
Could you help clarify:
1. Why writing to file after disk exhaust will cause "Error while
async write back metadata"? Assume it might be inode or block metadata
is corrupted there?
2. Why would the file system corrupt, like "Aborting journal on device"?
Thank you~

Detailed reproduction steps are:
# 1. Create ext4 file system in mydisk
root@testbed:/tmp# touch mydisk
root@testbed:/tmp# ls -l mydisk
-rw-r--r-- 1 root root 0 Jul  8 05:36 mydisk
root@testbed:/tmp# truncate -s 128M mydisk
root@testbed:/tmp# ls -lh mydisk
-rw-r--r-- 1 root root 128M Jul  8 05:36 mydisk
root@testbed:/tmp# mkfs.ext4 mydisk
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done
Creating filesystem with 131072 1k blocks and 32768 inodes
Filesystem UUID: b0b12002-d497-436e-b89d-d0e02f53b46d
Superblock backups stored on blocks:
        8193, 24577, 40961, 57345, 73729

Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done

# 2. Mount mydisk to /mnt/test_fs
root@testbed:/tmp# mkdir /mnt/test_fs
root@testbed:/tmp# mount mydisk /mnt/test_fs/
root@testbed:/tmp# findmnt /mnt/test_fs
TARGET       SOURCE     FSTYPE OPTIONS
/mnt/test_fs /dev/loop2 ext4   rw,relatime

root@testbed:/mnt/test_fs# file /tmp/mydisk
/tmp/mydisk: Linux rev 1.0 ext4 filesystem data,
UUID=b0b12002-d497-436e-b89d-d0e02f53b46d (needs journal recovery)
(extents) (64bit) (large files) (huge files)

# 3. Exhaust disk in /mnt/test_fs with 32G test_file
root@testbed:/mnt/test_fs# fallocate -l 32716560K /mnt/test_fs/test_file
fallocate: fallocate failed: No space left on device
root@testbed:/mnt/test_fs# ls
lost+found  test_file
root@testbed:/mnt/test_fs# journalctl -f
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
9178112, length 1024.
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
274432, length 1024.
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
274432, length 1024.
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
274432, length 1024.
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
274432, length 1024.
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
274432, length 1024.
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
274432, length 1024.
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
274432, length 1024.
Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
274432, length 1024.
Jul 08 05:43:07 testbed kernel: I/O error, dev loop2, sector 17926 op
0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 2
Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical
block 8963, lost async page write
Jul 08 05:43:07 testbed kernel: I/O error, dev loop2, sector 518 op
0x1:(WRITE) flags 0x103000 phys_seg 17 prio class 2
Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical
block 259, lost async page write
Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical
block 260, lost async page write
Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical
block 261, lost async page write
Jul 08 05:43:07 testbed kernel: Buffer I/O error on dev loop2, logical
block 262, lost async page write

# 4. Write to /mnt/test_fs/file.dat with dd cmd, I/O error appears.
root@testbed:/mnt/test_fs# dd if=/dev/zero of=/mnt/test_fs/file.dat
bs=1M count=64
root@testbed:/mnt/test_fs# journalctl -f
Jul 08 05:49:24 testbed kernel: Buffer I/O error on dev loop2, logical
block 268, lost async page write
Jul 08 05:49:26 testbed kernel: EXT4-fs error (device loop2):
ext4_check_bdev_write_error:217: comm dd: Error while async write back
metadata
Jul 08 05:49:26 testbed kernel: I/O error, dev loop2, sector 20482 op
0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 2

# Can see First error function is ext4_check_bdev_write_error.
root@testbed:/mnt/test_fs# dumpe2fs -h /dev/loop2
dumpe2fs 1.47.0 (5-Feb-2023)
Filesystem volume name:   <none>
Last mounted on:          /mnt/test_fs
Filesystem UUID:          b0b12002-d497-436e-b89d-d0e02f53b46d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent 64bit flex_bg sparse_super large_file
huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              32768
Block count:              131072
Reserved block count:     6553
Overhead clusters:        13869
Free blocks:              42236
Free inodes:              32754
First block:              1
Block size:               1024
Fragment size:            1024
Group descriptor size:    64
Reserved GDT blocks:      256
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         2048
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Tue Jul  8 05:37:11 2025
Last mount time:          Tue Jul  8 05:37:37 2025
Last write time:          Tue Jul  8 05:50:36 2025
Mount count:              1
Maximum mount count:      -1
Last checked:             Tue Jul  8 05:37:11 2025
Check interval:           0 (<none>)
Lifetime writes:          74 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      bf9009a6-ff19-41d5-8abc-a4d9cd65eeb4
Journal backup:           inode blocks
FS Error count:           4
First error time:         Tue Jul  8 05:45:22 2025
First error function:     ext4_check_bdev_write_error
First error line #:       217
First error err:          EIO
Last error time:          Tue Jul  8 05:50:36 2025
Last error function:      ext4_check_bdev_write_error
Last error line #:        217
Last error err:           EIO
Checksum type:            crc32c
Checksum:                 0x0583faaa
Journal features:         journal_incompat_revoke journal_64bit
journal_checksum_v3
Total journal size:       4096k
Total journal blocks:     4096
Max transaction length:   4096
Fast commit length:       0
Journal sequence:         0x00000002
Journal start:            1
Journal checksum type:    crc32c
Journal checksum:         0xebf7b874

# 5. unmount the filesystem, file system became read-only, result show
EFSCORRUPTED
Jul 08 06:44:17 testbed kernel: EXT4-fs (loop2): unmounting filesystem.
Jul 08 06:44:17 testbed kernel: Aborting journal on device loop2-8.
Jul 08 06:44:17 testbed kernel: EXT4-fs error (device loop2):
ext4_put_super:1232: comm umount: Couldn't clean up the journal
Jul 08 06:44:17 testbed kernel: EXT4-fs (loop2): Remounting filesystem read-only

root@testbed:/tmp# dumpe2fs -h /dev/loop2
...
FS Error count:           9
First error time:         Tue Jul  8 05:45:22 2025
First error function:     ext4_check_bdev_write_error
First error line #:       217
First error err:          EIO
Last error time:          Tue Jul  8 06:46:30 2025
Last error function:      ext4_validate_block_bitmap
Last error line #:        420
Last error err:           EFSCORRUPTED
...

Thank you~
Best regards,
Jianyue Wu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-11  3:20 Issue with ext4 filesystem corruption when writing to a file after disk exhaustion Jiany Wu
@ 2025-07-11  5:29 ` Theodore Ts'o
  2025-07-11  9:56   ` Jiany Wu
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2025-07-11  5:29 UTC (permalink / raw)
  To: Jiany Wu; +Cc: yi.zhang, jack, linux-ext4

On Fri, Jul 11, 2025 at 11:20:32AM +0800, Jiany Wu wrote:
> Hello,
> 
> Recently I encountered an issue in kernel 6.1.123, when writing to a
> file after disk exhaustion, it will report EFSCORRUPTED. I think it is
> un-expected behavior.

What you did was created a file system in /tmp/mydisk by creating a
sparse image file:

> root@testbed:/tmp# touch mydisk
> root@testbed:/tmp# ls -l mydisk
> -rw-r--r-- 1 root root 0 Jul  8 05:36 mydisk
> root@testbed:/tmp# truncate -s 128M mydisk
> root@testbed:/tmp# mkfs.ext4 mydisk

The potential problem is this assumes that /tmp had enough space to
write 128M of space.  But it's clear that it didn't have enough space.
Do not only did you exhaust the space in the file system, you *also*
exhausted space in /tmp.  You can see this because of the I/O errors
when writing to /dev/loop2:

> root@testbed:/tmp# mount mydisk /mnt/test_fs/
> root@testbed:/tmp# findmnt /mnt/test_fs
> TARGET       SOURCE     FSTYPE OPTIONS
> /mnt/test_fs /dev/loop2 ext4   rw,relatime
> ...
> root@testbed:/mnt/test_fs# fallocate -l 32716560K /mnt/test_fs/test_file
> fallocate: fallocate failed: No space left on device
> root@testbed:/mnt/test_fs# journalctl -f
> Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
> 9178112, length 1024.
> Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset
> 274432, length 1024.

These error messages are write errors in /dev/loop2, which were almost
certainly caused by ENOSPC errors when trying to write to /tmp/mydisk.

This is the moral equivalent of buying a fradulent USB thumb drive
from the back alleys of Shenzhen, where the USB thumb drive was
*labelled* as having 128MB of storage, but which only had 16MB of
flash, such that writes after the first 16MB would fail (or overwrite
other disk blocks).

If /tmp had enough space, then you wouldn't have see these errors.

One alternative way you could create the image would have been to replace 

> root@testbed:/tmp# touch mydisk
> root@testbed:/tmp# ls -l mydisk
> -rw-r--r-- 1 root root 0 Jul  8 05:36 mydisk
> root@testbed:/tmp# truncate -s 128M mydisk

with:

root@testbed:/tmp# dd if=/dev/zero of=mydisk bs=1M count=128

This allocates 128MB to /tmp/mydisk, and if there isn't enough space
in /tmp, the dd will fail with an error.  If it succeeds, then when
you create the file system and mount it, you won't see the error
messages writing to /dev/loopN.

The bottom line is that the bug is a PEBCAK ("probem exists between
chair and keyboard") which is another way of saying, it's a failure in
the system admisitrator not understanding that they had done something
bad.  It is not a kernel bug, but rather a bug in your procedure /
system setup.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-11  5:29 ` Theodore Ts'o
@ 2025-07-11  9:56   ` Jiany Wu
  2025-07-11 15:40     ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Jiany Wu @ 2025-07-11  9:56 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: yi.zhang, jack, linux-ext4

Hello, Ted,

Thanks indeed for the help, really appreciated!
BTW, is it proper to fallocate whole disk space to exhaust disk?
I see even fallocate full disk size, seems file size equal to avail
size still can be allocated.
i.e. When /tmp availability space is 26G, but fallocate requests 32G
(total disk space), we see it finally allocated a 26G file, but exit
code is 1.
Is it legal usage or will it trigger some unknown issue? I'm a newbie
on fallocate:)

Best regards,
Jianyue Wu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-11  9:56   ` Jiany Wu
@ 2025-07-11 15:40     ` Theodore Ts'o
  2025-07-12  4:27       ` Darrick J. Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2025-07-11 15:40 UTC (permalink / raw)
  To: Jiany Wu; +Cc: yi.zhang, jack, linux-ext4

On Fri, Jul 11, 2025 at 05:56:18PM +0800, Jiany Wu wrote:
> Hello, Ted,
> 
> Thanks indeed for the help, really appreciated!
> BTW, is it proper to fallocate whole disk space to exhaust disk?

I'm not sure what do you mean by "proper".  It depends on what you are
trying to do, I suppose.

The other thing here is I think you are seriously confusing yourself
(and others) by using a loopback file image which si mounted.  That's
because now you need worry about failures at two levels; at the level
of the storage device containing the image (e.g., /tmp for the image
/tmp/mydisk) and the loopback file system (e.g., /mnt/tmp when
/tmp/mydisk is mounted on top of /mnt/tmp).  You could potentially
have ENOSPC errors at either level.

> I see even fallocate full disk size, seems file size equal to avail
> size still can be allocated.
> i.e. When /tmp availability space is 26G, but fallocate requests 32G
> (total disk space), we see it finally allocated a 26G file, but exit
> code is 1.

You're being ambiguous here.  When you say "full disk sice", which
level are you talking about?  /tmp or /mnt/test?  And when you
fallocate, which are you fallocating.

What I would recommend is to fallocate *first* at the /mnt/mydisk
level.  So do this:

# fallocate -l 32G /tmp/mydisk
# mkfs.ext4 /tmp/mydisk

If /tmp only has 26GB of free space, then the fallocate will fail ---
but that's fine.  That tells you that you don't have enough free space
to fully allocate the file system image.  So *stop*, and do this
somewhere you have enough free space:

# fallocate -l 32G /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk
# mkfs.ext4 /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk

Now you know that no matter what, when you mount mydisk, you don't
need to worry about I/O errors when writing to mydisk.  And you can
proceed with your experimentation.

Now, what if you don't have that huge 10TB disk.  Can you use
/tmp/mydisk to create a 32TB file system even though /tmp only has
26GB of free space.  You *can*. but you need to be careful, because
eventually when you start writing to the mounted file system, you will
eventually run out of space in /tmp.

For example:

% cp /dev/null /tmp/test.img
% ls -lsh /tmp/test.img
0 -rw-r--r-- 1 tytso tytso 0 Jul 11 11:19 /tmp/test.img
% mkfs.ext4 -q /tmp/test.img 32G
% ls -lsh /tmp/test.img
6.4M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:19 /tmp/test.img

So you can see here that we have created a test file system which is
32 GiB in size, but so far, the actual amount of *space* consumed in
/tmp is 6.4 MiB.  The i_size of the file is 32 GiB, but it is a sparse
file, which means not all of the blocks between logical offset 0 and
32 GiB have been allocated.

Now, if we mount the file system, as we start writing into the file,
we will allocate space in /tmp.  Now, the way fallocate works in the
mounted file system is that it guarantees space in the file system,
but it won't write the data blocks, so space confusmed in /tmp by
/tmp/mydisk will grow only by the space needed when we updated the
metadata blocks in the file system contained in /tmp/mydisk.

% sudo mount /tmp/test.img /mnt/test
% df -h /mnt/test
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop1       32G  2.1M   30G   1% /mnt/test
% sudo fallocate -l 16G /mnt/test/testfile
1093% df -h /mnt/test
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop1       32G   17G   14G  55% /mnt/test
% ls -lsh /mnt/test/testfile
17G -rw-r--r-- 1 root root 16G Jul 11 11:25 /mnt/test/testfile
% ls -lsh /tmp/test.img
7.5M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:25 /tmp/test.img

So here, we fallocated 16GB in the file system in /tmp/test.img.  You
can see that it created a file which is 16GB in size, but which is a
bit more than 16GB once you include the metadata blocks for
/tmp/test/testfile.  That's why the space used is 17GB (the ls program
rounded up) but the i_size is 16GB.

*But* the space consumed by the file /tmp/test.img only went up from
6.4 MiB to 7.5 MiB.  That's because although we reserved space in the
file system /tmp/test.img, we didn't reserve any space in /tmp.

This is working as intended; and if what you are doing is "thin
provisioning", this is a feature, but a bug.

But what it means is that if /tmp only has 26GB of space, eventually
if you keep writing to /tmp/test.img, there will be block level errors
in the loop device when /tmp runs out of space.  That was what you saw
in your original example:

Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024.

As soon as there are block I/O errors in the underlying file system,
all bets are off.  There could be data loss (if you had been writing
to a data block when /tmp ran out of space) or file system corruption
(if the kernel had been trying to write a metadata block when /tmp ran
out of space), or possibly both.  So as soon as you see block I/O
errors, don't assume that file system is unscathed, because you
probably *will* have lost data or have a corrupted file sytem.

> Is it legal usage or will it trigger some unknown issue? I'm a newbie
> on fallocate:)

So it's *legal* to do thin provisioning; if you are trying to test a
very large file system, and you don't have enough space, then you
might not have a choice.  Or if you are trying to be more efficient,
it mgiht allow you to allow users to *think* they have more space than
you actually have purchased, since very often, users don't artually
use; they just want to feel good that they have the space.

And if you have a large number of users, thin provisioning might make
sense because it saves money.  But it's much like a bank which has
lent out money that depositors have on deposit, relying on the fact
that it is very rare that all of the depositors will suddenly show up
and withdraw all of their money all at the same time.  If that
happens, then you have a run on the bank, and there could be civil
unrest, and things get ugly.  Which is why after bank runs, government
regulartors will demand that banks keep more money on reserve, which
lowers their profits and makes the bank's shareholders sad --- but
better that than angry bank customers.  :-)

So if you know what you are doing, it *can* work.  But it might
trigger an issue which is unknown/unexpected for you, even though for
soemone who understands how things work it makes perfect sense and is
the system working as designed.

If you have lots of disk space, then just use fallocate to allocate
space for /tmp/mydisk, and then you can use fallocate to allocate
space for the file system contained in /tmp/mydisk.  And it will all
work, but it will require more disk space to be available in /tmp.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-11 15:40     ` Theodore Ts'o
@ 2025-07-12  4:27       ` Darrick J. Wong
  2025-07-12 14:34         ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2025-07-12  4:27 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jiany Wu, yi.zhang, jack, linux-ext4

On Fri, Jul 11, 2025 at 11:40:12AM -0400, Theodore Ts'o wrote:
> On Fri, Jul 11, 2025 at 05:56:18PM +0800, Jiany Wu wrote:
> > Hello, Ted,
> > 
> > Thanks indeed for the help, really appreciated!
> > BTW, is it proper to fallocate whole disk space to exhaust disk?
> 
> I'm not sure what do you mean by "proper".  It depends on what you are
> trying to do, I suppose.
> 
> The other thing here is I think you are seriously confusing yourself
> (and others) by using a loopback file image which si mounted.  That's
> because now you need worry about failures at two levels; at the level
> of the storage device containing the image (e.g., /tmp for the image
> /tmp/mydisk) and the loopback file system (e.g., /mnt/tmp when
> /tmp/mydisk is mounted on top of /mnt/tmp).  You could potentially
> have ENOSPC errors at either level.

Honestly it's really too bad that there's no way for an fs to ask the
block device how much space it thinks is available, and then teach its
own statfs method to return min(fs space available, bdev space
availble).

Then at least df could report that your 500T ramdisk filesystem on a 4G
/tmp really only has 4G of space available.

> > I see even fallocate full disk size, seems file size equal to avail
> > size still can be allocated.
> > i.e. When /tmp availability space is 26G, but fallocate requests 32G
> > (total disk space), we see it finally allocated a 26G file, but exit
> > code is 1.
> 
> You're being ambiguous here.  When you say "full disk sice", which
> level are you talking about?  /tmp or /mnt/test?  And when you
> fallocate, which are you fallocating.
> 
> What I would recommend is to fallocate *first* at the /mnt/mydisk
> level.  So do this:
> 
> # fallocate -l 32G /tmp/mydisk
> # mkfs.ext4 /tmp/mydisk
> 
> If /tmp only has 26GB of free space, then the fallocate will fail ---
> but that's fine.  That tells you that you don't have enough free space
> to fully allocate the file system image.  So *stop*, and do this
> somewhere you have enough free space:
> 
> # fallocate -l 32G /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk
> # mkfs.ext4 /mnt/huge-10TB-disk-with-lots-of-free-space/mydisk
> 
> Now you know that no matter what, when you mount mydisk, you don't
> need to worry about I/O errors when writing to mydisk.  And you can
> proceed with your experimentation.
> 
> 
> Now, what if you don't have that huge 10TB disk.  Can you use
> /tmp/mydisk to create a 32TB file system even though /tmp only has
> 26GB of free space.  You *can*. but you need to be careful, because
> eventually when you start writing to the mounted file system, you will
> eventually run out of space in /tmp.
> 
> For example:
> 
> % cp /dev/null /tmp/test.img
> % ls -lsh /tmp/test.img
> 0 -rw-r--r-- 1 tytso tytso 0 Jul 11 11:19 /tmp/test.img
> % mkfs.ext4 -q /tmp/test.img 32G
> % ls -lsh /tmp/test.img
> 6.4M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:19 /tmp/test.img
> 
> So you can see here that we have created a test file system which is
> 32 GiB in size, but so far, the actual amount of *space* consumed in
> /tmp is 6.4 MiB.  The i_size of the file is 32 GiB, but it is a sparse
> file, which means not all of the blocks between logical offset 0 and
> 32 GiB have been allocated.
> 
> Now, if we mount the file system, as we start writing into the file,
> we will allocate space in /tmp.  Now, the way fallocate works in the
> mounted file system is that it guarantees space in the file system,
> but it won't write the data blocks, so space confusmed in /tmp by
> /tmp/mydisk will grow only by the space needed when we updated the
> metadata blocks in the file system contained in /tmp/mydisk.
> 
> % sudo mount /tmp/test.img /mnt/test
> % df -h /mnt/test
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/loop1       32G  2.1M   30G   1% /mnt/test
> % sudo fallocate -l 16G /mnt/test/testfile
> 1093% df -h /mnt/test
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/loop1       32G   17G   14G  55% /mnt/test
> % ls -lsh /mnt/test/testfile
> 17G -rw-r--r-- 1 root root 16G Jul 11 11:25 /mnt/test/testfile
> % ls -lsh /tmp/test.img
> 7.5M -rw-r--r-- 1 tytso tytso 32G Jul 11 11:25 /tmp/test.img
> 
> So here, we fallocated 16GB in the file system in /tmp/test.img.  You
> can see that it created a file which is 16GB in size, but which is a
> bit more than 16GB once you include the metadata blocks for
> /tmp/test/testfile.  That's why the space used is 17GB (the ls program
> rounded up) but the i_size is 16GB.
> 
> *But* the space consumed by the file /tmp/test.img only went up from
> 6.4 MiB to 7.5 MiB.  That's because although we reserved space in the
> file system /tmp/test.img, we didn't reserve any space in /tmp.
> 
> This is working as intended; and if what you are doing is "thin
> provisioning", this is a feature, but a bug.
> 
> But what it means is that if /tmp only has 26GB of space, eventually
> if you keep writing to /tmp/test.img, there will be block level errors
> in the loop device when /tmp runs out of space.  That was what you saw
> in your original example:
> 
> Jul 08 05:43:07 testbed kernel: loop: Write error at byte offset 274432, length 1024.
> 
> As soon as there are block I/O errors in the underlying file system,
> all bets are off.  There could be data loss (if you had been writing
> to a data block when /tmp ran out of space) or file system corruption
> (if the kernel had been trying to write a metadata block when /tmp ran
> out of space), or possibly both.  So as soon as you see block I/O
> errors, don't assume that file system is unscathed, because you
> probably *will* have lost data or have a corrupted file sytem.
> 
> > Is it legal usage or will it trigger some unknown issue? I'm a newbie
> > on fallocate:)
> 
> So it's *legal* to do thin provisioning; if you are trying to test a
> very large file system, and you don't have enough space, then you
> might not have a choice.  Or if you are trying to be more efficient,
> it mgiht allow you to allow users to *think* they have more space than
> you actually have purchased, since very often, users don't artually
> use; they just want to feel good that they have the space.
> 
> And if you have a large number of users, thin provisioning might make
> sense because it saves money.  But it's much like a bank which has
> lent out money that depositors have on deposit, relying on the fact
> that it is very rare that all of the depositors will suddenly show up
> and withdraw all of their money all at the same time.  If that
> happens, then you have a run on the bank, and there could be civil
> unrest, and things get ugly.  Which is why after bank runs, government
> regulartors will demand that banks keep more money on reserve, which
> lowers their profits and makes the bank's shareholders sad --- but
> better that than angry bank customers.  :-)

LOL SVB :(

--D

> So if you know what you are doing, it *can* work.  But it might
> trigger an issue which is unknown/unexpected for you, even though for
> soemone who understands how things work it makes perfect sense and is
> the system working as designed.
> 
> If you have lots of disk space, then just use fallocate to allocate
> space for /tmp/mydisk, and then you can use fallocate to allocate
> space for the file system contained in /tmp/mydisk.  And it will all
> work, but it will require more disk space to be available in /tmp.
> 
> Cheers,
> 
> 						- Ted
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-12  4:27       ` Darrick J. Wong
@ 2025-07-12 14:34         ` Theodore Ts'o
  2025-07-14  4:37           ` Jiany Wu
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2025-07-12 14:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Jiany Wu, yi.zhang, jack, linux-ext4

On Fri, Jul 11, 2025 at 09:27:14PM -0700, Darrick J. Wong wrote:
> 
> Honestly it's really too bad that there's no way for an fs to ask the
> block device how much space it thinks is available, and then teach its
> own statfs method to return min(fs space available, bdev space
> availble).
> 
> Then at least df could report that your 500T ramdisk filesystem on a 4G
> /tmp really only has 4G of space available.

I think it would be better if there was an extra field in the statfs
structure that reported bdev space available, and have it show up
as an extra (optional) column in the df report.

The problem is that bdev space available could be highly variable.
For example, suppose you had a few thousand users all sharing thinly
provisioned space.  If a whole bunch of users suddenly all start using
space, the available space at the storage layer could suddenly
plummet.  And if the available space starts getting low, this might trigger
automated, central fstrims on all of the volumes, causing the free
space to go back up.

Having the free space on a file system as reported by df go up and
down randomly would very likely cause users to get very confused
and upset, especially when it wasn't under their control.  Even for a
single user system the free space in tmpfs could go down suddenly when
some huge process suddenly started, and then go up suddenly when that
process gets OOM-killed.  :-)

     	   	    	      	      	   - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-12 14:34         ` Theodore Ts'o
@ 2025-07-14  4:37           ` Jiany Wu
  2025-07-14 13:09             ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Jiany Wu @ 2025-07-14  4:37 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Darrick J. Wong, yi.zhang, jack, linux-ext4

Hello, Ted,

Good day, thanks indeed for the clarification~
Yes, previously tried to mount a specific ext4 disk-img to /var/log,
with /dev/loop1 device, and rsyslogd will write to /var/log/syslog.
When /tmp directory exhaust manually via fallocate, / dir will be also
occupied as 100%, and rsyslog write errors in /dev/loop1 happen, later
mount as read-only. Different from the early scenario, but this
scenario is not easy to reproduce.
Tried updating the test case, not fallocate all spaces in disk, now
alloc 95%, everything is normal now, no related error prints anymore.
It is confirmed errors are caused by disk exhaust.
I think the main hesitation part is whether fallocate is allowed to
use the whole disk space.
root@testbed:~$ df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs   16G     0   16G   0% /dev
tmpfs          tmpfs     3.2G   53M  3.1G   2% /run
root-overlay   overlay    32G  6.2G   25G  20% /
/dev/nvme0n1p3 ext4       32G  6.2G   25G  20% /host
/dev/loop1     ext4      3.9G  189M  3.5G   6% /var/log
tmpfs          tmpfs      16G  236M   16G   2% /dev/shm
tmpfs          tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs          tmpfs     4.0M     0  4.0M   0% /sys/fs/cgroup
root@testbed:~$ mount | grep log
/host/disk-img/var-log.ext4 on /var/log type ext4 (rw,relatime)
root@testbed:~$ ls -lh /host/disk-img/var-log.ext4
-rw-r--r-- 1 root root 4.0G Jul 14 07:05 /host/disk-img/var-log.ext4
root@testbed:~$ file /host/disk-img/var-log.ext4
/host/disk-img/var-log.ext4: Linux rev 1.0 ext4 filesystem data,
UUID=49281462-eb22-4f19-8d03-51338eaf278a (needs journal recovery)
(extents) (64bit) (large files) (huge files)

# fallocate to exhaust /tmp directly
root@testbed:~$ df /tmp
Filesystem     1K-blocks      Used Available Use% Mounted on
root-overlay   229572940 229556556         0 100% /

# loop write error
testbed ERR kernel: [ 1019.470013] I/O error, dev loop1, sector 266248
op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 2
testbed ERR kernel: [ 1019.479242] Buffer I/O error on dev loop1,
logical block 33281, lost async page write
testbed ERR kernel: [ 1009.228833] loop: Write error at byte offset
673349632, length 4096.
testbed CRIT kernel: [ 1019.487101] EXT4-fs error (device loop1):
ext4_check_bdev_write_error:217: comm rs:main Q:Reg: Error while async
write back metadata

# remounting fs as read-only
testbed ERR kernel: [ 1326.758055] Aborting journal on device loop1-8.
testbed CRIT kernel: [ 1326.765336] EXT4-fs error (device loop1):
ext4_journal_check_start:83: comm auditd: Detected aborted journal
testbed CRIT kernel: [ 1326.765960] EXT4-fs error (device loop1):
ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted
journal
testbed CRIT kernel: [ 1326.775629] EXT4-fs (loop1): Remounting
filesystem read-only

Best regards,
Jianyue Wu

On Sat, Jul 12, 2025 at 10:34 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Jul 11, 2025 at 09:27:14PM -0700, Darrick J. Wong wrote:
> >
> > Honestly it's really too bad that there's no way for an fs to ask the
> > block device how much space it thinks is available, and then teach its
> > own statfs method to return min(fs space available, bdev space
> > availble).
> >
> > Then at least df could report that your 500T ramdisk filesystem on a 4G
> > /tmp really only has 4G of space available.
>
> I think it would be better if there was an extra field in the statfs
> structure that reported bdev space available, and have it show up
> as an extra (optional) column in the df report.
>
> The problem is that bdev space available could be highly variable.
> For example, suppose you had a few thousand users all sharing thinly
> provisioned space.  If a whole bunch of users suddenly all start using
> space, the available space at the storage layer could suddenly
> plummet.  And if the available space starts getting low, this might trigger
> automated, central fstrims on all of the volumes, causing the free
> space to go back up.
>
> Having the free space on a file system as reported by df go up and
> down randomly would very likely cause users to get very confused
> and upset, especially when it wasn't under their control.  Even for a
> single user system the free space in tmpfs could go down suddenly when
> some huge process suddenly started, and then go up suddenly when that
> process gets OOM-killed.  :-)
>
>                                            - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-14  4:37           ` Jiany Wu
@ 2025-07-14 13:09             ` Theodore Ts'o
  2025-07-15  1:27               ` Jiany Wu
  0 siblings, 1 reply; 10+ messages in thread
From: Theodore Ts'o @ 2025-07-14 13:09 UTC (permalink / raw)
  To: Jiany Wu; +Cc: Darrick J. Wong, yi.zhang, jack, linux-ext4

On Mon, Jul 14, 2025 at 12:37:21PM +0800, Jiany Wu wrote:
> Hello, Ted,
> 
> Good day, thanks indeed for the clarification~
> Yes, previously tried to mount a specific ext4 disk-img to /var/log,
> with /dev/loop1 device, and rsyslogd will write to /var/log/syslog.
> When /tmp directory exhaust manually via fallocate, / dir will be also
> occupied as 100%, and rsyslog write errors in /dev/loop1 happen, later
> mount as read-only. Different from the early scenario, but this
> scenario is not easy to reproduce.
> Tried updating the test case, not fallocate all spaces in disk, now
> alloc 95%, everything is normal now, no related error prints anymore.
> It is confirmed errors are caused by disk exhaust.
> I think the main hesitation part is whether fallocate is allowed to
> use the whole disk space.

The fallocate system call is allowed to use the whole space on the
*file system*.  But it doesn't know about how much free space a
thin-provisioned device's underlying storage is available.  If you are
using a loopback mounted image on a disk, if the underlying file
system on the disk fills up then the block device will have I/O errors
--- and then the file system on the loop device will run into
problems, either data loss or metadata corruption.

So this is working as intended.  If you don't want this, either don't
use a loopback mount with a sparse file; either use fallocate when
creating the image file, or don't use a loopback mount.

				- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-14 13:09             ` Theodore Ts'o
@ 2025-07-15  1:27               ` Jiany Wu
  2025-07-15  3:42                 ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Jiany Wu @ 2025-07-15  1:27 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Darrick J. Wong, yi.zhang, jack, linux-ext4

Hello, Ted,

Thanks indeed for the clarification, it is clear now.
OK, if using a loopback mounted image on a disk, underlying file
system full then the block device will have I/O error.
This loopback mount belongs to a third party common config. I'll
fallocate lower disk space to not exhaust disk as a work around
firstly.
Thanks again for the help:)

Best regards,
Jianyue Wu


On Mon, Jul 14, 2025 at 9:09 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Mon, Jul 14, 2025 at 12:37:21PM +0800, Jiany Wu wrote:
> > Hello, Ted,
> >
> > Good day, thanks indeed for the clarification~
> > Yes, previously tried to mount a specific ext4 disk-img to /var/log,
> > with /dev/loop1 device, and rsyslogd will write to /var/log/syslog.
> > When /tmp directory exhaust manually via fallocate, / dir will be also
> > occupied as 100%, and rsyslog write errors in /dev/loop1 happen, later
> > mount as read-only. Different from the early scenario, but this
> > scenario is not easy to reproduce.
> > Tried updating the test case, not fallocate all spaces in disk, now
> > alloc 95%, everything is normal now, no related error prints anymore.
> > It is confirmed errors are caused by disk exhaust.
> > I think the main hesitation part is whether fallocate is allowed to
> > use the whole disk space.
>
> The fallocate system call is allowed to use the whole space on the
> *file system*.  But it doesn't know about how much free space a
> thin-provisioned device's underlying storage is available.  If you are
> using a loopback mounted image on a disk, if the underlying file
> system on the disk fills up then the block device will have I/O errors
> --- and then the file system on the loop device will run into
> problems, either data loss or metadata corruption.
>
> So this is working as intended.  If you don't want this, either don't
> use a loopback mount with a sparse file; either use fallocate when
> creating the image file, or don't use a loopback mount.
>
>                                 - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Issue with ext4 filesystem corruption when writing to a file after disk exhaustion
  2025-07-15  1:27               ` Jiany Wu
@ 2025-07-15  3:42                 ` Theodore Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2025-07-15  3:42 UTC (permalink / raw)
  To: Jiany Wu; +Cc: Darrick J. Wong, yi.zhang, jack, linux-ext4

On Tue, Jul 15, 2025 at 09:27:01AM +0800, Jiany Wu wrote:
> 
> Thanks indeed for the clarification, it is clear now.
> OK, if using a loopback mounted image on a disk, underlying file
> system full then the block device will have I/O error.
> This loopback mount belongs to a third party common config. I'll
> fallocate lower disk space to not exhaust disk as a work around
> firstly.

The question I'd ask is *why* someone set up that loopback mount in
the first place.  As a guess, perhaps the goal was to restrict the
mount of disk space could be used for log files in /var/log, so that
if there is runaway logging, that all of the free space on the root
partition won't be consumed.

But if that's the case, there are much better ways of achieving the
same goal.  For example, in addition to using log rotation programs,
you could back that up with using project quotas , which restricts how
much space can get used in a subdirectory.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-07-15  3:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-11  3:20 Issue with ext4 filesystem corruption when writing to a file after disk exhaustion Jiany Wu
2025-07-11  5:29 ` Theodore Ts'o
2025-07-11  9:56   ` Jiany Wu
2025-07-11 15:40     ` Theodore Ts'o
2025-07-12  4:27       ` Darrick J. Wong
2025-07-12 14:34         ` Theodore Ts'o
2025-07-14  4:37           ` Jiany Wu
2025-07-14 13:09             ` Theodore Ts'o
2025-07-15  1:27               ` Jiany Wu
2025-07-15  3:42                 ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).