Journal under-reservation bug on first >2G file

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* Journal under-reservation bug on first >2G file
@ 2014-09-30 21:10 Eric Sandeen
  2014-09-30 21:22 ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2014-09-30 21:10 UTC (permalink / raw)
  To: ext4 development

Hey all -

So the following testcase will overrun the 1-credit journal reservation
made during a delalloc write in ext4_da_write_begin(), because we
may cross the 2G threshold, and need to modify both the inode and the
superblock in the same transaction.

I see a few was to fix this:

1) Always set LARGE_FILE on mount if not set.  This will break
   RW compatiblity with very old kernels.  Do we care?
2) Bump the reservation to 2 under the fiddly condition of
   large file not yet set but this write might do it
3) bump the delalloc reservation to 2 just in case, always

I'll be happy to write the patch to fix it, just wondering what
people think the best approach is

Thoughts?
-Eric

#!/bin/bash

# A 400m fs won't get the large_file feature, oddly
# enough, because the resize inode will be < 2G.

truncate --size=400m test.img
mkfs.ext4 -F test.img
# This shouldn't have large_file set, exit if it does for some reason
dumpe2fs -h test.img | grep large_file && exit

mkdir -p mnt
mount -o loop test.img mnt

echo "writing 1 byte at 2147483646" 
dd if=/dev/zero of=mnt/testfile bs=1 seek=2147483646 count=1 conv=notrunc of=mnt/testfile
sync

# This will make sure i_disksize is on disk, and
# that the buffer will be mapped on the next write.
#
# This is critical because ext4_da_should_update_i_disksize()
# checks buffer_mapped():
#
#        if (!buffer_mapped(bh) || (buffer_delay(bh)) || buffer_unwritten(bh))
#                return 0;
#        return 1;

# This tries to update i_disksize, and also requires a superblock
# update for the large_file feature flag, but only has 1 credit
# available on the delalloc write path

echo "writing 1 byte at 2147483647"
dd if=/dev/zero of=mnt/testfile bs=1 seek=2147483647 count=1 conv=notrunc of=mnt/testfile

# Should go boom, but if not, unmount
umount mnt

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-09-30 21:10 Journal under-reservation bug on first >2G file Eric Sandeen
@ 2014-09-30 21:22 ` Eric Sandeen
  2014-09-30 21:36   ` Andreas Dilger
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2014-09-30 21:22 UTC (permalink / raw)
  To: ext4 development

On 9/30/14 4:10 PM, Eric Sandeen wrote:
> Hey all -
> 
> So the following testcase will overrun the 1-credit journal reservation
> made during a delalloc write in ext4_da_write_begin(), because we
> may cross the 2G threshold, and need to modify both the inode and the
> superblock in the same transaction.
> 
> I see a few was to fix this:
> 
> 1) Always set LARGE_FILE on mount if not set.  This will break
>    RW compatiblity with very old kernels.  Do we care?

  1.5) Don't update the feature on the fly - we don't for
       HUGE_FILE, either.

  1.5a) Always set the large_file feature with a fresh mkfs, insteadl
        of relying on the accident of the resize inode being > 2G!

> 2) Bump the reservation to 2 under the fiddly condition of
>    large file not yet set but this write might do it
> 3) bump the delalloc reservation to 2 just in case, always
> 
> I'll be happy to write the patch to fix it, just wondering what
> people think the best approach is
> 
> Thoughts?
> -Eric
> 
> 
> #!/bin/bash
> 
> # A 400m fs won't get the large_file feature, oddly
> # enough, because the resize inode will be < 2G.
> 
> truncate --size=400m test.img
> mkfs.ext4 -F test.img
> # This shouldn't have large_file set, exit if it does for some reason
> dumpe2fs -h test.img | grep large_file && exit
> 
> mkdir -p mnt
> mount -o loop test.img mnt
> 
> echo "writing 1 byte at 2147483646" 
> dd if=/dev/zero of=mnt/testfile bs=1 seek=2147483646 count=1 conv=notrunc of=mnt/testfile
> sync
> 
> # This will make sure i_disksize is on disk, and
> # that the buffer will be mapped on the next write.
> #
> # This is critical because ext4_da_should_update_i_disksize()
> # checks buffer_mapped():
> #
> #        if (!buffer_mapped(bh) || (buffer_delay(bh)) || buffer_unwritten(bh))
> #                return 0;
> #        return 1;
> 
> # This tries to update i_disksize, and also requires a superblock
> # update for the large_file feature flag, but only has 1 credit
> # available on the delalloc write path
> 
> echo "writing 1 byte at 2147483647"
> dd if=/dev/zero of=mnt/testfile bs=1 seek=2147483647 count=1 conv=notrunc of=mnt/testfile
> 
> # Should go boom, but if not, unmount
> umount mnt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-09-30 21:22 ` Eric Sandeen
@ 2014-09-30 21:36   ` Andreas Dilger
  2014-09-30 22:10     ` Darrick J. Wong
  2014-10-01 11:53     ` Theodore Ts'o
  0 siblings, 2 replies; 11+ messages in thread
From: Andreas Dilger @ 2014-09-30 21:36 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: ext4 development

[-- Attachment #1: Type: text/plain, Size: 4292 bytes --]

On Sep 30, 2014, at 3:22 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> On 9/30/14 4:10 PM, Eric Sandeen wrote:
>> Hey all -
>> 
>> So the following testcase will overrun the 1-credit journal reservation
>> made during a delalloc write in ext4_da_write_begin(), because we
>> may cross the 2G threshold, and need to modify both the inode and the
>> superblock in the same transaction.
>> 
>> I see a few was to fix this:
>> 
>> 1) Always set LARGE_FILE on mount if not set.  This will break
>>   RW compatiblity with very old kernels.  Do we care?
> 
>  1.5) Don't update the feature on the fly - we don't for
>       HUGE_FILE, either.
> 
>  1.5a) Always set the large_file feature with a fresh mkfs, insteadl
>        of relying on the accident of the resize inode being > 2G!

I think that 1.5a is definitely the way to go for new mke2fs, I'm a
bit surprised that we didn't do this for "-t ext4" a long time ago
given that we've enabled lots of other features automatically.

There shouldn't be any problem to do this retroactively in e2fsck
and potentially at mount time for filesystems that already have some
features enabled that are post-large_file (e.g. extents, flex_bg, etc.)
This definitely would not impose any compatibility issues, because any
kernel that supports those features already understands large_file.

I'm pretty sure that e2fsck doesn't turn off large_file automatically
anymore if it can't find any files over 2GB, but it is worthwhile to
verify this.

>> 2) Bump the reservation to 2 under the fiddly condition of
>>   large file not yet set but this write might do it
>> 3) bump the delalloc reservation to 2 just in case, always

Given how many other reservations we have for normal operations,
I don't think it is so bad to reserve an extra block if the
large_file feature isn't enabled yet.  This could be fine tuned
based on the size and offset of the write, but I'm not sure if
the extra complexity warrants it.

It doesn't make sense to reserve this block if the feature
is already set, and I don't think that there are (m)any features
that are turned on automatically by the kernel anymore so it is
overhead to reserve the block if you know it won't be needed.

I don't know if this is belt and suspenders, but it might be
something to consider for supporting older kernels and we may not
need it in newer kernels.

Cheers, Andreas

>> I'll be happy to write the patch to fix it, just wondering what
>> people think the best approach is
>> 
>> Thoughts?
>> -Eric
>> 
>> 
>> #!/bin/bash
>> 
>> # A 400m fs won't get the large_file feature, oddly
>> # enough, because the resize inode will be < 2G.
>> 
>> truncate --size=400m test.img
>> mkfs.ext4 -F test.img
>> # This shouldn't have large_file set, exit if it does for some reason
>> dumpe2fs -h test.img | grep large_file && exit
>> 
>> mkdir -p mnt
>> mount -o loop test.img mnt
>> 
>> echo "writing 1 byte at 2147483646" 
>> dd if=/dev/zero of=mnt/testfile bs=1 seek=2147483646 count=1 conv=notrunc of=mnt/testfile
>> sync
>> 
>> # This will make sure i_disksize is on disk, and
>> # that the buffer will be mapped on the next write.
>> #
>> # This is critical because ext4_da_should_update_i_disksize()
>> # checks buffer_mapped():
>> #
>> #        if (!buffer_mapped(bh) || (buffer_delay(bh)) || buffer_unwritten(bh))
>> #                return 0;
>> #        return 1;
>> 
>> # This tries to update i_disksize, and also requires a superblock
>> # update for the large_file feature flag, but only has 1 credit
>> # available on the delalloc write path
>> 
>> echo "writing 1 byte at 2147483647"
>> dd if=/dev/zero of=mnt/testfile bs=1 seek=2147483647 count=1 conv=notrunc of=mnt/testfile
>> 
>> # Should go boom, but if not, unmount
>> umount mnt
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-09-30 21:36   ` Andreas Dilger
@ 2014-09-30 22:10     ` Darrick J. Wong
  2014-10-01 11:53     ` Theodore Ts'o
  1 sibling, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2014-09-30 22:10 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Eric Sandeen, ext4 development

On Tue, Sep 30, 2014 at 03:36:17PM -0600, Andreas Dilger wrote:
> On Sep 30, 2014, at 3:22 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> > On 9/30/14 4:10 PM, Eric Sandeen wrote:
> >> Hey all -
> >> 
> >> So the following testcase will overrun the 1-credit journal reservation
> >> made during a delalloc write in ext4_da_write_begin(), because we
> >> may cross the 2G threshold, and need to modify both the inode and the
> >> superblock in the same transaction.
> >> 
> >> I see a few was to fix this:
> >> 
> >> 1) Always set LARGE_FILE on mount if not set.  This will break
> >>   RW compatiblity with very old kernels.  Do we care?
> > 
> >  1.5) Don't update the feature on the fly - we don't for
> >       HUGE_FILE, either.
> > 
> >  1.5a) Always set the large_file feature with a fresh mkfs, insteadl
> >        of relying on the accident of the resize inode being > 2G!
> 
> I think that 1.5a is definitely the way to go for new mke2fs, I'm a
> bit surprised that we didn't do this for "-t ext4" a long time ago
> given that we've enabled lots of other features automatically.

Sounds good to me.

> There shouldn't be any problem to do this retroactively in e2fsck
> and potentially at mount time for filesystems that already have some
> features enabled that are post-large_file (e.g. extents, flex_bg, etc.)
> This definitely would not impose any compatibility issues, because any
> kernel that supports those features already understands large_file.
> 
> I'm pretty sure that e2fsck doesn't turn off large_file automatically
> anymore if it can't find any files over 2GB, but it is worthwhile to
> verify this.

It doesn't.

> >> 2) Bump the reservation to 2 under the fiddly condition of
> >>   large file not yet set but this write might do it
> >> 3) bump the delalloc reservation to 2 just in case, always
> 
> Given how many other reservations we have for normal operations,
> I don't think it is so bad to reserve an extra block if the
> large_file feature isn't enabled yet.  This could be fine tuned
> based on the size and offset of the write, but I'm not sure if
> the extra complexity warrants it.
> 
> It doesn't make sense to reserve this block if the feature
> is already set, and I don't think that there are (m)any features
> that are turned on automatically by the kernel anymore so it is
> overhead to reserve the block if you know it won't be needed.
> 
> I don't know if this is belt and suspenders, but it might be
> something to consider for supporting older kernels and we may not
> need it in newer kernels.

1.5a and (2 if ^large_file) seem fine to me.

--D
> 
> Cheers, Andreas
> 
> >> I'll be happy to write the patch to fix it, just wondering what
> >> people think the best approach is
> >> 
> >> Thoughts?
> >> -Eric
> >> 
> >> 
> >> #!/bin/bash
> >> 
> >> # A 400m fs won't get the large_file feature, oddly
> >> # enough, because the resize inode will be < 2G.
> >> 
> >> truncate --size=400m test.img
> >> mkfs.ext4 -F test.img
> >> # This shouldn't have large_file set, exit if it does for some reason
> >> dumpe2fs -h test.img | grep large_file && exit
> >> 
> >> mkdir -p mnt
> >> mount -o loop test.img mnt
> >> 
> >> echo "writing 1 byte at 2147483646" 
> >> dd if=/dev/zero of=mnt/testfile bs=1 seek=2147483646 count=1 conv=notrunc of=mnt/testfile
> >> sync
> >> 
> >> # This will make sure i_disksize is on disk, and
> >> # that the buffer will be mapped on the next write.
> >> #
> >> # This is critical because ext4_da_should_update_i_disksize()
> >> # checks buffer_mapped():
> >> #
> >> #        if (!buffer_mapped(bh) || (buffer_delay(bh)) || buffer_unwritten(bh))
> >> #                return 0;
> >> #        return 1;
> >> 
> >> # This tries to update i_disksize, and also requires a superblock
> >> # update for the large_file feature flag, but only has 1 credit
> >> # available on the delalloc write path
> >> 
> >> echo "writing 1 byte at 2147483647"
> >> dd if=/dev/zero of=mnt/testfile bs=1 seek=2147483647 count=1 conv=notrunc of=mnt/testfile
> >> 
> >> # Should go boom, but if not, unmount
> >> umount mnt
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-09-30 21:36   ` Andreas Dilger
  2014-09-30 22:10     ` Darrick J. Wong
@ 2014-10-01 11:53     ` Theodore Ts'o
  2014-10-01 14:43       ` Eric Sandeen
  1 sibling, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2014-10-01 11:53 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Eric Sandeen, ext4 development

On Tue, Sep 30, 2014 at 03:36:17PM -0600, Andreas Dilger wrote:
> > 
> >  1.5a) Always set the large_file feature with a fresh mkfs, insteadl
> >        of relying on the accident of the resize inode being > 2G!
> 
> I think that 1.5a is definitely the way to go for new mke2fs, I'm a
> bit surprised that we didn't do this for "-t ext4" a long time ago
> given that we've enabled lots of other features automatically.

Yes, I agree that would be a good thing to do.  I'll make the change
to mke2fs.conf.

> There shouldn't be any problem to do this retroactively in e2fsck
> and potentially at mount time for filesystems that already have some
> features enabled that are post-large_file (e.g. extents, flex_bg, etc.)
> This definitely would not impose any compatibility issues, because any
> kernel that supports those features already understands large_file.

That sounds like a plan.  If we only enable it automatically at mount
time (iff we mounted the file system read/write) if any of the ext3 or
ext4 specific features are enabled, that should be completely safe.

In fact the only reason why we shouldn't turn it on unconditionally is
because there are other implementations of ext2 (most notably, GNU
Hurd and *BSD) which might not support large_file.

						- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-10-01 11:53     ` Theodore Ts'o
@ 2014-10-01 14:43       ` Eric Sandeen
  2014-10-01 19:59         ` Theodore Ts'o
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2014-10-01 14:43 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger; +Cc: ext4 development

On 10/1/14 6:53 AM, Theodore Ts'o wrote:
> On Tue, Sep 30, 2014 at 03:36:17PM -0600, Andreas Dilger wrote:
>>>
>>>  1.5a) Always set the large_file feature with a fresh mkfs, insteadl
>>>        of relying on the accident of the resize inode being > 2G!
>>
>> I think that 1.5a is definitely the way to go for new mke2fs, I'm a
>> bit surprised that we didn't do this for "-t ext4" a long time ago
>> given that we've enabled lots of other features automatically.
> 
> Yes, I agree that would be a good thing to do.  I'll make the change
> to mke2fs.conf.
> 
>> There shouldn't be any problem to do this retroactively in e2fsck
>> and potentially at mount time for filesystems that already have some
>> features enabled that are post-large_file (e.g. extents, flex_bg, etc.)
>> This definitely would not impose any compatibility issues, because any
>> kernel that supports those features already understands large_file.
> 
> That sounds like a plan.  If we only enable it automatically at mount
> time (iff we mounted the file system read/write) if any of the ext3 or
> ext4 specific features are enabled, that should be completely safe.

Ok, so do that, and don't bump the reservations? I suppose
the size test & superblock write can be removed, then...

This does bug me a little; at one point we were very carefully not
enabling any new features by mounting with a new kernel; that was
specific to mounting-ext2-with-ext4 etc, but it still feels slightly
inconsistent.  Although I guess we enable it today by mounting-and-
writing-a-big-enough-file.

Something like this should fix it too, though, with less unexpected
behind-your-back behavior:

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9..2f94cd6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2563,9 +2563,15 @@ retry_grab:
         * if there is delayed block allocation. But we still need
         * to journalling the i_disksize update if writes to the end
         * of file which has an already mapped buffer.
+        * If this write might need to update the superblock due to the
+        * filesize adding a new superblock feature flag, add that too.
         */
 retry_journal:
-       handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, 1);
+       handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
+                                   EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+                                       EXT4_FEATURE_RO_COMPAT_LARGE_FILE) ?
+                                   1 : 2);
+
        if (IS_ERR(handle)) {
                page_cache_release(page);
                return PTR_ERR(handle);


-ERic

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-10-01 14:43       ` Eric Sandeen
@ 2014-10-01 19:59         ` Theodore Ts'o
  2014-10-01 20:37           ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2014-10-01 19:59 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Andreas Dilger, ext4 development

On Wed, Oct 01, 2014 at 09:43:32AM -0500, Eric Sandeen wrote:
> > That sounds like a plan.  If we only enable it automatically at mount
> > time (iff we mounted the file system read/write) if any of the ext3 or
> > ext4 specific features are enabled, that should be completely safe.
> 
> Ok, so do that, and don't bump the reservations? I suppose
> the size test & superblock write can be removed, then...
> 
> This does bug me a little; at one point we were very carefully not
> enabling any new features by mounting with a new kernel; that was
> specific to mounting-ext2-with-ext4 etc, but it still feels slightly
> inconsistent.  Although I guess we enable it today by mounting-and-
> writing-a-big-enough-file.

Yeah, this behaviour was one that dates back a *long* time, before we
established the rule that we don't enable any new features
automatically.  If this was a new feature, I wouldn't be advocating
this.  But if we change this now, we could introduce a regression, or
at least a surprising breakage.

> Something like this should fix it too, though, with less unexpected
> behind-your-back behavior:
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3aa26e9..2f94cd6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2563,9 +2563,15 @@ retry_grab:
>          * if there is delayed block allocation. But we still need
>          * to journalling the i_disksize update if writes to the end
>          * of file which has an already mapped buffer.
> +        * If this write might need to update the superblock due to the
> +        * filesize adding a new superblock feature flag, add that too.
>          */
>  retry_journal:
> -       handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, 1);
> +       handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
> +                                   EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> +                                       EXT4_FEATURE_RO_COMPAT_LARGE_FILE) ?
> +                                   1 : 2);
> +

Yes, I suppose that would work as well.  It means that file systems
which don't have LARGE_FILE will waste a bit more space in the
journal, causing the journal to potentially close prematurely.

The code would be a bit simpler if we removed "set only if i_size has
gotten too big", and replaced it with a "set it unconditionally at
mount time".  So there are tradeoffs with either approach.  At this
point I'm slightly in favor of enabling it by default if ext4 features
are enabled, either in the kernel or in the e2fsck.  And if we're
going to do that, doing it in the kernel is more foolproof, and it
will have the same net result.

				- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-10-01 19:59         ` Theodore Ts'o
@ 2014-10-01 20:37           ` Eric Sandeen
  2014-10-01 22:43             ` Theodore Ts'o
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2014-10-01 20:37 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andreas Dilger, ext4 development

On 10/1/14 2:59 PM, Theodore Ts'o wrote:
> On Wed, Oct 01, 2014 at 09:43:32AM -0500, Eric Sandeen wrote:
>>> That sounds like a plan.  If we only enable it automatically at mount
>>> time (iff we mounted the file system read/write) if any of the ext3 or
>>> ext4 specific features are enabled, that should be completely safe.
>>
>> Ok, so do that, and don't bump the reservations? I suppose
>> the size test & superblock write can be removed, then...
>>
>> This does bug me a little; at one point we were very carefully not
>> enabling any new features by mounting with a new kernel; that was
>> specific to mounting-ext2-with-ext4 etc, but it still feels slightly
>> inconsistent.  Although I guess we enable it today by mounting-and-
>> writing-a-big-enough-file.
> 
> Yeah, this behaviour was one that dates back a *long* time, before we
> established the rule that we don't enable any new features
> automatically.  If this was a new feature, I wouldn't be advocating
> this.  But if we change this now, we could introduce a regression, or
> at least a surprising breakage.
> 
>> Something like this should fix it too, though, with less unexpected
>> behind-your-back behavior:
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 3aa26e9..2f94cd6 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2563,9 +2563,15 @@ retry_grab:
>>          * if there is delayed block allocation. But we still need
>>          * to journalling the i_disksize update if writes to the end
>>          * of file which has an already mapped buffer.
>> +        * If this write might need to update the superblock due to the
>> +        * filesize adding a new superblock feature flag, add that too.
>>          */
>>  retry_journal:
>> -       handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, 1);
>> +       handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
>> +                                   EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>> +                                       EXT4_FEATURE_RO_COMPAT_LARGE_FILE) ?
>> +                                   1 : 2);
>> +
> 
> Yes, I suppose that would work as well.  It means that file systems
> which don't have LARGE_FILE will waste a bit more space in the
> journal, causing the journal to potentially close prematurely.
> 
> The code would be a bit simpler if we removed "set only if i_size has
> gotten too big", and replaced it with a "set it unconditionally at
> mount time".  So there are tradeoffs with either approach.  At this
> point I'm slightly in favor of enabling it by default if ext4 features
> are enabled, either in the kernel or in the e2fsck.  And if we're
> going to do that, doing it in the kernel is more foolproof, and it
> will have the same net result.

Ok.  I guess this is only an issue for ext4 - well, at least this specific
issue.  Delalloc makes it much different than ext2 & ext3, which reserve quite a
lot more.  Whether there's a corner case over there which breaks, I dunno...

So it seems like the simplest test is simply: Are we RW mounted with delalloc?
And if so, update the feature.  Seems simpler than mucking with "which features
are unique to ext4"

(because we could be mounting ext3-with-ext4, having no ext4-specific features,
and still hit the problem right?   ... test test test ... right.)

I'll whip that up.

Thanks,
-Eric

> 				- Ted
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-10-01 20:37           ` Eric Sandeen
@ 2014-10-01 22:43             ` Theodore Ts'o
  2014-10-02  5:49               ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2014-10-01 22:43 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Andreas Dilger, ext4 development

On Wed, Oct 01, 2014 at 03:37:17PM -0500, Eric Sandeen wrote:
> 
> Ok.  I guess this is only an issue for ext4 - well, at least this specific
> issue.  Delalloc makes it much different than ext2 & ext3, which reserve quite a
> lot more.  Whether there's a corner case over there which breaks, I dunno...
> 
> So it seems like the simplest test is simply: Are we RW mounted with delalloc?
> And if so, update the feature.  Seems simpler than mucking with "which features
> are unique to ext4"

I'd do "are we RW mounted with the extents feature".  That way we
don't need to worry about someone accidentally mounting a partition
meant for Hurd using ext4, which would imply delalloc, and then
causing Hurd to no longer be able to deal with the file system.  That
*shouldn't* happen, but if someone accidentally mounts the file system
with -t ext4, but it seems safer to gate it on the existence of the
extents feature.

						- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-10-01 22:43             ` Theodore Ts'o
@ 2014-10-02  5:49               ` Eric Sandeen
  2014-10-02 11:26                 ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2014-10-02  5:49 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andreas Dilger, ext4 development

On 10/1/14 5:43 PM, Theodore Ts'o wrote:
> On Wed, Oct 01, 2014 at 03:37:17PM -0500, Eric Sandeen wrote:
>>
>> Ok.  I guess this is only an issue for ext4 - well, at least this specific
>> issue.  Delalloc makes it much different than ext2 & ext3, which reserve quite a
>> lot more.  Whether there's a corner case over there which breaks, I dunno...
>>
>> So it seems like the simplest test is simply: Are we RW mounted with delalloc?
>> And if so, update the feature.  Seems simpler than mucking with "which features
>> are unique to ext4"
> 
> I'd do "are we RW mounted with the extents feature".  That way we
> don't need to worry about someone accidentally mounting a partition
> meant for Hurd using ext4, which would imply delalloc, and then
> causing Hurd to no longer be able to deal with the file system.  That
> *shouldn't* happen, but if someone accidentally mounts the file system
> with -t ext4, but it seems safer to gate it on the existence of the
> extents feature.

Problem is, we can hit the same problem with an ext3 filesystem (no
extents) mounted with -t ext4 (enabling delalloc).

Ugh.  Can't we just bump the da write reservation to 2 and be done with it? ;)
(AFAICT the non-delalloc reservations can be wildly overestimated).

Or maybe ext4_journal_extend() when we try to update the superblock?
It could fail, but it wouldn't be catastrophic if it did, fsck would find
that the feature is missing...

-Eric


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Journal under-reservation bug on first >2G file
  2014-10-02  5:49               ` Eric Sandeen
@ 2014-10-02 11:26                 ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2014-10-02 11:26 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Theodore Ts'o, Andreas Dilger, ext4 development

On Thu 02-10-14 00:49:09, Eric Sandeen wrote:
> On 10/1/14 5:43 PM, Theodore Ts'o wrote:
> > On Wed, Oct 01, 2014 at 03:37:17PM -0500, Eric Sandeen wrote:
> >>
> >> Ok.  I guess this is only an issue for ext4 - well, at least this specific
> >> issue.  Delalloc makes it much different than ext2 & ext3, which reserve quite a
> >> lot more.  Whether there's a corner case over there which breaks, I dunno...
> >>
> >> So it seems like the simplest test is simply: Are we RW mounted with delalloc?
> >> And if so, update the feature.  Seems simpler than mucking with "which features
> >> are unique to ext4"
> > 
> > I'd do "are we RW mounted with the extents feature".  That way we
> > don't need to worry about someone accidentally mounting a partition
> > meant for Hurd using ext4, which would imply delalloc, and then
> > causing Hurd to no longer be able to deal with the file system.  That
> > *shouldn't* happen, but if someone accidentally mounts the file system
> > with -t ext4, but it seems safer to gate it on the existence of the
> > extents feature.
> 
> Problem is, we can hit the same problem with an ext3 filesystem (no
> extents) mounted with -t ext4 (enabling delalloc).
> 
> Ugh.  Can't we just bump the da write reservation to 2 and be done with it? ;)
> (AFAICT the non-delalloc reservations can be wildly overestimated).
> 
> Or maybe ext4_journal_extend() when we try to update the superblock?
> It could fail, but it wouldn't be catastrophic if it did, fsck would find
> that the feature is missing...
  A couple of notes:
1) Using 2 would be fine. Journal code is clever enough and it returns
unused handle credits to the transaction so using 2 instead of 1 limits
only the number of handles in ext4_da_write_begin() running in parallel.
So I'd frankly just bump the number to 2 (with a comment!) and be done with
it.

2) If we want to optimize a bit, we can check whether the write is going to
extend beyond 2G and first set the feature in a separate transaction.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-10-02 11:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-30 21:10 Journal under-reservation bug on first >2G file Eric Sandeen
2014-09-30 21:22 ` Eric Sandeen
2014-09-30 21:36   ` Andreas Dilger
2014-09-30 22:10     ` Darrick J. Wong
2014-10-01 11:53     ` Theodore Ts'o
2014-10-01 14:43       ` Eric Sandeen
2014-10-01 19:59         ` Theodore Ts'o
2014-10-01 20:37           ` Eric Sandeen
2014-10-01 22:43             ` Theodore Ts'o
2014-10-02  5:49               ` Eric Sandeen
2014-10-02 11:26                 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox