Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
       [not found] <560723F8.3010909@gmail.com>
@ 2015-09-27  1:36 ` Hugh Dickins
  2015-09-27  2:14   ` angelo
  2015-09-27  2:21   ` angelo
  0 siblings, 2 replies; 9+ messages in thread
From: Hugh Dickins @ 2015-09-27  1:36 UTC (permalink / raw)
  To: angelo; +Cc: linux-fsdevel, linux-kernel, linux-mm

Let's Cc linux-fsdevel, who will be more knowledgable.

On Sun, 27 Sep 2015, angelo wrote:

> Hi all,
> 
> running xfstests, generic 308 on whatever 32bit arch is possible
> to observe cpu to hang near 100% on unlink.
> The test removes a sparse file of length 16tera where only the last
> 4096 bytes block is mapped.
> At line 265 of truncate.c there is a
> if (index >= end)
>     break;
> But if index is, as in this case, a 4294967295, it match -1 used as
> eof. Hence the cpu loops 100% just after.

That's odd.  I've not checked your patch, because I think the problem
would go beyond truncate, and the root cause lie elsewhere.

My understanding is that the 32-bit
#define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1) 
makes a page->index of -1 (or any "negative") impossible to reach.

I don't know offhand the rules for mounting a filesystem populated with
a 64-bit kernel on a 32-bit kernel, what's to happen when a too-large
file is encountered; but assume that's not the case here - you're
just running xfstests/tests/generic/308.

Is pwrite missing a check for offset beyond s_maxbytes?

Or is this filesystem-dependent?  Which filesystem?

Hugh

> 
> -------------------
> 
> On 32bit archs, with CONFIG_LBDAF=y, if truncating last page
> of a 16tera file, "index" variable is set to 4294967295, and hence
> matches with -1 used as EOF value. This result in an inifite loop
> when unlink is executed on this file.
> 
> Signed-off-by: Angelo Dureghello <angelo@sysam.it>
> ---
>  mm/truncate.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 76e35ad..3751034 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -283,14 +283,15 @@ void truncate_inode_pages_range(struct address_space
> *mapping,
>                 pagevec_remove_exceptionals(&pvec);
>                 pagevec_release(&pvec);
>                 cond_resched();
> -               index++;
> +               if (index < end)
> +                       index++;
>         }
> 
>         if (partial_start) {
>                 struct page *page = find_lock_page(mapping, start - 1);
>                 if (page) {
>                         unsigned int top = PAGE_CACHE_SIZE;
> -                       if (start > end) {
> +                       if (start > end && end != -1) {
>                                 /* Truncation within a single page */
>                                 top = partial_end;
>                                 partial_end = 0;
> @@ -322,7 +323,7 @@ void truncate_inode_pages_range(struct address_space
> *mapping,
>          * If the truncation happened within a single page no pages
>          * will be released, just zeroed, so we can bail out now.
>          */
> -       if (start >= end)
> +       if (start >= end && end != -1)
>                 return;
> 
>         index = start;
> @@ -337,7 +338,7 @@ void truncate_inode_pages_range(struct address_space
> *mapping,
>                         index = start;
>                         continue;
>                 }
> -               if (index == start && indices[0] >= end) {
> +               if (index == start && (indices[0] >= end && end != -1)) {
>                         /* All gone out of hole to be punched, we're done */
>                         pagevec_remove_exceptionals(&pvec);
>                         pagevec_release(&pvec);
> @@ -348,7 +349,7 @@ void truncate_inode_pages_range(struct address_space
> *mapping,
> 
>                         /* We rely upon deletion not changing page->index */
>                         index = indices[i];
> -                       if (index >= end) {
> +                       if (index >= end && (end != -1)) {
>                                 /* Restart punch to make sure all gone */
>                                 index = start - 1;
>                                 break;
> -- 
> 2.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
  2015-09-27  1:36 ` [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file Hugh Dickins
@ 2015-09-27  2:14   ` angelo
  2015-09-27  2:21   ` angelo
  1 sibling, 0 replies; 9+ messages in thread
From: angelo @ 2015-09-27  2:14 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-fsdevel, linux-kernel, linux-mm

Hi Hugh,

thanks for the fast reply..

Looks like the XFS file system can support files until 16 Tera
when CONFIG_LBDAF is enabled.

On XFS, 32 bit arch, s_maxbytes is actually set (CONFIG_LBDAF=y) as
17592186044415.

But if s_maxbytes doesn't have to be greater than MAX_LFS_FILESIZE,
i agree the issue should be fixed in layers above.

The fact is that everything still works correct until an index as
17592186044415 - 4096, and there can be users that could already
have so big files in use in their setup.

What do you think ?

Best regards
Angelo Dureghello


On 27/09/2015 03:36, Hugh Dickins wrote:
> Let's Cc linux-fsdevel, who will be more knowledgable.
>
> On Sun, 27 Sep 2015, angelo wrote:
>
>> Hi all,
>>
>> running xfstests, generic 308 on whatever 32bit arch is possible
>> to observe cpu to hang near 100% on unlink.
>> The test removes a sparse file of length 16tera where only the last
>> 4096 bytes block is mapped.
>> At line 265 of truncate.c there is a
>> if (index >= end)
>>      break;
>> But if index is, as in this case, a 4294967295, it match -1 used as
>> eof. Hence the cpu loops 100% just after.
> That's odd.  I've not checked your patch, because I think the problem
> would go beyond truncate, and the root cause lie elsewhere.
>
> My understanding is that the 32-bit
> #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)
> makes a page->index of -1 (or any "negative") impossible to reach.
>
> I don't know offhand the rules for mounting a filesystem populated with
> a 64-bit kernel on a 32-bit kernel, what's to happen when a too-large
> file is encountered; but assume that's not the case here - you're
> just running xfstests/tests/generic/308.
>
> Is pwrite missing a check for offset beyond s_maxbytes?
>
> Or is this filesystem-dependent?  Which filesystem?
>
> Hugh
>
>> -------------------
>>
>> On 32bit archs, with CONFIG_LBDAF=y, if truncating last page
>> of a 16tera file, "index" variable is set to 4294967295, and hence
>> matches with -1 used as EOF value. This result in an inifite loop
>> when unlink is executed on this file.
>>
>> Signed-off-by: Angelo Dureghello <angelo@sysam.it>
>> ---
>>   mm/truncate.c | 11 ++++++-----
>>   1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 76e35ad..3751034 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -283,14 +283,15 @@ void truncate_inode_pages_range(struct address_space
>> *mapping,
>>                  pagevec_remove_exceptionals(&pvec);
>>                  pagevec_release(&pvec);
>>                  cond_resched();
>> -               index++;
>> +               if (index < end)
>> +                       index++;
>>          }
>>
>>          if (partial_start) {
>>                  struct page *page = find_lock_page(mapping, start - 1);
>>                  if (page) {
>>                          unsigned int top = PAGE_CACHE_SIZE;
>> -                       if (start > end) {
>> +                       if (start > end && end != -1) {
>>                                  /* Truncation within a single page */
>>                                  top = partial_end;
>>                                  partial_end = 0;
>> @@ -322,7 +323,7 @@ void truncate_inode_pages_range(struct address_space
>> *mapping,
>>           * If the truncation happened within a single page no pages
>>           * will be released, just zeroed, so we can bail out now.
>>           */
>> -       if (start >= end)
>> +       if (start >= end && end != -1)
>>                  return;
>>
>>          index = start;
>> @@ -337,7 +338,7 @@ void truncate_inode_pages_range(struct address_space
>> *mapping,
>>                          index = start;
>>                          continue;
>>                  }
>> -               if (index == start && indices[0] >= end) {
>> +               if (index == start && (indices[0] >= end && end != -1)) {
>>                          /* All gone out of hole to be punched, we're done */
>>                          pagevec_remove_exceptionals(&pvec);
>>                          pagevec_release(&pvec);
>> @@ -348,7 +349,7 @@ void truncate_inode_pages_range(struct address_space
>> *mapping,
>>
>>                          /* We rely upon deletion not changing page->index */
>>                          index = indices[i];
>> -                       if (index >= end) {
>> +                       if (index >= end && (end != -1)) {
>>                                  /* Restart punch to make sure all gone */
>>                                  index = start - 1;
>>                                  break;
>> -- 
>> 2.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
  2015-09-27  1:36 ` [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file Hugh Dickins
  2015-09-27  2:14   ` angelo
@ 2015-09-27  2:21   ` angelo
  2015-09-27 17:59     ` Hugh Dickins
  1 sibling, 1 reply; 9+ messages in thread
From: angelo @ 2015-09-27  2:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: Hugh Dickins, linux-fsdevel, linux-mm

Hi Hugh,

thanks for the fast reply..

Looks like the XFS file system can support files until 16 Tera
when CONFIG_LBDAF is enabled.

On XFS, 32 bit arch, s_maxbytes is actually set (CONFIG_LBDAF=y) as
17592186044415.

But if s_maxbytes doesn't have to be greater than MAX_LFS_FILESIZE,
i agree the issue should be fixed in layers above.

The fact is that everything still works correct until an index as
17592186044415 - 4096, and there can be users that could already
have so big files in use in their setup.

What do you think ?

Best regards
Angelo Dureghello

On 27/09/2015 03:36, Hugh Dickins wrote:
> Let's Cc linux-fsdevel, who will be more knowledgable.
>
> On Sun, 27 Sep 2015, angelo wrote:
>
>> Hi all,
>>
>> running xfstests, generic 308 on whatever 32bit arch is possible
>> to observe cpu to hang near 100% on unlink.
>> The test removes a sparse file of length 16tera where only the last
>> 4096 bytes block is mapped.
>> At line 265 of truncate.c there is a
>> if (index >= end)
>>      break;
>> But if index is, as in this case, a 4294967295, it match -1 used as
>> eof. Hence the cpu loops 100% just after.
> That's odd.  I've not checked your patch, because I think the problem
> would go beyond truncate, and the root cause lie elsewhere.
>
> My understanding is that the 32-bit
> #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)
> makes a page->index of -1 (or any "negative") impossible to reach.
>
> I don't know offhand the rules for mounting a filesystem populated with
> a 64-bit kernel on a 32-bit kernel, what's to happen when a too-large
> file is encountered; but assume that's not the case here - you're
> just running xfstests/tests/generic/308.
>
> Is pwrite missing a check for offset beyond s_maxbytes?
>
> Or is this filesystem-dependent?  Which filesystem?
>
> Hugh
>
>> -------------------
>>
>> On 32bit archs, with CONFIG_LBDAF=y, if truncating last page
>> of a 16tera file, "index" variable is set to 4294967295, and hence
>> matches with -1 used as EOF value. This result in an inifite loop
>> when unlink is executed on this file.
>>
>> Signed-off-by: Angelo Dureghello <angelo@sysam.it>
>> ---
>>   mm/truncate.c | 11 ++++++-----
>>   1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 76e35ad..3751034 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -283,14 +283,15 @@ void truncate_inode_pages_range(struct address_space
>> *mapping,
>>                  pagevec_remove_exceptionals(&pvec);
>>                  pagevec_release(&pvec);
>>                  cond_resched();
>> -               index++;
>> +               if (index < end)
>> +                       index++;
>>          }
>>
>>          if (partial_start) {
>>                  struct page *page = find_lock_page(mapping, start - 1);
>>                  if (page) {
>>                          unsigned int top = PAGE_CACHE_SIZE;
>> -                       if (start > end) {
>> +                       if (start > end && end != -1) {
>>                                  /* Truncation within a single page */
>>                                  top = partial_end;
>>                                  partial_end = 0;
>> @@ -322,7 +323,7 @@ void truncate_inode_pages_range(struct address_space
>> *mapping,
>>           * If the truncation happened within a single page no pages
>>           * will be released, just zeroed, so we can bail out now.
>>           */
>> -       if (start >= end)
>> +       if (start >= end && end != -1)
>>                  return;
>>
>>          index = start;
>> @@ -337,7 +338,7 @@ void truncate_inode_pages_range(struct address_space
>> *mapping,
>>                          index = start;
>>                          continue;
>>                  }
>> -               if (index == start && indices[0] >= end) {
>> +               if (index == start && (indices[0] >= end && end != -1)) {
>>                          /* All gone out of hole to be punched, we're done */
>>                          pagevec_remove_exceptionals(&pvec);
>>                          pagevec_release(&pvec);
>> @@ -348,7 +349,7 @@ void truncate_inode_pages_range(struct address_space
>> *mapping,
>>
>>                          /* We rely upon deletion not changing page->index */
>>                          index = indices[i];
>> -                       if (index >= end) {
>> +                       if (index >= end && (end != -1)) {
>>                                  /* Restart punch to make sure all gone */
>>                                  index = start - 1;
>>                                  break;
>> -- 
>> 2.5.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
  2015-09-27  2:21   ` angelo
@ 2015-09-27 17:59     ` Hugh Dickins
  2015-09-27 23:26       ` Dave Chinner
  2015-09-28 17:03       ` Andi Kleen
  0 siblings, 2 replies; 9+ messages in thread
From: Hugh Dickins @ 2015-09-27 17:59 UTC (permalink / raw)
  To: angelo
  Cc: Al Viro, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Andi Kleen, Jeff Layton, Eryu Guan, Hugh Dickins, linux-kernel,
	linux-fsdevel, linux-mm

On Sun, 27 Sep 2015, angelo wrote:
> On 27/09/2015 03:36, Hugh Dickins wrote:
> > Let's Cc linux-fsdevel, who will be more knowledgable.
> > 
> > On Sun, 27 Sep 2015, angelo wrote:
> > 
> > > Hi all,
> > > 
> > > running xfstests, generic 308 on whatever 32bit arch is possible
> > > to observe cpu to hang near 100% on unlink.

I have since tried to repeat your result, but generic/308 on 32-bit just
skipped the test for me.  I didn't investigate why: it's quite possible
that I had a leftover 64-bit executable in the path that it tried to use,
but didn't show the relevant error message.

I did verify your result with a standalone test; and that proves that
nobody has actually been using such files in practice before you,
since unmounting the xfs filesystem would hang in the same way if
they didn't unlink them.

> > > The test removes a sparse file of length 16tera where only the last
> > > 4096 bytes block is mapped.
> > > At line 265 of truncate.c there is a
> > > if (index >= end)
> > >      break;
> > > But if index is, as in this case, a 4294967295, it match -1 used as
> > > eof. Hence the cpu loops 100% just after.
> > > 
> > That's odd.  I've not checked your patch, because I think the problem
> > would go beyond truncate, and the root cause lie elsewhere.
> > 
> > My understanding is that the 32-bit
> > #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)
> > makes a page->index of -1 (or any "negative") impossible to reach.
> > 
> > I don't know offhand the rules for mounting a filesystem populated with
> > a 64-bit kernel on a 32-bit kernel, what's to happen when a too-large
> > file is encountered; but assume that's not the case here - you're
> > just running xfstests/tests/generic/308.
> > 
> > Is pwrite missing a check for offset beyond s_maxbytes?
> > 
> > Or is this filesystem-dependent?  Which filesystem?
>
> Hi Hugh,
> 
> thanks for the fast reply..
> 
> Looks like the XFS file system can support files until 16 Tera
> when CONFIG_LBDAF is enabled.
> 
> On XFS, 32 bit arch, s_maxbytes is actually set (CONFIG_LBDAF=y) as
> 17592186044415.

This is a valuable catch, no doubt of that, thank you.

A surprise to me, and I expect to others, that 32-bit xfs is not
respecting MAX_LFS_FILESIZE: going its own way with 0xfff ffffffff
instead of 0x7ff ffffffff (on a PAGE_CACHE_SIZE 4096 system).

MAX_LFS_FILESIZE has been defined that way ever since v2.5.4:
this is probably just an oversight from when xfs was later added
into the Linux tree.

I can't tell you why MAX_LFS_FILESIZE was defined to exclude half
of the available range.  I've always assumed that it's because there
were known or feared areas of the code, which manipulate between
bytes and pages, and might hit sign extension issues - though
I cannot identify those places myself.

> 
> But if s_maxbytes doesn't have to be greater than MAX_LFS_FILESIZE,
> i agree the issue should be fixed in layers above.

There is a "filesystems should never set s_maxbytes larger than
MAX_LFS_FILESIZE" comment in fs/super.c, but unfortunately its
warning is written with just 64-bit in mind (testing for negative).

> 
> The fact is that everything still works correct until an index as
> 17592186044415 - 4096, and there can be users that could already
> have so big files in use in their setup.

It's too soon to say "everything still works correct" before that:
there may be a number of incorrect syscall argument validation checks,
particularly at the mm end, or incorrect byte<->page offset conversions.

> 
> What do you think ?

It's a matter for vfs and mm and xfs maintainers to decide.

FWIW, I don't expect there would be much enthusiasm for doubling
MAX_LFS_FILESIZE now: it would involve more of a code audit than
I'd want to spend time on myself.  So personally I'd favour xfs
enforcing the lower limit, then we keep an eye open for whether
any user regression is reported.

There have been suggestions that there should be a 32-bit CONFIG
with a 64-bit page->index, to allow a 32-bit kernel to access a
fully 64-bit filesystem; but that's a different and larger task.

Hugh

> 
> Best regards
> Angelo Dureghello
> 
> > 
> > Hugh
> > 
> > > -------------------
> > > 
> > > On 32bit archs, with CONFIG_LBDAF=y, if truncating last page
> > > of a 16tera file, "index" variable is set to 4294967295, and hence
> > > matches with -1 used as EOF value. This result in an inifite loop
> > > when unlink is executed on this file.
> > > 
> > > Signed-off-by: Angelo Dureghello <angelo@sysam.it>
> > > ---
> > >   mm/truncate.c | 11 ++++++-----
> > >   1 file changed, 6 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > index 76e35ad..3751034 100644
> > > --- a/mm/truncate.c
> > > +++ b/mm/truncate.c
> > > @@ -283,14 +283,15 @@ void truncate_inode_pages_range(struct
> > > address_space
> > > *mapping,
> > >                  pagevec_remove_exceptionals(&pvec);
> > >                  pagevec_release(&pvec);
> > >                  cond_resched();
> > > -               index++;
> > > +               if (index < end)
> > > +                       index++;
> > >          }
> > > 
> > >          if (partial_start) {
> > >                  struct page *page = find_lock_page(mapping, start - 1);
> > >                  if (page) {
> > >                          unsigned int top = PAGE_CACHE_SIZE;
> > > -                       if (start > end) {
> > > +                       if (start > end && end != -1) {
> > >                                  /* Truncation within a single page */
> > >                                  top = partial_end;
> > >                                  partial_end = 0;
> > > @@ -322,7 +323,7 @@ void truncate_inode_pages_range(struct address_space
> > > *mapping,
> > >           * If the truncation happened within a single page no pages
> > >           * will be released, just zeroed, so we can bail out now.
> > >           */
> > > -       if (start >= end)
> > > +       if (start >= end && end != -1)
> > >                  return;
> > > 
> > >          index = start;
> > > @@ -337,7 +338,7 @@ void truncate_inode_pages_range(struct address_space
> > > *mapping,
> > >                          index = start;
> > >                          continue;
> > >                  }
> > > -               if (index == start && indices[0] >= end) {
> > > +               if (index == start && (indices[0] >= end && end != -1)) {
> > >                          /* All gone out of hole to be punched, we're
> > > done */
> > >                          pagevec_remove_exceptionals(&pvec);
> > >                          pagevec_release(&pvec);
> > > @@ -348,7 +349,7 @@ void truncate_inode_pages_range(struct address_space
> > > *mapping,
> > > 
> > >                          /* We rely upon deletion not changing
> > > page->index */
> > >                          index = indices[i];
> > > -                       if (index >= end) {
> > > +                       if (index >= end && (end != -1)) {
> > >                                  /* Restart punch to make sure all gone
> > > */
> > >                                  index = start - 1;
> > >                                  break;
> > > -- 
> > > 2.5.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
  2015-09-27 17:59     ` Hugh Dickins
@ 2015-09-27 23:26       ` Dave Chinner
  2015-09-27 23:56         ` Jeff Layton
  2015-09-28 17:03       ` Andi Kleen
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2015-09-27 23:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: angelo, Al Viro, Andrew Morton, Christoph Hellwig, Andi Kleen,
	Jeff Layton, Eryu Guan, linux-kernel, linux-fsdevel, linux-mm

On Sun, Sep 27, 2015 at 10:59:33AM -0700, Hugh Dickins wrote:
> On Sun, 27 Sep 2015, angelo wrote:
> > On 27/09/2015 03:36, Hugh Dickins wrote:
> > > Let's Cc linux-fsdevel, who will be more knowledgable.
> > > 
> > > On Sun, 27 Sep 2015, angelo wrote:
> > > 
> > > > Hi all,
> > > > 
> > > > running xfstests, generic 308 on whatever 32bit arch is possible
> > > > to observe cpu to hang near 100% on unlink.
> 
> I have since tried to repeat your result, but generic/308 on 32-bit just
> skipped the test for me.  I didn't investigate why: it's quite possible
> that I had a leftover 64-bit executable in the path that it tried to use,
> but didn't show the relevant error message.
>
> I did verify your result with a standalone test; and that proves that
> nobody has actually been using such files in practice before you,
> since unmounting the xfs filesystem would hang in the same way if
> they didn't unlink them.

It used to work - this is a regression. Just because nobody has
reported it recently simply means nobody has run xfstests on 32 bit
storage recently. There are 32 bit systems out there that expect
this to work, and we've broken it.

The regression was introduced in 3.11 by this commit:

commit 5a7203947a1d9b6f3a00a39fda08c2466489555f
Author: Lukas Czerner <lczerner@redhat.com>
Date:   Mon May 27 23:32:35 2013 -0400

    mm: teach truncate_inode_pages_range() to handle non page aligned ranges
    
    This commit changes truncate_inode_pages_range() so it can handle non
    page aligned regions of the truncate. Currently we can hit BUG_ON when
    the end of the range is not page aligned, but we can handle unaligned
    start of the range.
    
    Being able to handle non page aligned regions of the page can help file
    system punch_hole implementations and save some work, because once we're
    holding the page we might as well deal with it right away.
    
    In previous commits we've changed ->invalidatepage() prototype to accept
    'length' argument to be able to specify range to invalidate. No we can
    use that new ability in truncate_inode_pages_range().
    
    Signed-off-by: Lukas Czerner <lczerner@redhat.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>


> > > > The test removes a sparse file of length 16tera where only the last
> > > > 4096 bytes block is mapped.
> > > > At line 265 of truncate.c there is a
> > > > if (index >= end)
> > > >      break;
> > > > But if index is, as in this case, a 4294967295, it match -1 used as
> > > > eof. Hence the cpu loops 100% just after.
> > > > 
> > > That's odd.  I've not checked your patch, because I think the problem
> > > would go beyond truncate, and the root cause lie elsewhere.
> > > 
> > > My understanding is that the 32-bit
> > > #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)
> > > makes a page->index of -1 (or any "negative") impossible to reach.

We've supported > 8TB files on 32 bit XFS file systems since
since mid 2003:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=d13d78f6b83eefbd90a6cac5c9fbe42560c6511e

And it's been documented as such for a long time, too:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s04.html

(that was written, IIRC, back in 2007).

i.e. whatever the definition says about MAX_LFS_FILESIZE being an
8TB limit on 32 bit is stale and has been for a very long time.

> A surprise to me, and I expect to others, that 32-bit xfs is not
> respecting MAX_LFS_FILESIZE: going its own way with 0xfff ffffffff
> instead of 0x7ff ffffffff (on a PAGE_CACHE_SIZE 4096 system).
> 
> MAX_LFS_FILESIZE has been defined that way ever since v2.5.4:
> this is probably just an oversight from when xfs was later added
> into the Linux tree.

We supported >8 TB file offsets on 32 bit systems on 2.4 kernels
with XFS, so it sounds like it was wrong even when it was first
committed. Of course, XFS wasn't merged until 2.5.36, so I guess
nobody realised... ;)

> > But if s_maxbytes doesn't have to be greater than MAX_LFS_FILESIZE,
> > i agree the issue should be fixed in layers above.
> 
> There is a "filesystems should never set s_maxbytes larger than
> MAX_LFS_FILESIZE" comment in fs/super.c, but unfortunately its
> warning is written with just 64-bit in mind (testing for negative).

Yup, introduced here:

commit 42cb56ae2ab67390da34906b27bedc3f2ff1393b
Author: Jeff Layton <jlayton@redhat.com>
Date:   Fri Sep 18 13:05:53 2009 -0700

    vfs: change sb->s_maxbytes to a loff_t
    
    sb->s_maxbytes is supposed to indicate the maximum size of a file that can
    exist on the filesystem.  It's declared as an unsigned long long.

And yes, that will never fire on a 32bit filesystem, because loff_t
is a "long long" type....

> > The fact is that everything still works correct until an index as
> > 17592186044415 - 4096, and there can be users that could already
> > have so big files in use in their setup.
> 
> It's too soon to say "everything still works correct" before that:
> there may be a number of incorrect syscall argument validation checks,
> particularly at the mm end, or incorrect byte<->page offset conversions.

It's "worked correctly" for many years - that regression test used
to pass.  W.r.t to syscall arguments for for manipulating
files that large, they are 64 bit and hence "just work". W.r.t to
byte/page offset conversion, page->index is typed as pgoff_t,
which is:

#define pgoff_t unsigned long

So there should be no sign conversion issues in correctly written
code.

Further: block devices on 32 bit systems support 16TB, they
support buffered IO through the page cache and you can mmap them,
too:

# xfs_io -f -c "mmap 0 32k" -c "mwrite 0 4k" -c msync /dev/ram0
# hexdump /dev/ram0
0000000 5858 5858 5858 5858 5858 5858 5858 5858
*
0001000 0000 0000 0000 0000 0000 0000 0000 0000
*
#

Hence the mm subystem and the page cache *must* support operation up
to 16TB offsets on 32 bit systems regardless of MAX_LFS_FILESIZE
definitions, otherwise direct block device access goes wrong and
then we're really screwed....

> > What do you think ?
> 
> It's a matter for vfs and mm and xfs maintainers to decide.
> 
> FWIW, I don't expect there would be much enthusiasm for doubling
> MAX_LFS_FILESIZE now: it would involve more of a code audit than
> I'd want to spend time on myself.  So personally I'd favour xfs
> enforcing the lower limit, then we keep an eye open for whether
> any user regression is reported.

The horse has already bolted. The code used to work, there are users
out there that are using >8TB files on 32 bit systems. Hence we have
no choice but to fix the regression.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
  2015-09-27 23:26       ` Dave Chinner
@ 2015-09-27 23:56         ` Jeff Layton
  2015-09-28  1:06           ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Layton @ 2015-09-27 23:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, angelo, Al Viro, Andrew Morton, Christoph Hellwig,
	Andi Kleen, Eryu Guan, linux-kernel, linux-fsdevel, linux-mm

On Mon, 28 Sep 2015 09:26:45 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Sun, Sep 27, 2015 at 10:59:33AM -0700, Hugh Dickins wrote:
> > On Sun, 27 Sep 2015, angelo wrote:
> > > On 27/09/2015 03:36, Hugh Dickins wrote:
> > > > Let's Cc linux-fsdevel, who will be more knowledgable.
> > > > 
> > > > On Sun, 27 Sep 2015, angelo wrote:
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > running xfstests, generic 308 on whatever 32bit arch is possible
> > > > > to observe cpu to hang near 100% on unlink.
> > 
> > I have since tried to repeat your result, but generic/308 on 32-bit just
> > skipped the test for me.  I didn't investigate why: it's quite possible
> > that I had a leftover 64-bit executable in the path that it tried to use,
> > but didn't show the relevant error message.
> >
> > I did verify your result with a standalone test; and that proves that
> > nobody has actually been using such files in practice before you,
> > since unmounting the xfs filesystem would hang in the same way if
> > they didn't unlink them.
> 
> It used to work - this is a regression. Just because nobody has
> reported it recently simply means nobody has run xfstests on 32 bit
> storage recently. There are 32 bit systems out there that expect
> this to work, and we've broken it.
> 
> The regression was introduced in 3.11 by this commit:
> 
> commit 5a7203947a1d9b6f3a00a39fda08c2466489555f
> Author: Lukas Czerner <lczerner@redhat.com>
> Date:   Mon May 27 23:32:35 2013 -0400
> 
>     mm: teach truncate_inode_pages_range() to handle non page aligned ranges
>     
>     This commit changes truncate_inode_pages_range() so it can handle non
>     page aligned regions of the truncate. Currently we can hit BUG_ON when
>     the end of the range is not page aligned, but we can handle unaligned
>     start of the range.
>     
>     Being able to handle non page aligned regions of the page can help file
>     system punch_hole implementations and save some work, because once we're
>     holding the page we might as well deal with it right away.
>     
>     In previous commits we've changed ->invalidatepage() prototype to accept
>     'length' argument to be able to specify range to invalidate. No we can
>     use that new ability in truncate_inode_pages_range().
>     
>     Signed-off-by: Lukas Czerner <lczerner@redhat.com>
>     Cc: Andrew Morton <akpm@linux-foundation.org>
>     Cc: Hugh Dickins <hughd@google.com>
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> 
> 
> > > > > The test removes a sparse file of length 16tera where only the last
> > > > > 4096 bytes block is mapped.
> > > > > At line 265 of truncate.c there is a
> > > > > if (index >= end)
> > > > >      break;
> > > > > But if index is, as in this case, a 4294967295, it match -1 used as
> > > > > eof. Hence the cpu loops 100% just after.
> > > > > 
> > > > That's odd.  I've not checked your patch, because I think the problem
> > > > would go beyond truncate, and the root cause lie elsewhere.
> > > > 
> > > > My understanding is that the 32-bit
> > > > #define MAX_LFS_FILESIZE (((loff_t)PAGE_CACHE_SIZE << (BITS_PER_LONG-1))-1)
> > > > makes a page->index of -1 (or any "negative") impossible to reach.
> 
> We've supported > 8TB files on 32 bit XFS file systems since
> since mid 2003:
> 
> http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=d13d78f6b83eefbd90a6cac5c9fbe42560c6511e
> 
> And it's been documented as such for a long time, too:
> 
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s04.html
> 
> (that was written, IIRC, back in 2007).
> 
> i.e. whatever the definition says about MAX_LFS_FILESIZE being an
> 8TB limit on 32 bit is stale and has been for a very long time.
> 
> > A surprise to me, and I expect to others, that 32-bit xfs is not
> > respecting MAX_LFS_FILESIZE: going its own way with 0xfff ffffffff
> > instead of 0x7ff ffffffff (on a PAGE_CACHE_SIZE 4096 system).
> > 
> > MAX_LFS_FILESIZE has been defined that way ever since v2.5.4:
> > this is probably just an oversight from when xfs was later added
> > into the Linux tree.
> 
> We supported >8 TB file offsets on 32 bit systems on 2.4 kernels
> with XFS, so it sounds like it was wrong even when it was first
> committed. Of course, XFS wasn't merged until 2.5.36, so I guess
> nobody realised... ;)
> 
> > > But if s_maxbytes doesn't have to be greater than MAX_LFS_FILESIZE,
> > > i agree the issue should be fixed in layers above.
> > 
> > There is a "filesystems should never set s_maxbytes larger than
> > MAX_LFS_FILESIZE" comment in fs/super.c, but unfortunately its
> > warning is written with just 64-bit in mind (testing for negative).
> 
> Yup, introduced here:
> 
> commit 42cb56ae2ab67390da34906b27bedc3f2ff1393b
> Author: Jeff Layton <jlayton@redhat.com>
> Date:   Fri Sep 18 13:05:53 2009 -0700
> 
>     vfs: change sb->s_maxbytes to a loff_t
>     
>     sb->s_maxbytes is supposed to indicate the maximum size of a file that can
>     exist on the filesystem.  It's declared as an unsigned long long.
> 
> And yes, that will never fire on a 32bit filesystem, because loff_t
> is a "long long" type....
> 

Hmm...should we change that to something like this instead?

    WARN(((unsigned long long)sb->s_maxbytes > (unsigned long long)MAX_LFS_FILESIZE,
	"%s set sb->s_maxbytes to too large a value (0x%llx)\n", type->name, sb->s_maxbytes);

-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
  2015-09-27 23:56         ` Jeff Layton
@ 2015-09-28  1:06           ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2015-09-28  1:06 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Hugh Dickins, angelo, Al Viro, Andrew Morton, Christoph Hellwig,
	Andi Kleen, Eryu Guan, linux-kernel, linux-fsdevel, linux-mm

On Sun, Sep 27, 2015 at 07:56:55PM -0400, Jeff Layton wrote:
> On Mon, 28 Sep 2015 09:26:45 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Sep 27, 2015 at 10:59:33AM -0700, Hugh Dickins wrote:
> > > > But if s_maxbytes doesn't have to be greater than MAX_LFS_FILESIZE,
> > > > i agree the issue should be fixed in layers above.
> > > 
> > > There is a "filesystems should never set s_maxbytes larger than
> > > MAX_LFS_FILESIZE" comment in fs/super.c, but unfortunately its
> > > warning is written with just 64-bit in mind (testing for negative).
> > 
> > Yup, introduced here:
> > 
> > commit 42cb56ae2ab67390da34906b27bedc3f2ff1393b
> > Author: Jeff Layton <jlayton@redhat.com>
> > Date:   Fri Sep 18 13:05:53 2009 -0700
> > 
> >     vfs: change sb->s_maxbytes to a loff_t
> >     
> >     sb->s_maxbytes is supposed to indicate the maximum size of a file that can
> >     exist on the filesystem.  It's declared as an unsigned long long.
> > 
> > And yes, that will never fire on a 32bit filesystem, because loff_t
> > is a "long long" type....
> > 
> 
> Hmm...should we change that to something like this instead?
> 
>     WARN(((unsigned long long)sb->s_maxbytes > (unsigned long long)MAX_LFS_FILESIZE,
> 	"%s set sb->s_maxbytes to too large a value (0x%llx)\n", type->name, sb->s_maxbytes);

Well, it doesn't change the fact that we've actually been supporting
sb->s_maxbytes > MAX_LFS_FILESIZE for a long time on 32 bit systems.
And it's pretty unfriendly to start issuing warnings on every mount
of every XFS filesystem on every 32 bit system in existence for
something we've explicitly supported since 2.4 kernels...

I suspect the warning should have been removed back in 2.6.34 like
was originally intended. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
  2015-09-27 17:59     ` Hugh Dickins
  2015-09-27 23:26       ` Dave Chinner
@ 2015-09-28 17:03       ` Andi Kleen
  2015-09-28 18:08         ` Hugh Dickins
  1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2015-09-28 17:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: angelo, Al Viro, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Andi Kleen, Jeff Layton, Eryu Guan, linux-kernel, linux-fsdevel,
	linux-mm

> I can't tell you why MAX_LFS_FILESIZE was defined to exclude half
> of the available range.  I've always assumed that it's because there
> were known or feared areas of the code, which manipulate between
> bytes and pages, and might hit sign extension issues - though
> I cannot identify those places myself.

The limit was intentional to handle old user space. I don't think
it has anything to do with the kernel.

off_t is sometimes used signed, mainly with lseek SEEK_CUR/END when you
want to seek backwards. It would be quite odd to sometimes
have off_t be signed (SEEK_CUR/END) and sometimes be unsigned
(when using SEEK_SET).  So it made some sense to set the limit
to the signed max value.

Here's the original "Large file standard" that describes
the issues in more details:

http://www.unix.org/version2/whatsnew/lfs20mar.html

This document explicitly requests signed off_t:

>>>


Mixed sizes of off_t
    During a period of transition from existing systems to systems able to support an arbitrarily large file size, most systems will need to support binaries with two or more sizes of the off_t data type (and related data types). This mixed off_t environment may occur on a system with an ABI that supports different sizes of off_t. It may occur on a system which has both a 64-bit and a 32-bit ABI. Finally, it may occur when using a distributed system where clients and servers have differing sizes of off_t. In effect, the period of transition will not end until we need 128-bit file sizes, requiring yet another transition! The proposed changes may also be used as a model for the 64 to 128-bit file size transition. 
Offset maximum
    Most, but unfortunately not all, of the numeric values in the SUS are protected by opaque type definitions. In theory this allows programs to use these types rather than the underlying C language data types to avoid issues like overflow. However, most existing code maps these opaque data types like off_t to long integers that can overflow for the values needed to represent the offsets possible in large files.

    To protect existing binaries from arbitrarily large files, a new value (offset maximum) will be part of the open file description. An offset maximum is the largest offset that can be used as a file offset. Operations attempting to go beyond the offset maximum will return an error. The offset maximum is normally established as the size of the off_t "extended signed integral type" used by the program creating the file description.

    The open() function and other interfaces establish the offset maximum for a file description, returning an error if the file size is larger than the offset maximum at the time of the call. Returning errors when the 
<<<

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file
  2015-09-28 17:03       ` Andi Kleen
@ 2015-09-28 18:08         ` Hugh Dickins
  0 siblings, 0 replies; 9+ messages in thread
From: Hugh Dickins @ 2015-09-28 18:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, angelo, Al Viro, Andrew Morton, Dave Chinner,
	Christoph Hellwig, Jeff Layton, Eryu Guan, linux-kernel,
	linux-fsdevel, linux-mm

On Mon, 28 Sep 2015, Andi Kleen wrote:

> > I can't tell you why MAX_LFS_FILESIZE was defined to exclude half
> > of the available range.  I've always assumed that it's because there
> > were known or feared areas of the code, which manipulate between
> > bytes and pages, and might hit sign extension issues - though
> > I cannot identify those places myself.
> 
> The limit was intentional to handle old user space. I don't think
> it has anything to do with the kernel.
> 
> off_t is sometimes used signed, mainly with lseek SEEK_CUR/END when you
> want to seek backwards. It would be quite odd to sometimes
> have off_t be signed (SEEK_CUR/END) and sometimes be unsigned
> (when using SEEK_SET).  So it made some sense to set the limit
> to the signed max value.

Thanks a lot for filling in the history, Andi, I was hoping you could.

I think that's a good argument for MAX_NON_LFS 0x7fffffff, but
MAX_LFS_FILESIZE 0x7ff ffffffff just a mistake: it's a very long way
away from any ambiguity between signed and unsigned, and 0xfff ffffffff
(or perhaps 0xfff fffff000) would have made better use of the space.

Never mind, a bit late now.  (And apologies to those with non-4096
pagesize, but I find it easier to follow with concrete numbers.)

Hugh

> 
> Here's the original "Large file standard" that describes
> the issues in more details:
> 
> http://www.unix.org/version2/whatsnew/lfs20mar.html
> 
> This document explicitly requests signed off_t:
> 
> >>>
> 
> 
> Mixed sizes of off_t
>     During a period of transition from existing systems to systems able to support an arbitrarily large file size, most systems will need to support binaries with two or more sizes of the off_t data type (and related data types). This mixed off_t environment may occur on a system with an ABI that supports different sizes of off_t. It may occur on a system which has both a 64-bit and a 32-bit ABI. Finally, it may occur when using a distributed system where clients and servers have differing sizes of off_t. In effect, the period of transition will not end until we need 128-bit file sizes, requiring yet another transition! The proposed changes may also be used as a model for the 64 to 128-bit file size transition. 
> Offset maximum
>     Most, but unfortunately not all, of the numeric values in the SUS are protected by opaque type definitions. In theory this allows programs to use these types rather than the underlying C language data types to avoid issues like overflow. However, most existing code maps these opaque data types like off_t to long integers that can overflow for the values needed to represent the offsets possible in large files.
> 
>     To protect existing binaries from arbitrarily large files, a new value (offset maximum) will be part of the open file description. An offset maximum is the largest offset that can be used as a file offset. Operations attempting to go beyond the offset maximum will return an error. The offset maximum is normally established as the size of the off_t "extended signed integral type" used by the program creating the file description.
> 
>     The open() function and other interfaces establish the offset maximum for a file description, returning an error if the file size is larger than the offset maximum at the time of the call. Returning errors when the 
> <<<
> 
> -Andi
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-09-28 18:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <560723F8.3010909@gmail.com>
2015-09-27  1:36 ` [PATCH] mm: fix cpu hangs on truncating last page of a 16t sparse file Hugh Dickins
2015-09-27  2:14   ` angelo
2015-09-27  2:21   ` angelo
2015-09-27 17:59     ` Hugh Dickins
2015-09-27 23:26       ` Dave Chinner
2015-09-27 23:56         ` Jeff Layton
2015-09-28  1:06           ` Dave Chinner
2015-09-28 17:03       ` Andi Kleen
2015-09-28 18:08         ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).