[LSF/MM TOPIC] Fixing large block devices on 32 bit

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] Fixing large block devices on 32 bit
@ 2014-01-31 19:02 James Bottomley
  2014-01-31 19:26 ` Dave Jones
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: James Bottomley @ 2014-01-31 19:02 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-fsdevel, linux-mm; +Cc: lsf-pc

It has been reported:

http://marc.info/?t=139111447200006

That large block devices (specifically devices > 16TB) crash when
mounted on 32 bit systems.  The problem specifically is that although
CONFIG_LBDAF extends the size of sector_t within the block and storage
layers to 64 bits, the buffer cache isn't big enough.  Specifically,
buffers are mapped through a single page cache mapping on the backing
device inode.  The size of the allowed offset in the page cache radix
tree is pgoff_t which is 32 bits, so once the size of device goes beyond
16TB, this offset wraps and all hell breaks loose.

The problem is that although the current single drive limit is about
4TB, it will only be a couple of years before 16TB devices are
available.  By then, I bet that most arm (and other exotic CPU) Linux
based personal file servers are still going to be 32 bit, so they're not
going to be able to take this generation (or beyond) of drives.  The
thing I'd like to discuss is how to fix this.  There are several options
I see, but there might be others.

     1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB
        and there's nothing we can do about it ... this won't be at all
        popular with arm based file server manufacturers.
     2. Slyly make sure that the buffer cache won't go over 16TB by
        keeping filesystem metadata below that limit ... the horse has
        probably already bolted on this one.
     3. Increase pgoff_t and the radix tree indexes to u64 for
        CONFIG_LBDAF.  This will blow out the size of struct page on 32
        bits by 4 bytes and may have other knock on effects, but at
        least it will be transparent.
     4. add an additional radix tree lookup within the buffer cache, so
        instead of a single inode for the buffer cache, we have a radix
        tree of them which are added and removed at the granularity of
        16TB offsets as entries are requested.

James

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-01-31 19:02 [LSF/MM TOPIC] Fixing large block devices on 32 bit James Bottomley
@ 2014-01-31 19:26 ` Dave Jones
  2014-01-31 23:16   ` James Bottomley
  2014-01-31 21:20 ` Chris Mason
  2014-01-31 21:47 ` Dave Hansen
  2 siblings, 1 reply; 10+ messages in thread
From: Dave Jones @ 2014-01-31 19:26 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi, linux-ide, linux-fsdevel, linux-mm, lsf-pc

On Fri, Jan 31, 2014 at 11:02:58AM -0800, James Bottomley wrote:
 
 > it will only be a couple of years before 16TB devices are
 > available.  By then, I bet that most arm (and other exotic CPU) Linux
 > based personal file servers are still going to be 32 bit, so they're not
 > going to be able to take this generation (or beyond) of drives. 
 > 
 >      1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB
 >         and there's nothing we can do about it ... this won't be at all
 >         popular with arm based file server manufacturers.

Some of the higher end home-NAS's have already moved from arm/ppc -> x86_64[1]
Unless ARM64 starts appearing at a low enough price point, I wouldn't be 
surprised to see the smaller vendors do a similar move just to stay competitive.
(probably while keeping 'legacy' product lines for a while at a cheaper pricepoint
 that won't take bigger disks).

	Dave

[1] http://forum.synology.com/wiki/index.php/What_kind_of_CPU_does_my_NAS_have

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-01-31 19:02 [LSF/MM TOPIC] Fixing large block devices on 32 bit James Bottomley
  2014-01-31 19:26 ` Dave Jones
@ 2014-01-31 21:20 ` Chris Mason
  2014-01-31 23:14   ` James Bottomley
  2014-01-31 21:47 ` Dave Hansen
  2 siblings, 1 reply; 10+ messages in thread
From: Chris Mason @ 2014-01-31 21:20 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, linux-ide, linux-fsdevel, linux-mm; +Cc: lsf-pc

On 01/31/2014 02:02 PM, James Bottomley wrote:
> It has been reported:
>
> http://marc.info/?t=139111447200006
>
> That large block devices (specifically devices > 16TB) crash when
> mounted on 32 bit systems.  The problem specifically is that although
> CONFIG_LBDAF extends the size of sector_t within the block and storage
> layers to 64 bits, the buffer cache isn't big enough.  Specifically,
> buffers are mapped through a single page cache mapping on the backing
> device inode.  The size of the allowed offset in the page cache radix
> tree is pgoff_t which is 32 bits, so once the size of device goes beyond
> 16TB, this offset wraps and all hell breaks loose.
>
> The problem is that although the current single drive limit is about
> 4TB, it will only be a couple of years before 16TB devices are
> available.  By then, I bet that most arm (and other exotic CPU) Linux
> based personal file servers are still going to be 32 bit, so they're not
> going to be able to take this generation (or beyond) of drives.  The
> thing I'd like to discuss is how to fix this.  There are several options
> I see, but there might be others.
>
>       1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB
>          and there's nothing we can do about it ... this won't be at all
>          popular with arm based file server manufacturers.
>       2. Slyly make sure that the buffer cache won't go over 16TB by
>          keeping filesystem metadata below that limit ... the horse has
>          probably already bolted on this one.
>       3. Increase pgoff_t and the radix tree indexes to u64 for
>          CONFIG_LBDAF.  This will blow out the size of struct page on 32
>          bits by 4 bytes and may have other knock on effects, but at
>          least it will be transparent.
>       4. add an additional radix tree lookup within the buffer cache, so
>          instead of a single inode for the buffer cache, we have a radix
>          tree of them which are added and removed at the granularity of
>          16TB offsets as entries are requested.
>

I started typing up that #3 is going to cause problems with RCU radix, 
but it looks ok.  I think we'll find a really scarey number of places 
that interchange pgoff_t with unsigned long though.

I prefer #4, but it means each FS needs to add code too.  We assume 
page_offset(page) maps to the disk in more than a few places.

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-01-31 19:02 [LSF/MM TOPIC] Fixing large block devices on 32 bit James Bottomley
  2014-01-31 19:26 ` Dave Jones
  2014-01-31 21:20 ` Chris Mason
@ 2014-01-31 21:47 ` Dave Hansen
  2014-01-31 23:27   ` James Bottomley
  2 siblings, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2014-01-31 21:47 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, linux-ide, linux-fsdevel, linux-mm; +Cc: lsf-pc

On 01/31/2014 11:02 AM, James Bottomley wrote:
>      3. Increase pgoff_t and the radix tree indexes to u64 for
>         CONFIG_LBDAF.  This will blow out the size of struct page on 32
>         bits by 4 bytes and may have other knock on effects, but at
>         least it will be transparent.

I'm not sure how many acrobatics we want to go through for 32-bit, but...

Between page->mapping and page->index, we have 64 bits of space, which
*should* be plenty to uniquely identify a block.  We could easily add a
second-level lookup somewhere so that we store some cookie for the
address_space instead of a direct pointer.  How many devices would need,
practically?  8 bits worth?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-01-31 21:20 ` Chris Mason
@ 2014-01-31 23:14   ` James Bottomley
  0 siblings, 0 replies; 10+ messages in thread
From: James Bottomley @ 2014-01-31 23:14 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-scsi, linux-ide, linux-fsdevel, linux-mm, lsf-pc

On Fri, 2014-01-31 at 16:20 -0500, Chris Mason wrote:
> On 01/31/2014 02:02 PM, James Bottomley wrote:
> > It has been reported:
> >
> > http://marc.info/?t=139111447200006
> >
> > That large block devices (specifically devices > 16TB) crash when
> > mounted on 32 bit systems.  The problem specifically is that although
> > CONFIG_LBDAF extends the size of sector_t within the block and storage
> > layers to 64 bits, the buffer cache isn't big enough.  Specifically,
> > buffers are mapped through a single page cache mapping on the backing
> > device inode.  The size of the allowed offset in the page cache radix
> > tree is pgoff_t which is 32 bits, so once the size of device goes beyond
> > 16TB, this offset wraps and all hell breaks loose.
> >
> > The problem is that although the current single drive limit is about
> > 4TB, it will only be a couple of years before 16TB devices are
> > available.  By then, I bet that most arm (and other exotic CPU) Linux
> > based personal file servers are still going to be 32 bit, so they're not
> > going to be able to take this generation (or beyond) of drives.  The
> > thing I'd like to discuss is how to fix this.  There are several options
> > I see, but there might be others.
> >
> >       1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB
> >          and there's nothing we can do about it ... this won't be at all
> >          popular with arm based file server manufacturers.
> >       2. Slyly make sure that the buffer cache won't go over 16TB by
> >          keeping filesystem metadata below that limit ... the horse has
> >          probably already bolted on this one.
> >       3. Increase pgoff_t and the radix tree indexes to u64 for
> >          CONFIG_LBDAF.  This will blow out the size of struct page on 32
> >          bits by 4 bytes and may have other knock on effects, but at
> >          least it will be transparent.
> >       4. add an additional radix tree lookup within the buffer cache, so
> >          instead of a single inode for the buffer cache, we have a radix
> >          tree of them which are added and removed at the granularity of
> >          16TB offsets as entries are requested.
> >
> 
> I started typing up that #3 is going to cause problems with RCU radix, 
> but it looks ok.  I think we'll find a really scarey number of places 
> that interchange pgoff_t with unsigned long though.

Yes, beyond the performance issues of doing 64 bits in the radix tree,
it does look reasonably safe.

> I prefer #4, but it means each FS needs to add code too.  We assume 
> page_offset(page) maps to the disk in more than a few places.

Hmm, yes, that's just a few cases of the readahead code, though, isn't
it?  The necessary fixes look fairly small per filesystem.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-01-31 19:26 ` Dave Jones
@ 2014-01-31 23:16   ` James Bottomley
  0 siblings, 0 replies; 10+ messages in thread
From: James Bottomley @ 2014-01-31 23:16 UTC (permalink / raw)
  To: Dave Jones; +Cc: linux-scsi, linux-ide, linux-fsdevel, linux-mm, lsf-pc

On Fri, 2014-01-31 at 14:26 -0500, Dave Jones wrote:
> On Fri, Jan 31, 2014 at 11:02:58AM -0800, James Bottomley wrote:
>  
>  > it will only be a couple of years before 16TB devices are
>  > available.  By then, I bet that most arm (and other exotic CPU) Linux
>  > based personal file servers are still going to be 32 bit, so they're not
>  > going to be able to take this generation (or beyond) of drives. 
>  > 
>  >      1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB
>  >         and there's nothing we can do about it ... this won't be at all
>  >         popular with arm based file server manufacturers.
> 
> Some of the higher end home-NAS's have already moved from arm/ppc -> x86_64[1]
> Unless ARM64 starts appearing at a low enough price point, I wouldn't be 
> surprised to see the smaller vendors do a similar move just to stay competitive.
> (probably while keeping 'legacy' product lines for a while at a cheaper pricepoint
>  that won't take bigger disks).

So yould you bet on the problem solving itself *before* we get 16TB
disks?  Because if we ignore it, that's the bet we're making.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-01-31 21:47 ` Dave Hansen
@ 2014-01-31 23:27   ` James Bottomley
  2014-02-01  0:19     ` Dave Hansen
  0 siblings, 1 reply; 10+ messages in thread
From: James Bottomley @ 2014-01-31 23:27 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-scsi, linux-ide, linux-fsdevel, linux-mm, lsf-pc

On Fri, 2014-01-31 at 13:47 -0800, Dave Hansen wrote:
> On 01/31/2014 11:02 AM, James Bottomley wrote:
> >      3. Increase pgoff_t and the radix tree indexes to u64 for
> >         CONFIG_LBDAF.  This will blow out the size of struct page on 32
> >         bits by 4 bytes and may have other knock on effects, but at
> >         least it will be transparent.
> 
> I'm not sure how many acrobatics we want to go through for 32-bit, but...

That's partly the question: 32 bits was dying in the x86 space (at least
until quark), but it's still predominant in embedded.

> Between page->mapping and page->index, we have 64 bits of space, which
> *should* be plenty to uniquely identify a block.  We could easily add a
> second-level lookup somewhere so that we store some cookie for the
> address_space instead of a direct pointer.  How many devices would need,
> practically?  8 bits worth?

That might work.  8 bits would get us up to 4PB, which is looking a bit
high for single disk spinning rust.  However, how would the cookie work
efficiently? remember we'll be doing this lookup every time we pull a
page out of the page cache.  And the problem is that most of our lookups
will be on file inodes, which won't be > 16TB, so it's a lot of overhead
in the generic machinery for a problem that only occurs on buffer
related page cache lookups.

James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-01-31 23:27   ` James Bottomley
@ 2014-02-01  0:19     ` Dave Hansen
  2014-02-01  0:25       ` Kirill A. Shutemov
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2014-02-01  0:19 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi, linux-ide, linux-fsdevel, linux-mm, lsf-pc

On 01/31/2014 03:27 PM, James Bottomley wrote:
> On Fri, 2014-01-31 at 13:47 -0800, Dave Hansen wrote:
>> On 01/31/2014 11:02 AM, James Bottomley wrote:
>>>      3. Increase pgoff_t and the radix tree indexes to u64 for
>>>         CONFIG_LBDAF.  This will blow out the size of struct page on 32
>>>         bits by 4 bytes and may have other knock on effects, but at
>>>         least it will be transparent.
>>
>> I'm not sure how many acrobatics we want to go through for 32-bit, but...
> 
> That's partly the question: 32 bits was dying in the x86 space (at least
> until quark), but it's still predominant in embedded.
> 
>> Between page->mapping and page->index, we have 64 bits of space, which
>> *should* be plenty to uniquely identify a block.  We could easily add a
>> second-level lookup somewhere so that we store some cookie for the
>> address_space instead of a direct pointer.  How many devices would need,
>> practically?  8 bits worth?
> 
> That might work.  8 bits would get us up to 4PB, which is looking a bit
> high for single disk spinning rust.  However, how would the cookie work
> efficiently? remember we'll be doing this lookup every time we pull a
> page out of the page cache.  And the problem is that most of our lookups
> will be on file inodes, which won't be > 16TB, so it's a lot of overhead
> in the generic machinery for a problem that only occurs on buffer
> related page cache lookups.

I think all we have to do is set a low bit in page->mapping (or in
page->flags, but its more constrained) to say: "this isn't a direct
pointer".  We only set the bit for the buffer cache pages, and thus only
go to the slow(er) lookup path for those.  Whatever we use for the
lookups (radix tree or whatever) uses the remaining bits for an index.
We'd probably also need a last-lookup cache like mm->mmap_cache, but
probably not much more than that.

We already have page_mapping() in place to redirect folks away from
using page->mapping directly, so there shouldn't be too much code impact.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-02-01  0:19     ` Dave Hansen
@ 2014-02-01  0:25       ` Kirill A. Shutemov
  2014-02-01  0:32         ` Dave Hansen
  0 siblings, 1 reply; 10+ messages in thread
From: Kirill A. Shutemov @ 2014-02-01  0:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: James Bottomley, linux-scsi, linux-ide, linux-fsdevel, linux-mm,
	lsf-pc

On Fri, Jan 31, 2014 at 04:19:43PM -0800, Dave Hansen wrote:
> On 01/31/2014 03:27 PM, James Bottomley wrote:
> > On Fri, 2014-01-31 at 13:47 -0800, Dave Hansen wrote:
> >> On 01/31/2014 11:02 AM, James Bottomley wrote:
> >>>      3. Increase pgoff_t and the radix tree indexes to u64 for
> >>>         CONFIG_LBDAF.  This will blow out the size of struct page on 32
> >>>         bits by 4 bytes and may have other knock on effects, but at
> >>>         least it will be transparent.
> >>
> >> I'm not sure how many acrobatics we want to go through for 32-bit, but...
> > 
> > That's partly the question: 32 bits was dying in the x86 space (at least
> > until quark), but it's still predominant in embedded.
> > 
> >> Between page->mapping and page->index, we have 64 bits of space, which
> >> *should* be plenty to uniquely identify a block.  We could easily add a
> >> second-level lookup somewhere so that we store some cookie for the
> >> address_space instead of a direct pointer.  How many devices would need,
> >> practically?  8 bits worth?
> > 
> > That might work.  8 bits would get us up to 4PB, which is looking a bit
> > high for single disk spinning rust.  However, how would the cookie work
> > efficiently? remember we'll be doing this lookup every time we pull a
> > page out of the page cache.  And the problem is that most of our lookups
> > will be on file inodes, which won't be > 16TB, so it's a lot of overhead
> > in the generic machinery for a problem that only occurs on buffer
> > related page cache lookups.
> 
> I think all we have to do is set a low bit in page->mapping

It's already in use to say page->mapping is anon_vma. ;)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit
  2014-02-01  0:25       ` Kirill A. Shutemov
@ 2014-02-01  0:32         ` Dave Hansen
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Hansen @ 2014-02-01  0:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: James Bottomley, linux-scsi, linux-ide, linux-fsdevel, linux-mm,
	lsf-pc

On 01/31/2014 04:25 PM, Kirill A. Shutemov wrote:
>> > I think all we have to do is set a low bit in page->mapping
> It's already in use to say page->mapping is anon_vma. ;)

I weasel-worded that by not saying *THE* low bit. ;)

We find *some* discriminator whether it be a page flag or an actual bit
in page->mapping, or a magic value that doesn't collide with the
existing PAGE_MAPPING_* flags.

Poor 'struct page'.  It's the doormat of data structures.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-02-01  0:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-31 19:02 [LSF/MM TOPIC] Fixing large block devices on 32 bit James Bottomley
2014-01-31 19:26 ` Dave Jones
2014-01-31 23:16   ` James Bottomley
2014-01-31 21:20 ` Chris Mason
2014-01-31 23:14   ` James Bottomley
2014-01-31 21:47 ` Dave Hansen
2014-01-31 23:27   ` James Bottomley
2014-02-01  0:19     ` Dave Hansen
2014-02-01  0:25       ` Kirill A. Shutemov
2014-02-01  0:32         ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).