linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Fixing large block devices on 32 bit
@ 2014-01-31 19:02 James Bottomley
  2014-01-31 19:26 ` Dave Jones
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: James Bottomley @ 2014-01-31 19:02 UTC (permalink / raw)
  To: linux-scsi, linux-ide, linux-fsdevel, linux-mm; +Cc: lsf-pc

It has been reported:

http://marc.info/?t=139111447200006

That large block devices (specifically devices > 16TB) crash when
mounted on 32 bit systems.  The problem specifically is that although
CONFIG_LBDAF extends the size of sector_t within the block and storage
layers to 64 bits, the buffer cache isn't big enough.  Specifically,
buffers are mapped through a single page cache mapping on the backing
device inode.  The size of the allowed offset in the page cache radix
tree is pgoff_t which is 32 bits, so once the size of device goes beyond
16TB, this offset wraps and all hell breaks loose.

The problem is that although the current single drive limit is about
4TB, it will only be a couple of years before 16TB devices are
available.  By then, I bet that most arm (and other exotic CPU) Linux
based personal file servers are still going to be 32 bit, so they're not
going to be able to take this generation (or beyond) of drives.  The
thing I'd like to discuss is how to fix this.  There are several options
I see, but there might be others.

     1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB
        and there's nothing we can do about it ... this won't be at all
        popular with arm based file server manufacturers.
     2. Slyly make sure that the buffer cache won't go over 16TB by
        keeping filesystem metadata below that limit ... the horse has
        probably already bolted on this one.
     3. Increase pgoff_t and the radix tree indexes to u64 for
        CONFIG_LBDAF.  This will blow out the size of struct page on 32
        bits by 4 bytes and may have other knock on effects, but at
        least it will be transparent.
     4. add an additional radix tree lookup within the buffer cache, so
        instead of a single inode for the buffer cache, we have a radix
        tree of them which are added and removed at the granularity of
        16TB offsets as entries are requested.

James


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-02-01  0:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-31 19:02 [LSF/MM TOPIC] Fixing large block devices on 32 bit James Bottomley
2014-01-31 19:26 ` Dave Jones
2014-01-31 23:16   ` James Bottomley
2014-01-31 21:20 ` Chris Mason
2014-01-31 23:14   ` James Bottomley
2014-01-31 21:47 ` Dave Hansen
2014-01-31 23:27   ` James Bottomley
2014-02-01  0:19     ` Dave Hansen
2014-02-01  0:25       ` Kirill A. Shutemov
2014-02-01  0:32         ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).