From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 11:02:58 -0800 Message-ID: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: 7bit Return-path: Sender: owner-linux-mm@kvack.org To: linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Cc: lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org It has been reported: http://marc.info/?t=139111447200006 That large block devices (specifically devices > 16TB) crash when mounted on 32 bit systems. The problem specifically is that although CONFIG_LBDAF extends the size of sector_t within the block and storage layers to 64 bits, the buffer cache isn't big enough. Specifically, buffers are mapped through a single page cache mapping on the backing device inode. The size of the allowed offset in the page cache radix tree is pgoff_t which is 32 bits, so once the size of device goes beyond 16TB, this offset wraps and all hell breaks loose. The problem is that although the current single drive limit is about 4TB, it will only be a couple of years before 16TB devices are available. By then, I bet that most arm (and other exotic CPU) Linux based personal file servers are still going to be 32 bit, so they're not going to be able to take this generation (or beyond) of drives. The thing I'd like to discuss is how to fix this. There are several options I see, but there might be others. 1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB and there's nothing we can do about it ... this won't be at all popular with arm based file server manufacturers. 2. Slyly make sure that the buffer cache won't go over 16TB by keeping filesystem metadata below that limit ... the horse has probably already bolted on this one. 3. Increase pgoff_t and the radix tree indexes to u64 for CONFIG_LBDAF. This will blow out the size of struct page on 32 bits by 4 bytes and may have other knock on effects, but at least it will be transparent. 4. add an additional radix tree lookup within the buffer cache, so instead of a single inode for the buffer cache, we have a radix tree of them which are added and removed at the granularity of 16TB offsets as entries are requested. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Jones Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 14:26:17 -0500 Message-ID: <20140131192617.GA14098@redhat.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Sender: owner-linux-mm@kvack.org To: James Bottomley Cc: linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org On Fri, Jan 31, 2014 at 11:02:58AM -0800, James Bottomley wrote: > it will only be a couple of years before 16TB devices are > available. By then, I bet that most arm (and other exotic CPU) Linux > based personal file servers are still going to be 32 bit, so they're not > going to be able to take this generation (or beyond) of drives. > > 1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB > and there's nothing we can do about it ... this won't be at all > popular with arm based file server manufacturers. Some of the higher end home-NAS's have already moved from arm/ppc -> x86_64[1] Unless ARM64 starts appearing at a low enough price point, I wouldn't be surprised to see the smaller vendors do a similar move just to stay competitive. (probably while keeping 'legacy' product lines for a while at a cheaper pricepoint that won't take bigger disks). Dave [1] http://forum.synology.com/wiki/index.php/What_kind_of_CPU_does_my_NAS_have -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Mason Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 16:20:32 -0500 Message-ID: <52EC13A0.2080806@fb.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Sender: owner-linux-mm@kvack.org To: James Bottomley , linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Cc: lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org On 01/31/2014 02:02 PM, James Bottomley wrote: > It has been reported: > > http://marc.info/?t=139111447200006 > > That large block devices (specifically devices > 16TB) crash when > mounted on 32 bit systems. The problem specifically is that although > CONFIG_LBDAF extends the size of sector_t within the block and storage > layers to 64 bits, the buffer cache isn't big enough. Specifically, > buffers are mapped through a single page cache mapping on the backing > device inode. The size of the allowed offset in the page cache radix > tree is pgoff_t which is 32 bits, so once the size of device goes beyond > 16TB, this offset wraps and all hell breaks loose. > > The problem is that although the current single drive limit is about > 4TB, it will only be a couple of years before 16TB devices are > available. By then, I bet that most arm (and other exotic CPU) Linux > based personal file servers are still going to be 32 bit, so they're not > going to be able to take this generation (or beyond) of drives. The > thing I'd like to discuss is how to fix this. There are several options > I see, but there might be others. > > 1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB > and there's nothing we can do about it ... this won't be at all > popular with arm based file server manufacturers. > 2. Slyly make sure that the buffer cache won't go over 16TB by > keeping filesystem metadata below that limit ... the horse has > probably already bolted on this one. > 3. Increase pgoff_t and the radix tree indexes to u64 for > CONFIG_LBDAF. This will blow out the size of struct page on 32 > bits by 4 bytes and may have other knock on effects, but at > least it will be transparent. > 4. add an additional radix tree lookup within the buffer cache, so > instead of a single inode for the buffer cache, we have a radix > tree of them which are added and removed at the granularity of > 16TB offsets as entries are requested. > I started typing up that #3 is going to cause problems with RCU radix, but it looks ok. I think we'll find a really scarey number of places that interchange pgoff_t with unsigned long though. I prefer #4, but it means each FS needs to add code too. We assume page_offset(page) maps to the disk in more than a few places. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Hansen Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 13:47:18 -0800 Message-ID: <52EC19E6.9010509@intel.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Return-path: Received: from mga09.intel.com ([134.134.136.24]:11762 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752785AbaAaVrT (ORCPT ); Fri, 31 Jan 2014 16:47:19 -0500 In-Reply-To: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: James Bottomley , linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Cc: lsf-pc@lists.linux-foundation.org On 01/31/2014 11:02 AM, James Bottomley wrote: > 3. Increase pgoff_t and the radix tree indexes to u64 for > CONFIG_LBDAF. This will blow out the size of struct page on 32 > bits by 4 bytes and may have other knock on effects, but at > least it will be transparent. I'm not sure how many acrobatics we want to go through for 32-bit, but... Between page->mapping and page->index, we have 64 bits of space, which *should* be plenty to uniquely identify a block. We could easily add a second-level lookup somewhere so that we store some cookie for the address_space instead of a direct pointer. How many devices would need, practically? 8 bits worth? From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 15:14:01 -0800 Message-ID: <1391210041.2172.52.camel@dabdike.int.hansenpartnership.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> <52EC13A0.2080806@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <52EC13A0.2080806@fb.com> Sender: owner-linux-mm@kvack.org To: Chris Mason Cc: linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org On Fri, 2014-01-31 at 16:20 -0500, Chris Mason wrote: > On 01/31/2014 02:02 PM, James Bottomley wrote: > > It has been reported: > > > > http://marc.info/?t=139111447200006 > > > > That large block devices (specifically devices > 16TB) crash when > > mounted on 32 bit systems. The problem specifically is that although > > CONFIG_LBDAF extends the size of sector_t within the block and storage > > layers to 64 bits, the buffer cache isn't big enough. Specifically, > > buffers are mapped through a single page cache mapping on the backing > > device inode. The size of the allowed offset in the page cache radix > > tree is pgoff_t which is 32 bits, so once the size of device goes beyond > > 16TB, this offset wraps and all hell breaks loose. > > > > The problem is that although the current single drive limit is about > > 4TB, it will only be a couple of years before 16TB devices are > > available. By then, I bet that most arm (and other exotic CPU) Linux > > based personal file servers are still going to be 32 bit, so they're not > > going to be able to take this generation (or beyond) of drives. The > > thing I'd like to discuss is how to fix this. There are several options > > I see, but there might be others. > > > > 1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB > > and there's nothing we can do about it ... this won't be at all > > popular with arm based file server manufacturers. > > 2. Slyly make sure that the buffer cache won't go over 16TB by > > keeping filesystem metadata below that limit ... the horse has > > probably already bolted on this one. > > 3. Increase pgoff_t and the radix tree indexes to u64 for > > CONFIG_LBDAF. This will blow out the size of struct page on 32 > > bits by 4 bytes and may have other knock on effects, but at > > least it will be transparent. > > 4. add an additional radix tree lookup within the buffer cache, so > > instead of a single inode for the buffer cache, we have a radix > > tree of them which are added and removed at the granularity of > > 16TB offsets as entries are requested. > > > > I started typing up that #3 is going to cause problems with RCU radix, > but it looks ok. I think we'll find a really scarey number of places > that interchange pgoff_t with unsigned long though. Yes, beyond the performance issues of doing 64 bits in the radix tree, it does look reasonably safe. > I prefer #4, but it means each FS needs to add code too. We assume > page_offset(page) maps to the disk in more than a few places. Hmm, yes, that's just a few cases of the readahead code, though, isn't it? The necessary fixes look fairly small per filesystem. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 15:16:11 -0800 Message-ID: <1391210171.2172.54.camel@dabdike.int.hansenpartnership.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> <20140131192617.GA14098@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20140131192617.GA14098@redhat.com> Sender: owner-linux-mm@kvack.org To: Dave Jones Cc: linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org On Fri, 2014-01-31 at 14:26 -0500, Dave Jones wrote: > On Fri, Jan 31, 2014 at 11:02:58AM -0800, James Bottomley wrote: > > > it will only be a couple of years before 16TB devices are > > available. By then, I bet that most arm (and other exotic CPU) Linux > > based personal file servers are still going to be 32 bit, so they're not > > going to be able to take this generation (or beyond) of drives. > > > > 1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB > > and there's nothing we can do about it ... this won't be at all > > popular with arm based file server manufacturers. > > Some of the higher end home-NAS's have already moved from arm/ppc -> x86_64[1] > Unless ARM64 starts appearing at a low enough price point, I wouldn't be > surprised to see the smaller vendors do a similar move just to stay competitive. > (probably while keeping 'legacy' product lines for a while at a cheaper pricepoint > that won't take bigger disks). So yould you bet on the problem solving itself *before* we get 16TB disks? Because if we ignore it, that's the bet we're making. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 15:27:44 -0800 Message-ID: <1391210864.2172.61.camel@dabdike.int.hansenpartnership.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> <52EC19E6.9010509@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <52EC19E6.9010509@intel.com> Sender: linux-scsi-owner@vger.kernel.org To: Dave Hansen Cc: linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org On Fri, 2014-01-31 at 13:47 -0800, Dave Hansen wrote: > On 01/31/2014 11:02 AM, James Bottomley wrote: > > 3. Increase pgoff_t and the radix tree indexes to u64 for > > CONFIG_LBDAF. This will blow out the size of struct page on 32 > > bits by 4 bytes and may have other knock on effects, but at > > least it will be transparent. > > I'm not sure how many acrobatics we want to go through for 32-bit, but... That's partly the question: 32 bits was dying in the x86 space (at least until quark), but it's still predominant in embedded. > Between page->mapping and page->index, we have 64 bits of space, which > *should* be plenty to uniquely identify a block. We could easily add a > second-level lookup somewhere so that we store some cookie for the > address_space instead of a direct pointer. How many devices would need, > practically? 8 bits worth? That might work. 8 bits would get us up to 4PB, which is looking a bit high for single disk spinning rust. However, how would the cookie work efficiently? remember we'll be doing this lookup every time we pull a page out of the page cache. And the problem is that most of our lookups will be on file inodes, which won't be > 16TB, so it's a lot of overhead in the generic machinery for a problem that only occurs on buffer related page cache lookups. James From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Hansen Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 16:19:43 -0800 Message-ID: <52EC3D9F.8040702@intel.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> <52EC19E6.9010509@intel.com> <1391210864.2172.61.camel@dabdike.int.hansenpartnership.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1391210864.2172.61.camel@dabdike.int.hansenpartnership.com> Sender: owner-linux-mm@kvack.org To: James Bottomley Cc: linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org On 01/31/2014 03:27 PM, James Bottomley wrote: > On Fri, 2014-01-31 at 13:47 -0800, Dave Hansen wrote: >> On 01/31/2014 11:02 AM, James Bottomley wrote: >>> 3. Increase pgoff_t and the radix tree indexes to u64 for >>> CONFIG_LBDAF. This will blow out the size of struct page on 32 >>> bits by 4 bytes and may have other knock on effects, but at >>> least it will be transparent. >> >> I'm not sure how many acrobatics we want to go through for 32-bit, but... > > That's partly the question: 32 bits was dying in the x86 space (at least > until quark), but it's still predominant in embedded. > >> Between page->mapping and page->index, we have 64 bits of space, which >> *should* be plenty to uniquely identify a block. We could easily add a >> second-level lookup somewhere so that we store some cookie for the >> address_space instead of a direct pointer. How many devices would need, >> practically? 8 bits worth? > > That might work. 8 bits would get us up to 4PB, which is looking a bit > high for single disk spinning rust. However, how would the cookie work > efficiently? remember we'll be doing this lookup every time we pull a > page out of the page cache. And the problem is that most of our lookups > will be on file inodes, which won't be > 16TB, so it's a lot of overhead > in the generic machinery for a problem that only occurs on buffer > related page cache lookups. I think all we have to do is set a low bit in page->mapping (or in page->flags, but its more constrained) to say: "this isn't a direct pointer". We only set the bit for the buffer cache pages, and thus only go to the slow(er) lookup path for those. Whatever we use for the lookups (radix tree or whatever) uses the remaining bits for an index. We'd probably also need a last-lookup cache like mm->mmap_cache, but probably not much more than that. We already have page_mapping() in place to redirect folks away from using page->mapping directly, so there shouldn't be too much code impact. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Kirill A. Shutemov" Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Sat, 1 Feb 2014 02:25:47 +0200 Message-ID: <20140201002547.GA3551@node.dhcp.inet.fi> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> <52EC19E6.9010509@intel.com> <1391210864.2172.61.camel@dabdike.int.hansenpartnership.com> <52EC3D9F.8040702@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <52EC3D9F.8040702@intel.com> Sender: owner-linux-mm@kvack.org To: Dave Hansen Cc: James Bottomley , linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org On Fri, Jan 31, 2014 at 04:19:43PM -0800, Dave Hansen wrote: > On 01/31/2014 03:27 PM, James Bottomley wrote: > > On Fri, 2014-01-31 at 13:47 -0800, Dave Hansen wrote: > >> On 01/31/2014 11:02 AM, James Bottomley wrote: > >>> 3. Increase pgoff_t and the radix tree indexes to u64 for > >>> CONFIG_LBDAF. This will blow out the size of struct page on 32 > >>> bits by 4 bytes and may have other knock on effects, but at > >>> least it will be transparent. > >> > >> I'm not sure how many acrobatics we want to go through for 32-bit, but... > > > > That's partly the question: 32 bits was dying in the x86 space (at least > > until quark), but it's still predominant in embedded. > > > >> Between page->mapping and page->index, we have 64 bits of space, which > >> *should* be plenty to uniquely identify a block. We could easily add a > >> second-level lookup somewhere so that we store some cookie for the > >> address_space instead of a direct pointer. How many devices would need, > >> practically? 8 bits worth? > > > > That might work. 8 bits would get us up to 4PB, which is looking a bit > > high for single disk spinning rust. However, how would the cookie work > > efficiently? remember we'll be doing this lookup every time we pull a > > page out of the page cache. And the problem is that most of our lookups > > will be on file inodes, which won't be > 16TB, so it's a lot of overhead > > in the generic machinery for a problem that only occurs on buffer > > related page cache lookups. > > I think all we have to do is set a low bit in page->mapping It's already in use to say page->mapping is anon_vma. ;) -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Hansen Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 16:32:03 -0800 Message-ID: <52EC4083.8010309@intel.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> <52EC19E6.9010509@intel.com> <1391210864.2172.61.camel@dabdike.int.hansenpartnership.com> <52EC3D9F.8040702@intel.com> <20140201002547.GA3551@node.dhcp.inet.fi> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20140201002547.GA3551@node.dhcp.inet.fi> Sender: owner-linux-mm@kvack.org To: "Kirill A. Shutemov" Cc: James Bottomley , linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org List-Id: linux-ide@vger.kernel.org On 01/31/2014 04:25 PM, Kirill A. Shutemov wrote: >> > I think all we have to do is set a low bit in page->mapping > It's already in use to say page->mapping is anon_vma. ;) I weasel-worded that by not saying *THE* low bit. ;) We find *some* discriminator whether it be a page flag or an actual bit in page->mapping, or a magic value that doesn't collide with the existing PAGE_MAPPING_* flags. Poor 'struct page'. It's the doormat of data structures. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Mason Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 16:20:32 -0500 Message-ID: <52EC13A0.2080806@fb.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Cc: To: James Bottomley , linux-scsi , linux-ide , , Return-path: In-Reply-To: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 01/31/2014 02:02 PM, James Bottomley wrote: > It has been reported: > > http://marc.info/?t=139111447200006 > > That large block devices (specifically devices > 16TB) crash when > mounted on 32 bit systems. The problem specifically is that although > CONFIG_LBDAF extends the size of sector_t within the block and storage > layers to 64 bits, the buffer cache isn't big enough. Specifically, > buffers are mapped through a single page cache mapping on the backing > device inode. The size of the allowed offset in the page cache radix > tree is pgoff_t which is 32 bits, so once the size of device goes beyond > 16TB, this offset wraps and all hell breaks loose. > > The problem is that although the current single drive limit is about > 4TB, it will only be a couple of years before 16TB devices are > available. By then, I bet that most arm (and other exotic CPU) Linux > based personal file servers are still going to be 32 bit, so they're not > going to be able to take this generation (or beyond) of drives. The > thing I'd like to discuss is how to fix this. There are several options > I see, but there might be others. > > 1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB > and there's nothing we can do about it ... this won't be at all > popular with arm based file server manufacturers. > 2. Slyly make sure that the buffer cache won't go over 16TB by > keeping filesystem metadata below that limit ... the horse has > probably already bolted on this one. > 3. Increase pgoff_t and the radix tree indexes to u64 for > CONFIG_LBDAF. This will blow out the size of struct page on 32 > bits by 4 bytes and may have other knock on effects, but at > least it will be transparent. > 4. add an additional radix tree lookup within the buffer cache, so > instead of a single inode for the buffer cache, we have a radix > tree of them which are added and removed at the granularity of > 16TB offsets as entries are requested. > I started typing up that #3 is going to cause problems with RCU radix, but it looks ok. I think we'll find a really scarey number of places that interchange pgoff_t with unsigned long though. I prefer #4, but it means each FS needs to add code too. We assume page_offset(page) maps to the disk in more than a few places. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f49.google.com (mail-pb0-f49.google.com [209.85.160.49]) by kanga.kvack.org (Postfix) with ESMTP id C022D6B0037 for ; Fri, 31 Jan 2014 16:47:19 -0500 (EST) Received: by mail-pb0-f49.google.com with SMTP id up15so4884492pbc.22 for ; Fri, 31 Jan 2014 13:47:19 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id yg10si5611314pbc.332.2014.01.31.13.47.18 for ; Fri, 31 Jan 2014 13:47:18 -0800 (PST) Message-ID: <52EC19E6.9010509@intel.com> Date: Fri, 31 Jan 2014 13:47:18 -0800 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> In-Reply-To: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: James Bottomley , linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Cc: lsf-pc@lists.linux-foundation.org On 01/31/2014 11:02 AM, James Bottomley wrote: > 3. Increase pgoff_t and the radix tree indexes to u64 for > CONFIG_LBDAF. This will blow out the size of struct page on 32 > bits by 4 bytes and may have other knock on effects, but at > least it will be transparent. I'm not sure how many acrobatics we want to go through for 32-bit, but... Between page->mapping and page->index, we have 64 bits of space, which *should* be plenty to uniquely identify a block. We could easily add a second-level lookup somewhere so that we store some cookie for the address_space instead of a direct pointer. How many devices would need, practically? 8 bits worth? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f52.google.com (mail-pb0-f52.google.com [209.85.160.52]) by kanga.kvack.org (Postfix) with ESMTP id 939506B0039 for ; Fri, 31 Jan 2014 18:27:54 -0500 (EST) Received: by mail-pb0-f52.google.com with SMTP id jt11so5030506pbb.11 for ; Fri, 31 Jan 2014 15:27:54 -0800 (PST) Received: from bedivere.hansenpartnership.com (bedivere.hansenpartnership.com. [66.63.167.143]) by mx.google.com with ESMTP id sj5si12102404pab.313.2014.01.31.15.27.53 for ; Fri, 31 Jan 2014 15:27:53 -0800 (PST) Message-ID: <1391210864.2172.61.camel@dabdike.int.hansenpartnership.com> Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit From: James Bottomley Date: Fri, 31 Jan 2014 15:27:44 -0800 In-Reply-To: <52EC19E6.9010509@intel.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> <52EC19E6.9010509@intel.com> Content-Type: text/plain; charset="ISO-8859-15" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: linux-scsi , linux-ide , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org On Fri, 2014-01-31 at 13:47 -0800, Dave Hansen wrote: > On 01/31/2014 11:02 AM, James Bottomley wrote: > > 3. Increase pgoff_t and the radix tree indexes to u64 for > > CONFIG_LBDAF. This will blow out the size of struct page on 32 > > bits by 4 bytes and may have other knock on effects, but at > > least it will be transparent. > > I'm not sure how many acrobatics we want to go through for 32-bit, but... That's partly the question: 32 bits was dying in the x86 space (at least until quark), but it's still predominant in embedded. > Between page->mapping and page->index, we have 64 bits of space, which > *should* be plenty to uniquely identify a block. We could easily add a > second-level lookup somewhere so that we store some cookie for the > address_space instead of a direct pointer. How many devices would need, > practically? 8 bits worth? That might work. 8 bits would get us up to 4PB, which is looking a bit high for single disk spinning rust. However, how would the cookie work efficiently? remember we'll be doing this lookup every time we pull a page out of the page cache. And the problem is that most of our lookups will be on file inodes, which won't be > 16TB, so it's a lot of overhead in the generic machinery for a problem that only occurs on buffer related page cache lookups. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org