From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Mason Subject: Re: [LSF/MM TOPIC] Fixing large block devices on 32 bit Date: Fri, 31 Jan 2014 16:20:32 -0500 Message-ID: <52EC13A0.2080806@fb.com> References: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Cc: To: James Bottomley , linux-scsi , linux-ide , , Return-path: In-Reply-To: <1391194978.2172.20.camel@dabdike.int.hansenpartnership.com> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 01/31/2014 02:02 PM, James Bottomley wrote: > It has been reported: > > http://marc.info/?t=139111447200006 > > That large block devices (specifically devices > 16TB) crash when > mounted on 32 bit systems. The problem specifically is that although > CONFIG_LBDAF extends the size of sector_t within the block and storage > layers to 64 bits, the buffer cache isn't big enough. Specifically, > buffers are mapped through a single page cache mapping on the backing > device inode. The size of the allowed offset in the page cache radix > tree is pgoff_t which is 32 bits, so once the size of device goes beyond > 16TB, this offset wraps and all hell breaks loose. > > The problem is that although the current single drive limit is about > 4TB, it will only be a couple of years before 16TB devices are > available. By then, I bet that most arm (and other exotic CPU) Linux > based personal file servers are still going to be 32 bit, so they're not > going to be able to take this generation (or beyond) of drives. The > thing I'd like to discuss is how to fix this. There are several options > I see, but there might be others. > > 1. Try to pretend that CONFIG_LBDAF is supposed to cap out at 16TB > and there's nothing we can do about it ... this won't be at all > popular with arm based file server manufacturers. > 2. Slyly make sure that the buffer cache won't go over 16TB by > keeping filesystem metadata below that limit ... the horse has > probably already bolted on this one. > 3. Increase pgoff_t and the radix tree indexes to u64 for > CONFIG_LBDAF. This will blow out the size of struct page on 32 > bits by 4 bytes and may have other knock on effects, but at > least it will be transparent. > 4. add an additional radix tree lookup within the buffer cache, so > instead of a single inode for the buffer cache, we have a radix > tree of them which are added and removed at the granularity of > 16TB offsets as entries are requested. > I started typing up that #3 is going to cause problems with RCU radix, but it looks ok. I think we'll find a really scarey number of places that interchange pgoff_t with unsigned long though. I prefer #4, but it means each FS needs to add code too. We assume page_offset(page) maps to the disk in more than a few places. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org