From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754173AbXD0KFo (ORCPT ); Fri, 27 Apr 2007 06:05:44 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755528AbXD0KFo (ORCPT ); Fri, 27 Apr 2007 06:05:44 -0400 Received: from smtp105.mail.mud.yahoo.com ([209.191.85.215]:43797 "HELO smtp105.mail.mud.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1754173AbXD0KFm (ORCPT ); Fri, 27 Apr 2007 06:05:42 -0400 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com.au; h=Received:X-YMail-OSG:Message-ID:Date:From:User-Agent:X-Accept-Language:MIME-Version:To:CC:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=GJO68kQd8H42DzpDZJyoFDhO+t9XKsPv+0InTQcigeaz4CQqMWBig9NF2Tc545TkXkHR6yOZSImuL2MogUtOMMF929OW14QF4aVwjHsvlnnCdtNvram8Nsg9ip6hPfeRZaJPtEVOW35qbqN/ZKsEYZ0BMucWFoEj9afwkfaAfDU= ; X-YMail-OSG: vYW9hL4VM1mQQcV3e3ZD1mM39uxXyJ3PED7kadCPz8TGgsXW36rhd9UdlNpFOBf0FPUh4ZArKQ-- Message-ID: <4631CAE7.2080109@yahoo.com.au> Date: Fri, 27 Apr 2007 20:05:27 +1000 From: Nick Piggin User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20051007 Debian/1.7.12-1 X-Accept-Language: en MIME-Version: 1.0 To: Christoph Hellwig CC: Christoph Lameter , "Eric W. Biederman" , linux-kernel@vger.kernel.org, Mel Gorman , William Lee Irwin III , David Chinner , Jens Axboe , Badari Pulavarty , Maxim Levitsky Subject: Re: [00/17] Large Blocksize Support V3 References: <463048FE.5000600@yahoo.com.au> <46304D50.1040706@yahoo.com.au> <46305327.2000206@yahoo.com.au> <4630593C.8070905@yahoo.com.au> <20070426160715.GB16337@infradead.org> In-Reply-To: <20070426160715.GB16337@infradead.org> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Christoph Hellwig wrote: > On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote: > >>>Well maybe you could explain what you want. Preferably without redefining >>>the established terms? >> >>Support for larger buffers than page cache pages. > > > I don't think you really want this :) The whole non-pagecache I/O > path before 2.3 was a toal pain just because it used buffers to drive > I/O. Add to that buffers bigger than a page and you add another > two mangnitudes of complexity. If you want to see a mess like that > download on of the eary XFS/Linux releases that had an I/O path > like that. I _really_ _really_ don't want to go there. I'm not actually suggesting to add anything like that. But I think larger blocks can be doable while retaining the "buffer" layer as a relatively simple pagecache to block translation. Anyway, I'm working on patches... they might crash and burn, but we might have something to talk about later. > Linux has a long tradition of trading a tiny bit of efficieny for > much cleaner code, and I'd for 100% go down Christoph's route here. > Then again I'd actually be rather surprised if > page buffers > were more efficient - you'd run into shitloads over overhead due to > them beeing non-contingous like calling vmap all over the place, > reprogramming iommus to at least make them look virtually contingous [1], > etc.. I still think hardware should work reasonably well with 4K pages. The SGI io controllers and/or the Linux block layer that doesn't allow more than 128 sg entries is clearly suboptimal if the hardware runs twice as fast with 2MB submissions. > I also don't quite get what your problem with higher order allocations > are. order 1 allocations are generally just fine, and in fact > thread stacks are >= oder 1 on most architectures. And if the pagecache > uses higher order allocations that means we'll finally fix our problems > with them, which we have to do anyway. Workloads continue to grow and > with them the kernel overhead to manage them, while the pagesize for > many architectures is fixed. So we'll have to deal with order 1 > and order 2 allocations better just for backing kmalloc and co. The pagecache is much bigger and often a lot more activity than these other things though. Also, the more things you add to higher order allocations, the more pressure you have. I like PAGE_SIZE pagecache, because it is reliable and really fast, if you need to reclaim a page it should be almost O(1). > Or think jumboframes for that matter. They can actually run into problems if the hardware wants contiguous memory. I don't know why you think the fragmentation issues are just magically fixed. It is hard and inefficient to reclaim larger order blocks (even with lumpy reclaim), and Mel's patches aren't perfect. Actually, last time I looked, they needed to keep at least 16MB of pages free to be reasonably effective (or do we just say that people with less than XMB of memory shouldn't be accessing these filesystems anyway?), and I'm not sure if they have been tested for long term stability in the presence of a reasonable amount of higher order allocations. -- SUSE Labs, Novell Inc.