From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756097Ab0LNIU5 (ORCPT ); Tue, 14 Dec 2010 03:20:57 -0500 Received: from mga03.intel.com ([143.182.124.21]:6709 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751738Ab0LNIUz (ORCPT ); Tue, 14 Dec 2010 03:20:55 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.59,341,1288594800"; d="scan'208";a="361359874" Date: Tue, 14 Dec 2010 16:20:50 +0800 From: Wu Fengguang To: Trond Myklebust Cc: Andrew Morton , Jan Kara , Christoph Hellwig , Dave Chinner , "Theodore Ts'o" , Chris Mason , Peter Zijlstra , Mel Gorman , Rik van Riel , KOSAKI Motohiro , Greg Thelen , Minchan Kim , linux-mm , "linux-fsdevel@vger.kernel.org" , LKML , "Tang, Feng" Subject: Re: [PATCH 30/35] nfs: heuristics to avoid commit Message-ID: <20101214082050.GC6940@localhost> References: <20101213144646.341970461@intel.com> <20101213150329.953837345@intel.com> <1292273626.8795.19.camel@heimdal.trondhjem.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1292273626.8795.19.camel@heimdal.trondhjem.org> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 14, 2010 at 04:53:46AM +0800, Trond Myklebust wrote: > On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote: > > plain text document attachment (writeback-nfs-should-commit.patch) > > The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of > > unnecessary COMMIT calls") do not work well for large inodes being > > actively written to. > > > > Refine the criterion to > > - it has gone quiet (all data transfered to server) > > - has accumulated >= 4MB data to commit (so it will be large IO) > > - too few active commits (hence active IO) in the server > > Where does the number 4MB come from? If I'm writing a 4GB file, I > certainly do not want to commit every 4MB; that would make for a total > of 1000 commit requests in addition to the writes. On a 64-bit client > +server both having loads of memory and connected by a decently a fast > network, that can be a significant slowdown... Sorry the description omits too much details.. Let me show you the behavior in real workload first. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/writeback-inode.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png On a 3GB client writing 50MB/s to the NFS server, the write chunk size and commit size is mostly 32MB and 64MB. The ->writepages() size and the later commit size actually scales up to the available write bandwidth ("[PATCH 20/35] writeback: scale IO chunk size up to device bandwidth"). So the "4MB" here is merely the minimal threshold. I chose it mainly by the rule of thumb "it's not too bad IO size". And it's mainly used for the cases: 1) low client=>server write bandwidth In this case the VFS will call ->writepages() with small (but always >= 4MB, see patch 20/35) nr_to_write , and the 4MB threshold helps accumulate to-be-commited pages over multiple ->write_inode() calls. As you said it will help to further scale this 4MB threshold up to the client's memory size. But complexity arises in the next case. 2) bandwidth/memory is high, but there are lots of concurrent dd's When doing 10 dd's with mem=3G, it still achieves 20-30MB write/commit size: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/writeback-300.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/nfs-commit-300.png However when there comes 100 dd's, you cannot wait each inode to accumulate much more than 4MB pages to commit, because 4*100MB is approaching the client's dirty limit. So you'll see around 4-5MB commit sizes in this graph. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png Then you see the problem: how to decide one auto scaled threshold to start commit for the current inode? It's easy for the 1-dd case. However when there are N dd's (admittedly NFS clients rarely do large N), we don't readily know the number N to scale down the threshold that's suitable for 1-dd case.. So I give up the scale-to-memory commit threshold idea that could help case (1) and just do it in a dumb but should good enough way. But I'm open to better ideas :) > Most of the time, we really want the server to be managing its dirty > cache entirely independently of the client. The latter should only be > sending the commit when it really needs to free up those pages. Agreed. And it makes one major contrariety I'm fighting about: do large commit size but not too much to make unacceptable fluctuations in the data flow. It leads to the decision to include patch 20/35 into this series. It magically reduces the frequency to ->writepages()/write_inode() and results in semi-adaptive wrote pages in each ->writepages() (and the later commit) to the number of concurrent dd's. Thanks, Fengguang