From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [RFC 0/8] Cpuset aware writeback From: Peter Zijlstra In-Reply-To: <20070116054743.15358.77287.sendpatchset@schroedinger.engr.sgi.com> References: <20070116054743.15358.77287.sendpatchset@schroedinger.engr.sgi.com> Content-Type: text/plain Date: Tue, 16 Jan 2007 08:38:10 +0100 Message-Id: <1168933090.22935.30.camel@twins> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Christoph Lameter Cc: akpm@osdl.org, Paul Menage , linux-kernel@vger.kernel.org, Nick Piggin , linux-mm@kvack.org, Andi Kleen , Paul Jackson , Dave Chinner List-ID: On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote: > Currently cpusets are not able to do proper writeback since > dirty ratio calculations and writeback are all done for the system > as a whole. This may result in a large percentage of a cpuset > to become dirty without writeout being triggered. Under NFS > this can lead to OOM conditions. > > Writeback will occur during the LRU scans. But such writeout > is not effective since we write page by page and not in inode page > order (regular writeback). > > In order to fix the problem we first of all introduce a method to > establish a map of nodes that contain dirty pages for each > inode mapping. > > Secondly we modify the dirty limit calculation to be based > on the acctive cpuset. > > If we are in a cpuset then we select only inodes for writeback > that have pages on the nodes of the cpuset. > > After we have the cpuset throttling in place we can then make > further fixups: > > A. We can do inode based writeout from direct reclaim > avoiding single page writes to the filesystem. > > B. We add a new counter NR_UNRECLAIMABLE that is subtracted > from the available pages in a node. This allows us to > accurately calculate the dirty ratio even if large portions > of the node have been allocated for huge pages or for > slab pages. What about mlock'ed pages? > There are a couple of points where some better ideas could be used: > > 1. The nodemask expands the inode structure significantly if the > architecture allows a high number of nodes. This is only an issue > for IA64. For that platform we expand the inode structure by 128 byte > (to support 1024 nodes). The last patch attempts to address the issue > by using the knowledge about the maximum possible number of nodes > determined on bootup to shrink the nodemask. Not the prettiest indeed, no ideas though. > 2. The calculation of the per cpuset limits can require looping > over a number of nodes which may bring the performance of get_dirty_limits > near pre 2.6.18 performance (before the introduction of the ZVC counters) > (only for cpuset based limit calculation). There is no way of keeping these > counters per cpuset since cpusets may overlap. Well, you gain functionality, you loose some runtime, sad but probably worth it. Otherwise it all looks good. Acked-by: Peter Zijlstra -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org