From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754120AbZLBB4A (ORCPT ); Tue, 1 Dec 2009 20:56:00 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753343AbZLBB4A (ORCPT ); Tue, 1 Dec 2009 20:56:00 -0500 Received: from mx1.redhat.com ([209.132.183.28]:30310 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753445AbZLBBz7 (ORCPT ); Tue, 1 Dec 2009 20:55:59 -0500 Message-ID: <4B15C902.4020008@redhat.com> Date: Tue, 01 Dec 2009 20:55:14 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.4pre) Gecko/20090922 Fedora/3.0-3.9.b4.fc12 Lightning/1.0pre Thunderbird/3.0b4 MIME-Version: 1.0 To: Larry Woodman CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org Subject: Re: [RFC] high system time & lock contention running large mixed workload References: <20091125133752.2683c3e4@bree.surriel.com> <1259618429.2345.3.camel@dhcp-100-19-198.bos.redhat.com> In-Reply-To: <1259618429.2345.3.camel@dhcp-100-19-198.bos.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/30/2009 05:00 PM, Larry Woodman wrote: > While running workloads that do lots of forking processes, exiting > processes and page reclamation(AIM 7) on large systems very high system > time(100%) and lots of lock contention was observed. > > > > CPU5: > [] ? _spin_lock+0x27/0x48 > [] ? anon_vma_link+0x2a/0x5a > [] ? dup_mm+0x242/0x40c > [] ? copy_process+0xab1/0x12be > [] ? do_fork+0x151/0x330 > [] ? default_wake_function+0x0/0x36 > [] ? _spin_lock_irqsave+0x2f/0x68 > [] ? stub_clone+0x13/0x20 > [] ? system_call_fastpath+0x16/0x1b > > CPU4: > [] ? _spin_lock+0x29/0x48 > [] ? anon_vma_unlink+0x2a/0x84 > [] ? free_pgtables+0x3c/0xe1 > [] ? exit_mmap+0xc5/0x110 > [] ? mmput+0x55/0xd9 > [] ? exit_mm+0x109/0x129 > [] ? do_exit+0x1d7/0x712 > [] ? _spin_lock_irqsave+0x2f/0x68 > [] ? do_group_exit+0x86/0xb2 > [] ? sys_exit_group+0x22/0x3e > [] ? system_call_fastpath+0x16/0x1b > > CPU0: > [] ? _spin_lock+0x29/0x48 > [] ? page_check_address+0x9e/0x16f > [] ? page_referenced_one+0x53/0x10b > [] ? page_referenced+0xcd/0x167 > [] ? shrink_active_list+0x1ed/0x2a3 > [] ? shrink_zone+0xa06/0xa38 > [] ? getnstimeofday+0x64/0xce > [] ? do_try_to_free_pages+0x1e5/0x362 > [] ? try_to_free_pages+0x7a/0x94 > [] ? isolate_pages_global+0x0/0x242 > [] ? __alloc_pages_nodemask+0x397/0x572 > [] ? __get_free_pages+0x19/0x6e > [] ? copy_process+0xd1/0x12be > [] ? avc_has_perm+0x5c/0x84 > [] ? user_path_at+0x65/0xa3 > [] ? do_fork+0x151/0x330 > [] ? check_for_new_grace_period+0x78/0xab > [] ? stub_clone+0x13/0x20 > [] ? system_call_fastpath+0x16/0x1b > > > ------------------------------------------------------------------------------ > PerfTop: 864 irqs/sec kernel:99.7% [100000 cycles], (all, 8 > CPUs) > ------------------------------------------------------------------------------ > > samples pcnt RIP kernel function > ______ _______ _____ ________________ _______________ > > 3235.00 - 75.1% - ffffffff814afb21 : _spin_lock > 670.00 - 15.6% - ffffffff81101a33 : page_check_address > 165.00 - 3.8% - ffffffffa01cbc39 : rpc_sleep_on [sunrpc] > 40.00 - 0.9% - ffffffff81102113 : try_to_unmap_one > 29.00 - 0.7% - ffffffff81101c65 : page_referenced_one > 27.00 - 0.6% - ffffffff81101964 : vma_address > 8.00 - 0.2% - ffffffff8125a5a0 : clear_page_c > 6.00 - 0.1% - ffffffff8125a5f0 : copy_page_c > 6.00 - 0.1% - ffffffff811023ca : try_to_unmap_anon > 5.00 - 0.1% - ffffffff810fb014 : copy_page_range > 5.00 - 0.1% - ffffffff810e4d18 : get_page_from_freelist > > > > The cause was determined to be the unconditional call to > page_referenced() for every mapped page encountered in > shrink_active_list(). page_referenced() takes the anon_vma->lock and > calls page_referenced_one() for each vma. page_referenced_one() then > calls page_check_address() which takes the pte_lockptr spinlock. If > several CPUs are doing this at the same time there is a lot of > pte_lockptr spinlock contention with the anon_vma->lock held. This > causes contention on the anon_vma->lock, stalling in the fo and very > high system time. > > Before the splitLRU patch shrink_active_list() would only call > page_referenced() when reclaim_mapped got set. reclaim_mapped only got > set when the priority worked its way from 12 all the way to 7. This > prevented page_referenced() from being called from shrink_active_list() > until the system was really struggling to reclaim memory. > > On way to prevent this is to change page_check_address() to execute a > spin_trylock(ptl) when it was called by shrink_active_list() and simply > fail if it could not get the pte_lockptr spinlock. This will make > shrink_active_list() consider the page not referenced and allow the > anon_vma->lock to be dropped much quicker. > > The attached patch does just that, thoughts??? > My first thought is that you haven't read the code you are trying to patch. The purpose of calling page_referenced on anonymous pages from shrink_active_list is to CLEAR the referenced bit, not to test it! Not clearing the referenced bit will break page replacement, because pages that were not recently referenced will appear to be, causing them to get another round on the active list, which in turn could increase the list movement...