From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758136Ab1CaO7c (ORCPT ); Thu, 31 Mar 2011 10:59:32 -0400 Received: from e33.co.us.ibm.com ([32.97.110.151]:49652 "EHLO e33.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753015Ab1CaO7b (ORCPT ); Thu, 31 Mar 2011 10:59:31 -0400 Date: Thu, 31 Mar 2011 07:59:18 -0700 From: "Paul E. McKenney" To: Linus Torvalds Cc: Dave Jones , Andrew Morton , Linux Kernel Subject: Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) Message-ID: <20110331145918.GE2258@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20110329040939.GA32764@redhat.com> <20110331030917.GB26057@redhat.com> <20110331033432.GA434@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 30, 2011 at 08:44:06PM -0700, Linus Torvalds wrote: > On Wed, Mar 30, 2011 at 8:34 PM, Dave Jones wrote: > > > > so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from > > the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu. > > My guess is that my fuzzing caused so many allocations that the rcu freeing is > > still ongoing an hour or so after the process has quit.  does that make any sense? > > No, that shouldn't be the case. RCU freeing should go on for just a > few RCU periods, and be done. I think there is some "limit the work we > do for rcu each time" in order to not have bad latencies, but even so > that shouldn't take _that_ long. And as you say, you should see the > freeing in the slab stats. > > So clearly there are shmem inodes being destroyed, but it shouldn't be > from an hour ago. I wonder if your system isn't as idle as you think > it is. > > But I'm cc'ing Paul, maybe he'll disagree and say it's expected and > that the RCU batch size is really small. Or at least give some hint > about how to check the pending rcu state. There is by default a ten-callback-per-softirq per-CPU limit (set by module parameter "blimit"), but if there are more than 10,000 callbacks queued (set by module parameter "qhimark"), RCU stops being Mr. Nice Guy and just plows through the callbacks. I am having a really hard time believing that you could queue a batch of callbacks and still have RCU processing them an hour later. But one way this can happen is if the callbacks requeue themselves, for example, if the RCU callback function contains a call to call_rcu(). I am not seeing that in the __d_free and file_free_rcu callbacks that Dave mentioned, but this does happen in some places in the Linux kernel. And maybe such a place is being invoked in a non-obvious way. There is a debugfs file "rcu/rcudata" that reports the current RCU queue length that is enabled with CONFIG_RCU_TRACE. Here is sample output: rcu_sched: 0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1 dn=0 df=1101 of=0 ri=36 ql=0 b=10 1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1 dn=0 df=1015 of=0 ri=0 ql=0 b=10 2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1 dn=0 df=1839 of=0 ri=0 ql=0 b=10 3 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=6681/1 dn=0 df=1545 of=0 ri=0 ql=0 b=10 4 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1003/1 dn=0 df=1992 of=0 ri=0 ql=0 b=10 5 c=17829 g=17830 pq=1 pqc=17829 qp=1 dt=3887/1 dn=0 df=3331 of=0 ri=4 ql=2 b=10 6 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=859/1 dn=0 df=3224 of=0 ri=0 ql=0 b=10 7 c=17829 g=17830 pq=0 pqc=17829 qp=1 dt=3761/1 dn=0 df=1818 of=0 ri=0 ql=2 b=10 rcu_bh: 0 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=10951/1 dn=0 df=0 of=0 ri=0 ql=0 b=10 1 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=16117/1 dn=0 df=13 of=0 ri=0 ql=0 b=10 2 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1445/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 3 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=6681/1 dn=0 df=9 of=0 ri=0 ql=0 b=10 4 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1003/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 5 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3887/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 6 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=859/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 7 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3761/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 The "ql=" gives the per-CPU callback queue size. For example, CPU 7 has two rcu_sched callbacks queued. In CONFIG_TREE_PREEMPT_RCU builds, there will be an additional "rcu_preempt" section. If you see some huge number after a "ql=", that would indicate a problem. ;-) Does this help? Thanx, Paul