Re: System CPU increasing on idle 2.6.36

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: System CPU increasing on idle 2.6.36
       [not found] ` <20101218010801.GE28367@hostway.ca>
@ 2010-12-29 22:03   ` Simon Kirby
       [not found]     ` <AANLkTinq074Q-B=5B4_aXuBm-3ELwGknsJv+OvS1z1OX@mail.gmail.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Kirby @ 2010-12-29 22:03 UTC (permalink / raw)
  To: linux-nfs; +Cc: linux-kernel

On Fri, Dec 17, 2010 at 05:08:01PM -0800, Simon Kirby wrote:

> On Wed, Dec 08, 2010 at 01:25:05PM -0800, Simon Kirby wrote:
> 
> > Possibly related to the flush-processes-taking-CPU issues I saw
> > previously, I thought this was interesting.  I found a log-crunching box
> > that does all of its work via NFS and spends most of the day sleeping. 
> > It has been using a linearly-increasing amount of system time during the
> > time where is sleeping.  munin graph:
> > 
> > http://0x.ca/sim/ref/2.6.36/cpu_logcrunch_nfs.png
> >...
> > Known 2.6.36 issue?  This did not occur on 2.6.35.4, according to the
> > munin graphs.  I'll try 2.6.37-rc an see if it changes.
> 
> So, back on this topic,
> 
> It seems that system CPU from "flush" processes is still increasing
> during and after periods of NFS activity, even with 2.6.37-rc5-git4:
>         
>         http://0x.ca/sim/ref/2.6.37/cpu_nfs.png
> 
> Something is definitely going on while NFS is active, and then keeps
> happening in the idle periods.  top and perf top look the same as in
> 2.6.36.  No userland activity at all, but the kernel keeps doing stuff.
> 
> I could bisect this, but I have to wait a day for each build, unless I
> can come up with a way to reproduce it more quickly.  The mount points
> for which the flush processes are active are the two mount points where
> the logs are read from, rotated, compressed, and unlinked, and where the
> reports are written, running in parallel under an xargs -P 15.
> 
> I'm pretty sure the only syscalls that are reproducing this are read(),
> readdir(), lstat(), write(), rename(), unlink(), and close().  There's
> nothing special happening here...

I've noticed nfs_inode_cache is ever-increasing as well with 2.6.37:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
2562514 2541520  99%    0.95K  78739       33   2519648K nfs_inode_cache
467200 285110  61%    0.02K   1825      256      7300K kmalloc-16
299397 242350  80%    0.19K  14257       21     57028K dentry
217434 131978  60%    0.55K   7767       28    124272K radix_tree_node
215232  81522  37%    0.06K   3363       64     13452K kmalloc-64
183027 136802  74%    0.10K   4693       39     18772K buffer_head
101120  71184  70%    0.03K    790      128      3160K kmalloc-32
 79616  59713  75%    0.12K   2488       32      9952K kmalloc-128
 66560  41257  61%    0.01K    130      512       520K kmalloc-8
 42126  26650  63%    0.75K   2006       21     32096K ext3_inode_cache

http://0x.ca/sim/ref/2.6.37/inodes_nfs.png
http://0x.ca/sim/ref/2.6.37/cpu2_nfs.png

Perhaps I could bisect just fs/nfs changes between 2.6.35 and 2.6.36 to
try to track this down without having to wait too long, unless somebody
can see what is happening here.

Simon-

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <AANLkTinq074Q-B=5B4_aXuBm-3ELwGknsJv+OvS1z1OX@mail.gmail.com>]

* Re: System CPU increasing on idle 2.6.36
       [not found]     ` <AANLkTinq074Q-B=5B4_aXuBm-3ELwGknsJv+OvS1z1OX@mail.gmail.com>
@ 2011-01-04 21:40       ` Simon Kirby
  2011-01-05 19:43         ` Mark Moseley
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Kirby @ 2011-01-04 21:40 UTC (permalink / raw)
  To: Mark Moseley; +Cc: linux-nfs, linux-kernel

On Tue, Jan 04, 2011 at 09:42:14AM -0800, Mark Moseley wrote:

> On Wed, Dec 29, 2010 at 2:03 PM, Simon Kirby <sim@hostway.ca> wrote:
>
> > I've noticed nfs_inode_cache is ever-increasing as well with 2.6.37:
> >
> > ?OBJS ACTIVE ?USE OBJ SIZE ?SLABS OBJ/SLAB CACHE SIZE NAME
> > 2562514 2541520 ?99% ? ?0.95K ?78739 ? ? ? 33 ? 2519648K nfs_inode_cache
> > 467200 285110 ?61% ? ?0.02K ? 1825 ? ? ?256 ? ? ?7300K kmalloc-16
> > 299397 242350 ?80% ? ?0.19K ?14257 ? ? ? 21 ? ? 57028K dentry
> > 217434 131978 ?60% ? ?0.55K ? 7767 ? ? ? 28 ? ?124272K radix_tree_node
> > 215232 ?81522 ?37% ? ?0.06K ? 3363 ? ? ? 64 ? ? 13452K kmalloc-64
> > 183027 136802 ?74% ? ?0.10K ? 4693 ? ? ? 39 ? ? 18772K buffer_head
> > 101120 ?71184 ?70% ? ?0.03K ? ?790 ? ? ?128 ? ? ?3160K kmalloc-32
> > ?79616 ?59713 ?75% ? ?0.12K ? 2488 ? ? ? 32 ? ? ?9952K kmalloc-128
> > ?66560 ?41257 ?61% ? ?0.01K ? ?130 ? ? ?512 ? ? ? 520K kmalloc-8
> > ?42126 ?26650 ?63% ? ?0.75K ? 2006 ? ? ? 21 ? ? 32096K ext3_inode_cache
> >
> > http://0x.ca/sim/ref/2.6.37/inodes_nfs.png
> > http://0x.ca/sim/ref/2.6.37/cpu2_nfs.png
> >
> > Perhaps I could bisect just fs/nfs changes between 2.6.35 and 2.6.36 to
> > try to track this down without having to wait too long, unless somebody
> > can see what is happening here.
> 
> I'll get started bisecting too, since this is something of a
> show-stopper. Boxes that pre-2.6.36 would stay up for months at a time
> now have to be powercycled every couple of days (which is about how
> long it takes for this behavior to show up). This is across-the-board
> for about 50 boxes, ranging from 2.6.36 to 2.6.36.2.
> 
> Simon: It's probably irrelevant since these are kernel threads, but
> I'm curious what distro your boxes are running. Ours are Debian Lenny,
> i386, Dell Poweredge 850s. Just trying to figure out any
> commonalities. I'll get my boxes back on 2.6.36.2 and start watching
> nfs_inode_cache as well.

Same distro, x86_64, similar servers.

I'm not sure if the two cases I am seeing are exactly the same problem,
but on the log crunching boxes, system time seems proportional to
nfs_inode_cache and nfs_inode_cache just keeps growing forever; however,
if I stop the load and unmount the NFS mount points, all of the
nfs_inode_cache objects do actually go away (after umount finishes).

It seems the shrinker callback might not be working as intended here.

On the shared server case, the crazy spinlock contention from all of the
flusher processes happens suddenly and overloads the boxes for 10-15
minutes, and then everything recovers.  Over 21 of these boxes, they
each have about 500k-700k nfs_inode_cache objects.  The log cruncher hit
3.3 million nfs_inode_cache objects before I unmounted.

Are your boxes repeating this behaviour at any predictable interval?

Simon-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: System CPU increasing on idle 2.6.36
  2011-01-04 21:40       ` Simon Kirby
@ 2011-01-05 19:43         ` Mark Moseley
  2011-01-07 18:05           ` Mark Moseley
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Moseley @ 2011-01-05 19:43 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-nfs, linux-kernel

On Tue, Jan 4, 2011 at 1:40 PM, Simon Kirby <sim@hostway.ca> wrote:
> On Tue, Jan 04, 2011 at 09:42:14AM -0800, Mark Moseley wrote:
>
>> On Wed, Dec 29, 2010 at 2:03 PM, Simon Kirby <sim@hostway.ca> wrote:
>>
>> > I've noticed nfs_inode_cache is ever-increasing as well with 2.6.37:
>> >
>> > ?OBJS ACTIVE ?USE OBJ SIZE ?SLABS OBJ/SLAB CACHE SIZE NAME
>> > 2562514 2541520 ?99% ? ?0.95K ?78739 ? ? ? 33 ? 2519648K nfs_inode_cache
>> > 467200 285110 ?61% ? ?0.02K ? 1825 ? ? ?256 ? ? ?7300K kmalloc-16
>> > 299397 242350 ?80% ? ?0.19K ?14257 ? ? ? 21 ? ? 57028K dentry
>> > 217434 131978 ?60% ? ?0.55K ? 7767 ? ? ? 28 ? ?124272K radix_tree_node
>> > 215232 ?81522 ?37% ? ?0.06K ? 3363 ? ? ? 64 ? ? 13452K kmalloc-64
>> > 183027 136802 ?74% ? ?0.10K ? 4693 ? ? ? 39 ? ? 18772K buffer_head
>> > 101120 ?71184 ?70% ? ?0.03K ? ?790 ? ? ?128 ? ? ?3160K kmalloc-32
>> > ?79616 ?59713 ?75% ? ?0.12K ? 2488 ? ? ? 32 ? ? ?9952K kmalloc-128
>> > ?66560 ?41257 ?61% ? ?0.01K ? ?130 ? ? ?512 ? ? ? 520K kmalloc-8
>> > ?42126 ?26650 ?63% ? ?0.75K ? 2006 ? ? ? 21 ? ? 32096K ext3_inode_cache
>> >
>> > http://0x.ca/sim/ref/2.6.37/inodes_nfs.png
>> > http://0x.ca/sim/ref/2.6.37/cpu2_nfs.png
>> >
>> > Perhaps I could bisect just fs/nfs changes between 2.6.35 and 2.6.36 to
>> > try to track this down without having to wait too long, unless somebody
>> > can see what is happening here.
>>
>> I'll get started bisecting too, since this is something of a
>> show-stopper. Boxes that pre-2.6.36 would stay up for months at a time
>> now have to be powercycled every couple of days (which is about how
>> long it takes for this behavior to show up). This is across-the-board
>> for about 50 boxes, ranging from 2.6.36 to 2.6.36.2.
>>
>> Simon: It's probably irrelevant since these are kernel threads, but
>> I'm curious what distro your boxes are running. Ours are Debian Lenny,
>> i386, Dell Poweredge 850s. Just trying to figure out any
>> commonalities. I'll get my boxes back on 2.6.36.2 and start watching
>> nfs_inode_cache as well.
>
> Same distro, x86_64, similar servers.
>
> I'm not sure if the two cases I am seeing are exactly the same problem,
> but on the log crunching boxes, system time seems proportional to
> nfs_inode_cache and nfs_inode_cache just keeps growing forever; however,
> if I stop the load and unmount the NFS mount points, all of the
> nfs_inode_cache objects do actually go away (after umount finishes).
>
> It seems the shrinker callback might not be working as intended here.
>
> On the shared server case, the crazy spinlock contention from all of the
> flusher processes happens suddenly and overloads the boxes for 10-15
> minutes, and then everything recovers.  Over 21 of these boxes, they
> each have about 500k-700k nfs_inode_cache objects.  The log cruncher hit
> 3.3 million nfs_inode_cache objects before I unmounted.
>
> Are your boxes repeating this behaviour at any predictable interval?

Simon:
My boxes definitely fall into the latter category, with spinlock
regularly sitting at 60-80% CPU (according to 'perf top'). As far as
predictability, not strictly, but it's typically after an uptime of
2-3 days. They take so long to get into this state that I've never
seen the actual transition in person, just the after-effects of
flush-0:xx gone crazy. These boxes have a number of other NFS mounts,
but it's on the flush-0:xx's for the heavily written-to NFS mounts
that are spinning wildly, which you'd expect to be the case. The
infrequently written-to NFS servers' flush-0:xx isn't to be found in
'top' output.

I'd booted into older kernels after my initial reply, so I'm 14 hrs
into booting a box back into 2.6.36.2 and another box into a
double-bisected 2.6.35-2.6.36 kernel (my first bisect hit compile
errors). Both are running normally but that fits with the pattern so
far.

NFS Guys:
Anything else we can be digging up to help debug this? This is a
pretty ugly issue.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: System CPU increasing on idle 2.6.36
  2011-01-05 19:43         ` Mark Moseley
@ 2011-01-07 18:05           ` Mark Moseley
  2011-01-07 18:12             ` Mark Moseley
  2011-01-08  0:52             ` Simon Kirby
  0 siblings, 2 replies; 16+ messages in thread
From: Mark Moseley @ 2011-01-07 18:05 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-nfs, linux-kernel

On Wed, Jan 5, 2011 at 11:43 AM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Tue, Jan 4, 2011 at 1:40 PM, Simon Kirby <sim@hostway.ca> wrote:
>> On Tue, Jan 04, 2011 at 09:42:14AM -0800, Mark Moseley wrote:
>>
>>> On Wed, Dec 29, 2010 at 2:03 PM, Simon Kirby <sim@hostway.ca> wrote:
>>>
>>> > I've noticed nfs_inode_cache is ever-increasing as well with 2.6.37:
>>> >
>>> > ?OBJS ACTIVE ?USE OBJ SIZE ?SLABS OBJ/SLAB CACHE SIZE NAME
>>> > 2562514 2541520 ?99% ? ?0.95K ?78739 ? ? ? 33 ? 2519648K nfs_inode_cache
>>> > 467200 285110 ?61% ? ?0.02K ? 1825 ? ? ?256 ? ? ?7300K kmalloc-16
>>> > 299397 242350 ?80% ? ?0.19K ?14257 ? ? ? 21 ? ? 57028K dentry
>>> > 217434 131978 ?60% ? ?0.55K ? 7767 ? ? ? 28 ? ?124272K radix_tree_node
>>> > 215232 ?81522 ?37% ? ?0.06K ? 3363 ? ? ? 64 ? ? 13452K kmalloc-64
>>> > 183027 136802 ?74% ? ?0.10K ? 4693 ? ? ? 39 ? ? 18772K buffer_head
>>> > 101120 ?71184 ?70% ? ?0.03K ? ?790 ? ? ?128 ? ? ?3160K kmalloc-32
>>> > ?79616 ?59713 ?75% ? ?0.12K ? 2488 ? ? ? 32 ? ? ?9952K kmalloc-128
>>> > ?66560 ?41257 ?61% ? ?0.01K ? ?130 ? ? ?512 ? ? ? 520K kmalloc-8
>>> > ?42126 ?26650 ?63% ? ?0.75K ? 2006 ? ? ? 21 ? ? 32096K ext3_inode_cache
>>> >
>>> > http://0x.ca/sim/ref/2.6.37/inodes_nfs.png
>>> > http://0x.ca/sim/ref/2.6.37/cpu2_nfs.png
>>> >
>>> > Perhaps I could bisect just fs/nfs changes between 2.6.35 and 2.6.36 to
>>> > try to track this down without having to wait too long, unless somebody
>>> > can see what is happening here.
>>>
>>> I'll get started bisecting too, since this is something of a
>>> show-stopper. Boxes that pre-2.6.36 would stay up for months at a time
>>> now have to be powercycled every couple of days (which is about how
>>> long it takes for this behavior to show up). This is across-the-board
>>> for about 50 boxes, ranging from 2.6.36 to 2.6.36.2.
>>>
>>> Simon: It's probably irrelevant since these are kernel threads, but
>>> I'm curious what distro your boxes are running. Ours are Debian Lenny,
>>> i386, Dell Poweredge 850s. Just trying to figure out any
>>> commonalities. I'll get my boxes back on 2.6.36.2 and start watching
>>> nfs_inode_cache as well.
>>
>> Same distro, x86_64, similar servers.
>>
>> I'm not sure if the two cases I am seeing are exactly the same problem,
>> but on the log crunching boxes, system time seems proportional to
>> nfs_inode_cache and nfs_inode_cache just keeps growing forever; however,
>> if I stop the load and unmount the NFS mount points, all of the
>> nfs_inode_cache objects do actually go away (after umount finishes).
>>
>> It seems the shrinker callback might not be working as intended here.
>>
>> On the shared server case, the crazy spinlock contention from all of the
>> flusher processes happens suddenly and overloads the boxes for 10-15
>> minutes, and then everything recovers.  Over 21 of these boxes, they
>> each have about 500k-700k nfs_inode_cache objects.  The log cruncher hit
>> 3.3 million nfs_inode_cache objects before I unmounted.
>>
>> Are your boxes repeating this behaviour at any predictable interval?
>
> Simon:
> My boxes definitely fall into the latter category, with spinlock
> regularly sitting at 60-80% CPU (according to 'perf top'). As far as
> predictability, not strictly, but it's typically after an uptime of
> 2-3 days. They take so long to get into this state that I've never
> seen the actual transition in person, just the after-effects of
> flush-0:xx gone crazy. These boxes have a number of other NFS mounts,
> but it's on the flush-0:xx's for the heavily written-to NFS mounts
> that are spinning wildly, which you'd expect to be the case. The
> infrequently written-to NFS servers' flush-0:xx isn't to be found in
> 'top' output.
>
> I'd booted into older kernels after my initial reply, so I'm 14 hrs
> into booting a box back into 2.6.36.2 and another box into a
> double-bisected 2.6.35-2.6.36 kernel (my first bisect hit compile
> errors). Both are running normally but that fits with the pattern so
> far.
>
> NFS Guys:
> Anything else we can be digging up to help debug this? This is a
> pretty ugly issue.
>

NOTE: NFS/Kernel guys: I've left the 2.6.36.2 box still thrashing
about in case there's something you'd like me to look at.

Ok, both my 2.6.36.2 kernel box and my 2.6.35->2.6.36 bisect box (was
bisected to ce7db282a3830f57f5b05ec48288c23a5c4d66d5 -- this is my
first time doing bisect, so I'll preemptively apologize for doing
anything silly) both went berserk within 15 mins of each other, after
an uptime of around 63 hours for 2.6.36.2 and 65 hours for the
bisected box. The 2.6.36.2 one is still running with all the various
flush-0:xx threads spinning wildly. The bisected box just keeled over
and died, but is back up now. The only kernel messages logged are just
of the "task kworker/4:1:359 blocked for more than 120 seconds"
variety, all with _raw_spin_lock_irq at the top of the stack trace.

Looking at slabinfo, specifically, this: stats=$(grep nfs_inode_cache
/proc/slabinfo | awk '{ print $2, $3 }')
the active_objs and num_objs both increase to over a million (these
boxes are delivering mail to NFS-mounted mailboxes, so that's
perfectly reasonable). On both boxes, looking at sar, things start to
go awry around 10am today EST. At that time on the 2.6.36.2 box, the
NFS numbers look like this:

Fri Jan  7 09:58:00 2011: 1079433 1079650
Fri Jan  7 09:59:00 2011: 1079632 1080300
Fri Jan  7 10:00:00 2011: 1080196 1080300
Fri Jan  7 10:01:01 2011: 1080599 1080716
Fri Jan  7 10:02:01 2011: 1081074 1081288

on the bisected, like this:

Fri Jan  7 09:59:34 2011: 1162786 1165320
Fri Jan  7 10:00:34 2011: 1163301 1165320
Fri Jan  7 10:01:34 2011: 1164369 1165450
Fri Jan  7 10:02:35 2011: 1164179 1165450
Fri Jan  7 10:03:35 2011: 1165795 1166958

When the bisected box finally died, the last numbers were:

Fri Jan  7 10:40:33 2011: 1177156 1177202
Fri Jan  7 10:42:21 2011: 1177157 1177306
Fri Jan  7 10:44:55 2011: 1177201 1177324
Fri Jan  7 10:45:55 2011: 1177746 1177826

On the still-thrashing 2.6.36.2 box, the highwater mark is:

Fri Jan  7 10:23:30 2011: 1084020 1084070

and once things went awry, the active_objs started falling away and
the number_objs has stayed at 1084070. Last numbers were:

Fri Jan  7 12:19:34 2011: 826623 1084070

The bisected box had reached 1mil entries (or more significantly
1048576 entries) by y'day evening. The 2.6.36.2 box hit that by 7am
EST today. So in neither case was there a big spike in entries.

These boxes have identical workloads. They're not accessible from the
net, so there's no chance of a DDoS or something. The significance of
10am EST could be either uptime-related (all these have gone down
after 2-3 days) or just due to the to-be-expected early morning spike
in mail flow.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: System CPU increasing on idle 2.6.36
  2011-01-07 18:05           ` Mark Moseley
@ 2011-01-07 18:12             ` Mark Moseley
  2011-01-07 19:33               ` Mark Moseley
  2011-01-08  0:52             ` Simon Kirby
  1 sibling, 1 reply; 16+ messages in thread
From: Mark Moseley @ 2011-01-07 18:12 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-nfs, linux-kernel

On Fri, Jan 7, 2011 at 10:05 AM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Wed, Jan 5, 2011 at 11:43 AM, Mark Moseley <moseleymark@gmail.com> wrote:
>> On Tue, Jan 4, 2011 at 1:40 PM, Simon Kirby <sim@hostway.ca> wrote:
>>> On Tue, Jan 04, 2011 at 09:42:14AM -0800, Mark Moseley wrote:
>>>
>>>> On Wed, Dec 29, 2010 at 2:03 PM, Simon Kirby <sim@hostway.ca> wrote:
>>>>
>>>> > I've noticed nfs_inode_cache is ever-increasing as well with 2.6.37:
>>>> >
>>>> > ?OBJS ACTIVE ?USE OBJ SIZE ?SLABS OBJ/SLAB CACHE SIZE NAME
>>>> > 2562514 2541520 ?99% ? ?0.95K ?78739 ? ? ? 33 ? 2519648K nfs_inode_cache
>>>> > 467200 285110 ?61% ? ?0.02K ? 1825 ? ? ?256 ? ? ?7300K kmalloc-16
>>>> > 299397 242350 ?80% ? ?0.19K ?14257 ? ? ? 21 ? ? 57028K dentry
>>>> > 217434 131978 ?60% ? ?0.55K ? 7767 ? ? ? 28 ? ?124272K radix_tree_node
>>>> > 215232 ?81522 ?37% ? ?0.06K ? 3363 ? ? ? 64 ? ? 13452K kmalloc-64
>>>> > 183027 136802 ?74% ? ?0.10K ? 4693 ? ? ? 39 ? ? 18772K buffer_head
>>>> > 101120 ?71184 ?70% ? ?0.03K ? ?790 ? ? ?128 ? ? ?3160K kmalloc-32
>>>> > ?79616 ?59713 ?75% ? ?0.12K ? 2488 ? ? ? 32 ? ? ?9952K kmalloc-128
>>>> > ?66560 ?41257 ?61% ? ?0.01K ? ?130 ? ? ?512 ? ? ? 520K kmalloc-8
>>>> > ?42126 ?26650 ?63% ? ?0.75K ? 2006 ? ? ? 21 ? ? 32096K ext3_inode_cache
>>>> >
>>>> > http://0x.ca/sim/ref/2.6.37/inodes_nfs.png
>>>> > http://0x.ca/sim/ref/2.6.37/cpu2_nfs.png
>>>> >
>>>> > Perhaps I could bisect just fs/nfs changes between 2.6.35 and 2.6.36 to
>>>> > try to track this down without having to wait too long, unless somebody
>>>> > can see what is happening here.
>>>>
>>>> I'll get started bisecting too, since this is something of a
>>>> show-stopper. Boxes that pre-2.6.36 would stay up for months at a time
>>>> now have to be powercycled every couple of days (which is about how
>>>> long it takes for this behavior to show up). This is across-the-board
>>>> for about 50 boxes, ranging from 2.6.36 to 2.6.36.2.
>>>>
>>>> Simon: It's probably irrelevant since these are kernel threads, but
>>>> I'm curious what distro your boxes are running. Ours are Debian Lenny,
>>>> i386, Dell Poweredge 850s. Just trying to figure out any
>>>> commonalities. I'll get my boxes back on 2.6.36.2 and start watching
>>>> nfs_inode_cache as well.
>>>
>>> Same distro, x86_64, similar servers.
>>>
>>> I'm not sure if the two cases I am seeing are exactly the same problem,
>>> but on the log crunching boxes, system time seems proportional to
>>> nfs_inode_cache and nfs_inode_cache just keeps growing forever; however,
>>> if I stop the load and unmount the NFS mount points, all of the
>>> nfs_inode_cache objects do actually go away (after umount finishes).
>>>
>>> It seems the shrinker callback might not be working as intended here.
>>>
>>> On the shared server case, the crazy spinlock contention from all of the
>>> flusher processes happens suddenly and overloads the boxes for 10-15
>>> minutes, and then everything recovers.  Over 21 of these boxes, they
>>> each have about 500k-700k nfs_inode_cache objects.  The log cruncher hit
>>> 3.3 million nfs_inode_cache objects before I unmounted.
>>>
>>> Are your boxes repeating this behaviour at any predictable interval?
>>
>> Simon:
>> My boxes definitely fall into the latter category, with spinlock
>> regularly sitting at 60-80% CPU (according to 'perf top'). As far as
>> predictability, not strictly, but it's typically after an uptime of
>> 2-3 days. They take so long to get into this state that I've never
>> seen the actual transition in person, just the after-effects of
>> flush-0:xx gone crazy. These boxes have a number of other NFS mounts,
>> but it's on the flush-0:xx's for the heavily written-to NFS mounts
>> that are spinning wildly, which you'd expect to be the case. The
>> infrequently written-to NFS servers' flush-0:xx isn't to be found in
>> 'top' output.
>>
>> I'd booted into older kernels after my initial reply, so I'm 14 hrs
>> into booting a box back into 2.6.36.2 and another box into a
>> double-bisected 2.6.35-2.6.36 kernel (my first bisect hit compile
>> errors). Both are running normally but that fits with the pattern so
>> far.
>>
>> NFS Guys:
>> Anything else we can be digging up to help debug this? This is a
>> pretty ugly issue.
>>
>
> NOTE: NFS/Kernel guys: I've left the 2.6.36.2 box still thrashing
> about in case there's something you'd like me to look at.
>
> Ok, both my 2.6.36.2 kernel box and my 2.6.35->2.6.36 bisect box (was
> bisected to ce7db282a3830f57f5b05ec48288c23a5c4d66d5 -- this is my
> first time doing bisect, so I'll preemptively apologize for doing
> anything silly) both went berserk within 15 mins of each other, after
> an uptime of around 63 hours for 2.6.36.2 and 65 hours for the
> bisected box. The 2.6.36.2 one is still running with all the various
> flush-0:xx threads spinning wildly. The bisected box just keeled over
> and died, but is back up now. The only kernel messages logged are just
> of the "task kworker/4:1:359 blocked for more than 120 seconds"
> variety, all with _raw_spin_lock_irq at the top of the stack trace.
>
> Looking at slabinfo, specifically, this: stats=$(grep nfs_inode_cache
> /proc/slabinfo | awk '{ print $2, $3 }')
> the active_objs and num_objs both increase to over a million (these
> boxes are delivering mail to NFS-mounted mailboxes, so that's
> perfectly reasonable). On both boxes, looking at sar, things start to
> go awry around 10am today EST. At that time on the 2.6.36.2 box, the
> NFS numbers look like this:
>
> Fri Jan  7 09:58:00 2011: 1079433 1079650
> Fri Jan  7 09:59:00 2011: 1079632 1080300
> Fri Jan  7 10:00:00 2011: 1080196 1080300
> Fri Jan  7 10:01:01 2011: 1080599 1080716
> Fri Jan  7 10:02:01 2011: 1081074 1081288
>
> on the bisected, like this:
>
> Fri Jan  7 09:59:34 2011: 1162786 1165320
> Fri Jan  7 10:00:34 2011: 1163301 1165320
> Fri Jan  7 10:01:34 2011: 1164369 1165450
> Fri Jan  7 10:02:35 2011: 1164179 1165450
> Fri Jan  7 10:03:35 2011: 1165795 1166958
>
> When the bisected box finally died, the last numbers were:
>
> Fri Jan  7 10:40:33 2011: 1177156 1177202
> Fri Jan  7 10:42:21 2011: 1177157 1177306
> Fri Jan  7 10:44:55 2011: 1177201 1177324
> Fri Jan  7 10:45:55 2011: 1177746 1177826
>
> On the still-thrashing 2.6.36.2 box, the highwater mark is:
>
> Fri Jan  7 10:23:30 2011: 1084020 1084070
>
> and once things went awry, the active_objs started falling away and
> the number_objs has stayed at 1084070. Last numbers were:
>
> Fri Jan  7 12:19:34 2011: 826623 1084070
>
> The bisected box had reached 1mil entries (or more significantly
> 1048576 entries) by y'day evening. The 2.6.36.2 box hit that by 7am
> EST today. So in neither case was there a big spike in entries.
>
> These boxes have identical workloads. They're not accessible from the
> net, so there's no chance of a DDoS or something. The significance of
> 10am EST could be either uptime-related (all these have gone down
> after 2-3 days) or just due to the to-be-expected early morning spike
> in mail flow.
>

Actually, I shouldn't say 'perfectly reasonable'. On other boxes in
that pool, doing the same workload (but running 2.6.32.27), the number
is more like:

117196 120756

after 9 days of uptime. So maybe 8 x that  isn't quite reasonable
after only 2.5 days of uptime.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: System CPU increasing on idle 2.6.36
  2011-01-07 18:12             ` Mark Moseley
@ 2011-01-07 19:33               ` Mark Moseley
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Moseley @ 2011-01-07 19:33 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-nfs, linux-kernel

On Fri, Jan 7, 2011 at 10:12 AM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Fri, Jan 7, 2011 at 10:05 AM, Mark Moseley <moseleymark@gmail.com> wrote:
>> On Wed, Jan 5, 2011 at 11:43 AM, Mark Moseley <moseleymark@gmail.com> wrote:
>>> On Tue, Jan 4, 2011 at 1:40 PM, Simon Kirby <sim@hostway.ca> wrote:
>>>> On Tue, Jan 04, 2011 at 09:42:14AM -0800, Mark Moseley wrote:
>>>>
>>>>> On Wed, Dec 29, 2010 at 2:03 PM, Simon Kirby <sim@hostway.ca> wrote:
>>>>>
>>>>> > I've noticed nfs_inode_cache is ever-increasing as well with 2.6.37:
>>>>> >
>>>>> > ?OBJS ACTIVE ?USE OBJ SIZE ?SLABS OBJ/SLAB CACHE SIZE NAME
>>>>> > 2562514 2541520 ?99% ? ?0.95K ?78739 ? ? ? 33 ? 2519648K nfs_inode_cache
>>>>> > 467200 285110 ?61% ? ?0.02K ? 1825 ? ? ?256 ? ? ?7300K kmalloc-16
>>>>> > 299397 242350 ?80% ? ?0.19K ?14257 ? ? ? 21 ? ? 57028K dentry
>>>>> > 217434 131978 ?60% ? ?0.55K ? 7767 ? ? ? 28 ? ?124272K radix_tree_node
>>>>> > 215232 ?81522 ?37% ? ?0.06K ? 3363 ? ? ? 64 ? ? 13452K kmalloc-64
>>>>> > 183027 136802 ?74% ? ?0.10K ? 4693 ? ? ? 39 ? ? 18772K buffer_head
>>>>> > 101120 ?71184 ?70% ? ?0.03K ? ?790 ? ? ?128 ? ? ?3160K kmalloc-32
>>>>> > ?79616 ?59713 ?75% ? ?0.12K ? 2488 ? ? ? 32 ? ? ?9952K kmalloc-128
>>>>> > ?66560 ?41257 ?61% ? ?0.01K ? ?130 ? ? ?512 ? ? ? 520K kmalloc-8
>>>>> > ?42126 ?26650 ?63% ? ?0.75K ? 2006 ? ? ? 21 ? ? 32096K ext3_inode_cache
>>>>> >
>>>>> > http://0x.ca/sim/ref/2.6.37/inodes_nfs.png
>>>>> > http://0x.ca/sim/ref/2.6.37/cpu2_nfs.png
>>>>> >
>>>>> > Perhaps I could bisect just fs/nfs changes between 2.6.35 and 2.6.36 to
>>>>> > try to track this down without having to wait too long, unless somebody
>>>>> > can see what is happening here.
>>>>>
>>>>> I'll get started bisecting too, since this is something of a
>>>>> show-stopper. Boxes that pre-2.6.36 would stay up for months at a time
>>>>> now have to be powercycled every couple of days (which is about how
>>>>> long it takes for this behavior to show up). This is across-the-board
>>>>> for about 50 boxes, ranging from 2.6.36 to 2.6.36.2.
>>>>>
>>>>> Simon: It's probably irrelevant since these are kernel threads, but
>>>>> I'm curious what distro your boxes are running. Ours are Debian Lenny,
>>>>> i386, Dell Poweredge 850s. Just trying to figure out any
>>>>> commonalities. I'll get my boxes back on 2.6.36.2 and start watching
>>>>> nfs_inode_cache as well.
>>>>
>>>> Same distro, x86_64, similar servers.
>>>>
>>>> I'm not sure if the two cases I am seeing are exactly the same problem,
>>>> but on the log crunching boxes, system time seems proportional to
>>>> nfs_inode_cache and nfs_inode_cache just keeps growing forever; however,
>>>> if I stop the load and unmount the NFS mount points, all of the
>>>> nfs_inode_cache objects do actually go away (after umount finishes).
>>>>
>>>> It seems the shrinker callback might not be working as intended here.
>>>>
>>>> On the shared server case, the crazy spinlock contention from all of the
>>>> flusher processes happens suddenly and overloads the boxes for 10-15
>>>> minutes, and then everything recovers.  Over 21 of these boxes, they
>>>> each have about 500k-700k nfs_inode_cache objects.  The log cruncher hit
>>>> 3.3 million nfs_inode_cache objects before I unmounted.
>>>>
>>>> Are your boxes repeating this behaviour at any predictable interval?
>>>
>>> Simon:
>>> My boxes definitely fall into the latter category, with spinlock
>>> regularly sitting at 60-80% CPU (according to 'perf top'). As far as
>>> predictability, not strictly, but it's typically after an uptime of
>>> 2-3 days. They take so long to get into this state that I've never
>>> seen the actual transition in person, just the after-effects of
>>> flush-0:xx gone crazy. These boxes have a number of other NFS mounts,
>>> but it's on the flush-0:xx's for the heavily written-to NFS mounts
>>> that are spinning wildly, which you'd expect to be the case. The
>>> infrequently written-to NFS servers' flush-0:xx isn't to be found in
>>> 'top' output.
>>>
>>> I'd booted into older kernels after my initial reply, so I'm 14 hrs
>>> into booting a box back into 2.6.36.2 and another box into a
>>> double-bisected 2.6.35-2.6.36 kernel (my first bisect hit compile
>>> errors). Both are running normally but that fits with the pattern so
>>> far.
>>>
>>> NFS Guys:
>>> Anything else we can be digging up to help debug this? This is a
>>> pretty ugly issue.
>>>
>>
>> NOTE: NFS/Kernel guys: I've left the 2.6.36.2 box still thrashing
>> about in case there's something you'd like me to look at.
>>
>> Ok, both my 2.6.36.2 kernel box and my 2.6.35->2.6.36 bisect box (was
>> bisected to ce7db282a3830f57f5b05ec48288c23a5c4d66d5 -- this is my
>> first time doing bisect, so I'll preemptively apologize for doing
>> anything silly) both went berserk within 15 mins of each other, after
>> an uptime of around 63 hours for 2.6.36.2 and 65 hours for the
>> bisected box. The 2.6.36.2 one is still running with all the various
>> flush-0:xx threads spinning wildly. The bisected box just keeled over
>> and died, but is back up now. The only kernel messages logged are just
>> of the "task kworker/4:1:359 blocked for more than 120 seconds"
>> variety, all with _raw_spin_lock_irq at the top of the stack trace.
>>
>> Looking at slabinfo, specifically, this: stats=$(grep nfs_inode_cache
>> /proc/slabinfo | awk '{ print $2, $3 }')
>> the active_objs and num_objs both increase to over a million (these
>> boxes are delivering mail to NFS-mounted mailboxes, so that's
>> perfectly reasonable). On both boxes, looking at sar, things start to
>> go awry around 10am today EST. At that time on the 2.6.36.2 box, the
>> NFS numbers look like this:
>>
>> Fri Jan  7 09:58:00 2011: 1079433 1079650
>> Fri Jan  7 09:59:00 2011: 1079632 1080300
>> Fri Jan  7 10:00:00 2011: 1080196 1080300
>> Fri Jan  7 10:01:01 2011: 1080599 1080716
>> Fri Jan  7 10:02:01 2011: 1081074 1081288
>>
>> on the bisected, like this:
>>
>> Fri Jan  7 09:59:34 2011: 1162786 1165320
>> Fri Jan  7 10:00:34 2011: 1163301 1165320
>> Fri Jan  7 10:01:34 2011: 1164369 1165450
>> Fri Jan  7 10:02:35 2011: 1164179 1165450
>> Fri Jan  7 10:03:35 2011: 1165795 1166958
>>
>> When the bisected box finally died, the last numbers were:
>>
>> Fri Jan  7 10:40:33 2011: 1177156 1177202
>> Fri Jan  7 10:42:21 2011: 1177157 1177306
>> Fri Jan  7 10:44:55 2011: 1177201 1177324
>> Fri Jan  7 10:45:55 2011: 1177746 1177826
>>
>> On the still-thrashing 2.6.36.2 box, the highwater mark is:
>>
>> Fri Jan  7 10:23:30 2011: 1084020 1084070
>>
>> and once things went awry, the active_objs started falling away and
>> the number_objs has stayed at 1084070. Last numbers were:
>>
>> Fri Jan  7 12:19:34 2011: 826623 1084070
>>
>> The bisected box had reached 1mil entries (or more significantly
>> 1048576 entries) by y'day evening. The 2.6.36.2 box hit that by 7am
>> EST today. So in neither case was there a big spike in entries.
>>
>> These boxes have identical workloads. They're not accessible from the
>> net, so there's no chance of a DDoS or something. The significance of
>> 10am EST could be either uptime-related (all these have gone down
>> after 2-3 days) or just due to the to-be-expected early morning spike
>> in mail flow.
>>
>
> Actually, I shouldn't say 'perfectly reasonable'. On other boxes in
> that pool, doing the same workload (but running 2.6.32.27), the number
> is more like:
>
> 117196 120756
>
> after 9 days of uptime. So maybe 8 x that  isn't quite reasonable
> after only 2.5 days of uptime.
>

On the 2.6.36.2 box, after shutting down everything else that'd touch
NFS, I let the flush threads spin for a long time with nfs and rpc
debug turned up to 65535. I also tried unmounting one of the NFS
partitions which hung for a very very long time. As they spun there
was no debug logging except the occasional thing. It eventually spit
out 197903 messages of "NFS: clear cookie (0xeadd3288/0x(nil))"
(different addresses in each line, obviously) and since then things
quieted back down. The flush-0:xx threads don't even show up in 'top'
output anymore and _raw_spin_lock in 'perf top' has gone back down to
its usual area of 3-5%.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: System CPU increasing on idle 2.6.36
  2011-01-07 18:05           ` Mark Moseley
  2011-01-07 18:12             ` Mark Moseley
@ 2011-01-08  0:52             ` Simon Kirby
  2011-01-08  1:30               ` Mark Moseley
  1 sibling, 1 reply; 16+ messages in thread
From: Simon Kirby @ 2011-01-08  0:52 UTC (permalink / raw)
  To: Mark Moseley; +Cc: linux-nfs, linux-kernel

On Fri, Jan 07, 2011 at 10:05:39AM -0800, Mark Moseley wrote:

> NOTE: NFS/Kernel guys: I've left the 2.6.36.2 box still thrashing
> about in case there's something you'd like me to look at.
> 
> Ok, both my 2.6.36.2 kernel box and my 2.6.35->2.6.36 bisect box (was
> bisected to ce7db282a3830f57f5b05ec48288c23a5c4d66d5 -- this is my
> first time doing bisect, so I'll preemptively apologize for doing

That's a merge commit, so that doesn't sound right (it doesn't contain
changes by itself).  How were you running the bisect?  Were you
definitely seeing a good case since v2.6.35?  Could you post your
"git bisect log" so I can see if I can follow it?

> anything silly) both went berserk within 15 mins of each other, after
> an uptime of around 63 hours for 2.6.36.2 and 65 hours for the
> bisected box. The 2.6.36.2 one is still running with all the various
> flush-0:xx threads spinning wildly. The bisected box just keeled over
> and died, but is back up now. The only kernel messages logged are just
> of the "task kworker/4:1:359 blocked for more than 120 seconds"
> variety, all with _raw_spin_lock_irq at the top of the stack trace.
> 
> Looking at slabinfo, specifically, this: stats=$(grep nfs_inode_cache
> /proc/slabinfo | awk '{ print $2, $3 }')
> the active_objs and num_objs both increase to over a million (these
> boxes are delivering mail to NFS-mounted mailboxes, so that's
> perfectly reasonable). On both boxes, looking at sar, things start to
> go awry around 10am today EST. At that time on the 2.6.36.2 box, the
> NFS numbers look like this:
> 
> Fri Jan  7 09:58:00 2011: 1079433 1079650
> Fri Jan  7 09:59:00 2011: 1079632 1080300
> Fri Jan  7 10:00:00 2011: 1080196 1080300
> Fri Jan  7 10:01:01 2011: 1080599 1080716
> Fri Jan  7 10:02:01 2011: 1081074 1081288
> 
> on the bisected, like this:
> 
> Fri Jan  7 09:59:34 2011: 1162786 1165320
> Fri Jan  7 10:00:34 2011: 1163301 1165320
> Fri Jan  7 10:01:34 2011: 1164369 1165450
> Fri Jan  7 10:02:35 2011: 1164179 1165450
> Fri Jan  7 10:03:35 2011: 1165795 1166958
> 
> When the bisected box finally died, the last numbers were:
> 
> Fri Jan  7 10:40:33 2011: 1177156 1177202
> Fri Jan  7 10:42:21 2011: 1177157 1177306
> Fri Jan  7 10:44:55 2011: 1177201 1177324
> Fri Jan  7 10:45:55 2011: 1177746 1177826

Yeah, I was seeing load proportional to these, but I don't think their
growth is anything wrong.  It just seems to be something that is taking
up CPU from flush proceses as they grow, and something which is causing a
thundering herd of reclaim with lots of spinlock contention.

Hmmm.. I tried to see if drop_caches would clear nfs_inode_cache and thus
the CPU usage.  The top of slabtop before doing anything:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1256343 1221799  97%    0.95K  38071       33   1218272K nfs_inode_cache

# echo 2 > /proc/sys/vm/drop_caches

this took about 40 seconds to execute, and perf top showed this for
pretty much the whole time:

             samples  pcnt function                 DSO
             _______ _____ ________________________ ___________

             1188.00 45.1% _raw_spin_lock           [kernel]
              797.00 30.2% rpcauth_cache_shrinker   [kernel]
              132.00  5.0% shrink_slab              [kernel]
               60.00  2.3% queue_io                 [kernel]
               59.00  2.2% _cond_resched            [kernel]
               44.00  1.7% nfs_writepages           [kernel]
               35.00  1.3% writeback_single_inode   [kernel]
               34.00  1.3% writeback_sb_inodes      [kernel]
               23.00  0.9% do_writepages            [kernel]
               20.00  0.8% bit_waitqueue            [kernel]
               15.00  0.6% redirty_tail             [kernel]
               11.00  0.4% __iget                   [kernel]
               11.00  0.4% __GI_strstr              libc-2.7.so

slabtop then showed:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
1256211 531720  42%    0.95K  38067       33   1218144K nfs_inode_cache

and this doesn't change further with drop_caches=3 or 1.  It's strange
that most of them were freed, but not all...  We do actually have
RPC_CREDCACHE_DEFAULT_HASHBITS set to 10 at build time, which might
explain a bigger credcache and time to free that.  Did you change your
auth_hashtable_size at all?  Anyway, even with all of that, system CPU
doesn't change.

Ok, weird, Running "sync" actually caused "vmstat 1" to go from 25% CPU
spikes every 5 seconds with nfs_writepages in the "perf top" every 5
seconds to being idle!  /proc/vmstat showed only nr_dirty 8 and
nr_writeback 0 before I ran "sync".  Either that, or it was just good
timing.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 6767892   9696  41088    0    0     5   180   15   70 62  7 29  1
 0  0      0 6767892   9696  41088    0    0     0     0 1124 1978  0  0 100  0
 0  0      0 6767892   9696  41088    0    0     0     0 1105 1967  0  0 100  0
 0  0      0 6767892   9696  41088    0    0     0     0 1180 1984  0  2 98  0
 2  0      0 6767892   9696  41088    0    0     0     0 2102 1968  0 22 78  0
 0  0      0 6767768   9704  41084    0    0     0    52 1504 2022  0  8 91  1
 0  0      0 6767768   9704  41084    0    0     0     0 1059 1965  0  0 100  0
 0  0      0 6767768   9704  41088    0    0     0     0 1077 1982  0  0 100  0
 0  0      0 6767768   9704  41088    0    0     0     0 1166 1972  1  2 97  0
 0  0      0 6767768   9704  41088    0    0     0     0 1353 1990  0  6 94  0
 0  0      0 6767644   9704  41100    0    0     0     0 2531 2050  0 27 72  0
 0  0      0 6767644   9704  41100    0    0     0     0 1065 1974  0  0 100  0
 1  0      0 6767644   9704  41100    0    0     0     0 1090 1967  0  0 100  0
 0  0      0 6767644   9704  41100    0    0     0     0 1175 1971  0  2 98  0
 0  0      0 6767644   9704  41100    0    0     0     0 1321 1967  0  6 94  0
 1  0      0 6767644   9712  41100    0    0     0    52 1995 1995  0 18 81  1
 0  0      0 6767644   9712  41100    0    0     0     0 1342 1964  0  6 94  0
 0  0      0 6767644   9712  41100    0    0     0     0 1064 1974  0  0 100  0
 0  0      0 6767644   9712  41100    0    0     0     0 1163 1968  0  2 98  0
 0  0      0 6767644   9712  41100    0    0     0     0 1390 1976  1  6 94  0
 0  0      0 6767768   9712  41112    0    0     0     0 1282 2018  0  1 99  0
 0  0      0 6767768   9712  41112    0    0     0     0 2166 1972  0 24 76  0
 0  0      0 6786260   9712  41112    0    0     0     0 1092 1989  0  0 100  0
 0  0      0 6786260   9712  41112    0    0     0     0 1039 2003  0  2 98  0
 0  0      0 6774588   9712  41112    0    0     0     0 1524 2015  3  9 89  0
 0  1      0 6770124   9716  41112    0    0     0    48 1315 1990  3  0 87 11
 1  0      0 6769628   9720  41112    0    0     0     4 2196 1966  0 20 79  0
 0  0      0 6768364   9724  41108    0    0    36     0 1403 2031  1  5 94  1
 0  0      0 6768388   9724  41144    0    0     0     0 1185 1969  0  2 98  0
 0  0      0 6768388   9724  41144    0    0     0     0 1351 1975  0  6 94  0
 0  0      0 6768140   9724  41156    0    0     0     0 1261 2020  0  1 99  0
 0  0      0 6768140   9724  41156    0    0     0     0 1053 1973  0  0 100  0
 1  0      0 6768140   9724  41156    0    0     0     0 2194 1971  1 26 73  0
 0  0      0 6768140   9732  41148    0    0     0    52 1283 2088  0  4 96  0
 2  0      0 6768140   9732  41156    0    0     0     0 1184 1965  0  2 98  0
 0  0      0 6786368   9732  41156    0    0     0     0 1297 2000  0  4 97  0
 0  0      0 6786384   9732  41156    0    0     0     0  998 1949  0  0 100  0
(stop vmstat 1, run sync, restart vmstat 1)
 1  0      0 6765304   9868  41736    0    0     0     0 1239 1967  0  0 100  0
 0  0      0 6765304   9868  41736    0    0     0     0 1174 2035  0  0 100  0
 0  0      0 6765304   9868  41736    0    0     0     0 1084 1976  0  0 100  0
 1  0      0 6765172   9868  41736    0    0     0     0 1100 1975  0  0 100  0
 0  0      0 6765304   9868  41760    0    0     0     0 1501 1997  1  1 98  0
 0  0      0 6765304   9868  41760    0    0     0     0 1092 1964  0  0 100  0
 1  0      0 6765304   9868  41760    0    0     0     0 1096 1963  0  0 100  0
 0  0      0 6765304   9868  41760    0    0     0     0 1083 1967  0  0 100  0
 0  0      0 6765304   9876  41752    0    0     0    76 1101 1980  0  0 99  1
 0  0      0 6765304   9876  41760    0    0     0     4 1199 2097  0  2 98  0
 0  0      0 6765304   9876  41760    0    0     0     0 1055 1964  0  0 100  0
 0  0      0 6765304   9876  41760    0    0     0     0 1064 1977  0  0 100  0
 0  0      0 6765304   9876  41760    0    0     0     0 1073 1968  0  0 100  0
 0  0      0 6765304   9876  41760    0    0     0     0 1080 1976  0  0 100  0
 0  0      0 6765156   9884  41772    0    0     0    20 1329 2034  1  1 98  1
 0  0      0 6765180   9884  41780    0    0     0     0 1097 1971  0  1 99  0
 0  0      0 6765180   9884  41780    0    0     0     0 1073 1960  0  0 100  0
 0  0      0 6765180   9884  41780    0    0     0     0 1079 1976  0  0 100  0
 0  0      0 6765172   9884  41780    0    0     0     0 1120 1980  0  0 100  0
 0  0      0 6765180   9892  41772    0    0     0    60 1097 1982  0  0 100  0
 0  0      0 6765180   9892  41780    0    0     0     0 1067 1969  0  0 100  0
 0  0      0 6765180   9892  41780    0    0     0     0 1063 1973  0  0 100  0
 0  0      0 6765180   9892  41780    0    0     0     0 1064 1969  0  1 100  0

Hmmm.  I'll watch it and see if it climbs back up again.  That would mean
that for now, echo 2 > /proc/sys/vm/drop_caches ; sync is a workaround
for the problem we're seeing with system CPU slowly creeping up since
2.6.36.  I have no idea if this affects the sudden flush process spinlock
contention and attempted writeback without much progress during normal
load, but it certainly changes things here.

http://0x.ca/sim/ref/2.6.37/cpu_after_drop_caches_2_and_sync.png

Simon-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: System CPU increasing on idle 2.6.36
  2011-01-08  0:52             ` Simon Kirby
@ 2011-01-08  1:30               ` Mark Moseley
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Moseley @ 2011-01-08  1:30 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-nfs, linux-kernel

On Fri, Jan 7, 2011 at 4:52 PM, Simon Kirby <sim@hostway.ca> wrote:
> On Fri, Jan 07, 2011 at 10:05:39AM -0800, Mark Moseley wrote:
>
>> NOTE: NFS/Kernel guys: I've left the 2.6.36.2 box still thrashing
>> about in case there's something you'd like me to look at.
>>
>> Ok, both my 2.6.36.2 kernel box and my 2.6.35->2.6.36 bisect box (was
>> bisected to ce7db282a3830f57f5b05ec48288c23a5c4d66d5 -- this is my
>> first time doing bisect, so I'll preemptively apologize for doing
>
> That's a merge commit, so that doesn't sound right (it doesn't contain
> changes by itself).  How were you running the bisect?  Were you
> definitely seeing a good case since v2.6.35?  Could you post your
> "git bisect log" so I can see if I can follow it?

Here's 'git bisect log':

# git bisect log
git bisect start
# bad: [f6f94e2ab1b33f0082ac22d71f66385a60d8157f] Linux 2.6.36
git bisect bad f6f94e2ab1b33f0082ac22d71f66385a60d8157f
# good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35
git bisect good 9fe6206f400646a2322096b56c59891d530e8d51
# good: [78417334b5cb6e1f915b8fdcc4fce3f1a1b4420c] Merge branch
'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
git bisect good 78417334b5cb6e1f915b8fdcc4fce3f1a1b4420c
# good: [14a4fa20a10d76eb98b7feb25be60735217929ba] Merge branch
'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
git bisect good 14a4fa20a10d76eb98b7feb25be60735217929ba
# bad: [ce7db282a3830f57f5b05ec48288c23a5c4d66d5] Merge branch
'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
git bisect bad ce7db282a3830f57f5b05ec48288c23a5c4d66d5

I did an initial checkout of stable
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git)
and (after git bisect start), did a 'git bisect good v2.6.35; git
bisect bad v2.6.36' but that seemed to be way too old. I tried to
approximate 2.6.35.9->2.6.36 by doing 'git bisect good
ea8a52f9f4bcc3420c38ae07f8378a2f18443970; git bisect bad
f6f94e2ab1b33f0082ac22d71f66385a60d8157f' but that had compilation
errors so I did 'git bisect good
9fe6206f400646a2322096b56c59891d530e8d51' which compiled ok -- heh,
did I mention this was my first outing with 'git bisect' :)


> Yeah, I was seeing load proportional to these, but I don't think their
> growth is anything wrong.  It just seems to be something that is taking
> up CPU from flush proceses as they grow, and something which is causing a
> thundering herd of reclaim with lots of spinlock contention.
>
> Hmmm.. I tried to see if drop_caches would clear nfs_inode_cache and thus
> the CPU usage.  The top of slabtop before doing anything:
>
>  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 1256343 1221799  97%    0.95K  38071       33   1218272K nfs_inode_cache
>
> # echo 2 > /proc/sys/vm/drop_caches
>
> this took about 40 seconds to execute, and perf top showed this for
> pretty much the whole time:
>
>             samples  pcnt function                 DSO
>             _______ _____ ________________________ ___________
>
>             1188.00 45.1% _raw_spin_lock           [kernel]
>              797.00 30.2% rpcauth_cache_shrinker   [kernel]
>              132.00  5.0% shrink_slab              [kernel]
>               60.00  2.3% queue_io                 [kernel]
>               59.00  2.2% _cond_resched            [kernel]
>               44.00  1.7% nfs_writepages           [kernel]
>               35.00  1.3% writeback_single_inode   [kernel]
>               34.00  1.3% writeback_sb_inodes      [kernel]
>               23.00  0.9% do_writepages            [kernel]
>               20.00  0.8% bit_waitqueue            [kernel]
>               15.00  0.6% redirty_tail             [kernel]
>               11.00  0.4% __iget                   [kernel]
>               11.00  0.4% __GI_strstr              libc-2.7.so
>
> slabtop then showed:
>
>  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 1256211 531720  42%    0.95K  38067       33   1218144K nfs_inode_cache
>
> and this doesn't change further with drop_caches=3 or 1.  It's strange
> that most of them were freed, but not all...  We do actually have
> RPC_CREDCACHE_DEFAULT_HASHBITS set to 10 at build time, which might
> explain a bigger credcache and time to free that.  Did you change your
> auth_hashtable_size at all?  Anyway, even with all of that, system CPU
> doesn't change.

Oh, now that's interesting. I'm setting RPC_CREDCACHE_DEFAULT_HASHBITS
to 8 (and have been since Jan, so like 2.6.33 or something) -- I was
having the same issue as in this thread,
http://www.spinics.net/lists/linux-nfs/msg09261.html. Bumping up
RPC_CREDCACHE_DEFAULT_HASHBITS was like a miracle on some of our
systems. I was going to get around to switching to the 'right way' to
do that as a boot param one of these days, instead of a hack.

In the bisected kernel, I definitely did *not* touch
RPC_CREDCACHE_DEFAULT_HASHBITS and it blew up anyway. If it hadn't
completely died, it'd have been nice to see what 'perf top' said.


> Ok, weird, Running "sync" actually caused "vmstat 1" to go from 25% CPU
> spikes every 5 seconds with nfs_writepages in the "perf top" every 5
> seconds to being idle!  /proc/vmstat showed only nr_dirty 8 and
> nr_writeback 0 before I ran "sync".  Either that, or it was just good
> timing.
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  0  0      0 6767892   9696  41088    0    0     5   180   15   70 62  7 29  1
>  0  0      0 6767892   9696  41088    0    0     0     0 1124 1978  0  0 100  0
>  0  0      0 6767892   9696  41088    0    0     0     0 1105 1967  0  0 100  0
>  0  0      0 6767892   9696  41088    0    0     0     0 1180 1984  0  2 98  0
>  2  0      0 6767892   9696  41088    0    0     0     0 2102 1968  0 22 78  0
>  0  0      0 6767768   9704  41084    0    0     0    52 1504 2022  0  8 91  1
>  0  0      0 6767768   9704  41084    0    0     0     0 1059 1965  0  0 100  0
>  0  0      0 6767768   9704  41088    0    0     0     0 1077 1982  0  0 100  0
>  0  0      0 6767768   9704  41088    0    0     0     0 1166 1972  1  2 97  0
>  0  0      0 6767768   9704  41088    0    0     0     0 1353 1990  0  6 94  0
>  0  0      0 6767644   9704  41100    0    0     0     0 2531 2050  0 27 72  0
>  0  0      0 6767644   9704  41100    0    0     0     0 1065 1974  0  0 100  0
>  1  0      0 6767644   9704  41100    0    0     0     0 1090 1967  0  0 100  0
>  0  0      0 6767644   9704  41100    0    0     0     0 1175 1971  0  2 98  0
>  0  0      0 6767644   9704  41100    0    0     0     0 1321 1967  0  6 94  0
>  1  0      0 6767644   9712  41100    0    0     0    52 1995 1995  0 18 81  1
>  0  0      0 6767644   9712  41100    0    0     0     0 1342 1964  0  6 94  0
>  0  0      0 6767644   9712  41100    0    0     0     0 1064 1974  0  0 100  0
>  0  0      0 6767644   9712  41100    0    0     0     0 1163 1968  0  2 98  0
>  0  0      0 6767644   9712  41100    0    0     0     0 1390 1976  1  6 94  0
>  0  0      0 6767768   9712  41112    0    0     0     0 1282 2018  0  1 99  0
>  0  0      0 6767768   9712  41112    0    0     0     0 2166 1972  0 24 76  0
>  0  0      0 6786260   9712  41112    0    0     0     0 1092 1989  0  0 100  0
>  0  0      0 6786260   9712  41112    0    0     0     0 1039 2003  0  2 98  0
>  0  0      0 6774588   9712  41112    0    0     0     0 1524 2015  3  9 89  0
>  0  1      0 6770124   9716  41112    0    0     0    48 1315 1990  3  0 87 11
>  1  0      0 6769628   9720  41112    0    0     0     4 2196 1966  0 20 79  0
>  0  0      0 6768364   9724  41108    0    0    36     0 1403 2031  1  5 94  1
>  0  0      0 6768388   9724  41144    0    0     0     0 1185 1969  0  2 98  0
>  0  0      0 6768388   9724  41144    0    0     0     0 1351 1975  0  6 94  0
>  0  0      0 6768140   9724  41156    0    0     0     0 1261 2020  0  1 99  0
>  0  0      0 6768140   9724  41156    0    0     0     0 1053 1973  0  0 100  0
>  1  0      0 6768140   9724  41156    0    0     0     0 2194 1971  1 26 73  0
>  0  0      0 6768140   9732  41148    0    0     0    52 1283 2088  0  4 96  0
>  2  0      0 6768140   9732  41156    0    0     0     0 1184 1965  0  2 98  0
>  0  0      0 6786368   9732  41156    0    0     0     0 1297 2000  0  4 97  0
>  0  0      0 6786384   9732  41156    0    0     0     0  998 1949  0  0 100  0
> (stop vmstat 1, run sync, restart vmstat 1)
>  1  0      0 6765304   9868  41736    0    0     0     0 1239 1967  0  0 100  0
>  0  0      0 6765304   9868  41736    0    0     0     0 1174 2035  0  0 100  0
>  0  0      0 6765304   9868  41736    0    0     0     0 1084 1976  0  0 100  0
>  1  0      0 6765172   9868  41736    0    0     0     0 1100 1975  0  0 100  0
>  0  0      0 6765304   9868  41760    0    0     0     0 1501 1997  1  1 98  0
>  0  0      0 6765304   9868  41760    0    0     0     0 1092 1964  0  0 100  0
>  1  0      0 6765304   9868  41760    0    0     0     0 1096 1963  0  0 100  0
>  0  0      0 6765304   9868  41760    0    0     0     0 1083 1967  0  0 100  0
>  0  0      0 6765304   9876  41752    0    0     0    76 1101 1980  0  0 99  1
>  0  0      0 6765304   9876  41760    0    0     0     4 1199 2097  0  2 98  0
>  0  0      0 6765304   9876  41760    0    0     0     0 1055 1964  0  0 100  0
>  0  0      0 6765304   9876  41760    0    0     0     0 1064 1977  0  0 100  0
>  0  0      0 6765304   9876  41760    0    0     0     0 1073 1968  0  0 100  0
>  0  0      0 6765304   9876  41760    0    0     0     0 1080 1976  0  0 100  0
>  0  0      0 6765156   9884  41772    0    0     0    20 1329 2034  1  1 98  1
>  0  0      0 6765180   9884  41780    0    0     0     0 1097 1971  0  1 99  0
>  0  0      0 6765180   9884  41780    0    0     0     0 1073 1960  0  0 100  0
>  0  0      0 6765180   9884  41780    0    0     0     0 1079 1976  0  0 100  0
>  0  0      0 6765172   9884  41780    0    0     0     0 1120 1980  0  0 100  0
>  0  0      0 6765180   9892  41772    0    0     0    60 1097 1982  0  0 100  0
>  0  0      0 6765180   9892  41780    0    0     0     0 1067 1969  0  0 100  0
>  0  0      0 6765180   9892  41780    0    0     0     0 1063 1973  0  0 100  0
>  0  0      0 6765180   9892  41780    0    0     0     0 1064 1969  0  1 100  0
>
> Hmmm.  I'll watch it and see if it climbs back up again.  That would mean
> that for now, echo 2 > /proc/sys/vm/drop_caches ; sync is a workaround
> for the problem we're seeing with system CPU slowly creeping up since
> 2.6.36.  I have no idea if this affects the sudden flush process spinlock
> contention and attempted writeback without much progress during normal
> load, but it certainly changes things here.
>
> http://0x.ca/sim/ref/2.6.37/cpu_after_drop_caches_2_and_sync.png

That's pretty painful to have to do though (less painful than the box
becoming unusable, I'll admit). I wonder if when I did a umount on
mine (which must've sync'd at least that mount), it gave something a
big enough kick for the flush-0:xx threads to unstick themselves,
presumably what happened with your sync (though I didn't drop the
cache).

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <1291845189.3067.31.camel@heimdal.trondhjem.org>]

* NFS client growing system CPU
       [not found] ` <1291845189.3067.31.camel@heimdal.trondhjem.org>
@ 2011-09-27  0:39   ` Simon Kirby
  2011-09-27 11:42     ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Kirby @ 2011-09-27  0:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

Hello!

Following up on "System CPU increasing on idle 2.6.36", this issue is
still happening even on 3.1-rc7. So, since it has been 9 months since I
reported this, I figured I'd bisect this issue. The first bisection ended
in an IPMI regression that looked like the problem, so I had to start
again. Eventually, I got commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
which made it into 2.6.34-rc4.

With this commit, system CPU keeps rising as the log crunch box runs
(reads log files via NFS and spews out HTML files into NFS-mounted report
directories). When it finishes the daily run, the system time stays
non-zero and continues to be higher and higher after each run, until the
box never completes a run within a day due to all of the wasted cycles.

I previously posted a munin CPU graph picture here:

http://0x.ca/sim/ref/2.6.36/cpu_logcrunch_nfs.png

Here is the most recent graph of me bisecting, roughly one day at a time:

http://0x.ca/sim/ref/2.6.34/cpu.png

Running a kernel built on this commit, I can actually make the problem
stop by just typing "sync". CPU then goes down to 0 and everything is
fine. Future kernel versions, such as 2.6.36, do not change with just a
"sync", but "sync;echo 3 > /proc/sys/vm/drop_caches;sync" _does_ stop
the system CPU usage. I can bisect this change, too, if that would help.

Here is the commit:

commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Fri Apr 9 19:07:07 2010 -0400

    NFS: Ensure that writeback_single_inode() calls write_inode() when syncing
    
    Since writeback_single_inode() checks the inode->i_state flags _before_ it
    flushes out the data, we need to ensure that the I_DIRTY_DATASYNC flag is
    already set. Otherwise we risk not seeing a call to write_inode(), which
    again means that we break fsync() et al...
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53ff70e..7f40ea3 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -421,6 +421,7 @@ static void
 nfs_mark_request_dirty(struct nfs_page *req)
 {
 	__set_page_dirty_nobuffers(req->wb_page);
+	__mark_inode_dirty(req->wb_page->mapping->host, I_DIRTY_DATASYNC);
 }
 
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
@@ -660,6 +661,7 @@ static int nfs_writepage_setup(struct nfs_open_context *ctx, struct page *page,
 	req = nfs_setup_write_request(ctx, page, offset, count);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
+	nfs_mark_request_dirty(req);
 	/* Update file length */
 	nfs_grow_file(page, offset, count);
 	nfs_mark_uptodate(page, req->wb_pgbase, req->wb_bytes);
@@ -739,8 +741,6 @@ int nfs_updatepage(struct file *file, struct page *page,
 	status = nfs_writepage_setup(ctx, page, offset, count);
 	if (status < 0)
 		nfs_set_pageerror(page);
-	else
-		__set_page_dirty_nobuffers(page);
 
 	dprintk("NFS:       nfs_updatepage returns %d (isize %lld)\n",
 			status, (long long)i_size_read(inode));

With the build before this commit, the system time stays low always and
returns to 0% after every daily run.

Simon-

On Wed, Dec 08, 2010 at 04:53:09PM -0500, Trond Myklebust wrote:

> On Wed, 2010-12-08 at 13:25 -0800, Simon Kirby wrote:
> > Possibly related to the flush-processes-taking-CPU issues I saw
> > previously, I thought this was interesting.  I found a log-crunching box
> > that does all of its work via NFS and spends most of the day sleeping. 
> > It has been using a linearly-increasing amount of system time during the
> > time where is sleeping.  munin graph:
> > 
> > http://0x.ca/sim/ref/2.6.36/cpu_logcrunch_nfs.png
> > 
> > top shows:
> > 
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >  4767 root      20   0     0    0    0 S   16  0.0   2413:29 flush-0:69
> > 22385 root      20   0     0    0    0 S   10  0.0   1320:23 flush-0:70
> > (rest idle)
> > 
> > perf top shows:
> > 
> > -------------------------------------------------------------------------------------------
> >    PerfTop:     252 irqs/sec  kernel:97.2%  exact:  0.0% [1000Hz cycles],  (all, 2 CPUs)
> > -------------------------------------------------------------------------------------------
> > 
> >      samples  pcnt function                        DSO
> >      _______ _____ _______________________________ ___________
> > 
> >       912.00 21.2% _raw_spin_lock                  [kernel]
> >       592.00 13.8% nfs_writepages                  [kernel]
> >       474.00 11.0% queue_io                        [kernel]
> >       428.00 10.0% writeback_single_inode          [kernel]
> >       379.00  8.8% writeback_sb_inodes             [kernel]
> >       237.00  5.5% bit_waitqueue                   [kernel]
> >       184.00  4.3% do_writepages                   [kernel]
> >       169.00  3.9% __iget                          [kernel]
> >       132.00  3.1% redirty_tail                    [kernel]
> >        85.00  2.0% iput                            [kernel]
> >        72.00  1.7% _atomic_dec_and_lock            [kernel]
> >        67.00  1.6% write_cache_pages               [kernel]
> >        63.00  1.5% port_inb                        [ipmi_si]
> >        62.00  1.4% __mark_inode_dirty              [kernel]
> >        48.00  1.1% __wake_up_bit                   [kernel]
> >        32.00  0.7% nfs_write_inode                 [kernel]
> >        26.00  0.6% native_read_tsc                 [kernel]
> >        25.00  0.6% radix_tree_gang_lookup_tag_slot [kernel]
> >        25.00  0.6% schedule                        [kernel]
> >        24.00  0.6% wake_up_bit                     [kernel]
> >        20.00  0.5% apic_timer_interrupt            [kernel]
> >        19.00  0.4% ia32_syscall                    [kernel]
> >        16.00  0.4% find_get_pages_tag              [kernel]
> >        15.00  0.3% _raw_spin_lock_irqsave          [kernel]
> >        15.00  0.3% load_balance                    [kernel]
> >        13.00  0.3% ktime_get                       [kernel]
> >        12.00  0.3% shrink_icache_memory            [kernel]
> >        11.00  0.3% semop                           libc-2.7.so
> > 	9.00  0.2% nfs_pageio_doio                 [kernel]
> > 	8.00  0.2% dso__find_symbol                perf
> > 	8.00  0.2% down_read                       [kernel]
> > 	7.00  0.2% __switch_to                     [kernel]
> > 	7.00  0.2% __phys_addr                     [kernel]
> > 	7.00  0.2% nfs_pageio_init                 [kernel]
> > 	7.00  0.2% pagevec_lookup_tag              [kernel]
> > 	7.00  0.2% _cond_resched                   [kernel]
> > 	7.00  0.2% up_read                         [kernel]
> > 	6.00  0.1% sched_clock_local               [kernel]
> > 	5.00  0.1% rb_erase                        [kernel]
> > 	5.00  0.1% page_fault                      [kernel]
> > 
> > "tcpdump -i any -n port 2049" and "nfsstat" shows this work does not actually result
> > in any NFS packets.
> > 
> > Known 2.6.36 issue?  This did not occur on 2.6.35.4, according to the
> > munin graphs.  I'll try 2.6.37-rc an see if it changes.
> 
> Possibly a side effect of the fs/fs-writeback.c changes in 2.6.36? You
> do appear to be hitting a lot of spinlock contention, but I suspect that
> a lot of it is coming from writeback_sb_inodes, writeback_single_inode
> and queue_io, all of which seem unnaturally high on your stats above.
> 
> I don't see how you can be seeing no traffic on the wire. You are
> certainly hitting some page writeout (0.2% nfs_pageio_doio).
> 
> Cheers
>   Trond
> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: NFS client growing system CPU
  2011-09-27  0:39   ` NFS client growing system CPU Simon Kirby
@ 2011-09-27 11:42     ` Trond Myklebust
  2011-09-27 16:49       ` Simon Kirby
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2011-09-27 11:42 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-nfs, linux-kernel

On Mon, 2011-09-26 at 17:39 -0700, Simon Kirby wrote: 
> Hello!
> 
> Following up on "System CPU increasing on idle 2.6.36", this issue is
> still happening even on 3.1-rc7. So, since it has been 9 months since I
> reported this, I figured I'd bisect this issue. The first bisection ended
> in an IPMI regression that looked like the problem, so I had to start
> again. Eventually, I got commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
> which made it into 2.6.34-rc4.
> 
> With this commit, system CPU keeps rising as the log crunch box runs
> (reads log files via NFS and spews out HTML files into NFS-mounted report
> directories). When it finishes the daily run, the system time stays
> non-zero and continues to be higher and higher after each run, until the
> box never completes a run within a day due to all of the wasted cycles.

So reverting that commit fixes the problem on 3.1-rc7?

As far as I can see, doing so should be safe thanks to commit
5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update dirty flags
in two steps) which fixes the original problem at the VFS level.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS client growing system CPU
  2011-09-27 11:42     ` Trond Myklebust
@ 2011-09-27 16:49       ` Simon Kirby
  2011-09-27 17:04         ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Kirby @ 2011-09-27 16:49 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

On Tue, Sep 27, 2011 at 07:42:53AM -0400, Trond Myklebust wrote:

> On Mon, 2011-09-26 at 17:39 -0700, Simon Kirby wrote: 
> > Hello!
> > 
> > Following up on "System CPU increasing on idle 2.6.36", this issue is
> > still happening even on 3.1-rc7. So, since it has been 9 months since I
> > reported this, I figured I'd bisect this issue. The first bisection ended
> > in an IPMI regression that looked like the problem, so I had to start
> > again. Eventually, I got commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
> > which made it into 2.6.34-rc4.
> > 
> > With this commit, system CPU keeps rising as the log crunch box runs
> > (reads log files via NFS and spews out HTML files into NFS-mounted report
> > directories). When it finishes the daily run, the system time stays
> > non-zero and continues to be higher and higher after each run, until the
> > box never completes a run within a day due to all of the wasted cycles.
> 
> So reverting that commit fixes the problem on 3.1-rc7?
> 
> As far as I can see, doing so should be safe thanks to commit
> 5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update dirty flags
> in two steps) which fixes the original problem at the VFS level.

Hmm, I went to git revert b80c3cb628f0ebc241b02e38dd028969fb8026a2, but
for some reason git left the nfs_mark_request_dirty(req); line in
nfs_writepage_setup(), even though the original commit had that. Is that
OK or should I remove that as well?

Once that is sorted, I'll build it and let it run for a day and let you
know. Thanks!

Simon-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS client growing system CPU
  2011-09-27 16:49       ` Simon Kirby
@ 2011-09-27 17:04         ` Trond Myklebust
  2011-09-28 19:58           ` Simon Kirby
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2011-09-27 17:04 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-nfs, linux-kernel

On Tue, 2011-09-27 at 09:49 -0700, Simon Kirby wrote: 
> On Tue, Sep 27, 2011 at 07:42:53AM -0400, Trond Myklebust wrote:
> 
> > On Mon, 2011-09-26 at 17:39 -0700, Simon Kirby wrote: 
> > > Hello!
> > > 
> > > Following up on "System CPU increasing on idle 2.6.36", this issue is
> > > still happening even on 3.1-rc7. So, since it has been 9 months since I
> > > reported this, I figured I'd bisect this issue. The first bisection ended
> > > in an IPMI regression that looked like the problem, so I had to start
> > > again. Eventually, I got commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
> > > which made it into 2.6.34-rc4.
> > > 
> > > With this commit, system CPU keeps rising as the log crunch box runs
> > > (reads log files via NFS and spews out HTML files into NFS-mounted report
> > > directories). When it finishes the daily run, the system time stays
> > > non-zero and continues to be higher and higher after each run, until the
> > > box never completes a run within a day due to all of the wasted cycles.
> > 
> > So reverting that commit fixes the problem on 3.1-rc7?
> > 
> > As far as I can see, doing so should be safe thanks to commit
> > 5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update dirty flags
> > in two steps) which fixes the original problem at the VFS level.
> 
> Hmm, I went to git revert b80c3cb628f0ebc241b02e38dd028969fb8026a2, but
> for some reason git left the nfs_mark_request_dirty(req); line in
> nfs_writepage_setup(), even though the original commit had that. Is that
> OK or should I remove that as well?
> 
> Once that is sorted, I'll build it and let it run for a day and let you
> know. Thanks!

It shouldn't make any difference whether you leave it or remove it. The
resulting second call to __set_page_dirty_nobuffers() will always be a
no-op since the page will already be marked as dirty.

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS client growing system CPU
  2011-09-27 17:04         ` Trond Myklebust
@ 2011-09-28 19:58           ` Simon Kirby
  2011-09-30  0:58             ` Simon Kirby
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Kirby @ 2011-09-28 19:58 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

On Tue, Sep 27, 2011 at 01:04:15PM -0400, Trond Myklebust wrote:

> On Tue, 2011-09-27 at 09:49 -0700, Simon Kirby wrote: 
> > On Tue, Sep 27, 2011 at 07:42:53AM -0400, Trond Myklebust wrote:
> > 
> > > On Mon, 2011-09-26 at 17:39 -0700, Simon Kirby wrote: 
> > > > Hello!
> > > > 
> > > > Following up on "System CPU increasing on idle 2.6.36", this issue is
> > > > still happening even on 3.1-rc7. So, since it has been 9 months since I
> > > > reported this, I figured I'd bisect this issue. The first bisection ended
> > > > in an IPMI regression that looked like the problem, so I had to start
> > > > again. Eventually, I got commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
> > > > which made it into 2.6.34-rc4.
> > > > 
> > > > With this commit, system CPU keeps rising as the log crunch box runs
> > > > (reads log files via NFS and spews out HTML files into NFS-mounted report
> > > > directories). When it finishes the daily run, the system time stays
> > > > non-zero and continues to be higher and higher after each run, until the
> > > > box never completes a run within a day due to all of the wasted cycles.
> > > 
> > > So reverting that commit fixes the problem on 3.1-rc7?
> > > 
> > > As far as I can see, doing so should be safe thanks to commit
> > > 5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update dirty flags
> > > in two steps) which fixes the original problem at the VFS level.
> > 
> > Hmm, I went to git revert b80c3cb628f0ebc241b02e38dd028969fb8026a2, but
> > for some reason git left the nfs_mark_request_dirty(req); line in
> > nfs_writepage_setup(), even though the original commit had that. Is that
> > OK or should I remove that as well?
> > 
> > Once that is sorted, I'll build it and let it run for a day and let you
> > know. Thanks!
> 
> It shouldn't make any difference whether you leave it or remove it. The
> resulting second call to __set_page_dirty_nobuffers() will always be a
> no-op since the page will already be marked as dirty.

Ok, confirmed, git revert b80c3cb628f0ebc241b02e38dd028969fb8026a2 on
3.1-rc7 fixes the problem for me. Does this make sense, then, or do we
need further investigation and/or testing?

Simon-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS client growing system CPU
  2011-09-28 19:58           ` Simon Kirby
@ 2011-09-30  0:58             ` Simon Kirby
  2011-09-30  1:11               ` Myklebust, Trond
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Kirby @ 2011-09-30  0:58 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs, linux-kernel

On Wed, Sep 28, 2011 at 12:58:35PM -0700, Simon Kirby wrote:

> On Tue, Sep 27, 2011 at 01:04:15PM -0400, Trond Myklebust wrote:
> 
> > On Tue, 2011-09-27 at 09:49 -0700, Simon Kirby wrote: 
> > > On Tue, Sep 27, 2011 at 07:42:53AM -0400, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2011-09-26 at 17:39 -0700, Simon Kirby wrote: 
> > > > > Hello!
> > > > > 
> > > > > Following up on "System CPU increasing on idle 2.6.36", this issue is
> > > > > still happening even on 3.1-rc7. So, since it has been 9 months since I
> > > > > reported this, I figured I'd bisect this issue. The first bisection ended
> > > > > in an IPMI regression that looked like the problem, so I had to start
> > > > > again. Eventually, I got commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
> > > > > which made it into 2.6.34-rc4.
> > > > > 
> > > > > With this commit, system CPU keeps rising as the log crunch box runs
> > > > > (reads log files via NFS and spews out HTML files into NFS-mounted report
> > > > > directories). When it finishes the daily run, the system time stays
> > > > > non-zero and continues to be higher and higher after each run, until the
> > > > > box never completes a run within a day due to all of the wasted cycles.
> > > > 
> > > > So reverting that commit fixes the problem on 3.1-rc7?
> > > > 
> > > > As far as I can see, doing so should be safe thanks to commit
> > > > 5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update dirty flags
> > > > in two steps) which fixes the original problem at the VFS level.
> > > 
> > > Hmm, I went to git revert b80c3cb628f0ebc241b02e38dd028969fb8026a2, but
> > > for some reason git left the nfs_mark_request_dirty(req); line in
> > > nfs_writepage_setup(), even though the original commit had that. Is that
> > > OK or should I remove that as well?
> > > 
> > > Once that is sorted, I'll build it and let it run for a day and let you
> > > know. Thanks!
> > 
> > It shouldn't make any difference whether you leave it or remove it. The
> > resulting second call to __set_page_dirty_nobuffers() will always be a
> > no-op since the page will already be marked as dirty.
> 
> Ok, confirmed, git revert b80c3cb628f0ebc241b02e38dd028969fb8026a2 on
> 3.1-rc7 fixes the problem for me. Does this make sense, then, or do we
> need further investigation and/or testing?

Just to clear up what I said before, it seems that on plain 3.1-rc8, I am
actually able to clear the endless CPU use in nfs_writepages by just
running "sync". I am not sure when this changed, but I'm pretty sure that
some versions between 2.6.34 and 3.1-rc used to not be affected by just
"sync" unless it was paired with drop_caches. Maybe this makes the
problem more obvious...

Simon-

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: NFS client growing system CPU
  2011-09-30  0:58             ` Simon Kirby
@ 2011-09-30  1:11               ` Myklebust, Trond
  2011-10-05 23:07                 ` Simon Kirby
  0 siblings, 1 reply; 16+ messages in thread
From: Myklebust, Trond @ 2011-09-30  1:11 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-nfs, linux-kernel

> -----Original Message-----
> From: Simon Kirby [mailto:sim@hostway.ca]
> Sent: Thursday, September 29, 2011 8:58 PM
> To: Myklebust, Trond
> Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: NFS client growing system CPU
> 
> On Wed, Sep 28, 2011 at 12:58:35PM -0700, Simon Kirby wrote:
> 
> > On Tue, Sep 27, 2011 at 01:04:15PM -0400, Trond Myklebust wrote:
> >
> > > On Tue, 2011-09-27 at 09:49 -0700, Simon Kirby wrote:
> > > > On Tue, Sep 27, 2011 at 07:42:53AM -0400, Trond Myklebust wrote:
> > > >
> > > > > On Mon, 2011-09-26 at 17:39 -0700, Simon Kirby wrote:
> > > > > > Hello!
> > > > > >
> > > > > > Following up on "System CPU increasing on idle 2.6.36", this
> > > > > > issue is still happening even on 3.1-rc7. So, since it has
> > > > > > been 9 months since I reported this, I figured I'd bisect
this
> > > > > > issue. The first bisection ended in an IPMI regression that
> > > > > > looked like the problem, so I had to start again.
Eventually,
> > > > > > I got commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
> > > > > > which made it into 2.6.34-rc4.
> > > > > >
> > > > > > With this commit, system CPU keeps rising as the log crunch
> > > > > > box runs (reads log files via NFS and spews out HTML files
> > > > > > into NFS-mounted report directories). When it finishes the
> > > > > > daily run, the system time stays non-zero and continues to
be
> > > > > > higher and higher after each run, until the box never
completes a
> run within a day due to all of the wasted cycles.
> > > > >
> > > > > So reverting that commit fixes the problem on 3.1-rc7?
> > > > >
> > > > > As far as I can see, doing so should be safe thanks to commit
> > > > > 5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update
> > > > > dirty flags in two steps) which fixes the original problem at
the VFS
> level.
> > > >
> > > > Hmm, I went to git revert
> > > > b80c3cb628f0ebc241b02e38dd028969fb8026a2, but for some reason
git
> > > > left the nfs_mark_request_dirty(req); line in
> > > > nfs_writepage_setup(), even though the original commit had that.
Is
> that OK or should I remove that as well?
> > > >
> > > > Once that is sorted, I'll build it and let it run for a day and
> > > > let you know. Thanks!
> > >
> > > It shouldn't make any difference whether you leave it or remove
it.
> > > The resulting second call to __set_page_dirty_nobuffers() will
> > > always be a no-op since the page will already be marked as dirty.
> >
> > Ok, confirmed, git revert b80c3cb628f0ebc241b02e38dd028969fb8026a2
on
> > 3.1-rc7 fixes the problem for me. Does this make sense, then, or do
we
> > need further investigation and/or testing?
> 
> Just to clear up what I said before, it seems that on plain 3.1-rc8, I
am actually
> able to clear the endless CPU use in nfs_writepages by just running
"sync". I
> am not sure when this changed, but I'm pretty sure that some versions
> between 2.6.34 and 3.1-rc used to not be affected by just "sync"
unless it
> was paired with drop_caches. Maybe this makes the problem more
> obvious...

Hi Simon,

I think you are just finding yourself cycling through the VFS writeback
routines all the time because we dirty the inode for COMMIT at the same
time as we dirty a new page. Usually, we want to wait until after the
WRITE rpc call has completed, and so it was only the vfs race that
forced us to write this workaround so that we can guarantee reliable
fsync() behaviour.

My only concern at this point is to make sure that in reverting that
patch, we haven't overlooked some other fsync() bug that this patch
fixed. So far, it looks as if Dmitry's patch is sufficient to deal with
any issues that I can see.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS client growing system CPU
  2011-09-30  1:11               ` Myklebust, Trond
@ 2011-10-05 23:07                 ` Simon Kirby
  0 siblings, 0 replies; 16+ messages in thread
From: Simon Kirby @ 2011-10-05 23:07 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Mark Moseley, linux-nfs, linux-kernel

On Thu, Sep 29, 2011 at 06:11:17PM -0700, Myklebust, Trond wrote:

> > -----Original Message-----
> > From: Simon Kirby [mailto:sim@hostway.ca]
> > Sent: Thursday, September 29, 2011 8:58 PM
> > To: Myklebust, Trond
> > Cc: linux-nfs@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: Re: NFS client growing system CPU
> > 
> > On Wed, Sep 28, 2011 at 12:58:35PM -0700, Simon Kirby wrote:
> > 
> > > On Tue, Sep 27, 2011 at 01:04:15PM -0400, Trond Myklebust wrote:
> > >
> > > > On Tue, 2011-09-27 at 09:49 -0700, Simon Kirby wrote:
> > > > > On Tue, Sep 27, 2011 at 07:42:53AM -0400, Trond Myklebust wrote:
> > > > >
> > > > > > On Mon, 2011-09-26 at 17:39 -0700, Simon Kirby wrote:
> > > > > > > Hello!
> > > > > > >
> > > > > > > Following up on "System CPU increasing on idle 2.6.36", this
> > > > > > > issue is still happening even on 3.1-rc7. So, since it has
> > > > > > > been 9 months since I reported this, I figured I'd bisect
> this
> > > > > > > issue. The first bisection ended in an IPMI regression that
> > > > > > > looked like the problem, so I had to start again.
> Eventually,
> > > > > > > I got commit b80c3cb628f0ebc241b02e38dd028969fb8026a2
> > > > > > > which made it into 2.6.34-rc4.
> > > > > > >
> > > > > > > With this commit, system CPU keeps rising as the log crunch
> > > > > > > box runs (reads log files via NFS and spews out HTML files
> > > > > > > into NFS-mounted report directories). When it finishes the
> > > > > > > daily run, the system time stays non-zero and continues to
> be
> > > > > > > higher and higher after each run, until the box never
> completes a
> > run within a day due to all of the wasted cycles.
> > > > > >
> > > > > > So reverting that commit fixes the problem on 3.1-rc7?
> > > > > >
> > > > > > As far as I can see, doing so should be safe thanks to commit
> > > > > > 5547e8aac6f71505d621a612de2fca0dd988b439 (writeback: Update
> > > > > > dirty flags in two steps) which fixes the original problem at
> the VFS
> > level.
> > > > >
> > > > > Hmm, I went to git revert
> > > > > b80c3cb628f0ebc241b02e38dd028969fb8026a2, but for some reason
> git
> > > > > left the nfs_mark_request_dirty(req); line in
> > > > > nfs_writepage_setup(), even though the original commit had that.
> Is
> > that OK or should I remove that as well?
> > > > >
> > > > > Once that is sorted, I'll build it and let it run for a day and
> > > > > let you know. Thanks!
> > > >
> > > > It shouldn't make any difference whether you leave it or remove
> it.
> > > > The resulting second call to __set_page_dirty_nobuffers() will
> > > > always be a no-op since the page will already be marked as dirty.
> > >
> > > Ok, confirmed, git revert b80c3cb628f0ebc241b02e38dd028969fb8026a2
> on
> > > 3.1-rc7 fixes the problem for me. Does this make sense, then, or do
> we
> > > need further investigation and/or testing?
> > 
> > Just to clear up what I said before, it seems that on plain 3.1-rc8, I
> am actually
> > able to clear the endless CPU use in nfs_writepages by just running
> "sync". I
> > am not sure when this changed, but I'm pretty sure that some versions
> > between 2.6.34 and 3.1-rc used to not be affected by just "sync"
> unless it
> > was paired with drop_caches. Maybe this makes the problem more
> > obvious...
> 
> Hi Simon,
> 
> I think you are just finding yourself cycling through the VFS writeback
> routines all the time because we dirty the inode for COMMIT at the same
> time as we dirty a new page. Usually, we want to wait until after the
> WRITE rpc call has completed, and so it was only the vfs race that
> forced us to write this workaround so that we can guarantee reliable
> fsync() behaviour.
> 
> My only concern at this point is to make sure that in reverting that
> patch, we haven't overlooked some other fsync() bug that this patch
> fixed. So far, it looks as if Dmitry's patch is sufficient to deal with
> any issues that I can see.

Hello!

So, this is a regression that has caused uptime issues for us since
2.6.34-rc4. Dmitry's patch went into 2.6.35, so I think this revert
should be committed and be a stable candidate for 2.6.35 - 3.1.

We have not seen any problems resulting from the revert, but our loads
are not particularly fsync()-heavy. How did things work before this
patch, anyway?

Here is another graph showing the revert fixing the problem on this box
with relatively simple workload (revert applied Tuesday evening):

http://0x.ca/sim/ref/3.1-rc8/cpu-analog02-revert-b80c3cb6.png

It is helping on many other boxes, too, but they get various spurts of
memory pressure and other CPU spikes that cause the difference to be
harder to see. We're still running your sunrpc/clnt.c debugging patch as
well, but haven't hit the hang again yet.

Simon-

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-10-05 23:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20101208212505.GA18192@hostway.ca>
     [not found] ` <20101218010801.GE28367@hostway.ca>
2010-12-29 22:03   ` System CPU increasing on idle 2.6.36 Simon Kirby
     [not found]     ` <AANLkTinq074Q-B=5B4_aXuBm-3ELwGknsJv+OvS1z1OX@mail.gmail.com>
2011-01-04 21:40       ` Simon Kirby
2011-01-05 19:43         ` Mark Moseley
2011-01-07 18:05           ` Mark Moseley
2011-01-07 18:12             ` Mark Moseley
2011-01-07 19:33               ` Mark Moseley
2011-01-08  0:52             ` Simon Kirby
2011-01-08  1:30               ` Mark Moseley
     [not found] ` <1291845189.3067.31.camel@heimdal.trondhjem.org>
2011-09-27  0:39   ` NFS client growing system CPU Simon Kirby
2011-09-27 11:42     ` Trond Myklebust
2011-09-27 16:49       ` Simon Kirby
2011-09-27 17:04         ` Trond Myklebust
2011-09-28 19:58           ` Simon Kirby
2011-09-30  0:58             ` Simon Kirby
2011-09-30  1:11               ` Myklebust, Trond
2011-10-05 23:07                 ` Simon Kirby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox