2.4.20-rc1 - hang with processes stuck in D

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.20-rc1 - hang with processes stuck in D
@ 2002-11-06  0:25 Jeff Dike
  2002-11-06  0:37 ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Dike @ 2002-11-06  0:25 UTC (permalink / raw)
  To: linux-kernel

2.4.20-rc1 reliably gets processes stuck in D, eventually wedging the whole
system.  This is by diffing two kernel pools, one of which has 9 138764288 
byte core files.

The diff itself is stuck in __wait_on_buffer:

	Trace; c0131608 <__wait_on_buffer+68/90>
	Trace; c0132258 <getblk+28/60>
	Trace; c0132269 <getblk+39/60>
	Trace; c01324d6 <bread+46/70>
	Trace; c0121918 <handle_mm_fault+58/c0>
	Trace; c0163b02 <ext2_get_branch+52/c0>
	Trace; c0163d99 <ext2_get_block+59/320>
	Trace; c01109fa <do_page_fault+17a/4ab>
	Trace; c01326b2 <create_buffers+62/f0>
	Trace; c01326b8 <create_buffers+68/f0>
	Trace; c0132fec <block_read_full_page+ec/240>
	Trace; c0123a3d <add_to_page_cache_unique+6d/80>
	Trace; c0123ad8 <page_cache_read+88/c0>
	Trace; c0163d40 <ext2_get_block+0/320>
	Trace; c01240b5 <generic_file_readahead+f5/130>
	Trace; c012430f <do_generic_file_read+1df/430>
	Trace; c012487c <generic_file_read+7c/110>
	Trace; c0124780 <file_read_actor+0/80>
	Trace; c0130796 <sys_read+96/f0>
	Trace; c010bafb <sys_mmap2+2b/30>
	Trace; c0106d8b <system_call+33/38>

kupdated and bdflush are both stuck in __wait_on_buffer called from timer_bh:

kupdated:
	Trace; c01a0595 <__get_request_wait+95/d0>
	Trace; c01a0b6b <__make_request+3db/570>
	Trace; c011b424 <timer_bh+274/390>
	Trace; c011817b <bh_action+1b/50>
	Trace; c0118084 <tasklet_hi_action+44/70>
	Trace; c01a0e0e <generic_make_request+10e/130>
	Trace; c010833c <do_IRQ+9c/b0>
	Trace; c01a0e7b <submit_bh+4b/70>
	Trace; c0131684 <write_locked_buffers+24/30>
	Trace; c0131731 <write_some_buffers+a1/f0>
	Trace; c013455c <sync_old_buffers+1c/40>
	Trace; c0134824 <kupdate+f4/120>
	Trace; c0105000 <_stext+0/0>
	Trace; c0105000 <_stext+0/0>
	Trace; c01055d6 <kernel_thread+26/30>
	Trace; c0134730 <kupdate+0/120>

bdflush:
	Trace; c01a0595 <__get_request_wait+95/d0>
	Trace; c01a0b6b <__make_request+3db/570>
	Trace; c011b1d7 <timer_bh+27/390>
	Trace; c011817b <bh_action+1b/50>
	Trace; c0118084 <tasklet_hi_action+44/70>
	Trace; c0110e0e <remap_area_pages+7e/1d0>
	Trace; c010833c <do_IRQ+9c/b0>
	Trace; c01a0e7b <submit_bh+4b/70>
	Trace; c0131684 <write_locked_buffers+24/30>
	Trace; c0131731 <write_some_buffers+a1/f0>
	Trace; c01346fe <bdflush+9e/d0>
	Trace; c0105000 <_stext+0/0>
	Trace; c01055d6 <kernel_thread+26/30>
	Trace; c0134660 <bdflush+0/d0>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.20-rc1 - hang with processes stuck in D
  2002-11-06  0:25 2.4.20-rc1 - hang with processes stuck in D Jeff Dike
@ 2002-11-06  0:37 ` Andrew Morton
  2002-11-06  3:08   ` Jeff Dike
  2002-11-08  9:01   ` Marcelo Tosatti
  0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2002-11-06  0:37 UTC (permalink / raw)
  To: Jeff Dike; +Cc: linux-kernel

Jeff Dike wrote:
> 
> 2.4.20-rc1 reliably gets processes stuck in D, eventually wedging the whole
> system.  This is by diffing two kernel pools, one of which has 9 138764288
> byte core files.
> 
> The diff itself is stuck in __wait_on_buffer:
> 
>         Trace; c0131608 <__wait_on_buffer+68/90>

Kernel is waiting for IO completion on a read.  I would be
suspecting your IO system, or interrupt system.

Reverting your ide/scsi/whatever drivers to the last-known-to-work
version would be interesting.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.20-rc1 - hang with processes stuck in D
  2002-11-06  0:37 ` Andrew Morton
@ 2002-11-06  3:08   ` Jeff Dike
  2002-11-08  4:17     ` Jakob Oestergaard
  2002-11-08  9:01   ` Marcelo Tosatti
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff Dike @ 2002-11-06  3:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

akpm@digeo.com said:
> Kernel is waiting for IO completion on a read.  I would be suspecting
> your IO system, or interrupt system. 

Yup.  The disk access light is stuck on continuously at this point, FWIW.


> Reverting your ide/scsi/whatever drivers to the last-known-to-work
> version would be interesting. 

IDE - this didn't happen on 2.4.18.  It seems to happen on all more recent
kernels.  UML seems to trigger it, especially on UML servers.

				Jeff


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.20-rc1 - hang with processes stuck in D
  2002-11-06  3:08   ` Jeff Dike
@ 2002-11-08  4:17     ` Jakob Oestergaard
  2002-11-08 18:43       ` Trond Myklebust
  0 siblings, 1 reply; 8+ messages in thread
From: Jakob Oestergaard @ 2002-11-08  4:17 UTC (permalink / raw)
  To: Jeff Dike; +Cc: Andrew Morton, linux-kernel

On Tue, Nov 05, 2002 at 10:08:30PM -0500, Jeff Dike wrote:
> akpm@digeo.com said:
> > Kernel is waiting for IO completion on a read.  I would be suspecting
> > your IO system, or interrupt system. 
> 
> Yup.  The disk access light is stuck on continuously at this point, FWIW.
> 
> 
> > Reverting your ide/scsi/whatever drivers to the last-known-to-work
> > version would be interesting. 
> 
> IDE - this didn't happen on 2.4.18.  It seems to happen on all more recent
> kernels.  UML seems to trigger it, especially on UML servers.

Maybe not related, but I see 5 second "pauses" on a RAID-0+1 (software
RAID, Seagate 80G disks, Promise Ultra66+Ultra133 controllers, dual x86)
file server here.

I suspected NFS problems (looks like someone re-wrote NFS between 2.4.18
and 2.4.20-rc1) - but this is *not* the case.  The pauses happen on
locally running processes as well.

It seems to correlate well with a remote host delivering a mail (using
maildir over NFS) - but this is not the only situation in which it
happens.

Everything using disk, both on NFS clients and locally running
processes, just pause. Five seconds after everything is like it never
happened.

Nothing in dmesg.

Didn't happen in 2.4.18, happens in 2.4.20-rc1.

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.20-rc1 - hang with processes stuck in D
  2002-11-08  4:17     ` Jakob Oestergaard
@ 2002-11-08 18:43       ` Trond Myklebust
  2002-11-09 22:39         ` Jakob Oestergaard
  0 siblings, 1 reply; 8+ messages in thread
From: Trond Myklebust @ 2002-11-08 18:43 UTC (permalink / raw)
  To: Jakob Oestergaard; +Cc: Jeff Dike, Andrew Morton, linux-kernel

>>>>> " " == Jakob Oestergaard <jakob@unthought.net> writes:

     > I suspected NFS problems (looks like someone re-wrote NFS
     > between 2.4.18 and 2.4.20-rc1) - but this is *not* the case.
     > The pauses happen on locally running processes as well.

     > It seems to correlate well with a remote host delivering a mail
     > (using maildir over NFS) - but this is not the only situation
     > in which it happens.

     > Everything using disk, both on NFS clients and locally running
     > processes, just pause. Five seconds after everything is like it
     > never happened.

If you are using HIGHMEM, then the stock 2.4.20-rc1 has a known issue
with an unbalanced kmap. Marcelo has already applied the following
patch in the latest bitkeeper update.

Cheers,
  Trond

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.774   -> 1.775  
#	    net/sunrpc/xdr.c	1.7     -> 1.8    
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/11/06	trond.myklebust@fys.uio.no	1.775
# [PATCH] another kmap imbalance in 2.4.x/2.5.x RPC
# 
# >>>>> Andrew Ryan <andrewr@nam-shub.com> writes:
#      > So far so good on the crashes.  I'm able to get through a
#      > complete run of dbench using TCP mounts on 2.4.20rc1, which I
#      > haven't been able to do before this.
# 
# Marcelo, Linus
# 
#   We've uncovered yet another kmap imbalance in the new RPC code. This
# looks like it might be the last one (my debugging printks have been
# unable to unearth any more). One line fix + 4 line comment
# appended. Please apply to both 2.4.20-rc1 and 2.5.45...
# 
# Cheers,
#   Trond
# --------------------------------------------
#
diff -Nru a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
--- a/net/sunrpc/xdr.c	Fri Nov  8 19:42:24 2002
+++ b/net/sunrpc/xdr.c	Fri Nov  8 19:42:24 2002
@@ -244,6 +244,11 @@
 		pglen -= base;
 		base  += xdr->page_base;
 		ppage += base >> PAGE_CACHE_SHIFT;
+		/* Note: The offset means that the length of the first
+		 * page is really (PAGE_CACHE_SIZE - (base & ~PAGE_CACHE_MASK)).
+		 * In order to avoid an extra test inside the loop,
+		 * we bump pglen here, and just subtract PAGE_CACHE_SIZE... */
+		pglen += base & ~PAGE_CACHE_MASK;
 	}
 	for (;;) {
 		flush_dcache_page(*ppage);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.20-rc1 - hang with processes stuck in D
  2002-11-08 18:43       ` Trond Myklebust
@ 2002-11-09 22:39         ` Jakob Oestergaard
  0 siblings, 0 replies; 8+ messages in thread
From: Jakob Oestergaard @ 2002-11-09 22:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jeff Dike, Andrew Morton, linux-kernel

On Fri, Nov 08, 2002 at 07:43:10PM +0100, Trond Myklebust wrote:
> >>>>> " " == Jakob Oestergaard <jakob@unthought.net> writes:
...
>      > Everything using disk, both on NFS clients and locally running
>      > processes, just pause. Five seconds after everything is like it
>      > never happened.
> 
> If you are using HIGHMEM, then the stock 2.4.20-rc1 has a known issue
> with an unbalanced kmap. Marcelo has already applied the following
> patch in the latest bitkeeper update.

No highmem.  The box has 512 MB RAM.

I get some
eth1: TX underrun, threshold adjusted.
eth0: TX underrun, threshold adjusted.
messages in the syslog - probably around 100 messages or so, but they
stop appearing after a day of uptime or so.  This is two bonded Intel
eepro100 cards, using the "Becker" driver (not the Intel one which I saw
was included).  Those messages do not seem to be correlated with the
pauses at all though.

That's the *only* anomaly except for the pauses, that I see on the box.

The machine has run 2.4.20-rc1 for 5 days now, with an average load
probably around 3 or 4  (load 2 caused by two long-running CPU hogs, the
rest comes from disk I/O, mostly because it's NFS exporting a 147G fs).

Stable so far, but the "hickups" are weird.

-- 
................................................................
:   jakob@unthought.net   : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob Østergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.20-rc1 - hang with processes stuck in D
  2002-11-06  0:37 ` Andrew Morton
  2002-11-06  3:08   ` Jeff Dike
@ 2002-11-08  9:01   ` Marcelo Tosatti
  2002-11-10 21:22     ` Jeff Dike
  1 sibling, 1 reply; 8+ messages in thread
From: Marcelo Tosatti @ 2002-11-08  9:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jeff Dike, linux-kernel



On Tue, 5 Nov 2002, Andrew Morton wrote:

> Jeff Dike wrote:
> >
> > 2.4.20-rc1 reliably gets processes stuck in D, eventually wedging the whole
> > system.  This is by diffing two kernel pools, one of which has 9 138764288
> > byte core files.
> >
> > The diff itself is stuck in __wait_on_buffer:
> >
> >         Trace; c0131608 <__wait_on_buffer+68/90>
>
> Kernel is waiting for IO completion on a read.  I would be
> suspecting your IO system, or interrupt system.

Or rather try it on a different box.

Jeff, can you please mail me privately the exact test case which produces
the problem so I can try it around here?




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.4.20-rc1 - hang with processes stuck in D
  2002-11-08  9:01   ` Marcelo Tosatti
@ 2002-11-10 21:22     ` Jeff Dike
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Dike @ 2002-11-10 21:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, linux-kernel

marcelo@conectiva.com.br said:
> Or rather try it on a different box. 

This has been seen on a number of different boxes running a variety of kernels.

The ones that have happened to other people that I have heard of have all
involved UML.  I've also make my laptop hang with BK, diff, and emacs.

Here are some threads talking about this problem:

    http://marc.theaimsgroup.com/?l=user-mode-linux-user&m=103644225423660&w=2
and http://marc.theaimsgroup.com/?l=user-mode-linux-user&m=103644252023954&w=2

    http://marc.theaimsgroup.com/?l=linux-kernel&m=103351640614665&w=2

    http://marc.theaimsgroup.com/?l=user-mode-linux-user&m=103582756229685&w=2
and http://marc.theaimsgroup.com/?l=user-mode-linux-user&m=103582861831037&w=2

There's a variety of kernels and hardware involved here.  My laptop is 
bog-standard IDE afaik.  Zaphod, the subject of the second URL, is IDE behind
a 3ware raid controller.  Not sure about the others.

				Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-11-09 22:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-06  0:25 2.4.20-rc1 - hang with processes stuck in D Jeff Dike
2002-11-06  0:37 ` Andrew Morton
2002-11-06  3:08   ` Jeff Dike
2002-11-08  4:17     ` Jakob Oestergaard
2002-11-08 18:43       ` Trond Myklebust
2002-11-09 22:39         ` Jakob Oestergaard
2002-11-08  9:01   ` Marcelo Tosatti
2002-11-10 21:22     ` Jeff Dike

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox