* 2.4.20-rc1 - hang with processes stuck in D @ 2002-11-06 0:25 Jeff Dike 2002-11-06 0:37 ` Andrew Morton 0 siblings, 1 reply; 8+ messages in thread From: Jeff Dike @ 2002-11-06 0:25 UTC (permalink / raw) To: linux-kernel 2.4.20-rc1 reliably gets processes stuck in D, eventually wedging the whole system. This is by diffing two kernel pools, one of which has 9 138764288 byte core files. The diff itself is stuck in __wait_on_buffer: Trace; c0131608 <__wait_on_buffer+68/90> Trace; c0132258 <getblk+28/60> Trace; c0132269 <getblk+39/60> Trace; c01324d6 <bread+46/70> Trace; c0121918 <handle_mm_fault+58/c0> Trace; c0163b02 <ext2_get_branch+52/c0> Trace; c0163d99 <ext2_get_block+59/320> Trace; c01109fa <do_page_fault+17a/4ab> Trace; c01326b2 <create_buffers+62/f0> Trace; c01326b8 <create_buffers+68/f0> Trace; c0132fec <block_read_full_page+ec/240> Trace; c0123a3d <add_to_page_cache_unique+6d/80> Trace; c0123ad8 <page_cache_read+88/c0> Trace; c0163d40 <ext2_get_block+0/320> Trace; c01240b5 <generic_file_readahead+f5/130> Trace; c012430f <do_generic_file_read+1df/430> Trace; c012487c <generic_file_read+7c/110> Trace; c0124780 <file_read_actor+0/80> Trace; c0130796 <sys_read+96/f0> Trace; c010bafb <sys_mmap2+2b/30> Trace; c0106d8b <system_call+33/38> kupdated and bdflush are both stuck in __wait_on_buffer called from timer_bh: kupdated: Trace; c01a0595 <__get_request_wait+95/d0> Trace; c01a0b6b <__make_request+3db/570> Trace; c011b424 <timer_bh+274/390> Trace; c011817b <bh_action+1b/50> Trace; c0118084 <tasklet_hi_action+44/70> Trace; c01a0e0e <generic_make_request+10e/130> Trace; c010833c <do_IRQ+9c/b0> Trace; c01a0e7b <submit_bh+4b/70> Trace; c0131684 <write_locked_buffers+24/30> Trace; c0131731 <write_some_buffers+a1/f0> Trace; c013455c <sync_old_buffers+1c/40> Trace; c0134824 <kupdate+f4/120> Trace; c0105000 <_stext+0/0> Trace; c0105000 <_stext+0/0> Trace; c01055d6 <kernel_thread+26/30> Trace; c0134730 <kupdate+0/120> bdflush: Trace; c01a0595 <__get_request_wait+95/d0> Trace; c01a0b6b <__make_request+3db/570> Trace; c011b1d7 <timer_bh+27/390> Trace; c011817b <bh_action+1b/50> Trace; c0118084 <tasklet_hi_action+44/70> Trace; c0110e0e <remap_area_pages+7e/1d0> Trace; c010833c <do_IRQ+9c/b0> Trace; c01a0e7b <submit_bh+4b/70> Trace; c0131684 <write_locked_buffers+24/30> Trace; c0131731 <write_some_buffers+a1/f0> Trace; c01346fe <bdflush+9e/d0> Trace; c0105000 <_stext+0/0> Trace; c01055d6 <kernel_thread+26/30> Trace; c0134660 <bdflush+0/d0> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4.20-rc1 - hang with processes stuck in D 2002-11-06 0:25 2.4.20-rc1 - hang with processes stuck in D Jeff Dike @ 2002-11-06 0:37 ` Andrew Morton 2002-11-06 3:08 ` Jeff Dike 2002-11-08 9:01 ` Marcelo Tosatti 0 siblings, 2 replies; 8+ messages in thread From: Andrew Morton @ 2002-11-06 0:37 UTC (permalink / raw) To: Jeff Dike; +Cc: linux-kernel Jeff Dike wrote: > > 2.4.20-rc1 reliably gets processes stuck in D, eventually wedging the whole > system. This is by diffing two kernel pools, one of which has 9 138764288 > byte core files. > > The diff itself is stuck in __wait_on_buffer: > > Trace; c0131608 <__wait_on_buffer+68/90> Kernel is waiting for IO completion on a read. I would be suspecting your IO system, or interrupt system. Reverting your ide/scsi/whatever drivers to the last-known-to-work version would be interesting. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4.20-rc1 - hang with processes stuck in D 2002-11-06 0:37 ` Andrew Morton @ 2002-11-06 3:08 ` Jeff Dike 2002-11-08 4:17 ` Jakob Oestergaard 2002-11-08 9:01 ` Marcelo Tosatti 1 sibling, 1 reply; 8+ messages in thread From: Jeff Dike @ 2002-11-06 3:08 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel akpm@digeo.com said: > Kernel is waiting for IO completion on a read. I would be suspecting > your IO system, or interrupt system. Yup. The disk access light is stuck on continuously at this point, FWIW. > Reverting your ide/scsi/whatever drivers to the last-known-to-work > version would be interesting. IDE - this didn't happen on 2.4.18. It seems to happen on all more recent kernels. UML seems to trigger it, especially on UML servers. Jeff ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4.20-rc1 - hang with processes stuck in D 2002-11-06 3:08 ` Jeff Dike @ 2002-11-08 4:17 ` Jakob Oestergaard 2002-11-08 18:43 ` Trond Myklebust 0 siblings, 1 reply; 8+ messages in thread From: Jakob Oestergaard @ 2002-11-08 4:17 UTC (permalink / raw) To: Jeff Dike; +Cc: Andrew Morton, linux-kernel On Tue, Nov 05, 2002 at 10:08:30PM -0500, Jeff Dike wrote: > akpm@digeo.com said: > > Kernel is waiting for IO completion on a read. I would be suspecting > > your IO system, or interrupt system. > > Yup. The disk access light is stuck on continuously at this point, FWIW. > > > > Reverting your ide/scsi/whatever drivers to the last-known-to-work > > version would be interesting. > > IDE - this didn't happen on 2.4.18. It seems to happen on all more recent > kernels. UML seems to trigger it, especially on UML servers. Maybe not related, but I see 5 second "pauses" on a RAID-0+1 (software RAID, Seagate 80G disks, Promise Ultra66+Ultra133 controllers, dual x86) file server here. I suspected NFS problems (looks like someone re-wrote NFS between 2.4.18 and 2.4.20-rc1) - but this is *not* the case. The pauses happen on locally running processes as well. It seems to correlate well with a remote host delivering a mail (using maildir over NFS) - but this is not the only situation in which it happens. Everything using disk, both on NFS clients and locally running processes, just pause. Five seconds after everything is like it never happened. Nothing in dmesg. Didn't happen in 2.4.18, happens in 2.4.20-rc1. -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4.20-rc1 - hang with processes stuck in D 2002-11-08 4:17 ` Jakob Oestergaard @ 2002-11-08 18:43 ` Trond Myklebust 2002-11-09 22:39 ` Jakob Oestergaard 0 siblings, 1 reply; 8+ messages in thread From: Trond Myklebust @ 2002-11-08 18:43 UTC (permalink / raw) To: Jakob Oestergaard; +Cc: Jeff Dike, Andrew Morton, linux-kernel >>>>> " " == Jakob Oestergaard <jakob@unthought.net> writes: > I suspected NFS problems (looks like someone re-wrote NFS > between 2.4.18 and 2.4.20-rc1) - but this is *not* the case. > The pauses happen on locally running processes as well. > It seems to correlate well with a remote host delivering a mail > (using maildir over NFS) - but this is not the only situation > in which it happens. > Everything using disk, both on NFS clients and locally running > processes, just pause. Five seconds after everything is like it > never happened. If you are using HIGHMEM, then the stock 2.4.20-rc1 has a known issue with an unbalanced kmap. Marcelo has already applied the following patch in the latest bitkeeper update. Cheers, Trond # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.774 -> 1.775 # net/sunrpc/xdr.c 1.7 -> 1.8 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 02/11/06 trond.myklebust@fys.uio.no 1.775 # [PATCH] another kmap imbalance in 2.4.x/2.5.x RPC # # >>>>> Andrew Ryan <andrewr@nam-shub.com> writes: # > So far so good on the crashes. I'm able to get through a # > complete run of dbench using TCP mounts on 2.4.20rc1, which I # > haven't been able to do before this. # # Marcelo, Linus # # We've uncovered yet another kmap imbalance in the new RPC code. This # looks like it might be the last one (my debugging printks have been # unable to unearth any more). One line fix + 4 line comment # appended. Please apply to both 2.4.20-rc1 and 2.5.45... # # Cheers, # Trond # -------------------------------------------- # diff -Nru a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c --- a/net/sunrpc/xdr.c Fri Nov 8 19:42:24 2002 +++ b/net/sunrpc/xdr.c Fri Nov 8 19:42:24 2002 @@ -244,6 +244,11 @@ pglen -= base; base += xdr->page_base; ppage += base >> PAGE_CACHE_SHIFT; + /* Note: The offset means that the length of the first + * page is really (PAGE_CACHE_SIZE - (base & ~PAGE_CACHE_MASK)). + * In order to avoid an extra test inside the loop, + * we bump pglen here, and just subtract PAGE_CACHE_SIZE... */ + pglen += base & ~PAGE_CACHE_MASK; } for (;;) { flush_dcache_page(*ppage); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4.20-rc1 - hang with processes stuck in D 2002-11-08 18:43 ` Trond Myklebust @ 2002-11-09 22:39 ` Jakob Oestergaard 0 siblings, 0 replies; 8+ messages in thread From: Jakob Oestergaard @ 2002-11-09 22:39 UTC (permalink / raw) To: Trond Myklebust; +Cc: Jeff Dike, Andrew Morton, linux-kernel On Fri, Nov 08, 2002 at 07:43:10PM +0100, Trond Myklebust wrote: > >>>>> " " == Jakob Oestergaard <jakob@unthought.net> writes: ... > > Everything using disk, both on NFS clients and locally running > > processes, just pause. Five seconds after everything is like it > > never happened. > > If you are using HIGHMEM, then the stock 2.4.20-rc1 has a known issue > with an unbalanced kmap. Marcelo has already applied the following > patch in the latest bitkeeper update. No highmem. The box has 512 MB RAM. I get some eth1: TX underrun, threshold adjusted. eth0: TX underrun, threshold adjusted. messages in the syslog - probably around 100 messages or so, but they stop appearing after a day of uptime or so. This is two bonded Intel eepro100 cards, using the "Becker" driver (not the Intel one which I saw was included). Those messages do not seem to be correlated with the pauses at all though. That's the *only* anomaly except for the pauses, that I see on the box. The machine has run 2.4.20-rc1 for 5 days now, with an average load probably around 3 or 4 (load 2 caused by two long-running CPU hogs, the rest comes from disk I/O, mostly because it's NFS exporting a 147G fs). Stable so far, but the "hickups" are weird. -- ................................................................ : jakob@unthought.net : And I see the elder races, : :.........................: putrid forms of man : : Jakob Østergaard : See him rise and claim the earth, : : OZ9ABN : his downfall is at hand. : :.........................:............{Konkhra}...............: ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4.20-rc1 - hang with processes stuck in D 2002-11-06 0:37 ` Andrew Morton 2002-11-06 3:08 ` Jeff Dike @ 2002-11-08 9:01 ` Marcelo Tosatti 2002-11-10 21:22 ` Jeff Dike 1 sibling, 1 reply; 8+ messages in thread From: Marcelo Tosatti @ 2002-11-08 9:01 UTC (permalink / raw) To: Andrew Morton; +Cc: Jeff Dike, linux-kernel On Tue, 5 Nov 2002, Andrew Morton wrote: > Jeff Dike wrote: > > > > 2.4.20-rc1 reliably gets processes stuck in D, eventually wedging the whole > > system. This is by diffing two kernel pools, one of which has 9 138764288 > > byte core files. > > > > The diff itself is stuck in __wait_on_buffer: > > > > Trace; c0131608 <__wait_on_buffer+68/90> > > Kernel is waiting for IO completion on a read. I would be > suspecting your IO system, or interrupt system. Or rather try it on a different box. Jeff, can you please mail me privately the exact test case which produces the problem so I can try it around here? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4.20-rc1 - hang with processes stuck in D 2002-11-08 9:01 ` Marcelo Tosatti @ 2002-11-10 21:22 ` Jeff Dike 0 siblings, 0 replies; 8+ messages in thread From: Jeff Dike @ 2002-11-10 21:22 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Andrew Morton, linux-kernel marcelo@conectiva.com.br said: > Or rather try it on a different box. This has been seen on a number of different boxes running a variety of kernels. The ones that have happened to other people that I have heard of have all involved UML. I've also make my laptop hang with BK, diff, and emacs. Here are some threads talking about this problem: http://marc.theaimsgroup.com/?l=user-mode-linux-user&m=103644225423660&w=2 and http://marc.theaimsgroup.com/?l=user-mode-linux-user&m=103644252023954&w=2 http://marc.theaimsgroup.com/?l=linux-kernel&m=103351640614665&w=2 http://marc.theaimsgroup.com/?l=user-mode-linux-user&m=103582756229685&w=2 and http://marc.theaimsgroup.com/?l=user-mode-linux-user&m=103582861831037&w=2 There's a variety of kernels and hardware involved here. My laptop is bog-standard IDE afaik. Zaphod, the subject of the second URL, is IDE behind a 3ware raid controller. Not sure about the others. Jeff ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2002-11-09 22:32 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-11-06 0:25 2.4.20-rc1 - hang with processes stuck in D Jeff Dike 2002-11-06 0:37 ` Andrew Morton 2002-11-06 3:08 ` Jeff Dike 2002-11-08 4:17 ` Jakob Oestergaard 2002-11-08 18:43 ` Trond Myklebust 2002-11-09 22:39 ` Jakob Oestergaard 2002-11-08 9:01 ` Marcelo Tosatti 2002-11-10 21:22 ` Jeff Dike
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox