Re: VM deadlock

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: VM deadlock
  2001-06-27 14:27 VM deadlock Xuan Baldauf
@ 2001-06-27 13:11 ` Marcelo Tosatti
  2001-06-27 16:13   ` Xuan Baldauf
  2001-06-27 15:09 ` Chris Mason
  1 sibling, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2001-06-27 13:11 UTC (permalink / raw)
  To: Xuan Baldauf; +Cc: linux-kernel, reiserfs-list@namesys.com



On Wed, 27 Jun 2001, Xuan Baldauf wrote:

> Hello,
> 
> I'm not sure wether this is a reiserfs bug or a kernel bug,
> so I'm posting to both lists...
> 
> My linux box suddenly was not availbale using ssh|telnet,
> but it responded to pings. On console login, I could type
> "root", but after pressing "return", there was no reaction,
> and pressing keys did not result in writing them on the
> screen.
> 
> "Emergency sync" and "Remount R/O" did not have any
> response.
> 
> That's why I pressed Alt+SysRq+P 5 times and wrote all stack
> traces (without registers) onto paper. After that, I pressed
> Alt+SysRq+T and also wrote 3 long stack traces (others were
> available too, but too short) down.

Xuan, 

Are you using kiobuf IO ?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* VM deadlock
@ 2001-06-27 14:27 Xuan Baldauf
  2001-06-27 13:11 ` Marcelo Tosatti
  2001-06-27 15:09 ` Chris Mason
  0 siblings, 2 replies; 18+ messages in thread
From: Xuan Baldauf @ 2001-06-27 14:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: reiserfs-list@namesys.com

Hello,

I'm not sure wether this is a reiserfs bug or a kernel bug,
so I'm posting to both lists...

My linux box suddenly was not availbale using ssh|telnet,
but it responded to pings. On console login, I could type
"root", but after pressing "return", there was no reaction,
and pressing keys did not result in writing them on the
screen.

"Emergency sync" and "Remount R/O" did not have any
response.

That's why I pressed Alt+SysRq+P 5 times and wrote all stack
traces (without registers) onto paper. After that, I pressed
Alt+SysRq+T and also wrote 3 long stack traces (others were
available too, but too short) down.

After that, I wanted to kill processes and accidently
pressed Alt+SysRq+K (I did not want to kill init). After
that, there was sudden disk and screen activity. It seemed
that the system got alive again, but now, every process was
killed. On screen, I also could see many oopses from
reiserfs complaining about calling "journal_begin" on read
only media, but this is from "Remount R/O", I think.

This is what I copied from paper by hand, trying to be in a
suitable format for ksymoops:

---start---
5 stack traces obtained by Alt+SysRq+P

EIP: <c012839c>
Trace:
 <c0128ef5>
 <c012905e>
 <c0129d05>
 <c0129b36>
 <c012a425>
 <c0120198>
 <c01201f5>
 <c0120550>
 <c01113b4>
 <c0111513>
 <c01113b4>
 <c0111fb0>
 <c0129e12>
 <c0129e38>
 <c013c63a>
 <c013c928>
 <c0106be4>
 <c013cd24>
 <c0106ad3>

EIP: <c0128393>
Trace:
 <c0128ef5>
 <c012905e>
 <c0129d05>
 <c0129b36>
 <c012038e>
 <c012042f>
 <c012053f>
 <c01113b4>
 <c0111513>
 <c01113b4>
 <c0148ae9>
 <c01f258a>
 <c01f25cc>
 <c0106be4>
 <c01f1910>
 <c014886d>
 <c012ee26>
 <c0106ad3>

EIP: <c01285f6>
Trace:
 <c0128ef5>
 <c012905e>
 <c0129d05>
 <c0129b36>
 <c012a425>
 <c01201fb>
 <c0120550>
 <c01113b4>
 <c0111513>
 <c01113b4>
 <c0111fb0>
 <c013c63a>
 <c013c928>
 <c013c962>
 <c013cdef>
 <c0106be4>

EIP: <c0128d33>
Trace:
 <c0128b86>
 <c0128ef5>
 <c012905e>
 <c0129d05>
 <c0129b36>
 <c0129dc2>
 <c013c677>
 <c01bb4a0>
 <c01b6b4f>
 <c013c84b>
 <c013ccb2>
 <c0106ad3>

EIP: <c01283c8>
Trace:
 <c0128ef5>
 <c012905e>
 <c0129d05>
 <c0129b36>
 <c012038e>
 <c012042f>
 <c012053f>
 <c01113b4>
 <c0111513>
 <c01113b4>
 <c0148ae9>
 <c01f258a>
 <c01f25cc>
 <c0106be4>
 <c01f1910>
 <c014886d>
 <c012ee26>
 <c0106ad3>

3 chosen stack traces in the output of Alt+SysRq+T

java S 7FFFFFFF 0 4323 4240 (NOTLB) 4322
  Call Trace:
  EIP: <c0111cbf>
  Trace:
  <c01cc1a1>
  <c01bad82>
  <c01bae9e>
  <c01e8845>
  <c01b67d1>
  <c01b755c>
  <c01113b4>
  <c0111513>
  <c01385a4>
  <c011a862>
  <c01b7cab>
  <c0106ad3>

fetchmail R 00000000 0 4881 1 (NOTLB) 4902 4066
  Call Trace:
  EIP: <c0129c3c>
  Trace:
  <c0129b36>
  <c0122378>
  <c0123410>
  <c012044e>
  <c012053f>
  <c01113b4>
  <c0111513>
  <c01113b4>
  <c0117053>
  <c0117148>
  <c012ee53>
  <c0106be4>

mirrordir R 00000000 2672 4907 4894 (NOTLB)
  Call Trace:
  EIP: <c0129c3c>
  Trace:
  <c0129b36>
  <c0132282>
  <c013058f>
  <c013059c>
  <c0130992>
  <c0130b8e>
  <c0171f8b>
  <c016cc2e>
  <c015e189>
  <c0141b09>
  <c0141d46>
  <c015e1fc>
  <c015a958>
  <c013863b>
  <c0138ce1>
  <c01383db>
  <c01392b8>
  <c01361e6>
  <c0106ad3>


Ping over network: response
Emergency sync: no response
Remount R/O: no response
---end---

This is the output from ksymoops with the above file as
input:

---start---
ksymoops 2.4.1 on i586 2.4.6-pre5.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.6-pre5/ (default)
     -m /boot/System.map-2.4.6-pre5 (default)

Warning: You did not tell me where to find symbol
information.  I will
assume that the log matches the kernel and modules that are
running
right now and I'll use the default options above for symbol
resolution.
If the current kernel and/or modules do not match the log,
you can get
more accurate output by telling me the kernel version and
where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Error (regular_file): read_system_map stat
/boot/System.map-2.4.6-pre5 failed
EIP: <c012839c>
Using defaults from ksymoops -t elf32-i386 -a i386
Trace:
        <c0128ef5>
        <c012905e>
        <c0129d05>
        <c0129b36>
        <c012a425>
        <c0120198>
        <c01201f5>
        <c0120550>
        <c01113b4>
        <c0111513>
        <c01113b4>
        <c0111fb0>
        <c0129e12>
        <c0129e38>
        <c013c63a>
        <c013c928>
        <c0106be4>
        <c013cd24>
        <c0106ad3>
EIP: <c0128393>
Trace:
        <c0128ef5>
        <c012905e>
        <c0129d05>
        <c0129b36>
        <c012038e>
        <c012042f>
        <c012053f>
        <c01113b4>
        <c0111513>
        <c01113b4>
        <c0148ae9>
        <c01f258a>
        <c01f25cc>
        <c0106be4>
        <c01f1910>
        <c014886d>
        <c012ee26>
        <c0106ad3>
EIP: <c01285f6>
Trace:
        <c0128ef5>
        <c012905e>
        <c0129d05>
        <c0129b36>
        <c012a425>
        <c01201fb>
        <c0120550>
        <c01113b4>
        <c0111513>
        <c01113b4>
        <c0111fb0>
        <c013c63a>
        <c013c928>
        <c013c962>
        <c013cdef>
        <c0106be4>
EIP: <c0128d33>
Trace:
        <c0128b86>
        <c0128ef5>
        <c012905e>
        <c0129d05>
        <c0129b36>
        <c0129dc2>
        <c013c677>
        <c01bb4a0>
        <c01b6b4f>
        <c013c84b>
        <c013ccb2>
        <c0106ad3>
EIP: <c01283c8>
Trace:
        <c0128ef5>
        <c012905e>
        <c0129d05>
        <c0129b36>
        <c012038e>
        <c012042f>
        <c012053f>
        <c01113b4>
        <c0111513>
        <c01113b4>
        <c0148ae9>
        <c01f258a>
        <c01f25cc>
        <c0106be4>
        <c01f1910>
        <c014886d>
        <c012ee26>
        <c0106ad3>
               EIP: <c0111cbf>
               Trace:
               <c01cc1a1>
               <c01bad82>
               <c01bae9e>
               <c01e8845>
               <c01b67d1>
               <c01b755c>
               <c01113b4>
               <c0111513>
               <c01385a4>
               <c011a862>
               <c01b7cab>
               <c0106ad3>
               EIP: <c0129c3c>
               Trace:
               <c0129b36>
               <c0122378>
               <c0123410>
               <c012044e>
               <c012053f>
               <c01113b4>
               <c0111513>
               <c01113b4>
               <c0117053>
               <c0117148>
               <c012ee53>
               <c0106be4>
               EIP: <c0129c3c>
               Trace:
               <c0129b36>
               <c0132282>
               <c013058f>
               <c013059c>
               <c0130992>
               <c0130b8e>
               <c0171f8b>
               <c016cc2e>
               <c015e189>
               <c0141b09>
               <c0141d46>
               <c015e1fc>
               <c015a958>
               <c013863b>
               <c0138ce1>
               <c01383db>
               <c01392b8>
               <c01361e6>
               <c0106ad3>
Warning (Oops_read): Code line not seen, dumping what data
is available

>>EIP; c012839c <deactivate_page+e94/2618>   <=====
Trace; c0128ef5 <deactivate_page+19ed/2618>
Trace; c012905e <deactivate_page+1b56/2618>
Trace; c0129d05 <__alloc_pages+1cd/280>
Trace; c0129b36 <_alloc_pages+16/18>
Trace; c012a425 <free_pages+611/1cac>
Trace; c0120198 <vmtruncate+1c4/878>
Trace; c01201f5 <vmtruncate+221/878>
Trace; c0120550 <vmtruncate+57c/878>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0111513 <__verify_write+263/784>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0111fb0 <schedule+264/394>
Trace; c0129e12 <__free_pages+1a/1c>
Trace; c0129e38 <free_pages+24/1cac>
Trace; c013c63a <poll_freewait+3a/44>
Trace; c013c928 <__pollwait+2e4/ef4>
Trace; c0106be4 <__up_wakeup+1140/2374>
Trace; c013cd24 <__pollwait+6e0/ef4>
Trace; c0106ad3 <__up_wakeup+102f/2374>
>>EIP; c0128393 <deactivate_page+e8b/2618>   <=====
Trace; c0128ef5 <deactivate_page+19ed/2618>
Trace; c012905e <deactivate_page+1b56/2618>
Trace; c0129d05 <__alloc_pages+1cd/280>
Trace; c0129b36 <_alloc_pages+16/18>
Trace; c012038e <vmtruncate+3ba/878>
Trace; c012042f <vmtruncate+45b/878>
Trace; c012053f <vmtruncate+56b/878>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0111513 <__verify_write+263/784>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0148ae9 <kiobuf_wait_for_io+5f85/6238>
Trace; c01f258a <vsprintf+25e/35c>
Trace; c01f25cc <vsprintf+2a0/35c>
Trace; c0106be4 <__up_wakeup+1140/2374>
Trace; c01f1910 <csum_partial_copy_generic+114/128>
Trace; c014886d <kiobuf_wait_for_io+5d09/6238>
Trace; c012ee26 <default_llseek+25e/914>
Trace; c0106ad3 <__up_wakeup+102f/2374>
>>EIP; c01285f6 <deactivate_page+10ee/2618>   <=====
Trace; c0128ef5 <deactivate_page+19ed/2618>
Trace; c012905e <deactivate_page+1b56/2618>
Trace; c0129d05 <__alloc_pages+1cd/280>
Trace; c0129b36 <_alloc_pages+16/18>
Trace; c012a425 <free_pages+611/1cac>
Trace; c01201fb <vmtruncate+227/878>
Trace; c0120550 <vmtruncate+57c/878>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0111513 <__verify_write+263/784>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0111fb0 <schedule+264/394>
Trace; c013c63a <poll_freewait+3a/44>
Trace; c013c928 <__pollwait+2e4/ef4>
Trace; c013c962 <__pollwait+31e/ef4>
Trace; c013cdef <__pollwait+7ab/ef4>
Trace; c0106be4 <__up_wakeup+1140/2374>
>>EIP; c0128d33 <deactivate_page+182b/2618>   <=====
Trace; c0128b86 <deactivate_page+167e/2618>
Trace; c0128ef5 <deactivate_page+19ed/2618>
Trace; c012905e <deactivate_page+1b56/2618>
Trace; c0129d05 <__alloc_pages+1cd/280>
Trace; c0129b36 <_alloc_pages+16/18>
Trace; c0129dc2 <__get_free_pages+a/1c>
Trace; c013c677 <__pollwait+33/ef4>
Trace; c01bb4a0 <datagram_poll+24/17c>
Trace; c01b6b4f <sock_recvmsg+3bb/5e8>
Trace; c013c84b <__pollwait+207/ef4>
Trace; c013ccb2 <__pollwait+66e/ef4>
Trace; c0106ad3 <__up_wakeup+102f/2374>
>>EIP; c01283c8 <deactivate_page+ec0/2618>   <=====
Trace; c0128ef5 <deactivate_page+19ed/2618>
Trace; c012905e <deactivate_page+1b56/2618>
Trace; c0129d05 <__alloc_pages+1cd/280>
Trace; c0129b36 <_alloc_pages+16/18>
Trace; c012038e <vmtruncate+3ba/878>
Trace; c012042f <vmtruncate+45b/878>
Trace; c012053f <vmtruncate+56b/878>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0111513 <__verify_write+263/784>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0148ae9 <kiobuf_wait_for_io+5f85/6238>
Trace; c01f258a <vsprintf+25e/35c>
Trace; c01f25cc <vsprintf+2a0/35c>
Trace; c0106be4 <__up_wakeup+1140/2374>
Trace; c01f1910 <csum_partial_copy_generic+114/128>
Trace; c014886d <kiobuf_wait_for_io+5d09/6238>
Trace; c012ee26 <default_llseek+25e/914>
Trace; c0106ad3 <__up_wakeup+102f/2374>
>>EIP; c0111cbf <schedule_timeout+17/a4>   <=====
Trace; c01cc1a1 <ip_options_undo+829/830>
Trace; c01bad82 <csum_partial_copy_fromiovecend+2ca/338>
Trace; c01bae9e <skb_recv_datagram+ae/c4>
Trace; c01e8845 <inet_accept+119/1ac>
Trace; c01b67d1 <sock_recvmsg+3d/5e8>
Trace; c01b755c <sock_create+764/f30>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0111513 <__verify_write+263/784>
Trace; c01385a4 <path_release+3c/144>
Trace; c011a862 <del_timer+75a/b88>
Trace; c01b7cab <sock_create+eb3/f30>
Trace; c0106ad3 <__up_wakeup+102f/2374>
>>EIP; c0129c3c <__alloc_pages+104/280>   <=====
Trace; c0129b36 <_alloc_pages+16/18>
Trace; c0122378 <filemap_fdatawait+2a0/30c>
Trace; c0123410 <filemap_nopage+14c/3d8>
Trace; c012044e <vmtruncate+47a/878>
Trace; c012053f <vmtruncate+56b/878>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0111513 <__verify_write+263/784>
Trace; c01113b4 <__verify_write+104/784>
Trace; c0117053 <up_and_exit+61b/890>
Trace; c0117148 <up_and_exit+710/890>
Trace; c012ee53 <default_llseek+28b/914>
Trace; c0106be4 <__up_wakeup+1140/2374>
>>EIP; c0129c3c <__alloc_pages+104/280>   <=====
Trace; c0129b36 <_alloc_pages+16/18>
Trace; c0132282 <block_symlink+14e/570>
Trace; c013058f <set_blocksize+24f/284>
Trace; c013059c <set_blocksize+25c/284>
Trace; c0130992 <getblk+f2/170>
Trace; c0130b8e <bread+1a/90c>
Trace; c0171f8b <load_nls_default+1c9c3/27d64>
Trace; c016cc2e <load_nls_default+17666/27d64>
Trace; c015e189 <load_nls_default+8bc1/27d64>
Trace; c0141b09 <get_empty_inode+141/1e4>
Trace; c0141d46 <iget4+c2/d4>
Trace; c015e1fc <load_nls_default+8c34/27d64>
Trace; c015a958 <load_nls_default+5390/27d64>
Trace; c013863b <path_release+d3/144>
Trace; c0138ce1 <path_walk+529/890>
Trace; c01383db <getname+5b/98>
Trace; c01392b8 <__user_walk+3c/58>
Trace; c01361e6 <cdput+446/890>
Trace; c0106ad3 <__up_wakeup+102f/2374>


2 warnings and 1 error issued.  Results may not be reliable.

---end---

It seems to me that there was a memory low or so and linux
was unable to recover from that problem. Maybe this is a
deadlock where every kernel process tries to allocate
memory, fails, and therefore relinquishes the CPU to the
next kernel process, which in turn fails, and so on...

I had a probably similar|connected problem (but with no
"ping" responding) with linux-2.4.5-pre3, described here:
http://lists.omnipotent.net/reiserfs/200106/msg00214.html

The kernel version and the suspected modules loaded at crash
time:

router|16:13:41|~/temp> uname -a
Linux router 2.4.6-pre5 #3 Tue Jun 26 23:36:26 CEST 2001
i586 unknown
router|16:16:08|~/temp> lsmod
Module                  Size  Used by
ipt_MASQUERADE          2032   1  (autoclean)
ip_nat_ftp              3872   0  (unused)
ip_conntrack_ftp        3856   0  [ip_nat_ftp]
iptable_nat            20912   1  [ipt_MASQUERADE
ip_nat_ftp]
ip_conntrack           21216   2  [ipt_MASQUERADE ip_nat_ftp
ip_conntrack_ftp iptable_nat]
ipt_REJECT              3232   2  (autoclean)
iptable_filter          2048   0  (autoclean) (unused)
serial                 43280   1  (autoclean)
isa-pnp                28272   0  (autoclean) [serial]
ipv6                  129904  -1  (autoclean)
ip_tables              13152   6  [ipt_MASQUERADE
iptable_nat ipt_REJECT iptable_filter]
eepro100               16016   1  (autoclean)
wd                      5232   1  (autoclean)
8390                    6304   0  (autoclean) [wd]
hisax                 133104  30
isdn                  124880  31  [hisax]
slhc                    4912  14  [isdn]
rtc                     5472   0  (autoclean)
floppy                 46064   0  (autoclean)
ide-cd                 26192   0  (autoclean)
cdrom                  27552   0  (autoclean) [ide-cd]
router|16:26:56|~/temp>

Is this kind of problem known?

Xuân.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 14:27 VM deadlock Xuan Baldauf
  2001-06-27 13:11 ` Marcelo Tosatti
@ 2001-06-27 15:09 ` Chris Mason
  2001-06-27 16:20   ` Xuan Baldauf
                     ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Chris Mason @ 2001-06-27 15:09 UTC (permalink / raw)
  To: Xuan Baldauf, linux-kernel, andrea; +Cc: reiserfs-list@namesys.com



On Wednesday, June 27, 2001 04:27:45 PM +0200 Xuan Baldauf <xuan--lkml@baldauf.org> wrote:

> My linux box suddenly was not availbale using ssh|telnet,
> but it responded to pings. On console login, I could type
> "root", but after pressing "return", there was no reaction,
> and pressing keys did not result in writing them on the
> screen.

> Warning (Oops_read): Code line not seen, dumping what data
> is available
> 
>>> EIP; c012839c <deactivate_page+e94/2618>   <=====
> Trace; c0128ef5 <deactivate_page+19ed/2618>
> Trace; c012905e <deactivate_page+1b56/2618>
> Trace; c0129d05 <__alloc_pages+1cd/280>
> Trace; c0129b36 <_alloc_pages+16/18>
> Trace; c012a425 <free_pages+611/1cac>
> Trace; c0120198 <vmtruncate+1c4/878>
> Trace; c01201f5 <vmtruncate+221/878>
> Trace; c0120550 <vmtruncate+57c/878>

> I had a probably similar|connected problem (but with no
> "ping" responding) with linux-2.4.5-pre3, described here:
> http://lists.omnipotent.net/reiserfs/200106/msg00214.html
> 
> Linux router 2.4.6-pre5 #3 Tue Jun 26 23:36:26 CEST 2001

Sounds like a deadlock andrea recently found.

Could you please give this a try:

diff -urN 2.4.6pre5aa1/include/linux/swap.h 2.4.6pre5aa1-backout-page_launder/include/linux/swap.h
--- 2.4.6pre5aa1/include/linux/swap.h	Sun Jun 24 02:06:13 2001
+++ 2.4.6pre5aa1-backout-page_launder/include/linux/swap.h	Sun Jun 24 21:37:12 2001
@@ -205,16 +205,6 @@
 	page->zone->inactive_dirty_pages++; \
 }
 
-/* Like the above, but add us after the bookmark. */
-#define add_page_to_inactive_dirty_list_marker(page) { \
-	DEBUG_ADD_PAGE \
-	ZERO_PAGE_BUG \
-	SetPageInactiveDirty(page); \
-	list_add(&(page)->lru, marker_lru); \
-	nr_inactive_dirty_pages++; \
-	page->zone->inactive_dirty_pages++; \
-}
-
 #define add_page_to_inactive_clean_list(page) { \
 	DEBUG_ADD_PAGE \
 	ZERO_PAGE_BUG \
diff -urN 2.4.6pre5aa1/mm/vmscan.c 2.4.6pre5aa1-backout-page_launder/mm/vmscan.c
--- 2.4.6pre5aa1/mm/vmscan.c	Sun Jun 24 01:41:09 2001
+++ 2.4.6pre5aa1-backout-page_launder/mm/vmscan.c	Sun Jun 24 21:37:11 2001
@@ -407,7 +407,7 @@
 /**
  * page_launder - clean dirty inactive pages, move to inactive_clean list
  * @gfp_mask: what operations we are allowed to do
- * @sync: are we allowed to do synchronous IO in emergencies ?
+ * @sync: should we wait synchronously for the cleaning of pages
  *
  * When this function is called, we are most likely low on free +
  * inactive_clean pages. Since we want to refill those pages as
@@ -426,61 +426,23 @@
 #define MAX_LAUNDER 		(4 * (1 << page_cluster))
 #define CAN_DO_IO		(gfp_mask & __GFP_IO)
 #define CAN_DO_BUFFERS		(gfp_mask & __GFP_BUFFER)
-#define marker_lru		(&marker_page_struct.lru)
 int page_launder(int gfp_mask, int sync)
 {
-	static int cannot_free_pages;
 	int launder_loop, maxscan, cleaned_pages, maxlaunder;
 	struct list_head * page_lru;
 	struct page * page;
 
-	/* Our bookmark of where we are in the inactive_dirty list. */
-	struct page marker_page_struct = { zone: NULL };
-
 	launder_loop = 0;
 	maxlaunder = 0;
 	cleaned_pages = 0;
 
 dirty_page_rescan:
 	spin_lock(&pagemap_lru_lock);
-	/*
-	 * By not scanning all inactive dirty pages we'll write out
-	 * really old dirty pages before evicting newer clean pages.
-	 * This should cause some LRU behaviour if we have a large
-	 * amount of inactive pages (due to eg. drop behind).
-	 *
-	 * It also makes us accumulate dirty pages until we have enough
-	 * to be worth writing to disk without causing excessive disk
-	 * seeks and eliminates the infinite penalty clean pages incurred
-	 * vs. dirty pages.
-	 */
-	maxscan = nr_inactive_dirty_pages / 4;
-	if (launder_loop)
-		maxscan *= 2;
-	list_add_tail(marker_lru, &inactive_dirty_list);
-	for (;;) {
-		page_lru = marker_lru->prev;
-		if (page_lru == &inactive_dirty_list)
-			break;
-		if (--maxscan < 0)
-			break;
-		if (!free_shortage())
-			break;
-
+	maxscan = nr_inactive_dirty_pages;
+	while ((page_lru = inactive_dirty_list.prev) != &inactive_dirty_list &&
+				maxscan-- > 0) {
 		page = list_entry(page_lru, struct page, lru);
 
-		/* Move the bookmark backwards.. */
-		list_del(marker_lru);
-		list_add_tail(marker_lru, page_lru);
-
-		/* Don't waste CPU if chances are we cannot free anything. */
-		if (launder_loop && maxlaunder < 0 && cannot_free_pages)
-			break;
-
-		/* Skip other people's marker pages. */
-		if (!page->zone)
-			continue;
-
 		/* Wrong page on list?! (list corruption, should not happen) */
 		if (!PageInactiveDirty(page)) {
 			printk("VM: page_launder, wrong page on list.\n");
@@ -492,6 +454,7 @@
 
 		/* Page is or was in use?  Move it to the active list. */
 		if (PageReferenced(page) || page->age > 0 ||
+				page->zone->free_pages > page->zone->pages_high ||
 				(!page->buffers && page_count(page) > 1) ||
 				page_ramdisk(page)) {
 			del_page_from_inactive_dirty_list(page);
@@ -501,9 +464,11 @@
 
 		/*
 		 * The page is locked. IO in progress?
-		 * Skip the page, we'll take a look when it unlocks.
+		 * Move it to the back of the list.
 		 */
 		if (TryLockPage(page)) {
+			list_del(page_lru);
+			list_add(page_lru, &inactive_dirty_list);
 			continue;
 		}
 
@@ -517,8 +482,10 @@
 			if (!writepage)
 				goto page_active;
 
-			/* First time through? Skip the page. */
+			/* First time through? Move it to the back of the list */
 			if (!launder_loop || !CAN_DO_IO) {
+				list_del(page_lru);
+				list_add(page_lru, &inactive_dirty_list);
 				UnlockPage(page);
 				continue;
 			}
@@ -531,8 +498,6 @@
 			writepage(page);
 			page_cache_release(page);
 
-			maxlaunder--;
-
 			/* And re-start the thing.. */
 			spin_lock(&pagemap_lru_lock);
 			continue;
@@ -560,9 +525,9 @@
 			spin_unlock(&pagemap_lru_lock);
 
 			/* Will we do (asynchronous) IO? */
-			if (launder_loop && maxlaunder-- == 0 && sync)
+			if (launder_loop && maxlaunder == 0 && sync)
 				wait = 2;	/* Synchrounous IO */
-			else if (launder_loop && maxlaunder > 0)
+			else if (launder_loop && maxlaunder-- > 0)
 				wait = 1;	/* Async IO */
 			else
 				wait = 0;	/* No IO */
@@ -579,7 +544,7 @@
 
 			/* The buffers were not freed. */
 			if (!clearedbuf) {
-				add_page_to_inactive_dirty_list_marker(page);
+				add_page_to_inactive_dirty_list(page);
 
 			/* The page was only in the buffer cache. */
 			} else if (!page->mapping) {
@@ -635,8 +600,6 @@
 			UnlockPage(page);
 		}
 	}
-	/* Remove our marker. */
-	list_del(marker_lru);
 	spin_unlock(&pagemap_lru_lock);
 
 	/*
@@ -652,29 +615,16 @@
 	 */
 	if ((CAN_DO_IO || CAN_DO_BUFFERS) && !launder_loop && free_shortage()) {
 		launder_loop = 1;
-		/*
-		 * If we, or the previous process running page_launder(),
-		 * managed to free any pages we never do synchronous IO.
-		 */
-		if (cleaned_pages || !cannot_free_pages)
+		/* If we cleaned pages, never do synchronous IO. */
+		if (cleaned_pages)
 			sync = 0;
-		/* Else, do synchronous IO (if we are allowed to). */
-		else if (sync)
-			sync = 1;
 		/* We only do a few "out of order" flushes. */
 		maxlaunder = MAX_LAUNDER;
-		/* Let bdflush take care of the rest. */
+		/* Kflushd takes care of the rest. */
 		wakeup_bdflush(0);
 		goto dirty_page_rescan;
 	}
 
-	/*
-	 * If we failed to free pages (because all pages are dirty)
-	 * we remember this for the next time. This will prevent us
-	 * from wasting too much CPU here.
-	 */
-	cannot_free_pages = !cleaned_pages;
-
 	/* Return the number of pages moved to the inactive_clean list. */
 	return cleaned_pages;
 }
@@ -899,7 +849,7 @@
 	 * list, so this is a relatively cheap operation.
 	 */
 	if (free_shortage()) {
-		ret += page_launder(gfp_mask, 1);
+		ret += page_launder(gfp_mask, user);
 		shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
 		shrink_icache_memory(DEF_PRIORITY, gfp_mask);
 	}


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 13:11 ` Marcelo Tosatti
@ 2001-06-27 16:13   ` Xuan Baldauf
  0 siblings, 0 replies; 18+ messages in thread
From: Xuan Baldauf @ 2001-06-27 16:13 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel



Marcelo Tosatti wrote:

> On Wed, 27 Jun 2001, Xuan Baldauf wrote:
>
> > Hello,
> >
> > I'm not sure wether this is a reiserfs bug or a kernel bug,
> > so I'm posting to both lists...
> >
> > My linux box suddenly was not availbale using ssh|telnet,
> > but it responded to pings. On console login, I could type
> > "root", but after pressing "return", there was no reaction,
> > and pressing keys did not result in writing them on the
> > screen.
> >
> > "Emergency sync" and "Remount R/O" did not have any
> > response.
> >
> > That's why I pressed Alt+SysRq+P 5 times and wrote all stack
> > traces (without registers) onto paper. After that, I pressed
> > Alt+SysRq+T and also wrote 3 long stack traces (others were
> > available too, but too short) down.
>
> Xuan,
>
> Are you using kiobuf IO ?

I do not exactly know what kiobuf-IO is, so I suppose: no.

Xuân.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 15:09 ` Chris Mason
@ 2001-06-27 16:20   ` Xuan Baldauf
  2001-06-27 17:43   ` Marcelo Tosatti
  2001-06-27 18:16   ` Rik van Riel
  2 siblings, 0 replies; 18+ messages in thread
From: Xuan Baldauf @ 2001-06-27 16:20 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel, andrea, reiserfs-list@namesys.com



Chris Mason wrote:

> On Wednesday, June 27, 2001 04:27:45 PM +0200 Xuan Baldauf <xuan--lkml@baldauf.org> wrote:
>
> > My linux box suddenly was not availbale using ssh|telnet,
> > but it responded to pings. On console login, I could type
> > "root", but after pressing "return", there was no reaction,
> > and pressing keys did not result in writing them on the
> > screen.
>
> > Warning (Oops_read): Code line not seen, dumping what data
> > is available
> >
> >>> EIP; c012839c <deactivate_page+e94/2618>   <=====
> > Trace; c0128ef5 <deactivate_page+19ed/2618>
> > Trace; c012905e <deactivate_page+1b56/2618>
> > Trace; c0129d05 <__alloc_pages+1cd/280>
> > Trace; c0129b36 <_alloc_pages+16/18>
> > Trace; c012a425 <free_pages+611/1cac>
> > Trace; c0120198 <vmtruncate+1c4/878>
> > Trace; c01201f5 <vmtruncate+221/878>
> > Trace; c0120550 <vmtruncate+57c/878>
>
> > I had a probably similar|connected problem (but with no
> > "ping" responding) with linux-2.4.5-pre3, described here:
> > http://lists.omnipotent.net/reiserfs/200106/msg00214.html
> >
> > Linux router 2.4.6-pre5 #3 Tue Jun 26 23:36:26 CEST 2001
>
> Sounds like a deadlock andrea recently found.
>
> Could you please give this a try:
>
> diff -urN 2.4.6pre5aa1/include/linux/swap.h 2.4.6pre5aa1-backout-page_launder/include/linux/swap.h
> --- 2.4.6pre5aa1/include/linux/swap.h   Sun Jun 24 02:06:13 2001
> +++ 2.4.6pre5aa1-backout-page_launder/include/linux/swap.h      Sun Jun 24 21:37:12 2001
> @@ -205,16 +205,6 @@
>         page->zone->inactive_dirty_pages++; \
>  }
>
> -/* Like the above, but add us after the bookmark. */
> -#define add_page_to_inactive_dirty_list_marker(page) { \
> -       DEBUG_ADD_PAGE \
> -       ZERO_PAGE_BUG \
> -       SetPageInactiveDirty(page); \
> -       list_add(&(page)->lru, marker_lru); \
> -       nr_inactive_dirty_pages++; \
> -       page->zone->inactive_dirty_pages++; \
> -}
> -
>  #define add_page_to_inactive_clean_list(page) { \
>         DEBUG_ADD_PAGE \
>         ZERO_PAGE_BUG \
> diff -urN 2.4.6pre5aa1/mm/vmscan.c 2.4.6pre5aa1-backout-page_launder/mm/vmscan.c
> --- 2.4.6pre5aa1/mm/vmscan.c    Sun Jun 24 01:41:09 2001
> +++ 2.4.6pre5aa1-backout-page_launder/mm/vmscan.c       Sun Jun 24 21:37:11 2001
> @@ -407,7 +407,7 @@
>  /**
>   * page_launder - clean dirty inactive pages, move to inactive_clean list
>   * @gfp_mask: what operations we are allowed to do
> - * @sync: are we allowed to do synchronous IO in emergencies ?
> + * @sync: should we wait synchronously for the cleaning of pages
>   *
>   * When this function is called, we are most likely low on free +
>   * inactive_clean pages. Since we want to refill those pages as
> @@ -426,61 +426,23 @@
>  #define MAX_LAUNDER            (4 * (1 << page_cluster))
>  #define CAN_DO_IO              (gfp_mask & __GFP_IO)
>  #define CAN_DO_BUFFERS         (gfp_mask & __GFP_BUFFER)
> -#define marker_lru             (&marker_page_struct.lru)
>  int page_launder(int gfp_mask, int sync)
>  {
> -       static int cannot_free_pages;
>         int launder_loop, maxscan, cleaned_pages, maxlaunder;
>         struct list_head * page_lru;
>         struct page * page;
>
> -       /* Our bookmark of where we are in the inactive_dirty list. */
> -       struct page marker_page_struct = { zone: NULL };
> -
>         launder_loop = 0;
>         maxlaunder = 0;
>         cleaned_pages = 0;
>
>  dirty_page_rescan:
>         spin_lock(&pagemap_lru_lock);
> -       /*
> -        * By not scanning all inactive dirty pages we'll write out
> -        * really old dirty pages before evicting newer clean pages.
> -        * This should cause some LRU behaviour if we have a large
> -        * amount of inactive pages (due to eg. drop behind).
> -        *
> -        * It also makes us accumulate dirty pages until we have enough
> -        * to be worth writing to disk without causing excessive disk
> -        * seeks and eliminates the infinite penalty clean pages incurred
> -        * vs. dirty pages.
> -        */
> -       maxscan = nr_inactive_dirty_pages / 4;
> -       if (launder_loop)
> -               maxscan *= 2;
> -       list_add_tail(marker_lru, &inactive_dirty_list);
> -       for (;;) {
> -               page_lru = marker_lru->prev;
> -               if (page_lru == &inactive_dirty_list)
> -                       break;
> -               if (--maxscan < 0)
> -                       break;
> -               if (!free_shortage())
> -                       break;
> -
> +       maxscan = nr_inactive_dirty_pages;
> +       while ((page_lru = inactive_dirty_list.prev) != &inactive_dirty_list &&
> +                               maxscan-- > 0) {
>                 page = list_entry(page_lru, struct page, lru);
>
> -               /* Move the bookmark backwards.. */
> -               list_del(marker_lru);
> -               list_add_tail(marker_lru, page_lru);
> -
> -               /* Don't waste CPU if chances are we cannot free anything. */
> -               if (launder_loop && maxlaunder < 0 && cannot_free_pages)
> -                       break;
> -
> -               /* Skip other people's marker pages. */
> -               if (!page->zone)
> -                       continue;
> -
>                 /* Wrong page on list?! (list corruption, should not happen) */
>                 if (!PageInactiveDirty(page)) {
>                         printk("VM: page_launder, wrong page on list.\n");
> @@ -492,6 +454,7 @@
>
>                 /* Page is or was in use?  Move it to the active list. */
>                 if (PageReferenced(page) || page->age > 0 ||
> +                               page->zone->free_pages > page->zone->pages_high ||
>                                 (!page->buffers && page_count(page) > 1) ||
>                                 page_ramdisk(page)) {
>                         del_page_from_inactive_dirty_list(page);
> @@ -501,9 +464,11 @@
>
>                 /*
>                  * The page is locked. IO in progress?
> -                * Skip the page, we'll take a look when it unlocks.
> +                * Move it to the back of the list.
>                  */
>                 if (TryLockPage(page)) {
> +                       list_del(page_lru);
> +                       list_add(page_lru, &inactive_dirty_list);
>                         continue;
>                 }
>
> @@ -517,8 +482,10 @@
>                         if (!writepage)
>                                 goto page_active;
>
> -                       /* First time through? Skip the page. */
> +                       /* First time through? Move it to the back of the list */
>                         if (!launder_loop || !CAN_DO_IO) {
> +                               list_del(page_lru);
> +                               list_add(page_lru, &inactive_dirty_list);
>                                 UnlockPage(page);
>                                 continue;
>                         }
> @@ -531,8 +498,6 @@
>                         writepage(page);
>                         page_cache_release(page);
>
> -                       maxlaunder--;
> -
>                         /* And re-start the thing.. */
>                         spin_lock(&pagemap_lru_lock);
>                         continue;
> @@ -560,9 +525,9 @@
>                         spin_unlock(&pagemap_lru_lock);
>
>                         /* Will we do (asynchronous) IO? */
> -                       if (launder_loop && maxlaunder-- == 0 && sync)
> +                       if (launder_loop && maxlaunder == 0 && sync)
>                                 wait = 2;       /* Synchrounous IO */
> -                       else if (launder_loop && maxlaunder > 0)
> +                       else if (launder_loop && maxlaunder-- > 0)
>                                 wait = 1;       /* Async IO */
>                         else
>                                 wait = 0;       /* No IO */
> @@ -579,7 +544,7 @@
>
>                         /* The buffers were not freed. */
>                         if (!clearedbuf) {
> -                               add_page_to_inactive_dirty_list_marker(page);
> +                               add_page_to_inactive_dirty_list(page);
>
>                         /* The page was only in the buffer cache. */
>                         } else if (!page->mapping) {
> @@ -635,8 +600,6 @@
>                         UnlockPage(page);
>                 }
>         }
> -       /* Remove our marker. */
> -       list_del(marker_lru);
>         spin_unlock(&pagemap_lru_lock);
>
>         /*
> @@ -652,29 +615,16 @@
>          */
>         if ((CAN_DO_IO || CAN_DO_BUFFERS) && !launder_loop && free_shortage()) {
>                 launder_loop = 1;
> -               /*
> -                * If we, or the previous process running page_launder(),
> -                * managed to free any pages we never do synchronous IO.
> -                */
> -               if (cleaned_pages || !cannot_free_pages)
> +               /* If we cleaned pages, never do synchronous IO. */
> +               if (cleaned_pages)
>                         sync = 0;
> -               /* Else, do synchronous IO (if we are allowed to). */
> -               else if (sync)
> -                       sync = 1;
>                 /* We only do a few "out of order" flushes. */
>                 maxlaunder = MAX_LAUNDER;
> -               /* Let bdflush take care of the rest. */
> +               /* Kflushd takes care of the rest. */
>                 wakeup_bdflush(0);
>                 goto dirty_page_rescan;
>         }
>
> -       /*
> -        * If we failed to free pages (because all pages are dirty)
> -        * we remember this for the next time. This will prevent us
> -        * from wasting too much CPU here.
> -        */
> -       cannot_free_pages = !cleaned_pages;
> -
>         /* Return the number of pages moved to the inactive_clean list. */
>         return cleaned_pages;
>  }
> @@ -899,7 +849,7 @@
>          * list, so this is a relatively cheap operation.
>          */
>         if (free_shortage()) {
> -               ret += page_launder(gfp_mask, 1);
> +               ret += page_launder(gfp_mask, user);
>                 shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
>                 shrink_icache_memory(DEF_PRIORITY, gfp_mask);
>         }
>

Thank you, Chris.

I am currently compiling. For now, it does not seem to look like a reiserfs-specific bug. Because I do
not know how to trigger the bug, I do not know wether and how it will happen again. The deadlock
described occured within 1 day uptime of my linux-2.4.5-pre5 kernel, so if I do not report it again
within a week or so, this fix above might be the right one.

Xuân.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 15:09 ` Chris Mason
  2001-06-27 16:20   ` Xuan Baldauf
@ 2001-06-27 17:43   ` Marcelo Tosatti
  2001-06-27 19:36     ` Chris Mason
  2001-06-27 18:16   ` Rik van Riel
  2 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2001-06-27 17:43 UTC (permalink / raw)
  To: Chris Mason; +Cc: Xuan Baldauf, linux-kernel, andrea, reiserfs-list@namesys.com



On Wed, 27 Jun 2001, Chris Mason wrote:

> 
> 
> On Wednesday, June 27, 2001 04:27:45 PM +0200 Xuan Baldauf <xuan--lkml@baldauf.org> wrote:
> 
> > My linux box suddenly was not availbale using ssh|telnet,
> > but it responded to pings. On console login, I could type
> > "root", but after pressing "return", there was no reaction,
> > and pressing keys did not result in writing them on the
> > screen.
> 
> > Warning (Oops_read): Code line not seen, dumping what data
> > is available
> > 
> >>> EIP; c012839c <deactivate_page+e94/2618>   <=====
> > Trace; c0128ef5 <deactivate_page+19ed/2618>
> > Trace; c012905e <deactivate_page+1b56/2618>
> > Trace; c0129d05 <__alloc_pages+1cd/280>
> > Trace; c0129b36 <_alloc_pages+16/18>
> > Trace; c012a425 <free_pages+611/1cac>
> > Trace; c0120198 <vmtruncate+1c4/878>
> > Trace; c01201f5 <vmtruncate+221/878>
> > Trace; c0120550 <vmtruncate+57c/878>
> 
> > I had a probably similar|connected problem (but with no
> > "ping" responding) with linux-2.4.5-pre3, described here:
> > http://lists.omnipotent.net/reiserfs/200106/msg00214.html
> > 
> > Linux router 2.4.6-pre5 #3 Tue Jun 26 23:36:26 CEST 2001
> 
> Sounds like a deadlock andrea recently found.

Chris,

Looking at http://lists.omnipotent.net/reiserfs/200106/msg00214.html:

>>EIP; c0128228 <page_launder+b8/90c>   <=====
Trace; c01303df <refill_freelist+1f/54>
Trace; c01307e2 <getblk+f2/108>
Trace; c5141308 <END_OF_CODE+4e978b8/????>
Trace; c0176c4b <do_journal_end+63f/ac0>
Trace; c5160848 <END_OF_CODE+4eb6df8/????>
Trace; c01759e6 <journal_end_sync+16/1c>
Trace; c015e23a <reiserfs_write_inode+56/64>
Trace; c0141055 <try_to_sync_unused_inodes+101/1a8>
Trace; c01416dd <prune_icache+105/114>
Trace; c014170d <shrink_icache_memory+21/30>
Trace; c0128d67 <do_try_to_free_pages+2b/58>
Trace; c0128deb <kswapd+57/e4>
Trace; c0105434 <kernel_thread+28/38>



refill_freelist() calls page_launder(GFP_BUFFER). Now GFP_BUFFER _will_
block writting out buffers with try_to_free_buffers().

Maybe thats the reason for the deadlock we're seeing here at this specific
trace ? 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 15:09 ` Chris Mason
  2001-06-27 16:20   ` Xuan Baldauf
  2001-06-27 17:43   ` Marcelo Tosatti
@ 2001-06-27 18:16   ` Rik van Riel
  2001-06-27 18:38     ` Chris Mason
  2 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2001-06-27 18:16 UTC (permalink / raw)
  To: Chris Mason; +Cc: Xuan Baldauf, linux-kernel, andrea, reiserfs-list@namesys.com

On Wed, 27 Jun 2001, Chris Mason wrote:
> On Wednesday, June 27, 2001 04:27:45 PM +0200 Xuan Baldauf <xuan--lkml@baldauf.org> wrote:
>
> > My linux box suddenly was not availbale using ssh|telnet,
> > but it responded to pings. On console login, I could type
> > "root", but after pressing "return", there was no reaction,
>
> Sounds like a deadlock andrea recently found.

It would be nice if Andrea would TELL US every
once in a while what he found ;)


Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 18:16   ` Rik van Riel
@ 2001-06-27 18:38     ` Chris Mason
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2001-06-27 18:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Xuan Baldauf, linux-kernel, andrea, reiserfs-list@namesys.com

On Wednesday, June 27, 2001 03:16:09 PM -0300 Rik van Riel <riel@conectiva.com.br> wrote:

> On Wed, 27 Jun 2001, Chris Mason wrote:
>> On Wednesday, June 27, 2001 04:27:45 PM +0200 Xuan Baldauf <xuan--lkml@baldauf.org> wrote:
>> 
>> > My linux box suddenly was not availbale using ssh|telnet,
>> > but it responded to pings. On console login, I could type
>> > "root", but after pressing "return", there was no reaction,
>> 
>> Sounds like a deadlock andrea recently found.
> 
> It would be nice if Andrea would TELL US every
> once in a while what he found ;)

Well, I got an auto-reply from andrea saying he wasn't reading email until
July 5th (yeah, I've gotten other mails since then, we all know how
that goes ;-) 

The orig email I had regarding the patch was he thought some 
of the page lists were getting corrupted, leading to someone trying to free
a page that didn't exist anymore.  This was a recent discovery, I don't
think the patch is even in an aa kernel yet ;-)

Since Xuan's stack trace had things waiting in deactivate page, it sounded
similar to the problem andrea described.  We had a few test boxes
hanging under load, they are testing the patch now, plus Xuan, plus
one other l-k user.  If their problems go away, we'll have to dig to
find the exact corruption.

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 17:43   ` Marcelo Tosatti
@ 2001-06-27 19:36     ` Chris Mason
  2001-06-27 19:43       ` Rik van Riel
  2001-06-27 19:50       ` [reiserfs-list] " Xuan Baldauf
  0 siblings, 2 replies; 18+ messages in thread
From: Chris Mason @ 2001-06-27 19:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Xuan Baldauf, linux-kernel, andrea, reiserfs-list@namesys.com

On Wednesday, June 27, 2001 02:43:57 PM -0300 Marcelo Tosatti
<marcelo@conectiva.com.br> wrote:
> 
> Looking at http://lists.omnipotent.net/reiserfs/200106/msg00214.html:

Also from Xuan ;-)

> 
>>> EIP; c0128228 <page_launder+b8/90c>   <=====
> Trace; c01303df <refill_freelist+1f/54>
> Trace; c01307e2 <getblk+f2/108>
> Trace; c5141308 <END_OF_CODE+4e978b8/????>
> Trace; c0176c4b <do_journal_end+63f/ac0>
> Trace; c5160848 <END_OF_CODE+4eb6df8/????>
> Trace; c01759e6 <journal_end_sync+16/1c>
> Trace; c015e23a <reiserfs_write_inode+56/64>
> Trace; c0141055 <try_to_sync_unused_inodes+101/1a8>
> Trace; c01416dd <prune_icache+105/114>
> Trace; c014170d <shrink_icache_memory+21/30>
> Trace; c0128d67 <do_try_to_free_pages+2b/58>
> Trace; c0128deb <kswapd+57/e4>
> Trace; c0105434 <kernel_thread+28/38>
> 
> 
> 
> refill_freelist() calls page_launder(GFP_BUFFER). Now GFP_BUFFER _will_
> block writting out buffers with try_to_free_buffers().

Grrr, how did I miss this before?  I thought Xuan's hang went away after
pre3, so I didn't look into this trace hard enough.  

Reiserfs expects write_inode() calls initiated by kswapd to always have
sync==0.  Otherwise, kswapd ends up waiting on the log, which isn't what we
want at all.

The dirty inode callback ensures there are no dirty inodes that haven't
been logged.  I took the sync parameter to mean it is initiated by fsync or
O_SYNC, so I trigger a full commit when sync == 1.

So, my choices are to ignore sync == 1 write_inode calls when kswapd is
doing it, or make a private inode dirty list.

> 
> Maybe thats the reason for the deadlock we're seeing here at this specific
> trace ? 
> 

The trace above is caused by the dirty inode problem, the I think the more
recent trace is something different.

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 19:36     ` Chris Mason
@ 2001-06-27 19:43       ` Rik van Riel
  2001-06-27 20:24         ` Chris Mason
  2001-06-27 19:50       ` [reiserfs-list] " Xuan Baldauf
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2001-06-27 19:43 UTC (permalink / raw)
  To: Chris Mason
  Cc: Marcelo Tosatti, Xuan Baldauf, linux-kernel, andrea,
	reiserfs-list@namesys.com

On Wed, 27 Jun 2001, Chris Mason wrote:

> Reiserfs expects write_inode() calls initiated by kswapd to
> always have sync==0.  Otherwise, kswapd ends up waiting on the
> log, which isn't what we want at all.

If you don't have free memory, you are limited to 2 choices:

1) wait on IO
2) spin endlessly, wasting CPU until the IO is done

If (1) isn't possible in reiserfs, I'd say something in
reiserfs needs to be fixed, otherwise you will always
have problems when the system has lots of dirty mappings
that need to be written out.

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [reiserfs-list] Re: VM deadlock
  2001-06-27 19:36     ` Chris Mason
  2001-06-27 19:43       ` Rik van Riel
@ 2001-06-27 19:50       ` Xuan Baldauf
  1 sibling, 0 replies; 18+ messages in thread
From: Xuan Baldauf @ 2001-06-27 19:50 UTC (permalink / raw)
  To: Chris Mason
  Cc: Marcelo Tosatti, linux-kernel, andrea, reiserfs-list@namesys.com



Chris Mason wrote:

> On Wednesday, June 27, 2001 02:43:57 PM -0300 Marcelo Tosatti
> <marcelo@conectiva.com.br> wrote:
> >
> > Looking at http://lists.omnipotent.net/reiserfs/200106/msg00214.html:
>
> Also from Xuan ;-)
>
> >
> >>> EIP; c0128228 <page_launder+b8/90c>   <=====
> > Trace; c01303df <refill_freelist+1f/54>
> > Trace; c01307e2 <getblk+f2/108>
> > Trace; c5141308 <END_OF_CODE+4e978b8/????>
> > Trace; c0176c4b <do_journal_end+63f/ac0>
> > Trace; c5160848 <END_OF_CODE+4eb6df8/????>
> > Trace; c01759e6 <journal_end_sync+16/1c>
> > Trace; c015e23a <reiserfs_write_inode+56/64>
> > Trace; c0141055 <try_to_sync_unused_inodes+101/1a8>
> > Trace; c01416dd <prune_icache+105/114>
> > Trace; c014170d <shrink_icache_memory+21/30>
> > Trace; c0128d67 <do_try_to_free_pages+2b/58>
> > Trace; c0128deb <kswapd+57/e4>
> > Trace; c0105434 <kernel_thread+28/38>
> >
> >
> >
> > refill_freelist() calls page_launder(GFP_BUFFER). Now GFP_BUFFER _will_
> > block writting out buffers with try_to_free_buffers().
>
> Grrr, how did I miss this before?  I thought Xuan's hang went away after
> pre3, so I didn't look into this trace hard enough.

Actually, it went away :-), but only because I switched back from
linux-2.4.6-pre3 to linux-2.4.5-pre5 or so due to a symbol problem ("do_softirq"
or the like) which made some of this kernels modules not loadable. So the bug
which caused my first report is not fixed.

>
>
> Reiserfs expects write_inode() calls initiated by kswapd to always have
> sync==0.  Otherwise, kswapd ends up waiting on the log, which isn't what we
> want at all.
>
> The dirty inode callback ensures there are no dirty inodes that haven't
> been logged.  I took the sync parameter to mean it is initiated by fsync or
> O_SYNC, so I trigger a full commit when sync == 1.
>
> So, my choices are to ignore sync == 1 write_inode calls when kswapd is
> doing it, or make a private inode dirty list.
>
> >
> > Maybe thats the reason for the deadlock we're seeing here at this specific
> > trace ?
> >
>
> The trace above is caused by the dirty inode problem, the I think the more
> recent trace is something different.
>
> -chris

I also think that my new lockup is a different problem, because stack traces are
different. The only common things are that the kernel version number is benath
each other and I had to sit not virtually, but really in front of the monitor
connected to that box and write undecoded stack traces onto paper...

Xuân.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 19:43       ` Rik van Riel
@ 2001-06-27 20:24         ` Chris Mason
  2001-06-27 20:36           ` Rik van Riel
  2001-06-28  3:21           ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Chris Mason @ 2001-06-27 20:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Marcelo Tosatti, Xuan Baldauf, linux-kernel, andrea,
	reiserfs-list@namesys.com

On Wednesday, June 27, 2001 04:43:28 PM -0300 Rik van Riel
<riel@conectiva.com.br> wrote:

> On Wed, 27 Jun 2001, Chris Mason wrote:
> 
>> Reiserfs expects write_inode() calls initiated by kswapd to
>> always have sync==0.  Otherwise, kswapd ends up waiting on the
>> log, which isn't what we want at all.
> 
> If you don't have free memory, you are limited to 2 choices:
> 
> 1) wait on IO
> 2) spin endlessly, wasting CPU until the IO is done
> 
> If (1) isn't possible in reiserfs, I'd say something in
> reiserfs needs to be fixed, otherwise you will always
> have problems when the system has lots of dirty mappings
> that need to be written out.
> 

Ok, I need to describe the problem a little better.  reiserfs inodes need
to be logged, which means you have to join/start a transaction in order to
write them.

So, if kswapd tries to write them, it might end up waiting on the log.
Normally this is not a big deal, but almost allocations in reiserfs use
GFP_BUFFER, which means we never end up doing i/o ourselves in
page_launder, and always end up waiting on kswapd.  So, kswapd waits on
reiserfs and reiserfs waits on kswapd (none of these are spin locks ;-)

The work around I've been using is the dirty_inode method.  Whenever
mark_inode_dirty is called, reiserfs logs the dirty inode.  This means
inode changes are _always_ reflected in the buffer cache right away, and
the inode itself is never actually dirty.

So, the only time reiserfs_write_inode needs to do something is for fsync
and/or O_SYNC writes, and all it needs to do is commit the transaction.  

Any time kswapd is calling write_inode, it is just trying to free the inode
struct, and reiserfs can safely ignore the write request, regardless of if
a sync is requested.

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 20:24         ` Chris Mason
@ 2001-06-27 20:36           ` Rik van Riel
  2001-06-27 20:52             ` Chris Mason
  2001-06-28  3:21           ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2001-06-27 20:36 UTC (permalink / raw)
  To: Chris Mason
  Cc: Marcelo Tosatti, Xuan Baldauf, linux-kernel, andrea,
	reiserfs-list@namesys.com

On Wed, 27 Jun 2001, Chris Mason wrote:
> On Wednesday, June 27, 2001 04:43:28 PM -0300 Rik van Riel

> > If you don't have free memory, you are limited to 2 choices:
> >
> > 1) wait on IO
> > 2) spin endlessly, wasting CPU until the IO is done
>
> Ok, I need to describe the problem a little better.  reiserfs
> inodes need to be logged, which means you have to join/start a
> transaction in order to write them.

> So, the only time reiserfs_write_inode needs to do something is for fsync
> and/or O_SYNC writes, and all it needs to do is commit the transaction.
>
> Any time kswapd is calling write_inode, it is just trying to
> free the inode struct, and reiserfs can safely ignore the write
> request, regardless of if a sync is requested.

OK, sounds sane enough to me ;)

So the fix is just to let reiserfs_write_inode always be
asynchronous, independent of its arguments, as long as
we're not in fsync() or O_SYNC.

OTOH, if we are called synchronously, we could also just
walk down the code path taken when we _are_ called by
fsync(), right ?

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 20:36           ` Rik van Riel
@ 2001-06-27 20:52             ` Chris Mason
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2001-06-27 20:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Marcelo Tosatti, Xuan Baldauf, linux-kernel, andrea,
	reiserfs-list@namesys.com



On Wednesday, June 27, 2001 05:36:45 PM -0300 Rik van Riel
<riel@conectiva.com.br> wrote:

> On Wed, 27 Jun 2001, Chris Mason wrote:
>> On Wednesday, June 27, 2001 04:43:28 PM -0300 Rik van Riel
> 
>> > If you don't have free memory, you are limited to 2 choices:
>> > 
>> > 1) wait on IO
>> > 2) spin endlessly, wasting CPU until the IO is done
>> 
>> Ok, I need to describe the problem a little better.  reiserfs
>> inodes need to be logged, which means you have to join/start a
>> transaction in order to write them.
> 
>> So, the only time reiserfs_write_inode needs to do something is for fsync
>> and/or O_SYNC writes, and all it needs to do is commit the transaction.
>> 
>> Any time kswapd is calling write_inode, it is just trying to
>> free the inode struct, and reiserfs can safely ignore the write
>> request, regardless of if a sync is requested.
> 
> OK, sounds sane enough to me ;)

Well, I guess that's one word for it...I'll bet $5 Al's got a few others
;-) A better fix is to have private inode dirty lists....

> 
> So the fix is just to let reiserfs_write_inode always be
> asynchronous, independent of its arguments, as long as
> we're not in fsync() or O_SYNC.

I think so, but there needs to be some testing there.  Note that I managed
to run a heavy stress test (put my machine far, far into swap) for 3 solid
days without hitting this.  When I initially made the dirty_inode kludge, I
could trigger it in ~10 minutes.  

> 
> OTOH, if we are called synchronously, we could also just
> walk down the code path taken when we _are_ called by
> fsync(), right ?

sorry, not sure what you mean.  In fsync we do a commit, which might wait
on the current transaction, so kswapd can't go down that code path.

-chris



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-27 20:24         ` Chris Mason
  2001-06-27 20:36           ` Rik van Riel
@ 2001-06-28  3:21           ` Andrew Morton
  2001-06-28 12:53             ` Chris Mason
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2001-06-28  3:21 UTC (permalink / raw)
  To: Chris Mason
  Cc: Rik van Riel, Marcelo Tosatti, Xuan Baldauf, linux-kernel, andrea,
	reiserfs-list@namesys.com

Chris Mason wrote:
> 
> ...
> The work around I've been using is the dirty_inode method.  Whenever
> mark_inode_dirty is called, reiserfs logs the dirty inode.  This means
> inode changes are _always_ reflected in the buffer cache right away, and
> the inode itself is never actually dirty.

reiserfs_mark_inode_dirty() has taken a copy of the in-core inode, so
it can do this:

            spin_lock(&inode_lock);
            if ((inode->i_state & I_LOCK) == 0)
                    inode->i_state &= ~(I_DIRTY_SYNC|I_DIRTY_DATASYNC);
            spin_unlock(&inode_lock);

Unfortunately there is no API function to do this, so inode_lock
needs to be exported :(

The effect of this is that the filesystem almost never has dirty inodes
as far as the VFS is concerned: shrink_icache_memory() can just drop the
inodes without calling into the fs at all.  Which is nice.

So you end up with:

reiserfs_write_inode(struct inode * inode, int do_sync)
{
}

The write_inode() method is still called by shrink_icache_memory()
with extreme infrequency.  I haven't looked into the reasons why.  It may
be an SMP window.

This is not just a memory-tweak-optimisation hack, BTW.
shrink_icache->write_inode is a horrible embarrassment because it
can be called at any time.   The caller may have a transaction open
against a different fs.  It would cause nested transactions which
will exceed the current reservation, which will require a commit,
which simply cannot be performed, etc.

sync(), fsync() and msync() can be handled in ->fsync().

-

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-28  3:21           ` Andrew Morton
@ 2001-06-28 12:53             ` Chris Mason
  2001-06-28 14:08               ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2001-06-28 12:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Marcelo Tosatti, Xuan Baldauf, linux-kernel, andrea,
	reiserfs-list@namesys.com

On Thursday, June 28, 2001 01:21:28 PM +1000 Andrew Morton
<andrewm@uow.edu.au> wrote:

> Chris Mason wrote:
>> 
>> ...
>> The work around I've been using is the dirty_inode method.  Whenever
>> mark_inode_dirty is called, reiserfs logs the dirty inode.  This means
>> inode changes are _always_ reflected in the buffer cache right away, and
>> the inode itself is never actually dirty.
> 
> reiserfs_mark_inode_dirty() has taken a copy of the in-core inode, so
> it can do this:
> 
>             spin_lock(&inode_lock);
>             if ((inode->i_state & I_LOCK) == 0)
>                     inode->i_state &= ~(I_DIRTY_SYNC|I_DIRTY_DATASYNC);
>             spin_unlock(&inode_lock);
> 
> Unfortunately there is no API function to do this, so inode_lock
> needs to be exported :(

Well, this is kind of my own fault.  I didn't want the dirty_inode call
back to be able to screw with the internals of how inode.c dealt with
things, I wanted it purely to allow actions in addition to what inode.c
wanted to do.

So, mark_inode_dirty calls dirty_inode, and then it sets whatever dirty
bits it wants to.  Clearing them in your own dirty_inode call won't matter,
they should just get set again later.

If we really want to leave the inode clean,  fsync isn't as much of a
concern as O_SYNC writes, since you want generic_osync_inode to properly
flush the updated inode.  But, that can be dealt with by having your
commit_write func test for O_SYNC.

What we can't get around is our friend knfsd, who uses write_inode_now.
The I_DIRTY bit needs to be accurate there (although it doesn't seem
perfect right now anyway).

The real problem I see is that we've overload the sync flag to write_inode.
It means flush now to get the data safe, and flush now to free ram.
Normally this kind of overloading is ok, but once logging comes into play I
believe a distinction is needed.

So, my current plan to fix reiserfs_write_inode is to do nothing when
current->flags & PF_MEMALLOC == 1.  I'm not wild about it, but don't see
many other fixes that don't involve api changes.  

I'd rather not do a private inode list until there is a clean way to apply
memory pressure to it, since reiserfs pins enough memory as it is.

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-28 12:53             ` Chris Mason
@ 2001-06-28 14:08               ` Andrew Morton
  2001-06-28 14:25                 ` Chris Mason
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2001-06-28 14:08 UTC (permalink / raw)
  To: Chris Mason
  Cc: Rik van Riel, Marcelo Tosatti, Xuan Baldauf, linux-kernel, andrea,
	reiserfs-list@namesys.com

Chris Mason wrote:
> 
> On Thursday, June 28, 2001 01:21:28 PM +1000 Andrew Morton
> <andrewm@uow.edu.au> wrote:
> ...
> > reiserfs_mark_inode_dirty() has taken a copy of the in-core inode, so
> > it can do this:
> >
> >             spin_lock(&inode_lock);
> >             if ((inode->i_state & I_LOCK) == 0)
> >                     inode->i_state &= ~(I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> >             spin_unlock(&inode_lock);
> >
> > Unfortunately there is no API function to do this, so inode_lock
> > needs to be exported :(
> 
> Well, this is kind of my own fault.  I didn't want the dirty_inode call
> back to be able to screw with the internals of how inode.c dealt with
> things, I wanted it purely to allow actions in addition to what inode.c
> wanted to do.
> 
> So, mark_inode_dirty calls dirty_inode, and then it sets whatever dirty
> bits it wants to.  Clearing them in your own dirty_inode call won't matter,
> they should just get set again later.

yes, the above code is a bit of a waste of space :)

The reason ->write_inode() can be a no-op is that __sync_one()
marks the inode clean, then calls ->write_inode().  We *know*
that we took a copy of the inode in mark_inode_dirty(), so
we don't need to do anything.

Of course this absolutely requires all inode dirtiers to
call mark_inode_dirty() after doing the dirty, which is a risk.
But we face that risk with the PF_MEMALLOC case anyway.  No
problems have appeared in testing.

mark_inode_dirty() is the only way in which those bits can get
set. So the risk we face is that someone calls mark_inode_dirty(),
then alters the inode, then there is a call to write_inode().
That would be a bug, IMO.

As for knfsd, well, someone must have called mark_inode_dirty()
at sometime, else they'd never get written.

It's all rather dodgy.

-

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: VM deadlock
  2001-06-28 14:08               ` Andrew Morton
@ 2001-06-28 14:25                 ` Chris Mason
  0 siblings, 0 replies; 18+ messages in thread
From: Chris Mason @ 2001-06-28 14:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Marcelo Tosatti, Xuan Baldauf, linux-kernel, andrea,
	reiserfs-list@namesys.com

On Friday, June 29, 2001 12:08:37 AM +1000 Andrew Morton
<andrewm@uow.edu.au> wrote:

> The reason ->write_inode() can be a no-op is that __sync_one()
> marks the inode clean, then calls ->write_inode().  We *know*
> that we took a copy of the inode in mark_inode_dirty(), so
> we don't need to do anything.

Yes, the only exception is that write_inode needs to honor the sync flag,
at least when not called under PF_MEMALLOC.  The biggest reason I can find
so far is knfsd, who calls write_inode_now and expects the inode to be
securely on disk.  It doesn't call mark_inode_dirty directly, it calls some
FS func (link, create, whatever) and then uses write_inode_now to commit.

> 
> Of course this absolutely requires all inode dirtiers to
> call mark_inode_dirty() after doing the dirty, which is a risk.
> But we face that risk with the PF_MEMALLOC case anyway.  No
> problems have appeared in testing.

I haven't seen any problems caused by it yet, but that might be because
reiserfs does all the important inode writes on its own.  I believe
generic_commit_write is the only place outside the FS that calls
mark_inode_dirty with something other than an atime update.

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2001-06-28 14:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-06-27 14:27 VM deadlock Xuan Baldauf
2001-06-27 13:11 ` Marcelo Tosatti
2001-06-27 16:13   ` Xuan Baldauf
2001-06-27 15:09 ` Chris Mason
2001-06-27 16:20   ` Xuan Baldauf
2001-06-27 17:43   ` Marcelo Tosatti
2001-06-27 19:36     ` Chris Mason
2001-06-27 19:43       ` Rik van Riel
2001-06-27 20:24         ` Chris Mason
2001-06-27 20:36           ` Rik van Riel
2001-06-27 20:52             ` Chris Mason
2001-06-28  3:21           ` Andrew Morton
2001-06-28 12:53             ` Chris Mason
2001-06-28 14:08               ` Andrew Morton
2001-06-28 14:25                 ` Chris Mason
2001-06-27 19:50       ` [reiserfs-list] " Xuan Baldauf
2001-06-27 18:16   ` Rik van Riel
2001-06-27 18:38     ` Chris Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox