possible ext4 related deadlock

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* possible ext4 related deadlock
@ 2010-02-12 12:49 Enrik Berkhan
  2010-02-18  1:55 ` Jan Kara
  2010-03-05 13:56 ` Enrik Berkhan
  0 siblings, 2 replies; 5+ messages in thread
From: Enrik Berkhan @ 2010-02-12 12:49 UTC (permalink / raw)
  To: linux-ext4

Hi,

currently we're experiencing some process hangs that seem to be 
ext4-related. (Kernel 2.6.28.10-Blackfin, i.e. with Analog Devices
patches including some memory management changes for NOMMU.)

The situation is as follows:

We have two threads writing to an ext4-filesystem. After several hours 
and accross about 20 systems there happens one hang where
(reconstructed from Alt-SysRq-W output):

1. pdflush waits in start_this_handle
2. kjournald2 waits in jdb2_journal_commit_transaction
3. thread 1 waits in start_this_handle
4. thread 2 waits in
   ext4_da_write_begin
     (start_this_handle succeeded)
     grab_cache_page_write_begin
       __alloc_pages_internal
         try_to_free_pages
           do_try_to_free_pages
             congestion_wait

Actually, thread 2 shouldn't be completely blocked, because 
congestion_wait has a timeout if I understand the code correctly. 
Unfortunately, I pressed Alt-SysRq-W only once when having a chance to 
reproduce the problem on a test system with console access.

When the system is in this state, some external event like telnet login 
or killing a monitoring process in an older telnet sessin by pressing 
Ctrl-C makes it continue to work normally. I suspect that this triggers 
some memory freeing which allows thread 2 in the example above to get 
some pages and continue running.

I had a look at all the recent ext4/jbd2 changes since about 2.6.28 but 
couldn't identify anything that would solve this problem. But maybe I 
just couldn't identify the right thing.

What I have noticed is that the order of start_this_handle and 
grab_cache_page_write_begin has changed between ext3 and ext4:

ext3_write_begin:
   ...
   page = grab_cache_page_write_begin(mapping, index, flags);
   if (!page)
     return -ENOMEM;
   *pagep = page;

   handle = ext3_journal_start(inode, needed_blocks);
   ...

ext4_{da_}_write_begin:
   ...
   handle = ext4_journal_start(inode, needed_blocks);
   if (IS_ERR(handle)) {
     ret = PTR_ERR(handle);
     goto out;
   }

   /* We cannot recurse into the filesystem as the transaction is already
    * started */
   flags |= AOP_FLAG_NOFS;

   page = grab_cache_page_write_begin(mapping, index, flags);
   ...

As I understand the change of the order requires the AOP_FLAG_NOFS in 
the ext4 code.

Might this be the reason for the deadlock? Would it be worth trying to 
change the order back or is there a very good reason for the change 
between ext3 and ext4?

Or am I looking in a completely wrong place?

Any help would be appreciated.

Enrik

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: possible ext4 related deadlock
  2010-02-12 12:49 possible ext4 related deadlock Enrik Berkhan
@ 2010-02-18  1:55 ` Jan Kara
  2010-03-05 13:56 ` Enrik Berkhan
  1 sibling, 0 replies; 5+ messages in thread
From: Jan Kara @ 2010-02-18  1:55 UTC (permalink / raw)
  To: Enrik Berkhan; +Cc: linux-ext4

  Hi,

> currently we're experiencing some process hangs that seem to be  
> ext4-related. (Kernel 2.6.28.10-Blackfin, i.e. with Analog Devices
> patches including some memory management changes for NOMMU.)
>
> The situation is as follows:
>
> We have two threads writing to an ext4-filesystem. After several hours  
> and accross about 20 systems there happens one hang where
> (reconstructed from Alt-SysRq-W output):
>
> 1. pdflush waits in start_this_handle
> 2. kjournald2 waits in jdb2_journal_commit_transaction
> 3. thread 1 waits in start_this_handle
> 4. thread 2 waits in
>   ext4_da_write_begin
>     (start_this_handle succeeded)
>     grab_cache_page_write_begin
>       __alloc_pages_internal
>         try_to_free_pages
>           do_try_to_free_pages
>             congestion_wait
>
> Actually, thread 2 shouldn't be completely blocked, because  
> congestion_wait has a timeout if I understand the code correctly.  
> Unfortunately, I pressed Alt-SysRq-W only once when having a chance to  
> reproduce the problem on a test system with console access.
  Yes, thread 2 should eventually proceed, finish (or fail) the write and thus
all other processes should continue. If it does not, it's really strange. I've
checked the code but don't see where we could possibly loop - at worst, we
should spend like 1.5s waiting, then we conclude there's no free page for us
and we bail out with ENOMEM. If this does not happen it would be good to
find out whether we get stuck somewhere else or what...

> When the system is in this state, some external event like telnet login  
> or killing a monitoring process in an older telnet sessin by pressing  
> Ctrl-C makes it continue to work normally. I suspect that this triggers  
> some memory freeing which allows thread 2 in the example above to get  
> some pages and continue running.
>
> I had a look at all the recent ext4/jbd2 changes since about 2.6.28 but  
> couldn't identify anything that would solve this problem. But maybe I  
> just couldn't identify the right thing.
>
> What I have noticed is that the order of start_this_handle and  
> grab_cache_page_write_begin has changed between ext3 and ext4:
>
>
> ext3_write_begin:
>   ...
>   page = grab_cache_page_write_begin(mapping, index, flags);
>   if (!page)
>     return -ENOMEM;
>   *pagep = page;
>
>   handle = ext3_journal_start(inode, needed_blocks);
>   ...
>
>
> ext4_{da_}_write_begin:
>   ...
>   handle = ext4_journal_start(inode, needed_blocks);
>   if (IS_ERR(handle)) {
>     ret = PTR_ERR(handle);
>     goto out;
>   }
>
>   /* We cannot recurse into the filesystem as the transaction is already
>    * started */
>   flags |= AOP_FLAG_NOFS;
>
>   page = grab_cache_page_write_begin(mapping, index, flags);
>   ...
>
>
> As I understand the change of the order requires the AOP_FLAG_NOFS in  
> the ext4 code.
  Yes.

> Might this be the reason for the deadlock? Would it be worth trying to  
> change the order back or is there a very good reason for the change  
> between ext3 and ext4?
  It isn't a good idea to change the ordering in ext4 back to the one
in ext3. The main reason is that also other places in ext4 code rely
on start_handle -> lock_page ordering and thus you'd create real deadlocks
if you've changed just ext4_{da_}write_begin.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: possible ext4 related deadlock
  2010-02-12 12:49 possible ext4 related deadlock Enrik Berkhan
  2010-02-18  1:55 ` Jan Kara
@ 2010-03-05 13:56 ` Enrik Berkhan
  2010-03-05 15:45   ` tytso
  1 sibling, 1 reply; 5+ messages in thread
From: Enrik Berkhan @ 2010-03-05 13:56 UTC (permalink / raw)
  To: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 15017 bytes --]

Hi,

Enrik Berkhan wrote:
> currently we're experiencing some process hangs that seem to be 
> ext4-related. (Kernel 2.6.28.10-Blackfin, i.e. with Analog Devices
> patches including some memory management changes for NOMMU.)
> 
> The situation is as follows:
> 
> We have two threads writing to an ext4-filesystem. After several hours 
> and accross about 20 systems there happens one hang where
> (reconstructed from Alt-SysRq-W output):
> 
> 1. pdflush waits in start_this_handle
> 2. kjournald2 waits in jdb2_journal_commit_transaction
> 3. thread 1 waits in start_this_handle
> 4. thread 2 waits in
>   ext4_da_write_begin
>     (start_this_handle succeeded)
>     grab_cache_page_write_begin
>       __alloc_pages_internal
>         try_to_free_pages
>           do_try_to_free_pages
>             congestion_wait
> 
> Actually, thread 2 shouldn't be completely blocked, because 
> congestion_wait has a timeout if I understand the code correctly. 
> Unfortunately, I pressed Alt-SysRq-W only once when having a chance to 
> reproduce the problem on a test system with console access.

Meanwhile, I have found out that thread 2 actually isn't completely
blocked but loops in __alloc_pages_internal:

get_page_from_freelist() doesn't return a page;
try_to_free_pages() returns did_some_progress == 0;
later, do_retry == 1 and the loop restarts with goto rebalance;

I have implemented a workaround by limiting the "goto rebalance" retries 
to 3 iterations and triggering a page allocation failure otherwise.

I have changed ext4_{da_,}write_begin() to retry 
grab_cache_page_write_begin() 3 times, too, by calling 
ext4_journal_stop(); schedule_timeout(1) and goto retry;

Using this change the system recovers from the situation on first 
grab_cache_page_write_begin() retry. I have seen that 17 times during a 
test last night.

Interestingly, the show_mem() call triggered in __alloc_pages_internal() 
on page allocation failure shows plenty of memory free, see this example:

> Mar  5 02:26:16 bfp53 [21804.947985] manage_daq: page allocation failure. order:0, mode:0x120052
> Mar  5 02:26:16 bfp53 [21804.954607] Hardware Trace:
> Mar  5 02:26:16 bfp53 [21804.957377]    0 Target : <0x000049ac> { _dump_stack + 0x0 }
> Mar  5 02:26:16 bfp53 [21804.963139]      Source : <0x00037618> { ___alloc_pages_internal + 0x1cc } CALL pcrel
> Mar  5 02:26:16 bfp53 [21804.970996]    1 Target : <0x00037618> { ___alloc_pages_internal + 0x1cc }
> Mar  5 02:26:16 bfp53 [21804.977966]      Source : <0x0000fd2a> { _printk + 0x16 } RTS
> Mar  5 02:26:16 bfp53 [21804.983735]    2 Target : <0x0000fd26> { _printk + 0x12 }
> Mar  5 02:26:16 bfp53 [21804.989226]      Source : <0x0001058e> { _vprintk + 0x116 } RTS
> Mar  5 02:26:16 bfp53 [21804.995165]    3 Target : <0x00010586> { _vprintk + 0x10e }
> Mar  5 02:26:16 bfp53 [21805.005705]      Source : <0xffa00c76> /* kernel dynamic memory */
> Mar  5 02:26:16 bfp53 [21805.013596]    4 Target : <0xffa00c14> /* kernel dynamic memory */
> Mar  5 02:26:16 bfp53 [21805.022158]      Source : <0xffa00a5c> /* kernel dynamic memory */
> Mar  5 02:26:16 bfp53 [21805.030932]    5 Target : <0xffa00a5c> /* kernel dynamic memory */
> Mar  5 02:26:16 bfp53 [21805.040640]      Source : <0xffa00a32> /* kernel dynamic memory */
> Mar  5 02:26:16 bfp53 [21805.048552]    6 Target : <0xffa00a04> /* kernel dynamic memory */
> Mar  5 02:26:16 bfp53 [21805.057211]      Source : <0xffa00c10> /* kernel dynamic memory */
> Mar  5 02:26:16 bfp53 [21805.065086]    7 Target : <0xffa00c0e> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.072947]      Source : <0xffa00efe> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.081598]    8 Target : <0xffa00efa> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.090706]      Source : <0xffa003d2> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.098692]    9 Target : <0xffa003ca> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.104880]      Source : <0x00014024> { _irq_exit + 0x50 } RTS
> Mar  5 02:26:17 bfp53 [21805.110793]   10 Target : <0x0001401e> { _irq_exit + 0x4a }
> Mar  5 02:26:17 bfp53 [21805.116454]      Source : <0x0001402c> { _irq_exit + 0x58 } IF CC JUMP
> Mar  5 02:26:17 bfp53 [21805.123013]   11 Target : <0x0001402a> { _irq_exit + 0x56 }
> Mar  5 02:26:17 bfp53 [21805.128674]      Source : <0x00009d8c> { _idle_cpu + 0x1c } RTS
> Mar  5 02:26:17 bfp53 [21805.134643]   12 Target : <0x00009d70> { _idle_cpu + 0x0 }
> Mar  5 02:26:17 bfp53 [21805.140192]      Source : <0x00014026> { _irq_exit + 0x52 } CALL pcrel
> Mar  5 02:26:17 bfp53 [21805.146756]   13 Target : <0x00014026> { _irq_exit + 0x52 }
> Mar  5 02:26:17 bfp53 [21805.152471]      Source : <0x0001401c> { _irq_exit + 0x48 } IF !CC JUMP
> Mar  5 02:26:17 bfp53 [21805.159064]   14 Target : <0x00014006> { _irq_exit + 0x32 }
> Mar  5 02:26:17 bfp53 [21805.164783]      Source : <0x00014048> { _irq_exit + 0x74 } JUMP.S
> Mar  5 02:26:17 bfp53 [21805.170932]   15 Target : <0x00014048> { _irq_exit + 0x74 }
> Mar  5 02:26:17 bfp53 [21805.176592]      Source : <0x00013d6e> { ___do_softirq + 0xa2 } RTS
> Mar  5 02:26:17 bfp53 [21805.182864] Stack info:
> Mar  5 02:26:17 bfp53 [21805.191811]  SP: [0x02ab9b60] <0x02ab9b60> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.198495]  FP: (0x02ab9b64)
> Mar  5 02:26:17 bfp53 [21805.201403]  Memory from 0x02ab9b60 to 02aba000
> Mar  5 02:26:17 bfp53 [21805.205835] 02ab9b60:[02ab9bbc](02ab9bbc)<0003761c> 03bde680  00000000  00120052  03bde880  00000000
> Mar  5 02:26:17 bfp53 [21805.215086] 02ab9b80: 00120052  00000002  00000042  02ab8000  02ab8000  00000000  00120052  00000000
> Mar  5 02:26:17 bfp53 [21805.224375] 02ab9ba0: 02ab8000  00000010  00000002  00000000  00000000  00000000  00000000 (02ab9bf0)
> Mar  5 02:26:17 bfp53 [21805.233690] 02ab9bc0:<00033300><02f4e3dc> 04a85000 <02f4e344> 00000000  00000080  00004a85  ffffff7f
> Mar  5 02:26:17 bfp53 [21805.243044] 02ab9be0: 02ab9c04 <00a8f76e> 00000000  00000000 (02ab9c44)<00a8892c> 00000000  031423d8
> Mar  5 02:26:17 bfp53 [21805.252333] 02ab9c00: 00000000  00000004  031423d8  000005cc  04a85000  000005cc  00000000  000005cc
> Mar  5 02:26:17 bfp53 [21805.261648] 02ab9c20: 04a84a34  00000000  000005cc  000005cc  04a86000  00000000  00004a85  00000003
> Mar  5 02:26:17 bfp53 [21805.271002] 02ab9c40: 00000000 (02ab9ca4)<00032b08> 02ab8000  00001000 <02f4e3dc> 000005cc  00000000
> Mar  5 02:26:17 bfp53 [21805.280292] 02ab9c60: 04a85000  00001000  0306a260 <02f4e3dc> 04a85000  00000000  00001000  00000000
> Mar  5 02:26:17 bfp53 [21805.289608] 02ab9c80: 02ab9ca0  02ab9c9c  00000000  00a9fc74  000005cc  00000000  00000000  00000000
> Mar  5 02:26:17 bfp53 [21805.298962] 02ab9ca0: 00265100 (02ab9cf0)<00033a6c> 0306a260 <02f4e3dc><02f4e344> 02ab9ce0  00000000
> Mar  5 02:26:17 bfp53 [21805.308252] 02ab9cc0: 04a84a34  00000000  0306a260  02ab9ce0  04a84a34  00000000  00000000  00a9fc74
> Mar  5 02:26:17 bfp53 [21805.317566] 02ab9ce0: 02ab9e94  00000001  000005cc  00001c24 (02ab9d60)<00033e8a><02f4e344> 000021f0
> Mar  5 02:26:17 bfp53 [21805.326919] 02ab9d00: 0306a260  00000000  02ab9e94  00000000  04a84a34  02ab9e10  02ab9dfc <000fda3a>
> Mar  5 02:26:17 bfp53 [21805.336211] 02ab9d20: 04a84a34  00000000  02ab9e58  000021f0  00000000  00000000 <02ab9d70><02f4e3dc>
> Mar  5 02:26:17 bfp53 [21805.345529] 02ab9d40: 00000000  00000001  ffffffff  00000000  00000000  00000000  00000000  000021f0
> Mar  5 02:26:17 bfp53 [21805.354882] 02ab9d60:<02ab9da0><0003440a> 02ab9e10 <02f4e344> 0306a260  02f4e3b0  02ab9e94  00000000
> Mar  5 02:26:17 bfp53 [21805.364168] 02ab9d80: 04a84a34  02ab9e10  02ab9e94  00000001  02ab9e58 <0001410e> 02ab9dbc <02f4e3dc>
> Mar  5 02:26:17 bfp53 [21805.364168] 02ab9d80: 04a84a34  02ab9e10  02ab9e94  00000001  02ab9e58 <0001410e> 02ab9dbc <02f4e3dc>
> Mar  5 02:26:17 bfp53 [21805.373488] 02ab9da0: 02ab9dd8 <00a81c28><02f4e344> 0306a260  02f4e2cc  00000000  04a84a34  00000001
> Mar  5 02:26:17 bfp53 [21805.382834] 02ab9dc0: 02ab9e94  00000000  00040000  00000001  04a84a34  00000000  02ab9e9c <00042f34>
> Mar  5 02:26:17 bfp53 [21805.392127] 02ab9de0: 0306a260  02ab9eec  00e00fb8  02ab9e10  02ab9e94  000021f0  00dda938  02ab9e10
> Mar  5 02:26:17 bfp53 [21805.401446] 02ab9e00:<000fdc24> 00bdcd40  04a84a34  00000000 <000fdc5c> 003cf20e  00000000  00000001
> Mar  5 02:26:17 bfp53 [21805.410791] 02ab9e20: ffffffff  0306a260  00000000  00000000  00000000  00000000  03bde680  00000000
> Mar  5 02:26:17 bfp53 [21805.420082] 02ab9e40: 00000000  001f8c3c  03bde680  00020280  02ab9e50  02ab9e50  04a84a34  00000000
> Mar  5 02:26:17 bfp53 [21805.429405] 02ab9e60:<00013d28> 001f18fc  02ab8000  001f1888  000021f0  02ab9e94 <00013d68> 001f18fc
> Mar  5 02:26:17 bfp53 [21805.438755] 02ab9e80: 00000000  00000000  02ab9ea4 <00008f04> 001f63f4  02b00004  000021f0  02ab9ec4
> Mar  5 02:26:17 bfp53 [21805.448041] 02ab9ea0:<00043514> 0306a260  00000004  02ab9eec  02b00004  00000000  00000000  02ab9ee0
> Mar  5 02:26:17 bfp53 [21805.457364] 02ab9ec0: 02ab9eec  02ab9ef8 <000437f8> 0306a260  02b00004  000021f0  02b00004  00000141
> Mar  5 02:26:17 bfp53 [21805.466714] 02ab9ee0: 00000000  ffffe000  02ab9eec  04a84a34  00000000  00000001  00000000  ffa008d0
> Mar  5 02:26:17 bfp53 [21805.475999] 02ab9f00: 000437c4  00000000  ffffe000  030cf8dc  00e09388  000005a8  00000000  00dda938
> Mar  5 02:26:17 bfp53 [21805.485322] 02ab9f20: 02b00004  003cf20e  00008000  00000000  00000000  02aba000  003cf20e  003cf20e
> Mar  5 02:26:17 bfp53 [21805.494675] 02ab9f40:<00dcdb62> ffa00e46  02001025  003de2bb  003e03f1  003de2ba  003e03ee  00000000
> Mar  5 02:26:17 bfp53 [21805.503958] 02ab9f60: 00000000  00000000  00000000  00000000  00000000  00000000  00000000  00000000
> Mar  5 02:26:17 bfp53 [21805.513281] 02ab9f80: 00000000  00000000  00000000  00000000  00000000  039a0510  039a0508  0360af3c
> Mar  5 02:26:17 bfp53 [21805.522631] 02ab9fa0: 00000020  00000000  00000000  00000000  003f7838  02ab3c7c  02ab3c88  00df09b0
> Mar  5 02:26:17 bfp53 [21805.531916] 02ab9fc0: 00000017  00e00fb8  00df13f8  003cf204  00000004  00000000  000021f0  02b00004
> Mar  5 02:26:17 bfp53 [21805.541236] 02ab9fe0: 00dda938  00000000  000021f0  02b00004  00000017  00000017  00000004  00000006
> Mar  5 02:26:17 bfp53 [21805.550731] Return addresses in stack:
> Mar  5 02:26:17 bfp53 [21805.554584]    frame  1 : <0x0003761c> { ___alloc_pages_internal + 0x1d0 }
> Mar  5 02:26:17 bfp53 [21805.561534]    frame  2 : <0x00033300> { _grab_cache_page_write_begin + 0x50 }
> Mar  5 02:26:17 bfp53 [21805.584838]     address : <0x02f4e3dc> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.592680]     address : <0x02f4e344> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.599028]     address : <0x00a8f76e> { :ext4:_ext4_journal_start_sb + 0x26 }
> Mar  5 02:26:17 bfp53 [21805.606329]    frame  3 : <0x00a8892c> { :ext4:_ext4_da_write_begin + 0xac }
> Mar  5 02:26:17 bfp53 [21805.613392]    frame  4 : <0x00032b08> { _generic_perform_write + 0x98 }
> Mar  5 02:26:17 bfp53 [21805.622262]     address : <0x02f4e3dc> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.630250]     address : <0x02f4e3dc> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.636537]    frame  5 : <0x00033a6c> { _generic_file_buffered_write + 0x50 }
> Mar  5 02:26:17 bfp53 [21805.645740]     address : <0x02f4e3dc> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.653708]     address : <0x02f4e344> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.659928]    frame  6 : <0x00033e8a> { ___generic_file_aio_write_nolock + 0x1ba }
> Mar  5 02:26:17 bfp53 [21805.669659]     address : <0x02f4e344> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.675988]     address : <0x000fda3a> { _sock_recvmsg + 0x96 }
> Mar  5 02:26:17 bfp53 [21805.684148]     address : <0x02ab9d70> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.692186]     address : <0x02f4e3dc> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.700105]     address : <0x02ab9da0> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.706305]     address : <0x0003440a> { _generic_file_aio_write + 0x52 }
> Mar  5 02:26:17 bfp53 [21805.715083]     address : <0x02f4e344> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.721392]     address : <0x0001410e> { _irq_enter + 0xa }
> Mar  5 02:26:17 bfp53 [21805.728914]     address : <0x02f4e3dc> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.735283]     address : <0x00a81c28> { :ext4:_ext4_file_write + 0x40 }
> Mar  5 02:26:17 bfp53 [21805.744015]     address : <0x02f4e344> /* kernel dynamic memory */
> Mar  5 02:26:17 bfp53 [21805.750296]     address : <0x00042f34> { _do_sync_write + 0x90 }
> Mar  5 02:26:17 bfp53 [21805.756313]     address : <0x000fdc24> { _sys_recvfrom + 0x64 }
> Mar  5 02:26:17 bfp53 [21805.762262]     address : <0x000fdc5c> { _sys_recvfrom + 0x9c }
> Mar  5 02:26:17 bfp53 [21805.768263]     address : <0x00013d28> { ___do_softirq + 0x5c }
> Mar  5 02:26:17 bfp53 [21805.774254]     address : <0x00013d68> { ___do_softirq + 0x9c }
> Mar  5 02:26:17 bfp53 [21805.780216]     address : <0x00008f04> { _bfin_handle_irq + 0x1c }
> Mar  5 02:26:17 bfp53 [21805.786449]     address : <0x00043514> { _vfs_write + 0x60 }
> Mar  5 02:26:17 bfp53 [21805.792159]     address : <0x000437f8> { _sys_write + 0x34 }
> Mar  5 02:26:17 bfp53 [21805.800542]     address : <0x00dcdb62> [ /opt/nge/lib/libngus.so.0.0.0 + 0xdb62 ]
> Mar  5 02:26:17 bfp53 [21805.808171] Mem-Info:
> Mar  5 02:26:17 bfp53 [21805.810372] DMA per-cpu:
> Mar  5 02:26:17 bfp53 [21805.812915] CPU    0: hi:   18, btch:   3 usd:  15
> Mar  5 02:26:17 bfp53 [21805.817749] Active_anon:0 active_file:0 inactive_anon:0
> Mar  5 02:26:17 bfp53 [21805.817768]  inactive_file:620 dirty:565 writeback:0 unstable:0
> Mar  5 02:26:17 bfp53 [21805.817786]  free:7162 slab:1112 mapped:0 pagetables:0 bounce:0
> Mar  5 02:26:17 bfp53 [21805.835000] DMA free:28648kB min:984kB low:1228kB high:1476kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:2480kB present:60956kB pages_scanned:0 all_unreclaimable? no
> Mar  5 02:26:17 bfp53 [21805.851771] lowmem_reserve[]: 0 0 0
> Mar  5 02:26:17 bfp53 [21805.855236] DMA: 318*4kB 396*8kB 299*16kB 205*32kB 115*64kB 29*128kB 5*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB = 28648kB
> Mar  5 02:26:17 bfp53 [21805.868869] 620 total pagecache pages
> Mar  5 02:26:17 bfp53 [21805.876098] 15359 pages RAM
> Mar  5 02:26:17 bfp53 [21805.878803] 673 pages reserved
> Mar  5 02:26:17 bfp53 [21805.881839] 589 pages shared
> Mar  5 02:26:17 bfp53 [21805.884771] 6850 pages non-shared
> Mar  5 02:26:17 bfp53 [21805.888129] ext4_da_write_begin: retrying grab_cache_page_write_begin()

I will attach my workaround patch for reference, too.

Can anybody explain this behaviour and maybe direct me to the root cause?

Of course, this now looks more like a page allocation problem than an 
ext4 one.

Thanks,
Enrik


[-- Attachment #2: ext4-oom-endless-loop-workaround.diff --]
[-- Type: text/x-diff, Size: 2509 bytes --]

---
 fs/ext4/inode.c |   24 ++++++++++++++++++++++++
 mm/page_alloc.c |    4 +++-
 2 files changed, 27 insertions(+), 1 deletion(-)

Index: linux-2.6.28.10/fs/ext4/inode.c
===================================================================
--- linux-2.6.28.10.orig/fs/ext4/inode.c	2010-03-03 15:29:23.000000000 +0100
+++ linux-2.6.28.10/fs/ext4/inode.c	2010-03-03 15:53:20.000000000 +0100
@@ -1335,6 +1335,7 @@ static int ext4_write_begin(struct file 
 	struct page *page;
  	pgoff_t index;
 	unsigned from, to;
+	int grab_cache_page_retries = 3;
 
  	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
@@ -1354,6 +1355,17 @@ retry:
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page) {
 		ext4_journal_stop(handle);
+
+		/* retry to get memory because the reason for OOM could be
+		 * AOP_FLAG_NOFS! */
+		if (grab_cache_page_retries) {
+			printk(KERN_DEBUG "%s: retrying "
+			       "grab_cache_page_write_begin()\n", __func__);
+			grab_cache_page_retries--;
+			schedule_timeout_uninterruptible(1);
+			goto retry;
+		}
+
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -2584,6 +2596,7 @@ static int ext4_da_write_begin(struct fi
 	unsigned from, to;
 	struct inode *inode = mapping->host;
 	handle_t *handle;
+	int grab_cache_page_retries = 3;
 
 	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
@@ -2614,6 +2627,17 @@ retry:
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page) {
 		ext4_journal_stop(handle);
+
+		/* retry to get memory because the reason for OOM could be
+		 * AOP_FLAG_NOFS! */
+		if (grab_cache_page_retries) {
+			printk(KERN_DEBUG "%s: retrying "
+			       "grab_cache_page_write_begin()\n", __func__);
+			grab_cache_page_retries--;
+			schedule_timeout_uninterruptible(1);
+			goto retry;
+		}
+
 		ret = -ENOMEM;
 		goto out;
 	}
Index: linux-2.6.28.10/mm/page_alloc.c
===================================================================
--- linux-2.6.28.10.orig/mm/page_alloc.c	2010-03-03 15:52:18.000000000 +0100
+++ linux-2.6.28.10/mm/page_alloc.c	2010-03-03 15:52:22.000000000 +0100
@@ -1476,6 +1476,7 @@ __alloc_pages_internal(gfp_t gfp_mask, u
 	int alloc_flags;
 	unsigned long did_some_progress;
 	unsigned long pages_reclaimed = 0;
+	int retries = 3;
 
 	might_sleep_if(wait);
 
@@ -1659,8 +1660,9 @@ nofail_alloc:
 		if (gfp_mask & __GFP_NOFAIL)
 			do_retry = 1;
 	}
-	if (do_retry) {
+	if (do_retry && retries) {
 		congestion_wait(WRITE, HZ/50);
+		retries--;
 		goto rebalance;
 	}
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: possible ext4 related deadlock
  2010-03-05 13:56 ` Enrik Berkhan
@ 2010-03-05 15:45   ` tytso
  2010-03-10 16:23     ` Enrik Berkhan
  0 siblings, 1 reply; 5+ messages in thread
From: tytso @ 2010-03-05 15:45 UTC (permalink / raw)
  To: Enrik Berkhan; +Cc: linux-ext4

On Fri, Mar 05, 2010 at 02:56:28PM +0100, Enrik Berkhan wrote:
> 
> Meanwhile, I have found out that thread 2 actually isn't completely
> blocked but loops in __alloc_pages_internal:
> 
> get_page_from_freelist() doesn't return a page;
> try_to_free_pages() returns did_some_progress == 0;
> later, do_retry == 1 and the loop restarts with goto rebalance;
>
> 
> Can anybody explain this behaviour and maybe direct me to the root cause?
> 
> Of course, this now looks more like a page allocation problem than
> an ext4 one.

Yep, I'd have to agree with you.  We're only trying to allocate a
single page here, and you have plenty of pages available.  Just
checking....  you don't have CONFIG_NUMA enabled and doing something
crazy with NUMA nodes, are you?

The reason why there is no retry logic in this piece of code is this
is not something that we've *ever* anticipated would fail --- and if
it fails, the system is so badly in the weeds that we're probably
better off just returning ENOMEM and return an error to userspace; if
you can't even allocate a single page, the OOM killer should have been
killing off random processes for a while anyway, so a ENOMEM return to
a write(2) system call means the process has gotten off lightly; it's
lucky to still be alive, after all, with the OOM killer surely going
postal if things are so bad that a 4k page allocation isn't
succeeding.  :-)

So I would definitely say this is a problem with the page allocation
mechanism; you say it's been modified a bit for NOMMU for the Blackfin
architecture?  I imagine the fact that you don't have any kind of VM
means that the allocator must be playing all sorts of tricks to make
sure that a process can allocate memory contiguously in its data
segment, for example.  I assume that there are regions of memory which
are reserved for page cache?  I'm guessing you have to change how much
room is reserved for the page cache, or some such.  This sounds like
it's a Blackfin-specific issue.

My recommendation would be that you don't put the retry loop in the
ext4 code, since I suspect this is not going to be the only place
where the Blackfin is going to randomly fail to give you a 4k page,
even though there's plenty of memory somewhere else.  The lack of an
MMU pretty much guarantees this sort of thing can happen.  So putting
the retry loop in the page allocator, with a WARN_ON(1) when it
happens, so the developers can appropriately tune the manual memory
partition settings, seems like the better approach.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: possible ext4 related deadlock
  2010-03-05 15:45   ` tytso
@ 2010-03-10 16:23     ` Enrik Berkhan
  0 siblings, 0 replies; 5+ messages in thread
From: Enrik Berkhan @ 2010-03-10 16:23 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

tytso@mit.edu wrote:
> On Fri, Mar 05, 2010 at 02:56:28PM +0100, Enrik Berkhan wrote:
>> Meanwhile, I have found out that thread 2 actually isn't completely
>> blocked but loops in __alloc_pages_internal:
>>
>> get_page_from_freelist() doesn't return a page;
>> try_to_free_pages() returns did_some_progress == 0;
>> later, do_retry == 1 and the loop restarts with goto rebalance;
>>
>>
>> Can anybody explain this behaviour and maybe direct me to the root cause?

I think, I have isolated it further: the Blackfin/NOMMU changes are
simply to call drop_pagecache() in __alloc_pages_internal() before
trying harder to get pages, which generally is a good thing on NOMMU. We 
have far less OOMs since that has been introduced into the Blackfin patches.

So, the call sequence may reduce to

...
/* got no free page on first try */
drop_pagecache();
rebalance:
did_some_progress = try_to_free_pages();
/* returns 0, most probably because drop_pagecache() has already cleaned 
up everything possible, thus no call to get_page_from_freelist() */
drop_pagecache();
goto rebalance;
...

>> Of course, this now looks more like a page allocation problem than
>> an ext4 one.
> 
> Yep, I'd have to agree with you.  We're only trying to allocate a
> single page here, and you have plenty of pages available.  Just
> checking....  you don't have CONFIG_NUMA enabled and doing something
> crazy with NUMA nodes, are you?

no NUMA, of course :)

The ext4 contribution to the problem is setting AOP_FLAG_NOFS, which is
correct, of course. And because most probably no one else in the world 
uses ext4 on Blackfin/NOMMU, the endless loop only triggers here.

So it's definitely a page allocation problem and a better workaround is 
to call get_page_from_freelist() after each call to drop_pagecache().

I will continue this discussion on the Blackfin list.

Thanks for your patience.

Enrik

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-03-10 16:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-12 12:49 possible ext4 related deadlock Enrik Berkhan
2010-02-18  1:55 ` Jan Kara
2010-03-05 13:56 ` Enrik Berkhan
2010-03-05 15:45   ` tytso
2010-03-10 16:23     ` Enrik Berkhan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).