deadlock with latest xfs

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* deadlock with latest xfs
@ 2008-10-23  9:17 Lachlan McIlroy
  2008-10-23 20:57 ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Lachlan McIlroy @ 2008-10-23  9:17 UTC (permalink / raw)
  To: xfs-oss

another problem with latest xfs

I ran fsstress with 1024 threads and they all locked up within a few minutes.

Some of the stacktraces are stuck in the log

Stack traceback for pid 6675
0xffff881003436d60     6675     6648  0    1   D  0xffff8810034371c8  fsstress
sp                ip                Function (args)
0xffff8810034e56d8 0xffffffff8155d9d6 thread_return
0xffff8810034e5770 0xffffffff8155de7f schedule_timeout+0x22 (0x7fffffffffffffff)
0xffff8810034e57e0 0xffffffff811a57bc xlog_grant_log_space+0x10a (0xffff881026beb230, 0xffff8807dc582d68)
0xffff8810034e5850 0xffffffff811a5b88 xfs_log_reserve+0x160 (0xffff88100d085b18, invalid, invalid, 0xffff8807d6ae5c40, invalid, invalid, 0x10)
0xffff8810034e5890 0xffffffff811b0de6 xfs_trans_reserve+0x173 (0xffff8807d6ae5c00, invalid, invalid, invalid, invalid, 0x2)
0xffff8810034e58e0 0xffffffff8119fab7 xfs_iomap_write_direct+0x204 (0xffff8808a4220000, 0xe3000, invalid, invalid, 0xffff8810034e5a38, 0xffff8810034e5a64, 0xffffc20000000001)
0xffff8810034e59e0 0xffffffff811a0588 xfs_iomap+0x282 (0xffff8808a4220000, 0xe3000, invalid, invalid, 0xffff8810034e5ab8, 0xffff8810034e5af4)
0xffff8810034e5aa0 0xffffffff811bc134 __xfs_get_blocks+0xa3 (0xffff8808a42202a0, invalid, 0xffff881026a8b830, invalid, invalid, invalid)
0xffff8810034e5b30 0xffffffff811bc29a xfs_get_blocks_direct+0x15
0xffff8810034e5b40 0xffffffff810d21b6 __blockdev_direct_IO+0x53c (invalid, invalid, 0xffff8808a42202a0, invalid, invalid, 0xd6c00, 0x1, 0xffffffff811bc285, 0xffffffff811bd031)
0xffff8810034e5be0 0xffffffff811bdcc0 xfs_vm_direct_IO+0xeb (invalid, 0xffff8810034e5de8, 0xffff8810034e5ed8, 0xd6c00, 0x1)
0xffff8810034e5c50 0xffffffff8107d42a generic_file_direct_write+0xfd (0xffff8810034e5de8, 0xffff8810034e5ed8, 0xffff8810034e5d78, 0xd6c00, 0xffff8810034e5e68, invalid, 0xce00)
0xffff8810034e5cb0 0xffffffff811c536f xfs_write+0x579 (0xffff8808a4220000, 0xffff8810034e5de8, 0xffff8810034e5ed8, 0x1, 0xffff8810034e5e68, 0x34e5e6800000005)
0xffff8810034e5dc0 0xffffffff811c0e00 __xfs_file_write+0x4c (invalid, invalid, invalid, invalid, invalid)
0xffff8810034e5dd0 0xffffffff811c0e26 xfs_file_aio_write+0x11 (invalid, invalid, invalid, invalid)
0xffff8810034e5de0 0xffffffff810a96ec do_sync_write+0xe2 (0xffff880fff69d680, 0x7f60ef402000, 0xce00, 0xffff8810034e5f48)
0xffff8810034e5f10 0xffffffff810a9ee8 vfs_write+0xae (0xffff880fff69d680, 0x7f60ef402000, invalid, 0xffff8810034e5f48)
0xffff8810034e5f40 0xffffffff810aa3f8 sys_write+0x47 (invalid, 0x7f60ef402000, 0xce00)

But most of the fsstress threads are stuck with stacktraces like this one

Stack traceback for pid 6674
0xffff881003435dc0     6674     6648  0    3   D  0xffff881003436228  fsstress
sp                ip                Function (args)
0xffff8810034e3848 0xffffffff8155d9d6 thread_return
0xffff8810034e38e0 0xffffffff8155de07 io_schedule+0x5c
0xffff8810034e3900 0xffffffff8107c1fc sync_page+0x3f (invalid)
0xffff8810034e3910 0xffffffff8155e09a __wait_on_bit+0x45 (0xffff880028071cc0, 0xffff8810034e3958, 0xffffffff8107c1bd, invalid)
0xffff8810034e3950 0xffffffff8107c442 wait_on_page_bit+0x6e (0xffffe2003a7de578, invalid)
0xffff8810034e39b0 0xffffffff81082e57 write_cache_pages+0x191 (0xffff880c3dab6a08, 0xffff8810034e3b18, 0xffffffff81082966, 0xffff880c3dab6a08)
0xffff8810034e3ab0 0xffffffff81083019 generic_writepages+0x22 (invalid)
0xffff8810034e3ac0 0xffffffff811bdb52 xfs_vm_writepages+0x46 (0xffff880c3dab6a08, 0xffff8810034e3b18)
0xffff8810034e3af0 0xffffffff8108304a do_writepages+0x2b (invalid, 0xffff8810034e3b18)
0xffff8810034e3b10 0xffffffff8107cafd __filemap_fdatawrite_range+0x5b (invalid, 0x0, 0x7fffffffffffffff, invalid)
0xffff8810034e3b70 0xffffffff8107ccad filemap_fdatawrite+0x1a
0xffff8810034e3b80 0xffffffff8107cccb filemap_write_and_wait+0x1c (0xffff880c3dab6a08)
0xffff8810034e3ba0 0xffffffff811c1296 xfs_flushinval_pages+0x4e (0xffff880c3dab6580, 0x78000)
0xffff8810034e3bd0 0xffffffff811b5718 xfs_free_file_space+0x196 (0xffff880c3dab6580, 0x78a52, 0x894ec, invalid)
0xffff8810034e3ce0 0xffffffff811b771a xfs_change_file_space+0x163 (0xffff880c3dab6580, invalid, 0xffff8810034e3d98, invalid, 0x0, invalid)
0xffff8810034e3d90 0xffffffff811c1faf xfs_ioc_space+0xab (0xffff880c3dab6580, invalid, 0xffff880fffbfb500, invalid, invalid, invalid)
0xffff8810034e3e00 0xffffffff811c2e8e xfs_ioctl+0x296 (0xffff880c3dab6580, 0xffff880fffbfb500, invalid, invalid, 0x7ffff7a32c20)
0xffff8810034e3e80 0xffffffff811c1173 xfs_file_ioctl+0x36 (invalid, invalid, invalid)
0xffff8810034e3eb0 0xffffffff810b5c42 vfs_ioctl+0x2a (0xffff880fffbfb500, invalid, 0x7ffff7a32c20)
0xffff8810034e3ee0 0xffffffff810b5eee do_vfs_ioctl+0x25f (invalid, invalid, invalid, 0x7ffff7a32c20)
0xffff8810034e3f30 0xffffffff810b5f62 sys_ioctl+0x57 (invalid, invalid, 0x7ffff7a32c20)

The system has plenty of memory available.  The deadlock is reproducible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-23  9:17 deadlock with latest xfs Lachlan McIlroy
@ 2008-10-23 20:57 ` Christoph Hellwig
  2008-10-23 22:28   ` Dave Chinner
  2008-10-24  3:08   ` Lachlan McIlroy
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Hellwig @ 2008-10-23 20:57 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-oss

On Thu, Oct 23, 2008 at 07:17:30PM +1000, Lachlan McIlroy wrote:
> another problem with latest xfs

Is this with the 2.6.27-based ptools/cvs tree or with the 2.6.28 based
git tree?  It does looks more like a VM issue than a XFS issue to me.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-23 20:57 ` Christoph Hellwig
@ 2008-10-23 22:28   ` Dave Chinner
  2008-10-24  3:08   ` Lachlan McIlroy
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2008-10-23 22:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Lachlan McIlroy, xfs-oss

On Thu, Oct 23, 2008 at 04:57:28PM -0400, Christoph Hellwig wrote:
> On Thu, Oct 23, 2008 at 07:17:30PM +1000, Lachlan McIlroy wrote:
> > another problem with latest xfs
> 
> Is this with the 2.6.27-based ptools/cvs tree or with the 2.6.28 based
> git tree?  It does looks more like a VM issue than a XFS issue to me.

I hit this immediately after I upgraded to a 2.6.28 base tree from a
2.6.27-rc9. I couldn't get to the bottom of it - it did look like
a lost I/O or VM lockup but I haven't seen it since so I haven't
been able to do anything more....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-23 20:57 ` Christoph Hellwig
  2008-10-23 22:28   ` Dave Chinner
@ 2008-10-24  3:08   ` Lachlan McIlroy
  2008-10-24  5:24     ` Dave Chinner
  2008-10-26 22:39     ` Dave Chinner
  1 sibling, 2 replies; 20+ messages in thread
From: Lachlan McIlroy @ 2008-10-24  3:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs-oss

Christoph Hellwig wrote:
> On Thu, Oct 23, 2008 at 07:17:30PM +1000, Lachlan McIlroy wrote:
>> another problem with latest xfs
> 
> Is this with the 2.6.27-based ptools/cvs tree or with the 2.6.28 based
> git tree?  It does looks more like a VM issue than a XFS issue to me.
> 

It's with the 2.6.27-rc8 based ptools tree.  Prior to checking
in these patches:

Can't lock inodes in radix tree preload region
stop using xfs_itobp in xfs_bulkstat
free partially initialized inodes using destroy_inode

I was able to stress a system for about 4 hours before it ran out
of memory.  Now I hit the deadlock within a few minutes.  I need
to roll back to find which patch changed the behaviour.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-24  3:08   ` Lachlan McIlroy
@ 2008-10-24  5:24     ` Dave Chinner
  2008-10-24  6:48       ` Dave Chinner
  2008-10-24  8:46       ` Lachlan McIlroy
  2008-10-26 22:39     ` Dave Chinner
  1 sibling, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2008-10-24  5:24 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Christoph Hellwig, xfs-oss

On Fri, Oct 24, 2008 at 01:08:55PM +1000, Lachlan McIlroy wrote:
> Christoph Hellwig wrote:
>> On Thu, Oct 23, 2008 at 07:17:30PM +1000, Lachlan McIlroy wrote:
>>> another problem with latest xfs
>>
>> Is this with the 2.6.27-based ptools/cvs tree or with the 2.6.28 based
>> git tree?  It does looks more like a VM issue than a XFS issue to me.
>>
>
> It's with the 2.6.27-rc8 based ptools tree.  Prior to checking
> in these patches:
>
> Can't lock inodes in radix tree preload region
> stop using xfs_itobp in xfs_bulkstat
> free partially initialized inodes using destroy_inode
>
> I was able to stress a system for about 4 hours before it ran out
> of memory.  Now I hit the deadlock within a few minutes.  I need
> to roll back to find which patch changed the behaviour.

Does it go away when you add the "XFS: Fix race when looking up
reclaimable inodes" I sent this morning?

Also, is there a thread stuck in xfs_setfilesize() waiting on an
ilock during I/O completion?

i.e. did the log hang because I/O completion is stuck waiting on
an ilock that is held by a thread waiting on I/O completion?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-24  5:24     ` Dave Chinner
@ 2008-10-24  6:48       ` Dave Chinner
  2008-10-26  0:53         ` Dave Chinner
  2008-10-24  8:46       ` Lachlan McIlroy
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2008-10-24  6:48 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss

On Fri, Oct 24, 2008 at 04:24:18PM +1100, Dave Chinner wrote:
> On Fri, Oct 24, 2008 at 01:08:55PM +1000, Lachlan McIlroy wrote:
> > Christoph Hellwig wrote:
> >> On Thu, Oct 23, 2008 at 07:17:30PM +1000, Lachlan McIlroy wrote:
> >>> another problem with latest xfs
> >>
> >> Is this with the 2.6.27-based ptools/cvs tree or with the 2.6.28 based
> >> git tree?  It does looks more like a VM issue than a XFS issue to me.
> >>
> >
> > It's with the 2.6.27-rc8 based ptools tree.  Prior to checking
> > in these patches:
> >
> > Can't lock inodes in radix tree preload region
> > stop using xfs_itobp in xfs_bulkstat
> > free partially initialized inodes using destroy_inode
> >
> > I was able to stress a system for about 4 hours before it ran out
> > of memory.  Now I hit the deadlock within a few minutes.  I need
> > to roll back to find which patch changed the behaviour.
> 
> Does it go away when you add the "XFS: Fix race when looking up
> reclaimable inodes" I sent this morning?
> 
> Also, is there a thread stuck in xfs_setfilesize() waiting on an
> ilock during I/O completion?
> 
> i.e. did the log hang because I/O completion is stuck waiting on
> an ilock that is held by a thread waiting on I/O completion?

OK, I just hung a single-threaded rm -rf after this completed:

# fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress

It has hung with this trace:

# echo w > /proc/sysrq-trigger
[42954211.590000] SysRq : Show Blocked State
[42954211.590000]   task                        PC stack   pid father
[42954211.590000] rm            D 00000000407219f0     0  2504   1155
[42954211.590000] 604692d8 6002e40a 808ad040 79484000 79487850 60014f0d 808ad040 6032b3e0
[42954211.590000]        79484000 6c8a2808 60468e00 808ad040 794878a0 60324b21 79484000 00000250
[42954211.590000]        79484000 79484000 7fffffffffffffff 79045e88 80014d28 80014df8 79487900 60324e6d <6>Call Trace:
[42954211.590000] 794877f8:  [<6002e40a>] update_curr+0x3a/0x50
[42954211.590000] 79487818:  [<60014f0d>] _switch_to+0x6d/0xe0
[42954211.590000] 79487858:  [<60324b21>] schedule+0x171/0x2c0
[42954211.590000] 794878a8:  [<60324e6d>] schedule_timeout+0xad/0xf0
[42954211.590000] 794878c8:  [<60326e98>] _spin_unlock_irqrestore+0x18/0x20
[42954211.590000] 79487908:  [<60195455>] xlog_grant_log_space+0x245/0x470
[42954211.590000] 79487920:  [<60030ba0>] default_wake_function+0x0/0x10
[42954211.590000] 79487978:  [<601957a2>] xfs_log_reserve+0x122/0x140
[42954211.590000] 794879c8:  [<601a36e7>] xfs_trans_reserve+0x147/0x2e0
[42954211.590000] 794879f8:  [<60087374>] kmem_cache_alloc+0x84/0x100
[42954211.590000] 79487a38:  [<601ab01f>] xfs_inactive_symlink_rmt+0x9f/0x450
[42954211.590000] 79487a88:  [<601ada94>] kmem_zone_zalloc+0x34/0x50
[42954211.590000] 79487aa8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
[42954211.590000] 79487ac8:  [<601a3b52>] xfs_trans_alloc+0xa2/0xb0
[42954211.590000] 79487ad8:  [<60326ea9>] _spin_unlock+0x9/0x10
[42954211.590000] 79487ae8:  [<601a85ef>] xfs_inode_is_filestream+0x5f/0x80
[42954211.590000] 79487b28:  [<601ab597>] xfs_inactive+0x1c7/0x530
[42954211.590000] 79487b78:  [<601b94ec>] xfs_fs_clear_inode+0x3c/0x70
[42954211.590000] 79487b98:  [<6009e881>] clear_inode+0x91/0x150
[42954211.590000] 79487bb8:  [<6009f05f>] generic_delete_inode+0xff/0x130
[42954211.590000] 79487bd8:  [<6009f20d>] generic_drop_inode+0x17d/0x1a0
[42954211.590000] 79487bf8:  [<6009e317>] iput+0x57/0x90
[42954211.590000] 79487c18:  [<60095be3>] do_unlinkat+0x113/0x1c0
[42954211.590000] 79487c98:  [<60098e90>] sys_getdents+0x110/0x150
[42954211.590000] 79487cd8:  [<60095ded>] sys_unlinkat+0x1d/0x40
[42954211.590000] 79487ce8:  [<60018150>] handle_syscall+0x50/0x80
[42954211.590000] 79487d08:  [<6002b05e>] userspace+0x48e/0x550
[42954211.590000] 79487f58:  [<600269a7>] save_registers+0x17/0x40
[42954211.590000] 79487fc8:  [<60014df2>] fork_handler+0x62/0x70
[42954211.590000]

Which implies that the log tail is not moving forward. I'm about to jump
on a plane, so I won't be able to look at this until tomorrow....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-24  5:24     ` Dave Chinner
  2008-10-24  6:48       ` Dave Chinner
@ 2008-10-24  8:46       ` Lachlan McIlroy
  1 sibling, 0 replies; 20+ messages in thread
From: Lachlan McIlroy @ 2008-10-24  8:46 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss

Dave Chinner wrote:
> On Fri, Oct 24, 2008 at 01:08:55PM +1000, Lachlan McIlroy wrote:
>> Christoph Hellwig wrote:
>>> On Thu, Oct 23, 2008 at 07:17:30PM +1000, Lachlan McIlroy wrote:
>>>> another problem with latest xfs
>>> Is this with the 2.6.27-based ptools/cvs tree or with the 2.6.28 based
>>> git tree?  It does looks more like a VM issue than a XFS issue to me.
>>>
>> It's with the 2.6.27-rc8 based ptools tree.  Prior to checking
>> in these patches:
>>
>> Can't lock inodes in radix tree preload region
>> stop using xfs_itobp in xfs_bulkstat
>> free partially initialized inodes using destroy_inode
>>
>> I was able to stress a system for about 4 hours before it ran out
>> of memory.  Now I hit the deadlock within a few minutes.  I need
>> to roll back to find which patch changed the behaviour.
> 
> Does it go away when you add the "XFS: Fix race when looking up
> reclaimable inodes" I sent this morning?
I haven't had a chance to test it yet - will do that on Monday.

> 
> Also, is there a thread stuck in xfs_setfilesize() waiting on an
> ilock during I/O completion?
Haven't seen one but then I haven't looked through all 1024 stuck
threads.

> 
> i.e. did the log hang because I/O completion is stuck waiting on
> an ilock that is held by a thread waiting on I/O completion?
It could be.  I was hoping that if I found the offending mod it
would be easier to find out what caused the problem.  I pulled
out each of the changes listed above in turn and I can still hit
the problem.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-24  6:48       ` Dave Chinner
@ 2008-10-26  0:53         ` Dave Chinner
  2008-10-26  2:50           ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2008-10-26  0:53 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss

On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
> OK, I just hung a single-threaded rm -rf after this completed:
> 
> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
> 
> It has hung with this trace:
> 
> # echo w > /proc/sysrq-trigger
....
> [42954211.590000] 794877f8:  [<6002e40a>] update_curr+0x3a/0x50
> [42954211.590000] 79487818:  [<60014f0d>] _switch_to+0x6d/0xe0
> [42954211.590000] 79487858:  [<60324b21>] schedule+0x171/0x2c0
> [42954211.590000] 794878a8:  [<60324e6d>] schedule_timeout+0xad/0xf0
> [42954211.590000] 794878c8:  [<60326e98>] _spin_unlock_irqrestore+0x18/0x20
> [42954211.590000] 79487908:  [<60195455>] xlog_grant_log_space+0x245/0x470
> [42954211.590000] 79487920:  [<60030ba0>] default_wake_function+0x0/0x10
> [42954211.590000] 79487978:  [<601957a2>] xfs_log_reserve+0x122/0x140
> [42954211.590000] 794879c8:  [<601a36e7>] xfs_trans_reserve+0x147/0x2e0
> [42954211.590000] 794879f8:  [<60087374>] kmem_cache_alloc+0x84/0x100
> [42954211.590000] 79487a38:  [<601ab01f>] xfs_inactive_symlink_rmt+0x9f/0x450
> [42954211.590000] 79487a88:  [<601ada94>] kmem_zone_zalloc+0x34/0x50
> [42954211.590000] 79487aa8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
....

I came back to the system, and found that the hang had gone away - the
rm -rf had finished sometime in the ~36 hours between triggering the
problem and coming back to look at the corpse....

So nothing to report yet.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-26  0:53         ` Dave Chinner
@ 2008-10-26  2:50           ` Dave Chinner
  2008-10-26  4:20             ` Dave Chinner
                               ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Dave Chinner @ 2008-10-26  2:50 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss; +Cc: linux-mm

On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
> > OK, I just hung a single-threaded rm -rf after this completed:
> > 
> > # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
> > 
> > It has hung with this trace:
> > 
> > # echo w > /proc/sysrq-trigger
> ....
> > [42954211.590000] 794877f8:  [<6002e40a>] update_curr+0x3a/0x50
> > [42954211.590000] 79487818:  [<60014f0d>] _switch_to+0x6d/0xe0
> > [42954211.590000] 79487858:  [<60324b21>] schedule+0x171/0x2c0
> > [42954211.590000] 794878a8:  [<60324e6d>] schedule_timeout+0xad/0xf0
> > [42954211.590000] 794878c8:  [<60326e98>] _spin_unlock_irqrestore+0x18/0x20
> > [42954211.590000] 79487908:  [<60195455>] xlog_grant_log_space+0x245/0x470
> > [42954211.590000] 79487920:  [<60030ba0>] default_wake_function+0x0/0x10
> > [42954211.590000] 79487978:  [<601957a2>] xfs_log_reserve+0x122/0x140
> > [42954211.590000] 794879c8:  [<601a36e7>] xfs_trans_reserve+0x147/0x2e0
> > [42954211.590000] 794879f8:  [<60087374>] kmem_cache_alloc+0x84/0x100
> > [42954211.590000] 79487a38:  [<601ab01f>] xfs_inactive_symlink_rmt+0x9f/0x450
> > [42954211.590000] 79487a88:  [<601ada94>] kmem_zone_zalloc+0x34/0x50
> > [42954211.590000] 79487aa8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
> ....
> 
> I came back to the system, and found that the hang had gone away - the
> rm -rf had finished sometime in the ~36 hours between triggering the
> problem and coming back to look at the corpse....
> 
> So nothing to report yet.

Got it now. I can reproduce this in a couple of minutes now that both
the test fs and the fs hosting the UML fs images are using lazy-count=1
(and the frequent 10s long host system freezes have gone away, too).

Looks like *another* new memory allocation problem [1]:

[42950422.270000] xfsdatad/0    D 000000000043bf7a     0    51      2
[42950422.270000] 804add98 804ad8f8 60498c40 80474000 804776a0 60014f0d 80442780 1000111a8
[42950422.270000]        80474000 7ff1ac08 804ad8c0 80442780 804776f0 60324b21 80474000 80477700
[42950422.270000]        80474000 1000111a8 80477700 0000000a 804777e0 80477950 80477750 60324e39 <6>Call Trace:
[42950422.270000] 80477668:  [<60014f0d>] _switch_to+0x6d/0xe0
[42950422.270000] 804776a8:  [<60324b21>] schedule+0x171/0x2c0
[42950422.270000] 804776f8:  [<60324e39>] schedule_timeout+0x79/0xf0
[42950422.270000] 80477718:  [<60040360>] process_timeout+0x0/0x10
[42950422.270000] 80477758:  [<60324619>] io_schedule_timeout+0x19/0x30
[42950422.270000] 80477778:  [<6006eb74>] congestion_wait+0x74/0xa0
[42950422.270000] 80477790:  [<6004c5b0>] autoremove_wake_function+0x0/0x40
[42950422.270000] 804777d8:  [<600692a0>] throttle_vm_writeout+0x80/0xa0
[42950422.270000] 80477818:  [<6006cdf4>] shrink_zone+0xac4/0xb10
[42950422.270000] 80477828:  [<601adb5b>] kmem_alloc+0x5b/0x140
[42950422.270000] 804778c8:  [<60186d48>] xfs_iext_inline_to_direct+0x68/0x80
[42950422.270000] 804778f8:  [<60187e38>] xfs_iext_realloc_direct+0x128/0x1c0
[42950422.270000] 80477928:  [<60188594>] xfs_iext_add+0xc4/0x290
[42950422.270000] 80477978:  [<60166388>] xfs_bmbt_set_all+0x18/0x20
[42950422.270000] 80477988:  [<601887c4>] xfs_iext_insert+0x64/0x80
[42950422.270000] 804779c8:  [<6006d75a>] try_to_free_pages+0x1ea/0x330
[42950422.270000] 80477a40:  [<6006ba40>] isolate_pages_global+0x0/0x40
[42950422.270000] 80477a98:  [<60067887>] __alloc_pages_internal+0x267/0x540
[42950422.270000] 80477b68:  [<60086b61>] cache_alloc_refill+0x4c1/0x970
[42950422.270000] 80477b88:  [<60326ea9>] _spin_unlock+0x9/0x10
[42950422.270000] 80477bd8:  [<6002ffc5>] __might_sleep+0x55/0x120
[42950422.270000] 80477c08:  [<601ad9cd>] kmem_zone_alloc+0x7d/0x110
[42950422.270000] 80477c18:  [<600873c3>] kmem_cache_alloc+0xd3/0x100
[42950422.270000] 80477c58:  [<601ad9cd>] kmem_zone_alloc+0x7d/0x110
[42950422.270000] 80477ca8:  [<601ada78>] kmem_zone_zalloc+0x18/0x50
[42950422.270000] 80477cc8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
[42950422.270000] 80477ce8:  [<601a3b52>] xfs_trans_alloc+0xa2/0xb0
[42950422.270000] 80477d18:  [<60027655>] set_signals+0x35/0x40
[42950422.270000] 80477d48:  [<6018f93a>] xfs_iomap_write_unwritten+0x5a/0x260
[42950422.270000] 80477d50:  [<60063d12>] mempool_free_slab+0x12/0x20
[42950422.270000] 80477d68:  [<60027655>] set_signals+0x35/0x40
[42950422.270000] 80477db8:  [<60063d12>] mempool_free_slab+0x12/0x20
[42950422.270000] 80477dc8:  [<60063dbf>] mempool_free+0x4f/0x90
[42950422.270000] 80477e18:  [<601af5e5>] xfs_end_bio_unwritten+0x65/0x80
[42950422.270000] 80477e38:  [<60048574>] run_workqueue+0xa4/0x180
[42950422.270000] 80477e50:  [<601af580>] xfs_end_bio_unwritten+0x0/0x80
[42950422.270000] 80477e58:  [<6004c791>] prepare_to_wait+0x51/0x80
[42950422.270000] 80477e98:  [<600488e0>] worker_thread+0x70/0xd0

We've entered memory reclaim inside the xfsdatad while trying to do
unwritten extent completion during I/O completion, and that memory
reclaim is now blocked waiting for I/o completion that cannot make
progress.

Nasty.

My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
then re-queue the I/O completion at the back of the workqueue and let other
I/o completions progress before retrying this one. That way the I/O that
is simply cleaning memory will make progress, hence allowing memory
allocation to occur successfully when we retry this I/O completion...

XFS-folk - thoughts?

[1] I don't see how any of the XFS changes we made make this easier to hit.
What I suspect is a VM regression w.r.t. memory reclaim because this is
the second problem since 2.6.26 that appears to be a result of memory
allocation failures in places that we've never, ever seen failures before.

The other new failure is this one:

http://bugzilla.kernel.org/show_bug.cgi?id=11805

which is an alloc_pages(GFP_KERNEL) failure....

mm-folk - care to weight in?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-26  2:50           ` Dave Chinner
@ 2008-10-26  4:20             ` Dave Chinner
  2008-10-27  1:42             ` Lachlan McIlroy
       [not found]             ` <200810281702.17135.nickpiggin@yahoo.com.au>
  2 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2008-10-26  4:20 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

On Sun, Oct 26, 2008 at 01:50:13PM +1100, Dave Chinner wrote:
> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
> > On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
> > > OK, I just hung a single-threaded rm -rf after this completed:
> > > 
> > > # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
> > > 
> > > It has hung with this trace:
> > > 
> > > # echo w > /proc/sysrq-trigger
> > ....
> > > [42954211.590000] 794877f8:  [<6002e40a>] update_curr+0x3a/0x50
> > > [42954211.590000] 79487818:  [<60014f0d>] _switch_to+0x6d/0xe0
> > > [42954211.590000] 79487858:  [<60324b21>] schedule+0x171/0x2c0
> > > [42954211.590000] 794878a8:  [<60324e6d>] schedule_timeout+0xad/0xf0
> > > [42954211.590000] 794878c8:  [<60326e98>] _spin_unlock_irqrestore+0x18/0x20
> > > [42954211.590000] 79487908:  [<60195455>] xlog_grant_log_space+0x245/0x470
> > > [42954211.590000] 79487920:  [<60030ba0>] default_wake_function+0x0/0x10
> > > [42954211.590000] 79487978:  [<601957a2>] xfs_log_reserve+0x122/0x140
> > > [42954211.590000] 794879c8:  [<601a36e7>] xfs_trans_reserve+0x147/0x2e0
> > > [42954211.590000] 794879f8:  [<60087374>] kmem_cache_alloc+0x84/0x100
> > > [42954211.590000] 79487a38:  [<601ab01f>] xfs_inactive_symlink_rmt+0x9f/0x450
> > > [42954211.590000] 79487a88:  [<601ada94>] kmem_zone_zalloc+0x34/0x50
> > > [42954211.590000] 79487aa8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
> > ....
> > 
> > I came back to the system, and found that the hang had gone away - the
> > rm -rf had finished sometime in the ~36 hours between triggering the
> > problem and coming back to look at the corpse....
> > 
> > So nothing to report yet.
> 
> Got it now. I can reproduce this in a couple of minutes now that both
> the test fs and the fs hosting the UML fs images are using lazy-count=1
> (and the frequent 10s long host system freezes have gone away, too).
> 
> Looks like *another* new memory allocation problem [1]:

[snip]

And having fixed that, I'm now seeing the log reservation hang:

[42950307.350000] xfsdatad/0    D 00000000407219f0     0    51      2
[42950307.350000] 7bd1acd8 7bd1a838 60498c40 81074000 81077b40 60014f0d 81044780 81074000
[42950307.350000]        81074000 7e15f808 7bd1a800 81044780 81077b90 60324bc1 81074000 00000250
[42950307.350000]        81074000 81074000 7fffffffffffffff 6646a168 80b6dd28 80b6ddf8 81077bf0 60324f0d <6>Call Trace:
[42950307.350000] 81077b08:  [<60014f0d>] _switch_to+0x6d/0xe0
[42950307.350000] 81077b48:  [<60324bc1>] schedule+0x171/0x2c0
[42950307.350000] 81077b98:  [<60324f0d>] schedule_timeout+0xad/0xf0
[42950307.350000] 81077bb8:  [<60326f38>] _spin_unlock_irqrestore+0x18/0x20
[42950307.350000] 81077bf8:  [<601953e9>] xlog_grant_log_space+0x169/0x470
[42950307.350000] 81077c10:  [<60030ba0>] default_wake_function+0x0/0x10
[42950307.350000] 81077c68:  [<60195812>] xfs_log_reserve+0x122/0x140
[42950307.350000] 81077cb8:  [<601a3757>] xfs_trans_reserve+0x147/0x2e0
[42950307.350000] 81077ce8:  [<601adb14>] kmem_zone_zalloc+0x34/0x50
[42950307.350000] 81077d28:  [<6018f985>] xfs_iomap_write_unwritten+0xa5/0x2d0
[42950307.350000] 81077d38:  [<60326f38>] _spin_unlock_irqrestore+0x18/0x20
[42950307.350000] 81077d48:  [<60085750>] cache_free_debugcheck+0x150/0x2e0
[42950307.350000] 81077d50:  [<60063d12>] mempool_free_slab+0x12/0x20
[42950307.350000] 81077d88:  [<60085e02>] kmem_cache_free+0x72/0xb0
[42950307.350000] 81077dc8:  [<60063dbf>] mempool_free+0x4f/0x90
[42950307.350000] 81077e08:  [<601af66d>] xfs_end_bio_unwritten+0x6d/0xa0
[42950307.350000] 81077e38:  [<60048574>] run_workqueue+0xa4/0x180
[42950307.350000] 81077e50:  [<601af600>] xfs_end_bio_unwritten+0x0/0xa0
[42950307.350000] 81077e58:  [<6004c791>] prepare_to_wait+0x51/0x80
[42950307.350000] 81077e98:  [<600488e0>] worker_thread+0x70/0xd0
[42950307.350000] 81077eb0:  [<6004c5b0>] autoremove_wake_function+0x0/0x40
[42950307.350000] 81077ee8:  [<60048870>] worker_thread+0x0/0xd0
[42950307.350000] 81077f08:  [<6004c204>] kthread+0x64/0xb0
[42950307.350000] 81077f48:  [<60026285>] run_kernel_thread+0x35/0x60
[42950307.350000] 81077f58:  [<6004c1a0>] kthread+0x0/0xb0
[42950307.350000] 81077f98:  [<60026278>] run_kernel_thread+0x28/0x60
[42950307.350000] 81077fc8:  [<60014e71>] new_thread_handler+0x71/0xa0

Basically, the log is too small to fit the number of transaction reservations
that are currently being attempted (roughly 1000 parallel transactions), and so
xlog_grant_log_space() is sleeping.  Because it is sleeping in I/O completion,
the log tail can't move forward because I/O completion is not occurring.

I think that at this point, we need a separate workqueue for unwritten extent
conversion to prevent it from blocking normal data and metadata I/O completion.
that way we can allow it to recurse on allocation and transaction reservation
without introducing I/O completion deadlocks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-24  3:08   ` Lachlan McIlroy
  2008-10-24  5:24     ` Dave Chinner
@ 2008-10-26 22:39     ` Dave Chinner
  2008-10-27  2:30       ` Timothy Shimmin
  2008-10-27  7:33       ` Lachlan McIlroy
  1 sibling, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2008-10-26 22:39 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Christoph Hellwig, xfs-oss

On Fri, Oct 24, 2008 at 01:08:55PM +1000, Lachlan McIlroy wrote:
> Christoph Hellwig wrote:
>> On Thu, Oct 23, 2008 at 07:17:30PM +1000, Lachlan McIlroy wrote:
>>> another problem with latest xfs
>>
>> Is this with the 2.6.27-based ptools/cvs tree or with the 2.6.28 based
>> git tree?  It does looks more like a VM issue than a XFS issue to me.
>>
>
> It's with the 2.6.27-rc8 based ptools tree.  Prior to checking
> in these patches:
>
> Can't lock inodes in radix tree preload region
> stop using xfs_itobp in xfs_bulkstat
> free partially initialized inodes using destroy_inode
>
> I was able to stress a system for about 4 hours before it ran out
> of memory.  Now I hit the deadlock within a few minutes.  I need
> to roll back to find which patch changed the behaviour.

Ok, I think I've found the regression - it's introduced by the AIL
cursor modifications. The patch below has been running for 15
minutes now on my UML box that would have hung in a couple of
minutes otherwise.

FYI, the way I found this was:

	- put a breakpoint on xfs_create() once the fs hung
	- `touch /mnt/xfs2/fred` to trigger the break point.
	- look at:
		- mp->m_ail->xa_target
		- mp->m_ail->xa_ail.next->li_lsn
		- mp->m_log->l_tail_lsn
	  which indicated the push target was way ahead the
	  tail of the log, so AIL pushing was obviously not
	  happening otherwise we'd be making progress.
	- added breakpoint on xfsaild_push() and continued
	- xfsaild_push() bp triggered, looked at *last_lsn
	  and found it way behind the tail of the log (like
	  3 cycle behind), which meant that would return
	  NULL instead of the first object and AIL pushing
	  would abort. Confirmed with single stepping.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

XFS: correctly select first log item to push

Under heavy metadata load we are seeing log hangs. The
AIL has items in it ready to be pushed, and they are within
the push target window. However, we are not pushing them
when the last pushed LSN is less than the LSN of the
first log item on the AIL. This is a regression introduced
by the AIL push cursor modifications.
---
 fs/xfs/xfs_trans_ail.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 67ee466..2d47f10 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -228,7 +228,7 @@ xfs_trans_ail_cursor_first(
 
 	list_for_each_entry(lip, &ailp->xa_ail, li_ail) {
 		if (XFS_LSN_CMP(lip->li_lsn, lsn) >= 0)
-			break;
+			goto out;
 	}
 	lip = NULL;
 out:

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-26  2:50           ` Dave Chinner
  2008-10-26  4:20             ` Dave Chinner
@ 2008-10-27  1:42             ` Lachlan McIlroy
  2008-10-27  5:30               ` Dave Chinner
       [not found]             ` <200810281702.17135.nickpiggin@yahoo.com.au>
  2 siblings, 1 reply; 20+ messages in thread
From: Lachlan McIlroy @ 2008-10-27  1:42 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

Dave Chinner wrote:
> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>
>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>
>>> It has hung with this trace:
>>>
>>> # echo w > /proc/sysrq-trigger
>> ....
>>> [42954211.590000] 794877f8:  [<6002e40a>] update_curr+0x3a/0x50
>>> [42954211.590000] 79487818:  [<60014f0d>] _switch_to+0x6d/0xe0
>>> [42954211.590000] 79487858:  [<60324b21>] schedule+0x171/0x2c0
>>> [42954211.590000] 794878a8:  [<60324e6d>] schedule_timeout+0xad/0xf0
>>> [42954211.590000] 794878c8:  [<60326e98>] _spin_unlock_irqrestore+0x18/0x20
>>> [42954211.590000] 79487908:  [<60195455>] xlog_grant_log_space+0x245/0x470
>>> [42954211.590000] 79487920:  [<60030ba0>] default_wake_function+0x0/0x10
>>> [42954211.590000] 79487978:  [<601957a2>] xfs_log_reserve+0x122/0x140
>>> [42954211.590000] 794879c8:  [<601a36e7>] xfs_trans_reserve+0x147/0x2e0
>>> [42954211.590000] 794879f8:  [<60087374>] kmem_cache_alloc+0x84/0x100
>>> [42954211.590000] 79487a38:  [<601ab01f>] xfs_inactive_symlink_rmt+0x9f/0x450
>>> [42954211.590000] 79487a88:  [<601ada94>] kmem_zone_zalloc+0x34/0x50
>>> [42954211.590000] 79487aa8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
>> ....
>>
>> I came back to the system, and found that the hang had gone away - the
>> rm -rf had finished sometime in the ~36 hours between triggering the
>> problem and coming back to look at the corpse....
>>
>> So nothing to report yet.
> 
> Got it now. I can reproduce this in a couple of minutes now that both
> the test fs and the fs hosting the UML fs images are using lazy-count=1
> (and the frequent 10s long host system freezes have gone away, too).
> 
> Looks like *another* new memory allocation problem [1]:
> 
> [42950422.270000] xfsdatad/0    D 000000000043bf7a     0    51      2
> [42950422.270000] 804add98 804ad8f8 60498c40 80474000 804776a0 60014f0d 80442780 1000111a8
> [42950422.270000]        80474000 7ff1ac08 804ad8c0 80442780 804776f0 60324b21 80474000 80477700
> [42950422.270000]        80474000 1000111a8 80477700 0000000a 804777e0 80477950 80477750 60324e39 <6>Call Trace:
> [42950422.270000] 80477668:  [<60014f0d>] _switch_to+0x6d/0xe0
> [42950422.270000] 804776a8:  [<60324b21>] schedule+0x171/0x2c0
> [42950422.270000] 804776f8:  [<60324e39>] schedule_timeout+0x79/0xf0
> [42950422.270000] 80477718:  [<60040360>] process_timeout+0x0/0x10
> [42950422.270000] 80477758:  [<60324619>] io_schedule_timeout+0x19/0x30
> [42950422.270000] 80477778:  [<6006eb74>] congestion_wait+0x74/0xa0
> [42950422.270000] 80477790:  [<6004c5b0>] autoremove_wake_function+0x0/0x40
> [42950422.270000] 804777d8:  [<600692a0>] throttle_vm_writeout+0x80/0xa0
> [42950422.270000] 80477818:  [<6006cdf4>] shrink_zone+0xac4/0xb10
> [42950422.270000] 80477828:  [<601adb5b>] kmem_alloc+0x5b/0x140
> [42950422.270000] 804778c8:  [<60186d48>] xfs_iext_inline_to_direct+0x68/0x80
> [42950422.270000] 804778f8:  [<60187e38>] xfs_iext_realloc_direct+0x128/0x1c0
> [42950422.270000] 80477928:  [<60188594>] xfs_iext_add+0xc4/0x290
> [42950422.270000] 80477978:  [<60166388>] xfs_bmbt_set_all+0x18/0x20
> [42950422.270000] 80477988:  [<601887c4>] xfs_iext_insert+0x64/0x80
> [42950422.270000] 804779c8:  [<6006d75a>] try_to_free_pages+0x1ea/0x330
> [42950422.270000] 80477a40:  [<6006ba40>] isolate_pages_global+0x0/0x40
> [42950422.270000] 80477a98:  [<60067887>] __alloc_pages_internal+0x267/0x540
> [42950422.270000] 80477b68:  [<60086b61>] cache_alloc_refill+0x4c1/0x970
> [42950422.270000] 80477b88:  [<60326ea9>] _spin_unlock+0x9/0x10
> [42950422.270000] 80477bd8:  [<6002ffc5>] __might_sleep+0x55/0x120
> [42950422.270000] 80477c08:  [<601ad9cd>] kmem_zone_alloc+0x7d/0x110
> [42950422.270000] 80477c18:  [<600873c3>] kmem_cache_alloc+0xd3/0x100
> [42950422.270000] 80477c58:  [<601ad9cd>] kmem_zone_alloc+0x7d/0x110
> [42950422.270000] 80477ca8:  [<601ada78>] kmem_zone_zalloc+0x18/0x50
> [42950422.270000] 80477cc8:  [<601a3a6d>] _xfs_trans_alloc+0x2d/0x70
> [42950422.270000] 80477ce8:  [<601a3b52>] xfs_trans_alloc+0xa2/0xb0
> [42950422.270000] 80477d18:  [<60027655>] set_signals+0x35/0x40
> [42950422.270000] 80477d48:  [<6018f93a>] xfs_iomap_write_unwritten+0x5a/0x260
> [42950422.270000] 80477d50:  [<60063d12>] mempool_free_slab+0x12/0x20
> [42950422.270000] 80477d68:  [<60027655>] set_signals+0x35/0x40
> [42950422.270000] 80477db8:  [<60063d12>] mempool_free_slab+0x12/0x20
> [42950422.270000] 80477dc8:  [<60063dbf>] mempool_free+0x4f/0x90
> [42950422.270000] 80477e18:  [<601af5e5>] xfs_end_bio_unwritten+0x65/0x80
> [42950422.270000] 80477e38:  [<60048574>] run_workqueue+0xa4/0x180
> [42950422.270000] 80477e50:  [<601af580>] xfs_end_bio_unwritten+0x0/0x80
> [42950422.270000] 80477e58:  [<6004c791>] prepare_to_wait+0x51/0x80
> [42950422.270000] 80477e98:  [<600488e0>] worker_thread+0x70/0xd0
> 
> We've entered memory reclaim inside the xfsdatad while trying to do
> unwritten extent completion during I/O completion, and that memory
> reclaim is now blocked waiting for I/o completion that cannot make
> progress.
> 
> Nasty.
> 
> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
> then re-queue the I/O completion at the back of the workqueue and let other
> I/o completions progress before retrying this one. That way the I/O that
> is simply cleaning memory will make progress, hence allowing memory
> allocation to occur successfully when we retry this I/O completion...
It could work - unless it's a synchronous I/O in which case the I/O is not
complete until the extent conversion takes place.

Could we allocate the memory up front before the I/O is issued?

> 
> XFS-folk - thoughts?
> 
> [1] I don't see how any of the XFS changes we made make this easier to hit.
> What I suspect is a VM regression w.r.t. memory reclaim because this is
> the second problem since 2.6.26 that appears to be a result of memory
> allocation failures in places that we've never, ever seen failures before.
> 
> The other new failure is this one:
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=11805
> 
> which is an alloc_pages(GFP_KERNEL) failure....
> 
> mm-folk - care to weight in?
> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-26 22:39     ` Dave Chinner
@ 2008-10-27  2:30       ` Timothy Shimmin
  2008-10-27  5:47         ` Dave Chinner
  2008-10-27  7:33       ` Lachlan McIlroy
  1 sibling, 1 reply; 20+ messages in thread
From: Timothy Shimmin @ 2008-10-27  2:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Lachlan McIlroy, Christoph Hellwig, xfs-oss

Dave Chinner wrote:
> Ok, I think I've found the regression - it's introduced by the AIL
> cursor modifications. The patch below has been running for 15
> minutes now on my UML box that would have hung in a couple of
> minutes otherwise.
> 
> FYI, the way I found this was:
> 
> 	- put a breakpoint on xfs_create() once the fs hung
> 	- `touch /mnt/xfs2/fred` to trigger the break point.
> 	- look at:
> 		- mp->m_ail->xa_target
> 		- mp->m_ail->xa_ail.next->li_lsn
> 		- mp->m_log->l_tail_lsn
> 	  which indicated the push target was way ahead the
> 	  tail of the log, so AIL pushing was obviously not
> 	  happening otherwise we'd be making progress.
> 	- added breakpoint on xfsaild_push() and continued
> 	- xfsaild_push() bp triggered, looked at *last_lsn
> 	  and found it way behind the tail of the log (like
> 	  3 cycle behind), which meant that would return
> 	  NULL instead of the first object and AIL pushing
> 	  would abort. Confirmed with single stepping.
> 
> Cheers,
> 
> Dave.
> XFS: correctly select first log item to push
> 
> Under heavy metadata load we are seeing log hangs. The
> AIL has items in it ready to be pushed, and they are within
> the push target window. However, we are not pushing them
> when the last pushed LSN is less than the LSN of the
> first log item on the AIL. This is a regression introduced
> by the AIL push cursor modifications.
> ---
>  fs/xfs/xfs_trans_ail.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 67ee466..2d47f10 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -228,7 +228,7 @@ xfs_trans_ail_cursor_first(
>  
>  	list_for_each_entry(lip, &ailp->xa_ail, li_ail) {
>  		if (XFS_LSN_CMP(lip->li_lsn, lsn) >= 0)
> -			break;
> +			goto out;
>  	}
>  	lip = NULL;
>  out:

Yeah, the fix looks good. The previous code is pretty
obviously broken - a search which always returns NULL.

Which begs the question on the best way of testing this ail code.
I dunno - it would be nice for independent testing of data structures
but perhaps that is too ambitious.

OOC, so the call path for this code....
xfsaild -> xfsaild_push(ailp, &last_pushed_lsn)
           -> lip = xfs_trans_ail_cursor_first(ailp, cur, *last_lsn)
Initially, last_lsn = 0 in xfsaild
but it will be updated via last_pushed_lsn.
So it looks like things will work initially when lsn==0, because
xfs_trans_ail_cursor_first special cases that and uses the min.
But as soon as the lsn is set to non-zero,
xfs_trans_ail_cursor_first will return NULL,
and xfsaild_push will return early.

--Tim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  1:42             ` Lachlan McIlroy
@ 2008-10-27  5:30               ` Dave Chinner
  2008-10-27  6:29                 ` Lachlan McIlroy
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2008-10-27  5:30 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Christoph Hellwig, xfs-oss, linux-mm

On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>>
>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>>
>>>> It has hung with this trace:
....
>> Got it now. I can reproduce this in a couple of minutes now that both
>> the test fs and the fs hosting the UML fs images are using lazy-count=1
>> (and the frequent 10s long host system freezes have gone away, too).
>>
>> Looks like *another* new memory allocation problem [1]:
.....
>> We've entered memory reclaim inside the xfsdatad while trying to do
>> unwritten extent completion during I/O completion, and that memory
>> reclaim is now blocked waiting for I/o completion that cannot make
>> progress.
>>
>> Nasty.
>>
>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
>> then re-queue the I/O completion at the back of the workqueue and let other
>> I/o completions progress before retrying this one. That way the I/O that
>> is simply cleaning memory will make progress, hence allowing memory
>> allocation to occur successfully when we retry this I/O completion...
> It could work - unless it's a synchronous I/O in which case the I/O is not
> complete until the extent conversion takes place.

Right. Pushing unwritten extent conversion onto a different
workqueue is probably the only way to handle this easily.
That's the same solution Irix has been using for a long time
(the xfsc thread)....

> Could we allocate the memory up front before the I/O is issued?

Possibly, but that will create more memory pressure than
allocation in I/O completion because now we could need to hold
thousands of allocations across an I/O - think of the case where
we are running low on memory and have a disk subsystem capable of
a few hundred thousand I/Os per second. the allocation failing would
prevent the I/os from being issued, and if this is buffered writes
into unwritten extents we'd be preventing dirty pages from being
cleaned....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  2:30       ` Timothy Shimmin
@ 2008-10-27  5:47         ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2008-10-27  5:47 UTC (permalink / raw)
  To: Timothy Shimmin; +Cc: Lachlan McIlroy, Christoph Hellwig, xfs-oss

On Mon, Oct 27, 2008 at 01:30:58PM +1100, Timothy Shimmin wrote:
> Dave Chinner wrote:
> > Ok, I think I've found the regression - it's introduced by the AIL
> > cursor modifications. The patch below has been running for 15
> > minutes now on my UML box that would have hung in a couple of
> > minutes otherwise.
.....
> Yeah, the fix looks good. The previous code is pretty
> obviously broken - a search which always returns NULL.
> 
> Which begs the question on the best way of testing this ail code.
> I dunno - it would be nice for independent testing of data structures
> but perhaps that is too ambitious.
> 
> OOC, so the call path for this code....
> xfsaild -> xfsaild_push(ailp, &last_pushed_lsn)
>            -> lip = xfs_trans_ail_cursor_first(ailp, cur, *last_lsn)
> Initially, last_lsn = 0 in xfsaild
> but it will be updated via last_pushed_lsn.

Right.

> So it looks like things will work initially when lsn==0, because
> xfs_trans_ail_cursor_first special cases that and uses the min.
> But as soon as the lsn is set to non-zero,
> xfs_trans_ail_cursor_first will return NULL,
> and xfsaild_push will return early.

Right - that was the bug. With the fix we will only return NULL if
we walk off the end of the AIL list before we get to the LSN being
requested to start at. Otherwise we jump over the "lip = NULL" and
start at the first log item with a LSN greater than or equal to the
last_lsn....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  5:30               ` Dave Chinner
@ 2008-10-27  6:29                 ` Lachlan McIlroy
  2008-10-27  6:54                   ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Lachlan McIlroy @ 2008-10-27  6:29 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

Dave Chinner wrote:
> On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>>>
>>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>>>
>>>>> It has hung with this trace:
> ....
>>> Got it now. I can reproduce this in a couple of minutes now that both
>>> the test fs and the fs hosting the UML fs images are using lazy-count=1
>>> (and the frequent 10s long host system freezes have gone away, too).
>>>
>>> Looks like *another* new memory allocation problem [1]:
> .....
>>> We've entered memory reclaim inside the xfsdatad while trying to do
>>> unwritten extent completion during I/O completion, and that memory
>>> reclaim is now blocked waiting for I/o completion that cannot make
>>> progress.
>>>
>>> Nasty.
>>>
>>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
>>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
>>> then re-queue the I/O completion at the back of the workqueue and let other
>>> I/o completions progress before retrying this one. That way the I/O that
>>> is simply cleaning memory will make progress, hence allowing memory
>>> allocation to occur successfully when we retry this I/O completion...
>> It could work - unless it's a synchronous I/O in which case the I/O is not
>> complete until the extent conversion takes place.
> 
> Right. Pushing unwritten extent conversion onto a different
> workqueue is probably the only way to handle this easily.
> That's the same solution Irix has been using for a long time
> (the xfsc thread)....

Would that be a workqueue specific to one filesystem?  Right now our
workqueues are per-cpu so they can contain I/O completions for multiple
filesystems.

> 
>> Could we allocate the memory up front before the I/O is issued?
> 
> Possibly, but that will create more memory pressure than
> allocation in I/O completion because now we could need to hold
> thousands of allocations across an I/O - think of the case where
> we are running low on memory and have a disk subsystem capable of
> a few hundred thousand I/Os per second. the allocation failing would
> prevent the I/os from being issued, and if this is buffered writes
> into unwritten extents we'd be preventing dirty pages from being
> cleaned....

The allocation has to be done sometime - if have a few hundred thousand
I/Os per second then the queue of unwritten extent conversion requests
is going to grow very quickly.  If a separate workqueue will fix this
then that's a better solution anyway.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  6:29                 ` Lachlan McIlroy
@ 2008-10-27  6:54                   ` Dave Chinner
  2008-10-27  7:31                     ` Lachlan McIlroy
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2008-10-27  6:54 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: Christoph Hellwig, xfs-oss, linux-mm

On Mon, Oct 27, 2008 at 05:29:50PM +1100, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>>>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>>>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>>>>
>>>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>>>>
>>>>>> It has hung with this trace:
>> ....
>>>> Got it now. I can reproduce this in a couple of minutes now that both
>>>> the test fs and the fs hosting the UML fs images are using lazy-count=1
>>>> (and the frequent 10s long host system freezes have gone away, too).
>>>>
>>>> Looks like *another* new memory allocation problem [1]:
>> .....
>>>> We've entered memory reclaim inside the xfsdatad while trying to do
>>>> unwritten extent completion during I/O completion, and that memory
>>>> reclaim is now blocked waiting for I/o completion that cannot make
>>>> progress.
>>>>
>>>> Nasty.
>>>>
>>>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
>>>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
>>>> then re-queue the I/O completion at the back of the workqueue and let other
>>>> I/o completions progress before retrying this one. That way the I/O that
>>>> is simply cleaning memory will make progress, hence allowing memory
>>>> allocation to occur successfully when we retry this I/O completion...
>>> It could work - unless it's a synchronous I/O in which case the I/O is not
>>> complete until the extent conversion takes place.
>>
>> Right. Pushing unwritten extent conversion onto a different
>> workqueue is probably the only way to handle this easily.
>> That's the same solution Irix has been using for a long time
>> (the xfsc thread)....
>
> Would that be a workqueue specific to one filesystem?  Right now our
> workqueues are per-cpu so they can contain I/O completions for multiple
> filesystems.

I've simply implemented another per-cpu workqueue set.

>>> Could we allocate the memory up front before the I/O is issued?
>>
>> Possibly, but that will create more memory pressure than
>> allocation in I/O completion because now we could need to hold
>> thousands of allocations across an I/O - think of the case where
>> we are running low on memory and have a disk subsystem capable of
>> a few hundred thousand I/Os per second. the allocation failing would
>> prevent the I/os from being issued, and if this is buffered writes
>> into unwritten extents we'd be preventing dirty pages from being
>> cleaned....
>
> The allocation has to be done sometime - if have a few hundred thousand
> I/Os per second then the queue of unwritten extent conversion requests
> is going to grow very quickly.

Sure, but the difference is that in a workqueue we are doing:

	alloc
	free
	alloc
	free
	.....
	alloc
	free

So the instantaneous memory usage is bound by the number of
workqueue threads doing conversions. The "pre-allocate" case is:

	alloc
	alloc
	alloc
	alloc
	......
	<io completes>
	free
	.....
	<io_completes>
	free
	.....

so the allocation is bound by the number of parallel I/Os we have
not completed. Given that the transaction structure is *800* bytes,
they will consume memory very quickly if pre-allocated before the
I/O is dispatched.

> If a separate workqueue will fix this
> then that's a better solution anyway.

I think so. The patch I have been testing is below.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


XFS: Prevent unwritten extent conversion from blocking I/O completion

Unwritten extent conversion can recurse back into the filesystem due
to memory allocation. Memory reclaim requires I/O completions to be
processed to allow the callers to make progress. If the I/O
completion workqueue thread is doing the recursion, then we have a
deadlock situation.

Move unwritten extent completion into it's own workqueue so it
doesn't block I/O completions for normal delayed allocation or
overwrite data.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |   38 +++++++++++++++++++++-----------------
 fs/xfs/linux-2.6/xfs_aops.h |    1 +
 fs/xfs/linux-2.6/xfs_buf.c  |    9 +++++++++
 3 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 6f4ebd0..f8fa620 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -119,23 +119,6 @@ xfs_find_bdev_for_inode(
 }
 
 /*
- * Schedule IO completion handling on a xfsdatad if this was
- * the final hold on this ioend. If we are asked to wait,
- * flush the workqueue.
- */
-STATIC void
-xfs_finish_ioend(
-	xfs_ioend_t	*ioend,
-	int		wait)
-{
-	if (atomic_dec_and_test(&ioend->io_remaining)) {
-		queue_work(xfsdatad_workqueue, &ioend->io_work);
-		if (wait)
-			flush_workqueue(xfsdatad_workqueue);
-	}
-}
-
-/*
  * We're now finished for good with this ioend structure.
  * Update the page state via the associated buffer_heads,
  * release holds on the inode and bio, and finally free
@@ -266,6 +249,27 @@ xfs_end_bio_read(
 }
 
 /*
+ * Schedule IO completion handling on a xfsdatad if this was
+ * the final hold on this ioend. If we are asked to wait,
+ * flush the workqueue.
+ */
+STATIC void
+xfs_finish_ioend(
+	xfs_ioend_t	*ioend,
+	int		wait)
+{
+	if (atomic_dec_and_test(&ioend->io_remaining)) {
+		struct workqueue_struct *wq = xfsdatad_workqueue;
+		if (ioend->io_work.func == xfs_end_bio_unwritten)
+			wq = xfsconvertd_workqueue;
+
+		queue_work(wq, &ioend->io_work);
+		if (wait)
+			flush_workqueue(wq);
+	}
+}
+
+/*
  * Allocate and initialise an IO completion structure.
  * We need to track unwritten extent write completion here initially.
  * We'll need to extend this for updating the ondisk inode size later
diff --git a/fs/xfs/linux-2.6/xfs_aops.h b/fs/xfs/linux-2.6/xfs_aops.h
index 3ba0631..7643f82 100644
--- a/fs/xfs/linux-2.6/xfs_aops.h
+++ b/fs/xfs/linux-2.6/xfs_aops.h
@@ -19,6 +19,7 @@
 #define __XFS_AOPS_H__
 
 extern struct workqueue_struct *xfsdatad_workqueue;
+extern struct workqueue_struct *xfsconvertd_workqueue;
 extern mempool_t *xfs_ioend_pool;
 
 typedef void (*xfs_ioend_func_t)(void *);
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 36d5fcd..c1f55b3 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -45,6 +45,7 @@ static struct shrinker xfs_buf_shake = {
 
 static struct workqueue_struct *xfslogd_workqueue;
 struct workqueue_struct *xfsdatad_workqueue;
+struct workqueue_struct *xfsconvertd_workqueue;
 
 #ifdef XFS_BUF_TRACE
 void
@@ -1756,6 +1757,7 @@ xfs_flush_buftarg(
 	xfs_buf_t	*bp, *n;
 	int		pincount = 0;
 
+	xfs_buf_runall_queues(xfsconvertd_workqueue);
 	xfs_buf_runall_queues(xfsdatad_workqueue);
 	xfs_buf_runall_queues(xfslogd_workqueue);
 
@@ -1812,9 +1814,15 @@ xfs_buf_init(void)
 	if (!xfsdatad_workqueue)
 		goto out_destroy_xfslogd_workqueue;
 
+	xfsconvertd_workqueue = create_workqueue("xfsconvertd");
+	if (!xfsconvertd_workqueue)
+		goto out_destroy_xfsdatad_workqueue;
+
 	register_shrinker(&xfs_buf_shake);
 	return 0;
 
+ out_destroy_xfsdatad_workqueue:
+	destroy_workqueue(xfsdatad_workqueue);
  out_destroy_xfslogd_workqueue:
 	destroy_workqueue(xfslogd_workqueue);
  out_free_buf_zone:
@@ -1830,6 +1838,7 @@ void
 xfs_buf_terminate(void)
 {
 	unregister_shrinker(&xfs_buf_shake);
+	destroy_workqueue(xfsconvertd_workqueue);
 	destroy_workqueue(xfsdatad_workqueue);
 	destroy_workqueue(xfslogd_workqueue);
 	kmem_zone_destroy(xfs_buf_zone);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-27  6:54                   ` Dave Chinner
@ 2008-10-27  7:31                     ` Lachlan McIlroy
  0 siblings, 0 replies; 20+ messages in thread
From: Lachlan McIlroy @ 2008-10-27  7:31 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

Dave Chinner wrote:
> On Mon, Oct 27, 2008 at 05:29:50PM +1100, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Mon, Oct 27, 2008 at 12:42:09PM +1100, Lachlan McIlroy wrote:
>>>> Dave Chinner wrote:
>>>>> On Sun, Oct 26, 2008 at 11:53:51AM +1100, Dave Chinner wrote:
>>>>>> On Fri, Oct 24, 2008 at 05:48:04PM +1100, Dave Chinner wrote:
>>>>>>> OK, I just hung a single-threaded rm -rf after this completed:
>>>>>>>
>>>>>>> # fsstress -p 1024 -n 100 -d /mnt/xfs2/fsstress
>>>>>>>
>>>>>>> It has hung with this trace:
>>> ....
>>>>> Got it now. I can reproduce this in a couple of minutes now that both
>>>>> the test fs and the fs hosting the UML fs images are using lazy-count=1
>>>>> (and the frequent 10s long host system freezes have gone away, too).
>>>>>
>>>>> Looks like *another* new memory allocation problem [1]:
>>> .....
>>>>> We've entered memory reclaim inside the xfsdatad while trying to do
>>>>> unwritten extent completion during I/O completion, and that memory
>>>>> reclaim is now blocked waiting for I/o completion that cannot make
>>>>> progress.
>>>>>
>>>>> Nasty.
>>>>>
>>>>> My initial though is to make _xfs_trans_alloc() able to take a KM_NOFS argument
>>>>> so we don't re-enter the FS here. If we get an ENOMEM in this case, we should
>>>>> then re-queue the I/O completion at the back of the workqueue and let other
>>>>> I/o completions progress before retrying this one. That way the I/O that
>>>>> is simply cleaning memory will make progress, hence allowing memory
>>>>> allocation to occur successfully when we retry this I/O completion...
>>>> It could work - unless it's a synchronous I/O in which case the I/O is not
>>>> complete until the extent conversion takes place.
>>> Right. Pushing unwritten extent conversion onto a different
>>> workqueue is probably the only way to handle this easily.
>>> That's the same solution Irix has been using for a long time
>>> (the xfsc thread)....
>> Would that be a workqueue specific to one filesystem?  Right now our
>> workqueues are per-cpu so they can contain I/O completions for multiple
>> filesystems.
> 
> I've simply implemented another per-cpu workqueue set.
> 
>>>> Could we allocate the memory up front before the I/O is issued?
>>> Possibly, but that will create more memory pressure than
>>> allocation in I/O completion because now we could need to hold
>>> thousands of allocations across an I/O - think of the case where
>>> we are running low on memory and have a disk subsystem capable of
>>> a few hundred thousand I/Os per second. the allocation failing would
>>> prevent the I/os from being issued, and if this is buffered writes
>>> into unwritten extents we'd be preventing dirty pages from being
>>> cleaned....
>> The allocation has to be done sometime - if have a few hundred thousand
>> I/Os per second then the queue of unwritten extent conversion requests
>> is going to grow very quickly.
> 
> Sure, but the difference is that in a workqueue we are doing:
> 
> 	alloc
> 	free
> 	alloc
> 	free
> 	.....
> 	alloc
> 	free
> 
> So the instantaneous memory usage is bound by the number of
> workqueue threads doing conversions. The "pre-allocate" case is:
> 
> 	alloc
> 	alloc
> 	alloc
> 	alloc
> 	......
> 	<io completes>
> 	free
> 	.....
> 	<io_completes>
> 	free
> 	.....
> 
> so the allocation is bound by the number of parallel I/Os we have
> not completed. Given that the transaction structure is *800* bytes,
> they will consume memory very quickly if pre-allocated before the
> I/O is dispatched.
Ah, yes of course I see your point.  It would only really work for
synchronous I/O.

Even with the current code we could have queues that grow very large
because buffered writes to unwritten extents don't wait for the
conversion.  So even for the small amount of memory we allocate for
each queue entry we still could consume a lot in total.

> 
>> If a separate workqueue will fix this
>> then that's a better solution anyway.
> 
> I think so. The patch I have been testing is below.

Thanks, I'll add it to the list.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
  2008-10-26 22:39     ` Dave Chinner
  2008-10-27  2:30       ` Timothy Shimmin
@ 2008-10-27  7:33       ` Lachlan McIlroy
  1 sibling, 0 replies; 20+ messages in thread
From: Lachlan McIlroy @ 2008-10-27  7:33 UTC (permalink / raw)
  To: Lachlan McIlroy, Christoph Hellwig, xfs-oss

Dave Chinner wrote:
> On Fri, Oct 24, 2008 at 01:08:55PM +1000, Lachlan McIlroy wrote:
>> Christoph Hellwig wrote:
>>> On Thu, Oct 23, 2008 at 07:17:30PM +1000, Lachlan McIlroy wrote:
>>>> another problem with latest xfs
>>> Is this with the 2.6.27-based ptools/cvs tree or with the 2.6.28 based
>>> git tree?  It does looks more like a VM issue than a XFS issue to me.
>>>
>> It's with the 2.6.27-rc8 based ptools tree.  Prior to checking
>> in these patches:
>>
>> Can't lock inodes in radix tree preload region
>> stop using xfs_itobp in xfs_bulkstat
>> free partially initialized inodes using destroy_inode
>>
>> I was able to stress a system for about 4 hours before it ran out
>> of memory.  Now I hit the deadlock within a few minutes.  I need
>> to roll back to find which patch changed the behaviour.
> 
> Ok, I think I've found the regression - it's introduced by the AIL
> cursor modifications. The patch below has been running for 15
> minutes now on my UML box that would have hung in a couple of
> minutes otherwise.
Yep, looks good here too.  My test system has been up at least an hour
and still chugging.

> 
> FYI, the way I found this was:
> 
> 	- put a breakpoint on xfs_create() once the fs hung
> 	- `touch /mnt/xfs2/fred` to trigger the break point.
> 	- look at:
> 		- mp->m_ail->xa_target
> 		- mp->m_ail->xa_ail.next->li_lsn
> 		- mp->m_log->l_tail_lsn
> 	  which indicated the push target was way ahead the
> 	  tail of the log, so AIL pushing was obviously not
> 	  happening otherwise we'd be making progress.
> 	- added breakpoint on xfsaild_push() and continued
> 	- xfsaild_push() bp triggered, looked at *last_lsn
> 	  and found it way behind the tail of the log (like
> 	  3 cycle behind), which meant that would return
> 	  NULL instead of the first object and AIL pushing
> 	  would abort. Confirmed with single stepping.
> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: deadlock with latest xfs
       [not found]             ` <200810281702.17135.nickpiggin@yahoo.com.au>
@ 2008-10-28  6:25               ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2008-10-28  6:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Lachlan McIlroy, Christoph Hellwig, xfs-oss, linux-mm

On Tue, Oct 28, 2008 at 05:02:16PM +1100, Nick Piggin wrote:
> On Sunday 26 October 2008 13:50, Dave Chinner wrote:
> 
> > [1] I don't see how any of the XFS changes we made make this easier to hit.
> > What I suspect is a VM regression w.r.t. memory reclaim because this is
> > the second problem since 2.6.26 that appears to be a result of memory
> > allocation failures in places that we've never, ever seen failures before.
> >
> > The other new failure is this one:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=11805
> >
> > which is an alloc_pages(GFP_KERNEL) failure....
> >
> > mm-folk - care to weight in?
> 
> order-0 alloc page GFP_KERNEL can fail sometimes. If it is called
> from reclaim or PF_MEMALLOC thread; if it is OOM-killed; fault
> injection.
> 
> This is even the case for __GFP_NOFAIL allocations (which basically
> are buggy anyway).
> 
> Not sure why it might have started happening, but I didn't see
> exactly which alloc_pages you are talking about? If it is via slab,
> then maybe some parameters have changed (eg. in SLUB) which is
> using higher order allocations.

In fs/xfs/linux-2.6/xfs_buf.c::xfs_buf_get_noaddr(). It's doing a
single page allocation at a time.

It may be that this failure is caused by an increase base memory
consumption of the kernel as this failure was reported in an lguest
and reproduced with a simple 'modprobe xfs ; mount /dev/xxx
/mnt/xfs' command. Maybe the lguest had very little memory available
to begin with and trying to allocate 2MB of pages for 8x256k log
buffers may have been too much for it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-10-28  6:25 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-23  9:17 deadlock with latest xfs Lachlan McIlroy
2008-10-23 20:57 ` Christoph Hellwig
2008-10-23 22:28   ` Dave Chinner
2008-10-24  3:08   ` Lachlan McIlroy
2008-10-24  5:24     ` Dave Chinner
2008-10-24  6:48       ` Dave Chinner
2008-10-26  0:53         ` Dave Chinner
2008-10-26  2:50           ` Dave Chinner
2008-10-26  4:20             ` Dave Chinner
2008-10-27  1:42             ` Lachlan McIlroy
2008-10-27  5:30               ` Dave Chinner
2008-10-27  6:29                 ` Lachlan McIlroy
2008-10-27  6:54                   ` Dave Chinner
2008-10-27  7:31                     ` Lachlan McIlroy
     [not found]             ` <200810281702.17135.nickpiggin@yahoo.com.au>
2008-10-28  6:25               ` Dave Chinner
2008-10-24  8:46       ` Lachlan McIlroy
2008-10-26 22:39     ` Dave Chinner
2008-10-27  2:30       ` Timothy Shimmin
2008-10-27  5:47         ` Dave Chinner
2008-10-27  7:33       ` Lachlan McIlroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox