sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
@ 2009-06-08  2:37 Wu Fengguang
  2009-06-08  4:55 ` KOSAKI Motohiro
  2009-07-06 10:52 ` Herbert Xu
  0 siblings, 2 replies; 17+ messages in thread
From: Wu Fengguang @ 2009-06-08  2:37 UTC (permalink / raw)
  To: LKML; +Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA

Hi,

This lockdep warning appears when doing stress memory tests over NFS.

page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock

tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim

Any ideas?

Thanks,
Fengguang
---

[ 1630.751276] NFS: Server wrote zero bytes, expected 4096.
[ 1637.984875]
[ 1637.984878] =================================
[ 1637.987429] [ INFO: inconsistent lock state ]
[ 1637.987429] 2.6.30-rc8-mm1 #299
[ 1637.987429] ---------------------------------
[ 1637.987429] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 1637.987429] kswapd0/387 [HC0[0]:SC0[1]:HE1:SE0] takes:
[ 1637.987429]  (sk_lock-AF_INET-RPC){+.+.?.}, at: [<ffffffff81458972>] tcp_sendmsg+0x22/0xbc0
[ 1637.987429] {RECLAIM_FS-ON-W} state was registered at:
[ 1637.987429]   [<ffffffff81079b18>] mark_held_locks+0x68/0x90
[ 1637.987429]   [<ffffffff81079c35>] lockdep_trace_alloc+0xf5/0x100
[ 1637.987429]   [<ffffffff810c7f55>] __alloc_pages_nodemask+0x95/0x6c0
[ 1637.987429]   [<ffffffff810f71f9>] __slab_alloc_page+0xb9/0x3b0
[ 1637.987429]   [<ffffffff810f8596>] kmem_cache_alloc_node+0x166/0x200
[ 1637.987429]   [<ffffffff81423cba>] __alloc_skb+0x4a/0x160
[ 1637.987429]   [<ffffffff81466da6>] tcp_send_fin+0x86/0x1a0
[ 1637.987429]   [<ffffffff814573c0>] tcp_close+0x3f0/0x4b0
[ 1637.987429]   [<ffffffff81478dd2>] inet_release+0x42/0x70
[ 1637.987429]   [<ffffffff8141b2d4>] sock_release+0x24/0x90
[ 1637.987429]   [<ffffffff814d8fd8>] xs_reset_transport+0xb8/0xd0
[ 1637.987429]   [<ffffffff814d900d>] xs_close+0x1d/0x60
[ 1637.987429]   [<ffffffff814d9082>] xs_destroy+0x32/0xa0
[ 1637.987429]   [<ffffffff814d69be>] xprt_destroy+0x6e/0x90
[ 1637.987429]   [<ffffffff8126ed77>] kref_put+0x37/0x70
[ 1637.987429]   [<ffffffff814d6940>] xprt_put+0x10/0x20
[ 1637.987429]   [<ffffffff814d5e2b>] rpc_free_client+0x8b/0x100
[ 1637.987429]   [<ffffffff8126ed77>] kref_put+0x37/0x70
[ 1637.987429]   [<ffffffff814d5ee1>] rpc_free_auth+0x41/0x70
[ 1637.987429]   [<ffffffff8126ed77>] kref_put+0x37/0x70
[ 1637.987429]   [<ffffffff814d5d5e>] rpc_release_client+0x2e/0x70
[ 1637.987429]   [<ffffffff814d5f5c>] rpc_shutdown_client+0x4c/0xf0
[ 1637.987429]   [<ffffffff814e52b1>] rpcb_getport_sync+0xa1/0xf0
[ 1637.987429]   [<ffffffff81814f5f>] nfs_root_data+0x3a9/0x40a
[ 1637.987429]   [<ffffffff817f63ae>] mount_root+0x1f/0x141
[ 1637.987429]   [<ffffffff817f65c8>] prepare_namespace+0xf8/0x190
[ 1637.987429]   [<ffffffff817f5728>] kernel_init+0x1b5/0x1d2
[ 1637.987429]   [<ffffffff8100d0ca>] child_rip+0xa/0x20
[ 1637.987429]   [<ffffffffffffffff>] 0xffffffffffffffff
[ 1637.987429] irq event stamp: 285158
[ 1637.987429] hardirqs last  enabled at (285157): [<ffffffff81544cff>] _spin_unlock_irqrestore+0x3f/0x70
[ 1637.987429] hardirqs last disabled at (285156): [<ffffffff8154506d>] _spin_lock_irqsave+0x2d/0x90
[ 1637.987429] softirqs last  enabled at (285152): [<ffffffff814d7f8f>] xprt_transmit+0x1bf/0x2d0
[ 1637.987429] softirqs last disabled at (285158): [<ffffffff81544f67>] _spin_lock_bh+0x17/0x70
[ 1637.987429]
[ 1637.987429] other info that might help us debug this:
[ 1637.987429] no locks held by kswapd0/387.
[ 1637.987429]
[ 1637.987429] stack backtrace:
[ 1637.987429] Pid: 387, comm: kswapd0 Not tainted 2.6.30-rc8-mm1 #299
[ 1637.987429] Call Trace:
[ 1637.987429]  [<ffffffff810793bc>] print_usage_bug+0x18c/0x1f0
[ 1638.251441]  [<ffffffff810798bf>] mark_lock+0x49f/0x690
[ 1638.259418]  [<ffffffff8107a310>] ? check_usage_forwards+0x0/0xc0
[ 1638.267420]  [<ffffffff8107b039>] __lock_acquire+0x289/0x1b40
[ 1638.267420]  [<ffffffff810127f0>] ? native_sched_clock+0x20/0x80
[ 1638.277673]  [<ffffffff8107c9d1>] lock_acquire+0xe1/0x120
[ 1638.277673]  [<ffffffff81458972>] ? tcp_sendmsg+0x22/0xbc0
[ 1638.287418]  [<ffffffff8141d9b5>] lock_sock_nested+0x105/0x120
[ 1638.287418]  [<ffffffff81458972>] ? tcp_sendmsg+0x22/0xbc0
[ 1638.287418]  [<ffffffff810127f0>] ? native_sched_clock+0x20/0x80
[ 1638.287418]  [<ffffffff81458972>] tcp_sendmsg+0x22/0xbc0
[ 1638.287418]  [<ffffffff81077944>] ? find_usage_forwards+0x94/0xd0
[ 1638.287418]  [<ffffffff81077944>] ? find_usage_forwards+0x94/0xd0
[ 1638.287418]  [<ffffffff8141ad2f>] sock_sendmsg+0xdf/0x110
[ 1638.287418]  [<ffffffff81077944>] ? find_usage_forwards+0x94/0xd0
[ 1638.287418]  [<ffffffff81066ae0>] ? autoremove_wake_function+0x0/0x40
[ 1638.287418]  [<ffffffff8107a36e>] ? check_usage_forwards+0x5e/0xc0
[ 1638.287418]  [<ffffffff8107966e>] ? mark_lock+0x24e/0x690
[ 1638.287418]  [<ffffffff8141b0b4>] kernel_sendmsg+0x34/0x50
[ 1638.287418]  [<ffffffff814d9234>] xs_send_kvec+0x94/0xa0
[ 1638.287418]  [<ffffffff81079e55>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1638.287418]  [<ffffffff814d92bd>] xs_sendpages+0x7d/0x220
[ 1638.287418]  [<ffffffff814d95b9>] xs_tcp_send_request+0x59/0x190
[ 1638.287418]  [<ffffffff814d7e4e>] xprt_transmit+0x7e/0x2d0
[ 1638.287418]  [<ffffffff814d4eb8>] call_transmit+0x1c8/0x2a0
[ 1638.287418]  [<ffffffff814dccb2>] __rpc_execute+0xb2/0x2b0
[ 1638.287418]  [<ffffffff814dced8>] rpc_execute+0x28/0x30
[ 1638.403414]  [<ffffffff814d5b5b>] rpc_run_task+0x3b/0x80
[ 1638.403414]  [<ffffffff811c7b6d>] nfs_write_rpcsetup+0x1ad/0x250
[ 1638.403414]  [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
[ 1638.419417]  [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
[ 1638.419417]  [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
[ 1638.427413]  [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
[ 1638.435414]  [<ffffffff811c9ab0>] ? nfs_flush_one+0x0/0x100
[ 1638.435414]  [<ffffffff81079e55>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1638.435414]  [<ffffffff811c8509>] nfs_writepage+0x19/0x40
[ 1638.435414]  [<ffffffff810ce005>] shrink_page_list+0x675/0x810
[ 1638.435414]  [<ffffffff810127f0>] ? native_sched_clock+0x20/0x80
[ 1638.435414]  [<ffffffff810ce761>] shrink_list+0x301/0x650
[ 1638.435414]  [<ffffffff810ced23>] shrink_zone+0x273/0x370
[ 1638.435414]  [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
[ 1638.435414]  [<ffffffff810cca80>] ? isolate_pages_global+0x0/0x250
[ 1638.435414]  [<ffffffff81066ae0>] ? autoremove_wake_function+0x0/0x40
[ 1638.435414]  [<ffffffff810cf2d0>] ? kswapd+0x0/0x7a0
[ 1638.435414]  [<ffffffff810666de>] kthread+0x9e/0xb0
[ 1638.435414]  [<ffffffff8100d0ca>] child_rip+0xa/0x20
[ 1638.435414]  [<ffffffff8100ca90>] ? restore_args+0x0/0x30
[ 1638.435414]  [<ffffffff81066640>] ? kthread+0x0/0xb0
[ 1638.435414]  [<ffffffff8100d0c0>] ? child_rip+0x0/0x20
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-06-08  2:37 sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage Wu Fengguang
@ 2009-06-08  4:55 ` KOSAKI Motohiro
  2009-06-08  5:00   ` Wu Fengguang
       [not found]   ` <20090608134428.4373.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2009-07-06 10:52 ` Herbert Xu
  1 sibling, 2 replies; 17+ messages in thread
From: KOSAKI Motohiro @ 2009-06-08  4:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, LKML,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA

Hi

> Hi,
> 
> This lockdep warning appears when doing stress memory tests over NFS.
> 
> page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> 
> tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> 
> Any ideas?

AFAIK, btrfs has re-dirty hack. 

------------------------------------------------------------------
static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
{
        struct extent_io_tree *tree;


        if (current->flags & PF_MEMALLOC) {
                redirty_page_for_writepage(wbc, page);
                unlock_page(page);
                return 0;
        }
        tree = &BTRFS_I(page->mapping->host)->io_tree;
        return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
}
---------------------------------------------------------------

PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
Can't nfs does similar hack? 


I'm not net nor nfs expert. perhaps I'm wrong :-)


Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-06-08  4:55 ` KOSAKI Motohiro
@ 2009-06-08  5:00   ` Wu Fengguang
  2009-06-08  5:07     ` KOSAKI Motohiro
       [not found]   ` <20090608134428.4373.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2009-06-08  5:00 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-nfs@vger.kernel.org, netdev@vger.kernel.org

On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> Hi
> 
> > Hi,
> > 
> > This lockdep warning appears when doing stress memory tests over NFS.
> > 
> > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > 
> > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > 
> > Any ideas?
> 
> AFAIK, btrfs has re-dirty hack. 
> 
> ------------------------------------------------------------------
> static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> {
>         struct extent_io_tree *tree;
> 
> 
>         if (current->flags & PF_MEMALLOC) {
>                 redirty_page_for_writepage(wbc, page);
>                 unlock_page(page);
>                 return 0;
>         }
>         tree = &BTRFS_I(page->mapping->host)->io_tree;
>         return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> }
> ---------------------------------------------------------------
> 
> PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> Can't nfs does similar hack? 

But the trace shows that current is kswapd:

[ 1638.403414]  [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
[ 1638.419417]  [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
[ 1638.419417]  [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
[ 1638.427413]  [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
[ 1638.435414]  [<ffffffff811c8509>] nfs_writepage+0x19/0x40
[ 1638.435414]  [<ffffffff810ce005>] shrink_page_list+0x675/0x810
[ 1638.435414]  [<ffffffff810ce761>] shrink_list+0x301/0x650
[ 1638.435414]  [<ffffffff810ced23>] shrink_zone+0x273/0x370
[ 1638.435414]  [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
[ 1638.435414]  [<ffffffff810666de>] kthread+0x9e/0xb0
[ 1638.435414]  [<ffffffff8100d0ca>] child_rip+0xa/0x20

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-06-08  5:00   ` Wu Fengguang
@ 2009-06-08  5:07     ` KOSAKI Motohiro
       [not found]       ` <20090608140529.4376.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: KOSAKI Motohiro @ 2009-06-08  5:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, LKML,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

> On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> > Hi
> > 
> > > Hi,
> > > 
> > > This lockdep warning appears when doing stress memory tests over NFS.
> > > 
> > > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > > 
> > > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > > 
> > > Any ideas?
> > 
> > AFAIK, btrfs has re-dirty hack. 
> > 
> > ------------------------------------------------------------------
> > static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> > {
> >         struct extent_io_tree *tree;
> > 
> > 
> >         if (current->flags & PF_MEMALLOC) {
> >                 redirty_page_for_writepage(wbc, page);
> >                 unlock_page(page);
> >                 return 0;
> >         }
> >         tree = &BTRFS_I(page->mapping->host)->io_tree;
> >         return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> > }
> > ---------------------------------------------------------------
> > 
> > PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> > Can't nfs does similar hack? 
> 
> But the trace shows that current is kswapd:
> 
> [ 1638.403414]  [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
> [ 1638.419417]  [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
> [ 1638.419417]  [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
> [ 1638.427413]  [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
> [ 1638.435414]  [<ffffffff811c8509>] nfs_writepage+0x19/0x40
> [ 1638.435414]  [<ffffffff810ce005>] shrink_page_list+0x675/0x810
> [ 1638.435414]  [<ffffffff810ce761>] shrink_list+0x301/0x650
> [ 1638.435414]  [<ffffffff810ced23>] shrink_zone+0x273/0x370
> [ 1638.435414]  [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
> [ 1638.435414]  [<ffffffff810666de>] kthread+0x9e/0xb0
> [ 1638.435414]  [<ffffffff8100d0ca>] child_rip+0xa/0x20

kswapd can't hold sk-lock before calling reclaim. Thus, we don't need
care its bogus warning, I think.




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <20090608140529.4376.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
       [not found]       ` <20090608140529.4376.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2009-06-08  5:53         ` Wu Fengguang
  2009-06-08  5:56           ` Wu Fengguang
  0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2009-06-08  5:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Jun 08, 2009 at 01:07:26PM +0800, KOSAKI Motohiro wrote:
> > On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> > > Hi
> > > 
> > > > Hi,
> > > > 
> > > > This lockdep warning appears when doing stress memory tests over NFS.
> > > > 
> > > > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > > > 
> > > > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > > > 
> > > > Any ideas?
> > > 
> > > AFAIK, btrfs has re-dirty hack. 
> > > 
> > > ------------------------------------------------------------------
> > > static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> > > {
> > >         struct extent_io_tree *tree;
> > > 
> > > 
> > >         if (current->flags & PF_MEMALLOC) {
> > >                 redirty_page_for_writepage(wbc, page);
> > >                 unlock_page(page);
> > >                 return 0;
> > >         }
> > >         tree = &BTRFS_I(page->mapping->host)->io_tree;
> > >         return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> > > }
> > > ---------------------------------------------------------------
> > > 
> > > PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> > > Can't nfs does similar hack? 
> > 
> > But the trace shows that current is kswapd:
> > 
> > [ 1638.403414]  [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
> > [ 1638.419417]  [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
> > [ 1638.419417]  [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
> > [ 1638.427413]  [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
> > [ 1638.435414]  [<ffffffff811c8509>] nfs_writepage+0x19/0x40
> > [ 1638.435414]  [<ffffffff810ce005>] shrink_page_list+0x675/0x810
> > [ 1638.435414]  [<ffffffff810ce761>] shrink_list+0x301/0x650
> > [ 1638.435414]  [<ffffffff810ced23>] shrink_zone+0x273/0x370
> > [ 1638.435414]  [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
> > [ 1638.435414]  [<ffffffff810666de>] kthread+0x9e/0xb0
> > [ 1638.435414]  [<ffffffff8100d0ca>] child_rip+0xa/0x20
> 
> kswapd can't hold sk-lock before calling reclaim. Thus, we don't need
> care its bogus warning, I think.

Right. Although this path is possible:
        tcp_sendmsg() => page reclaim => tcp_send_fin()
But it won't happen for the same socket, so one sk_lock won't be
grabbed twice and go deadlock.

So it's a harmful warning for both direct/background page reclaims?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-06-08  5:53         ` Wu Fengguang
@ 2009-06-08  5:56           ` Wu Fengguang
  2009-06-08  6:12             ` KOSAKI Motohiro
  0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2009-06-08  5:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Jun 08, 2009 at 01:53:26PM +0800, Wu Fengguang wrote:
> On Mon, Jun 08, 2009 at 01:07:26PM +0800, KOSAKI Motohiro wrote:
> > > On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> > > > Hi
> > > > 
> > > > > Hi,
> > > > > 
> > > > > This lockdep warning appears when doing stress memory tests over NFS.
> > > > > 
> > > > > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > > > > 
> > > > > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > > > > 
> > > > > Any ideas?
> > > > 
> > > > AFAIK, btrfs has re-dirty hack. 
> > > > 
> > > > ------------------------------------------------------------------
> > > > static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> > > > {
> > > >         struct extent_io_tree *tree;
> > > > 
> > > > 
> > > >         if (current->flags & PF_MEMALLOC) {
> > > >                 redirty_page_for_writepage(wbc, page);
> > > >                 unlock_page(page);
> > > >                 return 0;
> > > >         }
> > > >         tree = &BTRFS_I(page->mapping->host)->io_tree;
> > > >         return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> > > > }
> > > > ---------------------------------------------------------------
> > > > 
> > > > PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> > > > Can't nfs does similar hack? 
> > > 
> > > But the trace shows that current is kswapd:
> > > 
> > > [ 1638.403414]  [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
> > > [ 1638.419417]  [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
> > > [ 1638.419417]  [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
> > > [ 1638.427413]  [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
> > > [ 1638.435414]  [<ffffffff811c8509>] nfs_writepage+0x19/0x40
> > > [ 1638.435414]  [<ffffffff810ce005>] shrink_page_list+0x675/0x810
> > > [ 1638.435414]  [<ffffffff810ce761>] shrink_list+0x301/0x650
> > > [ 1638.435414]  [<ffffffff810ced23>] shrink_zone+0x273/0x370
> > > [ 1638.435414]  [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
> > > [ 1638.435414]  [<ffffffff810666de>] kthread+0x9e/0xb0
> > > [ 1638.435414]  [<ffffffff8100d0ca>] child_rip+0xa/0x20
> > 
> > kswapd can't hold sk-lock before calling reclaim. Thus, we don't need
> > care its bogus warning, I think.
> 
> Right. Although this path is possible:
>         tcp_sendmsg() => page reclaim => tcp_send_fin()
> But it won't happen for the same socket, so one sk_lock won't be
> grabbed twice and go deadlock.
> 
> So it's a harmful warning for both direct/background page reclaims?

btw, can anyone explain these NFS warnings? It happens in a very
memory tight and busy nfsroot system.

[  113.267340] NFS: Server wrote zero bytes, expected 3671.
[  423.202607] NFS: Server wrote zero bytes, expected 108.
[  723.588411] NFS: Server wrote zero bytes, expected 560.
[ 1060.246747] NFS: Server wrote zero bytes, expected 54.
[ 1397.841183] NFS: Server wrote zero bytes, expected 402.
[ 1779.545035] NFS: Server wrote zero bytes, expected 319.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-06-08  5:56           ` Wu Fengguang
@ 2009-06-08  6:12             ` KOSAKI Motohiro
  0 siblings, 0 replies; 17+ messages in thread
From: KOSAKI Motohiro @ 2009-06-08  6:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, LKML,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

> btw, can anyone explain these NFS warnings? It happens in a very
> memory tight and busy nfsroot system.
> 
> [  113.267340] NFS: Server wrote zero bytes, expected 3671.
> [  423.202607] NFS: Server wrote zero bytes, expected 108.
> [  723.588411] NFS: Server wrote zero bytes, expected 560.
> [ 1060.246747] NFS: Server wrote zero bytes, expected 54.
> [ 1397.841183] NFS: Server wrote zero bytes, expected 402.
> [ 1779.545035] NFS: Server wrote zero bytes, expected 319.

server side write function is below

-----------------------------------------------------------
static __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
                                loff_t offset, struct kvec *vec, int vlen,
                                unsigned long *cnt, int *stablep)
{
(snip)
        host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &offset);	//(1)

(snip)
                /*
                 * Gathered writes: If another process is currently
                 * writing to the file, there's a high chance
                 * this is another nfsd (triggered by a bulk write
                 * from a client's biod). Rather than syncing the
                 * file with each write request, we sleep for 10 msec.
                 *
                 * I don't know if this roughly approximates
                 * C. Juszak's idea of gathered writes, but it's a
                 * nice and simple solution (IMHO), and it seems to
                 * work:-)
                 */
                if (EX_WGATHER(exp)) {
(snip)
                        if (inode->i_state & I_DIRTY) {
                                dprintk("nfsd: write sync %d\n", task_pid_nr(current));
                                host_err=nfsd_sync(file);		// (2)
                        }
                }
(snip)
        dprintk("nfsd: write complete host_err=%d\n", host_err);
        if (host_err >= 0) {
                err = 0;
                *cnt = host_err;
        } else
                err = nfserrno(host_err);
out:
        return err;
}
---------------------------------------------------------------------------

if (1) or (2) makes host_err == 0, it makes your warning messages.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <20090608134428.4373.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
       [not found]   ` <20090608134428.4373.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2009-06-09  3:07     ` Wu Fengguang
  2009-06-09  3:15       ` KOSAKI Motohiro
  0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2009-06-09  3:07 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org

On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> Hi
> 
> > Hi,
> > 
> > This lockdep warning appears when doing stress memory tests over NFS.
> > 
> > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > 
> > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > 
> > Any ideas?
> 
> AFAIK, btrfs has re-dirty hack. 
> 
> ------------------------------------------------------------------
> static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> {
>         struct extent_io_tree *tree;
> 
> 
>         if (current->flags & PF_MEMALLOC) {
>                 redirty_page_for_writepage(wbc, page);
>                 unlock_page(page);
>                 return 0;
>         }
>         tree = &BTRFS_I(page->mapping->host)->io_tree;
>         return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> }
> ---------------------------------------------------------------
> 
> PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)

No, kswapd also sets the PF_MEMALLOC flag. It looks like btrfs_writepage()
is trying to avoid inefficient page outs at the cost of pinning dirty
pages in memory (even when we really want free pages).

Thanks,
Fengguang

> Can't nfs does similar hack? 
> 
> 
> I'm not net nor nfs expert. perhaps I'm wrong :-)
> 
> 
> Thanks.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-06-09  3:07     ` Wu Fengguang
@ 2009-06-09  3:15       ` KOSAKI Motohiro
  0 siblings, 0 replies; 17+ messages in thread
From: KOSAKI Motohiro @ 2009-06-09  3:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, LKML,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org

> On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> > Hi
> > 
> > > Hi,
> > > 
> > > This lockdep warning appears when doing stress memory tests over NFS.
> > > 
> > > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > > 
> > > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > > 
> > > Any ideas?
> > 
> > AFAIK, btrfs has re-dirty hack. 
> > 
> > ------------------------------------------------------------------
> > static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> > {
> >         struct extent_io_tree *tree;
> > 
> > 
> >         if (current->flags & PF_MEMALLOC) {
> >                 redirty_page_for_writepage(wbc, page);
> >                 unlock_page(page);
> >                 return 0;
> >         }
> >         tree = &BTRFS_I(page->mapping->host)->io_tree;
> >         return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> > }
> > ---------------------------------------------------------------
> > 
> > PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> 
> No, kswapd also sets the PF_MEMALLOC flag. It looks like btrfs_writepage()
> is trying to avoid inefficient page outs at the cost of pinning dirty
> pages in memory (even when we really want free pages).

Sorry, I was confused ;)



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-06-08  2:37 sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage Wu Fengguang
  2009-06-08  4:55 ` KOSAKI Motohiro
@ 2009-07-06 10:52 ` Herbert Xu
  2009-07-09 13:17   ` Wu Fengguang
  1 sibling, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2009-07-06 10:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	David S. Miller

Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> 
> This lockdep warning appears when doing stress memory tests over NFS.
> 
> page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> 
> tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim

Well perhaps not this particular path, but it is certainly possible
if an existing NFS socket dies and NFS tries to reestablish it.

I suggest that NFS should utilise the sk_allocation field and
set an appropriate value.  Note that you may have to patch TCP
so that it uses sk_allocation everywhere necessary, e.g., in
tcp_send_fin.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-07-06 10:52 ` Herbert Xu
@ 2009-07-09 13:17   ` Wu Fengguang
  2009-07-10  0:13     ` David Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2009-07-09 13:17 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org,
	netdev@vger.kernel.org, David S. Miller

On Mon, Jul 06, 2009 at 06:52:16PM +0800, Herbert Xu wrote:
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > This lockdep warning appears when doing stress memory tests over NFS.
> > 
> > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > 
> > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> 
> Well perhaps not this particular path, but it is certainly possible
> if an existing NFS socket dies and NFS tries to reestablish it.
> 
> I suggest that NFS should utilise the sk_allocation field and
> set an appropriate value.  Note that you may have to patch TCP
> so that it uses sk_allocation everywhere necessary, e.g., in
> tcp_send_fin.

Good suggestion! NFS already sets sk_allocation to GFP_ATOMIC in
linux/net/sunrpc/xprtsock.c <<xs_tcp_finish_connecting>>.

To fix this warning and possible recursions, I converted some
GPF_KERNEL cases to sk_allocation in the tcp/ipv4 code:

---
tcp: replace hard coded GFP_KERNEL with sk_allocation

This fixed a lockdep warning which appeared when doing stress
memory tests over NFS:

	inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.

	mount_root => nfs_root_data => tcp_close => lock sk_lock =>
			tcp_send_fin => alloc_skb_fclone => page reclaim

	page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock

CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: David S. Miller <davem@davemloft.net>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 net/ipv4/tcp.c        |    4 ++--
 net/ipv4/tcp_ipv4.c   |    5 +++--
 net/ipv4/tcp_output.c |    5 +++--
 3 files changed, 8 insertions(+), 6 deletions(-)

--- linux.orig/net/ipv4/tcp_ipv4.c
+++ linux/net/ipv4/tcp_ipv4.c
@@ -970,8 +970,9 @@ static int tcp_v4_parse_md5_keys(struct 
 
 	if (!tcp_sk(sk)->md5sig_info) {
 		struct tcp_sock *tp = tcp_sk(sk);
-		struct tcp_md5sig_info *p = kzalloc(sizeof(*p), GFP_KERNEL);
+		struct tcp_md5sig_info *p;
 
+		p = kzalloc(sizeof(*p), sk->sk_allocation);
 		if (!p)
 			return -EINVAL;
 
@@ -979,7 +980,7 @@ static int tcp_v4_parse_md5_keys(struct 
 		sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
 	}
 
-	newkey = kmemdup(cmd.tcpm_key, cmd.tcpm_keylen, GFP_KERNEL);
+	newkey = kmemdup(cmd.tcpm_key, cmd.tcpm_keylen, sk->sk_allocation);
 	if (!newkey)
 		return -ENOMEM;
 	return tcp_v4_md5_do_add(sk, sin->sin_addr.s_addr,
--- linux.orig/net/ipv4/tcp_output.c
+++ linux/net/ipv4/tcp_output.c
@@ -2100,7 +2100,8 @@ void tcp_send_fin(struct sock *sk)
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+			skb = alloc_skb_fclone(MAX_TCP_HEADER,
+					       sk->sk_allocation);
 			if (skb)
 				break;
 			yield();
@@ -2358,7 +2359,7 @@ int tcp_connect(struct sock *sk)
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
 	tp->packets_out += tcp_skb_pcount(buff);
-	tcp_transmit_skb(sk, buff, 1, GFP_KERNEL);
+	tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
 
 	/* We change tp->snd_nxt after the tcp_transmit_skb() call
 	 * in order to make this packet get counted in tcpOutSegs.
--- linux.orig/net/ipv4/tcp.c
+++ linux/net/ipv4/tcp.c
@@ -1834,7 +1834,7 @@ void tcp_close(struct sock *sk, long tim
 		/* Unread data was tossed, zap the connection. */
 		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
 		tcp_set_state(sk, TCP_CLOSE);
-		tcp_send_active_reset(sk, GFP_KERNEL);
+		tcp_send_active_reset(sk, sk->sk_allocation);
 	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
 		/* Check zero linger _after_ checking for unread data. */
 		sk->sk_prot->disconnect(sk, 0);
@@ -2666,7 +2666,7 @@ static struct tcp_md5sig_pool **__tcp_al
 		struct tcp_md5sig_pool *p;
 		struct crypto_hash *hash;
 
-		p = kzalloc(sizeof(*p), GFP_KERNEL);
+		p = kzalloc(sizeof(*p), sk->sk_allocation);
 		if (!p)
 			goto out_free;
 		*per_cpu_ptr(pool, cpu) = p;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-07-09 13:17   ` Wu Fengguang
@ 2009-07-10  0:13     ` David Miller
       [not found]       ` <20090709.171355.09466097.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2009-07-10  0:13 UTC (permalink / raw)
  To: fengguang.wu-ral2JQCrhuEAvxtiuMwx3w
  Cc: herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA

From: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date: Thu, 9 Jul 2009 21:17:46 +0800

> @@ -2100,7 +2100,8 @@ void tcp_send_fin(struct sock *sk)
>  	} else {
>  		/* Socket is locked, keep trying until memory is available. */
>  		for (;;) {
> -			skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
> +			skb = alloc_skb_fclone(MAX_TCP_HEADER,
> +					       sk->sk_allocation);
>  			if (skb)
>  				break;
>  			yield();

I think this specific case needs more thinking.

If the allocation fails, and it's GFP_ATOMIC, we are going to yield()
(which sleeps) and loop endlessly waiting for the allocation to
succeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <20090709.171355.09466097.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>]

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
       [not found]       ` <20090709.171355.09466097.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2009-07-10  0:59         ` Herbert Xu
  2009-07-10  8:00         ` Wu Fengguang
  1 sibling, 0 replies; 17+ messages in thread
From: Herbert Xu @ 2009-07-10  0:59 UTC (permalink / raw)
  To: David Miller
  Cc: fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 09, 2009 at 05:13:55PM -0700, David Miller wrote:
>
> I think this specific case needs more thinking.
> 
> If the allocation fails, and it's GFP_ATOMIC, we are going to yield()
> (which sleeps) and loop endlessly waiting for the allocation to
> succeed.

Indeed.  We could do one of the following:

1) Preallocate, either universally or conditinally on sk_allocation.
2) Set a bit somewhere to indicate FIN and retry the allocation as
part of retransmit.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
       [not found]       ` <20090709.171355.09466097.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  2009-07-10  0:59         ` Herbert Xu
@ 2009-07-10  8:00         ` Wu Fengguang
  2009-07-10  8:02           ` Herbert Xu
  1 sibling, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2009-07-10  8:00 UTC (permalink / raw)
  To: David Miller
  Cc: herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Fri, Jul 10, 2009 at 08:13:55AM +0800, David Miller wrote:
> From: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Date: Thu, 9 Jul 2009 21:17:46 +0800
> 
> > @@ -2100,7 +2100,8 @@ void tcp_send_fin(struct sock *sk)
> >  	} else {
> >  		/* Socket is locked, keep trying until memory is available. */
> >  		for (;;) {
> > -			skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
> > +			skb = alloc_skb_fclone(MAX_TCP_HEADER,
> > +					       sk->sk_allocation);
> >  			if (skb)
> >  				break;
> >  			yield();
> 
> I think this specific case needs more thinking.
> 
> If the allocation fails, and it's GFP_ATOMIC, we are going to yield()
> (which sleeps) and loop endlessly waiting for the allocation to
> succeed.

The _retried_ GFP_ATOMIC won't be much worse than GFP_KERNEL.

GFP_KERNEL can directly reclaim FS pages; GFP_ATOMIC will wake up
kswapd to do that. So after yield(), GFP_ATOMIC have good opportunity
to succeed if GFP_KERNEL could succeed.

The original GFP_KERNEL does have _a bit_ better chance to succeed,
but there are no guarantee. It could loop endlessly whether it be
GFP_KERNEL or GFP_ATOMIC.

btw, generally speaking, it would be more robust that NFS set
sk_allocation to GFP_NOIO, and let the networking code choose
whether to use plain sk_allocation or (sk_allocation & ~__GFP_WAIT).

The (sk_allocation & ~__GFP_WAIT) cases should be rare, but I guess
the networking code shall do it anyway, because sk_allocation defaults
to GFP_KERNEL. It seems that currently the networking code simply uses
a lot of GFP_ATOMIC, do they really mean "I cannot sleep"?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
  2009-07-10  8:00         ` Wu Fengguang
@ 2009-07-10  8:02           ` Herbert Xu
       [not found]             ` <20090710080247.GA2693-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Herbert Xu @ 2009-07-10  8:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: David Miller, linux-kernel@vger.kernel.org,
	linux-nfs@vger.kernel.org, netdev@vger.kernel.org

On Fri, Jul 10, 2009 at 04:00:17PM +0800, Wu Fengguang wrote:
> 
> The (sk_allocation & ~__GFP_WAIT) cases should be rare, but I guess
> the networking code shall do it anyway, because sk_allocation defaults
> to GFP_KERNEL. It seems that currently the networking code simply uses
> a lot of GFP_ATOMIC, do they really mean "I cannot sleep"?

Yep because they're done from softirq context.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <20090710080247.GA2693-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>]

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
       [not found]             ` <20090710080247.GA2693-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
@ 2009-07-14 16:04               ` David Miller
       [not found]                 ` <20090714.090432.13343695.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2009-07-14 16:04 UTC (permalink / raw)
  To: herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q
  Cc: fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA

From: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
Date: Fri, 10 Jul 2009 16:02:47 +0800

> On Fri, Jul 10, 2009 at 04:00:17PM +0800, Wu Fengguang wrote:
>> 
>> The (sk_allocation & ~__GFP_WAIT) cases should be rare, but I guess
>> the networking code shall do it anyway, because sk_allocation defaults
>> to GFP_KERNEL. It seems that currently the networking code simply uses
>> a lot of GFP_ATOMIC, do they really mean "I cannot sleep"?
> 
> Yep because they're done from softirq context.

Yes, this is the core issue.

All of Wu's talk about how "GFP_ATOMIC will wake up kswapd and
therefore can succeed just as well as GFP_KERNEL" is not relevant,
because GFP_ATOMIC means sleeping is not allowed.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <20090714.090432.13343695.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>]

* Re: sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage
       [not found]                 ` <20090714.090432.13343695.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2009-07-15  7:45                   ` Wu Fengguang
  0 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2009-07-15  7:45 UTC (permalink / raw)
  To: David Miller
  Cc: herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Wed, Jul 15, 2009 at 12:04:32AM +0800, David Miller wrote:
> From: Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
> Date: Fri, 10 Jul 2009 16:02:47 +0800
> 
> > On Fri, Jul 10, 2009 at 04:00:17PM +0800, Wu Fengguang wrote:
> >> 
> >> The (sk_allocation & ~__GFP_WAIT) cases should be rare, but I guess
> >> the networking code shall do it anyway, because sk_allocation defaults
> >> to GFP_KERNEL. It seems that currently the networking code simply uses
> >> a lot of GFP_ATOMIC, do they really mean "I cannot sleep"?
> > 
> > Yep because they're done from softirq context.
> 
> Yes, this is the core issue.

Yes, that's general true. But..

> All of Wu's talk about how "GFP_ATOMIC will wake up kswapd and
> therefore can succeed just as well as GFP_KERNEL" is not relevant,
> because GFP_ATOMIC means sleeping is not allowed.

We are talking about tcp_send_fin() here, which can sleep.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-07-15  7:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-08  2:37 sk_lock: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage Wu Fengguang
2009-06-08  4:55 ` KOSAKI Motohiro
2009-06-08  5:00   ` Wu Fengguang
2009-06-08  5:07     ` KOSAKI Motohiro
     [not found]       ` <20090608140529.4376.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2009-06-08  5:53         ` Wu Fengguang
2009-06-08  5:56           ` Wu Fengguang
2009-06-08  6:12             ` KOSAKI Motohiro
     [not found]   ` <20090608134428.4373.A69D9226-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2009-06-09  3:07     ` Wu Fengguang
2009-06-09  3:15       ` KOSAKI Motohiro
2009-07-06 10:52 ` Herbert Xu
2009-07-09 13:17   ` Wu Fengguang
2009-07-10  0:13     ` David Miller
     [not found]       ` <20090709.171355.09466097.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2009-07-10  0:59         ` Herbert Xu
2009-07-10  8:00         ` Wu Fengguang
2009-07-10  8:02           ` Herbert Xu
     [not found]             ` <20090710080247.GA2693-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>
2009-07-14 16:04               ` David Miller
     [not found]                 ` <20090714.090432.13343695.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2009-07-15  7:45                   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).