Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
       [not found]         ` <CAOi1vP-doHSj8epQ1zLBnEi8QM4Eb7nFb5uo-XeUquZUkhacsg@mail.gmail.com>
@ 2017-03-29 10:41           ` Michal Hocko
  2017-03-29 10:55             ` Michal Hocko
  2017-03-29 11:05             ` Brian Foster
  0 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2017-03-29 10:41 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

[CC xfs guys]

On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
[...]
> This is a set of stack traces from http://tracker.ceph.com/issues/19309
> (linked in the changelog):
> 
> Workqueue: ceph-msgr con_work [libceph]
> ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> [<ffffffff81086335>] flush_work+0x165/0x250

I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
right? I kind of got lost where this waits on an IO.

> [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
> [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
> [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
> [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
> [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
> [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
> [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
> [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
> [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
[...]
> [<ffffffff815b933e>] sock_alloc+0x1e/0x80
> [<ffffffff815ba855>] __sock_create+0x95/0x220
> [<ffffffff815baa04>] sock_create_kern+0x24/0x30
> [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
> [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> We are writing out data on ceph_connection X:
> 
> ceph_con_workfn
>   mutex_lock(&con->mutex)  # ceph_connection::mutex
>   try_write
>     ceph_tcp_connect
>       sock_create_kern
>         GFP_KERNEL allocation
>           allocator recurses into XFS, more I/O is issued

I am not sure this is true actually. XFS tends to do an IO from a
separate kworkers rather than the direct reclaim context.

> Workqueue: rbd rbd_request_workfn [rbd]
> ffff880047a83b38 0000000000000046 ffff881025350c00 ffff8800383fa9e0
> 0000000000012b00 0000000000000000 ffff880047a83fd8 0000000000012b00
> ffff88014b638860 ffff8800383fa9e0 ffff880047a83b38 ffff8810878dc1b8
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> [<ffffffffa048ad66>] ? ceph_str_hash+0x26/0x80 [libceph]
> [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> [<ffffffffa04784fd>] ceph_con_send+0x4d/0x130 [libceph]
> [<ffffffffa047d3f0>] __send_queued+0x120/0x150 [libceph]
> [<ffffffffa047fe7b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph]
> [<ffffffffa047ff41>] ceph_osdc_start_request+0x51/0x80 [libceph]
> [<ffffffffa04a8050>] rbd_obj_request_submit.isra.27+0x10/0x20 [rbd]
> [<ffffffffa04aa6de>] rbd_img_obj_request_submit+0x23e/0x500 [rbd]
> [<ffffffffa04aa9ec>] rbd_img_request_submit+0x4c/0x60 [rbd]
> [<ffffffffa04ab3d5>] rbd_request_workfn+0x305/0x410 [rbd]
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> 
> Here is that I/O.  We grab ceph_osd_client::request_mutex, but
> ceph_connection::mutex is being held by the worker that recursed into
> XFS:
> 
> rbd_queue_workfn
>   ceph_osdc_start_request
>     mutex_lock(&osdc->request_mutex);
>     ceph_con_send
>       mutex_lock(&con->mutex)  # deadlock
> 
> 
> Workqueue: ceph-msgr con_work [libceph]
> ffff88014a89fc08 0000000000000046 ffff88014a89fc18 ffff88013a2d90c0
> 0000000000012b00 0000000000000000 ffff88014a89ffd8 0000000000012b00
> ffff880015a210c0 ffff88013a2d90c0 0000000000000000 ffff882028a84798
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> [<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
> [<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
> [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> 
> Workqueue: ceph-msgr con_work [libceph]
> ffff88014c10fc08 0000000000000046 ffff88013a2d9988 ffff88013a2d9920
> 0000000000012b00 0000000000000000 ffff88014c10ffd8 0000000000012b00
> ffffffff81c1b4a0 ffff88013a2d9920 0000000000000000 ffff882028a84798
> Call Trace:
> [<ffffffff816dd629>] schedule+0x29/0x70
> [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> [<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
> [<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
> [<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
> [<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
> [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> 
> These two are replies on ceph_connections Y and Z, which need
> ceph_osd_client::request_mutex to figure out which requests can be
> completed:
> 
> alloc_msg
>   get_reply
>     mutex_lock(&osdc->request_mutex);
> 
> Eventually everything else blocks on ceph_osd_client::request_mutex,
> since it's used for both submitting requests and handling replies.
> 
> This really is a straightforward "using GFP_KERNEL on the writeback
> path isn't allowed" case.  I'm not sure what made you worried here.

I am still not sure there is the dependency there. But if anything and
the con->mutex is the lock which is dangerous to recurse back to the FS
then please wrap the whole scope which takes the lock with the
memalloc_noio_save (or memalloc_nofs_save currently sitting in the mmotm
tree, if you can wait until that API gets merged) with a big fat comment
explaining why that is needed. Sticking the scope protection down the
path is just hard to understand later on. And as already mentioned
NOFS/NOIO context are (ab)used way too much without a clear/good reason.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 10:41           ` [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations Michal Hocko
@ 2017-03-29 10:55             ` Michal Hocko
  2017-03-29 11:10               ` Ilya Dryomov
  2017-03-29 11:05             ` Brian Foster
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-03-29 10:55 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed 29-03-17 12:41:26, Michal Hocko wrote:
[...]
> > ceph_con_workfn
> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
> >   try_write
> >     ceph_tcp_connect
> >       sock_create_kern
> >         GFP_KERNEL allocation
> >           allocator recurses into XFS, more I/O is issued

One more note. So what happens if this is a GFP_NOIO request which
cannot make any progress? Your IO thread is blocked on con->mutex
as you write below but the above thread cannot proceed as well. So I am
_really_ not sure this acutally helps.

[...]
> > 
> > rbd_queue_workfn
> >   ceph_osdc_start_request
> >     mutex_lock(&osdc->request_mutex);
> >     ceph_con_send
> >       mutex_lock(&con->mutex)  # deadlock

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 10:41           ` [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations Michal Hocko
  2017-03-29 10:55             ` Michal Hocko
@ 2017-03-29 11:05             ` Brian Foster
  2017-03-29 11:14               ` Ilya Dryomov
  1 sibling, 1 reply; 21+ messages in thread
From: Brian Foster @ 2017-03-29 11:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ilya Dryomov, Greg Kroah-Hartman, linux-kernel@vger.kernel.org,
	stable, Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
> [CC xfs guys]
> 
> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
> [...]
> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
> > (linked in the changelog):
> > 
> > Workqueue: ceph-msgr con_work [libceph]
> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> > Call Trace:
> > [<ffffffff816dd629>] schedule+0x29/0x70
> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> > [<ffffffff81086335>] flush_work+0x165/0x250
> 
> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
> right? I kind of got lost where this waits on an IO.
> 

Yep. That means a CIL push is already in progress. We wait on that to
complete here. After that, the resulting task queues execution of
xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
submit I/O to the log.

I don't see any reference to xlog_cil_push() anywhere in the traces here
or in the bug referenced above, however..?

Brian

> > [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
> > [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
> > [<ffffffff816d6b42>] ? __slab_free+0xee/0x234
> > [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
> > [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30
> > [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> > [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs]
> > [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
> > [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
> > [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40
> > [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs]
> [...]
> > [<ffffffff815b933e>] sock_alloc+0x1e/0x80
> > [<ffffffff815ba855>] __sock_create+0x95/0x220
> > [<ffffffff815baa04>] sock_create_kern+0x24/0x30
> > [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph]
> > [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd]
> > [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> > [<ffffffff8108561b>] worker_thread+0x11b/0x530
> > [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> > [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> >
> > We are writing out data on ceph_connection X:
> > 
> > ceph_con_workfn
> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
> >   try_write
> >     ceph_tcp_connect
> >       sock_create_kern
> >         GFP_KERNEL allocation
> >           allocator recurses into XFS, more I/O is issued
> 
> I am not sure this is true actually. XFS tends to do an IO from a
> separate kworkers rather than the direct reclaim context.
> 
> > Workqueue: rbd rbd_request_workfn [rbd]
> > ffff880047a83b38 0000000000000046 ffff881025350c00 ffff8800383fa9e0
> > 0000000000012b00 0000000000000000 ffff880047a83fd8 0000000000012b00
> > ffff88014b638860 ffff8800383fa9e0 ffff880047a83b38 ffff8810878dc1b8
> > Call Trace:
> > [<ffffffff816dd629>] schedule+0x29/0x70
> > [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> > [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> > [<ffffffffa048ad66>] ? ceph_str_hash+0x26/0x80 [libceph]
> > [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> > [<ffffffffa04784fd>] ceph_con_send+0x4d/0x130 [libceph]
> > [<ffffffffa047d3f0>] __send_queued+0x120/0x150 [libceph]
> > [<ffffffffa047fe7b>] __ceph_osdc_start_request+0x5b/0xd0 [libceph]
> > [<ffffffffa047ff41>] ceph_osdc_start_request+0x51/0x80 [libceph]
> > [<ffffffffa04a8050>] rbd_obj_request_submit.isra.27+0x10/0x20 [rbd]
> > [<ffffffffa04aa6de>] rbd_img_obj_request_submit+0x23e/0x500 [rbd]
> > [<ffffffffa04aa9ec>] rbd_img_request_submit+0x4c/0x60 [rbd]
> > [<ffffffffa04ab3d5>] rbd_request_workfn+0x305/0x410 [rbd]
> > [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> > [<ffffffff8108561b>] worker_thread+0x11b/0x530
> > [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> > [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > 
> > Here is that I/O.  We grab ceph_osd_client::request_mutex, but
> > ceph_connection::mutex is being held by the worker that recursed into
> > XFS:
> > 
> > rbd_queue_workfn
> >   ceph_osdc_start_request
> >     mutex_lock(&osdc->request_mutex);
> >     ceph_con_send
> >       mutex_lock(&con->mutex)  # deadlock
> > 
> > 
> > Workqueue: ceph-msgr con_work [libceph]
> > ffff88014a89fc08 0000000000000046 ffff88014a89fc18 ffff88013a2d90c0
> > 0000000000012b00 0000000000000000 ffff88014a89ffd8 0000000000012b00
> > ffff880015a210c0 ffff88013a2d90c0 0000000000000000 ffff882028a84798
> > Call Trace:
> > [<ffffffff816dd629>] schedule+0x29/0x70
> > [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> > [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> > [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> > [<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
> > [<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> > [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> > [<ffffffff8108561b>] worker_thread+0x11b/0x530
> > [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> > [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > 
> > Workqueue: ceph-msgr con_work [libceph]
> > ffff88014c10fc08 0000000000000046 ffff88013a2d9988 ffff88013a2d9920
> > 0000000000012b00 0000000000000000 ffff88014c10ffd8 0000000000012b00
> > ffffffff81c1b4a0 ffff88013a2d9920 0000000000000000 ffff882028a84798
> > Call Trace:
> > [<ffffffff816dd629>] schedule+0x29/0x70
> > [<ffffffff816dd906>] schedule_preempt_disabled+0x16/0x20
> > [<ffffffff816df755>] __mutex_lock_slowpath+0xa5/0x110
> > [<ffffffff816df7f6>] mutex_lock+0x36/0x4a
> > [<ffffffffa047ec1f>] alloc_msg+0xcf/0x210 [libceph]
> > [<ffffffffa0479c55>] con_work+0x1675/0x2050 [libceph]
> > [<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
> > [<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
> > [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> > [<ffffffff8108561b>] worker_thread+0x11b/0x530
> > [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> > [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> > [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> > 
> > These two are replies on ceph_connections Y and Z, which need
> > ceph_osd_client::request_mutex to figure out which requests can be
> > completed:
> > 
> > alloc_msg
> >   get_reply
> >     mutex_lock(&osdc->request_mutex);
> > 
> > Eventually everything else blocks on ceph_osd_client::request_mutex,
> > since it's used for both submitting requests and handling replies.
> > 
> > This really is a straightforward "using GFP_KERNEL on the writeback
> > path isn't allowed" case.  I'm not sure what made you worried here.
> 
> I am still not sure there is the dependency there. But if anything and
> the con->mutex is the lock which is dangerous to recurse back to the FS
> then please wrap the whole scope which takes the lock with the
> memalloc_noio_save (or memalloc_nofs_save currently sitting in the mmotm
> tree, if you can wait until that API gets merged) with a big fat comment
> explaining why that is needed. Sticking the scope protection down the
> path is just hard to understand later on. And as already mentioned
> NOFS/NOIO context are (ab)used way too much without a clear/good reason.
> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 10:55             ` Michal Hocko
@ 2017-03-29 11:10               ` Ilya Dryomov
  2017-03-29 11:16                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-29 11:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 29-03-17 12:41:26, Michal Hocko wrote:
> [...]
>> > ceph_con_workfn
>> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
>> >   try_write
>> >     ceph_tcp_connect
>> >       sock_create_kern
>> >         GFP_KERNEL allocation
>> >           allocator recurses into XFS, more I/O is issued
>
> One more note. So what happens if this is a GFP_NOIO request which
> cannot make any progress? Your IO thread is blocked on con->mutex
> as you write below but the above thread cannot proceed as well. So I am
> _really_ not sure this acutally helps.

This is not the only I/O worker.  A ceph cluster typically consists of
at least a few OSDs and can be as large as thousands of OSDs.  This is
the reason we are calling sock_create_kern() on the writeback path in
the first place: pre-opening thousands of sockets isn't feasible.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 11:05             ` Brian Foster
@ 2017-03-29 11:14               ` Ilya Dryomov
  2017-03-29 11:18                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-29 11:14 UTC (permalink / raw)
  To: Brian Foster
  Cc: Michal Hocko, Greg Kroah-Hartman, linux-kernel@vger.kernel.org,
	stable, Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed, Mar 29, 2017 at 1:05 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
>> [CC xfs guys]
>>
>> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
>> [...]
>> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
>> > (linked in the changelog):
>> >
>> > Workqueue: ceph-msgr con_work [libceph]
>> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
>> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
>> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
>> > Call Trace:
>> > [<ffffffff816dd629>] schedule+0x29/0x70
>> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
>> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
>> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
>> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
>> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
>> > [<ffffffff81086335>] flush_work+0x165/0x250
>>
>> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
>> right? I kind of got lost where this waits on an IO.
>>
>
> Yep. That means a CIL push is already in progress. We wait on that to
> complete here. After that, the resulting task queues execution of
> xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
> submit I/O to the log.
>
> I don't see any reference to xlog_cil_push() anywhere in the traces here
> or in the bug referenced above, however..?

Well, it's prefaced with "Interesting is:"...  Sergey (the original
reporter, CCed here) might still have the rest of them.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 11:10               ` Ilya Dryomov
@ 2017-03-29 11:16                 ` Michal Hocko
  2017-03-29 14:25                   ` Ilya Dryomov
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-03-29 11:16 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
> > [...]
> >> > ceph_con_workfn
> >> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
> >> >   try_write
> >> >     ceph_tcp_connect
> >> >       sock_create_kern
> >> >         GFP_KERNEL allocation
> >> >           allocator recurses into XFS, more I/O is issued
> >
> > One more note. So what happens if this is a GFP_NOIO request which
> > cannot make any progress? Your IO thread is blocked on con->mutex
> > as you write below but the above thread cannot proceed as well. So I am
> > _really_ not sure this acutally helps.
> 
> This is not the only I/O worker.  A ceph cluster typically consists of
> at least a few OSDs and can be as large as thousands of OSDs.  This is
> the reason we are calling sock_create_kern() on the writeback path in
> the first place: pre-opening thousands of sockets isn't feasible.

Sorry for being dense here but what actually guarantees the forward
progress? My current understanding is that the deadlock is caused by
con->mutext being held while the allocation cannot make a forward
progress. I can imagine this would be possible if the other io flushers
depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
much difference. What am I missing?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 11:14               ` Ilya Dryomov
@ 2017-03-29 11:18                 ` Michal Hocko
  2017-03-29 11:49                   ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-03-29 11:18 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Brian Foster, Greg Kroah-Hartman, linux-kernel@vger.kernel.org,
	stable, Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed 29-03-17 13:14:42, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 1:05 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
> >> [CC xfs guys]
> >>
> >> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
> >> [...]
> >> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
> >> > (linked in the changelog):
> >> >
> >> > Workqueue: ceph-msgr con_work [libceph]
> >> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> >> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> >> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> >> > Call Trace:
> >> > [<ffffffff816dd629>] schedule+0x29/0x70
> >> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> >> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> >> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> >> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> >> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> >> > [<ffffffff81086335>] flush_work+0x165/0x250
> >>
> >> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
> >> right? I kind of got lost where this waits on an IO.
> >>
> >
> > Yep. That means a CIL push is already in progress. We wait on that to
> > complete here. After that, the resulting task queues execution of
> > xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
> > submit I/O to the log.
> >
> > I don't see any reference to xlog_cil_push() anywhere in the traces here
> > or in the bug referenced above, however..?
> 
> Well, it's prefaced with "Interesting is:"...  Sergey (the original
> reporter, CCed here) might still have the rest of them.

JFTR
http://tracker.ceph.com/attachments/download/2769/full_kern_trace.txt
[288420.754637] Workqueue: xfs-cil/rbd1 xlog_cil_push_work [xfs]
[288420.754638]  ffff880130c1fb38 0000000000000046 ffff880130c1fac8 ffff880130d72180
[288420.754640]  0000000000012b00 ffff880130c1fad8 ffff880130c1ffd8 0000000000012b00
[288420.754641]  ffff8810297b6480 ffff880130d72180 ffffffffa03b1264 ffff8820263d6800
[288420.754643] Call Trace:
[288420.754652]  [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
[288420.754653]  [<ffffffff816dd629>] schedule+0x29/0x70
[288420.754661]  [<ffffffffa03b3b9c>] xlog_state_get_iclog_space+0xdc/0x2e0 [xfs]
[288420.754669]  [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
[288420.754670]  [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
[288420.754678]  [<ffffffffa03b4090>] xlog_write+0x190/0x730 [xfs]
[288420.754686]  [<ffffffffa03b5d9e>] xlog_cil_push+0x24e/0x3e0 [xfs]
[288420.754693]  [<ffffffffa03b5f45>] xlog_cil_push_work+0x15/0x20 [xfs]
[288420.754695]  [<ffffffff81084c19>] process_one_work+0x159/0x4f0
[288420.754697]  [<ffffffff81084fdc>] process_scheduled_works+0x2c/0x40
[288420.754698]  [<ffffffff8108579b>] worker_thread+0x29b/0x530
[288420.754699]  [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[288420.754701]  [<ffffffff8108b6f9>] kthread+0xc9/0xe0
[288420.754703]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[288420.754705]  [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[288420.754707]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 11:18                 ` Michal Hocko
@ 2017-03-29 11:49                   ` Brian Foster
  2017-03-29 14:30                     ` Ilya Dryomov
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2017-03-29 11:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ilya Dryomov, Greg Kroah-Hartman, linux-kernel@vger.kernel.org,
	stable, Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed, Mar 29, 2017 at 01:18:34PM +0200, Michal Hocko wrote:
> On Wed 29-03-17 13:14:42, Ilya Dryomov wrote:
> > On Wed, Mar 29, 2017 at 1:05 PM, Brian Foster <bfoster@redhat.com> wrote:
> > > On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
> > >> [CC xfs guys]
> > >>
> > >> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
> > >> [...]
> > >> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
> > >> > (linked in the changelog):
> > >> >
> > >> > Workqueue: ceph-msgr con_work [libceph]
> > >> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
> > >> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
> > >> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
> > >> > Call Trace:
> > >> > [<ffffffff816dd629>] schedule+0x29/0x70
> > >> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> > >> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> > >> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> > >> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> > >> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> > >> > [<ffffffff81086335>] flush_work+0x165/0x250
> > >>
> > >> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
> > >> right? I kind of got lost where this waits on an IO.
> > >>
> > >
> > > Yep. That means a CIL push is already in progress. We wait on that to
> > > complete here. After that, the resulting task queues execution of
> > > xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
> > > submit I/O to the log.
> > >
> > > I don't see any reference to xlog_cil_push() anywhere in the traces here
> > > or in the bug referenced above, however..?
> > 
> > Well, it's prefaced with "Interesting is:"...  Sergey (the original
> > reporter, CCed here) might still have the rest of them.
> 
> JFTR
> http://tracker.ceph.com/attachments/download/2769/full_kern_trace.txt
> [288420.754637] Workqueue: xfs-cil/rbd1 xlog_cil_push_work [xfs]
> [288420.754638]  ffff880130c1fb38 0000000000000046 ffff880130c1fac8 ffff880130d72180
> [288420.754640]  0000000000012b00 ffff880130c1fad8 ffff880130c1ffd8 0000000000012b00
> [288420.754641]  ffff8810297b6480 ffff880130d72180 ffffffffa03b1264 ffff8820263d6800
> [288420.754643] Call Trace:
> [288420.754652]  [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
> [288420.754653]  [<ffffffff816dd629>] schedule+0x29/0x70
> [288420.754661]  [<ffffffffa03b3b9c>] xlog_state_get_iclog_space+0xdc/0x2e0 [xfs]
> [288420.754669]  [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
> [288420.754670]  [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> [288420.754678]  [<ffffffffa03b4090>] xlog_write+0x190/0x730 [xfs]
> [288420.754686]  [<ffffffffa03b5d9e>] xlog_cil_push+0x24e/0x3e0 [xfs]
> [288420.754693]  [<ffffffffa03b5f45>] xlog_cil_push_work+0x15/0x20 [xfs]
> [288420.754695]  [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [288420.754697]  [<ffffffff81084fdc>] process_scheduled_works+0x2c/0x40
> [288420.754698]  [<ffffffff8108579b>] worker_thread+0x29b/0x530
> [288420.754699]  [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [288420.754701]  [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [288420.754703]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [288420.754705]  [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [288420.754707]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

Ah, thanks. According to above, xfs_cil is waiting on log space to free
up. This means xfs-cil is probably in:

	xlog_state_get_iclog_space()
	  ->xlog_wait(&log->l_flush_wait, &log->l_icloglock);

l_flush_wait is awoken during log I/O completion handling via the
xfs-log workqueue. That guy is here:

[288420.773968] INFO: task kworker/6:3:420227 blocked for more than 300 seconds.
[288420.773986]       Not tainted 3.18.43-40 #1
[288420.773997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[288420.774017] kworker/6:3     D ffff880103893650     0 420227      2 0x00000000
[288420.774027] Workqueue: xfs-log/rbd1 xfs_log_worker [xfs]
[288420.774028]  ffff88010357fac8 0000000000000046 0000000000000000 ffff880103893240
[288420.774030]  0000000000012b00 ffff880146361128 ffff88010357ffd8 0000000000012b00
[288420.774031]  ffff8810297b7540 ffff880103893240 ffff88010357fae8 ffff88010357fbf8
[288420.774033] Call Trace:
[288420.774035]  [<ffffffff816dd629>] schedule+0x29/0x70
[288420.774036]  [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
[288420.774038]  [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
[288420.774040]  [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
[288420.774042]  [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
[288420.774043]  [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
[288420.774044]  [<ffffffff81086335>] flush_work+0x165/0x250
[288420.774046]  [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
[288420.774054]  [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
[288420.774056]  [<ffffffff8109f3cc>] ? dequeue_entity+0x17c/0x520
[288420.774063]  [<ffffffffa03b478e>] _xfs_log_force+0x6e/0x280 [xfs]
[288420.774065]  [<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
[288420.774067]  [<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
[288420.774068]  [<ffffffff810145a2>] ? __switch_to+0xf2/0x5f0
[288420.774076]  [<ffffffffa03b49d9>] xfs_log_force+0x39/0xe0 [xfs]
[288420.774083]  [<ffffffffa03b4aa8>] xfs_log_worker+0x28/0x50 [xfs]
[288420.774085]  [<ffffffff81084c19>] process_one_work+0x159/0x4f0
[288420.774086]  [<ffffffff8108561b>] worker_thread+0x11b/0x530
[288420.774088]  [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
[288420.774089]  [<ffffffff8108b6f9>] kthread+0xc9/0xe0
[288420.774091]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
[288420.774093]  [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
[288420.774095]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90

... which is back waiting on xfs-cil.

Ilya,

Have you looked into this[1] patch by any chance? Note that 7a29ac474
("xfs: give all workqueues rescuer threads") may also be a potential
band aid for this. Or IOW, the lack thereof in v3.18.z may make this
problem more likely.

Brian

[1] http://www.spinics.net/lists/linux-xfs/msg04886.html

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 11:16                 ` Michal Hocko
@ 2017-03-29 14:25                   ` Ilya Dryomov
  2017-03-30  6:25                     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-29 14:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
>> > [...]
>> >> > ceph_con_workfn
>> >> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
>> >> >   try_write
>> >> >     ceph_tcp_connect
>> >> >       sock_create_kern
>> >> >         GFP_KERNEL allocation
>> >> >           allocator recurses into XFS, more I/O is issued
>> >
>> > One more note. So what happens if this is a GFP_NOIO request which
>> > cannot make any progress? Your IO thread is blocked on con->mutex
>> > as you write below but the above thread cannot proceed as well. So I am
>> > _really_ not sure this acutally helps.
>>
>> This is not the only I/O worker.  A ceph cluster typically consists of
>> at least a few OSDs and can be as large as thousands of OSDs.  This is
>> the reason we are calling sock_create_kern() on the writeback path in
>> the first place: pre-opening thousands of sockets isn't feasible.
>
> Sorry for being dense here but what actually guarantees the forward
> progress? My current understanding is that the deadlock is caused by
> con->mutext being held while the allocation cannot make a forward
> progress. I can imagine this would be possible if the other io flushers
> depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
> much difference. What am I missing?

con->mutex is per-ceph_connection, osdc->request_mutex is global and is
the real problem here because we need both on the submit side, at least
in 3.18.  You are correct that even with GFP_NOIO this code may lock up
in theory, however I think it's very unlikely in practice.

We got rid of osdc->request_mutex in 4.7, so these workers are almost
independent in newer kernels and should be able to free up memory for
those blocked on GFP_NOIO retries with their respective con->mutex
held.  Using GFP_KERNEL and thus allowing the recursion is just asking
for an AA deadlock on con->mutex OTOH, so it does make a difference.

I'm a little confused by this discussion because for me this patch was
a no-brainer...  Locking aside, you said it was the stack trace in the
changelog that got your attention -- are you saying it's OK for a block
device to recurse back into the filesystem when doing I/O, potentially
generating more I/O?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 11:49                   ` Brian Foster
@ 2017-03-29 14:30                     ` Ilya Dryomov
  0 siblings, 0 replies; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-29 14:30 UTC (permalink / raw)
  To: Brian Foster
  Cc: Michal Hocko, Greg Kroah-Hartman, linux-kernel@vger.kernel.org,
	stable, Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed, Mar 29, 2017 at 1:49 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Wed, Mar 29, 2017 at 01:18:34PM +0200, Michal Hocko wrote:
>> On Wed 29-03-17 13:14:42, Ilya Dryomov wrote:
>> > On Wed, Mar 29, 2017 at 1:05 PM, Brian Foster <bfoster@redhat.com> wrote:
>> > > On Wed, Mar 29, 2017 at 12:41:26PM +0200, Michal Hocko wrote:
>> > >> [CC xfs guys]
>> > >>
>> > >> On Wed 29-03-17 11:21:44, Ilya Dryomov wrote:
>> > >> [...]
>> > >> > This is a set of stack traces from http://tracker.ceph.com/issues/19309
>> > >> > (linked in the changelog):
>> > >> >
>> > >> > Workqueue: ceph-msgr con_work [libceph]
>> > >> > ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
>> > >> > 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
>> > >> > ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
>> > >> > Call Trace:
>> > >> > [<ffffffff816dd629>] schedule+0x29/0x70
>> > >> > [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
>> > >> > [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
>> > >> > [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
>> > >> > [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
>> > >> > [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
>> > >> > [<ffffffff81086335>] flush_work+0x165/0x250
>> > >>
>> > >> I suspect this is xlog_cil_push_now -> flush_work(&cil->xc_push_work)
>> > >> right? I kind of got lost where this waits on an IO.
>> > >>
>> > >
>> > > Yep. That means a CIL push is already in progress. We wait on that to
>> > > complete here. After that, the resulting task queues execution of
>> > > xlog_cil_push_work()->xlog_cil_push() on m_cil_workqueue. That task may
>> > > submit I/O to the log.
>> > >
>> > > I don't see any reference to xlog_cil_push() anywhere in the traces here
>> > > or in the bug referenced above, however..?
>> >
>> > Well, it's prefaced with "Interesting is:"...  Sergey (the original
>> > reporter, CCed here) might still have the rest of them.
>>
>> JFTR
>> http://tracker.ceph.com/attachments/download/2769/full_kern_trace.txt
>> [288420.754637] Workqueue: xfs-cil/rbd1 xlog_cil_push_work [xfs]
>> [288420.754638]  ffff880130c1fb38 0000000000000046 ffff880130c1fac8 ffff880130d72180
>> [288420.754640]  0000000000012b00 ffff880130c1fad8 ffff880130c1ffd8 0000000000012b00
>> [288420.754641]  ffff8810297b6480 ffff880130d72180 ffffffffa03b1264 ffff8820263d6800
>> [288420.754643] Call Trace:
>> [288420.754652]  [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
>> [288420.754653]  [<ffffffff816dd629>] schedule+0x29/0x70
>> [288420.754661]  [<ffffffffa03b3b9c>] xlog_state_get_iclog_space+0xdc/0x2e0 [xfs]
>> [288420.754669]  [<ffffffffa03b1264>] ? xlog_bdstrat+0x34/0x70 [xfs]
>> [288420.754670]  [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
>> [288420.754678]  [<ffffffffa03b4090>] xlog_write+0x190/0x730 [xfs]
>> [288420.754686]  [<ffffffffa03b5d9e>] xlog_cil_push+0x24e/0x3e0 [xfs]
>> [288420.754693]  [<ffffffffa03b5f45>] xlog_cil_push_work+0x15/0x20 [xfs]
>> [288420.754695]  [<ffffffff81084c19>] process_one_work+0x159/0x4f0
>> [288420.754697]  [<ffffffff81084fdc>] process_scheduled_works+0x2c/0x40
>> [288420.754698]  [<ffffffff8108579b>] worker_thread+0x29b/0x530
>> [288420.754699]  [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
>> [288420.754701]  [<ffffffff8108b6f9>] kthread+0xc9/0xe0
>> [288420.754703]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>> [288420.754705]  [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
>> [288420.754707]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> Ah, thanks. According to above, xfs_cil is waiting on log space to free
> up. This means xfs-cil is probably in:
>
>         xlog_state_get_iclog_space()
>           ->xlog_wait(&log->l_flush_wait, &log->l_icloglock);
>
> l_flush_wait is awoken during log I/O completion handling via the
> xfs-log workqueue. That guy is here:
>
> [288420.773968] INFO: task kworker/6:3:420227 blocked for more than 300 seconds.
> [288420.773986]       Not tainted 3.18.43-40 #1
> [288420.773997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [288420.774017] kworker/6:3     D ffff880103893650     0 420227      2 0x00000000
> [288420.774027] Workqueue: xfs-log/rbd1 xfs_log_worker [xfs]
> [288420.774028]  ffff88010357fac8 0000000000000046 0000000000000000 ffff880103893240
> [288420.774030]  0000000000012b00 ffff880146361128 ffff88010357ffd8 0000000000012b00
> [288420.774031]  ffff8810297b7540 ffff880103893240 ffff88010357fae8 ffff88010357fbf8
> [288420.774033] Call Trace:
> [288420.774035]  [<ffffffff816dd629>] schedule+0x29/0x70
> [288420.774036]  [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200
> [288420.774038]  [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120
> [288420.774040]  [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70
> [288420.774042]  [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180
> [288420.774043]  [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390
> [288420.774044]  [<ffffffff81086335>] flush_work+0x165/0x250
> [288420.774046]  [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0
> [288420.774054]  [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs]
> [288420.774056]  [<ffffffff8109f3cc>] ? dequeue_entity+0x17c/0x520
> [288420.774063]  [<ffffffffa03b478e>] _xfs_log_force+0x6e/0x280 [xfs]
> [288420.774065]  [<ffffffff810a076c>] ? put_prev_entity+0x3c/0x2e0
> [288420.774067]  [<ffffffff8109b315>] ? sched_clock_cpu+0x95/0xd0
> [288420.774068]  [<ffffffff810145a2>] ? __switch_to+0xf2/0x5f0
> [288420.774076]  [<ffffffffa03b49d9>] xfs_log_force+0x39/0xe0 [xfs]
> [288420.774083]  [<ffffffffa03b4aa8>] xfs_log_worker+0x28/0x50 [xfs]
> [288420.774085]  [<ffffffff81084c19>] process_one_work+0x159/0x4f0
> [288420.774086]  [<ffffffff8108561b>] worker_thread+0x11b/0x530
> [288420.774088]  [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0
> [288420.774089]  [<ffffffff8108b6f9>] kthread+0xc9/0xe0
> [288420.774091]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
> [288420.774093]  [<ffffffff816e1b98>] ret_from_fork+0x58/0x90
> [288420.774095]  [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90
>
> ... which is back waiting on xfs-cil.
>
> Ilya,
>
> Have you looked into this[1] patch by any chance? Note that 7a29ac474
> ("xfs: give all workqueues rescuer threads") may also be a potential
> band aid for this. Or IOW, the lack thereof in v3.18.z may make this
> problem more likely.

No, I haven't -- this was a clear rbd/libceph bug to me.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-29 14:25                   ` Ilya Dryomov
@ 2017-03-30  6:25                     ` Michal Hocko
  2017-03-30 10:02                       ` Ilya Dryomov
  2017-03-30 13:53                       ` Ilya Dryomov
  0 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2017-03-30  6:25 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@kernel.org> wrote:
> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
> >> > [...]
> >> >> > ceph_con_workfn
> >> >> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
> >> >> >   try_write
> >> >> >     ceph_tcp_connect
> >> >> >       sock_create_kern
> >> >> >         GFP_KERNEL allocation
> >> >> >           allocator recurses into XFS, more I/O is issued
> >> >
> >> > One more note. So what happens if this is a GFP_NOIO request which
> >> > cannot make any progress? Your IO thread is blocked on con->mutex
> >> > as you write below but the above thread cannot proceed as well. So I am
> >> > _really_ not sure this acutally helps.
> >>
> >> This is not the only I/O worker.  A ceph cluster typically consists of
> >> at least a few OSDs and can be as large as thousands of OSDs.  This is
> >> the reason we are calling sock_create_kern() on the writeback path in
> >> the first place: pre-opening thousands of sockets isn't feasible.
> >
> > Sorry for being dense here but what actually guarantees the forward
> > progress? My current understanding is that the deadlock is caused by
> > con->mutext being held while the allocation cannot make a forward
> > progress. I can imagine this would be possible if the other io flushers
> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
> > much difference. What am I missing?
> 
> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
> the real problem here because we need both on the submit side, at least
> in 3.18.  You are correct that even with GFP_NOIO this code may lock up
> in theory, however I think it's very unlikely in practice.

No, it would just make such a bug more obscure. The real problem seems
to be that you rely on locks which cannot guarantee a forward progress
in the IO path. And that is a bug IMHO.
 
> We got rid of osdc->request_mutex in 4.7, so these workers are almost
> independent in newer kernels and should be able to free up memory for
> those blocked on GFP_NOIO retries with their respective con->mutex
> held.  Using GFP_KERNEL and thus allowing the recursion is just asking
> for an AA deadlock on con->mutex OTOH, so it does make a difference.

You keep saying this but so far I haven't heard how the AA deadlock is
possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
of time and that would cause you problems AFAIU.

> I'm a little confused by this discussion because for me this patch was
> a no-brainer...

No, it is a brainer. Because recursion prevention should be carefully
thought through. The lack of this approach has caused that we have
thousands of GFP_NOFS uses all over the kernel without a clear or proper
justification. Adding more on top doesn't help long term
maintainability.

> Locking aside, you said it was the stack trace in the changelog that
> got your attention

No, it is the usage of the scope GFP_NOIO API usage without a proper
explanation which caught my attention.

> are you saying it's OK for a block
> device to recurse back into the filesystem when doing I/O, potentially
> generating more I/O?

No, block device has to make a forward progress guarantee when
allocating and so use mempools or other means to achieve the same.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30  6:25                     ` Michal Hocko
@ 2017-03-30 10:02                       ` Ilya Dryomov
  2017-03-30 11:21                         ` Michal Hocko
  2017-03-30 13:53                       ` Ilya Dryomov
  1 sibling, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-30 10:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
>> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
>> >> > [...]
>> >> >> > ceph_con_workfn
>> >> >> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
>> >> >> >   try_write
>> >> >> >     ceph_tcp_connect
>> >> >> >       sock_create_kern
>> >> >> >         GFP_KERNEL allocation
>> >> >> >           allocator recurses into XFS, more I/O is issued
>> >> >
>> >> > One more note. So what happens if this is a GFP_NOIO request which
>> >> > cannot make any progress? Your IO thread is blocked on con->mutex
>> >> > as you write below but the above thread cannot proceed as well. So I am
>> >> > _really_ not sure this acutally helps.
>> >>
>> >> This is not the only I/O worker.  A ceph cluster typically consists of
>> >> at least a few OSDs and can be as large as thousands of OSDs.  This is
>> >> the reason we are calling sock_create_kern() on the writeback path in
>> >> the first place: pre-opening thousands of sockets isn't feasible.
>> >
>> > Sorry for being dense here but what actually guarantees the forward
>> > progress? My current understanding is that the deadlock is caused by
>> > con->mutext being held while the allocation cannot make a forward
>> > progress. I can imagine this would be possible if the other io flushers
>> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
>> > much difference. What am I missing?
>>
>> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
>> the real problem here because we need both on the submit side, at least
>> in 3.18.  You are correct that even with GFP_NOIO this code may lock up
>> in theory, however I think it's very unlikely in practice.
>
> No, it would just make such a bug more obscure. The real problem seems
> to be that you rely on locks which cannot guarantee a forward progress
> in the IO path. And that is a bug IMHO.

Just to be clear: the "may lock up" comment above goes for 3.18, which
is where these stack traces came from.  osdc->request_mutex which stood
in the way of other ceph_connection workers is no more.

>
>> We got rid of osdc->request_mutex in 4.7, so these workers are almost
>> independent in newer kernels and should be able to free up memory for
>> those blocked on GFP_NOIO retries with their respective con->mutex
>> held.  Using GFP_KERNEL and thus allowing the recursion is just asking
>> for an AA deadlock on con->mutex OTOH, so it does make a difference.
>
> You keep saying this but so far I haven't heard how the AA deadlock is
> possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
> of time and that would cause you problems AFAIU.

Suppose we have an I/O for OSD X, which means it's got to go through
ceph_connection X:

ceph_con_workfn
  mutex_lock(&con->mutex)
    try_write
      ceph_tcp_connect
        sock_create_kern
          GFP_KERNEL allocation

Suppose that generates another I/O for OSD X and blocks on it.  Well,
it's got to go through the same ceph_connection:

rbd_queue_workfn
  ceph_osdc_start_request
    ceph_con_send
      mutex_lock(&con->mutex)  # deadlock, OSD X worker is knocked out

Now if that was a GFP_NOIO allocation, we would simply block in the
allocator.  The placement algorithm distributes objects across the OSDs
in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
that OSD, some other I/Os for other OSDs would complete in the meantime
and free up memory.  If we are under the kind of memory pressure that
makes GFP_NOIO allocations block for an extended period of time, we are
bound to have a lot of pre-open sockets, as we would have done at least
some flushing by then.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30 10:02                       ` Ilya Dryomov
@ 2017-03-30 11:21                         ` Michal Hocko
  2017-03-30 13:48                           ` Ilya Dryomov
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-03-30 11:21 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu 30-03-17 12:02:03, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
[...]
> >> We got rid of osdc->request_mutex in 4.7, so these workers are almost
> >> independent in newer kernels and should be able to free up memory for
> >> those blocked on GFP_NOIO retries with their respective con->mutex
> >> held.  Using GFP_KERNEL and thus allowing the recursion is just asking
> >> for an AA deadlock on con->mutex OTOH, so it does make a difference.
> >
> > You keep saying this but so far I haven't heard how the AA deadlock is
> > possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
> > of time and that would cause you problems AFAIU.
> 
> Suppose we have an I/O for OSD X, which means it's got to go through
> ceph_connection X:
> 
> ceph_con_workfn
>   mutex_lock(&con->mutex)
>     try_write
>       ceph_tcp_connect
>         sock_create_kern
>           GFP_KERNEL allocation
> 
> Suppose that generates another I/O for OSD X and blocks on it.

Yeah, I have understand that but I am asking _who_ is going to generate
that IO. We do not do writeback from the direct reclaim path. I am not
familiar with Ceph at all but does any of its (slab) shrinkers generate
IO to recurse back?

> Well,
> it's got to go through the same ceph_connection:
> 
> rbd_queue_workfn
>   ceph_osdc_start_request
>     ceph_con_send
>       mutex_lock(&con->mutex)  # deadlock, OSD X worker is knocked out
> 
> Now if that was a GFP_NOIO allocation, we would simply block in the
> allocator.  The placement algorithm distributes objects across the OSDs
> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
> that OSD, some other I/Os for other OSDs would complete in the meantime
> and free up memory.  If we are under the kind of memory pressure that
> makes GFP_NOIO allocations block for an extended period of time, we are
> bound to have a lot of pre-open sockets, as we would have done at least
> some flushing by then.

How is this any different from xfs waiting for its IO to be done?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30 11:21                         ` Michal Hocko
@ 2017-03-30 13:48                           ` Ilya Dryomov
  2017-03-30 14:36                             ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-30 13:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 30-03-17 12:02:03, Ilya Dryomov wrote:
>> On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
> [...]
>> >> We got rid of osdc->request_mutex in 4.7, so these workers are almost
>> >> independent in newer kernels and should be able to free up memory for
>> >> those blocked on GFP_NOIO retries with their respective con->mutex
>> >> held.  Using GFP_KERNEL and thus allowing the recursion is just asking
>> >> for an AA deadlock on con->mutex OTOH, so it does make a difference.
>> >
>> > You keep saying this but so far I haven't heard how the AA deadlock is
>> > possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
>> > of time and that would cause you problems AFAIU.
>>
>> Suppose we have an I/O for OSD X, which means it's got to go through
>> ceph_connection X:
>>
>> ceph_con_workfn
>>   mutex_lock(&con->mutex)
>>     try_write
>>       ceph_tcp_connect
>>         sock_create_kern
>>           GFP_KERNEL allocation
>>
>> Suppose that generates another I/O for OSD X and blocks on it.
>
> Yeah, I have understand that but I am asking _who_ is going to generate
> that IO. We do not do writeback from the direct reclaim path. I am not

It doesn't have to be a newly issued I/O, it could also be a wait on
something that depends on another I/O to OSD X, but I can't back this
up with any actual stack traces because the ones we have are too old.

That's just one scenario though.  With such recursion allowed, we can
just as easily deadlock in the filesystem.  Here is a couple of traces
circa 4.8, where it's the mutex in xfs_reclaim_inodes_ag():

cc1             D ffff92243fad8180     0  6772   6770 0x00000080
ffff9224d107b200 ffff922438de2f40 ffff922e8304fed8 ffff9224d107b200
ffff922ea7554000 ffff923034fb0618 0000000000000000 ffff9224d107b200
ffff9230368e5400 ffff92303788b000 ffffffff951eb4e1 0000003e00095bc0
Nov 28 18:21:23 dude kernel: Call Trace:
[<ffffffff951eb4e1>] ? schedule+0x31/0x80
[<ffffffffc0ab0570>] ? _xfs_log_force_lsn+0x1b0/0x340 [xfs]
[<ffffffff94ca5790>] ? wake_up_q+0x60/0x60
[<ffffffffc0a9f7ff>] ? __xfs_iunpin_wait+0x9f/0x160 [xfs]
[<ffffffffc0ab0730>] ? xfs_log_force_lsn+0x30/0xb0 [xfs]
[<ffffffffc0a97041>] ? xfs_reclaim_inode+0x131/0x370 [xfs]
[<ffffffffc0a9f7ff>] ? __xfs_iunpin_wait+0x9f/0x160 [xfs]
[<ffffffff94cbcf80>] ? autoremove_wake_function+0x40/0x40
[<ffffffffc0a97041>] ? xfs_reclaim_inode+0x131/0x370 [xfs]
[<ffffffffc0a97442>] ? xfs_reclaim_inodes_ag+0x1c2/0x2d0 [xfs]
[<ffffffff94cb197c>] ? enqueue_task_fair+0x5c/0x920
[<ffffffff94c35895>] ? sched_clock+0x5/0x10
[<ffffffff94ca47e0>] ? check_preempt_curr+0x50/0x90
[<ffffffff94ca4834>] ? ttwu_do_wakeup+0x14/0xe0
[<ffffffff94ca53c3>] ? try_to_wake_up+0x53/0x3a0
[<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
[<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
[<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
[<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
[<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
[<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
[<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
[<ffffffff94dd918a>] ? alloc_pages_vma+0xba/0x280
[<ffffffff94db0f8b>] ? wp_page_copy+0x45b/0x6c0
[<ffffffff94db3e12>] ? alloc_set_pte+0x2e2/0x5f0
[<ffffffff94db2169>] ? do_wp_page+0x4a9/0x7e0
[<ffffffff94db4bd2>] ? handle_mm_fault+0x872/0x1250
[<ffffffff94c65a53>] ? __do_page_fault+0x1e3/0x500
[<ffffffff951f0cd8>] ? page_fault+0x28/0x30

kworker/9:3     D ffff92303f318180     0 20732      2 0x00000080
Workqueue: ceph-msgr ceph_con_workfn [libceph]
 ffff923035dd4480 ffff923038f8a0c0 0000000000000001 000000009eb27318
 ffff92269eb28000 ffff92269eb27338 ffff923036b145ac ffff923035dd4480
 00000000ffffffff ffff923036b145b0 ffffffff951eb4e1 ffff923036b145a8
Call Trace:
 [<ffffffff951eb4e1>] ? schedule+0x31/0x80
 [<ffffffff951eb77a>] ? schedule_preempt_disabled+0xa/0x10
 [<ffffffff951ed1f4>] ? __mutex_lock_slowpath+0xb4/0x130
 [<ffffffff951ed28b>] ? mutex_lock+0x1b/0x30
 [<ffffffffc0a974b3>] ? xfs_reclaim_inodes_ag+0x233/0x2d0 [xfs]
 [<ffffffff94d92ba5>] ? move_active_pages_to_lru+0x125/0x270
 [<ffffffff94f2b985>] ? radix_tree_gang_lookup_tag+0xc5/0x1c0
 [<ffffffff94dad0f3>] ? __list_lru_walk_one.isra.3+0x33/0x120
 [<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
 [<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
 [<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
 [<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
 [<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
 [<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
 [<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
 [<ffffffff94ddf42d>] ? cache_grow_begin+0x9d/0x560
 [<ffffffff94ddfb88>] ? fallback_alloc+0x148/0x1c0
 [<ffffffff94de09db>] ? __kmalloc+0x1eb/0x580
# a buggy ceph_connection worker doing a GFP_KERNEL allocation

xz              D ffff92303f358180     0  5932   5928 0x00000084
 ffff921a56201180 ffff923038f8ae00 ffff92303788b2c8 0000000000000001
 ffff921e90234000 ffff921e90233820 ffff923036b14eac ffff921a56201180
 00000000ffffffff ffff923036b14eb0 ffffffff951eb4e1 ffff923036b14ea8
Call Trace:
 [<ffffffff951eb4e1>] ? schedule+0x31/0x80
 [<ffffffff951eb77a>] ? schedule_preempt_disabled+0xa/0x10
 [<ffffffff951ed1f4>] ? __mutex_lock_slowpath+0xb4/0x130
 [<ffffffff951ed28b>] ? mutex_lock+0x1b/0x30
 [<ffffffffc0a974b3>] ? xfs_reclaim_inodes_ag+0x233/0x2d0 [xfs]
 [<ffffffff94f2b985>] ? radix_tree_gang_lookup_tag+0xc5/0x1c0
 [<ffffffff94dad0f3>] ? __list_lru_walk_one.isra.3+0x33/0x120
 [<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
 [<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
 [<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
 [<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
 [<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
 [<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
 [<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
 [<ffffffff94dd73b1>] ? alloc_pages_current+0x91/0x140
 [<ffffffff94e0ab98>] ? pipe_write+0x208/0x3f0
 [<ffffffff94e01b08>] ? new_sync_write+0xd8/0x130
 [<ffffffff94e02293>] ? vfs_write+0xb3/0x1a0
 [<ffffffff94e03672>] ? SyS_write+0x52/0xc0
 [<ffffffff94c03b8a>] ? do_syscall_64+0x7a/0xd0
 [<ffffffff951ef9a5>] ? entry_SYSCALL64_slow_path+0x25/0x25

We have since fixed that allocation site, but the point is it was
a combination of direct reclaim and GFP_KERNEL recursion.

> familiar with Ceph at all but does any of its (slab) shrinkers generate
> IO to recurse back?

We don't register any custom shrinkers.  This is XFS on top of rbd,
a ceph-backed block device.

>
>> Well,
>> it's got to go through the same ceph_connection:
>>
>> rbd_queue_workfn
>>   ceph_osdc_start_request
>>     ceph_con_send
>>       mutex_lock(&con->mutex)  # deadlock, OSD X worker is knocked out
>>
>> Now if that was a GFP_NOIO allocation, we would simply block in the
>> allocator.  The placement algorithm distributes objects across the OSDs
>> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
>> that OSD, some other I/Os for other OSDs would complete in the meantime
>> and free up memory.  If we are under the kind of memory pressure that
>> makes GFP_NOIO allocations block for an extended period of time, we are
>> bound to have a lot of pre-open sockets, as we would have done at least
>> some flushing by then.
>
> How is this any different from xfs waiting for its IO to be done?

I feel like we are talking past each other here.  If the worker in
question isn't deadlocked, it will eventually get its socket and start
flushing I/O.  If it has deadlocked, it won't...

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30  6:25                     ` Michal Hocko
  2017-03-30 10:02                       ` Ilya Dryomov
@ 2017-03-30 13:53                       ` Ilya Dryomov
  2017-03-30 13:59                         ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-30 13:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
>> On Wed, Mar 29, 2017 at 1:16 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Wed 29-03-17 13:10:01, Ilya Dryomov wrote:
>> >> On Wed, Mar 29, 2017 at 12:55 PM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> > On Wed 29-03-17 12:41:26, Michal Hocko wrote:
>> >> > [...]
>> >> >> > ceph_con_workfn
>> >> >> >   mutex_lock(&con->mutex)  # ceph_connection::mutex
>> >> >> >   try_write
>> >> >> >     ceph_tcp_connect
>> >> >> >       sock_create_kern
>> >> >> >         GFP_KERNEL allocation
>> >> >> >           allocator recurses into XFS, more I/O is issued
>> >> >
>> >> > One more note. So what happens if this is a GFP_NOIO request which
>> >> > cannot make any progress? Your IO thread is blocked on con->mutex
>> >> > as you write below but the above thread cannot proceed as well. So I am
>> >> > _really_ not sure this acutally helps.
>> >>
>> >> This is not the only I/O worker.  A ceph cluster typically consists of
>> >> at least a few OSDs and can be as large as thousands of OSDs.  This is
>> >> the reason we are calling sock_create_kern() on the writeback path in
>> >> the first place: pre-opening thousands of sockets isn't feasible.
>> >
>> > Sorry for being dense here but what actually guarantees the forward
>> > progress? My current understanding is that the deadlock is caused by
>> > con->mutext being held while the allocation cannot make a forward
>> > progress. I can imagine this would be possible if the other io flushers
>> > depend on this lock. But then NOIO vs. KERNEL allocation doesn't make
>> > much difference. What am I missing?
>>
>> con->mutex is per-ceph_connection, osdc->request_mutex is global and is
>> the real problem here because we need both on the submit side, at least
>> in 3.18.  You are correct that even with GFP_NOIO this code may lock up
>> in theory, however I think it's very unlikely in practice.
>
> No, it would just make such a bug more obscure. The real problem seems
> to be that you rely on locks which cannot guarantee a forward progress
> in the IO path. And that is a bug IMHO.
>
>> We got rid of osdc->request_mutex in 4.7, so these workers are almost
>> independent in newer kernels and should be able to free up memory for
>> those blocked on GFP_NOIO retries with their respective con->mutex
>> held.  Using GFP_KERNEL and thus allowing the recursion is just asking
>> for an AA deadlock on con->mutex OTOH, so it does make a difference.
>
> You keep saying this but so far I haven't heard how the AA deadlock is
> possible. Both GFP_KERNEL and GFP_NOIO can stall for an unbounded amount
> of time and that would cause you problems AFAIU.
>
>> I'm a little confused by this discussion because for me this patch was
>> a no-brainer...
>
> No, it is a brainer. Because recursion prevention should be carefully
> thought through. The lack of this approach has caused that we have
> thousands of GFP_NOFS uses all over the kernel without a clear or proper
> justification. Adding more on top doesn't help long term
> maintainability.
>
>> Locking aside, you said it was the stack trace in the changelog that
>> got your attention
>
> No, it is the usage of the scope GFP_NOIO API usage without a proper
> explanation which caught my attention.
>
>> are you saying it's OK for a block
>> device to recurse back into the filesystem when doing I/O, potentially
>> generating more I/O?
>
> No, block device has to make a forward progress guarantee when
> allocating and so use mempools or other means to achieve the same.

OK, let me put this differently.  Do you agree that a block device
cannot make _any_ kind of progress guarantee if it does a GFP_KERNEL
allocation in the I/O path?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30 13:53                       ` Ilya Dryomov
@ 2017-03-30 13:59                         ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-03-30 13:59 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu 30-03-17 15:53:35, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 8:25 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 29-03-17 16:25:18, Ilya Dryomov wrote:
[...]
> >> are you saying it's OK for a block
> >> device to recurse back into the filesystem when doing I/O, potentially
> >> generating more I/O?
> >
> > No, block device has to make a forward progress guarantee when
> > allocating and so use mempools or other means to achieve the same.
> 
> OK, let me put this differently.  Do you agree that a block device
> cannot make _any_ kind of progress guarantee if it does a GFP_KERNEL
> allocation in the I/O path?

yes that is correct. And the same is correct for GFP_NOIO allocations as
well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30 13:48                           ` Ilya Dryomov
@ 2017-03-30 14:36                             ` Michal Hocko
  2017-03-30 15:06                               ` Ilya Dryomov
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-03-30 14:36 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu 30-03-17 15:48:42, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > familiar with Ceph at all but does any of its (slab) shrinkers generate
> > IO to recurse back?
> 
> We don't register any custom shrinkers.  This is XFS on top of rbd,
> a ceph-backed block device.

OK, that was the part I was missing. So you depend on the XFS to make a
forward progress here.

> >> Well,
> >> it's got to go through the same ceph_connection:
> >>
> >> rbd_queue_workfn
> >>   ceph_osdc_start_request
> >>     ceph_con_send
> >>       mutex_lock(&con->mutex)  # deadlock, OSD X worker is knocked out
> >>
> >> Now if that was a GFP_NOIO allocation, we would simply block in the
> >> allocator.  The placement algorithm distributes objects across the OSDs
> >> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
> >> that OSD, some other I/Os for other OSDs would complete in the meantime
> >> and free up memory.  If we are under the kind of memory pressure that
> >> makes GFP_NOIO allocations block for an extended period of time, we are
> >> bound to have a lot of pre-open sockets, as we would have done at least
> >> some flushing by then.
> >
> > How is this any different from xfs waiting for its IO to be done?
> 
> I feel like we are talking past each other here.  If the worker in
> question isn't deadlocked, it will eventually get its socket and start
> flushing I/O.  If it has deadlocked, it won't...

But if the allocation is stuck then the holder of the lock cannot make
a forward progress and it is effectivelly deadlocked because other IO
depends on the lock it holds. Maybe I just ask bad questions but what
makes GFP_NOIO different from GFP_KERNEL here. We know that the later
might need to wait for an IO to finish in the shrinker but it itself
doesn't get the lock in question directly. The former depends on the
allocator forward progress as well and that in turn wait for somebody
else to proceed with the IO. So to me any blocking allocation while
holding a lock which blocks further IO to complete is simply broken.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30 14:36                             ` Michal Hocko
@ 2017-03-30 15:06                               ` Ilya Dryomov
  2017-03-30 16:12                                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-30 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu, Mar 30, 2017 at 4:36 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 30-03-17 15:48:42, Ilya Dryomov wrote:
>> On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <mhocko@kernel.org> wrote:
> [...]
>> > familiar with Ceph at all but does any of its (slab) shrinkers generate
>> > IO to recurse back?
>>
>> We don't register any custom shrinkers.  This is XFS on top of rbd,
>> a ceph-backed block device.
>
> OK, that was the part I was missing. So you depend on the XFS to make a
> forward progress here.
>
>> >> Well,
>> >> it's got to go through the same ceph_connection:
>> >>
>> >> rbd_queue_workfn
>> >>   ceph_osdc_start_request
>> >>     ceph_con_send
>> >>       mutex_lock(&con->mutex)  # deadlock, OSD X worker is knocked out
>> >>
>> >> Now if that was a GFP_NOIO allocation, we would simply block in the
>> >> allocator.  The placement algorithm distributes objects across the OSDs
>> >> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
>> >> that OSD, some other I/Os for other OSDs would complete in the meantime
>> >> and free up memory.  If we are under the kind of memory pressure that
>> >> makes GFP_NOIO allocations block for an extended period of time, we are
>> >> bound to have a lot of pre-open sockets, as we would have done at least
>> >> some flushing by then.
>> >
>> > How is this any different from xfs waiting for its IO to be done?
>>
>> I feel like we are talking past each other here.  If the worker in
>> question isn't deadlocked, it will eventually get its socket and start
>> flushing I/O.  If it has deadlocked, it won't...
>
> But if the allocation is stuck then the holder of the lock cannot make
> a forward progress and it is effectivelly deadlocked because other IO
> depends on the lock it holds. Maybe I just ask bad questions but what

Only I/O to the same OSD.  A typical ceph cluster has dozens of OSDs,
so there is plenty of room for other in-flight I/Os to finish and move
the allocator forward.  The lock in question is per-ceph_connection
(read: per-OSD).

> makes GFP_NOIO different from GFP_KERNEL here. We know that the later
> might need to wait for an IO to finish in the shrinker but it itself
> doesn't get the lock in question directly. The former depends on the
> allocator forward progress as well and that in turn wait for somebody
> else to proceed with the IO. So to me any blocking allocation while
> holding a lock which blocks further IO to complete is simply broken.

Right, with GFP_NOIO we simply wait -- there is nothing wrong with
a blocking allocation, at least in the general case.  With GFP_KERNEL
we deadlock, either in rbd/libceph (less likely) or in the filesystem
above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
omitted in your quote).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30 15:06                               ` Ilya Dryomov
@ 2017-03-30 16:12                                 ` Michal Hocko
  2017-03-30 17:19                                   ` Ilya Dryomov
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-03-30 16:12 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu 30-03-17 17:06:51, Ilya Dryomov wrote:
[...]
> > But if the allocation is stuck then the holder of the lock cannot make
> > a forward progress and it is effectivelly deadlocked because other IO
> > depends on the lock it holds. Maybe I just ask bad questions but what
> 
> Only I/O to the same OSD.  A typical ceph cluster has dozens of OSDs,
> so there is plenty of room for other in-flight I/Os to finish and move
> the allocator forward.  The lock in question is per-ceph_connection
> (read: per-OSD).
> 
> > makes GFP_NOIO different from GFP_KERNEL here. We know that the later
> > might need to wait for an IO to finish in the shrinker but it itself
> > doesn't get the lock in question directly. The former depends on the
> > allocator forward progress as well and that in turn wait for somebody
> > else to proceed with the IO. So to me any blocking allocation while
> > holding a lock which blocks further IO to complete is simply broken.
> 
> Right, with GFP_NOIO we simply wait -- there is nothing wrong with
> a blocking allocation, at least in the general case.  With GFP_KERNEL
> we deadlock, either in rbd/libceph (less likely) or in the filesystem
> above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
> omitted in your quote).

I am not convinced. It seems you are relying on something that is not
guaranteed fundamentally. AFAIU all the IO paths should _guarantee_
and use mempools for that purpose if they need to allocate.

But, hey, I will not argue as my understanding of ceph is close to
zero. You are the maintainer so it is your call. I would just really
appreciate if you could document this as much as possible (ideally
at the place where you call memalloc_noio_save and describe the lock
dependency there).

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30 16:12                                 ` Michal Hocko
@ 2017-03-30 17:19                                   ` Ilya Dryomov
  2017-03-30 18:44                                     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2017-03-30 17:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu, Mar 30, 2017 at 6:12 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 30-03-17 17:06:51, Ilya Dryomov wrote:
> [...]
>> > But if the allocation is stuck then the holder of the lock cannot make
>> > a forward progress and it is effectivelly deadlocked because other IO
>> > depends on the lock it holds. Maybe I just ask bad questions but what
>>
>> Only I/O to the same OSD.  A typical ceph cluster has dozens of OSDs,
>> so there is plenty of room for other in-flight I/Os to finish and move
>> the allocator forward.  The lock in question is per-ceph_connection
>> (read: per-OSD).
>>
>> > makes GFP_NOIO different from GFP_KERNEL here. We know that the later
>> > might need to wait for an IO to finish in the shrinker but it itself
>> > doesn't get the lock in question directly. The former depends on the
>> > allocator forward progress as well and that in turn wait for somebody
>> > else to proceed with the IO. So to me any blocking allocation while
>> > holding a lock which blocks further IO to complete is simply broken.
>>
>> Right, with GFP_NOIO we simply wait -- there is nothing wrong with
>> a blocking allocation, at least in the general case.  With GFP_KERNEL
>> we deadlock, either in rbd/libceph (less likely) or in the filesystem
>> above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
>> omitted in your quote).
>
> I am not convinced. It seems you are relying on something that is not
> guaranteed fundamentally. AFAIU all the IO paths should _guarantee_
> and use mempools for that purpose if they need to allocate.
>
> But, hey, I will not argue as my understanding of ceph is close to
> zero. You are the maintainer so it is your call. I would just really
> appreciate if you could document this as much as possible (ideally
> at the place where you call memalloc_noio_save and describe the lock
> dependency there).

It's certainly not perfect (especially this socket case -- putting
together a pool of sockets is not easy) and I'm sure one could poke
some holes in the entire thing, but I'm convinced we are much better
off with the memalloc_noio_{save,restore}() pair in there.

I'll try to come up with a better comment, but the problem is that it
can be an arbitrary lock in an arbitrary filesystem, not just libceph's
con->mutex, so it's hard to be specific.

Do I have your OK to poke Greg to get the backports going?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
  2017-03-30 17:19                                   ` Ilya Dryomov
@ 2017-03-30 18:44                                     ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-03-30 18:44 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Greg Kroah-Hartman, linux-kernel@vger.kernel.org, stable,
	Sergey Jerusalimov, Jeff Layton, linux-xfs

On Thu 30-03-17 19:19:59, Ilya Dryomov wrote:
> On Thu, Mar 30, 2017 at 6:12 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 30-03-17 17:06:51, Ilya Dryomov wrote:
> > [...]
> >> > But if the allocation is stuck then the holder of the lock cannot make
> >> > a forward progress and it is effectivelly deadlocked because other IO
> >> > depends on the lock it holds. Maybe I just ask bad questions but what
> >>
> >> Only I/O to the same OSD.  A typical ceph cluster has dozens of OSDs,
> >> so there is plenty of room for other in-flight I/Os to finish and move
> >> the allocator forward.  The lock in question is per-ceph_connection
> >> (read: per-OSD).
> >>
> >> > makes GFP_NOIO different from GFP_KERNEL here. We know that the later
> >> > might need to wait for an IO to finish in the shrinker but it itself
> >> > doesn't get the lock in question directly. The former depends on the
> >> > allocator forward progress as well and that in turn wait for somebody
> >> > else to proceed with the IO. So to me any blocking allocation while
> >> > holding a lock which blocks further IO to complete is simply broken.
> >>
> >> Right, with GFP_NOIO we simply wait -- there is nothing wrong with
> >> a blocking allocation, at least in the general case.  With GFP_KERNEL
> >> we deadlock, either in rbd/libceph (less likely) or in the filesystem
> >> above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
> >> omitted in your quote).
> >
> > I am not convinced. It seems you are relying on something that is not
> > guaranteed fundamentally. AFAIU all the IO paths should _guarantee_
> > and use mempools for that purpose if they need to allocate.
> >
> > But, hey, I will not argue as my understanding of ceph is close to
> > zero. You are the maintainer so it is your call. I would just really
> > appreciate if you could document this as much as possible (ideally
> > at the place where you call memalloc_noio_save and describe the lock
> > dependency there).
> 
> It's certainly not perfect (especially this socket case -- putting
> together a pool of sockets is not easy) and I'm sure one could poke
> some holes in the entire thing,

I would recommend testing under a heavy memory pressure (involving OOM
killer invocations) with a lot of IO pressure to see what falls out.

> but I'm convinced we are much better
> off with the memalloc_noio_{save,restore}() pair in there.
> 
> I'll try to come up with a better comment, but the problem is that it
> can be an arbitrary lock in an arbitrary filesystem, not just libceph's
> con->mutex, so it's hard to be specific.

But the particular path should describe what is the deadlock scenario
regardless of the FS (xfs is likely not the only one to wait for the
IO to finish).

> Do I have your OK to poke Greg to get the backports going?

As I've said, it's your call, if you feel comfortable with this then I
will certainly not stand in the way.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-03-30 18:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170328122559.966310440@linuxfoundation.org>
     [not found] ` <20170328122601.905696872@linuxfoundation.org>
     [not found]   ` <20170328124312.GE18241@dhcp22.suse.cz>
     [not found]     ` <CAOi1vP-TeEwNM8n=Z5b6yx1epMDVJ4f7+S1poubA7zfT7L0hQQ@mail.gmail.com>
     [not found]       ` <20170328133040.GJ18241@dhcp22.suse.cz>
     [not found]         ` <CAOi1vP-doHSj8epQ1zLBnEi8QM4Eb7nFb5uo-XeUquZUkhacsg@mail.gmail.com>
2017-03-29 10:41           ` [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations Michal Hocko
2017-03-29 10:55             ` Michal Hocko
2017-03-29 11:10               ` Ilya Dryomov
2017-03-29 11:16                 ` Michal Hocko
2017-03-29 14:25                   ` Ilya Dryomov
2017-03-30  6:25                     ` Michal Hocko
2017-03-30 10:02                       ` Ilya Dryomov
2017-03-30 11:21                         ` Michal Hocko
2017-03-30 13:48                           ` Ilya Dryomov
2017-03-30 14:36                             ` Michal Hocko
2017-03-30 15:06                               ` Ilya Dryomov
2017-03-30 16:12                                 ` Michal Hocko
2017-03-30 17:19                                   ` Ilya Dryomov
2017-03-30 18:44                                     ` Michal Hocko
2017-03-30 13:53                       ` Ilya Dryomov
2017-03-30 13:59                         ` Michal Hocko
2017-03-29 11:05             ` Brian Foster
2017-03-29 11:14               ` Ilya Dryomov
2017-03-29 11:18                 ` Michal Hocko
2017-03-29 11:49                   ` Brian Foster
2017-03-29 14:30                     ` Ilya Dryomov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).