backport 7a29ac474a47eb8cf212b45917683ae89d6fa13b to stable ?

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* backport 7a29ac474a47eb8cf212b45917683ae89d6fa13b to stable ?
@ 2016-02-23 16:13 Jean-Tiare Le Bigot
  2016-02-23 22:45 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Jean-Tiare Le Bigot @ 2016-02-23 16:13 UTC (permalink / raw)
  To: xfs

Hi,

We've hit kernel hang related to XFS reclaim under heavy I/O load on a
couple of storage servers using XFS over flashcache over a 3.13.y kernel.

On the crash dumps, kthreadd is blocked, waiting for XFS to reclaim some
memory but the related reclaim job is queued on a worker_pool stuck
waiting for some I/O, itself depending on other jobs on other queues
which would require additional threads to go forward. Unfortunately
kthreadd is blocked.
The host has plenty of memory (~128GB), about 80% of which being used
for the page cache.

It looks like this is fixed by commit
7a29ac474a47eb8cf212b45917683ae89d6fa13b. We manually applied a fix to
our internal branch but I could not find a similar commit on the
longterm branches. Maybe it could be a good candidate for backport for
other users ?

On linux-3.14.y, this would be

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d971f49..36af881 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -858,17 +858,17 @@ xfs_init_mount_workqueues(
                goto out_destroy_unwritten;

        mp->m_reclaim_workqueue = alloc_workqueue("xfs-reclaim/%s",
-                       0, 0, mp->m_fsname);
+                       WQ_MEM_RECLAIM, 0, mp->m_fsname);
        if (!mp->m_reclaim_workqueue)
                goto out_destroy_cil;

        mp->m_log_workqueue = alloc_workqueue("xfs-log/%s",
-                       0, 0, mp->m_fsname);
+                       WQ_MEM_RECLAIM, 0, mp->m_fsname);
        if (!mp->m_log_workqueue)
                goto out_destroy_reclaim;

        mp->m_eofblocks_workqueue = alloc_workqueue("xfs-eofblocks/%s",
-                       0, 0, mp->m_fsname);
+                       WQ_MEM_RECLAIM, 0, mp->m_fsname);
        if (!mp->m_eofblocks_workqueue)
                goto out_destroy_log;

Regards,

-- 
Jean-Tiare Le Bigot, OVH

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: backport 7a29ac474a47eb8cf212b45917683ae89d6fa13b to stable ?
  2016-02-23 16:13 backport 7a29ac474a47eb8cf212b45917683ae89d6fa13b to stable ? Jean-Tiare Le Bigot
@ 2016-02-23 22:45 ` Dave Chinner
  2016-02-24 17:27   ` Jean-Tiare Le Bigot
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2016-02-23 22:45 UTC (permalink / raw)
  To: Jean-Tiare Le Bigot; +Cc: xfs

On Tue, Feb 23, 2016 at 05:13:35PM +0100, Jean-Tiare Le Bigot wrote:
> Hi,
> 
> We've hit kernel hang related to XFS reclaim under heavy I/O load on a
> couple of storage servers using XFS over flashcache over a 3.13.y kernel.
> 
> On the crash dumps, kthreadd is blocked, waiting for XFS to reclaim some
> memory but the related reclaim job is queued on a worker_pool stuck
> waiting for some I/O, itself depending on other jobs on other queues
> which would require additional threads to go forward. Unfortunately
> kthreadd is blocked.
> The host has plenty of memory (~128GB), about 80% of which being used
> for the page cache.
> 
> It looks like this is fixed by commit
> 7a29ac474a47eb8cf212b45917683ae89d6fa13b.

That commit fixed a regression introduced, IIRC, in 3.19. The
problem it fixed didn't exist before then, so I doubt you are seeing
the problem that the above commit fixed. Perhaps you'd like to
describe your problem along with the stack traces, etc so we have
some idea of what you are trying to to fix?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: backport 7a29ac474a47eb8cf212b45917683ae89d6fa13b to stable ?
  2016-02-23 22:45 ` Dave Chinner
@ 2016-02-24 17:27   ` Jean-Tiare Le Bigot
  2016-02-24 21:30     ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Jean-Tiare Le Bigot @ 2016-02-24 17:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi,

Thanks for having a look.

On 02/23/2016 11:45 PM, Dave Chinner wrote:
> On Tue, Feb 23, 2016 at 05:13:35PM +0100, Jean-Tiare Le Bigot wrote:
>> Hi,
>>
>> We've hit kernel hang related to XFS reclaim under heavy I/O load on a
>> couple of storage servers using XFS over flashcache over a 3.13.y kernel.
>>
>> On the crash dumps, kthreadd is blocked, waiting for XFS to reclaim some
>> memory but the related reclaim job is queued on a worker_pool stuck
>> waiting for some I/O, itself depending on other jobs on other queues
>> which would require additional threads to go forward. Unfortunately
>> kthreadd is blocked.
>> The host has plenty of memory (~128GB), about 80% of which being used
>> for the page cache.
>>
>> It looks like this is fixed by commit
>> 7a29ac474a47eb8cf212b45917683ae89d6fa13b.
> 
> That commit fixed a regression introduced, IIRC, in 3.19. The
> problem it fixed didn't exist before then, so I doubt you are seeing
> the problem that the above commit fixed. Perhaps you'd like to
> describe your problem along with the stack traces, etc so we have
> some idea of what you are trying to to fix?

On the dump we investigated, kthreadd is stuck, waiting for some memory

crash> bt 2
PID: 2      TASK: ffff881fd2a39800  CPU: 0   COMMAND: "kthreadd"
 #0 [ffff881fd2ab37c8] __schedule at ffffffff81724e19
 #1 [ffff881fd2ab3830] io_schedule at ffffffff817255fd
 #2 [ffff881fd2ab3848] __xfs_iunpin_wait at ffffffffa06cf269 [xfs]
 #3 [ffff881fd2ab38c0] xfs_iunpin_wait at ffffffffa06d2629 [xfs]
 #4 [ffff881fd2ab38d0] xfs_reclaim_inode at ffffffffa068ed4c [xfs]
 #5 [ffff881fd2ab3910] xfs_reclaim_inodes_ag at ffffffffa068f267 [xfs]
 #6 [ffff881fd2ab3aa0] xfs_reclaim_inodes_nr at ffffffffa068fd73 [xfs]
 #7 [ffff881fd2ab3ac0] xfs_fs_free_cached_objects at ffffffffa069a3a5 [xfs]
 #8 [ffff881fd2ab3ad0] super_cache_scan at ffffffff811c13e9
 #9 [ffff881fd2ab3b18] shrink_slab at ffffffff81160f27
#10 [ffff881fd2ab3bc0] do_try_to_free_pages at ffffffff8116405d
#11 [ffff881fd2ab3c38] try_to_free_pages at ffffffff8116429c
#12 [ffff881fd2ab3cd8] __alloc_pages_nodemask at ffffffff81158f15
#13 [ffff881fd2ab3e10] copy_process at ffffffff810652b3
#14 [ffff881fd2ab3e90] do_fork at ffffffff810669f5
#15 [ffff881fd2ab3ef8] kernel_thread at ffffffff81066c86
#16 [ffff881fd2ab3f08] kthreadd at ffffffff8108beea
#17 [ffff881fd2ab3f50] ret_from_fork at ffffffff817318bc

This triggered the insertion of a xfs-reclaim job for device dm-46 on
the workerpool bound to cpu #0

This xfs_reclaim job is the first *pending* job on the workerpool. The
workerpool has 2 active workers stuck in "xfs_log_worker" both blocked
in "xlog_state_get_iclog_space". My guess is they are waiting for some
underlying flashcache I/O (which won't happen, see below)


--> busy_hash aka busy workers:
crash> struct worker_pool.busy_hash 0xffff88207fc12d40 | grep -o 0xf.*
0xffff8806dd3b5780  --> 132626
                    --> in xfs_log_worker
                    --> in xlog_state_get_iclog_space
                    --> dm-40
0xffff88142d06ae80  --> 129142
                    --> in xfs_log_worker
                    --> in xlog_state_get_iclog_space
                    --> dm-63

--> pending jobs:
crash> list -o work_struct.entry -s work_struct.func -H ffff88207fc12d58
-x | awk 'BEGIN{w="";c=0} {if($1=="func") printf("%2d %s --> %s\n", c,
w, $4); else {w=$1;c++}}'
   WORK                 FUNC                    DEVICE
 1 ffff881fceda6ca8 --> <xfs_reclaim_worker>    dm-46
 2 ffff881fcd51bc28 --> <xfs_log_worker>        dm-46
 3 ffff88207fc0f3a0 --> <vmstat_update>         N/A
 ...

To progress this queue needs either:
- the I/O to complete
- a new thread to handle the xfs_reclaim
  (which does not trigger I/O in this context, If I got the code right)

The I/O is stuck so we need a new thread. The pool manager requested
this thread BUT the request is pending in kthreadd queue. It is #5 in
the line

crash> list -o kthread_create_info.list -s kthread_create_info -H
kthread_create_list | grep threadfn | sort | uniq -c
Hence 'data' field represents a struct worker
               WORKER POOL               ID
 1 0xffff881851c50780 0xffff881fff011c00 4  unbound pool 24
 2 0xffff881a08c5b680 0xffff881fd1cb4c00 0  unbound pool 25
 3 0xffff881295297700 0xffff88207fd72d40 0  cpu 11  pool 22
 4 0xffff88131ed9c580 0xffff88207fc72d40 1  cpu 3   pool  6
 5 0xffff88062e8e9b80 0xffff88207fc12d40 0  cpu 0   pool  0
 6 0xffff8819440d1500 0xffff88207fc92d40 2  cpu 4   pool  8
 7 0xffff8802288aa280 0xffff88207fc52d40 0  cpu 2   pool  4
 8 0xffff88120d711200 0xffff88207fcb2d40 0  cpu 5   pool 10
 9 0xffff881cf648fb00 0xffff88207fc32d40 0  cpu 1   pool  2
10 0xffff8819c1471a00 0xffff88207fd32d40 2  cpu 9   pool 18

Hence, the "mayday" mode should trigger and move the xfs_reclaim job to
the rescuer worker.

On the other hand, we also hit a similar issue on the underlying
flashcache. If it had not happened, I bet the system would not have locked.

Anyway, as I understood the code, flashcache queues a single job on the
default bound system queue. When this job is executed, it processes its
own internal queues sequentially. So basically, if this single job is
stuck, all I/Os of flashcache devices are stuck (!).

This is on queue #6

--> busy_hash aka busy workers:
0xffff881330823080 --> 139598
                   --> in xfs_end_io
                   --> in rwsem_down_write_failed
0xffff881330823c80 --> 139599
                   --> in xfs_end_io
                   --> in ???
0xffff880e1a890e80 --> 127402
                   --> in xfs_end_io
                   --> rwsem_down_write_failed

--> pending jobs:
crash> list -o work_struct.entry -s work_struct.func -H ffff88207fcd2d58
-x | awk 'BEGIN{w="";c=0} {if($1=="func") printf("    %2d %s --> %s\n",
c, w, $4); else {w=$1;c++}}'
   WORK                 FUNC                    NOTES
 1 ffffffffa062e780 --> <do_work>               FLASHCACHE JOB
 2 ffff881f582dd350 --> <flashcache_clean_all_sets>
 3 ffff88207fccf3a0 --> <vmstat_update>
 4 ffff881ed1d05350 --> <flashcache_clean_all_sets>
 5 ffff881fff010d10 --> <vmpressure_work_fn>
 6 ffff881d9998bca8 --> <xfs_reclaim_worker>    dm-4
 7 ffffffff81ca8880 --> <push_to_pool>
 8 ffff881fcc425ca8 --> <xfs_reclaim_worker>    dm-45
 9 ffff88207fccf180 --> <lru_add_drain_per_cpu>

Flashcache job is #1 (yes, the explicit 'do_work' name). All workers are
waiting for I/O. The pool is waiting for a thread. Which is waiting for
memory. Which ... OK, we're stuck.

Some assumptions may be wrong, I have seemingly truncated stack traces
looking like, this does not help much :s

crash> bt 139599
PID: 139599  TASK: ffff8816c7970000  CPU: 6   COMMAND: "kworker/6:1"
 #0 [ffff88011c407700] __schedule at ffffffff81724e19
 #1 [ffff88011c407768] schedule at ffffffff817252d9
 #2 [ffff88011c407778] schedule_timeout at ffffffff81724529

Long story short, My understanding is that the hang is the result of a
design issue in flashcache, which could be helped in such not-so-low-mem
situation (80% is page cache) if the reclaim job had a rescuer thread.

Actually, the quick prod fix was a clever brute force hack in flashcache.

Regards,

> 
> Cheers,
> 
> Dave.
> 


-- 
Jean-Tiare Le Bigot, OVH

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: backport 7a29ac474a47eb8cf212b45917683ae89d6fa13b to stable ?
  2016-02-24 17:27   ` Jean-Tiare Le Bigot
@ 2016-02-24 21:30     ` Dave Chinner
  2016-02-25 16:45       ` Jean-Tiare Le Bigot
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2016-02-24 21:30 UTC (permalink / raw)
  To: Jean-Tiare Le Bigot; +Cc: xfs

On Wed, Feb 24, 2016 at 06:27:29PM +0100, Jean-Tiare Le Bigot wrote:
> Hi,
> 
> Thanks for having a look.
> 
> On 02/23/2016 11:45 PM, Dave Chinner wrote:
> > On Tue, Feb 23, 2016 at 05:13:35PM +0100, Jean-Tiare Le Bigot wrote:
> >> Hi,
> >>
> >> We've hit kernel hang related to XFS reclaim under heavy I/O load on a
> >> couple of storage servers using XFS over flashcache over a 3.13.y kernel.
> >>
> >> On the crash dumps, kthreadd is blocked, waiting for XFS to reclaim some
> >> memory but the related reclaim job is queued on a worker_pool stuck
> >> waiting for some I/O, itself depending on other jobs on other queues
> >> which would require additional threads to go forward. Unfortunately
> >> kthreadd is blocked.
> >> The host has plenty of memory (~128GB), about 80% of which being used
> >> for the page cache.
> >>
> >> It looks like this is fixed by commit
> >> 7a29ac474a47eb8cf212b45917683ae89d6fa13b.
> > 
> > That commit fixed a regression introduced, IIRC, in 3.19. The
> > problem it fixed didn't exist before then, so I doubt you are seeing
> > the problem that the above commit fixed. Perhaps you'd like to
> > describe your problem along with the stack traces, etc so we have
> > some idea of what you are trying to to fix?
> 
> On the dump we investigated, kthreadd is stuck, waiting for some memory
> 
> crash> bt 2
> PID: 2      TASK: ffff881fd2a39800  CPU: 0   COMMAND: "kthreadd"
>  #0 [ffff881fd2ab37c8] __schedule at ffffffff81724e19
>  #1 [ffff881fd2ab3830] io_schedule at ffffffff817255fd
>  #2 [ffff881fd2ab3848] __xfs_iunpin_wait at ffffffffa06cf269 [xfs]
>  #3 [ffff881fd2ab38c0] xfs_iunpin_wait at ffffffffa06d2629 [xfs]
>  #4 [ffff881fd2ab38d0] xfs_reclaim_inode at ffffffffa068ed4c [xfs]
>  #5 [ffff881fd2ab3910] xfs_reclaim_inodes_ag at ffffffffa068f267 [xfs]
>  #6 [ffff881fd2ab3aa0] xfs_reclaim_inodes_nr at ffffffffa068fd73 [xfs]
>  #7 [ffff881fd2ab3ac0] xfs_fs_free_cached_objects at ffffffffa069a3a5 [xfs]
>  #8 [ffff881fd2ab3ad0] super_cache_scan at ffffffff811c13e9
>  #9 [ffff881fd2ab3b18] shrink_slab at ffffffff81160f27
> #10 [ffff881fd2ab3bc0] do_try_to_free_pages at ffffffff8116405d
> #11 [ffff881fd2ab3c38] try_to_free_pages at ffffffff8116429c
> #12 [ffff881fd2ab3cd8] __alloc_pages_nodemask at ffffffff81158f15
> #13 [ffff881fd2ab3e10] copy_process at ffffffff810652b3
> #14 [ffff881fd2ab3e90] do_fork at ffffffff810669f5
> #15 [ffff881fd2ab3ef8] kernel_thread at ffffffff81066c86
> #16 [ffff881fd2ab3f08] kthreadd at ffffffff8108beea
> #17 [ffff881fd2ab3f50] ret_from_fork at ffffffff817318bc

So, it's waiting for a log force to complete because it needs to
reclaim a dirty inode...

> This triggered the insertion of a xfs-reclaim job for device dm-46 on
> the workerpool bound to cpu #0
> 
> This xfs_reclaim job is the first *pending* job on the workerpool. The
> workerpool has 2 active workers stuck in "xfs_log_worker" both blocked
> in "xlog_state_get_iclog_space". My guess is they are waiting for some
> underlying flashcache I/O (which won't happen, see below)
> 
> 
> --> busy_hash aka busy workers:
> crash> struct worker_pool.busy_hash 0xffff88207fc12d40 | grep -o 0xf.*
> 0xffff8806dd3b5780  --> 132626
>                     --> in xfs_log_worker
>                     --> in xlog_state_get_iclog_space
>                     --> dm-40
> 0xffff88142d06ae80  --> 129142
>                     --> in xfs_log_worker
>                     --> in xlog_state_get_iclog_space
>                     --> dm-63

These are both waiting on IO completion of log IO. And the IO
completion is blocked behind metadata buffer completion that is
blocked waiting on locks. Yes, this can happen when we are running
low on memory, but it can also occur when we hit maximum workqueue
concurrency limits. The problem is not solved by adding a rescuer
thread to the workqueue because it ends up blocking, too.

> --> pending jobs:
> crash> list -o work_struct.entry -s work_struct.func -H ffff88207fc12d58
> -x | awk 'BEGIN{w="";c=0} {if($1=="func") printf("%2d %s --> %s\n", c,
> w, $4); else {w=$1;c++}}'
>    WORK                 FUNC                    DEVICE
>  1 ffff881fceda6ca8 --> <xfs_reclaim_worker>    dm-46
>  2 ffff881fcd51bc28 --> <xfs_log_worker>        dm-46
>  3 ffff88207fc0f3a0 --> <vmstat_update>         N/A
>  ...
> 
> To progress this queue needs either:
> - the I/O to complete
> - a new thread to handle the xfs_reclaim
>   (which does not trigger I/O in this context, If I got the code right)

The reclaim worker runs background inode reclaim - it doesn't
guarantee forwards progress. Running it won't make the log IO
completion the kthreadd is waiting on complete any faster, because
that's not what is preventing IO completion from running.

> The I/O is stuck so we need a new thread. The pool manager requested
> this thread BUT the request is pending in kthreadd queue. It is #5 in
> the line

Actually, we needed a different workqueue that is marked as a high
priority workqueue so log IO completion doesn't get held up by
metadata buffer IO completion waiting on locks and stalling.  This
is the problem that commit b29c70f59870 ("xfs: split metadata and
log buffer completion to separate workqueues") addressed in 3.18
addressed.

Remember how I said commit 7a29ac474 ("xfs: give all workqueues
rescuer threads") fixed a regression we'd introduced, not the
problem you are reporting?  Yeah, in commit b29c70f59870 I forgot to
add the WQ_MEM_RECLAIM to the m_log_workqueue when we moved all the
log IO completions to it and that introduced the problem seen in
subsequent kernels.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: backport 7a29ac474a47eb8cf212b45917683ae89d6fa13b to stable ?
  2016-02-24 21:30     ` Dave Chinner
@ 2016-02-25 16:45       ` Jean-Tiare Le Bigot
  0 siblings, 0 replies; 5+ messages in thread
From: Jean-Tiare Le Bigot @ 2016-02-25 16:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Thanks for these in depth explanations. I understood my mistake.
"xfs_reclaim_inodes_nr" ensures the background job on queue #0 is
scheduled but also run a synchronous reclaim which is blocked by the
I/O. For some reason, I focused on the job and completely forgot the
synchronous part.

Hence the only way to fix this hang is to fix flashcache to use a
dedicated queue with a rescuer thread. Which has nothing to do with XFS.

Regards,

On 02/24/2016 10:30 PM, Dave Chinner wrote:
> On Wed, Feb 24, 2016 at 06:27:29PM +0100, Jean-Tiare Le Bigot wrote:
>> Hi,
>>
>> Thanks for having a look.
>>
>> On 02/23/2016 11:45 PM, Dave Chinner wrote:
>>> On Tue, Feb 23, 2016 at 05:13:35PM +0100, Jean-Tiare Le Bigot wrote:
>>>> Hi,
>>>>
>>>> We've hit kernel hang related to XFS reclaim under heavy I/O load on a
>>>> couple of storage servers using XFS over flashcache over a 3.13.y kernel.
>>>>
>>>> On the crash dumps, kthreadd is blocked, waiting for XFS to reclaim some
>>>> memory but the related reclaim job is queued on a worker_pool stuck
>>>> waiting for some I/O, itself depending on other jobs on other queues
>>>> which would require additional threads to go forward. Unfortunately
>>>> kthreadd is blocked.
>>>> The host has plenty of memory (~128GB), about 80% of which being used
>>>> for the page cache.
>>>>
>>>> It looks like this is fixed by commit
>>>> 7a29ac474a47eb8cf212b45917683ae89d6fa13b.
>>>
>>> That commit fixed a regression introduced, IIRC, in 3.19. The
>>> problem it fixed didn't exist before then, so I doubt you are seeing
>>> the problem that the above commit fixed. Perhaps you'd like to
>>> describe your problem along with the stack traces, etc so we have
>>> some idea of what you are trying to to fix?
>>
>> On the dump we investigated, kthreadd is stuck, waiting for some memory
>>
>> crash> bt 2
>> PID: 2      TASK: ffff881fd2a39800  CPU: 0   COMMAND: "kthreadd"
>>  #0 [ffff881fd2ab37c8] __schedule at ffffffff81724e19
>>  #1 [ffff881fd2ab3830] io_schedule at ffffffff817255fd
>>  #2 [ffff881fd2ab3848] __xfs_iunpin_wait at ffffffffa06cf269 [xfs]
>>  #3 [ffff881fd2ab38c0] xfs_iunpin_wait at ffffffffa06d2629 [xfs]
>>  #4 [ffff881fd2ab38d0] xfs_reclaim_inode at ffffffffa068ed4c [xfs]
>>  #5 [ffff881fd2ab3910] xfs_reclaim_inodes_ag at ffffffffa068f267 [xfs]
>>  #6 [ffff881fd2ab3aa0] xfs_reclaim_inodes_nr at ffffffffa068fd73 [xfs]
>>  #7 [ffff881fd2ab3ac0] xfs_fs_free_cached_objects at ffffffffa069a3a5 [xfs]
>>  #8 [ffff881fd2ab3ad0] super_cache_scan at ffffffff811c13e9
>>  #9 [ffff881fd2ab3b18] shrink_slab at ffffffff81160f27
>> #10 [ffff881fd2ab3bc0] do_try_to_free_pages at ffffffff8116405d
>> #11 [ffff881fd2ab3c38] try_to_free_pages at ffffffff8116429c
>> #12 [ffff881fd2ab3cd8] __alloc_pages_nodemask at ffffffff81158f15
>> #13 [ffff881fd2ab3e10] copy_process at ffffffff810652b3
>> #14 [ffff881fd2ab3e90] do_fork at ffffffff810669f5
>> #15 [ffff881fd2ab3ef8] kernel_thread at ffffffff81066c86
>> #16 [ffff881fd2ab3f08] kthreadd at ffffffff8108beea
>> #17 [ffff881fd2ab3f50] ret_from_fork at ffffffff817318bc
> 
> So, it's waiting for a log force to complete because it needs to
> reclaim a dirty inode...
> 
>> This triggered the insertion of a xfs-reclaim job for device dm-46 on
>> the workerpool bound to cpu #0
>>
>> This xfs_reclaim job is the first *pending* job on the workerpool. The
>> workerpool has 2 active workers stuck in "xfs_log_worker" both blocked
>> in "xlog_state_get_iclog_space". My guess is they are waiting for some
>> underlying flashcache I/O (which won't happen, see below)
>>
>>
>> --> busy_hash aka busy workers:
>> crash> struct worker_pool.busy_hash 0xffff88207fc12d40 | grep -o 0xf.*
>> 0xffff8806dd3b5780  --> 132626
>>                     --> in xfs_log_worker
>>                     --> in xlog_state_get_iclog_space
>>                     --> dm-40
>> 0xffff88142d06ae80  --> 129142
>>                     --> in xfs_log_worker
>>                     --> in xlog_state_get_iclog_space
>>                     --> dm-63
> 
> These are both waiting on IO completion of log IO. And the IO
> completion is blocked behind metadata buffer completion that is
> blocked waiting on locks. Yes, this can happen when we are running
> low on memory, but it can also occur when we hit maximum workqueue
> concurrency limits. The problem is not solved by adding a rescuer
> thread to the workqueue because it ends up blocking, too.
> 
>> --> pending jobs:
>> crash> list -o work_struct.entry -s work_struct.func -H ffff88207fc12d58
>> -x | awk 'BEGIN{w="";c=0} {if($1=="func") printf("%2d %s --> %s\n", c,
>> w, $4); else {w=$1;c++}}'
>>    WORK                 FUNC                    DEVICE
>>  1 ffff881fceda6ca8 --> <xfs_reclaim_worker>    dm-46
>>  2 ffff881fcd51bc28 --> <xfs_log_worker>        dm-46
>>  3 ffff88207fc0f3a0 --> <vmstat_update>         N/A
>>  ...
>>
>> To progress this queue needs either:
>> - the I/O to complete
>> - a new thread to handle the xfs_reclaim
>>   (which does not trigger I/O in this context, If I got the code right)
> 
> The reclaim worker runs background inode reclaim - it doesn't
> guarantee forwards progress. Running it won't make the log IO
> completion the kthreadd is waiting on complete any faster, because
> that's not what is preventing IO completion from running.
> 
>> The I/O is stuck so we need a new thread. The pool manager requested
>> this thread BUT the request is pending in kthreadd queue. It is #5 in
>> the line
> 
> Actually, we needed a different workqueue that is marked as a high
> priority workqueue so log IO completion doesn't get held up by
> metadata buffer IO completion waiting on locks and stalling.  This
> is the problem that commit b29c70f59870 ("xfs: split metadata and
> log buffer completion to separate workqueues") addressed in 3.18
> addressed.
> 
> Remember how I said commit 7a29ac474 ("xfs: give all workqueues
> rescuer threads") fixed a regression we'd introduced, not the
> problem you are reporting?  Yeah, in commit b29c70f59870 I forgot to
> add the WQ_MEM_RECLAIM to the m_log_workqueue when we moved all the
> log IO completions to it and that introduced the problem seen in
> subsequent kernels.
> 
> Cheers,
> 
> Dave.
> 


-- 
Jean-Tiare Le Bigot, OVH

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-02-25 16:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-23 16:13 backport 7a29ac474a47eb8cf212b45917683ae89d6fa13b to stable ? Jean-Tiare Le Bigot
2016-02-23 22:45 ` Dave Chinner
2016-02-24 17:27   ` Jean-Tiare Le Bigot
2016-02-24 21:30     ` Dave Chinner
2016-02-25 16:45       ` Jean-Tiare Le Bigot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox