Re: 6.5.0rc5 fs hang - ext4? raid? - Dr. David Alan Gilbert

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dave@treblig.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: Theodore Ts'o <tytso@mit.edu>,
	hch@lst.de, adilger.kernel@dilger.ca, song@kernel.org,
	linux-raid@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-ext4@vger.kernel.org
Subject: Re: 6.5.0rc5 fs hang - ext4? raid?
Date: Wed, 16 Aug 2023 15:06:01 +0000	[thread overview]
Message-ID: <ZNzl2Sq9UJ3FiTgV@gallifrey> (raw)
In-Reply-To: <ZNzg1/zhxYV2EkBX@gallifrey>

* Dr. David Alan Gilbert (dave@treblig.org) wrote:
> * Jens Axboe (axboe@kernel.dk) wrote:
> > On 8/15/23 7:31 PM, Dr. David Alan Gilbert wrote:
> > > (Copying in Christoph and Jens)
> > > 
> > > * Dr. David Alan Gilbert (dave@treblig.org) wrote:
> > >> * Dr. David Alan Gilbert (dave@treblig.org) wrote:
> > >>> * Theodore Ts'o (tytso@mit.edu) wrote:
> > >>>> On Mon, Aug 14, 2023 at 09:02:53PM +0000, Dr. David Alan Gilbert wrote:
> > >>>>> dg         29594   29592  0 18:40 pts/0    00:00:00 /usr/bin/ar --plugin /usr/libexec/gcc/x86_64-redhat-linux/13/liblto_plugin.so -csrDT src/intel/perf/libintel_perf.a src/intel/perf/libintel_perf.a.p/meson-generated_.._intel_perf_metrics.c.o src/intel/perf/libintel_perf.a.p/intel_perf.c.o src/intel/perf/libintel_perf.a.p/intel_perf_query.c.o src/intel/perf/libintel_perf.a.p/intel_perf_mdapi.c.o
> > >>>>>
> > >>>>> [root@dalek dg]# cat /proc/29594/stack 
> > >>>>> [<0>] md_super_wait+0xa2/0xe0
> > >>>>> [<0>] md_bitmap_unplug+0xd2/0x120
> > >>>>> [<0>] flush_bio_list+0xf3/0x100 [raid1]
> > >>>>> [<0>] raid1_unplug+0x3b/0xb0 [raid1]
> > >>>>> [<0>] __blk_flush_plug+0xd7/0x150
> > >>>>> [<0>] blk_finish_plug+0x29/0x40
> > >>>>> [<0>] ext4_do_writepages+0x401/0xc90
> > >>>>> [<0>] ext4_writepages+0xad/0x180
> > >>>>
> > >>>> If you want a few seconds and try grabbing cat /proc/29594/stack
> > >>>> again, what does the stack trace stay consistent as above?
> > >>>
> > >>> I'll get back to that and retry it.
> > >>
> > >> Yeh, the stack is consistent; this time around it's an 'ar' in a kernel
> > >> build:
> > >>
> > >> [root@dalek dg]# cat /proc/17970/stack
> > >> [<0>] md_super_wait+0xa2/0xe0
> > >> [<0>] md_bitmap_unplug+0xad/0x120
> > >> [<0>] flush_bio_list+0xf3/0x100 [raid1]
> > >> [<0>] raid1_unplug+0x3b/0xb0 [raid1]
> > >> [<0>] __blk_flush_plug+0xd7/0x150
> > >> [<0>] blk_finish_plug+0x29/0x40
> > >> [<0>] ext4_do_writepages+0x401/0xc90
> > >> [<0>] ext4_writepages+0xad/0x180
> > >> [<0>] do_writepages+0xd2/0x1e0
> > >> [<0>] filemap_fdatawrite_wbc+0x63/0x90
> > >> [<0>] __filemap_fdatawrite_range+0x5c/0x80
> > >> [<0>] ext4_release_file+0x74/0xb0
> > >> [<0>] __fput+0xf5/0x2a0
> > >> [<0>] task_work_run+0x5d/0x90
> > >> [<0>] exit_to_user_mode_prepare+0x1e6/0x1f0
> > >> [<0>] syscall_exit_to_user_mode+0x1b/0x40
> > >> [<0>] do_syscall_64+0x6c/0x90
> > >> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > >> [root@dalek dg]# cat /proc/17970/stack
> > >> [<0>] md_super_wait+0xa2/0xe0
> > >> [<0>] md_bitmap_unplug+0xad/0x120
> > >> [<0>] flush_bio_list+0xf3/0x100 [raid1]
> > >> [<0>] raid1_unplug+0x3b/0xb0 [raid1]
> > >> [<0>] __blk_flush_plug+0xd7/0x150
> > >> [<0>] blk_finish_plug+0x29/0x40
> > >> [<0>] ext4_do_writepages+0x401/0xc90
> > >> [<0>] ext4_writepages+0xad/0x180
> > >> [<0>] do_writepages+0xd2/0x1e0
> > >> [<0>] filemap_fdatawrite_wbc+0x63/0x90
> > >> [<0>] __filemap_fdatawrite_range+0x5c/0x80
> > >> [<0>] ext4_release_file+0x74/0xb0
> > >> [<0>] __fput+0xf5/0x2a0
> > >> [<0>] task_work_run+0x5d/0x90
> > >> [<0>] exit_to_user_mode_prepare+0x1e6/0x1f0
> > >> [<0>] syscall_exit_to_user_mode+0x1b/0x40
> > >> [<0>] do_syscall_64+0x6c/0x90
> > >> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > >>
> > >>>> Also, if you have iostat installed (usually part of the sysstat
> > >>>> package), does "iostat 1" show any I/O activity on the md device?
> > >>
> > >> iostat is showing something odd, most devices are at 0,
> > >> except for 3 of the dm's that are stuck at 100% utilisation with
> > >> apparently nothing going on:
> > >>
> > >> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > >>            0.06    0.00    0.03   53.06    0.00   46.84
> > >>
> > >> Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
> > >> ...
> > >> dm-16            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00 100.00
> > >> dm-17            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00 100.00
> > >> dm-18            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
> > >> dm-19            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
> > >> dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
> > >> dm-20            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00 100.00
> > >> ....
> > >>
> > >> dm-20 is the /dev/mapper/main-more which is the RAID on which the
> > >> fs runs, 16 and 17 are main-more_rmeta_0 and main-more_rimage_0
> > >> so something screwy is going on there.
> > > 
> > > I've just finished a bisect of this hang, and got to:
> > > 
> > > 615939a2ae734e3e68c816d6749d1f5f79c62ab7 is the first bad commit
> > > commit 615939a2ae734e3e68c816d6749d1f5f79c62ab7
> > > Author: Christoph Hellwig <hch@lst.de>
> > > Date:   Fri May 19 06:40:48 2023 +0200
> > > 
> > >     blk-mq: defer to the normal submission path for post-flush requests
> > > 
> > >     Requests with the FUA bit on hardware without FUA support need a post
> > >     flush before returning to the caller, but they can still be sent using
> > >     the normal I/O path after initializing the flush-related fields and
> > >     end I/O handler.
> > > 
> > >     Signed-off-by: Christoph Hellwig <hch@lst.de>
> > >     Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> > >     Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@lst.de
> > >     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > 
> > Can you try and pull in:
> > 
> > https://git.kernel.dk/cgit/linux/commit/?h=block-6.5&id=5ff3213a5387e076af2b87f796f94b36965e8c3a
> > 
> > and see if that helps?
> 
> <testing....>

Yes it seems to fix it - thanks!

Dave

> Dave
> 
> > -- 
> > Jens Axboe
> > 
> -- 
>  -----Open up your eyes, open up your mind, open up your code -------   
> / Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
> \        dave @ treblig.org |                               | In Hex /
>  \ _________________________|_____ http://www.treblig.org   |_______/
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

next prev parent reply	other threads:[~2023-08-16 15:07 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-14 21:02 6.5.0rc5 fs hang - ext4? raid? Dr. David Alan Gilbert
2023-08-15  0:16 ` Bagas Sanjaya
2023-08-15  0:23   ` Dr. David Alan Gilbert
2023-08-15  1:11 ` Bagas Sanjaya
2023-08-15 11:47 ` Yu Kuai
2023-08-15 12:23   ` Dr. David Alan Gilbert
2023-08-15 12:47     ` Yu Kuai
2023-08-15 12:51 ` Theodore Ts'o
2023-08-15 12:55   ` Dr. David Alan Gilbert
2023-08-15 17:50     ` Dr. David Alan Gilbert
2023-08-16  1:31       ` Dr. David Alan Gilbert
2023-08-16 14:34         ` Jens Axboe
2023-08-16 14:44           ` Dr. David Alan Gilbert
2023-08-16 15:06             ` Dr. David Alan Gilbert [this message]
2023-08-16 15:30               ` Jens Axboe
2023-08-28 13:52               ` Bagas Sanjaya
2023-08-28 13:55                 ` Jens Axboe
2023-08-28 14:00                   ` Dr. David Alan Gilbert
2023-08-28 14:02                   ` Bagas Sanjaya
2023-08-28 14:08                 ` Linux regression tracking #update (Thorsten Leemhuis)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZNzl2Sq9UJ3FiTgV@gallifrey \
    --to=dave@treblig.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=song@kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox