linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
@ 2020-11-28 12:25 Donald Buczek
  2020-11-30  2:06 ` Guoqing Jiang
  0 siblings, 1 reply; 50+ messages in thread
From: Donald Buczek @ 2020-11-28 12:25 UTC (permalink / raw)
  To: Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Linux mdraid people,

we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.

The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.

The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with

     echo idle > /sys/devices/virtual/block/md1/md/sync_action

This doesn't complete. Here's /proc/stack of this process:

     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
     UID        PID  PPID  C STIME TTY          TIME CMD
     root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
     [<0>] kthread_stop+0x6e/0x150
     [<0>] md_unregister_thread+0x3e/0x70
     [<0>] md_reap_sync_thread+0x1f/0x1e0
     [<0>] action_store+0x141/0x2b0
     [<0>] md_attr_store+0x71/0xb0
     [<0>] kernfs_fop_write+0x113/0x1a0
     [<0>] vfs_write+0xbc/0x250
     [<0>] ksys_write+0xa1/0xe0
     [<0>] do_syscall_64+0x33/0x40
     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Note, that md0 has been paused successfully just before.

     2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
     2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
     2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
     
     2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
     2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
     2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
     [ nothing related following.... ]

After that, we see md1_raid6 in a busy loop:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6

Also, all processes doing I/O do the md device block.

This is /proc/mdstat:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           bitmap: 0/59 pages [0KB], 65536KB chunk

There doesn't seem to be any further progress.

I've taken a function_graph trace of the looping md1_raid6 process: https://owww.molgen.mpg.de/~buczek/2020-11-27_trace.txt (30 MB)

Maybe this helps to get an idea what might be going on?

Best
   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 50+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
@ 2024-05-30  8:12 Yaohui Hu
  0 siblings, 0 replies; 50+ messages in thread
From: Yaohui Hu @ 2024-05-30  8:12 UTC (permalink / raw)
  To: yukuai1
  Cc: buczek, dragan, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, song, yangerkun, yukuai3

Hi,

After applied suggested 7 commits to kernel v6.1. There are still deadlocks in raid456 such as following. Do we have other follow up patch to fix this issue? 

[2378440.310330] INFO: task systemd-journal:1415 blocked for more than 120 seconds.
[2378440.318649]       Tainted: G           OE K    6.1.51-0digitalocean2-generic #0digitalocean2
[2378440.328327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2378440.337306] task:systemd-journal state:D stack:0     pid:1415  ppid:1      flags:0x00000006
[2378440.337310] Call Trace:
[2378440.337312]  <TASK>
[2378440.337315]  __schedule+0x1f5/0x5c0
[2378440.337324]  schedule+0x53/0xb0
[2378440.337327]  io_schedule+0x42/0x70
[2378440.337329]  folio_wait_bit_common+0x11f/0x310
[2378440.337335]  ? filemap_invalidate_unlock_two+0x40/0x40
[2378440.337338]  ? ext4_da_map_blocks.constprop.0+0x370/0x370
[2378440.337342]  block_page_mkwrite+0x113/0x160
[2378440.337346]  ext4_page_mkwrite+0x233/0x6d0
[2378440.337349]  do_page_mkwrite+0x4d/0xe0
[2378440.337352]  do_wp_page+0x1d9/0x440
[2378440.337355]  __handle_mm_fault+0x50a/0x5d0
[2378440.337358]  handle_mm_fault+0xe3/0x2e0
[2378440.337359]  do_user_addr_fault+0x187/0x560
[2378440.337363]  exc_page_fault+0x6c/0x150
[2378440.337366]  asm_exc_page_fault+0x22/0x30
[2378440.337371] RIP: 0033:0x7f2b4b7c9e7a
[2378440.337377] RSP: 002b:00007ffcd6387340 EFLAGS: 00010202
[2378440.337379] RAX: 0000000004725c68 RBX: 000056420cb077f0 RCX: 00007f2b3a7983b0
[2378440.337381] RDX: 00007f2b3fb25c68 RSI: 000056420cb077f0 RDI: 00007f2b3a7983d8
[2378440.337382] RBP: 000056420caca8f0 R08: 00000000003983b0 R09: 00007ffcd6387370
[2378440.337383] R10: 86c0275e98fea605 R11: 0000000000244a4c R12: 000000000000000a
[2378440.337384] R13: e4bdb9e57ca6925c R14: 0000000000000000 R15: 00007ffcd6387378
[2378440.337386]  </TASK>
[2378440.337587] INFO: task md8_resync:3952093 blocked for more than 120 seconds.
[2378440.345699]       Tainted: G           OE K    6.1.51-0digitalocean2-generic #0digitalocean2
[2378440.355371] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2378440.364359] task:md8_resync      state:D stack:0     pid:3952093 ppid:2      flags:0x00004000
[2378440.364362] Call Trace:
[2378440.364363]  <TASK>
[2378440.364364]  __schedule+0x1f5/0x5c0
[2378440.364368]  schedule+0x53/0xb0
[2378440.364372]  raid5_get_active_stripe+0x1f6/0x290 [raid456]
[2378440.364381]  ? destroy_sched_domains_rcu+0x30/0x30
[2378440.364385]  raid5_sync_request+0x289/0x2c0 [raid456]
[2378440.364392]  ? is_mddev_idle+0xb5/0x115
[2378440.364395]  md_do_sync.cold+0x3eb/0x96b
[2378440.364398]  ? destroy_sched_domains_rcu+0x30/0x30
[2378440.364400]  md_thread+0xa9/0x160
[2378440.364406]  ? super_90_load.part.0+0x300/0x300
[2378440.364408]  kthread+0xb9/0xe0
[2378440.364412]  ? kthread_complete_and_exit+0x20/0x20
[2378440.364414]  ret_from_fork+0x1f/0x30
[2378440.364418]  </TASK>

Thanks

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-05-30  8:12 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek
2020-11-30  2:06 ` Guoqing Jiang
2020-12-01  9:29   ` Donald Buczek
2020-12-02 17:28     ` Donald Buczek
2020-12-03  1:55       ` Guoqing Jiang
2020-12-03 11:42         ` Donald Buczek
2020-12-21 12:33           ` Donald Buczek
2021-01-19 11:30             ` Donald Buczek
2021-01-20 16:33               ` Guoqing Jiang
2021-01-23 13:04                 ` Donald Buczek
2021-01-25  8:54                   ` Donald Buczek
2021-01-25 21:32                     ` Donald Buczek
2021-01-26  0:44                       ` Guoqing Jiang
2021-01-26  9:50                         ` Donald Buczek
2021-01-26 11:14                           ` Guoqing Jiang
2021-01-26 12:58                             ` Donald Buczek
2021-01-26 14:06                               ` Guoqing Jiang
2021-01-26 16:05                                 ` Donald Buczek
2021-02-02 15:42                                   ` Guoqing Jiang
2021-02-08 11:38                                     ` Donald Buczek
2021-02-08 14:53                                       ` Guoqing Jiang
2021-02-08 18:41                                         ` Donald Buczek
2021-02-09  0:46                                           ` Guoqing Jiang
2021-02-09  9:24                                             ` Donald Buczek
2023-03-14 13:25                                             ` Marc Smith
2023-03-14 13:55                                               ` Guoqing Jiang
2023-03-14 14:45                                                 ` Marc Smith
2023-03-16 15:25                                                   ` Marc Smith
2023-03-29  0:01                                                     ` Song Liu
2023-08-22 21:16                                                       ` Dragan Stancevic
2023-08-23  1:22                                                         ` Yu Kuai
2023-08-23 15:33                                                           ` Dragan Stancevic
2023-08-24  1:18                                                             ` Yu Kuai
2023-08-28 20:32                                                               ` Dragan Stancevic
2023-08-30  1:36                                                                 ` Yu Kuai
2023-09-05  3:50                                                                   ` Yu Kuai
2023-09-05 13:54                                                                     ` Dragan Stancevic
2023-09-13  9:08                                                                       ` Donald Buczek
2023-09-13 14:16                                                                         ` Dragan Stancevic
2023-09-14  6:03                                                                           ` Donald Buczek
2023-09-17  8:55                                                                             ` Donald Buczek
2023-09-24 14:35                                                                               ` Donald Buczek
2023-09-25  1:11                                                                                 ` Yu Kuai
2023-09-25  9:11                                                                                   ` Donald Buczek
2023-09-25  9:32                                                                                     ` Yu Kuai
2023-03-15  3:02                                                 ` Yu Kuai
2023-03-15  9:30                                                   ` Guoqing Jiang
2023-03-15  9:53                                                     ` Yu Kuai
2023-03-15  7:52                                               ` Donald Buczek
  -- strict thread matches above, loose matches on Subject: below --
2024-05-30  8:12 Yaohui Hu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).