linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* How to debug intermittent increasing md/inflight but no disk activity?
@ 2024-07-10 11:46 Paul Menzel
  2024-07-10 11:54 ` Roger Heflin
  2024-07-10 23:12 ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Paul Menzel @ 2024-07-10 11:46 UTC (permalink / raw)
  To: linux-raid, linux-nfs; +Cc: linux-block, linux-xfs, it+linux-raid

Dear Linux folks,


Exporting directories over NFS on a Dell PowerEdge R420 with Linux 
5.15.86, users noticed intermittent hangs. For example,

     df /project/something # on an NFS client

on a different system timed out.

     @grele:~$ more /proc/mdstat
     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
[multipath]
     md3 : active raid6 sdr[0] sdp[11] sdx[10] sdt[9] sdo[8] sdw[7] 
sds[6] sdm[5] sdu[4] sdq[3] sdn[2] sdv[1]
           156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 
2 [12/12] [UUUUUUUUUUUU]
           bitmap: 0/117 pages [0KB], 65536KB chunk

     md2 : active raid6 sdap[0] sdan[11] sdav[10] sdar[12] sdam[8] 
sdau[7] sdaq[6] sdak[5] sdas[4] sdao[3] sdal[2] sdat[1]
           156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 
2 [12/12] [UUUUUUUUUUUU]
           bitmap: 0/117 pages [0KB], 65536KB chunk

     md1 : active raid6 sdb[0] sdl[11] sdh[10] sdd[9] sdk[8] sdg[7] 
sdc[6] sdi[5] sde[4] sda[3] sdj[2] sdf[1]
           156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 
2 [12/12] [UUUUUUUUUUUU]
           bitmap: 2/117 pages [8KB], 65536KB chunk

     md0 : active raid6 sdaj[0] sdz[11] sdad[10] sdah[9] sdy[8] sdac[7] 
sdag[6] sdaa[5] sdae[4] sdai[3] sdab[2] sdaf[1]
           156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 
2 [12/12] [UUUUUUUUUUUU]
           bitmap: 7/117 pages [28KB], 65536KB chunk

     unused devices: <none>

In that time, we noticed all 64 NFSD processes being in uninterruptible 
sleep and the I/O requests currently in process increasing for the RAID6 
device *md0*

     /sys/devices/virtual/block/md0/inflight : 10 921

but with no disk activity according to iostat. There was only “little 
NFS activity” going on as far as we saw. This alternated for around half 
an our, and then we decreased the NFS processes from 64 to 8. After a 
while the problem settled, meaning the I/O requests went down, so it 
might be related to the access pattern, but we’d be curious to figure 
out exactly what is going on.

We captured some more data from sysfs [1].

Of course it’s not reproducible, but any insight how to debug this next 
time is much welcomed.


Kind regards,

Paul


[1]: https://owww.molgen.mpg.de/~pmenzel/grele.2.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-10 11:46 How to debug intermittent increasing md/inflight but no disk activity? Paul Menzel
@ 2024-07-10 11:54 ` Roger Heflin
  2024-07-23 10:33   ` Paul Menzel
  2024-07-10 23:12 ` Dave Chinner
  1 sibling, 1 reply; 12+ messages in thread
From: Roger Heflin @ 2024-07-10 11:54 UTC (permalink / raw)
  To: Paul Menzel; +Cc: linux-raid, linux-nfs, linux-block, linux-xfs, it+linux-raid

How long does it freeze this way?

The disks getting bad blocks do show up as stopping activity for 3-60
seconds (depending on the disks internal settings).

smartctl --xall <device> | grep -iE 'sector|reall' should show the
reallocation counters.

What kind of disks does the machine have?

On my home machine a bad sector freezes it for 7 seconds (scterc
defaults to 7).  On some work large disk big raid the hang is minutes.
   The raw disk is set to 10 (that is what the vendor told us) and
that 10 + having potentially a bunch of IOs against the bad sector
shows as minutes.

I wrote a script that work uses that both times how long smartctl
takes for each disk (the bad disk takes >5 seconds, and up to minutes)
and also shows the reallocated count and save a copy every hour so one
can see what disk incremented its counter in the last hour and replace
that disk.

On Wed, Jul 10, 2024 at 6:46 AM Paul Menzel <pmenzel@molgen.mpg.de> wrote:
>
> Dear Linux folks,
>
>
> Exporting directories over NFS on a Dell PowerEdge R420 with Linux
> 5.15.86, users noticed intermittent hangs. For example,
>
>      df /project/something # on an NFS client
>
> on a different system timed out.
>
>      @grele:~$ more /proc/mdstat
>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>      md3 : active raid6 sdr[0] sdp[11] sdx[10] sdt[9] sdo[8] sdw[7]
> sds[6] sdm[5] sdu[4] sdq[3] sdn[2] sdv[1]
>            156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm
> 2 [12/12] [UUUUUUUUUUUU]
>            bitmap: 0/117 pages [0KB], 65536KB chunk
>
>      md2 : active raid6 sdap[0] sdan[11] sdav[10] sdar[12] sdam[8]
> sdau[7] sdaq[6] sdak[5] sdas[4] sdao[3] sdal[2] sdat[1]
>            156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm
> 2 [12/12] [UUUUUUUUUUUU]
>            bitmap: 0/117 pages [0KB], 65536KB chunk
>
>      md1 : active raid6 sdb[0] sdl[11] sdh[10] sdd[9] sdk[8] sdg[7]
> sdc[6] sdi[5] sde[4] sda[3] sdj[2] sdf[1]
>            156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm
> 2 [12/12] [UUUUUUUUUUUU]
>            bitmap: 2/117 pages [8KB], 65536KB chunk
>
>      md0 : active raid6 sdaj[0] sdz[11] sdad[10] sdah[9] sdy[8] sdac[7]
> sdag[6] sdaa[5] sdae[4] sdai[3] sdab[2] sdaf[1]
>            156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm
> 2 [12/12] [UUUUUUUUUUUU]
>            bitmap: 7/117 pages [28KB], 65536KB chunk
>
>      unused devices: <none>
>
> In that time, we noticed all 64 NFSD processes being in uninterruptible
> sleep and the I/O requests currently in process increasing for the RAID6
> device *md0*
>
>      /sys/devices/virtual/block/md0/inflight : 10 921
>
> but with no disk activity according to iostat. There was only “little
> NFS activity” going on as far as we saw. This alternated for around half
> an our, and then we decreased the NFS processes from 64 to 8. After a
> while the problem settled, meaning the I/O requests went down, so it
> might be related to the access pattern, but we’d be curious to figure
> out exactly what is going on.
>
> We captured some more data from sysfs [1].
>
> Of course it’s not reproducible, but any insight how to debug this next
> time is much welcomed.
>
>
> Kind regards,
>
> Paul
>
>
> [1]: https://owww.molgen.mpg.de/~pmenzel/grele.2.txt
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-10 11:46 How to debug intermittent increasing md/inflight but no disk activity? Paul Menzel
  2024-07-10 11:54 ` Roger Heflin
@ 2024-07-10 23:12 ` Dave Chinner
  2024-07-11  8:51   ` Johannes Truschnigg
                     ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Dave Chinner @ 2024-07-10 23:12 UTC (permalink / raw)
  To: Paul Menzel; +Cc: linux-raid, linux-nfs, linux-block, linux-xfs, it+linux-raid

On Wed, Jul 10, 2024 at 01:46:01PM +0200, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> Exporting directories over NFS on a Dell PowerEdge R420 with Linux 5.15.86,
> users noticed intermittent hangs. For example,
> 
>     df /project/something # on an NFS client
> 
> on a different system timed out.
> 
>     @grele:~$ more /proc/mdstat
>     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>     md3 : active raid6 sdr[0] sdp[11] sdx[10] sdt[9] sdo[8] sdw[7] sds[6]
> sdm[5] sdu[4] sdq[3] sdn[2] sdv[1]
>           156257474560 blocks super 1.2 level 6, 1024k chunk, algorithm 2
                                                   ^^^^^^^^^^^^

There's the likely source of your problem - 1MB raid chunk size for
a 10+2 RAID6 array. That's a stripe width of 10MB, and that will
only work well with really large sequential streaming IO. However,
the general NFS server IO pattern over time will be small
semi-random IO patterns scattered all over the place. 

IOWs, the average NFS server IO pattern is about the worst case IO
pattern for a massively wide RAID6 stripe....

> In that time, we noticed all 64 NFSD processes being in uninterruptible
> sleep and the I/O requests currently in process increasing for the RAID6
> device *md0*
> 
>     /sys/devices/virtual/block/md0/inflight : 10 921
> 
> but with no disk activity according to iostat. There was only “little NFS
> activity” going on as far as we saw. This alternated for around half an our,
> and then we decreased the NFS processes from 64 to 8.

Has nothing to do with the number of NFSD tasks, I think. They are
all stuck waiting for page cache IO, journal write IO or
the inode locks for the inodes that are blocked on this IO.

The most informative nfsd stack trace is this one:

# # /proc/1414/task/1414: nfsd :
# cat /proc/1414/task/1414/stack

[<0>] submit_bio_wait+0x5b/0x90
[<0>] iomap_read_page_sync+0xaf/0xe0
[<0>] iomap_write_begin+0x3d1/0x5e0
[<0>] iomap_file_buffered_write+0x125/0x250
[<0>] xfs_file_buffered_write+0xc6/0x2d0
[<0>] do_iter_readv_writev+0x14f/0x1b0
[<0>] do_iter_write+0x7b/0x1d0
[<0>] nfsd_vfs_write+0x2f3/0x640 [nfsd]
[<0>] nfsd4_write+0x116/0x210 [nfsd]
[<0>] nfsd4_proc_compound+0x2d1/0x640 [nfsd]
[<0>] nfsd_dispatch+0x150/0x250 [nfsd]
[<0>] svc_process_common+0x440/0x6d0 [sunrpc]
[<0>] svc_process+0xb7/0xf0 [sunrpc]
[<0>] nfsd+0xe8/0x140 [nfsd]
[<0>] kthread+0x124/0x150
[<0>] ret_from_fork+0x1f/0x30

That's a buffered write into the page cache blocking on a read IO
to fill the page because the NFS client is doing subpage or
unaligned IO. So there's slow, IO latency dependent small RMW write
operations happening at the page cache.

IOWs, you've got a NFS client side application performing suboptimal
unaligned IO patterns causing the incoming writes to take the slow
path through the page cache (i.e. a RMW cycle).

When these get flushed from the cache by the nfs commit operation:

# # /proc/1413/task/1413: nfsd :
# cat /proc/1413/task/1413/stack

[<0>] wait_on_page_bit_common+0xfa/0x3b0
[<0>] wait_on_page_writeback+0x2a/0x80
[<0>] __filemap_fdatawait_range+0x81/0xf0
[<0>] file_write_and_wait_range+0xdf/0x100
[<0>] xfs_file_fsync+0x63/0x250
[<0>] nfsd_commit+0xd8/0x180 [nfsd]
[<0>] nfsd4_proc_compound+0x2d1/0x640 [nfsd]
[<0>] nfsd_dispatch+0x150/0x250 [nfsd]
[<0>] svc_process_common+0x440/0x6d0 [sunrpc]
[<0>] svc_process+0xb7/0xf0 [sunrpc]
[<0>] nfsd+0xe8/0x140 [nfsd]
[<0>] kthread+0x124/0x150
[<0>] ret_from_fork+0x1f/0x30

writeback then stalls waiting for the underlying MD device to flush
the small IO to the RAID6 storage. This causes massive write
amplification at the RAID6 level, as it requires a RMW of the RAID6
stripe to recalculate the parity blocks with the new changed data in
it, and that then gets forced to disk because the NFSD is asking for
the data being written to be durable.

This is basically worst case behaviour for small write IO both in
terms of latency and write amplification.

> We captured some more data from sysfs [1].

md0 is the busy device:

# # /proc/855/task/855: md0_raid6 :
# cat /proc/855/task/855/stack

[<0>] blk_mq_get_tag+0x11d/0x2c0
[<0>] __blk_mq_alloc_request+0xe1/0x120
[<0>] blk_mq_submit_bio+0x141/0x530
[<0>] submit_bio_noacct+0x26b/0x2b0
[<0>] ops_run_io+0x7e2/0xcf0
[<0>] handle_stripe+0xacb/0x2100
[<0>] handle_active_stripes.constprop.0+0x3d9/0x580
[<0>] raid5d+0x338/0x5a0
[<0>] md_thread+0xa8/0x160
[<0>] kthread+0x124/0x150
[<0>] ret_from_fork+0x1f/0x30

This is the background stripe cache flushing getting stuck waiting
for tag space on the underlying block devices. This happens because
they have run out of IO queue depth (i.e. are at capacity and
overloaded).

The raid6-md0 slab indicates:

raid6-md0         125564 125564   4640    1    2 : tunables    8    4    0 : slabdata 125564 125564      0

there are 125k stripe head objects active which means there has been
a *lot* of partial stripe writes been done and are active in memory.
The "inflight" IOs indicate that it's bottlenecked on writeback
of dirty cached stripes.

Oh, there's CIFS server on this machine, too, and it's having the
same issues:

# # /proc/7624/task/12539: smbd : /usr/local/samba4/sbin/smbd --configfile=/etc/samba4/smb.conf-grele --daemon
# cat /proc/7624/task/12539/stack

[<0>] md_bitmap_startwrite+0x14a/0x1c0
[<0>] add_stripe_bio+0x3f6/0x6d0
[<0>] raid5_make_request+0x1dd/0xbb0
[<0>] md_handle_request+0x11f/0x1b0
[<0>] md_submit_bio+0x66/0xb0
[<0>] __submit_bio+0x162/0x200
[<0>] submit_bio_noacct+0xa8/0x2b0
[<0>] iomap_do_writepage+0x382/0x800
[<0>] write_cache_pages+0x18f/0x3f0
[<0>] iomap_writepages+0x1c/0x40
[<0>] xfs_vm_writepages+0x71/0xa0
[<0>] do_writepages+0xc0/0x1f0
[<0>] filemap_fdatawrite_wbc+0x78/0xc0
[<0>] file_write_and_wait_range+0x9c/0x100
[<0>] xfs_file_fsync+0x63/0x250
[<0>] __x64_sys_fsync+0x34/0x60
[<0>] do_syscall_64+0x40/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

Yup, that looks to be a partial stripe write being started and MD
having to update the dirty bitmap and it's probably blocking because
the dirty bitmap is full. i.e. it is stuck waiting for stripe
writeback to complete and, as per the md0_raid6 stack above, this
is waiting on busy devices to complete IO.

> Of course it’s not reproducible, but any insight how to debug this next time
> is much welcomed.

Probably not a lot you can do short of reconfiguring your RAID6
storage devices to handle small IOs better. However, in general,
RAID6 /always sucks/ for small IOs, and the only way to fix this
problem is to use high performance SSDs to give you a massive excess
of write bandwidth to burn on write amplification....

You could also try to find NFS/CIFS client is doing poor/small IO
patterns and get them to do better IO patterns, but that might not
be fixable.

tl;dr: this isn't a bug in kernel code, this a result of a worst
case workload for a given sotrage configuration.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-10 23:12 ` Dave Chinner
@ 2024-07-11  8:51   ` Johannes Truschnigg
  2024-07-11 11:23   ` Andre Noll
  2024-07-12  3:54   ` Dragan Milivojević
  2 siblings, 0 replies; 12+ messages in thread
From: Johannes Truschnigg @ 2024-07-11  8:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Paul Menzel, linux-raid, linux-nfs, linux-block, linux-xfs

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

I just wanted to chime in to express a sincere "thank you!" to all of you
involved in this thread, for providing such a terrific example of how to
clearly and exhaustively present a systems problem, and then also how to
methodically determine its root cause.

If there were a textbook on this subject in the IT/CompSci context (if there
is, please let me know!), *this* exchange should make it into the next edition
on how to get problems analyzed and resolved.

I will keep it in my personal bookmarks right next to [0], which is a short
but great essay about how what you just put into practice is possible in
theory, and which I've used whenever needed to inspire hope in others (and,
truth be told, also in myself :)) in the face of the daunting complexity of
modern computer systems.

[0]: https://blog.nelhage.com/post/computers-can-be-understood/

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-10 23:12 ` Dave Chinner
  2024-07-11  8:51   ` Johannes Truschnigg
@ 2024-07-11 11:23   ` Andre Noll
  2024-07-11 22:26     ` Dave Chinner
  2024-07-23 15:13     ` Paul Menzel
  2024-07-12  3:54   ` Dragan Milivojević
  2 siblings, 2 replies; 12+ messages in thread
From: Andre Noll @ 2024-07-11 11:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Paul Menzel, linux-raid, linux-nfs, linux-block, linux-xfs,
	it+linux-raid

On Thu, Jul 11, 09:12, Dave Chinner wrote

> > Of course it’s not reproducible, but any insight how to debug this next time
> > is much welcomed.
> 
> Probably not a lot you can do short of reconfiguring your RAID6
> storage devices to handle small IOs better. However, in general,
> RAID6 /always sucks/ for small IOs, and the only way to fix this
> problem is to use high performance SSDs to give you a massive excess
> of write bandwidth to burn on write amplification....

FWIW, our approach to mitigate the write amplification suckage of large
HDD-backed raid6 arrays for small I/Os is to set up a bcache device
by combining such arrays with two small SSDs (configured as raid1).

Best
Andre
-- 
Max Planck Institute for Biology
Tel: (+49) 7071 601 829
Max-Planck-Ring 5, 72076 Tübingen, Germany
http://people.tuebingen.mpg.de/maan/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-11 11:23   ` Andre Noll
@ 2024-07-11 22:26     ` Dave Chinner
  2024-07-13 15:47       ` Andre Noll
  2024-07-23 15:13     ` Paul Menzel
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2024-07-11 22:26 UTC (permalink / raw)
  To: Andre Noll
  Cc: Paul Menzel, linux-raid, linux-nfs, linux-block, linux-xfs,
	it+linux-raid

On Thu, Jul 11, 2024 at 01:23:12PM +0200, Andre Noll wrote:
> On Thu, Jul 11, 09:12, Dave Chinner wrote
> 
> > > Of course it’s not reproducible, but any insight how to debug this next time
> > > is much welcomed.
> > 
> > Probably not a lot you can do short of reconfiguring your RAID6
> > storage devices to handle small IOs better. However, in general,
> > RAID6 /always sucks/ for small IOs, and the only way to fix this
> > problem is to use high performance SSDs to give you a massive excess
> > of write bandwidth to burn on write amplification....
> 
> FWIW, our approach to mitigate the write amplification suckage of large
> HDD-backed raid6 arrays for small I/Os is to set up a bcache device
> by combining such arrays with two small SSDs (configured as raid1).

Which is effectively the same sort of setup as having a NVRAM cache
in front of the RAID6 volume (i.e. hardware RAID controller).

That can work if the cache is large enough to soak up bursts of
small writes followed by enough idle time for the back end RAID6
device to do all it's RMW cycles to clean the cache.

However, if the cache fills up with small writes, then slowdowns and
IO latencies get even worse than if you are just using a plain RAID6
device. Think about a cache with several million cached random 4kB
writes, and how long that will take to flush to the RAID6 volume
that might only be able to do 100 IOPS.

It's not uncommon to see such setups stall for *hours* in situations
like this. We get stalls like this on hardware RAID reported to us
at least a couple of times a year. There's little we can do about it
because writeback caching mode is being used to boost burst
performance and there's not enough idle time between the bursts to
drain the cache. Yes, they could use write-through caching, but that
doesn't improve the performance of bursty workloads.

Hence deploying a fast cache in front of a very slow drive is not
exactly straight forward. Making it work reliably requires
awareness of workload IO patterns. Special attention needs to be
paid to the amount of idle time. If there isn't enough idle time,
the cache will eventually stall and it will take much longer to
recover than a stall on a plain RAID volume.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-10 23:12 ` Dave Chinner
  2024-07-11  8:51   ` Johannes Truschnigg
  2024-07-11 11:23   ` Andre Noll
@ 2024-07-12  3:54   ` Dragan Milivojević
  2024-07-12 23:45     ` Dave Chinner
  2 siblings, 1 reply; 12+ messages in thread
From: Dragan Milivojević @ 2024-07-12  3:54 UTC (permalink / raw)
  To: Dave Chinner, Paul Menzel
  Cc: linux-raid, linux-nfs, linux-block, linux-xfs, it+linux-raid

On 11/07/2024 01:12, Dave Chinner wrote:
> Probably not a lot you can do short of reconfiguring your RAID6
> storage devices to handle small IOs better. However, in general,
> RAID6 /always sucks/ for small IOs, and the only way to fix this
> problem is to use high performance SSDs to give you a massive excess
> of write bandwidth to burn on write amplification....
  
RAID5/6 has the same issues with NVME drives.
Major issue is the bitmap.

5 disk NVMe RAID5, 64K chunk

Test                   BW         IOPS
bitmap internal 64M    700KiB/s   174
bitmap internal 128M   702KiB/s   175
bitmap internal 512M   1142KiB/s  285
bitmap internal 1024M  40.4MiB/s  10.3k
bitmap internal 2G     66.5MiB/s  17.0k
bitmap external 64M    67.8MiB/s  17.3k
bitmap external 1024M  76.5MiB/s  19.6k
bitmap none            80.6MiB/s  20.6k
Single disk 1K         54.1MiB/s  55.4k
Single disk 4K         269MiB/s   68.8k

Tested with fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 --group_reporting --time_based --name=Raid5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-12  3:54   ` Dragan Milivojević
@ 2024-07-12 23:45     ` Dave Chinner
  2024-07-13 17:44       ` Dragan Milivojević
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2024-07-12 23:45 UTC (permalink / raw)
  To: Dragan Milivojević
  Cc: Paul Menzel, linux-raid, linux-nfs, linux-block, linux-xfs,
	it+linux-raid

On Fri, Jul 12, 2024 at 05:54:05AM +0200, Dragan Milivojević wrote:
> On 11/07/2024 01:12, Dave Chinner wrote:
> > Probably not a lot you can do short of reconfiguring your RAID6
> > storage devices to handle small IOs better. However, in general,
> > RAID6 /always sucks/ for small IOs, and the only way to fix this
> > problem is to use high performance SSDs to give you a massive excess
> > of write bandwidth to burn on write amplification....
> RAID5/6 has the same issues with NVME drives.
> Major issue is the bitmap.

That's irrelevant to the problem being discussed. The OP is
reporting stalls due to the bursty incoming workload vastly
outpacing the rate of draining of storage device. the above comment
is not about how close to "raw performace" the MD device gets on
NVMe SSDs - it's about how much faster it is for the given workload
than HDDs.

i.e. waht matters is the relative performance differential, and
according to you numbers below, it is at least two orders of
magnitude. That would make a 100s stall into a 1s stall, and that
would largely make the OP's problems go away....

> 5 disk NVMe RAID5, 64K chunk
> 
> Test                   BW         IOPS
> bitmap internal 64M    700KiB/s   174
> bitmap internal 128M   702KiB/s   175
> bitmap internal 512M   1142KiB/s  285
> bitmap internal 1024M  40.4MiB/s  10.3k
> bitmap internal 2G     66.5MiB/s  17.0k
> bitmap external 64M    67.8MiB/s  17.3k
> bitmap external 1024M  76.5MiB/s  19.6k
> bitmap none            80.6MiB/s  20.6k
> Single disk 1K         54.1MiB/s  55.4k
> Single disk 4K         269MiB/s   68.8k
> 
> Tested with fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite
> --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1
> --group_reporting --time_based --name=Raid5

Oh, you're only testing a single depth block aligned async direct IO
random write to the block device. The problem case that was reported
was unaligned, synchronous buffered IO to multiple files through the
the filesystem page cache (i.e. RMW at the page cache level as well
as the MD device) at IO depths of up to 64 with periodic fsyncs
thrown into the mix. 

So the OP's workload was not only doing synchronous buffered writes,
they also triggered a lot of dependent synchronous random read IO to
go with the async write IOs issued by fsyncs and page cache
writeback.

If you were to simulate all that, I would expect that the difference
between HDDs and NVMe SSDs to be much greater than just 2 orders of
magnitude.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-11 22:26     ` Dave Chinner
@ 2024-07-13 15:47       ` Andre Noll
  0 siblings, 0 replies; 12+ messages in thread
From: Andre Noll @ 2024-07-13 15:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Paul Menzel, linux-raid, linux-nfs, linux-block, linux-xfs,
	it+linux-raid

On Fri, Jul 12, 08:26, Dave Chinner wrote
> On Thu, Jul 11, 2024 at 01:23:12PM +0200, Andre Noll wrote:
> > On Thu, Jul 11, 09:12, Dave Chinner wrote
> > 
> > > > Of course it’s not reproducible, but any insight how to debug this next time
> > > > is much welcomed.
> > > 
> > > Probably not a lot you can do short of reconfiguring your RAID6
> > > storage devices to handle small IOs better. However, in general,
> > > RAID6 /always sucks/ for small IOs, and the only way to fix this
> > > problem is to use high performance SSDs to give you a massive excess
> > > of write bandwidth to burn on write amplification....
> > 
> > FWIW, our approach to mitigate the write amplification suckage of large
> > HDD-backed raid6 arrays for small I/Os is to set up a bcache device
> > by combining such arrays with two small SSDs (configured as raid1).
> 
> Which is effectively the same sort of setup as having a NVRAM cache
> in front of the RAID6 volume (i.e. hardware RAID controller).

Yes, bcache is cachevault on the cheap, plus the additional benefit
that bcache tries to detect and skip sequential I/O, bypassing
the cache.

> That can work if the cache is large enough to soak up bursts of
> small writes followed by enough idle time for the back end RAID6
> device to do all it's RMW cycles to clean the cache.
> 
> However, if the cache fills up with small writes, then slowdowns and
> IO latencies get even worse than if you are just using a plain RAID6
> device. Think about a cache with several million cached random 4kB
> writes, and how long that will take to flush to the RAID6 volume
> that might only be able to do 100 IOPS.

Indeed, we also see these stalls occasionally, especially under
mixed workloads where large file copies happen in parallel with heavy
metadata I/O such as a recursive chmod/chown. However, the stalls we
see are usually short. At most a couple of minutes, but not hours.

> Hence deploying a fast cache in front of a very slow drive is not
> exactly straight forward. Making it work reliably requires
> awareness of workload IO patterns. Special attention needs to be
> paid to the amount of idle time.

The problem is that knowing the I/O patterns might be too much to ask
for. In our case, many scientists use the servers at the same time,
and in very different ways. Some are experimenting with closed source
special purpose software that has unknown I/O characteristics. So the
workload and the I/O patterns are kind of unpredictable and vary a lot.

If people complain about slowness or high latencies, I usually
recommend to write to SSD-only scratch space first, then copy over
the results to the large HDD-backed arrays. Sometimes it's the
unsophisticated solutions that work best :)

Thanks
Andre
-- 
Max Planck Institute for Biology
Tel: (+49) 7071 601 829
Max-Planck-Ring 5, 72076 Tübingen, Germany
http://people.tuebingen.mpg.de/maan/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-12 23:45     ` Dave Chinner
@ 2024-07-13 17:44       ` Dragan Milivojević
  0 siblings, 0 replies; 12+ messages in thread
From: Dragan Milivojević @ 2024-07-13 17:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Paul Menzel, linux-raid, linux-nfs, linux-block, linux-xfs,
	it+linux-raid

On 13/07/2024 01:45, Dave Chinner wrote:

> That's irrelevant to the problem being discussed. The OP is
> reporting stalls due to the bursty incoming workload vastly
> outpacing the rate of draining of storage device. the above comment
> is not about how close to "raw performace" the MD device gets on
> NVMe SSDs - it's about how much faster it is for the given workload
> than HDDs.


NVMe raid is faster than HDD raid, that is true. Relative performance
degradation is a different matter. When used with the default bitmap
settings MD raid sends a ton of disk flushes, even with full stripe writes,
and that kills the already atrocious performance.

OP should modify his array and remove or move the bitmap to an external
drive and see how much that will help.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-10 11:54 ` Roger Heflin
@ 2024-07-23 10:33   ` Paul Menzel
  0 siblings, 0 replies; 12+ messages in thread
From: Paul Menzel @ 2024-07-23 10:33 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid, linux-nfs, linux-block, linux-xfs, it+linux-raid

Dear Roger,


Thank you for your reply.

Am 10.07.24 um 13:54 schrieb Roger Heflin:
> How long does it freeze this way?

It froze up to five minutes I’d say.

> The disks getting bad blocks do show up as stopping activity for 3-60
> seconds (depending on the disks internal settings).
> 
> smartctl --xall <device> | grep -iE 'sector|reall' should show the
> reallocation counters.

These are SAS disks, and none of the array members has any errors. Example:

```
@grele:~$ sudo smartctl --xall /dev/sdy
[…]
Error counter log:
            Errors Corrected by           Total   Correction 
Gigabytes    Total
                ECC          rereads/    errors   algorithm 
processed    uncorrected
            fast | delayed   rewrites  corrected  invocations   [10^9 
bytes]  errors
read:          0        0         0         0          0     655487.372 
          0
write:         0        0         0         0          0      38289.771 
          0
```

> What kind of disks does the machine have?

Seagate ST16000NM004J (16 TB, SAS)

> On my home machine a bad sector freezes it for 7 seconds (scterc
> defaults to 7).  On some work large disk big raid the hang is minutes.
>     The raw disk is set to 10 (that is what the vendor told us) and
> that 10 + having potentially a bunch of IOs against the bad sector
> shows as minutes.
> 
> I wrote a script that work uses that both times how long smartctl
> takes for each disk (the bad disk takes >5 seconds, and up to minutes)
> and also shows the reallocated count and save a copy every hour so one
> can see what disk incremented its counter in the last hour and replace
> that disk.

A colleague also wrote a Perl program diskcheck, that is regularly run 
to check all the disks. Nothing suspicious here.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: How to debug intermittent increasing md/inflight but no disk activity?
  2024-07-11 11:23   ` Andre Noll
  2024-07-11 22:26     ` Dave Chinner
@ 2024-07-23 15:13     ` Paul Menzel
  1 sibling, 0 replies; 12+ messages in thread
From: Paul Menzel @ 2024-07-23 15:13 UTC (permalink / raw)
  To: Andre Noll, Dave Chinner
  Cc: linux-raid, linux-nfs, linux-block, linux-xfs, it+linux-raid

Dear Andre, dear Dave,


Thank you for your replies.


Am 11.07.24 um 13:23 schrieb Andre Noll:
> On Thu, Jul 11, 09:12, Dave Chinner wrote
> 
>>> Of course it’s not reproducible, but any insight how to debug this next time
>>> is much welcomed.
>>
>> Probably not a lot you can do short of reconfiguring your RAID6
>> storage devices to handle small IOs better. However, in general,
>> RAID6 /always sucks/ for small IOs, and the only way to fix this
>> problem is to use high performance SSDs to give you a massive excess
>> of write bandwidth to burn on write amplification....
> 
> FWIW, our approach to mitigate the write amplification suckage of large
> HDD-backed raid6 arrays for small I/Os is to set up a bcache device
> by combining such arrays with two small SSDs (configured as raid1).

Now that file servers with software RAID proliferate in our institute 
due to old systems with battery backed hardware RAID controllers are 
taken offline, we noticed performance problems. (We still have not found 
the silver bullet yet.) My colleague Donald was testing bcache in March, 
but due to the slightly more complex setup, a colleague is currently 
experimenting with a write journal for the software RAID.


Kind regards,

Paul


PS: *bcache* performance test:

     time bash -c '(cd /jbod/MG002/scratch/x && for i in $(seq -w 1000); 
do echo a >  data.$i; done)'

| setting                                | time/s  | time/s  | time/s |
|----------------------------------------|---------|---------|--------|
| xfs/raid6                              | 40.826 | 41.638 | 44.685 |
| bcache/xfs/raid6 mode none             | 32.642 | 29.274 | 27.491 |
| bcache/xfs/raid6 mode writethrough     | 27.028 | 31.754 | 28.884 |
| bache/xfs/raid6 mode writearound       | 24.526 | 30.808 | 28.940 |
| bcache/xfs/raid6 mode writeback        |  5.795 |  6.456 |  7.230 |
| bcachefs 10+2                          | 10,321 | 11,832 | 12,671 |
| bcachefs 10+2+nvme (writeback)         |  9.026 |  8.676 |  8.619 |
| xfs/raid6 (12*100GB)                   | 32.446 | 25.583 | 24.007 |
| xfs/raid5 (12*100GB)                   | 27.934 | 23.705 | 22.558 |
| xfs/bcache(10*raid6,2*raid1 cache) writethrough | 56.240 | 47.997 | 
45.321 |
| xfs/bcache(10*raid6,2*raid1 cache) writeback  | 82.230 | 85.779 | 85.814 |
| xfs/bcache(10*raid6,2*raid1 cache(ssd)) writethrough | 26.459 | 23.631 
| 23.586 |
| xfs/bcache(10*raid6,2*raid1 cache(ssd)) writeback  |  7.729 |  7.073 | 
  6.958 |
| as above with sequential_cutoff=0      |  6.397 |  6.826 |  6.759 |

`sequential_cutoff=0` significantly speeds up the `tar xf 
node-v20.11.0.tar.gz` from 13m45.108s to 5m31.379s ! Maybe the 
sequential cutoff thing doesn't work well over nfs.

1.  Build kernel over NFS with the usual setup: 27m38s
2.  Build kernel over NFS with xfs+bcache with two (raid1) SSDs: 10m27s

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-07-23 15:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-10 11:46 How to debug intermittent increasing md/inflight but no disk activity? Paul Menzel
2024-07-10 11:54 ` Roger Heflin
2024-07-23 10:33   ` Paul Menzel
2024-07-10 23:12 ` Dave Chinner
2024-07-11  8:51   ` Johannes Truschnigg
2024-07-11 11:23   ` Andre Noll
2024-07-11 22:26     ` Dave Chinner
2024-07-13 15:47       ` Andre Noll
2024-07-23 15:13     ` Paul Menzel
2024-07-12  3:54   ` Dragan Milivojević
2024-07-12 23:45     ` Dave Chinner
2024-07-13 17:44       ` Dragan Milivojević

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).