XFS writing issue

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* XFS writing issue
@ 2023-07-11 15:31 Eugene K.
  2023-07-11 15:42 ` Darrick J. Wong
  2023-07-11 23:11 ` Dave Chinner
  0 siblings, 2 replies; 3+ messages in thread
From: Eugene K. @ 2023-07-11 15:31 UTC (permalink / raw)
  To: linux-xfs

Hello.

During investigation of flapping performance problem, it was detected 
that once a process writes big amount of data in a row, the filesystem 
focus on this writing and no other process can perform any IO on this 
filesystem.

We have noticed huge %iowait on software raid1 (mdraid) that runs on 2 
SSD drives - on every attempt to write more than 1GB.

The issue happens on any server running 6.4.2, 6.4.0, 6.3.3, 6.2.12 
kernel. Upon investigating and testing it appeared that server IO 
performance can be completely killed with a single command:

#cat /dev/zero > ./removeme

assuming the ~/removeme file resides on rootfs and rootfs is XFS.

While running this, the server becomes so unresponsive that after ~15 
seconds it's not even possible to login via ssh!

We did reproduce this on every machine with XFS as rootfs running 
mentioned kernels. However, when we converted rootfs from XFS to 
EXT4(and btrfs), the problem disappeared - with the same OS, same kernel 
binary, same hardware, just using ext4 or btrfs instead of xfs.

Note. During the hang and being unresponsive, SSD drives are writing 
data at expected performance. Just all the processes except the writing 
one hang.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: XFS writing issue
  2023-07-11 15:31 XFS writing issue Eugene K.
@ 2023-07-11 15:42 ` Darrick J. Wong
  2023-07-11 23:11 ` Dave Chinner
  1 sibling, 0 replies; 3+ messages in thread
From: Darrick J. Wong @ 2023-07-11 15:42 UTC (permalink / raw)
  To: Eugene K.; +Cc: linux-xfs

On Tue, Jul 11, 2023 at 05:31:13PM +0200, Eugene K. wrote:
> Hello.
> 
> During investigation of flapping performance problem, it was detected that
> once a process writes big amount of data in a row, the filesystem focus on
> this writing and no other process can perform any IO on this filesystem.
> 
> We have noticed huge %iowait on software raid1 (mdraid) that runs on 2 SSD
> drives - on every attempt to write more than 1GB.
> 
> The issue happens on any server running 6.4.2, 6.4.0, 6.3.3, 6.2.12 kernel.
> Upon investigating and testing it appeared that server IO performance can be
> completely killed with a single command:
> 
> #cat /dev/zero > ./removeme
> 
> assuming the ~/removeme file resides on rootfs and rootfs is XFS.
> 
> While running this, the server becomes so unresponsive that after ~15
> seconds it's not even possible to login via ssh!
> 
> We did reproduce this on every machine with XFS as rootfs running mentioned
> kernels. However, when we converted rootfs from XFS to EXT4(and btrfs), the
> problem disappeared - with the same OS, same kernel binary, same hardware,
> just using ext4 or btrfs instead of xfs.

So use ext4.

--D

> Note. During the hang and being unresponsive, SSD drives are writing data at
> expected performance. Just all the processes except the writing one hang.
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: XFS writing issue
  2023-07-11 15:31 XFS writing issue Eugene K.
  2023-07-11 15:42 ` Darrick J. Wong
@ 2023-07-11 23:11 ` Dave Chinner
  1 sibling, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2023-07-11 23:11 UTC (permalink / raw)
  To: Eugene K.; +Cc: linux-xfs

On Tue, Jul 11, 2023 at 05:31:13PM +0200, Eugene K. wrote:
> Hello.
> 
> During investigation of flapping performance problem, it was detected that
> once a process writes big amount of data in a row, the filesystem focus on
> this writing and no other process can perform any IO on this filesystem.

What hardware are you testing on? ram, cpus, SSD models, etc.

Also, xfs_info for the filesystem you are testing, output of 'grep .
/proc/sys/vm/*' as well as dumps of 'iostat -dxm 1' and 'vmstat 1'
while you are running the test. Also capture the dmesg output of
'echo w > /proc/sysrq-trigger' and 'cat /proc/meminfo' multiple
times while the test is running.

> We have noticed huge %iowait on software raid1 (mdraid) that runs on 2 SSD
> drives - on every attempt to write more than 1GB.

I would expect "huge" iowait for this workload because the
bandwidth of the pipe is much greater than the bandwidth of your MD
device and so the writes to the fs get throttled in
balance_dirty_pages_ratelimited() once a certain percentage of RAM
is dirtied.

> The issue happens on any server running 6.4.2, 6.4.0, 6.3.3, 6.2.12 kernel.
> Upon investigating and testing it appeared that server IO performance can be
> completely killed with a single command:
> 
> #cat /dev/zero > ./removeme

flat profile from 'perf top -U':

  35.85%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   6.86%  [kernel]  [k] rep_movs_alternative
   5.92%  [kernel]  [k] do_raw_spin_lock
   5.61%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   2.62%  [kernel]  [k] rep_stos_alternative
   2.25%  [kernel]  [k] do_raw_spin_unlock
   1.77%  [kernel]  [k] __folio_end_writeback
   1.68%  [kernel]  [k] xas_start
   1.60%  [kernel]  [k] xas_descend
   1.46%  [kernel]  [k] __remove_mapping
   1.36%  [kernel]  [k] __folio_start_writeback
   1.05%  [kernel]  [k] __filemap_add_folio
   0.98%  [kernel]  [k] iomap_write_begin
   0.87%  [kernel]  [k] percpu_counter_add_batch
   0.83%  [kernel]  [k] folio_clear_dirty_for_io
   0.82%  [kernel]  [k] get_page_from_freelist  
   0.79%  [kernel]  [k] iomap_write_end         
   0.78%  [kernel]  [k] inode_to_bdi
   0.72%  [kernel]  [k] folio_unlock 
   0.71%  [kernel]  [k] node_dirty_ok
   0.71%  [kernel]  [k] __mod_node_page_state
   0.65%  [kernel]  [k] write_cache_pages
   0.65%  [kernel]  [k] __might_resched        
   0.65%  [kernel]  [k] __mod_lruvec_page_state
   0.64%  [kernel]  [k] iomap_do_writepage
   0.57%  [kernel]  [k] xas_store         
   0.53%  [kernel]  [k] shrink_folio_list
   0.50%  [kernel]  [k] balance_dirty_pages_ratelimited_flags
   0.49%  [kernel]  [k] __mod_memcg_lruvec_state
   0.49%  [kernel]  [k] filemap_dirty_folio
   0.48%  [kernel]  [k] __folio_mark_dirty
   0.48%  [kernel]  [k] __rmqueue_pcplist
   0.48%  [kernel]  [k] __rcu_read_lock         
   0.45%  [kernel]  [k] xas_load             
   0.43%  [kernel]  [k] __mod_zone_page_state
   0.40%  [kernel]  [k] lru_add_fn
   0.39%  [kernel]  [k] __list_del_entry_valid
   0.38%  [kernel]  [k] mod_zone_page_state   
   0.37%  [kernel]  [k] __filemap_remove_folio
   0.36%  [kernel]  [k] node_page_state    
   0.34%  [kernel]  [k] __filemap_get_folio
   0.33%  [kernel]  [k] filemap_get_folios_tag
   0.31%  [kernel]  [k] isolate_lru_folios
   0.30%  [kernel]  [k] folio_end_writeback

Almost nothing XFS there - it's all lock contention in the page
cache.

This smells of mapping tree lock contention. Yup, the callgraph
profile indicates all that lock contention is on the mapping
tree between kswapd (multiple processes) the write process, the
writeback worker and the XFS IO completion worker.

Hmmm - system is definitely slow. Ah - the write to the file fills
all of free memory with page cache pages on the same mapping, then
every memory allocation requires reclaiming memory, and so they go
into direct reclaim and that adds even more lock contention to the
mapping tree lock....

IOWs, this looks like an mapping tree lock contention problem at it's
core. The mapping tree is exposed to unbound concurrency in these
sorts of situations

> assuming the ~/removeme file resides on rootfs and rootfs is XFS.

Doesn't need to be the root fs - I just did it on an XFS filesystem
mounted on /mnt/scratch with a ext3 rootfs

> While running this, the server becomes so unresponsive that after ~15
> seconds it's not even possible to login via ssh!

Direct memory reclaim getting stuck on the mapping lock because it
adds to the contention problem?

> We did reproduce this on every machine with XFS as rootfs running mentioned
> kernels. However, when we converted rootfs from XFS to EXT4(and btrfs), the
> problem disappeared - with the same OS, same kernel binary, same hardware,
> just using ext4 or btrfs instead of xfs.

Experience has taught me that XFS tends to trigger lock contention
problems in generic code sooner than other filesystems. So this
wouldn't be unexpected, but if the cause is really mapping tree lock
contention then XFS is just the canary....

> Note. During the hang and being unresponsive, SSD drives are
> writing data at expected performance. Just all the processes
> except the writing one hang.

Yup, that's definitely expected - everything on the write() and
writeback side is running at full IO speed, it's just that
everything else is thrashing on the mapping tree waiting for IO to
clean pages....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-07-11 23:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-11 15:31 XFS writing issue Eugene K.
2023-07-11 15:42 ` Darrick J. Wong
2023-07-11 23:11 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox