* [Cluster-devel] GFS2 deadlock in 4.19 (iomap/writeback?)
@ 2019-03-06 11:00 Edwin Török
2019-03-13 17:13 ` [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter Andreas Gruenbacher
2019-08-13 12:47 ` [Cluster-devel] GFS2 deadlock in 4.19 (iomap/writeback?) Andreas Gruenbacher
0 siblings, 2 replies; 9+ messages in thread
From: Edwin Török @ 2019-03-06 11:00 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hello,
I've been trying to debug a GFS2 deadlock that we see in our lab quite frequently with a 4.19 kernel. With 4.4 and older kernels we were not able to reproduce this.
See below for lockdep dumps and stacktraces.
Ignoring the lockdep warnings mentioned in my previous email I think I narrowed down the problem to a deadlock between iomap and writeback:
IIUC the sequence of calls leading up to this is:
aio_write ->
gfs2_file_write_iter ->
iomap_file_buffered_write ->
iomap_apply ->
iomap_begin -> gfs2_iomap_begin_write -> gfs2_trans_begin -> down_read(&sdp->sd_log_flush_lock)
iomap_write_actor ->
balance_dirty_pages ->
... waits for writeback ...
Meanwhile writeback would try to do a log flush and fails to acquire down_write(sd_log_flush_lock) because the above keeps it locked for read:
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] gfs2_write_inode+0xa0/0x160 [gfs2]
[<0>] __writeback_single_inode+0x2ec/0x3e0
[<0>] writeback_sb_inodes+0x263/0x470
[<0>] __writeback_inodes_wb+0x87/0xb0
[<0>] wb_writeback+0x332/0x350
[<0>] wb_workfn+0x1bf/0x490
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35
Seems like calling gfs2_trans_begin in iomap_begin_write is dangerous, because you never know when writeback might trigger which needs that lock.
Also we only ever write to this file in O_DIRECT mode through qemu, so there shouldn't be anything to flush to begin with (other than metadata),
but of course if the writeback gets triggerred at the wrong time it can deadlock:
if (iocb->ki_flags & IOCB_DIRECT) {
struct address_space *mapping = file->f_mapping;
loff_t pos, endbyte;
ssize_t buffered;
written = gfs2_file_direct_write(iocb, from);
if (written < 0 || !iov_iter_count(from))
goto out2;
ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
if (unlikely(ret < 0))
goto out2;
buffered = ret;
I tried disabling writeback throttling by writing 0 to wbt_lat_usec on the underlying iSCSI block device, but that didn't help,
writeback still deadlocks.
Any suggestions on how to fix this?
A lockdep dump when the problem occurs looks like this:
Mar 4 11:30:51 localhost kernel: [38534.988859] Showing all locks held in the system:
Mar 4 11:30:51 localhost kernel: [38534.988863] 3 locks held by kworker/0:0H/6:
Mar 4 11:30:51 localhost kernel: [38534.988865] #0: 00000000c104076e ((wq_completion)"glock_workqueue"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.988884] #1: 00000000a7d48703 ((work_completion)(&(&gl->gl_work)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.988897] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.988923] 3 locks held by kworker/u4:0/7:
Mar 4 11:30:51 localhost kernel: [38534.988925] #0: 000000002fe9e66a ((wq_completion)"writeback"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.988940] #1: 000000000289a6d6 ((work_completion)(&(&wb->dwork)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.988955] #2: 000000005d018367 (&type->s_umount_key#33){....}, at: trylock_super+0x16/0x50
Mar 4 11:30:51 localhost kernel: [38534.988976] 2 locks held by kworker/u5:0/37:
Mar 4 11:30:51 localhost kernel: [38534.988978] #0: 00000000376f5de7 ((wq_completion)"knbd-recv"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.988993] #1: 00000000daa752e1 ((work_completion)(&args->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989059] 3 locks held by kworker/0:1H/281:
Mar 4 11:30:51 localhost kernel: [38534.989069] #0: 0000000052c15a61 ((wq_completion)"kblockd"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989134] #1: 00000000f22dc4a3 ((work_completion)(&q->timeout_work)#2){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989210] #2: 0000000026b2250c (&nsock->tx_lock){....}, at: sock_shutdown+0x4f/0xa0 [nbd]
Mar 4 11:30:51 localhost kernel: [38534.989279] 2 locks held by in:imklog/781:
Mar 4 11:30:51 localhost kernel: [38534.989288] 3 locks held by rs:main Q:Reg/782:
Mar 4 11:30:51 localhost kernel: [38534.989300] #0: 000000006efedb7e (&f->f_pos_lock){....}, at: __fdget_pos+0x4d/0x60
Mar 4 11:30:51 localhost kernel: [38534.989362] #1: 00000000e9351b73 (sb_writers#9){....}, at: vfs_write+0x17e/0x1b0
Mar 4 11:30:51 localhost kernel: [38534.989439] #2: 00000000da9579f5 (&sb->s_type->i_mutex_key#12){....}, at: ext4_file_write_iter+0x83/0x410
Mar 4 11:30:51 localhost kernel: [38534.989537] 2 locks held by python/2778:
Mar 4 11:30:51 localhost kernel: [38534.989540] #0: 00000000ceca4c21 (&tty->ldisc_sem){....}, at: tty_ldisc_ref_wait+0x24/0x50
Mar 4 11:30:51 localhost kernel: [38534.989555] #1: 00000000b68a5ccd (&ldata->atomic_read_lock){....}, at: n_tty_read+0xdb/0x8f0
Mar 4 11:30:51 localhost kernel: [38534.989576] 4 locks held by kworker/u4:2/12433:
Mar 4 11:30:51 localhost kernel: [38534.989578] #0: 000000002fe9e66a ((wq_completion)"writeback"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989592] #1: 00000000e02b1d4a ((work_completion)(&(&wb->dwork)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989613] #2: 000000000ea8b7ae (&type->s_umount_key#38){....}, at: trylock_super+0x16/0x50
Mar 4 11:30:51 localhost kernel: [38534.989628] #3: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989651] 5 locks held by kworker/0:2H/13124:
Mar 4 11:30:51 localhost kernel: [38534.989653] #0: 0000000052c15a61 ((wq_completion)"kblockd"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989667] #1: 000000009e2ab5ef ((work_completion)(&(&hctx->run_work)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989683] #2: 00000000996c6773 (hctx->srcu){....}, at: hctx_lock+0x1c/0x90
Mar 4 11:30:51 localhost kernel: [38534.989697] #3: 00000000c5e879db (&cmd->lock){....}, at: nbd_queue_rq+0x51/0x508 [nbd]
Mar 4 11:30:51 localhost kernel: [38534.989713] #4: 0000000026b2250c (&nsock->tx_lock){....}, at: nbd_queue_rq+0xec/0x508 [nbd]
Mar 4 11:30:51 localhost kernel: [38534.989728] 1 lock held by gfs2_logd/13483:
Mar 4 11:30:51 localhost kernel: [38534.989730] #0: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989751] 3 locks held by gfs2_quotad/13484:
Mar 4 11:30:51 localhost kernel: [38534.989753] #0: 000000006cb1852f (sb_writers#15){....}, at: gfs2_statfs_sync+0x51/0x1e0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989777] #1: 0000000020773729 (sb_internal#2){....}, at: gfs2_trans_begin+0xba/0x140 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989800] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_reserve+0x160/0x300 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989820] 3 locks held by qemu-dp/14083:
Mar 4 11:30:51 localhost kernel: [38534.989824] #0: 00000000cf9ac145 (&sb->s_type->i_mutex_key#18){....}, at: gfs2_file_write_iter+0xa4/0x350 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989846] #1: 0000000020773729 (sb_internal#2){....}, at: gfs2_trans_begin+0xba/0x140 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989870] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_reserve+0x160/0x300 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989888] 3 locks held by kworker/0:3H/14213:
Mar 4 11:30:51 localhost kernel: [38534.989890] #0: 00000000c104076e ((wq_completion)"glock_workqueue"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989906] #1: 0000000040479723 ((work_completion)(&(&gl->gl_work)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.989921] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989941] 3 locks held by qemu-dp/14271:
Mar 4 11:30:51 localhost kernel: [38534.989943] #0: 000000008d3c4857 (&sb->s_type->i_mutex_key#18){....}, at: gfs2_file_write_iter+0xa4/0x350 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989967] #1: 0000000020773729 (sb_internal#2){....}, at: gfs2_trans_begin+0xba/0x140 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.989988] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_reserve+0x160/0x300 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.990008] 4 locks held by python/14293:
Mar 4 11:30:51 localhost kernel: [38534.990010] #0: 000000006cb1852f (sb_writers#15){....}, at: mnt_want_write+0x20/0x50
Mar 4 11:30:51 localhost kernel: [38534.990028] #1: 0000000070e2532a (&sb->s_type->i_mutex_key#18){....}, at: do_truncate+0x65/0xc0
Mar 4 11:30:51 localhost kernel: [38534.990044] #2: 0000000020773729 (sb_internal#2){....}, at: gfs2_trans_begin+0xba/0x140 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.990066] #3: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_reserve+0x160/0x300 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.990084] 3 locks held by kworker/0:4H/14543:
Mar 4 11:30:51 localhost kernel: [38534.990086] #0: 00000000c104076e ((wq_completion)"glock_workqueue"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990110] #1: 000000009df75ab0 ((work_completion)(&(&gl->gl_work)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990122] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.990143] 3 locks held by kworker/0:5H/18090:
Mar 4 11:30:51 localhost kernel: [38534.990145] #0: 00000000c104076e ((wq_completion)"glock_workqueue"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990159] #1: 000000000b482fa1 ((work_completion)(&(&gl->gl_work)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990175] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.990192] 3 locks held by kworker/0:6H/18091:
Mar 4 11:30:51 localhost kernel: [38534.990195] #0: 00000000c104076e ((wq_completion)"glock_workqueue"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990209] #1: 0000000031fb3d54 ((work_completion)(&(&gl->gl_work)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990224] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.990244] 3 locks held by kworker/0:7H/18092:
Mar 4 11:30:51 localhost kernel: [38534.990246] #0: 00000000c104076e ((wq_completion)"glock_workqueue"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990308] #1: 0000000055374d30 ((work_completion)(&(&gl->gl_work)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990380] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.990451] 3 locks held by kworker/0:8H/18093:
Mar 4 11:30:51 localhost kernel: [38534.990462] #0: 00000000c104076e ((wq_completion)"glock_workqueue"){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990527] #1: 000000009c509cc6 ((work_completion)(&(&gl->gl_work)->work)){....}, at: process_one_work+0x180/0x4a0
Mar 4 11:30:51 localhost kernel: [38534.990594] #2: 000000005d7246b0 (&sdp->sd_log_flush_lock){....}, at: gfs2_log_flush+0x47/0x6c0 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.990661] 1 lock held by mandb/3726:
Mar 4 11:30:51 localhost kernel: [38534.990669] #0: 0000000045a063d1 (&mm->mmap_sem){....}, at: __do_page_fault+0x158/0x4f0
Mar 4 11:30:51 localhost kernel: [38534.990739] 2 locks held by bash/5179:
Mar 4 11:30:51 localhost kernel: [38534.990747] #0: 00000000d3b17bfd (&tty->ldisc_sem){....}, at: tty_ldisc_ref_wait+0x24/0x50
Mar 4 11:30:51 localhost kernel: [38534.990791] #1: 000000007c50b530 (&ldata->atomic_read_lock){....}, at: n_tty_read+0xdb/0x8f0
Mar 4 11:30:51 localhost kernel: [38534.990809] 1 lock held by glocktop/5412:
Mar 4 11:30:51 localhost kernel: [38534.990811] #0: 000000000ea8b7ae (&type->s_umount_key#38){....}, at: __get_super+0xd5/0x100
Mar 4 11:30:51 localhost kernel: [38534.990827] 2 locks held by bash/5465:
Mar 4 11:30:51 localhost kernel: [38534.990830] #0: 00000000776a454d (&tty->ldisc_sem){....}, at: tty_ldisc_ref_wait+0x24/0x50
Mar 4 11:30:51 localhost kernel: [38534.990845] #1: 000000008c9e374c (&ldata->atomic_read_lock){....}, at: n_tty_read+0xdb/0x8f0
Mar 4 11:30:51 localhost kernel: [38534.990860] 2 locks held by bash/6756:
Mar 4 11:30:51 localhost kernel: [38534.990863] #0: 000000009d483d9b (&tty->ldisc_sem){....}, at: tty_ldisc_ref_wait+0x24/0x50
Mar 4 11:30:51 localhost kernel: [38534.990877] #1: 0000000034732f61 (&ldata->atomic_read_lock){....}, at: n_tty_read+0xdb/0x8f0
Mar 4 11:30:51 localhost kernel: [38534.990892] 2 locks held by agetty/7626:
Mar 4 11:30:51 localhost kernel: [38534.990895] #0: 00000000d59eb8fe (&tty->ldisc_sem){....}, at: tty_ldisc_ref_wait+0x24/0x50
Mar 4 11:30:51 localhost kernel: [38534.990909] #1: 000000002babda64 (&ldata->atomic_read_lock){....}, at: n_tty_read+0xdb/0x8f0
Mar 4 11:30:51 localhost kernel: [38534.990924] 3 locks held by bash/13302:
Mar 4 11:30:51 localhost kernel: [38534.990927] #0: 0000000085ac1eae (sb_writers#3){....}, at: vfs_write+0x17e/0x1b0
Mar 4 11:30:51 localhost kernel: [38534.990946] #1: 0000000008d46706 (rcu_read_lock){....}, at: __handle_sysrq+0x5/0x180
Mar 4 11:30:51 localhost kernel: [38534.990960] #2: 0000000008d46706 (rcu_read_lock){....}, at: debug_show_all_locks+0xe/0x110
Mar 4 11:30:51 localhost kernel: [38534.990977] 2 locks held by login/13641:
Mar 4 11:30:51 localhost kernel: [38534.990979] #0: 00000000a0d762d6 (&tty->ldisc_sem){....}, at: tty_ldisc_ref_wait+0x24/0x50
Mar 4 11:30:51 localhost kernel: [38534.990993] #1: 000000004f9c3423 (&ldata->atomic_read_lock){....}, at: n_tty_read+0xdb/0x8f0
Mar 4 11:30:51 localhost kernel: [38534.991009]
Mar 4 11:30:51 localhost kernel: [38534.991011] =============================================
Mar 4 11:30:51 localhost kernel: [38534.991011]
Mar 4 11:30:51 localhost kernel: [38534.991014] Showing busy workqueues and worker pools:
Mar 4 11:30:51 localhost kernel: [38534.991021] workqueue writeback: flags=0x4e
Mar 4 11:30:51 localhost kernel: [38534.991024] pwq 4: cpus=0-1 flags=0x4 nice=0 active=4/256
Mar 4 11:30:51 localhost kernel: [38534.991036] in-flight: 7:wb_workfn wb_workfn, 12433:wb_workfn wb_workfn
Mar 4 11:30:51 localhost kernel: [38534.991059] workqueue kblockd: flags=0x18
Mar 4 11:30:51 localhost kernel: [38534.991064] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=3/256
Mar 4 11:30:51 localhost kernel: [38534.991076] in-flight: 13124:blk_mq_run_work_fn blk_mq_run_work_fn, 281:blk_mq_timeout_work
Mar 4 11:30:51 localhost kernel: [38534.991103] workqueue glock_workqueue: flags=0x1c
Mar 4 11:30:51 localhost kernel: [38534.991106] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=7/256
Mar 4 11:30:51 localhost kernel: [38534.991118] in-flight: 14543:glock_work_func [gfs2], 6:glock_work_func [gfs2], 18093:glock_work_func [gfs2], 18090:glock_work_func [gfs2], 18091:glock_work_func [gfs2], 14213:glock_work_func [gfs2], 18092:glock_work_func [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.991182] workqueue knbd-recv: flags=0x1a
Mar 4 11:30:51 localhost kernel: [38534.991185] pwq 5: cpus=0-1 flags=0x4 nice=-20 active=1/256
Mar 4 11:30:51 localhost kernel: [38534.991197] in-flight: 37:recv_work [nbd]
Mar 4 11:30:51 localhost kernel: [38534.991209] pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=0s workers=11 idle: 18095 18094
Mar 4 11:30:51 localhost kernel: [38534.991232] pool 4: cpus=0-1 flags=0x4 nice=0 hung=0s workers=6 idle: 4015 13227 13249 11636
Mar 4 11:30:51 localhost kernel: [38534.991253] pool 5: cpus=0-1 flags=0x4 nice=-20 hung=37461s workers=3 idle: 14317 14153
Stacktraces of processes containing gfs2:
/proc/6/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] inode_go_sync+0xa1/0x190 [gfs2]
[<0>] do_xmote+0xd0/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x100 [gfs2]
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/12433/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] gfs2_write_inode+0xa0/0x160 [gfs2]
[<0>] __writeback_single_inode+0x2ec/0x3e0
[<0>] writeback_sb_inodes+0x263/0x470
[<0>] __writeback_inodes_wb+0x87/0xb0
[<0>] wb_writeback+0x332/0x350
[<0>] wb_workfn+0x1bf/0x490
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/13483/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] gfs2_logd+0xc6/0x360 [gfs2]
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/13484/stack
[<0>] call_rwsem_down_read_failed+0x14/0x30
[<0>] gfs2_log_reserve+0x160/0x300 [gfs2]
[<0>] gfs2_trans_begin+0xc5/0x140 [gfs2]
[<0>] gfs2_statfs_sync+0x19a/0x1e0 [gfs2]
[<0>] gfs2_quotad+0x1c2/0x2a0 [gfs2]
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/13495/stack
[<0>] gfs2_glock_wait+0x3e/0x80 [gfs2]
[<0>] gfs2_permission+0xc4/0x150 [gfs2]
[<0>] inode_permission+0xc2/0x190
[<0>] link_path_walk+0x2ad/0x550
[<0>] path_openat+0x89/0xbd0
[<0>] do_filp_open+0x91/0x100
[<0>] do_sys_open+0x126/0x210
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
/proc/13497/stack
[<0>] gfs2_glock_wait+0x3e/0x80 [gfs2]
[<0>] gfs2_permission+0xc4/0x150 [gfs2]
[<0>] inode_permission+0xc2/0x190
[<0>] link_path_walk+0x2ad/0x550
[<0>] path_openat+0x89/0xbd0
[<0>] do_filp_open+0x91/0x100
[<0>] do_sys_open+0x126/0x210
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
/proc/13498/stack
[<0>] gfs2_glock_wait+0x3e/0x80 [gfs2]
[<0>] gfs2_permission+0xc4/0x150 [gfs2]
[<0>] inode_permission+0xc2/0x190
[<0>] link_path_walk+0x2ad/0x550
[<0>] path_openat+0x89/0xbd0
[<0>] do_filp_open+0x91/0x100
[<0>] do_sys_open+0x126/0x210
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
/proc/13499/stack
[<0>] gfs2_glock_wait+0x3e/0x80 [gfs2]
[<0>] gfs2_permission+0xc4/0x150 [gfs2]
[<0>] inode_permission+0xc2/0x190
[<0>] link_path_walk+0x2ad/0x550
[<0>] path_openat+0x89/0xbd0
[<0>] do_filp_open+0x91/0x100
[<0>] do_sys_open+0x126/0x210
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
/proc/14083/stack
[<0>] balance_dirty_pages+0x6e0/0x79d
[<0>] balance_dirty_pages_ratelimited+0x387/0x520
[<0>] iomap_write_actor+0x10e/0x170
[<0>] iomap_apply+0xb8/0x130
[<0>] iomap_file_buffered_write+0x62/0x90
[<0>] gfs2_file_write_iter+0x2c3/0x350 [gfs2]
[<0>] aio_write+0xf8/0x1d0
[<0>] io_submit_one+0x236/0x7b0
[<0>] __x64_sys_io_submit+0xb8/0x1e0
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
/proc/14213/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] inode_go_sync+0xa1/0x190 [gfs2]
[<0>] do_xmote+0xd0/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x100 [gfs2]
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/14271/stack
[<0>] call_rwsem_down_read_failed+0x14/0x30
[<0>] gfs2_log_reserve+0x160/0x300 [gfs2]
[<0>] gfs2_trans_begin+0xc5/0x140 [gfs2]
[<0>] gfs2_dirty_inode+0x1d4/0x240 [gfs2]
[<0>] __mark_inode_dirty+0x1cf/0x3d0
[<0>] generic_update_time+0x6b/0xd0
[<0>] file_update_time+0xc0/0x110
[<0>] gfs2_file_write_iter+0x1d2/0x350 [gfs2]
[<0>] aio_write+0xf8/0x1d0
[<0>] io_submit_one+0x236/0x7b0
[<0>] __x64_sys_io_submit+0xb8/0x1e0
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
/proc/14293/stack
[<0>] call_rwsem_down_read_failed+0x14/0x30
[<0>] gfs2_log_reserve+0x160/0x300 [gfs2]
[<0>] gfs2_trans_begin+0xc5/0x140 [gfs2]
[<0>] gfs2_setattr_size+0xe3/0x260 [gfs2]
[<0>] gfs2_setattr+0x16e/0x2b0 [gfs2]
[<0>] notify_change+0x26c/0x440
[<0>] do_truncate+0x72/0xc0
[<0>] path_openat+0x70c/0xbd0
[<0>] do_filp_open+0x91/0x100
[<0>] do_sys_open+0x126/0x210
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
/proc/14543/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] inode_go_sync+0xa1/0x190 [gfs2]
[<0>] do_xmote+0xd0/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x100 [gfs2]
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/18090/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] inode_go_sync+0xa1/0x190 [gfs2]
[<0>] do_xmote+0xd0/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x100 [gfs2]
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/18091/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] inode_go_sync+0xa1/0x190 [gfs2]
[<0>] do_xmote+0xd0/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x100 [gfs2]
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] inode_go_sync+0xa1/0x190 [gfs2]
[<0>] do_xmote+0xd0/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x100 [gfs2]
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/18093/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] inode_go_sync+0xa1/0x190 [gfs2]
[<0>] do_xmote+0xd0/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x100 [gfs2]
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
Comparing the stacktraces and the lockdep dumps for processes holding or acquiring sd_log_flush_lock:
PID6, 12433, 13483, 13484, 14213, 14271, 14293, 14543,18090, 18091, 18092, 18093: both
PID14083: holds sd_log_flush_lock, (which everyone is waiting for), but is stuck here:
/proc/14083/stack
[<0>] balance_dirty_pages+0x6e0/0x79d
[<0>] balance_dirty_pages_ratelimited+0x387/0x520
[<0>] iomap_write_actor+0x10e/0x170
[<0>] iomap_apply+0xb8/0x130
[<0>] iomap_file_buffered_write+0x62/0x90
[<0>] gfs2_file_write_iter+0x2c3/0x350 [gfs2]
balance_dirty_pages is probably doing writeback, but writeback is stuck here, looks like deadlock?
Mar 4 11:30:51 localhost kernel: [38534.963177] ? __schedule+0x363/0x9b0
Mar 4 11:30:51 localhost kernel: [38534.963189] schedule+0x32/0x80
Mar 4 11:30:51 localhost kernel: [38534.963193] schedule_timeout+0x1d7/0x3d0
Mar 4 11:30:51 localhost kernel: [38534.963204] ? __next_timer_interrupt+0xd0/0xd0
Mar 4 11:30:51 localhost kernel: [38534.963229] io_schedule_timeout+0x19/0x40
Mar 4 11:30:51 localhost kernel: [38534.963235] balance_dirty_pages+0x6e0/0x79d
Mar 4 11:30:51 localhost kernel: [38534.963252] ? gfs2_trans_add_data+0xc0/0x170 [gfs2]
Mar 4 11:30:51 localhost kernel: [38534.963267] ? gfs2_dirty_inode+0x1c0/0x240 [gfs2]
/proc/12433/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x47/0x6c0 [gfs2]
[<0>] gfs2_write_inode+0xa0/0x160 [gfs2]
[<0>] __writeback_single_inode+0x2ec/0x3e0
[<0>] writeback_sb_inodes+0x263/0x470
[<0>] __writeback_inodes_wb+0x87/0xb0
[<0>] wb_writeback+0x332/0x350
[<0>] wb_workfn+0x1bf/0x490
[<0>] process_one_work+0x209/0x4a0
[<0>] worker_thread+0x46/0x3d0
[<0>] kthread+0xfb/0x130
[<0>] ret_from_fork+0x35
I tried disabling writeback throttling, however it still got stuck:
/proc/24847/stack
[<0>] balance_dirty_pages+0x737/0x93d
[<0>] balance_dirty_pages_ratelimited+0x264/0x3c0
[<0>] fault_dirty_shared_page.isra.101+0x4a/0x90
[<0>] do_wp_page+0x449/0x510
[<0>] __handle_mm_fault+0x3c0/0xd10
[<0>] handle_mm_fault+0xfa/0x210
[<0>] __do_page_fault+0x217/0x4f0
[<0>] page_fault+0x1e/0x30
[<0>] 0xffffffffffffffff
/proc/25122/stack
[<0>] io_schedule+0x12/0x40
[<0>] wbt_wait+0x25b/0x330
[<0>] rq_qos_throttle+0x31/0x40
[<0>] blk_mq_make_request+0x103/0x500
[<0>] generic_make_request+0x19a/0x3d0
[<0>] submit_bio+0x6e/0x130
[<0>] submit_bh_wbc+0x157/0x190
[<0>] __block_write_full_page+0x157/0x420
[<0>] __writepage+0x19/0x50
[<0>] write_cache_pages+0x21f/0x470
[<0>] generic_writepages+0x51/0x80
[<0>] do_writepages+0x1a/0x70
[<0>] __filemap_fdatawrite_range+0xc5/0x100
[<0>] filemap_write_and_wait+0x2c/0x70
[<0>] invalidate_partition+0x1e/0x40
[<0>] rescan_partitions+0x4d/0x350
[<0>] __blkdev_get+0x4a8/0x4f0
[<0>] blkdev_get+0x1a7/0x300
[<0>] do_dentry_open+0x1f1/0x380
[<0>] path_openat+0x379/0x1170
[<0>] do_filp_open+0x91/0x100
[<0>] do_sys_open+0x126/0x210
[<0>] do_syscall_64+0x4e/0x100
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff
/proc/275/stack
[<0>] io_schedule+0x12/0x40
[<0>] wbt_wait+0x25b/0x330
[<0>] rq_qos_throttle+0x31/0x40
[<0>] blk_mq_make_request+0x103/0x500
[<0>] generic_make_request+0x19a/0x3d0
[<0>] submit_bio+0x6e/0x130
[<0>] submit_bh_wbc+0x157/0x190
[<0>] __block_write_full_page+0x157/0x420
[<0>] __writepage+0x19/0x50
[<0>] write_cache_pages+0x21f/0x470
[<0>] generic_writepages+0x51/0x80
[<0>] do_writepages+0x1a/0x70
[<0>] __writeback_single_inode+0x3d/0x340
[<0>] writeback_sb_inodes+0x24f/0x4b0
[<0>] __writeback_inodes_wb+0x87/0xb0
[<0>] wb_writeback+0x27c/0x310
[<0>] wb_workfn+0x1a0/0x450
[<0>] process_one_work+0x165/0x370
[<0>] worker_thread+0x49/0x3e0
[<0>] kthread+0xf8/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/13869/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x46/0x5f0 [gfs2]
[<0>] inode_go_sync+0x9e/0x180 [gfs2]
[<0>] do_xmote+0xd1/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x110 [gfs2]
[<0>] process_one_work+0x165/0x370
[<0>] rescuer_thread+0x1c0/0x310
[<0>] kthread+0xf8/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/13872/stack
[<0>] rescuer_thread+0x2ad/0x310
[<0>] kthread+0xf8/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/13873/stack
[<0>] rescuer_thread+0x2ad/0x310
[<0>] kthread+0xf8/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
/proc/13900/stack
[<0>] call_rwsem_down_write_failed+0x13/0x20
[<0>] gfs2_log_flush+0x46/0x5f0 [gfs2]
[<0>] inode_go_sync+0x9e/0x180 [gfs2]
[<0>] do_xmote+0xd1/0x200 [gfs2]
[<0>] glock_work_func+0x58/0x110 [gfs2]
[<0>] process_one_work+0x165/0x370
[<0>] worker_thread+0x49/0x3e0
[<0>] kthread+0xf8/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff
....
Best regards,
--Edwin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debug.tar.gz
Type: application/gzip
Size: 526714 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20190306/3e418ac3/attachment.gz>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
2019-03-06 11:00 [Cluster-devel] GFS2 deadlock in 4.19 (iomap/writeback?) Edwin Török
@ 2019-03-13 17:13 ` Andreas Gruenbacher
2019-03-14 11:18 ` Ross Lagerwall
2019-03-15 20:58 ` Andreas Gruenbacher
2019-08-13 12:47 ` [Cluster-devel] GFS2 deadlock in 4.19 (iomap/writeback?) Andreas Gruenbacher
1 sibling, 2 replies; 9+ messages in thread
From: Andreas Gruenbacher @ 2019-03-13 17:13 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi Edwin,
On Wed, 6 Mar 2019 at 12:08, Edwin T?r?k <edvin.torok@citrix.com> wrote:
> Hello,
>
> I've been trying to debug a GFS2 deadlock that we see in our lab quite frequently with a 4.19 kernel. With 4.4 and older kernels we were not able to reproduce this.
> See below for lockdep dumps and stacktraces.
thanks for the thorough bug report. Does the below fix work for you?
Thanks,
Andreas
--
Prevent page reclaim on the backing device of the filesystem in
gfs2_file_write_iter:
- In gfs2_file_write_iter, iomap_file_buffered_write grabs the log
flush lock (via iomap_apply -> iomap_begin -> gfs2_iomap_begin_write
-> gfs2_trans_begin) and holds it for the duration of the write.
- It then calls iomap_write_actor -> balance_dirty_pages.
- If that triggers writeback on the same filesystem, we would try
to grab the log flush lock again (via writeback_sb_inodes ->
gfs2_write_inode -> gfs2_log_flush) and deadlock.
Prevent that by not setting backing_dev_info in gfs2_file_write_iter.
Reported-by: Edwin T?r?k <edvin.torok@citrix.com>
Fixes: 64bc06bb32ee ("gfs2: iomap buffered write support")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
fs/gfs2/file.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a2dea5bc04276..8c7d296d58608 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -816,16 +816,13 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret <= 0)
goto out;
- /* We can write back this queue in page reclaim */
- current->backing_dev_info = inode_to_bdi(inode);
-
ret = file_remove_privs(file);
if (ret)
- goto out2;
+ goto out;
ret = file_update_time(file);
if (ret)
- goto out2;
+ goto out;
if (iocb->ki_flags & IOCB_DIRECT) {
struct address_space *mapping = file->f_mapping;
@@ -834,11 +831,11 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
written = gfs2_file_direct_write(iocb, from);
if (written < 0 || !iov_iter_count(from))
- goto out2;
+ goto out;
ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
if (unlikely(ret < 0))
- goto out2;
+ goto out;
buffered = ret;
/*
@@ -867,8 +864,6 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
iocb->ki_pos += ret;
}
-out2:
- current->backing_dev_info = NULL;
out:
inode_unlock(inode);
if (likely(ret > 0)) {
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
2019-03-13 17:13 ` [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter Andreas Gruenbacher
@ 2019-03-14 11:18 ` Ross Lagerwall
2019-03-15 20:58 ` Andreas Gruenbacher
1 sibling, 0 replies; 9+ messages in thread
From: Ross Lagerwall @ 2019-03-14 11:18 UTC (permalink / raw)
To: cluster-devel.redhat.com
On 3/13/19 5:13 PM, Andreas Gruenbacher wrote:
> Hi Edwin,
>
> On Wed, 6 Mar 2019 at 12:08, Edwin T?r?k <edvin.torok@citrix.com> wrote:
>> Hello,
>>
>> I've been trying to debug a GFS2 deadlock that we see in our lab quite frequently with a 4.19 kernel. With 4.4 and older kernels we were not able to reproduce this.
>> See below for lockdep dumps and stacktraces.
>
> thanks for the thorough bug report. Does the below fix work for you?
>
Hi Andreas,
I've tested the patch and it doesn't fix the issue. As far as I can see,
current->backing_dev_info is not used by any of the code called from
balance_dirty_pages_ratelimited() so I don't see how it could work.
I found a way of consistently reproducing the issue almost immediately
(tested with the latest master commit):
# cat a.py
import os
fd = os.open("f", os.O_CREAT|os.O_TRUNC|os.O_WRONLY)
for i in range(1000):
os.mkdir("xxx" + str(i), 0777)
buf = 'x' * 4096
while True:
count = os.write(fd, buf)
if count <= 0:
break
# cat b.py
import os
while True:
os.mkdir("x", 0777)
os.rmdir("x")
# echo 8192 > /proc/sys/vm/dirty_bytes
# cd /gfs2mnt
# (mkdir tmp1; cd tmp1; python2 ~/a.py) &
# (mkdir tmp2; cd tmp2; python2 ~/a.py) &
# (mkdir tmp3; cd tmp3; python2 ~/b.py) &
This should deadlock almost immediately. One of the processes will be
waiting in balance_dirty_pages() and holding sd_log_flush_lock and
several others will be waiting for sd_log_flush_lock.
I came up with the following patch which seems to resolve the issue by
failing to write the inode if it can't take the lock, but it seems like
a dirty workaround rather than a proper fix:
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index ee20ea42e7b5..0659560bb8c6 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -766,24 +766,29 @@ static void log_write_header(struct gfs2_sbd *sdp,
u32 flags)
}
/**
- * gfs2_log_flush - flush incore transaction(s)
+ * __gfs2_log_flush - flush incore transaction(s)
* @sdp: the filesystem
* @gl: The glock structure to flush. If NULL, flush the whole incore log
* @flags: The log header flags: GFS2_LOG_HEAD_FLUSH_* and debug flags
+ * @wait: Wait for the sd_log_flush_lock
*
*/
-void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
+int __gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32
flags,
+ bool wait)
{
struct gfs2_trans *tr;
enum gfs2_freeze_state state = atomic_read(&sdp->sd_freeze_state);
- down_write(&sdp->sd_log_flush_lock);
+ if (wait)
+ down_write(&sdp->sd_log_flush_lock);
+ else if (!down_write_trylock(&sdp->sd_log_flush_lock))
+ return -EAGAIN;
/* Log might have been flushed while we waited for the flush lock */
if (gl && !test_bit(GLF_LFLUSH, &gl->gl_flags)) {
up_write(&sdp->sd_log_flush_lock);
- return;
+ return 0;
}
trace_gfs2_log_flush(sdp, 1, flags);
@@ -857,6 +862,12 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct
gfs2_glock *gl, u32 flags)
up_write(&sdp->sd_log_flush_lock);
kfree(tr);
+ return 0;
+}
+
+void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)
+{
+ __gfs2_log_flush(sdp, gl, flags, true);
}
/**
diff --git a/fs/gfs2/log.h b/fs/gfs2/log.h
index 20241436126d..2404167004bb 100644
--- a/fs/gfs2/log.h
+++ b/fs/gfs2/log.h
@@ -72,6 +72,8 @@ extern void gfs2_log_release(struct gfs2_sbd *sdp,
unsigned int blks);
extern int gfs2_log_reserve(struct gfs2_sbd *sdp, unsigned int blks);
extern void gfs2_write_log_header(struct gfs2_sbd *sdp, struct
gfs2_jdesc *jd,
u64 seq, u32 tail, u32 flags, int op_flags);
+extern int __gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl,
+ u32 type, bool wait);
extern void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl,
u32 type);
extern void gfs2_log_commit(struct gfs2_sbd *sdp, struct gfs2_trans
*trans);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index a971862b186e..369502a16ad6 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -756,10 +756,14 @@ static int gfs2_write_inode(struct inode *inode,
struct writeback_control *wbc)
int ret = 0;
bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
- if (flush_all)
- gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
+ if (flush_all) {
+ ret = __gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
GFS2_LOG_HEAD_FLUSH_NORMAL |
- GFS2_LFC_WRITE_INODE);
+ GFS2_LFC_WRITE_INODE,
+ wbc->sync_mode == WB_SYNC_ALL);
+ if (ret)
+ return ret;
+ }
if (bdi->wb.dirty_exceeded)
gfs2_ail1_flush(sdp, wbc);
else
---
Regards,
--
Ross Lagerwall
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
2019-03-13 17:13 ` [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter Andreas Gruenbacher
2019-03-14 11:18 ` Ross Lagerwall
@ 2019-03-15 20:58 ` Andreas Gruenbacher
1 sibling, 0 replies; 9+ messages in thread
From: Andreas Gruenbacher @ 2019-03-15 20:58 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi Ross,
On Thu, 14 Mar 2019 at 12:18, Ross Lagerwall <ross.lagerwall@citrix.com> wrote:
> On 3/13/19 5:13 PM, Andreas Gruenbacher wrote:
> > Hi Edwin,
> >
> > On Wed, 6 Mar 2019 at 12:08, Edwin T?r?k <edvin.torok@citrix.com>
> > wrote:
> >> Hello,
> >>
> >> I've been trying to debug a GFS2 deadlock that we see in our lab
> >> quite frequently with a 4.19 kernel. With 4.4 and older kernels we
> >> were not able to reproduce this.
> >> See below for lockdep dumps and stacktraces.
> >
> > thanks for the thorough bug report. Does the below fix work for
> > you?
> >
> Hi Andreas,
>
> I've tested the patch and it doesn't fix the issue. As far as I can see,
> current->backing_dev_info is not used by any of the code called from
> balance_dirty_pages_ratelimited() so I don't see how it could work.
yes, I see now.
> I found a way of consistently reproducing the issue almost immediately
> (tested with the latest master commit):
>
> # cat a.py
> import os
>
> fd = os.open("f", os.O_CREAT|os.O_TRUNC|os.O_WRONLY)
>
> for i in range(1000):
> os.mkdir("xxx" + str(i), 0777)
>
> buf = 'x' * 4096
>
> while True:
> count = os.write(fd, buf)
> if count <= 0:
> break
>
> # cat b.py
> import os
> while True:
> os.mkdir("x", 0777)
> os.rmdir("x")
>
> # echo 8192 > /proc/sys/vm/dirty_bytes
> # cd /gfs2mnt
> # (mkdir tmp1; cd tmp1; python2 ~/a.py) &
> # (mkdir tmp2; cd tmp2; python2 ~/a.py) &
> # (mkdir tmp3; cd tmp3; python2 ~/b.py) &
>
> This should deadlock almost immediately. One of the processes will be
> waiting in balance_dirty_pages() and holding sd_log_flush_lock and
> several others will be waiting for sd_log_flush_lock.
This doesn't work for me: the python processes don't even start properly
when dirty_bytes is set so low.
> I came up with the following patch which seems to resolve the issue by
> failing to write the inode if it can't take the lock, but it seems
> like a dirty workaround rather than a proper fix:
>
> [...]
Looking at ext4_dirty_inode, it seems that we should just be able to
bail out of gfs2_write_inode an return 0 when PF_MEMALLOC is set in
current->flags.
Also, we should probably add the current->flags checks from
xfs_do_writepage to gfs2_writepage_common.
So what do you get with the below patch?
Thanks,
Andreas
---
fs/gfs2/aops.c | 7 +++++++
fs/gfs2/super.c | 4 ++++
2 files changed, 11 insertions(+)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f..694ff91 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -102,6 +102,13 @@ static int gfs2_writepage_common(struct page *page,
pgoff_t end_index = i_size >> PAGE_SHIFT;
unsigned offset;
+ /* (see xfs_do_writepage) */
+ if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+ PF_MEMALLOC))
+ goto redirty;
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
+ goto redirty;
+
if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
goto out;
if (current->journal_info)
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index ca71163..540535c 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -756,6 +756,10 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
int ret = 0;
bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
+ /* (see ext4_dirty_inode) */
+ if (current->flags & PF_MEMALLOC)
+ return 0;
+
if (flush_all)
gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
GFS2_LOG_HEAD_FLUSH_NORMAL |
--
1.8.3.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
[not found] <19:84>
@ 2019-03-16 23:58 ` Mark Syms
2019-03-17 20:06 ` Andreas Gruenbacher
0 siblings, 1 reply; 9+ messages in thread
From: Mark Syms @ 2019-03-16 23:58 UTC (permalink / raw)
To: cluster-devel.redhat.com
Sadly, this doesn't help and seems to make the situation worse. Our automated tests were previously seeing about 5% failure rate and with this patch its 20%. We still need to verify that they're all down to the same failure but I thought it better to give some early feedback.
Surprised that the script didn't run for you but the limit on dirty bytes was just to get it to occur more frequently. You may have success with larger values.
Mark
Message: 1
Date: Fri, 15 Mar 2019 21:58:12 +0100
From: Andreas Gruenbacher <agruenba at redhat.com<mailto:agruenba@redhat.com>>
To: Ross Lagerwall <ross.lagerwall at citrix.com<mailto:ross.lagerwall@citrix.com>>
Cc: igor.druzhinin at citrix.com<mailto:igor.druzhinin@citrix.com>, Sergey Dyasli
<sergey.dyasli at citrix.com<mailto:sergey.dyasli@citrix.com>>, Mark Syms <Mark.Syms at citrix.com<mailto:Mark.Syms@citrix.com>>,
cluster-devel at redhat.com<mailto:cluster-devel@redhat.com>
Subject: Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in
gfs2_file_write_iter
Message-ID: <20190315205812.22727-1-agruenba at redhat.com<mailto:20190315205812.22727-1-agruenba@redhat.com>>
Content-Type: text/plain; charset=UTF-8
Hi Ross,
On Thu, 14 Mar 2019 at 12:18, Ross Lagerwall <ross.lagerwall at citrix.com<mailto:ross.lagerwall@citrix.com>> wrote:
> On 3/13/19 5:13 PM, Andreas Gruenbacher wrote:
> > Hi Edwin,
> >
> > On Wed, 6 Mar 2019 at 12:08, Edwin T?r?k <edvin.torok at citrix.com<mailto:edvin.torok@citrix.com>>
> > wrote:
> >> Hello,
> >>
> >> I've been trying to debug a GFS2 deadlock that we see in our lab
> >> quite frequently with a 4.19 kernel. With 4.4 and older kernels we
> >> were not able to reproduce this.
> >> See below for lockdep dumps and stacktraces.
> >
> > thanks for the thorough bug report. Does the below fix work for
> > you?
> >
> Hi Andreas,
>
> I've tested the patch and it doesn't fix the issue. As far as I can see,
> current->backing_dev_info is not used by any of the code called from
> balance_dirty_pages_ratelimited() so I don't see how it could work.
yes, I see now.
> I found a way of consistently reproducing the issue almost immediately
> (tested with the latest master commit):
>
> # cat a.py
> import os
>
> fd = os.open("f", os.O_CREAT|os.O_TRUNC|os.O_WRONLY)
>
> for i in range(1000):
> os.mkdir("xxx" + str(i), 0777)
>
> buf = 'x' * 4096
>
> while True:
> count = os.write(fd, buf)
> if count <= 0:
> break
>
> # cat b.py
> import os
> while True:
> os.mkdir("x", 0777)
> os.rmdir("x")
>
> # echo 8192 > /proc/sys/vm/dirty_bytes
> # cd /gfs2mnt
> # (mkdir tmp1; cd tmp1; python2 ~/a.py) &
> # (mkdir tmp2; cd tmp2; python2 ~/a.py) &
> # (mkdir tmp3; cd tmp3; python2 ~/b.py) &
>
> This should deadlock almost immediately. One of the processes will be
> waiting in balance_dirty_pages() and holding sd_log_flush_lock and
> several others will be waiting for sd_log_flush_lock.
This doesn't work for me: the python processes don't even start properly
when dirty_bytes is set so low.
> I came up with the following patch which seems to resolve the issue by
> failing to write the inode if it can't take the lock, but it seems
> like a dirty workaround rather than a proper fix:
>
> [...]
Looking at ext4_dirty_inode, it seems that we should just be able to
bail out of gfs2_write_inode an return 0 when PF_MEMALLOC is set in
current->flags.
Also, we should probably add the current->flags checks from
xfs_do_writepage to gfs2_writepage_common.
So what do you get with the below patch?
Thanks,
Andreas
---
fs/gfs2/aops.c | 7 +++++++
fs/gfs2/super.c | 4 ++++
2 files changed, 11 insertions(+)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f..694ff91 100644<tel:100644>
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -102,6 +102,13 @@ static int gfs2_writepage_common(struct page *page,
pgoff_t end_index = i_size >> PAGE_SHIFT;
unsigned offset;
+ /* (see xfs_do_writepage) */
+ if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+ PF_MEMALLOC))
+ goto redirty;
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
+ goto redirty;
+
if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
goto out;
if (current->journal_info)
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index ca71163..540535c 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -756,6 +756,10 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
int ret = 0;
bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
+ /* (see ext4_dirty_inode) */
+ if (current->flags & PF_MEMALLOC)
+ return 0;
+
if (flush_all)
gfs2_log_flush(GFS2_SB(inode), ip->i_gl,
GFS2_LOG_HEAD_FLUSH_NORMAL |
--
1.8.3.1
------------------------------
_______________________________________________
Cluster-devel mailing list
Cluster-devel at redhat.com<mailto:Cluster-devel@redhat.com>
https://www.redhat.com/mailman/listinfo/cluster-devel
End of Cluster-devel Digest, Vol 154, Issue 8
*********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20190316/c07148c0/attachment.htm>
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
2019-03-16 23:58 ` [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter Mark Syms
@ 2019-03-17 20:06 ` Andreas Gruenbacher
2019-03-18 15:10 ` Mark Syms
0 siblings, 1 reply; 9+ messages in thread
From: Andreas Gruenbacher @ 2019-03-17 20:06 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Sun, 17 Mar 2019 at 00:59, Mark Syms <Mark.Syms@citrix.com> wrote:
>
> Sadly, this doesn't help and seems to make the situation worse. Our automated tests were previously seeing about 5% failure rate and with this patch its 20%.
So the PF_MEMALLOC flag doesn't get set on that code path. In that
case, we should probably check wbc->sync_mode instead. How about the
following?
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 540535c..36de2f7 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -757,7 +757,7 @@ static int gfs2_write_inode(struct inode *inode,
struct writeback_control *wbc)
bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
/* (see ext4_dirty_inode) */
- if (current->flags & PF_MEMALLOC)
+ if (wbc->sync_mode != WB_SYNC_ALL)
return 0;
if (flush_all)
It may help to wrap that condition in WARN_ON_ONCE for debugging.
> We still need to verify that they're all down to the same failure but I thought it better to give some early feedback.
>
> Surprised that the script didn't run for you but the limit on dirty bytes was just to get it to occur more frequently. You may have success with larger values.
I'll keep trying to reproduce.
Thanks,
Andreas
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
2019-03-17 20:06 ` Andreas Gruenbacher
@ 2019-03-18 15:10 ` Mark Syms
2019-03-18 16:30 ` Andreas Gruenbacher
0 siblings, 1 reply; 9+ messages in thread
From: Mark Syms @ 2019-03-18 15:10 UTC (permalink / raw)
To: cluster-devel.redhat.com
-----Original Message-----
From: Andreas Gruenbacher <agruenba@redhat.com>
Sent: 17 March 2019 20:06
To: Mark Syms <Mark.Syms@citrix.com>
Cc: cluster-devel at redhat.com; Sergey Dyasli <sergey.dyasli@citrix.com>; Igor Druzhinin <igor.druzhinin@citrix.com>; Edvin Torok <edvin.torok@citrix.com>
Subject: Re: [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
On Sun, 17 Mar 2019 at 00:59, Mark Syms <Mark.Syms@citrix.com> wrote:
>
> Sadly, this doesn't help and seems to make the situation worse. Our automated tests were previously seeing about 5% failure rate and with this patch its 20%.
So the PF_MEMALLOC flag doesn't get set on that code path. In that case, we should probably check wbc->sync_mode instead. How about the following?
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 540535c..36de2f7 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -757,7 +757,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
bool flush_all = (wbc->sync_mode == WB_SYNC_ALL || gfs2_is_jdata(ip));
/* (see ext4_dirty_inode) */
- if (current->flags & PF_MEMALLOC)
+ if (wbc->sync_mode != WB_SYNC_ALL)
return 0;
if (flush_all)
It may help to wrap that condition in WARN_ON_ONCE for debugging.
[Mark Syms] That works better. We were wondering whether it might be a bit too aggressive though in that it skips writing the inode entirely unless we have WB_SYNC_ALL whereas the patch that Ross Lagerwall posted originally would use a trylock in the case where we don't have WB_SYNC_ALL and then fail out. Whether or not this should just silently return 0 or -EAGAIN as Ross' patch does is a matter of style I guess. What are your thoughts on this?
Thanks,
Mark.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter
2019-03-18 15:10 ` Mark Syms
@ 2019-03-18 16:30 ` Andreas Gruenbacher
0 siblings, 0 replies; 9+ messages in thread
From: Andreas Gruenbacher @ 2019-03-18 16:30 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Mon, 18 Mar 2019 at 16:15, Mark Syms <Mark.Syms@citrix.com> wrote:
> [Mark Syms] That works better.
Okay, great.
> We were wondering whether it might be a bit too aggressive though in that it skips writing the inode entirely unless we have WB_SYNC_ALL whereas the patch that Ross Lagerwall posted originally would use a trylock in the case where we don't have WB_SYNC_ALL and then fail out.
Yes, that may well be true. I'll have to think about it some more;
maybe we can avoid the trylock somehow else.
> Whether or not this should just silently return 0 or -EAGAIN as Ross' patch does is a matter of style I guess. What are your thoughts on this?
Well, write_inode is allowed to skip inodes in the WB_SYNC_NONE case.
If we return an error here, that error might leak and create nasty
surprises, so we should better not.
Thanks,
Andreas
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Cluster-devel] GFS2 deadlock in 4.19 (iomap/writeback?)
2019-03-06 11:00 [Cluster-devel] GFS2 deadlock in 4.19 (iomap/writeback?) Edwin Török
2019-03-13 17:13 ` [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter Andreas Gruenbacher
@ 2019-08-13 12:47 ` Andreas Gruenbacher
1 sibling, 0 replies; 9+ messages in thread
From: Andreas Gruenbacher @ 2019-08-13 12:47 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi Edwin and all,
On Wed, 6 Mar 2019 at 12:08, Edwin T?r?k <edvin.torok@citrix.com> wrote:
>
> Hello,
>
> I've been trying to debug a GFS2 deadlock that we see in our lab quite frequently with a 4.19 kernel. With 4.4 and older kernels we were not able to reproduce this.
> See below for lockdep dumps and stacktraces.
> Ignoring the lockdep warnings mentioned in my previous email I think I narrowed down the problem to a deadlock between iomap and writeback:
>
> IIUC the sequence of calls leading up to this is:
> aio_write ->
> gfs2_file_write_iter ->
> iomap_file_buffered_write ->
> iomap_apply ->
> iomap_begin -> gfs2_iomap_begin_write -> gfs2_trans_begin -> down_read(&sdp->sd_log_flush_lock)
> iomap_write_actor ->
> balance_dirty_pages ->
> ... waits for writeback ...
it took us several iterations, but that deadlock and all the follow-up
issues that were popping up should be fixed now. The deadlock fix
(gfs2: Fix iomap write page reclaim deadlock) required a follow-up fix
(gfs2: Inode dirtying fix) and caused a performance regression (iomap:
don't mark the inode dirty in iomap_write_end); in addition, function
gfs2_walk_metadata from v4.18 turned out to have issues (gfs2:
gfs2_walk_metadata fix). This should all be working fine in v5.3-rc4
now.
We're working on getting Bob's recovery patch queue ready for the next
merge window, which should fix the issues you've been seeing with
iSCSI failures. The changes for that will be close to what's on the
for-next.recovery10 branch in the gfs2 repository right now. If you
can still reproduce problems with that code base, please do let us
know.
Thanks,
Andreas
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2019-08-13 12:47 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-06 11:00 [Cluster-devel] GFS2 deadlock in 4.19 (iomap/writeback?) Edwin Török
2019-03-13 17:13 ` [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter Andreas Gruenbacher
2019-03-14 11:18 ` Ross Lagerwall
2019-03-15 20:58 ` Andreas Gruenbacher
2019-08-13 12:47 ` [Cluster-devel] GFS2 deadlock in 4.19 (iomap/writeback?) Andreas Gruenbacher
[not found] <19:84>
2019-03-16 23:58 ` [Cluster-devel] [PATCH] gfs2: Prevent writeback in gfs2_file_write_iter Mark Syms
2019-03-17 20:06 ` Andreas Gruenbacher
2019-03-18 15:10 ` Mark Syms
2019-03-18 16:30 ` Andreas Gruenbacher
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).