* [Cluster-devel] [PATCH 0/6] gfs2: Handle page faults during read and write
@ 2021-05-20 12:25 Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 1/6] gfs2: Fix mmap + page fault deadlocks (part 1) Andreas Gruenbacher
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-20 12:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com
Hello,
this is an update to my previous posting [1] on how to deal with
potential mmap + page fault deadlocks in gfs2.
What's happening is that a page fault triggers during a read or write
operation, while we're holding a glock (the cluster-wide gfs2 inode
lock), and the page fault requires another glock.  We can recognize and
handle the case when both glocks are the same, but when the page fault
requires another glock, there is a chance that taking that other glock
will deadlock.
The solution I've come up with for this is to turn locking requests into
locking attempts when we're in a potential recursive locking situation,
and to actively fault in pages and retry at the outer level when a
locking attempt fails.  Those kinds of conflicts should be relatively
rare.
Note that we need to fault in pages writable in ->read_iter, so this
patch set adds a new iov_iter_fault_in_writeable() helper hat mirrors
iov_iter_fault_in_readable().
In the initial prototype [1], I've added a restart_hack flag to struct
task_struct; this has now been moved to the lower two bits of
current->journal_info.
I've posted a new fstest [2] that triggers the self-recursion case so
that those kind of bugs will be caught early next time, with no feedback
in the last two weeks.
Those patches are currently on the gfs2 for-next branch [3].  If there
are no objections, I'll ask Linus to pull them from there.
Thanks,
Andreas
[1] [RFC] Trigger retry from fault vm operation,
https://lore.kernel.org/linux-fsdevel/20210511140113.1225981-1-agruenba at redhat.com/
[2] https://lore.kernel.org/fstests/20210520114218.1595735-1-agruenba at redhat.com/T/#u
[3] https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/log/?h=for-next
Andreas Gruenbacher (6):
  gfs2: Fix mmap + page fault deadlocks (part 1)
  iov_iter: Add iov_iter_fault_in_writeable()
  gfs2: Add wrappers for accessing journal_info
  gfs2: Encode glock holding and retry flags in journal_info
  gfs2: Add LM_FLAG_OUTER glock holder flag
  gfs2: Fix mmap + page fault deadlocks (part 2)
 fs/gfs2/aops.c      |   6 +--
 fs/gfs2/bmap.c      |  31 +++++++-------
 fs/gfs2/file.c      | 102 +++++++++++++++++++++++++++++++++++++-------
 fs/gfs2/glock.c     |  12 ++++++
 fs/gfs2/glock.h     |  13 ++++--
 fs/gfs2/incore.h    |  41 ++++++++++++++++++
 fs/gfs2/inode.c     |   2 +-
 fs/gfs2/log.c       |   4 +-
 fs/gfs2/lops.c      |   2 +-
 fs/gfs2/meta_io.c   |   6 +--
 fs/gfs2/super.c     |   2 +-
 fs/gfs2/trans.c     |  16 +++----
 include/linux/uio.h |   1 +
 lib/iov_iter.c      |  20 ++++++++-
 14 files changed, 204 insertions(+), 54 deletions(-)
-- 
2.26.3
^ permalink raw reply	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 1/6] gfs2: Fix mmap + page fault deadlocks (part 1)
  2021-05-20 12:25 [Cluster-devel] [PATCH 0/6] gfs2: Handle page faults during read and write Andreas Gruenbacher
@ 2021-05-20 12:25 ` Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 2/6] iov_iter: Add iov_iter_fault_in_writeable() Andreas Gruenbacher
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-20 12:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com
When the buffer passed to a read or write system call is memory mapped to the
same file, a page fault can occur in gfs2_fault.  In that case, the task will
already be holding the inode glock, and trying to take it again will result in
a BUG in add_to_queue().  Fix that by recognizing the self-recursion case and
either skipping the lock taking (when the glock is held in a compatible way),
or fail the operation.
Likewise, a request to un-share a copy-on-write page can *probably* happen in
similar situations, so treat the locking in gfs2_page_mkwrite in the same way.
A future patch will handle this case more gracefully, along with addressing
more complex deadlock scenarios.
Reported-by: Jan Kara <jack@suse.cz>
Fixes: 20f829999c38 ("gfs2: Rework read and page fault locking")
Cc: stable at vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 6d77743f11a4..7d88abb4629b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -423,6 +423,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_alloc_parms ap = { .aflags = 0, };
 	u64 offset = page_offset(page);
@@ -436,10 +437,18 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	sb_start_pagefault(inode->i_sb);
 
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
-	err = gfs2_glock_nq(&gh);
-	if (err) {
-		ret = block_page_mkwrite_return(err);
-		goto out_uninit;
+	if (likely(!outer_gh)) {
+		err = gfs2_glock_nq(&gh);
+		if (err) {
+			ret = block_page_mkwrite_return(err);
+			goto out_uninit;
+		}
+	} else {
+		if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) {
+			/* We could try to upgrade outer_gh here. */
+			ret = VM_FAULT_SIGBUS;
+			goto out_uninit;
+		}
 	}
 
 	/* Check page index against inode size */
@@ -540,7 +549,8 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 out_quota_unlock:
 	gfs2_quota_unlock(ip);
 out_unlock:
-	gfs2_glock_dq(&gh);
+	if (likely(!outer_gh))
+		gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	if (ret == VM_FAULT_LOCKED) {
@@ -555,6 +565,7 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_holder gh;
 	vm_fault_t ret;
 	u16 state;
@@ -562,13 +573,22 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 
 	state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
 	gfs2_holder_init(ip->i_gl, state, 0, &gh);
-	err = gfs2_glock_nq(&gh);
-	if (err) {
-		ret = block_page_mkwrite_return(err);
-		goto out_uninit;
+	if (likely(!outer_gh)) {
+		err = gfs2_glock_nq(&gh);
+		if (err) {
+			ret = block_page_mkwrite_return(err);
+			goto out_uninit;
+		}
+	} else {
+		if (!gfs2_holder_is_compatible(outer_gh, state)) {
+			/* We could try to upgrade outer_gh here. */
+			ret = VM_FAULT_SIGBUS;
+			goto out_uninit;
+		}
 	}
 	ret = filemap_fault(vmf);
-	gfs2_glock_dq(&gh);
+	if (likely(!outer_gh))
+		gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return ret;
-- 
2.26.3
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 2/6] iov_iter: Add iov_iter_fault_in_writeable()
  2021-05-20 12:25 [Cluster-devel] [PATCH 0/6] gfs2: Handle page faults during read and write Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 1/6] gfs2: Fix mmap + page fault deadlocks (part 1) Andreas Gruenbacher
@ 2021-05-20 12:25 ` Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 3/6] gfs2: Add wrappers for accessing journal_info Andreas Gruenbacher
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-20 12:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com
Add the equivalent of iov_iter_fault_in_readable(), but for pages that will be
written to.
While at it, fix an indentation error in iov_iter_fault_in_readable().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 20 +++++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index d3ec87706d75..6811eb6ac6e3 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -124,6 +124,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c701b7a187f2..317c94eac907 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -480,13 +480,31 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 		iterate_iovec(i, bytes, v, iov, skip, ({
 			err = fault_in_pages_readable(v.iov_base, v.iov_len);
 			if (unlikely(err))
-			return err;
+				return err;
 		0;}))
 	}
 	return 0;
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
+int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes)
+{
+	size_t skip = i->iov_offset;
+	const struct iovec *iov;
+	int err;
+	struct iovec v;
+
+	if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
+		iterate_iovec(i, bytes, v, iov, skip, ({
+			err = fault_in_pages_writeable(v.iov_base, v.iov_len);
+			if (unlikely(err))
+				return err;
+		0;}))
+	}
+	return 0;
+}
+EXPORT_SYMBOL(iov_iter_fault_in_writeable);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
-- 
2.26.3
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 3/6] gfs2: Add wrappers for accessing journal_info
  2021-05-20 12:25 [Cluster-devel] [PATCH 0/6] gfs2: Handle page faults during read and write Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 1/6] gfs2: Fix mmap + page fault deadlocks (part 1) Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 2/6] iov_iter: Add iov_iter_fault_in_writeable() Andreas Gruenbacher
@ 2021-05-20 12:25 ` Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 4/6] gfs2: Encode glock holding and retry flags in journal_info Andreas Gruenbacher
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-20 12:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com
No longer access current->journal_info directly.  We'll use that to
encode additional information in current->journal_info later.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/aops.c    |  6 +++---
 fs/gfs2/bmap.c    | 28 ++++++++++++++--------------
 fs/gfs2/incore.h  | 10 ++++++++++
 fs/gfs2/inode.c   |  2 +-
 fs/gfs2/log.c     |  4 ++--
 fs/gfs2/lops.c    |  2 +-
 fs/gfs2/meta_io.c |  6 +++---
 fs/gfs2/super.c   |  2 +-
 fs/gfs2/trans.c   | 16 ++++++++--------
 9 files changed, 43 insertions(+), 33 deletions(-)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 23b5be3db044..50dd1771d00c 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -95,7 +95,7 @@ static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
 
 	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 		goto out;
-	if (current->journal_info)
+	if (current_trans())
 		goto redirty;
 	return iomap_writepage(page, wbc, &wpc, &gfs2_writeback_ops);
 
@@ -182,7 +182,7 @@ static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc
 
 	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 		goto out;
-	if (PageChecked(page) || current->journal_info)
+	if (PageChecked(page) || current_trans())
 		goto out_ignore;
 	return __gfs2_jdata_writepage(page, wbc);
 
@@ -620,7 +620,7 @@ void adjust_fs_space(struct inode *inode)
  
 static int jdata_set_page_dirty(struct page *page)
 {
-	if (current->journal_info)
+	if (current_trans())
 		SetPageChecked(page);
 	return __set_page_dirty_buffers(page);
 }
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 0bcf11a9987b..2ff501c413f4 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1016,7 +1016,7 @@ static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
 				 unsigned copied, struct page *page,
 				 struct iomap *iomap)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 
@@ -1099,7 +1099,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 			}
 		}
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (tr->tr_num_buf_new)
 			__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 
@@ -1347,7 +1347,7 @@ int gfs2_alloc_extent(struct inode *inode, u64 lblock, u64 *dblock,
 static int gfs2_block_zero_range(struct inode *inode, loff_t from,
 				 unsigned int length)
 {
-	BUG_ON(current->journal_info);
+	BUG_ON(current_trans());
 	return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops);
 }
 
@@ -1386,7 +1386,7 @@ static int gfs2_journaled_truncate(struct inode *inode, u64 oldsize, u64 newsize
 		truncate_pagecache(inode, oldsize - chunk);
 		oldsize -= chunk;
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (!test_bit(TR_TOUCHED, &tr->tr_flags))
 			continue;
 
@@ -1447,7 +1447,7 @@ static int trunc_start(struct inode *inode, u64 newsize)
 
 out:
 	brelse(dibh);
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 	return error;
 }
@@ -1555,7 +1555,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 		   the rgrp. So we estimate. We know it can't be more than
 		   the dinode's i_blocks and we don't want to exceed the
 		   journal flush threshold, sd_log_thresh2. */
-		if (current->journal_info == NULL) {
+		if (!current_trans()) {
 			unsigned int jblocks_rqsted, revokes;
 
 			jblocks_rqsted = rgd->rd_length + RES_DINODE +
@@ -1577,7 +1577,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 			down_write(&ip->i_rw_mutex);
 		}
 		/* check if we will exceed the transaction blocks requested */
-		tr = current->journal_info;
+		tr = current_trans();
 		if (tr->tr_num_buf_new + RES_STATFS +
 		    RES_QUOTA >= atomic_read(&sdp->sd_log_thresh2)) {
 			/* We set blks_outside_rgrp to ensure the loop will
@@ -1625,7 +1625,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 	if (!ret && blks_outside_rgrp) { /* If buffer still has non-zero blocks
 					    outside the rgrp we just processed,
 					    do it all over again. */
-		if (current->journal_info) {
+		if (current_trans()) {
 			struct buffer_head *dibh;
 
 			ret = gfs2_meta_inode_buffer(ip, &dibh);
@@ -1991,7 +1991,7 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length)
 	}
 
 	if (btotal) {
-		if (current->journal_info == NULL) {
+		if (!current_trans()) {
 			ret = gfs2_trans_begin(sdp, RES_DINODE + RES_STATFS +
 					       RES_QUOTA, 0);
 			if (ret)
@@ -2011,7 +2011,7 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length)
 out:
 	if (gfs2_holder_initialized(&rd_gh))
 		gfs2_glock_dq_uninit(&rd_gh);
-	if (current->journal_info) {
+	if (current_trans()) {
 		up_write(&ip->i_rw_mutex);
 		gfs2_trans_end(sdp);
 		cond_resched();
@@ -2436,7 +2436,7 @@ static int gfs2_journaled_truncate_range(struct inode *inode, loff_t offset,
 		offset += chunk;
 		length -= chunk;
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (!test_bit(TR_TOUCHED, &tr->tr_flags))
 			continue;
 
@@ -2501,7 +2501,7 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
 	}
 
 	if (gfs2_is_jdata(ip)) {
-		BUG_ON(!current->journal_info);
+		BUG_ON(!current_trans());
 		gfs2_journaled_truncate_range(inode, offset, length);
 	} else
 		truncate_pagecache_range(inode, offset, offset + length - 1);
@@ -2509,14 +2509,14 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
 	file_update_time(file);
 	mark_inode_dirty(inode);
 
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 
 	if (!gfs2_is_stuffed(ip))
 		error = punch_hole(ip, offset, length);
 
 out:
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 	return error;
 }
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e6f820f146cb..aa8d1a23132d 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -871,5 +871,15 @@ static inline unsigned gfs2_max_stuffed_size(const struct gfs2_inode *ip)
 	return GFS2_SB(&ip->i_inode)->sd_sb.sb_bsize - sizeof(struct gfs2_dinode);
 }
 
+static inline struct gfs2_trans *current_trans(void)
+{
+	return current->journal_info;
+}
+
+static inline void set_current_trans(struct gfs2_trans *tr)
+{
+	current->journal_info = tr;
+}
+
 #endif /* __INCORE_DOT_H__ */
 
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 6e15434b23ac..1b94cbdc00cc 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -1883,7 +1883,7 @@ static int gfs2_setattr_simple(struct inode *inode, struct iattr *attr)
 {
 	int error;
 
-	if (current->journal_info)
+	if (current_trans())
 		return __gfs2_setattr_simple(inode, attr);
 
 	error = gfs2_trans_begin(GFS2_SB(inode), RES_DINODE, 0);
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 42c15cfc0821..3ee29045ab90 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -204,7 +204,7 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc)
 	ret = 0;
 	if (time_after(jiffies, flush_start + (HZ * 600))) {
 		fs_err(sdp, "Error: In %s for ten minutes! t=%d\n",
-		       __func__, current->journal_info ? 1 : 0);
+		       __func__, current_trans() ? 1 : 0);
 		dump_ail_list(sdp);
 		goto out;
 	}
@@ -971,7 +971,7 @@ static void empty_ail1_list(struct gfs2_sbd *sdp)
 	for (;;) {
 		if (time_after(jiffies, start + (HZ * 600))) {
 			fs_err(sdp, "Error: In %s for 10 minutes! t=%d\n",
-			       __func__, current->journal_info ? 1 : 0);
+			       __func__, current_trans() ? 1 : 0);
 			dump_ail_list(sdp);
 			return;
 		}
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 8ee05d25dfa6..9bd080e5db43 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -43,7 +43,7 @@ void gfs2_pin(struct gfs2_sbd *sdp, struct buffer_head *bh)
 {
 	struct gfs2_bufdata *bd;
 
-	BUG_ON(!current->journal_info);
+	BUG_ON(!current_trans());
 
 	clear_buffer_dirty(bh);
 	if (test_set_buffer_pinned(bh))
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index d68184ebbfdd..f5622393de63 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -294,7 +294,7 @@ int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
 	bh = *bhp;
 	wait_on_buffer(bh);
 	if (unlikely(!buffer_uptodate(bh))) {
-		struct gfs2_trans *tr = current->journal_info;
+		struct gfs2_trans *tr = current_trans();
 		if (tr && test_bit(TR_TOUCHED, &tr->tr_flags))
 			gfs2_io_error_bh_wd(sdp, bh);
 		brelse(bh);
@@ -321,7 +321,7 @@ int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh)
 	wait_on_buffer(bh);
 
 	if (!buffer_uptodate(bh)) {
-		struct gfs2_trans *tr = current->journal_info;
+		struct gfs2_trans *tr = current_trans();
 		if (tr && test_bit(TR_TOUCHED, &tr->tr_flags))
 			gfs2_io_error_bh_wd(sdp, bh);
 		return -EIO;
@@ -337,7 +337,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int meta)
 	struct address_space *mapping = bh->b_page->mapping;
 	struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
 	struct gfs2_bufdata *bd = bh->b_private;
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	int was_pinned = 0;
 
 	if (test_clear_buffer_pinned(bh)) {
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 4d4ceb0b6903..5cb823e58d01 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -557,7 +557,7 @@ static void gfs2_dirty_inode(struct inode *inode, int flags)
 	} else if (WARN_ON_ONCE(ip->i_gl->gl_state != LM_ST_EXCLUSIVE))
 		return;
 
-	if (current->journal_info == NULL) {
+	if (!current_trans()) {
 		ret = gfs2_trans_begin(sdp, RES_DINODE, 0);
 		if (ret) {
 			fs_err(sdp, "dirty_inode: gfs2_trans_begin %d\n", ret);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 63fec11ef2ce..7681fbb12050 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -43,8 +43,8 @@ int __gfs2_trans_begin(struct gfs2_trans *tr, struct gfs2_sbd *sdp,
 {
 	unsigned int extra_revokes;
 
-	if (current->journal_info) {
-		gfs2_print_trans(sdp, current->journal_info);
+	if (current_trans()) {
+		gfs2_print_trans(sdp, current_trans());
 		BUG();
 	}
 	BUG_ON(blocks == 0 && revokes == 0);
@@ -101,7 +101,7 @@ int __gfs2_trans_begin(struct gfs2_trans *tr, struct gfs2_sbd *sdp,
 		return -EROFS;
 	}
 
-	current->journal_info = tr;
+	set_current_trans(tr);
 
 	return 0;
 }
@@ -123,10 +123,10 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
 
 void gfs2_trans_end(struct gfs2_sbd *sdp)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	s64 nbuf;
 
-	current->journal_info = NULL;
+	set_current_trans(NULL);
 
 	if (!test_bit(TR_TOUCHED, &tr->tr_flags)) {
 		gfs2_log_release_revokes(sdp, tr->tr_revokes);
@@ -191,7 +191,7 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl,
  */
 void gfs2_trans_add_data(struct gfs2_glock *gl, struct buffer_head *bh)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct gfs2_bufdata *bd;
 
@@ -232,7 +232,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh)
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct gfs2_bufdata *bd;
 	struct gfs2_meta_header *mh;
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	enum gfs2_freeze_state state = atomic_read(&sdp->sd_freeze_state);
 
 	lock_buffer(bh);
@@ -288,7 +288,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh)
 
 void gfs2_trans_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 
 	BUG_ON(!list_empty(&bd->bd_list));
 	gfs2_add_revoke(sdp, bd);
-- 
2.26.3
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 4/6] gfs2: Encode glock holding and retry flags in journal_info
  2021-05-20 12:25 [Cluster-devel] [PATCH 0/6] gfs2: Handle page faults during read and write Andreas Gruenbacher
                   ` (2 preceding siblings ...)
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 3/6] gfs2: Add wrappers for accessing journal_info Andreas Gruenbacher
@ 2021-05-20 12:25 ` Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 5/6] gfs2: Add LM_FLAG_OUTER glock holder flag Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2) Andreas Gruenbacher
  5 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-20 12:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com
Use the lowest two bits in current->journal_info to encode when
we're holding a glock and when an operation holding a glock
needs to be retried.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/incore.h | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index aa8d1a23132d..e32433df119c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -871,14 +871,45 @@ static inline unsigned gfs2_max_stuffed_size(const struct gfs2_inode *ip)
 	return GFS2_SB(&ip->i_inode)->sd_sb.sb_bsize - sizeof(struct gfs2_dinode);
 }
 
+/*
+ * Transactions are always memory aligned, so we use bit 0 of
+ * current->journal_info to indicate when we're holding a glock and so taking
+ * random additional glocks might deadlock, and bit 1 to indicate when such an
+ * operation needs to be retried after dropping and re-acquiring that "outer"
+ * glock.
+ */
+
 static inline struct gfs2_trans *current_trans(void)
 {
-	return current->journal_info;
+	return (void *)((long)current->journal_info & ~3);
 }
 
 static inline void set_current_trans(struct gfs2_trans *tr)
 {
-	current->journal_info = tr;
+	long flags = (long)current->journal_info & 3;
+	current->journal_info = (void *)((long)tr | flags);
+}
+
+static inline bool current_holds_glock(void)
+{
+	return (long)current->journal_info & 1;
+}
+
+static inline bool current_needs_retry(void)
+{
+	return (long)current->journal_info & 2;
+}
+
+static inline void set_current_holds_glock(bool b)
+{
+	current->journal_info =
+		(void *)(((long)current->journal_info & ~1) | b);
+}
+
+static inline void set_current_needs_retry(bool b)
+{
+	current->journal_info =
+		(void *)(((long)current->journal_info & ~2) | (b << 1));
 }
 
 #endif /* __INCORE_DOT_H__ */
-- 
2.26.3
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 5/6] gfs2: Add LM_FLAG_OUTER glock holder flag
  2021-05-20 12:25 [Cluster-devel] [PATCH 0/6] gfs2: Handle page faults during read and write Andreas Gruenbacher
                   ` (3 preceding siblings ...)
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 4/6] gfs2: Encode glock holding and retry flags in journal_info Andreas Gruenbacher
@ 2021-05-20 12:25 ` Andreas Gruenbacher
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2) Andreas Gruenbacher
  5 siblings, 0 replies; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-20 12:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com
When a glock holder has the LM_FLAG_OUTER flag set, we set the
current_holds_glock() flag upon taking the lock.  With that flag set, we can
then recognize when trying to take an "inner" glock and react accordingly.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/glock.c | 12 ++++++++++++
 fs/gfs2/glock.h | 13 ++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index d9cb261f55b0..f6cae2ee1c83 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1427,6 +1427,11 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
 	if (glock_blocked_by_withdraw(gl) && !(gh->gh_flags & LM_FLAG_NOEXP))
 		return -EIO;
 
+	if (gh->gh_flags & LM_FLAG_OUTER) {
+		BUG_ON(current_holds_glock());
+		set_current_holds_glock(true);
+	}
+
 	if (test_bit(GLF_LRU, &gl->gl_flags))
 		gfs2_glock_remove_from_lru(gl);
 
@@ -1514,6 +1519,11 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
 		__gfs2_glock_queue_work(gl, delay);
 	}
 	spin_unlock(&gl->gl_lockref.lock);
+
+	if (gh->gh_flags & LM_FLAG_OUTER) {
+		BUG_ON(!current_holds_glock());
+		set_current_holds_glock(false);
+	}
 }
 
 void gfs2_glock_dq_wait(struct gfs2_holder *gh)
@@ -2068,6 +2078,8 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
 		*p++ = 'p';
 	if (flags & LM_FLAG_NODE_SCOPE)
 		*p++ = 'n';
+	if (flags & LM_FLAG_OUTER)
+		*p++ = 'o';
 	if (flags & GL_ASYNC)
 		*p++ = 'a';
 	if (flags & GL_EXACT)
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index f0ef6fd24ba4..8b145269fb14 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -94,6 +94,12 @@ static inline bool gfs2_holder_is_compatible(struct gfs2_holder *gh, int state)
  * This holder agrees to share the lock within this node. In other words,
  * the glock is held in EX mode according to DLM, but local holders on the
  * same node can share it.
+ *
+ * LM_FLAG_OUTER
+ * Use set_current_holds_glock() to indicate when the current task is holding
+ * this "upper" glock, and current_holds_glock() to detect when the current
+ * task is trying to take another glock.  Used to prevent deadlocks involving
+ * the inode glock during page faults.
  */
 
 #define LM_FLAG_TRY		0x0001
@@ -102,9 +108,10 @@ static inline bool gfs2_holder_is_compatible(struct gfs2_holder *gh, int state)
 #define LM_FLAG_ANY		0x0008
 #define LM_FLAG_PRIORITY	0x0010
 #define LM_FLAG_NODE_SCOPE	0x0020
-#define GL_ASYNC		0x0040
-#define GL_EXACT		0x0080
-#define GL_SKIP			0x0100
+#define LM_FLAG_OUTER		0x0040
+#define GL_ASYNC		0x0080
+#define GL_EXACT		0x0100
+#define GL_SKIP			0x0200
 #define GL_NOCACHE		0x0400
   
 /*
-- 
2.26.3
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2)
  2021-05-20 12:25 [Cluster-devel] [PATCH 0/6] gfs2: Handle page faults during read and write Andreas Gruenbacher
                   ` (4 preceding siblings ...)
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 5/6] gfs2: Add LM_FLAG_OUTER glock holder flag Andreas Gruenbacher
@ 2021-05-20 12:25 ` Andreas Gruenbacher
  2021-05-20 13:30   ` Jan Kara
  5 siblings, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-20 12:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com
Now that we handle self-recursion on the inode glock in gfs2_fault and
gfs2_page_mkwrite, we need to take care of more complex deadlock
scenarios like the following (example by Jan Kara):
Two independent processes P1, P2. Two files F1, F2, and two mappings M1,
M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO
to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They
can race like:
P1                                      P2
read()                                  read()
  gfs2_file_read_iter()                   gfs2_file_read_iter()
    gfs2_file_direct_read()                 gfs2_file_direct_read()
      locks glock of F1                       locks glock of F2
      iomap_dio_rw()                          iomap_dio_rw()
        bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
          <fault in M2>                           <fault in M1>
            gfs2_fault()                            gfs2_fault()
              tries to grab glock of F2               tries to grab glock of F1
Those kinds of scenarios are much harder to reproduce than
self-recursion.
We deal with such situations by using the LM_FLAG_OUTER flag to mark
"outer" glock taking.  Then, when taking an "inner" glock, we use the
LM_FLAG_TRY flag so that locking attempts that don't immediately succeed
will be aborted.  In case of a failed locking attempt, we "unroll" to
where the "outer" glock was taken, drop the "outer" glock, and fault in
the first offending user page.  This will re-trigger the "inner" locking
attempt but without the LM_FLAG_TRY flag.  Once that has happened, we
re-acquire the "outer" glock and retry the original operation.
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/bmap.c |  3 ++-
 fs/gfs2/file.c | 62 +++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 58 insertions(+), 7 deletions(-)
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 2ff501c413f4..82e4506984e3 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -967,7 +967,8 @@ static int gfs2_write_lock(struct inode *inode)
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	int error;
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, LM_FLAG_OUTER,
+			 &ip->i_gh);
 	error = gfs2_glock_nq(&ip->i_gh);
 	if (error)
 		goto out_uninit;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7d88abb4629b..8b26893f8dc6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	vm_fault_t ret = VM_FAULT_LOCKED;
 	struct gfs2_holder gh;
 	unsigned int length;
+	u16 flags = 0;
 	loff_t size;
 	int err;
 
 	sb_start_pagefault(inode->i_sb);
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
+	if (current_holds_glock())
+		flags |= LM_FLAG_TRY;
+
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh);
 	if (likely(!outer_gh)) {
 		err = gfs2_glock_nq(&gh);
 		if (err) {
 			ret = block_page_mkwrite_return(err);
+			if (err == GLR_TRYFAILED) {
+				set_current_needs_retry(true);
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out_uninit;
 		}
 	} else {
 		if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) {
 			/* We could try to upgrade outer_gh here. */
+			set_current_needs_retry(true);
 			ret = VM_FAULT_SIGBUS;
 			goto out_uninit;
 		}
@@ -568,20 +577,28 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_holder gh;
 	vm_fault_t ret;
-	u16 state;
+	u16 state, flags = 0;
 	int err;
 
+	if (current_holds_glock())
+		flags |= LM_FLAG_TRY;
+
 	state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
-	gfs2_holder_init(ip->i_gl, state, 0, &gh);
+	gfs2_holder_init(ip->i_gl, state, flags, &gh);
 	if (likely(!outer_gh)) {
 		err = gfs2_glock_nq(&gh);
 		if (err) {
 			ret = block_page_mkwrite_return(err);
+			if (err == GLR_TRYFAILED) {
+				set_current_needs_retry(true);
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out_uninit;
 		}
 	} else {
 		if (!gfs2_holder_is_compatible(outer_gh, state)) {
 			/* We could try to upgrade outer_gh here. */
+			set_current_needs_retry(true);
 			ret = VM_FAULT_SIGBUS;
 			goto out_uninit;
 		}
@@ -807,13 +824,21 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	if (!count)
 		return 0; /* skip atime */
 
-	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
 	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
 	gfs2_glock_dq(gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT) {
+			if (!iov_iter_fault_in_writeable(to, PAGE_SIZE))
+				goto retry;
+		}
+	}
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
@@ -837,7 +862,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * unfortunately, have the option of only flushing a range like the
 	 * VFS does.
 	 */
-	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
@@ -851,6 +877,13 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 		ret = 0;
 out:
 	gfs2_glock_dq(gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT) {
+			if (!iov_iter_fault_in_readable(from, PAGE_SIZE))
+				goto retry;
+		}
+	}
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
@@ -883,7 +916,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			return ret;
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
-	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_OUTER, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
@@ -891,6 +925,13 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (ret > 0)
 		written += ret;
 	gfs2_glock_dq(&gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT) {
+			if (!iov_iter_fault_in_writeable(to, PAGE_SIZE))
+				goto retry;
+		}
+	}
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -902,9 +943,18 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	struct inode *inode = file_inode(file);
 	ssize_t ret;
 
+retry:
 	current->backing_dev_info = inode_to_bdi(inode);
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
 	current->backing_dev_info = NULL;
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT) {
+			if (!iov_iter_fault_in_readable(from, PAGE_SIZE))
+				goto retry;
+		}
+	}
+
 	return ret;
 }
 
-- 
2.26.3
^ permalink raw reply related	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2)
  2021-05-20 12:25 ` [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2) Andreas Gruenbacher
@ 2021-05-20 13:30   ` Jan Kara
  2021-05-20 14:07     ` Andreas Gruenbacher
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2021-05-20 13:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com
On Thu 20-05-21 14:25:36, Andreas Gruenbacher wrote:
> Now that we handle self-recursion on the inode glock in gfs2_fault and
> gfs2_page_mkwrite, we need to take care of more complex deadlock
> scenarios like the following (example by Jan Kara):
> 
> Two independent processes P1, P2. Two files F1, F2, and two mappings M1,
> M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO
> to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They
> can race like:
> 
> P1                                      P2
> read()                                  read()
>   gfs2_file_read_iter()                   gfs2_file_read_iter()
>     gfs2_file_direct_read()                 gfs2_file_direct_read()
>       locks glock of F1                       locks glock of F2
>       iomap_dio_rw()                          iomap_dio_rw()
>         bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
>           <fault in M2>                           <fault in M1>
>             gfs2_fault()                            gfs2_fault()
>               tries to grab glock of F2               tries to grab glock of F1
> 
> Those kinds of scenarios are much harder to reproduce than
> self-recursion.
> 
> We deal with such situations by using the LM_FLAG_OUTER flag to mark
> "outer" glock taking.  Then, when taking an "inner" glock, we use the
> LM_FLAG_TRY flag so that locking attempts that don't immediately succeed
> will be aborted.  In case of a failed locking attempt, we "unroll" to
> where the "outer" glock was taken, drop the "outer" glock, and fault in
> the first offending user page.  This will re-trigger the "inner" locking
> attempt but without the LM_FLAG_TRY flag.  Once that has happened, we
> re-acquire the "outer" glock and retry the original operation.
> 
> Reported-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
...
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 7d88abb4629b..8b26893f8dc6 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
>  	vm_fault_t ret = VM_FAULT_LOCKED;
>  	struct gfs2_holder gh;
>  	unsigned int length;
> +	u16 flags = 0;
>  	loff_t size;
>  	int err;
>  
>  	sb_start_pagefault(inode->i_sb);
>  
> -	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
> +	if (current_holds_glock())
> +		flags |= LM_FLAG_TRY;
> +
> +	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh);
>  	if (likely(!outer_gh)) {
>  		err = gfs2_glock_nq(&gh);
>  		if (err) {
>  			ret = block_page_mkwrite_return(err);
> +			if (err == GLR_TRYFAILED) {
> +				set_current_needs_retry(true);
> +				ret = VM_FAULT_SIGBUS;
> +			}
I've checked to make sure but do_user_addr_fault() indeed calls do_sigbus()
which raises the SIGBUS signal. So if the application does not ignore
SIGBUS, your retry will be visible to the application and can cause all
sorts of interesting results... So you probably need to add a new VM_FAULT_
return code that will behave like VM_FAULT_SIGBUS except it will not raise
the signal.
Otherwise it seems to me your approach should work.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2)
  2021-05-20 13:30   ` Jan Kara
@ 2021-05-20 14:07     ` Andreas Gruenbacher
  2021-05-21 15:23       ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-20 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com
On Thu, May 20, 2021 at 3:30 PM Jan Kara <jack@suse.cz> wrote:
> On Thu 20-05-21 14:25:36, Andreas Gruenbacher wrote:
> > Now that we handle self-recursion on the inode glock in gfs2_fault and
> > gfs2_page_mkwrite, we need to take care of more complex deadlock
> > scenarios like the following (example by Jan Kara):
> >
> > Two independent processes P1, P2. Two files F1, F2, and two mappings M1,
> > M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO
> > to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They
> > can race like:
> >
> > P1                                      P2
> > read()                                  read()
> >   gfs2_file_read_iter()                   gfs2_file_read_iter()
> >     gfs2_file_direct_read()                 gfs2_file_direct_read()
> >       locks glock of F1                       locks glock of F2
> >       iomap_dio_rw()                          iomap_dio_rw()
> >         bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
> >           <fault in M2>                           <fault in M1>
> >             gfs2_fault()                            gfs2_fault()
> >               tries to grab glock of F2               tries to grab glock of F1
> >
> > Those kinds of scenarios are much harder to reproduce than
> > self-recursion.
> >
> > We deal with such situations by using the LM_FLAG_OUTER flag to mark
> > "outer" glock taking.  Then, when taking an "inner" glock, we use the
> > LM_FLAG_TRY flag so that locking attempts that don't immediately succeed
> > will be aborted.  In case of a failed locking attempt, we "unroll" to
> > where the "outer" glock was taken, drop the "outer" glock, and fault in
> > the first offending user page.  This will re-trigger the "inner" locking
> > attempt but without the LM_FLAG_TRY flag.  Once that has happened, we
> > re-acquire the "outer" glock and retry the original operation.
> >
> > Reported-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
>
> ...
>
> > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > index 7d88abb4629b..8b26893f8dc6 100644
> > --- a/fs/gfs2/file.c
> > +++ b/fs/gfs2/file.c
> > @@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
> >       vm_fault_t ret = VM_FAULT_LOCKED;
> >       struct gfs2_holder gh;
> >       unsigned int length;
> > +     u16 flags = 0;
> >       loff_t size;
> >       int err;
> >
> >       sb_start_pagefault(inode->i_sb);
> >
> > -     gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
> > +     if (current_holds_glock())
> > +             flags |= LM_FLAG_TRY;
> > +
> > +     gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh);
> >       if (likely(!outer_gh)) {
> >               err = gfs2_glock_nq(&gh);
> >               if (err) {
> >                       ret = block_page_mkwrite_return(err);
> > +                     if (err == GLR_TRYFAILED) {
> > +                             set_current_needs_retry(true);
> > +                             ret = VM_FAULT_SIGBUS;
> > +                     }
>
> I've checked to make sure but do_user_addr_fault() indeed calls do_sigbus()
> which raises the SIGBUS signal. So if the application does not ignore
> SIGBUS, your retry will be visible to the application and can cause all
> sorts of interesting results...
I would have noticed that, but no SIGBUS signals were actually
delivered. So we probably end up in kernelmode_fixup_or_oops() when in
kernel mode, which just does nothing in that case.
Andy Lutomirski, you've been involved with this, could you please shed
some light?
> So you probably need to add a new VM_FAULT_
> return code that will behave like VM_FAULT_SIGBUS except it will not raise
> the signal.
A new VM_FAULT_* flag might make the code easier to read, but I don't
know if we can have one.
> Otherwise it seems to me your approach should work.
Thanks a lot,
Andreas
^ permalink raw reply	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2)
  2021-05-20 14:07     ` Andreas Gruenbacher
@ 2021-05-21 15:23       ` Jan Kara
  2021-05-21 15:46         ` Andreas Gruenbacher
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2021-05-21 15:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com
On Thu 20-05-21 16:07:56, Andreas Gruenbacher wrote:
> On Thu, May 20, 2021 at 3:30 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 20-05-21 14:25:36, Andreas Gruenbacher wrote:
> > > Now that we handle self-recursion on the inode glock in gfs2_fault and
> > > gfs2_page_mkwrite, we need to take care of more complex deadlock
> > > scenarios like the following (example by Jan Kara):
> > >
> > > Two independent processes P1, P2. Two files F1, F2, and two mappings M1,
> > > M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO
> > > to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They
> > > can race like:
> > >
> > > P1                                      P2
> > > read()                                  read()
> > >   gfs2_file_read_iter()                   gfs2_file_read_iter()
> > >     gfs2_file_direct_read()                 gfs2_file_direct_read()
> > >       locks glock of F1                       locks glock of F2
> > >       iomap_dio_rw()                          iomap_dio_rw()
> > >         bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
> > >           <fault in M2>                           <fault in M1>
> > >             gfs2_fault()                            gfs2_fault()
> > >               tries to grab glock of F2               tries to grab glock of F1
> > >
> > > Those kinds of scenarios are much harder to reproduce than
> > > self-recursion.
> > >
> > > We deal with such situations by using the LM_FLAG_OUTER flag to mark
> > > "outer" glock taking.  Then, when taking an "inner" glock, we use the
> > > LM_FLAG_TRY flag so that locking attempts that don't immediately succeed
> > > will be aborted.  In case of a failed locking attempt, we "unroll" to
> > > where the "outer" glock was taken, drop the "outer" glock, and fault in
> > > the first offending user page.  This will re-trigger the "inner" locking
> > > attempt but without the LM_FLAG_TRY flag.  Once that has happened, we
> > > re-acquire the "outer" glock and retry the original operation.
> > >
> > > Reported-by: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> >
> > ...
> >
> > > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > > index 7d88abb4629b..8b26893f8dc6 100644
> > > --- a/fs/gfs2/file.c
> > > +++ b/fs/gfs2/file.c
> > > @@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
> > >       vm_fault_t ret = VM_FAULT_LOCKED;
> > >       struct gfs2_holder gh;
> > >       unsigned int length;
> > > +     u16 flags = 0;
> > >       loff_t size;
> > >       int err;
> > >
> > >       sb_start_pagefault(inode->i_sb);
> > >
> > > -     gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
> > > +     if (current_holds_glock())
> > > +             flags |= LM_FLAG_TRY;
> > > +
> > > +     gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh);
> > >       if (likely(!outer_gh)) {
> > >               err = gfs2_glock_nq(&gh);
> > >               if (err) {
> > >                       ret = block_page_mkwrite_return(err);
> > > +                     if (err == GLR_TRYFAILED) {
> > > +                             set_current_needs_retry(true);
> > > +                             ret = VM_FAULT_SIGBUS;
> > > +                     }
> >
> > I've checked to make sure but do_user_addr_fault() indeed calls do_sigbus()
> > which raises the SIGBUS signal. So if the application does not ignore
> > SIGBUS, your retry will be visible to the application and can cause all
> > sorts of interesting results...
> 
> I would have noticed that, but no SIGBUS signals were actually
> delivered. So we probably end up in kernelmode_fixup_or_oops() when in
> kernel mode, which just does nothing in that case.
Hum, but how would we get there? I don't think fatal_signal_pending() would
return true yet...
> > So you probably need to add a new VM_FAULT_
> > return code that will behave like VM_FAULT_SIGBUS except it will not raise
> > the signal.
> 
> A new VM_FAULT_* flag might make the code easier to read, but I don't
> know if we can have one.
Well, this is kernel-internal API and there's still plenty of space in
vm_fault_reason.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2)
  2021-05-21 15:23       ` Jan Kara
@ 2021-05-21 15:46         ` Andreas Gruenbacher
  2021-05-24  8:39           ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Gruenbacher @ 2021-05-21 15:46 UTC (permalink / raw)
  To: cluster-devel.redhat.com
On Fri, May 21, 2021 at 5:23 PM Jan Kara <jack@suse.cz> wrote:
> On Thu 20-05-21 16:07:56, Andreas Gruenbacher wrote:
> > On Thu, May 20, 2021 at 3:30 PM Jan Kara <jack@suse.cz> wrote:
> > > On Thu 20-05-21 14:25:36, Andreas Gruenbacher wrote:
> > > > Now that we handle self-recursion on the inode glock in gfs2_fault and
> > > > gfs2_page_mkwrite, we need to take care of more complex deadlock
> > > > scenarios like the following (example by Jan Kara):
> > > >
> > > > Two independent processes P1, P2. Two files F1, F2, and two mappings M1,
> > > > M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO
> > > > to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They
> > > > can race like:
> > > >
> > > > P1                                      P2
> > > > read()                                  read()
> > > >   gfs2_file_read_iter()                   gfs2_file_read_iter()
> > > >     gfs2_file_direct_read()                 gfs2_file_direct_read()
> > > >       locks glock of F1                       locks glock of F2
> > > >       iomap_dio_rw()                          iomap_dio_rw()
> > > >         bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
> > > >           <fault in M2>                           <fault in M1>
> > > >             gfs2_fault()                            gfs2_fault()
> > > >               tries to grab glock of F2               tries to grab glock of F1
> > > >
> > > > Those kinds of scenarios are much harder to reproduce than
> > > > self-recursion.
> > > >
> > > > We deal with such situations by using the LM_FLAG_OUTER flag to mark
> > > > "outer" glock taking.  Then, when taking an "inner" glock, we use the
> > > > LM_FLAG_TRY flag so that locking attempts that don't immediately succeed
> > > > will be aborted.  In case of a failed locking attempt, we "unroll" to
> > > > where the "outer" glock was taken, drop the "outer" glock, and fault in
> > > > the first offending user page.  This will re-trigger the "inner" locking
> > > > attempt but without the LM_FLAG_TRY flag.  Once that has happened, we
> > > > re-acquire the "outer" glock and retry the original operation.
> > > >
> > > > Reported-by: Jan Kara <jack@suse.cz>
> > > > Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> > >
> > > ...
> > >
> > > > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > > > index 7d88abb4629b..8b26893f8dc6 100644
> > > > --- a/fs/gfs2/file.c
> > > > +++ b/fs/gfs2/file.c
> > > > @@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
> > > >       vm_fault_t ret = VM_FAULT_LOCKED;
> > > >       struct gfs2_holder gh;
> > > >       unsigned int length;
> > > > +     u16 flags = 0;
> > > >       loff_t size;
> > > >       int err;
> > > >
> > > >       sb_start_pagefault(inode->i_sb);
> > > >
> > > > -     gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
> > > > +     if (current_holds_glock())
> > > > +             flags |= LM_FLAG_TRY;
> > > > +
> > > > +     gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh);
> > > >       if (likely(!outer_gh)) {
> > > >               err = gfs2_glock_nq(&gh);
> > > >               if (err) {
> > > >                       ret = block_page_mkwrite_return(err);
> > > > +                     if (err == GLR_TRYFAILED) {
> > > > +                             set_current_needs_retry(true);
> > > > +                             ret = VM_FAULT_SIGBUS;
> > > > +                     }
> > >
> > > I've checked to make sure but do_user_addr_fault() indeed calls do_sigbus()
> > > which raises the SIGBUS signal. So if the application does not ignore
> > > SIGBUS, your retry will be visible to the application and can cause all
> > > sorts of interesting results...
> >
> > I would have noticed that, but no SIGBUS signals were actually
> > delivered. So we probably end up in kernelmode_fixup_or_oops() when in
> > kernel mode, which just does nothing in that case.
>
> Hum, but how would we get there? I don't think fatal_signal_pending() would
> return true yet...
Hmm, right ...
> > > So you probably need to add a new VM_FAULT_
> > > return code that will behave like VM_FAULT_SIGBUS except it will not raise
> > > the signal.
> >
> > A new VM_FAULT_* flag might make the code easier to read, but I don't
> > know if we can have one.
>
> Well, this is kernel-internal API and there's still plenty of space in
> vm_fault_reason.
That's in the context of the page fault. The other issue is how to
propagate that out through iov_iter_fault_in_readable ->
fault_in_pages_readable -> __get_user, for example. I don't think
there's much of a chance to get an additional error code out of
__get_user and __put_user.
Thanks,
Andreas
^ permalink raw reply	[flat|nested] 12+ messages in thread
* [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2)
  2021-05-21 15:46         ` Andreas Gruenbacher
@ 2021-05-24  8:39           ` Jan Kara
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2021-05-24  8:39 UTC (permalink / raw)
  To: cluster-devel.redhat.com
On Fri 21-05-21 17:46:04, Andreas Gruenbacher wrote:
> On Fri, May 21, 2021 at 5:23 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 20-05-21 16:07:56, Andreas Gruenbacher wrote:
> > > > So you probably need to add a new VM_FAULT_
> > > > return code that will behave like VM_FAULT_SIGBUS except it will not raise
> > > > the signal.
> > >
> > > A new VM_FAULT_* flag might make the code easier to read, but I don't
> > > know if we can have one.
> >
> > Well, this is kernel-internal API and there's still plenty of space in
> > vm_fault_reason.
> 
> That's in the context of the page fault. The other issue is how to
> propagate that out through iov_iter_fault_in_readable ->
> fault_in_pages_readable -> __get_user, for example. I don't think
> there's much of a chance to get an additional error code out of
> __get_user and __put_user.
Yes, at that level we'd get EFAULT as in any other case. Really the only
difference of the new VM_FAULT_ error code from a case of "standard" error
and VM_FAULT_SIGBUS would be not raising the signal.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-05-24  8:39 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-20 12:25 [Cluster-devel] [PATCH 0/6] gfs2: Handle page faults during read and write Andreas Gruenbacher
2021-05-20 12:25 ` [Cluster-devel] [PATCH 1/6] gfs2: Fix mmap + page fault deadlocks (part 1) Andreas Gruenbacher
2021-05-20 12:25 ` [Cluster-devel] [PATCH 2/6] iov_iter: Add iov_iter_fault_in_writeable() Andreas Gruenbacher
2021-05-20 12:25 ` [Cluster-devel] [PATCH 3/6] gfs2: Add wrappers for accessing journal_info Andreas Gruenbacher
2021-05-20 12:25 ` [Cluster-devel] [PATCH 4/6] gfs2: Encode glock holding and retry flags in journal_info Andreas Gruenbacher
2021-05-20 12:25 ` [Cluster-devel] [PATCH 5/6] gfs2: Add LM_FLAG_OUTER glock holder flag Andreas Gruenbacher
2021-05-20 12:25 ` [Cluster-devel] [PATCH 6/6] gfs2: Fix mmap + page fault deadlocks (part 2) Andreas Gruenbacher
2021-05-20 13:30   ` Jan Kara
2021-05-20 14:07     ` Andreas Gruenbacher
2021-05-21 15:23       ` Jan Kara
2021-05-21 15:46         ` Andreas Gruenbacher
2021-05-24  8:39           ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).