public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
@ 2024-08-19  5:20 NeilBrown
  2024-08-19  5:20 ` [PATCH 1/9] i915: remove wake_up on I915_RESET_MODESET NeilBrown
                   ` (10 more replies)
  0 siblings, 11 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

I wasn't really sure who to send this too, and get_maintainer.pl
suggested 132 addresses which seemed excessive.  So I've focussed on
'sched' maintainers.  I'll probably submit individual patches to
relevant maintainers/lists if I get positive feedback at this level.

This series was motivated by 

   Commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task")

which adds smp_mb__after_atomic().  I thought "any API that requires that
sort of thing needs to be fixed".

The main patches here are 7 and 8 which revise wake_up_bit and
wake_up_var respectively.  They result in 3 interfaces:
  wake_up_{bit,var}           includes smp_mb__after_atomic()
  wake_up_{bit,var}_relaxed() doesn't have a barrier
  wake_up_{bit,var}_mb()      includes smb_mb().

I think this set of interfaces should be easier to use correctly.  They
are also now documented more clearly.

The preceeding patches clean up various places where the exiting
interfaces weren't used optimally.  The final patch uses
clear_and_wake_up_bit() more widely because it seems like a good idea.

I have three questions:

1/ is my understanding of the needed barriers correct.
 i.e:
   smp_mb__after_atomic() needed after a clear_bit() to atomic_set() 
     or similar, or a change inside a locked region
   smb_mb() needed after any non-locked update
   nothing needed after test_and_clear_bit() or atomic_dec_and_test()
     or similar.
   (I realised while working on this that my previous understanding
    of the barrier requires was wrong, so maybe it still is).

2/ How should we handle the "flag day" where a barrier is added to
   wake_up_bit() and wake_up_var().  Some options are:
   a/ have a big patch for the flag-day as this series does
   b/ add the barrier in a new wake_up_atomic_{bit,var} and deprecate
      wake_up_{bit,var}
   c/ don't worry about the fact that there will be an extra barrier for
      a while - just make the change to wake_up_xxx() first, then submit
      individual patches to remove barriers as needed.

3/ Who else should I ask to remove this at this high level?

Thanks,
NeilBrown


 [PATCH 1/9] i915: remove wake_up on I915_RESET_MODESET.
 [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
 [PATCH 3/9] XFS: use wait_var_event() when waiting of i_pincount.
 [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP
 [PATCH 5/9] Block: switch bd_prepare_to_claim to use
 [PATCH 6/9] block/pktdvd: switch congestion waiting to
 [PATCH 7/9] Improve and expand wake_up_bit() interface.
 [PATCH 8/9] Improve and extend wake_up_var() interface.
 [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/9] i915: remove wake_up on I915_RESET_MODESET.
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-19  5:20 ` [PATCH 2/9] Introduce atomic_dec_and_wake_up_var() NeilBrown
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

Since Commit d59cf7bb73f3 ("drm/i915/display: Use dma_fence interfaces
instead of i915_sw_fence") no code has waited for this wake_up, so let's
remove the wake_up itself.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/gpu/drm/i915/display/intel_display_reset.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display_reset.c b/drivers/gpu/drm/i915/display/intel_display_reset.c
index c2c347b22448..123571bf8af5 100644
--- a/drivers/gpu/drm/i915/display/intel_display_reset.c
+++ b/drivers/gpu/drm/i915/display/intel_display_reset.c
@@ -35,8 +35,6 @@ void intel_display_reset_prepare(struct drm_i915_private *dev_priv)
 
 	/* We have a modeset vs reset deadlock, defensively unbreak it. */
 	set_bit(I915_RESET_MODESET, &to_gt(dev_priv)->reset.flags);
-	smp_mb__after_atomic();
-	wake_up_bit(&to_gt(dev_priv)->reset.flags, I915_RESET_MODESET);
 
 	if (atomic_read(&dev_priv->gpu_error.pending_fb_pin)) {
 		drm_dbg_kms(&dev_priv->drm,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
  2024-08-19  5:20 ` [PATCH 1/9] i915: remove wake_up on I915_RESET_MODESET NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-20  5:47   ` kernel test robot
  2024-08-21  5:23   ` kernel test robot
  2024-08-19  5:20 ` [PATCH 3/9] XFS: use wait_var_event() when waiting of i_pincount NeilBrown
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

There is a common pattern of atomic_dec_and_test() being followed by
wake_up_var().

So provide atomic_dec_and_wake_up_var() to do both.

A future patch will change the default barrier used by wake_up_var().
Doing this first will reduce the size of that patch.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/dma-buf/st-dma-fence-chain.c          | 3 +--
 drivers/gpu/drm/display/drm_dp_aux_dev.c      | 6 ++----
 drivers/gpu/drm/i915/selftests/i915_request.c | 7 +------
 drivers/media/platform/qcom/venus/hfi.c       | 3 +--
 fs/afs/cell.c                                 | 3 +--
 fs/afs/internal.h                             | 3 +--
 fs/afs/rxrpc.c                                | 4 +---
 fs/btrfs/block-group.c                        | 6 ++----
 fs/netfs/objects.c                            | 3 +--
 fs/nfs/pagelist.c                             | 3 +--
 fs/nfs/write.c                                | 6 +-----
 fs/nfsd/nfs4callback.c                        | 4 +---
 fs/reiserfs/journal.c                         | 3 +--
 include/linux/wait_bit.h                      | 8 ++++++++
 mm/shmem.c                                    | 3 +--
 net/rxrpc/call_accept.c                       | 3 +--
 net/rxrpc/call_object.c                       | 3 +--
 net/rxrpc/conn_object.c                       | 3 +--
 net/sunrpc/xprt.c                             | 3 +--
 19 files changed, 28 insertions(+), 49 deletions(-)

diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c
index ed4b323886e4..36466ae588e8 100644
--- a/drivers/dma-buf/st-dma-fence-chain.c
+++ b/drivers/dma-buf/st-dma-fence-chain.c
@@ -434,8 +434,7 @@ static int __find_race(void *arg)
 		cond_resched();
 	}
 
-	if (atomic_dec_and_test(&data->children))
-		wake_up_var(&data->children);
+	atomic_dec_and_wake_up_var(&data->children);
 	return err;
 }
 
diff --git a/drivers/gpu/drm/display/drm_dp_aux_dev.c b/drivers/gpu/drm/display/drm_dp_aux_dev.c
index 29555b9f03c8..bb03f1dff840 100644
--- a/drivers/gpu/drm/display/drm_dp_aux_dev.c
+++ b/drivers/gpu/drm/display/drm_dp_aux_dev.c
@@ -180,8 +180,7 @@ static ssize_t auxdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		res = pos - iocb->ki_pos;
 	iocb->ki_pos = pos;
 
-	if (atomic_dec_and_test(&aux_dev->usecount))
-		wake_up_var(&aux_dev->usecount);
+	atomic_dec_and_wake_up_var(&aux_dev->usecount);
 
 	return res;
 }
@@ -223,8 +222,7 @@ static ssize_t auxdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		res = pos - iocb->ki_pos;
 	iocb->ki_pos = pos;
 
-	if (atomic_dec_and_test(&aux_dev->usecount))
-		wake_up_var(&aux_dev->usecount);
+	atomic_dec_and_wake_up_var(&aux_dev->usecount);
 
 	return res;
 }
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index acae30a04a94..8a45f79f10a0 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -1530,12 +1530,7 @@ static void __live_parallel_engineN(struct kthread_work *work)
 
 static bool wake_all(struct drm_i915_private *i915)
 {
-	if (atomic_dec_and_test(&i915->selftest.counter)) {
-		wake_up_var(&i915->selftest.counter);
-		return true;
-	}
-
-	return false;
+	return atomic_dec_and_wakeup_var(&i915->selftest.counter);
 }
 
 static int wait_for_all(struct drm_i915_private *i915)
diff --git a/drivers/media/platform/qcom/venus/hfi.c b/drivers/media/platform/qcom/venus/hfi.c
index e00aedb41d16..4c75b827a341 100644
--- a/drivers/media/platform/qcom/venus/hfi.c
+++ b/drivers/media/platform/qcom/venus/hfi.c
@@ -255,8 +255,7 @@ void hfi_session_destroy(struct venus_inst *inst)
 
 	mutex_lock(&core->lock);
 	list_del_init(&inst->list);
-	if (atomic_dec_and_test(&core->insts_count))
-		wake_up_var(&core->insts_count);
+	atomic_dec_and_wake_up_var(&core->insts_count);
 	mutex_unlock(&core->lock);
 }
 EXPORT_SYMBOL_GPL(hfi_session_destroy);
diff --git a/fs/afs/cell.c b/fs/afs/cell.c
index caa09875f520..422bcc26becc 100644
--- a/fs/afs/cell.c
+++ b/fs/afs/cell.c
@@ -25,8 +25,7 @@ static void afs_manage_cell_work(struct work_struct *);
 
 static void afs_dec_cells_outstanding(struct afs_net *net)
 {
-	if (atomic_dec_and_test(&net->cells_outstanding))
-		wake_up_var(&net->cells_outstanding);
+	atomic_dec_and_wake_up_var(&net->cells_outstanding);
 }
 
 /*
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 6e1d3c4daf72..d19db6c00cae 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -1484,8 +1484,7 @@ static inline void afs_inc_servers_outstanding(struct afs_net *net)
 
 static inline void afs_dec_servers_outstanding(struct afs_net *net)
 {
-	if (atomic_dec_and_test(&net->servers_outstanding))
-		wake_up_var(&net->servers_outstanding);
+	atomic_dec_and_wake_up_var(&net->servers_outstanding);
 }
 
 static inline bool afs_is_probing_server(struct afs_server *server)
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index c453428f3c8b..36e61b20f937 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -195,9 +195,7 @@ void afs_put_call(struct afs_call *call)
 			       __builtin_return_address(0));
 		kfree(call);
 
-		o = atomic_dec_return(&net->nr_outstanding_calls);
-		if (o == 0)
-			wake_up_var(&net->nr_outstanding_calls);
+		atomic_dec_and_wake_up_var(&net->nr_outstanding_calls);
 	}
 }
 
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 2e49d978f504..d26e176fa531 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -352,8 +352,7 @@ struct btrfs_block_group *btrfs_inc_nocow_writers(struct btrfs_fs_info *fs_info,
  */
 void btrfs_dec_nocow_writers(struct btrfs_block_group *bg)
 {
-	if (atomic_dec_and_test(&bg->nocow_writers))
-		wake_up_var(&bg->nocow_writers);
+	atomic_dec_and_wake_up_var(&bg->nocow_writers);
 
 	/* For the lookup done by a previous call to btrfs_inc_nocow_writers(). */
 	btrfs_put_block_group(bg);
@@ -371,8 +370,7 @@ void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
 
 	bg = btrfs_lookup_block_group(fs_info, start);
 	ASSERT(bg);
-	if (atomic_dec_and_test(&bg->reservations))
-		wake_up_var(&bg->reservations);
+	atomic_dec_and_wake_up_var(&bg->reservations);
 	btrfs_put_block_group(bg);
 }
 
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index f4a642727479..39e3ab3d0042 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -145,8 +145,7 @@ static void netfs_free_request(struct work_struct *work)
 		kvfree(rreq->direct_bv);
 	}
 
-	if (atomic_dec_and_test(&ictx->io_count))
-		wake_up_var(&ictx->io_count);
+	atomic_dec_and_wake_up_var(&ictx->io_count);
 	call_rcu(&rreq->rcu, netfs_free_request_rcu);
 }
 
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 04124f226665..8ae767578cd9 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -559,8 +559,7 @@ static void nfs_clear_request(struct nfs_page *req)
 		req->wb_page = NULL;
 	}
 	if (l_ctx != NULL) {
-		if (atomic_dec_and_test(&l_ctx->io_count)) {
-			wake_up_var(&l_ctx->io_count);
+		if (atomic_dec_and_wake_up_var(&l_ctx->io_count)) {
 			ctx = l_ctx->open_context;
 			if (test_bit(NFS_CONTEXT_UNLOCK, &ctx->flags))
 				rpc_wake_up(&NFS_SERVER(d_inode(ctx->dentry))->uoc_rpcwaitq);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index d074d0ceb4f0..7d9d6ba5c71c 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1646,11 +1646,7 @@ void nfs_commit_begin(struct nfs_mds_commit_info *cinfo)
 
 bool nfs_commit_end(struct nfs_mds_commit_info *cinfo)
 {
-	if (atomic_dec_and_test(&cinfo->rpcs_out)) {
-		wake_up_var(&cinfo->rpcs_out);
-		return true;
-	}
-	return false;
+	return atomic_dec_and_wake_up_var(&cinfo->rpcs_out);
 }
 
 void nfs_commitdata_release(struct nfs_commit_data *data)
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index d756f443fc44..cef65be34525 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -993,9 +993,7 @@ static void nfsd41_cb_inflight_begin(struct nfs4_client *clp)
 
 static void nfsd41_cb_inflight_end(struct nfs4_client *clp)
 {
-
-	if (atomic_dec_and_test(&clp->cl_cb_inflight))
-		wake_up_var(&clp->cl_cb_inflight);
+	atomic_dec_and_wake_up_var(&clp->cl_cb_inflight);
 }
 
 static void nfsd41_cb_inflight_wait_complete(struct nfs4_client *clp)
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index e477ee0ff35d..f3da240e486b 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -1059,8 +1059,7 @@ static int flush_commit_list(struct super_block *s,
 			put_bh(tbh) ;
 		}
 	}
-	if (atomic_dec_and_test(&journal->j_async_throttle))
-		wake_up_var(&journal->j_async_throttle);
+	atomic_dec_and_wake_up_var(&journal->j_async_throttle);
 
 	for (i = 0; i < (jl->j_len + 1); i++) {
 		bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) +
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 7725b7579b78..178ed8bed91c 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -335,4 +335,12 @@ static inline void clear_and_wake_up_bit(int bit, void *word)
 	wake_up_bit(word, bit);
 }
 
+static inline bool atomic_dec_and_wake_up_var(atomic_t *var)
+{
+	if (!atomic_dec_and_test(var))
+		return false;
+	wake_up_var(var);
+	return true;
+}
+
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index 5a77acf6ac6a..414c69d7596f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1421,8 +1421,7 @@ int shmem_unuse(unsigned int type)
 		next = list_next_entry(info, swaplist);
 		if (!info->swapped)
 			list_del_init(&info->swaplist);
-		if (atomic_dec_and_test(&info->stop_eviction))
-			wake_up_var(&info->stop_eviction);
+		atomic_dec_and_wake_up_var(&info->stop_eviction);
 		if (error)
 			break;
 	}
diff --git a/net/rxrpc/call_accept.c b/net/rxrpc/call_accept.c
index 0f5a1d77b890..ece4137bb464 100644
--- a/net/rxrpc/call_accept.c
+++ b/net/rxrpc/call_accept.c
@@ -209,8 +209,7 @@ void rxrpc_discard_prealloc(struct rxrpc_sock *rx)
 		list_del(&conn->proc_link);
 		write_unlock(&rxnet->conn_lock);
 		kfree(conn);
-		if (atomic_dec_and_test(&rxnet->nr_conns))
-			wake_up_var(&rxnet->nr_conns);
+		atomic_dec_and_wake_up_var(&rxnet->nr_conns);
 		tail = (tail + 1) & (size - 1);
 	}
 
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index f9e983a12c14..b25f6ed4bcc0 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -661,8 +661,7 @@ static void rxrpc_rcu_free_call(struct rcu_head *rcu)
 	struct rxrpc_net *rxnet = READ_ONCE(call->rxnet);
 
 	kmem_cache_free(rxrpc_call_jar, call);
-	if (atomic_dec_and_test(&rxnet->nr_calls))
-		wake_up_var(&rxnet->nr_calls);
+	atomic_dec_and_wake_up_var(&rxnet->nr_calls);
 }
 
 /*
diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c
index 1539d315afe7..738663251188 100644
--- a/net/rxrpc/conn_object.c
+++ b/net/rxrpc/conn_object.c
@@ -294,8 +294,7 @@ static void rxrpc_rcu_free_connection(struct rcu_head *rcu)
 			 rxrpc_conn_free);
 	kfree(conn);
 
-	if (atomic_dec_and_test(&rxnet->nr_conns))
-		wake_up_var(&rxnet->nr_conns);
+	atomic_dec_and_wake_up_var(&rxnet->nr_conns);
 }
 
 /*
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 09f245cda526..1a801a08671f 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1113,8 +1113,7 @@ void xprt_unpin_rqst(struct rpc_rqst *req)
 		atomic_dec(&req->rq_pin);
 		return;
 	}
-	if (atomic_dec_and_test(&req->rq_pin))
-		wake_up_var(&req->rq_pin);
+	atomic_dec_and_wake_up_var(&req->rq_pin);
 }
 EXPORT_SYMBOL_GPL(xprt_unpin_rqst);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/9] XFS: use wait_var_event() when waiting of i_pincount.
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
  2024-08-19  5:20 ` [PATCH 1/9] i915: remove wake_up on I915_RESET_MODESET NeilBrown
  2024-08-19  5:20 ` [PATCH 2/9] Introduce atomic_dec_and_wake_up_var() NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-19  5:20 ` [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP NeilBrown
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

__xfs_iunpin_wait() is nearly an open-coded version of wait_var_event().
The bit XFS_IPINNED in ->i_flags is never used (not set, cleared,
or tested).  Its only role is in choosing a wait_queue.

The code is more transparent if we discard that flag bit and use
wait_var_event() to wait for the pincount to reach zero, which is
signalled by wake_up_var() - or more specifically
atomic_dec_and_wake_up_var()

In order to use io_schedule(), we actually use ___wait_var_event().

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/xfs/xfs_inode.c      | 24 +++++-------------------
 fs/xfs/xfs_inode.h      |  2 --
 fs/xfs/xfs_inode_item.c |  3 +--
 3 files changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7dc6f326936c..a855295363b5 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1899,29 +1899,15 @@ xfs_iunpin(
 
 }
 
-static void
-__xfs_iunpin_wait(
-	struct xfs_inode	*ip)
-{
-	wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_IPINNED_BIT);
-	DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IPINNED_BIT);
-
-	xfs_iunpin(ip);
-
-	do {
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (xfs_ipincount(ip))
-			io_schedule();
-	} while (xfs_ipincount(ip));
-	finish_wait(wq, &wait.wq_entry);
-}
-
 void
 xfs_iunpin_wait(
 	struct xfs_inode	*ip)
 {
-	if (xfs_ipincount(ip))
-		__xfs_iunpin_wait(ip);
+	if (xfs_ipincount(ip)) {
+		xfs_iunpin(ip);
+		___wait_var_event(&ip->i_pincount, xfs_ipincount(ip) == 0,
+				  TASK_UNINTERRUPTIBLE, 0, 0, io_schedule());
+	}
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 51defdebef30..5062580034ec 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -337,8 +337,6 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
 #define XFS_ITRUNCATED		(1 << 5) /* truncated down so flush-on-close */
 #define XFS_IDIRTY_RELEASE	(1 << 6) /* dirty release already seen */
 #define XFS_IFLUSHING		(1 << 7) /* inode is being flushed */
-#define __XFS_IPINNED_BIT	8	 /* wakeup key for zero pin count */
-#define XFS_IPINNED		(1 << __XFS_IPINNED_BIT)
 #define XFS_IEOFBLOCKS		(1 << 9) /* has the preallocblocks tag set */
 #define XFS_NEED_INACTIVE	(1 << 10) /* see XFS_INACTIVATING below */
 /*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index b509cbd191f4..7fee751c7dfd 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -712,8 +712,7 @@ xfs_inode_item_unpin(
 	trace_xfs_inode_unpin(ip, _RET_IP_);
 	ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE));
 	ASSERT(atomic_read(&ip->i_pincount) > 0);
-	if (atomic_dec_and_test(&ip->i_pincount))
-		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
+	atomic_dec_and_wake_up_var(&ip->i_pincount);
 }
 
 STATIC uint
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
                   ` (2 preceding siblings ...)
  2024-08-19  5:20 ` [PATCH 3/9] XFS: use wait_var_event() when waiting of i_pincount NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-20  7:22   ` Christian Brauner
  2024-08-19  5:20 ` [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event() NeilBrown
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

inode_dio_wait() is essentially an open-coded version of
wait_var_event().  Similarly inode_dio_wait_interruptible() is an
open-coded version of wait_var_event_interruptible().

If we switch to waiting on the var, instead of an imaginary bit, the
code is more transparent, is shorter, and we can discard I_DIO_WAKEUP.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/inode.c         | 16 ++--------------
 fs/netfs/locking.c | 19 ++-----------------
 include/linux/fs.h |  7 +------
 3 files changed, 5 insertions(+), 37 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 86670941884b..91bb2f80fa03 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2467,18 +2467,6 @@ EXPORT_SYMBOL(inode_owner_or_capable);
 /*
  * Direct i/o helper functions
  */
-static void __inode_dio_wait(struct inode *inode)
-{
-	wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP);
-	DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP);
-
-	do {
-		prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (atomic_read(&inode->i_dio_count))
-			schedule();
-	} while (atomic_read(&inode->i_dio_count));
-	finish_wait(wq, &q.wq_entry);
-}
 
 /**
  * inode_dio_wait - wait for outstanding DIO requests to finish
@@ -2492,8 +2480,8 @@ static void __inode_dio_wait(struct inode *inode)
  */
 void inode_dio_wait(struct inode *inode)
 {
-	if (atomic_read(&inode->i_dio_count))
-		__inode_dio_wait(inode);
+	wait_var_event(&inode->i_dio_count,
+		       atomic_read(&inode->i_dio_count) == 0);
 }
 EXPORT_SYMBOL(inode_dio_wait);
 
diff --git a/fs/netfs/locking.c b/fs/netfs/locking.c
index 75dc52a49b3a..c171a0a48ed0 100644
--- a/fs/netfs/locking.c
+++ b/fs/netfs/locking.c
@@ -21,23 +21,8 @@
  */
 static int inode_dio_wait_interruptible(struct inode *inode)
 {
-	if (!atomic_read(&inode->i_dio_count))
-		return 0;
-
-	wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP);
-	DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP);
-
-	for (;;) {
-		prepare_to_wait(wq, &q.wq_entry, TASK_INTERRUPTIBLE);
-		if (!atomic_read(&inode->i_dio_count))
-			break;
-		if (signal_pending(current))
-			break;
-		schedule();
-	}
-	finish_wait(wq, &q.wq_entry);
-
-	return atomic_read(&inode->i_dio_count) ? -ERESTARTSYS : 0;
+	return wait_var_event_interruptible(&inode->i_dio_count,
+					    !atomic_read(&inode->i_dio_count));
 }
 
 /* Call with exclusively locked inode->i_rwsem */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..fc1b68134cbf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2372,8 +2372,6 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *
  * I_REFERENCED		Marks the inode as recently references on the LRU list.
  *
- * I_DIO_WAKEUP		Never set.  Only used as a key for wait_on_bit().
- *
  * I_WB_SWITCH		Cgroup bdi_writeback switching in progress.  Used to
  *			synchronize competing switching instances and to tell
  *			wb stat updates to grab the i_pages lock.  See
@@ -2405,8 +2403,6 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
 #define I_REFERENCED		(1 << 8)
-#define __I_DIO_WAKEUP		9
-#define I_DIO_WAKEUP		(1 << __I_DIO_WAKEUP)
 #define I_LINKABLE		(1 << 10)
 #define I_DIRTY_TIME		(1 << 11)
 #define I_WB_SWITCH		(1 << 13)
@@ -3237,8 +3233,7 @@ static inline void inode_dio_begin(struct inode *inode)
  */
 static inline void inode_dio_end(struct inode *inode)
 {
-	if (atomic_dec_and_test(&inode->i_dio_count))
-		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
+	atomic_dec_and_wake_up_var(&inode->i_dio_count);
 }
 
 extern void inode_set_flags(struct inode *inode, unsigned int flags,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event()
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
                   ` (3 preceding siblings ...)
  2024-08-19  5:20 ` [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-20  4:18   ` Dave Chinner
  2024-08-21  7:10   ` kernel test robot
  2024-08-19  5:20 ` [PATCH 6/9] block/pktdvd: switch congestion waiting to ___wait_var_event() NeilBrown
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

bd_prepare_to_claim() current uses a bit waitqueue with a matching
wake_up_bit() in bd_clear_claiming().  However it is really waiting on a
"var", not a "bit".

So change to wake_up_var(), and use ___wait_var_event() for the waiting.
Using the triple-underscore version allows us to drop the mutex across
the schedule() call.

Add a missing memory barrier before the wake_up_var() call.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/bdev.c | 49 ++++++++++++++++++++-----------------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index c5507b6f63b8..d804c91c651b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -487,10 +487,10 @@ long nr_blockdev_pages(void)
  * Test whether @bdev can be claimed by @holder.
  *
  * RETURNS:
- * %true if @bdev can be claimed, %false otherwise.
+ * %0 if @bdev can be claimed, %-EBUSY otherwise.
  */
-static bool bd_may_claim(struct block_device *bdev, void *holder,
-		const struct blk_holder_ops *hops)
+static int bd_may_claim(struct block_device *bdev, void *holder,
+			const struct blk_holder_ops *hops)
 {
 	struct block_device *whole = bdev_whole(bdev);
 
@@ -503,9 +503,9 @@ static bool bd_may_claim(struct block_device *bdev, void *holder,
 		if (bdev->bd_holder == holder) {
 			if (WARN_ON_ONCE(bdev->bd_holder_ops != hops))
 				return false;
-			return true;
+			return 0;
 		}
-		return false;
+		return -EBUSY;
 	}
 
 	/*
@@ -514,8 +514,8 @@ static bool bd_may_claim(struct block_device *bdev, void *holder,
 	 */
 	if (whole != bdev &&
 	    whole->bd_holder && whole->bd_holder != bd_may_claim)
-		return false;
-	return true;
+		return -EBUSY;
+	return 0;
 }
 
 /**
@@ -535,33 +535,23 @@ int bd_prepare_to_claim(struct block_device *bdev, void *holder,
 		const struct blk_holder_ops *hops)
 {
 	struct block_device *whole = bdev_whole(bdev);
+	int err = 0;
 
 	if (WARN_ON_ONCE(!holder))
 		return -EINVAL;
-retry:
-	mutex_lock(&bdev_lock);
-	/* if someone else claimed, fail */
-	if (!bd_may_claim(bdev, holder, hops)) {
-		mutex_unlock(&bdev_lock);
-		return -EBUSY;
-	}
-
-	/* if claiming is already in progress, wait for it to finish */
-	if (whole->bd_claiming) {
-		wait_queue_head_t *wq = bit_waitqueue(&whole->bd_claiming, 0);
-		DEFINE_WAIT(wait);
 
-		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&bdev_lock);
-		schedule();
-		finish_wait(wq, &wait);
-		goto retry;
-	}
+	mutex_lock(&bdev_lock);
+	___wait_var_event(&whole->bd_claiming,
+			  (err = bd_may_claim(bdev, holder, hops)) != 0 || !whole->bd_claiming,
+			  TASK_UNINTERRUPTIBLE, 0, 0,
+			  mutex_unlock(&bdev_lock); schedule(); mutex_lock(&bdev_lock));
 
-	/* yay, all mine */
-	whole->bd_claiming = holder;
+	/* if someone else claimed, fail */
+	if (!err)
+		/* yay, all mine */
+		whole->bd_claiming = holder;
 	mutex_unlock(&bdev_lock);
-	return 0;
+	return err;
 }
 EXPORT_SYMBOL_GPL(bd_prepare_to_claim); /* only for the loop driver */
 
@@ -571,7 +561,8 @@ static void bd_clear_claiming(struct block_device *whole, void *holder)
 	/* tell others that we're done */
 	BUG_ON(whole->bd_claiming != holder);
 	whole->bd_claiming = NULL;
-	wake_up_bit(&whole->bd_claiming, 0);
+	smp_mb();
+	wake_up_var(&whole->bd_claiming);
 }
 
 /**
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/9] block/pktdvd: switch congestion waiting to ___wait_var_event()
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
                   ` (4 preceding siblings ...)
  2024-08-19  5:20 ` [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event() NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-19  5:20 ` [PATCH 7/9] Improve and expand wake_up_bit() interface NeilBrown
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

Rather than having an open-coded wait event loop, use
__var_wait_event().
This fixes a bug as the existing loop doesn't call finish_wait().

Also add missing memory barrier before wake_up_var().

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/block/pktcdvd.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 7cece5884b9c..273fbe05d80f 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1210,6 +1210,7 @@ static int pkt_handle_queue(struct pktcdvd_device *pd)
 	if (pd->congested &&
 	    pd->bio_queue_size <= pd->write_congestion_off) {
 		pd->congested = false;
+		smp_mb();
 		wake_up_var(&pd->congested);
 	}
 	spin_unlock(&pd->lock);
@@ -2383,20 +2384,16 @@ static void pkt_make_request_write(struct bio *bio)
 	spin_lock(&pd->lock);
 	if (pd->write_congestion_on > 0
 	    && pd->bio_queue_size >= pd->write_congestion_on) {
-		struct wait_bit_queue_entry wqe;
 
-		init_wait_var_entry(&wqe, &pd->congested, 0);
-		for (;;) {
-			prepare_to_wait_event(__var_waitqueue(&pd->congested),
-					      &wqe.wq_entry,
-					      TASK_UNINTERRUPTIBLE);
-			if (pd->bio_queue_size <= pd->write_congestion_off)
-				break;
-			pd->congested = true;
-			spin_unlock(&pd->lock);
-			schedule();
-			spin_lock(&pd->lock);
-		}
+		___wait_var_event(&pd->congested,
+				  pd->bio_queue_size <= pd->write_congestion_off,
+				  TASK_UNINTERRUPTIBLE, 0, 0,
+				  ({ pd->congested = true;
+				    spin_unlock(&pd->lock);
+				    schedule();
+				    spin_lock(&pd->lock);
+				  })
+		);
 	}
 	spin_unlock(&pd->lock);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 7/9] Improve and expand wake_up_bit() interface.
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
                   ` (5 preceding siblings ...)
  2024-08-19  5:20 ` [PATCH 6/9] block/pktdvd: switch congestion waiting to ___wait_var_event() NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-19  5:20 ` [PATCH 8/9] Improve and extend wake_up_var() interface NeilBrown
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

wake_up_bit() is a fragile interface.  It requires a preceding barrier
in most cased but often this barrier is forgotten by developers.

There are three cases:
1/ in the common case the relevant bit has been cleared by an atomic
   operation such as clear_bit().  In this case smp_mb__after_atomic()
   is needed.  This patch changes wake_up_bit() to include this barrier
   so that in the common case it works correctly.

2/ If the bit has been cleared by a non-atomic operation such as a
   simple bit mask and assignment, then a full barrier - smp_mb() -
   is needed.  A new interface wake_up_bit_mb() is provided which
   provided this barrier and documents the expected use case.

3/ If a fully ordered operation such as test_and_clear_bit() is used to
   clear the bit then no barrier is needed.  A new interface
   wake_up_bit_relaxed() is added which provides for this possibility.
   It is exactly the old wake_up_bit() function.

This patch also removes all explicit barriers from before wake_up_bit(),
changing some into wake_up_bit_mb() as required.

It also changes some to wake_up_bit_relaxed() as well as removing the
barrier - in cases where the barrier was never needed.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/bluetooth/btintel.h                 |  2 +-
 drivers/bluetooth/btmtk.c                   |  9 ++--
 drivers/bluetooth/btmtksdio.c               |  8 ++--
 drivers/bluetooth/btmtkuart.c               |  8 ++--
 drivers/bluetooth/hci_intel.c               |  8 ++--
 drivers/bluetooth/hci_mrvl.c                |  2 -
 drivers/md/dm-bufio.c                       |  4 --
 drivers/md/dm-snap.c                        |  1 -
 drivers/md/dm-zoned-metadata.c              |  2 -
 drivers/md/dm-zoned-reclaim.c               |  1 -
 drivers/md/dm.c                             |  1 -
 drivers/media/usb/dvb-usb-v2/dvb_usb_core.c |  3 --
 fs/btrfs/extent_io.c                        |  2 -
 fs/buffer.c                                 |  1 -
 fs/ceph/dir.c                               |  2 +-
 fs/ceph/file.c                              |  4 +-
 fs/dcache.c                                 |  3 +-
 fs/ext4/fast_commit.c                       |  6 +--
 fs/fs-writeback.c                           |  4 +-
 fs/gfs2/glock.c                             |  2 -
 fs/gfs2/lock_dlm.c                          |  2 -
 fs/gfs2/recovery.c                          |  1 -
 fs/gfs2/sys.c                               |  1 -
 fs/gfs2/util.c                              |  1 -
 fs/inode.c                                  |  8 ++--
 fs/jbd2/commit.c                            |  7 +---
 fs/nfs/inode.c                              |  1 -
 fs/nfs/pagelist.c                           |  2 -
 fs/nfs/pnfs.c                               |  5 +--
 fs/nfsd/nfs4recover.c                       |  1 -
 fs/orangefs/file.c                          |  1 -
 fs/smb/client/inode.c                       |  1 -
 fs/smb/server/oplock.c                      |  2 -
 include/linux/wait_bit.h                    | 46 +++++++++++++++++++--
 kernel/sched/wait_bit.c                     | 13 +++---
 kernel/signal.c                             |  3 +-
 mm/ksm.c                                    |  1 -
 net/bluetooth/hci_event.c                   |  5 +--
 net/sunrpc/sched.c                          |  1 -
 security/keys/key.c                         |  4 +-
 40 files changed, 80 insertions(+), 99 deletions(-)

diff --git a/drivers/bluetooth/btintel.h b/drivers/bluetooth/btintel.h
index aa70e4c27416..1271ded14f9a 100644
--- a/drivers/bluetooth/btintel.h
+++ b/drivers/bluetooth/btintel.h
@@ -202,7 +202,7 @@ struct btintel_data {
 #define btintel_wake_up_flag(hdev, nr)					\
 	do {								\
 		struct btintel_data *intel = hci_get_priv((hdev));	\
-		wake_up_bit(intel->flags, (nr));			\
+		wake_up_bit_relaxed(intel->flags, (nr));		\
 	} while (0)
 
 #define btintel_get_flag(hdev)						\
diff --git a/drivers/bluetooth/btmtk.c b/drivers/bluetooth/btmtk.c
index 2b7c80043aa2..3bbb566bb1c0 100644
--- a/drivers/bluetooth/btmtk.c
+++ b/drivers/bluetooth/btmtk.c
@@ -482,12 +482,9 @@ static void btmtk_usb_wmt_recv(struct urb *urb)
 		}
 
 		if (test_and_clear_bit(BTMTK_TX_WAIT_VND_EVT,
-				       &data->flags)) {
-			/* Barrier to sync with other CPUs */
-			smp_mb__after_atomic();
-			wake_up_bit(&data->flags,
-				    BTMTK_TX_WAIT_VND_EVT);
-		}
+				       &data->flags))
+			wake_up_bit_relaxed(&data->flags,
+					    BTMTK_TX_WAIT_VND_EVT);
 		kfree(urb->setup_packet);
 		return;
 	} else if (urb->status == -ENOENT) {
diff --git a/drivers/bluetooth/btmtksdio.c b/drivers/bluetooth/btmtksdio.c
index 39d6898497a4..3fd84ae746c4 100644
--- a/drivers/bluetooth/btmtksdio.c
+++ b/drivers/bluetooth/btmtksdio.c
@@ -401,11 +401,9 @@ static int btmtksdio_recv_event(struct hci_dev *hdev, struct sk_buff *skb)
 
 	if (evt == HCI_EV_WMT) {
 		if (test_and_clear_bit(BTMTKSDIO_TX_WAIT_VND_EVT,
-				       &bdev->tx_state)) {
-			/* Barrier to sync with other CPUs */
-			smp_mb__after_atomic();
-			wake_up_bit(&bdev->tx_state, BTMTKSDIO_TX_WAIT_VND_EVT);
-		}
+				       &bdev->tx_state))
+			wake_up_bit_relaxed(&bdev->tx_state,
+					    BTMTKSDIO_TX_WAIT_VND_EVT);
 	}
 
 	return 0;
diff --git a/drivers/bluetooth/btmtkuart.c b/drivers/bluetooth/btmtkuart.c
index aa87c3e78871..b02b56d7d1b0 100644
--- a/drivers/bluetooth/btmtkuart.c
+++ b/drivers/bluetooth/btmtkuart.c
@@ -211,11 +211,9 @@ static int btmtkuart_recv_event(struct hci_dev *hdev, struct sk_buff *skb)
 
 	if (hdr->evt == HCI_EV_WMT) {
 		if (test_and_clear_bit(BTMTKUART_TX_WAIT_VND_EVT,
-				       &bdev->tx_state)) {
-			/* Barrier to sync with other CPUs */
-			smp_mb__after_atomic();
-			wake_up_bit(&bdev->tx_state, BTMTKUART_TX_WAIT_VND_EVT);
-		}
+				       &bdev->tx_state))
+			wake_up_bit_relaxed(&bdev->tx_state,
+					    BTMTKUART_TX_WAIT_VND_EVT);
 	}
 
 	return 0;
diff --git a/drivers/bluetooth/hci_intel.c b/drivers/bluetooth/hci_intel.c
index 999ccd5bb4f2..06ab03df9a7a 100644
--- a/drivers/bluetooth/hci_intel.c
+++ b/drivers/bluetooth/hci_intel.c
@@ -887,7 +887,7 @@ static int intel_recv_event(struct hci_dev *hdev, struct sk_buff *skb)
 
 		if (test_and_clear_bit(STATE_DOWNLOADING, &intel->flags) &&
 		    test_bit(STATE_FIRMWARE_LOADED, &intel->flags))
-			wake_up_bit(&intel->flags, STATE_DOWNLOADING);
+			wake_up_bit_relaxed(&intel->flags, STATE_DOWNLOADING);
 
 	/* When switching to the operational firmware the device
 	 * sends a vendor specific event indicating that the bootup
@@ -896,7 +896,7 @@ static int intel_recv_event(struct hci_dev *hdev, struct sk_buff *skb)
 	} else if (skb->len == 9 && hdr->evt == 0xff && hdr->plen == 0x07 &&
 		   skb->data[2] == 0x02) {
 		if (test_and_clear_bit(STATE_BOOTING, &intel->flags))
-			wake_up_bit(&intel->flags, STATE_BOOTING);
+			wake_up_bit_relaxed(&intel->flags, STATE_BOOTING);
 	}
 recv:
 	return hci_recv_frame(hdev, skb);
@@ -934,12 +934,12 @@ static int intel_recv_lpm(struct hci_dev *hdev, struct sk_buff *skb)
 	case LPM_OP_SUSPEND_ACK:
 		set_bit(STATE_SUSPENDED, &intel->flags);
 		if (test_and_clear_bit(STATE_LPM_TRANSACTION, &intel->flags))
-			wake_up_bit(&intel->flags, STATE_LPM_TRANSACTION);
+			wake_up_bit_relaxed(&intel->flags, STATE_LPM_TRANSACTION);
 		break;
 	case LPM_OP_RESUME_ACK:
 		clear_bit(STATE_SUSPENDED, &intel->flags);
 		if (test_and_clear_bit(STATE_LPM_TRANSACTION, &intel->flags))
-			wake_up_bit(&intel->flags, STATE_LPM_TRANSACTION);
+			wake_up_bit_relaxed(&intel->flags, STATE_LPM_TRANSACTION);
 		break;
 	default:
 		bt_dev_err(hdev, "Unknown LPM opcode (%02x)", lpm->opcode);
diff --git a/drivers/bluetooth/hci_mrvl.c b/drivers/bluetooth/hci_mrvl.c
index e08222395772..5486cf78a99b 100644
--- a/drivers/bluetooth/hci_mrvl.c
+++ b/drivers/bluetooth/hci_mrvl.c
@@ -184,7 +184,6 @@ static int mrvl_recv_fw_req(struct hci_dev *hdev, struct sk_buff *skb)
 	mrvl->tx_len = le16_to_cpu(pkt->lhs);
 
 	clear_bit(STATE_FW_REQ_PENDING, &mrvl->flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&mrvl->flags, STATE_FW_REQ_PENDING);
 
 done:
@@ -219,7 +218,6 @@ static int mrvl_recv_chip_ver(struct hci_dev *hdev, struct sk_buff *skb)
 	bt_dev_info(hdev, "Controller id = %x, rev = %x", mrvl->id, mrvl->rev);
 
 	clear_bit(STATE_CHIP_VER_PENDING, &mrvl->flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&mrvl->flags, STATE_CHIP_VER_PENDING);
 
 done:
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 098bf526136c..14b4d7cabbd6 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1432,8 +1432,6 @@ static void write_endio(struct dm_buffer *b, blk_status_t status)
 
 	smp_mb__before_atomic();
 	clear_bit(B_WRITING, &b->state);
-	smp_mb__after_atomic();
-
 	wake_up_bit(&b->state, B_WRITING);
 }
 
@@ -1843,8 +1841,6 @@ static void read_endio(struct dm_buffer *b, blk_status_t status)
 
 	smp_mb__before_atomic();
 	clear_bit(B_READING, &b->state);
-	smp_mb__after_atomic();
-
 	wake_up_bit(&b->state, B_READING);
 }
 
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index f40c18da4000..1549ab975021 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -921,7 +921,6 @@ static int init_hash_tables(struct dm_snapshot *s)
 static void merge_shutdown(struct dm_snapshot *s)
 {
 	clear_bit_unlock(RUNNING_MERGE, &s->state_bits);
-	smp_mb__after_atomic();
 	wake_up_bit(&s->state_bits, RUNNING_MERGE);
 }
 
diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
index 8156881a31de..7ea225ce418f 100644
--- a/drivers/md/dm-zoned-metadata.c
+++ b/drivers/md/dm-zoned-metadata.c
@@ -525,7 +525,6 @@ static void dmz_mblock_bio_end_io(struct bio *bio)
 		flag = DMZ_META_READING;
 
 	clear_bit_unlock(flag, &mblk->state);
-	smp_mb__after_atomic();
 	wake_up_bit(&mblk->state, flag);
 
 	bio_put(bio);
@@ -1917,7 +1916,6 @@ void dmz_unlock_zone_reclaim(struct dm_zone *zone)
 	WARN_ON(!dmz_in_reclaim(zone));
 
 	clear_bit_unlock(DMZ_RECLAIM, &zone->flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&zone->flags, DMZ_RECLAIM);
 }
 
diff --git a/drivers/md/dm-zoned-reclaim.c b/drivers/md/dm-zoned-reclaim.c
index d58db9a27e6c..9a7dadbb8eb7 100644
--- a/drivers/md/dm-zoned-reclaim.c
+++ b/drivers/md/dm-zoned-reclaim.c
@@ -107,7 +107,6 @@ static void dmz_reclaim_kcopy_end(int read_err, unsigned long write_err,
 		zrc->kc_err = 0;
 
 	clear_bit_unlock(DMZ_RECLAIM_KCOPY, &zrc->flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&zrc->flags, DMZ_RECLAIM_KCOPY);
 }
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 97fab2087df8..3fd19d478f62 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -3154,7 +3154,6 @@ static void __dm_internal_resume(struct mapped_device *md)
 	}
 done:
 	clear_bit(DMF_SUSPENDED_INTERNALLY, &md->flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&md->flags, DMF_SUSPENDED_INTERNALLY);
 }
 
diff --git a/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c b/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
index f1c79f351ec8..7df6b0791ebd 100644
--- a/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
+++ b/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
@@ -376,7 +376,6 @@ static int dvb_usb_stop_feed(struct dvb_demux_feed *dvbdmxfeed)
 
 	/* clear 'streaming' status bit */
 	clear_bit(ADAP_STREAMING, &adap->state_bits);
-	smp_mb__after_atomic();
 	wake_up_bit(&adap->state_bits, ADAP_STREAMING);
 skip_feed_stop:
 
@@ -581,7 +580,6 @@ static int dvb_usb_fe_init(struct dvb_frontend *fe)
 err:
 	if (!adap->suspend_resume_active) {
 		clear_bit(ADAP_INIT, &adap->state_bits);
-		smp_mb__after_atomic();
 		wake_up_bit(&adap->state_bits, ADAP_INIT);
 	}
 
@@ -621,7 +619,6 @@ static int dvb_usb_fe_sleep(struct dvb_frontend *fe)
 	if (!adap->suspend_resume_active) {
 		adap->active_fe = -1;
 		clear_bit(ADAP_SLEEP, &adap->state_bits);
-		smp_mb__after_atomic();
 		wake_up_bit(&adap->state_bits, ADAP_SLEEP);
 	}
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index aa7f8148cd0d..0b9cb4c87adf 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1762,7 +1762,6 @@ static void end_bbio_meta_write(struct btrfs_bio *bbio)
 	}
 
 	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-	smp_mb__after_atomic();
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
 
 	bio_put(&bbio->bio);
@@ -3506,7 +3505,6 @@ void set_extent_buffer_uptodate(struct extent_buffer *eb)
 static void clear_extent_buffer_reading(struct extent_buffer *eb)
 {
 	clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
-	smp_mb__after_atomic();
 	wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
 }
 
diff --git a/fs/buffer.c b/fs/buffer.c
index e55ad471c530..2932618c88e4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -75,7 +75,6 @@ EXPORT_SYMBOL(__lock_buffer);
 void unlock_buffer(struct buffer_head *bh)
 {
 	clear_bit_unlock(BH_Lock, &bh->b_state);
-	smp_mb__after_atomic();
 	wake_up_bit(&bh->b_state, BH_Lock);
 }
 EXPORT_SYMBOL(unlock_buffer);
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 18c72b305858..20d02142b059 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1254,7 +1254,7 @@ static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
 
 	spin_lock(&dentry->d_lock);
 	di->flags &= ~CEPH_DENTRY_ASYNC_UNLINK;
-	wake_up_bit(&di->flags, CEPH_DENTRY_ASYNC_UNLINK_BIT);
+	wake_up_bit_mb(&di->flags, CEPH_DENTRY_ASYNC_UNLINK_BIT);
 	spin_unlock(&dentry->d_lock);
 
 	synchronize_rcu();
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4b8d59ebda00..d779780abea5 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -581,7 +581,7 @@ static void wake_async_create_waiters(struct inode *inode,
 	spin_lock(&ci->i_ceph_lock);
 	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
 		ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
-		wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
+		wake_up_bit_mb(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
 
 		if (ci->i_ceph_flags & CEPH_I_ASYNC_CHECK_CAPS) {
 			ci->i_ceph_flags &= ~CEPH_I_ASYNC_CHECK_CAPS;
@@ -766,7 +766,7 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode,
 
 	spin_lock(&dentry->d_lock);
 	di->flags &= ~CEPH_DENTRY_ASYNC_CREATE;
-	wake_up_bit(&di->flags, CEPH_DENTRY_ASYNC_CREATE_BIT);
+	wake_up_bit_mb(&di->flags, CEPH_DENTRY_ASYNC_CREATE_BIT);
 	spin_unlock(&dentry->d_lock);
 
 	return ret;
diff --git a/fs/dcache.c b/fs/dcache.c
index 3d8daaecb6d1..8f63de904d7c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1908,8 +1908,7 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
 	__d_instantiate(entry, inode);
 	WARN_ON(!(inode->i_state & I_NEW));
 	inode->i_state &= ~I_NEW & ~I_CREATING;
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_NEW);
+	wake_up_bit_mb(&inode->i_state, __I_NEW);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(d_instantiate_new);
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 3926a05eceee..03d94f5fc8c8 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1290,12 +1290,10 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 				       EXT4_STATE_FC_COMMITTING);
 		if (tid_geq(tid, iter->i_sync_tid))
 			ext4_fc_reset_inode(&iter->vfs_inode);
-		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
-		smp_mb();
 #if (BITS_PER_LONG < 64)
-		wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
+		wake_up_bit_mb(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
 #else
-		wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
+		wake_up_bit_mb(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
 #endif
 	}
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b865a3fa52f3..0f6646d67a73 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1384,9 +1384,7 @@ static void inode_sync_complete(struct inode *inode)
 	inode->i_state &= ~I_SYNC;
 	/* If inode is clean an unused, put it into LRU now... */
 	inode_add_lru(inode);
-	/* Waiters must see I_SYNC cleared before being woken up */
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_SYNC);
+	wake_up_bit_mb(&inode->i_state, __I_SYNC);
 }
 
 static bool inode_dirtied_after(struct inode *inode, unsigned long t)
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 12a769077ea0..e1afe9aa7c2a 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -380,7 +380,6 @@ static inline bool may_grant(struct gfs2_glock *gl,
 static void gfs2_holder_wake(struct gfs2_holder *gh)
 {
 	clear_bit(HIF_WAIT, &gh->gh_iflags);
-	smp_mb__after_atomic();
 	wake_up_bit(&gh->gh_iflags, HIF_WAIT);
 	if (gh->gh_flags & GL_ASYNC) {
 		struct gfs2_sbd *sdp = gh->gh_gl->gl_name.ln_sbd;
@@ -576,7 +575,6 @@ static void gfs2_demote_wake(struct gfs2_glock *gl)
 {
 	gl->gl_demote_state = LM_ST_EXCLUSIVE;
 	clear_bit(GLF_DEMOTE, &gl->gl_flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&gl->gl_flags, GLF_DEMOTE);
 }
 
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index fa5134df985f..921b26b96192 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1224,7 +1224,6 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
 		queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0);
 
 	clear_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&ls->ls_recover_flags, DFL_DLM_RECOVERY);
 	spin_unlock(&ls->ls_recover_spin);
 }
@@ -1366,7 +1365,6 @@ static int gdlm_mount(struct gfs2_sbd *sdp, const char *table)
 
 	ls->ls_first = !!test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags);
 	clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
 	return 0;
 
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index f4fe7039f725..e70cf003d524 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -558,7 +558,6 @@ void gfs2_recover_func(struct work_struct *work)
 	gfs2_recovery_done(sdp, jd->jd_jid, LM_RD_GAVEUP);
 done:
 	clear_bit(JDF_RECOVERY, &jd->jd_flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
 }
 
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index ecc699f8d9fc..738337e35724 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -587,7 +587,6 @@ static ssize_t jid_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
 		rv = jid = -EINVAL;
 	sdp->sd_lockstruct.ls_jid = jid;
 	clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
 out:
 	spin_unlock(&sdp->sd_jindex_spin);
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 13be8d1d228b..2818f3891498 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -353,7 +353,6 @@ int gfs2_withdraw(struct gfs2_sbd *sdp)
 		fs_err(sdp, "File system withdrawn\n");
 		dump_stack();
 		clear_bit(SDF_WITHDRAW_IN_PROG, &sdp->sd_flags);
-		smp_mb__after_atomic();
 		wake_up_bit(&sdp->sd_flags, SDF_WITHDRAW_IN_PROG);
 	}
 
diff --git a/fs/inode.c b/fs/inode.c
index 91bb2f80fa03..f5f35b701196 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -687,7 +687,7 @@ static void evict(struct inode *inode)
 	 * used as an indicator whether blocking on it is safe.
 	 */
 	spin_lock(&inode->i_lock);
-	wake_up_bit(&inode->i_state, __I_NEW);
+	wake_up_bit_relaxed(&inode->i_state, __I_NEW);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	spin_unlock(&inode->i_lock);
 
@@ -1095,8 +1095,7 @@ void unlock_new_inode(struct inode *inode)
 	spin_lock(&inode->i_lock);
 	WARN_ON(!(inode->i_state & I_NEW));
 	inode->i_state &= ~I_NEW & ~I_CREATING;
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_NEW);
+	wake_up_bit_mb(&inode->i_state, __I_NEW);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(unlock_new_inode);
@@ -1107,8 +1106,7 @@ void discard_new_inode(struct inode *inode)
 	spin_lock(&inode->i_lock);
 	WARN_ON(!(inode->i_state & I_NEW));
 	inode->i_state &= ~I_NEW;
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_NEW);
+	wake_up_bit_mb(&inode->i_state, __I_NEW);
 	spin_unlock(&inode->i_lock);
 	iput(inode);
 }
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 4305a1ac808a..55ba3f62fbe3 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -40,7 +40,6 @@ static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 		clear_buffer_uptodate(bh);
 	if (orig_bh) {
 		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
-		smp_mb__after_atomic();
 		wake_up_bit(&orig_bh->b_state, BH_Shadow);
 	}
 	unlock_buffer(bh);
@@ -230,8 +229,7 @@ static int journal_submit_data_buffers(journal_t *journal,
 		spin_lock(&journal->j_list_lock);
 		J_ASSERT(jinode->i_transaction == commit_transaction);
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
-		smp_mb();
-		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
+		wake_up_bit_mb(&jinode->i_flags, __JI_COMMIT_RUNNING);
 	}
 	spin_unlock(&journal->j_list_lock);
 	return ret;
@@ -273,8 +271,7 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 		cond_resched();
 		spin_lock(&journal->j_list_lock);
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
-		smp_mb();
-		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
+		wake_up_bit_mb(&jinode->i_flags, __JI_COMMIT_RUNNING);
 	}
 
 	/* Now refile inode to proper lists */
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b4914a11c3c2..19d175446899 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1429,7 +1429,6 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)
 	trace_nfs_invalidate_mapping_exit(inode, ret);
 
 	clear_bit_unlock(NFS_INO_INVALIDATING, bitlock);
-	smp_mb__after_atomic();
 	wake_up_bit(bitlock, NFS_INO_INVALIDATING);
 out:
 	return ret;
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 8ae767578cd9..9f45c08d2e79 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -215,7 +215,6 @@ void
 nfs_page_clear_headlock(struct nfs_page *req)
 {
 	clear_bit_unlock(PG_HEADLOCK, &req->wb_flags);
-	smp_mb__after_atomic();
 	if (!test_bit(PG_CONTENDED1, &req->wb_flags))
 		return;
 	wake_up_bit(&req->wb_flags, PG_HEADLOCK);
@@ -520,7 +519,6 @@ nfs_create_subreq(struct nfs_page *req,
 void nfs_unlock_request(struct nfs_page *req)
 {
 	clear_bit_unlock(PG_BUSY, &req->wb_flags);
-	smp_mb__after_atomic();
 	if (!test_bit(PG_CONTENDED2, &req->wb_flags))
 		return;
 	wake_up_bit(&req->wb_flags, PG_BUSY);
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index aa698481bec8..877fc154eb2b 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -388,7 +388,6 @@ static void pnfs_clear_layoutreturn_waitbit(struct pnfs_layout_hdr *lo)
 {
 	clear_bit_unlock(NFS_LAYOUT_RETURN, &lo->plh_flags);
 	clear_bit(NFS_LAYOUT_RETURN_LOCK, &lo->plh_flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&lo->plh_flags, NFS_LAYOUT_RETURN);
 	rpc_wake_up(&NFS_SERVER(lo->plh_inode)->roc_rpcwaitq);
 }
@@ -2044,7 +2043,7 @@ static void nfs_layoutget_end(struct pnfs_layout_hdr *lo)
 {
 	if (atomic_dec_and_test(&lo->plh_outstanding) &&
 	    test_and_clear_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags))
-		wake_up_bit(&lo->plh_flags, NFS_LAYOUT_DRAIN);
+		wake_up_bit_relaxed(&lo->plh_flags, NFS_LAYOUT_DRAIN);
 }
 
 static bool pnfs_is_first_layoutget(struct pnfs_layout_hdr *lo)
@@ -2057,7 +2056,6 @@ static void pnfs_clear_first_layoutget(struct pnfs_layout_hdr *lo)
 	unsigned long *bitlock = &lo->plh_flags;
 
 	clear_bit_unlock(NFS_LAYOUT_FIRST_LAYOUTGET, bitlock);
-	smp_mb__after_atomic();
 	wake_up_bit(bitlock, NFS_LAYOUT_FIRST_LAYOUTGET);
 }
 
@@ -3230,7 +3228,6 @@ static void pnfs_clear_layoutcommitting(struct inode *inode)
 	unsigned long *bitlock = &NFS_I(inode)->flags;
 
 	clear_bit_unlock(NFS_INO_LAYOUTCOMMITTING, bitlock);
-	smp_mb__after_atomic();
 	wake_up_bit(bitlock, NFS_INO_LAYOUTCOMMITTING);
 }
 
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 67d8673a9391..3e1a434bc649 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -1897,7 +1897,6 @@ nfsd4_cltrack_upcall_unlock(struct nfs4_client *clp)
 {
 	smp_mb__before_atomic();
 	clear_bit(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);
-	smp_mb__after_atomic();
 	wake_up_bit(&clp->cl_flags, NFSD4_CLIENT_UPCALL_LOCK);
 }
 
diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index d68372241b30..d620b56a1002 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -314,7 +314,6 @@ int orangefs_revalidate_mapping(struct inode *inode)
 	    orangefs_cache_timeout_msecs*HZ/1000;
 
 	clear_bit(1, bitlock);
-	smp_mb__after_atomic();
 	wake_up_bit(bitlock, 1);
 
 	return ret;
diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c
index dd0afa23734c..8da74f15cc95 100644
--- a/fs/smb/client/inode.c
+++ b/fs/smb/client/inode.c
@@ -2511,7 +2511,6 @@ cifs_revalidate_mapping(struct inode *inode)
 
 skip_invalidate:
 	clear_bit_unlock(CIFS_INO_LOCK, flags);
-	smp_mb__after_atomic();
 	wake_up_bit(flags, CIFS_INO_LOCK);
 
 	return rc;
diff --git a/fs/smb/server/oplock.c b/fs/smb/server/oplock.c
index a8f52c4ebbda..b039b242df46 100644
--- a/fs/smb/server/oplock.c
+++ b/fs/smb/server/oplock.c
@@ -594,8 +594,6 @@ static void wait_for_break_ack(struct oplock_info *opinfo)
 static void wake_up_oplock_break(struct oplock_info *opinfo)
 {
 	clear_bit_unlock(0, &opinfo->pending_break);
-	/* memory barrier is needed for wake_up_bit() */
-	smp_mb__after_atomic();
 	wake_up_bit(&opinfo->pending_break, 0);
 }
 
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 178ed8bed91c..609c10fcd344 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -26,7 +26,7 @@ typedef int wait_bit_action_f(struct wait_bit_key *key, int mode);
 void __wake_up_bit(struct wait_queue_head *wq_head, void *word, int bit);
 int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
-void wake_up_bit(void *word, int bit);
+void wake_up_bit_relaxed(void *word, int bit);
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode);
@@ -318,6 +318,48 @@ do {									\
 	__ret;								\
 })
 
+/**
+ * wake_up_bit - wake up waiters on a bit
+ * @word: the word being waited on, a kernel virtual address
+ * @bit: the bit of the word being waited on
+ *
+ * There is a standard hashed waitqueue table for generic use. This
+ * is the part of the hash-table's accessor API that wakes up waiters
+ * on a bit. For instance, if one were to have waiters on a bitflag,
+ * one would call wake_up_bit() after atomically clearing the bit.
+ *
+ * This interface should only be use when the bit is cleared atomically,
+ * such as with clear_bit(), atomic_andnot(), code inside a locked
+ * region (with this interface called after the unlock).  If this
+ * bit is cleared non-atomically, wake_up_bit_mb() should be used.
+ */
+static inline void wake_up_bit(void *word, int bit)
+{
+	smp_mb__after_atomic();
+	wake_up_bit_relaxed(word, bit);
+}
+
+/**
+ * wake_up_bit_mb - wake up waiters on a bit with full barrier
+ * @word: the word being waited on, a kernel virtual address
+ * @bit: the bit of the word being waited on
+ *
+ * There is a standard hashed waitqueue table for generic use. This
+ * is the part of the hash-table's accessor API that wakes up waiters
+ * on a bit. For instance, if one were to have waiters on a bitflag,
+ * one would call wake_up_bit() after non-atomically clearing the bit.
+ *
+ * This interface has a full barrier and so is safe to use anywhere.
+ * It is particular intended for use after the bit has been cleared
+ * (or set) non-atmomically with simple assignment.  Where the bit
+ * it cleared atomically, the barrier used may be excessively.
+ */
+static inline void wake_up_bit_mb(void *word, int bit)
+{
+	smp_mb();
+	wake_up_bit_relaxed(word, bit);
+}
+
 /**
  * clear_and_wake_up_bit - clear a bit and wake up anyone waiting on that bit
  *
@@ -330,8 +372,6 @@ do {									\
 static inline void clear_and_wake_up_bit(int bit, void *word)
 {
 	clear_bit_unlock(bit, word);
-	/* See wake_up_bit() for which memory barrier you need to use. */
-	smp_mb__after_atomic();
 	wake_up_bit(word, bit);
 }
 
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 134d7112ef71..51d923bf207e 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -128,7 +128,7 @@ void __wake_up_bit(struct wait_queue_head *wq_head, void *word, int bit)
 EXPORT_SYMBOL(__wake_up_bit);
 
 /**
- * wake_up_bit - wake up a waiter on a bit
+ * wake_up_bit_relaxed - wake up waiters on a bit
  * @word: the word being waited on, a kernel virtual address
  * @bit: the bit of the word being waited on
  *
@@ -139,16 +139,15 @@ EXPORT_SYMBOL(__wake_up_bit);
  *
  * In order for this to function properly, as it uses waitqueue_active()
  * internally, some kind of memory barrier must be done prior to calling
- * this. Typically, this will be smp_mb__after_atomic(), but in some
- * cases where bitflags are manipulated non-atomically under a lock, one
- * may need to use a less regular barrier, such fs/inode.c's smp_mb(),
- * because spin_unlock() does not guarantee a memory barrier.
+ * this. Typically this will be provided implicitly by test_and_clear_bit().
+ * If the bit is cleared without full ordering, an alternate interface
+ * such as wake_up_bit_mb() or wake_up_bit() should be used.
  */
-void wake_up_bit(void *word, int bit)
+void wake_up_bit_relaxed(void *word, int bit)
 {
 	__wake_up_bit(bit_waitqueue(word, bit), word, bit);
 }
-EXPORT_SYMBOL(wake_up_bit);
+EXPORT_SYMBOL(wake_up_bit_relaxed);
 
 wait_queue_head_t *__var_waitqueue(void *p)
 {
diff --git a/kernel/signal.c b/kernel/signal.c
index 60c737e423a1..1c3d51027cd0 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -306,8 +306,7 @@ void task_clear_jobctl_trapping(struct task_struct *task)
 {
 	if (unlikely(task->jobctl & JOBCTL_TRAPPING)) {
 		task->jobctl &= ~JOBCTL_TRAPPING;
-		smp_mb();	/* advised by wake_up_bit() */
-		wake_up_bit(&task->jobctl, JOBCTL_TRAPPING_BIT);
+		wake_up_bit_mb(&task->jobctl, JOBCTL_TRAPPING_BIT);
 	}
 }
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 14d9e53b1ec2..edc55ba641fb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3267,7 +3267,6 @@ static int ksm_memory_callback(struct notifier_block *self,
 		ksm_run &= ~KSM_RUN_OFFLINE;
 		mutex_unlock(&ksm_thread_mutex);
 
-		smp_mb();	/* wake_up_bit advises this */
 		wake_up_bit(&ksm_run, ilog2(KSM_RUN_OFFLINE));
 		break;
 	}
diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
index d0c118c47f6c..226b37017d56 100644
--- a/net/bluetooth/hci_event.c
+++ b/net/bluetooth/hci_event.c
@@ -105,7 +105,6 @@ static u8 hci_cc_inquiry_cancel(struct hci_dev *hdev, void *data,
 		return rp->status;
 
 	clear_bit(HCI_INQUIRY, &hdev->flags);
-	smp_mb__after_atomic(); /* wake_up_bit advises about this barrier */
 	wake_up_bit(&hdev->flags, HCI_INQUIRY);
 
 	hci_dev_lock(hdev);
@@ -2972,9 +2971,7 @@ static void hci_inquiry_complete_evt(struct hci_dev *hdev, void *data,
 
 	if (!test_and_clear_bit(HCI_INQUIRY, &hdev->flags))
 		return;
-
-	smp_mb__after_atomic(); /* wake_up_bit advises about this barrier */
-	wake_up_bit(&hdev->flags, HCI_INQUIRY);
+	wake_up_bit_relaxed(&hdev->flags, HCI_INQUIRY);
 
 	if (!hci_dev_test_flag(hdev, HCI_MGMT))
 		return;
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index cef623ea1506..1f64652cb629 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -370,7 +370,6 @@ static void rpc_make_runnable(struct workqueue_struct *wq,
 		INIT_WORK(&task->u.tk_work, rpc_async_schedule);
 		queue_work(wq, &task->u.tk_work);
 	} else {
-		smp_mb__after_atomic();
 		wake_up_bit(&task->tk_runstate, RPC_TASK_QUEUED);
 	}
 }
diff --git a/security/keys/key.c b/security/keys/key.c
index 3d7d185019d3..8f7380935eb5 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -474,7 +474,7 @@ static int __key_instantiate_and_link(struct key *key,
 
 	/* wake up anyone waiting for a key to be constructed */
 	if (awaken)
-		wake_up_bit(&key->flags, KEY_FLAG_USER_CONSTRUCT);
+		wake_up_bit_relaxed(&key->flags, KEY_FLAG_USER_CONSTRUCT);
 
 	return ret;
 }
@@ -629,7 +629,7 @@ int key_reject_and_link(struct key *key,
 
 	/* wake up anyone waiting for a key to be constructed */
 	if (awaken)
-		wake_up_bit(&key->flags, KEY_FLAG_USER_CONSTRUCT);
+		wake_up_bit_relaxed(&key->flags, KEY_FLAG_USER_CONSTRUCT);
 
 	return ret == 0 ? link_ret : ret;
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 8/9] Improve and extend wake_up_var() interface.
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
                   ` (6 preceding siblings ...)
  2024-08-19  5:20 ` [PATCH 7/9] Improve and expand wake_up_bit() interface NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-19  5:20 ` [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate NeilBrown
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

wake_up_var() is similar to wake_up_bit() and benefits from a richer
interface which includes barriers.

- wake_up_var() now has smp_mb__after_atomic() and should be used after
  the variable is changed atomically - including inside a locked region.
- wake_up_var_mb() now has smb_mb() and should be used after the variable
  has been changed non-atomically
- wake_up_var_relaxed() can be used when no barrier is needed, such as
  after atomic_dec_and_test.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/bdev.c              |  3 +-
 drivers/block/pktcdvd.c   |  3 +-
 fs/afs/cell.c             | 16 +++++------
 fs/netfs/fscache_cache.c  |  2 +-
 fs/netfs/fscache_cookie.c |  6 ----
 fs/netfs/fscache_volume.c |  2 +-
 fs/netfs/io.c             |  4 +--
 fs/nfs/dir.c              |  3 +-
 fs/nfs/nfs4state.c        |  2 +-
 fs/nfsd/nfs4state.c       |  2 +-
 fs/notify/mark.c          |  2 +-
 fs/super.c                | 20 +++++---------
 fs/xfs/xfs_buf.c          |  2 +-
 include/linux/mbcache.h   |  2 +-
 include/linux/wait_bit.h  | 58 +++++++++++++++++++++++++++++++++++++--
 kernel/events/core.c      |  6 ++--
 kernel/sched/core.c       |  2 +-
 kernel/sched/wait_bit.c   | 21 ++++++++++++--
 kernel/softirq.c          |  3 +-
 mm/memremap.c             |  2 +-
 security/landlock/fs.c    |  2 +-
 21 files changed, 108 insertions(+), 55 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index d804c91c651b..34ee3e155c18 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -561,8 +561,7 @@ static void bd_clear_claiming(struct block_device *whole, void *holder)
 	/* tell others that we're done */
 	BUG_ON(whole->bd_claiming != holder);
 	whole->bd_claiming = NULL;
-	smp_mb();
-	wake_up_var(&whole->bd_claiming);
+	wake_up_var_mb(&whole->bd_claiming);
 }
 
 /**
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 273fbe05d80f..e774057329c6 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1210,8 +1210,7 @@ static int pkt_handle_queue(struct pktcdvd_device *pd)
 	if (pd->congested &&
 	    pd->bio_queue_size <= pd->write_congestion_off) {
 		pd->congested = false;
-		smp_mb();
-		wake_up_var(&pd->congested);
+		wake_up_var_mb(&pd->congested);
 	}
 	spin_unlock(&pd->lock);
 
diff --git a/fs/afs/cell.c b/fs/afs/cell.c
index 422bcc26becc..726bf48094ce 100644
--- a/fs/afs/cell.c
+++ b/fs/afs/cell.c
@@ -478,7 +478,7 @@ static int afs_update_cell(struct afs_cell *cell)
 out_wake:
 	smp_store_release(&cell->dns_lookup_count,
 			  cell->dns_lookup_count + 1); /* vs source/status */
-	wake_up_var(&cell->dns_lookup_count);
+	wake_up_var_mb(&cell->dns_lookup_count);
 	_leave(" = %d", ret);
 	return ret;
 }
@@ -748,12 +748,12 @@ static void afs_manage_cell(struct afs_cell *cell)
 		if (cell->state == AFS_CELL_FAILED)
 			goto done;
 		smp_store_release(&cell->state, AFS_CELL_UNSET);
-		wake_up_var(&cell->state);
+		wake_up_var_mb(&cell->state);
 		goto again;
 
 	case AFS_CELL_UNSET:
 		smp_store_release(&cell->state, AFS_CELL_ACTIVATING);
-		wake_up_var(&cell->state);
+		wake_up_var_mb(&cell->state);
 		goto again;
 
 	case AFS_CELL_ACTIVATING:
@@ -762,7 +762,7 @@ static void afs_manage_cell(struct afs_cell *cell)
 			goto activation_failed;
 
 		smp_store_release(&cell->state, AFS_CELL_ACTIVE);
-		wake_up_var(&cell->state);
+		wake_up_var_mb(&cell->state);
 		goto again;
 
 	case AFS_CELL_ACTIVE:
@@ -775,7 +775,7 @@ static void afs_manage_cell(struct afs_cell *cell)
 			goto done;
 		}
 		smp_store_release(&cell->state, AFS_CELL_DEACTIVATING);
-		wake_up_var(&cell->state);
+		wake_up_var_mb(&cell->state);
 		goto again;
 
 	case AFS_CELL_DEACTIVATING:
@@ -783,7 +783,7 @@ static void afs_manage_cell(struct afs_cell *cell)
 			goto reverse_deactivation;
 		afs_deactivate_cell(net, cell);
 		smp_store_release(&cell->state, AFS_CELL_INACTIVE);
-		wake_up_var(&cell->state);
+		wake_up_var_mb(&cell->state);
 		goto again;
 
 	case AFS_CELL_REMOVED:
@@ -800,12 +800,12 @@ static void afs_manage_cell(struct afs_cell *cell)
 	afs_deactivate_cell(net, cell);
 
 	smp_store_release(&cell->state, AFS_CELL_FAILED); /* vs error */
-	wake_up_var(&cell->state);
+	wake_up_var_mb(&cell->state);
 	goto again;
 
 reverse_deactivation:
 	smp_store_release(&cell->state, AFS_CELL_ACTIVE);
-	wake_up_var(&cell->state);
+	wake_up_var_mb(&cell->state);
 	_leave(" [deact->act]");
 	return;
 
diff --git a/fs/netfs/fscache_cache.c b/fs/netfs/fscache_cache.c
index 9397ed39b0b4..83e6d25a5e0a 100644
--- a/fs/netfs/fscache_cache.c
+++ b/fs/netfs/fscache_cache.c
@@ -321,7 +321,7 @@ void fscache_end_cache_access(struct fscache_cache *cache, enum fscache_access_t
 	trace_fscache_access_cache(cache->debug_id, refcount_read(&cache->ref),
 				   n_accesses, why);
 	if (n_accesses == 0)
-		wake_up_var(&cache->n_accesses);
+		wake_up_var_relaxed(&cache->n_accesses);
 }
 
 /**
diff --git a/fs/netfs/fscache_cookie.c b/fs/netfs/fscache_cookie.c
index bce2492186d0..93c66938b164 100644
--- a/fs/netfs/fscache_cookie.c
+++ b/fs/netfs/fscache_cookie.c
@@ -191,12 +191,6 @@ bool fscache_begin_cookie_access(struct fscache_cookie *cookie,
 
 static inline void wake_up_cookie_state(struct fscache_cookie *cookie)
 {
-	/* Use a barrier to ensure that waiters see the state variable
-	 * change, as spin_unlock doesn't guarantee a barrier.
-	 *
-	 * See comments over wake_up_bit() and waitqueue_active().
-	 */
-	smp_mb();
 	wake_up_var(&cookie->state);
 }
 
diff --git a/fs/netfs/fscache_volume.c b/fs/netfs/fscache_volume.c
index cb75c07b5281..c6c43a87f56e 100644
--- a/fs/netfs/fscache_volume.c
+++ b/fs/netfs/fscache_volume.c
@@ -128,7 +128,7 @@ void fscache_end_volume_access(struct fscache_volume *volume,
 				    refcount_read(&volume->ref),
 				    n_accesses, why);
 	if (n_accesses == 0)
-		wake_up_var(&volume->n_accesses);
+		wake_up_var_relaxed(&volume->n_accesses);
 }
 EXPORT_SYMBOL(fscache_end_volume_access);
 
diff --git a/fs/netfs/io.c b/fs/netfs/io.c
index c93851b98368..ebae3cfcad20 100644
--- a/fs/netfs/io.c
+++ b/fs/netfs/io.c
@@ -181,7 +181,7 @@ static bool netfs_rreq_perform_resubmissions(struct netfs_io_request *rreq)
 	if (atomic_dec_and_test(&rreq->nr_outstanding))
 		return true;
 
-	wake_up_var(&rreq->nr_outstanding);
+	wake_up_var_relaxed(&rreq->nr_outstanding);
 	return false;
 }
 
@@ -372,7 +372,7 @@ void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
 	if (u == 0)
 		netfs_rreq_terminated(rreq, was_async);
 	else if (u == 1)
-		wake_up_var(&rreq->nr_outstanding);
+		wake_up_var_relaxed(&rreq->nr_outstanding);
 
 	netfs_put_subrequest(subreq, was_async, netfs_sreq_trace_put_terminated);
 	return;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 4cb97ef41350..1d745f105095 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1837,9 +1837,8 @@ static void block_revalidate(struct dentry *dentry)
 
 static void unblock_revalidate(struct dentry *dentry)
 {
-	/* store_release ensures wait_var_event() sees the update */
 	smp_store_release(&dentry->d_fsdata, NULL);
-	wake_up_var(&dentry->d_fsdata);
+	wake_up_var_mb(&dentry->d_fsdata);
 }
 
 /*
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 877f682b45f2..d038409a9a9a 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1220,7 +1220,7 @@ void nfs4_schedule_state_manager(struct nfs_client *clp)
 		swapon = !test_and_set_bit(NFS4CLNT_MANAGER_AVAILABLE,
 					   &clp->cl_state);
 		if (!swapon) {
-			wake_up_var(&clp->cl_state);
+			wake_up_var_relaxed(&clp->cl_state);
 			return;
 		}
 	}
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index a20c2c9d7d45..d156ac7637cf 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -7478,8 +7478,8 @@ nfsd4_delegreturn(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
 		goto put_stateid;
 
 	trace_nfsd_deleg_return(stateid);
-	wake_up_var(d_inode(cstate->current_fh.fh_dentry));
 	destroy_delegation(dp);
+	wake_up_var(d_inode(cstate->current_fh.fh_dentry));
 put_stateid:
 	nfs4_put_stid(&dp->dl_stid);
 out:
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 5e170e713088..ff3d84e9db4d 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -139,7 +139,7 @@ static void fsnotify_get_sb_watched_objects(struct super_block *sb)
 static void fsnotify_put_sb_watched_objects(struct super_block *sb)
 {
 	if (atomic_long_dec_and_test(fsnotify_sb_watched_objects(sb)))
-		wake_up_var(fsnotify_sb_watched_objects(sb));
+		wake_up_var_relaxed(fsnotify_sb_watched_objects(sb));
 }
 
 static void fsnotify_get_inode_ref(struct inode *inode)
diff --git a/fs/super.c b/fs/super.c
index 38d72a3cf6fc..96b9a682a7ca 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -158,13 +158,7 @@ static void super_wake(struct super_block *sb, unsigned int flag)
 	 * seeing SB_BORN sent.
 	 */
 	smp_store_release(&sb->s_flags, sb->s_flags | flag);
-	/*
-	 * Pairs with the barrier in prepare_to_wait_event() to make sure
-	 * ___wait_var_event() either sees SB_BORN set or
-	 * waitqueue_active() check in wake_up_var() sees the waiter.
-	 */
-	smp_mb();
-	wake_up_var(&sb->s_flags);
+	wake_up_var_mb(&sb->s_flags);
 }
 
 /*
@@ -2074,7 +2068,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 		/* Nothing to do really... */
 		WARN_ON_ONCE(freeze_inc(sb, who) > 1);
 		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-		wake_up_var(&sb->s_writers.frozen);
+		wake_up_var_mb(&sb->s_writers.frozen);
 		super_unlock_excl(sb);
 		return 0;
 	}
@@ -2094,7 +2088,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	if (ret) {
 		sb->s_writers.frozen = SB_UNFROZEN;
 		sb_freeze_unlock(sb, SB_FREEZE_PAGEFAULT);
-		wake_up_var(&sb->s_writers.frozen);
+		wake_up_var_mb(&sb->s_writers.frozen);
 		deactivate_locked_super(sb);
 		return ret;
 	}
@@ -2110,7 +2104,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 				"VFS:Filesystem freeze failed\n");
 			sb->s_writers.frozen = SB_UNFROZEN;
 			sb_freeze_unlock(sb, SB_FREEZE_FS);
-			wake_up_var(&sb->s_writers.frozen);
+			wake_up_var_mb(&sb->s_writers.frozen);
 			deactivate_locked_super(sb);
 			return ret;
 		}
@@ -2121,7 +2115,7 @@ int freeze_super(struct super_block *sb, enum freeze_holder who)
 	 */
 	WARN_ON_ONCE(freeze_inc(sb, who) > 1);
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-	wake_up_var(&sb->s_writers.frozen);
+	wake_up_var_mb(&sb->s_writers.frozen);
 	lockdep_sb_freeze_release(sb);
 	super_unlock_excl(sb);
 	return 0;
@@ -2150,7 +2144,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
 
 	if (sb_rdonly(sb)) {
 		sb->s_writers.frozen = SB_UNFROZEN;
-		wake_up_var(&sb->s_writers.frozen);
+		wake_up_var_mb(&sb->s_writers.frozen);
 		goto out_deactivate;
 	}
 
@@ -2167,7 +2161,7 @@ static int thaw_super_locked(struct super_block *sb, enum freeze_holder who)
 	}
 
 	sb->s_writers.frozen = SB_UNFROZEN;
-	wake_up_var(&sb->s_writers.frozen);
+	wake_up_var_mb(&sb->s_writers.frozen);
 	sb_freeze_unlock(sb, SB_FREEZE_FS);
 out_deactivate:
 	deactivate_locked_super(sb);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index aa4dbda7b536..84355a859c86 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2137,7 +2137,7 @@ xfs_buf_list_del(
 	struct xfs_buf		*bp)
 {
 	list_del_init(&bp->b_list);
-	wake_up_var(&bp->b_list);
+	wake_up_var_mb(&bp->b_list);
 }
 
 /*
diff --git a/include/linux/mbcache.h b/include/linux/mbcache.h
index 97e64184767d..65cc6bc1baa6 100644
--- a/include/linux/mbcache.h
+++ b/include/linux/mbcache.h
@@ -52,7 +52,7 @@ static inline void mb_cache_entry_put(struct mb_cache *cache,
 
 	if (cnt > 0) {
 		if (cnt <= 2)
-			wake_up_var(&entry->e_refcnt);
+			wake_up_var_relaxed(&entry->e_refcnt);
 		return;
 	}
 	__mb_cache_entry_free(cache, entry);
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 609c10fcd344..7cdd0ab34f21 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -236,7 +236,7 @@ wait_on_bit_lock_action(unsigned long *word, int bit, wait_bit_action_f *action,
 }
 
 extern void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int flags);
-extern void wake_up_var(void *var);
+extern void wake_up_var_relaxed(void *var);
 extern wait_queue_head_t *__var_waitqueue(void *p);
 
 #define ___wait_var_event(var, condition, state, exclusive, ret, cmd)	\
@@ -375,12 +375,66 @@ static inline void clear_and_wake_up_bit(int bit, void *word)
 	wake_up_bit(word, bit);
 }
 
+/**
+ * atomic_dec_and_wake_up_var - decrement a counts a wake any waiting on it.
+ *
+ * @var: the atomic_t variable being waited on.
+ *
+ * @var is decremented and if the value reaches zero, any code waiting
+ * in wake_var_event() for the variable will be woken.
+ */
 static inline bool atomic_dec_and_wake_up_var(atomic_t *var)
 {
 	if (!atomic_dec_and_test(var))
 		return false;
-	wake_up_var(var);
+	wake_up_var_relaxed(var);
 	return true;
 }
 
+/**
+ * wake_up_var_mb - wake up all waiters on a var
+ * @var: the address being waited on, a kernel virtual address
+ *
+ * There is a standard hashed waitqueue table for generic use.  This is
+ * the part of the hash-table's accessor API that wakes up waiters on an
+ * address.  For instance, if one were to have waiters on an address one
+ * would call wake_up_var_mb() after non-atomically modifying the
+ * variable
+ *
+ * This interface has a full barrier and so is safe to use anywhere.
+ * It is particular intended for use after the variable has been updated
+ * non-atmomically with simple assignment.  Where the variable is
+ * is updated atomically, the barrier used may be excessively costly.
+ *
+ * Note that it is often appropriate to use smp_store_release() to
+ * update the field in a structure that is being waited on.  This ensures
+ * dependant fields which have previously been set will have the new value
+ * visible by the time the update to the waited-on field is visible.
+ */
+static inline void wake_up_var_mb(void *var)
+{
+	smp_mb();
+	wake_up_var_relaxed(var);
+}
+
+/**
+ * wake_up_var - wake up all waiters on a var
+ * @var: the address being waited on, a kernel virtual address
+ *
+ * There is a standard hashed waitqueue table for generic use. This
+ * is the part of the hash-table's accessor API that wakes up waiters
+ * on a variable. Waiting in wait_var_event() can be worken by this.
+ *
+ * This interface should only be used after the variable has been updated
+ * atomically such as in a locked region which has been unlocked, or by
+ * atomic_set() or xchg() etc.
+ * If the variable has been updated non-atomically then wake_up_var_mb()
+ * should be used.
+ */
+static inline void wake_up_var(void *var)
+{
+	smp_mb__after_atomic();
+	wake_up_var_relaxed(var);
+}
+
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index aa3450bdc227..e00be2c7dae7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5451,8 +5451,7 @@ int perf_event_release_kernel(struct perf_event *event)
 			 * ctx while the child_mutex got released above, make sure to
 			 * notify about the preceding put_ctx().
 			 */
-			smp_mb(); /* pairs with wait_var_event() */
-			wake_up_var(var);
+			wake_up_var_mb(var);
 		}
 		goto again;
 	}
@@ -5468,8 +5467,7 @@ int perf_event_release_kernel(struct perf_event *event)
 		 * Wake any perf_event_free_task() waiting for this event to be
 		 * freed.
 		 */
-		smp_mb(); /* pairs with wait_var_event() */
-		wake_up_var(var);
+		wake_up_var_mb(var);
 	}
 
 no_ctx:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3951e4a55e5..e4054f19bb25 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2898,7 +2898,7 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	wait_for_completion(&pending->done);
 
 	if (refcount_dec_and_test(&pending->refs))
-		wake_up_var(&pending->refs); /* No UaF, just an address */
+		wake_up_var_relaxed(&pending->refs); /* No UaF, just an address */
 
 	/*
 	 * Block the original owner of &pending until all subsequent callers
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 51d923bf207e..970a7874d785 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -187,11 +187,28 @@ void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int
 }
 EXPORT_SYMBOL(init_wait_var_entry);
 
-void wake_up_var(void *var)
+/**
+ * wake_up_var_relaxed - wake up all waiters on an address
+ * @word: the address being waited on, a kernel virtual address
+ *
+ * There is a standard hashed waitqueue table for generic use.  This is
+ * the part of the hash-table's accessor API that wakes up waiters on an
+ * address.  For instance, if one were to have waiters on an address one
+ * would call wake_up_bit() after updating a value at, or near, the
+ * address.
+ *
+ * In order for this to function properly, as it uses waitqueue_active()
+ * internally, some kind of memory barrier must be done prior to calling
+ * this. Typically this will be provided implicitly by an atomic function
+ * that returns value such as atomic_dec_return or test_and_clear_bit().
+ * If the bit is cleared without full ordering, an alternate interface
+ * such as wake_up_bit_sync() or wake_up_bit() should be used.
+ */
+void wake_up_var_relaxed(void *var)
 {
 	__wake_up_bit(__var_waitqueue(var), var, -1);
 }
-EXPORT_SYMBOL(wake_up_var);
+EXPORT_SYMBOL(wake_up_var_relaxed);
 
 __sched int bit_wait(struct wait_bit_key *word, int mode)
 {
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 02582017759a..a1f3bc234848 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -749,7 +749,7 @@ EXPORT_SYMBOL(__tasklet_hi_schedule);
 static bool tasklet_clear_sched(struct tasklet_struct *t)
 {
 	if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) {
-		wake_up_var(&t->state);
+		wake_up_var_relaxed(&t->state);
 		return true;
 	}
 
@@ -885,7 +885,6 @@ void tasklet_unlock(struct tasklet_struct *t)
 {
 	smp_mb__before_atomic();
 	clear_bit(TASKLET_STATE_RUN, &t->state);
-	smp_mb__after_atomic();
 	wake_up_var(&t->state);
 }
 EXPORT_SYMBOL_GPL(tasklet_unlock);
diff --git a/mm/memremap.c b/mm/memremap.c
index 40d4547ce514..c04c16230551 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -523,7 +523,7 @@ bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
 	 * stable because nobody holds a reference on the page.
 	 */
 	if (folio_ref_sub_return(folio, refs) == 1)
-		wake_up_var(&folio->_refcount);
+		wake_up_var_relaxed(&folio->_refcount);
 	return true;
 }
 EXPORT_SYMBOL(__put_devmap_managed_folio_refs);
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 7877a64cc6b8..0c971d64604e 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -82,7 +82,7 @@ static void release_inode(struct landlock_object *const object)
 
 	iput(inode);
 	if (atomic_long_dec_and_test(&landlock_superblock(sb)->inode_refs))
-		wake_up_var(&landlock_superblock(sb)->inode_refs);
+		wake_up_var_relaxed(&landlock_superblock(sb)->inode_refs);
 }
 
 static const struct landlock_object_underops landlock_fs_underops = {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate.
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
                   ` (7 preceding siblings ...)
  2024-08-19  5:20 ` [PATCH 8/9] Improve and extend wake_up_var() interface NeilBrown
@ 2024-08-19  5:20 ` NeilBrown
  2024-08-19  6:13 ` [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile Linus Torvalds
  2024-08-19  8:16 ` Christian Brauner
  10 siblings, 0 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-19  5:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

clear_and_wake_up_bit() contains 2 memory barriers - one before the bit
is cleared so that changes implied by the cleared bit are visible, and
another before the the lockless waitqueue_active() test in
wake_up_bit(), to ensure the cleared bit is visible in a waiter
immediately after a waiter is added to the waitqueue.

This function is open-coded in many places, some of which omit one or
both of the barriers.

For consistency, this patch changes all reasonable candidates to use
clear_and_wake_up_bit() directly.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/bluetooth/hci_mrvl.c                |  6 ++----
 drivers/md/dm-bufio.c                       |  8 ++------
 drivers/md/dm-snap.c                        |  3 +--
 drivers/md/dm-zoned-metadata.c              |  6 ++----
 drivers/md/dm-zoned-reclaim.c               |  3 +--
 drivers/md/dm.c                             |  3 +--
 drivers/media/usb/dvb-usb-v2/dvb_usb_core.c | 12 ++++--------
 fs/afs/server.c                             |  3 +--
 fs/afs/vl_probe.c                           |  3 +--
 fs/afs/volume.c                             |  3 +--
 fs/bcachefs/buckets.h                       |  3 +--
 fs/btrfs/extent_io.c                        |  6 ++----
 fs/buffer.c                                 |  3 +--
 fs/gfs2/glock.c                             |  6 ++----
 fs/gfs2/glops.c                             |  6 ++----
 fs/gfs2/lock_dlm.c                          |  6 ++----
 fs/gfs2/recovery.c                          |  3 +--
 fs/gfs2/sys.c                               |  3 +--
 fs/gfs2/util.c                              |  3 +--
 fs/jbd2/commit.c                            |  6 ++----
 fs/netfs/fscache_volume.c                   |  3 +--
 fs/netfs/io.c                               |  3 +--
 fs/netfs/write_collect.c                    |  9 +++------
 fs/nfs/inode.c                              |  3 +--
 fs/nfs/pnfs.c                               |  6 ++----
 fs/nfsd/nfs4recover.c                       |  4 +---
 fs/nfsd/nfs4state.c                         |  3 +--
 fs/orangefs/file.c                          |  3 +--
 fs/smb/client/connect.c                     |  3 +--
 fs/smb/client/inode.c                       |  3 +--
 fs/smb/client/misc.c                        | 14 ++++++--------
 fs/smb/server/oplock.c                      |  3 +--
 net/bluetooth/hci_event.c                   |  3 +--
 security/keys/gc.c                          |  3 +--
 34 files changed, 53 insertions(+), 105 deletions(-)

diff --git a/drivers/bluetooth/hci_mrvl.c b/drivers/bluetooth/hci_mrvl.c
index 5486cf78a99b..130bfe8f86d2 100644
--- a/drivers/bluetooth/hci_mrvl.c
+++ b/drivers/bluetooth/hci_mrvl.c
@@ -183,8 +183,7 @@ static int mrvl_recv_fw_req(struct hci_dev *hdev, struct sk_buff *skb)
 
 	mrvl->tx_len = le16_to_cpu(pkt->lhs);
 
-	clear_bit(STATE_FW_REQ_PENDING, &mrvl->flags);
-	wake_up_bit(&mrvl->flags, STATE_FW_REQ_PENDING);
+	clear_and_wake_up_bit(STATE_FW_REQ_PENDING, &mrvl->flags);
 
 done:
 	kfree_skb(skb);
@@ -217,8 +216,7 @@ static int mrvl_recv_chip_ver(struct hci_dev *hdev, struct sk_buff *skb)
 
 	bt_dev_info(hdev, "Controller id = %x, rev = %x", mrvl->id, mrvl->rev);
 
-	clear_bit(STATE_CHIP_VER_PENDING, &mrvl->flags);
-	wake_up_bit(&mrvl->flags, STATE_CHIP_VER_PENDING);
+	clear_and_wake_up_bit(STATE_CHIP_VER_PENDING, &mrvl->flags);
 
 done:
 	kfree_skb(skb);
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 14b4d7cabbd6..45696f9e814a 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1430,9 +1430,7 @@ static void write_endio(struct dm_buffer *b, blk_status_t status)
 
 	BUG_ON(!test_bit(B_WRITING, &b->state));
 
-	smp_mb__before_atomic();
-	clear_bit(B_WRITING, &b->state);
-	wake_up_bit(&b->state, B_WRITING);
+	clear_and_wake_up_bit(B_WRITING, &b->state);
 }
 
 /*
@@ -1839,9 +1837,7 @@ static void read_endio(struct dm_buffer *b, blk_status_t status)
 
 	BUG_ON(!test_bit(B_READING, &b->state));
 
-	smp_mb__before_atomic();
-	clear_bit(B_READING, &b->state);
-	wake_up_bit(&b->state, B_READING);
+	clear_and_wake_up_bit(B_READING, &b->state);
 }
 
 /*
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 1549ab975021..4be3426a79ed 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -920,8 +920,7 @@ static int init_hash_tables(struct dm_snapshot *s)
 
 static void merge_shutdown(struct dm_snapshot *s)
 {
-	clear_bit_unlock(RUNNING_MERGE, &s->state_bits);
-	wake_up_bit(&s->state_bits, RUNNING_MERGE);
+	clear_and_wake_up_bit(RUNNING_MERGE, &s->state_bits);
 }
 
 static struct bio *__release_queued_bios_after_merge(struct dm_snapshot *s)
diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
index 7ea225ce418f..61df3079899a 100644
--- a/drivers/md/dm-zoned-metadata.c
+++ b/drivers/md/dm-zoned-metadata.c
@@ -524,8 +524,7 @@ static void dmz_mblock_bio_end_io(struct bio *bio)
 	else
 		flag = DMZ_META_READING;
 
-	clear_bit_unlock(flag, &mblk->state);
-	wake_up_bit(&mblk->state, flag);
+	clear_and_wake_up_bit(flag, &mblk->state);
 
 	bio_put(bio);
 }
@@ -1915,8 +1914,7 @@ void dmz_unlock_zone_reclaim(struct dm_zone *zone)
 	WARN_ON(dmz_is_active(zone));
 	WARN_ON(!dmz_in_reclaim(zone));
 
-	clear_bit_unlock(DMZ_RECLAIM, &zone->flags);
-	wake_up_bit(&zone->flags, DMZ_RECLAIM);
+	clear_and_wake_up_bit(DMZ_RECLAIM, &zone->flags);
 }
 
 /*
diff --git a/drivers/md/dm-zoned-reclaim.c b/drivers/md/dm-zoned-reclaim.c
index 9a7dadbb8eb7..12b9606357ec 100644
--- a/drivers/md/dm-zoned-reclaim.c
+++ b/drivers/md/dm-zoned-reclaim.c
@@ -106,8 +106,7 @@ static void dmz_reclaim_kcopy_end(int read_err, unsigned long write_err,
 	else
 		zrc->kc_err = 0;
 
-	clear_bit_unlock(DMZ_RECLAIM_KCOPY, &zrc->flags);
-	wake_up_bit(&zrc->flags, DMZ_RECLAIM_KCOPY);
+	clear_and_wake_up_bit(DMZ_RECLAIM_KCOPY, &zrc->flags);
 }
 
 /*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3fd19d478f62..f453e6d78971 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -3153,8 +3153,7 @@ static void __dm_internal_resume(struct mapped_device *md)
 		set_bit(DMF_SUSPENDED, &md->flags);
 	}
 done:
-	clear_bit(DMF_SUSPENDED_INTERNALLY, &md->flags);
-	wake_up_bit(&md->flags, DMF_SUSPENDED_INTERNALLY);
+	clear_and_wake_up_bit(DMF_SUSPENDED_INTERNALLY, &md->flags);
 }
 
 void dm_internal_suspend_noflush(struct mapped_device *md)
diff --git a/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c b/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
index 7df6b0791ebd..4f9347ec0f43 100644
--- a/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
+++ b/drivers/media/usb/dvb-usb-v2/dvb_usb_core.c
@@ -375,8 +375,7 @@ static int dvb_usb_stop_feed(struct dvb_demux_feed *dvbdmxfeed)
 	usb_urb_killv2(&adap->stream);
 
 	/* clear 'streaming' status bit */
-	clear_bit(ADAP_STREAMING, &adap->state_bits);
-	wake_up_bit(&adap->state_bits, ADAP_STREAMING);
+	clear_and_wake_up_bit(ADAP_STREAMING, &adap->state_bits);
 skip_feed_stop:
 
 	if (ret)
@@ -578,10 +577,8 @@ static int dvb_usb_fe_init(struct dvb_frontend *fe)
 			goto err;
 	}
 err:
-	if (!adap->suspend_resume_active) {
-		clear_bit(ADAP_INIT, &adap->state_bits);
-		wake_up_bit(&adap->state_bits, ADAP_INIT);
-	}
+	if (!adap->suspend_resume_active)
+		clear_and_wake_up_bit(ADAP_INIT, &adap->state_bits);
 
 	dev_dbg(&d->udev->dev, "%s: ret=%d\n", __func__, ret);
 	return ret;
@@ -618,8 +615,7 @@ static int dvb_usb_fe_sleep(struct dvb_frontend *fe)
 err:
 	if (!adap->suspend_resume_active) {
 		adap->active_fe = -1;
-		clear_bit(ADAP_SLEEP, &adap->state_bits);
-		wake_up_bit(&adap->state_bits, ADAP_SLEEP);
+		clear_and_wake_up_bit(ADAP_SLEEP, &adap->state_bits);
 	}
 
 	dev_dbg(&d->udev->dev, "%s: ret=%d\n", __func__, ret);
diff --git a/fs/afs/server.c b/fs/afs/server.c
index 038f9d0ae3af..8dc3c60f8f81 100644
--- a/fs/afs/server.c
+++ b/fs/afs/server.c
@@ -688,8 +688,7 @@ bool afs_check_server_record(struct afs_operation *op, struct afs_server *server
 	if (!test_and_set_bit_lock(AFS_SERVER_FL_UPDATING, &server->flags)) {
 		clear_bit(AFS_SERVER_FL_NEEDS_UPDATE, &server->flags);
 		success = afs_update_server_record(op, server, key);
-		clear_bit_unlock(AFS_SERVER_FL_UPDATING, &server->flags);
-		wake_up_bit(&server->flags, AFS_SERVER_FL_UPDATING);
+		clear_and_wake_up_bit(AFS_SERVER_FL_UPDATING, &server->flags);
 		_leave(" = %d", success);
 		return success;
 	}
diff --git a/fs/afs/vl_probe.c b/fs/afs/vl_probe.c
index 3d2e0c925460..9850e028c44b 100644
--- a/fs/afs/vl_probe.c
+++ b/fs/afs/vl_probe.c
@@ -22,8 +22,7 @@ static void afs_finished_vl_probe(struct afs_vlserver *server)
 		clear_bit(AFS_VLSERVER_FL_RESPONDING, &server->flags);
 	}
 
-	clear_bit_unlock(AFS_VLSERVER_FL_PROBING, &server->flags);
-	wake_up_bit(&server->flags, AFS_VLSERVER_FL_PROBING);
+	clear_and_wake_up_bit(AFS_VLSERVER_FL_PROBING, &server->flags);
 }
 
 /*
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index af3a3f57c1b3..601b425cf093 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -439,9 +439,8 @@ int afs_check_volume_status(struct afs_volume *volume, struct afs_operation *op)
 		ret = afs_update_volume_status(volume, op->key);
 		if (ret < 0)
 			set_bit(AFS_VOLUME_NEEDS_UPDATE, &volume->flags);
-		clear_bit_unlock(AFS_VOLUME_WAIT, &volume->flags);
 		clear_bit_unlock(AFS_VOLUME_UPDATING, &volume->flags);
-		wake_up_bit(&volume->flags, AFS_VOLUME_WAIT);
+		clear_and_wake_up_bit(AFS_VOLUME_WAIT, &volume->flags);
 		_leave(" = %d", ret);
 		return ret;
 	}
diff --git a/fs/bcachefs/buckets.h b/fs/bcachefs/buckets.h
index edbdffd508fc..5c7cfee3707b 100644
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@@ -70,8 +70,7 @@ static inline void bucket_unlock(struct bucket *b)
 {
 	BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte);
 
-	clear_bit_unlock(BUCKET_LOCK_BITNR, (void *) &b->lock);
-	wake_up_bit((void *) &b->lock, BUCKET_LOCK_BITNR);
+	clear_and_wake_up_bit(BUCKET_LOCK_BITNR, (void *) &b->lock);
 }
 
 static inline void bucket_lock(struct bucket *b)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 0b9cb4c87adf..ab70f92f41bf 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1761,8 +1761,7 @@ static void end_bbio_meta_write(struct btrfs_bio *bbio)
 		bio_offset += len;
 	}
 
-	clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-	wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
+	clear_and_wake_up_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
 
 	bio_put(&bbio->bio);
 }
@@ -3504,8 +3503,7 @@ void set_extent_buffer_uptodate(struct extent_buffer *eb)
 
 static void clear_extent_buffer_reading(struct extent_buffer *eb)
 {
-	clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
-	wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
+	clear_and_wake_up_bit(EXTENT_BUFFER_READING, &eb->bflags);
 }
 
 static void end_bbio_meta_read(struct btrfs_bio *bbio)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2932618c88e4..2b55fad8bfc9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -74,8 +74,7 @@ EXPORT_SYMBOL(__lock_buffer);
 
 void unlock_buffer(struct buffer_head *bh)
 {
-	clear_bit_unlock(BH_Lock, &bh->b_state);
-	wake_up_bit(&bh->b_state, BH_Lock);
+	clear_and_wake_up_bit(BH_Lock, &bh->b_state);
 }
 EXPORT_SYMBOL(unlock_buffer);
 
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e1afe9aa7c2a..6b310a02c66d 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -379,8 +379,7 @@ static inline bool may_grant(struct gfs2_glock *gl,
 
 static void gfs2_holder_wake(struct gfs2_holder *gh)
 {
-	clear_bit(HIF_WAIT, &gh->gh_iflags);
-	wake_up_bit(&gh->gh_iflags, HIF_WAIT);
+	clear_and_wake_up_bit(HIF_WAIT, &gh->gh_iflags);
 	if (gh->gh_flags & GL_ASYNC) {
 		struct gfs2_sbd *sdp = gh->gh_gl->gl_name.ln_sbd;
 
@@ -574,8 +573,7 @@ static void gfs2_set_demote(struct gfs2_glock *gl)
 static void gfs2_demote_wake(struct gfs2_glock *gl)
 {
 	gl->gl_demote_state = LM_ST_EXCLUSIVE;
-	clear_bit(GLF_DEMOTE, &gl->gl_flags);
-	wake_up_bit(&gl->gl_flags, GLF_DEMOTE);
+	clear_and_wake_up_bit(GLF_DEMOTE, &gl->gl_flags);
 }
 
 /**
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 95d8081681dc..2c8bc1dce8d1 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -276,8 +276,7 @@ static void gfs2_clear_glop_pending(struct gfs2_inode *ip)
 	if (!ip)
 		return;
 
-	clear_bit_unlock(GIF_GLOP_PENDING, &ip->i_flags);
-	wake_up_bit(&ip->i_flags, GIF_GLOP_PENDING);
+	clear_and_wake_up_bit(GIF_GLOP_PENDING, &ip->i_flags);
 }
 
 /**
@@ -644,8 +643,7 @@ static void inode_go_unlocked(struct gfs2_glock *gl)
 	 * to NULL by this point in its lifecycle. */
 	if (!test_bit(GLF_UNLOCKED, &gl->gl_flags))
 		return;
-	clear_bit_unlock(GLF_UNLOCKED, &gl->gl_flags);
-	wake_up_bit(&gl->gl_flags, GLF_UNLOCKED);
+	clear_and_wake_up_bit(GLF_UNLOCKED, &gl->gl_flags);
 }
 
 /**
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 921b26b96192..d5ac3751f66d 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1223,8 +1223,7 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots,
 	if (!test_bit(DFL_UNMOUNT, &ls->ls_recover_flags))
 		queue_delayed_work(gfs2_control_wq, &sdp->sd_control_work, 0);
 
-	clear_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
-	wake_up_bit(&ls->ls_recover_flags, DFL_DLM_RECOVERY);
+	clear_and_wake_up_bit(DFL_DLM_RECOVERY, &ls->ls_recover_flags);
 	spin_unlock(&ls->ls_recover_spin);
 }
 
@@ -1364,8 +1363,7 @@ static int gdlm_mount(struct gfs2_sbd *sdp, const char *table)
 	}
 
 	ls->ls_first = !!test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags);
-	clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
-	wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
+	clear_and_wake_up_bit(SDF_NOJOURNALID, &sdp->sd_flags);
 	return 0;
 
 fail_release:
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index e70cf003d524..c162d46903ae 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -557,8 +557,7 @@ void gfs2_recover_func(struct work_struct *work)
 	jd->jd_recover_error = error;
 	gfs2_recovery_done(sdp, jd->jd_jid, LM_RD_GAVEUP);
 done:
-	clear_bit(JDF_RECOVERY, &jd->jd_flags);
-	wake_up_bit(&jd->jd_flags, JDF_RECOVERY);
+	clear_and_wake_up_bit(JDF_RECOVERY, &jd->jd_flags);
 }
 
 int gfs2_recover_journal(struct gfs2_jdesc *jd, bool wait)
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index 738337e35724..0c998c92cb6d 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -586,8 +586,7 @@ static ssize_t jid_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
 	if (sdp->sd_args.ar_spectator && jid > 0)
 		rv = jid = -EINVAL;
 	sdp->sd_lockstruct.ls_jid = jid;
-	clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
-	wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
+	clear_and_wake_up_bit(SDF_NOJOURNALID, &sdp->sd_flags);
 out:
 	spin_unlock(&sdp->sd_jindex_spin);
 	return rv ? rv : len;
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 2818f3891498..111aba955084 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -352,8 +352,7 @@ int gfs2_withdraw(struct gfs2_sbd *sdp)
 		}
 		fs_err(sdp, "File system withdrawn\n");
 		dump_stack();
-		clear_bit(SDF_WITHDRAW_IN_PROG, &sdp->sd_flags);
-		wake_up_bit(&sdp->sd_flags, SDF_WITHDRAW_IN_PROG);
+		clear_and_wake_up_bit(SDF_WITHDRAW_IN_PROG, &sdp->sd_flags);
 	}
 
 	if (sdp->sd_args.ar_errors == GFS2_ERRORS_PANIC)
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 55ba3f62fbe3..f4bf090b59e9 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -38,10 +38,8 @@ static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 		set_buffer_uptodate(bh);
 	else
 		clear_buffer_uptodate(bh);
-	if (orig_bh) {
-		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
-		wake_up_bit(&orig_bh->b_state, BH_Shadow);
-	}
+	if (orig_bh)
+		clear_and_wake_up_bit(BH_Shadow, &orig_bh->b_state);
 	unlock_buffer(bh);
 }
 
diff --git a/fs/netfs/fscache_volume.c b/fs/netfs/fscache_volume.c
index c6c43a87f56e..7b8aacf48172 100644
--- a/fs/netfs/fscache_volume.c
+++ b/fs/netfs/fscache_volume.c
@@ -322,8 +322,7 @@ void fscache_create_volume(struct fscache_volume *volume, bool wait)
 	}
 	return;
 no_wait:
-	clear_bit_unlock(FSCACHE_VOLUME_CREATING, &volume->flags);
-	wake_up_bit(&volume->flags, FSCACHE_VOLUME_CREATING);
+	clear_and_wake_up_bit(FSCACHE_VOLUME_CREATING, &volume->flags);
 }
 
 /*
diff --git a/fs/netfs/io.c b/fs/netfs/io.c
index ebae3cfcad20..4782ffd75fa5 100644
--- a/fs/netfs/io.c
+++ b/fs/netfs/io.c
@@ -272,8 +272,7 @@ static void netfs_rreq_assess(struct netfs_io_request *rreq, bool was_async)
 		netfs_rreq_assess_dio(rreq);
 
 	trace_netfs_rreq(rreq, netfs_rreq_trace_wake_ip);
-	clear_bit_unlock(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
-	wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS);
+	clear_and_wake_up_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags);
 
 	netfs_rreq_completed(rreq, was_async);
 }
diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c
index 426cf87aaf2e..2b8c2c8d802a 100644
--- a/fs/netfs/write_collect.c
+++ b/fs/netfs/write_collect.c
@@ -582,8 +582,7 @@ static void netfs_collect_write_results(struct netfs_io_request *wreq)
 		goto need_retry;
 	if ((notes & MADE_PROGRESS) && test_bit(NETFS_RREQ_PAUSE, &wreq->flags)) {
 		trace_netfs_rreq(wreq, netfs_rreq_trace_unpause);
-		clear_bit_unlock(NETFS_RREQ_PAUSE, &wreq->flags);
-		wake_up_bit(&wreq->flags, NETFS_RREQ_PAUSE);
+		clear_and_wake_up_bit(NETFS_RREQ_PAUSE, &wreq->flags);
 	}
 
 	if (notes & NEED_REASSESS) {
@@ -686,8 +685,7 @@ void netfs_write_collection_worker(struct work_struct *work)
 
 	_debug("finished");
 	trace_netfs_rreq(wreq, netfs_rreq_trace_wake_ip);
-	clear_bit_unlock(NETFS_RREQ_IN_PROGRESS, &wreq->flags);
-	wake_up_bit(&wreq->flags, NETFS_RREQ_IN_PROGRESS);
+	clear_and_wake_up_bit(NETFS_RREQ_IN_PROGRESS, &wreq->flags);
 
 	if (wreq->iocb) {
 		size_t written = min(wreq->transferred, wreq->len);
@@ -795,8 +793,7 @@ void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error,
 
 	trace_netfs_sreq(subreq, netfs_sreq_trace_terminated);
 
-	clear_bit_unlock(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-	wake_up_bit(&subreq->flags, NETFS_SREQ_IN_PROGRESS);
+	clear_and_wake_up_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
 
 	/* If we are at the head of the queue, wake up the collector,
 	 * transferring a ref to it if we were the ones to do so.
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 19d175446899..34b9e5366983 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1428,8 +1428,7 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)
 	ret = nfs_invalidate_mapping(inode, mapping);
 	trace_nfs_invalidate_mapping_exit(inode, ret);
 
-	clear_bit_unlock(NFS_INO_INVALIDATING, bitlock);
-	wake_up_bit(bitlock, NFS_INO_INVALIDATING);
+	clear_and_wake_up_bit(NFS_INO_INVALIDATING, bitlock);
 out:
 	return ret;
 }
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index 877fc154eb2b..4254de655d5e 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -2055,8 +2055,7 @@ static void pnfs_clear_first_layoutget(struct pnfs_layout_hdr *lo)
 {
 	unsigned long *bitlock = &lo->plh_flags;
 
-	clear_bit_unlock(NFS_LAYOUT_FIRST_LAYOUTGET, bitlock);
-	wake_up_bit(bitlock, NFS_LAYOUT_FIRST_LAYOUTGET);
+	clear_and_wake_up_bit(NFS_LAYOUT_FIRST_LAYOUTGET, bitlock);
 }
 
 static void _add_to_server_list(struct pnfs_layout_hdr *lo,
@@ -3227,8 +3226,7 @@ static void pnfs_clear_layoutcommitting(struct inode *inode)
 {
 	unsigned long *bitlock = &NFS_I(inode)->flags;
 
-	clear_bit_unlock(NFS_INO_LAYOUTCOMMITTING, bitlock);
-	wake_up_bit(bitlock, NFS_INO_LAYOUTCOMMITTING);
+	clear_and_wake_up_bit(NFS_INO_LAYOUTCOMMITTING, bitlock);
 }
 
 /*
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 3e1a434bc649..0d25ad5eeae3 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -1895,9 +1895,7 @@ nfsd4_cltrack_upcall_lock(struct nfs4_client *clp)
 static void
 nfsd4_cltrack_upcall_unlock(struct nfs4_client *clp)
 {
-	smp_mb__before_atomic();
-	clear_bit(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);
-	wake_up_bit(&clp->cl_flags, NFSD4_CLIENT_UPCALL_LOCK);
+	clear_and_wake_up_bit(NFSD4_CLIENT_UPCALL_LOCK, &clp->cl_flags);
 }
 
 static void
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index d156ac7637cf..bd948cca34b6 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3076,8 +3076,7 @@ nfsd4_cb_getattr_release(struct nfsd4_callback *cb)
 			container_of(ncf, struct nfs4_delegation, dl_cb_fattr);
 
 	nfs4_put_stid(&dp->dl_stid);
-	clear_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags);
-	wake_up_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY);
+	clear_and_wake_up_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags);
 }
 
 static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index d620b56a1002..c03bd14c1bc7 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -313,8 +313,7 @@ int orangefs_revalidate_mapping(struct inode *inode)
 	orangefs_inode->mapping_time = jiffies +
 	    orangefs_cache_timeout_msecs*HZ/1000;
 
-	clear_bit(1, bitlock);
-	wake_up_bit(bitlock, 1);
+	clear_and_wake_up_bit(1, bitlock);
 
 	return ret;
 }
diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
index d2307162a2de..ff30d11f8e45 100644
--- a/fs/smb/client/connect.c
+++ b/fs/smb/client/connect.c
@@ -4257,8 +4257,7 @@ cifs_sb_tlink(struct cifs_sb_info *cifs_sb)
 	}
 
 	tlink->tl_tcon = cifs_construct_tcon(cifs_sb, fsuid);
-	clear_bit(TCON_LINK_PENDING, &tlink->tl_flags);
-	wake_up_bit(&tlink->tl_flags, TCON_LINK_PENDING);
+	clear_and_wake_up_bit(TCON_LINK_PENDING, &tlink->tl_flags);
 
 	if (IS_ERR(tlink->tl_tcon)) {
 		cifs_put_tlink(tlink);
diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c
index 8da74f15cc95..fd0e3d1e49cb 100644
--- a/fs/smb/client/inode.c
+++ b/fs/smb/client/inode.c
@@ -2510,8 +2510,7 @@ cifs_revalidate_mapping(struct inode *inode)
 	}
 
 skip_invalidate:
-	clear_bit_unlock(CIFS_INO_LOCK, flags);
-	wake_up_bit(flags, CIFS_INO_LOCK);
+	clear_and_wake_up_bit(CIFS_INO_LOCK, flags);
 
 	return rc;
 }
diff --git a/fs/smb/client/misc.c b/fs/smb/client/misc.c
index c6f11e6f9eb9..455d50102d93 100644
--- a/fs/smb/client/misc.c
+++ b/fs/smb/client/misc.c
@@ -594,8 +594,8 @@ int cifs_get_writer(struct cifsInodeInfo *cinode)
 	if (test_bit(CIFS_INODE_PENDING_OPLOCK_BREAK, &cinode->flags)) {
 		cinode->writers--;
 		if (cinode->writers == 0) {
-			clear_bit(CIFS_INODE_PENDING_WRITERS, &cinode->flags);
-			wake_up_bit(&cinode->flags, CIFS_INODE_PENDING_WRITERS);
+			clear_and_wake_up_bit(CIFS_INODE_PENDING_WRITERS,
+					      &cinode->flags);
 		}
 		spin_unlock(&cinode->writers_lock);
 		goto start;
@@ -608,10 +608,9 @@ void cifs_put_writer(struct cifsInodeInfo *cinode)
 {
 	spin_lock(&cinode->writers_lock);
 	cinode->writers--;
-	if (cinode->writers == 0) {
-		clear_bit(CIFS_INODE_PENDING_WRITERS, &cinode->flags);
-		wake_up_bit(&cinode->flags, CIFS_INODE_PENDING_WRITERS);
-	}
+	if (cinode->writers == 0)
+		clear_and_wake_up_bit(CIFS_INODE_PENDING_WRITERS,
+				      &cinode->flags);
 	spin_unlock(&cinode->writers_lock);
 }
 
@@ -640,8 +639,7 @@ void cifs_queue_oplock_break(struct cifsFileInfo *cfile)
 
 void cifs_done_oplock_break(struct cifsInodeInfo *cinode)
 {
-	clear_bit(CIFS_INODE_PENDING_OPLOCK_BREAK, &cinode->flags);
-	wake_up_bit(&cinode->flags, CIFS_INODE_PENDING_OPLOCK_BREAK);
+	clear_and_wake_up_bit(CIFS_INODE_PENDING_OPLOCK_BREAK, &cinode->flags);
 }
 
 bool
diff --git a/fs/smb/server/oplock.c b/fs/smb/server/oplock.c
index b039b242df46..40adc1d42f96 100644
--- a/fs/smb/server/oplock.c
+++ b/fs/smb/server/oplock.c
@@ -593,8 +593,7 @@ static void wait_for_break_ack(struct oplock_info *opinfo)
 
 static void wake_up_oplock_break(struct oplock_info *opinfo)
 {
-	clear_bit_unlock(0, &opinfo->pending_break);
-	wake_up_bit(&opinfo->pending_break, 0);
+	clear_and_wake_up_bit(0, &opinfo->pending_break);
 }
 
 static int oplock_break_pending(struct oplock_info *opinfo, int req_op_level)
diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
index 226b37017d56..91f65d9ac82f 100644
--- a/net/bluetooth/hci_event.c
+++ b/net/bluetooth/hci_event.c
@@ -104,8 +104,7 @@ static u8 hci_cc_inquiry_cancel(struct hci_dev *hdev, void *data,
 	if (rp->status)
 		return rp->status;
 
-	clear_bit(HCI_INQUIRY, &hdev->flags);
-	wake_up_bit(&hdev->flags, HCI_INQUIRY);
+	clear_and_wake_up_bit(HCI_INQUIRY, &hdev->flags);
 
 	hci_dev_lock(hdev);
 	/* Set discovery state to stopped if we're not doing LE active
diff --git a/security/keys/gc.c b/security/keys/gc.c
index 7d687b0962b1..0033f8546fa9 100644
--- a/security/keys/gc.c
+++ b/security/keys/gc.c
@@ -317,8 +317,7 @@ static void key_garbage_collector(struct work_struct *work)
 	if (unlikely(gc_state & KEY_GC_REAPING_DEAD_3)) {
 		kdebug("dead wake");
 		smp_mb();
-		clear_bit(KEY_GC_REAPING_KEYTYPE, &key_gc_flags);
-		wake_up_bit(&key_gc_flags, KEY_GC_REAPING_KEYTYPE);
+		clear_and_wake_up_bit(KEY_GC_REAPING_KEYTYPE, &key_gc_flags);
 	}
 
 	if (gc_state & KEY_GC_REAP_AGAIN)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
                   ` (8 preceding siblings ...)
  2024-08-19  5:20 ` [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate NeilBrown
@ 2024-08-19  6:13 ` Linus Torvalds
  2024-08-20 21:47   ` NeilBrown
  2024-08-19  8:16 ` Christian Brauner
  10 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2024-08-19  6:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, linux-fsdevel

On Sun, 18 Aug 2024 at 22:36, NeilBrown <neilb@suse.de> wrote:
>
> The main patches here are 7 and 8 which revise wake_up_bit and
> wake_up_var respectively.  They result in 3 interfaces:
>   wake_up_{bit,var}           includes smp_mb__after_atomic()

I actually think this is even worse than the current model, in that
now it subtle only works after atomic ops, and it's not obvious from
the name.

At least the current model, correct code looks like

      do_some_atomic_op
      smp_mb__after_atomic()
      wake_up_{bit,var}

and the smp_mb__after_atomic() makes sense and pairs with the atomic.
So the current one may be complex, but at the same time it's also
explicit. Your changed interface is still complex, but now it's even
less obvious what is actually going on.

With your suggested interface, a plain "wake_up_{bit,var}" only works
after atomic ops, and other ops have to magically know that they
should use the _mb() version or whatever. And somebody who doesn't
understand that subtlety, and copies the code (but changes the op from
an atomic one to something else) now introduces code that looks fine,
but is really subtly wrong.

The reason for the barrier is for the serialization with the
waitqueue_active() check. Honestly, if you worry about correctness
here, I think you should leave the existing wake_up_{bit,var}() alone,
and concentrate on having helpers that do the whole "set and wake up".

IOW, I do not think you should change existing semantics, but *this*
kind of pattern:

>  [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
>  [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate.

sounds like a good idea.

IOW, once you have a whole "atomic_dec_and_wake_up()" (skip the "_var"
- it's implied by the fact that it's an atomic_dec), *then* that
function makes for a simple-to-use model, and now the "atomic_dec(),
the smp_mb__after_atomic(), and the wake_up_var()" are all together.

For all the same reasons, it makes total sense to have
"clear_bit_and_wake()" etc.

But exposing those "three different memory barrier scenarios" as three
different helpers is the *opposite* of helpful. It keeps the current
complexity, and makes it worse by making the barrier rules even more
opaque, imho.

               Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
                   ` (9 preceding siblings ...)
  2024-08-19  6:13 ` [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile Linus Torvalds
@ 2024-08-19  8:16 ` Christian Brauner
  2024-08-19 17:45   ` Linus Torvalds
  2024-08-19 20:52   ` NeilBrown
  10 siblings, 2 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-19  8:16 UTC (permalink / raw)
  To: NeilBrown, Peter Zijlstra
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, linux-kernel,
	linux-fsdevel

On Mon, Aug 19, 2024 at 03:20:34PM GMT, NeilBrown wrote:
> I wasn't really sure who to send this too, and get_maintainer.pl
> suggested 132 addresses which seemed excessive.  So I've focussed on
> 'sched' maintainers.  I'll probably submit individual patches to
> relevant maintainers/lists if I get positive feedback at this level.
> 
> This series was motivated by 
> 
>    Commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task")
> 
> which adds smp_mb__after_atomic().  I thought "any API that requires that
> sort of thing needs to be fixed".
> 
> The main patches here are 7 and 8 which revise wake_up_bit and
> wake_up_var respectively.  They result in 3 interfaces:
>   wake_up_{bit,var}           includes smp_mb__after_atomic()
>   wake_up_{bit,var}_relaxed() doesn't have a barrier
>   wake_up_{bit,var}_mb()      includes smb_mb().

It's great that this api is brought up because it gives me a reason to
ask a stupid question I've had for a while.

I want to change the i_state member in struct inode from an unsigned
long to a u32 because really we're wasting 4 bytes on 64 bit that we're
never going to use given how little I_* flags we actually have and I
dislike that we use that vacuous type in a bunch of our structures for
that reason.

(Together with another 4 byte shrinkage we would get a whopping 8 bytes
back.)

The problem is that we currently use wait_on_bit() and wake_up_bit() in
various places on i_state and all of these functions require an unsigned
long (probably because some architectures only handle atomic ops on
unsigned long).

I have an experimental patch converting all of that from wait_on_bit()
to wait_var_event() but I was hesitant about it because iiuc the
semantics don't nicely translate into each other. Specifically, if some
code uses wait_on_bit(SOME_BIT) and someone calls
wake_up_bit(SOME_OTHER_BIT) then iiuc only SOME_OTHER_BIT waiters will
be woken. IOW, SOME_BIT waiters are unaffected.

But if this is switched to wait_var_event() and wake_up_var() and there
are two waiters e.g.,

W1: wait_var_event(inode->i_state, !(inode->i_state & SOME_BIT))
W2: wait_var_event(inode->i_state, !(inode->i_state & SOME_OTHER_BIT))

then a waker like

inode->i_state &= ~SOME_OTHER_BIT;
// insert appropriate barrier
wake_up_var(inode->i_state)

will cause both W1 and W2 to be woken but W1 is put back to sleep? Is
there a nicer way to do this? The only thing I want is a (pony) 32bit
bit-wait-like mechanism.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-19  8:16 ` Christian Brauner
@ 2024-08-19 17:45   ` Linus Torvalds
  2024-08-19 20:52   ` NeilBrown
  1 sibling, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2024-08-19 17:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-fsdevel

On Mon, 19 Aug 2024 at 01:16, Christian Brauner <brauner@kernel.org> wrote:
>
> The problem is that we currently use wait_on_bit() and wake_up_bit() in
> various places on i_state and all of these functions require an unsigned
> long (probably because some architectures only handle atomic ops on
> unsigned long).

It's actually mostly because of endianness, not atomicity.

The whole "flags in one single unsigned long" is a very traditional
Linux pattern. Even originally, when "unsigned", "unsigned int", and
"unsigned long" were all the same 32-bit thing, the pattern for the
kernel was "unsigned long" for the native architecture accesses.

It may not be a *great* pattern, and arguably it should always have
been "flags in a single unsigned int" (which was the same thing back
in the days). But hey, hindsight is 20:20.

[ And I say "arguably", not "obviously".

  The kernel basically takes the approach that "unsigned long" is the
size of GP registers, and so "array of unsigned long" is in some
respect fundamentally more efficient than "array of unsigned int",
because you can very naturally do operations in bigger chunks.

  So bitops working on some more architecture-neutral size like just
bytes or "u32" or "unsigned int" would have advantages, but "unsigned
long" in many ways is also a real advantage, and can have serious
alignment advantages for the simple cases ]

And it turns out byte order matters. Because you absolutely want to be
able to mix things like simple initializers, ie

     unsigned long flags = 1ul << BITPOS;
     ...
     clear_bit(BITPOS, &flags);

then the clear_bit() really _fundamentally_ works only on arrays of
"unsigned long", because on big-endian machines the bit position
really depends on the chunk size.

And yes, big-endianness is a disease (and yes, bit ordering is
literally one of the reasons), and it's happily mostly gone, but sadly
"mostly" is not "entirely".

So we are more or less stuck with "bit operations fundamentally work
on arrays of unsigned long", and no, you *cannot* use them on smaller
types because of the horror that is big-endianness.

Could we do "u32 bitops"? Yeah, but because of all the endianness
problems, it really would end up having to be a whole new set of
interfaces.

We do, btw, have a special case: we support the notion of
"little-endian bitmaps".  When you actually have data structures that
are binary objects across architectures, you need to have a sane
*portable* bitmap representation, and the only sane model is the
little-endian one that basically is the same across any type size (ie
"bit 0" is the same physical bit in a byte array as it is in a word
array and in a unsigned long array).

So you have things like filesystems use this model with test_bit_le()
and friends. But note that that does add extra overhead on BE
machines. And while we probably *should* have just said that the
normal bitops are always little-endian, we didn't, because by the time
BE machines were an issue, we already had tons of those simple
initializers (that would now need to use some helper macro to do the
bit swizzling on big-endian HW).

So I suspect we're kind of stuck with it. If you use the bit
operations - including the wait/wake_bit stuff - you *have* to use
"unsigned long".

Note that the "var_wait" versions don't actually care about the size
of the variable. They never look at the value in memory, so they
basically just treat the address of the variable as a cookie for
waiting. So you can use the "var" versions with absolutely anything.

[ Side note: the wake_up_bit() interface is broken garbage. It uses
"void *word" for the word. That's very dangerous because it allows
type mis-use without warnings. I didn't notice until I checked.
Thankfully wait_on_bit() itself has the right signature, so hopefully
nobody ever does that ]

            Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-19  8:16 ` Christian Brauner
  2024-08-19 17:45   ` Linus Torvalds
@ 2024-08-19 20:52   ` NeilBrown
  2024-08-19 21:12     ` Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: NeilBrown @ 2024-08-19 20:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Peter Zijlstra, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	linux-kernel, linux-fsdevel

On Mon, 19 Aug 2024, Christian Brauner wrote:
> On Mon, Aug 19, 2024 at 03:20:34PM GMT, NeilBrown wrote:
> > I wasn't really sure who to send this too, and get_maintainer.pl
> > suggested 132 addresses which seemed excessive.  So I've focussed on
> > 'sched' maintainers.  I'll probably submit individual patches to
> > relevant maintainers/lists if I get positive feedback at this level.
> > 
> > This series was motivated by 
> > 
> >    Commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task")
> > 
> > which adds smp_mb__after_atomic().  I thought "any API that requires that
> > sort of thing needs to be fixed".
> > 
> > The main patches here are 7 and 8 which revise wake_up_bit and
> > wake_up_var respectively.  They result in 3 interfaces:
> >   wake_up_{bit,var}           includes smp_mb__after_atomic()
> >   wake_up_{bit,var}_relaxed() doesn't have a barrier
> >   wake_up_{bit,var}_mb()      includes smb_mb().
> 
> It's great that this api is brought up because it gives me a reason to
> ask a stupid question I've had for a while.
> 
> I want to change the i_state member in struct inode from an unsigned
> long to a u32 because really we're wasting 4 bytes on 64 bit that we're
> never going to use given how little I_* flags we actually have and I
> dislike that we use that vacuous type in a bunch of our structures for
> that reason.
> 
> (Together with another 4 byte shrinkage we would get a whopping 8 bytes
> back.)
> 
> The problem is that we currently use wait_on_bit() and wake_up_bit() in
> various places on i_state and all of these functions require an unsigned
> long (probably because some architectures only handle atomic ops on
> unsigned long).

i_state contains two bits that are used for wake_up - I_NEW and I_SYNC,
one virtual bit that is used for wake_up - I_DIO_WAKEUP - and 15 others.
You could fit those in a short and two bools which gives you three
different addresses to pass to wake_up_var().
Doing that would make it a little difficult to test for 
   I_NEW | I_FREEING | I_WILL_FREE
but you could probably make an inline that the compile with optimise
effectively.

Alternately you could union the "u32" with an array for 4 char to give
you 4 addresses for wakeup.

Both of these would be effective, but a bit hackish.  If you want
something clean we would need a new interface.  Maybe
wake_var_bit()/wait_var_bit_event().
It could store the bit in wait_bit_key.bit_nr as "-2-n" or similar.  Or
better still, add another field: enum { WAIT_BIT, WAIT_VAR, WAIT_VAR_BIT } wait_type
to wait_bit_key.

I would probably go with the union approach and re-order the bits so that
the ones you want to use for wake_up are less than sizeof(u32).

NeilBrown

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-19 20:52   ` NeilBrown
@ 2024-08-19 21:12     ` Linus Torvalds
  2024-08-20 16:06       ` [PATCH RFC 0/5] inode: turn i_state into u32 Christian Brauner
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2024-08-19 21:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-fsdevel

On Mon, 19 Aug 2024 at 13:52, NeilBrown <neilb@suse.de> wrote:
>
> You could fit those in a short and two bools which gives you three
> different addresses to pass to wake_up_var().

You don't actually have to even do that.

The address passed to 'wake_up_var()' doesn't actually have to *match*
anything. It's used purely as a cookie.

So you can literally do something like

   #define inode_state(X,inode) ((X)+(char *)&(inode)->i_state)

and then just use inode_state(0/1/2,inode) for waiting/waking the
different bits (and the numbers 0/1/2 do not have to bear any relation
to the bit numbers, although you may obviously do that).

              Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event()
  2024-08-19  5:20 ` [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event() NeilBrown
@ 2024-08-20  4:18   ` Dave Chinner
  2024-08-20 21:52     ` NeilBrown
  2024-08-21  7:10   ` kernel test robot
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2024-08-20  4:18 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, linux-kernel,
	linux-fsdevel

On Mon, Aug 19, 2024 at 03:20:39PM +1000, NeilBrown wrote:
> bd_prepare_to_claim() current uses a bit waitqueue with a matching
> wake_up_bit() in bd_clear_claiming().  However it is really waiting on a
> "var", not a "bit".
> 
> So change to wake_up_var(), and use ___wait_var_event() for the waiting.
> Using the triple-underscore version allows us to drop the mutex across
> the schedule() call.
....
> @@ -535,33 +535,23 @@ int bd_prepare_to_claim(struct block_device *bdev, void *holder,
>  		const struct blk_holder_ops *hops)
>  {
>  	struct block_device *whole = bdev_whole(bdev);
> +	int err = 0;
>  
>  	if (WARN_ON_ONCE(!holder))
>  		return -EINVAL;
> -retry:
> -	mutex_lock(&bdev_lock);
> -	/* if someone else claimed, fail */
> -	if (!bd_may_claim(bdev, holder, hops)) {
> -		mutex_unlock(&bdev_lock);
> -		return -EBUSY;
> -	}
> -
> -	/* if claiming is already in progress, wait for it to finish */
> -	if (whole->bd_claiming) {
> -		wait_queue_head_t *wq = bit_waitqueue(&whole->bd_claiming, 0);
> -		DEFINE_WAIT(wait);
>  
> -		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
> -		mutex_unlock(&bdev_lock);
> -		schedule();
> -		finish_wait(wq, &wait);
> -		goto retry;
> -	}
> +	mutex_lock(&bdev_lock);
> +	___wait_var_event(&whole->bd_claiming,
> +			  (err = bd_may_claim(bdev, holder, hops)) != 0 || !whole->bd_claiming,
> +			  TASK_UNINTERRUPTIBLE, 0, 0,
> +			  mutex_unlock(&bdev_lock); schedule(); mutex_lock(&bdev_lock));

That's not an improvement. Instead of nice, obvious, readable code,
I now have to go look at a macro and manually substitute the
parameters to work out what this abomination actually does.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
  2024-08-19  5:20 ` [PATCH 2/9] Introduce atomic_dec_and_wake_up_var() NeilBrown
@ 2024-08-20  5:47   ` kernel test robot
  2024-08-21  5:23   ` kernel test robot
  1 sibling, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-08-20  5:47 UTC (permalink / raw)
  To: NeilBrown, Ingo Molnar, Peter Zijlstra, Linus Torvalds
  Cc: oe-kbuild-all, LKML, linux-fsdevel

Hi NeilBrown,

kernel test robot noticed the following build errors:

[auto build test ERROR on trondmy-nfs/linux-next]
[also build test ERROR on gfs2/for-next device-mapper-dm/for-next linus/master v6.11-rc4 next-20240819]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/NeilBrown/i915-remove-wake_up-on-I915_RESET_MODESET/20240819-134414
base:   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
patch link:    https://lore.kernel.org/r/20240819053605.11706-3-neilb%40suse.de
patch subject: [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
config: i386-allmodconfig (https://download.01.org/0day-ci/archive/20240820/202408201216.GvF1K3RG-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240820/202408201216.GvF1K3RG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408201216.GvF1K3RG-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/i915_request.c:2277:
   drivers/gpu/drm/i915/selftests/i915_request.c: In function 'wake_all':
>> drivers/gpu/drm/i915/selftests/i915_request.c:1533:16: error: implicit declaration of function 'atomic_dec_and_wakeup_var'; did you mean 'atomic_dec_and_wake_up_var'? [-Werror=implicit-function-declaration]
    1533 |         return atomic_dec_and_wakeup_var(&i915->selftest.counter);
         |                ^~~~~~~~~~~~~~~~~~~~~~~~~
         |                atomic_dec_and_wake_up_var
   cc1: some warnings being treated as errors


vim +1533 drivers/gpu/drm/i915/selftests/i915_request.c

  1530	
  1531	static bool wake_all(struct drm_i915_private *i915)
  1532	{
> 1533		return atomic_dec_and_wakeup_var(&i915->selftest.counter);
  1534	}
  1535	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP
  2024-08-19  5:20 ` [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP NeilBrown
@ 2024-08-20  7:22   ` Christian Brauner
  2024-08-20 19:12     ` Christian Brauner
  0 siblings, 1 reply; 38+ messages in thread
From: Christian Brauner @ 2024-08-20  7:22 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, linux-kernel,
	linux-fsdevel

On Mon, Aug 19, 2024 at 03:20:38PM GMT, NeilBrown wrote:
> inode_dio_wait() is essentially an open-coded version of
> wait_var_event().  Similarly inode_dio_wait_interruptible() is an
> open-coded version of wait_var_event_interruptible().
> 
> If we switch to waiting on the var, instead of an imaginary bit, the
> code is more transparent, is shorter, and we can discard I_DIO_WAKEUP.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---

Neil, I've sent a patch for this last week already removing
__I_DIO_WAKEUP and it's in -next as
0009dc756e81 ("inode: remove __I_DIO_WAKEUP"). So you can drop this
patch, please.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFC 0/5] inode: turn i_state into u32
  2024-08-19 21:12     ` Linus Torvalds
@ 2024-08-20 16:06       ` Christian Brauner
  2024-08-20 16:06         ` [PATCH RFC 1/5] fs: add i_state helpers Christian Brauner
                           ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
	Jeff Layton, Jan Kara, linux-fsdevel

Hey,

I've recently looked for some free space in struct inode again because
of some exec kerfuffle we recently had and while my idea didn't turn
into anything I noticed that we often waste bytes when using wait bit
operations. So I set out to switch that to another mechanism that would
allow us to free up bytes. So this is the attempt to turn i_state from
an unsigned long into an u32 using the address of the i_state bit in the
wait var event mechanism.

This survives LTP, xfstests on various filesystems, and will-it-scale.
It's possible I got it all wrong but I want to have the RFC out.

---
---
base-commit: 01e603fb789c75b3a0c63bddd42a42a710da7a52
change-id: 20240820-work-i_state-4e34db39bcf8


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFC 1/5] fs: add i_state helpers
  2024-08-20 16:06       ` [PATCH RFC 0/5] inode: turn i_state into u32 Christian Brauner
@ 2024-08-20 16:06         ` Christian Brauner
  2024-08-20 17:10           ` Linus Torvalds
  2024-08-20 16:06         ` [PATCH RFC 2/5] writeback: port __I_SYNC to var event Christian Brauner
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
	Jeff Layton, Jan Kara, linux-fsdevel

The i_state member is an unsigned long so that it can be used with the
wait bit infrastructure which expects unsigned long. This wastes 4 bytes
which we're unlikely to ever use. Switch to using the var event wait
mechanism using the address of the bit. Thanks to Linus for the address
idea.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/inode.c         | 11 +++++++++++
 include/linux/fs.h | 16 ++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 154f8689457f..d0f614677798 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -472,6 +472,17 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
 		inode->i_state |= I_REFERENCED;
 }
 
+struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
+					    struct inode *inode, int bit);
+{
+        struct wait_queue_head *wq_head;
+        void *bit_address;
+
+        bit_address = inode_state_wait_address(inode, __I_SYNC);
+        init_wait_var_entry(wqe, bit_address, 0);
+        return __var_waitqueue(bit_address);
+}
+
 /*
  * Add inode to LRU if needed (inode is unused and clean).
  *
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 23e7d46b818a..f854f83e91af 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -744,6 +744,22 @@ struct inode {
 	void			*i_private; /* fs or device private pointer */
 } __randomize_layout;
 
+/*
+ * Get bit address from inode->i_state to use with wait_var_event()
+ * infrastructre.
+ */
+#define inode_state_wait_address(inode, bit) ((char *)&(inode)->i_state + (bit))
+
+struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
+					    struct inode *inode, int bit);
+
+static inline void inode_wake_up_bit(struct inode *inode, unsigned int bit)
+{
+	/* Ensure @bit will be seen cleared/set when caller is woken up. */
+	smp_mb();
+	wake_up_var(inode_state_wait_address(inode, bit));
+}
+
 struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
 
 static inline unsigned int i_blocksize(const struct inode *node)

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 2/5] writeback: port __I_SYNC to var event
  2024-08-20 16:06       ` [PATCH RFC 0/5] inode: turn i_state into u32 Christian Brauner
  2024-08-20 16:06         ` [PATCH RFC 1/5] fs: add i_state helpers Christian Brauner
@ 2024-08-20 16:06         ` Christian Brauner
  2024-08-20 16:06         ` [PATCH RFC 3/5] inode: port __I_NEW " Christian Brauner
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
	Jeff Layton, Jan Kara, linux-fsdevel

Port the __I_SYNC mechanism to use the new var event mechanism.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fs-writeback.c | 39 +++++++++++++++++++++------------------
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a5006329f6f..3619c86273a4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1389,9 +1389,7 @@ static void inode_sync_complete(struct inode *inode)
 	inode->i_state &= ~I_SYNC;
 	/* If inode is clean an unused, put it into LRU now... */
 	inode_add_lru(inode);
-	/* Waiters must see I_SYNC cleared before being woken up */
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_SYNC);
+	inode_wake_up_bit(inode, __I_SYNC);
 }
 
 static bool inode_dirtied_after(struct inode *inode, unsigned long t)
@@ -1512,17 +1510,21 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
-	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
-	wait_queue_head_t *wqh;
+	struct wait_bit_queue_entry wqe;
+	struct wait_queue_head *wq_head;
 
-	lockdep_assert_held(&inode->i_lock);
-	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	while (inode->i_state & I_SYNC) {
-		spin_unlock(&inode->i_lock);
-		__wait_on_bit(wqh, &wq, bit_wait,
-			      TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode->i_lock);
+	wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
+	for (;;) {
+		prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (inode->i_state & I_SYNC) {
+			spin_unlock(&inode->i_lock);
+			schedule();
+			spin_lock(&inode->i_lock);
+			continue;
+		}
+		break;
 	}
+	finish_wait(wq_head, &wqe.wq_entry);
 }
 
 /*
@@ -1533,16 +1535,17 @@ void inode_wait_for_writeback(struct inode *inode)
 static void inode_sleep_on_writeback(struct inode *inode)
 	__releases(inode->i_lock)
 {
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	int sleep;
+	struct wait_bit_queue_entry wqe;
+	struct wait_queue_head *wq_head;
+	bool sleep;
 
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	sleep = inode->i_state & I_SYNC;
+	wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
+	prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
+	sleep = !!(inode->i_state & I_SYNC);
 	spin_unlock(&inode->i_lock);
 	if (sleep)
 		schedule();
-	finish_wait(wqh, &wait);
+	finish_wait(wq_head, &wqe.wq_entry);
 }
 
 /*

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 3/5] inode: port __I_NEW to var event
  2024-08-20 16:06       ` [PATCH RFC 0/5] inode: turn i_state into u32 Christian Brauner
  2024-08-20 16:06         ` [PATCH RFC 1/5] fs: add i_state helpers Christian Brauner
  2024-08-20 16:06         ` [PATCH RFC 2/5] writeback: port __I_SYNC to var event Christian Brauner
@ 2024-08-20 16:06         ` Christian Brauner
  2024-08-20 16:06         ` [PATCH RFC 4/5] inode: port __I_LRU_ISOLATING " Christian Brauner
  2024-08-20 16:06         ` [PATCH RFC 5/5] inode: make i_state a u32 Christian Brauner
  4 siblings, 0 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
	Jeff Layton, Jan Kara, linux-fsdevel

Port the __I_NEW mechanism to use the new var event mechanism.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/bcachefs/fs.c          | 10 ++++++----
 fs/dcache.c               |  3 +--
 fs/inode.c                | 18 ++++++++----------
 include/linux/writeback.h |  3 ++-
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 94c392abef65..8575aedb9fde 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -1644,14 +1644,16 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s)
 				break;
 			}
 		} else if (clean_pass && this_pass_clean) {
-			wait_queue_head_t *wq = bit_waitqueue(&inode->v.i_state, __I_NEW);
-			DEFINE_WAIT_BIT(wait, &inode->v.i_state, __I_NEW);
+			struct wait_bit_queue_entry wqe;
+			struct wait_queue_head *wq_head;
 
-			prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+			wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW);
+			prepare_to_wait_event(wq_head, &wqe.wq_entry,
+					      TASK_UNINTERRUPTIBLE);
 			mutex_unlock(&c->vfs_inodes_lock);
 
 			schedule();
-			finish_wait(wq, &wait.wq_entry);
+			finish_wait(wq_head, &wqe.wq_entry);
 			goto again;
 		}
 	}
diff --git a/fs/dcache.c b/fs/dcache.c
index 1af75fa68638..7037f9312ed4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1908,8 +1908,7 @@ void d_instantiate_new(struct dentry *entry, struct inode *inode)
 	__d_instantiate(entry, inode);
 	WARN_ON(!(inode->i_state & I_NEW));
 	inode->i_state &= ~I_NEW & ~I_CREATING;
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_NEW);
+	inode_wake_up_bit(inode, __I_NEW);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(d_instantiate_new);
diff --git a/fs/inode.c b/fs/inode.c
index d0f614677798..d9e119b5eeef 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -734,7 +734,7 @@ static void evict(struct inode *inode)
 	 * used as an indicator whether blocking on it is safe.
 	 */
 	spin_lock(&inode->i_lock);
-	wake_up_bit(&inode->i_state, __I_NEW);
+	inode_wake_up_bit(inode, __I_NEW);
 	BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
 	spin_unlock(&inode->i_lock);
 
@@ -1142,8 +1142,7 @@ void unlock_new_inode(struct inode *inode)
 	spin_lock(&inode->i_lock);
 	WARN_ON(!(inode->i_state & I_NEW));
 	inode->i_state &= ~I_NEW & ~I_CREATING;
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_NEW);
+	inode_wake_up_bit(inode, __I_NEW);
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(unlock_new_inode);
@@ -1154,8 +1153,7 @@ void discard_new_inode(struct inode *inode)
 	spin_lock(&inode->i_lock);
 	WARN_ON(!(inode->i_state & I_NEW));
 	inode->i_state &= ~I_NEW;
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_NEW);
+	inode_wake_up_bit(inode, __I_NEW);
 	spin_unlock(&inode->i_lock);
 	iput(inode);
 }
@@ -2344,8 +2342,8 @@ EXPORT_SYMBOL(inode_needs_sync);
  */
 static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked)
 {
-	wait_queue_head_t *wq;
-	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
+	struct wait_bit_queue_entry wqe;
+	struct wait_queue_head *wq_head;
 
 	/*
 	 * Handle racing against evict(), see that routine for more details.
@@ -2356,14 +2354,14 @@ static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_lock
 		return;
 	}
 
-	wq = bit_waitqueue(&inode->i_state, __I_NEW);
-	prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+	wq_head = inode_bit_waitqueue(&wqe, inode, __I_SYNC);
+	prepare_to_wait_event(wq_head, &wqe.wq_entry, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
 	rcu_read_unlock();
 	if (is_inode_hash_locked)
 		spin_unlock(&inode_hash_lock);
 	schedule();
-	finish_wait(wq, &wait.wq_entry);
+	finish_wait(wq_head, &wqe.wq_entry);
 	if (is_inode_hash_locked)
 		spin_lock(&inode_hash_lock);
 	rcu_read_lock();
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 56b85841ae4c..bed795b8340b 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -200,7 +200,8 @@ void inode_io_list_del(struct inode *inode);
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
 {
-	wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE);
+	wait_var_event(inode_state_wait_address(inode, __I_NEW),
+		       !(inode->i_state & I_NEW));
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 4/5] inode: port __I_LRU_ISOLATING to var event
  2024-08-20 16:06       ` [PATCH RFC 0/5] inode: turn i_state into u32 Christian Brauner
                           ` (2 preceding siblings ...)
  2024-08-20 16:06         ` [PATCH RFC 3/5] inode: port __I_NEW " Christian Brauner
@ 2024-08-20 16:06         ` Christian Brauner
  2024-08-20 16:06         ` [PATCH RFC 5/5] inode: make i_state a u32 Christian Brauner
  4 siblings, 0 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
	Jeff Layton, Jan Kara, linux-fsdevel

Port the __I_LRU_ISOLATING mechanism to use the new var event mechanism.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/inode.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d9e119b5eeef..a48b6df05139 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -511,8 +511,7 @@ static void inode_unpin_lru_isolating(struct inode *inode)
 	spin_lock(&inode->i_lock);
 	WARN_ON(!(inode->i_state & I_LRU_ISOLATING));
 	inode->i_state &= ~I_LRU_ISOLATING;
-	smp_mb();
-	wake_up_bit(&inode->i_state, __I_LRU_ISOLATING);
+	inode_wake_up_bit(inode, __I_LRU_ISOLATING);
 	spin_unlock(&inode->i_lock);
 }
 
@@ -520,13 +519,22 @@ static void inode_wait_for_lru_isolating(struct inode *inode)
 {
 	lockdep_assert_held(&inode->i_lock);
 	if (inode->i_state & I_LRU_ISOLATING) {
-		DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LRU_ISOLATING);
-		wait_queue_head_t *wqh;
-
-		wqh = bit_waitqueue(&inode->i_state, __I_LRU_ISOLATING);
-		spin_unlock(&inode->i_lock);
-		__wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode->i_lock);
+		struct wait_bit_queue_entry wqe;
+		struct wait_queue_head *wq_head;
+
+		wq_head = inode_bit_waitqueue(&wqe, inode, __I_NEW);
+		for (;;) {
+			prepare_to_wait_event(wq_head, &wqe.wq_entry,
+					      TASK_UNINTERRUPTIBLE);
+			if (inode->i_state & I_LRU_ISOLATING) {
+				spin_unlock(&inode->i_lock);
+				schedule();
+				spin_lock(&inode->i_lock);
+				continue;
+			}
+			break;
+		}
+		finish_wait(wq_head, &wqe.wq_entry);
 		WARN_ON(inode->i_state & I_LRU_ISOLATING);
 	}
 }

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 5/5] inode: make i_state a u32
  2024-08-20 16:06       ` [PATCH RFC 0/5] inode: turn i_state into u32 Christian Brauner
                           ` (3 preceding siblings ...)
  2024-08-20 16:06         ` [PATCH RFC 4/5] inode: port __I_LRU_ISOLATING " Christian Brauner
@ 2024-08-20 16:06         ` Christian Brauner
  4 siblings, 0 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 16:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, NeilBrown, Peter Zijlstra, Ingo Molnar,
	Jeff Layton, Jan Kara, linux-fsdevel

Now that we use the wait var event mechanism make i_state a u32 and free
up 4 bytes. This means we currently have two 4 byte holes in struct
inode which we can pack.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f854f83e91af..54cfb75b6a28 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -681,7 +681,7 @@ struct inode {
 #endif
 
 	/* Misc */
-	unsigned long		i_state;
+	u32			i_state;
 	struct rw_semaphore	i_rwsem;
 
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/5] fs: add i_state helpers
  2024-08-20 16:06         ` [PATCH RFC 1/5] fs: add i_state helpers Christian Brauner
@ 2024-08-20 17:10           ` Linus Torvalds
  2024-08-20 17:19             ` Linus Torvalds
  2024-08-20 19:08             ` Christian Brauner
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2024-08-20 17:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
	linux-fsdevel

On Tue, 20 Aug 2024 at 09:07, Christian Brauner <brauner@kernel.org> wrote:
>
>
> +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> +                                           struct inode *inode, int bit);
> +{
> +        struct wait_queue_head *wq_head;
> +        void *bit_address;
> +
> +        bit_address = inode_state_wait_address(inode, __I_SYNC);

Shouldn't that '__I_SYNC' be 'bit' instead?

Now it always uses the __I_SYNC wait address, but:

> +static inline void inode_wake_up_bit(struct inode *inode, unsigned int bit)
> +{
> +       /* Ensure @bit will be seen cleared/set when caller is woken up. */
> +       smp_mb();
> +       wake_up_var(inode_state_wait_address(inode, bit));
> +}

This uses the 'bit' wait address as expected.

          Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/5] fs: add i_state helpers
  2024-08-20 17:10           ` Linus Torvalds
@ 2024-08-20 17:19             ` Linus Torvalds
  2024-08-20 19:10               ` Christian Brauner
  2024-08-20 19:08             ` Christian Brauner
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2024-08-20 17:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
	linux-fsdevel

.. and one more comment on that patch: it would probably be a good
idea to make sure that the __I_xyz constants that are used for this
are in the range 0-3.

It doesn't really *matter*, in the sense that it will all just be a
cookie with a random address, but if anybody else ever uses the same
trick (or just uses bit_waitqueue) for another field in the inode, the
two cookies might end up being the same if you are very unlucky.

So from a future-proofing standpoint it would be good if the cookies
that are used are always "within" the address range of i_state.

I don't think any of the bits in i_state have any actual meaning, so
moving the bits around shouldn't be a problem.

                 Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/5] fs: add i_state helpers
  2024-08-20 17:10           ` Linus Torvalds
  2024-08-20 17:19             ` Linus Torvalds
@ 2024-08-20 19:08             ` Christian Brauner
  1 sibling, 0 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
	linux-fsdevel

On Tue, Aug 20, 2024 at 10:10:51AM GMT, Linus Torvalds wrote:
> On Tue, 20 Aug 2024 at 09:07, Christian Brauner <brauner@kernel.org> wrote:
> >
> >
> > +struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe,
> > +                                           struct inode *inode, int bit);

Bah, I sent from the wrong branch. This is the branch where I even
forgot to remove that godforsaken ; at the end here...

> > +{
> > +        struct wait_queue_head *wq_head;
> > +        void *bit_address;
> > +
> > +        bit_address = inode_state_wait_address(inode, __I_SYNC);
> 
> Shouldn't that '__I_SYNC' be 'bit' instead?

Yeah, that's also fixed on the work.i_state branch. I sent from
work.i_state.wip... :/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/5] fs: add i_state helpers
  2024-08-20 17:19             ` Linus Torvalds
@ 2024-08-20 19:10               ` Christian Brauner
  0 siblings, 0 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: NeilBrown, Peter Zijlstra, Ingo Molnar, Jeff Layton, Jan Kara,
	linux-fsdevel

On Tue, Aug 20, 2024 at 10:19:11AM GMT, Linus Torvalds wrote:
> .. and one more comment on that patch: it would probably be a good
> idea to make sure that the __I_xyz constants that are used for this
> are in the range 0-3.
> 
> It doesn't really *matter*, in the sense that it will all just be a
> cookie with a random address, but if anybody else ever uses the same
> trick (or just uses bit_waitqueue) for another field in the inode, the
> two cookies might end up being the same if you are very unlucky.
> 
> So from a future-proofing standpoint it would be good if the cookies
> that are used are always "within" the address range of i_state.
> 
> I don't think any of the bits in i_state have any actual meaning, so
> moving the bits around shouldn't be a problem.

Yeah, I reordered. I did not think this too big of an issue but you're
right.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP
  2024-08-20  7:22   ` Christian Brauner
@ 2024-08-20 19:12     ` Christian Brauner
  0 siblings, 0 replies; 38+ messages in thread
From: Christian Brauner @ 2024-08-20 19:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, linux-kernel,
	linux-fsdevel

On Tue, Aug 20, 2024 at 09:22:33AM GMT, Christian Brauner wrote:
> On Mon, Aug 19, 2024 at 03:20:38PM GMT, NeilBrown wrote:
> > inode_dio_wait() is essentially an open-coded version of
> > wait_var_event().  Similarly inode_dio_wait_interruptible() is an
> > open-coded version of wait_var_event_interruptible().
> > 
> > If we switch to waiting on the var, instead of an imaginary bit, the
> > code is more transparent, is shorter, and we can discard I_DIO_WAKEUP.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> 
> Neil, I've sent a patch for this last week already removing
> __I_DIO_WAKEUP and it's in -next as
> 0009dc756e81 ("inode: remove __I_DIO_WAKEUP"). So you can drop this

Today's the day of getting things slightly wrong it seems...
2726a7a8477d8c0 is what I meant.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-19  6:13 ` [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile Linus Torvalds
@ 2024-08-20 21:47   ` NeilBrown
  2024-08-20 21:58     ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: NeilBrown @ 2024-08-20 21:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, linux-fsdevel

On Mon, 19 Aug 2024, Linus Torvalds wrote:
> On Sun, 18 Aug 2024 at 22:36, NeilBrown <neilb@suse.de> wrote:
> >
> > The main patches here are 7 and 8 which revise wake_up_bit and
> > wake_up_var respectively.  They result in 3 interfaces:
> >   wake_up_{bit,var}           includes smp_mb__after_atomic()
> 
> I actually think this is even worse than the current model, in that
> now it subtle only works after atomic ops, and it's not obvious from
> the name.
> 
> At least the current model, correct code looks like
> 
>       do_some_atomic_op
>       smp_mb__after_atomic()
>       wake_up_{bit,var}
> 
> and the smp_mb__after_atomic() makes sense and pairs with the atomic.
> So the current one may be complex, but at the same time it's also
> explicit. Your changed interface is still complex, but now it's even
> less obvious what is actually going on.
> 
> With your suggested interface, a plain "wake_up_{bit,var}" only works
> after atomic ops, and other ops have to magically know that they
> should use the _mb() version or whatever. And somebody who doesn't
> understand that subtlety, and copies the code (but changes the op from
> an atomic one to something else) now introduces code that looks fine,
> but is really subtly wrong.
> 
> The reason for the barrier is for the serialization with the
> waitqueue_active() check. Honestly, if you worry about correctness
> here, I think you should leave the existing wake_up_{bit,var}() alone,
> and concentrate on having helpers that do the whole "set and wake up".
> 
> IOW, I do not think you should change existing semantics, but *this*
> kind of pattern:
> 
> >  [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
> >  [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate.
> 
> sounds like a good idea.
> 
> IOW, once you have a whole "atomic_dec_and_wake_up()" (skip the "_var"
> - it's implied by the fact that it's an atomic_dec), *then* that
> function makes for a simple-to-use model, and now the "atomic_dec(),
> the smp_mb__after_atomic(), and the wake_up_var()" are all together.
> 
> For all the same reasons, it makes total sense to have
> "clear_bit_and_wake()" etc.

I can definitely get behind the idea has having a few more helpers and
using them more widely.  But unless we get rid of wake_up_bit(), people
will still use and some will use it wrongly.

What would you think of changing wake_up_bit/var to always have a full
memory barrier, and adding wake_up_bit/var_relaxed() for use when a
different barrier, or not barrier, is wanted?

Thanks,
NeilBrown


> 
> But exposing those "three different memory barrier scenarios" as three
> different helpers is the *opposite* of helpful. It keeps the current
> complexity, and makes it worse by making the barrier rules even more
> opaque, imho.
> 
>                Linus
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event()
  2024-08-20  4:18   ` Dave Chinner
@ 2024-08-20 21:52     ` NeilBrown
  2024-08-28  1:22       ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: NeilBrown @ 2024-08-20 21:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, linux-kernel,
	linux-fsdevel

On Tue, 20 Aug 2024, Dave Chinner wrote:
> On Mon, Aug 19, 2024 at 03:20:39PM +1000, NeilBrown wrote:
> > bd_prepare_to_claim() current uses a bit waitqueue with a matching
> > wake_up_bit() in bd_clear_claiming().  However it is really waiting on a
> > "var", not a "bit".
> > 
> > So change to wake_up_var(), and use ___wait_var_event() for the waiting.
> > Using the triple-underscore version allows us to drop the mutex across
> > the schedule() call.
> ....
> > @@ -535,33 +535,23 @@ int bd_prepare_to_claim(struct block_device *bdev, void *holder,
> >  		const struct blk_holder_ops *hops)
> >  {
> >  	struct block_device *whole = bdev_whole(bdev);
> > +	int err = 0;
> >  
> >  	if (WARN_ON_ONCE(!holder))
> >  		return -EINVAL;
> > -retry:
> > -	mutex_lock(&bdev_lock);
> > -	/* if someone else claimed, fail */
> > -	if (!bd_may_claim(bdev, holder, hops)) {
> > -		mutex_unlock(&bdev_lock);
> > -		return -EBUSY;
> > -	}
> > -
> > -	/* if claiming is already in progress, wait for it to finish */
> > -	if (whole->bd_claiming) {
> > -		wait_queue_head_t *wq = bit_waitqueue(&whole->bd_claiming, 0);
> > -		DEFINE_WAIT(wait);
> >  
> > -		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
> > -		mutex_unlock(&bdev_lock);
> > -		schedule();
> > -		finish_wait(wq, &wait);
> > -		goto retry;
> > -	}
> > +	mutex_lock(&bdev_lock);
> > +	___wait_var_event(&whole->bd_claiming,
> > +			  (err = bd_may_claim(bdev, holder, hops)) != 0 || !whole->bd_claiming,
> > +			  TASK_UNINTERRUPTIBLE, 0, 0,
> > +			  mutex_unlock(&bdev_lock); schedule(); mutex_lock(&bdev_lock));
> 
> That's not an improvement. Instead of nice, obvious, readable code,
> I now have to go look at a macro and manually substitute the
> parameters to work out what this abomination actually does.

Interesting - I thought the function as a whole was more readable this
way.
I agree that the ___wait_var_event macro isn't the best part.
Is your dislike simply that it isn't a macro that you are familar with,
or is there something specific that you don't like?

Suppose we could add a new macro so that it read:

     wait_var_event_mutex(&whole->bd_claiming,
			  (err = bd_may_claim(bdev, holder, hops)) != 0 || !whole->bd_claiming,
			  &bdev_lock);

would that be less abominable?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-20 21:47   ` NeilBrown
@ 2024-08-20 21:58     ` Linus Torvalds
  2024-08-20 22:15       ` NeilBrown
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2024-08-20 21:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, linux-fsdevel

On Tue, 20 Aug 2024 at 14:47, NeilBrown <neilb@suse.de> wrote:
>
> I can definitely get behind the idea has having a few more helpers and
> using them more widely.  But unless we get rid of wake_up_bit(), people
> will still use and some will use it wrongly.

I do not believe this is a valid argument.

"We have interfaces that somebody can use wrongly" is a fact of life,
not an argument.

The whole "wake_up_bit()" is a very special thing, and dammit, if
people don't know the rules, then they shouldn't be using it.

Anybody using that interface *ALREADY* has to have some model of
atomicity for the actual bit they are changing. And yes, they can get
that wrong too.

The only way to actually make it a simple interface is to do the bit
operation and the wakeup together. Which is why I think that
interfaces like clear_bit_and_wake() or set_bit_and_wake() are fine,
because at that point you actually have a valid rule for the whole
operation.

But wake_up_bit() on its own ALREADY depends on the user doing the
right thing for the bit itself. Putting a memory barrier in it will
only *HIDE* incompetence, it won't be fixing it.

So no. Don't add interfaces that hide the problem.

                  Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-20 21:58     ` Linus Torvalds
@ 2024-08-20 22:15       ` NeilBrown
  2024-08-20 22:24         ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: NeilBrown @ 2024-08-20 22:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, linux-fsdevel

On Wed, 21 Aug 2024, Linus Torvalds wrote:
> On Tue, 20 Aug 2024 at 14:47, NeilBrown <neilb@suse.de> wrote:
> >
> > I can definitely get behind the idea has having a few more helpers and
> > using them more widely.  But unless we get rid of wake_up_bit(), people
> > will still use and some will use it wrongly.
> 
> I do not believe this is a valid argument.
> 
> "We have interfaces that somebody can use wrongly" is a fact of life,
> not an argument.

The argument is more like "we have interfaces that are often used
wrongly and the resulting bugs are hard to find through testing because
they don't affect the more popular architectures".

> 
> The whole "wake_up_bit()" is a very special thing, and dammit, if
> people don't know the rules, then they shouldn't be using it.
> 
> Anybody using that interface *ALREADY* has to have some model of
> atomicity for the actual bit they are changing. And yes, they can get
> that wrong too.
> 
> The only way to actually make it a simple interface is to do the bit
> operation and the wakeup together. Which is why I think that
> interfaces like clear_bit_and_wake() or set_bit_and_wake() are fine,
> because at that point you actually have a valid rule for the whole
> operation.
> 
> But wake_up_bit() on its own ALREADY depends on the user doing the
> right thing for the bit itself. Putting a memory barrier in it will
> only *HIDE* incompetence, it won't be fixing it.
> 
> So no. Don't add interfaces that hide the problem.

Ok, thanks.  I'll focus my efforts on helper functions.

NeilBrown


> 
>                   Linus
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile
  2024-08-20 22:15       ` NeilBrown
@ 2024-08-20 22:24         ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2024-08-20 22:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, linux-fsdevel

On Tue, 20 Aug 2024 at 15:16, NeilBrown <neilb@suse.de> wrote:
>
> The argument is more like "we have interfaces that are often used
> wrongly and the resulting bugs are hard to find through testing because
> they don't affect the more popular architectures".

Right, but let's make the fix be that we actually then make those
places use better interfaces that don't _have_ any memory ordering
issues.

THAT is my argument. In the "combined" interface, the problem simply
goes away entirely, rather than being hidden by adding possibly
totally pointless barriers.

                 Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
  2024-08-19  5:20 ` [PATCH 2/9] Introduce atomic_dec_and_wake_up_var() NeilBrown
  2024-08-20  5:47   ` kernel test robot
@ 2024-08-21  5:23   ` kernel test robot
  1 sibling, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-08-21  5:23 UTC (permalink / raw)
  To: NeilBrown, Ingo Molnar, Peter Zijlstra, Linus Torvalds
  Cc: llvm, oe-kbuild-all, LKML, linux-fsdevel

Hi NeilBrown,

kernel test robot noticed the following build errors:

[auto build test ERROR on trondmy-nfs/linux-next]
[also build test ERROR on gfs2/for-next device-mapper-dm/for-next linus/master v6.11-rc4 next-20240820]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/NeilBrown/i915-remove-wake_up-on-I915_RESET_MODESET/20240819-134414
base:   git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
patch link:    https://lore.kernel.org/r/20240819053605.11706-3-neilb%40suse.de
patch subject: [PATCH 2/9] Introduce atomic_dec_and_wake_up_var().
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20240821/202408211354.SHlpNZbs-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240821/202408211354.SHlpNZbs-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408211354.SHlpNZbs-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/i915_request.c:2277:
>> drivers/gpu/drm/i915/selftests/i915_request.c:1533:9: error: call to undeclared function 'atomic_dec_and_wakeup_var'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1533 |         return atomic_dec_and_wakeup_var(&i915->selftest.counter);
         |                ^
   drivers/gpu/drm/i915/selftests/i915_request.c:1533:9: note: did you mean 'atomic_dec_and_wake_up_var'?
   include/linux/wait_bit.h:338:20: note: 'atomic_dec_and_wake_up_var' declared here
     338 | static inline bool atomic_dec_and_wake_up_var(atomic_t *var)
         |                    ^
   1 error generated.


vim +/atomic_dec_and_wakeup_var +1533 drivers/gpu/drm/i915/selftests/i915_request.c

  1530	
  1531	static bool wake_all(struct drm_i915_private *i915)
  1532	{
> 1533		return atomic_dec_and_wakeup_var(&i915->selftest.counter);
  1534	}
  1535	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event()
  2024-08-19  5:20 ` [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event() NeilBrown
  2024-08-20  4:18   ` Dave Chinner
@ 2024-08-21  7:10   ` kernel test robot
  1 sibling, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-08-21  7:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: oe-lkp, lkp, linux-block, Ingo Molnar, Peter Zijlstra,
	Linus Torvalds, linux-kernel, linux-fsdevel, oliver.sang



Hello,

kernel test robot noticed "kernel_BUG_at_block/bdev.c" on:

commit: b80d7798a6f9db2c094014570a73728ace8d844d ("[PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event()")
url: https://github.com/intel-lab-lkp/linux/commits/NeilBrown/i915-remove-wake_up-on-I915_RESET_MODESET/20240819-134414
base: git://git.linux-nfs.org/projects/trondmy/linux-nfs.git linux-next
patch link: https://lore.kernel.org/all/20240819053605.11706-6-neilb@suse.de/
patch subject: [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event()

in testcase: boot

compiler: clang-18
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+------------------------------------------+------------+------------+
|                                          | 30a670cac3 | b80d7798a6 |
+------------------------------------------+------------+------------+
| boot_successes                           | 9          | 0          |
| boot_failures                            | 0          | 9          |
| kernel_BUG_at_block/bdev.c               | 0          | 9          |
| Oops:invalid_opcode:#[##]PREEMPT_SMP_PTI | 0          | 9          |
| RIP:bd_finish_claiming                   | 0          | 9          |
| Kernel_panic-not_syncing:Fatal_exception | 0          | 9          |
+------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202408211439.954a6d41-lkp@intel.com


[    8.768327][ T2733] ------------[ cut here ]------------
[    8.768333][ T2733] kernel BUG at block/bdev.c:583!
[    8.768342][ T2733] Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
[    8.768348][ T2733] CPU: 1 UID: 0 PID: 2733 Comm: cdrom_id Not tainted 6.11.0-rc3-00005-gb80d7798a6f9 #1
[    8.768352][ T2733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 8.768355][ T2733] RIP: 0010:bd_finish_claiming (block/bdev.c:583) 
[ 8.768388][ T2733] Code: 48 c7 03 00 00 00 00 f0 83 44 24 fc 00 48 89 df e8 0f bc b1 ff 48 c7 c7 00 97 a0 82 5b 41 5c 41 5d 41 5e 41 5f e9 5a aa 95 00 <0f> 0b 0f 0b 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90
All code
========
   0:	48 c7 03 00 00 00 00 	movq   $0x0,(%rbx)
   7:	f0 83 44 24 fc 00    	lock addl $0x0,-0x4(%rsp)
   d:	48 89 df             	mov    %rbx,%rdi
  10:	e8 0f bc b1 ff       	call   0xffffffffffb1bc24
  15:	48 c7 c7 00 97 a0 82 	mov    $0xffffffff82a09700,%rdi
  1c:	5b                   	pop    %rbx
  1d:	41 5c                	pop    %r12
  1f:	41 5d                	pop    %r13
  21:	41 5e                	pop    %r14
  23:	41 5f                	pop    %r15
  25:	e9 5a aa 95 00       	jmp    0x95aa84
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	0f 0b                	ud2
  2e:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  34:	90                   	nop
  35:	90                   	nop
  36:	90                   	nop
  37:	90                   	nop
  38:	90                   	nop
  39:	90                   	nop
  3a:	90                   	nop
  3b:	90                   	nop
  3c:	90                   	nop
  3d:	90                   	nop
  3e:	90                   	nop
  3f:	90                   	nop

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	0f 0b                	ud2
   4:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
   a:	90                   	nop
   b:	90                   	nop
   c:	90                   	nop
   d:	90                   	nop
   e:	90                   	nop
   f:	90                   	nop
  10:	90                   	nop
  11:	90                   	nop
  12:	90                   	nop
  13:	90                   	nop
  14:	90                   	nop
  15:	90                   	nop
[    8.768392][ T2733] RSP: 0000:ffffc90000a5bc00 EFLAGS: 00010246
[    8.768396][ T2733] RAX: 0000000000000000 RBX: ffff888125940000 RCX: 0000000000000000
[    8.768398][ T2733] RDX: 0000000000000000 RSI: ffff88812a326800 RDI: ffff888125940000
[    8.768400][ T2733] RBP: 000000000000000d R08: 0000000000000004 R09: 00000002f1ee446d
[    8.768402][ T2733] R10: 00646b636f6c6200 R11: ffffffffa0015f60 R12: ffff888125940000
[    8.768404][ T2733] R13: 0000000000000000 R14: ffff88812a326800 R15: 0000000000000000
[    8.768407][ T2733] FS:  0000000000000000(0000) GS:ffff88842fd00000(0063) knlGS:00000000f7d406c0
[    8.768410][ T2733] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[    8.768412][ T2733] CR2: 00000000f7d0315c CR3: 000000012a244000 CR4: 00000000000406f0
[    8.768417][ T2733] Call Trace:
[    8.769866][ T2733]  <TASK>
[ 8.769875][ T2733] ? __die_body (arch/x86/kernel/dumpstack.c:421) 
[ 8.769884][ T2733] ? die (arch/x86/kernel/dumpstack.c:? arch/x86/kernel/dumpstack.c:447) 
[ 8.769888][ T2733] ? do_trap (arch/x86/kernel/traps.c:129) 
[ 8.769893][ T2733] ? bd_finish_claiming (block/bdev.c:583) 
[ 8.769898][ T2733] ? do_error_trap (arch/x86/kernel/traps.c:175) 
[ 8.769902][ T2733] ? bd_finish_claiming (block/bdev.c:583) 
[ 8.769905][ T2733] ? handle_invalid_op (arch/x86/kernel/traps.c:212) 
[ 8.769909][ T2733] ? bd_finish_claiming (block/bdev.c:583) 
[ 8.769912][ T2733] ? exc_invalid_op (arch/x86/kernel/traps.c:267) 
[ 8.769917][ T2733] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) 
[ 8.769926][ T2733] ? __pfx_sr_open (drivers/scsi/sr.c:593) sr_mod
[ 8.769932][ T2733] ? bd_finish_claiming (block/bdev.c:583) 
[ 8.769936][ T2733] bdev_open (block/bdev.c:914) 
[ 8.769940][ T2733] ? iput (fs/inode.c:1821) 
[ 8.769945][ T2733] blkdev_open (block/fops.c:631) 
[ 8.769949][ T2733] ? __pfx_blkdev_open (block/fops.c:610) 
[ 8.769952][ T2733] do_dentry_open (fs/open.c:959) 
[ 8.769958][ T2733] vfs_open (fs/open.c:1089) 
[ 8.769962][ T2733] path_openat (fs/namei.c:3727) 
[ 8.769966][ T2733] ? call_rcu (kernel/rcu/tree.c:2996) 
[ 8.769972][ T2733] do_filp_open (fs/namei.c:3913) 
[ 8.769978][ T2733] do_sys_openat2 (fs/open.c:1416) 
[ 8.769982][ T2733] do_sys_open (fs/open.c:1431) 
[ 8.769986][ T2733] do_int80_emulation (arch/x86/entry/common.c:?) 
[ 8.769990][ T2733] ? irqentry_exit_to_user_mode (arch/x86/include/asm/processor.h:702 arch/x86/include/asm/entry-common.h:91 include/linux/entry-common.h:364 kernel/entry/common.c:233) 
[ 8.769994][ T2733] asm_int80_emulation (arch/x86/include/asm/idtentry.h:626) 
[    8.769999][ T2733] RIP: 0023:0xf7f111b2
[ 8.770004][ T2733] Code: 89 c2 31 c0 89 d7 f3 aa 8b 44 24 1c 89 30 c6 40 04 00 83 c4 2c 89 f0 5b 5e 5f 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 cd 80 <c3> 8d b6 00 00 00 00 8d bc 27 00 00 00 00 8b 1c 24 c3 8d b6 00 00
All code
========
   0:	89 c2                	mov    %eax,%edx
   2:	31 c0                	xor    %eax,%eax
   4:	89 d7                	mov    %edx,%edi
   6:	f3 aa                	rep stos %al,%es:(%rdi)
   8:	8b 44 24 1c          	mov    0x1c(%rsp),%eax
   c:	89 30                	mov    %esi,(%rax)
   e:	c6 40 04 00          	movb   $0x0,0x4(%rax)
  12:	83 c4 2c             	add    $0x2c,%esp
  15:	89 f0                	mov    %esi,%eax
  17:	5b                   	pop    %rbx
  18:	5e                   	pop    %rsi
  19:	5f                   	pop    %rdi
  1a:	5d                   	pop    %rbp
  1b:	c3                   	ret
  1c:	90                   	nop
  1d:	90                   	nop
  1e:	90                   	nop
  1f:	90                   	nop
  20:	90                   	nop
  21:	90                   	nop
  22:	90                   	nop
  23:	90                   	nop
  24:	90                   	nop
  25:	90                   	nop
  26:	90                   	nop
  27:	90                   	nop
  28:	cd 80                	int    $0x80
  2a:*	c3                   	ret		<-- trapping instruction
  2b:	8d b6 00 00 00 00    	lea    0x0(%rsi),%esi
  31:	8d bc 27 00 00 00 00 	lea    0x0(%rdi,%riz,1),%edi
  38:	8b 1c 24             	mov    (%rsp),%ebx
  3b:	c3                   	ret
  3c:	8d                   	.byte 0x8d
  3d:	b6 00                	mov    $0x0,%dh
	...

Code starting with the faulting instruction
===========================================
   0:	c3                   	ret
   1:	8d b6 00 00 00 00    	lea    0x0(%rsi),%esi
   7:	8d bc 27 00 00 00 00 	lea    0x0(%rdi,%riz,1),%edi
   e:	8b 1c 24             	mov    (%rsp),%ebx
  11:	c3                   	ret
  12:	8d                   	.byte 0x8d
  13:	b6 00                	mov    $0x0,%dh


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240821/202408211439.954a6d41-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event()
  2024-08-20 21:52     ` NeilBrown
@ 2024-08-28  1:22       ` Dave Chinner
  2024-08-28  5:20         ` NeilBrown
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2024-08-28  1:22 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, linux-kernel,
	linux-fsdevel

On Wed, Aug 21, 2024 at 07:52:39AM +1000, NeilBrown wrote:
> On Tue, 20 Aug 2024, Dave Chinner wrote:
> > On Mon, Aug 19, 2024 at 03:20:39PM +1000, NeilBrown wrote:
> > > bd_prepare_to_claim() current uses a bit waitqueue with a matching
> > > wake_up_bit() in bd_clear_claiming().  However it is really waiting on a
> > > "var", not a "bit".
> > > 
> > > So change to wake_up_var(), and use ___wait_var_event() for the waiting.
> > > Using the triple-underscore version allows us to drop the mutex across
> > > the schedule() call.
> > ....
> > > @@ -535,33 +535,23 @@ int bd_prepare_to_claim(struct block_device *bdev, void *holder,
> > >  		const struct blk_holder_ops *hops)
> > >  {
> > >  	struct block_device *whole = bdev_whole(bdev);
> > > +	int err = 0;
> > >  
> > >  	if (WARN_ON_ONCE(!holder))
> > >  		return -EINVAL;
> > > -retry:
> > > -	mutex_lock(&bdev_lock);
> > > -	/* if someone else claimed, fail */
> > > -	if (!bd_may_claim(bdev, holder, hops)) {
> > > -		mutex_unlock(&bdev_lock);
> > > -		return -EBUSY;
> > > -	}
> > > -
> > > -	/* if claiming is already in progress, wait for it to finish */
> > > -	if (whole->bd_claiming) {
> > > -		wait_queue_head_t *wq = bit_waitqueue(&whole->bd_claiming, 0);
> > > -		DEFINE_WAIT(wait);
> > >  
> > > -		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
> > > -		mutex_unlock(&bdev_lock);
> > > -		schedule();
> > > -		finish_wait(wq, &wait);
> > > -		goto retry;
> > > -	}
> > > +	mutex_lock(&bdev_lock);
> > > +	___wait_var_event(&whole->bd_claiming,
> > > +			  (err = bd_may_claim(bdev, holder, hops)) != 0 || !whole->bd_claiming,
> > > +			  TASK_UNINTERRUPTIBLE, 0, 0,
> > > +			  mutex_unlock(&bdev_lock); schedule(); mutex_lock(&bdev_lock));
> > 
> > That's not an improvement. Instead of nice, obvious, readable code,
> > I now have to go look at a macro and manually substitute the
> > parameters to work out what this abomination actually does.
> 
> Interesting - I thought the function as a whole was more readable this
> way.
> I agree that the ___wait_var_event macro isn't the best part.
> Is your dislike simply that it isn't a macro that you are familar with,
> or is there something specific that you don't like?

It's the encoding of non-trivial logic and code into the macro
parameters that is the problem....

> Suppose we could add a new macro so that it read:
> 
>      wait_var_event_mutex(&whole->bd_claiming,
> 			  (err = bd_may_claim(bdev, holder, hops)) != 0 || !whole->bd_claiming,
> 			  &bdev_lock);

.... and this still does it. 

In fact, it's worse, because now I have -zero idea- of what locking
is being performed in this case, and so now I definitely have to go
pull that macro apart to understand what this is actually doing.

Complex macros don't make understanding the code easier - they may
make writing the code faster, but that comes at the expense of
clarity and obviousness of the logic flow of the code...

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event()
  2024-08-28  1:22       ` Dave Chinner
@ 2024-08-28  5:20         ` NeilBrown
  0 siblings, 0 replies; 38+ messages in thread
From: NeilBrown @ 2024-08-28  5:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds, linux-kernel,
	linux-fsdevel

On Wed, 28 Aug 2024, Dave Chinner wrote:
> On Wed, Aug 21, 2024 at 07:52:39AM +1000, NeilBrown wrote:
> > On Tue, 20 Aug 2024, Dave Chinner wrote:
> > > On Mon, Aug 19, 2024 at 03:20:39PM +1000, NeilBrown wrote:
> > > > bd_prepare_to_claim() current uses a bit waitqueue with a matching
> > > > wake_up_bit() in bd_clear_claiming().  However it is really waiting on a
> > > > "var", not a "bit".
> > > > 
> > > > So change to wake_up_var(), and use ___wait_var_event() for the waiting.
> > > > Using the triple-underscore version allows us to drop the mutex across
> > > > the schedule() call.
> > > ....
> > > > @@ -535,33 +535,23 @@ int bd_prepare_to_claim(struct block_device *bdev, void *holder,
> > > >  		const struct blk_holder_ops *hops)
> > > >  {
> > > >  	struct block_device *whole = bdev_whole(bdev);
> > > > +	int err = 0;
> > > >  
> > > >  	if (WARN_ON_ONCE(!holder))
> > > >  		return -EINVAL;
> > > > -retry:
> > > > -	mutex_lock(&bdev_lock);
> > > > -	/* if someone else claimed, fail */
> > > > -	if (!bd_may_claim(bdev, holder, hops)) {
> > > > -		mutex_unlock(&bdev_lock);
> > > > -		return -EBUSY;
> > > > -	}
> > > > -
> > > > -	/* if claiming is already in progress, wait for it to finish */
> > > > -	if (whole->bd_claiming) {
> > > > -		wait_queue_head_t *wq = bit_waitqueue(&whole->bd_claiming, 0);
> > > > -		DEFINE_WAIT(wait);
> > > >  
> > > > -		prepare_to_wait(wq, &wait, TASK_UNINTERRUPTIBLE);
> > > > -		mutex_unlock(&bdev_lock);
> > > > -		schedule();
> > > > -		finish_wait(wq, &wait);
> > > > -		goto retry;
> > > > -	}
> > > > +	mutex_lock(&bdev_lock);
> > > > +	___wait_var_event(&whole->bd_claiming,
> > > > +			  (err = bd_may_claim(bdev, holder, hops)) != 0 || !whole->bd_claiming,
> > > > +			  TASK_UNINTERRUPTIBLE, 0, 0,
> > > > +			  mutex_unlock(&bdev_lock); schedule(); mutex_lock(&bdev_lock));
> > > 
> > > That's not an improvement. Instead of nice, obvious, readable code,
> > > I now have to go look at a macro and manually substitute the
> > > parameters to work out what this abomination actually does.
> > 
> > Interesting - I thought the function as a whole was more readable this
> > way.
> > I agree that the ___wait_var_event macro isn't the best part.
> > Is your dislike simply that it isn't a macro that you are familar with,
> > or is there something specific that you don't like?
> 
> It's the encoding of non-trivial logic and code into the macro
> parameters that is the problem....

It would probably make sense to move all the logic into bd_may_claim()
so that it returns:
  -EBUSY if claim cannot succeed
  -EAGAIN if claim might succeed soon, or
  0 if it can be claimed now.
Then the wait becomes:

   wait_var_event_mutex(&whole->bd_claiming,
			bd_may_claim(bdev, holder, hops) != -EAGAIN,
			&bdev_lock);

> 
> > Suppose we could add a new macro so that it read:
> > 
> >      wait_var_event_mutex(&whole->bd_claiming,
> > 			  (err = bd_may_claim(bdev, holder, hops)) != 0 || !whole->bd_claiming,
> > 			  &bdev_lock);
> 
> .... and this still does it. 
> 
> In fact, it's worse, because now I have -zero idea- of what locking
> is being performed in this case, and so now I definitely have to go
> pull that macro apart to understand what this is actually doing.
> 
> Complex macros don't make understanding the code easier - they may
> make writing the code faster, but that comes at the expense of
> clarity and obviousness of the logic flow of the code...

I think that SIMPLE macros rarely make the code easier to understand -
for precisely the reason that you have to go and find out what the macro
actually does.
Complex macros obviously suffer the same problem but I believe they
bring tangible benefits by making review easier for those who understand
the macros, and consequently reducing bugs.

I'm currently particularly sensitive to this since finding that the
open-coded wait loop in pkt_make_request_write() - which I wrote - is
missing a finish_wait() call.  Ouch.  If there had been a
wait_var_event_spinlock() when I wrote that code, the mistake would not
have happened.

The argument about locking being non-obvious is, I think, doubly true
for wait_on_bit_lock().  But that is still a useful interface.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-08-28  5:20 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-19  5:20 [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile NeilBrown
2024-08-19  5:20 ` [PATCH 1/9] i915: remove wake_up on I915_RESET_MODESET NeilBrown
2024-08-19  5:20 ` [PATCH 2/9] Introduce atomic_dec_and_wake_up_var() NeilBrown
2024-08-20  5:47   ` kernel test robot
2024-08-21  5:23   ` kernel test robot
2024-08-19  5:20 ` [PATCH 3/9] XFS: use wait_var_event() when waiting of i_pincount NeilBrown
2024-08-19  5:20 ` [PATCH 4/9] Use wait_var_event() instead of I_DIO_WAKEUP NeilBrown
2024-08-20  7:22   ` Christian Brauner
2024-08-20 19:12     ` Christian Brauner
2024-08-19  5:20 ` [PATCH 5/9] Block: switch bd_prepare_to_claim to use ___wait_var_event() NeilBrown
2024-08-20  4:18   ` Dave Chinner
2024-08-20 21:52     ` NeilBrown
2024-08-28  1:22       ` Dave Chinner
2024-08-28  5:20         ` NeilBrown
2024-08-21  7:10   ` kernel test robot
2024-08-19  5:20 ` [PATCH 6/9] block/pktdvd: switch congestion waiting to ___wait_var_event() NeilBrown
2024-08-19  5:20 ` [PATCH 7/9] Improve and expand wake_up_bit() interface NeilBrown
2024-08-19  5:20 ` [PATCH 8/9] Improve and extend wake_up_var() interface NeilBrown
2024-08-19  5:20 ` [PATCH 9/9] Use clear_and_wake_up_bit() where appropriate NeilBrown
2024-08-19  6:13 ` [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile Linus Torvalds
2024-08-20 21:47   ` NeilBrown
2024-08-20 21:58     ` Linus Torvalds
2024-08-20 22:15       ` NeilBrown
2024-08-20 22:24         ` Linus Torvalds
2024-08-19  8:16 ` Christian Brauner
2024-08-19 17:45   ` Linus Torvalds
2024-08-19 20:52   ` NeilBrown
2024-08-19 21:12     ` Linus Torvalds
2024-08-20 16:06       ` [PATCH RFC 0/5] inode: turn i_state into u32 Christian Brauner
2024-08-20 16:06         ` [PATCH RFC 1/5] fs: add i_state helpers Christian Brauner
2024-08-20 17:10           ` Linus Torvalds
2024-08-20 17:19             ` Linus Torvalds
2024-08-20 19:10               ` Christian Brauner
2024-08-20 19:08             ` Christian Brauner
2024-08-20 16:06         ` [PATCH RFC 2/5] writeback: port __I_SYNC to var event Christian Brauner
2024-08-20 16:06         ` [PATCH RFC 3/5] inode: port __I_NEW " Christian Brauner
2024-08-20 16:06         ` [PATCH RFC 4/5] inode: port __I_LRU_ISOLATING " Christian Brauner
2024-08-20 16:06         ` [PATCH RFC 5/5] inode: make i_state a u32 Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox