stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
To: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: [GIT PULL for-4.9 23/48] ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock
Date: Wed, 11 Oct 2017 00:45:25 +0000	[thread overview]
Message-ID: <20171011004512.7949-24-alexander.levin@verizon.com> (raw)
In-Reply-To: <20171011004512.7949-1-alexander.levin@verizon.com>

From: Eric Ren <zren@suse.com>

[ Upstream commit 439a36b8ef38657f765b80b775e2885338d72451 ]

We are in the situation that we have to avoid recursive cluster locking,
but there is no way to check if a cluster lock has been taken by a precess
already.

Mostly, we can avoid recursive locking by writing code carefully.
However, we found that it's very hard to handle the routines that are
invoked directly by vfs code.  For instance:

  const struct inode_operations ocfs2_file_iops = {
      .permission     = ocfs2_permission,
      .get_acl        = ocfs2_iop_get_acl,
      .set_acl        = ocfs2_iop_set_acl,
  };

Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):

  do_sys_open
   may_open
    inode_permission
     ocfs2_permission
      ocfs2_inode_lock() <=== first time
       generic_permission
        get_acl
         ocfs2_iop_get_acl
  	ocfs2_inode_lock() <=== recursive one

A deadlock will occur if a remote EX request comes in between two of
ocfs2_inode_lock().  Briefly describe how the deadlock is formed:

On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
the remote EX lock request.  Another hand, the recursive cluster lock
(the second one) will be blocked in in __ocfs2_cluster_lock() because of
OCFS2_LOCK_BLOCKED.  But, the downconvert never complete, why? because
there is no chance for the first cluster lock on this node to be
unlocked - we block ourselves in the code path.

The idea to fix this issue is mostly taken from gfs2 code.

1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
   of the processes' pid who has taken the cluster lock of this lock
   resource;

2. introduce a new flag for ocfs2_inode_lock_full:
   OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
   us if we've got cluster lock.

3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
   got the cluster lock in the upper code path.

The tracking logic should be used by some of the ocfs2 vfs's callbacks,
to solve the recursive locking issue cuased by the fact that vfs
routines can call into each other.

The performance penalty of processing the holder list should only be
seen at a few cases where the tracking logic is used, such as get/set
acl.

You may ask what if the first time we got a PR lock, and the second time
we want a EX lock? fortunately, this case never happens in the real
world, as far as I can see, including permission check,
(get|set)_(acl|attr), and the gfs2 code also do so.

[sfr@canb.auug.org.au remove some inlines]
Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
---
 fs/ocfs2/dlmglue.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ocfs2/dlmglue.h |  18 +++++++++
 fs/ocfs2/ocfs2.h   |   1 +
 3 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 77d1632e905d..8dce4099a6ca 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -532,6 +532,7 @@ void ocfs2_lock_res_init_once(struct ocfs2_lock_res *res)
 	init_waitqueue_head(&res->l_event);
 	INIT_LIST_HEAD(&res->l_blocked_list);
 	INIT_LIST_HEAD(&res->l_mask_waiters);
+	INIT_LIST_HEAD(&res->l_holders);
 }
 
 void ocfs2_inode_lock_res_init(struct ocfs2_lock_res *res,
@@ -749,6 +750,50 @@ void ocfs2_lock_res_free(struct ocfs2_lock_res *res)
 	res->l_flags = 0UL;
 }
 
+/*
+ * Keep a list of processes who have interest in a lockres.
+ * Note: this is now only uesed for check recursive cluster locking.
+ */
+static inline void ocfs2_add_holder(struct ocfs2_lock_res *lockres,
+				   struct ocfs2_lock_holder *oh)
+{
+	INIT_LIST_HEAD(&oh->oh_list);
+	oh->oh_owner_pid = get_pid(task_pid(current));
+
+	spin_lock(&lockres->l_lock);
+	list_add_tail(&oh->oh_list, &lockres->l_holders);
+	spin_unlock(&lockres->l_lock);
+}
+
+static inline void ocfs2_remove_holder(struct ocfs2_lock_res *lockres,
+				       struct ocfs2_lock_holder *oh)
+{
+	spin_lock(&lockres->l_lock);
+	list_del(&oh->oh_list);
+	spin_unlock(&lockres->l_lock);
+
+	put_pid(oh->oh_owner_pid);
+}
+
+static inline int ocfs2_is_locked_by_me(struct ocfs2_lock_res *lockres)
+{
+	struct ocfs2_lock_holder *oh;
+	struct pid *pid;
+
+	/* look in the list of holders for one with the current task as owner */
+	spin_lock(&lockres->l_lock);
+	pid = task_pid(current);
+	list_for_each_entry(oh, &lockres->l_holders, oh_list) {
+		if (oh->oh_owner_pid == pid) {
+			spin_unlock(&lockres->l_lock);
+			return 1;
+		}
+	}
+	spin_unlock(&lockres->l_lock);
+
+	return 0;
+}
+
 static inline void ocfs2_inc_holders(struct ocfs2_lock_res *lockres,
 				     int level)
 {
@@ -2333,8 +2378,9 @@ int ocfs2_inode_lock_full_nested(struct inode *inode,
 		goto getbh;
 	}
 
-	if (ocfs2_mount_local(osb))
-		goto local;
+	if ((arg_flags & OCFS2_META_LOCK_GETBH) ||
+	    ocfs2_mount_local(osb))
+		goto update;
 
 	if (!(arg_flags & OCFS2_META_LOCK_RECOVERY))
 		ocfs2_wait_for_recovery(osb);
@@ -2363,7 +2409,7 @@ int ocfs2_inode_lock_full_nested(struct inode *inode,
 	if (!(arg_flags & OCFS2_META_LOCK_RECOVERY))
 		ocfs2_wait_for_recovery(osb);
 
-local:
+update:
 	/*
 	 * We only see this flag if we're being called from
 	 * ocfs2_read_locked_inode(). It means we're locking an inode
@@ -2497,6 +2543,59 @@ void ocfs2_inode_unlock(struct inode *inode,
 		ocfs2_cluster_unlock(OCFS2_SB(inode->i_sb), lockres, level);
 }
 
+/*
+ * This _tracker variantes are introduced to deal with the recursive cluster
+ * locking issue. The idea is to keep track of a lock holder on the stack of
+ * the current process. If there's a lock holder on the stack, we know the
+ * task context is already protected by cluster locking. Currently, they're
+ * used in some VFS entry routines.
+ *
+ * return < 0 on error, return == 0 if there's no lock holder on the stack
+ * before this call, return == 1 if this call would be a recursive locking.
+ */
+int ocfs2_inode_lock_tracker(struct inode *inode,
+			     struct buffer_head **ret_bh,
+			     int ex,
+			     struct ocfs2_lock_holder *oh)
+{
+	int status;
+	int arg_flags = 0, has_locked;
+	struct ocfs2_lock_res *lockres;
+
+	lockres = &OCFS2_I(inode)->ip_inode_lockres;
+	has_locked = ocfs2_is_locked_by_me(lockres);
+	/* Just get buffer head if the cluster lock has been taken */
+	if (has_locked)
+		arg_flags = OCFS2_META_LOCK_GETBH;
+
+	if (likely(!has_locked || ret_bh)) {
+		status = ocfs2_inode_lock_full(inode, ret_bh, ex, arg_flags);
+		if (status < 0) {
+			if (status != -ENOENT)
+				mlog_errno(status);
+			return status;
+		}
+	}
+	if (!has_locked)
+		ocfs2_add_holder(lockres, oh);
+
+	return has_locked;
+}
+
+void ocfs2_inode_unlock_tracker(struct inode *inode,
+				int ex,
+				struct ocfs2_lock_holder *oh,
+				int had_lock)
+{
+	struct ocfs2_lock_res *lockres;
+
+	lockres = &OCFS2_I(inode)->ip_inode_lockres;
+	if (!had_lock) {
+		ocfs2_remove_holder(lockres, oh);
+		ocfs2_inode_unlock(inode, ex);
+	}
+}
+
 int ocfs2_orphan_scan_lock(struct ocfs2_super *osb, u32 *seqno)
 {
 	struct ocfs2_lock_res *lockres;
diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h
index d293a22c32c5..a7fc18ba0dc1 100644
--- a/fs/ocfs2/dlmglue.h
+++ b/fs/ocfs2/dlmglue.h
@@ -70,6 +70,11 @@ struct ocfs2_orphan_scan_lvb {
 	__be32	lvb_os_seqno;
 };
 
+struct ocfs2_lock_holder {
+	struct list_head oh_list;
+	struct pid *oh_owner_pid;
+};
+
 /* ocfs2_inode_lock_full() 'arg_flags' flags */
 /* don't wait on recovery. */
 #define OCFS2_META_LOCK_RECOVERY	(0x01)
@@ -77,6 +82,8 @@ struct ocfs2_orphan_scan_lvb {
 #define OCFS2_META_LOCK_NOQUEUE		(0x02)
 /* don't block waiting for the downconvert thread, instead return -EAGAIN */
 #define OCFS2_LOCK_NONBLOCK		(0x04)
+/* just get back disk inode bh if we've got cluster lock. */
+#define OCFS2_META_LOCK_GETBH		(0x08)
 
 /* Locking subclasses of inode cluster lock */
 enum {
@@ -170,4 +177,15 @@ void ocfs2_put_dlm_debug(struct ocfs2_dlm_debug *dlm_debug);
 
 /* To set the locking protocol on module initialization */
 void ocfs2_set_locking_protocol(void);
+
+/* The _tracker pair is used to avoid cluster recursive locking */
+int ocfs2_inode_lock_tracker(struct inode *inode,
+			     struct buffer_head **ret_bh,
+			     int ex,
+			     struct ocfs2_lock_holder *oh);
+void ocfs2_inode_unlock_tracker(struct inode *inode,
+				int ex,
+				struct ocfs2_lock_holder *oh,
+				int had_lock);
+
 #endif	/* DLMGLUE_H */
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index e63af7ddfe68..594575e380e8 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -172,6 +172,7 @@ struct ocfs2_lock_res {
 
 	struct list_head         l_blocked_list;
 	struct list_head         l_mask_waiters;
+	struct list_head	 l_holders;
 
 	unsigned long		 l_flags;
 	char                     l_name[OCFS2_LOCK_ID_MAX_LEN];
-- 
2.11.0

  parent reply	other threads:[~2017-10-11  0:46 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-11  0:45 [GIT PULL for-4.9 00/48] Commits for v4.9 LTS Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 01/48] xen-netback: Use GFP_ATOMIC to allocate hash Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 04/48] irqchip/crossbar: Fix incorrect type of local variables Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 02/48] locking/lockdep: Add nest_lock integrity test Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 03/48] watchdog: kempld: fix gcc-4.3 build Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 06/48] mac80211_hwsim: check HWSIM_ATTR_RADIO_NAME length Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 07/48] ALSA: hda: Add Geminilake HDMI codec ID Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 05/48] initramfs: finish fput() before accessing any binary from initramfs Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 09/48] mac80211: fix power saving clients handling in iwlwifi Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 11/48] staging: vchiq_2835_arm: Make cache-line-size a required DT property Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 10/48] net/mlx4_en: fix overflow in mlx4_en_init_timestamp() Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 08/48] qed: Don't use attention PTT for configuring BW Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 15/48] sched/fair: Update rq clock before changing a task's CPU affinity Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 14/48] f2fs: do SSR for data when there is enough free space Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 13/48] iio: adc: xilinx: Fix error handling Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 12/48] netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 17/48] f2fs: do not wait for writeback in write_begin Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 16/48] Btrfs: send, fix failure to rename top level inode due to name collision Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 18/48] md/linear: shutup lockdep warnning Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 19/48] sparc64: Migrate hvcons irq to panicked cpu Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 22/48] mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 20/48] net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 21/48] crypto: xts - Add ECB dependency Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 26/48] ASoC: mediatek: add I2C dependency for CS42XX8 Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 24/48] slub: do not merge cache if slub_debug contains a never-merge flag Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 25/48] scsi: scsi_dh_emc: return success in clariion_std_inquiry() Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` Levin, Alexander (Sasha Levin) [this message]
2017-10-11  0:45 ` [GIT PULL for-4.9 29/48] qede: Prevent index problems in loopback test Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 27/48] drm/amdgpu: refuse to reserve io mem for split VRAM buffers Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 28/48] net: mvpp2: release reference to txq_cpu[] entry after unmapping Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 30/48] qed: Reserve doorbell BAR space for present CPUs Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 31/48] qed: Read queue state before releasing buffer Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 34/48] ceph: fix bogus endianness change in ceph_ioctl_set_layout Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 32/48] i2c: at91: ensure state is restored after suspending Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 33/48] ceph: don't update_dentry_lease unless we actually got one Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 37/48] uapi: fix linux/mroute6.h userspace compilation errors Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 36/48] uapi: fix linux/rds.h " Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 35/48] ceph: clean up unsafe d_parent accesses in build_dentry_path Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 38/48] IB/hfi1: Use static CTLE with Preset 6 for integrated HFIs Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 40/48] target/iscsi: Fix unsolicited data seq_end_offset calculation Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 39/48] IB/hfi1: Allocate context data on memory node Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 41/48] hrtimer: Catch invalid clockids again Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 42/48] nfsd/callback: Cleanup callback cred on shutdown Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 43/48] powerpc/perf: Add restrictions to PMC5 in power9 DD1 Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 44/48] drm/nouveau/gr/gf100-: fix ccache error logging Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 46/48] btmrvl: avoid double-disable_irq() race Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 45/48] regulator: core: Resolve supplies before disabling unused regulators Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 47/48] EDAC, mce_amd: Print IPID and Syndrome on a separate line Levin, Alexander (Sasha Levin)
2017-10-11  0:45 ` [GIT PULL for-4.9 48/48] cpufreq: CPPC: add ACPI_PROCESSOR dependency Levin, Alexander (Sasha Levin)
2017-10-19 13:13 ` [GIT PULL for-4.9 00/48] Commits for v4.9 LTS gregkh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171011004512.7949-24-alexander.levin@verizon.com \
    --to=alexander.levin@verizon.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).