From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-owner@vger.kernel.org>
Received: from mail-wm0-f44.google.com ([74.125.82.44]:36926 "EHLO
	mail-wm0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753144AbcADScy (ORCPT
	<rfc822;stable@vger.kernel.org>); Mon, 4 Jan 2016 13:32:54 -0500
Received: by mail-wm0-f44.google.com with SMTP id f206so226549769wmf.0
        for <stable@vger.kernel.org>; Mon, 04 Jan 2016 10:32:52 -0800 (PST)
Subject: Re: [PATCH] ipc/sem.c: Fix complex_count vs. simple op race
To: Davidlohr Bueso <dave@stgolabs.net>
References: <1451736291-8115-1-git-send-email-manfred@colorfullife.com>
 <20160104130231.GA3013@linux-uzut.site>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>, 1vier1@web.de,
	Ingo Molnar <mingo@redhat.com>,
	felixh@informatik.uni-bremen.de, stable@vger.kernel.org
From: Manfred Spraul <manfred@colorfullife.com>
Message-ID: <568ABAD2.4060707@colorfullife.com>
Date: Mon, 4 Jan 2016 19:32:50 +0100
MIME-Version: 1.0
In-Reply-To: <20160104130231.GA3013@linux-uzut.site>
Content-Type: multipart/mixed;
 boundary="------------080502020409050003040901"
Sender: stable-owner@vger.kernel.org
List-ID: <stable.vger.kernel.org>

This is a multi-part message in MIME format.
--------------080502020409050003040901
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit

On 01/04/2016 02:02 PM, Davidlohr Bueso wrote:
> On Sat, 02 Jan 2016, Manfred Spraul wrote:
>
>> Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a 
>> race:
>>
>> sem_lock has a fast path that allows parallel simple operations.
>> There are two reasons why a simple operation cannot run in parallel:
>> - a non-simple operations is ongoing (sma->sem_perm.lock held)
>> - a complex operation is sleeping (sma->complex_count != 0)
>>
>> As both facts are stored independently, a thread can bypass the current
>> checks by sleeping in the right positions. See below for more details
>> (or kernel bugzilla 105651).
>>
>> The patch fixes that by creating one variable (complex_mode)
>> that tracks both reasons why parallel operations are not possible.
>>
>> The patch also updates stale documentation regarding the locking.
>>
>> With regards to stable kernels:
>> The patch is required for all kernels that include the commit 
>> 6d07b68ce16a
>> ("ipc/sem.c: optimize sem_lock()") (3.10?)
>>
>> The alternative is to revert the patch that introduced the race.
>
> I am just now catching up with this, but a quick thought is that we 
> probably
> want to keep 6d07b68ce16a as waiting on unlocking all sem->lock should be
> much worse for performance than keeping track of the complex 'mode'.
Keeping track is simple - and the fast path gets simpler compared to 
ncurrent code.
> Specially
> if we have a large array.
Yes, I would prefer my patch as a backport.
But if someone has objections, then a revert would be an alternative.

>
> Also, any idea what workload exposed this race? Anyway, will take a 
> closer look
> at the patch/issue.
It was found by a theoretical review:
The sem_lock code was used as an exercise at University Bremen.

--
     Manfred
P.S.: If we replace the "bool" with an "int", we could even introduce a 
hysteresis, to further reduce the amount of array scans.

--------------080502020409050003040901
Content-Type: text/x-patch;
 name="0001-ipc-sem-sem_lock-with-hysteresis.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="0001-ipc-sem-sem_lock-with-hysteresis.patch"

>>From d3ea4cddaff440c03b89b045ee5a0be8c4db623d Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@colorfullife.com>
Date: Mon, 4 Jan 2016 19:20:34 +0100
Subject: [PATCH] ipc/sem: sem_lock with hysteresis

Untested
May not work
May not pass checkpatch.pl
May not be an improvement at all, not even for microbenchmarks
May be a worthless pseudo improvement without any real-world
advantages (except more complexity and more risks for bugs)
---
 include/linux/sem.h |  2 +-
 ipc/sem.c           | 60 ++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index d0efd6e..6fb3227 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,7 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
-	bool			complex_mode;	/* no parallel simple ops */
+	int			complex_mode;	/* >0: no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index 87e1f5d..f580c7c 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -154,6 +154,13 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #define SEMOPM_FAST	64  /* ~ 372 bytes on stack */
 
 /*
+ * Switching from the mode suitable for complex ops
+ * to the mode for simple ops is costly. Therefore:
+ * use some hysteresis
+ */
+#define COMPLEX_MODE_ENTER 10
+
+/*
  * Locking:
  * a) global sem_lock() for read/write
  *	sem_undo.id_next,
@@ -278,11 +285,16 @@ static void complexmode_enter(struct sem_array *sma)
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_mode)  {
-		/* We are already in complex_mode. Nothing to do */
+	if (sma->complex_mode > 0)  {
+		/*
+		 * We are already in complex_mode.
+		 * Nothing to do, just increase
+		 * counter until we return to simple mode
+		 */
+		WRITE_ONCE(sma->complex_mode, COMPLEX_MODE_ENTER);
 		return;
 	}
-	WRITE_ONCE(sma->complex_mode, true);
+	WRITE_ONCE(sma->complex_mode, COMPLEX_MODE_ENTER);
 
 	/* We need a full barrier:
 	 * The write to complex_mode must be visible
@@ -309,15 +321,18 @@ static void complexmode_tryleave(struct sem_array *sma)
 		 */
 		return;
 	}
+	if (sma->complex_mode == 0) {
+pr_info("error.");
+	}
 	/*
-	 * Immediately after setting complex_mode to false,
+	 * Immediately after setting complex_mode to 0,
 	 * a simple op can start. Thus: all memory writes
 	 * performed by the current operation must be visible
 	 * before we set complex_mode to false.
 	 */
 	smp_wmb();
 
-	WRITE_ONCE(sma->complex_mode, false);
+	WRITE_ONCE(sma->complex_mode, sma->complex_mode-1);
 }
 
 /*
@@ -377,19 +392,30 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 
 	if (sma->complex_count == 0) {
 		/* False alarm:
-		 * There is no complex operation, thus we can switch
-		 * back to the fast path.
-		 */
-		spin_lock(&sem->lock);
-		ipc_unlock_object(&sma->sem_perm);
-		return sops->sem_num;
-	} else {
-		/* Not a false alarm, thus complete the sequence for a
-		 * full lock.
+		 * There is no complex operation, check hysteresis
+		 * If 0, switch back to the fast path.
 		 */
-		complexmode_enter(sma);
-		return -1;
+		if (sma->complex_mode > 0) {
+			/*
+			 * barrier provided by spin_lock
+			 * sma->sem_perm.lock
+			 */
+
+			WRITE_ONCE(sma->complex_mode, sma->complex_mode-1);
+		}
+		if (sma->complex_mode == 0) {
+			spin_lock(&sem->lock);
+			ipc_unlock_object(&sma->sem_perm);
+			return sops->sem_num;
+		}
 	}
+	/*
+	 * Not a false alarm, but we must be already in
+	 * complex_mode: either because of waiting complex ops,
+	 * or due to hysteresis.
+	 */
+	WARN_ON(sma->complex_mode == 0);
+	return -1;
 }
 
 static inline void sem_unlock(struct sem_array *sma, int locknum)
@@ -548,7 +574,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
-	sma->complex_mode = true; /* dropped by sem_unlock below */
+	WRITE_ONCE(sma->complex_mode, COMPLEX_MODE_ENTER);
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
-- 
2.4.3


--------------080502020409050003040901--