From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, felixh@informatik.uni-bremen.de,
"H. Peter Anvin" <hpa@zytor.com>,
Peter Zijlstra <peterz@infradead.org>,
Davidlohr Bueso <dave@stgolabs.net>,
Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>,
1vier1@web.de, Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.8 38/57] ipc/sem.c: fix complex_count vs. simple op race
Date: Fri, 21 Oct 2016 11:18:01 +0200 [thread overview]
Message-ID: <20161021091437.220420413@linuxfoundation.org> (raw)
In-Reply-To: <20161021091435.435647262@linuxfoundation.org>
4.8-stable review patch. If anyone has any objections, please let me know.
------------------
From: Manfred Spraul <manfred@colorfullife.com>
commit 5864a2fd3088db73d47942370d0f7210a807b9bc upstream.
Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:
sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)
As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).
The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.
The patch also updates stale documentation regarding the locking.
With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)
The alternative is to revert the patch that introduced the race.
The patch is safe for backporting, i.e. it makes no assumptions
about memory barriers in spin_unlock_wait().
Background:
Here is the race of the current implementation:
Thread A: (simple op)
- does the first "sma->complex_count == 0" test
Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
drops sem_lock, sleeps
Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test
Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count
Thread A:
- does the complex_count test
Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.
Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
Reported-by: <felixh@informatik.uni-bremen.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <1vier1@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
include/linux/sem.h | 1
ipc/sem.c | 138 +++++++++++++++++++++++++++++++---------------------
2 files changed, 84 insertions(+), 55 deletions(-)
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
struct list_head list_id; /* undo requests on this array */
int sem_nsems; /* no. of semaphores in array */
int complex_count; /* pending complex operations */
+ bool complex_mode; /* no parallel simple ops */
};
#ifdef CONFIG_SYSVIPC
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct
/*
* Locking:
+ * a) global sem_lock() for read/write
* sem_undo.id_next,
* sem_array.complex_count,
- * sem_array.pending{_alter,_cont},
- * sem_array.sem_undo: global sem_lock() for read/write
- * sem_undo.proc_next: only "current" is allowed to read/write that field.
+ * sem_array.complex_mode
+ * sem_array.pending{_alter,_const},
+ * sem_array.sem_undo
*
+ * b) global or semaphore sem_lock() for read/write:
* sem_array.sem_base[i].pending_{const,alter}:
- * global or semaphore sem_lock() for read/write
+ * sem_array.complex_mode (for read)
+ *
+ * c) special:
+ * sem_undo_list.list_proc:
+ * * undo_list->lock for write
+ * * rcu for read
*/
#define sc_semmsl sem_ctls[0]
@@ -260,30 +267,61 @@ static void sem_rcu_free(struct rcu_head
}
/*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
* Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
*/
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
{
int i;
struct sem *sem;
- if (sma->complex_count) {
- /* The thread that increased sma->complex_count waited on
- * all sem->lock locks. Thus we don't need to wait again.
- */
+ if (sma->complex_mode) {
+ /* We are already in complex_mode. Nothing to do */
return;
}
+ /* We need a full barrier after seting complex_mode:
+ * The write to complex_mode must be visible
+ * before we read the first sem->lock spinlock state.
+ */
+ smp_store_mb(sma->complex_mode, true);
+
for (i = 0; i < sma->sem_nsems; i++) {
sem = sma->sem_base + i;
spin_unlock_wait(&sem->lock);
}
+ /*
+ * spin_unlock_wait() is not a memory barriers, it is only a
+ * control barrier. The code must pair with spin_unlock(&sem->lock),
+ * thus just the control barrier is insufficient.
+ *
+ * smp_rmb() is sufficient, as writes cannot pass the control barrier.
+ */
+ smp_rmb();
+}
+
+/*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+ if (sma->complex_count) {
+ /* Complex ops are sleeping.
+ * We must stay in complex mode
+ */
+ return;
+ }
+ /*
+ * Immediately after setting complex_mode to false,
+ * a simple op can start. Thus: all memory writes
+ * performed by the current operation must be visible
+ * before we set complex_mode to false.
+ */
+ smp_store_release(&sma->complex_mode, false);
}
+#define SEM_GLOBAL_LOCK (-1)
/*
* If the request contains only one semaphore operation, and there are
* no complex transactions pending, lock only the semaphore involved.
@@ -300,56 +338,42 @@ static inline int sem_lock(struct sem_ar
/* Complex operation - acquire a full lock */
ipc_lock_object(&sma->sem_perm);
- /* And wait until all simple ops that are processed
- * right now have dropped their locks.
- */
- sem_wait_array(sma);
- return -1;
+ /* Prevent parallel simple ops */
+ complexmode_enter(sma);
+ return SEM_GLOBAL_LOCK;
}
/*
* Only one semaphore affected - try to optimize locking.
- * The rules are:
- * - optimized locking is possible if no complex operation
- * is either enqueued or processed right now.
- * - The test for enqueued complex ops is simple:
- * sma->complex_count != 0
- * - Testing for complex ops that are processed right now is
- * a bit more difficult. Complex ops acquire the full lock
- * and first wait that the running simple ops have completed.
- * (see above)
- * Thus: If we own a simple lock and the global lock is free
- * and complex_count is now 0, then it will stay 0 and
- * thus just locking sem->lock is sufficient.
+ * Optimized locking is possible if no complex operation
+ * is either enqueued or processed right now.
+ *
+ * Both facts are tracked by complex_mode.
*/
sem = sma->sem_base + sops->sem_num;
- if (sma->complex_count == 0) {
+ /*
+ * Initial check for complex_mode. Just an optimization,
+ * no locking, no memory barrier.
+ */
+ if (!sma->complex_mode) {
/*
* It appears that no complex operation is around.
* Acquire the per-semaphore lock.
*/
spin_lock(&sem->lock);
- /* Then check that the global lock is free */
- if (!spin_is_locked(&sma->sem_perm.lock)) {
- /*
- * We need a memory barrier with acquire semantics,
- * otherwise we can race with another thread that does:
- * complex_count++;
- * spin_unlock(sem_perm.lock);
- */
- smp_acquire__after_ctrl_dep();
+ /*
+ * See 51d7d5205d33
+ * ("powerpc: Add smp_mb() to arch_spin_is_locked()"):
+ * A full barrier is required: the write of sem->lock
+ * must be visible before the read is executed
+ */
+ smp_mb();
- /*
- * Now repeat the test of complex_count:
- * It can't change anymore until we drop sem->lock.
- * Thus: if is now 0, then it will stay 0.
- */
- if (sma->complex_count == 0) {
- /* fast path successful! */
- return sops->sem_num;
- }
+ if (!smp_load_acquire(&sma->complex_mode)) {
+ /* fast path successful! */
+ return sops->sem_num;
}
spin_unlock(&sem->lock);
}
@@ -369,15 +393,16 @@ static inline int sem_lock(struct sem_ar
/* Not a false alarm, thus complete the sequence for a
* full lock.
*/
- sem_wait_array(sma);
- return -1;
+ complexmode_enter(sma);
+ return SEM_GLOBAL_LOCK;
}
}
static inline void sem_unlock(struct sem_array *sma, int locknum)
{
- if (locknum == -1) {
+ if (locknum == SEM_GLOBAL_LOCK) {
unmerge_queues(sma);
+ complexmode_tryleave(sma);
ipc_unlock_object(&sma->sem_perm);
} else {
struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +554,7 @@ static int newary(struct ipc_namespace *
}
sma->complex_count = 0;
+ sma->complex_mode = true; /* dropped by sem_unlock below */
INIT_LIST_HEAD(&sma->pending_alter);
INIT_LIST_HEAD(&sma->pending_const);
INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2210,10 @@ static int sysvipc_sem_proc_show(struct
/*
* The proc interface isn't aware of sem_lock(), it calls
* ipc_lock_object() directly (in sysvipc_find_ipc).
- * In order to stay compatible with sem_lock(), we must wait until
- * all simple semop() calls have left their critical regions.
+ * In order to stay compatible with sem_lock(), we must
+ * enter / leave complex_mode.
*/
- sem_wait_array(sma);
+ complexmode_enter(sma);
sem_otime = get_semotime(sma);
@@ -2204,6 +2230,8 @@ static int sysvipc_sem_proc_show(struct
sem_otime,
sma->sem_ctime);
+ complexmode_tryleave(sma);
+
return 0;
}
#endif
next prev parent reply other threads:[~2016-10-21 9:22 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20161021092156uscas1p2448161ce8162a8e99cfe189d1b435176@uscas1p2.samsung.com>
2016-10-21 9:17 ` [PATCH 4.8 00/57] 4.8.4-stable review Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 03/57] serial: 8250_dw: Check the data->pclk when get apb_pclk Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 04/57] serial: 8250_port: fix runtime PM use in __do_stop_tx_rs485() Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 05/57] ARCv2: intc: Use kflag if STATUS32.IE must be reset Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 06/57] ARCv2: fix local_save_flags Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 07/57] debugfs: introduce a public file_operations accessor Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 08/57] b43: fix debugfs crash Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 09/57] b43legacy: " Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 10/57] carl9170: fix debugfs crashes Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 14/57] btrfs: assign error values to the correct bio structs Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 15/57] mei: amthif: fix deadlock in initialization during a reset Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 16/57] drivers: base: dma-mapping: page align the size when unmap_kernel_range Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 17/57] IB/hfi1: Fix defered ack race with qp destroy Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 18/57] clk: mvebu: fix setting unwanted flags in CP110 gate clock Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 19/57] clk: mvebu: dynamically allocate resources in Armada CP110 system controller Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 20/57] fuse: listxattr: verify xattr list Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 22/57] fuse: fix killing s[ug]id in setattr Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 23/57] mm: filemap: fix mapping->nrpages double accounting in fuse Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 24/57] i40e: avoid NULL pointer dereference and recursive errors on early PCI error Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 25/57] xfs: change mailing list address Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 26/57] mm: filemap: dont plant shadow entries without radix tree node Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 27/57] brcmfmac: fix pmksa->bssid usage Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 30/57] ASoC: nau8825: fix bug in FLL parameter Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 31/57] ASoC: Intel: Atom: add a missing star in a memcpy call Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 32/57] reiserfs: Unlock superblock before calling reiserfs_quota_on_mount() Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 33/57] async_pq_val: fix DMA memory leak Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 34/57] autofs: Fix automounts by using current_real_cred()->uid Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 35/57] scsi: arcmsr: Buffer overflow in arcmsr_iop_message_xfer() Greg Kroah-Hartman
2016-10-21 9:17 ` [PATCH 4.8 36/57] scsi: arcmsr: Simplify user_len checking Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 37/57] scsi: ibmvfc: Fix I/O hang when port is not mapped Greg Kroah-Hartman
2016-10-21 9:18 ` Greg Kroah-Hartman [this message]
2016-10-21 9:18 ` [PATCH 4.8 39/57] mm/hugetlb: fix memory offline with hugepage size > memory block size Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 40/57] vfs,mm: fix a dead loop in truncate_inode_pages_range() Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 41/57] jbd2: fix lockdep annotation in add_transaction_credits() Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 42/57] ext4: enforce online defrag restriction for encrypted files Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 43/57] ext4: reinforce check of i_dtime when clearing high fields of uid and gid Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 45/57] ext4: fix memory leak in ext4_insert_range() Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 46/57] ext4: fix memory leak when symlink decryption fails Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 47/57] ext4: allow DAX writeback for hole punch Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 48/57] ext4: release bh in make_indexed_dir Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 49/57] ext4: unmap metadata when zeroing blocks Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 50/57] crypto: ghash-generic - move common definitions to a new header file Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 51/57] crypto: vmx - Fix memory corruption caused by p8_ghash Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 52/57] dlm: free workqueues after the connections Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 53/57] vfs: move permission checking into notify_change() for utimes(NULL) Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 54/57] cachefiles: Fix attempt to read i_blocks after deleting file [ver #2] Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 55/57] drm: virtio: reinstate drm_virtio_set_busid() Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 56/57] acpi, nfit: check for the correct event code in notifications Greg Kroah-Hartman
2016-10-21 9:18 ` [PATCH 4.8 57/57] cfq: fix starvation of asynchronous writes Greg Kroah-Hartman
2016-10-21 15:46 ` [PATCH 4.8 00/57] 4.8.4-stable review Shuah Khan
2016-10-22 9:56 ` Greg Kroah-Hartman
2016-10-21 19:17 ` Guenter Roeck
2016-10-22 9:56 ` Greg Kroah-Hartman
2016-10-21 21:02 ` Rafael J. Wysocki
2016-10-22 9:58 ` Greg Kroah-Hartman
2016-10-23 0:04 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161021091437.220420413@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=1vier1@web.de \
--cc=akpm@linux-foundation.org \
--cc=dave@stgolabs.net \
--cc=felixh@informatik.uni-bremen.de \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=peterz@infradead.org \
--cc=stable@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).