stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [3.8.y.z extended stable] Linux 3.8.13.28 stable review
@ 2014-08-25 16:54 Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 01/13] x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508) Kamal Mostafa
                   ` (13 more replies)
  0 siblings, 14 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team; +Cc: Kamal Mostafa

This is the start of the review cycle for the Linux 3.8.13.28 stable kernel.

 ** NOTE: This will be the last Linux 3.8.y.z extended stable version
 ** to be released and supported by me and the Ubuntu Kernel team.

This version contains 13 new patches, summarized below.  The new patches are
posted as replies to this message and also available in this git branch:

http://kernel.ubuntu.com/git?p=ubuntu/linux.git;h=linux-3.8.y-review;a=shortlog

git://kernel.ubuntu.com/ubuntu/linux.git  linux-3.8.y-review

The review period for version 3.8.13.28 will be open for the next three days.
To report a problem, please reply to the relevant follow-up patch message.

For more information about the Linux 3.8.y.z extended stable kernel version,
see https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable .

 -Kamal

--
 arch/x86/include/asm/ptrace.h   |  16 +++++++
 arch/x86/kernel/entry_32.S      |  13 +++--
 drivers/target/target_core_rd.c |   2 +-
 fs/namespace.c                  |  59 ++++++++++++++++++++---
 include/linux/mount.h           |   9 +++-
 include/linux/ptrace.h          |   3 ++
 mm/shmem.c                      | 104 +++++++++++++++++++++++++++++++++++-----
 net/l2tp/l2tp_ppp.c             |   4 +-
 net/sctp/associola.c            |   2 +-
 9 files changed, 185 insertions(+), 27 deletions(-)

Andy Lutomirski (1):
      x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508)

Eric W. Biederman (4):
      mnt: Only change user settable mount flags in remount
      mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
      mnt: Correct permission checks in do_remount
      mnt: Change the default remount atime from relatime to the existing value

Hugh Dickins (3):
      shmem: fix faulting into a hole while it's punched
      shmem: fix faulting into a hole, not taking i_mutex
      shmem: fix splicing from a hole while it's punched

Nicholas Bellinger (1):
      target: Explicitly clear ramdisk_mcp backend pages

Sasha Levin (1):
      net/l2tp: don't fall back on UDP [get|set]sockopt

Sven Wegener (1):
      x86_32, entry: Store badsys error code in %eax

Tejun Heo (1):
      ptrace,x86: force IRET path after a ptrace_stop()

Xufeng Zhang (1):
      sctp: Fix sk_ack_backlog wrap-around problem

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3.8 01/13] x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508)
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 02/13] x86_32, entry: Store badsys error code in %eax Kamal Mostafa
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Roland McGrath, Andy Lutomirski, H. Peter Anvin, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Andy Lutomirski <luto@amacapital.net>

commit 554086d85e71f30abe46fc014fea31929a7c6a8a upstream.

The bad syscall nr paths are their own incomprehensible route
through the entry control flow.  Rearrange them to work just like
syscalls that return -ENOSYS.

This fixes an OOPS in the audit code when fast-path auditing is
enabled and sysenter gets a bad syscall nr (CVE-2014-4508).

This has probably been broken since Linux 2.6.27:
af0575bba0 i386 syscall audit fast-path

Cc: Roland McGrath <roland@redhat.com>
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/e09c499eade6fc321266dd6b54da7beb28d6991c.1403558229.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 arch/x86/kernel/entry_32.S | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 60d03c2..b32b466 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -434,9 +434,10 @@ sysenter_past_esp:
 	jnz sysenter_audit
 sysenter_do_call:
 	cmpl $(NR_syscalls), %eax
-	jae syscall_badsys
+	jae sysenter_badsys
 	call *sys_call_table(,%eax,4)
 	movl %eax,PT_EAX(%esp)
+sysenter_after_call:
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
@@ -686,7 +687,12 @@ END(syscall_fault)
 
 syscall_badsys:
 	movl $-ENOSYS,PT_EAX(%esp)
-	jmp resume_userspace
+	jmp syscall_exit
+END(syscall_badsys)
+
+sysenter_badsys:
+	movl $-ENOSYS,PT_EAX(%esp)
+	jmp sysenter_after_call
 END(syscall_badsys)
 	CFI_ENDPROC
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 02/13] x86_32, entry: Store badsys error code in %eax
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 01/13] x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508) Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 03/13] shmem: fix faulting into a hole while it's punched Kamal Mostafa
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Sven Wegener, H. Peter Anvin, Andy Lutomirski, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Sven Wegener <sven.wegener@stealer.net>

commit 8142b215501f8b291a108a202b3a053a265b03dd upstream.

Commit 554086d ("x86_32, entry: Do syscall exit work on badsys
(CVE-2014-4508)") introduced a regression in the x86_32 syscall entry
code, resulting in syscall() not returning proper errors for undefined
syscalls on CPUs supporting the sysenter feature.

The following code:

> int result = syscall(666);
> printf("result=%d errno=%d error=%s\n", result, errno, strerror(errno));

results in:

> result=666 errno=0 error=Success

Obviously, the syscall return value is the called syscall number, but it
should have been an ENOSYS error. When run under ptrace it behaves
correctly, which makes it hard to debug in the wild:

> result=-1 errno=38 error=Function not implemented

The %eax register is the return value register. For debugging via ptrace
the syscall entry code stores the complete register context on the
stack. The badsys handlers only store the ENOSYS error code in the
ptrace register set and do not set %eax like a regular syscall handler
would. The old resume_userspace call chain contains code that clobbers
%eax and it restores %eax from the ptrace registers afterwards. The same
goes for the ptrace-enabled call chain. When ptrace is not used, the
syscall return value is the passed-in syscall number from the untouched
%eax register.

Use %eax as the return value register in syscall_badsys and
sysenter_badsys, like a real syscall handler does, and have the caller
push the value onto the stack for ptrace access.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Link: http://lkml.kernel.org/r/alpine.LNX.2.11.1407221022380.31021@titan.int.lan.stealer.net
Reviewed-and-tested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 arch/x86/kernel/entry_32.S | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index b32b466..1843543 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -436,8 +436,8 @@ sysenter_do_call:
 	cmpl $(NR_syscalls), %eax
 	jae sysenter_badsys
 	call *sys_call_table(,%eax,4)
-	movl %eax,PT_EAX(%esp)
 sysenter_after_call:
+	movl %eax,PT_EAX(%esp)
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
@@ -517,6 +517,7 @@ ENTRY(system_call)
 	jae syscall_badsys
 syscall_call:
 	call *sys_call_table(,%eax,4)
+syscall_after_call:
 	movl %eax,PT_EAX(%esp)		# store the return value
 syscall_exit:
 	LOCKDEP_SYS_EXIT
@@ -686,12 +687,12 @@ syscall_fault:
 END(syscall_fault)
 
 syscall_badsys:
-	movl $-ENOSYS,PT_EAX(%esp)
-	jmp syscall_exit
+	movl $-ENOSYS,%eax
+	jmp syscall_after_call
 END(syscall_badsys)
 
 sysenter_badsys:
-	movl $-ENOSYS,PT_EAX(%esp)
+	movl $-ENOSYS,%eax
 	jmp sysenter_after_call
 END(syscall_badsys)
 	CFI_ENDPROC
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 03/13] shmem: fix faulting into a hole while it's punched
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 01/13] x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508) Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 02/13] x86_32, entry: Store badsys error code in %eax Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 04/13] shmem: fix faulting into a hole, not taking i_mutex Kamal Mostafa
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Hugh Dickins, Dave Jones, Andrew Morton, Linus Torvalds,
	Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Hugh Dickins <hughd@google.com>

commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb upstream.

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.

It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.

Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 mm/shmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index efd0b3a..f5cd1f0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -78,11 +78,12 @@ static struct vfsmount *shm_mnt;
 #define SHORT_SYMLINK_LEN 128
 
 /*
- * shmem_fallocate and shmem_writepage communicate via inode->i_private
- * (with i_mutex making sure that it has only one user at a time):
- * we would prefer not to enlarge the shmem inode just for that.
+ * shmem_fallocate communicates with shmem_fault or shmem_writepage via
+ * inode->i_private (with i_mutex making sure that it has only one user at
+ * a time): we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
+	int	mode;		/* FALLOC_FL mode currently operating */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
@@ -825,6 +826,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
+			    !shmem_falloc->mode &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
@@ -1299,6 +1301,44 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	int error;
 	int ret = VM_FAULT_LOCKED;
 
+	/*
+	 * Trinity finds that probing a hole which tmpfs is punching can
+	 * prevent the hole-punch from ever completing: which in turn
+	 * locks writers out with its hold on i_mutex.  So refrain from
+	 * faulting pages into the hole while it's being punched, and
+	 * wait on i_mutex to be released if vmf->flags permits.
+	 */
+	if (unlikely(inode->i_private)) {
+		struct shmem_falloc *shmem_falloc;
+
+		spin_lock(&inode->i_lock);
+		shmem_falloc = inode->i_private;
+		if (!shmem_falloc ||
+		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
+		    vmf->pgoff < shmem_falloc->start ||
+		    vmf->pgoff >= shmem_falloc->next)
+			shmem_falloc = NULL;
+		spin_unlock(&inode->i_lock);
+		/*
+		 * i_lock has protected us from taking shmem_falloc seriously
+		 * once return from shmem_fallocate() went back up that stack.
+		 * i_lock does not serialize with i_mutex at all, but it does
+		 * not matter if sometimes we wait unnecessarily, or sometimes
+		 * miss out on waiting: we just need to make those cases rare.
+		 */
+		if (shmem_falloc) {
+			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+				up_read(&vma->vm_mm->mmap_sem);
+				mutex_lock(&inode->i_mutex);
+				mutex_unlock(&inode->i_mutex);
+				return VM_FAULT_RETRY;
+			}
+			/* cond_resched? Leave that to GUP or return to user */
+			return VM_FAULT_NOPAGE;
+		}
+	}
+
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -1816,18 +1856,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 
 	mutex_lock(&inode->i_mutex);
 
+	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
+
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
+		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
+		spin_lock(&inode->i_lock);
+		inode->i_private = &shmem_falloc;
+		spin_unlock(&inode->i_lock);
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		error = 0;
-		goto out;
+		goto undone;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 04/13] shmem: fix faulting into a hole, not taking i_mutex
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (2 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 03/13] shmem: fix faulting into a hole while it's punched Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 05/13] shmem: fix splicing from a hole while it's punched Kamal Mostafa
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Hugh Dickins, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Lukas Czerner, Dave Jones, Andrew Morton,
	Linus Torvalds, Luis Henriques, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Hugh Dickins <hughd@google.com>

commit 8e205f779d1443a94b5ae81aa359cb535dd3021e upstream.

Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's
punched") was buggy: Sasha sent a lockdep report to remind us that
grabbing i_mutex in the fault path is a no-no (write syscall may already
hold i_mutex while faulting user buffer).

We tried a completely different approach (see following patch) but that
proved inadequate: good enough for a rational workload, but not good
enough against trinity - which forks off so many mappings of the object
that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
into serious starvation when concurrent faults force the puncher to fall
back to single-page unmap_mapping_range() searches of the i_mmap tree.

So return to the original umbrella approach, but keep away from i_mutex
this time.  We really don't want to bloat every shmem inode with a new
mutex or completion, just to protect this unlikely case from trinity.
So extend the original with wait_queue_head on stack at the hole-punch
end, and wait_queue item on the stack at the fault end.

This involves further use of i_lock to guard against the races: lockdep
has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
i_lock around wake_up_bit(), which is comparable to what we do here.
i_lock is more convenient, but we could switch to shmem's info->lock.

This issue has been tagged with CVE-2014-4171, which will require commit
f00cdc6df7d7 and this and the following patch to be backported: we
suggest to 3.1+, though in fact the trinity forkbomb effect might go
back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
Anyone running trinity on 3.0 and earlier? I don't think we need care.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 mm/shmem.c | 78 +++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 52 insertions(+), 26 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index f5cd1f0..e679c38 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -83,7 +83,7 @@ static struct vfsmount *shm_mnt;
  * a time): we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
-	int	mode;		/* FALLOC_FL mode currently operating */
+	wait_queue_head_t *waitq; /* faults into hole wait for punch to end */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
@@ -826,7 +826,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
-			    !shmem_falloc->mode &&
+			    !shmem_falloc->waitq &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
@@ -1305,38 +1305,58 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * Trinity finds that probing a hole which tmpfs is punching can
 	 * prevent the hole-punch from ever completing: which in turn
 	 * locks writers out with its hold on i_mutex.  So refrain from
-	 * faulting pages into the hole while it's being punched, and
-	 * wait on i_mutex to be released if vmf->flags permits.
+	 * faulting pages into the hole while it's being punched.  Although
+	 * shmem_undo_range() does remove the additions, it may be unable to
+	 * keep up, as each new page needs its own unmap_mapping_range() call,
+	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
+	 *
+	 * It does not matter if we sometimes reach this check just before the
+	 * hole-punch begins, so that one fault then races with the punch:
+	 * we just need to make racing faults a rare case.
+	 *
+	 * The implementation below would be much simpler if we just used a
+	 * standard mutex or completion: but we cannot take i_mutex in fault,
+	 * and bloating every shmem inode for this unlikely case would be sad.
 	 */
 	if (unlikely(inode->i_private)) {
 		struct shmem_falloc *shmem_falloc;
 
 		spin_lock(&inode->i_lock);
 		shmem_falloc = inode->i_private;
-		if (!shmem_falloc ||
-		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
-		    vmf->pgoff < shmem_falloc->start ||
-		    vmf->pgoff >= shmem_falloc->next)
-			shmem_falloc = NULL;
-		spin_unlock(&inode->i_lock);
-		/*
-		 * i_lock has protected us from taking shmem_falloc seriously
-		 * once return from shmem_fallocate() went back up that stack.
-		 * i_lock does not serialize with i_mutex at all, but it does
-		 * not matter if sometimes we wait unnecessarily, or sometimes
-		 * miss out on waiting: we just need to make those cases rare.
-		 */
-		if (shmem_falloc) {
+		if (shmem_falloc &&
+		    shmem_falloc->waitq &&
+		    vmf->pgoff >= shmem_falloc->start &&
+		    vmf->pgoff < shmem_falloc->next) {
+			wait_queue_head_t *shmem_falloc_waitq;
+			DEFINE_WAIT(shmem_fault_wait);
+
+			ret = VM_FAULT_NOPAGE;
 			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
 			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+				/* It's polite to up mmap_sem if we can */
 				up_read(&vma->vm_mm->mmap_sem);
-				mutex_lock(&inode->i_mutex);
-				mutex_unlock(&inode->i_mutex);
-				return VM_FAULT_RETRY;
+				ret = VM_FAULT_RETRY;
 			}
-			/* cond_resched? Leave that to GUP or return to user */
-			return VM_FAULT_NOPAGE;
+
+			shmem_falloc_waitq = shmem_falloc->waitq;
+			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
+					TASK_UNINTERRUPTIBLE);
+			spin_unlock(&inode->i_lock);
+			schedule();
+
+			/*
+			 * shmem_falloc_waitq points into the shmem_fallocate()
+			 * stack of the hole-punching task: shmem_falloc_waitq
+			 * is usually invalid by the time we reach here, but
+			 * finish_wait() does not dereference it in that case;
+			 * though i_lock needed lest racing with wake_up_all().
+			 */
+			spin_lock(&inode->i_lock);
+			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
+			spin_unlock(&inode->i_lock);
+			return ret;
 		}
+		spin_unlock(&inode->i_lock);
 	}
 
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
@@ -1856,13 +1876,13 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 
 	mutex_lock(&inode->i_mutex);
 
-	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
+		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
 
+		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
 		spin_lock(&inode->i_lock);
@@ -1874,8 +1894,13 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
+
+		spin_lock(&inode->i_lock);
+		inode->i_private = NULL;
+		wake_up_all(&shmem_falloc_waitq);
+		spin_unlock(&inode->i_lock);
 		error = 0;
-		goto undone;
+		goto out;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -1891,6 +1916,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		goto out;
 	}
 
+	shmem_falloc.waitq = NULL;
 	shmem_falloc.start = start;
 	shmem_falloc.next  = start;
 	shmem_falloc.nr_falloced = 0;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 05/13] shmem: fix splicing from a hole while it's punched
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (3 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 04/13] shmem: fix faulting into a hole, not taking i_mutex Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 06/13] target: Explicitly clear ramdisk_mcp backend pages Kamal Mostafa
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Hugh Dickins, Konstantin Khlebnikov, Johannes Weiner,
	Lukas Czerner, Dave Jones, Andrew Morton, Linus Torvalds,
	Luis Henriques, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Hugh Dickins <hughd@google.com>

commit b1a366500bd537b50c3aad26dc7df083ec03a448 upstream.

shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.

But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.

shmem_file_splice_read() is the main other user of SGP_CACHE, which can
instantiate shmem pagecache pages in the read-only case (without holding
i_mutex, so perhaps concurrently with a hole-punch).  Probably it's
silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
ought to be safe, but might bring surprises - not a change to be rushed.

shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it could
be provoked to keep hole-punch busy or not.

We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use over
a range raises questions - should it actually be per page? can these get
starved themselves?

The origin of this part of the problem is my v3.1 commit d0823576bf4b
("mm: pincer in truncate_inode_pages_range"), once it was duplicated
into shmem.c.  It seemed like a nice idea at the time, to ensure
(barring RCU lookup fuzziness) that there's an instant when the entire
hole is empty; but the indefinitely repeated scans to ensure that make
it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov; and for the
case of page swizzled back to swap, as suggested by Johannes Weiner.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ luis: backported to 3.11: used hughd's backport to 3.10.50 ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 mm/shmem.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e679c38..840643a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -533,22 +533,19 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		return;
 
 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();
 		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 							pvec.pages, indices);
 		if (!pvec.nr) {
-			if (index == start || unfalloc)
+			/* If all gone or hole-punch or unfalloc, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if ((index == start || unfalloc) && indices[0] >= end) {
-			shmem_deswap_pagevec(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -560,8 +557,12 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-								index, page);
+				if (shmem_free_swap(mapping, index, page)) {
+					/* Swap was replaced by page: retry */
+					index--;
+					break;
+				}
+				nr_swaps_freed++;
 				continue;
 			}
 
@@ -570,6 +571,11 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 				if (page->mapping == mapping) {
 					VM_BUG_ON(PageWriteback(page));
 					truncate_inode_page(mapping, page);
+				} else {
+					/* Page was replaced by swap: retry */
+					unlock_page(page);
+					index--;
+					break;
 				}
 			}
 			unlock_page(page);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 06/13] target: Explicitly clear ramdisk_mcp backend pages
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (4 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 05/13] shmem: fix splicing from a hole while it's punched Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 07/13] sctp: Fix sk_ack_backlog wrap-around problem Kamal Mostafa
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Jorge Daniel Sequeira Matias, Nicholas Bellinger, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Nicholas Bellinger <nab@linux-iscsi.org>

[Note that a different patch to address the same issue went in during
v3.15-rc1 (commit 4442dc8a), but includes a bunch of other changes that
don't strictly apply to fixing the bug.]

This patch changes rd_allocate_sgl_table() to explicitly clear
ramdisk_mcp backend memory pages by passing __GFP_ZERO into
alloc_pages().

This addresses a potential security issue where reading from a
ramdisk_mcp could return sensitive information, and follows what
>= v3.15 does to explicitly clear ramdisk_mcp memory at backend
device initialization time.

Reported-by: Jorge Daniel Sequeira Matias <jdsm@tecnico.ulisboa.pt>
Cc: Jorge Daniel Sequeira Matias <jdsm@tecnico.ulisboa.pt>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reference: CVE-2014-4027
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 drivers/target/target_core_rd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/target/target_core_rd.c b/drivers/target/target_core_rd.c
index 0457de3..51968ed 100644
--- a/drivers/target/target_core_rd.c
+++ b/drivers/target/target_core_rd.c
@@ -174,7 +174,7 @@ static int rd_build_device_space(struct rd_dev *rd_dev)
 						- 1;
 
 		for (j = 0; j < sg_per_table; j++) {
-			pg = alloc_pages(GFP_KERNEL, 0);
+			pg = alloc_pages(GFP_KERNEL | __GFP_ZERO, 0);
 			if (!pg) {
 				pr_err("Unable to allocate scatterlist"
 					" pages for struct rd_dev_sg_table\n");
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 07/13] sctp: Fix sk_ack_backlog wrap-around problem
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (5 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 06/13] target: Explicitly clear ramdisk_mcp backend pages Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 08/13] mnt: Only change user settable mount flags in remount Kamal Mostafa
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Xufeng Zhang, David S. Miller, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Xufeng Zhang <xufeng.zhang@windriver.com>

commit d3217b15a19a4779c39b212358a5c71d725822ee upstream.

Consider the scenario:
For a TCP-style socket, while processing the COOKIE_ECHO chunk in
sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
a new association would be created in sctp_unpack_cookie(), but afterwards,
some processing maybe failed, and sctp_association_free() will be called to
free the previously allocated association, in sctp_association_free(),
sk_ack_backlog value is decremented for this socket, since the initial
value for sk_ack_backlog is 0, after the decrement, it will be 65535,
a wrap-around problem happens, and if we want to establish new associations
afterward in the same socket, ABORT would be triggered since sctp deem the
accept queue as full.
Fix this issue by only decrementing sk_ack_backlog for associations in
the endpoint's list.

Fix-suggested-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reference: CVE-2014-4667
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 net/sctp/associola.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 67c6823..c5cd799 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -396,7 +396,7 @@ void sctp_association_free(struct sctp_association *asoc)
 	/* Only real associations count against the endpoint, so
 	 * don't bother for if this is a temporary association.
 	 */
-	if (!asoc->temp) {
+	if (!list_empty(&asoc->asocs)) {
 		list_del(&asoc->asocs);
 
 		/* Decrement the backlog value for a TCP-style listening
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 08/13] mnt: Only change user settable mount flags in remount
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (6 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 07/13] sctp: Fix sk_ack_backlog wrap-around problem Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 09/13] mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount Kamal Mostafa
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team; +Cc: Eric W. Biederman, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: "Eric W. Biederman" <ebiederm@xmission.com>

commit a6138db815df5ee542d848318e5dae681590fccd upstream.

Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.

Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others.   This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 fs/namespace.c        | 2 +-
 include/linux/mount.h | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5dd7709..ddbd5bc 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1782,7 +1782,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
 		err = do_remount_sb(sb, flags, data, 0);
 	if (!err) {
 		br_write_lock(&vfsmount_lock);
-		mnt_flags |= mnt->mnt.mnt_flags & MNT_PROPAGATION_MASK;
+		mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
 		mnt->mnt.mnt_flags = mnt_flags;
 		br_write_unlock(&vfsmount_lock);
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 73005f9..16fc05d 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -42,7 +42,9 @@ struct mnt_namespace;
  * flag, consider how it interacts with shared mounts.
  */
 #define MNT_SHARED_MASK	(MNT_UNBINDABLE)
-#define MNT_PROPAGATION_MASK	(MNT_SHARED | MNT_UNBINDABLE)
+#define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
+				 | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
+				 | MNT_READONLY)
 
 
 #define MNT_INTERNAL	0x4000
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 09/13] mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (7 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 08/13] mnt: Only change user settable mount flags in remount Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 10/13] mnt: Correct permission checks in do_remount Kamal Mostafa
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team; +Cc: Eric W. Biederman, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: "Eric W. Biederman" <ebiederm@xmission.com>

commit 07b645589dcda8b7a5249e096fece2a67556f0f4 upstream.

There are no races as locked mount flags are guaranteed to never change.

Moving the test into do_remount makes it more visible, and ensures all
filesystem remounts pass the MNT_LOCK_READONLY permission check.  This
second case is not an issue today as filesystem remounts are guarded
by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
mount namespaces, but it could become an issue in the future.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 fs/namespace.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ddbd5bc..9171ac3 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1740,9 +1740,6 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
 	if (readonly_request == __mnt_is_readonly(mnt))
 		return 0;
 
-	if (mnt->mnt_flags & MNT_LOCK_READONLY)
-		return -EPERM;
-
 	if (readonly_request)
 		error = mnt_make_readonly(real_mount(mnt));
 	else
@@ -1771,6 +1768,16 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
 	if (path->dentry != path->mnt->mnt_root)
 		return -EINVAL;
 
+	/* Don't allow changing of locked mnt flags.
+	 *
+	 * No locks need to be held here while testing the various
+	 * MNT_LOCK flags because those flags can never be cleared
+	 * once they are set.
+	 */
+	if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
+	    !(mnt_flags & MNT_READONLY)) {
+		return -EPERM;
+	}
 	err = security_sb_remount(sb, data);
 	if (err)
 		return err;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 10/13] mnt: Correct permission checks in do_remount
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (8 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 09/13] mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 11/13] mnt: Change the default remount atime from relatime to the existing value Kamal Mostafa
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team; +Cc: Eric W. Biederman, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: "Eric W. Biederman" <ebiederm@xmission.com>

commit 9566d6742852c527bf5af38af5cbb878dad75705 upstream.

While invesgiating the issue where in "mount --bind -oremount,ro ..."
would result in later "mount --bind -oremount,rw" succeeding even if
the mount started off locked I realized that there are several
additional mount flags that should be locked and are not.

In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
flags in addition to MNT_READONLY should all be locked.  These
flags are all per superblock, can all be changed with MS_BIND,
and should not be changable if set by a more privileged user.

The following additions to the current logic are added in this patch.
- nosuid may not be clearable by a less privileged user.
- nodev  may not be clearable by a less privielged user.
- noexec may not be clearable by a less privileged user.
- atime flags may not be changeable by a less privileged user.

The logic with atime is that always setting atime on access is a
global policy and backup software and auditing software could break if
atime bits are not updated (when they are configured to be updated),
and serious performance degradation could result (DOS attack) if atime
updates happen when they have been explicitly disabled.  Therefore an
unprivileged user should not be able to mess with the atime bits set
by a more privileged user.

The additional restrictions are implemented with the addition of
MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
mnt flags.

Taken together these changes and the fixes for MNT_LOCK_READONLY
should make it safe for an unprivileged user to create a user
namespace and to call "mount --bind -o remount,... ..." without
the danger of mount flags being changed maliciously.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 fs/namespace.c        | 36 +++++++++++++++++++++++++++++++++---
 include/linux/mount.h |  5 +++++
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9171ac3..1d8b3d8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -799,8 +799,21 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~MNT_WRITE_HOLD;
 	/* Don't allow unprivileged users to change mount flags */
-	if ((flag & CL_UNPRIVILEGED) && (mnt->mnt.mnt_flags & MNT_READONLY))
-		mnt->mnt.mnt_flags |= MNT_LOCK_READONLY;
+	if (flag & CL_UNPRIVILEGED) {
+		mnt->mnt.mnt_flags |= MNT_LOCK_ATIME;
+
+		if (mnt->mnt.mnt_flags & MNT_READONLY)
+			mnt->mnt.mnt_flags |= MNT_LOCK_READONLY;
+
+		if (mnt->mnt.mnt_flags & MNT_NODEV)
+			mnt->mnt.mnt_flags |= MNT_LOCK_NODEV;
+
+		if (mnt->mnt.mnt_flags & MNT_NOSUID)
+			mnt->mnt.mnt_flags |= MNT_LOCK_NOSUID;
+
+		if (mnt->mnt.mnt_flags & MNT_NOEXEC)
+			mnt->mnt.mnt_flags |= MNT_LOCK_NOEXEC;
+	}
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
@@ -1778,6 +1791,23 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
 	    !(mnt_flags & MNT_READONLY)) {
 		return -EPERM;
 	}
+	if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
+	    !(mnt_flags & MNT_NODEV)) {
+		return -EPERM;
+	}
+	if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
+	    !(mnt_flags & MNT_NOSUID)) {
+		return -EPERM;
+	}
+	if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
+	    !(mnt_flags & MNT_NOEXEC)) {
+		return -EPERM;
+	}
+	if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
+	    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) {
+		return -EPERM;
+	}
+
 	err = security_sb_remount(sb, data);
 	if (err)
 		return err;
@@ -1978,7 +2008,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
 		 */
 		if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
 			flags |= MS_NODEV;
-			mnt_flags |= MNT_NODEV;
+			mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
 		}
 	}
 
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 16fc05d..f058e13 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -45,10 +45,15 @@ struct mnt_namespace;
 #define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
 				 | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
 				 | MNT_READONLY)
+#define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
 
 #define MNT_INTERNAL	0x4000
 
+#define MNT_LOCK_ATIME		0x040000
+#define MNT_LOCK_NOEXEC		0x080000
+#define MNT_LOCK_NOSUID		0x100000
+#define MNT_LOCK_NODEV		0x200000
 #define MNT_LOCK_READONLY	0x400000
 
 struct vfsmount {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 11/13] mnt: Change the default remount atime from relatime to the existing value
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (9 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 10/13] mnt: Correct permission checks in do_remount Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 12/13] ptrace,x86: force IRET path after a ptrace_stop() Kamal Mostafa
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team; +Cc: Eric W. Biederman, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: "Eric W. Biederman" <ebiederm@xmission.com>

commit ffbc6f0ead47fa5a1dc9642b0331cb75c20a640e upstream.

Since March 2009 the kernel has treated the state that if no
MS_..ATIME flags are passed then the kernel defaults to relatime.

Defaulting to relatime instead of the existing atime state during a
remount is silly, and causes problems in practice for people who don't
specify any MS_...ATIME flags and to get the default filesystem atime
setting.  Those users may encounter a permission error because the
default atime setting does not work.

A default that does not work and causes permission problems is
ridiculous, so preserve the existing value to have a default
atime setting that is always guaranteed to work.

Using the default atime setting in this way is particularly
interesting for applications built to run in restricted userspace
environments without /proc mounted, as the existing atime mount
options of a filesystem can not be read from /proc/mounts.

In practice this fixes user space that uses the default atime
setting on remount that are broken by the permission checks
keeping less privileged users from changing more privileged users
atime settings.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 fs/namespace.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 1d8b3d8..4d63cfe 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2327,6 +2327,14 @@ long do_mount(const char *dev_name, const char *dir_name,
 	if (flags & MS_RDONLY)
 		mnt_flags |= MNT_READONLY;
 
+	/* The default atime for remount is preservation */
+	if ((flags & MS_REMOUNT) &&
+	    ((flags & (MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+		       MS_STRICTATIME)) == 0)) {
+		mnt_flags &= ~MNT_ATIME_MASK;
+		mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK;
+	}
+
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
 		   MS_STRICTATIME);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 12/13] ptrace,x86: force IRET path after a ptrace_stop()
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (10 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 11/13] mnt: Change the default remount atime from relatime to the existing value Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 16:54 ` [PATCH 3.8 13/13] net/l2tp: don't fall back on UDP [get|set]sockopt Kamal Mostafa
  2014-08-25 17:00 ` [3.8.y.z extended stable] Linux 3.8.13.28 stable review Greg KH
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Tejun Heo, Linus Torvalds, Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Tejun Heo <tj@kernel.org>

commit b9cd18de4db3c9ffa7e17b0dc0ca99ed5aa4d43a upstream.

The 'sysret' fastpath does not correctly restore even all regular
registers, much less any segment registers or reflags values.  That is
very much part of why it's faster than 'iret'.

Normally that isn't a problem, because the normal ptrace() interface
catches the process using the signal handler infrastructure, which
always returns with an iret.

However, some paths can get caught using ptrace_event() instead of the
signal path, and for those we need to make sure that we aren't going to
return to user space using 'sysret'.  Otherwise the modifications that
may have been done to the register set by the tracer wouldn't
necessarily take effect.

Fix it by forcing IRET path by setting TIF_NOTIFY_RESUME from
arch_ptrace_stop_needed() which is invoked from ptrace_stop().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CVE-2014-4699
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 arch/x86/include/asm/ptrace.h | 16 ++++++++++++++++
 include/linux/ptrace.h        |  3 +++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 942a086..68e9f00 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -232,6 +232,22 @@ static inline unsigned long regs_get_kernel_stack_nth(struct pt_regs *regs,
 
 #define ARCH_HAS_USER_SINGLE_STEP_INFO
 
+/*
+ * When hitting ptrace_stop(), we cannot return using SYSRET because
+ * that does not restore the full CPU state, only a minimal set.  The
+ * ptracer can change arbitrary register values, which is usually okay
+ * because the usual ptrace stops run off the signal delivery path which
+ * forces IRET; however, ptrace_event() stops happen in arbitrary places
+ * in the kernel and don't force IRET path.
+ *
+ * So force IRET path after a ptrace stop.
+ */
+#define arch_ptrace_stop_needed(code, info)				\
+({									\
+	set_thread_flag(TIF_NOTIFY_RESUME);				\
+	false;								\
+})
+
 struct user_desc;
 extern int do_get_thread_area(struct task_struct *p, int idx,
 			      struct user_desc __user *info);
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 2e99b8e..bb980ae 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -337,6 +337,9 @@ static inline void user_single_step_siginfo(struct task_struct *tsk,
  * calling arch_ptrace_stop() when it would be superfluous.  For example,
  * if the thread has not been back to user mode since the last stop, the
  * thread state might indicate that nothing needs to be done.
+ *
+ * This is guaranteed to be invoked once before a task stops for ptrace and
+ * may include arch-specific operations necessary prior to a ptrace stop.
  */
 #define arch_ptrace_stop_needed(code, info)	(0)
 #endif
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3.8 13/13] net/l2tp: don't fall back on UDP [get|set]sockopt
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (11 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 12/13] ptrace,x86: force IRET path after a ptrace_stop() Kamal Mostafa
@ 2014-08-25 16:54 ` Kamal Mostafa
  2014-08-25 17:00 ` [3.8.y.z extended stable] Linux 3.8.13.28 stable review Greg KH
  13 siblings, 0 replies; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 16:54 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: Phil Turnbull, Vegard Nossum, Willy Tarreau, Linus Torvalds,
	Kamal Mostafa

3.8.13.28 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Sasha Levin <sasha.levin@oracle.com>

commit 3cf521f7dc87c031617fd47e4b7aa2593c2f3daf upstream.

The l2tp [get|set]sockopt() code has fallen back to the UDP functions
for socket option levels != SOL_PPPOL2TP since day one, but that has
never actually worked, since the l2tp socket isn't an inet socket.

As David Miller points out:

  "If we wanted this to work, it'd have to look up the tunnel and then
   use tunnel->sk, but I wonder how useful that would be"

Since this can never have worked so nobody could possibly have depended
on that functionality, just remove the broken code and return -EINVAL.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: James Chapman <jchapman@katalix.com>
Acked-by: David Miller <davem@davemloft.net>
Cc: Phil Turnbull <phil.turnbull@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CVE-2014-4943
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 net/l2tp/l2tp_ppp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index 26dbd9a..fbd6bbf 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -1404,7 +1404,7 @@ static int pppol2tp_setsockopt(struct socket *sock, int level, int optname,
 	int err;
 
 	if (level != SOL_PPPOL2TP)
-		return udp_prot.setsockopt(sk, level, optname, optval, optlen);
+		return -EINVAL;
 
 	if (optlen < sizeof(int))
 		return -EINVAL;
@@ -1530,7 +1530,7 @@ static int pppol2tp_getsockopt(struct socket *sock, int level, int optname,
 	struct pppol2tp_session *ps;
 
 	if (level != SOL_PPPOL2TP)
-		return udp_prot.getsockopt(sk, level, optname, optval, optlen);
+		return -EINVAL;
 
 	if (get_user(len, optlen))
 		return -EFAULT;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [3.8.y.z extended stable] Linux 3.8.13.28 stable review
  2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
                   ` (12 preceding siblings ...)
  2014-08-25 16:54 ` [PATCH 3.8 13/13] net/l2tp: don't fall back on UDP [get|set]sockopt Kamal Mostafa
@ 2014-08-25 17:00 ` Greg KH
  2014-08-25 19:14   ` Kamal Mostafa
  13 siblings, 1 reply; 17+ messages in thread
From: Greg KH @ 2014-08-25 17:00 UTC (permalink / raw)
  To: Kamal Mostafa; +Cc: linux-kernel, stable, kernel-team

On Mon, Aug 25, 2014 at 09:54:46AM -0700, Kamal Mostafa wrote:
> This is the start of the review cycle for the Linux 3.8.13.28 stable kernel.
> 
>  ** NOTE: This will be the last Linux 3.8.y.z extended stable version
>  ** to be released and supported by me and the Ubuntu Kernel team.

I know this is the last release, but I have gotten some complaints by
people who are confused by the numbering scheme you are using for the
Canonical stable kernel releases.  Can you please use the EXTRAVERSION
field with a "-something" tag in it to differenciate from the "normal"
kernel.org stable releases?

A simple "-ubuntu" or "-canonical" or "-fuzzy_bunnies" marking would be
fine, or whatever you want to pick.

Everyone used to do this a long time ago when we had lots of different
kernel releases / trees floating around in the pre-git days, which made
things much easier to figure out where the trees came from.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [3.8.y.z extended stable] Linux 3.8.13.28 stable review
  2014-08-25 17:00 ` [3.8.y.z extended stable] Linux 3.8.13.28 stable review Greg KH
@ 2014-08-25 19:14   ` Kamal Mostafa
  2014-08-25 19:37     ` Greg KH
  0 siblings, 1 reply; 17+ messages in thread
From: Kamal Mostafa @ 2014-08-25 19:14 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, stable, kernel-team, H. Peter Anvin

On Mon, 2014-08-25 at 10:00 -0700, Greg KH wrote:
> On Mon, Aug 25, 2014 at 09:54:46AM -0700, Kamal Mostafa wrote:
> > This is the start of the review cycle for the Linux 3.8.13.28 stable kernel.
> > 
> >  ** NOTE: This will be the last Linux 3.8.y.z extended stable version
> >  ** to be released and supported by me and the Ubuntu Kernel team.
> 
> I know this is the last release, but I have gotten some complaints by
> people who are confused by the numbering scheme you are using for the
> Canonical stable kernel releases.  Can you please use the EXTRAVERSION
> field with a "-something" tag in it to differenciate from the "normal"
> kernel.org stable releases?

We'll consider that idea for future extended stable versions maintained
by our team.

Thanks,

 -Kamal


> A simple "-ubuntu" or "-canonical" or "-fuzzy_bunnies" marking would be
> fine, or whatever you want to pick.
> 
> Everyone used to do this a long time ago when we had lots of different
> kernel releases / trees floating around in the pre-git days, which made
> things much easier to figure out where the trees came from.
> 
> thanks,
> 
> greg k-h
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [3.8.y.z extended stable] Linux 3.8.13.28 stable review
  2014-08-25 19:14   ` Kamal Mostafa
@ 2014-08-25 19:37     ` Greg KH
  0 siblings, 0 replies; 17+ messages in thread
From: Greg KH @ 2014-08-25 19:37 UTC (permalink / raw)
  To: Kamal Mostafa; +Cc: linux-kernel, stable, kernel-team, H. Peter Anvin

On Mon, Aug 25, 2014 at 12:14:18PM -0700, Kamal Mostafa wrote:
> On Mon, 2014-08-25 at 10:00 -0700, Greg KH wrote:
> > On Mon, Aug 25, 2014 at 09:54:46AM -0700, Kamal Mostafa wrote:
> > > This is the start of the review cycle for the Linux 3.8.13.28 stable kernel.
> > > 
> > >  ** NOTE: This will be the last Linux 3.8.y.z extended stable version
> > >  ** to be released and supported by me and the Ubuntu Kernel team.
> > 
> > I know this is the last release, but I have gotten some complaints by
> > people who are confused by the numbering scheme you are using for the
> > Canonical stable kernel releases.  Can you please use the EXTRAVERSION
> > field with a "-something" tag in it to differenciate from the "normal"
> > kernel.org stable releases?
> 
> We'll consider that idea for future extended stable versions maintained
> by our team.

Thank you very much, that should help reduce the current confusion.

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-08-25 19:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-25 16:54 [3.8.y.z extended stable] Linux 3.8.13.28 stable review Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 01/13] x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508) Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 02/13] x86_32, entry: Store badsys error code in %eax Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 03/13] shmem: fix faulting into a hole while it's punched Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 04/13] shmem: fix faulting into a hole, not taking i_mutex Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 05/13] shmem: fix splicing from a hole while it's punched Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 06/13] target: Explicitly clear ramdisk_mcp backend pages Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 07/13] sctp: Fix sk_ack_backlog wrap-around problem Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 08/13] mnt: Only change user settable mount flags in remount Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 09/13] mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 10/13] mnt: Correct permission checks in do_remount Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 11/13] mnt: Change the default remount atime from relatime to the existing value Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 12/13] ptrace,x86: force IRET path after a ptrace_stop() Kamal Mostafa
2014-08-25 16:54 ` [PATCH 3.8 13/13] net/l2tp: don't fall back on UDP [get|set]sockopt Kamal Mostafa
2014-08-25 17:00 ` [3.8.y.z extended stable] Linux 3.8.13.28 stable review Greg KH
2014-08-25 19:14   ` Kamal Mostafa
2014-08-25 19:37     ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).