[patch added to the 3.12 stable tree] rc-core: fix remove uevent generation

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation
@ 2015-09-30  9:59 Jiri Slaby
  2015-09-30  9:59 ` [patch added to the 3.12 stable tree] v4l: omap3isp: Fix sub-device power management code Jiri Slaby
                   ` (29 more replies)
  0 siblings, 30 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30  9:59 UTC (permalink / raw)
  To: stable; +Cc: David Härdeman, Mauro Carvalho Chehab, Jiri Slaby

From: David Härdeman <david@hardeman.nu>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit a66b0c41ad277ae62a3ae6ac430a71882f899557 upstream.

The input_dev is already gone when the rc device is being unregistered
so checking for its presence only means that no remove uevent will be
generated.

Signed-off-by: David Härdeman <david@hardeman.nu>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/media/rc/rc-main.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/media/rc/rc-main.c b/drivers/media/rc/rc-main.c
index 46da365c9c84..f972de9f02e6 100644
--- a/drivers/media/rc/rc-main.c
+++ b/drivers/media/rc/rc-main.c
@@ -978,9 +978,6 @@ static int rc_dev_uevent(struct device *device, struct kobj_uevent_env *env)
 {
 	struct rc_dev *dev = to_rc_dev(device);
 
-	if (!dev || !dev->input_dev)
-		return -ENODEV;
-
 	if (dev->rc_map.name)
 		ADD_HOTPLUG_VAR("NAME=%s", dev->rc_map.name);
 	if (dev->driver_name)
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] v4l: omap3isp: Fix sub-device power management code
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
@ 2015-09-30  9:59 ` Jiri Slaby
  2015-09-30  9:59 ` [patch added to the 3.12 stable tree] Btrfs: check if previous transaction aborted to avoid fs corruption Jiri Slaby
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30  9:59 UTC (permalink / raw)
  To: stable; +Cc: Sakari Ailus, Laurent Pinchart, Mauro Carvalho Chehab, Jiri Slaby

From: Sakari Ailus <sakari.ailus@iki.fi>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 9d39f05490115bf145e5ea03c0b7ec9d3d015b01 upstream.

Commit 813f5c0ac5cc ("media: Change media device link_notify behaviour")
modified the media controller link setup notification API and updated the
OMAP3 ISP driver accordingly. As a side effect it introduced a bug by
turning power on after setting the link instead of before. This results in
sub-devices not being powered down in some cases when they should be. Fix
it.

Fixes: 813f5c0ac5cc [media] media: Change media device link_notify behaviour

Signed-off-by: Sakari Ailus <sakari.ailus@iki.fi>
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/media/platform/omap3isp/isp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/media/platform/omap3isp/isp.c b/drivers/media/platform/omap3isp/isp.c
index df3a0ec7fd2c..9dd0e0cc65cf 100644
--- a/drivers/media/platform/omap3isp/isp.c
+++ b/drivers/media/platform/omap3isp/isp.c
@@ -814,14 +814,14 @@ static int isp_pipeline_link_notify(struct media_link *link, u32 flags,
 	int ret;
 
 	if (notification == MEDIA_DEV_NOTIFY_POST_LINK_CH &&
-	    !(link->flags & MEDIA_LNK_FL_ENABLED)) {
+	    !(flags & MEDIA_LNK_FL_ENABLED)) {
 		/* Powering off entities is assumed to never fail. */
 		isp_pipeline_pm_power(source, -sink_use);
 		isp_pipeline_pm_power(sink, -source_use);
 		return 0;
 	}
 
-	if (notification == MEDIA_DEV_NOTIFY_POST_LINK_CH &&
+	if (notification == MEDIA_DEV_NOTIFY_PRE_LINK_CH &&
 		(flags & MEDIA_LNK_FL_ENABLED)) {
 
 		ret = isp_pipeline_pm_power(source, sink_use);
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] Btrfs: check if previous transaction aborted to avoid fs corruption
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
  2015-09-30  9:59 ` [patch added to the 3.12 stable tree] v4l: omap3isp: Fix sub-device power management code Jiri Slaby
@ 2015-09-30  9:59 ` Jiri Slaby
  2015-09-30  9:59 ` [patch added to the 3.12 stable tree] NFSv4: don't set SETATTR for O_RDONLY|O_EXCL Jiri Slaby
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30  9:59 UTC (permalink / raw)
  To: stable; +Cc: Filipe Manana, Chris Mason, Jiri Slaby

From: Filipe Manana <fdmanana@suse.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 1f9b8c8fbc9a4d029760b16f477b9d15500e3a34 upstream.

While we are committing a transaction, it's possible the previous one is
still finishing its commit and therefore we wait for it to finish first.
However we were not checking if that previous transaction ended up getting
aborted after we waited for it to commit, so we ended up committing the
current transaction which can lead to fs corruption because the new
superblock can point to trees that have had one or more nodes/leafs that
were never durably persisted.
The following sequence diagram exemplifies how this is possible:

          CPU 0                                                        CPU 1

  transaction N starts

  (...)

  btrfs_commit_transaction(N)

    cur_trans->state = TRANS_STATE_COMMIT_START;
    (...)
    cur_trans->state = TRANS_STATE_COMMIT_DOING;
    (...)

    cur_trans->state = TRANS_STATE_UNBLOCKED;
    root->fs_info->running_transaction = NULL;

                                                              btrfs_start_transaction()
                                                                 --> starts transaction N + 1

    btrfs_write_and_wait_transaction(trans, root);
      --> starts writing all new or COWed ebs created
          at transaction N

                                                              creates some new ebs, COWs some
                                                              existing ebs but doesn't COW or
                                                              deletes eb X

                                                              btrfs_commit_transaction(N + 1)
                                                                (...)
                                                                cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                (...)
                                                                wait_for_commit(root, prev_trans);
                                                                  --> prev_trans == transaction N

    btrfs_write_and_wait_transaction() continues
    writing ebs
       --> fails writing eb X, we abort transaction N
           and set bit BTRFS_FS_STATE_ERROR on
           fs_info->fs_state, so no new transactions
           can start after setting that bit

       cleanup_transaction()
         btrfs_cleanup_one_transaction()
           wakes up task at CPU 1

                                                                continues, doesn't abort because
                                                                cur_trans->aborted (transaction N + 1)
                                                                is zero, and no checks for bit
                                                                BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                are made

                                                                btrfs_write_and_wait_transaction(trans, root);
                                                                  --> succeeds, no errors during writeback

                                                                write_ctree_super(trans, root, 0);
                                                                  --> succeeds
                                                                  --> we have now a superblock that points us
                                                                      to some root that uses eb X, which was
                                                                      never written to disk

In this scenario future attempts to read eb X from disk results in an
error message like "parent transid verify failed on X wanted Y found Z".

So fix this by aborting the current transaction if after waiting for the
previous transaction we verify that it was aborted.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 fs/btrfs/transaction.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 069c2fd37ce7..9218ea8dbfe5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1711,8 +1711,11 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			spin_unlock(&root->fs_info->trans_lock);
 
 			wait_for_commit(root, prev_trans);
+			ret = prev_trans->aborted;
 
 			btrfs_put_transaction(prev_trans);
+			if (ret)
+				goto cleanup_transaction;
 		} else {
 			spin_unlock(&root->fs_info->trans_lock);
 		}
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] NFSv4: don't set SETATTR for O_RDONLY|O_EXCL
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
  2015-09-30  9:59 ` [patch added to the 3.12 stable tree] v4l: omap3isp: Fix sub-device power management code Jiri Slaby
  2015-09-30  9:59 ` [patch added to the 3.12 stable tree] Btrfs: check if previous transaction aborted to avoid fs corruption Jiri Slaby
@ 2015-09-30  9:59 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] NFS: nfs_set_pgio_error sometimes misses errors Jiri Slaby
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30  9:59 UTC (permalink / raw)
  To: stable; +Cc: NeilBrown, Trond Myklebust, Jiri Slaby

From: NeilBrown <neilb@suse.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit efcbc04e16dfa95fef76309f89710dd1d99a5453 upstream.

It is unusual to combine the open flags O_RDONLY and O_EXCL, but
it appears that libre-office does just that.

[pid  3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0
[pid  3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL <unfinished ...>

NFSv4 takes O_EXCL as a sign that a setattr command should be sent,
probably to reset the timestamps.

When it was an O_RDONLY open, the SETATTR command does not
identify any actual attributes to change.
If no delegation was provided to the open, the SETATTR uses the
all-zeros stateid and the request is accepted (at least by the
Linux NFS server - no harm, no foul).

If a read-delegation was provided, this is used in the SETATTR
request, and a Netapp filer will justifiably claim
NFS4ERR_BAD_STATEID, which the Linux client takes as a sign
to retry - indefinitely.

So only treat O_EXCL specially if O_CREAT was also given.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 fs/nfs/nfs4proc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 36a72b59d7c8..794af58b388f 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2267,7 +2267,7 @@ static int _nfs4_do_open(struct inode *dir,
 		goto err_free_label;
 	state = ctx->state;

-	if ((opendata->o_arg.open_flags & O_EXCL) &&
+	if ((opendata->o_arg.open_flags & (O_CREAT|O_EXCL)) == (O_CREAT|O_EXCL) &&
 	    (opendata->o_arg.createmode != NFS4_CREATE_GUARDED)) {
 		nfs4_exclusive_attrset(opendata, sattr);

-- 
2.6.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] NFS: nfs_set_pgio_error sometimes misses errors
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (2 preceding siblings ...)
  2015-09-30  9:59 ` [patch added to the 3.12 stable tree] NFSv4: don't set SETATTR for O_RDONLY|O_EXCL Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] parisc: Filter out spurious interrupts in PA-RISC irq handler Jiri Slaby
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Trond Myklebust, Jiri Slaby

From: Trond Myklebust <trond.myklebust@primarydata.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit e9ae58aeee8842a50f7e199d602a5ccb2e41a95f upstream.

We should ensure that we always set the pgio_header's error field
if a READ or WRITE RPC call returns an error. The current code depends
on 'hdr->good_bytes' always being initialised to a large value, which
is not always done correctly by callers.
When this happens, applications may end up missing important errors.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 fs/nfs/pagelist.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 27d7f2742592..11763ce73709 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -60,8 +60,8 @@ EXPORT_SYMBOL_GPL(nfs_pgheader_init);
 void nfs_set_pgio_error(struct nfs_pgio_header *hdr, int error, loff_t pos)
 {
 	spin_lock(&hdr->lock);
-	if (pos < hdr->io_start + hdr->good_bytes) {
-		set_bit(NFS_IOHDR_ERROR, &hdr->flags);
+	if (!test_and_set_bit(NFS_IOHDR_ERROR, &hdr->flags)
+	    || pos < hdr->io_start + hdr->good_bytes) {
 		clear_bit(NFS_IOHDR_EOF, &hdr->flags);
 		hdr->good_bytes = pos - hdr->io_start;
 		hdr->error = error;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] parisc: Filter out spurious interrupts in PA-RISC irq handler
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (3 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] NFS: nfs_set_pgio_error sometimes misses errors Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] vmscan: fix increasing nr_isolated incurred by putback unevictable pages Jiri Slaby
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Helge Deller, Jiri Slaby

From: Helge Deller <deller@gmx.de>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit b1b4e435e4ef7de77f07bf2a42c8380b960c2d44 upstream.

When detecting a serial port on newer PA-RISC machines (with iosapic) we have a
long way to go to find the right IRQ line, registering it, then registering the
serial port and the irq handler for the serial port. During this phase spurious
interrupts for the serial port may happen which then crashes the kernel because
the action handler might not have been set up yet.

So, basically it's a race condition between the serial port hardware and the
CPU which sets up the necessary fields in the irq sructs. The main reason for
this race is, that we unmask the serial port irqs too early without having set
up everything properly before (which isn't easily possible because we need the
IRQ number to register the serial ports).

This patch is a work-around for this problem. It adds checks to the CPU irq
handler to verify if the IRQ action field has been initialized already. If not,
we just skip this interrupt (which isn't critical for a serial port at bootup).
The real fix would probably involve rewriting all PA-RISC specific IRQ code
(for CPU, IOSAPIC, GSC and EISA) to use IRQ domains with proper parenting of
the irq chips and proper irq enabling along this line.

This bug has been in the PA-RISC port since the beginning, but the crashes
happened very rarely with currently used hardware.  But on the latest machine
which I bought (a C8000 workstation), which uses the fastest CPUs (4 x PA8900,
1GHz) and which has the largest possible L1 cache size (64MB each), the kernel
crashed at every boot because of this race. So, without this patch the machine
would currently be unuseable.

For the record, here is the flow logic:
1. serial_init_chip() in 8250_gsc.c calls iosapic_serial_irq().
2. iosapic_serial_irq() calls txn_alloc_irq() to find the irq.
3. iosapic_serial_irq() calls cpu_claim_irq() to register the CPU irq
4. cpu_claim_irq() unmasks the CPU irq (which it shouldn't!)
5. serial_init_chip() then registers the 8250 port.
Problems:
- In step 4 the CPU irq shouldn't have been registered yet, but after step 5
- If serial irq happens between 4 and 5 have finished, the kernel will crash

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 arch/parisc/kernel/irq.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/kernel/irq.c b/arch/parisc/kernel/irq.c
index 2e6443b1e922..c32a37e0e0d2 100644
--- a/arch/parisc/kernel/irq.c
+++ b/arch/parisc/kernel/irq.c
@@ -524,8 +524,8 @@ void do_cpu_irq_mask(struct pt_regs *regs)
 	struct pt_regs *old_regs;
 	unsigned long eirr_val;
 	int irq, cpu = smp_processor_id();
-#ifdef CONFIG_SMP
 	struct irq_desc *desc;
+#ifdef CONFIG_SMP
 	cpumask_t dest;
 #endif

@@ -538,8 +538,12 @@ void do_cpu_irq_mask(struct pt_regs *regs)
 		goto set_out;
 	irq = eirr_to_irq(eirr_val);

-#ifdef CONFIG_SMP
+	/* Filter out spurious interrupts, mostly from serial port at bootup */
 	desc = irq_to_desc(irq);
+	if (unlikely(!desc->action))
+		goto set_out;
+
+#ifdef CONFIG_SMP
 	cpumask_copy(&dest, desc->irq_data.affinity);
 	if (irqd_is_per_cpu(&desc->irq_data) &&
 	    !cpu_isset(smp_processor_id(), dest)) {
-- 
2.6.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] vmscan: fix increasing nr_isolated incurred by putback unevictable pages
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (4 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] parisc: Filter out spurious interrupts in PA-RISC irq handler Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] fs: if a coredump already exists, unlink and recreate with O_EXCL Jiri Slaby
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Jaewon Kim, Mel Gorman, Andrew Morton, Linus Torvalds, Jiri Slaby

From: Jaewon Kim <jaewon31.kim@samsung.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit c54839a722a02818677bcabe57e957f0ce4f841d upstream.

reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
number of pages removed from the candidate list.  But shrink_page_list()
puts back mlocked pages without passing it to caller and without
counting as nr_reclaimed.  This increases nr_isolated.

To fix this, this patch changes shrink_page_list() to pass unevictable
pages back to caller.  Caller will take care those pages.

Minchan said:

It fixes two issues.

1. With unevictable page, cma_alloc will be successful.

Exactly speaking, cma_alloc of current kernel will fail due to
unevictable pages.

2. fix leaking of NR_ISOLATED counter of vmstat

With it, too_many_isolated works.  Otherwise, it could make hang until
the process get SIGKILL.

Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 04c33d5fb079..6dc33d9dc2cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1087,7 +1087,7 @@ cull_mlocked:
 		if (PageSwapCache(page))
 			try_to_free_swap(page);
 		unlock_page(page);
-		putback_lru_page(page);
+		list_add(&page->lru, &ret_pages);
 		continue;

 activate_locked:
-- 
2.6.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] fs: if a coredump already exists, unlink and recreate with O_EXCL
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (5 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] vmscan: fix increasing nr_isolated incurred by putback unevictable pages Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] mmc: core: fix race condition in mmc_wait_data_done Jiri Slaby
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: Jann Horn, Kees Cook, Al Viro, Andrew Morton, Linus Torvalds,
	Jiri Slaby

From: Jann Horn <jann@thejh.net>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit fbb1816942c04429e85dbf4c1a080accc534299e upstream.

It was possible for an attacking user to trick root (or another user) into
writing his coredumps into an attacker-readable, pre-existing file using
rename() or link(), causing the disclosure of secret data from the victim
process' virtual memory.  Depending on the configuration, it was also
possible to trick root into overwriting system files with coredumps.  Fix
that issue by never writing coredumps into existing files.

Requirements for the attack:
 - The attack only applies if the victim's process has a nonzero
   RLIMIT_CORE and is dumpable.
 - The attacker can trick the victim into coredumping into an
   attacker-writable directory D, either because the core_pattern is
   relative and the victim's cwd is attacker-writable or because an
   absolute core_pattern pointing to a world-writable directory is used.
 - The attacker has one of these:
  A: on a system with protected_hardlinks=0:
     execute access to a folder containing a victim-owned,
     attacker-readable file on the same partition as D, and the
     victim-owned file will be deleted before the main part of the attack
     takes place. (In practice, there are lots of files that fulfill
     this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
     This does not apply to most Linux systems because most distros set
     protected_hardlinks=1.
  B: on a system with protected_hardlinks=1:
     execute access to a folder containing a victim-owned,
     attacker-readable and attacker-writable file on the same partition
     as D, and the victim-owned file will be deleted before the main part
     of the attack takes place.
     (This seems to be uncommon.)
  C: on any system, independent of protected_hardlinks:
     write access to a non-sticky folder containing a victim-owned,
     attacker-readable file on the same partition as D
     (This seems to be uncommon.)

The basic idea is that the attacker moves the victim-owned file to where
he expects the victim process to dump its core.  The victim process dumps
its core into the existing file, and the attacker reads the coredump from
it.

If the attacker can't move the file because he does not have write access
to the containing directory, he can instead link the file to a directory
he controls, then wait for the original link to the file to be deleted
(because the kernel checks that the link count of the corefile is 1).

A less reliable variant that requires D to be non-sticky works with link()
and does not require deletion of the original link: link() the file into
D, but then unlink() it directly before the kernel performs the link count
check.

On systems with protected_hardlinks=0, this variant allows an attacker to
not only gain information from coredumps, but also clobber existing,
victim-writable files with coredumps.  (This could theoretically lead to a
privilege escalation.)

Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 fs/coredump.c | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 88adbdd15193..ff78d9075316 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -499,10 +499,10 @@ void do_coredump(siginfo_t *siginfo)
 	const struct cred *old_cred;
 	struct cred *cred;
 	int retval = 0;
-	int flag = 0;
 	int ispipe;
 	struct files_struct *displaced;
-	bool need_nonrelative = false;
+	/* require nonrelative corefile path and be extra careful */
+	bool need_suid_safe = false;
 	bool core_dumped = false;
 	static atomic_t core_dump_count = ATOMIC_INIT(0);
 	struct coredump_params cprm = {
@@ -536,9 +536,8 @@ void do_coredump(siginfo_t *siginfo)
 	 */
 	if (__get_dumpable(cprm.mm_flags) == SUID_DUMP_ROOT) {
 		/* Setuid core dump mode */
-		flag = O_EXCL;		/* Stop rewrite attacks */
 		cred->fsuid = GLOBAL_ROOT_UID;	/* Dump root private */
-		need_nonrelative = true;
+		need_suid_safe = true;
 	}
 
 	retval = coredump_wait(siginfo->si_signo, &core_state);
@@ -619,7 +618,7 @@ void do_coredump(siginfo_t *siginfo)
 		if (cprm.limit < binfmt->min_coredump)
 			goto fail_unlock;
 
-		if (need_nonrelative && cn.corename[0] != '/') {
+		if (need_suid_safe && cn.corename[0] != '/') {
 			printk(KERN_WARNING "Pid %d(%s) can only dump core "\
 				"to fully qualified path!\n",
 				task_tgid_vnr(current), current->comm);
@@ -627,8 +626,35 @@ void do_coredump(siginfo_t *siginfo)
 			goto fail_unlock;
 		}
 
+		/*
+		 * Unlink the file if it exists unless this is a SUID
+		 * binary - in that case, we're running around with root
+		 * privs and don't want to unlink another user's coredump.
+		 */
+		if (!need_suid_safe) {
+			mm_segment_t old_fs;
+
+			old_fs = get_fs();
+			set_fs(KERNEL_DS);
+			/*
+			 * If it doesn't exist, that's fine. If there's some
+			 * other problem, we'll catch it at the filp_open().
+			 */
+			(void) sys_unlink((const char __user *)cn.corename);
+			set_fs(old_fs);
+		}
+
+		/*
+		 * There is a race between unlinking and creating the
+		 * file, but if that causes an EEXIST here, that's
+		 * fine - another process raced with us while creating
+		 * the corefile, and the other process won. To userspace,
+		 * what matters is that at least one of the two processes
+		 * writes its coredump successfully, not which one.
+		 */
 		cprm.file = filp_open(cn.corename,
-				 O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
+				 O_CREAT | 2 | O_NOFOLLOW |
+				 O_LARGEFILE | O_EXCL,
 				 0600);
 		if (IS_ERR(cprm.file))
 			goto fail_unlock;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] mmc: core: fix race condition in mmc_wait_data_done
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (6 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] fs: if a coredump already exists, unlink and recreate with O_EXCL Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] md/raid10: always set reshape_safe when initializing reshape_position Jiri Slaby
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Jialing Fu, Shawn Lin, Ulf Hansson, Jiri Slaby

From: Jialing Fu <jlfu@marvell.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 71f8a4b81d040b3d094424197ca2f1bf811b1245 upstream.

The following panic is captured in ker3.14, but the issue still exists
in latest kernel.
---------------------------------------------------------------------
[   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer dereference
at virtual address 00000578
......
[   20.738499] c0 3136 (Compiler) PC is at _raw_spin_lock_irqsave+0x24/0x60
[   20.738527] c0 3136 (Compiler) LR is at _raw_spin_lock_irqsave+0x20/0x60
[   20.740134] c0 3136 (Compiler) Call trace:
[   20.740165] c0 3136 (Compiler) [<ffffffc0008ee900>] _raw_spin_lock_irqsave+0x24/0x60
[   20.740200] c0 3136 (Compiler) [<ffffffc0000dd024>] __wake_up+0x1c/0x54
[   20.740230] c0 3136 (Compiler) [<ffffffc000639414>] mmc_wait_data_done+0x28/0x34
[   20.740262] c0 3136 (Compiler) [<ffffffc0006391a0>] mmc_request_done+0xa4/0x220
[   20.740314] c0 3136 (Compiler) [<ffffffc000656894>] sdhci_tasklet_finish+0xac/0x264
[   20.740352] c0 3136 (Compiler) [<ffffffc0000a2b58>] tasklet_action+0xa0/0x158
[   20.740382] c0 3136 (Compiler) [<ffffffc0000a2078>] __do_softirq+0x10c/0x2e4
[   20.740411] c0 3136 (Compiler) [<ffffffc0000a24bc>] irq_exit+0x8c/0xc0
[   20.740439] c0 3136 (Compiler) [<ffffffc00008489c>] handle_IRQ+0x48/0xac
[   20.740469] c0 3136 (Compiler) [<ffffffc000081428>] gic_handle_irq+0x38/0x7c
----------------------------------------------------------------------
Because in SMP, "mrq" has race condition between below two paths:
path1: CPU0: <tasklet context>
  static void mmc_wait_data_done(struct mmc_request *mrq)
  {
     mrq->host->context_info.is_done_rcv = true;
     //
     // If CPU0 has just finished "is_done_rcv = true" in path1, and at
     // this moment, IRQ or ICache line missing happens in CPU0.
     // What happens in CPU1 (path2)?
     //
     // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
     // path2 would have chance to break from wait_event_interruptible
     // in mmc_wait_for_data_req_done and continue to run for next
     // mmc_request (mmc_blk_rw_rq_prep).
     //
     // Within mmc_blk_rq_prep, mrq is cleared to 0.
     // If below line still gets host from "mrq" as the result of
     // compiler, the panic happens as we traced.
     wake_up_interruptible(&mrq->host->context_info.wait);
  }

path2: CPU1: <The mmcqd thread runs mmc_queue_thread>
  static int mmc_wait_for_data_req_done(...
  {
     ...
     while (1) {
           wait_event_interruptible(context_info->wait,
                   (context_info->is_done_rcv ||
                    context_info->is_new_req));
     	   static void mmc_blk_rw_rq_prep(...
           {
           ...
           memset(brq, 0, sizeof(struct mmc_blk_request));

This issue happens very coincidentally; however adding mdelay(1) in
mmc_wait_data_done as below could duplicate it easily.

   static void mmc_wait_data_done(struct mmc_request *mrq)
   {
     mrq->host->context_info.is_done_rcv = true;
+    mdelay(1);
     wake_up_interruptible(&mrq->host->context_info.wait);
    }

At runtime, IRQ or ICache line missing may just happen at the same place
of the mdelay(1).

This patch gets the mmc_context_info at the beginning of function, it can
avoid this race condition.

Signed-off-by: Jialing Fu <jlfu@marvell.com>
Tested-by: Shawn Lin <shawn.lin@rock-chips.com>
Fixes: 2220eedfd7ae ("mmc: fix async request mechanism ....")
Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/mmc/core/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index e743d3984d29..4b12543b0826 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -328,8 +328,10 @@ EXPORT_SYMBOL(mmc_start_bkops);
  */
 static void mmc_wait_data_done(struct mmc_request *mrq)
 {
-	mrq->host->context_info.is_done_rcv = true;
-	wake_up_interruptible(&mrq->host->context_info.wait);
+	struct mmc_context_info *context_info = &mrq->host->context_info;
+
+	context_info->is_done_rcv = true;
+	wake_up_interruptible(&context_info->wait);
 }
 
 static void mmc_wait_done(struct mmc_request *mrq)
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] md/raid10: always set reshape_safe when initializing reshape_position.
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (7 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] mmc: core: fix race condition in mmc_wait_data_done Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] xen/gntdev: convert priv->lock to a mutex Jiri Slaby
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: NeilBrown, Jiri Slaby

From: NeilBrown <neilb@suse.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 299b0685e31c9f3dcc2d58ee3beca761a40b44b3 upstream.

'reshape_position' tracks where in the reshape we have reached.
'reshape_safe' tracks where in the reshape we have safely recorded
in the metadata.

These are compared to determine when to update the metadata.
So it is important that reshape_safe is initialised properly.
Currently it isn't.  When starting a reshape from the beginning
it usually has the correct value by luck.  But when reducing the
number of devices in a RAID10, it has the wrong value and this leads
to the metadata not being updated correctly.
This can lead to corruption if the reshape is not allowed to complete.

This patch is suitable for any -stable kernel which supports RAID10
reshape, which is 3.5 and later.

Fixes: 3ea7daa5d7fd ("md/raid10: add reshape support")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/md/raid10.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9ccb107c982e..d525d663bb22 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3599,6 +3599,7 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 			/* far_copies must be 1 */
 			conf->prev.stride = conf->dev_sectors;
 	}
+	conf->reshape_safe = conf->reshape_progress;
 	spin_lock_init(&conf->device_lock);
 	INIT_LIST_HEAD(&conf->retry_list);
 
@@ -3806,7 +3807,6 @@ static int run(struct mddev *mddev)
 		}
 		conf->offset_diff = min_offset_diff;
 
-		conf->reshape_safe = conf->reshape_progress;
 		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 		clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
 		set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
@@ -4151,6 +4151,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 		conf->reshape_progress = size;
 	} else
 		conf->reshape_progress = 0;
+	conf->reshape_safe = conf->reshape_progress;
 	spin_unlock_irq(&conf->device_lock);
 
 	if (mddev->delta_disks && mddev->bitmap) {
@@ -4217,6 +4218,7 @@ abort:
 		rdev->new_data_offset = rdev->data_offset;
 	smp_wmb();
 	conf->reshape_progress = MaxSector;
+	conf->reshape_safe = MaxSector;
 	mddev->reshape_position = MaxSector;
 	spin_unlock_irq(&conf->device_lock);
 	return ret;
@@ -4564,6 +4566,7 @@ static void end_reshape(struct r10conf *conf)
 	md_finish_reshape(conf->mddev);
 	smp_wmb();
 	conf->reshape_progress = MaxSector;
+	conf->reshape_safe = MaxSector;
 	spin_unlock_irq(&conf->device_lock);
 
 	/* read-ahead size must cover two whole stripes, which is
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] xen/gntdev: convert priv->lock to a mutex
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (8 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] md/raid10: always set reshape_safe when initializing reshape_position Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] hfs: fix B-tree corruption after insertion at position 0 Jiri Slaby
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: David Vrabel, Ian Campbell, Jiri Slaby

From: David Vrabel <david.vrabel@citrix.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 1401c00e59ea021c575f74612fe2dbba36d6a4ee upstream.

Unmapping may require sleeping and we unmap while holding priv->lock, so
convert it to a mutex.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/xen/gntdev.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index e41c79c986ea..0b5806995718 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -67,7 +67,7 @@ struct gntdev_priv {
 	 * Only populated if populate_freeable_maps == 1 */
 	struct list_head freeable_maps;
 	/* lock protects maps and freeable_maps */
-	spinlock_t lock;
+	struct mutex lock;
 	struct mm_struct *mm;
 	struct mmu_notifier mn;
 };
@@ -216,9 +216,9 @@ static void gntdev_put_map(struct gntdev_priv *priv, struct grant_map *map)
 	}
 
 	if (populate_freeable_maps && priv) {
-		spin_lock(&priv->lock);
+		mutex_lock(&priv->lock);
 		list_del(&map->next);
-		spin_unlock(&priv->lock);
+		mutex_unlock(&priv->lock);
 	}
 
 	if (map->pages && !use_ptemod)
@@ -387,9 +387,9 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
 		 * not do any unmapping, since that has been done prior to
 		 * closing the vma, but it may still iterate the unmap_ops list.
 		 */
-		spin_lock(&priv->lock);
+		mutex_lock(&priv->lock);
 		map->vma = NULL;
-		spin_unlock(&priv->lock);
+		mutex_unlock(&priv->lock);
 	}
 	vma->vm_private_data = NULL;
 	gntdev_put_map(priv, map);
@@ -433,14 +433,14 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
-	spin_lock(&priv->lock);
+	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
 		unmap_if_in_range(map, start, end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
 		unmap_if_in_range(map, start, end);
 	}
-	spin_unlock(&priv->lock);
+	mutex_unlock(&priv->lock);
 }
 
 static void mn_invl_page(struct mmu_notifier *mn,
@@ -457,7 +457,7 @@ static void mn_release(struct mmu_notifier *mn,
 	struct grant_map *map;
 	int err;
 
-	spin_lock(&priv->lock);
+	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
 		if (!map->vma)
 			continue;
@@ -476,7 +476,7 @@ static void mn_release(struct mmu_notifier *mn,
 		err = unmap_grant_pages(map, /* offset */ 0, map->count);
 		WARN_ON(err);
 	}
-	spin_unlock(&priv->lock);
+	mutex_unlock(&priv->lock);
 }
 
 static struct mmu_notifier_ops gntdev_mmu_ops = {
@@ -498,7 +498,7 @@ static int gntdev_open(struct inode *inode, struct file *flip)
 
 	INIT_LIST_HEAD(&priv->maps);
 	INIT_LIST_HEAD(&priv->freeable_maps);
-	spin_lock_init(&priv->lock);
+	mutex_init(&priv->lock);
 
 	if (use_ptemod) {
 		priv->mm = get_task_mm(current);
@@ -572,10 +572,10 @@ static long gntdev_ioctl_map_grant_ref(struct gntdev_priv *priv,
 		return -EFAULT;
 	}
 
-	spin_lock(&priv->lock);
+	mutex_lock(&priv->lock);
 	gntdev_add_map(priv, map);
 	op.index = map->index << PAGE_SHIFT;
-	spin_unlock(&priv->lock);
+	mutex_unlock(&priv->lock);
 
 	if (copy_to_user(u, &op, sizeof(op)) != 0)
 		return -EFAULT;
@@ -594,7 +594,7 @@ static long gntdev_ioctl_unmap_grant_ref(struct gntdev_priv *priv,
 		return -EFAULT;
 	pr_debug("priv %p, del %d+%d\n", priv, (int)op.index, (int)op.count);
 
-	spin_lock(&priv->lock);
+	mutex_lock(&priv->lock);
 	map = gntdev_find_map_index(priv, op.index >> PAGE_SHIFT, op.count);
 	if (map) {
 		list_del(&map->next);
@@ -602,7 +602,7 @@ static long gntdev_ioctl_unmap_grant_ref(struct gntdev_priv *priv,
 			list_add_tail(&map->next, &priv->freeable_maps);
 		err = 0;
 	}
-	spin_unlock(&priv->lock);
+	mutex_unlock(&priv->lock);
 	if (map)
 		gntdev_put_map(priv, map);
 	return err;
@@ -670,7 +670,7 @@ static long gntdev_ioctl_notify(struct gntdev_priv *priv, void __user *u)
 	out_flags = op.action;
 	out_event = op.event_channel_port;
 
-	spin_lock(&priv->lock);
+	mutex_lock(&priv->lock);
 
 	list_for_each_entry(map, &priv->maps, next) {
 		uint64_t begin = map->index << PAGE_SHIFT;
@@ -698,7 +698,7 @@ static long gntdev_ioctl_notify(struct gntdev_priv *priv, void __user *u)
 	rc = 0;
 
  unlock_out:
-	spin_unlock(&priv->lock);
+	mutex_unlock(&priv->lock);
 
 	/* Drop the reference to the event channel we did not save in the map */
 	if (out_flags & UNMAP_NOTIFY_SEND_EVENT)
@@ -748,7 +748,7 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 	pr_debug("map %d+%d at %lx (pgoff %lx)\n",
 			index, count, vma->vm_start, vma->vm_pgoff);
 
-	spin_lock(&priv->lock);
+	mutex_lock(&priv->lock);
 	map = gntdev_find_map_index(priv, index, count);
 	if (!map)
 		goto unlock_out;
@@ -783,7 +783,7 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 			map->flags |= GNTMAP_readonly;
 	}
 
-	spin_unlock(&priv->lock);
+	mutex_unlock(&priv->lock);
 
 	if (use_ptemod) {
 		err = apply_to_page_range(vma->vm_mm, vma->vm_start,
@@ -811,11 +811,11 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 	return 0;
 
 unlock_out:
-	spin_unlock(&priv->lock);
+	mutex_unlock(&priv->lock);
 	return err;
 
 out_unlock_put:
-	spin_unlock(&priv->lock);
+	mutex_unlock(&priv->lock);
 out_put_map:
 	if (use_ptemod)
 		map->vma = NULL;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] hfs: fix B-tree corruption after insertion at position 0
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (9 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] xen/gntdev: convert priv->lock to a mutex Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/qib: Change lkey table allocation to support more MRs Jiri Slaby
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: Hin-Tak Leung, Sergei Antonov, Joe Perches, Anton Altaparmakov,
	Al Viro, Christoph Hellwig, Andrew Morton, Linus Torvalds,
	Jiri Slaby

From: Hin-Tak Leung <htl10@users.sourceforge.net>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit b4cc0efea4f0bfa2477c56af406cfcf3d3e58680 upstream.

Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert().

This is an identical change to the corresponding hfs b-tree code to Sergei
Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
to keep similar code paths in the hfs and hfsplus drivers in sync, where
appropriate.

Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Sergei Antonov <saproj@gmail.com>
Cc: Joe Perches <joe@perches.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 fs/hfs/brec.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/hfs/brec.c b/fs/hfs/brec.c
index 9f4ee7f52026..6fc766df0461 100644
--- a/fs/hfs/brec.c
+++ b/fs/hfs/brec.c
@@ -131,13 +131,16 @@ skip:
 	hfs_bnode_write(node, entry, data_off + key_len, entry_len);
 	hfs_bnode_dump(node);
 
-	if (new_node) {
-		/* update parent key if we inserted a key
-		 * at the start of the first node
-		 */
-		if (!rec && new_node != node)
-			hfs_brec_update_parent(fd);
+	/*
+	 * update parent key if we inserted a key
+	 * at the start of the node and it is not the new node
+	 */
+	if (!rec && new_node != node) {
+		hfs_bnode_read_key(node, fd->search_key, data_off + size);
+		hfs_brec_update_parent(fd);
+	}
 
+	if (new_node) {
 		hfs_bnode_put(fd->bnode);
 		if (!new_node->parent) {
 			hfs_btree_inc_height(tree);
@@ -166,9 +169,6 @@ skip:
 		goto again;
 	}
 
-	if (!rec)
-		hfs_brec_update_parent(fd);
-
 	return 0;
 }
 
@@ -366,6 +366,8 @@ again:
 	if (IS_ERR(parent))
 		return PTR_ERR(parent);
 	__hfs_brec_find(parent, fd);
+	if (fd->record < 0)
+		return -ENOENT;
 	hfs_bnode_dump(parent);
 	rec = fd->record;
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] IB/qib: Change lkey table allocation to support more MRs
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (10 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] hfs: fix B-tree corruption after insertion at position 0 Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/uverbs: reject invalid or unknown opcodes Jiri Slaby
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Mike Marciniszyn, Doug Ledford, Jiri Slaby

From: Mike Marciniszyn <mike.marciniszyn@intel.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit d6f1c17e162b2a11e708f28fa93f2f79c164b442 upstream.

The lkey table is allocated with with a get_user_pages() with an
order based on a number of index bits from a module parameter.

The underlying kernel code cannot allocate that many contiguous pages.

There is no reason the underlying memory needs to be physically
contiguous.

This patch:
- switches the allocation/deallocation to vmalloc/vfree
- caps the number of bits to 23 to insure at least 1 generation bit
  o this matches the module parameter description

Reviewed-by: Vinit Agnihotri <vinit.abhay.agnihotri@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/infiniband/hw/qib/qib_keys.c  |  4 ++++
 drivers/infiniband/hw/qib/qib_verbs.c | 14 ++++++++++----
 drivers/infiniband/hw/qib/qib_verbs.h |  2 ++
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_keys.c b/drivers/infiniband/hw/qib/qib_keys.c
index 3b9afccaaade..eabe54738be6 100644
--- a/drivers/infiniband/hw/qib/qib_keys.c
+++ b/drivers/infiniband/hw/qib/qib_keys.c
@@ -86,6 +86,10 @@ int qib_alloc_lkey(struct qib_mregion *mr, int dma_region)
 	 * unrestricted LKEY.
 	 */
 	rkt->gen++;
+	/*
+	 * bits are capped in qib_verbs.c to insure enough bits
+	 * for generation number
+	 */
 	mr->lkey = (r << (32 - ib_qib_lkey_table_size)) |
 		((((1 << (24 - ib_qib_lkey_table_size)) - 1) & rkt->gen)
 		 << 8);
diff --git a/drivers/infiniband/hw/qib/qib_verbs.c b/drivers/infiniband/hw/qib/qib_verbs.c
index 092b0bb1bb78..c141b9b2493d 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.c
+++ b/drivers/infiniband/hw/qib/qib_verbs.c
@@ -40,6 +40,7 @@
 #include <linux/rculist.h>
 #include <linux/mm.h>
 #include <linux/random.h>
+#include <linux/vmalloc.h>
 
 #include "qib.h"
 #include "qib_common.h"
@@ -2086,10 +2087,16 @@ int qib_register_ib_device(struct qib_devdata *dd)
 	 * the LKEY).  The remaining bits act as a generation number or tag.
 	 */
 	spin_lock_init(&dev->lk_table.lock);
+	/* insure generation is at least 4 bits see keys.c */
+	if (ib_qib_lkey_table_size > MAX_LKEY_TABLE_BITS) {
+		qib_dev_warn(dd, "lkey bits %u too large, reduced to %u\n",
+			ib_qib_lkey_table_size, MAX_LKEY_TABLE_BITS);
+		ib_qib_lkey_table_size = MAX_LKEY_TABLE_BITS;
+	}
 	dev->lk_table.max = 1 << ib_qib_lkey_table_size;
 	lk_tab_size = dev->lk_table.max * sizeof(*dev->lk_table.table);
 	dev->lk_table.table = (struct qib_mregion __rcu **)
-		__get_free_pages(GFP_KERNEL, get_order(lk_tab_size));
+		vmalloc(lk_tab_size);
 	if (dev->lk_table.table == NULL) {
 		ret = -ENOMEM;
 		goto err_lk;
@@ -2262,7 +2269,7 @@ err_tx:
 					sizeof(struct qib_pio_header),
 				  dev->pio_hdrs, dev->pio_hdrs_phys);
 err_hdrs:
-	free_pages((unsigned long) dev->lk_table.table, get_order(lk_tab_size));
+	vfree(dev->lk_table.table);
 err_lk:
 	kfree(dev->qp_table);
 err_qpt:
@@ -2316,8 +2323,7 @@ void qib_unregister_ib_device(struct qib_devdata *dd)
 					sizeof(struct qib_pio_header),
 				  dev->pio_hdrs, dev->pio_hdrs_phys);
 	lk_tab_size = dev->lk_table.max * sizeof(*dev->lk_table.table);
-	free_pages((unsigned long) dev->lk_table.table,
-		   get_order(lk_tab_size));
+	vfree(dev->lk_table.table);
 	kfree(dev->qp_table);
 }
 
diff --git a/drivers/infiniband/hw/qib/qib_verbs.h b/drivers/infiniband/hw/qib/qib_verbs.h
index 012e2c7575ad..61b162b64dc6 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.h
+++ b/drivers/infiniband/hw/qib/qib_verbs.h
@@ -647,6 +647,8 @@ struct qib_qpn_table {
 	struct qpn_map map[QPNMAP_ENTRIES];
 };
 
+#define MAX_LKEY_TABLE_BITS 23
+
 struct qib_lkey_table {
 	spinlock_t lock; /* protect changes in this struct */
 	u32 next;               /* next unused index (speeds search) */
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] IB/uverbs: reject invalid or unknown opcodes
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (11 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/qib: Change lkey table allocation to support more MRs Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/uverbs: Fix race between ib_uverbs_open and remove_one Jiri Slaby
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Christoph Hellwig, Doug Ledford, Jiri Slaby

From: Christoph Hellwig <hch@lst.de>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit b632ffa7cee439ba5dce3b3bc4a5cbe2b3e20133 upstream.

We have many WR opcodes that are only supported in kernel space
and/or require optional information to be copied into the WR
structure.  Reject all those not explicitly handled so that we
can't pass invalid information to drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/infiniband/core/uverbs_cmd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 2f0f01b70e3b..f08f79c58f6a 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -2120,6 +2120,12 @@ ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file,
 		next->send_flags = user_wr->send_flags;
 
 		if (is_ud) {
+			if (next->opcode != IB_WR_SEND &&
+			    next->opcode != IB_WR_SEND_WITH_IMM) {
+				ret = -EINVAL;
+				goto out_put;
+			}
+
 			next->wr.ud.ah = idr_read_ah(user_wr->wr.ud.ah,
 						     file->ucontext);
 			if (!next->wr.ud.ah) {
@@ -2156,9 +2162,11 @@ ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file,
 					user_wr->wr.atomic.compare_add;
 				next->wr.atomic.swap = user_wr->wr.atomic.swap;
 				next->wr.atomic.rkey = user_wr->wr.atomic.rkey;
+			case IB_WR_SEND:
 				break;
 			default:
-				break;
+				ret = -EINVAL;
+				goto out_put;
 			}
 		}
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] IB/uverbs: Fix race between ib_uverbs_open and remove_one
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (12 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/uverbs: reject invalid or unknown opcodes Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/mlx4: Forbid using sysfs to change RoCE pkeys Jiri Slaby
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Yishai Hadas, Shachar Raindel, Doug Ledford, Jiri Slaby

From: Yishai Hadas <yishaih@mellanox.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 35d4a0b63dc0c6d1177d4f532a9deae958f0662c upstream.

Fixes: 2a72f212263701b927559f6850446421d5906c41 ("IB/uverbs: Remove dev_table")

Before this commit there was a device look-up table that was protected
by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When
it was dropped and container_of was used instead, it enabled the race
with remove_one as dev might be freed just after:
dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev) but
before the kref_get.

In addition, this buggy patch added some dead code as
container_of(x,y,z) can never be NULL and so dev can never be NULL.
As a result the comment above ib_uverbs_open saying "the open method
will either immediately run -ENXIO" is wrong as it can never happen.

The solution follows Jason Gunthorpe suggestion from below URL:
https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html

cdev will hold a kref on the parent (the containing structure,
ib_uverbs_device) and only when that kref is released it is
guaranteed that open will never be called again.

In addition, fixes the active count scheme to use an atomic
not a kref to prevent WARN_ON as pointed by above comment
from Jason.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/infiniband/core/uverbs.h      |  3 ++-
 drivers/infiniband/core/uverbs_main.c | 43 ++++++++++++++++++++++++-----------
 2 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index d8f9c6c272d7..5252cd0be039 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -69,7 +69,7 @@
  */
 
 struct ib_uverbs_device {
-	struct kref				ref;
+	atomic_t				refcount;
 	int					num_comp_vectors;
 	struct completion			comp;
 	struct device			       *dev;
@@ -78,6 +78,7 @@ struct ib_uverbs_device {
 	struct cdev			        cdev;
 	struct rb_root				xrcd_tree;
 	struct mutex				xrcd_tree_mutex;
+	struct kobject				kobj;
 };
 
 struct ib_uverbs_event_file {
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 849c9dc7d1f6..68e5496c5d58 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -124,14 +124,18 @@ static ssize_t (*uverbs_cmd_table[])(struct ib_uverbs_file *file,
 static void ib_uverbs_add_one(struct ib_device *device);
 static void ib_uverbs_remove_one(struct ib_device *device);
 
-static void ib_uverbs_release_dev(struct kref *ref)
+static void ib_uverbs_release_dev(struct kobject *kobj)
 {
 	struct ib_uverbs_device *dev =
-		container_of(ref, struct ib_uverbs_device, ref);
+		container_of(kobj, struct ib_uverbs_device, kobj);
 
-	complete(&dev->comp);
+	kfree(dev);
 }
 
+static struct kobj_type ib_uverbs_dev_ktype = {
+	.release = ib_uverbs_release_dev,
+};
+
 static void ib_uverbs_release_event_file(struct kref *ref)
 {
 	struct ib_uverbs_event_file *file =
@@ -295,13 +299,19 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	return context->device->dealloc_ucontext(context);
 }
 
+static void ib_uverbs_comp_dev(struct ib_uverbs_device *dev)
+{
+	complete(&dev->comp);
+}
+
 static void ib_uverbs_release_file(struct kref *ref)
 {
 	struct ib_uverbs_file *file =
 		container_of(ref, struct ib_uverbs_file, ref);
 
 	module_put(file->device->ib_dev->owner);
-	kref_put(&file->device->ref, ib_uverbs_release_dev);
+	if (atomic_dec_and_test(&file->device->refcount))
+		ib_uverbs_comp_dev(file->device);
 
 	kfree(file);
 }
@@ -665,9 +675,7 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	int ret;
 
 	dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev);
-	if (dev)
-		kref_get(&dev->ref);
-	else
+	if (!atomic_inc_not_zero(&dev->refcount))
 		return -ENXIO;
 
 	if (!try_module_get(dev->ib_dev->owner)) {
@@ -688,6 +696,7 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp)
 	mutex_init(&file->mutex);
 
 	filp->private_data = file;
+	kobject_get(&dev->kobj);
 
 	return nonseekable_open(inode, filp);
 
@@ -695,13 +704,16 @@ err_module:
 	module_put(dev->ib_dev->owner);
 
 err:
-	kref_put(&dev->ref, ib_uverbs_release_dev);
+	if (atomic_dec_and_test(&dev->refcount))
+		ib_uverbs_comp_dev(dev);
+
 	return ret;
 }
 
 static int ib_uverbs_close(struct inode *inode, struct file *filp)
 {
 	struct ib_uverbs_file *file = filp->private_data;
+	struct ib_uverbs_device *dev = file->device;
 
 	ib_uverbs_cleanup_ucontext(file, file->ucontext);
 
@@ -709,6 +721,7 @@ static int ib_uverbs_close(struct inode *inode, struct file *filp)
 		kref_put(&file->async_file->ref, ib_uverbs_release_event_file);
 
 	kref_put(&file->ref, ib_uverbs_release_file);
+	kobject_put(&dev->kobj);
 
 	return 0;
 }
@@ -804,10 +817,11 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	if (!uverbs_dev)
 		return;
 
-	kref_init(&uverbs_dev->ref);
+	atomic_set(&uverbs_dev->refcount, 1);
 	init_completion(&uverbs_dev->comp);
 	uverbs_dev->xrcd_tree = RB_ROOT;
 	mutex_init(&uverbs_dev->xrcd_tree_mutex);
+	kobject_init(&uverbs_dev->kobj, &ib_uverbs_dev_ktype);
 
 	spin_lock(&map_lock);
 	devnum = find_first_zero_bit(dev_map, IB_UVERBS_MAX_DEVICES);
@@ -834,6 +848,7 @@ static void ib_uverbs_add_one(struct ib_device *device)
 	cdev_init(&uverbs_dev->cdev, NULL);
 	uverbs_dev->cdev.owner = THIS_MODULE;
 	uverbs_dev->cdev.ops = device->mmap ? &uverbs_mmap_fops : &uverbs_fops;
+	uverbs_dev->cdev.kobj.parent = &uverbs_dev->kobj;
 	kobject_set_name(&uverbs_dev->cdev.kobj, "uverbs%d", uverbs_dev->devnum);
 	if (cdev_add(&uverbs_dev->cdev, base, 1))
 		goto err_cdev;
@@ -864,9 +879,10 @@ err_cdev:
 		clear_bit(devnum, overflow_map);
 
 err:
-	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
+	if (atomic_dec_and_test(&uverbs_dev->refcount))
+		ib_uverbs_comp_dev(uverbs_dev);
 	wait_for_completion(&uverbs_dev->comp);
-	kfree(uverbs_dev);
+	kobject_put(&uverbs_dev->kobj);
 	return;
 }
 
@@ -886,9 +902,10 @@ static void ib_uverbs_remove_one(struct ib_device *device)
 	else
 		clear_bit(uverbs_dev->devnum - IB_UVERBS_MAX_DEVICES, overflow_map);
 
-	kref_put(&uverbs_dev->ref, ib_uverbs_release_dev);
+	if (atomic_dec_and_test(&uverbs_dev->refcount))
+		ib_uverbs_comp_dev(uverbs_dev);
 	wait_for_completion(&uverbs_dev->comp);
-	kfree(uverbs_dev);
+	kobject_put(&uverbs_dev->kobj);
 }
 
 static char *uverbs_devnode(struct device *dev, umode_t *mode)
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] IB/mlx4: Forbid using sysfs to change RoCE pkeys
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (13 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/uverbs: Fix race between ib_uverbs_open and remove_one Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/mlx4: Use correct SL on AH query under RoCE Jiri Slaby
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Jack Morgenstein, Or Gerlitz, Doug Ledford, Jiri Slaby

From: Jack Morgenstein <jackm@dev.mellanox.co.il>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 2b135db3e81301d0452e6aa107349abe67b097d6 upstream.

The pkey mapping for RoCE must remain the default mapping:
VFs:
  virtual index 0 = mapped to real index 0 (0xFFFF)
  All others indices: mapped to a real pkey index containing an
                      invalid pkey.
PF:
  virtual index i = real index i.

Don't allow users to change these mappings using files found in
sysfs.

Fixes: c1e7e466120b ('IB/mlx4: Add iov directory in sysfs under the ib device')
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/infiniband/hw/mlx4/sysfs.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx4/sysfs.c b/drivers/infiniband/hw/mlx4/sysfs.c
index 97516eb363b7..c5ce4082fdc7 100644
--- a/drivers/infiniband/hw/mlx4/sysfs.c
+++ b/drivers/infiniband/hw/mlx4/sysfs.c
@@ -563,6 +563,8 @@ static int add_port(struct mlx4_ib_dev *dev, int port_num, int slave)
 	struct mlx4_port *p;
 	int i;
 	int ret;
+	int is_eth = rdma_port_get_link_layer(&dev->ib_dev, port_num) ==
+			IB_LINK_LAYER_ETHERNET;
 
 	p = kzalloc(sizeof *p, GFP_KERNEL);
 	if (!p)
@@ -580,7 +582,8 @@ static int add_port(struct mlx4_ib_dev *dev, int port_num, int slave)
 
 	p->pkey_group.name  = "pkey_idx";
 	p->pkey_group.attrs =
-		alloc_group_attrs(show_port_pkey, store_port_pkey,
+		alloc_group_attrs(show_port_pkey,
+				  is_eth ? NULL : store_port_pkey,
 				  dev->dev->caps.pkey_table_len[port_num]);
 	if (!p->pkey_group.attrs)
 		goto err_alloc;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] IB/mlx4: Use correct SL on AH query under RoCE
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (14 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/mlx4: Forbid using sysfs to change RoCE pkeys Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] stmmac: fix check for phydev being open Jiri Slaby
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Noa Osherovich, Shani Michaeli, Or Gerlitz, Doug Ledford,
	Jiri Slaby

From: Noa Osherovich <noaos@mellanox.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 5e99b139f1b68acd65e36515ca347b03856dfb5a upstream.

The mlx4 IB driver implementation for ib_query_ah used a wrong offset
(28 instead of 29) when link type is Ethernet. Fixed to use the correct one.

Fixes: fa417f7b520e ('IB/mlx4: Add support for IBoE')
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/infiniband/hw/mlx4/ah.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index a251becdaa98..890c23b3d714 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -169,9 +169,13 @@ int mlx4_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr)
 	enum rdma_link_layer ll;
 
 	memset(ah_attr, 0, sizeof *ah_attr);
-	ah_attr->sl = be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28;
 	ah_attr->port_num = be32_to_cpu(ah->av.ib.port_pd) >> 24;
 	ll = rdma_port_get_link_layer(ibah->device, ah_attr->port_num);
+	if (ll == IB_LINK_LAYER_ETHERNET)
+		ah_attr->sl = be32_to_cpu(ah->av.eth.sl_tclass_flowlabel) >> 29;
+	else
+		ah_attr->sl = be32_to_cpu(ah->av.ib.sl_tclass_flowlabel) >> 28;
+
 	ah_attr->dlid = ll == IB_LINK_LAYER_INFINIBAND ? be16_to_cpu(ah->av.ib.dlid) : 0;
 	if (ah->av.ib.stat_rate)
 		ah_attr->static_rate = ah->av.ib.stat_rate - MLX4_STAT_RATE_OFFSET;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] stmmac: fix check for phydev being open
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (15 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/mlx4: Use correct SL on AH query under RoCE Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 11:28   ` Sergei Shtylyov
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] stmmac: troubleshoot unexpected bits in des0 & des1 Jiri Slaby
                   ` (12 subsequent siblings)
  29 siblings, 1 reply; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: Alexey Brodkin, Sergei Shtylyov, Giuseppe Cavallaro, linux-kernel,
	David Miller, Alexey Brodkin, Jiri Slaby

From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit dfc50fcaad574e5c8c85cbc83eca1426b2413fa4 upstream.

Current check of phydev with IS_ERR(phydev) may make not much sense
because of_phy_connect() returns NULL on failure instead of error value.

Still for checking result of phy_connect() IS_ERR() makes perfect sense.

So let's use combined check IS_ERR_OR_NULL() that covers both cases.

Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: linux-kernel@vger.kernel.org
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 8d4ccd35a016..14c0d31c10ad 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -801,8 +801,11 @@ static int stmmac_init_phy(struct net_device *dev)
 
 	phydev = phy_connect(dev, phy_id_fmt, &stmmac_adjust_link, interface);
 
-	if (IS_ERR(phydev)) {
+	if (IS_ERR_OR_NULL(phydev)) {
 		pr_err("%s: Could not attach to PHY\n", dev->name);
+		if (!phydev)
+			return -ENODEV;
+
 		return PTR_ERR(phydev);
 	}
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] stmmac: troubleshoot unexpected bits in des0 & des1
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (16 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] stmmac: fix check for phydev being open Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] hfs,hfsplus: cache pages correctly between bnode_create and bnode_free Jiri Slaby
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: Alexey Brodkin, Alexey Brodkin, Giuseppe Cavallaro, arc-linux-dev,
	linux-kernel, David Miller, Jiri Slaby

From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit f1590670ce069eefeb93916391a67643e6ad1630 upstream.

Current implementation of descriptor init procedure only takes
care about setting/clearing ownership flag in "des0"/"des1"
fields while it is perfectly possible to get unexpected bits
set because of the following factors:

 [1] On driver probe underlying memory allocated with
     dma_alloc_coherent() might not be zeroed and so
     it will be filled with garbage.

 [2] During driver operation some bits could be set by SD/MMC
     controller (for example error flags etc).

And unexpected and/or randomly set flags in "des0"/"des1"
fields may lead to unpredictable behavior of GMAC DMA block.

This change addresses both items above with:

 [1] Use of dma_zalloc_coherent() instead of simple
     dma_alloc_coherent() to make sure allocated memory is
     zeroed. That shouldn't affect performance because
     this allocation only happens once on driver probe.

 [2] Do explicit zeroing of both "des0" and "des1" fields
     of all buffer descriptors during initialization of
     DMA transfer.

And while at it fixed identation of dma_free_coherent()
counterpart as well.

Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: arc-linux-dev@synopsys.com
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/net/ethernet/stmicro/stmmac/descs.h       |  2 ++
 drivers/net/ethernet/stmicro/stmmac/enh_desc.c    |  3 +-
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c   |  3 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 44 +++++++++++------------
 4 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/descs.h b/drivers/net/ethernet/stmicro/stmmac/descs.h
index ad3996038018..799c2929c536 100644
--- a/drivers/net/ethernet/stmicro/stmmac/descs.h
+++ b/drivers/net/ethernet/stmicro/stmmac/descs.h
@@ -158,6 +158,8 @@ struct dma_desc {
 			u32 buffer2_size:13;
 			u32 reserved4:3;
 		} etx;		/* -- enhanced -- */
+
+		u64 all_flags;
 	} des01;
 	unsigned int des2;
 	unsigned int des3;
diff --git a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
index 7e6628a91514..59fb7f69841b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c
@@ -240,6 +240,7 @@ static int enh_desc_get_rx_status(void *data, struct stmmac_extra_stats *x,
 static void enh_desc_init_rx_desc(struct dma_desc *p, int disable_rx_ic,
 				  int mode, int end)
 {
+	p->des01.all_flags = 0;
 	p->des01.erx.own = 1;
 	p->des01.erx.buffer1_size = BUF_SIZE_8KiB - 1;
 
@@ -254,7 +255,7 @@ static void enh_desc_init_rx_desc(struct dma_desc *p, int disable_rx_ic,
 
 static void enh_desc_init_tx_desc(struct dma_desc *p, int mode, int end)
 {
-	p->des01.etx.own = 0;
+	p->des01.all_flags = 0;
 	if (mode == STMMAC_CHAIN_MODE)
 		ehn_desc_tx_set_on_chain(p, end);
 	else
diff --git a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
index 35ad4f427ae2..48c3456445b2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
@@ -123,6 +123,7 @@ static int ndesc_get_rx_status(void *data, struct stmmac_extra_stats *x,
 static void ndesc_init_rx_desc(struct dma_desc *p, int disable_rx_ic, int mode,
 			       int end)
 {
+	p->des01.all_flags = 0;
 	p->des01.rx.own = 1;
 	p->des01.rx.buffer1_size = BUF_SIZE_2KiB - 1;
 
@@ -137,7 +138,7 @@ static void ndesc_init_rx_desc(struct dma_desc *p, int disable_rx_ic, int mode,
 
 static void ndesc_init_tx_desc(struct dma_desc *p, int mode, int end)
 {
-	p->des01.tx.own = 0;
+	p->des01.all_flags = 0;
 	if (mode == STMMAC_CHAIN_MODE)
 		ndesc_tx_set_on_chain(p, end);
 	else
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 14c0d31c10ad..3fd4b00c5b5e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1006,41 +1006,41 @@ static int init_dma_desc_rings(struct net_device *dev)
 			 txsize, rxsize, bfsize);
 
 	if (priv->extend_desc) {
-		priv->dma_erx = dma_alloc_coherent(priv->device, rxsize *
-						   sizeof(struct
-							  dma_extended_desc),
-						   &priv->dma_rx_phy,
-						   GFP_KERNEL);
+		priv->dma_erx = dma_zalloc_coherent(priv->device, rxsize *
+						    sizeof(struct
+							   dma_extended_desc),
+						    &priv->dma_rx_phy,
+						    GFP_KERNEL);
 		if (!priv->dma_erx)
 			goto err_dma;
 
-		priv->dma_etx = dma_alloc_coherent(priv->device, txsize *
-						   sizeof(struct
-							  dma_extended_desc),
-						   &priv->dma_tx_phy,
-						   GFP_KERNEL);
+		priv->dma_etx = dma_zalloc_coherent(priv->device, txsize *
+						    sizeof(struct
+							   dma_extended_desc),
+						    &priv->dma_tx_phy,
+						    GFP_KERNEL);
 		if (!priv->dma_etx) {
 			dma_free_coherent(priv->device, priv->dma_rx_size *
-					sizeof(struct dma_extended_desc),
-					priv->dma_erx, priv->dma_rx_phy);
+					  sizeof(struct dma_extended_desc),
+					  priv->dma_erx, priv->dma_rx_phy);
 			goto err_dma;
 		}
 	} else {
-		priv->dma_rx = dma_alloc_coherent(priv->device, rxsize *
-						  sizeof(struct dma_desc),
-						  &priv->dma_rx_phy,
-						  GFP_KERNEL);
+		priv->dma_rx = dma_zalloc_coherent(priv->device, rxsize *
+						   sizeof(struct dma_desc),
+						   &priv->dma_rx_phy,
+						   GFP_KERNEL);
 		if (!priv->dma_rx)
 			goto err_dma;
 
-		priv->dma_tx = dma_alloc_coherent(priv->device, txsize *
-						  sizeof(struct dma_desc),
-						  &priv->dma_tx_phy,
-						  GFP_KERNEL);
+		priv->dma_tx = dma_zalloc_coherent(priv->device, txsize *
+						   sizeof(struct dma_desc),
+						   &priv->dma_tx_phy,
+						   GFP_KERNEL);
 		if (!priv->dma_tx) {
 			dma_free_coherent(priv->device, priv->dma_rx_size *
-					sizeof(struct dma_desc),
-					priv->dma_rx, priv->dma_rx_phy);
+					  sizeof(struct dma_desc),
+					  priv->dma_rx, priv->dma_rx_phy);
 			goto err_dma;
 		}
 	}
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] hfs,hfsplus: cache pages correctly between bnode_create and bnode_free
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (17 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] stmmac: troubleshoot unexpected bits in des0 & des1 Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] ip6_gre: release cached dst on tunnel removal Jiri Slaby
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: Hin-Tak Leung, Sergei Antonov, Al Viro, Christoph Hellwig,
	Vyacheslav Dubeyko, Sougata Santra, Andrew Morton, Linus Torvalds,
	Jiri Slaby

From: Hin-Tak Leung <htl10@users.sourceforge.net>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 7cb74be6fd827e314f81df3c5889b87e4c87c569 upstream.

Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().

This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state".  Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:

$ df -i / /home /mnt
Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
/dev/mapper/fedora-root    3276800  551665    2725135   17% /
/dev/mapper/fedora-home   52879360  716221   52163139    2% /home
/dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt

After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.

There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years.  [1]

The unpatched code [2] has always been wrong since it entered the kernel
tree.  The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.

The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents.  When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging.  There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create().  In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.

References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
https://bugzilla.kernel.org/show_bug.cgi?id=42342
https://bugzilla.kernel.org/show_bug.cgi?id=63841
https://bugzilla.kernel.org/show_bug.cgi?id=78761

[2]:
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

commit 91556682e0bf004d98a529bf829d339abb98bbbd
Author: Andrew Morton <akpm@osdl.org>
Date:   Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

[3]:
http://sourceforge.net/projects/linux-hfsplus/

http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
fs/hfsplus/bnode.c?r1=1.4&r2=1.5

Date:   Thu Jun 6 09:45:14 2002 +0000
Use buffer cache instead of page cache in bnode.c. Cache inode extents.

[4]:
http://git.kernel.org/cgit/linux/kernel/git/\
stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

commit a5e3985fa014029eb6795664c704953720cc7f7d
Author: Roman Zippel <zippel@linux-m68k.org>
Date:   Tue Sep 6 15:18:47 2005 -0700

[PATCH] hfs: remove debug code

Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Sougata Santra <sougata@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 fs/hfs/bnode.c     | 9 ++++-----
 fs/hfsplus/bnode.c | 3 ---
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/hfs/bnode.c b/fs/hfs/bnode.c
index d3fa6bd9503e..221719eac5de 100644
--- a/fs/hfs/bnode.c
+++ b/fs/hfs/bnode.c
@@ -288,7 +288,6 @@ static struct hfs_bnode *__hfs_bnode_create(struct hfs_btree *tree, u32 cnid)
 			page_cache_release(page);
 			goto fail;
 		}
-		page_cache_release(page);
 		node->page[i] = page;
 	}

@@ -398,11 +397,11 @@ node_error:

 void hfs_bnode_free(struct hfs_bnode *node)
 {
-	//int i;
+	int i;

-	//for (i = 0; i < node->tree->pages_per_bnode; i++)
-	//	if (node->page[i])
-	//		page_cache_release(node->page[i]);
+	for (i = 0; i < node->tree->pages_per_bnode; i++)
+		if (node->page[i])
+			page_cache_release(node->page[i]);
 	kfree(node);
 }

diff --git a/fs/hfsplus/bnode.c b/fs/hfsplus/bnode.c
index 11c860204520..bedfe5f7d332 100644
--- a/fs/hfsplus/bnode.c
+++ b/fs/hfsplus/bnode.c
@@ -456,7 +456,6 @@ static struct hfs_bnode *__hfs_bnode_create(struct hfs_btree *tree, u32 cnid)
 			page_cache_release(page);
 			goto fail;
 		}
-		page_cache_release(page);
 		node->page[i] = page;
 	}

@@ -568,13 +567,11 @@ node_error:

 void hfs_bnode_free(struct hfs_bnode *node)
 {
-#if 0
 	int i;

 	for (i = 0; i < node->tree->pages_per_bnode; i++)
 		if (node->page[i])
 			page_cache_release(node->page[i]);
-#endif
 	kfree(node);
 }

-- 
2.6.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] ip6_gre: release cached dst on tunnel removal
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (18 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] hfs,hfsplus: cache pages correctly between bnode_create and bnode_free Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared Jiri Slaby
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: huaibin Wang, Dmitry Kozlov, Nicolas Dichtel, David S . Miller,
	Jiri Slaby

From: huaibin Wang <huaibin.wang@6wind.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit d4257295ba1b389c693b79de857a96e4b7cd8ac0 ]

When a tunnel is deleted, the cached dst entry should be released.

This problem may prevent the removal of a netns (seen with a x-netns IPv6
gre tunnel):
  unregister_netdevice: waiting for lo to become free. Usage count = 3

CC: Dmitry Kozlov <xeb@mail.ru>
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: huaibin Wang <huaibin.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 net/ipv6/ip6_gre.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 7d640f276e87..b2e4c77d9a8c 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -360,6 +360,7 @@ static void ip6gre_tunnel_uninit(struct net_device *dev)
 	struct ip6gre_net *ign = net_generic(net, ip6gre_net_id);
 
 	ip6gre_tunnel_unlink(ign, netdev_priv(dev));
+	ip6_tnl_dst_reset(netdev_priv(dev));
 	dev_put(dev);
 }
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (19 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] ip6_gre: release cached dst on tunnel removal Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] ipv6: fix exthdrs offload registration in out_rt path Jiri Slaby
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Eugene Shatokhin, David S . Miller, Jiri Slaby

From: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit f50791ac1aca1ac1b0370d62397b43e9f831421a ]

It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
usbnet_stop(), but its value should be read before it is cleared
when dev->flags is set to 0.

The problem was spotted and the fix was provided by
Oliver Neukum <oneukum@suse.de>.

Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Acked-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 drivers/net/usb/usbnet.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 1d4da74595f9..3b7b7b2eba1b 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -779,7 +779,7 @@ int usbnet_stop (struct net_device *net)
 {
 	struct usbnet		*dev = netdev_priv(net);
 	struct driver_info	*info = dev->driver_info;
-	int			retval, pm;
+	int			retval, pm, mpn;
 
 	clear_bit(EVENT_DEV_OPEN, &dev->flags);
 	netif_stop_queue (net);
@@ -810,6 +810,8 @@ int usbnet_stop (struct net_device *net)
 
 	usbnet_purge_paused_rxq(dev);
 
+	mpn = !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags);
+
 	/* deferred work (task, timer, softirq) must also stop.
 	 * can't flush_scheduled_work() until we drop rtnl (later),
 	 * else workers could deadlock; so make workers a NOP.
@@ -820,8 +822,7 @@ int usbnet_stop (struct net_device *net)
 	if (!pm)
 		usb_autopm_put_interface(dev->intf);
 
-	if (info->manage_power &&
-	    !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags))
+	if (info->manage_power && mpn)
 		info->manage_power(dev, 0);
 	else
 		usb_autopm_put_interface(dev->intf);
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] ipv6: fix exthdrs offload registration in out_rt path
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (20 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] net/ipv6: Correct PIM6 mrt_lock handling Jiri Slaby
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Daniel Borkmann, David S . Miller, Jiri Slaby

From: Daniel Borkmann <daniel@iogearbox.net>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit e41b0bedba0293b9e1e8d1e8ed553104b9693656 ]

We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.

Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 net/ipv6/exthdrs_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c
index 447a7fbd1bb6..f5e2ba1c18bf 100644
--- a/net/ipv6/exthdrs_offload.c
+++ b/net/ipv6/exthdrs_offload.c
@@ -36,6 +36,6 @@ out:
 	return ret;
 
 out_rt:
-	inet_del_offload(&rthdr_offload, IPPROTO_ROUTING);
+	inet6_del_offload(&rthdr_offload, IPPROTO_ROUTING);
 	goto out;
 }
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] net/ipv6: Correct PIM6 mrt_lock handling
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (21 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] ipv6: fix exthdrs offload registration in out_rt path Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] netlink, mmap: transform mmap skb into full skb on taps Jiri Slaby
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Richard Laing, David S . Miller, Jiri Slaby

From: Richard Laing <richard.laing@alliedtelesis.co.nz>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit 25b4a44c19c83d98e8c0807a7ede07c1f28eab8b ]

In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.

This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.

Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 net/ipv6/ip6mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 8737400af0a0..821d8dfb2ddd 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -552,7 +552,7 @@ static void ipmr_mfc_seq_stop(struct seq_file *seq, void *v)
 
 	if (it->cache == &mrt->mfc6_unres_queue)
 		spin_unlock_bh(&mfc_unres_lock);
-	else if (it->cache == mrt->mfc6_cache_array)
+	else if (it->cache == &mrt->mfc6_cache_array[it->ct])
 		read_unlock(&mrt_lock);
 }
 
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] netlink, mmap: transform mmap skb into full skb on taps
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (22 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] net/ipv6: Correct PIM6 mrt_lock handling Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] sctp: fix race on protocol/netns initialization Jiri Slaby
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Daniel Borkmann, David S . Miller, Jiri Slaby

From: Daniel Borkmann <daniel@iogearbox.net>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 net/netlink/af_netlink.c | 30 +++++++++++++++++++++++-------
 net/netlink/af_netlink.h |  9 +++++++++
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 22e0f478a2a3..10805856dfba 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -115,6 +115,24 @@ static inline struct hlist_head *nl_portid_hashfn(struct nl_portid_hash *hash, u
 	return &hash->table[jhash_1word(portid, hash->rnd) & hash->mask];
 }
 
+static struct sk_buff *netlink_to_full_skb(const struct sk_buff *skb,
+					   gfp_t gfp_mask)
+{
+	unsigned int len = skb_end_offset(skb);
+	struct sk_buff *new;
+
+	new = alloc_skb(len, gfp_mask);
+	if (new == NULL)
+		return NULL;
+
+	NETLINK_CB(new).portid = NETLINK_CB(skb).portid;
+	NETLINK_CB(new).dst_group = NETLINK_CB(skb).dst_group;
+	NETLINK_CB(new).creds = NETLINK_CB(skb).creds;
+
+	memcpy(skb_put(new, len), skb->data, len);
+	return new;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -200,7 +218,11 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
 	int ret = -ENOMEM;
 
 	dev_hold(dev);
-	nskb = skb_clone(skb, GFP_ATOMIC);
+
+	if (netlink_skb_is_mmaped(skb) || is_vmalloc_addr(skb->head))
+		nskb = netlink_to_full_skb(skb, GFP_ATOMIC);
+	else
+		nskb = skb_clone(skb, GFP_ATOMIC);
 	if (nskb) {
 		nskb->dev = dev;
 		nskb->protocol = htons((u16) sk->sk_protocol);
@@ -263,11 +285,6 @@ static void netlink_rcv_wake(struct sock *sk)
 }
 
 #ifdef CONFIG_NETLINK_MMAP
-static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
-{
-	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
-}
-
 static bool netlink_rx_is_mmaped(struct sock *sk)
 {
 	return nlk_sk(sk)->rx_ring.pg_vec != NULL;
@@ -819,7 +836,6 @@ static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
 }
 
 #else /* CONFIG_NETLINK_MMAP */
-#define netlink_skb_is_mmaped(skb)	false
 #define netlink_rx_is_mmaped(sk)	false
 #define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index acbd774eeb7c..dcc89c74b514 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -65,6 +65,15 @@ struct nl_portid_hash {
 	u32			rnd;
 };
 
+static inline bool netlink_skb_is_mmaped(const struct sk_buff *skb)
+{
+#ifdef CONFIG_NETLINK_MMAP
+	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
+#else
+	return false;
+#endif /* CONFIG_NETLINK_MMAP */
+}
+
 struct netlink_table {
 	struct nl_portid_hash	hash;
 	struct hlist_head	mc_list;
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] sctp: fix race on protocol/netns initialization
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (23 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] netlink, mmap: transform mmap skb into full skb on taps Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] openvswitch: Zero flows on allocation Jiri Slaby
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Marcelo Ricardo Leitner, Vlad Yasevich, David S . Miller,
	Jiri Slaby

From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit 8e2d61e0aed2b7c4ecb35844fe07e0b2b762dee4 ]

Consider sctp module is unloaded and is being requested because an user
is creating a sctp socket.

During initialization, sctp will add the new protocol type and then
initialize pernet subsys:

        status = sctp_v4_protosw_init();
        if (status)
                goto err_protosw_init;

        status = sctp_v6_protosw_init();
        if (status)
                goto err_v6_protosw_init;

        status = register_pernet_subsys(&sctp_net_ops);

The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
is possible for userspace to create SCTP sockets like if the module is
already fully loaded. If that happens, one of the possible effects is
that we will have readers for net->sctp.local_addr_list list earlier
than expected and sctp_net_init() does not take precautions while
dealing with that list, leading to a potential panic but not limited to
that, as sctp_sock_init() will copy a bunch of blank/partially
initialized values from net->sctp.

The race happens like this:

     CPU 0                           |  CPU 1
  socket()                           |
   __sock_create                     | socket()
    inet_create                      |  __sock_create
     list_for_each_entry_rcu(        |
        answer, &inetsw[sock->type], |
        list) {                      |   inet_create
      /* no hits */                  |
     if (unlikely(err)) {            |
      ...                            |
      request_module()               |
      /* socket creation is blocked  |
       * the module is fully loaded  |
       */                            |
       sctp_init                     |
        sctp_v4_protosw_init         |
         inet_register_protosw       |
          list_add_rcu(&p->list,     |
                       last_perm);   |
                                     |  list_for_each_entry_rcu(
                                     |     answer, &inetsw[sock->type],
        sctp_v6_protosw_init         |     list) {
                                     |     /* hit, so assumes protocol
                                     |      * is already loaded
                                     |      */
                                     |  /* socket creation continues
                                     |   * before netns is initialized
                                     |   */
        register_pernet_subsys       |

Simply inverting the initialization order between
register_pernet_subsys() and sctp_v4_protosw_init() is not possible
because register_pernet_subsys() will create a control sctp socket, so
the protocol must be already visible by then. Deferring the socket
creation to a work-queue is not good specially because we loose the
ability to handle its errors.

So, as suggested by Vlad, the fix is to split netns initialization in
two moments: defaults and control socket, so that the defaults are
already loaded by when we register the protocol, while control socket
initialization is kept at the same moment it is today.

Fixes: 4db67e808640 ("sctp: Make the address lists per network namespace")
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 net/sctp/protocol.c | 64 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 23 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 2b216f1f6b23..599757e0c23a 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1167,7 +1167,7 @@ static void sctp_v4_del_protocol(void)
 	unregister_inetaddr_notifier(&sctp_inetaddr_notifier);
 }
 
-static int __net_init sctp_net_init(struct net *net)
+static int __net_init sctp_defaults_init(struct net *net)
 {
 	int status;
 
@@ -1260,12 +1260,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	sctp_dbg_objcnt_init(net);
 
-	/* Initialize the control inode/socket for handling OOTB packets.  */
-	if ((status = sctp_ctl_sock_init(net))) {
-		pr_err("Failed to initialize the SCTP control sock\n");
-		goto err_ctl_sock_init;
-	}
-
 	/* Initialize the local address list. */
 	INIT_LIST_HEAD(&net->sctp.local_addr_list);
 	spin_lock_init(&net->sctp.local_addr_lock);
@@ -1281,9 +1275,6 @@ static int __net_init sctp_net_init(struct net *net)
 
 	return 0;
 
-err_ctl_sock_init:
-	sctp_dbg_objcnt_exit(net);
-	sctp_proc_exit(net);
 err_init_proc:
 	cleanup_sctp_mibs(net);
 err_init_mibs:
@@ -1292,15 +1283,12 @@ err_sysctl_register:
 	return status;
 }
 
-static void __net_exit sctp_net_exit(struct net *net)
+static void __net_exit sctp_defaults_exit(struct net *net)
 {
 	/* Free the local address list */
 	sctp_free_addr_wq(net);
 	sctp_free_local_addr_list(net);
 
-	/* Free the control endpoint.  */
-	inet_ctl_sock_destroy(net->sctp.ctl_sock);
-
 	sctp_dbg_objcnt_exit(net);
 
 	sctp_proc_exit(net);
@@ -1308,9 +1296,32 @@ static void __net_exit sctp_net_exit(struct net *net)
 	sctp_sysctl_net_unregister(net);
 }
 
-static struct pernet_operations sctp_net_ops = {
-	.init = sctp_net_init,
-	.exit = sctp_net_exit,
+static struct pernet_operations sctp_defaults_ops = {
+	.init = sctp_defaults_init,
+	.exit = sctp_defaults_exit,
+};
+
+static int __net_init sctp_ctrlsock_init(struct net *net)
+{
+	int status;
+
+	/* Initialize the control inode/socket for handling OOTB packets.  */
+	status = sctp_ctl_sock_init(net);
+	if (status)
+		pr_err("Failed to initialize the SCTP control sock\n");
+
+	return status;
+}
+
+static void __net_init sctp_ctrlsock_exit(struct net *net)
+{
+	/* Free the control endpoint.  */
+	inet_ctl_sock_destroy(net->sctp.ctl_sock);
+}
+
+static struct pernet_operations sctp_ctrlsock_ops = {
+	.init = sctp_ctrlsock_init,
+	.exit = sctp_ctrlsock_exit,
 };
 
 /* Initialize the universe into something sensible.  */
@@ -1444,8 +1455,11 @@ static __init int sctp_init(void)
 	sctp_v4_pf_init();
 	sctp_v6_pf_init();
 
-	status = sctp_v4_protosw_init();
+	status = register_pernet_subsys(&sctp_defaults_ops);
+	if (status)
+		goto err_register_defaults;
 
+	status = sctp_v4_protosw_init();
 	if (status)
 		goto err_protosw_init;
 
@@ -1453,9 +1467,9 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_protosw_init;
 
-	status = register_pernet_subsys(&sctp_net_ops);
+	status = register_pernet_subsys(&sctp_ctrlsock_ops);
 	if (status)
-		goto err_register_pernet_subsys;
+		goto err_register_ctrlsock;
 
 	status = sctp_v4_add_protocol();
 	if (status)
@@ -1472,12 +1486,14 @@ out:
 err_v6_add_protocol:
 	sctp_v4_del_protocol();
 err_add_protocol:
-	unregister_pernet_subsys(&sctp_net_ops);
-err_register_pernet_subsys:
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
+err_register_ctrlsock:
 	sctp_v6_protosw_exit();
 err_v6_protosw_init:
 	sctp_v4_protosw_exit();
 err_protosw_init:
+	unregister_pernet_subsys(&sctp_defaults_ops);
+err_register_defaults:
 	sctp_v4_pf_exit();
 	sctp_v6_pf_exit();
 	sctp_sysctl_unregister();
@@ -1510,12 +1526,14 @@ static __exit void sctp_exit(void)
 	sctp_v6_del_protocol();
 	sctp_v4_del_protocol();
 
-	unregister_pernet_subsys(&sctp_net_ops);
+	unregister_pernet_subsys(&sctp_ctrlsock_ops);
 
 	/* Free protosw registrations */
 	sctp_v6_protosw_exit();
 	sctp_v4_protosw_exit();
 
+	unregister_pernet_subsys(&sctp_defaults_ops);
+
 	/* Unregister with socket layer. */
 	sctp_v6_pf_exit();
 	sctp_v4_pf_exit();
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] openvswitch: Zero flows on allocation.
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (24 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] sctp: fix race on protocol/netns initialization Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] fib_rules: fix fib rule dumps across multiple skbs Jiri Slaby
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Jesse Gross, David S . Miller, Jiri Slaby

From: Jesse Gross <jesse@nicira.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit ae5f2fb1d51fa128a460bcfbe3c56d7ab8bf6a43 ]

When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 net/openvswitch/datapath.c |  2 +-
 net/openvswitch/flow.c     | 21 ++++++++++++---------
 net/openvswitch/flow.h     |  2 +-
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 2aa13bd7f2b2..aa0a5f2794f1 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1266,7 +1266,7 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info)
 		if (IS_ERR(acts))
 			goto error;
 
-		ovs_flow_key_mask(&masked_key, &key, &mask);
+		ovs_flow_key_mask(&masked_key, &key, true, &mask);
 		error = validate_and_copy_actions(a[OVS_FLOW_ATTR_ACTIONS],
 						  &masked_key, 0, &acts);
 		if (error) {
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 410db90db73d..b5d58cfa4fdc 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -373,18 +373,21 @@ static bool icmp6hdr_ok(struct sk_buff *skb)
 }
 
 void ovs_flow_key_mask(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask)
+		       bool full, const struct sw_flow_mask *mask)
 {
-	const long *m = (long *)((u8 *)&mask->key + mask->range.start);
-	const long *s = (long *)((u8 *)src + mask->range.start);
-	long *d = (long *)((u8 *)dst + mask->range.start);
+	int start = full ? 0 : mask->range.start;
+	int len = full ? sizeof *dst : range_n_bytes(&mask->range);
+	const long *m = (const long *)((const u8 *)&mask->key + start);
+	const long *s = (const long *)((const u8 *)src + start);
+	long *d = (long *)((u8 *)dst + start);
 	int i;
 
-	/* The memory outside of the 'mask->range' are not set since
-	 * further operations on 'dst' only uses contents within
-	 * 'mask->range'.
+	/* If 'full' is true then all of 'dst' is fully initialized. Otherwise,
+	 * if 'full' is false the memory outside of the 'mask->range' is left
+	 * uninitialized. This can be used as an optimization when further
+	 * operations on 'dst' only use contents within 'mask->range'.
 	 */
-	for (i = 0; i < range_n_bytes(&mask->range); i += sizeof(long))
+	for (i = 0; i < len; i += sizeof(long))
 		*d++ = *s++ & *m++;
 }
 
@@ -1085,7 +1088,7 @@ static struct sw_flow *ovs_masked_flow_lookup(struct flow_table *table,
 	u32 hash;
 	struct sw_flow_key masked_key;
 
-	ovs_flow_key_mask(&masked_key, unmasked, mask);
+	ovs_flow_key_mask(&masked_key, unmasked, false, mask);
 	hash = ovs_flow_hash(&masked_key, key_start, key_end);
 	head = find_bucket(table, hash);
 	hlist_for_each_entry_rcu(flow, head, hash_node[table->node_ver]) {
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 212fbf7510c4..a5da8e1ab854 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -255,5 +255,5 @@ void ovs_sw_flow_mask_insert(struct flow_table *, struct sw_flow_mask *);
 struct sw_flow_mask *ovs_sw_flow_mask_find(const struct flow_table *,
 		const struct sw_flow_mask *);
 void ovs_flow_key_mask(struct sw_flow_key *dst, const struct sw_flow_key *src,
-		       const struct sw_flow_mask *mask);
+		       bool full, const struct sw_flow_mask *mask);
 #endif /* flow.h */
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] fib_rules: fix fib rule dumps across multiple skbs
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (25 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] openvswitch: Zero flows on allocation Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Improve nested NMI comments Jiri Slaby
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable; +Cc: Wilson Kok, Roopa Prabhu, David S . Miller, Jiri Slaby

From: Wilson Kok <wkok@cumulusnetworks.com>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

[ Upstream commit 41fc014332d91ee90c32840bf161f9685b7fbf2b ]

dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 net/core/fib_rules.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 185c341fafbd..aeedc3a961a1 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -621,15 +621,17 @@ static int dump_rules(struct sk_buff *skb, struct netlink_callback *cb,
 {
 	int idx = 0;
 	struct fib_rule *rule;
+	int err = 0;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(rule, &ops->rules_list, list) {
 		if (idx < cb->args[1])
 			goto skip;
 
-		if (fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
-				     cb->nlh->nlmsg_seq, RTM_NEWRULE,
-				     NLM_F_MULTI, ops) < 0)
+		err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
+				       cb->nlh->nlmsg_seq, RTM_NEWRULE,
+				       NLM_F_MULTI, ops);
+		if (err)
 			break;
 skip:
 		idx++;
@@ -638,7 +640,7 @@ skip:
 	cb->args[1] = idx;
 	rules_ops_put(ops);
 
-	return skb->len;
+	return err;
 }
 
 static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
@@ -654,7 +656,9 @@ static int fib_nl_dumprule(struct sk_buff *skb, struct netlink_callback *cb)
 		if (ops == NULL)
 			return -EAFNOSUPPORT;
 
-		return dump_rules(skb, cb, ops);
+		dump_rules(skb, cb, ops);
+
+		return skb->len;
 	}
 
 	rcu_read_lock();
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] x86/nmi/64: Improve nested NMI comments
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (26 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] fib_rules: fix fib rule dumps across multiple skbs Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Reorder nested NMI checks Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection Jiri Slaby
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: Andy Lutomirski, Borislav Petkov, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Jiri Slaby

From: Andy Lutomirski <luto@kernel.org>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 0b22930ebad563ae97ff3f8d7b9f12060b4c6e6b upstream.

I found the nested NMI documentation to be difficult to follow.
Improve the comments.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 arch/x86/kernel/entry_64.S | 159 ++++++++++++++++++++++++++-------------------
 arch/x86/kernel/nmi.c      |   4 +-
 2 files changed, 93 insertions(+), 70 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 691337073e1f..795245d48eb3 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1693,11 +1693,12 @@ ENTRY(nmi)
 	 *  If the variable is not set and the stack is not the NMI
 	 *  stack then:
 	 *    o Set the special variable on the stack
-	 *    o Copy the interrupt frame into a "saved" location on the stack
-	 *    o Copy the interrupt frame into a "copy" location on the stack
+	 *    o Copy the interrupt frame into an "outermost" location on the
+	 *      stack
+	 *    o Copy the interrupt frame into an "iret" location on the stack
 	 *    o Continue processing the NMI
 	 *  If the variable is set or the previous stack is the NMI stack:
-	 *    o Modify the "copy" location to jump to the repeate_nmi
+	 *    o Modify the "iret" location to jump to the repeat_nmi
 	 *    o return back to the first NMI
 	 *
 	 * Now on exit of the first NMI, we first clear the stack variable
@@ -1777,18 +1778,60 @@ ENTRY(nmi)
 
 .Lnmi_from_kernel:
 	/*
-	 * Check the special variable on the stack to see if NMIs are
-	 * executing.
+	 * Here's what our stack frame will look like:
+	 * +---------------------------------------------------------+
+	 * | original SS                                             |
+	 * | original Return RSP                                     |
+	 * | original RFLAGS                                         |
+	 * | original CS                                             |
+	 * | original RIP                                            |
+	 * +---------------------------------------------------------+
+	 * | temp storage for rdx                                    |
+	 * +---------------------------------------------------------+
+	 * | "NMI executing" variable                                |
+	 * +---------------------------------------------------------+
+	 * | iret SS          } Copied from "outermost" frame        |
+	 * | iret Return RSP  } on each loop iteration; overwritten  |
+	 * | iret RFLAGS      } by a nested NMI to force another     |
+	 * | iret CS          } iteration if needed.                 |
+	 * | iret RIP         }                                      |
+	 * +---------------------------------------------------------+
+	 * | outermost SS          } initialized in first_nmi;       |
+	 * | outermost Return RSP  } will not be changed before      |
+	 * | outermost RFLAGS      } NMI processing is done.         |
+	 * | outermost CS          } Copied to "iret" frame on each  |
+	 * | outermost RIP         } iteration.                      |
+	 * +---------------------------------------------------------+
+	 * | pt_regs                                                 |
+	 * +---------------------------------------------------------+
+	 *
+	 * The "original" frame is used by hardware.  Before re-enabling
+	 * NMIs, we need to be done with it, and we need to leave enough
+	 * space for the asm code here.
+	 *
+	 * We return by executing IRET while RSP points to the "iret" frame.
+	 * That will either return for real or it will loop back into NMI
+	 * processing.
+	 *
+	 * The "outermost" frame is copied to the "iret" frame on each
+	 * iteration of the loop, so each iteration starts with the "iret"
+	 * frame pointing to the final return target.
+	 */
+
+	/*
+	 * Determine whether we're a nested NMI.
+	 *
+	 * First check "NMI executing".  If it's set, then we're nested.
+	 * This will not detect if we interrupted an outer NMI just
+	 * before IRET.
 	 */
 	cmpl $1, -8(%rsp)
 	je nested_nmi
 
 	/*
-	 * Now test if the previous stack was an NMI stack.
-	 * We need the double check. We check the NMI stack to satisfy the
-	 * race when the first NMI clears the variable before returning.
-	 * We check the variable because the first NMI could be in a
-	 * breakpoint routine using a breakpoint stack.
+	 * Now test if the previous stack was an NMI stack.  This covers
+	 * the case where we interrupt an outer NMI after it clears
+	 * "NMI executing" but before IRET.
 	 */
 	lea 6*8(%rsp), %rdx
 	test_in_nmi rdx, 4*8(%rsp), nested_nmi, first_nmi
@@ -1796,9 +1839,11 @@ ENTRY(nmi)
 
 nested_nmi:
 	/*
-	 * Do nothing if we interrupted the fixup in repeat_nmi.
-	 * It's about to repeat the NMI handler, so we are fine
-	 * with ignoring this one.
+	 * If we interrupted an NMI that is between repeat_nmi and
+	 * end_repeat_nmi, then we must not modify the "iret" frame
+	 * because it's being written by the outer NMI.  That's okay;
+	 * the outer NMI handler is about to call do_nmi anyway,
+	 * so we can just resume the outer NMI.
 	 */
 	movq $repeat_nmi, %rdx
 	cmpq 8(%rsp), %rdx
@@ -1808,7 +1853,10 @@ nested_nmi:
 	ja nested_nmi_out
 
 1:
-	/* Set up the interrupted NMIs stack to jump to repeat_nmi */
+	/*
+	 * Modify the "iret" frame to point to repeat_nmi, forcing another
+	 * iteration of NMI handling.
+	 */
 	leaq -1*8(%rsp), %rdx
 	movq %rdx, %rsp
 	CFI_ADJUST_CFA_OFFSET 1*8
@@ -1827,60 +1875,23 @@ nested_nmi_out:
 	popq_cfi %rdx
 	CFI_RESTORE rdx
 
-	/* No need to check faults here */
+	/* We are returning to kernel mode, so this cannot result in a fault. */
 	INTERRUPT_RETURN
 
 	CFI_RESTORE_STATE
 first_nmi:
-	/*
-	 * Because nested NMIs will use the pushed location that we
-	 * stored in rdx, we must keep that space available.
-	 * Here's what our stack frame will look like:
-	 * +-------------------------+
-	 * | original SS             |
-	 * | original Return RSP     |
-	 * | original RFLAGS         |
-	 * | original CS             |
-	 * | original RIP            |
-	 * +-------------------------+
-	 * | temp storage for rdx    |
-	 * +-------------------------+
-	 * | NMI executing variable  |
-	 * +-------------------------+
-	 * | copied SS               |
-	 * | copied Return RSP       |
-	 * | copied RFLAGS           |
-	 * | copied CS               |
-	 * | copied RIP              |
-	 * +-------------------------+
-	 * | Saved SS                |
-	 * | Saved Return RSP        |
-	 * | Saved RFLAGS            |
-	 * | Saved CS                |
-	 * | Saved RIP               |
-	 * +-------------------------+
-	 * | pt_regs                 |
-	 * +-------------------------+
-	 *
-	 * The saved stack frame is used to fix up the copied stack frame
-	 * that a nested NMI may change to make the interrupted NMI iret jump
-	 * to the repeat_nmi. The original stack frame and the temp storage
-	 * is also used by nested NMIs and can not be trusted on exit.
-	 */
-	/* Do not pop rdx, nested NMIs will corrupt that part of the stack */
+	/* Restore rdx. */
 	movq (%rsp), %rdx
 	CFI_RESTORE rdx
 
-	/* Set the NMI executing variable on the stack. */
+	/* Set "NMI executing" on the stack. */
 	pushq_cfi $1
 
-	/*
-	 * Leave room for the "copied" frame
-	 */
+	/* Leave room for the "iret" frame */
 	subq $(5*8), %rsp
 	CFI_ADJUST_CFA_OFFSET 5*8
 
-	/* Copy the stack frame to the Saved frame */
+	/* Copy the "original" frame to the "outermost" frame */
 	.rept 5
 	pushq_cfi 11*8(%rsp)
 	.endr
@@ -1888,6 +1899,7 @@ first_nmi:
 
 	/* Everything up to here is safe from nested NMIs */
 
+repeat_nmi:
 	/*
 	 * If there was a nested NMI, the first NMI's iret will return
 	 * here. But NMIs are still enabled and we can take another
@@ -1896,16 +1908,21 @@ first_nmi:
 	 * it will just return, as we are about to repeat an NMI anyway.
 	 * This makes it safe to copy to the stack frame that a nested
 	 * NMI will update.
-	 */
-repeat_nmi:
-	/*
-	 * Update the stack variable to say we are still in NMI (the update
-	 * is benign for the non-repeat case, where 1 was pushed just above
-	 * to this very stack slot).
+	 *
+	 * RSP is pointing to "outermost RIP".  gsbase is unknown, but, if
+	 * we're repeating an NMI, gsbase has the same value that it had on
+	 * the first iteration.  paranoid_entry will load the kernel
+	 * gsbase if needed before we call do_nmi.
+	 *
+	 * Set "NMI executing" in case we came back here via IRET.
 	 */
 	movq $1, 10*8(%rsp)
 
-	/* Make another copy, this one may be modified by nested NMIs */
+	/*
+	 * Copy the "outermost" frame to the "iret" frame.  NMIs that nest
+	 * here must not modify the "iret" frame while we're writing to
+	 * it or it will end up containing garbage.
+	 */
 	addq $(10*8), %rsp
 	CFI_ADJUST_CFA_OFFSET -10*8
 	.rept 5
@@ -1916,9 +1933,9 @@ repeat_nmi:
 end_repeat_nmi:
 
 	/*
-	 * Everything below this point can be preempted by a nested
-	 * NMI if the first NMI took an exception and reset our iret stack
-	 * so that we repeat another NMI.
+	 * Everything below this point can be preempted by a nested NMI.
+	 * If this happens, then the inner NMI will change the "iret"
+	 * frame to point back to repeat_nmi.
 	 */
 	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
 	subq $ORIG_RAX-R15, %rsp
@@ -1946,9 +1963,15 @@ nmi_restore:
 	/* Pop the extra iret frame at once */
 	RESTORE_ALL 6*8
 
-	/* Clear the NMI executing stack variable */
+	/* Clear "NMI executing". */
 	movq $0, 5*8(%rsp)
-	jmp irq_return
+
+	/*
+	 * INTERRUPT_RETURN reads the "iret" frame and exits the NMI
+	 * stack in a single instruction.  We are returning to kernel
+	 * mode, so this cannot result in a fault.
+	 */
+	INTERRUPT_RETURN
 	CFI_ENDPROC
 END(nmi)
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index b82e0fdc7edb..85ede73e5ed7 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -392,8 +392,8 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
 }
 
 /*
- * NMIs can hit breakpoints which will cause it to lose its NMI context
- * with the CPU when the breakpoint or page fault does an IRET.
+ * NMIs can page fault or hit breakpoints which will cause it to lose
+ * its NMI context with the CPU when the breakpoint or page fault does an IRET.
  *
  * As a result, NMIs can nest if NMIs get unmasked due an IRET during
  * NMI processing.  On x86_64, the asm glue protects us from nested NMIs
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] x86/nmi/64: Reorder nested NMI checks
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (27 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Improve nested NMI comments Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection Jiri Slaby
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: Andy Lutomirski, Borislav Petkov, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Jiri Slaby

From: Andy Lutomirski <luto@kernel.org>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit a27507ca2d796cfa8d907de31ad730359c8a6d06 upstream.

Check the repeat_nmi .. end_repeat_nmi special case first.  The
next patch will rework the RSP check and, as a side effect, the
RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
we'll need this ordering of the checks.

Note: this is more subtle than it appears.  The check for
repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
instead of adjusting the "iret" frame to force a repeat.  This
is necessary, because the code between repeat_nmi and
end_repeat_nmi sets "NMI executing" and then writes to the
"iret" frame itself.  If a nested NMI comes in and modifies the
"iret" frame while repeat_nmi is also modifying it, we'll end up
with garbage.  The old code got this right, as does the new
code, but the new code is a bit more explicit.

If we were to move the check right after the "NMI executing"
check, then we'd get it wrong and have random crashes.

( Because the "NMI executing" check would jump to the code that would
  modify the "iret" frame without checking if the interrupted NMI was
  currently modifying it. )

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 arch/x86/kernel/entry_64.S | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 795245d48eb3..b58cfdd0bc50 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1821,7 +1821,23 @@ ENTRY(nmi)
 	/*
 	 * Determine whether we're a nested NMI.
 	 *
-	 * First check "NMI executing".  If it's set, then we're nested.
+	 * If we interrupted kernel code between repeat_nmi and
+	 * end_repeat_nmi, then we are a nested NMI.  We must not
+	 * modify the "iret" frame because it's being written by
+	 * the outer NMI.  That's okay; the outer NMI handler is
+	 * about to about to call do_nmi anyway, so we can just
+	 * resume the outer NMI.
+	 */
+	movq	$repeat_nmi, %rdx
+	cmpq	8(%rsp), %rdx
+	ja	1f
+	movq	$end_repeat_nmi, %rdx
+	cmpq	8(%rsp), %rdx
+	ja	nested_nmi_out
+1:
+
+	/*
+	 * Now check "NMI executing".  If it's set, then we're nested.
 	 * This will not detect if we interrupted an outer NMI just
 	 * before IRET.
 	 */
@@ -1839,21 +1855,6 @@ ENTRY(nmi)
 
 nested_nmi:
 	/*
-	 * If we interrupted an NMI that is between repeat_nmi and
-	 * end_repeat_nmi, then we must not modify the "iret" frame
-	 * because it's being written by the outer NMI.  That's okay;
-	 * the outer NMI handler is about to call do_nmi anyway,
-	 * so we can just resume the outer NMI.
-	 */
-	movq $repeat_nmi, %rdx
-	cmpq 8(%rsp), %rdx
-	ja 1f
-	movq $end_repeat_nmi, %rdx
-	cmpq 8(%rsp), %rdx
-	ja nested_nmi_out
-
-1:
-	/*
 	 * Modify the "iret" frame to point to repeat_nmi, forcing another
 	 * iteration of NMI handling.
 	 */
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [patch added to the 3.12 stable tree] x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection
  2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
                   ` (28 preceding siblings ...)
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Reorder nested NMI checks Jiri Slaby
@ 2015-09-30 10:00 ` Jiri Slaby
  29 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 10:00 UTC (permalink / raw)
  To: stable
  Cc: Andy Lutomirski, Borislav Petkov, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Jiri Slaby

From: Andy Lutomirski <luto@kernel.org>

This patch has been added to the 3.12 stable tree. If you have any
objections, please let us know.

===============

commit 810bc075f78ff2c221536eb3008eac6a492dba2d upstream.

We have a tricky bug in the nested NMI code: if we see RSP
pointing to the NMI stack on NMI entry from kernel mode, we
assume that we are executing a nested NMI.

This isn't quite true.  A malicious userspace program can point
RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
happen while RSP is still pointing at the NMI stack.

Fix it with a sneaky trick.  Set DF in the region of code that
the RSP check is intended to detect.  IRET will clear DF
atomically.

( Note: other than paravirt, there's little need for all this
  complexity. We could check RIP instead of RSP. )

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 arch/x86/kernel/entry_64.S | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b58cfdd0bc50..7ed99df028ca 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1847,10 +1847,25 @@ ENTRY(nmi)
 	/*
 	 * Now test if the previous stack was an NMI stack.  This covers
 	 * the case where we interrupt an outer NMI after it clears
-	 * "NMI executing" but before IRET.
+	 * "NMI executing" but before IRET.  We need to be careful, though:
+	 * there is one case in which RSP could point to the NMI stack
+	 * despite there being no NMI active: naughty userspace controls
+	 * RSP at the very beginning of the SYSCALL targets.  We can
+	 * pull a fast one on naughty userspace, though: we program
+	 * SYSCALL to mask DF, so userspace cannot cause DF to be set
+	 * if it controls the kernel's RSP.  We set DF before we clear
+	 * "NMI executing".
 	 */
 	lea 6*8(%rsp), %rdx
 	test_in_nmi rdx, 4*8(%rsp), nested_nmi, first_nmi
+
+	/* Ah, it is within the NMI stack. */
+
+	testb	$(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
+	jz	first_nmi	/* RSP was user controlled. */
+
+	/* This is a nested NMI. */
+
 	CFI_REMEMBER_STATE
 
 nested_nmi:
@@ -1964,8 +1979,16 @@ nmi_restore:
 	/* Pop the extra iret frame at once */
 	RESTORE_ALL 6*8
 
-	/* Clear "NMI executing". */
-	movq $0, 5*8(%rsp)
+	/*
+	 * Clear "NMI executing".  Set DF first so that we can easily
+	 * distinguish the remaining code between here and IRET from
+	 * the SYSCALL entry and exit paths.  On a native kernel, we
+	 * could just inspect RIP, but, on paravirt kernels,
+	 * INTERRUPT_RETURN can translate into a jump into a
+	 * hypercall page.
+	 */
+	std
+	movq	$0, 5*8(%rsp)		/* clear "NMI executing" */
 
 	/*
 	 * INTERRUPT_RETURN reads the "iret" frame and exits the NMI
-- 
2.6.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [patch added to the 3.12 stable tree] stmmac: fix check for phydev being open
  2015-09-30 10:00 ` [patch added to the 3.12 stable tree] stmmac: fix check for phydev being open Jiri Slaby
@ 2015-09-30 11:28   ` Sergei Shtylyov
  2015-09-30 11:47     ` Jiri Slaby
  0 siblings, 1 reply; 33+ messages in thread
From: Sergei Shtylyov @ 2015-09-30 11:28 UTC (permalink / raw)
  To: Jiri Slaby, stable
  Cc: Alexey Brodkin, Giuseppe Cavallaro, linux-kernel, David Miller,
	Alexey Brodkin

On 9/30/2015 1:00 PM, Jiri Slaby wrote:

> From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>

> This patch has been added to the 3.12 stable tree. If you have any
> objections, please let us know.

    I do -- of_phy_connect() isn't called in this version, so the patch is 
totally useless.

> ===============
>
> commit dfc50fcaad574e5c8c85cbc83eca1426b2413fa4 upstream.
>
> Current check of phydev with IS_ERR(phydev) may make not much sense
> because of_phy_connect() returns NULL on failure instead of error value.
>
> Still for checking result of phy_connect() IS_ERR() makes perfect sense.
>
> So let's use combined check IS_ERR_OR_NULL() that covers both cases.
>
> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
> ---
>   drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 8d4ccd35a016..14c0d31c10ad 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -801,8 +801,11 @@ static int stmmac_init_phy(struct net_device *dev)
>
>   	phydev = phy_connect(dev, phy_id_fmt, &stmmac_adjust_link, interface);
>
> -	if (IS_ERR(phydev)) {
> +	if (IS_ERR_OR_NULL(phydev)) {
>   		pr_err("%s: Could not attach to PHY\n", dev->name);
> +		if (!phydev)
> +			return -ENODEV;
> +
>   		return PTR_ERR(phydev);
>   	}
>

MBR, Sergei


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch added to the 3.12 stable tree] stmmac: fix check for phydev being open
  2015-09-30 11:28   ` Sergei Shtylyov
@ 2015-09-30 11:47     ` Jiri Slaby
  0 siblings, 0 replies; 33+ messages in thread
From: Jiri Slaby @ 2015-09-30 11:47 UTC (permalink / raw)
  To: Sergei Shtylyov, stable
  Cc: Alexey Brodkin, Giuseppe Cavallaro, linux-kernel, David Miller,
	Alexey Brodkin

On 09/30/2015, 01:28 PM, Sergei Shtylyov wrote:
> On 9/30/2015 1:00 PM, Jiri Slaby wrote:
> 
>> From: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
> 
>> This patch has been added to the 3.12 stable tree. If you have any
>> objections, please let us know.
> 
>    I do -- of_phy_connect() isn't called in this version, so the patch
> is totally useless.

Dropped from 3.12. Thanks.

>> ===============
>>
>> commit dfc50fcaad574e5c8c85cbc83eca1426b2413fa4 upstream.



-- 
js
suse labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2015-09-30 11:47 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-30  9:59 [patch added to the 3.12 stable tree] rc-core: fix remove uevent generation Jiri Slaby
2015-09-30  9:59 ` [patch added to the 3.12 stable tree] v4l: omap3isp: Fix sub-device power management code Jiri Slaby
2015-09-30  9:59 ` [patch added to the 3.12 stable tree] Btrfs: check if previous transaction aborted to avoid fs corruption Jiri Slaby
2015-09-30  9:59 ` [patch added to the 3.12 stable tree] NFSv4: don't set SETATTR for O_RDONLY|O_EXCL Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] NFS: nfs_set_pgio_error sometimes misses errors Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] parisc: Filter out spurious interrupts in PA-RISC irq handler Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] vmscan: fix increasing nr_isolated incurred by putback unevictable pages Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] fs: if a coredump already exists, unlink and recreate with O_EXCL Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] mmc: core: fix race condition in mmc_wait_data_done Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] md/raid10: always set reshape_safe when initializing reshape_position Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] xen/gntdev: convert priv->lock to a mutex Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] hfs: fix B-tree corruption after insertion at position 0 Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/qib: Change lkey table allocation to support more MRs Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/uverbs: reject invalid or unknown opcodes Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/uverbs: Fix race between ib_uverbs_open and remove_one Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/mlx4: Forbid using sysfs to change RoCE pkeys Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] IB/mlx4: Use correct SL on AH query under RoCE Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] stmmac: fix check for phydev being open Jiri Slaby
2015-09-30 11:28   ` Sergei Shtylyov
2015-09-30 11:47     ` Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] stmmac: troubleshoot unexpected bits in des0 & des1 Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] hfs,hfsplus: cache pages correctly between bnode_create and bnode_free Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] ip6_gre: release cached dst on tunnel removal Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] ipv6: fix exthdrs offload registration in out_rt path Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] net/ipv6: Correct PIM6 mrt_lock handling Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] netlink, mmap: transform mmap skb into full skb on taps Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] sctp: fix race on protocol/netns initialization Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] openvswitch: Zero flows on allocation Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] fib_rules: fix fib rule dumps across multiple skbs Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Improve nested NMI comments Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Reorder nested NMI checks Jiri Slaby
2015-09-30 10:00 ` [patch added to the 3.12 stable tree] x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection Jiri Slaby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).