Netdev List
 help / color / mirror / Atom feed
* [PATCH v3 07/31] ext4: Define usercopy region in ext4_inode_cache slab cache
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Theodore Ts'o, Andreas Dilger,
	linux-ext4, linux-fsdevel, netdev, linux-mm, kernel-hardening
In-Reply-To: <1505940337-79069-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

The ext4 symlink pathnames, stored in struct ext4_inode_info.i_data
and therefore contained in the ext4_inode_cache slab cache, need
to be copied to/from userspace.

cache object allocation:
    fs/ext4/super.c:
        ext4_alloc_inode(...):
            struct ext4_inode_info *ei;
            ...
            ei = kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS);
            ...
            return &ei->vfs_inode;

    include/trace/events/ext4.h:
            #define EXT4_I(inode) \
                (container_of(inode, struct ext4_inode_info, vfs_inode))

    fs/ext4/namei.c:
        ext4_symlink(...):
            ...
            inode->i_link = (char *)&EXT4_I(inode)->i_data;

example usage trace:
    readlink_copy+0x43/0x70
    vfs_readlink+0x62/0x110
    SyS_readlinkat+0x100/0x130

    fs/namei.c:
        readlink_copy(..., link):
            ...
            copy_to_user(..., link, len)

        (inlined into vfs_readlink)
        generic_readlink(dentry, ...):
            struct inode *inode = d_inode(dentry);
            const char *link = inode->i_link;
            ...
            readlink_copy(..., link);

In support of usercopy hardening, this patch defines a region in the
ext4_inode_cache slab cache in which userspace copy operations are
allowed.

This region is known as the slab cache's usercopy region. Slab caches can
now check that each copy operation involving cache-managed memory falls
entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, provide usage trace]
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/ext4/super.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b104096fce9e..b5d393321b7b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1036,11 +1036,13 @@ static void init_once(void *foo)
 
 static int __init init_inodecache(void)
 {
-	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
-					     sizeof(struct ext4_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD|SLAB_ACCOUNT),
-					     init_once);
+	ext4_inode_cachep = kmem_cache_create_usercopy("ext4_inode_cache",
+				sizeof(struct ext4_inode_info), 0,
+				(SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD|
+					SLAB_ACCOUNT),
+				offsetof(struct ext4_inode_info, i_data),
+				sizeof_field(struct ext4_inode_info, i_data),
+				init_once);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v3 06/31] vfs: Copy struct mount.mnt_id to userspace using put_user()
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Alexander Viro, linux-fsdevel, netdev,
	linux-mm, kernel-hardening
In-Reply-To: <1505940337-79069-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

The mnt_id field can be copied with put_user(), so there is no need to
use copy_to_user(). In both cases, hardened usercopy is being bypassed
since the size is constant, and not open to runtime manipulation.

This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/fhandle.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/fhandle.c b/fs/fhandle.c
index 58a61f55e0d0..46e00ccca8f0 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -68,8 +68,7 @@ static long do_sys_name_to_handle(struct path *path,
 	} else
 		retval = 0;
 	/* copy the mount id */
-	if (copy_to_user(mnt_id, &real_mount(path->mnt)->mnt_id,
-			 sizeof(*mnt_id)) ||
+	if (put_user(real_mount(path->mnt)->mnt_id, mnt_id) ||
 	    copy_to_user(ufh, handle,
 			 sizeof(struct file_handle) + handle_bytes))
 		retval = -EFAULT;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 05/31] vfs: Define usercopy region in names_cache slab caches
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Alexander Viro, linux-fsdevel, netdev,
	linux-mm, kernel-hardening
In-Reply-To: <1505940337-79069-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

VFS pathnames are stored in the names_cache slab cache, either inline
or across an entire allocation entry (when approaching PATH_MAX). These
are copied to/from userspace, so they must be entirely whitelisted.

cache object allocation:
    include/linux/fs.h:
        #define __getname()    kmem_cache_alloc(names_cachep, GFP_KERNEL)

example usage trace:
    strncpy_from_user+0x4d/0x170
    getname_flags+0x6f/0x1f0
    user_path_at_empty+0x23/0x40
    do_mount+0x69/0xda0
    SyS_mount+0x83/0xd0

    fs/namei.c:
        getname_flags(...):
            ...
            result = __getname();
            ...
            kname = (char *)result->iname;
            result->name = kname;
            len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
            ...
            if (unlikely(len == EMBEDDED_NAME_MAX)) {
                const size_t size = offsetof(struct filename, iname[1]);
                kname = (char *)result;

                result = kzalloc(size, GFP_KERNEL);
                ...
                result->name = kname;
                len = strncpy_from_user(kname, filename, PATH_MAX);

In support of usercopy hardening, this patch defines the entire cache
object in the names_cache slab cache as whitelisted, since it may entirely
hold name strings to be copied to/from userspace.

This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, add usage trace]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/dcache.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5f5e7c1fcf4b..34ef9a9169be 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3642,8 +3642,8 @@ void __init vfs_caches_init_early(void)
 
 void __init vfs_caches_init(void)
 {
-	names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+	names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);
 
 	dcache_init();
 	inode_init();
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v3 04/31] dcache: Define usercopy region in dentry_cache slab cache
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Alexander Viro, linux-fsdevel, netdev,
	linux-mm, kernel-hardening
In-Reply-To: <1505940337-79069-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

When a dentry name is short enough, it can be stored directly in the
dentry itself (instead in a separate kmalloc allocation). These dentry
short names, stored in struct dentry.d_iname and therefore contained in
the dentry_cache slab cache, need to be coped to userspace.

cache object allocation:
    fs/dcache.c:
        __d_alloc(...):
            ...
            dentry = kmem_cache_alloc(dentry_cache, ...);
            ...
            dentry->d_name.name = dentry->d_iname;

example usage trace:
    filldir+0xb0/0x140
    dcache_readdir+0x82/0x170
    iterate_dir+0x142/0x1b0
    SyS_getdents+0xb5/0x160

    fs/readdir.c:
        (called via ctx.actor by dir_emit)
        filldir(..., const char *name, ...):
            ...
            copy_to_user(..., name, namlen)

    fs/libfs.c:
        dcache_readdir(...):
            ...
            next = next_positive(dentry, p, 1)
            ...
            dir_emit(..., next->d_name.name, ...)

In support of usercopy hardening, this patch defines a region in the
dentry_cache slab cache in which userspace copy operations are allowed.

This region is known as the slab cache's usercopy region. Slab caches can
now check that each copy operation involving cache-managed memory falls
entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust hunks for kmalloc-specific things moved later]
[kees: adjust commit log, provide usage trace]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/dcache.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..5f5e7c1fcf4b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3603,8 +3603,9 @@ static void __init dcache_init(void)
 	 * but it is probably not worth it because of the cache nature
 	 * of the dcache.
 	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT);
+	dentry_cache = KMEM_CACHE_USERCOPY(dentry,
+		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT,
+		d_iname);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 03/31] usercopy: Mark kmalloc caches as usercopy caches
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, linux-mm, linux-xfs,
	linux-fsdevel, netdev, kernel-hardening
In-Reply-To: <1505940337-79069-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

Mark the kmalloc slab caches as entirely whitelisted. These caches
are frequently used to fulfill kernel allocations that contain data
to be copied to/from userspace. Internal-only uses are also common,
but are scattered in the kernel. For now, mark all the kmalloc caches
as whitelisted.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: merged in moved kmalloc hunks, adjust commit log]
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 mm/slab.c        |  3 ++-
 mm/slab.h        |  3 ++-
 mm/slab_common.c | 10 ++++++----
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index df268999cf02..9af16f675927 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1291,7 +1291,8 @@ void __init kmem_cache_init(void)
 	 */
 	kmalloc_caches[INDEX_NODE] = create_kmalloc_cache(
 				kmalloc_info[INDEX_NODE].name,
-				kmalloc_size(INDEX_NODE), ARCH_KMALLOC_FLAGS);
+				kmalloc_size(INDEX_NODE), ARCH_KMALLOC_FLAGS,
+				0, kmalloc_size(INDEX_NODE));
 	slab_state = PARTIAL_NODE;
 	setup_kmalloc_cache_index_table();
 
diff --git a/mm/slab.h b/mm/slab.h
index 044755ff9632..2e0fe357d777 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -97,7 +97,8 @@ struct kmem_cache *kmalloc_slab(size_t, gfp_t);
 extern int __kmem_cache_create(struct kmem_cache *, unsigned long flags);
 
 extern struct kmem_cache *create_kmalloc_cache(const char *name, size_t size,
-			unsigned long flags);
+			unsigned long flags, size_t useroffset,
+			size_t usersize);
 extern void create_boot_cache(struct kmem_cache *, const char *name,
 			size_t size, unsigned long flags, size_t useroffset,
 			size_t usersize);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 36408f5f2a34..d4e6442f9bbc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -920,14 +920,15 @@ void __init create_boot_cache(struct kmem_cache *s, const char *name, size_t siz
 }
 
 struct kmem_cache *__init create_kmalloc_cache(const char *name, size_t size,
-				unsigned long flags)
+				unsigned long flags, size_t useroffset,
+				size_t usersize)
 {
 	struct kmem_cache *s = kmem_cache_zalloc(kmem_cache, GFP_NOWAIT);
 
 	if (!s)
 		panic("Out of memory when creating slab %s\n", name);
 
-	create_boot_cache(s, name, size, flags, 0, size);
+	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
 	memcg_link_cache(s);
 	s->refcount = 1;
@@ -1081,7 +1082,8 @@ void __init setup_kmalloc_cache_index_table(void)
 static void __init new_kmalloc_cache(int idx, unsigned long flags)
 {
 	kmalloc_caches[idx] = create_kmalloc_cache(kmalloc_info[idx].name,
-					kmalloc_info[idx].size, flags);
+					kmalloc_info[idx].size, flags, 0,
+					kmalloc_info[idx].size);
 }
 
 /*
@@ -1122,7 +1124,7 @@ void __init create_kmalloc_caches(unsigned long flags)
 
 			BUG_ON(!n);
 			kmalloc_dma_caches[i] = create_kmalloc_cache(n,
-				size, SLAB_CACHE_DMA | flags);
+				size, SLAB_CACHE_DMA | flags, 0, 0);
 		}
 	}
 #endif
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v3 02/31] usercopy: Enforce slab cache usercopy region boundaries
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Laura Abbott,
	Ingo Molnar, Mark Rutland, linux-mm, linux-xfs, linux-fsdevel,
	netdev, kernel-hardening
In-Reply-To: <1505940337-79069-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

This patch adds the enforcement component of usercopy cache whitelisting,
and is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

The SLAB and SLUB allocators are modified to deny all copy operations
in which the kernel heap memory being modified falls outside of the cache's
defined usercopy region.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log and comments]
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 mm/slab.c     | 16 +++++++++++-----
 mm/slub.c     | 18 +++++++++++-------
 mm/usercopy.c | 12 ++++++++++++
 3 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 87b6e5e0cdaf..df268999cf02 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4408,7 +4408,9 @@ module_init(slab_proc_init);
 
 #ifdef CONFIG_HARDENED_USERCOPY
 /*
- * Rejects objects that are incorrectly sized.
+ * Rejects incorrectly sized objects and objects that are to be copied
+ * to/from userspace but do not fall entirely within the containing slab
+ * cache's usercopy region.
  *
  * Returns NULL if check passes, otherwise const char * to name of cache
  * to indicate an error.
@@ -4428,11 +4430,15 @@ const char *__check_heap_object(const void *ptr, unsigned long n,
 	/* Find offset within object. */
 	offset = ptr - index_to_obj(cachep, page, objnr) - obj_offset(cachep);
 
-	/* Allow address range falling entirely within object size. */
-	if (offset <= cachep->object_size && n <= cachep->object_size - offset)
-		return NULL;
+	/* Make sure object falls entirely within cache's usercopy region. */
+	if (offset < cachep->useroffset)
+		return cachep->name;
+	if (offset - cachep->useroffset > cachep->usersize)
+		return cachep->name;
+	if (n > cachep->useroffset - offset + cachep->usersize)
+		return cachep->name;
 
-	return cachep->name;
+	return NULL;
 }
 #endif /* CONFIG_HARDENED_USERCOPY */
 
diff --git a/mm/slub.c b/mm/slub.c
index fae637726c44..bbf73024be3a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3833,7 +3833,9 @@ EXPORT_SYMBOL(__kmalloc_node);
 
 #ifdef CONFIG_HARDENED_USERCOPY
 /*
- * Rejects objects that are incorrectly sized.
+ * Rejects incorrectly sized objects and objects that are to be copied
+ * to/from userspace but do not fall entirely within the containing slab
+ * cache's usercopy region.
  *
  * Returns NULL if check passes, otherwise const char * to name of cache
  * to indicate an error.
@@ -3843,11 +3845,9 @@ const char *__check_heap_object(const void *ptr, unsigned long n,
 {
 	struct kmem_cache *s;
 	unsigned long offset;
-	size_t object_size;
 
 	/* Find object and usable object size. */
 	s = page->slab_cache;
-	object_size = slab_ksize(s);
 
 	/* Reject impossible pointers. */
 	if (ptr < page_address(page))
@@ -3863,11 +3863,15 @@ const char *__check_heap_object(const void *ptr, unsigned long n,
 		offset -= s->red_left_pad;
 	}
 
-	/* Allow address range falling entirely within object size. */
-	if (offset <= object_size && n <= object_size - offset)
-		return NULL;
+	/* Make sure object falls entirely within cache's usercopy region. */
+	if (offset < s->useroffset)
+		return s->name;
+	if (offset - s->useroffset > s->usersize)
+		return s->name;
+	if (n > s->useroffset - offset + s->usersize)
+		return s->name;
 
-	return s->name;
+	return NULL;
 }
 #endif /* CONFIG_HARDENED_USERCOPY */
 
diff --git a/mm/usercopy.c b/mm/usercopy.c
index a9852b24715d..cbffde670c49 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -58,6 +58,18 @@ static noinline int check_stack_object(const void *obj, unsigned long len)
 	return GOOD_STACK;
 }
 
+/*
+ * If this function is reached, then CONFIG_HARDENED_USERCOPY has found an
+ * unexpected state during a copy_from_user() or copy_to_user() call.
+ * There are several checks being performed on the buffer by the
+ * __check_object_size() function. Normal stack buffer usage should never
+ * trip the checks, and kernel text addressing will always trip the check.
+ * For cache objects, it is checking that only the whitelisted range of
+ * bytes for a given cache is being accessed (via the cache's usersize and
+ * useroffset fields). To adjust a cache whitelist, use the usercopy-aware
+ * kmem_cache_create_usercopy() function to create the cache (and
+ * carefully audit the whitelist range).
+ */
 static void report_usercopy(const void *ptr, unsigned long len,
 			    bool to_user, const char *type)
 {
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v3 01/31] usercopy: Prepare for usercopy whitelisting
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, David Windsor, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, linux-mm, linux-xfs,
	linux-fsdevel, netdev, kernel-hardening
In-Reply-To: <1505940337-79069-1-git-send-email-keescook@chromium.org>

From: David Windsor <dave@nullcore.net>

This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.

Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)

To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.

In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.

After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:

Total Slab Memory:           48074720
Usercopyable Memory:          6367532  13.2%
         task_struct                    0.2%         4480/1630720
         RAW                            0.3%            300/96000
         RAWv6                          2.1%           1408/64768
         ext4_inode_cache               3.0%       269760/8740224
         dentry                        11.1%       585984/5273856
         mm_struct                     29.1%         54912/188448
         kmalloc-8                    100.0%          24576/24576
         kmalloc-16                   100.0%          28672/28672
         kmalloc-32                   100.0%          81920/81920
         kmalloc-192                  100.0%          96768/96768
         kmalloc-128                  100.0%        143360/143360
         names_cache                  100.0%        163840/163840
         kmalloc-64                   100.0%        167936/167936
         kmalloc-256                  100.0%        339968/339968
         kmalloc-512                  100.0%        350720/350720
         kmalloc-96                   100.0%        455616/455616
         kmalloc-8192                 100.0%        655360/655360
         kmalloc-1024                 100.0%        812032/812032
         kmalloc-4096                 100.0%        819200/819200
         kmalloc-2048                 100.0%      1310720/1310720

After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:

Total Slab Memory:           95516184
Usercopyable Memory:          8497452   8.8%
         task_struct                    0.2%         4000/1456000
         RAW                            0.3%            300/96000
         RAWv6                          2.1%           1408/64768
         ext4_inode_cache               3.0%     1217280/39439872
         dentry                        11.1%     1623200/14608800
         mm_struct                     29.1%         73216/251264
         kmalloc-8                    100.0%          24576/24576
         kmalloc-16                   100.0%          28672/28672
         kmalloc-32                   100.0%          94208/94208
         kmalloc-192                  100.0%          96768/96768
         kmalloc-128                  100.0%        143360/143360
         names_cache                  100.0%        163840/163840
         kmalloc-64                   100.0%        245760/245760
         kmalloc-256                  100.0%        339968/339968
         kmalloc-512                  100.0%        350720/350720
         kmalloc-96                   100.0%        563520/563520
         kmalloc-8192                 100.0%        655360/655360
         kmalloc-1024                 100.0%        794624/794624
         kmalloc-4096                 100.0%        819200/819200
         kmalloc-2048                 100.0%      1257472/1257472

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 include/linux/slab.h     | 27 +++++++++++++++++++++------
 include/linux/slab_def.h |  3 +++
 include/linux/slub_def.h |  3 +++
 include/linux/stddef.h   |  2 ++
 mm/slab.c                |  2 +-
 mm/slab.h                |  5 ++++-
 mm/slab_common.c         | 46 ++++++++++++++++++++++++++++++++++++++--------
 mm/slub.c                | 11 +++++++++--
 8 files changed, 81 insertions(+), 18 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 41473df6dfb0..8b6cb384f8b6 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -126,9 +126,13 @@ struct mem_cgroup;
 void __init kmem_cache_init(void);
 bool slab_is_available(void);
 
-struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
-			unsigned long,
-			void (*)(void *));
+struct kmem_cache *kmem_cache_create(const char *name, size_t size,
+			size_t align, unsigned long flags,
+			void (*ctor)(void *));
+struct kmem_cache *kmem_cache_create_usercopy(const char *name,
+			size_t size, size_t align, unsigned long flags,
+			size_t useroffset, size_t usersize,
+			void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
@@ -144,9 +148,20 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *);
  * f.e. add ____cacheline_aligned_in_smp to the struct declaration
  * then the objects will be properly aligned in SMP configurations.
  */
-#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
-		sizeof(struct __struct), __alignof__(struct __struct),\
-		(__flags), NULL)
+#define KMEM_CACHE(__struct, __flags)					\
+		kmem_cache_create(#__struct, sizeof(struct __struct),	\
+			__alignof__(struct __struct), (__flags), NULL)
+
+/*
+ * To whitelist a single field for copying to/from usercopy, use this
+ * macro instead for KMEM_CACHE() above.
+ */
+#define KMEM_CACHE_USERCOPY(__struct, __flags, __field)			\
+		kmem_cache_create_usercopy(#__struct,			\
+			sizeof(struct __struct),			\
+			__alignof__(struct __struct), (__flags),	\
+			offsetof(struct __struct, __field),		\
+			sizeof_field(struct __struct, __field), NULL)
 
 /*
  * Common kmalloc functions provided by all allocators
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 4ad2c5a26399..03eef0df8648 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -84,6 +84,9 @@ struct kmem_cache {
 	unsigned int *random_seq;
 #endif
 
+	size_t useroffset;		/* Usercopy region offset */
+	size_t usersize;		/* Usercopy region size */
+
 	struct kmem_cache_node *node[MAX_NUMNODES];
 };
 
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 0783b622311e..62866a1a767c 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -134,6 +134,9 @@ struct kmem_cache {
 	struct kasan_cache kasan_info;
 #endif
 
+	size_t useroffset;		/* Usercopy region offset */
+	size_t usersize;		/* Usercopy region size */
+
 	struct kmem_cache_node *node[MAX_NUMNODES];
 };
 
diff --git a/include/linux/stddef.h b/include/linux/stddef.h
index 9c61c7cda936..f00355086fb2 100644
--- a/include/linux/stddef.h
+++ b/include/linux/stddef.h
@@ -18,6 +18,8 @@ enum {
 #define offsetof(TYPE, MEMBER)	((size_t)&((TYPE *)0)->MEMBER)
 #endif
 
+#define sizeof_field(structure, field) sizeof((((structure *)0)->field))
+
 /**
  * offsetofend(TYPE, MEMBER)
  *
diff --git a/mm/slab.c b/mm/slab.c
index 04dec48c3ed7..87b6e5e0cdaf 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1281,7 +1281,7 @@ void __init kmem_cache_init(void)
 	create_boot_cache(kmem_cache, "kmem_cache",
 		offsetof(struct kmem_cache, node) +
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
-				  SLAB_HWCACHE_ALIGN);
+				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
 	slab_state = PARTIAL;
 
diff --git a/mm/slab.h b/mm/slab.h
index 073362816acc..044755ff9632 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -21,6 +21,8 @@ struct kmem_cache {
 	unsigned int size;	/* The aligned/padded/added on size  */
 	unsigned int align;	/* Alignment as calculated */
 	unsigned long flags;	/* Active flags on the slab */
+	size_t useroffset;	/* Usercopy region offset */
+	size_t usersize;	/* Usercopy region size */
 	const char *name;	/* Slab name for sysfs */
 	int refcount;		/* Use counter */
 	void (*ctor)(void *);	/* Called on object slot creation */
@@ -97,7 +99,8 @@ extern int __kmem_cache_create(struct kmem_cache *, unsigned long flags);
 extern struct kmem_cache *create_kmalloc_cache(const char *name, size_t size,
 			unsigned long flags);
 extern void create_boot_cache(struct kmem_cache *, const char *name,
-			size_t size, unsigned long flags);
+			size_t size, unsigned long flags, size_t useroffset,
+			size_t usersize);
 
 int slab_unmergeable(struct kmem_cache *s);
 struct kmem_cache *find_mergeable(size_t size, size_t align,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 904a83be82de..36408f5f2a34 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -272,6 +272,9 @@ int slab_unmergeable(struct kmem_cache *s)
 	if (s->ctor)
 		return 1;
 
+	if (s->usersize)
+		return 1;
+
 	/*
 	 * We may have set a slab to be unmergeable during bootstrap.
 	 */
@@ -357,12 +360,16 @@ unsigned long calculate_alignment(unsigned long flags,
 
 static struct kmem_cache *create_cache(const char *name,
 		size_t object_size, size_t size, size_t align,
-		unsigned long flags, void (*ctor)(void *),
+		unsigned long flags, size_t useroffset,
+		size_t usersize, void (*ctor)(void *),
 		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
 {
 	struct kmem_cache *s;
 	int err;
 
+	if (WARN_ON(useroffset + usersize > object_size))
+		useroffset = usersize = 0;
+
 	err = -ENOMEM;
 	s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
 	if (!s)
@@ -373,6 +380,8 @@ static struct kmem_cache *create_cache(const char *name,
 	s->size = size;
 	s->align = align;
 	s->ctor = ctor;
+	s->useroffset = useroffset;
+	s->usersize = usersize;
 
 	err = init_memcg_params(s, memcg, root_cache);
 	if (err)
@@ -397,11 +406,13 @@ static struct kmem_cache *create_cache(const char *name,
 }
 
 /*
- * kmem_cache_create - Create a cache.
+ * kmem_cache_create_usercopy - Create a cache.
  * @name: A string which is used in /proc/slabinfo to identify this cache.
  * @size: The size of objects to be created in this cache.
  * @align: The required alignment for the objects.
  * @flags: SLAB flags
+ * @useroffset: Usercopy region offset
+ * @usersize: Usercopy region size
  * @ctor: A constructor for the objects.
  *
  * Returns a ptr to the cache on success, NULL on failure.
@@ -421,8 +432,9 @@ static struct kmem_cache *create_cache(const char *name,
  * as davem.
  */
 struct kmem_cache *
-kmem_cache_create(const char *name, size_t size, size_t align,
-		  unsigned long flags, void (*ctor)(void *))
+kmem_cache_create_usercopy(const char *name, size_t size, size_t align,
+		  unsigned long flags, size_t useroffset, size_t usersize,
+		  void (*ctor)(void *))
 {
 	struct kmem_cache *s = NULL;
 	const char *cache_name;
@@ -453,7 +465,13 @@ kmem_cache_create(const char *name, size_t size, size_t align,
 	 */
 	flags &= CACHE_CREATE_MASK;
 
-	s = __kmem_cache_alias(name, size, align, flags, ctor);
+	/* Fail closed on bad usersize of useroffset values. */
+	if (WARN_ON(!usersize && useroffset) ||
+	    WARN_ON(size < usersize || size - usersize < useroffset))
+		usersize = useroffset = 0;
+
+	if (!usersize)
+		s = __kmem_cache_alias(name, size, align, flags, ctor);
 	if (s)
 		goto out_unlock;
 
@@ -465,7 +483,7 @@ kmem_cache_create(const char *name, size_t size, size_t align,
 
 	s = create_cache(cache_name, size, size,
 			 calculate_alignment(flags, align, size),
-			 flags, ctor, NULL, NULL);
+			 flags, useroffset, usersize, ctor, NULL, NULL);
 	if (IS_ERR(s)) {
 		err = PTR_ERR(s);
 		kfree_const(cache_name);
@@ -491,6 +509,15 @@ kmem_cache_create(const char *name, size_t size, size_t align,
 	}
 	return s;
 }
+EXPORT_SYMBOL(kmem_cache_create_usercopy);
+
+struct kmem_cache *
+kmem_cache_create(const char *name, size_t size, size_t align,
+		unsigned long flags, void (*ctor)(void *))
+{
+	return kmem_cache_create_usercopy(name, size, align, flags, 0, size,
+					  ctor);
+}
 EXPORT_SYMBOL(kmem_cache_create);
 
 static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work)
@@ -603,6 +630,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	s = create_cache(cache_name, root_cache->object_size,
 			 root_cache->size, root_cache->align,
 			 root_cache->flags & CACHE_CREATE_MASK,
+			 root_cache->useroffset, root_cache->usersize,
 			 root_cache->ctor, memcg, root_cache);
 	/*
 	 * If we could not create a memcg cache, do not complain, because
@@ -870,13 +898,15 @@ bool slab_is_available(void)
 #ifndef CONFIG_SLOB
 /* Create a cache during boot when no slab services are available yet */
 void __init create_boot_cache(struct kmem_cache *s, const char *name, size_t size,
-		unsigned long flags)
+		unsigned long flags, size_t useroffset, size_t usersize)
 {
 	int err;
 
 	s->name = name;
 	s->size = s->object_size = size;
 	s->align = calculate_alignment(flags, ARCH_KMALLOC_MINALIGN, size);
+	s->useroffset = useroffset;
+	s->usersize = usersize;
 
 	slab_init_memcg_params(s);
 
@@ -897,7 +927,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name, size_t size,
 	if (!s)
 		panic("Out of memory when creating slab %s\n", name);
 
-	create_boot_cache(s, name, size, flags);
+	create_boot_cache(s, name, size, flags, 0, size);
 	list_add(&s->list, &slab_caches);
 	memcg_link_cache(s);
 	s->refcount = 1;
diff --git a/mm/slub.c b/mm/slub.c
index 163352c537ab..fae637726c44 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4201,7 +4201,7 @@ void __init kmem_cache_init(void)
 	kmem_cache = &boot_kmem_cache;
 
 	create_boot_cache(kmem_cache_node, "kmem_cache_node",
-		sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN);
+		sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
 
 	register_hotmemory_notifier(&slab_memory_callback_nb);
 
@@ -4211,7 +4211,7 @@ void __init kmem_cache_init(void)
 	create_boot_cache(kmem_cache, "kmem_cache",
 			offsetof(struct kmem_cache, node) +
 				nr_node_ids * sizeof(struct kmem_cache_node *),
-		       SLAB_HWCACHE_ALIGN);
+		       SLAB_HWCACHE_ALIGN, 0, 0);
 
 	kmem_cache = bootstrap(&boot_kmem_cache);
 
@@ -5081,6 +5081,12 @@ static ssize_t cache_dma_show(struct kmem_cache *s, char *buf)
 SLAB_ATTR_RO(cache_dma);
 #endif
 
+static ssize_t usersize_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%zu\n", s->usersize);
+}
+SLAB_ATTR_RO(usersize);
+
 static ssize_t destroy_by_rcu_show(struct kmem_cache *s, char *buf)
 {
 	return sprintf(buf, "%d\n", !!(s->flags & SLAB_TYPESAFE_BY_RCU));
@@ -5455,6 +5461,7 @@ static struct attribute *slab_attrs[] = {
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,
 #endif
+	&usersize_attr.attr,
 
 	NULL
 };
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v3 00/31] Hardened usercopy whitelisting
From: Kees Cook @ 2017-09-20 20:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kees Cook, linux-fsdevel, netdev, linux-mm, kernel-hardening,
	David Windsor

v3:
- added LKDTM update patch
- downgrade BUGs to WARNs and fail closed
- add Acks/Reviews from v2

v2:
- added tracing of allocation and usage
- refactored solutions for task_struct
- split up network patches for readability

I intend for this to land via my usercopy hardening tree, so Acks,
Reviewed, and Tested-bys would be greatly appreciated. I have some
questions in a few patches (e.g. CIFS and thread_stack) that would be nice
to get answered for completeness. FWIW, this series has survived generally
for weeks in 0-day testing, and specifically over a couple days rebased
on v4.14-rc1, so I intend to put this in -next shortly unless there is
further feedback.

----

This series is modified from Brad Spengler/PaX Team's PAX_USERCOPY code
in the last public patch of grsecurity/PaX based on our understanding
of the code. Changes or omissions from the original code are ours and
don't reflect the original grsecurity/PaX code.

David Windsor did the bulk of the porting, refactoring, splitting,
testing, etc; I did some extra tweaks, hunk moving, traces, and extra
patches.

Description from patch 1:


Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)

To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.

In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.

After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:

Total Slab Memory:           48074720
Usercopyable Memory:          6367532  13.2%
         task_struct                    0.2%         4480/1630720
         RAW                            0.3%            300/96000
         RAWv6                          2.1%           1408/64768
         ext4_inode_cache               3.0%       269760/8740224
         dentry                        11.1%       585984/5273856
         mm_struct                     29.1%         54912/188448
         kmalloc-8                    100.0%          24576/24576
         kmalloc-16                   100.0%          28672/28672
         kmalloc-32                   100.0%          81920/81920
         kmalloc-192                  100.0%          96768/96768
         kmalloc-128                  100.0%        143360/143360
         names_cache                  100.0%        163840/163840
         kmalloc-64                   100.0%        167936/167936
         kmalloc-256                  100.0%        339968/339968
         kmalloc-512                  100.0%        350720/350720
         kmalloc-96                   100.0%        455616/455616
         kmalloc-8192                 100.0%        655360/655360
         kmalloc-1024                 100.0%        812032/812032
         kmalloc-4096                 100.0%        819200/819200
         kmalloc-2048                 100.0%      1310720/1310720

After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:

Total Slab Memory:           95516184
Usercopyable Memory:          8497452   8.8%
         task_struct                    0.2%         4000/1456000
         RAW                            0.3%            300/96000
         RAWv6                          2.1%           1408/64768
         ext4_inode_cache               3.0%     1217280/39439872
         dentry                        11.1%     1623200/14608800
         mm_struct                     29.1%         73216/251264
         kmalloc-8                    100.0%          24576/24576
         kmalloc-16                   100.0%          28672/28672
         kmalloc-32                   100.0%          94208/94208
         kmalloc-192                  100.0%          96768/96768
         kmalloc-128                  100.0%        143360/143360
         names_cache                  100.0%        163840/163840
         kmalloc-64                   100.0%        245760/245760
         kmalloc-256                  100.0%        339968/339968
         kmalloc-512                  100.0%        350720/350720
         kmalloc-96                   100.0%        563520/563520
         kmalloc-8192                 100.0%        655360/655360
         kmalloc-1024                 100.0%        794624/794624
         kmalloc-4096                 100.0%        819200/819200
         kmalloc-2048                 100.0%      1257472/1257472

------
The patches are broken in several stages of changes:

Prepare and whitelist kmalloc:
    [PATCH 01/31] usercopy: Prepare for usercopy whitelisting
    [PATCH 02/31] usercopy: Enforce slab cache usercopy region boundaries
    [PATCH 03/31] usercopy: Mark kmalloc caches as usercopy caches

Update VFS layer for symlinks and other inline storage:
    [PATCH 04/31] dcache: Define usercopy region in dentry_cache slab
    [PATCH 05/31] vfs: Define usercopy region in names_cache slab caches
    [PATCH 06/31] vfs: Copy struct mount.mnt_id to userspace using
    [PATCH 07/31] ext4: Define usercopy region in ext4_inode_cache slab
    [PATCH 08/31] ext2: Define usercopy region in ext2_inode_cache slab
    [PATCH 09/31] jfs: Define usercopy region in jfs_ip slab cache
    [PATCH 10/31] befs: Define usercopy region in befs_inode_cache slab
    [PATCH 11/31] exofs: Define usercopy region in exofs_inode_cache slab
    [PATCH 12/31] orangefs: Define usercopy region in
    [PATCH 13/31] ufs: Define usercopy region in ufs_inode_cache slab
    [PATCH 14/31] vxfs: Define usercopy region in vxfs_inode slab cache
    [PATCH 15/31] xfs: Define usercopy region in xfs_inode slab cache
    [PATCH 16/31] cifs: Define usercopy region in cifs_request slab cache

Update scsi layer for inline storage:
    [PATCH 17/31] scsi: Define usercopy region in scsi_sense_cache slab

Whitelist a few network protocol-specific areas of memory:
    [PATCH 18/31] net: Define usercopy region in struct proto slab cache
    [PATCH 19/31] ip: Define usercopy region in IP proto slab cache
    [PATCH 20/31] caif: Define usercopy region in caif proto slab cache
    [PATCH 21/31] sctp: Define usercopy region in SCTP proto slab cache
    [PATCH 22/31] sctp: Copy struct sctp_sock.autoclose to userspace
    [PATCH 23/31] net: Restrict unwhitelisted proto caches to size 0

Whitelist areas of process memory:
    [PATCH 24/31] fork: Define usercopy region in mm_struct slab caches
    [PATCH 25/31] fork: Define usercopy region in thread_stack slab

Deal with per-architecture thread_struct whitelisting:
    [PATCH 26/31] fork: Provide usercopy whitelisting for task_struct
    [PATCH 27/31] x86: Implement thread_struct whitelist for hardened
    [PATCH 28/31] arm64: Implement thread_struct whitelist for hardened
    [PATCH 29/31] arm: Implement thread_struct whitelist for hardened

Make blacklisting the default:
    [PATCH 30/31] usercopy: Restrict non-usercopy caches to size 0

Update LKDTM:
    [PATCH 31/31] lkdtm: Update usercopy tests for whitelisting


Thanks!

-Kees (and David)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next 08/14] gtp: Support encpasulating over IPv6
From: Tom Herbert @ 2017-09-20 20:40 UTC (permalink / raw)
  To: David Miller
  Cc: Tom Herbert, Linux Kernel Network Developers, Pablo Neira Ayuso,
	Harald Welte, Rohit Seth
In-Reply-To: <20170920.124511.922311380432026759.davem@davemloft.net>

On Wed, Sep 20, 2017 at 12:45 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Wed, 20 Sep 2017 11:03:52 -0700
>
>> On Mon, Sep 18, 2017 at 9:19 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Tom Herbert <tom@quantonium.net>
>>> Date: Mon, 18 Sep 2017 17:38:58 -0700
>>>
>>>> Allow peers to be specified by IPv6 addresses.
>>>>
>>>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>>>
>>> Hmmm, can you just check the socket family or something like that?
>>
>> I'm not sure what code you're referring to.
>
> There is a socket associated with the tunnel to do the encapsulation
> and it has an address family, right?

If fd's are set from userspace for the sockets then we could derive
the address family from them. I'll change that. Although, looking at
now I am wondering why were passing fds into GTP instead of just
having the kernel create the UDP port like is done for other encaps.

Tom

^ permalink raw reply

* [PATCH] [RESEND][for 4.14] net: qcom/emac: add software control for pause frame mode
From: Timur Tabi @ 2017-09-20 20:32 UTC (permalink / raw)
  To: David S. Miller, netdev; +Cc: timur

The EMAC has the option of sending only a single pause frame when
flow control is enabled and the RX queue is full.  Although sending
only one pause frame has little value, this would allow admins to
enable automatic flow control without having to worry about the EMAC
flooding nearby switches with pause frames if the kernel hangs.

The option is enabled by using the single-pause-mode private flag.

Signed-off-by: Timur Tabi <timur@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/emac/emac-ethtool.c | 30 +++++++++++++++++++++++
 drivers/net/ethernet/qualcomm/emac/emac-mac.c     | 22 +++++++++++++++++
 drivers/net/ethernet/qualcomm/emac/emac.c         |  3 +++
 drivers/net/ethernet/qualcomm/emac/emac.h         |  3 +++
 4 files changed, 58 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-ethtool.c b/drivers/net/ethernet/qualcomm/emac/emac-ethtool.c
index bbe24639aa5a..c8c6231b87f3 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-ethtool.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-ethtool.c
@@ -88,6 +88,8 @@ static void emac_set_msglevel(struct net_device *netdev, u32 data)
 static int emac_get_sset_count(struct net_device *netdev, int sset)
 {
 	switch (sset) {
+	case ETH_SS_PRIV_FLAGS:
+		return 1;
 	case ETH_SS_STATS:
 		return EMAC_STATS_LEN;
 	default:
@@ -100,6 +102,10 @@ static void emac_get_strings(struct net_device *netdev, u32 stringset, u8 *data)
 	unsigned int i;
 
 	switch (stringset) {
+	case ETH_SS_PRIV_FLAGS:
+		strcpy(data, "single-pause-mode");
+		break;
+
 	case ETH_SS_STATS:
 		for (i = 0; i < EMAC_STATS_LEN; i++) {
 			strlcpy(data, emac_ethtool_stat_strings[i],
@@ -230,6 +236,27 @@ static int emac_get_regs_len(struct net_device *netdev)
 	return EMAC_MAX_REG_SIZE * sizeof(u32);
 }
 
+#define EMAC_PRIV_ENABLE_SINGLE_PAUSE	BIT(0)
+
+static int emac_set_priv_flags(struct net_device *netdev, u32 flags)
+{
+	struct emac_adapter *adpt = netdev_priv(netdev);
+
+	adpt->single_pause_mode = !!(flags & EMAC_PRIV_ENABLE_SINGLE_PAUSE);
+
+	if (netif_running(netdev))
+		return emac_reinit_locked(adpt);
+
+	return 0;
+}
+
+static u32 emac_get_priv_flags(struct net_device *netdev)
+{
+	struct emac_adapter *adpt = netdev_priv(netdev);
+
+	return adpt->single_pause_mode ? EMAC_PRIV_ENABLE_SINGLE_PAUSE : 0;
+}
+
 static const struct ethtool_ops emac_ethtool_ops = {
 	.get_link_ksettings = phy_ethtool_get_link_ksettings,
 	.set_link_ksettings = phy_ethtool_set_link_ksettings,
@@ -253,6 +280,9 @@ static int emac_get_regs_len(struct net_device *netdev)
 
 	.get_regs_len    = emac_get_regs_len,
 	.get_regs        = emac_get_regs,
+
+	.set_priv_flags = emac_set_priv_flags,
+	.get_priv_flags = emac_get_priv_flags,
 };
 
 void emac_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
index bcd4708b3745..0ea3ca09c689 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
@@ -551,6 +551,28 @@ static void emac_mac_start(struct emac_adapter *adpt)
 	mac &= ~(HUGEN | VLAN_STRIP | TPAUSE | SIMR | HUGE | MULTI_ALL |
 		 DEBUG_MODE | SINGLE_PAUSE_MODE);
 
+	/* Enable single-pause-frame mode if requested.
+	 *
+	 * If enabled, the EMAC will send a single pause frame when the RX
+	 * queue is full.  This normally leads to packet loss because
+	 * the pause frame disables the remote MAC only for 33ms (the quanta),
+	 * and then the remote MAC continues sending packets even though
+	 * the RX queue is still full.
+	 *
+	 * If disabled, the EMAC sends a pause frame every 31ms until the RX
+	 * queue is no longer full.  Normally, this is the preferred
+	 * method of operation.  However, when the system is hung (e.g.
+	 * cores are halted), the EMAC interrupt handler is never called
+	 * and so the RX queue fills up quickly and stays full.  The resuling
+	 * non-stop "flood" of pause frames sometimes has the effect of
+	 * disabling nearby switches.  In some cases, other nearby switches
+	 * are also affected, shutting down the entire network.
+	 *
+	 * The user can enable or disable single-pause-frame mode
+	 * via ethtool.
+	 */
+	mac |= adpt->single_pause_mode ? SINGLE_PAUSE_MODE : 0;
+
 	writel_relaxed(csr1, adpt->csr + EMAC_EMAC_WRAPPER_CSR1);
 
 	writel_relaxed(mac, adpt->base + EMAC_MAC_CTRL);
diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c b/drivers/net/ethernet/qualcomm/emac/emac.c
index 60850bfa3d32..759543512117 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -443,6 +443,9 @@ static void emac_init_adapter(struct emac_adapter *adpt)
 
 	/* default to automatic flow control */
 	adpt->automatic = true;
+
+	/* Disable single-pause-frame mode by default */
+	adpt->single_pause_mode = false;
 }
 
 /* Get the clock */
diff --git a/drivers/net/ethernet/qualcomm/emac/emac.h b/drivers/net/ethernet/qualcomm/emac/emac.h
index 8ee4ec6aef2e..d7c9f44209d4 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.h
+++ b/drivers/net/ethernet/qualcomm/emac/emac.h
@@ -363,6 +363,9 @@ struct emac_adapter {
 	bool				tx_flow_control;
 	bool				rx_flow_control;
 
+	/* True == use single-pause-frame mode. */
+	bool				single_pause_mode;
+
 	/* Ring parameter */
 	u8				tpd_burst;
 	u8				rfd_burst;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related

* [PATCH iproute2 v2] ip: ipaddress: fix missing space after prefixlen
From: Julien Fortin @ 2017-09-20 20:26 UTC (permalink / raw)
  To: netdev; +Cc: roopa, nikolay, dsa, sd, Julien Fortin

From: Julien Fortin <julien@cumulusnetworks.com>

Fixes: d0e720111aad2 ("ip: ipaddress.c: add support for json output")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Julien Fortin <julien@cumulusnetworks.com>
---
 ip/ipaddress.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 97971450..fb496bbb 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1604,7 +1604,7 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n,
 					   format_host_rta(ifa->ifa_family,
 							   rta_tb[IFA_ADDRESS]));
 		}
-		print_int(PRINT_ANY, "prefixlen", "/%d", ifa->ifa_prefixlen);
+		print_int(PRINT_ANY, "prefixlen", "/%d ", ifa->ifa_prefixlen);
 	}
 
 	if (brief)
-- 
2.14.1

^ permalink raw reply related

* Re: [RFC PATCH] can: m_can: Support higher speed CAN-FD bitrates
From: Franklin S Cooper Jr @ 2017-09-20 20:19 UTC (permalink / raw)
  To: Yang, Wenyou, Sekhar Nori, wg, mkl, mario.huettel, socketcan,
	quentin.schulz, edumazet, linux-can, netdev, linux-kernel
  Cc: Wenyou Yang, Dong Aisheng
In-Reply-To: <532e09ae-775d-e8aa-e468-78e148c650da@Microchip.com>

Hi Wenyou,

On 09/17/2017 10:47 PM, Yang, Wenyou wrote:
> 
> 
> On 2017/9/14 13:06, Sekhar Nori wrote:
>> On Thursday 14 September 2017 03:28 AM, Franklin S Cooper Jr wrote:
>>>
>>> On 08/18/2017 02:39 PM, Franklin S Cooper Jr wrote:
>>>> During test transmitting using CAN-FD at high bitrates (4 Mbps) only
>>>> resulted in errors. Scoping the signals I noticed that only a single
>>>> bit
>>>> was being transmitted and with a bit more investigation realized the
>>>> actual
>>>> MCAN IP would go back to initialization mode automatically.
>>>>
>>>> It appears this issue is due to the MCAN needing to use the Transmitter
>>>> Delay Compensation Mode as defined in the MCAN User's Guide. When this
>>>> mode is used the User's Guide indicates that the Transmitter Delay
>>>> Compensation Offset register should be set. The document mentions
>>>> that this
>>>> register should be set to (1/dbitrate)/2*(Func Clk Freq).
>>>>
>>>> Additional CAN-CIA's "Bit Time Requirements for CAN FD" document
>>>> indicates
>>>> that this TDC mode is only needed for data bit rates above 2.5 Mbps.
>>>> Therefore, only enable this mode and only set TDCO when the data bit
>>>> rate
>>>> is above 2.5 Mbps.
>>>>
>>>> Signed-off-by: Franklin S Cooper Jr <fcooper@ti.com>
>>>> ---
>>>> I'm pretty surprised that this hasn't been implemented already since
>>>> the primary purpose of CAN-FD is to go beyond 1 Mbps and the MCAN IP
>>>> supports up to 10 Mbps.
>>>>
>>>> So it will be nice to get comments from users of this driver to
>>>> understand
>>>> if they have been able to use CAN-FD beyond 2.5 Mbps without this
>>>> patch.
>>>> If they haven't what did they do to get around it if they needed higher
>>>> speeds.
>>>>
>>>> Meanwhile I plan on testing this using a more "realistic" CAN bus to
>>>> insure
>>>> everything still works at 5 Mbps which is the max speed of my CAN
>>>> transceiver.
>>> ping. Anyone has any thoughts on this?
>> I added Dong who authored the m_can driver and Wenyou who added the only
>> in-kernel user of the driver for any help.
> I tested it on SAMA5D2 Xplained board both with and without this patch, 
> both work with the 4M bps data bit rate.

Thank you for testing this out. Its interesting that you have been able
to use higher speeds without this patch. What is the CAN transceiver
being used on the SAMA5D2 Xplained board? I tried looking at the
schematic but it seems the CAN signals are used on an extension board
which I can't find the schematic for. Also do you mind sharing your test
setup? Were you doing a short point to point test?

Thank You,
Franklin

> 
>>
>> Thanks,
>> Sekhar
>>
>>>>   drivers/net/can/m_can/m_can.c | 24 +++++++++++++++++++++++-
>>>>   1 file changed, 23 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/can/m_can/m_can.c
>>>> b/drivers/net/can/m_can/m_can.c
>>>> index f4947a7..720e073 100644
>>>> --- a/drivers/net/can/m_can/m_can.c
>>>> +++ b/drivers/net/can/m_can/m_can.c
>>>> @@ -126,6 +126,12 @@ enum m_can_mram_cfg {
>>>>   #define DBTP_DSJW_SHIFT        0
>>>>   #define DBTP_DSJW_MASK        (0xf << DBTP_DSJW_SHIFT)
>>>>   +/* Transmitter Delay Compensation Register (TDCR) */
>>>> +#define TDCR_TDCO_SHIFT        8
>>>> +#define TDCR_TDCO_MASK        (0x7F << TDCR_TDCO_SHIFT)
>>>> +#define TDCR_TDCF_SHIFT        0
>>>> +#define TDCR_TDCF_MASK        (0x7F << TDCR_TDCO_SHIFT)
>>>> +
>>>>   /* Test Register (TEST) */
>>>>   #define TEST_LBCK        BIT(4)
>>>>   @@ -977,6 +983,8 @@ static int m_can_set_bittiming(struct
>>>> net_device *dev)
>>>>       const struct can_bittiming *dbt = &priv->can.data_bittiming;
>>>>       u16 brp, sjw, tseg1, tseg2;
>>>>       u32 reg_btp;
>>>> +    u32 enable_tdc = 0;
>>>> +    u32 tdco;
>>>>         brp = bt->brp - 1;
>>>>       sjw = bt->sjw - 1;
>>>> @@ -991,9 +999,23 @@ static int m_can_set_bittiming(struct
>>>> net_device *dev)
>>>>           sjw = dbt->sjw - 1;
>>>>           tseg1 = dbt->prop_seg + dbt->phase_seg1 - 1;
>>>>           tseg2 = dbt->phase_seg2 - 1;
>>>> +
>>>> +        /* TDC is only needed for bitrates beyond 2.5 MBit/s
>>>> +         * Specified in the "Bit Time Requirements for CAN FD"
>>>> document
>>>> +         */
>>>> +        if (dbt->bitrate > 2500000) {
>>>> +            enable_tdc = DBTP_TDC;
>>>> +            /* Equation based on Bosch's M_CAN User Manual's
>>>> +             * Transmitter Delay Compensation Section
>>>> +             */
>>>> +            tdco = priv->can.clock.freq / (dbt->bitrate * 2);
>>>> +            m_can_write(priv, M_CAN_TDCR, tdco << TDCR_TDCO_SHIFT);
>>>> +        }
>>>> +
>>>>           reg_btp = (brp << DBTP_DBRP_SHIFT) | (sjw <<
>>>> DBTP_DSJW_SHIFT) |
>>>>               (tseg1 << DBTP_DTSEG1_SHIFT) |
>>>> -            (tseg2 << DBTP_DTSEG2_SHIFT);
>>>> +            (tseg2 << DBTP_DTSEG2_SHIFT) | enable_tdc;
>>>> +
>>>>           m_can_write(priv, M_CAN_DBTP, reg_btp);
>>>>       }
>>>>  
> 
> Regards,
> Wenyou Yang
> 

^ permalink raw reply

* Re: [iproute2 json PATCH] ip: ipaddress: fix missing space after prefixlen
From: Julien Fortin @ 2017-09-20 20:13 UTC (permalink / raw)
  To: netdev; +Cc: Roopa Prabhu, Nikolay Aleksandrov, David Ahern, sd, Julien Fortin
In-Reply-To: <20170920200421.94514-1-julien@cumulusnetworks.com>

Sorry Sabrina, looks like i forgot the reported-by tag, i'll send a v2...

On Wed, Sep 20, 2017 at 1:04 PM, Julien Fortin
<julien@cumulusnetworks.com> wrote:
> From: Julien Fortin <julien@cumulusnetworks.com>
>
> Fixes: d0e720111aad2 ("ip: ipaddress.c: add support for json output")
>
> Signed-off-by: Julien Fortin <julien@cumulusnetworks.com>
> ---
>  ip/ipaddress.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/ip/ipaddress.c b/ip/ipaddress.c
> index 97971450..fb496bbb 100644
> --- a/ip/ipaddress.c
> +++ b/ip/ipaddress.c
> @@ -1604,7 +1604,7 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n,
>                                            format_host_rta(ifa->ifa_family,
>                                                            rta_tb[IFA_ADDRESS]));
>                 }
> -               print_int(PRINT_ANY, "prefixlen", "/%d", ifa->ifa_prefixlen);
> +               print_int(PRINT_ANY, "prefixlen", "/%d ", ifa->ifa_prefixlen);
>         }
>
>         if (brief)
> --
> 2.14.1
>

^ permalink raw reply

* Re: cross namespace interface notification for tun devices
From: Jason A. Donenfeld @ 2017-09-20 20:13 UTC (permalink / raw)
  To: Cong Wang; +Cc: Netdev, Mathias
In-Reply-To: <CAM_iQpXhjuh5NJJiAwBJG4EpkWXWz74WS7RFjt5L1mDV9nKTVw@mail.gmail.com>

On Wed, Sep 20, 2017 at 8:29 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Sounds like we should set NETIF_F_NETNS_LOCAL for tun
> device.

Absolutely do not do this under any circumstances. This would be a
regression and would break API compatibility. As I wrote in my first
email, it's already possible to sleep-loop for that information using
the tun device's fd; I'm just looking for a better event-based
approach.

> What is your legitimate use case of send/receive packet to/from
> a tun device in a different netns?

Because sometimes it's very nice to be able to move network interfaces
that use tun devices into different namespaces, for some xnamespace
proxying.

What Dan described in the email he just sent is exactly this use case.

In WireGuard (a kernel thing), I have facilities for this --
https://www.wireguard.com/netns/ . Now I'm working on the userspace
version and would like to expose the same utility.

Anyway, the purpose of me sending this message to the list was not to
question the "legitimacy" of my application usage, but rather to
elicit feedback on two specific things:

1. to determine if there's already a mechanism in place for this that
I've overlooked; and
2. to determine particularities of me implementing a mechanism, if
it's not already there.

I'm slightly more convinced that there isn't currently a mechanism for
this. It seems like the easiest way, therefore, would be some kind of
control message that could be poll'd for, using the existing
per-process fd. That way there wouldn't be any violations of the
current namespace situation, yet processes could still get event
notifications as needed.

^ permalink raw reply

* Re: [PATCH RFC V1 net-next 0/6] Time based packet transmission
From: Richard Cochran @ 2017-09-20 20:11 UTC (permalink / raw)
  To: levipearson
  Cc: rcochran, netdev, linux-kernel, intel-wired-lan, vinicius.gomes,
	andre.guedes, john.stultz, jesus.sanchez-palencia, henrik, tglx,
	anna-maria, davem
In-Reply-To: <20170920173533.32537-1-levipearson@gmail.com>

On Wed, Sep 20, 2017 at 11:35:33AM -0600, levipearson@gmail.com wrote:
> Anyway, I am wholly in favor of this proposal--in fact, it is very similar to
> a patch set I shared with Eric Mann and others at Intel in early Dec 2016 with
> the intention to get some early feedback before submitting here. I never heard
> back and got busy with other things. I only mention this since you said
> elsewhere that you got this idea from Eric Mann yourself, and I am curious
> whether Eric and I came up with it independently (which I would not be
> surprised at).

Well, I actually thought of placing the Tx time in a CMSG all by
myself, but later I found Eric's talk from 2012,

  https://linuxplumbers.ubicast.tv/videos/linux-network-enabling-requirements-for-audiovideo-bridging-avb/

and so I wanted to give him credit.

Thanks,
Richard

^ permalink raw reply

* [iproute2 json PATCH] ip: ipaddress: fix missing space after prefixlen
From: Julien Fortin @ 2017-09-20 20:04 UTC (permalink / raw)
  To: netdev; +Cc: roopa, nikolay, dsa, sd, Julien Fortin

From: Julien Fortin <julien@cumulusnetworks.com>

Fixes: d0e720111aad2 ("ip: ipaddress.c: add support for json output")

Signed-off-by: Julien Fortin <julien@cumulusnetworks.com>
---
 ip/ipaddress.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 97971450..fb496bbb 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1604,7 +1604,7 @@ int print_addrinfo(const struct sockaddr_nl *who, struct nlmsghdr *n,
 					   format_host_rta(ifa->ifa_family,
 							   rta_tb[IFA_ADDRESS]));
 		}
-		print_int(PRINT_ANY, "prefixlen", "/%d", ifa->ifa_prefixlen);
+		print_int(PRINT_ANY, "prefixlen", "/%d ", ifa->ifa_prefixlen);
 	}
 
 	if (brief)
-- 
2.14.1

^ permalink raw reply related

* Re: cross namespace interface notification for tun devices
From: Dan Williams @ 2017-09-20 19:57 UTC (permalink / raw)
  To: Cong Wang, Jason A. Donenfeld; +Cc: Netdev, Mathias
In-Reply-To: <CAM_iQpXhjuh5NJJiAwBJG4EpkWXWz74WS7RFjt5L1mDV9nKTVw@mail.gmail.com>

On Wed, 2017-09-20 at 11:29 -0700, Cong Wang wrote:
> On Tue, Sep 19, 2017 at 2:02 PM, Jason A. Donenfeld <Jason@zx2c4.com>
> wrote:
> > On Tue, Sep 19, 2017 at 10:40 PM, Cong Wang <xiyou.wangcong@gmail.c
> > om> wrote:
> > > By "notification" I assume you mean netlink notification.
> > 
> > Yes, netlink notification.
> > 
> > > The question is why does the process in A still care about
> > > the device sitting in B?
> > > 
> > > Also, the process should be able to receive a last notification
> > > on IFF_UP|IFF_RUNNING before device is finally moved to B.
> > > After this point, it should not have any relation to netns A
> > > any more, like the device were completely gone.
> > 
> > That's very clearly not the case with a tun device. Tun devices
> > work
> > by letting a userspace process control the inputs (ndo_start_xmit)
> > and
> > outputs (netif_rx) of the actual network device. This controlling
> > userspace process needs to know when its own interface that it
> > controls goes up and down. In the kernel, we can do this by just
> > checking dev->flags&IFF_UP, and receive notifications on ndo_open
> > and
> > ndo_stop. In userspace, the controlling process looses the ability
> > to
> > receive notifications like ndo_open/ndo_stop when the interface is
> > moved to a new namespace. After the interface is moved to a
> > namespace,
> > the process will still control inputs and ouputs (ndo_start_xmit
> > and
> > netif_rx), but it will no longer receive netlink notifications for
> > the
> > equivalent of ndo_open and ndo_stop. This is problematic.
> 
> Sounds like we should set NETIF_F_NETNS_LOCAL for tun
> device.
> 
> What is your legitimate use case of send/receive packet to/from
> a tun device in a different netns?

One thought: run openvpn in the master netns, but put its tun0
interface into an application's netns.  Per-application VPN,
essentially?  Or maybe that's not how people do this kind of thing, but
it's a thought.

Dan

^ permalink raw reply

* Re: usb/net/p54: trying to register non-static key in p54_unregister_leds
From: Johannes Berg @ 2017-09-20 19:55 UTC (permalink / raw)
  To: Christian Lamparter, Andrey Konovalov
  Cc: Kalle Valo, linux-wireless, netdev, LKML, Dmitry Vyukov,
	Kostya Serebryany, syzkaller, Stephen Boyd, Tejun Heo, Yong Zhang
In-Reply-To: <2277141.bYDD1vAb9W@debian64>

On Wed, 2017-09-20 at 21:27 +0200, Christian Lamparter wrote:

> It seems this is caused as a result of:
>     -> lock_map_acquire(&work->lockdep_map);
> 	    lock_map_release(&work->lockdep_map);
> 
>     in flush_work() [0]

Agree.

> This was added by:
> 
> 	commit 0976dfc1d0cd80a4e9dfaf87bd8744612bde475a
> 	Author: Stephen Boyd <sboyd@codeaurora.org>
> 	Date:   Fri Apr 20 17:28:50 2012 -0700
> 
> 	workqueue: Catch more locking problems with flush_work()

Yes, but that doesn't matter.
    
> Looking at the Stephen's patch, it's clear that it was made
> with "static DECLARE_WORK(work, my_work)" in mind. However
> p54's led_work is "per-device", hence it is stored in the
> devices context p54_common, which is dynamically allocated.
> So, maybe revert Stephen's patch?

I disagree - as the lockdep warning says:

> > INFO: trying to register non-static key.
> > the code is fine but needs lockdep annotation.
> > turning off the locking correctness validator.

What it needs is to actually correctly go through initializing the work
at least once.

Without more information, I can't really say what's going on, but I
assume that something is failing and p54_unregister_leds() is getting
invoked without p54_init_leds() having been invoked, so essentially
it's trying to flush a work that was never initialized?

INIT_DELAYED_WORK() does, after all, initialize the lockdep map
properly via __INIT_WORK().

johannes

^ permalink raw reply

* Re: [PATCH net-next 08/14] gtp: Support encpasulating over IPv6
From: David Miller @ 2017-09-20 19:45 UTC (permalink / raw)
  To: tom; +Cc: tom, netdev, pablo, laforge, rohit
In-Reply-To: <CALx6S36cJEU2LkbNf-d8kUHCPj78bt5Y2Bd_rdvjxJ9epOEqAQ@mail.gmail.com>

From: Tom Herbert <tom@herbertland.com>
Date: Wed, 20 Sep 2017 11:03:52 -0700

> On Mon, Sep 18, 2017 at 9:19 PM, David Miller <davem@davemloft.net> wrote:
>> From: Tom Herbert <tom@quantonium.net>
>> Date: Mon, 18 Sep 2017 17:38:58 -0700
>>
>>> Allow peers to be specified by IPv6 addresses.
>>>
>>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>>
>> Hmmm, can you just check the socket family or something like that?
> 
> I'm not sure what code you're referring to.

There is a socket associated with the tunnel to do the encapsulation
and it has an address family, right?

^ permalink raw reply

* Re: Regression in throughput between kvm guests over virtual bridge
From: Matthew Rosato @ 2017-09-20 19:38 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst
In-Reply-To: <1173ab1f-e2b6-26b3-8c3c-bd5ceaa1bd8e@redhat.com>


> Seems to make some progress on wakeup mitigation. Previous patch tries
> to reduce the unnecessary traversal of waitqueue during rx. Attached
> patch goes even further which disables rx polling during processing tx.
> Please try it to see if it has any difference.

Unfortunately, this patch doesn't seem to have made a difference.  I
tried runs with both this patch and the previous patch applied, as well
as only this patch applied for comparison (numbers from vhost thread of
sending VM):

4.12    4.13     patch1   patch2   patch1+2
2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key

In each case, the regression in throughput was still present.

> And two questions:
> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)

Verified that the second set of guests are not actually required, I can
see the regression with only 2 VMs.

> - Can enable batching in the tap of sending VM improve the performance
> (ethtool -C $tap rx-frames 64)

I tried this, but it did not help (actually seemed to make things a
little worse)

^ permalink raw reply

* Re: usb/net/p54: trying to register non-static key in p54_unregister_leds
From: Christian Lamparter @ 2017-09-20 19:27 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Kalle Valo, linux-wireless, netdev, LKML, Dmitry Vyukov,
	Kostya Serebryany, syzkaller, Stephen Boyd, Tejun Heo, Yong Zhang
In-Reply-To: <CAAeHK+xPoNn7i5xOzEqHX6tfDpmpb1rd0hcAxZpTHKi6B8xzsw@mail.gmail.com>

On Wednesday, September 20, 2017 8:37:08 PM CEST Andrey Konovalov wrote:
> Hi!
> 
> I've got the following report while fuzzing the kernel with syzkaller.
> 
> On commit ebb2c2437d8008d46796902ff390653822af6cc4 (Sep 18).
> 
> INFO: trying to register non-static key.
> the code is fine but needs lockdep annotation.
> turning off the locking correctness validator.
> CPU: 1 PID: 1404 Comm: kworker/1:1 Not tainted
> 4.14.0-rc1-42251-gebb2c2437d80-dirty #205
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Workqueue: usb_hub_wq hub_event
> Call Trace:
>  __dump_stack lib/dump_stack.c:16
>  dump_stack+0x292/0x395 lib/dump_stack.c:52
>  register_lock_class+0x6c4/0x1a00 kernel/locking/lockdep.c:769
>  __lock_acquire+0x27e/0x4550 kernel/locking/lockdep.c:3385
>  lock_acquire+0x259/0x620 kernel/locking/lockdep.c:4002
>  flush_work+0xf0/0x8c0 kernel/workqueue.c:2886
>  __cancel_work_timer+0x51d/0x870 kernel/workqueue.c:2961
>  cancel_delayed_work_sync+0x1f/0x30 kernel/workqueue.c:3081
>  p54_unregister_leds+0x6c/0xc0 drivers/net/wireless/intersil/p54/led.c:160
>  p54_unregister_common+0x3d/0xb0 drivers/net/wireless/intersil/p54/main.c:856
>  p54u_disconnect+0x86/0x120 drivers/net/wireless/intersil/p54/p54usb.c:1073
>  usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
>  __device_release_driver drivers/base/dd.c:861
>  device_release_driver_internal+0x4f4/0x5c0 drivers/base/dd.c:893
>  device_release_driver+0x1e/0x30 drivers/base/dd.c:918
>  bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
>  device_del+0x5c4/0xab0 drivers/base/core.c:1985
>  usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
>  usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
>  hub_port_connect drivers/usb/core/hub.c:4754
>  hub_port_connect_change drivers/usb/core/hub.c:5009
>  port_event drivers/usb/core/hub.c:5115
>  hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
>  process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
>  process_scheduled_works kernel/workqueue.c:2179
>  worker_thread+0xb2b/0x1850 kernel/workqueue.c:2255
>  kthread+0x3a1/0x470 kernel/kthread.c:231
>  ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
> 

It seems this is caused as a result of:
    -> lock_map_acquire(&work->lockdep_map);
	    lock_map_release(&work->lockdep_map);

    in flush_work() [0]

This was added by:

	commit 0976dfc1d0cd80a4e9dfaf87bd8744612bde475a
	Author: Stephen Boyd <sboyd@codeaurora.org>
	Date:   Fri Apr 20 17:28:50 2012 -0700

	workqueue: Catch more locking problems with flush_work()
    
Looking at the Stephen's patch, it's clear that it was made
with "static DECLARE_WORK(work, my_work)" in mind. However
p54's led_work is "per-device", hence it is stored in the
devices context p54_common, which is dynamically allocated.
So, maybe revert Stephen's patch?

[0] <http://elixir.free-electrons.com/linux/latest/source/kernel/workqueue.c#L2853>

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 19:23 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Wei Wang, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <07f0c800-7ee7-1ebe-9f5f-eaad52a7b42d@itcare.pl>



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:
>
>
> W dniu 2017-09-20 o 20:36, Cong Wang pisze:
>> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
>> <eric.dumazet@gmail.com> wrote:
>>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>>
>>>> This is very odd.
>>>>
>>>> We only free netdevice in free_netdev() and it is only called when
>>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>>> to be NULL.
>>> If there is a missing dev_hold() or one dev_put() in excess,
>>> this would allow the netdev to be freed too soon.
>>>
>>> -> Use after free.
>>> memory holding netdev could be reallocated-cleared by some other kernel
>>> user.
>>>
>> Sure, but only unregister could trigger a free. If there is no 
>> unregister,
>> like what Pawel claims, then there is no free, the refcnt just goes to
>> 0 but the memory is still there.
>>
> About possible mistake from my side with bisect - i can judge too 
> early that some bisect was good
> the road was:
> git bisect start
> # bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
> 'pinctrl-v4.13-1' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
> # good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
> of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
> git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
> # bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
> using stack larger than 1024.
> git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
> # good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
> 'udp-reduce-cache-pressure'
> git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
> # bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
> 's390-net-updates-part-2'
> git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
> # good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
> 'bpf-ctx-narrow'
> git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
> # good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
> cp_outgoing
> git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
> # bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
> TCP_MD5SIG_EXT socket option to set a key address prefix
> git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
> # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
> new function dst_dev_put()
>
> And currently have this running for about 4 hours without problems.
>
>
>
> git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
> # bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
> DST_NOCACHE flag
>
> Here for sure - panic
>
> git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
> # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
> dst_hold_safe() properly
> git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
> # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
> dst_hold_safe() properly
> git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
> # bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
> dst->__refcnt for insertion into fib6 tree
>
> im not 100% sure tor last two
> Will test them again starting from
> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
> properly
>
>
> git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
> # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
> and remove the operation of dst_free()
>
>
>
> git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
> # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
> mark DST_NOGC and remove the operation of dst_free()
>
>
>
What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink -> kernel
But normally in lab when using only plain routing no bgpd and about 128 
vlans - with 128 routes - cant reproduce this - this apperas only with 
bgp - minimum where i can reproduce this was about 130k prefixes with 
about 286 nexthops

^ permalink raw reply

* Re: Latest net-next from GIT panic
From: Paweł Staszewski @ 2017-09-20 19:13 UTC (permalink / raw)
  To: Cong Wang, Eric Dumazet
  Cc: Wei Wang, Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <CAM_iQpVF3Gf3G=543o8g+5N2dtEL4DcUPN3a6=L8V_G0CXZ65w@mail.gmail.com>



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
> On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>>> but dmesg at this time shows nothing about interfaces or flaps.
>>>
>>> This is very odd.
>>>
>>> We only free netdevice in free_netdev() and it is only called when
>>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>>> to be NULL.
>> If there is a missing dev_hold() or one dev_put() in excess,
>> this would allow the netdev to be freed too soon.
>>
>> -> Use after free.
>> memory holding netdev could be reallocated-cleared by some other kernel
>> user.
>>
> Sure, but only unregister could trigger a free. If there is no unregister,
> like what Pawel claims, then there is no free, the refcnt just goes to
> 0 but the memory is still there.
>
About possible mistake from my side with bisect - i can judge too early 
that some bisect was good
the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 'next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid using 
stack larger than 1024.
git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'
git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'
git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'
git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing
git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix
git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new 
function dst_dev_put()

And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag

Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly
git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly
git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree

im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly


git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
and remove the operation of dst_free()



git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()

^ permalink raw reply

* [PATCH] hv_netvsc: fix send buffer failure on MTU change
From: Stephen Hemminger @ 2017-09-20 18:17 UTC (permalink / raw)
  To: davem, kys; +Cc: netdev, Alex Ng, Stephen Hemminger

From: Alex Ng <alexng@microsoft.com>

If MTU is changed the host would reject the send buffer change.
This problem is result of recent change to allow changing send
buffer size.

Every time we change the MTU, we store the previous net_device section
count before destroying the buffer, but we don’t store the previous
section size. When we reinitialize the buffer, its size is calculated
by multiplying the previous count and previous size. Since we
continuously increase the MTU, the host returns us a decreasing count
value while the section size is reinitialized to 1728 bytes every
time.

This eventually leads to a condition where the calculated buf_size is
so small that the host rejects it.

Fixes: 8b5327975ae1 ("netvsc: allow controlling send/recv buffer size")
Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h | 2 ++
 drivers/net/hyperv/netvsc.c     | 7 ++-----
 drivers/net/hyperv/netvsc_drv.c | 8 ++++++++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index d98cdfb1536b..5176be76ca7d 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -150,6 +150,8 @@ struct netvsc_device_info {
 	u32  num_chn;
 	u32  send_sections;
 	u32  recv_sections;
+	u32  send_section_size;
+	u32  recv_section_size;
 };
 
 enum rndis_device_state {
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index a5511b7326af..8d5077fb0492 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -76,9 +76,6 @@ static struct netvsc_device *alloc_net_device(void)
 	net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
 	net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
 
-	net_device->recv_section_size = NETVSC_RECV_SECTION_SIZE;
-	net_device->send_section_size = NETVSC_SEND_SECTION_SIZE;
-
 	init_completion(&net_device->channel_init_wait);
 	init_waitqueue_head(&net_device->subchan_open);
 	INIT_WORK(&net_device->subchan_work, rndis_set_subchannel);
@@ -262,7 +259,7 @@ static int netvsc_init_buf(struct hv_device *device,
 	int ret = 0;
 
 	/* Get receive buffer area. */
-	buf_size = device_info->recv_sections * net_device->recv_section_size;
+	buf_size = device_info->recv_sections * device_info->recv_section_size;
 	buf_size = roundup(buf_size, PAGE_SIZE);
 
 	net_device->recv_buf = vzalloc(buf_size);
@@ -344,7 +341,7 @@ static int netvsc_init_buf(struct hv_device *device,
 		goto cleanup;
 
 	/* Now setup the send buffer. */
-	buf_size = device_info->send_sections * net_device->send_section_size;
+	buf_size = device_info->send_sections * device_info->send_section_size;
 	buf_size = round_up(buf_size, PAGE_SIZE);
 
 	net_device->send_buf = vzalloc(buf_size);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index d4902ee5f260..a32ae02e1b6c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -848,7 +848,9 @@ static int netvsc_set_channels(struct net_device *net,
 	device_info.num_chn = count;
 	device_info.ring_size = ring_size;
 	device_info.send_sections = nvdev->send_section_cnt;
+	device_info.send_section_size = nvdev->send_section_size;
 	device_info.recv_sections = nvdev->recv_section_cnt;
+	device_info.recv_section_size = nvdev->recv_section_size;
 
 	rndis_filter_device_remove(dev, nvdev);
 
@@ -963,7 +965,9 @@ static int netvsc_change_mtu(struct net_device *ndev, int mtu)
 	device_info.ring_size = ring_size;
 	device_info.num_chn = nvdev->num_chn;
 	device_info.send_sections = nvdev->send_section_cnt;
+	device_info.send_section_size = nvdev->send_section_size;
 	device_info.recv_sections = nvdev->recv_section_cnt;
+	device_info.recv_section_size = nvdev->recv_section_size;
 
 	rndis_filter_device_remove(hdev, nvdev);
 
@@ -1485,7 +1489,9 @@ static int netvsc_set_ringparam(struct net_device *ndev,
 	device_info.num_chn = nvdev->num_chn;
 	device_info.ring_size = ring_size;
 	device_info.send_sections = new_tx;
+	device_info.send_section_size = nvdev->send_section_size;
 	device_info.recv_sections = new_rx;
+	device_info.recv_section_size = nvdev->recv_section_size;
 
 	netif_device_detach(ndev);
 	was_opened = rndis_filter_opened(nvdev);
@@ -1934,7 +1940,9 @@ static int netvsc_probe(struct hv_device *dev,
 	device_info.ring_size = ring_size;
 	device_info.num_chn = VRSS_CHANNEL_DEFAULT;
 	device_info.send_sections = NETVSC_DEFAULT_TX;
+	device_info.send_section_size = NETVSC_SEND_SECTION_SIZE;
 	device_info.recv_sections = NETVSC_DEFAULT_RX;
+	device_info.recv_section_size = NETVSC_RECV_SECTION_SIZE;
 
 	nvdev = rndis_filter_device_add(dev, &device_info);
 	if (IS_ERR(nvdev)) {
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net-next] bpf: Optimize lpm trie delete
From: Craig Gallek @ 2017-09-20 18:51 UTC (permalink / raw)
  To: Daniel Mack; +Cc: Alexei Starovoitov, Daniel Borkmann, David S . Miller, netdev
In-Reply-To: <47e47f0d-d04d-b52b-647a-7883aa5f3268@zonque.org>

On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack <daniel@zonque.org> wrote:
> Hi Craig,
>
> Thanks, this looks much cleaner already :)
>
> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
>>       struct lpm_trie_node __rcu **trim;
>>       struct lpm_trie_node *node;
>>       unsigned long irq_flags;
>> -     unsigned int next_bit;
>> +     unsigned int next_bit = 0;
>
> This default assignment seems wrong, and I guess you only added it to
> squelch a compiler warning?
Yes, this variable is only initialized after the lookup iterations
below (meaning it will never be initialized the the root-removal
case).

> [...]
>
>> +     /* If the node has one child, we may be able to collapse the tree
>> +      * while removing this node if the node's child is in the same
>> +      * 'next bit' slot as this node was in its parent or if the node
>> +      * itself is the root.
>> +      */
>> +     if (trim == &trie->root) {
>> +             next_bit = node->child[0] ? 0 : 1;
>> +             rcu_assign_pointer(trie->root, node->child[next_bit]);
>> +             kfree_rcu(node, rcu);
>
> I don't think you should treat this 'root' case special.
>
> Instead, move the 'next_bit' assignment outside of the condition ...
I'm not quite sure I follow.  Are you saying do something like this:

if (trim == &trie->root) {
  next_bit = node->child[0] ? 0 : 1;
}
if (rcu_access_pointer(node->child[next_bit])) {
...

This would save a couple lines of code, but I think the as-is
implementation is slightly easier to understand.  I don't have a
strong opinion either way, though.

Thanks for the pointers,
Craig

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox