Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next 1/4] umh: introduce fork_usermode_blob() helper
From: Alexei Starovoitov @ 2018-05-05  1:37 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Alexei Starovoitov, davem, daniel, torvalds, gregkh, luto, netdev,
	linux-kernel, kernel-team, Al Viro, David Howells, Mimi Zohar,
	Kees Cook, Andrew Morton, Dominik Brodowski, Hugh Dickins,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, David Airlie,
	Rafael J. Wysocki, Linux FS Devel, p
In-Reply-To: <20180504195642.GB12838@wotan.suse.de>

On Fri, May 04, 2018 at 07:56:43PM +0000, Luis R. Rodriguez wrote:
> What a mighty short list of reviewers. Adding some more. My review below.
> I'd appreciate a Cc on future versions of these patches.

sure.

> On Wed, May 02, 2018 at 09:36:01PM -0700, Alexei Starovoitov wrote:
> > Introduce helper:
> > int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> > struct umh_info {
> >        struct file *pipe_to_umh;
> >        struct file *pipe_from_umh;
> >        pid_t pid;
> > };
> > 
> > that GPLed kernel modules (signed or unsigned) can use it to execute part
> > of its own data as swappable user mode process.
> > 
> > The kernel will do:
> > - mount "tmpfs"
> 
> Actually its a *shared* vfsmount tmpfs for all umh blobs.

yep

> > - allocate a unique file in tmpfs
> > - populate that file with [data, data + len] bytes
> > - user-mode-helper code will do_execve that file and, before the process
> >   starts, the kernel will create two unix pipes for bidirectional
> >   communication between kernel module and umh
> > - close tmpfs file, effectively deleting it
> > - the fork_usermode_blob will return zero on success and populate
> >   'struct umh_info' with two unix pipes and the pid of the user process
> 
> But since its using UMH_WAIT_EXEC, all we can guarantee currently is the
> inception point was intended, well though out, and will run, but the return
> value in no way reflects the success or not of the execution. More below.

yep

> > As the first step in the development of the bpfilter project
> > the fork_usermode_blob() helper is introduced to allow user mode code
> > to be invoked from a kernel module. The idea is that user mode code plus
> > normal kernel module code are built as part of the kernel build
> > and installed as traditional kernel module into distro specified location,
> > such that from a distribution point of view, there is
> > no difference between regular kernel modules and kernel modules + umh code.
> > Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
> > by a kernel module doesn't make it any special from kernel and user space
> > tooling point of view.
> > 
> > Such approach enables kernel to delegate functionality traditionally done
> > by the kernel modules into the user space processes (either root or !root) and
> > reduces security attack surface of the new code. The buggy umh code would crash
> > the user process, but not the kernel. Another advantage is that umh code
> > of the kernel module can be debugged and tested out of user space
> > (e.g. opening the possibility to run clang sanitizers, fuzzers or
> > user space test suites on the umh code).
> > In case of the bpfilter project such architecture allows complex control plane
> > to be done in the user space while bpf based data plane stays in the kernel.
> > 
> > Since umh can crash, can be oom-ed by the kernel, killed by the admin,
> > the kernel module that uses them (like bpfilter) needs to manage life
> > time of umh on its own via two unix pipes and the pid of umh.
> > 
> > The exit code of such kernel module should kill the umh it started,
> > so that rmmod of the kernel module will cleanup the corresponding umh.
> > Just like if the kernel module does kmalloc() it should kfree() it in the exit code.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  fs/exec.c               |  38 ++++++++---
> >  include/linux/binfmts.h |   1 +
> >  include/linux/umh.h     |  12 ++++
> >  kernel/umh.c            | 176 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  4 files changed, 215 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 183059c427b9..30a36c2a39bf 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1706,14 +1706,13 @@ static int exec_binprm(struct linux_binprm *bprm)
> >  /*
> >   * sys_execve() executes a new program.
> >   */
> > -static int do_execveat_common(int fd, struct filename *filename,
> > -			      struct user_arg_ptr argv,
> > -			      struct user_arg_ptr envp,
> > -			      int flags)
> > +static int __do_execve_file(int fd, struct filename *filename,
> > +			    struct user_arg_ptr argv,
> > +			    struct user_arg_ptr envp,
> > +			    int flags, struct file *file)
> >  {
> >  	char *pathbuf = NULL;
> >  	struct linux_binprm *bprm;
> > -	struct file *file;
> >  	struct files_struct *displaced;
> >  	int retval;
> 
> Keeping in mind a fuzzer...
> 
> Note, right below this, and not shown here in the hunk, is:
> 
>         if (IS_ERR(filename))                                                   
>                 return PTR_ERR(filename)
> >  
> > @@ -1752,7 +1751,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> >  	check_unsafe_exec(bprm);
> >  	current->in_execve = 1;
> >  
> > -	file = do_open_execat(fd, filename, flags);
> > +	if (!file)
> > +		file = do_open_execat(fd, filename, flags);
> 
> 
> Here we now seem to allow !file and open the file with the passed fd as in
> the old days. This is an expected change.
> 
> >  	retval = PTR_ERR(file);
> >  	if (IS_ERR(file))
> >  		goto out_unmark;
> > @@ -1760,7 +1760,9 @@ static int do_execveat_common(int fd, struct filename *filename,
> >  	sched_exec();
> >  
> >  	bprm->file = file;
> > -	if (fd == AT_FDCWD || filename->name[0] == '/') {
> > +	if (!filename) {
> 
> If anything shouldn't this be:
> 
> 	if (IS_ERR(filename))

nope. it should be !filename as do_execve_file() passes NULL.
IS_ERR != IS_ERR_OR_NULL

> But, wouldn't the above first branch in the routine catch this?
> 
> > +		bprm->filename = "none";
> 
> Given this seems like a desirable branch which was tested, wonder how this
> ever got set if the above branch in the first hunk I noted hit true?
> 
> In any case, we seem to have two cases, can we rule out the exact requirements
> at the top so we can bail out with an error code if one or the other way to
> call this function does not align with expectations?

I think you're misreading the code or I don't understand the concern at all.

> > +	} else if (fd == AT_FDCWD || filename->name[0] == '/') {
> >  		bprm->filename = filename->name;
> >  	} else {
> >  		if (filename->name[0] == '\0')
> > @@ -1826,7 +1828,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> >  	task_numa_free(current);
> >  	free_bprm(bprm);
> >  	kfree(pathbuf);
> > -	putname(filename);
> > +	if (filename)
> > +		putname(filename);
> >  	if (displaced)
> >  		put_files_struct(displaced);
> >  	return retval;
> > @@ -1849,10 +1852,27 @@ static int do_execveat_common(int fd, struct filename *filename,
> >  	if (displaced)
> >  		reset_files_struct(displaced);
> >  out_ret:
> > -	putname(filename);
> > +	if (filename)
> > +		putname(filename);
> >  	return retval;
> >  }
> >  
> > +static int do_execveat_common(int fd, struct filename *filename,
> 
> Further signs the filename is now optional. But I don't understand how these
> branches ever be true, but perhaps I'm missing something?
> 
> > +			      struct user_arg_ptr argv,
> > +			      struct user_arg_ptr envp,
> > +			      int flags)
> > +{
> > +	return __do_execve_file(fd, filename, argv, envp, flags, NULL);
> > +}
> > +
> > +int do_execve_file(struct file *file, void *__argv, void *__envp)
> > +{
> > +	struct user_arg_ptr argv = { .ptr.native = __argv };
> > +	struct user_arg_ptr envp = { .ptr.native = __envp };
> > +
> > +	return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
> > +}
> 
> Or maybe do the semantics expectations checks here, so we don't clutter
> do_execveat_common() with them?

specifically ?

> > +
> >  int do_execve(struct filename *filename,
> >  	const char __user *const __user *__argv,
> >  	const char __user *const __user *__envp)
> > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> > index 4955e0863b83..c05f24fac4f6 100644
> > --- a/include/linux/binfmts.h
> > +++ b/include/linux/binfmts.h
> > @@ -150,5 +150,6 @@ extern int do_execveat(int, struct filename *,
> >  		       const char __user * const __user *,
> >  		       const char __user * const __user *,
> >  		       int);
> > +int do_execve_file(struct file *file, void *__argv, void *__envp);
> >  
> >  #endif /* _LINUX_BINFMTS_H */
> > diff --git a/include/linux/umh.h b/include/linux/umh.h
> > index 244aff638220..5c812acbb80a 100644
> > --- a/include/linux/umh.h
> > +++ b/include/linux/umh.h
> > @@ -22,8 +22,10 @@ struct subprocess_info {
> >  	const char *path;
> >  	char **argv;
> >  	char **envp;
> > +	struct file *file;
> >  	int wait;
> >  	int retval;
> > +	pid_t pid;
> >  	int (*init)(struct subprocess_info *info, struct cred *new);
> >  	void (*cleanup)(struct subprocess_info *info);
> >  	void *data;
> 
> While at it, can you kdocify struct subprocess_info and add new docs for at
> least these two entires you are adding ?

Sorry 'while at it' doesn't sound as a good reason to
add kdoc now instead of later.

> > @@ -38,6 +40,16 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
> >  			  int (*init)(struct subprocess_info *info, struct cred *new),
> >  			  void (*cleanup)(struct subprocess_info *), void *data);
> >  
> > +struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
> > +			  int (*init)(struct subprocess_info *info, struct cred *new),
> > +			  void (*cleanup)(struct subprocess_info *), void *data);
> 
> Likewise but on the umc.c file.
> 
> > +struct umh_info {
> > +	struct file *pipe_to_umh;
> > +	struct file *pipe_from_umh;
> > +	pid_t pid;
> > +};
> 
> Likewise.

what 'likewise' ? The kdoc ?

> 
> > +int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> 
> Likewise but on the umc.c files.
> 
> > +
> >  extern int
> >  call_usermodehelper_exec(struct subprocess_info *info, int wait);
> >  
> > diff --git a/kernel/umh.c b/kernel/umh.c
> > index f76b3ff876cf..c3f418d7d51a 100644
> > --- a/kernel/umh.c
> > +++ b/kernel/umh.c
> > @@ -25,6 +25,8 @@
> >  #include <linux/ptrace.h>
> >  #include <linux/async.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/shmem_fs.h>
> > +#include <linux/pipe_fs_i.h>
> >  
> >  #include <trace/events/module.h>
> >  
> > @@ -97,9 +99,13 @@ static int call_usermodehelper_exec_async(void *data)
> >  
> >  	commit_creds(new);
> >  
> > -	retval = do_execve(getname_kernel(sub_info->path),
> > -			   (const char __user *const __user *)sub_info->argv,
> > -			   (const char __user *const __user *)sub_info->envp);
> > +	if (sub_info->file)
> > +		retval = do_execve_file(sub_info->file,
> > +					sub_info->argv, sub_info->envp);
> > +	else
> > +		retval = do_execve(getname_kernel(sub_info->path),
> > +				   (const char __user *const __user *)sub_info->argv,
> > +				   (const char __user *const __user *)sub_info->envp);
> >  out:
> >  	sub_info->retval = retval;
> >  	/*
> > @@ -185,6 +191,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
> >  		if (pid < 0) {
> >  			sub_info->retval = pid;
> >  			umh_complete(sub_info);
> > +		} else {
> > +			sub_info->pid = pid;
> >  		}
> >  	}
> >  }
> > @@ -393,6 +401,168 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
> >  }
> >  EXPORT_SYMBOL(call_usermodehelper_setup);
> >  
> > +struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
> > +		int (*init)(struct subprocess_info *info, struct cred *new),
> > +		void (*cleanup)(struct subprocess_info *info), void *data)
> 
> Should be static, no other users outside of this file.

good catch. will change to static.

> Please use umh_setup_file().

sorry. makes no sense.
There is call_usermodehelper_setup() right above it.
call_usermodehelper_setup_file() just follows the naming convention.
If you prefer shorter names, both have to be renamed in the separate patch series.

> > +{
> > +	struct subprocess_info *sub_info;
> 
> Considering a possible fuzzer triggering random data we should probably
> return NULL early and avoid the kzalloc if:

I missing 'fuzzer' point here and earlier.
'fuzzer' cannot reach here. It's all internal api.

> 	if (!file || !init || !cleanup)
> 		return NULL;

sorry, nope. in kernel we don't do defensive programming like this.

> Is data optional? The kdoc could clarify this.

No. Should be obvious from this patch.
The only caller of call_usermodehelper_setup_file() is fork_usermode_blob()
and it passes 'struct umh_info *info'.

> 
> > +
> > +	sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
> > +	if (!sub_info)
> > +		return NULL;
> > +
> > +	INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
> > +	sub_info->path = "none";
> > +	sub_info->file = file;
> > +	sub_info->init = init;
> > +	sub_info->cleanup = cleanup;
> > +	sub_info->data = data;
> > +	return sub_info;
> > +}
> > +
> > +static struct vfsmount *umh_fs;
> > +
> > +static int init_tmpfs(void)
> 
> Please use umh_init_tmpfs(). 

ok

> Also see init/main.c do_basic_setup() which calls
> usermodehelper_enable() prior to do_initcalls(). Now, fortunately TMPFS is only
> bool, saving us from some races and we do call tmpfs's init first shmem_init():
> 
> static void __init do_basic_setup(void)
> {
> 	cpuset_init_smp();
> 	shmem_init();
> 	driver_init();
> 	init_irq_proc();
> 	do_ctors();
> 	usermodehelper_enable();
> 	do_initcalls();
> }
> 
> But it also means we're enabling your new call call fork_usermode_blob() on
> early init code even if we're not setup. Since this umh tmpfs vfsmount is
> shared I'd say just call this init right before usermodehelper_enable()
> on do_basic_setup().

Not following.
Why init_tmpfs() should be called by __init function?
Are you saying make 'static struct vfsmount *shm_mnt;'
global and use it here? so no init_tmpfs() necessary?
I think that can work, but feels that having two
tmpfs mounts (one for shmem and one for umh) is cleaner.

> 
> > +{
> > +	struct file_system_type *type;
> > +
> > +	if (umh_fs)
> > +		return 0;
> > +	type = get_fs_type("tmpfs");
> > +	if (!type)
> > +		return -ENODEV;
> > +	umh_fs = kern_mount(type);
> > +	if (IS_ERR(umh_fs)) {
> > +		int err = PTR_ERR(umh_fs);
> > +
> > +		put_filesystem(type);
> > +		umh_fs = NULL;
> > +		return err;
> > +	}
> > +	return 0;
> > +}
> > +
> > +static int alloc_tmpfs_file(size_t size, struct file **filp)
> 
> Please use umh_alloc_tmpfs_file()

ok

> > +{
> > +	struct file *file;
> > +	int err;
> > +
> > +	err = init_tmpfs();
> > +	if (err)
> > +		return err;
> > +	file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
> > +	if (IS_ERR(file))
> > +		return PTR_ERR(file);
> > +	*filp = file;
> > +	return 0;
> > +}
> > +
> > +static int populate_file(struct file *file, const void *data, size_t size)
> 
> Please use umh_populate_file()

ok

> > +{
> > +	size_t offset = 0;
> > +	int err;
> > +
> > +	do {
> > +		unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
> > +		struct page *page;
> > +		void *pgdata, *vaddr;
> > +
> > +		err = pagecache_write_begin(file, file->f_mapping, offset, len,
> > +					    0, &page, &pgdata);
> > +		if (err < 0)
> > +			goto fail;
> > +
> > +		vaddr = kmap(page);
> > +		memcpy(vaddr, data, len);
> > +		kunmap(page);
> > +
> > +		err = pagecache_write_end(file, file->f_mapping, offset, len,
> > +					  len, page, pgdata);
> > +		if (err < 0)
> > +			goto fail;
> > +
> > +		size -= len;
> > +		data += len;
> > +		offset += len;
> > +	} while (size);
> 
> Character for character, this looks like a wonderful copy and paste from
> i915_gem_object_create_from_data()'s own loop which does the same exact
> thing. Perhaps its time for a helper on mm/filemap.c with an export so
> if a bug is fixed in one place its fixed in both places.

yes, of course, but not right now.
Once it all lands that will be the time to create common helper.
It's not a good idea to mess with i915 in one patch set.

> > +	return 0;
> > +fail:
> > +	return err;
> > +}
> > +
> > +static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
> 
> The function name umh_pipe_setup() is also used on fs/coredump.c, with the same
> prototype, perhaps rename that before we take this on, even if both are static.

hmm. why?
These are two static functions with the same name, so?
tags get confusing?

> > +{
> > +	struct umh_info *umh_info = info->data;
> > +	struct file *from_umh[2];
> > +	struct file *to_umh[2];
> > +	int err;
> > +
> > +	/* create pipe to send data to umh */
> > +	err = create_pipe_files(to_umh, 0);
> > +	if (err)
> > +		return err;
> > +	err = replace_fd(0, to_umh[0], 0);
> > +	fput(to_umh[0]);
> > +	if (err < 0) {
> > +		fput(to_umh[1]);
> > +		return err;
> > +	}
> > +
> > +	/* create pipe to receive data from umh */
> > +	err = create_pipe_files(from_umh, 0);
> > +	if (err) {
> > +		fput(to_umh[1]);
> > +		replace_fd(0, NULL, 0);
> > +		return err;
> > +	}
> > +	err = replace_fd(1, from_umh[1], 0);
> > +	fput(from_umh[1]);
> > +	if (err < 0) {
> > +		fput(to_umh[1]);
> > +		replace_fd(0, NULL, 0);
> > +		fput(from_umh[0]);
> > +		return err;
> > +	}
> > +
> > +	umh_info->pipe_to_umh = to_umh[1];
> > +	umh_info->pipe_from_umh = from_umh[0];
> > +	return 0;
> > +}
> > +
> > +static void umh_save_pid(struct subprocess_info *info)
> > +{
> > +	struct umh_info *umh_info = info->data;
> > +
> > +	umh_info->pid = info->pid;
> > +}
> > +
> > +int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
> 
> Please use umh_fork_blob()

sorry, no. fork_usermode_blob() is much more descriptive name.

> > +{
> > +	struct subprocess_info *sub_info;
> > +	struct file *file = NULL;
> > +	int err;
> > +
> > +	err = alloc_tmpfs_file(len, &file);
> > +	if (err)
> > +		return err;
> > +
> > +	err = populate_file(file, data, len);
> > +	if (err)
> > +		goto out;
> > +
> > +	err = -ENOMEM;
> > +	sub_info = call_usermodehelper_setup_file(file, umh_pipe_setup,
> > +						  umh_save_pid, info);
> > +	if (!sub_info)
> > +		goto out;
> > +
> > +	err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
> 
> Alright, neat, so to be clear, we're just glad to try inception, we have no
> clue or idea what the real return value would be, its up to the caller to track
> the progress somehow?

yep.

> Can you add a kdoc entry for this and clarify requirements?

ok. I'll add a comment to this helper.

> Also, can you extend lib/test_kmod.c with a test case for this with its own
> demo and try to blow it up?

in what sense? bpfilter is the test and the driving component for it.
I'm expecting that folks who want to use this helper to do usb drivers
in user space may want to extend this helper further, but that's their job.

> I hadn't tried suspend/resume during a kmod test, but since we're using a
> kernel_thread() I wouldn't be surprised if we barf while stress testing the
> module loader. Its surely a corner case, but better mention that now than cry
> later if we get heavy umh modules and all of a sudden we start using this for
> whatever reason close to suspend.

folks that care about suspend/resume should do that.
I'm happy to gate this helper for !CONFIG_SUSPEND, since I have
no idea what issues can be uncovered, how to fix them and no desire to do so.

Thanks

^ permalink raw reply

* Re: [PATCH V3] net/netlink: make sure the headers line up actual value output
From: Bo YU @ 2018-05-05  1:54 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, yuzibode, netdev
In-Reply-To: <20180504.130223.1863527612784159289.davem@davemloft.net>

Hi,
On Fri, May 04, 2018 at 01:02:23PM -0400, David Miller wrote:
>From: YU Bo <tsu.yubo@gmail.com>
>Date: Thu, 3 May 2018 23:09:23 -0400
>
>> Making sure the headers line up properly with the actual value output
>> of the command
>> `cat /proc/net/netlink`
>>
>> Before the patch:
>> <sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode
>> <ffff8cd2c2f7b000 0 909 00000550 0 0 0 2 0 18946
>>
>> After the patch:
>>>sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode
>>>0000000033203952 0 897 00000113 0 0 0 2 0 14906
>>
>> Signed-off-by: Bo YU <tsu.yubo@gmail.com>
>
>Applied, but why did you send this V3 to the list two times?
Thanks a lot. When sent the email,i encounter networking issue and not sure
send to the list.Sorry for making noise.
>
>Thank you.

^ permalink raw reply

* Re: [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats
From: Zumeng Chen @ 2018-05-05  2:40 UTC (permalink / raw)
  To: Michael Chan
  Cc: Netdev, open list, Siva Reddy Kallam,
	prashant.sreedharan@broadcom.com, David Miller
In-Reply-To: <CACKFLi=Rv_+Gig5UpdnpbOPcyLyGr3TYGUbFEmeGcZWAk3YZRQ@mail.gmail.com>

On 05/03/2018 01:04 PM, Michael Chan wrote:
> On Wed, May 2, 2018 at 5:30 PM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
>> On 2018年05月03日 01:32, Michael Chan wrote:
>>> On Wed, May 2, 2018 at 3:27 AM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
>>>> On 2018年05月02日 13:12, Michael Chan wrote:
>>>>> On Tue, May 1, 2018 at 5:42 PM, Zumeng Chen <zumeng.chen@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> diff --git a/drivers/net/ethernet/broadcom/tg3.h
>>>>>> b/drivers/net/ethernet/broadcom/tg3.h
>>>>>> index 3b5e98e..c61d83c 100644
>>>>>> --- a/drivers/net/ethernet/broadcom/tg3.h
>>>>>> +++ b/drivers/net/ethernet/broadcom/tg3.h
>>>>>> @@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
>>>>>>            TG3_FLAG_ROBOSWITCH,
>>>>>>            TG3_FLAG_ONE_DMA_AT_ONCE,
>>>>>>            TG3_FLAG_RGMII_MODE,
>>>>>> +       TG3_FLAG_HALT,
>>>>> I think you should be able to use the existing INIT_COMPLETE flag
>>>>
>>>> No,  it will bring the uncertain factors into the existed complicate
>>>> logic
>>>> of INIT_COMPLETE.
>>>> And I think it's very simple logic here to fix the meaningless hw_stats
>>>> reading and the problem
>>>> of commit f5992b72. I even suspect if you have read INIT_COMPLETE related
>>>> codes carefully.
>>>>
>>> We should use an existing flag whenever appropriate
>>
>> I disagree. This is sort of blahblah...
> I don't want to see another flag added that is practically the same as
> !INIT_COMPLETE.  The driver already has close to one hundred flags.
> Adding a new flag that is similar to an existing flag will just make
> the code more difficult to understand and maintain.
>
> If you don't want to fix it the cleaner way, Siva or I will fix it.

Feel free to go, I just take a double look, INIT_COMPLETE can directly 
be used as follows:

diff --git a/drivers/net/ethernet/broadcom/tg3.c 
b/drivers/net/ethernet/broadcom/tg3.c
index 08bbb63..0e04fd7 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -8733,14 +8733,11 @@ static void tg3_free_consistent(struct tg3 *tp)
         tg3_mem_rx_release(tp);
         tg3_mem_tx_release(tp);

-       /* Protect tg3_get_stats64() from reading freed tp->hw_stats. */
-       tg3_full_lock(tp, 0);
         if (tp->hw_stats) {
                 dma_free_coherent(&tp->pdev->dev, sizeof(struct 
tg3_hw_stats),
                                   tp->hw_stats, tp->stats_mapping);
                 tp->hw_stats = NULL;
         }
-       tg3_full_unlock(tp);
  }

  /*
@@ -14178,7 +14175,7 @@ static void tg3_get_stats64(struct net_device *dev,
         struct tg3 *tp = netdev_priv(dev);

         spin_lock_bh(&tp->lock);
-       if (!tp->hw_stats) {
+       if (!tp->hw_stats || tg3_flag(tp, INIT_COMPLETE)) {
                 *stats = tp->net_stats_prev;
                 spin_unlock_bh(&tp->lock);
                 return;

Cheers,
Zumeng

^ permalink raw reply related

* [PATCH] net/9p: correct the variable name in v9fs_get_trans_by_name() comment
From: Sun Lianwen @ 2018-05-05  3:29 UTC (permalink / raw)
  To: davem; +Cc: netdev

The v9fs_get_trans_by_name(char *s) variable name is not "name" but "s".

Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>
---
 net/9p/mod.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/9p/mod.c b/net/9p/mod.c
index 6ab36aea7727..eb9777f05755 100644
--- a/net/9p/mod.c
+++ b/net/9p/mod.c
@@ -104,7 +104,7 @@ EXPORT_SYMBOL(v9fs_unregister_trans);
 
 /**
  * v9fs_get_trans_by_name - get transport with the matching name
- * @name: string identifying transport
+ * @s: string identifying transport
  *
  */
 struct p9_trans_module *v9fs_get_trans_by_name(char *s)
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH v2 net-next 1/4] umh: introduce fork_usermode_blob() helper
From: Jann Horn @ 2018-05-05  4:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Daniel Borkmann, Linus Torvalds,
	Greg Kroah-Hartman, Andy Lutomirski, Network Development,
	kernel list, kernel-team
In-Reply-To: <20180503043604.1604587-2-ast@kernel.org>

On Thu, May 3, 2018 at 12:36 AM, Alexei Starovoitov <ast@kernel.org> wrote:
> Introduce helper:
> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> struct umh_info {
>        struct file *pipe_to_umh;
>        struct file *pipe_from_umh;
>        pid_t pid;
> };
>
> that GPLed kernel modules (signed or unsigned) can use it to execute part
> of its own data as swappable user mode process.
>
> The kernel will do:
> - mount "tmpfs"
> - allocate a unique file in tmpfs
> - populate that file with [data, data + len] bytes
> - user-mode-helper code will do_execve that file and, before the process
>   starts, the kernel will create two unix pipes for bidirectional
>   communication between kernel module and umh
> - close tmpfs file, effectively deleting it
> - the fork_usermode_blob will return zero on success and populate
>   'struct umh_info' with two unix pipes and the pid of the user process
>
> As the first step in the development of the bpfilter project
> the fork_usermode_blob() helper is introduced to allow user mode code
> to be invoked from a kernel module. The idea is that user mode code plus
> normal kernel module code are built as part of the kernel build
> and installed as traditional kernel module into distro specified location,
> such that from a distribution point of view, there is
> no difference between regular kernel modules and kernel modules + umh code.
> Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
> by a kernel module doesn't make it any special from kernel and user space
> tooling point of view.
[...]
> +static struct vfsmount *umh_fs;
> +
> +static int init_tmpfs(void)
> +{
> +       struct file_system_type *type;
> +
> +       if (umh_fs)
> +               return 0;
> +       type = get_fs_type("tmpfs");
> +       if (!type)
> +               return -ENODEV;
> +       umh_fs = kern_mount(type);
> +       if (IS_ERR(umh_fs)) {
> +               int err = PTR_ERR(umh_fs);
> +
> +               put_filesystem(type);
> +               umh_fs = NULL;
> +               return err;
> +       }
> +       return 0;
> +}

Should init_tmpfs() be holding some sort of mutex if it's fiddling
with `umh_fs`? The current code only calls it in initcall context, but
if that ever changes and two processes try to initialize the tmpfs at
the same time, a few things could go wrong.
I guess Luis' suggestion (putting a call to init_tmpfs() in
do_basic_setup()) might be the easiest way to get rid of that problem.

> +static int alloc_tmpfs_file(size_t size, struct file **filp)
> +{
> +       struct file *file;
> +       int err;
> +
> +       err = init_tmpfs();
> +       if (err)
> +               return err;
> +       file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
> +       if (IS_ERR(file))
> +               return PTR_ERR(file);
> +       *filp = file;
> +       return 0;
> +}

^ permalink raw reply

* Re: [PATCH iproute2] rdma: fix header files
From: Stephen Hemminger @ 2018-05-05  4:58 UTC (permalink / raw)
  To: David Ahern; +Cc: swise, netdev
In-Reply-To: <d51c9746-d3e9-48f5-fc93-b63e4c5c7a8c@gmail.com>

On Fri, 4 May 2018 16:13:07 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 5/4/18 3:56 PM, Stephen Hemminger wrote:
> > All user api headers in iproute2 should be in include/uapi
> > so that script can be used to put correct sanitized kernel headers
> > there. And the header files for rdma must be a complete set; if one
> > header file includes another, all must be present.
> > 
> > This fixes build on older distributions, and Windows Services
> > for Linux.
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > ---
> >  include/uapi/rdma/ib_user_sa.h                |   77 ++
> >  include/uapi/rdma/ib_user_verbs.h             | 1210 +++++++++++++++++
> >  .../uapi/rdma/rdma_netlink.h                  |   13 +
> >  .../uapi/rdma/rdma_user_cm.h                  |    6 +-
> >  4 files changed, 1303 insertions(+), 3 deletions(-)
> >  create mode 100644 include/uapi/rdma/ib_user_sa.h
> >  create mode 100644 include/uapi/rdma/ib_user_verbs.h
> >  rename {rdma/include => include}/uapi/rdma/rdma_netlink.h (95%)
> >  rename {rdma/include => include}/uapi/rdma/rdma_user_cm.h (98%)
> >   
> 
> Stephen:
> 
> Per a recent discussion the RDMA folks need to take ownership of the
> uapi files. RDMA features do not hit Dave's net-next tree so the rdma
> code can never hit iproute2-next during a dev cycle.

I want all uapi headers in include/uapi because it avoids possible overlap problems,
During the linux-net/linus release cycle they should match what is Linus's tree.

During the net-next they can come from two sources.

^ permalink raw reply

* Re: [PATCH net-next v2] net: core: rework basic flow dissection helper
From: kbuild test robot @ 2018-05-05  5:21 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: kbuild-all, netdev, David S. Miller, Eric Dumazet, Jason Wang
In-Reply-To: <e87a05fed35c9cb8e60fe8f867c43b27b3044e21.1525426286.git.pabeni@redhat.com>

Hi Paolo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Paolo-Abeni/net-core-rework-basic-flow-dissection-helper/20180505-090417
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   include/net/tipc.h:55:30: sparse: incorrect type in return expression (different base types) @@    expected unsigned int @@    got restriunsigned int @@
   include/net/tipc.h:55:30:    expected unsigned int
   include/net/tipc.h:55:30:    got restricted __be32 <noident>
   net/core/flow_dissector.c:840:48: sparse: incorrect type in assignment (different base types) @@    expected restricted __be32 [usertype] key @@    got  [usertype] key @@
   net/core/flow_dissector.c:840:48:    expected restricted __be32 [usertype] key
   net/core/flow_dissector.c:840:48:    got unsigned int
   net/core/flow_dissector.c:1035:30: sparse: expression using sizeof(void)
   net/core/flow_dissector.c:1035:30: sparse: expression using sizeof(void)
   net/core/flow_dissector.c:1276:25: sparse: expression using sizeof(void)
>> net/core/flow_dissector.c:1319:59: sparse: Using plain integer as NULL pointer

vim +1319 net/core/flow_dissector.c

  1305	
  1306	/**
  1307	 * skb_get_poff - get the offset to the payload
  1308	 * @skb: sk_buff to get the payload offset from
  1309	 *
  1310	 * The function will get the offset to the payload as far as it could
  1311	 * be dissected.  The main user is currently BPF, so that we can dynamically
  1312	 * truncate packets without needing to push actual payload to the user
  1313	 * space and can analyze headers only, instead.
  1314	 */
  1315	u32 skb_get_poff(const struct sk_buff *skb)
  1316	{
  1317		struct flow_keys_basic keys;
  1318	
> 1319		if (!skb_flow_dissect_flow_keys_basic(skb, &keys, 0, 0, 0, 0, 0))
  1320			return 0;
  1321	
  1322		return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
  1323	}
  1324	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCHv2 net] sctp: delay the authentication for the duplicated cookie-echo chunk
From: Xin Long @ 2018-05-05  6:59 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

Now sctp only delays the authentication for the normal cookie-echo
chunk by setting chunk->auth_chunk in sctp_endpoint_bh_rcv(). But
for the duplicated one with auth, in sctp_assoc_bh_rcv(), it does
authentication first based on the old asoc, which will definitely
fail due to the different auth info in the old asoc.

The duplicated cookie-echo chunk will create a new asoc with the
auth info from this chunk, and the authentication should also be
done with the new asoc's auth info for all of the collision 'A',
'B' and 'D'. Otherwise, the duplicated cookie-echo chunk with auth
will never pass the authentication and create the new connection.

This issue exists since very beginning, and this fix is to make
sctp_assoc_bh_rcv() follow the way sctp_endpoint_bh_rcv() does
for the normal cookie-echo chunk to delay the authentication.

While at it, remove the unused params from sctp_sf_authenticate()
and define sctp_auth_chunk_verify() used for all the places that
do the delayed authentication.

v1->v2:
  fix the typo in changelog as Marcelo noticed.

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/associola.c    | 30 ++++++++++++++++-
 net/sctp/sm_statefuns.c | 86 ++++++++++++++++++++++++++-----------------------
 2 files changed, 75 insertions(+), 41 deletions(-)

diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 837806d..a47179d 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1024,8 +1024,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
 	struct sctp_endpoint *ep;
 	struct sctp_chunk *chunk;
 	struct sctp_inq *inqueue;
-	int state;
+	int first_time = 1;	/* is this the first time through the loop */
 	int error = 0;
+	int state;
 
 	/* The association should be held so we should be safe. */
 	ep = asoc->ep;
@@ -1036,6 +1037,30 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
 		state = asoc->state;
 		subtype = SCTP_ST_CHUNK(chunk->chunk_hdr->type);
 
+		/* If the first chunk in the packet is AUTH, do special
+		 * processing specified in Section 6.3 of SCTP-AUTH spec
+		 */
+		if (first_time && subtype.chunk == SCTP_CID_AUTH) {
+			struct sctp_chunkhdr *next_hdr;
+
+			next_hdr = sctp_inq_peek(inqueue);
+			if (!next_hdr)
+				goto normal;
+
+			/* If the next chunk is COOKIE-ECHO, skip the AUTH
+			 * chunk while saving a pointer to it so we can do
+			 * Authentication later (during cookie-echo
+			 * processing).
+			 */
+			if (next_hdr->type == SCTP_CID_COOKIE_ECHO) {
+				chunk->auth_chunk = skb_clone(chunk->skb,
+							      GFP_ATOMIC);
+				chunk->auth = 1;
+				continue;
+			}
+		}
+
+normal:
 		/* SCTP-AUTH, Section 6.3:
 		 *    The receiver has a list of chunk types which it expects
 		 *    to be received only after an AUTH-chunk.  This list has
@@ -1074,6 +1099,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
 		/* If there is an error on chunk, discard this packet. */
 		if (error && chunk)
 			chunk->pdiscard = 1;
+
+		if (first_time)
+			first_time = 0;
 	}
 	sctp_association_put(asoc);
 }
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 28c070e..c9ae340 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -153,10 +153,7 @@ static enum sctp_disposition sctp_sf_violation_chunk(
 					struct sctp_cmd_seq *commands);
 
 static enum sctp_ierror sctp_sf_authenticate(
-					struct net *net,
-					const struct sctp_endpoint *ep,
 					const struct sctp_association *asoc,
-					const union sctp_subtype type,
 					struct sctp_chunk *chunk);
 
 static enum sctp_disposition __sctp_sf_do_9_1_abort(
@@ -626,6 +623,38 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net *net,
 	return SCTP_DISPOSITION_CONSUME;
 }
 
+static bool sctp_auth_chunk_verify(struct net *net, struct sctp_chunk *chunk,
+				   const struct sctp_association *asoc)
+{
+	struct sctp_chunk auth;
+
+	if (!chunk->auth_chunk)
+		return true;
+
+	/* SCTP-AUTH:  auth_chunk pointer is only set when the cookie-echo
+	 * is supposed to be authenticated and we have to do delayed
+	 * authentication.  We've just recreated the association using
+	 * the information in the cookie and now it's much easier to
+	 * do the authentication.
+	 */
+
+	/* Make sure that we and the peer are AUTH capable */
+	if (!net->sctp.auth_enable || !asoc->peer.auth_capable)
+		return false;
+
+	/* set-up our fake chunk so that we can process it */
+	auth.skb = chunk->auth_chunk;
+	auth.asoc = chunk->asoc;
+	auth.sctp_hdr = chunk->sctp_hdr;
+	auth.chunk_hdr = (struct sctp_chunkhdr *)
+				skb_push(chunk->auth_chunk,
+					 sizeof(struct sctp_chunkhdr));
+	skb_pull(chunk->auth_chunk, sizeof(struct sctp_chunkhdr));
+	auth.transport = chunk->transport;
+
+	return sctp_sf_authenticate(asoc, &auth) == SCTP_IERROR_NO_ERROR;
+}
+
 /*
  * Respond to a normal COOKIE ECHO chunk.
  * We are the side that is being asked for an association.
@@ -763,37 +792,9 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
 	if (error)
 		goto nomem_init;
 
-	/* SCTP-AUTH:  auth_chunk pointer is only set when the cookie-echo
-	 * is supposed to be authenticated and we have to do delayed
-	 * authentication.  We've just recreated the association using
-	 * the information in the cookie and now it's much easier to
-	 * do the authentication.
-	 */
-	if (chunk->auth_chunk) {
-		struct sctp_chunk auth;
-		enum sctp_ierror ret;
-
-		/* Make sure that we and the peer are AUTH capable */
-		if (!net->sctp.auth_enable || !new_asoc->peer.auth_capable) {
-			sctp_association_free(new_asoc);
-			return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
-		}
-
-		/* set-up our fake chunk so that we can process it */
-		auth.skb = chunk->auth_chunk;
-		auth.asoc = chunk->asoc;
-		auth.sctp_hdr = chunk->sctp_hdr;
-		auth.chunk_hdr = (struct sctp_chunkhdr *)
-					skb_push(chunk->auth_chunk,
-						 sizeof(struct sctp_chunkhdr));
-		skb_pull(chunk->auth_chunk, sizeof(struct sctp_chunkhdr));
-		auth.transport = chunk->transport;
-
-		ret = sctp_sf_authenticate(net, ep, new_asoc, type, &auth);
-		if (ret != SCTP_IERROR_NO_ERROR) {
-			sctp_association_free(new_asoc);
-			return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
-		}
+	if (!sctp_auth_chunk_verify(net, chunk, new_asoc)) {
+		sctp_association_free(new_asoc);
+		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
 	}
 
 	repl = sctp_make_cookie_ack(new_asoc, chunk);
@@ -1797,13 +1798,15 @@ static enum sctp_disposition sctp_sf_do_dupcook_a(
 	if (sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC))
 		goto nomem;
 
+	if (!sctp_auth_chunk_verify(net, chunk, new_asoc))
+		return SCTP_DISPOSITION_DISCARD;
+
 	/* Make sure no new addresses are being added during the
 	 * restart.  Though this is a pretty complicated attack
 	 * since you'd have to get inside the cookie.
 	 */
-	if (!sctp_sf_check_restart_addrs(new_asoc, asoc, chunk, commands)) {
+	if (!sctp_sf_check_restart_addrs(new_asoc, asoc, chunk, commands))
 		return SCTP_DISPOSITION_CONSUME;
-	}
 
 	/* If the endpoint is in the SHUTDOWN-ACK-SENT state and recognizes
 	 * the peer has restarted (Action A), it MUST NOT setup a new
@@ -1912,6 +1915,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_b(
 	if (sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC))
 		goto nomem;
 
+	if (!sctp_auth_chunk_verify(net, chunk, new_asoc))
+		return SCTP_DISPOSITION_DISCARD;
+
 	/* Update the content of current association.  */
 	sctp_add_cmd_sf(commands, SCTP_CMD_UPDATE_ASSOC, SCTP_ASOC(new_asoc));
 	sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
@@ -2009,6 +2015,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_d(
 	 * a COOKIE ACK.
 	 */
 
+	if (!sctp_auth_chunk_verify(net, chunk, asoc))
+		return SCTP_DISPOSITION_DISCARD;
+
 	/* Don't accidentally move back into established state. */
 	if (asoc->state < SCTP_STATE_ESTABLISHED) {
 		sctp_add_cmd_sf(commands, SCTP_CMD_TIMER_STOP,
@@ -4171,10 +4180,7 @@ enum sctp_disposition sctp_sf_eat_fwd_tsn_fast(
  * The return value is the disposition of the chunk.
  */
 static enum sctp_ierror sctp_sf_authenticate(
-					struct net *net,
-					const struct sctp_endpoint *ep,
 					const struct sctp_association *asoc,
-					const union sctp_subtype type,
 					struct sctp_chunk *chunk)
 {
 	struct sctp_shared_key *sh_key = NULL;
@@ -4275,7 +4281,7 @@ enum sctp_disposition sctp_sf_eat_auth(struct net *net,
 						  commands);
 
 	auth_hdr = (struct sctp_authhdr *)chunk->skb->data;
-	error = sctp_sf_authenticate(net, ep, asoc, type, chunk);
+	error = sctp_sf_authenticate(asoc, chunk);
 	switch (error) {
 	case SCTP_IERROR_AUTH_BAD_HMAC:
 		/* Generate the ERROR chunk and discard the rest
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH net] sctp: delay the authentication for the duplicated cookie-echo chunk
From: Xin Long @ 2018-05-05  7:01 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: network dev, linux-sctp, davem, Neil Horman
In-Reply-To: <20180504223337.GO5105@localhost.localdomain>

On Sat, May 5, 2018 at 6:33 AM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Fri, May 04, 2018 at 05:05:10PM +0800, Xin Long wrote:
>> Now sctp only delays the authentication for the normal cookie-echo
>> chunk by setting chunk->auth_chunk in sctp_endpoint_bh_rcv(). But
>> for the duplicated one with auth, in sctp_assoc_bh_rcv(), it does
>> authentication first based on the old asoc, which will definitely
>> fail due to the different auth info in the old asoc.
>>
>> The duplicated cookie-echo chunk will create a new asoc with the
>> auth info from this chunk, and the authentication should also be
>> done with the new asoc's auth info for all of the collision 'A',
>> 'B' and 'D'. Otherwise, the duplicated cookie-echo chunk with auth
>> will never pass the authentication and create the new connection.
>>
>> This issue exists since very beginning, and this fix is to make
>> sctp_assoc_bh_rcv() follow the way sctp_assoc_bh_rcv() does for
>    I guess you meant sctp_endpoint_bh_rcv here --^ right?
have posted v2, thanks.

>
> Other than this LGTM
>
>> the normal cookie-echo chunk to delay the authentication.
>>
>> While at it, remove the unused params from sctp_sf_authenticate()
>> and define sctp_auth_chunk_verify() used for all the places that
>> do the delayed authentication.
>>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> ---
>>  net/sctp/associola.c    | 30 ++++++++++++++++-
>>  net/sctp/sm_statefuns.c | 86 ++++++++++++++++++++++++++-----------------------
>>  2 files changed, 75 insertions(+), 41 deletions(-)
>>
>> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
>> index 837806d..a47179d 100644
>> --- a/net/sctp/associola.c
>> +++ b/net/sctp/associola.c
>> @@ -1024,8 +1024,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>>       struct sctp_endpoint *ep;
>>       struct sctp_chunk *chunk;
>>       struct sctp_inq *inqueue;
>> -     int state;
>> +     int first_time = 1;     /* is this the first time through the loop */
>>       int error = 0;
>> +     int state;
>>
>>       /* The association should be held so we should be safe. */
>>       ep = asoc->ep;
>> @@ -1036,6 +1037,30 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>>               state = asoc->state;
>>               subtype = SCTP_ST_CHUNK(chunk->chunk_hdr->type);
>>
>> +             /* If the first chunk in the packet is AUTH, do special
>> +              * processing specified in Section 6.3 of SCTP-AUTH spec
>> +              */
>> +             if (first_time && subtype.chunk == SCTP_CID_AUTH) {
>> +                     struct sctp_chunkhdr *next_hdr;
>> +
>> +                     next_hdr = sctp_inq_peek(inqueue);
>> +                     if (!next_hdr)
>> +                             goto normal;
>> +
>> +                     /* If the next chunk is COOKIE-ECHO, skip the AUTH
>> +                      * chunk while saving a pointer to it so we can do
>> +                      * Authentication later (during cookie-echo
>> +                      * processing).
>> +                      */
>> +                     if (next_hdr->type == SCTP_CID_COOKIE_ECHO) {
>> +                             chunk->auth_chunk = skb_clone(chunk->skb,
>> +                                                           GFP_ATOMIC);
>> +                             chunk->auth = 1;
>> +                             continue;
>> +                     }
>> +             }
>> +
>> +normal:
>>               /* SCTP-AUTH, Section 6.3:
>>                *    The receiver has a list of chunk types which it expects
>>                *    to be received only after an AUTH-chunk.  This list has
>> @@ -1074,6 +1099,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>>               /* If there is an error on chunk, discard this packet. */
>>               if (error && chunk)
>>                       chunk->pdiscard = 1;
>> +
>> +             if (first_time)
>> +                     first_time = 0;
>>       }
>>       sctp_association_put(asoc);
>>  }
>> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
>> index 28c070e..c9ae340 100644
>> --- a/net/sctp/sm_statefuns.c
>> +++ b/net/sctp/sm_statefuns.c
>> @@ -153,10 +153,7 @@ static enum sctp_disposition sctp_sf_violation_chunk(
>>                                       struct sctp_cmd_seq *commands);
>>
>>  static enum sctp_ierror sctp_sf_authenticate(
>> -                                     struct net *net,
>> -                                     const struct sctp_endpoint *ep,
>>                                       const struct sctp_association *asoc,
>> -                                     const union sctp_subtype type,
>>                                       struct sctp_chunk *chunk);
>>
>>  static enum sctp_disposition __sctp_sf_do_9_1_abort(
>> @@ -626,6 +623,38 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net *net,
>>       return SCTP_DISPOSITION_CONSUME;
>>  }
>>
>> +static bool sctp_auth_chunk_verify(struct net *net, struct sctp_chunk *chunk,
>> +                                const struct sctp_association *asoc)
>> +{
>> +     struct sctp_chunk auth;
>> +
>> +     if (!chunk->auth_chunk)
>> +             return true;
>> +
>> +     /* SCTP-AUTH:  auth_chunk pointer is only set when the cookie-echo
>> +      * is supposed to be authenticated and we have to do delayed
>> +      * authentication.  We've just recreated the association using
>> +      * the information in the cookie and now it's much easier to
>> +      * do the authentication.
>> +      */
>> +
>> +     /* Make sure that we and the peer are AUTH capable */
>> +     if (!net->sctp.auth_enable || !asoc->peer.auth_capable)
>> +             return false;
>> +
>> +     /* set-up our fake chunk so that we can process it */
>> +     auth.skb = chunk->auth_chunk;
>> +     auth.asoc = chunk->asoc;
>> +     auth.sctp_hdr = chunk->sctp_hdr;
>> +     auth.chunk_hdr = (struct sctp_chunkhdr *)
>> +                             skb_push(chunk->auth_chunk,
>> +                                      sizeof(struct sctp_chunkhdr));
>> +     skb_pull(chunk->auth_chunk, sizeof(struct sctp_chunkhdr));
>> +     auth.transport = chunk->transport;
>> +
>> +     return sctp_sf_authenticate(asoc, &auth) == SCTP_IERROR_NO_ERROR;
>> +}
>> +
>>  /*
>>   * Respond to a normal COOKIE ECHO chunk.
>>   * We are the side that is being asked for an association.
>> @@ -763,37 +792,9 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
>>       if (error)
>>               goto nomem_init;
>>
>> -     /* SCTP-AUTH:  auth_chunk pointer is only set when the cookie-echo
>> -      * is supposed to be authenticated and we have to do delayed
>> -      * authentication.  We've just recreated the association using
>> -      * the information in the cookie and now it's much easier to
>> -      * do the authentication.
>> -      */
>> -     if (chunk->auth_chunk) {
>> -             struct sctp_chunk auth;
>> -             enum sctp_ierror ret;
>> -
>> -             /* Make sure that we and the peer are AUTH capable */
>> -             if (!net->sctp.auth_enable || !new_asoc->peer.auth_capable) {
>> -                     sctp_association_free(new_asoc);
>> -                     return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
>> -             }
>> -
>> -             /* set-up our fake chunk so that we can process it */
>> -             auth.skb = chunk->auth_chunk;
>> -             auth.asoc = chunk->asoc;
>> -             auth.sctp_hdr = chunk->sctp_hdr;
>> -             auth.chunk_hdr = (struct sctp_chunkhdr *)
>> -                                     skb_push(chunk->auth_chunk,
>> -                                              sizeof(struct sctp_chunkhdr));
>> -             skb_pull(chunk->auth_chunk, sizeof(struct sctp_chunkhdr));
>> -             auth.transport = chunk->transport;
>> -
>> -             ret = sctp_sf_authenticate(net, ep, new_asoc, type, &auth);
>> -             if (ret != SCTP_IERROR_NO_ERROR) {
>> -                     sctp_association_free(new_asoc);
>> -                     return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
>> -             }
>> +     if (!sctp_auth_chunk_verify(net, chunk, new_asoc)) {
>> +             sctp_association_free(new_asoc);
>> +             return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
>>       }
>>
>>       repl = sctp_make_cookie_ack(new_asoc, chunk);
>> @@ -1797,13 +1798,15 @@ static enum sctp_disposition sctp_sf_do_dupcook_a(
>>       if (sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC))
>>               goto nomem;
>>
>> +     if (!sctp_auth_chunk_verify(net, chunk, new_asoc))
>> +             return SCTP_DISPOSITION_DISCARD;
>> +
>>       /* Make sure no new addresses are being added during the
>>        * restart.  Though this is a pretty complicated attack
>>        * since you'd have to get inside the cookie.
>>        */
>> -     if (!sctp_sf_check_restart_addrs(new_asoc, asoc, chunk, commands)) {
>> +     if (!sctp_sf_check_restart_addrs(new_asoc, asoc, chunk, commands))
>>               return SCTP_DISPOSITION_CONSUME;
>> -     }
>>
>>       /* If the endpoint is in the SHUTDOWN-ACK-SENT state and recognizes
>>        * the peer has restarted (Action A), it MUST NOT setup a new
>> @@ -1912,6 +1915,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_b(
>>       if (sctp_auth_asoc_init_active_key(new_asoc, GFP_ATOMIC))
>>               goto nomem;
>>
>> +     if (!sctp_auth_chunk_verify(net, chunk, new_asoc))
>> +             return SCTP_DISPOSITION_DISCARD;
>> +
>>       /* Update the content of current association.  */
>>       sctp_add_cmd_sf(commands, SCTP_CMD_UPDATE_ASSOC, SCTP_ASOC(new_asoc));
>>       sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
>> @@ -2009,6 +2015,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_d(
>>        * a COOKIE ACK.
>>        */
>>
>> +     if (!sctp_auth_chunk_verify(net, chunk, asoc))
>> +             return SCTP_DISPOSITION_DISCARD;
>> +
>>       /* Don't accidentally move back into established state. */
>>       if (asoc->state < SCTP_STATE_ESTABLISHED) {
>>               sctp_add_cmd_sf(commands, SCTP_CMD_TIMER_STOP,
>> @@ -4171,10 +4180,7 @@ enum sctp_disposition sctp_sf_eat_fwd_tsn_fast(
>>   * The return value is the disposition of the chunk.
>>   */
>>  static enum sctp_ierror sctp_sf_authenticate(
>> -                                     struct net *net,
>> -                                     const struct sctp_endpoint *ep,
>>                                       const struct sctp_association *asoc,
>> -                                     const union sctp_subtype type,
>>                                       struct sctp_chunk *chunk)
>>  {
>>       struct sctp_shared_key *sh_key = NULL;
>> @@ -4275,7 +4281,7 @@ enum sctp_disposition sctp_sf_eat_auth(struct net *net,
>>                                                 commands);
>>
>>       auth_hdr = (struct sctp_authhdr *)chunk->skb->data;
>> -     error = sctp_sf_authenticate(net, ep, asoc, type, chunk);
>> +     error = sctp_sf_authenticate(asoc, chunk);
>>       switch (error) {
>>       case SCTP_IERROR_AUTH_BAD_HMAC:
>>               /* Generate the ERROR chunk and discard the rest
>> --
>> 2.1.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply

* Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning
From: Ingo Molnar @ 2018-05-05  7:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, linux-kernel, tglx, Ingo Molnar, David S. Miller,
	Johannes Berg, Alexander Aring, Stefan Schmidt, netdev,
	linux-wireless, linux-wpan, Anna-Maria Gleixner
In-Reply-To: <20180504190735.izmzibhb66gjb5wr@linutronix.de>


* Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> On 2018-05-04 20:51:32 [+0200], Peter Zijlstra wrote:
> > On Fri, May 04, 2018 at 08:45:39PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2018-05-04 20:32:49 [+0200], Peter Zijlstra wrote:
> > > > On Fri, May 04, 2018 at 07:51:44PM +0200, Sebastian Andrzej Siewior wrote:
> > > > > From: Anna-Maria Gleixner <anna-maria@linutronix.de>
> > > > > 
> > > > > The warning in ieee802154_rx() and ieee80211_rx_napi() is there to ensure
> > > > > the softirq context for the subsequent netif_receive_skb() call. 
> > > > 
> > > > That's not in fact what it does though; so while that might indeed be
> > > > the intent that's not what it does.
> > > 
> > > It was introduced in commit d20ef63d3246 ("mac80211: document
> > > ieee80211_rx() context requirement"):
> > > 
> > >     mac80211: document ieee80211_rx() context requirement
> > >     
> > >     ieee80211_rx() must be called with softirqs disabled
> > 
> > softirqs disabled, ack that is exactly what it checks.
> > 
> > But afaict the assertion you introduced tests that we are _in_ softirq
> > context, which is not the same.
> 
> indeed, now it clicked. Given what I wrote in the cover letter would you
> be in favour of (a proper) lockdep_assert_BH_disabled() or the cheaper
> local_bh_enable() (assuming the network folks don't mind the cheaper
> version)?

BTW., going by the hardirq variant nomenclature:

        lockdep_assert_irqs_enabled();

... the proper name would not be lockdep_assert_BH_disabled(), but:

        lockdep_assert_softirqs_disabled();

Thanks,

	Ingo

^ permalink raw reply

* Re: [net-next PATCH v2 2/8] udp: Verify that pulling UDP header in GSO segmentation doesn't fail
From: Willem de Bruijn @ 2018-05-05  8:12 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Network Development, Willem de Bruijn, David Miller
In-Reply-To: <20180504182857.5194.45504.stgit@localhost.localdomain>

On Fri, May 4, 2018 at 8:29 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> We should verify that we can pull the UDP header before we attempt to do
> so. Otherwise if this fails we have no way of knowing and GSO will not work
> correctly.

This already happened in the callers udp[46]_ufo_fragment

^ permalink raw reply

* Re: [net-next PATCH v2 3/8] udp: Do not pass MSS as parameter to GSO segmentation
From: Willem de Bruijn @ 2018-05-05  8:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Network Development, Willem de Bruijn,
	David Miller
In-Reply-To: <84200426-da0f-6ff4-8ae1-209e417f76c6@gmail.com>

On Fri, May 4, 2018 at 10:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 05/04/2018 11:29 AM, Alexander Duyck wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>>
>> There is no point in passing MSS as a parameter for for the GSO
>> segmentation call as it is already available via the shared info for the
>> skb itself.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [net-next PATCH v2 1/8] udp: Record gso_segs when supporting UDP segmentation offload
From: Willem de Bruijn @ 2018-05-05  8:23 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Network Development, Willem de Bruijn, David Miller
In-Reply-To: <20180504182847.5194.27449.stgit@localhost.localdomain>

On Fri, May 4, 2018 at 8:28 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> We need to record the number of segments that will be generated when this
> frame is segmented. The expectation is that if gso_size is set then
> gso_segs is set as well. Without this some drivers such as ixgbe get
> confused if they attempt to offload this as they record 0 segments for the
> entire packet instead of the correct value.
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>  net/ipv4/udp.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index dd3102a37ef9..e07db83b311e 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -793,6 +793,8 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4,
>
>                 skb_shinfo(skb)->gso_size = cork->gso_size;
>                 skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4;
> +               skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len - sizeof(uh),
> +                                                        cork->gso_size);
>                 goto csum_partial;
>         }

Acked-by: Willem de Bruijn <willemb@google.com>

But we have to be careful that both UDP GSO and UFO packets can traverse
qdisc_pkt_len_init. This does seem to fix the UDP GSO case, as it adds the
header size to each segment. But that is odd, as the code precedes UDP
GSO, so must have been intended to estimate UFO size on the wire. Only
external sources can generate UFO. It seems like we need to not add
sizeof(struct udphdr) to hdrlen in the SKB_GSO_DODGY branch.

^ permalink raw reply

* Re: [net-next PATCH v2 5/8] udp: Partially unroll handling of first segment and last segment
From: Willem de Bruijn @ 2018-05-05  8:37 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Network Development, Willem de Bruijn, David Miller
In-Reply-To: <20180504183019.5194.47789.stgit@localhost.localdomain>

On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This patch allows us to take care of unrolling the first segment and the
> last segment

Only the last segment needs to be unrolled, right?

> of the loop for processing the segmented skb. Part of the
> motivation for this is that it makes it easier to process the fact that the
> first fame and all of the frames in between should be mostly identical
> in terms of header data, and the last frame has differences in the length
> and partial checksum.
>
> In addition I am dropping the header length calculation since we don't
> really need it for anything but the last frame and it can be easily
> obtained by just pulling the data_len and offset of tail from the transport
> header.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>
> v2: New break-out patch based on one patch from earlier series
>
>  net/ipv4/udp_offload.c |   35 ++++++++++++++++++++---------------
>  1 file changed, 20 insertions(+), 15 deletions(-)
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 5c4bb8b9e83e..946d06d2aa0c 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -193,7 +193,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>         struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
>         struct sock *sk = gso_skb->sk;
>         unsigned int sum_truesize = 0;
> -       unsigned int hdrlen;
>         struct udphdr *uh;
>         unsigned int mss;
>         __sum16 check;
> @@ -206,7 +205,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>         if (!pskb_may_pull(gso_skb, sizeof(*uh)))
>                 goto out;
>
> -       hdrlen = gso_skb->data - skb_mac_header(gso_skb);
>         skb_pull(gso_skb, sizeof(*uh));
>
>         /* clear destructor to avoid skb_segment assigning it to tail */
> @@ -219,7 +217,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>                 return segs;
>         }
>
> -       uh = udp_hdr(segs);
> +       seg = segs;
> +       uh = udp_hdr(seg);
>
>         /* compute checksum adjustment based on old length versus new */
>         newlen = htons(sizeof(*uh) + mss);
> @@ -227,25 +226,31 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>                                             ((__force u32)uh->len ^ 0xFFFF) +
>                                             (__force u32)newlen));
>
> -       for (seg = segs; seg; seg = seg->next) {
> -               uh = udp_hdr(seg);
> +       for (;;) {
> +               seg->destructor = sock_wfree;
> +               seg->sk = sk;
> +               sum_truesize += seg->truesize;
>
> -               /* last packet can be partial gso_size */
> -               if (!seg->next) {
> -                       newlen = htons(seg->len - hdrlen);
> -                       check = ~csum_fold((__force __wsum)((__force u32)uh->check +
> -                                                           ((__force u32)uh->len ^ 0xFFFF) +
> -                                                           (__force u32)newlen));
> -               }
> +               if (!seg->next)
> +                       break;

Not critical, but I find replacing

  for (seg = segs; seg; seg = seg->next) {
    uh = udp_hdr(seg);
    ...
  }

with

  uh = udp_hdr(seg);
  for (;;) {
    ...
    if (!seg->next)
      break;

    uh = udp_hdr(seg);
  }

much less easy to parse and it inflates patch size. How about just

  - for (seg = segs; seg; seg = seg->next)
  + for( seg = segs; seg->next; seg = seg->next)


>
>                 uh->len = newlen;
>                 uh->check = check;
>
> -               seg->destructor = sock_wfree;
> -               seg->sk = sk;
> -               sum_truesize += seg->truesize;
> +               seg = seg->next;
> +               uh = udp_hdr(seg);
>         }
>
> +       /* last packet can be partial gso_size, account for that in checksum */
> +       newlen = htons(skb_tail_pointer(seg) - skb_transport_header(seg) +
> +                      seg->data_len);
> +       check = ~csum_fold((__force __wsum)((__force u32)uh->check +
> +                                           ((__force u32)uh->len ^ 0xFFFF)  +
> +                                           (__force u32)newlen));
> +       uh->len = newlen;
> +       uh->check = check;
> +
> +       /* update refcount for the packet */
>         refcount_add(sum_truesize - gso_skb->truesize, &sk->sk_wmem_alloc);
>  out:
>         return segs;
>

^ permalink raw reply

* Re: [net-next PATCH v2 7/8] udp: Do not copy destructor if one is not present
From: Willem de Bruijn @ 2018-05-05  8:45 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Network Development, Willem de Bruijn, David Miller
In-Reply-To: <20180504183110.5194.5449.stgit@localhost.localdomain>

On Fri, May 4, 2018 at 8:31 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This patch makes it so that if a destructor is not present we avoid trying
> to update the skb socket or any reference counting that would be associated
> with the NULL socket and/or descriptor. By doing this we can support
> traffic coming from another namespace without any issues.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* BUG?: receiving on a packet socket with .sll_protocoll and bridging
From: Uwe Kleine-König @ 2018-05-05  8:57 UTC (permalink / raw)
  To: netdev

Hello,

my eventual goal is to implement MRP and for that I started to program a
bit and stumbled over a problem I don't understand.

For testing purposes I created a veth device pair (veth0 veth1), open a
socket for each of the devices and send packets around between them. In
tcpdump a typical package looks as follows:

10:36:34.755208 ae:a9:da:50:48:db (oui Unknown) > 01:15:e4:00:00:01 (oui Unknown), ethertype Unknown (0x88e3), length 58: 
	0x0000:  0001 0212 8000 aea9 da50 48db 0000 0000  .........PH.....
	0x0010:  0000 0589 40f2 6574 6800 0000 0000 0000  ....@.eth.......
	0x0020:  0000 0100 0a80 3d38 4c5e 0000            ......=8L^..

The socket to receive these packages is opened using:

	#define ETH_P_MRP 0x88e3

	struct sockaddr_ll sa_ll = {
		.sll_family = AF_PACKET,
		.sll_protocol = htons(ETH_P_MRP),
		.sll_ifindex = if_nametoindex("veth0")
	};

	fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_MRP));
	bind(fd, (struct sockaddr *)&sa_ll, sizeof(sa_ll));

So far everything works fine and I can receive the packets I send.

If now I add veth0 to a bridge (e.g.

	ip link add br0 type bridge
	ip link set dev veth0 master br0

) and continue to send on veth1 and receive on veth0 I don't receive
the packets any more. The other direction (veth0 sending, veth1
receiving) still works fine. Each of the following changes allow to
receive again:

 a) take veth0 out of the bridge
 b) bind(2) the receiving socket to br0 instead of veth0
 c) use .sll_protocol = htons(ETH_P_ALL) for bind(2)

In the end only c) could be sensible (because I need to know the port
the packet entered the stack and that might well be bridged), but I
wonder why .sll_protocol = htons(ETH_P_MRP) seems to do the right thing
for an unbridged device but not for a bridged one.

Is this a bug or a feature I don't understand?

If someone wants to reproduce this locally, I can simplify my program
and provide it here. I tested this on a Debian 4.15 kernel (x86), but
also with the same symptoms on an arm64 with 4.16 and a dsa switch.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |

^ permalink raw reply

* Re: [PATCH 1/8] rhashtable: silence RCU warning in rhashtable_test.
From: Herbert Xu @ 2018-05-05  9:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152540605424.18473.310003836978239224.stgit@noble>

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> print_ht in rhashtable_test calls rht_dereference() with neither
> RCU protection or the mutex.  This triggers an RCU warning.
> So take the mutex to silence the warning.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

I don't think the mutex is actually needed but since we don't
have rht_dereference_raw and this is just test code:

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 2/8] rhashtable: remove nulls_base and related code.
From: Herbert Xu @ 2018-05-05  9:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152540605427.18473.12277050439942480863.stgit@noble>

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> This "feature" is unused, undocumented, and untested and so
> doesn't really belong.  If a use for the nulls marker
> is found, all this code would need to be reviewed to
> ensure it works as required.  It would be just as easy to
> just add the code if/when it is needed instead.
> 
> This patch actually fixes a bug too.  The table resizing allows a
> table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
> any growth beyond 2^27 is wasteful an ineffective.
> 
> This patch result in NULLS_MARKER(0) being used for all chains,
> and leave the use of rht_is_a_null() to test for it.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

I disagree.  This is a fundamental requirement for the use of
rhashtable in certain networking systems such as TCP/UDP.  So
we know that there will be a use for this.

As to the bug fix, please separate it out of the patch and resubmit.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 3/8] rhashtable: use cmpxchg() to protect ->future_tbl.
From: Herbert Xu @ 2018-05-05  9:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152540605430.18473.11758878046224478232.stgit@noble>

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> Rather than borrowing one of the bucket locks to
> protect ->future_tbl updates, use cmpxchg().
> This gives more freedom to change how bucket locking
> is implemented.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

This looks nice.

> -	spin_unlock_bh(old_tbl->locks);
> +	rcu_assign_pointer(tmp, new_tbl);

Do we need this barrier since cmpxchg is supposed to provide memory
barrier semantics?

> +	if (cmpxchg(&old_tbl->future_tbl, NULL, tmp) != NULL)
> +		return -EEXIST;

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()
From: Herbert Xu @ 2018-05-05  9:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152540605432.18473.11813271279255176724.stgit@noble>

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> If two threads run nested_table_alloc() at the same time
> they could both allocate a new table.
> Best case is that one of them will never be freed, leaking memory.
> Worst case is hat entry get stored there before it leaks,
> and the are lost from the table.
> 
> So use cmpxchg to detect the race and free the unused table.
> 
> Fixes: da20420f83ea ("rhashtable: Add nested tables")
> Cc: stable@vger.kernel.org # 4.11+
> Signed-off-by: NeilBrown <neilb@suse.com>

What about the spinlock that's meant to be held around this
operation?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 5/8] rhashtable: remove rhashtable_walk_peek()
From: Herbert Xu @ 2018-05-05  9:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel, Tom Herbert
In-Reply-To: <152540605435.18473.12657571579874985957.stgit@noble>

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> This function has a somewhat confused behavior that is not properly
> described by the documentation.
> Sometimes is returns the previous object, sometimes it returns the
> next one.
> Sometimes it changes the iterator, sometimes it doesn't.
> 
> This function is not currently used and is not worth keeping, so
> remove it.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

I don't have a problem with this.  But it would be good to hear
from Tom Herbert who added this.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.
From: Herbert Xu @ 2018-05-05  9:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152540605444.18473.9591316658457316578.stgit@noble>

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> rhashtable_try_insert() currently hold a lock on the bucket in
> the first table, while also locking buckets in subsequent tables.
> This is unnecessary and looks like a hold-over from some earlier
> version of the implementation.
> 
> As insert and remove always lock a bucket in each table in turn, and
> as insert only inserts in the final table, there cannot be any races
> that are not covered by simply locking a bucket in each table in turn.
> 
> When an insert call reaches that last table it can be sure that there
> is no match entry in any other table as it has searched them all, and
> insertion never happens anywhere but in the last table.  The fact that
> code tests for the existence of future_tbl while holding a lock on
> the relevant bucket ensures that two threads inserting the same key
> will make compatible decisions about which is the "last" table.
> 
> This simplifies the code and allows the ->rehash field to be
> discarded.
> We still need a way to ensure that a dead bucket_table is never
> re-linked by rhashtable_walk_stop().  This can be achieved by
> setting the ->size to 1.  This still allows lookup code to work (and
> correctly not find anything) but can never happen on an active bucket
> table (as the minimum size is 4).
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

I'm not convinced this is safe.  There can be multiple tables
in existence.  Unless you hold the lock on the first table, what
is to stop two parallel inserters each from inserting into their
"last" tables which may not be the same table?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 6/8] rhashtable: further improve stability of rhashtable_walk
From: Herbert Xu @ 2018-05-05  9:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel
In-Reply-To: <152540605438.18473.4797800779538116583.stgit@noble>

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> If the sequence:
>    obj = rhashtable_walk_next(iter);
>    rhashtable_walk_stop(iter);
>    rhashtable_remove_fast(ht, &obj->head, params);
>    rhashtable_walk_start(iter);
> 
>  races with another thread inserting or removing
>  an object on the same hash chain, a subsequent
>  rhashtable_walk_next() is not guaranteed to get the "next"
>  object. It is possible that an object could be
>  repeated, or missed.
> 
>  This can be made more reliable by keeping the objects in a hash chain
>  sorted by memory address.  A subsequent rhashtable_walk_next()
>  call can reliably find the correct position in the list, and thus
>  find the 'next' object.
> 
>  It is not possible (certainly not so easy) to achieve this with an
>  rhltable as keeping the hash chain in order is not so easy.  When the
>  first object with a given key is removed, it is replaced in the chain
>  with the next object with the same key, and the address of that
>  object may not be correctly ordered.
>  No current user of rhltable_walk_enter() calls
>  rhashtable_walk_start() more than once, so no current code
>  could benefit from a more reliable walk of rhltables.
> 
>  This patch only attempts to improve walks for rhashtables.
>  - a new object is always inserted after the last object with a
>    smaller address, or at the start
>  - when rhashtable_walk_start() is called, it records that 'p' is not
>    'safe', meaning that it cannot be dereferenced.  The revalidation
>    that was previously done here is moved to rhashtable_walk_next()
>  - when rhashtable_walk_next() is called while p is not NULL and not
>    safe, it walks the chain looking for the first object with an
>    address greater than p and returns that.  If there is none, it moves
>    to the next hash chain.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

I'm a bit torn on this.  On the hand this is definitely an improvement
over the status quo.  On the other this does not work on rhltable and
we do have a way of fixing it for both rhashtable and rhltable.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 7/8] rhashtable: add rhashtable_walk_prev()
From: Herbert Xu @ 2018-05-05  9:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: Thomas Graf, netdev, linux-kernel, Tom Herbert
In-Reply-To: <152540605441.18473.4087381584733882012.stgit@noble>

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> rhashtable_walk_prev() returns the object returned by
> the previous rhashtable_walk_next(), providing it is still in the
> table (or was during this grace period).
> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
> have been called since the last rhashtable_walk_next().
> 
> If there have been no calls to rhashtable_walk_next(), or if the
> object is gone from the table, then NULL is returned.
> 
> This can usefully be used in a seq_file ->start() function.
> If the pos is the same as was returned by the last ->next() call,
> then rhashtable_walk_prev() can be used to re-establish the
> current location in the table.  If it returns NULL, then
> rhashtable_walk_next() should be used.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

I will ack this if Tom is OK with replacing peek with it.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation
From: Willem de Bruijn @ 2018-05-05 10:01 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Network Development, Willem de Bruijn, David Miller
In-Reply-To: <20180504182958.5194.62398.stgit@localhost.localdomain>

On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This patch is meant to allow us to avoid having to recompute the checksum
> from scratch and have it passed as a parameter.
>
> Instead of taking that approach we can take advantage of the fact that the
> length that was used to compute the existing checksum is included in the
> UDP header. If we cancel that out by adding the value XOR with 0xFFFF we
> can then just add the new length in and fold that into the new result.
>
> I think this may be fixing a checksum bug in the original code as well
> since the checksum that was passed included the UDP header in the checksum
> computation, but then excluded it for the adjustment on the last frame. I
> believe this may have an effect on things in the cases where the two differ
> by bits that would result in things crossing the byte boundaries.

The replacement code, below, subtracts original payload size then adds
the new payload size. mss here excludes the udp header size.

>                 /* last packet can be partial gso_size */
> -               if (!seg->next)
> -                       csum_replace2(&uh->check, htons(mss),
> -                                     htons(seg->len - hdrlen - sizeof(*uh)));

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox