Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next 1/4] umh: introduce fork_usermode_blob() helper
From: Luis R. Rodriguez @ 2018-05-04 19:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, daniel, torvalds, gregkh, luto, netdev, linux-kernel,
	kernel-team, Al Viro, David Howells, Mimi Zohar, Kees Cook,
	Andrew Morton, Dominik Brodowski, Hugh Dickins, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, David Airlie, Rafael J. Wysocki,
	Luis R. Rodriguez, Linux FS Devel
In-Reply-To: <20180503043604.1604587-2-ast@kernel.org>

What a mighty short list of reviewers. Adding some more. My review below.
I'd appreciate a Cc on future versions of these patches.

On Wed, May 02, 2018 at 09:36:01PM -0700, Alexei Starovoitov wrote:
> Introduce helper:
> int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> struct umh_info {
>        struct file *pipe_to_umh;
>        struct file *pipe_from_umh;
>        pid_t pid;
> };
> 
> that GPLed kernel modules (signed or unsigned) can use it to execute part
> of its own data as swappable user mode process.
> 
> The kernel will do:
> - mount "tmpfs"

Actually its a *shared* vfsmount tmpfs for all umh blobs.

> - allocate a unique file in tmpfs
> - populate that file with [data, data + len] bytes
> - user-mode-helper code will do_execve that file and, before the process
>   starts, the kernel will create two unix pipes for bidirectional
>   communication between kernel module and umh
> - close tmpfs file, effectively deleting it
> - the fork_usermode_blob will return zero on success and populate
>   'struct umh_info' with two unix pipes and the pid of the user process

But since its using UMH_WAIT_EXEC, all we can guarantee currently is the
inception point was intended, well though out, and will run, but the return
value in no way reflects the success or not of the execution. More below.

> As the first step in the development of the bpfilter project
> the fork_usermode_blob() helper is introduced to allow user mode code
> to be invoked from a kernel module. The idea is that user mode code plus
> normal kernel module code are built as part of the kernel build
> and installed as traditional kernel module into distro specified location,
> such that from a distribution point of view, there is
> no difference between regular kernel modules and kernel modules + umh code.
> Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
> by a kernel module doesn't make it any special from kernel and user space
> tooling point of view.
> 
> Such approach enables kernel to delegate functionality traditionally done
> by the kernel modules into the user space processes (either root or !root) and
> reduces security attack surface of the new code. The buggy umh code would crash
> the user process, but not the kernel. Another advantage is that umh code
> of the kernel module can be debugged and tested out of user space
> (e.g. opening the possibility to run clang sanitizers, fuzzers or
> user space test suites on the umh code).
> In case of the bpfilter project such architecture allows complex control plane
> to be done in the user space while bpf based data plane stays in the kernel.
> 
> Since umh can crash, can be oom-ed by the kernel, killed by the admin,
> the kernel module that uses them (like bpfilter) needs to manage life
> time of umh on its own via two unix pipes and the pid of umh.
> 
> The exit code of such kernel module should kill the umh it started,
> so that rmmod of the kernel module will cleanup the corresponding umh.
> Just like if the kernel module does kmalloc() it should kfree() it in the exit code.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  fs/exec.c               |  38 ++++++++---
>  include/linux/binfmts.h |   1 +
>  include/linux/umh.h     |  12 ++++
>  kernel/umh.c            | 176 +++++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 215 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 183059c427b9..30a36c2a39bf 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1706,14 +1706,13 @@ static int exec_binprm(struct linux_binprm *bprm)
>  /*
>   * sys_execve() executes a new program.
>   */
> -static int do_execveat_common(int fd, struct filename *filename,
> -			      struct user_arg_ptr argv,
> -			      struct user_arg_ptr envp,
> -			      int flags)
> +static int __do_execve_file(int fd, struct filename *filename,
> +			    struct user_arg_ptr argv,
> +			    struct user_arg_ptr envp,
> +			    int flags, struct file *file)
>  {
>  	char *pathbuf = NULL;
>  	struct linux_binprm *bprm;
> -	struct file *file;
>  	struct files_struct *displaced;
>  	int retval;

Keeping in mind a fuzzer...

Note, right below this, and not shown here in the hunk, is:

        if (IS_ERR(filename))                                                   
                return PTR_ERR(filename)
>  
> @@ -1752,7 +1751,8 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	check_unsafe_exec(bprm);
>  	current->in_execve = 1;
>  
> -	file = do_open_execat(fd, filename, flags);
> +	if (!file)
> +		file = do_open_execat(fd, filename, flags);


Here we now seem to allow !file and open the file with the passed fd as in
the old days. This is an expected change.

>  	retval = PTR_ERR(file);
>  	if (IS_ERR(file))
>  		goto out_unmark;
> @@ -1760,7 +1760,9 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	sched_exec();
>  
>  	bprm->file = file;
> -	if (fd == AT_FDCWD || filename->name[0] == '/') {
> +	if (!filename) {

If anything shouldn't this be:

	if (IS_ERR(filename))

But, wouldn't the above first branch in the routine catch this?

> +		bprm->filename = "none";

Given this seems like a desirable branch which was tested, wonder how this
ever got set if the above branch in the first hunk I noted hit true?

In any case, we seem to have two cases, can we rule out the exact requirements
at the top so we can bail out with an error code if one or the other way to
call this function does not align with expectations?

> +	} else if (fd == AT_FDCWD || filename->name[0] == '/') {
>  		bprm->filename = filename->name;
>  	} else {
>  		if (filename->name[0] == '\0')
> @@ -1826,7 +1828,8 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	task_numa_free(current);
>  	free_bprm(bprm);
>  	kfree(pathbuf);
> -	putname(filename);
> +	if (filename)
> +		putname(filename);
>  	if (displaced)
>  		put_files_struct(displaced);
>  	return retval;
> @@ -1849,10 +1852,27 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	if (displaced)
>  		reset_files_struct(displaced);
>  out_ret:
> -	putname(filename);
> +	if (filename)
> +		putname(filename);
>  	return retval;
>  }
>  
> +static int do_execveat_common(int fd, struct filename *filename,

Further signs the filename is now optional. But I don't understand how these
branches ever be true, but perhaps I'm missing something?

> +			      struct user_arg_ptr argv,
> +			      struct user_arg_ptr envp,
> +			      int flags)
> +{
> +	return __do_execve_file(fd, filename, argv, envp, flags, NULL);
> +}
> +
> +int do_execve_file(struct file *file, void *__argv, void *__envp)
> +{
> +	struct user_arg_ptr argv = { .ptr.native = __argv };
> +	struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> +	return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
> +}

Or maybe do the semantics expectations checks here, so we don't clutter
do_execveat_common() with them?

> +
>  int do_execve(struct filename *filename,
>  	const char __user *const __user *__argv,
>  	const char __user *const __user *__envp)
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 4955e0863b83..c05f24fac4f6 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -150,5 +150,6 @@ extern int do_execveat(int, struct filename *,
>  		       const char __user * const __user *,
>  		       const char __user * const __user *,
>  		       int);
> +int do_execve_file(struct file *file, void *__argv, void *__envp);
>  
>  #endif /* _LINUX_BINFMTS_H */
> diff --git a/include/linux/umh.h b/include/linux/umh.h
> index 244aff638220..5c812acbb80a 100644
> --- a/include/linux/umh.h
> +++ b/include/linux/umh.h
> @@ -22,8 +22,10 @@ struct subprocess_info {
>  	const char *path;
>  	char **argv;
>  	char **envp;
> +	struct file *file;
>  	int wait;
>  	int retval;
> +	pid_t pid;
>  	int (*init)(struct subprocess_info *info, struct cred *new);
>  	void (*cleanup)(struct subprocess_info *info);
>  	void *data;

While at it, can you kdocify struct subprocess_info and add new docs for at
least these two entires you are adding ?

> @@ -38,6 +40,16 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
>  			  int (*init)(struct subprocess_info *info, struct cred *new),
>  			  void (*cleanup)(struct subprocess_info *), void *data);
>  
> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
> +			  int (*init)(struct subprocess_info *info, struct cred *new),
> +			  void (*cleanup)(struct subprocess_info *), void *data);

Likewise but on the umc.c file.

> +struct umh_info {
> +	struct file *pipe_to_umh;
> +	struct file *pipe_from_umh;
> +	pid_t pid;
> +};

Likewise.

> +int fork_usermode_blob(void *data, size_t len, struct umh_info *info);

Likewise but on the umc.c files.

> +
>  extern int
>  call_usermodehelper_exec(struct subprocess_info *info, int wait);
>  
> diff --git a/kernel/umh.c b/kernel/umh.c
> index f76b3ff876cf..c3f418d7d51a 100644
> --- a/kernel/umh.c
> +++ b/kernel/umh.c
> @@ -25,6 +25,8 @@
>  #include <linux/ptrace.h>
>  #include <linux/async.h>
>  #include <linux/uaccess.h>
> +#include <linux/shmem_fs.h>
> +#include <linux/pipe_fs_i.h>
>  
>  #include <trace/events/module.h>
>  
> @@ -97,9 +99,13 @@ static int call_usermodehelper_exec_async(void *data)
>  
>  	commit_creds(new);
>  
> -	retval = do_execve(getname_kernel(sub_info->path),
> -			   (const char __user *const __user *)sub_info->argv,
> -			   (const char __user *const __user *)sub_info->envp);
> +	if (sub_info->file)
> +		retval = do_execve_file(sub_info->file,
> +					sub_info->argv, sub_info->envp);
> +	else
> +		retval = do_execve(getname_kernel(sub_info->path),
> +				   (const char __user *const __user *)sub_info->argv,
> +				   (const char __user *const __user *)sub_info->envp);
>  out:
>  	sub_info->retval = retval;
>  	/*
> @@ -185,6 +191,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
>  		if (pid < 0) {
>  			sub_info->retval = pid;
>  			umh_complete(sub_info);
> +		} else {
> +			sub_info->pid = pid;
>  		}
>  	}
>  }
> @@ -393,6 +401,168 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
>  }
>  EXPORT_SYMBOL(call_usermodehelper_setup);
>  
> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
> +		int (*init)(struct subprocess_info *info, struct cred *new),
> +		void (*cleanup)(struct subprocess_info *info), void *data)

Should be static, no other users outside of this file.
Please use umh_setup_file().

> +{
> +	struct subprocess_info *sub_info;

Considering a possible fuzzer triggering random data we should probably
return NULL early and avoid the kzalloc if:

	if (!file || !init || !cleanup)
		return NULL;

Is data optional? The kdoc could clarify this.


> +
> +	sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
> +	if (!sub_info)
> +		return NULL;
> +
> +	INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
> +	sub_info->path = "none";
> +	sub_info->file = file;
> +	sub_info->init = init;
> +	sub_info->cleanup = cleanup;
> +	sub_info->data = data;
> +	return sub_info;
> +}
> +
> +static struct vfsmount *umh_fs;
> +
> +static int init_tmpfs(void)

Please use umh_init_tmpfs(). Also see init/main.c do_basic_setup() which calls
usermodehelper_enable() prior to do_initcalls(). Now, fortunately TMPFS is only
bool, saving us from some races and we do call tmpfs's init first shmem_init():

static void __init do_basic_setup(void)
{
	cpuset_init_smp();
	shmem_init();
	driver_init();
	init_irq_proc();
	do_ctors();
	usermodehelper_enable();
	do_initcalls();
}

But it also means we're enabling your new call call fork_usermode_blob() on
early init code even if we're not setup. Since this umh tmpfs vfsmount is
shared I'd say just call this init right before usermodehelper_enable()
on do_basic_setup().

> +{
> +	struct file_system_type *type;
> +
> +	if (umh_fs)
> +		return 0;
> +	type = get_fs_type("tmpfs");
> +	if (!type)
> +		return -ENODEV;
> +	umh_fs = kern_mount(type);
> +	if (IS_ERR(umh_fs)) {
> +		int err = PTR_ERR(umh_fs);
> +
> +		put_filesystem(type);
> +		umh_fs = NULL;
> +		return err;
> +	}
> +	return 0;
> +}
> +
> +static int alloc_tmpfs_file(size_t size, struct file **filp)

Please use umh_alloc_tmpfs_file()

> +{
> +	struct file *file;
> +	int err;
> +
> +	err = init_tmpfs();
> +	if (err)
> +		return err;
> +	file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +	*filp = file;
> +	return 0;
> +}
> +
> +static int populate_file(struct file *file, const void *data, size_t size)

Please use umh_populate_file()

> +{
> +	size_t offset = 0;
> +	int err;
> +
> +	do {
> +		unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
> +		struct page *page;
> +		void *pgdata, *vaddr;
> +
> +		err = pagecache_write_begin(file, file->f_mapping, offset, len,
> +					    0, &page, &pgdata);
> +		if (err < 0)
> +			goto fail;
> +
> +		vaddr = kmap(page);
> +		memcpy(vaddr, data, len);
> +		kunmap(page);
> +
> +		err = pagecache_write_end(file, file->f_mapping, offset, len,
> +					  len, page, pgdata);
> +		if (err < 0)
> +			goto fail;
> +
> +		size -= len;
> +		data += len;
> +		offset += len;
> +	} while (size);

Character for character, this looks like a wonderful copy and paste from
i915_gem_object_create_from_data()'s own loop which does the same exact
thing. Perhaps its time for a helper on mm/filemap.c with an export so
if a bug is fixed in one place its fixed in both places.

> +	return 0;
> +fail:
> +	return err;
> +}
> +
> +static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)

The function name umh_pipe_setup() is also used on fs/coredump.c, with the same
prototype, perhaps rename that before we take this on, even if both are static.

> +{
> +	struct umh_info *umh_info = info->data;
> +	struct file *from_umh[2];
> +	struct file *to_umh[2];
> +	int err;
> +
> +	/* create pipe to send data to umh */
> +	err = create_pipe_files(to_umh, 0);
> +	if (err)
> +		return err;
> +	err = replace_fd(0, to_umh[0], 0);
> +	fput(to_umh[0]);
> +	if (err < 0) {
> +		fput(to_umh[1]);
> +		return err;
> +	}
> +
> +	/* create pipe to receive data from umh */
> +	err = create_pipe_files(from_umh, 0);
> +	if (err) {
> +		fput(to_umh[1]);
> +		replace_fd(0, NULL, 0);
> +		return err;
> +	}
> +	err = replace_fd(1, from_umh[1], 0);
> +	fput(from_umh[1]);
> +	if (err < 0) {
> +		fput(to_umh[1]);
> +		replace_fd(0, NULL, 0);
> +		fput(from_umh[0]);
> +		return err;
> +	}
> +
> +	umh_info->pipe_to_umh = to_umh[1];
> +	umh_info->pipe_from_umh = from_umh[0];
> +	return 0;
> +}
> +
> +static void umh_save_pid(struct subprocess_info *info)
> +{
> +	struct umh_info *umh_info = info->data;
> +
> +	umh_info->pid = info->pid;
> +}
> +
> +int fork_usermode_blob(void *data, size_t len, struct umh_info *info)

Please use umh_fork_blob()

> +{
> +	struct subprocess_info *sub_info;
> +	struct file *file = NULL;
> +	int err;
> +
> +	err = alloc_tmpfs_file(len, &file);
> +	if (err)
> +		return err;
> +
> +	err = populate_file(file, data, len);
> +	if (err)
> +		goto out;
> +
> +	err = -ENOMEM;
> +	sub_info = call_usermodehelper_setup_file(file, umh_pipe_setup,
> +						  umh_save_pid, info);
> +	if (!sub_info)
> +		goto out;
> +
> +	err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);

Alright, neat, so to be clear, we're just glad to try inception, we have no
clue or idea what the real return value would be, its up to the caller to track
the progress somehow?

Can you add a kdoc entry for this and clarify requirements?

Also, can you extend lib/test_kmod.c with a test case for this with its own
demo and try to blow it up?

I hadn't tried suspend/resume during a kmod test, but since we're using a
kernel_thread() I wouldn't be surprised if we barf while stress testing the
module loader. Its surely a corner case, but better mention that now than cry
later if we get heavy umh modules and all of a sudden we start using this for
whatever reason close to suspend.

  Luis

> +out:
> +	fput(file);
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(fork_usermode_blob);
> +
>  /**
>   * call_usermodehelper_exec - start a usermode application
>   * @sub_info: information about the subprocessa
> -- 
> 2.9.5


-- 
Do not panic

^ permalink raw reply

* Re: [PATCH v2 0/4] Introduce LSM-hook for socketpair(2)
From: James Morris @ 2018-05-04 19:51 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Paul Moore, teg, Stephen Smalley, selinux,
	linux-security-module, Eric Paris, serge, Casey Schaufler, davem,
	netdev
In-Reply-To: <20180504142822.15233-1-dh.herrmann@gmail.com>

On Fri, 4 May 2018, David Herrmann wrote:

> Hi
> 
> This is v2 of the socketpair(2) LSM hook introduction.

Thanks, all applied to
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security.git next-general


-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply

* Re: Hello My Dear Friend!!!!
From: SISTER DOROTHY KENT @ 2018-05-04 19:39 UTC (permalink / raw)


Hello Dear

I am Sister Dorothy Kent I really need your assistance to help me discuss a project .

Thanks,
Sister Dorothy Kent

^ permalink raw reply

* Re: [PATCH 04/15] media: pxa_camera: remove the dmaengine compat need
From: Mauro Carvalho Chehab @ 2018-05-04 19:29 UTC (permalink / raw)
  To: Hans Verkuil
  Cc: Ulf Hansson, alsa-devel, Takashi Iwai, linux-ide, Pravin Shedge,
	netdev, linux-mtd, Laurent Pinchart, Robert Jarzmik, devel,
	Boris Brezillon, dmaengine, Vinod Koul, Richard Weinberger,
	Robert Jarzmik, Marek Vasut, Ezequiel Garcia, linux-media,
	Samuel Ortiz, Arnd Bergmann, Bartlomiej Zolnierkiewicz,
	Haojian Zhuang, Flavio Ceolin, Mark 
In-Reply-To: <89f3e97c-08a4-33b6-8fd3-0c1b4dea2065@xs4all.nl>

Em Sun, 22 Apr 2018 13:06:12 +0200
Hans Verkuil <hverkuil@xs4all.nl> escreveu:

> On 04/02/2018 04:26 PM, Robert Jarzmik wrote:
> > From: Robert Jarzmik <robert.jarzmik@renault.com>
> > 
> > As the pxa architecture switched towards the dmaengine slave map, the
> > old compatibility mechanism to acquire the dma requestor line number and
> > priority are not needed anymore.
> > 
> > This patch simplifies the dma resource acquisition, using the more
> > generic function dma_request_slave_channel().
> > 
> > Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>  
> 
> Acked-by: Hans Verkuil <hans.verkuil@cisco.com>

I'm assuming that you'll be applying it together with other PXA patches.
So:

Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

Regards,
Mauro
> 
> Regards,
> 
> 	Hans
> 
> > ---
> >  drivers/media/platform/pxa_camera.c | 22 +++-------------------
> >  1 file changed, 3 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/media/platform/pxa_camera.c b/drivers/media/platform/pxa_camera.c
> > index c71a00736541..4c82d1880753 100644
> > --- a/drivers/media/platform/pxa_camera.c
> > +++ b/drivers/media/platform/pxa_camera.c
> > @@ -2357,8 +2357,6 @@ static int pxa_camera_probe(struct platform_device *pdev)
> >  		.src_maxburst = 8,
> >  		.direction = DMA_DEV_TO_MEM,
> >  	};
> > -	dma_cap_mask_t mask;
> > -	struct pxad_param params;
> >  	char clk_name[V4L2_CLK_NAME_SIZE];
> >  	int irq;
> >  	int err = 0, i;
> > @@ -2432,34 +2430,20 @@ static int pxa_camera_probe(struct platform_device *pdev)
> >  	pcdev->base = base;
> >  
> >  	/* request dma */
> > -	dma_cap_zero(mask);
> > -	dma_cap_set(DMA_SLAVE, mask);
> > -	dma_cap_set(DMA_PRIVATE, mask);
> > -
> > -	params.prio = 0;
> > -	params.drcmr = 68;
> > -	pcdev->dma_chans[0] =
> > -		dma_request_slave_channel_compat(mask, pxad_filter_fn,
> > -						 &params, &pdev->dev, "CI_Y");
> > +	pcdev->dma_chans[0] = dma_request_slave_channel(&pdev->dev, "CI_Y");
> >  	if (!pcdev->dma_chans[0]) {
> >  		dev_err(&pdev->dev, "Can't request DMA for Y\n");
> >  		return -ENODEV;
> >  	}
> >  
> > -	params.drcmr = 69;
> > -	pcdev->dma_chans[1] =
> > -		dma_request_slave_channel_compat(mask, pxad_filter_fn,
> > -						 &params, &pdev->dev, "CI_U");
> > +	pcdev->dma_chans[1] = dma_request_slave_channel(&pdev->dev, "CI_U");
> >  	if (!pcdev->dma_chans[1]) {
> >  		dev_err(&pdev->dev, "Can't request DMA for Y\n");
> >  		err = -ENODEV;
> >  		goto exit_free_dma_y;
> >  	}
> >  
> > -	params.drcmr = 70;
> > -	pcdev->dma_chans[2] =
> > -		dma_request_slave_channel_compat(mask, pxad_filter_fn,
> > -						 &params, &pdev->dev, "CI_V");
> > +	pcdev->dma_chans[2] = dma_request_slave_channel(&pdev->dev, "CI_V");
> >  	if (!pcdev->dma_chans[2]) {
> >  		dev_err(&pdev->dev, "Can't request DMA for V\n");
> >  		err = -ENODEV;
> >   
> 



Thanks,
Mauro

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/3] bpf: cleanups on managing subprog information
From: Daniel Borkmann @ 2018-05-04 19:20 UTC (permalink / raw)
  To: Jiong Wang, alexei.starovoitov, borkmann
  Cc: john.fastabend, ecree, netdev, oss-drivers
In-Reply-To: <1525292239-1309-1-git-send-email-jiong.wang@netronome.com>

On 05/02/2018 10:17 PM, Jiong Wang wrote:
> This patch set clean up some code logic related with managing subprog
> information.
> 
> Part of the set are inspried by Edwin's code in his RFC:
> 
>   "bpf/verifier: subprog/func_call simplifications"
> 
> but with clearer separation so it could be easier to review.
> 
>   - Path 1 unifies main prog and subprogs. All of them are registered in
>     env->subprog_starts.
> 
>   - After patch 1, it is clear that subprog_starts and subprog_stack_depth
>     could be merged as both of them now have main and subprog unified.
>     Patch 2 therefore does the merge, all subprog information are centred
>     at bpf_subprog_info.
> 
>   - Patch 3 goes further to introduce a new fake "exit" subprog which
>     serves as an ending marker to the subprog list. We could then turn the
>     following code snippets across verifier:
> 
>        if (env->subprog_cnt == cur_subprog + 1)
>                subprog_end = insn_cnt;
>        else
>                subprog_end = env->subprog_info[cur_subprog + 1].start;
> 
>     into:
>        subprog_end = env->subprog_info[cur_subprog + 1].start;
> 
> There is no functional change by this patch set.
> No bpf selftest (both non-jit and jit) regression found after this set.
> 
> v2:
>   - fixed adjust_subprog_starts to also update fake "exit" subprog start.
>   - for John's suggestion on renaming subprog to prog, I could work on
>     a follow-up patch if it is recognized as worth the change.
> 
> Jiong Wang (3):
>   bpf: unify main prog and subprog
>   bpf: centre subprog information fields
>   bpf: add faked "ending" subprog
> 
>  include/linux/bpf_verifier.h |   9 ++--
>  kernel/bpf/verifier.c        | 121 ++++++++++++++++++++++---------------------
>  2 files changed, 67 insertions(+), 63 deletions(-)
> 

LGTM, applied to bpf-next, thanks Jiong!

^ permalink raw reply

* Re: [PATCH v5 0/6] firmware_loader: cleanups for v4.18
From: Krzysztof Halasa @ 2018-05-04 19:17 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: gregkh, akpm, keescook, josh, teg, wagi, hdegoede, andresx7,
	zohar, kubakici, shuah, mfuzzey, dhowells, pali.rohar, tiwai,
	kvalo, arend.vanspriel, zajec5, nbroeking, markivx, broonie,
	dmitry.torokhov, dwmw2, torvalds, Abhay_Salunke, bjorn.andersson,
	jewalt, oneukum, cantabile.desu, ast, hare, jejb, martin.petersen,
	davem, maco, arve, tkjos
In-Reply-To: <20180504174356.13227-1-mcgrof@kernel.org>

"Luis R. Rodriguez" <mcgrof@kernel.org> writes:

>   * CONFIG_WANXL --> CONFIG_WANXL_BUILD_FIRMWARE
>   * CONFIG_SCSI_AIC79XX --> CONFIG_AIC79XX_BUILD_FIRMWARE
>
> To this day both of these drivers are building driver *firmwares* when
> the option CONFIG_PREVENT_FIRMWARE_BUILD is disabled, and they don't
> even make use of the firmware API at all.

Don't know for sure about Adaptec, but wanXL firmware fits every
definition of "stable". BTW it's a 1997 or so early PCI card, built
around the PLX PCI9060 bus mastering PCI bridge and Motorola 68360
(the original QUICC) processor. Maximum bit rate of 2 Mb/s on each sync
serial port.

It's more about delivering the .S source for the firmware, I guess.
Nobody is expected to build it. The fw is about 2.5 KB and is directly
linked with the driver.
-- 
Krzysztof Halasa

^ permalink raw reply

* Re: [PATCH net-next v8 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-05-04 19:10 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Cake List
In-Reply-To: <CAM_iQpUEaYN+ziMF1CUaGkYR-sTH1hc9vPdQ5RQLvSMZJGdrVg@mail.gmail.com>

Thank you for the review! A few comments below, I'll fix the rest.

> [...]
> 
> So sch_cake doesn't accept normal tc filters? Is this intentional?
> If so, why?

For two reasons:

- The two-level scheduling used in CAKE (tins / diffserv classes, and
  flow hashing) does not map in an obvious way to the classification
  index of tc filters.

- No one has asked for it. We have done our best to accommodate the
  features people want in a home router qdisc directly in CAKE, and the
  ability to integrate tc filters has never been requested.

>> +static u16 quantum_div[CAKE_QUEUES + 1] = {0};
>> +
>> +#define REC_INV_SQRT_CACHE (16)
>> +static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
>> +
>> +/* http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
>> + * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
>> + *
>> + * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
>> + */
>> +
>> +static void cobalt_newton_step(struct cobalt_vars *vars)
>> +{
>> +       u32 invsqrt = vars->rec_inv_sqrt;
>> +       u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
>> +       u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
>> +
>> +       val >>= 2; /* avoid overflow in following multiply */
>> +       val = (val * invsqrt) >> (32 - 2 + 1);
>> +
>> +       vars->rec_inv_sqrt = val;
>> +}
>> +
>> +static void cobalt_invsqrt(struct cobalt_vars *vars)
>> +{
>> +       if (vars->count < REC_INV_SQRT_CACHE)
>> +               vars->rec_inv_sqrt = cobalt_rec_inv_sqrt_cache[vars->count];
>> +       else
>> +               cobalt_newton_step(vars);
>> +}
>
> Looks pretty much duplicated with codel...

Cobalt is derived from CoDel, and so naturally shares some features with
it. However, it is quite different in other respects, so we can't just
use the existing CoDel code for the parts that are similar. We don't
feel quite confident enough in Cobalt (yet) to propose it replace CoDel
everywhere else in the kernel; so we have elected to keep it internal to
CAKE instead.

>> [...]
>>
>> +static int cake_init(struct Qdisc *sch, struct nlattr *opt,
>> +                    struct netlink_ext_ack *extack)
>> +{
>> +       struct cake_sched_data *q = qdisc_priv(sch);
>> +       int i, j;
>> +
>> +       sch->limit = 10240;
>> +       q->tin_mode = CAKE_DIFFSERV_BESTEFFORT;
>> +       q->flow_mode  = CAKE_FLOW_TRIPLE;
>> +
>> +       q->rate_bps = 0; /* unlimited by default */
>> +
>> +       q->interval = 100000; /* 100ms default */
>> +       q->target   =   5000; /* 5ms: codel RFC argues
>> +                              * for 5 to 10% of interval
>> +                              */
>> +
>> +       q->cur_tin = 0;
>> +       q->cur_flow  = 0;
>> +
>> +       if (opt) {
>> +               int err = cake_change(sch, opt, extack);
>> +
>> +               if (err)
>> +                       return err;
>
>
> Not sure if you really want to reallocate q->tines below for this
> case.

I'm not sure what you mean here? If there's an error we return it and
the qdisc is not created. If there's not, we allocate and on subsequent
changes cake_change() will be called directly, or? Can the init function
ever be called again during the lifetime of the qdisc?

-Toke

^ permalink raw reply

* Re: [PATCH] net/mlx5: fix spelling mistake: "modfiy" -> "modify"
From: Saeed Mahameed @ 2018-05-04 19:09 UTC (permalink / raw)
  To: davem@davemloft.net, colin.king@canonical.com
  Cc: Matan Barak, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, leon@kernel.org,
	kernel-janitors@vger.kernel.org
In-Reply-To: <20180503.144440.629738993491549851.davem@davemloft.net>

On Thu, 2018-05-03 at 14:44 -0400, David Miller wrote:
> From: Colin King <colin.king@canonical.com>
> Date: Thu,  3 May 2018 14:35:03 +0100
> 
> > From: Colin Ian King <colin.king@canonical.com>
> > 
> > Trivial fix to spelling mistake in netdev_warn warning message
> > 
> > Signed-off-by: Colin Ian King <colin.king@canonical.com>
> 
> Saeed, please send this to me in your next pull request.
> 

Applied to mlx5-next, Thanks Colin and Dave !


^ permalink raw reply

* Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning
From: Sebastian Andrzej Siewior @ 2018-05-04 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, Ingo Molnar, David S. Miller, Johannes Berg,
	Alexander Aring, Stefan Schmidt, netdev, linux-wireless,
	linux-wpan, Anna-Maria Gleixner
In-Reply-To: <20180504185132.GV12217@hirez.programming.kicks-ass.net>

On 2018-05-04 20:51:32 [+0200], Peter Zijlstra wrote:
> On Fri, May 04, 2018 at 08:45:39PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2018-05-04 20:32:49 [+0200], Peter Zijlstra wrote:
> > > On Fri, May 04, 2018 at 07:51:44PM +0200, Sebastian Andrzej Siewior wrote:
> > > > From: Anna-Maria Gleixner <anna-maria@linutronix.de>
> > > > 
> > > > The warning in ieee802154_rx() and ieee80211_rx_napi() is there to ensure
> > > > the softirq context for the subsequent netif_receive_skb() call. 
> > > 
> > > That's not in fact what it does though; so while that might indeed be
> > > the intent that's not what it does.
> > 
> > It was introduced in commit d20ef63d3246 ("mac80211: document
> > ieee80211_rx() context requirement"):
> > 
> >     mac80211: document ieee80211_rx() context requirement
> >     
> >     ieee80211_rx() must be called with softirqs disabled
> 
> softirqs disabled, ack that is exactly what it checks.
> 
> But afaict the assertion you introduced tests that we are _in_ softirq
> context, which is not the same.

indeed, now it clicked. Given what I wrote in the cover letter would you
be in favour of (a proper) lockdep_assert_BH_disabled() or the cheaper
local_bh_enable() (assuming the network folks don't mind the cheaper
version)?

Sebastian

^ permalink raw reply

* Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning
From: Peter Zijlstra @ 2018-05-04 18:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, tglx, Ingo Molnar, David S. Miller, Johannes Berg,
	Alexander Aring, Stefan Schmidt, netdev, linux-wireless,
	linux-wpan, Anna-Maria Gleixner
In-Reply-To: <20180504184538.gavk7e2vtm73e32w@linutronix.de>

On Fri, May 04, 2018 at 08:45:39PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-05-04 20:32:49 [+0200], Peter Zijlstra wrote:
> > On Fri, May 04, 2018 at 07:51:44PM +0200, Sebastian Andrzej Siewior wrote:
> > > From: Anna-Maria Gleixner <anna-maria@linutronix.de>
> > > 
> > > The warning in ieee802154_rx() and ieee80211_rx_napi() is there to ensure
> > > the softirq context for the subsequent netif_receive_skb() call. 
> > 
> > That's not in fact what it does though; so while that might indeed be
> > the intent that's not what it does.
> 
> It was introduced in commit d20ef63d3246 ("mac80211: document
> ieee80211_rx() context requirement"):
> 
>     mac80211: document ieee80211_rx() context requirement
>     
>     ieee80211_rx() must be called with softirqs disabled

softirqs disabled, ack that is exactly what it checks.

But afaict the assertion you introduced tests that we are _in_ softirq
context, which is not the same.

^ permalink raw reply

* Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning
From: Sebastian Andrzej Siewior @ 2018-05-04 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, Ingo Molnar, David S. Miller, Johannes Berg,
	Alexander Aring, Stefan Schmidt, netdev, linux-wireless,
	linux-wpan, Anna-Maria Gleixner
In-Reply-To: <20180504183249.GU12217@hirez.programming.kicks-ass.net>

On 2018-05-04 20:32:49 [+0200], Peter Zijlstra wrote:
> On Fri, May 04, 2018 at 07:51:44PM +0200, Sebastian Andrzej Siewior wrote:
> > From: Anna-Maria Gleixner <anna-maria@linutronix.de>
> > 
> > The warning in ieee802154_rx() and ieee80211_rx_napi() is there to ensure
> > the softirq context for the subsequent netif_receive_skb() call. 
> 
> That's not in fact what it does though; so while that might indeed be
> the intent that's not what it does.

It was introduced in commit d20ef63d3246 ("mac80211: document
ieee80211_rx() context requirement"):

    mac80211: document ieee80211_rx() context requirement
    
    ieee80211_rx() must be called with softirqs disabled
    since the networking stack requires this for netif_rx()
    and some code in mac80211 can assume that it can not
    be processing its own tasklet and this call at the same
    time.
    
    It may be possible to remove this requirement after a
    careful audit of mac80211 and doing any needed locking
    improvements in it along with disabling softirqs around
    netif_rx(). An alternative might be to push all packet
    processing to process context in mac80211, instead of
    to the tasklet, and add other synchronisation.

Sebastian

^ permalink raw reply

* Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning
From: Peter Zijlstra @ 2018-05-04 18:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, tglx, Ingo Molnar, David S. Miller, Johannes Berg,
	Alexander Aring, Stefan Schmidt, netdev, linux-wireless,
	linux-wpan, Anna-Maria Gleixner
In-Reply-To: <20180504175144.12179-3-bigeasy@linutronix.de>

On Fri, May 04, 2018 at 07:51:44PM +0200, Sebastian Andrzej Siewior wrote:
> From: Anna-Maria Gleixner <anna-maria@linutronix.de>
> 
> The warning in ieee802154_rx() and ieee80211_rx_napi() is there to ensure
> the softirq context for the subsequent netif_receive_skb() call. 

That's not in fact what it does though; so while that might indeed be
the intent that's not what it does.

^ permalink raw reply

* [net-next PATCH v2 8/8] net: Add NETIF_F_GSO_UDP_L4 to list of GSO offloads with fallback
From: Alexander Duyck @ 2018-05-04 18:31 UTC (permalink / raw)
  To: netdev, willemb, davem
In-Reply-To: <20180504182537.5194.72775.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

Enable UDP offload as a generic software offload since we can now handle it
for multiple cases including if the checksum isn't present and via
gso_partial in the case of tunnels.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/netdev_features.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index c87c3a3453c1..efbd8b2c0197 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -184,7 +184,8 @@ enum {
 
 /* List of features with software fallbacks. */
 #define NETIF_F_GSO_SOFTWARE	(NETIF_F_ALL_TSO | \
-				 NETIF_F_GSO_SCTP)
+				 NETIF_F_GSO_SCTP| \
+				 NETIF_F_GSO_UDP_L4)
 
 /*
  * If one device supports one of these features, then enable them

^ permalink raw reply related

* [net-next PATCH v2 7/8] udp: Do not copy destructor if one is not present
From: Alexander Duyck @ 2018-05-04 18:31 UTC (permalink / raw)
  To: netdev, willemb, davem
In-Reply-To: <20180504182537.5194.72775.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch makes it so that if a destructor is not present we avoid trying
to update the skb socket or any reference counting that would be associated
with the NULL socket and/or descriptor. By doing this we can support
traffic coming from another namespace without any issues.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

v2: Do not overwrite destructor if not sock_wfree as per Eric
    Drop WARN_ON as per Eric

 net/ipv4/udp_offload.c |   21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index fd94bbb369b2..e802f6331589 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -195,6 +195,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	unsigned int sum_truesize = 0;
 	struct udphdr *uh;
 	unsigned int mss;
+	bool copy_dtor;
 	__sum16 check;
 	__be16 newlen;
 
@@ -208,12 +209,14 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	skb_pull(gso_skb, sizeof(*uh));
 
 	/* clear destructor to avoid skb_segment assigning it to tail */
-	WARN_ON_ONCE(gso_skb->destructor != sock_wfree);
-	gso_skb->destructor = NULL;
+	copy_dtor = gso_skb->destructor == sock_wfree;
+	if (copy_dtor)
+		gso_skb->destructor = NULL;
 
 	segs = skb_segment(gso_skb, features);
 	if (unlikely(IS_ERR_OR_NULL(segs))) {
-		gso_skb->destructor = sock_wfree;
+		if (copy_dtor)
+			gso_skb->destructor = sock_wfree;
 		return segs;
 	}
 
@@ -234,9 +237,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 					    (__force u32)newlen));
 
 	for (;;) {
-		seg->destructor = sock_wfree;
-		seg->sk = sk;
-		sum_truesize += seg->truesize;
+		if (copy_dtor) {
+			seg->destructor = sock_wfree;
+			seg->sk = sk;
+			sum_truesize += seg->truesize;
+		}
 
 		if (!seg->next)
 			break;
@@ -268,7 +273,9 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 		uh->check = gso_make_checksum(seg, ~check);
 
 	/* update refcount for the packet */
-	refcount_add(sum_truesize - gso_skb->truesize, &sk->sk_wmem_alloc);
+	if (copy_dtor)
+		refcount_add(sum_truesize - gso_skb->truesize,
+			     &sk->sk_wmem_alloc);
 out:
 	return segs;
 }

^ permalink raw reply related

* [net-next PATCH v2 6/8] udp: Add support for software checksum and GSO_PARTIAL with GSO offload
From: Alexander Duyck @ 2018-05-04 18:31 UTC (permalink / raw)
  To: netdev, willemb, davem
In-Reply-To: <20180504182537.5194.72775.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch adds support for a software provided checksum and GSO_PARTIAL
segmentation support. With this we can offload UDP segmentation on devices
that only have partial support for tunnels.

Since we are no longer needing the hardware checksum we can drop the checks
in the segmentation code that were verifying if it was present.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/ipv4/udp_offload.c |   28 ++++++++++++++++++----------
 net/ipv6/udp_offload.c |   11 +----------
 2 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 946d06d2aa0c..fd94bbb369b2 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -217,6 +217,13 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 		return segs;
 	}
 
+	/* GSO partial and frag_list segmentation only requires splitting
+	 * the frame into an MSS multiple and possibly a remainder, both
+	 * cases return a GSO skb. So update the mss now.
+	 */
+	if (skb_is_gso(segs))
+		mss *= skb_shinfo(segs)->gso_segs;
+
 	seg = segs;
 	uh = udp_hdr(seg);
 
@@ -237,6 +244,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 		uh->len = newlen;
 		uh->check = check;
 
+		if (seg->ip_summed == CHECKSUM_PARTIAL)
+			gso_reset_checksum(seg, ~check);
+		else
+			uh->check = gso_make_checksum(seg, ~check);
+
 		seg = seg->next;
 		uh = udp_hdr(seg);
 	}
@@ -250,6 +262,11 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	uh->len = newlen;
 	uh->check = check;
 
+	if (seg->ip_summed == CHECKSUM_PARTIAL)
+		gso_reset_checksum(seg, ~check);
+	else
+		uh->check = gso_make_checksum(seg, ~check);
+
 	/* update refcount for the packet */
 	refcount_add(sum_truesize - gso_skb->truesize, &sk->sk_wmem_alloc);
 out:
@@ -257,15 +274,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 }
 EXPORT_SYMBOL_GPL(__udp_gso_segment);
 
-static struct sk_buff *__udp4_gso_segment(struct sk_buff *gso_skb,
-					  netdev_features_t features)
-{
-	if (!can_checksum_protocol(features, htons(ETH_P_IP)))
-		return ERR_PTR(-EIO);
-
-	return __udp_gso_segment(gso_skb, features);
-}
-
 static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 					 netdev_features_t features)
 {
@@ -289,7 +297,7 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
 		goto out;
 
 	if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
-		return __udp4_gso_segment(skb, features);
+		return __udp_gso_segment(skb, features);
 
 	mss = skb_shinfo(skb)->gso_size;
 	if (unlikely(skb->len <= mss))
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 61e34f1d2fa2..03a2ff3fe1e6 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -17,15 +17,6 @@
 #include <net/ip6_checksum.h>
 #include "ip6_offload.h"
 
-static struct sk_buff *__udp6_gso_segment(struct sk_buff *gso_skb,
-					  netdev_features_t features)
-{
-	if (!can_checksum_protocol(features, htons(ETH_P_IPV6)))
-		return ERR_PTR(-EIO);
-
-	return __udp_gso_segment(gso_skb, features);
-}
-
 static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 					 netdev_features_t features)
 {
@@ -58,7 +49,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
 			goto out;
 
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)
-			return __udp6_gso_segment(skb, features);
+			return __udp_gso_segment(skb, features);
 
 		/* Do software UFO. Complete and fill in the UDP checksum as HW cannot
 		 * do checksum of UDP packets sent as multiple IP fragments.

^ permalink raw reply related

* [net-next PATCH v2 5/8] udp: Partially unroll handling of first segment and last segment
From: Alexander Duyck @ 2018-05-04 18:30 UTC (permalink / raw)
  To: netdev, willemb, davem
In-Reply-To: <20180504182537.5194.72775.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch allows us to take care of unrolling the first segment and the
last segment of the loop for processing the segmented skb. Part of the
motivation for this is that it makes it easier to process the fact that the
first fame and all of the frames in between should be mostly identical
in terms of header data, and the last frame has differences in the length
and partial checksum.

In addition I am dropping the header length calculation since we don't
really need it for anything but the last frame and it can be easily
obtained by just pulling the data_len and offset of tail from the transport
header.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

v2: New break-out patch based on one patch from earlier series

 net/ipv4/udp_offload.c |   35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 5c4bb8b9e83e..946d06d2aa0c 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -193,7 +193,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
 	struct sock *sk = gso_skb->sk;
 	unsigned int sum_truesize = 0;
-	unsigned int hdrlen;
 	struct udphdr *uh;
 	unsigned int mss;
 	__sum16 check;
@@ -206,7 +205,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	if (!pskb_may_pull(gso_skb, sizeof(*uh)))
 		goto out;
 
-	hdrlen = gso_skb->data - skb_mac_header(gso_skb);
 	skb_pull(gso_skb, sizeof(*uh));
 
 	/* clear destructor to avoid skb_segment assigning it to tail */
@@ -219,7 +217,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 		return segs;
 	}
 
-	uh = udp_hdr(segs);
+	seg = segs;
+	uh = udp_hdr(seg);
 
 	/* compute checksum adjustment based on old length versus new */
 	newlen = htons(sizeof(*uh) + mss);
@@ -227,25 +226,31 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 					    ((__force u32)uh->len ^ 0xFFFF) +
 					    (__force u32)newlen));
 
-	for (seg = segs; seg; seg = seg->next) {
-		uh = udp_hdr(seg);
+	for (;;) {
+		seg->destructor = sock_wfree;
+		seg->sk = sk;
+		sum_truesize += seg->truesize;
 
-		/* last packet can be partial gso_size */
-		if (!seg->next) {
-			newlen = htons(seg->len - hdrlen);
-			check = ~csum_fold((__force __wsum)((__force u32)uh->check +
-							    ((__force u32)uh->len ^ 0xFFFF) +
-							    (__force u32)newlen));
-		}
+		if (!seg->next)
+			break;
 
 		uh->len = newlen;
 		uh->check = check;
 
-		seg->destructor = sock_wfree;
-		seg->sk = sk;
-		sum_truesize += seg->truesize;
+		seg = seg->next;
+		uh = udp_hdr(seg);
 	}
 
+	/* last packet can be partial gso_size, account for that in checksum */
+	newlen = htons(skb_tail_pointer(seg) - skb_transport_header(seg) +
+		       seg->data_len);
+	check = ~csum_fold((__force __wsum)((__force u32)uh->check +
+					    ((__force u32)uh->len ^ 0xFFFF)  +
+					    (__force u32)newlen));
+	uh->len = newlen;
+	uh->check = check;
+
+	/* update refcount for the packet */
 	refcount_add(sum_truesize - gso_skb->truesize, &sk->sk_wmem_alloc);
 out:
 	return segs;

^ permalink raw reply related

* [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation
From: Alexander Duyck @ 2018-05-04 18:30 UTC (permalink / raw)
  To: netdev, willemb, davem
In-Reply-To: <20180504182537.5194.72775.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch is meant to allow us to avoid having to recompute the checksum
from scratch and have it passed as a parameter.

Instead of taking that approach we can take advantage of the fact that the
length that was used to compute the existing checksum is included in the
UDP header. If we cancel that out by adding the value XOR with 0xFFFF we
can then just add the new length in and fold that into the new result.

I think this may be fixing a checksum bug in the original code as well
since the checksum that was passed included the UDP header in the checksum
computation, but then excluded it for the adjustment on the last frame. I
believe this may have an effect on things in the cases where the two differ
by bits that would result in things crossing the byte boundaries.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

v2: New break-out patch based on one patch from earlier series

 include/net/udp.h      |    3 +--
 net/ipv4/udp_offload.c |   35 +++++++++++++++++++++--------------
 net/ipv6/udp_offload.c |    7 +------
 3 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 8bd83b044ecd..9289b6425032 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -175,8 +175,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *skb,
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
-				  netdev_features_t features,
-				  __sum16 check);
+				  netdev_features_t features);
 
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index f21b63adcbbc..5c4bb8b9e83e 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -188,8 +188,7 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 EXPORT_SYMBOL(skb_udp_tunnel_segment);
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
-				  netdev_features_t features,
-				  __sum16 check)
+				  netdev_features_t features)
 {
 	struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
 	struct sock *sk = gso_skb->sk;
@@ -197,6 +196,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	unsigned int hdrlen;
 	struct udphdr *uh;
 	unsigned int mss;
+	__sum16 check;
+	__be16 newlen;
 
 	mss = skb_shinfo(gso_skb)->gso_size;
 	if (gso_skb->len <= sizeof(*uh) + mss)
@@ -218,17 +219,28 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 		return segs;
 	}
 
+	uh = udp_hdr(segs);
+
+	/* compute checksum adjustment based on old length versus new */
+	newlen = htons(sizeof(*uh) + mss);
+	check = ~csum_fold((__force __wsum)((__force u32)uh->check +
+					    ((__force u32)uh->len ^ 0xFFFF) +
+					    (__force u32)newlen));
+
 	for (seg = segs; seg; seg = seg->next) {
 		uh = udp_hdr(seg);
-		uh->len = htons(seg->len - hdrlen);
-		uh->check = check;
 
 		/* last packet can be partial gso_size */
-		if (!seg->next)
-			csum_replace2(&uh->check, htons(mss),
-				      htons(seg->len - hdrlen - sizeof(*uh)));
+		if (!seg->next) {
+			newlen = htons(seg->len - hdrlen);
+			check = ~csum_fold((__force __wsum)((__force u32)uh->check +
+							    ((__force u32)uh->len ^ 0xFFFF) +
+							    (__force u32)newlen));
+		}
+
+		uh->len = newlen;
+		uh->check = check;
 
-		uh->check = ~uh->check;
 		seg->destructor = sock_wfree;
 		seg->sk = sk;
 		sum_truesize += seg->truesize;
@@ -243,15 +255,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 static struct sk_buff *__udp4_gso_segment(struct sk_buff *gso_skb,
 					  netdev_features_t features)
 {
-	const struct iphdr *iph = ip_hdr(gso_skb);
-	unsigned int mss = skb_shinfo(gso_skb)->gso_size;
-
 	if (!can_checksum_protocol(features, htons(ETH_P_IP)))
 		return ERR_PTR(-EIO);
 
-	return __udp_gso_segment(gso_skb, features,
-				 udp_v4_check(sizeof(struct udphdr) + mss,
-					      iph->saddr, iph->daddr, 0));
+	return __udp_gso_segment(gso_skb, features);
 }
 
 static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb,
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index dea03ec09715..61e34f1d2fa2 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -20,15 +20,10 @@
 static struct sk_buff *__udp6_gso_segment(struct sk_buff *gso_skb,
 					  netdev_features_t features)
 {
-	const struct ipv6hdr *ip6h = ipv6_hdr(gso_skb);
-	unsigned int mss = skb_shinfo(gso_skb)->gso_size;
-
 	if (!can_checksum_protocol(features, htons(ETH_P_IPV6)))
 		return ERR_PTR(-EIO);
 
-	return __udp_gso_segment(gso_skb, features,
-				 udp_v6_check(sizeof(struct udphdr) + mss,
-					      &ip6h->saddr, &ip6h->daddr, 0));
+	return __udp_gso_segment(gso_skb, features);
 }
 
 static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,

^ permalink raw reply related

* [net-next PATCH v2 3/8] udp: Do not pass MSS as parameter to GSO segmentation
From: Alexander Duyck @ 2018-05-04 18:29 UTC (permalink / raw)
  To: netdev, willemb, davem
In-Reply-To: <20180504182537.5194.72775.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

There is no point in passing MSS as a parameter for for the GSO
segmentation call as it is already available via the shared info for the
skb itself.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

v2: New break-out patch based on one patch from earlier series

 include/net/udp.h      |    2 +-
 net/ipv4/udp_offload.c |    6 ++++--
 net/ipv6/udp_offload.c |    2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 05990746810e..8bd83b044ecd 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -176,7 +176,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *skb,
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 				  netdev_features_t features,
-				  unsigned int mss, __sum16 check);
+				  __sum16 check);
 
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 8303fff42940..f21b63adcbbc 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -189,14 +189,16 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 
 struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 				  netdev_features_t features,
-				  unsigned int mss, __sum16 check)
+				  __sum16 check)
 {
 	struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
 	struct sock *sk = gso_skb->sk;
 	unsigned int sum_truesize = 0;
 	unsigned int hdrlen;
 	struct udphdr *uh;
+	unsigned int mss;
 
+	mss = skb_shinfo(gso_skb)->gso_size;
 	if (gso_skb->len <= sizeof(*uh) + mss)
 		goto out;
 
@@ -247,7 +249,7 @@ static struct sk_buff *__udp4_gso_segment(struct sk_buff *gso_skb,
 	if (!can_checksum_protocol(features, htons(ETH_P_IP)))
 		return ERR_PTR(-EIO);
 
-	return __udp_gso_segment(gso_skb, features, mss,
+	return __udp_gso_segment(gso_skb, features,
 				 udp_v4_check(sizeof(struct udphdr) + mss,
 					      iph->saddr, iph->daddr, 0));
 }
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index f7b85b1e6b3e..dea03ec09715 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -26,7 +26,7 @@ static struct sk_buff *__udp6_gso_segment(struct sk_buff *gso_skb,
 	if (!can_checksum_protocol(features, htons(ETH_P_IPV6)))
 		return ERR_PTR(-EIO);
 
-	return __udp_gso_segment(gso_skb, features, mss,
+	return __udp_gso_segment(gso_skb, features,
 				 udp_v6_check(sizeof(struct udphdr) + mss,
 					      &ip6h->saddr, &ip6h->daddr, 0));
 }

^ permalink raw reply related

* [net-next PATCH v2 2/8] udp: Verify that pulling UDP header in GSO segmentation doesn't fail
From: Alexander Duyck @ 2018-05-04 18:29 UTC (permalink / raw)
  To: netdev, willemb, davem
In-Reply-To: <20180504182537.5194.72775.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

We should verify that we can pull the UDP header before we attempt to do
so. Otherwise if this fails we have no way of knowing and GSO will not work
correctly.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

v2: New break-out patch based on one patch from earlier series

 net/ipv4/udp_offload.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 006257092f06..8303fff42940 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -191,14 +191,17 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 				  netdev_features_t features,
 				  unsigned int mss, __sum16 check)
 {
+	struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
 	struct sock *sk = gso_skb->sk;
 	unsigned int sum_truesize = 0;
-	struct sk_buff *segs, *seg;
 	unsigned int hdrlen;
 	struct udphdr *uh;
 
 	if (gso_skb->len <= sizeof(*uh) + mss)
-		return ERR_PTR(-EINVAL);
+		goto out;
+
+	if (!pskb_may_pull(gso_skb, sizeof(*uh)))
+		goto out;
 
 	hdrlen = gso_skb->data - skb_mac_header(gso_skb);
 	skb_pull(gso_skb, sizeof(*uh));
@@ -230,7 +233,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	}
 
 	refcount_add(sum_truesize - gso_skb->truesize, &sk->sk_wmem_alloc);
-
+out:
 	return segs;
 }
 EXPORT_SYMBOL_GPL(__udp_gso_segment);

^ permalink raw reply related

* [net-next PATCH v2 1/8] udp: Record gso_segs when supporting UDP segmentation offload
From: Alexander Duyck @ 2018-05-04 18:28 UTC (permalink / raw)
  To: netdev, willemb, davem
In-Reply-To: <20180504182537.5194.72775.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

We need to record the number of segments that will be generated when this
frame is segmented. The expectation is that if gso_size is set then
gso_segs is set as well. Without this some drivers such as ixgbe get
confused if they attempt to offload this as they record 0 segments for the
entire packet instead of the correct value.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/ipv4/udp.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index dd3102a37ef9..e07db83b311e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -793,6 +793,8 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 *fl4,
 
 		skb_shinfo(skb)->gso_size = cork->gso_size;
 		skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4;
+		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len - sizeof(uh),
+							 cork->gso_size);
 		goto csum_partial;
 	}
 

^ permalink raw reply related

* [net-next PATCH v2 0/8] UDP GSO Segmentation clean-ups
From: Alexander Duyck @ 2018-05-04 18:28 UTC (permalink / raw)
  To: netdev, willemb, davem

This patch set addresses a number of issues I found while sorting out
enabling UDP GSO Segmentation support for ixgbe/ixgbevf. Specifically there
were a number of issues related to the checksum and such that seemed to
cause either minor irregularities or kernel panics in the case of the
offload request being allowed to traverse between name spaces.

With this set applied I am was able to get UDP GSO traffic to pass over
vxlan tunnels in both offloaded modes and non-offloaded modes for ixgbe and
ixgbevf.

I submitted the driver specific patches earlier as an RFC:
https://patchwork.ozlabs.org/project/netdev/list/?series=42477&archive=both&state=*

v2: Updated patches based on feedback from Eric Dumazet
    Split first patch into several patches based on feedback from Eric

---

Alexander Duyck (8):
      udp: Record gso_segs when supporting UDP segmentation offload
      udp: Verify that pulling UDP header in GSO segmentation doesn't fail
      udp: Do not pass MSS as parameter to GSO segmentation
      udp: Do not pass checksum as a parameter to GSO segmentation
      udp: Partially unroll handling of first segment and last segment
      udp: Add support for software checksum and GSO_PARTIAL with GSO offload
      udp: Do not copy destructor if one is not present
      net: Add NETIF_F_GSO_UDP_L4 to list of GSO offloads with fallback


 include/linux/netdev_features.h |    3 +
 include/net/udp.h               |    3 -
 net/ipv4/udp.c                  |    2 +
 net/ipv4/udp_offload.c          |  104 ++++++++++++++++++++++++++-------------
 net/ipv6/udp_offload.c          |   16 ------
 5 files changed, 74 insertions(+), 54 deletions(-)

^ permalink raw reply

* [PATCH net] Added support for 802.1ad Q in Q Ethernet tagged frames
From: Elad Nachman @ 2018-05-04 18:25 UTC (permalink / raw)
  To: netdev, davem

stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before calling napi_gro_receive().

The function assumes VLAN tagged frames are always tagged with 802.1Q protocol,
and assigns ETH_P_8021Q to the skb by hard-coding the parameter on call to __vlan_hwaccel_put_tag() .

This causes packets not to be passed to the VLAN slave if it was created with 802.1AD protocol
(ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).

This fix passes the protocol from the VLAN header into __vlan_hwaccel_put_tag()
instead of using the hard-coded value of ETH_P_8021Q.

Signed-off-by: Elad Nachman <eladn@gilat.com>


---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index b65e2d1..ced2d34 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3293,17 +3293,19 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 
 static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
 {
-	struct ethhdr *ehdr;
+	struct vlan_ethhdr *veth;
 	u16 vlanid;
+	__be16 vlan_proto;
 
 	if ((dev->features & NETIF_F_HW_VLAN_CTAG_RX) ==
 	    NETIF_F_HW_VLAN_CTAG_RX &&
 	    !__vlan_get_tag(skb, &vlanid)) {
 		/* pop the vlan tag */
-		ehdr = (struct ethhdr *)skb->data;
-		memmove(skb->data + VLAN_HLEN, ehdr, ETH_ALEN * 2);
+		veth = (struct vlan_ethhdr *)skb->data;
+		vlan_proto = veth->h_vlan_proto;
+		memmove(skb->data + VLAN_HLEN, veth, ETH_ALEN * 2);
 		skb_pull(skb, VLAN_HLEN);
-		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlanid);
+		__vlan_hwaccel_put_tag(skb, vlan_proto, vlanid);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related

* Hello
From: Faruk Sakawo @ 2018-05-04 18:14 UTC (permalink / raw)


I have a business to discuss with you, can we talk?


Regards
Faruk Sakawo

^ permalink raw reply

* [PATCH V2 net-next] liquidio: support use of ethtool to set link speed of CN23XX-225 cards
From: Felix Manlunas @ 2018-05-04 18:07 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	felix.manlunas, weilin.chang

From: Weilin Chang <weilin.chang@cavium.com>

Support setting the link speed of CN23XX-225 cards (which can do 25Gbps or
10Gbps) via ethtool_ops.set_link_ksettings.

Also fix the function assigned to ethtool_ops.get_link_ksettings to use the
new link_ksettings api completely (instead of partially via
ethtool_convert_legacy_u32_to_link_mode).

Signed-off-by: Weilin Chang <weilin.chang@cavium.com>
Acked-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
Patch Change Log:
  V1 -> V2:
    In lio_set_link_ksettings(), set the order of local variable
    declarations from longest to shortest line.

 drivers/net/ethernet/cavium/liquidio/lio_core.c    | 196 +++++++++++++++++++++
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 195 +++++++++++++++++---
 drivers/net/ethernet/cavium/liquidio/lio_main.c    |  20 +++
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c |   5 +
 .../net/ethernet/cavium/liquidio/liquidio_common.h |   4 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  14 ++
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  15 ++
 7 files changed, 425 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 6821afc..8093c5e 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -1481,3 +1481,199 @@ int octnet_get_link_stats(struct net_device *netdev)
 
 	return 0;
 }
+
+static void liquidio_nic_seapi_ctl_callback(struct octeon_device *oct,
+					    u32 status,
+					    void *buf)
+{
+	struct liquidio_nic_seapi_ctl_context *ctx;
+	struct octeon_soft_command *sc = buf;
+
+	ctx = sc->ctxptr;
+
+	oct = lio_get_device(ctx->octeon_id);
+	if (status) {
+		dev_err(&oct->pci_dev->dev, "%s: instruction failed. Status: %llx\n",
+			__func__,
+			CVM_CAST64(status));
+	}
+	ctx->status = status;
+	complete(&ctx->complete);
+}
+
+int liquidio_set_speed(struct lio *lio, int speed)
+{
+	struct liquidio_nic_seapi_ctl_context *ctx;
+	struct octeon_device *oct = lio->oct_dev;
+	struct oct_nic_seapi_resp *resp;
+	struct octeon_soft_command *sc;
+	union octnet_cmd *ncmd;
+	u32 ctx_size;
+	int retval;
+	u32 var;
+
+	if (oct->speed_setting == speed)
+		return 0;
+
+	if (!OCTEON_CN23XX_PF(oct)) {
+		dev_err(&oct->pci_dev->dev, "%s: SET SPEED only for PF\n",
+			__func__);
+		return -EOPNOTSUPP;
+	}
+
+	ctx_size = sizeof(struct liquidio_nic_seapi_ctl_context);
+	sc = octeon_alloc_soft_command(oct, OCTNET_CMD_SIZE,
+				       sizeof(struct oct_nic_seapi_resp),
+				       ctx_size);
+	if (!sc)
+		return -ENOMEM;
+
+	ncmd = sc->virtdptr;
+	ctx  = sc->ctxptr;
+	resp = sc->virtrptr;
+	memset(resp, 0, sizeof(struct oct_nic_seapi_resp));
+
+	ctx->octeon_id = lio_get_device_id(oct);
+	ctx->status = 0;
+	init_completion(&ctx->complete);
+
+	ncmd->u64 = 0;
+	ncmd->s.cmd = SEAPI_CMD_SPEED_SET;
+	ncmd->s.param1 = speed;
+
+	octeon_swap_8B_data((u64 *)ncmd, (OCTNET_CMD_SIZE >> 3));
+
+	sc->iq_no = lio->linfo.txpciq[0].s.q_no;
+
+	octeon_prepare_soft_command(oct, sc, OPCODE_NIC,
+				    OPCODE_NIC_UBOOT_CTL, 0, 0, 0);
+
+	sc->callback = liquidio_nic_seapi_ctl_callback;
+	sc->callback_arg = sc;
+	sc->wait_time = 5000;
+
+	retval = octeon_send_soft_command(oct, sc);
+	if (retval == IQ_SEND_FAILED) {
+		dev_info(&oct->pci_dev->dev, "Failed to send soft command\n");
+		retval = -EBUSY;
+	} else {
+		/* Wait for response or timeout */
+		if (wait_for_completion_timeout(&ctx->complete,
+						msecs_to_jiffies(10000)) == 0) {
+			dev_err(&oct->pci_dev->dev, "%s: sc timeout\n",
+				__func__);
+			octeon_free_soft_command(oct, sc);
+			return -EINTR;
+		}
+
+		retval = resp->status;
+
+		if (retval) {
+			dev_err(&oct->pci_dev->dev, "%s failed, retval=%d\n",
+				__func__, retval);
+			octeon_free_soft_command(oct, sc);
+			return -EIO;
+		}
+
+		var = be32_to_cpu((__force __be32)resp->speed);
+		if (var != speed) {
+			dev_err(&oct->pci_dev->dev,
+				"%s: setting failed speed= %x, expect %x\n",
+				__func__, var, speed);
+		}
+
+		oct->speed_setting = var;
+	}
+
+	octeon_free_soft_command(oct, sc);
+
+	return retval;
+}
+
+int liquidio_get_speed(struct lio *lio)
+{
+	struct liquidio_nic_seapi_ctl_context *ctx;
+	struct octeon_device *oct = lio->oct_dev;
+	struct oct_nic_seapi_resp *resp;
+	struct octeon_soft_command *sc;
+	union octnet_cmd *ncmd;
+	u32 ctx_size;
+	int retval;
+
+	ctx_size = sizeof(struct liquidio_nic_seapi_ctl_context);
+	sc = octeon_alloc_soft_command(oct, OCTNET_CMD_SIZE,
+				       sizeof(struct oct_nic_seapi_resp),
+				       ctx_size);
+	if (!sc)
+		return -ENOMEM;
+
+	ncmd = sc->virtdptr;
+	ctx  = sc->ctxptr;
+	resp = sc->virtrptr;
+	memset(resp, 0, sizeof(struct oct_nic_seapi_resp));
+
+	ctx->octeon_id = lio_get_device_id(oct);
+	ctx->status = 0;
+	init_completion(&ctx->complete);
+
+	ncmd->u64 = 0;
+	ncmd->s.cmd = SEAPI_CMD_SPEED_GET;
+
+	octeon_swap_8B_data((u64 *)ncmd, (OCTNET_CMD_SIZE >> 3));
+
+	sc->iq_no = lio->linfo.txpciq[0].s.q_no;
+
+	octeon_prepare_soft_command(oct, sc, OPCODE_NIC,
+				    OPCODE_NIC_UBOOT_CTL, 0, 0, 0);
+
+	sc->callback = liquidio_nic_seapi_ctl_callback;
+	sc->callback_arg = sc;
+	sc->wait_time = 5000;
+
+	retval = octeon_send_soft_command(oct, sc);
+	if (retval == IQ_SEND_FAILED) {
+		dev_info(&oct->pci_dev->dev, "Failed to send soft command\n");
+		oct->no_speed_setting = 1;
+		oct->speed_setting = 25;
+
+		retval = -EBUSY;
+	} else {
+		if (wait_for_completion_timeout(&ctx->complete,
+						msecs_to_jiffies(10000)) == 0) {
+			dev_err(&oct->pci_dev->dev, "%s: sc timeout\n",
+				__func__);
+
+			oct->speed_setting = 25;
+			oct->no_speed_setting = 1;
+
+			octeon_free_soft_command(oct, sc);
+
+			return -EINTR;
+		}
+		retval = resp->status;
+		if (retval) {
+			dev_err(&oct->pci_dev->dev,
+				"%s failed retval=%d\n", __func__, retval);
+			oct->no_speed_setting = 1;
+			oct->speed_setting = 25;
+			octeon_free_soft_command(oct, sc);
+			retval = -EIO;
+		} else {
+			u32 var;
+
+			var = be32_to_cpu((__force __be32)resp->speed);
+			oct->speed_setting = var;
+			if (var == 0xffff) {
+				oct->no_speed_setting = 1;
+				/* unable to access boot variables
+				 * get the default value based on the NIC type
+				 */
+				oct->speed_setting = 25;
+			}
+		}
+	}
+
+	octeon_free_soft_command(oct, sc);
+
+	return retval;
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
index a1d8430..06f7449 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
@@ -230,46 +230,147 @@ static int lio_get_link_ksettings(struct net_device *netdev,
 	struct lio *lio = GET_LIO(netdev);
 	struct octeon_device *oct = lio->oct_dev;
 	struct oct_link_info *linfo;
-	u32 supported = 0, advertising = 0;
 
 	linfo = &lio->linfo;
 
+	ethtool_link_ksettings_zero_link_mode(ecmd, supported);
+	ethtool_link_ksettings_zero_link_mode(ecmd, advertising);
+
 	switch (linfo->link.s.phy_type) {
 	case LIO_PHY_PORT_TP:
 		ecmd->base.port = PORT_TP;
-		supported = (SUPPORTED_10000baseT_Full |
-			     SUPPORTED_TP | SUPPORTED_Pause);
-		advertising = (ADVERTISED_10000baseT_Full | ADVERTISED_Pause);
 		ecmd->base.autoneg = AUTONEG_DISABLE;
+		ethtool_link_ksettings_add_link_mode(ecmd, supported, TP);
+		ethtool_link_ksettings_add_link_mode(ecmd, supported, Pause);
+		ethtool_link_ksettings_add_link_mode(ecmd, supported,
+						     10000baseT_Full);
+
+		ethtool_link_ksettings_add_link_mode(ecmd, advertising, Pause);
+		ethtool_link_ksettings_add_link_mode(ecmd, advertising,
+						     10000baseT_Full);
+
 		break;
 
 	case LIO_PHY_PORT_FIBRE:
-		ecmd->base.port = PORT_FIBRE;
-
-		if (linfo->link.s.speed == SPEED_10000) {
-			supported = SUPPORTED_10000baseT_Full;
-			advertising = ADVERTISED_10000baseT_Full;
+		if (linfo->link.s.if_mode == INTERFACE_MODE_XAUI ||
+		    linfo->link.s.if_mode == INTERFACE_MODE_RXAUI ||
+		    linfo->link.s.if_mode == INTERFACE_MODE_XLAUI ||
+		    linfo->link.s.if_mode == INTERFACE_MODE_XFI) {
+			dev_dbg(&oct->pci_dev->dev, "ecmd->base.transceiver is XCVR_EXTERNAL\n");
+		} else {
+			dev_err(&oct->pci_dev->dev, "Unknown link interface mode: %d\n",
+				linfo->link.s.if_mode);
 		}
 
-		supported |= SUPPORTED_FIBRE | SUPPORTED_Pause;
-		advertising |= ADVERTISED_Pause;
+		ecmd->base.port = PORT_FIBRE;
 		ecmd->base.autoneg = AUTONEG_DISABLE;
+		ethtool_link_ksettings_add_link_mode(ecmd, supported, FIBRE);
+
+		ethtool_link_ksettings_add_link_mode(ecmd, supported, Pause);
+		ethtool_link_ksettings_add_link_mode(ecmd, advertising, Pause);
+		if (oct->subsystem_id == OCTEON_CN2350_25GB_SUBSYS_ID ||
+		    oct->subsystem_id == OCTEON_CN2360_25GB_SUBSYS_ID) {
+			if (OCTEON_CN23XX_PF(oct)) {
+				ethtool_link_ksettings_add_link_mode
+					(ecmd, supported, 25000baseSR_Full);
+				ethtool_link_ksettings_add_link_mode
+					(ecmd, supported, 25000baseKR_Full);
+				ethtool_link_ksettings_add_link_mode
+					(ecmd, supported, 25000baseCR_Full);
+
+				if (oct->no_speed_setting == 0)  {
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 10000baseSR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 10000baseKR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 10000baseCR_Full);
+				}
+
+				if (oct->no_speed_setting == 0)
+					liquidio_get_speed(lio);
+				else
+					oct->speed_setting = 25;
+
+				if (oct->speed_setting == 10) {
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 10000baseSR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 10000baseKR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 10000baseCR_Full);
+				}
+				if (oct->speed_setting == 25) {
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 25000baseSR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 25000baseKR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 25000baseCR_Full);
+				}
+			} else { /* VF */
+				if (linfo->link.s.speed == 10000) {
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 10000baseSR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 10000baseKR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 10000baseCR_Full);
+
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 10000baseSR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 10000baseKR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 10000baseCR_Full);
+				}
+
+				if (linfo->link.s.speed == 25000) {
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 25000baseSR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 25000baseKR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, supported,
+						 25000baseCR_Full);
+
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 25000baseSR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 25000baseKR_Full);
+					ethtool_link_ksettings_add_link_mode
+						(ecmd, advertising,
+						 25000baseCR_Full);
+				}
+			}
+		} else {
+			ethtool_link_ksettings_add_link_mode(ecmd, supported,
+							     10000baseT_Full);
+			ethtool_link_ksettings_add_link_mode(ecmd, advertising,
+							     10000baseT_Full);
+		}
 		break;
 	}
 
-	if (linfo->link.s.if_mode == INTERFACE_MODE_XAUI ||
-	    linfo->link.s.if_mode == INTERFACE_MODE_RXAUI ||
-	    linfo->link.s.if_mode == INTERFACE_MODE_XLAUI ||
-	    linfo->link.s.if_mode == INTERFACE_MODE_XFI) {
-		ethtool_convert_legacy_u32_to_link_mode(
-			ecmd->link_modes.supported, supported);
-		ethtool_convert_legacy_u32_to_link_mode(
-			ecmd->link_modes.advertising, advertising);
-	} else {
-		dev_err(&oct->pci_dev->dev, "Unknown link interface reported %d\n",
-			linfo->link.s.if_mode);
-	}
-
 	if (linfo->link.s.link_up) {
 		ecmd->base.speed = linfo->link.s.speed;
 		ecmd->base.duplex = linfo->link.s.duplex;
@@ -281,6 +382,51 @@ static int lio_get_link_ksettings(struct net_device *netdev,
 	return 0;
 }
 
+static int lio_set_link_ksettings(struct net_device *netdev,
+				  const struct ethtool_link_ksettings *ecmd)
+{
+	const int speed = ecmd->base.speed;
+	struct lio *lio = GET_LIO(netdev);
+	struct oct_link_info *linfo;
+	struct octeon_device *oct;
+	u32 is25G = 0;
+
+	oct = lio->oct_dev;
+
+	linfo = &lio->linfo;
+
+	if (oct->subsystem_id == OCTEON_CN2350_25GB_SUBSYS_ID ||
+	    oct->subsystem_id == OCTEON_CN2360_25GB_SUBSYS_ID) {
+		is25G = 1;
+	} else {
+		return -EOPNOTSUPP;
+	}
+
+	if (oct->no_speed_setting) {
+		dev_err(&oct->pci_dev->dev, "%s: Changing speed is not supported\n",
+			__func__);
+		return -EOPNOTSUPP;
+	}
+
+	if ((ecmd->base.duplex != DUPLEX_UNKNOWN &&
+	     ecmd->base.duplex != linfo->link.s.duplex) ||
+	     ecmd->base.autoneg != AUTONEG_DISABLE ||
+	    (ecmd->base.speed != 10000 && ecmd->base.speed != 25000 &&
+	     ecmd->base.speed != SPEED_UNKNOWN))
+		return -EOPNOTSUPP;
+
+	if ((oct->speed_boot == speed / 1000) &&
+	    oct->speed_boot == oct->speed_setting)
+		return 0;
+
+	liquidio_set_speed(lio, speed / 1000);
+
+	dev_dbg(&oct->pci_dev->dev, "Port speed is set to %dG\n",
+		oct->speed_setting);
+
+	return 0;
+}
+
 static void
 lio_get_drvinfo(struct net_device *netdev, struct ethtool_drvinfo *drvinfo)
 {
@@ -2966,6 +3112,7 @@ static int lio_set_priv_flags(struct net_device *netdev, u32 flags)
 
 static const struct ethtool_ops lio_ethtool_ops = {
 	.get_link_ksettings	= lio_get_link_ksettings,
+	.set_link_ksettings	= lio_set_link_ksettings,
 	.get_link		= ethtool_op_get_link,
 	.get_drvinfo		= lio_get_drvinfo,
 	.get_ringparam		= lio_ethtool_get_ringparam,
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index ee75048..e500528 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -912,6 +912,9 @@ liquidio_probe(struct pci_dev *pdev,
 	/* set linux specific device pointer */
 	oct_dev->pci_dev = (void *)pdev;
 
+	oct_dev->subsystem_id = pdev->subsystem_vendor |
+		(pdev->subsystem_device << 16);
+
 	hs = &handshake[oct_dev->octeon_id];
 	init_completion(&hs->init);
 	init_completion(&hs->started);
@@ -3664,6 +3667,23 @@ static int setup_nic_devices(struct octeon_device *octeon_dev)
 			"NIC ifidx:%d Setup successful\n", i);
 
 		octeon_free_soft_command(octeon_dev, sc);
+
+		if (octeon_dev->subsystem_id ==
+			OCTEON_CN2350_25GB_SUBSYS_ID ||
+		    octeon_dev->subsystem_id ==
+			OCTEON_CN2360_25GB_SUBSYS_ID) {
+			liquidio_get_speed(lio);
+
+			if (octeon_dev->speed_setting == 0) {
+				octeon_dev->speed_setting = 25;
+				octeon_dev->no_speed_setting = 1;
+			}
+		} else {
+			octeon_dev->no_speed_setting = 1;
+			octeon_dev->speed_setting = 10;
+		}
+		octeon_dev->speed_boot = octeon_dev->speed_setting;
+
 	}
 
 	devlink = devlink_alloc(&liquidio_devlink_ops,
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index 6295eee..7fa0212 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -411,6 +411,9 @@ liquidio_vf_probe(struct pci_dev *pdev,
 	/* set linux specific device pointer */
 	oct_dev->pci_dev = pdev;
 
+	oct_dev->subsystem_id = pdev->subsystem_vendor |
+		(pdev->subsystem_device << 16);
+
 	if (octeon_device_init(oct_dev)) {
 		liquidio_vf_remove(pdev);
 		return -ENOMEM;
@@ -2199,6 +2202,8 @@ static int setup_nic_devices(struct octeon_device *octeon_dev)
 			"NIC ifidx:%d Setup successful\n", i);
 
 		octeon_free_soft_command(octeon_dev, sc);
+
+		octeon_dev->no_speed_setting = 1;
 	}
 
 	return 0;
diff --git a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
index 0265704..285b248 100644
--- a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
+++ b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
@@ -93,6 +93,7 @@ enum octeon_tag_type {
 
 #define OPCODE_NIC_VF_REP_PKT          0x15
 #define OPCODE_NIC_VF_REP_CMD          0x16
+#define OPCODE_NIC_UBOOT_CTL           0x17
 
 #define CORE_DRV_TEST_SCATTER_OP    0xFFF5
 
@@ -249,6 +250,9 @@ static inline void add_sg_size(struct octeon_sg_entry *sg_entry,
 #define   OCTNET_CMD_VLAN_FILTER_ENABLE 0x1
 #define   OCTNET_CMD_VLAN_FILTER_DISABLE 0x0
 
+#define   SEAPI_CMD_SPEED_SET           0x2
+#define   SEAPI_CMD_SPEED_GET           0x3
+
 #define   LIO_CMD_WAIT_TM 100
 
 /* RX(packets coming from wire) Checksum verification flags */
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.h b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
index 9430c0a..94a4ed88d 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
@@ -43,6 +43,13 @@
 #define  OCTEON_CN23XX_REV_1_1        0x01
 #define  OCTEON_CN23XX_REV_2_0        0x80
 
+/**SubsystemId for the chips */
+#define	 OCTEON_CN2350_10GB_SUBSYS_ID_1	0X3177d
+#define	 OCTEON_CN2350_10GB_SUBSYS_ID_2	0X4177d
+#define	 OCTEON_CN2360_10GB_SUBSYS_ID	0X5177d
+#define	 OCTEON_CN2350_25GB_SUBSYS_ID	0X7177d
+#define	 OCTEON_CN2360_25GB_SUBSYS_ID	0X6177d
+
 /** Endian-swap modes supported by Octeon. */
 enum octeon_pci_swap_mode {
 	OCTEON_PCI_PASSTHROUGH = 0,
@@ -430,6 +437,8 @@ struct octeon_device {
 
 	u16 rev_id;
 
+	u32 subsystem_id;
+
 	u16 pf_num;
 
 	u16 vf_num;
@@ -584,6 +593,11 @@ struct octeon_device {
 	struct lio_vf_rep_list vf_rep_list;
 	struct devlink *devlink;
 	enum devlink_eswitch_mode eswitch_mode;
+
+	/* for 25G NIC speed change */
+	u8  speed_boot;
+	u8  speed_setting;
+	u8  no_speed_setting;
 };
 
 #define  OCT_DRV_ONLINE 1
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_network.h b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
index 8571f11..dd3177a 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_network.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
@@ -81,6 +81,18 @@ struct oct_nic_stats_ctrl {
 	struct net_device *netdev;
 };
 
+struct oct_nic_seapi_resp {
+	u64 rh;
+	u32 speed;
+	u64 status;
+};
+
+struct liquidio_nic_seapi_ctl_context {
+	int octeon_id;
+	u32 status;
+	struct completion complete;
+};
+
 /** LiquidIO per-interface network private data */
 struct lio {
 	/** State of the interface. Rx/Tx happens only in the RUNNING state.  */
@@ -230,6 +242,9 @@ void lio_delete_glists(struct lio *lio);
 
 int lio_setup_glists(struct octeon_device *oct, struct lio *lio, int num_qs);
 
+int liquidio_get_speed(struct lio *lio);
+int liquidio_set_speed(struct lio *lio, int speed);
+
 /**
  * \brief Net device change_mtu
  * @param netdev network device

^ permalink raw reply related

* Re: [PATCH net-next v8 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
From: Cong Wang @ 2018-05-04 17:57 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Linux Kernel Network Developers, Cake List
In-Reply-To: <152544255298.11750.16512595539680113590.stgit@alrua-kau>

On Fri, May 4, 2018 at 7:02 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> +struct cake_sched_data {
> +       struct cake_tin_data *tins;
> +
> +       struct cake_heap_entry overflow_heap[CAKE_QUEUES * CAKE_MAX_TINS];
> +       u16             overflow_timeout;
> +
> +       u16             tin_cnt;
> +       u8              tin_mode;
> +       u8              flow_mode;
> +       u8              ack_filter;
> +       u8              atm_mode;
> +
> +       /* time_next = time_this + ((len * rate_ns) >> rate_shft) */
> +       u16             rate_shft;
> +       u64             time_next_packet;
> +       u64             failsafe_next_packet;
> +       u32             rate_ns;
> +       u32             rate_bps;
> +       u16             rate_flags;
> +       s16             rate_overhead;
> +       u16             rate_mpu;
> +       u32             interval;
> +       u32             target;
> +
> +       /* resource tracking */
> +       u32             buffer_used;
> +       u32             buffer_max_used;
> +       u32             buffer_limit;
> +       u32             buffer_config_limit;
> +
> +       /* indices for dequeue */
> +       u16             cur_tin;
> +       u16             cur_flow;
> +
> +       struct qdisc_watchdog watchdog;
> +       const u8        *tin_index;
> +       const u8        *tin_order;
> +
> +       /* bandwidth capacity estimate */
> +       u64             last_packet_time;
> +       u64             avg_packet_interval;
> +       u64             avg_window_begin;
> +       u32             avg_window_bytes;
> +       u32             avg_peak_bandwidth;
> +       u64             last_reconfig_time;
> +
> +       /* packet length stats */
> +       u32 avg_netoff;
> +       u16 max_netlen;
> +       u16 max_adjlen;
> +       u16 min_netlen;
> +       u16 min_adjlen;
> +};


So sch_cake doesn't accept normal tc filters? Is this intentional?
If so, why?


> +static u16 quantum_div[CAKE_QUEUES + 1] = {0};
> +
> +#define REC_INV_SQRT_CACHE (16)
> +static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
> +
> +/* http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
> + * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
> + *
> + * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
> + */
> +
> +static void cobalt_newton_step(struct cobalt_vars *vars)
> +{
> +       u32 invsqrt = vars->rec_inv_sqrt;
> +       u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
> +       u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
> +
> +       val >>= 2; /* avoid overflow in following multiply */
> +       val = (val * invsqrt) >> (32 - 2 + 1);
> +
> +       vars->rec_inv_sqrt = val;
> +}
> +
> +static void cobalt_invsqrt(struct cobalt_vars *vars)
> +{
> +       if (vars->count < REC_INV_SQRT_CACHE)
> +               vars->rec_inv_sqrt = cobalt_rec_inv_sqrt_cache[vars->count];
> +       else
> +               cobalt_newton_step(vars);
> +}


Looks pretty much duplicated with codel...


> +
> +/* There is a big difference in timing between the accurate values placed in
> + * the cache and the approximations given by a single Newton step for small
> + * count values, particularly when stepping from count 1 to 2 or vice versa.
> + * Above 16, a single Newton step gives sufficient accuracy in either
> + * direction, given the precision stored.
> + *
> + * The magnitude of the error when stepping up to count 2 is such as to give
> + * the value that *should* have been produced at count 4.
> + */
> +
> +static void cobalt_cache_init(void)
> +{
> +       struct cobalt_vars v;
> +
> +       memset(&v, 0, sizeof(v));
> +       v.rec_inv_sqrt = ~0U;
> +       cobalt_rec_inv_sqrt_cache[0] = v.rec_inv_sqrt;
> +
> +       for (v.count = 1; v.count < REC_INV_SQRT_CACHE; v.count++) {
> +               cobalt_newton_step(&v);
> +               cobalt_newton_step(&v);
> +               cobalt_newton_step(&v);
> +               cobalt_newton_step(&v);
> +
> +               cobalt_rec_inv_sqrt_cache[v.count] = v.rec_inv_sqrt;
> +       }
> +}
> +
> +static void cobalt_vars_init(struct cobalt_vars *vars)
> +{
> +       memset(vars, 0, sizeof(*vars));
> +
> +       if (!cobalt_rec_inv_sqrt_cache[0]) {
> +               cobalt_cache_init();
> +               cobalt_rec_inv_sqrt_cache[0] = ~0;
> +       }
> +}
> +
> +/* CoDel control_law is t + interval/sqrt(count)
> + * We maintain in rec_inv_sqrt the reciprocal value of sqrt(count) to avoid
> + * both sqrt() and divide operation.
> + */
> +static cobalt_time_t cobalt_control(cobalt_time_t t,
> +                                   cobalt_time_t interval,
> +                                   u32 rec_inv_sqrt)
> +{
> +       return t + reciprocal_scale(interval, rec_inv_sqrt);
> +}
> +
> +/* Call this when a packet had to be dropped due to queue overflow.  Returns
> + * true if the BLUE state was quiescent before but active after this call.
> + */
> +static bool cobalt_queue_full(struct cobalt_vars *vars,
> +                             struct cobalt_params *p,
> +                             cobalt_time_t now)
> +{
> +       bool up = false;
> +
> +       if ((now - vars->blue_timer) > p->target) {
> +               up = !vars->p_drop;
> +               vars->p_drop += p->p_inc;
> +               if (vars->p_drop < p->p_inc)
> +                       vars->p_drop = ~0;
> +               vars->blue_timer = now;
> +       }
> +       vars->dropping = true;
> +       vars->drop_next = now;
> +       if (!vars->count)
> +               vars->count = 1;
> +
> +       return up;
> +}
> +
> +/* Call this when the queue was serviced but turned out to be empty.  Returns
> + * true if the BLUE state was active before but quiescent after this call.
> + */
> +static bool cobalt_queue_empty(struct cobalt_vars *vars,
> +                              struct cobalt_params *p,
> +                              cobalt_time_t now)
> +{
> +       bool down = false;
> +
> +       if (vars->p_drop && (now - vars->blue_timer) > p->target) {
> +               if (vars->p_drop < p->p_dec)
> +                       vars->p_drop = 0;
> +               else
> +                       vars->p_drop -= p->p_dec;
> +               vars->blue_timer = now;
> +               down = !vars->p_drop;
> +       }
> +       vars->dropping = false;
> +
> +       if (vars->count && (now - vars->drop_next) >= 0) {
> +               vars->count--;
> +               cobalt_invsqrt(vars);
> +               vars->drop_next = cobalt_control(vars->drop_next,
> +                                                p->interval,
> +                                                vars->rec_inv_sqrt);
> +       }
> +
> +       return down;
> +}
> +
> +/* Call this with a freshly dequeued packet for possible congestion marking.
> + * Returns true as an instruction to drop the packet, false for delivery.
> + */
> +static bool cobalt_should_drop(struct cobalt_vars *vars,
> +                              struct cobalt_params *p,
> +                              cobalt_time_t now,
> +                              struct sk_buff *skb)
> +{
> +       bool drop = false;
> +
> +       /* Simplified Codel implementation */
> +       cobalt_tdiff_t sojourn  = now - cobalt_get_enqueue_time(skb);
> +
> +/* The 'schedule' variable records, in its sign, whether 'now' is before or
> + * after 'drop_next'.  This allows 'drop_next' to be updated before the next
> + * scheduling decision is actually branched, without destroying that
> + * information.  Similarly, the first 'schedule' value calculated is preserved
> + * in the boolean 'next_due'.
> + *
> + * As for 'drop_next', we take advantage of the fact that 'interval' is both
> + * the delay between first exceeding 'target' and the first signalling event,
> + * *and* the scaling factor for the signalling frequency.  It's therefore very
> + * natural to use a single mechanism for both purposes, and eliminates a
> + * significant amount of reference Codel's spaghetti code.  To help with this,
> + * both the '0' and '1' entries in the invsqrt cache are 0xFFFFFFFF, as close
> + * as possible to 1.0 in fixed-point.
> + */
> +
> +       cobalt_tdiff_t schedule = now - vars->drop_next;
> +
> +       bool over_target = sojourn > p->target &&
> +                          sojourn > p->mtu_time * 4;
> +       bool next_due    = vars->count && schedule >= 0;
> +
> +       vars->ecn_marked = false;
> +
> +       if (over_target) {
> +               if (!vars->dropping) {
> +                       vars->dropping = true;
> +                       vars->drop_next = cobalt_control(now,
> +                                                        p->interval,
> +                                                        vars->rec_inv_sqrt);
> +               }
> +               if (!vars->count)
> +                       vars->count = 1;
> +       } else if (vars->dropping) {
> +               vars->dropping = false;
> +       }
> +
> +       if (next_due && vars->dropping) {
> +               /* Use ECN mark if possible, otherwise drop */
> +               drop = !(vars->ecn_marked = INET_ECN_set_ce(skb));
> +
> +               vars->count++;
> +               if (!vars->count)
> +                       vars->count--;
> +               cobalt_invsqrt(vars);
> +               vars->drop_next = cobalt_control(vars->drop_next,
> +                                                p->interval,
> +                                                vars->rec_inv_sqrt);
> +               schedule = now - vars->drop_next;
> +       } else {
> +               while (next_due) {
> +                       vars->count--;
> +                       cobalt_invsqrt(vars);
> +                       vars->drop_next = cobalt_control(vars->drop_next,
> +                                                        p->interval,
> +                                                        vars->rec_inv_sqrt);
> +                       schedule = now - vars->drop_next;
> +                       next_due = vars->count && schedule >= 0;
> +               }
> +       }
> +
> +       /* Simple BLUE implementation.  Lack of ECN is deliberate. */
> +       if (vars->p_drop)
> +               drop |= (prandom_u32() < vars->p_drop);
> +
> +       /* Overload the drop_next field as an activity timeout */
> +       if (!vars->count)
> +               vars->drop_next = now + p->interval;
> +       else if (schedule > 0 && !drop)
> +               vars->drop_next = now;
> +
> +       return drop;
> +}
> +
> +/* Cake has several subtle multiple bit settings. In these cases you
> + *  would be matching triple isolate mode as well.
> + */
> +
> +static bool cake_dsrc(int flow_mode)
> +{
> +       return (flow_mode & CAKE_FLOW_DUAL_SRC) == CAKE_FLOW_DUAL_SRC;
> +}
> +
> +static bool cake_ddst(int flow_mode)
> +{
> +       return (flow_mode & CAKE_FLOW_DUAL_DST) == CAKE_FLOW_DUAL_DST;
> +}
> +
> +static u32 cake_hash(struct cake_tin_data *q, const struct sk_buff *skb,
> +                    int flow_mode)
> +{
> +       struct flow_keys keys, host_keys;
> +       u32 flow_hash = 0, srchost_hash, dsthost_hash;
> +       u16 reduced_hash, srchost_idx, dsthost_idx;
> +
> +       if (unlikely(flow_mode == CAKE_FLOW_NONE))
> +               return 0;
> +
> +       skb_flow_dissect_flow_keys(skb, &keys,
> +                                  FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
> +
> +       /* flow_hash_from_keys() sorts the addresses by value, so we have
> +        * to preserve their order in a separate data structure to treat
> +        * src and dst host addresses as independently selectable.
> +        */
> +       host_keys = keys;
> +       host_keys.ports.ports     = 0;
> +       host_keys.basic.ip_proto  = 0;
> +       host_keys.keyid.keyid     = 0;
> +       host_keys.tags.flow_label = 0;
> +
> +       switch (host_keys.control.addr_type) {
> +       case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
> +               host_keys.addrs.v4addrs.src = 0;
> +               dsthost_hash = flow_hash_from_keys(&host_keys);
> +               host_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
> +               host_keys.addrs.v4addrs.dst = 0;
> +               srchost_hash = flow_hash_from_keys(&host_keys);
> +               break;
> +
> +       case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
> +               memset(&host_keys.addrs.v6addrs.src, 0,
> +                      sizeof(host_keys.addrs.v6addrs.src));
> +               dsthost_hash = flow_hash_from_keys(&host_keys);
> +               host_keys.addrs.v6addrs.src = keys.addrs.v6addrs.src;
> +               memset(&host_keys.addrs.v6addrs.dst, 0,
> +                      sizeof(host_keys.addrs.v6addrs.dst));
> +               srchost_hash = flow_hash_from_keys(&host_keys);
> +               break;
> +
> +       default:
> +               dsthost_hash = 0;
> +               srchost_hash = 0;
> +       }
> +
> +       /* This *must* be after the above switch, since as a
> +        * side-effect it sorts the src and dst addresses.
> +        */
> +       if (flow_mode & CAKE_FLOW_FLOWS)
> +               flow_hash = flow_hash_from_keys(&keys);
> +
> +       if (!(flow_mode & CAKE_FLOW_FLOWS)) {
> +               if (flow_mode & CAKE_FLOW_SRC_IP)
> +                       flow_hash ^= srchost_hash;
> +
> +               if (flow_mode & CAKE_FLOW_DST_IP)
> +                       flow_hash ^= dsthost_hash;
> +       }
> +
> +       reduced_hash = flow_hash % CAKE_QUEUES;
> +
> +       /* set-associative hashing */
> +       /* fast path if no hash collision (direct lookup succeeds) */
> +       if (likely(q->tags[reduced_hash] == flow_hash &&
> +                  q->flows[reduced_hash].set)) {
> +               q->way_directs++;
> +       } else {
> +               u32 inner_hash = reduced_hash % CAKE_SET_WAYS;
> +               u32 outer_hash = reduced_hash - inner_hash;
> +               u32 i, k;
> +               bool allocate_src = false;
> +               bool allocate_dst = false;
> +
> +               /* check if any active queue in the set is reserved for
> +                * this flow.
> +                */
> +               for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
> +                    i++, k = (k + 1) % CAKE_SET_WAYS) {
> +                       if (q->tags[outer_hash + k] == flow_hash) {
> +                               if (i)
> +                                       q->way_hits++;
> +
> +                               if (!q->flows[outer_hash + k].set) {
> +                                       /* need to increment host refcnts */
> +                                       allocate_src = cake_dsrc(flow_mode);
> +                                       allocate_dst = cake_ddst(flow_mode);
> +                               }
> +
> +                               goto found;
> +                       }
> +               }
> +
> +               /* no queue is reserved for this flow, look for an
> +                * empty one.
> +                */
> +               for (i = 0; i < CAKE_SET_WAYS;
> +                        i++, k = (k + 1) % CAKE_SET_WAYS) {
> +                       if (!q->flows[outer_hash + k].set) {
> +                               q->way_misses++;
> +                               allocate_src = cake_dsrc(flow_mode);
> +                               allocate_dst = cake_ddst(flow_mode);
> +                               goto found;
> +                       }
> +               }
> +
> +               /* With no empty queues, default to the original
> +                * queue, accept the collision, update the host tags.
> +                */
> +               q->way_collisions++;
> +               q->hosts[q->flows[reduced_hash].srchost].srchost_refcnt--;
> +               q->hosts[q->flows[reduced_hash].dsthost].dsthost_refcnt--;
> +               allocate_src = cake_dsrc(flow_mode);
> +               allocate_dst = cake_ddst(flow_mode);
> +found:
> +               /* reserve queue for future packets in same flow */
> +               reduced_hash = outer_hash + k;
> +               q->tags[reduced_hash] = flow_hash;
> +
> +               if (allocate_src) {
> +                       srchost_idx = srchost_hash % CAKE_QUEUES;
> +                       inner_hash = srchost_idx % CAKE_SET_WAYS;
> +                       outer_hash = srchost_idx - inner_hash;
> +                       for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
> +                               i++, k = (k + 1) % CAKE_SET_WAYS) {
> +                               if (q->hosts[outer_hash + k].srchost_tag ==
> +                                   srchost_hash)
> +                                       goto found_src;
> +                       }
> +                       for (i = 0; i < CAKE_SET_WAYS;
> +                               i++, k = (k + 1) % CAKE_SET_WAYS) {
> +                               if (!q->hosts[outer_hash + k].srchost_refcnt)
> +                                       break;
> +                       }
> +                       q->hosts[outer_hash + k].srchost_tag = srchost_hash;
> +found_src:
> +                       srchost_idx = outer_hash + k;
> +                       q->hosts[srchost_idx].srchost_refcnt++;
> +                       q->flows[reduced_hash].srchost = srchost_idx;
> +               }
> +
> +               if (allocate_dst) {
> +                       dsthost_idx = dsthost_hash % CAKE_QUEUES;
> +                       inner_hash = dsthost_idx % CAKE_SET_WAYS;
> +                       outer_hash = dsthost_idx - inner_hash;
> +                       for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
> +                            i++, k = (k + 1) % CAKE_SET_WAYS) {
> +                               if (q->hosts[outer_hash + k].dsthost_tag ==
> +                                   dsthost_hash)
> +                                       goto found_dst;
> +                       }
> +                       for (i = 0; i < CAKE_SET_WAYS;
> +                            i++, k = (k + 1) % CAKE_SET_WAYS) {
> +                               if (!q->hosts[outer_hash + k].dsthost_refcnt)
> +                                       break;
> +                       }
> +                       q->hosts[outer_hash + k].dsthost_tag = dsthost_hash;
> +found_dst:
> +                       dsthost_idx = outer_hash + k;
> +                       q->hosts[dsthost_idx].dsthost_refcnt++;
> +                       q->flows[reduced_hash].dsthost = dsthost_idx;
> +               }
> +       }
> +
> +       return reduced_hash;
> +}
> +
> +/* helper functions : might be changed when/if skb use a standard list_head */
> +/* remove one skb from head of slot queue */
> +
> +static struct sk_buff *dequeue_head(struct cake_flow *flow)
> +{
> +       struct sk_buff *skb = flow->head;
> +
> +       if (skb) {
> +               flow->head = skb->next;
> +               skb->next = NULL;
> +
> +               if (skb == flow->ackcheck)
> +                       flow->ackcheck = NULL;
> +       }
> +
> +       return skb;
> +}
> +
> +/* add skb to flow queue (tail add) */
> +
> +static void flow_queue_add(struct cake_flow *flow, struct sk_buff *skb)
> +{
> +       if (!flow->head)
> +               flow->head = skb;
> +       else
> +               flow->tail->next = skb;
> +       flow->tail = skb;
> +       skb->next = NULL;
> +}
> +
> +static cobalt_time_t cake_ewma(cobalt_time_t avg, cobalt_time_t sample,
> +                                     u32 shift)
> +{
> +       avg -= avg >> shift;
> +       avg += sample >> shift;
> +       return avg;
> +}
> +
> +static void cake_heap_swap(struct cake_sched_data *q, u16 i, u16 j)
> +{
> +       struct cake_heap_entry ii = q->overflow_heap[i];
> +       struct cake_heap_entry jj = q->overflow_heap[j];
> +
> +       q->overflow_heap[i] = jj;
> +       q->overflow_heap[j] = ii;
> +
> +       q->tins[ii.t].overflow_idx[ii.b] = j;
> +       q->tins[jj.t].overflow_idx[jj.b] = i;
> +}
> +
> +static u32 cake_heap_get_backlog(const struct cake_sched_data *q, u16 i)
> +{
> +       struct cake_heap_entry ii = q->overflow_heap[i];
> +
> +       return q->tins[ii.t].backlogs[ii.b];
> +}
> +
> +static void cake_heapify(struct cake_sched_data *q, u16 i)
> +{
> +       static const u32 a = CAKE_MAX_TINS * CAKE_QUEUES;
> +       u32 m = i;
> +       u32 mb = cake_heap_get_backlog(q, m);
> +
> +       while (m < a) {
> +               u32 l = m + m + 1;
> +               u32 r = l + 1;
> +
> +               if (l < a) {
> +                       u32 lb = cake_heap_get_backlog(q, l);
> +
> +                       if (lb > mb) {
> +                               m  = l;
> +                               mb = lb;
> +                       }
> +               }
> +
> +               if (r < a) {
> +                       u32 rb = cake_heap_get_backlog(q, r);
> +
> +                       if (rb > mb) {
> +                               m  = r;
> +                               mb = rb;
> +                       }
> +               }
> +
> +               if (m != i) {
> +                       cake_heap_swap(q, i, m);
> +                       i = m;
> +               } else {
> +                       break;
> +               }
> +       }
> +}
> +
> +static void cake_heapify_up(struct cake_sched_data *q, u16 i)
> +{
> +       while (i > 0 && i < CAKE_MAX_TINS * CAKE_QUEUES) {
> +               u16 p = (i - 1) >> 1;
> +               u32 ib = cake_heap_get_backlog(q, i);
> +               u32 pb = cake_heap_get_backlog(q, p);
> +
> +               if (ib > pb) {
> +                       cake_heap_swap(q, i, p);
> +                       i = p;
> +               } else {
> +                       break;
> +               }
> +       }
> +}
> +
> +static int cake_advance_shaper(struct cake_sched_data *q,
> +                              struct cake_tin_data *b,
> +                              struct sk_buff *skb,
> +                              u64 now, bool drop)
> +{
> +       u32 len = qdisc_pkt_len(skb);
> +
> +       /* charge packet bandwidth to this tin
> +        * and to the global shaper.
> +        */
> +       if (q->rate_ns) {
> +               s64 tdiff1 = b->tin_time_next_packet - now;
> +               s64 tdiff2 = (len * (u64)b->tin_rate_ns) >> b->tin_rate_shft;
> +               s64 tdiff3 = (len * (u64)q->rate_ns) >> q->rate_shft;
> +               s64 tdiff4 = tdiff3 + (tdiff3 >> 1);
> +
> +               if (tdiff1 < 0)
> +                       b->tin_time_next_packet += tdiff2;
> +               else if (tdiff1 < tdiff2)
> +                       b->tin_time_next_packet = now + tdiff2;
> +
> +               q->time_next_packet += tdiff3;
> +               if (!drop)
> +                       q->failsafe_next_packet += tdiff4;
> +       }
> +       return len;
> +}
> +
> +static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       struct sk_buff *skb;
> +       u32 idx = 0, tin = 0, len;
> +       struct cake_tin_data *b;
> +       struct cake_flow *flow;
> +       struct cake_heap_entry qq;
> +       u64 now = cobalt_get_time();
> +
> +       if (!q->overflow_timeout) {
> +               int i;
> +               /* Build fresh max-heap */
> +               for (i = CAKE_MAX_TINS * CAKE_QUEUES / 2; i >= 0; i--)
> +                       cake_heapify(q, i);
> +       }
> +       q->overflow_timeout = 65535;
> +
> +       /* select longest queue for pruning */
> +       qq  = q->overflow_heap[0];
> +       tin = qq.t;
> +       idx = qq.b;
> +
> +       b = &q->tins[tin];
> +       flow = &b->flows[idx];
> +       skb = dequeue_head(flow);
> +       if (unlikely(!skb)) {
> +               /* heap has gone wrong, rebuild it next time */
> +               q->overflow_timeout = 0;
> +               return idx + (tin << 16);
> +       }
> +
> +       if (cobalt_queue_full(&flow->cvars, &b->cparams, now))
> +               b->unresponsive_flow_count++;
> +
> +       len = qdisc_pkt_len(skb);
> +       q->buffer_used      -= skb->truesize;
> +       b->backlogs[idx]    -= len;
> +       b->tin_backlog      -= len;
> +       sch->qstats.backlog -= len;
> +       qdisc_tree_reduce_backlog(sch, 1, len);
> +
> +       b->tin_dropped++;
> +       sch->qstats.drops++;
> +
> +       __qdisc_drop(skb, to_free);
> +       sch->q.qlen--;
> +
> +       cake_heapify(q, 0);
> +
> +       return idx + (tin << 16);
> +}
> +
> +static void cake_reconfigure(struct Qdisc *sch);
> +
> +static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> +                       struct sk_buff **to_free)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       u32 idx, tin;
> +       struct cake_tin_data *b;
> +       struct cake_flow *flow;
> +       /* signed len to handle corner case filtered ACK larger than trigger */
> +       int len = qdisc_pkt_len(skb);
> +       u64 now = cobalt_get_time();
> +
> +       tin = 0;
> +       b = &q->tins[tin];
> +
> +       /* choose flow to insert into */
> +       idx = cake_hash(b, skb, q->flow_mode);
> +       flow = &b->flows[idx];
> +
> +       /* ensure shaper state isn't stale */
> +       if (!b->tin_backlog) {
> +               if (b->tin_time_next_packet < now)
> +                       b->tin_time_next_packet = now;
> +
> +               if (!sch->q.qlen) {
> +                       if (q->time_next_packet < now) {
> +                               q->failsafe_next_packet = now;
> +                               q->time_next_packet = now;
> +                       } else if (q->time_next_packet > now &&
> +                                  q->failsafe_next_packet > now) {
> +                               u64 next = min(q->time_next_packet,
> +                                              q->failsafe_next_packet);
> +                               sch->qstats.overlimits++;
> +                               qdisc_watchdog_schedule_ns(&q->watchdog, next);
> +                       }
> +               }
> +       }
> +
> +       if (unlikely(len > b->max_skblen))
> +               b->max_skblen = len;
> +
> +       cobalt_set_enqueue_time(skb, now);
> +       flow_queue_add(flow, skb);
> +
> +       sch->q.qlen++;
> +       q->buffer_used      += skb->truesize;
> +
> +       /* stats */
> +       b->packets++;
> +       b->bytes            += len;
> +       b->backlogs[idx]    += len;
> +       b->tin_backlog      += len;
> +       sch->qstats.backlog += len;
> +       q->avg_window_bytes += len;
> +
> +       if (q->overflow_timeout)
> +               cake_heapify_up(q, b->overflow_idx[idx]);
> +
> +       /* incoming bandwidth capacity estimate */
> +       q->avg_window_bytes = 0;
> +       q->last_packet_time = now;
> +
> +       /* flowchain */
> +       if (!flow->set || flow->set == CAKE_SET_DECAYING) {
> +               struct cake_host *srchost = &b->hosts[flow->srchost];
> +               struct cake_host *dsthost = &b->hosts[flow->dsthost];
> +               u16 host_load = 1;
> +
> +               if (!flow->set) {
> +                       list_add_tail(&flow->flowchain, &b->new_flows);
> +               } else {
> +                       b->decaying_flow_count--;
> +                       list_move_tail(&flow->flowchain, &b->new_flows);
> +               }
> +               flow->set = CAKE_SET_SPARSE;
> +               b->sparse_flow_count++;
> +
> +               if (cake_dsrc(q->flow_mode))
> +                       host_load = max(host_load, srchost->srchost_refcnt);
> +
> +               if (cake_ddst(q->flow_mode))
> +                       host_load = max(host_load, dsthost->dsthost_refcnt);
> +
> +               flow->deficit = (b->flow_quantum *
> +                                quantum_div[host_load]) >> 16;
> +       } else if (flow->set == CAKE_SET_SPARSE_WAIT) {
> +               /* this flow was empty, accounted as a sparse flow, but actually
> +                * in the bulk rotation.
> +                */
> +               flow->set = CAKE_SET_BULK;
> +               b->sparse_flow_count--;
> +               b->bulk_flow_count++;
> +       }
> +
> +       if (q->buffer_used > q->buffer_max_used)
> +               q->buffer_max_used = q->buffer_used;
> +
> +       if (q->buffer_used > q->buffer_limit) {
> +               u32 dropped = 0;
> +
> +               while (q->buffer_used > q->buffer_limit) {
> +                       dropped++;
> +                       cake_drop(sch, to_free);
> +               }
> +               b->drop_overlimit += dropped;
> +       }
> +       return NET_XMIT_SUCCESS;
> +}
> +
> +static struct sk_buff *cake_dequeue_one(struct Qdisc *sch)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       struct cake_tin_data *b = &q->tins[q->cur_tin];
> +       struct cake_flow *flow = &b->flows[q->cur_flow];
> +       struct sk_buff *skb = NULL;
> +       u32 len;
> +
> +       if (flow->head) {
> +               skb = dequeue_head(flow);
> +               len = qdisc_pkt_len(skb);
> +               b->backlogs[q->cur_flow] -= len;
> +               b->tin_backlog           -= len;
> +               sch->qstats.backlog      -= len;
> +               q->buffer_used           -= skb->truesize;
> +               sch->q.qlen--;
> +
> +               if (q->overflow_timeout)
> +                       cake_heapify(q, b->overflow_idx[q->cur_flow]);
> +       }
> +       return skb;
> +}
> +
> +/* Discard leftover packets from a tin no longer in use. */
> +static void cake_clear_tin(struct Qdisc *sch, u16 tin)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       struct sk_buff *skb;
> +
> +       q->cur_tin = tin;
> +       for (q->cur_flow = 0; q->cur_flow < CAKE_QUEUES; q->cur_flow++)
> +               while (!!(skb = cake_dequeue_one(sch)))
> +                       kfree_skb(skb);
> +}
> +
> +static struct sk_buff *cake_dequeue(struct Qdisc *sch)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       struct sk_buff *skb;
> +       struct cake_tin_data *b = &q->tins[q->cur_tin];
> +       struct cake_flow *flow;
> +       struct cake_host *srchost, *dsthost;
> +       struct list_head *head;
> +       u32 len;
> +       u16 host_load;
> +       cobalt_time_t now = ktime_get_ns();
> +       cobalt_time_t delay;
> +       bool first_flow = true;
> +
> +begin:
> +       if (!sch->q.qlen)
> +               return NULL;
> +
> +       /* global hard shaper */
> +       if (q->time_next_packet > now && q->failsafe_next_packet > now) {
> +               u64 next = min(q->time_next_packet, q->failsafe_next_packet);
> +
> +               sch->qstats.overlimits++;
> +               qdisc_watchdog_schedule_ns(&q->watchdog, next);
> +               return NULL;
> +       }
> +
> +       /* Choose a class to work on. */
> +       if (!q->rate_ns) {
> +               /* In unlimited mode, can't rely on shaper timings, just balance
> +                * with DRR
> +                */
> +               while (b->tin_deficit < 0 ||
> +                      !(b->sparse_flow_count + b->bulk_flow_count)) {
> +                       if (b->tin_deficit <= 0)
> +                               b->tin_deficit += b->tin_quantum_band;
> +
> +                       q->cur_tin++;
> +                       b++;
> +                       if (q->cur_tin >= q->tin_cnt) {
> +                               q->cur_tin = 0;
> +                               b = q->tins;
> +                       }
> +               }
> +       } else {
> +               /* In shaped mode, choose:
> +                * - Highest-priority tin with queue and meeting schedule, or
> +                * - The earliest-scheduled tin with queue.
> +                */
> +               int tin, best_tin = 0;
> +               s64 best_time = 0xFFFFFFFFFFFFUL;
> +
> +               for (tin = 0; tin < q->tin_cnt; tin++) {
> +                       b = q->tins + tin;
> +                       if ((b->sparse_flow_count + b->bulk_flow_count) > 0) {
> +                               s64 tdiff = b->tin_time_next_packet - now;
> +
> +                               if (tdiff <= 0 || tdiff <= best_time) {
> +                                       best_time = tdiff;
> +                                       best_tin = tin;
> +                               }
> +                       }
> +               }
> +
> +               q->cur_tin = best_tin;
> +               b = q->tins + best_tin;
> +       }
> +
> +retry:
> +       /* service this class */
> +       head = &b->decaying_flows;
> +       if (!first_flow || list_empty(head)) {
> +               head = &b->new_flows;
> +               if (list_empty(head)) {
> +                       head = &b->old_flows;
> +                       if (unlikely(list_empty(head))) {
> +                               head = &b->decaying_flows;
> +                               if (unlikely(list_empty(head)))
> +                                       goto begin;
> +                       }
> +               }
> +       }
> +       flow = list_first_entry(head, struct cake_flow, flowchain);
> +       q->cur_flow = flow - b->flows;
> +       first_flow = false;
> +
> +       /* triple isolation (modified DRR++) */
> +       srchost = &b->hosts[flow->srchost];
> +       dsthost = &b->hosts[flow->dsthost];
> +       host_load = 1;
> +
> +       if (cake_dsrc(q->flow_mode))
> +               host_load = max(host_load, srchost->srchost_refcnt);
> +
> +       if (cake_ddst(q->flow_mode))
> +               host_load = max(host_load, dsthost->dsthost_refcnt);
> +
> +       WARN_ON(host_load > CAKE_QUEUES);
> +
> +       /* flow isolation (DRR++) */
> +       if (flow->deficit <= 0) {
> +               /* The shifted prandom_u32() is a way to apply dithering to
> +                * avoid accumulating roundoff errors
> +                */
> +               flow->deficit += (b->flow_quantum * quantum_div[host_load] +
> +                                 (prandom_u32() >> 16)) >> 16;
> +               list_move_tail(&flow->flowchain, &b->old_flows);
> +
> +               /* Keep all flows with deficits out of the sparse and decaying
> +                * rotations.  No non-empty flow can go into the decaying
> +                * rotation, so they can't get deficits
> +                */
> +               if (flow->set == CAKE_SET_SPARSE) {
> +                       if (flow->head) {
> +                               b->sparse_flow_count--;
> +                               b->bulk_flow_count++;
> +                               flow->set = CAKE_SET_BULK;
> +                       } else {
> +                               /* we've moved it to the bulk rotation for
> +                                * correct deficit accounting but we still want
> +                                * to count it as a sparse flow, not a bulk one.
> +                                */
> +                               flow->set = CAKE_SET_SPARSE_WAIT;
> +                       }
> +               }
> +               goto retry;
> +       }
> +
> +       /* Retrieve a packet via the AQM */
> +       while (1) {
> +               skb = cake_dequeue_one(sch);
> +               if (!skb) {
> +                       /* this queue was actually empty */
> +                       if (cobalt_queue_empty(&flow->cvars, &b->cparams, now))
> +                               b->unresponsive_flow_count--;
> +
> +                       if (flow->cvars.p_drop || flow->cvars.count ||
> +                           now < flow->cvars.drop_next) {
> +                               /* keep in the flowchain until the state has
> +                                * decayed to rest
> +                                */
> +                               list_move_tail(&flow->flowchain,
> +                                              &b->decaying_flows);
> +                               if (flow->set == CAKE_SET_BULK) {
> +                                       b->bulk_flow_count--;
> +                                       b->decaying_flow_count++;
> +                               } else if (flow->set == CAKE_SET_SPARSE ||
> +                                          flow->set == CAKE_SET_SPARSE_WAIT) {
> +                                       b->sparse_flow_count--;
> +                                       b->decaying_flow_count++;
> +                               }
> +                               flow->set = CAKE_SET_DECAYING;
> +                       } else {
> +                               /* remove empty queue from the flowchain */
> +                               list_del_init(&flow->flowchain);
> +                               if (flow->set == CAKE_SET_SPARSE ||
> +                                   flow->set == CAKE_SET_SPARSE_WAIT)
> +                                       b->sparse_flow_count--;
> +                               else if (flow->set == CAKE_SET_BULK)
> +                                       b->bulk_flow_count--;
> +                               else
> +                                       b->decaying_flow_count--;
> +
> +                               flow->set = CAKE_SET_NONE;
> +                               srchost->srchost_refcnt--;
> +                               dsthost->dsthost_refcnt--;
> +                       }
> +                       goto begin;
> +               }
> +
> +               /* Last packet in queue may be marked, shouldn't be dropped */
> +               if (!cobalt_should_drop(&flow->cvars, &b->cparams, now, skb) ||
> +                   !flow->head)
> +                       break;
> +
> +               b->tin_dropped++;
> +               qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
> +               qdisc_qstats_drop(sch);
> +               kfree_skb(skb);
> +       }
> +
> +       b->tin_ecn_mark += !!flow->cvars.ecn_marked;
> +       qdisc_bstats_update(sch, skb);
> +
> +       /* collect delay stats */
> +       delay = now - cobalt_get_enqueue_time(skb);
> +       b->avge_delay = cake_ewma(b->avge_delay, delay, 8);
> +       b->peak_delay = cake_ewma(b->peak_delay, delay,
> +                                 delay > b->peak_delay ? 2 : 8);
> +       b->base_delay = cake_ewma(b->base_delay, delay,
> +                                 delay < b->base_delay ? 2 : 8);
> +
> +       len = cake_advance_shaper(q, b, skb, now, false);
> +       flow->deficit -= len;
> +       b->tin_deficit -= len;
> +
> +       if (q->time_next_packet > now && sch->q.qlen) {
> +               u64 next = min(q->time_next_packet, q->failsafe_next_packet);
> +
> +               qdisc_watchdog_schedule_ns(&q->watchdog, next);
> +       } else if (!sch->q.qlen) {
> +               int i;
> +
> +               for (i = 0; i < q->tin_cnt; i++) {
> +                       if (q->tins[i].decaying_flow_count) {
> +                               u64 next = now + q->tins[i].cparams.target;
> +
> +                               qdisc_watchdog_schedule_ns(&q->watchdog, next);
> +                               break;
> +                       }
> +               }
> +       }
> +
> +       if (q->overflow_timeout)
> +               q->overflow_timeout--;
> +
> +       return skb;
> +}
> +
> +static void cake_reset(struct Qdisc *sch)
> +{
> +       u32 c;
> +
> +       for (c = 0; c < CAKE_MAX_TINS; c++)
> +               cake_clear_tin(sch, c);
> +}
> +
> +static const struct nla_policy cake_policy[TCA_CAKE_MAX + 1] = {
> +       [TCA_CAKE_BASE_RATE]     = { .type = NLA_U32 },
> +       [TCA_CAKE_DIFFSERV_MODE] = { .type = NLA_U32 },
> +       [TCA_CAKE_ATM]           = { .type = NLA_U32 },
> +       [TCA_CAKE_FLOW_MODE]     = { .type = NLA_U32 },
> +       [TCA_CAKE_OVERHEAD]      = { .type = NLA_S32 },
> +       [TCA_CAKE_RTT]           = { .type = NLA_U32 },
> +       [TCA_CAKE_TARGET]        = { .type = NLA_U32 },
> +       [TCA_CAKE_AUTORATE]      = { .type = NLA_U32 },
> +       [TCA_CAKE_MEMORY]        = { .type = NLA_U32 },
> +       [TCA_CAKE_NAT]           = { .type = NLA_U32 },
> +       [TCA_CAKE_RAW]           = { .type = NLA_U32 },
> +       [TCA_CAKE_WASH]          = { .type = NLA_U32 },
> +       [TCA_CAKE_MPU]           = { .type = NLA_U32 },
> +       [TCA_CAKE_INGRESS]       = { .type = NLA_U32 },
> +       [TCA_CAKE_ACK_FILTER]    = { .type = NLA_U32 },
> +};
> +
> +static void cake_set_rate(struct cake_tin_data *b, u64 rate, u32 mtu,
> +                         cobalt_time_t ns_target, cobalt_time_t rtt_est_ns)
> +{
> +       /* convert byte-rate into time-per-byte
> +        * so it will always unwedge in reasonable time.
> +        */
> +       static const u64 MIN_RATE = 64;
> +       u64 rate_ns = 0;
> +       u8  rate_shft = 0;
> +       cobalt_time_t byte_target_ns;
> +       u32 byte_target = mtu;
> +
> +       b->flow_quantum = 1514;
> +       if (rate) {
> +               b->flow_quantum = max(min(rate >> 12, 1514ULL), 300ULL);
> +               rate_shft = 32;
> +               rate_ns = ((u64)NSEC_PER_SEC) << rate_shft;
> +               do_div(rate_ns, max(MIN_RATE, rate));
> +               while (!!(rate_ns >> 32)) {
> +                       rate_ns >>= 1;
> +                       rate_shft--;
> +               }
> +       } /* else unlimited, ie. zero delay */
> +
> +       b->tin_rate_bps  = rate;
> +       b->tin_rate_ns   = rate_ns;
> +       b->tin_rate_shft = rate_shft;
> +
> +       byte_target_ns = (byte_target * rate_ns) >> rate_shft;
> +
> +       b->cparams.target = max((byte_target_ns * 3) / 2, ns_target);
> +       b->cparams.interval = max(rtt_est_ns +
> +                                    b->cparams.target - ns_target,
> +                                    b->cparams.target * 2);
> +       b->cparams.mtu_time = byte_target_ns;
> +       b->cparams.p_inc = 1 << 24; /* 1/256 */
> +       b->cparams.p_dec = 1 << 20; /* 1/4096 */
> +}
> +
> +static void cake_reconfigure(struct Qdisc *sch)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       struct cake_tin_data *b = &q->tins[0];
> +       int c, ft = 0;
> +
> +       q->tin_cnt = 1;
> +       cake_set_rate(b, q->rate_bps, psched_mtu(qdisc_dev(sch)),
> +                     US2TIME(q->target), US2TIME(q->interval));
> +       b->tin_quantum_band = 65535;
> +       b->tin_quantum_prio = 65535;
> +
> +       for (c = q->tin_cnt; c < CAKE_MAX_TINS; c++) {
> +               cake_clear_tin(sch, c);
> +               q->tins[c].cparams.mtu_time = q->tins[ft].cparams.mtu_time;
> +       }
> +
> +       q->rate_ns   = q->tins[ft].tin_rate_ns;
> +       q->rate_shft = q->tins[ft].tin_rate_shft;
> +
> +       if (q->buffer_config_limit) {
> +               q->buffer_limit = q->buffer_config_limit;
> +       } else if (q->rate_bps) {
> +               u64 t = (u64)q->rate_bps * q->interval;
> +
> +               do_div(t, USEC_PER_SEC / 4);
> +               q->buffer_limit = max_t(u32, t, 4U << 20);
> +       } else {
> +               q->buffer_limit = ~0;
> +       }
> +
> +       sch->flags &= ~TCQ_F_CAN_BYPASS;
> +
> +       q->buffer_limit = min(q->buffer_limit,
> +                             max(sch->limit * psched_mtu(qdisc_dev(sch)),
> +                                 q->buffer_config_limit));
> +}
> +
> +static int cake_change(struct Qdisc *sch, struct nlattr *opt,
> +                      struct netlink_ext_ack *extack)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       struct nlattr *tb[TCA_CAKE_MAX + 1];
> +       int err;
> +
> +       if (!opt)
> +               return -EINVAL;
> +
> +       err = nla_parse_nested(tb, TCA_CAKE_MAX, opt, cake_policy, extack);
> +       if (err < 0)
> +               return err;
> +
> +       if (tb[TCA_CAKE_BASE_RATE])
> +               q->rate_bps = nla_get_u32(tb[TCA_CAKE_BASE_RATE]);
> +
> +       if (tb[TCA_CAKE_FLOW_MODE])
> +               q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
> +                               CAKE_FLOW_MASK);
> +
> +       if (tb[TCA_CAKE_RTT]) {
> +               q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
> +
> +               if (!q->interval)
> +                       q->interval = 1;
> +       }
> +
> +       if (tb[TCA_CAKE_TARGET]) {
> +               q->target = nla_get_u32(tb[TCA_CAKE_TARGET]);
> +
> +               if (!q->target)
> +                       q->target = 1;
> +       }
> +
> +       if (tb[TCA_CAKE_MEMORY])
> +               q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
> +
> +       if (q->tins) {
> +               sch_tree_lock(sch);
> +               cake_reconfigure(sch);
> +               sch_tree_unlock(sch);
> +       }
> +
> +       return 0;
> +}
> +
> +static void cake_free(void *addr)
> +{
> +       if (addr)
> +               kvfree(addr);
> +}


Are you sure you have to check NULL for kvfree()?



> +
> +static void cake_destroy(struct Qdisc *sch)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +
> +       qdisc_watchdog_cancel(&q->watchdog);
> +
> +       if (q->tins)
> +               cake_free(q->tins);


Duplicated NULL check.


> +}
> +
> +static int cake_init(struct Qdisc *sch, struct nlattr *opt,
> +                    struct netlink_ext_ack *extack)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       int i, j;
> +
> +       sch->limit = 10240;
> +       q->tin_mode = CAKE_DIFFSERV_BESTEFFORT;
> +       q->flow_mode  = CAKE_FLOW_TRIPLE;
> +
> +       q->rate_bps = 0; /* unlimited by default */
> +
> +       q->interval = 100000; /* 100ms default */
> +       q->target   =   5000; /* 5ms: codel RFC argues
> +                              * for 5 to 10% of interval
> +                              */
> +
> +       q->cur_tin = 0;
> +       q->cur_flow  = 0;
> +
> +       if (opt) {
> +               int err = cake_change(sch, opt, extack);
> +
> +               if (err)
> +                       return err;


Not sure if you really want to reallocate q->tines below for this
case.


> +       }
> +
> +       qdisc_watchdog_init(&q->watchdog, sch);
> +
> +       quantum_div[0] = ~0;
> +       for (i = 1; i <= CAKE_QUEUES; i++)
> +               quantum_div[i] = 65535 / i;
> +
> +       q->tins = kvzalloc(CAKE_MAX_TINS * sizeof(struct cake_tin_data),
> +                          GFP_KERNEL | __GFP_NOWARN);



Why __GFP_NOWARN?



> +       if (!q->tins)
> +               goto nomem;
> +
> +       for (i = 0; i < CAKE_MAX_TINS; i++) {
> +               struct cake_tin_data *b = q->tins + i;
> +
> +               INIT_LIST_HEAD(&b->new_flows);
> +               INIT_LIST_HEAD(&b->old_flows);
> +               INIT_LIST_HEAD(&b->decaying_flows);
> +               b->sparse_flow_count = 0;
> +               b->bulk_flow_count = 0;
> +               b->decaying_flow_count = 0;
> +
> +               for (j = 0; j < CAKE_QUEUES; j++) {
> +                       struct cake_flow *flow = b->flows + j;
> +                       u32 k = j * CAKE_MAX_TINS + i;
> +
> +                       INIT_LIST_HEAD(&flow->flowchain);
> +                       cobalt_vars_init(&flow->cvars);
> +
> +                       q->overflow_heap[k].t = i;
> +                       q->overflow_heap[k].b = j;
> +                       b->overflow_idx[j] = k;
> +               }
> +       }
> +
> +       cake_reconfigure(sch);
> +       q->avg_peak_bandwidth = q->rate_bps;
> +       q->min_netlen = ~0;
> +       q->min_adjlen = ~0;
> +       return 0;
> +
> +nomem:
> +       cake_destroy(sch);
> +       return -ENOMEM;
> +}
> +
> +static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       struct nlattr *opts;
> +
> +       opts = nla_nest_start(skb, TCA_OPTIONS);
> +       if (!opts)
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_BASE_RATE, q->rate_bps))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_DIFFSERV_MODE, q->tin_mode))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_ATM, q->atm_mode))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_FLOW_MODE,
> +                       q->flow_mode & CAKE_FLOW_MASK))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_NAT,
> +                       !!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_SPLIT_GSO,
> +                       !!(q->rate_flags & CAKE_FLAG_SPLIT_GSO)))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_WASH,
> +                       !!(q->rate_flags & CAKE_FLAG_WASH)))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_OVERHEAD, q->rate_overhead))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_MPU, q->rate_mpu))
> +               goto nla_put_failure;
> +
> +       if (!(q->rate_flags & CAKE_FLAG_OVERHEAD))
> +               if (nla_put_u32(skb, TCA_CAKE_RAW, 0))
> +                       goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_RTT, q->interval))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_TARGET, q->target))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
> +                       !!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_CAKE_INGRESS,
> +                       !!(q->rate_flags & CAKE_FLAG_INGRESS)))
> +               goto nla_put_failure;


Is this used by a following patch? If so, please move it there.



> +
> +       if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
> +               goto nla_put_failure;


Ditto.



> +
> +       if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
> +               goto nla_put_failure;
> +
> +       return nla_nest_end(skb, opts);
> +
> +nla_put_failure:
> +       return -1;
> +}
> +
> +static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
> +{
> +       struct cake_sched_data *q = qdisc_priv(sch);
> +       struct nlattr *stats = nla_nest_start(d->skb, TCA_STATS_APP);
> +       struct nlattr *tstats, *ts;
> +       int i;
> +
> +       if (!stats)
> +               return -1;
> +
> +#define PUT_STAT_U32(attr, data) do {                                 \
> +               if (nla_put_u32(d->skb, TCA_CAKE_STATS_ ## attr, data)) \
> +                       goto nla_put_failure;                          \
> +       } while (0)
> +
> +       PUT_STAT_U32(CAPACITY_ESTIMATE, q->avg_peak_bandwidth);
> +       PUT_STAT_U32(MEMORY_LIMIT, q->buffer_limit);
> +       PUT_STAT_U32(MEMORY_USED, q->buffer_max_used);
> +       PUT_STAT_U32(AVG_NETOFF, ((q->avg_netoff + 0x8000) >> 16));
> +       PUT_STAT_U32(MAX_NETLEN, q->max_netlen);
> +       PUT_STAT_U32(MAX_ADJLEN, q->max_adjlen);
> +       PUT_STAT_U32(MIN_NETLEN, q->min_netlen);
> +       PUT_STAT_U32(MIN_ADJLEN, q->min_adjlen);
> +
> +#undef PUT_STAT_U32
> +
> +       tstats = nla_nest_start(d->skb, TCA_CAKE_STATS_TIN_STATS);
> +       if (!tstats)
> +               goto nla_put_failure;
> +
> +#define PUT_TSTAT_U32(attr, data) do {                                 \
> +               if (nla_put_u32(d->skb, TCA_CAKE_TIN_STATS_ ## attr, data)) \
> +                       goto nla_put_failure;                           \
> +       } while (0)
> +#define PUT_TSTAT_U64(attr, data) do {                                 \
> +               if (nla_put_u64_64bit(d->skb, TCA_CAKE_TIN_STATS_ ## attr, \
> +                                       data, TCA_CAKE_TIN_STATS_PAD))  \
> +                       goto nla_put_failure;                           \
> +       } while (0)
> +
> +       for (i = 0; i < q->tin_cnt; i++) {
> +               struct cake_tin_data *b = &q->tins[i];
> +
> +               ts = nla_nest_start(d->skb, i + 1);
> +               if (!ts)
> +                       goto nla_put_failure;
> +
> +               PUT_TSTAT_U32(THRESHOLD_RATE, b->tin_rate_bps);
> +               PUT_TSTAT_U32(TARGET_US, cobalt_time_to_us(b->cparams.target));
> +               PUT_TSTAT_U32(INTERVAL_US,
> +                             cobalt_time_to_us(b->cparams.interval));
> +
> +               PUT_TSTAT_U32(SENT_PACKETS, b->packets);
> +               PUT_TSTAT_U64(SENT_BYTES64, b->bytes);
> +               PUT_TSTAT_U32(DROPPED_PACKETS, b->tin_dropped);
> +               PUT_TSTAT_U32(ECN_MARKED_PACKETS, b->tin_ecn_mark);
> +               PUT_TSTAT_U64(BACKLOG_BYTES64, b->tin_backlog);
> +               PUT_TSTAT_U32(ACKS_DROPPED_PACKETS, b->ack_drops);
> +
> +               PUT_TSTAT_U32(PEAK_DELAY_US, cobalt_time_to_us(b->peak_delay));
> +               PUT_TSTAT_U32(AVG_DELAY_US, cobalt_time_to_us(b->avge_delay));
> +               PUT_TSTAT_U32(BASE_DELAY_US, cobalt_time_to_us(b->base_delay));
> +
> +               PUT_TSTAT_U32(WAY_INDIRECT_HITS, b->way_hits);
> +               PUT_TSTAT_U32(WAY_MISSES, b->way_misses);
> +               PUT_TSTAT_U32(WAY_COLLISIONS, b->way_collisions);
> +
> +               PUT_TSTAT_U32(SPARSE_FLOWS, b->sparse_flow_count +
> +                                          b->decaying_flow_count);
> +               PUT_TSTAT_U32(BULK_FLOWS, b->bulk_flow_count);
> +               PUT_TSTAT_U32(UNRESPONSIVE_FLOWS, b->unresponsive_flow_count);
> +               PUT_TSTAT_U32(MAX_SKBLEN, b->max_skblen);
> +
> +               PUT_TSTAT_U32(FLOW_QUANTUM, b->flow_quantum);
> +               nla_nest_end(d->skb, ts);
> +       }
> +
> +#undef PUT_TSTAT_U32
> +#undef PUT_TSTAT_U64
> +
> +       nla_nest_end(d->skb, tstats);
> +       return nla_nest_end(d->skb, stats);
> +
> +nla_put_failure:
> +       nla_nest_cancel(d->skb, stats);
> +       return -1;
> +}
> +
> +static struct Qdisc_ops cake_qdisc_ops __read_mostly = {
> +       .id             =       "cake",
> +       .priv_size      =       sizeof(struct cake_sched_data),
> +       .enqueue        =       cake_enqueue,
> +       .dequeue        =       cake_dequeue,
> +       .peek           =       qdisc_peek_dequeued,
> +       .init           =       cake_init,
> +       .reset          =       cake_reset,
> +       .destroy        =       cake_destroy,
> +       .change         =       cake_change,
> +       .dump           =       cake_dump,
> +       .dump_stats     =       cake_dump_stats,
> +       .owner          =       THIS_MODULE,
> +};
> +
> +static int __init cake_module_init(void)
> +{
> +       return register_qdisc(&cake_qdisc_ops);
> +}
> +
> +static void __exit cake_module_exit(void)
> +{
> +       unregister_qdisc(&cake_qdisc_ops);
> +}
> +
> +module_init(cake_module_init)
> +module_exit(cake_module_exit)
> +MODULE_AUTHOR("Jonathan Morton");
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_DESCRIPTION("The CAKE shaper.");
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox