Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy
From: Mickaël Salaün @ 2018-04-11 22:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Borkmann, LKML, Alexei Starovoitov,
	Arnaldo Carvalho de Melo, Casey Schaufler, David Drysdale,
	David S . Miller, Eric W . Biederman, Jann Horn, Jonathan Corbet,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf,
	Tycho Andersen, Will Drewry, Kernel 
In-Reply-To: <20180410044821.tllxbaq2uj6gtzpn@ast-mbp.dhcp.thefacebook.com>


[-- Attachment #1.1: Type: text/plain, Size: 25891 bytes --]


On 04/10/2018 06:48 AM, Alexei Starovoitov wrote:
> On Mon, Apr 09, 2018 at 12:01:59AM +0200, Mickaël Salaün wrote:
>>
>> On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
>>> On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>>>>
>>>>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>>>>> programss.
>>>>>>>>>>>>
>>>>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>>>>
>>>>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>>>>> +     unsigned long pages;
>>>>>>>>>>>> +     int err;
>>>>>>>>>>>> +     size_t i;
>>>>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>>>>> +
>>>>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>>>>> +     pages = prog->pages;
>>>>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>>>>> +             size_t i;
>>>>>>>>>>>> +
>>>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>>>>> +
>>>>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>>>>> +      * prog_lists */
>>>>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>>>>> +     if (err)
>>>>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /*
>>>>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>>>>> +      * at.
>>>>>>>>>>>> +      *
>>>>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>>>>> +      */
>>>>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>>>>> +              * then create a new one.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>>>>> +              * needed.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>>>>> +             /* get the last new list */
>>>>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> +             if (last_list) {
>>>>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>>>> +
>>>>>>>>>>>> +put_tmp_lists:
>>>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> Nack on the chaining concept.
>>>>>>>>>>> Please do not reinvent the wheel.
>>>>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>>>>> Please use that instead.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>>>>
>>>>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>>>>
>>>>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>>>>> think that Landlock programs can and should just live in the existing
>>>>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>>>>> to make this work, then so be it.
>>>>>>>
>>>>>>> +1
>>>>>>> if that was the case...
>>>>>>> but that's not my reading of the patch set.
>>>>>>
>>>>>> An earlier version of the patch set used the seccomp filter chain.
>>>>>> Mickaël, what exactly was wrong with that approach other than that the
>>>>>> seccomp() syscall was awkward for you to use?  You could add a
>>>>>> seccomp_add_landlock_rule() syscall if you needed to.
>>>>>
>>>>> Nothing was wrong about about that, this part did not changed (see my
>>>>> next comment).
>>>>>
>>>>>>
>>>>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>>>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>>>>> having LSMs configured in but to call the landlock hooks directly from
>>>>>> the security_xyz() hooks.
>>>>>
>>>>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> In other words, the kernel already has two kinds of chaining:
>>>>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>>>>> for the actual filtering.
>>>>>>>
>>>>>>> +1
>>>>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>>>>> at different levels with different override/multiprog properties,
>>>>>>> so walking link list and checking all flags at run-time would have
>>>>>>> been too slow. That's why we added compute_effective_progs().
>>>>>>
>>>>>> If we start adding override flags to Landlock, I think we're doing it
>>>>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>>>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>>>>> *won't* be set up by the administrator, so the chance that everyone
>>>>>> gets all the flags right is about zero.  All attached filters should
>>>>>> run unconditionally.
>>>>>
>>>>>
>>>>> There is a misunderstanding about this chaining mechanism. This should
>>>>> not be confused with the list of seccomp filters nor the cgroup
>>>>> hierarchies. Landlock programs can be stacked the same way seccomp's
>>>>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>>>>> optimization which is not used for this struct handling). This stackable
>>>>> property did not changed from the previous patch series. The chaining
>>>>> mechanism is for another use case, which does not make sense for seccomp
>>>>> filters nor other eBPF program types, at least for now, from what I can
>>>>> tell.
>>>>>
>>>>> You may want to get a look at my talk at FOSDEM
>>>>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>>>>> slides 11 and 12.
>>>>>
>>>>> Let me explain my reasoning about this program chaining thing.
>>>>>
>>>>> To check if an action on a file is allowed, we first need to identify
>>>>> this file and match it to the security policy. In a previous
>>>>> (non-public) patch series, I tried to use one type of eBPF program to
>>>>> check every kind of access to a file. To be able to identify a file, I
>>>>> relied on an eBPF map, similar to the current inode map. This map store
>>>>> a set of references to file descriptors. I then created a function
>>>>> bpf_is_file_beneath() to check if the requested file was beneath a file
>>>>> in the map. This way, no chaining, only one eBPF program type to check
>>>>> an access to a file... but some issues then emerged. First, this design
>>>>> create a side-channel which help an attacker using such a program to
>>>>> infer some information not normally available, for example to get a hint
>>>>> on where a file descriptor (received from a UNIX socket) come from.
>>>>> Another issue is that this type of program would be called for each
>>>>> component of a path. Indeed, when the kernel check if an access to a
>>>>> file is allowed, it walk through all of the directories in its path
>>>>> (checking if the current process is allowed to execute them). That first
>>>>> attempt led me to rethink the way we could filter an access to a file
>>>>> *path*.
>>>>>
>>>>> To minimize the number of called to an eBPF program dedicated to
>>>>> validate an access to a file path, I decided to create three subtype of
>>>>> eBPF programs. The FS_WALK type is called when walking through every
>>>>> directory of a file path (except the last one if it is the target). We
>>>>> can then restrict this type of program to the minimum set of functions
>>>>> it is allowed to call and the minimum set of data available from its
>>>>> context. The first implicit chaining is for this type of program. To be
>>>>> able to evaluate a path while being called for all its components, this
>>>>> program need to store a state (to remember what was the parent directory
>>>>> of this path). There is no "previous" field in the subtype for this
>>>>> program because it is chained with itself, for each directories. This
>>>>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>>>>> to the inode map which can be used to check if a directory of this
>>>>> hierarchy is part of an allowed (or denied) list of directories. This
>>>>> design enables to express a file hierarchy in a programmatic way,
>>>>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>>>>
>>>>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>>>>> program) to an access to the actual file being requested (the last
>>>>> component of a file path), with a FS_PICK program. It is only at this
>>>>> time that the kernel check for the requested action (e.g. read, write,
>>>>> chdir, append...). To be able to filter such access request we can have
>>>>> one call to the same program for every action and let this program check
>>>>> for which action it was called. However, this design does not allow the
>>>>> kernel to know if the current action is indeed handled by this program.
>>>>> Hence, it is not possible to implement a cache mechanism to only call
>>>>> this program if it knows how to handle this action.
>>>>>
>>>>> The approach I took for this FS_PICK type of program is to add to its
>>>>> subtype which action it can handle (with the "triggers" bitfield, seen
>>>>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>>>>> program is necessary. If the user wants to enforce a different security
>>>>> policy according to the action requested on a file, then it needs
>>>>> multiple FS_PICK programs. However, to reduce the number of such
>>>>> programs, this patch series allow a FS_PICK program to be chained with
>>>>> another, the same way a FS_WALK is chained with itself. This way, if the
>>>>> user want to check if the action is a for example an "open" and a "read"
>>>>> and not a "map" and a "read", then it can chain multiple FS_PICK
>>>>> programs with different triggers actions. The OR check performed by the
>>>>> kernel is not a limitation then, only a way to know if a call to an eBPF
>>>>> program is needed.
>>>>>
>>>>> The last type of program is FS_GET. This one is called when a process
>>>>> get a struct file or change its working directory. This is the only
>>>>> program type able (and allowed) to tag a file. This restriction is
>>>>> important to not being subject to resource exhaustion attacks (i.e.
>>>>> tagging every inode accessible to an attacker, which would allocate too
>>>>> much kernel memory).
>>>>>
>>>>> This design gives room for improvements to create a cache of eBPF
>>>>> context (input data, including maps if any), with the result of an eBPF
>>>>> program. This would help limit the number of call to an eBPF program the
>>>>> same way SELinux or other kernel components do to limit costly checks.
>>>>>
>>>>> The eBPF maps of progs are useful to call the same type of eBPF
>>>>> program. It does not fit with this use case because we may want multiple
>>>>> eBPF program according to the action requested on a kernel object (e.g.
>>>>> FS_GET). The other reason is because the eBPF program does not know what
>>>>> will be the next (type of) access check performed by the kernel.
>>>>>
>>>>> To say it another way, this chaining mechanism is a way to split a
>>>>> kernel object evaluation with multiple specialized programs, each of
>>>>> them being able to deal with data tied to their type. Using a monolithic
>>>>> eBPF program to check everything does not scale and does not fit with
>>>>> unprivileged use either.
>>>>>
>>>>> As a side note, the cookie value is only an ephemeral value to keep a
>>>>> state between multiple programs call. It can be used to create a state
>>>>> machine for an object evaluation.
>>>>>
>>>>> I don't see a way to do an efficient and programmatic path evaluation,
>>>>> with different access checks, with the current eBPF features. Please let
>>>>> me know if you know how to do it another way.
>>>>>
>>>>
>>>> Andy, Alexei, Daniel, what do you think about this Landlock program
>>>> chaining and cookie?
>>>>
>>>
>>> Can you give a small pseudocode real world example that acutally needs
>>> chaining?  The mechanism is quite complicated and I'd like to
>>> understand how it'll be used.
>>>
>>
>> Here is the interesting part from the example (patch 09/11):
>>
>> +SEC("maps")
>> +struct bpf_map_def inode_map = {
>> +	.type = BPF_MAP_TYPE_INODE,
>> +	.key_size = sizeof(u32),
>> +	.value_size = sizeof(u64),
>> +	.max_entries = 20,
>> +};
>> +
>> +SEC("subtype/landlock1")
>> +static union bpf_prog_subtype _subtype1 = {
>> +	.landlock_hook = {
>> +		.type = LANDLOCK_HOOK_FS_WALK,
>> +	}
>> +};
>> +
>> +static __always_inline __u64 update_cookie(__u64 cookie, __u8 lookup,
>> +		void *inode, void *chain, bool freeze)
>> +{
>> +	__u64 map_allow = 0;
>> +
>> +	if (cookie == 0) {
>> +		cookie = bpf_inode_get_tag(inode, chain);
>> +		if (cookie)
>> +			return cookie;
>> +		/* only look for the first match in the map, ignore nested
>> +		 * paths in this example */
>> +		map_allow = bpf_inode_map_lookup(&inode_map, inode);
>> +		if (map_allow)
>> +			cookie = 1 | map_allow;
>> +	} else {
>> +		if (cookie & COOKIE_VALUE_FREEZED)
>> +			return cookie;
>> +		map_allow = cookie & _MAP_MARK_MASK;
>> +		cookie &= ~_MAP_MARK_MASK;
>> +		switch (lookup) {
>> +		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOTDOT:
>> +			cookie--;
>> +			break;
>> +		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOT:
>> +			break;
>> +		default:
>> +			/* ignore _MAP_MARK_MASK overflow in this example */
>> +			cookie++;
>> +			break;
>> +		}
>> +		if (cookie >= 1)
>> +			cookie |= map_allow;
>> +	}
>> +	/* do not modify the cookie for each fs_pick */
>> +	if (freeze && cookie)
>> +		cookie |= COOKIE_VALUE_FREEZED;
>> +	return cookie;
>> +}
>> +
>> +SEC("landlock1")
>> +int fs_walk(struct landlock_ctx_fs_walk *ctx)
>> +{
>> +	ctx->cookie = update_cookie(ctx->cookie, ctx->inode_lookup,
>> +			(void *)ctx->inode, (void *)ctx->chain, false);
>> +	return LANDLOCK_RET_ALLOW;
>> +}
>>
>> The program "landlock1" is called for every directory execution (except
>> the last one if it is the leaf of a path). This enables to identify a
>> file hierarchy with only a (one dimension) list of file descriptors
>> (i.e. inode_map).
>>
>> Underneath, the Landlock LSM part looks if there is an associated path
>> walk (nameidata) with each inode access request. If there is one, then
>> the cookie associated with the path walk (if any) is made available
>> through the eBPF program context. This enables to develop a state
>> machine with an eBPF program to "evaluate" a file path (without string
>> parsing).
>>
>> The goal with this chaining mechanism is to be able to express a complex
>> kernel object like a file, with multiple run of one or more eBPF
>> programs, as a multilayer evaluation. This semantic may only make sense
>> for the user/developer and his security policy. We must keep in mind
>> that this object identification should be available to unprivileged
>> processes. This means that we must be very careful to what kind of
>> information are available to an eBPF program because this can then leak
>> to a process (e.g. through a map). With this mechanism, only information
>> already available to user space is available to the eBPF program.
>>
>> In this example, the complexity of the path evaluation is in the eBPF
>> program. We can then keep the kernel code more simple and generic. This
>> enables more flexibility for a security policy definition.
> 
> it all sounds correct on paper, but it's pretty novel
> approach and I'm not sure I see all the details in the patch.
> When people say "inode" they most of the time mean inode integer number,
> whereas in this patch do you mean a raw pointer to in-kernel
> 'struct inode' ?
> To avoid confusion it should probably be called differently.

It's indeed a pointer to a "struct inode", not an inode number.

I was thinking about generalizing the BPF_MAP_TYPE_INODE by renaming it
to BPF_MAP_TYPE_FD. This map type could then be used either to identify
a set of inodes (pointers) or other kernel objects identifiable by a
file descriptor. A "subtype" (similar to the BPF prog subtype introduced
in this patch series) may be used to specialize such a map to statically
identify the kind of content it may hold. We could then add more
subtypes to identify sockets, devices, processes, and so on.

> 
> If you meant inode as a number then why inode only?
> where is superblock, device, mount point?
> How bpf side can compare inodes without this additional info?
> How bpf side will know what inode to compare to?
> What if inode number is reused?

This pointer can identify if a giver inode is the same as one pointed by
a file descriptor (or a file path).


> This approach is an optimization to compare inodes
> instead of strings passed into sys_open ?

Comparing paths with strings is less efficient but it is also very
error-prone. Another advantage of using file descriptors is for
unprivileged processes: we can be sure that this processes are allowed
to access a file referred by a file descriptor (opened file). Indeed we
check (security_inode_getattr) that the process is allowed to stat an
opened file. This way, a malicious process can't infer information by
crafting path strings.


> 
> If you meant inode as a pointer how bpf side will
> know the pointer before the walk begins?

The BPF map is filled by user space with file descriptors pointing to
opened files. When a path walk begin, the LSM part of Landlock is
notified that a process is requesting an access to the first element of
the path (e.g. "/"). This first element may be part of a map or not. The
BPF program can then choose if this request is legitimate or not.


> What guarantees that it's not a stale pointer?

When user space updates a map with a new file descriptor, the kernel
checks if this FD is valid. If this is the case, then the inode's usage
counter is incremented and its address is stored in the map.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net] strparser: Fix incorrect strp->need_bytes value.
From: Doron Roberts-Kedes @ 2018-04-11 22:05 UTC (permalink / raw)
  To: davem; +Cc: netdev, Doron Roberts-Kedes

strp_data_ready resets strp->need_bytes to 0 if strp_peek_len indicates
that the remainder of the message has been received. However,
do_strp_work does not reset strp->need_bytes to 0. If do_strp_work
completes a partial message, the value of strp->need_bytes will continue
to reflect the needed bytes of the previous message, causing
future invocations of strp_data_ready to return early if
strp->need_bytes is less than strp_peek_len. Resetting strp->need_bytes
to 0 in __strp_recv on handing a full message to the upper layer solves
this problem.

__strp_recv also calculates strp->need_bytes using stm->accum_len before
stm->accum_len has been incremented by cand_len. This can cause
strp->need_bytes to be equal to the full length of the message instead
of the full length minus the accumulated length. This, in turn, causes
strp_data_ready to return early, even when there is sufficient data to
complete the partial message. Incrementing stm->accum_len before using
it to calculate strp->need_bytes solves this problem.

Found while testing net/tls_sw recv path.

Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
---
 net/strparser/strparser.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index b9283ce..805b139 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -296,9 +296,9 @@ static int __strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 					strp_start_timer(strp, timeo);
 				}
 
+				stm->accum_len += cand_len;
 				strp->need_bytes = stm->strp.full_len -
 						       stm->accum_len;
-				stm->accum_len += cand_len;
 				stm->early_eaten = cand_len;
 				STRP_STATS_ADD(strp->stats.bytes, cand_len);
 				desc->count = 0; /* Stop reading socket */
@@ -321,6 +321,7 @@ static int __strp_recv(read_descriptor_t *desc, struct sk_buff *orig_skb,
 		/* Hurray, we have a new message! */
 		cancel_delayed_work(&strp->msg_timer_work);
 		strp->skb_head = NULL;
+		strp->need_bytes = 0;
 		STRP_STATS_INCR(strp->stats.msgs);
 
 		/* Give skb to upper layer */
@@ -410,9 +411,7 @@ void strp_data_ready(struct strparser *strp)
 		return;
 
 	if (strp->need_bytes) {
-		if (strp_peek_len(strp) >= strp->need_bytes)
-			strp->need_bytes = 0;
-		else
+		if (strp_peek_len(strp) < strp->need_bytes)
 			return;
 	}
 
-- 
2.9.5

^ permalink raw reply related

* [GIT] Networking
From: David Miller @ 2018-04-11 21:53 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) In ip_gre tunnel, handle the conflict between TUNNEL_{SEQ,CSUM} and
   GSO/LLTX properly.  From Sabrina Dubroca.

2) Stop properly on error in lan78xx_read_otp(), from Phil Elwell.

3) Don't uncompress in slip before rstate is initialized, from Tejaswi
   Tanikella.

4) When using 1.x firmware on aquantia, issue a deinit before we
   hardware reset the chip, otherwise we break dirty wake WOL.  From
   Igor Russkikh.

5) Correct log check in vhost_vq_access_ok(), from Stefan Hajnoczi.

6) Fix ethtool -x crashes in bnxt_en, from Michael Chan.

7) Fix races in l2tp tunnel creation and duplicate tunnel detection,
   from Guillaume Nault.

Please pull, thanks a lot!

The following changes since commit c18bb396d3d261ebbb4efbc05129c5d354c541e4:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-04-09 17:04:10 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 0c84cee8f131a090f77f5a3dea5d6a7bd99c00db:

  Merge branch 'l2tp-tunnel-creation-fixes' (2018-04-11 17:41:28 -0400)

----------------------------------------------------------------
Andy Gospodarek (1):
      bnxt_en: do not allow wildcard matches for L2 flows

Bassem Boubaker (1):
      cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN

David S. Miller (4):
      Merge branch 'Aquantia-atlantic-critical-fixes-04-2018'
      Merge branch 'vhost-fix-vhost_vq_access_ok-log-check'
      Merge branch 'bnxt_en-Fixes-for-net'
      Merge branch 'l2tp-tunnel-creation-fixes'

Eric Auger (1):
      vhost: Fix vhost_copy_to_user()

Guillaume Nault (2):
      l2tp: fix races in tunnel creation
      l2tp: fix race in duplicate tunnel detection

Igor Russkikh (2):
      net: aquantia: Regression on reset with 1.x firmware
      net: aquantia: oops when shutdown on already stopped device

Ka-Cheong Poon (1):
      rds: MP-RDS may use an invalid c_path

Michael Chan (3):
      bnxt_en: Fix ethtool -x crash when device is down.
      bnxt_en: Need to include RDMA rings in bnxt_check_rings().
      bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

Phil Elwell (3):
      lan78xx: Correctly indicate invalid OTP
      lan78xx: Avoid spurious kevent 4 "error"
      lan78xx: Don't reset the interface on open

Sabrina Dubroca (3):
      ip_gre: clear feature flags when incompatible o_flags are set
      tun: set the flags before registering the netdevice
      tun: send netlink notification when the device is modified

Sriharsha Basavapatna (2):
      bnxt_en: Ignore src port field in decap filter nodes
      bnxt_en: Support max-mtu with VF-reps

Stefan Hajnoczi (2):
      vhost: fix vhost_vq_access_ok() log check
      vhost: return bool from *_access_ok() functions

Tejaswi Tanikella (1):
      slip: Check if rstate is initialized before uncompressing

 drivers/net/ethernet/aquantia/atlantic/aq_nic.c              |   8 +-
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c |  16 ++++
 drivers/net/ethernet/broadcom/bnxt/bnxt.c                    |   4 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c            |  11 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c                 |  63 +++++++++++++++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c                |  30 ++++++++
 drivers/net/slip/slhc.c                                      |   5 ++
 drivers/net/tun.c                                            |  33 ++++++--
 drivers/net/usb/cdc_ether.c                                  |   6 ++
 drivers/net/usb/lan78xx.c                                    |   9 +--
 drivers/vhost/vhost.c                                        |  72 +++++++++---------
 drivers/vhost/vhost.h                                        |   4 +-
 include/net/slhc_vj.h                                        |   1 +
 net/ipv4/ip_gre.c                                            |   6 ++
 net/l2tp/l2tp_core.c                                         | 225 ++++++++++++++++++++++++-------------------------------
 net/l2tp/l2tp_core.h                                         |   4 +-
 net/l2tp/l2tp_netlink.c                                      |  22 +++---
 net/l2tp/l2tp_ppp.c                                          |   9 +++
 net/rds/send.c                                               |  15 ++--
 19 files changed, 345 insertions(+), 198 deletions(-)

^ permalink raw reply

* Re: [PATCH net] net: validate attribute sizes in neigh_dump_table()
From: David Ahern @ 2018-04-11 21:48 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <20180411214600.203361-1-edumazet@google.com>

On 4/11/18 3:46 PM, Eric Dumazet wrote:
> Since neigh_dump_table() calls nlmsg_parse() without giving policy
> constraints, attributes can have arbirary size that we must validate
> 

...

> 
> Fixes: 21fdd092acc7 ("net: Add support for filtering neigh dump by master device")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> ---
>  net/core/neighbour.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 

Acked-by: David Ahern <dsa@cumulusnetworks.com>

Thanks for fixing, Eric.

^ permalink raw reply

* [PATCH net] net: validate attribute sizes in neigh_dump_table()
From: Eric Dumazet @ 2018-04-11 21:46 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, David Ahern

Since neigh_dump_table() calls nlmsg_parse() without giving policy
constraints, attributes can have arbirary size that we must validate

Reported by syzbot/KMSAN :

BUG: KMSAN: uninit-value in neigh_master_filtered net/core/neighbour.c:2292 [inline]
BUG: KMSAN: uninit-value in neigh_dump_table net/core/neighbour.c:2348 [inline]
BUG: KMSAN: uninit-value in neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
CPU: 1 PID: 3575 Comm: syzkaller268891 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 neigh_master_filtered net/core/neighbour.c:2292 [inline]
 neigh_dump_table net/core/neighbour.c:2348 [inline]
 neigh_dump_info+0x1af0/0x2250 net/core/neighbour.c:2438
 netlink_dump+0x9ad/0x1540 net/netlink/af_netlink.c:2225
 __netlink_dump_start+0x1167/0x12a0 net/netlink/af_netlink.c:2322
 netlink_dump_start include/linux/netlink.h:214 [inline]
 rtnetlink_rcv_msg+0x1435/0x1560 net/core/rtnetlink.c:4598
 netlink_rcv_skb+0x355/0x5f0 net/netlink/af_netlink.c:2447
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4653
 netlink_unicast_kernel net/netlink/af_netlink.c:1311 [inline]
 netlink_unicast+0x1672/0x1750 net/netlink/af_netlink.c:1337
 netlink_sendmsg+0x1048/0x1310 net/netlink/af_netlink.c:1900
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43fed9
RSP: 002b:00007ffddbee2798 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fed9
RDX: 0000000000000000 RSI: 0000000020005000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401800
R13: 0000000000401890 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1183 [inline]
 netlink_sendmsg+0x9a6/0x1310 net/netlink/af_netlink.c:1875
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
 __sys_sendmsg net/socket.c:2080 [inline]
 SYSC_sendmsg+0x2a3/0x3d0 net/socket.c:2091
 SyS_sendmsg+0x54/0x80 net/socket.c:2087
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 21fdd092acc7 ("net: Add support for filtering neigh dump by master device")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/core/neighbour.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 7b7a14abba28e2b77c6448f1c3d151287afc79ad..a8bc02bb339f9f4c914ae7b23408cd5ccc8b3b8e 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2323,12 +2323,16 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
 
 	err = nlmsg_parse(nlh, sizeof(struct ndmsg), tb, NDA_MAX, NULL, NULL);
 	if (!err) {
-		if (tb[NDA_IFINDEX])
+		if (tb[NDA_IFINDEX]) {
+			if (nla_len(tb[NDA_IFINDEX]) != sizeof(u32))
+				return -EINVAL;
 			filter_idx = nla_get_u32(tb[NDA_IFINDEX]);
-
-		if (tb[NDA_MASTER])
+		}
+		if (tb[NDA_MASTER]) {
+			if (nla_len(tb[NDA_MASTER]) != sizeof(u32))
+				return -EINVAL;
 			filter_master_idx = nla_get_u32(tb[NDA_MASTER]);
-
+		}
 		if (filter_idx || filter_master_idx)
 			flags |= NLM_F_DUMP_FILTERED;
 	}
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: [PATCH net 0/2] l2tp: tunnel creation fixes
From: David Miller @ 2018-04-11 21:42 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman, tparkin
In-Reply-To: <cover.1523385906.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Tue, 10 Apr 2018 21:01:07 +0200

> L2TP tunnel creation is racy. We need to make sure that the tunnel
> returned by l2tp_tunnel_create() isn't going to be freed while the
> caller is using it. This is done in patch #1, by separating tunnel
> creation from tunnel registration.
> 
> With the tunnel registration code in place, we can now check for
> duplicate tunnels in a race-free way. This is done in patch #2, which
> incidentally removes the last use of l2tp_tunnel_find().

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* [PATCH net] tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets
From: Eric Dumazet @ 2018-04-11 21:36 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]

I believe this was caused by a TCP_MD5SIG being set on live
flow.

This is highly unexpected, since TCP option space is limited.

For instance, presence of TCP MD5 option automatically disables
TCP TimeStamp option at SYN/SYNACK time, which we can not do
once flow has been established.

Really, adding/deleting an MD5 key only makes sense on sockets
in CLOSE or LISTEN state.

[1]
BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
 tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
 tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
 tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x448fe9
RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9
RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004
RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010
R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000
R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
 __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
 tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
 tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
 sk_backlog_rcv include/net/sock.h:908 [inline]
 __release_sock+0x2d6/0x680 net/core/sock.c:2271
 release_sock+0x97/0x2a0 net/core/sock.c:2786
 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
 SyS_sendto+0x8a/0xb0 net/socket.c:1715
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/ipv4/tcp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bccc4c2700870b8c7ff592a6bd27acebd9bc6471..4fa3f812b9ff8954a9b6a018c648ff12ab995721 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2813,8 +2813,10 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 #ifdef CONFIG_TCP_MD5SIG
 	case TCP_MD5SIG:
 	case TCP_MD5SIG_EXT:
-		/* Read the IP->Key mappings from userspace */
-		err = tp->af_specific->md5_parse(sk, optname, optval, optlen);
+		if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
+			err = tp->af_specific->md5_parse(sk, optname, optval, optlen);
+		else
+			err = -EINVAL;
 		break;
 #endif
 	case TCP_USER_TIMEOUT:
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Jakub Kicinski @ 2018-04-11 21:19 UTC (permalink / raw)
  To: Michael Chan, Andy Gospodarek; +Cc: David Miller, Netdev
In-Reply-To: <CACKFLinjsSyMDSyut3h1Qyzrn0HGjaGTROdA-5U-syezEopkxQ@mail.gmail.com>

On Wed, 11 Apr 2018 13:55:11 -0700, Michael Chan wrote:
> On Wed, Apr 11, 2018 at 1:50 PM, Andy Gospodarek wrote:
> > On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:  
> > True, but I'm not sure that tc_cls_common_offload is used in all cases.
> > Take red_offload() as one of those.  
> 
> For Flower, we know we have the extack pointer in
> tc_cls_common_offload struct and we can use it to set the netlink
> error message.  The point is that we don't have to modify
> ndo_setup_tc().

Yes, the extack is actually only populated when skip_sw is specified to
avoid warning users who don't care about offloads.

Flower offloads don't go via .ndo_setup_tc but TC block callbacks.  But
one day we will hopefully find a reasonable way to pass extack to qdisc
offloads as well..

FWIW your driver is actually already using extack under the veil of
tc_cls_can_offload_and_chain0() :)

^ permalink raw reply

* BUG: corrupted list in team_nl_cmd_options_set
From: syzbot @ 2018-04-11 21:02 UTC (permalink / raw)
  To: jiri, linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot hit the following crash on upstream commit
b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +0000)
Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=4d4af685432dc0e56c91

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6161158629228544
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5600380654190592
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4627738266697728
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4d4af685432dc0e56c91@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

8021q: adding VLAN 0 to HW filter on device team0
netlink: 'syzkaller556835': attribute type 3 has an invalid length.
netlink: 'syzkaller556835': attribute type 3 has an invalid length.
list_add double add: new=0000000004f859c0, prev=00000000c9745291,  
next=0000000004f859c0.
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:31!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4466 Comm: syzkaller556835 Not tainted 4.16.0+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:__list_add_valid+0xaa/0xb0 lib/list_debug.c:29
RSP: 0018:ffff8801b04bf248 EFLAGS: 00010286
RAX: 0000000000000058 RBX: ffff8801c8fc7a90 RCX: 0000000000000000
RDX: 0000000000000058 RSI: ffffffff815fbf41 RDI: ffffed0036097e3f
RBP: ffff8801b04bf260 R08: ffff8801b0b2a700 R09: ffffed003b604f90
R10: ffffed003b604f90 R11: ffff8801db027c87 R12: ffff8801c8fc7a90
R13: ffff8801c8fc7a90 R14: dffffc0000000000 R15: 0000000000000000
FS:  0000000000b98880(0000) GS:ffff8801db000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000043fc30 CR3: 00000001afe8e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __list_add include/linux/list.h:60 [inline]
  list_add include/linux/list.h:79 [inline]
  team_nl_cmd_options_set+0x9ff/0x12b0 drivers/net/team/team.c:2571
  genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
  genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:639
  ___sys_sendmsg+0x805/0x940 net/socket.c:2117
  __sys_sendmsg+0x115/0x270 net/socket.c:2155
  SYSC_sendmsg net/socket.c:2164 [inline]
  SyS_sendmsg+0x29/0x30 net/socket.c:2162
  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4458b9
RSP: 002b:00007ffd1d4a7278 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000000001b RCX: 00000000004458b9
RDX: 0000000000000010 RSI: 0000000020000d00 RDI: 0000000000000004
RBP: 00000000004a74ed R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000213 R12: 00007ffd1d4a7348
R13: 0000000000402a60 R14: 0000000000000000 R15: 0000000000000000
Code: 75 e8 eb a9 48 89 f7 48 89 75 e8 e8 d1 85 7b fe 48 8b 75 e8 eb bb 48  
89 f2 48 89 d9 4c 89 e6 48 c7 c7 a0 84 d8 87 e8 ea 67 28 fe <0f> 0b 0f 1f  
40 00 48 b8 00 00 00 00 00 fc ff df 55 48 89 e5 41
RIP: __list_add_valid+0xaa/0xb0 lib/list_debug.c:29 RSP: ffff8801b04bf248
---[ end trace b4f71d7dd7ca6d10 ]---


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Michael Chan @ 2018-04-11 20:55 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: Jakub Kicinski, David Miller, Netdev
In-Reply-To: <20180411205014.GE33938@C02RW35GFVH8.dhcp.broadcom.net>

On Wed, Apr 11, 2018 at 1:50 PM, Andy Gospodarek
<andrew.gospodarek@broadcom.com> wrote:
> On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:
>> On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
>> <andrew.gospodarek@broadcom.com> wrote:
>> > On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
>> >> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> >> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
>> >> >             return false;
>> >> >     }
>> >> >
>> >> > +   /* Currently source/dest MAC cannot be partial wildcard  */
>> >> > +   if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> >> > +       !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
>> >> > +           netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
>> >>
>> >> This wouldn't be something to do in net, but how do you feel about
>> >> using extack for messages like this?
>> >>
>> >
>> > I agree 'net' would not have been the place for a change like that, but
>> > I do think that would be a good idea.  It looks like we could easily
>> > change the ndo_setup_tc to something like this:
>> >
>> >         int                     (*ndo_setup_tc)(struct net_device *dev,
>> >                                                 enum tc_setup_type type,
>> >                                                 void *type_data,
>> >                                                 struct netlink_ext_ack *extack);
>>
>> I think the extack pointer is already in the tc_cls_common_offload
>> struct inside tc_cls_flower_offload struct.
>
> True, but I'm not sure that tc_cls_common_offload is used in all cases.
> Take red_offload() as one of those.

For Flower, we know we have the extack pointer in
tc_cls_common_offload struct and we can use it to set the netlink
error message.  The point is that we don't have to modify
ndo_setup_tc().

^ permalink raw reply

* [net 1/1] tipc: fix unbalanced reference counter
From: Jon Maloy @ 2018-04-11 20:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: tipc-discussion, mohan.krishna.ghanta.krishnamurthy

When a topology subscription is created, we may encounter (or KASAN
may provoke) a failure to create a corresponding service instance in
the binding table. Instead of letting the tipc_nametbl_subscribe()
report the failure back to the caller, the function just makes a warning
printout and returns, without incrementing the subscription reference
counter as expected by the caller.

This makes the caller believe that the subscription was successful, so
it will at a later moment try to unsubscribe the item. This involves
a sub_put() call. Since the reference counter never was incremented
in the first place, we get a premature delete of the subscription item,
followed by a "use-after-free" warning.

We fix this by adding a return value to tipc_nametbl_subscribe() and
make the caller aware of the failure to subscribe.

This bug seems to always have been around, but this fix only applies
back to the commit shown below. Given the low risk of this happening
we believe this to be sufficient.

Fixes: commit 218527fe27ad ("tipc: replace name table service range
array with rb tree")
Reported-by: syzbot+aa245f26d42b8305d157@syzkaller.appspotmail.com

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/name_table.c | 5 ++++-
 net/tipc/name_table.h | 2 +-
 net/tipc/subscr.c     | 5 ++++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index b1fe209..4068eaa 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -665,13 +665,14 @@ int tipc_nametbl_withdraw(struct net *net, u32 type, u32 lower,
 /**
  * tipc_nametbl_subscribe - add a subscription object to the name table
  */
-void tipc_nametbl_subscribe(struct tipc_subscription *sub)
+bool tipc_nametbl_subscribe(struct tipc_subscription *sub)
 {
 	struct name_table *nt = tipc_name_table(sub->net);
 	struct tipc_net *tn = tipc_net(sub->net);
 	struct tipc_subscr *s = &sub->evt.s;
 	u32 type = tipc_sub_read(s, seq.type);
 	struct tipc_service *sc;
+	bool res = true;
 
 	spin_lock_bh(&tn->nametbl_lock);
 	sc = tipc_service_find(sub->net, type);
@@ -685,8 +686,10 @@ void tipc_nametbl_subscribe(struct tipc_subscription *sub)
 		pr_warn("Failed to subscribe for {%u,%u,%u}\n", type,
 			tipc_sub_read(s, seq.lower),
 			tipc_sub_read(s, seq.upper));
+		res = false;
 	}
 	spin_unlock_bh(&tn->nametbl_lock);
+	return res;
 }
 
 /**
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index 4b14fc2..0febba4 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -126,7 +126,7 @@ struct publication *tipc_nametbl_insert_publ(struct net *net, u32 type,
 struct publication *tipc_nametbl_remove_publ(struct net *net, u32 type,
 					     u32 lower, u32 upper,
 					     u32 node, u32 key);
-void tipc_nametbl_subscribe(struct tipc_subscription *s);
+bool tipc_nametbl_subscribe(struct tipc_subscription *s);
 void tipc_nametbl_unsubscribe(struct tipc_subscription *s);
 int tipc_nametbl_init(struct net *net);
 void tipc_nametbl_stop(struct net *net);
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index b7d80bc..f340e53 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -153,7 +153,10 @@ struct tipc_subscription *tipc_sub_subscribe(struct net *net,
 	memcpy(&sub->evt.s, s, sizeof(*s));
 	spin_lock_init(&sub->lock);
 	kref_init(&sub->kref);
-	tipc_nametbl_subscribe(sub);
+	if (!tipc_nametbl_subscribe(sub)) {
+		kfree(sub);
+		return NULL;
+	}
 	timer_setup(&sub->timer, tipc_sub_timeout, 0);
 	timeout = tipc_sub_read(&sub->evt.s, timeout);
 	if (timeout != TIPC_WAIT_FOREVER)
-- 
2.1.4


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply related

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Andy Gospodarek @ 2018-04-11 20:50 UTC (permalink / raw)
  To: Michael Chan; +Cc: Andy Gospodarek, Jakub Kicinski, David Miller, Netdev
In-Reply-To: <CACKFLi=W=0a+2pap5Yis7_ZCRrP+cGxxik6XBLnEPws+ge873g@mail.gmail.com>

On Wed, Apr 11, 2018 at 01:41:31PM -0700, Michael Chan wrote:
> On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
> <andrew.gospodarek@broadcom.com> wrote:
> > On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
> >> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> >> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
> >> >             return false;
> >> >     }
> >> >
> >> > +   /* Currently source/dest MAC cannot be partial wildcard  */
> >> > +   if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
> >> > +       !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> >> > +           netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
> >>
> >> This wouldn't be something to do in net, but how do you feel about
> >> using extack for messages like this?
> >>
> >
> > I agree 'net' would not have been the place for a change like that, but
> > I do think that would be a good idea.  It looks like we could easily
> > change the ndo_setup_tc to something like this:
> >
> >         int                     (*ndo_setup_tc)(struct net_device *dev,
> >                                                 enum tc_setup_type type,
> >                                                 void *type_data,
> >                                                 struct netlink_ext_ack *extack);
> 
> I think the extack pointer is already in the tc_cls_common_offload
> struct inside tc_cls_flower_offload struct.

True, but I'm not sure that tc_cls_common_offload is used in all cases.
Take red_offload() as one of those.

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Michael Chan @ 2018-04-11 20:41 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: Jakub Kicinski, David Miller, Netdev
In-Reply-To: <20180411203152.GD33938@C02RW35GFVH8.dhcp.broadcom.net>

On Wed, Apr 11, 2018 at 1:31 PM, Andy Gospodarek
<andrew.gospodarek@broadcom.com> wrote:
> On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
>> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
>> >             return false;
>> >     }
>> >
>> > +   /* Currently source/dest MAC cannot be partial wildcard  */
>> > +   if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> > +       !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
>> > +           netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
>>
>> This wouldn't be something to do in net, but how do you feel about
>> using extack for messages like this?
>>
>
> I agree 'net' would not have been the place for a change like that, but
> I do think that would be a good idea.  It looks like we could easily
> change the ndo_setup_tc to something like this:
>
>         int                     (*ndo_setup_tc)(struct net_device *dev,
>                                                 enum tc_setup_type type,
>                                                 void *type_data,
>                                                 struct netlink_ext_ack *extack);

I think the extack pointer is already in the tc_cls_common_offload
struct inside tc_cls_flower_offload struct.

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Andy Gospodarek @ 2018-04-11 20:31 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Michael Chan, davem, netdev
In-Reply-To: <20180411114303.6f927c45@cakuba.netronome.com>

On Wed, Apr 11, 2018 at 11:43:14AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> > @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
> >  		return false;
> >  	}
> >  
> > +	/* Currently source/dest MAC cannot be partial wildcard  */
> > +	if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
> > +	    !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> > +		netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
> 
> This wouldn't be something to do in net, but how do you feel about
> using extack for messages like this?
> 

I agree 'net' would not have been the place for a change like that, but
I do think that would be a good idea.  It looks like we could easily
change the ndo_setup_tc to something like this:

        int                     (*ndo_setup_tc)(struct net_device *dev,
                                                enum tc_setup_type type,
                                                void *type_data,
						struct netlink_ext_ack *extack);

It also looks like most of the callers of ndo_setup_tc have infra in
place to pass extack easily when the call is sourced from a netlink
message.   The others can just pass in NULL or define a local
netlink_ext_ack variable for short-term use.

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Ivan Briano @ 2018-04-11 20:31 UTC (permalink / raw)
  To: Thomas Gleixner, Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <alpine.DEB.2.21.1804112209450.1564@nanos.tec.linutronix.de>



On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
> On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
>>>> This will be provided by tbs if the socket which is transmitting packets is
>>>> configured for deadline mode.
>>>
>>> You don't want the socket to decide that. The qdisc into which a socket
>>> feeds defines the mode and the qdisc rejects requests with the wrong mode.
>>>
>>> Making a qdisc doing both and let the user decide what he wants it to be is
>>> not really going to fly. Especially if you have different users which want
>>> a different mode. It's clearly distinct functionality.
>>
>>
>> Ok, so just to make sure I got this right, are you suggesting that both the
>> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
>> parameter for specifying the txtime mode? This way if there is a mismatch,
>> packets from that socket are rejected by the qdisc.
> 
> Correct. The same is true if you try to set SO_TXTIME for something which
> is just routing regular traffic.
> 
>> (...)
>>>
>>>> Another question for this mode (but perhaps that applies to both modes) is, what
>>>> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
>>>> the packet during dequeue.
>>>
>>> There the question is how user space is notified about that issue. The
>>> application which queued the packet on time does rightfully assume that
>>> it's going to be on the wire on time.
>>>
>>> This is a violation of the overall scheduling plan, so you need to have
>>> a sane design to handle that.
>>
>> In addition to the qdisc stats, we could look into using the socket's error
>> queue to notify the application about that.
> 
> Makes sense.
>  
>>>> Putting it all together, we end up with:
>>>>
>>>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
>>>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
> 
> I think so. It's _the_ network time on which everything is based on.
> 
>>>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>>>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>>>> offlaod case): SCM_TXTIME.
>>>>
>>>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
>>>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>>>> explicit), that defines the semantics of the timestamp set on packets using
>>>> SCM_TXTIME.
>>>>
>>>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .
> 
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?
> 

Most likely, though you can technically have a different time domain
that is not based on TAI.

> Thanks,
> 
> 	tglx
> 

^ permalink raw reply

* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Michael Chan @ 2018-04-11 20:21 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Andy Gospodarek, David Miller, Netdev
In-Reply-To: <20180411114303.6f927c45@cakuba.netronome.com>

On Wed, Apr 11, 2018 at 11:43 AM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
>> @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
>>               return false;
>>       }
>>
>> +     /* Currently source/dest MAC cannot be partial wildcard  */
>> +     if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
>> +         !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
>> +             netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
>
> This wouldn't be something to do in net, but how do you feel about
> using extack for messages like this?
>

Sounds reasonable to me.  Just need to pass in the extack pointer to
this function to set the netlink error message.

^ permalink raw reply

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Thomas Gleixner @ 2018-04-11 20:16 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
	mlichvar
In-Reply-To: <e758b90b-508b-d1e8-bd1b-41e7b40c357b@intel.com>

On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
> >> This will be provided by tbs if the socket which is transmitting packets is
> >> configured for deadline mode.
> > 
> > You don't want the socket to decide that. The qdisc into which a socket
> > feeds defines the mode and the qdisc rejects requests with the wrong mode.
> > 
> > Making a qdisc doing both and let the user decide what he wants it to be is
> > not really going to fly. Especially if you have different users which want
> > a different mode. It's clearly distinct functionality.
> 
> 
> Ok, so just to make sure I got this right, are you suggesting that both the
> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
> parameter for specifying the txtime mode? This way if there is a mismatch,
> packets from that socket are rejected by the qdisc.

Correct. The same is true if you try to set SO_TXTIME for something which
is just routing regular traffic.

> (...)
> > 
> >> Another question for this mode (but perhaps that applies to both modes) is, what
> >> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
> >> the packet during dequeue.
> > 
> > There the question is how user space is notified about that issue. The
> > application which queued the packet on time does rightfully assume that
> > it's going to be on the wire on time.
> > 
> > This is a violation of the overall scheduling plan, so you need to have
> > a sane design to handle that.
> 
> In addition to the qdisc stats, we could look into using the socket's error
> queue to notify the application about that.

Makes sense.
 
> >> Putting it all together, we end up with:
> >>
> >> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
> >> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
> > 
> > Why CLOCK_REALTIME? The only interesting time in a TSN network is
> > CLOCK_TAI, really.
> 
> REALTIME was just an example here to show that the qdisc has to be configured
> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?

I think so. It's _the_ network time on which everything is based on.

> >> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
> >> either as a txtime or as deadline by tbs (and further the NIC driver for the
> >> offlaod case): SCM_TXTIME.
> >>
> >> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
> >> socket, and will have as parameters a clockid and a txtime mode (deadline or
> >> explicit), that defines the semantics of the timestamp set on packets using
> >> SCM_TXTIME.
> >>
> >> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> > 
> > Can you remind me why we would need that?
> 
> So there is a "clockid" that can be used for the full hw offload modes. On this
> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .

And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
yet another clock, right?

Thanks,

	tglx

^ permalink raw reply

* Re: [patch net] devlink: convert occ_get op to separate registration
From: Sasha Levin @ 2018-04-11 20:04 UTC (permalink / raw)
  To: gregkh@linuxfoundation.org
  Cc: David Miller, jiri@resnulli.us, jiri@mellanox.com,
	netdev@vger.kernel.org, idosch@mellanox.com,
	stable@vger.kernel.org
In-Reply-To: <20180411084600.GA26298@kroah.com>

On Wed, Apr 11, 2018 at 10:46:00AM +0200, gregkh@linuxfoundation.org wrote:
>On Tue, Apr 10, 2018 at 08:43:31PM +0000, Sasha Levin wrote:
>> >Bots are starting to overwhelm actual content from human beings
>> >on this list, and I want to put my foot on the brake right now
>> >before it gets even more out of control.
>>
>> I think we're just hitting the limitations of using a mailing list. Bots
>> aren't around to spam everyone for fun, right?
>>
>> 0day bot is spammy because patches fail to compile, syzbot is spammy
>> because we have tons of bugs users can hit and the stable bot is
>> spammy because we miss lots of commits that should be in stable.
>>
>> As the kernel grows, not only the codebase is expanding but also the
>> tooling around it. While spammy, bots provide valuable input that in the
>> past would take blood, sweat and tears to acquire. We just need a better
>> way to consume it rather than blocking off these inputs.
>
>As one of the primary abusers of the "I send too much email" flood of
>mess, I'm going to start cutting down on what type of message I respond
>to a public vger list from now on.  I'll write more on the stable@ list
>about this, but for your individual patches like this, how about just
>responding to the developers themselves and not a public list?  I do
>that when I commit a patch to my tree, stripping out lists, as it's not
>a useful message for anyone other than the person who wrote the patch
>and the reviewers.

Sure.

I think we should have a discussion as to the best way to "spam" people.
There are a few alternatives I can think of (a "digest" mail once a
day/week? Web UI? different mailing list?) but we keep using the one
size fits all approach for all our bots.

I don't see a reason why we can't tailor bot output based on
maintainer/author preference? Maybe David would be less upset if he'd
see just one mail per week with commits we ask to review?

We're stuck with mailing lists for kernel development, might as well
make the best out of it.

^ permalink raw reply

* Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr
From: Neil Horman @ 2018-04-11 19:59 UTC (permalink / raw)
  To: Xin Long; +Cc: Marcelo Ricardo Leitner, network dev, linux-sctp, davem
In-Reply-To: <CADvbK_egknPZCHRkZoLZ2MuGhRyaYKcMCaqG9Apt=QJ3-o-D6A@mail.gmail.com>

On Thu, Apr 12, 2018 at 12:16:58AM +0800, Xin Long wrote:
> On Wed, Apr 11, 2018 at 10:59 PM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > On Wed, Apr 11, 2018 at 10:36:07AM -0400, Neil Horman wrote:
> >> On Wed, Apr 11, 2018 at 08:58:05PM +0800, Xin Long wrote:
> >> > pf->cmp_addr() is called before binding a v6 address to the sock. It
> >> > should not check ports, like in sctp_inet_cmp_addr.
> >> >
> >> > But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
> >> > sctp_v6_cmp_addr where it also compares the ports.
> >> >
> >> > This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
> >> > multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
> >> > lack the check for ports in sctp_v6_cmp_addr").
> >> >
> >> > This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
> >> > but do the proper check for both v6 addrs and v4mapped addrs.
> >> >
> >> > Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
> >> > Reported-by: Jianwen Ji <jiji@redhat.com>
> >> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> >> > ---
> >> >  net/sctp/ipv6.c | 27 ++++++++++++++++++++++++---
> >> >  1 file changed, 24 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> >> > index f1fc48e..be4b72c 100644
> >> > --- a/net/sctp/ipv6.c
> >> > +++ b/net/sctp/ipv6.c
> >> > @@ -846,8 +846,8 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
> >> >                            const union sctp_addr *addr2,
> >> >                            struct sctp_sock *opt)
> >> >  {
> >> > -   struct sctp_af *af1, *af2;
> >> >     struct sock *sk = sctp_opt2sk(opt);
> >> > +   struct sctp_af *af1, *af2;
> >> >
> >> >     af1 = sctp_get_af_specific(addr1->sa.sa_family);
> >> >     af2 = sctp_get_af_specific(addr2->sa.sa_family);
> >> > @@ -863,10 +863,31 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
> >> >     if (sctp_is_any(sk, addr1) || sctp_is_any(sk, addr2))
> >> >             return 1;
> >> >
> >> > -   if (addr1->sa.sa_family != addr2->sa.sa_family)
> >> > +   if (addr1->sa.sa_family != addr2->sa.sa_family) {
> >> > +           if (addr1->sa.sa_family == AF_INET &&
> >> > +               addr2->sa.sa_family == AF_INET6 &&
> >> > +               ipv6_addr_v4mapped(&addr2->v6.sin6_addr))
> >> > +                   if (addr2->v6.sin6_addr.s6_addr32[3] ==
> >> > +                       addr1->v4.sin_addr.s_addr)
> >> > +                           return 1;
> >> > +           if (addr2->sa.sa_family == AF_INET &&
> >> > +               addr1->sa.sa_family == AF_INET6 &&
> >> > +               ipv6_addr_v4mapped(&addr1->v6.sin6_addr))
> >> > +                   if (addr1->v6.sin6_addr.s6_addr32[3] ==
> >> > +                       addr2->v4.sin_addr.s_addr)
> >> > +                           return 1;
> >> > +           return 0;
> >> > +   }
> >> > +
> >> > +   if (!ipv6_addr_equal(&addr1->v6.sin6_addr, &addr2->v6.sin6_addr))
> >> > +           return 0;
> >> > +
> >> > +   if ((ipv6_addr_type(&addr1->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) &&
> >> > +       addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
> >> > +       addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)
> >> >             return 0;
> >> >
> >> > -   return af1->cmp_addr(addr1, addr2);
> >> > +   return 1;
> >> >  }
> >> >
> >> >  /* Verify that the provided sockaddr looks bindable.   Common verification,
> >> > --
> >> > 2.1.0
> >> >
> >> This looks correct to me, but is it worth duplicating the comparison code like
> >> this from the cmp_addr function?  It might be more worthwhile to add a flag to
> >> the cmp_addr method to direct weather it needs to check port values or not.
> >> That way you could continue to use the cmp_addr function here.
> >
> > Adding a flag into sctp_v6_cmp_addr will get us a terrible code to
> > read. It's already not one of the best looking part of it. Maybe
> > still duplicate part of it it, but at 'af' level? As in:
> > - af->cmp_addr
> > - af->cmp_addr_port
> >
> What do you think of:
> 
> static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
>                             const union sctp_addr *addr2)
> {
>         return __sctp_v6_cmp_addr(addr1, addr2) &&
>                addr1->v6.sin_port == addr2->v6.sin_port;
> }
> 
> (v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
>  we've exploited this in many places in SCTP)
Yes, I'd be ok with that
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-11 19:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <871sflk0zc.fsf@xmission.com>

On Wed, Apr 11, 2018 at 02:16:23PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> > Yeah, agreed.
> >> >> > But I think the patch is not complete. To guarantee that no non-initial
> >> >> > user namespace actually receives uevents we need to:
> >> >> > 1. only sent uevents to uevent sockets that are located in network
> >> >> >    namespaces that are owned by init_user_ns
> >> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >> >> >    uevent socket that is owned by init_user_ns *from* a
> >> >> >    non-init_user_ns
> >> >> >
> >> >> > We account for 1. by only recording uevent sockets in the global uevent
> >> >> > socket list who are owned by init_user_ns.
> >> >> > But to account for 2. we need to filter by the user namespace who owns
> >> >> > the socket in mc_list. So in addition to that we also need to slightly
> >> >> > change the filter logic in kobj_bcast_filter() I think:
> >> >> >
> >> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> > index 22a2c1a98b8f..064d7d29ace5 100644
> >> >> > --- a/lib/kobject_uevent.c
> >> >> > +++ b/lib/kobject_uevent.c
> >> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >  		return sock_ns != ns;
> >> >> >  	}
> >> >> >  
> >> >> > -	return 0;
> >> >> > + 	/* Check if socket was opened from non-initial user namespace. */
> >> >> > + 	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >  }
> >> >> >  #endif
> >> >> >  
> >> >> >
> >> >> > But correct me if I'm wrong.
> >> >> 
> >> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> >> >> permissions and an explicit opt-in to receiving packets from multiple
> >> >> network namespaces.
> >> >
> >> > I don't think that's what I'm talking about unless that is somehow the
> >> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> >> > doing
> >> >
> >> > unshare -U --map-root
> >> >
> >> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> >> > uevents. Imho, this should not be possible because I'm in a
> >> > non-init_user_ns. But currently I'm able to - even with the patch to
> >> > come - since the uevent socket in the kernel was created when init_net
> >> > was created and hence is *owned* by the init_user_ns which means it is
> >> > in the list of uevent sockets. Here's a demo of what I mean:
> >> >
> >> > https://asciinema.org/a/175632
> >> 
> >> Why do you care about this case?
> >
> > It's not so much that I care about this case since any workload that
> > wants to run a separate udevd will have to unshare the network namespace
> > and the user namespace for it to make complete sense.
> > What I do care about is that the two of us are absolutely in the clear
> > about what semantics we are going to expose to userspace and it seems
> > that we were talking past each other wrt to this "corner case".
> > For userspace, it needs to be very clear that the intention is to filter
> > by *owning user namespace of the network namespace a given task resides
> > in* and not by user namespace of the task per se. This is what this
> > corner case basically shows, I think.
> 
> If this is just a clarification of semantics then yes this is a
> productive question.  I almost agree with your definition above.
> 
> I would make the definition very simple.  Uevents will not be broadcast
> via netlink in a network namespace where net->user_ns != &init_user_ns,
> with the exception of uevents for network devices in that network
> namespace.

Well, for the sake of posterity :) I should add that I'd prefer we'd add
what I suggested above:

-	return 0;
+ 	/* Check if socket was opened from non-initial user namespace. */
+ 	return sk_user_ns(dsk) != &init_user_ns;
 }

to slam the door shut once and for all for all non-init_user_ns
namespaces because it *seems* like the cleanest solution: uevents are
owned by init_user_ns; period. Because it is the only user namespace
that can do anything interesting with them *by default*.
But what we have now right now with my upcoming patch is at least
sufficient and safe.

Christian

> 
> The existing filtering by the sending uid and verifying that it is uid 0
> gives a little more room to filter if we want (as udev & friends will
> ignore the uevent), but I don't see the point.
> 
> Eric

^ permalink raw reply

* Re: WARNING: possible recursive locking detected
From: Julian Anastasov @ 2018-04-11 19:45 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Christian Brauner, David Miller, David Ahern,
	Florian Westphal, Jiri Benc, Kirill Tkhai, LKML, Xin Long, netdev,
	syzkaller-bugs
In-Reply-To: <CACT4Y+Z3Ktnio1B0TAZpx8HOEfd2evwyQqbMeWGvEkzh0yfhKQ@mail.gmail.com>


	Hello,

On Wed, 11 Apr 2018, Dmitry Vyukov wrote:

> On Wed, Apr 11, 2018 at 4:02 PM, syzbot
> <syzbot+3c43eecd7745a5ce1640@syzkaller.appspotmail.com> wrote:
> > Hello,
> >
> > syzbot hit the following crash on upstream commit
> > b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +0000)
> > Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
> > syzbot dashboard link:
> > https://syzkaller.appspot.com/bug?extid=3c43eecd7745a5ce1640
> >
> > So far this crash happened 3 times on upstream.
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5103706542440448
> > syzkaller reproducer:
> > https://syzkaller.appspot.com/x/repro.syz?id=5641659786199040
> > Raw console output:
> > https://syzkaller.appspot.com/x/log.txt?id=5099510896263168
> > Kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
> > compiler: gcc (GCC) 8.0.1 20180301 (experimental)
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+3c43eecd7745a5ce1640@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> 
> #syz dup: possible deadlock in rtnl_lock (5)

	Yes, patch is now in the "nf" tree, so all these
lockups around start_sync_thread should be resolved soon...

> > IPVS: sync thread started: state = BACKUP, mcast_ifn = lo, syncid = 0, id =
> > 0
> > IPVS: stopping backup sync thread 4546 ...
> >
> > ============================================
> > IPVS: stopping backup sync thread 4559 ...
> > WARNING: possible recursive locking detected

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Julian Anastasov @ 2018-04-11 19:37 UTC (permalink / raw)
  To: Stephen Suryaputra; +Cc: netdev
In-Reply-To: <1523415335-17154-1-git-send-email-ssuryaextr@gmail.com>


	Hello,

On Tue, 10 Apr 2018, Stephen Suryaputra wrote:

> This is enhanced from the proposed patch by Igor Maravic in 2011 to
> support per interface IPv4 stats. The enhancement is mainly adding a
> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
> 
> Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
> ---

...

> diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
> index b54b948..bb9be11 100644
> --- a/net/ipv4/ip_forward.c
> +++ b/net/ipv4/ip_forward.c
> @@ -66,8 +66,8 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
>  {
>  	struct ip_options *opt	= &(IPCB(skb)->opt);
>  
> -	__IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
> -	__IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
> +	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
> +	__IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
>  
>  	if (unlikely(opt->optlen))
>  		ip_forward_options(skb);
> @@ -121,7 +121,7 @@ int ip_forward(struct sk_buff *skb)
>  	IPCB(skb)->flags |= IPSKB_FORWARDED;
>  	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
>  	if (ip_exceeds_mtu(skb, mtu)) {
> -		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
> +		IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_FRAGFAILS);
>  		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  			  htonl(mtu));
>  		goto drop;
> @@ -158,7 +158,7 @@ int ip_forward(struct sk_buff *skb)
>  
>  too_many_hops:
>  	/* Tell the sender its packet died... */
> -	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> +	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);

	May be skb->dev if we want to account it to the
input device.

>  	icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
>  drop:
>  	kfree_skb(skb);

...

> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 4527921..32bd3af 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
>  	{
>  		if (ip_hdr(skb)->ttl <= 1) {
>  			/* Tell the sender its packet died... */
> -			__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> +			__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);

	At this point, skb_dst(skb) can be:

- input route at LOCAL_IN => dst->dev is "lo", skb->dev = input_device
- output route at LOCAL_OUT => dst->dev is output_device, skb->dev = NULL

	We should see this error on LOCAL_IN but better to be
safe: use 'skb->dev ? : skb_dst(skb)->dev' instead of just
'skb_dst(skb)->dev'.

>  			icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
>  			return false;
>  		}

	The patch probably has other errors, for example,
using rt->dst.dev (lo) when rt->dst.error != 0 in ip_error,
may be 'dev' should be used instead...

Regards

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-11 19:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180411185715.GA15862@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> > Yeah, agreed.
>> >> > But I think the patch is not complete. To guarantee that no non-initial
>> >> > user namespace actually receives uevents we need to:
>> >> > 1. only sent uevents to uevent sockets that are located in network
>> >> >    namespaces that are owned by init_user_ns
>> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
>> >> >    uevent socket that is owned by init_user_ns *from* a
>> >> >    non-init_user_ns
>> >> >
>> >> > We account for 1. by only recording uevent sockets in the global uevent
>> >> > socket list who are owned by init_user_ns.
>> >> > But to account for 2. we need to filter by the user namespace who owns
>> >> > the socket in mc_list. So in addition to that we also need to slightly
>> >> > change the filter logic in kobj_bcast_filter() I think:
>> >> >
>> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> > index 22a2c1a98b8f..064d7d29ace5 100644
>> >> > --- a/lib/kobject_uevent.c
>> >> > +++ b/lib/kobject_uevent.c
>> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >  		return sock_ns != ns;
>> >> >  	}
>> >> >  
>> >> > -	return 0;
>> >> > + 	/* Check if socket was opened from non-initial user namespace. */
>> >> > + 	return sk_user_ns(dsk) != &init_user_ns;
>> >> >  }
>> >> >  #endif
>> >> >  
>> >> >
>> >> > But correct me if I'm wrong.
>> >> 
>> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
>> >> permissions and an explicit opt-in to receiving packets from multiple
>> >> network namespaces.
>> >
>> > I don't think that's what I'm talking about unless that is somehow the
>> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
>> > doing
>> >
>> > unshare -U --map-root
>> >
>> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
>> > uevents. Imho, this should not be possible because I'm in a
>> > non-init_user_ns. But currently I'm able to - even with the patch to
>> > come - since the uevent socket in the kernel was created when init_net
>> > was created and hence is *owned* by the init_user_ns which means it is
>> > in the list of uevent sockets. Here's a demo of what I mean:
>> >
>> > https://asciinema.org/a/175632
>> 
>> Why do you care about this case?
>
> It's not so much that I care about this case since any workload that
> wants to run a separate udevd will have to unshare the network namespace
> and the user namespace for it to make complete sense.
> What I do care about is that the two of us are absolutely in the clear
> about what semantics we are going to expose to userspace and it seems
> that we were talking past each other wrt to this "corner case".
> For userspace, it needs to be very clear that the intention is to filter
> by *owning user namespace of the network namespace a given task resides
> in* and not by user namespace of the task per se. This is what this
> corner case basically shows, I think.

If this is just a clarification of semantics then yes this is a
productive question.  I almost agree with your definition above.

I would make the definition very simple.  Uevents will not be broadcast
via netlink in a network namespace where net->user_ns != &init_user_ns,
with the exception of uevents for network devices in that network
namespace.

The existing filtering by the sending uid and verifying that it is uid 0
gives a little more room to filter if we want (as udev & friends will
ignore the uevent), but I don't see the point.

Eric

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Samudrala, Sridhar @ 2018-04-11 19:13 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <20180411155127.GQ2028@nanopsycho>

On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> This provides a generic interface for paravirtual drivers to listen
>> for netdev register/unregister/link change events from pci ethernet
>> devices with the same MAC and takeover their datapath. The notifier and
>> event handling code is based on the existing netvsc implementation.
>>
>> It exposes 2 sets of interfaces to the paravirtual drivers.
>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> master netdev is created. The paravirtual driver registers each bypass
>> instance along with a set of ops to manage the slave events.
>>      bypass_master_register()
>>      bypass_master_unregister()
>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> the bypass module provides interfaces to create/destroy additional master
>> netdev and all the slave events are managed internally.
>>       bypass_master_create()
>>       bypass_master_destroy()
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>> include/linux/netdevice.h |  14 +
>> include/net/bypass.h      |  96 ++++++
>> net/Kconfig               |  18 +
>> net/core/Makefile         |   1 +
>> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 973 insertions(+)
>> create mode 100644 include/net/bypass.h
>> create mode 100644 net/core/bypass.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index cf44503ea81a..587293728f70 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> 	IFF_PHONY_HEADROOM		= 1<<24,
>> 	IFF_MACSEC			= 1<<25,
>> 	IFF_NO_RX_HANDLER		= 1<<26,
>> +	IFF_BYPASS			= 1 << 27,
>> +	IFF_BYPASS_SLAVE		= 1 << 28,
> I wonder, why you don't follow the existing coding style... Also, please
> add these to into the comment above.

To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
to the existing coding style to be consistent.

>
>
>> };
>>
>> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> #define IFF_MACSEC			IFF_MACSEC
>> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> +#define IFF_BYPASS			IFF_BYPASS
>> +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>>
>> /**
>>   *	struct net_device - The DEVICE structure.
>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> }
>>
>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_BYPASS;
>> +}
>> +
>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> +{
>> +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> +}
>> +
>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> static inline void netif_keep_dst(struct net_device *dev)
>> {
>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>> new file mode 100644
>> index 000000000000..86b02cb894cf
>> --- /dev/null
>> +++ b/include/net/bypass.h
>> @@ -0,0 +1,96 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +#ifndef _NET_BYPASS_H
>> +#define _NET_BYPASS_H
>> +
>> +#include <linux/netdevice.h>
>> +
>> +struct bypass_ops {
>> +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> +				  struct net_device *bypass_netdev);
>> +	int (*slave_join)(struct net_device *slave_netdev,
>> +			  struct net_device *bypass_netdev);
>> +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> +				    struct net_device *bypass_netdev);
>> +	int (*slave_release)(struct net_device *slave_netdev,
>> +			     struct net_device *bypass_netdev);
>> +	int (*slave_link_change)(struct net_device *slave_netdev,
>> +				 struct net_device *bypass_netdev);
>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> +};
>> +
>> +struct bypass_master {
>> +	struct list_head list;
>> +	struct net_device __rcu *bypass_netdev;
>> +	struct bypass_ops __rcu *ops;
>> +};
>> +
>> +/* bypass state */
>> +struct bypass_info {
>> +	/* passthru netdev with same MAC */
>> +	struct net_device __rcu *active_netdev;
> You still use "active"/"backup" names which is highly misleading as
> it has completely different meaning that in bond for example.
> I noted that in my previous review already. Please change it.

I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
matches with the BACKUP feature bit we are adding to virtio_net.

With regards to alternate names for 'active', you suggested 'stolen', but i
am not too happy with it.
netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'



>
>
>> +
>> +	/* virtio_net netdev */
>> +	struct net_device __rcu *backup_netdev;
>> +
>> +	/* active netdev stats */
>> +	struct rtnl_link_stats64 active_stats;
>> +
>> +	/* backup netdev stats */
>> +	struct rtnl_link_stats64 backup_stats;
>> +
>> +	/* aggregated stats */
>> +	struct rtnl_link_stats64 bypass_stats;
>> +
>> +	/* spinlock while updating stats */
>> +	spinlock_t stats_lock;
>> +};
>> +
>> +#if IS_ENABLED(CONFIG_NET_BYPASS)
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master);
>> +void bypass_master_destroy(struct bypass_master *bypass_master);
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct bypass_master **pbypass_master);
>> +void bypass_master_unregister(struct bypass_master *bypass_master);
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev);
>> +
>> +#else
>> +
>> +static inline
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct pbypass_master **pbypass_master);
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +}
>> +
>> +static inline
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> +	return 0;
>> +}
>> +
>> +#endif
>> +
>> +#endif /* _NET_BYPASS_H */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index 0428f12c25c2..994445f4a96a 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
>> 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
>> 	  devlink is a loadable module and the driver using it is built-in.
>>
>> +config NET_BYPASS
>> +	tristate "Bypass interface"
>> +	---help---
>> +	  This provides a generic interface for paravirtual drivers to listen
>> +	  for netdev register/unregister/link change events from pci ethernet
>> +	  devices with the same MAC and takeover their datapath. This also
>> +	  enables live migration of a VM with direct attached VF by failing
>> +	  over to the paravirtual datapath when the VF is unplugged.
>> +
>> +config MAY_USE_BYPASS
>> +	tristate
>> +	default m if NET_BYPASS=m
>> +	default y if NET_BYPASS=y || NET_BYPASS=n
>> +	help
>> +	  Drivers using the bypass infrastructure should have a dependency
>> +	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>> +	  bypass is a loadable module and the driver using it is built-in.
>> +
>> endif   # if NET
>>
>> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 6dbbba8c57ae..a9727ed1c8fc 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
>> obj-$(CONFIG_HWBM) += hwbm.o
>> obj-$(CONFIG_NET_DEVLINK) += devlink.o
>> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>> +obj-$(CONFIG_NET_BYPASS) += bypass.o
>> diff --git a/net/core/bypass.c b/net/core/bypass.c
>> new file mode 100644
>> index 000000000000..b5b9cb554c3f
>> --- /dev/null
>> +++ b/net/core/bypass.c
>> @@ -0,0 +1,844 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2018, Intel Corporation. */
>> +
>> +/* A common module to handle registrations and notifications for paravirtual
>> + * drivers to enable accelerated datapath and support VF live migration.
>> + *
>> + * The notifier and event handling code is based on netvsc driver.
>> + */
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/etherdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/module.h>
>> +#include <linux/slab.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/netpoll.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/if_vlan.h>
>> +#include <linux/pci.h>
>> +#include <net/sch_generic.h>
>> +#include <uapi/linux/if_arp.h>
>> +#include <net/bypass.h>
>> +
>> +static LIST_HEAD(bypass_master_list);
>> +static DEFINE_SPINLOCK(bypass_lock);
>> +
>> +static int bypass_slave_pre_register(struct net_device *slave_netdev,
>> +				     struct net_device *bypass_netdev,
>> +				     struct bypass_ops *bypass_ops)
>> +{
>> +	struct bypass_info *bi;
>> +	bool backup;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_pre_register)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_pre_register(slave_netdev,
>> +						      bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +	if (backup ? rtnl_dereference(bi->backup_netdev) :
>> +			rtnl_dereference(bi->active_netdev)) {
>> +		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>> +			   slave_netdev->name, backup ? "backup" : "active");
>> +		return -EEXIST;
>> +	}
>> +
>> +	/* Avoid non pci devices as active netdev */
>> +	if (!backup && (!slave_netdev->dev.parent ||
>> +			!dev_is_pci(slave_netdev->dev.parent)))
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int bypass_slave_join(struct net_device *slave_netdev,
>> +			     struct net_device *bypass_netdev,
>> +			     struct bypass_ops *bypass_ops)
>> +{
>> +	struct bypass_info *bi;
>> +	bool backup;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_join)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>> +
>> +	dev_hold(slave_netdev);
>> +
>> +	if (backup) {
>> +		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>> +		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>> +	} else {
>> +		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>> +		dev_get_stats(bi->active_netdev, &bi->active_stats);
>> +		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>> +		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>> +	}
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>> +		    slave_netdev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +/* Called when slave dev is injecting data into network stack.
>> + * Change the associated network device from lower dev to virtio.
>> + * note: already called with rcu_read_lock
>> + */
>> +static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>> +
>> +	skb->dev = ndev;
>> +
>> +	return RX_HANDLER_ANOTHER;
>> +}
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> +						  struct bypass_ops **ops)
>> +{
>> +	struct bypass_master *bypass_master;
>> +	struct net_device *bypass_netdev;
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
> As I wrote the last time, you don't need this list, spinlock.
> You can do just something like:
>          for_each_net(net) {
>                  for_each_netdev(net, dev) {
> 			if (netif_is_bypass_master(dev)) {

This function returns the upper netdev as well as the ops associated
with that netdev.
bypass_master_list is a list of 'struct bypass_master' that associates
'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
We need 'ops' only to support the 2 netdev model of netvsc. ops will be
NULL for 3-netdev model.


>
>
>
>
>> +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> +			*ops = rcu_dereference(bypass_master->ops);
> I don't see how rcu_dereference is ok here.
> 1) I don't see rcu_read_lock taken
> 2) Looks like bypass_master->ops has the same value across the whole
>     existence.

We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
Yes. ops doesn't change.

>
>
>> +			spin_unlock(&bypass_lock);
>> +			return bypass_netdev;
>> +		}
>> +	}
>> +	spin_unlock(&bypass_lock);
>> +	return NULL;
>> +}
>> +
>> +static int bypass_slave_register(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	int ret, orig_mtu;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
> For master, could you use word "master" in the variables so it is clear?
> Also, "dev" is fine instead of "netdev".
> Something like "bpmaster_dev"

bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
I can change all _netdev suffixes to _dev to make the names shorter.


>
>
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> +					bypass_ops);
>> +	if (ret != 0)
> 	Just "if (ret)" will do. You have this on more places.

OK.


>
>
>> +		goto done;
>> +
>> +	ret = netdev_rx_handler_register(slave_netdev,
>> +					 bypass_ops ? bypass_ops->handle_frame :
>> +					 bypass_handle_frame, bypass_netdev);
>> +	if (ret != 0) {
>> +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> +			   ret);
>> +		goto done;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> +	if (ret != 0) {
>> +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> +			   bypass_netdev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> +
>> +	if (netif_running(bypass_netdev)) {
>> +		ret = dev_open(slave_netdev);
>> +		if (ret && (ret != -EBUSY)) {
>> +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> +				   slave_netdev->name, ret);
>> +			goto err_interface_up;
>> +		}
>> +	}
>> +
>> +	/* Align MTU of slave with master */
>> +	orig_mtu = slave_netdev->mtu;
>> +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> +	if (ret != 0) {
>> +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> +			   slave_netdev->name, bypass_netdev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> +	if (ret != 0)
>> +		goto err_join;
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> +		    slave_netdev->name);
>> +
>> +	goto done;
>> +
>> +err_join:
>> +	dev_set_mtu(slave_netdev, orig_mtu);
>> +err_set_mtu:
>> +	dev_close(slave_netdev);
>> +err_interface_up:
>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(slave_netdev);
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> +				       struct net_device *bypass_netdev,
>> +				       struct bypass_ops *bypass_ops)
>> +{
>> +	struct net_device *backup_netdev, *active_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_pre_unregister)
>> +			return -EINVAL;
>> +
>> +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> +							bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int bypass_slave_release(struct net_device *slave_netdev,
>> +				struct net_device *bypass_netdev,
>> +				struct bypass_ops *bypass_ops)
>> +{
>> +	struct net_device *backup_netdev, *active_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_release)
>> +			return -EINVAL;
> I think it would be good to make the API to the driver more strict and
> have a separate set of ops for "active" and "backup" netdevices.
> That should stop people thinking about extending this to more slaves in
> the future.

We have checks in slave_pre_register() that allows only 1 'backup' and 1
'active' slave.


>
>
>
>> +
>> +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> +	}
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev == backup_netdev) {
>> +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> +		if (backup_netdev) {
>> +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(slave_netdev);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> +		    slave_netdev->name);
>> +
>> +	return 0;
>> +}
>> +
>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	int ret;
>> +
>> +	if (!netif_is_bypass_slave(slave_netdev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> +					  bypass_ops);
>> +	if (ret != 0)
>> +		goto done;
>> +
>> +	netdev_rx_handler_unregister(slave_netdev);
>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> +
>> +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> +
>> +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> +		    slave_netdev->name);
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> +
>> +static bool bypass_xmit_ready(struct net_device *dev)
>> +{
>> +	return netif_running(dev) && netif_carrier_ok(dev);
>> +}
>> +
>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> +{
>> +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> +	struct bypass_ops *bypass_ops;
>> +	struct bypass_info *bi;
>> +
>> +	if (!netif_is_bypass_slave(slave_netdev))
>> +		goto done;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> +						&bypass_ops);
>> +	if (!bypass_netdev)
>> +		goto done;
>> +
>> +	if (bypass_ops) {
>> +		if (!bypass_ops->slave_link_change)
>> +			goto done;
>> +
>> +		return bypass_ops->slave_link_change(slave_netdev,
>> +						     bypass_netdev);
>> +	}
>> +
>> +	if (!netif_running(bypass_netdev))
>> +		return 0;
>> +
>> +	bi = netdev_priv(bypass_netdev);
>> +
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +
>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> +		goto done;
> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> above is enough.

I think we need this check to not allow events from a slave that is not
attached to this master but has the same MAC.

>
>
>> +
>> +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> +		netif_carrier_on(bypass_netdev);
>> +		netif_tx_wake_all_queues(bypass_netdev);
>> +	} else {
>> +		netif_carrier_off(bypass_netdev);
>> +		netif_tx_stop_all_queues(bypass_netdev);
>> +	}
>> +
>> +done:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static bool bypass_validate_event_dev(struct net_device *dev)
>> +{
>> +	/* Skip parent events */
>> +	if (netif_is_bypass_master(dev))
>> +		return false;
>> +
>> +	/* Avoid non-Ethernet type devices */
>> +	if (dev->type != ARPHRD_ETHER)
>> +		return false;
>> +
>> +	/* Avoid Vlan dev with same MAC registering as VF */
>> +	if (is_vlan_dev(dev))
>> +		return false;
>> +
>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> Yeah, this is certainly incorrect. One thing is, you should be using the
> helpers netif_is_bond_master().
> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>
> You need to do it not by blacklisting, but with whitelisting. You need
> to whitelist VF devices. My port flavours patchset might help with this.

May be i can use netdev_has_lower_dev() helper to make sure that the slave
device is not an upper dev.
Can you point to your port flavours patchset? Is it upstream?

>
>
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int
>> +bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	if (!bypass_validate_event_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return bypass_slave_register(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return bypass_slave_unregister(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return bypass_slave_link_change(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
>> +
>> +static struct notifier_block bypass_notifier = {
>> +	.notifier_call = bypass_event,
>> +};
>> +
>> +int bypass_open(struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *active_netdev, *backup_netdev;
>> +	int err;
>> +
>> +	netif_carrier_off(dev);
>> +	netif_tx_wake_all_queues(dev);
>> +
>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (active_netdev) {
>> +		err = dev_open(active_netdev);
>> +		if (err)
>> +			goto err_active_open;
>> +	}
>> +
>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> +	if (backup_netdev) {
>> +		err = dev_open(backup_netdev);
>> +		if (err)
>> +			goto err_backup_open;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_backup_open:
>> +	dev_close(active_netdev);
>> +err_active_open:
>> +	netif_tx_disable(dev);
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_open);
>> +
>> +int bypass_close(struct net_device *dev)
>> +{
>> +	struct bypass_info *vi = netdev_priv(dev);
> This should be probably "bi"

Yes.


>
>
>> +	struct net_device *slave_netdev;
>> +
>> +	netif_tx_disable(dev);
>> +
>> +	slave_netdev = rtnl_dereference(vi->active_netdev);
>> +	if (slave_netdev)
>> +		dev_close(slave_netdev);
>> +
>> +	slave_netdev = rtnl_dereference(vi->backup_netdev);
>> +	if (slave_netdev)
>> +		dev_close(slave_netdev);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_close);
>> +
>> +static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	atomic_long_inc(&dev->tx_dropped);
>> +	dev_kfree_skb_any(skb);
>> +	return NETDEV_TX_OK;
>> +}
>> +
>> +netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
> If you rename the other variable to "bpmaster_dev", it would be nice to
> rename this to bpinfo or something more descriptive. "bi" is too short
> to know what that is right away.

Will rename bypass_netdev to bypass_dev. bypass indicates that it is
an upper master dev.


>
>
>> +	struct net_device *xmit_dev;
> Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.

OK.


>
>
>
>> +
>> +	/* Try xmit via active netdev followed by backup netdev */
>> +	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>> +	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>> +		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>> +		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>> +			return bypass_drop_xmit(skb, dev);
>> +	}
>> +
>> +	skb->dev = xmit_dev;
>> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>> +
>> +	return dev_queue_xmit(skb);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_start_xmit);
>> +
>> +u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>> +			void *accel_priv, select_queue_fallback_t fallback)
>> +{
>> +	/* This helper function exists to help dev_pick_tx get the correct
>> +	 * destination queue.  Using a helper function skips a call to
>> +	 * skb_tx_hash and will put the skbs in the queue we expect on their
>> +	 * way down to the bonding driver.
>> +	 */
>> +	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>> +
>> +	/* Save the original txq to restore before passing to the driver */
>> +	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>> +
>> +	if (unlikely(txq >= dev->real_num_tx_queues)) {
>> +		do {
>> +			txq -= dev->real_num_tx_queues;
>> +		} while (txq >= dev->real_num_tx_queues);
>> +	}
>> +
>> +	return txq;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_select_queue);
>> +
>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>> + * that some drivers can provide 32bit values only.
>> + */
>> +static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>> +			      const struct rtnl_link_stats64 *_new,
>> +			      const struct rtnl_link_stats64 *_old)
>> +{
>> +	const u64 *new = (const u64 *)_new;
>> +	const u64 *old = (const u64 *)_old;
>> +	u64 *res = (u64 *)_res;
>> +	int i;
>> +
>> +	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>> +		u64 nv = new[i];
>> +		u64 ov = old[i];
>> +		s64 delta = nv - ov;
>> +
>> +		/* detects if this particular field is 32bit only */
>> +		if (((nv | ov) >> 32) == 0)
>> +			delta = (s64)(s32)((u32)nv - (u32)ov);
>> +
>> +		/* filter anomalies, some drivers reset their stats
>> +		 * at down/up events.
>> +		 */
>> +		if (delta > 0)
>> +			res[i] += delta;
>> +	}
>> +}
>> +
>> +void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
> You can WARN_ON and return in case the dev is not bypass master, just
> to catch buggy drivers. Same with other helpers.

I can make this static and not export this helper as well as all
bypass_netdev ops.

>
>
>> +	const struct rtnl_link_stats64 *new;
>> +	struct rtnl_link_stats64 temp;
>> +	struct net_device *slave_netdev;
>> +
>> +	spin_lock(&bi->stats_lock);
>> +	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_netdev = rcu_dereference(bi->active_netdev);
>> +	if (slave_netdev) {
>> +		new = dev_get_stats(slave_netdev, &temp);
>> +		bypass_fold_stats(stats, new, &bi->active_stats);
>> +		memcpy(&bi->active_stats, new, sizeof(*new));
>> +	}
>> +
>> +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (slave_netdev) {
>> +		new = dev_get_stats(slave_netdev, &temp);
>> +		bypass_fold_stats(stats, new, &bi->backup_stats);
>> +		memcpy(&bi->backup_stats, new, sizeof(*new));
>> +	}
>> +
>> +	rcu_read_unlock();
>> +
>> +	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>> +	spin_unlock(&bi->stats_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_get_stats);
>> +
>> +int bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *active_netdev, *backup_netdev;
>> +	int ret = 0;
> Pointless initialization.
>
>
>> +
>> +	active_netdev = rcu_dereference(bi->active_netdev);
>> +	if (active_netdev) {
>> +		ret = dev_set_mtu(active_netdev, new_mtu);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	backup_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (backup_netdev) {
>> +		ret = dev_set_mtu(backup_netdev, new_mtu);
>> +		if (ret) {
>> +			dev_set_mtu(active_netdev, dev->mtu);
>> +			return ret;
>> +		}
>> +	}
>> +
>> +	dev->mtu = new_mtu;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_change_mtu);
>> +
>> +void bypass_set_rx_mode(struct net_device *dev)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *slave_netdev;
>> +
>> +	rcu_read_lock();
>> +
>> +	slave_netdev = rcu_dereference(bi->active_netdev);
>> +	if (slave_netdev) {
>> +		dev_uc_sync_multiple(slave_netdev, dev);
>> +		dev_mc_sync_multiple(slave_netdev, dev);
>> +	}
>> +
>> +	slave_netdev = rcu_dereference(bi->backup_netdev);
>> +	if (slave_netdev) {
>> +		dev_uc_sync_multiple(slave_netdev, dev);
>> +		dev_mc_sync_multiple(slave_netdev, dev);
>> +	}
>> +
>> +	rcu_read_unlock();
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>> +
>> +static const struct net_device_ops bypass_netdev_ops = {
>> +	.ndo_open		= bypass_open,
>> +	.ndo_stop		= bypass_close,
>> +	.ndo_start_xmit		= bypass_start_xmit,
>> +	.ndo_select_queue	= bypass_select_queue,
>> +	.ndo_get_stats64	= bypass_get_stats,
>> +	.ndo_change_mtu		= bypass_change_mtu,
>> +	.ndo_set_rx_mode	= bypass_set_rx_mode,
>> +	.ndo_validate_addr	= eth_validate_addr,
>> +	.ndo_features_check	= passthru_features_check,
>> +};
>> +
>> +#define BYPASS_DRV_NAME "bypass"
>> +#define BYPASS_DRV_VERSION "0.1"
>> +
>> +static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>> +				       struct ethtool_drvinfo *drvinfo)
>> +{
>> +	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>> +	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>> +}
>> +
>> +int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>> +				      struct ethtool_link_ksettings *cmd)
>> +{
>> +	struct bypass_info *bi = netdev_priv(dev);
>> +	struct net_device *slave_netdev;
>> +
>> +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> +		slave_netdev = rtnl_dereference(bi->backup_netdev);
>> +		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>> +			cmd->base.duplex = DUPLEX_UNKNOWN;
>> +			cmd->base.port = PORT_OTHER;
>> +			cmd->base.speed = SPEED_UNKNOWN;
>> +
>> +			return 0;
>> +		}
>> +	}
>> +
>> +	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>> +
>> +static const struct ethtool_ops bypass_ethtool_ops = {
>> +	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>> +	.get_link               = ethtool_op_get_link,
>> +	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>> +};
>> +
>> +static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>> +{
>> +	struct net *net = dev_net(bypass_netdev);
>> +	struct net_device *dev;
>> +
>> +	rtnl_lock();
>> +	for_each_netdev(net, dev) {
>> +		if (dev == bypass_netdev)
>> +			continue;
>> +		if (!bypass_validate_event_dev(dev))
>> +			continue;
>> +		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>> +			bypass_slave_register(dev);
>> +	}
>> +	rtnl_unlock();
>> +}
>> +
>> +int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>> +			   struct bypass_master **pbypass_master)
>> +{
>> +	struct bypass_master *bypass_master;
>> +
>> +	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>> +	if (!bypass_master)
>> +		return -ENOMEM;
>> +
>> +	rcu_assign_pointer(bypass_master->ops, ops);
>> +	dev_hold(dev);
>> +	dev->priv_flags |= IFF_BYPASS;
>> +	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_add_tail(&bypass_master->list, &bypass_master_list);
>> +	spin_unlock(&bypass_lock);
>> +
>> +	bypass_register_existing_slave(dev);
>> +
>> +	*pbypass_master = bypass_master;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_register);
>> +
>> +void bypass_master_unregister(struct bypass_master *bypass_master)
>> +{
>> +	struct net_device *bypass_netdev;
>> +
>> +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +
>> +	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>> +	dev_put(bypass_netdev);
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_del(&bypass_master->list);
>> +	spin_unlock(&bypass_lock);
>> +
>> +	kfree(bypass_master);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_unregister);
>> +
>> +int bypass_master_create(struct net_device *backup_netdev,
>> +			 struct bypass_master **pbypass_master)
>> +{
>> +	struct device *dev = backup_netdev->dev.parent;
>> +	struct net_device *bypass_netdev;
>> +	int err;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>> +	if (!bypass_netdev) {
>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> +	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>> +	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>> +	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>> +				       IFF_TX_SKB_SHARING);
>> +
>> +	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>> +	bypass_netdev->features |= NETIF_F_LLTX;
>> +
>> +	/* Don't allow bypass devices to change network namespaces. */
>> +	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>> +
>> +	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>> +				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>> +				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>> +
>> +	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>> +	bypass_netdev->features |= bypass_netdev->hw_features;
>> +
>> +	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>> +	       bypass_netdev->addr_len);
>> +
>> +	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> +	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> +
>> +	err = register_netdev(bypass_netdev);
>> +	if (err < 0) {
>> +		dev_err(dev, "Unable to register bypass_netdev!\n");
>> +		goto err_register_netdev;
>> +	}
>> +
>> +	netif_carrier_off(bypass_netdev);
>> +
>> +	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>> +	if (err < 0)
> just "if (err)" would do.

OK

>
>
>> +		goto err_bypass;
>> +
>> +	return 0;
>> +
>> +err_bypass:
>> +	unregister_netdev(bypass_netdev);
>> +err_register_netdev:
>> +	free_netdev(bypass_netdev);
>> +
>> +	return err;
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_create);
>> +
>> +void bypass_master_destroy(struct bypass_master *bypass_master)
>> +{
>> +	struct net_device *bypass_netdev;
>> +	struct net_device *slave_netdev;
>> +	struct bypass_info *bi;
>> +
>> +	if (!bypass_master)
>> +		return;
>> +
>> +	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> +	bi = netdev_priv(bypass_netdev);
>> +
>> +	netif_device_detach(bypass_netdev);
>> +
>> +	rtnl_lock();
>> +
>> +	slave_netdev = rtnl_dereference(bi->active_netdev);
>> +	if (slave_netdev)
>> +		bypass_slave_unregister(slave_netdev);
>> +
>> +	slave_netdev = rtnl_dereference(bi->backup_netdev);
>> +	if (slave_netdev)
>> +		bypass_slave_unregister(slave_netdev);
>> +
>> +	bypass_master_unregister(bypass_master);
>> +
>> +	unregister_netdevice(bypass_netdev);
>> +
>> +	rtnl_unlock();
>> +
>> +	free_netdev(bypass_netdev);
>> +}
>> +EXPORT_SYMBOL_GPL(bypass_master_destroy);
>> +
>> +static __init int
>> +bypass_init(void)
>> +{
>> +	register_netdevice_notifier(&bypass_notifier);
>> +
>> +	return 0;
>> +}
>> +module_init(bypass_init);
>> +
>> +static __exit
>> +void bypass_exit(void)
>> +{
>> +	unregister_netdevice_notifier(&bypass_notifier);
>> +}
>> +module_exit(bypass_exit);
>> +
>> +MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>> +MODULE_LICENSE("GPL v2");
>> -- 
>> 2.14.3
>>

^ permalink raw reply

* [PULL] virtio: feature pull
From: Michael S. Tsirkin @ 2018-04-11 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kvm, virtualization, netdev, linux-kernel, jonathan.helman, mst

The following changes since commit dc32bb678e103afbcfa4d814489af0566307f528:

  vhost: add vsock compat ioctl (2018-03-20 03:17:42 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 6c64fe7f2adcee21d7c4247f1ec021fd18428fc4:

  virtio_balloon: export hugetlb page allocation counts (2018-04-10 21:23:55 +0300)

----------------------------------------------------------------
virtio: feature

This adds reporting hugepage stats to virtio-balloon.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Jonathan Helman (1):
      virtio_balloon: export hugetlb page allocation counts

 drivers/virtio/virtio_balloon.c     | 6 ++++++
 include/uapi/linux/virtio_balloon.h | 4 +++-
 2 files changed, 9 insertions(+), 1 deletion(-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox