* Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy
From: Mickaël Salaün @ 2018-04-11 22:18 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andy Lutomirski, Daniel Borkmann, LKML, Alexei Starovoitov,
Arnaldo Carvalho de Melo, Casey Schaufler, David Drysdale,
David S . Miller, Eric W . Biederman, Jann Horn, Jonathan Corbet,
Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf,
Tycho Andersen, Will Drewry, Kernel
In-Reply-To: <20180410044821.tllxbaq2uj6gtzpn@ast-mbp.dhcp.thefacebook.com>
[-- Attachment #1.1: Type: text/plain, Size: 25891 bytes --]
On 04/10/2018 06:48 AM, Alexei Starovoitov wrote:
> On Mon, Apr 09, 2018 at 12:01:59AM +0200, Mickaël Salaün wrote:
>>
>> On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
>>> On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>>>>
>>>>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>>>>> programss.
>>>>>>>>>>>>
>>>>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>>>>
>>>>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path). This
>>>>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>>>>> this chain of programs to itself. The restrictions ensure that it is
>>>>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain. For now,
>>>>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>>>>> commits). This restrictions still allows to reuse Landlock programs in
>>>>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>>>>> + struct landlock_prog_set *current_prog_set,
>>>>>>>>>>>> + struct bpf_prog *prog)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>>>>> + unsigned long pages;
>>>>>>>>>>>> + int err;
>>>>>>>>>>>> + size_t i;
>>>>>>>>>>>> + struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>>>>> + return ERR_PTR(-EINVAL);
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* validate memory size allocation */
>>>>>>>>>>>> + pages = prog->pages;
>>>>>>>>>>>> + if (current_prog_set) {
>>>>>>>>>>>> + size_t i;
>>>>>>>>>>>> +
>>>>>>>>>>>> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>>>>> + struct landlock_prog_list *walker_p;
>>>>>>>>>>>> +
>>>>>>>>>>>> + for (walker_p = current_prog_set->programs[i];
>>>>>>>>>>>> + walker_p; walker_p = walker_p->prev)
>>>>>>>>>>>> + pages += walker_p->prog->pages;
>>>>>>>>>>>> + }
>>>>>>>>>>>> + /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>>>>> + if (refcount_read(¤t_prog_set->usage) != 1)
>>>>>>>>>>>> + pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>>>>> + / PAGE_SIZE;
>>>>>>>>>>>> + }
>>>>>>>>>>>> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>>>>> + return ERR_PTR(-E2BIG);
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* ensure early that we can allocate enough memory for the new
>>>>>>>>>>>> + * prog_lists */
>>>>>>>>>>>> + err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>>>>> + if (err)
>>>>>>>>>>>> + return ERR_PTR(err);
>>>>>>>>>>>> +
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * Each task_struct points to an array of prog list pointers. These
>>>>>>>>>>>> + * tables are duplicated when additions are made (which means each
>>>>>>>>>>>> + * table needs to be refcounted for the processes using it). When a new
>>>>>>>>>>>> + * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>>>>> + * track each table that references the prog). When a new prog is
>>>>>>>>>>>> + * added, it's just prepended to the list for the new table to point
>>>>>>>>>>>> + * at.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * Manage all the possible errors before this step to not uselessly
>>>>>>>>>>>> + * duplicate current_prog_set and avoid a rollback.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + if (!new_prog_set) {
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * If there is no Landlock program set used by the current task,
>>>>>>>>>>>> + * then create a new one.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + new_prog_set = new_landlock_prog_set();
>>>>>>>>>>>> + if (IS_ERR(new_prog_set))
>>>>>>>>>>>> + goto put_tmp_lists;
>>>>>>>>>>>> + } else if (refcount_read(¤t_prog_set->usage) > 1) {
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * If the current task is not the sole user of its Landlock
>>>>>>>>>>>> + * program set, then duplicate them.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + new_prog_set = new_landlock_prog_set();
>>>>>>>>>>>> + if (IS_ERR(new_prog_set))
>>>>>>>>>>>> + goto put_tmp_lists;
>>>>>>>>>>>> + for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>>>>> + new_prog_set->programs[i] =
>>>>>>>>>>>> + READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>>>>> + if (new_prog_set->programs[i])
>>>>>>>>>>>> + refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * Landlock program set from the current task will not be freed
>>>>>>>>>>>> + * here because the usage is strictly greater than 1. It is
>>>>>>>>>>>> + * only prevented to be freed by another task thanks to the
>>>>>>>>>>>> + * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>>>>> + * needed.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + landlock_put_prog_set(current_prog_set);
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>>>>> + for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>>>>> + /* get the last new list */
>>>>>>>>>>>> + struct landlock_prog_list *last_list =
>>>>>>>>>>>> + tmp_prog_set.programs[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (last_list) {
>>>>>>>>>>>> + while (last_list->prev)
>>>>>>>>>>>> + last_list = last_list->prev;
>>>>>>>>>>>> + /* no need to increment usage (pointer replacement) */
>>>>>>>>>>>> + last_list->prev = new_prog_set->programs[i];
>>>>>>>>>>>> + new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>>>>> + }
>>>>>>>>>>>> + }
>>>>>>>>>>>> + new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>>>>> + return new_prog_set;
>>>>>>>>>>>> +
>>>>>>>>>>>> +put_tmp_lists:
>>>>>>>>>>>> + for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>>>>> + put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>>>>> + return new_prog_set;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> Nack on the chaining concept.
>>>>>>>>>>> Please do not reinvent the wheel.
>>>>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>>>>> Please use that instead.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't see how that would help. Suppose you add a filter, then
>>>>>>>>>> fork(), and then the child adds another filter. Do you want to
>>>>>>>>>> duplicate the entire array? You certainly can't *modify* the array
>>>>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>>>>
>>>>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>>>>
>>>>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't see why the seccomp way can't be used. I agree with you that
>>>>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>>>>> think that Landlock programs can and should just live in the existing
>>>>>>>> seccomp chain. If the existing seccomp code needs some modification
>>>>>>>> to make this work, then so be it.
>>>>>>>
>>>>>>> +1
>>>>>>> if that was the case...
>>>>>>> but that's not my reading of the patch set.
>>>>>>
>>>>>> An earlier version of the patch set used the seccomp filter chain.
>>>>>> Mickaël, what exactly was wrong with that approach other than that the
>>>>>> seccomp() syscall was awkward for you to use? You could add a
>>>>>> seccomp_add_landlock_rule() syscall if you needed to.
>>>>>
>>>>> Nothing was wrong about about that, this part did not changed (see my
>>>>> next comment).
>>>>>
>>>>>>
>>>>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>>>>> LSM? It would make a lot more sense to me to make Landlock depend on
>>>>>> having LSMs configured in but to call the landlock hooks directly from
>>>>>> the security_xyz() hooks.
>>>>>
>>>>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> In other words, the kernel already has two kinds of chaining:
>>>>>>>> seccomp's and bpf's. bpf's doesn't work right for this type of usage
>>>>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>>>> So IMO Landlock should use the seccomp core code and call into bpf
>>>>>>>> for the actual filtering.
>>>>>>>
>>>>>>> +1
>>>>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>>>>> at different levels with different override/multiprog properties,
>>>>>>> so walking link list and checking all flags at run-time would have
>>>>>>> been too slow. That's why we added compute_effective_progs().
>>>>>>
>>>>>> If we start adding override flags to Landlock, I think we're doing it
>>>>>> wrong. With cgroup bpf programs, the whole mess is set up by the
>>>>>> administrator. With seccomp, and with Landlock if done correctly, it
>>>>>> *won't* be set up by the administrator, so the chance that everyone
>>>>>> gets all the flags right is about zero. All attached filters should
>>>>>> run unconditionally.
>>>>>
>>>>>
>>>>> There is a misunderstanding about this chaining mechanism. This should
>>>>> not be confused with the list of seccomp filters nor the cgroup
>>>>> hierarchies. Landlock programs can be stacked the same way seccomp's
>>>>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>>>>> optimization which is not used for this struct handling). This stackable
>>>>> property did not changed from the previous patch series. The chaining
>>>>> mechanism is for another use case, which does not make sense for seccomp
>>>>> filters nor other eBPF program types, at least for now, from what I can
>>>>> tell.
>>>>>
>>>>> You may want to get a look at my talk at FOSDEM
>>>>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>>>>> slides 11 and 12.
>>>>>
>>>>> Let me explain my reasoning about this program chaining thing.
>>>>>
>>>>> To check if an action on a file is allowed, we first need to identify
>>>>> this file and match it to the security policy. In a previous
>>>>> (non-public) patch series, I tried to use one type of eBPF program to
>>>>> check every kind of access to a file. To be able to identify a file, I
>>>>> relied on an eBPF map, similar to the current inode map. This map store
>>>>> a set of references to file descriptors. I then created a function
>>>>> bpf_is_file_beneath() to check if the requested file was beneath a file
>>>>> in the map. This way, no chaining, only one eBPF program type to check
>>>>> an access to a file... but some issues then emerged. First, this design
>>>>> create a side-channel which help an attacker using such a program to
>>>>> infer some information not normally available, for example to get a hint
>>>>> on where a file descriptor (received from a UNIX socket) come from.
>>>>> Another issue is that this type of program would be called for each
>>>>> component of a path. Indeed, when the kernel check if an access to a
>>>>> file is allowed, it walk through all of the directories in its path
>>>>> (checking if the current process is allowed to execute them). That first
>>>>> attempt led me to rethink the way we could filter an access to a file
>>>>> *path*.
>>>>>
>>>>> To minimize the number of called to an eBPF program dedicated to
>>>>> validate an access to a file path, I decided to create three subtype of
>>>>> eBPF programs. The FS_WALK type is called when walking through every
>>>>> directory of a file path (except the last one if it is the target). We
>>>>> can then restrict this type of program to the minimum set of functions
>>>>> it is allowed to call and the minimum set of data available from its
>>>>> context. The first implicit chaining is for this type of program. To be
>>>>> able to evaluate a path while being called for all its components, this
>>>>> program need to store a state (to remember what was the parent directory
>>>>> of this path). There is no "previous" field in the subtype for this
>>>>> program because it is chained with itself, for each directories. This
>>>>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>>>>> to the inode map which can be used to check if a directory of this
>>>>> hierarchy is part of an allowed (or denied) list of directories. This
>>>>> design enables to express a file hierarchy in a programmatic way,
>>>>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>>>>
>>>>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>>>>> program) to an access to the actual file being requested (the last
>>>>> component of a file path), with a FS_PICK program. It is only at this
>>>>> time that the kernel check for the requested action (e.g. read, write,
>>>>> chdir, append...). To be able to filter such access request we can have
>>>>> one call to the same program for every action and let this program check
>>>>> for which action it was called. However, this design does not allow the
>>>>> kernel to know if the current action is indeed handled by this program.
>>>>> Hence, it is not possible to implement a cache mechanism to only call
>>>>> this program if it knows how to handle this action.
>>>>>
>>>>> The approach I took for this FS_PICK type of program is to add to its
>>>>> subtype which action it can handle (with the "triggers" bitfield, seen
>>>>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>>>>> program is necessary. If the user wants to enforce a different security
>>>>> policy according to the action requested on a file, then it needs
>>>>> multiple FS_PICK programs. However, to reduce the number of such
>>>>> programs, this patch series allow a FS_PICK program to be chained with
>>>>> another, the same way a FS_WALK is chained with itself. This way, if the
>>>>> user want to check if the action is a for example an "open" and a "read"
>>>>> and not a "map" and a "read", then it can chain multiple FS_PICK
>>>>> programs with different triggers actions. The OR check performed by the
>>>>> kernel is not a limitation then, only a way to know if a call to an eBPF
>>>>> program is needed.
>>>>>
>>>>> The last type of program is FS_GET. This one is called when a process
>>>>> get a struct file or change its working directory. This is the only
>>>>> program type able (and allowed) to tag a file. This restriction is
>>>>> important to not being subject to resource exhaustion attacks (i.e.
>>>>> tagging every inode accessible to an attacker, which would allocate too
>>>>> much kernel memory).
>>>>>
>>>>> This design gives room for improvements to create a cache of eBPF
>>>>> context (input data, including maps if any), with the result of an eBPF
>>>>> program. This would help limit the number of call to an eBPF program the
>>>>> same way SELinux or other kernel components do to limit costly checks.
>>>>>
>>>>> The eBPF maps of progs are useful to call the same type of eBPF
>>>>> program. It does not fit with this use case because we may want multiple
>>>>> eBPF program according to the action requested on a kernel object (e.g.
>>>>> FS_GET). The other reason is because the eBPF program does not know what
>>>>> will be the next (type of) access check performed by the kernel.
>>>>>
>>>>> To say it another way, this chaining mechanism is a way to split a
>>>>> kernel object evaluation with multiple specialized programs, each of
>>>>> them being able to deal with data tied to their type. Using a monolithic
>>>>> eBPF program to check everything does not scale and does not fit with
>>>>> unprivileged use either.
>>>>>
>>>>> As a side note, the cookie value is only an ephemeral value to keep a
>>>>> state between multiple programs call. It can be used to create a state
>>>>> machine for an object evaluation.
>>>>>
>>>>> I don't see a way to do an efficient and programmatic path evaluation,
>>>>> with different access checks, with the current eBPF features. Please let
>>>>> me know if you know how to do it another way.
>>>>>
>>>>
>>>> Andy, Alexei, Daniel, what do you think about this Landlock program
>>>> chaining and cookie?
>>>>
>>>
>>> Can you give a small pseudocode real world example that acutally needs
>>> chaining? The mechanism is quite complicated and I'd like to
>>> understand how it'll be used.
>>>
>>
>> Here is the interesting part from the example (patch 09/11):
>>
>> +SEC("maps")
>> +struct bpf_map_def inode_map = {
>> + .type = BPF_MAP_TYPE_INODE,
>> + .key_size = sizeof(u32),
>> + .value_size = sizeof(u64),
>> + .max_entries = 20,
>> +};
>> +
>> +SEC("subtype/landlock1")
>> +static union bpf_prog_subtype _subtype1 = {
>> + .landlock_hook = {
>> + .type = LANDLOCK_HOOK_FS_WALK,
>> + }
>> +};
>> +
>> +static __always_inline __u64 update_cookie(__u64 cookie, __u8 lookup,
>> + void *inode, void *chain, bool freeze)
>> +{
>> + __u64 map_allow = 0;
>> +
>> + if (cookie == 0) {
>> + cookie = bpf_inode_get_tag(inode, chain);
>> + if (cookie)
>> + return cookie;
>> + /* only look for the first match in the map, ignore nested
>> + * paths in this example */
>> + map_allow = bpf_inode_map_lookup(&inode_map, inode);
>> + if (map_allow)
>> + cookie = 1 | map_allow;
>> + } else {
>> + if (cookie & COOKIE_VALUE_FREEZED)
>> + return cookie;
>> + map_allow = cookie & _MAP_MARK_MASK;
>> + cookie &= ~_MAP_MARK_MASK;
>> + switch (lookup) {
>> + case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOTDOT:
>> + cookie--;
>> + break;
>> + case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOT:
>> + break;
>> + default:
>> + /* ignore _MAP_MARK_MASK overflow in this example */
>> + cookie++;
>> + break;
>> + }
>> + if (cookie >= 1)
>> + cookie |= map_allow;
>> + }
>> + /* do not modify the cookie for each fs_pick */
>> + if (freeze && cookie)
>> + cookie |= COOKIE_VALUE_FREEZED;
>> + return cookie;
>> +}
>> +
>> +SEC("landlock1")
>> +int fs_walk(struct landlock_ctx_fs_walk *ctx)
>> +{
>> + ctx->cookie = update_cookie(ctx->cookie, ctx->inode_lookup,
>> + (void *)ctx->inode, (void *)ctx->chain, false);
>> + return LANDLOCK_RET_ALLOW;
>> +}
>>
>> The program "landlock1" is called for every directory execution (except
>> the last one if it is the leaf of a path). This enables to identify a
>> file hierarchy with only a (one dimension) list of file descriptors
>> (i.e. inode_map).
>>
>> Underneath, the Landlock LSM part looks if there is an associated path
>> walk (nameidata) with each inode access request. If there is one, then
>> the cookie associated with the path walk (if any) is made available
>> through the eBPF program context. This enables to develop a state
>> machine with an eBPF program to "evaluate" a file path (without string
>> parsing).
>>
>> The goal with this chaining mechanism is to be able to express a complex
>> kernel object like a file, with multiple run of one or more eBPF
>> programs, as a multilayer evaluation. This semantic may only make sense
>> for the user/developer and his security policy. We must keep in mind
>> that this object identification should be available to unprivileged
>> processes. This means that we must be very careful to what kind of
>> information are available to an eBPF program because this can then leak
>> to a process (e.g. through a map). With this mechanism, only information
>> already available to user space is available to the eBPF program.
>>
>> In this example, the complexity of the path evaluation is in the eBPF
>> program. We can then keep the kernel code more simple and generic. This
>> enables more flexibility for a security policy definition.
>
> it all sounds correct on paper, but it's pretty novel
> approach and I'm not sure I see all the details in the patch.
> When people say "inode" they most of the time mean inode integer number,
> whereas in this patch do you mean a raw pointer to in-kernel
> 'struct inode' ?
> To avoid confusion it should probably be called differently.
It's indeed a pointer to a "struct inode", not an inode number.
I was thinking about generalizing the BPF_MAP_TYPE_INODE by renaming it
to BPF_MAP_TYPE_FD. This map type could then be used either to identify
a set of inodes (pointers) or other kernel objects identifiable by a
file descriptor. A "subtype" (similar to the BPF prog subtype introduced
in this patch series) may be used to specialize such a map to statically
identify the kind of content it may hold. We could then add more
subtypes to identify sockets, devices, processes, and so on.
>
> If you meant inode as a number then why inode only?
> where is superblock, device, mount point?
> How bpf side can compare inodes without this additional info?
> How bpf side will know what inode to compare to?
> What if inode number is reused?
This pointer can identify if a giver inode is the same as one pointed by
a file descriptor (or a file path).
> This approach is an optimization to compare inodes
> instead of strings passed into sys_open ?
Comparing paths with strings is less efficient but it is also very
error-prone. Another advantage of using file descriptors is for
unprivileged processes: we can be sure that this processes are allowed
to access a file referred by a file descriptor (opened file). Indeed we
check (security_inode_getattr) that the process is allowed to stat an
opened file. This way, a malicious process can't infer information by
crafting path strings.
>
> If you meant inode as a pointer how bpf side will
> know the pointer before the walk begins?
The BPF map is filled by user space with file descriptors pointing to
opened files. When a path walk begin, the LSM part of Landlock is
notified that a process is requesting an access to the first element of
the path (e.g. "/"). This first element may be part of a map or not. The
BPF program can then choose if this request is legitimate or not.
> What guarantees that it's not a stale pointer?
When user space updates a map with a new file descriptor, the kernel
checks if this FD is valid. If this is the case, then the inode's usage
counter is incremented and its address is stored in the map.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: Eric Dumazet @ 2018-04-11 22:30 UTC (permalink / raw)
To: Saeed Mahameed, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <1523473143.3402.55.camel@mellanox.com>
On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
>>
>> On 04/10/2018 10:16 AM, David Miller wrote:
>>>
>>> The tradeoff here is that now you are doing two unnecessary atomic
>>> operations per stats dump.
>>>
>>> That is what the RCU lock allows us to avoid.
>>>
>>
>> dev_hold() and dev_put() are actually per cpu increment and
>> decrements,
>> pretty cheap these days.
>>
>
> Yes, i am adding only 2 cpu instructions here.
> I think the trade-off here is too small and the price to finally have
> get_stats64 called from non atomic context is really worth it.
Oh... but you have not mentioned this reason in your changelog.
What about bonding stats ?
By sending partial patches like that, others have to take care of the details
and this is not really acceptable.
>
> It looks really odd to me that the device chain locks are held for
> such long periods, while we already have the means to avoid this, same
> goes for rtnl_lock, same trick can work here for many use cases and
> many ndos, we are just over protective for no reasons.
>
>
>> Problem here is that any preemption of the process holding device
>> reference
>> might trigger warnings in device unregister.
>>
>
> This is true for any other place dev_hold is used,
> as explained in the commit message dev_hold is used for a very brief
> moment before calling the stats ndo and released directly after.
Not really.
Other places usually have notifiers to remove the refcount when needed.
We worked quite hard in the past to remove extra dev_hold()
(before we finally converted it to percpu counter)
>
> looking at
>
> netdev_wait_allrefs(...)
> [...]
> msleep(250);
>
> refcnt = netdev_refcnt_read(dev);
>
> if (time_after(jiffies, warning_time + 10 * HZ)) {
> pr_emerg("unregister_netdevice: waiting for %s to
> become free. Usage count = %d\n",
> dev->name, refcnt);
> warning_time = jiffies;
> }
>
> The holder will get enough time to release the netdev way before the
> warning is triggered.
>
> The warning is triggered only if someone holds the dev for more than 10
> seconds which is impossible for the stats ndo to take more than this,
> in fact i just did a quick measurement and it seems that in average
> get_stats64 ndo takes 0.5us !
Average is nice, but what about max time ?
Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation can definitely
happen in the real world, or simply if you get preempted by some RT/high prio tasks.
Just say no to 'might sleep in ndo_get_stats()', and you will save a lot of issues.
^ permalink raw reply
* Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Marcelo Ricardo Leitner @ 2018-04-11 22:39 UTC (permalink / raw)
To: Vladislav Yasevich
Cc: netdev, linux-sctp, virtualization, mst, jasowang, nhorman,
Vladislav Yasevich
In-Reply-To: <20180402134006.10111-2-vyasevic@redhat.com>
On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> To support SCTP checksum offloading, we need to add a new feature
> to virtio_net, so we can negotiate support between the hypervisor
> and the guest.
>
> The signalling to the guest that an alternate checksum needs to
> be used is done via a new flag in the virtio_net_hdr. If the
> flag is set, the host will know to perform an alternate checksum
> calculation, which right now is only CRC32c.
>
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
> drivers/net/virtio_net.c | 11 ++++++++---
> include/linux/virtio_net.h | 6 ++++++
> include/uapi/linux/virtio_net.h | 2 ++
> 3 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..b601294 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
> /* Do we support "hardware" checksums? */
> if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
> /* This opens up the world of extra features. */
> - dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> + netdev_features_t sctp = 0;
> +
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> + sctp |= NETIF_F_SCTP_CRC;
> +
> + dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
> if (csum)
> - dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> + dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>
> if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
> dev->hw_features |= NETIF_F_TSO
> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
> };
>
> #define VIRTNET_FEATURES \
> - VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> + VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, VIRTIO_NET_F_SCTP_CSUM, \
> VIRTIO_NET_F_MAC, \
> VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
> VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index f144216..2e7a64a 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>
> if (!skb_partial_csum_set(skb, start, off))
> return -EINVAL;
> +
> + if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> + skb->csum_not_inet = 1;
> }
>
> if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
> hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
> } /* else everything is zero */
>
> + if (skb->csum_not_inet)
> + hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
Shouldn't this be |= instead?
> +
> return 0;
> }
>
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed3..3f279c8 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -36,6 +36,7 @@
> #define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */
> #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
> #define VIRTIO_NET_F_MTU 3 /* Initial MTU advice */
> +#define VIRTIO_NET_F_SCTP_CSUM 4 /* SCTP checksum offload support */
> #define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */
> #define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */
> #define VIRTIO_NET_F_GUEST_TSO6 8 /* Guest can handle TSOv6 in. */
> @@ -101,6 +102,7 @@ struct virtio_net_config {
> struct virtio_net_hdr_v1 {
> #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */
> #define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */
> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET 4 /* Checksum is not inet */
> __u8 flags;
> #define VIRTIO_NET_HDR_GSO_NONE 0 /* Not a GSO frame */
> #define VIRTIO_NET_HDR_GSO_TCPV4 1 /* GSO frame, IPv4 TCP (TSO) */
> --
> 2.9.5
>
^ permalink raw reply
* Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Michael S. Tsirkin @ 2018-04-11 22:49 UTC (permalink / raw)
To: Vladislav Yasevich
Cc: netdev, linux-sctp, virtualization, jasowang, nhorman,
Vladislav Yasevich
In-Reply-To: <20180402134006.10111-2-vyasevic@redhat.com>
On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> To support SCTP checksum offloading, we need to add a new feature
> to virtio_net, so we can negotiate support between the hypervisor
> and the guest.
>
> The signalling to the guest that an alternate checksum needs to
> be used is done via a new flag in the virtio_net_hdr. If the
> flag is set, the host will know to perform an alternate checksum
> calculation, which right now is only CRC32c.
>
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
> drivers/net/virtio_net.c | 11 ++++++++---
> include/linux/virtio_net.h | 6 ++++++
> include/uapi/linux/virtio_net.h | 2 ++
> 3 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..b601294 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
> /* Do we support "hardware" checksums? */
> if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
> /* This opens up the world of extra features. */
> - dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> + netdev_features_t sctp = 0;
> +
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> + sctp |= NETIF_F_SCTP_CRC;
> +
> + dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
> if (csum)
> - dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> + dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>
> if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
> dev->hw_features |= NETIF_F_TSO
> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
> };
>
> #define VIRTNET_FEATURES \
> - VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> + VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, VIRTIO_NET_F_SCTP_CSUM, \
> VIRTIO_NET_F_MAC, \
> VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
> VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index f144216..2e7a64a 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>
> if (!skb_partial_csum_set(skb, start, off))
> return -EINVAL;
> +
> + if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> + skb->csum_not_inet = 1;
> }
>
> if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
> hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
> } /* else everything is zero */
>
> + if (skb->csum_not_inet)
> + hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
> +
> return 0;
> }
>
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed3..3f279c8 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -36,6 +36,7 @@
> #define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */
> #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
> #define VIRTIO_NET_F_MTU 3 /* Initial MTU advice */
> +#define VIRTIO_NET_F_SCTP_CSUM 4 /* SCTP checksum offload support */
> #define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */
> #define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */
> #define VIRTIO_NET_F_GUEST_TSO6 8 /* Guest can handle TSOv6 in. */
Is this a guest or a host checksum? We should differenciate between the
two.
> @@ -101,6 +102,7 @@ struct virtio_net_config {
> struct virtio_net_hdr_v1 {
> #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */
> #define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */
> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET 4 /* Checksum is not inet */
> __u8 flags;
> #define VIRTIO_NET_HDR_GSO_NONE 0 /* Not a GSO frame */
> #define VIRTIO_NET_HDR_GSO_TCPV4 1 /* GSO frame, IPv4 TCP (TSO) */
> --
> 2.9.5
^ permalink raw reply
* [net 1/1] tipc: fix missing initializer in tipc_sendmsg()
From: Jon Maloy @ 2018-04-11 23:15 UTC (permalink / raw)
To: davem, netdev
Cc: mohan.krishna.ghanta.krishnamurthy, tung.q.nguyen, hoang.h.le,
jon.maloy, canh.d.luu, ying.xue, tipc-discussion
The stack variable 'dnode' in __tipc_sendmsg() may theoretically
end up tipc_node_get_mtu() as an unitilalized variable.
We fix this by intializing the variable at declaration. We also add
a default else clause to the two conditional ones already there, so
that we never end up in the named function if the given address
type is illegal.
Reported-by: syzbot+b0975ce9355b347c1546@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
net/tipc/socket.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1fd1c8b..252a52ae 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1278,7 +1278,7 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen)
struct tipc_msg *hdr = &tsk->phdr;
struct tipc_name_seq *seq;
struct sk_buff_head pkts;
- u32 dnode, dport;
+ u32 dport, dnode = 0;
u32 type, inst;
int mtu, rc;
@@ -1348,6 +1348,8 @@ static int __tipc_sendmsg(struct socket *sock, struct msghdr *m, size_t dlen)
msg_set_destnode(hdr, dnode);
msg_set_destport(hdr, dest->addr.id.ref);
msg_set_hdr_sz(hdr, BASIC_H_SIZE);
+ } else {
+ return -EINVAL;
}
/* Block or return if destination link is congested */
--
2.1.4
^ permalink raw reply related
* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
From: Jesus Sanchez-Palencia @ 2018-04-11 23:38 UTC (permalink / raw)
To: Thomas Gleixner
Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
anna-maria, henrik, John Stultz, levi.pearson, edumazet, willemb,
mlichvar
In-Reply-To: <alpine.DEB.2.21.1804112209450.1564@nanos.tec.linutronix.de>
Hi,
On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
>>>> Putting it all together, we end up with:
>>>>
>>>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
>>>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
>
> I think so. It's _the_ network time on which everything is based on.
Yes, but more on this below.
>
>>>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>>>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>>>> offlaod case): SCM_TXTIME.
>>>>
>>>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
>>>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>>>> explicit), that defines the semantics of the timestamp set on packets using
>>>> SCM_TXTIME.
>>>>
>>>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .
>
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?
Just breaking this down a bit, yes, TAI is the network time base, and the NICs
PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
been synchronized over the network (e.g. with ptp4l), my understanding is that
if applications want to use the clockid_t CLOCK_TAI as a network clock reference
it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
system clock, and also that something calls adjtime to apply the TAI vs UTC
offset to CLOCK_TAI.
If we are fine with those 'dependencies', then I agree there is no need for
another clock.
I was thinking about the full offload use-cases, thus when no scheduling is
happening inside the qdiscs. Applications could just read the time from the PHC
clocks directly without having to rely on any of the above. On this case,
userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I must
admit it's not clear to me how common of a use-case that is, or even if it makes
sense.
Thanks,
Jesus
>
> Thanks,
>
> tglx
>
^ permalink raw reply
* Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: Saeed Mahameed @ 2018-04-11 23:47 UTC (permalink / raw)
To: eric.dumazet@gmail.com, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <f1b3bc40-cfdf-5d89-9cfc-cf6996f99d8c@gmail.com>
On Wed, 2018-04-11 at 15:30 -0700, Eric Dumazet wrote:
>
> On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> > On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
> > >
> > > On 04/10/2018 10:16 AM, David Miller wrote:
> > > >
> > > > The tradeoff here is that now you are doing two unnecessary
> > > > atomic
> > > > operations per stats dump.
> > > >
> > > > That is what the RCU lock allows us to avoid.
> > > >
> > >
> > > dev_hold() and dev_put() are actually per cpu increment and
> > > decrements,
> > > pretty cheap these days.
> > >
> >
> > Yes, i am adding only 2 cpu instructions here.
> > I think the trade-off here is too small and the price to finally
> > have
> > get_stats64 called from non atomic context is really worth it.
>
> Oh... but you have not mentioned this reason in your changelog.
>
from the commit message:
"This is really greedy and demanding from device drivers since
ndo_get_stats64 called from dev_seq_show while the rcu lock is held"
sorry if this wasn't clear enough I will fix this if this goes through.
> What about bonding stats ?
>
> By sending partial patches like that, others have to take care of the
> details
> and this is not really acceptable.
>
This is a RFC just to show the approach, if the approach is acceptable
of course i will provide a full series that will handle all other
places, the change should be the same, I already have 2 other patches
to address ovs and netlink stats, i just didn't want to waste your time
on small details like netlink messages.
> >
> > It looks really odd to me that the device chain locks are held for
> > such long periods, while we already have the means to avoid this,
> > same
> > goes for rtnl_lock, same trick can work here for many use cases and
> > many ndos, we are just over protective for no reasons.
> >
> >
> > > Problem here is that any preemption of the process holding device
> > > reference
> > > might trigger warnings in device unregister.
> > >
> >
> > This is true for any other place dev_hold is used,
> > as explained in the commit message dev_hold is used for a very
> > brief
> > moment before calling the stats ndo and released directly after.
>
> Not really.
>
> Other places usually have notifiers to remove the refcount when
> needed.
>
Other places hold the netdev for the whole lifetime of the netdev, they
don't know when to release it, this is why we need notifiers.
in this patch the approach is:
hold
call ndo
put
notifier is not needed in this case.
> We worked quite hard in the past to remove extra dev_hold()
> (before we finally converted it to percpu counter)
>
> >
> > looking at
> >
> > netdev_wait_allrefs(...)
> > [...]
> > msleep(250);
> >
> > refcnt = netdev_refcnt_read(dev);
> >
> > if (time_after(jiffies, warning_time + 10 * HZ)) {
> > pr_emerg("unregister_netdevice: waiting for %s to
> > become free. Usage count = %d\n",
> > dev->name, refcnt);
> > warning_time = jiffies;
> > }
> >
> > The holder will get enough time to release the netdev way before
> > the
> > warning is triggered.
> >
> > The warning is triggered only if someone holds the dev for more
> > than 10
> > seconds which is impossible for the stats ndo to take more than
> > this,
> > in fact i just did a quick measurement and it seems that in average
> > get_stats64 ndo takes 0.5us !
>
> Average is nice, but what about max time ?
>
Well if we allow devices to access HW counters via FW command
interfaces in ndo_get_stats and by testing mlx5 where we query up to 5
hw registers, it could take 100us, still this is way smaller than 10sec
:) and it is really a nice rate to fetch HW stats on demand.
> Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation
> can definitely
> happen in the real world, or simply if you get preempted by some
> RT/high prio tasks.
>
Same issue can occur without this patch if an IRQ is triggered under
while under rcu lock.
And RT/high prio task shouldn't take more than 10sec.
In case GFP_KERNEL memory allocation takes more than 10sec you will
already get tons of OOM warnings.
Having a netdev warning in case of really the netdev is being
unregistered and ndo_get_stats was preempted for more than 10 seconds
will be the least of your problems.
> Just say no to 'might sleep in ndo_get_stats()', and you will save a
> lot of issues.
We are just being over protective here.
Just say no to 'might sleep in ndo_get_stats()' is by itself creating a
lot of issues, many drivers out there have a background thread running
N times a second just to cache HW stats. which is really silly and CPU
consuming for no reason.
The rate of ndo_get_stats should be equal to the rate the driver can
actually provide fresh stats, any background thread is just a a waste
of CPU. Counters should be fetched from HW on demand rather than
periodically for no reason.
The same goes for set_rx_mode ndo.
Bottom line it looks like the need for rcu locking today is only meant
for synchronizing accessing the netdev ndo with the device chain in
question (namespace/ovs/bonding/etc .. ).
There are many places where dev_get_stats is called from non atmoic
context where the caller knows it is safe to access the netdev ndo.
example:
net/bondign/bond_main.c: bond_enslave(..)
/* initialize slave stats */
dev_get_stats(new_slave->dev, &new_slave->slave_stats);
^ permalink raw reply
* [PATCH net 0/4] nfp: improve signal handing on FW waits and flower control message processing
From: Jakub Kicinski @ 2018-04-11 23:47 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski
Hi!
The first part of this set aims to improve handling of interrupted
waits. Patch 1 makes waiting for management FW responses
uninterruptible while patch 2 adds a message when signal arrives
while waiting for an NFP mutex. We can't interrupt execution of
FW commands so uninterruptible sleep seems reasonable there.
Exiting a wait for a mutex should be clean and have no side affects
so we are allowing to abort it. Note that both waits have rather
large timeouts (tens of seconds).
Patches 3 and 4 improve flower offload operation under heavy load.
Currently there is no cap on the number of queued FW notifications.
Some of the notifications have to be processed from a workqueue
which may lead to very large number of messages getting queued
if workqueue never gets a chance to run. Pieter puts a limit
on number of queued messages, tries to drop some messages we ignore
without queuing and process more important messages first.
Jakub Kicinski (2):
nfp: ignore signals when communicating with management FW
nfp: print a message when mutex wait is interrupted
Pieter Jansen van Vuuren (2):
nfp: flower: move route ack control messages out of the workqueue
nfp: flower: split and limit cmsg skb lists
drivers/net/ethernet/netronome/nfp/flower/cmsg.c | 44 ++++++++++++++++++----
drivers/net/ethernet/netronome/nfp/flower/cmsg.h | 2 +
drivers/net/ethernet/netronome/nfp/flower/main.c | 6 ++-
drivers/net/ethernet/netronome/nfp/flower/main.h | 8 +++-
.../net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c | 5 ++-
.../net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 3 +-
6 files changed, 54 insertions(+), 14 deletions(-)
--
2.16.2
^ permalink raw reply
* [PATCH net 1/4] nfp: ignore signals when communicating with management FW
From: Jakub Kicinski @ 2018-04-11 23:47 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski
In-Reply-To: <20180411234738.6766-1-jakub.kicinski@netronome.com>
We currently allow signals to interrupt the wait for management FW
commands. Exiting the wait should not cause trouble, the FW will
just finish executing the command in the background and new commands
will wait for the old one to finish.
However, this may not be what users expect (Ctrl-C not actually stopping
the command). Moreover some systems routinely request link information
with signals pending (Ubuntu 14.04 runs a landscape-sysinfo python tool
from MOTD) worrying users with errors like these:
nfp 0000:04:00.0: nfp_nsp: Error -512 waiting for code 0x0007 to start
nfp 0000:04:00.0: nfp: reading port table failed -512
Make the wait for management FW responses non-interruptible.
Fixes: 1a64821c6af7 ("nfp: add support for service processor access")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
index 99bb679a9801..2abee0fe3a7c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
@@ -281,8 +281,7 @@ nfp_nsp_wait_reg(struct nfp_cpp *cpp, u64 *reg, u32 nsp_cpp, u64 addr,
if ((*reg & mask) == val)
return 0;
- if (msleep_interruptible(25))
- return -ERESTARTSYS;
+ msleep(25);
if (time_after(start_time, wait_until))
return -ETIMEDOUT;
--
2.16.2
^ permalink raw reply related
* [PATCH net 2/4] nfp: print a message when mutex wait is interrupted
From: Jakub Kicinski @ 2018-04-11 23:47 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski
In-Reply-To: <20180411234738.6766-1-jakub.kicinski@netronome.com>
When waiting for an NFP mutex is interrupted print a message
to make root causing later error messages easier.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
index f7b958181126..cb28ac03e4ca 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
@@ -211,8 +211,11 @@ int nfp_cpp_mutex_lock(struct nfp_cpp_mutex *mutex)
break;
err = msleep_interruptible(timeout_ms);
- if (err != 0)
+ if (err != 0) {
+ nfp_info(mutex->cpp,
+ "interrupted waiting for NFP mutex\n");
return -ERESTARTSYS;
+ }
if (time_is_before_eq_jiffies(warn_at)) {
warn_at = jiffies + NFP_MUTEX_WAIT_NEXT_WARN * HZ;
--
2.16.2
^ permalink raw reply related
* [PATCH net 3/4] nfp: flower: move route ack control messages out of the workqueue
From: Jakub Kicinski @ 2018-04-11 23:47 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, Pieter Jansen van Vuuren
In-Reply-To: <20180411234738.6766-1-jakub.kicinski@netronome.com>
From: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Previously we processed the route ack control messages in the workqueue,
this unnecessarily loads the workqueue. We can deal with these messages
sooner as we know we are going to drop them.
Fixes: 8e6a9046b66a ("nfp: flower vxlan neighbour offload")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
drivers/net/ethernet/netronome/nfp/flower/cmsg.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
index 3735c09d2112..8ad857eb89c6 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
@@ -258,9 +258,6 @@ nfp_flower_cmsg_process_one_rx(struct nfp_app *app, struct sk_buff *skb)
case NFP_FLOWER_CMSG_TYPE_ACTIVE_TUNS:
nfp_tunnel_keep_alive(app, skb);
break;
- case NFP_FLOWER_CMSG_TYPE_TUN_NEIGH:
- /* Acks from the NFP that the route is added - ignore. */
- break;
default:
nfp_flower_cmsg_warn(app, "Cannot handle invalid repr control type %u\n",
type);
@@ -306,6 +303,9 @@ void nfp_flower_cmsg_rx(struct nfp_app *app, struct sk_buff *skb)
nfp_flower_process_mtu_ack(app, skb)) {
/* Handle MTU acks outside wq to prevent RTNL conflict. */
dev_consume_skb_any(skb);
+ } else if (cmsg_hdr->type == NFP_FLOWER_CMSG_TYPE_TUN_NEIGH) {
+ /* Acks from the NFP that the route is added - ignore. */
+ dev_consume_skb_any(skb);
} else {
skb_queue_tail(&priv->cmsg_skbs, skb);
schedule_work(&priv->cmsg_work);
--
2.16.2
^ permalink raw reply related
* [PATCH net 4/4] nfp: flower: split and limit cmsg skb lists
From: Jakub Kicinski @ 2018-04-11 23:47 UTC (permalink / raw)
To: davem; +Cc: netdev, oss-drivers, Pieter Jansen van Vuuren
In-Reply-To: <20180411234738.6766-1-jakub.kicinski@netronome.com>
From: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Introduce a second skb list for handling control messages and limit the
number of allowed messages. Some control messages are considered more
crucial than others, resulting in the need for a second skb list. By
splitting the list into a separate high and low priority list we can
ensure that messages on the high list get added to the head of the list
that gets processed, this however has no functional impact. Previously
there was no limit on the number of messages allowed on the queue, this
could result in the queue growing boundlessly and eventually the host
running out of memory.
Fixes: b985f870a5f0 ("nfp: process control messages in workqueue in flower app")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
drivers/net/ethernet/netronome/nfp/flower/cmsg.c | 38 +++++++++++++++++++++---
drivers/net/ethernet/netronome/nfp/flower/cmsg.h | 2 ++
drivers/net/ethernet/netronome/nfp/flower/main.c | 6 ++--
drivers/net/ethernet/netronome/nfp/flower/main.h | 8 +++--
4 files changed, 46 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
index 8ad857eb89c6..577659f332e4 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
@@ -272,18 +272,49 @@ nfp_flower_cmsg_process_one_rx(struct nfp_app *app, struct sk_buff *skb)
void nfp_flower_cmsg_process_rx(struct work_struct *work)
{
+ struct sk_buff_head cmsg_joined;
struct nfp_flower_priv *priv;
struct sk_buff *skb;
priv = container_of(work, struct nfp_flower_priv, cmsg_work);
+ skb_queue_head_init(&cmsg_joined);
- while ((skb = skb_dequeue(&priv->cmsg_skbs)))
+ spin_lock_bh(&priv->cmsg_skbs_high.lock);
+ skb_queue_splice_tail_init(&priv->cmsg_skbs_high, &cmsg_joined);
+ spin_unlock_bh(&priv->cmsg_skbs_high.lock);
+
+ spin_lock_bh(&priv->cmsg_skbs_low.lock);
+ skb_queue_splice_tail_init(&priv->cmsg_skbs_low, &cmsg_joined);
+ spin_unlock_bh(&priv->cmsg_skbs_low.lock);
+
+ while ((skb = __skb_dequeue(&cmsg_joined)))
nfp_flower_cmsg_process_one_rx(priv->app, skb);
}
-void nfp_flower_cmsg_rx(struct nfp_app *app, struct sk_buff *skb)
+static void
+nfp_flower_queue_ctl_msg(struct nfp_app *app, struct sk_buff *skb, int type)
{
struct nfp_flower_priv *priv = app->priv;
+ struct sk_buff_head *skb_head;
+
+ if (type == NFP_FLOWER_CMSG_TYPE_PORT_REIFY ||
+ type == NFP_FLOWER_CMSG_TYPE_PORT_MOD)
+ skb_head = &priv->cmsg_skbs_high;
+ else
+ skb_head = &priv->cmsg_skbs_low;
+
+ if (skb_queue_len(skb_head) >= NFP_FLOWER_WORKQ_MAX_SKBS) {
+ nfp_flower_cmsg_warn(app, "Dropping queued control messages\n");
+ dev_kfree_skb_any(skb);
+ return;
+ }
+
+ skb_queue_tail(skb_head, skb);
+ schedule_work(&priv->cmsg_work);
+}
+
+void nfp_flower_cmsg_rx(struct nfp_app *app, struct sk_buff *skb)
+{
struct nfp_flower_cmsg_hdr *cmsg_hdr;
cmsg_hdr = nfp_flower_cmsg_get_hdr(skb);
@@ -307,7 +338,6 @@ void nfp_flower_cmsg_rx(struct nfp_app *app, struct sk_buff *skb)
/* Acks from the NFP that the route is added - ignore. */
dev_consume_skb_any(skb);
} else {
- skb_queue_tail(&priv->cmsg_skbs, skb);
- schedule_work(&priv->cmsg_work);
+ nfp_flower_queue_ctl_msg(app, skb, cmsg_hdr->type);
}
}
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index 96bc0e33980c..b6c0fd053a50 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -108,6 +108,8 @@
#define NFP_FL_IPV4_TUNNEL_TYPE GENMASK(7, 4)
#define NFP_FL_IPV4_PRE_TUN_INDEX GENMASK(2, 0)
+#define NFP_FLOWER_WORKQ_MAX_SKBS 30000
+
#define nfp_flower_cmsg_warn(app, fmt, args...) \
do { \
if (net_ratelimit()) \
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 6357e0720f43..ad02592a82b7 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -519,7 +519,8 @@ static int nfp_flower_init(struct nfp_app *app)
app->priv = app_priv;
app_priv->app = app;
- skb_queue_head_init(&app_priv->cmsg_skbs);
+ skb_queue_head_init(&app_priv->cmsg_skbs_high);
+ skb_queue_head_init(&app_priv->cmsg_skbs_low);
INIT_WORK(&app_priv->cmsg_work, nfp_flower_cmsg_process_rx);
init_waitqueue_head(&app_priv->reify_wait_queue);
@@ -549,7 +550,8 @@ static void nfp_flower_clean(struct nfp_app *app)
{
struct nfp_flower_priv *app_priv = app->priv;
- skb_queue_purge(&app_priv->cmsg_skbs);
+ skb_queue_purge(&app_priv->cmsg_skbs_high);
+ skb_queue_purge(&app_priv->cmsg_skbs_low);
flush_work(&app_priv->cmsg_work);
nfp_flower_metadata_cleanup(app);
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h b/drivers/net/ethernet/netronome/nfp/flower/main.h
index e030b3ce4510..c67e1b54c614 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -107,7 +107,10 @@ struct nfp_mtu_conf {
* @mask_table: Hash table used to store masks
* @flow_table: Hash table used to store flower rules
* @cmsg_work: Workqueue for control messages processing
- * @cmsg_skbs: List of skbs for control message processing
+ * @cmsg_skbs_high: List of higher priority skbs for control message
+ * processing
+ * @cmsg_skbs_low: List of lower priority skbs for control message
+ * processing
* @nfp_mac_off_list: List of MAC addresses to offload
* @nfp_mac_index_list: List of unique 8-bit indexes for non NFP netdevs
* @nfp_ipv4_off_list: List of IPv4 addresses to offload
@@ -136,7 +139,8 @@ struct nfp_flower_priv {
DECLARE_HASHTABLE(mask_table, NFP_FLOWER_MASK_HASH_BITS);
DECLARE_HASHTABLE(flow_table, NFP_FLOWER_HASH_BITS);
struct work_struct cmsg_work;
- struct sk_buff_head cmsg_skbs;
+ struct sk_buff_head cmsg_skbs_high;
+ struct sk_buff_head cmsg_skbs_low;
struct list_head nfp_mac_off_list;
struct list_head nfp_mac_index_list;
struct list_head nfp_ipv4_off_list;
--
2.16.2
^ permalink raw reply related
* [bpf PATCH 3/3] bpf: sockmap, fix double put_page on ENOMEM error in redirect path
From: John Fastabend @ 2018-04-11 23:56 UTC (permalink / raw)
To: ast, daniel; +Cc: netdev
In-Reply-To: <20180411235230.19135.63857.stgit@john-Precision-Tower-5810>
In the case where the socket memory boundary is hit the redirect
path returns an ENOMEM error. However, before checking for this
condition the redirect scatterlist buffer is setup with a valid
page and length. This is never unwound so when the buffers are
released latter in the error path we do a put_page() and clear
the scatterlist fields. But, because the initial error happens
before completing the scatterlist buffer we end up with both the
original buffer and the redirect buffer pointing to the same page
resulting in duplicate put_page() calls.
To fix this simply move the initial configuration of the redirect
scatterlist buffer below the sock memory check.
Found this while running TCP_STREAM test with netperf using Cilium.
Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
kernel/bpf/sockmap.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index ffe266b..0fefdb0 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -524,8 +524,6 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
i = md->sg_start;
do {
- r->sg_data[i] = md->sg_data[i];
-
size = (apply && apply_bytes < md->sg_data[i].length) ?
apply_bytes : md->sg_data[i].length;
@@ -536,6 +534,7 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
}
sk_mem_charge(sk, size);
+ r->sg_data[i] = md->sg_data[i];
r->sg_data[i].length = size;
md->sg_data[i].length -= size;
md->sg_data[i].offset += size;
^ permalink raw reply related
* [bpf PATCH 0/3] BPF, a couple sockmap fixes
From: John Fastabend @ 2018-04-11 23:56 UTC (permalink / raw)
To: ast, daniel; +Cc: netdev
While testing sockmap with more programs (besides our test programs)
I found a couple issues.
The attached series fixes an issue where pinned maps were not
working correctly, blocking sockets returned zero, and an error
path that when the sock hit an out of memory case resulted in a
double page_put() while doing ingress redirects.
See individual patches for more details.
Thanks,
John
---
John Fastabend (3):
bpf: sockmap, map_release does not hold refcnt for pinned maps
bpf: sockmap, sk_wait_event needed to handle blocking cases
bpf: sockmap, fix double put_page on ENOMEM error in redirect path
include/linux/bpf.h | 3 +++
kernel/bpf/sockmap.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++----
kernel/bpf/syscall.c | 10 +++++++++-
3 files changed, 59 insertions(+), 5 deletions(-)
^ permalink raw reply
* [bpf PATCH 2/3] bpf: sockmap, sk_wait_event needed to handle blocking cases
From: John Fastabend @ 2018-04-11 23:56 UTC (permalink / raw)
To: ast, daniel; +Cc: netdev
In-Reply-To: <20180411235230.19135.63857.stgit@john-Precision-Tower-5810>
In the recvmsg handler we need to add a wait event to support the
blocking use cases. Without this we return zero and may confuse
user applications. In the wait event any data received on the
sk either via sk_receive_queue or the psock ingress list will
wake up the sock.
Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
kernel/bpf/sockmap.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index cef187f..ffe266b 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -43,6 +43,7 @@
#include <net/tcp.h>
#include <linux/ptr_ring.h>
#include <net/inet_common.h>
+#include <linux/sched/signal.h>
#define SOCK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
@@ -732,6 +733,26 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
return err;
}
+static int bpf_wait_data(struct sock *sk,
+ struct smap_psock *psk, int flags,
+ long timeo, int *err)
+{
+ int rc;
+
+ DEFINE_WAIT_FUNC(wait, woken_wake_function);
+
+ add_wait_queue(sk_sleep(sk), &wait);
+ sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+ rc = sk_wait_event(sk, &timeo,
+ !list_empty(&psk->ingress) ||
+ !skb_queue_empty(&sk->sk_receive_queue),
+ &wait);
+ sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+ remove_wait_queue(sk_sleep(sk), &wait);
+
+ return rc;
+}
+
static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
int nonblock, int flags, int *addr_len)
{
@@ -755,6 +776,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
lock_sock(sk);
+bytes_ready:
while (copied != len) {
struct scatterlist *sg;
struct sk_msg_buff *md;
@@ -809,6 +831,29 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
}
}
+ if (!copied) {
+ long timeo;
+ int data;
+ int err = 0;
+
+ timeo = sock_rcvtimeo(sk, nonblock);
+ data = bpf_wait_data(sk, psock, flags, timeo, &err);
+
+ if (data) {
+ if (!skb_queue_empty(&sk->sk_receive_queue)) {
+ release_sock(sk);
+ smap_release_sock(psock, sk);
+ copied = tcp_recvmsg(sk, msg, len,
+ nonblock, flags, addr_len);
+ return copied;
+ }
+ goto bytes_ready;
+ }
+
+ if (err)
+ copied = err;
+ }
+
release_sock(sk);
smap_release_sock(psock, sk);
return copied;
^ permalink raw reply related
* [bpf PATCH 1/3] bpf: sockmap, map_release does not hold refcnt for pinned maps
From: John Fastabend @ 2018-04-11 23:56 UTC (permalink / raw)
To: ast, daniel; +Cc: netdev
In-Reply-To: <20180411235230.19135.63857.stgit@john-Precision-Tower-5810>
Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.
This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.
Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
include/linux/bpf.h | 3 +++
kernel/bpf/sockmap.c | 3 +--
kernel/bpf/syscall.c | 10 +++++++++-
3 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..8a71058 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -646,6 +646,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map *map)
#if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_INET)
struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
+void sock_map_release(struct bpf_map *map);
#else
static inline struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
{
@@ -658,6 +659,8 @@ static inline int sock_map_prog(struct bpf_map *map,
{
return -EOPNOTSUPP;
}
+
+void sock_map_release(struct bpf_map *map) {}
#endif
/* verifier prototypes for helper functions called from eBPF programs */
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 8dd9210..cef187f 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1834,7 +1834,7 @@ static int sock_map_update_elem(struct bpf_map *map,
return err;
}
-static void sock_map_release(struct bpf_map *map, struct file *map_file)
+void sock_map_release(struct bpf_map *map)
{
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
struct bpf_prog *orig;
@@ -1858,7 +1858,6 @@ static void sock_map_release(struct bpf_map *map, struct file *map_file)
.map_get_next_key = sock_map_get_next_key,
.map_update_elem = sock_map_update_elem,
.map_delete_elem = sock_map_delete_elem,
- .map_release = sock_map_release,
};
BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0244973..449a618 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -257,8 +257,16 @@ static void bpf_map_free_deferred(struct work_struct *work)
static void bpf_map_put_uref(struct bpf_map *map)
{
if (atomic_dec_and_test(&map->usercnt)) {
- if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
+ switch (map->map_type) {
+ case BPF_MAP_TYPE_PROG_ARRAY:
bpf_fd_array_map_clear(map);
+ break;
+ case BPF_MAP_TYPE_SOCKMAP:
+ sock_map_release(map);
+ break;
+ default:
+ break;
+ }
}
}
^ permalink raw reply related
* RE: WARNING in kobject_add_internal
From: Yuan, Linyu (NSB - CN/Shanghai) @ 2018-04-12 0:29 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: bridge@lists.linux-foundation.org, David Miller,
Greg Kroah-Hartman, LKML, netdev, stephen hemminger,
syzkaller-bugs
In-Reply-To: <CACT4Y+bO2HN9Tuw2iMND9tD8578DNSOwUs7_W6Rvg-8oX5Qipw@mail.gmail.com>
Hi,
I have a question,
"can syzbot auto test each tree with newest changeset" ?
> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Dmitry Vyukov
> Sent: Wednesday, April 11, 2018 10:58 PM
> To: syzbot
> Cc: bridge@lists.linux-foundation.org; David Miller; Greg Kroah-Hartman;
> LKML; netdev; stephen hemminger; syzkaller-bugs
> Subject: Re: WARNING in kobject_add_internal
>
> On Fri, Jan 5, 2018 at 10:41 PM, syzbot
> <syzbot+e204ced820ef739d71ef5438f5e1976a874abc8d@syzkaller.appspotma
> il.com>
> wrote:
> > syzkaller has found reproducer for the following crash on
> > 89876f275e8d562912d9c238cd888b52065cf25c
> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> >
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by:
> >
> syzbot+e204ced820ef739d71ef5438f5e1976a874abc8d@syzkaller.appspotmail
> .com
> > It will help syzbot understand when the bug is fixed.
>
> #syz dup: WARNING: kobject bug in device_add
>
> > ------------[ cut here ]------------
> > kobject_add_internal failed for (error: -12 parent: net)
> > WARNING: CPU: 1 PID: 3494 at lib/kobject.c:244
> > kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:242
> > Kernel panic - not syncing: panic_on_warn set ...
> >
> > CPU: 1 PID: 3494 Comm: syzkaller425998 Not tainted 4.15.0-rc6+ #249
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS
> > Google 01/01/2011
> > Call Trace:
> > __dump_stack lib/dump_stack.c:17 [inline]
> > dump_stack+0x194/0x257 lib/dump_stack.c:53
> > panic+0x1e4/0x41c kernel/panic.c:183
> > __warn+0x1dc/0x200 kernel/panic.c:547
> > report_bug+0x211/0x2d0 lib/bug.c:184
> > fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> > fixup_bug arch/x86/kernel/traps.c:247 [inline]
> > do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> > do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> > invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
> > RIP: 0010:kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:242
> > RSP: 0018:ffff8801c53c76f0 EFLAGS: 00010286
> > RAX: dffffc0000000008 RBX: ffff8801bf5a88d8 RCX: ffffffff8159da9e
> > RDX: 0000000000000000 RSI: 1ffff10038a78e99 RDI: ffff8801c53c73f8
> > RBP: ffff8801c53c77e8 R08: 1ffff10038a78e5b R09: 0000000000000000
> > R10: ffff8801c53c74b0 R11: 0000000000000000 R12: 1ffff10038a78ee4
> > R13: 00000000fffffff4 R14: ffff8801d8359a80 R15: ffffffff86201980
> > kobject_add_varg lib/kobject.c:366 [inline]
> > kobject_add+0x132/0x1f0 lib/kobject.c:411
> > device_add+0x35d/0x1650 drivers/base/core.c:1787
> > netdev_register_kobject+0x183/0x360 net/core/net-sysfs.c:1604
> > register_netdevice+0xb2b/0x1010 net/core/dev.c:7698
> > tun_set_iff drivers/net/tun.c:2319 [inline]
> > __tun_chr_ioctl+0x1d89/0x3dd0 drivers/net/tun.c:2524
> > tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2773
> > vfs_ioctl fs/ioctl.c:46 [inline]
> > do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
> > SYSC_ioctl fs/ioctl.c:701 [inline]
> > SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
> > entry_SYSCALL_64_fastpath+0x23/0x9a
> > RIP: 0033:0x444fc9
> > RSP: 002b:00007fff53389dc8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> > RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000000444fc9
> > RDX: 0000000020533000 RSI: 00000000400454ca RDI: 0000000000000004
> > RBP: 0000000000000005 R08: 0000000000000002 R09: 0000006f00003131
> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402500
> > R13: 0000000000402590 R14: 0000000000000000 R15: 0000000000000000
> >
> > Dumping ftrace buffer:
> > (ftrace buffer empty)
> > Kernel Offset: disabled
> > Rebooting in 86400 seconds..
> >
^ permalink raw reply
* Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init
From: Jia-Ju Bai @ 2018-04-12 1:30 UTC (permalink / raw)
To: James Bottomley, davem, stephen, johannes.berg, arvind.yadav.cs,
dhowells
Cc: netdev, linux-parisc, linux-kernel
In-Reply-To: <1523463379.3221.18.camel@HansenPartnership.com>
On 2018/4/12 0:16, James Bottomley wrote:
> On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:
>> de4x5_hw_init() is never called in atomic context.
>>
>> de4x5_hw_init() is only called by de4x5_pci_probe(), which is only
>> set as ".probe" in struct pci_driver.
>>
>> Despite never getting called from atomic context, de4x5_hw_init()
>> calls mdelay() to busily wait. This is not necessary and can be
>> replaced with usleep_range() to avoid busy waiting.
>>
>> This is found by a static analysis tool named DCNS written by myself.
>> And I also manually check it.
> Did you actually test this? The usual reason for wanting m/udelay is
> that the timing must be exact. The driver is filled with mdelay()s for
> this reason. The one you've picked on is in the init path so it won't
> affect the runtime in any way. I also don't think we have the hrtimer
> machinery for usleep_range() to work properly on parisc, so I don't
> think the replacement works.
>
> James
>
Hello, James.
Thanks for your reply :)
I agree that usleep_range() here will not much affect the real execution
of this driver.
But I think usleep_range() can more opportunity for other threads to use
the CPU core to schedule during waiting.
That is why I detect mdelay() that can be replaced with msleep() or
usleep_range().
Best wishes,
Jia-Ju Bai
^ permalink raw reply
* [PATCH v2] net: dsa: b53: Using sleep-able operations in b53_switch_reset_gpio
From: Jia-Ju Bai @ 2018-04-12 1:48 UTC (permalink / raw)
To: f.fainelli, andrew, vivien.didelot, preid
Cc: netdev, linux-kernel, Jia-Ju Bai
b53_switch_reset_gpio() is never called in atomic context.
The call chain ending up at b53_switch_reset_gpio() is:
[1] b53_switch_reset_gpio() <- b53_switch_reset() <-
b53_reset_switch() <- b53_setup()
b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
This function is not called in atomic context.
Despite never getting called from atomic context, b53_switch_reset_gpio()
calls non-sleep operations mdelay() and gpio_set_value().
They are not necessary and can be replaced with msleep()
and gpio_set_value_cansleep().
This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
v2:
* Use gpio_set_value_cansleep() to replace gpio_set_value() additionally.
Thanks for Florian and Phil for good advice.
---
drivers/net/dsa/b53/b53_common.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 274f367..36cc60d 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -596,11 +596,11 @@ static void b53_switch_reset_gpio(struct b53_device *dev)
/* Reset sequence: RESET low(50ms)->high(20ms)
*/
- gpio_set_value(gpio, 0);
- mdelay(50);
+ gpio_set_value_cansleep(gpio, 0);
+ msleep(50);
- gpio_set_value(gpio, 1);
- mdelay(20);
+ gpio_set_value_cansleep(gpio, 1);
+ msleep(20);
dev->current_page = 0xff;
}
--
1.9.1
^ permalink raw reply related
* Re: [PATCH] net: dsa: b53: Replace mdelay with msleep in b53_switch_reset_gpio
From: Jia-Ju Bai @ 2018-04-12 1:49 UTC (permalink / raw)
To: Florian Fainelli, Phil Reid, andrew, vivien.didelot; +Cc: netdev, linux-kernel
In-Reply-To: <7d376874-8078-0252-ed7e-29392a519fc8@gmail.com>
On 2018/4/12 0:19, Florian Fainelli wrote:
> On 04/11/2018 12:14 AM, Jia-Ju Bai wrote:
>>
>> On 2018/4/11 13:30, Phil Reid wrote:
>>> On 11/04/2018 09:51, Jia-Ju Bai wrote:
>>>> b53_switch_reset_gpio() is never called in atomic context.
>>>>
>>>> The call chain ending up at b53_switch_reset_gpio() is:
>>>> [1] b53_switch_reset_gpio() <- b53_switch_reset() <-
>>>> b53_reset_switch() <- b53_setup()
>>>>
>>>> b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
>>>> This function is not called in atomic context.
>>>>
>>>> Despite never getting called from atomic context,
>>>> b53_switch_reset_gpio()
>>>> calls mdelay() to busily wait.
>>>> This is not necessary and can be replaced with msleep() to
>>>> avoid busy waiting.
>>>>
>>>> This is found by a static analysis tool named DCNS written by myself.
>>>> And I also manually check it.
>>>>
>>>> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
>>>> ---
>>>> drivers/net/dsa/b53/b53_common.c | 4 ++--
>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/dsa/b53/b53_common.c
>>>> b/drivers/net/dsa/b53/b53_common.c
>>>> index 274f367..e070ff6 100644
>>>> --- a/drivers/net/dsa/b53/b53_common.c
>>>> +++ b/drivers/net/dsa/b53/b53_common.c
>>>> @@ -597,10 +597,10 @@ static void b53_switch_reset_gpio(struct
>>>> b53_device *dev)
>>>> /* Reset sequence: RESET low(50ms)->high(20ms)
>>>> */
>>>> gpio_set_value(gpio, 0);
>>>> - mdelay(50);
>>>> + msleep(50);
>>>> gpio_set_value(gpio, 1);
>>>> - mdelay(20);
>>>> + msleep(20);
>>>> dev->current_page = 0xff;
>>>> }
>>>>
>>> Would that also imply gpio_set_value could be gpio_set_value_cansleep?
>>>
>> Yes, I think gpio_set_value_cansleep() is okay here?
>> Do I need to send a V2 patch to replace gpio_set_value()?
> Yes, I would lump these two changes in the same patch since this is
> effectively about solving sleeping vs. non sleeping operations.
Okay, I have sent a V2 patch, and you can have a look :)
Best wishes,
Jia-Ju Bai
^ permalink raw reply
* iproute2-4.16.0 no longer accepts routes via fe80::1
From: Thomas Deutschmann @ 2018-04-12 1:50 UTC (permalink / raw)
To: netdev
Hi,
I upgraded to iproute2-4.16.0 and rebooted my system.
On start I noticed that my IPv6 default route via "fe80::1" was rejected:
> # ip route add default via ff80::1
> Error: inet address is expected rather than "ff80::1".
This works with <iproute2-4.16.0.
--
Regards,
Thomas Deutschmann / Gentoo Linux Developer
C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5
^ permalink raw reply
* Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
From: Michael S. Tsirkin @ 2018-04-12 2:15 UTC (permalink / raw)
To: Björn Töpel
Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
netdev, michael.lundkvist, jesse.brandeburg, anjali.singhai,
qi.z.zhang, ravineet.singh
In-Reply-To: <20180327165919.17933-4-bjorn.topel@gmail.com>
On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote:
> @@ -30,4 +31,18 @@ struct xdp_umem_reg {
> __u32 frame_headroom; /* Frame head room */
> };
>
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_QUEUE 0x100000000
> +
> +struct xdp_queue {
> + __u32 head_idx __attribute__((aligned(64)));
> + __u32 tail_idx __attribute__((aligned(64)));
> +};
> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_queue {
> + struct xdp_queue ptrs;
> + __u32 desc[0] __attribute__((aligned(64)));
> +};
> +
> #endif /* _LINUX_IF_XDP_H */
So IIUC it's a head/tail ring of 32 bit descriptors.
In my experience (from implementing ptr_ring) this
implies that head/tail cache lines bounce a lot between
CPUs. Caching will help some. You are also forced to
use barriers to check validity which is slow on
some architectures.
If instead you can use a special descriptor value (e.g. 0) as
a valid signal, things work much better:
- you read descriptor atomically, if it's not 0 it's fine
- same with write - write 0 to pass it to the other side
- there is a data dependency so no need for barriers (except on dec alpha)
- no need for power of 2 limitations, you can make it any size you like
- easy to resize too
architecture (if not implementation) would be shared with ptr_ring
so some of the optimization ideas like batched updates could
be lifted from there.
When I was building ptr_ring, any head/tail design underperformed
storing valid flag with data itself. YMMV.
--
MST
^ permalink raw reply
* Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init
From: arvindY @ 2018-04-12 2:21 UTC (permalink / raw)
To: Jia-Ju Bai, James Bottomley, davem, stephen, johannes.berg,
dhowells
Cc: netdev, linux-parisc, linux-kernel
In-Reply-To: <8bac3385-97c6-fff1-17c6-11f5e98a039a@gmail.com>
On Thursday 12 April 2018 07:00 AM, Jia-Ju Bai wrote:
>
>
> On 2018/4/12 0:16, James Bottomley wrote:
>> On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:
>>> de4x5_hw_init() is never called in atomic context.
>>>
>>> de4x5_hw_init() is only called by de4x5_pci_probe(), which is only
>>> set as ".probe" in struct pci_driver.
>>>
>>> Despite never getting called from atomic context, de4x5_hw_init()
>>> calls mdelay() to busily wait. This is not necessary and can be
>>> replaced with usleep_range() to avoid busy waiting.
>>>
>>> This is found by a static analysis tool named DCNS written by myself.
>>> And I also manually check it.
>> Did you actually test this? The usual reason for wanting m/udelay is
>> that the timing must be exact. The driver is filled with mdelay()s for
>> this reason. The one you've picked on is in the init path so it won't
>> affect the runtime in any way. I also don't think we have the hrtimer
>> machinery for usleep_range() to work properly on parisc, so I don't
>> think the replacement works.
>>
>> James
>>
>
> Hello, James.
> Thanks for your reply :)
>
> I agree that usleep_range() here will not much affect the real
> execution of this driver.
>
> But I think usleep_range() can more opportunity for other threads to
> use the CPU core to schedule during waiting.
> That is why I detect mdelay() that can be replaced with msleep() or
> usleep_range().
>
James is right, You have added all usleep_range() during system boot-up
time.
During boot-up system will run as single threaded. Where this change will
not make much sense. System first priority is match the exact timing on
each and every boot-up.
~arvind
>
> Best wishes,
> Jia-Ju Bai
^ permalink raw reply
* iproute2-4.16.0 no longer accepts routes via fe80::1
From: Thomas Deutschmann @ 2018-04-12 2:23 UTC (permalink / raw)
To: netdev
Hi,
I upgraded to iproute2-4.16.0 and rebooted my system.
On start I noticed that my IPv6 default route via "fe80::1" was rejected:
> # ip route add default via ff80::1
> Error: inet address is expected rather than "ff80::1".
This works fine with <iproute2-4.16.0.
--
Regards,
Thomas Deutschmann / Gentoo Linux Developer
C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5
^ permalink raw reply
* Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init
From: Jia-Ju Bai @ 2018-04-12 2:26 UTC (permalink / raw)
To: arvindY, James Bottomley, davem, stephen, johannes.berg, dhowells
Cc: netdev, linux-parisc, linux-kernel
In-Reply-To: <5ACEC2B6.9080904@gmail.com>
On 2018/4/12 10:21, arvindY wrote:
>
>
> On Thursday 12 April 2018 07:00 AM, Jia-Ju Bai wrote:
>>
>>
>> On 2018/4/12 0:16, James Bottomley wrote:
>>> On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:
>>>> de4x5_hw_init() is never called in atomic context.
>>>>
>>>> de4x5_hw_init() is only called by de4x5_pci_probe(), which is only
>>>> set as ".probe" in struct pci_driver.
>>>>
>>>> Despite never getting called from atomic context, de4x5_hw_init()
>>>> calls mdelay() to busily wait. This is not necessary and can be
>>>> replaced with usleep_range() to avoid busy waiting.
>>>>
>>>> This is found by a static analysis tool named DCNS written by myself.
>>>> And I also manually check it.
>>> Did you actually test this? The usual reason for wanting m/udelay is
>>> that the timing must be exact. The driver is filled with mdelay()s for
>>> this reason. The one you've picked on is in the init path so it won't
>>> affect the runtime in any way. I also don't think we have the hrtimer
>>> machinery for usleep_range() to work properly on parisc, so I don't
>>> think the replacement works.
>>>
>>> James
>>>
>>
>> Hello, James.
>> Thanks for your reply :)
>>
>> I agree that usleep_range() here will not much affect the real
>> execution of this driver.
>>
>> But I think usleep_range() can more opportunity for other threads to
>> use the CPU core to schedule during waiting.
>> That is why I detect mdelay() that can be replaced with msleep() or
>> usleep_range().
>>
>
> James is right, You have added all usleep_range() during system
> boot-up time.
> During boot-up system will run as single threaded. Where this change will
> not make much sense. System first priority is match the exact timing on
> each and every boot-up.
>
Hello, Arvind.
Thanks for your reply :)
I admit I am not familiar with this driver.
I did not know this driver is only loaded during system boot-up time,
I thought this driver can be loaded as a kernel module (like many
drivers) after system booting.
After knowing this, I admit my patch is not proper, sorry...
Best wishes,
Jia-Ju Bai
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox