Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 08/11] net: hns3: add interrupt affinity support for misc interrupt
From: Saeed Mahameed @ 2019-07-26 23:18 UTC (permalink / raw)
  To: linyunsheng@huawei.com, tanhuazhong@huawei.com,
	davem@davemloft.net
  Cc: lipeng321@huawei.com, yisen.zhuang@huawei.com,
	salil.mehta@huawei.com, linuxarm@huawei.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <db2d081f-b892-b141-7fa5-44e66dd97eb9@huawei.com>

On Thu, 2019-07-25 at 10:05 +0800, Yunsheng Lin wrote:
> On 2019/7/25 3:24, Saeed Mahameed wrote:
> > On Wed, 2019-07-24 at 11:18 +0800, Huazhong Tan wrote:
> > > From: Yunsheng Lin <linyunsheng@huawei.com>
> > > 

[...]
> > > 
> > > +static void hclge_irq_affinity_notify(struct irq_affinity_notify
> > > *notify,
> > > +				      const cpumask_t *mask)
> > > +{
> > > +	struct hclge_dev *hdev = container_of(notify, struct hclge_dev,
> > > +					      affinity_notify);
> > > +
> > > +	cpumask_copy(&hdev->affinity_mask, mask);
> > > +	del_timer_sync(&hdev->service_timer);
> > > +	hdev->service_timer.expires = jiffies + HZ;
> > > +	add_timer_on(&hdev->service_timer, cpumask_first(&hdev-
> > > > affinity_mask));
> > > +}
> > 
> > I don't see any relation between your misc irq vector and &hdev-
> > > service_timer, to me this looks like an abuse of the irq affinity
> > > API
> > to allow the user to move the service timer affinity.
> 
> Hi, thanks for reviewing.
> 
> hdev->service_timer is used to schedule the periodic work
> queue hdev->service_task， we want all the management work
> queue including hdev->service_task to bind to the same cpu
> to improve cache and power efficiency, it is better to move
> service timer affinity according to that.
> 
> The hdev->service_task is changed to delay work queue in
> next patch " net: hns3: make hclge_service use delayed workqueue",
> So the affinity in the timer of the delay work queue is automatically
> set to the affinity of the delay work queue, we will move the
> "make hclge_service use delayed workqueue" patch before the
> "add interrupt affinity support for misc interrupt" patch, so
> we do not have to set service timer affinity explicitly.
> 
> Also, There is there work queues(mbx_service_task, service_task,
> rst_service_task) in the hns3 driver, we plan to combine them
> to one or two workqueue to improve efficiency and readability.
> 

So just to make it clear, you have 3 deferred works, 2 are triggered by
the misc irq and 1 is periodic, you want them all on the same core and
you want to control their affinity via the irq affinity ? for works #1
and  #2 you get that for free since the irq is triggering them, but for
work #3 (the periodic one) you need to manually move it when the irq
affinity changes.

I guess i am ok with this, since moving the irq affinity isn't only
required by the periodic work but also for the other works that the irq
is actually triggering (so it is not an actual abuse, sort of .. )




^ permalink raw reply

* Re: memory leak in kobject_set_name_vargs (2)
From: syzbot @ 2019-07-26 23:26 UTC (permalink / raw)
  To: catalin.marinas, davem, dvyukov, herbert, kuznet, kvalo,
	linux-kernel, linux-mm, luciano.coelho, netdev, steffen.klassert,
	syzkaller-bugs, torvalds, yoshfuji
In-Reply-To: <000000000000edcb3c058e6143d5@google.com>

syzbot has bisected this bug to:

commit 0e034f5c4bc408c943f9c4a06244415d75d7108c
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed May 18 18:51:25 2016 +0000

     iwlwifi: fix mis-merge that breaks the driver

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10f955f0600000
start commit:   3bfe1fc4 Merge tag 'for-5.3/dm-changes-2' of git://git.ker..
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=12f955f0600000
console output: https://syzkaller.appspot.com/x/log.txt?x=14f955f0600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=dcfc65ee492509c6
dashboard link: https://syzkaller.appspot.com/bug?extid=ad8ca40ecd77896d51e2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=135cbed0600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14dd4e34600000

Reported-by: syzbot+ad8ca40ecd77896d51e2@syzkaller.appspotmail.com
Fixes: 0e034f5c4bc4 ("iwlwifi: fix mis-merge that breaks the driver")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [PATCH bpf-next 06/10] selftests/bpf: add CO-RE relocs array tests
From: Andrii Nakryiko @ 2019-07-26 23:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Kernel Team
In-Reply-To: <20190725232633.zt6fxixq72xqwwmz@ast-mbp>

On Thu, Jul 25, 2019 at 4:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 12:27:38PM -0700, Andrii Nakryiko wrote:
> > Add tests for various array handling/relocation scenarios.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ...
> > +
> > +#define CORE_READ(dst, src) \
> > +     bpf_probe_read(dst, sizeof(*src), __builtin_preserve_access_index(src))
>
> This is the key accessor that all progs will use.
> Please split just this single macro into individual commit and add
> detailed comment about its purpose and
> what __builtin_preserve_access_index() does underneath.

I'm planning to add more powerful and flexible set of macros to
support BCC style a->b->c->d accesses using single macro. Something
like BPF_CORE_READ(&dst, sizeof(dst), a, b, c, d). I want to move
bpf_helpers.h into libbpf itself first, after some clean up. How about
I write all that at that time and for now just add this simpler
CORE_READ into bpf_helpers.h?

Relocations recorded by __builtin_preserve_access_index() are
described pretty well in patch #1, which adds bpf_offset_reloc. I'll
double check if I mention this built-in there, and if not - will add
that.

>
> > +SEC("raw_tracepoint/sys_enter")
> > +int test_core_nesting(void *ctx)
> > +{
> > +     struct core_reloc_arrays *in = (void *)&data.in;
> > +     struct core_reloc_arrays_output *out = (void *)&data.out;
> > +
> > +     /* in->a[2] */
> > +     if (CORE_READ(&out->a2, &in->a[2]))
> > +             return 1;
> > +     /* in->b[1][2][3] */
> > +     if (CORE_READ(&out->b123, &in->b[1][2][3]))
> > +             return 1;

^ permalink raw reply

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Brian Vazquez @ 2019-07-26 23:36 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Song Liu, Brian Vazquez, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, Networking, bpf
In-Reply-To: <beb513cb-2d76-30d4-6500-2892c6566a7e@fb.com>

On Thu, Jul 25, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 7/25/19 6:47 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 25, 2019 at 6:24 PM Brian Vazquez <brianvv.kernel@gmail.com> wrote:
> >>
> >> On Thu, Jul 25, 2019 at 4:54 PM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Thu, Jul 25, 2019 at 04:25:53PM -0700, Brian Vazquez wrote:
> >>>>>>> If prev_key is deleted before map_get_next_key(), we get the first key
> >>>>>>> again. This is pretty weird.
> >>>>>>
> >>>>>> Yes, I know. But note that the current scenario happens even for the
> >>>>>> old interface (imagine you are walking a map from userspace and you
> >>>>>> tried get_next_key the prev_key was removed, you will start again from
> >>>>>> the beginning without noticing it).
> >>>>>> I tried to sent a patch in the past but I was missing some context:
> >>>>>> before NULL was used to get the very first_key the interface relied in
> >>>>>> a random (non existent) key to retrieve the first_key in the map, and
> >>>>>> I was told what we still have to support that scenario.
> >>>>>
> >>>>> BPF_MAP_DUMP is slightly different, as you may return the first key
> >>>>> multiple times in the same call. Also, BPF_MAP_DUMP is new, so we
> >>>>> don't have to support legacy scenarios.
> >>>>>
> >>>>> Since BPF_MAP_DUMP keeps a list of elements. It is possible to try
> >>>>> to look up previous keys. Would something down this direction work?
> >>>>
> >>>> I've been thinking about it and I think first we need a way to detect
> >>>> that since key was not present we got the first_key instead:
> >>>>
> >>>> - One solution I had in mind was to explicitly asked for the first key
> >>>> with map_get_next_key(map, NULL, first_key) and while walking the map
> >>>> check that map_get_next_key(map, prev_key, key) doesn't return the
> >>>> same key. This could be done using memcmp.
> >>>> - Discussing with Stan, he mentioned that another option is to support
> >>>> a flag in map_get_next_key to let it know that we want an error
> >>>> instead of the first_key.
> >>>>
> >>>> After detecting the problem we also need to define what we want to do,
> >>>> here some options:
> >>>>
> >>>> a) Return the error to the caller
> >>>> b) Try with previous keys if any (which be limited to the keys that we
> >>>> have traversed so far in this dump call)
> >>>> c) continue with next entries in the map. array is easy just get the
> >>>> next valid key (starting on i+1), but hmap might be difficult since
> >>>> starting on the next bucket could potentially skip some keys that were
> >>>> concurrently added to the same bucket where key used to be, and
> >>>> starting on the same bucket could lead us to return repeated elements.
> >>>>
> >>>> Or maybe we could support those 3 cases via flags and let the caller
> >>>> decide which one to use?
> >>>
> >>> this type of indecision is the reason why I wasn't excited about
> >>> batch dumping in the first place and gave 'soft yes' when Stan
> >>> mentioned it during lsf/mm/bpf uconf.
> >>> We probably shouldn't do it.
> >>> It feels this map_dump makes api more complex and doesn't really
> >>> give much benefit to the user other than large map dump becomes faster.
> >>> I think we gotta solve this problem differently.
> >>
> >> Some users are working around the dumping problems with the existing
> >> api by creating a bpf_map_get_next_key_and_delete userspace function
> >> (see https://urldefense.proofpoint.com/v2/url?u=https-3A__www.bouncybouncy.net_blog_bpf-5Fmap-5Fget-5Fnext-5Fkey-2Dpitfalls_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=DA8e1B5r073vIqRrFz7MRA&m=XvNxqsDhRi62gzZ04HbLRTOFJX8X6mTuK7PZGn80akY&s=7q7beZxOJJ3Q0el8L0r-xDctedSpnEejJ6PVX1XYotQ&e= )
> >> which in my opinion is actually a good idea. The only problem with
> >> that is that calling bpf_map_get_next_key(fd, key, next_key) and then
> >> bpf_map_delete_elem(fd, key) from userspace is racing with kernel code
> >> and it might lose some information when deleting.
> >> We could then do map_dump_and_delete using that idea but in the kernel
> >> where we could better handle the racing condition. In that scenario
> >> even if we retrieve the same key it will contain different info ( the
> >> delta between old and new value). Would that work?
> >
> > you mean get_next+lookup+delete at once?
> > Sounds useful.
> > Yonghong has been thinking about batching api as well.
>
> In bcc, we have many instances like this:
>     getting all (key value) pairs, do some analysis and output,
>     delete all keys
>
> The implementation typically like
>     /* to get all (key, value) pairs */
>     while(bpf_get_next_key() == 0)
>       bpf_map_lookup()
>     /* do analysis and output */
>     for (all keys)
>       bpf_map_delete()

If you do that in a map that is being modified while you are doing the
analysis and output, you will lose some new data by deleting the keys,
right?

> get_next+lookup+delete will be definitely useful.
> batching will be even better to save the number of syscalls.
>
> An alternative is to do batch get_next+lookup and batch delete
> to achieve similar goal as the above code.

What I mentioned above is what it makes me think that with the
deletion it'd be better if we perform these 3 operations at once:
get_next+lookup+delete in a jumbo/atomic command and batch them later?

>
> There is a minor difference between this approach
> and the above get_next+lookup+delete.
> During scanning the hash map, get_next+lookup may get less number
> of elements compared to get_next+lookup+delete as the latter
> may have more later-inserted hash elements after the operation
> start. But both are inaccurate, so probably the difference
> is minor.
>
> >
> > I think if we cannot figure out how to make a batch of two commands
> > get_next + lookup to work correctly then we need to identify/invent one
> > command and make batching more generic.
>
> not 100% sure. It will be hard to define what is "correctly".

I agree, it'll be hard to define what is the right behavior.

> For not changing map, looping of (get_next, lookup) and batch
> get_next+lookup should have the same results.

This is true for the api I'm presenting the only think that I was
missing was what to do for changing maps to avoid the weird scenario
(getting the first key due a concurrent deletion). And, in my opinion
the way to go should be what also Willem supported: return the err to
the caller and restart the dumping. I could do this with existing code
just by detecting that we do provide a prev_key and got the first_key
instead of the next_key or even implement a new function if you want
to.

> For constant changing loops, not sure how to define which one
> is correct. If users have concerns, they may need to just pick one
> which gives them more comfort.
>
> > Like make one jumbo/compound/atomic command to be get_next+lookup+delete.
> > Define the semantics of this single compound command.
> > And then let batching to be a multiplier of such command.
> > In a sense that multiplier 1 or N should be have the same way.
> > No extra flags to alter the batching.
> > The high level description of the batch would be:
> > pls execute get_next,lookup,delete and repeat it N times.
> > or
> > pls execute get_next,lookup and repeat N times.

But any attempt to do get_next+lookup will have same problem with
deletions right?

I don't see how we could do it more consistent than what I'm
proposing. Let's just support one case: report an error if the
prev_key was not found instead of retrieving the first_key. Would that
work?

> > where each command action is defined to be composable.
> >
> > Just a rough idea.
> >

^ permalink raw reply

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Jakub Kicinski @ 2019-07-27  0:02 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Yonghong Song, Alexei Starovoitov, Song Liu, Brian Vazquez,
	Daniel Borkmann, David S . Miller, Stanislav Fomichev,
	Willem de Bruijn, Petar Penkov, Networking, bpf
In-Reply-To: <CABCgpaVB+iDGO132d9CTtC_GYiKJuuL6pe5_Krm3-THgvfMO=A@mail.gmail.com>

On Fri, 26 Jul 2019 16:36:19 -0700, Brian Vazquez wrote:
> > In bcc, we have many instances like this:
> >     getting all (key value) pairs, do some analysis and output,
> >     delete all keys
> >
> > The implementation typically like
> >     /* to get all (key, value) pairs */
> >     while(bpf_get_next_key() == 0)
> >       bpf_map_lookup()
> >     /* do analysis and output */
> >     for (all keys)
> >       bpf_map_delete()  
> 
> If you do that in a map that is being modified while you are doing the
> analysis and output, you will lose some new data by deleting the keys,
> right?
> 
> > get_next+lookup+delete will be definitely useful.
> > batching will be even better to save the number of syscalls.
> >
> > An alternative is to do batch get_next+lookup and batch delete
> > to achieve similar goal as the above code.  
> 
> What I mentioned above is what it makes me think that with the
> deletion it'd be better if we perform these 3 operations at once:
> get_next+lookup+delete in a jumbo/atomic command and batch them later?

Hm. The lookup+delete are only "atomic" if we insert an RCU sync in
between right? The elements' lifetime is protected by RCU.

	CPU 1				CPU 2
					val = lookup(map, key)				
	val = lookup(map, key)
	delete(map, key)
	dump(val)
					val->counter++

So we'd need to walk the hash table, disconnect all lists from the
buckets and splice them onto one combined list, then synchronize_rcu()
and only after that we can dump the elements and free them.

> > There is a minor difference between this approach
> > and the above get_next+lookup+delete.
> > During scanning the hash map, get_next+lookup may get less number
> > of elements compared to get_next+lookup+delete as the latter
> > may have more later-inserted hash elements after the operation
> > start. But both are inaccurate, so probably the difference
> > is minor.

> > For not changing map, looping of (get_next, lookup) and batch
> > get_next+lookup should have the same results.  
> 
> This is true for the api I'm presenting the only think that I was
> missing was what to do for changing maps to avoid the weird scenario
> (getting the first key due a concurrent deletion). And, in my opinion
> the way to go should be what also Willem supported: return the err to
> the caller and restart the dumping. I could do this with existing code
> just by detecting that we do provide a prev_key and got the first_key
> instead of the next_key or even implement a new function if you want
> to.

My knee jerk reaction to Willem's proposal was to nit pick that sizing
the dump space to the map's max entries doesn't guarantee all entries
will fit if an entry can be removed and re-added in a different place
while dumper proceeds.

If we keep elements disconnected from the hash array _without_
decrementing element count until synchronize_rcu() completes we have
that guarantee, but OTOH it may not be possible to add any new entries
from the datapath, so that doesn't sound great either :/

^ permalink raw reply

* Re: [PATCH v2 bpf] libbpf: fix missing __WORDSIZE definition
From: Arnaldo Carvalho de Melo @ 2019-07-27  0:03 UTC (permalink / raw)
  To: Andrii Nakryiko, Arnaldo Carvalho de Melo
  Cc: Daniel Borkmann, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Arnaldo Carvalho de Melo, Kernel Team
In-Reply-To: <CAEf4BzYZT3fmQUuGp45+Mn6hLLyWnT2NE3PxfpD88sThX8JS_w@mail.gmail.com>

On July 26, 2019 8:01:10 PM GMT-03:00, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>On Fri, Jul 26, 2019 at 1:49 PM Arnaldo Carvalho de Melo
><arnaldo.melo@gmail.com> wrote:
>>
>> Em Thu, Jul 18, 2019 at 10:30:21AM -0700, Andrii Nakryiko escreveu:
>> > hashmap.h depends on __WORDSIZE being defined. It is defined by
>> > glibc/musl in different headers. It's an explicit goal for musl to
>be
>> > "non-detectable" at compilation time, so instead include glibc
>header if
>> > glibc is explicitly detected and fall back to musl header
>otherwise.
>> >
>> > Fixes: e3b924224028 ("libbpf: add resizable non-thread safe
>internal hashmap")
>> > Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
>> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>
>> Couldn't find ths in the bpf tree, please consider applying it:
>>
>> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
>
>Arnaldo, I somehow got impression that you were going to pull this
>into your perf tree. Can you please confirm that it wasn't pulled into
>your tree, so that Alexei can apply it to bpf tree? Thanks!
>

I can process it, just was unsure about where it should go by,

I'll have it in my next perf/urgent pull req,

Thanks,

- Arnaldo
>
>>
>>
>> - Arnaldo
>>
>> > ---
>> >  tools/lib/bpf/hashmap.h | 5 +++++
>> >  1 file changed, 5 insertions(+)
>> >
>> > diff --git a/tools/lib/bpf/hashmap.h b/tools/lib/bpf/hashmap.h
>> > index 03748a742146..bae8879cdf58 100644
>> > --- a/tools/lib/bpf/hashmap.h
>> > +++ b/tools/lib/bpf/hashmap.h
>> > @@ -10,6 +10,11 @@
>> >
>> >  #include <stdbool.h>
>> >  #include <stddef.h>
>> > +#ifdef __GLIBC__
>> > +#include <bits/wordsize.h>
>> > +#else
>> > +#include <bits/reg.h>
>> > +#endif
>> >  #include "libbpf_internal.h"
>> >
>> >  static inline size_t hash_bits(size_t h, int bits)
>> > --
>> > 2.17.1
>>
>> --
>>
>> - Arnaldo


^ permalink raw reply

* Re: [PATCH bpf] libbpf: fix erroneous multi-closing of BTF FD
From: Alexei Starovoitov @ 2019-07-27  0:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann,
	Andrey Ignatov, Yonghong Song, Andrii Nakryiko, Kernel Team
In-Reply-To: <20190726212438.1269599-1-andriin@fb.com>

On Fri, Jul 26, 2019 at 2:25 PM Andrii Nakryiko <andriin@fb.com> wrote:
>
> Libbpf stores associated BTF FD per each instance of bpf_program. When
> program is unloaded, that FD is closed. This is wrong, because leads to
> a race and possibly closing of unrelated files, if application
> simultaneously opens new files while bpf_programs are unloaded.
>
> It's also unnecessary, because struct btf "owns" that FD, and
> btf__free(), called from bpf_object__close() will close it. Thus the fix
> is to never have per-program BTF FD and fetch it from obj->btf, when
> necessary.
>
> Fixes: 2993e0515bb4 ("tools/bpf: add support to read .BTF.ext sections")
> Reported-by: Andrey Ignatov <rdna@fb.com>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Applied. Thanks

^ permalink raw reply

* Re: [PATCH bpf-next 4/9] libbpf: add libbpf_swap_print to get previous print func
From: Alexei Starovoitov @ 2019-07-27  0:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Stanislav Fomichev, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team
In-Reply-To: <CAEf4BzYoiL7XAXFdLaf5TDDas42u+jUTy2WydgmJT7WiniqOqQ@mail.gmail.com>

On Fri, Jul 26, 2019 at 02:47:28PM -0700, Andrii Nakryiko wrote:
> On Fri, Jul 26, 2019 at 2:28 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
> >
> > On 07/26, Andrii Nakryiko wrote:
> > > libbpf_swap_print allows to restore previously set print function.
> > > This is useful when running many independent test with one default print
> > > function, but overriding log verbosity for particular subset of tests.
> > Can we change the return type of libbpf_set_print instead and return
> > the old function from it? Will it break ABI?
> 
> Yeah, thought about that, but I wasn't sure about ABI breakage. It
> seems like it shouldn't, so I'll just change libbpf_set_print
> signature instead.

I think it's ok to change return value of libbpf_set_print() from void
to useful pointer.
This function is not marked as __attribute__((__warn_unused_result__)),
so there should be no abi issues.

Please double check by compiler perf with different gcc-s as Arnaldo's setup does.


^ permalink raw reply

* Re: [PATCH bpf-next 6/9] selftests/bpf: abstract away test log output
From: Alexei Starovoitov @ 2019-07-27  0:34 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Andrii Nakryiko, Andrii Nakryiko, bpf, Networking,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team
In-Reply-To: <20190726222652.GG24397@mini-arch>

On Fri, Jul 26, 2019 at 03:26:52PM -0700, Stanislav Fomichev wrote:
> On 07/26, Andrii Nakryiko wrote:
> > On Fri, Jul 26, 2019 at 2:31 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
> > >
> > > On 07/26, Andrii Nakryiko wrote:
> > > > This patch changes how test output is printed out. By default, if test
> > > > had no errors, the only output will be a single line with test number,
> > > > name, and verdict at the end, e.g.:
> > > >
> > > >   #31 xdp:OK
> > > >
> > > > If test had any errors, all log output captured during test execution
> > > > will be output after test completes.
> > > >
> > > > It's possible to force output of log with `-v` (`--verbose`) option, in
> > > > which case output won't be buffered and will be output immediately.
> > > >
> > > > To support this, individual tests are required to use helper methods for
> > > > logging: `test__printf()` and `test__vprintf()`.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > > > ---
> > > >  .../selftests/bpf/prog_tests/bpf_obj_id.c     |   6 +-
> > > >  .../bpf/prog_tests/bpf_verif_scale.c          |  31 ++--
> > > >  .../bpf/prog_tests/get_stack_raw_tp.c         |   4 +-
> > > >  .../selftests/bpf/prog_tests/l4lb_all.c       |   2 +-
> > > >  .../selftests/bpf/prog_tests/map_lock.c       |  10 +-
> > > >  .../selftests/bpf/prog_tests/send_signal.c    |   8 +-
> > > >  .../selftests/bpf/prog_tests/spinlock.c       |   2 +-
> > > >  .../bpf/prog_tests/stacktrace_build_id.c      |   4 +-
> > > >  .../bpf/prog_tests/stacktrace_build_id_nmi.c  |   4 +-
> > > >  .../selftests/bpf/prog_tests/xdp_noinline.c   |   3 +-
> > > >  tools/testing/selftests/bpf/test_progs.c      | 135 +++++++++++++-----
> > > >  tools/testing/selftests/bpf/test_progs.h      |  37 ++++-
> > > >  12 files changed, 173 insertions(+), 73 deletions(-)
> > > >
> > 
> > [...]
> > 
> > > >               error_cnt++;
> > > > -             printf("test_l4lb:FAIL:stats %lld %lld\n", bytes, pkts);
> > > > +             test__printf("test_l4lb:FAIL:stats %lld %lld\n", bytes, pkts);
> > > #define printf(...) test__printf(...) in tests.h?
> > >
> > > A bit ugly, but no need to retrain everyone to use new printf wrappers.
> > 
> > I try to reduce amount of magic and surprising things, not add new
> > ones :) I also led by example and converted all current instances of
> > printf usage to test__printf, so anyone new will just copy/paste good
> > example, hopefully. Even if not, this non-buffered output will be
> > immediately obvious to anyone who just runs `sudo ./test_progs`.
> 
> [..]
> > And
> > author of new test with this problem should hopefully be the first and
> > the only one to catch and fix this.
> Yeah, that is my only concern, that regular printfs will eventually
> creep in. It's already confusing to go to/from printf/printk.
> 
> 2c:
> 
> I'm coming from a perspective of tools/testing/selftests/kselftest.h
> which is supposed to be a generic framework with custom
> printf variants (ksft_print_msg), but I still see a bunch of tests
> calling printf :-/
> 
> 	grep -ril ksft_exit_fail_msg selftests/ | xargs -n1 grep -w printf
> 
> Since we don't expect regular buffered io from the tests anyway
> it might be easier just to add a bit of magic and call it a day.

I think #define printf()
is not a good style in general.
glibc functions should never be #define-d.


^ permalink raw reply

* Re: [PATCH] iplink: document 'change' option to ip link
From: Matteo Croce @ 2019-07-27  0:45 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20190726215959.6312-1-stephen@networkplumber.org>

On Sat, Jul 27, 2019 at 12:00 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> Add the command alias "change" to man page.
> Don't show it on usage, since it is not commonly used.
>
> Reported-off-by: Matteo Croce <mcroce@redhat.com>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  man/man8/ip-link.8.in | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
> index 883d88077d66..a8ae72d2b097 100644
> --- a/man/man8/ip-link.8.in
> +++ b/man/man8/ip-link.8.in
> @@ -1815,6 +1815,11 @@ can move the system to an unpredictable state. The solution
>  is to avoid changing several parameters with one
>  .B ip link set
>  call.
> +The modifier
> +.B change
> +is equivalent to
> +.BR "set" .
> +
>
>  .TP
>  .BI dev " DEVICE "
> --
> 2.20.1
>

Acked-by: Matteo Croce <mcroce@redhat.com>

-- 
Matteo Croce
per aspera ad upstream

^ permalink raw reply

* Re: [PATCH net-next v3 2/3] flow_offload: support get tcf block immediately
From: Jakub Kicinski @ 2019-07-27  0:52 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, netfilter-devel, netdev
In-Reply-To: <1564148047-6428-3-git-send-email-wenxu@ucloud.cn>

On Fri, 26 Jul 2019 21:34:06 +0800, wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
> 
> Because the new flow-indr-block can't get the tcf_block
> directly.
> It provide a callback to find the tcf block immediately
> when the device register and contain a ingress block.
> 
> Signed-off-by: wenxu <wenxu@ucloud.cn>

Please CC people who gave you feedback on your subsequent submissions.

> diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
> index 66f89bc..3b2e848 100644
> --- a/include/net/flow_offload.h
> +++ b/include/net/flow_offload.h
> @@ -391,6 +391,10 @@ struct flow_indr_block_dev {
>  	struct flow_block *flow_block;
>  };
>  
> +typedef void flow_indr_get_default_block_t(struct flow_indr_block_dev *indr_dev);
> +
> +void flow_indr_set_default_block_cb(flow_indr_get_default_block_t *cb);
> +
>  struct flow_indr_block_dev *flow_indr_block_dev_lookup(struct net_device *dev);
>  
>  int __flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
> diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
> index 9f1ae67..db8469d 100644
> --- a/net/core/flow_offload.c
> +++ b/net/core/flow_offload.c
> @@ -298,6 +298,14 @@ struct flow_indr_block_dev *
>  }
>  EXPORT_SYMBOL(flow_indr_block_dev_lookup);
>  
> +static flow_indr_get_default_block_t *flow_indr_get_default_block;

This static variable which can only be set to the TC's callback really
is not a great API design :/

> +void flow_indr_set_default_block_cb(flow_indr_get_default_block_t *cb)
> +{
> +	flow_indr_get_default_block = cb;
> +}
> +EXPORT_SYMBOL(flow_indr_set_default_block_cb);
> +
>  static struct flow_indr_block_dev *flow_indr_block_dev_get(struct net_device *dev)
>  {
>  	struct flow_indr_block_dev *indr_dev;
> @@ -312,6 +320,10 @@ static struct flow_indr_block_dev *flow_indr_block_dev_get(struct net_device *de
>  
>  	INIT_LIST_HEAD(&indr_dev->cb_list);
>  	indr_dev->dev = dev;
> +
> +	if (flow_indr_get_default_block)
> +		flow_indr_get_default_block(indr_dev);
> +
>  	if (rhashtable_insert_fast(&indr_setup_block_ht, &indr_dev->ht_node,
>  				   flow_indr_setup_block_ht_params)) {
>  		kfree(indr_dev);


^ permalink raw reply

* Re: [PATCH net-next v3 1/3] flow_offload: move tc indirect block to flow offload
From: Jakub Kicinski @ 2019-07-27  0:56 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, netfilter-devel, netdev
In-Reply-To: <1564148047-6428-2-git-send-email-wenxu@ucloud.cn>

On Fri, 26 Jul 2019 21:34:05 +0800, wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
> 
> move tc indirect block to flow_offload and rename
> it to flow indirect block.The nf_tables can use the
> indr block architecture.
> 
> Signed-off-by: wenxu <wenxu@ucloud.cn>

> diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
> index 00b9aab..66f89bc 100644
> --- a/include/net/flow_offload.h
> +++ b/include/net/flow_offload.h
> @@ -4,6 +4,7 @@
>  #include <linux/kernel.h>
>  #include <linux/list.h>
>  #include <net/flow_dissector.h>
> +#include <linux/rhashtable.h>
>  
>  struct flow_match {
>  	struct flow_dissector	*dissector;
> @@ -366,4 +367,42 @@ static inline void flow_block_init(struct flow_block *flow_block)
>  	INIT_LIST_HEAD(&flow_block->cb_list);
>  }
>  
> +typedef int flow_indr_block_bind_cb_t(struct net_device *dev, void *cb_priv,
> +				      enum tc_setup_type type, void *type_data);
> +
> +struct flow_indr_block_cb {
> +	struct list_head list;
> +	void *cb_priv;
> +	flow_indr_block_bind_cb_t *cb;
> +	void *cb_ident;
> +};
> +
> +typedef void flow_indr_block_ing_cmd_t(struct net_device *dev,
> +				       struct flow_block *flow_block,
> +				       struct flow_indr_block_cb *indr_block_cb,
> +				       enum flow_block_command command);
> +
> +struct flow_indr_block_dev {
> +	struct rhash_head ht_node;
> +	struct net_device *dev;
> +	unsigned int refcnt;
> +	struct list_head cb_list;
> +	flow_indr_block_ing_cmd_t *ing_cmd_cb;
> +	struct flow_block *flow_block;

TC can only have one block per device. Now with nftables offload we can
have multiple blocks. Could you elaborate how this is solved?

> +};

^ permalink raw reply

* Re: [PATCH v2] net: dsa: qca8k: enable port flow control
From: xiaofeis @ 2019-07-27  1:15 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: vkoul, netdev
In-Reply-To: <20190726132919.GB18223@lunn.ch>

On 2019-07-26 21:29, Andrew Lunn wrote:
>> I didn't compile it on this tree, same code is just compiled and 
>> tested on
>> kernel v4.14.
> 
> For kernel development work, v4.14 is dead. It died 12th November
> 2017. It gets backports of bug fixes, but kernel developers otherwise
> don't touch it.
> 
>> We are working on one google project, all the change is
>> required to upstream by Google.
>> But if I do the change based on the new type for kernel 5.3, then the 
>> commit
>> can't be used directly for Google's project.
> 
> So you will need to backport the change. In this case, you will have a
> very different patch in v4.14 than in mainline, due to changes like
> this. That is part of the pain in using such an old kernel.
> 
> You should use the function
> 
> void phy_support_asym_pause(struct phy_device *phydev);
> 
> to indicate the MAC supports asym pause.
> 
>    Andrew

Hi Andrew

Thanks a lot, you are correct. phy_support_asym_pause is the API to do 
this.

Very appreciate for all your patinet explaination and good suggestion.

Thanks
Xiaofeis

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Alexei Starovoitov @ 2019-07-27  1:40 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Alexander Viro, Alexei Starovoitov, Andrew Morton,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, John Johansen,
	Jonathan Corbet, Kees Cook, Michael Kerrisk,
	Mickaël Salaün, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Stephen Smalley, Tejun Heo,
	Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	kernel-hardening, linux-api, linux-fsdevel, linux-security-module,
	netdev
In-Reply-To: <20190721213116.23476-7-mic@digikod.net>

On Sun, Jul 21, 2019 at 11:31:12PM +0200, Mickaël Salaün wrote:
> FIXME: 64-bits in the doc
> 
> This new map store arbitrary values referenced by inode keys.  The map
> can be updated from user space with file descriptor pointing to inodes
> tied to a file system.  From an eBPF (Landlock) program point of view,
> such a map is read-only and can only be used to retrieved a value tied
> to a given inode.  This is useful to recognize an inode tagged by user
> space, without access right to this inode (i.e. no need to have a write
> access to this inode).
> 
> Add dedicated BPF functions to handle this type of map:
> * bpf_inode_htab_map_update_elem()
> * bpf_inode_htab_map_lookup_elem()
> * bpf_inode_htab_map_delete_elem()
> 
> This new map require a dedicated helper inode_map_lookup_elem() because
> of the key which is a pointer to an opaque data (only provided by the
> kernel).  This act like a (physical or cryptographic) key, which is why
> it is also not allowed to get the next key.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>

there are too many things to comment on.
Let's do this patch.

imo inode_map concept is interesting, but see below...

> +
> +	/*
> +	 * Limit number of entries in an inode map to the maximum number of
> +	 * open files for the current process. The maximum number of file
> +	 * references (including all inode maps) for a process is then
> +	 * (RLIMIT_NOFILE - 1) * RLIMIT_NOFILE. If the process' RLIMIT_NOFILE
> +	 * is 0, then any entry update is forbidden.
> +	 *
> +	 * An eBPF program can inherit all the inode map FD. The worse case is
> +	 * to fill a bunch of arraymaps, create an eBPF program, close the
> +	 * inode map FDs, and start again. The maximum number of inode map
> +	 * entries can then be close to RLIMIT_NOFILE^3.
> +	 */
> +	if (attr->max_entries > rlimit(RLIMIT_NOFILE))
> +		return -EMFILE;

rlimit is checked, but no fd are consumed.
Once created such inode map_fd can be passed to a different process.
map_fd can be pinned into bpffs.
etc.
what the value of the check?

> +
> +	/* decorelate UAPI from kernel API */
> +	attr->key_size = sizeof(struct inode *);
> +
> +	return htab_map_alloc_check(attr);
> +}
> +
> +static void inode_htab_put_key(void *key)
> +{
> +	struct inode **inode = key;
> +
> +	if ((*inode)->i_state & I_FREEING)
> +		return;

checking the state without take a lock? isn't it racy?

> +	iput(*inode);
> +}
> +
> +/* called from syscall or (never) from eBPF program */
> +static int map_get_next_no_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +	/* do not leak a file descriptor */

what this comment suppose to mean?

> +	return -ENOTSUPP;
> +}
> +
> +/* must call iput(inode) after this call */
> +static struct inode *inode_from_fd(int ufd, bool check_access)
> +{
> +	struct inode *ret;
> +	struct fd f;
> +	int deny;
> +
> +	f = fdget(ufd);
> +	if (unlikely(!f.file))
> +		return ERR_PTR(-EBADF);
> +	/* TODO?: add this check when called from an eBPF program too (already
> +	* checked by the LSM parent hooks anyway) */
> +	if (unlikely(IS_PRIVATE(file_inode(f.file)))) {
> +		ret = ERR_PTR(-EINVAL);
> +		goto put_fd;
> +	}
> +	/* check if the FD is tied to a mount point */
> +	/* TODO?: add this check when called from an eBPF program too */
> +	if (unlikely(f.file->f_path.mnt->mnt_flags & MNT_INTERNAL)) {
> +		ret = ERR_PTR(-EINVAL);
> +		goto put_fd;
> +	}

a bunch of TODOs do not inspire confidence.

> +	if (check_access) {
> +		/*
> +		* must be allowed to access attributes from this file to then
> +		* be able to compare an inode to its map entry
> +		*/
> +		deny = security_inode_getattr(&f.file->f_path);
> +		if (deny) {
> +			ret = ERR_PTR(deny);
> +			goto put_fd;
> +		}
> +	}
> +	ret = file_inode(f.file);
> +	ihold(ret);
> +
> +put_fd:
> +	fdput(f);
> +	return ret;
> +}
> +
> +/*
> + * The key is a FD when called from a syscall, but an inode address when called
> + * from an eBPF program.
> + */
> +
> +/* called from syscall */
> +int bpf_inode_fd_htab_map_lookup_elem(struct bpf_map *map, int *key, void *value)
> +{
> +	void *ptr;
> +	struct inode *inode;
> +	int ret;
> +
> +	/* check inode access */
> +	inode = inode_from_fd(*key, true);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +
> +	rcu_read_lock();
> +	ptr = htab_map_lookup_elem(map, &inode);
> +	iput(inode);
> +	if (IS_ERR(ptr)) {
> +		ret = PTR_ERR(ptr);
> +	} else if (!ptr) {
> +		ret = -ENOENT;
> +	} else {
> +		ret = 0;
> +		copy_map_value(map, value, ptr);
> +	}
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +/* called from kernel */

wrong comment?
kernel side cannot call it, right?

> +int bpf_inode_ptr_locked_htab_map_delete_elem(struct bpf_map *map,
> +		struct inode **key, bool remove_in_inode)
> +{
> +	if (remove_in_inode)
> +		landlock_inode_remove_map(*key, map);
> +	return htab_map_delete_elem(map, key);
> +}
> +
> +/* called from syscall */
> +int bpf_inode_fd_htab_map_delete_elem(struct bpf_map *map, int *key)
> +{
> +	struct inode *inode;
> +	int ret;
> +
> +	/* do not check inode access (similar to directory check) */
> +	inode = inode_from_fd(*key, false);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +	ret = bpf_inode_ptr_locked_htab_map_delete_elem(map, &inode, true);
> +	iput(inode);
> +	return ret;
> +}
> +
> +/* called from syscall */
> +int bpf_inode_fd_htab_map_update_elem(struct bpf_map *map, int *key, void *value,
> +		u64 map_flags)
> +{
> +	struct inode *inode;
> +	int ret;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +	/* check inode access */
> +	inode = inode_from_fd(*key, true);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +	ret = htab_map_update_elem(map, &inode, value, map_flags);
> +	if (!ret)
> +		ret = landlock_inode_add_map(inode, map);
> +	iput(inode);
> +	return ret;
> +}
> +
> +static void inode_htab_map_free(struct bpf_map *map)
> +{
> +	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> +	struct hlist_nulls_node *n;
> +	struct hlist_nulls_head *head;
> +	struct htab_elem *l;
> +	int i;
> +
> +	for (i = 0; i < htab->n_buckets; i++) {
> +		head = select_bucket(htab, i);
> +		hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> +			landlock_inode_remove_map(*((struct inode **)l->key), map);
> +		}
> +	}
> +	htab_map_free(map);
> +}

user space can delete the map.
that will trigger inode_htab_map_free() which will call
landlock_inode_remove_map().
which will simply itereate the list and delete from the list.

While in parallel inode can be destoyed and hook_inode_free_security()
will be called.
I think nothing that protects from this race.

> +
> +/*
> + * We need a dedicated helper to deal with inode maps because the key is a
> + * pointer to an opaque data, only provided by the kernel.  This really act
> + * like a (physical or cryptographic) key, which is why it is also not allowed
> + * to get the next key with map_get_next_key().

inode pointer is like cryptographic key? :)

> + */
> +BPF_CALL_2(bpf_inode_map_lookup_elem, struct bpf_map *, map, void *, key)
> +{
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +	return (unsigned long)htab_map_lookup_elem(map, &key);
> +}
> +
> +const struct bpf_func_proto bpf_inode_map_lookup_elem_proto = {
> +	.func		= bpf_inode_map_lookup_elem,
> +	.gpl_only	= false,
> +	.pkt_access	= true,

pkt_access ? :)

> +	.ret_type	= RET_PTR_TO_MAP_VALUE_OR_NULL,
> +	.arg1_type	= ARG_CONST_MAP_PTR,
> +	.arg2_type	= ARG_PTR_TO_INODE,
> +};
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index b2a8cb14f28e..e46441c42b68 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -801,6 +801,8 @@ static int map_lookup_elem(union bpf_attr *attr)
>  	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>  		   map->map_type == BPF_MAP_TYPE_STACK) {
>  		err = map->ops->map_peek_elem(map, value);
> +	} else if (map->map_type == BPF_MAP_TYPE_INODE) {
> +		err = bpf_inode_fd_htab_map_lookup_elem(map, key, value);
>  	} else {
>  		rcu_read_lock();
>  		if (map->ops->map_lookup_elem_sys_only)
> @@ -951,6 +953,10 @@ static int map_update_elem(union bpf_attr *attr)
>  	} else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>  		   map->map_type == BPF_MAP_TYPE_STACK) {
>  		err = map->ops->map_push_elem(map, value, attr->flags);
> +	} else if (map->map_type == BPF_MAP_TYPE_INODE) {
> +		rcu_read_lock();
> +		err = bpf_inode_fd_htab_map_update_elem(map, key, value, attr->flags);
> +		rcu_read_unlock();
>  	} else {
>  		rcu_read_lock();
>  		err = map->ops->map_update_elem(map, key, value, attr->flags);
> @@ -1006,7 +1012,10 @@ static int map_delete_elem(union bpf_attr *attr)
>  	preempt_disable();
>  	__this_cpu_inc(bpf_prog_active);
>  	rcu_read_lock();
> -	err = map->ops->map_delete_elem(map, key);
> +	if (map->map_type == BPF_MAP_TYPE_INODE)
> +		err = bpf_inode_fd_htab_map_delete_elem(map, key);
> +	else
> +		err = map->ops->map_delete_elem(map, key);
>  	rcu_read_unlock();
>  	__this_cpu_dec(bpf_prog_active);
>  	preempt_enable();
> @@ -1018,6 +1027,22 @@ static int map_delete_elem(union bpf_attr *attr)
>  	return err;
>  }
>  
> +int bpf_inode_ptr_unlocked_htab_map_delete_elem(struct bpf_map *map,
> +						struct inode **key, bool remove_in_inode)
> +{
> +	int err;
> +
> +	preempt_disable();
> +	__this_cpu_inc(bpf_prog_active);
> +	rcu_read_lock();
> +	err = bpf_inode_ptr_locked_htab_map_delete_elem(map, key, remove_in_inode);
> +	rcu_read_unlock();
> +	__this_cpu_dec(bpf_prog_active);
> +	preempt_enable();
> +	maybe_wait_bpf_programs(map);

if that function was actually doing synchronize_rcu() the consequences
would have been unpleasant. Fortunately it's a nop in this case.
Please read the code carefully before copy-paste.
Also what do you think the reason of bpf_prog_active above?
What is the reason of rcu_read_lock above?

I think the patch set needs to shrink at least in half to be reviewable.
The way you tie seccomp and lsm is probably the biggest obstacle
than any of the bugs above.
Can you drop seccomp ? and do it as normal lsm ?


^ permalink raw reply

* Re: [PATCH] b43legacy: Remove pointless cond_resched() wrapper
From: Larry Finger @ 2019-07-27  1:52 UTC (permalink / raw)
  To: Thomas Gleixner, netdev; +Cc: b43-dev, Kalle Valo
In-Reply-To: <alpine.DEB.2.21.1907262157500.1791@nanos.tec.linutronix.de>

On 7/26/19 3:00 PM, Thomas Gleixner wrote:
> cond_resched() can be used unconditionally. If CONFIG_PREEMPT is set, it
> becomes a NOP scheduler wise.
> 
> Also the B43_BUG_ON() in that wrapper is a homebrewn variant of
> __might_sleep() which is part of cond_resched() already.
> 
> Remove the wrapper and invoke cond_resched() directly.
> 
> Found while looking for CONFIG_PREEMPT dependent code treewide.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: netdev@vger.kernel.org
> Cc: b43-dev@lists.infradead.org
> Cc: Kalle Valo <kvalo@codeaurora.org>
> Cc: Larry Finger <Larry.Finger@lwfinger.net>

Reviewed- and Tested-by: Larry Finger <Larry.Finger@lwfinger.net>

Thanks.

Larry

^ permalink raw reply

* Re: [PATCH] isdn/gigaset: check endpoint null in gigaset_probe
From: Phong Tran @ 2019-07-27  1:56 UTC (permalink / raw)
  To: Paul Bolle, isdn, gregkh
  Cc: tranmanphong, gigaset307x-common, netdev, linux-kernel,
	linux-kernel-mentees, syzbot+35b1c403a14f5c89eba7
In-Reply-To: <1876196a0e7fc665f0f50d5e9c0e2641f713e089.camel@tiscali.nl>

On 7/26/19 9:22 PM, Paul Bolle wrote:
> Phong Tran schreef op vr 26-07-2019 om 20:35 [+0700]:
>> This fixed the potential reference NULL pointer while using variable
>> endpoint.
>>
>> Reported-by: syzbot+35b1c403a14f5c89eba7@syzkaller.appspotmail.com
>> Tested by syzbot:
>> https://groups.google.com/d/msg/syzkaller-bugs/wnHG8eRNWEA/Qn2HhjNdBgAJ
>>
>> Signed-off-by: Phong Tran <tranmanphong@gmail.com>
>> ---
>>   drivers/isdn/gigaset/usb-gigaset.c | 9 +++++++++
> 
> This is now drivers/staging/isdn/gigaset/usb-gigaset.c.

this patch was created base on branch 
kasan/usb-fuzzer-usb-testing-2019.07.11 [1]
I did not notice about the driver was moved to staging.

> 
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/isdn/gigaset/usb-gigaset.c b/drivers/isdn/gigaset/usb-gigaset.c
>> index 1b9b43659bdf..2e011f3db59e 100644
>> --- a/drivers/isdn/gigaset/usb-gigaset.c
>> +++ b/drivers/isdn/gigaset/usb-gigaset.c
>> @@ -703,6 +703,10 @@ static int gigaset_probe(struct usb_interface *interface,
>>   	usb_set_intfdata(interface, cs);
>>   
>>   	endpoint = &hostif->endpoint[0].desc;
>> +        if (!endpoint) {
>> +		dev_err(cs->dev, "Couldn't get control endpoint\n");
>> +		return -ENODEV;
>> +	}
> 
> When can this happen? Is this one of those bugs that one can only trigger with
> a specially crafted (evil) usb device?
> 

Yes, in my understanding, this only happens with random test of syzbot.

>>   	buffer_size = le16_to_cpu(endpoint->wMaxPacketSize);
>>   	ucs->bulk_out_size = buffer_size;
>> @@ -722,6 +726,11 @@ static int gigaset_probe(struct usb_interface *interface,
>>   	}
>>   
>>   	endpoint = &hostif->endpoint[1].desc;
>> +        if (!endpoint) {
>> +		dev_err(cs->dev, "Endpoint not available\n");
>> +		retval = -ENODEV;
>> +		goto error;
>> +	}
>>   
>>   	ucs->busy = 0;
>>   
> 
> Please note that I'm very close to getting cut off from the ISDN network, so
> the chances of being able to testi this on a live system are getting small.
> 

This bug can be invalid now. Do you agree?
There is an instruction to report invalid bug to syzbot [2].

> Thanks,
> 
> 
> Paul Bolle
> 


[1] 
https://github.com/google/kasan/commits/usb-fuzzer-usb-testing-2019.07.11
[2] 
https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot

Thanks,
Phong

^ permalink raw reply

* [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
From: Josh Hunt @ 2019-07-27  2:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, Josh Hunt

The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.

The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt <johunt@akamai.com>
---
 Documentation/networking/ip-sysctl.txt | 6 ++++++
 include/net/netns/ipv4.h               | 1 +
 net/ipv4/sysctl_net_ipv4.c             | 9 +++++++++
 net/ipv4/tcp_ipv4.c                    | 1 +
 net/ipv4/tcp_timer.c                   | 2 +-
 5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index df33674799b5..49e95f438ed7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
 	Path MTU discovery (MTU probing).  If MTU probing is enabled,
 	this is the initial MSS used by the connection.
 
+tcp_mtu_probe_floor - INTEGER
+	If MTU probing is enabled this caps the minimum MSS used for search_low
+	for the connection.
+
+	Default : 48
+
 tcp_min_snd_mss - INTEGER
 	TCP SYN and SYNACK messages usually advertise an ADVMSS option,
 	as described in RFC 1122 and RFC 6691.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc24a8ec1ce5..c0c0791b1912 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -116,6 +116,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_l3mdev_accept;
 #endif
 	int sysctl_tcp_mtu_probing;
+	int sysctl_tcp_mtu_probe_floor;
 	int sysctl_tcp_base_mss;
 	int sysctl_tcp_min_snd_mss;
 	int sysctl_tcp_probe_threshold;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0b980e841927..59ded25acd04 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra2		= &tcp_min_snd_mss_max,
 	},
 	{
+		.procname	= "tcp_mtu_probe_floor",
+		.data		= &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &tcp_min_snd_mss_min,
+		.extra2		= &tcp_min_snd_mss_max,
+	},
+	{
 		.procname	= "tcp_probe_threshold",
 		.data		= &init_net.ipv4.sysctl_tcp_probe_threshold,
 		.maxlen		= sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..e0a372676329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
 	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
+	net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
 
 	net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
 	net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c801cd37cc2a..dbd9d2d0ee63 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 	} else {
 		mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
 		mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
-		mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
+		mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
 		mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
 		icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
 	}
-- 
2.7.4


^ permalink raw reply related

* Re: next-20190723: bpf/seccomp - systemd/journald issue?
From: Alexei Starovoitov @ 2019-07-27  2:24 UTC (permalink / raw)
  To: sedat.dilek
  Cc: Yonghong Song, Alexei Starovoitov, Daniel Borkmann, Martin Lau,
	Song Liu, netdev@vger.kernel.org, bpf@vger.kernel.org,
	Clang-Built-Linux ML, Kees Cook, Nick Desaulniers,
	Nathan Chancellor
In-Reply-To: <CA+icZUUe0QE9QGMom1iQwuG8nM7Oi4Mq0GKqrLvebyxfUmj6RQ@mail.gmail.com>

On Fri, Jul 26, 2019 at 2:19 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> On Fri, Jul 26, 2019 at 11:10 PM Yonghong Song <yhs@fb.com> wrote:
> >
> >
> >
> > On 7/26/19 2:02 PM, Sedat Dilek wrote:
> > > On Fri, Jul 26, 2019 at 10:38 PM Sedat Dilek <sedat.dilek@gmail.com> wrote:
> > >>
> > >> Hi Yonghong Song,
> > >>
> > >> On Fri, Jul 26, 2019 at 5:45 PM Yonghong Song <yhs@fb.com> wrote:
> > >>>
> > >>>
> > >>>
> > >>> On 7/26/19 1:26 AM, Sedat Dilek wrote:
> > >>>> Hi,
> > >>>>
> > >>>> I have opened a new issue in the ClangBuiltLinux issue tracker.
> > >>>
> > >>> Glad to know clang 9 has asm goto support and now It can compile
> > >>> kernel again.
> > >>>
> > >>
> > >> Yupp.
> > >>
> > >>>>
> > >>>> I am seeing a problem in the area bpf/seccomp causing
> > >>>> systemd/journald/udevd services to fail.
> > >>>>
> > >>>> [Fri Jul 26 08:08:43 2019] systemd[453]: systemd-udevd.service: Failed
> > >>>> to connect stdout to the journal socket, ignoring: Connection refused
> > >>>>
> > >>>> This happens when I use the (LLVM) LLD ld.lld-9 linker but not with
> > >>>> BFD linker ld.bfd on Debian/buster AMD64.
> > >>>> In both cases I use clang-9 (prerelease).
> > >>>
> > >>> Looks like it is a lld bug.
> > >>>
> > >>> I see the stack trace has __bpf_prog_run32() which is used by
> > >>> kernel bpf interpreter. Could you try to enable bpf jit
> > >>>     sysctl net.core.bpf_jit_enable = 1
> > >>> If this passed, it will prove it is interpreter related.
> > >>>
> > >>
> > >> After...
> > >>
> > >> sysctl -w net.core.bpf_jit_enable=1
> > >>
> > >> I can start all failed systemd services.
> > >>
> > >> systemd-journald.service
> > >> systemd-udevd.service
> > >> haveged.service
> > >>
> > >> This is in maintenance mode.
> > >>
> > >> What is next: Do set a permanent sysctl setting for net.core.bpf_jit_enable?
> > >>
> > >
> > > This is what I did:
> >
> > I probably won't have cycles to debug this potential lld issue.
> > Maybe you already did, I suggest you put enough reproducible
> > details in the bug you filed against lld so they can take a look.
> >
>
> I understand and will put the journalctl-log into the CBL issue
> tracker and update informations.
>
> Thanks for your help understanding the BPF correlations.
>
> Is setting 'net.core.bpf_jit_enable = 2' helpful here?

jit_enable=1 is enough.
Or use CONFIG_BPF_JIT_ALWAYS_ON to workaround.

It sounds like clang miscompiles interpreter.
modprobe test_bpf
should be able to point out which part of interpreter is broken.

^ permalink raw reply

* Re: [PATCH bpf-next v5 0/6] xdp: Add devmap_hash map type
From: Alexei Starovoitov @ 2019-07-27  2:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, Alexei Starovoitov, Network Development,
	David Miller, Jesper Dangaard Brouer, Jakub Kicinski,
	Björn Töpel, Yonghong Song
In-Reply-To: <156415721066.13581.737309854787645225.stgit@alrua-x1>

On Fri, Jul 26, 2019 at 9:06 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> This series adds a new map type, devmap_hash, that works like the existing
> devmap type, but using a hash-based indexing scheme. This is useful for the use
> case where a devmap is indexed by ifindex (for instance for use with the routing
> table lookup helper). For this use case, the regular devmap needs to be sized
> after the maximum ifindex number, not the number of devices in it. A hash-based
> indexing scheme makes it possible to size the map after the number of devices it
> should contain instead.
>
> This was previously part of my patch series that also turned the regular
> bpf_redirect() helper into a map-based one; for this series I just pulled out
> the patches that introduced the new map type.
>
> Changelog:
>
> v5:
>
> - Dynamically set the number of hash buckets by rounding up max_entries to the
>   nearest power of two (mirroring the regular hashmap), as suggested by Jesper.

fyi I'm waiting for Jesper to review this new version.

^ permalink raw reply

* Re: [PATCH V2 net-next 07/11] net: hns3: adds debug messages to identify eth down cause
From: liuyonglong @ 2019-07-27  2:28 UTC (permalink / raw)
  To: Joe Perches, Saeed Mahameed, tanhuazhong@huawei.com,
	davem@davemloft.net
  Cc: lipeng321@huawei.com, yisen.zhuang@huawei.com,
	salil.mehta@huawei.com, linuxarm@huawei.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <05602c954c689ffcd796e9468c52bca6fa4efe3f.camel@perches.com>



On 2019/7/27 6:18, Joe Perches wrote:
> On Fri, 2019-07-26 at 22:00 +0000, Saeed Mahameed wrote:
>> On Fri, 2019-07-26 at 11:24 +0800, Huazhong Tan wrote:
>>> From: Yonglong Liu <liuyonglong@huawei.com>
>>>
>>> Some times just see the eth interface have been down/up via
>>> dmesg, but can not know why the eth down. So adds some debug
>>> messages to identify the cause for this.
> []
>>> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
>> []
>>> @@ -459,6 +459,10 @@ static int hns3_nic_net_open(struct net_device
>>> *netdev)
>>>  		h->ae_algo->ops->set_timer_task(priv->ae_handle, true);
>>>  
>>>  	hns3_config_xps(priv);
>>> +
>>> +	if (netif_msg_drv(h))
>>> +		netdev_info(netdev, "net open\n");
>>> +
>>
>> to make sure this is only intended for debug, and to avoid repetition.
>> #define hns3_dbg(__dev, format, args...)			\
>> ({								\
>> 	if (netif_msg_drv(h))					\
>> 		netdev_info(h->netdev, format, ##args);         \
>> })
> 
> 	netif_dbg(h, drv, h->netdev, "net open\n")
> 

Hi, Saeed && Joe:
For our cases, maybe netif_info() can be use for HNS3 drivers?
netif_dbg need to open dynamic debug options additional.


^ permalink raw reply

* Re: memory leak in kobject_set_name_vargs (2)
From: Linus Torvalds @ 2019-07-27  2:29 UTC (permalink / raw)
  To: syzbot
  Cc: Catalin Marinas, David Miller, Dmitry Vyukov, Herbert Xu, kuznet,
	Kalle Valo, Linux List Kernel Mailing, Linux-MM, luciano.coelho,
	Netdev, steffen.klassert, syzkaller-bugs, yoshfuji
In-Reply-To: <00000000000083ffc4058e9dddf0@google.com>

On Fri, Jul 26, 2019 at 4:26 PM syzbot
<syzbot+ad8ca40ecd77896d51e2@syzkaller.appspotmail.com> wrote:
>
> syzbot has bisected this bug to:
>
> commit 0e034f5c4bc408c943f9c4a06244415d75d7108c
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Wed May 18 18:51:25 2016 +0000
>
>      iwlwifi: fix mis-merge that breaks the driver

While this bisection looks more likely than the other syzbot entry
that bisected to a version change, I don't think it is correct eitger.

The bisection ended up doing a lot of "git bisect skip" because of the

    undefined reference to `nf_nat_icmp_reply_translation'

issue. Also, the memory leak doesn't seem to be entirely reliable:
when the bisect does 10 runs to verify that some test kernel is bad,
there are a couple of cases where only one or two of the ten run
failed.

Which makes me wonder if one or two of the "everything OK" runs were
actually buggy, but just happened to have all ten pass...

               Linus

^ permalink raw reply

* Re: memory leak in kobject_set_name_vargs (2)
From: Qian Cai @ 2019-07-27  2:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Catalin Marinas, David Miller, Dmitry Vyukov, Herbert Xu,
	kuznet, Kalle Valo, Linux List Kernel Mailing, Linux-MM,
	luciano.coelho, Netdev, steffen.klassert, syzkaller-bugs,
	yoshfuji, Wang Hai, Andy Shevchenko, David S. Miller
In-Reply-To: <CAHk-=why-PdP_HNbskRADMp1bnj+FwUDYpUZSYoNLNHMRPtoVA@mail.gmail.com>



> On Jul 26, 2019, at 10:29 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Fri, Jul 26, 2019 at 4:26 PM syzbot
> <syzbot+ad8ca40ecd77896d51e2@syzkaller.appspotmail.com> wrote:
>> 
>> syzbot has bisected this bug to:
>> 
>> commit 0e034f5c4bc408c943f9c4a06244415d75d7108c
>> Author: Linus Torvalds <torvalds@linux-foundation.org>
>> Date:   Wed May 18 18:51:25 2016 +0000
>> 
>>     iwlwifi: fix mis-merge that breaks the driver
> 
> While this bisection looks more likely than the other syzbot entry
> that bisected to a version change, I don't think it is correct eitger.
> 
> The bisection ended up doing a lot of "git bisect skip" because of the
> 
>    undefined reference to `nf_nat_icmp_reply_translation'
> 
> issue. Also, the memory leak doesn't seem to be entirely reliable:
> when the bisect does 10 runs to verify that some test kernel is bad,
> there are a couple of cases where only one or two of the ten run
> failed.
> 
> Which makes me wonder if one or two of the "everything OK" runs were
> actually buggy, but just happened to have all ten pass…

Real bisection should point to,

8ed633b9baf9e (“Revert "net-sysfs: Fix memory leak in netdev_register_kobject”")

I did encounter those memory leak and comes up with a similar fix in,

6b70fc94afd1 ("net-sysfs: Fix memory leak in netdev_register_kobject”)

but those error handling paths are tricky that seems nobody did much testing there, so it will
keep hitting other bugs in upper functions.

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Andrew Lunn @ 2019-07-27  3:02 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Horatiu Vultur, Nikolay Aleksandrov, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190726195010.7x75rr74v7ph3m6m@lx-anielsen.microsemi.net>

> As you properly guessed, this model is quite different from what we are used to.

Yes, it takes a while to get the idea that the hardware is just an
accelerator for what the Linux stack can already do. And if the switch
cannot do some feature, pass the frame to Linux so it can handle it.

You need to keep in mind that there could be other ports in the bridge
than switch ports, and those ports might be interested in the
multicast traffic. Hence the CPU needs to see the traffic. But IGMP
snooping can be used to optimise this. But you still need to be
careful, eg. IPv6 Neighbour discovery has often been broken on
mv88e6xxx because we have been too aggressive with filtering
multicast.

	Andrew

^ permalink raw reply

* Re: [PATCH V2 net-next 07/11] net: hns3: adds debug messages to identify eth down cause
From: Joe Perches @ 2019-07-27  3:14 UTC (permalink / raw)
  To: liuyonglong, Saeed Mahameed, tanhuazhong@huawei.com,
	davem@davemloft.net
  Cc: lipeng321@huawei.com, yisen.zhuang@huawei.com,
	salil.mehta@huawei.com, linuxarm@huawei.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <f517dc69-6356-98fe-fb7a-0427728814bb@huawei.com>

On Sat, 2019-07-27 at 10:28 +0800, liuyonglong wrote:
> On 2019/7/27 6:18, Joe Perches wrote:
> > On Fri, 2019-07-26 at 22:00 +0000, Saeed Mahameed wrote:
> > > On Fri, 2019-07-26 at 11:24 +0800, Huazhong Tan wrote:
> > > > From: Yonglong Liu <liuyonglong@huawei.com>
> > > > 
> > > > Some times just see the eth interface have been down/up via
> > > > dmesg, but can not know why the eth down. So adds some debug
> > > > messages to identify the cause for this.
> > []
> > > > diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
> > > []
> > > > @@ -459,6 +459,10 @@ static int hns3_nic_net_open(struct net_device
> > > > *netdev)
> > > >  		h->ae_algo->ops->set_timer_task(priv->ae_handle, true);
> > > >  
> > > >  	hns3_config_xps(priv);
> > > > +
> > > > +	if (netif_msg_drv(h))
> > > > +		netdev_info(netdev, "net open\n");
> > > > +
> > > 
> > > to make sure this is only intended for debug, and to avoid repetition.
> > > #define hns3_dbg(__dev, format, args...)			\
> > > ({								\
> > > 	if (netif_msg_drv(h))					\
> > > 		netdev_info(h->netdev, format, ##args);         \
> > > })
> > 
> > 	netif_dbg(h, drv, h->netdev, "net open\n")
> > 
> 
> Hi, Saeed && Joe:
> For our cases, maybe netif_info() can be use for HNS3 drivers?
> netif_dbg need to open dynamic debug options additional.

Your code, your choice.

I do think littering dmesg with "net open" style messages
and such may be unnecessary.  KERN_DEBUG seems a more
appropriate log level.



^ permalink raw reply

* Re: [PATCH bpf-next 01/10] libbpf: add .BTF.ext offset relocation section loading
From: Andrii Nakryiko @ 2019-07-27  5:11 UTC (permalink / raw)
  To: Song Liu
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Kernel Team
In-Reply-To: <B01B98E5-CDFB-4E3A-BD58-DBA3113C3C3F@fb.com>

On Wed, Jul 24, 2019 at 10:20 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 24, 2019, at 5:37 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Jul 24, 2019 at 5:00 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >>
> >>
> >>> On Jul 24, 2019, at 12:27 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> >>>
> >>> Add support for BPF CO-RE offset relocations. Add section/record
> >>> iteration macros for .BTF.ext. These macro are useful for iterating over
> >>> each .BTF.ext record, either for dumping out contents or later for BPF
> >>> CO-RE relocation handling.
> >>>
> >>> To enable other parts of libbpf to work with .BTF.ext contents, moved
> >>> a bunch of type definitions into libbpf_internal.h.
> >>>
> >>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> >>> ---
> >>> tools/lib/bpf/btf.c             | 64 +++++++++--------------
> >>> tools/lib/bpf/btf.h             |  4 ++
> >>> tools/lib/bpf/libbpf_internal.h | 91 +++++++++++++++++++++++++++++++++
> >>> 3 files changed, 118 insertions(+), 41 deletions(-)
> >>>
> >
> > [...]
> >
> >>> +
> >>> static int btf_ext_parse_hdr(__u8 *data, __u32 data_size)
> >>> {
> >>>      const struct btf_ext_header *hdr = (struct btf_ext_header *)data;
> >>> @@ -1004,6 +979,13 @@ struct btf_ext *btf_ext__new(__u8 *data, __u32 size)
> >>>      if (err)
> >>>              goto done;
> >>>
> >>> +     /* check if there is offset_reloc_off/offset_reloc_len fields */
> >>> +     if (btf_ext->hdr->hdr_len < sizeof(struct btf_ext_header))
> >>
> >> This check will break when we add more optional sections to btf_ext_header.
> >> Maybe use offsetof() instead?
> >
> > I didn't do it, because there are no fields after offset_reloc_len.
> > But now I though that maybe it would be ok to add zero-sized marker
> > field, kind of like marking off various versions of btf_ext header?
> >
> > Alternatively, I can add offsetofend() macro somewhere in libbpf_internal.h.
> >
> > Do you have any preference?
>
> We only need a stable number to compare against. offsetofend() works.
> Or we can simply have something like
>
>     if (btf_ext->hdr->hdr_len <= offsetof(struct btf_ext_header, offset_reloc_off))
>           goto done;
> or
>     if (btf_ext->hdr->hdr_len < offsetof(struct btf_ext_header, offset_reloc_len))
>           goto done;
>
> Does this make sense?

I think offsetofend() is the cleanest solution, I'll do just that.

>
> Thanks,
> Song

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox