* Re: [RFC v2 bpf-next 8/9] bpf: Provide helper to do lookups in kernel FIB table
From: David Miller @ 2018-05-02 17:00 UTC (permalink / raw)
To: dsahern
Cc: brouer, netdev, borkmann, ast, shm, roopa, toke, john.fastabend,
bernat
In-Reply-To: <4a92f04a-6865-ae70-1c54-e4b92d886491@gmail.com>
From: David Ahern <dsahern@gmail.com>
Date: Wed, 2 May 2018 09:37:21 -0600
> To share numbers from recent testing I did using Vincent's modules,
> lookup times in nsec (using local_clock) with MULTIPLE_TABLES config
> disabled for IPv4 and IPv6
>
> IPv4 IPv6-dst IPv6-fib6
> baseline 49 126 52
>
> I have other cases with combinations of configs and rules, but this
> shows the best possible case.
>
> IPv6 needs some more work to improve speeds with MULTIPLE_TABLES enabled
> (separate local and main tables unlike IPv4) and IPV6_SUBTREES enabled.
Yes, like for ipv4 sharing local and main tables will help a lot.
^ permalink raw reply
* Re: [PATCH bpf-next 0/3] bpf: cleanups on managing subprog information
From: Jiong Wang @ 2018-05-02 16:59 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: borkmann, ecree, netdev, oss-drivers
In-Reply-To: <20180501222257.cogbkcsncrmg54p5@ast-mbp>
On 01/05/2018 23:22, Alexei Starovoitov wrote:
...
> [ 27.784931] ? bpf_int_jit_compile+0x7ac/0xab0
> [ 27.785475] bpf_int_jit_compile+0x2b6/0xab0
> [ 27.786001] ? do_jit+0x6020/0x6020
> [ 27.786428] ? kasan_kmalloc+0xa0/0xd0
> [ 27.786885] bpf_check+0x2c05/0x4c40
> [ 27.787346] ? fixup_bpf_calls+0x1140/0x1140
> [ 27.787865] ? kasan_unpoison_shadow+0x30/0x40
> [ 27.788406] ? kasan_kmalloc+0xa0/0xd0
> [ 27.788865] ? memset+0x1f/0x40
> [ 27.789255] ? bpf_obj_name_cpy+0x2d/0x200
> [ 27.789750] bpf_prog_load+0xb07/0xeb0
>
> simply running test_verifier with JIT and kasan on.
Ah, sorry, I should add "sysctl net/core/bpf_jit_enable=1" to my test
script, error reproduced.
convert_ctx_accesses and fixup_bpf_calls might insert ebpf insns that
prog->len would change.
The new fake "exit" subprog whose .start offset is prog->len should be
updated as well.
The "for" condition in adjust_subprog_starts:
for (i = 0; i < env->subprog_cnt; i++) {
need to be changed into:
for (i = 0; i <= env->subprog_cnt; i++) {
Will respin the patch set.
Thanks.
Regards,
Jiong
^ permalink raw reply
* Re: [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context
From: Song Liu @ 2018-05-02 16:48 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Networking, Kernel Team, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20180502092109.GI12180@hirez.programming.kicks-ass.net>
> On May 2, 2018, at 2:21 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, May 01, 2018 at 05:02:19PM -0700, Song Liu wrote:
>> @@ -267,17 +285,27 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>> {
>> int i;
>> struct vm_area_struct *vma;
>> + bool in_nmi_ctx = in_nmi();
>> + bool irq_work_busy = false;
>> + struct stack_map_irq_work *work;
>> +
>> + if (in_nmi_ctx) {
>> + work = this_cpu_ptr(&irq_work);
>> + if (work->sem)
>> + /* cannot queue more up_read, fallback */
>> + irq_work_busy = true;
>> + }
>>
>> /*
>> + * We cannot do up_read() in nmi context. To do build_id lookup
>> + * in nmi context, we need to run up_read() in irq_work. We use
>> + * a percpu variable to do the irq_work. If the irq_work is
>> + * already used by another lookup, we fall back to report ips.
>> *
>> * Same fallback is used for kernel stack (!user) on a stackmap
>> * with build_id.
>> */
>> + if (!user || !current || !current->mm || irq_work_busy ||
>> down_read_trylock(¤t->mm->mmap_sem) == 0) {
>> /* cannot access current->mm, fall back to ips */
>> for (i = 0; i < trace_nr; i++) {
>> @@ -299,7 +327,13 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>> - vma->vm_start;
>> id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
>> }
>> +
>> + if (!in_nmi_ctx)
>> + up_read(¤t->mm->mmap_sem);
>> + else {
>> + work->sem = ¤t->mm->mmap_sem;
>> + irq_work_queue(&work->work);
>> + }
>> }
>
> This is disguisting.. :-)
>
> It's broken though, I've bet you've never actually ran this with lockdep
> enabled for example.
I am not following here. I just run the new selftest with CONFIG_LOCKDEP on,
and got no warning for this.
> Also, you set work->sem before you do trylock, if the trylock fails you
> return early and keep work->sem set, which will thereafter always result
> in irq_work_busy.
work->sem was set after down_read_trylock(). I guess you misread the patch?
Thanks,
Song
^ permalink raw reply
* Re: [PATCH net-next 1/4] ipv6: Calculate hash thresholds for IPv6 nexthops
From: Eric Dumazet @ 2018-05-02 16:43 UTC (permalink / raw)
To: David Ahern, Ido Schimmel, netdev
Cc: davem, roopa, nikolay, pch, jkbs, yoshfuji, mlxsw
In-Reply-To: <5550c628-5014-427b-60c9-71cf80462723@gmail.com>
On 01/09/2018 07:43 PM, David Ahern wrote:
> On 1/9/18 7:40 AM, Ido Schimmel wrote:
>> Before we convert IPv6 to use hash-threshold instead of modulo-N, we
>> first need each nexthop to store its region boundary in the hash
>> function's output space.
>>
>> The boundary is calculated by dividing the output space equally between
>> the different active nexthops. That is, nexthops that are not dead or
>> linkdown.
>>
>> The boundaries are rebalanced whenever a nexthop is added or removed to
>> a multipath route and whenever a nexthop becomes active or inactive.
>>
>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>> ---
>> include/net/ip6_fib.h | 1 +
>> include/net/ip6_route.h | 7 ++++
>> net/ipv6/ip6_fib.c | 8 ++---
>> net/ipv6/route.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 106 insertions(+), 6 deletions(-)
>>
>
> LGTM.
> Acked-by: David Ahern <dsahern@gmail.com>
>
For some reason I have a divide by zero error booting my hosts with latest net tree.
What guarantee do we have that total is not zero when rt6_upper_bound_set() is called ?
[ 8.498639] divide error: 0000 [#1] SMP PTI
[ 8.503178] gsmi: Log Shutdown Reason 0x03
[ 8.507270] Modules linked in: bnx2x mdio
[ 8.511276] CPU: 17 PID: 116 Comm: kworker/17:0 Not tainted 4.17.0-smp-DEV #110
[ 8.518571] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 2.40.0 06/22/2016
[ 8.525526] Workqueue: ipv6_addrconf addrconf_dad_work
[ 8.530662] RIP: 0010:rt6_multipath_rebalance.part.82+0x1cb/0x1f0
[ 8.536752] RSP: 0018:ffffba72867cbbf8 EFLAGS: 00010246
[ 8.541966] RAX: 0000000000000000 RBX: 0000000000000025 RCX: ffff9d555ab73180
[ 8.549090] RDX: 0000000000000000 RSI: ffff9d4d5a34b1c0 RDI: 0000000000000000
[ 8.556212] RBP: ffffba72867cbc00 R08: 0000000000000000 R09: 0000000000000000
[ 8.563336] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d5559f95680
[ 8.570457] R13: ffff9d4d5a34b1c0 R14: ffff9d555ab73180 R15: 0000000000000000
[ 8.577579] FS: 0000000000000000(0000) GS:ffff9d4d5fc40000(0000) knlGS:0000000000000000
[ 8.585654] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8.591391] CR2: 00007fffe47ff000 CR3: 0000000c39c0a001 CR4: 00000000000606e0
[ 8.598515] Call Trace:
[ 8.600961] ? rt6_multipath_rebalance+0x21/0x30
[ 8.605579] fib6_add+0x75f/0xf70
[ 8.608899] ? __wake_up+0x13/0x20
[ 8.612303] ? netlink_broadcast_filtered+0x14c/0x3c0
[ 8.617355] __ip6_ins_rt+0x4c/0x70
[ 8.620847] ip6_ins_rt+0x6e/0xa0
[ 8.624157] __ipv6_ifa_notify+0x226/0x2e0
[ 8.628249] ipv6_ifa_notify+0x2a/0x40
[ 8.631999] addrconf_dad_completed+0x59/0x360
[ 8.636438] addrconf_dad_work+0x11c/0x400
[ 8.640536] ? addrconf_dad_work+0x11c/0x400
[ 8.644810] process_one_work+0x184/0x370
[ 8.648820] ? process_one_work+0x184/0x370
[ 8.652996] worker_thread+0x35/0x3a0
[ 8.656654] kthread+0x121/0x140
[ 8.659887] ? process_one_work+0x370/0x370
[ 8.664073] ? kthread_create_worker_on_cpu+0x70/0x70
[ 8.669118] ret_from_fork+0x35/0x40
[ 8.672693] Code: c3 8b b9 38 01 00 00 eb aa 48 63 81 38 01 00 00 89 fa 41 89 f8 c1 ea 1f 01 fa d1 fa 48 63 d2 49 89 c2 48 c1 e0 1f 48 01 d0 31 d2 <49> f7 f0 83 e8 01 48 39 ce 89 81 b4 00 00 00 0f 85 e2 fe ff ff
[ 8.691533] RIP: rt6_multipath_rebalance.part.82+0x1cb/0x1f0 RSP: ffffba72867cbbf8
[ 8.699135] ---[ end trace 9ae26819121cdc3a ]---
[ 8.703760] Kernel panic - not syncing: Fatal exception in interrupt
[ 8.710169] Kernel Offset: 0x3d200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 8.721256] gsmi: Log Shutdown Reason 0x02
[ 8.725357] Rebooting in 10 seconds..
^ permalink raw reply
* Re: [PATCH net-next v9 2/4] net: Introduce generic failover module
From: Jiri Pirko @ 2018-05-02 16:15 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh, aaron.f.brown
In-Reply-To: <20180428090601.GL5632@nanopsycho.orion>
Sat, Apr 28, 2018 at 11:06:01AM CEST, jiri@resnulli.us wrote:
>Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudrala@intel.com wrote:
[...]
>>+
>>+ err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
>>+ failover_dev);
>>+ if (err) {
>>+ netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
>>+ err);
>>+ goto err_handler_register;
>>+ }
>>+
>>+ err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>
>Please use netdev_master_upper_dev_link().
Don't forget to fillup struct netdev_lag_upper_info - NETDEV_LAG_TX_TYPE_ACTIVEBACKUP
Also, please call netdev_lower_state_changed() when the active slave
device changes from primary->backup of backup->primary and whenever link
state of a slave changes
[...]
^ permalink raw reply
* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-05-02 16:05 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: alexander.h.duyck, virtio-dev, mst, kubakici, netdev,
virtualization, loseweigh, aaron.f.brown, davem
In-Reply-To: <c94ce6a9-6f36-675c-cf5b-414454fc2244@intel.com>
Wed, May 02, 2018 at 05:34:44PM CEST, sridhar.samudrala@intel.com wrote:
>On 5/2/2018 12:50 AM, Jiri Pirko wrote:
>> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>> > > > > Now I try to change mac of the failover master:
>> > > > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> > > > > RTNETLINK answers: Operation not supported
>> > > > >
>> > > > > That I did expect to work. I would expect this would change the mac of
>> > > > > the master and both standby and primary slaves.
>> > > > If a VF is untrusted, a VM will not able to change its MAC and moreover
>> > > Note that at this point, I have no VF. So I'm not sure why you mention
>> > > that.
>> > >
>> > >
>> > > > in this mode we are assuming that the hypervisor has assigned the MAC and
>> > > > guest is not expected to change the MAC.
>> > > Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>> > > mac and all works fine. How is this different? Change mac on "failover
>> > > instance" should work and should propagate the mac down to its slaves.
>> > >
>> > >
>> > > > For the initial implementation, i would propose not allowing the guest to
>> > > > change the MAC of failover or standby dev.
>> > > I see no reason for such restriction.
>> > >
>> > It is true that a VM user can change mac address of a normal virtio-net interface,
>> > however when it is in STANDBY mode i think we should not allow this change specifically
>> > because we are creating a failover instance based on a MAC that is assigned by the
>> > hypervisor.
>> >
>> > Moreover, in a cloud environment i would think that PF/hypervisor assigns a MAC to
>> > the VF and it cannot be changed by the guest.
>> So that is easy. You allow the change of the mac and in the "failover"
>> mac change implementation you propagate the change down to slaves. If
>> one slave does not support the change, you bail out. And since VF does
>> not allow it as you say, once it will be enslaved, the mac change could
>> not be done. Seems like a correct behavior to me and is in-sync with how
>> bond/team behaves.
>
>If we allow the mac to be changed when the VF is not yet plugged in
>and the guest changes the mac, then VF cannot join the failover when
>it is hot plugged with the original mac assigned by the hypervisor.
>Specifically there could be timing window during the live migration where
>VF would be unplugged at the source and if we allow the guest to change the
>failover mac, then VF would not be able to register with the failover after
>the VM is migrated to the destination.
Okay. So as suggested by mst, just forbid the mac changes all together.
>
>>
>> > So for the initial implementation, do you see any issues with having this restriction
>> > in STANDBY mode.
>> >
>> >
>
^ permalink raw reply
* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-05-02 16:04 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
jasowang, loseweigh, aaron.f.brown
In-Reply-To: <20180502184326-mutt-send-email-mst@kernel.org>
Wed, May 02, 2018 at 05:47:27PM CEST, mst@redhat.com wrote:
>On Wed, May 02, 2018 at 09:50:21AM +0200, Jiri Pirko wrote:
>> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>> >>
>> >> > > Now I try to change mac of the failover master:
>> >> > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> >> > > RTNETLINK answers: Operation not supported
>> >> > >
>> >> > > That I did expect to work. I would expect this would change the mac of
>> >> > > the master and both standby and primary slaves.
>> >> > If a VF is untrusted, a VM will not able to change its MAC and moreover
>> >> Note that at this point, I have no VF. So I'm not sure why you mention
>> >> that.
>> >>
>> >>
>> >> > in this mode we are assuming that the hypervisor has assigned the MAC and
>> >> > guest is not expected to change the MAC.
>> >> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>> >> mac and all works fine. How is this different? Change mac on "failover
>> >> instance" should work and should propagate the mac down to its slaves.
>> >>
>> >>
>> >> > For the initial implementation, i would propose not allowing the guest to
>> >> > change the MAC of failover or standby dev.
>> >> I see no reason for such restriction.
>> >>
>> >
>> >It is true that a VM user can change mac address of a normal virtio-net interface,
>> >however when it is in STANDBY mode i think we should not allow this change specifically
>> >because we are creating a failover instance based on a MAC that is assigned by the
>> >hypervisor.
>> >
>> >Moreover, in a cloud environment i would think that PF/hypervisor assigns a MAC to
>> >the VF and it cannot be changed by the guest.
>>
>> So that is easy. You allow the change of the mac and in the "failover"
>> mac change implementation you propagate the change down to slaves. If
>> one slave does not support the change, you bail out. And since VF does
>
>I wish people would say primary/standby and not "VF" :)
Sure, sorry.
>
>> not allow it as you say, once it will be enslaved, the mac change could
>> not be done. Seems like a correct behavior to me
>
>
>what if primary does not allow mac changes and is attached after
>mac is changed on standy?
Mac should be changed on failover. In that case, the primary would have
a different mac and therefore it won't get enslaved.
>
>
>> and is in-sync with how
>> bond/team behaves.
>
>I think in the end virtio can just block MAC changes for simplicity.
>
>I personally don't see softmac as a must have feature in v1,
>we can add it later.
Okay.
>
>What's the situation with init scripts and whether it's
>possible to make them work well would be a better question.
>
>>
>> >
>> >So for the initial implementation, do you see any issues with having this restriction
>> >in STANDBY mode.
>> >
>> >
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf: relax constraints on formatting for eBPF helper documentation
From: Daniel Borkmann @ 2018-05-02 15:48 UTC (permalink / raw)
To: Quentin Monnet, ast; +Cc: dsahern, yhs, ecree, netdev, oss-drivers
In-Reply-To: <20180502132024.14550-1-quentin.monnet@netronome.com>
On 05/02/2018 03:20 PM, Quentin Monnet wrote:
> The Python script used to parse and extract eBPF helpers documentation
> from include/uapi/linux/bpf.h expects a very specific formatting for the
> descriptions (single dot represents a space, '>' stands for a tab):
>
> /*
> ...
> *.int bpf_helper(list of arguments)
> *.> Description
> *.> > Start of description
> *.> > Another line of description
> *.> > And yet another line of description
> *.> Return
> *.> > 0 on success, or a negative error in case of failure
> ...
> */
>
> This is too strict, and painful for developers who wants to add
> documentation for new helpers. Worse, it is extremely difficult to check
> that the formatting is correct during reviews. Change the format
> expected by the script and make it more flexible. The script now works
> whether or not the initial space (right after the star) is present, and
> accepts both tabs and white spaces (or a combination of both) for
> indenting description sections and contents.
>
> Concretely, something like the following would now be supported:
>
> /*
> ...
> *int bpf_helper(list of arguments)
> *......Description
> *.> > Start of description...
> *> > Another line of description
> *..............And yet another line of description
> *> Return
> *.> ........0 on success, or a negative error in case of failure
> ...
> */
>
> While at it, remove unnecessary carets from each regex used with match()
> in the script. They are redundant, as match() tries to match from the
> beginning of the string by default.
>
> v2: Remove unnecessary caret when a regex is used with match().
>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Applied to bpf-next, thanks Quentin!
^ permalink raw reply
* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2018-05-02 15:47 UTC (permalink / raw)
To: Jiri Pirko
Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
jasowang, loseweigh, aaron.f.brown
In-Reply-To: <20180502075021.GC19250@nanopsycho>
On Wed, May 02, 2018 at 09:50:21AM +0200, Jiri Pirko wrote:
> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/30/2018 12:20 AM, Jiri Pirko wrote:
> >>
> >> > > Now I try to change mac of the failover master:
> >> > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
> >> > > RTNETLINK answers: Operation not supported
> >> > >
> >> > > That I did expect to work. I would expect this would change the mac of
> >> > > the master and both standby and primary slaves.
> >> > If a VF is untrusted, a VM will not able to change its MAC and moreover
> >> Note that at this point, I have no VF. So I'm not sure why you mention
> >> that.
> >>
> >>
> >> > in this mode we are assuming that the hypervisor has assigned the MAC and
> >> > guest is not expected to change the MAC.
> >> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
> >> mac and all works fine. How is this different? Change mac on "failover
> >> instance" should work and should propagate the mac down to its slaves.
> >>
> >>
> >> > For the initial implementation, i would propose not allowing the guest to
> >> > change the MAC of failover or standby dev.
> >> I see no reason for such restriction.
> >>
> >
> >It is true that a VM user can change mac address of a normal virtio-net interface,
> >however when it is in STANDBY mode i think we should not allow this change specifically
> >because we are creating a failover instance based on a MAC that is assigned by the
> >hypervisor.
> >
> >Moreover, in a cloud environment i would think that PF/hypervisor assigns a MAC to
> >the VF and it cannot be changed by the guest.
>
> So that is easy. You allow the change of the mac and in the "failover"
> mac change implementation you propagate the change down to slaves. If
> one slave does not support the change, you bail out. And since VF does
I wish people would say primary/standby and not "VF" :)
> not allow it as you say, once it will be enslaved, the mac change could
> not be done. Seems like a correct behavior to me
what if primary does not allow mac changes and is attached after
mac is changed on standy?
> and is in-sync with how
> bond/team behaves.
I think in the end virtio can just block MAC changes for simplicity.
I personally don't see softmac as a must have feature in v1,
we can add it later.
What's the situation with init scripts and whether it's
possible to make them work well would be a better question.
>
> >
> >So for the initial implementation, do you see any issues with having this restriction
> >in STANDBY mode.
> >
> >
^ permalink raw reply
* Re: [PATCH v3 2/3] selftests/bpf: Makefile: add include path
From: Daniel Borkmann @ 2018-05-02 15:43 UTC (permalink / raw)
To: Sirio Balmelli; +Cc: netdev
In-Reply-To: <20180502110505.7n567iqy6duuzjbs@vm4>
On 05/02/2018 01:05 PM, Sirio Balmelli wrote:
> On some systems selftests fail to build, missing the following headers:
>
> asm/byteorder.h
> asm/socket.h
> asm/swab.h
>
> In the specific case of Ubuntu, this is because the files are in
> '/usr/include/x86_64-linux-gnu' (see <https://wiki.ubuntu.com/MultiarchSpec>)
> which is both architecture- and distro-specific.
>
> The solution is to add $(KERNEL)/usr/include to the Makefile,
> so the build references these from the current kernel build
> and not the running system.
>
> Signed-off-by: Sirio Balmelli <sirio@b-ad.ch>
> ---
> tools/testing/selftests/bpf/Makefile | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 9d76218..1ec09c6 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -84,7 +84,10 @@ else
> CPU ?= generic
> endif
>
> -CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
> +# we are in 'tools/testing/selftests/bpf'
> +KERNEL=../../../..
> +TOOLS=../../..
> +CLANG_FLAGS = -I. -I./include/uapi -I$(TOOLS)/include/uapi -I$(KERNEL)/usr/include \
> -Wno-compare-distinct-pointer-types
First patch in the series looks good to me, thanks Sirio! Problem with this
one is still as described earlier in: http://patchwork.ozlabs.org/patch/906505/
This will break people's setup and build bots since they expect headers to be
available from tools infra and not -I<kernel-root>/usr/include/ via headers_install,
so adding the latter is not possible (unless Arnaldo agrees to rework the whole tools
include infra in a better way). This means the only other option we have is to pull
them into tools/ instead, even if this results in a lot of duplication, but whether
we like it or not that is the way the tools/include stuff was designed along with
perf, meaning unless there's a fundamental rework, we will have to stick to it.
Thanks,
Daniel
> $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
>
^ permalink raw reply
* Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring
From: Michael S. Tsirkin @ 2018-05-02 15:42 UTC (permalink / raw)
To: Tiwei Bie
Cc: Jason Wang, virtualization, linux-kernel, netdev, wexu, jfreimann
In-Reply-To: <20180502151255.h3x6rhszxa3euinl@debian>
On Wed, May 02, 2018 at 11:12:55PM +0800, Tiwei Bie wrote:
> On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> > On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > > This commit introduces the event idx support in packed
> > > > > ring. This feature is temporarily disabled, because the
> > > > > implementation in this patch may not work as expected,
> > > > > and some further discussions on the implementation are
> > > > > needed, e.g. do we have to check the wrap counter when
> > > > > checking whether a kick is needed?
> > > > >
> > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > ---
> > > > > drivers/virtio/virtio_ring.c | 53 ++++++++++++++++++++++++++++++++++++++++----
> > > > > 1 file changed, 49 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 0181e93897be..b1039c2985b9 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> > > > > static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > {
> > > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > - u16 flags;
> > > > > + u16 new, old, off_wrap, flags;
> > > > > bool needs_kick;
> > > > > u32 snapshot;
> > > > > @@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > * suppressions. */
> > > > > virtio_mb(vq->weak_barriers);
> > > > > + old = vq->next_avail_idx - vq->num_added;
> > > > > + new = vq->next_avail_idx;
> > > > > + vq->num_added = 0;
> > > > > +
> > > > > snapshot = *(u32 *)vq->vring_packed.device;
> > > > > + off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0xffff);
> > > > > flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
> > > > > #ifdef DEBUG
> > > > > @@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > > vq->last_add_time_valid = false;
> > > > > #endif
> > > > > - needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > > + if (flags == VRING_EVENT_F_DESC)
> > > > > + needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
> > > >
> > > > I wonder whether or not the math is correct. Both new and event are in the
> > > > unit of descriptor ring size, but old looks not.
> > >
> > > What vring_need_event() cares is the distance between
> > > `new` and `old`, i.e. vq->num_added. So I think there
> > > is nothing wrong with `old`. But the calculation of the
> > > distance between `new` and `event_idx` isn't right when
> > > `new` wraps. How do you think about the below code:
> > >
> > > wrap_counter = off_wrap >> 15;
> > > event_idx = off_wrap & ~(1<<15);
> > > if (wrap_counter != vq->wrap_counter)
> > > event_idx -= vq->vring_packed.num;
> > >
> > > needs_kick = vring_need_event(event_idx, new, old);
> >
> > I suspect this hack won't work for non power of 2 ring.
>
> Above code doesn't require the ring size to be a power of 2.
>
> For (__u16)(new_idx - old), what we want to get is vq->num_added.
>
> old = vq->next_avail_idx - vq->num_added;
> new = vq->next_avail_idx;
>
> When vq->next_avail_idx >= vq->num_added, it's obvious that,
> (__u16)(new_idx - old) is vq->num_added.
>
> And when vq->next_avail_idx < vq->num_added, new will be smaller
> than old (old will be a big unsigned number), but (__u16)(new_idx
> - old) is still vq->num_added.
>
> For (__u16)(new_idx - event_idx - 1), when new wraps and event_idx
> doesn't wrap, the most straightforward way to calculate it is:
> (new + vq->vring_packed.num) - event_idx - 1.
So how about we use the straightforward way then?
> But we can also calculate it in this way:
>
> event_idx -= vq->vring_packed.num;
> (event_idx will be a big unsigned number)
>
> Then (__u16)(new_idx - event_idx - 1) will be the value we want.
>
> Best regards,
> Tiwei Bie
> >
> >
> > > Best regards,
> > > Tiwei Bie
> > >
> > >
> > > >
> > > > Thanks
> > > >
> > > > > + else
> > > > > + needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > > END_USE(vq);
> > > > > return needs_kick;
> > > > > }
> > > > > @@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> > > > > if (vq->last_used_idx >= vq->vring_packed.num)
> > > > > vq->last_used_idx -= vq->vring_packed.num;
> > > > > + /* If we expect an interrupt for the next entry, tell host
> > > > > + * by writing event index and flush out the write before
> > > > > + * the read in the next get_buf call. */
> > > > > + if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
> > > > > + virtio_store_mb(vq->weak_barriers,
> > > > > + &vq->vring_packed.driver->off_wrap,
> > > > > + cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
> > > > > + (vq->wrap_counter << 15)));
> > > > > +
> > > > > #ifdef DEBUG
> > > > > vq->last_add_time_valid = false;
> > > > > #endif
> > > > > @@ -1143,10 +1160,17 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> > > > > /* We optimistically turn back on interrupts, then check if there was
> > > > > * more to do. */
> > > > > + /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > > > > + * either clear the flags bit or point the event index at the next
> > > > > + * entry. Always update the event index to keep code simple. */
> > > > > +
> > > > > + vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > > > > + vq->last_used_idx | (vq->wrap_counter << 15));
> > > > > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > > > > virtio_wmb(vq->weak_barriers);
> > > > > - vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > > > > + vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > > > > + VRING_EVENT_F_ENABLE;
> > > > > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > > > > vq->event_flags_shadow);
> > > > > }
> > > > > @@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> > > > > static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> > > > > {
> > > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > > + u16 bufs, used_idx, wrap_counter;
> > > > > START_USE(vq);
> > > > > /* We optimistically turn back on interrupts, then check if there was
> > > > > * more to do. */
> > > > > + /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > > > > + * either clear the flags bit or point the event index at the next
> > > > > + * entry. Always update the event index to keep code simple. */
> > > > > +
> > > > > + /* TODO: tune this threshold */
> > > > > + bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> > > > > +
> > > > > + used_idx = vq->last_used_idx + bufs;
> > > > > + wrap_counter = vq->wrap_counter;
> > > > > +
> > > > > + if (used_idx >= vq->vring_packed.num) {
> > > > > + used_idx -= vq->vring_packed.num;
> > > > > + wrap_counter ^= 1;
> > > > > + }
> > > > > +
> > > > > + vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > > > > + used_idx | (wrap_counter << 15));
> > > > > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > > > > virtio_wmb(vq->weak_barriers);
> > > > > - vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > > > > + vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > > > > + VRING_EVENT_F_ENABLE;
> > > > > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > > > > vq->event_flags_shadow);
> > > > > }
> > > > > @@ -1822,8 +1865,10 @@ void vring_transport_features(struct virtio_device *vdev)
> > > > > switch (i) {
> > > > > case VIRTIO_RING_F_INDIRECT_DESC:
> > > > > break;
> > > > > +#if 0
> > > > > case VIRTIO_RING_F_EVENT_IDX:
> > > > > break;
> > > > > +#endif
> > > > > case VIRTIO_F_VERSION_1:
> > > > > break;
> > > > > case VIRTIO_F_IOMMU_PLATFORM:
> > > >
^ permalink raw reply
* Re: [RFC v2 bpf-next 9/9] samples/bpf: Add examples of ipv4 and ipv6 forwarding in XDP
From: David Ahern @ 2018-05-02 15:40 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: netdev, borkmann, ast, davem, shm, roopa, toke, john.fastabend,
Tariq Toukan
In-Reply-To: <20180502131306.07c3acee@redhat.com>
On 5/2/18 5:13 AM, Jesper Dangaard Brouer wrote:
>
> On Sun, 29 Apr 2018 11:07:52 -0700 David Ahern <dsahern@gmail.com> wrote:
>
>> + /* verify egress index has xdp support */
>> + // TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
>> + // cannot pass map_type 14 into func bpf_map_lookup_elem#1:
>
> I just want to point out that I/we are aware of this "usability"
> problem with the sample program, but I don't want to block the FIB
> helper going upstream, we can fix this problem later.
>
> The problem is that if you load this bpf/xdp prog, then all incoming
> traffic (on that interface), will be forward using XDP, regardless
> whether the egress ifindex support XDP or not. And if not supported,
> then packets are dropped hard (only detectable via tracepoints).
>
> If the bpf prog could tell/know that the egress ifindex doesn't
> support XDP xmit, then it could simply return XDP_PASS to fallback to
> the normal network stack.
>
That's what the above lookup is intending but DEVMAP type does not
handle lookups from xdp programs. Any chance of fixing that?
^ permalink raw reply
* Re: [RFC v2 bpf-next 8/9] bpf: Provide helper to do lookups in kernel FIB table
From: David Ahern @ 2018-05-02 15:37 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: netdev, borkmann, ast, davem, shm, roopa, toke, john.fastabend,
Vincent Bernat
In-Reply-To: <20180502132736.3560fcac@redhat.com>
On 5/2/18 5:27 AM, Jesper Dangaard Brouer wrote:
> On Sun, 29 Apr 2018 11:07:51 -0700
> David Ahern <dsahern@gmail.com> wrote:
>
>> Initial performance numbers collected by Jesper, forwarded packets/sec:
>>
>> Full stack XDP FIB lookup XDP Direct lookup
>> IPv4 1,947,969 7,074,156 7,415,333
>> IPv6 1,728,000 6,165,504 7,262,720
>
> Do notice these number is single CPU core forwarding performance!
> On a Broadwell E5-1650 v4 @ 3.60GHz.
I'll add that context to the commit message. Thanks,
>
> Another interesting data point is that xdp_redirect_map performance is
> 13,365,161 pps, which allow us to calculate/isolate the overhead/cost
> of the FIB lookup.
>
> (1/13365161-1/7074156)*10^9 = -66.5 ns
> (1/13365161-1/7415333)*10^9 = -60.0 ns
>
> Which is very close to the measured 50 ns cost of the FIB lookup, done
> by Vincent Bernat.
> See: https://vincent.bernat.im/en/blog/2017-ipv4-route-lookup-linux
>
>
>
> Another way I calculate this is by (ran a new benchmark):
>
> Performance: 7641593 (7,641,593) <= tx_unicast /sec
> * Packet-gap: (1/7641593*10^9) = 130.86 ns
>
> Find all FIB related lookup functions in perf-report::
>
> Samples: 93K of event 'cycles:ppp', Event count (approx.): 88553104553
> Overhead Cost CPU Command Symbol
> 20.63 % 26.99 ns 002 ksoftirqd/2 [k] fib_table_lookup
> 12.92 % 16.90 ns 002 ksoftirqd/2 [k] bpf_fib_lookup
> 2.40 % 3.14 ns 002 ksoftirqd/2 [k] fib_select_path
> 0.83 % 1.09 ns 002 ksoftirqd/2 [k] fib_get_table
> 0.40 % 0.52 ns 002 ksoftirqd/2 [k] l3mdev_fib_table_rcu
> -----------
> Tot:37.18 % (20.63+12.92+2.40+0.83+0.40)
> ===========
>
> Cost of FIB lookup:
> - 130.86/100*37.18 = 48.65 ns overhead by FIB lookup.
>
> Again very close to Vincent's IPv4 measurements of ~50 ns.
>
>
>
> Notice that the IPv6 measurements does not match up:
> https://vincent.bernat.im/en/blog/2017-ipv6-route-lookup-linux
> This is because, we/I'm just testing the IPv6 route cache here...
>
Vincent's blog is before recent changes -- 4.15 getting the rcu locking,
net-next getting separate fib entries and now this set adding a FIB
lookup without the dst.
To share numbers from recent testing I did using Vincent's modules,
lookup times in nsec (using local_clock) with MULTIPLE_TABLES config
disabled for IPv4 and IPv6
IPv4 IPv6-dst IPv6-fib6
baseline 49 126 52
I have other cases with combinations of configs and rules, but this
shows the best possible case.
IPv6 needs some more work to improve speeds with MULTIPLE_TABLES enabled
(separate local and main tables unlike IPv4) and IPV6_SUBTREES enabled.
^ permalink raw reply
* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-05-02 15:34 UTC (permalink / raw)
To: Jiri Pirko
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh, aaron.f.brown
In-Reply-To: <20180502075021.GC19250@nanopsycho>
On 5/2/2018 12:50 AM, Jiri Pirko wrote:
> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>>>>> Now I try to change mac of the failover master:
>>>>> [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>>>>> RTNETLINK answers: Operation not supported
>>>>>
>>>>> That I did expect to work. I would expect this would change the mac of
>>>>> the master and both standby and primary slaves.
>>>> If a VF is untrusted, a VM will not able to change its MAC and moreover
>>> Note that at this point, I have no VF. So I'm not sure why you mention
>>> that.
>>>
>>>
>>>> in this mode we are assuming that the hypervisor has assigned the MAC and
>>>> guest is not expected to change the MAC.
>>> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>>> mac and all works fine. How is this different? Change mac on "failover
>>> instance" should work and should propagate the mac down to its slaves.
>>>
>>>
>>>> For the initial implementation, i would propose not allowing the guest to
>>>> change the MAC of failover or standby dev.
>>> I see no reason for such restriction.
>>>
>> It is true that a VM user can change mac address of a normal virtio-net interface,
>> however when it is in STANDBY mode i think we should not allow this change specifically
>> because we are creating a failover instance based on a MAC that is assigned by the
>> hypervisor.
>>
>> Moreover, in a cloud environment i would think that PF/hypervisor assigns a MAC to
>> the VF and it cannot be changed by the guest.
> So that is easy. You allow the change of the mac and in the "failover"
> mac change implementation you propagate the change down to slaves. If
> one slave does not support the change, you bail out. And since VF does
> not allow it as you say, once it will be enslaved, the mac change could
> not be done. Seems like a correct behavior to me and is in-sync with how
> bond/team behaves.
If we allow the mac to be changed when the VF is not yet plugged in
and the guest changes the mac, then VF cannot join the failover when
it is hot plugged with the original mac assigned by the hypervisor.
Specifically there could be timing window during the live migration where
VF would be unplugged at the source and if we allow the guest to change the
failover mac, then VF would not be able to register with the failover after
the VM is migrated to the destination.
>
>> So for the initial implementation, do you see any issues with having this restriction
>> in STANDBY mode.
>>
>>
^ permalink raw reply
* Re: [PATCH net-next v2 2/2] mlxsw: spectrum_router: Return an error for routes added after abort
From: David Ahern @ 2018-05-02 15:28 UTC (permalink / raw)
To: Ido Schimmel, netdev; +Cc: davem, jiri, mlxsw
In-Reply-To: <20180502071735.32352-3-idosch@mellanox.com>
On 5/2/18 1:17 AM, Ido Schimmel wrote:
> We currently do not perform accounting in the driver and thus can't
> reject routes before resources are exceeded.
>
> However, in order to make users aware of the fact that routes are no
> longer offloaded we can return an error for routes configured after the
> abort mechanism was triggered.
>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> ---
> drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> index added380e344..8028d221aece 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> @@ -5928,6 +5928,13 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
> router->mlxsw_sp);
> if (!err || info->extack)
> return notifier_from_errno(err);
> + break;
> + case FIB_EVENT_ENTRY_ADD:
> + if (router->aborted) {
> + NL_SET_ERR_MSG_MOD(info->extack, "FIB offload was aborted. Not configuring route");
> + return notifier_from_errno(-EINVAL);
> + }
> + break;
> }
>
> fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
>
Reasonable next step.
Acked-by: David Ahern <dsahern@gmail.com>
^ permalink raw reply
* Re: [PATCH net-next v2 1/2] mlxsw: spectrum_router: Return an error for non-default FIB rules
From: David Ahern @ 2018-05-02 15:26 UTC (permalink / raw)
To: Ido Schimmel, netdev; +Cc: davem, jiri, mlxsw
In-Reply-To: <20180502071735.32352-2-idosch@mellanox.com>
On 5/2/18 1:17 AM, Ido Schimmel wrote:
> Since commit 9776d32537d2 ("net: Move call_fib_rule_notifiers up in
> fib_nl_newrule") it is possible to forbid the installation of
> unsupported FIB rules.
>
> Have mlxsw return an error for non-default FIB rules in addition to the
> existing extack message.
>
> Example:
> # ip rule add from 198.51.100.1 table 10
> Error: mlxsw_spectrum: FIB rules not supported.
Next, I think we need a user override flag for rules to prevent hardware
from limiting the rules that can be added and aborting offload. For
example, if the user knows that a rule is meant to direct traffic out of
a management VRF and is unrelated to front panel ports the offload
should ignore it.
>
> Note that offload is only aborted when non-default FIB rules are already
> installed and merely replayed during module initialization.
>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> ---
> drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
Acked-by: David Ahern <dsahern@gmail.com>
^ permalink raw reply
* Re: [PATCH iproute2-next] Add tc-BPF example for a TCP pure ack recognizer
From: David Ahern @ 2018-05-02 15:22 UTC (permalink / raw)
To: Dave Taht; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAA93jw7d6oiushFda=Z7nY6xCeVFWk9cBcJRUaCtjtCLcfJTXQ@mail.gmail.com>
On 5/1/18 10:51 PM, Dave Taht wrote:
> On Tue, May 1, 2018 at 9:12 PM, David Ahern <dsahern@gmail.com> wrote:
>> On 5/1/18 12:32 PM, Dave Taht wrote:
>>> ack_recognize can shift pure tcp acks into another flowid.
>>> ---
>>> examples/bpf/ack_recognize.c | 98 ++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 98 insertions(+)
>>> create mode 100644 examples/bpf/ack_recognize.c
>>>
>>> diff --git a/examples/bpf/ack_recognize.c b/examples/bpf/ack_recognize.c
>>> new file mode 100644
>>> index 0000000..5da620c
>>> --- /dev/null
>>> +++ b/examples/bpf/ack_recognize.c
>>> @@ -0,0 +1,98 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * Copyright 2017 Google Inc.
>>
>> 2017?? it's 2018. Did you write this last year and just now posting?
>
> November, 2017? Shortly prior to you taking iproute2-next off of steven's hands.
ok, just checking
>
> It was the first stage of a proof of concept showing (eventually) that
> a simple ack decimator/filter (like "drop every other ack", or simple
> "drop head" in the supplied example) had a ton of problems, and that
> to filter out acks correctly to any extent, it needed to peer back
> into the queue. (see the sch_cake ack-filter controversy on-going on
> this list)
>
> While it's better than what is in wondershaper, I wouldn't recommend
> anyone use it. It was, however, useful in acquiring a gut feel for
> what several other broken ack filters might be doing in the field. I
> submitted it as a possibly useful example for showing off tc-bpf and
> to add fuel to the ack-filter fire under cake.
>
> I can clean it up, SPF it, etc, if you want it. Otherwise, sorry for the noise.
I'm fine with the example, just prefer magic numbers to be clarified.
^ permalink raw reply
* Re: [PATCH RFC iproute2-next 1/2] rdma: update rdma_netlink.h to get provider attrs
From: David Ahern @ 2018-05-02 15:17 UTC (permalink / raw)
To: Steve Wise, leon; +Cc: stephen, netdev, linux-rdma
In-Reply-To: <10f219db72dfb6e8d5d84ee009c97594b6d211c1.1525100473.git.swise@opengridcomputing.com>
On 4/30/18 8:36 AM, Steve Wise wrote:
> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> ---
> rdma/include/uapi/rdma/rdma_netlink.h | 37 ++++++++++++++++++++++++++++++++++-
> 1 file changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/rdma/include/uapi/rdma/rdma_netlink.h
> index 45474f1..faea9d5 100644
> --- a/rdma/include/uapi/rdma/rdma_netlink.h
> +++ b/rdma/include/uapi/rdma/rdma_netlink.h
> @@ -249,10 +249,22 @@ enum rdma_nldev_command {
> RDMA_NLDEV_NUM_OPS
> };
>
> +enum {
> + RDMA_NLDEV_ATTR_ENTRY_STRLEN = 16,
> +};
> +
> +enum rdma_nldev_print_type {
> + RDMA_NLDEV_PRINT_TYPE_UNSPEC,
> + RDMA_NLDEV_PRINT_TYPE_HEX,
> +};
> +
> enum rdma_nldev_attr {
> /* don't change the order or add anything between, this is ABI! */
> RDMA_NLDEV_ATTR_UNSPEC,
>
> + /* Pad attribute for 64b alignment */
> + RDMA_NLDEV_ATTR_PAD = RDMA_NLDEV_ATTR_UNSPEC,
are you really inserting a new value here?
> +
> /* Identifier for ib_device */
> RDMA_NLDEV_ATTR_DEV_INDEX, /* u32 */
>
> @@ -387,8 +399,31 @@ enum rdma_nldev_attr {
> RDMA_NLDEV_ATTR_RES_PD_ENTRY, /* nested table */
> RDMA_NLDEV_ATTR_RES_LOCAL_DMA_LKEY, /* u32 */
> RDMA_NLDEV_ATTR_RES_UNSAFE_GLOBAL_RKEY, /* u32 */
> + /*
> + * provider-specific attributes.
> + */
> + RDMA_NLDEV_ATTR_PROVIDER, /* nested table */
> + RDMA_NLDEV_ATTR_PROVIDER_ENTRY, /* nested table */
> + RDMA_NLDEV_ATTR_PROVIDER_STRING, /* string */
> + /*
> + * u8 values from enum rdma_nldev_print_type
> + */
> + RDMA_NLDEV_ATTR_PROVIDER_PRINT_TYPE, /* u8 */
> + RDMA_NLDEV_ATTR_PROVIDER_S32, /* s32 */
> + RDMA_NLDEV_ATTR_PROVIDER_U32, /* u32 */
> + RDMA_NLDEV_ATTR_PROVIDER_S64, /* s64 */
> + RDMA_NLDEV_ATTR_PROVIDER_U64, /* u64 */
and this sequence too.
>
> - /* Netdev information for relevant protocols, like RoCE and iWARP */
> + /*
> + * Provides logical name and index of netdevice which is
> + * connected to physical port. This information is relevant
> + * for RoCE and iWARP.
> + *
> + * The netdevices which are associated with containers are
> + * supposed to be exported together with GID table once it
> + * will be exposed through the netlink. Because the
> + * associated netdevices are properties of GIDs.
> + */
> RDMA_NLDEV_ATTR_NDEV_INDEX, /* u32 */
> RDMA_NLDEV_ATTR_NDEV_NAME, /* string */
>
>
^ permalink raw reply
* RE: [PATCH V2 6/8] net: stmmac: add dwmac-4.20a compatible
From: Christophe ROULLIER @ 2018-05-02 15:16 UTC (permalink / raw)
To: Jose Abreu, mark.rutland@arm.com, mcoquelin.stm32@gmail.com,
Alexandre TORGUE, Peppe CAVALLARO
Cc: devicetree@vger.kernel.org, andrew@lunn.ch,
linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org
In-Reply-To: <cb027f0e-7209-0035-84a9-b5d7a4ec9595@synopsys.com>
Hi Jose,
>Just being curious: Can you tell me which HW features do you have on your NIC?
Nothing specific to SNPS 4.20a, it is just to be align with our SNPS IP in our NIC ;-)
BR
Christophe.
-----Original Message-----
From: Jose Abreu [mailto:Jose.Abreu@synopsys.com]
Sent: mercredi 2 mai 2018 16:41
To: Christophe ROULLIER <christophe.roullier@st.com>; mark.rutland@arm.com; mcoquelin.stm32@gmail.com; Alexandre TORGUE <alexandre.torgue@st.com>; Peppe CAVALLARO <peppe.cavallaro@st.com>
Cc: devicetree@vger.kernel.org; linux-arm-kernel@lists.infradead.org; netdev@vger.kernel.org; andrew@lunn.ch
Subject: Re: [PATCH V2 6/8] net: stmmac: add dwmac-4.20a compatible
Hi Christophe,
On 02-05-2018 15:18, Christophe Roullier wrote:
> Manage dwmac-4.20a version from synopsys
>
>
Just being curious: Can you tell me which HW features do you have on your NIC?
Thanks and Best Regards,
Jose Miguel Abreu
^ permalink raw reply
* Re: [PATCH net] sctp: fix the issue that the cookie-ack with auth can't get processed
From: David Miller @ 2018-05-02 15:16 UTC (permalink / raw)
To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <aec2a2b5bff80a4d8eb6a9db96a36a92e429e574.1525239912.git.lucien.xin@gmail.com>
From: Xin Long <lucien.xin@gmail.com>
Date: Wed, 2 May 2018 13:45:12 +0800
> When auth is enabled for cookie-ack chunk, in sctp_inq_pop, sctp
> processes auth chunk first, then continues to the next chunk in
> this packet if chunk_end + chunk_hdr size < skb_tail_pointer().
> Otherwise, it will go to the next packet or discard this chunk.
>
> However, it missed the fact that cookie-ack chunk's size is equal
> to chunk_hdr size, which couldn't match that check, and thus this
> chunk would not get processed.
>
> This patch fixes it by changing the check to chunk_end + chunk_hdr
> size <= skb_tail_pointer().
>
> Fixes: 26b87c788100 ("net: sctp: fix remote memory pressure from excessive queueing")
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH net] sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
From: David Miller @ 2018-05-02 15:16 UTC (permalink / raw)
To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <b4d64902d4a8e0774c44752056e35a32d69966a9.1525239586.git.lucien.xin@gmail.com>
From: Xin Long <lucien.xin@gmail.com>
Date: Wed, 2 May 2018 13:39:46 +0800
> When processing a duplicate cookie-echo chunk, for case 'D', sctp will
> not process the param from this chunk. It means old asoc has nothing
> to be updated, and the new temp asoc doesn't have the complete info.
>
> So there's no reason to use the new asoc when creating the cookie-ack
> chunk. Otherwise, like when auth is enabled for cookie-ack, the chunk
> can not be set with auth, and it will definitely be dropped by peer.
>
> This issue is there since very beginning, and we fix it by using the
> old asoc instead.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH net] sctp: init active key for the new asoc in dupcook_a and dupcook_b
From: David Miller @ 2018-05-02 15:16 UTC (permalink / raw)
To: lucien.xin; +Cc: netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <96c2191206cb67ccec250ca1ec66d83dc9f40c9a.1525239464.git.lucien.xin@gmail.com>
From: Xin Long <lucien.xin@gmail.com>
Date: Wed, 2 May 2018 13:37:44 +0800
> When processing a duplicate cookie-echo chunk, for case 'A' and 'B',
> after sctp_process_init for the new asoc, if auth is enabled for the
> cookie-ack chunk, the active key should also be initialized.
>
> Otherwise, the cookie-ack chunk made later can not be set with auth
> shkey properly, and a crash can even be caused by this, as after
> Commit 1b1e0bc99474 ("sctp: add refcnt support for sh_key"), sctp
> needs to hold the shkey when making control chunks.
>
> Fixes: 1b1e0bc99474 ("sctp: add refcnt support for sh_key")
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Applied.
^ permalink raw reply
* [PATCH v2 1/2] net: phy: broadcom: add support for BCM89610 PHY
From: Bhadram Varka @ 2018-05-02 15:13 UTC (permalink / raw)
To: andrew, f.fainelli; +Cc: davem, netdev, linux-tegra
It adds support for BCM89610 (Single-Port 10/100/1000BASE-T)
transceiver which is used in P3310 Tegra186 platform.
Signed-off-by: Bhadram Varka <vbhadram@nvidia.com>
---
drivers/net/phy/broadcom.c | 10 ++++++++++
include/linux/brcmphy.h | 1 +
2 files changed, 11 insertions(+)
diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
index 3bb6b66..f9c2591 100644
--- a/drivers/net/phy/broadcom.c
+++ b/drivers/net/phy/broadcom.c
@@ -720,6 +720,15 @@ static struct phy_driver broadcom_drivers[] = {
.get_strings = bcm_phy_get_strings,
.get_stats = bcm53xx_phy_get_stats,
.probe = bcm53xx_phy_probe,
+}, {
+ .phy_id = PHY_ID_BCM89610,
+ .phy_id_mask = 0xfffffff0,
+ .name = "Broadcom BCM89610",
+ .features = PHY_GBIT_FEATURES,
+ .flags = PHY_HAS_INTERRUPT,
+ .config_init = bcm54xx_config_init,
+ .ack_interrupt = bcm_phy_ack_intr,
+ .config_intr = bcm_phy_config_intr,
} };
module_phy_driver(broadcom_drivers);
@@ -741,6 +750,7 @@ static struct mdio_device_id __maybe_unused broadcom_tbl[] = {
{ PHY_ID_BCMAC131, 0xfffffff0 },
{ PHY_ID_BCM5241, 0xfffffff0 },
{ PHY_ID_BCM5395, 0xfffffff0 },
+ { PHY_ID_BCM89610, 0xfffffff0 },
{ }
};
diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
index d3339dd..b324e01 100644
--- a/include/linux/brcmphy.h
+++ b/include/linux/brcmphy.h
@@ -25,6 +25,7 @@
#define PHY_ID_BCM54612E 0x03625e60
#define PHY_ID_BCM54616S 0x03625d10
#define PHY_ID_BCM57780 0x03625d90
+#define PHY_ID_BCM89610 0x03625cd0
#define PHY_ID_BCM7250 0xae025280
#define PHY_ID_BCM7260 0xae025190
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net] tcp_bbr: fix to zero idle_restart only upon S/ACKed data
From: David Miller @ 2018-05-02 15:13 UTC (permalink / raw)
To: ncardwell; +Cc: netdev, ycheng, soheil, priyarjha, ysseung
In-Reply-To: <20180502014541.194259-1-ncardwell@google.com>
From: Neal Cardwell <ncardwell@google.com>
Date: Tue, 1 May 2018 21:45:41 -0400
> Previously the bbr->idle_restart tracking was zeroing out the
> bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
> e.g. receiving incoming data or receiver window updates. In such
> situations BBR would forget that this was a restart-from-idle
> situation, and if the min_rtt had expired it would unnecessarily enter
> PROBE_RTT (even though we were actually restarting from idle but had
> merely forgotten that fact).
>
> The fix is simple: we need to remember we are restarting from idle
> until we receive a S/ACK for some data (a S/ACK for the first flight
> of data we send as we are restarting).
>
> This commit is a stable candidate for kernels back as far as 4.9.
>
> Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
> Signed-off-by: Yousuk Seung <ysseung@google.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring
From: Tiwei Bie @ 2018-05-02 15:12 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, virtualization, linux-kernel, netdev, wexu, jfreimann
In-Reply-To: <20180502164828-mutt-send-email-mst@kernel.org>
On Wed, May 02, 2018 at 04:51:01PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 02, 2018 at 03:28:19PM +0800, Tiwei Bie wrote:
> > On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> > > On 2018年04月25日 13:15, Tiwei Bie wrote:
> > > > This commit introduces the event idx support in packed
> > > > ring. This feature is temporarily disabled, because the
> > > > implementation in this patch may not work as expected,
> > > > and some further discussions on the implementation are
> > > > needed, e.g. do we have to check the wrap counter when
> > > > checking whether a kick is needed?
> > > >
> > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > ---
> > > > drivers/virtio/virtio_ring.c | 53 ++++++++++++++++++++++++++++++++++++++++----
> > > > 1 file changed, 49 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 0181e93897be..b1039c2985b9 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> > > > static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > {
> > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > - u16 flags;
> > > > + u16 new, old, off_wrap, flags;
> > > > bool needs_kick;
> > > > u32 snapshot;
> > > > @@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > * suppressions. */
> > > > virtio_mb(vq->weak_barriers);
> > > > + old = vq->next_avail_idx - vq->num_added;
> > > > + new = vq->next_avail_idx;
> > > > + vq->num_added = 0;
> > > > +
> > > > snapshot = *(u32 *)vq->vring_packed.device;
> > > > + off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0xffff);
> > > > flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
> > > > #ifdef DEBUG
> > > > @@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> > > > vq->last_add_time_valid = false;
> > > > #endif
> > > > - needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > + if (flags == VRING_EVENT_F_DESC)
> > > > + needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
> > >
> > > I wonder whether or not the math is correct. Both new and event are in the
> > > unit of descriptor ring size, but old looks not.
> >
> > What vring_need_event() cares is the distance between
> > `new` and `old`, i.e. vq->num_added. So I think there
> > is nothing wrong with `old`. But the calculation of the
> > distance between `new` and `event_idx` isn't right when
> > `new` wraps. How do you think about the below code:
> >
> > wrap_counter = off_wrap >> 15;
> > event_idx = off_wrap & ~(1<<15);
> > if (wrap_counter != vq->wrap_counter)
> > event_idx -= vq->vring_packed.num;
> >
> > needs_kick = vring_need_event(event_idx, new, old);
>
> I suspect this hack won't work for non power of 2 ring.
Above code doesn't require the ring size to be a power of 2.
For (__u16)(new_idx - old), what we want to get is vq->num_added.
old = vq->next_avail_idx - vq->num_added;
new = vq->next_avail_idx;
When vq->next_avail_idx >= vq->num_added, it's obvious that,
(__u16)(new_idx - old) is vq->num_added.
And when vq->next_avail_idx < vq->num_added, new will be smaller
than old (old will be a big unsigned number), but (__u16)(new_idx
- old) is still vq->num_added.
For (__u16)(new_idx - event_idx - 1), when new wraps and event_idx
doesn't wrap, the most straightforward way to calculate it is:
(new + vq->vring_packed.num) - event_idx - 1.
But we can also calculate it in this way:
event_idx -= vq->vring_packed.num;
(event_idx will be a big unsigned number)
Then (__u16)(new_idx - event_idx - 1) will be the value we want.
Best regards,
Tiwei Bie
>
>
> > Best regards,
> > Tiwei Bie
> >
> >
> > >
> > > Thanks
> > >
> > > > + else
> > > > + needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > > > END_USE(vq);
> > > > return needs_kick;
> > > > }
> > > > @@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> > > > if (vq->last_used_idx >= vq->vring_packed.num)
> > > > vq->last_used_idx -= vq->vring_packed.num;
> > > > + /* If we expect an interrupt for the next entry, tell host
> > > > + * by writing event index and flush out the write before
> > > > + * the read in the next get_buf call. */
> > > > + if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
> > > > + virtio_store_mb(vq->weak_barriers,
> > > > + &vq->vring_packed.driver->off_wrap,
> > > > + cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
> > > > + (vq->wrap_counter << 15)));
> > > > +
> > > > #ifdef DEBUG
> > > > vq->last_add_time_valid = false;
> > > > #endif
> > > > @@ -1143,10 +1160,17 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> > > > /* We optimistically turn back on interrupts, then check if there was
> > > > * more to do. */
> > > > + /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > > > + * either clear the flags bit or point the event index at the next
> > > > + * entry. Always update the event index to keep code simple. */
> > > > +
> > > > + vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > > > + vq->last_used_idx | (vq->wrap_counter << 15));
> > > > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > > > virtio_wmb(vq->weak_barriers);
> > > > - vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > > > + vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > > > + VRING_EVENT_F_ENABLE;
> > > > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > > > vq->event_flags_shadow);
> > > > }
> > > > @@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> > > > static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> > > > {
> > > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > > + u16 bufs, used_idx, wrap_counter;
> > > > START_USE(vq);
> > > > /* We optimistically turn back on interrupts, then check if there was
> > > > * more to do. */
> > > > + /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > > > + * either clear the flags bit or point the event index at the next
> > > > + * entry. Always update the event index to keep code simple. */
> > > > +
> > > > + /* TODO: tune this threshold */
> > > > + bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> > > > +
> > > > + used_idx = vq->last_used_idx + bufs;
> > > > + wrap_counter = vq->wrap_counter;
> > > > +
> > > > + if (used_idx >= vq->vring_packed.num) {
> > > > + used_idx -= vq->vring_packed.num;
> > > > + wrap_counter ^= 1;
> > > > + }
> > > > +
> > > > + vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > > > + used_idx | (wrap_counter << 15));
> > > > if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > > > virtio_wmb(vq->weak_barriers);
> > > > - vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > > > + vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > > > + VRING_EVENT_F_ENABLE;
> > > > vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > > > vq->event_flags_shadow);
> > > > }
> > > > @@ -1822,8 +1865,10 @@ void vring_transport_features(struct virtio_device *vdev)
> > > > switch (i) {
> > > > case VIRTIO_RING_F_INDIRECT_DESC:
> > > > break;
> > > > +#if 0
> > > > case VIRTIO_RING_F_EVENT_IDX:
> > > > break;
> > > > +#endif
> > > > case VIRTIO_F_VERSION_1:
> > > > break;
> > > > case VIRTIO_F_IOMMU_PLATFORM:
> > >
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox