Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jiri Pirko @ 2018-06-27  7:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Linux Netdev List, David Miller, Jamal Hadi Salim, Cong Wang,
	Simon Horman, John Hurley, David Ahern, mlxsw
In-Reply-To: <20180626141858.7f18730f@cakuba.netronome.com>

Tue, Jun 26, 2018 at 11:18:58PM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 26 Jun 2018 09:12:17 +0200, Jiri Pirko wrote:
>> Tue, Jun 26, 2018 at 09:00:45AM CEST, jakub.kicinski@netronome.com wrote:
>> >On Mon, Jun 25, 2018 at 11:43 PM, Jiri Pirko <jiri@resnulli.us> wrote:  
>> >> Tue, Jun 26, 2018 at 06:58:50AM CEST, jakub.kicinski@netronome.com wrote:  
>> >>>On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:  
>> >>>> From: Jiri Pirko <jiri@mellanox.com>
>> >>>>
>> >>>> For the TC clsact offload these days, some of HW drivers need
>> >>>> to hold a magic ball. The reason is, with the first inserted rule inside
>> >>>> HW they need to guess what fields will be used for the matching. If
>> >>>> later on this guess proves to be wrong and user adds a filter with a
>> >>>> different field to match, there's a problem. Mlxsw resolves it now with
>> >>>> couple of patterns. Those try to cover as many match fields as possible.
>> >>>> This aproach is far from optimal, both performance-wise and scale-wise.
>> >>>> Also, there is a combination of filters that in certain order won't
>> >>>> succeed.
>> >>>>
>> >>>> Most of the time, when user inserts filters in chain, he knows right away
>> >>>> how the filters are going to look like - what type and option will they
>> >>>> have. For example, he knows that he will only insert filters of type
>> >>>> flower matching destination IP address. He can specify a template that
>> >>>> would cover all the filters in the chain.  
>> >>>
>> >>>Perhaps it's lack of sleep, but this paragraph threw me a little off
>> >>>the track.  IIUC the goal of this set is to provide a way to inform the
>> >>>HW about expected matches before any rule is programmed into the HW.
>> >>>Not before any rule is added to a particular chain.  One can just use
>> >>>the first rule in the chain to make a guess about the chain, but thanks
>> >>>to this set user can configure *all* chains before any rules are added.  
>> >>
>> >> The template is per-chain. User can use template for chain x and
>> >> not-use it for chain y. Up to him.  
>> >
>> >Makes sense.
>> >
>> >I can't help but wonder if it'd be better to associate the
>> >constraints/rules with chains instead of creating a new "template"
>> >object.  It seems more natural to create a chain with specific
>> >constraints in place than add and delete template of which there can
>> >be at most one to a chain...  Perhaps that's more about the user space
>> >tc command line.  Anyway, not a strong objection, just a thought.  
>> 
>> Hmm. I don't think it is good idea. User should see the template in a
>> "show" command per chain. We would have to have 2 show commands, one to
>> list the template objects and one to list templates per chains. It makes
>> things more complicated for no good reason. I think that this simple
>> chain-lock is easier and serves the purpose.
>
>Hm, I think the dump is fine, what I was thinking about was:
>
># tc chain add dev dummy0 ingress chain_index 22 \
>     ^^^^^
>	template proto ip \
>	^^^^^^^^
>	flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF

Okay, I got it. I see 2 issues.
1) user might expect to add a chain without the template. But that does
   not make sense really. Chains are created/deleted implicitly
   according to refcount.
2) there is not chain object like this available to user. Adding it just
   for template looks odd. Also, the "filter" and "template" are very
   much alike. They both are added to a chain, they both implicitly
   create chain if it does not exist, etc.

if you don't like "tc filter template add dev dummy0 ingress", how
about:
"tc template add dev dummy0 ingress ..."
"tc template add dev dummy0 ingress chain 22 ..."
that makes more sense I think.


>
>instead of:
>
># tc filter template add dev dummy0 ingress \
>     ^^^^^^^^^^^^^^^
>	proto ip chain_index 22 \
>	flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
>
>And then delete becomes:
>
># tc chain del dev dummy0 ingress chain_index 22
>Error: The chain is not empty.
>
>The fact that template is very much like a filter is sort of an
>implementation detail, from user perspective it may be more intuitive
>to model template as an attribute of the chain, not a filter object
>added to a chain.
>
>But I could well be the only person who feels that way :)

^ permalink raw reply

* Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jiri Pirko @ 2018-06-27  7:03 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Cong Wang, Linux Kernel Network Developers, David Miller,
	Jamal Hadi Salim, Jakub Kicinski, Simon Horman, john.hurley,
	David Ahern, mlxsw
In-Reply-To: <4b2e7870-b225-3855-f3ac-183126142c1c@intel.com>

Wed, Jun 27, 2018 at 08:34:46AM CEST, sridhar.samudrala@intel.com wrote:
>
>On 6/26/2018 11:05 PM, Jiri Pirko wrote:
>> Wed, Jun 27, 2018 at 02:04:31AM CEST, xiyou.wangcong@gmail.com wrote:
>> > On Tue, Jun 26, 2018 at 1:01 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> > > Create dummy device with clsact first:
>> > > # ip link add type dummy
>> > > # tc qdisc add dev dummy0 clsact
>> > > 
>> > > There is no template assigned by default:
>> > > # tc filter template show dev dummy0 ingress
>> > > 
>> > > Add a template of type flower allowing to insert rules matching on last
>> > > 2 bytes of destination mac address:
>> > > # tc filter template add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
>> > Now you are extending 'tc filter' command with a new
>> > subcommand 'template', which looks weird.
>> > 
>> > Why not make it a new property of filter like you did for chain?
>> > Like:
>> > 
>> > tc filter add dev dummy0 ingress proto ip template flower
>> But unlike chain, this is not a filter property. For chain, when you add
>> filter, you add it to a specific chain. That makes sense.
>> But for template, you need to add the template first. Then, later on,
>> you add filters which either match or does not match the template.
>
>So can we say that template defines the types of rules(match fields/masks) that
>can be added to a specific chain and there is 1-1 relationship between a template
>and a chain?

yes

>
>Without attaching a template to a chain, i guess it is possible to add different
>types of rules to a chain?

yes

>
>
>> Does not make sense to have "template" the filter property as you
>> suggest.
>
>template seems to a chain property.

yes

>
>> > which is much better IMHO.
>

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-27  7:03 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Michael S. Tsirkin, Jakub Kicinski, Netdev, Cornelia Huck,
	qemu-devel, virtualization, Venu Busireddy, boris.ostrovsky,
	aaron.f.brown, Joao Martins
In-Reply-To: <19654612-a321-2ce9-9c1c-bcbae3a10e2f@intel.com>

On Tue, Jun 26, 2018 at 11:49 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 6/26/2018 11:21 PM, Siwei Liu wrote:
>>
>> On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin <mst@redhat.com>
>> wrote:
>>>
>>> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>>>>
>>>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com>
>>>> wrote:
>>>>>
>>>>> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>>>>>>>>>>
>>>>>>>>>> Might not neccessarily be something wrong, but it's very limited
>>>>>>>>>> to
>>>>>>>>>> prohibit the MAC of VF from changing when enslaved by failover.
>>>>>>>>>
>>>>>>>>> You mean guest changing MAC? I'm not sure why we prohibit that.
>>>>>>>>
>>>>>>>> I think Sridhar and Jiri might be better person to answer it. My
>>>>>>>> impression was that sync'ing the MAC address change between all 3
>>>>>>>> devices is challenging, as the failover driver uses MAC address to
>>>>>>>> match net_device internally.
>>>>>>
>>>>>> Yes. The MAC address is assigned by the hypervisor and it needs to
>>>>>> manage the movement
>>>>>> of the MAC between the PF and VF.  Allowing the guest to change the
>>>>>> MAC will require
>>>>>> synchronization between the hypervisor and the PF/VF drivers. Most of
>>>>>> the VF drivers
>>>>>> don't allow changing guest MAC unless it is a trusted VF.
>>>>>
>>>>> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>>>>> For example I can see host just
>>>>> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>>>>> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>>>
>>>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>>>> got changed before virtio attempts to match and pair the device, it
>>>> ends up with no pairing found out at all.
>>>
>>> Guest seems to match on the hardware mac and ignore whatever
>>> is set by user. Makes sense to me and should not be fragile.
>>
>> Host can change the hardware mac for VF any time.
>
>
> Live migration is initiated and controlled by the Host,  So the Source Host
> will
> reset the MAC during live migration after unplugging the VF. This is to
> redirect the
> VMs frames towards PF so that they can be received via virtio-net standby
> interface.
> The destination host will set the VF MAC and plug the VF after live
> migration is
> completed.
>
> Allowing the guest to change the MAC will require the qemu/libvirt/mgmt
> layers to
> track the MAC changes and replay that change after live migration.
>
If the failover's MAC is kept in sync with VF's MAC address change,
the VF on destination host can be paired using the permanent address
after plugging in, while failover interface will resync the MAC to the
current one in use when enslaving the VF. I think similar is done for
multicast and unicast address list on VF's registration, right? No
need of QEMU or mgmt software keep track of MAC changes.

-Siwei

>
>
>>
>> -Siwei
>>>
>>>
>>>> UUID is better.
>>>>
>>>> -Siwei
>>>>
>>>>> --
>>>>> MST
>
>

^ permalink raw reply

* Re: Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-27  6:49 UTC (permalink / raw)
  To: Siwei Liu, Michael S. Tsirkin
  Cc: Cornelia Huck, Alexander Duyck, virtio-dev, aaron.f.brown,
	Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel, virtualization,
	konrad.wilk, boris.ostrovsky, Joao Martins, Venu Busireddy,
	vijay.balakrishna
In-Reply-To: <CADGSJ22Sv_AFUkDou6hdq9++RNC40BTB-puS9209tOqzGLFJ-g@mail.gmail.com>

On 6/26/2018 11:21 PM, Siwei Liu wrote:
> On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>>>>>>>>> Might not neccessarily be something wrong, but it's very limited to
>>>>>>>>> prohibit the MAC of VF from changing when enslaved by failover.
>>>>>>>> You mean guest changing MAC? I'm not sure why we prohibit that.
>>>>>>> I think Sridhar and Jiri might be better person to answer it. My
>>>>>>> impression was that sync'ing the MAC address change between all 3
>>>>>>> devices is challenging, as the failover driver uses MAC address to
>>>>>>> match net_device internally.
>>>>> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
>>>>> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
>>>>> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
>>>>> don't allow changing guest MAC unless it is a trusted VF.
>>>> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>>>> For example I can see host just
>>>> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>>>> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>>> got changed before virtio attempts to match and pair the device, it
>>> ends up with no pairing found out at all.
>> Guest seems to match on the hardware mac and ignore whatever
>> is set by user. Makes sense to me and should not be fragile.
> Host can change the hardware mac for VF any time.

Live migration is initiated and controlled by the Host,  So the Source Host will
reset the MAC during live migration after unplugging the VF. This is to redirect the
VMs frames towards PF so that they can be received via virtio-net standby interface.
The destination host will set the VF MAC and plug the VF after live migration is
completed.

Allowing the guest to change the MAC will require the qemu/libvirt/mgmt layers to
track the MAC changes and replay that change after live migration.


>
> -Siwei
>>
>>> UUID is better.
>>>
>>> -Siwei
>>>
>>>> --
>>>> MST

^ permalink raw reply

* Re: [PATCH net-next] neighbour: force neigh_invalidate when NUD_FAILED update is from admin
From: David Miller @ 2018-06-27  6:41 UTC (permalink / raw)
  To: roopa; +Cc: netdev
In-Reply-To: <1529983973-5508-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>
Date: Mon, 25 Jun 2018 20:32:53 -0700

> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> 
> In systems where neigh gc thresh holds are set to high values,
> admin deleted neigh entries (eg ip neigh flush or ip neigh del) can
> linger around in NUD_FAILED state for a long time until periodic gc kicks
> in. This patch forces neigh_invalidate when NUD_FAILED neigh_update is
> from an admin.
> 
> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>

Applied.

^ permalink raw reply

* Re: [PATCH v2 net] nfp: cast sizeof() to int when comparing with error code
From: David Miller @ 2018-06-27  6:38 UTC (permalink / raw)
  To: cgxu519; +Cc: jakub.kicinski, oss-drivers, netdev
In-Reply-To: <20180626011631.22717-1-cgxu519@gmx.com>

From: Chengguang Xu <cgxu519@gmx.com>
Date: Tue, 26 Jun 2018 09:16:31 +0800

> sizeof() will return unsigned value so in the error check
> negative error code will be always larger than sizeof().
> 
> Fixes: a0d8e02c35ff ("nfp: add support for reading nffw info")
> 
> Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
> v2:
> - Add more information to patch subject and commit log.

I guess the:

	if (x < 0 || x < sizeof(foo))

better self-documents the situation, but this patch is fine too
so I have applied it.

Thanks.

^ permalink raw reply

* Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Samudrala, Sridhar @ 2018-06-27  6:34 UTC (permalink / raw)
  To: Jiri Pirko, Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jamal Hadi Salim,
	Jakub Kicinski, Simon Horman, john.hurley, David Ahern, mlxsw
In-Reply-To: <20180627060504.GS2161@nanopsycho>


On 6/26/2018 11:05 PM, Jiri Pirko wrote:
> Wed, Jun 27, 2018 at 02:04:31AM CEST, xiyou.wangcong@gmail.com wrote:
>> On Tue, Jun 26, 2018 at 1:01 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>> Create dummy device with clsact first:
>>> # ip link add type dummy
>>> # tc qdisc add dev dummy0 clsact
>>>
>>> There is no template assigned by default:
>>> # tc filter template show dev dummy0 ingress
>>>
>>> Add a template of type flower allowing to insert rules matching on last
>>> 2 bytes of destination mac address:
>>> # tc filter template add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
>> Now you are extending 'tc filter' command with a new
>> subcommand 'template', which looks weird.
>>
>> Why not make it a new property of filter like you did for chain?
>> Like:
>>
>> tc filter add dev dummy0 ingress proto ip template flower
> But unlike chain, this is not a filter property. For chain, when you add
> filter, you add it to a specific chain. That makes sense.
> But for template, you need to add the template first. Then, later on,
> you add filters which either match or does not match the template.

So can we say that template defines the types of rules(match fields/masks) that
can be added to a specific chain and there is 1-1 relationship between a template
and a chain?

Without attaching a template to a chain, i guess it is possible to add different
types of rules to a chain?


> Does not make sense to have "template" the filter property as you
> suggest.

template seems to a chain property.

>> which is much better IMHO.

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-27  6:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, Cornelia Huck, Alexander Duyck, virtio-dev,
	aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
	virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
	Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180627032825-mutt-send-email-mst@kernel.org>

On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>> >> > > > > Might not neccessarily be something wrong, but it's very limited to
>> >> > > > > prohibit the MAC of VF from changing when enslaved by failover.
>> >> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
>> >> > > I think Sridhar and Jiri might be better person to answer it. My
>> >> > > impression was that sync'ing the MAC address change between all 3
>> >> > > devices is challenging, as the failover driver uses MAC address to
>> >> > > match net_device internally.
>> >>
>> >> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
>> >> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
>> >> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
>> >> don't allow changing guest MAC unless it is a trusted VF.
>> >
>> > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>> > For example I can see host just
>> > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>> > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>
>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>> got changed before virtio attempts to match and pair the device, it
>> ends up with no pairing found out at all.
>
> Guest seems to match on the hardware mac and ignore whatever
> is set by user. Makes sense to me and should not be fragile.
Host can change the hardware mac for VF any time.

-Siwei
>
>
>> UUID is better.
>>
>> -Siwei
>>
>> >
>> > --
>> > MST

^ permalink raw reply

* Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jiri Pirko @ 2018-06-27  6:05 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jamal Hadi Salim,
	Jakub Kicinski, Simon Horman, john.hurley, David Ahern, mlxsw
In-Reply-To: <CAM_iQpVzuX4P_C=APtWcTMGtuFZiS_ncAaJvtF6Chz19pG_oJg@mail.gmail.com>

Wed, Jun 27, 2018 at 02:04:31AM CEST, xiyou.wangcong@gmail.com wrote:
>On Tue, Jun 26, 2018 at 1:01 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> Create dummy device with clsact first:
>> # ip link add type dummy
>> # tc qdisc add dev dummy0 clsact
>>
>> There is no template assigned by default:
>> # tc filter template show dev dummy0 ingress
>>
>> Add a template of type flower allowing to insert rules matching on last
>> 2 bytes of destination mac address:
>> # tc filter template add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
>
>Now you are extending 'tc filter' command with a new
>subcommand 'template', which looks weird.
>
>Why not make it a new property of filter like you did for chain?
>Like:
>
>tc filter add dev dummy0 ingress proto ip template flower

But unlike chain, this is not a filter property. For chain, when you add
filter, you add it to a specific chain. That makes sense.
But for template, you need to add the template first. Then, later on,
you add filters which either match or does not match the template.
Does not make sense to have "template" the filter property as you
suggest.

>
>which is much better IMHO.

^ permalink raw reply

* Re: [PATCH mlx5-next 05/12] net/mlx5: Rate limit errors in command interface
From: Leon Romanovsky @ 2018-06-27  5:48 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: RDMA mailing list, Hadar Hen Zion, Matan Barak, Michael J Ruhl,
	Noa Osherovich, Raed Salem, Yishai Hadas, Saeed Mahameed,
	linux-netdev
In-Reply-To: <20180624082353.16138-6-leon@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 952 bytes --]

On Sun, Jun 24, 2018 at 11:23:46AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
>
> Any error status returned by FW will trigger similar
> to the following error message in the dmesg.
>
> [   55.884355] mlx5_core 0000:00:04.0: mlx5_cmd_check:712:(pid 555):
> ALLOC_UAR(0x802) op_mod(0x0) failed, status limits exceeded(0x8),
> syndrome (0x0)
>
> Those prints are extremely valuable to diagnose issues with running
> system and it is important to keep them. However, not-so-careful user
> can trigger endless number of such prints by depleting HW resources
> and will spam dmesg.
>
> Rate limiting of such messages solves this issue.
>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c       | 11 ++++-------
>  drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h |  6 ++++++
>  2 files changed, 10 insertions(+), 7 deletions(-)
>

Thanks, applied to mlx5-next.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24
From: Leon Romanovsky @ 2018-06-27  5:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, RDMA mailing list, Matan Barak, Michael J Ruhl,
	Noa Osherovich, Raed Salem, Yishai Hadas, Saeed Mahameed,
	linux-netdev
In-Reply-To: <20180626203921.GF5381@ziepe.ca>

[-- Attachment #1: Type: text/plain, Size: 2139 bytes --]

On Tue, Jun 26, 2018 at 02:39:21PM -0600, Jason Gunthorpe wrote:
> On Tue, Jun 26, 2018 at 07:21:26AM +0300, Leon Romanovsky wrote:
> > On Mon, Jun 25, 2018 at 03:34:38PM -0600, Jason Gunthorpe wrote:
> > > On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > >
> > > > Hi,
> > > >
> > > > This is bunch of patches trigged by running syzkaller internally.
> > > >
> > > > I'm sending them based on rdma-next mainly for two reasons:
> > > > 1, Most of the patches fix the old issues and it doesn't matter when
> > > > they will hit the Linus's tree: now or later in a couple of weeks
> > > > during merge window.
> > > > 2. They interleave with code cleanup, mlx5-next patches and Michael's
> > > > feedback on flow counters series.
> > > >
> > > > Thanks
> > > >
> > > > Leon Romanovsky (12):
> > > >   RDMA/uverbs: Protect from attempts to create flows on unsupported QP
> > > >   RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow
> > >
> > > I applied these two to for-rc
> > >
> > > >   RDMA/uverbs: Check existence of create_flow callback
> > > >   RDMA/verbs: Drop kernel variant of create_flow
> > > >   RDMA/verbs: Drop kernel variant of destroy_flow
> > > >   net/mlx5: Rate limit errors in command interface
> > > >   RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
> > > >   RDMA/umem: Don't check for negative return value of dma_map_sg_attrs()
> > > >   RDMA/uverbs: Remove redundant check
> > >
> > > These to for-next
> >
> > Jason,
> >
> > We would like to see patch "[PATCH mlx5-next 05/12] net/mlx5:
> > Rate limit errors in command interface" in out mlx5-next. Is it possible
> > at this point to drop it from for-next, so I'll be able to take it into
> > mlx5-next?
>
> Okay, you got to this while it was still 'wip', so it is dropped. Add
> it to the mlx5-next branch and netdev or rdma can pull it next time
> there is some reason to pull the branch..

Thanks, I cherry-picked that patch from your wip branch, so it includes
your SOB too. Most probably, the RDMA will pull it in dump_fill_mkey
series.

Thanks

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH v2] fib_rules: match rules based on suppress_* properties too
From: Roopa Prabhu @ 2018-06-27  4:53 UTC (permalink / raw)
  To: David Miller; +Cc: Jason A. Donenfeld, netdev
In-Reply-To: <20180627.103417.614359212764375850.davem@davemloft.net>

On Tue, Jun 26, 2018 at 6:34 PM, David Miller <davem@davemloft.net> wrote:
> From: "Jason A. Donenfeld" <Jason@zx2c4.com>
> Date: Tue, 26 Jun 2018 01:39:32 +0200
>
>> Two rules with different values of suppress_prefix or suppress_ifgroup
>> are not the same. This fixes an -EEXIST when running:
>>
>>    $ ip -4 rule add table main suppress_prefixlength 0
>>
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>> Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")
>
> Roopa, thanks for doing all of that analysis.
>
> I think applying this patch makes the most sense at this point,
> so that it what I have done.


Thanks, will keep an eye out and add some more tests

^ permalink raw reply

* [PATCH net-next v2 4/4] net/sched: add tunnel option support to act_tunnel_key
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Simon Horman, Pieter Jansen van Vuuren
In-Reply-To: <20180627043937.25431-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Allow setting tunnel options using the act_tunnel_key action.

Options are expressed as class:type:data and multiple options
may be listed using a comma delimiter.

 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent ffff: \
     flower indev eth0 \
        ip_proto udp \
        action tunnel_key \
            set src_ip 10.0.99.192 \
            dst_ip 10.0.99.193 \
            dst_port 6081 \
            id 11 \
            geneve_opts 0102:80:00800022,0102:80:00800022 \
    action mirred egress redirect dev geneve0

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
v2:
 - use nla_get_be16()/nla_put_be16() for TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS
   as struct geneve_opt :: opt_class is __be16.
---
 include/uapi/linux/tc_act/tc_tunnel_key.h |  26 +++
 net/sched/act_tunnel_key.c                | 214 +++++++++++++++++++++-
 2 files changed, 236 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/tc_act/tc_tunnel_key.h b/include/uapi/linux/tc_act/tc_tunnel_key.h
index 72bbefe5d1d1..e284fec8c467 100644
--- a/include/uapi/linux/tc_act/tc_tunnel_key.h
+++ b/include/uapi/linux/tc_act/tc_tunnel_key.h
@@ -36,9 +36,35 @@ enum {
 	TCA_TUNNEL_KEY_PAD,
 	TCA_TUNNEL_KEY_ENC_DST_PORT,	/* be16 */
 	TCA_TUNNEL_KEY_NO_CSUM,		/* u8 */
+	TCA_TUNNEL_KEY_ENC_OPTS,	/* Nested TCA_TUNNEL_KEY_ENC_OPTS_
+					 * attributes
+					 */
 	__TCA_TUNNEL_KEY_MAX,
 };
 
 #define TCA_TUNNEL_KEY_MAX (__TCA_TUNNEL_KEY_MAX - 1)
 
+enum {
+	TCA_TUNNEL_KEY_ENC_OPTS_UNSPEC,
+	TCA_TUNNEL_KEY_ENC_OPTS_GENEVE,		/* Nested
+						 * TCA_TUNNEL_KEY_ENC_OPTS_
+						 * attributes
+						 */
+	__TCA_TUNNEL_KEY_ENC_OPTS_MAX,
+};
+
+#define TCA_TUNNEL_KEY_ENC_OPTS_MAX (__TCA_TUNNEL_KEY_ENC_OPTS_MAX - 1)
+
+enum {
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_UNSPEC,
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,		/* be16 */
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,		/* u8 */
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,		/* 4 to 128 bytes */
+
+	__TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
+};
+
+#define TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX \
+	(__TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX - 1)
+
 #endif
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 20e98ed8d498..ea203e386a92 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -13,6 +13,7 @@
 #include <linux/kernel.h>
 #include <linux/skbuff.h>
 #include <linux/rtnetlink.h>
+#include <net/geneve.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -57,6 +58,135 @@ static int tunnel_key_act(struct sk_buff *skb, const struct tc_action *a,
 	return action;
 }
 
+static const struct nla_policy
+enc_opts_policy[TCA_TUNNEL_KEY_ENC_OPTS_MAX + 1] = {
+	[TCA_TUNNEL_KEY_ENC_OPTS_GENEVE]	= { .type = NLA_NESTED },
+};
+
+static const struct nla_policy
+geneve_opt_policy[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1] = {
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]	   = { .type = NLA_U16 },
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]	   = { .type = NLA_U8 },
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]	   = { .type = NLA_BINARY,
+						       .len = 128 },
+};
+
+static int
+tunnel_key_copy_geneve_opt(const struct nlattr *nla, void *dst, int dst_len,
+			   struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1];
+	int err, data_len, opt_len;
+	u8 *data;
+
+	err = nla_parse_nested(tb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
+			       nla, geneve_opt_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS] ||
+	    !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE] ||
+	    !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]) {
+		NL_SET_ERR_MSG(extack, "Missing tunnel key geneve option class, type or data");
+		return -EINVAL;
+	}
+
+	data = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
+	data_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
+	if (data_len < 4) {
+		NL_SET_ERR_MSG(extack, "Tunnel key geneve option data is less than 4 bytes long");
+		return -ERANGE;
+	}
+	if (data_len % 4) {
+		NL_SET_ERR_MSG(extack, "Tunnel key geneve option data is not a multiple of 4 bytes long");
+		return -ERANGE;
+	}
+
+	opt_len = sizeof(struct geneve_opt) + data_len;
+	if (dst) {
+		struct geneve_opt *opt = dst;
+
+		WARN_ON(dst_len < opt_len);
+
+		opt->opt_class =
+			nla_get_be16(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]);
+		opt->type = nla_get_u8(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]);
+		opt->length = data_len / 4; /* length is in units of 4 bytes */
+		opt->r1 = 0;
+		opt->r2 = 0;
+		opt->r3 = 0;
+
+		memcpy(opt + 1, data, data_len);
+	}
+
+	return opt_len;
+}
+
+static int tunnel_key_copy_opts(const struct nlattr *nla, u8 *dst,
+				int dst_len, struct netlink_ext_ack *extack)
+{
+	int err, rem, opt_len, len = nla_len(nla), opts_len = 0;
+	const struct nlattr *attr, *head = nla_data(nla);
+
+	err = nla_validate(head, len, TCA_TUNNEL_KEY_ENC_OPTS_MAX,
+			   enc_opts_policy, extack);
+	if (err)
+		return err;
+
+	nla_for_each_attr(attr, head, len, rem) {
+		switch (nla_type(attr)) {
+		case TCA_TUNNEL_KEY_ENC_OPTS_GENEVE:
+			opt_len = tunnel_key_copy_geneve_opt(attr, dst,
+							     dst_len, extack);
+			if (opt_len < 0)
+				return opt_len;
+			opts_len += opt_len;
+			if (dst) {
+				dst_len -= opt_len;
+				dst += opt_len;
+			}
+			break;
+		}
+	}
+
+	if (!opts_len) {
+		NL_SET_ERR_MSG(extack, "Empty list of tunnel options");
+		return -EINVAL;
+	}
+
+	if (rem > 0) {
+		NL_SET_ERR_MSG(extack, "Trailing data after parsing tunnel key options attributes");
+		return -EINVAL;
+	}
+
+	return opts_len;
+}
+
+static int tunnel_key_get_opts_len(struct nlattr *nla,
+				   struct netlink_ext_ack *extack)
+{
+	return tunnel_key_copy_opts(nla, NULL, 0, extack);
+}
+
+static int tunnel_key_opts_set(struct nlattr *nla, struct ip_tunnel_info *info,
+			       int opts_len, struct netlink_ext_ack *extack)
+{
+	info->options_len = opts_len;
+	switch (nla_type(nla_data(nla))) {
+	case TCA_TUNNEL_KEY_ENC_OPTS_GENEVE:
+#if IS_ENABLED(CONFIG_INET)
+		info->key.tun_flags |= TUNNEL_GENEVE_OPT;
+		return tunnel_key_copy_opts(nla, ip_tunnel_info_opts(info),
+					    opts_len, extack);
+#else
+		return -EAFNOSUPPORT;
+#endif
+	default:
+		NL_SET_ERR_MSG(extack, "Cannot set tunnel options for unknown tunnel type");
+		return -EINVAL;
+	}
+}
+
 static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
 	[TCA_TUNNEL_KEY_PARMS]	    = { .len = sizeof(struct tc_tunnel_key) },
 	[TCA_TUNNEL_KEY_ENC_IPV4_SRC] = { .type = NLA_U32 },
@@ -66,6 +196,7 @@ static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
 	[TCA_TUNNEL_KEY_ENC_KEY_ID]   = { .type = NLA_U32 },
 	[TCA_TUNNEL_KEY_ENC_DST_PORT] = {.type = NLA_U16},
 	[TCA_TUNNEL_KEY_NO_CSUM]      = { .type = NLA_U8 },
+	[TCA_TUNNEL_KEY_ENC_OPTS]     = { .type = NLA_NESTED },
 };
 
 static int tunnel_key_init(struct net *net, struct nlattr *nla,
@@ -81,6 +212,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	struct tcf_tunnel_key *t;
 	bool exists = false;
 	__be16 dst_port = 0;
+	int opts_len = 0;
 	__be64 key_id;
 	__be16 flags;
 	int ret = 0;
@@ -128,6 +260,15 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		if (tb[TCA_TUNNEL_KEY_ENC_DST_PORT])
 			dst_port = nla_get_be16(tb[TCA_TUNNEL_KEY_ENC_DST_PORT]);
 
+		if (tb[TCA_TUNNEL_KEY_ENC_OPTS]) {
+			opts_len = tunnel_key_get_opts_len(tb[TCA_TUNNEL_KEY_ENC_OPTS],
+							   extack);
+			if (opts_len < 0) {
+				ret = opts_len;
+				goto err_out;
+			}
+		}
+
 		if (tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC] &&
 		    tb[TCA_TUNNEL_KEY_ENC_IPV4_DST]) {
 			__be32 saddr;
@@ -138,7 +279,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 
 			metadata = __ip_tun_set_dst(saddr, daddr, 0, 0,
 						    dst_port, flags,
-						    key_id, 0);
+						    key_id, opts_len);
 		} else if (tb[TCA_TUNNEL_KEY_ENC_IPV6_SRC] &&
 			   tb[TCA_TUNNEL_KEY_ENC_IPV6_DST]) {
 			struct in6_addr saddr;
@@ -162,6 +303,14 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 			goto err_out;
 		}
 
+		if (opts_len) {
+			ret = tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
+						  &metadata->u.tun_info,
+						  opts_len, extack);
+			if (ret < 0)
+				goto err_out;
+		}
+
 		metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
 		break;
 	default:
@@ -234,6 +383,61 @@ static void tunnel_key_release(struct tc_action *a)
 	}
 }
 
+static int tunnel_key_geneve_opts_dump(struct sk_buff *skb,
+				       const struct ip_tunnel_info *info)
+{
+	int len = info->options_len;
+	u8 *src = (u8 *)(info + 1);
+	struct nlattr *start;
+
+	start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS_GENEVE);
+	if (!start)
+		return -EMSGSIZE;
+
+	while (len > 0) {
+		struct geneve_opt *opt = (struct geneve_opt *)src;
+
+		if (nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,
+				 opt->opt_class) ||
+		    nla_put_u8(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,
+			       opt->type) ||
+		    nla_put(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,
+			    opt->length * 4, opt + 1))
+			return -EMSGSIZE;
+
+		len -= sizeof(struct geneve_opt) + opt->length * 4;
+		src += sizeof(struct geneve_opt) + opt->length * 4;
+	}
+
+	nla_nest_end(skb, start);
+	return 0;
+}
+
+static int tunnel_key_opts_dump(struct sk_buff *skb,
+				const struct ip_tunnel_info *info)
+{
+	struct nlattr *start;
+	int err;
+
+	if (!info->options_len)
+		return 0;
+
+	start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS);
+	if (!start)
+		return -EMSGSIZE;
+
+	if (info->key.tun_flags & TUNNEL_GENEVE_OPT) {
+		err = tunnel_key_geneve_opts_dump(skb, info);
+		if (err)
+			return err;
+	} else {
+		return -EINVAL;
+	}
+
+	nla_nest_end(skb, start);
+	return 0;
+}
+
 static int tunnel_key_dump_addresses(struct sk_buff *skb,
 				     const struct ip_tunnel_info *info)
 {
@@ -284,8 +488,9 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
 		goto nla_put_failure;
 
 	if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET) {
-		struct ip_tunnel_key *key =
-			&params->tcft_enc_metadata->u.tun_info.key;
+		struct ip_tunnel_info *info =
+			&params->tcft_enc_metadata->u.tun_info;
+		struct ip_tunnel_key *key = &info->key;
 		__be32 key_id = tunnel_id_to_key32(key->tun_id);
 
 		if (nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id) ||
@@ -293,7 +498,8 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
 					      &params->tcft_enc_metadata->u.tun_info) ||
 		    nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) ||
 		    nla_put_u8(skb, TCA_TUNNEL_KEY_NO_CSUM,
-			       !(key->tun_flags & TUNNEL_CSUM)))
+			       !(key->tun_flags & TUNNEL_CSUM)) ||
+		    tunnel_key_opts_dump(skb, info))
 			goto nla_put_failure;
 	}
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Pieter Jansen van Vuuren, Jakub Kicinski, Daniel Borkmann
In-Reply-To: <20180627043937.25431-1-jakub.kicinski@netronome.com>

From: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>

Check the tunnel option type stored in tunnel flags when creating options
for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
options on interfaces that are not associated with them.

Make sure all users of the infrastructure set correct flags, for the BPF
helper we have to set all bits to keep backward compatibility.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
CC: Daniel Borkmann <daniel@iogearbox.net>

v2:
 - use __be16 for dst_opt_type in net/openvswitch/flow_netlink.c (build bot).
---
 drivers/net/geneve.c           | 6 ++++--
 drivers/net/vxlan.c            | 3 ++-
 include/net/ip_tunnels.h       | 8 ++++++--
 net/core/filter.c              | 2 +-
 net/ipv4/ip_gre.c              | 2 ++
 net/ipv6/ip6_gre.c             | 2 ++
 net/openvswitch/flow_netlink.c | 7 ++++++-
 7 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 3e94375b9b01..471edd76ff55 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -236,7 +236,8 @@ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs,
 		}
 		/* Update tunnel dst according to Geneve options. */
 		ip_tunnel_info_opts_set(&tun_dst->u.tun_info,
-					gnvh->options, gnvh->opt_len * 4);
+					gnvh->options, gnvh->opt_len * 4,
+					TUNNEL_GENEVE_OPT);
 	} else {
 		/* Drop packets w/ critical options,
 		 * since we don't support any...
@@ -675,7 +676,8 @@ static void geneve_build_header(struct genevehdr *geneveh,
 	geneveh->proto_type = htons(ETH_P_TEB);
 	geneveh->rsvd2 = 0;
 
-	ip_tunnel_info_opts_get(geneveh->options, info);
+	if (info->key.tun_flags & TUNNEL_GENEVE_OPT)
+		ip_tunnel_info_opts_get(geneveh->options, info);
 }
 
 static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb,
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index cc14e0cd5647..7eb30d7c8bd7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2122,7 +2122,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		vni = tunnel_id_to_key32(info->key.tun_id);
 		ifindex = 0;
 		dst_cache = &info->dst_cache;
-		if (info->options_len)
+		if (info->options_len &&
+		    info->key.tun_flags & TUNNEL_VXLAN_OPT)
 			md = ip_tunnel_info_opts(info);
 		ttl = info->key.ttl;
 		tos = info->key.tos;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 90ff430f5e9d..b0d022ff6ea1 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -466,10 +466,12 @@ static inline void ip_tunnel_info_opts_get(void *to,
 }
 
 static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info,
-					   const void *from, int len)
+					   const void *from, int len,
+					   __be16 flags)
 {
 	memcpy(ip_tunnel_info_opts(info), from, len);
 	info->options_len = len;
+	info->key.tun_flags |= flags;
 }
 
 static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate)
@@ -511,9 +513,11 @@ static inline void ip_tunnel_info_opts_get(void *to,
 }
 
 static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info,
-					   const void *from, int len)
+					   const void *from, int len,
+					   __be16 flags)
 {
 	info->options_len = 0;
+	info->key.tun_flags |= flags;
 }
 
 #endif /* CONFIG_INET */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..dade922678f6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3582,7 +3582,7 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb,
 	if (unlikely(size > IP_TUNNEL_OPTS_MAX))
 		return -ENOMEM;
 
-	ip_tunnel_info_opts_set(info, from, size);
+	ip_tunnel_info_opts_set(info, from, size, TUNNEL_OPTIONS_PRESENT);
 
 	return 0;
 }
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2d8efeecf619..c8ca5d8f0f75 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -587,6 +587,8 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 		goto err_free_skb;
 
 	key = &tun_info->key;
+	if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT))
+		goto err_free_rt;
 	md = ip_tunnel_info_opts(tun_info);
 	if (!md)
 		goto err_free_rt;
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index c8cf2fdbb13b..367177786e34 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -990,6 +990,8 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
 		fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL);
 
 		dsfield = key->tos;
+		if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT))
+			goto tx_err;
 		md = ip_tunnel_info_opts(tun_info);
 		if (!md)
 			goto tx_err;
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 492ab0c36f7c..391c4073a6dc 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2516,7 +2516,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	struct ovs_tunnel_info *ovs_tun;
 	struct nlattr *a;
 	int err = 0, start, opts_type;
+	__be16 dst_opt_type;
 
+	dst_opt_type = 0;
 	ovs_match_init(&match, &key, true, NULL);
 	opts_type = ip_tun_from_nlattr(nla_data(attr), &match, false, log);
 	if (opts_type < 0)
@@ -2528,10 +2530,13 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 			err = validate_geneve_opts(&key);
 			if (err < 0)
 				return err;
+			dst_opt_type = TUNNEL_GENEVE_OPT;
 			break;
 		case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+			dst_opt_type = TUNNEL_VXLAN_OPT;
 			break;
 		case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
+			dst_opt_type = TUNNEL_ERSPAN_OPT;
 			break;
 		}
 	}
@@ -2574,7 +2579,7 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	 */
 	ip_tunnel_info_opts_set(tun_info,
 				TUN_METADATA_OPTS(&key, key.tun_opts_len),
-				key.tun_opts_len);
+				key.tun_opts_len, dst_opt_type);
 	add_nested_action_end(*sfa, start);
 
 	return err;
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 2/4] net/sched: act_tunnel_key: add extended ack support
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Simon Horman, Alexander Aring, Pieter Jansen van Vuuren
In-Reply-To: <20180627043937.25431-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Add extended ack support for the tunnel key action by using NL_SET_ERR_MSG
during validation of user input.

Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
---
 net/sched/act_tunnel_key.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 2edd389e7c92..20e98ed8d498 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -86,16 +86,22 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	int ret = 0;
 	int err;
 
-	if (!nla)
+	if (!nla) {
+		NL_SET_ERR_MSG(extack, "Tunnel requires attributes to be passed");
 		return -EINVAL;
+	}
 
 	err = nla_parse_nested(tb, TCA_TUNNEL_KEY_MAX, nla, tunnel_key_policy,
-			       NULL);
-	if (err < 0)
+			       extack);
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack, "Failed to parse nested tunnel key attributes");
 		return err;
+	}
 
-	if (!tb[TCA_TUNNEL_KEY_PARMS])
+	if (!tb[TCA_TUNNEL_KEY_PARMS]) {
+		NL_SET_ERR_MSG(extack, "Missing tunnel key parameters");
 		return -EINVAL;
+	}
 
 	parm = nla_data(tb[TCA_TUNNEL_KEY_PARMS]);
 	exists = tcf_idr_check(tn, parm->index, a, bind);
@@ -107,6 +113,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		break;
 	case TCA_TUNNEL_KEY_ACT_SET:
 		if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
+			NL_SET_ERR_MSG(extack, "Missing tunnel key id");
 			ret = -EINVAL;
 			goto err_out;
 		}
@@ -144,11 +151,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 						      0, flags,
 						      key_id, 0);
 		} else {
+			NL_SET_ERR_MSG(extack, "Missing either ipv4 or ipv6 src and dst");
 			ret = -EINVAL;
 			goto err_out;
 		}
 
 		if (!metadata) {
+			NL_SET_ERR_MSG(extack, "Cannot allocate tunnel metadata dst");
 			ret = -ENOMEM;
 			goto err_out;
 		}
@@ -156,6 +165,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
 		break;
 	default:
+		NL_SET_ERR_MSG(extack, "Unknown tunnel key action");
 		ret = -EINVAL;
 		goto err_out;
 	}
@@ -163,14 +173,18 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_tunnel_key_ops, bind, true);
-		if (ret)
+		if (ret) {
+			NL_SET_ERR_MSG(extack, "Cannot create TC IDR");
 			return ret;
+		}
 
 		ret = ACT_P_CREATED;
 	} else {
 		tcf_idr_release(*a, bind);
-		if (!ovr)
+		if (!ovr) {
+			NL_SET_ERR_MSG(extack, "TC IDR already exists");
 			return -EEXIST;
+		}
 	}
 
 	t = to_tunnel_key(*a);
@@ -180,6 +194,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	if (unlikely(!params_new)) {
 		if (ret == ACT_P_CREATED)
 			tcf_idr_release(*a, bind);
+		NL_SET_ERR_MSG(extack, "Cannot allocate tunnel key parameters");
 		return -ENOMEM;
 	}
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 1/4] net/sched: act_tunnel_key: disambiguate metadata dst error cases
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Simon Horman
In-Reply-To: <20180627043937.25431-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Metadata may be NULL for one of two reasons:
* Missing user input
* Failure to allocate the metadata dst

Disambiguate these case by returning -EINVAL for the former and -ENOMEM
for the latter rather than -EINVAL for both cases.

This is in preparation for using extended ack to provide more information
to users when parsing their input.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 net/sched/act_tunnel_key.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 626dac81a48a..2edd389e7c92 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -143,10 +143,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 			metadata = __ipv6_tun_set_dst(&saddr, &daddr, 0, 0, dst_port,
 						      0, flags,
 						      key_id, 0);
+		} else {
+			ret = -EINVAL;
+			goto err_out;
 		}
 
 		if (!metadata) {
-			ret = -EINVAL;
+			ret = -ENOMEM;
 			goto err_out;
 		}
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 0/4] net: Geneve options support for TC act_tunnel_key
From: Jakub Kicinski @ 2018-06-27  4:39 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers, netdev,
	Jakub Kicinski

Hi,

Simon & Pieter say:

This set adds Geneve Options support to the TC tunnel key action.
It provides the plumbing required to configure Geneve variable length
options.  The options can be configured in the form CLASS:TYPE:DATA,
where CLASS is represented as a 16bit hexadecimal value, TYPE as an 8bit
hexadecimal value and DATA as a variable length hexadecimal value.
Additionally multiple options may be listed using a comma delimiter.

v2:
 - fix sparse warnings in patches 3 and 4 (first one reported by
   build bot).

Pieter Jansen van Vuuren (1):
  net: check tunnel option type in tunnel flags

Simon Horman (3):
  net/sched: act_tunnel_key: disambiguate metadata dst error cases
  net/sched: act_tunnel_key: add extended ack support
  net/sched: add tunnel option support to act_tunnel_key

 drivers/net/geneve.c                      |   6 +-
 drivers/net/vxlan.c                       |   3 +-
 include/net/ip_tunnels.h                  |   8 +-
 include/uapi/linux/tc_act/tc_tunnel_key.h |  26 +++
 net/core/filter.c                         |   2 +-
 net/ipv4/ip_gre.c                         |   2 +
 net/ipv6/ip6_gre.c                        |   2 +
 net/openvswitch/flow_netlink.c            |   7 +-
 net/sched/act_tunnel_key.c                | 246 +++++++++++++++++++++-
 9 files changed, 284 insertions(+), 18 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH net-next] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: kbuild test robot @ 2018-06-27  4:22 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: kbuild-all, netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Eric Dumazet
In-Reply-To: <20180627015222.3269067-1-brakmo@fb.com>

Hi Lawrence,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Lawrence-Brakmo/tcp-force-cwnd-at-least-2-in-tcp_cwnd_reduction/20180627-095533
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/ipv4/tcp_input.c:168:42: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:168:42: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:213:21: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:213:21: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:329:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:329:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:336:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:337:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:337:19: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:347:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:347:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:412:32: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:412:32: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:413:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:413:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:436:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:436:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:464:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:464:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:473:36: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:473:36: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:475:28: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:475:28: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:492:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:492:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:496:36: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:496:36: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:509:29: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:509:29: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:511:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:511:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:512:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:513:16: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:652:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:652:26: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:790:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:790:33: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:794:23: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:818:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:818:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:827:9: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:827:9: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:867:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:867:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:902:34: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:902:34: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1654:25: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1848:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1849:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1849:17: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1869:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1869:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1896:34: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1995:34: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:1995:34: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2024:39: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2024:39: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2400:44: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2472:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2476:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2476:26: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2479:18: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2479:18: sparse: expression using sizeof(void)
>> net/ipv4/tcp_input.c:2480:24: sparse: incompatible types in comparison expression (different signedness)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:1138:24: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:2990:48: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:3195:46: sparse: expression using sizeof(void)
   net/ipv4/tcp_input.c:3195:46: sparse: expression using sizeof(void)
   include/net/tcp.h:739:16: sparse: expression using sizeof(void)
   include/net/tcp.h:1206:16: sparse: expression using sizeof(void)
   include/net/tcp.h:1215:31: sparse: expression using sizeof(void)
   include/net/tcp.h:1215:31: sparse: too many warnings
   In file included from include/asm-generic/bug.h:18:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from net/ipv4/tcp_input.c:67:
   net/ipv4/tcp_input.c: In function 'tcp_cwnd_reduction':
   include/linux/kernel.h:812:29: warning: comparison of distinct pointer types lacks a cast
      (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                                ^
   include/linux/kernel.h:826:4: note: in expansion of macro '__typecheck'
      (__typecheck(x, y) && __no_side_effects(x, y))
       ^~~~~~~~~~~
   include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
     __builtin_choose_expr(__safe_cmp(x, y), 117-                        ^~~~~~~~~~
   include/linux/kernel.h:852:19: note: in expansion of macro '__careful_cmp'
    #define max(x, y) __careful_cmp(x, y, >)
                      ^~~~~~~~~~~~~
   net/ipv4/tcp_input.c:2480:17: note: in expansion of macro 'max'
     tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
                    ^~~

vim +2480 net/ipv4/tcp_input.c

  2455	
  2456	void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
  2457	{
  2458		struct tcp_sock *tp = tcp_sk(sk);
  2459		int sndcnt = 0;
  2460		int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
  2461	
  2462		if (newly_acked_sacked <= 0 || WARN_ON_ONCE(!tp->prior_cwnd))
  2463			return;
  2464	
  2465		tp->prr_delivered += newly_acked_sacked;
  2466		if (delta < 0) {
  2467			u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
  2468				       tp->prior_cwnd - 1;
  2469			sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
  2470		} else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
  2471			   !(flag & FLAG_LOST_RETRANS)) {
> 2472			sndcnt = min_t(int, delta,
  2473				       max_t(int, tp->prr_delivered - tp->prr_out,
  2474					     newly_acked_sacked) + 1);
  2475		} else {
  2476			sndcnt = min(delta, newly_acked_sacked);
  2477		}
  2478		/* Force a fast retransmit upon entering fast recovery */
  2479		sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
> 2480		tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
  2481	}
  2482	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCH 1/1] esp6: fix memleak on error path in esp6_input
From: Zhen Lei @ 2018-06-27  3:49 UTC (permalink / raw)
  To: Steffen Klassert, Herbert Xu, David S. Miller, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, netdev, linux-kernel
  Cc: Zhen Lei, Hanjun Guo, Libin, YueHaibing

This ought to be an omission in e6194923237 ("esp: Fix memleaks on error
paths."). The memleak on error path in esp6_input is similar to esp_input
of esp4.

Fixes: e6194923237 ("esp: Fix memleaks on error paths.")
Fixes: 3f29770723f ("ipsec: check return value of skb_to_sgvec always")

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 net/ipv6/esp6.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 97513f3..88a7579 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -669,8 +669,10 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)

 	sg_init_table(sg, nfrags);
 	ret = skb_to_sgvec(skb, sg, 0, skb->len);
-	if (unlikely(ret < 0))
+	if (unlikely(ret < 0)) {
+		kfree(tmp);
 		goto out;
+	}

 	skb->ip_summed = CHECKSUM_NONE;

--
1.8.3

^ permalink raw reply related

* [PATCH net] bpfilter: include bpfilter_umh in assembly instead of using objcopy
From: Alexei Starovoitov @ 2018-06-27  3:13 UTC (permalink / raw)
  To: David S . Miller
  Cc: daniel, torvalds, mcroce, yamada.masahiro, linux, netdev,
	kernel-team

From: Masahiro Yamada <yamada.masahiro@socionext.com>

What we want here is to embed a user-space program into the kernel.
Instead of the complex ELF magic, let's simply wrap it in the assembly
with the '.incbin' directive.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
I think this patch should 'fix' bpfilter build issue on all archs.
cflags for cross CC may still be incorrect and embedded blob
may fail to execute via fork_usermode_blob()
(like in case of 'make ARCH=i386 net/bpfilter/' CC will build and link 64-bit
binary that will be included into bpfilter.o or vmlinux and that binary
will fail to run on 32-bit kernel),
but that is separate issue that will be addressed in net-next time frame.
Long term we've discussed to switch to something like klibc and keep it
as part of the kernel to avoid relying on glibc and cc-can-link.sh.

 net/bpfilter/Makefile            | 17 ++---------------
 net/bpfilter/bpfilter_kern.c     | 11 +++++------
 net/bpfilter/bpfilter_umh_blob.S |  7 +++++++
 3 files changed, 14 insertions(+), 21 deletions(-)
 create mode 100644 net/bpfilter/bpfilter_umh_blob.S

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 051dc18b8ccb..39c6980b5d99 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -15,20 +15,7 @@ ifeq ($(CONFIG_BPFILTER_UMH), y)
 HOSTLDFLAGS += -static
 endif
 
-# a bit of elf magic to convert bpfilter_umh binary into a binary blob
-# inside bpfilter_umh.o elf file referenced by
-# _binary_net_bpfilter_bpfilter_umh_start symbol
-# which bpfilter_kern.c passes further into umh blob loader at run-time
-quiet_cmd_copy_umh = GEN $@
-      cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
-      $(OBJCOPY) -I binary \
-          `LC_ALL=C $(OBJDUMP) -f net/bpfilter/bpfilter_umh \
-          |awk -F' |,' '/file format/{print "-O",$$NF} \
-          /^architecture:/{print "-B",$$2}'` \
-      --rename-section .data=.init.rodata $< $@
-
-$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
-	$(call cmd,copy_umh)
+$(obj)/bpfilter_umh_blob.o: $(obj)/bpfilter_umh
 
 obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
-bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
+bpfilter-objs += bpfilter_kern.o bpfilter_umh_blob.o
diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
index 09522573f611..f0fc182d3db7 100644
--- a/net/bpfilter/bpfilter_kern.c
+++ b/net/bpfilter/bpfilter_kern.c
@@ -10,11 +10,8 @@
 #include <linux/file.h>
 #include "msgfmt.h"
 
-#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
-#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
-
-extern char UMH_start;
-extern char UMH_end;
+extern char bpfilter_umh_start;
+extern char bpfilter_umh_end;
 
 static struct umh_info info;
 /* since ip_getsockopt() can run in parallel, serialize access to umh */
@@ -93,7 +90,9 @@ static int __init load_umh(void)
 	int err;
 
 	/* fork usermode process */
-	err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
+	err = fork_usermode_blob(&bpfilter_umh_start,
+				 &bpfilter_umh_end - &bpfilter_umh_start,
+				 &info);
 	if (err)
 		return err;
 	pr_info("Loaded bpfilter_umh pid %d\n", info.pid);
diff --git a/net/bpfilter/bpfilter_umh_blob.S b/net/bpfilter/bpfilter_umh_blob.S
new file mode 100644
index 000000000000..40311d10d2f2
--- /dev/null
+++ b/net/bpfilter/bpfilter_umh_blob.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+	.section .init.rodata, "a"
+	.global bpfilter_umh_start
+bpfilter_umh_start:
+	.incbin "net/bpfilter/bpfilter_umh"
+	.global bpfilter_umh_end
+bpfilter_umh_end:
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next 3/4] net: check tunnel option type in tunnel flags
From: kbuild test robot @ 2018-06-27  3:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kbuild-all, davem, jbenc, Roopa Prabhu, jiri, jhs, xiyou.wangcong,
	daniel, oss-drivers, netdev, Pieter Jansen van Vuuren,
	Jakub Kicinski
In-Reply-To: <20180626185308.3605-4-jakub.kicinski@netronome.com>

Hi Pieter,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Jakub-Kicinski/net-Geneve-options-support-for-TC-act_tunnel_key/20180627-030036
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/openvswitch/flow_netlink.c:2532:38: sparse: incorrect type in assignment (different base types) @@    expected int [signed] [assigned] dst_opt_type @@    got restrint [signed] [assigned] dst_opt_type @@
   net/openvswitch/flow_netlink.c:2532:38:    expected int [signed] [assigned] dst_opt_type
   net/openvswitch/flow_netlink.c:2532:38:    got restricted __be16 [usertype] <noident>
   net/openvswitch/flow_netlink.c:2535:38: sparse: incorrect type in assignment (different base types) @@    expected int [signed] [assigned] dst_opt_type @@    got restrint [signed] [assigned] dst_opt_type @@
   net/openvswitch/flow_netlink.c:2535:38:    expected int [signed] [assigned] dst_opt_type
   net/openvswitch/flow_netlink.c:2535:38:    got restricted __be16 [usertype] <noident>
   net/openvswitch/flow_netlink.c:2538:38: sparse: incorrect type in assignment (different base types) @@    expected int [signed] [assigned] dst_opt_type @@    got restrint [signed] [assigned] dst_opt_type @@
   net/openvswitch/flow_netlink.c:2538:38:    expected int [signed] [assigned] dst_opt_type
   net/openvswitch/flow_netlink.c:2538:38:    got restricted __be16 [usertype] <noident>
>> net/openvswitch/flow_netlink.c:2581:51: sparse: incorrect type in argument 4 (different base types) @@    expected restricted __be16 [usertype] flags @@    got icted __be16 [usertype] flags @@
   net/openvswitch/flow_netlink.c:2581:51:    expected restricted __be16 [usertype] flags
   net/openvswitch/flow_netlink.c:2581:51:    got int [signed] [assigned] dst_opt_type
   net/openvswitch/flow_netlink.c:3064:39: sparse: expression using sizeof(void)

vim +2532 net/openvswitch/flow_netlink.c

  2508	
  2509	static int validate_and_copy_set_tun(const struct nlattr *attr,
  2510					     struct sw_flow_actions **sfa, bool log)
  2511	{
  2512		struct sw_flow_match match;
  2513		struct sw_flow_key key;
  2514		struct metadata_dst *tun_dst;
  2515		struct ip_tunnel_info *tun_info;
  2516		struct ovs_tunnel_info *ovs_tun;
  2517		struct nlattr *a;
  2518		int err = 0, start, opts_type, dst_opt_type;
  2519	
  2520		dst_opt_type = 0;
  2521		ovs_match_init(&match, &key, true, NULL);
  2522		opts_type = ip_tun_from_nlattr(nla_data(attr), &match, false, log);
  2523		if (opts_type < 0)
  2524			return opts_type;
  2525	
  2526		if (key.tun_opts_len) {
  2527			switch (opts_type) {
  2528			case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
  2529				err = validate_geneve_opts(&key);
  2530				if (err < 0)
  2531					return err;
> 2532				dst_opt_type = TUNNEL_GENEVE_OPT;
  2533				break;
  2534			case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
  2535				dst_opt_type = TUNNEL_VXLAN_OPT;
  2536				break;
  2537			case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
  2538				dst_opt_type = TUNNEL_ERSPAN_OPT;
  2539				break;
  2540			}
  2541		}
  2542	
  2543		start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
  2544		if (start < 0)
  2545			return start;
  2546	
  2547		tun_dst = metadata_dst_alloc(key.tun_opts_len, METADATA_IP_TUNNEL,
  2548					     GFP_KERNEL);
  2549	
  2550		if (!tun_dst)
  2551			return -ENOMEM;
  2552	
  2553		err = dst_cache_init(&tun_dst->u.tun_info.dst_cache, GFP_KERNEL);
  2554		if (err) {
  2555			dst_release((struct dst_entry *)tun_dst);
  2556			return err;
  2557		}
  2558	
  2559		a = __add_action(sfa, OVS_KEY_ATTR_TUNNEL_INFO, NULL,
  2560				 sizeof(*ovs_tun), log);
  2561		if (IS_ERR(a)) {
  2562			dst_release((struct dst_entry *)tun_dst);
  2563			return PTR_ERR(a);
  2564		}
  2565	
  2566		ovs_tun = nla_data(a);
  2567		ovs_tun->tun_dst = tun_dst;
  2568	
  2569		tun_info = &tun_dst->u.tun_info;
  2570		tun_info->mode = IP_TUNNEL_INFO_TX;
  2571		if (key.tun_proto == AF_INET6)
  2572			tun_info->mode |= IP_TUNNEL_INFO_IPV6;
  2573		tun_info->key = key.tun_key;
  2574	
  2575		/* We need to store the options in the action itself since
  2576		 * everything else will go away after flow setup. We can append
  2577		 * it to tun_info and then point there.
  2578		 */
  2579		ip_tunnel_info_opts_set(tun_info,
  2580					TUN_METADATA_OPTS(&key, key.tun_opts_len),
> 2581					key.tun_opts_len, dst_opt_type);
  2582		add_nested_action_end(*sfa, start);
  2583	
  2584		return err;
  2585	}
  2586	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCH bpf-next] nfp: bpf: allow source ptr type be map ptr in memcpy optimization
From: Jakub Kicinski @ 2018-06-27  2:48 UTC (permalink / raw)
  To: alexei.starovoitov, daniel; +Cc: netdev, oss-drivers, Jiong Wang

From: Jiong Wang <jiong.wang@netronome.com>

Map read has been supported on NFP, this patch enables optimization
for memcpy from map to packet.

This patch also fixed one latent bug which will cause copying from
unexpected address once memcpy for map pointer enabled.  The fixed
code path was not exercised before.

Reported-by: Mary Pham <mary.pham@netronome.com>
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
---
Reposting separately from the mul/div patches.

 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 8a92088df0d7..33111739b210 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 	xfer_num = round_up(len, 4) / 4;
 
 	if (src_40bit_addr)
-		addr40_offset(nfp_prog, meta->insn.src_reg, off, &src_base,
+		addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, &src_base,
 			      &off);
 
 	/* Setup PREV_ALU fields to override memory read length. */
@@ -3299,7 +3299,8 @@ curr_pair_is_memcpy(struct nfp_insn_meta *ld_meta,
 	if (!is_mbpf_load(ld_meta) || !is_mbpf_store(st_meta))
 		return false;
 
-	if (ld_meta->ptr.type != PTR_TO_PACKET)
+	if (ld_meta->ptr.type != PTR_TO_PACKET &&
+	    ld_meta->ptr.type != PTR_TO_MAP_VALUE)
 		return false;
 
 	if (st_meta->ptr.type != PTR_TO_PACKET)
-- 
2.17.1

^ permalink raw reply related

* [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

Add a new program (prog_num = 4) that will not parse packets and will
use the meta data hash to spread/redirect traffic into different cpus.

For the new program we set on bpf_set_link_xdp_fd:
	xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

On mlx5 it will succeed since mlx5 already supports these flags.

The new program will read the value of the hash from the data_meta
pointer from the xdp_md and will use it to compute the destination cpu.

Note: I didn't test this patch to show redirect works with the hash!
I only used it to see that the hash and vlan values are set correctly
by the driver and can be seen by the xdp program.

* I faced some difficulties to read the hash value using the helper
functions defined in the previous patches, but once i used the same logic
with out these functions it worked ! Will have to figure this out later.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 samples/bpf/xdp_redirect_cpu_kern.c | 67 +++++++++++++++++++++++++++++
 samples/bpf/xdp_redirect_cpu_user.c |  7 +++
 2 files changed, 74 insertions(+)

diff --git a/samples/bpf/xdp_redirect_cpu_kern.c b/samples/bpf/xdp_redirect_cpu_kern.c
index 303e9e7161f3..d6b3f55f342a 100644
--- a/samples/bpf/xdp_redirect_cpu_kern.c
+++ b/samples/bpf/xdp_redirect_cpu_kern.c
@@ -376,6 +376,73 @@ int  xdp_prognum3_proto_separate(struct xdp_md *ctx)
 	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
 }
 
+#if 0
+xdp_md_info_arr mdi = {
+	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
+	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
+};
+#endif
+
+SEC("xdp_cpu_map4_hash_separate")
+int  xdp_prognum4_hash_separate(struct xdp_md *ctx)
+{
+	void *data_meta = (void *)(long)ctx->data_meta;
+	void *data_end  = (void *)(long)ctx->data_end;
+	void *data      = (void *)(long)ctx->data;
+	struct xdp_md_hash *hash;
+	struct xdp_md_vlan *vlan;
+	struct datarec *rec;
+	u32 cpu_dest = 0;
+	u32 cpu_idx = 0;
+	u32 *cpu_lookup;
+	u32 key = 0;
+
+	/* Count RX packet in map */
+	rec = bpf_map_lookup_elem(&rx_cnt, &key);
+	if (!rec)
+		return XDP_ABORTED;
+	rec->processed++;
+
+	/* for some reason this code fails to be verified */
+#if 0
+	hash = xdp_data_meta_get_hash(mdi, data_meta);
+	if (hash + 1 > data)
+		return XDP_ABORTED;
+
+	vlan = xdp_data_meta_get_vlan(mdi, data_meta);
+	if (vlan + 1 > data)
+		return XDP_ABORTED;
+#endif
+
+	/* Work around for the above code */
+	hash = data_meta; /* since we know hash will appear first */
+        if (hash + 1 > data)
+		return XDP_ABORTED;
+
+#if 0
+	// Just for testing
+	/* We know that vlan will appear after the hash */
+	vlan = (void *)((char *)data_meta + sizeof(*hash));
+	if (vlan + 1 > data) {
+		return XDP_ABORTED;
+	}
+#endif
+
+	cpu_idx = reciprocal_scale(hash->hash, MAX_CPUS);
+
+	cpu_lookup = bpf_map_lookup_elem(&cpus_available, &cpu_idx);
+	if (!cpu_lookup)
+		return XDP_ABORTED;
+	cpu_dest = *cpu_lookup;
+
+	if (cpu_dest >= MAX_CPUS) {
+		rec->issue++;
+		return XDP_ABORTED;
+	}
+
+	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
+}
+
 SEC("xdp_cpu_map4_ddos_filter_pktgen")
 int  xdp_prognum4_ddos_filter_pktgen(struct xdp_md *ctx)
 {
diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c
index f6efaefd485b..3429215d5a7b 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -679,6 +679,13 @@ int main(int argc, char **argv)
 		return EXIT_FAIL_OPTION;
 	}
 
+	/*
+	 * prog_num 4 requires xdp meta data hash
+	 * Vlan is not required but added just for testing..
+	 */
+	if (prog_num == 4)
+		xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
+
 	/* Remove XDP program when program is interrupted */
 	signal(SIGINT, int_exit);
 
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

Implement XDP meta data hash and vlan support.

1. on xdp setup ndo: add support for XDP_QUERY_META_FLAGS and return the
two supported flags
2. use xdp_data_meta_set_hash and xdp_data_meta_set_vlan helpers to fill
in the meta data fileds.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 ++++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 30 ++++++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8debae6b9cab..3d1066a953cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4189,6 +4189,9 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 	return prog_id;
 }
 
+#define MLX5E_SUPPORTED_XDP_META_FLAGS  \
+             (XDP_FLAGS_META_HASH  | XDP_FLAGS_META_VLAN)
+
 static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct mlx5e_xdp_info xdp_info;
@@ -4204,6 +4207,9 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 		xdp->prog_id = mlx5e_xdp_query(dev);
 		xdp->prog_attached = !!xdp->prog_id;
 		return 0;
+	case XDP_QUERY_META_FLAGS:
+		xdp->meta_flags = MLX5E_SUPPORTED_XDP_META_FLAGS;
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index e37f9747a0e3..1f3e934d0dd8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -920,6 +920,29 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_rq *rq,
 	return true;
 }
 
+static void
+mlx5e_xdp_fill_data_meta(xdp_md_info_arr mdi,  void *data_meta, struct mlx5_cqe64 *cqe)
+{
+	if (xdp_data_meta_present(mdi, XDP_DATA_META_HASH))
+	{
+		u8 cht = cqe->rss_hash_type;
+		int ht = (cht & CQE_RSS_HTYPE_L4) ? PKT_HASH_TYPE_L4 :
+			 (cht & CQE_RSS_HTYPE_IP) ? PKT_HASH_TYPE_L3 :
+						    PKT_HASH_TYPE_NONE;
+		u32 hash = be32_to_cpu(cqe->rss_hash_result);
+
+		xdp_data_meta_set_hash(mdi, data_meta, hash, ht);
+	}
+
+	if (xdp_data_meta_present(mdi, XDP_DATA_META_VLAN))
+	{
+		u16 vlan = (!!cqe_has_vlan(cqe) * VLAN_TAG_PRESENT) |
+			   be16_to_cpu(cqe->vlan_info);
+
+		xdp_data_meta_set_vlan(mdi, data_meta, vlan);
+	}
+}
+
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
@@ -935,11 +958,16 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		return false;
 
 	xdp.data = va + *rx_headroom;
-	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + *len;
 	xdp.data_hard_start = va;
 	xdp.rxq = &rq->xdp_rxq;
 
+	if (rq->xdp.flags & XDP_FLAGS_META_ALL) {
+		xdp_reset_data_meta(&xdp);
+		mlx5e_xdp_fill_data_meta(rq->xdp.md_info, xdp.data_meta, cqe);
+	} else
+		xdp_set_data_meta_invalid(&xdp);
+
 	act = bpf_prog_run_xdp(prog, &xdp);
 	switch (act) {
 	case XDP_PASS:
-- 
2.17.0

^ permalink raw reply related

* [RFC bpf-next 4/6] net/mlx5e: Pass CQE to RX handlers
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed
In-Reply-To: <20180627024615.17856-1-saeedm@mellanox.com>

CQE has all the meta data information from HW.
Make it available to the driver xdp handlers.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++++++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 15 +++++++++------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5893acfae307..98bb315fc8a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -505,7 +505,8 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+			       u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+			       struct mlx5_cqe64 *cqe);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			 struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
@@ -901,10 +902,12 @@ void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				struct mlx5_cqe64 *cqe);
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				   u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				   struct mlx5_cqe64 *cqe);
 struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			  struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d12577c17011..e37f9747a0e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -923,7 +923,8 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_rq *rq,
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
-				    void *va, u16 *rx_headroom, u32 *len)
+				    void *va, u16 *rx_headroom,
+				    u32 *len, struct mlx5_cqe64 *cqe)
 {
 	struct bpf_prog *prog = READ_ONCE(rq->xdp.prog);
 	struct xdp_buff xdp;
@@ -1012,7 +1013,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	}
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt);
+	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, cqe);
 	rcu_read_unlock();
 	if (consumed)
 		return NULL; /* page/packet was consumed by XDP */
@@ -1155,7 +1156,8 @@ void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				   u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				   struct mlx5_cqe64 *cqe)
 {
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
@@ -1202,7 +1204,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -1221,7 +1224,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 	prefetch(data);
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32);
+	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, cqe);
 	rcu_read_unlock();
 	if (consumed) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
@@ -1268,7 +1271,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
 
 	skb = rq->mpwqe.skb_from_cqe_mpwrq(rq, wi, cqe_bcnt, head_offset,
-					   page_idx);
+					   page_idx, cqe);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
-- 
2.17.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox