Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 3/5] net: thunderx: Fix configuration of L3/L4 length checking
From: Sunil Kovvuri @ 2016-11-14 17:26 UTC (permalink / raw)
  To: Corentin Labbe; +Cc: Linux Netdev List, Sunil Goutham, LKML, LAKML
In-Reply-To: <20161114123350.GA2449@Red>

>>You could use the BIT() macro here
Thanks, will change and resubmit.

Sunil.

^ permalink raw reply

* Re: [PATCH 2/6] dt-bindings: phy: Add documentation for NSP USB3 PHY
From: Rob Herring @ 2016-11-14 17:23 UTC (permalink / raw)
  To: Yendapally Reddy Dhananjaya Reddy
  Cc: Mark Rutland, Russell King, Ray Jui, Scott Branden, Jon Mason,
	Florian Fainelli, Kishon Vijay Abraham I,
	bcm-kernel-feedback-list, netdev, devicetree, linux-kernel,
	linux-arm-kernel
In-Reply-To: <1478683994-12008-3-git-send-email-yendapally.reddy@broadcom.com>

On Wed, Nov 09, 2016 at 04:33:10AM -0500, Yendapally Reddy Dhananjaya Reddy wrote:
> Add documentation for USB3 PHY available in Northstar plus SoC
> 
> Signed-off-by: Yendapally Reddy Dhananjaya Reddy <yendapally.reddy@broadcom.com>
> ---
>  .../devicetree/bindings/phy/brcm,nsp-usb3-phy.txt  | 39 ++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/phy/brcm,nsp-usb3-phy.txt
> 
> diff --git a/Documentation/devicetree/bindings/phy/brcm,nsp-usb3-phy.txt b/Documentation/devicetree/bindings/phy/brcm,nsp-usb3-phy.txt
> new file mode 100644
> index 0000000..30cf4b9
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/phy/brcm,nsp-usb3-phy.txt
> @@ -0,0 +1,39 @@
> +Broadcom USB3 phy binding northstar plus SoC
> +This is a child bus node of "brcm,mdio-mux-nsp" node.
> +
> +Required mdio bus properties:
> +- reg: MDIO Bus number for the MDIO interface
> +- #address-cells: must be 1
> +- #size-cells: must be 0
> +
> +Required PHY properties:
> +- compatible: should be "brcm,nsp-usb3-phy"
> +- reg: Phy address in the MDIO interface
> +- usb3-ctrl-syscon: handler of syscon node defining physical address
> +  of usb3 control register.
> +- #phy-cells: must be 0
> +
> +Required usb3 control properties:
> +- compatible: should be "brcm,nsp-usb3-ctrl"
> +- reg: offset and length of the control registers
> +
> +Example:
> +
> +	mdio@0 {
> +		reg = <0x0>;
> +		#address-cells = <1>;
> +		#size-cells = <0>;
> +
> +		usb3_phy: usb3-phy@10 {

Just 'usb-phy@10'. With that,

Acked-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH 1/6] dt-bindings: mdio-mux: Add documentation for mdio mux for NSP SoC
From: Rob Herring @ 2016-11-14 17:22 UTC (permalink / raw)
  To: Yendapally Reddy Dhananjaya Reddy
  Cc: Mark Rutland, Russell King, Ray Jui, Scott Branden, Jon Mason,
	Florian Fainelli, Kishon Vijay Abraham I,
	bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <1478683994-12008-2-git-send-email-yendapally.reddy-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

On Wed, Nov 09, 2016 at 04:33:09AM -0500, Yendapally Reddy Dhananjaya Reddy wrote:
> Add documentation for mdio mux available in Broadcom NSP SoC
> 
> Signed-off-by: Yendapally Reddy Dhananjaya Reddy <yendapally.reddy-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
> ---
>  .../devicetree/bindings/net/brcm,mdio-mux-nsp.txt  | 57 ++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/brcm,mdio-mux-nsp.txt

Acked-by: Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Kernel 4.8.7 crashing
From: Cong Wang @ 2016-11-14 17:18 UTC (permalink / raw)
  To: Ashton Holmes; +Cc: LKML, Linux Kernel Network Developers
In-Reply-To: <CAM_iQpUiBE93MsYsMWKcSRBTCNMPvWRqiqAkH3DDqFP1cCSdtw@mail.gmail.com>

(Really Cc'ing netdev.. ;-p )

On Mon, Nov 14, 2016 at 9:18 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Cc'ing netdev.
>
> On Sat, Nov 12, 2016 at 2:17 PM, Ashton Holmes <scoopta@gmail.com> wrote:
>> I upgraded to 4.8.7 and the system boots and my root partition gets
>> decrypted but right after that both of my monitors turn off and
>> looking at syslog from 4.8.6 shows the following:
>>
>> Nov 12 11:54:24 user-desktop kernel: [   19.853197] ------------[ cut
>> here ]------------
>> Nov 12 11:54:24 user-desktop kernel: [   19.853206] WARNING: CPU: 2
>> PID: 177 at fs/proc/proc_sysctl.c:1607 ops_exit_list.isra.4+0x33/0x60
>
> Probably this warning in source code:
>
> void retire_sysctl_set(struct ctl_table_set *set)
> {
>         WARN_ON(!RB_EMPTY_ROOT(&set->dir.root));
> }
>
>
>> Nov 12 11:54:24 user-desktop kernel: [   19.853210] Modules linked in:
>> amdgpu eeepc_wmi i2c_algo_bit asus_wmi drm_kms_helper video rfkill
>> snd_hda_codec_realtek ttm snd_hda_codec_generic snd_hda_codec_hdmi drm
>> snd_hda_intel snd_usb_audio sparse_keymap snd_hda_codec mxm_wmi
>> snd_usbmidi_lib snd_hda_core efi_pstore snd_hwdep snd_rawmidi evdev
>> snd_seq_device joydev snd_pcm serio_raw pcspkr efivars fam15h_power
>> k10temp snd_timer snd sp5100_tco i2c_piix4 sg soundcore i2c_core
>> shpchp tpm_infineon tpm_tis tpm_tis_core tpm wmi button it87 hwmon_vid
>> autofs4 ext4 crc16 jbd2 mbcache algif_skcipher af_alg dm_crypt dm_mod
>> hid_generic usbhid hid sr_mod cdrom sd_mod ohci_pci crct10dif_pclmul
>> crc32_pclmul crc32c_intel aesni_intel aes_x86_64 glue_helper lrw
>> gf128mul ablk_helper cryptd psmouse ohci_hcd ahci libahci ehci_pci
>> ehci_hcd libata xhci_pci xhci_hcd e1000e usbcore scsi_mod ptp
>> usb_common pps_core
>> Nov 12 11:54:24 user-desktop kernel: [   19.853264] CPU: 2 PID: 177
>> Comm: kworker/u16:2 Not tainted 4.8.7 #1
>> Nov 12 11:54:24 user-desktop kernel: [   19.853267] Hardware name: To
>> be filled by O.E.M. To be filled by O.E.M./CROSSHAIR V FORMULA-Z, BIOS
>> 2201 03/23/2015
>> Nov 12 11:54:24 user-desktop kernel: [   19.853273] Workqueue: netns cleanup_net
>> Nov 12 11:54:24 user-desktop kernel: [   19.853276]  0000000000000286
>> 00000000b98071d2 ffffffff81302a8f 0000000000000000
>> Nov 12 11:54:24 user-desktop kernel: [   19.853281]  0000000000000000
>> ffffffff81076f54 ffff880814f88040 ffff8808140efdf0
>> Nov 12 11:54:24 user-desktop kernel: [   19.853285]  ffffffff818f11d8
>> ffffffff818f11e0 ffff8808140efde0 0000000000000200
>> Nov 12 11:54:24 user-desktop kernel: [   19.853290] Call Trace:
>> Nov 12 11:54:24 user-desktop kernel: [   19.853295]
>> [<ffffffff81302a8f>] ? dump_stack+0x5c/0x7d
>> Nov 12 11:54:24 user-desktop kernel: [   19.853299]
>> [<ffffffff81076f54>] ? __warn+0xc4/0xe0
>> Nov 12 11:54:24 user-desktop kernel: [   19.853302]
>> [<ffffffff8149e243>] ? ops_exit_list.isra.4+0x33/0x60
>> Nov 12 11:54:24 user-desktop kernel: [   19.853305]
>> [<ffffffff8149f230>] ? cleanup_net+0x1b0/0x290
>> Nov 12 11:54:24 user-desktop kernel: [   19.853309]
>> [<ffffffff8108fefd>] ? process_one_work+0x14d/0x410
>> Nov 12 11:54:24 user-desktop kernel: [   19.853312]
>> [<ffffffff81090cc2>] ? worker_thread+0x62/0x490
>> Nov 12 11:54:24 user-desktop kernel: [   19.853315]
>> [<ffffffff81090c60>] ? rescuer_thread+0x340/0x340
>> Nov 12 11:54:24 user-desktop kernel: [   19.853318]
>> [<ffffffff81095f3f>] ? kthread+0xdf/0x100
>> Nov 12 11:54:24 user-desktop kernel: [   19.853322]
>> [<ffffffff8102b78b>] ? __switch_to+0x2bb/0x710
>> Nov 12 11:54:24 user-desktop kernel: [   19.853325]
>> [<ffffffff815a371f>] ? ret_from_fork+0x1f/0x40
>> Nov 12 11:54:24 user-desktop kernel: [   19.853328]
>> [<ffffffff81095e60>] ? kthread_park+0x50/0x50
>> Nov 12 11:54:24 user-desktop kernel: [   19.853331] ---[ end trace
>> c4840b46b58dbe12 ]---
>
> Looks like net->sysctls is not empty when unregistering,
> perhaps unregister_net_sysctl_table() is missing somewhere.

^ permalink raw reply

* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: David Ahern @ 2016-11-14 17:17 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Jason A. Donenfeld, Netdev,
	WireGuard mailing list, LKML, YOSHIFUJI Hideaki
In-Reply-To: <7d8c0210-9132-c755-9053-6ec19409e343@stressinduktion.org>

On 11/14/16 10:04 AM, Hannes Frederic Sowa wrote:
> On 14.11.2016 17:55, David Ahern wrote:
>> On 11/14/16 9:44 AM, Hannes Frederic Sowa wrote:
>>> On Mon, Nov 14, 2016, at 00:28, Jason A. Donenfeld wrote:
>>>> This puts the IPv6 routing functions in parity with the IPv4 routing
>>>> functions. Namely, we now check in v6 that if a flowi6 requests an
>>>> saddr, the returned dst actually corresponds to a net device that has
>>>> that saddr. This mirrors the v4 logic with __ip_dev_find in
>>>> __ip_route_output_key_hash. In the event that the returned dst is not
>>>> for a dst with a dev that has the saddr, we return -EINVAL, just like
>>>> v4; this makes it easy to use the same error handlers for both cases.
>>>>
>>>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>>>> Cc: David Ahern <dsa@cumulusnetworks.com>
>>>> ---
>>>> Changes from v2:
>>>>     It turns out ipv6_chk_addr already has the device enumeration
>>>>     logic that we need by simply passing NULL.
>>>>
>>>>  net/ipv6/ip6_output.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>>>> index 6001e78..b3b5cb6 100644
>>>> --- a/net/ipv6/ip6_output.c
>>>> +++ b/net/ipv6/ip6_output.c
>>>> @@ -926,6 +926,10 @@ static int ip6_dst_lookup_tail(struct net *net,
>>>> const struct sock *sk,
>>>>  	int err;
>>>>  	int flags = 0;
>>>>  
>>>> +       if (!ipv6_addr_any(&fl6->saddr) &&
>>>> +           !ipv6_chk_addr(net, &fl6->saddr, NULL, 1))
>>>> +               return -EINVAL;
>>>
>>> Hmm, this check is too permissive, no?
>>>
>>> E.g. what happens if you move a link local address from one interface to
>>> another? In this case this code would still allow the saddr to be used.
>>
>> This check -- like the ipv4 variant -- only verifies the saddr is locally assigned. If the address moves interfaces it should be fine.
> 
> But in this case we should actually bail out, no?
> 
> Let's say, user assumes we are on ifindex eth0 with LL address from
> eth0. Suddenly the LL address from eth0 is moved to eth1, we can't
> accept this source address anymore and need to return -EINVAL, too.

so you mean if rt6_need_strict(&fl6->saddr) then the dev needs to be considered.


> 
>>> I just also quickly read up on the history (sorry was travelling last
>>> week) and wonder if you ever saw a user space facing bug or if this is
>>> basically some difference you saw while writing out of tree code?
>>
>> I checked the userspace API this morning. bind and cmsg for example check that the address is valid with calls to ipv6_chk_addr.
> 
> Hmm, so it fixes no real bug.
> 
> Because of translations of flowi6_oif we actually can't do a correct
> check of source address for cases like the one I outlined above? Hmm,
> maybe we should simply depend on user space checks.

I believe Jason's case is forwarding path and the ipv6_stub->ipv6_dst_lookup API.

^ permalink raw reply

* Assalamu`Alaikum.
From: mohammad ouattara @ 2016-11-14 17:00 UTC (permalink / raw)

In-Reply-To: <1854474938.4647011.1479142846322.ref@mail.yahoo.com>




Dear Sir/Madam.

Assalamu`Alaikum.

I am Dr mohammad ouattara, I have  ($10.6 Million us dollars) to transfer into your account,

I will send you more details about this deal and the procedures to follow when I receive a positive response from you, 

Have a great day,
Dr mohammad ouattara.

^ permalink raw reply

* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: Hannes Frederic Sowa @ 2016-11-14 17:04 UTC (permalink / raw)
  To: David Ahern, Jason A. Donenfeld, Netdev, WireGuard mailing list,
	LKML, YOSHIFUJI Hideaki
In-Reply-To: <fd1c804d-3e44-0321-8a3e-67d6ff7357fa@cumulusnetworks.com>

On 14.11.2016 17:55, David Ahern wrote:
> On 11/14/16 9:44 AM, Hannes Frederic Sowa wrote:
>> On Mon, Nov 14, 2016, at 00:28, Jason A. Donenfeld wrote:
>>> This puts the IPv6 routing functions in parity with the IPv4 routing
>>> functions. Namely, we now check in v6 that if a flowi6 requests an
>>> saddr, the returned dst actually corresponds to a net device that has
>>> that saddr. This mirrors the v4 logic with __ip_dev_find in
>>> __ip_route_output_key_hash. In the event that the returned dst is not
>>> for a dst with a dev that has the saddr, we return -EINVAL, just like
>>> v4; this makes it easy to use the same error handlers for both cases.
>>>
>>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>>> Cc: David Ahern <dsa@cumulusnetworks.com>
>>> ---
>>> Changes from v2:
>>>     It turns out ipv6_chk_addr already has the device enumeration
>>>     logic that we need by simply passing NULL.
>>>
>>>  net/ipv6/ip6_output.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>>> index 6001e78..b3b5cb6 100644
>>> --- a/net/ipv6/ip6_output.c
>>> +++ b/net/ipv6/ip6_output.c
>>> @@ -926,6 +926,10 @@ static int ip6_dst_lookup_tail(struct net *net,
>>> const struct sock *sk,
>>>  	int err;
>>>  	int flags = 0;
>>>  
>>> +       if (!ipv6_addr_any(&fl6->saddr) &&
>>> +           !ipv6_chk_addr(net, &fl6->saddr, NULL, 1))
>>> +               return -EINVAL;
>>
>> Hmm, this check is too permissive, no?
>>
>> E.g. what happens if you move a link local address from one interface to
>> another? In this case this code would still allow the saddr to be used.
> 
> This check -- like the ipv4 variant -- only verifies the saddr is locally assigned. If the address moves interfaces it should be fine.

But in this case we should actually bail out, no?

Let's say, user assumes we are on ifindex eth0 with LL address from
eth0. Suddenly the LL address from eth0 is moved to eth1, we can't
accept this source address anymore and need to return -EINVAL, too.

>> I just also quickly read up on the history (sorry was travelling last
>> week) and wonder if you ever saw a user space facing bug or if this is
>> basically some difference you saw while writing out of tree code?
> 
> I checked the userspace API this morning. bind and cmsg for example check that the address is valid with calls to ipv6_chk_addr.

Hmm, so it fixes no real bug.

Because of translations of flowi6_oif we actually can't do a correct
check of source address for cases like the one I outlined above? Hmm,
maybe we should simply depend on user space checks.

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: David Ahern @ 2016-11-14 16:55 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Jason A. Donenfeld, Netdev,
	WireGuard mailing list, LKML, YOSHIFUJI Hideaki
In-Reply-To: <1479141867.3723362.787321689.4A3DCFD6@webmail.messagingengine.com>

On 11/14/16 9:44 AM, Hannes Frederic Sowa wrote:
> On Mon, Nov 14, 2016, at 00:28, Jason A. Donenfeld wrote:
>> This puts the IPv6 routing functions in parity with the IPv4 routing
>> functions. Namely, we now check in v6 that if a flowi6 requests an
>> saddr, the returned dst actually corresponds to a net device that has
>> that saddr. This mirrors the v4 logic with __ip_dev_find in
>> __ip_route_output_key_hash. In the event that the returned dst is not
>> for a dst with a dev that has the saddr, we return -EINVAL, just like
>> v4; this makes it easy to use the same error handlers for both cases.
>>
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>> Cc: David Ahern <dsa@cumulusnetworks.com>
>> ---
>> Changes from v2:
>>     It turns out ipv6_chk_addr already has the device enumeration
>>     logic that we need by simply passing NULL.
>>
>>  net/ipv6/ip6_output.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index 6001e78..b3b5cb6 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -926,6 +926,10 @@ static int ip6_dst_lookup_tail(struct net *net,
>> const struct sock *sk,
>>  	int err;
>>  	int flags = 0;
>>  
>> +       if (!ipv6_addr_any(&fl6->saddr) &&
>> +           !ipv6_chk_addr(net, &fl6->saddr, NULL, 1))
>> +               return -EINVAL;
> 
> Hmm, this check is too permissive, no?
> 
> E.g. what happens if you move a link local address from one interface to
> another? In this case this code would still allow the saddr to be used.

This check -- like the ipv4 variant -- only verifies the saddr is locally assigned. If the address moves interfaces it should be fine.

> 
> I just also quickly read up on the history (sorry was travelling last
> week) and wonder if you ever saw a user space facing bug or if this is
> basically some difference you saw while writing out of tree code?

I checked the userspace API this morning. bind and cmsg for example check that the address is valid with calls to ipv6_chk_addr.

^ permalink raw reply

* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: Jason A. Donenfeld @ 2016-11-14 16:54 UTC (permalink / raw)
  To: David Ahern, David Miller
  Cc: Netdev, Hannes Frederic Sowa, YOSHIFUJI Hideaki,
	WireGuard mailing list, LKML
In-Reply-To: <CAHmME9qC4xqGOwJnauXrJBDkAtmmuJ+kJKL6ufuU9_XWKNFdSA@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 201 bytes --]

On Nov 14, 2016 17:19, "David Ahern" <dsa@cumulusnetworks.com> wrote:
>
> LGTM
>
> Acked-by: David Ahern <dsa@cumulusnetworks.com>

Great.

@DaveM: can we get this in 4.9 and in stable?

Thanks,
Jason

[-- Attachment #1.2: Type: text/html, Size: 410 bytes --]

[-- Attachment #2: Type: text/plain, Size: 147 bytes --]

_______________________________________________
WireGuard mailing list
WireGuard@lists.zx2c4.com
http://lists.zx2c4.com/mailman/listinfo/wireguard

^ permalink raw reply

* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: Hannes Frederic Sowa @ 2016-11-14 16:44 UTC (permalink / raw)
  To: Jason A. Donenfeld, David Ahern, Netdev, WireGuard mailing list,
	LKML, YOSHIFUJI Hideaki
In-Reply-To: <20161113232813.28926-1-Jason@zx2c4.com>

On Mon, Nov 14, 2016, at 00:28, Jason A. Donenfeld wrote:
> This puts the IPv6 routing functions in parity with the IPv4 routing
> functions. Namely, we now check in v6 that if a flowi6 requests an
> saddr, the returned dst actually corresponds to a net device that has
> that saddr. This mirrors the v4 logic with __ip_dev_find in
> __ip_route_output_key_hash. In the event that the returned dst is not
> for a dst with a dev that has the saddr, we return -EINVAL, just like
> v4; this makes it easy to use the same error handlers for both cases.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> ---
> Changes from v2:
>     It turns out ipv6_chk_addr already has the device enumeration
>     logic that we need by simply passing NULL.
> 
>  net/ipv6/ip6_output.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 6001e78..b3b5cb6 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -926,6 +926,10 @@ static int ip6_dst_lookup_tail(struct net *net,
> const struct sock *sk,
>  	int err;
>  	int flags = 0;
>  
> +       if (!ipv6_addr_any(&fl6->saddr) &&
> +           !ipv6_chk_addr(net, &fl6->saddr, NULL, 1))
> +               return -EINVAL;

Hmm, this check is too permissive, no?

E.g. what happens if you move a link local address from one interface to
another? In this case this code would still allow the saddr to be used.

I just also quickly read up on the history (sorry was travelling last
week) and wonder if you ever saw a user space facing bug or if this is
basically some difference you saw while writing out of tree code?

Thanks,
Hannes

^ permalink raw reply

* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
From: Stefan Hajnoczi @ 2016-11-14 16:25 UTC (permalink / raw)
  To: Russell King - ARM Linux, David Woodhouse
  Cc: jasowang, vyasevic, stefanha, netdev, qemu-devel,
	igor.v.kovalenko, dgilbert
In-Reply-To: <20161114092947.GB2031@work-vm>

[-- Attachment #1: Type: text/plain, Size: 3792 bytes --]

On Mon, Nov 14, 2016 at 09:29:48AM +0000, Dr. David Alan Gilbert wrote:
> * Russell King - ARM Linux (linux@armlinux.org.uk) wrote:
> > On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> > > It's also *fairly* unlikely that the kernel in the guest has developed
> > > a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> > > that qemu isn't properly emulating those bits. But at first glance at
> > > the code, it looks like *that's* been there for the last decade too...
> > 
> > I take issue with that, having looked at the qemu rtl8139 code:
> > 
> >                 if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
> >                 {
> >                     int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;
> > 
> >                     DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d "
> >                         "frame data %d specified MSS=%d\n", ETH_MTU,
> >                         ip_data_len, saved_size - ETH_HLEN, large_send_mss);
> > 
> > That's the only reference to "large_send_mss" there, other than that,
> > the MSS value that gets stuck into the field by 8139cp.c is completely
> > unused.  Instead, qemu does this:
> > 
> >                 eth_payload_data = saved_buffer + ETH_HLEN;
> >                 eth_payload_len  = saved_size   - ETH_HLEN;
> > 
> >                 ip = (ip_header*)eth_payload_data;
> > 
> >                     hlen = IP_HEADER_LENGTH(ip);
> >                     ip_data_len = be16_to_cpu(ip->ip_len) - hlen;
> > 
> >                     tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + hlen);
> >                     int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);
> > 
> >                     /* ETH_MTU = ip header len + tcp header len + payload */
> >                     int tcp_data_len = ip_data_len - tcp_hlen;
> >                     int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
> > 
> >                     for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; tcp_send_offset += tcp_chunk_size)
> >                     {
> > 
> > It uses a fixed value of ETH_MTU to calculate the size of the TCP
> > data chunks, and this is not surprisingly the well known:
> > 
> > #define ETH_MTU     1500
> > 
> > Qemu seems to be buggy - it ignores the MSS value, and always tries to
> > send 1500 byte frames.
> 
> cc'ing in Stefan who last touched that code and Jason and Vlad who
> know the net code.

CCing Igor Kovalenko who implemented "fixed for TCP segmentation
offloading - removed dependency on slirp.h" in 2006.  I don't actually
expect him to remember this from 10 years ago though :).

Looking at the history the large_send_mss variable was never used for
anything beyond the debug printf.

The datasheet for this NIC is here:
http://realtek.info/pdf/rtl8139cp.pdf.  See 9.2.1 Transmit.

Does this untested patch work for you?

diff --git a/hw/net/rtl8139.c b/hw/net/rtl8139.c
index f05e59c..a3f1af5 100644
--- a/hw/net/rtl8139.c
+++ b/hw/net/rtl8139.c
@@ -2167,9 +2167,13 @@ static int rtl8139_cplus_transmit_one(RTL8139State *s)
                     goto skip_offload;
                 }
 
-                /* ETH_MTU = ip header len + tcp header len + payload */
+                /* MSS too small */
+                if (tcp_hlen + hlen >= large_send_mss) {
+                    goto skip_offload;
+                }
+
                 int tcp_data_len = ip_data_len - tcp_hlen;
-                int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
+                int tcp_chunk_size = large_send_mss - hlen - tcp_hlen;
 
                 DPRINTF("+++ C+ mode TSO IP data len %d TCP hlen %d TCP "
                     "data len %d TCP chunk size %d\n", ip_data_len,

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply related

* Re: Long delays creating a netns after deleting one (possibly RCU related)
From: Paul E. McKenney @ 2016-11-14 16:24 UTC (permalink / raw)
  To: Cong Wang
  Cc: Rolf Neugebauer, LKML, Linux Kernel Network Developers,
	Justin Cormack, Ian Campbell
In-Reply-To: <CAM_iQpUq14GsvGyz2xA1PkDq5YO743T68Zh2zm=NJZQ9S2Ahqg@mail.gmail.com>

On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >> Ah!  This net_mutex is different than RTNL.  Should synchronize_net() be
> >> modified to check for net_mutex being held in addition to the current
> >> checks for RTNL being held?
> >>
> >
> > Good point!
> >
> > Like commit be3fc413da9eb17cce0991f214ab0, checking
> > for net_mutex for this case seems to be an optimization, I assume
> > synchronize_rcu_expedited() and synchronize_rcu() have the same
> > behavior...
> 
> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
> gets wrong on rtnl_is_locked(), the lock could be locked by other
> process not by the current one, therefore it should be
> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
> is enabled... Sigh.
> 
> I don't see any better way than letting callers decide if they want the
> expedited version or not, but this requires changes of all callers of
> synchronize_net(). Hm.

I must confess that I don't understand how it would help to use an
expedited grace period when some other process is holding RTNL.
In contrast, I do well understand how it helps when the current process
is holding RTNL.

So what am I missing here?

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: David Ahern @ 2016-11-14 16:19 UTC (permalink / raw)
  To: Jason A. Donenfeld, Netdev, WireGuard mailing list, LKML,
	YOSHIFUJI Hideaki, Hannes Frederic Sowa
In-Reply-To: <20161113232813.28926-1-Jason@zx2c4.com>

On 11/13/16 4:28 PM, Jason A. Donenfeld wrote:
> This puts the IPv6 routing functions in parity with the IPv4 routing
> functions. Namely, we now check in v6 that if a flowi6 requests an
> saddr, the returned dst actually corresponds to a net device that has
> that saddr. This mirrors the v4 logic with __ip_dev_find in
> __ip_route_output_key_hash. In the event that the returned dst is not
> for a dst with a dev that has the saddr, we return -EINVAL, just like
> v4; this makes it easy to use the same error handlers for both cases.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> ---
> Changes from v2:
>     It turns out ipv6_chk_addr already has the device enumeration
>     logic that we need by simply passing NULL.
> 
>  net/ipv6/ip6_output.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 6001e78..b3b5cb6 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -926,6 +926,10 @@ static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk,
>  	int err;
>  	int flags = 0;
>  
> +	if (!ipv6_addr_any(&fl6->saddr) &&
> +	    !ipv6_chk_addr(net, &fl6->saddr, NULL, 1))
> +		return -EINVAL;
> +
>  	/* The correct way to handle this would be to do
>  	 * ip6_route_get_saddr, and then ip6_route_output; however,
>  	 * the route-specific preferred source forces the
> 

LGTM

Acked-by: David Ahern <dsa@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCH net] sctp: change sk state only when it has assocs in sctp_shutdown
From: Neil Horman @ 2016-11-14 15:52 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, linux-sctp, davem, Marcelo Ricardo Leitner,
	Vlad Yasevich, andreyknvl
In-Reply-To: <d260f8b59f52d7ef00b83a554fc19d4aa91766a2.1479044677.git.lucien.xin@gmail.com>

On Sun, Nov 13, 2016 at 09:44:37PM +0800, Xin Long wrote:
> Now when users shutdown a sock with SEND_SHUTDOWN in sctp, even if
> this sock has no connection (assoc), sk state would be changed to
> SCTP_SS_CLOSING, which is not as we expect.
> 
> Besides, after that if users try to listen on this sock, kernel
> could even panic when it dereference sctp_sk(sk)->bind_hash in
> sctp_inet_listen, as bind_hash is null when sock has no assoc.
> 
> This patch is to move sk state change after checking sk assocs
> is not empty, and also merge these two if() conditions and reduce
> indent level.
> 
> Fixes: d46e416c11c8 ("sctp: sctp should change socket state when shutdown is received")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/sctp/socket.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index faa48ff..f23ad91 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4285,19 +4285,18 @@ static void sctp_shutdown(struct sock *sk, int how)
>  {
>  	struct net *net = sock_net(sk);
>  	struct sctp_endpoint *ep;
> -	struct sctp_association *asoc;
>  
>  	if (!sctp_style(sk, TCP))
>  		return;
>  
> -	if (how & SEND_SHUTDOWN) {
> +	ep = sctp_sk(sk)->ep;
> +	if (how & SEND_SHUTDOWN && !list_empty(&ep->asocs)) {
> +		struct sctp_association *asoc;
> +
>  		sk->sk_state = SCTP_SS_CLOSING;
> -		ep = sctp_sk(sk)->ep;
> -		if (!list_empty(&ep->asocs)) {
> -			asoc = list_entry(ep->asocs.next,
> -					  struct sctp_association, asocs);
> -			sctp_primitive_SHUTDOWN(net, asoc, NULL);
> -		}
> +		asoc = list_entry(ep->asocs.next,
> +				  struct sctp_association, asocs);
> +		sctp_primitive_SHUTDOWN(net, asoc, NULL);
>  	}
>  }
>  
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* [PATCH next] dctcp: update cwnd on congestion event
From: Florian Westphal @ 2016-11-14 15:42 UTC (permalink / raw)
  To: netdev
  Cc: Florian Westphal, Lawrence Brakmo, Andrew Shewmaker, Glenn Judd,
	Daniel Borkmann

draft-ietf-tcpm-dctcp-02 says:

... when the sender receives an indication of congestion
(ECE), the sender SHOULD update cwnd as follows:

         cwnd = cwnd * (1 - DCTCP.Alpha / 2)

So, lets do this and reduce cwnd more smoothly (and faster), as per
current congestion estimate.

Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index ab37c6775630..51139175bf61 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -188,8 +188,8 @@ static void dctcp_ce_state_1_to_0(struct sock *sk)
 
 static void dctcp_update_alpha(struct sock *sk, u32 flags)
 {
-	const struct tcp_sock *tp = tcp_sk(sk);
 	struct dctcp *ca = inet_csk_ca(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
 	u32 acked_bytes = tp->snd_una - ca->prior_snd_una;
 
 	/* If ack did not advance snd_una, count dupack as MSS size.
@@ -229,6 +229,13 @@ static void dctcp_update_alpha(struct sock *sk, u32 flags)
 		WRITE_ONCE(ca->dctcp_alpha, alpha);
 		dctcp_reset(tp, ca);
 	}
+
+	if (flags & CA_ACK_ECE) {
+		unsigned int cwnd = dctcp_ssthresh(sk);
+
+		if (cwnd != tp->snd_cwnd)
+			tp->snd_cwnd = cwnd;
+	}
 }
 
 static void dctcp_state(struct sock *sk, u8 new_state)
-- 
2.7.3

^ permalink raw reply related

* Re: Debugging Ethernet issues
From: Mason @ 2016-11-14 15:33 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, netdev, Mans Rullgard, Sergei Shtylyov,
	Tom Lendacky, Zach Brown, Shaohui Xie, Tim Beale, Brian Hill,
	Vince Bridgers, Balakumaran Kannan, David S. Miller,
	Sebastian Frias, Kirill Kapranov
In-Reply-To: <5829D10F.30206@free.fr>

On 14/11/2016 15:58, Mason wrote:

> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
> vs
> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off
> 
> I'm not sure whether "flow control" is relevant...

Based on phy_print_status()
phydev->pause ? "rx/tx" : "off"
I added the following patch.

diff --git a/drivers/net/ethernet/aurora/nb8800.c b/drivers/net/ethernet/aurora/nb8800.c
index defc22a15f67..4e758c1cfa4e 100644
--- a/drivers/net/ethernet/aurora/nb8800.c
+++ b/drivers/net/ethernet/aurora/nb8800.c
@@ -667,6 +667,8 @@ static void nb8800_link_reconfigure(struct net_device *dev)
        struct phy_device *phydev = priv->phydev;
        int change = 0;
 
+       printk("%s from %pf\n", __func__, __builtin_return_address(0));
+
        if (phydev->link) {
                if (phydev->speed != priv->speed) {
                        priv->speed = phydev->speed;
@@ -1274,9 +1276,9 @@ static int nb8800_hw_init(struct net_device *dev)
        nb8800_writeb(priv, NB8800_PQ2, val & 0xff);
 
        /* Auto-negotiate by default */
-       priv->pause_aneg = true;
-       priv->pause_rx = true;
-       priv->pause_tx = true;
+       priv->pause_aneg = false;
+       priv->pause_rx = false;
+       priv->pause_tx = false;
 
        nb8800_mc_init(dev, 0);
 

Connected to 1000 Mbps switch:

# time udhcpc | while read LINE; do date; echo $LINE; done
Thu Jan  1 00:00:22 UTC 1970
udhcpc (v1.22.1) started
Thu Jan  1 00:00:22 UTC 1970
Sending discover...
[   24.565346] nb8800_link_reconfigure from phy_state_machine
Thu Jan  1 00:00:25 UTC 1970
Sending discover...
[   26.575402] nb8800_link_reconfigure from phy_state_machine
[   26.580972] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
Thu Jan  1 00:00:28 UTC 1970
Sending discover...
Thu Jan  1 00:00:29 UTC 1970
Sending select for 172.27.64.58...
Thu Jan  1 00:00:29 UTC 1970
Lease of 172.27.64.58 obtained, lease time 604800
Thu Jan  1 00:00:29 UTC 1970
deleting routers
Thu Jan  1 00:00:29 UTC 1970
adding dns 172.27.0.17

real    0m7.388s
user    0m0.040s
sys     0m0.090s



Connected to 100 Mbps switch:

# time udhcpc | while read LINE; do date; echo $LINE; done
Thu Jan  1 00:00:14 UTC 1970
udhcpc (v1.22.1) started
Thu Jan  1 00:00:15 UTC 1970
Sending discover...
[   16.968621] nb8800_link_reconfigure from phy_state_machine
[   17.975359] nb8800_link_reconfigure from phy_state_machine
[   17.980923] nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
Thu Jan  1 00:00:18 UTC 1970
Sending discover...
Thu Jan  1 00:00:19 UTC 1970
Sending select for 172.27.64.58...
Thu Jan  1 00:00:19 UTC 1970
Lease of 172.27.64.58 obtained, lease time 604800
Thu Jan  1 00:00:19 UTC 1970
deleting routers
Thu Jan  1 00:00:19 UTC 1970
adding dns 172.27.0.17

real    0m4.355s
user    0m0.043s
sys     0m0.083s



OK, so now it works (by accident?) even on 100 Mbps switch, but it still
prints "flow control rx/tx"...

# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate:  off
RX:             off
TX:             off

These values make sense considering my changes in the driver.

Are 100 Mbps switches supposed to support these pause features,
and are they supposed to be able to auto-negotiate them?

Regards.

^ permalink raw reply related

* Re: [PATCH] net/phy: add trace events for mdio accesses
From: Steven Rostedt @ 2016-11-14 15:22 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: Florian Fainelli, Ingo Molnar, netdev
In-Reply-To: <20161114110335.27862-1-uwe@kleine-koenig.org>

On Mon, 14 Nov 2016 12:03:35 +0100
Uwe Kleine-König <uwe@kleine-koenig.org> wrote:

> Make it possible to generate trace events for mdio read and write accesses.
> 
> Signed-off-by: Uwe Kleine-König <uwe@kleine-koenig.org>
> ---
>  drivers/net/phy/mdio_bus.c  | 15 +++++++++++++++
>  include/trace/events/mdio.h | 40 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 55 insertions(+)
>  create mode 100644 include/trace/events/mdio.h
> 
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 09deef4bed09..0f3f207419f6 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -38,6 +38,9 @@
>  
>  #include <asm/irq.h>
>  
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mdio.h>
> +
>  int mdiobus_register_device(struct mdio_device *mdiodev)
>  {
>  	if (mdiodev->bus->mdio_map[mdiodev->addr])
> @@ -461,6 +464,9 @@ int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum)
>  	retval = bus->read(bus, addr, regnum);
>  	mutex_unlock(&bus->mdio_lock);
>  
> +	if (retval >= 0)
> +		trace_mdio_access(bus, 1, addr, regnum, retval);

These cause branches to be taken when tracing is disabled, breaking the
zero overhead for disabled tracing guideline. As retval is passed to
the tracepoint, please look at TRACE_EVENT_CONDITION() and use that.
It will move the if statement into the enabling of the trace event and
keep the overhead to a minimum when the tracepoint is disabled.

Do the same for the below as well.

-- Steve

> +
>  	return retval;
>  }
>  EXPORT_SYMBOL(mdiobus_read_nested);
> @@ -485,6 +491,9 @@ int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum)
>  	retval = bus->read(bus, addr, regnum);
>  	mutex_unlock(&bus->mdio_lock);
>  
> +	if (retval >= 0)
> +		trace_mdio_access(bus, 1, addr, regnum, retval);
> +
>  	return retval;
>  }
>  EXPORT_SYMBOL(mdiobus_read);
> @@ -513,6 +522,9 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val)
>  	err = bus->write(bus, addr, regnum, val);
>  	mutex_unlock(&bus->mdio_lock);
>  
> +	if (!err)
> +		trace_mdio_access(bus, 0, addr, regnum, val);
> +
>  	return err;
>  }
>  EXPORT_SYMBOL(mdiobus_write_nested);
> @@ -538,6 +550,9 @@ int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val)
>  	err = bus->write(bus, addr, regnum, val);
>  	mutex_unlock(&bus->mdio_lock);
>  
> +	if (!err)
> +		trace_mdio_access(bus, 0, addr, regnum, val);
> +
>  	return err;
>  }
>  EXPORT_SYMBOL(mdiobus_write);
> diff --git a/include/trace/events/mdio.h b/include/trace/events/mdio.h
> new file mode 100644
> index 000000000000..dcb2d456a346
> --- /dev/null
> +++ b/include/trace/events/mdio.h
> @@ -0,0 +1,40 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mdio
> +
> +#if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MDIO_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(mdio_access,
> +
> +	TP_PROTO(struct mii_bus *bus, int read,
> +		 unsigned addr, unsigned regnum, u16 val),
> +
> +	TP_ARGS(bus, read, addr, regnum, val),
> +
> +	TP_STRUCT__entry(
> +		__array(char, busid, MII_BUS_ID_SIZE)
> +		__field(int, read)
> +		__field(unsigned, addr)
> +		__field(unsigned, regnum)
> +		__field(u16, val)
> +	),
> +
> +	TP_fast_assign(
> +		strncpy(__entry->busid, bus->id, MII_BUS_ID_SIZE);
> +		__entry->read = read;
> +		__entry->addr = addr;
> +		__entry->regnum = regnum;
> +		__entry->val = val;
> +	),
> +
> +	TP_printk("%s %-5s phy:0x%02x reg:0x%02x val:0x%04hx",
> +		  __entry->busid, __entry->read ? "read" : "write",
> +		  __entry->addr, __entry->regnum, __entry->val)
> +);
> +
> +#endif /* if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ) */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>

^ permalink raw reply

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Andreas Färber @ 2016-11-14 15:00 UTC (permalink / raw)
  To: Martin Blumenstingl
  Cc: Giuseppe CAVALLARO, Alexandre Torgue, netdev, André Roth,
	Johnson Leung, linux-amlogic, Jerome Brunet
In-Reply-To: <CAFBinCDq0neyidQBO13F4Yz-GZPorGyRQ2t5Vb-PL0Ls582+4Q@mail.gmail.com>

Hi,

Am 07.11.2016 um 18:37 schrieb Martin Blumenstingl:
> The same problem still appears on my Tronsmart Vega S95 Meta even with
> the patched PHY driver.
> 
> Unfortunately I don't have a second device to rule out that my
> Tronsmart Vega S95 Meta could be broken (not unlikely, I get DDR
> errors from time to time in u-boot). Maybe Andreas Faerber can test
> ethernet with and without Jerome's patch on one of his Tronsmart
> devices.

So far I only ran into the stall issue on my Odroid-C2, not on my Vega
S95 Telos, both on v4.10/integ branch with a rebased MMC patch of yours
on top.

I'm assuming you mean realtek8211f-disable-eee-1000.patch from Nov 3.

Regards,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* [patch net-next v2 8/8] mlxsw: packet sample: Add packet sample offloading support
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux, roopa,
	john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>

From: Yotam Gigi <yotamg@mellanox.com>

Using the MPSC regiter, add the functions that configure port packets
sampling in hardware and the necessary datatypes in the mlxsw_sp_port
struct. In addition, add the necessary trap for sampled packets and
integrate with matchall offloading to allow offloading of the sample tc
action.

The current offload support is for the tc command:

tc filter add dev <DEV> parent ffff:   \
	  matchall   \
	  action sample rate <RATE> mark <MARK> [trunc <SIZE>]   \
	  		[src <SADDR>] [dst <DADDR>] [type <TYPE>]

Where only ingress qdiscs are supported, and only a combination of
matchall classifier and sample action will lead to activating hardware
packet sampling.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 120 +++++++++++++++++++++++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  12 +++
 drivers/net/ethernet/mellanox/mlxsw/trap.h     |   1 +
 3 files changed, 125 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index a5433e4..6cbf67c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -57,6 +57,8 @@
 #include <net/pkt_cls.h>
 #include <net/tc_act/tc_mirred.h>
 #include <net/netevent.h>
+#include <net/tc_act/tc_sample.h>
+#include <net/ife.h>
 
 #include "spectrum.h"
 #include "pci.h"
@@ -467,6 +469,16 @@ static void mlxsw_sp_span_mirror_remove(struct mlxsw_sp_port *from,
 	mlxsw_sp_span_inspected_port_unbind(from, span_entry, type);
 }
 
+static int mlxsw_sp_port_sample_set(struct mlxsw_sp_port *mlxsw_sp_port,
+				    bool enable, u32 rate)
+{
+	struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+	char mpsc_pl[MLXSW_REG_MPSC_LEN];
+
+	mlxsw_reg_mpsc_pack(mpsc_pl, mlxsw_sp_port->local_port, enable, rate);
+	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(mpsc), mpsc_pl);
+}
+
 static int mlxsw_sp_port_admin_status_set(struct mlxsw_sp_port *mlxsw_sp_port,
 					  bool is_up)
 {
@@ -1221,6 +1233,49 @@ mlxsw_sp_port_add_cls_matchall_mirror(struct mlxsw_sp_port *mlxsw_sp_port,
 	return err;
 }
 
+static int
+mlxsw_sp_port_add_cls_matchall_sample(struct mlxsw_sp_port *mlxsw_sp_port,
+				      struct tc_cls_matchall_offload *cls,
+				      const struct tc_action *a,
+				      bool ingress)
+{
+	struct mlxsw_sp_port_mall_tc_entry *mall_tc_entry;
+	int sampler_id;
+	int err;
+
+	if (mlxsw_sp_port->sample.enable) {
+		netdev_err(mlxsw_sp_port->dev, "Sample already active\n");
+		return -EEXIST;
+	}
+
+	err = mlxsw_sp_port_sample_set(mlxsw_sp_port, true, tcf_sample_rate(a));
+	if (err)
+		return err;
+
+	sampler_id = tcf_sample_sampler_id(a);
+
+	mlxsw_sp_port->sample.enable = true;
+	mlxsw_sp_port->sample.mark = tcf_sample_mark(a);
+	mlxsw_sp_port->sample.truncate = tcf_sample_truncate(a);
+	mlxsw_sp_port->sample.trunc_size = tcf_sample_trunc_size(a);
+	mlxsw_sp_port->sample.eth_type = tcf_sample_eth_type(a);
+	mlxsw_sp_port->sample.sampler_id = TC_SAMPLE_HW_ID(sampler_id);
+	tcf_sample_eth_dst_addr(a, mlxsw_sp_port->sample.eth_dst);
+	tcf_sample_eth_src_addr(a, mlxsw_sp_port->sample.eth_src);
+
+	netdev_dbg(mlxsw_sp_port->dev, "Activate hardware sample\n");
+
+	mall_tc_entry = kzalloc(sizeof(*mall_tc_entry), GFP_KERNEL);
+	if (!mall_tc_entry)
+		return -ENOMEM;
+
+	mall_tc_entry->cookie = cls->cookie;
+	mall_tc_entry->type = MLXSW_SP_PORT_MALL_SAMPLE;
+	list_add_tail(&mall_tc_entry->list, &mlxsw_sp_port->mall_tc_list);
+
+	return 0;
+}
+
 static int mlxsw_sp_port_add_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
 					  __be16 protocol,
 					  struct tc_cls_matchall_offload *cls,
@@ -1236,17 +1291,19 @@ static int mlxsw_sp_port_add_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
 	}
 
 	tcf_exts_to_list(cls->exts, &actions);
-	list_for_each_entry(a, &actions, list) {
-		if (!is_tcf_mirred_egress_mirror(a) ||
-		    protocol != htons(ETH_P_ALL)) {
-			return -ENOTSUPP;
-		}
+	a = list_first_entry(&actions, struct tc_action, list);
 
+	if (is_tcf_mirred_egress_mirror(a) && protocol == htons(ETH_P_ALL))
 		err = mlxsw_sp_port_add_cls_matchall_mirror(mlxsw_sp_port, cls,
 							    a, ingress);
-		if (err)
-			return err;
-	}
+	else if (is_tcf_sample(a) && protocol == htons(ETH_P_ALL))
+		err = mlxsw_sp_port_add_cls_matchall_sample(mlxsw_sp_port, cls,
+							    a, ingress);
+	else
+		return -ENOTSUPP;
+
+	if (err)
+		return err;
 
 	return 0;
 }
@@ -1274,6 +1331,10 @@ static void mlxsw_sp_port_del_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
 
 		mlxsw_sp_span_mirror_remove(mlxsw_sp_port, to_port, span_type);
 		break;
+	case MLXSW_SP_PORT_MALL_SAMPLE:
+		mlxsw_sp_port->sample.enable = false;
+		mlxsw_sp_port_sample_set(mlxsw_sp_port, false, 1);
+		break;
 	default:
 		WARN_ON(1);
 	}
@@ -2774,6 +2835,46 @@ static void mlxsw_sp_rx_listener_mark_func(struct sk_buff *skb, u8 local_port,
 	return mlxsw_sp_rx_listener_func(skb, local_port, priv);
 }
 
+static void mlxsw_sp_rx_listener_sample_func(struct sk_buff *skb, u8 local_port,
+					     void *priv)
+{
+	struct mlxsw_sp *mlxsw_sp = priv;
+	struct mlxsw_sp_port *mlxsw_sp_port = mlxsw_sp->ports[local_port];
+	static struct ethhdr *ethhdr;
+	int orig_size;
+
+	if (unlikely(!mlxsw_sp_port)) {
+		dev_warn_ratelimited(mlxsw_sp->bus_info->dev, "Port %d: sample skb received for non-existent port\n",
+				     local_port);
+		return;
+	}
+
+	skb->dev = mlxsw_sp_port->dev;
+	skb->mac_len = skb->dev->hard_header_len;
+	skb->mark = mlxsw_sp_port->sample.mark;
+
+	orig_size = skb->len;
+
+	if (mlxsw_sp_port->sample.truncate)
+		skb_trim(skb, mlxsw_sp_port->sample.trunc_size);
+
+	ethhdr = ife_packet_info_pack(skb, orig_size, skb->dev->ifindex, 0,
+				      mlxsw_sp_port->sample.sampler_id,
+				      mlxsw_sp_port->sample.seq++);
+	if (!ethhdr)
+		return;
+
+	if (!is_zero_ether_addr(mlxsw_sp_port->sample.eth_src))
+		ether_addr_copy(ethhdr->h_source,
+				mlxsw_sp_port->sample.eth_src);
+	if (!is_zero_ether_addr(mlxsw_sp_port->sample.eth_dst))
+		ether_addr_copy(ethhdr->h_dest, mlxsw_sp_port->sample.eth_dst);
+	ethhdr->h_proto = htons(mlxsw_sp_port->sample.eth_type);
+
+	skb->protocol = eth_type_trans(skb, skb->dev);
+	netif_receive_skb(skb);
+}
+
 #define MLXSW_SP_RXL(_func, _trap_id, _action)			\
 	{							\
 		.func = _func,					\
@@ -2784,6 +2885,9 @@ static void mlxsw_sp_rx_listener_mark_func(struct sk_buff *skb, u8 local_port,
 
 static const struct mlxsw_rx_listener mlxsw_sp_rx_listener[] = {
 	MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, FDB_MC, TRAP_TO_CPU),
+	MLXSW_SP_RXL(mlxsw_sp_rx_listener_sample_func, PKT_SAMPLE,
+		     MIRROR_TO_CPU),
+
 	/* Traps for specific L2 packet types, not trapped as FDB MC */
 	MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, STP, TRAP_TO_CPU),
 	MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, LACP, TRAP_TO_CPU),
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 04a2bc7..7f2e6fc 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -229,6 +229,7 @@ struct mlxsw_sp_span_entry {
 
 enum mlxsw_sp_port_mall_action_type {
 	MLXSW_SP_PORT_MALL_MIRROR,
+	MLXSW_SP_PORT_MALL_SAMPLE,
 };
 
 struct mlxsw_sp_port_mall_mirror_tc_entry {
@@ -361,6 +362,17 @@ struct mlxsw_sp_port {
 		struct rtnl_link_stats64 *cache;
 		struct delayed_work update_dw;
 	} hw_stats;
+	struct {
+		bool enable;
+		u32 mark;
+		bool truncate;
+		u32 trunc_size;
+		u8 eth_src[ETH_ALEN];
+		u8 eth_dst[ETH_ALEN];
+		u16 eth_type;
+		u32 sampler_id;
+		u32 seq;
+	} sample;
 };
 
 struct mlxsw_sp_port *mlxsw_sp_port_lower_dev_hold(struct net_device *dev);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/trap.h b/drivers/net/ethernet/mellanox/mlxsw/trap.h
index ed8e301..eebd135 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/trap.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/trap.h
@@ -54,6 +54,7 @@ enum {
 	MLXSW_TRAP_ID_IGMP_V2_REPORT = 0x32,
 	MLXSW_TRAP_ID_IGMP_V2_LEAVE = 0x33,
 	MLXSW_TRAP_ID_IGMP_V3_REPORT = 0x34,
+	MLXSW_TRAP_ID_PKT_SAMPLE = 0x38,
 	MLXSW_TRAP_ID_ARPBC = 0x50,
 	MLXSW_TRAP_ID_ARPUC = 0x51,
 	MLXSW_TRAP_ID_MTUERROR = 0x52,
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v2 7/8] mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux, roopa,
	john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>

From: Yotam Gigi <yotamg@mellanox.com>

The MPSC register allows to configure ingress packet sampling on specific
port of the mlxsw device. The sampled packets are then trapped via
PKT_SAMPLE trap.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 38 +++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index edad7cb..b5ab81a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -4771,6 +4771,44 @@ static inline void mlxsw_reg_mlcr_pack(char *payload, u8 local_port,
 					   MLXSW_REG_MLCR_DURATION_MAX : 0);
 }
 
+/* MPSC - Monitoring Packet Sampling Configuration Register
+ * --------------------------------------------------------
+ * MPSC Register is used to configure the Packet Sampling mechanism.
+ */
+#define MLXSW_REG_MPSC_ID 0x9080
+#define MLXSW_REG_MPSC_LEN 0x14
+
+MLXSW_REG_DEFINE(mpsc, MLXSW_REG_MPSC_ID, MLXSW_REG_MPSC_LEN);
+
+/* reg_mpsc_local_port
+ * Local port number
+ * Not supported for CPU port
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, mpsc, local_port, 0x00, 16, 8);
+
+/* reg_mpsc_e
+ * Enable sampling on port local_port
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mpsc, e, 0x04, 30, 1);
+
+/* reg_mpsc_rate
+ * Sampling rate = 1 out of rate packets (with randomization around
+ * the point). Valid values are: 1 to 3.5*10^9
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mpsc, rate, 0x08, 0, 32);
+
+static inline void mlxsw_reg_mpsc_pack(char *payload, u8 local_port, bool e,
+				       u32 rate)
+{
+	MLXSW_REG_ZERO(mpsc, payload);
+	mlxsw_reg_mpsc_local_port_set(payload, local_port);
+	mlxsw_reg_mpsc_e_set(payload, e);
+	mlxsw_reg_mpsc_rate_set(payload, rate);
+}
+
 /* SBPR - Shared Buffer Pools Register
  * -----------------------------------
  * The SBPR configures and retrieves the shared buffer pools and configuration.
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v2 6/8] tc: sample: Add sequence number and sampler_id fields
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux, roopa,
	john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>

From: Yotam Gigi <yotamg@mellanox.com>

Add support for sequence numbering to the sampled packets. The sequence
number counts the sampled packets and enables to check whether sampled
packets have been dropped. The sequence state is kept per sample action
instance.

Add unique id (sampler_id) for each sample action instance, and add that
sampler_id to the packet. This way, a user application can verify sample
drops, by tracking the sequence numbers from each sampler id.

Because of the fact that hw and sw can be triggered by the same tc action
(if the user does not specify both skip_sw and skip_hw), each should
have a different sampler_id, as each generates a different sequence
numbering. To allow that, the macros TC_SAMPLE_HW_ID and TC_SAMPLE_SW_ID,
which, given a sampler_id, calculates the hw sample_id and sw sample_id.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/ife.h              |  5 +++--
 include/net/tc_act/tc_sample.h |  9 +++++++++
 include/uapi/linux/ife.h       |  2 ++
 net/ife/ife.c                  | 14 ++++++++++++--
 net/sched/act_sample.c         | 10 +++++++++-
 5 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/net/ife.h b/include/net/ife.h
index 5fbcfb3..3ae4f0d 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -19,7 +19,8 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
 void *ife_tlv_meta_next(void *skbdata);
 
 struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
-				    int in_ifindex, int out_ifindex);
+				    int in_ifindex, int out_ifindex,
+				    u32 sampler_id, u32 seq);
 
 #else
 
@@ -52,7 +53,7 @@ static inline void *ife_tlv_meta_next(void *skbdata)
 
 static inline struct ethhdr *
 ife_packet_info_pack(struct sk_buff *skb, int orig_size, int in_ifindex,
-		     int out_ifindex)
+		     int out_ifindex, u32 sampler_id, u32 seq)
 {
 	return NULL;
 }
diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
index af99eb8..07638b7 100644
--- a/include/net/tc_act/tc_sample.h
+++ b/include/net/tc_act/tc_sample.h
@@ -13,9 +13,13 @@ struct tcf_sample {
 	u8			eth_dst[ETH_ALEN];
 	u8			eth_src[ETH_ALEN];
 	u16			eth_type;
+	u32			sampler_id;
+	u32			seq;
 	struct list_head	tcfm_list;
 };
 #define to_sample(a) ((struct tcf_sample *)a)
+#define TC_SAMPLE_SW_ID(id) ((id << 1) + 0)
+#define TC_SAMPLE_HW_ID(id) ((id << 1) + 1)
 
 static inline bool is_tcf_sample(const struct tc_action *a)
 {
@@ -61,4 +65,9 @@ static inline void tcf_sample_eth_src_addr(const struct tc_action *a, u8 *src)
 	ether_addr_copy(src, to_sample(a)->eth_src);
 }
 
+static inline u32 tcf_sample_sampler_id(const struct tc_action *a)
+{
+	return to_sample(a)->sampler_id;
+}
+
 #endif /* __NET_TC_SAMPLE_H */
diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
index ebdd785..2007818 100644
--- a/include/uapi/linux/ife.h
+++ b/include/uapi/linux/ife.h
@@ -13,6 +13,8 @@ enum {
 	IFE_META_OUT_IFINDEX,
 	IFE_META_ORIGSIZE,
 	IFE_META_SIZE,
+	IFE_META_SAMPLER_ID,
+	IFE_META_SEQ,
 	__IFE_META_MAX
 };
 
diff --git a/net/ife/ife.c b/net/ife/ife.c
index f0b683e..e1b7452d 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -139,7 +139,8 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
 EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
 
 struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
-				    int in_ifindex, int out_ifindex)
+				    int in_ifindex, int out_ifindex,
+				    u32 sampler_id, u32 seq)
 {
 	int curr_size;
 	void *ifetlv;
@@ -148,7 +149,9 @@ struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
 	curr_size = skb->len;
 
 	metalen = nla_total_size(sizeof(orig_size)) +
-		  nla_total_size(sizeof(curr_size));
+		  nla_total_size(sizeof(curr_size)) +
+		  nla_total_size(sizeof(sampler_id)) +
+		  nla_total_size(sizeof(seq));
 
 	if (in_ifindex)
 		metalen += nla_total_size(sizeof(in_ifindex));
@@ -159,6 +162,8 @@ struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
 	out_ifindex = htonl(out_ifindex);
 	orig_size = htonl(orig_size);
 	curr_size = htonl(curr_size);
+	sampler_id = htonl(sampler_id);
+	seq = htonl(seq);
 
 	ifetlv = ife_encode(skb, metalen);
 	if (!ifetlv)
@@ -179,6 +184,11 @@ struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
 	ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SIZE,
 				      sizeof(curr_size), &curr_size);
 
+	ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SAMPLER_ID,
+				      sizeof(sampler_id), &sampler_id);
+
+	ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SEQ, sizeof(seq), &seq);
+
 	return (struct ethhdr *) skb->data;
 }
 EXPORT_SYMBOL(ife_packet_info_pack);
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 39f1d94c..3fe9489 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -27,6 +27,7 @@
 
 #define SAMPLE_TAB_MASK     7
 static int sample_net_id;
+static u32 samplers;
 static struct tc_action_ops act_sample_ops;
 
 static const struct nla_policy sample_policy[TCA_SAMPLE_MAX + 1] = {
@@ -77,6 +78,7 @@ static int tcf_sample_init(struct net *net, struct nlattr *nla,
 	s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
 	s->mark = nla_get_u32(tb[TCA_SAMPLE_MARK]);
 	s->eth_type = nla_get_u16(tb[TCA_SAMPLE_ETH_TYPE]);
+	s->sampler_id = samplers++;
 
 	s->truncate = tb[TCA_SAMPLE_TRUNC_SIZE];
 	if (tb[TCA_SAMPLE_TRUNC_SIZE])
@@ -117,6 +119,7 @@ static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
 	struct tcf_sample *s = to_sample(a);
 	static struct ethhdr *ethhdr;
 	struct sk_buff *skb2;
+	u32 sampler_id;
 	int orig_size;
 	int retval;
 	u32 at;
@@ -144,8 +147,12 @@ static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
 			skb_push(skb2, skb2->mac_len);
 
 		orig_size = skb->len + skb->dev->hard_header_len;
+		sampler_id = TC_SAMPLE_SW_ID(s->sampler_id);
+
 		ethhdr = ife_packet_info_pack(skb2, orig_size,
-					      skb->dev->ifindex, 0);
+					      skb->dev->ifindex, 0,
+					      sampler_id, s->seq++);
+
 		if (!ethhdr)
 			goto out;
 
@@ -266,6 +273,7 @@ static struct pernet_operations sample_net_ops = {
 
 static int __init sample_init_module(void)
 {
+	samplers = 0;
 	return tcf_register_action(&act_sample_ops, &sample_net_ops);
 }
 
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v2 5/8] Introduce sample tc action
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux, roopa,
	john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>

From: Yotam Gigi <yotamg@mellanox.com>

This action allow the user to sample traffic matched by tc classifier.
The sampling consists of choosing packets randomly, truncating them,
adding some informative metadata regarding the interface and the original
packet size and mark them with specific mark, to allow further tc rules to
match and process. The marked sample packets are then injected into the
device ingress qdisc using netif_receive_skb.

The packets metadata is packed using the ife encapsulation protocol, and
the outer packet's ethernet dest, source and eth_type, along with the
rate, mark and the optional truncation size can be configured from
userspace.

Example:
To sample ingress traffic from interface eth1, and redirect the sampled
the sampled packets to interface dummy0, one may use the commands:

tc qdisc add dev eth1 handle ffff: ingress

tc filter add dev eth1 parent ffff: \
	   matchall action sample rate 12 mark 17

tc filter add parent ffff: dev eth1 protocol all \
	   u32 match mark 17 0xff \
	   action mirred egress redirect dev dummy0

Where the first command adds an ingress qdisc and the second starts
sampling randomly with an average of one sampled packet per 12 packets on
dev eth1 and marks the sampled packets with 17. The third command catches
the sampled packets, which are marked with 17, and redirects them to dev
dummy0.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/tc_act/tc_sample.h        |  64 ++++++++
 include/uapi/linux/tc_act/Kbuild      |   1 +
 include/uapi/linux/tc_act/tc_sample.h |  29 ++++
 net/sched/Kconfig                     |  13 ++
 net/sched/Makefile                    |   1 +
 net/sched/act_sample.c                | 282 ++++++++++++++++++++++++++++++++++
 6 files changed, 390 insertions(+)
 create mode 100644 include/net/tc_act/tc_sample.h
 create mode 100644 include/uapi/linux/tc_act/tc_sample.h
 create mode 100644 net/sched/act_sample.c

diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
new file mode 100644
index 0000000..af99eb8
--- /dev/null
+++ b/include/net/tc_act/tc_sample.h
@@ -0,0 +1,64 @@
+#ifndef __NET_TC_SAMPLE_H
+#define __NET_TC_SAMPLE_H
+
+#include <net/act_api.h>
+#include <linux/tc_act/tc_sample.h>
+
+struct tcf_sample {
+	struct tc_action	common;
+	u32			rate;
+	u32			mark;
+	bool			truncate;
+	u32			trunc_size;
+	u8			eth_dst[ETH_ALEN];
+	u8			eth_src[ETH_ALEN];
+	u16			eth_type;
+	struct list_head	tcfm_list;
+};
+#define to_sample(a) ((struct tcf_sample *)a)
+
+static inline bool is_tcf_sample(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+	return a->ops && a->ops->type == TCA_ACT_SAMPLE;
+#else
+	return false;
+#endif
+}
+
+static inline __u32 tcf_sample_mark(const struct tc_action *a)
+{
+	return to_sample(a)->mark;
+}
+
+static inline __u32 tcf_sample_rate(const struct tc_action *a)
+{
+	return to_sample(a)->rate;
+}
+
+static inline bool tcf_sample_truncate(const struct tc_action *a)
+{
+	return to_sample(a)->truncate;
+}
+
+static inline int tcf_sample_trunc_size(const struct tc_action *a)
+{
+	return to_sample(a)->trunc_size;
+}
+
+static inline u16 tcf_sample_eth_type(const struct tc_action *a)
+{
+	return to_sample(a)->eth_type;
+}
+
+static inline void tcf_sample_eth_dst_addr(const struct tc_action *a, u8 *dst)
+{
+	ether_addr_copy(dst, to_sample(a)->eth_dst);
+}
+
+static inline void tcf_sample_eth_src_addr(const struct tc_action *a, u8 *src)
+{
+	ether_addr_copy(src, to_sample(a)->eth_src);
+}
+
+#endif /* __NET_TC_SAMPLE_H */
diff --git a/include/uapi/linux/tc_act/Kbuild b/include/uapi/linux/tc_act/Kbuild
index e3969bd..6c6b8d6 100644
--- a/include/uapi/linux/tc_act/Kbuild
+++ b/include/uapi/linux/tc_act/Kbuild
@@ -4,6 +4,7 @@ header-y += tc_defact.h
 header-y += tc_gact.h
 header-y += tc_ipt.h
 header-y += tc_mirred.h
+header-y += tc_sample.h
 header-y += tc_nat.h
 header-y += tc_pedit.h
 header-y += tc_skbedit.h
diff --git a/include/uapi/linux/tc_act/tc_sample.h b/include/uapi/linux/tc_act/tc_sample.h
new file mode 100644
index 0000000..44ee9d0
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_sample.h
@@ -0,0 +1,29 @@
+#ifndef __LINUX_TC_SAMPLE_H
+#define __LINUX_TC_SAMPLE_H
+
+#include <linux/types.h>
+#include <linux/pkt_cls.h>
+#include <linux/if_ether.h>
+
+#define TCA_ACT_SAMPLE 26
+
+struct tc_sample {
+	tc_gen;
+};
+
+enum {
+	TCA_SAMPLE_UNSPEC,
+	TCA_SAMPLE_PARMS,
+	TCA_SAMPLE_TM,
+	TCA_SAMPLE_RATE,
+	TCA_SAMPLE_MARK,
+	TCA_SAMPLE_TRUNC_SIZE,
+	TCA_SAMPLE_ETH_DST,
+	TCA_SAMPLE_ETH_SRC,
+	TCA_SAMPLE_ETH_TYPE,
+	TCA_SAMPLE_PAD,
+	__TCA_SAMPLE_MAX
+};
+#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 24f7cac..c54ea6b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -650,6 +650,19 @@ config NET_ACT_MIRRED
 	  To compile this code as a module, choose M here: the
 	  module will be called act_mirred.
 
+config NET_ACT_SAMPLE
+        tristate "Traffic Sampling"
+        depends on NET_CLS_ACT
+        select NET_IFE
+        ---help---
+	  Say Y here to allow packet sampling tc action. The packet sample
+	  action consists of statistically duplicating packets, truncating them
+	  and adding descriptive metadata to them. The duplicated packets are
+	  then marked to allow further processing using tc.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called act_sample.
+
 config NET_ACT_IPT
         tristate "IPtables targets"
         depends on NET_CLS_ACT && NETFILTER && IP_NF_IPTABLES
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 4bdda36..7b915d2 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_NET_CLS_ACT)	+= act_api.o
 obj-$(CONFIG_NET_ACT_POLICE)	+= act_police.o
 obj-$(CONFIG_NET_ACT_GACT)	+= act_gact.o
 obj-$(CONFIG_NET_ACT_MIRRED)	+= act_mirred.o
+obj-$(CONFIG_NET_ACT_SAMPLE)	+= act_sample.o
 obj-$(CONFIG_NET_ACT_IPT)	+= act_ipt.o
 obj-$(CONFIG_NET_ACT_NAT)	+= act_nat.o
 obj-$(CONFIG_NET_ACT_PEDIT)	+= act_pedit.o
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
new file mode 100644
index 0000000..39f1d94c
--- /dev/null
+++ b/net/sched/act_sample.c
@@ -0,0 +1,282 @@
+/*
+ * net/sched/act_sample.c - Packet samplig tc action
+ * Copyright (c) 2016 Yotam Gigi <yotamg@mellanox.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/gfp.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <linux/tc_act/tc_sample.h>
+#include <net/tc_act/tc_sample.h>
+#include <net/ife.h>
+
+#include <linux/if_arp.h>
+
+#define SAMPLE_TAB_MASK     7
+static int sample_net_id;
+static struct tc_action_ops act_sample_ops;
+
+static const struct nla_policy sample_policy[TCA_SAMPLE_MAX + 1] = {
+	[TCA_SAMPLE_PARMS]	= { .len = sizeof(struct tc_sample) },
+};
+
+static int tcf_sample_init(struct net *net, struct nlattr *nla,
+			   struct nlattr *est, struct tc_action **a, int ovr,
+			   int bind)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+	struct nlattr *tb[TCA_SAMPLE_MAX + 1];
+	struct tc_sample *parm;
+	struct tcf_sample *s;
+	int ret;
+	bool exists = false;
+
+	if (!nla)
+		return -EINVAL;
+	ret = nla_parse_nested(tb, TCA_SAMPLE_MAX, nla, sample_policy);
+	if (ret < 0)
+		return ret;
+	if (!tb[TCA_SAMPLE_PARMS] || !tb[TCA_SAMPLE_RATE] ||
+	    !tb[TCA_SAMPLE_MARK] || !tb[TCA_SAMPLE_ETH_TYPE])
+		return -EINVAL;
+
+	parm = nla_data(tb[TCA_SAMPLE_PARMS]);
+
+	exists = tcf_hash_check(tn, parm->index, a, bind);
+	if (exists && bind)
+		return 0;
+
+	if (!exists) {
+		ret = tcf_hash_create(tn, parm->index, est, a,
+				      &act_sample_ops, bind, false);
+		if (ret)
+			return ret;
+		ret = ACT_P_CREATED;
+	} else {
+		tcf_hash_release(*a, bind);
+		if (!ovr)
+			return -EEXIST;
+	}
+	s = to_sample(*a);
+
+	ASSERT_RTNL();
+	s->tcf_action = parm->action;
+	s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
+	s->mark = nla_get_u32(tb[TCA_SAMPLE_MARK]);
+	s->eth_type = nla_get_u16(tb[TCA_SAMPLE_ETH_TYPE]);
+
+	s->truncate = tb[TCA_SAMPLE_TRUNC_SIZE];
+	if (tb[TCA_SAMPLE_TRUNC_SIZE])
+		s->trunc_size = nla_get_u32(tb[TCA_SAMPLE_TRUNC_SIZE]);
+
+	if (tb[TCA_SAMPLE_ETH_SRC])
+		ether_addr_copy(s->eth_src, nla_data(tb[TCA_SAMPLE_ETH_SRC]));
+	else
+		eth_zero_addr(s->eth_src);
+	if (tb[TCA_SAMPLE_ETH_DST])
+		ether_addr_copy(s->eth_dst, nla_data(tb[TCA_SAMPLE_ETH_DST]));
+	else
+		eth_zero_addr(s->eth_dst);
+
+	if (ret == ACT_P_CREATED)
+		tcf_hash_insert(tn, *a);
+	return ret;
+}
+
+static bool dev_ok_push(struct net_device *dev)
+{
+	switch (dev->type) {
+	case ARPHRD_TUNNEL:
+	case ARPHRD_TUNNEL6:
+	case ARPHRD_SIT:
+	case ARPHRD_IPGRE:
+	case ARPHRD_VOID:
+	case ARPHRD_NONE:
+		return false;
+	default:
+		return true;
+	}
+}
+
+static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
+		      struct tcf_result *res)
+{
+	struct tcf_sample *s = to_sample(a);
+	static struct ethhdr *ethhdr;
+	struct sk_buff *skb2;
+	int orig_size;
+	int retval;
+	u32 at;
+
+	tcf_lastuse_update(&s->tcf_tm);
+	bstats_cpu_update(this_cpu_ptr(s->common.cpu_bstats), skb);
+
+	rcu_read_lock();
+	retval = READ_ONCE(s->tcf_action);
+
+	/* randomly sample packets according to rate */
+	if (prandom_u32() % s->rate == 0) {
+		skb2 = skb_copy(skb, GFP_ATOMIC);
+		if (!skb2)
+			goto out;
+
+		if (s->truncate)
+			skb_trim(skb2, s->trunc_size);
+
+		at = G_TC_AT(skb->tc_verd);
+		skb2->mac_len = skb->mac_len;
+
+		/* on ingress, the mac header gets poped, so push it back */
+		if (!(at & AT_EGRESS) && dev_ok_push(skb->dev))
+			skb_push(skb2, skb2->mac_len);
+
+		orig_size = skb->len + skb->dev->hard_header_len;
+		ethhdr = ife_packet_info_pack(skb2, orig_size,
+					      skb->dev->ifindex, 0);
+		if (!ethhdr)
+			goto out;
+
+		ethhdr->h_proto = htons(s->eth_type);
+		if (!is_zero_ether_addr(s->eth_src))
+			ether_addr_copy(ethhdr->h_source, s->eth_src);
+		if (!is_zero_ether_addr(s->eth_dst))
+			ether_addr_copy(ethhdr->h_dest, s->eth_dst);
+
+		skb2->mark = s->mark;
+		netif_receive_skb(skb2);
+
+		/* mirror is always swallowed */
+		skb2->tc_verd = SET_TC_FROM(skb2->tc_verd, at);
+	}
+out:
+	rcu_read_unlock();
+
+	return retval;
+}
+
+static int tcf_sample_dump(struct sk_buff *skb, struct tc_action *a,
+			   int bind, int ref)
+{
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tcf_sample *s = to_sample(a);
+	struct tc_sample opt = {
+		.index      = s->tcf_index,
+		.action     = s->tcf_action,
+		.refcnt     = s->tcf_refcnt - ref,
+		.bindcnt    = s->tcf_bindcnt - bind,
+	};
+	struct tcf_t t;
+
+	if (nla_put(skb, TCA_SAMPLE_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	tcf_tm_dump(&t, &s->tcf_tm);
+	if (nla_put_64bit(skb, TCA_SAMPLE_TM, sizeof(t), &t, TCA_SAMPLE_PAD))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_SAMPLE_RATE, s->rate))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_SAMPLE_MARK, s->mark))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_SAMPLE_ETH_TYPE, s->eth_type))
+		goto nla_put_failure;
+
+	if (s->truncate)
+		if (nla_put_u32(skb, TCA_SAMPLE_TRUNC_SIZE, s->trunc_size))
+			goto nla_put_failure;
+
+	if (!is_zero_ether_addr(s->eth_src))
+		if (nla_put(skb, TCA_SAMPLE_ETH_SRC, ETH_ALEN, s->eth_src))
+			goto nla_put_failure;
+
+	if (!is_zero_ether_addr(s->eth_dst))
+		if (nla_put(skb, TCA_SAMPLE_ETH_DST, ETH_ALEN, s->eth_dst))
+			goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static int tcf_sample_walker(struct net *net, struct sk_buff *skb,
+			     struct netlink_callback *cb, int type,
+			     const struct tc_action_ops *ops)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+	return tcf_generic_walker(tn, skb, cb, type, ops);
+}
+
+static int tcf_sample_search(struct net *net, struct tc_action **a, u32 index)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+	return tcf_hash_search(tn, a, index);
+}
+
+static struct tc_action_ops act_sample_ops = {
+	.kind	= "sample",
+	.type	= TCA_ACT_SAMPLE,
+	.owner	= THIS_MODULE,
+	.act	= tcf_sample,
+	.dump	= tcf_sample_dump,
+	.init	= tcf_sample_init,
+	.walk	= tcf_sample_walker,
+	.lookup	= tcf_sample_search,
+	.size	= sizeof(struct tcf_sample),
+};
+
+static __net_init int sample_init_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+	return tc_action_net_init(tn, &act_sample_ops, SAMPLE_TAB_MASK);
+}
+
+static void __net_exit sample_exit_net(struct net *net)
+{
+	struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+	tc_action_net_exit(tn);
+}
+
+static struct pernet_operations sample_net_ops = {
+	.init = sample_init_net,
+	.exit = sample_exit_net,
+	.id   = &sample_net_id,
+	.size = sizeof(struct tc_action_net),
+};
+
+static int __init sample_init_module(void)
+{
+	return tcf_register_action(&act_sample_ops, &sample_net_ops);
+}
+
+static void __exit sample_cleanup_module(void)
+{
+	tcf_unregister_action(&act_sample_ops, &sample_net_ops);
+}
+
+module_init(sample_init_module);
+module_exit(sample_cleanup_module);
+
+MODULE_AUTHOR("Yotam Gigi <yotamg@mellanox.com>");
+MODULE_DESCRIPTION("Packet sampling action");
+MODULE_LICENSE("GPL v2");
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v2 4/8] net: ife: Introduce packet info packing method
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux, roopa,
	john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>

From: Yotam Gigi <yotamg@mellanox.com>

Add the ife_packet_info_pack function which encodes some informative
metadata into a packet using the ife encapsulation. The informative
metadata consists of:
 - The packet input/output ifindices
 - The packet original size, in case it was truncated

This method is useful for anyone that wants to pass informative metadata
on a packet alongside with the packet itself. A good usage can be packet
sampling.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/ife.h | 10 ++++++++++
 net/ife/ife.c     | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/net/ife.h b/include/net/ife.h
index a75d4e0..5fbcfb3 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -18,6 +18,9 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
 
 void *ife_tlv_meta_next(void *skbdata);
 
+struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
+				    int in_ifindex, int out_ifindex);
+
 #else
 
 static inline void *ife_encode(struct sk_buff *skb, u16 metalen)
@@ -47,6 +50,13 @@ static inline void *ife_tlv_meta_next(void *skbdata)
 	return NULL;
 }
 
+static inline struct ethhdr *
+ife_packet_info_pack(struct sk_buff *skb, int orig_size, int in_ifindex,
+		     int out_ifindex)
+{
+	return NULL;
+}
+
 #endif
 
 #endif /* __NET_IFE_H */
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 3642f43..f0b683e 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -138,6 +138,51 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
 }
 EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
 
+struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
+				    int in_ifindex, int out_ifindex)
+{
+	int curr_size;
+	void *ifetlv;
+	u16 metalen;
+
+	curr_size = skb->len;
+
+	metalen = nla_total_size(sizeof(orig_size)) +
+		  nla_total_size(sizeof(curr_size));
+
+	if (in_ifindex)
+		metalen += nla_total_size(sizeof(in_ifindex));
+	if (out_ifindex)
+		metalen += nla_total_size(sizeof(out_ifindex));
+
+	in_ifindex = htonl(in_ifindex);
+	out_ifindex = htonl(out_ifindex);
+	orig_size = htonl(orig_size);
+	curr_size = htonl(curr_size);
+
+	ifetlv = ife_encode(skb, metalen);
+	if (!ifetlv)
+		return NULL;
+
+	if (in_ifindex)
+		ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_IN_IFINDEX,
+					      sizeof(in_ifindex), &in_ifindex);
+
+	if (out_ifindex)
+		ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_OUT_IFINDEX,
+					      sizeof(out_ifindex),
+					      &out_ifindex);
+
+	ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_ORIGSIZE,
+				      sizeof(orig_size), &orig_size);
+
+	ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SIZE,
+				      sizeof(curr_size), &curr_size);
+
+	return (struct ethhdr *) skb->data;
+}
+EXPORT_SYMBOL(ife_packet_info_pack);
+
 MODULE_AUTHOR("Jamal Hadi Salim <jhs@mojatatu.com>");
 MODULE_AUTHOR("Yotam Gigi <yotamg@mellanox.com>");
 MODULE_DESCRIPTION("Inter-FE LFB action");
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v2 3/8] net: ife: Introduce new metadata tlv types
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux, roopa,
	john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>

From: Yotam Gigi <yotamg@mellanox.com>

IFE_META_IN_IFINDEX: Allow to pass input ifindex value as part of the
ife metadata
IFE_META_OUT_IFINDEX: Allow to pass output ifindex value as part of the
ife metadata
IFE_META_ORIG_SIZE: Allow to pass the original packet size as part of
the ife metadata. Can be used in case that the packet is truncated
IFE_META_SIZE: Allow to pass the size of the encapsulated packet as
part of the ife metadata

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/uapi/linux/ife.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
index 2954da3..ebdd785 100644
--- a/include/uapi/linux/ife.h
+++ b/include/uapi/linux/ife.h
@@ -9,6 +9,10 @@ enum {
 	IFE_META_PRIO,
 	IFE_META_QMAP,
 	IFE_META_TCINDEX,
+	IFE_META_IN_IFINDEX,
+	IFE_META_OUT_IFINDEX,
+	IFE_META_ORIGSIZE,
+	IFE_META_SIZE,
 	__IFE_META_MAX
 };
 
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v2 2/8] act_ife: Change to use ife module
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
	geert+renesas, stephen, xiyou.wangcong, linux, roopa,
	john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>

From: Yotam Gigi <yotamg@mellanox.com>

Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/tc_act/tc_ife.h        |   3 -
 include/uapi/linux/tc_act/tc_ife.h |  10 +---
 net/sched/Kconfig                  |   1 +
 net/sched/act_ife.c                | 109 +++++++++++--------------------------
 4 files changed, 34 insertions(+), 89 deletions(-)

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index 9fd2bea0..30ba459 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -6,7 +6,6 @@
 #include <linux/rtnetlink.h>
 #include <linux/module.h>
 
-#define IFE_METAHDRLEN 2
 struct tcf_ife_info {
 	struct tc_action common;
 	u8 eth_dst[ETH_ALEN];
@@ -45,8 +44,6 @@ struct tcf_meta_ops {
 
 int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi);
 int ife_get_meta_u16(struct sk_buff *skb, struct tcf_meta_info *mi);
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
-			const void *dval);
 int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
 int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi);
diff --git a/include/uapi/linux/tc_act/tc_ife.h b/include/uapi/linux/tc_act/tc_ife.h
index cd18360..7c28178 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -3,6 +3,7 @@
 
 #include <linux/types.h>
 #include <linux/pkt_cls.h>
+#include <linux/ife.h>
 
 #define TCA_ACT_IFE 25
 /* Flag bits for now just encoding/decoding; mutually exclusive */
@@ -28,13 +29,4 @@ enum {
 };
 #define TCA_IFE_MAX (__TCA_IFE_MAX - 1)
 
-#define IFE_META_SKBMARK 1
-#define IFE_META_HASHID 2
-#define	IFE_META_PRIO 3
-#define	IFE_META_QMAP 4
-#define	IFE_META_TCINDEX 5
-/*Can be overridden at runtime by module option*/
-#define	__IFE_META_MAX 6
-#define IFE_META_MAX (__IFE_META_MAX - 1)
-
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 87956a7..24f7cac 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -763,6 +763,7 @@ config NET_ACT_SKBMOD
 config NET_ACT_IFE
         tristate "Inter-FE action based on IETF ForCES InterFE LFB"
         depends on NET_CLS_ACT
+        select NET_IFE
         ---help---
 	  Say Y here to allow for sourcing and terminating metadata
 	  For details refer to netdev01 paper:
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 95c463c..5c2478a 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -32,6 +32,7 @@
 #include <uapi/linux/tc_act/tc_ife.h>
 #include <net/tc_act/tc_ife.h>
 #include <linux/etherdevice.h>
+#include <net/ife.h>
 
 #define IFE_TAB_MASK 15
 
@@ -46,23 +47,6 @@ static const struct nla_policy ife_policy[TCA_IFE_MAX + 1] = {
 	[TCA_IFE_TYPE] = { .type = NLA_U16},
 };
 
-/* Caller takes care of presenting data in network order
-*/
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
-{
-	u32 *tlv = (u32 *)(skbdata);
-	u16 totlen = nla_total_size(dlen);	/*alignment + hdr */
-	char *dptr = (char *)tlv + NLA_HDRLEN;
-	u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
-
-	*tlv = htonl(htlv);
-	memset(dptr, 0, totlen - NLA_HDRLEN);
-	memcpy(dptr, dval, dlen);
-
-	return totlen;
-}
-EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
-
 int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
 {
 	u16 edata = 0;
@@ -637,69 +621,60 @@ int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
 	return 0;
 }
 
-struct ifeheadr {
-	__be16 metalen;
-	u8 tlv_data[];
-};
-
-struct meta_tlvhdr {
-	__be16 type;
-	__be16 len;
-};
-
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 			  struct tcf_result *res)
 {
 	struct tcf_ife_info *ife = to_ife(a);
+	u32 at = G_TC_AT(skb->tc_verd);
 	int action = ife->tcf_action;
-	struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
-	int ifehdrln = (int)ifehdr->metalen;
-	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
+	u8 *ifehdr_end;
+	u8 *tlv_data;
+	u16 metalen;
 
 	spin_lock(&ife->tcf_lock);
 	bstats_update(&ife->tcf_bstats, skb);
 	tcf_lastuse_update(&ife->tcf_tm);
 	spin_unlock(&ife->tcf_lock);
 
-	ifehdrln = ntohs(ifehdrln);
-	if (unlikely(!pskb_may_pull(skb, ifehdrln))) {
+	if (!(at & AT_EGRESS))
+		skb_push(skb, skb->dev->hard_header_len);
+
+	tlv_data = ife_decode(skb, &metalen);
+	if (unlikely(!tlv_data)) {
 		spin_lock(&ife->tcf_lock);
 		ife->tcf_qstats.drops++;
 		spin_unlock(&ife->tcf_lock);
 		return TC_ACT_SHOT;
 	}
 
-	skb_set_mac_header(skb, ifehdrln);
-	__skb_pull(skb, ifehdrln);
-	skb->protocol = eth_type_trans(skb, skb->dev);
-	ifehdrln -= IFE_METAHDRLEN;
-
-	while (ifehdrln > 0) {
-		u8 *tlvdata = (u8 *)tlv;
-		u16 mtype = tlv->type;
-		u16 mlen = tlv->len;
-		u16 alen;
+	ifehdr_end = tlv_data + metalen;
+	for (; tlv_data < ifehdr_end; tlv_data = ife_tlv_meta_next(tlv_data)) {
+		u8 *curr_data;
+		u16 mtype;
+		u16 dlen;
 
-		mtype = ntohs(mtype);
-		mlen = ntohs(mlen);
-		alen = NLA_ALIGN(mlen);
+		curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
 
-		if (find_decode_metaid(skb, ife, mtype, (mlen - NLA_HDRLEN),
-				       (void *)(tlvdata + NLA_HDRLEN))) {
+		if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
 			/* abuse overlimits to count when we receive metadata
 			 * but dont have an ops for it
 			 */
-			pr_info_ratelimited("Unknown metaid %d alnlen %d\n",
-					    mtype, mlen);
+			pr_info_ratelimited("Unknown metaid %d dlen %d\n",
+					    mtype, dlen);
 			ife->tcf_qstats.overlimits++;
 		}
+	}
 
-		tlvdata += alen;
-		ifehdrln -= alen;
-		tlv = (struct meta_tlvhdr *)tlvdata;
+	if (WARN_ON(tlv_data != ifehdr_end)) {
+		spin_lock(&ife->tcf_lock);
+		ife->tcf_qstats.drops++;
+		spin_unlock(&ife->tcf_lock);
+		return TC_ACT_SHOT;
 	}
 
+	skb->protocol = eth_type_trans(skb, skb->dev);
 	skb_reset_network_header(skb);
+
 	return action;
 }
 
@@ -727,7 +702,6 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	struct tcf_ife_info *ife = to_ife(a);
 	int action = ife->tcf_action;
 	struct ethhdr *oethh;	/* outer ether header */
-	struct ethhdr *iethh;	/* inner eth header */
 	struct tcf_meta_info *e;
 	/*
 	   OUTERHDR:TOTMETALEN:{TLVHDR:Metadatum:TLVHDR..}:ORIGDATA
@@ -735,10 +709,11 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	 */
 	u16 metalen = ife_get_sz(skb, ife);
 	int hdrm = metalen + skb->dev->hard_header_len + IFE_METAHDRLEN;
-	unsigned int skboff = skb->dev->hard_header_len;
+	unsigned int skboff = 0;
 	u32 at = G_TC_AT(skb->tc_verd);
 	int new_len = skb->len + hdrm;
 	bool exceed_mtu = false;
+	void *ife_meta;
 	int err;
 
 	if (at & AT_EGRESS) {
@@ -766,27 +741,10 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 		return TC_ACT_SHOT;
 	}
 
-	err = skb_cow_head(skb, hdrm);
-	if (unlikely(err)) {
-		ife->tcf_qstats.drops++;
-		spin_unlock(&ife->tcf_lock);
-		return TC_ACT_SHOT;
-	}
-
 	if (!(at & AT_EGRESS))
 		skb_push(skb, skb->dev->hard_header_len);
 
-	iethh = (struct ethhdr *)skb->data;
-	__skb_push(skb, hdrm);
-	memcpy(skb->data, iethh, skb->mac_len);
-	skb_reset_mac_header(skb);
-	oethh = eth_hdr(skb);
-
-	/*total metadata length */
-	metalen += IFE_METAHDRLEN;
-	metalen = htons(metalen);
-	memcpy((skb->data + skboff), &metalen, IFE_METAHDRLEN);
-	skboff += IFE_METAHDRLEN;
+	ife_meta = ife_encode(skb, metalen);
 
 	/* XXX: we dont have a clever way of telling encode to
 	 * not repeat some of the computations that are done by
@@ -794,7 +752,7 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 	 */
 	list_for_each_entry(e, &ife->metalist, metalist) {
 		if (e->ops->encode) {
-			err = e->ops->encode(skb, (void *)(skb->data + skboff),
+			err = e->ops->encode(skb, (void *)(ife_meta + skboff),
 					     e);
 		}
 		if (err < 0) {
@@ -805,15 +763,12 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
 		}
 		skboff += err;
 	}
+	oethh = (struct ethhdr *)skb->data;
 
 	if (!is_zero_ether_addr(ife->eth_src))
 		ether_addr_copy(oethh->h_source, ife->eth_src);
-	else
-		ether_addr_copy(oethh->h_source, iethh->h_source);
 	if (!is_zero_ether_addr(ife->eth_dst))
 		ether_addr_copy(oethh->h_dest, ife->eth_dst);
-	else
-		ether_addr_copy(oethh->h_dest, iethh->h_dest);
 	oethh->h_proto = htons(ife->eth_type);
 
 	if (!(at & AT_EGRESS))
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox