* Re: [PATCH net-next v5] cadence: Add LSO support.
From: Eric Dumazet @ 2016-11-14 14:35 UTC (permalink / raw)
To: Rafal Ozieblo
Cc: David Miller, nicolas.ferre@atmel.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <BN3PR07MB2516C30913D2FB8D4788071AC9BC0@BN3PR07MB2516.namprd07.prod.outlook.com>
On Mon, 2016-11-14 at 09:32 +0000, Rafal Ozieblo wrote:
> From: David Miller [mailto:davem@davemloft.net]
> Sent: 10 listopada 2016 18:01
> To: Rafal Ozieblo
> Cc: nicolas.ferre@atmel.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next v5]] cadence: Add LSO support.
>
> From: Rafal Ozieblo <rafalo@cadence.com>
> Date: Wed, 9 Nov 2016 13:41:02 +0000
>
> > First, please remove the spurious closing bracket in your Subject line in future submittions.
> >
> >> + if (is_udp) /* is_udp is only set when (is_lso) is checked */
> >> + /* zero UDP checksum, not calculated by h/w for UFO */
> >> + udp_hdr(skb)->check = 0;
> >
> > This is really not ok.
> >
> > If UFO is in use it should not silently disable UDP checksums.
> >
> > If you cannot support UFO with proper checksumming, then you cannot enable support for that feature.
>
> According Cadence Gigabit Ethernet MAC documentation:
>
> "Hardware will not calculate the UDP checksum or modify the UDP checksum field. Therefore software
> must set a value of zero in the checksum field in the UDP header (in the first payload buffer)
> to indicate to the receiver that the UDP datagram does not include a checksum."
>
> It is hardware requirement.
Then do not claim NETIF_F_UFO suport in the driver, if hardware is
absolutely not capable of handling this.
Linux stack will provide proper udp checksum.
Almost no driver sets NETIF_F_UFO.
^ permalink raw reply
* Re: [PATCH ethtool v2 2/2] Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift
From: Andrew Lunn @ 2016-11-14 14:03 UTC (permalink / raw)
To: Allan W. Nielsen; +Cc: netdev, raju.lakkaraju
In-Reply-To: <1479115787-1265-3-git-send-email-allan.nielsen@microsemi.com>
> +.B \-\-get\-phy\-tunable
> +Gets the PHY tunable parameters.
> +.RS 4
> +.TP
> +.B downshift
> +For operation in cabling environments that are incompatible with 1000BAST-T,
BASE
> +PHY device provides an automatic link speed downshift operation.
> +Link speed downshift after N failed 1000BASE-T auto-negotiation attempts.
> +
> +Gets the PHY downshift count/status.
> +.RE
> .SH BUGS
> Not supported (in part or whole) on all network drivers.
> .SH AUTHOR
> diff --git a/ethtool.c b/ethtool.c
> index 49ac94e..3b7ea4e 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -4520,6 +4520,146 @@ static int do_seee(struct cmd_context *ctx)
> return 0;
> }
>
> +static int do_get_phy_tunable(struct cmd_context *ctx)
> +{
> + int argc = ctx->argc;
> + char **argp = ctx->argp;
> + int err, i;
> + u8 downshift_changed = 0;
> +
> + if (argc < 1)
> + exit_bad_args();
> + for (i = 0; i < argc; i++) {
> + if (!strcmp(argp[i], "downshift")) {
> + downshift_changed = 1;
> + i += 1;
> + } else {
> + exit_bad_args();
> + }
> + }
> +
> + if (downshift_changed) {
> + struct ethtool_tunable ds;
> + u8 count = 0;
> +
> + ds.cmd = ETHTOOL_PHY_GTUNABLE;
> + ds.id = ETHTOOL_PHY_DOWNSHIFT;
> + ds.type_id = ETHTOOL_TUNABLE_U8;
> + ds.len = 1;
> + ds.data[0] = &count;
> + err = send_ioctl(ctx, &ds);
> + if (err < 0) {
> + perror("Cannot Get PHY downshift count");
> + err = 87;
> + }
> + count = *((u8 *)&ds.data[0]);
> + if (count)
> + fprintf(stdout, " Downshift count: %d\n",
> + count);
> + else
> + fprintf(stdout, " Downshift disabled\n");
Please be consistent with the spaces.
Also, if send_ioctl() returns an error, you print the error message,
and then still print the value of count. You look to be missing a
return or an else.
Andrew
^ permalink raw reply
* [PATCH v2] net: ioctl SIOCSIFADDR minor cleanup
From: yuan linyu @ 2016-11-14 14:48 UTC (permalink / raw)
To: netdev; +Cc: davem
From: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
1. set interface address label to ioctl request device name is enough
2. when address pass inet_abc_len check, prefixlen < 31 is always true
Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
---
net/ipv4/devinet.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 062a67c..d491a7a 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1063,10 +1063,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
if (!ifa)
break;
INIT_HLIST_NODE(&ifa->hash);
- if (colon)
- memcpy(ifa->ifa_label, ifr.ifr_name, IFNAMSIZ);
- else
- memcpy(ifa->ifa_label, dev->name, IFNAMSIZ);
+ memcpy(ifa->ifa_label, ifr.ifr_name, IFNAMSIZ);
} else {
ret = 0;
if (ifa->ifa_local == sin->sin_addr.s_addr)
@@ -1081,8 +1078,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
if (!(dev->flags & IFF_POINTOPOINT)) {
ifa->ifa_prefixlen = inet_abc_len(ifa->ifa_address);
ifa->ifa_mask = inet_make_mask(ifa->ifa_prefixlen);
- if ((dev->flags & IFF_BROADCAST) &&
- ifa->ifa_prefixlen < 31)
+ if (dev->flags & IFF_BROADCAST)
ifa->ifa_broadcast = ifa->ifa_address |
~ifa->ifa_mask;
} else {
--
2.7.4
--
^ permalink raw reply related
* Re: [net-next:master 664/670] drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c:1421:7-9: WARNING: possible condition with no effect (if == else)
From: Tom Lendacky @ 2016-11-14 14:49 UTC (permalink / raw)
To: Julia Lawall; +Cc: netdev, kbuild-all
In-Reply-To: <alpine.DEB.2.20.1611131910380.1946@hadrien>
On 11/13/2016 12:12 PM, Julia Lawall wrote:
> It looks like the code on lines 1422 and 1424 is the same, so either the
> test may be unnecessary, or one of the branches is wrong.
I was trying to show that for a cable length of less than 5 meters I was
using a specific sub-command and that for anything longer I was falling
back to the 5 meter value. I can see how it doesn't look right and I'll
clean it up with a patch.
>
> Coccinelle is also comparing about a comparison on an unsigned value in
> line 1900, but the code for that is not listed in the report below.
Ok, good catch. I'll fix this up with a patch also.
Thanks,
Tom
>
> julia
>
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
> head: f7ad3d4b83e694347cddc96d956143068bef32c9
> commit: abf0a1c2b26ad964d19b143ce46735e1b0667f29 [664/670] amd-xgbe: Add support for SFP+ modules
> :::::: branch date: 16 minutes ago
> :::::: commit date: 12 hours ago
>
>>> drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c:1421:7-9: WARNING: possible condition with no effect (if == else)
> --
>>> drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c:1900:6-9: WARNING: Unsigned expression compared with zero: ret < 0
>
> git remote add net-next https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
> git remote update net-next
> git checkout abf0a1c2b26ad964d19b143ce46735e1b0667f29
> vim +1421 drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
>
> abf0a1c2 Lendacky, Thomas 2016-11-10 1405 {
> abf0a1c2 Lendacky, Thomas 2016-11-10 1406 struct xgbe_phy_data *phy_data = pdata->phy_data;
> abf0a1c2 Lendacky, Thomas 2016-11-10 1407 unsigned int s0;
> abf0a1c2 Lendacky, Thomas 2016-11-10 1408
> abf0a1c2 Lendacky, Thomas 2016-11-10 1409 xgbe_phy_start_ratechange(pdata);
> abf0a1c2 Lendacky, Thomas 2016-11-10 1410
> abf0a1c2 Lendacky, Thomas 2016-11-10 1411 /* 10G/SFI */
> abf0a1c2 Lendacky, Thomas 2016-11-10 1412 s0 = 0;
> abf0a1c2 Lendacky, Thomas 2016-11-10 1413 XP_SET_BITS(s0, XP_DRIVER_SCRATCH_0, COMMAND, 3);
> abf0a1c2 Lendacky, Thomas 2016-11-10 1414 if (phy_data->sfp_cable != XGBE_SFP_CABLE_PASSIVE) {
> abf0a1c2 Lendacky, Thomas 2016-11-10 1415 XP_SET_BITS(s0, XP_DRIVER_SCRATCH_0, SUB_COMMAND, 0);
> abf0a1c2 Lendacky, Thomas 2016-11-10 1416 } else {
> abf0a1c2 Lendacky, Thomas 2016-11-10 1417 if (phy_data->sfp_cable_len <= 1)
> abf0a1c2 Lendacky, Thomas 2016-11-10 1418 XP_SET_BITS(s0, XP_DRIVER_SCRATCH_0, SUB_COMMAND, 1);
> abf0a1c2 Lendacky, Thomas 2016-11-10 1419 else if (phy_data->sfp_cable_len <= 3)
> abf0a1c2 Lendacky, Thomas 2016-11-10 1420 XP_SET_BITS(s0, XP_DRIVER_SCRATCH_0, SUB_COMMAND, 2);
> abf0a1c2 Lendacky, Thomas 2016-11-10 @1421 else if (phy_data->sfp_cable_len <= 5)
> abf0a1c2 Lendacky, Thomas 2016-11-10 1422 XP_SET_BITS(s0, XP_DRIVER_SCRATCH_0, SUB_COMMAND, 3);
> abf0a1c2 Lendacky, Thomas 2016-11-10 1423 else
> abf0a1c2 Lendacky, Thomas 2016-11-10 1424 XP_SET_BITS(s0, XP_DRIVER_SCRATCH_0, SUB_COMMAND, 3);
> abf0a1c2 Lendacky, Thomas 2016-11-10 1425 }
> abf0a1c2 Lendacky, Thomas 2016-11-10 1426
> abf0a1c2 Lendacky, Thomas 2016-11-10 1427 /* Call FW to make the change */
> abf0a1c2 Lendacky, Thomas 2016-11-10 1428 XP_IOWRITE(pdata, XP_DRIVER_SCRATCH_0, s0);
> abf0a1c2 Lendacky, Thomas 2016-11-10 1429 XP_IOWRITE(pdata, XP_DRIVER_SCRATCH_1, 0);
>
> ---
> 0-DAY kernel test infrastructure Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all Intel Corporation
>
^ permalink raw reply
* Re: Debugging Ethernet issues
From: Mason @ 2016-11-14 14:58 UTC (permalink / raw)
To: Andrew Lunn
Cc: Florian Fainelli, netdev, Mans Rullgard, Sergei Shtylyov,
Tom Lendacky, Zach Brown, Shaohui Xie, Tim Beale, Brian Hill,
Vince Bridgers, Balakumaran Kannan, David S. Miller,
Sebastian Frias, Kirill Kapranov
In-Reply-To: <20161114133428.GF26710@lunn.ch>
On 14/11/2016 14:34, Andrew Lunn wrote:
> Mason wrote:
>
>> How exactly does one use the generic PHY driver?
>
> If there is no specific driver available for the PHY, linux will fall
> back to the generic driver. So just don't compile the specific driver.
> You can see under /sys/bus/mdio_bus/devices which driver is being
> used.
I've removed the Atheros driver from the kernel config.
# ls /sys/bus/mdio_bus/drivers
Generic 10G PHY/ Generic PHY/
# ls -l "/sys/bus/mdio_bus/devices/26000.nb8800-mii:04/"
lrwxrwxrwx 1 root root 0 Jan 1 00:00 of_node -> ../../../../../../../firmware/devicetree/base/soc/ethernet@26000/ethernet-phy@4/
-r--r--r-- 1 root root 4096 Jan 1 00:00 phy_has_fixups
-r--r--r-- 1 root root 4096 Jan 1 00:00 phy_id
-r--r--r-- 1 root root 4096 Jan 1 00:00 phy_interface
lrwxrwxrwx 1 root root 0 Jan 1 00:00 subsystem -> ../../../../../../../bus/mdio_bus/
-rw-r--r-- 1 root root 4096 Jan 1 00:00 uevent
(I don't know why/how it finds so many ".." in the symlink path)
[ 1.170994] libphy: Fixed MDIO Bus: probed
[ 1.181794] libphy: nb8800-mii: probed
[ 1.192759] nb8800 26000.ethernet eth0: MAC address 00:16:e8:4b:b0:7d
# time udhcpc | while read LINE; do date; echo $LINE; done
Mon Nov 14 15:39:50 UTC 2016
udhcpc (v1.22.1) started
Mon Nov 14 15:39:50 UTC 2016
Sending discover...
[ 125.468634] nb8800_link_reconfigure from phy_state_machine
[ 126.475300] nb8800_link_reconfigure from phy_state_machine
Mon Nov 14 15:39:53 UTC 2016
Sending discover...
[ 126.682096] nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
Mon Nov 14 15:39:57 UTC 2016
Sending discover...
Mon Nov 14 15:40:20 UTC 2016
Sending discover...
Mon Nov 14 15:40:23 UTC 2016
Sending discover...
Mon Nov 14 15:40:26 UTC 2016
Sending discover...
Mon Nov 14 15:40:49 UTC 2016
Sending discover...
Mon Nov 14 15:40:52 UTC 2016
Sending discover...
Mon Nov 14 15:40:55 UTC 2016
Sending discover...
\x03^C
real 1m12.040s
user 0m0.067s
sys 0m0.057s
# tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
15:39:54.214574 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:39:55.215830 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:39:57.247862 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:39:57.248027 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:40:20.307556 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:40:20.307732 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:40:23.334155 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:40:23.334313 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:40:26.360807 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:40:26.360976 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:40:49.420473 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:40:49.420645 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:40:52.447121 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:40:52.447284 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:40:55.473721 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:40:55.473891 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
# ethtool -S eth0
NIC statistics:
rx_bytes_ok: 975
rx_frames_ok: 8
rx_undersize_frames: 0
rx_fragment_frames: 0
rx_64_byte_frames: 6
rx_127_byte_frames: 0
rx_255_byte_frames: 1
rx_511_byte_frames: 1
rx_1023_byte_frames: 0
rx_max_size_frames: 0
rx_oversize_frames: 0
rx_bad_fcs_frames: 0
rx_broadcast_frames: 4
rx_multicast_frames: 0
rx_control_frames: 0
rx_pause_frames: 0
rx_unsup_control_frames: 0
rx_align_error_frames: 0
rx_overrun_frames: 0
rx_jabber_frames: 0
rx_bytes: 975
rx_frames: 8
tx_bytes_ok: 2960
tx_frames_ok: 11
tx_64_byte_frames: 3
tx_127_byte_frames: 0
tx_255_byte_frames: 0
tx_511_byte_frames: 8
tx_1023_byte_frames: 0
tx_max_size_frames: 0
tx_oversize_frames: 0
tx_broadcast_frames: 8
tx_multicast_frames: 0
tx_control_frames: 0
tx_pause_frames: 0
tx_underrun_frames: 0
tx_single_collision_frames: 0
tx_multi_collision_frames: 0
tx_deferred_collision_frames: 0
tx_late_collision_frames: 0
tx_excessive_collision_frames: 0
tx_bytes: 2960
tx_frames: 11
tx_collisions: 0
EXPERIMENT #2
Now running the same test with fixed-phy (instead of generic phy driver)
# ls -l "/sys/bus/mdio_bus/devices/26000.nb8800-mii:04/"
lrwxrwxrwx 1 root root 0 Jan 1 00:01 of_node -> ../../../../../../../firmware/devicetree/base/soc/ethernet@26000/fixed-link/
-r--r--r-- 1 root root 4096 Jan 1 00:01 phy_has_fixups
-r--r--r-- 1 root root 4096 Jan 1 00:01 phy_id
-r--r--r-- 1 root root 4096 Jan 1 00:01 phy_interface
lrwxrwxrwx 1 root root 0 Jan 1 00:01 subsystem -> ../../../../../../../bus/mdio_bus/
-rw-r--r-- 1 root root 4096 Jan 1 00:00 uevent
[ 1.194170] libphy: Fixed MDIO Bus: probed
[ 1.208618] libphy: nb8800-mii: probed
[ 1.212422] mdio_bus 26000.nb8800-mii: /soc/ethernet@26000/fixed-link has invalid PHY address
[ 1.221025] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 0
[ 1.227788] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 1
[ 1.234551] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 2
[ 1.241310] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 3
[ 1.248071] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 4
[ 1.261077] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 5
[ 1.267857] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 6
[ 1.274623] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 7
[ 1.281386] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 8
[ 1.288144] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 9
[ 1.294906] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 10
[ 1.301750] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 11
[ 1.308599] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 12
[ 1.315444] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 13
[ 1.322291] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 14
[ 1.329142] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 15
[ 1.335992] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 16
[ 1.342839] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 17
[ 1.349752] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 18
[ 1.356652] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 19
[ 1.363503] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 20
[ 1.370352] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 21
[ 1.377200] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 22
[ 1.384048] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 23
[ 1.390895] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 24
[ 1.397742] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 25
[ 1.404590] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 26
[ 1.411437] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 27
[ 1.418284] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 28
[ 1.425132] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 29
[ 1.431979] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 30
[ 1.438826] mdio_bus 26000.nb8800-mii: scan phy fixed-link at address 31
[ 1.450432] nb8800 26000.ethernet eth0: MAC address 00:16:e8:4b:b0:7d
# time udhcpc | while read LINE; do date; echo $LINE; done
Mon Nov 14 15:49:05 UTC 2016
udhcpc (v1.22.1) started
Mon Nov 14 15:49:05 UTC 2016
Sending discover...
[ 179.748319] nb8800_link_reconfigure from phy_state_machine
[ 179.753888] nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off
Mon Nov 14 15:49:08 UTC 2016
Sending discover...
Mon Nov 14 15:49:09 UTC 2016
Sending select for 172.27.64.58...
Mon Nov 14 15:49:10 UTC 2016
Lease of 172.27.64.58 obtained, lease time 604800
Mon Nov 14 15:49:10 UTC 2016
deleting routers
Mon Nov 14 15:49:10 UTC 2016
adding dns 172.27.0.17
real 0m4.359s
user 0m0.043s
sys 0m0.080s
# tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
15:49:09.054779 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:49:10.056055 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:49:10.071415 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
15:49:10.172012 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
15:49:11.054105 ARP, Reply 172.27.64.58 is-at 00:16:e8:4b:b0:7d, length 46
15:49:11.054116 IP 172.27.64.1 > 172.27.64.58: ICMP echo request, id 29883, seq 0, length 28
15:49:11.054304 IP 172.27.64.58 > 172.27.64.1: ICMP echo reply, id 29883, seq 0, length 28
15:49:16.057970 ARP, Request who-has 172.27.64.1 tell 172.27.64.58, length 46
15:49:16.057983 ARP, Reply 172.27.64.1 is-at 00:15:17:24:e0:81, length 28
# ethtool -S eth0
NIC statistics:
rx_bytes_ok: 1196
rx_frames_ok: 7
rx_undersize_frames: 0
rx_fragment_frames: 0
rx_64_byte_frames: 4
rx_127_byte_frames: 1
rx_255_byte_frames: 0
rx_511_byte_frames: 2
rx_1023_byte_frames: 0
rx_max_size_frames: 0
rx_oversize_frames: 0
rx_bad_fcs_frames: 0
rx_broadcast_frames: 3
rx_multicast_frames: 0
rx_control_frames: 0
rx_pause_frames: 0
rx_unsup_control_frames: 0
rx_align_error_frames: 0
rx_overrun_frames: 0
rx_jabber_frames: 0
rx_bytes: 1196
rx_frames: 7
tx_bytes_ok: 886
tx_frames_ok: 5
tx_64_byte_frames: 2
tx_127_byte_frames: 1
tx_255_byte_frames: 0
tx_511_byte_frames: 2
tx_1023_byte_frames: 0
tx_max_size_frames: 0
tx_oversize_frames: 0
tx_broadcast_frames: 2
tx_multicast_frames: 0
tx_control_frames: 0
tx_pause_frames: 0
tx_underrun_frames: 0
tx_single_collision_frames: 0
tx_multi_collision_frames: 0
tx_deferred_collision_frames: 0
tx_late_collision_frames: 0
tx_excessive_collision_frames: 0
tx_bytes: 886
tx_frames: 5
tx_collisions: 0
Hmmm...
[ 126.682096] nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
vs
[ 179.753888] nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off
I'm not sure whether "flow control" is relevant...
Regards.
^ permalink raw reply
* [patch net-next v2 0/8] Add support for offloading packet-sampling
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
From: Jiri Pirko <jiri@mellanox.com>
Add the sample tc action, which allows to sample packet matching
a classifier. The sample action peeks randomly packets, duplicates them,
truncates them and adds informative metadata on the packet, for example,
the input interface and the original packet length. The sampled packets
are marked to allow matching them and redirecting them to a specific
collector device.
The sampled packets metadata is packed using ife encapsulation. To do
that, this patch-set extracts ife logics from the tc_ife action into an
independent ife module, and uses that functionality to pack the metadata.
To include all the needed metadata, this patch-set introduces some new
IFE_META tlv types.
In addition, Add the support for offloading the macthall-sample tc command
in the Mellanox mlxsw driver, for ingress qdiscs.
Userspace examples for that code can be found in:
- https://github.com/yotamgi/host-sflow: sflow client that uses the sample
with combination of tap device to sample packets and send to a
centralized sflow collector.
- https://github.com/yotamgi/libife: a library for manipulating ife
packets in userspace. The library is used in the host-sflow program to
parse the sampled packets.
---
v1->v2:
- Change the sampling to be random by default, other than sampling exactly
every n'th packet
- Change to use __be types in the ife module
rfc->v1:
- Change ifindex sampled packets metadatum to in_ifindex and out_ifindex
- Add sequence number metadatum to sampled packets
- Add sampler_id metadatum to sampled packets
- Make the user kernel interface extensible
- Move the sampling helper function to the ife module
- Fix ife header to be safe when CONFIG_NET_IFE is not set
- Made the sampled packets eth_type field mandatory other then optional
- Change the IFE_META* fields to be enum
- Remove the ife_packet_info struct and pass the parameters directly to
the ife_packet_ifo_pack function
- Couple of more styling and cosmetic issues
Yotam Gigi (8):
Introduce ife encapsulation module
act_ife: Change to use ife module
net: ife: Introduce new metadata tlv types
net: ife: Introduce packet info packing method
Introduce sample tc action
tc: sample: Add sequence number and sampler_id fields
mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
mlxsw: packet sample: Add packet sample offloading support
MAINTAINERS | 7 +
drivers/net/ethernet/mellanox/mlxsw/reg.h | 38 ++++
drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 120 +++++++++-
drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 12 +
drivers/net/ethernet/mellanox/mlxsw/trap.h | 1 +
include/net/ife.h | 63 ++++++
include/net/tc_act/tc_ife.h | 3 -
include/net/tc_act/tc_sample.h | 73 +++++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/ife.h | 24 ++
include/uapi/linux/tc_act/Kbuild | 1 +
include/uapi/linux/tc_act/tc_ife.h | 10 +-
include/uapi/linux/tc_act/tc_sample.h | 29 +++
net/Kconfig | 1 +
net/Makefile | 1 +
net/ife/Kconfig | 16 ++
net/ife/Makefile | 5 +
net/ife/ife.c | 199 +++++++++++++++++
net/sched/Kconfig | 14 ++
net/sched/Makefile | 1 +
net/sched/act_ife.c | 109 +++-------
net/sched/act_sample.c | 290 +++++++++++++++++++++++++
22 files changed, 921 insertions(+), 97 deletions(-)
create mode 100644 include/net/ife.h
create mode 100644 include/net/tc_act/tc_sample.h
create mode 100644 include/uapi/linux/ife.h
create mode 100644 include/uapi/linux/tc_act/tc_sample.h
create mode 100644 net/ife/Kconfig
create mode 100644 net/ife/Makefile
create mode 100644 net/ife/ife.c
create mode 100644 net/sched/act_sample.c
--
2.7.4
^ permalink raw reply
* [patch net-next v2 1/8] Introduce ife encapsulation module
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>
From: Yotam Gigi <yotamg@mellanox.com>
This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
- ife_encode: encode skb and reserve space for the ife meta header
- ife_decode: decode skb and extract the meta header size
- ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
header space.
- ife_tlv_meta_decode - decodes one tlv entry from the packet
- ife_tlv_meta_next - advance to the next tlv
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
MAINTAINERS | 7 +++
include/net/ife.h | 52 +++++++++++++++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/ife.h | 18 ++++++
net/Kconfig | 1 +
net/Makefile | 1 +
net/ife/Kconfig | 16 ++++++
net/ife/Makefile | 5 ++
net/ife/ife.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 245 insertions(+)
create mode 100644 include/net/ife.h
create mode 100644 include/uapi/linux/ife.h
create mode 100644 net/ife/Kconfig
create mode 100644 net/ife/Makefile
create mode 100644 net/ife/ife.c
diff --git a/MAINTAINERS b/MAINTAINERS
index e5c17a9..cc3e640 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6114,6 +6114,13 @@ F: include/net/cfg802154.h
F: include/net/ieee802154_netdev.h
F: Documentation/networking/ieee802154.txt
+IFE PROTOCOL
+M: Yotam Gigi <yotamg@mellanox.com>
+M: Jamal Hadi Salim <jhs@mojatatu.com>
+F: net/ife
+F: include/net/ife.h
+F: include/uapi/linux/ife.h
+
IGORPLUG-USB IR RECEIVER
M: Sean Young <sean@mess.org>
L: linux-media@vger.kernel.org
diff --git a/include/net/ife.h b/include/net/ife.h
new file mode 100644
index 0000000..a75d4e0
--- /dev/null
+++ b/include/net/ife.h
@@ -0,0 +1,52 @@
+#ifndef __NET_IFE_H
+#define __NET_IFE_H
+
+#include <uapi/linux/ife.h>
+#include <linux/etherdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <uapi/linux/ife.h>
+
+#if IS_ENABLED(CONFIG_NET_IFE)
+
+void *ife_encode(struct sk_buff *skb, u16 metalen);
+void *ife_decode(struct sk_buff *skb, u16 *metalen);
+
+void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
+int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+ const void *dval);
+
+void *ife_tlv_meta_next(void *skbdata);
+
+#else
+
+static inline void *ife_encode(struct sk_buff *skb, u16 metalen)
+{
+ return NULL;
+}
+
+static inline void *ife_decode(struct sk_buff *skb, u16 *metalen)
+{
+ return NULL;
+}
+
+static inline void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen,
+ u16 *totlen)
+{
+ return NULL;
+}
+
+static inline int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
+ const void *dval)
+{
+ return 0;
+}
+
+static inline void *ife_tlv_meta_next(void *skbdata)
+{
+ return NULL;
+}
+
+#endif
+
+#endif /* __NET_IFE_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index cd2be1c..eabf838 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -191,6 +191,7 @@ header-y += if_tun.h
header-y += if_tunnel.h
header-y += if_vlan.h
header-y += if_x25.h
+header-y += ife.h
header-y += igmp.h
header-y += ila.h
header-y += in6.h
diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
new file mode 100644
index 0000000..2954da3
--- /dev/null
+++ b/include/uapi/linux/ife.h
@@ -0,0 +1,18 @@
+#ifndef __UAPI_IFE_H
+#define __UAPI_IFE_H
+
+#define IFE_METAHDRLEN 2
+
+enum {
+ IFE_META_SKBMARK = 1,
+ IFE_META_HASHID,
+ IFE_META_PRIO,
+ IFE_META_QMAP,
+ IFE_META_TCINDEX,
+ __IFE_META_MAX
+};
+
+/*Can be overridden at runtime by module option*/
+#define IFE_META_MAX (__IFE_META_MAX - 1)
+
+#endif
diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..3cf29b1 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -393,6 +393,7 @@ source "net/9p/Kconfig"
source "net/caif/Kconfig"
source "net/ceph/Kconfig"
source "net/nfc/Kconfig"
+source "net/ife/Kconfig"
config LWTUNNEL
bool "Network light weight tunnels"
diff --git a/net/Makefile b/net/Makefile
index 4cafaa2..4ddc67e 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -69,6 +69,7 @@ obj-$(CONFIG_DNS_RESOLVER) += dns_resolver/
obj-$(CONFIG_CEPH_LIB) += ceph/
obj-$(CONFIG_BATMAN_ADV) += batman-adv/
obj-$(CONFIG_NFC) += nfc/
+obj-$(CONFIG_NET_IFE) += ife/
obj-$(CONFIG_OPENVSWITCH) += openvswitch/
obj-$(CONFIG_VSOCKETS) += vmw_vsock/
obj-$(CONFIG_MPLS) += mpls/
diff --git a/net/ife/Kconfig b/net/ife/Kconfig
new file mode 100644
index 0000000..31e48b6
--- /dev/null
+++ b/net/ife/Kconfig
@@ -0,0 +1,16 @@
+#
+# IFE subsystem configuration
+#
+
+menuconfig NET_IFE
+ depends on NET
+ tristate "Inter-FE based on IETF ForCES InterFE LFB"
+ default n
+ help
+ Say Y here to add support of IFE encapsulation protocol
+ For details refer to netdev01 paper:
+ "Distributing Linux Traffic Control Classifier-Action Subsystem"
+ Authors: Jamal Hadi Salim and Damascene M. Joachimpillai
+
+ To compile this support as a module, choose M here: the module will
+ be called ife.
diff --git a/net/ife/Makefile b/net/ife/Makefile
new file mode 100644
index 0000000..2a90d97
--- /dev/null
+++ b/net/ife/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the IFE encapsulation protocol
+#
+
+obj-$(CONFIG_NET_IFE) += ife.o
diff --git a/net/ife/ife.c b/net/ife/ife.c
new file mode 100644
index 0000000..3642f43
--- /dev/null
+++ b/net/ife/ife.c
@@ -0,0 +1,144 @@
+/*
+ * net/ife/ife.c - Inter-FE protocol based on ForCES WG InterFE LFB
+ * Copyright (c) 2015 Jamal Hadi Salim <jhs@mojatatu.com>
+ * Copyright (c) 2016 Yotam Gigi <yotamg@mellanox.com>
+ *
+ * Refer to: draft-ietf-forces-interfelfb-03 and netdev01 paper:
+ * "Distributing Linux Traffic Control Classifier-Action Subsystem"
+ * Authors: Jamal Hadi Salim and Damascene M. Joachimpillai
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <linux/etherdevice.h>
+#include <net/ife.h>
+
+struct ifeheadr {
+ __be16 metalen;
+ u8 tlv_data[];
+};
+
+void *ife_encode(struct sk_buff *skb, u16 metalen)
+{
+ /* OUTERHDR:TOTMETALEN:{TLVHDR:Metadatum:TLVHDR..}:ORIGDATA
+ * where ORIGDATA = original ethernet header ...
+ */
+ int hdrm = metalen + IFE_METAHDRLEN;
+ int total_push = hdrm + skb->dev->hard_header_len;
+ struct ifeheadr *ifehdr;
+ struct ethhdr *iethh; /* inner ether header */
+ int skboff = 0;
+ int err;
+
+ err = skb_cow_head(skb, total_push);
+ if (unlikely(err))
+ return NULL;
+
+ iethh = (struct ethhdr *) skb->data;
+
+ __skb_push(skb, total_push);
+ memcpy(skb->data, iethh, skb->dev->hard_header_len);
+ skb_reset_mac_header(skb);
+ skboff += skb->dev->hard_header_len;
+
+ /* total metadata length */
+ ifehdr = (struct ifeheadr *) (skb->data + skboff);
+ metalen += IFE_METAHDRLEN;
+ ifehdr->metalen = htons(metalen);
+
+ return ifehdr->tlv_data;
+}
+EXPORT_SYMBOL_GPL(ife_encode);
+
+void *ife_decode(struct sk_buff *skb, u16 *metalen)
+{
+ struct ifeheadr *ifehdr;
+ int total_pull;
+ u16 ifehdrln;
+
+ ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
+ ifehdrln = ifehdr->metalen;
+ ifehdrln = ntohs(ifehdrln);
+ total_pull = skb->dev->hard_header_len + ifehdrln;
+
+ if (unlikely(ifehdrln < 2))
+ return NULL;
+
+ if (unlikely(!pskb_may_pull(skb, total_pull)))
+ return NULL;
+
+ skb_set_mac_header(skb, total_pull);
+ __skb_pull(skb, total_pull);
+ *metalen = ifehdrln - IFE_METAHDRLEN;
+
+ return &ifehdr->tlv_data;
+}
+EXPORT_SYMBOL_GPL(ife_decode);
+
+struct meta_tlvhdr {
+ __be16 type;
+ __be16 len;
+};
+
+/* Caller takes care of presenting data in network order
+ */
+void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
+{
+ struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+
+ *dlen = ntohs(tlv->len) - NLA_HDRLEN;
+ *attrtype = ntohs(tlv->type);
+
+ if (totlen)
+ *totlen = nla_total_size(*dlen);
+
+ return skbdata + sizeof(struct meta_tlvhdr);
+}
+EXPORT_SYMBOL_GPL(ife_tlv_meta_decode);
+
+void *ife_tlv_meta_next(void *skbdata)
+{
+ struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+ u16 tlvlen = tlv->len;
+
+ tlvlen = ntohs(tlvlen);
+ tlvlen = NLA_ALIGN(tlvlen);
+
+ return skbdata + tlvlen;
+}
+EXPORT_SYMBOL_GPL(ife_tlv_meta_next);
+
+/* Caller takes care of presenting data in network order
+ */
+int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
+{
+ __be32 *tlv = (u32 *) (skbdata);
+ u16 totlen = nla_total_size(dlen); /*alignment + hdr */
+ char *dptr = (char *) tlv + NLA_HDRLEN;
+ u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
+
+ *tlv = htonl(htlv);
+ memset(dptr, 0, totlen - NLA_HDRLEN);
+ memcpy(dptr, dval, dlen);
+
+ return totlen;
+}
+EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
+
+MODULE_AUTHOR("Jamal Hadi Salim <jhs@mojatatu.com>");
+MODULE_AUTHOR("Yotam Gigi <yotamg@mellanox.com>");
+MODULE_DESCRIPTION("Inter-FE LFB action");
+MODULE_LICENSE("GPL");
--
2.7.4
^ permalink raw reply related
* [patch net-next v2 2/8] act_ife: Change to use ife module
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>
From: Yotam Gigi <yotamg@mellanox.com>
Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
include/net/tc_act/tc_ife.h | 3 -
include/uapi/linux/tc_act/tc_ife.h | 10 +---
net/sched/Kconfig | 1 +
net/sched/act_ife.c | 109 +++++++++++--------------------------
4 files changed, 34 insertions(+), 89 deletions(-)
diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
index 9fd2bea0..30ba459 100644
--- a/include/net/tc_act/tc_ife.h
+++ b/include/net/tc_act/tc_ife.h
@@ -6,7 +6,6 @@
#include <linux/rtnetlink.h>
#include <linux/module.h>
-#define IFE_METAHDRLEN 2
struct tcf_ife_info {
struct tc_action common;
u8 eth_dst[ETH_ALEN];
@@ -45,8 +44,6 @@ struct tcf_meta_ops {
int ife_get_meta_u32(struct sk_buff *skb, struct tcf_meta_info *mi);
int ife_get_meta_u16(struct sk_buff *skb, struct tcf_meta_info *mi);
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
- const void *dval);
int ife_alloc_meta_u32(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
int ife_alloc_meta_u16(struct tcf_meta_info *mi, void *metaval, gfp_t gfp);
int ife_check_meta_u32(u32 metaval, struct tcf_meta_info *mi);
diff --git a/include/uapi/linux/tc_act/tc_ife.h b/include/uapi/linux/tc_act/tc_ife.h
index cd18360..7c28178 100644
--- a/include/uapi/linux/tc_act/tc_ife.h
+++ b/include/uapi/linux/tc_act/tc_ife.h
@@ -3,6 +3,7 @@
#include <linux/types.h>
#include <linux/pkt_cls.h>
+#include <linux/ife.h>
#define TCA_ACT_IFE 25
/* Flag bits for now just encoding/decoding; mutually exclusive */
@@ -28,13 +29,4 @@ enum {
};
#define TCA_IFE_MAX (__TCA_IFE_MAX - 1)
-#define IFE_META_SKBMARK 1
-#define IFE_META_HASHID 2
-#define IFE_META_PRIO 3
-#define IFE_META_QMAP 4
-#define IFE_META_TCINDEX 5
-/*Can be overridden at runtime by module option*/
-#define __IFE_META_MAX 6
-#define IFE_META_MAX (__IFE_META_MAX - 1)
-
#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 87956a7..24f7cac 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -763,6 +763,7 @@ config NET_ACT_SKBMOD
config NET_ACT_IFE
tristate "Inter-FE action based on IETF ForCES InterFE LFB"
depends on NET_CLS_ACT
+ select NET_IFE
---help---
Say Y here to allow for sourcing and terminating metadata
For details refer to netdev01 paper:
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 95c463c..5c2478a 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -32,6 +32,7 @@
#include <uapi/linux/tc_act/tc_ife.h>
#include <net/tc_act/tc_ife.h>
#include <linux/etherdevice.h>
+#include <net/ife.h>
#define IFE_TAB_MASK 15
@@ -46,23 +47,6 @@ static const struct nla_policy ife_policy[TCA_IFE_MAX + 1] = {
[TCA_IFE_TYPE] = { .type = NLA_U16},
};
-/* Caller takes care of presenting data in network order
-*/
-int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
-{
- u32 *tlv = (u32 *)(skbdata);
- u16 totlen = nla_total_size(dlen); /*alignment + hdr */
- char *dptr = (char *)tlv + NLA_HDRLEN;
- u32 htlv = attrtype << 16 | (dlen + NLA_HDRLEN);
-
- *tlv = htonl(htlv);
- memset(dptr, 0, totlen - NLA_HDRLEN);
- memcpy(dptr, dval, dlen);
-
- return totlen;
-}
-EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
-
int ife_encode_meta_u16(u16 metaval, void *skbdata, struct tcf_meta_info *mi)
{
u16 edata = 0;
@@ -637,69 +621,60 @@ int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
return 0;
}
-struct ifeheadr {
- __be16 metalen;
- u8 tlv_data[];
-};
-
-struct meta_tlvhdr {
- __be16 type;
- __be16 len;
-};
-
static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
struct tcf_result *res)
{
struct tcf_ife_info *ife = to_ife(a);
+ u32 at = G_TC_AT(skb->tc_verd);
int action = ife->tcf_action;
- struct ifeheadr *ifehdr = (struct ifeheadr *)skb->data;
- int ifehdrln = (int)ifehdr->metalen;
- struct meta_tlvhdr *tlv = (struct meta_tlvhdr *)(ifehdr->tlv_data);
+ u8 *ifehdr_end;
+ u8 *tlv_data;
+ u16 metalen;
spin_lock(&ife->tcf_lock);
bstats_update(&ife->tcf_bstats, skb);
tcf_lastuse_update(&ife->tcf_tm);
spin_unlock(&ife->tcf_lock);
- ifehdrln = ntohs(ifehdrln);
- if (unlikely(!pskb_may_pull(skb, ifehdrln))) {
+ if (!(at & AT_EGRESS))
+ skb_push(skb, skb->dev->hard_header_len);
+
+ tlv_data = ife_decode(skb, &metalen);
+ if (unlikely(!tlv_data)) {
spin_lock(&ife->tcf_lock);
ife->tcf_qstats.drops++;
spin_unlock(&ife->tcf_lock);
return TC_ACT_SHOT;
}
- skb_set_mac_header(skb, ifehdrln);
- __skb_pull(skb, ifehdrln);
- skb->protocol = eth_type_trans(skb, skb->dev);
- ifehdrln -= IFE_METAHDRLEN;
-
- while (ifehdrln > 0) {
- u8 *tlvdata = (u8 *)tlv;
- u16 mtype = tlv->type;
- u16 mlen = tlv->len;
- u16 alen;
+ ifehdr_end = tlv_data + metalen;
+ for (; tlv_data < ifehdr_end; tlv_data = ife_tlv_meta_next(tlv_data)) {
+ u8 *curr_data;
+ u16 mtype;
+ u16 dlen;
- mtype = ntohs(mtype);
- mlen = ntohs(mlen);
- alen = NLA_ALIGN(mlen);
+ curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
- if (find_decode_metaid(skb, ife, mtype, (mlen - NLA_HDRLEN),
- (void *)(tlvdata + NLA_HDRLEN))) {
+ if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
/* abuse overlimits to count when we receive metadata
* but dont have an ops for it
*/
- pr_info_ratelimited("Unknown metaid %d alnlen %d\n",
- mtype, mlen);
+ pr_info_ratelimited("Unknown metaid %d dlen %d\n",
+ mtype, dlen);
ife->tcf_qstats.overlimits++;
}
+ }
- tlvdata += alen;
- ifehdrln -= alen;
- tlv = (struct meta_tlvhdr *)tlvdata;
+ if (WARN_ON(tlv_data != ifehdr_end)) {
+ spin_lock(&ife->tcf_lock);
+ ife->tcf_qstats.drops++;
+ spin_unlock(&ife->tcf_lock);
+ return TC_ACT_SHOT;
}
+ skb->protocol = eth_type_trans(skb, skb->dev);
skb_reset_network_header(skb);
+
return action;
}
@@ -727,7 +702,6 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
struct tcf_ife_info *ife = to_ife(a);
int action = ife->tcf_action;
struct ethhdr *oethh; /* outer ether header */
- struct ethhdr *iethh; /* inner eth header */
struct tcf_meta_info *e;
/*
OUTERHDR:TOTMETALEN:{TLVHDR:Metadatum:TLVHDR..}:ORIGDATA
@@ -735,10 +709,11 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
*/
u16 metalen = ife_get_sz(skb, ife);
int hdrm = metalen + skb->dev->hard_header_len + IFE_METAHDRLEN;
- unsigned int skboff = skb->dev->hard_header_len;
+ unsigned int skboff = 0;
u32 at = G_TC_AT(skb->tc_verd);
int new_len = skb->len + hdrm;
bool exceed_mtu = false;
+ void *ife_meta;
int err;
if (at & AT_EGRESS) {
@@ -766,27 +741,10 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
return TC_ACT_SHOT;
}
- err = skb_cow_head(skb, hdrm);
- if (unlikely(err)) {
- ife->tcf_qstats.drops++;
- spin_unlock(&ife->tcf_lock);
- return TC_ACT_SHOT;
- }
-
if (!(at & AT_EGRESS))
skb_push(skb, skb->dev->hard_header_len);
- iethh = (struct ethhdr *)skb->data;
- __skb_push(skb, hdrm);
- memcpy(skb->data, iethh, skb->mac_len);
- skb_reset_mac_header(skb);
- oethh = eth_hdr(skb);
-
- /*total metadata length */
- metalen += IFE_METAHDRLEN;
- metalen = htons(metalen);
- memcpy((skb->data + skboff), &metalen, IFE_METAHDRLEN);
- skboff += IFE_METAHDRLEN;
+ ife_meta = ife_encode(skb, metalen);
/* XXX: we dont have a clever way of telling encode to
* not repeat some of the computations that are done by
@@ -794,7 +752,7 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
*/
list_for_each_entry(e, &ife->metalist, metalist) {
if (e->ops->encode) {
- err = e->ops->encode(skb, (void *)(skb->data + skboff),
+ err = e->ops->encode(skb, (void *)(ife_meta + skboff),
e);
}
if (err < 0) {
@@ -805,15 +763,12 @@ static int tcf_ife_encode(struct sk_buff *skb, const struct tc_action *a,
}
skboff += err;
}
+ oethh = (struct ethhdr *)skb->data;
if (!is_zero_ether_addr(ife->eth_src))
ether_addr_copy(oethh->h_source, ife->eth_src);
- else
- ether_addr_copy(oethh->h_source, iethh->h_source);
if (!is_zero_ether_addr(ife->eth_dst))
ether_addr_copy(oethh->h_dest, ife->eth_dst);
- else
- ether_addr_copy(oethh->h_dest, iethh->h_dest);
oethh->h_proto = htons(ife->eth_type);
if (!(at & AT_EGRESS))
--
2.7.4
^ permalink raw reply related
* [patch net-next v2 3/8] net: ife: Introduce new metadata tlv types
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>
From: Yotam Gigi <yotamg@mellanox.com>
IFE_META_IN_IFINDEX: Allow to pass input ifindex value as part of the
ife metadata
IFE_META_OUT_IFINDEX: Allow to pass output ifindex value as part of the
ife metadata
IFE_META_ORIG_SIZE: Allow to pass the original packet size as part of
the ife metadata. Can be used in case that the packet is truncated
IFE_META_SIZE: Allow to pass the size of the encapsulated packet as
part of the ife metadata
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
include/uapi/linux/ife.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
index 2954da3..ebdd785 100644
--- a/include/uapi/linux/ife.h
+++ b/include/uapi/linux/ife.h
@@ -9,6 +9,10 @@ enum {
IFE_META_PRIO,
IFE_META_QMAP,
IFE_META_TCINDEX,
+ IFE_META_IN_IFINDEX,
+ IFE_META_OUT_IFINDEX,
+ IFE_META_ORIGSIZE,
+ IFE_META_SIZE,
__IFE_META_MAX
};
--
2.7.4
^ permalink raw reply related
* [patch net-next v2 4/8] net: ife: Introduce packet info packing method
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>
From: Yotam Gigi <yotamg@mellanox.com>
Add the ife_packet_info_pack function which encodes some informative
metadata into a packet using the ife encapsulation. The informative
metadata consists of:
- The packet input/output ifindices
- The packet original size, in case it was truncated
This method is useful for anyone that wants to pass informative metadata
on a packet alongside with the packet itself. A good usage can be packet
sampling.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
include/net/ife.h | 10 ++++++++++
net/ife/ife.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)
diff --git a/include/net/ife.h b/include/net/ife.h
index a75d4e0..5fbcfb3 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -18,6 +18,9 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
void *ife_tlv_meta_next(void *skbdata);
+struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
+ int in_ifindex, int out_ifindex);
+
#else
static inline void *ife_encode(struct sk_buff *skb, u16 metalen)
@@ -47,6 +50,13 @@ static inline void *ife_tlv_meta_next(void *skbdata)
return NULL;
}
+static inline struct ethhdr *
+ife_packet_info_pack(struct sk_buff *skb, int orig_size, int in_ifindex,
+ int out_ifindex)
+{
+ return NULL;
+}
+
#endif
#endif /* __NET_IFE_H */
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 3642f43..f0b683e 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -138,6 +138,51 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
}
EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
+struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
+ int in_ifindex, int out_ifindex)
+{
+ int curr_size;
+ void *ifetlv;
+ u16 metalen;
+
+ curr_size = skb->len;
+
+ metalen = nla_total_size(sizeof(orig_size)) +
+ nla_total_size(sizeof(curr_size));
+
+ if (in_ifindex)
+ metalen += nla_total_size(sizeof(in_ifindex));
+ if (out_ifindex)
+ metalen += nla_total_size(sizeof(out_ifindex));
+
+ in_ifindex = htonl(in_ifindex);
+ out_ifindex = htonl(out_ifindex);
+ orig_size = htonl(orig_size);
+ curr_size = htonl(curr_size);
+
+ ifetlv = ife_encode(skb, metalen);
+ if (!ifetlv)
+ return NULL;
+
+ if (in_ifindex)
+ ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_IN_IFINDEX,
+ sizeof(in_ifindex), &in_ifindex);
+
+ if (out_ifindex)
+ ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_OUT_IFINDEX,
+ sizeof(out_ifindex),
+ &out_ifindex);
+
+ ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_ORIGSIZE,
+ sizeof(orig_size), &orig_size);
+
+ ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SIZE,
+ sizeof(curr_size), &curr_size);
+
+ return (struct ethhdr *) skb->data;
+}
+EXPORT_SYMBOL(ife_packet_info_pack);
+
MODULE_AUTHOR("Jamal Hadi Salim <jhs@mojatatu.com>");
MODULE_AUTHOR("Yotam Gigi <yotamg@mellanox.com>");
MODULE_DESCRIPTION("Inter-FE LFB action");
--
2.7.4
^ permalink raw reply related
* [patch net-next v2 5/8] Introduce sample tc action
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>
From: Yotam Gigi <yotamg@mellanox.com>
This action allow the user to sample traffic matched by tc classifier.
The sampling consists of choosing packets randomly, truncating them,
adding some informative metadata regarding the interface and the original
packet size and mark them with specific mark, to allow further tc rules to
match and process. The marked sample packets are then injected into the
device ingress qdisc using netif_receive_skb.
The packets metadata is packed using the ife encapsulation protocol, and
the outer packet's ethernet dest, source and eth_type, along with the
rate, mark and the optional truncation size can be configured from
userspace.
Example:
To sample ingress traffic from interface eth1, and redirect the sampled
the sampled packets to interface dummy0, one may use the commands:
tc qdisc add dev eth1 handle ffff: ingress
tc filter add dev eth1 parent ffff: \
matchall action sample rate 12 mark 17
tc filter add parent ffff: dev eth1 protocol all \
u32 match mark 17 0xff \
action mirred egress redirect dev dummy0
Where the first command adds an ingress qdisc and the second starts
sampling randomly with an average of one sampled packet per 12 packets on
dev eth1 and marks the sampled packets with 17. The third command catches
the sampled packets, which are marked with 17, and redirects them to dev
dummy0.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
include/net/tc_act/tc_sample.h | 64 ++++++++
include/uapi/linux/tc_act/Kbuild | 1 +
include/uapi/linux/tc_act/tc_sample.h | 29 ++++
net/sched/Kconfig | 13 ++
net/sched/Makefile | 1 +
net/sched/act_sample.c | 282 ++++++++++++++++++++++++++++++++++
6 files changed, 390 insertions(+)
create mode 100644 include/net/tc_act/tc_sample.h
create mode 100644 include/uapi/linux/tc_act/tc_sample.h
create mode 100644 net/sched/act_sample.c
diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
new file mode 100644
index 0000000..af99eb8
--- /dev/null
+++ b/include/net/tc_act/tc_sample.h
@@ -0,0 +1,64 @@
+#ifndef __NET_TC_SAMPLE_H
+#define __NET_TC_SAMPLE_H
+
+#include <net/act_api.h>
+#include <linux/tc_act/tc_sample.h>
+
+struct tcf_sample {
+ struct tc_action common;
+ u32 rate;
+ u32 mark;
+ bool truncate;
+ u32 trunc_size;
+ u8 eth_dst[ETH_ALEN];
+ u8 eth_src[ETH_ALEN];
+ u16 eth_type;
+ struct list_head tcfm_list;
+};
+#define to_sample(a) ((struct tcf_sample *)a)
+
+static inline bool is_tcf_sample(const struct tc_action *a)
+{
+#ifdef CONFIG_NET_CLS_ACT
+ return a->ops && a->ops->type == TCA_ACT_SAMPLE;
+#else
+ return false;
+#endif
+}
+
+static inline __u32 tcf_sample_mark(const struct tc_action *a)
+{
+ return to_sample(a)->mark;
+}
+
+static inline __u32 tcf_sample_rate(const struct tc_action *a)
+{
+ return to_sample(a)->rate;
+}
+
+static inline bool tcf_sample_truncate(const struct tc_action *a)
+{
+ return to_sample(a)->truncate;
+}
+
+static inline int tcf_sample_trunc_size(const struct tc_action *a)
+{
+ return to_sample(a)->trunc_size;
+}
+
+static inline u16 tcf_sample_eth_type(const struct tc_action *a)
+{
+ return to_sample(a)->eth_type;
+}
+
+static inline void tcf_sample_eth_dst_addr(const struct tc_action *a, u8 *dst)
+{
+ ether_addr_copy(dst, to_sample(a)->eth_dst);
+}
+
+static inline void tcf_sample_eth_src_addr(const struct tc_action *a, u8 *src)
+{
+ ether_addr_copy(src, to_sample(a)->eth_src);
+}
+
+#endif /* __NET_TC_SAMPLE_H */
diff --git a/include/uapi/linux/tc_act/Kbuild b/include/uapi/linux/tc_act/Kbuild
index e3969bd..6c6b8d6 100644
--- a/include/uapi/linux/tc_act/Kbuild
+++ b/include/uapi/linux/tc_act/Kbuild
@@ -4,6 +4,7 @@ header-y += tc_defact.h
header-y += tc_gact.h
header-y += tc_ipt.h
header-y += tc_mirred.h
+header-y += tc_sample.h
header-y += tc_nat.h
header-y += tc_pedit.h
header-y += tc_skbedit.h
diff --git a/include/uapi/linux/tc_act/tc_sample.h b/include/uapi/linux/tc_act/tc_sample.h
new file mode 100644
index 0000000..44ee9d0
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_sample.h
@@ -0,0 +1,29 @@
+#ifndef __LINUX_TC_SAMPLE_H
+#define __LINUX_TC_SAMPLE_H
+
+#include <linux/types.h>
+#include <linux/pkt_cls.h>
+#include <linux/if_ether.h>
+
+#define TCA_ACT_SAMPLE 26
+
+struct tc_sample {
+ tc_gen;
+};
+
+enum {
+ TCA_SAMPLE_UNSPEC,
+ TCA_SAMPLE_PARMS,
+ TCA_SAMPLE_TM,
+ TCA_SAMPLE_RATE,
+ TCA_SAMPLE_MARK,
+ TCA_SAMPLE_TRUNC_SIZE,
+ TCA_SAMPLE_ETH_DST,
+ TCA_SAMPLE_ETH_SRC,
+ TCA_SAMPLE_ETH_TYPE,
+ TCA_SAMPLE_PAD,
+ __TCA_SAMPLE_MAX
+};
+#define TCA_SAMPLE_MAX (__TCA_SAMPLE_MAX - 1)
+
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 24f7cac..c54ea6b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -650,6 +650,19 @@ config NET_ACT_MIRRED
To compile this code as a module, choose M here: the
module will be called act_mirred.
+config NET_ACT_SAMPLE
+ tristate "Traffic Sampling"
+ depends on NET_CLS_ACT
+ select NET_IFE
+ ---help---
+ Say Y here to allow packet sampling tc action. The packet sample
+ action consists of statistically duplicating packets, truncating them
+ and adding descriptive metadata to them. The duplicated packets are
+ then marked to allow further processing using tc.
+
+ To compile this code as a module, choose M here: the
+ module will be called act_sample.
+
config NET_ACT_IPT
tristate "IPtables targets"
depends on NET_CLS_ACT && NETFILTER && IP_NF_IPTABLES
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 4bdda36..7b915d2 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_NET_CLS_ACT) += act_api.o
obj-$(CONFIG_NET_ACT_POLICE) += act_police.o
obj-$(CONFIG_NET_ACT_GACT) += act_gact.o
obj-$(CONFIG_NET_ACT_MIRRED) += act_mirred.o
+obj-$(CONFIG_NET_ACT_SAMPLE) += act_sample.o
obj-$(CONFIG_NET_ACT_IPT) += act_ipt.o
obj-$(CONFIG_NET_ACT_NAT) += act_nat.o
obj-$(CONFIG_NET_ACT_PEDIT) += act_pedit.o
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
new file mode 100644
index 0000000..39f1d94c
--- /dev/null
+++ b/net/sched/act_sample.c
@@ -0,0 +1,282 @@
+/*
+ * net/sched/act_sample.c - Packet samplig tc action
+ * Copyright (c) 2016 Yotam Gigi <yotamg@mellanox.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/rtnetlink.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/gfp.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <linux/tc_act/tc_sample.h>
+#include <net/tc_act/tc_sample.h>
+#include <net/ife.h>
+
+#include <linux/if_arp.h>
+
+#define SAMPLE_TAB_MASK 7
+static int sample_net_id;
+static struct tc_action_ops act_sample_ops;
+
+static const struct nla_policy sample_policy[TCA_SAMPLE_MAX + 1] = {
+ [TCA_SAMPLE_PARMS] = { .len = sizeof(struct tc_sample) },
+};
+
+static int tcf_sample_init(struct net *net, struct nlattr *nla,
+ struct nlattr *est, struct tc_action **a, int ovr,
+ int bind)
+{
+ struct tc_action_net *tn = net_generic(net, sample_net_id);
+ struct nlattr *tb[TCA_SAMPLE_MAX + 1];
+ struct tc_sample *parm;
+ struct tcf_sample *s;
+ int ret;
+ bool exists = false;
+
+ if (!nla)
+ return -EINVAL;
+ ret = nla_parse_nested(tb, TCA_SAMPLE_MAX, nla, sample_policy);
+ if (ret < 0)
+ return ret;
+ if (!tb[TCA_SAMPLE_PARMS] || !tb[TCA_SAMPLE_RATE] ||
+ !tb[TCA_SAMPLE_MARK] || !tb[TCA_SAMPLE_ETH_TYPE])
+ return -EINVAL;
+
+ parm = nla_data(tb[TCA_SAMPLE_PARMS]);
+
+ exists = tcf_hash_check(tn, parm->index, a, bind);
+ if (exists && bind)
+ return 0;
+
+ if (!exists) {
+ ret = tcf_hash_create(tn, parm->index, est, a,
+ &act_sample_ops, bind, false);
+ if (ret)
+ return ret;
+ ret = ACT_P_CREATED;
+ } else {
+ tcf_hash_release(*a, bind);
+ if (!ovr)
+ return -EEXIST;
+ }
+ s = to_sample(*a);
+
+ ASSERT_RTNL();
+ s->tcf_action = parm->action;
+ s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
+ s->mark = nla_get_u32(tb[TCA_SAMPLE_MARK]);
+ s->eth_type = nla_get_u16(tb[TCA_SAMPLE_ETH_TYPE]);
+
+ s->truncate = tb[TCA_SAMPLE_TRUNC_SIZE];
+ if (tb[TCA_SAMPLE_TRUNC_SIZE])
+ s->trunc_size = nla_get_u32(tb[TCA_SAMPLE_TRUNC_SIZE]);
+
+ if (tb[TCA_SAMPLE_ETH_SRC])
+ ether_addr_copy(s->eth_src, nla_data(tb[TCA_SAMPLE_ETH_SRC]));
+ else
+ eth_zero_addr(s->eth_src);
+ if (tb[TCA_SAMPLE_ETH_DST])
+ ether_addr_copy(s->eth_dst, nla_data(tb[TCA_SAMPLE_ETH_DST]));
+ else
+ eth_zero_addr(s->eth_dst);
+
+ if (ret == ACT_P_CREATED)
+ tcf_hash_insert(tn, *a);
+ return ret;
+}
+
+static bool dev_ok_push(struct net_device *dev)
+{
+ switch (dev->type) {
+ case ARPHRD_TUNNEL:
+ case ARPHRD_TUNNEL6:
+ case ARPHRD_SIT:
+ case ARPHRD_IPGRE:
+ case ARPHRD_VOID:
+ case ARPHRD_NONE:
+ return false;
+ default:
+ return true;
+ }
+}
+
+static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
+ struct tcf_result *res)
+{
+ struct tcf_sample *s = to_sample(a);
+ static struct ethhdr *ethhdr;
+ struct sk_buff *skb2;
+ int orig_size;
+ int retval;
+ u32 at;
+
+ tcf_lastuse_update(&s->tcf_tm);
+ bstats_cpu_update(this_cpu_ptr(s->common.cpu_bstats), skb);
+
+ rcu_read_lock();
+ retval = READ_ONCE(s->tcf_action);
+
+ /* randomly sample packets according to rate */
+ if (prandom_u32() % s->rate == 0) {
+ skb2 = skb_copy(skb, GFP_ATOMIC);
+ if (!skb2)
+ goto out;
+
+ if (s->truncate)
+ skb_trim(skb2, s->trunc_size);
+
+ at = G_TC_AT(skb->tc_verd);
+ skb2->mac_len = skb->mac_len;
+
+ /* on ingress, the mac header gets poped, so push it back */
+ if (!(at & AT_EGRESS) && dev_ok_push(skb->dev))
+ skb_push(skb2, skb2->mac_len);
+
+ orig_size = skb->len + skb->dev->hard_header_len;
+ ethhdr = ife_packet_info_pack(skb2, orig_size,
+ skb->dev->ifindex, 0);
+ if (!ethhdr)
+ goto out;
+
+ ethhdr->h_proto = htons(s->eth_type);
+ if (!is_zero_ether_addr(s->eth_src))
+ ether_addr_copy(ethhdr->h_source, s->eth_src);
+ if (!is_zero_ether_addr(s->eth_dst))
+ ether_addr_copy(ethhdr->h_dest, s->eth_dst);
+
+ skb2->mark = s->mark;
+ netif_receive_skb(skb2);
+
+ /* mirror is always swallowed */
+ skb2->tc_verd = SET_TC_FROM(skb2->tc_verd, at);
+ }
+out:
+ rcu_read_unlock();
+
+ return retval;
+}
+
+static int tcf_sample_dump(struct sk_buff *skb, struct tc_action *a,
+ int bind, int ref)
+{
+ unsigned char *b = skb_tail_pointer(skb);
+ struct tcf_sample *s = to_sample(a);
+ struct tc_sample opt = {
+ .index = s->tcf_index,
+ .action = s->tcf_action,
+ .refcnt = s->tcf_refcnt - ref,
+ .bindcnt = s->tcf_bindcnt - bind,
+ };
+ struct tcf_t t;
+
+ if (nla_put(skb, TCA_SAMPLE_PARMS, sizeof(opt), &opt))
+ goto nla_put_failure;
+
+ tcf_tm_dump(&t, &s->tcf_tm);
+ if (nla_put_64bit(skb, TCA_SAMPLE_TM, sizeof(t), &t, TCA_SAMPLE_PAD))
+ goto nla_put_failure;
+
+ if (nla_put_u32(skb, TCA_SAMPLE_RATE, s->rate))
+ goto nla_put_failure;
+
+ if (nla_put_u32(skb, TCA_SAMPLE_MARK, s->mark))
+ goto nla_put_failure;
+
+ if (nla_put_u32(skb, TCA_SAMPLE_ETH_TYPE, s->eth_type))
+ goto nla_put_failure;
+
+ if (s->truncate)
+ if (nla_put_u32(skb, TCA_SAMPLE_TRUNC_SIZE, s->trunc_size))
+ goto nla_put_failure;
+
+ if (!is_zero_ether_addr(s->eth_src))
+ if (nla_put(skb, TCA_SAMPLE_ETH_SRC, ETH_ALEN, s->eth_src))
+ goto nla_put_failure;
+
+ if (!is_zero_ether_addr(s->eth_dst))
+ if (nla_put(skb, TCA_SAMPLE_ETH_DST, ETH_ALEN, s->eth_dst))
+ goto nla_put_failure;
+
+ return skb->len;
+
+nla_put_failure:
+ nlmsg_trim(skb, b);
+ return -1;
+}
+
+static int tcf_sample_walker(struct net *net, struct sk_buff *skb,
+ struct netlink_callback *cb, int type,
+ const struct tc_action_ops *ops)
+{
+ struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+ return tcf_generic_walker(tn, skb, cb, type, ops);
+}
+
+static int tcf_sample_search(struct net *net, struct tc_action **a, u32 index)
+{
+ struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+ return tcf_hash_search(tn, a, index);
+}
+
+static struct tc_action_ops act_sample_ops = {
+ .kind = "sample",
+ .type = TCA_ACT_SAMPLE,
+ .owner = THIS_MODULE,
+ .act = tcf_sample,
+ .dump = tcf_sample_dump,
+ .init = tcf_sample_init,
+ .walk = tcf_sample_walker,
+ .lookup = tcf_sample_search,
+ .size = sizeof(struct tcf_sample),
+};
+
+static __net_init int sample_init_net(struct net *net)
+{
+ struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+ return tc_action_net_init(tn, &act_sample_ops, SAMPLE_TAB_MASK);
+}
+
+static void __net_exit sample_exit_net(struct net *net)
+{
+ struct tc_action_net *tn = net_generic(net, sample_net_id);
+
+ tc_action_net_exit(tn);
+}
+
+static struct pernet_operations sample_net_ops = {
+ .init = sample_init_net,
+ .exit = sample_exit_net,
+ .id = &sample_net_id,
+ .size = sizeof(struct tc_action_net),
+};
+
+static int __init sample_init_module(void)
+{
+ return tcf_register_action(&act_sample_ops, &sample_net_ops);
+}
+
+static void __exit sample_cleanup_module(void)
+{
+ tcf_unregister_action(&act_sample_ops, &sample_net_ops);
+}
+
+module_init(sample_init_module);
+module_exit(sample_cleanup_module);
+
+MODULE_AUTHOR("Yotam Gigi <yotamg@mellanox.com>");
+MODULE_DESCRIPTION("Packet sampling action");
+MODULE_LICENSE("GPL v2");
--
2.7.4
^ permalink raw reply related
* [patch net-next v2 6/8] tc: sample: Add sequence number and sampler_id fields
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>
From: Yotam Gigi <yotamg@mellanox.com>
Add support for sequence numbering to the sampled packets. The sequence
number counts the sampled packets and enables to check whether sampled
packets have been dropped. The sequence state is kept per sample action
instance.
Add unique id (sampler_id) for each sample action instance, and add that
sampler_id to the packet. This way, a user application can verify sample
drops, by tracking the sequence numbers from each sampler id.
Because of the fact that hw and sw can be triggered by the same tc action
(if the user does not specify both skip_sw and skip_hw), each should
have a different sampler_id, as each generates a different sequence
numbering. To allow that, the macros TC_SAMPLE_HW_ID and TC_SAMPLE_SW_ID,
which, given a sampler_id, calculates the hw sample_id and sw sample_id.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
include/net/ife.h | 5 +++--
include/net/tc_act/tc_sample.h | 9 +++++++++
include/uapi/linux/ife.h | 2 ++
net/ife/ife.c | 14 ++++++++++++--
net/sched/act_sample.c | 10 +++++++++-
5 files changed, 35 insertions(+), 5 deletions(-)
diff --git a/include/net/ife.h b/include/net/ife.h
index 5fbcfb3..3ae4f0d 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -19,7 +19,8 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
void *ife_tlv_meta_next(void *skbdata);
struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
- int in_ifindex, int out_ifindex);
+ int in_ifindex, int out_ifindex,
+ u32 sampler_id, u32 seq);
#else
@@ -52,7 +53,7 @@ static inline void *ife_tlv_meta_next(void *skbdata)
static inline struct ethhdr *
ife_packet_info_pack(struct sk_buff *skb, int orig_size, int in_ifindex,
- int out_ifindex)
+ int out_ifindex, u32 sampler_id, u32 seq)
{
return NULL;
}
diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
index af99eb8..07638b7 100644
--- a/include/net/tc_act/tc_sample.h
+++ b/include/net/tc_act/tc_sample.h
@@ -13,9 +13,13 @@ struct tcf_sample {
u8 eth_dst[ETH_ALEN];
u8 eth_src[ETH_ALEN];
u16 eth_type;
+ u32 sampler_id;
+ u32 seq;
struct list_head tcfm_list;
};
#define to_sample(a) ((struct tcf_sample *)a)
+#define TC_SAMPLE_SW_ID(id) ((id << 1) + 0)
+#define TC_SAMPLE_HW_ID(id) ((id << 1) + 1)
static inline bool is_tcf_sample(const struct tc_action *a)
{
@@ -61,4 +65,9 @@ static inline void tcf_sample_eth_src_addr(const struct tc_action *a, u8 *src)
ether_addr_copy(src, to_sample(a)->eth_src);
}
+static inline u32 tcf_sample_sampler_id(const struct tc_action *a)
+{
+ return to_sample(a)->sampler_id;
+}
+
#endif /* __NET_TC_SAMPLE_H */
diff --git a/include/uapi/linux/ife.h b/include/uapi/linux/ife.h
index ebdd785..2007818 100644
--- a/include/uapi/linux/ife.h
+++ b/include/uapi/linux/ife.h
@@ -13,6 +13,8 @@ enum {
IFE_META_OUT_IFINDEX,
IFE_META_ORIGSIZE,
IFE_META_SIZE,
+ IFE_META_SAMPLER_ID,
+ IFE_META_SEQ,
__IFE_META_MAX
};
diff --git a/net/ife/ife.c b/net/ife/ife.c
index f0b683e..e1b7452d 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -139,7 +139,8 @@ int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen, const void *dval)
EXPORT_SYMBOL_GPL(ife_tlv_meta_encode);
struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
- int in_ifindex, int out_ifindex)
+ int in_ifindex, int out_ifindex,
+ u32 sampler_id, u32 seq)
{
int curr_size;
void *ifetlv;
@@ -148,7 +149,9 @@ struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
curr_size = skb->len;
metalen = nla_total_size(sizeof(orig_size)) +
- nla_total_size(sizeof(curr_size));
+ nla_total_size(sizeof(curr_size)) +
+ nla_total_size(sizeof(sampler_id)) +
+ nla_total_size(sizeof(seq));
if (in_ifindex)
metalen += nla_total_size(sizeof(in_ifindex));
@@ -159,6 +162,8 @@ struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
out_ifindex = htonl(out_ifindex);
orig_size = htonl(orig_size);
curr_size = htonl(curr_size);
+ sampler_id = htonl(sampler_id);
+ seq = htonl(seq);
ifetlv = ife_encode(skb, metalen);
if (!ifetlv)
@@ -179,6 +184,11 @@ struct ethhdr *ife_packet_info_pack(struct sk_buff *skb, int orig_size,
ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SIZE,
sizeof(curr_size), &curr_size);
+ ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SAMPLER_ID,
+ sizeof(sampler_id), &sampler_id);
+
+ ifetlv += ife_tlv_meta_encode(ifetlv, IFE_META_SEQ, sizeof(seq), &seq);
+
return (struct ethhdr *) skb->data;
}
EXPORT_SYMBOL(ife_packet_info_pack);
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 39f1d94c..3fe9489 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -27,6 +27,7 @@
#define SAMPLE_TAB_MASK 7
static int sample_net_id;
+static u32 samplers;
static struct tc_action_ops act_sample_ops;
static const struct nla_policy sample_policy[TCA_SAMPLE_MAX + 1] = {
@@ -77,6 +78,7 @@ static int tcf_sample_init(struct net *net, struct nlattr *nla,
s->rate = nla_get_u32(tb[TCA_SAMPLE_RATE]);
s->mark = nla_get_u32(tb[TCA_SAMPLE_MARK]);
s->eth_type = nla_get_u16(tb[TCA_SAMPLE_ETH_TYPE]);
+ s->sampler_id = samplers++;
s->truncate = tb[TCA_SAMPLE_TRUNC_SIZE];
if (tb[TCA_SAMPLE_TRUNC_SIZE])
@@ -117,6 +119,7 @@ static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
struct tcf_sample *s = to_sample(a);
static struct ethhdr *ethhdr;
struct sk_buff *skb2;
+ u32 sampler_id;
int orig_size;
int retval;
u32 at;
@@ -144,8 +147,12 @@ static int tcf_sample(struct sk_buff *skb, const struct tc_action *a,
skb_push(skb2, skb2->mac_len);
orig_size = skb->len + skb->dev->hard_header_len;
+ sampler_id = TC_SAMPLE_SW_ID(s->sampler_id);
+
ethhdr = ife_packet_info_pack(skb2, orig_size,
- skb->dev->ifindex, 0);
+ skb->dev->ifindex, 0,
+ sampler_id, s->seq++);
+
if (!ethhdr)
goto out;
@@ -266,6 +273,7 @@ static struct pernet_operations sample_net_ops = {
static int __init sample_init_module(void)
{
+ samplers = 0;
return tcf_register_action(&act_sample_ops, &sample_net_ops);
}
--
2.7.4
^ permalink raw reply related
* [patch net-next v2 7/8] mlxsw: reg: add the Monitoring Packet Sampling Configuration Register
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>
From: Yotam Gigi <yotamg@mellanox.com>
The MPSC register allows to configure ingress packet sampling on specific
port of the mlxsw device. The sampled packets are then trapped via
PKT_SAMPLE trap.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
drivers/net/ethernet/mellanox/mlxsw/reg.h | 38 +++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index edad7cb..b5ab81a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -4771,6 +4771,44 @@ static inline void mlxsw_reg_mlcr_pack(char *payload, u8 local_port,
MLXSW_REG_MLCR_DURATION_MAX : 0);
}
+/* MPSC - Monitoring Packet Sampling Configuration Register
+ * --------------------------------------------------------
+ * MPSC Register is used to configure the Packet Sampling mechanism.
+ */
+#define MLXSW_REG_MPSC_ID 0x9080
+#define MLXSW_REG_MPSC_LEN 0x14
+
+MLXSW_REG_DEFINE(mpsc, MLXSW_REG_MPSC_ID, MLXSW_REG_MPSC_LEN);
+
+/* reg_mpsc_local_port
+ * Local port number
+ * Not supported for CPU port
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, mpsc, local_port, 0x00, 16, 8);
+
+/* reg_mpsc_e
+ * Enable sampling on port local_port
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mpsc, e, 0x04, 30, 1);
+
+/* reg_mpsc_rate
+ * Sampling rate = 1 out of rate packets (with randomization around
+ * the point). Valid values are: 1 to 3.5*10^9
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, mpsc, rate, 0x08, 0, 32);
+
+static inline void mlxsw_reg_mpsc_pack(char *payload, u8 local_port, bool e,
+ u32 rate)
+{
+ MLXSW_REG_ZERO(mpsc, payload);
+ mlxsw_reg_mpsc_local_port_set(payload, local_port);
+ mlxsw_reg_mpsc_e_set(payload, e);
+ mlxsw_reg_mpsc_rate_set(payload, rate);
+}
+
/* SBPR - Shared Buffer Pools Register
* -----------------------------------
* The SBPR configures and retrieves the shared buffer pools and configuration.
--
2.7.4
^ permalink raw reply related
* [patch net-next v2 8/8] mlxsw: packet sample: Add packet sample offloading support
From: Jiri Pirko @ 2016-11-14 15:00 UTC (permalink / raw)
To: netdev
Cc: davem, yotamg, idosch, eladr, nogahf, ogerlitz, jhs,
geert+renesas, stephen, xiyou.wangcong, linux, roopa,
john.fastabend, simon.horman
In-Reply-To: <1479135638-3580-1-git-send-email-jiri@resnulli.us>
From: Yotam Gigi <yotamg@mellanox.com>
Using the MPSC regiter, add the functions that configure port packets
sampling in hardware and the necessary datatypes in the mlxsw_sp_port
struct. In addition, add the necessary trap for sampled packets and
integrate with matchall offloading to allow offloading of the sample tc
action.
The current offload support is for the tc command:
tc filter add dev <DEV> parent ffff: \
matchall \
action sample rate <RATE> mark <MARK> [trunc <SIZE>] \
[src <SADDR>] [dst <DADDR>] [type <TYPE>]
Where only ingress qdiscs are supported, and only a combination of
matchall classifier and sample action will lead to activating hardware
packet sampling.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 120 +++++++++++++++++++++++--
drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 12 +++
drivers/net/ethernet/mellanox/mlxsw/trap.h | 1 +
3 files changed, 125 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index a5433e4..6cbf67c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -57,6 +57,8 @@
#include <net/pkt_cls.h>
#include <net/tc_act/tc_mirred.h>
#include <net/netevent.h>
+#include <net/tc_act/tc_sample.h>
+#include <net/ife.h>
#include "spectrum.h"
#include "pci.h"
@@ -467,6 +469,16 @@ static void mlxsw_sp_span_mirror_remove(struct mlxsw_sp_port *from,
mlxsw_sp_span_inspected_port_unbind(from, span_entry, type);
}
+static int mlxsw_sp_port_sample_set(struct mlxsw_sp_port *mlxsw_sp_port,
+ bool enable, u32 rate)
+{
+ struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+ char mpsc_pl[MLXSW_REG_MPSC_LEN];
+
+ mlxsw_reg_mpsc_pack(mpsc_pl, mlxsw_sp_port->local_port, enable, rate);
+ return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(mpsc), mpsc_pl);
+}
+
static int mlxsw_sp_port_admin_status_set(struct mlxsw_sp_port *mlxsw_sp_port,
bool is_up)
{
@@ -1221,6 +1233,49 @@ mlxsw_sp_port_add_cls_matchall_mirror(struct mlxsw_sp_port *mlxsw_sp_port,
return err;
}
+static int
+mlxsw_sp_port_add_cls_matchall_sample(struct mlxsw_sp_port *mlxsw_sp_port,
+ struct tc_cls_matchall_offload *cls,
+ const struct tc_action *a,
+ bool ingress)
+{
+ struct mlxsw_sp_port_mall_tc_entry *mall_tc_entry;
+ int sampler_id;
+ int err;
+
+ if (mlxsw_sp_port->sample.enable) {
+ netdev_err(mlxsw_sp_port->dev, "Sample already active\n");
+ return -EEXIST;
+ }
+
+ err = mlxsw_sp_port_sample_set(mlxsw_sp_port, true, tcf_sample_rate(a));
+ if (err)
+ return err;
+
+ sampler_id = tcf_sample_sampler_id(a);
+
+ mlxsw_sp_port->sample.enable = true;
+ mlxsw_sp_port->sample.mark = tcf_sample_mark(a);
+ mlxsw_sp_port->sample.truncate = tcf_sample_truncate(a);
+ mlxsw_sp_port->sample.trunc_size = tcf_sample_trunc_size(a);
+ mlxsw_sp_port->sample.eth_type = tcf_sample_eth_type(a);
+ mlxsw_sp_port->sample.sampler_id = TC_SAMPLE_HW_ID(sampler_id);
+ tcf_sample_eth_dst_addr(a, mlxsw_sp_port->sample.eth_dst);
+ tcf_sample_eth_src_addr(a, mlxsw_sp_port->sample.eth_src);
+
+ netdev_dbg(mlxsw_sp_port->dev, "Activate hardware sample\n");
+
+ mall_tc_entry = kzalloc(sizeof(*mall_tc_entry), GFP_KERNEL);
+ if (!mall_tc_entry)
+ return -ENOMEM;
+
+ mall_tc_entry->cookie = cls->cookie;
+ mall_tc_entry->type = MLXSW_SP_PORT_MALL_SAMPLE;
+ list_add_tail(&mall_tc_entry->list, &mlxsw_sp_port->mall_tc_list);
+
+ return 0;
+}
+
static int mlxsw_sp_port_add_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
__be16 protocol,
struct tc_cls_matchall_offload *cls,
@@ -1236,17 +1291,19 @@ static int mlxsw_sp_port_add_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
}
tcf_exts_to_list(cls->exts, &actions);
- list_for_each_entry(a, &actions, list) {
- if (!is_tcf_mirred_egress_mirror(a) ||
- protocol != htons(ETH_P_ALL)) {
- return -ENOTSUPP;
- }
+ a = list_first_entry(&actions, struct tc_action, list);
+ if (is_tcf_mirred_egress_mirror(a) && protocol == htons(ETH_P_ALL))
err = mlxsw_sp_port_add_cls_matchall_mirror(mlxsw_sp_port, cls,
a, ingress);
- if (err)
- return err;
- }
+ else if (is_tcf_sample(a) && protocol == htons(ETH_P_ALL))
+ err = mlxsw_sp_port_add_cls_matchall_sample(mlxsw_sp_port, cls,
+ a, ingress);
+ else
+ return -ENOTSUPP;
+
+ if (err)
+ return err;
return 0;
}
@@ -1274,6 +1331,10 @@ static void mlxsw_sp_port_del_cls_matchall(struct mlxsw_sp_port *mlxsw_sp_port,
mlxsw_sp_span_mirror_remove(mlxsw_sp_port, to_port, span_type);
break;
+ case MLXSW_SP_PORT_MALL_SAMPLE:
+ mlxsw_sp_port->sample.enable = false;
+ mlxsw_sp_port_sample_set(mlxsw_sp_port, false, 1);
+ break;
default:
WARN_ON(1);
}
@@ -2774,6 +2835,46 @@ static void mlxsw_sp_rx_listener_mark_func(struct sk_buff *skb, u8 local_port,
return mlxsw_sp_rx_listener_func(skb, local_port, priv);
}
+static void mlxsw_sp_rx_listener_sample_func(struct sk_buff *skb, u8 local_port,
+ void *priv)
+{
+ struct mlxsw_sp *mlxsw_sp = priv;
+ struct mlxsw_sp_port *mlxsw_sp_port = mlxsw_sp->ports[local_port];
+ static struct ethhdr *ethhdr;
+ int orig_size;
+
+ if (unlikely(!mlxsw_sp_port)) {
+ dev_warn_ratelimited(mlxsw_sp->bus_info->dev, "Port %d: sample skb received for non-existent port\n",
+ local_port);
+ return;
+ }
+
+ skb->dev = mlxsw_sp_port->dev;
+ skb->mac_len = skb->dev->hard_header_len;
+ skb->mark = mlxsw_sp_port->sample.mark;
+
+ orig_size = skb->len;
+
+ if (mlxsw_sp_port->sample.truncate)
+ skb_trim(skb, mlxsw_sp_port->sample.trunc_size);
+
+ ethhdr = ife_packet_info_pack(skb, orig_size, skb->dev->ifindex, 0,
+ mlxsw_sp_port->sample.sampler_id,
+ mlxsw_sp_port->sample.seq++);
+ if (!ethhdr)
+ return;
+
+ if (!is_zero_ether_addr(mlxsw_sp_port->sample.eth_src))
+ ether_addr_copy(ethhdr->h_source,
+ mlxsw_sp_port->sample.eth_src);
+ if (!is_zero_ether_addr(mlxsw_sp_port->sample.eth_dst))
+ ether_addr_copy(ethhdr->h_dest, mlxsw_sp_port->sample.eth_dst);
+ ethhdr->h_proto = htons(mlxsw_sp_port->sample.eth_type);
+
+ skb->protocol = eth_type_trans(skb, skb->dev);
+ netif_receive_skb(skb);
+}
+
#define MLXSW_SP_RXL(_func, _trap_id, _action) \
{ \
.func = _func, \
@@ -2784,6 +2885,9 @@ static void mlxsw_sp_rx_listener_mark_func(struct sk_buff *skb, u8 local_port,
static const struct mlxsw_rx_listener mlxsw_sp_rx_listener[] = {
MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, FDB_MC, TRAP_TO_CPU),
+ MLXSW_SP_RXL(mlxsw_sp_rx_listener_sample_func, PKT_SAMPLE,
+ MIRROR_TO_CPU),
+
/* Traps for specific L2 packet types, not trapped as FDB MC */
MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, STP, TRAP_TO_CPU),
MLXSW_SP_RXL(mlxsw_sp_rx_listener_func, LACP, TRAP_TO_CPU),
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 04a2bc7..7f2e6fc 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -229,6 +229,7 @@ struct mlxsw_sp_span_entry {
enum mlxsw_sp_port_mall_action_type {
MLXSW_SP_PORT_MALL_MIRROR,
+ MLXSW_SP_PORT_MALL_SAMPLE,
};
struct mlxsw_sp_port_mall_mirror_tc_entry {
@@ -361,6 +362,17 @@ struct mlxsw_sp_port {
struct rtnl_link_stats64 *cache;
struct delayed_work update_dw;
} hw_stats;
+ struct {
+ bool enable;
+ u32 mark;
+ bool truncate;
+ u32 trunc_size;
+ u8 eth_src[ETH_ALEN];
+ u8 eth_dst[ETH_ALEN];
+ u16 eth_type;
+ u32 sampler_id;
+ u32 seq;
+ } sample;
};
struct mlxsw_sp_port *mlxsw_sp_port_lower_dev_hold(struct net_device *dev);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/trap.h b/drivers/net/ethernet/mellanox/mlxsw/trap.h
index ed8e301..eebd135 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/trap.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/trap.h
@@ -54,6 +54,7 @@ enum {
MLXSW_TRAP_ID_IGMP_V2_REPORT = 0x32,
MLXSW_TRAP_ID_IGMP_V2_LEAVE = 0x33,
MLXSW_TRAP_ID_IGMP_V3_REPORT = 0x34,
+ MLXSW_TRAP_ID_PKT_SAMPLE = 0x38,
MLXSW_TRAP_ID_ARPBC = 0x50,
MLXSW_TRAP_ID_ARPUC = 0x51,
MLXSW_TRAP_ID_MTUERROR = 0x52,
--
2.7.4
^ permalink raw reply related
* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Andreas Färber @ 2016-11-14 15:00 UTC (permalink / raw)
To: Martin Blumenstingl
Cc: Giuseppe CAVALLARO, Alexandre Torgue, netdev, André Roth,
Johnson Leung, linux-amlogic, Jerome Brunet
In-Reply-To: <CAFBinCDq0neyidQBO13F4Yz-GZPorGyRQ2t5Vb-PL0Ls582+4Q@mail.gmail.com>
Hi,
Am 07.11.2016 um 18:37 schrieb Martin Blumenstingl:
> The same problem still appears on my Tronsmart Vega S95 Meta even with
> the patched PHY driver.
>
> Unfortunately I don't have a second device to rule out that my
> Tronsmart Vega S95 Meta could be broken (not unlikely, I get DDR
> errors from time to time in u-boot). Maybe Andreas Faerber can test
> ethernet with and without Jerome's patch on one of his Tronsmart
> devices.
So far I only ran into the stall issue on my Odroid-C2, not on my Vega
S95 Telos, both on v4.10/integ branch with a rebased MMC patch of yours
on top.
I'm assuming you mean realtek8211f-disable-eee-1000.patch from Nov 3.
Regards,
Andreas
--
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
^ permalink raw reply
* Re: [PATCH] net/phy: add trace events for mdio accesses
From: Steven Rostedt @ 2016-11-14 15:22 UTC (permalink / raw)
To: Uwe Kleine-König; +Cc: Florian Fainelli, Ingo Molnar, netdev
In-Reply-To: <20161114110335.27862-1-uwe@kleine-koenig.org>
On Mon, 14 Nov 2016 12:03:35 +0100
Uwe Kleine-König <uwe@kleine-koenig.org> wrote:
> Make it possible to generate trace events for mdio read and write accesses.
>
> Signed-off-by: Uwe Kleine-König <uwe@kleine-koenig.org>
> ---
> drivers/net/phy/mdio_bus.c | 15 +++++++++++++++
> include/trace/events/mdio.h | 40 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 55 insertions(+)
> create mode 100644 include/trace/events/mdio.h
>
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 09deef4bed09..0f3f207419f6 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -38,6 +38,9 @@
>
> #include <asm/irq.h>
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mdio.h>
> +
> int mdiobus_register_device(struct mdio_device *mdiodev)
> {
> if (mdiodev->bus->mdio_map[mdiodev->addr])
> @@ -461,6 +464,9 @@ int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum)
> retval = bus->read(bus, addr, regnum);
> mutex_unlock(&bus->mdio_lock);
>
> + if (retval >= 0)
> + trace_mdio_access(bus, 1, addr, regnum, retval);
These cause branches to be taken when tracing is disabled, breaking the
zero overhead for disabled tracing guideline. As retval is passed to
the tracepoint, please look at TRACE_EVENT_CONDITION() and use that.
It will move the if statement into the enabling of the trace event and
keep the overhead to a minimum when the tracepoint is disabled.
Do the same for the below as well.
-- Steve
> +
> return retval;
> }
> EXPORT_SYMBOL(mdiobus_read_nested);
> @@ -485,6 +491,9 @@ int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum)
> retval = bus->read(bus, addr, regnum);
> mutex_unlock(&bus->mdio_lock);
>
> + if (retval >= 0)
> + trace_mdio_access(bus, 1, addr, regnum, retval);
> +
> return retval;
> }
> EXPORT_SYMBOL(mdiobus_read);
> @@ -513,6 +522,9 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val)
> err = bus->write(bus, addr, regnum, val);
> mutex_unlock(&bus->mdio_lock);
>
> + if (!err)
> + trace_mdio_access(bus, 0, addr, regnum, val);
> +
> return err;
> }
> EXPORT_SYMBOL(mdiobus_write_nested);
> @@ -538,6 +550,9 @@ int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val)
> err = bus->write(bus, addr, regnum, val);
> mutex_unlock(&bus->mdio_lock);
>
> + if (!err)
> + trace_mdio_access(bus, 0, addr, regnum, val);
> +
> return err;
> }
> EXPORT_SYMBOL(mdiobus_write);
> diff --git a/include/trace/events/mdio.h b/include/trace/events/mdio.h
> new file mode 100644
> index 000000000000..dcb2d456a346
> --- /dev/null
> +++ b/include/trace/events/mdio.h
> @@ -0,0 +1,40 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mdio
> +
> +#if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MDIO_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(mdio_access,
> +
> + TP_PROTO(struct mii_bus *bus, int read,
> + unsigned addr, unsigned regnum, u16 val),
> +
> + TP_ARGS(bus, read, addr, regnum, val),
> +
> + TP_STRUCT__entry(
> + __array(char, busid, MII_BUS_ID_SIZE)
> + __field(int, read)
> + __field(unsigned, addr)
> + __field(unsigned, regnum)
> + __field(u16, val)
> + ),
> +
> + TP_fast_assign(
> + strncpy(__entry->busid, bus->id, MII_BUS_ID_SIZE);
> + __entry->read = read;
> + __entry->addr = addr;
> + __entry->regnum = regnum;
> + __entry->val = val;
> + ),
> +
> + TP_printk("%s %-5s phy:0x%02x reg:0x%02x val:0x%04hx",
> + __entry->busid, __entry->read ? "read" : "write",
> + __entry->addr, __entry->regnum, __entry->val)
> +);
> +
> +#endif /* if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ) */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
^ permalink raw reply
* Re: Debugging Ethernet issues
From: Mason @ 2016-11-14 15:33 UTC (permalink / raw)
To: Andrew Lunn
Cc: Florian Fainelli, netdev, Mans Rullgard, Sergei Shtylyov,
Tom Lendacky, Zach Brown, Shaohui Xie, Tim Beale, Brian Hill,
Vince Bridgers, Balakumaran Kannan, David S. Miller,
Sebastian Frias, Kirill Kapranov
In-Reply-To: <5829D10F.30206@free.fr>
On 14/11/2016 15:58, Mason wrote:
> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
> vs
> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off
>
> I'm not sure whether "flow control" is relevant...
Based on phy_print_status()
phydev->pause ? "rx/tx" : "off"
I added the following patch.
diff --git a/drivers/net/ethernet/aurora/nb8800.c b/drivers/net/ethernet/aurora/nb8800.c
index defc22a15f67..4e758c1cfa4e 100644
--- a/drivers/net/ethernet/aurora/nb8800.c
+++ b/drivers/net/ethernet/aurora/nb8800.c
@@ -667,6 +667,8 @@ static void nb8800_link_reconfigure(struct net_device *dev)
struct phy_device *phydev = priv->phydev;
int change = 0;
+ printk("%s from %pf\n", __func__, __builtin_return_address(0));
+
if (phydev->link) {
if (phydev->speed != priv->speed) {
priv->speed = phydev->speed;
@@ -1274,9 +1276,9 @@ static int nb8800_hw_init(struct net_device *dev)
nb8800_writeb(priv, NB8800_PQ2, val & 0xff);
/* Auto-negotiate by default */
- priv->pause_aneg = true;
- priv->pause_rx = true;
- priv->pause_tx = true;
+ priv->pause_aneg = false;
+ priv->pause_rx = false;
+ priv->pause_tx = false;
nb8800_mc_init(dev, 0);
Connected to 1000 Mbps switch:
# time udhcpc | while read LINE; do date; echo $LINE; done
Thu Jan 1 00:00:22 UTC 1970
udhcpc (v1.22.1) started
Thu Jan 1 00:00:22 UTC 1970
Sending discover...
[ 24.565346] nb8800_link_reconfigure from phy_state_machine
Thu Jan 1 00:00:25 UTC 1970
Sending discover...
[ 26.575402] nb8800_link_reconfigure from phy_state_machine
[ 26.580972] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
Thu Jan 1 00:00:28 UTC 1970
Sending discover...
Thu Jan 1 00:00:29 UTC 1970
Sending select for 172.27.64.58...
Thu Jan 1 00:00:29 UTC 1970
Lease of 172.27.64.58 obtained, lease time 604800
Thu Jan 1 00:00:29 UTC 1970
deleting routers
Thu Jan 1 00:00:29 UTC 1970
adding dns 172.27.0.17
real 0m7.388s
user 0m0.040s
sys 0m0.090s
Connected to 100 Mbps switch:
# time udhcpc | while read LINE; do date; echo $LINE; done
Thu Jan 1 00:00:14 UTC 1970
udhcpc (v1.22.1) started
Thu Jan 1 00:00:15 UTC 1970
Sending discover...
[ 16.968621] nb8800_link_reconfigure from phy_state_machine
[ 17.975359] nb8800_link_reconfigure from phy_state_machine
[ 17.980923] nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
Thu Jan 1 00:00:18 UTC 1970
Sending discover...
Thu Jan 1 00:00:19 UTC 1970
Sending select for 172.27.64.58...
Thu Jan 1 00:00:19 UTC 1970
Lease of 172.27.64.58 obtained, lease time 604800
Thu Jan 1 00:00:19 UTC 1970
deleting routers
Thu Jan 1 00:00:19 UTC 1970
adding dns 172.27.0.17
real 0m4.355s
user 0m0.043s
sys 0m0.083s
OK, so now it works (by accident?) even on 100 Mbps switch, but it still
prints "flow control rx/tx"...
# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate: off
RX: off
TX: off
These values make sense considering my changes in the driver.
Are 100 Mbps switches supposed to support these pause features,
and are they supposed to be able to auto-negotiate them?
Regards.
^ permalink raw reply related
* [PATCH next] dctcp: update cwnd on congestion event
From: Florian Westphal @ 2016-11-14 15:42 UTC (permalink / raw)
To: netdev
Cc: Florian Westphal, Lawrence Brakmo, Andrew Shewmaker, Glenn Judd,
Daniel Borkmann
draft-ietf-tcpm-dctcp-02 says:
... when the sender receives an indication of congestion
(ECE), the sender SHOULD update cwnd as follows:
cwnd = cwnd * (1 - DCTCP.Alpha / 2)
So, lets do this and reduce cwnd more smoothly (and faster), as per
current congestion estimate.
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index ab37c6775630..51139175bf61 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -188,8 +188,8 @@ static void dctcp_ce_state_1_to_0(struct sock *sk)
static void dctcp_update_alpha(struct sock *sk, u32 flags)
{
- const struct tcp_sock *tp = tcp_sk(sk);
struct dctcp *ca = inet_csk_ca(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
u32 acked_bytes = tp->snd_una - ca->prior_snd_una;
/* If ack did not advance snd_una, count dupack as MSS size.
@@ -229,6 +229,13 @@ static void dctcp_update_alpha(struct sock *sk, u32 flags)
WRITE_ONCE(ca->dctcp_alpha, alpha);
dctcp_reset(tp, ca);
}
+
+ if (flags & CA_ACK_ECE) {
+ unsigned int cwnd = dctcp_ssthresh(sk);
+
+ if (cwnd != tp->snd_cwnd)
+ tp->snd_cwnd = cwnd;
+ }
}
static void dctcp_state(struct sock *sk, u8 new_state)
--
2.7.3
^ permalink raw reply related
* Re: [PATCH net] sctp: change sk state only when it has assocs in sctp_shutdown
From: Neil Horman @ 2016-11-14 15:52 UTC (permalink / raw)
To: Xin Long
Cc: network dev, linux-sctp, davem, Marcelo Ricardo Leitner,
Vlad Yasevich, andreyknvl
In-Reply-To: <d260f8b59f52d7ef00b83a554fc19d4aa91766a2.1479044677.git.lucien.xin@gmail.com>
On Sun, Nov 13, 2016 at 09:44:37PM +0800, Xin Long wrote:
> Now when users shutdown a sock with SEND_SHUTDOWN in sctp, even if
> this sock has no connection (assoc), sk state would be changed to
> SCTP_SS_CLOSING, which is not as we expect.
>
> Besides, after that if users try to listen on this sock, kernel
> could even panic when it dereference sctp_sk(sk)->bind_hash in
> sctp_inet_listen, as bind_hash is null when sock has no assoc.
>
> This patch is to move sk state change after checking sk assocs
> is not empty, and also merge these two if() conditions and reduce
> indent level.
>
> Fixes: d46e416c11c8 ("sctp: sctp should change socket state when shutdown is received")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
> net/sctp/socket.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index faa48ff..f23ad91 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4285,19 +4285,18 @@ static void sctp_shutdown(struct sock *sk, int how)
> {
> struct net *net = sock_net(sk);
> struct sctp_endpoint *ep;
> - struct sctp_association *asoc;
>
> if (!sctp_style(sk, TCP))
> return;
>
> - if (how & SEND_SHUTDOWN) {
> + ep = sctp_sk(sk)->ep;
> + if (how & SEND_SHUTDOWN && !list_empty(&ep->asocs)) {
> + struct sctp_association *asoc;
> +
> sk->sk_state = SCTP_SS_CLOSING;
> - ep = sctp_sk(sk)->ep;
> - if (!list_empty(&ep->asocs)) {
> - asoc = list_entry(ep->asocs.next,
> - struct sctp_association, asocs);
> - sctp_primitive_SHUTDOWN(net, asoc, NULL);
> - }
> + asoc = list_entry(ep->asocs.next,
> + struct sctp_association, asocs);
> + sctp_primitive_SHUTDOWN(net, asoc, NULL);
> }
> }
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
^ permalink raw reply
* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: David Ahern @ 2016-11-14 16:19 UTC (permalink / raw)
To: Jason A. Donenfeld, Netdev, WireGuard mailing list, LKML,
YOSHIFUJI Hideaki, Hannes Frederic Sowa
In-Reply-To: <20161113232813.28926-1-Jason@zx2c4.com>
On 11/13/16 4:28 PM, Jason A. Donenfeld wrote:
> This puts the IPv6 routing functions in parity with the IPv4 routing
> functions. Namely, we now check in v6 that if a flowi6 requests an
> saddr, the returned dst actually corresponds to a net device that has
> that saddr. This mirrors the v4 logic with __ip_dev_find in
> __ip_route_output_key_hash. In the event that the returned dst is not
> for a dst with a dev that has the saddr, we return -EINVAL, just like
> v4; this makes it easy to use the same error handlers for both cases.
>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> ---
> Changes from v2:
> It turns out ipv6_chk_addr already has the device enumeration
> logic that we need by simply passing NULL.
>
> net/ipv6/ip6_output.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 6001e78..b3b5cb6 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -926,6 +926,10 @@ static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk,
> int err;
> int flags = 0;
>
> + if (!ipv6_addr_any(&fl6->saddr) &&
> + !ipv6_chk_addr(net, &fl6->saddr, NULL, 1))
> + return -EINVAL;
> +
> /* The correct way to handle this would be to do
> * ip6_route_get_saddr, and then ip6_route_output; however,
> * the route-specific preferred source forces the
>
LGTM
Acked-by: David Ahern <dsa@cumulusnetworks.com>
^ permalink raw reply
* Re: Long delays creating a netns after deleting one (possibly RCU related)
From: Paul E. McKenney @ 2016-11-14 16:24 UTC (permalink / raw)
To: Cong Wang
Cc: Rolf Neugebauer, LKML, Linux Kernel Network Developers,
Justin Cormack, Ian Campbell
In-Reply-To: <CAM_iQpUq14GsvGyz2xA1PkDq5YO743T68Zh2zm=NJZQ9S2Ahqg@mail.gmail.com>
On Sun, Nov 13, 2016 at 10:47:01PM -0800, Cong Wang wrote:
> On Fri, Nov 11, 2016 at 4:55 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Fri, Nov 11, 2016 at 4:23 PM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >> Ah! This net_mutex is different than RTNL. Should synchronize_net() be
> >> modified to check for net_mutex being held in addition to the current
> >> checks for RTNL being held?
> >>
> >
> > Good point!
> >
> > Like commit be3fc413da9eb17cce0991f214ab0, checking
> > for net_mutex for this case seems to be an optimization, I assume
> > synchronize_rcu_expedited() and synchronize_rcu() have the same
> > behavior...
>
> Thinking a bit more, I think commit be3fc413da9eb17cce0991f
> gets wrong on rtnl_is_locked(), the lock could be locked by other
> process not by the current one, therefore it should be
> lockdep_rtnl_is_held() which, however, is defined only when LOCKDEP
> is enabled... Sigh.
>
> I don't see any better way than letting callers decide if they want the
> expedited version or not, but this requires changes of all callers of
> synchronize_net(). Hm.
I must confess that I don't understand how it would help to use an
expedited grace period when some other process is holding RTNL.
In contrast, I do well understand how it helps when the current process
is holding RTNL.
So what am I missing here?
Thanx, Paul
^ permalink raw reply
* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
From: Stefan Hajnoczi @ 2016-11-14 16:25 UTC (permalink / raw)
To: Russell King - ARM Linux, David Woodhouse
Cc: jasowang, vyasevic, stefanha, netdev, qemu-devel,
igor.v.kovalenko, dgilbert
In-Reply-To: <20161114092947.GB2031@work-vm>
[-- Attachment #1: Type: text/plain, Size: 3792 bytes --]
On Mon, Nov 14, 2016 at 09:29:48AM +0000, Dr. David Alan Gilbert wrote:
> * Russell King - ARM Linux (linux@armlinux.org.uk) wrote:
> > On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> > > It's also *fairly* unlikely that the kernel in the guest has developed
> > > a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> > > that qemu isn't properly emulating those bits. But at first glance at
> > > the code, it looks like *that's* been there for the last decade too...
> >
> > I take issue with that, having looked at the qemu rtl8139 code:
> >
> > if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
> > {
> > int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;
> >
> > DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d "
> > "frame data %d specified MSS=%d\n", ETH_MTU,
> > ip_data_len, saved_size - ETH_HLEN, large_send_mss);
> >
> > That's the only reference to "large_send_mss" there, other than that,
> > the MSS value that gets stuck into the field by 8139cp.c is completely
> > unused. Instead, qemu does this:
> >
> > eth_payload_data = saved_buffer + ETH_HLEN;
> > eth_payload_len = saved_size - ETH_HLEN;
> >
> > ip = (ip_header*)eth_payload_data;
> >
> > hlen = IP_HEADER_LENGTH(ip);
> > ip_data_len = be16_to_cpu(ip->ip_len) - hlen;
> >
> > tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + hlen);
> > int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);
> >
> > /* ETH_MTU = ip header len + tcp header len + payload */
> > int tcp_data_len = ip_data_len - tcp_hlen;
> > int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
> >
> > for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; tcp_send_offset += tcp_chunk_size)
> > {
> >
> > It uses a fixed value of ETH_MTU to calculate the size of the TCP
> > data chunks, and this is not surprisingly the well known:
> >
> > #define ETH_MTU 1500
> >
> > Qemu seems to be buggy - it ignores the MSS value, and always tries to
> > send 1500 byte frames.
>
> cc'ing in Stefan who last touched that code and Jason and Vlad who
> know the net code.
CCing Igor Kovalenko who implemented "fixed for TCP segmentation
offloading - removed dependency on slirp.h" in 2006. I don't actually
expect him to remember this from 10 years ago though :).
Looking at the history the large_send_mss variable was never used for
anything beyond the debug printf.
The datasheet for this NIC is here:
http://realtek.info/pdf/rtl8139cp.pdf. See 9.2.1 Transmit.
Does this untested patch work for you?
diff --git a/hw/net/rtl8139.c b/hw/net/rtl8139.c
index f05e59c..a3f1af5 100644
--- a/hw/net/rtl8139.c
+++ b/hw/net/rtl8139.c
@@ -2167,9 +2167,13 @@ static int rtl8139_cplus_transmit_one(RTL8139State *s)
goto skip_offload;
}
- /* ETH_MTU = ip header len + tcp header len + payload */
+ /* MSS too small */
+ if (tcp_hlen + hlen >= large_send_mss) {
+ goto skip_offload;
+ }
+
int tcp_data_len = ip_data_len - tcp_hlen;
- int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
+ int tcp_chunk_size = large_send_mss - hlen - tcp_hlen;
DPRINTF("+++ C+ mode TSO IP data len %d TCP hlen %d TCP "
"data len %d TCP chunk size %d\n", ip_data_len,
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply related
* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: Hannes Frederic Sowa @ 2016-11-14 16:44 UTC (permalink / raw)
To: Jason A. Donenfeld, David Ahern, Netdev, WireGuard mailing list,
LKML, YOSHIFUJI Hideaki
In-Reply-To: <20161113232813.28926-1-Jason@zx2c4.com>
On Mon, Nov 14, 2016, at 00:28, Jason A. Donenfeld wrote:
> This puts the IPv6 routing functions in parity with the IPv4 routing
> functions. Namely, we now check in v6 that if a flowi6 requests an
> saddr, the returned dst actually corresponds to a net device that has
> that saddr. This mirrors the v4 logic with __ip_dev_find in
> __ip_route_output_key_hash. In the event that the returned dst is not
> for a dst with a dev that has the saddr, we return -EINVAL, just like
> v4; this makes it easy to use the same error handlers for both cases.
>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: David Ahern <dsa@cumulusnetworks.com>
> ---
> Changes from v2:
> It turns out ipv6_chk_addr already has the device enumeration
> logic that we need by simply passing NULL.
>
> net/ipv6/ip6_output.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 6001e78..b3b5cb6 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -926,6 +926,10 @@ static int ip6_dst_lookup_tail(struct net *net,
> const struct sock *sk,
> int err;
> int flags = 0;
>
> + if (!ipv6_addr_any(&fl6->saddr) &&
> + !ipv6_chk_addr(net, &fl6->saddr, NULL, 1))
> + return -EINVAL;
Hmm, this check is too permissive, no?
E.g. what happens if you move a link local address from one interface to
another? In this case this code would still allow the saddr to be used.
I just also quickly read up on the history (sorry was travelling last
week) and wonder if you ever saw a user space facing bug or if this is
basically some difference you saw while writing out of tree code?
Thanks,
Hannes
^ permalink raw reply
* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: Jason A. Donenfeld @ 2016-11-14 16:54 UTC (permalink / raw)
To: David Ahern, David Miller
Cc: Netdev, Hannes Frederic Sowa, YOSHIFUJI Hideaki,
WireGuard mailing list, LKML
In-Reply-To: <CAHmME9qC4xqGOwJnauXrJBDkAtmmuJ+kJKL6ufuU9_XWKNFdSA@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 201 bytes --]
On Nov 14, 2016 17:19, "David Ahern" <dsa@cumulusnetworks.com> wrote:
>
> LGTM
>
> Acked-by: David Ahern <dsa@cumulusnetworks.com>
Great.
@DaveM: can we get this in 4.9 and in stable?
Thanks,
Jason
[-- Attachment #1.2: Type: text/html, Size: 410 bytes --]
[-- Attachment #2: Type: text/plain, Size: 147 bytes --]
_______________________________________________
WireGuard mailing list
WireGuard@lists.zx2c4.com
http://lists.zx2c4.com/mailman/listinfo/wireguard
^ permalink raw reply
* Re: [PATCH v3] ip6_output: ensure flow saddr actually belongs to device
From: David Ahern @ 2016-11-14 16:55 UTC (permalink / raw)
To: Hannes Frederic Sowa, Jason A. Donenfeld, Netdev,
WireGuard mailing list, LKML, YOSHIFUJI Hideaki
In-Reply-To: <1479141867.3723362.787321689.4A3DCFD6@webmail.messagingengine.com>
On 11/14/16 9:44 AM, Hannes Frederic Sowa wrote:
> On Mon, Nov 14, 2016, at 00:28, Jason A. Donenfeld wrote:
>> This puts the IPv6 routing functions in parity with the IPv4 routing
>> functions. Namely, we now check in v6 that if a flowi6 requests an
>> saddr, the returned dst actually corresponds to a net device that has
>> that saddr. This mirrors the v4 logic with __ip_dev_find in
>> __ip_route_output_key_hash. In the event that the returned dst is not
>> for a dst with a dev that has the saddr, we return -EINVAL, just like
>> v4; this makes it easy to use the same error handlers for both cases.
>>
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>> Cc: David Ahern <dsa@cumulusnetworks.com>
>> ---
>> Changes from v2:
>> It turns out ipv6_chk_addr already has the device enumeration
>> logic that we need by simply passing NULL.
>>
>> net/ipv6/ip6_output.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index 6001e78..b3b5cb6 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -926,6 +926,10 @@ static int ip6_dst_lookup_tail(struct net *net,
>> const struct sock *sk,
>> int err;
>> int flags = 0;
>>
>> + if (!ipv6_addr_any(&fl6->saddr) &&
>> + !ipv6_chk_addr(net, &fl6->saddr, NULL, 1))
>> + return -EINVAL;
>
> Hmm, this check is too permissive, no?
>
> E.g. what happens if you move a link local address from one interface to
> another? In this case this code would still allow the saddr to be used.
This check -- like the ipv4 variant -- only verifies the saddr is locally assigned. If the address moves interfaces it should be fine.
>
> I just also quickly read up on the history (sorry was travelling last
> week) and wonder if you ever saw a user space facing bug or if this is
> basically some difference you saw while writing out of tree code?
I checked the userspace API this morning. bind and cmsg for example check that the address is valid with calls to ipv6_chk_addr.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox