* Re: Debugging Ethernet issues
From: Sebastian Frias @ 2016-11-14 13:03 UTC (permalink / raw)
To: Florian Fainelli, Mason, Andrew Lunn
Cc: netdev, Mans Rullgard, Sergei Shtylyov, Tom Lendacky, Zach Brown,
Shaohui Xie, Tim Beale, Brian Hill, Vince Bridgers,
Balakumaran Kannan, David S. Miller, Kirill Kapranov
In-Reply-To: <9d1f28a7-959b-fdde-3403-f6da5f521125@gmail.com>
On 11/13/2016 08:55 PM, Florian Fainelli wrote:
> Le 13/11/2016 à 11:51, Mason a écrit :
>> On 13/11/2016 04:09, Andrew Lunn wrote:
>>
>>> Mason wrote:
>>>
>>>> When connected to a Gigabit switch
>>>> 3.4 negotiates a LAN DHCP setup instantly
>>>> 4.7 requires over 5 seconds to do so
>>>
>>> When you run tcpdump on the DHCP server, are you noticing the first
>>> request is missing?
>>>
>>> What can happen is the dhclient gets started immediately and sends out
>>> its first request before auto-negotiation has finished. So this first packet
>>> gets lost. The retransmit after a few seconds is then successful.
>>
>> I will run tcpdump on the server as I run udhcpc on the client
>> for Linux 3.4 vs 4.7
>>
>> Do you know what would make auto-negotiation fail at 100 Mbps
>> on 4.7? (whereas it succeeds on 3.4)
>>
>> (Thinking out loud) If the problem were in auto-negotiation,
>> then if should work if I hard-code speed and duplex using
>> ethtool, right? (IIRC, hard-coding doesn't help.)
>
> I would start with checking basic things:
>
> - does your Ethernet driver get a link UP being reported correctly
> (netif_carrier_ok returns 1)?
> - if you let the bootloader configure the PHY and utilize the Generic
> PHY driver instead of the Atheros PHY driver, does the problem appear as
> well?
Would using a "fixed-link" serve the same?
It appears that using a fixed-link
ð0 {
#address-cells = <1>;
#size-cells = <0>;
#ifdef WITH_FIXED_LINK
phy-connection-type = "rgmii";
fixed-link {
speed = <100>;
full-duplex;
};
#else
phy-connection-type = "rgmii";
phy-handle = <ð0_phy>;
/* Atheros AR8035 */
eth0_phy: ethernet-phy@4 {
interrupt-parent = <&irq0>;
compatible = "ethernet-phy-id004d.d072",
"ethernet-phy-ieee802.3-c22";
interrupts = <37 IRQ_TYPE_EDGE_RISING>;
reg = <4>;
};
#endif
};
works.
----
For what is worth, the patch that Mason was talking about earlier
in the thread:
"...After much hair-pulling, it turned out that *some* of the breakage
was caused by a local patch..."
was setting changing the following delay in 'drivers/net/phy/phy.c:phy_state_machine()'
/* Only re-schedule a PHY state machine change if we are polling the
* PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
* between states from phy_mac_interrupt()
*/
if (phydev->irq == PHY_POLL)
queue_delayed_work(system_power_efficient_wq, &phydev->state_queue,
PHY_STATE_TIME * HZ);
from "PHY_STATE_TIME * HZ" to "0".
That caused 2 of 3 types of boards to fail, while one of them always worked
regardless of the delay.
In a nutshell:
- Board A, chip X: works with delay "PHY_STATE_TIME * HZ" or "0".
- Board B, chip X: does not work with delay "0"
- Board C, chip Y: does not work with delay "0"
Does board A works by "luck" when this delay is "0"?
(this delay has always been there, but it is not clear why)
> - what do transmit/receive counters on the Ethernet driver/MAC return?
>
^ permalink raw reply
* [net-next PATCH v2] net: dummy: Introduce dummy virtual functions
From: Phil Sutter @ 2016-11-14 13:02 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Sabrina Dubroca
The idea for this was born when testing VF support in iproute2 which was
impeded by hardware requirements. In fact, not every VF-capable hardware
driver implements all netdev ops, so testing the interface is still hard
to do even with a well-sorted hardware shelf.
To overcome this and allow for testing the user-kernel interface, this
patch allows to turn dummy into a PF with a configurable amount of VFs.
Due to the assumption that all PFs are PCI devices, this implementation
is not completely straightforward: In order to allow for
rtnl_fill_ifinfo() to see the dummy VFs, a fake PCI parent device is
attached to the dummy netdev. This has to happen at the right spot so
register_netdevice() does not get confused. This patch abuses
ndo_fix_features callback for that. In ndo_uninit callback, the fake
parent is removed again for the same purpose.
Joint work with Sabrina Dubroca.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Phil Sutter <phil@nwl.cc>
---
Changes since v1:
- Fixed issues reported by kbuild test robot:
- pci_dev->sriov is only present if CONFIG_PCI_ATS is active.
- pci_bus_type does not exist if CONFIG_PCI is not defined.
---
drivers/net/dummy.c | 199 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 197 insertions(+), 2 deletions(-)
diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index 69fc8409a9733..a831537145bd9 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -34,6 +34,8 @@
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
+#include <linux/pci.h>
+#include "../pci/pci.h" /* for struct pci_sriov */
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>
@@ -42,6 +44,37 @@
#define DRV_VERSION "1.0"
static int numdummies = 1;
+static int num_vfs;
+
+static struct pci_sriov pdev_sriov;
+
+static struct pci_dev pci_pdev = {
+ .is_physfn = 1,
+#ifdef CONFIG_PCI_ATS
+ .sriov = &pdev_sriov,
+#endif
+#ifdef CONFIG_PCI
+ .dev.bus = &pci_bus_type,
+#endif
+};
+
+struct vf_data_storage {
+ unsigned char vf_mac[ETH_ALEN];
+ u16 pf_vlan; /* When set, guest VLAN config not allowed. */
+ u16 pf_qos;
+ __be16 vlan_proto;
+ u16 min_tx_rate;
+ u16 max_tx_rate;
+ u8 spoofchk_enabled;
+ bool rss_query_enabled;
+ u8 trusted;
+ int link_state;
+};
+
+struct dummy_priv {
+ int num_vfs;
+ struct vf_data_storage *vfinfo;
+};
/* fake multicast ability */
static void set_multicast_list(struct net_device *dev)
@@ -91,15 +124,29 @@ static netdev_tx_t dummy_xmit(struct sk_buff *skb, struct net_device *dev)
static int dummy_dev_init(struct net_device *dev)
{
+ struct dummy_priv *priv = netdev_priv(dev);
+
dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
if (!dev->dstats)
return -ENOMEM;
+ priv->num_vfs = num_vfs;
+ priv->vfinfo = NULL;
+
+ if (!num_vfs)
+ return 0;
+
+ priv->vfinfo = kcalloc(num_vfs, sizeof(struct vf_data_storage),
+ GFP_KERNEL);
+ if (!priv->vfinfo)
+ return -ENOMEM;
+
return 0;
}
static void dummy_dev_uninit(struct net_device *dev)
{
+ dev->dev.parent = NULL;
free_percpu(dev->dstats);
}
@@ -112,6 +159,129 @@ static int dummy_change_carrier(struct net_device *dev, bool new_carrier)
return 0;
}
+/* fake, just to set fake PCI parent after netdev_register_kobject() */
+static netdev_features_t dummy_fix_features(struct net_device *dev,
+ netdev_features_t features)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if (priv->num_vfs)
+ dev->dev.parent = &pci_pdev.dev;
+
+ return features;
+}
+
+static int dummy_set_vf_mac(struct net_device *dev, int vf, u8 *mac)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if (!is_valid_ether_addr(mac) || (vf >= priv->num_vfs))
+ return -EINVAL;
+
+ memcpy(priv->vfinfo[vf].vf_mac, mac, ETH_ALEN);
+
+ return 0;
+}
+
+static int dummy_set_vf_vlan(struct net_device *dev, int vf,
+ u16 vlan, u8 qos, __be16 vlan_proto)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if ((vf >= priv->num_vfs) || (vlan > 4095) || (qos > 7))
+ return -EINVAL;
+
+ priv->vfinfo[vf].pf_vlan = vlan;
+ priv->vfinfo[vf].pf_qos = qos;
+ priv->vfinfo[vf].vlan_proto = vlan_proto;
+
+ return 0;
+}
+
+static int dummy_set_vf_rate(struct net_device *dev, int vf, int min, int max)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if (vf >= priv->num_vfs)
+ return -EINVAL;
+
+ priv->vfinfo[vf].min_tx_rate = min;
+ priv->vfinfo[vf].max_tx_rate = max;
+
+ return 0;
+}
+
+static int dummy_set_vf_spoofchk(struct net_device *dev, int vf, bool val)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if (vf >= priv->num_vfs)
+ return -EINVAL;
+
+ priv->vfinfo[vf].spoofchk_enabled = val;
+
+ return 0;
+}
+
+static int dummy_set_vf_rss_query_en(struct net_device *dev, int vf, bool val)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if (vf >= priv->num_vfs)
+ return -EINVAL;
+
+ priv->vfinfo[vf].rss_query_enabled = val;
+
+ return 0;
+}
+
+static int dummy_set_vf_trust(struct net_device *dev, int vf, bool val)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if (vf >= priv->num_vfs)
+ return -EINVAL;
+
+ priv->vfinfo[vf].trusted = val;
+
+ return 0;
+}
+
+static int dummy_get_vf_config(struct net_device *dev,
+ int vf, struct ifla_vf_info *ivi)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if (vf >= priv->num_vfs)
+ return -EINVAL;
+
+ ivi->vf = vf;
+ memcpy(&ivi->mac, priv->vfinfo[vf].vf_mac, ETH_ALEN);
+ ivi->vlan = priv->vfinfo[vf].pf_vlan;
+ ivi->qos = priv->vfinfo[vf].pf_qos;
+ ivi->spoofchk = priv->vfinfo[vf].spoofchk_enabled;
+ ivi->linkstate = priv->vfinfo[vf].link_state;
+ ivi->min_tx_rate = priv->vfinfo[vf].min_tx_rate;
+ ivi->max_tx_rate = priv->vfinfo[vf].max_tx_rate;
+ ivi->rss_query_en = priv->vfinfo[vf].rss_query_enabled;
+ ivi->trusted = priv->vfinfo[vf].trusted;
+ ivi->vlan_proto = priv->vfinfo[vf].vlan_proto;
+
+ return 0;
+}
+
+static int dummy_set_vf_link_state(struct net_device *dev, int vf, int state)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ if (vf >= priv->num_vfs)
+ return -EINVAL;
+
+ priv->vfinfo[vf].link_state = state;
+
+ return 0;
+}
+
static const struct net_device_ops dummy_netdev_ops = {
.ndo_init = dummy_dev_init,
.ndo_uninit = dummy_dev_uninit,
@@ -121,6 +291,15 @@ static const struct net_device_ops dummy_netdev_ops = {
.ndo_set_mac_address = eth_mac_addr,
.ndo_get_stats64 = dummy_get_stats64,
.ndo_change_carrier = dummy_change_carrier,
+ .ndo_fix_features = dummy_fix_features,
+ .ndo_set_vf_mac = dummy_set_vf_mac,
+ .ndo_set_vf_vlan = dummy_set_vf_vlan,
+ .ndo_set_vf_rate = dummy_set_vf_rate,
+ .ndo_set_vf_spoofchk = dummy_set_vf_spoofchk,
+ .ndo_set_vf_trust = dummy_set_vf_trust,
+ .ndo_get_vf_config = dummy_get_vf_config,
+ .ndo_set_vf_link_state = dummy_set_vf_link_state,
+ .ndo_set_vf_rss_query_en = dummy_set_vf_rss_query_en,
};
static void dummy_get_drvinfo(struct net_device *dev,
@@ -134,6 +313,14 @@ static const struct ethtool_ops dummy_ethtool_ops = {
.get_drvinfo = dummy_get_drvinfo,
};
+static void dummy_free_netdev(struct net_device *dev)
+{
+ struct dummy_priv *priv = netdev_priv(dev);
+
+ kfree(priv->vfinfo);
+ free_netdev(dev);
+}
+
static void dummy_setup(struct net_device *dev)
{
ether_setup(dev);
@@ -141,7 +328,7 @@ static void dummy_setup(struct net_device *dev)
/* Initialize the device structure. */
dev->netdev_ops = &dummy_netdev_ops;
dev->ethtool_ops = &dummy_ethtool_ops;
- dev->destructor = free_netdev;
+ dev->destructor = dummy_free_netdev;
/* Fill in device structure with ethernet-generic values. */
dev->flags |= IFF_NOARP;
@@ -169,6 +356,7 @@ static int dummy_validate(struct nlattr *tb[], struct nlattr *data[])
static struct rtnl_link_ops dummy_link_ops __read_mostly = {
.kind = DRV_NAME,
+ .priv_size = sizeof(struct dummy_priv),
.setup = dummy_setup,
.validate = dummy_validate,
};
@@ -177,12 +365,16 @@ static struct rtnl_link_ops dummy_link_ops __read_mostly = {
module_param(numdummies, int, 0);
MODULE_PARM_DESC(numdummies, "Number of dummy pseudo devices");
+module_param(num_vfs, int, 0);
+MODULE_PARM_DESC(num_vfs, "Number of dummy VFs per dummy device");
+
static int __init dummy_init_one(void)
{
struct net_device *dev_dummy;
int err;
- dev_dummy = alloc_netdev(0, "dummy%d", NET_NAME_UNKNOWN, dummy_setup);
+ dev_dummy = alloc_netdev(sizeof(struct dummy_priv),
+ "dummy%d", NET_NAME_UNKNOWN, dummy_setup);
if (!dev_dummy)
return -ENOMEM;
@@ -190,6 +382,7 @@ static int __init dummy_init_one(void)
err = register_netdevice(dev_dummy);
if (err < 0)
goto err;
+
return 0;
err:
@@ -201,6 +394,8 @@ static int __init dummy_init_module(void)
{
int i, err = 0;
+ pdev_sriov.num_VFs = num_vfs;
+
rtnl_lock();
err = __rtnl_link_register(&dummy_link_ops);
if (err < 0)
--
2.10.0
^ permalink raw reply related
* Re: Debugging Ethernet issues
From: Mason @ 2016-11-14 12:45 UTC (permalink / raw)
To: netdev
Cc: Andrew Lunn, Florian Fainelli, Mans Rullgard, Sergei Shtylyov,
Tom Lendacky, Zach Brown, Shaohui Xie, Tim Beale, Brian Hill,
Vince Bridgers, Balakumaran Kannan, David S. Miller,
Sebastian Frias, Kirill Kapranov
In-Reply-To: <5829AA80.2090102@free.fr>
On 14/11/2016 13:13, Mason wrote:
> This is a different log which I got earlier, but can no longer reproduce:
>
> # tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
> 11:08:09.610662 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
> 11:08:10.642852 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
> 11:08:10.643276 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
> 11:08:10.790526 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
> 11:08:11.638146 ARP, Reply 172.27.64.58 is-at 00:16:e8:4b:b0:7d, length 46
> 11:08:11.638156 IP 172.27.64.1 > 172.27.64.58: ICMP echo request, id 29883, seq 0, length 28
> 11:08:11.638345 IP 172.27.64.58 > 172.27.64.1: ICMP echo reply, id 29883, seq 0, length 28
> 11:08:16.642811 ARP, Request who-has 172.27.64.1 tell 172.27.64.58, length 46
> 11:08:16.642822 ARP, Reply 172.27.64.1 is-at 00:15:17:24:e0:81, length 28
Additional tests on v4.7
If I set the link up *BEFORE* running the DHCP client, then I get:
# ip link set eth0 up /* Wait 4-5 seconds */
[ 69.815303] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
# time udhcpc | while read LINE; do date; echo $LINE; done
Mon Nov 14 13:27:30 UTC 2016
udhcpc (v1.22.1) started
Mon Nov 14 13:27:30 UTC 2016
Sending discover...
Mon Nov 14 13:27:31 UTC 2016
Sending select for 172.27.64.58...
Mon Nov 14 13:27:32 UTC 2016
Lease of 172.27.64.58 obtained, lease time 604800
Mon Nov 14 13:27:32 UTC 2016
deleting routers
Mon Nov 14 13:27:32 UTC 2016
adding dns 172.27.0.17
real 0m1.292s
user 0m0.037s
sys 0m0.087s
# tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
13:27:30.922880 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:27:30.923055 IP 172.27.64.1 > 172.27.64.58: ICMP echo request, id 29883, seq 0, length 28
13:27:31.924151 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:27:31.936221 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:27:32.061869 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:27:35.933946 ARP, Request who-has 172.27.64.58 tell 172.27.64.1, length 28
13:27:35.934079 ARP, Reply 172.27.64.58 is-at 00:16:e8:4b:b0:7d, length 46
I did see (once) the 9-packet trace (with the ping echo/reply)
# tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
13:17:31.494117 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:17:32.495374 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:17:32.510753 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:17:32.591262 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:17:33.494093 ARP, Reply 172.27.64.58 is-at 00:16:e8:4b:b0:7d, length 46
13:17:33.494107 IP 172.27.64.1 > 172.27.64.58: ICMP echo request, id 29883, seq 0, length 28
13:17:33.494242 IP 172.27.64.58 > 172.27.64.1: ICMP echo reply, id 29883, seq 0, length 28
13:17:38.500651 ARP, Request who-has 172.27.64.1 tell 172.27.64.58, length 46
13:17:38.500663 ARP, Reply 172.27.64.1 is-at 00:15:17:24:e0:81, length 28
Since the ping request is sent from the DHCP server, perhaps the server
is checking its ARP table. I think it is not an important difference.
Experiment #2
Set link up, then down, then up. Then send DHCP request.
# ip link set eth0 up
[ 39.185326] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[root@buildroot ~]# ip link set eth0 down
/* WAIT A LONG TIME FOR "Link is Down" MESSAGE */
# ip link set eth0 up
[ 102.818598] nb8800 26000.ethernet eth0: Link is Down
[ 104.828632] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
Note: isn't it weird that I have to set link up before "link down" message appears?
# time udhcpc | while read LINE; do date; echo $LINE; done
Mon Nov 14 13:40:08 UTC 2016
udhcpc (v1.22.1) started
Mon Nov 14 13:40:08 UTC 2016
Sending discover...
Mon Nov 14 13:40:11 UTC 2016
Sending discover...
Mon Nov 14 13:40:14 UTC 2016
Sending discover...
Mon Nov 14 13:40:37 UTC 2016
Sending discover...
Mon Nov 14 13:40:40 UTC 2016
Sending discover...
Mon Nov 14 13:40:43 UTC 2016
Sending discover...
Mon Nov 14 13:41:06 UTC 2016
Sending discover...
Mon Nov 14 13:41:09 UTC 2016
Sending discover...
Mon Nov 14 13:41:12 UTC 2016
Sending discover...
Mon Nov 14 13:41:35 UTC 2016
Sending discover...
Mon Nov 14 13:41:38 UTC 2016
Sending discover...
Mon Nov 14 13:41:42 UTC 2016
Sending discover...
\x03^C
real 1m37.623s
user 0m0.100s
sys 0m0.053s
# tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
13:40:08.593122 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:40:09.594365 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:40:11.619772 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:40:11.619925 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:40:14.646372 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:40:14.646535 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:40:37.706093 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:40:37.706257 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:40:40.732693 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:40:40.732827 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:40:43.759342 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:40:43.759475 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:41:06.819024 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:41:06.819201 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:41:09.845671 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:41:09.845807 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:41:12.872271 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:41:12.872396 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:41:35.931994 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:41:35.932162 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:41:38.958593 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:41:38.958742 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
13:41:41.985194 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
13:41:41.985359 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
^C
24 packets captured
24 packets received by filter
0 packets dropped by kernel
Link appears to be broken (not receiving) after up/down/up sequence.
Despite the rx_frames_ok: 8 ???
And why 8, server sent 12 replies...
# ethtool -S eth0
NIC statistics:
rx_bytes_ok: 975
rx_frames_ok: 8
rx_undersize_frames: 0
rx_fragment_frames: 0
rx_64_byte_frames: 6
rx_127_byte_frames: 0
rx_255_byte_frames: 1
rx_511_byte_frames: 1
rx_1023_byte_frames: 0
rx_max_size_frames: 0
rx_oversize_frames: 0
rx_bad_fcs_frames: 0
rx_broadcast_frames: 3
rx_multicast_frames: 1
rx_control_frames: 0
rx_pause_frames: 0
rx_unsup_control_frames: 0
rx_align_error_frames: 0
rx_overrun_frames: 0
rx_jabber_frames: 0
rx_bytes: 975
rx_frames: 8
tx_bytes_ok: 4344
tx_frames_ok: 15
tx_64_byte_frames: 3
tx_127_byte_frames: 0
tx_255_byte_frames: 0
tx_511_byte_frames: 12
tx_1023_byte_frames: 0
tx_max_size_frames: 0
tx_oversize_frames: 0
tx_broadcast_frames: 12
tx_multicast_frames: 0
tx_control_frames: 0
tx_pause_frames: 0
tx_underrun_frames: 0
tx_single_collision_frames: 0
tx_multi_collision_frames: 0
tx_deferred_collision_frames: 0
tx_late_collision_frames: 0
tx_excessive_collision_frames: 0
tx_bytes: 4344
tx_frames: 15
tx_collisions: 0
Will add a few traces, as suggested by Florian.
Regards.
^ permalink raw reply
* Re: [PATCH 3/5] net: thunderx: Fix configuration of L3/L4 length checking
From: Corentin Labbe @ 2016-11-14 12:33 UTC (permalink / raw)
To: sunil.kovvuri; +Cc: netdev, Sunil Goutham, linux-kernel, linux-arm-kernel
In-Reply-To: <1479120886-13425-4-git-send-email-sunil.kovvuri@gmail.com>
On Mon, Nov 14, 2016 at 04:24:44PM +0530, sunil.kovvuri@gmail.com wrote:
> From: Sunil Goutham <sgoutham@cavium.com>
>
> This patch fixes enabling of HW verification of L3/L4 length and
> TCP/UDP checksum which is currently being cleared. Also fixed VLAN
> stripping config which is being cleared when multiqset is enabled.
>
> Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
> ---
> drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> index f0e0ca6..3050177 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> @@ -538,9 +538,12 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, struct queue_set *qs,
> mbx.rq.cfg = (1ULL << 62) | (RQ_CQ_DROP << 8);
> nicvf_send_msg_to_pf(nic, &mbx);
>
> - nicvf_queue_reg_write(nic, NIC_QSET_RQ_GEN_CFG, 0, 0x00);
> - if (!nic->sqs_mode)
> + if (!nic->sqs_mode && (qidx == 0)) {
> + /* Enable checking L3/L4 length and TCP/UDP checksums */
> + nicvf_queue_reg_write(nic, NIC_QSET_RQ_GEN_CFG, 0,
> + ((1 << 24) | (1 << 23) | (1 << 21)));
Hello
You could use the BIT() macro here
Regards
^ permalink raw reply
* Re: Debugging Ethernet issues
From: Mason @ 2016-11-14 12:13 UTC (permalink / raw)
To: Andrew Lunn
Cc: netdev, Florian Fainelli, Mans Rullgard, Sergei Shtylyov,
Tom Lendacky, Zach Brown, Shaohui Xie, Tim Beale, Brian Hill,
Vince Bridgers, Balakumaran Kannan, David S. Miller,
Sebastian Frias, Kirill Kapranov
In-Reply-To: <20161113030919.GA2892@lunn.ch>
On 13/11/2016 04:09, Andrew Lunn wrote:
> Mason wrote:
>
>> When connected to a Gigabit switch
>> 3.4 negotiates a LAN DHCP setup instantly
>> 4.7 requires over 5 seconds to do so
>
> When you run tcpdump on the DHCP server, are you noticing the first
> request is missing?
>
> What can happen is the dhclient gets started immediately and sends out
> its first request before auto-negotiation has finished. So this first packet
> gets lost. The retransmit after a few seconds is then successful.
This is what happens on 3.4
# time udhcpc | while read LINE; do date; echo $LINE; done
Mon Nov 14 11:57:12 UTC 2016
udhcpc (v1.22.1) started
Mon Nov 14 11:57:12 UTC 2016
Sending discover...
[ 50.150000] tangox-enet.0: link up (1000 Mbps - Full Duplex)
Mon Nov 14 11:57:15 UTC 2016
Sending discover...
Mon Nov 14 11:57:16 UTC 2016
Sending select for 172.27.64.58...
Mon Nov 14 11:57:17 UTC 2016
Lease of 172.27.64.58 obtained, lease time 604800
Mon Nov 14 11:57:17 UTC 2016
deleting routers
Mon Nov 14 11:57:17 UTC 2016
adding dns 172.27.0.17
real 0m4.704s
user 0m0.030s
sys 0m0.550s
The corresponding log on the DHCP server was
# tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
11:57:16.095474 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
11:57:16.095638 IP 172.27.64.1 > 172.27.64.58: ICMP echo request, id 29883, seq 0, length 28
11:57:17.096740 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
11:57:17.097182 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
11:57:17.202842 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
11:57:21.101946 ARP, Request who-has 172.27.64.58 tell 172.27.64.1, length 28
11:57:21.102182 ARP, Reply 172.27.64.58 is-at 00:16:e8:4b:b0:7d, length 46
This is a different log which I got earlier, but can no longer reproduce:
# tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
11:08:09.610662 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
11:08:10.642852 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
11:08:10.643276 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
11:08:10.790526 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
11:08:11.638146 ARP, Reply 172.27.64.58 is-at 00:16:e8:4b:b0:7d, length 46
11:08:11.638156 IP 172.27.64.1 > 172.27.64.58: ICMP echo request, id 29883, seq 0, length 28
11:08:11.638345 IP 172.27.64.58 > 172.27.64.1: ICMP echo reply, id 29883, seq 0, length 28
11:08:16.642811 ARP, Request who-has 172.27.64.1 tell 172.27.64.58, length 46
11:08:16.642822 ARP, Reply 172.27.64.1 is-at 00:15:17:24:e0:81, length 28
This is what happens on v4.7
# time udhcpc | while read LINE; do date; echo $LINE; done
Mon Nov 14 11:51:25 UTC 2016
udhcpc (v1.22.1) started
Mon Nov 14 11:51:25 UTC 2016
Sending discover...
Mon Nov 14 11:51:28 UTC 2016
Sending discover...
[ 342.658572] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
Mon Nov 14 11:51:32 UTC 2016
Sending discover...
Mon Nov 14 11:51:33 UTC 2016
Sending select for 172.27.64.58...
Mon Nov 14 11:51:33 UTC 2016
Lease of 172.27.64.58 obtained, lease time 604800
Mon Nov 14 11:51:33 UTC 2016
deleting routers
Mon Nov 14 11:51:33 UTC 2016
adding dns 172.27.0.17
real 0m7.348s
user 0m0.053s
sys 0m0.077s
The corresponding log on the DHCP server was
# tcpdump -n -i eth1-boards ether host 00:16:e8:4b:b0:7d
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1-boards, link-type EN10MB (Ethernet), capture size 262144 bytes
11:51:31.957245 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
11:51:31.957409 IP 172.27.64.1 > 172.27.64.58: ICMP echo request, id 29883, seq 0, length 28
11:51:32.958514 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
11:51:32.970538 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:16:e8:4b:b0:7d, length 300
11:51:33.038205 IP 172.27.200.1.67 > 172.27.64.58.68: BOOTP/DHCP, Reply, length 391
11:51:36.957949 ARP, Request who-has 172.27.64.58 tell 172.27.64.1, length 28
11:51:36.958112 ARP, Reply 172.27.64.58 is-at 00:16:e8:4b:b0:7d, length 46
For reference, here are the ethtool statistics on 4.7 after the DHCP setup:
# ethtool -S eth0
NIC statistics:
rx_bytes_ok: 1747
rx_frames_ok: 9
rx_undersize_frames: 0
rx_fragment_frames: 0
rx_64_byte_frames: 3
rx_127_byte_frames: 2
rx_255_byte_frames: 1
rx_511_byte_frames: 3
rx_1023_byte_frames: 0
rx_max_size_frames: 0
rx_oversize_frames: 0
rx_bad_fcs_frames: 0
rx_broadcast_frames: 4
rx_multicast_frames: 1
rx_control_frames: 0
rx_pause_frames: 0
rx_unsup_control_frames: 0
rx_align_error_frames: 0
rx_overrun_frames: 0
rx_jabber_frames: 0
rx_bytes: 1747
rx_frames: 9
tx_bytes_ok: 756
tx_frames_ok: 3
tx_64_byte_frames: 1
tx_127_byte_frames: 0
tx_255_byte_frames: 0
tx_511_byte_frames: 2
tx_1023_byte_frames: 0
tx_max_size_frames: 0
tx_oversize_frames: 0
tx_broadcast_frames: 2
tx_multicast_frames: 0
tx_control_frames: 0
tx_pause_frames: 0
tx_underrun_frames: 0
tx_single_collision_frames: 0
tx_multi_collision_frames: 0
tx_deferred_collision_frames: 0
tx_late_collision_frames: 0
tx_excessive_collision_frames: 0
tx_bytes: 756
tx_frames: 3
tx_collisions: 0
Regards.
^ permalink raw reply
* Re: [PATCH 0/5] net: thunderx: Miscellaneous fixes
From: Matthias Brugger @ 2016-11-14 12:01 UTC (permalink / raw)
To: sunil.kovvuri, netdev; +Cc: Sunil Goutham, linux-kernel, linux-arm-kernel
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>
On 14/11/16 11:54, sunil.kovvuri@gmail.com wrote:
> From: Sunil Goutham <sgoutham@cavium.com>
>
> This patchset includes fixes for incorrect LMAC credits,
> unreliable driver statistics, memory leak upon interface
> down e.t.c
>
Are these fixes relevant to for older kernels as well?
If so, please add "Cc: stable@vger.kernel.org" to the Sigend-off list.
Thanks,
Matthias
> Radha Mohan Chintakuntla (1):
> net: thunderx: Introduce BGX_ID_MASK macro to extract bgx_id
>
> Sunil Goutham (4):
> net: thunderx: Program LMAC credits based on MTU
> net: thunderx: Fix configuration of L3/L4 length checking
> net: thunderx: Fix VF driver's interface statistics
> net: thunderx: Fix memory leak and other issues upon interface toggle
>
> drivers/net/ethernet/cavium/thunder/nic.h | 64 +++++----
> drivers/net/ethernet/cavium/thunder/nic_main.c | 37 +++--
> drivers/net/ethernet/cavium/thunder/nic_reg.h | 1 +
> .../net/ethernet/cavium/thunder/nicvf_ethtool.c | 105 +++++++-------
> drivers/net/ethernet/cavium/thunder/nicvf_main.c | 153 +++++++++++----------
> drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 118 +++++++++-------
> drivers/net/ethernet/cavium/thunder/nicvf_queues.h | 24 +---
> drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 +-
> drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 2 +
> 9 files changed, 274 insertions(+), 234 deletions(-)
>
^ permalink raw reply
* Re: [PATCH] Fixup packets with incorrect ethertype sent by ZTE MF821D
From: Bjørn Mork @ 2016-11-14 11:37 UTC (permalink / raw)
To: Jussi Peltola; +Cc: netdev
In-Reply-To: <20161114002218.GW2745@pokute.pelzi.net>
Jussi Peltola <plz@plz.fi> writes:
> So here's another stab. The comments and the current implementation are
> not in sync: any non-multicast address starting with a null octet gets
> rewritten, while the comment specifically mentions 00:a0:c6:00:00:00. It
> is certainly not elegant but re-writing all unicast destinations with
> our address does come to mind instead of special cases.
The known bug is related to 00:a0:c6:00:00:00 only. But the workaround
catches anything starting with 00 for simplicity. It's a deliberate
trade-off. Could probably be clearer from the comments, yes.
> This patch fails to handle the invalid destinations in either way so I
> will send another one if you think it's worthwhile to go on. And it
> seems I forgot htons but I need this device for work now so a better
> patch must wait :)
>
> commit 35d3a46b7f1ece70e24386acbdd16af4507cb5f3
> Author: Jussi Peltola <plz@plz.fi>
> Date: Mon Nov 14 01:45:32 2016 +0200
>
> Attempt to fix up packets with a broken ethernet header
>
> Signed-off-by: Jussi Peltola <plz@plz.fi>
>
> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
> index 3ff76c6..7308d6b 100644
> --- a/drivers/net/usb/qmi_wwan.c
> +++ b/drivers/net/usb/qmi_wwan.c
> @@ -153,25 +153,57 @@ static const u8 default_modem_addr[ETH_ALEN] = {0x02, 0x50, 0xf3};
>
> static const u8 buggy_fw_addr[ETH_ALEN] = {0x00, 0xa0, 0xc6, 0x00, 0x00, 0x00};
>
> -/* Make up an ethernet header if the packet doesn't have one.
> +/* Check if the ethernet header has an unknown ethertype, and return a
> + * guess of the correct one based on the L3 header, or zero if the type was
> + * known or detection failed.
> + */
> +static __be16 detect_bogus_header(struct sk_buff *skb) {
> + struct ethhdr *eth_hdr = (struct ethhdr*) skb->data;
> +
> + switch (eth_hdr->h_proto) {
> + case ETH_P_IP:
> + case ETH_P_IPV6:
> + case ETH_P_ARP:
> + return 0;
> + default:
> + switch (skb->data[14] & 0xf0) {
> + case 0x40:
> + return htons(ETH_P_IP);
> + case 0x60:
> + return htons(ETH_P_IPV6);
> + default:
> + /* pass on undetectable packets */
> + return 0;
> + }
> + }
> + /*NOTREACHED*/
> + return 0;
> +}
> +
> +/* Make up an ethernet header if the packet doesn't have a correct one.
> *
> * A firmware bug common among several devices cause them to send raw
> * IP packets under some circumstances. There is no way for the
> * driver/host to know when this will happen. And even when the bug
> * hits, some packets will still arrive with an intact header.
> *
> - * The supported devices are only capably of sending IPv4, IPv6 and
> + * The supported devices are only capable of sending IPv4, IPv6 and
> * ARP packets on a point-to-point link. Any packet with an ethernet
> * header will have either our address or a broadcast/multicast
> - * address as destination. ARP packets will always have a header.
> + * address as destination. ARP packets will always have a header.
> *
> * This means that this function will reliably add the appropriate
> - * header iff necessary, provided our hardware address does not start
> + * header if necessary, provided our hardware address does not start
> * with 4 or 6.
> *
> * Another common firmware bug results in all packets being addressed
> * to 00:a0:c6:00:00:00 despite the host address being different.
> - * This function will also fixup such packets.
> + *
> + * Some devices will send packets with garbage source/destination MACs and
> + * ethertypes.
> + *
> + * This function will try to fix up all such packets.
> + *
> */
> static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
> {
> @@ -179,8 +211,8 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
> bool rawip = info->flags & QMI_WWAN_FLAG_RAWIP;
> __be16 proto;
>
> - /* This check is no longer done by usbnet */
> - if (skb->len < dev->net->hard_header_len)
> + /* Shorter is definitely invalid and breaks subsequent tests */
> + if (skb->len < 15)
> return 0;
>
> switch (skb->data[0] & 0xf0) {
Makes sense, but could we please use some reasonable macro or something
instead of an arbitrary magic number? Anything shorter than an IP
header will be bogus, for example.
> @@ -190,17 +222,17 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
> case 0x60:
> proto = htons(ETH_P_IPV6);
> break;
> - case 0x00:
> + default:
> if (rawip)
> return 0;
> if (is_multicast_ether_addr(skb->data))
> return 1;
> - /* possibly bogus destination - rewrite just in case */
> - skb_reset_mac_header(skb);
> - goto fix_dest;
> - default:
> - if (rawip)
> - return 0;
> + proto = detect_bogus_header(skb);
> + if (proto) {
> + /* remove terminally broken header */
> + skb_pull(skb, ETH_HLEN);
> + break;
> + }
> /* pass along other packets without modifications */
> return 1;
> }
Am I missing somehting, or is this removing the bogus destination
address fixup? In any case, I don't have any device with that bug
available or the time to go and test it. So I want you to either leave
that part of the code alone, or verify the workaround on a device with
the "00:a0:c6:00:00:00" bug.
> @@ -208,17 +240,17 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
> skb->dev = dev->net; /* normally set by eth_type_trans */
> skb->protocol = proto;
> return 1;
> + } else {
Completely unnecessary "else". We return from the other branch.
In any case, this change is not related to the rest of the patch and is
just making review more confusing. And why would you add more indenting
levels than strictly necessary?
But please: Try to make the device work with raw-ip first. Qualcomm
never managed to fix their fake ethernet mode, and have given up on it.
Newer devices don't have that mode at all. The device you have does
still have the mode. But as is pretty obvious from the observed
behaviour: It's completely untested and not really working. If you look
at the Windows driver, I'm pretty sure you'll find that the only reason
it works is because they use raw-ip mode.
FWIW, I regret having chosen the fake ethernet mode as the default for
qmi_wwan. I did not anticipate the number of firmware issues Qualcomm
could manage to create simply adding a made-up ethernet header. And I
did not anticipate them finally dropping it on the floor, forcing us to
support raw-ip mode anyway.
But with raw-ip support, there is absolutely no reason to play around
with fixing up the ethernet header bugs anymore. If you have an older
device where it sort of works, then by all means use it. But otherwise:
Use raw-ip mode. I don't think I would have wanted any of the existing
header fixups either if the driver had supported raw-ip when they were
proposed.
Bjørn
^ permalink raw reply
* Re: [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings
From: Jesper Dangaard Brouer @ 2016-11-14 11:09 UTC (permalink / raw)
To: John Fastabend
Cc: jasowang, netdev, linux-kernel, Michael S. Tsirkin, Jason Wang,
Mathieu Desnoyers
In-Reply-To: <20161111044408.1547.92737.stgit@john-Precision-Tower-5810>
On Thu, 10 Nov 2016 20:44:08 -0800 John Fastabend <john.fastabend@gmail.com> wrote:
> ---
> include/linux/ptr_ring_ll.h | 136 +++++++++++++++++++++++++++++++++++++++++++
> include/linux/skb_array.h | 25 ++++++++
> 2 files changed, 161 insertions(+)
> create mode 100644 include/linux/ptr_ring_ll.h
>
> diff --git a/include/linux/ptr_ring_ll.h b/include/linux/ptr_ring_ll.h
> new file mode 100644
> index 0000000..bcb11f3
> --- /dev/null
> +++ b/include/linux/ptr_ring_ll.h
> @@ -0,0 +1,136 @@
> +/*
> + * Definitions for the 'struct ptr_ring_ll' datastructure.
> + *
> + * Author:
> + * John Fastabend <john.r.fastabend@intel.com>
[...]
> + *
> + * This is a limited-size FIFO maintaining pointers in FIFO order, with
> + * one CPU producing entries and another consuming entries from a FIFO.
> + * extended from ptr_ring_ll to use cmpxchg over spin lock.
It sounds like this is Single Producer Single Consumer (SPSC)
implementation, but your implementation actually is Multi Producer
Multi Consumer (MPMC) capable.
The implementation looks a lot like my alf_queue[1] implementation:
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h
If the primary use-case is one CPU producing and another consuming,
then the normal ptr_ring (skb_array) will actually be faster!
The reason is ptr_ring avoids bouncing a cache-line between the CPUs on
every ring access. This is achieved by having the checks for full
(__ptr_ring_full) and empty (__ptr_ring_empty) use the contents of the
array (NULL value).
I actually implemented two micro-benchmarks to measure the difference
between skb_array[2] and alf_queue[3]:
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
[3] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c
> + */
> +
> +#ifndef _LINUX_PTR_RING_LL_H
> +#define _LINUX_PTR_RING_LL_H 1
> +
[...]
> +
> +struct ptr_ring_ll {
> + u32 prod_size;
> + u32 prod_mask;
> + u32 prod_head;
> + u32 prod_tail;
> + u32 cons_size;
> + u32 cons_mask;
> + u32 cons_head;
> + u32 cons_tail;
> +
> + void **queue;
> +};
Your implementation doesn't even split the consumer and producer into
different cachelines (which in practice doesn't help much due to how
the empty/full checks are performed).
> +
> +/* Note: callers invoking this in a loop must use a compiler barrier,
> + * for example cpu_relax(). Callers must hold producer_lock.
> + */
> +static inline int __ptr_ring_ll_produce(struct ptr_ring_ll *r, void *ptr)
> +{
> + u32 ret, head, tail, next, slots, mask;
> +
> + do {
> + head = READ_ONCE(r->prod_head);
> + mask = READ_ONCE(r->prod_mask);
> + tail = READ_ONCE(r->cons_tail);
Problem occur here, as the producer need to access/read the consumers
tail, to determine if the queue is not already full (slots avail).
Thus, the next "consumer-CPU" will see the cacheline in wrong state
(Modified/Invalid or Shared).
> +
> + slots = mask + tail - head;
> + if (slots < 1)
> + return -ENOMEM;
> +
> + next = head + 1;
> + ret = cmpxchg(&r->prod_head, head, next);
> + } while (ret != head);
> +
> + r->queue[head & mask] = ptr;
> + smp_wmb();
> +
> + while (r->prod_tail != head)
> + cpu_relax();
> +
> + r->prod_tail = next;
> + return 0;
> +}
> +
> +static inline void *__ptr_ring_ll_consume(struct ptr_ring_ll *r)
> +{
> + u32 ret, head, tail, next, slots, mask;
> + void *ptr;
> +
> + do {
> + head = READ_ONCE(r->cons_head);
> + mask = READ_ONCE(r->cons_mask);
> + tail = READ_ONCE(r->prod_tail);
Like wise the consumer is reading the producer tail (for the empty check).
> +
> + slots = tail - head;
> + if (slots < 1)
> + return ERR_PTR(-ENOMEM);
> +
> + next = head + 1;
> + ret = cmpxchg(&r->cons_head, head, next);
> + } while (ret != head);
> +
> + ptr = r->queue[head & mask];
> + smp_rmb();
> +
> + while (r->cons_tail != head)
> + cpu_relax();
> +
> + r->cons_tail = next;
> + return ptr;
> +}
> +
> +static inline void **__ptr_ring_ll_init_queue_alloc(int size, gfp_t gfp)
> +{
> + return kzalloc(ALIGN(size * sizeof(void *), SMP_CACHE_BYTES), gfp);
> +}
> +
> +static inline int ptr_ring_ll_init(struct ptr_ring_ll *r, int size, gfp_t gfp)
> +{
> + r->queue = __ptr_ring_init_queue_alloc(size, gfp);
> + if (!r->queue)
> + return -ENOMEM;
> +
> + r->prod_size = r->cons_size = size;
> + r->prod_mask = r->cons_mask = size - 1;
Shouldn't we have some check like is_power_of_2(size), as this code
looks like it depend on this.
> + r->prod_tail = r->prod_head = 0;
> + r->cons_tail = r->prod_tail = 0;
> +
> + return 0;
> +}
> +
[...]
> +#endif /* _LINUX_PTR_RING_LL_H */
> diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
> index f4dfade..9b43dfd 100644
> --- a/include/linux/skb_array.h
> +++ b/include/linux/skb_array.h
[...]
>
> +static inline int skb_array_ll_produce(struct skb_array_ll *a, struct sk_buff *skb)
> +{
> + return __ptr_ring_ll_produce(&a->ring, skb);
> +}
> +
[...]
>
> +static inline struct sk_buff *skb_array_ll_consume(struct skb_array_ll *a)
> +{
> + return __ptr_ring_ll_consume(&a->ring);
> +}
> +
Note in the Multi Producer Multi Consumer (MPMC) use-case this type of
queue can be faster than normal ptr_ring. And in patch2 you implement
bulking, which is where the real benefit shows (in the MPMC case) for
this kind of queue.
What I would really like to see is a lock-free (locked cmpxchg) queue
implementation, what like ptr_ring use the array as empty/full check,
and still (somehow) support bulking.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH] net/phy: add trace events for mdio accesses
From: Uwe Kleine-König @ 2016-11-14 11:03 UTC (permalink / raw)
To: Florian Fainelli, Steven Rostedt, Ingo Molnar; +Cc: netdev
Make it possible to generate trace events for mdio read and write accesses.
Signed-off-by: Uwe Kleine-König <uwe@kleine-koenig.org>
---
drivers/net/phy/mdio_bus.c | 15 +++++++++++++++
include/trace/events/mdio.h | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)
create mode 100644 include/trace/events/mdio.h
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 09deef4bed09..0f3f207419f6 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -38,6 +38,9 @@
#include <asm/irq.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/mdio.h>
+
int mdiobus_register_device(struct mdio_device *mdiodev)
{
if (mdiodev->bus->mdio_map[mdiodev->addr])
@@ -461,6 +464,9 @@ int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum)
retval = bus->read(bus, addr, regnum);
mutex_unlock(&bus->mdio_lock);
+ if (retval >= 0)
+ trace_mdio_access(bus, 1, addr, regnum, retval);
+
return retval;
}
EXPORT_SYMBOL(mdiobus_read_nested);
@@ -485,6 +491,9 @@ int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum)
retval = bus->read(bus, addr, regnum);
mutex_unlock(&bus->mdio_lock);
+ if (retval >= 0)
+ trace_mdio_access(bus, 1, addr, regnum, retval);
+
return retval;
}
EXPORT_SYMBOL(mdiobus_read);
@@ -513,6 +522,9 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val)
err = bus->write(bus, addr, regnum, val);
mutex_unlock(&bus->mdio_lock);
+ if (!err)
+ trace_mdio_access(bus, 0, addr, regnum, val);
+
return err;
}
EXPORT_SYMBOL(mdiobus_write_nested);
@@ -538,6 +550,9 @@ int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val)
err = bus->write(bus, addr, regnum, val);
mutex_unlock(&bus->mdio_lock);
+ if (!err)
+ trace_mdio_access(bus, 0, addr, regnum, val);
+
return err;
}
EXPORT_SYMBOL(mdiobus_write);
diff --git a/include/trace/events/mdio.h b/include/trace/events/mdio.h
new file mode 100644
index 000000000000..dcb2d456a346
--- /dev/null
+++ b/include/trace/events/mdio.h
@@ -0,0 +1,40 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mdio
+
+#if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MDIO_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mdio_access,
+
+ TP_PROTO(struct mii_bus *bus, int read,
+ unsigned addr, unsigned regnum, u16 val),
+
+ TP_ARGS(bus, read, addr, regnum, val),
+
+ TP_STRUCT__entry(
+ __array(char, busid, MII_BUS_ID_SIZE)
+ __field(int, read)
+ __field(unsigned, addr)
+ __field(unsigned, regnum)
+ __field(u16, val)
+ ),
+
+ TP_fast_assign(
+ strncpy(__entry->busid, bus->id, MII_BUS_ID_SIZE);
+ __entry->read = read;
+ __entry->addr = addr;
+ __entry->regnum = regnum;
+ __entry->val = val;
+ ),
+
+ TP_printk("%s %-5s phy:0x%02x reg:0x%02x val:0x%04hx",
+ __entry->busid, __entry->read ? "read" : "write",
+ __entry->addr, __entry->regnum, __entry->val)
+);
+
+#endif /* if !defined(_TRACE_MDIO_H) || defined(TRACE_HEADER_MULTI_READ) */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
2.10.2
^ permalink raw reply related
* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Jerome Brunet @ 2016-11-14 11:02 UTC (permalink / raw)
To: Giuseppe CAVALLARO, Martin Blumenstingl
Cc: André Roth, Alexandre Torgue, Johnson Leung, linux-amlogic,
netdev, afaerber
In-Reply-To: <68696e92-d056-5f29-9e71-72066f0c7673@st.com>
On Mon, 2016-11-14 at 08:47 +0100, Giuseppe CAVALLARO wrote:
> Hello Martin
>
> On 11/7/2016 6:37 PM, Martin Blumenstingl wrote:
> >
> > Hi Peppe,
> >
> > On Mon, Nov 7, 2016 at 11:59 AM, Giuseppe CAVALLARO
> > <peppe.cavallaro@st.com> wrote:
> > >
> > > In the meantime, I will read again the thread just to see if
> > > there is something I am missing.
> > if you are re-reading this thread: please note that there are two
> > devices in discussion here!
>
> many thx for the sum :-)
>
> >
> > Both are using the Amlogic S905 (GXBB) SoC and both are
> > experiencing
> > the same issue (Gbit TX issues, RX with Gbit speeds and RX/TX with
> > 100Mbit speed are NOT affected):
> > - Odroid-C2 (used by Jerome and André Roth)
> > - Tronsmart Vega S95 Meta (my device)
> >
> > The (Gbit TX) problem seems to be gone on the Odroid-C2 with
> > Jerome's
> > patch which disables EEE in drivers/net/phy/realtek.c (at least in
> > his
> > tests, I don't have that device so I can't verify).
> > The same problem still appears on my Tronsmart Vega S95 Meta even
> > with
> > the patched PHY driver.
>
> just an doubt, maybe useful, in the past, on GiGa setup I saw similar
> problems and it was due to retiming so maybe 2ns could be necessary
> (or better granularity via PAD logic if available).
>
> Regards
> Peppe
Peppe, Martin,
With Andre's feedback, I think we can confirm that disabling EEE solve
the problem for the OdroidC2 design.
We do have the same results as Martin on MXQ-Pro based designs. For
these particular boards, disabling EEE does not seems to enough to get
a stable Tx path in 1000Base-T.
I will submit the patch for the Odroidc2 later today.
For the Vega, you should probably check the Tx delay as Peppe suggests.
To do these tests, It would probably be better to disable EEE as well.
Do you want me to include this change for the vega in the patch ?
Cheers
Jerome
>
> >
> > Unfortunately I don't have a second device to rule out that my
> > Tronsmart Vega S95 Meta could be broken (not unlikely, I get DDR
> > errors from time to time in u-boot). Maybe Andreas Faerber can test
> > ethernet with and without Jerome's patch on one of his Tronsmart
> > devices.
> >
> >
> > Regards,
> > Martin
> >
>
^ permalink raw reply
* [PATCH 1/5] net: thunderx: Introduce BGX_ID_MASK macro to extract bgx_id
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, linux-arm-kernel, Radha Mohan Chintakuntla,
Sunil Goutham
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>
From: Radha Mohan Chintakuntla <rchintakuntla@cavium.com>
This patch fixes the 'bgx_id' determination on 83xx where there are
4 BGX blocks instead of 2 on other platforms.
Signed-off-by: Radha Mohan Chintakuntla <rchintakuntla@cavium.com>
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 ++--
drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
index 8bbaedb..050e21f 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
@@ -1242,8 +1242,8 @@ static int bgx_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
pci_read_config_word(pdev, PCI_DEVICE_ID, &sdevid);
if (sdevid != PCI_DEVICE_ID_THUNDER_RGX) {
- bgx->bgx_id =
- (pci_resource_start(pdev, PCI_CFG_REG_BAR_NUM) >> 24) & 1;
+ bgx->bgx_id = (pci_resource_start(pdev,
+ PCI_CFG_REG_BAR_NUM) >> 24) & BGX_ID_MASK;
bgx->bgx_id += nic_get_node_id(pdev) * MAX_BGX_PER_NODE;
bgx->max_lmac = MAX_LMAC_PER_BGX;
bgx_vnic[bgx->bgx_id] = bgx;
diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
index d59c71e..01cc7c8 100644
--- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
+++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h
@@ -28,6 +28,8 @@
#define MAX_DMAC_PER_LMAC 8
#define MAX_FRAME_SIZE 9216
+#define BGX_ID_MASK 0x3
+
#define MAX_DMAC_PER_LMAC_TNS_BYPASS_MODE 2
/* Registers */
--
2.7.4
^ permalink raw reply related
* [PATCH 0/5] net: thunderx: Miscellaneous fixes
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel, linux-arm-kernel, Sunil Goutham
From: Sunil Goutham <sgoutham@cavium.com>
This patchset includes fixes for incorrect LMAC credits,
unreliable driver statistics, memory leak upon interface
down e.t.c
Radha Mohan Chintakuntla (1):
net: thunderx: Introduce BGX_ID_MASK macro to extract bgx_id
Sunil Goutham (4):
net: thunderx: Program LMAC credits based on MTU
net: thunderx: Fix configuration of L3/L4 length checking
net: thunderx: Fix VF driver's interface statistics
net: thunderx: Fix memory leak and other issues upon interface toggle
drivers/net/ethernet/cavium/thunder/nic.h | 64 +++++----
drivers/net/ethernet/cavium/thunder/nic_main.c | 37 +++--
drivers/net/ethernet/cavium/thunder/nic_reg.h | 1 +
.../net/ethernet/cavium/thunder/nicvf_ethtool.c | 105 +++++++-------
drivers/net/ethernet/cavium/thunder/nicvf_main.c | 153 +++++++++++----------
drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 118 +++++++++-------
drivers/net/ethernet/cavium/thunder/nicvf_queues.h | 24 +---
drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 4 +-
drivers/net/ethernet/cavium/thunder/thunder_bgx.h | 2 +
9 files changed, 274 insertions(+), 234 deletions(-)
--
2.7.4
^ permalink raw reply
* [PATCH 5/5] net: thunderx: Fix memory leak and other issues upon interface toggle
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
To: netdev; +Cc: Sunil Goutham, linux-kernel, linux-arm-kernel
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>
From: Sunil Goutham <sgoutham@cavium.com>
This patch fixes the following
1. When interface is being teardown and queues are being cleaned up,
free pending SKBs that are in SQ which are either not transmitted
or freed as NAPI is disabled by that time.
2. While interface initialization, delay CFG_DONE notification till
the end to avoid corner cases where TXQs are enabled but CQ
interrupts are not which results blocking transmission and kicking
off watchdog.
3. Check for IFF_UP while re-enabling RBDR interrupts from tasklet.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
drivers/net/ethernet/cavium/thunder/nicvf_main.c | 11 +++++------
drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 14 +++++++++++++-
2 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 9dc79c05..8a37012 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -473,9 +473,6 @@ int nicvf_set_real_num_queues(struct net_device *netdev,
static int nicvf_init_resources(struct nicvf *nic)
{
int err;
- union nic_mbx mbx = {};
-
- mbx.msg.msg = NIC_MBOX_MSG_CFG_DONE;
/* Enable Qset */
nicvf_qset_config(nic, true);
@@ -488,9 +485,6 @@ static int nicvf_init_resources(struct nicvf *nic)
return err;
}
- /* Send VF config done msg to PF */
- nicvf_write_to_mbx(nic, &mbx);
-
return 0;
}
@@ -1184,6 +1178,7 @@ int nicvf_open(struct net_device *netdev)
struct nicvf *nic = netdev_priv(netdev);
struct queue_set *qs = nic->qs;
struct nicvf_cq_poll *cq_poll = NULL;
+ union nic_mbx mbx = {};
netif_carrier_off(netdev);
@@ -1271,6 +1266,10 @@ int nicvf_open(struct net_device *netdev)
for (qidx = 0; qidx < qs->rbdr_cnt; qidx++)
nicvf_enable_intr(nic, NICVF_INTR_RBDR, qidx);
+ /* Send VF config done msg to PF */
+ mbx.msg.msg = NIC_MBOX_MSG_CFG_DONE;
+ nicvf_write_to_mbx(nic, &mbx);
+
return 0;
cleanup:
nicvf_disable_intr(nic, NICVF_INTR_MBOX, 0);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 86726ab..43ccc9d 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -271,7 +271,8 @@ static void nicvf_refill_rbdr(struct nicvf *nic, gfp_t gfp)
rbdr_idx, new_rb);
next_rbdr:
/* Re-enable RBDR interrupts only if buffer allocation is success */
- if (!nic->rb_alloc_fail && rbdr->enable)
+ if (!nic->rb_alloc_fail && rbdr->enable &&
+ netif_running(nic->pnicvf->netdev))
nicvf_enable_intr(nic, NICVF_INTR_RBDR, rbdr_idx);
if (rbdr_idx)
@@ -362,6 +363,8 @@ static int nicvf_init_snd_queue(struct nicvf *nic,
static void nicvf_free_snd_queue(struct nicvf *nic, struct snd_queue *sq)
{
+ struct sk_buff *skb;
+
if (!sq)
return;
if (!sq->dmem.base)
@@ -372,6 +375,15 @@ static void nicvf_free_snd_queue(struct nicvf *nic, struct snd_queue *sq)
sq->dmem.q_len * TSO_HEADER_SIZE,
sq->tso_hdrs, sq->tso_hdrs_phys);
+ /* Free pending skbs in the queue */
+ smp_rmb();
+ while (sq->head != sq->tail) {
+ skb = (struct sk_buff *)sq->skbuff[sq->head];
+ if (skb)
+ dev_kfree_skb_any(skb);
+ sq->head++;
+ sq->head &= (sq->dmem.q_len - 1);
+ }
kfree(sq->skbuff);
nicvf_free_q_desc_mem(nic, &sq->dmem);
}
--
2.7.4
^ permalink raw reply related
* [PATCH 4/5] net: thunderx: Fix VF driver's interface statistics
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel, linux-arm-kernel, Sunil Goutham
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>
From: Sunil Goutham <sgoutham@cavium.com>
This patch fixes multiple issues
1. Convert all driver statistics to percpu counters for accuracy.
2. To avoid multiple CQEs posted by a TSO packet appended to HW,
TSO pkt's SQE has 'post_cqe' not set but a dummy SQE is added
for getting HW transmit completion notification. This dummy
SQE has 'dont_send' set and HW drops the pkt pointed to in this
thus Tx drop counter increases. This patch fixes this by subtracting
SW tx tso counter from HW Tx drop counter for actual packet drop counter.
3. Reset all individual queue's and VNIC HW stats when interface is going down.
4. Getrid off unnecessary counters in hot path.
5. Bringout all CQE error stats i.e both Rx and Tx.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
drivers/net/ethernet/cavium/thunder/nic.h | 61 +++++++-----
drivers/net/ethernet/cavium/thunder/nic_main.c | 1 +
.../net/ethernet/cavium/thunder/nicvf_ethtool.c | 105 +++++++++++---------
drivers/net/ethernet/cavium/thunder/nicvf_main.c | 106 +++++++++++----------
drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 96 +++++++++----------
drivers/net/ethernet/cavium/thunder/nicvf_queues.h | 24 +----
6 files changed, 197 insertions(+), 196 deletions(-)
diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
index cd2d379..86bd93c 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -178,11 +178,11 @@ enum tx_stats_reg_offset {
struct nicvf_hw_stats {
u64 rx_bytes;
+ u64 rx_frames;
u64 rx_ucast_frames;
u64 rx_bcast_frames;
u64 rx_mcast_frames;
- u64 rx_fcs_errors;
- u64 rx_l2_errors;
+ u64 rx_drops;
u64 rx_drop_red;
u64 rx_drop_red_bytes;
u64 rx_drop_overrun;
@@ -191,6 +191,19 @@ struct nicvf_hw_stats {
u64 rx_drop_mcast;
u64 rx_drop_l3_bcast;
u64 rx_drop_l3_mcast;
+ u64 rx_fcs_errors;
+ u64 rx_l2_errors;
+
+ u64 tx_bytes;
+ u64 tx_frames;
+ u64 tx_ucast_frames;
+ u64 tx_bcast_frames;
+ u64 tx_mcast_frames;
+ u64 tx_drops;
+};
+
+struct nicvf_drv_stats {
+ /* CQE Rx errs */
u64 rx_bgx_truncated_pkts;
u64 rx_jabber_errs;
u64 rx_fcs_errs;
@@ -216,34 +229,30 @@ struct nicvf_hw_stats {
u64 rx_l4_pclp;
u64 rx_truncated_pkts;
- u64 tx_bytes_ok;
- u64 tx_ucast_frames_ok;
- u64 tx_bcast_frames_ok;
- u64 tx_mcast_frames_ok;
- u64 tx_drops;
-};
-
-struct nicvf_drv_stats {
- /* Rx */
- u64 rx_frames_ok;
- u64 rx_frames_64;
- u64 rx_frames_127;
- u64 rx_frames_255;
- u64 rx_frames_511;
- u64 rx_frames_1023;
- u64 rx_frames_1518;
- u64 rx_frames_jumbo;
- u64 rx_drops;
-
+ /* CQE Tx errs */
+ u64 tx_desc_fault;
+ u64 tx_hdr_cons_err;
+ u64 tx_subdesc_err;
+ u64 tx_max_size_exceeded;
+ u64 tx_imm_size_oflow;
+ u64 tx_data_seq_err;
+ u64 tx_mem_seq_err;
+ u64 tx_lock_viol;
+ u64 tx_data_fault;
+ u64 tx_tstmp_conflict;
+ u64 tx_tstmp_timeout;
+ u64 tx_mem_fault;
+ u64 tx_csum_overlap;
+ u64 tx_csum_overflow;
+
+ /* driver debug stats */
u64 rcv_buffer_alloc_failures;
-
- /* Tx */
- u64 tx_frames_ok;
- u64 tx_drops;
u64 tx_tso;
u64 tx_timeout;
u64 txq_stop;
u64 txq_wake;
+
+ struct u64_stats_sync syncp;
};
struct nicvf {
@@ -297,7 +306,7 @@ struct nicvf {
/* Stats */
struct nicvf_hw_stats hw_stats;
- struct nicvf_drv_stats drv_stats;
+ struct nicvf_drv_stats __percpu *drv_stats;
struct bgx_stats bgx_stats;
/* MSI-X */
diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c b/drivers/net/ethernet/cavium/thunder/nic_main.c
index 85c9e62..6677b96 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nic_main.c
@@ -851,6 +851,7 @@ static int nic_reset_stat_counters(struct nicpf *nic,
nic_reg_write(nic, reg_addr, 0);
}
}
+
return 0;
}
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
index ad4fddb..432bf6b 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_ethtool.c
@@ -36,11 +36,11 @@ struct nicvf_stat {
static const struct nicvf_stat nicvf_hw_stats[] = {
NICVF_HW_STAT(rx_bytes),
+ NICVF_HW_STAT(rx_frames),
NICVF_HW_STAT(rx_ucast_frames),
NICVF_HW_STAT(rx_bcast_frames),
NICVF_HW_STAT(rx_mcast_frames),
- NICVF_HW_STAT(rx_fcs_errors),
- NICVF_HW_STAT(rx_l2_errors),
+ NICVF_HW_STAT(rx_drops),
NICVF_HW_STAT(rx_drop_red),
NICVF_HW_STAT(rx_drop_red_bytes),
NICVF_HW_STAT(rx_drop_overrun),
@@ -49,50 +49,59 @@ static const struct nicvf_stat nicvf_hw_stats[] = {
NICVF_HW_STAT(rx_drop_mcast),
NICVF_HW_STAT(rx_drop_l3_bcast),
NICVF_HW_STAT(rx_drop_l3_mcast),
- NICVF_HW_STAT(rx_bgx_truncated_pkts),
- NICVF_HW_STAT(rx_jabber_errs),
- NICVF_HW_STAT(rx_fcs_errs),
- NICVF_HW_STAT(rx_bgx_errs),
- NICVF_HW_STAT(rx_prel2_errs),
- NICVF_HW_STAT(rx_l2_hdr_malformed),
- NICVF_HW_STAT(rx_oversize),
- NICVF_HW_STAT(rx_undersize),
- NICVF_HW_STAT(rx_l2_len_mismatch),
- NICVF_HW_STAT(rx_l2_pclp),
- NICVF_HW_STAT(rx_ip_ver_errs),
- NICVF_HW_STAT(rx_ip_csum_errs),
- NICVF_HW_STAT(rx_ip_hdr_malformed),
- NICVF_HW_STAT(rx_ip_payload_malformed),
- NICVF_HW_STAT(rx_ip_ttl_errs),
- NICVF_HW_STAT(rx_l3_pclp),
- NICVF_HW_STAT(rx_l4_malformed),
- NICVF_HW_STAT(rx_l4_csum_errs),
- NICVF_HW_STAT(rx_udp_len_errs),
- NICVF_HW_STAT(rx_l4_port_errs),
- NICVF_HW_STAT(rx_tcp_flag_errs),
- NICVF_HW_STAT(rx_tcp_offset_errs),
- NICVF_HW_STAT(rx_l4_pclp),
- NICVF_HW_STAT(rx_truncated_pkts),
- NICVF_HW_STAT(tx_bytes_ok),
- NICVF_HW_STAT(tx_ucast_frames_ok),
- NICVF_HW_STAT(tx_bcast_frames_ok),
- NICVF_HW_STAT(tx_mcast_frames_ok),
+ NICVF_HW_STAT(rx_fcs_errors),
+ NICVF_HW_STAT(rx_l2_errors),
+ NICVF_HW_STAT(tx_bytes),
+ NICVF_HW_STAT(tx_frames),
+ NICVF_HW_STAT(tx_ucast_frames),
+ NICVF_HW_STAT(tx_bcast_frames),
+ NICVF_HW_STAT(tx_mcast_frames),
+ NICVF_HW_STAT(tx_drops),
};
static const struct nicvf_stat nicvf_drv_stats[] = {
- NICVF_DRV_STAT(rx_frames_ok),
- NICVF_DRV_STAT(rx_frames_64),
- NICVF_DRV_STAT(rx_frames_127),
- NICVF_DRV_STAT(rx_frames_255),
- NICVF_DRV_STAT(rx_frames_511),
- NICVF_DRV_STAT(rx_frames_1023),
- NICVF_DRV_STAT(rx_frames_1518),
- NICVF_DRV_STAT(rx_frames_jumbo),
- NICVF_DRV_STAT(rx_drops),
+ NICVF_DRV_STAT(rx_bgx_truncated_pkts),
+ NICVF_DRV_STAT(rx_jabber_errs),
+ NICVF_DRV_STAT(rx_fcs_errs),
+ NICVF_DRV_STAT(rx_bgx_errs),
+ NICVF_DRV_STAT(rx_prel2_errs),
+ NICVF_DRV_STAT(rx_l2_hdr_malformed),
+ NICVF_DRV_STAT(rx_oversize),
+ NICVF_DRV_STAT(rx_undersize),
+ NICVF_DRV_STAT(rx_l2_len_mismatch),
+ NICVF_DRV_STAT(rx_l2_pclp),
+ NICVF_DRV_STAT(rx_ip_ver_errs),
+ NICVF_DRV_STAT(rx_ip_csum_errs),
+ NICVF_DRV_STAT(rx_ip_hdr_malformed),
+ NICVF_DRV_STAT(rx_ip_payload_malformed),
+ NICVF_DRV_STAT(rx_ip_ttl_errs),
+ NICVF_DRV_STAT(rx_l3_pclp),
+ NICVF_DRV_STAT(rx_l4_malformed),
+ NICVF_DRV_STAT(rx_l4_csum_errs),
+ NICVF_DRV_STAT(rx_udp_len_errs),
+ NICVF_DRV_STAT(rx_l4_port_errs),
+ NICVF_DRV_STAT(rx_tcp_flag_errs),
+ NICVF_DRV_STAT(rx_tcp_offset_errs),
+ NICVF_DRV_STAT(rx_l4_pclp),
+ NICVF_DRV_STAT(rx_truncated_pkts),
+
+ NICVF_DRV_STAT(tx_desc_fault),
+ NICVF_DRV_STAT(tx_hdr_cons_err),
+ NICVF_DRV_STAT(tx_subdesc_err),
+ NICVF_DRV_STAT(tx_max_size_exceeded),
+ NICVF_DRV_STAT(tx_imm_size_oflow),
+ NICVF_DRV_STAT(tx_data_seq_err),
+ NICVF_DRV_STAT(tx_mem_seq_err),
+ NICVF_DRV_STAT(tx_lock_viol),
+ NICVF_DRV_STAT(tx_data_fault),
+ NICVF_DRV_STAT(tx_tstmp_conflict),
+ NICVF_DRV_STAT(tx_tstmp_timeout),
+ NICVF_DRV_STAT(tx_mem_fault),
+ NICVF_DRV_STAT(tx_csum_overlap),
+ NICVF_DRV_STAT(tx_csum_overflow),
+
NICVF_DRV_STAT(rcv_buffer_alloc_failures),
- NICVF_DRV_STAT(tx_frames_ok),
NICVF_DRV_STAT(tx_tso),
- NICVF_DRV_STAT(tx_drops),
NICVF_DRV_STAT(tx_timeout),
NICVF_DRV_STAT(txq_stop),
NICVF_DRV_STAT(txq_wake),
@@ -278,8 +287,8 @@ static void nicvf_get_ethtool_stats(struct net_device *netdev,
struct ethtool_stats *stats, u64 *data)
{
struct nicvf *nic = netdev_priv(netdev);
- int stat;
- int sqs;
+ int stat, tmp_stats;
+ int sqs, cpu;
nicvf_update_stats(nic);
@@ -289,9 +298,13 @@ static void nicvf_get_ethtool_stats(struct net_device *netdev,
for (stat = 0; stat < nicvf_n_hw_stats; stat++)
*(data++) = ((u64 *)&nic->hw_stats)
[nicvf_hw_stats[stat].index];
- for (stat = 0; stat < nicvf_n_drv_stats; stat++)
- *(data++) = ((u64 *)&nic->drv_stats)
- [nicvf_drv_stats[stat].index];
+ for (stat = 0; stat < nicvf_n_drv_stats; stat++) {
+ tmp_stats = 0;
+ for_each_possible_cpu(cpu)
+ tmp_stats += ((u64 *)per_cpu_ptr(nic->drv_stats, cpu))
+ [nicvf_drv_stats[stat].index];
+ *(data++) = tmp_stats;
+ }
nicvf_get_qset_stats(nic, stats, &data);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 8f83361..9dc79c05 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -69,25 +69,6 @@ static inline u8 nicvf_netdev_qidx(struct nicvf *nic, u8 qidx)
return qidx;
}
-static inline void nicvf_set_rx_frame_cnt(struct nicvf *nic,
- struct sk_buff *skb)
-{
- if (skb->len <= 64)
- nic->drv_stats.rx_frames_64++;
- else if (skb->len <= 127)
- nic->drv_stats.rx_frames_127++;
- else if (skb->len <= 255)
- nic->drv_stats.rx_frames_255++;
- else if (skb->len <= 511)
- nic->drv_stats.rx_frames_511++;
- else if (skb->len <= 1023)
- nic->drv_stats.rx_frames_1023++;
- else if (skb->len <= 1518)
- nic->drv_stats.rx_frames_1518++;
- else
- nic->drv_stats.rx_frames_jumbo++;
-}
-
/* The Cavium ThunderX network controller can *only* be found in SoCs
* containing the ThunderX ARM64 CPU implementation. All accesses to the device
* registers on this platform are implicitly strongly ordered with respect
@@ -514,7 +495,6 @@ static int nicvf_init_resources(struct nicvf *nic)
}
static void nicvf_snd_pkt_handler(struct net_device *netdev,
- struct cmp_queue *cq,
struct cqe_send_t *cqe_tx,
int cqe_type, int budget,
unsigned int *tx_pkts, unsigned int *tx_bytes)
@@ -536,7 +516,7 @@ static void nicvf_snd_pkt_handler(struct net_device *netdev,
__func__, cqe_tx->sq_qs, cqe_tx->sq_idx,
cqe_tx->sqe_ptr, hdr->subdesc_cnt);
- nicvf_check_cqe_tx_errs(nic, cq, cqe_tx);
+ nicvf_check_cqe_tx_errs(nic, cqe_tx);
skb = (struct sk_buff *)sq->skbuff[cqe_tx->sqe_ptr];
if (skb) {
/* Check for dummy descriptor used for HW TSO offload on 88xx */
@@ -630,8 +610,6 @@ static void nicvf_rcv_pkt_handler(struct net_device *netdev,
return;
}
- nicvf_set_rx_frame_cnt(nic, skb);
-
nicvf_set_rxhash(netdev, cqe_rx, skb);
skb_record_rx_queue(skb, rq_idx);
@@ -703,7 +681,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, u8 cq_idx,
work_done++;
break;
case CQE_TYPE_SEND:
- nicvf_snd_pkt_handler(netdev, cq,
+ nicvf_snd_pkt_handler(netdev,
(void *)cq_desc, CQE_TYPE_SEND,
budget, &tx_pkts, &tx_bytes);
tx_done++;
@@ -740,7 +718,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, u8 cq_idx,
nic = nic->pnicvf;
if (netif_tx_queue_stopped(txq) && netif_carrier_ok(netdev)) {
netif_tx_start_queue(txq);
- nic->drv_stats.txq_wake++;
+ this_cpu_inc(nic->drv_stats->txq_wake);
if (netif_msg_tx_err(nic))
netdev_warn(netdev,
"%s: Transmit queue wakeup SQ%d\n",
@@ -1084,7 +1062,7 @@ static netdev_tx_t nicvf_xmit(struct sk_buff *skb, struct net_device *netdev)
if (!netif_tx_queue_stopped(txq) && !nicvf_sq_append_skb(nic, skb)) {
netif_tx_stop_queue(txq);
- nic->drv_stats.txq_stop++;
+ this_cpu_inc(nic->drv_stats->txq_stop);
if (netif_msg_tx_err(nic))
netdev_warn(netdev,
"%s: Transmit ring full, stopping SQ%d\n",
@@ -1202,7 +1180,7 @@ static int nicvf_update_hw_max_frs(struct nicvf *nic, int mtu)
int nicvf_open(struct net_device *netdev)
{
- int err, qidx;
+ int cpu, err, qidx;
struct nicvf *nic = netdev_priv(netdev);
struct queue_set *qs = nic->qs;
struct nicvf_cq_poll *cq_poll = NULL;
@@ -1262,6 +1240,11 @@ int nicvf_open(struct net_device *netdev)
nicvf_rss_init(nic);
if (nicvf_update_hw_max_frs(nic, netdev->mtu))
goto cleanup;
+
+ /* Clear percpu stats */
+ for_each_possible_cpu(cpu)
+ memset(per_cpu_ptr(nic->drv_stats, cpu), 0,
+ sizeof(struct nicvf_drv_stats));
}
err = nicvf_register_interrupts(nic);
@@ -1288,9 +1271,6 @@ int nicvf_open(struct net_device *netdev)
for (qidx = 0; qidx < qs->rbdr_cnt; qidx++)
nicvf_enable_intr(nic, NICVF_INTR_RBDR, qidx);
- nic->drv_stats.txq_stop = 0;
- nic->drv_stats.txq_wake = 0;
-
return 0;
cleanup:
nicvf_disable_intr(nic, NICVF_INTR_MBOX, 0);
@@ -1383,9 +1363,10 @@ void nicvf_update_lmac_stats(struct nicvf *nic)
void nicvf_update_stats(struct nicvf *nic)
{
- int qidx;
+ int qidx, cpu;
+ u64 tmp_stats = 0;
struct nicvf_hw_stats *stats = &nic->hw_stats;
- struct nicvf_drv_stats *drv_stats = &nic->drv_stats;
+ struct nicvf_drv_stats *drv_stats;
struct queue_set *qs = nic->qs;
#define GET_RX_STATS(reg) \
@@ -1408,21 +1389,33 @@ void nicvf_update_stats(struct nicvf *nic)
stats->rx_drop_l3_bcast = GET_RX_STATS(RX_DRP_L3BCAST);
stats->rx_drop_l3_mcast = GET_RX_STATS(RX_DRP_L3MCAST);
- stats->tx_bytes_ok = GET_TX_STATS(TX_OCTS);
- stats->tx_ucast_frames_ok = GET_TX_STATS(TX_UCAST);
- stats->tx_bcast_frames_ok = GET_TX_STATS(TX_BCAST);
- stats->tx_mcast_frames_ok = GET_TX_STATS(TX_MCAST);
+ stats->tx_bytes = GET_TX_STATS(TX_OCTS);
+ stats->tx_ucast_frames = GET_TX_STATS(TX_UCAST);
+ stats->tx_bcast_frames = GET_TX_STATS(TX_BCAST);
+ stats->tx_mcast_frames = GET_TX_STATS(TX_MCAST);
stats->tx_drops = GET_TX_STATS(TX_DROP);
- drv_stats->tx_frames_ok = stats->tx_ucast_frames_ok +
- stats->tx_bcast_frames_ok +
- stats->tx_mcast_frames_ok;
- drv_stats->rx_frames_ok = stats->rx_ucast_frames +
- stats->rx_bcast_frames +
- stats->rx_mcast_frames;
- drv_stats->rx_drops = stats->rx_drop_red +
- stats->rx_drop_overrun;
- drv_stats->tx_drops = stats->tx_drops;
+ /* On T88 pass 2.0, the dummy SQE added for TSO notification
+ * via CQE has 'dont_send' set. Hence HW drops the pkt pointed
+ * pointed by dummy SQE and results in tx_drops counter being
+ * incremented. Subtracting it from tx_tso counter will give
+ * exact tx_drops counter.
+ */
+ if (nic->t88 && nic->hw_tso) {
+ for_each_possible_cpu(cpu) {
+ drv_stats = per_cpu_ptr(nic->drv_stats, cpu);
+ tmp_stats += drv_stats->tx_tso;
+ }
+ stats->tx_drops = tmp_stats - stats->tx_drops;
+ }
+ stats->tx_frames = stats->tx_ucast_frames +
+ stats->tx_bcast_frames +
+ stats->tx_mcast_frames;
+ stats->rx_frames = stats->rx_ucast_frames +
+ stats->rx_bcast_frames +
+ stats->rx_mcast_frames;
+ stats->rx_drops = stats->rx_drop_red +
+ stats->rx_drop_overrun;
/* Update RQ and SQ stats */
for (qidx = 0; qidx < qs->rq_cnt; qidx++)
@@ -1436,18 +1429,17 @@ static struct rtnl_link_stats64 *nicvf_get_stats64(struct net_device *netdev,
{
struct nicvf *nic = netdev_priv(netdev);
struct nicvf_hw_stats *hw_stats = &nic->hw_stats;
- struct nicvf_drv_stats *drv_stats = &nic->drv_stats;
nicvf_update_stats(nic);
stats->rx_bytes = hw_stats->rx_bytes;
- stats->rx_packets = drv_stats->rx_frames_ok;
- stats->rx_dropped = drv_stats->rx_drops;
+ stats->rx_packets = hw_stats->rx_frames;
+ stats->rx_dropped = hw_stats->rx_drops;
stats->multicast = hw_stats->rx_mcast_frames;
- stats->tx_bytes = hw_stats->tx_bytes_ok;
- stats->tx_packets = drv_stats->tx_frames_ok;
- stats->tx_dropped = drv_stats->tx_drops;
+ stats->tx_bytes = hw_stats->tx_bytes;
+ stats->tx_packets = hw_stats->tx_frames;
+ stats->tx_dropped = hw_stats->tx_drops;
return stats;
}
@@ -1460,7 +1452,7 @@ static void nicvf_tx_timeout(struct net_device *dev)
netdev_warn(dev, "%s: Transmit timed out, resetting\n",
dev->name);
- nic->drv_stats.tx_timeout++;
+ this_cpu_inc(nic->drv_stats->tx_timeout);
schedule_work(&nic->reset_task);
}
@@ -1594,6 +1586,12 @@ static int nicvf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto err_free_netdev;
}
+ nic->drv_stats = netdev_alloc_pcpu_stats(struct nicvf_drv_stats);
+ if (!nic->drv_stats) {
+ err = -ENOMEM;
+ goto err_free_netdev;
+ }
+
err = nicvf_set_qset_resources(nic);
if (err)
goto err_free_netdev;
@@ -1652,6 +1650,8 @@ static int nicvf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
nicvf_unregister_interrupts(nic);
err_free_netdev:
pci_set_drvdata(pdev, NULL);
+ if (nic->drv_stats)
+ free_percpu(nic->drv_stats);
free_netdev(netdev);
err_release_regions:
pci_release_regions(pdev);
@@ -1679,6 +1679,8 @@ static void nicvf_remove(struct pci_dev *pdev)
unregister_netdev(pnetdev);
nicvf_unregister_interrupts(nic);
pci_set_drvdata(pdev, NULL);
+ if (nic->drv_stats)
+ free_percpu(nic->drv_stats);
free_netdev(netdev);
pci_release_regions(pdev);
pci_disable_device(pdev);
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index 3050177..86726ab 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -104,7 +104,8 @@ static inline int nicvf_alloc_rcv_buffer(struct nicvf *nic, gfp_t gfp,
nic->rb_page = alloc_pages(gfp | __GFP_COMP | __GFP_NOWARN,
order);
if (!nic->rb_page) {
- nic->drv_stats.rcv_buffer_alloc_failures++;
+ this_cpu_inc(nic->pnicvf->drv_stats->
+ rcv_buffer_alloc_failures);
return -ENOMEM;
}
nic->rb_page_offset = 0;
@@ -483,9 +484,12 @@ static void nicvf_reset_rcv_queue_stats(struct nicvf *nic)
{
union nic_mbx mbx = {};
- /* Reset all RXQ's stats */
+ /* Reset all RQ/SQ and VF stats */
mbx.reset_stat.msg = NIC_MBOX_MSG_RESET_STAT_COUNTER;
+ mbx.reset_stat.rx_stat_mask = 0x3FFF;
+ mbx.reset_stat.tx_stat_mask = 0x1F;
mbx.reset_stat.rq_stat_mask = 0xFFFF;
+ mbx.reset_stat.sq_stat_mask = 0xFFFF;
nicvf_send_msg_to_pf(nic, &mbx);
}
@@ -1032,7 +1036,7 @@ nicvf_sq_add_hdr_subdesc(struct nicvf *nic, struct snd_queue *sq, int qentry,
hdr->tso_max_paysize = skb_shinfo(skb)->gso_size;
/* For non-tunneled pkts, point this to L2 ethertype */
hdr->inner_l3_offset = skb_network_offset(skb) - 2;
- nic->drv_stats.tx_tso++;
+ this_cpu_inc(nic->pnicvf->drv_stats->tx_tso);
}
}
@@ -1164,7 +1168,7 @@ static int nicvf_sq_append_tso(struct nicvf *nic, struct snd_queue *sq,
nicvf_sq_doorbell(nic, skb, sq_num, desc_cnt);
- nic->drv_stats.tx_tso++;
+ this_cpu_inc(nic->pnicvf->drv_stats->tx_tso);
return 1;
}
@@ -1425,8 +1429,6 @@ void nicvf_update_sq_stats(struct nicvf *nic, int sq_idx)
/* Check for errors in the receive cmp.queue entry */
int nicvf_check_cqe_rx_errs(struct nicvf *nic, struct cqe_rx_t *cqe_rx)
{
- struct nicvf_hw_stats *stats = &nic->hw_stats;
-
if (!cqe_rx->err_level && !cqe_rx->err_opcode)
return 0;
@@ -1438,76 +1440,76 @@ int nicvf_check_cqe_rx_errs(struct nicvf *nic, struct cqe_rx_t *cqe_rx)
switch (cqe_rx->err_opcode) {
case CQ_RX_ERROP_RE_PARTIAL:
- stats->rx_bgx_truncated_pkts++;
+ this_cpu_inc(nic->drv_stats->rx_bgx_truncated_pkts);
break;
case CQ_RX_ERROP_RE_JABBER:
- stats->rx_jabber_errs++;
+ this_cpu_inc(nic->drv_stats->rx_jabber_errs);
break;
case CQ_RX_ERROP_RE_FCS:
- stats->rx_fcs_errs++;
+ this_cpu_inc(nic->drv_stats->rx_fcs_errs);
break;
case CQ_RX_ERROP_RE_RX_CTL:
- stats->rx_bgx_errs++;
+ this_cpu_inc(nic->drv_stats->rx_bgx_errs);
break;
case CQ_RX_ERROP_PREL2_ERR:
- stats->rx_prel2_errs++;
+ this_cpu_inc(nic->drv_stats->rx_prel2_errs);
break;
case CQ_RX_ERROP_L2_MAL:
- stats->rx_l2_hdr_malformed++;
+ this_cpu_inc(nic->drv_stats->rx_l2_hdr_malformed);
break;
case CQ_RX_ERROP_L2_OVERSIZE:
- stats->rx_oversize++;
+ this_cpu_inc(nic->drv_stats->rx_oversize);
break;
case CQ_RX_ERROP_L2_UNDERSIZE:
- stats->rx_undersize++;
+ this_cpu_inc(nic->drv_stats->rx_undersize);
break;
case CQ_RX_ERROP_L2_LENMISM:
- stats->rx_l2_len_mismatch++;
+ this_cpu_inc(nic->drv_stats->rx_l2_len_mismatch);
break;
case CQ_RX_ERROP_L2_PCLP:
- stats->rx_l2_pclp++;
+ this_cpu_inc(nic->drv_stats->rx_l2_pclp);
break;
case CQ_RX_ERROP_IP_NOT:
- stats->rx_ip_ver_errs++;
+ this_cpu_inc(nic->drv_stats->rx_ip_ver_errs);
break;
case CQ_RX_ERROP_IP_CSUM_ERR:
- stats->rx_ip_csum_errs++;
+ this_cpu_inc(nic->drv_stats->rx_ip_csum_errs);
break;
case CQ_RX_ERROP_IP_MAL:
- stats->rx_ip_hdr_malformed++;
+ this_cpu_inc(nic->drv_stats->rx_ip_hdr_malformed);
break;
case CQ_RX_ERROP_IP_MALD:
- stats->rx_ip_payload_malformed++;
+ this_cpu_inc(nic->drv_stats->rx_ip_payload_malformed);
break;
case CQ_RX_ERROP_IP_HOP:
- stats->rx_ip_ttl_errs++;
+ this_cpu_inc(nic->drv_stats->rx_ip_ttl_errs);
break;
case CQ_RX_ERROP_L3_PCLP:
- stats->rx_l3_pclp++;
+ this_cpu_inc(nic->drv_stats->rx_l3_pclp);
break;
case CQ_RX_ERROP_L4_MAL:
- stats->rx_l4_malformed++;
+ this_cpu_inc(nic->drv_stats->rx_l4_malformed);
break;
case CQ_RX_ERROP_L4_CHK:
- stats->rx_l4_csum_errs++;
+ this_cpu_inc(nic->drv_stats->rx_l4_csum_errs);
break;
case CQ_RX_ERROP_UDP_LEN:
- stats->rx_udp_len_errs++;
+ this_cpu_inc(nic->drv_stats->rx_udp_len_errs);
break;
case CQ_RX_ERROP_L4_PORT:
- stats->rx_l4_port_errs++;
+ this_cpu_inc(nic->drv_stats->rx_l4_port_errs);
break;
case CQ_RX_ERROP_TCP_FLAG:
- stats->rx_tcp_flag_errs++;
+ this_cpu_inc(nic->drv_stats->rx_tcp_flag_errs);
break;
case CQ_RX_ERROP_TCP_OFFSET:
- stats->rx_tcp_offset_errs++;
+ this_cpu_inc(nic->drv_stats->rx_tcp_offset_errs);
break;
case CQ_RX_ERROP_L4_PCLP:
- stats->rx_l4_pclp++;
+ this_cpu_inc(nic->drv_stats->rx_l4_pclp);
break;
case CQ_RX_ERROP_RBDR_TRUNC:
- stats->rx_truncated_pkts++;
+ this_cpu_inc(nic->drv_stats->rx_truncated_pkts);
break;
}
@@ -1515,56 +1517,52 @@ int nicvf_check_cqe_rx_errs(struct nicvf *nic, struct cqe_rx_t *cqe_rx)
}
/* Check for errors in the send cmp.queue entry */
-int nicvf_check_cqe_tx_errs(struct nicvf *nic,
- struct cmp_queue *cq, struct cqe_send_t *cqe_tx)
+int nicvf_check_cqe_tx_errs(struct nicvf *nic, struct cqe_send_t *cqe_tx)
{
- struct cmp_queue_stats *stats = &cq->stats;
-
switch (cqe_tx->send_status) {
case CQ_TX_ERROP_GOOD:
- stats->tx.good++;
return 0;
case CQ_TX_ERROP_DESC_FAULT:
- stats->tx.desc_fault++;
+ this_cpu_inc(nic->drv_stats->tx_desc_fault);
break;
case CQ_TX_ERROP_HDR_CONS_ERR:
- stats->tx.hdr_cons_err++;
+ this_cpu_inc(nic->drv_stats->tx_hdr_cons_err);
break;
case CQ_TX_ERROP_SUBDC_ERR:
- stats->tx.subdesc_err++;
+ this_cpu_inc(nic->drv_stats->tx_subdesc_err);
break;
case CQ_TX_ERROP_MAX_SIZE_VIOL:
- stats->tx.max_size_exceeded++;
+ this_cpu_inc(nic->drv_stats->tx_max_size_exceeded);
break;
case CQ_TX_ERROP_IMM_SIZE_OFLOW:
- stats->tx.imm_size_oflow++;
+ this_cpu_inc(nic->drv_stats->tx_imm_size_oflow);
break;
case CQ_TX_ERROP_DATA_SEQUENCE_ERR:
- stats->tx.data_seq_err++;
+ this_cpu_inc(nic->drv_stats->tx_data_seq_err);
break;
case CQ_TX_ERROP_MEM_SEQUENCE_ERR:
- stats->tx.mem_seq_err++;
+ this_cpu_inc(nic->drv_stats->tx_mem_seq_err);
break;
case CQ_TX_ERROP_LOCK_VIOL:
- stats->tx.lock_viol++;
+ this_cpu_inc(nic->drv_stats->tx_lock_viol);
break;
case CQ_TX_ERROP_DATA_FAULT:
- stats->tx.data_fault++;
+ this_cpu_inc(nic->drv_stats->tx_data_fault);
break;
case CQ_TX_ERROP_TSTMP_CONFLICT:
- stats->tx.tstmp_conflict++;
+ this_cpu_inc(nic->drv_stats->tx_tstmp_conflict);
break;
case CQ_TX_ERROP_TSTMP_TIMEOUT:
- stats->tx.tstmp_timeout++;
+ this_cpu_inc(nic->drv_stats->tx_tstmp_timeout);
break;
case CQ_TX_ERROP_MEM_FAULT:
- stats->tx.mem_fault++;
+ this_cpu_inc(nic->drv_stats->tx_mem_fault);
break;
case CQ_TX_ERROP_CK_OVERLAP:
- stats->tx.csum_overlap++;
+ this_cpu_inc(nic->drv_stats->tx_csum_overlap);
break;
case CQ_TX_ERROP_CK_OFLOW:
- stats->tx.csum_overflow++;
+ this_cpu_inc(nic->drv_stats->tx_csum_overflow);
break;
}
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index 8f4718e..2e3c940 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
@@ -172,26 +172,6 @@ enum CQ_TX_ERROP_E {
CQ_TX_ERROP_ENUM_LAST = 0x8a,
};
-struct cmp_queue_stats {
- struct tx_stats {
- u64 good;
- u64 desc_fault;
- u64 hdr_cons_err;
- u64 subdesc_err;
- u64 max_size_exceeded;
- u64 imm_size_oflow;
- u64 data_seq_err;
- u64 mem_seq_err;
- u64 lock_viol;
- u64 data_fault;
- u64 tstmp_conflict;
- u64 tstmp_timeout;
- u64 mem_fault;
- u64 csum_overlap;
- u64 csum_overflow;
- } tx;
-} ____cacheline_aligned_in_smp;
-
enum RQ_SQ_STATS {
RQ_SQ_STATS_OCTS,
RQ_SQ_STATS_PKTS,
@@ -243,7 +223,6 @@ struct cmp_queue {
spinlock_t lock; /* lock to serialize processing CQEs */
void *desc;
struct q_desc_mem dmem;
- struct cmp_queue_stats stats;
int irq;
} ____cacheline_aligned_in_smp;
@@ -338,6 +317,5 @@ u64 nicvf_queue_reg_read(struct nicvf *nic,
void nicvf_update_rq_stats(struct nicvf *nic, int rq_idx);
void nicvf_update_sq_stats(struct nicvf *nic, int sq_idx);
int nicvf_check_cqe_rx_errs(struct nicvf *nic, struct cqe_rx_t *cqe_rx);
-int nicvf_check_cqe_tx_errs(struct nicvf *nic,
- struct cmp_queue *cq, struct cqe_send_t *cqe_tx);
+int nicvf_check_cqe_tx_errs(struct nicvf *nic, struct cqe_send_t *cqe_tx);
#endif /* NICVF_QUEUES_H */
--
2.7.4
^ permalink raw reply related
* [PATCH 3/5] net: thunderx: Fix configuration of L3/L4 length checking
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
To: netdev; +Cc: Sunil Goutham, linux-kernel, linux-arm-kernel
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>
From: Sunil Goutham <sgoutham@cavium.com>
This patch fixes enabling of HW verification of L3/L4 length and
TCP/UDP checksum which is currently being cleared. Also fixed VLAN
stripping config which is being cleared when multiqset is enabled.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index f0e0ca6..3050177 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -538,9 +538,12 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, struct queue_set *qs,
mbx.rq.cfg = (1ULL << 62) | (RQ_CQ_DROP << 8);
nicvf_send_msg_to_pf(nic, &mbx);
- nicvf_queue_reg_write(nic, NIC_QSET_RQ_GEN_CFG, 0, 0x00);
- if (!nic->sqs_mode)
+ if (!nic->sqs_mode && (qidx == 0)) {
+ /* Enable checking L3/L4 length and TCP/UDP checksums */
+ nicvf_queue_reg_write(nic, NIC_QSET_RQ_GEN_CFG, 0,
+ ((1 << 24) | (1 << 23) | (1 << 21)));
nicvf_config_vlan_stripping(nic, nic->netdev->features);
+ }
/* Enable Receive queue */
memset(&rq_cfg, 0, sizeof(struct rq_cfg));
--
2.7.4
^ permalink raw reply related
* [PATCH 2/5] net: thunderx: Program LMAC credits based on MTU
From: sunil.kovvuri @ 2016-11-14 10:54 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel, linux-arm-kernel, Sunil Goutham
In-Reply-To: <1479120886-13425-1-git-send-email-sunil.kovvuri@gmail.com>
From: Sunil Goutham <sgoutham@cavium.com>
Programming LMAC credits taking 9K frame size by default is incorrect
as for an interface which is one of the many on the same BGX/QLM
no of credits available will be less as Tx FIFO will be divided
across all interfaces. So let's say a BGX with 40G interface and another
BGX with multiple 10G, bandwidth of 10G interfaces will be effected when
traffic is running on both 40G and 10G interfaces simultaneously.
This patch fixes this issue by programming credits based on netdev's MTU.
Also fixed configuring MTU to HW and added CQE counter for pkts which
exceed this value.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
---
drivers/net/ethernet/cavium/thunder/nic.h | 3 +-
drivers/net/ethernet/cavium/thunder/nic_main.c | 36 +++++++++++++-------
drivers/net/ethernet/cavium/thunder/nic_reg.h | 1 +
drivers/net/ethernet/cavium/thunder/nicvf_main.c | 38 ++++++++++++----------
drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 3 ++
drivers/net/ethernet/cavium/thunder/nicvf_queues.h | 2 ++
6 files changed, 53 insertions(+), 30 deletions(-)
diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h
index 3042610..cd2d379 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -47,7 +47,7 @@
/* Min/Max packet size */
#define NIC_HW_MIN_FRS 64
-#define NIC_HW_MAX_FRS 9200 /* 9216 max packet including FCS */
+#define NIC_HW_MAX_FRS 9190 /* Excluding L2 header and FCS */
/* Max pkinds */
#define NIC_MAX_PKIND 16
@@ -282,7 +282,6 @@ struct nicvf {
u8 node;
u8 cpi_alg;
- u16 mtu;
bool link_up;
u8 duplex;
u32 speed;
diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c b/drivers/net/ethernet/cavium/thunder/nic_main.c
index 2bbf4cb..85c9e62 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nic_main.c
@@ -11,6 +11,7 @@
#include <linux/pci.h>
#include <linux/etherdevice.h>
#include <linux/of.h>
+#include <linux/if_vlan.h>
#include "nic_reg.h"
#include "nic.h"
@@ -260,18 +261,31 @@ static void nic_get_bgx_stats(struct nicpf *nic, struct bgx_stats_msg *bgx)
/* Update hardware min/max frame size */
static int nic_update_hw_frs(struct nicpf *nic, int new_frs, int vf)
{
- if ((new_frs > NIC_HW_MAX_FRS) || (new_frs < NIC_HW_MIN_FRS)) {
- dev_err(&nic->pdev->dev,
- "Invalid MTU setting from VF%d rejected, should be between %d and %d\n",
- vf, NIC_HW_MIN_FRS, NIC_HW_MAX_FRS);
+ int bgx, lmac, lmac_cnt;
+ u64 lmac_credits;
+
+ if ((new_frs > NIC_HW_MAX_FRS) || (new_frs < NIC_HW_MIN_FRS))
return 1;
- }
- new_frs += ETH_HLEN;
- if (new_frs <= nic->pkind.maxlen)
- return 0;
- nic->pkind.maxlen = new_frs;
- nic_reg_write(nic, NIC_PF_PKIND_0_15_CFG, *(u64 *)&nic->pkind);
+ bgx = NIC_GET_BGX_FROM_VF_LMAC_MAP(nic->vf_lmac_map[vf]);
+ lmac = NIC_GET_LMAC_FROM_VF_LMAC_MAP(nic->vf_lmac_map[vf]);
+ lmac += bgx * MAX_LMAC_PER_BGX;
+
+ new_frs += VLAN_ETH_HLEN + ETH_FCS_LEN + 4;
+
+ /* Update corresponding LMAC credits */
+ lmac_cnt = bgx_get_lmac_count(nic->node, bgx);
+ lmac_credits = nic_reg_read(nic, NIC_PF_LMAC_0_7_CREDIT + (lmac * 8));
+ lmac_credits &= ~(0xFFFFFULL << 12);
+ lmac_credits |= (((((48 * 1024) / lmac_cnt) - new_frs) / 16) << 12);
+ nic_reg_write(nic, NIC_PF_LMAC_0_7_CREDIT + (lmac * 8), lmac_credits);
+
+ /* Enforce MTU in HW
+ * This config is supported only from 88xx pass 2.0 onwards.
+ */
+ if (!pass1_silicon(nic->pdev))
+ nic_reg_write(nic,
+ NIC_PF_LMAC_0_7_CFG2 + (lmac * 8), new_frs);
return 0;
}
@@ -464,7 +478,7 @@ static int nic_init_hw(struct nicpf *nic)
/* PKIND configuration */
nic->pkind.minlen = 0;
- nic->pkind.maxlen = NIC_HW_MAX_FRS + ETH_HLEN;
+ nic->pkind.maxlen = NIC_HW_MAX_FRS + VLAN_ETH_HLEN + ETH_FCS_LEN + 4;
nic->pkind.lenerr_en = 1;
nic->pkind.rx_hdr = 0;
nic->pkind.hdr_sl = 0;
diff --git a/drivers/net/ethernet/cavium/thunder/nic_reg.h b/drivers/net/ethernet/cavium/thunder/nic_reg.h
index edf779f..80d4633 100644
--- a/drivers/net/ethernet/cavium/thunder/nic_reg.h
+++ b/drivers/net/ethernet/cavium/thunder/nic_reg.h
@@ -106,6 +106,7 @@
#define NIC_PF_MPI_0_2047_CFG (0x210000)
#define NIC_PF_RSSI_0_4097_RQ (0x220000)
#define NIC_PF_LMAC_0_7_CFG (0x240000)
+#define NIC_PF_LMAC_0_7_CFG2 (0x240100)
#define NIC_PF_LMAC_0_7_SW_XOFF (0x242000)
#define NIC_PF_LMAC_0_7_CREDIT (0x244000)
#define NIC_PF_CHAN_0_255_TX_CFG (0x400000)
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 45a13f7..8f83361 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1189,6 +1189,17 @@ int nicvf_stop(struct net_device *netdev)
return 0;
}
+static int nicvf_update_hw_max_frs(struct nicvf *nic, int mtu)
+{
+ union nic_mbx mbx = {};
+
+ mbx.frs.msg = NIC_MBOX_MSG_SET_MAX_FRS;
+ mbx.frs.max_frs = mtu;
+ mbx.frs.vf_id = nic->vf_id;
+
+ return nicvf_send_msg_to_pf(nic, &mbx);
+}
+
int nicvf_open(struct net_device *netdev)
{
int err, qidx;
@@ -1196,8 +1207,6 @@ int nicvf_open(struct net_device *netdev)
struct queue_set *qs = nic->qs;
struct nicvf_cq_poll *cq_poll = NULL;
- nic->mtu = netdev->mtu;
-
netif_carrier_off(netdev);
err = nicvf_register_misc_interrupt(nic);
@@ -1248,9 +1257,12 @@ int nicvf_open(struct net_device *netdev)
if (nic->sqs_mode)
nicvf_get_primary_vf_struct(nic);
- /* Configure receive side scaling */
- if (!nic->sqs_mode)
+ /* Configure receive side scaling and MTU */
+ if (!nic->sqs_mode) {
nicvf_rss_init(nic);
+ if (nicvf_update_hw_max_frs(nic, netdev->mtu))
+ goto cleanup;
+ }
err = nicvf_register_interrupts(nic);
if (err)
@@ -1297,17 +1309,6 @@ int nicvf_open(struct net_device *netdev)
return err;
}
-static int nicvf_update_hw_max_frs(struct nicvf *nic, int mtu)
-{
- union nic_mbx mbx = {};
-
- mbx.frs.msg = NIC_MBOX_MSG_SET_MAX_FRS;
- mbx.frs.max_frs = mtu;
- mbx.frs.vf_id = nic->vf_id;
-
- return nicvf_send_msg_to_pf(nic, &mbx);
-}
-
static int nicvf_change_mtu(struct net_device *netdev, int new_mtu)
{
struct nicvf *nic = netdev_priv(netdev);
@@ -1318,10 +1319,13 @@ static int nicvf_change_mtu(struct net_device *netdev, int new_mtu)
if (new_mtu < NIC_HW_MIN_FRS)
return -EINVAL;
+ netdev->mtu = new_mtu;
+
+ if (!netif_running(netdev))
+ return 0;
+
if (nicvf_update_hw_max_frs(nic, new_mtu))
return -EINVAL;
- netdev->mtu = new_mtu;
- nic->mtu = new_mtu;
return 0;
}
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index a4fc501..f0e0ca6 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -1530,6 +1530,9 @@ int nicvf_check_cqe_tx_errs(struct nicvf *nic,
case CQ_TX_ERROP_SUBDC_ERR:
stats->tx.subdesc_err++;
break;
+ case CQ_TX_ERROP_MAX_SIZE_VIOL:
+ stats->tx.max_size_exceeded++;
+ break;
case CQ_TX_ERROP_IMM_SIZE_OFLOW:
stats->tx.imm_size_oflow++;
break;
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index 869f338..8f4718e 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
@@ -158,6 +158,7 @@ enum CQ_TX_ERROP_E {
CQ_TX_ERROP_DESC_FAULT = 0x10,
CQ_TX_ERROP_HDR_CONS_ERR = 0x11,
CQ_TX_ERROP_SUBDC_ERR = 0x12,
+ CQ_TX_ERROP_MAX_SIZE_VIOL = 0x13,
CQ_TX_ERROP_IMM_SIZE_OFLOW = 0x80,
CQ_TX_ERROP_DATA_SEQUENCE_ERR = 0x81,
CQ_TX_ERROP_MEM_SEQUENCE_ERR = 0x82,
@@ -177,6 +178,7 @@ struct cmp_queue_stats {
u64 desc_fault;
u64 hdr_cons_err;
u64 subdesc_err;
+ u64 max_size_exceeded;
u64 imm_size_oflow;
u64 data_seq_err;
u64 mem_seq_err;
--
2.7.4
^ permalink raw reply related
* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Jerome Brunet @ 2016-11-14 10:49 UTC (permalink / raw)
To: André Roth
Cc: Martin Blumenstingl, Johnson Leung, Giuseppe CAVALLARO,
linux-amlogic, Alexandre Torgue, netdev
In-Reply-To: <20161113201339.667ac1f7@gmail.com>
On Sun, 2016-11-13 at 20:13 +0100, André Roth wrote:
> >
> > Andre, the 3.14 kernel you are talking, is it this one ? :
> > https://github.com/hardkernel/linux/tree/odroidc2-3.14.y
>
> yes
>
> >
> > Because in drivers/net/phy/realtek.c, they disable EEE, but
> > also 1000Base-T Full Duplex advertisement ?
> >
> > + /* disable 1000m adv*/
> > + val = phy_read(phydev, 0x9);
> > + phy_write(phydev, 0x9, val&(~(1<<9)));
> >
> > If this is the kernel you are running, you should not be able to
> > have
> > ethernet at 1000MB/s ? Or is it in half duplex mode ?
>
> ethtool shows 1000Mb/s Full-Duplex and the bandwith is around 300Mb/s
> (as measured by scp). kernel version: 3.14.65-73
Andre,
I checked again the kernel at https://github.com/hardkernel/linux/tree/
odroidc2-3.14.y. The version you mention (3.14.65-73) seems to be:
sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036
tag: jenkins-deb s905 kernel-73
In this particular version, both realtek drivers:
- drivers/net/phy/realtek.c
- drivers/amlogic/ethernet/phy/am_realtek.c
have the hack to disable 1000M advertisement. I don't understand how it
possible for you to have 1000Base-T Full Duplex with this, maybe I'm
missing something here ?
If you did compile the kernel yourself, could you check the 2 file
mentioned above ? Just to be sure there was no patch applied at the
last minute, which would not show up in the git history of hardkernel ?
Thx
Jerome
>
>
^ permalink raw reply
* Re: [RFC v4 00/18] Landlock LSM: Unprivileged sandboxing
From: Sargun Dhillon @ 2016-11-14 10:35 UTC (permalink / raw)
To: Mickaël Salaün
Cc: LKML, Alexei Starovoitov, Andy Lutomirski, Daniel Borkmann,
Daniel Mack, David Drysdale, David S . Miller, Eric W . Biederman,
James Morris, Jann Horn, Kees Cook, Paul Moore, Serge E . Hallyn,
Tejun Heo, Thomas Graf, Will Drewry,
kernel-hardening-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Linux API, LSM,
netdev, "open list:CONTROL GROUP (CGRO
In-Reply-To: <5828776A.1010104-WFhQfpSGs3bR7s880joybQ@public.gmane.org>
On Sun, Nov 13, 2016 at 6:23 AM, Mickaël Salaün <mic-WFhQfpSGs3bR7s880joybQ@public.gmane.org> wrote:
> Hi,
>
> After the BoF at LPC last week, we came to a multi-step roadmap to
> upstream Landlock.
>
> A first patch series containing the basic properties needed for a
> "minimum viable product", which means being able to test it, without
> full features. The idea is to set in place the main components which
> include the LSM part (some hooks with the manager logic) and the new
> eBPF type. To have a minimum amount of code, the first userland entry
> point will be the seccomp syscall. This doesn't imply non-upstream
> patches and should be more simple. For the sake of simplicity and to
> ease the review, this first series will only be dedicated to privileged
> processes (i.e. with CAP_SYS_ADMIN). We may want to only allow one level
> of rules at first, instead of dealing with more complex rule inheritance
> (like seccomp-bpf can do).
>
> The second series will focus on the cgroup manager. It will follow the
> same rules of inheritance as the Daniel Mack's patches does.
>
> The third series will try to bring a BPF map of handles for Landlock and
> the dedicated BPF helpers.
>
> Finally, the fourth series will bring back the unprivileged mode (with
> no_new_privs), at least for process hierarchies (via seccomp). This also
> imply to handle multi-level of rules.
>
> Right now, an important point of attention is the userland ABI. We don't
> want LSM hooks to be exposed "as is" to userland. This may have some
> future implications if their semantic and/or enforcement point(s)
> change. In the next series, I will propose a new abstraction over the
> currently used LSM hooks. I'll also propose a new way to deal with
> resource accountability. Finally, I plan to create a minimal (kernel)
> developer documentation and a test suite.
>
> Regards,
> Mickaël
>
>
> On 26/10/2016 08:56, Mickaël Salaün wrote:
>> Hi,
>>
>> This fourth RFC brings some improvements over the previous one [1]. An important
>> new point is the abstraction from the raw types of LSM hook arguments. It is
>> now possible to call a Landlock function the same way for LSM hooks with
>> different internal argument types. Some parts of the code are revamped with RCU
>> to properly deal with concurrency. From a userland point of view, the only
>> remaining link with seccomp-bpf is the ability to use the seccomp(2) syscall to
>> load and enforce a Landlock rule. Seccomp filters cannot trigger Landlock rules
>> anymore. For now, it is no more possible for an unprivileged user to enforce a
>> Landlock rule on a cgroup through delegation.
>>
>> As suggested, I plan to write documentation for userland and kernel developers
>> with some kind of guiding principles. A remaining question is how to enforce
>> limitations for the rule creation?
>>
>>
>> # Landlock LSM
>>
>> The goal of this new stackable Linux Security Module (LSM) called Landlock is
>> to allow any process, including unprivileged ones, to create powerful security
>> sandboxes comparable to the Seatbelt/XNU Sandbox or the OpenBSD Pledge. This
>> kind of sandbox is expected to help mitigate the security impact of bugs or
>> unexpected/malicious behaviors in userland applications.
>>
>> eBPF programs are used to create a security rule. They are very limited (i.e.
>> can only call a whitelist of functions) and cannot do a denial of service (i.e.
>> no loop). A new dedicated eBPF map allows to collect and compare Landlock
>> handles with system resources (e.g. files or network connections).
>>
>> The approach taken is to add the minimum amount of code while still allowing
>> the userland to create quite complex access rules. A dedicated security policy
>> language as the one used by SELinux, AppArmor and other major LSMs involves a
>> lot of code and is usually dedicated to a trusted user (i.e. root).
>>
>>
>> # eBPF
>>
>> To get an expressive language while still being safe and small, Landlock is
>> based on eBPF. Landlock should be usable by untrusted processes and must then
>> expose a minimal attack surface. The eBPF bytecode is minimal while powerful,
>> widely used and designed to be used by not so trusted application. Reusing this
>> code allows to not reproduce the same mistakes and minimize new code while
>> still taking a generic approach. Only a few additional features are added like
>> a new kind of arraymap and some dedicated eBPF functions.
>>
>> An eBPF program has access to an eBPF context which contains the LSM hook
>> arguments (as does seccomp-bpf with syscall arguments). They can be used
>> directly or passed to helper functions according to their types. It is then
>> possible to do complex access checks without race conditions nor inconsistent
>> evaluation (i.e. incorrect mirroring of the OS code and state [2]).
>>
>> There is one eBPF program subtype per LSM hook. This allows to statically check
>> which context access is performed by an eBPF program. This is needed to deny
>> kernel address leak and ensure the right use of LSM hook arguments with eBPF
>> functions. Moreover, this safe pointer handling removes the need for runtime
>> check or abstract data, which improves performances. Any user can add multiple
>> Landlock eBPF programs per LSM hook. They are stacked and evaluated one after
>> the other (cf. seccomp-bpf).
>>
>>
>> # LSM hooks
>>
>> Unlike syscalls, LSM hooks are security checkpoints and are not architecture
>> dependent. They are designed to match a security need associated with a
>> security policy (e.g. access to a file). Exposing parts of some LSM hooks
>> instead of using the syscall API for sandboxing should help to avoid bugs and
>> hacks as encountered by the first RFC. Instead of redoing the work of the LSM
>> hooks through syscalls, we should use and expose them as does policies of
>> access control LSM.
>>
>> Only a subset of the hooks are meaningful for an unprivileged sandbox mechanism
>> (e.g. file system or network access control). Landlock uses an abstraction of
>> raw LSM hooks, which allow to deal with possible future API changes of the LSM
>> hook API. Moreover, thanks to the ePBF program typing (per LSM hook) used by
>> Landlock, it should not be hard to make such evolutions backward compatible.
>>
>>
>> # Use case scenario
>>
>> First, a process needs to create a new dedicated eBPF map containing handles.
>> This handles are references to system resources (e.g. file or directory) and
>> grouped in one or multiple maps to be efficiently managed and checked in
>> batches. This kind of map can be passed to Landlock eBPF functions to compare,
>> for example, with a file access request. The handles are only accessible from
>> the eBPF programs created by the same thread.
>>
>> The loaded Landlock eBPF programs can be triggered by a seccomp filter
>> returning RET_LANDLOCK. In addition, a cookie (16-bit value) can be passed from
>> a seccomp filter to eBPF programs. This allow flexible security policies
>> between seccomp and Landlock.
>>
>> Another way to enforce a Landlock security policy is to attach Landlock
>> programs to a dedicated cgroup. All the processes in this cgroup will then be
>> subject to this policy. For unprivileged processes, this can be done thanks to
>> cgroup delegation.
>>
>> A triggered Landlock eBPF program can allow or deny an access, according to
>> its subtype (i.e. LSM hook), thanks to errno return values.
>>
>>
>> # Sandbox example with process hierarchy sandboxing (seccomp)
>>
>> $ ls /home
>> user1
>> $ LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
>> ./samples/landlock/sandbox /bin/sh -i
>> Launching a new sandboxed process.
>> $ ls /home
>> ls: cannot access '/home': No such file or directory
>>
>>
>> # Sandbox example with conditional access control depending on a cgroup
>>
>> $ mkdir /sys/fs/cgroup/sandboxed
>> $ ls /home
>> user1
>> $ LANDLOCK_CGROUPS='/sys/fs/cgroup/sandboxed' \
>> LANDLOCK_ALLOWED='/bin:/lib:/usr:/tmp:/proc/self/fd/0' \
>> ./samples/landlock/sandbox
>> Ready to sandbox with cgroups.
>> $ ls /home
>> user1
>> $ echo $$ > /sys/fs/cgroup/sandboxed/cgroup.procs
>> $ ls /home
>> ls: cannot access '/home': No such file or directory
>>
>>
>> # Current limitations and possible improvements
>>
>> For now, eBPF programs can only return an errno code. It may be interesting to
>> be able to do other actions like seccomp-bpf does (e.g. kill process). Such
>> features can easily be implemented but the main advantage of the current
>> approach is to be able to only execute eBPF programs until one returns an errno
>> code instead of executing all programs like seccomp-bpf does.
>>
>> It is quite easy to add new eBPF functions to extend Landlock. The main concern
>> should be about the possibility to leak information from current process to
>> another one (e.g. through maps) to not reproduce the same security sensitive
>> behavior as ptrace.
>>
>> This design does not seem too intrusive but is flexible enough to allow a
>> powerful sandbox mechanism accessible by any process on Linux. The use of
>> seccomp and Landlock is more suitable with the help of a userland library (e.g.
>> libseccomp) that could help to specify a high-level language to express a
>> security policy instead of raw eBPF programs. Moreover, thanks to LLVM, it is
>> possible to express an eBPF program with a subset of C.
>>
>>
>> # FAQ
>>
>> ## Why does seccomp-bpf is not enough?
>>
>> A seccomp filter can access to raw syscall arguments which means that it is not
>> possible to filter according to pointed such as a file path. As the first
>> version of this patch series demonstrated, filtering at the syscall level is
>> complicated (e.g. need to take care of race conditions). This is mainly because
>> the access control checkpoints of the kernel are not at this high-level but
>> more underneath, at LSM hooks level. The LSM hooks are designed to handle this
>> kind of checks. This series use this approach to leverage the ability of
>> unprivileged users to limit themselves.
>>
>> Cf. "What it isn't?" in Documentation/prctl/seccomp_filter.txt
>>
>>
>> ## Why using the seccomp(2) syscall?
>>
>> Landlock use the same semantic as seccomp to apply access rule restrictions. It
>> add a new layer of security for the current process which is inherited by its
>> childs. It makes sense to use an unique access-restricting syscall (that should
>> be allowed by seccomp-bpf rules) which can only drop privileges. Moreover, a
>> Landlock eBPF program could come from outside a process (e.g. passed through a
>> UNIX socket). It is then useful to differentiate the creation/load of Landlock
>> eBPF programs via bpf(2), from rule enforcing via seccomp(2).
>>
>>
>> ## Why using cgroups?
>>
>> cgroups are designed to handle groups of processes. One use case is to manage
>> containers. Sandboxing based on process hierarchy (seccomp) is design to handle
>> immutable security policies, which is a good security property but does not
>> match all use cases. A user can attach Landlock rules to a cgroup. Doing so,
>> all the processes in that cgroup will be subject to the security policy.
>> However, if the user is allowed to manage this cgroup, it could dynamically
>> move this group of processes to a cgroup with another security policy (or
>> none). Landlock rules can be applied either on a process hierarchy (e.g.
>> application with built-in sandboxing) or a group of processes (e.g. container
>> sandboxing). Both approaches can be combined for the same process.
>>
>>
>> ## Does Landlock can limit network access or other resources?
>>
>> Limiting network access is obviously in the scope of Landlock but it is not yet
>> implemented. The main goal now is to get feedback about the whole concept, the
>> API and the file access control part. More access control types could be
>> implemented in the future.
>>
>> Sargun Dhillon sent a RFC (Checmate) [4] to deal with network manipulation.
>> This could be implemented on top of the Landlock framework.
>>
>>
>> ## Why a new LSM? Are SELinux, AppArmor, Smack or Tomoyo not good enough?
>>
>> The current access control LSMs are fine for their purpose which is to give the
>> *root* the ability to enforce a security policy for the *system*. What is
>> missing is a way to enforce a security policy for any applications by its
>> developer and *unprivileged user* as seccomp can do for raw syscall filtering.
>> Moreover, Landlock handles stacked hook programs from different users. It must
>> then ensure there is no possible malicious interactions between these programs.
>>
>> Differences with other (access control) LSMs:
>> * not only dedicated to administrators (i.e. no_new_priv);
>> * limited kernel attack surface (e.g. policy parsing);
>> * helpers to compare complex objects (path/FD), no access to internal kernel
>> data (do not leak addresses);
>> * constrained policy rules/programs (no DoS: deterministic execution time);
>> * do not leak more information than the loader process can legitimately have
>> access to (minimize metadata inference): must compare from an already allowed
>> file (through a handle).
>>
>>
>> ## Why not use a policy language like used by SElinux or AppArmor?
>>
>> This kind of LSMs are dedicated to administrators. They already manage the
>> system and are not a threat to the system security. However, seccomp, and
>> Landlock too, should be available to anyone, which potentially include
>> untrusted users and processes. To reduce the attack surface, Landlock should
>> expose the minimum amount of code, hence minimal complexity. Moreover, another
>> threat is to make accessible to a malicious code a new way to gain more
>> information. For example, Landlock features should not allow a program to get
>> the file owner if the directory containing this file is not readable. This data
>> could then be exfiltrated thanks to the access result. Thus, we should limit
>> the expressiveness of the available checks. The current approach is to do the
>> checks in such a way that only a comparison with an already accessed resource
>> (e.g. file descriptor) is possible. This allow to have a reference to compare
>> with, without exposing much information.
>>
>>
>> ## As a developer, why do I need this feature?
>>
>> Landlock's goal is to help userland to limit its attack surface.
>> Security-conscious developers would like to protect users from a security bug
>> in their applications and the third-party dependencies they are using. Such a
>> bug can compromise all the user data and help an attacker to perform a
>> privilege escalation. Using an *unprivileged sandbox* feature such as Landlock
>> empowers the developer with the ability to properly compartmentalize its
>> software and limit the impact of vulnerabilities.
>>
>>
>> ## As a user, why do I need a this feature?
>>
>> Any user can already use seccomp-bpf to whitelist a set of syscalls to
>> reduce the kernel attack surface for a predefined set of processes. However an
>> unprivileged user can't create a security policy like the root user can thanks to
>> SELinux and other access control LSMs. Landlock allows any unprivileged user to
>> protect their data from being accessed by any process they run but only an
>> identified subset. User tools can be created to help create such a high-level
>> access control policy. This policy may not be powerful enough to express the
>> same policies as the current access control LSMs, because of the threat an
>> unprivileged user can be to the system, but it should be enough for most
>> use-cases (e.g. blacklist or whitelist a set of file hierarchies).
>>
>>
>> # Changes since RFC v3
>>
>> * use abstract LSM hook arguments with custom types (e.g. *_LANDLOCK_ARG_FS for
>> struct file, struct inode and struct path)
>> * add more LSM hooks to support full file system access control
>> * improve the sandbox example
>> * fix races and RCU issues:
>> * eBPF program execution and eBPF helpers
>> * revamp the arraymap of handles to cleanly deal with update/delete
>> * eBPF program subtype for Landlock:
>> * remove the "origin" field
>> * add an "option" field
>> * rebase onto Daniel Mack's patches v7 [3]
>> * remove merged commit 1955351da41c ("bpf: Set register type according to
>> is_valid_access()")
>> * fix spelling mistakes
>> * cleanup some type and variable names
>> * split patches
>> * for now, remove cgroup delegation handling for unprivileged user
>> * remove extra access check for cgroup_get_from_fd()
>> * remove unused example code dealing with skb
>> * remove seccomp-bpf link:
>> * no more seccomp cookie
>> * for now, it is no more possible to check the current syscall properties
>>
>>
>> # Changes since RFC v2
>>
>> * revamp cgroup handling:
>> * use Daniel Mack's patches "Add eBPF hooks for cgroups" v5
>> * remove bpf_landlock_cmp_cgroup_beneath()
>> * make BPF_PROG_ATTACH usable with delegated cgroups
>> * add a new CGRP_NO_NEW_PRIVS flag for safe cgroups
>> * handle Landlock sandboxing for cgroups hierarchy
>> * allow unprivileged processes to attach Landlock eBPF program to cgroups
>> * add subtype to eBPF programs:
>> * replace Landlock hook identification by custom eBPF program types with a
>> dedicated subtype field
>> * manage fine-grained privileged Landlock programs
>> * register Landlock programs for dedicated trigger origins (e.g. syscall,
>> return from seccomp filter and/or interruption)
>> * performance and memory optimizations: use an array to access Landlock hooks
>> directly but do not duplicated it for each thread (seccomp-based)
>> * allow running Landlock programs without seccomp filter
>> * fix seccomp-related issues
>> * remove extra errno bounding check for Landlock programs
>> * add some examples for optional eBPF functions or context access (network
>> related) according to security checks to allow more features for privileged
>> programs (e.g. Checmate)
>>
>>
>> # Changes since RFC v1
>>
>> * focus on the LSM hooks, not the syscalls:
>> * much more simple implementation
>> * does not need audit cache tricks to avoid race conditions
>> * more simple to use and more generic because using the LSM hook abstraction
>> directly
>> * more efficient because only checking in LSM hooks
>> * architecture agnostic
>> * switch from cBPF to eBPF:
>> * new eBPF program types dedicated to Landlock
>> * custom functions used by the eBPF program
>> * gain some new features (e.g. 10 registers, can load values of different
>> size, LLVM translator) but only a few functions allowed and a dedicated map
>> type
>> * new context: LSM hook ID, cookie and LSM hook arguments
>> * need to set the sysctl kernel.unprivileged_bpf_disable to 0 (default value)
>> to be able to load hook filters as unprivileged users
>> * smaller and simpler:
>> * no more checker groups but dedicated arraymap of handles
>> * simpler userland structs thanks to eBPF functions
>> * distinctive name: Landlock
>>
>>
>> This series can be applied on top of Daniel Mack's patches for BPF_PROG_ATTACH
>> v7 [3] on Linux v4.9-rc2. This can be tested with CONFIG_SECURITY_LANDLOCK,
>> CONFIG_SECCOMP_FILTER and CONFIG_CGROUP_BPF. I would really appreciate
>> constructive comments on the usability, architecture, code and userland API of
>> Landlock LSM.
>>
>> [1] https://lkml.kernel.org/r/20160914072415.26021-1-mic-WFhQfpSGs3bR7s880joybQ@public.gmane.org
>> [2] https://crypto.stanford.edu/cs155/papers/traps.pdf
>> [3] https://lkml.kernel.org/r/1477390454-12553-1-git-send-email-daniel@zonque.org
>> [4] https://lkml.kernel.org/r/20160829114542.GA20836-I4sfFR6g6EicJoAdRrHjTitQHAD/DGy2@public.gmane.orgbus-611.internal
>>
>> Regards,
>>
>> Mickaël Salaün (18):
>> landlock: Add Kconfig
>> bpf: Move u64_to_ptr() to BPF headers and inline it
>> bpf,landlock: Add a new arraymap type to deal with (Landlock) handles
>> bpf,landlock: Add eBPF program subtype and is_valid_subtype() verifier
>> bpf,landlock: Define an eBPF program type for Landlock
>> fs: Constify path_is_under()'s arguments
>> landlock: Add LSM hooks
>> landlock: Handle file comparisons
>> landlock: Add manager functions
>> seccomp: Split put_seccomp_filter() with put_seccomp()
>> seccomp,landlock: Handle Landlock hooks per process hierarchy
>> bpf: Cosmetic change for bpf_prog_attach()
>> bpf/cgroup: Replace struct bpf_prog with struct bpf_object
>> bpf/cgroup: Make cgroup_bpf_update() return an error code
>> bpf/cgroup: Move capability check
>> bpf/cgroup,landlock: Handle Landlock hooks per cgroup
>> landlock: Add update and debug access flags
>> samples/landlock: Add sandbox example
>>
>> fs/namespace.c | 2 +-
>> include/linux/bpf-cgroup.h | 19 +-
>> include/linux/bpf.h | 44 +++-
>> include/linux/cgroup-defs.h | 2 +
>> include/linux/filter.h | 1 +
>> include/linux/fs.h | 2 +-
>> include/linux/landlock.h | 95 +++++++++
>> include/linux/lsm_hooks.h | 5 +
>> include/linux/seccomp.h | 12 +-
>> include/uapi/linux/bpf.h | 105 ++++++++++
>> include/uapi/linux/seccomp.h | 1 +
>> kernel/bpf/arraymap.c | 270 +++++++++++++++++++++++++
>> kernel/bpf/cgroup.c | 139 ++++++++++---
>> kernel/bpf/syscall.c | 71 ++++---
>> kernel/bpf/verifier.c | 35 +++-
>> kernel/cgroup.c | 6 +-
>> kernel/fork.c | 15 +-
>> kernel/seccomp.c | 26 ++-
>> kernel/trace/bpf_trace.c | 12 +-
>> net/core/filter.c | 26 ++-
>> samples/Makefile | 2 +-
>> samples/bpf/bpf_helpers.h | 5 +
>> samples/landlock/.gitignore | 1 +
>> samples/landlock/Makefile | 16 ++
>> samples/landlock/sandbox.c | 405 +++++++++++++++++++++++++++++++++++++
>> security/Kconfig | 1 +
>> security/Makefile | 2 +
>> security/landlock/Kconfig | 23 +++
>> security/landlock/Makefile | 3 +
>> security/landlock/checker_fs.c | 152 ++++++++++++++
>> security/landlock/checker_fs.h | 20 ++
>> security/landlock/common.h | 58 ++++++
>> security/landlock/lsm.c | 449 +++++++++++++++++++++++++++++++++++++++++
>> security/landlock/manager.c | 379 ++++++++++++++++++++++++++++++++++
>> security/security.c | 1 +
>> 35 files changed, 2309 insertions(+), 96 deletions(-)
>> create mode 100644 include/linux/landlock.h
>> create mode 100644 samples/landlock/.gitignore
>> create mode 100644 samples/landlock/Makefile
>> create mode 100644 samples/landlock/sandbox.c
>> create mode 100644 security/landlock/Kconfig
>> create mode 100644 security/landlock/Makefile
>> create mode 100644 security/landlock/checker_fs.c
>> create mode 100644 security/landlock/checker_fs.h
>> create mode 100644 security/landlock/common.h
>> create mode 100644 security/landlock/lsm.c
>> create mode 100644 security/landlock/manager.c
>>
>
Was there a plan around getting Daniel's patches in as well? Also,
rather than making these handles landlock-specific, can they be
implemented in such a way where we can keep track of (some) of these
in other types of programs?
^ permalink raw reply
* Re: [PATCH net] sctp: change sk state only when it has assocs in sctp_shutdown
From: Marcelo Ricardo Leitner @ 2016-11-14 10:32 UTC (permalink / raw)
To: Xin Long
Cc: network dev, linux-sctp, davem, Neil Horman, Vlad Yasevich,
andreyknvl
In-Reply-To: <d260f8b59f52d7ef00b83a554fc19d4aa91766a2.1479044677.git.lucien.xin@gmail.com>
On Sun, Nov 13, 2016 at 09:44:37PM +0800, Xin Long wrote:
> Now when users shutdown a sock with SEND_SHUTDOWN in sctp, even if
> this sock has no connection (assoc), sk state would be changed to
> SCTP_SS_CLOSING, which is not as we expect.
>
> Besides, after that if users try to listen on this sock, kernel
> could even panic when it dereference sctp_sk(sk)->bind_hash in
> sctp_inet_listen, as bind_hash is null when sock has no assoc.
>
> This patch is to move sk state change after checking sk assocs
> is not empty, and also merge these two if() conditions and reduce
> indent level.
>
> Fixes: d46e416c11c8 ("sctp: sctp should change socket state when shutdown is received")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
> net/sctp/socket.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index faa48ff..f23ad91 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4285,19 +4285,18 @@ static void sctp_shutdown(struct sock *sk, int how)
> {
> struct net *net = sock_net(sk);
> struct sctp_endpoint *ep;
> - struct sctp_association *asoc;
>
> if (!sctp_style(sk, TCP))
> return;
>
> - if (how & SEND_SHUTDOWN) {
> + ep = sctp_sk(sk)->ep;
> + if (how & SEND_SHUTDOWN && !list_empty(&ep->asocs)) {
> + struct sctp_association *asoc;
> +
> sk->sk_state = SCTP_SS_CLOSING;
> - ep = sctp_sk(sk)->ep;
> - if (!list_empty(&ep->asocs)) {
> - asoc = list_entry(ep->asocs.next,
> - struct sctp_association, asocs);
> - sctp_primitive_SHUTDOWN(net, asoc, NULL);
> - }
> + asoc = list_entry(ep->asocs.next,
> + struct sctp_association, asocs);
> + sctp_primitive_SHUTDOWN(net, asoc, NULL);
> }
> }
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* [PATCH net-next v2 4/5] ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable
From: Allan W. Nielsen @ 2016-11-14 9:29 UTC (permalink / raw)
To: netdev; +Cc: andrew, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479115760-1182-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Adding validation support for the ETHTOOL_PHY_DOWNSHIFT. Functional
implementation needs to be done in the individual PHY drivers.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
---
net/core/ethtool.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 61aebdf..e9b45567 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -122,6 +122,7 @@ tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
static const char
phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
[ETHTOOL_ID_UNSPEC] = "Unspec",
+ [ETHTOOL_PHY_DOWNSHIFT] = "phy-downshift",
};
static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
@@ -2435,6 +2436,11 @@ static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr)
static int ethtool_phy_tunable_valid(const struct ethtool_tunable *tuna)
{
switch (tuna->id) {
+ case ETHTOOL_PHY_DOWNSHIFT:
+ if (tuna->len != sizeof(u8) ||
+ tuna->type_id != ETHTOOL_TUNABLE_U8)
+ return -EINVAL;
+ break;
default:
return -EINVAL;
}
--
2.7.3
^ permalink raw reply related
* [patch net] mlxsw: spectrum_router: Flush FIB tables during fini
From: Jiri Pirko @ 2016-11-14 10:26 UTC (permalink / raw)
To: netdev; +Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz
From: Ido Schimmel <idosch@mellanox.com>
Since commit b45f64d16d45 ("mlxsw: spectrum_router: Use FIB notifications
instead of switchdev calls") we reflect to the device the entire FIB
table and not only FIBs that point to netdevs created by the driver.
During module removal, FIBs of the second type are removed following
NETDEV_UNREGISTER events sent. The other FIBs are still present in both
the driver's cache and the device's table.
Fix this by iterating over all the FIB tables in the device and flush
them. There's no need to take locks, as we're the only writer.
Fixes: b45f64d16d45 ("mlxsw: spectrum_router: Use FIB notifications instead of switchdev calls")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index cbeeddd..e83072d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -594,8 +594,11 @@ static int mlxsw_sp_vrs_init(struct mlxsw_sp *mlxsw_sp)
return 0;
}
+static void mlxsw_sp_router_fib_flush(struct mlxsw_sp *mlxsw_sp);
+
static void mlxsw_sp_vrs_fini(struct mlxsw_sp *mlxsw_sp)
{
+ mlxsw_sp_router_fib_flush(mlxsw_sp);
kfree(mlxsw_sp->router.vrs);
}
@@ -1867,18 +1870,18 @@ static int mlxsw_sp_router_set_abort_trap(struct mlxsw_sp *mlxsw_sp)
return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ralue), ralue_pl);
}
-static void mlxsw_sp_router_fib4_abort(struct mlxsw_sp *mlxsw_sp)
+static void mlxsw_sp_router_fib_flush(struct mlxsw_sp *mlxsw_sp)
{
struct mlxsw_resources *resources;
struct mlxsw_sp_fib_entry *fib_entry;
struct mlxsw_sp_fib_entry *tmp;
struct mlxsw_sp_vr *vr;
int i;
- int err;
resources = mlxsw_core_resources_get(mlxsw_sp->core);
for (i = 0; i < resources->max_virtual_routers; i++) {
vr = &mlxsw_sp->router.vrs[i];
+
if (!vr->used)
continue;
@@ -1894,6 +1897,13 @@ static void mlxsw_sp_router_fib4_abort(struct mlxsw_sp *mlxsw_sp)
break;
}
}
+}
+
+static void mlxsw_sp_router_fib4_abort(struct mlxsw_sp *mlxsw_sp)
+{
+ int err;
+
+ mlxsw_sp_router_fib_flush(mlxsw_sp);
mlxsw_sp->router.aborted = true;
err = mlxsw_sp_router_set_abort_trap(mlxsw_sp);
if (err)
--
2.7.4
^ permalink raw reply related
* RE: [PATCH net-next v7 03/10] dpaa_eth: add option to use one buffer pool set
From: Madalin-Cristian Bucur @ 2016-11-14 10:25 UTC (permalink / raw)
To: David Miller
Cc: pebolle@tiscali.nl, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, ppc@mindchasers.com,
oss@buserror.net, joe@perches.com, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20161113.124617.2176429700446266337.davem@davemloft.net>
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Sunday, November 13, 2016 7:46 PM
>
> From: Madalin Bucur <madalin.bucur@nxp.com>
> Date: Fri, 11 Nov 2016 10:20:00 +0200
>
> > @@ -8,3 +8,12 @@ menuconfig FSL_DPAA_ETH
> > supporting the Freescale QorIQ chips.
> > Depends on Freescale Buffer Manager and Queue Manager
> > driver and Frame Manager Driver.
> > +
> > +if FSL_DPAA_ETH
> > +config FSL_DPAA_ETH_COMMON_BPOOL
> > + bool "Use a common buffer pool set for all the interfaces"
> > + ---help---
> > + The DPAA Ethernet netdevices require buffer pools for storing the
> buffers
> > + used by the FMan hardware for reception. One can use a single
> buffer pool
> > + set for all interfaces or a dedicated buffer pool set for each
> interface.
> > +endif # FSL_DPAA_ETH
>
> This in no way belongs in Kconfig. If you want to support this,
> support it wit a run time configuration choice via ethtool flags
> or similar. Do not use debugfs, do not use sysfs, do not use
> module options.
>
> If you put it in Kconfig, distributions will have to pick one way or
> another which means that users who want the other choice lose. This
> never works.
I've introduced this Kconfig option as a backwards compatible option, to
be able to run comparative tests between the independent buffer pool setup
and the previous common buffer pool setup. There are not so many reasons
to use the same buffer pool besides "having the old setup", the memory
saving is marginal, in all other aspects the separate buffer pools setup
fares better.
I'll remove this patch from the next submission. Should anyone care for
this I can add an entry to the feature backlog to add runtime support but
it will be quite low in priority.
Thank you for your review.
Madalin
^ permalink raw reply
* Re: [PATCH nf-next,RFC] netfilter: nft_meta: add cgroup version 2 support
From: Daniel Mack @ 2016-11-14 10:10 UTC (permalink / raw)
To: Pablo Neira Ayuso, netfilter-devel-u79uwXL29TY76Z2rM5mHXA
Cc: htejun-b10kYP2dOMg, cgroups-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479114761-19534-1-git-send-email-pablo-Cap9r6Oaw4JrovVCs/uTlw@public.gmane.org>
Hi Pablo,
On 11/14/2016 10:12 AM, Pablo Neira Ayuso wrote:
> Add cgroup version 2 support to nf_tables.
>
> This extension allows us to fetch the cgroup i-node number from the
> cgroup socket data, place it in a register, then match it against any
> value specified by user. This approach scales up nicely since it
> integrates well in the existing nf_tables map infrastructure.
>
> Contrary to what iptables cgroup v2 match does, this patch doesn't use
> cgroup_is_descendant() because this call cannot guarantee that the cgroup
> hierarchy is honored in anyway given that the cgroup v2 field becomes yet
> another packet selector that you can use to build your filtering policy.
>
> Actually, using the i-node approach, it should be easy to build a policy
> that honors the hierarchy if you need this, eg.
>
> meta cgroup2 vmap { "/A/B" : jump b-cgroup-chain,
> "/A/C" : jump c-cgroup-chain,
> "/A" : jump a-cgroup-chain }
>
> then, the b-cgroup-chain looks like:
>
> jump a-cgroup-chain
> ... # specific policy b-cgroup-chain goes here
>
> similarly, the c-cgroup-chain looks like:
>
> jump a-cgroup-chain
> ... # specific policy c-cgroup-chain goes here
>
> So both B and C would evaluate A's ruleset. Note that cgroup A would
> also jump to the root cgroup chain policy.
>
> Anyway, this cgroup i-node approach provides way more flexibility since
> it is up to the sysadmin to decide if he wants to honor the hierarchy or
> simply define a fast path to skip any further classification.
I don't think this can work. The problem is that inodes in cgroupfs are
dynamically allocated when a cgroup is created, so the sysadmin cannot
install the jump rules before that. Worse yet, inode numbers in pseudo
filesystems are recycled, so if a cgroup goes away and a new one is
created, the latter may well end up having the same inode than the old
one. As cgroupfs decoupled from netfilter tables, this will lead to
major chaos in the field.
Note that this was different with the netclass controller in v1 that
would assign a user-controlled numeric value to each cgroup, so both
sides were in the control of the sysadmin. It is also different with the
path matching logic for v2 which does a full path string comparison.
That's potentially expensive, it does lead to predictable runtime behavior.
One way forward here would be to assign a atomically increasing 64-bit
sequence number to each cgroup and expose that. I've recently talked to
Tejun about that. While that won't solve the predictability issue, it
would at least make it practically impossible to have re-used IDs.
Anyway - I think it would be great to have an alternative to the v2 path
matching here, but of course this patch does not solve the ingress issue
we've been discussing. It is still impossible to reliably determine the
cgroup of a local receiver at the time when the netfilter rules are
processed, even for unicast packets.
Thanks,
Daniel
>
> Signed-off-by: Pablo Neira Ayuso <pablo-Cap9r6Oaw4JrovVCs/uTlw@public.gmane.org>
> ---
> include/uapi/linux/netfilter/nf_tables.h | 2 ++
> net/netfilter/nft_meta.c | 15 +++++++++++++++
> 2 files changed, 17 insertions(+)
>
> diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
> index 0da7ccf65511..5d4d08367a87 100644
> --- a/include/uapi/linux/netfilter/nf_tables.h
> +++ b/include/uapi/linux/netfilter/nf_tables.h
> @@ -729,6 +729,7 @@ enum nft_exthdr_attributes {
> * @NFT_META_OIFGROUP: packet output interface group
> * @NFT_META_CGROUP: socket control group (skb->sk->sk_classid)
> * @NFT_META_PRANDOM: a 32bit pseudo-random number
> + * @NFT_META_CGROUP2: socket control group v2 (skb->sk->sk_cgrp_data)
> */
> enum nft_meta_keys {
> NFT_META_LEN,
> @@ -756,6 +757,7 @@ enum nft_meta_keys {
> NFT_META_OIFGROUP,
> NFT_META_CGROUP,
> NFT_META_PRANDOM,
> + NFT_META_CGROUP2,
> };
>
> /**
> diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
> index 6c1e0246706e..1e793e133903 100644
> --- a/net/netfilter/nft_meta.c
> +++ b/net/netfilter/nft_meta.c
> @@ -190,6 +190,18 @@ void nft_meta_get_eval(const struct nft_expr *expr,
> *dest = prandom_u32_state(state);
> break;
> }
> +#ifdef CONFIG_SOCK_CGROUP_DATA
> + case NFT_META_CGROUP2: {
> + struct cgroup *cgrp;
> +
> + if (!skb->sk || !sk_fullsock(skb->sk))
> + goto err;
> +
> + cgrp = sock_cgroup_ptr(&skb->sk->sk_cgrp_data);
> + *dest = cgrp->kn->ino;
> + break;
> + }
> +#endif
> default:
> WARN_ON(1);
> goto err;
> @@ -273,6 +285,9 @@ int nft_meta_get_init(const struct nft_ctx *ctx,
> #ifdef CONFIG_CGROUP_NET_CLASSID
> case NFT_META_CGROUP:
> #endif
> +#ifdef CONFIG_SOCK_CGROUP_DATA
> + case NFT_META_CGROUP2:
> +#endif
> len = sizeof(u32);
> break;
> case NFT_META_IIFNAME:
>
^ permalink raw reply
* RE: [PATCH v3 3/5] net: asix: Fix AX88772x resume failures
From: ASIX_Allan [Office] @ 2016-11-14 9:45 UTC (permalink / raw)
To: 'Jon Hunter', robert.foss-ZGY8ohtN/8qB+jHODAdFcQ,
freddy-knRN6Y/kmf1NUHwG+Fw1Kw,
Dean_Jenkins-nmGgyN9QBj3QT0dZR+AlfA,
Mark_Craske-nmGgyN9QBj3QT0dZR+AlfA, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
ivecera-H+wXaHxf7aLQT0dZR+AlfA,
john.stultz-QSEj5FYQhm4dnm+yROfE0A,
vpalatin-F7+t8E8rja9g9hUCZPvPmw,
stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
grundler-F7+t8E8rja9g9hUCZPvPmw,
changchias-Re5JQEeQqe8AvxtiuMwx3w, andrew-g2DYL2Zd6BY,
tremyfr-Re5JQEeQqe8AvxtiuMwx3w, colin.king-Z7WLFzj8eWMS+FvcfC7Uqw,
linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
vpalatin-hpIqsD4AKlfQT0dZR+AlfA
In-Reply-To: <f0d21edc-060b-bd9b-fced-faf1f51e8467-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Hi Jon,
Please help to double check if the USB host controller of your Terga
platform had been powered OFF while running the ax88772_suspend() routine or
not?
---
Best regards,
Allan Chou
-----Original Message-----
From: Jon Hunter [mailto:jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org]
Sent: Monday, November 14, 2016 5:34 PM
To: allan-knRN6Y/kmf1NUHwG+Fw1Kw@public.gmane.org; robert.foss-ZGY8ohtN/8qB+jHODAdFcQ@public.gmane.org; freddy-knRN6Y/kmf1NUHwG+Fw1Kw@public.gmane.org;
Dean_Jenkins-nmGgyN9QBj3QT0dZR+AlfA@public.gmane.org; Mark_Craske-nmGgyN9QBj3QT0dZR+AlfA@public.gmane.org; davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org;
ivecera-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org; vpalatin-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org;
stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org; grundler-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org; changchias-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org;
andrew-g2DYL2Zd6BY@public.gmane.org; tremyfr-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; colin.king-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org;
linux-usb-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; vpalatin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org
Subject: Re: [PATCH v3 3/5] net: asix: Fix AX88772x resume failures
Hi Allan,
On 14/11/16 08:50, ASIX_Allan [Home] wrote:
> It seems the AX88772x dongle had been unexpectedly removed while
> running the
> ax88772_suspend() routine. If yes, you might see these error messages
> because the hardware had been absent.
In my case the hardware was never removed. The boards are in a test fixture
that are not touched. This is seen on more than one board. By reverting this
change I no longer see the error messages and appears to be 100%
reproducible.
Jon
--
nvpublic
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH ethtool v2 1/2] ethtool-copy.h:sync with net
From: Allan W. Nielsen @ 2016-11-14 9:29 UTC (permalink / raw)
To: netdev; +Cc: andrew, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479115787-1265-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
This covers kernel changes upto:
commit f5a4732f85613b3fb43f8bc33a017e3db3b3605a
Author: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Date: Wed Nov 9 16:33:09 2016 +0530
ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE
For operation in cabling environments that are incompatible with 1000BAST-T,
PHY device may provide an automatic link speed downshift operation. When
enabled, the device automatically changes its 1000BAST-T auto-negotiation
to the next slower speed after a configured number of failed attempts at
1000BAST-T.
This feature is useful in setting up in networks using older cable
installations that include only pairs A and B, and not pairs C and D.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
---
ethtool-copy.h | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/ethtool-copy.h b/ethtool-copy.h
index 70748f5..2e2448f 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -247,6 +247,19 @@ struct ethtool_tunable {
void *data[0];
};
+#define DOWNSHIFT_DEV_DEFAULT_COUNT 0xff
+#define DOWNSHIFT_DEV_DISABLE 0
+
+enum phy_tunable_id {
+ ETHTOOL_PHY_ID_UNSPEC,
+ ETHTOOL_PHY_DOWNSHIFT,
+ /*
+ * Add your fresh new phy tunable attribute above and remember to update
+ * phy_tunable_strings[] in net/core/ethtool.c
+ */
+ __ETHTOOL_PHY_TUNABLE_COUNT,
+};
+
/**
* struct ethtool_regs - hardware register dump
* @cmd: Command number = %ETHTOOL_GREGS
@@ -547,6 +560,7 @@ struct ethtool_pauseparam {
* @ETH_SS_FEATURES: Device feature names
* @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
* @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
+ * @ETH_SS_PHY_TUNABLES: PHY tunable names
*/
enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -557,6 +571,7 @@ enum ethtool_stringset {
ETH_SS_RSS_HASH_FUNCS,
ETH_SS_TUNABLES,
ETH_SS_PHY_STATS,
+ ETH_SS_PHY_TUNABLES,
};
/**
@@ -1312,7 +1327,8 @@ struct ethtool_per_queue_op {
#define ETHTOOL_GLINKSETTINGS 0x0000004c /* Get ethtool_link_settings */
#define ETHTOOL_SLINKSETTINGS 0x0000004d /* Set ethtool_link_settings */
-
+#define ETHTOOL_PHY_GTUNABLE 0x0000004e /* Get PHY tunable configuration */
+#define ETHTOOL_PHY_STUNABLE 0x0000004f /* Set PHY tunable configuration */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
--
2.7.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox