* [PATCH net-next v3 3/5] ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479384444-31122-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
For operation in cabling environments that are incompatible with
1000BASE-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BASE-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BASE-T. This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
include/uapi/linux/ethtool.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 42f696f..f0db778 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -248,9 +248,12 @@ struct ethtool_tunable {
void *data[0];
};
+#define DOWNSHIFT_DEV_DEFAULT_COUNT 0xff
+#define DOWNSHIFT_DEV_DISABLE 0
+
enum phy_tunable_id {
ETHTOOL_PHY_ID_UNSPEC,
-
+ ETHTOOL_PHY_DOWNSHIFT,
/*
* Add your fresh new phy tunable attribute above and remember to update
* phy_tunable_strings[] in net/core/ethtool.c
--
2.7.3
^ permalink raw reply related
* [PATCH ethtool v3 1/2] ethtool-copy.h:sync with net
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479384480-31201-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
This covers kernel changes upto:
commit f5a4732f85613b3fb43f8bc33a017e3db3b3605a
Author: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Date: Wed Nov 9 16:33:09 2016 +0530
ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
For operation in cabling environments that are incompatible with
1000BASE-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BASE-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BASE-T. This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
---
ethtool-copy.h | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/ethtool-copy.h b/ethtool-copy.h
index 70748f5..2e2448f 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -247,6 +247,19 @@ struct ethtool_tunable {
void *data[0];
};
+#define DOWNSHIFT_DEV_DEFAULT_COUNT 0xff
+#define DOWNSHIFT_DEV_DISABLE 0
+
+enum phy_tunable_id {
+ ETHTOOL_PHY_ID_UNSPEC,
+ ETHTOOL_PHY_DOWNSHIFT,
+ /*
+ * Add your fresh new phy tunable attribute above and remember to update
+ * phy_tunable_strings[] in net/core/ethtool.c
+ */
+ __ETHTOOL_PHY_TUNABLE_COUNT,
+};
+
/**
* struct ethtool_regs - hardware register dump
* @cmd: Command number = %ETHTOOL_GREGS
@@ -547,6 +560,7 @@ struct ethtool_pauseparam {
* @ETH_SS_FEATURES: Device feature names
* @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
* @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
+ * @ETH_SS_PHY_TUNABLES: PHY tunable names
*/
enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -557,6 +571,7 @@ enum ethtool_stringset {
ETH_SS_RSS_HASH_FUNCS,
ETH_SS_TUNABLES,
ETH_SS_PHY_STATS,
+ ETH_SS_PHY_TUNABLES,
};
/**
@@ -1312,7 +1327,8 @@ struct ethtool_per_queue_op {
#define ETHTOOL_GLINKSETTINGS 0x0000004c /* Get ethtool_link_settings */
#define ETHTOOL_SLINKSETTINGS 0x0000004d /* Set ethtool_link_settings */
-
+#define ETHTOOL_PHY_GTUNABLE 0x0000004e /* Get PHY tunable configuration */
+#define ETHTOOL_PHY_STUNABLE 0x0000004f /* Set PHY tunable configuration */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
--
2.7.3
^ permalink raw reply related
* [PATCH ethtool v3 0/2] Adding downshift support to ethtool
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen
Hi All,
(This is a re-post of the v3 patch set with a new cover letter - I was not
aware that the cover letters was used as commit comments in merge commits).
This patch implements for set/get downshifting.
Downshifting can either be turned on/off, or it can be configured to a
specifc count.
"count" is optional.
Tested on Beaglebone Black with VSC 8531 PHY.
Change set:
v1:
- Initial version of set/get phy tunable with downshift feature.
v2:
- (ethtool) Syntax is changed from "--set-phy-tunable downshift on|off|%d"
to "--set-phy-tunable [downshift on|off [count N]]" - as requested by
Andrew.
v3:
- Fixed Spelling in "ethtool-copy.h:sync with net"
- Fixed "if send_ioctl() returns an error, print the error message and then
still print th value of count".
Raju Lakkaraju (2):
ethtool-copy.h:sync with net
Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY
downshift
ethtool-copy.h | 18 +++++++-
ethtool.8.in | 39 ++++++++++++++++
ethtool.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 200 insertions(+), 1 deletion(-)
--
2.7.4
^ permalink raw reply
* [PATCH net-next v3 2/5] ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479384444-31122-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Adding get_tunable/set_tunable function pointer to the phy_driver
structure, and uses these function pointers to implement the
ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE ioctls.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
---
include/linux/phy.h | 7 +++++
net/core/ethtool.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 94 insertions(+)
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 9880d73..3d35c36 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -611,6 +611,13 @@ struct phy_driver {
void (*get_strings)(struct phy_device *dev, u8 *data);
void (*get_stats)(struct phy_device *dev,
struct ethtool_stats *stats, u64 *data);
+
+ /* Get and Set PHY tunables */
+ int (*get_tunable)(struct phy_device *dev,
+ struct ethtool_tunable *tuna, void *data);
+ int (*set_tunable)(struct phy_device *dev,
+ struct ethtool_tunable *tuna,
+ const void *data);
};
#define to_phy_driver(d) container_of(to_mdio_common_driver(d), \
struct phy_driver, mdiodrv)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 9774898..61aebdf 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -119,6 +119,11 @@ tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
[ETHTOOL_TX_COPYBREAK] = "tx-copybreak",
};
+static const char
+phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
+ [ETHTOOL_ID_UNSPEC] = "Unspec",
+};
+
static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
{
struct ethtool_gfeatures cmd = {
@@ -227,6 +232,9 @@ static int __ethtool_get_sset_count(struct net_device *dev, int sset)
if (sset == ETH_SS_TUNABLES)
return ARRAY_SIZE(tunable_strings);
+ if (sset == ETH_SS_PHY_TUNABLES)
+ return ARRAY_SIZE(phy_tunable_strings);
+
if (sset == ETH_SS_PHY_STATS) {
if (dev->phydev)
return phy_get_sset_count(dev->phydev);
@@ -253,6 +261,8 @@ static void __ethtool_get_strings(struct net_device *dev,
sizeof(rss_hash_func_strings));
else if (stringset == ETH_SS_TUNABLES)
memcpy(data, tunable_strings, sizeof(tunable_strings));
+ else if (stringset == ETH_SS_PHY_TUNABLES)
+ memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
else if (stringset == ETH_SS_PHY_STATS) {
struct phy_device *phydev = dev->phydev;
@@ -2422,6 +2432,76 @@ static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr)
};
}
+static int ethtool_phy_tunable_valid(const struct ethtool_tunable *tuna)
+{
+ switch (tuna->id) {
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int get_phy_tunable(struct net_device *dev, void __user *useraddr)
+{
+ int ret;
+ struct ethtool_tunable tuna;
+ struct phy_device *phydev = dev->phydev;
+ void *data;
+
+ if (!(phydev && phydev->drv && phydev->drv->get_tunable))
+ return -EOPNOTSUPP;
+
+ if (copy_from_user(&tuna, useraddr, sizeof(tuna)))
+ return -EFAULT;
+ ret = ethtool_phy_tunable_valid(&tuna);
+ if (ret)
+ return ret;
+ data = kmalloc(tuna.len, GFP_USER);
+ if (!data)
+ return -ENOMEM;
+ ret = phydev->drv->get_tunable(phydev, &tuna, data);
+ if (ret)
+ goto out;
+ useraddr += sizeof(tuna);
+ ret = -EFAULT;
+ if (copy_to_user(useraddr, data, tuna.len))
+ goto out;
+ ret = 0;
+
+out:
+ kfree(data);
+ return ret;
+}
+
+static int set_phy_tunable(struct net_device *dev, void __user *useraddr)
+{
+ int ret;
+ struct ethtool_tunable tuna;
+ struct phy_device *phydev = dev->phydev;
+ void *data;
+
+ if (!(phydev && phydev->drv && phydev->drv->set_tunable))
+ return -EOPNOTSUPP;
+ if (copy_from_user(&tuna, useraddr, sizeof(tuna)))
+ return -EFAULT;
+ ret = ethtool_phy_tunable_valid(&tuna);
+ if (ret)
+ return ret;
+ data = kmalloc(tuna.len, GFP_USER);
+ if (!data)
+ return -ENOMEM;
+ useraddr += sizeof(tuna);
+ ret = -EFAULT;
+ if (copy_from_user(data, useraddr, tuna.len))
+ goto out;
+ ret = phydev->drv->set_tunable(phydev, &tuna, data);
+
+out:
+ kfree(data);
+ return ret;
+}
+
/* The main entry point in this file. Called from net/core/dev_ioctl.c */
int dev_ethtool(struct net *net, struct ifreq *ifr)
@@ -2479,6 +2559,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
case ETHTOOL_GET_TS_INFO:
case ETHTOOL_GEEE:
case ETHTOOL_GTUNABLE:
+ case ETHTOOL_PHY_GTUNABLE:
break;
default:
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
@@ -2684,6 +2765,12 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
case ETHTOOL_SLINKSETTINGS:
rc = ethtool_set_link_ksettings(dev, useraddr);
break;
+ case ETHTOOL_PHY_GTUNABLE:
+ rc = get_phy_tunable(dev, useraddr);
+ break;
+ case ETHTOOL_PHY_STUNABLE:
+ rc = set_phy_tunable(dev, useraddr);
+ break;
default:
rc = -EOPNOTSUPP;
}
--
2.7.3
^ permalink raw reply related
* [PATCH net-next v3 5/5] net: phy: Add downshift get/set support in Microsemi PHYs driver
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479384444-31122-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Implements the phy tunable function pointers and implement downshift
functionality for MSCC PHYs.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
---
drivers/net/phy/mscc.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 100 insertions(+)
diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index d0026ab..92018ba 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -46,8 +46,15 @@ enum rgmii_rx_clock_delay {
#define MSCC_EXT_PAGE_ACCESS 31
#define MSCC_PHY_PAGE_STANDARD 0x0000 /* Standard registers */
+#define MSCC_PHY_PAGE_EXTENDED 0x0001 /* Extended registers */
#define MSCC_PHY_PAGE_EXTENDED_2 0x0002 /* Extended reg - page 2 */
+/* Extended Page 1 Registers */
+#define MSCC_PHY_ACTIPHY_CNTL 20
+#define DOWNSHIFT_CNTL_MASK 0x001C
+#define DOWNSHIFT_EN 0x0010
+#define DOWNSHIFT_CNTL_POS 2
+
/* Extended Page 2 Registers */
#define MSCC_PHY_RGMII_CNTL 20
#define RGMII_RX_CLK_DELAY_MASK 0x0070
@@ -75,6 +82,8 @@ enum rgmii_rx_clock_delay {
#define MSCC_VDDMAC_2500 2500
#define MSCC_VDDMAC_3300 3300
+#define DOWNSHIFT_COUNT_MAX 5
+
struct vsc8531_private {
int rate_magic;
};
@@ -101,6 +110,66 @@ static int vsc85xx_phy_page_set(struct phy_device *phydev, u8 page)
return rc;
}
+static int vsc85xx_downshift_get(struct phy_device *phydev, u8 *count)
+{
+ int rc;
+ u16 reg_val;
+
+ mutex_lock(&phydev->lock);
+ rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
+ if (rc != 0)
+ goto out_unlock;
+
+ reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL);
+ reg_val &= DOWNSHIFT_CNTL_MASK;
+ if (!(reg_val & DOWNSHIFT_EN))
+ *count = DOWNSHIFT_DEV_DISABLE;
+ else
+ *count = ((reg_val & ~DOWNSHIFT_EN) >> DOWNSHIFT_CNTL_POS) + 2;
+ rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
+
+out_unlock:
+ mutex_unlock(&phydev->lock);
+
+ return rc;
+}
+
+static int vsc85xx_downshift_set(struct phy_device *phydev, u8 count)
+{
+ int rc;
+ u16 reg_val;
+
+ if (count == DOWNSHIFT_DEV_DEFAULT_COUNT) {
+ /* Default downshift count 3 (i.e. Bit3:2 = 0b01) */
+ count = ((1 << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN);
+ } else if (count > DOWNSHIFT_COUNT_MAX || count == 1) {
+ phydev_err(phydev, "Downshift count should be 2,3,4 or 5\n");
+ return -ERANGE;
+ } else if (count) {
+ /* Downshift count is either 2,3,4 or 5 */
+ count = (((count - 2) << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN);
+ }
+
+ mutex_lock(&phydev->lock);
+ rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
+ if (rc != 0)
+ goto out_unlock;
+
+ reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL);
+ reg_val &= ~(DOWNSHIFT_CNTL_MASK);
+ reg_val |= count;
+ rc = phy_write(phydev, MSCC_PHY_ACTIPHY_CNTL, reg_val);
+ if (rc != 0)
+ goto out_unlock;
+
+ rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
+
+out_unlock:
+ mutex_unlock(&phydev->lock);
+
+ return rc;
+}
+
static int vsc85xx_wol_set(struct phy_device *phydev,
struct ethtool_wolinfo *wol)
{
@@ -329,6 +398,29 @@ static int vsc85xx_default_config(struct phy_device *phydev)
return rc;
}
+static int vsc85xx_get_tunable(struct phy_device *phydev,
+ struct ethtool_tunable *tuna, void *data)
+{
+ switch (tuna->id) {
+ case ETHTOOL_PHY_DOWNSHIFT:
+ return vsc85xx_downshift_get(phydev, (u8 *)data);
+ default:
+ return -EINVAL;
+ }
+}
+
+static int vsc85xx_set_tunable(struct phy_device *phydev,
+ struct ethtool_tunable *tuna,
+ const void *data)
+{
+ switch (tuna->id) {
+ case ETHTOOL_PHY_DOWNSHIFT:
+ return vsc85xx_downshift_set(phydev, *(u8 *)data);
+ default:
+ return -EINVAL;
+ }
+}
+
static int vsc85xx_config_init(struct phy_device *phydev)
{
int rc;
@@ -418,6 +510,8 @@ static struct phy_driver vsc85xx_driver[] = {
.probe = &vsc85xx_probe,
.set_wol = &vsc85xx_wol_set,
.get_wol = &vsc85xx_wol_get,
+ .get_tunable = &vsc85xx_get_tunable,
+ .set_tunable = &vsc85xx_set_tunable,
},
{
.phy_id = PHY_ID_VSC8531,
@@ -437,6 +531,8 @@ static struct phy_driver vsc85xx_driver[] = {
.probe = &vsc85xx_probe,
.set_wol = &vsc85xx_wol_set,
.get_wol = &vsc85xx_wol_get,
+ .get_tunable = &vsc85xx_get_tunable,
+ .set_tunable = &vsc85xx_set_tunable,
},
{
.phy_id = PHY_ID_VSC8540,
@@ -456,6 +552,8 @@ static struct phy_driver vsc85xx_driver[] = {
.probe = &vsc85xx_probe,
.set_wol = &vsc85xx_wol_set,
.get_wol = &vsc85xx_wol_get,
+ .get_tunable = &vsc85xx_get_tunable,
+ .set_tunable = &vsc85xx_set_tunable,
},
{
.phy_id = PHY_ID_VSC8541,
@@ -475,6 +573,8 @@ static struct phy_driver vsc85xx_driver[] = {
.probe = &vsc85xx_probe,
.set_wol = &vsc85xx_wol_set,
.get_wol = &vsc85xx_wol_get,
+ .get_tunable = &vsc85xx_get_tunable,
+ .set_tunable = &vsc85xx_set_tunable,
}
};
--
2.7.3
^ permalink raw reply related
* [PATCH] net: sky2: Fix shutdown crash
From: Jeremy Linton @ 2016-11-17 15:14 UTC (permalink / raw)
To: netdev; +Cc: mlindner, stephen, Sudeep.Holla
The sky2 frequently crashes during machine shutdown with:
sky2_get_stats+0x60/0x3d8 [sky2]
dev_get_stats+0x68/0xd8
rtnl_fill_stats+0x54/0x140
rtnl_fill_ifinfo+0x46c/0xc68
rtmsg_ifinfo_build_skb+0x7c/0xf0
rtmsg_ifinfo.part.22+0x3c/0x70
rtmsg_ifinfo+0x50/0x5c
netdev_state_change+0x4c/0x58
linkwatch_do_dev+0x50/0x88
__linkwatch_run_queue+0x104/0x1a4
linkwatch_event+0x30/0x3c
process_one_work+0x140/0x3e0
worker_thread+0x60/0x44c
kthread+0xdc/0xf0
ret_from_fork+0x10/0x50
This is caused by the sky2 being called after it has been shutdown.
A previous thread about this can be found here:
https://lkml.org/lkml/2016/4/12/410
An alternative fix is to assure that IFF_UP gets cleared by
calling dev_close() during shutdown. This is similar to what the
bnx2/tg3/xgene and maybe others are doing to assure that the driver
isn't being called following _shutdown().
Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
---
drivers/net/ethernet/marvell/sky2.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c
index f05ea56..941c8e2 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -5220,6 +5220,19 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, sky2_suspend, sky2_resume);
static void sky2_shutdown(struct pci_dev *pdev)
{
+ struct sky2_hw *hw = pci_get_drvdata(pdev);
+ int port;
+
+ for (port = 0; port < hw->ports; port++) {
+ struct net_device *ndev = hw->dev[port];
+
+ rtnl_lock();
+ if (netif_running(ndev)) {
+ dev_close(ndev);
+ netif_device_detach(ndev);
+ }
+ rtnl_unlock();
+ }
sky2_suspend(&pdev->dev);
pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev));
pci_set_power_state(pdev, PCI_D3hot);
--
2.5.5
^ permalink raw reply related
* Re: Netperf UDP issue with connected sockets
From: Rick Jones @ 2016-11-17 17:42 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Eric Dumazet; +Cc: netdev, brouer
In-Reply-To: <20161117091638.5fab8494@redhat.com>
On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:
>> time to try IP_MTU_DISCOVER ;)
>
> To Rick, maybe you can find a good solution or option with Eric's hint,
> to send appropriate sized UDP packets with Don't Fragment (DF).
Well, I suppose adding another setsockopt() to the data socket creation
wouldn't be too difficult, along with another command-line option to
cause it to happen.
Could we leave things as "make sure you don't need fragmentation when
you use this" or would netperf have to start processing ICMP messages?
happy benchmarking,
rick jones
^ permalink raw reply
* [PATCH net-next v3 4/5] ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479384444-31122-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Adding validation support for the ETHTOOL_PHY_DOWNSHIFT. Functional
implementation needs to be done in the individual PHY drivers.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
---
net/core/ethtool.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 61aebdf..e9b45567 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -122,6 +122,7 @@ tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
static const char
phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
[ETHTOOL_ID_UNSPEC] = "Unspec",
+ [ETHTOOL_PHY_DOWNSHIFT] = "phy-downshift",
};
static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
@@ -2435,6 +2436,11 @@ static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr)
static int ethtool_phy_tunable_valid(const struct ethtool_tunable *tuna)
{
switch (tuna->id) {
+ case ETHTOOL_PHY_DOWNSHIFT:
+ if (tuna->len != sizeof(u8) ||
+ tuna->type_id != ETHTOOL_TUNABLE_U8)
+ return -EINVAL;
+ break;
default:
return -EINVAL;
}
--
2.7.3
^ permalink raw reply related
* [PATCH ethtool v3 2/2] Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift
From: Allan W. Nielsen @ 2016-11-17 12:08 UTC (permalink / raw)
To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479384480-31201-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Add ethtool get and set tunable to access PHY drivers.
Ethtool Help: ethtool -h for PHY tunables
ethtool --set-phy-tunable DEVNAME Set PHY tunable
[ downshift on|off [count N] ]
ethtool --get-phy-tunable DEVNAME Get PHY tunable
[ downshift ]
Ethtool ex:
ethtool --set-phy-tuanble eth0 downshift on
ethtool --set-phy-tuanble eth0 downshift off
ethtool --set-phy-tuanble eth0 downshift on count 2
ethtool --get-phy-tunable eth0 downshift
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
---
ethtool.8.in | 39 ++++++++++++++++
ethtool.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 183 insertions(+)
diff --git a/ethtool.8.in b/ethtool.8.in
index 9631847..337d0cf 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -340,6 +340,18 @@ ethtool \- query or control network driver and hardware settings
.B2 tx-lpi on off
.BN tx-timer
.BN advertise
+.HP
+.B ethtool \-\-set\-phy\-tunable
+.I devname
+.RB [
+.B downshift
+.A1 on off
+.BN count
+.RB ]
+.HP
+.B ethtool \-\-get\-phy\-tunable
+.I devname
+.RB [ downshift ]
.
.\" Adjust lines (i.e. full justification) and hyphenate.
.ad
@@ -947,6 +959,33 @@ Values are as for
Sets the amount of time the device should stay in idle mode prior to asserting
its Tx LPI (in microseconds). This has meaning only when Tx LPI is enabled.
.RE
+.TP
+.B \-\-set\-phy\-tunable
+Sets the PHY tunable parameters.
+.RS 4
+.TP
+.A2 downshift on off
+Specifies whether downshift should be enabled
+.TS
+nokeep;
+lB l.
+.BI count \ N
+Sets the PHY downshift re-tries count.
+.TE
+.PD
+.RE
+.TP
+.B \-\-get\-phy\-tunable
+Gets the PHY tunable parameters.
+.RS 4
+.TP
+.B downshift
+For operation in cabling environments that are incompatible with 1000BASE-T,
+PHY device provides an automatic link speed downshift operation.
+Link speed downshift after N failed 1000BASE-T auto-negotiation attempts.
+
+Gets the PHY downshift count/status.
+.RE
.SH BUGS
Not supported (in part or whole) on all network drivers.
.SH AUTHOR
diff --git a/ethtool.c b/ethtool.c
index 49ac94e..7dcd005 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4520,6 +4520,146 @@ static int do_seee(struct cmd_context *ctx)
return 0;
}
+static int do_get_phy_tunable(struct cmd_context *ctx)
+{
+ int argc = ctx->argc;
+ char **argp = ctx->argp;
+ int err, i;
+ u8 downshift_changed = 0;
+
+ if (argc < 1)
+ exit_bad_args();
+ for (i = 0; i < argc; i++) {
+ if (!strcmp(argp[i], "downshift")) {
+ downshift_changed = 1;
+ i += 1;
+ if (i < argc)
+ exit_bad_args();
+ } else {
+ exit_bad_args();
+ }
+ }
+
+ if (downshift_changed) {
+ struct ethtool_tunable ds;
+ u8 count = 0;
+
+ ds.cmd = ETHTOOL_PHY_GTUNABLE;
+ ds.id = ETHTOOL_PHY_DOWNSHIFT;
+ ds.type_id = ETHTOOL_TUNABLE_U8;
+ ds.len = 1;
+ ds.data[0] = &count;
+ err = send_ioctl(ctx, &ds);
+ if (err < 0) {
+ perror("Cannot Get PHY downshift count");
+ return 87;
+ }
+ count = *((u8 *)&ds.data[0]);
+ if (count)
+ fprintf(stdout, "Downshift count: %d\n", count);
+ else
+ fprintf(stdout, "Downshift disabled\n");
+ }
+
+ return err;
+}
+
+static int parse_named_bool(struct cmd_context *ctx, const char *name, u8 *on)
+{
+ if (ctx->argc < 2)
+ return 0;
+
+ if (strcmp(*ctx->argp, name))
+ return 0;
+
+ if (!strcmp(*(ctx->argp + 1), "on")) {
+ *on = 1;
+ } else if (!strcmp(*(ctx->argp + 1), "off")) {
+ *on = 0;
+ } else {
+ fprintf(stderr, "Invalid boolean\n");
+ exit_bad_args();
+ }
+
+ ctx->argc -= 2;
+ ctx->argp += 2;
+
+ return 1;
+}
+
+static int parse_named_u8(struct cmd_context *ctx, const char *name, u8 *val)
+{
+ if (ctx->argc < 2)
+ return 0;
+
+ if (strcmp(*ctx->argp, name))
+ return 0;
+
+ *val = get_uint_range(*(ctx->argp + 1), 0, 0xff);
+
+ ctx->argc -= 2;
+ ctx->argp += 2;
+
+ return 1;
+}
+
+static int do_set_phy_tunable(struct cmd_context *ctx)
+{
+ int err = 0;
+ u8 ds_cnt = DOWNSHIFT_DEV_DEFAULT_COUNT;
+ u8 ds_changed = 0, ds_has_cnt = 0, ds_enable = 0;
+
+ if (ctx->argc == 0)
+ exit_bad_args();
+
+ /* Parse arguments */
+ while (ctx->argc) {
+ if (parse_named_bool(ctx, "downshift", &ds_enable)) {
+ ds_changed = 1;
+ ds_has_cnt = parse_named_u8(ctx, "count", &ds_cnt);
+ } else {
+ exit_bad_args();
+ }
+ }
+
+ /* Validate parameters */
+ if (ds_changed) {
+ if (!ds_enable && ds_has_cnt) {
+ fprintf(stderr, "'count' may not be set when downshift "
+ "is off.\n");
+ exit_bad_args();
+ }
+
+ if (ds_enable && ds_has_cnt && ds_cnt == 0) {
+ fprintf(stderr, "'count' may not be zero.\n");
+ exit_bad_args();
+ }
+
+ if (!ds_enable)
+ ds_cnt = DOWNSHIFT_DEV_DISABLE;
+ }
+
+ /* Do it */
+ if (ds_changed) {
+ struct ethtool_tunable ds;
+ u8 count;
+
+ ds.cmd = ETHTOOL_PHY_STUNABLE;
+ ds.id = ETHTOOL_PHY_DOWNSHIFT;
+ ds.type_id = ETHTOOL_TUNABLE_U8;
+ ds.len = 1;
+ ds.data[0] = &count;
+ *((u8 *)&ds.data[0]) = ds_cnt;
+ err = send_ioctl(ctx, &ds);
+ if (err < 0) {
+ perror("Cannot Set PHY downshift count");
+ err = 87;
+ }
+ }
+
+ return err;
+}
+
#ifndef TEST_ETHTOOL
int send_ioctl(struct cmd_context *ctx, void *cmd)
{
@@ -4681,6 +4821,10 @@ static const struct option {
" [ advertise %x ]\n"
" [ tx-lpi on|off ]\n"
" [ tx-timer %d ]\n"},
+ { "--set-phy-tunable", 1, do_set_phy_tunable, "Set PHY tunable",
+ " [ downshift on|off [count N] ]\n"},
+ { "--get-phy-tunable", 1, do_get_phy_tunable, "Get PHY tunable",
+ " [ downshift ]\n"},
{ "-h|--help", 0, show_usage, "Show this help" },
{ "--version", 0, do_version, "Show version number" },
{}
--
2.7.3
^ permalink raw reply related
* Fwd:[Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN
From: Neftin, Sasha @ 2016-11-17 13:17 UTC (permalink / raw)
To: intel-wired-lan, Tyler Baicar, netdev
In-Reply-To: <630A6B92B7EDEB45A87E20D3D286660153E3B481@hasmsx109.ger.corp.intel.com>
From: Baicar, Tyler [mailto:tbaicar@codeaurora.org] Sent: Tuesday,
November 15, 2016 11:50 PM
To: Neftin, Sasha <sasha.neftin@intel.com>; Kirsher, Jeffrey T
<jeffrey.t.kirsher@intel.com>; intel-wired-lan@lists.osuosl.org;
netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
okaya@codeaurora.org; timur@codeaurora.org
Subject: Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of
__E1000_DOWN
On 11/13/2016 2:25 AM, Neftin, Sasha wrote:
> On 11/13/2016 10:34 AM, Neftin, Sasha wrote:
>> On 11/11/2016 12:35 AM, Baicar, Tyler wrote:
>>> Hello Sasha,
>>>
>>> On 11/9/2016 11:19 PM, Neftin, Sasha wrote:
>>>> On 11/9/2016 11:41 PM, Tyler Baicar wrote:
>>>>> Move IRQ free code so that it will happen regardless of the
>>>>> __E1000_DOWN bit. Currently the e1000e driver only releases its
>>>>> IRQ if the __E1000_DOWN bit is cleared. This is not sufficient
>>>>> because it is possible for __E1000_DOWN to be set without releasing the IRQ.
>>>>> In such a situation, we will hit a kernel bug later in
>>>>> e1000_remove because the IRQ still has action since it was never
>>>>> freed. A secondary bus reset can cause this case to happen.
>>>>>
>>>>> Signed-off-by: Tyler Baicar <tbaicar@codeaurora.org>
>>>>> ---
>>>>> drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>> b/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>> index 7017281..36cfcb0 100644
>>>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>>>>> @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev)
>>>>> if (!test_bit(__E1000_DOWN, &adapter->state)) {
>>>>> e1000e_down(adapter, true);
>>>>> - e1000_free_irq(adapter);
>>>>> /* Link status message must follow this format */
>>>>> pr_info("%s NIC Link is Down\n", adapter->netdev->name);
>>>>> }
>>>>> + e1000_free_irq(adapter);
>>>>> +
>>>>> napi_disable(&adapter->napi);
>>>>> e1000e_free_tx_resources(adapter->tx_ring);
>>>>>
>>>> I would like not recommend insert this change. This change related
>>>> driver state machine, we afraid from lot of synchronization problem
>>>> and issues.
>>>> We need keep e1000_free_irq in loop and check for 'test_bit' ready.
>>> What do you mean here? There is no loop. If __E1000_DOWN is set then
>>> we will never free the IRQ.
>>>
>>>> Another point, does before execute secondary bus reset your SW back
>>>> up pcie configuration space as properly?
>>> After a secondary bus reset, the link needs to recover and go back
>>> to a working state after 1 second.
>>>
>>> From the callstack, the issue is happening while removing the
>>> endpoint from the system, before applying the secondary bus reset.
>>>
>>> The order of events is
>>> 1. remove the drivers
>>> 2. cause a secondary bus reset
>>> 3. wait 1 second
>> Actually, this is too much, usually link up in less than 100ms.You
>> can check Data Link Layer indication.
>>> 4. recover the link
>>>
>>> callstack:
>>> free_msi_irqs+0x6c/0x1a8
>>> pci_disable_msi+0xb0/0x148
>>> e1000e_reset_interrupt_capability+0x60/0x78
>>> e1000_remove+0xc8/0x180
>>> pci_device_remove+0x48/0x118
>>> __device_release_driver+0x80/0x108
>>> device_release_driver+0x2c/0x40
>>> pci_stop_bus_device+0xa0/0xb0
>>> pci_stop_bus_device+0x3c/0xb0
>>> pci_stop_root_bus+0x54/0x80
>>> acpi_pci_root_remove+0x28/0x64
>>> acpi_bus_trim+0x6c/0xa4
>>> acpi_device_hotplug+0x19c/0x3f4
>>> acpi_hotplug_work_fn+0x28/0x3c
>>> process_one_work+0x150/0x460
>>> worker_thread+0x50/0x4b8
>>> kthread+0xd4/0xe8
>>> ret_from_fork+0x10/0x50
>>>
>>> Thanks,
>>> Tyler
>>>
>> Hello Tyler,
>> Okay, we need consult more about this suggestion.
>> May I ask what is setup you run? Is there NIC or on board LAN? I
>> would like try reproduce this issue in our lab's too.
>> Also, is same issue observed with same scenario and others NIC's too?
>> Sasha
>> _______________________________________________
>> Intel-wired-lan mailing list
>> Intel-wired-lan@lists.osuosl.org
>> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
>>
> Please, specify what is device used.
Hello Sasha,
This was on a QDF2432 using an Intel PRO/1000 PT Dual Port server
adapter. I have not tried other e1000e PCIe cards, but have not seen any
similar issues with Mellanox cards. I'm able to reproduce it with just
pulling the card out. Here is the lspci -vvv output for this card:
0004:00:00.0 PCI bridge: Airgo Networks, Inc. Device 0400 (prog-if 00
[Normal decode])
Physical Slot: 5
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 297
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 00002000-00002fff
Memory behind bridge: 00100000-002fffff
Prefetchable memory behind bridge:
00000c0400000000-00000c04001fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000397f0040 Data: 0000
Capabilities: [70] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1,
Exit Latency L0s <1us, L1 <16us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
DLActive+ BWMgmt- ABWMgmt-
SltCap: AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+
HotPlug+ Surprise+
Slot #5, PowerLimit 75.000W; Interlock+ NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt-
HPIrq- LinkChg-
Control: AttnInd Off, PwrInd Off, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet- Interlock-
Changed: MRL- PresDet- LinkState-
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
PMEIntEna+ CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,
LTR+, OBFF Not Supported ARIFwd+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-,
LTR-, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance-
SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn+
ChkCap+ ChkEn+
Capabilities: [178 v1] #19
Kernel driver in use: pcieport
0004:01:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
Ethernet Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Dual Port Server Adapter
Physical Slot: 5-1
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 128 bytes
Interrupt: pin A routed to IRQ 299
Region 0: Memory at c0100100000 (32-bit, non-prefetchable)
[size=128K]
Region 1: Memory at c0100120000 (32-bit, non-prefetchable)
[size=128K]
Region 2: I/O ports at 1000 [size=32]
Expansion ROM at c0100140000 [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000397f0040 Data: 0000
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
<512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr+ TransPend-
LnkCap: Port #8, Speed 2.5GT/s, Width x4, ASPM L0s,
Exit Latency L0s <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
AERCap: First Error Pointer: 14, GenCap- CGenEn-
ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 68-05-ca-ff-ff-3e-5b-7a
Kernel driver in use: e1000e
0004:01:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
Ethernet Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Dual Port Server Adapter
Physical Slot: 5-1
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 128 bytes
Interrupt: pin B routed to IRQ 301
Region 0: Memory at c0100160000 (32-bit, non-prefetchable)
[size=128K]
Region 1: Memory at c0100180000 (32-bit, non-prefetchable)
[size=128K]
Region 2: I/O ports at 1020 [size=32]
Expansion ROM at c01001a0000 [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000397f0040 Data: 0000
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
<512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr+ TransPend-
LnkCap: Port #8, Speed 2.5GT/s, Width x4, ASPM L0s,
Exit Latency L0s <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
AERCap: First Error Pointer: 14, GenCap- CGenEn-
ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 68-05-ca-ff-ff-3e-5b-7a
Kernel driver in use: e1000e
Thanks,
Tyler
--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.
Hello Tyler,
I see some in consistent implementation of __*_close methods in our
drivers. Do you have any igb NIC to test and check if same problem
persist there?
Thanks,
Sasha
^ permalink raw reply
* [PATCH net-next] net_sched: sch_fq: use hash_ptr()
From: Eric Dumazet @ 2016-11-17 17:48 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Hugh Dickins
From: Eric Dumazet <edumazet@google.com>
When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful,
and I chose hash_32().
Linus Torvalds and George Spelvin fixed this issue, so we can
use hash_ptr() to get more entropy on 64bit arches with Terabytes
of memory, and avoid the cast games.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hugh Dickins <hughd@google.com>
---
net/sched/sch_fq.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 18e752439f6f..86309a3156a5 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -245,7 +245,7 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q)
skb_orphan(skb);
}
- root = &q->fq_root[hash_32((u32)(long)sk, q->fq_trees_log)];
+ root = &q->fq_root[hash_ptr(sk, q->fq_trees_log)];
if (q->flows >= (2U << q->fq_trees_log) &&
q->inactive_flows > q->flows/2)
@@ -599,7 +599,7 @@ static void fq_rehash(struct fq_sched_data *q,
kmem_cache_free(fq_flow_cachep, of);
continue;
}
- nroot = &new_array[hash_32((u32)(long)of->sk, new_log)];
+ nroot = &new_array[hash_ptr(of->sk, new_log)];
np = &nroot->rb_node;
parent = NULL;
^ permalink raw reply related
* Re: [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options
From: Anand Moon @ 2016-11-17 18:00 UTC (permalink / raw)
To: Jerome Brunet
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree, Florian Fainelli,
Alexandre TORGUE, Neil Armstrong, Martin Blumenstingl,
Kevin Hilman, Linux Kernel, Andre Roth,
linux-amlogic-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Carlo Caione,
Giuseppe Cavallaro, linux-arm-kernel
In-Reply-To: <1479378055.17538.57.camel-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
Hi Jerone,
On 17 November 2016 at 15:50, Jerome Brunet <jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org> wrote:
> On Wed, 2016-11-16 at 22:36 +0530, Anand Moon wrote:
>> Hi Jerome.
>>
>> On 15 November 2016 at 19:59, Jerome Brunet <jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
>> wrote:
>> >
>> > On some platforms, energy efficient ethernet with rtl8211 devices
>> > is
>> > causing issue, like throughput drop or broken link.
>> >
>> > This was reported on the OdroidC2 (DWMAC + RTL8211F). While the
>> > issue root
>> > cause is not fully understood yet, disabling EEE advertisement
>> > prevent auto
>> > negotiation from enabling EEE.
>> >
>> > This patch provides options to disable 1000T and 100TX EEE
>> > advertisement
>> > individually for the realtek phys supporting this feature.
>> >
>> > Reported-by: Martin Blumenstingl <martin.blumenstingl-gM/Ye1E23myhRSP0FMvGiw@public.gmane.org
>> > m>
>> > Cc: Giuseppe Cavallaro <peppe.cavallaro-qxv4g6HH51o@public.gmane.org>
>> > Cc: Alexandre TORGUE <alexandre.torgue-qxv4g6HH51o@public.gmane.org>
>> > Signed-off-by: Jerome Brunet <jbrunet-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
>> > Signed-off-by: Neil Armstrong <narmstrong-rdvid1DuHRBWk0Htik3J/w@public.gmane.org>
>> > Tested-by: Andre Roth <neolynx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> > ---
>> > drivers/net/phy/realtek.c | 65
>> > ++++++++++++++++++++++++++++++++++++++++++++++-
>> > 1 file changed, 64 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
>> > index aadd6e9f54ad..77235fd5faaf 100644
>> > --- a/drivers/net/phy/realtek.c
>> > +++ b/drivers/net/phy/realtek.c
>> > @@ -15,6 +15,12 @@
>> > */
>> > #include <linux/phy.h>
>> > #include <linux/module.h>
>> > +#include <linux/of.h>
>> > +
>> > +struct rtl8211x_phy_priv {
>> > + bool eee_1000t_disable;
>> > + bool eee_100tx_disable;
>> > +};
>> >
>> > #define RTL821x_PHYSR 0x11
>> > #define RTL821x_PHYSR_DUPLEX 0x2000
>> > @@ -93,12 +99,44 @@ static int rtl8211f_config_intr(struct
>> > phy_device *phydev)
>> > return err;
>> > }
>> >
>> > +static void rtl8211x_clear_eee_adv(struct phy_device *phydev)
>> > +{
>> > + struct rtl8211x_phy_priv *priv = phydev->priv;
>> > + u16 val;
>> > +
>> > + if (priv->eee_1000t_disable || priv->eee_100tx_disable) {
>> > + val = phy_read_mmd_indirect(phydev,
>> > MDIO_AN_EEE_ADV,
>> > + MDIO_MMD_AN);
>> > +
>> > + if (priv->eee_1000t_disable)
>> > + val &= ~MDIO_AN_EEE_ADV_1000T;
>> > + if (priv->eee_100tx_disable)
>> > + val &= ~MDIO_AN_EEE_ADV_100TX;
>> > +
>> > + phy_write_mmd_indirect(phydev, MDIO_AN_EEE_ADV,
>> > + MDIO_MMD_AN, val);
>> > + }
>> > +}
>> > +
>> > +static int rtl8211x_config_init(struct phy_device *phydev)
>> > +{
>> > + int ret;
>> > +
>> > + ret = genphy_config_init(phydev);
>> > + if (ret < 0)
>> > + return ret;
>> > +
>> > + rtl8211x_clear_eee_adv(phydev);
>> > +
>> > + return 0;
>> > +}
>> > +
>> > static int rtl8211f_config_init(struct phy_device *phydev)
>> > {
>> > int ret;
>> > u16 reg;
>> >
>> > - ret = genphy_config_init(phydev);
>> > + ret = rtl8211x_config_init(phydev);
>> > if (ret < 0)
>> > return ret;
>> >
>> > @@ -115,6 +153,26 @@ static int rtl8211f_config_init(struct
>> > phy_device *phydev)
>> > return 0;
>> > }
>> >
>> > +static int rtl8211x_phy_probe(struct phy_device *phydev)
>> > +{
>> > + struct device *dev = &phydev->mdio.dev;
>> > + struct device_node *of_node = dev->of_node;
>> > + struct rtl8211x_phy_priv *priv;
>> > +
>> > + priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>> > + if (!priv)
>> > + return -ENOMEM;
>> > +
>> > + priv->eee_1000t_disable =
>> > + of_property_read_bool(of_node, "realtek,disable-
>> > eee-1000t");
>> > + priv->eee_100tx_disable =
>> > + of_property_read_bool(of_node, "realtek,disable-
>> > eee-100tx");
>> > +
>> > + phydev->priv = priv;
>> > +
>> > + return 0;
>> > +}
>> > +
>> > static struct phy_driver realtek_drvs[] = {
>> > {
>> > .phy_id = 0x00008201,
>> > @@ -140,7 +198,9 @@ static struct phy_driver realtek_drvs[] = {
>> > .phy_id_mask = 0x001fffff,
>> > .features = PHY_GBIT_FEATURES,
>> > .flags = PHY_HAS_INTERRUPT,
>> > + .probe = &rtl8211x_phy_probe,
>> > .config_aneg = genphy_config_aneg,
>> > + .config_init = &rtl8211x_config_init,
>> > .read_status = genphy_read_status,
>> > .ack_interrupt = rtl821x_ack_interrupt,
>> > .config_intr = rtl8211e_config_intr,
>> > @@ -152,7 +212,9 @@ static struct phy_driver realtek_drvs[] = {
>> > .phy_id_mask = 0x001fffff,
>> > .features = PHY_GBIT_FEATURES,
>> > .flags = PHY_HAS_INTERRUPT,
>> > + .probe = &rtl8211x_phy_probe,
>> > .config_aneg = &genphy_config_aneg,
>> > + .config_init = &rtl8211x_config_init,
>> > .read_status = &genphy_read_status,
>> > .ack_interrupt = &rtl821x_ack_interrupt,
>> > .config_intr = &rtl8211e_config_intr,
>> > @@ -164,6 +226,7 @@ static struct phy_driver realtek_drvs[] = {
>> > .phy_id_mask = 0x001fffff,
>> > .features = PHY_GBIT_FEATURES,
>> > .flags = PHY_HAS_INTERRUPT,
>> > + .probe = &rtl8211x_phy_probe,
>> > .config_aneg = &genphy_config_aneg,
>> > .config_init = &rtl8211f_config_init,
>> > .read_status = &genphy_read_status,
>> > --
>> > 2.7.4
>> >
>>
>> How about adding callback functionality for .soft_reset to handle
>> BMCR
>> where we update the Auto-Negotiation for the phy,
>> as per the datasheet of the rtl8211f.
>
> I'm not sure I understand how this would help with our issue (and EEE).
> Am I missing something or is it something unrelated that you would like
> to see happening on this driver ?
>
[snip]
I was just tying other phy module to understand the feature.
But in order to improve the throughput I tried to integrate blow u-boot commit.
commit 3d6af748ebd831524cb22a29433e9092af469ec7
Author: Shengzhou Liu <Shengzhou.Liu-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
Date: Thu Mar 12 18:54:59 2015 +0800
net/phy: Add support for realtek RTL8211F
RTL8211F has different registers from RTL8211E.
This patch adds support for RTL8211F PHY which
can be found on Freescale's T1023 RDB board.
And added the similar functionality to .config_aneg = &rtl8211f_config_aneg,
And I seem to have better results in through put with periodic drop
but it recovers.
-----
odroid@odroid64:~$ iperf3 -c 10.0.0.102 -p 2006 -i 1 -t 100 -V
iperf 3.0.11
Linux odroid64 4.9.0-rc5-xc2ml #18 SMP PREEMPT Thu Nov 17 22:56:00 IST
2016 aarch64 aarch64 aarch64 GNU/Linux
Time: Thu, 17 Nov 2016 17:35:25 GMT
Connecting to host 10.0.0.102, port 2006
Cookie: odroid64.1479404125.404729.3b45146e7
TCP MSS: 1448 (default)
[ 4] local 10.0.0.105 port 40238 connected to 10.0.0.102 port 2006
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
0 seconds, 100 second test
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 114 MBytes 952 Mbits/sec 0 368 KBytes
[ 4] 1.00-2.00 sec 112 MBytes 937 Mbits/sec 0 368 KBytes
[ 4] 2.00-3.00 sec 111 MBytes 935 Mbits/sec 0 368 KBytes
[ 4] 3.00-4.00 sec 112 MBytes 936 Mbits/sec 0 368 KBytes
[ 4] 4.00-5.00 sec 112 MBytes 939 Mbits/sec 0 368 KBytes
[ 4] 5.00-6.00 sec 112 MBytes 936 Mbits/sec 0 368 KBytes
[ 4] 6.00-7.00 sec 111 MBytes 933 Mbits/sec 0 368 KBytes
[ 4] 7.00-8.00 sec 112 MBytes 942 Mbits/sec 0 368 KBytes
[ 4] 8.00-9.00 sec 111 MBytes 935 Mbits/sec 0 368 KBytes
[ 4] 9.00-10.00 sec 111 MBytes 932 Mbits/sec 0 368 KBytes
[ 4] 10.00-11.00 sec 112 MBytes 937 Mbits/sec 0 368 KBytes
[ 4] 11.00-12.00 sec 111 MBytes 935 Mbits/sec 0 368 KBytes
[ 4] 12.00-13.00 sec 112 MBytes 938 Mbits/sec 0 368 KBytes
[ 4] 13.00-14.00 sec 112 MBytes 940 Mbits/sec 0 368 KBytes
[ 4] 14.00-15.00 sec 111 MBytes 934 Mbits/sec 0 368 KBytes
[ 4] 15.00-16.00 sec 111 MBytes 935 Mbits/sec 0 368 KBytes
[ 4] 16.00-17.00 sec 112 MBytes 939 Mbits/sec 0 368 KBytes
[ 4] 17.00-18.00 sec 112 MBytes 936 Mbits/sec 0 368 KBytes
[ 4] 18.00-19.00 sec 111 MBytes 934 Mbits/sec 0 368 KBytes
[ 4] 19.00-20.00 sec 112 MBytes 940 Mbits/sec 0 368 KBytes
[ 4] 20.00-21.00 sec 111 MBytes 933 Mbits/sec 0 368 KBytes
[ 4] 21.00-22.00 sec 112 MBytes 941 Mbits/sec 0 368 KBytes
[ 4] 22.00-23.00 sec 111 MBytes 931 Mbits/sec 0 368 KBytes
[ 4] 23.00-24.00 sec 112 MBytes 938 Mbits/sec 0 368 KBytes
[ 4] 24.00-25.00 sec 112 MBytes 938 Mbits/sec 0 368 KBytes
[ 4] 25.00-26.00 sec 111 MBytes 934 Mbits/sec 0 368 KBytes
[ 4] 26.00-27.00 sec 112 MBytes 940 Mbits/sec 0 368 KBytes
[ 4] 27.00-28.00 sec 112 MBytes 936 Mbits/sec 0 368 KBytes
[ 4] 28.00-29.00 sec 111 MBytes 934 Mbits/sec 0 368 KBytes
[ 4] 29.00-30.00 sec 112 MBytes 937 Mbits/sec 0 368 KBytes
[ 4] 30.00-31.00 sec 111 MBytes 934 Mbits/sec 0 368 KBytes
[ 4] 31.00-32.00 sec 112 MBytes 942 Mbits/sec 0 368 KBytes
[ 4] 32.00-33.00 sec 111 MBytes 933 Mbits/sec 0 368 KBytes
[ 4] 33.00-34.00 sec 111 MBytes 935 Mbits/sec 0 368 KBytes
[ 4] 34.00-35.00 sec 112 MBytes 941 Mbits/sec 0 368 KBytes
[ 4] 35.00-36.00 sec 107 MBytes 896 Mbits/sec 0 368 KBytes
[ 4] 36.00-37.00 sec 0.00 Bytes 0.00 bits/sec 2 1.41 KBytes
[ 4] 37.00-38.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes
[ 4] 38.00-39.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
[ 4] 39.00-40.00 sec 38.0 MBytes 319 Mbits/sec 1 385 KBytes
[ 4] 40.00-41.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 41.00-42.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 42.00-43.00 sec 112 MBytes 937 Mbits/sec 0 385 KBytes
[ 4] 43.00-44.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 44.00-45.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 45.00-46.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 46.00-47.00 sec 111 MBytes 931 Mbits/sec 0 385 KBytes
[ 4] 47.00-48.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 48.00-49.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 49.00-50.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 50.00-51.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 51.00-52.00 sec 111 MBytes 934 Mbits/sec 0 385 KBytes
[ 4] 52.00-53.00 sec 112 MBytes 941 Mbits/sec 0 385 KBytes
[ 4] 53.00-54.00 sec 112 MBytes 937 Mbits/sec 0 385 KBytes
[ 4] 54.00-55.00 sec 111 MBytes 930 Mbits/sec 0 385 KBytes
[ 4] 55.00-56.00 sec 112 MBytes 941 Mbits/sec 0 385 KBytes
[ 4] 56.00-57.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 57.00-58.00 sec 111 MBytes 933 Mbits/sec 0 385 KBytes
[ 4] 58.00-59.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 59.00-60.00 sec 112 MBytes 940 Mbits/sec 0 385 KBytes
[ 4] 60.00-61.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 61.00-62.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 62.00-63.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 63.00-64.00 sec 112 MBytes 938 Mbits/sec 0 385 KBytes
[ 4] 64.00-65.00 sec 111 MBytes 932 Mbits/sec 0 385 KBytes
[ 4] 65.00-66.00 sec 112 MBytes 940 Mbits/sec 0 385 KBytes
[ 4] 66.00-67.00 sec 112 MBytes 938 Mbits/sec 0 385 KBytes
[ 4] 67.00-68.00 sec 111 MBytes 934 Mbits/sec 0 385 KBytes
[ 4] 68.00-69.00 sec 111 MBytes 933 Mbits/sec 0 385 KBytes
[ 4] 69.00-70.00 sec 112 MBytes 937 Mbits/sec 0 385 KBytes
[ 4] 70.00-71.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 71.00-72.00 sec 112 MBytes 941 Mbits/sec 0 385 KBytes
[ 4] 72.00-73.00 sec 111 MBytes 933 Mbits/sec 0 385 KBytes
[ 4] 73.00-74.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 74.00-75.00 sec 111 MBytes 934 Mbits/sec 0 385 KBytes
[ 4] 75.00-76.00 sec 111 MBytes 934 Mbits/sec 0 385 KBytes
[ 4] 76.00-77.00 sec 112 MBytes 937 Mbits/sec 0 385 KBytes
[ 4] 77.00-78.00 sec 112 MBytes 938 Mbits/sec 0 385 KBytes
[ 4] 78.00-79.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 79.00-80.00 sec 111 MBytes 934 Mbits/sec 0 385 KBytes
[ 4] 80.00-81.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 81.00-82.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 82.00-83.00 sec 111 MBytes 934 Mbits/sec 0 385 KBytes
[ 4] 83.00-84.00 sec 112 MBytes 937 Mbits/sec 0 385 KBytes
[ 4] 84.00-85.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 85.00-86.00 sec 112 MBytes 937 Mbits/sec 0 385 KBytes
[ 4] 86.00-87.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 87.00-88.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 88.00-89.00 sec 112 MBytes 937 Mbits/sec 0 385 KBytes
[ 4] 89.00-90.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 90.00-91.00 sec 112 MBytes 937 Mbits/sec 0 385 KBytes
[ 4] 91.00-92.00 sec 111 MBytes 934 Mbits/sec 0 385 KBytes
[ 4] 92.00-93.00 sec 112 MBytes 939 Mbits/sec 0 385 KBytes
[ 4] 93.00-94.00 sec 111 MBytes 935 Mbits/sec 0 385 KBytes
[ 4] 94.00-95.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 95.00-96.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 96.00-97.00 sec 112 MBytes 936 Mbits/sec 0 385 KBytes
[ 4] 97.00-98.00 sec 113 MBytes 945 Mbits/sec 0 559 KBytes
[ 4] 98.00-99.00 sec 112 MBytes 937 Mbits/sec 0 559 KBytes
[ 4] 99.00-100.00 sec 111 MBytes 928 Mbits/sec 0 559 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-100.00 sec 10.5 GBytes 902 Mbits/sec 4 sender
[ 4] 0.00-100.00 sec 10.5 GBytes 902 Mbits/sec receiver
CPU Utilization: local/sender 5.6% (0.2%u/5.4%s), remote/receiver
17.1% (1.2%u/15.9%s)
Can your confirm this at your end.
Once confirm I will try to send this as a fix for this issue.
-Best Regards
Anand Moon
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: net: BUG still has locks held in unix_stream_splice_read
From: Dmitry Vyukov @ 2016-11-17 18:02 UTC (permalink / raw)
To: Al Viro
Cc: David Miller, Hannes Frederic Sowa, Eric Dumazet, netdev, LKML,
syzkaller
In-Reply-To: <CACT4Y+ZgbobWZEkAFLVK9bLHB0oLpnxXrH1FQzTgVu_6JQQVDQ@mail.gmail.com>
On Mon, Oct 10, 2016 at 10:01 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Mon, Oct 10, 2016 at 5:14 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Mon, Oct 10, 2016 at 03:46:07AM +0100, Al Viro wrote:
>>> On Sun, Oct 09, 2016 at 12:06:14PM +0200, Dmitry Vyukov wrote:
>>> > I suspect this is:
>>> >
>>> > commit 25869262ef7af24ccde988867ac3eb1c3d4b88d4
>>> > Author: Al Viro <viro@zeniv.linux.org.uk>
>>> > Date: Sat Sep 17 21:02:10 2016 -0400
>>> > skb_splice_bits(): get rid of callback
>>> > since pipe_lock is the outermost now, we don't need to drop/regain
>>> > socket locks around the call of splice_to_pipe() from skb_splice_bits(),
>>> > which kills the need to have a socket-specific callback; we can just
>>> > call splice_to_pipe() and be done with that.
>>>
>>> Unlikely, since that particular commit removes unlocking/relocking ->iolock
>>> around the call of splice_to_pipe(). Original would've retaken the same
>>> lock on the way out; it's not as if we could leave the syscall there.
>>>
>>> It might be splice-related, but I don't believe that you've got the right
>>> commit here.
>>
>> It's not that commit
>
> It's highly likely. Sorry for falsely pointing to your commit.
>
>
>> , all right - it's "can't call unix_stream_read_generic()
>> with any locks held" stepped onto a couple of commits prior by
>> "splice: lift pipe_lock out of splice_to_pipe()". Could somebody explain
>> what is that about?
>>
>> E.g what will happen if some code does a read on AF_UNIX socket with
>> some local mutex held? AFAICS, there are exactly two callers of
>> freezable_schedule_timeout() - this one and one in XFS; the latter is
>> in a kernel thread where we do have good warranties about the locking
>> environment, but here it's in the bleeding ->recvmsg/->splice_read and
>> for those assumption that caller doesn't hold any locks is pretty
>> strong, especially since it's not documented anywhere.
>>
>> What's going on there?
>
> I never saw that warning before. There is some possibility that fuzzer
> has discovered some new paths, but it's much more likely that
> something has changed recently (the stack looks quite simple -- just a
> splice from unix socket). And my previous pull was like a week ago.
Ping. Just hit it again on 4.9-rc5
[ BUG: syz-executor/15922 still has locks held! ]
4.9.0-rc5+ #43 Not tainted
-------------------------------------
1 lock held by syz-executor/15922:
#0: [ 1441.143288] (
[< inline >] pipe_lock_nested fs/pipe.c:66
[<ffffffff81a8d72b>] pipe_lock+0x5b/0x70 fs/pipe.c:74
stack backtrace:
CPU: 3 PID: 15922 Comm: syz-executor Not tainted 4.9.0-rc5+ #43
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffff88004a98f568 ffffffff834c2a19 ffffffff00000003 1ffff10009531e40
ffffed0009531e38 0000000041b58ab3 ffffffff895758b0 ffffffff834c272b
0000000000000003 ffff880035256640 0000000000000003 ffff88006d122cd8
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff834c2a19>] dump_stack+0x2ee/0x3f5 lib/dump_stack.c:51
[< inline >] print_held_locks_bug kernel/locking/lockdep.c:4296
[<ffffffff81560245>] debug_check_no_locks_held+0x125/0x140
kernel/locking/lockdep.c:4302
[< inline >] try_to_freeze include/linux/freezer.h:65
[< inline >] freezer_count include/linux/freezer.h:127
[< inline >] freezable_schedule_timeout include/linux/freezer.h:192
[<ffffffff8716ed5d>] unix_stream_data_wait+0x4fd/0x910 net/unix/af_unix.c:2223
[<ffffffff871749a2>] unix_stream_read_generic+0x11e2/0x2240
net/unix/af_unix.c:2332
[<ffffffff87175c7f>] unix_stream_splice_read+0x27f/0x400
net/unix/af_unix.c:2506
[<ffffffff86a6578e>] sock_splice_read+0xbe/0x100 net/socket.c:772
[<ffffffff81b3f90f>] do_splice_to+0x10f/0x170 fs/splice.c:897
[< inline >] do_splice fs/splice.c:1185
[< inline >] SYSC_splice fs/splice.c:1409
[<ffffffff81b4683a>] SyS_splice+0xfaa/0x16a0 fs/splice.c:1392
[<ffffffff88147905>] entry_SYSCALL_64_fastpath+0x23/0xc6
^ permalink raw reply
* RE: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
From: Rafal Ozieblo @ 2016-11-17 13:35 UTC (permalink / raw)
To: Nicolas Ferre, Harini Katakam
Cc: harini.katakam@xilinx.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <e6773b4e-a633-4754-893d-fd2351d07459@atmel.com>
-----Original Message-----
From: Nicolas Ferre [mailto:nicolas.ferre@atmel.com]
Sent: 17 listopada 2016 14:29
To: Harini Katakam; Rafal Ozieblo
Cc: harini.katakam@xilinx.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
> Le 17/11/2016 à 13:21, Harini Katakam a écrit :
> > Hi Rafal,
> >
> > On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo <rafalo@cadence.com> wrote:
> >> Hello,
> >> I think, there could a bug in your patch.
> >>
> >>> +
> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> >>> + dmacfg |= GEM_BIT(ADDR64); #endif
> >>
> >> You enable 64 bit addressing (64b dma bus width) always when appropriate architecture config option is enabled.
> >> But there are some legacy controllers which do not support that feature. According Cadence hardware team:
> >> "64 bit addressing was added in July 2013. Earlier version do not have it.
> >> This feature was enhanced in release August 2014 to have separate upper address values for transmit and receive."
> >>
> >>> /* Bitfields in NSR */
> >>> @@ -474,6 +479,10 @@
> >>> struct macb_dma_desc {
> >> > u32 addr;
> >>> u32 ctrl;
> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> >>> + u32 addrh;
> >>> + u32 resvd;
> >>> +#endif
> >>> };
> >>
> >> It will not work for legacy hardware. Old descriptor is 2 words wide, the new one is 4 words wide.
> >> If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't
> >> support it at all, you will miss every second descriptor.
> >>
> >
> > True, this feature is not available in all of Cadence IP versions.
> > In fact, the IP version Zynq does not support this. But the one in ZynqMP does.
> > So, we enable kernel config for 64 bit DMA addressing for this SoC and
> > hence the driver picks it up. My assumption was that if the legacy IP
> > does not support
> > 64 bit addressing, then this DMA option wouldn't be enabled.
> >
> > There is a design config register in Cadence IP which is being read to
> > check for 64 bit address support - DMA mask is set based on that.
> > But the addition of two descriptor words cannot be based on this runtime check.
> > For this reason, all the static changes were placed under this check.
>
> We have quite a bunch of options in this driver to determinate what is the real capacity of the underlying hardware.
> If HW configuration registers are not appropriate, and it seems they are not, I would advice to simply use the DT compatibility string.
>
> Best regards,
> --
> Nicolas Ferre
HW configuration registers are appropriate. The issue is that this code doesn’t use the capability bit to switch between different dma descriptors (2 words vs. 4 words).
DMA descriptor size is chosen based on kernel configuration, not based on hardware capabilities.
Regards,
Rafal Ozieblo
^ permalink raw reply
* Re: [PATCH net-next v3 4/7] vxlan: improve vxlan route lookup checks.
From: David Miller @ 2016-11-17 18:11 UTC (permalink / raw)
To: jbenc; +Cc: David.Laight, pshelar, netdev
In-Reply-To: <20161117165950.6a8ed0d0@griffin>
From: Jiri Benc <jbenc@redhat.com>
Date: Thu, 17 Nov 2016 16:59:49 +0100
> On Thu, 17 Nov 2016 10:17:01 +0000, David Laight wrote:
>> Worse than arbitrary, it adds 4 bytes of pad on 64bit systems.
>
> It does not, this is not a struct.
He is talking about on the function stack.
^ permalink raw reply
* [PATCH ipsec] xfrm: unbreak xfrm_sk_policy_lookup
From: Florian Westphal @ 2016-11-17 12:21 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal
if we succeed grabbing the refcount, then
if (err && !xfrm_pol_hold_rcu)
will evaluate to false so this hits last else branch which then
sets policy to ERR_PTR(0).
Fixes: ae33786f73a7ce ("xfrm: policy: only use rcu in xfrm_sk_policy_lookup")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
net/xfrm/xfrm_policy.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index fd6986634e6f..5bf7e1bfeac7 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1268,12 +1268,14 @@ static struct xfrm_policy *xfrm_sk_policy_lookup(const struct sock *sk, int dir,
err = security_xfrm_policy_lookup(pol->security,
fl->flowi_secid,
policy_to_flow_dir(dir));
- if (!err && !xfrm_pol_hold_rcu(pol))
- goto again;
- else if (err == -ESRCH)
+ if (!err) {
+ if (!xfrm_pol_hold_rcu(pol))
+ goto again;
+ } else if (err == -ESRCH) {
pol = NULL;
- else
+ } else {
pol = ERR_PTR(err);
+ }
} else
pol = NULL;
}
--
2.7.3
^ permalink raw reply related
* [PATCH -next] tcp: make undo_cwnd mandatory for congestion modules
From: Florian Westphal @ 2016-11-17 12:56 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal, Eric Dumazet, Yuchung Cheng, Neal Cardwell
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
which un-does reno halving behaviour.
It seems more appropriate to let congctl algorithms pair .ssthresh
and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
up for all congestion algorithms that used to rely on the fallback.
highspeed, illinois, scalable, veno and yeah use 'reno undo' while their
.ssthresh implementation doesn't halve the slowstart threshold, this
might point to similar issue as the one fixed for dctcp in
ce6dd23329b1e ("dctcp: avoid bogus doubling of cwnd after loss").
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/tcp.h | 1 +
net/ipv4/tcp_cong.c | 14 ++++++++++++--
net/ipv4/tcp_dctcp.c | 1 +
net/ipv4/tcp_highspeed.c | 2 +-
net/ipv4/tcp_hybla.c | 1 +
net/ipv4/tcp_illinois.c | 1 +
net/ipv4/tcp_input.c | 5 +----
net/ipv4/tcp_lp.c | 1 +
net/ipv4/tcp_scalable.c | 1 +
net/ipv4/tcp_vegas.c | 1 +
net/ipv4/tcp_veno.c | 1 +
net/ipv4/tcp_westwood.c | 1 +
net/ipv4/tcp_yeah.c | 1 +
13 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 123979fe12bf..7de80739adab 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -958,6 +958,7 @@ u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
u32 tcp_reno_ssthresh(struct sock *sk);
+u32 tcp_reno_undo_cwnd(struct sock *sk);
void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
extern struct tcp_congestion_ops tcp_reno;
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 1294af4e0127..38905ec5f508 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -68,8 +68,9 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
{
int ret = 0;
- /* all algorithms must implement ssthresh and cong_avoid ops */
- if (!ca->ssthresh || !(ca->cong_avoid || ca->cong_control)) {
+ /* all algorithms must implement these */
+ if (!ca->ssthresh || !ca->undo_cwnd ||
+ !(ca->cong_avoid || ca->cong_control)) {
pr_err("%s does not implement required ops\n", ca->name);
return -EINVAL;
}
@@ -441,10 +442,19 @@ u32 tcp_reno_ssthresh(struct sock *sk)
}
EXPORT_SYMBOL_GPL(tcp_reno_ssthresh);
+u32 tcp_reno_undo_cwnd(struct sock *sk)
+{
+ const struct tcp_sock *tp = tcp_sk(sk);
+
+ return max(tp->snd_cwnd, tp->snd_ssthresh << 1);
+}
+EXPORT_SYMBOL_GPL(tcp_reno_undo_cwnd);
+
struct tcp_congestion_ops tcp_reno = {
.flags = TCP_CONG_NON_RESTRICTED,
.name = "reno",
.owner = THIS_MODULE,
.ssthresh = tcp_reno_ssthresh,
.cong_avoid = tcp_reno_cong_avoid,
+ .undo_cwnd = tcp_reno_undo_cwnd,
};
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 51139175bf61..bde22ebb92a8 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -342,6 +342,7 @@ static struct tcp_congestion_ops dctcp __read_mostly = {
static struct tcp_congestion_ops dctcp_reno __read_mostly = {
.ssthresh = tcp_reno_ssthresh,
.cong_avoid = tcp_reno_cong_avoid,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.get_info = dctcp_get_info,
.owner = THIS_MODULE,
.name = "dctcp-reno",
diff --git a/net/ipv4/tcp_highspeed.c b/net/ipv4/tcp_highspeed.c
index db7842495a64..1eb8fefd9bd0 100644
--- a/net/ipv4/tcp_highspeed.c
+++ b/net/ipv4/tcp_highspeed.c
@@ -161,7 +161,7 @@ static struct tcp_congestion_ops tcp_highspeed __read_mostly = {
.init = hstcp_init,
.ssthresh = hstcp_ssthresh,
.cong_avoid = hstcp_cong_avoid,
-
+ .undo_cwnd = tcp_reno_undo_cwnd,
.owner = THIS_MODULE,
.name = "highspeed"
};
diff --git a/net/ipv4/tcp_hybla.c b/net/ipv4/tcp_hybla.c
index 083831e359df..0f7175c3338e 100644
--- a/net/ipv4/tcp_hybla.c
+++ b/net/ipv4/tcp_hybla.c
@@ -166,6 +166,7 @@ static void hybla_cong_avoid(struct sock *sk, u32 ack, u32 acked)
static struct tcp_congestion_ops tcp_hybla __read_mostly = {
.init = hybla_init,
.ssthresh = tcp_reno_ssthresh,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.cong_avoid = hybla_cong_avoid,
.set_state = hybla_state,
diff --git a/net/ipv4/tcp_illinois.c b/net/ipv4/tcp_illinois.c
index c8e6d86be114..7c843578f233 100644
--- a/net/ipv4/tcp_illinois.c
+++ b/net/ipv4/tcp_illinois.c
@@ -327,6 +327,7 @@ static size_t tcp_illinois_info(struct sock *sk, u32 ext, int *attr,
static struct tcp_congestion_ops tcp_illinois __read_mostly = {
.init = tcp_illinois_init,
.ssthresh = tcp_illinois_ssthresh,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.cong_avoid = tcp_illinois_cong_avoid,
.set_state = tcp_illinois_state,
.get_info = tcp_illinois_info,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a70046fea0e8..22e6a2097ff6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2394,10 +2394,7 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
if (tp->prior_ssthresh) {
const struct inet_connection_sock *icsk = inet_csk(sk);
- if (icsk->icsk_ca_ops->undo_cwnd)
- tp->snd_cwnd = icsk->icsk_ca_ops->undo_cwnd(sk);
- else
- tp->snd_cwnd = max(tp->snd_cwnd, tp->snd_ssthresh << 1);
+ tp->snd_cwnd = icsk->icsk_ca_ops->undo_cwnd(sk);
if (tp->prior_ssthresh > tp->snd_ssthresh) {
tp->snd_ssthresh = tp->prior_ssthresh;
diff --git a/net/ipv4/tcp_lp.c b/net/ipv4/tcp_lp.c
index c67ece1390c2..046fd3910873 100644
--- a/net/ipv4/tcp_lp.c
+++ b/net/ipv4/tcp_lp.c
@@ -316,6 +316,7 @@ static void tcp_lp_pkts_acked(struct sock *sk, const struct ack_sample *sample)
static struct tcp_congestion_ops tcp_lp __read_mostly = {
.init = tcp_lp_init,
.ssthresh = tcp_reno_ssthresh,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.cong_avoid = tcp_lp_cong_avoid,
.pkts_acked = tcp_lp_pkts_acked,
diff --git a/net/ipv4/tcp_scalable.c b/net/ipv4/tcp_scalable.c
index bf5ea9e9bbc1..addc122f8818 100644
--- a/net/ipv4/tcp_scalable.c
+++ b/net/ipv4/tcp_scalable.c
@@ -38,6 +38,7 @@ static u32 tcp_scalable_ssthresh(struct sock *sk)
static struct tcp_congestion_ops tcp_scalable __read_mostly = {
.ssthresh = tcp_scalable_ssthresh,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.cong_avoid = tcp_scalable_cong_avoid,
.owner = THIS_MODULE,
diff --git a/net/ipv4/tcp_vegas.c b/net/ipv4/tcp_vegas.c
index 4c4bac1b5eab..218cfcc77650 100644
--- a/net/ipv4/tcp_vegas.c
+++ b/net/ipv4/tcp_vegas.c
@@ -307,6 +307,7 @@ EXPORT_SYMBOL_GPL(tcp_vegas_get_info);
static struct tcp_congestion_ops tcp_vegas __read_mostly = {
.init = tcp_vegas_init,
.ssthresh = tcp_reno_ssthresh,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.cong_avoid = tcp_vegas_cong_avoid,
.pkts_acked = tcp_vegas_pkts_acked,
.set_state = tcp_vegas_state,
diff --git a/net/ipv4/tcp_veno.c b/net/ipv4/tcp_veno.c
index 40171e163cff..6fcf482d611b 100644
--- a/net/ipv4/tcp_veno.c
+++ b/net/ipv4/tcp_veno.c
@@ -204,6 +204,7 @@ static u32 tcp_veno_ssthresh(struct sock *sk)
static struct tcp_congestion_ops tcp_veno __read_mostly = {
.init = tcp_veno_init,
.ssthresh = tcp_veno_ssthresh,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.cong_avoid = tcp_veno_cong_avoid,
.pkts_acked = tcp_veno_pkts_acked,
.set_state = tcp_veno_state,
diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c
index 4b03a2e2a050..fed66dc0e0f5 100644
--- a/net/ipv4/tcp_westwood.c
+++ b/net/ipv4/tcp_westwood.c
@@ -278,6 +278,7 @@ static struct tcp_congestion_ops tcp_westwood __read_mostly = {
.init = tcp_westwood_init,
.ssthresh = tcp_reno_ssthresh,
.cong_avoid = tcp_reno_cong_avoid,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.cwnd_event = tcp_westwood_event,
.in_ack_event = tcp_westwood_ack,
.get_info = tcp_westwood_info,
diff --git a/net/ipv4/tcp_yeah.c b/net/ipv4/tcp_yeah.c
index 9c5fc973267f..56ed4257c706 100644
--- a/net/ipv4/tcp_yeah.c
+++ b/net/ipv4/tcp_yeah.c
@@ -226,6 +226,7 @@ static u32 tcp_yeah_ssthresh(struct sock *sk)
static struct tcp_congestion_ops tcp_yeah __read_mostly = {
.init = tcp_yeah_init,
.ssthresh = tcp_yeah_ssthresh,
+ .undo_cwnd = tcp_reno_undo_cwnd,
.cong_avoid = tcp_yeah_cong_avoid,
.set_state = tcp_vegas_state,
.cwnd_event = tcp_vegas_cwnd_event,
--
2.7.3
^ permalink raw reply related
* [PATCH net-next v3 1/5] ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE
From: Allan W. Nielsen @ 2016-11-17 12:07 UTC (permalink / raw)
To: netdev; +Cc: andrew, f.fainelli, raju.lakkaraju, allan.nielsen, Raju Lakkaraju
In-Reply-To: <1479384444-31122-1-git-send-email-allan.nielsen@microsemi.com>
From: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Defines a generic API to get/set phy tunables. The API is using the
existing ethtool_tunable/tunable_type_id types which is already being used
for mac level tunables.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
---
include/uapi/linux/ethtool.h | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 8e54723..42f696f 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -248,6 +248,16 @@ struct ethtool_tunable {
void *data[0];
};
+enum phy_tunable_id {
+ ETHTOOL_PHY_ID_UNSPEC,
+
+ /*
+ * Add your fresh new phy tunable attribute above and remember to update
+ * phy_tunable_strings[] in net/core/ethtool.c
+ */
+ __ETHTOOL_PHY_TUNABLE_COUNT,
+};
+
/**
* struct ethtool_regs - hardware register dump
* @cmd: Command number = %ETHTOOL_GREGS
@@ -548,6 +558,7 @@ struct ethtool_pauseparam {
* @ETH_SS_FEATURES: Device feature names
* @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
* @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
+ * @ETH_SS_PHY_TUNABLES: PHY tunable names
*/
enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -558,6 +569,7 @@ enum ethtool_stringset {
ETH_SS_RSS_HASH_FUNCS,
ETH_SS_TUNABLES,
ETH_SS_PHY_STATS,
+ ETH_SS_PHY_TUNABLES,
};
/**
@@ -1313,7 +1325,8 @@ struct ethtool_per_queue_op {
#define ETHTOOL_GLINKSETTINGS 0x0000004c /* Get ethtool_link_settings */
#define ETHTOOL_SLINKSETTINGS 0x0000004d /* Set ethtool_link_settings */
-
+#define ETHTOOL_PHY_GTUNABLE 0x0000004e /* Get PHY tunable configuration */
+#define ETHTOOL_PHY_STUNABLE 0x0000004f /* Set PHY tunable configuration */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
--
2.7.3
^ permalink raw reply related
* [PATCH v8 0/6] Add eBPF hooks for cgroups
From: Daniel Mack @ 2016-11-17 18:27 UTC (permalink / raw)
To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
ast-b10kYP2dOMg
Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
Daniel Mack
This is v8 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.
Again, only minor details are updated in this version.
Thanks,
Daniel
Changes from v7:
* Replace the static inline function cgroup_bpf_run_filter() with
two specific macros for ingress and egress. This addresses David
Miller's concern regarding skb->sk vs. sk in the egress path.
Thanks a lot to Daniel Borkmann and Alexei Starovoitov for the
suggestions.
Changes from v6:
* Rebased to 4.9-rc2
* Add EXPORT_SYMBOL(__cgroup_bpf_run_filter). The kbuild test robot
now succeeds in building this version of the patch set.
* Switch from bpf_prog_run_save_cb() to bpf_prog_run_clear_cb() to not
tamper with the contents of skb->cb[]. Pointed out by Daniel
Borkmann.
* Use sk_to_full_sk() in the egress path, as suggested by Daniel
Borkmann.
* Renamed BPF_PROG_TYPE_CGROUP_SOCKET to BPF_PROG_TYPE_CGROUP_SKB, as
requested by David Ahern.
* Added Alexei's Acked-by tags.
Changes from v5:
* The eBPF programs now operate on L3 rather than on L2 of the packets,
and the egress hooks were moved from __dev_queue_xmit() to
ip*_output().
* For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
through BPF_LD_[ABS|IND] instructions, but hook up the
bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
for the help.
Changes from v4:
* Plug an skb leak when dropping packets due to eBPF verdicts in
__dev_queue_xmit(). Spotted by Daniel Borkmann.
* Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
operate on timewait or request sockets. Suggested by Daniel Borkmann.
* Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
Spotted by Rami Rosen.
* Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.
Changes from v3:
* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
__BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.
* Dropped the attach_flags member from the anonymous struct for BPF
attach operations in union bpf_attr. They can be added later on via
CHECK_ATTR. Requested by Daniel Borkmann and Alexei.
* Release old_prog at the end of __cgroup_bpf_update rather that at
the beginning to fix a race gap between program updates and their
users. Spotted by Daniel Borkmann.
* Plugged an skb leak when dropping packets on the egress path.
Spotted by Daniel Borkmann.
* Add cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org to the loop, as suggested by Rami Rosen.
* Some minor coding style adoptions not worth mentioning in particular.
Changes from v2:
* Fixed the RCU locking details Tejun pointed out.
* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.
Changes from v1:
* Moved all bpf specific cgroup code into its own file, and stub
out related functions for !CONFIG_CGROUP_BPF as static inline nops.
This way, the call sites are not cluttered with #ifdef guards while
the feature remains compile-time configurable.
* Implemented the new scheme proposed by Tejun. Per cgroup, store one
set of pointers that are pinned to the cgroup, and one for the
programs that are effective. When a program is attached or detached,
the change is propagated to all the cgroup's descendants. If a
subcgroup has its own pinned program, skip the whole subbranch in
order to allow delegation models.
* The hookup for egress packets is now done from __dev_queue_xmit().
* A static key is now used in both the ingress and egress fast paths
to keep performance penalties close to zero if the feature is
not in use.
* Overall cleanup to make the accessors use the program arrays.
This should make it much easier to add new program types, which
will then automatically follow the pinned vs. effective logic.
* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
Starovoitov. Changes to the program array are now done with
xchg() and are protected by cgroup_mutex.
* eBPF programs are now expected to return 1 to let the packet pass,
not >= 0. Pointed out by Alexei.
* Operation is now limited to INET sockets, so local AF_UNIX sockets
are not affected. The enum members are renamed accordingly. In case
other socket families should be supported, this can be extended in
the future.
* The sample program learned to support both ingress and egress, and
can now optionally make the eBPF program drop packets by making it
return 0.
Daniel Mack (6):
bpf: add new prog type for cgroup socket filtering
cgroup: add support for eBPF programs
bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
net: filter: run cgroup eBPF ingress programs
net: ipv4, ipv6: run cgroup eBPF egress programs
samples: bpf: add userspace example for attaching eBPF programs to
cgroups
include/linux/bpf-cgroup.h | 79 +++++++++++++++++++
include/linux/cgroup-defs.h | 4 +
include/uapi/linux/bpf.h | 17 ++++
init/Kconfig | 12 +++
kernel/bpf/Makefile | 1 +
kernel/bpf/cgroup.c | 167 ++++++++++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 81 +++++++++++++++++++
kernel/cgroup.c | 18 +++++
net/core/filter.c | 27 +++++++
net/ipv4/ip_output.c | 15 ++++
net/ipv6/ip6_output.c | 8 ++
samples/bpf/Makefile | 2 +
samples/bpf/libbpf.c | 21 +++++
samples/bpf/libbpf.h | 3 +
samples/bpf/test_cgrp2_attach.c | 147 +++++++++++++++++++++++++++++++++++
15 files changed, 602 insertions(+)
create mode 100644 include/linux/bpf-cgroup.h
create mode 100644 kernel/bpf/cgroup.c
create mode 100644 samples/bpf/test_cgrp2_attach.c
--
2.7.4
^ permalink raw reply
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: David Miller @ 2016-11-17 18:16 UTC (permalink / raw)
To: hannes
Cc: idosch, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <f58c008f-c23a-4950-2975-3c1b4c3b8692@stressinduktion.org>
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 17 Nov 2016 18:20:39 +0100
> Hi,
>
> On 17.11.2016 17:45, David Miller wrote:
>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>
>>> The other way is the journal idea I had, which uses an rb-tree with
>>> timestamps as keys (can be lamport timestamps). You insert into the tree
>>> until the dump is finished and use it as queue later to shuffle stuff
>>> into the hardware.
>>
>> If you have this "place" where pending inserts are stored, you have
>> a policy decision to make.
>>
>> First of all what do other lookups see when there are pending entires?
>
> I think this is a problem with the current approach already, as the
> delayed work queue already postpones the insert for an undecidable
> amount of time (and reorders depending on which CPU the entry was
> inserted and the fib notifier was called).
>
> For user space queries we would still query the in-kernel table.
Ok, I think I might misunderstand something.
What is going into this journal exactly? The actual full software and
hardware insert operation, or just the notification to the hardware
device driver notifiers?
The "lookup" I'm mostly concerned with is the fast path where the
packets being processed actually look up a route.
I do not think we can return success on the insert to the user yet
have the route lookup dataplace not return that route on a lookup.
^ permalink raw reply
* [PATCH v8 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
From: Daniel Mack @ 2016-11-17 18:27 UTC (permalink / raw)
To: htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
ast-b10kYP2dOMg
Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
Daniel Mack
In-Reply-To: <1479407229-14861-1-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.
On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.
When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.
If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.
For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.
The API is guarded by CAP_NET_ADMIN.
Signed-off-by: Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>
Acked-by: Alexei Starovoitov <ast-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
include/uapi/linux/bpf.h | 8 +++++
kernel/bpf/syscall.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 89 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1f3e6f1..f31b655 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+ BPF_PROG_ATTACH,
+ BPF_PROG_DETACH,
};
enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
__aligned_u64 pathname;
__u32 bpf_fd;
};
+
+ struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+ __u32 target_fd; /* container object to attach to */
+ __u32 attach_bpf_fd; /* eBPF program to attach */
+ __u32 attach_type;
+ };
} __attribute__((aligned(8)));
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1814c01 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
}
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+ struct bpf_prog *prog;
+ struct cgroup *cgrp;
+
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+
+ if (CHECK_ATTR(BPF_PROG_ATTACH))
+ return -EINVAL;
+
+ switch (attr->attach_type) {
+ case BPF_CGROUP_INET_INGRESS:
+ case BPF_CGROUP_INET_EGRESS:
+ prog = bpf_prog_get_type(attr->attach_bpf_fd,
+ BPF_PROG_TYPE_CGROUP_SKB);
+ if (IS_ERR(prog))
+ return PTR_ERR(prog);
+
+ cgrp = cgroup_get_from_fd(attr->target_fd);
+ if (IS_ERR(cgrp)) {
+ bpf_prog_put(prog);
+ return PTR_ERR(cgrp);
+ }
+
+ cgroup_bpf_update(cgrp, prog, attr->attach_type);
+ cgroup_put(cgrp);
+ break;
+
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+ struct cgroup *cgrp;
+
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+
+ if (CHECK_ATTR(BPF_PROG_DETACH))
+ return -EINVAL;
+
+ switch (attr->attach_type) {
+ case BPF_CGROUP_INET_INGRESS:
+ case BPF_CGROUP_INET_EGRESS:
+ cgrp = cgroup_get_from_fd(attr->target_fd);
+ if (IS_ERR(cgrp))
+ return PTR_ERR(cgrp);
+
+ cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+ cgroup_put(cgrp);
+ break;
+
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
{
union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get(&attr);
break;
+
+#ifdef CONFIG_CGROUP_BPF
+ case BPF_PROG_ATTACH:
+ err = bpf_prog_attach(&attr);
+ break;
+ case BPF_PROG_DETACH:
+ err = bpf_prog_detach(&attr);
+ break;
+#endif
+
default:
err = -EINVAL;
break;
--
2.7.4
^ permalink raw reply related
* [PATCH v8 1/6] bpf: add new prog type for cgroup socket filtering
From: Daniel Mack @ 2016-11-17 18:27 UTC (permalink / raw)
To: htejun, daniel, ast
Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups,
Daniel Mack
In-Reply-To: <1479407229-14861-1-git-send-email-daniel@zonque.org>
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.
Programs of this type will be attached to cgroups for network filtering
and accounting.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
include/uapi/linux/bpf.h | 9 +++++++++
net/core/filter.c | 23 +++++++++++++++++++++++
2 files changed, 32 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f09c70b..1f3e6f1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+ BPF_PROG_TYPE_CGROUP_SKB,
};
+enum bpf_attach_type {
+ BPF_CGROUP_INET_INGRESS,
+ BPF_CGROUP_INET_EGRESS,
+ __MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
#define BPF_PSEUDO_MAP_FD 1
/* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/net/core/filter.c b/net/core/filter.c
index 00351cd..e3813d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2576,6 +2576,17 @@ xdp_func_proto(enum bpf_func_id func_id)
}
}
+static const struct bpf_func_proto *
+cg_skb_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
+ case BPF_FUNC_skb_load_bytes:
+ return &bpf_skb_load_bytes_proto;
+ default:
+ return sk_filter_func_proto(func_id);
+ }
+}
+
static bool __is_valid_access(int off, int size, enum bpf_access_type type)
{
if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2938,6 +2949,12 @@ static const struct bpf_verifier_ops xdp_ops = {
.convert_ctx_access = xdp_convert_ctx_access,
};
+static const struct bpf_verifier_ops cg_skb_ops = {
+ .get_func_proto = cg_skb_func_proto,
+ .is_valid_access = sk_filter_is_valid_access,
+ .convert_ctx_access = sk_filter_convert_ctx_access,
+};
+
static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops = &sk_filter_ops,
.type = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2958,12 +2975,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
.type = BPF_PROG_TYPE_XDP,
};
+static struct bpf_prog_type_list cg_skb_type __read_mostly = {
+ .ops = &cg_skb_ops,
+ .type = BPF_PROG_TYPE_CGROUP_SKB,
+};
+
static int __init register_sk_filter_ops(void)
{
bpf_register_prog_type(&sk_filter_type);
bpf_register_prog_type(&sched_cls_type);
bpf_register_prog_type(&sched_act_type);
bpf_register_prog_type(&xdp_type);
+ bpf_register_prog_type(&cg_skb_type);
return 0;
}
--
2.7.4
^ permalink raw reply related
* [PATCH v8 2/6] cgroup: add support for eBPF programs
From: Daniel Mack @ 2016-11-17 18:27 UTC (permalink / raw)
To: htejun, daniel, ast
Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups,
Daniel Mack
In-Reply-To: <1479407229-14861-1-git-send-email-daniel@zonque.org>
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.
To illustrate the logic behind that, assume the following example
cgroup hierarchy.
A - B - C
\ D - E
If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.
Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/bpf-cgroup.h | 79 +++++++++++++++++++++
include/linux/cgroup-defs.h | 4 ++
init/Kconfig | 12 ++++
kernel/bpf/Makefile | 1 +
kernel/bpf/cgroup.c | 167 ++++++++++++++++++++++++++++++++++++++++++++
kernel/cgroup.c | 18 +++++
6 files changed, 281 insertions(+)
create mode 100644 include/linux/bpf-cgroup.h
create mode 100644 kernel/bpf/cgroup.c
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 0000000..ec80d0c
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,79 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include <linux/bpf.h>
+#include <linux/jump_label.h>
+#include <uapi/linux/bpf.h>
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
+
+struct cgroup_bpf {
+ /*
+ * Store two sets of bpf_prog pointers, one for programs that are
+ * pinned directly to this cgroup, and one for those that are effective
+ * when this cgroup is accessed.
+ */
+ struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+ struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+ struct cgroup *parent,
+ struct bpf_prog *prog,
+ enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+ struct bpf_prog *prog,
+ enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+ struct sk_buff *skb,
+ enum bpf_attach_type type);
+
+/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) \
+({ \
+ int __ret = 0; \
+ if (cgroup_bpf_enabled) \
+ __ret = __cgroup_bpf_run_filter(sk, skb, \
+ BPF_CGROUP_INET_INGRESS); \
+ \
+ __ret; \
+})
+
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) \
+({ \
+ int __ret = 0; \
+ if (cgroup_bpf_enabled && sk && sk == skb->sk) { \
+ typeof(sk) __sk = sk_to_full_sk(sk); \
+ if (sk_fullsock(__sk)) \
+ __ret = __cgroup_bpf_run_filter(__sk, skb, \
+ BPF_CGROUP_INET_EGRESS); \
+ } \
+ __ret; \
+})
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
#include <linux/percpu-refcount.h>
#include <linux/percpu-rwsem.h>
#include <linux/workqueue.h>
+#include <linux/bpf-cgroup.h>
#ifdef CONFIG_CGROUPS
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
+ /* used to store eBPF programs */
+ struct cgroup_bpf bpf;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
};
diff --git a/init/Kconfig b/init/Kconfig
index 34407f1..405120b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1154,6 +1154,18 @@ config CGROUP_PERF
Say N if unsure.
+config CGROUP_BPF
+ bool "Support for eBPF programs attached to cgroups"
+ depends on BPF_SYSCALL && SOCK_CGROUP_DATA
+ help
+ Allow attaching eBPF programs to a cgroup using the bpf(2)
+ syscall command BPF_PROG_ATTACH.
+
+ In which context these programs are accessed depends on the type
+ of attachment. For instance, programs that are attached using
+ BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
+ inet sockets.
+
config CGROUP_DEBUG
bool "Example controller"
default n
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index eed911d..b22256b 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o
ifeq ($(CONFIG_PERF_EVENTS),y)
obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
endif
+obj-$(CONFIG_CGROUP_BPF) += cgroup.o
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
new file mode 100644
index 0000000..a0ab43f
--- /dev/null
+++ b/kernel/bpf/cgroup.c
@@ -0,0 +1,167 @@
+/*
+ * Functions to manage eBPF programs attached to cgroups
+ *
+ * Copyright (c) 2016 Daniel Mack
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License. See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/bpf-cgroup.h>
+#include <net/sock.h>
+
+DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
+EXPORT_SYMBOL(cgroup_bpf_enabled_key);
+
+/**
+ * cgroup_bpf_put() - put references of all bpf programs
+ * @cgrp: the cgroup to modify
+ */
+void cgroup_bpf_put(struct cgroup *cgrp)
+{
+ unsigned int type;
+
+ for (type = 0; type < ARRAY_SIZE(cgrp->bpf.prog); type++) {
+ struct bpf_prog *prog = cgrp->bpf.prog[type];
+
+ if (prog) {
+ bpf_prog_put(prog);
+ static_branch_dec(&cgroup_bpf_enabled_key);
+ }
+ }
+}
+
+/**
+ * cgroup_bpf_inherit() - inherit effective programs from parent
+ * @cgrp: the cgroup to modify
+ * @parent: the parent to inherit from
+ */
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
+{
+ unsigned int type;
+
+ for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
+ struct bpf_prog *e;
+
+ e = rcu_dereference_protected(parent->bpf.effective[type],
+ lockdep_is_held(&cgroup_mutex));
+ rcu_assign_pointer(cgrp->bpf.effective[type], e);
+ }
+}
+
+/**
+ * __cgroup_bpf_update() - Update the pinned program of a cgroup, and
+ * propagate the change to descendants
+ * @cgrp: The cgroup which descendants to traverse
+ * @parent: The parent of @cgrp, or %NULL if @cgrp is the root
+ * @prog: A new program to pin
+ * @type: Type of pinning operation (ingress/egress)
+ *
+ * Each cgroup has a set of two pointers for bpf programs; one for eBPF
+ * programs it owns, and which is effective for execution.
+ *
+ * If @prog is %NULL, this function attaches a new program to the cgroup and
+ * releases the one that is currently attached, if any. @prog is then made
+ * the effective program of type @type in that cgroup.
+ *
+ * If @prog is %NULL, the currently attached program of type @type is released,
+ * and the effective program of the parent cgroup (if any) is inherited to
+ * @cgrp.
+ *
+ * Then, the descendants of @cgrp are walked and the effective program for
+ * each of them is set to the effective program of @cgrp unless the
+ * descendant has its own program attached, in which case the subbranch is
+ * skipped. This ensures that delegated subcgroups with own programs are left
+ * untouched.
+ *
+ * Must be called with cgroup_mutex held.
+ */
+void __cgroup_bpf_update(struct cgroup *cgrp,
+ struct cgroup *parent,
+ struct bpf_prog *prog,
+ enum bpf_attach_type type)
+{
+ struct bpf_prog *old_prog, *effective;
+ struct cgroup_subsys_state *pos;
+
+ old_prog = xchg(cgrp->bpf.prog + type, prog);
+
+ effective = (!prog && parent) ?
+ rcu_dereference_protected(parent->bpf.effective[type],
+ lockdep_is_held(&cgroup_mutex)) :
+ prog;
+
+ css_for_each_descendant_pre(pos, &cgrp->self) {
+ struct cgroup *desc = container_of(pos, struct cgroup, self);
+
+ /* skip the subtree if the descendant has its own program */
+ if (desc->bpf.prog[type] && desc != cgrp)
+ pos = css_rightmost_descendant(pos);
+ else
+ rcu_assign_pointer(desc->bpf.effective[type],
+ effective);
+ }
+
+ if (prog)
+ static_branch_inc(&cgroup_bpf_enabled_key);
+
+ if (old_prog) {
+ bpf_prog_put(old_prog);
+ static_branch_dec(&cgroup_bpf_enabled_key);
+ }
+}
+
+/**
+ * __cgroup_bpf_run_filter() - Run a program for packet filtering
+ * @sk: The socken sending or receiving traffic
+ * @skb: The skb that is being sent or received
+ * @type: The type of program to be exectuted
+ *
+ * If no socket is passed, or the socket is not of type INET or INET6,
+ * this function does nothing and returns 0.
+ *
+ * The program type passed in via @type must be suitable for network
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter(struct sock *sk,
+ struct sk_buff *skb,
+ enum bpf_attach_type type)
+{
+ struct bpf_prog *prog;
+ struct cgroup *cgrp;
+ int ret = 0;
+
+ if (!sk || !sk_fullsock(sk))
+ return 0;
+
+ if (sk->sk_family != AF_INET &&
+ sk->sk_family != AF_INET6)
+ return 0;
+
+ cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+
+ rcu_read_lock();
+
+ prog = rcu_dereference(cgrp->bpf.effective[type]);
+ if (prog) {
+ unsigned int offset = skb->data - skb_network_header(skb);
+
+ __skb_push(skb, offset);
+ ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
+ __skb_pull(skb, offset);
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 85bc9be..2ee9ec3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5074,6 +5074,8 @@ static void css_release_work_fn(struct work_struct *work)
if (cgrp->kn)
RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv,
NULL);
+
+ cgroup_bpf_put(cgrp);
}
mutex_unlock(&cgroup_mutex);
@@ -5281,6 +5283,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
if (!cgroup_on_dfl(cgrp))
cgrp->subtree_control = cgroup_control(cgrp);
+ if (parent)
+ cgroup_bpf_inherit(cgrp, parent);
+
cgroup_propagate_control(cgrp);
/* @cgrp doesn't have dir yet so the following will only create csses */
@@ -6495,6 +6500,19 @@ static __init int cgroup_namespaces_init(void)
}
subsys_initcall(cgroup_namespaces_init);
+#ifdef CONFIG_CGROUP_BPF
+void cgroup_bpf_update(struct cgroup *cgrp,
+ struct bpf_prog *prog,
+ enum bpf_attach_type type)
+{
+ struct cgroup *parent = cgroup_parent(cgrp);
+
+ mutex_lock(&cgroup_mutex);
+ __cgroup_bpf_update(cgrp, parent, prog, type);
+ mutex_unlock(&cgroup_mutex);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
#ifdef CONFIG_CGROUP_DEBUG
static struct cgroup_subsys_state *
debug_css_alloc(struct cgroup_subsys_state *parent_css)
--
2.7.4
^ permalink raw reply related
* [PATCH v8 4/6] net: filter: run cgroup eBPF ingress programs
From: Daniel Mack @ 2016-11-17 18:27 UTC (permalink / raw)
To: htejun, daniel, ast
Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups,
Daniel Mack
In-Reply-To: <1479407229-14861-1-git-send-email-daniel@zonque.org>
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().
eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).
Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
net/core/filter.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index e3813d6..474b486 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
+ err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
+ if (err)
+ return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
--
2.7.4
^ permalink raw reply related
* [PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Daniel Mack @ 2016-11-17 18:27 UTC (permalink / raw)
To: htejun, daniel, ast
Cc: davem, kafai, fw, pablo, harald, netdev, sargun, cgroups,
Daniel Mack
In-Reply-To: <1479407229-14861-1-git-send-email-daniel@zonque.org>
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output(). From mentioned functions we have two socket contexts
as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
okfn()."). We explicitly need to use sk instead of skb->sk here,
since otherwise the same program would run multiple times on egress
when encap devices are involved, which is not desired in our case.
eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).
Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
net/ipv4/ip_output.c | 15 +++++++++++++++
net/ipv6/ip6_output.c | 8 ++++++++
2 files changed, 23 insertions(+)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..5914006 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@
#include <net/checksum.h>
#include <net/inetpeer.h>
#include <net/lwtunnel.h>
+#include <linux/bpf-cgroup.h>
#include <linux/igmp.h>
#include <linux/netfilter_ipv4.h>
#include <linux/netfilter_bridge.h>
@@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
struct rtable *rt = skb_rtable(skb);
struct net_device *dev = rt->dst.dev;
+ int ret;
/*
* If the indicated interface is up and running, send the packet.
@@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
+ ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+ if (ret) {
+ kfree_skb(skb);
+ return ret;
+ }
+
/*
* Multicasts are looped back for other local users
*/
@@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
struct net_device *dev = skb_dst(skb)->dev;
+ int ret;
IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
+ ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+ if (ret) {
+ kfree_skb(skb);
+ return ret;
+ }
+
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip_finish_output,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..483f91b 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@
#include <linux/module.h>
#include <linux/slab.h>
+#include <linux/bpf-cgroup.h>
#include <linux/netfilter.h>
#include <linux/netfilter_ipv6.h>
@@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
struct net_device *dev = skb_dst(skb)->dev;
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+ int ret;
if (unlikely(idev->cnf.disable_ipv6)) {
IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
@@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
return 0;
}
+ ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+ if (ret) {
+ kfree_skb(skb);
+ return ret;
+ }
+
return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip6_finish_output,
--
2.7.4
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox