Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH iproute2-next] tc-netem: fix limit description in man page
From: Marcelo Ricardo Leitner @ 2018-05-16  0:49 UTC (permalink / raw)
  To: netdev

As the kernel code says, limit is actually the amount of packets it can
hold queued at a time, as per:

static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                         struct sk_buff **to_free)
{
	...
        if (unlikely(sch->q.qlen >= sch->limit))
                return qdisc_drop_all(skb, sch, to_free);

So lets fix the description of the field in the man page.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
---
 man/man8/tc-netem.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/tc-netem.8 b/man/man8/tc-netem.8
index b31384f57a9b36769c0037c465cc6b5bbe8c8b6e..f2cd86b6ed8ae82b8cc2fbd2ecbe41d2fcbad507 100644
--- a/man/man8/tc-netem.8
+++ b/man/man8/tc-netem.8
@@ -65,7 +65,7 @@ netem has the following options:
 
 .SS limit packets
 
-limits the effect of selected options to the indicated number of next packets.
+maximum number of packets the qdisc may hold queued at a time.
 
 .SS delay
 adds the chosen delay to the packets outgoing to chosen network interface. The
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v2 0/3] net: qualcomm: rmnet: Updates 2018-05-14
From: Subash Abhinov Kasiviswanathan @ 2018-05-16  0:51 UTC (permalink / raw)
  To: davem, netdev, fengguang.wu; +Cc: Subash Abhinov Kasiviswanathan

Patch 1 adds tx_drops counter to more places.
Patch 2 adds ethtool private stats support to make it easy to debug
the checksum offload path.
Patch 3 is a cleanup in command packet processing path.

v1->v2: Fix the incorrect if / else statement in
rmnet_map_checksum_downlink_packet() and define rmnet_ethtool_ops
as static as mentioned by kbuild test robot.

Subash Abhinov Kasiviswanathan (3):
  net: qualcomm: rmnet: Capture all drops in transmit path
  net: qualcomm: rmnet: Add support for ethtool private stats
  net: qualcomm: rmnet: Remove redundant command check

 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 13 +++++
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   | 21 ++++---
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c    | 14 +----
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 64 ++++++++++++++++------
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c    | 51 +++++++++++++++++
 5 files changed, 125 insertions(+), 38 deletions(-)

-- 
1.9.1

^ permalink raw reply

* [PATCH net-next v2 1/3] net: qualcomm: rmnet: Capture all drops in transmit path
From: Subash Abhinov Kasiviswanathan @ 2018-05-16  0:52 UTC (permalink / raw)
  To: davem, netdev, fengguang.wu; +Cc: Subash Abhinov Kasiviswanathan
In-Reply-To: <1526431922-4132-1-git-send-email-subashab@codeaurora.org>

Packets in transmit path could potentially be dropped if there were
errors while adding the MAP header or the checksum header.
Increment the tx_drops stats in these cases.

Additionally, refactor the code to free the packet and increment
the tx_drops stat under a single label.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c    | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 6fcd586..7fd86d4 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -148,7 +148,7 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
 
 	if (skb_headroom(skb) < required_headroom) {
 		if (pskb_expand_head(skb, required_headroom, 0, GFP_KERNEL))
-			goto fail;
+			return -ENOMEM;
 	}
 
 	if (port->data_format & RMNET_FLAGS_EGRESS_MAP_CKSUMV4)
@@ -156,17 +156,13 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
 
 	map_header = rmnet_map_add_map_header(skb, additional_header_len, 0);
 	if (!map_header)
-		goto fail;
+		return -ENOMEM;
 
 	map_header->mux_id = mux_id;
 
 	skb->protocol = htons(ETH_P_MAP);
 
 	return 0;
-
-fail:
-	kfree_skb(skb);
-	return -ENOMEM;
 }
 
 static void
@@ -228,15 +224,18 @@ void rmnet_egress_handler(struct sk_buff *skb)
 	mux_id = priv->mux_id;
 
 	port = rmnet_get_port(skb->dev);
-	if (!port) {
-		kfree_skb(skb);
-		return;
-	}
+	if (!port)
+		goto drop;
 
 	if (rmnet_map_egress_handler(skb, port, mux_id, orig_dev))
-		return;
+		goto drop;
 
 	rmnet_vnd_tx_fixup(skb, orig_dev);
 
 	dev_queue_xmit(skb);
+	return;
+
+drop:
+	this_cpu_inc(priv->pcpu_stats->stats.tx_drops);
+	kfree_skb(skb);
 }
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v2 2/3] net: qualcomm: rmnet: Add support for ethtool private stats
From: Subash Abhinov Kasiviswanathan @ 2018-05-16  0:52 UTC (permalink / raw)
  To: davem, netdev, fengguang.wu; +Cc: Subash Abhinov Kasiviswanathan
In-Reply-To: <1526431922-4132-1-git-send-email-subashab@codeaurora.org>

Add ethtool private stats handler to debug the handling of packets
with checksum offload header / trailer. This allows to keep track of
the number of packets for which hardware computes the checksum and
counts and reasons where checksum computation was skipped in hardware
and was done in the network stack.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 13 +++++
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 64 ++++++++++++++++------
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c    | 51 +++++++++++++++++
 3 files changed, 112 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
index 0b5b5da..34ac45a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
@@ -54,11 +54,24 @@ struct rmnet_pcpu_stats {
 	struct u64_stats_sync syncp;
 };
 
+struct rmnet_priv_stats {
+	u64 csum_ok;
+	u64 csum_valid_unset;
+	u64 csum_validation_failed;
+	u64 csum_err_bad_buffer;
+	u64 csum_err_invalid_ip_version;
+	u64 csum_err_invalid_transport;
+	u64 csum_fragmented_pkt;
+	u64 csum_skipped;
+	u64 csum_sw;
+};
+
 struct rmnet_priv {
 	u8 mux_id;
 	struct net_device *real_dev;
 	struct rmnet_pcpu_stats __percpu *pcpu_stats;
 	struct gro_cells gro_cells;
+	struct rmnet_priv_stats stats;
 };
 
 struct rmnet_port *rmnet_get_port(struct net_device *real_dev);
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
index a6ea094..57a9c31 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
@@ -48,7 +48,8 @@ static __sum16 *rmnet_map_get_csum_field(unsigned char protocol,
 
 static int
 rmnet_map_ipv4_dl_csum_trailer(struct sk_buff *skb,
-			       struct rmnet_map_dl_csum_trailer *csum_trailer)
+			       struct rmnet_map_dl_csum_trailer *csum_trailer,
+			       struct rmnet_priv *priv)
 {
 	__sum16 *csum_field, csum_temp, pseudo_csum, hdr_csum, ip_payload_csum;
 	u16 csum_value, csum_value_final;
@@ -58,19 +59,25 @@ static __sum16 *rmnet_map_get_csum_field(unsigned char protocol,
 
 	ip4h = (struct iphdr *)(skb->data);
 	if ((ntohs(ip4h->frag_off) & IP_MF) ||
-	    ((ntohs(ip4h->frag_off) & IP_OFFSET) > 0))
+	    ((ntohs(ip4h->frag_off) & IP_OFFSET) > 0)) {
+		priv->stats.csum_fragmented_pkt++;
 		return -EOPNOTSUPP;
+	}
 
 	txporthdr = skb->data + ip4h->ihl * 4;
 
 	csum_field = rmnet_map_get_csum_field(ip4h->protocol, txporthdr);
 
-	if (!csum_field)
+	if (!csum_field) {
+		priv->stats.csum_err_invalid_transport++;
 		return -EPROTONOSUPPORT;
+	}
 
 	/* RFC 768 - Skip IPv4 UDP packets where sender checksum field is 0 */
-	if (*csum_field == 0 && ip4h->protocol == IPPROTO_UDP)
+	if (*csum_field == 0 && ip4h->protocol == IPPROTO_UDP) {
+		priv->stats.csum_skipped++;
 		return 0;
+	}
 
 	csum_value = ~ntohs(csum_trailer->csum_value);
 	hdr_csum = ~ip_fast_csum(ip4h, (int)ip4h->ihl);
@@ -102,16 +109,20 @@ static __sum16 *rmnet_map_get_csum_field(unsigned char protocol,
 		}
 	}
 
-	if (csum_value_final == ntohs((__force __be16)*csum_field))
+	if (csum_value_final == ntohs((__force __be16)*csum_field)) {
+		priv->stats.csum_ok++;
 		return 0;
-	else
+	} else {
+		priv->stats.csum_validation_failed++;
 		return -EINVAL;
+	}
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int
 rmnet_map_ipv6_dl_csum_trailer(struct sk_buff *skb,
-			       struct rmnet_map_dl_csum_trailer *csum_trailer)
+			       struct rmnet_map_dl_csum_trailer *csum_trailer,
+			       struct rmnet_priv *priv)
 {
 	__sum16 *csum_field, ip6_payload_csum, pseudo_csum, csum_temp;
 	u16 csum_value, csum_value_final;
@@ -125,8 +136,10 @@ static __sum16 *rmnet_map_get_csum_field(unsigned char protocol,
 	txporthdr = skb->data + sizeof(struct ipv6hdr);
 	csum_field = rmnet_map_get_csum_field(ip6h->nexthdr, txporthdr);
 
-	if (!csum_field)
+	if (!csum_field) {
+		priv->stats.csum_err_invalid_transport++;
 		return -EPROTONOSUPPORT;
+	}
 
 	csum_value = ~ntohs(csum_trailer->csum_value);
 	ip6_hdr_csum = (__force __be16)
@@ -164,10 +177,13 @@ static __sum16 *rmnet_map_get_csum_field(unsigned char protocol,
 		}
 	}
 
-	if (csum_value_final == ntohs((__force __be16)*csum_field))
+	if (csum_value_final == ntohs((__force __be16)*csum_field)) {
+		priv->stats.csum_ok++;
 		return 0;
-	else
+	} else {
+		priv->stats.csum_validation_failed++;
 		return -EINVAL;
+	}
 }
 #endif
 
@@ -339,24 +355,34 @@ struct sk_buff *rmnet_map_deaggregate(struct sk_buff *skb,
  */
 int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len)
 {
+	struct rmnet_priv *priv = netdev_priv(skb->dev);
 	struct rmnet_map_dl_csum_trailer *csum_trailer;
 
-	if (unlikely(!(skb->dev->features & NETIF_F_RXCSUM)))
+	if (unlikely(!(skb->dev->features & NETIF_F_RXCSUM))) {
+		priv->stats.csum_sw++;
 		return -EOPNOTSUPP;
+	}
 
 	csum_trailer = (struct rmnet_map_dl_csum_trailer *)(skb->data + len);
 
-	if (!csum_trailer->valid)
+	if (!csum_trailer->valid) {
+		priv->stats.csum_valid_unset++;
 		return -EINVAL;
+	}
 
-	if (skb->protocol == htons(ETH_P_IP))
-		return rmnet_map_ipv4_dl_csum_trailer(skb, csum_trailer);
-	else if (skb->protocol == htons(ETH_P_IPV6))
+	if (skb->protocol == htons(ETH_P_IP)) {
+		return rmnet_map_ipv4_dl_csum_trailer(skb, csum_trailer, priv);
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
 #if IS_ENABLED(CONFIG_IPV6)
-		return rmnet_map_ipv6_dl_csum_trailer(skb, csum_trailer);
+		return rmnet_map_ipv6_dl_csum_trailer(skb, csum_trailer, priv);
 #else
+		priv->stats.csum_err_invalid_ip_version++;
 		return -EPROTONOSUPPORT;
 #endif
+	} else {
+		priv->stats.csum_err_invalid_ip_version++;
+		return -EPROTONOSUPPORT;
+	}
 
 	return 0;
 }
@@ -367,6 +393,7 @@ int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len)
 void rmnet_map_checksum_uplink_packet(struct sk_buff *skb,
 				      struct net_device *orig_dev)
 {
+	struct rmnet_priv *priv = netdev_priv(orig_dev);
 	struct rmnet_map_ul_csum_header *ul_header;
 	void *iphdr;
 
@@ -389,8 +416,11 @@ void rmnet_map_checksum_uplink_packet(struct sk_buff *skb,
 			rmnet_map_ipv6_ul_csum_header(iphdr, ul_header, skb);
 			return;
 #else
+			priv->stats.csum_err_invalid_ip_version++;
 			goto sw_csum;
 #endif
+		} else {
+			priv->stats.csum_err_invalid_ip_version++;
 		}
 	}
 
@@ -399,4 +429,6 @@ void rmnet_map_checksum_uplink_packet(struct sk_buff *skb,
 	ul_header->csum_insert_offset = 0;
 	ul_header->csum_enabled = 0;
 	ul_header->udp_ip4_ind = 0;
+
+	priv->stats.csum_sw++;
 }
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index 2ea16a0..cb02e1a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -152,6 +152,56 @@ static void rmnet_get_stats64(struct net_device *dev,
 	.ndo_get_stats64 = rmnet_get_stats64,
 };
 
+static const char rmnet_gstrings_stats[][ETH_GSTRING_LEN] = {
+	"Checksum ok",
+	"Checksum valid bit not set",
+	"Checksum validation failed",
+	"Checksum error bad buffer",
+	"Checksum error bad ip version",
+	"Checksum error bad transport",
+	"Checksum skipped on ip fragment",
+	"Checksum skipped",
+	"Checksum computed in software",
+};
+
+static void rmnet_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+	switch (stringset) {
+	case ETH_SS_STATS:
+		memcpy(buf, &rmnet_gstrings_stats,
+		       sizeof(rmnet_gstrings_stats));
+		break;
+	}
+}
+
+static int rmnet_get_sset_count(struct net_device *dev, int sset)
+{
+	switch (sset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(rmnet_gstrings_stats);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static void rmnet_get_ethtool_stats(struct net_device *dev,
+				    struct ethtool_stats *stats, u64 *data)
+{
+	struct rmnet_priv *priv = netdev_priv(dev);
+	struct rmnet_priv_stats *st = &priv->stats;
+
+	if (!data)
+		return;
+
+	memcpy(data, st, ARRAY_SIZE(rmnet_gstrings_stats) * sizeof(u64));
+}
+
+static const struct ethtool_ops rmnet_ethtool_ops = {
+	.get_ethtool_stats = rmnet_get_ethtool_stats,
+	.get_strings = rmnet_get_strings,
+	.get_sset_count = rmnet_get_sset_count,
+};
+
 /* Called by kernel whenever a new rmnet<n> device is created. Sets MTU,
  * flags, ARP type, needed headroom, etc...
  */
@@ -170,6 +220,7 @@ void rmnet_vnd_setup(struct net_device *rmnet_dev)
 	rmnet_dev->flags &= ~(IFF_BROADCAST | IFF_MULTICAST);
 
 	rmnet_dev->needs_free_netdev = true;
+	rmnet_dev->ethtool_ops = &rmnet_ethtool_ops;
 }
 
 /* Exposed API */
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v2 3/3] net: qualcomm: rmnet: Remove redundant command check
From: Subash Abhinov Kasiviswanathan @ 2018-05-16  0:52 UTC (permalink / raw)
  To: davem, netdev, fengguang.wu; +Cc: Subash Abhinov Kasiviswanathan
In-Reply-To: <1526431922-4132-1-git-send-email-subashab@codeaurora.org>

The command packet size is already checked once in
rmnet_map_deaggregate() for the header, packet and trailer size, so
this additional check is not needed.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index 78fdad0..56a93df 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -69,17 +69,9 @@ static void rmnet_map_send_ack(struct sk_buff *skb,
 	struct rmnet_map_control_command *cmd;
 	int xmit_status;
 
-	if (port->data_format & RMNET_FLAGS_INGRESS_MAP_CKSUMV4) {
-		if (skb->len < sizeof(struct rmnet_map_header) +
-		    RMNET_MAP_GET_LENGTH(skb) +
-		    sizeof(struct rmnet_map_dl_csum_trailer)) {
-			kfree_skb(skb);
-			return;
-		}
-
-		skb_trim(skb, skb->len -
-			 sizeof(struct rmnet_map_dl_csum_trailer));
-	}
+	if (port->data_format & RMNET_FLAGS_INGRESS_MAP_CKSUMV4)
+		skb_trim(skb,
+			 skb->len - sizeof(struct rmnet_map_dl_csum_trailer));
 
 	skb->protocol = htons(ETH_P_MAP);
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH] net: qcom/emac: Encapsulate sgmii ops under one structure
From: Hemanth Puranik @ 2018-05-16  1:10 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: Timur Tabi, Hemanth Puranik

This patch introduces ops structure for sgmii, This by ensures that
we do not need dummy functions in case of emulation platforms.

Signed-off-by: Hemanth Puranik <hpuranik@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/emac/emac-mac.c   |   5 +-
 drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 128 ++++++++++++++----------
 drivers/net/ethernet/qualcomm/emac/emac-sgmii.h |  32 +++---
 drivers/net/ethernet/qualcomm/emac/emac.c       |   9 +-
 4 files changed, 103 insertions(+), 71 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
index d5a32b7..092718a 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
@@ -920,14 +920,13 @@ static void emac_mac_rx_descs_refill(struct emac_adapter *adpt,
 static void emac_adjust_link(struct net_device *netdev)
 {
 	struct emac_adapter *adpt = netdev_priv(netdev);
-	struct emac_sgmii *sgmii = &adpt->phy;
 	struct phy_device *phydev = netdev->phydev;
 
 	if (phydev->link) {
 		emac_mac_start(adpt);
-		sgmii->link_up(adpt);
+		emac_sgmii_link_change(adpt, true);
 	} else {
-		sgmii->link_down(adpt);
+		emac_sgmii_link_change(adpt, false);
 		emac_mac_stop(adpt);
 	}
 
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
index e8ab512..562420b 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
@@ -53,6 +53,46 @@
 
 #define SERDES_START_WAIT_TIMES			100
 
+int emac_sgmii_init(struct emac_adapter *adpt)
+{
+	if (!(adpt->phy.sgmii_ops && adpt->phy.sgmii_ops->init))
+		return 0;
+
+	return adpt->phy.sgmii_ops->init(adpt);
+}
+
+int emac_sgmii_open(struct emac_adapter *adpt)
+{
+	if (!(adpt->phy.sgmii_ops && adpt->phy.sgmii_ops->open))
+		return 0;
+
+	return adpt->phy.sgmii_ops->open(adpt);
+}
+
+void emac_sgmii_close(struct emac_adapter *adpt)
+{
+	if (!(adpt->phy.sgmii_ops && adpt->phy.sgmii_ops->close))
+		return;
+
+	adpt->phy.sgmii_ops->close(adpt);
+}
+
+int emac_sgmii_link_change(struct emac_adapter *adpt, bool link_state)
+{
+	if (!(adpt->phy.sgmii_ops && adpt->phy.sgmii_ops->link_change))
+		return 0;
+
+	return adpt->phy.sgmii_ops->link_change(adpt, link_state);
+}
+
+void emac_sgmii_reset(struct emac_adapter *adpt)
+{
+	if (!(adpt->phy.sgmii_ops && adpt->phy.sgmii_ops->reset))
+		return;
+
+	adpt->phy.sgmii_ops->reset(adpt);
+}
+
 /* Initialize the SGMII link between the internal and external PHYs. */
 static void emac_sgmii_link_init(struct emac_adapter *adpt)
 {
@@ -163,21 +203,21 @@ static void emac_sgmii_reset_prepare(struct emac_adapter *adpt)
 	msleep(50);
 }
 
-void emac_sgmii_reset(struct emac_adapter *adpt)
+static void emac_sgmii_common_reset(struct emac_adapter *adpt)
 {
 	int ret;
 
 	emac_sgmii_reset_prepare(adpt);
 	emac_sgmii_link_init(adpt);
 
-	ret = adpt->phy.initialize(adpt);
+	ret = emac_sgmii_init(adpt);
 	if (ret)
 		netdev_err(adpt->netdev,
 			   "could not reinitialize internal PHY (error=%i)\n",
 			   ret);
 }
 
-static int emac_sgmii_open(struct emac_adapter *adpt)
+static int emac_sgmii_common_open(struct emac_adapter *adpt)
 {
 	struct emac_sgmii *sgmii = &adpt->phy;
 	int ret;
@@ -201,43 +241,53 @@ static int emac_sgmii_open(struct emac_adapter *adpt)
 	return 0;
 }
 
-static int emac_sgmii_close(struct emac_adapter *adpt)
+static void emac_sgmii_common_close(struct emac_adapter *adpt)
 {
 	struct emac_sgmii *sgmii = &adpt->phy;
 
 	/* Make sure interrupts are disabled */
 	writel(0, sgmii->base + EMAC_SGMII_PHY_INTERRUPT_MASK);
 	free_irq(sgmii->irq, adpt);
-
-	return 0;
 }
 
 /* The error interrupts are only valid after the link is up */
-static int emac_sgmii_link_up(struct emac_adapter *adpt)
+static int emac_sgmii_common_link_change(struct emac_adapter *adpt, bool linkup)
 {
 	struct emac_sgmii *sgmii = &adpt->phy;
 	int ret;
 
-	/* Clear and enable interrupts */
-	ret = emac_sgmii_irq_clear(adpt, 0xff);
-	if (ret)
-		return ret;
+	if (linkup) {
+		/* Clear and enable interrupts */
+		ret = emac_sgmii_irq_clear(adpt, 0xff);
+		if (ret)
+			return ret;
 
-	writel(SGMII_ISR_MASK, sgmii->base + EMAC_SGMII_PHY_INTERRUPT_MASK);
+		writel(SGMII_ISR_MASK,
+		       sgmii->base + EMAC_SGMII_PHY_INTERRUPT_MASK);
+	} else {
+		/* Disable interrupts */
+		writel(0, sgmii->base + EMAC_SGMII_PHY_INTERRUPT_MASK);
+		synchronize_irq(sgmii->irq);
+	}
 
 	return 0;
 }
 
-static int emac_sgmii_link_down(struct emac_adapter *adpt)
-{
-	struct emac_sgmii *sgmii = &adpt->phy;
-
-	/* Disable interrupts */
-	writel(0, sgmii->base + EMAC_SGMII_PHY_INTERRUPT_MASK);
-	synchronize_irq(sgmii->irq);
+static struct sgmii_ops qdf2432_ops = {
+	.init = emac_sgmii_init_qdf2432,
+	.open = emac_sgmii_common_open,
+	.close = emac_sgmii_common_close,
+	.link_change = emac_sgmii_common_link_change,
+	.reset = emac_sgmii_common_reset,
+};
 
-	return 0;
-}
+static struct sgmii_ops qdf2400_ops = {
+	.init = emac_sgmii_init_qdf2400,
+	.open = emac_sgmii_common_open,
+	.close = emac_sgmii_common_close,
+	.link_change = emac_sgmii_common_link_change,
+	.reset = emac_sgmii_common_reset,
+};
 
 static int emac_sgmii_acpi_match(struct device *dev, void *data)
 {
@@ -249,7 +299,7 @@ static int emac_sgmii_acpi_match(struct device *dev, void *data)
 		{}
 	};
 	const struct acpi_device_id *id = acpi_match_device(match_table, dev);
-	emac_sgmii_function *initialize = data;
+	struct sgmii_ops **ops = data;
 
 	if (id) {
 		acpi_handle handle = ACPI_HANDLE(dev);
@@ -270,10 +320,10 @@ static int emac_sgmii_acpi_match(struct device *dev, void *data)
 
 		switch (hrv) {
 		case 1:
-			*initialize = emac_sgmii_init_qdf2432;
+			*ops = &qdf2432_ops;
 			return 1;
 		case 2:
-			*initialize = emac_sgmii_init_qdf2400;
+			*ops = &qdf2400_ops;
 			return 1;
 		}
 	}
@@ -294,14 +344,6 @@ static int emac_sgmii_acpi_match(struct device *dev, void *data)
 	{}
 };
 
-/* Dummy function for systems without an internal PHY. This avoids having
- * to check for NULL pointers before calling the functions.
- */
-static int emac_sgmii_dummy(struct emac_adapter *adpt)
-{
-	return 0;
-}
-
 int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt)
 {
 	struct platform_device *sgmii_pdev = NULL;
@@ -312,22 +354,11 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt)
 	if (has_acpi_companion(&pdev->dev)) {
 		struct device *dev;
 
-		dev = device_find_child(&pdev->dev, &phy->initialize,
+		dev = device_find_child(&pdev->dev, &phy->sgmii_ops,
 					emac_sgmii_acpi_match);
 
 		if (!dev) {
 			dev_warn(&pdev->dev, "cannot find internal phy node\n");
-			/* There is typically no internal PHY on emulation
-			 * systems, so if we can't find the node, assume
-			 * we are on an emulation system and stub-out
-			 * support for the internal PHY.  These systems only
-			 * use ACPI.
-			 */
-			phy->open = emac_sgmii_dummy;
-			phy->close = emac_sgmii_dummy;
-			phy->link_up = emac_sgmii_dummy;
-			phy->link_down = emac_sgmii_dummy;
-
 			return 0;
 		}
 
@@ -355,14 +386,9 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt)
 			goto error_put_device;
 		}
 
-		phy->initialize = (emac_sgmii_function)match->data;
+		phy->sgmii_ops->init = match->data;
 	}
 
-	phy->open = emac_sgmii_open;
-	phy->close = emac_sgmii_close;
-	phy->link_up = emac_sgmii_link_up;
-	phy->link_down = emac_sgmii_link_down;
-
 	/* Base address is the first address */
 	res = platform_get_resource(sgmii_pdev, IORESOURCE_MEM, 0);
 	if (!res) {
@@ -386,7 +412,7 @@ int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt)
 		}
 	}
 
-	ret = phy->initialize(adpt);
+	ret = emac_sgmii_init(adpt);
 	if (ret)
 		goto error;
 
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.h b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.h
index e7c0c3b..31ba21e 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.h
+++ b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.h
@@ -16,36 +16,44 @@
 struct emac_adapter;
 struct platform_device;
 
-typedef int (*emac_sgmii_function)(struct emac_adapter *adpt);
+/** emac_sgmii - internal emac phy
+ * @init initialization function
+ * @open called when the driver is opened
+ * @close called when the driver is closed
+ * @link_change called when the link state changes
+ */
+struct sgmii_ops {
+	int (*init)(struct emac_adapter *adpt);
+	int (*open)(struct emac_adapter *adpt);
+	void (*close)(struct emac_adapter *adpt);
+	int (*link_change)(struct emac_adapter *adpt, bool link_state);
+	void (*reset)(struct emac_adapter *adpt);
+};
 
 /** emac_sgmii - internal emac phy
  * @base base address
  * @digital per-lane digital block
  * @irq the interrupt number
  * @decode_error_count reference count of consecutive decode errors
- * @initialize initialization function
- * @open called when the driver is opened
- * @close called when the driver is closed
- * @link_up called when the link comes up
- * @link_down called when the link comes down
+ * @sgmii_ops sgmii ops
  */
 struct emac_sgmii {
 	void __iomem		*base;
 	void __iomem		*digital;
 	unsigned int		irq;
 	atomic_t		decode_error_count;
-	emac_sgmii_function	initialize;
-	emac_sgmii_function	open;
-	emac_sgmii_function	close;
-	emac_sgmii_function	link_up;
-	emac_sgmii_function	link_down;
+	struct	sgmii_ops	*sgmii_ops;
 };
 
 int emac_sgmii_config(struct platform_device *pdev, struct emac_adapter *adpt);
-void emac_sgmii_reset(struct emac_adapter *adpt);
 
 int emac_sgmii_init_fsm9900(struct emac_adapter *adpt);
 int emac_sgmii_init_qdf2432(struct emac_adapter *adpt);
 int emac_sgmii_init_qdf2400(struct emac_adapter *adpt);
 
+int emac_sgmii_init(struct emac_adapter *adpt);
+int emac_sgmii_open(struct emac_adapter *adpt);
+void emac_sgmii_close(struct emac_adapter *adpt);
+int emac_sgmii_link_change(struct emac_adapter *adpt, bool link_state);
+void emac_sgmii_reset(struct emac_adapter *adpt);
 #endif
diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c b/drivers/net/ethernet/qualcomm/emac/emac.c
index 13235ba..2a0cbc5 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -253,7 +253,7 @@ static int emac_open(struct net_device *netdev)
 		return ret;
 	}
 
-	ret = adpt->phy.open(adpt);
+	ret = emac_sgmii_open(adpt);
 	if (ret) {
 		emac_mac_rx_tx_rings_free_all(adpt);
 		free_irq(irq->irq, irq);
@@ -264,7 +264,7 @@ static int emac_open(struct net_device *netdev)
 	if (ret) {
 		emac_mac_rx_tx_rings_free_all(adpt);
 		free_irq(irq->irq, irq);
-		adpt->phy.close(adpt);
+		emac_sgmii_close(adpt);
 		return ret;
 	}
 
@@ -278,7 +278,7 @@ static int emac_close(struct net_device *netdev)
 
 	mutex_lock(&adpt->reset_lock);
 
-	adpt->phy.close(adpt);
+	emac_sgmii_close(adpt);
 	emac_mac_down(adpt);
 	emac_mac_rx_tx_rings_free_all(adpt);
 
@@ -761,11 +761,10 @@ static void emac_shutdown(struct platform_device *pdev)
 {
 	struct net_device *netdev = dev_get_drvdata(&pdev->dev);
 	struct emac_adapter *adpt = netdev_priv(netdev);
-	struct emac_sgmii *sgmii = &adpt->phy;
 
 	if (netdev->flags & IFF_UP) {
 		/* Closing the SGMII turns off its interrupts */
-		sgmii->close(adpt);
+		emac_sgmii_close(adpt);
 
 		/* Resetting the MAC turns off all DMA and its interrupts */
 		emac_mac_reset(adpt);
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related

* RE: [PATCH net-next 2/3] net: ethernet: freescale: Allow FEC with COMPILE_TEST
From: Andy Duan @ 2018-05-16  1:19 UTC (permalink / raw)
  To: Florian Fainelli, netdev@vger.kernel.org
  Cc: David S. Miller, Andrew Lunn, open list
In-Reply-To: <20180515234825.11240-3-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com> Sent: 2018年5月16日 7:48
> The Freescale FEC driver builds fine with COMPILE_TEST, so make that
> possible.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Acked-by: Fugang Duan <fugang.duan@nxp.com>

> ---
>  drivers/net/ethernet/freescale/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/freescale/Kconfig
> b/drivers/net/ethernet/freescale/Kconfig
> index 6e490fd2345d..a580a3dcbe59 100644
> --- a/drivers/net/ethernet/freescale/Kconfig
> +++ b/drivers/net/ethernet/freescale/Kconfig
> @@ -22,7 +22,7 @@ if NET_VENDOR_FREESCALE  config FEC
>  	tristate "FEC ethernet controller (of ColdFire and some i.MX CPUs)"
>  	depends on (M523x || M527x || M5272 || M528x || M520x || M532x || \
> -		   ARCH_MXC || SOC_IMX28)
> +		   ARCH_MXC || SOC_IMX28 || COMPILE_TEST)
>  	default ARCH_MXC || SOC_IMX28 if ARM
>  	select PHYLIB
>  	imply PTP_1588_CLOCK
> --
> 2.14.1


^ permalink raw reply

* [net-next PATCH v2 0/4] Symmetric queue selection using XPS for Rx queues
From: Amritha Nambiar @ 2018-05-16  1:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
	hannes, tom

This patch series implements support for Tx queue selection based on
Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.

XPS is refactored to support Tx queue selection based on either the
CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.

- /sys/class/net/<dev>/queues/tx-*/xps_rxqs

This is to enable sending packets on the same Tx-Rx queue pair as this
is useful for busy polling multi-threaded workloads where it is not
possible to pin the threads to a CPU. This is a rework of Sridhar's
patch for symmetric queueing via socket option:
https://www.spinics.net/lists/netdev/msg453106.html

v2:
- Added documentation in networking/scaling.txt
- Added a simple routine to replace multiple ifdef blocks.

---

Amritha Nambiar (4):
      net: Refactor XPS for CPUs and Rx queues
      net: Enable Tx queue selection based on Rx queues
      net-sysfs: Add interface for Rx queue map per Tx queue
      Documentation: Add explanation for XPS using Rx-queue map

 Documentation/networking/scaling.txt |   58 +++++++-
 include/linux/cpumask.h              |   11 +-
 include/linux/netdevice.h            |   72 ++++++++++
 include/net/sock.h                   |   18 +++
 net/core/dev.c                       |  242 +++++++++++++++++++++++-----------
 net/core/net-sysfs.c                 |   85 ++++++++++++
 net/core/sock.c                      |    5 +
 net/ipv4/tcp_input.c                 |    7 +
 net/ipv4/tcp_ipv4.c                  |    1 
 net/ipv4/tcp_minisocks.c             |    1 
 10 files changed, 404 insertions(+), 96 deletions(-)

^ permalink raw reply

* [net-next PATCH v2 1/4] net: Refactor XPS for CPUs and Rx queues
From: Amritha Nambiar @ 2018-05-16  1:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
	hannes, tom
In-Reply-To: <152643356116.4991.7215767041139726872.stgit@anamdev.jf.intel.com>

Refactor XPS code to support Tx queue selection based on
CPU map or Rx queue map.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/linux/cpumask.h   |   11 ++
 include/linux/netdevice.h |   72 +++++++++++++++-
 net/core/dev.c            |  208 +++++++++++++++++++++++++++++----------------
 net/core/net-sysfs.c      |    4 -
 4 files changed, 215 insertions(+), 80 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bf53d89..57f20a0 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
 #define cpu_active(cpu)		((cpu) == 0)
 #endif
 
-/* verify cpu argument to cpumask_* operators */
-static inline unsigned int cpumask_check(unsigned int cpu)
+static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
 {
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
-	WARN_ON_ONCE(cpu >= nr_cpumask_bits);
+	WARN_ON_ONCE(cpu >= bits);
 #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+}
+
+/* verify cpu argument to cpumask_* operators */
+static inline unsigned int cpumask_check(unsigned int cpu)
+{
+	cpu_max_bits_warn(cpu, nr_cpumask_bits);
 	return cpu;
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492..c2eeb36 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -730,10 +730,21 @@ struct xps_map {
  */
 struct xps_dev_maps {
 	struct rcu_head rcu;
-	struct xps_map __rcu *cpu_map[0];
+	struct xps_map __rcu *attr_map[0];
 };
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +		\
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +	\
 	(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+	(_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
+enum xps_map_type {
+	XPS_MAP_RXQS,
+	XPS_MAP_CPUS,
+	__XPS_MAP_MAX
+};
+
 #endif /* CONFIG_XPS */
 
 #define TC_MAX_QUEUE	16
@@ -1891,7 +1902,7 @@ struct net_device {
 	int			watchdog_timeo;
 
 #ifdef CONFIG_XPS
-	struct xps_dev_maps __rcu *xps_maps;
+	struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
 #endif
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
@@ -3229,6 +3240,61 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 #ifdef CONFIG_XPS
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, enum xps_map_type type);
+
+static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
+				  unsigned int nr_bits)
+{
+	cpu_max_bits_warn(j, nr_bits);
+	return test_bit(j, mask);
+}
+
+static inline bool attr_test_online(unsigned long j,
+				    const unsigned long *online_mask,
+				    unsigned int nr_bits)
+{
+	cpu_max_bits_warn(j, nr_bits);
+
+	if (online_mask)
+		return test_bit(j, online_mask);
+
+	if (j >= 0 && j < nr_bits)
+		return true;
+
+	return false;
+}
+
+static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
+					 unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpu_max_bits_warn(n, nr_bits);
+
+	if (srcp)
+		return find_next_bit(srcp, nr_bits, n + 1);
+
+	return n + 1;
+}
+
+static inline int attrmask_next_and(int n, const unsigned long *src1p,
+				    const unsigned long *src2p,
+				    unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpu_max_bits_warn(n, nr_bits);
+
+	if (src1p && src2p)
+		return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
+	else if (src1p)
+		return find_next_bit(src1p, nr_bits, n + 1);
+	else if (src2p)
+		return find_next_bit(src2p, nr_bits, n + 1);
+
+	return n + 1;
+}
 #else
 static inline int netif_set_xps_queue(struct net_device *dev,
 				      const struct cpumask *mask,
diff --git a/net/core/dev.c b/net/core/dev.c
index 9f43901..7e5dfdb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 	int pos;
 
 	if (dev_maps)
-		map = xmap_dereference(dev_maps->cpu_map[tci]);
+		map = xmap_dereference(dev_maps->attr_map[tci]);
 	if (!map)
 		return false;
 
@@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 			break;
 		}
 
-		RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
+		RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
 		kfree_rcu(map, rcu);
 		return false;
 	}
@@ -2125,7 +2125,7 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 		int i, j;
 
 		for (i = count, j = offset; i--; j++) {
-			if (!remove_xps_queue(dev_maps, cpu, j))
+			if (!remove_xps_queue(dev_maps, tci, j))
 				break;
 		}
 
@@ -2138,30 +2138,47 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 				   u16 count)
 {
+	const unsigned long *possible_mask = NULL;
+	enum xps_map_type type = XPS_MAP_RXQS;
 	struct xps_dev_maps *dev_maps;
-	int cpu, i;
 	bool active = false;
+	unsigned int nr_ids;
+	int i, j;
 
 	mutex_lock(&xps_map_mutex);
-	dev_maps = xmap_dereference(dev->xps_maps);
 
-	if (!dev_maps)
-		goto out_no_maps;
+	while (type < __XPS_MAP_MAX) {
+		dev_maps = xmap_dereference(dev->xps_maps[type]);
+		if (!dev_maps)
+			goto out_no_maps;
+
+		if (type == XPS_MAP_CPUS) {
+			if (num_possible_cpus() > 1)
+				possible_mask = cpumask_bits(cpu_possible_mask);
+			nr_ids = nr_cpu_ids;
+		} else if (type == XPS_MAP_RXQS) {
+			nr_ids = dev->num_rx_queues;
+		}
 
-	for_each_possible_cpu(cpu)
-		active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
-					       offset, count);
+		for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+		     j < nr_ids;)
+			active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
+						       count);
+		if (!active) {
+			RCU_INIT_POINTER(dev->xps_maps[type], NULL);
+			kfree_rcu(dev_maps, rcu);
+		}
 
-	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
-		kfree_rcu(dev_maps, rcu);
+		if (type == XPS_MAP_CPUS) {
+			for (i = offset + (count - 1); count--; i--)
+				netdev_queue_numa_node_write(
+					netdev_get_tx_queue(dev, i),
+							    NUMA_NO_NODE);
+		}
+out_no_maps:
+		type++;
 	}
 
-	for (i = offset + (count - 1); count--; i--)
-		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
-					     NUMA_NO_NODE);
-
-out_no_maps:
 	mutex_unlock(&xps_map_mutex);
 }
 
@@ -2170,11 +2187,11 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
 	netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
 }
 
-static struct xps_map *expand_xps_map(struct xps_map *map,
-				      int cpu, u16 index)
+static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
+				      u16 index, enum xps_map_type type)
 {
-	struct xps_map *new_map;
 	int alloc_len = XPS_MIN_MAP_ALLOC;
+	struct xps_map *new_map = NULL;
 	int i, pos;
 
 	for (pos = 0; map && pos < map->len; pos++) {
@@ -2183,7 +2200,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		return map;
 	}
 
-	/* Need to add queue to this CPU's existing map */
+	/* Need to add tx-queue to this CPU's/rx-queue's existing map */
 	if (map) {
 		if (pos < map->alloc_len)
 			return map;
@@ -2191,9 +2208,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		alloc_len = map->alloc_len * 2;
 	}
 
-	/* Need to allocate new map to store queue on this CPU's map */
-	new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
-			       cpu_to_node(cpu));
+	/* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
+	 *  map
+	 */
+	if (type == XPS_MAP_RXQS)
+		new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
+	else if (type == XPS_MAP_CPUS)
+		new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
+				       cpu_to_node(attr_index));
 	if (!new_map)
 		return NULL;
 
@@ -2205,14 +2227,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 	return new_map;
 }
 
-int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
-			u16 index)
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, enum xps_map_type type)
 {
+	const unsigned long *online_mask = NULL, *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
-	int i, cpu, tci, numa_node_id = -2;
+	int i, j, tci, numa_node_id = -2;
 	int maps_sz, num_tc = 1, tc = 0;
 	struct xps_map *map, *new_map;
 	bool active = false;
+	unsigned int nr_ids;
 
 	if (dev->num_tc) {
 		num_tc = dev->num_tc;
@@ -2221,16 +2245,33 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -EINVAL;
 	}
 
-	maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
+	switch (type) {
+	case XPS_MAP_RXQS:
+		maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
+		dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_RXQS]);
+		nr_ids = dev->num_rx_queues;
+		break;
+	case XPS_MAP_CPUS:
+		maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
+		if (num_possible_cpus() > 1) {
+			online_mask = cpumask_bits(cpu_online_mask);
+			possible_mask = cpumask_bits(cpu_possible_mask);
+		}
+		dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_CPUS]);
+		nr_ids = nr_cpu_ids;
+		break;
+	default:
+		return -EINVAL;
+	}
+
 	if (maps_sz < L1_CACHE_BYTES)
 		maps_sz = L1_CACHE_BYTES;
 
 	mutex_lock(&xps_map_mutex);
 
-	dev_maps = xmap_dereference(dev->xps_maps);
-
 	/* allocate memory for queue storage */
-	for_each_cpu_and(cpu, cpu_online_mask, mask) {
+	for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
+	     j < nr_ids;) {
 		if (!new_dev_maps)
 			new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
 		if (!new_dev_maps) {
@@ -2238,73 +2279,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -ENOMEM;
 		}
 
-		tci = cpu * num_tc + tc;
-		map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
+		tci = j * num_tc + tc;
+		map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
 				 NULL;
 
-		map = expand_xps_map(map, cpu, index);
+		map = expand_xps_map(map, j, index, type);
 		if (!map)
 			goto error;
 
-		RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+		RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 	}
 
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
-	for_each_possible_cpu(cpu) {
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
 		/* copy maps belonging to foreign traffic classes */
-		for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
+		for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* We need to explicitly update tci as prevous loop
 		 * could break out early if dev_maps is NULL.
 		 */
-		tci = cpu * num_tc + tc;
+		tci = j * num_tc + tc;
 
-		if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
-			/* add queue to CPU maps */
+		if (attr_test_mask(j, mask, nr_ids) &&
+		    attr_test_online(j, online_mask, nr_ids)) {
+			/* add tx-queue to CPU/rx-queue maps */
 			int pos = 0;
 
-			map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+			map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			while ((pos < map->len) && (map->queues[pos] != index))
 				pos++;
 
 			if (pos == map->len)
 				map->queues[map->len++] = index;
 #ifdef CONFIG_NUMA
-			if (numa_node_id == -2)
-				numa_node_id = cpu_to_node(cpu);
-			else if (numa_node_id != cpu_to_node(cpu))
-				numa_node_id = -1;
+			if (type == XPS_MAP_CPUS) {
+				if (numa_node_id == -2)
+					numa_node_id = cpu_to_node(j);
+				else if (numa_node_id != cpu_to_node(j))
+					numa_node_id = -1;
+			}
 #endif
 		} else if (dev_maps) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* copy maps belonging to foreign traffic classes */
 		for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 	}
 
-	rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+	if (type == XPS_MAP_RXQS)
+		rcu_assign_pointer(dev->xps_maps[XPS_MAP_RXQS], new_dev_maps);
+	else if (type == XPS_MAP_CPUS)
+		rcu_assign_pointer(dev->xps_maps[XPS_MAP_CPUS], new_dev_maps);
 
 	/* Cleanup old maps */
 	if (!dev_maps)
 		goto out_no_old_maps;
 
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
 			if (map && map != new_map)
 				kfree_rcu(map, rcu);
 		}
@@ -2317,19 +2366,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	active = true;
 
 out_no_new_maps:
-	/* update Tx queue numa node */
-	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
-				     (numa_node_id >= 0) ? numa_node_id :
-				     NUMA_NO_NODE);
+	if (type == XPS_MAP_CPUS) {
+		/* update Tx queue numa node */
+		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+					     (numa_node_id >= 0) ?
+					     numa_node_id : NUMA_NO_NODE);
+	}
 
 	if (!dev_maps)
 		goto out_no_maps;
 
-	/* removes queue from unused CPUs */
-	for_each_possible_cpu(cpu) {
-		for (i = tc, tci = cpu * num_tc; i--; tci++)
+	/* removes tx-queue from unused CPUs/rx-queues */
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = tc, tci = j * num_tc; i--; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
-		if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
+		if (!attr_test_mask(j, mask, nr_ids) ||
+		    !attr_test_online(j, online_mask, nr_ids))
 			active |= remove_xps_queue(dev_maps, tci, index);
 		for (i = num_tc - tc, tci++; --i; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
@@ -2337,7 +2390,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 
 	/* free map if not active */
 	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
+		if (type == XPS_MAP_RXQS)
+			RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_RXQS], NULL);
+		else if (type == XPS_MAP_CPUS)
+			RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_CPUS], NULL);
 		kfree_rcu(dev_maps, rcu);
 	}
 
@@ -2347,11 +2403,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	return 0;
 error:
 	/* remove any maps that we added */
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+	for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			map = dev_maps ?
-			      xmap_dereference(dev_maps->cpu_map[tci]) :
+			      xmap_dereference(dev_maps->attr_map[tci]) :
 			      NULL;
 			if (new_map && new_map != map)
 				kfree(new_map);
@@ -2363,6 +2420,13 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	kfree(new_dev_maps);
 	return -ENOMEM;
 }
+
+int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
+			u16 index)
+{
+	return __netif_set_xps_queue(dev, cpumask_bits(mask), index,
+				     XPS_MAP_CPUS);
+}
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
@@ -3402,7 +3466,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3411,7 +3475,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 			tci += netdev_get_prio_tc_map(dev, skb->priority);
 		}
 
-		map = rcu_dereference(dev_maps->cpu_map[tci]);
+		map = rcu_dereference(dev_maps->attr_map[tci]);
 		if (map) {
 			if (map->len == 1)
 				queue_index = map->queues[0];
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c476f07..d7abd33 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 	}
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
 	if (dev_maps) {
 		for_each_possible_cpu(cpu) {
 			int i, tci = cpu * num_tc + tc;
 			struct xps_map *map;
 
-			map = rcu_dereference(dev_maps->cpu_map[tci]);
+			map = rcu_dereference(dev_maps->attr_map[tci]);
 			if (!map)
 				continue;
 

^ permalink raw reply related

* [net-next PATCH v2 2/4] net: Enable Tx queue selection based on Rx queues
From: Amritha Nambiar @ 2018-05-16  1:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
	hannes, tom
In-Reply-To: <152643356116.4991.7215767041139726872.stgit@anamdev.jf.intel.com>

This patch adds support to pick Tx queue based on the Rx queue map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive
queue map does not apply, then the Tx queue selection falls back
to CPU map based selection and finally to hashing.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/net/sock.h       |   18 ++++++++++++++++++
 net/core/dev.c           |   36 +++++++++++++++++++++++++++++-------
 net/core/sock.c          |    5 +++++
 net/ipv4/tcp_input.c     |    7 +++++++
 net/ipv4/tcp_ipv4.c      |    1 +
 net/ipv4/tcp_minisocks.c |    1 +
 6 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 4f7c584..0613f63 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,8 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_rx_queue_mapping: rx queue number for this connection
+ *	@skc_rx_ifindex: rx ifindex for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  *		%SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +217,10 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	int			skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+	int			skc_rx_queue_mapping;
+	int			skc_rx_ifindex;
+#endif
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -326,6 +332,10 @@ struct sock {
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
+#define sk_rx_ifindex		__sk_common.skc_rx_ifindex
+#endif
 
 #define sk_dontcopy_begin	__sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end		__sk_common.skc_dontcopy_end
@@ -1696,6 +1706,14 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 	return sk ? sk->sk_tx_queue_mapping : -1;
 }
 
+static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+	sk->sk_rx_ifindex = skb->skb_iif;
+	sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
+#endif
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index 7e5dfdb..4030368 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3458,18 +3458,14 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
-{
 #ifdef CONFIG_XPS
-	struct xps_dev_maps *dev_maps;
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+			       struct xps_dev_maps *dev_maps, unsigned int tci)
+{
 	struct xps_map *map;
 	int queue_index = -1;
 
-	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
 	if (dev_maps) {
-		unsigned int tci = skb->sender_cpu - 1;
-
 		if (dev->num_tc) {
 			tci *= dev->num_tc;
 			tci += netdev_get_prio_tc_map(dev, skb->priority);
@@ -3486,6 +3482,32 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 				queue_index = -1;
 		}
 	}
+	return queue_index;
+}
+#endif
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+	enum xps_map_type i = XPS_MAP_RXQS;
+	struct xps_dev_maps *dev_maps;
+	struct sock *sk = skb->sk;
+	int queue_index = -1;
+	unsigned int tci = 0;
+
+	if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues &&
+	    dev->ifindex == sk->sk_rx_ifindex)
+		tci = sk->sk_rx_queue_mapping;
+
+	rcu_read_lock();
+	while (queue_index < 0 && i < __XPS_MAP_MAX) {
+		if (i == XPS_MAP_CPUS)
+			tci = skb->sender_cpu - 1;
+		dev_maps = rcu_dereference(dev->xps_maps[i]);
+		queue_index = __get_xps_queue_idx(dev, skb, dev_maps, tci);
+		i++;
+	}
+
 	rcu_read_unlock();
 
 	return queue_index;
diff --git a/net/core/sock.c b/net/core/sock.c
index 042cfc6..73d7fa8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2824,6 +2824,11 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_pacing_rate = ~0U;
 	sk->sk_pacing_shift = 10;
 	sk->sk_incoming_cpu = -1;
+
+#ifdef CONFIG_XPS
+	sk->sk_rx_ifindex = -1;
+	sk->sk_rx_queue_mapping = -1;
+#endif
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
 	 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b188e0d..d33911c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <net/busy_poll.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -5559,6 +5560,11 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 		__tcp_fast_path_on(tp, tp->snd_wnd);
 	else
 		tp->pred_flags = 0;
+
+	if (skb) {
+		sk_mark_napi_id(sk, skb);
+		sk_mark_rx_queue(sk, skb);
+	}
 }
 
 static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
@@ -6371,6 +6377,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_rsk(req)->txhash = net_tx_rndhash();
 	tcp_openreq_init_rwin(req, sk, dst);
+	sk_mark_rx_queue(req_to_sk(req), skb);
 	if (!want_cookie) {
 		tcp_reqsk_record_syn(sk, req, skb);
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index caf23de..abdf02e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1479,6 +1479,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 
 		sock_rps_save_rxhash(sk, skb);
 		sk_mark_napi_id(sk, skb);
+		sk_mark_rx_queue(sk, skb);
 		if (dst) {
 			if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
 			    !dst->ops->check(dst, 0)) {
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f867658..4939c28 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -836,6 +836,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,
 
 	/* record NAPI ID of child */
 	sk_mark_napi_id(child, skb);
+	sk_mark_rx_queue(child, skb);
 
 	tcp_segs_in(tcp_sk(child), skb);
 	if (!sock_owned_by_user(child)) {

^ permalink raw reply related

* [net-next PATCH v2 3/4] net-sysfs: Add interface for Rx queue map per Tx queue
From: Amritha Nambiar @ 2018-05-16  1:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
	hannes, tom
In-Reply-To: <152643356116.4991.7215767041139726872.stgit@anamdev.jf.intel.com>

Extend transmit queue sysfs attribute to configure Rx queue map
per Tx queue. By default no receive queues are configured for the
Tx queue.

- /sys/class/net/eth0/queues/tx-*/xps_rxqs

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 net/core/net-sysfs.c |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index d7abd33..0654243 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 
 static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
 	= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+	struct net_device *dev = queue->dev;
+	struct xps_dev_maps *dev_maps;
+	unsigned long *mask, index;
+	int j, len, num_tc = 1, tc = 0;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	if (dev->num_tc) {
+		num_tc = dev->num_tc;
+		tc = netdev_txq_to_tc(dev, index);
+		if (tc < 0)
+			return -EINVAL;
+	}
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_RXQS]);
+	if (dev_maps) {
+		for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues),
+		     j < dev->num_rx_queues;) {
+			int i, tci = j * num_tc + tc;
+			struct xps_map *map;
+
+			map = rcu_dereference(dev_maps->attr_map[tci]);
+			if (!map)
+				continue;
+
+			for (i = map->len; i--;) {
+				if (map->queues[i] == index) {
+					set_bit(j, mask);
+					break;
+				}
+			}
+		}
+	}
+
+	len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+	rcu_read_unlock();
+	kfree(mask);
+
+	return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+			      size_t len)
+{
+	struct net_device *dev = queue->dev;
+	unsigned long *mask, index;
+	int err;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+	if (err) {
+		kfree(mask);
+		return err;
+	}
+
+	err = __netif_set_xps_queue(dev, mask, index, XPS_MAP_RXQS);
+	kfree(mask);
+	return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+	= __ATTR_RW(xps_rxqs);
 #endif /* CONFIG_XPS */
 
 static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
 	&queue_traffic_class.attr,
 #ifdef CONFIG_XPS
 	&xps_cpus_attribute.attr,
+	&xps_rxqs_attribute.attr,
 	&queue_tx_maxrate.attr,
 #endif
 	NULL

^ permalink raw reply related

* [net-next PATCH v2 4/4] Documentation: Add explanation for XPS using Rx-queue map
From: Amritha Nambiar @ 2018-05-16  1:27 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, amritha.nambiar, sridhar.samudrala, edumazet,
	hannes, tom
In-Reply-To: <152643356116.4991.7215767041139726872.stgit@anamdev.jf.intel.com>

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 Documentation/networking/scaling.txt |   58 ++++++++++++++++++++++++++++------
 1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index f55639d..834147c 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
 
 Transmit Packet Steering is a mechanism for intelligently selecting
 which transmit queue to use when transmitting a packet on a multi-queue
-device. To accomplish this, a mapping from CPU to hardware queue(s) is
-recorded. The goal of this mapping is usually to assign queues
+device. This can be accomplished by recording two kinds of maps, either
+a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
+to hardware transmit queue(s).
+
+1. XPS using CPUs map
+
+The goal of this mapping is usually to assign queues
 exclusively to a subset of CPUs, where the transmit completions for
 these queues are processed on a CPU within this set. This choice
 provides two benefits. First, contention on the device queue lock is
@@ -377,12 +382,36 @@ transmit queue). Secondly, cache miss rate on transmit completion is
 reduced, in particular for data cache lines that hold the sk_buff
 structures.
 
-XPS is configured per transmit queue by setting a bitmap of CPUs that
-may use that queue to transmit. The reverse mapping, from CPUs to
-transmit queues, is computed and maintained for each network device.
-When transmitting the first packet in a flow, the function
-get_xps_queue() is called to select a queue. This function uses the ID
-of the running CPU as a key into the CPU-to-queue lookup table. If the
+2. XPS using receive queues map
+
+This mapping is used to pick transmit queue based on the receive
+queue(s) map configuration set by the administrator. A set of receive
+queues can be mapped to a set of transmit queues (many:many), although
+the common use case is a 1:1 mapping. This will enable sending packets
+on the same queue pair for transmit and receive. This is useful for
+busy polling multi-threaded workloads where there are challenges in
+associating a given CPU to a given application thread. The application
+threads are not pinned to CPUs and each thread handles packets
+received on a single queue. The receive queue number is cached in the
+socket for the connection and there is no need for adding flow entries
+as in the case of aRFS or flow director. In this model, sending the
+packets on the same transmit queue corresponding to the queue-pair
+associated with the receive queue has benefits in keeping the CPU overhead
+low. Transmit completion work is locked into the same queue pair that
+a given application is polling on. This avoids the overhead of triggering
+an interrupt on another CPU. When the application cleans up the packets
+during the busy poll, transmit completion may be processed along with it
+in the same thread context and so result in reduced latency.
+
+XPS is configured per transmit queue by setting a bitmap of
+CPUs/receive-queues that may use that queue to transmit. The reverse
+mapping, from CPUs to transmit queues or from receive-queues to transmit
+queues, is computed and maintained for each network device. When
+transmitting the first packet in a flow, the function get_xps_queue() is
+called to select a queue. This function uses the ID of the receive queue
+for the socket connection for a match in the receive queue-to-transmit queue
+lookup table. Alternatively, this function can also use the ID of the
+running CPU as a key into the CPU-to-queue lookup table. If the
 ID matches a single queue, that is used for transmission. If multiple
 queues match, one is selected by using the flow hash to compute an index
 into the set.
@@ -404,11 +433,15 @@ acknowledged.
 
 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
 default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+configured. To enable XPS, the bitmap of CPUs/receive-queues that may
+use a transmit queue is configured using the sysfs file entry:
 
+For selection based on CPUs map:
 /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
 
+For selection based on receive-queues map:
+/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration
@@ -421,6 +454,11 @@ best CPUs to share a given queue are probably those that share the cache
 with the CPU that processes transmit completions for that queue
 (transmit interrupts).
 
+For transmit queue selection based on receive queue(s), XPS has to be
+explicitly configured mapping receive-queue(s) to transmit queue(s). If the
+user configuration for receive-queue map does not apply, then the transmit
+queue is selected based on the CPUs map.
+
 Per TX Queue rate limitation:
 =============================
 

^ permalink raw reply related

* ECMP routing: problematic selection of outgoing interface
From: Hirotaka Yamamoto @ 2018-05-16  1:51 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,

Recently I have built a highly-available network using an ECMP
route connected to two isolated L2 switches as follows.

Router-- ToR switch 1  ---- Linux
     |   192.168.11.1/24     |  eth0: 192.168.11.2/24
     |                       |  eth1: 192.168.12.2/24
     +-- ToR switch 2  ------+
         192.168.12.1/24

The (default) route has been configured with:

    $ sudo ip route add default \
           nexthop via 192.168.11.1 \
           nexthop via 192.168.12.1

Then I found that Linux chooses a wrong outgoing device for some
destination/source address pairs like this:

    $ ip route get 12.34.56.78 from 192.168.12.2:
    12.34.56.78 from 192.168.12.2 via 192.168.11.1 dev eth0 uid 0
                                                 # dev should be "eth1"

As a consequence, programs like SSH or curl do not work for such
destinations because routers drop packets having strange source
addresses.

Unbound sockets also suffer this problem.  My guess for this is that
Linux chooses a source address first, then a wrong outgoing device.

Although I believe this is a bug in Linux, I found a possibly relevant
comment in function ip_route_output_key_hash_rcu at net/ipv4/route.c:

    /* I removed check for oif == dev_out->oif here.
       It was wrong for two reasons:
       1. ip_dev_find(net, saddr) can return wrong iface, if saddr
          is assigned to multiple interfaces.
       2. Moreover, we are allowed to send packets with saddr
          of another iface. --ANK

According to the comment 2, I wonder this behavior might be intended.

So, my question is:

1. Is this intended or not?
2. If this is intended, how can I make programs work in this ECMP network?


I have created a simple script to reproduce the problem (attached below).
The script creates a dedicated network namespace "testns" and configures
ECMP route to reproduce the problem.

So far, I can reproduce the problem with these Linux versions:
    - 4.17-rc5          (Upstream)
    - 4.15.0-20-generic (Ubuntu 18.04)
    - 4.14.32-coreos    (CoreOS)
    - 4.13.0-37-generic (Ubuntu 16.04 HWE)
    - 4.4.0-116-generic (Ubuntu 16.04)

Note that the problem is not limited to the default route.
Any route configured as ECMP can cause the problem.

- ymmt

#!/bin/sh -e

NS=testns

BR1=testbr1
VETH1=testveth1
BR2=testbr2
VETH2=testveth2
LINKS="$VETH1 $VETH2 $BR1 $BR2"

NET1=192.168.11.xx/24
NET2=192.168.12.xx/24
IPNS="ip netns exec $NS ip"

clean() {
    for l in $LINKS; do
        if ip -o link show $l >/dev/null 2>&1; then
            ip link del $l
        fi
    done

    if ip netns list | grep -q $NS; then
        ip netns del $NS
    fi
}
trap clean INT QUIT TERM HUP PIPE 0

make_address() {
    local net addr
    net=$1
    addr=$2

    echo $net | sed "s/xx/$addr/"
}

cidr2ip() {
    echo $1 | cut -d / -f 1
}

GW1=$(make_address $NET1 1)
GW2=$(make_address $NET2 1)
ADDR1=$(make_address $NET1 2)
ADDR2=$(make_address $NET2 2)

setup_veth() {
    local br veth dest
    br=$1
    veth=$2
    dest=$3

    ip link add $br type bridge
    ip link add $veth type veth peer name ${veth}_
    ip link set $br up
    ip link set $veth master $br up
    ip link set ${veth}_ netns $NS name $dest up
}

setup() {
    ip netns add $NS
    $IPNS link set lo up

    setup_veth $BR1 $VETH1 eth0
    setup_veth $BR2 $VETH2 eth1

    local gw1 gw2
    ip addr add $GW1 dev $BR1
    ip addr add $GW2 dev $BR2
    $IPNS addr add $ADDR1 dev eth0
    $IPNS addr add $ADDR2 dev eth1

    $IPNS route add 0.0.0.0/0 nexthop via $(cidr2ip $GW1) nexthop via $(cidr2ip $GW2)
}

test_route_from() {
    local dest dev from r rdev
    dest=$1
    dev=$2
    from=$3
    r=$($IPNS -o route get $dest from $from)
    rdev=$(echo $r | sed -nr 's/^.*dev (eth[[:digit:]]+).*/\1/p')
    if [ "$dev" != "$rdev" ]; then
        echo "WRONG dev/from pair: ip -o route get $dest from $from:"
        printf "%s\n" "$r"
        return
    fi
}

test_route() {
    test_route_from "$1" eth0 $(cidr2ip $ADDR1)
    test_route_from "$1" eth1 $(cidr2ip $ADDR2)
}

run_tests() {
    test_route 12.34.56.78
    test_route 216.58.200.160
    test_route 216.58.200.161
    test_route 216.58.200.162
    test_route 216.58.200.163
    test_route 216.58.200.164
    test_route 52.85.149.10
    test_route 52.85.149.11
    test_route 52.85.149.12
    test_route 52.85.149.13
    test_route 52.85.149.14
}

# main
setup
run_tests
read -p "Press enter to finish" ret

^ permalink raw reply

* RE: [PATCH net-next v2 2/2] drivers: net: Remove device_node checks with of_mdiobus_register()
From: Andy Duan @ 2018-05-16  1:57 UTC (permalink / raw)
  To: Florian Fainelli, netdev@vger.kernel.org
  Cc: Andrew Lunn, Vivien Didelot, David S. Miller, Nicolas Ferre,
	Sergei Shtylyov, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	Grygorii Strashko, Woojung Huh, Microchip Linux Driver Support,
	Rob Herring, Frank Rowand, Antoine Tenart, Tobias Jordan,
	Russell King, Geert Uytterhoeven, Th
In-Reply-To: <20180515235619.27773-3-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com> Sent: 2018年5月16日 7:56
> A number of drivers have the following pattern:
> 
> if (np)
> 	of_mdiobus_register()
> else
> 	mdiobus_register()
> 
> which the implementation of of_mdiobus_register() now takes care of.
> Remove that pattern in drivers that strictly adhere to it.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

For drivers/net/ethernet/freescale/fec_main.c:

Reviewed-by: Fugang Duan <fugang.duan@nxp.com>
> ---
>  drivers/net/dsa/bcm_sf2.c                         |  8 ++------
>  drivers/net/dsa/mv88e6xxx/chip.c                  |  5 +----
>  drivers/net/ethernet/cadence/macb_main.c          | 12 +++---------
>  drivers/net/ethernet/freescale/fec_main.c         |  8 ++------
>  drivers/net/ethernet/marvell/mvmdio.c             |  5 +----
>  drivers/net/ethernet/renesas/sh_eth.c             | 11 +++--------
>  drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c |  5 +----
>  drivers/net/ethernet/ti/davinci_mdio.c            |  8 +++-----
>  drivers/net/phy/mdio-gpio.c                       |  6 +-----
>  drivers/net/phy/mdio-mscc-miim.c                  |  6 +-----
>  drivers/net/usb/lan78xx.c                         |  7 ++-----
>  11 files changed, 20 insertions(+), 61 deletions(-)
> 
> diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c index
> ac621f44237a..02e8982519ce 100644
> --- a/drivers/net/dsa/bcm_sf2.c
> +++ b/drivers/net/dsa/bcm_sf2.c
> @@ -450,12 +450,8 @@ static int bcm_sf2_mdio_register(struct dsa_switch
> *ds)
>  	priv->slave_mii_bus->parent = ds->dev->parent;
>  	priv->slave_mii_bus->phy_mask = ~priv->indir_phy_mask;
> 
> -	if (dn)
> -		err = of_mdiobus_register(priv->slave_mii_bus, dn);
> -	else
> -		err = mdiobus_register(priv->slave_mii_bus);
> -
> -	if (err)
> +	err = of_mdiobus_register(priv->slave_mii_bus, dn);
> +	if (err && dn)
>  		of_node_put(dn);
> 
>  	return err;
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c
> b/drivers/net/dsa/mv88e6xxx/chip.c
> index b23c11d9f4b2..2bb3f03ee1cb 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -2454,10 +2454,7 @@ static int mv88e6xxx_mdio_register(struct
> mv88e6xxx_chip *chip,
>  			return err;
>  	}
> 
> -	if (np)
> -		err = of_mdiobus_register(bus, np);
> -	else
> -		err = mdiobus_register(bus);
> +	err = of_mdiobus_register(bus, np);
>  	if (err) {
>  		dev_err(chip->dev, "Cannot register MDIO bus (%d)\n", err);
>  		mv88e6xxx_g2_irq_mdio_free(chip, bus); diff --git
> a/drivers/net/ethernet/cadence/macb_main.c
> b/drivers/net/ethernet/cadence/macb_main.c
> index b4c9268100bb..3e93df5d4e3b 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -591,16 +591,10 @@ static int macb_mii_init(struct macb *bp)
>  	dev_set_drvdata(&bp->dev->dev, bp->mii_bus);
> 
>  	np = bp->pdev->dev.of_node;
> +	if (pdata)
> +		bp->mii_bus->phy_mask = pdata->phy_mask;
> 
> -	if (np) {
> -		err = of_mdiobus_register(bp->mii_bus, np);
> -	} else {
> -		if (pdata)
> -			bp->mii_bus->phy_mask = pdata->phy_mask;
> -
> -		err = mdiobus_register(bp->mii_bus);
> -	}
> -
> +	err = of_mdiobus_register(bp->mii_bus, np);
>  	if (err)
>  		goto err_out_free_mdiobus;
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index d4604bc8eb5b..f3e43db0d6cb 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -2052,13 +2052,9 @@ static int fec_enet_mii_init(struct platform_device
> *pdev)
>  	fep->mii_bus->parent = &pdev->dev;
> 
>  	node = of_get_child_by_name(pdev->dev.of_node, "mdio");
> -	if (node) {
> -		err = of_mdiobus_register(fep->mii_bus, node);
> +	err = of_mdiobus_register(fep->mii_bus, node);
> +	if (node)
>  		of_node_put(node);
> -	} else {
> -		err = mdiobus_register(fep->mii_bus);
> -	}
> -
>  	if (err)
>  		goto err_out_free_mdiobus;
> 
> diff --git a/drivers/net/ethernet/marvell/mvmdio.c
> b/drivers/net/ethernet/marvell/mvmdio.c
> index 0495487f7b42..c5dac6bd2be4 100644
> --- a/drivers/net/ethernet/marvell/mvmdio.c
> +++ b/drivers/net/ethernet/marvell/mvmdio.c
> @@ -348,10 +348,7 @@ static int orion_mdio_probe(struct platform_device
> *pdev)
>  		goto out_mdio;
>  	}
> 
> -	if (pdev->dev.of_node)
> -		ret = of_mdiobus_register(bus, pdev->dev.of_node);
> -	else
> -		ret = mdiobus_register(bus);
> +	ret = of_mdiobus_register(bus, pdev->dev.of_node);
>  	if (ret < 0) {
>  		dev_err(&pdev->dev, "Cannot register MDIO bus (%d)\n", ret);
>  		goto out_mdio;
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c
> b/drivers/net/ethernet/renesas/sh_eth.c
> index 5970d9e5ddf1..8dd41e08a6c6 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -3025,15 +3025,10 @@ static int sh_mdio_init(struct sh_eth_private
> *mdp,
>  		 pdev->name, pdev->id);
> 
>  	/* register MDIO bus */
> -	if (dev->of_node) {
> -		ret = of_mdiobus_register(mdp->mii_bus, dev->of_node);
> -	} else {
> -		if (pd->phy_irq > 0)
> -			mdp->mii_bus->irq[pd->phy] = pd->phy_irq;
> -
> -		ret = mdiobus_register(mdp->mii_bus);
> -	}
> +	if (pd->phy_irq > 0)
> +		mdp->mii_bus->irq[pd->phy] = pd->phy_irq;
> 
> +	ret = of_mdiobus_register(mdp->mii_bus, dev->of_node);
>  	if (ret)
>  		goto out_free_bus;
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
> index f5f37bfa1d58..5df1a608e566 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
> @@ -233,10 +233,7 @@ int stmmac_mdio_register(struct net_device *ndev)
>  	new_bus->phy_mask = mdio_bus_data->phy_mask;
>  	new_bus->parent = priv->device;
> 
> -	if (mdio_node)
> -		err = of_mdiobus_register(new_bus, mdio_node);
> -	else
> -		err = mdiobus_register(new_bus);
> +	err = of_mdiobus_register(new_bus, mdio_node);
>  	if (err != 0) {
>  		dev_err(dev, "Cannot register the MDIO bus\n");
>  		goto bus_register_fail;
> diff --git a/drivers/net/ethernet/ti/davinci_mdio.c
> b/drivers/net/ethernet/ti/davinci_mdio.c
> index 98a1c97fb95e..8ac72831af05 100644
> --- a/drivers/net/ethernet/ti/davinci_mdio.c
> +++ b/drivers/net/ethernet/ti/davinci_mdio.c
> @@ -429,12 +429,10 @@ static int davinci_mdio_probe(struct
> platform_device *pdev)
>  	 * defined to support backward compatibility with DTs which assume that
>  	 * Davinci MDIO will always scan the bus for PHYs detection.
>  	 */
> -	if (dev->of_node && of_get_child_count(dev->of_node)) {
> +	if (dev->of_node && of_get_child_count(dev->of_node))
>  		data->skip_scan = true;
> -		ret = of_mdiobus_register(data->bus, dev->of_node);
> -	} else {
> -		ret = mdiobus_register(data->bus);
> -	}
> +
> +	ret = of_mdiobus_register(data->bus, dev->of_node);
>  	if (ret)
>  		goto bail_out;
> 
> diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c index
> b501221819e1..4e4c8daf44c3 100644
> --- a/drivers/net/phy/mdio-gpio.c
> +++ b/drivers/net/phy/mdio-gpio.c
> @@ -179,11 +179,7 @@ static int mdio_gpio_probe(struct platform_device
> *pdev)
>  	if (!new_bus)
>  		return -ENODEV;
> 
> -	if (pdev->dev.of_node)
> -		ret = of_mdiobus_register(new_bus, pdev->dev.of_node);
> -	else
> -		ret = mdiobus_register(new_bus);
> -
> +	ret = of_mdiobus_register(new_bus, pdev->dev.of_node);
>  	if (ret)
>  		mdio_gpio_bus_deinit(&pdev->dev);
> 
> diff --git a/drivers/net/phy/mdio-mscc-miim.c
> b/drivers/net/phy/mdio-mscc-miim.c
> index 8c689ccfdbca..badbc99bedd3 100644
> --- a/drivers/net/phy/mdio-mscc-miim.c
> +++ b/drivers/net/phy/mdio-mscc-miim.c
> @@ -151,11 +151,7 @@ static int mscc_miim_probe(struct platform_device
> *pdev)
>  		}
>  	}
> 
> -	if (pdev->dev.of_node)
> -		ret = of_mdiobus_register(bus, pdev->dev.of_node);
> -	else
> -		ret = mdiobus_register(bus);
> -
> +	ret = of_mdiobus_register(bus, pdev->dev.of_node);
>  	if (ret < 0) {
>  		dev_err(&pdev->dev, "Cannot register MDIO bus (%d)\n", ret);
>  		return ret;
> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index
> 91761436709a..8dff87ec6d99 100644
> --- a/drivers/net/usb/lan78xx.c
> +++ b/drivers/net/usb/lan78xx.c
> @@ -1843,12 +1843,9 @@ static int lan78xx_mdio_init(struct lan78xx_net
> *dev)
>  	}
> 
>  	node = of_get_child_by_name(dev->udev->dev.of_node, "mdio");
> -	if (node) {
> -		ret = of_mdiobus_register(dev->mdiobus, node);
> +	ret = of_mdiobus_register(dev->mdiobus, node);
> +	if (node)
>  		of_node_put(node);
> -	} else {
> -		ret = mdiobus_register(dev->mdiobus);
> -	}
>  	if (ret) {
>  		netdev_err(dev->net, "can't register MDIO bus\n");
>  		goto exit1;
> --
> 2.14.1


^ permalink raw reply

* Re: [PATCH] net: qcom/emac: Encapsulate sgmii ops under one structure
From: Timur Tabi @ 2018-05-16  2:32 UTC (permalink / raw)
  To: Hemanth Puranik, netdev, linux-kernel
In-Reply-To: <1526433053-16308-1-git-send-email-hpuranik@codeaurora.org>

On 5/15/18 8:10 PM, Hemanth Puranik wrote:
> This patch introduces ops structure for sgmii, This by ensures that
> we do not need dummy functions in case of emulation platforms.
> 
> Signed-off-by: Hemanth Puranik<hpuranik@codeaurora.org>

Acked-by: Timur Tabi <timur@codeaurora.org>

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* [PATCH v5 0/3] multi-threading device shutdown
From: Pavel Tatashin @ 2018-05-16  2:40 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh,
	alexander.duyck, tobin, andy.shevchenko

Changelog
v4 - v5
	- Addressed comments from Andy Shevchenko and Greg
	  Kroah-Hartman
	- Split the patch into a series of 3 patches in order to
	  provide a better bisecting, and facilitate with reviewing.
v3 - v4
	- Added device_shutdown_serial kernel parameter to disable
	  multi-threading as suggested by Greg Kroah-Hartman

v2 - v3
	- Fixed warning from kbuild test.
	- Moved device_lock/device_unlock inside device_shutdown_tree().

v1 - v2
	- It turns out we cannot lock more than MAX_LOCK_DEPTH by a single
	  thread. (By default this value is 48), and is used to detect
	  deadlocks. So, I re-wrote the code to only lock one devices per
	  thread instead of pre-locking all devices by the main thread.
	- Addressed comments from Tobin C. Harding.
	- As suggested by Alexander Duyck removed ixgbe changes. It can be
	  done as a separate work scaling RTNL mutex.

Do a faster shutdown by calling dev->*->shutdown(dev) in parallel.
device_shutdown() calls these functions for every single device but
only using one thread.

Since, nothing else is running on the machine by the time
device_shutdown() is called, there is no reason not to utilize all the
available CPU resources.

Pavel Tatashin (3):
  drivers core: refactor device_shutdown
  drivers core: prepare device_shutdown for multi-threading
  drivers core: multi-threading device shutdown

 drivers/base/core.c | 290 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 243 insertions(+), 47 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [PATCH v5 1/3] drivers core: refactor device_shutdown
From: Pavel Tatashin @ 2018-05-16  2:40 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh,
	alexander.duyck, tobin, andy.shevchenko
In-Reply-To: <20180516024004.28977-1-pasha.tatashin@oracle.com>

device_shutdown() traverses through the list of devices, and calls
dev->{bug/driver}->shutdown() for each entry in the list.

Refactor the function by keeping device_shutdown() to do the logic of
traversing the list of devices, and device_shutdown_one() to perform the
actual shutdown operation on one device.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/core.c | 50 +++++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816eb887..ed189f6d1a2f 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2765,6 +2765,35 @@ int device_move(struct device *dev, struct device *new_parent,
 }
 EXPORT_SYMBOL_GPL(device_move);
 
+/*
+ * device_shutdown_one - call ->shutdown() for the device passed as
+ * argument.
+ */
+static void device_shutdown_one(struct device *dev)
+{
+	/* Don't allow any more runtime suspends */
+	pm_runtime_get_noresume(dev);
+	pm_runtime_barrier(dev);
+
+	if (dev->class && dev->class->shutdown_pre) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown_pre\n");
+		dev->class->shutdown_pre(dev);
+	}
+	if (dev->bus && dev->bus->shutdown) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown\n");
+		dev->bus->shutdown(dev);
+	} else if (dev->driver && dev->driver->shutdown) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown\n");
+		dev->driver->shutdown(dev);
+	}
+
+	/* decrement the reference counter */
+	put_device(dev);
+}
+
 /**
  * device_shutdown - call ->shutdown() on each device to shutdown.
  */
@@ -2801,30 +2830,11 @@ void device_shutdown(void)
 			device_lock(parent);
 		device_lock(dev);
 
-		/* Don't allow any more runtime suspends */
-		pm_runtime_get_noresume(dev);
-		pm_runtime_barrier(dev);
-
-		if (dev->class && dev->class->shutdown_pre) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown_pre\n");
-			dev->class->shutdown_pre(dev);
-		}
-		if (dev->bus && dev->bus->shutdown) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown\n");
-			dev->bus->shutdown(dev);
-		} else if (dev->driver && dev->driver->shutdown) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown\n");
-			dev->driver->shutdown(dev);
-		}
-
+		device_shutdown_one(dev);
 		device_unlock(dev);
 		if (parent)
 			device_unlock(parent);
 
-		put_device(dev);
 		put_device(parent);
 
 		spin_lock(&devices_kset->list_lock);
-- 
2.17.0

^ permalink raw reply related

* [PATCH v5 2/3] drivers core: prepare device_shutdown for multi-threading
From: Pavel Tatashin @ 2018-05-16  2:40 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh,
	alexander.duyck, tobin, andy.shevchenko
In-Reply-To: <20180516024004.28977-1-pasha.tatashin@oracle.com>

Do all the necessary refactoring to prepare device_shutdown() logic to
be multi-threaded.

Which includes:
1. Change the direction of traversing the list instead of going backward,
we now go forward.
2. Children are shutdown recursively for each root device from bottom-up.
3. Functions that can be multi-threaded have _task() in their name.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/core.c | 178 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 149 insertions(+), 29 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index ed189f6d1a2f..210b619931bc 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2102,6 +2102,59 @@ const char *device_get_devnode(struct device *dev,
 	return *tmp = s;
 }
 
+/*
+ * device_children_count - device children count
+ * @parent: parent struct device.
+ *
+ * Returns number of children for this device or 0 if none.
+ */
+static int device_children_count(struct device *parent)
+{
+	struct klist_iter i;
+	int children = 0;
+
+	if (!parent->p)
+		return 0;
+
+	klist_iter_init(&parent->p->klist_children, &i);
+	while (next_device(&i))
+		children++;
+	klist_iter_exit(&i);
+
+	return children;
+}
+
+/*
+ * device_get_child_by_index - Return child using the provided index.
+ * @parent: parent struct device.
+ * @index:  Index of the child, where 0 is the first child in the children list,
+ * and so on.
+ *
+ * Returns child or NULL if child with this index is not present.
+ */
+static struct device *
+device_get_child_by_index(struct device *parent, int index)
+{
+	struct klist_iter i;
+	struct device *dev = NULL, *d;
+	int child_index = 0;
+
+	if (!parent->p || index < 0)
+		return NULL;
+
+	klist_iter_init(&parent->p->klist_children, &i);
+	while ((d = next_device(&i))) {
+		if (child_index == index) {
+			dev = d;
+			break;
+		}
+		child_index++;
+	}
+	klist_iter_exit(&i);
+
+	return dev;
+}
+
 /**
  * device_for_each_child - device child iterator.
  * @parent: parent struct device.
@@ -2794,50 +2847,117 @@ static void device_shutdown_one(struct device *dev)
 	put_device(dev);
 }
 
+/*
+ * Passed as an argument to device_shutdown_child_task().
+ * child_next_index	the next available child index.
+ * parent		Parent device.
+ */
+struct device_shutdown_task_data {
+	atomic_t		child_next_index;
+	struct device		*parent;
+};
+
+static int device_shutdown_child_task(void *data);
+
+/*
+ * Shutdown device tree with root started in dev. If dev has no children
+ * simply shutdown only this device. If dev has children recursively shutdown
+ * children first, and only then the parent.
+ */
+static void device_shutdown_tree(struct device *dev)
+{
+	int children_count;
+
+	device_lock(dev);
+	children_count = device_children_count(dev);
+
+	if (children_count) {
+		struct device_shutdown_task_data tdata;
+		int i;
+
+		atomic_set(&tdata.child_next_index, 0);
+		tdata.parent = dev;
+
+		for (i = 0; i < children_count; i++) {
+			device_shutdown_child_task(&tdata);
+		}
+	}
+	device_shutdown_one(dev);
+	device_unlock(dev);
+}
+
+/*
+ * Only devices with parent are going through this function. The parent is
+ * locked and waits for all of its children to finish shutting down before
+ * calling shutdown function for itself.
+ */
+static int device_shutdown_child_task(void *data)
+{
+	struct device_shutdown_task_data *tdata = data;
+	int cidx = atomic_inc_return(&tdata->child_next_index) - 1;
+	struct device *dev = device_get_child_by_index(tdata->parent, cidx);
+
+	/* ref. counter is going to be decremented in device_shutdown_one() */
+	get_device(dev);
+	device_shutdown_tree(dev);
+	return 0;
+}
+
+/*
+ * On shutdown each root device (the one that does not have a parent) goes
+ * through this function.
+ */
+static int device_shutdown_root_task(void *data)
+{
+	struct device *dev = (struct device *)data;
+
+	device_shutdown_tree(dev);
+
+	return 0;
+}
+
 /**
  * device_shutdown - call ->shutdown() on each device to shutdown.
  */
 void device_shutdown(void)
 {
-	struct device *dev, *parent;
+	struct device *dev;
 
-	spin_lock(&devices_kset->list_lock);
-	/*
-	 * Walk the devices list backward, shutting down each in turn.
-	 * Beware that device unplug events may also start pulling
-	 * devices offline, even as the system is shutting down.
+	/* Shutdown the root devices. The children are going to be
+	 * shutdown first in device_shutdown_tree().
 	 */
+	spin_lock(&devices_kset->list_lock);
 	while (!list_empty(&devices_kset->list)) {
-		dev = list_entry(devices_kset->list.prev, struct device,
-				kobj.entry);
+		dev = list_entry(devices_kset->list.next, struct device,
+				 kobj.entry);
 
-		/*
-		 * hold reference count of device's parent to
-		 * prevent it from being freed because parent's
-		 * lock is to be held
-		 */
-		parent = get_device(dev->parent);
-		get_device(dev);
 		/*
 		 * Make sure the device is off the kset list, in the
 		 * event that dev->*->shutdown() doesn't remove it.
 		 */
 		list_del_init(&dev->kobj.entry);
-		spin_unlock(&devices_kset->list_lock);
 
-		/* hold lock to avoid race with probe/release */
-		if (parent)
-			device_lock(parent);
-		device_lock(dev);
-
-		device_shutdown_one(dev);
-		device_unlock(dev);
-		if (parent)
-			device_unlock(parent);
-
-		put_device(parent);
-
-		spin_lock(&devices_kset->list_lock);
+		/* Here we start tasks for root devices only */
+		if (!dev->parent) {
+			/* Prevents devices from being freed. The counter is
+			 * going to be decremented in device_shutdown_one() once
+			 * this root device is shutdown.
+			 */
+			get_device(dev);
+
+			/* We unlock list for performance reasons,
+			 * dev->*->shutdown(), may try to take this lock to
+			 * remove us from kset list. To avoid unlocking this
+			 * list we could replace spin lock in:
+			 * dev->kobj.kset->list_lock with a dummy one once
+			 * device is locked in device_shutdown_root_task() and
+			 * in device_shutdown_child_task().
+			 */
+			spin_unlock(&devices_kset->list_lock);
+
+			device_shutdown_root_task(dev);
+			spin_lock(&devices_kset->list_lock);
+		}
 	}
 	spin_unlock(&devices_kset->list_lock);
 }
-- 
2.17.0

^ permalink raw reply related

* [PATCH v5 3/3] drivers core: multi-threading device shutdown
From: Pavel Tatashin @ 2018-05-16  2:40 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh,
	alexander.duyck, tobin, andy.shevchenko
In-Reply-To: <20180516024004.28977-1-pasha.tatashin@oracle.com>

When system is rebooted, halted or kexeced device_shutdown() is
called.

This function shuts down every single device by calling either:

	dev->bus->shutdown(dev)
	dev->driver->shutdown(dev)

Even on a machine with just a moderate amount of devices, device_shutdown()
may take multiple seconds to complete. This is because many devices require
a specific delays to perform this operation.

Here is a sample analysis of time it takes to call device_shutdown() on a
two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine.

device_shutdown		2.95s
-----------------------------
 mlx4_shutdown		1.14s
 megasas_shutdown	0.24s
 ixgbe_shutdown		0.37s x 4 (four ixgbe devices on this machine).
 the rest		0.09s

In mlx4 we spent the most time, but that is because there is a 1 second
sleep, which is defined by hardware specifications:
mlx4_shutdown
 mlx4_unload_one
  mlx4_free_ownership
   msleep(1000)

With megasas we spend a quarter of a second, but sometimes longer (up-to
0.5s) in this path:

    megasas_shutdown
      megasas_flush_cache
        megasas_issue_blocked_cmd
          wait_event_timeout

Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time
is spread all over the place, with bigger offenders:

    ixgbe_shutdown
      __ixgbe_shutdown
        ixgbe_close_suspend
          ixgbe_down
            ixgbe_init_hw_generic
              ixgbe_reset_hw_X540
                msleep(100);                        0.104483472
                ixgbe_get_san_mac_addr_generic      0.048414851
                ixgbe_get_wwn_prefix_generic        0.048409893
              ixgbe_start_hw_X540
                ixgbe_start_hw_generic
                  ixgbe_clear_hw_cntrs_generic      0.048581502
                  ixgbe_setup_fc_generic            0.024225800

    All the ixgbe_*generic functions end-up calling:
    ixgbe_read_eerd_X540()
      ixgbe_acquire_swfw_sync_X540
        usleep_range(5000, 6000);
      ixgbe_release_swfw_sync_X540
        usleep_range(5000, 6000);

While these are short sleeps, they end-up calling them over 24 times!
24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices. Also we have
four msleep(100). Totaling to:  0.928s

While we should keep optimizing the individual device drivers, in some
cases this is simply a hardware property that forces a specific delay, and
we must wait.

So, the solution for this problem is to shutdown devices in parallel.
However, we must shutdown children before shutting down parents, so parent
device must wait for its children to finish.

With this patch, on the same machine devices_shutdown() takes 1.142s, and
without mlx4 one second delay only 0.38s

This feature can be optionally disabled via kernel parameter:
device_shutdown_serial. When booted with this parameter, device_shutdown()
will shutdown devices one by one.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/core.c | 70 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 210b619931bc..032a1922bcb7 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -22,6 +22,7 @@
 #include <linux/genhd.h>
 #include <linux/mutex.h>
 #include <linux/pm_runtime.h>
+#include <linux/kthread.h>
 #include <linux/netdevice.h>
 #include <linux/sched/signal.h>
 #include <linux/sysfs.h>
@@ -2850,19 +2851,39 @@ static void device_shutdown_one(struct device *dev)
 /*
  * Passed as an argument to device_shutdown_child_task().
  * child_next_index	the next available child index.
+ * tasks_running	number of tasks still running. Each tasks decrements it
+ *			when job is finished and the last task signals that the
+ *			job is complete.
+ * complete		Used to signal job competition.
  * parent		Parent device.
  */
 struct device_shutdown_task_data {
 	atomic_t		child_next_index;
+	atomic_t		tasks_running;
+	struct completion	complete;
 	struct device		*parent;
 };
 
 static int device_shutdown_child_task(void *data);
+static bool device_shutdown_serial;
+
+/*
+ * These globals are used by tasks that are started for root devices.
+ * device_root_tasks_finished	Number of root devices finished shutting down.
+ * device_root_tasks_started	Total number of root devices tasks started.
+ * device_root_tasks_done	The completion signal to the main thread.
+ */
+static atomic_t			device_root_tasks_finished;
+static atomic_t			device_root_tasks_started;
+static struct completion	device_root_tasks_done;
 
 /*
  * Shutdown device tree with root started in dev. If dev has no children
  * simply shutdown only this device. If dev has children recursively shutdown
  * children first, and only then the parent.
+ * For performance reasons children are shutdown in parallel using kernel
+ * threads. because we lock dev its children cannot be removed while this
+ * functions is running.
  */
 static void device_shutdown_tree(struct device *dev)
 {
@@ -2876,11 +2897,20 @@ static void device_shutdown_tree(struct device *dev)
 		int i;
 
 		atomic_set(&tdata.child_next_index, 0);
+		atomic_set(&tdata.tasks_running, children_count);
+		init_completion(&tdata.complete);
 		tdata.parent = dev;
 
 		for (i = 0; i < children_count; i++) {
-			device_shutdown_child_task(&tdata);
+			if (device_shutdown_serial) {
+				device_shutdown_child_task(&tdata);
+			} else {
+				kthread_run(device_shutdown_child_task,
+					    &tdata, "device_shutdown.%s",
+					    dev_name(dev));
+			}
 		}
+		wait_for_completion(&tdata.complete);
 	}
 	device_shutdown_one(dev);
 	device_unlock(dev);
@@ -2900,6 +2930,10 @@ static int device_shutdown_child_task(void *data)
 	/* ref. counter is going to be decremented in device_shutdown_one() */
 	get_device(dev);
 	device_shutdown_tree(dev);
+
+	/* If we are the last to exit, signal the completion */
+	if (atomic_dec_return(&tdata->tasks_running) == 0)
+		complete(&tdata->complete);
 	return 0;
 }
 
@@ -2910,9 +2944,14 @@ static int device_shutdown_child_task(void *data)
 static int device_shutdown_root_task(void *data)
 {
 	struct device *dev = (struct device *)data;
+	int root_devices;
 
 	device_shutdown_tree(dev);
 
+	/* If we are the last to exit, signal the completion */
+	root_devices = atomic_inc_return(&device_root_tasks_finished);
+	if (root_devices == atomic_read(&device_root_tasks_started))
+		complete(&device_root_tasks_done);
 	return 0;
 }
 
@@ -2921,10 +2960,17 @@ static int device_shutdown_root_task(void *data)
  */
 void device_shutdown(void)
 {
+	int root_devices = 0;
 	struct device *dev;
 
+	atomic_set(&device_root_tasks_finished, 0);
+	atomic_set(&device_root_tasks_started, 0);
+	init_completion(&device_root_tasks_done);
+
 	/* Shutdown the root devices. The children are going to be
 	 * shutdown first in device_shutdown_tree().
+	 * We shutdown root devices in parallel by starting thread
+	 * for each root device.
 	 */
 	spin_lock(&devices_kset->list_lock);
 	while (!list_empty(&devices_kset->list)) {
@@ -2955,13 +3001,33 @@ void device_shutdown(void)
 			 */
 			spin_unlock(&devices_kset->list_lock);
 
-			device_shutdown_root_task(dev);
+			root_devices++;
+			if (device_shutdown_serial) {
+				device_shutdown_root_task(dev);
+			} else {
+				kthread_run(device_shutdown_root_task,
+					    dev, "device_root_shutdown.%s",
+					    dev_name(dev));
+			}
 			spin_lock(&devices_kset->list_lock);
 		}
 	}
 	spin_unlock(&devices_kset->list_lock);
+
+	/* Set number of root tasks started, and waits for completion */
+	atomic_set(&device_root_tasks_started, root_devices);
+	if (root_devices != atomic_read(&device_root_tasks_finished))
+		wait_for_completion(&device_root_tasks_done);
+}
+
+static int __init _device_shutdown_serial(char *arg)
+{
+	device_shutdown_serial = true;
+	return 0;
 }
 
+early_param("device_shutdown_serial", _device_shutdown_serial);
+
 /*
  * Device logging functions
  */
-- 
2.17.0

^ permalink raw reply related

* RE: Hangs in r8152 connected to power management in kernels at least up v4.17-rc4
From: Hayes Wang @ 2018-05-16  3:37 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: netdev@vger.kernel.org
In-Reply-To: <1526388188.2724.5.camel@suse.com>

Oliver Neukum [mailto:oneukum@suse.com]
> 
> Hi,
> 
> I got reports about hangs with this trace:
> 
> May 13 01:36:55 neroon kernel: INFO: task kworker/0:0:4 blocked for more
> than 60 seconds.
> May 13 01:36:55 neroon kernel:       Tainted: G     U
> 4.17.0-rc4-1.g8257a00-vanilla #1
> May 13 01:36:55 neroon kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 13 01:36:55 neroon kernel: kworker/0:0     D    0     4      2
> 0x80000000
> May 13 01:36:55 neroon kernel: Workqueue: events rtl_work_func_t [r8152]
> May 13 01:36:55 neroon kernel: Call Trace:
> May 13 01:36:55 neroon kernel:  ? __schedule+0x289/0x880
> May 13 01:36:55 neroon kernel:  schedule+0x2f/0x90
> May 13 01:36:55 neroon kernel:  rpm_resume+0xf9/0x7a0
> May 13 01:36:55 neroon kernel:  ? wait_woken+0x80/0x80
> May 13 01:36:55 neroon kernel:  rpm_resume+0x547/0x7a0
> May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x40/0x70
> May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x34/0x70
> May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x40/0x70
> May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x34/0x70
> May 13 01:36:55 neroon kernel:  ? __switch_to_asm+0x40/0x70
> May 13 01:36:55 neroon kernel:  __pm_runtime_resume+0x3a/0x50
> May 13 01:36:55 neroon kernel:  usb_autopm_get_interface+0x1d/0x50 [usbcore]

Would usb_autopm_get_interface() take a long time?
The driver would wake the device if it has suspended.
I have no idea about how usb_autopm_get_interface() works, so I don't know how to help.

> May 13 01:36:55 neroon kernel:  rtl_work_func_t+0x3c/0x27c [r8152]
> May 13 01:36:55 neroon kernel:  process_one_work+0x1d4/0x3f0
> May 13 01:36:55 neroon kernel:  worker_thread+0x2b/0x3d0
> May 13 01:36:55 neroon kernel:  ? process_one_work+0x3f0/0x3f0
> May 13 01:36:55 neroon kernel:  kthread+0x113/0x130
> May 13 01:36:55 neroon kernel:  ?
> kthread_create_worker_on_cpu+0x50/0x50
> May 13 01:36:55 neroon kernel:  ret_from_fork+0x3a/0x50

Best Regards,
Hayes



^ permalink raw reply

* [PATCH] ethernet: stmmac: dwmac-rk: Add GMAC support for px30
From: David Wu @ 2018-05-16  3:38 UTC (permalink / raw)
  To: davem, heiko, robh+dt
  Cc: mark.rutland, huangtao, netdev, linux-arm-kernel, linux-rockchip,
	linux-kernel, David Wu

Add constants and callback functions for the dwmac on px30 soc.
The base structure is the same, but registers and the bits in
them moved slightly, and add the clk_mac_speed for the select
of mac speed.

Signed-off-by: David Wu <david.wu@rock-chips.com>
---
 .../devicetree/bindings/net/rockchip-dwmac.txt     |  1 +
 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c     | 64 ++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/rockchip-dwmac.txt b/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
index 9c16ee2..3b71da7 100644
--- a/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
+++ b/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
@@ -4,6 +4,7 @@ The device node has following properties.
 
 Required properties:
  - compatible: should be "rockchip,<name>-gamc"
+   "rockchip,px30-gmac":   found on PX30 SoCs
    "rockchip,rk3128-gmac": found on RK312x SoCs
    "rockchip,rk3228-gmac": found on RK322x SoCs
    "rockchip,rk3288-gmac": found on RK3288 SoCs
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
index 13133b3..4b2ab71 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
@@ -61,6 +61,7 @@ struct rk_priv_data {
 	struct clk *mac_clk_tx;
 	struct clk *clk_mac_ref;
 	struct clk *clk_mac_refout;
+	struct clk *clk_mac_speed;
 	struct clk *aclk_mac;
 	struct clk *pclk_mac;
 	struct clk *clk_phy;
@@ -83,6 +84,64 @@ struct rk_priv_data {
 	(((tx) ? soc##_GMAC_TXCLK_DLY_ENABLE : soc##_GMAC_TXCLK_DLY_DISABLE) | \
 	 ((rx) ? soc##_GMAC_RXCLK_DLY_ENABLE : soc##_GMAC_RXCLK_DLY_DISABLE))
 
+#define PX30_GRF_GMAC_CON1		0X0904
+
+/* PX30_GRF_GMAC_CON1 */
+#define PX30_GMAC_PHY_INTF_SEL_RMII	(GRF_CLR_BIT(4) | GRF_CLR_BIT(5) | \
+					GRF_BIT(6))
+#define PX30_GMAC_SPEED_10M		GRF_CLR_BIT(2)
+#define PX30_GMAC_SPEED_100M		GRF_BIT(2)
+
+static void px30_set_to_rmii(struct rk_priv_data *bsp_priv)
+{
+	struct device *dev = &bsp_priv->pdev->dev;
+
+	if (IS_ERR(bsp_priv->grf)) {
+		dev_err(dev, "%s: Missing rockchip,grf property\n", __func__);
+		return;
+	}
+
+	regmap_write(bsp_priv->grf, PX30_GRF_GMAC_CON1,
+		     PX30_GMAC_PHY_INTF_SEL_RMII);
+}
+
+static void px30_set_rmii_speed(struct rk_priv_data *bsp_priv, int speed)
+{
+	struct device *dev = &bsp_priv->pdev->dev;
+	int ret;
+
+	if (IS_ERR(bsp_priv->clk_mac_speed)) {
+		dev_err(dev, "%s: Missing clk_mac_speed clock\n", __func__);
+		return;
+	}
+
+	if (speed == 10) {
+		regmap_write(bsp_priv->grf, PX30_GRF_GMAC_CON1,
+			     PX30_GMAC_SPEED_10M);
+
+		ret = clk_set_rate(bsp_priv->clk_mac_speed, 2500000);
+		if (ret)
+			dev_err(dev, "%s: set clk_mac_speed rate 2500000 failed: %d\n",
+				__func__, ret);
+	} else if (speed == 100) {
+		regmap_write(bsp_priv->grf, PX30_GRF_GMAC_CON1,
+			     PX30_GMAC_SPEED_100M);
+
+		ret = clk_set_rate(bsp_priv->clk_mac_speed, 25000000);
+		if (ret)
+			dev_err(dev, "%s: set clk_mac_speed rate 25000000 failed: %d\n",
+				__func__, ret);
+
+	} else {
+		dev_err(dev, "unknown speed value for RMII! speed=%d", speed);
+	}
+}
+
+static const struct rk_gmac_ops px30_ops = {
+	.set_to_rmii = px30_set_to_rmii,
+	.set_rmii_speed = px30_set_rmii_speed,
+};
+
 #define RK3128_GRF_MAC_CON0	0x0168
 #define RK3128_GRF_MAC_CON1	0x016c
 
@@ -1042,6 +1101,10 @@ static int rk_gmac_clk_init(struct plat_stmmacenet_data *plat)
 		}
 	}
 
+	bsp_priv->clk_mac_speed = devm_clk_get(dev, "clk_mac_speed");
+	if (IS_ERR(bsp_priv->clk_mac_speed))
+		dev_err(dev, "cannot get clock %s\n", "clk_mac_speed");
+
 	if (bsp_priv->clock_input) {
 		dev_info(dev, "clock input from PHY\n");
 	} else {
@@ -1424,6 +1487,7 @@ static int rk_gmac_resume(struct device *dev)
 static SIMPLE_DEV_PM_OPS(rk_gmac_pm_ops, rk_gmac_suspend, rk_gmac_resume);
 
 static const struct of_device_id rk_gmac_dwmac_match[] = {
+	{ .compatible = "rockchip,px30-gmac",	.data = &px30_ops   },
 	{ .compatible = "rockchip,rk3128-gmac", .data = &rk3128_ops },
 	{ .compatible = "rockchip,rk3228-gmac", .data = &rk3228_ops },
 	{ .compatible = "rockchip,rk3288-gmac", .data = &rk3288_ops },
-- 
2.7.4

^ permalink raw reply related

* xdp and fragments with virtio
From: David Ahern @ 2018-05-16  3:51 UTC (permalink / raw)
  To: Jason Wang, netdev@vger.kernel.org

Hi Jason:

I am trying to test MTU changes to the BPF fib_lookup helper and seeing
something odd. Hoping you can help.

I have a VM with multiple virtio based NICs and tap backends. I install
the xdp program on eth1 and eth2 to do forwarding. In the host I send a
large packet to eth1:

$ ping -s 1500 9.9.9.9

The tap device in the host sees 2 packets:

$ sudo tcpdump -nv -i vm02-eth1
20:44:33.943160 IP (tos 0x0, ttl 64, id 58746, offset 0, flags [+],
proto ICMP (1), length 1500)
    10.100.1.254 > 9.9.9.9: ICMP echo request, id 17917, seq 1, length 1480
20:44:33.943172 IP (tos 0x0, ttl 64, id 58746, offset 1480, flags
[none], proto ICMP (1), length 48)
    10.100.1.254 > 9.9.9.9: ip-proto-1

In the VM, the XDP program only sees the first packet, not the fragment.
I added a printk to the program (see diff below):

$ cat trace_pipe
          <idle>-0     [003] ..s2   254.436467: 0: packet length 1514

Anything come to mind in the virtio xdp implementation that affects
fragment packets? I see this with both IPv4 and v6.

Thanks,
David

[1] xdp program diff showing printk that dumps packet length:

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index 4a6be0f87505..f119b506e782 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -52,6 +52,11 @@ static __always_inline int xdp_fwd_flags(struct
xdp_md *ctx, u32 flags)
        u16 h_proto;
        u64 nh_off;

+       {
+               char fmt[] = "packet length %u\n";
+
+               bpf_trace_printk(fmt, sizeof(fmt), ctx->data_end-ctx->data);
+       }
        nh_off = sizeof(*eth);
        if (data + nh_off > data_end)
                return XDP_DROP;

^ permalink raw reply related

* [PATCH net-next v3 0/3] fib rule selftest
From: Roopa Prabhu @ 2018-05-16  3:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, dsa, nikolay, idosch

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This series adds a new test to test fib rules.
ip route get is used to test fib rule matches.
This series also extends ip route get to match on
sport and dport to test recent support of sport
and dport fib rule match.

v2 - address ido's commemt to make sport dport
ip route get to work correctly for input route
get. I don't support ip route get on ip-proto match yet.
ip route get creates a udp packet and i have left
it at that. We could extend ip route get to support
a few ip proto matches in followup patches.

v3 - Support ip_proto (tcp, udp and icmp) match in getroute.
dropped printing of new match attrs in ip route get, 
because ipv6 does not print it. And ipv6 currrently shares
the dump api with ipv6 notify and its better to not add them
to the notify api. dropped it to keep the api consistent between
ipv4 and ipv6 (though uid is already printed in the ipv4 case).

I am signing up for fib_rule_get test next :). Will probably need
some fixes around that area for consistency between ipv4 and ipv6.

Roopa Prabhu (3):
  ipv4: support sport, dport and ip_proto in RTM_GETROUTE
  ipv6: support sport, dport and ip_proto in RTM_GETROUTE
  selftests: net: initial fib rule tests

 include/uapi/linux/rtnetlink.h                |   2 +
 net/ipv4/route.c                              | 152 ++++++++++++-----
 net/ipv6/route.c                              |  25 +++
 tools/testing/selftests/net/Makefile          |   2 +-
 tools/testing/selftests/net/fib_rule_tests.sh | 224 ++++++++++++++++++++++++++
 5 files changed, 366 insertions(+), 39 deletions(-)
 create mode 100644 tools/testing/selftests/net/fib_rule_tests.sh

-- 
2.1.4

^ permalink raw reply

* [PATCH net-next v3 1/3] ipv4: support sport, dport and ip_proto in RTM_GETROUTE
From: Roopa Prabhu @ 2018-05-16  3:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, dsa, nikolay, idosch
In-Reply-To: <1526442908-13183-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This is a followup to fib rules sport, dport and ipproto
match support. Only supports tcp, udp and icmp for ipproto.
Used by fib rule self tests. Before this patch getroute used
same skb to pass through the route lookup and for the netlink
getroute reply msg. This patch allocates separate skb's to keep
flow dissector happy.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 include/net/ip.h               |   3 +
 include/uapi/linux/rtnetlink.h |   3 +
 net/ipv4/Makefile              |   2 +-
 net/ipv4/fib_frontend.c        |   4 +
 net/ipv4/netlink.c             |  23 +++++
 net/ipv4/route.c               | 190 ++++++++++++++++++++++++++++++-----------
 6 files changed, 173 insertions(+), 52 deletions(-)
 create mode 100644 net/ipv4/netlink.c

diff --git a/include/net/ip.h b/include/net/ip.h
index bada1f1..0d2281b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -664,4 +664,7 @@ extern int sysctl_icmp_msgs_burst;
 int ip_misc_proc_init(void);
 #endif
 
+int rtm_getroute_parse_ip_proto(struct nlattr *attr, u8 *ip_proto,
+				struct netlink_ext_ack *extack);
+
 #endif	/* _IP_H */
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 9b15005..cabb210 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -327,6 +327,9 @@ enum rtattr_type_t {
 	RTA_PAD,
 	RTA_UID,
 	RTA_TTL_PROPAGATE,
+	RTA_IP_PROTO,
+	RTA_SPORT,
+	RTA_DPORT,
 	__RTA_MAX
 };
 
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index b379520..13f2ba9 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -14,7 +14,7 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \
 	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
-	     metrics.o
+	     metrics.o netlink.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index f05afaf..d7092c5 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -643,6 +643,10 @@ const struct nla_policy rtm_ipv4_policy[RTA_MAX + 1] = {
 	[RTA_ENCAP]		= { .type = NLA_NESTED },
 	[RTA_UID]		= { .type = NLA_U32 },
 	[RTA_MARK]		= { .type = NLA_U32 },
+	[RTA_TABLE]		= { .type = NLA_U32 },
+	[RTA_IP_PROTO]		= { .type = NLA_U8 },
+	[RTA_SPORT]		= { .type = NLA_U16 },
+	[RTA_DPORT]		= { .type = NLA_U16 },
 };
 
 static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
diff --git a/net/ipv4/netlink.c b/net/ipv4/netlink.c
new file mode 100644
index 0000000..f86bb4f
--- /dev/null
+++ b/net/ipv4/netlink.c
@@ -0,0 +1,23 @@
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <linux/types.h>
+#include <net/net_namespace.h>
+#include <net/netlink.h>
+#include <net/ip.h>
+
+int rtm_getroute_parse_ip_proto(struct nlattr *attr, u8 *ip_proto,
+				struct netlink_ext_ack *extack)
+{
+	*ip_proto = nla_get_u8(attr);
+
+	switch (*ip_proto) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+	case IPPROTO_ICMP:
+		return 0;
+	default:
+		NL_SET_ERR_MSG(extack, "Unsupported ip proto");
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL_GPL(rtm_getroute_parse_ip_proto);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 29268ef..073c849 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2569,11 +2569,10 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
 /* called with rcu_read_lock held */
-static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
-			struct flowi4 *fl4, struct sk_buff *skb, u32 portid,
-			u32 seq)
+static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
+			struct rtable *rt, u32 table_id, struct flowi4 *fl4,
+			struct sk_buff *skb, u32 portid, u32 seq)
 {
-	struct rtable *rt = skb_rtable(skb);
 	struct rtmsg *r;
 	struct nlmsghdr *nlh;
 	unsigned long expires = 0;
@@ -2669,7 +2668,7 @@ static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
 			}
 		} else
 #endif
-			if (nla_put_u32(skb, RTA_IIF, skb->dev->ifindex))
+			if (nla_put_u32(skb, RTA_IIF, fl4->flowi4_iif))
 				goto nla_put_failure;
 	}
 
@@ -2684,43 +2683,128 @@ static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
 	return -EMSGSIZE;
 }
 
+static int inet_rtm_getroute_reply(struct sk_buff *in_skb, struct nlmsghdr *nlh,
+				   __be32 dst, __be32 src, struct flowi4 *fl4,
+				   struct rtable *rt, struct fib_result *res)
+{
+	struct net *net = sock_net(in_skb->sk);
+	struct rtmsg *rtm = nlmsg_data(nlh);
+	u32 table_id = RT_TABLE_MAIN;
+	struct sk_buff *skb;
+	int err = 0;
+
+	skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
+	if (!skb)
+		return -ENOMEM;
+
+	if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE)
+		table_id = res->table ? res->table->tb_id : 0;
+
+	if (rtm->rtm_flags & RTM_F_FIB_MATCH)
+		err = fib_dump_info(skb, NETLINK_CB(in_skb).portid,
+				    nlh->nlmsg_seq, RTM_NEWROUTE, table_id,
+				    rt->rt_type, res->prefix, res->prefixlen,
+				    fl4->flowi4_tos, res->fi, 0);
+	else
+		err = rt_fill_info(net, dst, src, rt, table_id,
+				   fl4, skb, NETLINK_CB(in_skb).portid,
+				   nlh->nlmsg_seq);
+	if (err < 0)
+		goto errout;
+
+	return rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
+
+errout:
+	kfree_skb(skb);
+	return err;
+}
+
+static struct sk_buff *inet_rtm_getroute_build_skb(__be32 src, __be32 dst,
+						   u8 ip_proto, __be16 sport,
+						   __be16 dport)
+{
+	struct sk_buff *skb;
+	struct iphdr *iph;
+
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb)
+		return NULL;
+
+	/* Reserve room for dummy headers, this skb can pass
+	 * through good chunk of routing engine.
+	 */
+	skb_reset_mac_header(skb);
+	skb_reset_network_header(skb);
+	skb->protocol = htons(ETH_P_IP);
+	iph = skb_put(skb, sizeof(struct iphdr));
+	iph->protocol = ip_proto;
+	iph->saddr = src;
+	iph->daddr = dst;
+	iph->version = 0x4;
+	iph->frag_off = 0;
+	iph->ihl = 0x5;
+	skb_set_transport_header(skb, skb->len);
+
+	switch (iph->protocol) {
+	case IPPROTO_UDP: {
+		struct udphdr *udph;
+
+		udph = skb_put_zero(skb, sizeof(struct udphdr));
+		udph->source = sport;
+		udph->dest = dport;
+		udph->len = sizeof(struct udphdr);
+		udph->check = 0;
+		break;
+	}
+	case IPPROTO_TCP: {
+		struct tcphdr *tcph;
+
+		tcph = skb_put_zero(skb, sizeof(struct tcphdr));
+		tcph->source	= sport;
+		tcph->dest	= dport;
+		tcph->doff	= sizeof(struct tcphdr) / 4;
+		tcph->rst = 1;
+		tcph->check = ~tcp_v4_check(sizeof(struct tcphdr),
+					    src, dst, 0);
+		break;
+	}
+	case IPPROTO_ICMP: {
+		struct icmphdr *icmph;
+
+		icmph = skb_put_zero(skb, sizeof(struct icmphdr));
+		icmph->type = ICMP_ECHO;
+		icmph->code = 0;
+	}
+	}
+
+	return skb;
+}
+
 static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 			     struct netlink_ext_ack *extack)
 {
 	struct net *net = sock_net(in_skb->sk);
-	struct rtmsg *rtm;
 	struct nlattr *tb[RTA_MAX+1];
+	__be16 sport = 0, dport = 0;
 	struct fib_result res = {};
+	u8 ip_proto = IPPROTO_UDP;
 	struct rtable *rt = NULL;
+	struct sk_buff *skb;
+	struct rtmsg *rtm;
 	struct flowi4 fl4;
 	__be32 dst = 0;
 	__be32 src = 0;
+	kuid_t uid;
 	u32 iif;
 	int err;
 	int mark;
-	struct sk_buff *skb;
-	u32 table_id = RT_TABLE_MAIN;
-	kuid_t uid;
 
 	err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_ipv4_policy,
 			  extack);
 	if (err < 0)
-		goto errout;
+		return err;
 
 	rtm = nlmsg_data(nlh);
-
-	skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
-	if (!skb) {
-		err = -ENOBUFS;
-		goto errout;
-	}
-
-	/* Reserve room for dummy headers, this skb can pass
-	   through good chunk of routing engine.
-	 */
-	skb_reset_mac_header(skb);
-	skb_reset_network_header(skb);
-
 	src = tb[RTA_SRC] ? nla_get_in_addr(tb[RTA_SRC]) : 0;
 	dst = tb[RTA_DST] ? nla_get_in_addr(tb[RTA_DST]) : 0;
 	iif = tb[RTA_IIF] ? nla_get_u32(tb[RTA_IIF]) : 0;
@@ -2730,14 +2814,23 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	else
 		uid = (iif ? INVALID_UID : current_uid());
 
-	/* Bugfix: need to give ip_route_input enough of an IP header to
-	 * not gag.
-	 */
-	ip_hdr(skb)->protocol = IPPROTO_UDP;
-	ip_hdr(skb)->saddr = src;
-	ip_hdr(skb)->daddr = dst;
+	if (tb[RTA_IP_PROTO]) {
+		err = rtm_getroute_parse_ip_proto(tb[RTA_IP_PROTO],
+						  &ip_proto, extack);
+		if (err)
+			return err;
+	}
+
+	if (tb[RTA_SPORT])
+		sport = nla_get_be16(tb[RTA_SPORT]);
 
-	skb_reserve(skb, MAX_HEADER + sizeof(struct iphdr));
+	if (tb[RTA_DPORT])
+		dport = nla_get_be16(tb[RTA_DPORT]);
+
+
+	skb = inet_rtm_getroute_build_skb(src, dst, ip_proto, sport, dport);
+	if (!skb)
+		return -ENOBUFS;
 
 	memset(&fl4, 0, sizeof(fl4));
 	fl4.daddr = dst;
@@ -2746,6 +2839,11 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
 	fl4.flowi4_mark = mark;
 	fl4.flowi4_uid = uid;
+	if (sport)
+		fl4.fl4_sport = sport;
+	if (dport)
+		fl4.fl4_dport = dport;
+	fl4.flowi4_proto = ip_proto;
 
 	rcu_read_lock();
 
@@ -2755,10 +2853,10 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 		dev = dev_get_by_index_rcu(net, iif);
 		if (!dev) {
 			err = -ENODEV;
-			goto errout_free;
+			goto errout_rcu;
 		}
 
-		skb->protocol	= htons(ETH_P_IP);
+		fl4.flowi4_iif = iif; /* for rt_fill_info */
 		skb->dev	= dev;
 		skb->mark	= mark;
 		err = ip_route_input_rcu(skb, dst, src, rtm->rtm_tos,
@@ -2778,42 +2876,32 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	}
 
 	if (err)
-		goto errout_free;
+		goto errout_rcu;
 
 	if (rtm->rtm_flags & RTM_F_NOTIFY)
 		rt->rt_flags |= RTCF_NOTIFY;
 
-	if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE)
-		table_id = res.table ? res.table->tb_id : 0;
-
 	if (rtm->rtm_flags & RTM_F_FIB_MATCH) {
 		if (!res.fi) {
 			err = fib_props[res.type].error;
 			if (!err)
 				err = -EHOSTUNREACH;
-			goto errout_free;
+			goto errout_rcu;
 		}
-		err = fib_dump_info(skb, NETLINK_CB(in_skb).portid,
-				    nlh->nlmsg_seq, RTM_NEWROUTE, table_id,
-				    rt->rt_type, res.prefix, res.prefixlen,
-				    fl4.flowi4_tos, res.fi, 0);
-	} else {
-		err = rt_fill_info(net, dst, src, table_id, &fl4, skb,
-				   NETLINK_CB(in_skb).portid, nlh->nlmsg_seq);
 	}
+
+	err = inet_rtm_getroute_reply(in_skb, nlh, dst, src, &fl4, rt, &res);
 	if (err < 0)
-		goto errout_free;
+		goto errout_rcu;
 
 	rcu_read_unlock();
 
-	err = rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
-errout:
-	return err;
-
 errout_free:
-	rcu_read_unlock();
 	kfree_skb(skb);
-	goto errout;
+	return err;
+errout_rcu:
+	rcu_read_unlock();
+	goto errout_free;
 }
 
 void ip_rt_multicast_event(struct in_device *in_dev)
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v3 2/3] ipv6: support sport, dport and ip_proto in RTM_GETROUTE
From: Roopa Prabhu @ 2018-05-16  3:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, dsa, nikolay, idosch
In-Reply-To: <1526442908-13183-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This is a followup to fib6 rules sport, dport and ipproto
match support. Only supports tcp, udp and icmp for ipproto.
Used by fib rule self tests.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 net/ipv6/route.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index af04167..31fb5bc 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -63,6 +63,7 @@
 #include <net/lwtunnel.h>
 #include <net/ip_tunnels.h>
 #include <net/l3mdev.h>
+#include <net/ip.h>
 #include <trace/events/fib6.h>
 
 #include <linux/uaccess.h>
@@ -4073,6 +4074,9 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] = {
 	[RTA_UID]		= { .type = NLA_U32 },
 	[RTA_MARK]		= { .type = NLA_U32 },
 	[RTA_TABLE]		= { .type = NLA_U32 },
+	[RTA_IP_PROTO]		= { .type = NLA_U8 },
+	[RTA_SPORT]		= { .type = NLA_U16 },
+	[RTA_DPORT]		= { .type = NLA_U16 },
 };
 
 static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -4784,6 +4788,19 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 	else
 		fl6.flowi6_uid = iif ? INVALID_UID : current_uid();
 
+	if (tb[RTA_SPORT])
+		fl6.fl6_sport = nla_get_be16(tb[RTA_SPORT]);
+
+	if (tb[RTA_DPORT])
+		fl6.fl6_dport = nla_get_be16(tb[RTA_DPORT]);
+
+	if (tb[RTA_IP_PROTO]) {
+		err = rtm_getroute_parse_ip_proto(tb[RTA_IP_PROTO],
+						  &fl6.flowi6_proto, extack);
+		if (err)
+			goto errout;
+	}
+
 	if (iif) {
 		struct net_device *dev;
 		int flags = 0;
-- 
2.1.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox