Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 3/6] s390/qeth: unregister netdevice only when registered
From: Julian Wiedmann @ 2018-11-02 18:04 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Julian Wiedmann
In-Reply-To: <20181102180413.53020-1-jwi@linux.ibm.com>

qeth only registers its netdevice when the qeth device is first set
online. Thus a device that has never been set online will trigger
a WARN ("network todo 'hsi%d' but state 0") in unregister_netdev() when
removed.

Fix this by protecting the unregister step, just like we already protect
against repeated registering of the netdevice.

Fixes: d3d1b205e89f ("s390/qeth: allocate netdevice early")
Reported-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 drivers/s390/net/qeth_core.h    | 5 +++++
 drivers/s390/net/qeth_l2_main.c | 5 +++--
 drivers/s390/net/qeth_l3_main.c | 5 +++--
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 884ba9dfb341..b3a0b8838d2f 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -843,6 +843,11 @@ struct qeth_trap_id {
 /*some helper functions*/
 #define QETH_CARD_IFNAME(card) (((card)->dev)? (card)->dev->name : "")
 
+static inline bool qeth_netdev_is_registered(struct net_device *dev)
+{
+	return dev->netdev_ops != NULL;
+}
+
 static inline void qeth_scrub_qdio_buffer(struct qdio_buffer *buf,
 					  unsigned int elements)
 {
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 5b67fd1f2b77..2b978eba7e30 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -826,7 +826,8 @@ static void qeth_l2_remove_device(struct ccwgroup_device *cgdev)
 
 	if (cgdev->state == CCWGROUP_ONLINE)
 		qeth_l2_set_offline(cgdev);
-	unregister_netdev(card->dev);
+	if (qeth_netdev_is_registered(card->dev))
+		unregister_netdev(card->dev);
 }
 
 static const struct ethtool_ops qeth_l2_ethtool_ops = {
@@ -866,7 +867,7 @@ static int qeth_l2_setup_netdev(struct qeth_card *card)
 {
 	int rc;
 
-	if (card->dev->netdev_ops)
+	if (qeth_netdev_is_registered(card->dev))
 		return 0;
 
 	card->dev->priv_flags |= IFF_UNICAST_FLT;
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index 968e344a240b..a719c5ec4171 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -2356,7 +2356,7 @@ static int qeth_l3_setup_netdev(struct qeth_card *card)
 	unsigned int headroom;
 	int rc;
 
-	if (card->dev->netdev_ops)
+	if (qeth_netdev_is_registered(card->dev))
 		return 0;
 
 	if (card->info.type == QETH_CARD_TYPE_OSD ||
@@ -2465,7 +2465,8 @@ static void qeth_l3_remove_device(struct ccwgroup_device *cgdev)
 	if (cgdev->state == CCWGROUP_ONLINE)
 		qeth_l3_set_offline(cgdev);
 
-	unregister_netdev(card->dev);
+	if (qeth_netdev_is_registered(card->dev))
+		unregister_netdev(card->dev);
 	qeth_l3_clear_ip_htable(card, 0);
 	qeth_l3_clear_ipato_list(card);
 }
-- 
2.16.4

^ permalink raw reply related

* [PATCH net 5/6] s390/qeth: sanitize ARP requests
From: Julian Wiedmann @ 2018-11-02 18:04 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Julian Wiedmann
In-Reply-To: <20181102180413.53020-1-jwi@linux.ibm.com>

The ARP_{ADD,REMOVE}_ENTRY cmd structs contain reserved fields.
Introduce a common helper that doesn't raw-copy the user-provided data
into the cmd, but only sets those fields that are strictly needed for
the command.

This also sets the correct command length for ARP_REMOVE_ENTRY.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 drivers/s390/net/qeth_core.h      |  5 ---
 drivers/s390/net/qeth_core_main.c | 12 ++---
 drivers/s390/net/qeth_core_mpc.h  |  2 +-
 drivers/s390/net/qeth_l3_main.c   | 94 +++++++++++----------------------------
 4 files changed, 34 insertions(+), 79 deletions(-)

diff --git a/drivers/s390/net/qeth_core.h b/drivers/s390/net/qeth_core.h
index 90cb213b0d55..04e294d1d16d 100644
--- a/drivers/s390/net/qeth_core.h
+++ b/drivers/s390/net/qeth_core.h
@@ -1046,11 +1046,6 @@ int qeth_configure_cq(struct qeth_card *, enum qeth_cq);
 int qeth_hw_trap(struct qeth_card *, enum qeth_diags_trap_action);
 void qeth_trace_features(struct qeth_card *);
 void qeth_close_dev(struct qeth_card *);
-int qeth_send_setassparms(struct qeth_card *, struct qeth_cmd_buffer *, __u16,
-			  long,
-			  int (*reply_cb)(struct qeth_card *,
-					  struct qeth_reply *, unsigned long),
-			  void *);
 int qeth_setassparms_cb(struct qeth_card *, struct qeth_reply *, unsigned long);
 struct qeth_cmd_buffer *qeth_get_setassparms_cmd(struct qeth_card *,
 						 enum qeth_ipa_funcs,
diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index aed1a7961553..82282b2092d8 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -5477,11 +5477,12 @@ struct qeth_cmd_buffer *qeth_get_setassparms_cmd(struct qeth_card *card,
 }
 EXPORT_SYMBOL_GPL(qeth_get_setassparms_cmd);
 
-int qeth_send_setassparms(struct qeth_card *card,
-			  struct qeth_cmd_buffer *iob, __u16 len, long data,
-			  int (*reply_cb)(struct qeth_card *,
-					  struct qeth_reply *, unsigned long),
-			  void *reply_param)
+static int qeth_send_setassparms(struct qeth_card *card,
+				 struct qeth_cmd_buffer *iob, u16 len,
+				 long data, int (*reply_cb)(struct qeth_card *,
+							    struct qeth_reply *,
+							    unsigned long),
+				 void *reply_param)
 {
 	int rc;
 	struct qeth_ipa_cmd *cmd;
@@ -5497,7 +5498,6 @@ int qeth_send_setassparms(struct qeth_card *card,
 	rc = qeth_send_ipa_cmd(card, iob, reply_cb, reply_param);
 	return rc;
 }
-EXPORT_SYMBOL_GPL(qeth_send_setassparms);
 
 int qeth_send_simple_setassparms_prot(struct qeth_card *card,
 				      enum qeth_ipa_funcs ipa_func,
diff --git a/drivers/s390/net/qeth_core_mpc.h b/drivers/s390/net/qeth_core_mpc.h
index e85090467afe..80c036acf563 100644
--- a/drivers/s390/net/qeth_core_mpc.h
+++ b/drivers/s390/net/qeth_core_mpc.h
@@ -436,7 +436,7 @@ struct qeth_ipacmd_setassparms {
 		__u32 flags_32bit;
 		struct qeth_ipa_caps caps;
 		struct qeth_checksum_cmd chksum;
-		struct qeth_arp_cache_entry add_arp_entry;
+		struct qeth_arp_cache_entry arp_entry;
 		struct qeth_arp_query_data query_arp;
 		struct qeth_tso_start_data tso;
 		__u8 ip[16];
diff --git a/drivers/s390/net/qeth_l3_main.c b/drivers/s390/net/qeth_l3_main.c
index b26f7d7a2ca0..f08b745c2007 100644
--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -1777,13 +1777,18 @@ static int qeth_l3_arp_query(struct qeth_card *card, char __user *udata)
 	return rc;
 }
 
-static int qeth_l3_arp_add_entry(struct qeth_card *card,
-				struct qeth_arp_cache_entry *entry)
+static int qeth_l3_arp_modify_entry(struct qeth_card *card,
+				    struct qeth_arp_cache_entry *entry,
+				    enum qeth_arp_process_subcmds arp_cmd)
 {
+	struct qeth_arp_cache_entry *cmd_entry;
 	struct qeth_cmd_buffer *iob;
 	int rc;
 
-	QETH_CARD_TEXT(card, 3, "arpadent");
+	if (arp_cmd == IPA_CMD_ASS_ARP_ADD_ENTRY)
+		QETH_CARD_TEXT(card, 3, "arpadd");
+	else
+		QETH_CARD_TEXT(card, 3, "arpdel");
 
 	/*
 	 * currently GuestLAN only supports the ARP assist function
@@ -1796,54 +1801,19 @@ static int qeth_l3_arp_add_entry(struct qeth_card *card,
 		return -EOPNOTSUPP;
 	}
 
-	iob = qeth_get_setassparms_cmd(card, IPA_ARP_PROCESSING,
-				       IPA_CMD_ASS_ARP_ADD_ENTRY,
-				       sizeof(struct qeth_arp_cache_entry),
-				       QETH_PROT_IPV4);
+	iob = qeth_get_setassparms_cmd(card, IPA_ARP_PROCESSING, arp_cmd,
+				       sizeof(*cmd_entry), QETH_PROT_IPV4);
 	if (!iob)
 		return -ENOMEM;
-	rc = qeth_send_setassparms(card, iob,
-				   sizeof(struct qeth_arp_cache_entry),
-				   (unsigned long) entry,
-				   qeth_setassparms_cb, NULL);
-	if (rc)
-		QETH_DBF_MESSAGE(2, "Could not add ARP entry on device %x: %#x\n",
-				 CARD_DEVID(card), rc);
-	return qeth_l3_arp_makerc(rc);
-}
-
-static int qeth_l3_arp_remove_entry(struct qeth_card *card,
-				struct qeth_arp_cache_entry *entry)
-{
-	struct qeth_cmd_buffer *iob;
-	char buf[16] = {0, };
-	int rc;
 
-	QETH_CARD_TEXT(card, 3, "arprment");
-
-	/*
-	 * currently GuestLAN only supports the ARP assist function
-	 * IPA_CMD_ASS_ARP_QUERY_INFO, but not IPA_CMD_ASS_ARP_REMOVE_ENTRY;
-	 * thus we say EOPNOTSUPP for this ARP function
-	 */
-	if (card->info.guestlan)
-		return -EOPNOTSUPP;
-	if (!qeth_is_supported(card, IPA_ARP_PROCESSING)) {
-		return -EOPNOTSUPP;
-	}
-	memcpy(buf, entry, 12);
-	iob = qeth_get_setassparms_cmd(card, IPA_ARP_PROCESSING,
-				       IPA_CMD_ASS_ARP_REMOVE_ENTRY,
-				       12,
-				       QETH_PROT_IPV4);
-	if (!iob)
-		return -ENOMEM;
-	rc = qeth_send_setassparms(card, iob,
-				   12, (unsigned long)buf,
-				   qeth_setassparms_cb, NULL);
+	cmd_entry = &__ipa_cmd(iob)->data.setassparms.data.arp_entry;
+	ether_addr_copy(cmd_entry->macaddr, entry->macaddr);
+	memcpy(cmd_entry->ipaddr, entry->ipaddr, 4);
+	rc = qeth_send_ipa_cmd(card, iob, qeth_setassparms_cb, NULL);
 	if (rc)
-		QETH_DBF_MESSAGE(2, "Could not delete ARP entry on device %x: %#x\n",
-				 CARD_DEVID(card), rc);
+		QETH_DBF_MESSAGE(2, "Could not modify (cmd: %#x) ARP entry on device %x: %#x\n",
+				 arp_cmd, CARD_DEVID(card), rc);
+
 	return qeth_l3_arp_makerc(rc);
 }
 
@@ -1875,6 +1845,7 @@ static int qeth_l3_do_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 {
 	struct qeth_card *card = dev->ml_priv;
 	struct qeth_arp_cache_entry arp_entry;
+	enum qeth_arp_process_subcmds arp_cmd;
 	int rc = 0;
 
 	switch (cmd) {
@@ -1893,27 +1864,16 @@ static int qeth_l3_do_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 		rc = qeth_l3_arp_query(card, rq->ifr_ifru.ifru_data);
 		break;
 	case SIOC_QETH_ARP_ADD_ENTRY:
-		if (!capable(CAP_NET_ADMIN)) {
-			rc = -EPERM;
-			break;
-		}
-		if (copy_from_user(&arp_entry, rq->ifr_ifru.ifru_data,
-				   sizeof(struct qeth_arp_cache_entry)))
-			rc = -EFAULT;
-		else
-			rc = qeth_l3_arp_add_entry(card, &arp_entry);
-		break;
 	case SIOC_QETH_ARP_REMOVE_ENTRY:
-		if (!capable(CAP_NET_ADMIN)) {
-			rc = -EPERM;
-			break;
-		}
-		if (copy_from_user(&arp_entry, rq->ifr_ifru.ifru_data,
-				   sizeof(struct qeth_arp_cache_entry)))
-			rc = -EFAULT;
-		else
-			rc = qeth_l3_arp_remove_entry(card, &arp_entry);
-		break;
+		if (!capable(CAP_NET_ADMIN))
+			return -EPERM;
+		if (copy_from_user(&arp_entry, rq->ifr_data, sizeof(arp_entry)))
+			return -EFAULT;
+
+		arp_cmd = (cmd == SIOC_QETH_ARP_ADD_ENTRY) ?
+				IPA_CMD_ASS_ARP_ADD_ENTRY :
+				IPA_CMD_ASS_ARP_REMOVE_ENTRY;
+		return qeth_l3_arp_modify_entry(card, &arp_entry, arp_cmd);
 	case SIOC_QETH_ARP_FLUSH_CACHE:
 		if (!capable(CAP_NET_ADMIN)) {
 			rc = -EPERM;
-- 
2.16.4

^ permalink raw reply related

* [PATCH net 6/6] s390/qeth: report 25Gbit link speed
From: Julian Wiedmann @ 2018-11-02 18:04 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-s390, Martin Schwidefsky, Heiko Carstens,
	Stefan Raspl, Ursula Braun, Julian Wiedmann
In-Reply-To: <20181102180413.53020-1-jwi@linux.ibm.com>

This adds the various identifiers for 25Gbit cards, and wires them up
into sysfs and ethtool.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 drivers/s390/net/qeth_core_main.c | 20 ++++++++++++++++++--
 drivers/s390/net/qeth_core_mpc.h  |  2 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index 82282b2092d8..4bce5ae65a55 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -167,6 +167,8 @@ const char *qeth_get_cardname_short(struct qeth_card *card)
 				return "OSD_1000";
 			case QETH_LINK_TYPE_10GBIT_ETH:
 				return "OSD_10GIG";
+			case QETH_LINK_TYPE_25GBIT_ETH:
+				return "OSD_25GIG";
 			case QETH_LINK_TYPE_LANE_ETH100:
 				return "OSD_FE_LANE";
 			case QETH_LINK_TYPE_LANE_TR:
@@ -4432,7 +4434,8 @@ static int qeth_mdio_read(struct net_device *dev, int phy_id, int regnum)
 		rc = BMCR_FULLDPLX;
 		if ((card->info.link_type != QETH_LINK_TYPE_GBIT_ETH) &&
 		    (card->info.link_type != QETH_LINK_TYPE_OSN) &&
-		    (card->info.link_type != QETH_LINK_TYPE_10GBIT_ETH))
+		    (card->info.link_type != QETH_LINK_TYPE_10GBIT_ETH) &&
+		    (card->info.link_type != QETH_LINK_TYPE_25GBIT_ETH))
 			rc |= BMCR_SPEED100;
 		break;
 	case MII_BMSR: /* Basic mode status register */
@@ -6166,8 +6169,14 @@ static void qeth_set_cmd_adv_sup(struct ethtool_link_ksettings *cmd,
 		WARN_ON_ONCE(1);
 	}
 
-	/* fallthrough from high to low, to select all legal speeds: */
+	/* partially does fall through, to also select lower speeds */
 	switch (maxspeed) {
+	case SPEED_25000:
+		ethtool_link_ksettings_add_link_mode(cmd, supported,
+						     25000baseSR_Full);
+		ethtool_link_ksettings_add_link_mode(cmd, advertising,
+						     25000baseSR_Full);
+		break;
 	case SPEED_10000:
 		ethtool_link_ksettings_add_link_mode(cmd, supported,
 						     10000baseT_Full);
@@ -6250,6 +6259,10 @@ int qeth_core_ethtool_get_link_ksettings(struct net_device *netdev,
 		cmd->base.speed = SPEED_10000;
 		cmd->base.port = PORT_FIBRE;
 		break;
+	case QETH_LINK_TYPE_25GBIT_ETH:
+		cmd->base.speed = SPEED_25000;
+		cmd->base.port = PORT_FIBRE;
+		break;
 	default:
 		cmd->base.speed = SPEED_10;
 		cmd->base.port = PORT_TP;
@@ -6316,6 +6329,9 @@ int qeth_core_ethtool_get_link_ksettings(struct net_device *netdev,
 	case CARD_INFO_PORTS_10G:
 		cmd->base.speed = SPEED_10000;
 		break;
+	case CARD_INFO_PORTS_25G:
+		cmd->base.speed = SPEED_25000;
+		break;
 	}
 
 	return 0;
diff --git a/drivers/s390/net/qeth_core_mpc.h b/drivers/s390/net/qeth_core_mpc.h
index 80c036acf563..3e54be201b27 100644
--- a/drivers/s390/net/qeth_core_mpc.h
+++ b/drivers/s390/net/qeth_core_mpc.h
@@ -90,6 +90,7 @@ enum qeth_link_types {
 	QETH_LINK_TYPE_GBIT_ETH     = 0x03,
 	QETH_LINK_TYPE_OSN          = 0x04,
 	QETH_LINK_TYPE_10GBIT_ETH   = 0x10,
+	QETH_LINK_TYPE_25GBIT_ETH   = 0x12,
 	QETH_LINK_TYPE_LANE_ETH100  = 0x81,
 	QETH_LINK_TYPE_LANE_TR      = 0x82,
 	QETH_LINK_TYPE_LANE_ETH1000 = 0x83,
@@ -347,6 +348,7 @@ enum qeth_card_info_port_speed {
 	CARD_INFO_PORTS_100M		= 0x00000006,
 	CARD_INFO_PORTS_1G		= 0x00000007,
 	CARD_INFO_PORTS_10G		= 0x00000008,
+	CARD_INFO_PORTS_25G		= 0x0000000A,
 };
 
 /* (SET)DELIP(M) IPA stuff ***************************************************/
-- 
2.16.4

^ permalink raw reply related

* [PATCH net] net-sysfs: fix formatting of tx_timeout attribute
From: Julian Wiedmann @ 2018-11-02 18:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Add a missing newline.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
---
 net/core/net-sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bd67c4d0fcfd..ef06409d768e 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1039,7 +1039,7 @@ static ssize_t tx_timeout_show(struct netdev_queue *queue, char *buf)
 	trans_timeout = queue->trans_timeout;
 	spin_unlock_irq(&queue->_xmit_lock);
 
-	return sprintf(buf, "%lu", trans_timeout);
+	return sprintf(buf, "%lu\n", trans_timeout);
 }
 
 static unsigned int get_netdev_queue_index(struct netdev_queue *queue)
-- 
2.16.4

^ permalink raw reply related

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-02 19:02 UTC (permalink / raw)
  To: Aaron Lu, Jesper Dangaard Brouer
  Cc: Saeed Mahameed, eric.dumazet@gmail.com, netdev@vger.kernel.org,
	Tariq Toukan, ilias.apalodimas@linaro.org, yoel@kviknet.dk,
	mgorman@techsingularity.net
In-Reply-To: <20181102142024.GA18343@intel.com>



W dniu 02.11.2018 o 15:20, Aaron Lu pisze:
> On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:
>> On Fri, 2 Nov 2018 13:23:56 +0800
>> Aaron Lu <aaron.lu@intel.com> wrote:
>>
>>> On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
>>>> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
>>>>> On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
>>>>> wrote:
>>>>> ... ...
>>>>>> Section copied out:
>>>>>>
>>>>>>    mlx5e_poll_tx_cq
>>>>>>    |
>>>>>>     --16.34%--napi_consume_skb
>>>>>>               |
>>>>>>               |--12.65%--__free_pages_ok
>>>>>>               |          |
>>>>>>               |           --11.86%--free_one_page
>>>>>>               |                     |
>>>>>>               |                     |--10.10%
>>>>>> --queued_spin_lock_slowpath
>>>>>>               |                     |
>>>>>>               |                      --0.65%--_raw_spin_lock
>>>>> This callchain looks like it is freeing higher order pages than order
>>>>> 0:
>>>>> __free_pages_ok is only called for pages whose order are bigger than
>>>>> 0.
>>>> mlx5 rx uses only order 0 pages, so i don't know where these high order
>>>> tx SKBs are coming from..
>>> Perhaps here:
>>> __netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
>>> __napi_alloc_frag() will all call page_frag_alloc(), which will use
>>> __page_frag_cache_refill() to get an order 3 page if possible, or fall
>>> back to an order 0 page if order 3 page is not available.
>>>
>>> I'm not sure if your workload will use the above code path though.
>> TL;DR: this is order-0 pages (code-walk trough proof below)
>>
>> To Aaron, the network stack *can* call __free_pages_ok() with order-0
>> pages, via:
>>
>> static void skb_free_head(struct sk_buff *skb)
>> {
>> 	unsigned char *head = skb->head;
>>
>> 	if (skb->head_frag)
>> 		skb_free_frag(head);
>> 	else
>> 		kfree(head);
>> }
>>
>> static inline void skb_free_frag(void *addr)
>> {
>> 	page_frag_free(addr);
>> }
>>
>> /*
>>   * Frees a page fragment allocated out of either a compound or order 0 page.
>>   */
>> void page_frag_free(void *addr)
>> {
>> 	struct page *page = virt_to_head_page(addr);
>>
>> 	if (unlikely(put_page_testzero(page)))
>> 		__free_pages_ok(page, compound_order(page));
>> }
>> EXPORT_SYMBOL(page_frag_free);
> I think here is a problem - order 0 pages are freed directly to buddy,
> bypassing per-cpu-pages. This might be the reason lock contention
> appeared on free path. Can someone apply below diff and see if lock
> contention is gone?
Will test it tonight



>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e2ef1c17942f..65c0ae13215a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
>   {
>   	struct page *page = virt_to_head_page(addr);
>   
> -	if (unlikely(put_page_testzero(page)))
> -		__free_pages_ok(page, compound_order(page));
> +	if (unlikely(put_page_testzero(page))) {
> +		unsigned int order = compound_order(page);
> +
> +		if (order == 0)
> +			free_unref_page(page);
> +		else
> +			__free_pages_ok(page, order);
> +	}
>   }
>   EXPORT_SYMBOL(page_frag_free);
>   
>> Notice for the mlx5 driver it support several RX-memory models, so it
>> can be hard to follow, but from the perf report output we can see that
>> is uses mlx5e_skb_from_cqe_linear, which use build_skb.
>>
>> --13.63%--mlx5e_skb_from_cqe_linear
>>            |
>>             --5.02%--build_skb
>>                       |
>>                        --1.85%--__build_skb
>>                                  |
>>                                   --1.00%--kmem_cache_alloc
>>
>> /* build_skb() is wrapper over __build_skb(), that specifically
>>   * takes care of skb->head and skb->pfmemalloc
>>   * This means that if @frag_size is not zero, then @data must be backed
>>   * by a page fragment, not kmalloc() or vmalloc()
>>   */
>> struct sk_buff *build_skb(void *data, unsigned int frag_size)
>> {
>> 	struct sk_buff *skb = __build_skb(data, frag_size);
>>
>> 	if (skb && frag_size) {
>> 		skb->head_frag = 1;
>> 		if (page_is_pfmemalloc(virt_to_head_page(data)))
>> 			skb->pfmemalloc = 1;
>> 	}
>> 	return skb;
>> }
>> EXPORT_SYMBOL(build_skb);
>>
>> It still doesn't prove, that the @data is backed by by a order-0 page.
>> For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
>> page_pool_dev_alloc_pages(), and I can see perf report using
>> __page_pool_alloc_pages_slow().
>>
>> The setup for page_pool in mlx5 uses order=0.
>>
>> 	/* Create a page_pool and register it with rxq */
>> 	pp_params.order     = 0;
>> 	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
>> 	pp_params.pool_size = pool_size;
>> 	pp_params.nid       = cpu_to_node(c->cpu);
>> 	pp_params.dev       = c->pdev;
>> 	pp_params.dma_dir   = rq->buff.map_dir;
>>
>> 	/* page_pool can be used even when there is no rq->xdp_prog,
>> 	 * given page_pool does not handle DMA mapping there is no
>> 	 * required state to clear. And page_pool gracefully handle
>> 	 * elevated refcnt.
>> 	 */
>> 	rq->page_pool = page_pool_create(&pp_params);
>> 	if (IS_ERR(rq->page_pool)) {
>> 		err = PTR_ERR(rq->page_pool);
>> 		rq->page_pool = NULL;
>> 		goto err_free;
>> 	}
>> 	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
>> 					 MEM_TYPE_PAGE_POOL, rq->page_pool);
> Thanks for the detailed analysis, I'll need more time to understand the
> whole picture :-)
>

^ permalink raw reply

* [PATCH net-next v4 1/9] net: allow binding socket in a VRF when there's an unbound socket
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev; +Cc: Robert Shearman
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

From: Robert Shearman <rshearma@vyatta.att-mail.com>

Change the inet socket lookup to avoid packets arriving on a device
enslaved to an l3mdev from matching unbound sockets by removing the
wildcard for non sk_bound_dev_if and instead relying on check against
the secondary device index, which will be 0 when the input device is
not enslaved to an l3mdev and so match against an unbound socket and
not match when the input device is enslaved.

Change the socket binding to take the l3mdev into account to allow an
unbound socket to not conflict sockets bound to an l3mdev given the
datapath isolation now guaranteed.

Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 Documentation/networking/vrf.txt |  9 +++++----
 include/net/inet6_hashtables.h   |  5 ++---
 include/net/inet_hashtables.h    | 13 ++++++-------
 include/net/inet_sock.h          | 13 +++++++++++++
 net/ipv4/inet_connection_sock.c  | 13 ++++++++++---
 net/ipv4/inet_hashtables.c       | 20 +++++++++++++++-----
 6 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt
index 8ff7b4c8f91b..d4b129402d57 100644
--- a/Documentation/networking/vrf.txt
+++ b/Documentation/networking/vrf.txt
@@ -103,6 +103,11 @@ VRF device:
 
 or to specify the output device using cmsg and IP_PKTINFO.
 
+By default the scope of the port bindings for unbound sockets is
+limited to the default VRF. That is, it will not be matched by packets
+arriving on interfaces enslaved to an l3mdev and processes may bind to
+the same port if they bind to an l3mdev.
+
 TCP & UDP services running in the default VRF context (ie., not bound
 to any VRF device) can work across all VRF domains by enabling the
 tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
@@ -112,10 +117,6 @@ tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
 netfilter rules on the VRF device can be used to limit access to services
 running in the default VRF context as well.
 
-The default VRF does not have limited scope with respect to port bindings.
-That is, if a process does a wildcard bind to a port in the default VRF it
-owns the port across all VRF domains within the network namespace.
-
 ################################################################################
 
 Using iproute2 for VRFs
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 6e91e38a31da..9db98af46985 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -115,9 +115,8 @@ int inet6_hash(struct sock *sk);
 	 ((__sk)->sk_family == AF_INET6)			&&	\
 	 ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr))		&&	\
 	 ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr))	&&	\
-	 (!(__sk)->sk_bound_dev_if	||				\
-	   ((__sk)->sk_bound_dev_if == (__dif))	||			\
-	   ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
+	 (((__sk)->sk_bound_dev_if == (__dif))	||			\
+	  ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
 	 net_eq(sock_net(__sk), (__net)))
 
 #endif /* _INET6_HASHTABLES_H */
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 9141e95529e7..4ae060b4bac2 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -79,6 +79,7 @@ struct inet_ehash_bucket {
 
 struct inet_bind_bucket {
 	possible_net_t		ib_net;
+	int			l3mdev;
 	unsigned short		port;
 	signed char		fastreuse;
 	signed char		fastreuseport;
@@ -191,7 +192,7 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
 struct inet_bind_bucket *
 inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 			struct inet_bind_hashbucket *head,
-			const unsigned short snum);
+			const unsigned short snum, int l3mdev);
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
 			      struct inet_bind_bucket *tb);
 
@@ -282,9 +283,8 @@ static inline struct sock *inet_lookup_listener(struct net *net,
 #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
 	(((__sk)->sk_portpair == (__ports))			&&	\
 	 ((__sk)->sk_addrpair == (__cookie))			&&	\
-	 (!(__sk)->sk_bound_dev_if	||				\
-	   ((__sk)->sk_bound_dev_if == (__dif))			||	\
-	   ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
+	 (((__sk)->sk_bound_dev_if == (__dif))			||	\
+	  ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
 	 net_eq(sock_net(__sk), (__net)))
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
@@ -294,9 +294,8 @@ static inline struct sock *inet_lookup_listener(struct net *net,
 	(((__sk)->sk_portpair == (__ports))		&&		\
 	 ((__sk)->sk_daddr	== (__saddr))		&&		\
 	 ((__sk)->sk_rcv_saddr	== (__daddr))		&&		\
-	 (!(__sk)->sk_bound_dev_if	||				\
-	   ((__sk)->sk_bound_dev_if == (__dif))		||		\
-	   ((__sk)->sk_bound_dev_if == (__sdif)))	&&		\
+	 (((__sk)->sk_bound_dev_if == (__dif))		||		\
+	  ((__sk)->sk_bound_dev_if == (__sdif)))	&&		\
 	 net_eq(sock_net(__sk), (__net)))
 #endif /* 64-bit arch */
 
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index a80fd0ac4563..ed3f723af00b 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -130,6 +130,19 @@ static inline int inet_request_bound_dev_if(const struct sock *sk,
 	return sk->sk_bound_dev_if;
 }
 
+static inline int inet_sk_bound_l3mdev(const struct sock *sk)
+{
+#ifdef CONFIG_NET_L3_MASTER_DEV
+	struct net *net = sock_net(sk);
+
+	if (!net->ipv4.sysctl_tcp_l3mdev_accept)
+		return l3mdev_master_ifindex_by_index(net,
+						      sk->sk_bound_dev_if);
+#endif
+
+	return 0;
+}
+
 struct inet_cork {
 	unsigned int		flags;
 	__be32			addr;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 15e7f7915a21..5c63449130d9 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -183,7 +183,9 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 	int i, low, high, attempt_half;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
+	int l3mdev;
 
+	l3mdev = inet_sk_bound_l3mdev(sk);
 	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
 other_half_scan:
 	inet_get_local_port_range(net, &low, &high);
@@ -219,7 +221,8 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 						  hinfo->bhash_size)];
 		spin_lock_bh(&head->lock);
 		inet_bind_bucket_for_each(tb, &head->chain)
-			if (net_eq(ib_net(tb), net) && tb->port == port) {
+			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
+			    tb->port == port) {
 				if (!inet_csk_bind_conflict(sk, tb, false, false))
 					goto success;
 				goto next_port;
@@ -293,6 +296,9 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	struct net *net = sock_net(sk);
 	struct inet_bind_bucket *tb = NULL;
 	kuid_t uid = sock_i_uid(sk);
+	int l3mdev;
+
+	l3mdev = inet_sk_bound_l3mdev(sk);
 
 	if (!port) {
 		head = inet_csk_find_open_port(sk, &tb, &port);
@@ -306,11 +312,12 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 					  hinfo->bhash_size)];
 	spin_lock_bh(&head->lock);
 	inet_bind_bucket_for_each(tb, &head->chain)
-		if (net_eq(ib_net(tb), net) && tb->port == port)
+		if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
+		    tb->port == port)
 			goto tb_found;
 tb_not_found:
 	tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
-				     net, head, port);
+				     net, head, port, l3mdev);
 	if (!tb)
 		goto fail_unlock;
 tb_found:
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 411dd7a90046..40d722ab1738 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -65,12 +65,14 @@ static u32 sk_ehashfn(const struct sock *sk)
 struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 						 struct net *net,
 						 struct inet_bind_hashbucket *head,
-						 const unsigned short snum)
+						 const unsigned short snum,
+						 int l3mdev)
 {
 	struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
 
 	if (tb) {
 		write_pnet(&tb->ib_net, net);
+		tb->l3mdev    = l3mdev;
 		tb->port      = snum;
 		tb->fastreuse = 0;
 		tb->fastreuseport = 0;
@@ -135,6 +137,7 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 			table->bhash_size);
 	struct inet_bind_hashbucket *head = &table->bhash[bhash];
 	struct inet_bind_bucket *tb;
+	int l3mdev;
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
@@ -143,6 +146,8 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 		return -ENOENT;
 	}
 	if (tb->port != port) {
+		l3mdev = inet_sk_bound_l3mdev(sk);
+
 		/* NOTE: using tproxy and redirecting skbs to a proxy
 		 * on a different listener port breaks the assumption
 		 * that the listener socket's icsk_bind_hash is the same
@@ -150,12 +155,13 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 		 * create a new bind bucket for the child here. */
 		inet_bind_bucket_for_each(tb, &head->chain) {
 			if (net_eq(ib_net(tb), sock_net(sk)) &&
-			    tb->port == port)
+			    tb->l3mdev == l3mdev && tb->port == port)
 				break;
 		}
 		if (!tb) {
 			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
-						     sock_net(sk), head, port);
+						     sock_net(sk), head, port,
+						     l3mdev);
 			if (!tb) {
 				spin_unlock(&head->lock);
 				return -ENOMEM;
@@ -675,6 +681,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	u32 remaining, offset;
 	int ret, i, low, high;
 	static u32 hint;
+	int l3mdev;
 
 	if (port) {
 		head = &hinfo->bhash[inet_bhashfn(net, port,
@@ -693,6 +700,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		return ret;
 	}
 
+	l3mdev = inet_sk_bound_l3mdev(sk);
+
 	inet_get_local_port_range(net, &low, &high);
 	high++; /* [32768, 60999] -> [32768, 61000[ */
 	remaining = high - low;
@@ -719,7 +728,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		 * the established check is already unique enough.
 		 */
 		inet_bind_bucket_for_each(tb, &head->chain) {
-			if (net_eq(ib_net(tb), net) && tb->port == port) {
+			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
+			    tb->port == port) {
 				if (tb->fastreuse >= 0 ||
 				    tb->fastreuseport >= 0)
 					goto next_port;
@@ -732,7 +742,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		}
 
 		tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
-					     net, head, port);
+					     net, head, port, l3mdev);
 		if (!tb) {
 			spin_unlock_bh(&head->lock);
 			return -ENOMEM;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v4 0/9] vrf: allow simultaneous service instances in default and other VRFs
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev

Services currently have to be VRF-aware if they are using an unbound
socket. One cannot have multiple service instances running in the
default and other VRFs for services that are not VRF-aware and listen
on an unbound socket. This is because there is no easy way of isolating
packets received in the default VRF from those arriving in other VRFs.

This series provides this isolation for stream sockets subject to the
existing kernel parameter net.ipv4.tcp_l3mdev_accept not being set,
given that this is documented as allowing a single service instance to
work across all VRF domains. Similarly, net.ipv4.udp_l3mdev_accept is
checked for datagram sockets, and net.ipv4.raw_l3mdev_accept is
introduced for raw sockets. The functionality applies to UDP & TCP
services as well as those using raw sockets, and is for IPv4 and IPv6.

Example of running ssh instances in default and blue VRF:

$ /usr/sbin/sshd -D
$ ip vrf exec vrf-blue /usr/sbin/sshd
$ ss -ta | egrep 'State|ssh'
State   Recv-Q   Send-Q           Local Address:Port       Peer Address:Port
LISTEN  0        128           0.0.0.0%vrf-blue:ssh             0.0.0.0:*
LISTEN  0        128                    0.0.0.0:ssh             0.0.0.0:*
ESTAB   0        0              192.168.122.220:ssh       192.168.122.1:50282
LISTEN  0        128              [::]%vrf-blue:ssh                [::]:*
LISTEN  0        128                       [::]:ssh                [::]:*
ESTAB   0        0           [3000::2]%vrf-blue:ssh           [3000::9]:45896
ESTAB   0        0                    [2000::2]:ssh           [2000::9]:46398

v1:
   - Address Paolo Abeni's comments (patch 4/5)
   - Fix build when CONFIG_NET_L3_MASTER_DEV not defined (patch 1/5)
v2:
   - Address David Aherns' comments (patches 4/5 and 5/5)
   - Remove patches 3/5 and 5/5 from series for individual submissions
   - Include a sysctl for raw sockets as recommended by David Ahern
   - Expand series into 10 patches and provide improved descriptions
v3:
   - Update description for patch 1/10 and remove patch 6/10
v4:
   - Set default to enabled for raw socket sysctl as recommended by David Ahern

Dewi Morgan (1):
  ipv6: do not drop vrf udp multicast packets

Duncan Eastoe (1):
  net: fix raw socket lookup device bind matching with VRFs

Mike Manning (6):
  net: ensure unbound stream socket to be chosen when not in a VRF
  net: ensure unbound datagram socket to be chosen when not in a VRF
  net: provide a sysctl raw_l3mdev_accept for raw socket lookup with
    VRFs
  vrf: mark skb for multicast or link-local as enslaved to VRF
  ipv6: allow ping to link-local address in VRF
  ipv6: handling of multicast packets received in VRF

Robert Shearman (1):
  net: allow binding socket in a VRF when there's an unbound socket

 Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
 Documentation/networking/vrf.txt       | 22 +++++++++++++++++----
 drivers/net/vrf.c                      | 19 +++++++++---------
 include/net/inet6_hashtables.h         |  5 ++---
 include/net/inet_hashtables.h          | 24 ++++++++++++++++-------
 include/net/inet_sock.h                | 21 ++++++++++++++++++++
 include/net/netns/ipv4.h               |  3 +++
 include/net/raw.h                      | 13 +++++++++++++
 include/net/udp.h                      | 11 +++++++++++
 net/core/sock.c                        |  2 ++
 net/ipv4/af_inet.c                     |  2 ++
 net/ipv4/inet_connection_sock.c        | 13 ++++++++++---
 net/ipv4/inet_hashtables.c             | 34 ++++++++++++++++++++-------------
 net/ipv4/raw.c                         | 19 ++++++++++++++----
 net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
 net/ipv4/udp.c                         | 15 ++++++---------
 net/ipv6/datagram.c                    |  5 ++++-
 net/ipv6/inet6_hashtables.c            | 14 ++++++--------
 net/ipv6/ip6_input.c                   | 35 +++++++++++++++++++++++++++++++---
 net/ipv6/ipv6_sockglue.c               |  2 +-
 net/ipv6/raw.c                         |  5 ++---
 net/ipv6/udp.c                         | 22 ++++++++++-----------
 22 files changed, 228 insertions(+), 81 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net-next v4 2/9] net: ensure unbound stream socket to be chosen when not in a VRF
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

The commit a04a480d4392 ("net: Require exact match for TCP socket
lookups if dif is l3mdev") only ensures that the correct socket is
selected for packets in a VRF. However, there is no guarantee that
the unbound socket will be selected for packets when not in a VRF.
By checking for a device match in compute_score() also for the case
when there is no bound device and attaching a score to this, the
unbound socket is selected. And if a failure is returned when there
is no device match, this ensures that bound sockets are never selected,
even if there is no unbound socket.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 include/net/inet_hashtables.h | 11 +++++++++++
 include/net/inet_sock.h       |  8 ++++++++
 net/ipv4/inet_hashtables.c    | 14 ++++++--------
 net/ipv6/inet6_hashtables.c   | 14 ++++++--------
 4 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 4ae060b4bac2..5de2d9f24c05 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -189,6 +189,17 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
 	hashinfo->ehash_locks = NULL;
 }
 
+static inline bool inet_sk_bound_dev_eq(struct net *net, int bound_dev_if,
+					int dif, int sdif)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+	return inet_bound_dev_eq(net->ipv4.sysctl_tcp_l3mdev_accept,
+				    bound_dev_if, dif, sdif);
+#else
+	return inet_bound_dev_eq(1, bound_dev_if, dif, sdif);
+#endif
+}
+
 struct inet_bind_bucket *
 inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 			struct inet_bind_hashbucket *head,
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index ed3f723af00b..e8eef85006aa 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -143,6 +143,14 @@ static inline int inet_sk_bound_l3mdev(const struct sock *sk)
 	return 0;
 }
 
+static inline bool inet_bound_dev_eq(bool l3mdev_accept, int bound_dev_if,
+				     int dif, int sdif)
+{
+	if (!bound_dev_if)
+		return !sdif || l3mdev_accept;
+	return bound_dev_if == dif || bound_dev_if == sdif;
+}
+
 struct inet_cork {
 	unsigned int		flags;
 	__be32			addr;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 40d722ab1738..13890d5bfc34 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -235,6 +235,7 @@ static inline int compute_score(struct sock *sk, struct net *net,
 {
 	int score = -1;
 	struct inet_sock *inet = inet_sk(sk);
+	bool dev_match;
 
 	if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
 			!ipv6_only_sock(sk)) {
@@ -245,15 +246,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
 				return -1;
 			score += 4;
 		}
-		if (sk->sk_bound_dev_if || exact_dif) {
-			bool dev_match = (sk->sk_bound_dev_if == dif ||
-					  sk->sk_bound_dev_if == sdif);
+		dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+						 dif, sdif);
+		if (!dev_match)
+			return -1;
+		score += 4;
 
-			if (!dev_match)
-				return -1;
-			if (sk->sk_bound_dev_if)
-				score += 4;
-		}
 		if (sk->sk_incoming_cpu == raw_smp_processor_id())
 			score++;
 	}
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 3d7c7460a0c5..5eeeba7181a1 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -99,6 +99,7 @@ static inline int compute_score(struct sock *sk, struct net *net,
 				const int dif, const int sdif, bool exact_dif)
 {
 	int score = -1;
+	bool dev_match;
 
 	if (net_eq(sock_net(sk), net) && inet_sk(sk)->inet_num == hnum &&
 	    sk->sk_family == PF_INET6) {
@@ -109,15 +110,12 @@ static inline int compute_score(struct sock *sk, struct net *net,
 				return -1;
 			score++;
 		}
-		if (sk->sk_bound_dev_if || exact_dif) {
-			bool dev_match = (sk->sk_bound_dev_if == dif ||
-					  sk->sk_bound_dev_if == sdif);
+		dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+						 dif, sdif);
+		if (!dev_match)
+			return -1;
+		score++;
 
-			if (!dev_match)
-				return -1;
-			if (sk->sk_bound_dev_if)
-				score++;
-		}
 		if (sk->sk_incoming_cpu == raw_smp_processor_id())
 			score++;
 	}
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v4 4/9] net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

Add a sysctl raw_l3mdev_accept to control raw socket lookup in a manner
similar to use of tcp_l3mdev_accept for stream and of udp_l3mdev_accept
for datagram sockets. Have this default to enabled for reasons of
backwards compatibility. This is so as to specify the output device
with cmsg and IP_PKTINFO, but using a socket not bound to the
corresponding VRF. This allows e.g. older ping implementations to be
run with specifying the device but without executing it in the VRF.
If the option is disabled, packets received in a VRF context are only
handled by a raw socket bound to the VRF, and correspondingly packets
in the default VRF are only handled by a socket not bound to any VRF.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
 Documentation/networking/vrf.txt       | 13 +++++++++++++
 include/net/netns/ipv4.h               |  3 +++
 include/net/raw.h                      |  1 +
 net/ipv4/af_inet.c                     |  2 ++
 net/ipv4/raw.c                         | 16 ++++++++++++++--
 net/ipv4/sysctl_net_ipv4.c             | 11 +++++++++++
 7 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 32b21571adfe..aa9e6a331679 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -370,6 +370,7 @@ tcp_l3mdev_accept - BOOLEAN
 	derived from the listen socket to be bound to the L3 domain in
 	which the packets originated. Only valid when the kernel was
 	compiled with CONFIG_NET_L3_MASTER_DEV.
+        Default: 0 (disabled)
 
 tcp_low_latency - BOOLEAN
 	This is a legacy option, it has no effect anymore.
@@ -773,6 +774,7 @@ udp_l3mdev_accept - BOOLEAN
 	being received regardless of the L3 domain in which they
 	originated. Only valid when the kernel was compiled with
 	CONFIG_NET_L3_MASTER_DEV.
+        Default: 0 (disabled)
 
 udp_mem - vector of 3 INTEGERs: min, pressure, max
 	Number of pages allowed for queueing by all UDP sockets.
@@ -799,6 +801,16 @@ udp_wmem_min - INTEGER
 	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
 	Default: 4K
 
+RAW variables:
+
+raw_l3mdev_accept - BOOLEAN
+	Enabling this option allows a "global" bound socket to work
+	across L3 master domains (e.g., VRFs) with packets capable of
+	being received regardless of the L3 domain in which they
+	originated. Only valid when the kernel was compiled with
+	CONFIG_NET_L3_MASTER_DEV.
+	Default: 1 (enabled)
+
 CIPSOv4 Variables:
 
 cipso_cache_enable - BOOLEAN
diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt
index d4b129402d57..d234c9750c72 100644
--- a/Documentation/networking/vrf.txt
+++ b/Documentation/networking/vrf.txt
@@ -111,9 +111,22 @@ the same port if they bind to an l3mdev.
 TCP & UDP services running in the default VRF context (ie., not bound
 to any VRF device) can work across all VRF domains by enabling the
 tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
+
     sysctl -w net.ipv4.tcp_l3mdev_accept=1
     sysctl -w net.ipv4.udp_l3mdev_accept=1
 
+These options are disabled by default so that a socket in a VRF is only
+selected for packets in that VRF. There is a similar option for RAW
+sockets, which is enabled by default for reasons of backwards compatibility.
+This is so as to specify the output device with cmsg and IP_PKTINFO, but
+using a socket not bound to the corresponding VRF. This allows e.g. older ping
+implementations to be run with specifying the device but without executing it
+in the VRF. This option can be disabled so that packets received in a VRF
+context are only handled by a raw socket bound to the VRF, and packets in the
+in the default VRF are only handled by a socket not bound to any VRF:
+
+    sysctl -w net.ipv4.raw_l3mdev_accept=0
+
 netfilter rules on the VRF device can be used to limit access to services
 running in the default VRF context as well.
 
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index e47503b4e4d1..104a6669e344 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -103,6 +103,9 @@ struct netns_ipv4 {
 	/* Shall we try to damage output packets if routing dev changes? */
 	int sysctl_ip_dynaddr;
 	int sysctl_ip_early_demux;
+#ifdef CONFIG_NET_L3_MASTER_DEV
+	int sysctl_raw_l3mdev_accept;
+#endif
 	int sysctl_tcp_early_demux;
 	int sysctl_udp_early_demux;
 
diff --git a/include/net/raw.h b/include/net/raw.h
index 9c9fa98a91a4..20ebf0b3dfa8 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -61,6 +61,7 @@ void raw_seq_stop(struct seq_file *seq, void *v);
 
 int raw_hash_sk(struct sock *sk);
 void raw_unhash_sk(struct sock *sk);
+void raw_init(void);
 
 struct raw_sock {
 	/* inet_sock has to be the first member */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1fbe2f815474..07749c5b0a50 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1964,6 +1964,8 @@ static int __init inet_init(void)
 	/* Add UDP-Lite (RFC 3828) */
 	udplite4_register();
 
+	raw_init();
+
 	ping_init();
 
 	/*
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8ca3eb06ba04..da453c7dfb75 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -805,7 +805,7 @@ static int raw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	return copied;
 }
 
-static int raw_init(struct sock *sk)
+static int raw_sk_init(struct sock *sk)
 {
 	struct raw_sock *rp = raw_sk(sk);
 
@@ -970,7 +970,7 @@ struct proto raw_prot = {
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = __udp_disconnect,
 	.ioctl		   = raw_ioctl,
-	.init		   = raw_init,
+	.init		   = raw_sk_init,
 	.setsockopt	   = raw_setsockopt,
 	.getsockopt	   = raw_getsockopt,
 	.sendmsg	   = raw_sendmsg,
@@ -1133,4 +1133,16 @@ void __init raw_proc_exit(void)
 {
 	unregister_pernet_subsys(&raw_net_ops);
 }
+
+static void raw_sysctl_init(void)
+{
+#ifdef CONFIG_NET_L3_MASTER_DEV
+	init_net.ipv4.sysctl_raw_l3mdev_accept = 1;
+#endif
+}
+
+void __init raw_init(void)
+{
+	raw_sysctl_init();
+}
 #endif /* CONFIG_PROC_FS */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 891ed2f91467..ba0fc4b18465 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -602,6 +602,17 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_ping_group_range,
 	},
+#ifdef CONFIG_NET_L3_MASTER_DEV
+	{
+		.procname	= "raw_l3mdev_accept",
+		.data		= &init_net.ipv4.sysctl_raw_l3mdev_accept,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
 	{
 		.procname	= "tcp_ecn",
 		.data		= &init_net.ipv4.sysctl_tcp_ecn,
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v4 5/9] net: fix raw socket lookup device bind matching with VRFs
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev; +Cc: Duncan Eastoe
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

From: Duncan Eastoe <deastoe@vyatta.att-mail.com>

When there exist a pair of raw sockets one unbound and one bound
to a VRF but equal in all other respects, when a packet is received
in the VRF context, __raw_v4_lookup() matches on both sockets.

This results in the packet being delivered over both sockets,
instead of only the raw socket bound to the VRF. The bound device
checks in __raw_v4_lookup() are replaced with a call to
raw_sk_bound_dev_eq() which correctly handles whether the packet
should be delivered over the unbound socket in such cases.

In __raw_v6_lookup() the match on the device binding of the socket is
similarly updated to use raw_sk_bound_dev_eq() which matches the
handling in __raw_v4_lookup().

Importantly raw_sk_bound_dev_eq() takes the raw_l3mdev_accept sysctl
into account.

Signed-off-by: Duncan Eastoe <deastoe@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 include/net/raw.h | 12 ++++++++++++
 net/ipv4/raw.c    |  3 +--
 net/ipv6/raw.c    |  5 ++---
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index 20ebf0b3dfa8..6ed2ae5b4a80 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -18,6 +18,7 @@
 #define _RAW_H
 
 
+#include <net/inet_sock.h>
 #include <net/protocol.h>
 #include <linux/icmp.h>
 
@@ -75,4 +76,15 @@ static inline struct raw_sock *raw_sk(const struct sock *sk)
 	return (struct raw_sock *)sk;
 }
 
+static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
+				       int dif, int sdif)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+	return inet_bound_dev_eq(net->ipv4.sysctl_raw_l3mdev_accept,
+				 bound_dev_if, dif, sdif);
+#else
+	return inet_bound_dev_eq(1, bound_dev_if, dif, sdif);
+#endif
+}
+
 #endif	/* _RAW_H */
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index da453c7dfb75..d42cdd018987 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -131,8 +131,7 @@ struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
 		if (net_eq(sock_net(sk), net) && inet->inet_num == num	&&
 		    !(inet->inet_daddr && inet->inet_daddr != raddr) 	&&
 		    !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
-		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif &&
-		      sk->sk_bound_dev_if != sdif))
+		    raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
 			goto found; /* gotcha */
 	}
 	sk = NULL;
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5e0efd3954e9..aed7eb5c2123 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -86,9 +86,8 @@ struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
 			    !ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr))
 				continue;
 
-			if (sk->sk_bound_dev_if &&
-			    sk->sk_bound_dev_if != dif &&
-			    sk->sk_bound_dev_if != sdif)
+			if (!raw_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+						 dif, sdif))
 				continue;
 
 			if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v4 6/9] vrf: mark skb for multicast or link-local as enslaved to VRF
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

The skb for packets that are multicast or to a link-local address are
not marked as being enslaved to a VRF, if they are received on a socket
bound to the VRF. This is needed for ND and it is preferable for the
kernel not to have to deal with the additional use-cases if ll or mcast
packets are handled as enslaved. However, this does not allow service
instances listening on unbound and bound to VRF sockets to distinguish
the VRF used, if packets are sent as multicast or to a link-local
address. The fix is for the VRF driver to also mark these skb as being
enslaved to the VRF.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 drivers/net/vrf.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 69b7227c637e..21ad4b1d7f03 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -981,24 +981,23 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device *vrf_dev,
 				   struct sk_buff *skb)
 {
 	int orig_iif = skb->skb_iif;
-	bool need_strict;
+	bool need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
+	bool is_ndisc = ipv6_ndisc_frame(skb);
 
-	/* loopback traffic; do not push through packet taps again.
-	 * Reset pkt_type for upper layers to process skb
+	/* loopback, multicast & non-ND link-local traffic; do not push through
+	 * packet taps again. Reset pkt_type for upper layers to process skb
 	 */
-	if (skb->pkt_type == PACKET_LOOPBACK) {
+	if (skb->pkt_type == PACKET_LOOPBACK || (need_strict && !is_ndisc)) {
 		skb->dev = vrf_dev;
 		skb->skb_iif = vrf_dev->ifindex;
 		IP6CB(skb)->flags |= IP6SKB_L3SLAVE;
-		skb->pkt_type = PACKET_HOST;
+		if (skb->pkt_type == PACKET_LOOPBACK)
+			skb->pkt_type = PACKET_HOST;
 		goto out;
 	}
 
-	/* if packet is NDISC or addressed to multicast or link-local
-	 * then keep the ingress interface
-	 */
-	need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
-	if (!ipv6_ndisc_frame(skb) && !need_strict) {
+	/* if packet is NDISC then keep the ingress interface */
+	if (!is_ndisc) {
 		vrf_rx_stats(vrf_dev, skb->len);
 		skb->dev = vrf_dev;
 		skb->skb_iif = vrf_dev->ifindex;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v4 7/9] ipv6: allow ping to link-local address in VRF
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

If link-local packets are marked as enslaved to a VRF, then to allow
ping to the link-local from a vrf, the error handling for IPV6_PKTINFO
needs to be relaxed to also allow the pkt ipi6_ifindex to be that of a
slave device to the vrf.

Note that the real device also needs to be retrieved in icmp6_iif()
to set the ipv6 flow oif to this for icmp echo reply handling. The
recent commit 24b711edfc34 ("net/ipv6: Fix linklocal to global address
with VRF") takes care of this, so the sdif does not need checking here.

This fix makes ping to link-local consistent with that to global
addresses, in that this can now be done from within the same VRF that
the address is in.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 net/ipv6/ipv6_sockglue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 381ce38940ae..973e215c3114 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -486,7 +486,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 				retv = -EFAULT;
 				break;
 		}
-		if (sk->sk_bound_dev_if && pkt.ipi6_ifindex != sk->sk_bound_dev_if)
+		if (!sk_dev_equal_l3scope(sk, pkt.ipi6_ifindex))
 			goto e_inval;

 		np->sticky_pktinfo.ipi6_ifindex = pkt.ipi6_ifindex;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v4 8/9] ipv6: handling of multicast packets received in VRF
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev; +Cc: Dewi Morgan
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

If the skb for multicast packets marked as enslaved to a VRF are
received, then the secondary device index should be used to obtain
the real device. And verify the multicast address against the
enslaved rather than the l3mdev device.

Signed-off-by: Dewi Morgan <morgand@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 net/ipv6/ip6_input.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 96577e742afd..df58e1100226 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -359,6 +359,8 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
 			}
 		} else if (ipprot->flags & INET6_PROTO_FINAL) {
 			const struct ipv6hdr *hdr;
+			int sdif = inet6_sdif(skb);
+			struct net_device *dev;
 
 			/* Only do this once for first final protocol */
 			have_final = true;
@@ -371,9 +373,19 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk
 			skb_postpull_rcsum(skb, skb_network_header(skb),
 					   skb_network_header_len(skb));
 			hdr = ipv6_hdr(skb);
+
+			/* skb->dev passed may be master dev for vrfs. */
+			if (sdif) {
+				dev = dev_get_by_index_rcu(net, sdif);
+				if (!dev)
+					goto discard;
+			} else {
+				dev = skb->dev;
+			}
+
 			if (ipv6_addr_is_multicast(&hdr->daddr) &&
-			    !ipv6_chk_mcast_addr(skb->dev, &hdr->daddr,
-			    &hdr->saddr) &&
+			    !ipv6_chk_mcast_addr(dev, &hdr->daddr,
+						 &hdr->saddr) &&
 			    !ipv6_is_mld(skb, nexthdr, skb_network_header_len(skb)))
 				goto discard;
 		}
@@ -432,15 +444,32 @@ EXPORT_SYMBOL_GPL(ip6_input);
 
 int ip6_mc_input(struct sk_buff *skb)
 {
+	int sdif = inet6_sdif(skb);
 	const struct ipv6hdr *hdr;
+	struct net_device *dev;
 	bool deliver;
 
 	__IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev),
 			 __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST,
 			 skb->len);
 
+	/* skb->dev passed may be master dev for vrfs. */
+	if (sdif) {
+		rcu_read_lock();
+		dev = dev_get_by_index_rcu(dev_net(skb->dev), sdif);
+		if (!dev) {
+			rcu_read_unlock();
+			kfree_skb(skb);
+			return -ENODEV;
+		}
+	} else {
+		dev = skb->dev;
+	}
+
 	hdr = ipv6_hdr(skb);
-	deliver = ipv6_chk_mcast_addr(skb->dev, &hdr->daddr, NULL);
+	deliver = ipv6_chk_mcast_addr(dev, &hdr->daddr, NULL);
+	if (sdif)
+		rcu_read_unlock();
 
 #ifdef CONFIG_IPV6_MROUTE
 	/*
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v4 9/9] ipv6: do not drop vrf udp multicast packets
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev; +Cc: Dewi Morgan
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

From: Dewi Morgan <morgand@vyatta.att-mail.com>

For bound udp sockets in a vrf, also check the sdif to get the index
for ingress devices enslaved to an l3mdev.

Signed-off-by: Dewi Morgan <morgand@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 net/ipv6/udp.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 0559adc2f357..a25571c12a8a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -637,7 +637,7 @@ static int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk,
 				   __be16 loc_port, const struct in6_addr *loc_addr,
 				   __be16 rmt_port, const struct in6_addr *rmt_addr,
-				   int dif, unsigned short hnum)
+				   int dif, int sdif, unsigned short hnum)
 {
 	struct inet_sock *inet = inet_sk(sk);
 
@@ -649,7 +649,7 @@ static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk,
 	    (inet->inet_dport && inet->inet_dport != rmt_port) ||
 	    (!ipv6_addr_any(&sk->sk_v6_daddr) &&
 		    !ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) ||
-	    (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif) ||
+	    !udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif) ||
 	    (!ipv6_addr_any(&sk->sk_v6_rcv_saddr) &&
 		    !ipv6_addr_equal(&sk->sk_v6_rcv_saddr, loc_addr)))
 		return false;
@@ -683,6 +683,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	unsigned int offset = offsetof(typeof(*sk), sk_node);
 	unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
 	int dif = inet6_iif(skb);
+	int sdif = inet6_sdif(skb);
 	struct hlist_node *node;
 	struct sk_buff *nskb;
 
@@ -697,7 +698,8 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 
 	sk_for_each_entry_offset_rcu(sk, node, &hslot->head, offset) {
 		if (!__udp_v6_is_mcast_sock(net, sk, uh->dest, daddr,
-					    uh->source, saddr, dif, hnum))
+					    uh->source, saddr, dif, sdif,
+					    hnum))
 			continue;
 		/* If zero checksum and no_check is not on for
 		 * the socket then skip it.
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v4 3/9] net: ensure unbound datagram socket to be chosen when not in a VRF
From: Mike Manning @ 2018-11-02 19:10 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20181102191020.14170-1-mmanning@vyatta.att-mail.com>

Ensure an unbound datagram skt is chosen when not in a VRF. The check
for a device match in compute_score() for UDP must be performed when
there is no device match. For this, a failure is returned when there is
no device match. This ensures that bound sockets are never selected,
even if there is no unbound socket.

Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
packets are currently blocked, as flowi6_oif was set to that of the
master vrf device, and the ipi6_ifindex is that of the slave device.
Allow these packets to be sent by checking the device with ipi6_ifindex
has the same L3 scope as that of the bound device of the skt, which is
the master vrf device. Note that this check always succeeds if the skt
is unbound.

Even though the right datagram skt is now selected by compute_score(),
a different skt is being returned that is bound to the wrong vrf. The
difference between these and stream sockets is the handling of the skt
option for SO_REUSEPORT. While the handling when adding a skt for reuse
correctly checks that the bound device of the skt is a match, the skts
in the hashslot are already incorrect. So for the same hash, a skt for
the wrong vrf may be selected for the required port. The root cause is
that the skt is immediately placed into a slot when it is created,
but when the skt is then bound using SO_BINDTODEVICE, it remains in the
same slot. The solution is to move the skt to the correct slot by
forcing a rehash.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 include/net/udp.h   | 11 +++++++++++
 net/core/sock.c     |  2 ++
 net/ipv4/udp.c      | 15 ++++++---------
 net/ipv6/datagram.c |  5 ++++-
 net/ipv6/udp.c      | 14 +++++---------
 5 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 9e82cb391dea..057972d0eea5 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -252,6 +252,17 @@ static inline int udp_rqueue_get(struct sock *sk)
 	return sk_rmem_alloc_get(sk) - READ_ONCE(udp_sk(sk)->forward_deficit);
 }
 
+static inline bool udp_sk_bound_dev_eq(struct net *net, int bound_dev_if,
+				       int dif, int sdif)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+	return inet_bound_dev_eq(net->ipv4.sysctl_udp_l3mdev_accept,
+				 bound_dev_if, dif, sdif);
+#else
+	return inet_bound_dev_eq(1, bound_dev_if, dif, sdif);
+#endif
+}
+
 /* net/ipv4/udp.c */
 void udp_destruct_sock(struct sock *sk);
 void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
diff --git a/net/core/sock.c b/net/core/sock.c
index 6fcc4bc07d19..6eda848192aa 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -567,6 +567,8 @@ static int sock_setbindtodevice(struct sock *sk, char __user *optval,
 
 	lock_sock(sk);
 	sk->sk_bound_dev_if = index;
+	if (sk->sk_prot->rehash)
+		sk->sk_prot->rehash(sk);
 	sk_dst_reset(sk);
 	release_sock(sk);
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1976fddb9e00..cf73c9194bb6 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -371,6 +371,7 @@ static int compute_score(struct sock *sk, struct net *net,
 {
 	int score;
 	struct inet_sock *inet;
+	bool dev_match;
 
 	if (!net_eq(sock_net(sk), net) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
@@ -398,15 +399,11 @@ static int compute_score(struct sock *sk, struct net *net,
 		score += 4;
 	}
 
-	if (sk->sk_bound_dev_if || exact_dif) {
-		bool dev_match = (sk->sk_bound_dev_if == dif ||
-				  sk->sk_bound_dev_if == sdif);
-
-		if (!dev_match)
-			return -1;
-		if (sk->sk_bound_dev_if)
-			score += 4;
-	}
+	dev_match = udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+					dif, sdif);
+	if (!dev_match)
+		return -1;
+	score += 4;
 
 	if (sk->sk_incoming_cpu == raw_smp_processor_id())
 		score++;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 1ede7a16a0be..4813293d4fad 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -782,7 +782,10 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 
 			if (src_info->ipi6_ifindex) {
 				if (fl6->flowi6_oif &&
-				    src_info->ipi6_ifindex != fl6->flowi6_oif)
+				    src_info->ipi6_ifindex != fl6->flowi6_oif &&
+				    (sk->sk_bound_dev_if != fl6->flowi6_oif ||
+				     !sk_dev_equal_l3scope(
+					     sk, src_info->ipi6_ifindex)))
 					return -EINVAL;
 				fl6->flowi6_oif = src_info->ipi6_ifindex;
 			}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d2d97d07ef27..0559adc2f357 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -117,6 +117,7 @@ static int compute_score(struct sock *sk, struct net *net,
 {
 	int score;
 	struct inet_sock *inet;
+	bool dev_match;
 
 	if (!net_eq(sock_net(sk), net) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
@@ -144,15 +145,10 @@ static int compute_score(struct sock *sk, struct net *net,
 		score++;
 	}
 
-	if (sk->sk_bound_dev_if || exact_dif) {
-		bool dev_match = (sk->sk_bound_dev_if == dif ||
-				  sk->sk_bound_dev_if == sdif);
-
-		if (!dev_match)
-			return -1;
-		if (sk->sk_bound_dev_if)
-			score++;
-	}
+	dev_match = udp_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif);
+	if (!dev_match)
+		return -1;
+	score++;
 
 	if (sk->sk_incoming_cpu == raw_smp_processor_id())
 		score++;
-- 
2.11.0

^ permalink raw reply related

* [PATCH 0/6] Add Support for MCAN transceivers in AM65x-evm
From: Faiz Abbas @ 2018-11-02 19:26 UTC (permalink / raw)
  To: linux-kernel, devicetree, netdev, linux-can
  Cc: wg, mkl, robh+dt, mark.rutland, kishon, faiz_abbas

The following patches add support for CAN transceivers in the AM65x-evm.

The legacy transceiver implementation has a transceiver node as a child of the
m_can node with a max-bitrate property which is read directly by the 
of_can_transceiver() API. The transceivers on the present platform however,
require some configuration (pulling the gpio connected to the stb line of
the transceiver low) before they can start sending messages.

The new implementation models the transceiver as a phy and implements the
max-bitrate as a phy attribute.

patch 1 adds the max_bitrate attribute to the phy core. It also implements the API
to be used by the consumer to get the attribute.

patches 2 & 3 implement a generic phy driver for simple implementations.

patches 4,5 & 6 implement the transceiver as a phy to the m_can driver.

Note: Pinmux and GPIO support for am65x-evm are not yet implemented in upstream.
So I tested this implementation with some out of tree patches on top of linux-next.
dts patches will be posted as soon as the above frameworks are available.

Faiz Abbas (6):
  phy: Add max_bitrate attribute & phy_get_max_bitrate()
  dt-bindings: phy: phy-of-simple: Document new binding
  phy: phy-of-simple: Add support for simple generic phy driver
  dt-bindings: can: m_can: Document transceiver implementation as a phy
  dt-bindings: can: can-transceiver: Remove legacy binding documentation
  can: m_can: Add support for transceiver as phy

 .../bindings/net/can/can-transceiver.txt      | 24 -----
 .../devicetree/bindings/net/can/m_can.txt     | 24 +++--
 .../devicetree/bindings/phy/phy-of-simple.txt | 29 ++++++
 drivers/net/can/m_can/m_can.c                 | 23 ++++-
 drivers/phy/Kconfig                           |  7 ++
 drivers/phy/Makefile                          |  1 +
 drivers/phy/phy-of-simple.c                   | 90 +++++++++++++++++++
 include/linux/phy/phy.h                       |  5 ++
 8 files changed, 170 insertions(+), 33 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/net/can/can-transceiver.txt
 create mode 100644 Documentation/devicetree/bindings/phy/phy-of-simple.txt
 create mode 100644 drivers/phy/phy-of-simple.c

-- 
2.18.0

^ permalink raw reply

* [PATCH 5/6] dt-bindings: can: can-transceiver: Remove legacy binding documentation
From: Faiz Abbas @ 2018-11-02 19:26 UTC (permalink / raw)
  To: linux-kernel, devicetree, netdev, linux-can
  Cc: wg, mkl, robh+dt, mark.rutland, kishon, faiz_abbas
In-Reply-To: <20181102192616.28291-1-faiz_abbas@ti.com>

With the transceiver node being implemented as a phy, remove the legacy
dcoumentation. Don't remove the code implementing it to maintain dt
compatibility.

Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
---
 .../bindings/net/can/can-transceiver.txt      | 24 -------------------
 1 file changed, 24 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/net/can/can-transceiver.txt

diff --git a/Documentation/devicetree/bindings/net/can/can-transceiver.txt b/Documentation/devicetree/bindings/net/can/can-transceiver.txt
deleted file mode 100644
index 0011f53ff159..000000000000
--- a/Documentation/devicetree/bindings/net/can/can-transceiver.txt
+++ /dev/null
@@ -1,24 +0,0 @@
-Generic CAN transceiver Device Tree binding
-------------------------------
-
-CAN transceiver typically limits the max speed in standard CAN and CAN FD
-modes. Typically these limitations are static and the transceivers themselves
-provide no way to detect this limitation at runtime. For this situation,
-the "can-transceiver" node can be used.
-
-Required Properties:
- max-bitrate:	a positive non 0 value that determines the max
-		speed that CAN/CAN-FD can run. Any other value
-		will be ignored.
-
-Examples:
-
-Based on Texas Instrument's TCAN1042HGV CAN Transceiver
-
-m_can0 {
-	....
-	can-transceiver {
-		max-bitrate = <5000000>;
-	};
-	...
-};
-- 
2.18.0

^ permalink raw reply related

* [PATCH net] mlxsw: spectrum: Fix IP2ME CPU policer configuration
From: Ido Schimmel @ 2018-11-02 19:49 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: davem@davemloft.net, Jiri Pirko, Shalom Toledo, mlxsw,
	Ido Schimmel

From: Shalom Toledo <shalomt@mellanox.com>

The CPU policer used to police packets being trapped via a local route
(IP2ME) was incorrectly configured to police based on bytes per second
instead of packets per second.

Change the policer to police based on packets per second and avoid
packet loss under certain circumstances.

Fixes: 9148e7cf73ce ("mlxsw: spectrum: Add policers for trap groups")
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index a2df12b79f8e..9bec940330a4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3568,7 +3568,6 @@ static int mlxsw_sp_cpu_policers_set(struct mlxsw_core *mlxsw_core)
 			burst_size = 7;
 			break;
 		case MLXSW_REG_HTGT_TRAP_GROUP_SP_IP2ME:
-			is_bytes = true;
 			rate = 4 * 1024;
 			burst_size = 4;
 			break;
-- 
2.17.2

^ permalink raw reply related

* Re: [PATCH 3/6] phy: phy-of-simple: Add support for simple generic phy driver
From: kbuild test robot @ 2018-11-03  5:04 UTC (permalink / raw)
  To: Faiz Abbas
  Cc: kbuild-all, linux-kernel, devicetree, netdev, linux-can, wg, mkl,
	robh+dt, mark.rutland, kishon, faiz_abbas
In-Reply-To: <20181102192616.28291-4-faiz_abbas@ti.com>

[-- Attachment #1: Type: text/plain, Size: 4620 bytes --]

Hi Faiz,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.19 next-20181102]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Faiz-Abbas/Add-Support-for-MCAN-transceivers-in-AM65x-evm/20181103-103548
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=sh 

All error/warnings (new ones prefixed by >>):

   In file included from drivers//phy/phy-of-simple.c:10:0:
>> drivers//phy/phy-of-simple.c:76:25: error: 'phy_simple_phy_dt_ids' undeclared here (not in a function); did you mean 'phy_simple_dt_ids'?
    MODULE_DEVICE_TABLE(of, phy_simple_phy_dt_ids);
                            ^
   include/linux/module.h:213:15: note: in definition of macro 'MODULE_DEVICE_TABLE'
    extern typeof(name) __mod_##type##__##name##_device_table  \
                  ^~~~
>> include/linux/module.h:213:21: error: '__mod_of__phy_simple_phy_dt_ids_device_table' aliased to undefined symbol 'phy_simple_phy_dt_ids'
    extern typeof(name) __mod_##type##__##name##_device_table  \
                        ^
>> drivers//phy/phy-of-simple.c:76:1: note: in expansion of macro 'MODULE_DEVICE_TABLE'
    MODULE_DEVICE_TABLE(of, phy_simple_phy_dt_ids);
    ^~~~~~~~~~~~~~~~~~~
--
   In file included from drivers/phy/phy-of-simple.c:10:0:
   drivers/phy/phy-of-simple.c:76:25: error: 'phy_simple_phy_dt_ids' undeclared here (not in a function); did you mean 'phy_simple_dt_ids'?
    MODULE_DEVICE_TABLE(of, phy_simple_phy_dt_ids);
                            ^
   include/linux/module.h:213:15: note: in definition of macro 'MODULE_DEVICE_TABLE'
    extern typeof(name) __mod_##type##__##name##_device_table  \
                  ^~~~
>> include/linux/module.h:213:21: error: '__mod_of__phy_simple_phy_dt_ids_device_table' aliased to undefined symbol 'phy_simple_phy_dt_ids'
    extern typeof(name) __mod_##type##__##name##_device_table  \
                        ^
   drivers/phy/phy-of-simple.c:76:1: note: in expansion of macro 'MODULE_DEVICE_TABLE'
    MODULE_DEVICE_TABLE(of, phy_simple_phy_dt_ids);
    ^~~~~~~~~~~~~~~~~~~

vim +76 drivers//phy/phy-of-simple.c

  > 10	#include <linux/module.h>
    11	#include <linux/regulator/consumer.h>
    12	
    13	static int phy_simple_power_on(struct phy *phy)
    14	{
    15		if (phy->pwr)
    16			return regulator_enable(phy->pwr);
    17	
    18		return 0;
    19	}
    20	
    21	static int phy_simple_power_off(struct phy *phy)
    22	{
    23		if (phy->pwr)
    24			return regulator_disable(phy->pwr);
    25	
    26		return 0;
    27	}
    28	
    29	static const struct phy_ops phy_simple_ops = {
    30		.power_on	= phy_simple_power_on,
    31		.power_off	= phy_simple_power_off,
    32		.owner		= THIS_MODULE,
    33	};
    34	
    35	int phy_simple_probe(struct platform_device *pdev)
    36	{
    37		struct phy_provider *phy_provider;
    38		struct device *dev = &pdev->dev;
    39		struct regulator *pwr = NULL;
    40		struct phy *phy;
    41		u32 bus_width = 0;
    42		u32 max_bitrate = 0;
    43		int ret;
    44	
    45		phy = devm_phy_create(dev, dev->of_node,
    46					      &phy_simple_ops);
    47	
    48		if (IS_ERR(phy)) {
    49			dev_err(dev, "Failed to create phy\n");
    50			return PTR_ERR(phy);
    51		}
    52	
    53		device_property_read_u32(dev, "bus-width", &bus_width);
    54		phy->attrs.bus_width = bus_width;
    55		device_property_read_u32(dev, "max-bitrate", &max_bitrate);
    56		phy->attrs.max_bitrate = max_bitrate;
    57	
    58		pwr = devm_regulator_get_optional(dev, "pwr");
    59		if (IS_ERR(pwr)) {
    60			ret = PTR_ERR(pwr);
    61			dev_err(dev, "Couldn't get regulator. ret=%d\n", ret);
    62			return ret;
    63		}
    64		phy->pwr = pwr;
    65	
    66		phy_provider = devm_of_phy_provider_register(dev, of_phy_simple_xlate);
    67	
    68		return PTR_ERR_OR_ZERO(phy_provider);
    69	}
    70	
    71	static const struct of_device_id phy_simple_dt_ids[] = {
    72		{ .compatible = "simple-phy"},
    73		{}
    74	};
    75	
  > 76	MODULE_DEVICE_TABLE(of, phy_simple_phy_dt_ids);
    77	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50346 bytes --]

^ permalink raw reply

* [PATCH net v7] net/ipv6: Add anycast addresses to a global hashtable
From: Jeff Barnhill @ 2018-11-02 20:23 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, yoshfuji, Jeff Barnhill
In-Reply-To: <20181031223420.36e03da9@xeon-e3>

icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device.  This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().

Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.

Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
---
 include/net/addrconf.h |  2 ++
 include/net/if_inet6.h |  2 ++
 net/ipv6/af_inet6.c    |  5 ++++
 net/ipv6/anycast.c     | 80 +++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 14b789a123e7..1656c5978498 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 			 const struct in6_addr *addr);
 bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
 			     const struct in6_addr *addr);
+int ipv6_anycast_init(void);
+void ipv6_anycast_cleanup(void);
 
 /* Device notifier */
 int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d7578cf49c3a..c9c78c15bce0 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -146,10 +146,12 @@ struct ifacaddr6 {
 	struct in6_addr		aca_addr;
 	struct fib6_info	*aca_rt;
 	struct ifacaddr6	*aca_next;
+	struct hlist_node	aca_addr_lst;
 	int			aca_users;
 	refcount_t		aca_refcnt;
 	unsigned long		aca_cstamp;
 	unsigned long		aca_tstamp;
+	struct rcu_head		rcu;
 };
 
 #define	IFA_HOST	IPV6_ADDR_LOOPBACK
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3f4d61017a69..f0cd291034f0 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
 	err = ip6_flowlabel_init();
 	if (err)
 		goto ip6_flowlabel_fail;
+	err = ipv6_anycast_init();
+	if (err)
+		goto ipv6_anycast_fail;
 	err = addrconf_init();
 	if (err)
 		goto addrconf_fail;
@@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
 ipv6_exthdrs_fail:
 	addrconf_cleanup();
 addrconf_fail:
+	ipv6_anycast_cleanup();
+ipv6_anycast_fail:
 	ip6_flowlabel_cleanup();
 ip6_flowlabel_fail:
 	ndisc_late_cleanup();
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 4e0ff7031edd..7698637cf827 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -44,8 +44,22 @@
 
 #include <net/checksum.h>
 
+#define IN6_ADDR_HSIZE_SHIFT	8
+#define IN6_ADDR_HSIZE		BIT(IN6_ADDR_HSIZE_SHIFT)
+/*	anycast address hash table
+ */
+static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
+static DEFINE_SPINLOCK(acaddr_hash_lock);
+
 static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr *addr);
 
+static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+	u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
+
+	return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
+}
+
 /*
  *	socket join an anycast group
  */
@@ -204,16 +218,39 @@ void ipv6_sock_ac_close(struct sock *sk)
 	rtnl_unlock();
 }
 
+static void ipv6_add_acaddr_hash(struct net *net, struct ifacaddr6 *aca)
+{
+	unsigned int hash = inet6_acaddr_hash(net, &aca->aca_addr);
+
+	spin_lock(&acaddr_hash_lock);
+	hlist_add_head_rcu(&aca->aca_addr_lst, &inet6_acaddr_lst[hash]);
+	spin_unlock(&acaddr_hash_lock);
+}
+
+static void ipv6_del_acaddr_hash(struct ifacaddr6 *aca)
+{
+	spin_lock(&acaddr_hash_lock);
+	hlist_del_init_rcu(&aca->aca_addr_lst);
+	spin_unlock(&acaddr_hash_lock);
+}
+
 static void aca_get(struct ifacaddr6 *aca)
 {
 	refcount_inc(&aca->aca_refcnt);
 }
 
+static void aca_free_rcu(struct rcu_head *h)
+{
+	struct ifacaddr6 *aca = container_of(h, struct ifacaddr6, rcu);
+
+	fib6_info_release(aca->aca_rt);
+	kfree(aca);
+}
+
 static void aca_put(struct ifacaddr6 *ac)
 {
 	if (refcount_dec_and_test(&ac->aca_refcnt)) {
-		fib6_info_release(ac->aca_rt);
-		kfree(ac);
+		call_rcu(&ac->rcu, aca_free_rcu);
 	}
 }
 
@@ -229,6 +266,7 @@ static struct ifacaddr6 *aca_alloc(struct fib6_info *f6i,
 	aca->aca_addr = *addr;
 	fib6_info_hold(f6i);
 	aca->aca_rt = f6i;
+	INIT_HLIST_NODE(&aca->aca_addr_lst);
 	aca->aca_users = 1;
 	/* aca_tstamp should be updated upon changes */
 	aca->aca_cstamp = aca->aca_tstamp = jiffies;
@@ -285,6 +323,8 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
 	aca_get(aca);
 	write_unlock_bh(&idev->lock);
 
+	ipv6_add_acaddr_hash(net, aca);
+
 	ip6_ins_rt(net, f6i);
 
 	addrconf_join_solict(idev->dev, &aca->aca_addr);
@@ -325,6 +365,7 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr)
 	else
 		idev->ac_list = aca->aca_next;
 	write_unlock_bh(&idev->lock);
+	ipv6_del_acaddr_hash(aca);
 	addrconf_leave_solict(idev, &aca->aca_addr);
 
 	ip6_del_rt(dev_net(idev->dev), aca->aca_rt);
@@ -352,6 +393,8 @@ void ipv6_ac_destroy_dev(struct inet6_dev *idev)
 		idev->ac_list = aca->aca_next;
 		write_unlock_bh(&idev->lock);
 
+		ipv6_del_acaddr_hash(aca);
+
 		addrconf_leave_solict(idev, &aca->aca_addr);
 
 		ip6_del_rt(dev_net(idev->dev), aca->aca_rt);
@@ -390,17 +433,25 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, const struct in6_addr *ad
 bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 			 const struct in6_addr *addr)
 {
+	unsigned int hash = inet6_acaddr_hash(net, addr);
+	struct net_device *nh_dev;
+	struct ifacaddr6 *aca;
 	bool found = false;
 
 	rcu_read_lock();
 	if (dev)
 		found = ipv6_chk_acast_dev(dev, addr);
 	else
-		for_each_netdev_rcu(net, dev)
-			if (ipv6_chk_acast_dev(dev, addr)) {
+		hlist_for_each_entry_rcu(aca, &inet6_acaddr_lst[hash],
+					 aca_addr_lst) {
+			nh_dev = fib6_info_nh_dev(aca->aca_rt);
+			if (!nh_dev || !net_eq(dev_net(nh_dev), net))
+				continue;
+			if (ipv6_addr_equal(&aca->aca_addr, addr)) {
 				found = true;
 				break;
 			}
+		}
 	rcu_read_unlock();
 	return found;
 }
@@ -539,4 +590,25 @@ void ac6_proc_exit(struct net *net)
 {
 	remove_proc_entry("anycast6", net->proc_net);
 }
+
+/*	Init / cleanup code
+ */
+int __init ipv6_anycast_init(void)
+{
+	int i;
+
+	for (i = 0; i < IN6_ADDR_HSIZE; i++)
+		INIT_HLIST_HEAD(&inet6_acaddr_lst[i]);
+	return 0;
+}
+
+void ipv6_anycast_cleanup(void)
+{
+	int i;
+
+	spin_lock(&acaddr_hash_lock);
+	for (i = 0; i < IN6_ADDR_HSIZE; i++)
+		WARN_ON(!hlist_empty(&inet6_acaddr_lst[i]));
+	spin_unlock(&acaddr_hash_lock);
+}
 #endif
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH v2 bpf 0/3] show more accurrate bpf program address
From: Daniel Borkmann @ 2018-11-02 20:45 UTC (permalink / raw)
  To: Song Liu, netdev; +Cc: kernel-team, ast, sandipan
In-Reply-To: <20181102171617.310178-1-songliubraving@fb.com>

On 11/02/2018 06:16 PM, Song Liu wrote:
> Changes v1 -> v2:
> 1. Added main program length to bpf_prog_info->jited_fun_lens (3/3).
> 2. Updated commit message of 1/3 and 2/3 with more background about the
>    address masking, and why it is still save after the changes.
> 3. Replace "ulong" with "unsigned long".
> 
> This set improves bpf program address showed in /proc/kallsyms and in
> bpf_prog_info. First, real program address is showed instead of page
> address. Second, when there is no subprogram, bpf_prog_info->jited_ksyms
> and bpf_prog_info->jited_fun_lens returns the main prog address and
> length.
> 
> Song Liu (3):
>   bpf: show real jited prog address in /proc/kallsyms
>   bpf: show real jited address in bpf_prog_info->jited_ksyms
>   bpf: show main program address and length in bpf_prog_info
> 
>  kernel/bpf/core.c    |  4 +---
>  kernel/bpf/syscall.c | 34 ++++++++++++++++++++++++----------
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> --
> 2.17.1
> 

Applied to bpf, thanks!

^ permalink raw reply

* Re: [PATCH bpf] bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv
From: Alexei Starovoitov @ 2018-11-02 20:54 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast, netdev, Sandipan Das, Song Liu
In-Reply-To: <20181102103546.4499-1-daniel@iogearbox.net>

On Fri, Nov 02, 2018 at 11:35:46AM +0100, Daniel Borkmann wrote:
> While dbecd7388476 ("bpf: get kernel symbol addresses via syscall")
> zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries
> from unprivileged users, commit 815581c11cc2 ("bpf: get JITed image
> lengths of functions via syscall") forgot about doing so and therefore
> returns the #elems of the user set up buffer which is incorrect. It
> also needs to indicate a info.nr_jited_func_lens of zero.
> 
> Fixes: 815581c11cc2 ("bpf: get JITed image lengths of functions via syscall")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Sandipan Das <sandipan@linux.vnet.ibm.com>
> Cc: Song Liu <songliubraving@fb.com>

Applied, Thanks

^ permalink raw reply

* Re: [PATCH] bonding:avoid repeated display of same link status change
From: David Miller @ 2018-11-03  6:31 UTC (permalink / raw)
  To: mk.singh
  Cc: netdev, eric.dumazet, mkubecek, j.vosburgh, vfalico, andy,
	linux-kernel
In-Reply-To: <20181031105729.7442-1-mk.singh@oracle.com>

From: mk.singh@oracle.com
Date: Wed, 31 Oct 2018 16:27:28 +0530

> -			if (slave->delay) {
> +			if (slave->delay &&
> +			    !atomic64_read(&bond->rtnl_needed)) {
 ...
> +			    !atomic64_read(&bond->rtnl_needed)) {
 ...
> +			atomic64_set(&bond->rtnl_needed, 1);
 ...
> +		atomic64_set(&bond->rtnl_needed, 0);
 ...
> @@ -229,6 +229,7 @@ struct bonding {
>  	struct	 dentry *debug_dir;
>  #endif /* CONFIG_DEBUG_FS */
>  	struct rtnl_link_stats64 bond_stats;
> +	atomic64_t rtnl_needed;

There is nothing "atomic" about a value that is only set and read.

And using a full 64-bit value for something taking on only '0' and
'1' is unnecessary as well.

^ permalink raw reply

* Re: [PATCH] net: move ‘__zerocopy_sg_from_iter’ prototype to header file <linux/skbuff.h>
From: David Miller @ 2018-11-03  6:33 UTC (permalink / raw)
  To: malat; +Cc: linux-kernel, netdev
In-Reply-To: <20181031113500.3763-1-malat@debian.org>

From: Mathieu Malaterre <malat@debian.org>
Date: Wed, 31 Oct 2018 12:34:59 +0100

> This makes it clear the function is part of the API. Also this will
> remove a warning triggered at W=1:
> 
>   net/core/datagram.c:581:5: warning: no previous prototype for ‘__zerocopy_sg_from_iter’ [-Wmissing-prototypes]
> 
> Signed-off-by: Mathieu Malaterre <malat@debian.org>

It's not part of the "API", and it shouldn't even be exported to
modules.

Only net/core/skbuff.c calls it, and that is never modular.

^ permalink raw reply

* Re: [PATCH] net: document skb parameter in function 'skb_gso_size_check'
From: David Miller @ 2018-11-03  6:34 UTC (permalink / raw)
  To: malat; +Cc: netdev, linux-kernel
In-Reply-To: <20181031121659.30764-1-malat@debian.org>

From: Mathieu Malaterre <malat@debian.org>
Date: Wed, 31 Oct 2018 13:16:58 +0100

> Remove kernel-doc warning:
> 
>   net/core/skbuff.c:4953: warning: Function parameter or member 'skb' not described in 'skb_gso_size_check'
> 
> Signed-off-by: Mathieu Malaterre <malat@debian.org>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox