Netdev List
 help / color / mirror / Atom feed
* Re: [Xen-devel] Question on xen-netfront code to fix a potential ring buffer corruption
From: Dongli Zhang @ 2019-08-27  6:43 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: netdev, Joe Jin
In-Reply-To: <130ea0ab-4364-2b77-dc8d-b869e06d1768@suse.com>

Hi Juergen,

On 8/27/19 2:13 PM, Juergen Gross wrote:
> On 18.08.19 10:31, Dongli Zhang wrote:
>> Hi,
>>
>> Would you please help confirm why the condition at line 908 is ">="?
>>
>> In my opinion, we would only hit "skb_shinfo(skb)->nr_frag == MAX_SKB_FRAGS" at
>> line 908.
>>
>> 890 static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
>> 891                                   struct sk_buff *skb,
>> 892                                   struct sk_buff_head *list)
>> 893 {
>> 894         RING_IDX cons = queue->rx.rsp_cons;
>> 895         struct sk_buff *nskb;
>> 896
>> 897         while ((nskb = __skb_dequeue(list))) {
>> 898                 struct xen_netif_rx_response *rx =
>> 899                         RING_GET_RESPONSE(&queue->rx, ++cons);
>> 900                 skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
>> 901
>> 902                 if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS) {
>> 903                         unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
>> 904
>> 905                         BUG_ON(pull_to < skb_headlen(skb));
>> 906                         __pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>> 907                 }
>> 908                 if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
>> 909                         queue->rx.rsp_cons = ++cons;
>> 910                         kfree_skb(nskb);
>> 911                         return ~0U;
>> 912                 }
>> 913
>> 914                 skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
>> 915                                 skb_frag_page(nfrag),
>> 916                                 rx->offset, rx->status, PAGE_SIZE);
>> 917
>> 918                 skb_shinfo(nskb)->nr_frags = 0;
>> 919                 kfree_skb(nskb);
>> 920         }
>> 921
>> 922         return cons;
>> 923 }
>>
>>
>> The reason that I ask about this is because I am considering below patch to
>> avoid a potential xen-netfront ring buffer corruption.
>>
>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index 8d33970..48a2162 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -906,7 +906,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue
>> *queue,
>>                          __pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>>                  }
>>                  if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
>> -                       queue->rx.rsp_cons = ++cons;
>> +                       queue->rx.rsp_cons = cons + skb_queue_len(list) + 1;
>>                          kfree_skb(nskb);
>>                          return ~0U;
>>                  }
>>
>>
>> If there is skb left in list when we return ~0U, queue->rx.rsp_cons may be set
>> incorrectly.
> 
> Sa basically you want to consume the responses for all outstanding skbs
> in the list?
> 

I think there would be bug if there is skb left in the list.

This is what is implanted in xennet_poll() when there is error of processing
extra info like below. As at line 1034, if there is error, all outstanding skb
are consumed.

 985 static int xennet_poll(struct napi_struct *napi, int budget)
... ...
1028                 if (extras[XEN_NETIF_EXTRA_TYPE_GSO - 1].type) {
1029                         struct xen_netif_extra_info *gso;
1030                         gso = &extras[XEN_NETIF_EXTRA_TYPE_GSO - 1];
1031
1032                         if (unlikely(xennet_set_skb_gso(skb, gso))) {
1033                                 __skb_queue_head(&tmpq, skb);
1034                                 queue->rx.rsp_cons += skb_queue_len(&tmpq);
1035                                 goto err;
1036                         }
1037                 }

The reason we need to consume all outstanding skb is because
xennet_get_responses() would reset both queue->rx_skbs[i] and
queue->grant_rx_ref[i] to NULL before enqueue all outstanding skb to the list
(e.g., &tmpq), by xennet_get_rx_skb() and xennet_get_rx_ref().

If we do not consume all of them, we would hit NULL in queue->rx_skbs[i] in next
iteration of while loop in xennet_poll().

That is, if we do not consume all outstanding skb, the queue->rx.rsp_cons may
refer to a response whose queue->rx_skbs[i] and queue->grant_rx_ref[i] are
already reset to NULL.

Dongli Zhang

>>
>> While I am trying to make up a case that would hit the corruption, I could not
>> explain why (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)), but not
>> just "==". Perhaps __pskb_pull_tail() may fail although pull_to is less than
>> skb_headlen(skb).
> 
> I don't think nr_frags can be larger than MAX_SKB_FRAGS. OTOH I don't
> think it hurts to have a safety net here in order to avoid problems.
> 
> Originally this was BUG_ON(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)
> so that might explain the ">=".
> 
> 
> Juergen
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply

* [PATCH v2] ath10k: Adjust skb length in ath10k_sdio_mbox_rx_packet
From: Nicolas Boichat @ 2019-08-27  6:58 UTC (permalink / raw)
  To: Kalle Valo
  Cc: David S . Miller, ath10k, linux-wireless, netdev, linux-kernel,
	wgong, Niklas Cassel, Alagu Sankar, briannorris, tientzu

When the FW bundles multiple packets, pkt->act_len may be incorrect
as it refers to the first packet only (however, the FW will only
bundle packets that fit into the same pkt->alloc_len).

Before this patch, the skb length would be set (incorrectly) to
pkt->act_len in ath10k_sdio_mbox_rx_packet, and then later manually
adjusted in ath10k_sdio_mbox_rx_process_packet.

The first problem is that ath10k_sdio_mbox_rx_process_packet does not
use proper skb_put commands to adjust the length (it directly changes
skb->len), so we end up with a mismatch between skb->head + skb->tail
and skb->data + skb->len. This is quite serious, and causes corruptions
in the TCP stack, as the stack tries to coalesce packets, and relies
on skb->tail being correct (that is, skb_tail_pointer must point to
the first byte_after_ the data).

Instead of re-adjusting the size in ath10k_sdio_mbox_rx_process_packet,
this moves the code to ath10k_sdio_mbox_rx_packet, and also add a
bounds check, as skb_put would crash the kernel if not enough space is
available.

Fixes: 8530b4e7b22bc3b ("ath10k: sdio: set skb len for all rx packets")
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>

---

One simple way to reproduce this issue is this scriplet, that sends a
lot of small packets over SSH (it usually fails before reaching 300):
(for i in `seq 1 300`; do echo $i; sleep 0.1; done) | ssh $IP cat

I was not able to check the original use case why the code was added
(packets > 1500 bytes), as the FW on my board crashes when sending
such large packets.

 drivers/net/wireless/ath/ath10k/sdio.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index 8ed4fbd8d6c3888..0a3ac44a13698c1 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -381,16 +381,11 @@ static int ath10k_sdio_mbox_rx_process_packet(struct ath10k *ar,
 	struct ath10k_htc_hdr *htc_hdr = (struct ath10k_htc_hdr *)skb->data;
 	bool trailer_present = htc_hdr->flags & ATH10K_HTC_FLAG_TRAILER_PRESENT;
 	enum ath10k_htc_ep_id eid;
-	u16 payload_len;
 	u8 *trailer;
 	int ret;
 
-	payload_len = le16_to_cpu(htc_hdr->len);
-	skb->len = payload_len + sizeof(struct ath10k_htc_hdr);
-
 	if (trailer_present) {
-		trailer = skb->data + sizeof(*htc_hdr) +
-			  payload_len - htc_hdr->trailer_len;
+		trailer = skb->data + skb->len - htc_hdr->trailer_len;
 
 		eid = pipe_id_to_eid(htc_hdr->eid);
 
@@ -636,9 +631,23 @@ static int ath10k_sdio_mbox_rx_packet(struct ath10k *ar,
 
 	ret = ath10k_sdio_readsb(ar, ar_sdio->mbox_info.htc_addr,
 				 skb->data, pkt->alloc_len);
+	if (!ret) {
+		/* Update actual length. The original length may be incorrect,
+		 * as the FW will bundle multiple packets as long as their sizes
+		 * fit within the same aligned length (pkt->alloc_len).
+		 */
+		struct ath10k_htc_hdr *htc_hdr =
+			(struct ath10k_htc_hdr *)skb->data;
+		pkt->act_len = le16_to_cpu(htc_hdr->len) + sizeof(*htc_hdr);
+		if (pkt->act_len <= pkt->alloc_len) {
+			skb_put(skb, pkt->act_len);
+		} else {
+			ath10k_warn(ar, "rx_packet too large (%d > %d)\n",
+				    pkt->act_len, pkt->alloc_len);
+			ret = -EMSGSIZE;
+		}
+	}
 	pkt->status = ret;
-	if (!ret)
-		skb_put(skb, pkt->act_len);
 
 	return ret;
 }
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related

* Re: [PATCH net] xfrm: add NETDEV_UNREGISTER event process for xfrmi
From: Steffen Klassert @ 2019-08-27  6:58 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem
In-Reply-To: <fce85b872c03cee379cb30ae46100ff42bea8e0d.1566807905.git.lucien.xin@gmail.com>

On Mon, Aug 26, 2019 at 04:25:05PM +0800, Xin Long wrote:
> When creating a xfrmi dev, it always holds the phydev. The xfrmi dev should
> be deleted when the phydev is being unregistered, so that the phydev can be
> put on time. Otherwise the phydev's deleting will get stuck:
> 
>   # ip link add dummy10 type dummy
>   # ip link add xfrmi10 type xfrm dev dummy10
>   # ip link del dummy10
> 
> The last command blocks and dmesg shows:
> 
>   unregister_netdevice: waiting for dummy10 to become free. Usage count = 1
> 
> This patch fixes it by adding NETDEV_UNREGISTER event process for xfrmi.

There is already a patch in the ipsec tree that fixes this issue
in a different way. Please base xfrm patches on the ipsec/ipsec-next
tree to avoid double work.

Thanks anyway!

^ permalink raw reply

* Re: [PATCH net] xfrm: remove the unnecessary .net_exit for xfrmi
From: Steffen Klassert @ 2019-08-27  7:01 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem
In-Reply-To: <120f5509a5c9292b437041e8a4193653adb9a019.1566807951.git.lucien.xin@gmail.com>

On Mon, Aug 26, 2019 at 04:25:51PM +0800, Xin Long wrote:
> The xfrm_if(s) on each netns can be deleted when its xfrmi dev is
> deleted. xfrmi dev's removal can happen when:
> 
>   1. Its phydev is being deleted and NETDEV_UNREGISTER event is
>      processed in xfrmi_event() from my last patch.
>   2. netns is being removed and all xfrmi devs will be deleted.
>   3. rtnl_link_unregister(&xfrmi_link_ops) in xfrmi_fini() when
>      xfrm_interface.ko is being unloaded.
> 
> So there's no need to use xfrmi_exit_net() to clean any xfrm_if up.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

We already have a bunch of xfrm interface fixes in the ipsec tree.
Please base this patch onto the ipsec tree and adjust if needed.

Thanks!

^ permalink raw reply

* Re: [Xen-devel] Question on xen-netfront code to fix a potential ring buffer corruption
From: Juergen Gross @ 2019-08-27  7:04 UTC (permalink / raw)
  To: Dongli Zhang, xen-devel; +Cc: Joe Jin, netdev
In-Reply-To: <9284025e-1066-387e-a52f-c46d4c66d2d3@oracle.com>

On 27.08.19 08:43, Dongli Zhang wrote:
> Hi Juergen,
> 
> On 8/27/19 2:13 PM, Juergen Gross wrote:
>> On 18.08.19 10:31, Dongli Zhang wrote:
>>> Hi,
>>>
>>> Would you please help confirm why the condition at line 908 is ">="?
>>>
>>> In my opinion, we would only hit "skb_shinfo(skb)->nr_frag == MAX_SKB_FRAGS" at
>>> line 908.
>>>
>>> 890 static RING_IDX xennet_fill_frags(struct netfront_queue *queue,
>>> 891                                   struct sk_buff *skb,
>>> 892                                   struct sk_buff_head *list)
>>> 893 {
>>> 894         RING_IDX cons = queue->rx.rsp_cons;
>>> 895         struct sk_buff *nskb;
>>> 896
>>> 897         while ((nskb = __skb_dequeue(list))) {
>>> 898                 struct xen_netif_rx_response *rx =
>>> 899                         RING_GET_RESPONSE(&queue->rx, ++cons);
>>> 900                 skb_frag_t *nfrag = &skb_shinfo(nskb)->frags[0];
>>> 901
>>> 902                 if (skb_shinfo(skb)->nr_frags == MAX_SKB_FRAGS) {
>>> 903                         unsigned int pull_to = NETFRONT_SKB_CB(skb)->pull_to;
>>> 904
>>> 905                         BUG_ON(pull_to < skb_headlen(skb));
>>> 906                         __pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>>> 907                 }
>>> 908                 if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
>>> 909                         queue->rx.rsp_cons = ++cons;
>>> 910                         kfree_skb(nskb);
>>> 911                         return ~0U;
>>> 912                 }
>>> 913
>>> 914                 skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
>>> 915                                 skb_frag_page(nfrag),
>>> 916                                 rx->offset, rx->status, PAGE_SIZE);
>>> 917
>>> 918                 skb_shinfo(nskb)->nr_frags = 0;
>>> 919                 kfree_skb(nskb);
>>> 920         }
>>> 921
>>> 922         return cons;
>>> 923 }
>>>
>>>
>>> The reason that I ask about this is because I am considering below patch to
>>> avoid a potential xen-netfront ring buffer corruption.
>>>
>>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>>> index 8d33970..48a2162 100644
>>> --- a/drivers/net/xen-netfront.c
>>> +++ b/drivers/net/xen-netfront.c
>>> @@ -906,7 +906,7 @@ static RING_IDX xennet_fill_frags(struct netfront_queue
>>> *queue,
>>>                           __pskb_pull_tail(skb, pull_to - skb_headlen(skb));
>>>                   }
>>>                   if (unlikely(skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS)) {
>>> -                       queue->rx.rsp_cons = ++cons;
>>> +                       queue->rx.rsp_cons = cons + skb_queue_len(list) + 1;
>>>                           kfree_skb(nskb);
>>>                           return ~0U;
>>>                   }
>>>
>>>
>>> If there is skb left in list when we return ~0U, queue->rx.rsp_cons may be set
>>> incorrectly.
>>
>> Sa basically you want to consume the responses for all outstanding skbs
>> in the list?
>>
> 
> I think there would be bug if there is skb left in the list.
> 
> This is what is implanted in xennet_poll() when there is error of processing
> extra info like below. As at line 1034, if there is error, all outstanding skb
> are consumed.
> 
>   985 static int xennet_poll(struct napi_struct *napi, int budget)
> ... ...
> 1028                 if (extras[XEN_NETIF_EXTRA_TYPE_GSO - 1].type) {
> 1029                         struct xen_netif_extra_info *gso;
> 1030                         gso = &extras[XEN_NETIF_EXTRA_TYPE_GSO - 1];
> 1031
> 1032                         if (unlikely(xennet_set_skb_gso(skb, gso))) {
> 1033                                 __skb_queue_head(&tmpq, skb);
> 1034                                 queue->rx.rsp_cons += skb_queue_len(&tmpq);
> 1035                                 goto err;
> 1036                         }
> 1037                 }
> 
> The reason we need to consume all outstanding skb is because
> xennet_get_responses() would reset both queue->rx_skbs[i] and
> queue->grant_rx_ref[i] to NULL before enqueue all outstanding skb to the list
> (e.g., &tmpq), by xennet_get_rx_skb() and xennet_get_rx_ref().
> 
> If we do not consume all of them, we would hit NULL in queue->rx_skbs[i] in next
> iteration of while loop in xennet_poll().
> 
> That is, if we do not consume all outstanding skb, the queue->rx.rsp_cons may
> refer to a response whose queue->rx_skbs[i] and queue->grant_rx_ref[i] are
> already reset to NULL.

Thanks for confirming. I just wanted to make sure I understand your
patch correctly. :-)


Juergen

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-08-27  7:08 UTC (permalink / raw)
  To: David Miller
  Cc: jakub.kicinski, dsahern, roopa, netdev, sthemmin, dcbw, mkubecek,
	andrew, parav, saeedm, mlxsw
In-Reply-To: <20190826.151819.804077961408964282.davem@davemloft.net>

Tue, Aug 27, 2019 at 12:18:19AM CEST, davem@davemloft.net wrote:
>From: Jakub Kicinski <jakub.kicinski@netronome.com>
>Date: Mon, 26 Aug 2019 15:15:52 -0700
>
>> Weren't there multiple problems with the size of the RTM_NEWLINK
>> notification already? Adding multiple sizeable strings to it won't
>> help there either :S
>
>Indeed.
>
>We even had situations where we had to make the information provided
>in a newlink dump opt-in when we added VF info because the buffers
>glibc was using at the time were too small and this broke so much
>stuff.
>
>I honestly think that the size of link dumps are out of hand as-is.

Okay, so if I understand correctly, on top of separate commands for
add/del of alternative names, you suggest also get/dump to be separate
command and don't fill this up in existing newling/getlink command.

So we'll have:
CMD to add:
	RTM_NEWALTIFNAME = 108
#define RTM_NEWALTIFNAME       RTM_NEWALTIFNAME

Example msg (user->kernel):
     IFLA_IFNAME eth0
     IFLA_ALT_IFNAME_MOD somereallyreallylongname

Example msg (user->kernel):
     IFLA_ALT_IFNAME somereallyreallylongname
     IFLA_ALT_IFNAME_MOD someotherreallyreallylongname


CMD to delete:
	RTM_DELALTIFNAME,
#define RTM_DELALTIFNAME       RTM_DELALTIFNAME

Example msg (user->kernel):
     IFLA_IFNAME eth0
     IFLA_ALT_IFNAME_MOD somereallyreallylongname

	
CMD to get/dump:
        RTM_GETALTIFNAME,
#define RTM_GETALTIFNAME       RTM_GETALTIFNAME

Example msg (kernel->user):
     hdr (with ifindex)
     IFLA_ALT_IFNAME_LIST (nest)
        IFLA_ALT_IFNAME somereallyreallylongname
        IFLA_ALT_IFNAME someotherreallyreallylongname

Correct?	

^ permalink raw reply

* [PATCH net-next] ipv6: shrink struct ipv6_mc_socklist
From: Eric Dumazet @ 2019-08-27  7:08 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

Remove two holes on 64bit arches, to bring the size
to one cache line exactly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/if_inet6.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 50037913c9b191cb3793e3f072104c10c4257ff2..a01981d7108f96075b6939f4a78f14e7afe93a4e 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -89,9 +89,9 @@ struct ip6_sf_socklist {
 struct ipv6_mc_socklist {
 	struct in6_addr		addr;
 	int			ifindex;
+	unsigned int		sfmode;		/* MCAST_{INCLUDE,EXCLUDE} */
 	struct ipv6_mc_socklist __rcu *next;
 	rwlock_t		sflock;
-	unsigned int		sfmode;		/* MCAST_{INCLUDE,EXCLUDE} */
 	struct ip6_sf_socklist	*sflist;
 	struct rcu_head		rcu;
 };
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related

* Re: [PATCH -next] net: mlx5: Kconfig: Fix MLX5_CORE_EN dependencies
From: walter harms @ 2019-08-27  7:24 UTC (permalink / raw)
  To: Mao Wenan
  Cc: saeedm, leon, davem, netdev, linux-rdma, linux-kernel,
	kernel-janitors
In-Reply-To: <20190827031251.98881-1-maowenan@huawei.com>



Am 27.08.2019 05:12, schrieb Mao Wenan:
> When MLX5_CORE_EN=y and PCI_HYPERV_INTERFACE is not set, below errors are found:
> drivers/net/ethernet/mellanox/mlx5/core/en_main.o: In function `mlx5e_nic_enable':
> en_main.c:(.text+0xb649): undefined reference to `mlx5e_hv_vhca_stats_create'
> drivers/net/ethernet/mellanox/mlx5/core/en_main.o: In function `mlx5e_nic_disable':
> en_main.c:(.text+0xb8c4): undefined reference to `mlx5e_hv_vhca_stats_destroy'
> 
> This because CONFIG_PCI_HYPERV_INTERFACE is newly introduced by 'commit 348dd93e40c1
> ("PCI: hv: Add a Hyper-V PCI interface driver for software backchannel interface"),
> Fix this by making MLX5_CORE_EN imply PCI_HYPERV_INTERFACE.
> 
> Fixes: cef35af34d6d ("net/mlx5e: Add mlx5e HV VHCA stats agent")
> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> index 37fef8c..a6a70ce 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> @@ -35,6 +35,7 @@ config MLX5_CORE_EN
>  	depends on IPV6=y || IPV6=n || MLX5_CORE=m

OT but ...
is that IPV6 needed at all ? can there be something else that yes or no ?

re,
 wh

>  	select PAGE_POOL
>  	select DIMLIB
> +	imply PCI_HYPERV_INTERFACE
>  	default n
>  	---help---
>  	  Ethernet support in Mellanox Technologies ConnectX-4 NIC.

^ permalink raw reply

* Re: [PATCH bpf-next v5 03/11] xsk: add support to allow unaligned chunk placement
From: Maxim Mikityanskiy @ 2019-08-27  7:36 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	bjorn.topel@intel.com, magnus.karlsson@intel.com,
	jakub.kicinski@netronome.com, jonathan.lemon@gmail.com,
	Saeed Mahameed, stephen@networkplumber.org,
	bruce.richardson@intel.com, ciara.loftus@intel.com,
	bpf@vger.kernel.org, intel-wired-lan@lists.osuosl.org
In-Reply-To: <20190822014427.49800-4-kevin.laatz@intel.com>

On 2019-08-22 04:44, Kevin Laatz wrote:
> Currently, addresses are chunk size aligned. This means, we are very
> restricted in terms of where we can place chunk within the umem. For
> example, if we have a chunk size of 2k, then our chunks can only be placed
> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
> 
> This patch introduces the ability to use unaligned chunks. With these
> changes, we are no longer bound to having to place chunks at a 2k (or
> whatever your chunk size is) interval. Since we are no longer dealing with
> aligned chunks, they can now cross page boundaries. Checks for page
> contiguity have been added in order to keep track of which pages are
> followed by a physically contiguous page.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> ---
> v2:
>    - Add checks for the flags coming from userspace
>    - Fix how we get chunk_size in xsk_diag.c
>    - Add defines for masking the new descriptor format
>    - Modified the rx functions to use new descriptor format
>    - Modified the tx functions to use new descriptor format
> 
> v3:
>    - Add helper function to do address/offset masking/addition
> 
> v4:
>    - fixed page_start calculation in __xsk_rcv_memcpy().
>    - move offset handling to the xdp_umem_get_* functions
>    - modified the len field in xdp_umem_reg struct. We now use 16 bits from
>      this for the flags field.
>    - removed next_pg_contig field from xdp_umem_page struct. Using low 12
>      bits of addr to store flags instead.
>    - other minor changes based on review comments
> 
> v5:
>    - Added accessors for getting addr and offset
>    - Added helper function to add offset to addr
>    - Fixed offset handling in xsk_rcv
>    - Removed bitfields from xdp_umem_reg
>    - Added struct size checking for xdp_umem_reg in xsk_setsockopt to handle
>      different versions of the struct.
>    - fix conflicts after 'bpf-af-xdp-wakeup' was merged.
> ---
>   include/net/xdp_sock.h      | 75 +++++++++++++++++++++++++++--
>   include/uapi/linux/if_xdp.h |  9 ++++
>   net/xdp/xdp_umem.c          | 19 ++++++--
>   net/xdp/xsk.c               | 96 +++++++++++++++++++++++++++++--------
>   net/xdp/xsk_diag.c          |  2 +-
>   net/xdp/xsk_queue.h         | 68 ++++++++++++++++++++++----
>   6 files changed, 232 insertions(+), 37 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index f023b9940d64..c9398ce7960f 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -16,6 +16,13 @@
>   struct net_device;
>   struct xsk_queue;
>   
> +/* Masks for xdp_umem_page flags.
> + * The low 12-bits of the addr will be 0 since this is the page address, so we
> + * can use them for flags.
> + */
> +#define XSK_NEXT_PG_CONTIG_SHIFT 0
> +#define XSK_NEXT_PG_CONTIG_MASK (1ULL << XSK_NEXT_PG_CONTIG_SHIFT)
> +
>   struct xdp_umem_page {
>   	void *addr;
>   	dma_addr_t dma;
> @@ -27,8 +34,12 @@ struct xdp_umem_fq_reuse {
>   	u64 handles[];
>   };
>   
> -/* Flags for the umem flags field. */
> -#define XDP_UMEM_USES_NEED_WAKEUP (1 << 0)
> +/* Flags for the umem flags field.
> + *
> + * The NEED_WAKEUP flag is 1 due to the reuse of the flags field for public
> + * flags. See inlude/uapi/include/linux/if_xdp.h.
> + */
> +#define XDP_UMEM_USES_NEED_WAKEUP (1 << 1)
>   
>   struct xdp_umem {
>   	struct xsk_queue *fq;
> @@ -124,14 +135,36 @@ void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
>   int xsk_map_inc(struct xsk_map *map);
>   void xsk_map_put(struct xsk_map *map);
>   
> +static inline u64 xsk_umem_extract_addr(u64 addr)
> +{
> +	return addr & XSK_UNALIGNED_BUF_ADDR_MASK;
> +}
> +
> +static inline u64 xsk_umem_extract_offset(u64 addr)
> +{
> +	return addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +}
> +
> +static inline u64 xsk_umem_add_offset_to_addr(u64 addr)
> +{
> +	return xsk_umem_extract_addr(addr) + xsk_umem_extract_offset(addr);
> +}
> +
>   static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
>   {
> -	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
> +	unsigned long page_addr;
> +
> +	addr = xsk_umem_add_offset_to_addr(addr);
> +	page_addr = (unsigned long)umem->pages[addr >> PAGE_SHIFT].addr;
> +
> +	return (char *)(page_addr & PAGE_MASK) + (addr & ~PAGE_MASK);
>   }
>   
>   static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
>   {
> -	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
> +	addr = xsk_umem_add_offset_to_addr(addr);
> +
> +	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & ~PAGE_MASK);
>   }
>   
>   /* Reuse-queue aware version of FILL queue helpers */
> @@ -172,6 +205,19 @@ static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
>   
>   	rq->handles[rq->length++] = addr;
>   }
> +
> +/* Handle the offset appropriately depending on aligned or unaligned mode.
> + * For unaligned mode, we store the offset in the upper 16-bits of the address.
> + * For aligned mode, we simply add the offset to the address.
> + */
> +static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64 address,
> +					 u64 offset)
> +{
> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> +		return address + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
> +	else
> +		return address + offset;
> +}
>   #else
>   static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>   {
> @@ -241,6 +287,21 @@ static inline struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev,
>   	return NULL;
>   }
>   
> +static inline u64 xsk_umem_extract_addr(u64 addr)
> +{
> +	return 0;
> +}
> +
> +static inline u64 xsk_umem_extract_offset(u64 addr)
> +{
> +	return 0;
> +}
> +
> +static inline u64 xsk_umem_add_offset_to_addr(u64 addr)
> +{
> +	return 0;
> +}
> +
>   static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
>   {
>   	return NULL;
> @@ -290,6 +351,12 @@ static inline bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem)
>   	return false;
>   }
>   
> +static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64 handle,
> +					 u64 offset)
> +{
> +	return 0;
> +}
> +
>   #endif /* CONFIG_XDP_SOCKETS */
>   
>   #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 62b80d57b72a..be328c59389d 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -26,6 +26,9 @@
>    */
>   #define XDP_USE_NEED_WAKEUP (1 << 3)
>   
> +/* Flags for xsk_umem_config flags */
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
> +
>   struct sockaddr_xdp {
>   	__u16 sxdp_family;
>   	__u16 sxdp_flags;
> @@ -66,6 +69,7 @@ struct xdp_umem_reg {
>   	__u64 len; /* Length of packet data area */
>   	__u32 chunk_size;
>   	__u32 headroom;
> +	__u32 flags;
>   };
>   
>   struct xdp_statistics {
> @@ -87,6 +91,11 @@ struct xdp_options {
>   #define XDP_UMEM_PGOFF_FILL_RING	0x100000000ULL
>   #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000ULL
>   
> +/* Masks for unaligned chunks mode */
> +#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
> +#define XSK_UNALIGNED_BUF_ADDR_MASK \
> +	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> +
>   /* Rx/Tx descriptor */
>   struct xdp_desc {
>   	__u64 addr;
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 2d65779282a1..e997b263a0dd 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -340,6 +340,7 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)
>   
>   static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>   {
> +	bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG;
>   	u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
>   	unsigned int chunks, chunks_per_page;
>   	u64 addr = mr->addr, size = mr->len;
> @@ -355,7 +356,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>   		return -EINVAL;
>   	}
>   
> -	if (!is_power_of_2(chunk_size))
> +	if (mr->flags & ~(XDP_UMEM_UNALIGNED_CHUNK_FLAG |
> +			XDP_UMEM_USES_NEED_WAKEUP))
> +		return -EINVAL;
> +
> +	if (!unaligned_chunks && !is_power_of_2(chunk_size))
>   		return -EINVAL;
>   
>   	if (!PAGE_ALIGNED(addr)) {
> @@ -372,9 +377,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>   	if (chunks == 0)
>   		return -EINVAL;
>   
> -	chunks_per_page = PAGE_SIZE / chunk_size;
> -	if (chunks < chunks_per_page || chunks % chunks_per_page)
> -		return -EINVAL;
> +	if (!unaligned_chunks) {
> +		chunks_per_page = PAGE_SIZE / chunk_size;
> +		if (chunks < chunks_per_page || chunks % chunks_per_page)
> +			return -EINVAL;
> +	}
>   
>   	headroom = ALIGN(headroom, 64);
>   
> @@ -383,13 +390,15 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>   		return -EINVAL;
>   
>   	umem->address = (unsigned long)addr;
> -	umem->chunk_mask = ~((u64)chunk_size - 1);
> +	umem->chunk_mask = unaligned_chunks ? XSK_UNALIGNED_BUF_ADDR_MASK
> +					    : ~((u64)chunk_size - 1);
>   	umem->size = size;
>   	umem->headroom = headroom;
>   	umem->chunk_size_nohr = chunk_size - headroom;
>   	umem->npgs = size / PAGE_SIZE;
>   	umem->pgs = NULL;
>   	umem->user = NULL;
> +	umem->flags = mr->flags;
>   	INIT_LIST_HEAD(&umem->xsk_list);
>   	spin_lock_init(&umem->xsk_list_lock);
>   
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index ee4428a892fa..907e5f12338f 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -45,7 +45,7 @@ EXPORT_SYMBOL(xsk_umem_has_addrs);
>   
>   u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
>   {
> -	return xskq_peek_addr(umem->fq, addr);
> +	return xskq_peek_addr(umem->fq, addr, umem);
>   }
>   EXPORT_SYMBOL(xsk_umem_peek_addr);
>   
> @@ -115,21 +115,43 @@ bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem)
>   }
>   EXPORT_SYMBOL(xsk_umem_uses_need_wakeup);
>   
> +/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one for
> + * each page. This is only required in copy mode.
> + */
> +static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void *from_buf,
> +			     u32 len, u32 metalen)
> +{
> +	void *to_buf = xdp_umem_get_data(umem, addr);
> +
> +	addr = xsk_umem_add_offset_to_addr(addr);
> +	if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
> +		void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
> +		u64 page_start = addr & ~(PAGE_SIZE - 1);
> +		u64 first_len = PAGE_SIZE - (addr - page_start);
> +
> +		memcpy(to_buf, from_buf, first_len + metalen);
> +		memcpy(next_pg_addr, from_buf + first_len, len - first_len);
> +
> +		return;
> +	}
> +
> +	memcpy(to_buf, from_buf, len + metalen);
> +}
> +
>   static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>   {
> -	void *to_buf, *from_buf;
> +	u64 offset = xs->umem->headroom;
> +	u64 addr, memcpy_addr;
> +	void *from_buf;
>   	u32 metalen;
> -	u64 addr;
>   	int err;
>   
> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>   	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>   		xs->rx_dropped++;
>   		return -ENOSPC;
>   	}
>   
> -	addr += xs->umem->headroom;
> -
>   	if (unlikely(xdp_data_meta_unsupported(xdp))) {
>   		from_buf = xdp->data;
>   		metalen = 0;
> @@ -138,9 +160,11 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>   		metalen = xdp->data - xdp->data_meta;
>   	}
>   
> -	to_buf = xdp_umem_get_data(xs->umem, addr);
> -	memcpy(to_buf, from_buf, len + metalen);
> -	addr += metalen;
> +	memcpy_addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
> +	__xsk_rcv_memcpy(xs->umem, memcpy_addr, from_buf, len, metalen);
> +
> +	offset += metalen;
> +	addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>   	err = xskq_produce_batch_desc(xs->rx, addr, len);
>   	if (!err) {
>   		xskq_discard_addr(xs->umem->fq);
> @@ -185,6 +209,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>   {
>   	u32 metalen = xdp->data - xdp->data_meta;
>   	u32 len = xdp->data_end - xdp->data;
> +	u64 offset = xs->umem->headroom;
>   	void *buffer;
>   	u64 addr;
>   	int err;
> @@ -196,17 +221,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>   		goto out_unlock;
>   	}
>   
> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>   	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>   		err = -ENOSPC;
>   		goto out_drop;
>   	}
>   
> -	addr += xs->umem->headroom;
> -
> -	buffer = xdp_umem_get_data(xs->umem, addr);
> +	buffer = xdp_umem_get_data(xs->umem, addr + offset);
>   	memcpy(buffer, xdp->data_meta, len + metalen);
> -	addr += metalen;
> +	offset += metalen;
> +
> +	addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>   	err = xskq_produce_batch_desc(xs->rx, addr, len);
>   	if (err)
>   		goto out_drop;
> @@ -250,7 +275,7 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc)
>   
>   	rcu_read_lock();
>   	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
> -		if (!xskq_peek_desc(xs->tx, desc))
> +		if (!xskq_peek_desc(xs->tx, desc, umem))
>   			continue;
>   
>   		if (xskq_produce_addr_lazy(umem->cq, desc->addr))
> @@ -304,7 +329,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
>   	if (xs->queue_id >= xs->dev->real_num_tx_queues)
>   		goto out;
>   
> -	while (xskq_peek_desc(xs->tx, &desc)) {
> +	while (xskq_peek_desc(xs->tx, &desc, xs->umem)) {
>   		char *buffer;
>   		u64 addr;
>   		u32 len;
> @@ -333,7 +358,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
>   		skb->dev = xs->dev;
>   		skb->priority = sk->sk_priority;
>   		skb->mark = sk->sk_mark;
> -		skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
> +		skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
>   		skb->destructor = xsk_destruct_skb;
>   
>   		err = dev_direct_xmit(skb, xs->queue_id);
> @@ -526,6 +551,24 @@ static struct socket *xsk_lookup_xsk_from_fd(int fd)
>   	return sock;
>   }
>   
> +/* Check if umem pages are contiguous.
> + * If zero-copy mode, use the DMA address to do the page contiguity check
> + * For all other modes we use addr (kernel virtual address)
> + * Store the result in the low bits of addr.
> + */
> +static void xsk_check_page_contiguity(struct xdp_umem *umem, u32 flags)
> +{
> +	struct xdp_umem_page *pgs = umem->pages;
> +	int i, is_contig;
> +
> +	for (i = 0; i < umem->npgs - 1; i++) {
> +		is_contig = (flags & XDP_ZEROCOPY) ?
> +			(pgs[i].dma + PAGE_SIZE == pgs[i + 1].dma) :
> +			(pgs[i].addr + PAGE_SIZE == pgs[i + 1].addr);
> +		pgs[i].addr += is_contig << XSK_NEXT_PG_CONTIG_SHIFT;
> +	}
> +}
> +
>   static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   {
>   	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
> @@ -616,6 +659,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   		err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
>   		if (err)
>   			goto out_unlock;
> +
> +		xsk_check_page_contiguity(xs->umem, flags);
>   	}
>   
>   	xs->dev = dev;
> @@ -636,6 +681,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>   	return err;
>   }
>   
> +struct xdp_umem_reg_v1 {
> +	__u64 addr; /* Start of packet data area */
> +	__u64 len; /* Length of packet data area */
> +	__u32 chunk_size;
> +	__u32 headroom;
> +};
> +
>   static int xsk_setsockopt(struct socket *sock, int level, int optname,
>   			  char __user *optval, unsigned int optlen)
>   {
> @@ -673,10 +725,16 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>   	}
>   	case XDP_UMEM_REG:
>   	{
> -		struct xdp_umem_reg mr;
> +		size_t mr_size = sizeof(struct xdp_umem_reg);
> +		struct xdp_umem_reg mr = {};
>   		struct xdp_umem *umem;
>   
> -		if (copy_from_user(&mr, optval, sizeof(mr)))
> +		if (optlen < sizeof(struct xdp_umem_reg_v1))
> +			return -EINVAL;
> +		else if (optlen < sizeof(mr))
> +			mr_size = sizeof(struct xdp_umem_reg_v1);
> +
> +		if (copy_from_user(&mr, optval, mr_size))
>   			return -EFAULT;
>   
>   		mutex_lock(&xs->mutex);
> diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
> index d5e06c8e0cbf..9986a759fe06 100644
> --- a/net/xdp/xsk_diag.c
> +++ b/net/xdp/xsk_diag.c
> @@ -56,7 +56,7 @@ static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb)
>   	du.id = umem->id;
>   	du.size = umem->size;
>   	du.num_pages = umem->npgs;
> -	du.chunk_size = (__u32)(~umem->chunk_mask + 1);
> +	du.chunk_size = umem->chunk_size_nohr + umem->headroom;
>   	du.headroom = umem->headroom;
>   	du.ifindex = umem->dev ? umem->dev->ifindex : 0;
>   	du.queue_id = umem->queue_id;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index dd9e985c2461..6c67c9d0294f 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -134,6 +134,17 @@ static inline bool xskq_has_addrs(struct xsk_queue *q, u32 cnt)
>   
>   /* UMEM queue */
>   
> +static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem, u64 addr,
> +					      u64 length)
> +{
> +	bool cross_pg = (addr & (PAGE_SIZE - 1)) + length > PAGE_SIZE;
> +	bool next_pg_contig =
> +		(unsigned long)umem->pages[(addr >> PAGE_SHIFT)].addr &
> +			XSK_NEXT_PG_CONTIG_MASK;
> +
> +	return cross_pg && !next_pg_contig;
> +}
> +
>   static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
>   {
>   	if (addr >= q->size) {
> @@ -144,23 +155,49 @@ static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
>   	return true;
>   }
>   
> -static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
> +static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, u64 addr,
> +						u64 length,
> +						struct xdp_umem *umem)
> +{
> +	addr = xsk_umem_add_offset_to_addr(addr);
> +	if (addr >= q->size ||

I know I already asked about it, but I feel we need to clarify things 
here further.

The `addr >= q->size` check is good, it checks that the address doesn't 
overflow the UMEM. However, the new encoding of UMEM handles consists of 
two parts: base address of the frame (low bits) and offset (high bits), 
that are added together by xsk_umem_add_offset_to_addr. It's possible to 
craft an address that has too big base address of the frame (all 
0xff...f), some non-zero offset, so after adding them together we pass 
`addr < q->size`, but the base address exceeds the UMEM. In my opinion, 
we should not allow such addresses, because the base address of the 
frame is out of bounds, so we need one more check here. What do you 
think about this point?

> +	    xskq_crosses_non_contig_pg(umem, addr, length)) {
> +		q->invalid_descs++;
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr,
> +				      struct xdp_umem *umem)
>   {
>   	while (q->cons_tail != q->cons_head) {
>   		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>   		unsigned int idx = q->cons_tail & q->ring_mask;
>   
>   		*addr = READ_ONCE(ring->desc[idx]) & q->chunk_mask;
> +
> +		if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
> +			if (xskq_is_valid_addr_unaligned(q, *addr,
> +							 umem->chunk_size_nohr,
> +							 umem))
> +				return addr;
> +			goto out;
> +		}
> +
>   		if (xskq_is_valid_addr(q, *addr))
>   			return addr;
>   
> +out:
>   		q->cons_tail++;
>   	}
>   
>   	return NULL;
>   }
>   
> -static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
> +static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr,
> +				  struct xdp_umem *umem)
>   {
>   	if (q->cons_tail == q->cons_head) {
>   		smp_mb(); /* D, matches A */
> @@ -171,7 +208,7 @@ static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
>   		smp_rmb();
>   	}
>   
> -	return xskq_validate_addr(q, addr);
> +	return xskq_validate_addr(q, addr, umem);
>   }
>   
>   static inline void xskq_discard_addr(struct xsk_queue *q)
> @@ -230,8 +267,21 @@ static inline int xskq_reserve_addr(struct xsk_queue *q)
>   
>   /* Rx/Tx queue */
>   
> -static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
> +static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d,
> +				      struct xdp_umem *umem)
>   {
> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
> +		if (!xskq_is_valid_addr_unaligned(q, d->addr, d->len, umem))
> +			return false;
> +
> +		if (d->len > umem->chunk_size_nohr || d->options) {
> +			q->invalid_descs++;
> +			return false;
> +		}
> +
> +		return true;
> +	}
> +
>   	if (!xskq_is_valid_addr(q, d->addr))
>   		return false;
>   
> @@ -245,14 +295,15 @@ static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
>   }
>   
>   static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
> -						  struct xdp_desc *desc)
> +						  struct xdp_desc *desc,
> +						  struct xdp_umem *umem)
>   {
>   	while (q->cons_tail != q->cons_head) {
>   		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
>   		unsigned int idx = q->cons_tail & q->ring_mask;
>   
>   		*desc = READ_ONCE(ring->desc[idx]);
> -		if (xskq_is_valid_desc(q, desc))
> +		if (xskq_is_valid_desc(q, desc, umem))
>   			return desc;
>   
>   		q->cons_tail++;
> @@ -262,7 +313,8 @@ static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
>   }
>   
>   static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
> -					      struct xdp_desc *desc)
> +					      struct xdp_desc *desc,
> +					      struct xdp_umem *umem)
>   {
>   	if (q->cons_tail == q->cons_head) {
>   		smp_mb(); /* D, matches A */
> @@ -273,7 +325,7 @@ static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
>   		smp_rmb(); /* C, matches B */
>   	}
>   
> -	return xskq_validate_desc(q, desc);
> +	return xskq_validate_desc(q, desc, umem);
>   }
>   
>   static inline void xskq_discard_desc(struct xsk_queue *q)
> 


^ permalink raw reply

* Re: [PATCH 14/16] include/linux: prefer __section from compiler_attributes.h
From: Will Deacon @ 2019-08-27  8:21 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Nick Desaulniers, Andrew Morton, Sedat Dilek, Josh Poimboeuf,
	Yonghong Song, clang-built-linux, Luc Van Oostenryck,
	Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra (Intel), Nicholas Piggin, Jiri Kosina,
	Ard Biesheuvel, Michael Ellerman, Masahiro Yamada,
	Hans Liljestrand, Elena Reshetova, David Windsor, Marc Zyngier,
	Ming Lei, Dou Liyang, Julien Thierry, Mauro Carvalho Chehab,
	Jens Axboe, linux-kernel, linux-sparse, rcu, Network Development,
	bpf
In-Reply-To: <CANiq72mXvuVdO2i9gobmh9YeUW4bhe7YnqQc6PaZrbqua1DoYw@mail.gmail.com>

On Sat, Aug 24, 2019 at 02:51:46PM +0200, Miguel Ojeda wrote:
> On Tue, Aug 13, 2019 at 10:33 AM Will Deacon <will@kernel.org> wrote:
> >
> > -ENOCOMMITMESSAGE
> >
> > Otherwise, patch looks good to me.
> 
> Do you want Ack, Review or nothing?

You can add my Ack if a commit message appears.

Will

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: David Miller @ 2019-08-27  8:22 UTC (permalink / raw)
  To: jiri
  Cc: jakub.kicinski, dsahern, roopa, netdev, sthemmin, dcbw, mkubecek,
	andrew, parav, saeedm, mlxsw
In-Reply-To: <20190827070808.GA2250@nanopsycho>

From: Jiri Pirko <jiri@resnulli.us>
Date: Tue, 27 Aug 2019 09:08:08 +0200

> Okay, so if I understand correctly, on top of separate commands for
> add/del of alternative names, you suggest also get/dump to be separate
> command and don't fill this up in existing newling/getlink command.

I'm not sure what to do yet.

David has a point, because the only way these ifnames are useful is
as ways to specify and choose net devices.  So based upon that I'm
slightly learning towards not using separate commands.

^ permalink raw reply

* Re: [RFC PATCH net-next] net: phy: force phy suspend when calling phy_stop
From: shenjian (K) @ 2019-08-27  8:29 UTC (permalink / raw)
  To: Heiner Kallweit, andrew, f.fainelli, davem
  Cc: netdev, forest.zhouchang, linuxarm
In-Reply-To: <fc2a700a-9c24-b96c-df6b-c5414883d89e@gmail.com>



在 2019/8/27 13:51, Heiner Kallweit 写道:
> On 27.08.2019 04:47, Jian Shen wrote:
>> Some ethernet drivers may call phy_start() and phy_stop() from
>> ndo_open and ndo_close() respectively.
>>
>> When network cable is unconnected, and operate like below:
>> step 1: ifconfig ethX up -> ndo_open -> phy_start ->start
>> autoneg, and phy is no link.
>> step 2: ifconfig ethX down -> ndo_close -> phy_stop -> just stop
>> phy state machine.
>> step 3: plugin the network cable, and autoneg complete, then
>> LED for link status will be on.
>> step 4: ethtool ethX --> see the result of "Link detected" is no.
>>
> Step 3 and 4 seem to be unrelated to the actual issue.
> With which MAC + PHY driver did you observe this?
> 
Thanks Heiner,

I tested this on HNS3 driver, with two phy, Marvell 88E1512 and RTL8211.

Step 3 and Step 4 is just to describe that the LED of link shows link up,
but the port information shows no link.


>> This patch forces phy suspend even phydev->link is off.
>>
>> Signed-off-by: Jian Shen <shenjian15@huawei.com>
>> ---
>>  drivers/net/phy/phy.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index f3adea9..0acd5b4 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -911,8 +911,8 @@ void phy_state_machine(struct work_struct *work)
>>  		if (phydev->link) {
>>  			phydev->link = 0;
>>  			phy_link_down(phydev, true);
>> -			do_suspend = true;
>>  		}
>> +		do_suspend = true;
>>  		break;
>>  	}
>>  
>>
> Heiner
> 
> 


^ permalink raw reply

* [PATCH] ipv6: Not to probe neighbourless routes
From: Yi Wang @ 2019-08-27  9:08 UTC (permalink / raw)
  To: davem
  Cc: kuznet, yoshfuji, netdev, linux-kernel, xue.zhihong, wang.yi59,
	wang.liang82, Cheng Lin

From: Cheng Lin <cheng.lin130@zte.com.cn>

Originally, Router Reachability Probing require a neighbour entry
existed. Commit 2152caea7196 ("ipv6: Do not depend on rt->n in
rt6_probe().") removed the requirement for a neighbour entry. And
commit f547fac624be ("ipv6: rate-limit probes for neighbourless
routes") adds rate-limiting for neighbourless routes.

And, the Neighbor Discovery for IP version 6 (IPv6)(rfc4861) says,
"
7.2.5.  Receipt of Neighbor Advertisements

When a valid Neighbor Advertisement is received (either solicited or
unsolicited), the Neighbor Cache is searched for the target's entry.
If no entry exists, the advertisement SHOULD be silently discarded.
There is no need to create an entry if none exists, since the
recipient has apparently not initiated any communication with the
target.
".

In rt6_probe(), just a Neighbor Solicitation message are transmited.
When receiving a Neighbor Advertisement, the node does nothing in a
Neighborless condition.

Not sure it's needed to create a neighbor entry in Router
Reachability Probing. And the Original way may be the right way.

This patch recover the requirement for a neighbour entry.

Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
---
 include/net/ip6_fib.h | 5 -----
 net/ipv6/route.c      | 5 +----
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 4b5656c..8c2e022 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -124,11 +124,6 @@ struct rt6_exception {
 
 struct fib6_nh {
 	struct fib_nh_common	nh_common;
-
-#ifdef CONFIG_IPV6_ROUTER_PREF
-	unsigned long		last_probe;
-#endif
-
 	struct rt6_info * __percpu *rt6i_pcpu;
 	struct rt6_exception_bucket __rcu *rt6i_exception_bucket;
 };
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index fd059e0..c4bcffc 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -639,12 +639,12 @@ static void rt6_probe(struct fib6_nh *fib6_nh)
 	nh_gw = &fib6_nh->fib_nh_gw6;
 	dev = fib6_nh->fib_nh_dev;
 	rcu_read_lock_bh();
-	idev = __in6_dev_get(dev);
 	neigh = __ipv6_neigh_lookup_noref(dev, nh_gw);
 	if (neigh) {
 		if (neigh->nud_state & NUD_VALID)
 			goto out;
 
+		idev = __in6_dev_get(dev);
 		write_lock(&neigh->lock);
 		if (!(neigh->nud_state & NUD_VALID) &&
 		    time_after(jiffies,
@@ -654,9 +654,6 @@ static void rt6_probe(struct fib6_nh *fib6_nh)
 				__neigh_set_probe_once(neigh);
 		}
 		write_unlock(&neigh->lock);
-	} else if (time_after(jiffies, fib6_nh->last_probe +
-				       idev->cnf.rtr_probe_interval)) {
-		work = kmalloc(sizeof(*work), GFP_ATOMIC);
 	}
 
 	if (work) {
-- 
1.8.3.1


^ permalink raw reply related

* RE: [PATCH v4] rtw88: pci: Move a mass of jobs in hw IRQ to soft IRQ
From: Tony Chuang @ 2019-08-27  9:20 UTC (permalink / raw)
  To: Jian-Hong Pan, Kalle Valo, David S . Miller
  Cc: linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux@endlessm.com
In-Reply-To: <20190826070827.1436-1-jian-hong@endlessm.com>

> From: Jian-Hong Pan 
> Subject: [PATCH v4] rtw88: pci: Move a mass of jobs in hw IRQ to soft IRQ
> 
> There is a mass of jobs between spin lock and unlock in the hardware
> IRQ which will occupy much time originally. To make system work more
> efficiently, this patch moves the jobs to the soft IRQ (bottom half) to
> reduce the time in hardware IRQ.
> 
> Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
> ---
> v2:
>  Change the spin_lock_irqsave/unlock_irqrestore to spin_lock/unlock in
>  rtw_pci_interrupt_handler. Because the interrupts are already disabled
>  in the hardware interrupt handler.
> 
> v3:
>  Extend the spin lock protecting area for the TX path in
>  rtw_pci_interrupt_threadfn by Realtek's suggestion
> 
> v4:
>  Remove the WiFi running check in rtw_pci_interrupt_threadfn to avoid AP
>  connection failed by Realtek's suggestion.
> 
>  drivers/net/wireless/realtek/rtw88/pci.c | 32 +++++++++++++++++++-----
>  1 file changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/wireless/realtek/rtw88/pci.c
> b/drivers/net/wireless/realtek/rtw88/pci.c
> index 00ef229552d5..955dd6c6fb57 100644
> --- a/drivers/net/wireless/realtek/rtw88/pci.c
> +++ b/drivers/net/wireless/realtek/rtw88/pci.c
> @@ -866,12 +866,29 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  {
>  	struct rtw_dev *rtwdev = dev;
>  	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> -	u32 irq_status[4];
> 
>  	spin_lock(&rtwpci->irq_lock);
>  	if (!rtwpci->irq_enabled)
>  		goto out;
> 
> +	/* disable RTW PCI interrupt to avoid more interrupts before the end of
> +	 * thread function
> +	 */
> +	rtw_pci_disable_interrupt(rtwdev, rtwpci);
> +out:
> +	spin_unlock(&rtwpci->irq_lock);
> +
> +	return IRQ_WAKE_THREAD;
> +}
> +
> +static irqreturn_t rtw_pci_interrupt_threadfn(int irq, void *dev)
> +{
> +	struct rtw_dev *rtwdev = dev;
> +	struct rtw_pci *rtwpci = (struct rtw_pci *)rtwdev->priv;
> +	unsigned long flags;
> +	u32 irq_status[4];
> +
> +	spin_lock_irqsave(&rtwpci->irq_lock, flags);
>  	rtw_pci_irq_recognized(rtwdev, rtwpci, irq_status);
> 
>  	if (irq_status[0] & IMR_MGNTDOK)
> @@ -891,8 +908,9 @@ static irqreturn_t rtw_pci_interrupt_handler(int irq,
> void *dev)
>  	if (irq_status[0] & IMR_ROK)
>  		rtw_pci_rx_isr(rtwdev, rtwpci, RTW_RX_QUEUE_MPDU);
> 
> -out:
> -	spin_unlock(&rtwpci->irq_lock);
> +	/* all of the jobs for this interrupt have been done */
> +	rtw_pci_enable_interrupt(rtwdev, rtwpci);
> +	spin_unlock_irqrestore(&rtwpci->irq_lock, flags);
> 
>  	return IRQ_HANDLED;
>  }
> @@ -1152,8 +1170,10 @@ static int rtw_pci_probe(struct pci_dev *pdev,
>  		goto err_destroy_pci;
>  	}
> 
> -	ret = request_irq(pdev->irq, &rtw_pci_interrupt_handler,
> -			  IRQF_SHARED, KBUILD_MODNAME, rtwdev);
> +	ret = devm_request_threaded_irq(rtwdev->dev, pdev->irq,
> +					rtw_pci_interrupt_handler,
> +					rtw_pci_interrupt_threadfn,
> +					IRQF_SHARED, KBUILD_MODNAME, rtwdev);
>  	if (ret) {
>  		ieee80211_unregister_hw(hw);
>  		goto err_destroy_pci;
> @@ -1192,7 +1212,7 @@ static void rtw_pci_remove(struct pci_dev *pdev)
>  	rtw_pci_disable_interrupt(rtwdev, rtwpci);
>  	rtw_pci_destroy(rtwdev, pdev);
>  	rtw_pci_declaim(rtwdev, pdev);
> -	free_irq(rtwpci->pdev->irq, rtwdev);
> +	devm_free_irq(rtwdev->dev, rtwpci->pdev->irq, rtwdev);
>  	rtw_core_deinit(rtwdev);
>  	ieee80211_free_hw(hw);
>  }
> --
> 2.20.1


Now it works fine with MSI interrupt enabled.

But this patch is conflicting with MSI interrupt patch.
Is there a better way we can make Kalle apply them more smoothly?
I can rebase them and submit both if you're OK.

Yan-Hsuan

^ permalink raw reply

* [PATCH 0/4] net: dsa: microchip: add KSZ8563 support
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu

This patchset adds compatiblity strings for the KSZ8563 switch and
also adds two small fixes to the ksz9477 driver.

Razvan Stefanescu (4):
  dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
  net: dsa: microchip: add KSZ8563 compatibility string
  net: dsa: microchip: fix interrupt mask
  net: dsa: microchip: avoid hard-codded port count

 Documentation/devicetree/bindings/net/dsa/ksz.txt | 1 +
 drivers/net/dsa/microchip/ksz9477.c               | 2 +-
 drivers/net/dsa/microchip/ksz9477_reg.h           | 2 +-
 drivers/net/dsa/microchip/ksz9477_spi.c           | 1 +
 4 files changed, 4 insertions(+), 2 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH 4/4] net: dsa: microchip: avoid hard-codded port count
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu
In-Reply-To: <20190827093110.14957-1-razvan.stefanescu@microchip.com>

Use port_cnt value to disable interrupts on switch reset.

Signed-off-by: Razvan Stefanescu <razvan.stefanescu@microchip.com>
---
 drivers/net/dsa/microchip/ksz9477.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c
index 187be42de5f1..54fc05595d48 100644
--- a/drivers/net/dsa/microchip/ksz9477.c
+++ b/drivers/net/dsa/microchip/ksz9477.c
@@ -213,7 +213,7 @@ static int ksz9477_reset_switch(struct ksz_device *dev)
 
 	/* disable interrupts */
 	ksz_write32(dev, REG_SW_INT_MASK__4, SWITCH_INT_MASK);
-	ksz_write32(dev, REG_SW_PORT_INT_MASK__4, 0x7F);
+	ksz_write32(dev, REG_SW_PORT_INT_MASK__4, dev->port_cnt);
 	ksz_read32(dev, REG_SW_PORT_INT_STATUS__4, &data32);
 
 	/* set broadcast storm protection 10% rate */
-- 
2.20.1


^ permalink raw reply related

* [PATCH 3/4] net: dsa: microchip: fix interrupt mask
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu
In-Reply-To: <20190827093110.14957-1-razvan.stefanescu@microchip.com>

Global Interrupt Mask Register comprises of Lookup Engine (LUE) Interrupt
Mask (bit 31) and GPIO Pin Output Trigger and Timestamp Unit Interrupt
Mask (bit 29).

This corrects LUE bit.

Signed-off-by: Razvan Stefanescu <razvan.stefanescu@microchip.com>
---
 drivers/net/dsa/microchip/ksz9477_reg.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/microchip/ksz9477_reg.h b/drivers/net/dsa/microchip/ksz9477_reg.h
index 2938e892b631..f3949d7b9bbd 100644
--- a/drivers/net/dsa/microchip/ksz9477_reg.h
+++ b/drivers/net/dsa/microchip/ksz9477_reg.h
@@ -76,7 +76,7 @@
 #define TRIG_TS_INT			BIT(30)
 #define APB_TIMEOUT_INT			BIT(29)
 
-#define SWITCH_INT_MASK			(TRIG_TS_INT | APB_TIMEOUT_INT)
+#define SWITCH_INT_MASK			(LUE_INT | TRIG_TS_INT)
 
 #define REG_SW_PORT_INT_STATUS__4	0x0018
 #define REG_SW_PORT_INT_MASK__4		0x001C
-- 
2.20.1


^ permalink raw reply related

* [PATCH 2/4] net: dsa: microchip: add KSZ8563 compatibility string
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu
In-Reply-To: <20190827093110.14957-1-razvan.stefanescu@microchip.com>

It is a 3-Port 10/100 Ethernet Switch with 1588v2 PTP.

Signed-off-by: Razvan Stefanescu <razvan.stefanescu@microchip.com>
---
 drivers/net/dsa/microchip/ksz9477_spi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/microchip/ksz9477_spi.c b/drivers/net/dsa/microchip/ksz9477_spi.c
index a226b389e12d..2e402e4d866f 100644
--- a/drivers/net/dsa/microchip/ksz9477_spi.c
+++ b/drivers/net/dsa/microchip/ksz9477_spi.c
@@ -80,6 +80,7 @@ static const struct of_device_id ksz9477_dt_ids[] = {
 	{ .compatible = "microchip,ksz9897" },
 	{ .compatible = "microchip,ksz9893" },
 	{ .compatible = "microchip,ksz9563" },
+	{ .compatible = "microchip,ksz8563" },
 	{},
 };
 MODULE_DEVICE_TABLE(of, ksz9477_dt_ids);
-- 
2.20.1


^ permalink raw reply related

* [PATCH 1/4] dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
From: Razvan Stefanescu @ 2019-08-27  9:31 UTC (permalink / raw)
  To: Woojung Huh, Microchip Linux Driver Support, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, David S . Miller
  Cc: netdev, linux-kernel, Razvan Stefanescu
In-Reply-To: <20190827093110.14957-1-razvan.stefanescu@microchip.com>

It is a 3-Port 10/100 Ethernet Switch with 1588v2 PTP.

Signed-off-by: Razvan Stefanescu <razvan.stefanescu@microchip.com>
---
 Documentation/devicetree/bindings/net/dsa/ksz.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt b/Documentation/devicetree/bindings/net/dsa/ksz.txt
index 5e8429b6f9ca..95e91e84151c 100644
--- a/Documentation/devicetree/bindings/net/dsa/ksz.txt
+++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt
@@ -15,6 +15,7 @@ Required properties:
   - "microchip,ksz8565"
   - "microchip,ksz9893"
   - "microchip,ksz9563"
+  - "microchip,ksz8563"
 
 Optional properties:
 
-- 
2.20.1


^ permalink raw reply related

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-08-27  9:35 UTC (permalink / raw)
  To: David Miller
  Cc: jakub.kicinski, dsahern, roopa, netdev, sthemmin, dcbw, mkubecek,
	andrew, parav, saeedm, mlxsw
In-Reply-To: <20190827.012242.418276717667374306.davem@davemloft.net>

Tue, Aug 27, 2019 at 10:22:42AM CEST, davem@davemloft.net wrote:
>From: Jiri Pirko <jiri@resnulli.us>
>Date: Tue, 27 Aug 2019 09:08:08 +0200
>
>> Okay, so if I understand correctly, on top of separate commands for
>> add/del of alternative names, you suggest also get/dump to be separate
>> command and don't fill this up in existing newling/getlink command.
>
>I'm not sure what to do yet.
>
>David has a point, because the only way these ifnames are useful is
>as ways to specify and choose net devices.  So based upon that I'm
>slightly learning towards not using separate commands.

Well yeah, one can use it to handle existing commands instead of
IFLA_NAME.

But why does it rule out separate commands? I think it is cleaner than
to put everything in poor setlink messages :/ The fact that we would
need to add "OP" to the setlink message just feels of. Other similar
needs may show up in the future and we may endup in ridiculous messages
like:

SETLINK
  IFLA_NAME eth0
  IFLA_ATLNAME_LIST (nest)
      IFLA_ALTNAME_OP add
      IFLA_ALTNAME somereallylongname 
      IFLA_ALTNAME_OP del
      IFLA_ALTNAME somereallyreallylongname 
      IFLA_ALTNAME_OP add
      IFLA_ALTNAME someotherreallylongname 
  IFLA_SOMETHING_ELSE_LIST (nest)
      IFLA_SOMETHING_ELSE_OP add
      ...
      IFLA_SOMETHING_ELSE_OP del
      ...
      IFLA_SOMETHING_ELSE_OP add
      ...

I don't know what to think about it. Rollbacks are going to be pure hell :/

^ permalink raw reply

* Re: [PATCH net] net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
From: Stefano Brivio @ 2019-08-27  9:50 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David S. Miller, netdev,
	Paolo Abeni, Li Shuang
In-Reply-To: <0598164c6e32684e57c7656f0b8aca0813c51f42.1566861256.git.dcaratti@redhat.com>

On Tue, 27 Aug 2019 01:15:16 +0200
Davide Caratti <dcaratti@redhat.com> wrote:

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 11c03cf4aa74..c89b787785a1 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -688,12 +688,14 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
>  			kfree_skb(skb);
>  	}
>  
> -	for_each_possible_cpu(i) {
> -		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
> +	if (qdisc_is_percpu_stats(qdisc))

This needs curly brackets, as the block has multiple lines (for coding
style only).

> +		for_each_possible_cpu(i) {
> +			struct gnet_stats_queue *q =
> +				per_cpu_ptr(qdisc->cpu_qstats, i);

And you could split declaration and assignment here, it takes two lines
anyway and becomes more readable.

-- 
Stefano

^ permalink raw reply

* Re: [PATCH -next] net: mlx5: Kconfig: Fix MLX5_CORE_EN dependencies
From: maowenan @ 2019-08-27  9:51 UTC (permalink / raw)
  To: wharms
  Cc: saeedm, leon, davem, netdev, linux-rdma, linux-kernel,
	kernel-janitors
In-Reply-To: <5D64DABF.4010601@bfs.de>



On 2019/8/27 15:24, walter harms wrote:
> 
> 
> Am 27.08.2019 05:12, schrieb Mao Wenan:
>> When MLX5_CORE_EN=y and PCI_HYPERV_INTERFACE is not set, below errors are found:
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.o: In function `mlx5e_nic_enable':
>> en_main.c:(.text+0xb649): undefined reference to `mlx5e_hv_vhca_stats_create'
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.o: In function `mlx5e_nic_disable':
>> en_main.c:(.text+0xb8c4): undefined reference to `mlx5e_hv_vhca_stats_destroy'
>>
>> This because CONFIG_PCI_HYPERV_INTERFACE is newly introduced by 'commit 348dd93e40c1
>> ("PCI: hv: Add a Hyper-V PCI interface driver for software backchannel interface"),
>> Fix this by making MLX5_CORE_EN imply PCI_HYPERV_INTERFACE.
>>
>> Fixes: cef35af34d6d ("net/mlx5e: Add mlx5e HV VHCA stats agent")
>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
>> index 37fef8c..a6a70ce 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
>> @@ -35,6 +35,7 @@ config MLX5_CORE_EN
>>  	depends on IPV6=y || IPV6=n || MLX5_CORE=m
> 
> OT but ...
> is that IPV6 needed at all ? can there be something else that yes or no ?

If I set IPV6=m, errors are found as below:
drivers/net/ethernet/mellanox/mlx5/core/main.o: In function `mlx5_unload':
main.c:(.text+0x275): undefined reference to `mlx5_hv_vhca_cleanup'
drivers/net/ethernet/mellanox/mlx5/core/main.o: In function `mlx5_cleanup_once':
main.c:(.text+0x2e8): undefined reference to `mlx5_hv_vhca_destroy'
drivers/net/ethernet/mellanox/mlx5/core/main.o: In function `mlx5_load_one':
main.c:(.text+0x23c1): undefined reference to `mlx5_hv_vhca_create'
main.c:(.text+0x248f): undefined reference to `mlx5_hv_vhca_init'
main.c:(.text+0x25e0): undefined reference to `mlx5_hv_vhca_cleanup
> 
> re,
>  wh
> 
>>  	select PAGE_POOL
>>  	select DIMLIB
>> +	imply PCI_HYPERV_INTERFACE
>>  	default n
>>  	---help---
>>  	  Ethernet support in Mellanox Technologies ConnectX-4 NIC.
> 
> .
> 


^ permalink raw reply

* Re: [PATCH v2 0/3] Add NETIF_F_HW_BR_CAP feature
From: Horatiu Vultur @ 2019-08-27 10:10 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: roopa, nikolay, davem, UNGLinuxDriver, alexandre.belloni,
	allan.nielsen, f.fainelli, netdev, linux-kernel, bridge
In-Reply-To: <20190826123811.GA13411@lunn.ch>

The 08/26/2019 14:38, Andrew Lunn wrote:
> External E-Mail
> 
> 
> On Mon, Aug 26, 2019 at 10:11:12AM +0200, Horatiu Vultur wrote:
> > When a network port is added to a bridge then the port is added in
> > promisc mode. Some HW that has bridge capabilities(can learn, forward,
> > flood etc the frames) they are disabling promisc mode in the network
> > driver when the port is added to the SW bridge.
> > 
> > This patch adds the feature NETIF_F_HW_BR_CAP so that the network ports
> > that have this feature will not be set in promisc mode when they are
> > added to a SW bridge.
> > 
> > In this way the HW that has bridge capabilities don't need to send all the
> > traffic to the CPU and can also implement the promisc mode and toggle it
> > using the command 'ip link set dev swp promisc on'
> 
> Hi Horatiu

Hi Andrew,
> 
> I'm still not convinced this is needed. The model is, the hardware is
> there to accelerate what Linux can do in software. Any peculiarities
> of the accelerator should be hidden in the driver.  If the accelerator
> can do its job without needing promisc mode, do that in the driver.
Thanks for the model description. I will keep in my mind for the next
patches that I will do.
> 
> So you are trying to differentiate between promisc mode because the
> interface is a member of a bridge, and promisc mode because some
> application, like pcap, has asked for promisc mode.
> 
> dev->promiscuity is a counter. So what you can do it look at its
> value, and how the interface is being used. If the interface is not a
> member of a bridge, and the count > 0, enable promisc mode in the
> accelerator. If the interface is a member of a bridge, and the count >
> 1, enable promisc mode in the accelerator.
That sounds like a great idea. I was expecting to add this logic in the
set_rx_mode function of the driver. But unfortunetly, I got the calls to
this function before the dev->promiscuity is updated or not to get the
call at all. For example in case the port is member of a bridge and I try
to enable the promisc mode.

> 
>    Andrew
> 
> 

-- 
/Horatiu

^ permalink raw reply

* Re: [RFC PATCH net-next] net: phy: force phy suspend when calling phy_stop
From: Sergei Shtylyov @ 2019-08-27 10:11 UTC (permalink / raw)
  To: Jian Shen, andrew, f.fainelli, hkallweit1, davem
  Cc: netdev, forest.zhouchang, linuxarm
In-Reply-To: <1566874020-14334-1-git-send-email-shenjian15@huawei.com>

On 27.08.2019 5:47, Jian Shen wrote:

> Some ethernet drivers may call phy_start() and phy_stop() from
> ndo_open and ndo_close() respectively.

    ndo_open() for consistency.

> When network cable is unconnected, and operate like below:
> step 1: ifconfig ethX up -> ndo_open -> phy_start ->start
> autoneg, and phy is no link.
> step 2: ifconfig ethX down -> ndo_close -> phy_stop -> just stop
> phy state machine.
> step 3: plugin the network cable, and autoneg complete, then
> LED for link status will be on.
> step 4: ethtool ethX --> see the result of "Link detected" is no.
> 
> This patch forces phy suspend even phydev->link is off.
> 
> Signed-off-by: Jian Shen <shenjian15@huawei.com>
[...]

MBR, Sergei


^ permalink raw reply

* Re: [PATCH net] net/sched: pfifo_fast: fix wrong dereference when qdisc is reset
From: Davide Caratti @ 2019-08-27 10:15 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Cong Wang, Jamal Hadi Salim, Jiri Pirko, David S. Miller, netdev,
	Paolo Abeni, Li Shuang
In-Reply-To: <20190827115031.43fcbac5@redhat.com>

hello Stefano,

thanks for looking at this.

On Tue, 2019-08-27 at 11:50 +0200, Stefano Brivio wrote:
> On Tue, 27 Aug 2019 01:15:16 +0200
> Davide Caratti <dcaratti@redhat.com> wrote:
[...]
> @@ -688,12 +688,14 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
> >  			kfree_skb(skb);
> >  	}
> >  
> > -	for_each_possible_cpu(i) {
> > -		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
> > +	if (qdisc_is_percpu_stats(qdisc))
> 
> This needs curly brackets, as the block has multiple lines (for coding
> style only).

If I well read Documentation/process/coding-style.rst, at the end of
section 3), this rule should apply to loops. But I'm fine with the curly
brace here, if checkpatch doesn't say anything.

I will add it in v2. 

> > +		for_each_possible_cpu(i) {
> > +			struct gnet_stats_queue *q =
> > +				per_cpu_ptr(qdisc->cpu_qstats, i);
> 
> And you could split declaration and assignment here, it takes two lines
> anyway and becomes more readable.

ok, I will do that in v2.

thanks!
-- 
davide


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox