Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Vlastimil Babka @ 2018-06-04 12:40 UTC (permalink / raw)
  To: Michal Hocko, Qing Huang
  Cc: Eric Dumazet, David Miller, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim,
	santosh.shilimkar@oracle.com
In-Reply-To: <20180604062737.GA19202@dhcp22.suse.cz>

On 06/04/2018 08:27 AM, Michal Hocko wrote:
> On Fri 01-06-18 15:05:26, Qing Huang wrote:
>>
>>
>> On 6/1/2018 12:31 AM, Michal Hocko wrote:
>>> On Thu 31-05-18 19:04:46, Qing Huang wrote:
>>>>
>>>> On 5/31/2018 2:10 AM, Michal Hocko wrote:
>>>>> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
>>>>>> On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
>>>>> [...]
>>>>>>> I merely copied/pasted from alloc_skb_with_frags() :/
>>>>>> I will have a look at it. Thanks!
>>>>> OK, so this is an example of an incremental development ;).
>>>>>
>>>>> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
>>>>> high order allocations") to prevent from OOM killer. Yet this was
>>>>> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
>>>>> allocation") didn't want an excessive reclaim for non-costly orders
>>>>> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
>>>>> place which is now redundant. Should I send a patch?
>>>>>
>>>> Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion?
>>>> We experimented
>>>> with it a bit in the past but it seemed to cause other issue in our tests.
>>>> :-)
>>> GFP_ATOMIC is a non-sleeping (aka no reclaim) context with an access to
>>> memory reserves. So the risk is that you deplete those reserves and
>>> cause issues to other subsystems which need them as well.
>>>
>>>> By the way, we didn't encounter any OOM killer events. It seemed that the
>>>> mlx4_alloc_icm() triggered slowpath.
>>>> We still had about 2GB free memory while it was highly fragmented.
>>> The compaction was able to make a reasonable forward progress for you.
>>> But considering mlx4_alloc_icm is called with GFP_KERNEL resp. GFP_HIGHUSER
>>> then the OOM killer is clearly possible as long as the order is lower
>>> than 4.
>>
>> The allocation was 256KB so the order was much higher than 4. The compaction
>> seemed to be the root
>> cause for our problem. It took too long to finish its work while putting
>> mlx4_alloc_icm to sleep in a heavily
>> fragmented memory situation . Will NORETRY flag avoid the compaction ops and
>> fail the 256KB allocation
>> immediately so mlx4_alloc_icm can enter adjustable lower order allocation
>> code path quickly?
> 
> Costly orders should only perform a light compaction attempt unless
> __GFP_RETRY_MAY_FAIL is used IIRC. CCing Vlastimil. So __GFP_NORETRY
> shouldn't make any difference.

It's a bit more complicated. Costly allocations will try the light
compaction attempt first, even before reclaim. This is followed by
reclaim and a more costly compaction attempt. With __GFP_NORETRY, the
second compaction attempt is also only the light one, so the flag does
make a difference here.

^ permalink raw reply

* Re: [PATCH RFC ipsec-next 0/3] Virtual xfrm interfaces
From: David Miller @ 2018-06-04 12:58 UTC (permalink / raw)
  To: steffen.klassert
  Cc: netdev, eyal.birger, antony, benedictwong, lorenzo,
	shannon.nelson
In-Reply-To: <20180604060910.13896-1-steffen.klassert@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Mon, 4 Jun 2018 08:09:07 +0200

> This patchset introduces new virtual xfrm interfaces.
> The design of virtual xfrm interfaces interfaces was
> discussed at the Linux IPsec workshop 2018. This patchset
> implements these interfaces as the IPsec userspace and
> kernel developers agreed. The purpose of these interfaces
> is to overcome the design limitations that the existing
> VTI devices have.
> 
> We had two presentations about xfrm interfaces at
> the workshop. Slides with further informations
> can be found at the workshop homepage:
> 
> https://workshop.linux-ipsec.org/2018/

First off, you will have to describe in detail what the VTI
limitations are and how these new devices overcome them in this commit
message.

You can't just say "we discussed this over there, go take a look".

The place people "take a look" is your text here.

Second, since you didn't explain things, I have to ask.  Why is a new
special ID even necessary?  It makes the flowi bigger, and adds all of
this new logic.

All netdevs have an ifindex and you should be able to find a way to
use the ifindex of these new devices in the key somehow.

Thanks.

^ permalink raw reply

* Re: 答复: ANNOUNCE: Enhanced IP v1.4
From: Eric Dumazet @ 2018-06-04 13:02 UTC (permalink / raw)
  To: PKU.孙斌, 'Willy Tarreau',
	'Eric Dumazet'
  Cc: 'Linux Kernel Network Developers'
In-Reply-To: <042801d3fbc9$02818fc0$0784af40$@pku.edu.cn>



On 06/03/2018 10:58 PM, PKU.孙斌 wrote:
> On Sun, Jun 03, 2018 at 03:41:08PM -0700, Eric Dumazet wrote:
>>
>>
>> On 06/03/2018 01:37 PM, Tom Herbert wrote:
>>
>>> This is not an inconsequential mechanism that is being proposed. It's
>>> a modification to IP protocol that is intended to work on the
>>> Internet, but it looks like the draft hasn't been updated for two
>>> years and it is not adopted by any IETF working group. I don't see how
>>> this can go anywhere without IETF support. Also, I suggest that you
>>> look at the IPv10 proposal since that was very similar in intent. One
>>> of the reasons that IPv10 shot down was because protocol transition
>>> mechanisms were more interesting ten years ago than today. IPv6 has
>>> good traction now. In fact, it's probably the case that it's now
>>> easier to bring up IPv6 than to try to make IPv4 options work over the
>>> Internet.
>>
>> +1
>>
>> Many hosts do not use IPv4 anymore.
>>
>> We even have the project making IPv4 support in linux optional.
> 
> I guess then Linux kernel wouldn't be able to boot itself without IPv4 built in, e.g., when we only have old L2 links (without the IPv6 frame type)...



*Optional* means that a CONFIG_IPV4 would be there, and some people could build a kernel with CONFIG_IPV4=n,

Like IPv6 is optional today.

Of course, most distros will select CONFIG_IPV4=y  (as they probably select CONFIG_IPV6=y today)

Do not worry, IPv4 is not dead, but I doubt Enhanced IP v1.4 has any chance,
it is at least 10 years too late.

^ permalink raw reply

* Re: [PATCH] samples/bpf: Add xdp_sample_pkts example
From: Toke Høiland-Jørgensen @ 2018-06-04 13:02 UTC (permalink / raw)
  To: Daniel Borkmann, Song Liu; +Cc: Networking
In-Reply-To: <672f2d99-f44d-7605-7c07-e9b6315f0bcd@iogearbox.net>

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 06/02/2018 06:22 AM, Daniel Borkmann wrote:
>> On 05/31/2018 11:44 AM, Toke Høiland-Jørgensen wrote:
>>> Song Liu <liu.song.a23@gmail.com> writes:
>>>
>>>> On Wed, May 30, 2018 at 9:45 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>> This adds an example program showing how to sample packets from XDP using
>>>>> the perf event buffer. The example userspace program just prints the
>>>>> ethernet header for every packet sampled.
>>>>>
>>>>> Most of the userspace code is borrowed from other examples, most notably
>>>>> trace_output.
>>>>>
>>>>> Note that the example only works when everything runs on CPU0; so
>>>>> suitable smp_affinity needs to be set on the device. Some drivers seem
>>>>> to reset smp_affinity when loading an XDP program, so it may be
>>>>> necessary to change it after starting the example userspace program.
>>>>
>>>> Why does this only works when everything runs on CPU0? Is this
>>>> something we can improve?
>>>
>>> Yeah, good question. Basically, the call from XDP to
>>> bpf_perf_event_output() will fail with -EOPNOTSUPP. I tracked this down
>>> to this if statement in __bpf_perf_event_output() in bpf_trace.c:
>>>
>>>> 	if (unlikely(event->oncpu != cpu))
>>>> 		return -EOPNOTSUPP;
>>>
>>> I *think* that the way to fix this is for the userspace program to open
>>> a perf file descriptor for each CPU in the system and poll all of them,
>>> in which case the XDP program can pass the BPF_F_CURRENT_CPU flag to
>>> access the right one.
>> That is correct, you need one perf fd per cpu, and map them accordingly
>> into the map slots when you use BPF_F_CURRENT_CPU.
>
> Given this is a sample that users are likely to copy from, I think it would
> be great if you could fix this up so you can just pass in BPF_F_CURRENT_CPU
> eventually. Thanks for working on this, Toke!

You're welcome! And yup, I was planning to. I'll need to add a new
function to the trace helpers that can poll more than one fd; just
haven't gotten around to it yet. :)

-Toke

^ permalink raw reply

* [PATCH net-next] wan/fsl_ucc_hdlc: use dma_zalloc_coherent instead of allocator/memset
From: YueHaibing @ 2018-06-04 13:07 UTC (permalink / raw)
  To: davem, qiang.zhao; +Cc: netdev, linux-kernel, linuxppc-dev, YueHaibing

Use dma_zalloc_coherent instead of dma_alloc_coherent
followed by memset 0.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 drivers/net/wan/fsl_ucc_hdlc.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
index 33df764..4205dfd 100644
--- a/drivers/net/wan/fsl_ucc_hdlc.c
+++ b/drivers/net/wan/fsl_ucc_hdlc.c
@@ -270,10 +270,10 @@ static int uhdlc_init(struct ucc_hdlc_private *priv)
 	iowrite16be(DEFAULT_HDLC_ADDR, &priv->ucc_pram->haddr4);
 
 	/* Get BD buffer */
-	bd_buffer = dma_alloc_coherent(priv->dev,
-				       (RX_BD_RING_LEN + TX_BD_RING_LEN) *
-				       MAX_RX_BUF_LENGTH,
-				       &bd_dma_addr, GFP_KERNEL);
+	bd_buffer = dma_zalloc_coherent(priv->dev,
+					(RX_BD_RING_LEN + TX_BD_RING_LEN) *
+					MAX_RX_BUF_LENGTH,
+					&bd_dma_addr, GFP_KERNEL);
 
 	if (!bd_buffer) {
 		dev_err(priv->dev, "Could not allocate buffer descriptors\n");
@@ -281,9 +281,6 @@ static int uhdlc_init(struct ucc_hdlc_private *priv)
 		goto free_tiptr;
 	}
 
-	memset(bd_buffer, 0, (RX_BD_RING_LEN + TX_BD_RING_LEN)
-			* MAX_RX_BUF_LENGTH);
-
 	priv->rx_buffer = bd_buffer;
 	priv->tx_buffer = bd_buffer + RX_BD_RING_LEN * MAX_RX_BUF_LENGTH;
 
-- 
2.7.0

^ permalink raw reply related

* [PATCH net-next] qed: use dma_zalloc_coherent instead of allocator/memset
From: YueHaibing @ 2018-06-04 13:10 UTC (permalink / raw)
  To: davem, Ariel.Elior; +Cc: netdev, linux-kernel, everest-linux-l2, YueHaibing

Use dma_zalloc_coherent instead of dma_alloc_coherent
followed by memset 0.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 drivers/net/ethernet/qlogic/qed/qed_cxt.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_cxt.c b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
index 820b226..1835f00 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_cxt.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_cxt.c
@@ -936,14 +936,13 @@ static int qed_cxt_src_t2_alloc(struct qed_hwfn *p_hwfn)
 		u32 size = min_t(u32, total_size, psz);
 		void **p_virt = &p_mngr->t2[i].p_virt;
 
-		*p_virt = dma_alloc_coherent(&p_hwfn->cdev->pdev->dev,
-					     size,
-					     &p_mngr->t2[i].p_phys, GFP_KERNEL);
+		*p_virt = dma_zalloc_coherent(&p_hwfn->cdev->pdev->dev,
+					      size, &p_mngr->t2[i].p_phys,
+					      GFP_KERNEL);
 		if (!p_mngr->t2[i].p_virt) {
 			rc = -ENOMEM;
 			goto t2_fail;
 		}
-		memset(*p_virt, 0, size);
 		p_mngr->t2[i].size = size;
 		total_size -= size;
 	}
-- 
2.7.0

^ permalink raw reply related

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Michal Hocko @ 2018-06-04 13:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim
In-Reply-To: <20180531091022.GL15278@dhcp22.suse.cz>

On Thu 31-05-18 11:10:22, Michal Hocko wrote:
> On Thu 31-05-18 10:55:32, Michal Hocko wrote:
> > On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
> [...]
> > > I merely copied/pasted from alloc_skb_with_frags() :/
> > 
> > I will have a look at it. Thanks!
> 
> OK, so this is an example of an incremental development ;).
> 
> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> high order allocations") to prevent from OOM killer. Yet this was
> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> allocation") didn't want an excessive reclaim for non-costly orders
> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> place which is now redundant. Should I send a patch?

Just in case you are interested
---
>From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 4 Jun 2018 15:07:37 +0200
Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags

alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
which is just a noop and a little bit confusing.

__GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
high order allocations") to prevent from the OOM killer. Yet this was
not enough because fb05e7a89f50 ("net: don't wait for order-3 page
allocation") didn't want an excessive reclaim for non-costly orders
so it made it completely NOWAIT while it preserved __GFP_NORETRY in
place which is now redundant.

Drop the pointless __GFP_NORETRY because this function is used as
copy&paste source for other places.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/core/skbuff.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 857e4e6f751a..c1f22adc30de 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5239,8 +5239,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 			if (npages >= 1 << order) {
 				page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
 						   __GFP_COMP |
-						   __GFP_NOWARN |
-						   __GFP_NORETRY,
+						   __GFP_NOWARN,
 						   order);
 				if (page)
 					goto fill_page;
-- 
2.17.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related

* Re: [bpf-next V2 PATCH 2/8] i40e: implement flush flag for ndo_xdp_xmit
From: Daniel Borkmann @ 2018-06-04 13:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: liu.song.a23, songliubraving, John Fastabend
In-Reply-To: <152775719291.24817.3098409990616007642.stgit@firesoul>

On 05/31/2018 10:59 AM, Jesper Dangaard Brouer wrote:
> When passed the XDP_XMIT_FLUSH flag i40e_xdp_xmit now performs the
> same kind of ring tail update as in i40e_xdp_flush.  The advantage is
> that all the necessary checks have been performed and xdp_ring can be
> updated, instead of having to perform the exact same steps/checks in
> i40e_xdp_flush
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index c0451d6e0790..5f01e4ce9c92 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -3676,6 +3676,7 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>  	struct i40e_netdev_priv *np = netdev_priv(dev);
>  	unsigned int queue_index = smp_processor_id();
>  	struct i40e_vsi *vsi = np->vsi;
> +	struct i40e_ring *xdp_ring;
>  	int drops = 0;
>  	int i;
>  
> @@ -3685,20 +3686,25 @@ int i40e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>  	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
>  		return -ENXIO;
>  
> -	if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>  		return -EINVAL;
>  
> +	xdp_ring = vsi->xdp_rings[queue_index];
> +
>  	for (i = 0; i < n; i++) {
>  		struct xdp_frame *xdpf = frames[i];
>  		int err;
>  
> -		err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
> +		err = i40e_xmit_xdp_ring(xdpf, xdp_ring);
>  		if (err != I40E_XDP_TX) {
>  			xdp_return_frame_rx_napi(xdpf);
>  			drops++;
>  		}
>  	}
>  
> +	if (unlikely(flags & XDP_XMIT_FLUSH))
> +		i40e_xdp_ring_update_tail(xdp_ring);

In addition to Alexei's feedback, I'd remove the unlikely() on the flush from here and the
ixgbe one like you did on the rest of the drivers in the series, just let CPU decide.

For the invalid flags case it's totally fine and in fact you could probably do this for all
three cases where you bail out in the beginning of i40e_xdp_xmit() and won't able able to
send anything anyway:

        if (test_bit(__I40E_VSI_DOWN, vsi->state))
                return -ENETDOWN;

        if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
                return -ENXIO;

        if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
                return -EINVAL;

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] samples/bpf: Add xdp_sample_pkts example
From: Daniel Borkmann @ 2018-06-04 13:12 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Song Liu; +Cc: Networking
In-Reply-To: <87r2lm1z87.fsf@toke.dk>

On 06/04/2018 03:02 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>> On 06/02/2018 06:22 AM, Daniel Borkmann wrote:
>>> On 05/31/2018 11:44 AM, Toke Høiland-Jørgensen wrote:
>>>> Song Liu <liu.song.a23@gmail.com> writes:
>>>>> On Wed, May 30, 2018 at 9:45 AM, Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>>>> This adds an example program showing how to sample packets from XDP using
>>>>>> the perf event buffer. The example userspace program just prints the
>>>>>> ethernet header for every packet sampled.
>>>>>>
>>>>>> Most of the userspace code is borrowed from other examples, most notably
>>>>>> trace_output.
>>>>>>
>>>>>> Note that the example only works when everything runs on CPU0; so
>>>>>> suitable smp_affinity needs to be set on the device. Some drivers seem
>>>>>> to reset smp_affinity when loading an XDP program, so it may be
>>>>>> necessary to change it after starting the example userspace program.
>>>>>
>>>>> Why does this only works when everything runs on CPU0? Is this
>>>>> something we can improve?
>>>>
>>>> Yeah, good question. Basically, the call from XDP to
>>>> bpf_perf_event_output() will fail with -EOPNOTSUPP. I tracked this down
>>>> to this if statement in __bpf_perf_event_output() in bpf_trace.c:
>>>>
>>>>> 	if (unlikely(event->oncpu != cpu))
>>>>> 		return -EOPNOTSUPP;
>>>>
>>>> I *think* that the way to fix this is for the userspace program to open
>>>> a perf file descriptor for each CPU in the system and poll all of them,
>>>> in which case the XDP program can pass the BPF_F_CURRENT_CPU flag to
>>>> access the right one.
>>> That is correct, you need one perf fd per cpu, and map them accordingly
>>> into the map slots when you use BPF_F_CURRENT_CPU.
>>
>> Given this is a sample that users are likely to copy from, I think it would
>> be great if you could fix this up so you can just pass in BPF_F_CURRENT_CPU
>> eventually. Thanks for working on this, Toke!
> 
> You're welcome! And yup, I was planning to. I'll need to add a new
> function to the trace helpers that can poll more than one fd; just
> haven't gotten around to it yet. :)

Ok, great, looking forward!

Cheers,
Daniel

^ permalink raw reply

* Re: [bpf-next V2 PATCH 3/8] ixgbe: implement flush flag for ndo_xdp_xmit
From: Daniel Borkmann @ 2018-06-04 13:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: liu.song.a23, songliubraving, John Fastabend
In-Reply-To: <152775719796.24817.11035788244128769860.stgit@firesoul>

On 05/31/2018 10:59 AM, Jesper Dangaard Brouer wrote:
> When passed the XDP_XMIT_FLUSH flag ixgbe_xdp_xmit now performs the
> same kind of ring tail update as in ixgbe_xdp_flush.  The update tail
> code in ixgbe_xdp_flush is generalized and shared with ixgbe_xdp_xmit.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 87f088f4af52..4fd77c9067f2 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -10022,6 +10022,15 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  	}
>  }
>  
> +static void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring)
> +{
> +	/* Force memory writes to complete before letting h/w know there
> +	 * are new descriptors to fetch.
> +	 */
> +	wmb();
> +	writel(ring->next_to_use, ring->tail);
> +}

Did you double check that this doesn't become a function call? Should this
get an __always_inline attribute?

> +
>  static int ixgbe_xdp_xmit(struct net_device *dev, int n,
>  			  struct xdp_frame **frames, u32 flags)
>  {
> @@ -10033,7 +10042,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
>  	if (unlikely(test_bit(__IXGBE_DOWN, &adapter->state)))
>  		return -ENETDOWN;
>  
> -	if (unlikely(flags & ~XDP_XMIT_FLAGS_NONE))
> +	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>  		return -EINVAL;
>  
>  	/* During program transitions its possible adapter->xdp_prog is assigned
> @@ -10054,6 +10063,9 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
>  		}
>  	}
>  
> +	if (unlikely(flags & XDP_XMIT_FLUSH))
> +		ixgbe_xdp_ring_update_tail(ring);
> +
>  	return n - drops;
>  }
>  
> @@ -10072,11 +10084,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
>  	if (unlikely(!ring))
>  		return;
>  
> -	/* Force memory writes to complete before letting h/w know there
> -	 * are new descriptors to fetch.
> -	 */
> -	wmb();
> -	writel(ring->next_to_use, ring->tail);
> +	ixgbe_xdp_ring_update_tail(ring);
>  
>  	return;
>  }
> 

^ permalink raw reply

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Eric Dumazet @ 2018-06-04 13:22 UTC (permalink / raw)
  To: Michal Hocko, Eric Dumazet
  Cc: David Miller, qing.huang, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim
In-Reply-To: <20180604131104.GS19202@dhcp22.suse.cz>



On 06/04/2018 06:11 AM, Michal Hocko wrote:
> On Thu 31-05-18 11:10:22, Michal Hocko wrote:

> Just in case you are interested
> ---
> From 5010543ed6f73e4c00367801486dca8d5c63b2ce Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 4 Jun 2018 15:07:37 +0200
> Subject: [PATCH] net: cleanup gfp mask in alloc_skb_with_frags
> 
> alloc_skb_with_frags uses __GFP_NORETRY for non-sleeping allocations
> which is just a noop and a little bit confusing.
> 
> __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> high order allocations") to prevent from the OOM killer. Yet this was
> not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> allocation") didn't want an excessive reclaim for non-costly orders
> so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> place which is now redundant.
> 
> Drop the pointless __GFP_NORETRY because this function is used as
> copy&paste source for other places.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [PATCH 1/2 net-next] net_failover: fix net_failover_compute_features()
From: David Miller @ 2018-06-04 13:31 UTC (permalink / raw)
  To: dan.carpenter; +Cc: sridhar.samudrala, netdev, kernel-janitors
In-Reply-To: <20180531120124.pc4txiifxnrslbei@kili.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Thu, 31 May 2018 15:01:25 +0300

> @@ -380,7 +380,8 @@ static rx_handler_result_t net_failover_handle_frame(struct sk_buff **pskb)
>  
>  static void net_failover_compute_features(struct net_device *dev)
>  {
> -	u32 vlan_features = FAILOVER_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
> +	netdev_features_t vlan_features = FAILOVER_VLAN_FEATURES |
> +					  NETIF_F_ALL_FOR_ALL;

The type does need to be corrected to netdev_features_t, but the
logical operation is correct.

It's a policy operation that was simply by-hand propagated all
over the place where these kinds of calculations are performed.

So vlan_features is starting with a value of 0 intentionally.

^ permalink raw reply

* Re: [PATCH] net: virtio: simplify the virtnet_find_vqs
From: David Miller @ 2018-06-04 13:33 UTC (permalink / raw)
  To: xiangxia.m.yue; +Cc: netdev
In-Reply-To: <1527776192-26928-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Date: Thu, 31 May 2018 07:16:32 -0700

> Use the common free functions while return successfully.
> 
> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This looks fine, applied, thanks.

^ permalink raw reply

* Re: [bpf PATCH v2] bpf: sockmap, fix crash when ipv6 sock is added
From: Daniel Borkmann @ 2018-06-04 13:39 UTC (permalink / raw)
  To: John Fastabend, Eric Dumazet, edumazet, ast, Dave Watson; +Cc: netdev
In-Reply-To: <81abd5f7-5343-a27a-6715-8b413f6c5a27@gmail.com>

Hey guys,

On 06/02/2018 11:39 PM, John Fastabend wrote:
> On 06/01/2018 12:58 PM, Eric Dumazet wrote:
>> On 06/01/2018 03:46 PM, John Fastabend wrote:
>>> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
>>> of tcpv6_prot.
>>
>> ...
>>
>>> +	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
>>> +	 * state. Supporting sockets in LISTEN state will require us to
>>> +	 * modify the accept implementation to clone rather then share the
>>> +	 * ulp context.
>>> +	 */
>>> +	if (sock->sk_state != TCP_ESTABLISHED)
>>> +		return -ENOTSUPP;
>>> +
>>>  	/* 1. If sock map has BPF programs those will be inherited by the
>>>  	 * sock being added. If the sock is already attached to BPF programs
>>>  	 * this results in an error.
>>
>> Next question will be then : What happens if syzbot uses tcp_disconnect() and then listen() ?
> 
> Yep we need to fix that as well :( Looks like we can plumb the
> unhash callback and remove it from the sockmap when the socket
> goes through tcp_disconnect().
> 
> This patch should go in as-is though and we can fix the disconnect
> issue with a new patch.
> 
> Adding Dave Watson to the thread as well because I'm guessing
> the disconnect() case is also applicable to TLS. At least I see
> a hw handler for unhash but there does not appear to be a handler
> in the SW case, at least from a quick glance.
> 
> Thanks again!

Given the discussion and fixes weren't resolved resp. ready in time for 4.17,
and last bpf pr for it went out last week, we need to route this via -stable
once all is hashed out.

This fix here therefore needs to be rebased against bpf-next tree, and as far
as I can see another fix for hash map is also needed to address the same issue.

After that, likely also fixes for the disconnect + listen case are needed.

(I can use the one here later on for -stable backport, but given merge window
is open this needs a rebase and a resolution for hash map.)

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH 09/10] dpaa_eth: add support for hardware timestamping
From: Richard Cochran @ 2018-06-04 13:49 UTC (permalink / raw)
  To: Yangbo Lu
  Cc: netdev, madalin.bucur, Rob Herring, Shawn Guo, David S . Miller,
	devicetree, linuxppc-dev, linux-arm-kernel, linux-kernel
In-Reply-To: <20180604070837.19265-10-yangbo.lu@nxp.com>

On Mon, Jun 04, 2018 at 03:08:36PM +0800, Yangbo Lu wrote:

> +if FSL_DPAA_ETH
> +config FSL_DPAA_ETH_TS
> +	bool "DPAA hardware timestamping support"
> +	select PTP_1588_CLOCK_QORIQ
> +	default n
> +	help
> +	  Enable DPAA hardware timestamping support.
> +	  This option is useful for applications to get
> +	  hardware time stamps on the Ethernet packets
> +	  using the SO_TIMESTAMPING API.
> +endif

You should drop this #ifdef.  In general, if a MAC supports time
stamping and PHC, then the driver support should simply be compiled
in.

[ When time stamping incurs a large run time performance penalty to
  non-PTP users, then it might make sense to have a Kconfig option to
  disable it, but that doesn't appear to be the case here. ]

> @@ -1615,6 +1635,24 @@ static int dpaa_eth_refill_bpools(struct dpaa_priv *priv)
>  	skbh = (struct sk_buff **)phys_to_virt(addr);
>  	skb = *skbh;
>  
> +#ifdef CONFIG_FSL_DPAA_ETH_TS
> +	if (priv->tx_tstamp &&
> +	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {

This condition fits on one line easily.

> +		struct skb_shared_hwtstamps shhwtstamps;
> +		u64 ns;

Local variables belong at the top of the function.

> +		memset(&shhwtstamps, 0, sizeof(shhwtstamps));
> +
> +		if (!dpaa_get_tstamp_ns(priv->net_dev, &ns,
> +					priv->mac_dev->port[TX],
> +					(void *)skbh)) {
> +			shhwtstamps.hwtstamp = ns_to_ktime(ns);
> +			skb_tstamp_tx(skb, &shhwtstamps);
> +		} else {
> +			dev_warn(dev, "dpaa_get_tstamp_ns failed!\n");
> +		}
> +	}
> +#endif
>  	if (unlikely(qm_fd_get_format(fd) == qm_fd_sg)) {
>  		nr_frags = skb_shinfo(skb)->nr_frags;
>  		dma_unmap_single(dev, addr, qm_fd_get_offset(fd) +
> @@ -2086,6 +2124,14 @@ static int dpaa_start_xmit(struct sk_buff *skb, struct net_device *net_dev)
>  	if (unlikely(err < 0))
>  		goto skb_to_fd_failed;
>  
> +#ifdef CONFIG_FSL_DPAA_ETH_TS
> +	if (priv->tx_tstamp &&
> +	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {

One line please.

> +		fd.cmd |= FM_FD_CMD_UPD;
> +		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +	}
> +#endif
> +
>  	if (likely(dpaa_xmit(priv, percpu_stats, queue_mapping, &fd) == 0))
>  		return NETDEV_TX_OK;
>  

Thanks,
Richard

^ permalink raw reply

* [PATCH net-next] MAINTAINERS: TCP gets its first maintainer
From: Eric Dumazet @ 2018-06-04 13:50 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 MAINTAINERS | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 0ae0dbf0e15e74febca1b3469098a08704b59594..70d61c2b1be46c0927ae6648c644b8c7828cce48 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9862,6 +9862,19 @@ F:	net/ipv6/calipso.c
 F:	net/netfilter/xt_CONNSECMARK.c
 F:	net/netfilter/xt_SECMARK.c
 
+NETWORKING [TCP]
+M:	Eric Dumazet <edumazet@google.com>
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	net/ipv4/tcp*.c
+F:	net/ipv4/syncookies.c
+F:	net/ipv6/tcp*.c
+F:	net/ipv6/syncookies.c
+F:	include/uapi/linux/tcp.h
+F:	include/net/tcp.h
+F:	include/linux/tcp.h
+F:	include/trace/events/tcp.h
+
 NETWORKING [TLS]
 M:	Boris Pismenny <borisp@mellanox.com>
 M:	Aviad Yehezkel <aviadye@mellanox.com>
-- 
2.17.1.1185.g55be947832-goog

^ permalink raw reply related

* Re: [PATCH net-next 0/3] selftests/net: various
From: David Miller @ 2018-06-04 13:50 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb
In-Reply-To: <20180531161440.89709-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Thu, 31 May 2018 12:14:37 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> A few odds and ends to network tests:
> 
> - msg_zerocopy: run as part of kselftest
> - udp gso:      add missing bounds test for minimal sizes
> - psocket_snd:  initial basic conformance test

Always great to see new tests.

Series applied, thanks Willem.

^ permalink raw reply

* Re: [bpf-next V2 PATCH 3/8] ixgbe: implement flush flag for ndo_xdp_xmit
From: Jesper Dangaard Brouer @ 2018-06-04 13:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, liu.song.a23,
	songliubraving, John Fastabend, brouer
In-Reply-To: <156d6d45-8557-0303-edeb-10d04c2be474@iogearbox.net>

On Mon, 4 Jun 2018 15:19:05 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> > +static void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring)
> > +{
> > +	/* Force memory writes to complete before letting h/w know there
> > +	 * are new descriptors to fetch.
> > +	 */
> > +	wmb();
> > +	writel(ring->next_to_use, ring->tail);
> > +}  
> 
> Did you double check that this doesn't become a function call? Should this
> get an __always_inline attribute?

I did check this doesn't become a function call.  The same kind of code
happens other places in the driver, but I choose not to generalize
this, exactly to avoid this becoming a function call ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [bpf PATCH v2] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-04 13:57 UTC (permalink / raw)
  To: Daniel Borkmann, Eric Dumazet, edumazet, ast, Dave Watson; +Cc: netdev
In-Reply-To: <c9eab906-6793-1e98-b9d8-01d665ac1c3c@iogearbox.net>

On 06/04/2018 06:39 AM, Daniel Borkmann wrote:
> Hey guys,
> 
> On 06/02/2018 11:39 PM, John Fastabend wrote:
>> On 06/01/2018 12:58 PM, Eric Dumazet wrote:
>>> On 06/01/2018 03:46 PM, John Fastabend wrote:
>>>> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
>>>> of tcpv6_prot.
>>>
>>> ...
>>>
>>>> +	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
>>>> +	 * state. Supporting sockets in LISTEN state will require us to
>>>> +	 * modify the accept implementation to clone rather then share the
>>>> +	 * ulp context.
>>>> +	 */
>>>> +	if (sock->sk_state != TCP_ESTABLISHED)
>>>> +		return -ENOTSUPP;
>>>> +
>>>>  	/* 1. If sock map has BPF programs those will be inherited by the
>>>>  	 * sock being added. If the sock is already attached to BPF programs
>>>>  	 * this results in an error.
>>>
>>> Next question will be then : What happens if syzbot uses tcp_disconnect() and then listen() ?
>>
>> Yep we need to fix that as well :( Looks like we can plumb the
>> unhash callback and remove it from the sockmap when the socket
>> goes through tcp_disconnect().
>>
>> This patch should go in as-is though and we can fix the disconnect
>> issue with a new patch.
>>
>> Adding Dave Watson to the thread as well because I'm guessing
>> the disconnect() case is also applicable to TLS. At least I see
>> a hw handler for unhash but there does not appear to be a handler
>> in the SW case, at least from a quick glance.
>>
>> Thanks again!
> 
> Given the discussion and fixes weren't resolved resp. ready in time for 4.17,
> and last bpf pr for it went out last week, we need to route this via -stable
> once all is hashed out.
> 

OK.

> This fix here therefore needs to be rebased against bpf-next tree, and as far
> as I can see another fix for hash map is also needed to address the same issue.
> 

This fix works for both sockmap and sockhash because they use the same
ulp register and init paths. But, will rebase for net-next and send out
this morning.

> After that, likely also fixes for the disconnect + listen case are needed.
> 

Yep will have a fix today for this.

> (I can use the one here later on for -stable backport, but given merge window
> is open this needs a rebase and a resolution for hash map.)
> 

hash map is also resolved with the same patch but please do queue this
up for -stable.


> Thanks,
> Daniel
> 

^ permalink raw reply

* Re: [PATCH net-next 0/2] selftests: forwarding: mirror_vlan: Fixlets
From: David Miller @ 2018-06-04 14:09 UTC (permalink / raw)
  To: petrm; +Cc: netdev, linux-kselftest, shuah, idosch
In-Reply-To: <cover.1527805500.git.petrm@mellanox.com>

From: Petr Machata <petrm@mellanox.com>
Date: Fri, 01 Jun 2018 00:37:29 +0200

> This patchset includes two small fixes for the tests that were
> introduced in commit 1bb58d2d3cbe ("Merge branch
> 'Mirroring-tests-involving-VLAN'").
> 
> In patch #1, a "tc action trap" is uninstalled after the suite runs,
> instead of being installed again.
> 
> In patch #2, a test in suite is renamed to differentiate it from another
> test of the same name.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] qed: Fix issues in UFP feature commit 'cac6f691'.
From: David Miller @ 2018-06-04 14:11 UTC (permalink / raw)
  To: sudarsana.kalluru; +Cc: netdev, Ariel.Elior, Michal.Kalderon
In-Reply-To: <20180601014737.6164-1-sudarsana.kalluru@cavium.com>

From: Sudarsana Reddy Kalluru <sudarsana.kalluru@cavium.com>
Date: Thu, 31 May 2018 18:47:35 -0700

> From: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
> 
> This patch series fixes couple of issues in the UFP feature commit,
>    cac6f691: Add support for Unified Fabric Port.
> 
> Changes from previous version:
> ------------------------------
> v2: Added "Fixes:" tag.
> 
> Please consider applying it to "net-next".

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH v3 net-next] net: stmmac: Add Flexible PPS support
From: David Miller @ 2018-06-04 14:13 UTC (permalink / raw)
  To: Jose.Abreu
  Cc: netdev, Joao.Pinto, Vitor.Soares, peppe.cavallaro,
	alexandre.torgue, richardcochran
In-Reply-To: <6f0f69081c8352845da413f2737f313d7904d3ee.1527785912.git.joabreu@synopsys.com>

From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Thu, 31 May 2018 18:01:27 +0100

> This adds support for Flexible PPS output (which is equivalent
> to per_out output of PTP subsystem).
> 
> Tested using an oscilloscope and the following commands:
> 
> 1) Start PTP4L:
> 	# ptp4l -A -4 -H -m -i eth0 &
> 2) Set Flexible PPS frequency:
> 	# echo <idx> <ts> <tns> <ps> <pns> > /sys/class/ptp/ptpX/period
> 
> Where, ts/tns is start time and ps/pns is period time, and ptpX is ptp
> of eth0.
> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Joao Pinto <jpinto@synopsys.com>
> Cc: Vitor Soares <soares@synopsys.com>
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> Cc: Richard Cochran <richardcochran@gmail.com>
> ---
> Changes from v2:
> 	- Remove PPS support as we can't input the event to PTP
> 	subsystem
> Changes from v1:
> 	- Correct kbuild errors in some archs

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] net: ethernet: mlx4: Remove unnecessary parentheses
From: David Miller @ 2018-06-04 14:15 UTC (permalink / raw)
  To: rvarsha016
  Cc: tariqt, der.herr, lukas.bulwahn, netdev, linux-rdma, linux-kernel
In-Reply-To: <20180601020049.3704-1-rvarsha016@gmail.com>

From: Varsha Rao <rvarsha016@gmail.com>
Date: Fri,  1 Jun 2018 07:30:49 +0530

> This patch fixes the clang warning of extraneous parentheses, with the
> following coccinelle script.
 ...
> Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Signed-off-by: Varsha Rao <rvarsha016@gmail.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net-next] MAINTAINERS: TCP gets its first maintainer
From: Jiri Pirko @ 2018-06-04 14:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet
In-Reply-To: <20180604135029.241753-1-edumazet@google.com>

Mon, Jun 04, 2018 at 03:50:29PM CEST, edumazet@google.com wrote:
>Signed-off-by: Eric Dumazet <edumazet@google.com>
>---
> MAINTAINERS | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
>diff --git a/MAINTAINERS b/MAINTAINERS
>index 0ae0dbf0e15e74febca1b3469098a08704b59594..70d61c2b1be46c0927ae6648c644b8c7828cce48 100644
>--- a/MAINTAINERS
>+++ b/MAINTAINERS
>@@ -9862,6 +9862,19 @@ F:	net/ipv6/calipso.c
> F:	net/netfilter/xt_CONNSECMARK.c
> F:	net/netfilter/xt_SECMARK.c
> 
>+NETWORKING [TCP]
>+M:	Eric Dumazet <edumazet@google.com>

May the Force be with you...


>+L:	netdev@vger.kernel.org
>+S:	Maintained
>+F:	net/ipv4/tcp*.c
>+F:	net/ipv4/syncookies.c
>+F:	net/ipv6/tcp*.c
>+F:	net/ipv6/syncookies.c
>+F:	include/uapi/linux/tcp.h
>+F:	include/net/tcp.h
>+F:	include/linux/tcp.h
>+F:	include/trace/events/tcp.h
>+
> NETWORKING [TLS]
> M:	Boris Pismenny <borisp@mellanox.com>
> M:	Aviad Yehezkel <aviadye@mellanox.com>
>-- 
>2.17.1.1185.g55be947832-goog
>

^ permalink raw reply

* Re: [PATCH net-next] MAINTAINERS: TCP gets its first maintainer
From: David Miller @ 2018-06-04 14:32 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet
In-Reply-To: <20180604135029.241753-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Mon,  4 Jun 2018 06:50:29 -0700

> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks a lot Eric, applied to net-next. :-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox