Netdev List
 help / color / mirror / Atom feed
* [PATCH] vif queue counters from int to long
From: Mart van Santen @ 2016-12-23 15:09 UTC (permalink / raw)
  To: netdev; +Cc: ian.campbell, wei.liu2


[-- Attachment #1.1: Type: text/plain, Size: 1388 bytes --]


Hello,

This patch fixes an issue where counters in the queue have type int,
while the counters of the vif itself are specified as long. This can
cause incorrect reporting of tx/rx values of the vif interface.
More extensively reported on xen-devel mailinglist.



Signed-off-by: Mart van Santen <mart@greenhost.nl>
--- a/drivers/net/xen-netback/common.h  2016-12-22 15:41:07.785535748 +0000
+++ b/drivers/net/xen-netback/common.h  2016-12-23 13:08:18.123080064 +0000
@@ -113,10 +113,10 @@ struct xenvif_stats {
         * A subset of struct net_device_stats that contains only the
         * fields that are updated in netback.c for each queue.
         */
-       unsigned int rx_bytes;
-       unsigned int rx_packets;
-       unsigned int tx_bytes;
-       unsigned int tx_packets;
+       unsigned long rx_bytes;
+       unsigned long rx_packets;
+       unsigned long tx_bytes;
+       unsigned long tx_packets;

        /* Additional stats used by xenvif */
        unsigned long rx_gso_checksum_fixup;

-- 
Mart van Santen
Greenhost
E: mart@greenhost.nl
T: +31 20 4890444
W: https://greenhost.nl

A PGP signature can be attached to this e-mail,
you need PGP software to verify it. 
My public key is available in keyserver(s)
see: http://tinyurl.com/openpgp-manual

PGP Fingerprint: CA85 EB11 2B70 042D AF66  B29A 6437 01A1 10A3 D3A5



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply

* Re: [PATCH net 2/9] virtio-net: correctly xmit linearized page on XDP_TX
From: John Fastabend @ 2016-12-23 15:47 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-3-git-send-email-jasowang@redhat.com>

On 16-12-23 06:37 AM, Jason Wang wrote:
> After we linearize page, we should xmit this page instead of the page
> of first buffer which may lead unexpected result. With this patch, we
> can see correct packet during XDP_TX.
> 
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 1067253..fe4562d 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -572,7 +572,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
>  			goto err_xdp;
>  
> -		act = do_xdp_prog(vi, rq, xdp_prog, page, offset, len);
> +		act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len);
>  		switch (act) {
>  		case XDP_PASS:
>  			if (unlikely(xdp_page != page))
> 

Thanks clumsy merge conflict resolution between v4 and v6 versions :/

Acked-by: John Fastabend <john.r.fastabend@intel.com>

^ permalink raw reply

* Re: [PATCH] ethtool: add one ethtool option to set relax ordering mode
From: Alexander Duyck @ 2016-12-23 15:42 UTC (permalink / raw)
  To: maowenan
  Cc: Jeff Kirsher, Stephen Hemminger, netdev@vger.kernel.org,
	weiyongjun (A), Dingtianhong
In-Reply-To: <F95AC9340317A84688A5F0DF0246F3F20151FE86@szxeml504-mbs.china.huawei.com>

On Thu, Dec 22, 2016 at 10:14 PM, maowenan <maowenan@huawei.com> wrote:
>
>
>> -----Original Message-----
>> From: Jeff Kirsher [mailto:jeffrey.t.kirsher@intel.com]
>> Sent: Friday, December 23, 2016 9:07 AM
>> To: maowenan; Alexander Duyck
>> Cc: Stephen Hemminger; netdev@vger.kernel.org; weiyongjun (A);
>> Dingtianhong
>> Subject: Re: [PATCH] ethtool: add one ethtool option to set relax ordering mode
>>
>> On Fri, 2016-12-23 at 00:40 +0000, maowenan wrote:
>> > > -----Original Message-----
>> > > From: Alexander Duyck [mailto:alexander.duyck@gmail.com]
>> > > Sent: Thursday, December 22, 2016 11:54 PM
>> > > To: maowenan
>> > > Cc: Stephen Hemminger; netdev@vger.kernel.org; jeffrey.t.kirsher@intel.
>> > > com;
>> > > weiyongjun (A); Dingtianhong
>> > > Subject: Re: [PATCH] ethtool: add one ethtool option to set relax
>> > > ordering mode
>> > >
>> > > On Wed, Dec 21, 2016 at 5:39 PM, maowenan <maowenan@huawei.com>
>> > > wrote:
>> > > >
>> > > >
>> > > > > -----Original Message-----
>> > > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>> > > > > Sent: Thursday, December 22, 2016 9:28 AM
>> > > > > To: maowenan
>> > > > > Cc: netdev@vger.kernel.org; jeffrey.t.kirsher@intel.com
>> > > > > Subject: Re: [PATCH] ethtool: add one ethtool option to set
>> > > > > relax ordering mode
>> > > > >
>> > > > > On Thu, 8 Dec 2016 14:51:38 +0800 Mao Wenan
>> > > > > <maowenan@huawei.com> wrote:
>> > > > >
>> > > > > > This patch provides one way to set/unset IXGBE NIC TX and RX
>> > > > > > relax ordering mode, which can be set by ethtool.
>> > > > > > Relax ordering is one mode of 82599 NIC, to enable this mode
>> > > > > > can enhance the performance for some cpu architecure.
>> > > > >
>> > > > > Then it should be done by CPU architecture specific quirks
>> > > > > (preferably in PCI
>> > > > > layer) so that all users get the option without having to do
>> > > > > manual
>> > >
>> > > intervention.
>> > > > >
>> > > > > > example:
>> > > > > > ethtool -s enp1s0f0 relaxorder off ethtool -s enp1s0f0
>> > > > > > relaxorder on
>> > > > >
>> > > > > Doing it via ethtool is a developer API (for testing) not
>> > > > > something that makes sense in production.
>> > > >
>> > > >
>> > > > This feature is not mandatory for all users, acturally relax
>> > > > ordering default configuration of 82599 is 'disable', So this
>> > > > patch gives one way to
>> > >
>> > > enable relax ordering to be selected in some performance condition.
>> > >
>> > > That isn't quite true.  The default for Sparc systems is to have it
>> > > enabled.
>> > >
>> > > Really this is something that is platform specific.  I agree with
>> > > Stephen that it would work better if this was handled as a series of
>> > > platform specific quirks handled at something like the PCI layer
>> > > rather than be a switch the user can toggle on and off.
>> > >
>> > > With that being said there are changes being made that should help
>> > > to improve the situation.  Specifically I am looking at adding
>> > > support for the DMA_ATTR_WEAK_ORDERING which may also allow us to
>> > > identify cases where you might be able to specify the DMA behavior
>> > > via the DMA mapping instead of having to make the final decision in
>> > > the device itself.
>> > >
>> > > - Alex
>> >
>> > Yes, Sparc is a special case. From the NIC driver point of view, It is
>> > no need for some ARCHs to do particular operation and compiling
>> > branch, ethtool is a flexible method for user to make decision whether
>> > on|off this feature.
>> > I think Jeff as maintainer of 82599 has some comments about this.
>>
>> My original comment/objection was that you attempted to do this change as a
>> module parameter to the ixgbe driver, where I directed you to use ethtool so
>> that other drivers could benefit from the ability to enable/disable relaxed
>> ordering.  As far as how it gets implemented in ethtool or PCI layer, makes
>> little difference to me, I only had issues with the driver specific module
>> parameter implementation, which is not acceptable.
>
>
> Thank you Jeff and Alex.
> And then I have gone through mail thread about "i40e: enable PCIe relax ordering for SPARC",
> It only works for SPARC, any other ARCH who wants to enable DMA_ATTR_WEAK_ORDERING
> feature, should define the new macro, recompile the driver module.
>
> Because of the above reasons, we implement in ethtool to give the final user a convenient
> way to on|off special feature, no need define new macro, easy to extend the new features,
> and also good benefit for other driver as Jeff referred.
>

I think the point is we shouldn't base the decision on user input.
The fact is the PCIe device control register should have a bit that
indicates if the device is allowed to enable relaxed ordering or not.
If we can guarantee that the bit is set in all the cases where it
should be set, and cleared in all the cases where it should not then
we could use something like that to determine if the device is
supposed to enable relaxed ordering instead of trying to make the
decision ourselves.

- Alex

^ permalink raw reply

* Re: [PATCH net 3/9] virtio-net: fix page miscount during XDP linearizing
From: John Fastabend @ 2016-12-23 15:54 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-4-git-send-email-jasowang@redhat.com>

On 16-12-23 06:37 AM, Jason Wang wrote:
> We don't put page during linearizing, the would cause leaking when
> xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by
> put page accordingly. Also decrease the number of buffers during
> linearizing to make sure caller can free buffers correctly when packet
> exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize
> huge number of packets.
> 
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---

Thanks! looks good. By the way do you happen to have any actual
configuration where this path is hit? I obviously didn't test this
very long other than a quick test with my hacked vhost driver.

Acked-by: John Fastabend <john.r.fastabend@intel.com>

^ permalink raw reply

* Re: [PATCH net 4/9] virtio-net: correctly handle XDP_PASS for linearized packets
From: John Fastabend @ 2016-12-23 15:57 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-5-git-send-email-jasowang@redhat.com>

On 16-12-23 06:37 AM, Jason Wang wrote:
> When XDP_PASS were determined for linearized packets, we try to get
> new buffers in the virtqueue and build skbs from them. This is wrong,
> we should create skbs based on existed buffers instead. Fixing them by
> creating skb based on xdp_page.
> 
> With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS.
> 
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 58ad40e..470293e 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -578,8 +578,14 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len);
>  		switch (act) {
>  		case XDP_PASS:
> -			if (unlikely(xdp_page != page))
> -				__free_pages(xdp_page, 0);
> +			/* We can only create skb based on xdp_page. */
> +			if (unlikely(xdp_page != page)) {
> +				rcu_read_unlock();
> +				put_page(page);
> +				head_skb = page_to_skb(vi, rq, xdp_page,
> +						       0, len, PAGE_SIZE);
> +				return head_skb;
> +			}
>  			break;
>  		case XDP_TX:
>  			if (unlikely(xdp_page != page))
> 

Great thanks. This was likely working before because of the memory
leak fixed in 3/9.

Acked-by: John Fastabend <john.r.fastabend@intel.com>

^ permalink raw reply

* Re: [PATCH net 5/9] virtio-net: unbreak csumed packets for XDP_PASS
From: John Fastabend @ 2016-12-23 15:58 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-6-git-send-email-jasowang@redhat.com>

On 16-12-23 06:37 AM, Jason Wang wrote:
> We drop csumed packet when do XDP for packets. This breaks
> XDP_PASS when GUEST_CSUM is supported. Fix this by allowing csum flag
> to be set. With this patch, simple TCP works for XDP_PASS.
> 
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 470293e..0778dc8 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -440,7 +440,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
>  		struct virtio_net_hdr_mrg_rxbuf *hdr = buf;
>  		u32 act;
>  
> -		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
> +		if (unlikely(hdr->hdr.gso_type))
>  			goto err_xdp;
>  		act = do_xdp_prog(vi, rq, xdp_prog, page, 0, len);
>  		switch (act) {
> @@ -572,7 +572,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		 * the receive path after XDP is loaded. In practice I
>  		 * was not able to create this condition.
>  		 */
> -		if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
> +		if (unlikely(hdr->hdr.gso_type))
>  			goto err_xdp;
>  
>  		act = do_xdp_prog(vi, rq, xdp_prog, xdp_page, offset, len);
> 

Acked-by: John Fastabend <john.r.fastabend@intel.com>

^ permalink raw reply

* Re: [PATCH net 6/9] virtio-net: make rx buf size estimation works for XDP
From: John Fastabend @ 2016-12-23 16:02 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-7-git-send-email-jasowang@redhat.com>

On 16-12-23 06:37 AM, Jason Wang wrote:
> We don't update ewma rx buf size in the case of XDP. This will lead
> underestimation of rx buf size which causes host to produce more than
> one buffers. This will greatly increase the possibility of XDP page
> linearization.
> 
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0778dc8..77ae358 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -584,10 +584,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  				put_page(page);
>  				head_skb = page_to_skb(vi, rq, xdp_page,
>  						       0, len, PAGE_SIZE);
> +				ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len);
>  				return head_skb;
>  			}
>  			break;
>  		case XDP_TX:
> +			ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len);
>  			if (unlikely(xdp_page != page))
>  				goto err_xdp;
>  			rcu_read_unlock();
> @@ -596,6 +598,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		default:
>  			if (unlikely(xdp_page != page))
>  				__free_pages(xdp_page, 0);
> +			ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len);
>  			goto err_xdp;
>  		}
>  	}
> 

Looks needed although I guess it will only be the case with
MTU > ETH_DATA_LEN because of the clamp in get_mergeable_buf_len().
Although XDP setup allows MTU up to page_size - hdr so certainly
will happen with ~MTU > 1500.

I need to add some various MTU tests to my setup.

Acked-by: John Fastabend <john.r.fastabend@intel.com>

^ permalink raw reply

* Re: [PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
From: John Fastabend @ 2016-12-23 16:02 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-8-git-send-email-jasowang@redhat.com>

On 16-12-23 06:37 AM, Jason Wang wrote:
> When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
> packet that exceeds a single page which could not be handled
> correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
> supported. While at it, forbid XDP for ECN (which comes only from GRO)
> too to prevent user from misconfiguration.
> 
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 77ae358..c1f66d8 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>  	int i, err;
>  
>  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> -	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) {
> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) {
>  		netdev_warn(dev, "can't set XDP while host is implementing LRO, disable LRO first\n");
>  		return -EOPNOTSUPP;
>  	}
> 

Acked-by: John Fastabend <john.r.fastabend@intel.com>

^ permalink raw reply

* Re: [PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
From: John Fastabend @ 2016-12-23 16:10 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <585D4AAD.4030702@gmail.com>

On 16-12-23 08:02 AM, John Fastabend wrote:
> On 16-12-23 06:37 AM, Jason Wang wrote:
>> When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
>> packet that exceeds a single page which could not be handled
>> correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
>> supported. While at it, forbid XDP for ECN (which comes only from GRO)
>> too to prevent user from misconfiguration.
>>

Is sending packets greater than single page though normal in this case?
I don't have any need to support big packet mode other than MST asked
for it. And I wasn't seeing this in my tests. MTU is capped at 4k - hdr
when XDP is enabled.

.John

>> Cc: John Fastabend <john.r.fastabend@intel.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>  drivers/net/virtio_net.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 77ae358..c1f66d8 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>>  	int i, err;
>>  
>>  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>> -	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) {
>> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
>> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
>> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) {
>>  		netdev_warn(dev, "can't set XDP while host is implementing LRO, disable LRO first\n");
>>  		return -EOPNOTSUPP;
>>  	}
>>
> 
> Acked-by: John Fastabend <john.r.fastabend@intel.com>
> 

^ permalink raw reply

* Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)
From: Andy Lutomirski @ 2016-12-23 16:23 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Hannes Frederic Sowa, Alexei Starovoitov, Jason A. Donenfeld,
	kernel-hardening@lists.openwall.com, Theodore Ts'o, Netdev,
	LKML, Linux Crypto Mailing List, David Laight, Eric Dumazet,
	Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
	David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <585D11BF.60903@iogearbox.net>

On Fri, Dec 23, 2016 at 3:59 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 12/23/2016 11:59 AM, Hannes Frederic Sowa wrote:
>>
>> On Fri, 2016-12-23 at 11:04 +0100, Daniel Borkmann wrote:
>>>
>>> On 12/22/2016 05:59 PM, Hannes Frederic Sowa wrote:
>>>>
>>>> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>
> [...]
>
>>>> The hashing is not a proper sha1 neither, unfortunately. I think that
>>>> is why it will have a custom implementation in iproute2?
>>>
>>>
>>> Still trying to catch up on this admittedly bit confusing thread. I
>>> did run automated tests over couple of days comparing the data I got
>>> from fdinfo with the one from af_alg and found no mismatch on the test
>>> cases varying from min to max possible program sizes. In the process
>>> of testing, as you might have seen on netdev, I found couple of other
>>> bugs in bpf code along the way and fixed them up as well. So my question,
>>> do you or Andy or anyone participating in claiming this have any
>>> concrete data or test cases that suggests something different? If yes,
>>> I'm very curious to hear about it and willing fix it up, of course.
>>> When I'm back from pto I'll prep and cook up my test suite to be
>>> included into the selftests/bpf/, should have done this initially,
>>> sorry about that. I'll also post something to expose the alg, that
>>> sounds fine to me.
>>
>>
>> Looking into your code closer, I noticed that you indeed seem to do the
>> finalization of sha-1 by hand by aligning and padding the buffer
>> accordingly and also patching in the necessary payload length.
>>
>> Apologies for my side for claiming that this is not correct sha1
>> output, I was only looking at sha_transform and its implementation and
>> couldn't see the padding and finalization round with embedding the data
>> length in there and hadn't thought of it being done manually.
>>
>> Anyway, is it difficult to get the sha finalization into some common
>> code library? It is not very bpf specific and crypto code reviewers
>> won't find it there at all.
>
>
> Yes, sure, I'll rework it that way (early next year when I'm back if
> that's fine with you).

Can we make it SHA-256 before 4.10 comes out, though?  This really
looks like it will be used in situations where collisions matter and
it will be exposed to malicious programs, and SHA-1 should not be used
for new designs for this purpose because it simply isn't long enough.

Also, a SHA-1 digest isn't a pile of u32s, so u32 digest[...] is very
misleading.  That should be u8 or, at the very least, __be32.

I realize that there isn't a sha-256 implementation in lib, but would
it really be so bad to make the bpf digest only work (for now) when
crypto is enabled?  I would *love* to see the crypto core learn how to
export simple primitives for direct use without needing the whole
crypto core, and this doesn't seem particularly hard to do, but I
don't think that's 4.10 material.

--Andy

^ permalink raw reply

* [PATCH net] sctp: do not loose window information if in rwnd_over
From: Marcelo Ricardo Leitner @ 2016-12-23 16:29 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich

It's possible that we receive a packet that is larger than current
window. If it's the first packet in this way, it will cause it to
increase rwnd_over. Then, if we receive another data chunk (specially as
SCTP allows you to have one data chunk in flight even during 0 window),
rwnd_over will be overwritten instead of added to.

In the long run, this could cause the window to grow bigger than its
initial size, as rwnd_over would be charged only for the last received
data chunk while the code will try open the window for all packets that
were received and had its value in rwnd_over overwritten. This, then,
can lead to the worsening of payload/buffer ratio and cause rwnd_press
to kick in more often.

The fix is to sum it too, same as is done for rwnd_press, so that if we
receive 3 chunks after closing the window, we still have to release that
same amount before re-opening it.

Log snippet from sctp_test exhibiting the issue:
[  146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
rwnd decreased by 1 to (0, 1, 114221)
[  146.209232] sctp: sctp_assoc_rwnd_decrease:
association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
[  146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
rwnd decreased by 1 to (0, 1, 114221)
[  146.209232] sctp: sctp_assoc_rwnd_decrease:
association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
[  146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
rwnd decreased by 1 to (0, 1, 114221)
[  146.209232] sctp: sctp_assoc_rwnd_decrease:
association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
[  146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
rwnd decreased by 1 to (0, 1, 114221)

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/associola.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 68428e1f71810fbe65b7f86c750c3ad61f0266ec..56ddcfaeb4f64235b54b0daa915fabf0cc0170a9 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1539,7 +1539,7 @@ void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
 			asoc->rwnd = 0;
 		}
 	} else {
-		asoc->rwnd_over = len - asoc->rwnd;
+		asoc->rwnd_over += len - asoc->rwnd;
 		asoc->rwnd = 0;
 	}
 
-- 
2.9.3

^ permalink raw reply related

* [PATCH net] sctp: fix recovering from 0 win with small data chunks
From: Marcelo Ricardo Leitner @ 2016-12-23 16:29 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich

Currently if SCTP closes the receive window with window pressure, mostly
caused by excessive skb overhead on payload/overheads ratio, SCTP will
close the window abruptly while saving the delta on rwnd_press. It will
start recovering rwnd as the chunks are consumed by the application and
the rwnd_press will be only recovered after rwnd reach the same value as
of rwnd_press, mostly to prevent silly window syndrome.

Thing is, this is very inefficient with small data chunks, as with those
it will never reach back that value, and thus it will never recover from
such pressure. This means that we will not issue window updates when
recovering from 0 window and will rely on a sender retransmit to notice
it.

The fix here is to remove such threshold, as no value is good enough: it
depends on the (avg) chunk sizes being used.

Test with netperf -t SCTP_STREAM -- -m 1, and trigger 0 window by
sending SIGSTOP to netserver, sleep 1.2, and SIGCONT.
Rate limited to 845kbps, for visibility. Capture done at netserver side.

Previously:
01.500751 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632372996] [a_rwnd 99153] [
01.500752 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632372997] [SID: 0] [SS
01.517471 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
01.517483 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
01.517485 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
01.517488 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
01.534168 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373096] [SID: 0] [SS
01.534180 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
01.534181 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373169] [SID: 0] [SS
01.534185 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
02.525978 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
02.526021 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
  (window update missed)
04.573807 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
04.779370 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373082] [a_rwnd 859] [#g
04.789162 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
04.789323 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373156] [SID: 0] [SS
04.789372 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373228] [a_rwnd 786] [#g

After:
02.568957 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098728] [a_rwnd 99153]
02.568961 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098729] [SID: 0] [S
02.585631 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
02.585666 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
02.585671 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
02.585683 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
02.602330 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098828] [SID: 0] [S
02.602359 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
02.602363 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098901] [SID: 0] [S
02.602372 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
03.600788 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
03.600830 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
03.619455 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 13508]
03.619479 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 27017]
03.619497 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 40526]
03.619516 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 54035]
03.619533 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 67544]
03.619552 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 81053]
03.619570 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 94562]
  (following data transmission triggered by window updates above)
03.633504 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
03.836445 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098814] [a_rwnd 100000]
03.843125 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
03.843285 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098888] [SID: 0] [S
03.843345 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098960] [a_rwnd 99894]
03.856546 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098961] [SID: 0] [S
03.866450 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490099011] [SID: 0] [S

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/associola.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 56ddcfaeb4f64235b54b0daa915fabf0cc0170a9..d3cc30c25c41091c2bf18022506dff4145d29944 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1471,7 +1471,7 @@ void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
 	 * threshold.  The idea is to recover slowly, but up
 	 * to the initial advertised window.
 	 */
-	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
+	if (asoc->rwnd_press) {
 		int change = min(asoc->pathmtu, asoc->rwnd_press);
 		asoc->rwnd += change;
 		asoc->rwnd_press -= change;
-- 
2.9.3

^ permalink raw reply related

* Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)
From: Andy Lutomirski @ 2016-12-23 16:42 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Hannes Frederic Sowa, Alexei Starovoitov, Jason A. Donenfeld,
	kernel-hardening@lists.openwall.com, Theodore Ts'o, Netdev,
	LKML, Linux Crypto Mailing List, David Laight, Eric Dumazet,
	Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
	David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <CALCETrVgnnbstPoPf-0bZ_ySDydXNCp5dLAEvhAD9cApozFsng@mail.gmail.com>

On Fri, Dec 23, 2016 at 8:23 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Dec 23, 2016 at 3:59 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 12/23/2016 11:59 AM, Hannes Frederic Sowa wrote:
>>>
>>> On Fri, 2016-12-23 at 11:04 +0100, Daniel Borkmann wrote:
>>>>
>>>> On 12/22/2016 05:59 PM, Hannes Frederic Sowa wrote:
>>>>>
>>>>> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>>
>> [...]
>>
>>>>> The hashing is not a proper sha1 neither, unfortunately. I think that
>>>>> is why it will have a custom implementation in iproute2?
>>>>
>>>>
>>>> Still trying to catch up on this admittedly bit confusing thread. I
>>>> did run automated tests over couple of days comparing the data I got
>>>> from fdinfo with the one from af_alg and found no mismatch on the test
>>>> cases varying from min to max possible program sizes. In the process
>>>> of testing, as you might have seen on netdev, I found couple of other
>>>> bugs in bpf code along the way and fixed them up as well. So my question,
>>>> do you or Andy or anyone participating in claiming this have any
>>>> concrete data or test cases that suggests something different? If yes,
>>>> I'm very curious to hear about it and willing fix it up, of course.
>>>> When I'm back from pto I'll prep and cook up my test suite to be
>>>> included into the selftests/bpf/, should have done this initially,
>>>> sorry about that. I'll also post something to expose the alg, that
>>>> sounds fine to me.
>>>
>>>
>>> Looking into your code closer, I noticed that you indeed seem to do the
>>> finalization of sha-1 by hand by aligning and padding the buffer
>>> accordingly and also patching in the necessary payload length.
>>>
>>> Apologies for my side for claiming that this is not correct sha1
>>> output, I was only looking at sha_transform and its implementation and
>>> couldn't see the padding and finalization round with embedding the data
>>> length in there and hadn't thought of it being done manually.
>>>
>>> Anyway, is it difficult to get the sha finalization into some common
>>> code library? It is not very bpf specific and crypto code reviewers
>>> won't find it there at all.
>>
>>
>> Yes, sure, I'll rework it that way (early next year when I'm back if
>> that's fine with you).
>
> Can we make it SHA-256 before 4.10 comes out, though?  This really
> looks like it will be used in situations where collisions matter and
> it will be exposed to malicious programs, and SHA-1 should not be used
> for new designs for this purpose because it simply isn't long enough.
>
> Also, a SHA-1 digest isn't a pile of u32s, so u32 digest[...] is very
> misleading.  That should be u8 or, at the very least, __be32.
>
> I realize that there isn't a sha-256 implementation in lib, but would
> it really be so bad to make the bpf digest only work (for now) when
> crypto is enabled?  I would *love* to see the crypto core learn how to
> export simple primitives for direct use without needing the whole
> crypto core, and this doesn't seem particularly hard to do, but I
> don't think that's 4.10 material.

I'm going to try to send out RFC patches for all of this today or
tomorrow.  It doesn't look bad at all.

--Andy

^ permalink raw reply

* Re: [PATCH v3] stmmac: CSR clock configuration fix
From: David Miller @ 2016-12-23 16:46 UTC (permalink / raw)
  To: Joao.Pinto
  Cc: peppe.cavallaro, seraphin.bonnaffe, hock.leong.kweh,
	niklas.cassel, pavel, linux-kernel, preid, netdev
In-Reply-To: <f586b394955d543470b614ddc8c5bc766c450c40.1482487963.git.jpinto@synopsys.com>

From: Joao Pinto <Joao.Pinto@synopsys.com>
Date: Fri, 23 Dec 2016 10:15:59 +0000

> When testing stmmac with my QoS reference design I checked a problem in the
> CSR clock configuration that was impossibilitating the phy discovery, since
> every read operation returned 0x0000ffff. This patch fixes the issue.
> 
> Signed-off-by: Joao Pinto <jpinto@synopsys.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net 9/9] virtio-net: XDP support for small buffers
From: John Fastabend @ 2016-12-23 16:51 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-10-git-send-email-jasowang@redhat.com>

On 16-12-23 06:37 AM, Jason Wang wrote:
> Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
> small receive buffer untouched. This will confuse the user who want to
> set XDP but use small buffers. Other than forbid XDP in small buffer
> mode, let's make it work. XDP then can only work at skb->data since
> virtio-net create skbs during refill, this is sub optimal which could
> be optimized in the future.
> 
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---


Looks good to me thanks!

Acked-by: John Fastabend <john.r.fastabend@intel.com>

^ permalink raw reply

* Re: [PATCH net 0/2] net/sched fixes for cls_flower and act_tunnel_key
From: David Miller @ 2016-12-23 17:00 UTC (permalink / raw)
  To: ogerlitz; +Cc: netdev, roid, hadarh
In-Reply-To: <1482409695-27956-1-git-send-email-ogerlitz@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Thu, 22 Dec 2016 14:28:13 +0200

> This small series contain a fix to the matching flags support 
> in flower and to the tunnel key action MD prep for IPv6.

Series applied, thanks.

> On a non related note, wishing everyone around here a happy new year, 
> celebrate and take a rest so we can do lots of good patch work(s) next.

Happy new year to you as well.

^ permalink raw reply

* Re: [PATCH net 0/9] several fixups for virtio-net XDP
From: John Fastabend @ 2016-12-23 17:10 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-1-git-send-email-jasowang@redhat.com>

On 16-12-23 06:37 AM, Jason Wang wrote:
> Merry Xmas and a Happy New year to all:
> 
> This series tries to fixes several issues for virtio-net XDP which
> could be categorized into several parts:
> 
> - fix several issues during XDP linearizing
> - allow csumed packet to work for XDP_PASS
> - make EWMA rxbuf size estimation works for XDP
> - forbid XDP when GUEST_UFO is support
> - remove big packet XDP support
> - add XDP support or small buffer
> 
> Please see individual patches for details.
> 
> Thanks
> 
> Jason Wang (9):
>   virtio-net: remove the warning before XDP linearizing
>   virtio-net: correctly xmit linearized page on XDP_TX
>   virtio-net: fix page miscount during XDP linearizing
>   virtio-net: correctly handle XDP_PASS for linearized packets
>   virtio-net: unbreak csumed packets for XDP_PASS
>   virtio-net: make rx buf size estimation works for XDP
>   virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
>   virtio-net: remove big packet XDP codes
>   virtio-net: XDP support for small buffers
> 
>  drivers/net/virtio_net.c | 172 ++++++++++++++++++++++++++++-------------------
>  1 file changed, 102 insertions(+), 70 deletions(-)
> 

Thanks a lot Jason. The last piece that is needed is support to
complete XDP support is to get the adjust_head part correct. I'll
send out a patch in a bit but will need to merge it on top of this
set.

.John

^ permalink raw reply

* [net/mm PATCH v2 0/3] Page fragment updates
From: Alexander Duyck @ 2016-12-23 17:16 UTC (permalink / raw)
  To: linux-mm, akpm, davem, netdev; +Cc: linux-kernel, jeffrey.t.kirsher

This patch series takes care of a few cleanups for the page fragments API.

First we do some renames so that things are much more consistent.  First we
move the page_frag_ portion of the name to the front of the functions
names.  Secondly we split out the cache specific functions from the other
page fragment functions by adding the word "cache" to the name.

Second I did some minor clean-up on the function calls so that they are
more inline with the existing __free_pages calls in terms of how they
operate.

Finally I added a bit of documentation that will hopefully help to explain
some of this.  I plan to revisit this later as we get things more ironed
out in the near future with the changes planned for the DMA setup to
support eXpress Data Path.

---

v2: Fixed a comparison between a void* and 0 due to copy/paste from free_pages

I'm listing this as a patch for net or mm since I had originally submitted
it against mm as that was where the patches for the __page_frag functions
has previously resided.  However they are now also in net, and I wanted to
get this fixed before the merge window closed as I was hoping to make use
of these APIs in net-next and I already have about 20 patches that are
waiting on these patches to be accepted.

I tried to get in touch with Andrew about this fix but I haven't heard any
reply to the email I sent out on Tuesday.  The last comment I had from
Andrew against v1 was "Looks good to me.  I have it all queued for post-4.9
processing.", but I haven't received any notice they were applied.

Alexander Duyck (3):
      mm: Rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
      mm: Rename __page_frag functions to __page_frag_cache, drop order from drain
      mm: Add documentation for page fragment APIs


 Documentation/vm/page_frags               |   42 +++++++++++++++++++++++++++++
 drivers/net/ethernet/intel/igb/igb_main.c |    6 ++--
 include/linux/gfp.h                       |    9 +++---
 include/linux/skbuff.h                    |    2 +
 mm/page_alloc.c                           |   33 +++++++++++++----------
 net/core/skbuff.c                         |    8 +++---
 6 files changed, 73 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/vm/page_frags

--

^ permalink raw reply

* [net/mm PATCH v2 1/3] mm: Rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
From: Alexander Duyck @ 2016-12-23 17:17 UTC (permalink / raw)
  To: linux-mm, akpm, davem, netdev; +Cc: linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20161223170756.14573.74139.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch renames the page frag functions to be more consistent with other
APIs.  Specifically we place the name page_frag first in the name and then
have either an alloc or free call name that we append as the suffix.  This
makes it a bit clearer in terms of naming.

In addition we drop the leading double underscores since we are technically
no longer a backing interface and instead the front end that is called from
the networking APIs.

The last bit I changed is I rebased page_frag_free to actually function
very similar to the function free_pages, the only real difference now is
the fact that we have to get the page order by calling compound page
instead of having it passed as a part of the function call.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

v2: Fixed a comparison between a void* and 0 due to copy/paste from free_pages

 include/linux/gfp.h    |    6 +++---
 include/linux/skbuff.h |    2 +-
 mm/page_alloc.c        |   20 ++++++++++++--------
 net/core/skbuff.c      |    8 ++++----
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..6238c74e0a01 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -508,9 +508,9 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 struct page_frag_cache;
 extern void __page_frag_drain(struct page *page, unsigned int order,
 			      unsigned int count);
-extern void *__alloc_page_frag(struct page_frag_cache *nc,
-			       unsigned int fragsz, gfp_t gfp_mask);
-extern void __free_page_frag(void *addr);
+extern void *page_frag_alloc(struct page_frag_cache *nc,
+			     unsigned int fragsz, gfp_t gfp_mask);
+extern void page_frag_free(void *addr);
 
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ac7fa34db8a7..e0b7f492b628 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2480,7 +2480,7 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 
 static inline void skb_free_frag(void *addr)
 {
-	__free_page_frag(addr);
+	page_frag_free(addr);
 }
 
 void *napi_alloc_frag(unsigned int fragsz);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c6d5f64feca..842df9d60ebc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3939,8 +3939,8 @@ void __page_frag_drain(struct page *page, unsigned int order,
 }
 EXPORT_SYMBOL(__page_frag_drain);
 
-void *__alloc_page_frag(struct page_frag_cache *nc,
-			unsigned int fragsz, gfp_t gfp_mask)
+void *page_frag_alloc(struct page_frag_cache *nc,
+		      unsigned int fragsz, gfp_t gfp_mask)
 {
 	unsigned int size = PAGE_SIZE;
 	struct page *page;
@@ -3991,19 +3991,23 @@ void *__alloc_page_frag(struct page_frag_cache *nc,
 
 	return nc->va + offset;
 }
-EXPORT_SYMBOL(__alloc_page_frag);
+EXPORT_SYMBOL(page_frag_alloc);
 
 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.
  */
-void __free_page_frag(void *addr)
+void page_frag_free(void *addr)
 {
-	struct page *page = virt_to_head_page(addr);
+	struct page *page;
+
+	if (addr != NULL) {
+		VM_BUG_ON(!virt_addr_valid(addr));
+		page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page)))
-		__free_pages_ok(page, compound_order(page));
+		__free_pages(page, compound_order(page));
+	}
 }
-EXPORT_SYMBOL(__free_page_frag);
+EXPORT_SYMBOL(page_frag_free);
 
 static void *make_alloc_exact(unsigned long addr, unsigned int order,
 		size_t size)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 65a74e13c45b..df4598ad6473 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -369,7 +369,7 @@ static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 
 	local_irq_save(flags);
 	nc = this_cpu_ptr(&netdev_alloc_cache);
-	data = __alloc_page_frag(nc, fragsz, gfp_mask);
+	data = page_frag_alloc(nc, fragsz, gfp_mask);
 	local_irq_restore(flags);
 	return data;
 }
@@ -391,7 +391,7 @@ static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
-	return __alloc_page_frag(&nc->page, fragsz, gfp_mask);
+	return page_frag_alloc(&nc->page, fragsz, gfp_mask);
 }
 
 void *napi_alloc_frag(unsigned int fragsz)
@@ -441,7 +441,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	local_irq_save(flags);
 
 	nc = this_cpu_ptr(&netdev_alloc_cache);
-	data = __alloc_page_frag(nc, len, gfp_mask);
+	data = page_frag_alloc(nc, len, gfp_mask);
 	pfmemalloc = nc->pfmemalloc;
 
 	local_irq_restore(flags);
@@ -505,7 +505,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = __alloc_page_frag(&nc->page, len, gfp_mask);
+	data = page_frag_alloc(&nc->page, len, gfp_mask);
 	if (unlikely(!data))
 		return NULL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [net/mm PATCH v2 2/3] mm: Rename __page_frag functions to __page_frag_cache, drop order from drain
From: Alexander Duyck @ 2016-12-23 17:17 UTC (permalink / raw)
  To: linux-mm, akpm, davem, netdev; +Cc: linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20161223170756.14573.74139.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This patch does two things.

First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or refilling
the cache, not the frags themselves.

Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

v2: No change

 drivers/net/ethernet/intel/igb/igb_main.c |    6 +++---
 include/linux/gfp.h                       |    3 +--
 mm/page_alloc.c                           |   13 +++++++------
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a761001308dc..1515abaa5ac9 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3962,8 +3962,8 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
 				     PAGE_SIZE,
 				     DMA_FROM_DEVICE,
 				     DMA_ATTR_SKIP_CPU_SYNC);
-		__page_frag_drain(buffer_info->page, 0,
-				  buffer_info->pagecnt_bias);
+		__page_frag_cache_drain(buffer_info->page,
+					buffer_info->pagecnt_bias);
 
 		buffer_info->page = NULL;
 	}
@@ -6991,7 +6991,7 @@ static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring,
 		dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
 				     PAGE_SIZE, DMA_FROM_DEVICE,
 				     DMA_ATTR_SKIP_CPU_SYNC);
-		__page_frag_drain(page, 0, rx_buffer->pagecnt_bias);
+		__page_frag_cache_drain(page, rx_buffer->pagecnt_bias);
 	}
 
 	/* clear contents of rx_buffer */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6238c74e0a01..884080404e24 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -506,8 +506,7 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 struct page_frag_cache;
-extern void __page_frag_drain(struct page *page, unsigned int order,
-			      unsigned int count);
+extern void __page_frag_cache_drain(struct page *page, unsigned int count);
 extern void *page_frag_alloc(struct page_frag_cache *nc,
 			     unsigned int fragsz, gfp_t gfp_mask);
 extern void page_frag_free(void *addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 842df9d60ebc..c11c359c46b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3904,8 +3904,8 @@ void free_pages(unsigned long addr, unsigned int order)
  * drivers to provide a backing region of memory for use as either an
  * sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
  */
-static struct page *__page_frag_refill(struct page_frag_cache *nc,
-				       gfp_t gfp_mask)
+static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
+					     gfp_t gfp_mask)
 {
 	struct page *page = NULL;
 	gfp_t gfp = gfp_mask;
@@ -3925,19 +3925,20 @@ static struct page *__page_frag_refill(struct page_frag_cache *nc,
 	return page;
 }
 
-void __page_frag_drain(struct page *page, unsigned int order,
-		       unsigned int count)
+void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
 
 	if (page_ref_sub_and_test(page, count)) {
+		unsigned int order = compound_order(page);
+
 		if (order == 0)
 			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
 	}
 }
-EXPORT_SYMBOL(__page_frag_drain);
+EXPORT_SYMBOL(__page_frag_cache_drain);
 
 void *page_frag_alloc(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask)
@@ -3948,7 +3949,7 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 
 	if (unlikely(!nc->va)) {
 refill:
-		page = __page_frag_refill(nc, gfp_mask);
+		page = __page_frag_cache_refill(nc, gfp_mask);
 		if (!page)
 			return NULL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [net/mm PATCH v2 3/3] mm: Add documentation for page fragment APIs
From: Alexander Duyck @ 2016-12-23 17:17 UTC (permalink / raw)
  To: linux-mm, akpm, davem, netdev; +Cc: linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20161223170756.14573.74139.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.h.duyck@intel.com>

This is a first pass at trying to add documentation for the page_frag APIs.
They may still change over time but for now I thought I would try to get
these documented so that as more network drivers and stack calls make use
of them we have one central spot to document how they are meant to be used.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

v2: No change

 Documentation/vm/page_frags |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)
 create mode 100644 Documentation/vm/page_frags

diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags
new file mode 100644
index 000000000000..a6714565dbf9
--- /dev/null
+++ b/Documentation/vm/page_frags
@@ -0,0 +1,42 @@
+Page fragments
+--------------
+
+A page fragment is an arbitrary-length arbitrary-offset area of memory
+which resides within a 0 or higher order compound page.  Multiple
+fragments within that page are individually refcounted, in the page's
+reference counter.
+
+The page_frag functions, page_frag_alloc and page_frag_free, provide a
+simple allocation framework for page fragments.  This is used by the
+network stack and network device drivers to provide a backing region of
+memory for use as either an sk_buff->head, or to be used in the "frags"
+portion of skb_shared_info.
+
+In order to make use of the page fragment APIs a backing page fragment
+cache is needed.  This provides a central point for the fragment allocation
+and tracks allows multiple calls to make use of a cached page.  The
+advantage to doing this is that multiple calls to get_page can be avoided
+which can be expensive at allocation time.  However due to the nature of
+this caching it is required that any calls to the cache be protected by
+either a per-cpu limitation, or a per-cpu limitation and forcing interrupts
+to be disabled when executing the fragment allocation.
+
+The network stack uses two separate caches per CPU to handle fragment
+allocation.  The netdev_alloc_cache is used by callers making use of the
+__netdev_alloc_frag and __netdev_alloc_skb calls.  The napi_alloc_cache is
+used by callers of the __napi_alloc_frag and __napi_alloc_skb calls.  The
+main difference between these two calls is the context in which they may be
+called.  The "netdev" prefixed functions are usable in any context as these
+functions will disable interrupts, while the "napi" prefixed functions are
+only usable within the softirq context.
+
+Many network device drivers use a similar methodology for allocating page
+fragments, but the page fragments are cached at the ring or descriptor
+level.  In order to enable these cases it is necessary to provide a generic
+way of tearing down a page cache.  For this reason __page_frag_cache_drain
+was implemented.  It allows for freeing multiple references from a single
+page via a single call.  The advantage to doing this is that it allows for
+cleaning up the multiple references that were added to a page in order to
+avoid calling get_page per allocation.
+
+Alexander Duyck, Nov 29, 2016.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH net] inet: fix IP(V6)_RECVORIGDSTADDR for udp sockets
From: David Miller @ 2016-12-23 17:19 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, njagabar, samanthakumar, willemb
In-Reply-To: <1482448756-83129-1-git-send-email-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Thu, 22 Dec 2016 18:19:16 -0500

> From: Willem de Bruijn <willemb@google.com>
> 
> Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
> the packet. For sockets that have transport headers pulled, transport
> offset can be negative. Use signed comparison to avoid overflow.
> 
> Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
> Reported-by: Nisar Jagabar <njagabar@cloudmark.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: ipv6: handle -EFAULT from skb_copy_bits
From: David Miller @ 2016-12-23 17:21 UTC (permalink / raw)
  To: davej; +Cc: netdev, hannes
In-Reply-To: <20161222161622.o6gtidv24znbfs7n@codemonkey.org.uk>

From: Dave Jones <davej@codemonkey.org.uk>
Date: Thu, 22 Dec 2016 11:16:22 -0500

> By setting certain socket options on ipv6 raw sockets, we can confuse the
> length calculation in rawv6_push_pending_frames triggering a BUG_ON.
 ...
> Signed-off-by: Dave Jones <davej@codemonkey.org.uk>

Applied and queued up for -stable, thanks Dave.

^ permalink raw reply

* Re: [patch net 0/3] mlxsw: Router fixes
From: David Miller @ 2016-12-23 17:31 UTC (permalink / raw)
  To: jiri; +Cc: netdev, idosch, eladr, yotamg, nogahf, arkadis
In-Reply-To: <1482481970-2327-1-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Fri, 23 Dec 2016 09:32:47 +0100

> From: Jiri Pirko <jiri@mellanox.com>
> 
> Ido says:
> 
> First two patches ensure we remove from the device's table neighbours
> that are considered to be dead by the neighbour core.
> 
> The last patch removes nexthop groups from the device when they are no
> longer valid.

Series applied, thank you.

^ permalink raw reply

* Re: [net/mm PATCH v2 0/3] Page fragment updates
From: David Miller @ 2016-12-23 17:50 UTC (permalink / raw)
  To: alexander.duyck; +Cc: linux-mm, akpm, netdev, linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20161223170756.14573.74139.stgit@localhost.localdomain>

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Fri, 23 Dec 2016 09:16:39 -0800

> I tried to get in touch with Andrew about this fix but I haven't heard any
> reply to the email I sent out on Tuesday.  The last comment I had from
> Andrew against v1 was "Looks good to me.  I have it all queued for post-4.9
> processing.", but I haven't received any notice they were applied.

Andrew, please follow up with Alex.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox