* Re: [PATCH 4/6 net-next] inet: don't check for bind conflicts twice when searching for a port
From: David Miller @ 2016-12-23 18:58 UTC (permalink / raw)
To: jbacik; +Cc: hannes, kraigatgoog, eric.dumazet, tom, netdev, kernel-team
In-Reply-To: <1482441998-28359-5-git-send-email-jbacik@fb.com>
From: Josef Bacik <jbacik@fb.com>
Date: Thu, 22 Dec 2016 16:26:36 -0500
> This is just wasted time, we've already found a tb that doesn't have a bind
> conflict, and we don't drop the head lock so scanning again isn't going to give
> us a different answer. Instead move the tb->reuse setting logic outside of the
> found_tb path and put it in the success: path. Then make it so that we don't
> goto again if we find a bind conflict in the found_tb path as we won't reach
> this anymore when we are scanning for an ephemeral port.
>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
...
> @@ -220,8 +219,10 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
> spin_lock_bh(&head->lock);
> inet_bind_bucket_for_each(tb, &head->chain)
> if (net_eq(ib_net(tb), net) && tb->port == port) {
> + if (hlist_empty(&tb->owners))
> + goto success;
I am not sure that this condition can ever trigger.
The only time we will see an alive 'tb' with an empty owners list
is when we allocate one in this function. And when that allocation
succeeds, we go straight through the rest of this function without
ever jumping out for errors or anything like that.
So we should never drop the allocated fresh 'tb', leave it with
an empty owners list, and branch back up to this part of the
function.
^ permalink raw reply
* Re: [PATCH net 0/9] several fixups for virtio-net XDP
From: David Miller @ 2016-12-23 18:49 UTC (permalink / raw)
To: jasowang; +Cc: john.r.fastabend, netdev, virtualization, linux-kernel, mst
In-Reply-To: <1482503852-12438-1-git-send-email-jasowang@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Fri, 23 Dec 2016 22:37:23 +0800
> Merry Xmas and a Happy New year to all:
>
> This series tries to fixes several issues for virtio-net XDP which
> could be categorized into several parts:
>
> - fix several issues during XDP linearizing
> - allow csumed packet to work for XDP_PASS
> - make EWMA rxbuf size estimation works for XDP
> - forbid XDP when GUEST_UFO is support
> - remove big packet XDP support
> - add XDP support or small buffer
>
> Please see individual patches for details.
Series applied, thanks Jason.
^ permalink raw reply
* Re: [PATCH net] virtio_net: reject XDP programs using header adjustment
From: John Fastabend @ 2016-12-23 18:44 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev, kafai, Daniel Borkmann, alexei.starovoitov, mst
In-Reply-To: <CAJpBn1wKkJ83GLznzQ+DCs7JZb-FkQ-zYu3d6i4p1RmfKy0mig@mail.gmail.com>
On 16-12-20 04:30 AM, Jakub Kicinski wrote:
> On Mon, Dec 19, 2016 at 5:50 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 16-12-19 07:05 AM, Jakub Kicinski wrote:
>>> commit 17bedab27231 ("bpf: xdp: Allow head adjustment in XDP prog")
>>> added a new XDP helper to prepend and remove data from a frame.
>>> Make virtio_net reject programs making use of this helper until
>>> proper support is added.
>>>
>>> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>> ---
>>> drivers/net/virtio_net.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 08327e005ccc..db761f37783e 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1677,6 +1677,11 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>>> u16 xdp_qp = 0, curr_qp;
>>> int i, err;
>>>
>>> + if (prog && prog->xdp_adjust_head) {
>>> + netdev_warn(dev, "Does not support bpf_xdp_adjust_head()\n");
>>> + return -EOPNOTSUPP;
>>> + }
>>> +
>>> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>>> virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) {
>>> netdev_warn(dev, "can't set XDP while host is implementing LRO, disable LRO first\n");
>>>
>>
>> Acked-by: John Fastabend <john.r.fastabend@intel.com>
>>
>> Thanks patch looks good. Alternatively we could push a "bug fix" to
>> support the adjust header feature depending on how DaveM and MST feel
>> about that. I don't have a strong opinion but I have the patch on my
>> queue it just needs some more testing.
>
> Cool! I thought to ask you what your plans are but then this patch is so
> trivial I decided to just post it :) I'm perfectly happy with dropping it
> for now and reposting after ~rc5 if needed.
>
Hi Jakub,
I just posted a RFC take a look if you get a chance. Notice though Jason
fixed up my linearize path a bunch so it will need to be pushed on top
of that series.
Thanks,
John
^ permalink raw reply
* [RFC PATCH] virtio_net: XDP support for adjust_head
From: John Fastabend @ 2016-12-23 18:43 UTC (permalink / raw)
To: jasowang, jakub.kicinski, mst
Cc: john.r.fastabend, netdev, alexei.starovoitov, daniel,
virtualization
Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.
In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.
TBD merge with Jason Wang's fixes, do a bit more testing, note
big_packet support is removed by Jason as well.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
drivers/net/virtio_net.c | 53 +++++++++++++++++++++++++++++++---------------
1 file changed, 36 insertions(+), 17 deletions(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 08327e0..1a93158 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -159,6 +159,9 @@ struct virtnet_info {
/* Ethtool settings */
u8 duplex;
u32 speed;
+
+ /* Headroom allocated in RX Queue */
+ unsigned int headroom;
};
struct padded_vnet_hdr {
@@ -392,6 +395,7 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
else
hdr_padded_len = sizeof(struct padded_vnet_hdr);
+ xdp.data_hard_start = buf - vi->headroom;
xdp.data = buf + hdr_padded_len;
xdp.data_end = xdp.data + (len - vi->hdr_len);
@@ -430,9 +434,12 @@ static struct sk_buff *receive_big(struct net_device *dev,
void *buf,
unsigned int len)
{
- struct bpf_prog *xdp_prog;
struct page *page = buf;
+ struct bpf_prog *xdp_prog;
struct sk_buff *skb;
+ int offset;
+
+ offset = vi->headroom;
rcu_read_lock();
xdp_prog = rcu_dereference(rq->xdp_prog);
@@ -442,7 +449,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
goto err_xdp;
- act = do_xdp_prog(vi, rq, xdp_prog, page, 0, len);
+ act = do_xdp_prog(vi, rq, xdp_prog, page, offset, len);
switch (act) {
case XDP_PASS:
break;
@@ -456,7 +463,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
}
rcu_read_unlock();
- skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
+ skb = page_to_skb(vi, rq, page, offset, len, PAGE_SIZE);
if (unlikely(!skb))
goto err;
@@ -797,10 +804,10 @@ static int add_recvbuf_big(struct virtnet_info *vi, struct receive_queue *rq,
/* rq->sg[0], rq->sg[1] share the same page */
/* a separated rq->sg[0] for header - required in case !any_header_sg */
- sg_set_buf(&rq->sg[0], p, vi->hdr_len);
+ sg_set_buf(&rq->sg[0], p, vi->hdr_len + vi->headroom);
/* rq->sg[1] for data packet, from offset */
- offset = sizeof(struct padded_vnet_hdr);
+ offset = vi->headroom + sizeof(struct padded_vnet_hdr);
sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
/* chain first in list head */
@@ -823,24 +830,27 @@ static unsigned int get_mergeable_buf_len(struct ewma_pkt_len *avg_pkt_len)
return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
}
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static int add_recvbuf_mergeable(struct virtnet_info *vi,
+ struct receive_queue *rq, gfp_t gfp)
{
struct page_frag *alloc_frag = &rq->alloc_frag;
+ unsigned int headroom = vi->headroom;
char *buf;
unsigned long ctx;
int err;
unsigned int len, hole;
len = get_mergeable_buf_len(&rq->mrg_avg_pkt_len);
- if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
+ if (unlikely(!skb_page_frag_refill(len + headroom, alloc_frag, gfp)))
return -ENOMEM;
buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+ buf += headroom; /* advance address leaving hole at front of pkt */
ctx = mergeable_buf_to_ctx(buf, len);
get_page(alloc_frag->page);
- alloc_frag->offset += len;
+ alloc_frag->offset += len + headroom;
hole = alloc_frag->size - alloc_frag->offset;
- if (hole < len) {
+ if (hole < len + headroom) {
/* To avoid internal fragmentation, if there is very likely not
* enough space for another buffer, add the remaining space to
* the current buffer. This extra space is not included in
@@ -874,7 +884,7 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
gfp |= __GFP_COLD;
do {
if (vi->mergeable_rx_bufs)
- err = add_recvbuf_mergeable(rq, gfp);
+ err = add_recvbuf_mergeable(vi, rq, gfp);
else if (vi->big_packets)
err = add_recvbuf_big(vi, rq, gfp);
else
@@ -1669,12 +1679,15 @@ static void virtnet_init_settings(struct net_device *dev)
.set_settings = virtnet_set_settings,
};
+#define VIRTIO_XDP_HEADROOM 256
+
static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
{
unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
struct virtnet_info *vi = netdev_priv(dev);
struct bpf_prog *old_prog;
u16 xdp_qp = 0, curr_qp;
+ unsigned int old_hr;
int i, err;
if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -1704,19 +1717,25 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
return -ENOMEM;
}
+ old_hr = vi->headroom;
+ if (prog) {
+ prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
+ if (IS_ERR(prog))
+ return PTR_ERR(prog);
+ vi->headroom = VIRTIO_XDP_HEADROOM;
+ } else {
+ vi->headroom = 0;
+ }
+
err = virtnet_set_queues(vi, curr_qp + xdp_qp);
if (err) {
dev_warn(&dev->dev, "XDP Device queue allocation failure.\n");
+ vi->headroom = old_hr;
+ if (prog)
+ bpf_prog_sub(prog, vi->max_queue_pairs - 1);
return err;
}
- if (prog) {
- prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
- if (IS_ERR(prog)) {
- virtnet_set_queues(vi, curr_qp);
- return PTR_ERR(prog);
- }
- }
vi->xdp_queue_pairs = xdp_qp;
netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
^ permalink raw reply related
* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
From: George Spelvin @ 2016-12-23 18:26 UTC (permalink / raw)
To: hannes, linux
Cc: ak, davem, David.Laight, ebiggers3, eric.dumazet, Jason,
kernel-hardening, linux-crypto, linux-kernel, luto, netdev, tom,
tytso, vegard.nossum
In-Reply-To: <1482494724.3353.7.camel@stressinduktion.org>
(Cc: list trimmed slightly as the topic is wandering a bit.)
Hannes Frederic Sowa wrote:
> On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote:
>> Adding might_lock() annotations will improve coverage a lot.
>
> Might be hard to find the correct lock we take later down the code
> path, but if that is possible, certainly.
The point of might_lock() is that you don't have to. You find the
worst case (most global) lock that the code *might* take if all the
buffer-empty conditions are true, and tell lockdep "assume this lock is
taken all the time".
>> Hannes Frederic Sowa wrote:
>>> Yes, that does look nice indeed. Accounting for bits instead of bytes
>>> shouldn't be a huge problem either. Maybe it gets a bit more verbose in
>>> case you can't satisfy a request with one batched entropy block and have
>>> to consume randomness from two.
For example, here's a simple bit-buffer implementation that wraps around
a get_random_long. The bitbuffer is of the form "00001xxxx", where the
x bits are valid, and the position of the msbit indicates how many bits
are valid.
extern unsigned long get_random_long();
static unsigned long bitbuffer = 1; /* Holds 0..BITS_PER_LONG-1 bits */
unsigned long get_random_bits(unsigned char bits)
{
/* We handle bits == BITS_PER_LONG,and not bits == 0 */
unsigned long mask = -1ul >> (BITS_PER_LONG - bits);
unsigned long val;
if (bitbuffer > mask) {
/* Request can be satisfied out of the bit buffer */
val = bitbuffer;
bitbuffer >>= bits;
} else {
/*
* Not enough bits, but enough room in bitbuffer for the
* leftovers. avail < bits, so avail + 64 <= bits + 63.
*/
val = get_random_long();
bitbuffer = bitbuffer << (BITS_PER_LONG - bits)
| val >> 1 >> (bits-1);
}
return val & mask;
}
> When we hit the chacha20 without doing a reseed we only mutate the
> state of chacha, but being an invertible function in its own, a
> proposal would be to mix parts of the chacha20 output back into the
> state, which, as a result, would cause slowdown because we couldn't
> propagate the complete output of the cipher back to the caller (looking
> at the function _extract_crng).
Basically, yes. Half of the output goes to rekeying itself.
But, I just realized I've been overlooking something glaringly obvious...
there's no reason you can't compute multple blocks in advance.
The standard assumption in antibacktracking is that you'll *notice* the
state capture and stop trusting the random numbers afterward; you just
want the output *before* to be secure. In other words, cops busting
down the door can't find the session key used in the message you just sent.
So you can compute and store random numbers ahead of need.
This can amortize the antibacktracking as much as you'd like.
For example, suppose we gave each CPU a small pool to minimize locking.
When one runs out and calls the global refill, it could refill *all*
of the CPU pools. (You don't even need locking; there may be a race to
determine *which* random numbers the reader sees, but they're equally
good either way.)
> Or are you referring that the anti-backtrack protection should happen
> in every call from get_random_int?
If you ask for anti-backtracking without qualification, that's the
goal, since you don't know how long will elapse until the next call.
It's like fsync(). There are lots of more limited forms of "keep my
data safe in case of a crash", but the most basic one is "if we lost
power the very instant the call returned, the data would be safe."
^ permalink raw reply
* Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)
From: Hannes Frederic Sowa @ 2016-12-23 18:19 UTC (permalink / raw)
To: Andy Lutomirski, Daniel Borkmann
Cc: Alexei Starovoitov, Jason A. Donenfeld,
kernel-hardening@lists.openwall.com, Theodore Ts'o, Netdev,
LKML, Linux Crypto Mailing List, David Laight, Eric Dumazet,
Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <CALCETrXHoPBV1UpLwtLjjmqJXVpgEUTCnwePDLx4g6a1SFMZaw@mail.gmail.com>
On 23.12.2016 17:42, Andy Lutomirski wrote:
> On Fri, Dec 23, 2016 at 8:23 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Fri, Dec 23, 2016 at 3:59 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> On 12/23/2016 11:59 AM, Hannes Frederic Sowa wrote:
>>>>
>>>> On Fri, 2016-12-23 at 11:04 +0100, Daniel Borkmann wrote:
>>>>>
>>>>> On 12/22/2016 05:59 PM, Hannes Frederic Sowa wrote:
>>>>>>
>>>>>> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>>>
>>> [...]
>>>
>>>>>> The hashing is not a proper sha1 neither, unfortunately. I think that
>>>>>> is why it will have a custom implementation in iproute2?
>>>>>
>>>>>
>>>>> Still trying to catch up on this admittedly bit confusing thread. I
>>>>> did run automated tests over couple of days comparing the data I got
>>>>> from fdinfo with the one from af_alg and found no mismatch on the test
>>>>> cases varying from min to max possible program sizes. In the process
>>>>> of testing, as you might have seen on netdev, I found couple of other
>>>>> bugs in bpf code along the way and fixed them up as well. So my question,
>>>>> do you or Andy or anyone participating in claiming this have any
>>>>> concrete data or test cases that suggests something different? If yes,
>>>>> I'm very curious to hear about it and willing fix it up, of course.
>>>>> When I'm back from pto I'll prep and cook up my test suite to be
>>>>> included into the selftests/bpf/, should have done this initially,
>>>>> sorry about that. I'll also post something to expose the alg, that
>>>>> sounds fine to me.
>>>>
>>>>
>>>> Looking into your code closer, I noticed that you indeed seem to do the
>>>> finalization of sha-1 by hand by aligning and padding the buffer
>>>> accordingly and also patching in the necessary payload length.
>>>>
>>>> Apologies for my side for claiming that this is not correct sha1
>>>> output, I was only looking at sha_transform and its implementation and
>>>> couldn't see the padding and finalization round with embedding the data
>>>> length in there and hadn't thought of it being done manually.
>>>>
>>>> Anyway, is it difficult to get the sha finalization into some common
>>>> code library? It is not very bpf specific and crypto code reviewers
>>>> won't find it there at all.
>>>
>>>
>>> Yes, sure, I'll rework it that way (early next year when I'm back if
>>> that's fine with you).
>>
>> Can we make it SHA-256 before 4.10 comes out, though? This really
>> looks like it will be used in situations where collisions matter and
>> it will be exposed to malicious programs, and SHA-1 should not be used
>> for new designs for this purpose because it simply isn't long enough.
>>
>> Also, a SHA-1 digest isn't a pile of u32s, so u32 digest[...] is very
>> misleading. That should be u8 or, at the very least, __be32.
>>
>> I realize that there isn't a sha-256 implementation in lib, but would
>> it really be so bad to make the bpf digest only work (for now) when
>> crypto is enabled? I would *love* to see the crypto core learn how to
>> export simple primitives for direct use without needing the whole
>> crypto core, and this doesn't seem particularly hard to do, but I
>> don't think that's 4.10 material.
>
> I'm going to try to send out RFC patches for all of this today or
> tomorrow. It doesn't look bad at all.
Factoring out sha3 to lib/ and use it as standalone and in crypto api
doesn't seem hard, yep. I also proposed this to Daniel offlist.
Bye,
Hannes
^ permalink raw reply
* Re: [net/mm PATCH v2 0/3] Page fragment updates
From: David Miller @ 2016-12-23 17:50 UTC (permalink / raw)
To: alexander.duyck; +Cc: linux-mm, akpm, netdev, linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20161223170756.14573.74139.stgit@localhost.localdomain>
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Fri, 23 Dec 2016 09:16:39 -0800
> I tried to get in touch with Andrew about this fix but I haven't heard any
> reply to the email I sent out on Tuesday. The last comment I had from
> Andrew against v1 was "Looks good to me. I have it all queued for post-4.9
> processing.", but I haven't received any notice they were applied.
Andrew, please follow up with Alex.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [patch net 0/3] mlxsw: Router fixes
From: David Miller @ 2016-12-23 17:31 UTC (permalink / raw)
To: jiri; +Cc: netdev, idosch, eladr, yotamg, nogahf, arkadis
In-Reply-To: <1482481970-2327-1-git-send-email-jiri@resnulli.us>
From: Jiri Pirko <jiri@resnulli.us>
Date: Fri, 23 Dec 2016 09:32:47 +0100
> From: Jiri Pirko <jiri@mellanox.com>
>
> Ido says:
>
> First two patches ensure we remove from the device's table neighbours
> that are considered to be dead by the neighbour core.
>
> The last patch removes nexthop groups from the device when they are no
> longer valid.
Series applied, thank you.
^ permalink raw reply
* Re: ipv6: handle -EFAULT from skb_copy_bits
From: David Miller @ 2016-12-23 17:21 UTC (permalink / raw)
To: davej; +Cc: netdev, hannes
In-Reply-To: <20161222161622.o6gtidv24znbfs7n@codemonkey.org.uk>
From: Dave Jones <davej@codemonkey.org.uk>
Date: Thu, 22 Dec 2016 11:16:22 -0500
> By setting certain socket options on ipv6 raw sockets, we can confuse the
> length calculation in rawv6_push_pending_frames triggering a BUG_ON.
...
> Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Applied and queued up for -stable, thanks Dave.
^ permalink raw reply
* Re: [PATCH net] inet: fix IP(V6)_RECVORIGDSTADDR for udp sockets
From: David Miller @ 2016-12-23 17:19 UTC (permalink / raw)
To: willemdebruijn.kernel; +Cc: netdev, njagabar, samanthakumar, willemb
In-Reply-To: <1482448756-83129-1-git-send-email-willemdebruijn.kernel@gmail.com>
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Thu, 22 Dec 2016 18:19:16 -0500
> From: Willem de Bruijn <willemb@google.com>
>
> Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
> the packet. For sockets that have transport headers pulled, transport
> offset can be negative. Use signed comparison to avoid overflow.
>
> Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
> Reported-by: Nisar Jagabar <njagabar@cloudmark.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
Applied and queued up for -stable.
^ permalink raw reply
* [net/mm PATCH v2 3/3] mm: Add documentation for page fragment APIs
From: Alexander Duyck @ 2016-12-23 17:17 UTC (permalink / raw)
To: linux-mm, akpm, davem, netdev; +Cc: linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20161223170756.14573.74139.stgit@localhost.localdomain>
From: Alexander Duyck <alexander.h.duyck@intel.com>
This is a first pass at trying to add documentation for the page_frag APIs.
They may still change over time but for now I thought I would try to get
these documented so that as more network drivers and stack calls make use
of them we have one central spot to document how they are meant to be used.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
v2: No change
Documentation/vm/page_frags | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
create mode 100644 Documentation/vm/page_frags
diff --git a/Documentation/vm/page_frags b/Documentation/vm/page_frags
new file mode 100644
index 000000000000..a6714565dbf9
--- /dev/null
+++ b/Documentation/vm/page_frags
@@ -0,0 +1,42 @@
+Page fragments
+--------------
+
+A page fragment is an arbitrary-length arbitrary-offset area of memory
+which resides within a 0 or higher order compound page. Multiple
+fragments within that page are individually refcounted, in the page's
+reference counter.
+
+The page_frag functions, page_frag_alloc and page_frag_free, provide a
+simple allocation framework for page fragments. This is used by the
+network stack and network device drivers to provide a backing region of
+memory for use as either an sk_buff->head, or to be used in the "frags"
+portion of skb_shared_info.
+
+In order to make use of the page fragment APIs a backing page fragment
+cache is needed. This provides a central point for the fragment allocation
+and tracks allows multiple calls to make use of a cached page. The
+advantage to doing this is that multiple calls to get_page can be avoided
+which can be expensive at allocation time. However due to the nature of
+this caching it is required that any calls to the cache be protected by
+either a per-cpu limitation, or a per-cpu limitation and forcing interrupts
+to be disabled when executing the fragment allocation.
+
+The network stack uses two separate caches per CPU to handle fragment
+allocation. The netdev_alloc_cache is used by callers making use of the
+__netdev_alloc_frag and __netdev_alloc_skb calls. The napi_alloc_cache is
+used by callers of the __napi_alloc_frag and __napi_alloc_skb calls. The
+main difference between these two calls is the context in which they may be
+called. The "netdev" prefixed functions are usable in any context as these
+functions will disable interrupts, while the "napi" prefixed functions are
+only usable within the softirq context.
+
+Many network device drivers use a similar methodology for allocating page
+fragments, but the page fragments are cached at the ring or descriptor
+level. In order to enable these cases it is necessary to provide a generic
+way of tearing down a page cache. For this reason __page_frag_cache_drain
+was implemented. It allows for freeing multiple references from a single
+page via a single call. The advantage to doing this is that it allows for
+cleaning up the multiple references that were added to a page in order to
+avoid calling get_page per allocation.
+
+Alexander Duyck, Nov 29, 2016.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [net/mm PATCH v2 2/3] mm: Rename __page_frag functions to __page_frag_cache, drop order from drain
From: Alexander Duyck @ 2016-12-23 17:17 UTC (permalink / raw)
To: linux-mm, akpm, davem, netdev; +Cc: linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20161223170756.14573.74139.stgit@localhost.localdomain>
From: Alexander Duyck <alexander.h.duyck@intel.com>
This patch does two things.
First it goes through and renames the __page_frag prefixed functions to
__page_frag_cache so that we can be clear that we are draining or refilling
the cache, not the frags themselves.
Second we drop the order parameter from __page_frag_cache_drain since we
don't actually need to pass it since all fragments are either order 0 or
must be a compound page.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
v2: No change
drivers/net/ethernet/intel/igb/igb_main.c | 6 +++---
include/linux/gfp.h | 3 +--
mm/page_alloc.c | 13 +++++++------
3 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a761001308dc..1515abaa5ac9 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3962,8 +3962,8 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
PAGE_SIZE,
DMA_FROM_DEVICE,
DMA_ATTR_SKIP_CPU_SYNC);
- __page_frag_drain(buffer_info->page, 0,
- buffer_info->pagecnt_bias);
+ __page_frag_cache_drain(buffer_info->page,
+ buffer_info->pagecnt_bias);
buffer_info->page = NULL;
}
@@ -6991,7 +6991,7 @@ static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring,
dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
PAGE_SIZE, DMA_FROM_DEVICE,
DMA_ATTR_SKIP_CPU_SYNC);
- __page_frag_drain(page, 0, rx_buffer->pagecnt_bias);
+ __page_frag_cache_drain(page, rx_buffer->pagecnt_bias);
}
/* clear contents of rx_buffer */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6238c74e0a01..884080404e24 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -506,8 +506,7 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
extern void free_hot_cold_page_list(struct list_head *list, bool cold);
struct page_frag_cache;
-extern void __page_frag_drain(struct page *page, unsigned int order,
- unsigned int count);
+extern void __page_frag_cache_drain(struct page *page, unsigned int count);
extern void *page_frag_alloc(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask);
extern void page_frag_free(void *addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 842df9d60ebc..c11c359c46b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3904,8 +3904,8 @@ void free_pages(unsigned long addr, unsigned int order)
* drivers to provide a backing region of memory for use as either an
* sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
*/
-static struct page *__page_frag_refill(struct page_frag_cache *nc,
- gfp_t gfp_mask)
+static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
+ gfp_t gfp_mask)
{
struct page *page = NULL;
gfp_t gfp = gfp_mask;
@@ -3925,19 +3925,20 @@ static struct page *__page_frag_refill(struct page_frag_cache *nc,
return page;
}
-void __page_frag_drain(struct page *page, unsigned int order,
- unsigned int count)
+void __page_frag_cache_drain(struct page *page, unsigned int count)
{
VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
if (page_ref_sub_and_test(page, count)) {
+ unsigned int order = compound_order(page);
+
if (order == 0)
free_hot_cold_page(page, false);
else
__free_pages_ok(page, order);
}
}
-EXPORT_SYMBOL(__page_frag_drain);
+EXPORT_SYMBOL(__page_frag_cache_drain);
void *page_frag_alloc(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask)
@@ -3948,7 +3949,7 @@ void *page_frag_alloc(struct page_frag_cache *nc,
if (unlikely(!nc->va)) {
refill:
- page = __page_frag_refill(nc, gfp_mask);
+ page = __page_frag_cache_refill(nc, gfp_mask);
if (!page)
return NULL;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [net/mm PATCH v2 1/3] mm: Rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
From: Alexander Duyck @ 2016-12-23 17:17 UTC (permalink / raw)
To: linux-mm, akpm, davem, netdev; +Cc: linux-kernel, jeffrey.t.kirsher
In-Reply-To: <20161223170756.14573.74139.stgit@localhost.localdomain>
From: Alexander Duyck <alexander.h.duyck@intel.com>
This patch renames the page frag functions to be more consistent with other
APIs. Specifically we place the name page_frag first in the name and then
have either an alloc or free call name that we append as the suffix. This
makes it a bit clearer in terms of naming.
In addition we drop the leading double underscores since we are technically
no longer a backing interface and instead the front end that is called from
the networking APIs.
The last bit I changed is I rebased page_frag_free to actually function
very similar to the function free_pages, the only real difference now is
the fact that we have to get the page order by calling compound page
instead of having it passed as a part of the function call.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
v2: Fixed a comparison between a void* and 0 due to copy/paste from free_pages
include/linux/gfp.h | 6 +++---
include/linux/skbuff.h | 2 +-
mm/page_alloc.c | 20 ++++++++++++--------
net/core/skbuff.c | 8 ++++----
4 files changed, 20 insertions(+), 16 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..6238c74e0a01 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -508,9 +508,9 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct page_frag_cache;
extern void __page_frag_drain(struct page *page, unsigned int order,
unsigned int count);
-extern void *__alloc_page_frag(struct page_frag_cache *nc,
- unsigned int fragsz, gfp_t gfp_mask);
-extern void __free_page_frag(void *addr);
+extern void *page_frag_alloc(struct page_frag_cache *nc,
+ unsigned int fragsz, gfp_t gfp_mask);
+extern void page_frag_free(void *addr);
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr), 0)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ac7fa34db8a7..e0b7f492b628 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2480,7 +2480,7 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
static inline void skb_free_frag(void *addr)
{
- __free_page_frag(addr);
+ page_frag_free(addr);
}
void *napi_alloc_frag(unsigned int fragsz);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c6d5f64feca..842df9d60ebc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3939,8 +3939,8 @@ void __page_frag_drain(struct page *page, unsigned int order,
}
EXPORT_SYMBOL(__page_frag_drain);
-void *__alloc_page_frag(struct page_frag_cache *nc,
- unsigned int fragsz, gfp_t gfp_mask)
+void *page_frag_alloc(struct page_frag_cache *nc,
+ unsigned int fragsz, gfp_t gfp_mask)
{
unsigned int size = PAGE_SIZE;
struct page *page;
@@ -3991,19 +3991,23 @@ void *__alloc_page_frag(struct page_frag_cache *nc,
return nc->va + offset;
}
-EXPORT_SYMBOL(__alloc_page_frag);
+EXPORT_SYMBOL(page_frag_alloc);
/*
* Frees a page fragment allocated out of either a compound or order 0 page.
*/
-void __free_page_frag(void *addr)
+void page_frag_free(void *addr)
{
- struct page *page = virt_to_head_page(addr);
+ struct page *page;
+
+ if (addr != NULL) {
+ VM_BUG_ON(!virt_addr_valid(addr));
+ page = virt_to_head_page(addr);
- if (unlikely(put_page_testzero(page)))
- __free_pages_ok(page, compound_order(page));
+ __free_pages(page, compound_order(page));
+ }
}
-EXPORT_SYMBOL(__free_page_frag);
+EXPORT_SYMBOL(page_frag_free);
static void *make_alloc_exact(unsigned long addr, unsigned int order,
size_t size)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 65a74e13c45b..df4598ad6473 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -369,7 +369,7 @@ static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
local_irq_save(flags);
nc = this_cpu_ptr(&netdev_alloc_cache);
- data = __alloc_page_frag(nc, fragsz, gfp_mask);
+ data = page_frag_alloc(nc, fragsz, gfp_mask);
local_irq_restore(flags);
return data;
}
@@ -391,7 +391,7 @@ static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
{
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
- return __alloc_page_frag(&nc->page, fragsz, gfp_mask);
+ return page_frag_alloc(&nc->page, fragsz, gfp_mask);
}
void *napi_alloc_frag(unsigned int fragsz)
@@ -441,7 +441,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
local_irq_save(flags);
nc = this_cpu_ptr(&netdev_alloc_cache);
- data = __alloc_page_frag(nc, len, gfp_mask);
+ data = page_frag_alloc(nc, len, gfp_mask);
pfmemalloc = nc->pfmemalloc;
local_irq_restore(flags);
@@ -505,7 +505,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
if (sk_memalloc_socks())
gfp_mask |= __GFP_MEMALLOC;
- data = __alloc_page_frag(&nc->page, len, gfp_mask);
+ data = page_frag_alloc(&nc->page, len, gfp_mask);
if (unlikely(!data))
return NULL;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [net/mm PATCH v2 0/3] Page fragment updates
From: Alexander Duyck @ 2016-12-23 17:16 UTC (permalink / raw)
To: linux-mm, akpm, davem, netdev; +Cc: linux-kernel, jeffrey.t.kirsher
This patch series takes care of a few cleanups for the page fragments API.
First we do some renames so that things are much more consistent. First we
move the page_frag_ portion of the name to the front of the functions
names. Secondly we split out the cache specific functions from the other
page fragment functions by adding the word "cache" to the name.
Second I did some minor clean-up on the function calls so that they are
more inline with the existing __free_pages calls in terms of how they
operate.
Finally I added a bit of documentation that will hopefully help to explain
some of this. I plan to revisit this later as we get things more ironed
out in the near future with the changes planned for the DMA setup to
support eXpress Data Path.
---
v2: Fixed a comparison between a void* and 0 due to copy/paste from free_pages
I'm listing this as a patch for net or mm since I had originally submitted
it against mm as that was where the patches for the __page_frag functions
has previously resided. However they are now also in net, and I wanted to
get this fixed before the merge window closed as I was hoping to make use
of these APIs in net-next and I already have about 20 patches that are
waiting on these patches to be accepted.
I tried to get in touch with Andrew about this fix but I haven't heard any
reply to the email I sent out on Tuesday. The last comment I had from
Andrew against v1 was "Looks good to me. I have it all queued for post-4.9
processing.", but I haven't received any notice they were applied.
Alexander Duyck (3):
mm: Rename __alloc_page_frag to page_frag_alloc and __free_page_frag to page_frag_free
mm: Rename __page_frag functions to __page_frag_cache, drop order from drain
mm: Add documentation for page fragment APIs
Documentation/vm/page_frags | 42 +++++++++++++++++++++++++++++
drivers/net/ethernet/intel/igb/igb_main.c | 6 ++--
include/linux/gfp.h | 9 +++---
include/linux/skbuff.h | 2 +
mm/page_alloc.c | 33 +++++++++++++----------
net/core/skbuff.c | 8 +++---
6 files changed, 73 insertions(+), 27 deletions(-)
create mode 100644 Documentation/vm/page_frags
--
^ permalink raw reply
* Re: [PATCH net 0/9] several fixups for virtio-net XDP
From: John Fastabend @ 2016-12-23 17:10 UTC (permalink / raw)
To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-1-git-send-email-jasowang@redhat.com>
On 16-12-23 06:37 AM, Jason Wang wrote:
> Merry Xmas and a Happy New year to all:
>
> This series tries to fixes several issues for virtio-net XDP which
> could be categorized into several parts:
>
> - fix several issues during XDP linearizing
> - allow csumed packet to work for XDP_PASS
> - make EWMA rxbuf size estimation works for XDP
> - forbid XDP when GUEST_UFO is support
> - remove big packet XDP support
> - add XDP support or small buffer
>
> Please see individual patches for details.
>
> Thanks
>
> Jason Wang (9):
> virtio-net: remove the warning before XDP linearizing
> virtio-net: correctly xmit linearized page on XDP_TX
> virtio-net: fix page miscount during XDP linearizing
> virtio-net: correctly handle XDP_PASS for linearized packets
> virtio-net: unbreak csumed packets for XDP_PASS
> virtio-net: make rx buf size estimation works for XDP
> virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
> virtio-net: remove big packet XDP codes
> virtio-net: XDP support for small buffers
>
> drivers/net/virtio_net.c | 172 ++++++++++++++++++++++++++++-------------------
> 1 file changed, 102 insertions(+), 70 deletions(-)
>
Thanks a lot Jason. The last piece that is needed is support to
complete XDP support is to get the adjust_head part correct. I'll
send out a patch in a bit but will need to merge it on top of this
set.
.John
^ permalink raw reply
* Re: [PATCH net 0/2] net/sched fixes for cls_flower and act_tunnel_key
From: David Miller @ 2016-12-23 17:00 UTC (permalink / raw)
To: ogerlitz; +Cc: netdev, roid, hadarh
In-Reply-To: <1482409695-27956-1-git-send-email-ogerlitz@mellanox.com>
From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Thu, 22 Dec 2016 14:28:13 +0200
> This small series contain a fix to the matching flags support
> in flower and to the tunnel key action MD prep for IPv6.
Series applied, thanks.
> On a non related note, wishing everyone around here a happy new year,
> celebrate and take a rest so we can do lots of good patch work(s) next.
Happy new year to you as well.
^ permalink raw reply
* Re: [PATCH net 9/9] virtio-net: XDP support for small buffers
From: John Fastabend @ 2016-12-23 16:51 UTC (permalink / raw)
To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-10-git-send-email-jasowang@redhat.com>
On 16-12-23 06:37 AM, Jason Wang wrote:
> Commit f600b6905015 ("virtio_net: Add XDP support") leaves the case of
> small receive buffer untouched. This will confuse the user who want to
> set XDP but use small buffers. Other than forbid XDP in small buffer
> mode, let's make it work. XDP then can only work at skb->data since
> virtio-net create skbs during refill, this is sub optimal which could
> be optimized in the future.
>
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
Looks good to me thanks!
Acked-by: John Fastabend <john.r.fastabend@intel.com>
^ permalink raw reply
* Re: [PATCH v3] stmmac: CSR clock configuration fix
From: David Miller @ 2016-12-23 16:46 UTC (permalink / raw)
To: Joao.Pinto
Cc: peppe.cavallaro, seraphin.bonnaffe, hock.leong.kweh,
niklas.cassel, pavel, linux-kernel, preid, netdev
In-Reply-To: <f586b394955d543470b614ddc8c5bc766c450c40.1482487963.git.jpinto@synopsys.com>
From: Joao Pinto <Joao.Pinto@synopsys.com>
Date: Fri, 23 Dec 2016 10:15:59 +0000
> When testing stmmac with my QoS reference design I checked a problem in the
> CSR clock configuration that was impossibilitating the phy discovery, since
> every read operation returned 0x0000ffff. This patch fixes the issue.
>
> Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Applied, thank you.
^ permalink raw reply
* Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)
From: Andy Lutomirski @ 2016-12-23 16:42 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Hannes Frederic Sowa, Alexei Starovoitov, Jason A. Donenfeld,
kernel-hardening@lists.openwall.com, Theodore Ts'o, Netdev,
LKML, Linux Crypto Mailing List, David Laight, Eric Dumazet,
Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <CALCETrVgnnbstPoPf-0bZ_ySDydXNCp5dLAEvhAD9cApozFsng@mail.gmail.com>
On Fri, Dec 23, 2016 at 8:23 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Dec 23, 2016 at 3:59 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 12/23/2016 11:59 AM, Hannes Frederic Sowa wrote:
>>>
>>> On Fri, 2016-12-23 at 11:04 +0100, Daniel Borkmann wrote:
>>>>
>>>> On 12/22/2016 05:59 PM, Hannes Frederic Sowa wrote:
>>>>>
>>>>> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>>
>> [...]
>>
>>>>> The hashing is not a proper sha1 neither, unfortunately. I think that
>>>>> is why it will have a custom implementation in iproute2?
>>>>
>>>>
>>>> Still trying to catch up on this admittedly bit confusing thread. I
>>>> did run automated tests over couple of days comparing the data I got
>>>> from fdinfo with the one from af_alg and found no mismatch on the test
>>>> cases varying from min to max possible program sizes. In the process
>>>> of testing, as you might have seen on netdev, I found couple of other
>>>> bugs in bpf code along the way and fixed them up as well. So my question,
>>>> do you or Andy or anyone participating in claiming this have any
>>>> concrete data or test cases that suggests something different? If yes,
>>>> I'm very curious to hear about it and willing fix it up, of course.
>>>> When I'm back from pto I'll prep and cook up my test suite to be
>>>> included into the selftests/bpf/, should have done this initially,
>>>> sorry about that. I'll also post something to expose the alg, that
>>>> sounds fine to me.
>>>
>>>
>>> Looking into your code closer, I noticed that you indeed seem to do the
>>> finalization of sha-1 by hand by aligning and padding the buffer
>>> accordingly and also patching in the necessary payload length.
>>>
>>> Apologies for my side for claiming that this is not correct sha1
>>> output, I was only looking at sha_transform and its implementation and
>>> couldn't see the padding and finalization round with embedding the data
>>> length in there and hadn't thought of it being done manually.
>>>
>>> Anyway, is it difficult to get the sha finalization into some common
>>> code library? It is not very bpf specific and crypto code reviewers
>>> won't find it there at all.
>>
>>
>> Yes, sure, I'll rework it that way (early next year when I'm back if
>> that's fine with you).
>
> Can we make it SHA-256 before 4.10 comes out, though? This really
> looks like it will be used in situations where collisions matter and
> it will be exposed to malicious programs, and SHA-1 should not be used
> for new designs for this purpose because it simply isn't long enough.
>
> Also, a SHA-1 digest isn't a pile of u32s, so u32 digest[...] is very
> misleading. That should be u8 or, at the very least, __be32.
>
> I realize that there isn't a sha-256 implementation in lib, but would
> it really be so bad to make the bpf digest only work (for now) when
> crypto is enabled? I would *love* to see the crypto core learn how to
> export simple primitives for direct use without needing the whole
> crypto core, and this doesn't seem particularly hard to do, but I
> don't think that's 4.10 material.
I'm going to try to send out RFC patches for all of this today or
tomorrow. It doesn't look bad at all.
--Andy
^ permalink raw reply
* [PATCH net] sctp: fix recovering from 0 win with small data chunks
From: Marcelo Ricardo Leitner @ 2016-12-23 16:29 UTC (permalink / raw)
To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich
Currently if SCTP closes the receive window with window pressure, mostly
caused by excessive skb overhead on payload/overheads ratio, SCTP will
close the window abruptly while saving the delta on rwnd_press. It will
start recovering rwnd as the chunks are consumed by the application and
the rwnd_press will be only recovered after rwnd reach the same value as
of rwnd_press, mostly to prevent silly window syndrome.
Thing is, this is very inefficient with small data chunks, as with those
it will never reach back that value, and thus it will never recover from
such pressure. This means that we will not issue window updates when
recovering from 0 window and will rely on a sender retransmit to notice
it.
The fix here is to remove such threshold, as no value is good enough: it
depends on the (avg) chunk sizes being used.
Test with netperf -t SCTP_STREAM -- -m 1, and trigger 0 window by
sending SIGSTOP to netserver, sleep 1.2, and SIGCONT.
Rate limited to 845kbps, for visibility. Capture done at netserver side.
Previously:
01.500751 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632372996] [a_rwnd 99153] [
01.500752 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632372997] [SID: 0] [SS
01.517471 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
01.517483 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
01.517485 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
01.517488 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
01.534168 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373096] [SID: 0] [SS
01.534180 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
01.534181 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373169] [SID: 0] [SS
01.534185 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
02.525978 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
02.526021 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373009] [a_rwnd 0] [#gap
(window update missed)
04.573807 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373010] [SID: 0] [SS
04.779370 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373082] [a_rwnd 859] [#g
04.789162 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373083] [SID: 0] [SS
04.789323 IP A.36925 > B.48277: sctp (1) [DATA] (B)(E) [TSN: 632373156] [SID: 0] [SS
04.789372 IP B.48277 > A.36925: sctp (1) [SACK] [cum ack 632373228] [a_rwnd 786] [#g
After:
02.568957 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098728] [a_rwnd 99153]
02.568961 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098729] [SID: 0] [S
02.585631 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
02.585666 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
02.585671 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
02.585683 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
02.602330 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098828] [SID: 0] [S
02.602359 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
02.602363 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098901] [SID: 0] [S
02.602372 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
03.600788 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
03.600830 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 0] [#ga
03.619455 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 13508]
03.619479 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 27017]
03.619497 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 40526]
03.619516 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 54035]
03.619533 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 67544]
03.619552 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 81053]
03.619570 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098741] [a_rwnd 94562]
(following data transmission triggered by window updates above)
03.633504 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098742] [SID: 0] [S
03.836445 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098814] [a_rwnd 100000]
03.843125 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098815] [SID: 0] [S
03.843285 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098888] [SID: 0] [S
03.843345 IP B.50536 > A.55173: sctp (1) [SACK] [cum ack 2490098960] [a_rwnd 99894]
03.856546 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490098961] [SID: 0] [S
03.866450 IP A.55173 > B.50536: sctp (1) [DATA] (B)(E) [TSN: 2490099011] [SID: 0] [S
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
net/sctp/associola.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 56ddcfaeb4f64235b54b0daa915fabf0cc0170a9..d3cc30c25c41091c2bf18022506dff4145d29944 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1471,7 +1471,7 @@ void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
* threshold. The idea is to recover slowly, but up
* to the initial advertised window.
*/
- if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
+ if (asoc->rwnd_press) {
int change = min(asoc->pathmtu, asoc->rwnd_press);
asoc->rwnd += change;
asoc->rwnd_press -= change;
--
2.9.3
^ permalink raw reply related
* [PATCH net] sctp: do not loose window information if in rwnd_over
From: Marcelo Ricardo Leitner @ 2016-12-23 16:29 UTC (permalink / raw)
To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich
It's possible that we receive a packet that is larger than current
window. If it's the first packet in this way, it will cause it to
increase rwnd_over. Then, if we receive another data chunk (specially as
SCTP allows you to have one data chunk in flight even during 0 window),
rwnd_over will be overwritten instead of added to.
In the long run, this could cause the window to grow bigger than its
initial size, as rwnd_over would be charged only for the last received
data chunk while the code will try open the window for all packets that
were received and had its value in rwnd_over overwritten. This, then,
can lead to the worsening of payload/buffer ratio and cause rwnd_press
to kick in more often.
The fix is to sum it too, same as is done for rwnd_press, so that if we
receive 3 chunks after closing the window, we still have to release that
same amount before re-opening it.
Log snippet from sctp_test exhibiting the issue:
[ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
rwnd decreased by 1 to (0, 1, 114221)
[ 146.209232] sctp: sctp_assoc_rwnd_decrease:
association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
[ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
rwnd decreased by 1 to (0, 1, 114221)
[ 146.209232] sctp: sctp_assoc_rwnd_decrease:
association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
[ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
rwnd decreased by 1 to (0, 1, 114221)
[ 146.209232] sctp: sctp_assoc_rwnd_decrease:
association:ffff88013928e000 has asoc->rwnd:0, asoc->rwnd_over:1!
[ 146.209232] sctp: sctp_assoc_rwnd_decrease: asoc:ffff88013928e000
rwnd decreased by 1 to (0, 1, 114221)
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
net/sctp/associola.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 68428e1f71810fbe65b7f86c750c3ad61f0266ec..56ddcfaeb4f64235b54b0daa915fabf0cc0170a9 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1539,7 +1539,7 @@ void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
asoc->rwnd = 0;
}
} else {
- asoc->rwnd_over = len - asoc->rwnd;
+ asoc->rwnd_over += len - asoc->rwnd;
asoc->rwnd = 0;
}
--
2.9.3
^ permalink raw reply related
* Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)
From: Andy Lutomirski @ 2016-12-23 16:23 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Hannes Frederic Sowa, Alexei Starovoitov, Jason A. Donenfeld,
kernel-hardening@lists.openwall.com, Theodore Ts'o, Netdev,
LKML, Linux Crypto Mailing List, David Laight, Eric Dumazet,
Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <585D11BF.60903@iogearbox.net>
On Fri, Dec 23, 2016 at 3:59 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 12/23/2016 11:59 AM, Hannes Frederic Sowa wrote:
>>
>> On Fri, 2016-12-23 at 11:04 +0100, Daniel Borkmann wrote:
>>>
>>> On 12/22/2016 05:59 PM, Hannes Frederic Sowa wrote:
>>>>
>>>> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>
> [...]
>
>>>> The hashing is not a proper sha1 neither, unfortunately. I think that
>>>> is why it will have a custom implementation in iproute2?
>>>
>>>
>>> Still trying to catch up on this admittedly bit confusing thread. I
>>> did run automated tests over couple of days comparing the data I got
>>> from fdinfo with the one from af_alg and found no mismatch on the test
>>> cases varying from min to max possible program sizes. In the process
>>> of testing, as you might have seen on netdev, I found couple of other
>>> bugs in bpf code along the way and fixed them up as well. So my question,
>>> do you or Andy or anyone participating in claiming this have any
>>> concrete data or test cases that suggests something different? If yes,
>>> I'm very curious to hear about it and willing fix it up, of course.
>>> When I'm back from pto I'll prep and cook up my test suite to be
>>> included into the selftests/bpf/, should have done this initially,
>>> sorry about that. I'll also post something to expose the alg, that
>>> sounds fine to me.
>>
>>
>> Looking into your code closer, I noticed that you indeed seem to do the
>> finalization of sha-1 by hand by aligning and padding the buffer
>> accordingly and also patching in the necessary payload length.
>>
>> Apologies for my side for claiming that this is not correct sha1
>> output, I was only looking at sha_transform and its implementation and
>> couldn't see the padding and finalization round with embedding the data
>> length in there and hadn't thought of it being done manually.
>>
>> Anyway, is it difficult to get the sha finalization into some common
>> code library? It is not very bpf specific and crypto code reviewers
>> won't find it there at all.
>
>
> Yes, sure, I'll rework it that way (early next year when I'm back if
> that's fine with you).
Can we make it SHA-256 before 4.10 comes out, though? This really
looks like it will be used in situations where collisions matter and
it will be exposed to malicious programs, and SHA-1 should not be used
for new designs for this purpose because it simply isn't long enough.
Also, a SHA-1 digest isn't a pile of u32s, so u32 digest[...] is very
misleading. That should be u8 or, at the very least, __be32.
I realize that there isn't a sha-256 implementation in lib, but would
it really be so bad to make the bpf digest only work (for now) when
crypto is enabled? I would *love* to see the crypto core learn how to
export simple primitives for direct use without needing the whole
crypto core, and this doesn't seem particularly hard to do, but I
don't think that's 4.10 material.
--Andy
^ permalink raw reply
* Re: [PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
From: John Fastabend @ 2016-12-23 16:10 UTC (permalink / raw)
To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <585D4AAD.4030702@gmail.com>
On 16-12-23 08:02 AM, John Fastabend wrote:
> On 16-12-23 06:37 AM, Jason Wang wrote:
>> When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
>> packet that exceeds a single page which could not be handled
>> correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
>> supported. While at it, forbid XDP for ECN (which comes only from GRO)
>> too to prevent user from misconfiguration.
>>
Is sending packets greater than single page though normal in this case?
I don't have any need to support big packet mode other than MST asked
for it. And I wasn't seeing this in my tests. MTU is capped at 4k - hdr
when XDP is enabled.
.John
>> Cc: John Fastabend <john.r.fastabend@intel.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>> drivers/net/virtio_net.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 77ae358..c1f66d8 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>> int i, err;
>>
>> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>> - virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) {
>> + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
>> + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
>> + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) {
>> netdev_warn(dev, "can't set XDP while host is implementing LRO, disable LRO first\n");
>> return -EOPNOTSUPP;
>> }
>>
>
> Acked-by: John Fastabend <john.r.fastabend@intel.com>
>
^ permalink raw reply
* Re: [PATCH net 7/9] virtio-net: forbid XDP when VIRTIO_NET_F_GUEST_UFO is support
From: John Fastabend @ 2016-12-23 16:02 UTC (permalink / raw)
To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-8-git-send-email-jasowang@redhat.com>
On 16-12-23 06:37 AM, Jason Wang wrote:
> When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
> packet that exceeds a single page which could not be handled
> correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
> supported. While at it, forbid XDP for ECN (which comes only from GRO)
> too to prevent user from misconfiguration.
>
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> drivers/net/virtio_net.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 77ae358..c1f66d8 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1684,7 +1684,9 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
> int i, err;
>
> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> - virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6)) {
> + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
> + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
> + virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) {
> netdev_warn(dev, "can't set XDP while host is implementing LRO, disable LRO first\n");
> return -EOPNOTSUPP;
> }
>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
^ permalink raw reply
* Re: [PATCH net 6/9] virtio-net: make rx buf size estimation works for XDP
From: John Fastabend @ 2016-12-23 16:02 UTC (permalink / raw)
To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: john.r.fastabend
In-Reply-To: <1482503852-12438-7-git-send-email-jasowang@redhat.com>
On 16-12-23 06:37 AM, Jason Wang wrote:
> We don't update ewma rx buf size in the case of XDP. This will lead
> underestimation of rx buf size which causes host to produce more than
> one buffers. This will greatly increase the possibility of XDP page
> linearization.
>
> Cc: John Fastabend <john.r.fastabend@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> drivers/net/virtio_net.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0778dc8..77ae358 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -584,10 +584,12 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> put_page(page);
> head_skb = page_to_skb(vi, rq, xdp_page,
> 0, len, PAGE_SIZE);
> + ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len);
> return head_skb;
> }
> break;
> case XDP_TX:
> + ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len);
> if (unlikely(xdp_page != page))
> goto err_xdp;
> rcu_read_unlock();
> @@ -596,6 +598,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> default:
> if (unlikely(xdp_page != page))
> __free_pages(xdp_page, 0);
> + ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len);
> goto err_xdp;
> }
> }
>
Looks needed although I guess it will only be the case with
MTU > ETH_DATA_LEN because of the clamp in get_mergeable_buf_len().
Although XDP setup allows MTU up to page_size - hdr so certainly
will happen with ~MTU > 1500.
I need to add some various MTU tests to my setup.
Acked-by: John Fastabend <john.r.fastabend@intel.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox