* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01 3:26 UTC (permalink / raw)
To: Michael S. Tsirkin, wangyunjian
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, caihe
In-Reply-To: <20161201051207-mutt-send-email-mst@kernel.org>
On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>> -----Original Message-----
>>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>> Sent: Wednesday, November 30, 2016 9:41 PM
>>> To: wangyunjian
>>> Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>>> Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>>>
>>> On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>> When we meet an error(err=-EBADFD) recvmsg,
>>> How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>>> return 0 in this case, breaking the loop?
>> We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>> The soft lockup might happened. The err is -EBADFD.
>>
> OK, I'd like to figure out what happened here. why don't
> we get 0 when we peek at the head?
>
> EBADFD is from here:
> struct tun_struct *tun = __tun_get(tfile);
> ...
> if (!tun)
> return -EBADFD;
>
> but then:
> static int tun_peek_len(struct socket *sock)
> {
>
> ...
>
> struct tun_struct *tun;
> ...
>
> tun = __tun_get(tfile);
> if (!tun)
> return 0;
>
>
> so peek len should return 0.
>
> then while will exit:
> while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
> ...
>
Consider this case: user do ip link del link tap0 before recvmsg() but
after tun_peek_len() ?
>> meesage log:
>> kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
>> call trace:
>> [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
>> [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
>> [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
>> [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
>> [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
>> [<fffffff810a5c7f>]kthread+0xcf/0xe0
>> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
>> [<fffffff81648898>]ret_from_fork+0x58/0x90
>> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> So somehow you keep seeing something in tun when we peek.
> IMO this is the problem we should try to fix.
> Could you try debugging the root cause here?
>
>>>> the error handling in vhost
>>>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>>>>
>>>> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>>>> ---
>>>> drivers/vhost/net.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>>>> index 5dc128a..edc470b 100644
>>>> --- a/drivers/vhost/net.c
>>>> +++ b/drivers/vhost/net.c
>>>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>>> pr_debug("Discarded rx packet: "
>>>> " len %d, expected %zd\n", err, sock_len);
>>>> vhost_discard_vq_desc(vq, headcount);
>>>> + /* Don't continue to do, when meet errors. */
>>>> + if (err < 0)
>>>> + goto out;
>>> You might get e.g. EAGAIN and I think you need to retry
>>> in this case.
>>>
>>>> continue;
>>>> }
>>>> /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>>>> --
>>>> 1.9.5.msysgit.1
>>>>
^ permalink raw reply
* Re: DSA vs envelope frames
From: Florian Fainelli @ 2016-12-01 3:26 UTC (permalink / raw)
To: Andrew Lunn, Nikita Yushchenko
Cc: Toshiaki Makita, Andy Duan, David S. Miller, Troy Kisky,
Eric Nelson, Philippe Reynes, Johannes Berg,
netdev@vger.kernel.org, Chris Healy, Fabio Estevam,
linux-kernel@vger.kernel.org, Vivien Didelot
In-Reply-To: <20161130151028.GD21645@lunn.ch>
On 11/30/2016 07:10 AM, Andrew Lunn wrote:
>> What is not really clear - what if several tagging protocols are used
>> together. AFAIU, things may be more complex that simple appending of
>> tags, e.g. EDSA tag can carry VLAN id inside.
>
> Hi Nikita
>
> At least for all current tagging protocols, the size of the tag is
> constant. And you cannot run different tagging protocols on the same
> master interface at the same time.
I am not sure if using envelope frames is entirely appropriate here,
because there are existing switch tagging protocols that:
- don't have a specific Ethernet type allocated (Broadcom tags, DSA)
- could be appended at the end of the frame instead of pre-pended
Alexander Duyck suggested a while ago that we may be able to use the
headers_ops to implement the DSA tag pop/push, as well as get an
appropriate MTU adjustment, can you see if that would work?
>
> However, i think Florian tried something funky with the SF2 and B53
> driver. He has a b53 hanging off a sf2. So i think he used nested
> tagging protocols!
Well, this actually did not work, because the SF2 and B53 switches
essentially terminate switch tags when they ingress their switch ports,
so the tag inserted by B53 ingressing into the SF2 port does not get
sent all the way to the CPU hanging off the SF2... With the B53 hanging
off the SF2, we essentially have to disable Broadcom tags in the B53
device, because the tag and switches were never designed to be cascaded
in the first place (at least not these specific cores).
--
Florian
^ permalink raw reply
* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-12-01 3:27 UTC (permalink / raw)
To: Jason Wang
Cc: wangyunjian, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
caihe
In-Reply-To: <ee70cb92-3fd9-4148-77ba-739af30b04b6@redhat.com>
On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>
>
> On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> > On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > > > Sent: Wednesday, November 30, 2016 9:41 PM
> > > > To: wangyunjian
> > > > Cc: jasowang@redhat.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
> > > > Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
> > > >
> > > > On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> > > > > When we meet an error(err=-EBADFD) recvmsg,
> > > > How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> > > > return 0 in this case, breaking the loop?
> > > We started many guest VMs while attaching/detaching some virtio-net nics for loop.
> > > The soft lockup might happened. The err is -EBADFD.
> > >
> > OK, I'd like to figure out what happened here. why don't
> > we get 0 when we peek at the head?
> >
> > EBADFD is from here:
> > struct tun_struct *tun = __tun_get(tfile);
> > ...
> > if (!tun)
> > return -EBADFD;
> >
> > but then:
> > static int tun_peek_len(struct socket *sock)
> > {
> >
> > ...
> >
> > struct tun_struct *tun;
> > ...
> > tun = __tun_get(tfile);
> > if (!tun)
> > return 0;
> >
> >
> > so peek len should return 0.
> >
> > then while will exit:
> > while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
> > ...
> >
>
> Consider this case: user do ip link del link tap0 before recvmsg() but after
> tun_peek_len() ?
Sure, this can happen, but I think we'll just exit on the next loop,
won't we?
> > > meesage log:
> > > kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! [vhost-60898:126093]
> > > call trace:
> > > [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
> > > [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
> > > [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
> > > [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
> > > [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
> > > [<fffffff810a5c7f>]kthread+0xcf/0xe0
> > > [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> > > [<fffffff81648898>]ret_from_fork+0x58/0x90
> > > [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> > So somehow you keep seeing something in tun when we peek.
> > IMO this is the problem we should try to fix.
> > Could you try debugging the root cause here?
> >
> > > > > the error handling in vhost
> > > > > handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
> > > > >
> > > > > Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> > > > > ---
> > > > > drivers/vhost/net.c | 3 +++
> > > > > 1 file changed, 3 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > > > index 5dc128a..edc470b 100644
> > > > > --- a/drivers/vhost/net.c
> > > > > +++ b/drivers/vhost/net.c
> > > > > @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
> > > > > pr_debug("Discarded rx packet: "
> > > > > " len %d, expected %zd\n", err, sock_len);
> > > > > vhost_discard_vq_desc(vq, headcount);
> > > > > + /* Don't continue to do, when meet errors. */
> > > > > + if (err < 0)
> > > > > + goto out;
> > > > You might get e.g. EAGAIN and I think you need to retry
> > > > in this case.
> > > >
> > > > > continue;
> > > > > }
> > > > > /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> > > > > --
> > > > > 1.9.5.msysgit.1
> > > > >
^ permalink raw reply
* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-12-01 3:37 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: wangyunjian, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
caihe
In-Reply-To: <20161201052657-mutt-send-email-mst@kernel.org>
On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>> >
>> >
>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>>>> > > > >-----Original Message-----
>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
>>>>> > > > >To: wangyunjian
>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@vger.kernel.org; caihe
>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>>>>> > > > >
>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>>>>> > > > >return 0 in this case, breaking the loop?
>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>>>> > > >The soft lockup might happened. The err is -EBADFD.
>>>> > > >
>>> > >OK, I'd like to figure out what happened here. why don't
>>> > >we get 0 when we peek at the head?
>>> > >
>>> > >EBADFD is from here:
>>> > > struct tun_struct *tun = __tun_get(tfile);
>>> > >...
>>> > > if (!tun)
>>> > > return -EBADFD;
>>> > >
>>> > >but then:
>>> > >static int tun_peek_len(struct socket *sock)
>>> > >{
>>> > >
>>> > >...
>>> > >
>>> > > struct tun_struct *tun;
>>> > >...
>>> > > tun = __tun_get(tfile);
>>> > > if (!tun)
>>> > > return 0;
>>> > >
>>> > >
>>> > >so peek len should return 0.
>>> > >
>>> > >then while will exit:
>>> > > while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
>>> > >...
>>> > >
>> >
>> >Consider this case: user do ip link del link tap0 before recvmsg() but after
>> >tun_peek_len() ?
> Sure, this can happen, but I think we'll just exit on the next loop,
> won't we?
>
Right, this is the only case I can image for -EBADFD, let's wait for the
author to the steps.
^ permalink raw reply
* Re: [PATCH] tcp_bbr: fix Kconfig to be able to make BBR the default congestion algorithm
From: David Miller @ 2016-12-01 3:43 UTC (permalink / raw)
To: berny156; +Cc: linux-kernel, netdev, ncardwell
In-Reply-To: <b37e4220-b857-4853-f2e2-b77bb08115e5@gmx.de>
From: Bernhard Held <berny156@gmx.de>
Date: Wed, 30 Nov 2016 23:31:29 +0100
> Add missing line to be able to make BBR the default
> congestion algorithm.
>
> Signed-off-by: Bernhard Held <berny156@gmx.de>
The current tree already has this fix.
^ permalink raw reply
* linux-next: manual merge of the tip tree with the net-next tree
From: Stephen Rothwell @ 2016-12-01 3:46 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Peter Zijlstra,
David Miller, Networking
Cc: linux-next, linux-kernel, Edward Cree, Nicolas Pitre
Hi all,
Today's linux-next merge of the tip tree got a conflict in:
drivers/net/ethernet/sfc/Kconfig
between commit:
5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")
from the net-next tree and commit:
d1cbfd771ce8 ("ptp_clock: Allow for it to be optional")
from the tip tree.
I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging. You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.
--
Cheers,
Stephen Rothwell
diff --cc drivers/net/ethernet/sfc/Kconfig
index 0e16197e4f18,83f4766a1da0..000000000000
--- a/drivers/net/ethernet/sfc/Kconfig
+++ b/drivers/net/ethernet/sfc/Kconfig
@@@ -5,11 -5,11 +5,11 @@@ config SF
select CRC32
select I2C
select I2C_ALGOBIT
- select PTP_1588_CLOCK
+ select SFC_FALCON
+ imply PTP_1588_CLOCK
---help---
This driver supports 10/40-gigabit Ethernet cards based on
- the Solarflare SFC4000, SFC9000-family and SFC9100-family
- controllers.
+ the Solarflare SFC9000-family and SFC9100-family controllers.
To compile this driver as a module, choose M here. The module
will be called sfc.
^ permalink raw reply
* Re: [net-next PATCH v3 3/6] virtio_net: Add XDP support
From: John Fastabend @ 2016-12-01 4:24 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: eric.dumazet, daniel, shm, davem, tgraf, alexei.starovoitov,
john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161130203851-mutt-send-email-mst@kernel.org>
On 16-11-30 10:54 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 29, 2016 at 12:10:21PM -0800, John Fastabend wrote:
>> From: John Fastabend <john.fastabend@gmail.com>
>>
>> This adds XDP support to virtio_net. Some requirements must be
>> met for XDP to be enabled depending on the mode. First it will
>> only be supported with LRO disabled so that data is not pushed
>> across multiple buffers. Second the MTU must be less than a page
>> size to avoid having to handle XDP across multiple pages.
>>
>> If mergeable receive is enabled this patch only supports the case
>> where header and data are in the same buf which we can check when
>> a packet is received by looking at num_buf. If the num_buf is
>> greater than 1 and a XDP program is loaded the packet is dropped
>> and a warning is thrown. When any_header_sg is set this does not
>> happen and both header and data is put in a single buffer as expected
>> so we check this when XDP programs are loaded. Subsequent patches
>> will process the packet in a degraded mode to ensure connectivity
>> and correctness is not lost even if backend pushes packets into
>> multiple buffers.
>>
>> If big packets mode is enabled and MTU/LRO conditions above are
>> met then XDP is allowed.
>>
>> This patch was tested with qemu with vhost=on and vhost=off where
>> mergable and big_packet modes were forced via hard coding feature
>> negotiation. Multiple buffers per packet was forced via a small
>> test patch to vhost.c in the vhost=on qemu mode.
>>
>> Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
[...]
>>
>> +static u32 do_xdp_prog(struct virtnet_info *vi,
>> + struct bpf_prog *xdp_prog,
>> + struct page *page, int offset, int len)
>> +{
>> + int hdr_padded_len;
>> + struct xdp_buff xdp;
>> + u32 act;
>> + u8 *buf;
>> +
>> + buf = page_address(page) + offset;
>> +
>> + if (vi->mergeable_rx_bufs)
>> + hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>> + else
>> + hdr_padded_len = sizeof(struct padded_vnet_hdr);
>> +
>> + xdp.data = buf + hdr_padded_len;
>> + xdp.data_end = xdp.data + (len - vi->hdr_len);
>
> so header seems to be ignored completely.
> but the packet could be from the time when
> e.g. checksum offloading was on, and
> so it might gave DATA_VALID (from CHECKSUM_UNNECESSARY
> in host).
>
> I think you want to verify that flags and gso type
> are 0.
Ah yes agree. I think this might be possible in all the driver
implementations as well I wonder if there are bugs there as well.
The race between setting MTU, LRO, etc. and pushing the XDP prog
into the datapath seems unlikely but possible.
>
>
>> +
>> + act = bpf_prog_run_xdp(xdp_prog, &xdp);
>> + switch (act) {
>> + case XDP_PASS:
>> + return XDP_PASS;
>> + default:
>> + bpf_warn_invalid_xdp_action(act);
>> + case XDP_TX:
>> + case XDP_ABORTED:
>> + case XDP_DROP:
>> + return XDP_DROP;
>> + }
>> +}
>
> do we really want this switch just to warn?
> How about doing != XDP_PASS in the caller?
Well in later patches XDP_TX case gets handled. So its really
three cases, PASS, XDP_TX, {default|ABORT|DROP}. I don't have
a strong preference if its case statement or if/else it should
not matter except for style.
>
>> +
>> static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, unsigned int len)
>> {
>> struct sk_buff * skb = buf;
>> @@ -340,14 +375,28 @@ static struct sk_buff *receive_big(struct net_device *dev,
>> void *buf,
>> unsigned int len)
>> {
>> + struct bpf_prog *xdp_prog;
>> struct page *page = buf;
>> - struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
>> + struct sk_buff *skb;
>>
>> + rcu_read_lock();
>> + xdp_prog = rcu_dereference(rq->xdp_prog);
>> + if (xdp_prog) {
>> + u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
>> +
>> + if (act == XDP_DROP)
>> + goto err_xdp;
>> + }
>> + rcu_read_unlock();
>> +
>> + skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
>> if (unlikely(!skb))
>> goto err;
>>
>> return skb;
>>
>> +err_xdp:
>> + rcu_read_unlock();
>> err:
>> dev->stats.rx_dropped++;
>> give_pages(rq, page);
>> @@ -366,10 +415,27 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>> struct page *page = virt_to_head_page(buf);
>> int offset = buf - page_address(page);
>> unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
>
> This is some useless computation when XDP is used, isn't it?
>
Seems to be nice catch.
>> + struct sk_buff *head_skb, *curr_skb;
>> + struct bpf_prog *xdp_prog;
>>
>> - struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
>> - truesize);
>> - struct sk_buff *curr_skb = head_skb;
>> + rcu_read_lock();
>> + xdp_prog = rcu_dereference(rq->xdp_prog);
>> + if (xdp_prog) {
>> + u32 act;
>> +
>> + if (num_buf > 1) {
>> + bpf_warn_invalid_xdp_buffer();
>> + goto err_xdp;
>> + }
>> +
>> + act = do_xdp_prog(vi, xdp_prog, page, offset, len);
>> + if (act == XDP_DROP)
>> + goto err_xdp;
>> + }
>> + rcu_read_unlock();
>> +
>> + head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
>> + curr_skb = head_skb;
>>
>> if (unlikely(!curr_skb))
>> goto err_skb;
>
> I'm confused. Did the requirement to have a page per packet go away?
> I don't think this mode is doing it here.
>
Correct its not a page per packet but there seems no reason to enforce
that here. XDP works just fine without contiguous buffers. I believe
the page per packet argument came out of the mlx driver implementation.
iirc DaveM had the most push back on the multiple packets per page
thing. Maybe he can comment again. (sorry Dave I'm dragging this up again).
>
>> @@ -423,6 +489,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>> ewma_pkt_len_add(&rq->mrg_avg_pkt_len, head_skb->len);
>> return head_skb;
>>
>> +err_xdp:
>> + rcu_read_unlock();
>> err_skb:
>> put_page(page);
>> while (--num_buf) {
>> @@ -1328,6 +1396,13 @@ static int virtnet_set_channels(struct net_device *dev,
>> if (queue_pairs > vi->max_queue_pairs || queue_pairs == 0)
>> return -EINVAL;
>>
>> + /* For now we don't support modifying channels while XDP is loaded
>> + * also when XDP is loaded all RX queues have XDP programs so we only
>> + * need to check a single RX queue.
>> + */
>> + if (vi->rq[0].xdp_prog)
>> + return -EINVAL;
>> +
>> get_online_cpus();
>> err = virtnet_set_queues(vi, queue_pairs);
>> if (!err) {
>> @@ -1454,6 +1529,68 @@ static int virtnet_set_features(struct net_device *netdev,
>> return 0;
>> }
>>
>> +static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
>> +{
>> + struct virtnet_info *vi = netdev_priv(dev);
>> + struct bpf_prog *old_prog;
>> + int i;
>> +
>> + if ((dev->features & NETIF_F_LRO) && prog) {
>> + netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
>> + return -EINVAL;
>> + }
>> +
>> + if (vi->mergeable_rx_bufs && !vi->any_header_sg) {
>> + netdev_warn(dev, "XDP expects header/data in single page\n");
>> + return -EINVAL;
>> + }
>> +
>> + if (dev->mtu > PAGE_SIZE) {
>> + netdev_warn(dev, "XDP requires MTU less than %lu\n", PAGE_SIZE);
>> + return -EINVAL;
>> + }
>> +
>> + if (prog) {
>> + prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
>> + if (IS_ERR(prog))
>> + return PTR_ERR(prog);
>> + }
>> +
>> + for (i = 0; i < vi->max_queue_pairs; i++) {
>> + old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
>> + rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
>> + if (old_prog)
>> + bpf_prog_put(old_prog);
>
> don't we need to sync before put?
>
No should be handled by core infrastructure in bpf_prog_put(). Also this
aligns with other driver implementations.
>
>> + }
>> +
>> + return 0;
>> +}
>> +
Thanks,
John
^ permalink raw reply
* Re: [net-next PATCH v3 0/6] XDP for virtio_net
From: John Fastabend @ 2016-12-01 4:26 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: tgraf, shm, alexei.starovoitov, daniel, davem, john.r.fastabend,
netdev, bblanco, brouer
In-Reply-To: <20161130205527-mutt-send-email-mst@kernel.org>
On 16-11-30 10:56 AM, Michael S. Tsirkin wrote:
> On Wed, Nov 30, 2016 at 10:41:04AM -0800, John Fastabend wrote:
>> On 16-11-30 10:35 AM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 29, 2016 at 12:05:20PM -0800, John Fastabend wrote:
>>>> This implements virtio_net for the mergeable buffers and big_packet
>>>> modes. I tested this with vhost_net running on qemu and did not see
>>>> any issues. For testing num_buf > 1 I added a hack to vhost driver
>>>> to only but 100 bytes per buffer.
>>>>
>>>> There are some restrictions for XDP to be enabled and work well
>>>> (see patch 3) for more details.
>>>>
>>>> 1. LRO must be off
>>>> 2. MTU must be less than PAGE_SIZE
>>>> 3. queues must be available to dedicate to XDP
>>>> 4. num_bufs received in mergeable buffers must be 1
>>>> 5. big_packet mode must have all data on single page
>>>>
>>>> Please review any comments/feedback welcome as always.
>>>>
>>>> v2, fixes rcu usage throughout thanks to Eric and the use of
>>>> num_online_cpus() usage thanks to Jakub.
>>>>
>>>> v3, add slowpath patch to handle num_bufs > 1
>>>>
>>>> Thanks,
>>>> John
>>>
>>> BTW this is threaded incorrectly: patch 1/6 isn't a reply to 0/6,
>>> patches 2 and on are replies to patch 1.
>>>
>>
>> Ah yep, if you mangle the command line git will send the
>> cover letter even if you have mangled 'to' email addresses but when
>> it hits a real patch it aborts. At least on my version of git.
>>
>>> I'm busy until end of week, I'll review Monday. Sorry about the delay.
>>
>> In the meantime I'll post a v4 with better commit message (Alexei) and
>> address a corner cases Jakub pointed out.
>
> I did a quick look and found some too, but a detailed review will
> have to wait till next week.
Thanks! I'll address all the comments and have a v4 by then.
^ permalink raw reply
* RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: wangyunjian @ 2016-12-01 4:41 UTC (permalink / raw)
To: Jason Wang, Michael S. Tsirkin
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, caihe
In-Reply-To: <936954dd-c8a5-c0f6-c3b0-84a9d67329f5@redhat.com>
>-----Original Message-----
>From: Jason Wang [mailto:jasowang@redhat.com]
>Sent: Thursday, December 01, 2016 11:37 AM
>To: Michael S. Tsirkin
>Cc: wangyunjian; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
>
>
>
>On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
>> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>>> >
>>> >
>>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
>>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
>>>>>> > > > >-----Original Message-----
>>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
>>>>>> > > > >To: wangyunjian
>>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@
>>>>>> > > > >vger.kernel.org; caihe
>>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call
>>>>>> > > > >the recvmsg when meet errors
>>>>>> > > > >
>>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
>>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>>>>>> > > > >return 0 in this case, breaking the loop?
>>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
>>>>> > > >The soft lockup might happened. The err is -EBADFD.
>>>>> > > >
>>>> > >OK, I'd like to figure out what happened here. why don't we get 0
>>>> > >when we peek at the head?
>>>> > >
>>>> > >EBADFD is from here:
>>>> > > struct tun_struct *tun = __tun_get(tfile); ...
>>>> > > if (!tun)
>>>> > > return -EBADFD;
>>>> > >
>>>> > >but then:
>>>> > >static int tun_peek_len(struct socket *sock) {
>>>> > >
>>>> > >...
>>>> > >
>>>> > > struct tun_struct *tun; ...
>>>> > > tun = __tun_get(tfile);
>>>> > > if (!tun)
>>>> > > return 0;
>>>> > >
>>>> > >
>>>> > >so peek len should return 0.
>>>> > >
>>>> > >then while will exit:
>>>> > > while ((sock_len = vhost_net_rx_peek_head_len(net,
>>>> > >sock->sk))) ...
>>>> > >
>>> >
>>> >Consider this case: user do ip link del link tap0 before recvmsg()
>>> >but after
>>> >tun_peek_len() ?
>> Sure, this can happen, but I think we'll just exit on the next loop,
>> won't we?
>>
>
>Right, this is the only case I can image for -EBADFD, let's wait for the author to the steps.
>
Thanks, I understand it don't happen in the latest kernel version.
My problem happened using kernel version 3.10.0-xx
The peek len willn't return 0.
static int peek_head_len(struct sock *sk)
{
struct sk_buff *head;
int len = 0;
unsigned long flags;
spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
head = skb_peek(&sk->sk_receive_queue);
if (likely(head)) {
len = head->len;
if (skb_vlan_tag_present(head))
len += VLAN_HLEN;
}
spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
return len;
}
^ permalink raw reply
* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-12-01 4:56 UTC (permalink / raw)
To: wangyunjian
Cc: Jason Wang, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
caihe
In-Reply-To: <34EFBCA9F01B0748BEB6B629CE643AE60B0A7C38@szxeml561-mbx.china.huawei.com>
On Thu, Dec 01, 2016 at 04:41:40AM +0000, wangyunjian wrote:
> >-----Original Message-----
> >From: Jason Wang [mailto:jasowang@redhat.com]
> >Sent: Thursday, December 01, 2016 11:37 AM
> >To: Michael S. Tsirkin
> >Cc: wangyunjian; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
> >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
> >
> >
> >
> >On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
> >> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
> >>> >
> >>> >
> >>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
> >>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> >>>>>> > > > >-----Original Message-----
> >>>>>> > > > >From: Michael S. Tsirkin [mailto:mst@redhat.com]
> >>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
> >>>>>> > > > >To: wangyunjian
> >>>>>> > > > >Cc:jasowang@redhat.com;netdev@vger.kernel.org;linux-kernel@
> >>>>>> > > > >vger.kernel.org; caihe
> >>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call
> >>>>>> > > > >the recvmsg when meet errors
> >>>>>> > > > >
> >>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> >>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
> >>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> >>>>>> > > > >return 0 in this case, breaking the loop?
> >>>>> > > >We started many guest VMs while attaching/detaching some virtio-net nics for loop.
> >>>>> > > >The soft lockup might happened. The err is -EBADFD.
> >>>>> > > >
> >>>> > >OK, I'd like to figure out what happened here. why don't we get 0
> >>>> > >when we peek at the head?
> >>>> > >
> >>>> > >EBADFD is from here:
> >>>> > > struct tun_struct *tun = __tun_get(tfile); ...
> >>>> > > if (!tun)
> >>>> > > return -EBADFD;
> >>>> > >
> >>>> > >but then:
> >>>> > >static int tun_peek_len(struct socket *sock) {
> >>>> > >
> >>>> > >...
> >>>> > >
> >>>> > > struct tun_struct *tun; ...
> >>>> > > tun = __tun_get(tfile);
> >>>> > > if (!tun)
> >>>> > > return 0;
> >>>> > >
> >>>> > >
> >>>> > >so peek len should return 0.
> >>>> > >
> >>>> > >then while will exit:
> >>>> > > while ((sock_len = vhost_net_rx_peek_head_len(net,
> >>>> > >sock->sk))) ...
> >>>> > >
> >>> >
> >>> >Consider this case: user do ip link del link tap0 before recvmsg()
> >>> >but after
> >>> >tun_peek_len() ?
> >> Sure, this can happen, but I think we'll just exit on the next loop,
> >> won't we?
> >>
> >
> >Right, this is the only case I can image for -EBADFD, let's wait for the author to the steps.
> >
>
> Thanks, I understand it don't happen in the latest kernel version.
> My problem happened using kernel version 3.10.0-xx
> The peek len willn't return 0.
>
> static int peek_head_len(struct sock *sk)
> {
> struct sk_buff *head;
> int len = 0;
> unsigned long flags;
>
> spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
> head = skb_peek(&sk->sk_receive_queue);
Should return NULL, should it not?
Maybe sk_receive_queue was not purged on detach back then.
> if (likely(head)) {
> len = head->len;
> if (skb_vlan_tag_present(head))
> len += VLAN_HLEN;
> }
>
> spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
> return len;
> }
^ permalink raw reply
* Re: [WIP] net+mlx4: auto doorbell
From: Tom Herbert @ 2016-12-01 5:03 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Willem de Bruijn, Rick Jones,
Linux Kernel Network Developers, Saeed Mahameed, Tariq Toukan,
Achiad Shochat
In-Reply-To: <1480559566.18162.253.camel@edumazet-glaptop3.roam.corp.google.com>
On Wed, Nov 30, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-11-30 at 17:16 -0800, Tom Herbert wrote:
>> On Wed, Nov 30, 2016 at 4:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Another issue I found during my tests last days, is a problem with BQL,
>> > and more generally when driver stops/starts the queue.
>> >
>> > When under stress and BQL stops the queue, driver TX completion does a
>> > lot of work, and servicing CPU also takes over further qdisc_run().
>> >
>> Hmm, hard to say if this is problem or a feature I think ;-). One way
>> to "solve" this would be to use IRQ round robin, that would spread the
>> load of networking across CPUs, but that gives us no additional
>> parallelism and reduces locality-- it's generally considered a bad
>> idea. The question might be: is it better to continuously ding one CPU
>> with all the networking work or try to artificially spread it out?
>> Note this is orthogonal to MQ also, so we should already have multiple
>> CPUs doing netif_schedule_queue for queues they manage.
>>
>> Do you have a test or application that shows this is causing pain?
>
> Yes, just launch enough TCP senders (more than 10,000) to fully utilize
> the NIC, with small messages.
>
> super_netperf is not good for that, because you would need 10,000
> processes and would spend too much cycles just dealing with an enormous
> working set, you would not activate BQL.
>
>
>>
>> > The work-flow is :
>> >
>> > 1) collect up to 64 (or 256 packets for mlx4) packets from TX ring, and
>> > unmap things, queue skbs for freeing.
>> >
>> > 2) Calls netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
>> >
>> > if (test_and_clear_bit(__QUEUE_STATE_STACK_XOFF, &dev_queue->state))
>> > netif_schedule_queue(dev_queue);
>> >
>> > This leaves a very tiny window where other cpus could grab __QDISC_STATE_SCHED
>> > (They absolutely have no chance to grab it)
>> >
>> > So we end up with one cpu doing the ndo_start_xmit() and TX completions,
>> > and RX work.
>> >
>> > This problem is magnified when XPS is used, if one mono-threaded application deals with
>> > thousands of TCP sockets.
>> >
>> Do you know of an application doing this? The typical way XPS and most
>> of the other steering mechanisms are configured assume that workloads
>> tend towards a normal distribution. Such mechanisms can become
>> problematic under asymmetric loads, but then I would expect these are
>> likely dedicated servers so that the mechanisms can be tuned
>> accordingly. For instance, XPS can be configured to map more than one
>> queue to a CPU. Alternatively, IIRC Windows has some functionality to
>> tune networking for the load (spin up queues, reconfigure things
>> similar to XPS/RPS, etc.)-- that's promising up the point that we need
>> a lot of heuristics and measurement; but then we lose determinism and
>> risk edge case where we get completely unsatisfied results (one of my
>> concerns with the recent attempt to put configuration in the kernel).
>
> We have thousands of applications, and some of them 'kind of multicast'
> events to a broad number of TCP sockets.
>
> Very often, applications writers use a single thread for doing this,
> when all they need is to send small packets to 10,000 sockets, and they
> do not really care of doing this very fast. They also do not want to
> hurt other applications sharing the NIC.
>
> Very often, process scheduler will also run this single thread in a
> single cpu, ie avoiding expensive migrations if they are not needed.
>
> Problem is this behavior locks one TX queue for the duration of the
> multicast, since XPS will force all the TX packets to go to one TX
> queue.
>
The fact that XPS is forcing TX packets to go over one CPU is a result
of how you've configured XPS. There are other potentially
configurations that present different tradeoffs. For instance, TX
queues are plentiful now days to the point that we could map a number
of queues to each CPU while respecting NUMA locality between the
sending CPU and where TX completions occur. If the set is sufficiently
large this would also helps to address device lock contention. The
other thing I'm wondering is if Willem's concepts distributing DOS
attacks in RPS might be applicable in XPS. If we could detect that a
TX queue is "under attack" maybe we can automatically backoff to
distributing the load to more queues in XPS somehow.
Tom
> Other flows that would share the locked CPU experience high P99
> latencies.
>
>
>>
>> > We should use an additional bit (__QDISC_STATE_PLEASE_GRAB_ME) or some way
>> > to allow another cpu to service the qdisc and spare us.
>> >
>> Wouldn't this need to be an active operation? That is to queue the
>> qdisc on another CPU's output_queue?
>
> I simply suggest we try to queue the qdisc for further servicing as we
> do today, from net_tx_action(), but we might use a different bit, so
> that we leave the opportunity for another cpu to get __QDISC_STATE_SCHED
> before we grab it from net_tx_action(), maybe 100 usec later (time to
> flush all skbs queued in napi_consume_skb() and maybe RX processing,
> since most NIC handle TX completion before doing RX processing from thei
> napi poll handler.
>
> Should be doable with few changes in __netif_schedule() and
> net_tx_action(), plus some control paths that will need to take care of
> the new bit at dismantle time, right ?
>
>
>
^ permalink raw reply
* Re: [v5,1/5] soc: qcom: smem_state: Fix include for ERR_PTR()
From: Valo, Kalle @ 2016-12-01 5:13 UTC (permalink / raw)
To: Bjorn Andersson
Cc: k.eugene.e@gmail.com, Andy Gross, wcn36xx@lists.infradead.org,
linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org
In-Reply-To: <878ts0u7z3.fsf@kamboji.qca.qualcomm.com>
Kalle Valo <kvalo@qca.qualcomm.com> writes:
> "Valo, Kalle" <kvalo@qca.qualcomm.com> writes:
>
>> Bjorn Andersson <bjorn.andersson@linaro.org> writes:
>>
>>> On Wed 16 Nov 10:49 PST 2016, Kalle Valo wrote:
>>>
>>>> Bjorn Andersson <bjorn.andersson@linaro.org> wrote:
>>>> > The correct include file for getting errno constants and ERR_PTR() is
>>>> > linux/err.h, rather than linux/errno.h, so fix the include.
>>>> >
>>>> > Fixes: e8b123e60084 ("soc: qcom: smem_state: Add stubs for disabled smem_state")
>>>> > Acked-by: Andy Gross <andy.gross@linaro.org>
>>>> > Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
>>>>
>>>> For some reason this fails to compile now. Can you take a look, please?
>>>>
>>>> ERROR: "qcom_wcnss_open_channel" [drivers/net/wireless/ath/wcn36xx/wcn36xx.ko] undefined!
>>>> make[1]: *** [__modpost] Error 1
>>>> make: *** [modules] Error 2
>>>>
>>>> 5 patches set to Changes Requested.
>>>>
>>>> 9429045 [v5,1/5] soc: qcom: smem_state: Fix include for ERR_PTR()
>>>> 9429047 [v5,2/5] wcn36xx: Transition driver to SMD client
>>>
>>> This patch was updated with the necessary depends in Kconfig to catch
>>> this exact issue and when I pull in your .config (which has QCOM_SMD=n,
>>> QCOM_WCNSS_CTRL=n and WCN36XX=y) I can build this just fine.
>>>
>>> I've tested the various combinations and it seems to work fine. Do you
>>> have any other patches in your tree?
>>
>> This was with the pending branch of my ath.git tree. There are other
>> wireless patches (ath10k etc) but I would guess they don't affect here.
>>
>>> Any stale objects?
>>
>> Not sure what you mean with this question, but I didn't run 'make clean'
>> if that's what you are asking.
>>
>>> Would you mind retesting this, before I invest more time in trying to
>>> reproduce the issue you're seeing?
>>
>> Sure, I'll take a look but that might take few days.
>
> I didn't find enough time to look at this in detail. I applied this to
> my ath.git pending branch, let's see what the kbuild bot finds.
It found the same problem. Interestingly I'm also building x86 with 32
bit, maybe it's related?
tree: https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git pending
head: 1ea16a1c457939b4564643f7637d5cc639a8d3b7
commit: 5eb09c672b01460804fd49b1c9cc7d1072a102f0 [96/99] wcn36xx: Transition driver to SMD client
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
git checkout 5eb09c672b01460804fd49b1c9cc7d1072a102f0
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
>> ERROR: "qcom_wcnss_open_channel" [drivers/net/wireless/ath/wcn36xx/wcn36xx.ko] undefined!
--
Kalle Valo
^ permalink raw reply
* [PATCH net] net: bcmgenet: Utilize correct struct device for all DMA operations
From: Florian Fainelli @ 2016-12-01 5:15 UTC (permalink / raw)
To: netdev; +Cc: davem, pgynther, jaedon.shin, opendmb, Florian Fainelli
__bcmgenet_tx_reclaim() is not using the struct device reference when
doing its unmap operation, which made the DMA-API debugging warn about
it. Fix this by always using &priv->pdev->dev throughout the driver,
using an identical device reference for all map/unmap calls.
Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
drivers/net/ethernet/broadcom/genet/bcmgenet.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 4464bc5db934..d7b553020d66 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -1172,6 +1172,7 @@ static unsigned int __bcmgenet_tx_reclaim(struct net_device *dev,
struct bcmgenet_tx_ring *ring)
{
struct bcmgenet_priv *priv = netdev_priv(dev);
+ struct device *kdev = &priv->pdev->dev;
struct enet_cb *tx_cb_ptr;
struct netdev_queue *txq;
unsigned int pkts_compl = 0;
@@ -1199,13 +1200,13 @@ static unsigned int __bcmgenet_tx_reclaim(struct net_device *dev,
if (tx_cb_ptr->skb) {
pkts_compl++;
bytes_compl += GENET_CB(tx_cb_ptr->skb)->bytes_sent;
- dma_unmap_single(&dev->dev,
+ dma_unmap_single(kdev,
dma_unmap_addr(tx_cb_ptr, dma_addr),
dma_unmap_len(tx_cb_ptr, dma_len),
DMA_TO_DEVICE);
bcmgenet_free_cb(tx_cb_ptr);
} else if (dma_unmap_addr(tx_cb_ptr, dma_addr)) {
- dma_unmap_page(&dev->dev,
+ dma_unmap_page(kdev,
dma_unmap_addr(tx_cb_ptr, dma_addr),
dma_unmap_len(tx_cb_ptr, dma_len),
DMA_TO_DEVICE);
--
1.9.1
^ permalink raw reply related
* Re: pull-request: wireless-drivers 2016-11-29
From: Kalle Valo @ 2016-12-01 5:19 UTC (permalink / raw)
To: David Miller; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20161130.143444.242053190479060640.davem@davemloft.net>
David Miller <davem@davemloft.net> writes:
> From: Kalle Valo <kvalo@codeaurora.org>
> Date: Tue, 29 Nov 2016 16:59:44 +0200
>
>> if there's still time here's one more patch to 3.9. I think this is good
>> to have in 3.9 as it fixes an issue where we were printing uninitialised
>> memory in mwifiex. I had this in wireless-drivers already for some time
>> as I was waiting for other fixes and nothing serious actually came up.
>>
>> If this doesn't make it to 3.9 that's not a problem, I'll just merge
>> this to wireless-drivers-next. Let me know what you prefer.
>
> Unless you are in a time-machine of some sort I assume you mean "4.9" not
> "3.9" :-)
>
> Pulled, thanks.
Hah, I have no idea where I came up with this "3.9" :) I was trying mean
"4.9" of course, thanks for pulling.
--
Kalle Valo
^ permalink raw reply
* Re: [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: Hannes Frederic Sowa @ 2016-12-01 5:30 UTC (permalink / raw)
To: David Miller, david.lebrun; +Cc: netdev
In-Reply-To: <20161130.144946.28181822717893867.davem@davemloft.net>
On Wed, Nov 30, 2016, at 20:49, David Miller wrote:
> From: David Lebrun <david.lebrun@uclouvain.be>
> Date: Tue, 29 Nov 2016 18:15:18 +0100
>
> > When multiple nexthops are available for a given route, the routing engine
> > chooses a nexthop by computing the flow hash through get_hash_from_flowi6
> > and by taking that value modulo the number of nexthops. The resulting value
> > indexes the nexthop to select. This method causes issues when a new nexthop
> > is added or one is removed (e.g. link failure). In that case, the number
> > of nexthops changes and potentially all the flows get re-routed to another
> > nexthop.
> >
> > This patch implements a consistent hash method to select the nexthop in
> > case of ECMP. The idea is to generate K slices (or intervals) for each
> > route with multiple nexthops. The nexthops are randomly assigned to those
> > slices, in a uniform manner. The number K is configurable through a sysctl
> > net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
> > nexthop, the algorithm takes the flow hash and computes an index which is
> > the flow hash modulo K. As K = 2^x, the modulo can be computed using a
> > simple binary AND operation (idx = hash & (K - 1)). The resulting index
> > references the selected nexthop. The lookup time complexity is thus O(1).
> >
> > When a nexthop is added, it steals K/N slices from the other nexthops,
> > where N is the new number of nexthops. The slices are stolen randomly and
> > uniformly from the other nexthops. When a nexthop is removed, the orphan
> > slices are randomly reassigned to the other nexthops.
> >
> > The number of slices for a route also fixes the maximum number of nexthops
> > possible for that route.
> >
> > Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
>
> Interesting approach, but like Hannes I worry about the memory
> consumption
> bounds.
>
> Limiting to 1<<16 is interesting, but if you can limit to 1<<8 (256
> nexthops) maybe the state requirement can be compressed even further?
>From what I have heard the user space DPDK-alike implementations of
consistent hashing actually use as much memory as possible (2^32) to
consistently steer the flow to the target, so every u32 flow hash value
has a unique corresponding mapping. The slice size doesn't affect the
number of nexthops (that is limited by the datatype in the array, which
is in this patch a u8), thus limiting us to 256 nexthops we could refer
to independently of the slice size. Lowering the slices means that the
algorithm becomes inaccurate as more flows map to one nexthop selector,
thus causing more inconsistent routing decisions to be made when changes
happen.
> We can always increase this if necessary in the future if someone
> reports a reasonable use case that really needs it. Let's start
> simple and small first.
If people really rely on consistent hashing for load balancers setups
where the backends don't synchronize their state across their server
(the IPv6 anycast use case), they would probably always want to go with
the full sized state table, as every update on this map would cause user
errors and inconsistent state on the backend.
Thus, I think that we should also allow for maximum allocations,
wherever they come from, as it is already out there and deployed.
Bye,
Hannes
^ permalink raw reply
* Re: [PATCH net-next v6 6/6] samples/bpf: add userspace example for prohibiting sockets
From: Alexei Starovoitov @ 2016-12-01 5:59 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480529810-25850-7-git-send-email-dsa@cumulusnetworks.com>
On Wed, Nov 30, 2016 at 10:16:50AM -0800, David Ahern wrote:
> Add examples preventing a process in a cgroup from opening a socket
> based family, protocol and type.
>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
> +++ b/samples/bpf/sock_flags_kern.c
> @@ -0,0 +1,37 @@
> +#include <uapi/linux/bpf.h>
> +#include <linux/socket.h>
> +#include "bpf_helpers.h"
> +
> +SEC("cgroup/sock1")
> +int bpf_prog1(struct bpf_sock *sk)
> +{
> + char fmt[] = "socket: family %d type %d protocol %d\n";
> +
> + bpf_trace_printk(fmt, sizeof(fmt), sk->family, sk->type, sk->protocol);
> +
> + /* block PF_INET6, SOCK_RAW, IPPROTO_ICMPV6 sockets
> + * ie., make ping6 fail
> + */
> + if (sk->family == PF_INET6 && sk->type == 3 && sk->protocol == 58)
> + return 0;
why not to use SOCK_RAW and IPPROTO_ICMPV6 instead of constants?
Thanks for the test!
^ permalink raw reply
* Re: [PATCH net-next v6 4/6] bpf: Add support for reading socket family, type, protocol
From: Alexei Starovoitov @ 2016-12-01 5:53 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, daniel, ast, daniel, maheshb, tgraf
In-Reply-To: <1480529810-25850-5-git-send-email-dsa@cumulusnetworks.com>
On Wed, Nov 30, 2016 at 10:16:48AM -0800, David Ahern wrote:
> Add socket family, type and protocol to bpf_sock allowing bpf programs
> read-only access.
>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
...
> +static u32 convert_sock_access(int sock_field, int dst_reg, int src_reg,
> + struct bpf_insn *insn_buf)
> +{
> + struct bpf_insn *insn = insn_buf;
> +
> + switch (sock_field) {
> + case SOCKF_AD_TYPE:
> + *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
> + offsetof(struct sock, __sk_flags_offset));
> + *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, SK_FL_TYPE_MASK);
> + *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, SK_FL_TYPE_SHIFT);
> + break;
Wasn't that complex after all? ;)
This part looks good.
> + case offsetof(struct bpf_sock, type):
> + return convert_sock_access(SOCKF_AD_TYPE, dst_reg, src_reg,
> + insn_buf);
my only request-for-change is to place the above 3 lines of *insn++ here
instead of adding new helper function for it.
In case of skb it was a helper, because it was called from different places.
Here this is the only place where it ever be needed, so extra helper
only confuses the reader.
The rest of the patches look good to me.
I'm not Ack-ing them yet, since you didn't carry over my Ack
from the first patch and I had to re-review it again this time.
Thanks.
^ permalink raw reply
* Re: Re: [PATCH] ipv6:ip6_xmit remove unnecessary np NULL check
From: Eric Dumazet @ 2016-12-01 6:29 UTC (permalink / raw)
To: r.thapliyal, Lorenzo Colitti
Cc: Manjeet Pawar, davem@davemloft.net, kuznet@ms2.inr.ac.ru,
jmorris@namei.org, yoshfuji@linux-ipv6.org, kaber@trash.net,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
PANKAJ MISHRA, Ajeet Kumar Yadav
In-Reply-To: <20161201061422epcms5p83ef5b41a335c4556dd2bc725eed38d8d@epcms5p8>
On Thu, 2016-12-01 at 06:14 +0000, Rohit Thapliyal wrote:
> Hi,
>
Hi,
Do not top post on netdev, and do not use HTML format : it wont reach
netdev.
>
> Found at just one place in ping_v6_sendmsg, where np is checked for
> NULL.
>
> And I am not sure, if it is really required there also.
>
> So, I am not sure if sk_fullsock will ever return NULL in inet6_sk.
>
Well I am sure. You can trust me. You can read git history for the
details.
>
> sk_fullsock(__sk) ? inet_sk(__sk)->pinet6 : NULL;
>
Yes, Lorenzo mentioned this ping_v6_sendmsg() useless code, but
apparently never sent the cleanup. This is really minor you know...
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index 66e2d9dfc43a87ebed092d024e5bf2752b755d0e..775a9365725dd400c36e2510c685d08bf4a7d4b0 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -125,12 +125,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
return PTR_ERR(dst);
rt = (struct rt6_info *) dst;
- np = inet6_sk(sk);
- if (!np) {
- err = -EBADF;
- goto dst_err_out;
- }
-
if (!fl6.flowi6_oif && ipv6_addr_is_multicast(&fl6.daddr))
fl6.flowi6_oif = np->mcast_oif;
else if (!fl6.flowi6_oif)
@@ -165,7 +159,6 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
}
release_sock(sk);
-dst_err_out:
dst_release(dst);
if (err)
>
> Thanks,
>
> Rohit Thapliyal
>
>
>
> --------- Original Message ---------
>
> Sender : Eric Dumazet <eric.dumazet@gmail.com>
>
> Date : 2016-11-30 00:26 (GMT+9)
>
> Title : Re: [PATCH] ipv6:ip6_xmit remove unnecessary np NULL check
>
>
>
> On Tue, 2016-11-29 at 12:02 +0530, Manjeet Pawar wrote:
> > From: Rohit Thapliyal <r.thapliyal@samsung.com>
> >
> > np NULL check doesn't seem required here as it shall never
> > be NULL anyways in inet6_sk(sk).
> >
> > Signed-off-by: Rohit Thapliyal <r.thapliyal@samsung.com>
> > Signed-off-by: Manjeet Pawar <manjeet.p@samsung.com>
> > Signed-off-by: David Miller <davem@davemloft.net>
> > Reviewed-by: Akhilesh Kumar <akhilesh.k@samsung.com>
> >
> > ---
> > v2->v3: Modified as per the suggestion from David Miller
> > ip6_xmit calls are made without checking NULL np
> > pointer, so no need to explicitly check NULL np in
> > ip6_xmit.
> >
> > include/linux/ipv6.h | 2 +-
> > net/ipv6/ip6_output.c | 3 +--
> > 2 files changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> > index a064997..6c9c604 100644
> > --- a/include/linux/ipv6.h
> > +++ b/include/linux/ipv6.h
> > @@ -299,7 +299,7 @@ struct tcp6_timewait_sock {
> >
> > static inline struct ipv6_pinfo *inet6_sk(const struct sock *__sk)
> > {
> > - return sk_fullsock(__sk) ? inet_sk(__sk)->pinet6 : NULL;
> > + return inet_sk(__sk)->pinet6;
>
>
> David suggestion was about np being NULL or not in ip6_xmit()
>
> But have you checked inet6_sk() was never called for a TCPv6 TIMEWAIT or
> SYN_RECV request ?
>
> > }
> >
> > static inline struct raw6_sock *raw6_sk(const struct sock *sk)
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index 59eb4ed..f8c63ec 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -213,8 +213,7 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6,
> > /*
> > * Fill in the IPv6 header
> > */
> > - if (np)
> > - hlimit = np->hop_limit;
> > + hlimit = np->hop_limit;
> > if (hlimit < 0)
> > hlimit = ip6_dst_hoplimit(dst);
> >
>
> This part is fine.
>
>
>
>
> --
>
> Thanks and Regards,
>
> Rohit Thapliyal
>
> Lead Engineer
>
> Samsung SRI-D
> Phone: +91-9717627969
>
>
>
>
>
>
^ permalink raw reply related
* Re: [PATCH net-next v2 1/2] tcp: randomize tcp timestamp offsets for each connection
From: Yuchung Cheng @ 2016-12-01 7:05 UTC (permalink / raw)
To: Florian Westphal; +Cc: netdev
In-Reply-To: <1480508930-24406-1-git-send-email-fw@strlen.de>
On Wed, Nov 30, 2016 at 4:28 AM, Florian Westphal <fw@strlen.de> wrote:
>
> jiffies based timestamps allow for easy inference of number of devices
> behind NAT translators and also makes tracking of hosts simpler.
>
> commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
> added the main infrastructure that is needed for per-connection ts
> randomization, in particular writing/reading the on-wire tcp header
> format takes the offset into account so rest of stack can use normal
> tcp_time_stamp (jiffies).
>
> So only two items are left:
> - add a tsoffset for request sockets
> - extend the tcp isn generator to also return another 32bit number
> in addition to the ISN.
>
> Re-use of ISN generator also means timestamps are still monotonically
> increasing for same connection quadruple, i.e. PAWS will still work.
>
> Includes fixes from Eric Dumazet.
>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> Acked-by: Eric Dumazet <edumazet@google.com>
> ---
Acked-by: Yuchung Cheng <ycheng@google.com>
Nice feature!
>
> No changes since v1, preserved Erics ack.
>
> include/linux/tcp.h | 1 +
> include/net/secure_seq.h | 8 ++++----
> include/net/tcp.h | 2 +-
> net/core/secure_seq.c | 10 ++++++----
> net/ipv4/syncookies.c | 1 +
> net/ipv4/tcp_input.c | 7 ++++++-
> net/ipv4/tcp_ipv4.c | 9 +++++----
> net/ipv4/tcp_minisocks.c | 4 +++-
> net/ipv4/tcp_output.c | 2 +-
> net/ipv6/syncookies.c | 1 +
> net/ipv6/tcp_ipv6.c | 10 ++++++----
> 11 files changed, 35 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 32a7c7e35b71..2408bcc579f1 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -123,6 +123,7 @@ struct tcp_request_sock {
> u32 txhash;
> u32 rcv_isn;
> u32 snt_isn;
> + u32 ts_off;
> u32 last_oow_ack_time; /* last SYNACK */
> u32 rcv_nxt; /* the ack # by SYNACK. For
> * FastOpen it's the seq#
> diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h
> index 3f36d45b714a..0caee631a836 100644
> --- a/include/net/secure_seq.h
> +++ b/include/net/secure_seq.h
> @@ -6,10 +6,10 @@
> u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
> u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
> __be16 dport);
> -__u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
> - __be16 sport, __be16 dport);
> -__u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
> - __be16 sport, __be16 dport);
> +u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
> + __be16 sport, __be16 dport, u32 *tsoff);
> +u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
> + __be16 sport, __be16 dport, u32 *tsoff);
> u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr,
> __be16 sport, __be16 dport);
> u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr,
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7de80739adab..1c09d909bd43 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1809,7 +1809,7 @@ struct tcp_request_sock_ops {
> struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
> const struct request_sock *req,
> bool *strict);
> - __u32 (*init_seq)(const struct sk_buff *skb);
> + __u32 (*init_seq)(const struct sk_buff *skb, u32 *tsoff);
> int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
> struct flowi *fl, struct request_sock *req,
> struct tcp_fastopen_cookie *foc,
> diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
> index fd3ce461fbe6..a8d6062cbb4a 100644
> --- a/net/core/secure_seq.c
> +++ b/net/core/secure_seq.c
> @@ -40,8 +40,8 @@ static u32 seq_scale(u32 seq)
> #endif
>
> #if IS_ENABLED(CONFIG_IPV6)
> -__u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
> - __be16 sport, __be16 dport)
> +u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
> + __be16 sport, __be16 dport, u32 *tsoff)
> {
> u32 secret[MD5_MESSAGE_BYTES / 4];
> u32 hash[MD5_DIGEST_WORDS];
> @@ -58,6 +58,7 @@ __u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
>
> md5_transform(hash, secret);
>
> + *tsoff = hash[1];
> return seq_scale(hash[0]);
> }
> EXPORT_SYMBOL(secure_tcpv6_sequence_number);
> @@ -86,8 +87,8 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
>
> #ifdef CONFIG_INET
>
> -__u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
> - __be16 sport, __be16 dport)
> +u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
> + __be16 sport, __be16 dport, u32 *tsoff)
> {
> u32 hash[MD5_DIGEST_WORDS];
>
> @@ -99,6 +100,7 @@ __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
>
> md5_transform(hash, net_secret);
>
> + *tsoff = hash[1];
> return seq_scale(hash[0]);
> }
>
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 0dc6286272aa..3e88467d70ee 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -334,6 +334,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
> treq = tcp_rsk(req);
> treq->rcv_isn = ntohl(th->seq) - 1;
> treq->snt_isn = cookie;
> + treq->ts_off = 0;
> req->mss = mss;
> ireq->ir_num = ntohs(th->dest);
> ireq->ir_rmt_port = th->source;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 22e6a2097ff6..1b1921c71f7c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6301,6 +6301,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
> goto drop;
>
> tcp_rsk(req)->af_specific = af_ops;
> + tcp_rsk(req)->ts_off = 0;
>
> tcp_clear_options(&tmp_opt);
> tmp_opt.mss_clamp = af_ops->mss_clamp;
> @@ -6322,6 +6323,9 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
> if (security_inet_conn_request(sk, skb, req))
> goto drop_and_free;
>
> + if (isn && tmp_opt.tstamp_ok)
> + af_ops->init_seq(skb, &tcp_rsk(req)->ts_off);
> +
> if (!want_cookie && !isn) {
> /* VJ's idea. We save last timestamp seen
> * from the destination in peer table, when entering
> @@ -6362,7 +6366,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
> goto drop_and_release;
> }
>
> - isn = af_ops->init_seq(skb);
> + isn = af_ops->init_seq(skb, &tcp_rsk(req)->ts_off);
> }
> if (!dst) {
> dst = af_ops->route_req(sk, &fl, req, NULL);
> @@ -6374,6 +6378,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
>
> if (want_cookie) {
> isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
> + tcp_rsk(req)->ts_off = 0;
> req->cookie_ts = tmp_opt.tstamp_ok;
> if (!tmp_opt.tstamp_ok)
> inet_rsk(req)->ecn_ok = 0;
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 5555eb86e549..b50f05905ced 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -95,12 +95,12 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
> struct inet_hashinfo tcp_hashinfo;
> EXPORT_SYMBOL(tcp_hashinfo);
>
> -static __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
> +static u32 tcp_v4_init_sequence(const struct sk_buff *skb, u32 *tsoff)
> {
> return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
> ip_hdr(skb)->saddr,
> tcp_hdr(skb)->dest,
> - tcp_hdr(skb)->source);
> + tcp_hdr(skb)->source, tsoff);
> }
>
> int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
> @@ -237,7 +237,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
> tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
> inet->inet_daddr,
> inet->inet_sport,
> - usin->sin_port);
> + usin->sin_port,
> + &tp->tsoffset);
>
> inet->inet_id = tp->write_seq ^ jiffies;
>
> @@ -824,7 +825,7 @@ static void tcp_v4_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
> tcp_v4_send_ack(sk, skb, seq,
> tcp_rsk(req)->rcv_nxt,
> req->rsk_rcv_wnd >> inet_rsk(req)->rcv_wscale,
> - tcp_time_stamp,
> + tcp_time_stamp + tcp_rsk(req)->ts_off,
> req->ts_recent,
> 0,
> tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 6234ebaa7db1..28ce5ee831f5 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -532,7 +532,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
> newtp->rx_opt.ts_recent_stamp = 0;
> newtp->tcp_header_len = sizeof(struct tcphdr);
> }
> - newtp->tsoffset = 0;
> + newtp->tsoffset = treq->ts_off;
> #ifdef CONFIG_TCP_MD5SIG
> newtp->md5sig_info = NULL; /*XXX*/
> if (newtp->af_specific->md5_lookup(sk, newsk))
> @@ -581,6 +581,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>
> if (tmp_opt.saw_tstamp) {
> tmp_opt.ts_recent = req->ts_recent;
> + if (tmp_opt.rcv_tsecr)
> + tmp_opt.rcv_tsecr -= tcp_rsk(req)->ts_off;
> /* We do not store true stamp, but it is not required,
> * it can be estimated (approximately)
> * from another data.
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 19105b46a304..1b6d5f34bf45 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -640,7 +640,7 @@ static unsigned int tcp_synack_options(struct request_sock *req,
> }
> if (likely(ireq->tstamp_ok)) {
> opts->options |= OPTION_TS;
> - opts->tsval = tcp_skb_timestamp(skb);
> + opts->tsval = tcp_skb_timestamp(skb) + tcp_rsk(req)->ts_off;
> opts->tsecr = req->ts_recent;
> remaining -= TCPOLEN_TSTAMP_ALIGNED;
> }
> diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
> index 97830a6a9cbb..a4d49760bf43 100644
> --- a/net/ipv6/syncookies.c
> +++ b/net/ipv6/syncookies.c
> @@ -209,6 +209,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
> treq->snt_synack.v64 = 0;
> treq->rcv_isn = ntohl(th->seq) - 1;
> treq->snt_isn = cookie;
> + treq->ts_off = 0;
>
> /*
> * We need to lookup the dst_entry to get the correct window size.
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 28ec0a2e7b72..a2185a214abc 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -101,12 +101,12 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
> }
> }
>
> -static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
> +static u32 tcp_v6_init_sequence(const struct sk_buff *skb, u32 *tsoff)
> {
> return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
> ipv6_hdr(skb)->saddr.s6_addr32,
> tcp_hdr(skb)->dest,
> - tcp_hdr(skb)->source);
> + tcp_hdr(skb)->source, tsoff);
> }
>
> static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> @@ -283,7 +283,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
> tp->write_seq = secure_tcpv6_sequence_number(np->saddr.s6_addr32,
> sk->sk_v6_daddr.s6_addr32,
> inet->inet_sport,
> - inet->inet_dport);
> + inet->inet_dport,
> + &tp->tsoffset);
>
> err = tcp_connect(sk);
> if (err)
> @@ -956,7 +957,8 @@ static void tcp_v6_reqsk_send_ack(const struct sock *sk, struct sk_buff *skb,
> tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
> tcp_rsk(req)->rcv_nxt,
> req->rsk_rcv_wnd >> inet_rsk(req)->rcv_wscale,
> - tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
> + tcp_time_stamp + tcp_rsk(req)->ts_off,
> + req->ts_recent, sk->sk_bound_dev_if,
> tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
> 0, 0);
> }
> --
> 2.7.3
>
^ permalink raw reply
* Re: [PATCH net-next v2 2/2] tcp: allow to turn tcp timestamp randomization off
From: Yuchung Cheng @ 2016-12-01 7:12 UTC (permalink / raw)
To: Florian Westphal; +Cc: netdev
In-Reply-To: <1480508930-24406-2-git-send-email-fw@strlen.de>
On Wed, Nov 30, 2016 at 4:28 AM, Florian Westphal <fw@strlen.de> wrote:
> Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
> flows, we can deduct the relative qdisc delays (think of fq pacing).
> This should work even if we have one flow per remote peer."
>
> Having random per flow (or host) offsets doesn't allow that anymore so add
> a way to turn this off.
>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
> change since v2: do check in secure_tcpv4/6_sequence_number so outgoing
> syn packets won't have a random offset either in if randomization is off.
>
> Tested:
> sysctl_tcp_timestamps==1, tcpdump on lo, both ends have same values.
>
> Documentation/networking/ip-sysctl.txt | 9 +++++++--
> net/core/secure_seq.c | 5 +++--
> net/ipv4/tcp_input.c | 3 ++-
> 3 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 5af48dd7c5fc..de2448313799 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -610,8 +610,13 @@ tcp_syn_retries - INTEGER
> with the current initial RTO of 1second. With this the final timeout
> for an active TCP connection attempt will happen after 127seconds.
>
> -tcp_timestamps - BOOLEAN
> - Enable timestamps as defined in RFC1323.
> +tcp_timestamps - INTEGER
> +Enable timestamps as defined in RFC1323.
> + 0: Disabled.
> + 1: Enable timestamps as defined in RFC1323.
> + 2: Like 1, but also use a random offset for each connection
> + rather than only using the current time.
> + Default: 2
Small suggestion: I suspect host/server configs manually set the knob
to 1. Perhaps swap 1 and 2 to maximize the coverage of this new
feature?
>
> tcp_min_tso_segs - INTEGER
> Minimal number of segments per TSO frame.
> diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
> index a8d6062cbb4a..36addd3d9633 100644
> --- a/net/core/secure_seq.c
> +++ b/net/core/secure_seq.c
> @@ -12,6 +12,7 @@
> #include <net/secure_seq.h>
>
> #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET)
> +#include <net/tcp.h>
> #define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4)
>
> static u32 net_secret[NET_SECRET_SIZE] ____cacheline_aligned;
> @@ -58,7 +59,7 @@ u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
>
> md5_transform(hash, secret);
>
> - *tsoff = hash[1];
> + *tsoff = sysctl_tcp_timestamps == 2 ? hash[1] : 0;
> return seq_scale(hash[0]);
> }
> EXPORT_SYMBOL(secure_tcpv6_sequence_number);
> @@ -100,7 +101,7 @@ u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
>
> md5_transform(hash, net_secret);
>
> - *tsoff = hash[1];
> + *tsoff = sysctl_tcp_timestamps == 2 ? hash[1] : 0;
> return seq_scale(hash[0]);
> }
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 1b1921c71f7c..5f6d4efd2551 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -76,7 +76,7 @@
> #include <asm/unaligned.h>
> #include <linux/errqueue.h>
>
> -int sysctl_tcp_timestamps __read_mostly = 1;
> +int sysctl_tcp_timestamps __read_mostly = 2;
> int sysctl_tcp_window_scaling __read_mostly = 1;
> int sysctl_tcp_sack __read_mostly = 1;
> int sysctl_tcp_fack __read_mostly = 1;
> @@ -85,6 +85,7 @@ int sysctl_tcp_dsack __read_mostly = 1;
> int sysctl_tcp_app_win __read_mostly = 31;
> int sysctl_tcp_adv_win_scale __read_mostly = 1;
> EXPORT_SYMBOL(sysctl_tcp_adv_win_scale);
> +EXPORT_SYMBOL(sysctl_tcp_timestamps);
>
> /* rfc5961 challenge ack rate limiting */
> int sysctl_tcp_challenge_ack_limit = 1000;
> --
> 2.7.3
>
^ permalink raw reply
* Re: [PATCH net] tipc: check minimum bearer MTU
From: kbuild test robot @ 2016-12-01 7:21 UTC (permalink / raw)
To: Michal Kubecek
Cc: kbuild-all, Jon Maloy, Ying Xue, David S. Miller, tipc-discussion,
netdev, linux-kernel, Ben Hutchings, Qian Zhang
In-Reply-To: <20161130095702.DD033A0F14@unicorn.suse.cz>
[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]
Hi Michal,
[auto build test ERROR on net/master]
url: https://github.com/0day-ci/linux/commits/Michal-Kubecek/tipc-check-minimum-bearer-MTU/20161201-140555
config: i386-randconfig-s0-201648 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
net/tipc/bearer.o: In function `tipc_check_mtu':
>> bearer.c:(.text+0xadd): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/core.o: In function `tipc_check_mtu':
core.c:(.text+0x239): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/link.o: In function `tipc_check_mtu':
link.c:(.text+0x110e): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/discover.o: In function `tipc_check_mtu':
discover.c:(.text+0x2c7): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/name_distr.o: In function `tipc_check_mtu':
name_distr.c:(.text+0x536): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/monitor.o: In function `tipc_check_mtu':
monitor.c:(.text+0x906): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/name_table.o: In function `tipc_check_mtu':
name_table.c:(.text+0x875): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/net.o: In function `tipc_check_mtu':
net.c:(.text+0xc4): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/netlink.o: In function `tipc_check_mtu':
netlink.c:(.text+0x0): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/netlink_compat.o: In function `tipc_check_mtu':
netlink_compat.c:(.text+0x2a9e): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/node.o: In function `tipc_check_mtu':
node.c:(.text+0x1dd8): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/socket.o: In function `tipc_check_mtu':
socket.c:(.text+0x62bb): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/eth_media.o: In function `tipc_check_mtu':
eth_media.c:(.text+0x117): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
net/tipc/udp_media.o: In function `tipc_check_mtu':
udp_media.c:(.text+0x10b9): multiple definition of `tipc_check_mtu'
net/tipc/bcast.o:bcast.c:(.text+0x598): first defined here
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25827 bytes --]
^ permalink raw reply
* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-12-01 7:39 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Hefty, Sean, Weiny, Ira, Doug Ledford,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Dalessandro, Dennis
In-Reply-To: <20161129165009.GA3167-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On Tue, Nov 29, 2016 at 09:50:09AM -0700, Jason Gunthorpe wrote:
>On Tue, Nov 29, 2016 at 04:44:37PM +0000, Hefty, Sean wrote:
>> > You are not making a subsystem. Don't overcomplicate things. A
>> > multi-part device device can just directly link.
>>
>> The VNIC may be usable over multiple generations of HFIs, but I
>> don't know if that is the intent.
>
>If Intel wants to build a HFI subystem within RDMA with multiple
>drivers then sure, but they are not there yet, and we don't even know
>what that could look like. So it is better to leave it simple for now.
>
>Jason
Sorry for the delay, I was weighing in couple options.
We envisioned vnic as a pure software construct and hence should be independent
(like ipoib). ie., both hfi_vnic and hfi1 should be independently loadable
(like ipoib) despite hfi_vnic being dependent on hfi1 here for HW access.
There doesn't seem to be much value of hfi_vnic being a 'ib client', if it
still has compilation and module dependency on hfi1 module.
The more I think of it, having vnic ops added to ib device structure (option
(b)) makes it cleaner (no dependency). We can probably consider extending 'ib
device' in hfi1 in order for hfi_vnic to get to the vnic ops. But (b) makes it
simpler.
Though Jason's suggestion could be a temporary measure for this patch series,
the above approach is what I would like to target here.
Niranjana
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-12-01 7:41 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161129162113.GC742-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On Tue, Nov 29, 2016 at 09:21:13AM -0700, Jason Gunthorpe wrote:
>On Mon, Nov 28, 2016 at 10:29:38PM -0800, Vishwanathapura, Niranjana wrote:
>> On Thu, Nov 24, 2016 at 09:15:45AM -0700, Jason Gunthorpe wrote:
>> >>And will move the hfi_vnic module under
>> >>???drivers/infiniband/ulp/hfi_vnic???.
>> >
>> >I would prefer drivers/net/ethernet
>> >
>> >This is clearly not a ULP since it doesn't use verbs.
>> >
>>
>> I understand it is not using verbs, but the control path (ib_device client)
>> is using verbs (IB MAD).
>> Our prefernce is to keep it somewhere under drivers/infiniband. Summarizing
>> reasons again here,
>>
>> - VNIC control driver (ib_device client) is an IB MAD agent.
>> - It is purly a software construct, encapsualtes ethernet packets in
>> Omni-path packet and depends on hfi1 driver here for HW access.
>
>Is the majority of the code MAD focused or net stack focused?
>
>I'm not sure it matters, it isn't like we can review Intel's
>proprietary mad stuff anyhow. :\
>
>Jason
That is an intersting measure. In hfi_vnic driver, I would say, >60% of the
code is MAD focused, mainly interfacing with the IB MAD agent.
It also includes populating/parsing those MAD packets. At the least it is not
supporting the driver to be put under net folder.
Even in the remaining <40%, half of it is involved with encapsulating ethernet
frames with Omni-path header (does this makes it belong under
drivers/infiniband/hw?).
The net stack interface part is pretty standard, hence is not much of code.
I do see the reason to put it under net folder, but I am seeing more reason for
it to be somewhere under drivers/infiniband.
Niranjana
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: linux-next: manual merge of the net-next tree with the net tree
From: Jiri Pirko @ 2016-12-01 7:56 UTC (permalink / raw)
To: Stephen Rothwell
Cc: David Miller, Networking, linux-next, linux-kernel, Jiri Pirko,
Roi Dayan
In-Reply-To: <20161201124159.79f2e4f9@canb.auug.org.au>
Thu, Dec 01, 2016 at 02:41:59AM CET, sfr@canb.auug.org.au wrote:
>Hi all,
>
>Today's linux-next merge of the net-next tree got a conflict in:
>
> net/sched/cls_flower.c
>
>between commit:
>
> 725cbb62e7ad ("sched: cls_flower: remove from hashtable only in case skip sw flag is not set")
>
>from the net tree and commit:
>
> 13fa876ebd03 ("net/sched: cls_flower: merge filter delete/destroy common code")
>
>from the net-next tree.
>
>I fixed it up (see below) and can carry the fix as necessary. This
>is now fixed as far as linux-next is concerned, but any non trivial
>conflicts should be mentioned to your upstream maintainer when your tree
>is submitted for merging. You may also want to consider cooperating
>with the maintainer of the conflicting tree to minimise any particularly
>complex conflicts.
Looks fine to me. Thanks.
>
>--
>Cheers,
>Stephen Rothwell
>
>diff --cc net/sched/cls_flower.c
>index 904442421db3,e8dd09af0d0c..000000000000
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@@ -273,24 -272,14 +276,32 @@@ static void fl_hw_update_stats(struct t
> dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle, tp->protocol, &tc);
> }
>
> +static void fl_destroy_sleepable(struct work_struct *work)
> +{
> + struct cls_fl_head *head = container_of(work, struct cls_fl_head,
> + work);
> + if (head->mask_assigned)
> + rhashtable_destroy(&head->ht);
> + kfree(head);
> + module_put(THIS_MODULE);
> +}
> +
> +static void fl_destroy_rcu(struct rcu_head *rcu)
> +{
> + struct cls_fl_head *head = container_of(rcu, struct cls_fl_head, rcu);
> +
> + INIT_WORK(&head->work, fl_destroy_sleepable);
> + schedule_work(&head->work);
> +}
> +
>+ static void __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f)
>+ {
>+ list_del_rcu(&f->list);
>+ fl_hw_destroy_filter(tp, (unsigned long)f);
>+ tcf_unbind_filter(tp, &f->res);
>+ call_rcu(&f->rcu, fl_destroy_filter);
>+ }
>+
> static bool fl_destroy(struct tcf_proto *tp, bool force)
> {
> struct cls_fl_head *head = rtnl_dereference(tp->root);
>@@@ -299,14 -288,12 +310,11 @@@
> if (!force && !list_empty(&head->filters))
> return false;
>
>- list_for_each_entry_safe(f, next, &head->filters, list) {
>- fl_hw_destroy_filter(tp, (unsigned long)f);
>- list_del_rcu(&f->list);
>- call_rcu(&f->rcu, fl_destroy_filter);
>- }
>+ list_for_each_entry_safe(f, next, &head->filters, list)
>+ __fl_delete(tp, f);
> - RCU_INIT_POINTER(tp->root, NULL);
> - if (head->mask_assigned)
> - rhashtable_destroy(&head->ht);
> - kfree_rcu(head, rcu);
> +
> + __module_get(THIS_MODULE);
> + call_rcu(&head->rcu, fl_destroy_rcu);
> return true;
> }
>
>@@@ -761,13 -782,9 +804,10 @@@ static int fl_delete(struct tcf_proto *
> struct cls_fl_head *head = rtnl_dereference(tp->root);
> struct cls_fl_filter *f = (struct cls_fl_filter *) arg;
>
> - rhashtable_remove_fast(&head->ht, &f->ht_node,
> - head->ht_params);
> + if (!tc_skip_sw(f->flags))
> + rhashtable_remove_fast(&head->ht, &f->ht_node,
> + head->ht_params);
>- list_del_rcu(&f->list);
>- fl_hw_destroy_filter(tp, (unsigned long)f);
>- tcf_unbind_filter(tp, &f->res);
>- call_rcu(&f->rcu, fl_destroy_filter);
>+ __fl_delete(tp, f);
> return 0;
> }
>
^ permalink raw reply
* Re: [PATCH net-next 5/8] net/sched: cls_flower: Add offload support using egress Hardware device
From: kbuild test robot @ 2016-12-01 8:11 UTC (permalink / raw)
To: Hadar Hen Zion
Cc: kbuild-all, David S. Miller, netdev, Saeed Mahameed, Jiri Pirko,
Amir Vadai, Or Gerlitz, Roi Dayan, Hadar Hen Zion
In-Reply-To: <1480516895-29545-6-git-send-email-hadarh@mellanox.com>
[-- Attachment #1: Type: text/plain, Size: 1105 bytes --]
Hi Hadar,
[auto build test ERROR on net-next/master]
url: https://github.com/0day-ci/linux/commits/Hadar-Hen-Zion/Offloading-tc-rules-using-underline-Hardware-device/20161201-155727
config: i386-randconfig-x004-201648 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
net/sched/cls_api.c: In function 'tcf_exts_get_dev':
>> net/sched/cls_api.c:696:13: error: dereferencing pointer to incomplete type 'const struct tc_action_ops'
if (a->ops->get_dev) {
^~
vim +696 net/sched/cls_api.c
690
691 if (tc_no_actions(exts))
692 return -EINVAL;
693
694 tcf_exts_to_list(exts, &actions);
695 list_for_each_entry(a, &actions, list) {
> 696 if (a->ops->get_dev) {
697 a->ops->get_dev(a, dev_net(dev), hw_dev);
698 break;
699 }
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26047 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox