Netdev List
 help / color / mirror / Atom feed
* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-06-12  1:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <20180612042600-mutt-send-email-mst@kernel.org>

On Mon, Jun 11, 2018 at 6:36 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Maybe it will help to have GFP_NONE which will make any allocation
> fail if attempted. Linus, would this address your comment?

It would definitely have helped me initially overlook that call chain.

But then when I started looking at the whole dma_map_page() thing, it
just raised my hackles again.

I would seriously suggest having a much simpler version for the "no
allocation, no dma mapping" case, so that it's *obvious* that that
never happens.

So instead of having virtio_balloon_send_free_pages() call a really
generic complex chain of functions that in _some_ cases can do memory
allocation, why isn't there a short-circuited "vitruque_add_datum()"
that is guaranteed to never do anything like that?

Honestly, I look at "add_one_sg()" and it really doesn't make me
happy. It looks hacky as hell. If I read the code right, you're really
trying to just queue up a simple tuple of <pfn,len>, except you encode
it as a page pointer in order to play games with the SG logic, and
then you hmap that to the ring, except in this case it's all a fake
ring that just adds the cpu-physical address instead.

And to figuer that out, it's like five layers of indirection through
different helper functions that *can* do more generic things but in
this case don't.

And you do all of this from a core VM callback function with some
_really_ core VM locks held.

That makes no sense to me.

How about this:

 - get rid of all that code

 - make the core VM callback save the "these are the free memory
regions" in a fixed and limited array. One that DOES JUST THAT. No
crazy "SG IO dma-mapping function crap". Just a plain array of a fixed
size, pre-allocated for that virtio instance.

 - make it obvious that what you do in that sequence is ten
instructions and no allocations ("Look ma, I wrote a value to an array
and incremented the array idex, and I'M DONE")

 - then in that workqueue entry that you start *anyway*, you empty the
array and do all the crazy virtio stuff.

In fact, while at it, just simplify the VM interface too. Instead of
traversing a random number of buddy lists, just trraverse *one* - the
top-level one. Are you seriously ever going to shrink or mark
read-only anythin *but* something big enough to be in the maximum
order?

MAX_ORDER is what, 11? So we're talking 8MB blocks. Do you *really*
want the balloon code to work on smaller things, particularly since
the whole interface is fundamentally racy and opportunistic to begin
with?

The whole sequence of events really looks "this is too much
complexity, and way too fragile" to me at so many levels.

                 Linus

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-06-12  1:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: nilal, KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFzrPgnd7hRPrkeV+jX-MSwOZf7T4wKxz66Lk4oub3PZsw@mail.gmail.com>

On Mon, Jun 11, 2018 at 11:32:41AM -0700, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 9:24 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> >       virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
> 
> Is this really a good idea?

Well knowing which pages are unused does seem to be useful.  Do you
refer to this generally or to the idea of scanning the free list,
or to the specific implementation issues you listed below?


> Plus it seems entirely broken.
> 
> The report_pfn_range() callback is done under the zone lock, but
> virtio_balloon_send_free_pages() (which is the only callback used that
> I can find) does add_one_sg(), which does virtqueue_add_inbuf(vq, &sg,
> 1, vq, GFP_KERNEL);
> 
> So now we apparently do a GFP_KERNEL allocation insider the mm zone
> lock, which is broken on just _so_ many levels.
> 
> Pulled and then unpulled again.
> 
> Either somebody needs to explain why I'm wrong and you can re-submit
> this, or this kind of garbage needs to go away.
> 
> I do *not* want to be in the situation where I pull stuff from the
> virtio people that adds completely broken core VM functionality.
> 
> Because if I'm in that situation, I will stop pulling from you guys.
> Seriously. You have *no* place sending me broken shit that is outside
> the virtio layer.
> 
> Subsystems that break code MM will get shunned. You just aren't
> important enough to allow you breaking code VM.
> 
>                 Linus

So no, it doesn't do any allocations - GFP_KERNEL or otherwise, but I
agree how it might be confusing.

So I'll drop this for now, but it would be nice if there is a bit more
direction on this feature since I know several people are trying to work
on this.


-- 
MST

^ permalink raw reply

* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Jason Wang @ 2018-06-12  1:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Sridhar Samudrala
  Cc: alexander.h.duyck, virtio-dev, qemu-devel, jiri, kubakici, netdev,
	jesse.brandeburg, virtualization, loseweigh, aaron.f.brown
In-Reply-To: <20180611202207-mutt-send-email-mst@kernel.org>



On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>> This feature bit can be used by hypervisor to indicate virtio_net device to
>> act as a standby for another device with the same MAC address.
>>
>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>> by default as i am using libvirt to start the VMs.
>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>> XML file?
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> So I do not think we can commit to this interface: we
> really need to control visibility of the primary device.

The problem is legacy guest won't use primary device at all if we do this.

How about control the visibility of standby device?

Thanks

> However just for testing purposes, we could add a non-stable
> interface "x-standby" with the understanding that as any
> x- prefix it's unstable and will be changed down the road,
> likely in the next release.
>
>
>> ---
>>   hw/net/virtio-net.c                         | 2 ++
>>   include/standard-headers/linux/virtio_net.h | 3 +++
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 90502fca7c..38b3140670 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>                        true),
>>       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>> +                      false),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>   
>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>> index e9f255ea3f..01ec09684c 100644
>> --- a/include/standard-headers/linux/virtio_net.h
>> +++ b/include/standard-headers/linux/virtio_net.h
>> @@ -57,6 +57,9 @@
>>   					 * Steering */
>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>   
>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>> +                                         * with the same MAC.
>> +                                         */
>>   #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>   
>>   #ifndef VIRTIO_NET_NO_LEGACY
>> -- 
>> 2.14.3

^ permalink raw reply

* Re: [PATCH bpf] tools/bpftool: fix a bug in bpftool perf
From: Yonghong Song @ 2018-06-12  1:51 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180611184902.23c13235@cakuba.netronome.com>



On 6/11/18 6:49 PM, Jakub Kicinski wrote:
> On Mon, 11 Jun 2018 18:15:16 -0700, Yonghong Song wrote:
>> Commit b04df400c302 ("tools/bpftool: add perf subcommand")
>> introduced bpftool subcommand perf to query bpf program
>> kuprobe and tracepoint attachments.
>>
>> The perf subcommand will first test whether bpf subcommand
>> BPF_TASK_FD_QUERY is supported in kernel or not. It does it
>> by opening a file with argv[0] and feeds the file descriptor
>> and current task pid to the kernel for querying.
>>
>> Such an approach won't work if the argv[0] cannot be opened
>> successfully in the current directory. This is especially
>> true when bpftool is accessible through PATH env variable.
>> The error below reflects the open failure for file argv[0]
>> at home directory.
>>
>>    [yhs@localhost ~]$ which bpftool
>>    /usr/local/sbin/bpftool
>>    [yhs@localhost ~]$ bpftool perf
>>    Error: perf_query_support: No such file or directory
>>
>> To fix the issue, let us open root directory ("/")
>> which exists in every linux system. With the fix, the
>> error message will correctly reflect the permission issue.
>>
>>    [yhs@localhost ~]$ which bpftool
>>    /usr/local/sbin/bpftool
>>    [yhs@localhost ~]$ bpftool perf
>>    Error: perf_query_support: Operation not permitted
>>    HINT: non root or kernel doesn't support TASK_FD_QUERY
>>
>> Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
>> Reported-by: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>>   tools/bpf/bpftool/perf.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
>> index ac6b1a12c9b7..f1e4c9b270e2 100644
>> --- a/tools/bpf/bpftool/perf.c
>> +++ b/tools/bpf/bpftool/perf.c
>> @@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
>>   	if (perf_query_supported)
>>   		goto out;
>>   
>> -	fd = open(bin_name, O_RDONLY);
>> -	if (fd < 0) {
>> -		p_err("perf_query_support: %s", strerror(errno));
>> +	fd = open("/", O_RDONLY);
>> +	if (fd > 0) {
>> +		p_err("perf_query_support: cannot open directory \"/\" (%s)\n",
> 
> nit: no \n at the end of p_err() format, because it breaks JSON :(

Thanks for letting me know. Will send v2 to fix this!

> 
>> +		      strerror(errno));
>>   		goto out;
>>   	}
>>   
> 

^ permalink raw reply

* Re: [PATCH bpf] tools/bpftool: fix a bug in bpftool perf
From: Jakub Kicinski @ 2018-06-12  1:49 UTC (permalink / raw)
  To: Yonghong Song; +Cc: ast, daniel, netdev, kernel-team
In-Reply-To: <20180612011516.4171096-1-yhs@fb.com>

On Mon, 11 Jun 2018 18:15:16 -0700, Yonghong Song wrote:
> Commit b04df400c302 ("tools/bpftool: add perf subcommand")
> introduced bpftool subcommand perf to query bpf program
> kuprobe and tracepoint attachments.
> 
> The perf subcommand will first test whether bpf subcommand
> BPF_TASK_FD_QUERY is supported in kernel or not. It does it
> by opening a file with argv[0] and feeds the file descriptor
> and current task pid to the kernel for querying.
> 
> Such an approach won't work if the argv[0] cannot be opened
> successfully in the current directory. This is especially
> true when bpftool is accessible through PATH env variable.
> The error below reflects the open failure for file argv[0]
> at home directory.
> 
>   [yhs@localhost ~]$ which bpftool
>   /usr/local/sbin/bpftool
>   [yhs@localhost ~]$ bpftool perf
>   Error: perf_query_support: No such file or directory
> 
> To fix the issue, let us open root directory ("/")
> which exists in every linux system. With the fix, the
> error message will correctly reflect the permission issue.
> 
>   [yhs@localhost ~]$ which bpftool
>   /usr/local/sbin/bpftool
>   [yhs@localhost ~]$ bpftool perf
>   Error: perf_query_support: Operation not permitted
>   HINT: non root or kernel doesn't support TASK_FD_QUERY
> 
> Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
> Reported-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  tools/bpf/bpftool/perf.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
> index ac6b1a12c9b7..f1e4c9b270e2 100644
> --- a/tools/bpf/bpftool/perf.c
> +++ b/tools/bpf/bpftool/perf.c
> @@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
>  	if (perf_query_supported)
>  		goto out;
>  
> -	fd = open(bin_name, O_RDONLY);
> -	if (fd < 0) {
> -		p_err("perf_query_support: %s", strerror(errno));
> +	fd = open("/", O_RDONLY);
> +	if (fd > 0) {
> +		p_err("perf_query_support: cannot open directory \"/\" (%s)\n",

nit: no \n at the end of p_err() format, because it breaks JSON :(

> +		      strerror(errno));
>  		goto out;
>  	}
>  

^ permalink raw reply

* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-06-12  1:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KVM list, Network Development, Linux Kernel Mailing List,
	Bjorn Andersson, Andrew Morton, virtualization
In-Reply-To: <CA+55aFz4uBHaWnBarVUEG90s2ucyVtoLmwYpYVwDV+XQESNRqw@mail.gmail.com>

On Mon, Jun 11, 2018 at 11:44:05AM -0700, Linus Torvalds wrote:
> On Mon, Jun 11, 2018 at 11:32 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So now we apparently do a GFP_KERNEL allocation insider the mm zone
> > lock, which is broken on just _so_ many levels.
> 
> Oh, I see the comment about how it doesn't actually do an allocation
> at all because it's a single-entry.
> 
> Still too damn ugly to live, and much too fragile. No way in hell do
> we even _hint_ at a GFP_KERNEL when we're inside a context that can't
> do any memory allocation at all.
> 
> Plus I'm not convinced it's a "no allocation" path even despite that
> comment, because it also does a "dma_map_page()" etc, which can cause
> allocations to do the dma mapping thing afaik. No?

Well no because DMA is triggered by the IOMMU flag and
that is always off for the balloon. But I hear what you
are saying about it being fragile.

> Maybe there's some reason why that doesn't happen either, but
> basically this whole callchain looks *way* to complicated to be used
> under a core VM spinlock.
> 
>                 Linus

Maybe it will help to have GFP_NONE which will make any allocation
fail if attempted. Linus, would this address your comment?
-- 
MST

^ permalink raw reply

* Re: [PATCH 1/2] Convert target drivers to use sbitmap
From: Jens Axboe @ 2018-06-12  1:18 UTC (permalink / raw)
  To: Matthew Wilcox, linux-kernel, linux-scsi, target-devel,
	linux1394-devel, linux-usb, kvm, virtualization, netdev,
	Juergen Gross, qla2xxx-upstream, Kent Overstreet
  Cc: Matthew Wilcox
In-Reply-To: <3a56027b-47bc-dcb8-a465-3670031572f1@kernel.dk>

On 5/15/18 10:11 AM, Jens Axboe wrote:
> On 5/15/18 10:00 AM, Matthew Wilcox wrote:
>> From: Matthew Wilcox <mawilcox@microsoft.com>
>>
>> The sbitmap and the percpu_ida perform essentially the same task,
>> allocating tags for commands.  Since the sbitmap is more used than
>> the percpu_ida, convert the percpu_ida users to the sbitmap API.
> 
> It should also be the same performance as percpu_ida in light use, and
> performs much better at > 50% utilization of the tag space. I think
> that's better justification than "more used than".
> 
>> diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
>> index 4435bf374d2d..28bcffae609f 100644
>> --- a/drivers/target/iscsi/iscsi_target_util.c
>> +++ b/drivers/target/iscsi/iscsi_target_util.c
>> @@ -17,7 +17,7 @@
>>   ******************************************************************************/
>>  
>>  #include <linux/list.h>
>> -#include <linux/percpu_ida.h>
>> +#include <linux/sched/signal.h>
>>  #include <net/ipv6.h>         /* ipv6_addr_equal() */
>>  #include <scsi/scsi_tcq.h>
>>  #include <scsi/iscsi_proto.h>
>> @@ -147,6 +147,28 @@ void iscsit_free_r2ts_from_list(struct iscsi_cmd *cmd)
>>  	spin_unlock_bh(&cmd->r2t_lock);
>>  }
>>  
>> +int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
>> +{
>> +	int tag = -1;
>> +	DEFINE_WAIT(wait);
>> +	struct sbq_wait_state *ws;
>> +
>> +	if (state == TASK_RUNNING)
>> +		return tag;
>> +
>> +	ws = &se_sess->sess_tag_pool.ws[0];
>> +	for (;;) {
>> +		prepare_to_wait_exclusive(&ws->wait, &wait, state);
>> +		if (signal_pending_state(state, current))
>> +			break;
>> +		schedule();
>> +		tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
>> +	}
>> +
>> +	finish_wait(&ws->wait, &wait);
>> +	return tag;
>> +}
> 
> Seems like that should be:
> 
> 
> 	ws = &se_sess->sess_tag_pool.ws[0];
> 	for (;;) {
> 		prepare_to_wait_exclusive(&ws->wait, &wait, state);
> 		if (signal_pending_state(state, current))
> 			break;
> 		tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
> 		if (tag != -1)
> 			break;
> 		schedule();
> 	}
> 
> 	finish_wait(&ws->wait, &wait);
> 	return tag;
> 
>>  /*
>>   * May be called from software interrupt (timer) context for allocating
>>   * iSCSI NopINs.
>> @@ -155,9 +177,11 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, int state)
>>  {
>>  	struct iscsi_cmd *cmd;
>>  	struct se_session *se_sess = conn->sess->se_sess;
>> -	int size, tag;
>> +	int size, tag, cpu;
>>  
>> -	tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);
>> +	tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
>> +	if (tag < 0)
>> +		tag = iscsit_wait_for_tag(se_sess, state, &cpu);
>>  	if (tag < 0)
>>  		return NULL;
> 
> Might make sense to just roll the whole thing into iscsi_get_tag(), that
> would be cleaner.
> 
> sbitmap should provide a helper for that, but we can do that cleanup
> later. That would encapsulate things like the per-cpu caching hint too,
> for instance.
> 
> Rest looks fine to me.

Are you going to push this further? I really think we should.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH bpf] tools/bpftool: fix a bug in bpftool perf
From: Yonghong Song @ 2018-06-12  1:15 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

Commit b04df400c302 ("tools/bpftool: add perf subcommand")
introduced bpftool subcommand perf to query bpf program
kuprobe and tracepoint attachments.

The perf subcommand will first test whether bpf subcommand
BPF_TASK_FD_QUERY is supported in kernel or not. It does it
by opening a file with argv[0] and feeds the file descriptor
and current task pid to the kernel for querying.

Such an approach won't work if the argv[0] cannot be opened
successfully in the current directory. This is especially
true when bpftool is accessible through PATH env variable.
The error below reflects the open failure for file argv[0]
at home directory.

  [yhs@localhost ~]$ which bpftool
  /usr/local/sbin/bpftool
  [yhs@localhost ~]$ bpftool perf
  Error: perf_query_support: No such file or directory

To fix the issue, let us open root directory ("/")
which exists in every linux system. With the fix, the
error message will correctly reflect the permission issue.

  [yhs@localhost ~]$ which bpftool
  /usr/local/sbin/bpftool
  [yhs@localhost ~]$ bpftool perf
  Error: perf_query_support: Operation not permitted
  HINT: non root or kernel doesn't support TASK_FD_QUERY

Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/bpf/bpftool/perf.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
index ac6b1a12c9b7..f1e4c9b270e2 100644
--- a/tools/bpf/bpftool/perf.c
+++ b/tools/bpf/bpftool/perf.c
@@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
 	if (perf_query_supported)
 		goto out;
 
-	fd = open(bin_name, O_RDONLY);
-	if (fd < 0) {
-		p_err("perf_query_support: %s", strerror(errno));
+	fd = open("/", O_RDONLY);
+	if (fd > 0) {
+		p_err("perf_query_support: cannot open directory \"/\" (%s)\n",
+		      strerror(errno));
 		goto out;
 	}
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC] tcp: Do not reload skb pointer after skb_gro_receive().
From: David Miller @ 2018-06-12  1:00 UTC (permalink / raw)
  To: netdev; +Cc: edumazet


This is not necessary.  skb_gro_receive() will never change what
'head' points to.

In it's original implementation (see commit 71d93b39e52e ("net: Add
skb_gro_receive")), it did:

====================
+	*head = nskb;
+	nskb->next = p->next;
+	p->next = NULL;
====================

This sequence was removed in commit 58025e46ea2d ("net: gro: remove
obsolete code from skb_gro_receive()")

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 4d58e2ce0b5b..8cc7c3487330 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -268,8 +268,6 @@ struct sk_buff **tcp_gro_receive(struct sk_buff **head, struct sk_buff *skb)
 		goto out_check_final;
 	}
 
-	p = *head;
-	th2 = tcp_hdr(p);
 	tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH);
 
 out_check_final:

^ permalink raw reply related

* Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions
From: Alexander Duyck @ 2018-06-12  0:52 UTC (permalink / raw)
  To: kbuild test robot, Jeff Kirsher
  Cc: Alexander Duyck, Netdev, intel-wired-lan, kbuild-all
In-Reply-To: <201806120507.3D5zIr3A%fengguang.wu@intel.com>

Jeff,

Looks like I have some bugs on non-x86 architecture. I need to address
these in a v2 of this set so please hold off on applying until I can
get that submitted either tonight or tomorrow morning.

Thanks.

- Alex

On Mon, Jun 11, 2018 at 4:10 PM, kbuild test robot <lkp@intel.com> wrote:
> Hi Alexander,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on net-next/master]
> [also build test ERROR on next-20180608]
> [cannot apply to v4.17]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Alexander-Duyck/Add-support-for-L2-Fwd-Offload-w-o-ndo_select_queue/20180612-015220
> config: arm-allmodconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         GCC_VERSION=7.2.0 make.cross ARCH=arm
>
> All errors (new ones prefixed by >>):
>
>>> drivers/net//ethernet/ti/netcp_core.c:1968:22: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
>      .ndo_select_queue = dev_pick_tx_zero,
>                          ^~~~~~~~~~~~~~~~
>    drivers/net//ethernet/ti/netcp_core.c:1968:22: note: (near initialization for 'netcp_netdev_ops.ndo_select_queue')
>    cc1: some warnings being treated as errors
>
> vim +1968 drivers/net//ethernet/ti/netcp_core.c
>
>   1955
>   1956  static const struct net_device_ops netcp_netdev_ops = {
>   1957          .ndo_open               = netcp_ndo_open,
>   1958          .ndo_stop               = netcp_ndo_stop,
>   1959          .ndo_start_xmit         = netcp_ndo_start_xmit,
>   1960          .ndo_set_rx_mode        = netcp_set_rx_mode,
>   1961          .ndo_do_ioctl           = netcp_ndo_ioctl,
>   1962          .ndo_get_stats64        = netcp_get_stats,
>   1963          .ndo_set_mac_address    = eth_mac_addr,
>   1964          .ndo_validate_addr      = eth_validate_addr,
>   1965          .ndo_vlan_rx_add_vid    = netcp_rx_add_vid,
>   1966          .ndo_vlan_rx_kill_vid   = netcp_rx_kill_vid,
>   1967          .ndo_tx_timeout         = netcp_ndo_tx_timeout,
>> 1968          .ndo_select_queue       = dev_pick_tx_zero,
>   1969          .ndo_setup_tc           = netcp_setup_tc,
>   1970  };
>   1971
>
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: pull-request: bpf 2018-06-12
From: David Miller @ 2018-06-12  0:39 UTC (permalink / raw)
  To: daniel; +Cc: ast, netdev
In-Reply-To: <20180611235956.6040-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Tue, 12 Jun 2018 01:59:56 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Avoid an allocation warning in AF_XDP by adding __GFP_NOWARN for the
>    umem setup, from Björn.
> 
> 2) Silence a warning in bpf fs when an application tries to open(2) a
>    pinned bpf obj due to missing fops. Add a dummy open fop that continues
>    to just bail out in such case, from Daniel.
> 
> 3) Fix a BPF selftest urandom_read build issue where gcc complains that
>    it gets built twice, from Anders.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

^ permalink raw reply

* Re: [net 0/5][pull request] Intel Wired LAN Driver Updates 2018-06-11
From: David Miller @ 2018-06-12  0:30 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180611161630.21338-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 11 Jun 2018 09:16:25 -0700

> This series contains fixes to ixgbe IPsec and MACVLAN.
> 
> Alex provides the 5 fixes in this series, starting with fixing an issue
> where num_rx_pools was not being populated until after the queues and
> interrupts were reinitialized when enabling MACVLAN interfaces.  Updated
> to use CONFIG_XFRM_OFFLOAD instead of CONFIG_XFRM, since the code
> requires CONFIG_XFRM_OFFLOAD to be enabled.  Moved the IPsec
> initialization function to be more consistent with the placement of
> similar initialization functions and before the call to reset the
> hardware, which will clean up any link issues that may have been
> introduced.  Fixed the boolean logic that was testing for transmit OR
> receive ready bits, when it should have been testing for transmit AND
> receive ready bits.  Fixed the bit definitions for SECTXSTAT and SECRXSTAT
> registers and ensure that if IPsec is disabled on the part, do not
> enable it.
> 
> The following are changes since commit f0dc7f9c6dd99891611fca5849cbc4c6965b690e:
>   Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Samudrala, Sridhar @ 2018-06-12  0:08 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stephen Hemminger
  Cc: kys, haiyangz, davem, netdev, Stephen Hemminger
In-Reply-To: <04210bba-f67a-dd10-cdce-c7d140dcecbe@intel.com>

On 6/11/2018 12:34 PM, Samudrala, Sridhar wrote:
>
> On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote:
>> On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>>>    * Set permanent and current address of net_failover device
>>>      to match the primary.

We copy the dev_addr of standby dev to failover_dev in net_failover_create()
before calling register_netdev().
register_netdev() does a copy of dev_addr to perm_addr.

So i don't think this is an issue.

>>>
>>>    * Carrier should be marked off before registering device
>>>      the net_failover device.

Will fix this and also a couple of places dev_err() needs to be replaced with netdev_err()

>> Sridhar, do we want to address this?
>> If yes, could you please take a look at addressing these
>> meanwhile, while we keep arguing about making API changes?
>
> Sure. I will submit patches to address these issues raised by Stephen.
>
>

^ permalink raw reply

* pull-request: bpf 2018-06-12
From: Daniel Borkmann @ 2018-06-11 23:59 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Avoid an allocation warning in AF_XDP by adding __GFP_NOWARN for the
   umem setup, from Björn.

2) Silence a warning in bpf fs when an application tries to open(2) a
   pinned bpf obj due to missing fops. Add a dummy open fop that continues
   to just bail out in such case, from Daniel.

3) Fix a BPF selftest urandom_read build issue where gcc complains that
   it gets built twice, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 66e58e0ef80a56a1d7857b6ce121141563cdd93e:

  bpfilter: fix race in pipe access (2018-06-07 20:07:28 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to a343993c518ce252b62ec00ac06bccfb1d17129d:

  xsk: silence warning on memory allocation failure (2018-06-11 23:49:07 +0200)

----------------------------------------------------------------
Anders Roxell (1):
      selftests: bpf: fix urandom_read build issue

Björn Töpel (1):
      xsk: silence warning on memory allocation failure

Daniel Borkmann (1):
      bpf: implement dummy fops for bpf objects

 kernel/bpf/inode.c                   | 14 ++++++++++++--
 net/xdp/xdp_umem.c                   |  3 ++-
 tools/testing/selftests/bpf/Makefile |  4 +---
 3 files changed, 15 insertions(+), 6 deletions(-)

^ permalink raw reply

* [PATCH v2 net-next] vlan: implement vlan id and protocol changes
From: Chas Williams @ 2018-06-11 23:44 UTC (permalink / raw)
  To: davem; +Cc: netdev, Chas Williams
In-Reply-To: <20180610231912.1543-1-3chas3@gmail.com>

vlan_changelink silently ignores attempts to change the vlan id
or protocol id of an existing vlan interface.  Implement by adding
the new vlan id and protocol to the interface's vlan group and then
removing the old vlan id and protocol from the vlan group.

Signed-off-by: Chas Williams <3chas3@gmail.com>
---
 include/linux/netdevice.h |  1 +
 net/8021q/vlan.c          |  4 ++--
 net/8021q/vlan.h          |  2 ++
 net/8021q/vlan_netlink.c  | 38 ++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |  1 +
 5 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850c7936..a95ae238addf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2409,6 +2409,7 @@ enum netdev_cmd {
 	NETDEV_CVLAN_FILTER_DROP_INFO,
 	NETDEV_SVLAN_FILTER_PUSH_INFO,
 	NETDEV_SVLAN_FILTER_DROP_INFO,
+	NETDEV_CHANGEVLAN,
 };
 const char *netdev_cmd_to_name(enum netdev_cmd cmd);
 
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 73a65789271b..b5e0ad1a581a 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -51,8 +51,8 @@ const char vlan_version[] = DRV_VERSION;
 
 /* End of global variables definitions. */
 
-static int vlan_group_prealloc_vid(struct vlan_group *vg,
-				   __be16 vlan_proto, u16 vlan_id)
+int vlan_group_prealloc_vid(struct vlan_group *vg,
+			    __be16 vlan_proto, u16 vlan_id)
 {
 	struct net_device **array;
 	unsigned int pidx, vidx;
diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h
index 44df1c3df02d..c734dd21d70d 100644
--- a/net/8021q/vlan.h
+++ b/net/8021q/vlan.h
@@ -116,6 +116,8 @@ int register_vlan_dev(struct net_device *dev, struct netlink_ext_ack *extack);
 void unregister_vlan_dev(struct net_device *dev, struct list_head *head);
 bool vlan_dev_inherit_address(struct net_device *dev,
 			      struct net_device *real_dev);
+int vlan_group_prealloc_vid(struct vlan_group *vg,
+			    __be16 vlan_proto, u16 vlan_id);
 
 static inline u32 vlan_get_ingress_priority(struct net_device *dev,
 					    u16 vlan_tci)
diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 9b60c1e399e2..0e59babe6651 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -107,10 +107,48 @@ static int vlan_changelink(struct net_device *dev, struct nlattr *tb[],
 			   struct nlattr *data[],
 			   struct netlink_ext_ack *extack)
 {
+	struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
 	struct ifla_vlan_flags *flags;
 	struct ifla_vlan_qos_mapping *m;
 	struct nlattr *attr;
 	int rem;
+	int err;
+	__be16 vlan_proto = vlan->vlan_proto;
+	u16 vlan_id = vlan->vlan_id;
+
+	if (data[IFLA_VLAN_ID])
+		vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
+
+	if (data[IFLA_VLAN_PROTOCOL])
+		vlan_proto = nla_get_be16(data[IFLA_VLAN_PROTOCOL]);
+
+	if (vlan->vlan_id != vlan_id || vlan->vlan_proto != vlan_proto) {
+		struct net_device *real_dev = vlan->real_dev;
+		struct vlan_info *vlan_info;
+		struct vlan_group *grp;
+		__be16 old_vlan_proto = vlan->vlan_proto;
+		u16 old_vlan_id = vlan->vlan_id;
+
+		err = vlan_vid_add(real_dev, vlan_proto, vlan_id);
+		if (err)
+			return err;
+		vlan_info = rtnl_dereference(real_dev->vlan_info);
+		grp = &vlan_info->grp;
+		err = vlan_group_prealloc_vid(grp, vlan_proto, vlan_id);
+		if (err < 0) {
+			vlan_vid_del(real_dev, vlan_proto, vlan_id);
+			return err;
+		}
+		vlan_group_set_device(grp, vlan_proto, vlan_id, dev);
+		vlan->vlan_proto = vlan_proto;
+		vlan->vlan_id = vlan_id;
+
+		vlan_group_set_device(grp, old_vlan_proto, old_vlan_id, NULL);
+		vlan_vid_del(real_dev, old_vlan_proto, old_vlan_id);
+
+		err = call_netdevice_notifiers(NETDEV_CHANGEVLAN, dev);
+		notifier_to_errno(err);
+	}
 
 	if (data[IFLA_VLAN_FLAGS]) {
 		flags = nla_data(data[IFLA_VLAN_FLAGS]);
diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242a1cae..849fdb60fd21 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1587,6 +1587,7 @@ const char *netdev_cmd_to_name(enum netdev_cmd cmd)
 	N(UDP_TUNNEL_DROP_INFO) N(CHANGE_TX_QUEUE_LEN)
 	N(CVLAN_FILTER_PUSH_INFO) N(CVLAN_FILTER_DROP_INFO)
 	N(SVLAN_FILTER_PUSH_INFO) N(SVLAN_FILTER_DROP_INFO)
+	N(CHANGEVLAN)
 	}
 #undef N
 	return "UNKNOWN_NETDEV_EVENT";
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH] tcp: verify the checksum of the first data segment in a new connection
From: van der Linden, Frank @ 2018-06-11 23:43 UTC (permalink / raw)
  To: Eric Dumazet, edumazet@google.com, netdev@vger.kernel.org
In-Reply-To: <20599722-9a2f-38c8-e4b8-d2eaf40197ab@gmail.com>

Yeah, true, it's missing INERRS in this case. I'll fix it up a bit.

Frank

On 6/11/18, 4:38 PM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:

    
    
    On 06/11/2018 04:25 PM, van der Linden, Frank wrote:
    > A few comments on this one:
    > 
    > - obviously this is fairly serious, as it can let corrupted data all the way up to the application
    
    Sure, although anyone relying on CRC checksum for ensuring TCP data integrity
    has big troubles ;)
    
    I would rather have a refined version of this patch doing a "goto csum_error" 
    so that we properly increment TCP_MIB_CSUMERRORS and TCP_MIB_INERRS 
    
    Thanks !
    
    


^ permalink raw reply

* Re: [PATCH net] net/ipv6: Ensure cfg is properly initialized in ipv6_create_tempaddr
From: David Miller @ 2018-06-11 23:39 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, dsahern
In-Reply-To: <20180611141212.26371-1-dsahern@kernel.org>

From: dsahern@kernel.org
Date: Mon, 11 Jun 2018 07:12:12 -0700

> From: David Ahern <dsahern@gmail.com>
> 
> Valdis reported a BUG in ipv6_add_addr:
 ...
> Looking at the code I found 1 element (peer_pfx) of the newly introduced
> ifa6_config struct that is not initialized. Use a memset rather than hard
> coding an init for each struct element.
> 
> Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
> Fixes: e6464b8c63619 ("net/ipv6: Convert ipv6_add_addr to struct ifa6_config")
> Signed-off-by: David Ahern <dsahern@gmail.com>

Applied, thanks David.

^ permalink raw reply

* Re: [PATCH] tcp: verify the checksum of the first data segment in a new connection
From: Eric Dumazet @ 2018-06-11 23:37 UTC (permalink / raw)
  To: van der Linden, Frank, edumazet@google.com,
	netdev@vger.kernel.org
In-Reply-To: <631FD61F-40EF-4B6A-BD30-6C208DD450B7@amazon.com>



On 06/11/2018 04:25 PM, van der Linden, Frank wrote:
> A few comments on this one:
> 
> - obviously this is fairly serious, as it can let corrupted data all the way up to the application

Sure, although anyone relying on CRC checksum for ensuring TCP data integrity
has big troubles ;)

I would rather have a refined version of this patch doing a "goto csum_error" 
so that we properly increment TCP_MIB_CSUMERRORS and TCP_MIB_INERRS 

Thanks !

^ permalink raw reply

* Re: [PATCH 00/15] Netfilter/IPVS fixes for net
From: David Miller @ 2018-06-11 23:31 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20180611092233.3219-1-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Mon, 11 Jun 2018 11:22:18 +0200

> The following patchset contains Netfilter/IPVS fixes for your net tree:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks Pablo.

^ permalink raw reply

* Re: [PATCH net] tls: fix NULL pointer dereference on poll
From: David Miller @ 2018-06-11 23:30 UTC (permalink / raw)
  To: daniel; +Cc: davejwatson, netdev, ast, hch
In-Reply-To: <20180611212204.24002-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Mon, 11 Jun 2018 23:22:04 +0200

> While hacking on kTLS, I ran into the following panic from an
> unprivileged netserver / netperf TCP session:
 ...
> Debugging further, it turns out that calling into ctx->sk_poll() is
> invalid since sk_poll itself is NULL which was saved from the original
> TCP socket in order for tls_sw_poll() to invoke it.
> 
> Looks like the recent conversion from poll to poll_mask callback started
> in 152524231023 ("net: add support for ->poll_mask in proto_ops") missed
> to eventually convert kTLS, too: TCP's ->poll was converted over to the
> ->poll_mask in commit 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> and therefore kTLS wrongly saved the ->poll old one which is now NULL.
> 
> Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
> POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
> tcp_poll_mask() as well that is mangled here.
> 
> Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Applied, thanks Daniel.

^ permalink raw reply

* Re: [bpf PATCH] bpf: selftest fix for sockmap
From: Y Song @ 2018-06-11 23:28 UTC (permalink / raw)
  To: John Fastabend; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev
In-Reply-To: <20180611184735.31255.51105.stgit@john-Precision-Tower-5810>

On Mon, Jun 11, 2018 at 11:47 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> In selftest test_maps the sockmap test case attempts to add a socket
> in listening state to the sockmap. This is no longer a valid operation
> so it fails as expected. However, the test wrongly reports this as an
> error now. Fix the test to avoid adding sockets in listening state.
>
> Fixes: 945ae430aa44 ("bpf: sockmap only allow ESTABLISHED sock state")

I cannot reproduce the failure and I cannot find the above "Fixes" commit.
I checked latest bpf/bpf-next/net-next trees. Maybe I missed something?

> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  tools/testing/selftests/bpf/test_maps.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
> index 6c25334..9fed5f0 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -564,7 +564,7 @@ static void test_sockmap(int tasks, void *data)
>         }
>
>         /* Test update without programs */
> -       for (i = 0; i < 6; i++) {
> +       for (i = 2; i < 6; i++) {
>                 err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
>                 if (err) {
>                         printf("Failed noprog update sockmap '%i:%i'\n",
> @@ -727,7 +727,7 @@ static void test_sockmap(int tasks, void *data)
>         }
>
>         /* Test map update elem afterwards fd lives in fd and map_fd */
> -       for (i = 0; i < 6; i++) {
> +       for (i = 2; i < 6; i++) {
>                 err = bpf_map_update_elem(map_fd_rx, &i, &sfd[i], BPF_ANY);
>                 if (err) {
>                         printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",
>

^ permalink raw reply

* Re: [PATCH] tcp: verify the checksum of the first data segment in a new connection
From: van der Linden, Frank @ 2018-06-11 23:25 UTC (permalink / raw)
  To: edumazet@google.com, netdev@vger.kernel.org
In-Reply-To: <5b1f0292.IdMQh83ac/EN53Sl%fllinden@amazon.com>

A few comments on this one:

- obviously this is fairly serious, as it can let corrupted data all the way up to the application
- I am not nuts about the patch itself, the code feels a bit cluttered, but it's the least invasive way
  I could think of. Probably some refactoring is needed at some point.
- here's how to easily reproduce it:

On the sender, set up artificial packet corruption:

#!/bin/sh
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 netem corrupt 50.0%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst $DSTADDR flowid 10:3


Then, run the following on the sender:

while :; do dd if=/dev/zero of=/dev/stdout bs=4096 count=4 | nc $DSTADDR 10000; sleep 1; done

..and this on the receiver:

uname -r; grep ^Tcp /proc/net/snmp; ttl=$((${SECONDS} + 300)); while [[ ${SECONDS} -lt ${ttl} ]]; do out="foo.$(date +%s)"; nc -l 10000 > "${out}"; md5=$(md5sum "${out}"|cut -d\  -f 1); echo -n "${md5}"; if [[ "${md5}" != "ce338fe6899778aacfc28414f2d9498b" ]]; then echo " corrupted"; else echo; fi; done; grep ^Tcp /proc/net/snmp

[obviously, if you change the size / content, the md5 sum has to be changed here]

This reproduces it fairly quickly (within 20 seconds) for us, on 4.14.x kernels. 4.17 kernels were verified to still have the issue.

Frank

On 6/11/18, 4:15 PM, "Frank van der Linden" <fllinden@amazon.com> wrote:

    commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
    table") introduced an optimization for the handling of child sockets
    created for a new TCP connection.
    
    But this optimization passes any data associated with the last ACK of the
    connection handshake up the stack without verifying its checksum, because it
    calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
    directly.  These lower-level processing functions do not do any checksum
    verification.
    
    Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
    fix this.
    
    Signed-off-by: Frank van der Linden <fllinden@amazon.com>
    ---
     net/ipv4/tcp_ipv4.c | 8 +++++++-
     net/ipv6/tcp_ipv6.c | 8 +++++++-
     2 files changed, 14 insertions(+), 2 deletions(-)
    
    diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
    index f70586b50838..1ec4c0d4aba5 100644
    --- a/net/ipv4/tcp_ipv4.c
    +++ b/net/ipv4/tcp_ipv4.c
    @@ -1703,7 +1703,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
     			th = (const struct tcphdr *)skb->data;
     			iph = ip_hdr(skb);
     			tcp_v4_fill_cb(skb, iph, th);
    -			nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
    +
    +			if (tcp_checksum_complete(skb)) {
    +				__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
    +			} else {
    +				nsk = tcp_check_req(sk, skb, req, false,
    +						    &req_stolen);
    +			}
     		}
     		if (!nsk) {
     			reqsk_put(req);
    diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
    index 6d664d83cd16..a12b694d3d1e 100644
    --- a/net/ipv6/tcp_ipv6.c
    +++ b/net/ipv6/tcp_ipv6.c
    @@ -1486,7 +1486,13 @@ static int tcp_v6_rcv(struct sk_buff *skb)
     			th = (const struct tcphdr *)skb->data;
     			hdr = ipv6_hdr(skb);
     			tcp_v6_fill_cb(skb, hdr, th);
    -			nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
    +
    +			if (tcp_checksum_complete(skb)) {
    +				__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
    +			} else {
    +				nsk = tcp_check_req(sk, skb, req, false,
    +						    &req_stolen);
    +			}
     		}
     		if (!nsk) {
     			reqsk_put(req);
    -- 
    2.14.4
    
    


^ permalink raw reply

* [PATCH] tcp: verify the checksum of the first data segment in a new connection
From: Frank van der Linden @ 2018-06-11 23:15 UTC (permalink / raw)
  To: edumazet, netdev; +Cc: fllinden

commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.

But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly.  These lower-level processing functions do not do any checksum
verification.

Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.

Signed-off-by: Frank van der Linden <fllinden@amazon.com>
---
 net/ipv4/tcp_ipv4.c | 8 +++++++-
 net/ipv6/tcp_ipv6.c | 8 +++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..1ec4c0d4aba5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1703,7 +1703,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
 			th = (const struct tcphdr *)skb->data;
 			iph = ip_hdr(skb);
 			tcp_v4_fill_cb(skb, iph, th);
-			nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
+
+			if (tcp_checksum_complete(skb)) {
+				__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
+			} else {
+				nsk = tcp_check_req(sk, skb, req, false,
+						    &req_stolen);
+			}
 		}
 		if (!nsk) {
 			reqsk_put(req);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6d664d83cd16..a12b694d3d1e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1486,7 +1486,13 @@ static int tcp_v6_rcv(struct sk_buff *skb)
 			th = (const struct tcphdr *)skb->data;
 			hdr = ipv6_hdr(skb);
 			tcp_v6_fill_cb(skb, hdr, th);
-			nsk = tcp_check_req(sk, skb, req, false, &req_stolen);
+
+			if (tcp_checksum_complete(skb)) {
+				__TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
+			} else {
+				nsk = tcp_check_req(sk, skb, req, false,
+						    &req_stolen);
+			}
 		}
 		if (!nsk) {
 			reqsk_put(req);
-- 
2.14.4

^ permalink raw reply related

* Re: [bpf PATCH v2 1/2] bpf: sockmap, fix crash when ipv6 sock is added
From: Daniel Borkmann @ 2018-06-11 23:14 UTC (permalink / raw)
  To: John Fastabend, edumazet, weiwan, ast; +Cc: netdev
In-Reply-To: <20180608150639.15153.91342.stgit@john-Precision-Tower-5810>

Hi John,

On 06/08/2018 05:06 PM, John Fastabend wrote:
> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
> of tcpv6_prot.
> 
> Previously we overwrote the sk->prot field with tcp_prot even in the
> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
> are used. Further, only allow ESTABLISHED connections to join the
> map per note in TLS ULP,
> 
>    /* The TLS ulp is currently supported only for TCP sockets
>     * in ESTABLISHED state.
>     * Supporting sockets in LISTEN state will require us
>     * to modify the accept implementation to clone rather then
>     * share the ulp context.
>     */
> 
> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
> crashing case here.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Wei Wang <weiwan@google.com>
[...]

Still one question for some more clarification below that popped up while
review:

> @@ -162,6 +164,8 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
>  }
>  
>  static struct proto tcp_bpf_proto;
> +static struct proto tcpv6_bpf_proto;

These two are global, w/o locking.

>  static int bpf_tcp_init(struct sock *sk)
>  {
>  	struct smap_psock *psock;
> @@ -181,14 +185,30 @@ static int bpf_tcp_init(struct sock *sk)
>  	psock->save_close = sk->sk_prot->close;
>  	psock->sk_proto = sk->sk_prot;
>  
> +	if (sk->sk_family == AF_INET6) {
> +		tcpv6_bpf_proto = *sk->sk_prot;
> +		tcpv6_bpf_proto.close = bpf_tcp_close;
> +	} else {
> +		tcp_bpf_proto = *sk->sk_prot;
> +		tcp_bpf_proto.close = bpf_tcp_close;
> +	}

And each time we add a BPF ULP to a v4/v6 socket, we override tcp{,v6}_bpf_proto
from scratch.

>  	if (psock->bpf_tx_msg) {
> +		tcpv6_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> +		tcpv6_bpf_proto.sendpage = bpf_tcp_sendpage;
> +		tcpv6_bpf_proto.recvmsg = bpf_tcp_recvmsg;
> +		tcpv6_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
>  		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
>  		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
>  		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
>  		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
>  	}
>  
> -	sk->sk_prot = &tcp_bpf_proto;
> +	if (sk->sk_family == AF_INET6)
> +		sk->sk_prot = &tcpv6_bpf_proto;
> +	else
> +		sk->sk_prot = &tcp_bpf_proto;

Where every active socket would be affected from it as well. Isn't that
generally racy? E.g. existing ones where tcpv6_bpf_proto.sendmsg points
to bpf_tcp_sendmsg would get overridden with earlier assignment on the
tcpv6_bpf_proto = *sk->sk_prot during their lifetime after bpf_tcp_init().

In the kTLS case, the v4 protos are built up in module init via tls_register()
and never change from there. The v6 ones are only reloaded when their addr
changes e.g. module reload would come to mind, which should only be possible
once no active v6 socket is present. What speaks against adapting similar
scheme resp. what am I missing that the above would work? (Would be nice if
there was some discussion in commit log related to it on 'why' this approach
was done differently.)

Thanks,
Daniel

>  	rcu_read_unlock();
>  	return 0;
>  }
> @@ -1111,8 +1131,6 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
>  
>  static int bpf_tcp_ulp_register(void)
>  {
> -	tcp_bpf_proto = tcp_prot;
> -	tcp_bpf_proto.close = bpf_tcp_close;
>  	/* Once BPF TX ULP is registered it is never unregistered. It
>  	 * will be in the ULP list for the lifetime of the system. Doing
>  	 * duplicate registers is not a problem.
> 

^ permalink raw reply

* Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions
From: kbuild test robot @ 2018-06-11 23:10 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: kbuild-all, netdev, intel-wired-lan, jeffrey.t.kirsher
In-Reply-To: <20180611174119.41352.28975.stgit@ahduyck-green-test.jf.intel.com>

[-- Attachment #1: Type: text/plain, Size: 2094 bytes --]

Hi Alexander,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on next-20180608]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Alexander-Duyck/Add-support-for-L2-Fwd-Offload-w-o-ndo_select_queue/20180612-015220
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> drivers/net//ethernet/ti/netcp_core.c:1968:22: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
     .ndo_select_queue = dev_pick_tx_zero,
                         ^~~~~~~~~~~~~~~~
   drivers/net//ethernet/ti/netcp_core.c:1968:22: note: (near initialization for 'netcp_netdev_ops.ndo_select_queue')
   cc1: some warnings being treated as errors

vim +1968 drivers/net//ethernet/ti/netcp_core.c

  1955	
  1956	static const struct net_device_ops netcp_netdev_ops = {
  1957		.ndo_open		= netcp_ndo_open,
  1958		.ndo_stop		= netcp_ndo_stop,
  1959		.ndo_start_xmit		= netcp_ndo_start_xmit,
  1960		.ndo_set_rx_mode	= netcp_set_rx_mode,
  1961		.ndo_do_ioctl           = netcp_ndo_ioctl,
  1962		.ndo_get_stats64        = netcp_get_stats,
  1963		.ndo_set_mac_address	= eth_mac_addr,
  1964		.ndo_validate_addr	= eth_validate_addr,
  1965		.ndo_vlan_rx_add_vid	= netcp_rx_add_vid,
  1966		.ndo_vlan_rx_kill_vid	= netcp_rx_kill_vid,
  1967		.ndo_tx_timeout		= netcp_ndo_tx_timeout,
> 1968		.ndo_select_queue	= dev_pick_tx_zero,
  1969		.ndo_setup_tc		= netcp_setup_tc,
  1970	};
  1971	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 65910 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox