Netdev List
 help / color / mirror / Atom feed
* Re: linux-next: Tree for Apr 26 (net/can/bcm.c)
From: Oliver Hartkopp @ 2017-04-26 18:18 UTC (permalink / raw)
  To: Randy Dunlap, Stephen Rothwell, Linux-Next Mailing List
  Cc: Linux Kernel Mailing List, netdev@vger.kernel.org, linux-can
In-Reply-To: <05bd24d8-6840-3551-a529-c464f3b26d0a@infradead.org>

Hi Randy,

thanks for the report!

Some fallout of my namespace support integration %-)

I posted a patch for it:

http://marc.info/?l=linux-can&m=149323049630039&w=2

Many thanks & best regards,
Oliver

On 04/26/2017 04:53 PM, Randy Dunlap wrote:
> On 04/26/17 01:03, Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20170424:
>>
>
> on x86_64:
>
> when CONFIG_PROC_FS is not enabled:
>
> ../net/can/bcm.c:1541:14: error: 'struct netns_can' has no member named 'bcmproc_dir'
> ../net/can/bcm.c:1601:14: error: 'struct netns_can' has no member named 'bcmproc_dir'
> ../net/can/bcm.c:1696:11: error: 'struct netns_can' has no member named 'bcmproc_dir'
> ../net/can/bcm.c:1707:15: error: 'struct netns_can' has no member named 'bcmproc_dir'
>
> 2 of those are "protected" by
> 	if (IS_ENABLED(CONFIG_PROC_FS)) {
> but that doesn't seem to help/work here.
>
> gcc v4.8.5
>
>
>

^ permalink raw reply

* Re: Bug and configuration MPLS error?
From: David Ahern @ 2017-04-26 18:18 UTC (permalink / raw)
  To: Алексей Болдырев,
	netdev
In-Reply-To: <1728581493210432@web45j.yandex.ru>

On 4/26/17 6:40 AM, Алексей Болдырев wrote:
> 
> 
> 26.04.2017, 05:23, "David Ahern" <dsahern@gmail.com>:
>> On 4/25/17 11:28 AM, Алексей Болдырев wrote:
>>>  226 sysctl -w net.mpls.conf.lo.input=1
>>>  227 sysctl -w net.mpls.platform_labels=1048575
>>>  228 ip link add veth0 type veth peer name veth1
>>>  229 ip link add veth2 type veth peer name veth3
>>>  230 sysctl -w net.mpls.conf.veth0.input=1
>>>  231 sysctl -w net.mpls.conf.veth2.input=1
>>>  232 ifconfig veth0 10.3.3.1 netmask 255.255.255.0
>>>  233 ifconfig veth2 10.4.4.1 netmask 255.255.255.0
>>>  234 ip netns add host1
>>>  235 ip netns add host2
>>>  236 ip link set veth1 netns host1
>>>  237 ip link set veth3 netns host2
>>>  238 ip netns exec host1 ifconfig veth1 10.3.3.2 netmask 255.255.255.0 up
>>>  239 ip netns exec host2 ifconfig veth3 10.4.4.2 netmask 255.255.255.0 up
>>>  240 ip netns exec host1 ip route add 10.10.10.2/32 encap mpls 112 via inet 10.3.3.1
>>>  241 ip netns exec host2 ip route add 10.10.10.1/32 encap mpls 111 via inet 10.4.4.1
>>>  242 ip -f mpls route add 111 via inet 10.3.3.2
>>>  243 ip -f mpls route add 112 via inet 10.4.4.2
>>
>> your setup is incomplete.
>>
>> # ip netns exec host2 ping 10.10.10.1
>> PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
>> ^C
>> --- 10.10.10.1 ping statistics ---
>> 2 packets transmitted, 0 received, 100% packet loss, time 1038ms
>>
>> If you run tcpdump on veth1 in host1 you see the packets come in but no
>> response:
>>
>> # ip netns exec host1 tcpdump -n -i veth1
>> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
>> listening on veth1, link-type EN10MB (Ethernet), capture size 262144 bytes
>> 19:20:24.599529 IP6 fe80::347d:e3ff:fe93:944b > ff02::2: ICMP6, router
>> solicitation, length 16
>> 19:20:27.413901 IP 10.4.4.2 > 10.10.10.1: ICMP echo request, id 978, seq
>> 1, length 64
>> 19:20:28.439574 IP 10.4.4.2 > 10.10.10.1: ICMP echo request, id 978, seq
>> 2, length 64
>>
>> and the lack of response is b/c:
>> 1. host1 has no address for 10.10.10.1 and
>> 2. even if it did, there is no return route to 10.4.4.2:
>>
>> # ip -netns host1 ro ls
>> 10.3.3.0/24 dev veth1 proto kernel scope link src 10.3.3.2
>> 10.10.10.2 encap mpls 112 via 10.3.3.1 dev veth1
> 
> As for ping, you need to enter this:
> Ip netns exec host2 ping 10.10.10.1 -A 10.10.10.2
> Here I published the results of testing on new (>4.9) cores. (in Russian):
> http://forum.nag.ru/forum/index.php?s=d09f0e5186fda59b3099eb81ad07ee63&showtopic=128927
> But on the old kernels:
> http://forum.nag.ru/forum/index.php?showtopic=128927&view=findpost&p=1396067
> 

host1 does not have 10.10.10.1 as a local address.
host2 does not have 10.10.10.2 as a local address.

Given that, host1 has no business replying to a ping destined to
10.10.10.1, and host2 will not use 10.10.10.2 as a source address.

I don't have time right now to build and test on older kernels, but
based on the network config I do not see how it can work.

If you add:
  ip -netns host1 addr add dev lo 10.10.10.1/32
  ip -netns host2 addr add dev lo 10.10.10.2/32

Then it works.

^ permalink raw reply

* [PATCH net-next] can: fix CAN BCM build with CONFIG_PROC_FS disabled
From: Oliver Hartkopp @ 2017-04-26 18:14 UTC (permalink / raw)
  To: linux-can, davem, rdunlap; +Cc: Oliver Hartkopp, mkl, netdev

The introduced namespace support moved the BCM variables for procfs into a
per-net data structure. This leads to a build failure with disabled procfs:

on x86_64:

when CONFIG_PROC_FS is not enabled:

../net/can/bcm.c:1541:14: error: 'struct netns_can' has no member named 'bcmproc_dir'
../net/can/bcm.c:1601:14: error: 'struct netns_can' has no member named 'bcmproc_dir'
../net/can/bcm.c:1696:11: error: 'struct netns_can' has no member named 'bcmproc_dir'
../net/can/bcm.c:1707:15: error: 'struct netns_can' has no member named 'bcmproc_dir'

http://marc.info/?l=linux-can&m=149321842526524&w=2

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
---
 net/can/bcm.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index 0e855917b7e1..78409841b74c 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -147,6 +147,7 @@ static inline ktime_t bcm_timeval_to_ktime(struct bcm_timeval tv)
 /*
  * procfs functions
  */
+#if IS_ENABLED(CONFIG_PROC_FS)
 static char *bcm_proc_getifname(struct net *net, char *result, int ifindex)
 {
 	struct net_device *dev;
@@ -251,6 +252,7 @@ static const struct file_operations bcm_proc_fops = {
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
+#endif /* CONFIG_PROC_FS */
 
 /*
  * bcm_can_tx - send the (next) CAN frame to the appropriate CAN interface
@@ -1537,9 +1539,11 @@ static int bcm_release(struct socket *sock)
 		bcm_remove_op(op);
 	}
 
+#if IS_ENABLED(CONFIG_PROC_FS)
 	/* remove procfs entry */
 	if (net->can.bcmproc_dir && bo->bcm_proc_read)
 		remove_proc_entry(bo->procname, net->can.bcmproc_dir);
+#endif /* CONFIG_PROC_FS */
 
 	/* remove device reference */
 	if (bo->bound) {
@@ -1598,6 +1602,7 @@ static int bcm_connect(struct socket *sock, struct sockaddr *uaddr, int len,
 		bo->ifindex = 0;
 	}
 
+#if IS_ENABLED(CONFIG_PROC_FS)
 	if (net->can.bcmproc_dir) {
 		/* unique socket address as filename */
 		sprintf(bo->procname, "%lu", sock_i_ino(sk));
@@ -1609,6 +1614,7 @@ static int bcm_connect(struct socket *sock, struct sockaddr *uaddr, int len,
 			goto fail;
 		}
 	}
+#endif /* CONFIG_PROC_FS */
 
 	bo->bound = 1;
 
@@ -1691,22 +1697,22 @@ static const struct can_proto bcm_can_proto = {
 
 static int canbcm_pernet_init(struct net *net)
 {
+#if IS_ENABLED(CONFIG_PROC_FS)
 	/* create /proc/net/can-bcm directory */
-	if (IS_ENABLED(CONFIG_PROC_FS)) {
-		net->can.bcmproc_dir =
-			proc_net_mkdir(net, "can-bcm", net->proc_net);
-	}
+	net->can.bcmproc_dir =
+		proc_net_mkdir(net, "can-bcm", net->proc_net);
+#endif /* CONFIG_PROC_FS */
 
 	return 0;
 }
 
 static void canbcm_pernet_exit(struct net *net)
 {
+#if IS_ENABLED(CONFIG_PROC_FS)
 	/* remove /proc/net/can-bcm directory */
-	if (IS_ENABLED(CONFIG_PROC_FS)) {
-		if (net->can.bcmproc_dir)
-			remove_proc_entry("can-bcm", net->proc_net);
-	}
+	if (net->can.bcmproc_dir)
+		remove_proc_entry("can-bcm", net->proc_net);
+#endif /* CONFIG_PROC_FS */
 }
 
 static struct pernet_operations canbcm_pernet_ops __read_mostly = {
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions
From: Logan Gunthorpe @ 2017-04-26 18:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dri-devel, Stephen Bates, dm-devel, target-devel, Sumit Semwal,
	devel, James E.J. Bottomley, linux-scsi, linux-nvdimm, linux-rdma,
	Ross Zwisler, open-iscsi, linux-media, intel-gfx, sparmaintainer,
	linux-raid, Dan Williams, megaraidlinux.pdl, Jens Axboe,
	Martin K. Petersen, netdev, Matthew Wilcox, linux-mmc,
	linux-kernel, linux-crypto, Greg Kroah-Hartman
In-Reply-To: <20170426074416.GA7936@lst.de>



On 26/04/17 01:44 AM, Christoph Hellwig wrote:
> I think we'll at least need a draft of those to make sense of these
> patches.  Otherwise they just look very clumsy.

Ok, I'll work up a draft proposal and send it in a couple days. But
without a lot of cleanup such as this series it's not going to even be
able to compile.

> I'm sorry but this API is just a trainwreck.  Right now we have the
> nice little kmap_atomic API, which never fails and has a very nice
> calling convention where we just pass back the return address, but does
> not support sleeping inside the critical section.
> 
> And kmap, whіch may fail and requires the original page to be passed
> back.  Anything that mixes these two concepts up is simply a non-starter.

Ok, well for starters I think you are mistaken about kmap being able to
fail. I'm having a hard time finding many users of that function that
bother to check for an error when calling it. The main difficulty we
have now is that neither of those functions are expected to fail and we
need them to be able to in cases where the page doesn't map to system
RAM. This patch series is trying to address it for users of scatterlist.
I'm certainly open to other suggestions.

I also have to disagree that kmap and kmap_atomic are all that "nice".
Except for the sleeping restriction and performance, they effectively do
the same thing. And it was necessary to write a macro wrapper around
kunmap_atomic to ensure that users of that function don't screw it up.
(See 597781f3e5.) I'd say the kmap/kmap_atomic functions are the
trainwreck and I'm trying to do my best to cleanup a few cases.

There are a fair number of cases in the kernel that do something like:

if (something)
    x = kmap(page);
else
    x = kmap_atomic(page);
...
if (something)
    kunmap(page)
else
    kunmap_atomic(x)

Which just seems cumbersome to me.

In any case, if you can accept an sg_kmap and sg_kmap_atomic api just
say so and I'll make the change. But I'll still need a flags variable
for SG_MAP_MUST_NOT_FAIL to support legacy cases that have no fail path
and both of those functions will need to be pretty nearly replicas of
each other.

Logan


_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply

* Re: [PATCH net-next] ip6_tunnel: Fix missing tunnel encapsulation limit option
From: Craig Gallek @ 2017-04-26 17:59 UTC (permalink / raw)
  To: Hideaki YOSHIFUJI, Alexey Kuznetsov, David S . Miller; +Cc: netdev
In-Reply-To: <20170426170707.165201-1-kraigatgoog@gmail.com>

On Wed, Apr 26, 2017 at 1:07 PM, Craig Gallek <kraigatgoog@gmail.com> wrote:
> From: Craig Gallek <kraig@google.com>
>
> The IPv6 tunneling code tries to insert IPV6_TLV_TNL_ENCAP_LIMIT and
> IPV6_TLV_PADN options when an encapsulation limit is defined (the
> default is a limit of 4).  An MTU adjustment is done to account for
> these options as well.  However, the options are never present in the
> generated packets.
>
> ipv6_push_nfrag_opts requires that IPV6_RTHDR be present in order to
> include any IPV6_DSTOPTS options.  The v6 tunnel code does not
> use routing options, so the encap limit options are not included.
>
> A brief reading of RFC 3542 section 9.2 (specifically the 4th paragraph)
> makes me believe that this requirement in the kernel is incorrect.
Looking more closely, I think I'm wrong here.  Specifically, the cmsg
parser puts IPV6_RTHDRDSTOPTS in dst0opt and IPV6_DSTOPTS in dst1opt.
The tunnel code is currently building dst0opt and using
ipv6_push_nfrag_opts.  Perhaps it should be building dst1opt and
calling ipv6_push_frag_opts?

^ permalink raw reply

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
From: Alexei Starovoitov @ 2017-04-26 17:58 UTC (permalink / raw)
  To: John Fastabend, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, Andy Gospodarek,
	Daniel Borkmann, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org
In-Reply-To: <5900CC41.2070502@gmail.com>

On 4/26/17 9:35 AM, John Fastabend wrote:
>
>> As Alexei also mentioned before, ifindex vs port makes no real
>> difference seen from the bpf program side.  It is userspace's
>> responsibility to add ifindex/port's to the bpf-maps, according to how
>> the bpf program "policy" want to "connect" these ports.  The
>> port-table system add one extra step, of also adding this port to the
>> port-table (which lives inside the kernel).
>>
>
> I'm not sure I understand the "lives inside the kernel" bit. I assumed
> the 'map' should be a bpf map and behave like any other bpf map.
>
> I wanted a new map to be defined, something like this from the bpf programmer
> side.
>
> struct bpf_map_def SEC("maps") port_table =
> 	.type = BPF_MAP_TYPE_PORT_CONNECTION,
> 	.key_size = sizeof(u32),
> 	.value_size = BPF_PORT_CONNECTION_SIZE,
> 	.max_entries = 256,
> };

I like the idea.
We have prog_array, perf_event_array, cgroup_array map specializations.
This one can be new netdev_array with some new bpf_redirect-like helper
accessing it.

>> When loading the XDP program, we also need to pass along a port table
>> "id" this XDP program is associated with (and if it doesn't exists you
>> create it).  And your userspace "control-plane" application also need
>> to know this port table "id", when adding a new port.
>
> So the user space application that is loading the program also needs
> to handle this map. This seems correct to me. But I don't see the
> value in making some new port table when we already have well understood
> framework for maps.

+1

>>
>> The concept of having multiple port tables is key.  As this implies we
>> can have several simultaneous "data-planes" that is *isolated* from
>> each-other.  Think about how network-namespaces/containers want
>> isolation. A subtle thing I'm afraid to mention, is that oppose to the
>> ifindex model, a port table with mapping to a net_device pointer, would
>> allow (faster) delivery into the container's inner net_device, which
>> sort of violates the isolation, but I would argue it is not a problem
>> as this net_device pointer could only be added from a process within the
>> namespace.  I like this feature, but it could easily be disallowed via
>> port insertion-time validation.
>>
>
> I think the above optimization should be allowed. And agree multiple port
> tables (maps?) is needed. Again all this points to using standard maps
> logic in my mind. For permissions and different domains, which I think
> you were starting to touch on, it looks like we could extend the pinning API.
> At the moment it does an inode_permission(inode, MAY_WRITE) check but I
> presume this could be extended. None of this would be needed in v1 and
> could be added subsequently. read-only maps seems doable.

this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated
the user space can make it readonly to prevent further changes.

 From user space it can be done similar to perf_events/cgroups as well.
bpf_map_update_elem(&netdev_array, &port_num, &ifindex)
should work.
For bpf_map_lookup_elem() from such netdev_array we can return
ifindex back.
The bpf_map_show_fdinfo() can be customized as well to pretty print
ifindexes of netdevs stored in there.

^ permalink raw reply

* Re: [Patch net 3/3] team: use a larger struct for mac address
From: Jarod Wilson @ 2017-04-26 17:59 UTC (permalink / raw)
  To: Cong Wang; +Cc: Jiri Pirko, Linux Kernel Network Developers, Andrey Konovalov
In-Reply-To: <CAM_iQpUGSy7My+hLXhsreq-8wPEdAKzNkQRzu-aiP5nkk+wSdw@mail.gmail.com>

On 2017-04-26 1:28 PM, Cong Wang wrote:
> On Wed, Apr 26, 2017 at 9:46 AM, Jarod Wilson <jarod@redhat.com> wrote:
>> On 2017-04-26 12:11 PM, Cong Wang wrote:
>>>
>>> On Wed, Apr 26, 2017 at 8:55 AM, Jarod Wilson <jarod@redhat.com> wrote:
>>>>
>>>>
>>>> We already have struct sockaddr_storage that could be used throughout
>>>> this
>>>> set as well. We just converted a few pieces of the bonding driver over to
>>>> using it for better support of ipoib bonds, via commit
>>>> faeeb317a5615076dff1ff44b51e862e6064dbd0. Might be better to just use
>>>> that
>>>> in both bonding and team, rather than having different per-driver
>>>> structs,
>>>> or Yet Another Address Storage implementation.
>>>
>>>
>>> Technically, struct sockaddr_storage is not enough either, given the
>>> max is MAX_ADDR_LEN. This is why I gave up on sockaddr_storage.
>>
>>
>> Wait, what? Am I missing something? MAX_ADDR_LEN is 32, and sockaddr_storage
>> is a #define for __kernel_sockaddr_storage, which has it's __data member
>> defined as being of size 128 - sizeof(unsigned short).
> 
> My bad, I thought it is same with sizeof(in6addr) without looking into it.
> The question is, why do we waste 126 - 32 = 94 bytes on stack to just
> use struct sockaddr_storage?

That's a fair point.

> I totally understand we want a unified struct, but we already redefine
> it in multiple places in tree...

Something unified and centralized with a data storage of MAX_ADDR_LEN 
does seem reasonable to get both consistency and minimized waste, and I 
could certainly do a follow-up patch for the bonding driver to switch 
the bits now using sockaddr_storage over to whatever new struct gets added.

-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply

* RE: qed*: debug infrastructures
From: Elior, Ariel @ 2017-04-26 17:57 UTC (permalink / raw)
  To: Florian Fainelli, David Miller, Jakub Kicinski, Jiri Pirko
  Cc: netdev@vger.kernel.org, Mintz, Yuval, Tayar, Tomer, Dupuis, Chad
In-Reply-To: <c8751a31-f493-45c1-5bf2-596b367ea034@gmail.com>

Jiri, Florian, Jakub,
Thanks all for you suggestions.

Some answers to questions posted: The signal tracing in our device can be used 
for tracing things like load/store/program_counter from our fastpath processors
(which handle every packet) which can then be re-run in a simulative environment
(recreating the recorded scenario). Other interesting uses for this feature can
be partial pci recording or partial network recording (poor man's analyzer)
which can also be very effective where full blown lab equipment is unavailable.

I reviewed the code of the drivers under hw_tracing (thanks Florian) and I think
we might be a good fit.

Jiri indicated dpipe was not intended for this sort of thing and suggested an
additional dev link object, although it seems to me that this will have to be
either a very generic object which would be susceptible to abuse similar to
debugfs, or it would be tailored to our device so much that no one else would
use it, so I am somewhat less inclined to go down this path (the code
abstracting our debug feature is accessed via ~20 api functions accepting ~10
params each, i.e. quite a handful of configuraion to generalize).

The ethtool debug dump presets (thanks Jakub) are far too narrow to encompass
the full flexibility required here.

Dave, I think my next step would be to send an RFC adding to our core module
(qed) the necessary APIs (mostly to provide some details to this rather abstract
discussion). I will plan to connect those to a new hwtracing driver I'll create
for this purpose, unless a different direction is suggested.

Thanks,
Ariel

^ permalink raw reply

* Re: Corrupted SKB
From: Cong Wang @ 2017-04-26 17:38 UTC (permalink / raw)
  To: Michael Ma; +Cc: Linux Kernel Network Developers, jin.oyj
In-Reply-To: <CAAmHdhwncVcbeKN+K4EnmWqwBRXPwuMOSFwtOfuwmgc6gw-5=g@mail.gmail.com>

On Tue, Apr 25, 2017 at 9:42 PM, Michael Ma <make0818@gmail.com> wrote:
> 2017-04-18 21:46 GMT-07:00 Michael Ma <make0818@gmail.com>:
>> 2017-04-18 16:12 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>:
>>> On Mon, Apr 17, 2017 at 5:39 PM, Michael Ma <make0818@gmail.com> wrote:
>>>> Hi -
>>>>
>>>> We've implemented a "glue" qdisc similar to mqprio which can associate
>>>> one qdisc to multiple txqs as the root qdisc. Reference count of the
>>>> child qdiscs have been adjusted properly in this case so that it
>>>> represents the number of txqs it has been attached to. However when
>>>> sending packets we saw the skb from dequeue_skb() corrupted with the
>>>> following call stack:
>>>>
>>>>     [exception RIP: netif_skb_features+51]
>>>>     RIP: ffffffff815292b3  RSP: ffff8817f6987940  RFLAGS: 00010246
>>>>
>>>>  #9 [ffff8817f6987968] validate_xmit_skb at ffffffff815294aa
>>>> #10 [ffff8817f69879a0] validate_xmit_skb at ffffffff8152a0d9
>>>> #11 [ffff8817f69879b0] __qdisc_run at ffffffff8154a193
>>>> #12 [ffff8817f6987a00] dev_queue_xmit at ffffffff81529e03
>>>>
>>>> It looks like the skb has already been released since its dev pointer
>>>> field is invalid.
>>>>
>>>> Any clue on how this can be investigated further? My current thought
>>>> is to add some instrumentation to the place where skb is released and
>>>> analyze whether there is any race condition happening there. However
>>>
>>> Either dropwatch or perf could do the work to instrument kfree_skb().
>>
>> Thanks - will try it out.
>
> I'm using perf to collect the callstack for kfree_skb and trying to
> correlate that with the corrupted SKB address however when system
> crashes the perf.data file is also corrupted - how can I view this
> file in case the system crashes before perf exits?

Hmm, KASAN is pretty good at detecting use-after-free,
its report can nicely shows where we allocate/free it and the
use after free.

https://01.org/linuxgraphics/gfx-docs/drm/dev-tools/kasan.html

^ permalink raw reply

* Re: [Patch net 3/3] team: use a larger struct for mac address
From: Cong Wang @ 2017-04-26 17:28 UTC (permalink / raw)
  To: Jarod Wilson
  Cc: Jiri Pirko, Linux Kernel Network Developers, Andrey Konovalov
In-Reply-To: <3f48c69c-3ff4-f895-3c0f-2e8c9c5f063a@redhat.com>

On Wed, Apr 26, 2017 at 9:46 AM, Jarod Wilson <jarod@redhat.com> wrote:
> On 2017-04-26 12:11 PM, Cong Wang wrote:
>>
>> On Wed, Apr 26, 2017 at 8:55 AM, Jarod Wilson <jarod@redhat.com> wrote:
>>>
>>>
>>> We already have struct sockaddr_storage that could be used throughout
>>> this
>>> set as well. We just converted a few pieces of the bonding driver over to
>>> using it for better support of ipoib bonds, via commit
>>> faeeb317a5615076dff1ff44b51e862e6064dbd0. Might be better to just use
>>> that
>>> in both bonding and team, rather than having different per-driver
>>> structs,
>>> or Yet Another Address Storage implementation.
>>
>>
>> Technically, struct sockaddr_storage is not enough either, given the
>> max is MAX_ADDR_LEN. This is why I gave up on sockaddr_storage.
>
>
> Wait, what? Am I missing something? MAX_ADDR_LEN is 32, and sockaddr_storage
> is a #define for __kernel_sockaddr_storage, which has it's __data member
> defined as being of size 128 - sizeof(unsigned short).

My bad, I thought it is same with sizeof(in6addr) without looking into it.
The question is, why do we waste 126 - 32 = 94 bytes on stack to just
use struct sockaddr_storage?

I totally understand we want a unified struct, but we already redefine
it in multiple places in tree...

^ permalink raw reply

* [PATCH] net: hso: register netdev later to avoid a race condition
From: Andreas Kemnade @ 2017-04-26 17:26 UTC (permalink / raw)
  To: davem, joe, gregkh, peter, hns, linux-usb, netdev, linux-kernel
  Cc: Andreas Kemnade

If the netdev is accessed before the urbs are initialized,
there will be NULL pointer dereferences. That is avoided by
registering it when it is fully initialized.

This case occurs e.g. if dhcpcd is running in the background
and the device is probed, either after insmod hso or
when the device appears on the usb bus.

A backtrace is the following:

[ 1357.356048] usb 1-2: new high-speed USB device number 12 using ehci-omap
[ 1357.551177] usb 1-2: New USB device found, idVendor=0af0, idProduct=8800
[ 1357.558654] usb 1-2: New USB device strings: Mfr=3, Product=2, SerialNumber=0
[ 1357.568572] usb 1-2: Product: Globetrotter HSUPA Modem
[ 1357.574096] usb 1-2: Manufacturer: Option N.V.
[ 1357.685882] hso 1-2:1.5: Not our interface
[ 1460.886352] hso: unloaded
[ 1460.889984] usbcore: deregistering interface driver hso
[ 1513.769134] hso: ../drivers/net/usb/hso.c: Option Wireless
[ 1513.846771] Unable to handle kernel NULL pointer dereference at virtual address 00000030
[ 1513.887664] hso 1-2:1.5: Not our interface
[ 1513.906890] usbcore: registered new interface driver hso
[ 1513.937988] pgd = ecdec000
[ 1513.949890] [00000030] *pgd=acd15831, *pte=00000000, *ppte=00000000
[ 1513.956573] Internal error: Oops: 817 [#1] PREEMPT SMP ARM
[ 1513.962371] Modules linked in: hso usb_f_ecm omap2430 bnep bluetooth g_ether usb_f_rndis u_ether libcomposite configfs ipv6 arc4 wl18xx wlcore mac80211 cfg80211 bq27xxx_battery panel_tpo_td028ttec1 omapdrm drm_kms_helper cfbfillrect snd_soc_simple_card syscopyarea cfbimgblt snd_soc_simple_card_utils sysfillrect sysimgblt fb_sys_fops snd_soc_omap_twl4030 cfbcopyarea encoder_opa362 drm twl4030_madc_hwmon wwan_on_off snd_soc_gtm601 pwm_omap_dmtimer generic_adc_battery connector_analog_tv pwm_bl extcon_gpio omap3_isp wlcore_sdio videobuf2_dma_contig videobuf2_memops w1_bq27000 videobuf2_v4l2 videobuf2_core omap_hdq snd_soc_omap_mcbsp ov9650 snd_soc_omap bmp280_i2c bmg160_i2c v4l2_common snd_pcm_dmaengine bmp280 bmg160_core at24 bmc150_magn_i2c nvmem_core videodev phy_twl4030_usb bmc150_acce
 l_i2c tsc2007
[ 1514.037384]  bmc150_magn bmc150_accel_core media leds_tca6507 bno055 industrialio_triggered_buffer kfifo_buf gpio_twl4030 musb_hdrc snd_soc_twl4030 twl4030_vibra twl4030_madc twl4030_pwrbutton twl4030_charger industrialio w2sg0004 ehci_omap omapdss [last unloaded: hso]
[ 1514.062622] CPU: 0 PID: 3433 Comm: dhcpcd Tainted: G        W       4.11.0-rc8-letux+ #1
[ 1514.071136] Hardware name: Generic OMAP36xx (Flattened Device Tree)
[ 1514.077758] task: ee748240 task.stack: ecdd6000
[ 1514.082580] PC is at hso_start_net_device+0x50/0xc0 [hso]
[ 1514.088287] LR is at hso_net_open+0x68/0x84 [hso]
[ 1514.093231] pc : [<bf79c304>]    lr : [<bf79ced8>]    psr: a00f0013
sp : ecdd7e20  ip : 00000000  fp : ffffffff
[ 1514.105316] r10: 00000000  r9 : ed0e080c  r8 : ecd8fe2c
[ 1514.110839] r7 : bf79cef4  r6 : ecd8fe00  r5 : 00000000  r4 : ed0dbd80
[ 1514.117706] r3 : 00000000  r2 : c0020c80  r1 : 00000000  r0 : ecdb7800
[ 1514.124572] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[ 1514.132110] Control: 10c5387d  Table: acdec019  DAC: 00000051
[ 1514.138153] Process dhcpcd (pid: 3433, stack limit = 0xecdd6218)
[ 1514.144470] Stack: (0xecdd7e20 to 0xecdd8000)
[ 1514.149078] 7e20: ed0dbd80 ecd8fe98 00000001 00000000 ecd8f800 ecd8fe00 ecd8fe60 00000000
[ 1514.157714] 7e40: ed0e080c bf79ced8 bf79ce70 ecd8f800 00000001 bf7a0258 ecd8f830 c068d958
[ 1514.166320] 7e60: c068d8b8 ecd8f800 00000001 00001091 00001090 c068dba4 ecd8f800 00001090
[ 1514.174926] 7e80: ecd8f940 ecd8f800 00000000 c068dc60 00000000 00000001 ed0e0800 ecd8f800
[ 1514.183563] 7ea0: 00000000 c06feaa8 c0ca39c2 beea57dc 00000020 00000000 306f7368 00000000
[ 1514.192169] 7ec0: 00000000 00000000 00001091 00000000 00000000 00000000 00000000 00008914
[ 1514.200805] 7ee0: eaa9ab60 beea57dc c0c9bfc0 eaa9ab40 00000006 00000000 00046858 c066a948
[ 1514.209411] 7f00: beea57dc eaa9ab60 ecc6b0c0 c02837b0 00000006 c0282c90 0000c000 c0283654
[ 1514.218017] 7f20: c09b0c00 c098bc31 00000001 c0c5e513 c0c5e513 00000000 c0151354 c01a20c0
[ 1514.226654] 7f40: c0c5e513 c01a3134 ecdd6000 c01a3160 ee7487f0 600f0013 00000000 ee748240
[ 1514.235260] 7f60: ee748734 00000000 ecc6b0c0 ecc6b0c0 beea57dc 00008914 00000006 00000000
[ 1514.243896] 7f80: 00046858 c02837b0 00001091 0003a1f0 00046608 0003a248 00000036 c01071e4
[ 1514.252502] 7fa0: ecdd6000 c0107040 0003a1f0 00046608 00000006 00008914 beea57dc 00001091
[ 1514.261108] 7fc0: 0003a1f0 00046608 0003a248 00000036 0003ac0c 00046608 00046610 00046858
[ 1514.269744] 7fe0: 0003a0ac beea57d4 000167eb b6f23106 400f0030 00000006 00000000 00000000
[ 1514.278411] [<bf79c304>] (hso_start_net_device [hso]) from [<bf79ced8>] (hso_net_open+0x68/0x84 [hso])
[ 1514.288238] [<bf79ced8>] (hso_net_open [hso]) from [<c068d958>] (__dev_open+0xa0/0xf4)
[ 1514.296600] [<c068d958>] (__dev_open) from [<c068dba4>] (__dev_change_flags+0x8c/0x130)
[ 1514.305023] [<c068dba4>] (__dev_change_flags) from [<c068dc60>] (dev_change_flags+0x18/0x48)
[ 1514.313934] [<c068dc60>] (dev_change_flags) from [<c06feaa8>] (devinet_ioctl+0x348/0x714)
[ 1514.322540] [<c06feaa8>] (devinet_ioctl) from [<c066a948>] (sock_ioctl+0x2b0/0x308)
[ 1514.330627] [<c066a948>] (sock_ioctl) from [<c0282c90>] (vfs_ioctl+0x20/0x34)
[ 1514.338165] [<c0282c90>] (vfs_ioctl) from [<c0283654>] (do_vfs_ioctl+0x82c/0x93c)
[ 1514.346038] [<c0283654>] (do_vfs_ioctl) from [<c02837b0>] (SyS_ioctl+0x4c/0x74)
[ 1514.353759] [<c02837b0>] (SyS_ioctl) from [<c0107040>] (ret_fast_syscall+0x0/0x1c)
[ 1514.361755] Code: e3822103 e3822080 e1822781 e5981014 (e5832030)
[ 1514.510833] ---[ end trace dfb3e53c657f34a0 ]---

Reported-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Andreas Kemnade <andreas@kemnade.info>
---
 drivers/net/usb/hso.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/usb/hso.c b/drivers/net/usb/hso.c
index 93411a3..00067a0 100644
--- a/drivers/net/usb/hso.c
+++ b/drivers/net/usb/hso.c
@@ -2534,13 +2534,6 @@ static struct hso_device *hso_create_net_device(struct usb_interface *interface,
 	SET_NETDEV_DEV(net, &interface->dev);
 	SET_NETDEV_DEVTYPE(net, &hso_type);
 
-	/* registering our net device */
-	result = register_netdev(net);
-	if (result) {
-		dev_err(&interface->dev, "Failed to register device\n");
-		goto exit;
-	}
-
 	/* start allocating */
 	for (i = 0; i < MUX_BULK_RX_BUF_COUNT; i++) {
 		hso_net->mux_bulk_rx_urb_pool[i] = usb_alloc_urb(0, GFP_KERNEL);
@@ -2560,6 +2553,13 @@ static struct hso_device *hso_create_net_device(struct usb_interface *interface,
 
 	add_net_device(hso_dev);
 
+	/* registering our net device */
+	result = register_netdev(net);
+	if (result) {
+		dev_err(&interface->dev, "Failed to register device\n");
+		goto exit;
+	}
+
 	hso_log_port(hso_dev);
 
 	hso_create_rfkill(hso_dev, interface);
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v2 2/2] ARM: dts: Add the ethernet and ethernet PHY to the cygnus core DT.
From: Eric Anholt @ 2017-04-26 17:26 UTC (permalink / raw)
  To: Florian Fainelli, Vivien Didelot, Andrew Lunn,
	netdev-u79uwXL29TY76Z2rM5mHXA, Rob Herring, Mark Rutland,
	devicetree-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w, Ray Jui,
	Scott Branden, Jon Mason
In-Reply-To: <0cb00eb7-41d0-0390-4687-d966499ed9f4-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1564 bytes --]

Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> On 04/25/2017 04:53 PM, Eric Anholt wrote:
>> Cygnus has a single AMAC controller connected to the B53 switch with 2
>> PHYs.  On the BCM911360_EP platform, those two PHYs are connected to
>> the external ethernet jacks.
>> 
>> Signed-off-by: Eric Anholt <eric-WhKQ6XTQaPysTnJN9+BGXg@public.gmane.org>
>> Reviewed-by: Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> ---
>> 
>> v2: Call the node "switch", just call the ports "port" (suggestions by
>>     Florian), drop max-speed on the phys (suggestion by Andrew Lunn),
>>     call the other nodes "ethernet" and "ethernet-phy" (suggestions by
>>     Sergei Shtylyov)
>> 
>>  arch/arm/boot/dts/bcm-cygnus.dtsi      | 58 ++++++++++++++++++++++++++++++++++
>>  arch/arm/boot/dts/bcm911360_entphn.dts |  8 +++++
>>  2 files changed, 66 insertions(+)
>> 
>> diff --git a/arch/arm/boot/dts/bcm-cygnus.dtsi b/arch/arm/boot/dts/bcm-cygnus.dtsi
>> index 009f1346b817..9fd89be0f5e0 100644
>> --- a/arch/arm/boot/dts/bcm-cygnus.dtsi
>> +++ b/arch/arm/boot/dts/bcm-cygnus.dtsi
>> @@ -142,6 +142,54 @@
>>  			interrupts = <0>;
>>  		};
>>  
>> +		mdio: mdio@18002000 {
>> +			compatible = "brcm,iproc-mdio";
>> +			reg = <0x18002000 0x8>;
>> +			#size-cells = <1>;
>> +			#address-cells = <0>;
>
> Sorry for not noticing earlier, since you override this correctly in the
> board-level DTS file can you put a:
>
> 			status = "disabled"
>
> property in there by default?

Done.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH v2 2/2] ARM: dts: Add the ethernet and ethernet PHY to the cygnus core DT.
From: Eric Anholt @ 2017-04-26 17:26 UTC (permalink / raw)
  To: Florian Fainelli, Vivien Didelot, Andrew Lunn,
	netdev-u79uwXL29TY76Z2rM5mHXA, Rob Herring, Mark Rutland,
	devicetree-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w, Ray Jui,
	Scott Branden, Jon Mason
In-Reply-To: <0cb00eb7-41d0-0390-4687-d966499ed9f4-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1646 bytes --]

Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> On 04/25/2017 04:53 PM, Eric Anholt wrote:
>> Cygnus has a single AMAC controller connected to the B53 switch with 2
>> PHYs.  On the BCM911360_EP platform, those two PHYs are connected to
>> the external ethernet jacks.
>> 
>> Signed-off-by: Eric Anholt <eric-WhKQ6XTQaPysTnJN9+BGXg@public.gmane.org>
>> Reviewed-by: Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> ---
>> 
>> v2: Call the node "switch", just call the ports "port" (suggestions by
>>     Florian), drop max-speed on the phys (suggestion by Andrew Lunn),
>>     call the other nodes "ethernet" and "ethernet-phy" (suggestions by
>>     Sergei Shtylyov)
>> 
>>  arch/arm/boot/dts/bcm-cygnus.dtsi      | 58 ++++++++++++++++++++++++++++++++++
>>  arch/arm/boot/dts/bcm911360_entphn.dts |  8 +++++
>>  2 files changed, 66 insertions(+)
>> 
>> diff --git a/arch/arm/boot/dts/bcm-cygnus.dtsi b/arch/arm/boot/dts/bcm-cygnus.dtsi
>> index 009f1346b817..9fd89be0f5e0 100644
>> --- a/arch/arm/boot/dts/bcm-cygnus.dtsi
>> +++ b/arch/arm/boot/dts/bcm-cygnus.dtsi
>> @@ -142,6 +142,54 @@
>>  			interrupts = <0>;
>>  		};
>>  
>> +		mdio: mdio@18002000 {
>> +			compatible = "brcm,iproc-mdio";
>> +			reg = <0x18002000 0x8>;
>> +			#size-cells = <1>;
>> +			#address-cells = <0>;
>
> Sorry for not noticing earlier, since you override this correctly in the
> board-level DTS file can you put a:
>
> 			status = "disabled"
>
> property in there by default?

I didn't have the override in the board file either, just switch and
ethernet.  Fixed.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: hmmm...
From: David Miller @ 2017-04-26 17:25 UTC (permalink / raw)
  To: ast; +Cc: netdev
In-Reply-To: <133d8e46-c5c1-1e0e-86a1-a7b5a2737bff@fb.com>

From: Alexei Starovoitov <ast@fb.com>
Date: Tue, 25 Apr 2017 22:31:06 -0700

> On 4/25/17 8:38 PM, David Miller wrote:
> jgt/jge/jsgt/sge was a stumbling block for me as well,
> since it still takes me longer than necessary to disambiguate
> into > vs >= and signed/unsigned

I had this problem while writing Sparc JIT :)

> Though I think Daniel still prefers old classic bpf asm ;)

I do too.

> Anyway, back to the question...
> since BFD and GCC are so much entrenched into canonical style
> of asm code, I don't mind that gnu toolchain will be using it.

Ok.  All data flows from right to left in the instructions so it will
be familiar for x86 assembler hackers.

> I like that you used 'dw' in 'ldxdw' instead of just 'd'
> though 'x' can probably be dropped.

Ok, dropped.

> 'x' should be added here instead:
>  { "stb", BPF_OPC_ST    | BPF_OPC_MEM  | BPF_OPC_B, "[1+O],i" },
>  { "stb", BPF_OPC_STX   | BPF_OPC_MEM  | BPF_OPC_B, "[1+O],2" },

The 'x' really isn't necessary, I would say.  Assembler can tell from
context whether immediate or register variant is wanted and thus:

	stb	[r1+8], 2
	stb	[r1+8], r4

are both assembled correctly.

^ permalink raw reply

* Re: [PATCH v2 2/2] ARM: dts: Add the ethernet and ethernet PHY to the cygnus core DT.
From: Eric Anholt @ 2017-04-26 17:25 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, Vivien Didelot, netdev, Rob Herring,
	Mark Rutland, devicetree, linux-arm-kernel, linux-kernel,
	bcm-kernel-feedback-list, Ray Jui, Scott Branden, Jon Mason
In-Reply-To: <20170426004907.GA9453@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 469 bytes --]

Andrew Lunn <andrew@lunn.ch> writes:

>> +		eth0: ethernet@18042000 {
>> +			compatible = "brcm,amac";
>> +			reg = <0x18042000 0x1000>,
>> +			      <0x18110000 0x1000>;
>> +			reg-names = "amac_base", "idm_base";
>> +			interrupts = <GIC_SPI 110 IRQ_TYPE_LEVEL_HIGH>;
>> +			max-speed = <1000>;
>
> Hi Eric
>
> Sorry i missed this the first time. Does this Ethernet controller do >
> 1Gbps? Does this max-speed do anything useful?

It doesn't look like it.  Dropped.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH net] tcp: fix access to sk->sk_state in tcp_poll()
From: Wei Wang @ 2017-04-26 17:24 UTC (permalink / raw)
  To: Davide Caratti
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy,
	Linux Kernel Network Developers
In-Reply-To: <0a1710138c3e55c388a52dba817b2d1b1996c899.1493226034.git.dcaratti@redhat.com>

On Wed, Apr 26, 2017 at 10:07 AM, Davide Caratti <dcaratti@redhat.com> wrote:
> avoid direct access to sk->sk_state when tcp_poll() is called on a socket
> using active TCP fastopen with deferred connect. Use local variable
> 'state', which stores the result of sk_state_load(), like it was done in
> commit 00fd38d938db ("tcp: ensure proper barriers in lockless contexts").
>
> Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Acked-by: Wei Wang <weiwan@google.com>
Thanks for the fix.

^ permalink raw reply

* Re: blocking ops when !TASK_RUNNING in vsock_stream_sendmsg() (again)
From: Cong Wang @ 2017-04-26 17:18 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Claudio Imbrenda, Linux Kernel Network Developers, Andy King,
	George Zhang
In-Reply-To: <20170421081458.GI13789@unicorn.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 532 bytes --]

Hi,

On Fri, Apr 21, 2017 at 1:14 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> I tried to think about a solution but there doesn't seem to be an easy
> way to fix this in vmw_stream_sendmsg() as moving prepare_to_wait()
> inside the loop would result in missed wake-ups (that was the problem
> with the original fix); IMHO the right way to resolve the issue would be
> rewriting the vmci queue pair code to allow performing the has_space()
> check without taking a mutex.


Can you try the attached patch (compile only)?

Thanks.

[-- Attachment #2: vsock-wait.diff --]
[-- Type: text/plain, Size: 2095 bytes --]

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 6f7f675..dfc8c51e 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1540,8 +1540,7 @@ static int vsock_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 	long timeout;
 	int err;
 	struct vsock_transport_send_notify_data send_data;
-
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
 	sk = sock->sk;
 	vsk = vsock_sk(sk);
@@ -1584,11 +1583,10 @@ static int vsock_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 	if (err < 0)
 		goto out;
 
-
 	while (total_written < len) {
 		ssize_t written;
 
-		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		add_wait_queue(sk_sleep(sk), &wait);
 		while (vsock_stream_has_space(vsk) == 0 &&
 		       sk->sk_err == 0 &&
 		       !(sk->sk_shutdown & SEND_SHUTDOWN) &&
@@ -1597,33 +1595,30 @@ static int vsock_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 			/* Don't wait for non-blocking sockets. */
 			if (timeout == 0) {
 				err = -EAGAIN;
-				finish_wait(sk_sleep(sk), &wait);
+				remove_wait_queue(sk_sleep(sk), &wait);
 				goto out_err;
 			}
 
 			err = transport->notify_send_pre_block(vsk, &send_data);
 			if (err < 0) {
-				finish_wait(sk_sleep(sk), &wait);
+				remove_wait_queue(sk_sleep(sk), &wait);
 				goto out_err;
 			}
 
 			release_sock(sk);
-			timeout = schedule_timeout(timeout);
+			timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
 			lock_sock(sk);
 			if (signal_pending(current)) {
 				err = sock_intr_errno(timeout);
-				finish_wait(sk_sleep(sk), &wait);
+				remove_wait_queue(sk_sleep(sk), &wait);
 				goto out_err;
 			} else if (timeout == 0) {
 				err = -EAGAIN;
-				finish_wait(sk_sleep(sk), &wait);
+				remove_wait_queue(sk_sleep(sk), &wait);
 				goto out_err;
 			}
-
-			prepare_to_wait(sk_sleep(sk), &wait,
-					TASK_INTERRUPTIBLE);
 		}
-		finish_wait(sk_sleep(sk), &wait);
+		remove_wait_queue(sk_sleep(sk), &wait);
 
 		/* These checks occur both as part of and after the loop
 		 * conditional since we need to check before and after

^ permalink raw reply related

* Re: [PATCH net] net: adjust skb->truesize in ___pskb_trim()
From: Andrey Konovalov @ 2017-04-26 17:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn
In-Reply-To: <1493222866.6453.75.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Apr 26, 2017 at 6:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Andrey found a way to trigger the WARN_ON_ONCE(delta < len) in
> skb_try_coalesce() using syzkaller and a filter attached to a TCP
> socket.
>
> As we did recently in commit 158f323b9868 ("net: adjust skb->truesize in
> pskb_expand_head()") we can adjust skb->truesize from ___pskb_trim(),
> via a call to skb_condense().
>
> If all frags were freed, then skb->truesize can be recomputed.
>
> This call can be done if skb is not yet owned, or destructor is
> sock_edemux().

Hi Eric,

I still see the warning even with your patch.

Thanks!

>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> ---
>  net/core/skbuff.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f86bf69cfb8d8bc17262cdba5d9f57a4726cd476..f1d04592ace02f32efa6e05df89c9a5e0023157f 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1576,6 +1576,8 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
>                 skb_set_tail_pointer(skb, len);
>         }
>
> +       if (!skb->sk || skb->destructor == sock_edemux)
> +               skb_condense(skb);
>         return 0;
>  }
>  EXPORT_SYMBOL(___pskb_trim);
>
>

^ permalink raw reply

* [PATCH net] tcp: fix access to sk->sk_state in tcp_poll()
From: Davide Caratti @ 2017-04-26 17:07 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Wei Wang
  Cc: netdev

avoid direct access to sk->sk_state when tcp_poll() is called on a socket
using active TCP fastopen with deferred connect. Use local variable
'state', which stores the result of sk_state_load(), like it was done in
commit 00fd38d938db ("tcp: ensure proper barriers in lockless contexts").

Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 40ba424..2dc7fcf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -533,7 +533,7 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 
 		if (tp->urg_data & TCP_URG_VALID)
 			mask |= POLLPRI;
-	} else if (sk->sk_state == TCP_SYN_SENT && inet_sk(sk)->defer_connect) {
+	} else if (state == TCP_SYN_SENT && inet_sk(sk)->defer_connect) {
 		/* Active TCP fastopen socket with defer_connect
 		 * Return POLLOUT so application can call write()
 		 * in order for kernel to generate SYN+data
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next] ip6_tunnel: Fix missing tunnel encapsulation limit option
From: Craig Gallek @ 2017-04-26 17:07 UTC (permalink / raw)
  To: Hideaki YOSHIFUJI, Alexey Kuznetsov, David S . Miller; +Cc: netdev

From: Craig Gallek <kraig@google.com>

The IPv6 tunneling code tries to insert IPV6_TLV_TNL_ENCAP_LIMIT and
IPV6_TLV_PADN options when an encapsulation limit is defined (the
default is a limit of 4).  An MTU adjustment is done to account for
these options as well.  However, the options are never present in the
generated packets.

ipv6_push_nfrag_opts requires that IPV6_RTHDR be present in order to
include any IPV6_DSTOPTS options.  The v6 tunnel code does not
use routing options, so the encap limit options are not included.

A brief reading of RFC 3542 section 9.2 (specifically the 4th paragraph)
makes me believe that this requirement in the kernel is incorrect.

Fixes: 333fad5364d6: ("[IPV6]: Support several new sockopt / ancillary data in Advanced API (RFC3542)")
Signed-off-by: Craig Gallek <kraig@google.com>
---
 net/ipv6/exthdrs.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index 25192a3b0cd7..224a89e68a42 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -932,15 +932,12 @@ void ipv6_push_nfrag_opts(struct sk_buff *skb, struct ipv6_txoptions *opt,
 			  u8 *proto,
 			  struct in6_addr **daddr, struct in6_addr *saddr)
 {
-	if (opt->srcrt) {
+	if (opt->srcrt)
 		ipv6_push_rthdr(skb, proto, opt->srcrt, daddr, saddr);
-		/*
-		 * IPV6_RTHDRDSTOPTS is ignored
-		 * unless IPV6_RTHDR is set (RFC3542).
-		 */
-		if (opt->dst0opt)
-			ipv6_push_exthdr(skb, proto, NEXTHDR_DEST, opt->dst0opt);
-	}
+
+	if (opt->dst0opt)
+		ipv6_push_exthdr(skb, proto, NEXTHDR_DEST, opt->dst0opt);
+
 	if (opt->hopopt)
 		ipv6_push_exthdr(skb, proto, NEXTHDR_HOP, opt->hopopt);
 }
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* Re: [PATCH v1 net-next 0/6] Extend socket timestamping API
From: Richard Cochran @ 2017-04-26 16:54 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: netdev, Willem de Bruijn, Soheil Hassas Yeganeh, Keller, Jacob E,
	Denny Page, Jiri Benc
In-Reply-To: <20170426145035.25846-1-mlichvar@redhat.com>

On Wed, Apr 26, 2017 at 04:50:29PM +0200, Miroslav Lichvar wrote:
> This patchset adds new options to the timestamping API that will be
> useful for NTP implementations and possibly other applications.

Are there any userland ntp patches floating around to exercise the new
HW time stamping option?

Thanks,
Richard

^ permalink raw reply

* Re: bluetooth 6lowpan interfaces are not virtual anymore
From: Jukka Rissanen @ 2017-04-26 16:52 UTC (permalink / raw)
  To: Michael Richardson, Alexander Aring
  Cc: Network Development, Luiz Augusto von Dentz,
	linux-wpan-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux Bluetooth
In-Reply-To: <383.1493218546-VaGMqW6d0iXFptTlUKWvmrDks+cytr/Z@public.gmane.org>

Hi Michael,

On Wed, 2017-04-26 at 10:55 -0400, Michael Richardson wrote:
> Alexander Aring <aar-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> wrote:
>     >> In a classic SVR4 STREAMS works, it would have been just
> another
>     >> module.  (No, I'm not a fan of *STREAMS* or of SVR4 in
> general,
>     >> although I liked some of the ideas).
>     >>
> 
>     > ok, I see you complain about "having a virtual on top of wpan
>     > interface", or?
> 
>     > I wanted to talk at first about the queue handling which is
> introduced
>     > when 6LoWPAN is not a virtual interface anymore. Or do you want
> to have
>     > a queue in front of 6lowpan adaptation (see other mail reply
> with ASCII
>     > graphics).
> 
> I would like to have a single queue, as close to the hardware as
> possible,
> such that BQL can do it's thing easily.  Should we rethink outgoing
> fragment
> handling for 6lowpan?  Clearly the BT people had a need.
> I don't think they've had a chance to respond to your complaints.

Note that the BT fragmentation (or actually it is called segmentation
in BT) is totally different what 802.15.4 is doing. I do not think
there is any need to add fragmentation handling into 6lo.

Actually the 6lowpan kernel module could probably be simplified to be a
library. We did this in Zephyr where we have compress() and
uncompress() functions that do all the magic.  

> 
>     > We can change that you can run multiple interfaces on one
>     > PHY. Currently we just allow one, because address filtering.
> Disable
>     > address filtering
>     > we will loose ACK handling on hardware.
> 
> Yes, that's a limitation of some hardware, and if you enable multiple
> PANIDs,
> that might be the consequence....
> 
>     > I can try to implement all stuff in software "for fun, maybe
> see what
>     > we can do to handle ACK in software, etc" Then you can run
> multiple
> 
> I'm not asking you to do it, I'm asking, now that we've gotten to a
> certain
> point, we have a better idea what the various requirements are, and
> can we
> re-evaluate things and maybe tweak some things.
> 
> --
> ]               Never tell me the odds!                 | ipv6 mesh
> networks [
> ]   Michael Richardson, Sandelman Software Works        | network
> architect  [
> ]     mcr-SWp7JaYWvAQV+D8aMU/kSg@public.gmane.org  http://www.sandelman.ca/        |   ruby on
> rails    [
> 


Cheers,
Jukka

^ permalink raw reply

* Re: [Patch net 3/3] team: use a larger struct for mac address
From: Jarod Wilson @ 2017-04-26 16:46 UTC (permalink / raw)
  To: Cong Wang; +Cc: Jiri Pirko, Linux Kernel Network Developers, Andrey Konovalov
In-Reply-To: <CAM_iQpVXBuXNfVM_h894HEF_9hv2rWgDJcMbvEJG25xfm7DGUA@mail.gmail.com>

On 2017-04-26 12:11 PM, Cong Wang wrote:
> On Wed, Apr 26, 2017 at 8:55 AM, Jarod Wilson <jarod@redhat.com> wrote:
>>
>> We already have struct sockaddr_storage that could be used throughout this
>> set as well. We just converted a few pieces of the bonding driver over to
>> using it for better support of ipoib bonds, via commit
>> faeeb317a5615076dff1ff44b51e862e6064dbd0. Might be better to just use that
>> in both bonding and team, rather than having different per-driver structs,
>> or Yet Another Address Storage implementation.
> 
> Technically, struct sockaddr_storage is not enough either, given the
> max is MAX_ADDR_LEN. This is why I gave up on sockaddr_storage.

Wait, what? Am I missing something? MAX_ADDR_LEN is 32, and 
sockaddr_storage is a #define for __kernel_sockaddr_storage, which has 
it's __data member defined as being of size 128 - sizeof(unsigned short).

-- 
Jarod Wilson
jarod@redhat.com

^ permalink raw reply

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
From: John Fastabend @ 2017-04-26 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, Andy Gospodarek,
	Daniel Borkmann, Alexei Starovoitov, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org
In-Reply-To: <20170426111158.578b925e@redhat.com>

[...]

>> Jesper, I was working up the code for the redirect piece for ixgbe and
>> virtio, please use this as a base for your virtual port number table. I'll
>> push an update onto github tomorrow. I think the table should drop in fairly
>> nicely.
> 
> Cool, I will do that. Then, I'll also have a redirect method to shape
> this around, and I would have to benchmark/test your ixgbe redirect.
> 
> (John please let me know, what github tree we are talking about, and
> what branch)
> 
> 
>> One piece that isn't clear to me is how do you plan to instantiate and
>> program this table. Is it a new static bpf map that is created any
>> time we see the redirect command? I think this would be preferred.
> 
> (This is difficult to explain without us misunderstanding each-other)
> 

Yep and I'm not sure I follow :)

> As Alexei also mentioned before, ifindex vs port makes no real
> difference seen from the bpf program side.  It is userspace's
> responsibility to add ifindex/port's to the bpf-maps, according to how
> the bpf program "policy" want to "connect" these ports.  The
> port-table system add one extra step, of also adding this port to the
> port-table (which lives inside the kernel). 
> 

I'm not sure I understand the "lives inside the kernel" bit. I assumed
the 'map' should be a bpf map and behave like any other bpf map.

I wanted a new map to be defined, something like this from the bpf programmer
side.

struct bpf_map_def SEC("maps") port_table =
	.type = BPF_MAP_TYPE_PORT_CONNECTION,
	.key_size = sizeof(u32),
	.value_size = BPF_PORT_CONNECTION_SIZE,
	.max_entries = 256,
};

> When loading the XDP program, we also need to pass along a port table
> "id" this XDP program is associated with (and if it doesn't exists you
> create it).  And your userspace "control-plane" application also need
> to know this port table "id", when adding a new port.

So the user space application that is loading the program also needs
to handle this map. This seems correct to me. But I don't see the
value in making some new port table when we already have well understood
framework for maps.

> 
> The concept of having multiple port tables is key.  As this implies we
> can have several simultaneous "data-planes" that is *isolated* from
> each-other.  Think about how network-namespaces/containers want
> isolation. A subtle thing I'm afraid to mention, is that oppose to the
> ifindex model, a port table with mapping to a net_device pointer, would
> allow (faster) delivery into the container's inner net_device, which
> sort of violates the isolation, but I would argue it is not a problem
> as this net_device pointer could only be added from a process within the
> namespace.  I like this feature, but it could easily be disallowed via
> port insertion-time validation.
> 

I think the above optimization should be allowed. And agree multiple port
tables (maps?) is needed. Again all this points to using standard maps
logic in my mind. For permissions and different domains, which I think
you were starting to touch on, it looks like we could extend the pinning API.
At the moment it does an inode_permission(inode, MAY_WRITE) check but I
presume this could be extended. None of this would be needed in v1 and
could be added subsequently. read-only maps seems doable.

>    
>>>> I'm not worried about the DROP case, I agree that is fine (as you
>>>> also say).  The problem is unintentionally sending a packet to a
>>>> wrong ifindex.  This is clearly an eBPF program error, BUT with
>>>> XDP this becomes a very hard to debug program error.  With
>>>> TC-redirect/cls_bpf we can tcpdump the packets, with XDP there is
>>>> no visibility into this happening (the NSA is going to love this
>>>> "feature").  Maybe we could add yet-another tracepoint to allow
>>>> debugging this.  My proposal that we simply remove the possibility
>>>> for such program errors, by as you say move the validation from
>>>> run-time into static insertion-time, via a port table.  
>>>
>>> I think lack of tcpdump-like debugging in xdp is a separate issue.
>>> As I was saying in the other thread we have trivial 'xdpdump'
>>> kern+user app that emits pcap file, but it's too specific to how we
>>> use tail_calls+prog_array in our xdp setup. I'm working on the
>>> program chaining that will be generic and allow us transparently
>>> add multiple xdp or tc progs to the same attachment point and will
>>> allow us to do 'xdpdump' at any point of this pipeline, so
>>> debugging of what happened to the packet will be easier and done in
>>> the same way for both tc and xdp.
>>> btw in our experience working with both tc and xdp the tc+bpf was
>>> actually harder to use and more bug prone.
>>>   
>>
>> Nice, the tcpdump-like debugging looks interesting.
> 
> Yes, this xdpdump sound like a very useful tool.
> 

^ permalink raw reply

* Re: [PATCH net-next] tcp: memset ca_priv data to 0 properly
From: Wei Wang @ 2017-04-26 16:25 UTC (permalink / raw)
  To: Linux Kernel Network Developers, David Miller
  Cc: Eric Dumazet, Yuchung Cheng, Neal Cardwell, Wei Wang
In-Reply-To: <20170426003802.40091-1-tracywwnj@gmail.com>

This fix should target for net tree instead of net-next.
Sorry for the wrong title.

On Tue, Apr 25, 2017 at 5:38 PM, Wei Wang <weiwan@google.com> wrote:
> From: Wei Wang <weiwan@google.com>
>
> Always zero out ca_priv data in tcp_assign_congestion_control() so that
> ca_priv data is cleared out during socket creation.
> Also always zero out ca_priv data in tcp_reinit_congestion_control() so
> that when cc algorithm is changed, ca_priv data is cleared out as well.
> We should still zero out ca_priv data even in TCP_CLOSE state because
> user could call connect() on AF_UNSPEC to disconnect the socket and
> leave it in TCP_CLOSE state and later call setsockopt() to switch cc
> algorithm on this socket.
>
> Fixes: 2b0a8c9ee ("tcp: add CDG congestion control")
> Reported-by: Andrey Konovalov  <andreyknvl@google.com>
> Signed-off-by: Wei Wang <weiwan@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> ---
>  net/ipv4/tcp_cong.c | 11 +++--------
>  1 file changed, 3 insertions(+), 8 deletions(-)
>
> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
> index 79c4817abc94..6e3c512054a6 100644
> --- a/net/ipv4/tcp_cong.c
> +++ b/net/ipv4/tcp_cong.c
> @@ -168,12 +168,8 @@ void tcp_assign_congestion_control(struct sock *sk)
>         }
>  out:
>         rcu_read_unlock();
> +       memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
>
> -       /* Clear out private data before diag gets it and
> -        * the ca has not been initialized.
> -        */
> -       if (ca->get_info)
> -               memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
>         if (ca->flags & TCP_CONG_NEEDS_ECN)
>                 INET_ECN_xmit(sk);
>         else
> @@ -200,11 +196,10 @@ static void tcp_reinit_congestion_control(struct sock *sk,
>         tcp_cleanup_congestion_control(sk);
>         icsk->icsk_ca_ops = ca;
>         icsk->icsk_ca_setsockopt = 1;
> +       memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
>
> -       if (sk->sk_state != TCP_CLOSE) {
> -               memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
> +       if (sk->sk_state != TCP_CLOSE)
>                 tcp_init_congestion_control(sk);
> -       }
>  }
>
>  /* Manage refcounts on socket close. */
> --
> 2.13.0.rc0.306.g87b477812d-goog
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox