Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next RFC 0/5] batched tx processing in vhost_net
From: Michael S. Tsirkin @ 2017-09-26 13:45 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1506067355-5771-1-git-send-email-jasowang@redhat.com>

On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to implement basic tx batched processing. This is
> done by prefetching descriptor indices and update used ring in a
> batch. This intends to speed up used ring updating and improve the
> cache utilization.

Interesting, thanks for the patches. So IIUC most of the gain is really
overcoming some of the shortcomings of virtio 1.0 wrt cache utilization?

Which is fair enough (1.0 is already deployed) but I would like to avoid
making 1.1 support harder, and this patchset does this unfortunately,
see comments on individual patches. I'm sure it can be addressed though.

> Test shows about ~22% improvement in tx pss.

Is this with or without tx napi in guest?

> Please review.
> 
> Jason Wang (5):
>   vhost: split out ring head fetching logic
>   vhost: introduce helper to prefetch desc index
>   vhost: introduce vhost_add_used_idx()
>   vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
>   vhost_net: basic tx virtqueue batched processing
> 
>  drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
>  drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
>  drivers/vhost/vhost.h |   9 ++
>  3 files changed, 270 insertions(+), 125 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next v10] openvswitch: enable NSH support
From: Yang, Yi @ 2017-09-26 13:52 UTC (permalink / raw)
  To: Jiri Benc
  Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, e@erig.me,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
In-Reply-To: <20170926130534.170270e3@griffin>

On Tue, Sep 26, 2017 at 07:05:34PM +0800, Jiri Benc wrote:
> On Tue, 26 Sep 2017 12:47:16 +0800, Yi Yang wrote:
> > +	return ((ret != 0) ? false : true);

Jiri, I appriciate your review very carefully and professionally from my
heart for 10 versions, that is really very very big effort.

I carefully thought this comment:

"""
> +	return ((ret != 0) ? false : true);

Too little coffee or too late in the night, right? ;-)
"""

But I don't think this is a problematic line from my understanding,
because validate_nsh is declared to return bool. I had thought you were
saying "it was late, it was time for you to leave office, so no more
comments", this is my readout :-)

So the best comment you can give out here is one line you prefer :-)

I'm not a kernel developer full time, I'm adapting to several coding
style at the time in kernel, OVS and Opendaylight.

> 
> I'm not going to review this version but this caught my eye - I pointed
> out this silly construct in my review of v9. I can understand that
> working late in the night and rewriting the code back and forth, one
> could end up with such construct and overlook it at the final
> self-review before submission. Happens to everyone.
> 
> But leaving this after a review pointed it out means you're not paying
> enough attention to your work. Even the fact that you sent v10 so
> shortly after v9 means you did not spend the needed time on reflecting
> on the review and that you did not properly test the new version. And
> I told you exactly this before.
> 
> Honestly, I'm starting to be tired with reviewing your patch again and
> again and pointing out silly mistakes like this one and repeating basic
> things to you again and again.

I did test it in my sfc test environment, I don't see any warning, error
during build and runtime, it can work well.

Sorry if missing your comments, that is understanding issue in most
cases but not I don't take your comments seriously. You know English
isn't my native language, it is a big challenge for me to understand
your comments very well. Neverthless, code is our common language, I can
understand code better than your descriptive words :-)

> 
>  Jiri

^ permalink raw reply

* Re: [PATCH] net: stmmac: Meet alignment requirements for DMA
From: Matt Redfearn @ 2017-09-26 13:57 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, alexandre.torgue, peppe.cavallaro, linux-kernel,
	Linux MIPS Mailing List, Paul Burton, James Hogan
In-Reply-To: <20170922.182639.272534775457081015.davem@davemloft.net>

Hi David,

Thanks for your feedback.


On 23/09/17 02:26, David Miller wrote:
> From: Matt Redfearn <matt.redfearn@imgtec.com>
> Date: Fri, 22 Sep 2017 12:13:53 +0100
>
>> According to Documentation/DMA-API.txt:
>>   Warnings:  Memory coherency operates at a granularity called the cache
>>   line width.  In order for memory mapped by this API to operate
>>   correctly, the mapped region must begin exactly on a cache line
>>   boundary and end exactly on one (to prevent two separately mapped
>>   regions from sharing a single cache line).  Since the cache line size
>>   may not be known at compile time, the API will not enforce this
>>   requirement.  Therefore, it is recommended that driver writers who
>>   don't take special care to determine the cache line size at run time
>>   only map virtual regions that begin and end on page boundaries (which
>>   are guaranteed also to be cache line boundaries).
> This is rediculious.  You're misreading what this document is trying
> to explain.

To be clear, is your interpretation of the documentation that drivers do 
not have to ensure cacheline alignment?

As I read that documentation of the DMA API, it puts the onus on device 
drivers to ensure that they operate on cacheline aligned & sized blocks 
of memory. It states "the mapped region must begin exactly on a cache 
line boundary". We know that skb->data is not cacheline aligned, since 
it is skb_headroom() bytes into the skb buffer, and the value of 
skb_headroom is not a multiple of cachelines. To give an example, on the 
Creator Ci40 platform (a 32bit MIPS platform), I have the following values:
skb_headroom(skb) = 130 bytes
sbb->head = 0x8ed9b800 (32byte cacheline aligned)
skb->data = 0x8ed9b882 (not cacheline aligned)

Since the MIPS architecture requires software cache management, 
performing a dma_map_single(TO_DEVICE) will writeback and invalidate the 
cachelines for the required region. To comply with (our interpretation 
of) the DMA API documentation, the MIPS implementation expects a 
cacheline aligned & sized buffer, so that it can writeback a set of 
complete cachelines. Indeed a recent patch 
(https://patchwork.linux-mips.org/patch/14624/) causes the MIPS dma 
mapping code to emit warnings if the buffer it is requested to sync is 
not cachline aligned. This driver, as is, fails this test and causes 
thousands of warnings to be emitted.

The situation for dma_map_single(FROM_DEVICE) is potentially worse, 
since it will perform a cache invalidate of all lines for the buffer. If 
the buffer is not cacheline aligned, this will throw away any adjacent 
data in the same cacheline. The dma_map implementation has no way to 
know if data adjacent to the buffer it is requested to sync is valuable 
or not, and it will simply be thrown away by the cache invalidate.

If I am somehow misinterpreting the DMA API documentation, and drivers 
are NOT required to ensure that buffers to be synced are cacheline 
aligned, then there is only one way that the MIPS implementation can 
ensure that it writeback / invalidates cachelines containing only the 
requested data buffer, and that would be to use bounce buffers. Clearly 
that will be devastating for performance as every outgoing packet will 
need to be copied from it's unaligned location to a guaranteed aligned 
location, and written back from there. Similarly incoming packets would 
have to somehow be initially located in a cacheline aligned location 
before being copied into the non-cacheline aligned location supplied by 
the driver.

> As long as you use the dma_{map,unamp}_single() and sync to/from
> deivce interfaces properly, the cacheline issues will be handled properly
> and the cpu and the device will see proper uptodate memory contents.

I interpret the DMA API document (and the MIPS arch dma code operates 
this way) as stating that the driver must ensure that buffers passed to 
it are cacheline aligned, and cacheline sized, to prevent cache 
management throwing away adjacent data in the same cacheline.

That is what this patch is for, to change the buffer address passed to 
the DMA API to skb->head, which is (as far as I know) guaranteed to be 
cacheline aligned.

>
> It is completely rediculious to require every driver to stash away two
> sets of pointer for every packet, and to DMA map the headroom of the SKB
> which is wasteful.
>
> I'm not applying this, fix this problem properly, thanks.

An alternative, slightly less invasive change, may be to subtract 
NET_IP_ALIGN from the dma buffer address, so that the buffer for which a 
sync is requested is cacheline aligned (is that guaranteed?). Would that 
change be more acceptable?

Thanks,
Matt

^ permalink raw reply

* Re: [PATCH net-next 2/7] nfp: compile flower vxlan tunnel metadata match fields
From: John Hurley @ 2017-09-26 13:58 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Simon Horman, David Miller, Jakub Kicinski, Linux Netdev List,
	oss-drivers
In-Reply-To: <CAJ3xEMgs2d7dcktHfvx_iry8tBy8XK8n6E5W7J0pYPer9Xz_Kg@mail.gmail.com>

On Mon, Sep 25, 2017 at 7:35 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
> <simon.horman@netronome.com> wrote:
>> From: John Hurley <john.hurley@netronome.com>
>>
>> Compile ovs-tc flower vxlan metadata match fields for offloading. Only
>
> anything in the npf kernel bits has direct relation to ovs? what?
>

Sorry, this is a typo  and should refer to TC.

>> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>> @@ -52,8 +52,25 @@
>>          BIT(FLOW_DISSECTOR_KEY_PORTS) | \
>>          BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \
>>          BIT(FLOW_DISSECTOR_KEY_VLAN) | \
>> +        BIT(FLOW_DISSECTOR_KEY_ENC_KEYID) | \
>> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) | \
>> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) | \
>
> this series takes care of IPv6 tunnels too?

IPv6 is not included in this set.
The reason the IPv6 bit is included here is to account for behavior we
have noticed in TC flower.
If, for example, I add a filter with the following match fields:
'protocol ip flower enc_src_ip 10.0.0.1 enc_dst_ip 10.0.0.2
enc_dst_port 4789 enc_key_id 123'
The 'used_keys' value in the dissector marks both IPv4 and IPv6 encap
addresses as 'used'.
I am not sure if this is a bug in TC or that we are expected to check
the enc_control fields to determine if IPv4 or v6 addresses are used.
Including the IPv6 used_keys bit in our whitelist approach allows us
to accept legitimate IPv4 tunnel rules in these situations.
If it is found to be IPv6 when the rule is parsed, it will be rejected here.

^ permalink raw reply

* Re: [PATCH net v2] l2tp: fix race condition in l2tp_tunnel_delete
From: Sabrina Dubroca @ 2017-09-26 13:59 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: netdev, Xin Long, Tom Parkin
In-Reply-To: <20170925123355.a6qof4ovgr6ksx3s@alphalink.fr>

2017-09-25, 14:33:55 +0200, Guillaume Nault wrote:
[...]
> So what about using your v2 patch, but without the ->dead flag test in
> l2tp_tunnel_get() and l2tp_tunnel_find*()?

Ok, I think that's reasonable. I'll post a v3, thanks.

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH 2/6] ath9k: add a quirk to set use_msi automatically
From: Christoph Hellwig @ 2017-09-26 14:04 UTC (permalink / raw)
  To: AceLan Kao
  Cc: Christoph Hellwig, QCA ath9k Development, Kalle Valo,
	linux-wireless, netdev, Linux-Kernel@Vger. Kernel. Org
In-Reply-To: <CAFv23Qm9kZLSqJ_BP1D7mWCxekq0nt3S8onT1ckH-AoHDBUDzw@mail.gmail.com>

On Tue, Sep 26, 2017 at 03:26:51PM +0800, AceLan Kao wrote:
> Ath9k is an old driver for old chips, and they work fine with legacy INTx.
> But some new platforms are using it, so I think we should list those
> new platforms which blocks INTx in the driver.

And unless they are broken and need a quirk they should work even better
with MSI if the device advertises MSI support in the PCI capabilities.

^ permalink raw reply

* Re: [PATCH net-next 2/7] nfp: compile flower vxlan tunnel metadata match fields
From: Or Gerlitz @ 2017-09-26 14:12 UTC (permalink / raw)
  To: John Hurley
  Cc: Simon Horman, David Miller, Jakub Kicinski, Linux Netdev List,
	oss-drivers
In-Reply-To: <CAK+XE=nSHn6FGTi0kcqi+mcaV3Lo8QANpqQ4z6uNcWE+M3Gngw@mail.gmail.com>

On Tue, Sep 26, 2017 at 4:58 PM, John Hurley <john.hurley@netronome.com> wrote:
> On Mon, Sep 25, 2017 at 7:35 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
>> <simon.horman@netronome.com> wrote:
>>> From: John Hurley <john.hurley@netronome.com>
>>>
>>> Compile ovs-tc flower vxlan metadata match fields for offloading. Only
>>
>> anything in the npf kernel bits has direct relation to ovs? what?
>>
>
> Sorry, this is a typo  and should refer to TC.
>
>>> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>> @@ -52,8 +52,25 @@
>>>          BIT(FLOW_DISSECTOR_KEY_PORTS) | \
>>>          BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \
>>>          BIT(FLOW_DISSECTOR_KEY_VLAN) | \
>>> +        BIT(FLOW_DISSECTOR_KEY_ENC_KEYID) | \
>>> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) | \
>>> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) | \
>>
>> this series takes care of IPv6 tunnels too?
>
> IPv6 is not included in this set.
> The reason the IPv6 bit is included here is to account for behavior we
> have noticed in TC flower.
> If, for example, I add a filter with the following match fields:
> 'protocol ip flower enc_src_ip 10.0.0.1 enc_dst_ip 10.0.0.2
> enc_dst_port 4789 enc_key_id 123'
> The 'used_keys' value in the dissector marks both IPv4 and IPv6 encap
> addresses as 'used'.
> I am not sure if this is a bug in TC or that we are expected to check
> the enc_control fields to determine if IPv4 or v6 addresses are used.

you should have your code to check enc_control->addr_type to be
FLOW_DISSECTOR_KEY_IPV4_ADDRS or IPV6_ADDRS


> Including the IPv6 used_keys bit in our whitelist approach allows us
> to accept legitimate IPv4 tunnel rules in these situations.

mmm can please take a look on fl_init_dissector() and tell me if you
see why FLOW_DISSECTOR_KEY_IPV6_ADDRS is set for ipv4 tunnels,
I am not sure.

> If it is found to be IPv6 when the rule is parsed, it will be rejected here.

^ permalink raw reply

* Re: [PATCH 2/6] ath9k: add a quirk to set use_msi automatically
From: Kalle Valo @ 2017-09-26 14:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: AceLan Kao, QCA ath9k Development, linux-wireless, netdev,
	Linux-Kernel@Vger. Kernel. Org, Daniel Drake
In-Reply-To: <20170926140413.GA19977@infradead.org>

Christoph Hellwig <hch@infradead.org> writes:

> On Tue, Sep 26, 2017 at 03:26:51PM +0800, AceLan Kao wrote:
>> Ath9k is an old driver for old chips, and they work fine with legacy INTx.
>> But some new platforms are using it, so I think we should list those
>> new platforms which blocks INTx in the driver.
>
> And unless they are broken and need a quirk they should work even better
> with MSI if the device advertises MSI support in the PCI capabilities.

Daniel Drake (CCed) already found problems with the MSI implementation:

https://lkml.kernel.org/r/20170925041153.26574-1-drake@endlessm.com

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH net-next 0/7] nfp: flower vxlan tunnel offload
From: Or Gerlitz @ 2017-09-26 14:17 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Simon Horman, David Miller, Jakub Kicinski, Linux Netdev List,
	oss-drivers, John Hurley, Paolo Abeni, Paul Blakey, Jiri Pirko,
	Roi Dayan
In-Reply-To: <20170926145143.28bf52bd@griffin>

On Tue, Sep 26, 2017 at 3:51 PM, Jiri Benc <jbenc@redhat.com> wrote:
> On Tue, 26 Sep 2017 15:41:37 +0300, Or Gerlitz wrote:
>> Please note that the way the rule is being set to the HW driver is by delegation
>> done in flower, see these commits (specifically "Add offload support
>> using egress Hardware device")
>
> It's very well possible the bug is somewhere in net/sched.

maybe before/instead you call it a bug, take a look on the design
there and maybe
tell us how to possibly do that otherwise?

^ permalink raw reply

* [PATCH net v3] l2tp: fix race condition in l2tp_tunnel_delete
From: Sabrina Dubroca @ 2017-09-26 14:16 UTC (permalink / raw)
  To: netdev; +Cc: Sabrina Dubroca, Guillaume Nault, Xin Long, Tom Parkin

If we try to delete the same tunnel twice, the first delete operation
does a lookup (l2tp_tunnel_get), finds the tunnel, calls
l2tp_tunnel_delete, which queues it for deletion by
l2tp_tunnel_del_work.

The second delete operation also finds the tunnel and calls
l2tp_tunnel_delete. If the workqueue has already fired and started
running l2tp_tunnel_del_work, then l2tp_tunnel_delete will queue the
same tunnel a second time, and try to free the socket again.

Add a dead flag to prevent firing the workqueue twice. Then we can
remove the check of queue_work's result that was meant to prevent that
race but doesn't.

Reproducer:

    ip l2tp add tunnel tunnel_id 3000 peer_tunnel_id 4000 local 192.168.0.2 remote 192.168.0.1 encap udp udp_sport 5000 udp_dport 6000
    ip l2tp add session name l2tp1 tunnel_id 3000 session_id 1000 peer_session_id 2000
    ip link set l2tp1 up
    ip l2tp del tunnel tunnel_id 3000
    ip l2tp del tunnel tunnel_id 3000

Fixes: f8ccac0e4493 ("l2tp: put tunnel socket release on a workqueue")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
v3: only prevent double deletion, drop checks during lookups

v2: as Tom Parkin explained, we can't remove the tunnel from the
    per-net list from netlink. v2 uses only a dead flag, and adds
    corresponding checks during lookups

 net/l2tp/l2tp_core.c | 10 ++++------
 net/l2tp/l2tp_core.h |  5 ++++-
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index d8c2a89a76e1..02d61101b108 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1688,14 +1688,12 @@ EXPORT_SYMBOL_GPL(l2tp_tunnel_create);
 
 /* This function is used by the netlink TUNNEL_DELETE command.
  */
-int l2tp_tunnel_delete(struct l2tp_tunnel *tunnel)
+void l2tp_tunnel_delete(struct l2tp_tunnel *tunnel)
 {
-	l2tp_tunnel_inc_refcount(tunnel);
-	if (false == queue_work(l2tp_wq, &tunnel->del_work)) {
-		l2tp_tunnel_dec_refcount(tunnel);
-		return 1;
+	if (!test_and_set_bit(0, &tunnel->dead)) {
+		l2tp_tunnel_inc_refcount(tunnel);
+		queue_work(l2tp_wq, &tunnel->del_work);
 	}
-	return 0;
 }
 EXPORT_SYMBOL_GPL(l2tp_tunnel_delete);
 
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 70a12df40a5f..67c79d9b5c6c 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -161,6 +161,9 @@ struct l2tp_tunnel_cfg {
 
 struct l2tp_tunnel {
 	int			magic;		/* Should be L2TP_TUNNEL_MAGIC */
+
+	unsigned long		dead;
+
 	struct rcu_head rcu;
 	rwlock_t		hlist_lock;	/* protect session_hlist */
 	bool			acpt_newsess;	/* Indicates whether this
@@ -255,7 +258,7 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id,
 		       u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg,
 		       struct l2tp_tunnel **tunnelp);
 void l2tp_tunnel_closeall(struct l2tp_tunnel *tunnel);
-int l2tp_tunnel_delete(struct l2tp_tunnel *tunnel);
+void l2tp_tunnel_delete(struct l2tp_tunnel *tunnel);
 struct l2tp_session *l2tp_session_create(int priv_size,
 					 struct l2tp_tunnel *tunnel,
 					 u32 session_id, u32 peer_session_id,
-- 
2.14.1

^ permalink raw reply related

* RE: [PATCH] net: stmmac: Meet alignment requirements for DMA
From: David Laight @ 2017-09-26 14:23 UTC (permalink / raw)
  To: 'Matt Redfearn', David Miller
  Cc: netdev@vger.kernel.org, alexandre.torgue@st.com,
	peppe.cavallaro@st.com, linux-kernel@vger.kernel.org,
	Linux MIPS Mailing List, Paul Burton, James Hogan
In-Reply-To: <587dc9a8-b974-e222-95b4-93c2a8f2aba2@imgtec.com>

From: Matt Redfearn
> Sent: 26 September 2017 14:58
...
> > As long as you use the dma_{map,unamp}_single() and sync to/from
> > deivce interfaces properly, the cacheline issues will be handled properly
> > and the cpu and the device will see proper uptodate memory contents.
> 
> I interpret the DMA API document (and the MIPS arch dma code operates
> this way) as stating that the driver must ensure that buffers passed to
> it are cacheline aligned, and cacheline sized, to prevent cache
> management throwing away adjacent data in the same cacheline.

The important thing is that the driver must not dirty any cache lines
that are mapped for DMA (from the device).

Typically this is not a problem because the driver doesn't look
at skb (etc) that contain receive buffers once the dma is setup.

	David



^ permalink raw reply

* Re: [PATCH net-next 0/7] nfp: flower vxlan tunnel offload
From: Jiri Benc @ 2017-09-26 14:31 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Simon Horman, David Miller, Jakub Kicinski, Linux Netdev List,
	oss-drivers, John Hurley, Paolo Abeni, Paul Blakey, Jiri Pirko,
	Roi Dayan
In-Reply-To: <CAJ3xEMjOkF7vKHY=+q1dHXGdsyEXFPUa6dsGD_-M3+TX=DNouA@mail.gmail.com>

On Tue, 26 Sep 2017 17:17:02 +0300, Or Gerlitz wrote:
> maybe before/instead you call it a bug,

But it is a bug. When offloaded, the rules must not behave differently.
That's the fundamental thing about offloading. Here, the rules behave
differently when offloaded and when not. That's a bug.

> take a look on the design there and maybe
> tell us how to possibly do that otherwise?

I don't know the design. It's the responsibility of those who implement
the offloading to do it in the way that it's consistent with the
software path. That has always been the case.

This needs to be fixed. If it can't be fixed, the feature needs to be
reverted. It's not that Linux has to make use of every single offload
supported by hardware. If the offloading cannot be fit into how Linux
works, then the offload can't be supported. There are in fact many
precedents.

 Jiri

^ permalink raw reply

* Re: [PATCH] ebtables: fix race condition in frame_filter_net_init()
From: Artem Savkov @ 2017-09-26 14:34 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netdev, linux-kernel, netfilter-devel
In-Reply-To: <20170926124211.GA14971@breakpoint.cc>

On Tue, Sep 26, 2017 at 02:42:11PM +0200, Florian Westphal wrote:
> Artem Savkov <asavkov@redhat.com> wrote:
> > It is possible for ebt_in_hook to be triggered before ebt_table is assigned
> > resulting in a NULL-pointer dereference. Make sure hooks are
> > registered as the last step.
> 
> Right, thanks for the patch.
> 
> > --- a/net/bridge/netfilter/ebtable_broute.c
> > +++ b/net/bridge/netfilter/ebtable_broute.c
> > @@ -65,7 +65,7 @@ static int ebt_broute(struct sk_buff *skb)
> >  
> >  static int __net_init broute_net_init(struct net *net)
> >  {
> > -	net->xt.broute_table = ebt_register_table(net, &broute_table, NULL);
> > +	net->xt.broute_table = ebt_register_table(net, &broute_table);
> 
> I wonder if it makes more sense to model this like the iptables version,
> i.e. pass net->xt.table_name as last arg to ebt_register_table ...
> 
> > +int ebt_register_hooks(struct net *net, struct ebt_table *table,
> > +		      const struct nf_hook_ops *ops)
> > +{
> > +	int ret = nf_register_net_hooks(net, ops, hweight32(table->valid_hooks));
> > +
> > +	if (ret)
> > +		__ebt_unregister_table(net, table);
> > +
> > +	return ret;
> > +}
> 
> ... because this looks strange (unregister of table/not-so-obvious error
> unwinding ...)
> 
> > @@ -1252,15 +1262,6 @@ ebt_register_table(struct net *net, const struct ebt_table *input_table,
> >  	list_add(&table->list, &net->xt.tables[NFPROTO_BRIDGE]);
> >  	mutex_unlock(&ebt_mutex);
> 
> ... here one could then assign the net->xt.table_X pointer, and then do
> the hook registration right after.
> 
> However i have no strong opinion here.

Agreed, that does look better and requires less changes. I'll send a v2.

-- 
Regards,
  Artem

^ permalink raw reply

* Re: [PATCH net-next] bpf/verifier: improve disassembly of BPF_END instructions
From: Edward Cree @ 2017-09-26 14:37 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Y Song, David Miller, netdev, Jiong Wang, Jakub Kicinski
In-Reply-To: <20170926013259.cwnpnshay4rxnmyr@ast-mbp>

On 26/09/17 02:33, Alexei Starovoitov wrote:
> On Mon, Sep 25, 2017 at 11:44:02PM +0200, Daniel Borkmann wrote:
>> But above cast to be16 also doesn't seem quite C-like in terms
>> of what we're actually doing... 3rd option would be my personal
>> preference even if it doesn't look C-like, but otoh we also have
>> 'call' etc which is neither.

<snip>

> In that sense (be16) cast is pretty much self explanatory.
> So I'd like to continue bikesheding in hopes to convince you
> to accept either 1 or 2 above ;)
I agree with Daniel.  3rd option `r1 = be16 r1` is best, as it's an
 actual ALU operation, not just a cast.  And since it looks like we're
 drifting vaguely near a consensus on that (even if Alexei still isn't
 convinced ;-) I'll spin v2 patches with that and `r1 = (u32) -r1`, so
 we have something concrete to argue about...

-Ed

^ permalink raw reply

* Re: [PATCH net-next v10] openvswitch: enable NSH support
From: Jiri Benc @ 2017-09-26 14:42 UTC (permalink / raw)
  To: Yang, Yi
  Cc: netdev@vger.kernel.org, dev@openvswitch.org, e@erig.me,
	davem@davemloft.net
In-Reply-To: <20170926135240.GA88616@cran64.bj.intel.com>

On Tue, 26 Sep 2017 21:52:41 +0800, Yang, Yi wrote:
> > +	return ((ret != 0) ? false : true);
> 
> But I don't think this is a problematic line from my understanding,

Why not:

	return ((ret != 0 == true) ? false : true) == true;

?

Sigh. This is equal to:

	return !ret;

which you should use.

 Jiri

^ permalink raw reply

* Re: [PATCH net-next 0/7] nfp: flower vxlan tunnel offload
From: Paolo Abeni @ 2017-09-26 14:50 UTC (permalink / raw)
  To: Or Gerlitz, Jiri Benc
  Cc: Simon Horman, David Miller, Jakub Kicinski, Linux Netdev List,
	oss-drivers, John Hurley, Paul Blakey, Jiri Pirko, Roi Dayan
In-Reply-To: <CAJ3xEMjOkF7vKHY=+q1dHXGdsyEXFPUa6dsGD_-M3+TX=DNouA@mail.gmail.com>

On Tue, 2017-09-26 at 17:17 +0300, Or Gerlitz wrote:
> On Tue, Sep 26, 2017 at 3:51 PM, Jiri Benc <jbenc@redhat.com> wrote:
> > On Tue, 26 Sep 2017 15:41:37 +0300, Or Gerlitz wrote:
> > > Please note that the way the rule is being set to the HW driver is by delegation
> > > done in flower, see these commits (specifically "Add offload support
> > > using egress Hardware device")
> > 
> > It's very well possible the bug is somewhere in net/sched.
> 
> maybe before/instead you call it a bug, take a look on the design
> there and maybe
> tell us how to possibly do that otherwise?

The problem, AFAICT, is in the API between flower and NIC implementing
the offload, because in the above example the kernel will call the
offload hook with exactly the same arguments with the 'bad' rule and
the 'good' one - but the 'bad' rule should never match any packets.

I think that can be fixed changing the flower code to invoke the
offload hook for filters with tunnel-based match only if the device
specified in such match has the appropriate type, e.g. given that
currently only vxlan is supported with something like the code below
(very rough and untested, just to give the idea):

Cheers,

Paolo

---
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index d230cb4c8094..ff8476e56d4e 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -243,10 +243,11 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
                                struct fl_flow_key *mask,
                                struct cls_fl_filter *f)
 {
-       struct net_device *dev = tp->q->dev_queue->dev;
+       struct net_device *ingress_dev, *dev = tp->q->dev_queue->dev;
        struct tc_cls_flower_offload cls_flower = {};
        int err;
 
+       ingress_dev = dev;
        if (!tc_can_offload(dev)) {
                if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
                    (f->hw_dev && !tc_can_offload(f->hw_dev))) {
@@ -259,6 +260,12 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
                f->hw_dev = dev;
        }
 
+       if ((dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_KEYID) ||
+            dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_PORTS) ||
+             // ... list all the others tunnel based keys ...
+             ) && strcmp(ingress_dev->rtnl_link_ops->kind, "vxlan"))
+               return tc_skip_sw(f->flags) ? -EINVAL : 0;
+

^ permalink raw reply related

* [PATCH] tun: bail out from tun_get_user() if the skb is empty
From: Alexander Potapenko @ 2017-09-26 14:53 UTC (permalink / raw)
  To: davem, edumazet; +Cc: dvyukov, syzkaller, netdev, linux-kernel

KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
skb->data[0] in the case the skb is empty (i.e. skb->len is 0):

================================================
BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
...
 __msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
 tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
 tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
 call_write_iter ./include/linux/fs.h:1743
 new_sync_write fs/read_write.c:457
 __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
 vfs_write+0x3e4/0x770 fs/read_write.c:518
 SYSC_write+0x12f/0x2b0 fs/read_write.c:565
 SyS_write+0x55/0x80 fs/read_write.c:557
 do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
 entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
...
origin:
...
 kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
 slab_alloc_node mm/slub.c:2732
 __kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
 __kmalloc_reserve net/core/skbuff.c:138
 __alloc_skb+0x26a/0x810 net/core/skbuff.c:231
 alloc_skb ./include/linux/skbuff.h:903
 alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
 sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
 tun_alloc_skb drivers/net/tun.c:1144
 tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
 tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
 call_write_iter ./include/linux/fs.h:1743
 new_sync_write fs/read_write.c:457
 __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
 vfs_write+0x3e4/0x770 fs/read_write.c:518
 SYSC_write+0x12f/0x2b0 fs/read_write.c:565
 SyS_write+0x55/0x80 fs/read_write.c:557
 do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
 return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
================================================

Make sure tun_get_user() doesn't touch skb->data[0] unless there is
actual data.

C reproducer below:
==========================
    // autogenerated by syzkaller (http://github.com/google/syzkaller)

    #define _GNU_SOURCE

    #include <fcntl.h>
    #include <linux/if_tun.h>
    #include <netinet/ip.h>
    #include <net/if.h>
    #include <string.h>
    #include <sys/ioctl.h>

    int main()
    {
      int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
      int tun_fd = open("/dev/net/tun", O_RDWR);
      struct ifreq req;
      memset(&req, 0, sizeof(struct ifreq));
      strcpy((char*)&req.ifr_name, "gre0");
      req.ifr_flags = IFF_UP | IFF_MULTICAST;
      ioctl(tun_fd, TUNSETIFF, &req);
      ioctl(sock, SIOCSIFFLAGS, "gre0");
      write(tun_fd, "hi", 0);
      return 0;
    }
==========================

Signed-off-by: Alexander Potapenko <glider@google.com>
---
 drivers/net/tun.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3c9985f29950..25d96f54a729 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1496,6 +1496,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	switch (tun->flags & TUN_TYPE_MASK) {
 	case IFF_TUN:
 		if (tun->flags & IFF_NO_PI) {
+			if (!skb->len)
+				return -EINVAL;
 			switch (skb->data[0] & 0xf0) {
 			case 0x40:
 				pi.proto = htons(ETH_P_IP);
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related

* Re: [PATCH net v3] l2tp: fix race condition in l2tp_tunnel_delete
From: Guillaume Nault @ 2017-09-26 14:56 UTC (permalink / raw)
  To: Sabrina Dubroca; +Cc: netdev, Xin Long, Tom Parkin
In-Reply-To: <d6011f741f7ea38a61006b9f2cde8e544b4d206b.1506432922.git.sd@queasysnail.net>

On Tue, Sep 26, 2017 at 04:16:43PM +0200, Sabrina Dubroca wrote:
> Add a dead flag to prevent firing the workqueue twice. Then we can
> remove the check of queue_work's result that was meant to prevent that
> race but doesn't.
> 
Acked-by: Guillaume Nault <g.nault@alphalink.fr>

^ permalink raw reply

* Re: [PATCH] tun: bail out from tun_get_user() if the skb is empty
From: Eric Dumazet @ 2017-09-26 15:02 UTC (permalink / raw)
  To: Alexander Potapenko; +Cc: David Miller, Dmitry Vyukov, syzkaller, netdev, LKML
In-Reply-To: <20170926145359.28344-1-glider@google.com>

On Tue, Sep 26, 2017 at 7:53 AM, Alexander Potapenko <glider@google.com> wrote:
> KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
> skb->data[0] in the case the skb is empty (i.e. skb->len is 0):
>
> ================================================
> BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
> CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
> ...
>  __msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
>  tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
>  tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
>  call_write_iter ./include/linux/fs.h:1743
>  new_sync_write fs/read_write.c:457
>  __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
>  vfs_write+0x3e4/0x770 fs/read_write.c:518
>  SYSC_write+0x12f/0x2b0 fs/read_write.c:565
>  SyS_write+0x55/0x80 fs/read_write.c:557
>  do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
>  entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
> ...
> origin:
> ...
>  kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
>  slab_alloc_node mm/slub.c:2732
>  __kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
>  __kmalloc_reserve net/core/skbuff.c:138
>  __alloc_skb+0x26a/0x810 net/core/skbuff.c:231
>  alloc_skb ./include/linux/skbuff.h:903
>  alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
>  sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
>  tun_alloc_skb drivers/net/tun.c:1144
>  tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
>  tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
>  call_write_iter ./include/linux/fs.h:1743
>  new_sync_write fs/read_write.c:457
>  __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
>  vfs_write+0x3e4/0x770 fs/read_write.c:518
>  SYSC_write+0x12f/0x2b0 fs/read_write.c:565
>  SyS_write+0x55/0x80 fs/read_write.c:557
>  do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
>  return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
> ================================================
>
> Make sure tun_get_user() doesn't touch skb->data[0] unless there is
> actual data.
>
> C reproducer below:
> ==========================
>     // autogenerated by syzkaller (http://github.com/google/syzkaller)
>
>     #define _GNU_SOURCE
>
>     #include <fcntl.h>
>     #include <linux/if_tun.h>
>     #include <netinet/ip.h>
>     #include <net/if.h>
>     #include <string.h>
>     #include <sys/ioctl.h>
>
>     int main()
>     {
>       int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
>       int tun_fd = open("/dev/net/tun", O_RDWR);
>       struct ifreq req;
>       memset(&req, 0, sizeof(struct ifreq));
>       strcpy((char*)&req.ifr_name, "gre0");
>       req.ifr_flags = IFF_UP | IFF_MULTICAST;
>       ioctl(tun_fd, TUNSETIFF, &req);
>       ioctl(sock, SIOCSIFFLAGS, "gre0");
>       write(tun_fd, "hi", 0);
>       return 0;
>     }
> ==========================
>
> Signed-off-by: Alexander Potapenko <glider@google.com>
> ---
>  drivers/net/tun.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 3c9985f29950..25d96f54a729 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1496,6 +1496,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>         switch (tun->flags & TUN_TYPE_MASK) {
>         case IFF_TUN:
>                 if (tun->flags & IFF_NO_PI) {
> +                       if (!skb->len)
> +                               return -EINVAL;
>                         switch (skb->data[0] & 0xf0) {
>                         case 0x40:
>                                 pi.proto = htons(ETH_P_IP);
> --
> 2.14.1.821.g8fa685d3b7-goog
>

Good catch, but...

You are replacing an harmless read by a memory leak, which is far more
problematic :/

^ permalink raw reply

* Re: [PATCH v2 net-next 0/7] net: speedup netns create/delete time
From: Tariq Toukan @ 2017-09-26 15:04 UTC (permalink / raw)
  To: Eric Dumazet, Dmitry Torokhov
  Cc: David S . Miller, netdev, Eric W . Biederman, Eric Dumazet,
	Majd Dibbiny, Yonatan Cohen, Eran Ben Elisha
In-Reply-To: <CANn89i+LE6He_E3VHwUTpLAVW+uUGo+FPs_zg1bs4P3CVcCL6Q@mail.gmail.com>



On 26/09/2017 3:51 PM, Eric Dumazet wrote:
> On Tue, Sep 26, 2017 at 4:21 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>
>> Hi Eric,
>>
>> We see a regression introduced in this series, specifically in the patches
>> touching lib/kobject_uevent.c.
>> We tried to figure out what is wrong there, but couldn't point it out.
>>
>> Bug is that mlx4 driver restart fails, because mlx4_core is still in use.
>> According to module dependencies, both mlx4_en and mlx4_ib should have been
>> unloaded at this point
>> Please see log below.
>>
>> This looks to be some kind of a race, as the repro is not deterministic.
>> Probably the en/ib modules are now mistakenly reloaded.
>>
>> Any idea what could this be?
>>
>> Regards,
>> Tariq
>>
>>
>> [root@reg-l-vrt-41016-009 ~]# /etc/init.d/openibd stop
>> Unloading HCA driver:                                      [  OK  ]
>> [root@reg-l-vrt-41016-009 ~]# /etc/init.d/openibd start
>> Loading HCA driver and Access Layer:                       [  OK  ]
>> [root@reg-l-vrt-41016-009 ~]# /etc/init.d/openibd stop
>> Unloading mlx4_core                                        [FAILED]
>> rmmod: ERROR: Module mlx4_core is in use
> I have absolutely no idea. Please bisect.
We previously saw a similar issue, that was reported in mailing list.
Dmitry Torokhov suggested the following fix:
https://lkml.org/lkml/2017/9/12/523

And indeed, it solved the issue.

We kept the suggested patch in our internal branch, and rebased.
Issue appeared again once your series was accepted.

By bisecting, we see that the issue re-appears in this patch:
4a336a23d619 kobject: copy env blob in one go

>
> Are you really using netns in the first place ?
No. But seems like it still affects the modules load/unload.

Regards,
Tariq

^ permalink raw reply

* [PATCH] p54: don't unregister leds when they are inited
From: Andrey Konovalov @ 2017-09-26 15:05 UTC (permalink / raw)
  To: Christian Lamparter, Kalle Valo, linux-wireless, netdev,
	linux-kernel
  Cc: Dmitry Vyukov, Kostya Serebryany, Andrey Konovalov

ieee80211_register_hw() in p54_register_common() may fail and leds won't
get initialized. Currently p54_unregister_common() doesn't check that and
always calls p54_unregister_leds(). The fix is to check priv->registered
flag before calling p54_unregister_leds().

Found by syzkaller.

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 1404 Comm: kworker/1:1 Not tainted
4.14.0-rc1-42251-gebb2c2437d80-dirty #205
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x292/0x395 lib/dump_stack.c:52
 register_lock_class+0x6c4/0x1a00 kernel/locking/lockdep.c:769
 __lock_acquire+0x27e/0x4550 kernel/locking/lockdep.c:3385
 lock_acquire+0x259/0x620 kernel/locking/lockdep.c:4002
 flush_work+0xf0/0x8c0 kernel/workqueue.c:2886
 __cancel_work_timer+0x51d/0x870 kernel/workqueue.c:2961
 cancel_delayed_work_sync+0x1f/0x30 kernel/workqueue.c:3081
 p54_unregister_leds+0x6c/0xc0 drivers/net/wireless/intersil/p54/led.c:160
 p54_unregister_common+0x3d/0xb0 drivers/net/wireless/intersil/p54/main.c:856
 p54u_disconnect+0x86/0x120 drivers/net/wireless/intersil/p54/p54usb.c:1073
 usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
 __device_release_driver drivers/base/dd.c:861
 device_release_driver_internal+0x4f4/0x5c0 drivers/base/dd.c:893
 device_release_driver+0x1e/0x30 drivers/base/dd.c:918
 bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
 device_del+0x5c4/0xab0 drivers/base/core.c:1985
 usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
 usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
 hub_port_connect drivers/usb/core/hub.c:4754
 hub_port_connect_change drivers/usb/core/hub.c:5009
 port_event drivers/usb/core/hub.c:5115
 hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
 process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
 process_scheduled_works kernel/workqueue.c:2179
 worker_thread+0xb2b/0x1850 kernel/workqueue.c:2255
 kthread+0x3a1/0x470 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
---
 drivers/net/wireless/intersil/p54/main.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/intersil/p54/main.c b/drivers/net/wireless/intersil/p54/main.c
index d5a3bf91a03e..ab6d39e12069 100644
--- a/drivers/net/wireless/intersil/p54/main.c
+++ b/drivers/net/wireless/intersil/p54/main.c
@@ -852,12 +852,11 @@ void p54_unregister_common(struct ieee80211_hw *dev)
 {
 	struct p54_common *priv = dev->priv;
 
-#ifdef CONFIG_P54_LEDS
-	p54_unregister_leds(priv);
-#endif /* CONFIG_P54_LEDS */
-
 	if (priv->registered) {
 		priv->registered = false;
+#ifdef CONFIG_P54_LEDS
+		p54_unregister_leds(priv);
+#endif /* CONFIG_P54_LEDS */
 		ieee80211_unregister_hw(dev);
 	}
 
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related

* Re: [RESEND] Re: usb/net/p54: trying to register non-static key in p54_unregister_leds
From: Andrey Konovalov @ 2017-09-26 15:06 UTC (permalink / raw)
  To: Christian Lamparter
  Cc: Johannes Berg, Kalle Valo, linux-wireless, netdev, LKML,
	Dmitry Vyukov, Kostya Serebryany, syzkaller, Stephen Boyd,
	Tejun Heo, Yong Zhang
In-Reply-To: <2589427.Vd4nrgaY4N@debian64>

On Sat, Sep 23, 2017 at 9:37 PM, 'Christian Lamparter' via syzkaller
<syzkaller@googlegroups.com> wrote:
> This got rejected by gmail once. Let's see if it works now.
>
> On Thursday, September 21, 2017 8:22:45 PM CEST Andrey Konovalov wrote:
>> On Wed, Sep 20, 2017 at 9:55 PM, Johannes Berg
>> <johannes@sipsolutions.net> wrote:
>> > On Wed, 2017-09-20 at 21:27 +0200, Christian Lamparter wrote:
>> >
>> >> It seems this is caused as a result of:
>> >>     -> lock_map_acquire(&work->lockdep_map);
>> >>           lock_map_release(&work->lockdep_map);
>> >>
>> >>     in flush_work() [0]
>> >
>> > Agree.
>> >
>> >> This was added by:
>> >>
>> >>       commit 0976dfc1d0cd80a4e9dfaf87bd8744612bde475a
>> >>       Author: Stephen Boyd <sboyd@codeaurora.org>
>> >>       Date:   Fri Apr 20 17:28:50 2012 -0700
>> >>
>> >>       workqueue: Catch more locking problems with flush_work()
>> >
>> > Yes, but that doesn't matter.
>> >
>> >> Looking at the Stephen's patch, it's clear that it was made
>> >> with "static DECLARE_WORK(work, my_work)" in mind. However
>> >> p54's led_work is "per-device", hence it is stored in the
>> >> devices context p54_common, which is dynamically allocated.
>> >> So, maybe revert Stephen's patch?
>> >
>> > I disagree - as the lockdep warning says:
>> >
>> >> > INFO: trying to register non-static key.
>> >> > the code is fine but needs lockdep annotation.
>> >> > turning off the locking correctness validator.
>> >
>> > What it needs is to actually correctly go through initializing the work
>> > at least once.
>> >
>> > Without more information, I can't really say what's going on, but I
>> > assume that something is failing and p54_unregister_leds() is getting
>> > invoked without p54_init_leds() having been invoked, so essentially
>> > it's trying to flush a work that was never initialized?
>> >
>> > INIT_DELAYED_WORK() does, after all, initialize the lockdep map
>> > properly via __INIT_WORK().
>
> Ok, thanks. This does indeed explain it.
>
> But this also begs the question: Is this really working then?
> From what I can tell, if CONFIG_LOCKDEP is not set then there's no BUG
> no WARN, no other splat or any other odd system behaviour. Does
> [cancel | flush]_[delayed_]work[_sync] really "just work" by *accident*,
> as long the delayed_work | work_struct is zeroed out?
> And should it work in the future as well?
>
>> Since I'm able to reproduce this, please let me know if you need me to
>> collect some debug traces to help with the triage.
>
> Do you want to take a shot at making a patch too? At a quick glance, it
> should be enough to move the [#ifdef CONFIG_P54_LEDS ... #endif] block
> in p54_unregister_common() into the if (priv->registered) { block
> (preferably before the ieee80211_unregister_hw(dev).

Just mailed a patch.

>
> Regards,
> Christian
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH] tun: bail out from tun_get_user() if the skb is empty
From: Alexander Potapenko @ 2017-09-26 15:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Dmitry Vyukov, syzkaller, netdev, LKML
In-Reply-To: <CANn89i+0uj=N2aWc6wUq4EK1jHmEVoxw0UcCPiFoggGTea++Dw@mail.gmail.com>

On Tue, Sep 26, 2017 at 5:02 PM, 'Eric Dumazet' via syzkaller
<syzkaller@googlegroups.com> wrote:
> On Tue, Sep 26, 2017 at 7:53 AM, Alexander Potapenko <glider@google.com> wrote:
>> KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
>> skb->data[0] in the case the skb is empty (i.e. skb->len is 0):
>>
>> ================================================
>> BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
>> CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Call Trace:
>> ...
>>  __msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
>>  tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
>>  tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
>>  call_write_iter ./include/linux/fs.h:1743
>>  new_sync_write fs/read_write.c:457
>>  __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
>>  vfs_write+0x3e4/0x770 fs/read_write.c:518
>>  SYSC_write+0x12f/0x2b0 fs/read_write.c:565
>>  SyS_write+0x55/0x80 fs/read_write.c:557
>>  do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
>>  entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
>> ...
>> origin:
>> ...
>>  kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
>>  slab_alloc_node mm/slub.c:2732
>>  __kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
>>  __kmalloc_reserve net/core/skbuff.c:138
>>  __alloc_skb+0x26a/0x810 net/core/skbuff.c:231
>>  alloc_skb ./include/linux/skbuff.h:903
>>  alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
>>  sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
>>  tun_alloc_skb drivers/net/tun.c:1144
>>  tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
>>  tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
>>  call_write_iter ./include/linux/fs.h:1743
>>  new_sync_write fs/read_write.c:457
>>  __vfs_write+0x6c3/0x7f0 fs/read_write.c:470
>>  vfs_write+0x3e4/0x770 fs/read_write.c:518
>>  SYSC_write+0x12f/0x2b0 fs/read_write.c:565
>>  SyS_write+0x55/0x80 fs/read_write.c:557
>>  do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
>>  return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
>> ================================================
>>
>> Make sure tun_get_user() doesn't touch skb->data[0] unless there is
>> actual data.
>>
>> C reproducer below:
>> ==========================
>>     // autogenerated by syzkaller (http://github.com/google/syzkaller)
>>
>>     #define _GNU_SOURCE
>>
>>     #include <fcntl.h>
>>     #include <linux/if_tun.h>
>>     #include <netinet/ip.h>
>>     #include <net/if.h>
>>     #include <string.h>
>>     #include <sys/ioctl.h>
>>
>>     int main()
>>     {
>>       int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
>>       int tun_fd = open("/dev/net/tun", O_RDWR);
>>       struct ifreq req;
>>       memset(&req, 0, sizeof(struct ifreq));
>>       strcpy((char*)&req.ifr_name, "gre0");
>>       req.ifr_flags = IFF_UP | IFF_MULTICAST;
>>       ioctl(tun_fd, TUNSETIFF, &req);
>>       ioctl(sock, SIOCSIFFLAGS, "gre0");
>>       write(tun_fd, "hi", 0);
>>       return 0;
>>     }
>> ==========================
>>
>> Signed-off-by: Alexander Potapenko <glider@google.com>
>> ---
>>  drivers/net/tun.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 3c9985f29950..25d96f54a729 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1496,6 +1496,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>>         switch (tun->flags & TUN_TYPE_MASK) {
>>         case IFF_TUN:
>>                 if (tun->flags & IFF_NO_PI) {
>> +                       if (!skb->len)
>> +                               return -EINVAL;
>>                         switch (skb->data[0] & 0xf0) {
>>                         case 0x40:
>>                                 pi.proto = htons(ETH_P_IP);
>> --
>> 2.14.1.821.g8fa685d3b7-goog
>>
>
> Good catch, but...
>
> You are replacing an harmless read by a memory leak, which is far more
> problematic :/
Ah, you're right, I should have done kfree_skb() and incremented the
stats. That's also rx_dropped, correct?
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply

* Re: [PATCH] p54: don't unregister leds when they are inited
From: Johannes Berg @ 2017-09-26 15:08 UTC (permalink / raw)
  To: Andrey Konovalov, Christian Lamparter, Kalle Valo, linux-wireless,
	netdev, linux-kernel
  Cc: Dmitry Vyukov, Kostya Serebryany
In-Reply-To: <384966f3a79915b4617d808c40e63072aa8b868d.1506438160.git.andreyknvl@google.com>

Subject should say *not* initialized?

johannes

^ permalink raw reply

* Re: [PATCH net-next 2/7] nfp: compile flower vxlan tunnel metadata match fields
From: John Hurley @ 2017-09-26 15:11 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Simon Horman, David Miller, Jakub Kicinski, Linux Netdev List,
	oss-drivers
In-Reply-To: <CAJ3xEMgmC+4V+Yr9PeO1wJ=ACBoPMTrasd7Joyx--F9PjV1yvg@mail.gmail.com>

On Tue, Sep 26, 2017 at 3:12 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Tue, Sep 26, 2017 at 4:58 PM, John Hurley <john.hurley@netronome.com> wrote:
>> On Mon, Sep 25, 2017 at 7:35 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>> On Mon, Sep 25, 2017 at 1:23 PM, Simon Horman
>>> <simon.horman@netronome.com> wrote:
>>>> From: John Hurley <john.hurley@netronome.com>
>>>>
>>>> Compile ovs-tc flower vxlan metadata match fields for offloading. Only
>>>
>>> anything in the npf kernel bits has direct relation to ovs? what?
>>>
>>
>> Sorry, this is a typo  and should refer to TC.
>>
>>>> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>>> @@ -52,8 +52,25 @@
>>>>          BIT(FLOW_DISSECTOR_KEY_PORTS) | \
>>>>          BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \
>>>>          BIT(FLOW_DISSECTOR_KEY_VLAN) | \
>>>> +        BIT(FLOW_DISSECTOR_KEY_ENC_KEYID) | \
>>>> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS) | \
>>>> +        BIT(FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS) | \
>>>
>>> this series takes care of IPv6 tunnels too?
>>
>> IPv6 is not included in this set.
>> The reason the IPv6 bit is included here is to account for behavior we
>> have noticed in TC flower.
>> If, for example, I add a filter with the following match fields:
>> 'protocol ip flower enc_src_ip 10.0.0.1 enc_dst_ip 10.0.0.2
>> enc_dst_port 4789 enc_key_id 123'
>> The 'used_keys' value in the dissector marks both IPv4 and IPv6 encap
>> addresses as 'used'.
>> I am not sure if this is a bug in TC or that we are expected to check
>> the enc_control fields to determine if IPv4 or v6 addresses are used.
>
> you should have your code to check enc_control->addr_type to be
> FLOW_DISSECTOR_KEY_IPV4_ADDRS or IPV6_ADDRS
>
>
>> Including the IPv6 used_keys bit in our whitelist approach allows us
>> to accept legitimate IPv4 tunnel rules in these situations.
>
> mmm can please take a look on fl_init_dissector() and tell me if you
> see why FLOW_DISSECTOR_KEY_IPV6_ADDRS is set for ipv4 tunnels,
> I am not sure.


The fl_init_dissector uses the FL_KEY_SET_IF_MASKED macro to set an
array of keys which are then translated to the used_keys values.
The FL_KEY_SET_IF_MASKED takes a 'struct fl_flow_key' as input and
checks if any mask bits are set in a particular field - if so it
eventually marks it as used.
In struct fl_flow_key, the encap ipv4 and ipv6 addresses are
represented as a union of the 2.
Therefore, if we have masked bits set for IPv4, they are also being
set for the IPv6 field.


>
>> If it is found to be IPv6 when the rule is parsed, it will be rejected here.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox