Netdev List
 help / color / mirror / Atom feed
* [PATCH 4.14] tcp: fix tcp_rtx_queue_tail in case of empty retransmit queue
From: Tim Froidcoeur @ 2019-08-24  6:03 UTC (permalink / raw)
  To: matthieu.baerts
  Cc: aprout, cpaasch, davem, edumazet, gregkh, jonathan.lemon, jtl,
	linux-kernel, mkubecek, ncardwell, sashal, stable, tim.froidcoeur,
	ycheng, netdev
In-Reply-To: <529376a4-cf63-f225-ce7c-4747e9966938@tessares.net>

Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()")
triggers following stack trace:

[25244.848046] kernel BUG at ./include/linux/skbuff.h:1406!
[25244.859335] RIP: 0010:skb_queue_prev+0x9/0xc
[25244.888167] Call Trace:
[25244.889182]  <IRQ>
[25244.890001]  tcp_fragment+0x9c/0x2cf
[25244.891295]  tcp_write_xmit+0x68f/0x988
[25244.892732]  __tcp_push_pending_frames+0x3b/0xa0
[25244.894347]  tcp_data_snd_check+0x2a/0xc8
[25244.895775]  tcp_rcv_established+0x2a8/0x30d
[25244.897282]  tcp_v4_do_rcv+0xb2/0x158
[25244.898666]  tcp_v4_rcv+0x692/0x956
[25244.899959]  ip_local_deliver_finish+0xeb/0x169
[25244.901547]  __netif_receive_skb_core+0x51c/0x582
[25244.903193]  ? inet_gro_receive+0x239/0x247
[25244.904756]  netif_receive_skb_internal+0xab/0xc6
[25244.906395]  napi_gro_receive+0x8a/0xc0
[25244.907760]  receive_buf+0x9a1/0x9cd
[25244.909160]  ? load_balance+0x17a/0x7b7
[25244.910536]  ? vring_unmap_one+0x18/0x61
[25244.911932]  ? detach_buf+0x60/0xfa
[25244.913234]  virtnet_poll+0x128/0x1e1
[25244.914607]  net_rx_action+0x12a/0x2b1
[25244.915953]  __do_softirq+0x11c/0x26b
[25244.917269]  ? handle_irq_event+0x44/0x56
[25244.918695]  irq_exit+0x61/0xa0
[25244.919947]  do_IRQ+0x9d/0xbb
[25244.921065]  common_interrupt+0x85/0x85
[25244.922479]  </IRQ>

tcp_rtx_queue_tail() (called by tcp_fragment()) can call
tcp_write_queue_prev() on the first packet in the queue, which will trigger
the BUG in tcp_write_queue_prev(), because there is no previous packet.

This happens when the retransmit queue is empty, for example in case of a
zero window.

Patch is needed for 4.4, 4.9 and 4.14 stable branches.

Fixes: 8c3088f895a0 ("tcp: be more careful in tcp_fragment()")
Change-Id: I839bde7167ae59e2f7d916c913507372445765c5
Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
---
 include/net/tcp.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9de2c8cdcc51..1e70ca75c8bf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1705,6 +1705,10 @@ static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
 {
 	struct sk_buff *skb = tcp_send_head(sk);

+	/* empty retransmit queue, for example due to zero window */
+	if (skb == tcp_write_queue_head(sk))
+		return NULL;
+
 	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
 }

--
2.23.0


-- 


Disclaimer: https://www.tessares.net/mail-disclaimer/ 
<https://www.tessares.net/mail-disclaimer/>



^ permalink raw reply related

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-24  5:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jiri Pirko, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <20190823225929.38fd86f5@x1.home>



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, August 24, 2019 10:29 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>; David S .
> Miller <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>;
> Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> On Sat, 24 Aug 2019 03:56:08 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Saturday, August 24, 2019 1:14 AM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>;
> > > David S . Miller <davem@davemloft.net>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> > > kvm@vger.kernel.org; linux- kernel@vger.kernel.org; cjia
> > > <cjia@nvidia.com>; netdev@vger.kernel.org
> > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >
> > > On Fri, 23 Aug 2019 18:00:30 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Friday, August 23, 2019 10:47 PM
> > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko
> > > > > <jiri@mellanox.com>; David S . Miller <davem@davemloft.net>;
> > > > > Kirti Wankhede <kwankhede@nvidia.com>; Cornelia Huck
> > > > > <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> > > > > kernel@vger.kernel.org; cjia <cjia@nvidia.com>;
> > > > > netdev@vger.kernel.org
> > > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > > >
> > > > > On Fri, 23 Aug 2019 16:14:04 +0000 Parav Pandit
> > > > > <parav@mellanox.com> wrote:
> > > > >
> > > > > > > > Idea is to have mdev alias as optional.
> > > > > > > > Each mdev_parent says whether it wants mdev_core to
> > > > > > > > generate an alias or not. So only networking device drivers
> would set it to true.
> > > > > > > > For rest, alias won't be generated, and won't be compared
> > > > > > > > either during creation time. User continue to provide only uuid.
> > > > > > >
> > > > > > > Ok
> > > > > > >
> > > > > > > > I am tempted to have alias collision detection only within
> > > > > > > > children mdevs of the same parent, but doing so will
> > > > > > > > always mandate to prefix in netdev name. And currently we
> > > > > > > > are left with only 3 characters to prefix it, so that may not be
> good either.
> > > > > > > > Hence, I think mdev core wide alias is better with 12 characters.
> > > > > > >
> > > > > > > I suppose it depends on the API, if the vendor driver can
> > > > > > > ask the mdev core for an alias as part of the device
> > > > > > > creation process, then it could manage the netdev namespace
> > > > > > > for all its devices, choosing how many characters to use,
> > > > > > > and fail the creation if it can't meet a uniqueness
> > > > > > > requirement.  IOW, mdev-core would always provide a full
> > > > > > > sha1 and therefore gets itself out of the uniqueness/collision
> aspects.
> > > > > > >
> > > > > > This doesn't work. At mdev core level 20 bytes sha1 are
> > > > > > unique, so mdev core allowed to create a mdev.
> > > > >
> > > > > The mdev vendor driver has the opportunity to fail the device
> > > > > creation in mdev_parent_ops.create().
> > > > >
> > > > That is not helpful for below reasons.
> > > > 1. vendor driver doesn't have visibility in other vendor's alias.
> > > > 2. Even for single vendor, it needs to maintain global list of
> > > > devices to see
> > > collision.
> > > > 3. multiple vendors needs to implement same scheme.
> > > >
> > > > Mdev core should be the owner. Shifting ownership from one layer
> > > > to a lower layer in vendor driver doesn't solve the problem (if
> > > > there is one, which I think doesn't exist).
> > > >
> > > > > > And then devlink core chooses
> > > > > > only 6 bytes (12 characters) and there is collision. Things
> > > > > > fall apart. Since mdev provides unique uuid based scheme, it's
> > > > > > the mdev core's ownership to provide unique aliases.
> > > > >
> > > > > You're suggesting/contemplating multiple solutions here, 3-char
> > > > > prefix + 12- char sha1 vs <parent netdev> + ?-char sha1.  Also,
> > > > > the 15-char total limit is imposed by an external subsystem,
> > > > > where the vendor driver is the gateway between that subsystem
> > > > > and mdev.  How would mdev integrate with another subsystem that
> > > > > maybe only has 9-chars available?  Would the vendor driver API
> > > > > specify "I need an alias" or would it specify "I need an X-char length
> alias"?
> > > > Yes, Vendor driver should say how long the alias it wants.
> > > > However before we implement that, I suggest let such
> > > > vendor/user/driver arrive which needs that. Such variable length
> > > > alias can be added at that time and even with that alias collision
> > > > can be detected by single mdev module.
> > >
> > > If we agree that different alias lengths are possible, then I would
> > > request that minimally an mdev sample driver be modified to request
> > > an alias with a length that can be adjusted without recompiling in order
> to exercise the collision path.
> > >
> > Yes. this can be done. But I fail to understand the need to do so.
> > It is not the responsibility of the mdev core to show case sha1
> > collision efficiency/deficiency. So why do you insist exercise it?
> 
> I don't understand what you're trying to imply with "show case sha1 collision
> efficiency/deficiency".  Are you suggesting that I'm asking for this feature to
> experimentally test the probability of collisions at different character
> lengths?  We can use shell scripts for that.
> I'm simply observing that collisions are possible based on user input, but
> they're not practical to test for at the character lengths we're using.
> Therefore, how do I tell QA to develop a tests to make sure the kernel and
> userspace tools that might be involved behave correctly when this rare event
> occurs?
>
Ok. so you want to have code coverage and want to add a knob for that.
That is fine. I will have the mdev_parent->ops.alias_len as API instead of bool.
And extend mtty module parameter to set the alias length.

Unfortunately similar code coverage doesn't exist for API like mdev_get/set_iommu_device() in sample of real vendor driver.
And QA is not able to test this functionality without tainting the kernel.

^ permalink raw reply

* Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Alex Williamson @ 2019-08-24  4:59 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jiri Pirko, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866008B0571B90DAFFADA97D1A70@AM0PR05MB4866.eurprd05.prod.outlook.com>

On Sat, 24 Aug 2019 03:56:08 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Saturday, August 24, 2019 1:14 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>; David S . Miller
> > <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>; Cornelia
> > Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Fri, 23 Aug 2019 18:00:30 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, August 23, 2019 10:47 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>;
> > > > David S . Miller <davem@davemloft.net>; Kirti Wankhede
> > > > <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> > > > kvm@vger.kernel.org; linux- kernel@vger.kernel.org; cjia
> > > > <cjia@nvidia.com>; netdev@vger.kernel.org
> > > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > > >
> > > > On Fri, 23 Aug 2019 16:14:04 +0000
> > > > Parav Pandit <parav@mellanox.com> wrote:
> > > >  
> > > > > > > Idea is to have mdev alias as optional.
> > > > > > > Each mdev_parent says whether it wants mdev_core to generate
> > > > > > > an alias or not. So only networking device drivers would set it to true.
> > > > > > > For rest, alias won't be generated, and won't be compared
> > > > > > > either during creation time. User continue to provide only uuid.  
> > > > > >
> > > > > > Ok
> > > > > >  
> > > > > > > I am tempted to have alias collision detection only within
> > > > > > > children mdevs of the same parent, but doing so will always
> > > > > > > mandate to prefix in netdev name. And currently we are left
> > > > > > > with only 3 characters to prefix it, so that may not be good either.
> > > > > > > Hence, I think mdev core wide alias is better with 12 characters.  
> > > > > >
> > > > > > I suppose it depends on the API, if the vendor driver can ask
> > > > > > the mdev core for an alias as part of the device creation
> > > > > > process, then it could manage the netdev namespace for all its
> > > > > > devices, choosing how many characters to use, and fail the
> > > > > > creation if it can't meet a uniqueness requirement.  IOW,
> > > > > > mdev-core would always provide a full
> > > > > > sha1 and therefore gets itself out of the uniqueness/collision aspects.
> > > > > >  
> > > > > This doesn't work. At mdev core level 20 bytes sha1 are unique, so
> > > > > mdev core allowed to create a mdev.  
> > > >
> > > > The mdev vendor driver has the opportunity to fail the device
> > > > creation in mdev_parent_ops.create().
> > > >  
> > > That is not helpful for below reasons.
> > > 1. vendor driver doesn't have visibility in other vendor's alias.
> > > 2. Even for single vendor, it needs to maintain global list of devices to see  
> > collision.  
> > > 3. multiple vendors needs to implement same scheme.
> > >
> > > Mdev core should be the owner. Shifting ownership from one layer to a
> > > lower layer in vendor driver doesn't solve the problem (if there is
> > > one, which I think doesn't exist).
> > >  
> > > > > And then devlink core chooses
> > > > > only 6 bytes (12 characters) and there is collision. Things fall
> > > > > apart. Since mdev provides unique uuid based scheme, it's the mdev
> > > > > core's ownership to provide unique aliases.  
> > > >
> > > > You're suggesting/contemplating multiple solutions here, 3-char
> > > > prefix + 12- char sha1 vs <parent netdev> + ?-char sha1.  Also, the
> > > > 15-char total limit is imposed by an external subsystem, where the
> > > > vendor driver is the gateway between that subsystem and mdev.  How
> > > > would mdev integrate with another subsystem that maybe only has
> > > > 9-chars available?  Would the vendor driver API specify "I need an
> > > > alias" or would it specify "I need an X-char length alias"?  
> > > Yes, Vendor driver should say how long the alias it wants.
> > > However before we implement that, I suggest let such
> > > vendor/user/driver arrive which needs that. Such variable length alias
> > > can be added at that time and even with that alias collision can be
> > > detected by single mdev module.  
> > 
> > If we agree that different alias lengths are possible, then I would request that
> > minimally an mdev sample driver be modified to request an alias with a length
> > that can be adjusted without recompiling in order to exercise the collision path.
> >   
> Yes. this can be done. But I fail to understand the need to do so.
> It is not the responsibility of the mdev core to show case sha1
> collision efficiency/deficiency. So why do you insist exercise it?

I don't understand what you're trying to imply with "show case sha1
collision efficiency/deficiency".  Are you suggesting that I'm asking
for this feature to experimentally test the probability of collisions
at different character lengths?  We can use shell scripts for that.
I'm simply observing that collisions are possible based on user input,
but they're not practical to test for at the character lengths we're
using.  Therefore, how do I tell QA to develop a tests to make sure the
kernel and userspace tools that might be involved behave correctly when
this rare event occurs?

As I mentioned previously, we can burn the cpu cyles to find some uuids
which will collide with our aliases, but the more accessible approach
seems to be to have a tune-able to reduce the alias address space such
that we can simply throw enough random uuids into the test to guarantee
a collision.  Simply generating 10,000 devices with a 12-character
alias, as you suggested previously, has effectively a 0% probability of
generating a collision.

If we accept that different vendor drivers might have different alias
requirements, and therefore the vendor driver should have the ability
to specify an alias length, then this all fits very nicely into
modifying a sample driver to request a sufficiently short alias such
that we can use it to test the behavior of mdev-core and surrounding
code when an alias collision occurs.

> > If mdev-core is guaranteeing uniqueness, does this indicate that
> > each alias length constitutes a separate namespace?  ie. strictly a
> > strcmp(), not a strncmp() to the shorter alias.
> >   
> Yes.
> 
> 
> > > > Does it make sense that mdev-core would fail creation of a
> > > > device if there's a collision in the 12-char address space
> > > > between different subsystems?  For example, does
> > > > enm0123456789ab really collide with xyz0123456789ab?  
> > > I think so, because at mdev level its 12-char alias matters.
> > > Choosing the prefix not adding prefix is really a user space
> > > choice. 
> > > >  So if
> > > > mdev were to provided a 40-char sha1, is it possible that the
> > > > vendor driver could consume this in its create callback,
> > > > truncate it to the number of chars required by the vendor
> > > > driver's subsystem, and determine whether a collision exists?  
> > > We shouldn't shift the problem from mdev to multiple vendor
> > > drivers to detect collision.
> > >
> > > I still think that user providing alias is better because it
> > > knows the use-case system in use, and eliminates these collision
> > > issue.  
> > 
> > How is a user provided alias immune from collisions?  The burden is
> > on the user to provide both a unique uuid and a unique alias.  That
> > makes it trivial to create a collision.
> >   
> Than such collision should have occurred for other subsystem such as
> netdev while creating vlan, macvlan, ipvlan, vxlan and more devices
> who are named by the user. But that isn't the case.
> 
> > > > > > > I do not understand how an extra character reduces
> > > > > > > collision, if that's what you meant.  
> > > > > >
> > > > > > If the default were for example 3-chars, we might already
> > > > > > have device 'abc'.  A collision would expose one more char
> > > > > > of the new device, so we might add device with alias
> > > > > > 'abcd'.  I mentioned previously that this leaves an issue
> > > > > > for userspace that we can't change the alias of device abc,
> > > > > > so without additional information, userspace can only
> > > > > > determine via elimination the mapping of alias to device,
> > > > > > but userspace has more information available to it in the
> > > > > > form of sysfs links.  
> > > > > > > Module options are almost not encouraged anymore with
> > > > > > > other subsystems/drivers.  
> > > > > >
> > > > > > We don't live in a world of absolutes.  I agree that the
> > > > > > defaults should work in the vast majority of cases.
> > > > > > Requiring a user to twiddle module options to make things
> > > > > > work is undesirable, verging on a bug.  A module option to
> > > > > > enable some specific feature, unsafe condition, or test
> > > > > > that is outside of the typical use case is reasonable,
> > > > > > imo.  
> > > > > > > For testing collision rate, a sample user space script and
> > > > > > > sample mtty is easy and get us collision count too. We
> > > > > > > shouldn't put that using module option in production
> > > > > > > kernel. I practically have the code ready to play with;
> > > > > > > Changing 12 to smaller value is easy with module reload.
> > > > > > >
> > > > > > > #define MDEV_ALIAS_LEN 12  
> > > > > >
> > > > > > If it can't be tested with a shipping binary, it probably
> > > > > > won't be tested.  Thanks,  
> > > > > It is not the role of mdev core to expose collision
> > > > > efficiency/deficiency of the sha1. It can be tested outside
> > > > > before mdev choose to use it.  
> > > >
> > > > The testing I'm considering is the user and kernel response to a
> > > > collision.  
> > > > > I am saying we should test with 12 characters with 10,000 or
> > > > > more devices and see how collision occurs. Even if collision
> > > > > occurs, mdev returns EEXIST status indicating user to pick a
> > > > > different UUID for those rare conditions.  
> > > >
> > > > The only way we're going to see collision with a 12-char sha1
> > > > is if we burn the CPU cycles to find uuids that collide in that
> > > > space. 10,000 devices is not remotely enough to generate a
> > > > collision in that address space.  That puts a prerequisite in
> > > > place that in order to test collision, someone needs to know
> > > > certain magic inputs. OTOH, if we could use a shorter
> > > > abbreviation, collisions are trivial to test experimentally.
> > > > Thanks,  
> > > Yes, and therefore a sane user who wants to create more mdevs,
> > > wouldn't intentionally stress it to see failures.  
> > 
> > I don't understand this logic.  I'm simply asking that we have a
> > way to test the collision behavior without changing the binary.
> > The path we're driving towards seems to be making this easier and
> > easier.  If the vendor can request an alias of a specific length,
> > then a sample driver with a module option to set the desired alias
> > length to 1-char makes it trivially easy to induce a collision.    
> Sure it is easy to test collision, but my point is - mdev core is not
> sha1 test module. Hence adding functionality of variable alias length
> to test collision doesn't make sense. When the actual user arrives
> who needs small alias, we will be able to add additional pieces very
> easily.
> 
> > It doesn't
> > even need to be exposed in a real driver.  Besides, when do we ever
> > get to design interfaces that only worry about sane users???
> > Thanks, 
> I intent to say that a sane user who wants to create mdev's will just
> work fine with less collision. If there is collision EEXIST is
> returns and sane user picks different UUID. If user is intentionally
> picking UUIDs in such a way that triggers sha1 collision, his
> intention is likely to not create mdevs for actual use. And if
> interface returns error code it is still fine.

This is exactly the scenarios that I'm asking "how do we test that it
works as we expect".  I can test that passing identical uuids into the
mdev create interface only allows the first to succeed.  With a 12-char
sha1 alias, it's not practical to construct a test to validate the
alias collision behavior.  Do you suggest we rely only on code
inspection instead?  Thanks,

Alex

^ permalink raw reply

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-24  4:45 UTC (permalink / raw)
  To: Parav Pandit, Alex Williamson
  Cc: Jiri Pirko, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866008B0571B90DAFFADA97D1A70@AM0PR05MB4866.eurprd05.prod.outlook.com>

Hi Alex,

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> owner@vger.kernel.org> On Behalf Of Parav Pandit
> Sent: Saturday, August 24, 2019 9:26 AM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>; David S .
> Miller <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>;
> Cornelia Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> Subject: RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > I don't understand this logic.  I'm simply asking that we have a way
> > to test the collision behavior without changing the binary.  The path
> > we're driving towards seems to be making this easier and easier.  If
> > the vendor can request an alias of a specific length, then a sample
> > driver with a module option to set the desired alias length to 1-char makes
> it trivially easy to induce a collision.
> Sure it is easy to test collision, but my point is - mdev core is not sha1 test
> module.
> Hence adding functionality of variable alias length to test collision doesn't
> make sense.
> When the actual user arrives who needs small alias, we will be able to add
> additional pieces very easily.

My initial thoughts to add parent_ops to have bool flag to generate alias or not.
However, instead of bool, keeping it unsigned int to say, zero to skip alias and non-zero length to convey generate alias.
This will serve both the purpose with trivial handling.



^ permalink raw reply

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-24  3:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jiri Pirko, Jiri Pirko, David S . Miller, Kirti Wankhede,
	Cornelia Huck, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	cjia, netdev@vger.kernel.org
In-Reply-To: <20190823134337.37e4b215@x1.home>



> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, August 24, 2019 1:14 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>; David S . Miller
> <davem@davemloft.net>; Kirti Wankhede <kwankhede@nvidia.com>; Cornelia
> Huck <cohuck@redhat.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; cjia <cjia@nvidia.com>; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> On Fri, 23 Aug 2019 18:00:30 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, August 23, 2019 10:47 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Jiri Pirko <jiri@resnulli.us>; Jiri Pirko <jiri@mellanox.com>;
> > > David S . Miller <davem@davemloft.net>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; Cornelia Huck <cohuck@redhat.com>;
> > > kvm@vger.kernel.org; linux- kernel@vger.kernel.org; cjia
> > > <cjia@nvidia.com>; netdev@vger.kernel.org
> > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >
> > > On Fri, 23 Aug 2019 16:14:04 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > > > Idea is to have mdev alias as optional.
> > > > > > Each mdev_parent says whether it wants mdev_core to generate
> > > > > > an alias or not. So only networking device drivers would set it to true.
> > > > > > For rest, alias won't be generated, and won't be compared
> > > > > > either during creation time. User continue to provide only uuid.
> > > > >
> > > > > Ok
> > > > >
> > > > > > I am tempted to have alias collision detection only within
> > > > > > children mdevs of the same parent, but doing so will always
> > > > > > mandate to prefix in netdev name. And currently we are left
> > > > > > with only 3 characters to prefix it, so that may not be good either.
> > > > > > Hence, I think mdev core wide alias is better with 12 characters.
> > > > >
> > > > > I suppose it depends on the API, if the vendor driver can ask
> > > > > the mdev core for an alias as part of the device creation
> > > > > process, then it could manage the netdev namespace for all its
> > > > > devices, choosing how many characters to use, and fail the
> > > > > creation if it can't meet a uniqueness requirement.  IOW,
> > > > > mdev-core would always provide a full
> > > > > sha1 and therefore gets itself out of the uniqueness/collision aspects.
> > > > >
> > > > This doesn't work. At mdev core level 20 bytes sha1 are unique, so
> > > > mdev core allowed to create a mdev.
> > >
> > > The mdev vendor driver has the opportunity to fail the device
> > > creation in mdev_parent_ops.create().
> > >
> > That is not helpful for below reasons.
> > 1. vendor driver doesn't have visibility in other vendor's alias.
> > 2. Even for single vendor, it needs to maintain global list of devices to see
> collision.
> > 3. multiple vendors needs to implement same scheme.
> >
> > Mdev core should be the owner. Shifting ownership from one layer to a
> > lower layer in vendor driver doesn't solve the problem (if there is
> > one, which I think doesn't exist).
> >
> > > > And then devlink core chooses
> > > > only 6 bytes (12 characters) and there is collision. Things fall
> > > > apart. Since mdev provides unique uuid based scheme, it's the mdev
> > > > core's ownership to provide unique aliases.
> > >
> > > You're suggesting/contemplating multiple solutions here, 3-char
> > > prefix + 12- char sha1 vs <parent netdev> + ?-char sha1.  Also, the
> > > 15-char total limit is imposed by an external subsystem, where the
> > > vendor driver is the gateway between that subsystem and mdev.  How
> > > would mdev integrate with another subsystem that maybe only has
> > > 9-chars available?  Would the vendor driver API specify "I need an
> > > alias" or would it specify "I need an X-char length alias"?
> > Yes, Vendor driver should say how long the alias it wants.
> > However before we implement that, I suggest let such
> > vendor/user/driver arrive which needs that. Such variable length alias
> > can be added at that time and even with that alias collision can be
> > detected by single mdev module.
> 
> If we agree that different alias lengths are possible, then I would request that
> minimally an mdev sample driver be modified to request an alias with a length
> that can be adjusted without recompiling in order to exercise the collision path.
> 
Yes. this can be done. But I fail to understand the need to do so.
It is not the responsibility of the mdev core to show case sha1 collision efficiency/deficiency.
So why do you insist exercise it?

> If mdev-core is guaranteeing uniqueness, does this indicate that each alias
> length constitutes a separate namespace?  ie. strictly a strcmp(), not a
> strncmp() to the shorter alias.
> 
Yes.


> > > Does it make sense that mdev-core would fail creation of a device if
> > > there's a collision in the 12-char address space between different
> > > subsystems?  For example, does enm0123456789ab really
> > > collide with xyz0123456789ab?
> > I think so, because at mdev level its 12-char alias matters.
> > Choosing the prefix not adding prefix is really a user space choice.
> >
> > >  So if
> > > mdev were to provided a 40-char sha1, is it possible that the vendor
> > > driver could consume this in its create callback, truncate it to the
> > > number of chars required by the vendor driver's subsystem, and
> > > determine whether a collision exists?
> > We shouldn't shift the problem from mdev to multiple vendor drivers to
> > detect collision.
> >
> > I still think that user providing alias is better because it knows the
> > use-case system in use, and eliminates these collision issue.
> 
> How is a user provided alias immune from collisions?  The burden is on the user
> to provide both a unique uuid and a unique alias.  That makes it trivial to create
> a collision.
> 
Than such collision should have occurred for other subsystem such as netdev while creating vlan, macvlan, ipvlan, vxlan and more devices who are named by the user.
But that isn't the case.

> > > > > > I do not understand how an extra character reduces collision,
> > > > > > if that's what you meant.
> > > > >
> > > > > If the default were for example 3-chars, we might already have
> > > > > device 'abc'.  A collision would expose one more char of the new
> > > > > device, so we might add device with alias 'abcd'.  I mentioned
> > > > > previously that this leaves an issue for userspace that we can't
> > > > > change the alias of device abc, so without additional
> > > > > information, userspace can only determine via elimination the
> > > > > mapping of alias to device, but userspace has more information
> > > > > available to it in the form of sysfs links.
> > > > > > Module options are almost not encouraged anymore with other
> > > > > > subsystems/drivers.
> > > > >
> > > > > We don't live in a world of absolutes.  I agree that the
> > > > > defaults should work in the vast majority of cases.  Requiring a
> > > > > user to twiddle module options to make things work is
> > > > > undesirable, verging on a bug.  A module option to enable some
> > > > > specific feature, unsafe condition, or test that is outside of
> > > > > the typical use case is reasonable, imo.
> > > > > > For testing collision rate, a sample user space script and
> > > > > > sample mtty is easy and get us collision count too. We
> > > > > > shouldn't put that using module option in production kernel.
> > > > > > I practically have the code ready to play with; Changing 12 to
> > > > > > smaller value is easy with module reload.
> > > > > >
> > > > > > #define MDEV_ALIAS_LEN 12
> > > > >
> > > > > If it can't be tested with a shipping binary, it probably won't
> > > > > be tested.  Thanks,
> > > > It is not the role of mdev core to expose collision
> > > > efficiency/deficiency of the sha1. It can be tested outside before
> > > > mdev choose to use it.
> > >
> > > The testing I'm considering is the user and kernel response to a
> > > collision.
> > > > I am saying we should test with 12 characters with 10,000 or more
> > > > devices and see how collision occurs. Even if collision occurs,
> > > > mdev returns EEXIST status indicating user to pick a different
> > > > UUID for those rare conditions.
> > >
> > > The only way we're going to see collision with a 12-char sha1 is if
> > > we burn the CPU cycles to find uuids that collide in that space.
> > > 10,000 devices is not remotely enough to generate a collision in
> > > that address space.  That puts a prerequisite in place that in order
> > > to test collision, someone needs to know certain magic inputs.
> > > OTOH, if we could use a shorter abbreviation, collisions are trivial
> > > to test experimentally.  Thanks,
> > Yes, and therefore a sane user who wants to create more mdevs,
> > wouldn't intentionally stress it to see failures.
> 
> I don't understand this logic.  I'm simply asking that we have a way to test the
> collision behavior without changing the binary.  The path we're driving towards
> seems to be making this easier and easier.  If the vendor can request an alias of
> a specific length, then a sample driver with a module option to set the desired
> alias length to 1-char makes it trivially easy to induce a collision.  
Sure it is easy to test collision, but my point is - mdev core is not sha1 test module.
Hence adding functionality of variable alias length to test collision doesn't make sense.
When the actual user arrives who needs small alias, we will be able to add additional pieces very easily.

> It doesn't
> even need to be exposed in a real driver.  Besides, when do we ever get to
> design interfaces that only worry about sane users???  Thanks,
> 
I intent to say that a sane user who wants to create mdev's will just work fine with less collision.
If there is collision EEXIST is returns and sane user picks different UUID.
If user is intentionally picking UUIDs in such a way that triggers sha1 collision, his intention is likely to not create mdevs for actual use.
And if interface returns error code it is still fine.

> Alex

^ permalink raw reply

* [PATCH RFC net-next 3/3] net: dsa: implement ndo_set_netlink for chaning port's CPU port
From: Marek Behún @ 2019-08-24  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Florian Fainelli, David Ahern,
	Stephen Hemminger, Marek Behún
In-Reply-To: <20190824024251.4542-1-marek.behun@nic.cz>

Implement ndo_set_iflink for DSA slave device. In multi-CPU port setup
this should be used to change to which CPU destination port a given port
should be connected.

This adds a new operation into the DSA switch operations structure,
port_change_cpu_port. A driver implementing this function has the
ability to change CPU destination port of a given port.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
---
 include/net/dsa.h |  6 ++++++
 net/dsa/slave.c   | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 64bd70608f2f..4f3f0032b886 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -545,6 +545,12 @@ struct dsa_switch_ops {
 	 */
 	netdev_tx_t (*port_deferred_xmit)(struct dsa_switch *ds, int port,
 					  struct sk_buff *skb);
+
+	/*
+	 * Multi-CPU port support
+	 */
+	int	(*port_change_cpu_port)(struct dsa_switch *ds, int port,
+					struct dsa_port *new_cpu_dp);
 };
 
 struct dsa_switch_driver {
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 33f41178afcc..bafaadeca912 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -64,6 +64,40 @@ static int dsa_slave_get_iflink(const struct net_device *dev)
 	return dsa_slave_to_master(dev)->ifindex;
 }
 
+static int dsa_slave_set_iflink(struct net_device *dev, int iflink)
+{
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_slave_priv *p = netdev_priv(dev);
+	struct net_device *new_cpu_dev;
+	struct dsa_port *new_cpu_dp;
+	int err;
+
+	if (!dp->ds->ops->port_change_cpu_port)
+		return -EOPNOTSUPP;
+
+	new_cpu_dev = dev_get_by_index(dev_net(dev), iflink);
+	if (!new_cpu_dev)
+		return -ENODEV;
+
+	new_cpu_dp = new_cpu_dev->dsa_ptr;
+	if (!new_cpu_dp)
+		return -EINVAL;
+
+	/* new CPU port has to be on the same switch tree */
+	if (new_cpu_dp->dst != dp->dst)
+		return -EINVAL;
+
+	err = dp->ds->ops->port_change_cpu_port(dp->ds, dp->index, new_cpu_dp);
+	if (err)
+		return err;
+
+	/* should this be done atomically? */
+	dp->cpu_dp = new_cpu_dp;
+	p->xmit = new_cpu_dp->tag_ops->xmit;
+
+	return 0;
+}
+
 static int dsa_slave_open(struct net_device *dev)
 {
 	struct net_device *master = dsa_slave_to_master(dev);
@@ -1176,6 +1210,7 @@ static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_fdb_dump		= dsa_slave_fdb_dump,
 	.ndo_do_ioctl		= dsa_slave_ioctl,
 	.ndo_get_iflink		= dsa_slave_get_iflink,
+	.ndo_set_iflink		= dsa_slave_set_iflink,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_netpoll_setup	= dsa_slave_netpoll_setup,
 	.ndo_netpoll_cleanup	= dsa_slave_netpoll_cleanup,
-- 
2.21.0


^ permalink raw reply related

* [PATCH RFC iproute2-next] iplink: allow to change iplink value
From: Marek Behún @ 2019-08-24  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Florian Fainelli, David Ahern,
	Stephen Hemminger, Marek Behún
In-Reply-To: <20190824024251.4542-1-marek.behun@nic.cz>

Allow to change the interface to which a given interface is linked to.
This is useful in the case of multi-CPU port DSA, for changing the CPU
port of a given user port.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
Cc: David Ahern <dsahern@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
---
 ip/iplink.c           | 16 +++++-----------
 man/man8/ip-link.8.in |  7 +++++++
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index 212a0885..d52c0aaf 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -579,7 +579,6 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 {
 	char *name = NULL;
 	char *dev = NULL;
-	char *link = NULL;
 	int ret, len;
 	char abuf[32];
 	int qlen = -1;
@@ -590,6 +589,7 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 	int numrxqueues = -1;
 	int link_netnsid = -1;
 	int index = 0;
+	int link = -1;
 	int group = -1;
 	int addr_len = 0;
 
@@ -620,7 +620,10 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 				invarg("Invalid \"index\" value", *argv);
 		} else if (matches(*argv, "link") == 0) {
 			NEXT_ARG();
-			link = *argv;
+			link = ll_name_to_index(*argv);
+			if (!link)
+				return nodev(*argv);
+			addattr32(&req->n, sizeof(*req), IFLA_LINK, link);
 		} else if (matches(*argv, "address") == 0) {
 			NEXT_ARG();
 			addr_len = ll_addr_a2n(abuf, sizeof(abuf), *argv);
@@ -1004,15 +1007,6 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 			exit(-1);
 		}
 
-		if (link) {
-			int ifindex;
-
-			ifindex = ll_name_to_index(link);
-			if (!ifindex)
-				return nodev(link);
-			addattr32(&req->n, sizeof(*req), IFLA_LINK, ifindex);
-		}
-
 		req->i.ifi_index = index;
 	}
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index a8ae72d2..800aed05 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -149,6 +149,9 @@ ip-link \- network device configuration
 .br
 .RB "[ " nomaster " ]"
 .br
+.RB "[ " link
+.IR DEVICE " ]"
+.br
 .RB "[ " vrf
 .IR NAME " ]"
 .br
@@ -2131,6 +2134,10 @@ set master device of the device (enslave device).
 .BI nomaster
 unset master device of the device (release device).
 
+.TP
+.BI link " DEVICE"
+set device to which this device is linked to.
+
 .TP
 .BI addrgenmode " eui64|none|stable_secret|random"
 set the IPv6 address generation mode
-- 
2.21.0


^ permalink raw reply related

* [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports
From: Marek Behún @ 2019-08-24  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Florian Fainelli, David Ahern,
	Stephen Hemminger, Marek Behún
In-Reply-To: <20190824024251.4542-1-marek.behun@nic.cz>

Allow for multiple CPU ports in a DSA switch tree. By default assign the
CPU ports to user ports in a round robin way, ie. if there are two CPU
ports connected to eth0 and eth1, and five user ports (lan1..lan5),
assign them as:
  lan1 <-> eth0
  lan2 <-> eth1
  lan3 <-> eth0
  lan4 <-> eth1
  lan5 <-> eth0

Signed-off-by: Marek Behún <marek.behun@nic.cz>
---
 include/net/dsa.h |  5 +--
 net/dsa/dsa2.c    | 84 +++++++++++++++++++++++++++++++----------------
 2 files changed, 58 insertions(+), 31 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 147b757ef8ea..64bd70608f2f 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -123,9 +123,10 @@ struct dsa_switch_tree {
 	struct dsa_platform_data	*pd;
 
 	/*
-	 * The switch port to which the CPU is attached.
+	 * The switch ports to which the CPU is attached.
 	 */
-	struct dsa_port		*cpu_dp;
+	size_t			num_cpu_dps;
+	struct dsa_port		*cpu_dps[DSA_MAX_PORTS];
 
 	/*
 	 * Data for the individual switch chips.
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 8c4eccb0cfe6..c5af89079a6b 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -194,11 +194,12 @@ static bool dsa_tree_setup_routing_table(struct dsa_switch_tree *dst)
 	return complete;
 }
 
-static struct dsa_port *dsa_tree_find_first_cpu(struct dsa_switch_tree *dst)
+static void dsa_tree_fill_cpu_ports(struct dsa_switch_tree *dst)
 {
 	struct dsa_switch *ds;
 	struct dsa_port *dp;
 	int device, port;
+	int count = 0;
 
 	for (device = 0; device < DSA_MAX_SWITCHES; device++) {
 		ds = dst->ds[device];
@@ -208,28 +209,38 @@ static struct dsa_port *dsa_tree_find_first_cpu(struct dsa_switch_tree *dst)
 		for (port = 0; port < ds->num_ports; port++) {
 			dp = &ds->ports[port];
 
-			if (dsa_port_is_cpu(dp))
-				return dp;
+			if (dsa_port_is_cpu(dp)) {
+				if (count == ARRAY_SIZE(dst->cpu_dps)) {
+					pr_warn("Tree has too many CPU ports\n");
+					dst->num_cpu_dps = count;
+					return;
+				}
+
+				dst->cpu_dps[count++] = dp;
+			}
 		}
 	}
 
-	return NULL;
+	dst->num_cpu_dps = count;
 }
 
-static int dsa_tree_setup_default_cpu(struct dsa_switch_tree *dst)
+static int dsa_tree_setup_default_cpus(struct dsa_switch_tree *dst)
 {
 	struct dsa_switch *ds;
 	struct dsa_port *dp;
-	int device, port;
+	int device, port, i;
 
-	/* DSA currently only supports a single CPU port */
-	dst->cpu_dp = dsa_tree_find_first_cpu(dst);
-	if (!dst->cpu_dp) {
+	dsa_tree_fill_cpu_ports(dst);
+	if (!dst->num_cpu_dps) {
 		pr_warn("Tree has no master device\n");
 		return -EINVAL;
 	}
 
-	/* Assign the default CPU port to all ports of the fabric */
+	/* Assign the default CPU port to all ports of the fabric in a round
+	 * robin way. This should work nicely for all sane switch tree designs.
+	 */
+	i = 0;
+
 	for (device = 0; device < DSA_MAX_SWITCHES; device++) {
 		ds = dst->ds[device];
 		if (!ds)
@@ -238,18 +249,20 @@ static int dsa_tree_setup_default_cpu(struct dsa_switch_tree *dst)
 		for (port = 0; port < ds->num_ports; port++) {
 			dp = &ds->ports[port];
 
-			if (dsa_port_is_user(dp) || dsa_port_is_dsa(dp))
-				dp->cpu_dp = dst->cpu_dp;
+			if (dsa_port_is_user(dp) || dsa_port_is_dsa(dp)) {
+				dp->cpu_dp = dst->cpu_dps[i++];
+				if (i == dst->num_cpu_dps)
+					i = 0;
+			}
 		}
 	}
 
 	return 0;
 }
 
-static void dsa_tree_teardown_default_cpu(struct dsa_switch_tree *dst)
+static void dsa_tree_teardown_default_cpus(struct dsa_switch_tree *dst)
 {
-	/* DSA currently only supports a single CPU port */
-	dst->cpu_dp = NULL;
+	dst->num_cpu_dps = 0;
 }
 
 static int dsa_port_setup(struct dsa_port *dp)
@@ -493,21 +506,34 @@ static void dsa_tree_teardown_switches(struct dsa_switch_tree *dst)
 	}
 }
 
-static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
+static int dsa_tree_setup_masters(struct dsa_switch_tree *dst)
 {
-	struct dsa_port *cpu_dp = dst->cpu_dp;
-	struct net_device *master = cpu_dp->master;
+	int i;
+	int err;
 
-	/* DSA currently supports a single pair of CPU port and master device */
-	return dsa_master_setup(master, cpu_dp);
+	for (i = 0; i < dst->num_cpu_dps; ++i) {
+		struct dsa_port *cpu_dp = dst->cpu_dps[i];
+		struct net_device *master = cpu_dp->master;
+
+		err = dsa_master_setup(master, cpu_dp);
+		if (err)
+			goto teardown;
+	}
+
+	return 0;
+teardown:
+	for (--i; i >= 0; --i)
+		dsa_master_teardown(dst->cpu_dps[i]->master);
+
+	return err;
 }
 
-static void dsa_tree_teardown_master(struct dsa_switch_tree *dst)
+static void dsa_tree_teardown_masters(struct dsa_switch_tree *dst)
 {
-	struct dsa_port *cpu_dp = dst->cpu_dp;
-	struct net_device *master = cpu_dp->master;
+	int i;
 
-	return dsa_master_teardown(master);
+	for (i = 0; i < dst->num_cpu_dps; ++i)
+		dsa_master_teardown(dst->cpu_dps[i]->master);
 }
 
 static int dsa_tree_setup(struct dsa_switch_tree *dst)
@@ -525,7 +551,7 @@ static int dsa_tree_setup(struct dsa_switch_tree *dst)
 	if (!complete)
 		return 0;
 
-	err = dsa_tree_setup_default_cpu(dst);
+	err = dsa_tree_setup_default_cpus(dst);
 	if (err)
 		return err;
 
@@ -533,7 +559,7 @@ static int dsa_tree_setup(struct dsa_switch_tree *dst)
 	if (err)
 		goto teardown_default_cpu;
 
-	err = dsa_tree_setup_master(dst);
+	err = dsa_tree_setup_masters(dst);
 	if (err)
 		goto teardown_switches;
 
@@ -546,7 +572,7 @@ static int dsa_tree_setup(struct dsa_switch_tree *dst)
 teardown_switches:
 	dsa_tree_teardown_switches(dst);
 teardown_default_cpu:
-	dsa_tree_teardown_default_cpu(dst);
+	dsa_tree_teardown_default_cpus(dst);
 
 	return err;
 }
@@ -556,11 +582,11 @@ static void dsa_tree_teardown(struct dsa_switch_tree *dst)
 	if (!dst->setup)
 		return;
 
-	dsa_tree_teardown_master(dst);
+	dsa_tree_teardown_masters(dst);
 
 	dsa_tree_teardown_switches(dst);
 
-	dsa_tree_teardown_default_cpu(dst);
+	dsa_tree_teardown_default_cpus(dst);
 
 	pr_info("DSA: tree %d torn down\n", dst->index);
 
-- 
2.21.0


^ permalink raw reply related

* [PATCH RFC net-next 2/3] net: add ndo for setting the iflink property
From: Marek Behún @ 2019-08-24  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Florian Fainelli, David Ahern,
	Stephen Hemminger, Marek Behún
In-Reply-To: <20190824024251.4542-1-marek.behun@nic.cz>

In DSA the iflink value is used to report to which CPU port a given
switch port is connected to. Since we want to support multi-CPU DSA, we
want the user to be able to change this value.

Add ndo_set_iflink method into the ndo strucutre to be a pair to
ndo_get_iflink. Also create dev_set_iflink and call this from the
netlink code, so that userspace can change the iflink value.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
---
 include/linux/netdevice.h |  5 +++++
 net/core/dev.c            | 15 +++++++++++++++
 net/core/rtnetlink.c      |  7 +++++++
 3 files changed, 27 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 55ac223553f8..45eeb6da8583 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1201,6 +1201,8 @@ struct tlsdev_ops;
  *	TX queue.
  * int (*ndo_get_iflink)(const struct net_device *dev);
  *	Called to get the iflink value of this device.
+ * int (*ndo_set_iflink)(struct net_device *dev, int iflink);
+ *	Called to set the iflink value of this device.
  * void (*ndo_change_proto_down)(struct net_device *dev,
  *				 bool proto_down);
  *	This function is used to pass protocol port error state information
@@ -1415,6 +1417,8 @@ struct net_device_ops {
 						      int queue_index,
 						      u32 maxrate);
 	int			(*ndo_get_iflink)(const struct net_device *dev);
+	int			(*ndo_set_iflink)(struct net_device *dev,
+						  int iflink);
 	int			(*ndo_change_proto_down)(struct net_device *dev,
 							 bool proto_down);
 	int			(*ndo_fill_metadata_dst)(struct net_device *dev,
@@ -2606,6 +2610,7 @@ void dev_add_offload(struct packet_offload *po);
 void dev_remove_offload(struct packet_offload *po);
 
 int dev_get_iflink(const struct net_device *dev);
+int dev_set_iflink(struct net_device *dev, int iflink);
 int dev_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb);
 struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
 				      unsigned short mask);
diff --git a/net/core/dev.c b/net/core/dev.c
index 49589ed2018d..966bab196694 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -693,6 +693,21 @@ int dev_get_iflink(const struct net_device *dev)
 }
 EXPORT_SYMBOL(dev_get_iflink);
 
+/**
+ *	dev_set_iflink - set 'iflink' value of an interface
+ *	@dev: target interface
+ *	@iflink: new value
+ *
+ *	Change the interface to which this interface is linked to.
+ */
+int dev_set_iflink(struct net_device *dev, int iflink)
+{
+	if (dev->netdev_ops && dev->netdev_ops->ndo_set_iflink)
+		return dev->netdev_ops->ndo_set_iflink(dev, iflink);
+
+	return -EOPNOTSUPP;
+}
+
 /**
  *	dev_fill_metadata_dst - Retrieve tunnel egress information.
  *	@dev: targeted interface
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1ee6460f8275..106d5e23ae6f 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2507,6 +2507,13 @@ static int do_setlink(const struct sk_buff *skb,
 		status |= DO_SETLINK_MODIFIED;
 	}
 
+	if (tb[IFLA_LINK]) {
+		err = dev_set_iflink(dev, nla_get_u32(tb[IFLA_LINK]));
+		if (err)
+			goto errout;
+		status |= DO_SETLINK_MODIFIED;
+	}
+
 	if (tb[IFLA_CARRIER]) {
 		err = dev_change_carrier(dev, nla_get_u8(tb[IFLA_CARRIER]));
 		if (err)
-- 
2.21.0


^ permalink raw reply related

* [PATCH RFC net-next 0/3] Multi-CPU DSA support
From: Marek Behún @ 2019-08-24  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Florian Fainelli, David Ahern,
	Stephen Hemminger, Marek Behún

Hi,
this is my attempt to solve the multi-CPU port issue for DSA.

Patch 1 adds code for handling multiple CPU ports in a DSA switch tree.
If more than one CPU port is found in a tree, the code assigns CPU ports
to user/DSA ports in a round robin way. So for the simplest case where
we have one switch with N ports, 2 of them of type CPU connected to eth0
and eth1, and the other ports labels being lan1, lan2, ..., the code
assigns them to CPU ports this way:
  lan1 <-> eth0
  lan2 <-> eth1
  lan3 <-> eth0
  lan4 <-> eth1
  lan5 <-> eth0
  ...

Patch 2 adds a new operation to the net device operations structure.
Currently we use the iflink property of a net device to report to which
CPU port a given switch port si connected to. The ip link utility from
iproute2 reports this as "lan1@eth0". We add a new net device operation,
ndo_set_iflink, which can be used to set this property. We call this
function from the netlink handlers.

Patch 3 implements this new ndo_set_iflink operation for DSA slave
device. Thus the userspace can request a change of CPU port of a given
port.

I am also sending patch for iproute2-next, to add support for setting
this iflink value.

Marek

Marek Behún (3):
  net: dsa: allow for multiple CPU ports
  net: add ndo for setting the iflink property
  net: dsa: implement ndo_set_netlink for chaning port's CPU port

 include/linux/netdevice.h |  5 +++
 include/net/dsa.h         | 11 ++++-
 net/core/dev.c            | 15 +++++++
 net/core/rtnetlink.c      |  7 ++++
 net/dsa/dsa2.c            | 84 +++++++++++++++++++++++++--------------
 net/dsa/slave.c           | 35 ++++++++++++++++
 6 files changed, 126 insertions(+), 31 deletions(-)

-- 
2.21.0


^ permalink raw reply

* [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Jakub Kicinski @ 2019-08-24  2:00 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: bpf, netdev, oss-drivers, Jiong Wang, Jakub Kicinski

From: Jiong Wang <jiong.wang@netronome.com>

NFP is using Local Memory to model stack. LM_addr could be used as base of
a 16 32-bit word region of Local Memory. Then, if the stack offset is
beyond the current region, the local index needs to be updated. The update
needs at least three cycles to take effect, therefore the sequence normally
looks like:

  local_csr_wr[ActLMAddr3, gprB_5]
  nop
  nop
  nop

If the local index switch happens on a narrow loads, then the instruction
preparing value to zero high 32-bit of the destination register could be
counted as one cycle, the sequence then could be something like:

  local_csr_wr[ActLMAddr3, gprB_5]
  nop
  nop
  immed[gprB_5, 0]

However, we have zero extension optimization that zeroing high 32-bit could
be eliminated, therefore above IMMED insn won't be available for which case
the first sequence needs to be generated.

Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 4054b70d7719..5afcb3c4c2ef 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1163,7 +1163,7 @@ mem_op_stack(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
 	     bool clr_gpr, lmem_step step)
 {
 	s32 off = nfp_prog->stack_frame_depth + meta->insn.off + ptr_off;
-	bool first = true, last;
+	bool first = true, narrow_ld, last;
 	bool needs_inc = false;
 	swreg stack_off_reg;
 	u8 prev_gpr = 255;
@@ -1209,13 +1209,22 @@ mem_op_stack(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
 
 		needs_inc = true;
 	}
+
+	narrow_ld = clr_gpr && size < 8;
+
 	if (lm3) {
+		unsigned int nop_cnt;
+
 		emit_csr_wr(nfp_prog, imm_b(nfp_prog), NFP_CSR_ACT_LM_ADDR3);
-		/* For size < 4 one slot will be filled by zeroing of upper. */
-		wrp_nops(nfp_prog, clr_gpr && size < 8 ? 2 : 3);
+		/* For size < 4 one slot will be filled by zeroing of upper,
+		 * but be careful, that zeroing could be eliminated by zext
+		 * optimization.
+		 */
+		nop_cnt = narrow_ld && meta->flags & FLAG_INSN_DO_ZEXT ? 2 : 3;
+		wrp_nops(nfp_prog, nop_cnt);
 	}
 
-	if (clr_gpr && size < 8)
+	if (narrow_ld)
 		wrp_zext(nfp_prog, meta, gpr);
 
 	while (size) {
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCHv2 1/1] net: rds: add service level support in rds-info
From: Zhu Yanjun @ 2019-08-24  1:36 UTC (permalink / raw)
  To: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel,
	gerd.rausch
In-Reply-To: <6e5bc371-d613-e8f7-7b57-0b1bc2e10e9d@oracle.com>


On 2019/8/24 9:25, santosh.shilimkar@oracle.com wrote:
> On 8/23/19 6:04 PM, Zhu Yanjun wrote:
>>  From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
>> is used to identify different flows within an IBA subnet.
>> It is carried in the local route header of the packet.
>>
>> Before this commit, run "rds-info -I". The outputs are as
>> below:
>> "
>> RDS IB Connections:
>>   LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
>> 192.2.95.3  192.2.95.1  2   0  fe80::21:28:1a:39 fe80::21:28:10:b9
>> 192.2.95.3  192.2.95.1  1   0  fe80::21:28:1a:39 fe80::21:28:10:b9
>> 192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39 fe80::21:28:10:b9
>> "
>> After this commit, the output is as below:
>> "
>> RDS IB Connections:
>>   LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
>> 192.2.95.3  192.2.95.1  2   2  fe80::21:28:1a:39 fe80::21:28:10:b9
>> 192.2.95.3  192.2.95.1  1   1  fe80::21:28:1a:39 fe80::21:28:10:b9
>> 192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39 fe80::21:28:10:b9
>> "
>>
>> The commit fe3475af3bdf ("net: rds: add per rds connection cache
>> statistics") adds cache_allocs in struct rds_info_rdma_connection
>> as below:
>> struct rds_info_rdma_connection {
>> ...
>>          __u32           rdma_mr_max;
>>          __u32           rdma_mr_size;
>>          __u8            tos;
>>          __u32           cache_allocs;
>>   };
>> The peer struct in rds-tools of struct rds_info_rdma_connection is as
>> below:
>> struct rds_info_rdma_connection {
>> ...
>>          uint32_t        rdma_mr_max;
>>          uint32_t        rdma_mr_size;
>>          uint8_t         tos;
>>          uint8_t         sl;
>>          uint32_t        cache_allocs;
>> };
>> The difference between userspace and kernel is the member variable sl.
>> In the kernel struct, the member variable sl is missing. This will
>> introduce risks. So it is necessary to use this commit to avoid this 
>> risk.
>>
>> Fixes: fe3475af3bdf ("net: rds: add per rds connection cache 
>> statistics")
>> CC: Joe Jin <joe.jin@oracle.com>
>> CC: JUNXIAO_BI <junxiao.bi@oracle.com>
>> Suggested-by: Gerd Rausch <gerd.rausch@oracle.com>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
>> ---
>> V1->V2: fix typos in commit logs.
>> ---
> I did ask you when ypu posted the patch about whether you did
> backward compatibility tests for which you said, you did all the
> tests and said "So do not worry about backward compatibility. This
> commit will work well with older rds-tools2.0.5 and 2.0.6."
>
> https://www.spinics.net/lists/netdev/msg574691.html
>
> I was worried about exactly such issue as described in commit.

Sorry. My bad. I will make more work to let rds robust.

Thanks a lot for your Ack.

Zhu Yanjun

>
> Anyways thanks for the fixup patch. Should be applied to stable
> as well.
>
> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
>
> Regards,
> Santosh
>
>

^ permalink raw reply

* Re: [net-next 07/14] ice: Rename ethtool private flag for lldp
From: Jakub Kicinski @ 2019-08-24  1:31 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, Dave Ertman, netdev, nhorman, sassmann, Andrew Bowers
In-Reply-To: <20190823233750.7997-8-jeffrey.t.kirsher@intel.com>

On Fri, 23 Aug 2019 16:37:43 -0700, Jeff Kirsher wrote:
> From: Dave Ertman <david.m.ertman@intel.com>
> 
> The current flag name of "enable-fw-lldp" is a bit cumbersome.
> 
> Change priv-flag name to "fw-lldp-agent" with a value of on or
> off.  This is more straight-forward in meaning.
> 
> Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Just flagging this for Dave, it was introduced in v5.2 by:

commit 3a257a1404f8bf751a258ab92262dcb2cce39eef
Author: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Date:   Thu Feb 28 15:24:31 2019 -0800

    ice: Add code to control FW LLDP and DCBX
    
    This patch adds code to start or stop LLDP and DCBX in firmware through
    use of ethtool private flags.
    
    Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
    Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

And changed once already in v5.3-rc by:

commit 31eafa403b9945997cf5b321ae3560f072b74efe
Author: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Date:   Tue Apr 16 10:24:25 2019 -0700

    ice: Implement LLDP persistence
    
    Implement LLDP persistence across reboots, start and stop of LLDP agent.
    Add additional parameter to ice_aq_start_lldp and ice_aq_stop_lldp.
    
    Also change the ethtool private flag from "disable-fw-lldp" to
    "enable-fw-lldp". This change will flip the boolean logic of the
    functionality of the flag (on = enable, off = disable). The change
    in name and functionality is to differentiate between the
    pre-persistence and post-persistence states.
    
    Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
    Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Perhaps the rename should target net? IDK how much driver flag renaming
is okay otherwise, I guess this will only affect Intel users.

^ permalink raw reply

* Re: [PATCHv2 1/1] net: rds: add service level support in rds-info
From: santosh.shilimkar @ 2019-08-24  1:25 UTC (permalink / raw)
  To: Zhu Yanjun, davem, netdev, linux-rdma, rds-devel, gerd.rausch
In-Reply-To: <1566608656-30836-1-git-send-email-yanjun.zhu@oracle.com>

On 8/23/19 6:04 PM, Zhu Yanjun wrote:
>  From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
> is used to identify different flows within an IBA subnet.
> It is carried in the local route header of the packet.
> 
> Before this commit, run "rds-info -I". The outputs are as
> below:
> "
> RDS IB Connections:
>   LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
> 192.2.95.3  192.2.95.1  2   0  fe80::21:28:1a:39  fe80::21:28:10:b9
> 192.2.95.3  192.2.95.1  1   0  fe80::21:28:1a:39  fe80::21:28:10:b9
> 192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
> "
> After this commit, the output is as below:
> "
> RDS IB Connections:
>   LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
> 192.2.95.3  192.2.95.1  2   2  fe80::21:28:1a:39  fe80::21:28:10:b9
> 192.2.95.3  192.2.95.1  1   1  fe80::21:28:1a:39  fe80::21:28:10:b9
> 192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
> "
> 
> The commit fe3475af3bdf ("net: rds: add per rds connection cache
> statistics") adds cache_allocs in struct rds_info_rdma_connection
> as below:
> struct rds_info_rdma_connection {
> ...
>          __u32           rdma_mr_max;
>          __u32           rdma_mr_size;
>          __u8            tos;
>          __u32           cache_allocs;
>   };
> The peer struct in rds-tools of struct rds_info_rdma_connection is as
> below:
> struct rds_info_rdma_connection {
> ...
>          uint32_t        rdma_mr_max;
>          uint32_t        rdma_mr_size;
>          uint8_t         tos;
>          uint8_t         sl;
>          uint32_t        cache_allocs;
> };
> The difference between userspace and kernel is the member variable sl.
> In the kernel struct, the member variable sl is missing. This will
> introduce risks. So it is necessary to use this commit to avoid this risk.
> 
> Fixes: fe3475af3bdf ("net: rds: add per rds connection cache statistics")
> CC: Joe Jin <joe.jin@oracle.com>
> CC: JUNXIAO_BI <junxiao.bi@oracle.com>
> Suggested-by: Gerd Rausch <gerd.rausch@oracle.com>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> ---
> V1->V2: fix typos in commit logs.
> ---
I did ask you when ypu posted the patch about whether you did
backward compatibility tests for which you said, you did all the
tests and said "So do not worry about backward compatibility.  This
commit will work well with older rds-tools2.0.5 and 2.0.6."

https://www.spinics.net/lists/netdev/msg574691.html

I was worried about exactly such issue as described in commit.

Anyways thanks for the fixup patch. Should be applied to stable
as well.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

Regards,
Santosh


^ permalink raw reply

* Re: [PATCH v2] riscv: add support for SECCOMP and SECCOMP_FILTER
From: Paul Walmsley @ 2019-08-24  1:18 UTC (permalink / raw)
  To: David Abdurachmanov
  Cc: Tycho Andersen, Palmer Dabbelt, Albert Ou, Oleg Nesterov,
	Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David Abdurachmanov, Thomas Gleixner,
	Allison Randal, Alexios Zavras, Anup Patel, Vincent Chen,
	Alan Kao, linux-riscv, linux-kernel, linux-kselftest, netdev, bpf,
	me
In-Reply-To: <CAEn-LTp=ss0Dfv6J00=rCAy+N78U2AmhqJNjfqjr2FDpPYjxEQ@mail.gmail.com>

On Fri, 23 Aug 2019, David Abdurachmanov wrote:

> On Fri, Aug 23, 2019 at 5:30 PM Paul Walmsley <paul.walmsley@sifive.com> wrote:
> >
> > On Thu, 22 Aug 2019, David Abdurachmanov wrote:
> >
> > > There is one failing kernel selftest: global.user_notification_signal
> >
> > Is this the only failing test?  Or are the rest of the selftests skipped
> > when this test fails, and no further tests are run, as seems to be shown
> > here:
> >
> >   https://lore.kernel.org/linux-riscv/CADnnUqcmDMRe1f+3jG8SPR6jRrnBsY8VVD70VbKEm0NqYeoicA@mail.gmail.com/
> 
> Yes, it's a single test failing. After removing global.user_notification_signal
> test everything else pass and you get the results printed.

OK.

> Well the code states ".. and hope that it doesn't break when there
> is actually a signal :)". Maybe we are just unlucky. I don't have results
> from other architectures to compare.
> 
> I found that Linaro is running selftests, but SECCOMP is disabled
> and thus it's failing. Is there another CI which tracks selftests?

0day runs the kselftests, and at least on some architectures/Kconfigs, 
it's succeeding:

https://lore.kernel.org/lkml/20190726083740.GG22106@shao2-debian/

https://lore.kernel.org/lkml/20190712064850.GC20848@shao2-debian/

https://lore.kernel.org/lkml/20190311074115.GC10839@shao2-debian/

etc.


- Paul

^ permalink raw reply

* Re: [PATCH v2] riscv: add support for SECCOMP and SECCOMP_FILTER
From: David Abdurachmanov @ 2019-08-24  1:10 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Tycho Andersen, Palmer Dabbelt, Albert Ou, Oleg Nesterov,
	Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David Abdurachmanov, Thomas Gleixner,
	Allison Randal, Alexios Zavras, Anup Patel, Vincent Chen,
	Alan Kao, linux-riscv, linux-kernel, linux-kselftest, netdev, bpf,
	me
In-Reply-To: <CAEn-LTp=ss0Dfv6J00=rCAy+N78U2AmhqJNjfqjr2FDpPYjxEQ@mail.gmail.com>

On Fri, Aug 23, 2019 at 6:04 PM David Abdurachmanov
<david.abdurachmanov@gmail.com> wrote:
>
> On Fri, Aug 23, 2019 at 5:30 PM Paul Walmsley <paul.walmsley@sifive.com> wrote:
> >
> > On Thu, 22 Aug 2019, David Abdurachmanov wrote:
> >
> > > There is one failing kernel selftest: global.user_notification_signal
> >
> > Is this the only failing test?  Or are the rest of the selftests skipped
> > when this test fails, and no further tests are run, as seems to be shown
> > here:
> >
> >   https://lore.kernel.org/linux-riscv/CADnnUqcmDMRe1f+3jG8SPR6jRrnBsY8VVD70VbKEm0NqYeoicA@mail.gmail.com/
>
> Yes, it's a single test failing. After removing global.user_notification_signal
> test everything else pass and you get the results printed.
>
> >
> > For example, looking at the source, I'd naively expect to see the
> > user_notification_closed_listener test result -- which follows right
> > after the failing test in the selftest source.  But there aren't any
> > results?
>
> Yes, it hangs at this point. You have to manually terminate it.
>
> >
> > Also - could you follow up with the author of this failing test to see if
> > we can get some more clarity about what might be going wrong here?  It
> > appears that the failing test was added in commit 6a21cc50f0c7f ("seccomp:
> > add a return code to trap to userspace") by Tycho Andersen
> > <tycho@tycho.ws>.
>
> Well the code states ".. and hope that it doesn't break when there
> is actually a signal :)". Maybe we are just unlucky. I don't have results
> from other architectures to compare.
>
> I found that Linaro is running selftests, but SECCOMP is disabled
> and thus it's failing. Is there another CI which tracks selftests?
>
> https://qa-reports.linaro.org/lkft/linux-next-oe/tests/kselftest/seccomp_seccomp_bpf?top=next-20190823

Actually it seems that seccomp is enabled in kernel, but not in
systemd, and somehow seccomp_bpf is missing on all arches thus
causing automatic failure.

> >
> >
> > - Paul

^ permalink raw reply

* Re: [net PATCH] net: route dump netlink NLM_F_MULTI flag missing
From: Stefano Brivio @ 2019-08-24  1:06 UTC (permalink / raw)
  To: John Fastabend; +Cc: davem, dsahern, netdev
In-Reply-To: <156660549861.5753.7912871726096518275.stgit@john-XPS-13-9370>

On Fri, 23 Aug 2019 17:11:38 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> An excerpt from netlink(7) man page,
> 
>   In multipart messages (multiple nlmsghdr headers with associated payload
>   in one byte stream) the first and all following headers have the
>   NLM_F_MULTI flag set, except for the last  header  which  has the type
>   NLMSG_DONE.
> 
> but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a
> FIB dump.

In your case (see below), it can be zero or more, depending on how many
exception routes you have.

> The result is user space applications following above man page
> excerpt may get confused and may stop parsing msg believing something went
> wrong.

Worse yet, also RFC 3459 says:

	[...] For multipart
	messages, the first and all following headers have the NLM_F_MULTI
	Netlink header flag set, except for the last header which has the
	Netlink header type NLMSG_DONE.

But iproute2 doesn't check for this, so the selftests I added didn't
notice. Thanks for fixing this!

> In the golang netlink lib [0] the library logic stops parsing believing the
> message is not a multipart message. Found this running Cilium[1] against
> net-next while adding a feature to auto-detect routes. I noticed with
> multiple route tables we no longer could detect the default routes on net
> tree kernels because the library logic was not returning them.

However, note that if strict netlink checking is requested (I think the
library should be updated), and RTM_F_CLONED is not set (which should
be the case if you are just looking for "regular" routes), you won't
hit this.

> Fix this by handling the fib_dump_info_fnhe() case the same way the
> fib_dump_info() handles it by passing the flags argument through the
> call chain and adding a flags argument to rt_fill_info().
> 
> Tested with Cilium stack and auto-detection of routes works again. Also
> annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
> NLMSG_DONE flags look correct after this.
> 
> Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
> as is done for fib_dump_info() so this looks correct to me.

Yes, that's correct, because if the buffer is too small for a single
route dumped by a single rt_fill_info() call, we'll just fail, so that
will never be a multipart message.

> [0] https://github.com/vishvananda/netlink/
> [1] https://github.com/cilium/
> 
> Fixes: ee28906fd7a14 ("ipv4: Dump route exceptions if requested")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>

-- 
Stefano

^ permalink raw reply

* Re: [PATCH v2] riscv: add support for SECCOMP and SECCOMP_FILTER
From: David Abdurachmanov @ 2019-08-24  1:04 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Tycho Andersen, Palmer Dabbelt, Albert Ou, Oleg Nesterov,
	Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David Abdurachmanov, Thomas Gleixner,
	Allison Randal, Alexios Zavras, Anup Patel, Vincent Chen,
	Alan Kao, linux-riscv, linux-kernel, linux-kselftest, netdev, bpf,
	me
In-Reply-To: <alpine.DEB.2.21.9999.1908231717550.25649@viisi.sifive.com>

On Fri, Aug 23, 2019 at 5:30 PM Paul Walmsley <paul.walmsley@sifive.com> wrote:
>
> On Thu, 22 Aug 2019, David Abdurachmanov wrote:
>
> > There is one failing kernel selftest: global.user_notification_signal
>
> Is this the only failing test?  Or are the rest of the selftests skipped
> when this test fails, and no further tests are run, as seems to be shown
> here:
>
>   https://lore.kernel.org/linux-riscv/CADnnUqcmDMRe1f+3jG8SPR6jRrnBsY8VVD70VbKEm0NqYeoicA@mail.gmail.com/

Yes, it's a single test failing. After removing global.user_notification_signal
test everything else pass and you get the results printed.

>
> For example, looking at the source, I'd naively expect to see the
> user_notification_closed_listener test result -- which follows right
> after the failing test in the selftest source.  But there aren't any
> results?

Yes, it hangs at this point. You have to manually terminate it.

>
> Also - could you follow up with the author of this failing test to see if
> we can get some more clarity about what might be going wrong here?  It
> appears that the failing test was added in commit 6a21cc50f0c7f ("seccomp:
> add a return code to trap to userspace") by Tycho Andersen
> <tycho@tycho.ws>.

Well the code states ".. and hope that it doesn't break when there
is actually a signal :)". Maybe we are just unlucky. I don't have results
from other architectures to compare.

I found that Linaro is running selftests, but SECCOMP is disabled
and thus it's failing. Is there another CI which tracks selftests?

https://qa-reports.linaro.org/lkft/linux-next-oe/tests/kselftest/seccomp_seccomp_bpf?top=next-20190823

>
>
> - Paul

^ permalink raw reply

* [PATCHv2 1/1] net: rds: add service level support in rds-info
From: Zhu Yanjun @ 2019-08-24  1:04 UTC (permalink / raw)
  To: santosh.shilimkar, davem, netdev, linux-rdma, rds-devel,
	gerd.rausch

From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
is used to identify different flows within an IBA subnet.
It is carried in the local route header of the packet.

Before this commit, run "rds-info -I". The outputs are as
below:
"
RDS IB Connections:
 LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
192.2.95.3  192.2.95.1  2   0  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   0  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
"
After this commit, the output is as below:
"
RDS IB Connections:
 LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
192.2.95.3  192.2.95.1  2   2  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   1  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
"

The commit fe3475af3bdf ("net: rds: add per rds connection cache
statistics") adds cache_allocs in struct rds_info_rdma_connection
as below:
struct rds_info_rdma_connection {
...
        __u32           rdma_mr_max;
        __u32           rdma_mr_size;
        __u8            tos;
        __u32           cache_allocs;
 };
The peer struct in rds-tools of struct rds_info_rdma_connection is as
below:
struct rds_info_rdma_connection {
...
        uint32_t        rdma_mr_max;
        uint32_t        rdma_mr_size;
        uint8_t         tos;
        uint8_t         sl;
        uint32_t        cache_allocs;
};
The difference between userspace and kernel is the member variable sl.
In the kernel struct, the member variable sl is missing. This will
introduce risks. So it is necessary to use this commit to avoid this risk.

Fixes: fe3475af3bdf ("net: rds: add per rds connection cache statistics")
CC: Joe Jin <joe.jin@oracle.com>
CC: JUNXIAO_BI <junxiao.bi@oracle.com>
Suggested-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
V1->V2: fix typos in commit logs.
---
 include/uapi/linux/rds.h |    2 ++
 net/rds/ib.c             |   16 ++++++++++------
 net/rds/ib.h             |    1 +
 net/rds/ib_cm.c          |    3 +++
 net/rds/rdma_transport.c |   10 ++++++++--
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index fd6b5f6..cba368e 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -250,6 +250,7 @@ struct rds_info_rdma_connection {
 	__u32		rdma_mr_max;
 	__u32		rdma_mr_size;
 	__u8		tos;
+	__u8		sl;
 	__u32		cache_allocs;
 };
 
@@ -265,6 +266,7 @@ struct rds6_info_rdma_connection {
 	__u32		rdma_mr_max;
 	__u32		rdma_mr_size;
 	__u8		tos;
+	__u8		sl;
 	__u32		cache_allocs;
 };
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index ec05d91..45acab2 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -291,7 +291,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection *conn,
 				    void *buffer)
 {
 	struct rds_info_rdma_connection *iinfo = buffer;
-	struct rds_ib_connection *ic;
+	struct rds_ib_connection *ic = conn->c_transport_data;
 
 	/* We will only ever look at IB transports */
 	if (conn->c_trans != &rds_ib_transport)
@@ -301,15 +301,16 @@ static int rds_ib_conn_info_visitor(struct rds_connection *conn,
 
 	iinfo->src_addr = conn->c_laddr.s6_addr32[3];
 	iinfo->dst_addr = conn->c_faddr.s6_addr32[3];
-	iinfo->tos = conn->c_tos;
+	if (ic) {
+		iinfo->tos = conn->c_tos;
+		iinfo->sl = ic->i_sl;
+	}
 
 	memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
 	memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
 	if (rds_conn_state(conn) == RDS_CONN_UP) {
 		struct rds_ib_device *rds_ibdev;
 
-		ic = conn->c_transport_data;
-
 		rdma_read_gids(ic->i_cm_id, (union ib_gid *)&iinfo->src_gid,
 			       (union ib_gid *)&iinfo->dst_gid);
 
@@ -329,7 +330,7 @@ static int rds6_ib_conn_info_visitor(struct rds_connection *conn,
 				     void *buffer)
 {
 	struct rds6_info_rdma_connection *iinfo6 = buffer;
-	struct rds_ib_connection *ic;
+	struct rds_ib_connection *ic = conn->c_transport_data;
 
 	/* We will only ever look at IB transports */
 	if (conn->c_trans != &rds_ib_transport)
@@ -337,6 +338,10 @@ static int rds6_ib_conn_info_visitor(struct rds_connection *conn,
 
 	iinfo6->src_addr = conn->c_laddr;
 	iinfo6->dst_addr = conn->c_faddr;
+	if (ic) {
+		iinfo6->tos = conn->c_tos;
+		iinfo6->sl = ic->i_sl;
+	}
 
 	memset(&iinfo6->src_gid, 0, sizeof(iinfo6->src_gid));
 	memset(&iinfo6->dst_gid, 0, sizeof(iinfo6->dst_gid));
@@ -344,7 +349,6 @@ static int rds6_ib_conn_info_visitor(struct rds_connection *conn,
 	if (rds_conn_state(conn) == RDS_CONN_UP) {
 		struct rds_ib_device *rds_ibdev;
 
-		ic = conn->c_transport_data;
 		rdma_read_gids(ic->i_cm_id, (union ib_gid *)&iinfo6->src_gid,
 			       (union ib_gid *)&iinfo6->dst_gid);
 		rds_ibdev = ic->rds_ibdev;
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 303c6ee..f2b558e 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -220,6 +220,7 @@ struct rds_ib_connection {
 	/* Send/Recv vectors */
 	int			i_scq_vector;
 	int			i_rcq_vector;
+	u8			i_sl;
 };
 
 /* This assumes that atomic_t is at least 32 bits */
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index fddaa09..233f136 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -152,6 +152,9 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 		  RDS_PROTOCOL_MINOR(conn->c_version),
 		  ic->i_flowctl ? ", flow control" : "");
 
+	/* receive sl from the peer */
+	ic->i_sl = ic->i_cm_id->route.path_rec->sl;
+
 	atomic_set(&ic->i_cq_quiesce, 0);
 
 	/* Init rings and fill recv. this needs to wait until protocol
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index ff74c4b..28668ad 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -43,6 +43,9 @@
 static struct rdma_cm_id *rds6_rdma_listen_id;
 #endif
 
+/* Per IB specification 7.7.3, service level is a 4-bit field. */
+#define TOS_TO_SL(tos)		((tos) & 0xF)
+
 static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 					 struct rdma_cm_event *event,
 					 bool isv6)
@@ -97,10 +100,13 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 			struct rds_ib_connection *ibic;
 
 			ibic = conn->c_transport_data;
-			if (ibic && ibic->i_cm_id == cm_id)
+			if (ibic && ibic->i_cm_id == cm_id) {
+				cm_id->route.path_rec[0].sl =
+					TOS_TO_SL(conn->c_tos);
 				ret = trans->cm_initiate_connect(cm_id, isv6);
-			else
+			} else {
 				rds_conn_drop(conn);
+			}
 		}
 		break;
 
-- 
1.7.1


^ permalink raw reply related

* Re: pull-request: bpf 2019-08-24
From: David Miller @ 2019-08-24  0:34 UTC (permalink / raw)
  To: daniel; +Cc: jakub.kicinski, ast, andrii.nakryiko, netdev, bpf
In-Reply-To: <20190824001157.16043-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Sat, 24 Aug 2019 02:11:57 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei.
> 
> 2) Fix a use-after-free in prog symbol exposure, from Daniel.
> 
> 3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya.
> 
> 4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan.
> 
> 5) Fix a potential use-after-free on flow dissector detach, from Jakub.
> 
> 6) Fix bpftool to close prog fd after showing metadata, from Quentin.
> 
> 7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

^ permalink raw reply

* Re: [PATCH v2] riscv: add support for SECCOMP and SECCOMP_FILTER
From: Paul Walmsley @ 2019-08-24  0:30 UTC (permalink / raw)
  To: David Abdurachmanov, Tycho Andersen
  Cc: Palmer Dabbelt, Albert Ou, Oleg Nesterov, Kees Cook,
	Andy Lutomirski, Will Drewry, Shuah Khan, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David Abdurachmanov, Thomas Gleixner, Allison Randal,
	Alexios Zavras, Anup Patel, Vincent Chen, Alan Kao, linux-riscv,
	linux-kernel, linux-kselftest, netdev, bpf, me
In-Reply-To: <20190822205533.4877-1-david.abdurachmanov@sifive.com>

On Thu, 22 Aug 2019, David Abdurachmanov wrote:

> There is one failing kernel selftest: global.user_notification_signal

Is this the only failing test?  Or are the rest of the selftests skipped 
when this test fails, and no further tests are run, as seems to be shown 
here:

  https://lore.kernel.org/linux-riscv/CADnnUqcmDMRe1f+3jG8SPR6jRrnBsY8VVD70VbKEm0NqYeoicA@mail.gmail.com/

For example, looking at the source, I'd naively expect to see the 
user_notification_closed_listener test result -- which follows right 
after the failing test in the selftest source.  But there aren't any 
results?

Also - could you follow up with the author of this failing test to see if 
we can get some more clarity about what might be going wrong here?  It 
appears that the failing test was added in commit 6a21cc50f0c7f ("seccomp: 
add a return code to trap to userspace") by Tycho Andersen 
<tycho@tycho.ws>.


- Paul

^ permalink raw reply

* Re: [PATCH net-next v2 03/10] net: sched: refactor block offloads counter usage
From: Jakub Kicinski @ 2019-08-24  0:26 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem, pablo
In-Reply-To: <20190823185056.12536-4-vladbu@mellanox.com>

On Fri, 23 Aug 2019 21:50:49 +0300, Vlad Buslov wrote:
> @@ -1201,14 +1199,11 @@ static int u32_reoffload_knode(struct tcf_proto *tp, struct tc_u_knode *n,
>  			cls_u32.knode.link_handle = ht->handle;
>  	}
>  
> -	err = cb(TC_SETUP_CLSU32, &cls_u32, cb_priv);
> -	if (err) {
> -		if (add && tc_skip_sw(n->flags))
> -			return err;
> -		return 0;
> -	}
> -
> -	tc_cls_offload_cnt_update(block, &n->in_hw_count, &n->flags, add);
> +	err = tc_setup_cb_reoffload(block, tp, add, cb, TC_SETUP_CLSU32,
> +				    &cls_u32, cb_priv, &n->flags,
> +				    &n->in_hw_count);
> +	if (err && add && tc_skip_sw(n->flags))
> +		return err;

Could this be further simplified by adding something along the lines of:

	if (!add || !tc_skip_sw(*flags))
		err = 0;

to tc_setup_cb_reoffload() ?

>  
>  	return 0;
>  }

^ permalink raw reply

* Re: [PATCHv4 net 1/2] ipv4/icmp: fix rt dst dev null pointer dereference
From: Jonathan Lemon @ 2019-08-23 22:32 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: netdev, Stefano Brivio, wenxu, Alexei Starovoitov,
	David S . Miller, Eric Dumazet, Julian Anastasov
In-Reply-To: <20190822141949.29561-2-liuhangbin@gmail.com>

On 22 Aug 2019, at 7:19, Hangbin Liu wrote:

> In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
> e,g, with tunnel collect_md mode, which will cause kernel crash.
> Here is what the code path looks like, for GRE:
>
> - ip6gre_tunnel_xmit
>   - ip6gre_xmit_ipv4
>     - __gre6_xmit
>       - ip6_tnl_xmit
>         - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
>     - icmp_send
>       - net = dev_net(rt->dst.dev); <-- here
>
> The reason is __metadata_dst_init() init dst->dev to NULL by default.
> We could not fix it in __metadata_dst_init() as there is no dev supplied.
> On the other hand, the reason we need rt->dst.dev is to get the net.
> So we can just try get it from skb->dev when rt->dst.dev is NULL.
>
> v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
> still use dst.dev and do a check to avoid crash.
>
> v3: No changes.
>
> v2: fix the issue in __icmp_send() instead of updating shared dst dev
> in {ip_md, ip6}_tunnel_xmit.
>
> Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* pull-request: bpf 2019-08-24
From: Daniel Borkmann @ 2019-08-24  0:11 UTC (permalink / raw)
  To: davem, jakub.kicinski; +Cc: daniel, ast, andrii.nakryiko, netdev, bpf

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei.

2) Fix a use-after-free in prog symbol exposure, from Daniel.

3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya.

4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan.

5) Fix a potential use-after-free on flow dissector detach, from Jakub.

6) Fix bpftool to close prog fd after showing metadata, from Quentin.

7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 125b7e0949d4e72b15c2b1a1590f8cece985a918:

  net: tc35815: Explicitly check NET_IP_ALIGN is not zero in tc35815_rx (2019-08-11 21:41:48 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 2c238177bd7f4b14bdf7447cc1cd9bb791f147e6:

  bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0 (2019-08-24 01:25:41 +0200)

----------------------------------------------------------------
Alexei Starovoitov (1):
      bpf: fix precision tracking in presence of bpf2bpf calls

Anders Roxell (2):
      selftests/bpf: add config fragment BPF_JIT
      selftests/bpf: install files test_xdp_vlan.sh

Daniel Borkmann (1):
      bpf: fix use after free in prog symbol exposure

Ilya Leoshkevich (6):
      s390/bpf: fix lcgr instruction encoding
      s390/bpf: use 32-bit index for tail calls
      selftests/bpf: fix "bind{4, 6} deny specific IP & port" on s390
      selftests/bpf: fix test_cgroup_storage on s390
      selftests/bpf: fix test_btf_dump with O=
      bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0

Ivan Khoronzhuk (1):
      xdp: unpin xdp umem pages in error path

Jakub Sitnicki (1):
      flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH

Quentin Monnet (1):
      tools: bpftool: close prog FD before exit on showing a single program

 arch/s390/net/bpf_jit_comp.c                      | 12 +++++----
 kernel/bpf/syscall.c                              | 30 ++++++++++++++---------
 kernel/bpf/verifier.c                             |  9 ++++---
 net/core/filter.c                                 |  8 +++---
 net/core/flow_dissector.c                         |  2 +-
 net/xdp/xdp_umem.c                                |  4 ++-
 tools/bpf/bpftool/prog.c                          |  4 ++-
 tools/testing/selftests/bpf/Makefile              |  6 ++++-
 tools/testing/selftests/bpf/config                |  1 +
 tools/testing/selftests/bpf/test_btf_dump.c       |  7 ++++++
 tools/testing/selftests/bpf/test_cgroup_storage.c |  6 ++---
 tools/testing/selftests/bpf/test_sock.c           |  7 ++++--
 12 files changed, 62 insertions(+), 34 deletions(-)

^ permalink raw reply

* [net PATCH] net: route dump netlink NLM_F_MULTI flag missing
From: John Fastabend @ 2019-08-24  0:11 UTC (permalink / raw)
  To: sbrivio, davem, dsahern; +Cc: netdev, john.fastabend

An excerpt from netlink(7) man page,

  In multipart messages (multiple nlmsghdr headers with associated payload
  in one byte stream) the first and all following headers have the
  NLM_F_MULTI flag set, except for the last  header  which  has the type
  NLMSG_DONE.

but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a
FIB dump. The result is user space applications following above man page
excerpt may get confused and may stop parsing msg believing something went
wrong.

In the golang netlink lib [0] the library logic stops parsing believing the
message is not a multipart message. Found this running Cilium[1] against
net-next while adding a feature to auto-detect routes. I noticed with
multiple route tables we no longer could detect the default routes on net
tree kernels because the library logic was not returning them.

Fix this by handling the fib_dump_info_fnhe() case the same way the
fib_dump_info() handles it by passing the flags argument through the
call chain and adding a flags argument to rt_fill_info().

Tested with Cilium stack and auto-detection of routes works again. Also
annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
NLMSG_DONE flags look correct after this.

Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
as is done for fib_dump_info() so this looks correct to me.

[0] https://github.com/vishvananda/netlink/
[1] https://github.com/cilium/

Fixes: ee28906fd7a14 ("ipv4: Dump route exceptions if requested")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/net/route.h |    2 +-
 net/ipv4/fib_trie.c |    2 +-
 net/ipv4/route.c    |   17 ++++++++++-------
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 630a0493f1f3..dfce19c9fa96 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -233,7 +233,7 @@ void rt_del_uncached_list(struct rtable *rt);
 
 int fib_dump_info_fnhe(struct sk_buff *skb, struct netlink_callback *cb,
 		       u32 table_id, struct fib_info *fi,
-		       int *fa_index, int fa_start);
+		       int *fa_index, int fa_start, unsigned int flags);
 
 static inline void ip_rt_put(struct rtable *rt)
 {
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 2b2b3d291ab0..1ab2fb6bb37d 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2145,7 +2145,7 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 
 		if (filter->dump_exceptions) {
 			err = fib_dump_info_fnhe(skb, cb, tb->tb_id, fi,
-						 &i_fa, s_fa);
+						 &i_fa, s_fa, flags);
 			if (err < 0)
 				goto stop;
 		}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 517300d587a7..b6a6f18c3dd1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2728,7 +2728,8 @@ EXPORT_SYMBOL_GPL(ip_route_output_flow);
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 			struct rtable *rt, u32 table_id, struct flowi4 *fl4,
-			struct sk_buff *skb, u32 portid, u32 seq)
+			struct sk_buff *skb, u32 portid, u32 seq,
+			unsigned int flags)
 {
 	struct rtmsg *r;
 	struct nlmsghdr *nlh;
@@ -2736,7 +2737,7 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 	u32 error;
 	u32 metrics[RTAX_MAX];
 
-	nlh = nlmsg_put(skb, portid, seq, RTM_NEWROUTE, sizeof(*r), 0);
+	nlh = nlmsg_put(skb, portid, seq, RTM_NEWROUTE, sizeof(*r), flags);
 	if (!nlh)
 		return -EMSGSIZE;
 
@@ -2860,7 +2861,7 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 static int fnhe_dump_bucket(struct net *net, struct sk_buff *skb,
 			    struct netlink_callback *cb, u32 table_id,
 			    struct fnhe_hash_bucket *bucket, int genid,
-			    int *fa_index, int fa_start)
+			    int *fa_index, int fa_start, unsigned int flags)
 {
 	int i;
 
@@ -2891,7 +2892,7 @@ static int fnhe_dump_bucket(struct net *net, struct sk_buff *skb,
 			err = rt_fill_info(net, fnhe->fnhe_daddr, 0, rt,
 					   table_id, NULL, skb,
 					   NETLINK_CB(cb->skb).portid,
-					   cb->nlh->nlmsg_seq);
+					   cb->nlh->nlmsg_seq, flags);
 			if (err)
 				return err;
 next:
@@ -2904,7 +2905,7 @@ static int fnhe_dump_bucket(struct net *net, struct sk_buff *skb,
 
 int fib_dump_info_fnhe(struct sk_buff *skb, struct netlink_callback *cb,
 		       u32 table_id, struct fib_info *fi,
-		       int *fa_index, int fa_start)
+		       int *fa_index, int fa_start, unsigned int flags)
 {
 	struct net *net = sock_net(cb->skb->sk);
 	int nhsel, genid = fnhe_genid(net);
@@ -2922,7 +2923,8 @@ int fib_dump_info_fnhe(struct sk_buff *skb, struct netlink_callback *cb,
 		err = 0;
 		if (bucket)
 			err = fnhe_dump_bucket(net, skb, cb, table_id, bucket,
-					       genid, fa_index, fa_start);
+					       genid, fa_index, fa_start,
+					       flags);
 		rcu_read_unlock();
 		if (err)
 			return err;
@@ -3183,7 +3185,8 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 				    fl4.flowi4_tos, res.fi, 0);
 	} else {
 		err = rt_fill_info(net, dst, src, rt, table_id, &fl4, skb,
-				   NETLINK_CB(in_skb).portid, nlh->nlmsg_seq);
+				   NETLINK_CB(in_skb).portid,
+				   nlh->nlmsg_seq, 0);
 	}
 	if (err < 0)
 		goto errout_rcu;


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox