Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [GIT PULL] Keys: Set 4 - Key ACLs for 5.3
From: David Howells @ 2019-08-16 13:36 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: dhowells, Linus Torvalds, James Morris, keyrings, Netdev,
	linux-nfs, CIFS, linux-afs, linux-fsdevel, linux-integrity,
	LSM List, Linux List Kernel Mailing
In-Reply-To: <1562814435.4014.11.camel@linux.ibm.com>

Mimi Zohar <zohar@linux.ibm.com> wrote:

> Sorry for the delay.  An exception is needed for loading builtin keys
> "KEY_ALLOC_BUILT_IN" onto a keyring that is not writable by userspace.
>  The following works, but probably is not how David would handle the
> exception.

I think the attached is the right way to fix it.

load_system_certificate_list(), for example, when it creates keys does this:

	key = key_create_or_update(make_key_ref(builtin_trusted_keys, 1),

marking the keyring as "possessed" in make_key_ref().  This allows the
possessor permits to be used - and that's the *only* way to use them for
internal keyrings like this because you can't link to them and you can't join
them.

David
---
diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 57be78b5fdfc..1f8f26f7bb05 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -99,7 +99,7 @@ static __init int system_trusted_keyring_init(void)
 	builtin_trusted_keys =
 		keyring_alloc(".builtin_trusted_keys",
 			      KUIDT_INIT(0), KGIDT_INIT(0), current_cred(),
-			      &internal_key_acl, KEY_ALLOC_NOT_IN_QUOTA,
+			      &internal_keyring_acl, KEY_ALLOC_NOT_IN_QUOTA,
 			      NULL, NULL);
 	if (IS_ERR(builtin_trusted_keys))
 		panic("Can't allocate builtin trusted keyring\n");
diff --git a/security/keys/permission.c b/security/keys/permission.c
index fc84d9ef6239..86efd3eaf083 100644
--- a/security/keys/permission.c
+++ b/security/keys/permission.c
@@ -47,7 +47,7 @@ struct key_acl internal_keyring_acl = {
 	.usage	= REFCOUNT_INIT(1),
 	.nr_ace	= 2,
 	.aces = {
-		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH),
+		KEY_POSSESSOR_ACE(KEY_ACE_SEARCH | KEY_ACE_WRITE),
 		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ | KEY_ACE_SEARCH),
 	}
 };

^ permalink raw reply related

* linux-next: Fixes tag needs some work in the bpf-next tree
From: Stephen Rothwell @ 2019-08-16 13:46 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Networking
  Cc: Linux Next Mailing List, Linux Kernel Mailing List,
	Quentin Monnet, Alexei Starovoitov

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

Hi all,

In commit

  ed4a3983cd3e ("tools: bpftool: fix argument for p_err() in BTF do_dump()")

Fixes tag

  Fixes: c93cc69004dt ("bpftool: add ability to dump BTF types")

has these problem(s):

  - missing space between the SHA1 and the subject

This is dues to the trailing 't' on the SHA1 :-(

  - SHA1 should be at least 12 digits long
    Can be fixed by setting core.abbrev to 12 (or more) or (for git v2.11
    or later) just making sure it is not set (or set to "auto").

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: r8169: Performance regression and latency instability
From: Holger Hoffstätte @ 2019-08-16 13:59 UTC (permalink / raw)
  To: Eric Dumazet, Juliana Rodrigueiro, netdev; +Cc: hkallweit1
In-Reply-To: <217e3fa9-7782-08c7-1f2b-8dabacaa83f9@gmail.com>

On 8/16/19 2:35 PM, Eric Dumazet wrote:
..snip..
> I also see this relevant commit : I have no idea why SG would have any relation with TSO.
> 
> commit a7eb6a4f2560d5ae64bfac98d79d11378ca2de6c
> Author: Holger Hoffstätte <holger@applied-asynchrony.com>
> Date:   Fri Aug 9 00:02:40 2019 +0200
> 
>      r8169: fix performance issue on RTL8168evl
>      
>      Disabling TSO but leaving SG active results is a significant
>      performance drop. Therefore disable also SG on RTL8168evl.
>      This restores the original performance.
>      
>      Fixes: 93681cd7d94f ("r8169: enable HW csum and TSO")
>      Signed-off-by: Holger Hoffstätte <holger@applied-asynchrony.com>
>      Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>      Signed-off-by: David S. Miller <davem@davemloft.net>

It does not - and admittedly none of this makes sense, but stay with me here.

The commit 93681cd7d94f to net-next enabled rx/tx HW checksumming and TSO
by default, but disabled TSO for one specific chip revision - the most popular
one, of course. Enabling rx/tx checksums by default while leaving SG on turned
out to be the performance issue (~780 MBit max) that I found & fixed in the
quoted commit. SG *can* be enabled when rx/tx checkusmming is *dis*abled
(I just verified again), we just had to sanitize the new default.

An alternative strategy could still be to (again?) disable everything by default
and just let people manually enable whatever settings work for their random
chip revision + BIOS combination. I'll let Heiner chime in here.

Basically these chips are dumpster fires and should not be used for anything
ever, which of course means they are everywhere.

AFAICT none of this has anything to do with Juliana's problem..

-h

^ permalink raw reply

* Re: [RFC PATCH net-next 03/11] spi: Add a PTP system timestamp to the transfer structure
From: Vladimir Oltean @ 2019-08-16 14:05 UTC (permalink / raw)
  To: Mark Brown
  Cc: Hubert Feurstein, mlichvar, Richard Cochran, Andrew Lunn,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <20190816125820.GF4039@sirena.co.uk>

On Fri, 16 Aug 2019 at 15:58, Mark Brown <broonie@kernel.org> wrote:
>
> On Fri, Aug 16, 2019 at 03:35:30PM +0300, Vladimir Oltean wrote:
> > On Fri, 16 Aug 2019 at 15:18, Mark Brown <broonie@kernel.org> wrote:
> > > On Fri, Aug 16, 2019 at 03:44:41AM +0300, Vladimir Oltean wrote:
>
> > > > @@ -842,6 +843,9 @@ struct spi_transfer {
> > > >
> > > >       u32             effective_speed_hz;
> > > >
> > > > +     struct ptp_system_timestamp *ptp_sts;
> > > > +     unsigned int    ptp_sts_word_offset;
> > > > +
>
> > > You've not documented these fields at all so it's not clear what the
> > > intended usage is.
>
> > Thanks for looking into this.
> > Indeed I didn't document them as the patch is part of a RFC and I
> > thought the purpose was more clear from the context (cover letter
> > etc).
> > If I do ever send a patchset for submission I will document the newly
> > introduced fields properly.
>
> The issue I'm having is that I have zero idea about the PTP API so I've
> got nothing to go on when thinking about if this approach makes any
> sense unless I go do some research.
>
> > So let me clarify:
> > The SPI slave device driver is populating these fields to indicate to
> > the controller driver that it wants word number @ptp_sts_word_offset
> > from the tx buffer snapshotted. The controller driver is supposed to
> > put the snapshot into the @ptp_sts field, which is a pointer to a
> > memory location under the control of the SPI slave device driver.
>
> Snapshot here basically meaning recording a timestamp?  This interface
> does seem like it basically precludes DMA based controllers from using
> it unless someone happened to implement some very specific stuff in
> hardware which seems implausible.  I'd be inclined to just require that
> users can only snapshot the first (and possibly also the last, though
> DMA completions make that fun) word of a transfer, we could then pull
> this out into the core a bit by providing a wrapper function drivers
> should call at the appropriate moment.
>

I'm not sure how to respond to this, because I don't know anything
about the timing of DMA transfers.
Maybe snapshotting DMA transfers the same way is not possible (if at
all). Maybe they are not exactly adequate for this sort of application
anyway. Maybe it depends.
But the switch I'm working on is issuing an internal read transaction
of the PTP timer exactly at the 4th-to-last bit of the 3rd byte. This
is so that it has time (4 SPI clock cycles, to be precise) for the
result of the read transaction to become available again to the SPI
block, for output. It is impossible to know exactly when the switch
will snapshot the time internally (because there are several clock
domain crossings from the SPI interface towards its core) but for
certain it takes place during the latter part of the 3rd SPI byte. I
believe other devices are similar in this regard.
In other words, from a purely performance perspective, I am against
limiting the API to just snapshotting the first and last byte. At this
level of "zoom", if I change the offset of the byte to anything other
than 3, the synchronization offset refuses to converge towards zero,
because the snapshot is incurring a constant offset that the servo
loop from userspace (phc2sys) can't compensate for.
Maybe the SPI master driver should just report what sort of
snapshotting capability it can offer, ranging from none (default
unless otherwise specified), to transfer-level (DMA style) or
byte-level.
I'm afraid more actual experimentation is needed with DMA-based
controllers to understand what can be expected from them, and as a
result, how the API should map around them.
MDIO bus controllers are in a similar situation (with Hubert's patch)
but at least there the frame size is fixed and I haven't heard of an
MDIO controller to use DMA.
I'm not really sure what the next step would be. In the other thread,
Richard Cochran mentioned something about a two-part write API,
although I didn't quite understand the idea behind it.

> > It is ok if the ptp_sts pointer is NULL (no need to check), because
> > the API for taking snapshots already checks for that.
> > At the moment there is yet no proposed mechanism for the SPI slave
> > driver to ensure that the controller will really act upon this
> > request. That would be really nice to have, since some SPI slave
> > devices are time-sensitive and warning early is a good way to prevent
> > unnecessary troubleshooting.
>
> Yes, that's one of the things I was thinking about looking at the series
> - we should at least be able to warn if we can't timestamp so nobody
> gets confused, possibly error out if the calling code particularly
> depends on it.

Regards,
-Vladimir

^ permalink raw reply

* [PATCH] rtlwifi: remove unused variables 'RTL8712_SDIO_EFUSE_TABLE' and 'MAX_PGPKT_SIZE'
From: YueHaibing @ 2019-08-16 14:05 UTC (permalink / raw)
  To: pkshih, kvalo, davem; +Cc: linux-kernel, netdev, linux-wireless, YueHaibing

drivers/net/wireless/realtek/rtlwifi/efuse.c:16:31:
 warning: RTL8712_SDIO_EFUSE_TABLE defined but not used [-Wunused-const-variable=]
drivers/net/wireless/realtek/rtlwifi/efuse.c:9:17:
 warning: MAX_PGPKT_SIZE defined but not used [-Wunused-const-variable=]

They are never used, so can be removed.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 drivers/net/wireless/realtek/rtlwifi/efuse.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/efuse.c b/drivers/net/wireless/realtek/rtlwifi/efuse.c
index ea4fc53..2646672 100644
--- a/drivers/net/wireless/realtek/rtlwifi/efuse.c
+++ b/drivers/net/wireless/realtek/rtlwifi/efuse.c
@@ -6,29 +6,12 @@
 #include "pci.h"
 #include <linux/export.h>
 
-static const u8 MAX_PGPKT_SIZE = 9;
 static const u8 PGPKT_DATA_SIZE = 8;
 static const int EFUSE_MAX_SIZE = 512;
 
 #define START_ADDRESS		0x1000
 #define REG_MCUFWDL		0x0080
 
-static const struct efuse_map RTL8712_SDIO_EFUSE_TABLE[] = {
-	{0, 0, 0, 2},
-	{0, 1, 0, 2},
-	{0, 2, 0, 2},
-	{1, 0, 0, 1},
-	{1, 0, 1, 1},
-	{1, 1, 0, 1},
-	{1, 1, 1, 3},
-	{1, 3, 0, 17},
-	{3, 3, 1, 48},
-	{10, 0, 0, 6},
-	{10, 3, 0, 1},
-	{10, 3, 1, 1},
-	{11, 0, 0, 28}
-};
-
 static const struct rtl_efuse_ops efuse_ops = {
 	.efuse_onebyte_read = efuse_one_byte_read,
 	.efuse_logical_map_read = efuse_shadow_read,
-- 
2.7.4



^ permalink raw reply related

* Re: linux-next: Signed-off-by missing for commits in the net-next tree
From: Gerd Rausch @ 2019-08-16 14:10 UTC (permalink / raw)
  To: Andy Grover, Stephen Rothwell, David Miller, Networking,
	Chris Mason
  Cc: Linux Next Mailing List, Linux Kernel Mailing List, Andy Grover,
	Chris Mason
In-Reply-To: <e85146f3-93a0-b23f-6a6e-11e42815946d@groveronline.com>

Hi,

On 16/08/2019 02.15, Andy Grover wrote:
> On 8/16/19 3:06 PM, Gerd Rausch wrote:
>> Hi,
>>
>> Just added the e-mail addresses I found using a simple "google search",
>> in order to reach out to the original authors of these commits:
>> Chris Mason and Andy Grover.
>>
>> I'm hoping they still remember their work from 7-8 years ago.
> 
> Yes looks like what I was working on. What did you need from me? It's
> too late to amend the commitlogs...
> 

I'll let Stephen or David respond to what (if any) action is necessary.

The missing Signed-off-by was pointed out to me by Stephen yesterday.

Hence I tried to locate you guys to pull you into the loop in order to
not leave his concern unanswered.

Thanks,

  Gerd





^ permalink raw reply

* Re: [PATCH RFC ipsec-next 0/7] ipsec: add TCP encapsulation support (RFC 8229)
From: Sabrina Dubroca @ 2019-08-16 14:18 UTC (permalink / raw)
  To: netdev, Steffen Klassert; +Cc: Herbert Xu
In-Reply-To: <cover.1561457281.git.sd@queasysnail.net>

Hi Steffen,

2019-06-25, 12:11:33 +0200, Sabrina Dubroca wrote:
> This patchset introduces support for TCP encapsulation of IKE and ESP
> messages, as defined by RFC 8229 [0]. It is an evolution of what
> Herbert Xu proposed in January 2018 [1] that addresses the main
> criticism against it, by not interfering with the TCP implementation
> at all. The networking stack now has infrastructure for this: TCP ULPs
> and Stream Parsers.

Have you had a chance to look at this?  I was going to rebase and
resend, but the patches still apply to ipsec-next and net-next (patch
2 is already in net-next as commit bd95e678e0f6).

Thanks,

-- 
Sabrina

^ permalink raw reply

* Re: linux-next: Signed-off-by missing for commits in the net-next tree
From: Stephen Rothwell @ 2019-08-16 14:31 UTC (permalink / raw)
  To: Gerd Rausch
  Cc: Andy Grover, David Miller, Networking, Chris Mason,
	Linux Next Mailing List, Linux Kernel Mailing List, Andy Grover,
	Chris Mason
In-Reply-To: <15078f1f-a036-2a54-1a07-9197f81bd58f@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 1099 bytes --]

Hi all,

On Fri, 16 Aug 2019 07:10:34 -0700 Gerd Rausch <gerd.rausch@oracle.com> wrote:
>
> On 16/08/2019 02.15, Andy Grover wrote:
> > On 8/16/19 3:06 PM, Gerd Rausch wrote:  
> >>
> >> Just added the e-mail addresses I found using a simple "google search",
> >> in order to reach out to the original authors of these commits:
> >> Chris Mason and Andy Grover.
> >>
> >> I'm hoping they still remember their work from 7-8 years ago.  
> > 
> > Yes looks like what I was working on. What did you need from me? It's
> > too late to amend the commitlogs...

Yeah, Dave doesn't rebase his trees.

> I'll let Stephen or David respond to what (if any) action is necessary.
> 
> The missing Signed-off-by was pointed out to me by Stephen yesterday.
> 
> Hence I tried to locate you guys to pull you into the loop in order to
> not leave his concern unanswered.

It is OK for SOBs to be missing, I just wanted to make sure that it was
OK in this instance.  (Its better that I ask when its OK then not to
ask and find something has gone wrong.)

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: linux-next: Fixes tag needs some work in the bpf-next tree
From: Quentin Monnet @ 2019-08-16 14:35 UTC (permalink / raw)
  To: Stephen Rothwell, Daniel Borkmann, Alexei Starovoitov, Networking
  Cc: Linux Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <20190816234613.351ddf07@canb.auug.org.au>

2019-08-16 23:46 UTC+1000 ~ Stephen Rothwell <sfr@canb.auug.org.au>
> Hi all,
> 
> In commit
> 
>   ed4a3983cd3e ("tools: bpftool: fix argument for p_err() in BTF do_dump()")
> 
> Fixes tag
> 
>   Fixes: c93cc69004dt ("bpftool: add ability to dump BTF types")
> 
> has these problem(s):
> 
>   - missing space between the SHA1 and the subject
> 
> This is dues to the trailing 't' on the SHA1 :-(
> 
>   - SHA1 should be at least 12 digits long
>     Can be fixed by setting core.abbrev to 12 (or more) or (for git v2.11
>     or later) just making sure it is not set (or set to "auto").
> 

Hi Stephen,

I made that typo, please accept my apologies :(.

The correct tag should be:

  Fixes: c93cc69004df ("bpftool: add ability to dump BTF types")

Regards,
Quentin

^ permalink raw reply

* RE: [PATCH net-next, 2/6] PCI: hv: Add a Hyper-V PCI mini driver for software backchannel interface
From: Haiyang Zhang @ 2019-08-16 14:48 UTC (permalink / raw)
  To: vkuznets, sashal@kernel.org, davem@davemloft.net,
	saeedm@mellanox.com, leon@kernel.org, eranbe@mellanox.com,
	lorenzo.pieralisi@arm.com, bhelgaas@google.com,
	linux-pci@vger.kernel.org, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org
  Cc: KY Srinivasan, Stephen Hemminger, linux-kernel@vger.kernel.org
In-Reply-To: <878srt8fd8.fsf@vitty.brq.redhat.com>



> -----Original Message-----
> From: Vitaly Kuznetsov <vkuznets@redhat.com>
> Sent: Friday, August 16, 2019 8:28 AM
> To: Haiyang Zhang <haiyangz@microsoft.com>; sashal@kernel.org;
> davem@davemloft.net; saeedm@mellanox.com; leon@kernel.org;
> eranbe@mellanox.com; lorenzo.pieralisi@arm.com; bhelgaas@google.com;
> linux-pci@vger.kernel.org; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org
> Cc: Haiyang Zhang <haiyangz@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next, 2/6] PCI: hv: Add a Hyper-V PCI mini driver for
> software backchannel interface
> 
> Haiyang Zhang <haiyangz@microsoft.com> writes:
> 
> > This mini driver is a helper driver allows other drivers to have a
> > common interface with the Hyper-V PCI frontend driver.
> >
> > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  MAINTAINERS                              |  1 +
> >  drivers/pci/Kconfig                      |  1 +
> >  drivers/pci/controller/Kconfig           |  7 ++++
> >  drivers/pci/controller/Makefile          |  1 +
> >  drivers/pci/controller/pci-hyperv-mini.c | 70
> ++++++++++++++++++++++++++++++++
> >  drivers/pci/controller/pci-hyperv.c      | 12 ++++--
> >  include/linux/hyperv.h                   | 30 ++++++++++----
> >  7 files changed, 111 insertions(+), 11 deletions(-)  create mode
> > 100644 drivers/pci/controller/pci-hyperv-mini.c
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS index e352550..c4962b9 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -7453,6 +7453,7 @@ F:	drivers/hid/hid-hyperv.c
> >  F:	drivers/hv/
> >  F:	drivers/input/serio/hyperv-keyboard.c
> >  F:	drivers/pci/controller/pci-hyperv.c
> > +F:	drivers/pci/controller/pci-hyperv-mini.c
> >  F:	drivers/net/hyperv/
> >  F:	drivers/scsi/storvsc_drv.c
> >  F:	drivers/uio/uio_hv_generic.c
> > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index
> > 2ab9240..bb852f5 100644
> > --- a/drivers/pci/Kconfig
> > +++ b/drivers/pci/Kconfig
> > @@ -182,6 +182,7 @@ config PCI_LABEL
> >  config PCI_HYPERV
> >          tristate "Hyper-V PCI Frontend"
> >          depends on X86 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN
> &&
> > X86_64
> > +	select PCI_HYPERV_MINI
> >          help
> >            The PCI device frontend driver allows the kernel to import arbitrary
> >            PCI devices from a PCI backend to support PCI driver domains.
> > diff --git a/drivers/pci/controller/Kconfig
> > b/drivers/pci/controller/Kconfig index fe9f9f1..8e31cba 100644
> > --- a/drivers/pci/controller/Kconfig
> > +++ b/drivers/pci/controller/Kconfig
> > @@ -281,5 +281,12 @@ config VMD
> >  	  To compile this driver as a module, choose M here: the
> >  	  module will be called vmd.
> >
> > +config PCI_HYPERV_MINI
> > +	tristate "Hyper-V PCI Mini"
> > +	depends on X86 && HYPERV && PCI_MSI && PCI_MSI_IRQ_DOMAIN
> && X86_64
> > +	help
> > +	  The Hyper-V PCI Mini is a helper driver allows other drivers to
> > +	  have a common interface with the Hyper-V PCI frontend driver.
> > +
> 
> Out of pure curiosity, why not just export this interface from PCI_HYPERV
> directly? Why do we need this stub?

The pci_hyperv can only be loaded on VMs on Hyper-V and Azure. Other 
drivers like MLX5e will have symbolic dependency of pci_hyperv if they 
use functions exported by pci_hyperv. This dependency will cause other 
drivers fail to load on other platforms, like VMs on KVM. So we created 
this mini driver, which can be loaded on any platforms to provide the 
symbolic dependency.

Thanks,
- Haiyang

^ permalink raw reply

* Re: [PATCH] arm64: do_csum: implement accelerated scalar version
From: Robin Murphy @ 2019-08-16 14:55 UTC (permalink / raw)
  To: Shaokun Zhang, Will Deacon
  Cc: Ard Biesheuvel, linux-arm-kernel, netdev, ilias.apalodimas,
	huanglingyan (A), steve.capper
In-Reply-To: <37fbc2a3-069d-9f75-f3d0-3eda2efa5c9b@hisilicon.com>

On 16/08/2019 09:15, Shaokun Zhang wrote:
> Hi Will,
> 
> On 2019/8/16 0:46, Will Deacon wrote:
>> On Thu, May 16, 2019 at 11:14:35AM +0800, Zhangshaokun wrote:
>>> On 2019/5/15 17:47, Will Deacon wrote:
>>>> On Mon, Apr 15, 2019 at 07:18:22PM +0100, Robin Murphy wrote:
>>>>> On 12/04/2019 10:52, Will Deacon wrote:
>>>>>> I'm waiting for Robin to come back with numbers for a C implementation.
>>>>>>
>>>>>> Robin -- did you get anywhere with that?
>>>>>
>>>>> Still not what I would call finished, but where I've got so far (besides an
>>>>> increasingly elaborate test rig) is as below - it still wants some unrolling
>>>>> in the middle to really fly (and actual testing on BE), but the worst-case
>>>>> performance already equals or just beats this asm version on Cortex-A53 with
>>>>> GCC 7 (by virtue of being alignment-insensitive and branchless except for
>>>>> the loop). Unfortunately, the advantage of C code being instrumentable does
>>>>> also come around to bite me...
>>>>
>>>> Is there any interest from anybody in spinning a proper patch out of this?
>>>> Shaokun?
>>>
>>> HiSilicon's Kunpeng920(Hi1620) benefits from do_csum optimization, if Ard and
>>> Robin are ok, Lingyan or I can try to do it.
>>> Of course, if any guy posts the patch, we are happy to test it.
>>> Any will be ok.
>>
>> I don't mind who posts it, but Robin is super busy with SMMU stuff at the
>> moment so it probably makes more sense for you or Lingyan to do it.
> 
> Thanks for restarting this topic, I or Lingyan will do it soon.

FWIW, I've rolled up what I had so far and dumped it up into a quick 
semi-realistic patch here:

http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=859c5566510c32ae72039aa5072e932a771a3596

So far I'd put most of the effort into the aforementioned benchmarking 
harness to compare performance and correctness for all the proposed 
implementations over all reasonable alignment/length combinations - I 
think that got pretty much finished, but as Will says I'm unlikely to 
find time to properly look at this again for several weeks.

Robin.

^ permalink raw reply

* Re: [PATCH net-next v7 5/6] flow_offload: support get multi-subsystem block
From: Vlad Buslov @ 2019-08-16 15:04 UTC (permalink / raw)
  To: wenxu
  Cc: Vlad Buslov, Jakub Kicinski, David Miller, Jiri Pirko,
	pablo@netfilter.org, netfilter-devel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <f28ddefe-a7d8-e5ad-e03e-08cfee4db147@ucloud.cn>

On Wed 14 Aug 2019 at 05:50, wenxu <wenxu@ucloud.cn> wrote:
> On 8/12/2019 10:11 PM, Vlad Buslov wrote:
>>
>>> +static void flow_block_ing_cmd(struct net_device *dev,
>>> +			       flow_indr_block_bind_cb_t *cb,
>>> +			       void *cb_priv,
>>> +			       enum flow_block_command command)
>>> +{
>>> +	struct flow_indr_block_ing_entry *entry;
>>> +
>>> +	rcu_read_lock();
>>> +	list_for_each_entry_rcu(entry, &block_ing_cb_list, list) {
>>> +		entry->cb(dev, cb, cb_priv, command);
>>> +	}
>>> +	rcu_read_unlock();
>>> +}
>> Hi,
>>
>> I'm getting following incorrect rcu usage warnings with this patch
>> caused by rcu_read_lock in flow_block_ing_cmd:
>>
>> [  401.510948] =============================
>> [  401.510952] WARNING: suspicious RCU usage
>> [  401.510993] 5.3.0-rc3+ #589 Not tainted
>> [  401.510996] -----------------------------
>> [  401.511001] include/linux/rcupdate.h:265 Illegal context switch in RCU read-side critical section!
>> [  401.511004]
>>                other info that might help us debug this:
>>
>> [  401.511008]
>>                rcu_scheduler_active = 2, debug_locks = 1
>> [  401.511012] 7 locks held by test-ecmp-add-v/7576:
>> [  401.511015]  #0: 00000000081d71a5 (sb_writers#4){.+.+}, at: vfs_write+0x166/0x1d0
>> [  401.511037]  #1: 000000002bd338c3 (&of->mutex){+.+.}, at: kernfs_fop_write+0xef/0x1b0
>> [  401.511051]  #2: 00000000c921c634 (kn->count#317){.+.+}, at: kernfs_fop_write+0xf7/0x1b0
>> [  401.511062]  #3: 00000000a19cdd56 (&dev->mutex){....}, at: sriov_numvfs_store+0x6b/0x130
>> [  401.511079]  #4: 000000005425fa52 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x30/0x140
>> [  401.511092]  #5: 00000000c5822793 (rtnl_mutex){+.+.}, at: unregister_netdevice_notifier+0x35/0x140
>> [  401.511101]  #6: 00000000c2f3507e (rcu_read_lock){....}, at: flow_block_ing_cmd+0x5/0x130
>> [  401.511115]
>>                stack backtrace:
>> [  401.511121] CPU: 21 PID: 7576 Comm: test-ecmp-add-v Not tainted 5.3.0-rc3+ #589
>> [  401.511124] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
>> [  401.511127] Call Trace:
>> [  401.511138]  dump_stack+0x85/0xc0
>> [  401.511146]  ___might_sleep+0x100/0x180
>> [  401.511154]  __mutex_lock+0x5b/0x960
>> [  401.511162]  ? find_held_lock+0x2b/0x80
>> [  401.511173]  ? __tcf_get_next_chain+0x1d/0xb0
>> [  401.511179]  ? mark_held_locks+0x49/0x70
>> [  401.511194]  ? __tcf_get_next_chain+0x1d/0xb0
>> [  401.511198]  __tcf_get_next_chain+0x1d/0xb0
>> [  401.511251]  ? uplink_rep_async_event+0x70/0x70 [mlx5_core]
>> [  401.511261]  tcf_block_playback_offloads+0x39/0x160
>> [  401.511276]  tcf_block_setup+0x1b0/0x240
>> [  401.511312]  ? mlx5e_rep_indr_setup_tc_cb+0xca/0x290 [mlx5_core]
>> [  401.511347]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
>> [  401.511359]  tc_indr_block_get_and_ing_cmd+0x11b/0x1e0
>> [  401.511404]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
>> [  401.511414]  flow_block_ing_cmd+0x7e/0x130
>> [  401.511453]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
>> [  401.511462]  __flow_indr_block_cb_unregister+0x7f/0xf0
>> [  401.511502]  mlx5e_nic_rep_netdevice_event+0x75/0xb0 [mlx5_core]
>> [  401.511513]  unregister_netdevice_notifier+0xe9/0x140
>> [  401.511554]  mlx5e_cleanup_rep_tx+0x6f/0xe0 [mlx5_core]
>> [  401.511597]  mlx5e_detach_netdev+0x4b/0x60 [mlx5_core]
>> [  401.511637]  mlx5e_vport_rep_unload+0x71/0xc0 [mlx5_core]
>> [  401.511679]  esw_offloads_disable+0x5b/0x90 [mlx5_core]
>> [  401.511724]  mlx5_eswitch_disable.cold+0xdf/0x176 [mlx5_core]
>> [  401.511759]  mlx5_device_disable_sriov+0xab/0xb0 [mlx5_core]
>> [  401.511794]  mlx5_core_sriov_configure+0xaf/0xd0 [mlx5_core]
>> [  401.511805]  sriov_numvfs_store+0xf8/0x130
>> [  401.511817]  kernfs_fop_write+0x122/0x1b0
>> [  401.511826]  vfs_write+0xdb/0x1d0
>> [  401.511835]  ksys_write+0x65/0xe0
>> [  401.511847]  do_syscall_64+0x5c/0xb0
>> [  401.511857]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  401.511862] RIP: 0033:0x7fad892d30f8
>> [  401.511868] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 96 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 60 c3 0f 1f 80 00 00 00 00 48 83
>>  ec 28 48 89
>> [  401.511871] RSP: 002b:00007ffca2a9fad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> [  401.511875] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fad892d30f8
>> [  401.511878] RDX: 0000000000000002 RSI: 000055afeb072a90 RDI: 0000000000000001
>> [  401.511881] RBP: 000055afeb072a90 R08: 00000000ffffffff R09: 000000000000000a
>> [  401.511884] R10: 000055afeb058710 R11: 0000000000000246 R12: 0000000000000002
>> [  401.511887] R13: 00007fad893a8780 R14: 0000000000000002 R15: 00007fad893a3740
>>
>> I don't think it is correct approach to try to call these callbacks with
>> rcu protection because:
>>
>> - Cls API uses sleeping locks that cannot be used in rcu read section
>>   (hence the included trace).
>>
>> - It assumes that all implementation of classifier ops reoffload() don't
>>   sleep.
>>
>> - And that all driver offload callbacks (both block and classifier
>>   setup) don't sleep, which is not the case.
>>
>> I don't see any straightforward way to fix this, besides using some
>> other locking mechanism to protect block_ing_cb_list.
>>
>> Regards,
>> Vlad
>
> Maybe get the  mutex flow_indr_block_ing_cb_lock for both lookup, add, delete? 
>
> the callbacks_lists. the add and delete is work only on modules init case. So the
>
> lookup is also not frequently(ony [un]register) and can protect with the locks.

That should do the job. I'll send the patch.

^ permalink raw reply

* [PATCH net-next] net: flow_offload: convert block_ing_cb_list to regular list type
From: Vlad Buslov @ 2019-08-16 15:06 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, wenxu, pablo, Vlad Buslov

RCU list block_ing_cb_list is protected by rcu read lock in
flow_block_ing_cmd() and with flow_indr_block_ing_cb_lock mutex in all
functions that use it. However, flow_block_ing_cmd() needs to call blocking
functions while iterating block_ing_cb_list which leads to following
suspicious RCU usage warning:

[  401.510948] =============================
[  401.510952] WARNING: suspicious RCU usage
[  401.510993] 5.3.0-rc3+ #589 Not tainted
[  401.510996] -----------------------------
[  401.511001] include/linux/rcupdate.h:265 Illegal context switch in RCU read-side critical section!
[  401.511004]
               other info that might help us debug this:

[  401.511008]
               rcu_scheduler_active = 2, debug_locks = 1
[  401.511012] 7 locks held by test-ecmp-add-v/7576:
[  401.511015]  #0: 00000000081d71a5 (sb_writers#4){.+.+}, at: vfs_write+0x166/0x1d0
[  401.511037]  #1: 000000002bd338c3 (&of->mutex){+.+.}, at: kernfs_fop_write+0xef/0x1b0
[  401.511051]  #2: 00000000c921c634 (kn->count#317){.+.+}, at: kernfs_fop_write+0xf7/0x1b0
[  401.511062]  #3: 00000000a19cdd56 (&dev->mutex){....}, at: sriov_numvfs_store+0x6b/0x130
[  401.511079]  #4: 000000005425fa52 (pernet_ops_rwsem){++++}, at: unregister_netdevice_notifier+0x30/0x140
[  401.511092]  #5: 00000000c5822793 (rtnl_mutex){+.+.}, at: unregister_netdevice_notifier+0x35/0x140
[  401.511101]  #6: 00000000c2f3507e (rcu_read_lock){....}, at: flow_block_ing_cmd+0x5/0x130
[  401.511115]
               stack backtrace:
[  401.511121] CPU: 21 PID: 7576 Comm: test-ecmp-add-v Not tainted 5.3.0-rc3+ #589
[  401.511124] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[  401.511127] Call Trace:
[  401.511138]  dump_stack+0x85/0xc0
[  401.511146]  ___might_sleep+0x100/0x180
[  401.511154]  __mutex_lock+0x5b/0x960
[  401.511162]  ? find_held_lock+0x2b/0x80
[  401.511173]  ? __tcf_get_next_chain+0x1d/0xb0
[  401.511179]  ? mark_held_locks+0x49/0x70
[  401.511194]  ? __tcf_get_next_chain+0x1d/0xb0
[  401.511198]  __tcf_get_next_chain+0x1d/0xb0
[  401.511251]  ? uplink_rep_async_event+0x70/0x70 [mlx5_core]
[  401.511261]  tcf_block_playback_offloads+0x39/0x160
[  401.511276]  tcf_block_setup+0x1b0/0x240
[  401.511312]  ? mlx5e_rep_indr_setup_tc_cb+0xca/0x290 [mlx5_core]
[  401.511347]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
[  401.511359]  tc_indr_block_get_and_ing_cmd+0x11b/0x1e0
[  401.511404]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
[  401.511414]  flow_block_ing_cmd+0x7e/0x130
[  401.511453]  ? mlx5e_rep_indr_tc_block_unbind+0x50/0x50 [mlx5_core]
[  401.511462]  __flow_indr_block_cb_unregister+0x7f/0xf0
[  401.511502]  mlx5e_nic_rep_netdevice_event+0x75/0xb0 [mlx5_core]
[  401.511513]  unregister_netdevice_notifier+0xe9/0x140
[  401.511554]  mlx5e_cleanup_rep_tx+0x6f/0xe0 [mlx5_core]
[  401.511597]  mlx5e_detach_netdev+0x4b/0x60 [mlx5_core]
[  401.511637]  mlx5e_vport_rep_unload+0x71/0xc0 [mlx5_core]
[  401.511679]  esw_offloads_disable+0x5b/0x90 [mlx5_core]
[  401.511724]  mlx5_eswitch_disable.cold+0xdf/0x176 [mlx5_core]
[  401.511759]  mlx5_device_disable_sriov+0xab/0xb0 [mlx5_core]
[  401.511794]  mlx5_core_sriov_configure+0xaf/0xd0 [mlx5_core]
[  401.511805]  sriov_numvfs_store+0xf8/0x130
[  401.511817]  kernfs_fop_write+0x122/0x1b0
[  401.511826]  vfs_write+0xdb/0x1d0
[  401.511835]  ksys_write+0x65/0xe0
[  401.511847]  do_syscall_64+0x5c/0xb0
[  401.511857]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  401.511862] RIP: 0033:0x7fad892d30f8
[  401.511868] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 96 0d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 60 c3 0f 1f 80 00 00 00 00 48 83
 ec 28 48 89
[  401.511871] RSP: 002b:00007ffca2a9fad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  401.511875] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fad892d30f8
[  401.511878] RDX: 0000000000000002 RSI: 000055afeb072a90 RDI: 0000000000000001
[  401.511881] RBP: 000055afeb072a90 R08: 00000000ffffffff R09: 000000000000000a
[  401.511884] R10: 000055afeb058710 R11: 0000000000000246 R12: 0000000000000002
[  401.511887] R13: 00007fad893a8780 R14: 0000000000000002 R15: 00007fad893a3740

To fix the described incorrect RCU usage, convert block_ing_cb_list from
RCU list to regular list and protect it with flow_indr_block_ing_cb_lock
mutex in flow_block_ing_cmd().

Fixes: 1150ab0f1b33 ("flow_offload: support get multi-subsystem block")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 net/core/flow_offload.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index 64c3d4d72b9c..cf52d9c422fa 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -391,6 +391,8 @@ static void flow_indr_block_cb_del(struct flow_indr_block_cb *indr_block_cb)
 	kfree(indr_block_cb);
 }
 
+static DEFINE_MUTEX(flow_indr_block_ing_cb_lock);
+
 static void flow_block_ing_cmd(struct net_device *dev,
 			       flow_indr_block_bind_cb_t *cb,
 			       void *cb_priv,
@@ -398,11 +400,11 @@ static void flow_block_ing_cmd(struct net_device *dev,
 {
 	struct flow_indr_block_ing_entry *entry;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(entry, &block_ing_cb_list, list) {
+	mutex_lock(&flow_indr_block_ing_cb_lock);
+	list_for_each_entry(entry, &block_ing_cb_list, list) {
 		entry->cb(dev, cb, cb_priv, command);
 	}
-	rcu_read_unlock();
+	mutex_unlock(&flow_indr_block_ing_cb_lock);
 }
 
 int __flow_indr_block_cb_register(struct net_device *dev, void *cb_priv,
@@ -497,11 +499,10 @@ void flow_indr_block_call(struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(flow_indr_block_call);
 
-static DEFINE_MUTEX(flow_indr_block_ing_cb_lock);
 void flow_indr_add_block_ing_cb(struct flow_indr_block_ing_entry *entry)
 {
 	mutex_lock(&flow_indr_block_ing_cb_lock);
-	list_add_tail_rcu(&entry->list, &block_ing_cb_list);
+	list_add_tail(&entry->list, &block_ing_cb_list);
 	mutex_unlock(&flow_indr_block_ing_cb_lock);
 }
 EXPORT_SYMBOL_GPL(flow_indr_add_block_ing_cb);
@@ -509,7 +510,7 @@ EXPORT_SYMBOL_GPL(flow_indr_add_block_ing_cb);
 void flow_indr_del_block_ing_cb(struct flow_indr_block_ing_entry *entry)
 {
 	mutex_lock(&flow_indr_block_ing_cb_lock);
-	list_del_rcu(&entry->list);
+	list_del(&entry->list);
 	mutex_unlock(&flow_indr_block_ing_cb_lock);
 }
 EXPORT_SYMBOL_GPL(flow_indr_del_block_ing_cb);
-- 
2.21.0


^ permalink raw reply related

* [PATCH RFC net-next 2/3] net: dsa: add port_setup/port_teardown methods to DSA ops
From: Marek Behún @ 2019-08-16 15:08 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Vladimir Oltean, Florian Fainelli,
	Marek Behún
In-Reply-To: <20190816150834.26939-1-marek.behun@nic.cz>

Add two new methods into the DSA operations structure:
  - port_setup(), called from dsa_port_setup() after the DSA port
    already registered
  - port_teardown(), called from dsa_port_teardown() before the port is
    unregistered

Signed-off-by: Marek Behún <marek.behun@nic.cz>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
---
 include/net/dsa.h |  2 ++
 net/dsa/dsa2.c    | 10 +++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 147b757ef8ea..848898e5d7c5 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -360,6 +360,8 @@ struct dsa_switch_ops {
 
 	int	(*setup)(struct dsa_switch *ds);
 	void	(*teardown)(struct dsa_switch *ds);
+	int	(*port_setup)(struct dsa_switch *ds, int port);
+	void	(*port_teardown)(struct dsa_switch *ds, int port);
 	u32	(*get_phy_flags)(struct dsa_switch *ds, int port);
 
 	/*
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 3abd173ebacb..c891300a6d2c 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -315,6 +315,9 @@ static int dsa_port_setup(struct dsa_port *dp)
 		break;
 	}
 
+	if (!err && ds->ops->port_setup)
+		err = ds->ops->port_setup(ds, dp->index);
+
 	if (err)
 		devlink_port_unregister(&dp->devlink_port);
 
@@ -323,8 +326,13 @@ static int dsa_port_setup(struct dsa_port *dp)
 
 static void dsa_port_teardown(struct dsa_port *dp)
 {
-	if (dp->type != DSA_PORT_TYPE_UNUSED)
+	struct dsa_switch *ds = dp->ds;
+
+	if (dp->type != DSA_PORT_TYPE_UNUSED) {
 		devlink_port_unregister(&dp->devlink_port);
+		if (ds->ops->port_teardown)
+			ds->ops->port_teardown(ds, dp->index);
+	}
 
 	switch (dp->type) {
 	case DSA_PORT_TYPE_UNUSED:
-- 
2.21.0


^ permalink raw reply related

* [PATCH RFC net-next 1/3] net: dsa: mv88e6xxx: support 2500base-x in SGMII IRQ handler
From: Marek Behún @ 2019-08-16 15:08 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Vladimir Oltean, Florian Fainelli,
	Marek Behún
In-Reply-To: <20190816150834.26939-1-marek.behun@nic.cz>

The mv88e6390_serdes_irq_link_sgmii IRQ handler reads the SERDES PHY
status register to determine speed, among other things. If cmode of the
port is set to 2500base-x, though, the PHY still reports 1000 Mbps (the
PHY register itself does not differentiate between 1000 Mbps and 2500
Mbps - it thinks it is running at 1000 Mbps, although clock is 2.5x
faster).
Look at the cmode and set SPEED_2500 if cmode is set to 2500base-x.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
---
 drivers/net/dsa/mv88e6xxx/serdes.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/serdes.c b/drivers/net/dsa/mv88e6xxx/serdes.c
index 20c526c2a9ee..17bb4a705d44 100644
--- a/drivers/net/dsa/mv88e6xxx/serdes.c
+++ b/drivers/net/dsa/mv88e6xxx/serdes.c
@@ -506,6 +506,7 @@ static void mv88e6390_serdes_irq_link_sgmii(struct mv88e6xxx_chip *chip,
 					    int port, int lane)
 {
 	struct dsa_switch *ds = chip->ds;
+	u8 cmode = chip->ports[port].cmode;
 	int duplex = DUPLEX_UNKNOWN;
 	int speed = SPEED_UNKNOWN;
 	int link, err;
@@ -527,7 +528,10 @@ static void mv88e6390_serdes_irq_link_sgmii(struct mv88e6xxx_chip *chip,
 
 		switch (status & MV88E6390_SGMII_PHY_STATUS_SPEED_MASK) {
 		case MV88E6390_SGMII_PHY_STATUS_SPEED_1000:
-			speed = SPEED_1000;
+			if (cmode == MV88E6XXX_PORT_STS_CMODE_2500BASEX)
+				speed = SPEED_2500;
+			else
+				speed = SPEED_1000;
 			break;
 		case MV88E6390_SGMII_PHY_STATUS_SPEED_100:
 			speed = SPEED_100;
-- 
2.21.0


^ permalink raw reply related

* [PATCH RFC net-next 0/3] mv88e6xxx: setting 2500base-x mode for CPU/DSA port in dts
From: Marek Behún @ 2019-08-16 15:08 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Vladimir Oltean, Florian Fainelli,
	Marek Behún

Hello,

I am preparing device tree for Turris Mox, and am encountering the
question of how to properly tell in DTS that a specific port is
connected to the cpu via 2500base-x mode.

CPU port is connected to eth1. In eth1, I simply have
  &eth1 {
    phy-mode = "2500base-x";
    managed = "in-band-status";
  };

This does not work for the CPU/DSA ports though, because of how phylink
and mv88e6xxx operate. CPU/DSA ports SERDES is enabled from
mv88e6xxx_setup(). SERDES irq is not enabled for there ports. Enabling
SERDES irq for there ports cannot be done from mv88e6xxx_setup(),
because the IRQ may fire immediately after enablement, and the phylink
structures do not exist yet for there ports (they are create by DSA only
after the .setup() method is called).

One way to make it work is to use fixed-link for there ports, with
speed = <2500>. But looking at the mv88e6xxx driver I discovered that
this works only because we are lucky (or because of my commit
65b034cf5c176, whatever you prefer. But when I was sending that
patch, fixed-link was not needed for the switch to work):
  - when first the mv88e6xxx_port_setup_mac is called from
    mv88e6xxx_setup_port, it is called with SPEED_MAX. The
    ->port_max_speed_mode() method is called to determine cmode for
    this port, which is 2500basex
  - afterwards, mv88e6xxx_port_setup_mac is called with parameters
    speed=2500 and mode=PHY_INTERFACE_MODE_NA. Because of commit
    65b034cf5c176, cmode is not set to something else

I think that the correct way to do this would be for CPU/DSA port nodes
to have the same setting in dts as the eth node (eg 2500base-x/inband).

For this to work, the mv88e6390_serdes_irq_link_sgmii has to be able
to determine between 2500base-x vs 1000base-x modes. This is done by
the first patch.

The second patch adds two new methods into the DSA operations structure,
.port_setup() and .port_teardown(). The .port_setup is called from
dsa_port_setup() after the port is registered and phylink structure
already exists, and .port_teardown() is called from dsa_port_teardown()
before the port is unregistered.

The third patch then utilizes these new methods to enable/disable
SERDESes and ther IRQs for CPU/DSA ports.

Please comment on this new code, whether it is acceptable to have these
new methods, if they should be called differently, and so on.

Thank you.

Marek

Marek Behún (3):
  net: dsa: mv88e6xxx: support 2500base-x in SGMII IRQ handler
  net: dsa: add port_setup/port_teardown methods to DSA ops
  net: dsa: mv88e6xxx: setup SERDES irq also for CPU/DSA ports

 drivers/net/dsa/mv88e6xxx/chip.c   | 54 ++++++++++++++++++++++++------
 drivers/net/dsa/mv88e6xxx/serdes.c |  6 +++-
 include/net/dsa.h                  |  2 ++
 net/dsa/dsa2.c                     | 10 +++++-
 4 files changed, 60 insertions(+), 12 deletions(-)

-- 
2.21.0

^ permalink raw reply

* [PATCH RFC net-next 3/3] net: dsa: mv88e6xxx: setup SERDES irq also for CPU/DSA ports
From: Marek Behún @ 2019-08-16 15:08 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Vladimir Oltean, Florian Fainelli,
	Marek Behún
In-Reply-To: <20190816150834.26939-1-marek.behun@nic.cz>

When CPU/DSA port is put into for example into 2500base-x mode, the
SERDES irq has to be enabled so that port's MAC is configured properly
after autonegotiation.

When SERDES irq is being enabled, the port's phylink structure already
has to exist. Otherwise if the IRQ fires immediately, the IRQ routine's
access to the nonexistent phylink structure results in an exception.

We therefore enable SERDES irqs for CPU/DSA ports in the .port_setup()
method, which is called by DSA from dsa_setup_port after the port is
registered and phylink structures exist.

We also move SERDES powering on for CPU/DSA ports to this method.

We also free the IRQ and power off SERDESes for these ports in the
.port_teardown() method.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 54 ++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 9b3ad22a5b98..23d3e39d2b9c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2151,16 +2151,6 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port)
 	if (err)
 		return err;
 
-	/* Enable the SERDES interface for DSA and CPU ports. Normal
-	 * ports SERDES are enabled when the port is enabled, thus
-	 * saving a bit of power.
-	 */
-	if ((dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))) {
-		err = mv88e6xxx_serdes_power(chip, port, true);
-		if (err)
-			return err;
-	}
-
 	/* Port Control 2: don't force a good FCS, set the maximum frame size to
 	 * 10240 bytes, disable 802.1q tags checking, don't discard tagged or
 	 * untagged frames on this port, do a destination address lookup on all
@@ -2557,6 +2547,48 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
 	return err;
 }
 
+static int mv88e6xxx_port_setup(struct dsa_switch *ds, int port)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	/* Enable the SERDES interface for DSA and CPU ports. Normal
+	 * ports SERDES are enabled when the port is enabled, thus
+	 * saving a bit of power.
+	 */
+	if ((dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))) {
+		mv88e6xxx_reg_lock(chip);
+
+		err = mv88e6xxx_serdes_power(chip, port, true);
+
+		if (!err && chip->info->ops->serdes_irq_setup)
+			err = chip->info->ops->serdes_irq_setup(chip, port);
+
+		mv88e6xxx_reg_unlock(chip);
+
+		return err;
+	}
+
+	return 0;
+}
+
+static void mv88e6xxx_port_teardown(struct dsa_switch *ds, int port)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+
+	if ((dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))) {
+		mv88e6xxx_reg_lock(chip);
+
+		if (chip->info->ops->serdes_irq_free)
+			chip->info->ops->serdes_irq_free(chip, port);
+
+		if (mv88e6xxx_serdes_power(chip, port, false))
+			dev_err(chip->dev, "failed to power off SERDES\n");
+
+		mv88e6xxx_reg_unlock(chip);
+	}
+}
+
 static int mv88e6xxx_mdio_read(struct mii_bus *bus, int phy, int reg)
 {
 	struct mv88e6xxx_mdio_bus *mdio_bus = bus->priv;
@@ -4692,6 +4724,8 @@ static int mv88e6xxx_port_egress_floods(struct dsa_switch *ds, int port,
 static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.get_tag_protocol	= mv88e6xxx_get_tag_protocol,
 	.setup			= mv88e6xxx_setup,
+	.port_setup		= mv88e6xxx_port_setup,
+	.port_teardown		= mv88e6xxx_port_teardown,
 	.phylink_validate	= mv88e6xxx_validate,
 	.phylink_mac_link_state	= mv88e6xxx_link_state,
 	.phylink_mac_config	= mv88e6xxx_mac_config,
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v4 3/9] nvmem: core: add nvmem_device_find
From: Thomas Bogendoerfer @ 2019-08-16 14:09 UTC (permalink / raw)
  To: Srinivas Kandagatla
  Cc: Ralf Baechle, Paul Burton, James Hogan, Dmitry Torokhov,
	Lee Jones, David S. Miller, Alessandro Zummo, Alexandre Belloni,
	Greg Kroah-Hartman, Jiri Slaby, Evgeniy Polyakov, linux-mips,
	linux-kernel, linux-input, netdev, linux-rtc, linux-serial
In-Reply-To: <31d680ee-ddb3-8536-c915-576222d263e1@linaro.org>

On Wed, Aug 14, 2019 at 01:52:49PM +0100, Srinivas Kandagatla wrote:
> On 14/08/2019 12:46, Thomas Bogendoerfer wrote:
> >On Tue, 13 Aug 2019 10:40:34 +0100
> >Srinivas Kandagatla <srinivas.kandagatla@linaro.org> wrote:
> >>On 09/08/2019 11:32, Thomas Bogendoerfer wrote:
> >>>nvmem_device_find provides a way to search for nvmem devices with
> >>>the help of a match function simlair to bus_find_device.
> >>>
> >>>Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
> >>>---
> >>>   drivers/nvmem/core.c           | 62 ++++++++++++++++++++++--------------------
> >>>   include/linux/nvmem-consumer.h |  9 ++++++
> >>>   2 files changed, 41 insertions(+), 30 deletions(-)
> >>
> >>Have you considered using nvmem_register_notifier() ?
> >
> >yes, that was the first idea. But then I realized I need to build up
> >a private database of information already present in nvmem bus. So I
> >looked for a way to retrieve it from there. Unfortunately I couldn't
> >use bus_find_device directly, because nvmem_bus_type and struct nvmem_device
> >is hidden. So I refactured the lookup code and added a more universal
> >lookup function, which fits my needs and should be usable for more.
> I see your point.
> 
> overall the patch as it is look good, but recently we added more generic
> lookups for DT node, looks like part of your patch is un-doing generic
> device name lookup.
> 
> DT node match lookup is in https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git/log/?h=generic_lookup_helpers

these patches are not in Linus tree, yet. I guess they will show up
in 5.4. No idea how to deal with it right now, do you ?

> Other missing bit is adding this api to documentation in
> ./Documentation/driver-api/nvmem.rst

ok, will do.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.                                                [ RFC1925, 2.3 ]

^ permalink raw reply

* [PATCH] ravb: Fix use-after-free ravb_tstamp_skb
From: Simon Horman @ 2019-08-16 15:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Tho Vu <tho.vu.wh@rvc.renesas.com>

When a Tx timestamp is requested, a pointer to the skb is stored in the
ravb_tstamp_skb struct. This was done without an skb_get. There exists
the possibility that the skb could be freed by ravb_tx_free (when
ravb_tx_free is called from ravb_start_xmit) before the timestamp was
processed, leading to a use-after-free bug.

Use skb_get when filling a ravb_tstamp_skb struct, and add appropriate
frees/consumes when a ravb_tstamp_skb struct is freed.

Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Tho Vu <tho.vu.wh@rvc.renesas.com>
Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
---
As this is an old bug I am submitting a fix against net-next rather than
net: I do not see any urgency here. I am however, happy for it to be
applied against net instead.
---
 drivers/net/ethernet/renesas/ravb_main.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index ef8f08931fe8..6cacd5e893ac 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Renesas Ethernet AVB device driver
  *
- * Copyright (C) 2014-2015 Renesas Electronics Corporation
+ * Copyright (C) 2014-2019 Renesas Electronics Corporation
  * Copyright (C) 2015 Renesas Solutions Corp.
  * Copyright (C) 2015-2016 Cogent Embedded, Inc. <source@cogentembedded.com>
  *
@@ -513,7 +513,10 @@ static void ravb_get_tx_tstamp(struct net_device *ndev)
 			kfree(ts_skb);
 			if (tag == tfa_tag) {
 				skb_tstamp_tx(skb, &shhwtstamps);
+				dev_consume_skb_any(skb);
 				break;
+			} else {
+				dev_kfree_skb_any(skb);
 			}
 		}
 		ravb_modify(ndev, TCCR, TCCR_TFR, TCCR_TFR);
@@ -1564,7 +1567,7 @@ static netdev_tx_t ravb_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 			}
 			goto unmap;
 		}
-		ts_skb->skb = skb;
+		ts_skb->skb = skb_get(skb);
 		ts_skb->tag = priv->ts_skb_tag++;
 		priv->ts_skb_tag &= 0x3ff;
 		list_add_tail(&ts_skb->list, &priv->ts_skb_list);
@@ -1693,6 +1696,7 @@ static int ravb_close(struct net_device *ndev)
 	/* Clear the timestamp list */
 	list_for_each_entry_safe(ts_skb, ts_skb2, &priv->ts_skb_list, list) {
 		list_del(&ts_skb->list);
+		kfree_skb(ts_skb->skb);
 		kfree(ts_skb);
 	}
 
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH bpf-next] libbpf: relicense bpf_helpers.h and bpf_endian.h
From: Daniel Borkmann @ 2019-08-16 15:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Andrii Nakryiko
  Cc: bpf, netdev, andrii.nakryiko, kernel-team, Michael Holzheu,
	Naveen N . Rao, David S . Miller, Michal Rostecki, John Fastabend,
	Sargun Dhillon
In-Reply-To: <20190816141001.4a879101@carbon>

On 8/16/19 2:10 PM, Jesper Dangaard Brouer wrote:
> On Thu, 15 Aug 2019 22:45:43 -0700
> Andrii Nakryiko <andriin@fb.com> wrote:
> 
>> bpf_helpers.h and bpf_endian.h contain useful macros and BPF helper
>> definitions essential to almost every BPF program. Which makes them
>> useful not just for selftests. To be able to expose them as part of
>> libbpf, though, we need them to be dual-licensed as LGPL-2.1 OR
>> BSD-2-Clause. This patch updates licensing of those two files.
> 
> I've already ACKed this, and is fine with (LGPL-2.1 OR BSD-2-Clause).
> 
> I just want to understand, why "BSD-2-Clause" and not "Apache-2.0" ?
> 
> The original argument was that this needed to be compatible with
> "Apache-2.0", then why not simply add this in the "OR" ?

It's use is discouraged in the kernel tree, see also LICENSES/dual/Apache-2.0 (below) and
statement wrt compatibility from https://www.apache.org/licenses/GPL-compatibility.html:

   Valid-License-Identifier: Apache-2.0
   SPDX-URL: https://spdx.org/licenses/Apache-2.0.html
   Usage-Guide:
     Do NOT use. The Apache-2.0 is not GPL2 compatible. [...]

>> Acked-by: Alexei Starovoitov <ast@kernel.org>
>> Acked-by: Hechao Li <hechaol@fb.com>
>> Acked-by: Martin KaFai Lau <kafai@fb.com>
>> Acked-by: Andrey Ignatov <rdna@fb.com>
>> Acked-by: Yonghong Song <yhs@fb.com>
>> Acked-by: Lawrence Brakmo <brakmo@fb.com>
>> Acked-by: Adam Barth <arb@fb.com>
>> Acked-by: Roman Gushchin <guro@fb.com>
>> Acked-by: Josef Bacik <jbacik@fb.com>
>> Acked-by: Joe Stringer <joe@wand.net.nz>
>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> Acked-by: David Ahern <dsahern@gmail.com>
>> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> Confirming I acked this.
> 
>> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
>> Acked-by: Lorenz Bauer <lmb@cloudflare.com>
>> Acked-by: Adrian Ratiu <adrian.ratiu@collabora.com>
>> Acked-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
>> Acked-by: Willem de Bruijn <willemb@google.com>
>> Acked-by: Petar Penkov <ppenkov@google.com>
>> Acked-by: Teng Qin <palmtenor@gmail.com>
>> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
>> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Michal Rostecki <mrostecki@opensuse.org>
>> Cc: John Fastabend <john.fastabend@gmail.com>
>> Cc: Sargun Dhillon <sargun@sargun.me>
>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>> ---
>>   tools/testing/selftests/bpf/bpf_endian.h  | 2 +-
>>   tools/testing/selftests/bpf/bpf_helpers.h | 2 +-
>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/testing/selftests/bpf/bpf_endian.h b/tools/testing/selftests/bpf/bpf_endian.h
>> index 05f036df8a4c..ff3593b0ae03 100644
>> --- a/tools/testing/selftests/bpf/bpf_endian.h
>> +++ b/tools/testing/selftests/bpf/bpf_endian.h
>> @@ -1,4 +1,4 @@
>> -/* SPDX-License-Identifier: GPL-2.0 */
>> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
>>   #ifndef __BPF_ENDIAN__
>>   #define __BPF_ENDIAN__
>>   
>> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>> index 8b503ea142f0..6c4930bc6e2e 100644
>> --- a/tools/testing/selftests/bpf/bpf_helpers.h
>> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>> @@ -1,4 +1,4 @@
>> -/* SPDX-License-Identifier: GPL-2.0 */
>> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
>>   #ifndef __BPF_HELPERS_H
>>   #define __BPF_HELPERS_H
>>   
> 
> 
> 


^ permalink raw reply

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Stanislav Fomichev @ 2019-08-16 15:35 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <4614fefc-fc43-8cf7-d064-7dc1947acc6c@gmail.com>

On 08/16, Toshiaki Makita wrote:
> On 2019/08/16 0:21, Stanislav Fomichev wrote:
> > On 08/15, Toshiaki Makita wrote:
> > > On 2019/08/15 2:07, Stanislav Fomichev wrote:
> > > > On 08/13, Toshiaki Makita wrote:
> > > > > * Implementation
> > > > > 
> > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > first packet of ping in above test will incease, which I want to avoid.
> > > > Can this be instead implemented with a new hook that will be called
> > > > for TC events? This hook can write to perf event buffer and control
> > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > plane will also install xdp program).
> > > > 
> > > > Why do we need UMH? What am I missing?
> > > 
> > > So you suggest doing everything in xdp_flow kmod?
> > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > (bypass) that dumps every command via netlink (or calls the BPF hook
> > where you can dump it into perf event buffer) and then read that info
> > from userspace and install xdp programs and modify flow tables.
> > I don't think you need any kernel changes besides that stream
> > of data from the kernel about qdisc/tc flow creation/removal/etc.
> 
> My intention is to make more people who want high speed network easily use XDP,
> so making transparent XDP offload with current TC interface.
> 
> What userspace program would monitor TC events with your suggestion?
Have a new system daemon (xdpflowerd) that is independently
packaged/shipped/installed. Anybody who wants accelerated TC can
download/install it. OVS can be completely unaware of this.

> ovs-vswitchd? If so, it even does not need to monitor TC. It can
> implement XDP offload directly.
> (However I prefer kernel solution. Please refer to "About alternative
> userland (ovs-vswitchd etc.) implementation" section in the cover letter.)
> 
> Also such a TC monitoring solution easily can be out-of-sync with real TC
> behavior as TC filter/flower is being heavily developed and changed,
> e.g. introduction of TC block, support multiple masks with the same pref, etc.
> I'm not sure such an unreliable solution have much value.
This same issue applies to the in-kernel implementation, isn't it?
What happens if somebody sends patches for a new flower feature but
doesn't add appropriate xdp support? Do we reject them?

That's why I'm suggesting to move this problem to the userspace :-)

> > But, I haven't looked at the series deeply, so I might be missing
> > something :-)
> > 
> > > I also thought about that. There are two phases so let's think about them separately.
> > > 
> > > 1) TC block (qdisc) creation / eBPF load
> > > 
> > > I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
> > > done from userland, not from kernel, to run the verifier for safety.
> > > However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
> > > allow such programs to be loaded from kernel? I currently don't have the will
> > > to make such an API as loading can be done with current UMH mechanism.
> > > 
> > > 2) flow insertion / eBPF map update
> > > 
> > > Not sure if this needs to be done from userland. One concern is that eBPF maps can
> > > be modified by unrelated processes and we need to handle all unexpected state of maps.
> > > Such handling tends to be difficult and may cause unexpected kernel behavior.
> > > OTOH updating maps from kmod may reduces the latency of flow insertion drastically.
> > Latency from the moment I type 'tc filter add ...' to the moment the rule
> > is installed into the maps? Does it really matter?
> 
> Yes it matters. Flow insertion is kind of data path in OVS.
> Please see how ping latency is affected in the cover letter.
Ok, but what I'm suggesting shouldn't be less performant.
We are talking about UMH writing into a pipe vs writing TC events into
a netlink.

> > Do I understand correctly that both of those events (qdisc creation and
> > flow insertion) are triggered from tcf_block_offload_cmd (or similar)?
> 
> Both of eBPF load and map update are triggered from tcf_block_offload_cmd.
> I think you understand it correctly.
> 
> Toshiaki Makita

^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: remove zc variable as it is not used
From: Yonghong Song @ 2019-08-16 15:37 UTC (permalink / raw)
  To: Magnus Karlsson, bjorn.topel@intel.com, ast@kernel.org,
	daniel@iogearbox.net, netdev@vger.kernel.org
  Cc: bpf@vger.kernel.org
In-Reply-To: <1565951171-14439-1-git-send-email-magnus.karlsson@intel.com>



On 8/16/19 3:26 AM, Magnus Karlsson wrote:
> The zc is not used in the xsk part of libbpf, so let us remove it. Not
> good to have dead code lying around.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Reported-by: Yonghong Song <yhs@fb.com> > ---
>   tools/lib/bpf/xsk.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index 680e630..9687da9 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -65,7 +65,6 @@ struct xsk_socket {
>   	int xsks_map_fd;
>   	__u32 queue_id;
>   	char ifname[IFNAMSIZ];
> -	bool zc;
>   };
>   
>   struct xsk_nl_info {
> @@ -608,8 +607,6 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
>   		goto out_mmap_tx;
>   	}
>   
> -	xsk->zc = opts.flags & XDP_OPTIONS_ZEROCOPY;

Since opts.flags usage is removed. Do you think it makes sense to
remove
         optlen = sizeof(opts);
         err = getsockopt(xsk->fd, SOL_XDP, XDP_OPTIONS, &opts, &optlen);
         if (err) {
                 err = -errno;
                 goto out_mmap_tx;
         }
as well since nobody then uses opts?

> -
>   	if (!(xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)) {
>   		err = xsk_setup_xdp_prog(xsk);
>   		if (err)
> 

^ permalink raw reply

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Stanislav Fomichev @ 2019-08-16 15:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190815122232.4b1fa01c@cakuba.netronome.com>

On 08/15, Jakub Kicinski wrote:
> On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> > On 08/15, Toshiaki Makita wrote:
> > > On 2019/08/15 2:07, Stanislav Fomichev wrote:  
> > > > On 08/13, Toshiaki Makita wrote:  
> > > > > * Implementation
> > > > > 
> > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > first packet of ping in above test will incease, which I want to avoid.  
> > > > Can this be instead implemented with a new hook that will be called
> > > > for TC events? This hook can write to perf event buffer and control
> > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > plane will also install xdp program).
> > > > 
> > > > Why do we need UMH? What am I missing?  
> > > 
> > > So you suggest doing everything in xdp_flow kmod?  
> > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > (bypass) that dumps every command via netlink (or calls the BPF hook
> > where you can dump it into perf event buffer) and then read that info
> > from userspace and install xdp programs and modify flow tables.
> > I don't think you need any kernel changes besides that stream
> > of data from the kernel about qdisc/tc flow creation/removal/etc.
> 
> There's a certain allure in bringing the in-kernel BPF translation
> infrastructure forward. OTOH from system architecture perspective IMHO
> it does seem like a task best handed in user space. bpfilter can replace
> iptables completely, here we're looking at an acceleration relatively
> loosely coupled with flower.
Even for bpfilter I would've solved it using something similar:
iptables bypass + redirect iptables netlink requests to some
userspace helper that was registered to be iptables compatibility
manager. And then, again, it becomes a purely userspace problem.

The issue with UMH is that the helper has to be statically compiled
from the kernel tree, which means we can't bring in any dependencies
(stuff like libkefir you mentioned below).

But I digress :-)

> FWIW Quentin spent some time working on a universal flow rule to BPF
> translation library:
> 
> https://github.com/Netronome/libkefir
> 
> A lot remains to be done there, but flower front end is one of the
> targets. A library can be tuned for any application, without a
> dependency on flower uAPI.
> 
> > But, I haven't looked at the series deeply, so I might be missing
> > something :-)
> 
> I don't think you are :)

^ permalink raw reply

* [bpf-next] selftests/bpf: fix race in test_tcp_rtt test
From: Petar Penkov @ 2019-08-16 16:03 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, sdf, Petar Penkov

From: Petar Penkov <ppenkov@google.com>

There is a race in this test between receiving the ACK for the
single-byte packet sent in the test, and reading the values from the
map.

This patch fixes this by having the client wait until there are no more
unacknowledged packets.

Before:
for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
< trimmed error messages >
993

After:
for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
10000

Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
Signed-off-by: Petar Penkov <ppenkov@google.com>
---
 tools/testing/selftests/bpf/test_tcp_rtt.c | 31 ++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_tcp_rtt.c b/tools/testing/selftests/bpf/test_tcp_rtt.c
index 90c3862f74a8..2b4754473956 100644
--- a/tools/testing/selftests/bpf/test_tcp_rtt.c
+++ b/tools/testing/selftests/bpf/test_tcp_rtt.c
@@ -6,6 +6,7 @@
 #include <sys/types.h>
 #include <sys/socket.h>
 #include <netinet/in.h>
+#include <netinet/tcp.h>
 #include <pthread.h>
 
 #include <linux/filter.h>
@@ -34,6 +35,30 @@ static void send_byte(int fd)
 		error(1, errno, "Failed to send single byte");
 }
 
+static int wait_for_ack(int fd, int retries)
+{
+	struct tcp_info info;
+	socklen_t optlen;
+	int i, err;
+
+	for (i = 0; i < retries; i++) {
+		optlen = sizeof(info);
+		err = getsockopt(fd, SOL_TCP, TCP_INFO, &info, &optlen);
+		if (err < 0) {
+			log_err("Failed to lookup TCP stats");
+			return err;
+		}
+
+		if (info.tcpi_unacked == 0)
+			return 0;
+
+		sleep(1);
+	}
+
+	log_err("Did not receive ACK");
+	return -1;
+}
+
 static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked,
 		     __u32 dsack_dups, __u32 delivered, __u32 delivered_ce,
 		     __u32 icsk_retransmits)
@@ -149,6 +174,11 @@ static int run_test(int cgroup_fd, int server_fd)
 			 /*icsk_retransmits=*/0);
 
 	send_byte(client_fd);
+	if (wait_for_ack(client_fd, 5) < 0) {
+		err = -1;
+		goto close_client_fd;
+	}
+
 
 	err += verify_sk(map_fd, client_fd, "first payload byte",
 			 /*invoked=*/2,
@@ -157,6 +187,7 @@ static int run_test(int cgroup_fd, int server_fd)
 			 /*delivered_ce=*/0,
 			 /*icsk_retransmits=*/0);
 
+close_client_fd:
 	close(client_fd);
 
 close_bpf_object:
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply related

* Re: [bpf-next] selftests/bpf: fix race in test_tcp_rtt test
From: Stanislav Fomichev @ 2019-08-16 16:13 UTC (permalink / raw)
  To: Petar Penkov; +Cc: netdev, bpf, davem, ast, daniel, sdf, Petar Penkov
In-Reply-To: <20190816160339.249832-1-ppenkov.kernel@gmail.com>

On 08/16, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
> 
> There is a race in this test between receiving the ACK for the
> single-byte packet sent in the test, and reading the values from the
> map.
> 
> This patch fixes this by having the client wait until there are no more
> unacknowledged packets.
> 
> Before:
> for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
> done | grep -c PASSED
> < trimmed error messages >
> 993
> 
> After:
> for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
> done | grep -c PASSED
> 10000
> 
> Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> ---
>  tools/testing/selftests/bpf/test_tcp_rtt.c | 31 ++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/test_tcp_rtt.c b/tools/testing/selftests/bpf/test_tcp_rtt.c
> index 90c3862f74a8..2b4754473956 100644
> --- a/tools/testing/selftests/bpf/test_tcp_rtt.c
> +++ b/tools/testing/selftests/bpf/test_tcp_rtt.c
> @@ -6,6 +6,7 @@
>  #include <sys/types.h>
>  #include <sys/socket.h>
>  #include <netinet/in.h>
> +#include <netinet/tcp.h>
>  #include <pthread.h>
>  
>  #include <linux/filter.h>
> @@ -34,6 +35,30 @@ static void send_byte(int fd)
>  		error(1, errno, "Failed to send single byte");
>  }
>  
> +static int wait_for_ack(int fd, int retries)
> +{
> +	struct tcp_info info;
> +	socklen_t optlen;
> +	int i, err;
> +
> +	for (i = 0; i < retries; i++) {
> +		optlen = sizeof(info);
> +		err = getsockopt(fd, SOL_TCP, TCP_INFO, &info, &optlen);
> +		if (err < 0) {
> +			log_err("Failed to lookup TCP stats");
> +			return err;
> +		}
> +
> +		if (info.tcpi_unacked == 0)
> +			return 0;
> +
> +		sleep(1);
Isn't it too big of a hammer? Maybe usleep(10) here and do x100 retries
instead?

> +	}
> +
> +	log_err("Did not receive ACK");
> +	return -1;
> +}
> +
>  static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked,
>  		     __u32 dsack_dups, __u32 delivered, __u32 delivered_ce,
>  		     __u32 icsk_retransmits)
> @@ -149,6 +174,11 @@ static int run_test(int cgroup_fd, int server_fd)
>  			 /*icsk_retransmits=*/0);
>  
>  	send_byte(client_fd);
> +	if (wait_for_ack(client_fd, 5) < 0) {
> +		err = -1;
> +		goto close_client_fd;
> +	}
> +
>  
>  	err += verify_sk(map_fd, client_fd, "first payload byte",
>  			 /*invoked=*/2,
> @@ -157,6 +187,7 @@ static int run_test(int cgroup_fd, int server_fd)
>  			 /*delivered_ce=*/0,
>  			 /*icsk_retransmits=*/0);
>  
> +close_client_fd:
>  	close(client_fd);
>  
>  close_bpf_object:
> -- 
> 2.23.0.rc1.153.gdeed80330f-goog
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox