Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net 0/3] ipv6: fix error path of inet6_init()
From: Xin Long @ 2018-08-28 15:30 UTC (permalink / raw)
  To: Sabrina Dubroca; +Cc: netdev
In-Reply-To: <cover.1535451234.git.sd@queasysnail.net>



----- Original Message -----
> The error path of inet6_init() can trigger multiple kernel panics,
> mostly due to wrong ordering of cleanups. This series fixes those
> issues.
> 
> Sabrina Dubroca (3):
>   ipv6: fix cleanup ordering for ip6_mr failure
>   ipv6: fix cleanup ordering for pingv6 registration
>   net: rtnl: return early from rtnl_unregister_all when protocol isn't
>     registered
> 
>  net/core/rtnetlink.c |  4 ++++
>  net/ipv6/af_inet6.c  | 10 +++++-----
>  2 files changed, 9 insertions(+), 5 deletions(-)
> 
> --
> 2.18.0
> 
> 
Series Reviewed-by: Xin Long <lucien.xin@gmail.com>

^ permalink raw reply

* Re: [PATCH] iprule: Fix destination prefix output
From: Luca Boccassi @ 2018-08-28 15:38 UTC (permalink / raw)
  To: Stefan Bader, netdev; +Cc: Stephen Hemminger, Christian Ehrhardt
In-Reply-To: <1535466449-24190-1-git-send-email-stefan.bader@canonical.com>

[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]

On Tue, 2018-08-28 at 16:27 +0200, Stefan Bader wrote:
> When adding support for JSON output the new code for printing
> the destination prefix adds a stray blank character before
> the bitmask. This causes some user-space parsing to fail.
> 
> Current output:
>   ...: from x.x.x.x/l to y.y.y.y /l
> Previous output:
>   ...: from x.x.x.x/l to y.y.y.y/l
> 
> Fixes: 0dd4ccc5 "iprule: add json support"
> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
> ---
>  ip/iprule.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/ip/iprule.c b/ip/iprule.c
> index 8b94214..744d6d8 100644
> --- a/ip/iprule.c
> +++ b/ip/iprule.c
> @@ -239,7 +239,7 @@ int print_rule(const struct sockaddr_nl *who,
> struct nlmsghdr *n, void *arg)
>  
>  		print_string(PRINT_FP, NULL, "to ", NULL);
>  		print_color_string(PRINT_ANY, ifa_family_color(frh-
> >family),
> -				   "dst", "%s ", dst);
> +				   "dst", "%s", dst);
>  		if (frh->dst_len != host_len)
>  			print_uint(PRINT_ANY, "dstlen", "/%u ", frh-
> >dst_len);
>  		else

Acked-by: Luca Boccassi <bluca@debian.org>

-- 
Kind regards,
Luca Boccassi

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net] net: bcmgenet: use MAC link status for fixed phy
From: Doug Berger @ 2018-08-28 19:33 UTC (permalink / raw)
  To: David S. Miller; +Cc: Florian Fainelli, netdev, linux-kernel, Doug Berger

When using the fixed PHY with GENET (e.g. MOCA) the PHY link
status can be determined from the internal link status captured
by the MAC. This allows the PHY state machine to use the correct
link state with the fixed PHY even if MAC link event interrupts
are missed when the net device is opened.

Fixes: 8d88c6e ("net: bcmgenet: enable MoCA link state change detection")
Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.h |  3 +++
 drivers/net/ethernet/broadcom/genet/bcmmii.c   | 10 ++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
index b773bc0..14b49612a 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -186,6 +186,9 @@ struct bcmgenet_mib_counters {
 #define UMAC_MAC1			0x010
 #define UMAC_MAX_FRAME_LEN		0x014
 
+#define UMAC_MODE			0x44
+#define  MODE_LINK_STATUS		(1 << 5)
+
 #define UMAC_EEE_CTRL			0x064
 #define  EN_LPI_RX_PAUSE		(1 << 0)
 #define  EN_LPI_TX_PFC			(1 << 1)
diff --git a/drivers/net/ethernet/broadcom/genet/bcmmii.c b/drivers/net/ethernet/broadcom/genet/bcmmii.c
index 5333274..4241ae9 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmmii.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmmii.c
@@ -115,8 +115,14 @@ void bcmgenet_mii_setup(struct net_device *dev)
 static int bcmgenet_fixed_phy_link_update(struct net_device *dev,
 					  struct fixed_phy_status *status)
 {
-	if (dev && dev->phydev && status)
-		status->link = dev->phydev->link;
+	struct bcmgenet_priv *priv;
+	u32 reg;
+
+	if (dev && dev->phydev && status) {
+		priv = netdev_priv(dev);
+		reg = bcmgenet_umac_readl(priv, UMAC_MODE);
+		status->link = !!(reg & MODE_LINK_STATUS);
+	}
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH 1/2] net/ibm/emac: wrong emac_calc_base call was used by typo
From: Christian Lamparter @ 2018-08-28 19:35 UTC (permalink / raw)
  To: Ivan Mikhaylov; +Cc: netdev, David S . Miller, linux-kernel
In-Reply-To: <20180827164336.8815-1-ivan@de.ibm.com>

On Monday, August 27, 2018 6:43:35 PM CEST Ivan Mikhaylov wrote:
> __emac_calc_base_mr1 was used instead of __emac4_calc_base_mr1
> by copy-paste mistake for emac4syn.
> 
> Fixes: 45d6e545505fd32edb812f085be7de45b6a5c0af ("net/ibm/emac: add 8192 rx/tx fifo size")
> Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>

Always nice to see :) .

I do have a ot question though:
Since you are working for IBM and probably have access to all the
wonderful docs and the working hardware: Would you consider to be
the emac/emac4/emac-sync maintainer? From what I can parse from
the /MAINTAINERS file, there isn't currently anyone listed.

Thanks,
Christian

^ permalink raw reply

* net-next is OPEN...
From: David Miller @ 2018-08-28 15:43 UTC (permalink / raw)
  To: netdev


You know the drill...

http://vger.kernel.org/~davem/net-next.html

^ permalink raw reply

* Re: [PATCH net] net: bcmgenet: use MAC link status for fixed phy
From: Florian Fainelli @ 2018-08-28 19:40 UTC (permalink / raw)
  To: Doug Berger, David S. Miller; +Cc: netdev, linux-kernel
In-Reply-To: <1535484795-31871-1-git-send-email-opendmb@gmail.com>

On 08/28/2018 12:33 PM, Doug Berger wrote:
> When using the fixed PHY with GENET (e.g. MOCA) the PHY link
> status can be determined from the internal link status captured
> by the MAC. This allows the PHY state machine to use the correct
> link state with the fixed PHY even if MAC link event interrupts
> are missed when the net device is opened.
> 
> Fixes: 8d88c6e ("net: bcmgenet: enable MoCA link state change detection")

The 12-digit sha1 for that commit would be 8d88c6ebb34c

> Signed-off-by: Doug Berger <opendmb@gmail.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

Thanks Doug!

> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.h |  3 +++
>  drivers/net/ethernet/broadcom/genet/bcmmii.c   | 10 ++++++++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> index b773bc0..14b49612a 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> @@ -186,6 +186,9 @@ struct bcmgenet_mib_counters {
>  #define UMAC_MAC1			0x010
>  #define UMAC_MAX_FRAME_LEN		0x014
>  
> +#define UMAC_MODE			0x44
> +#define  MODE_LINK_STATUS		(1 << 5)
> +
>  #define UMAC_EEE_CTRL			0x064
>  #define  EN_LPI_RX_PAUSE		(1 << 0)
>  #define  EN_LPI_TX_PFC			(1 << 1)
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmmii.c b/drivers/net/ethernet/broadcom/genet/bcmmii.c
> index 5333274..4241ae9 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmmii.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmmii.c
> @@ -115,8 +115,14 @@ void bcmgenet_mii_setup(struct net_device *dev)
>  static int bcmgenet_fixed_phy_link_update(struct net_device *dev,
>  					  struct fixed_phy_status *status)
>  {
> -	if (dev && dev->phydev && status)
> -		status->link = dev->phydev->link;
> +	struct bcmgenet_priv *priv;
> +	u32 reg;
> +
> +	if (dev && dev->phydev && status) {
> +		priv = netdev_priv(dev);
> +		reg = bcmgenet_umac_readl(priv, UMAC_MODE);
> +		status->link = !!(reg & MODE_LINK_STATUS);
> +	}
>  
>  	return 0;
>  }
> 


-- 
Florian

^ permalink raw reply

* Waiting for
From: Ruby @ 2018-08-28 13:50 UTC (permalink / raw)
  To: netdev

We provide photoshop services to some of the companies from around the
world.
We have worked on tons of images ever since our team establishment in 2009.

Many online retail companies use our services for retouching electronics,
jewelry, apparels, furniture
etc. by getting the images of their products enhanced.

Here are the details of what we provide:
Clipping path;
Deep etch process
Image masking
Remove background
Portrait retouching
Jewelry retouching
Fashion retouching

Please reply back for further info.
We can provide testing for your photos if needed.

Thanks,
Ruby

^ permalink raw reply

* Re: [PATCH net] sctp: hold transport before accessing its asoc in sctp_transport_get_next
From: Xin Long @ 2018-08-28 16:08 UTC (permalink / raw)
  To: Neil Horman; +Cc: network dev, linux-sctp, davem, Marcelo Ricardo Leitner
In-Reply-To: <20180827130803.GA12418@hmswarspite.think-freely.org>

On Mon, Aug 27, 2018 at 9:08 PM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Mon, Aug 27, 2018 at 06:38:31PM +0800, Xin Long wrote:
> > As Marcelo noticed, in sctp_transport_get_next, it is iterating over
> > transports but then also accessing the association directly, without
> > checking any refcnts before that, which can cause an use-after-free
> > Read.
> >
> > So fix it by holding transport before accessing the association. With
> > that, sctp_transport_hold calls can be removed in the later places.
> >
> > Fixes: 626d16f50f39 ("sctp: export some apis or variables for sctp_diag and reuse some for proc")
> > Reported-by: syzbot+fe62a0c9aa6a85c6de16@syzkaller.appspotmail.com
> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > ---
> >  net/sctp/proc.c   |  4 ----
> >  net/sctp/socket.c | 22 +++++++++++++++-------
> >  2 files changed, 15 insertions(+), 11 deletions(-)
> >
> > diff --git a/net/sctp/proc.c b/net/sctp/proc.c
> > index ef5c9a8..4d6f1c8 100644
> > --- a/net/sctp/proc.c
> > +++ b/net/sctp/proc.c
> > @@ -264,8 +264,6 @@ static int sctp_assocs_seq_show(struct seq_file *seq, void *v)
> >       }
> >
> >       transport = (struct sctp_transport *)v;
> > -     if (!sctp_transport_hold(transport))
> > -             return 0;
> >       assoc = transport->asoc;
> >       epb = &assoc->base;
> >       sk = epb->sk;
> > @@ -322,8 +320,6 @@ static int sctp_remaddr_seq_show(struct seq_file *seq, void *v)
> >       }
> >
> >       transport = (struct sctp_transport *)v;
> > -     if (!sctp_transport_hold(transport))
> > -             return 0;
> >       assoc = transport->asoc;
> >
> >       list_for_each_entry_rcu(tsp, &assoc->peer.transport_addr_list,
> > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > index e96b15a..aa76586 100644
> > --- a/net/sctp/socket.c
> > +++ b/net/sctp/socket.c
> > @@ -5005,9 +5005,14 @@ struct sctp_transport *sctp_transport_get_next(struct net *net,
> >                       break;
> >               }
> >
> > +             if (!sctp_transport_hold(t))
> > +                     continue;
> > +
> >               if (net_eq(sock_net(t->asoc->base.sk), net) &&
> >                   t->asoc->peer.primary_path == t)
> >                       break;
> > +
> > +             sctp_transport_put(t);
> >       }
> >
> >       return t;
> > @@ -5017,13 +5022,18 @@ struct sctp_transport *sctp_transport_get_idx(struct net *net,
> >                                             struct rhashtable_iter *iter,
> >                                             int pos)
> >  {
> > -     void *obj = SEQ_START_TOKEN;
> > +     struct sctp_transport *t;
> >
> > -     while (pos && (obj = sctp_transport_get_next(net, iter)) &&
> > -            !IS_ERR(obj))
> > -             pos--;
> > +     if (!pos)
> > +             return SEQ_START_TOKEN;
> >
> > -     return obj;
> > +     while ((t = sctp_transport_get_next(net, iter)) && !IS_ERR(t)) {
> > +             if (!--pos)
> > +                     break;
> > +             sctp_transport_put(t);
> > +     }
> > +
> > +     return t;
> >  }
> >
> >  int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *),
> > @@ -5082,8 +5092,6 @@ int sctp_for_each_transport(int (*cb)(struct sctp_transport *, void *),
> >
> >       tsp = sctp_transport_get_idx(net, &hti, *pos + 1);
> >       for (; !IS_ERR_OR_NULL(tsp); tsp = sctp_transport_get_next(net, &hti)) {
> > -             if (!sctp_transport_hold(tsp))
> > -                     continue;
> >               ret = cb(tsp, p);
> >               if (ret)
> >                       break;
> > --
> > 2.1.0
> >
> >
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
>
> Additionally, its not germaine to this particular fix, but why are we still
> using that pos variable in sctp_transport_get_idx?  With the conversion to
> rhashtables, it doesn't seem particularly useful anymore.
For proc, seems so, hti is saved into seq->private.
But for diag, "hti" in sctp_for_each_transport() is a local variable.
do you think where we can save it?

^ permalink raw reply

* Re: Oops running iptables -F OUTPUT
From: Ard Biesheuvel @ 2018-08-28 16:09 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andreas Schwab, <netdev@vger.kernel.org>, linuxppc-dev,
	Jessica Yu, Michael Ellerman, Will Deacon, Ingo Molnar,
	Andrew Morton, linux-arch
In-Reply-To: <CAKv+Gu8ROLamoxFZzbqrzSHq1cUQg5hn02HrKGB_0AT=EcBJpg@mail.gmail.com>

On 28 August 2018 at 15:56, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> Hello Andreas, Nick,
>
> On 28 August 2018 at 06:06, Nicholas Piggin <nicholas.piggin@gmail.com> wrote:
>> On Mon, 27 Aug 2018 19:11:01 +0200
>> Andreas Schwab <schwab@linux-m68k.org> wrote:
>>
>>> I'm getting this Oops when running iptables -F OUTPUT:
>>>
>>> [   91.139409] Unable to handle kernel paging request for data at address 0xd0000001fff12f34
>>> [   91.139414] Faulting instruction address: 0xd0000000016a5718
>>> [   91.139419] Oops: Kernel access of bad area, sig: 11 [#1]
>>> [   91.139426] BE SMP NR_CPUS=2 PowerMac
>>> [   91.139434] Modules linked in: iptable_filter ip_tables x_tables bpfilter nfsd auth_rpcgss lockd grace nfs_acl sunrpc tun af_packet snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus snd_pcm_oss snd_pcm snd_seq snd_timer snd_seq_device snd_mixer_oss snd sungem sr_mod firewire_ohci cdrom sungem_phy soundcore firewire_core pata_macio crc_itu_t sg hid_generic usbhid linear md_mod ohci_pci ohci_hcd ehci_pci ehci_hcd usbcore usb_common dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sata_svw
>>> [   91.139522] CPU: 1 PID: 3620 Comm: iptables Not tainted 4.19.0-rc1 #1
>>> [   91.139526] NIP:  d0000000016a5718 LR: d0000000016a569c CTR: c0000000006f560c
>>> [   91.139531] REGS: c0000001fa577670 TRAP: 0300   Not tainted  (4.19.0-rc1)
>>> [   91.139534] MSR:  900000000200b032 <SF,HV,VEC,EE,FP,ME,IR,DR,RI>  CR: 84002484  XER: 20000000
>>> [   91.139553] DAR: d0000001fff12f34 DSISR: 40000000 IRQMASK: 0
>>> GPR00: d0000000016a569c c0000001fa5778f0 d0000000016b0400 0000000000000000
>>> GPR04: 0000000000000002 0000000000000000 80000001fa46418e c0000001fa0d05c8
>>> GPR08: d0000000016b0400 d00037fffff13000 00000001ff3e7000 d0000000016a6fb8
>>> GPR12: c0000000006f560c c00000000ffff780 0000000000000000 0000000000000000
>>> GPR16: 0000000011635010 00003fffa1b7aa68 0000000000000000 0000000000000000
>>> GPR20: 0000000000000003 0000000010013918 00000000116350c0 c000000000b88990
>>> GPR24: c000000000b88ba4 0000000000000000 d0000001fff12f34 0000000000000000
>>> GPR28: d0000000016b8000 c0000001fa20f400 c0000001fa20f440 0000000000000000
>>> [   91.139627] NIP [d0000000016a5718] .alloc_counters.isra.10+0xbc/0x140 [ip_tables]
>>> [   91.139634] LR [d0000000016a569c] .alloc_counters.isra.10+0x40/0x140 [ip_tables]
>>> [   91.139638] Call Trace:
>>> [   91.139645] [c0000001fa5778f0] [d0000000016a569c] .alloc_counters.isra.10+0x40/0x140 [ip_tables] (unreliable)
>>> [   91.139655] [c0000001fa5779b0] [d0000000016a5b54] .do_ipt_get_ctl+0x110/0x2ec [ip_tables]
>>> [   91.139666] [c0000001fa577aa0] [c0000000006233e0] .nf_getsockopt+0x68/0x88
>>> [   91.139674] [c0000001fa577b40] [c000000000631608] .ip_getsockopt+0xbc/0x128
>>> [   91.139682] [c0000001fa577bf0] [c00000000065adf4] .raw_getsockopt+0x18/0x5c
>>> [   91.139690] [c0000001fa577c60] [c0000000005b5f60] .sock_common_getsockopt+0x2c/0x40
>>> [   91.139697] [c0000001fa577cd0] [c0000000005b3394] .__sys_getsockopt+0xa4/0xd0
>>> [   91.139704] [c0000001fa577d80] [c0000000005b5ab0] .__se_sys_socketcall+0x238/0x2b4
>>> [   91.139712] [c0000001fa577e30] [c00000000000a31c] system_call+0x5c/0x70
>>> [   91.139716] Instruction dump:
>>> [   91.139721] 39290040 7d3d4a14 7fbe4840 409cff98 81380000 2b890001 419d000c 393e0060
>>> [   91.139736] 48000010 7d57c82a e93e0060 7d295214 <815a0000> 794807e1 41e20010 7c210b78
>>> [   91.139752] ---[ end trace f5d1d5431651845d ]---
>>
>> This is due to 7290d58095 ("module: use relative references for
>> __ksymtab entries"). This part of kernel/module.c -
>>
>>    /* Divert to percpu allocation if a percpu var. */
>>    if (sym[i].st_shndx == info->index.pcpu)
>>        secbase = (unsigned long)mod_percpu(mod);
>>    else
>>        secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
>>    sym[i].st_value += secbase;
>>
>> Causes the distance to the target to exceed 32-bits on powerpc, so
>> it doesn't fit in a rel32 reloc. Not sure how other archs cope.
>>
>
> Apologies for the breakage. It does indeed appear to affect all
> architectures, and I'm a bit puzzled why you are the first one to spot
> it.
>
> I will try to find a clean way to special case the per-CPU variable
> __ksymtab references in the generic module code, and if that is too
> cumbersome, we can switch to 64-bit relative references (or rather,
> native word size relative references) instead. Or revert the whole
> thing ...

OK, after a bit of digging, and confirming that the arm64
implementation works as expected (its module loader actually detects
overflows of the 32-bit place relative relocations, so the problem
definitely does not occur there), I think I found the explanation why
this occurs on powerpc and not on x86 or arm64.

Could you please check whether this change makes the issue go away?
(whitespace damage courtesy of Gmail)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6a501b25dd85..57d09d5ceb1a 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -779,7 +779,6 @@ EXPORT_SYMBOL(__per_cpu_offset);

 void __init setup_per_cpu_areas(void)
 {
-       const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
        size_t atom_size;
        unsigned long delta;
        unsigned int cpu;
@@ -795,7 +794,9 @@ void __init setup_per_cpu_areas(void)
        else
                atom_size = 1 << 20;

-       rc = pcpu_embed_first_chunk(0, dyn_size, atom_size, pcpu_cpu_distance,
+       rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+                                   PERCPU_DYNAMIC_RESERVE,
+                                   atom_size, pcpu_cpu_distance,
                                    pcpu_fc_alloc, pcpu_fc_free);
        if (rc < 0)
                panic("cannot initialize percpu area (err=%d)", rc);

The git log does not explain why power deviates from x86 and arm64 in
the way it initializes the percpu areas.

^ permalink raw reply related

* [bpf-next PATCH 0/2] bpf: test_sockmap updates
From: John Fastabend @ 2018-08-28 16:10 UTC (permalink / raw)
  To: alexei.starovoitov, daniel; +Cc: netdev

Two small test sockmap updates for bpf-next. These help me run some
additional tests with test_sockmap.

---

John Fastabend (2):
      bpf: sockmap test remove shutdown() calls
      bpf: use --cgroup in test_suite if supplied


 tools/testing/selftests/bpf/test_sockmap.c |   56 ++++++++++++++++------------
 1 file changed, 31 insertions(+), 25 deletions(-)

^ permalink raw reply

* [bpf-next PATCH 1/2] bpf: sockmap test remove shutdown() calls
From: John Fastabend @ 2018-08-28 16:10 UTC (permalink / raw)
  To: alexei.starovoitov, daniel; +Cc: netdev
In-Reply-To: <20180828160921.24004.71893.stgit@john-Precision-Tower-5810>

Currently, we do a shutdown(sk, SHUT_RDWR) on both peer sockets and
a shutdown on the sender as well. However, this is incorrect and can
occasionally cause issues if you happen to have bad timing. First
peer1 or peer2 may still be in use depending on the test and timing.
Second we really should only be closing the read side and/or write
side depending on if the test is receiving or sending.

But, really none of this is needed just remove the shutdown calls.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/testing/selftests/bpf/test_sockmap.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 0c7d9e5..a0e77c6 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -469,8 +469,6 @@ static int sendmsg_test(struct sockmap_options *opt)
 			fprintf(stderr,
 				"msg_loop_rx: iov_count %i iov_buf %i cnt %i err %i\n",
 				iov_count, iov_buf, cnt, err);
-		shutdown(p2, SHUT_RDWR);
-		shutdown(p1, SHUT_RDWR);
 		if (s.end.tv_sec - s.start.tv_sec) {
 			sent_Bps = sentBps(s);
 			recvd_Bps = recvdBps(s);
@@ -500,7 +498,6 @@ static int sendmsg_test(struct sockmap_options *opt)
 			fprintf(stderr,
 				"msg_loop_tx: iov_count %i iov_buf %i cnt %i err %i\n",
 				iov_count, iov_buf, cnt, err);
-		shutdown(c1, SHUT_RDWR);
 		if (s.end.tv_sec - s.start.tv_sec) {
 			sent_Bps = sentBps(s);
 			recvd_Bps = recvdBps(s);

^ permalink raw reply related

* [bpf-next PATCH 2/2] bpf: use --cgroup in test_suite if supplied
From: John Fastabend @ 2018-08-28 16:10 UTC (permalink / raw)
  To: alexei.starovoitov, daniel; +Cc: netdev
In-Reply-To: <20180828160921.24004.71893.stgit@john-Precision-Tower-5810>

If the user supplies a --cgroup value in the arguments when running
the test_suite go ahaead and run the self tests there. I use this
to test with multiple cgroup users.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/testing/selftests/bpf/test_sockmap.c |   53 ++++++++++++++++------------
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index a0e77c6..ac7de38 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -1345,9 +1345,9 @@ static int populate_progs(char *bpf_file)
 	return 0;
 }
 
-static int __test_suite(char *bpf_file)
+static int __test_suite(int cg_fd, char *bpf_file)
 {
-	int cg_fd, err;
+	int err, cleanup = cg_fd;
 
 	err = populate_progs(bpf_file);
 	if (err < 0) {
@@ -1355,22 +1355,24 @@ static int __test_suite(char *bpf_file)
 		return err;
 	}
 
-	if (setup_cgroup_environment()) {
-		fprintf(stderr, "ERROR: cgroup env failed\n");
-		return -EINVAL;
-	}
-
-	cg_fd = create_and_get_cgroup(CG_PATH);
 	if (cg_fd < 0) {
-		fprintf(stderr,
-			"ERROR: (%i) open cg path failed: %s\n",
-			cg_fd, optarg);
-		return cg_fd;
-	}
+		if (setup_cgroup_environment()) {
+			fprintf(stderr, "ERROR: cgroup env failed\n");
+			return -EINVAL;
+		}
+
+		cg_fd = create_and_get_cgroup(CG_PATH);
+		if (cg_fd < 0) {
+			fprintf(stderr,
+				"ERROR: (%i) open cg path failed: %s\n",
+				cg_fd, optarg);
+			return cg_fd;
+		}
 
-	if (join_cgroup(CG_PATH)) {
-		fprintf(stderr, "ERROR: failed to join cgroup\n");
-		return -EINVAL;
+		if (join_cgroup(CG_PATH)) {
+			fprintf(stderr, "ERROR: failed to join cgroup\n");
+			return -EINVAL;
+		}
 	}
 
 	/* Tests basic commands and APIs with range of iov values */
@@ -1391,20 +1393,24 @@ static int __test_suite(char *bpf_file)
 
 out:
 	printf("Summary: %i PASSED %i FAILED\n", passed, failed);
-	cleanup_cgroup_environment();
-	close(cg_fd);
+	if (cleanup < 0) {
+		cleanup_cgroup_environment();
+		close(cg_fd);
+	}
 	return err;
 }
 
-static int test_suite(void)
+static int test_suite(int cg_fd)
 {
 	int err;
 
-	err = __test_suite(BPF_SOCKMAP_FILENAME);
+	err = __test_suite(cg_fd, BPF_SOCKMAP_FILENAME);
 	if (err)
 		goto out;
-	err = __test_suite(BPF_SOCKHASH_FILENAME);
+	err = __test_suite(cg_fd, BPF_SOCKHASH_FILENAME);
 out:
+	if (cg_fd > -1)
+		close(cg_fd);
 	return err;
 }
 
@@ -1417,7 +1423,7 @@ int main(int argc, char **argv)
 	int test = PING_PONG;
 
 	if (argc < 2)
-		return test_suite();
+		return test_suite(-1);
 
 	while ((opt = getopt_long(argc, argv, ":dhvc:r:i:l:t:",
 				  long_options, &longindex)) != -1) {
@@ -1483,6 +1489,9 @@ int main(int argc, char **argv)
 		}
 	}
 
+	if (argc <= 3 && cg_fd)
+		return test_suite(cg_fd);
+
 	if (!cg_fd) {
 		fprintf(stderr, "%s requires cgroup option: --cgroup <path>\n",
 			argv[0]);

^ permalink raw reply related

* Re: [PATCH net] net/sched: act_pedit: fix dump of extended layered op
From: Cong Wang @ 2018-08-28 16:48 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Jamal Hadi Salim, David Miller, Linux Kernel Network Developers,
	Amir Vadai
In-Reply-To: <69b7819408d62dc49aa242c239fec558fa9acd8d.1535403029.git.dcaratti@redhat.com>

On Mon, Aug 27, 2018 at 1:56 PM Davide Caratti <dcaratti@redhat.com> wrote:
>
> in the (rare) case of failure in nla_nest_start(), missing NULL checks in
> tcf_pedit_key_ex_dump() can make the following command
>
>  # tc action add action pedit ex munge ip ttl set 64
>
> dereference a NULL pointer:
>
>  BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
>  PGD 800000007d1cd067 P4D 800000007d1cd067 PUD 7acd3067 PMD 0
>  Oops: 0002 [#1] SMP PTI
>  CPU: 0 PID: 3336 Comm: tc Tainted: G            E     4.18.0.pedit+ #425
>  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>  RIP: 0010:tcf_pedit_dump+0x19d/0x358 [act_pedit]
>  Code: be 02 00 00 00 48 89 df 66 89 44 24 20 e8 9b b1 fd e0 85 c0 75 46 8b 83 c8 00 00 00 49 83 c5 08 48 03 83 d0 00 00 00 4d 39 f5 <66> 89 04 25 00 00 00 00 0f 84 81 01 00 00 41 8b 45 00 48 8d 4c 24
>  RSP: 0018:ffffb5d4004478a8 EFLAGS: 00010246
>  RAX: ffff8880fcda2070 RBX: ffff8880fadd2900 RCX: 0000000000000000
>  RDX: 0000000000000002 RSI: ffffb5d4004478ca RDI: ffff8880fcda206e
>  RBP: ffff8880fb9cb900 R08: 0000000000000008 R09: ffff8880fcda206e
>  R10: ffff8880fadd2900 R11: 0000000000000000 R12: ffff8880fd26cf40
>  R13: ffff8880fc957430 R14: ffff8880fc957430 R15: ffff8880fb9cb988
>  FS:  00007f75a537a740(0000) GS:ffff8880fda00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000000 CR3: 000000007a2fa005 CR4: 00000000001606f0
>  Call Trace:
>   ? __nla_reserve+0x38/0x50
>   tcf_action_dump_1+0xd2/0x130
>   tcf_action_dump+0x6a/0xf0
>   tca_get_fill.constprop.31+0xa3/0x120
>   tcf_action_add+0xd1/0x170
>   tc_ctl_action+0x137/0x150
>   rtnetlink_rcv_msg+0x263/0x2d0
>   ? _cond_resched+0x15/0x40
>   ? rtnl_calcit.isra.30+0x110/0x110
>   netlink_rcv_skb+0x4d/0x130
>   netlink_unicast+0x1a3/0x250
>   netlink_sendmsg+0x2ae/0x3a0
>   sock_sendmsg+0x36/0x40
>   ___sys_sendmsg+0x26f/0x2d0
>   ? do_wp_page+0x8e/0x5f0
>   ? handle_pte_fault+0x6c3/0xf50
>   ? __handle_mm_fault+0x38e/0x520
>   ? __sys_sendmsg+0x5e/0xa0
>   __sys_sendmsg+0x5e/0xa0
>   do_syscall_64+0x5b/0x180
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>  RIP: 0033:0x7f75a4583ba0
>  Code: c3 48 8b 05 f2 62 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d fd c3 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae cc 00 00 48 89 04 24
>  RSP: 002b:00007fff60ee7418 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>  RAX: ffffffffffffffda RBX: 00007fff60ee7540 RCX: 00007f75a4583ba0
>  RDX: 0000000000000000 RSI: 00007fff60ee7490 RDI: 0000000000000003
>  RBP: 000000005b842d3e R08: 0000000000000002 R09: 0000000000000000
>  R10: 00007fff60ee6ea0 R11: 0000000000000246 R12: 0000000000000000
>  R13: 00007fff60ee7554 R14: 0000000000000001 R15: 000000000066c100
>  Modules linked in: act_pedit(E) ip6table_filter ip6_tables iptable_filter binfmt_misc crct10dif_pclmul ext4 crc32_pclmul mbcache ghash_clmulni_intel jbd2 pcbc snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer cryptd glue_helper snd joydev pcspkr soundcore virtio_balloon i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net net_failover virtio_blk virtio_console failover qxl crc32c_intel drm_kms_helper syscopyarea serio_raw sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix virtio_pci libata virtio_ring i2c_core virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_pedit]
>  CR2: 0000000000000000
>
> Like it's done for other TC actions, give up dumping pedit rules and return
> an error if nla_nest_start() returns NULL.

Looks good to me,

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

While you are at it, please fix act_tunnel_key too.

Thanks.

^ permalink raw reply

* [PATCH net-next] net: thunderbolt: Convert to use SPDX identifier
From: Mika Westerberg @ 2018-08-28 16:58 UTC (permalink / raw)
  To: David S . Miller; +Cc: Michael Jamet, Yehezkel Bernat, Mika Westerberg, netdev

This gets rid of the licence boilerblate in favor of SPDX identifier
which only takes a single line comment.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/net/thunderbolt.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/thunderbolt.c b/drivers/net/thunderbolt.c
index e0d6760f3219..c48c3a1eb1f8 100644
--- a/drivers/net/thunderbolt.c
+++ b/drivers/net/thunderbolt.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Networking over Thunderbolt cable using Apple ThunderboltIP protocol
  *
@@ -5,10 +6,6 @@
  * Authors: Amir Levy <amir.jer.levy@intel.com>
  *          Michael Jamet <michael.jamet@intel.com>
  *          Mika Westerberg <mika.westerberg@linux.intel.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
  */
 
 #include <linux/atomic.h>
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH net 1/2] net_sched: reject unknown tcfa_action values
From: Cong Wang @ 2018-08-28 16:59 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Davide Caratti
In-Reply-To: <0bfdb4f6816e8eae8e86e97b78eb86d3adba0a01.1535465984.git.pabeni@redhat.com>

On Tue, Aug 28, 2018 at 7:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> +int tcf_action_destroy_one(struct tc_action *a, int bind)
> +{
> +       struct tc_action *actions[] = { a, NULL };
> +
> +       return tcf_action_destroy(actions, bind);
> +}

Make it static.


> +
>  static int tcf_action_put(struct tc_action *p)
>  {
>         return __tcf_action_put(p, false);
> @@ -881,17 +888,16 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
>         if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN)) {
>                 err = tcf_action_goto_chain_init(a, tp);
>                 if (err) {
> -                       struct tc_action *actions[] = { a, NULL };
> -
> -                       tcf_action_destroy(actions, bind);
>                         NL_SET_ERR_MSG(extack, "Failed to init TC action chain");
> +                       tcf_action_destroy_one(a, bind);
>                         return ERR_PTR(err);
>                 }
>         }
>
>         if (!tcf_action_valid(a->tcfa_action)) {
>                 NL_SET_ERR_MSG(extack, "invalid action value, using TC_ACT_UNSPEC instead");


You need to adjust this extack too.



> -               a->tcfa_action = TC_ACT_UNSPEC;
> +               tcf_action_destroy_one(a, bind);
> +               return ERR_PTR(-EINVAL);
>         }

Thanks.

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH] i40e: report correct statistics when XDP is enabled
From: Paul Menzel @ 2018-08-28 17:00 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Jesper Dangaard Brouer, netdev, intel-wired-lan, Magnus Karlsson,
	Magnus Karlsson
In-Reply-To: <20180824160016.245b08d8@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3498 bytes --]

Dear Björn,


On 08/24/18 16:00, Jesper Dangaard Brouer wrote:
> On Fri, 24 Aug 2018 13:21:59 +0200
> Björn Töpel <bjorn.topel@intel.com> wrote:
> 
>> When XDP is enabled, the driver will report incorrect
>> statistics. Received frames will reported as transmitted frames.
>>
>> This commits fixes the i40e implementation of ndo_get_stats64 (struct

Should you send a v2, then please use singular for *commit*:

This commit ….

>> net_device_ops), so that iproute2 will report correct statistics
>> (e.g. when running "ip -stats link show dev eth0") even when XDP is
>> enabled.

In the future, I’d be great, if you could describe your fix in the
commit message too. For example, why the if statement needs to move up.

>> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Fixes: 74608d17fe29 ("i40e: add support for XDP_TX action")
> 
> Stable candidate:
>  $ git describe --contains 74608d17fe29
>  v4.13-rc1~157^2~128^2~13
> 
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> 
> It works for me:
> 
> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> I'm explicitly _not_ ACK'ing the patch, as I think the your code changes
> below makes it harder to follow whether a TX or RX ring is getting
> updated. But it is 100% up to the driver maintainers to say if this is
> acceptable from a maintenance PoV.
> 
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e_main.c | 24 +++++++++++----------
>>  1 file changed, 13 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index e40c023cc7b6..7c122dd3faa1 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -425,9 +425,9 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev,
>>  				  struct rtnl_link_stats64 *stats)
>>  {
>>  	struct i40e_netdev_priv *np = netdev_priv(netdev);
>> -	struct i40e_ring *tx_ring, *rx_ring;
>>  	struct i40e_vsi *vsi = np->vsi;
>>  	struct rtnl_link_stats64 *vsi_stats = i40e_get_vsi_stats_struct(vsi);
>> +	struct i40e_ring *ring;
>>  	int i;
>>  
>>  	if (test_bit(__I40E_VSI_DOWN, vsi->state))
>> @@ -441,24 +441,26 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev,
>>  		u64 bytes, packets;
>>  		unsigned int start;
>>  
>> -		tx_ring = READ_ONCE(vsi->tx_rings[i]);
>> -		if (!tx_ring)
>> +		ring = READ_ONCE(vsi->tx_rings[i]);
>> +		if (!ring)
>>  			continue;
>> -		i40e_get_netdev_stats_struct_tx(tx_ring, stats);
>> +		i40e_get_netdev_stats_struct_tx(ring, stats);
>>  
>> -		rx_ring = &tx_ring[1];
>> +		if (i40e_enabled_xdp_vsi(vsi)) {
>> +			ring++;
>> +			i40e_get_netdev_stats_struct_tx(ring, stats);
>> +		}
>>  
>> +		ring++;
>>  		do {
>> -			start = u64_stats_fetch_begin_irq(&rx_ring->syncp);
>> -			packets = rx_ring->stats.packets;
>> -			bytes   = rx_ring->stats.bytes;
>> -		} while (u64_stats_fetch_retry_irq(&rx_ring->syncp, start));
>> +			start   = u64_stats_fetch_begin_irq(&ring->syncp);
>> +			packets = ring->stats.packets;
>> +			bytes   = ring->stats.bytes;
>> +		} while (u64_stats_fetch_retry_irq(&ring->syncp, start));
>>  
>>  		stats->rx_packets += packets;
>>  		stats->rx_bytes   += bytes;
>>  
>> -		if (i40e_enabled_xdp_vsi(vsi))
>> -			i40e_get_netdev_stats_struct_tx(&rx_ring[1], stats);
>>  	}
>>  	rcu_read_unlock();
>>  


Kind regards,

Paul


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH] i40e: report correct statistics when XDP is enabled
From: Björn Töpel @ 2018-08-28 17:10 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Jesper Dangaard Brouer, netdev, intel-wired-lan, Magnus Karlsson,
	Magnus Karlsson
In-Reply-To: <a20ab03d-25e0-126a-d996-5c1750de0639@molgen.mpg.de>

On 2018-08-28 19:00, Paul Menzel wrote:
> Dear Björn,
> 
> 
> On 08/24/18 16:00, Jesper Dangaard Brouer wrote:
>> On Fri, 24 Aug 2018 13:21:59 +0200
>> Björn Töpel <bjorn.topel@intel.com> wrote:
>>
>>> When XDP is enabled, the driver will report incorrect
>>> statistics. Received frames will reported as transmitted frames.
>>>
>>> This commits fixes the i40e implementation of ndo_get_stats64 (struct
> 
> Should you send a v2, then please use singular for *commit*:
> 
> This commit ….
> 
>>> net_device_ops), so that iproute2 will report correct statistics
>>> (e.g. when running "ip -stats link show dev eth0") even when XDP is
>>> enabled.
> 
> In the future, I’d be great, if you could describe your fix in the
> commit message too. For example, why the if statement needs to move up.
>

Thanks for the review, Paul. I'll address your comments, if we'll end up
with V2.


Björn

>>> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> Fixes: 74608d17fe29 ("i40e: add support for XDP_TX action")
>>
>> Stable candidate:
>>   $ git describe --contains 74608d17fe29
>>   v4.13-rc1~157^2~128^2~13
>>
>>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>>
>> It works for me:
>>
>> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>
>> I'm explicitly _not_ ACK'ing the patch, as I think the your code changes
>> below makes it harder to follow whether a TX or RX ring is getting
>> updated. But it is 100% up to the driver maintainers to say if this is
>> acceptable from a maintenance PoV.
>>
>>> ---
>>>   drivers/net/ethernet/intel/i40e/i40e_main.c | 24 +++++++++++----------
>>>   1 file changed, 13 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> index e40c023cc7b6..7c122dd3faa1 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> @@ -425,9 +425,9 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev,
>>>   				  struct rtnl_link_stats64 *stats)
>>>   {
>>>   	struct i40e_netdev_priv *np = netdev_priv(netdev);
>>> -	struct i40e_ring *tx_ring, *rx_ring;
>>>   	struct i40e_vsi *vsi = np->vsi;
>>>   	struct rtnl_link_stats64 *vsi_stats = i40e_get_vsi_stats_struct(vsi);
>>> +	struct i40e_ring *ring;
>>>   	int i;
>>>   
>>>   	if (test_bit(__I40E_VSI_DOWN, vsi->state))
>>> @@ -441,24 +441,26 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev,
>>>   		u64 bytes, packets;
>>>   		unsigned int start;
>>>   
>>> -		tx_ring = READ_ONCE(vsi->tx_rings[i]);
>>> -		if (!tx_ring)
>>> +		ring = READ_ONCE(vsi->tx_rings[i]);
>>> +		if (!ring)
>>>   			continue;
>>> -		i40e_get_netdev_stats_struct_tx(tx_ring, stats);
>>> +		i40e_get_netdev_stats_struct_tx(ring, stats);
>>>   
>>> -		rx_ring = &tx_ring[1];
>>> +		if (i40e_enabled_xdp_vsi(vsi)) {
>>> +			ring++;
>>> +			i40e_get_netdev_stats_struct_tx(ring, stats);
>>> +		}
>>>   
>>> +		ring++;
>>>   		do {
>>> -			start = u64_stats_fetch_begin_irq(&rx_ring->syncp);
>>> -			packets = rx_ring->stats.packets;
>>> -			bytes   = rx_ring->stats.bytes;
>>> -		} while (u64_stats_fetch_retry_irq(&rx_ring->syncp, start));
>>> +			start   = u64_stats_fetch_begin_irq(&ring->syncp);
>>> +			packets = ring->stats.packets;
>>> +			bytes   = ring->stats.bytes;
>>> +		} while (u64_stats_fetch_retry_irq(&ring->syncp, start));
>>>   
>>>   		stats->rx_packets += packets;
>>>   		stats->rx_bytes   += bytes;
>>>   
>>> -		if (i40e_enabled_xdp_vsi(vsi))
>>> -			i40e_get_netdev_stats_struct_tx(&rx_ring[1], stats);
>>>   	}
>>>   	rcu_read_unlock();
>>>   
> 
> 
> Kind regards,
> 
> Paul
> 

^ permalink raw reply

* Re: Oops running iptables -F OUTPUT
From: Andreas Schwab @ 2018-08-28 17:28 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Nicholas Piggin, <netdev@vger.kernel.org>, linuxppc-dev,
	Jessica Yu, Michael Ellerman, Will Deacon, Ingo Molnar,
	Andrew Morton, linux-arch
In-Reply-To: <CAKv+Gu_hPaxrVtsBOoviRraYk4FWnT9zQVCVF=i27xd_nGHryw@mail.gmail.com>

On Aug 28 2018, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 6a501b25dd85..57d09d5ceb1a 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -779,7 +779,6 @@ EXPORT_SYMBOL(__per_cpu_offset);
>
>  void __init setup_per_cpu_areas(void)
>  {
> -       const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
>         size_t atom_size;
>         unsigned long delta;
>         unsigned int cpu;
> @@ -795,7 +794,9 @@ void __init setup_per_cpu_areas(void)
>         else
>                 atom_size = 1 << 20;
>
> -       rc = pcpu_embed_first_chunk(0, dyn_size, atom_size, pcpu_cpu_distance,
> +       rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
> +                                   PERCPU_DYNAMIC_RESERVE,
> +                                   atom_size, pcpu_cpu_distance,
>                                     pcpu_fc_alloc, pcpu_fc_free);
>         if (rc < 0)
>                 panic("cannot initialize percpu area (err=%d)", rc);

That didn't help.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply

* Re: [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
From: Björn Töpel @ 2018-08-28 17:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexander Duyck, ast, Daniel Borkmann, Netdev, Brandeburg, Jesse,
	Singhai, Anjali, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, Willem de Bruijn, John Fastabend,
	Jakub Kicinski, neerav.parikh, MykytaI Iziumtsev, Francois Ozog
In-Reply-To: <20180828161102.45a00204@redhat.com>

Den tis 28 aug. 2018 kl 16:11 skrev Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Tue, 28 Aug 2018 14:44:25 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This commit adds proper MEM_TYPE_ZERO_COPY support for
> > convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
> > xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
> > MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
> > make sense to implement a more sophisticated thread-safe alloc/free
> > scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
> > required in the fast-path.
>
> This is going to be slow. Especially the dev_alloc_page() call, which
> for small frames is likely going to be slower than the data copy.
> I guess this is a good first step, but I do hope we will circle back and
> optimize this later.  (It would also be quite easy to use
> MEM_TYPE_PAGE_POOL instead to get page recycling in devmap redirect case).
>

Yes, slow. :-( Still, I think this is a good starting point, and then
introduce a page pool in later performance oriented series to make XDP
faster for the AF_XDP scenario.

But I'm definitely on your side here; This need to be addressed -- but
not now IMO.


And thanks for spending time on the series!
Björn

> I would have liked the MEM_TYPE_ZERO_COPY frame to travel one level
> deeper into the redirect-core code.  Allowing devmap to send these
> frame without copy, and allow cpumap to do the dev_alloc_page() call
> (+copy) on the remote CPU.
>
>
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> > ---
> >  include/net/xdp.h |  5 +++--
> >  net/core/xdp.c    | 39 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 42 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 76b95256c266..0d5c6fb4b2e2 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
> >       frame->dev_rx = NULL;
> >  }
> >
> > +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> > +
> >  /* Convert xdp_buff to xdp_frame */
> >  static inline
> >  struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> > @@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> >       int metasize;
> >       int headroom;
> >
> > -     /* TODO: implement clone, copy, use "native" MEM_TYPE */
> >       if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> > -             return NULL;
> > +             return xdp_convert_zc_to_xdp_frame(xdp);
> >
> >       /* Assure headroom is available for storing info */
> >       headroom = xdp->data - xdp->data_hard_start;
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 89b6785cef2a..be6cb2f0e722 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> >       info->flags = bpf->flags;
> >  }
> >  EXPORT_SYMBOL_GPL(xdp_attachment_setup);
> > +
> > +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
> > +{
> > +     unsigned int metasize, headroom, totsize;
> > +     void *addr, *data_to_copy;
> > +     struct xdp_frame *xdpf;
> > +     struct page *page;
> > +
> > +     /* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
> > +     metasize = xdp_data_meta_unsupported(xdp) ? 0 :
> > +                xdp->data - xdp->data_meta;
> > +     headroom = xdp->data - xdp->data_hard_start;
> > +     totsize = xdp->data_end - xdp->data + metasize;
> > +
> > +     if (sizeof(*xdpf) + totsize > PAGE_SIZE)
> > +             return NULL;
> > +
> > +     page = dev_alloc_page();
> > +     if (!page)
> > +             return NULL;
> > +
> > +     addr = page_to_virt(page);
> > +     xdpf = addr;
> > +     memset(xdpf, 0, sizeof(*xdpf));
> > +
> > +     addr += sizeof(*xdpf);
> > +     data_to_copy = metasize ? xdp->data_meta : xdp->data;
> > +     memcpy(addr, data_to_copy, totsize);
> > +
> > +     xdpf->data = addr + metasize;
> > +     xdpf->len = totsize - metasize;
> > +     xdpf->headroom = 0;
> > +     xdpf->metasize = metasize;
> > +     xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
> > +
> > +     xdp_return_buff(xdp);
> > +     return xdpf;
> > +}
> > +EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH net-next] liquidio: remove unnecessary delay when processing IQ responses
From: Felix Manlunas @ 2018-08-28 18:19 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	felix.manlunas, ricardo.farrington

From: Rick Farrington <ricardo.farrington@cavium.com>

Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 drivers/net/ethernet/cavium/liquidio/request_manager.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c b/drivers/net/ethernet/cavium/liquidio/request_manager.c
index 8f746e1..0a06fbb 100644
--- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
+++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
@@ -459,7 +459,7 @@ static inline void __copy_cmd_into_iq(struct octeon_instr_queue *iq,
 
 	if (atomic_read(&oct->response_list
 			[OCTEON_ORDERED_SC_LIST].pending_req_count))
-		queue_delayed_work(cwq->wq, &cwq->wk.work, msecs_to_jiffies(1));
+		queue_work(cwq->wq, &cwq->wk.work.work);
 
 	return inst_count;
 }
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next] net: nixge: Add support for 64-bit platforms
From: Moritz Fischer @ 2018-08-28 22:16 UTC (permalink / raw)
  To: davem
  Cc: keescook, netdev, linux-kernel, alex.williams, Moritz Fischer,
	Florian Fainelli

Add support for 64-bit platforms to driver.

The hardware only supports 32-bit register accesses
so the accesses need to be split up into two writes
when setting the current and tail descriptor values.

Signed-off-by: Moritz Fischer <mdf@kernel.org>
Cc: Florian Fainelli <f.fainelli@gmail.com>
---
Hi Dave,

I'll submit the other changes for fixed_phy / non MDIO support
as a separate series since I'm ironing out some issues on ARM
with devicetree overlays.

Thanks,
Moritz
---
 drivers/net/ethernet/ni/Kconfig |   3 +-
 drivers/net/ethernet/ni/nixge.c | 168 ++++++++++++++++++++++----------
 2 files changed, 116 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/ni/Kconfig b/drivers/net/ethernet/ni/Kconfig
index aa41e5f6e437..04e315704f71 100644
--- a/drivers/net/ethernet/ni/Kconfig
+++ b/drivers/net/ethernet/ni/Kconfig
@@ -18,8 +18,9 @@ if NET_VENDOR_NI
 
 config NI_XGE_MANAGEMENT_ENET
 	tristate "National Instruments XGE management enet support"
-	depends on ARCH_ZYNQ
+	depends on HAS_IOMEM && HAS_DMA
 	select PHYLIB
+	select OF_MDIO
 	help
 	  Simple LAN device for debug or management purposes. Can
 	  support either 10G or 1G PHYs via SFP+ ports.
diff --git a/drivers/net/ethernet/ni/nixge.c b/drivers/net/ethernet/ni/nixge.c
index 76efed058f33..74cf52e3fb09 100644
--- a/drivers/net/ethernet/ni/nixge.c
+++ b/drivers/net/ethernet/ni/nixge.c
@@ -106,10 +106,10 @@
 	(NIXGE_JUMBO_MTU + NIXGE_HDR_SIZE + NIXGE_TRL_SIZE)
 
 struct nixge_hw_dma_bd {
-	u32 next;
-	u32 reserved1;
-	u32 phys;
-	u32 reserved2;
+	u32 next_lo;
+	u32 next_hi;
+	u32 phys_lo;
+	u32 phys_hi;
 	u32 reserved3;
 	u32 reserved4;
 	u32 cntrl;
@@ -119,11 +119,39 @@ struct nixge_hw_dma_bd {
 	u32 app2;
 	u32 app3;
 	u32 app4;
-	u32 sw_id_offset;
-	u32 reserved5;
+	u32 sw_id_offset_lo;
+	u32 sw_id_offset_hi;
 	u32 reserved6;
 };
 
+#ifdef CONFIG_PHYS_ADDR_T_64BIT
+#define nixge_hw_dma_bd_set_addr(bd, field, addr) \
+	do { \
+		(bd)->field##_lo = lower_32_bits(((u64)addr)); \
+		(bd)->field##_hi = upper_32_bits(((u64)addr)); \
+	} while (0)
+#else
+#define nixge_hw_dma_bd_set_addr(bd, field, addr) \
+	((bd)->field##_lo = lower_32_bits((addr)))
+#endif
+
+#define nixge_hw_dma_bd_set_phys(bd, addr) \
+	nixge_hw_dma_bd_set_addr((bd), phys, (addr))
+
+#define nixge_hw_dma_bd_set_next(bd, addr) \
+	nixge_hw_dma_bd_set_addr((bd), next, (addr))
+
+#define nixge_hw_dma_bd_set_offset(bd, addr) \
+	nixge_hw_dma_bd_set_addr((bd), sw_id_offset, (addr))
+
+#ifdef CONFIG_PHYS_ADDR_T_64BIT
+#define nixge_hw_dma_bd_get_addr(bd, field) \
+	(dma_addr_t)((((u64)(bd)->field##_hi) << 32) | ((bd)->field##_lo))
+#else
+#define nixge_hw_dma_bd_get_addr(bd, field) \
+	(dma_addr_t)((bd)->field##_lo)
+#endif
+
 struct nixge_tx_skb {
 	struct sk_buff *skb;
 	dma_addr_t mapping;
@@ -176,6 +204,15 @@ static void nixge_dma_write_reg(struct nixge_priv *priv, off_t offset, u32 val)
 	writel(val, priv->dma_regs + offset);
 }
 
+static void nixge_dma_write_desc_reg(struct nixge_priv *priv, off_t offset,
+				     dma_addr_t addr)
+{
+	writel(lower_32_bits(addr), priv->dma_regs + offset);
+#ifdef CONFIG_PHYS_ADDR_T_64BIT
+	writel(upper_32_bits(addr), priv->dma_regs + offset + 4);
+#endif
+}
+
 static u32 nixge_dma_read_reg(const struct nixge_priv *priv, off_t offset)
 {
 	return readl(priv->dma_regs + offset);
@@ -202,13 +239,22 @@ static u32 nixge_ctrl_read_reg(struct nixge_priv *priv, off_t offset)
 static void nixge_hw_dma_bd_release(struct net_device *ndev)
 {
 	struct nixge_priv *priv = netdev_priv(ndev);
+	dma_addr_t phys_addr;
+	struct sk_buff *skb;
 	int i;
 
 	for (i = 0; i < RX_BD_NUM; i++) {
-		dma_unmap_single(ndev->dev.parent, priv->rx_bd_v[i].phys,
-				 NIXGE_MAX_JUMBO_FRAME_SIZE, DMA_FROM_DEVICE);
-		dev_kfree_skb((struct sk_buff *)
-			      (priv->rx_bd_v[i].sw_id_offset));
+		phys_addr = nixge_hw_dma_bd_get_addr(&priv->rx_bd_v[i],
+						     phys);
+
+		dma_unmap_single(ndev->dev.parent, phys_addr,
+				 NIXGE_MAX_JUMBO_FRAME_SIZE,
+				 DMA_FROM_DEVICE);
+
+		skb = (struct sk_buff *)
+			nixge_hw_dma_bd_get_addr(&priv->rx_bd_v[i],
+						 sw_id_offset);
+		dev_kfree_skb(skb);
 	}
 
 	if (priv->rx_bd_v)
@@ -231,6 +277,7 @@ static int nixge_hw_dma_bd_init(struct net_device *ndev)
 {
 	struct nixge_priv *priv = netdev_priv(ndev);
 	struct sk_buff *skb;
+	dma_addr_t phys;
 	u32 cr;
 	int i;
 
@@ -259,27 +306,30 @@ static int nixge_hw_dma_bd_init(struct net_device *ndev)
 		goto out;
 
 	for (i = 0; i < TX_BD_NUM; i++) {
-		priv->tx_bd_v[i].next = priv->tx_bd_p +
-				      sizeof(*priv->tx_bd_v) *
-				      ((i + 1) % TX_BD_NUM);
+		nixge_hw_dma_bd_set_next(&priv->tx_bd_v[i],
+					 priv->tx_bd_p +
+					 sizeof(*priv->tx_bd_v) *
+					 ((i + 1) % TX_BD_NUM));
 	}
 
 	for (i = 0; i < RX_BD_NUM; i++) {
-		priv->rx_bd_v[i].next = priv->rx_bd_p +
-				      sizeof(*priv->rx_bd_v) *
-				      ((i + 1) % RX_BD_NUM);
+		nixge_hw_dma_bd_set_next(&priv->rx_bd_v[i],
+					 priv->rx_bd_p
+					 + sizeof(*priv->rx_bd_v) *
+					 ((i + 1) % RX_BD_NUM));
 
 		skb = netdev_alloc_skb_ip_align(ndev,
 						NIXGE_MAX_JUMBO_FRAME_SIZE);
 		if (!skb)
 			goto out;
 
-		priv->rx_bd_v[i].sw_id_offset = (u32)skb;
-		priv->rx_bd_v[i].phys =
-			dma_map_single(ndev->dev.parent,
-				       skb->data,
-				       NIXGE_MAX_JUMBO_FRAME_SIZE,
-				       DMA_FROM_DEVICE);
+		nixge_hw_dma_bd_set_offset(&priv->rx_bd_v[i], skb);
+		phys = dma_map_single(ndev->dev.parent, skb->data,
+				      NIXGE_MAX_JUMBO_FRAME_SIZE,
+				      DMA_FROM_DEVICE);
+
+		nixge_hw_dma_bd_set_phys(&priv->rx_bd_v[i], phys);
+
 		priv->rx_bd_v[i].cntrl = NIXGE_MAX_JUMBO_FRAME_SIZE;
 	}
 
@@ -312,18 +362,18 @@ static int nixge_hw_dma_bd_init(struct net_device *ndev)
 	/* Populate the tail pointer and bring the Rx Axi DMA engine out of
 	 * halted state. This will make the Rx side ready for reception.
 	 */
-	nixge_dma_write_reg(priv, XAXIDMA_RX_CDESC_OFFSET, priv->rx_bd_p);
+	nixge_dma_write_desc_reg(priv, XAXIDMA_RX_CDESC_OFFSET, priv->rx_bd_p);
 	cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
 	nixge_dma_write_reg(priv, XAXIDMA_RX_CR_OFFSET,
 			    cr | XAXIDMA_CR_RUNSTOP_MASK);
-	nixge_dma_write_reg(priv, XAXIDMA_RX_TDESC_OFFSET, priv->rx_bd_p +
+	nixge_dma_write_desc_reg(priv, XAXIDMA_RX_TDESC_OFFSET, priv->rx_bd_p +
 			    (sizeof(*priv->rx_bd_v) * (RX_BD_NUM - 1)));
 
 	/* Write to the RS (Run-stop) bit in the Tx channel control register.
 	 * Tx channel is now ready to run. But only after we write to the
 	 * tail pointer register that the Tx channel will start transmitting.
 	 */
-	nixge_dma_write_reg(priv, XAXIDMA_TX_CDESC_OFFSET, priv->tx_bd_p);
+	nixge_dma_write_desc_reg(priv, XAXIDMA_TX_CDESC_OFFSET, priv->tx_bd_p);
 	cr = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
 	nixge_dma_write_reg(priv, XAXIDMA_TX_CR_OFFSET,
 			    cr | XAXIDMA_CR_RUNSTOP_MASK);
@@ -451,7 +501,7 @@ static int nixge_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	struct nixge_priv *priv = netdev_priv(ndev);
 	struct nixge_hw_dma_bd *cur_p;
 	struct nixge_tx_skb *tx_skb;
-	dma_addr_t tail_p;
+	dma_addr_t tail_p, cur_phys;
 	skb_frag_t *frag;
 	u32 num_frag;
 	u32 ii;
@@ -466,15 +516,16 @@ static int nixge_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 		return NETDEV_TX_OK;
 	}
 
-	cur_p->phys = dma_map_single(ndev->dev.parent, skb->data,
-				     skb_headlen(skb), DMA_TO_DEVICE);
-	if (dma_mapping_error(ndev->dev.parent, cur_p->phys))
+	cur_phys = dma_map_single(ndev->dev.parent, skb->data,
+				  skb_headlen(skb), DMA_TO_DEVICE);
+	if (dma_mapping_error(ndev->dev.parent, cur_phys))
 		goto drop;
+	nixge_hw_dma_bd_set_phys(cur_p, cur_phys);
 
 	cur_p->cntrl = skb_headlen(skb) | XAXIDMA_BD_CTRL_TXSOF_MASK;
 
 	tx_skb->skb = NULL;
-	tx_skb->mapping = cur_p->phys;
+	tx_skb->mapping = cur_phys;
 	tx_skb->size = skb_headlen(skb);
 	tx_skb->mapped_as_page = false;
 
@@ -485,16 +536,17 @@ static int nixge_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 		tx_skb = &priv->tx_skb[priv->tx_bd_tail];
 		frag = &skb_shinfo(skb)->frags[ii];
 
-		cur_p->phys = skb_frag_dma_map(ndev->dev.parent, frag, 0,
-					       skb_frag_size(frag),
-					       DMA_TO_DEVICE);
-		if (dma_mapping_error(ndev->dev.parent, cur_p->phys))
+		cur_phys = skb_frag_dma_map(ndev->dev.parent, frag, 0,
+					    skb_frag_size(frag),
+					    DMA_TO_DEVICE);
+		if (dma_mapping_error(ndev->dev.parent, cur_phys))
 			goto frag_err;
+		nixge_hw_dma_bd_set_phys(cur_p, cur_phys);
 
 		cur_p->cntrl = skb_frag_size(frag);
 
 		tx_skb->skb = NULL;
-		tx_skb->mapping = cur_p->phys;
+		tx_skb->mapping = cur_phys;
 		tx_skb->size = skb_frag_size(frag);
 		tx_skb->mapped_as_page = true;
 	}
@@ -506,7 +558,7 @@ static int nixge_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 
 	tail_p = priv->tx_bd_p + sizeof(*priv->tx_bd_v) * priv->tx_bd_tail;
 	/* Start the transfer */
-	nixge_dma_write_reg(priv, XAXIDMA_TX_TDESC_OFFSET, tail_p);
+	nixge_dma_write_desc_reg(priv, XAXIDMA_TX_TDESC_OFFSET, tail_p);
 	++priv->tx_bd_tail;
 	priv->tx_bd_tail %= TX_BD_NUM;
 
@@ -537,7 +589,7 @@ static int nixge_recv(struct net_device *ndev, int budget)
 	struct nixge_priv *priv = netdev_priv(ndev);
 	struct sk_buff *skb, *new_skb;
 	struct nixge_hw_dma_bd *cur_p;
-	dma_addr_t tail_p = 0;
+	dma_addr_t tail_p = 0, cur_phys = 0;
 	u32 packets = 0;
 	u32 length = 0;
 	u32 size = 0;
@@ -549,13 +601,15 @@ static int nixge_recv(struct net_device *ndev, int budget)
 		tail_p = priv->rx_bd_p + sizeof(*priv->rx_bd_v) *
 			 priv->rx_bd_ci;
 
-		skb = (struct sk_buff *)(cur_p->sw_id_offset);
+		skb = (struct sk_buff *)nixge_hw_dma_bd_get_addr(cur_p,
+								 sw_id_offset);
 
 		length = cur_p->status & XAXIDMA_BD_STS_ACTUAL_LEN_MASK;
 		if (length > NIXGE_MAX_JUMBO_FRAME_SIZE)
 			length = NIXGE_MAX_JUMBO_FRAME_SIZE;
 
-		dma_unmap_single(ndev->dev.parent, cur_p->phys,
+		dma_unmap_single(ndev->dev.parent,
+				 nixge_hw_dma_bd_get_addr(cur_p, phys),
 				 NIXGE_MAX_JUMBO_FRAME_SIZE,
 				 DMA_FROM_DEVICE);
 
@@ -579,16 +633,17 @@ static int nixge_recv(struct net_device *ndev, int budget)
 		if (!new_skb)
 			return packets;
 
-		cur_p->phys = dma_map_single(ndev->dev.parent, new_skb->data,
-					     NIXGE_MAX_JUMBO_FRAME_SIZE,
-					     DMA_FROM_DEVICE);
-		if (dma_mapping_error(ndev->dev.parent, cur_p->phys)) {
+		cur_phys = dma_map_single(ndev->dev.parent, new_skb->data,
+					  NIXGE_MAX_JUMBO_FRAME_SIZE,
+					  DMA_FROM_DEVICE);
+		if (dma_mapping_error(ndev->dev.parent, cur_phys)) {
 			/* FIXME: bail out and clean up */
 			netdev_err(ndev, "Failed to map ...\n");
 		}
+		nixge_hw_dma_bd_set_phys(cur_p, cur_phys);
 		cur_p->cntrl = NIXGE_MAX_JUMBO_FRAME_SIZE;
 		cur_p->status = 0;
-		cur_p->sw_id_offset = (u32)new_skb;
+		nixge_hw_dma_bd_set_offset(cur_p, new_skb);
 
 		++priv->rx_bd_ci;
 		priv->rx_bd_ci %= RX_BD_NUM;
@@ -599,7 +654,7 @@ static int nixge_recv(struct net_device *ndev, int budget)
 	ndev->stats.rx_bytes += size;
 
 	if (tail_p)
-		nixge_dma_write_reg(priv, XAXIDMA_RX_TDESC_OFFSET, tail_p);
+		nixge_dma_write_desc_reg(priv, XAXIDMA_RX_TDESC_OFFSET, tail_p);
 
 	return packets;
 }
@@ -637,6 +692,7 @@ static irqreturn_t nixge_tx_irq(int irq, void *_ndev)
 	struct nixge_priv *priv = netdev_priv(_ndev);
 	struct net_device *ndev = _ndev;
 	unsigned int status;
+	dma_addr_t phys;
 	u32 cr;
 
 	status = nixge_dma_read_reg(priv, XAXIDMA_TX_SR_OFFSET);
@@ -650,9 +706,11 @@ static irqreturn_t nixge_tx_irq(int irq, void *_ndev)
 		return IRQ_NONE;
 	}
 	if (status & XAXIDMA_IRQ_ERROR_MASK) {
+		phys = nixge_hw_dma_bd_get_addr(&priv->tx_bd_v[priv->tx_bd_ci],
+						phys);
+
 		netdev_err(ndev, "DMA Tx error 0x%x\n", status);
-		netdev_err(ndev, "Current BD is at: 0x%x\n",
-			   (priv->tx_bd_v[priv->tx_bd_ci]).phys);
+		netdev_err(ndev, "Current BD is at: 0x%llx\n", (u64)phys);
 
 		cr = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
 		/* Disable coalesce, delay timer and error interrupts */
@@ -678,6 +736,7 @@ static irqreturn_t nixge_rx_irq(int irq, void *_ndev)
 	struct nixge_priv *priv = netdev_priv(_ndev);
 	struct net_device *ndev = _ndev;
 	unsigned int status;
+	dma_addr_t phys;
 	u32 cr;
 
 	status = nixge_dma_read_reg(priv, XAXIDMA_RX_SR_OFFSET);
@@ -697,9 +756,10 @@ static irqreturn_t nixge_rx_irq(int irq, void *_ndev)
 		return IRQ_NONE;
 	}
 	if (status & XAXIDMA_IRQ_ERROR_MASK) {
+		phys = nixge_hw_dma_bd_get_addr(&priv->rx_bd_v[priv->rx_bd_ci],
+						phys);
 		netdev_err(ndev, "DMA Rx error 0x%x\n", status);
-		netdev_err(ndev, "Current BD is at: 0x%x\n",
-			   (priv->rx_bd_v[priv->rx_bd_ci]).phys);
+		netdev_err(ndev, "Current BD is at: 0x%llx\n", (u64)phys);
 
 		cr = nixge_dma_read_reg(priv, XAXIDMA_TX_CR_OFFSET);
 		/* Disable coalesce, delay timer and error interrupts */
@@ -735,10 +795,10 @@ static void nixge_dma_err_handler(unsigned long data)
 		tx_skb = &lp->tx_skb[i];
 		nixge_tx_skb_unmap(lp, tx_skb);
 
-		cur_p->phys = 0;
+		nixge_hw_dma_bd_set_phys(cur_p, 0);
 		cur_p->cntrl = 0;
 		cur_p->status = 0;
-		cur_p->sw_id_offset = 0;
+		nixge_hw_dma_bd_set_offset(cur_p, 0);
 	}
 
 	for (i = 0; i < RX_BD_NUM; i++) {
@@ -779,18 +839,18 @@ static void nixge_dma_err_handler(unsigned long data)
 	/* Populate the tail pointer and bring the Rx Axi DMA engine out of
 	 * halted state. This will make the Rx side ready for reception.
 	 */
-	nixge_dma_write_reg(lp, XAXIDMA_RX_CDESC_OFFSET, lp->rx_bd_p);
+	nixge_dma_write_desc_reg(lp, XAXIDMA_RX_CDESC_OFFSET, lp->rx_bd_p);
 	cr = nixge_dma_read_reg(lp, XAXIDMA_RX_CR_OFFSET);
 	nixge_dma_write_reg(lp, XAXIDMA_RX_CR_OFFSET,
 			    cr | XAXIDMA_CR_RUNSTOP_MASK);
-	nixge_dma_write_reg(lp, XAXIDMA_RX_TDESC_OFFSET, lp->rx_bd_p +
+	nixge_dma_write_desc_reg(lp, XAXIDMA_RX_TDESC_OFFSET, lp->rx_bd_p +
 			    (sizeof(*lp->rx_bd_v) * (RX_BD_NUM - 1)));
 
 	/* Write to the RS (Run-stop) bit in the Tx channel control register.
 	 * Tx channel is now ready to run. But only after we write to the
 	 * tail pointer register that the Tx channel will start transmitting
 	 */
-	nixge_dma_write_reg(lp, XAXIDMA_TX_CDESC_OFFSET, lp->tx_bd_p);
+	nixge_dma_write_desc_reg(lp, XAXIDMA_TX_CDESC_OFFSET, lp->tx_bd_p);
 	cr = nixge_dma_read_reg(lp, XAXIDMA_TX_CR_OFFSET);
 	nixge_dma_write_reg(lp, XAXIDMA_TX_CR_OFFSET,
 			    cr | XAXIDMA_CR_RUNSTOP_MASK);
-- 
2.18.0

^ permalink raw reply related

* [PATCH net-next] liquidio: fix race condition in instruction completion processing
From: Felix Manlunas @ 2018-08-28 18:32 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	felix.manlunas, ricardo.farrington

From: Rick Farrington <ricardo.farrington@cavium.com>

In lio_enable_irq, the pkt_in_done count register was being cleared to
zero.  However, there could be some completed instructions which were not
yet processed due to budget and limit constraints.
So, only write this register with the number of actual completions
that were processed.

Signed-off-by: Rick Farrington <ricardo.farrington@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 drivers/net/ethernet/cavium/liquidio/octeon_device.c   | 5 +++--
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h       | 2 ++
 drivers/net/ethernet/cavium/liquidio/request_manager.c | 2 ++
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.c b/drivers/net/ethernet/cavium/liquidio/octeon_device.c
index f878a55..d0ed6c4 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.c
@@ -1450,8 +1450,9 @@ void lio_enable_irq(struct octeon_droq *droq, struct octeon_instr_queue *iq)
 	}
 	if (iq) {
 		spin_lock_bh(&iq->lock);
-		writel(iq->pkt_in_done, iq->inst_cnt_reg);
-		iq->pkt_in_done = 0;
+		writel(iq->pkts_processed, iq->inst_cnt_reg);
+		iq->pkt_in_done -= iq->pkts_processed;
+		iq->pkts_processed = 0;
 		/* this write needs to be flushed before we release the lock */
 		mmiowb();
 		spin_unlock_bh(&iq->lock);
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_iq.h b/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
index 2327062..aecd0d3 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
@@ -94,6 +94,8 @@ struct octeon_instr_queue {
 
 	u32 pkt_in_done;
 
+	u32 pkts_processed;
+
 	/** A spinlock to protect access to the input ring.*/
 	spinlock_t iq_flush_running_lock;
 
diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c b/drivers/net/ethernet/cavium/liquidio/request_manager.c
index 8f746e1..f943aa7 100644
--- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
+++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
@@ -123,6 +123,7 @@ int octeon_init_instr_queue(struct octeon_device *oct,
 	iq->do_auto_flush = 1;
 	iq->db_timeout = (u32)conf->db_timeout;
 	atomic_set(&iq->instr_pending, 0);
+	iq->pkts_processed = 0;
 
 	/* Initialize the spinlock for this instruction queue */
 	spin_lock_init(&iq->lock);
@@ -495,6 +496,7 @@ static inline void __copy_cmd_into_iq(struct octeon_instr_queue *iq,
 				lio_process_iq_request_list(oct, iq, 0);
 
 		if (inst_processed) {
+			iq->pkts_processed += inst_processed;
 			atomic_sub(inst_processed, &iq->instr_pending);
 			iq->stats.instr_processed += inst_processed;
 		}
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/2] ip: fail fast on IP defrag errors
From: Peter Oskolkov @ 2018-08-28 18:36 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Peter Oskolkov

The current behavior of IP defragmentation is inconsistent:
- some overlapping/wrong length fragments are dropped without
  affecting the queue;
- most overlapping fragments cause the whole frag queue to be dropped.

This patch brings consistency: if a bad fragment is detected,
the whole frag queue is dropped. Two major benefits:
- fail fast: corrupted frag queues are cleared immediately, instead of
  by timeout;
- testing of overlapping fragments is now much easier: any kind of
  random fragment length mutation now leads to the frag queue being
  discarded (IP packet dropped); before this patch, some overlaps were
  "corrected", with tests not seeing expected packet drops.

Note that in one case (see "if (end&7)" conditional) the current
behavior is preserved as there are concerns that this could be
legitimate padding.

Signed-off-by: Peter Oskolkov <posk@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/ip_fragment.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88281fbce88c..330f62353b11 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -382,7 +382,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		 */
 		if (end < qp->q.len ||
 		    ((qp->q.flags & INET_FRAG_LAST_IN) && end != qp->q.len))
-			goto err;
+			goto discard_qp;
 		qp->q.flags |= INET_FRAG_LAST_IN;
 		qp->q.len = end;
 	} else {
@@ -394,20 +394,20 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		if (end > qp->q.len) {
 			/* Some bits beyond end -> corruption. */
 			if (qp->q.flags & INET_FRAG_LAST_IN)
-				goto err;
+				goto discard_qp;
 			qp->q.len = end;
 		}
 	}
 	if (end == offset)
-		goto err;
+		goto discard_qp;
 
 	err = -ENOMEM;
 	if (!pskb_pull(skb, skb_network_offset(skb) + ihl))
-		goto err;
+		goto discard_qp;
 
 	err = pskb_trim_rcsum(skb, end - offset);
 	if (err)
-		goto err;
+		goto discard_qp;
 
 	/* Note : skb->rbnode and skb->dev share the same location. */
 	dev = skb->dev;
@@ -423,6 +423,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	 * We do the same here for IPv4 (and increment an snmp counter).
 	 */
 
+	err = -EINVAL;
 	/* Find out where to put this fragment.  */
 	prev_tail = qp->q.fragments_tail;
 	if (!prev_tail)
@@ -431,7 +432,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		/* This is the common case: skb goes to the end. */
 		/* Detect and discard overlaps. */
 		if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
-			goto discard_qp;
+			goto overlap;
 		if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
 			ip4_frag_append_to_last_run(&qp->q, skb);
 		else
@@ -450,7 +451,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 						FRAG_CB(skb1)->frag_run_len)
 				rbn = &parent->rb_right;
 			else /* Found an overlap with skb1. */
-				goto discard_qp;
+				goto overlap;
 		} while (*rbn);
 		/* Here we have parent properly set, and rbn pointing to
 		 * one of its NULL left/right children. Insert skb.
@@ -487,16 +488,18 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		skb->_skb_refdst = 0UL;
 		err = ip_frag_reasm(qp, skb, prev_tail, dev);
 		skb->_skb_refdst = orefdst;
+		if (err)
+			inet_frag_kill(&qp->q);
 		return err;
 	}
 
 	skb_dst_drop(skb);
 	return -EINPROGRESS;
 
+overlap:
+	__IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 discard_qp:
 	inet_frag_kill(&qp->q);
-	err = -EINVAL;
-	__IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 err:
 	kfree_skb(skb);
 	return err;
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply related

* [PATCH net-next 2/2] selftests/net: add ip_defrag selftest
From: Peter Oskolkov @ 2018-08-28 18:36 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Peter Oskolkov
In-Reply-To: <20180828183620.101597-1-posk@google.com>

This test creates a raw IPv4 socket, fragments a largish UDP
datagram and sends the fragments out of order.

Then repeats in a loop with different message and fragment lengths.

Then does the same with overlapping fragments (with overlapping
fragments the expectation is that the recv times out).

Tested:

root@<host># time ./ip_defrag.sh
ipv4 defrag
PASS
ipv4 defrag with overlaps
PASS

real    1m7.679s
user    0m0.628s
sys     0m2.242s

A similar test for IPv6 is to follow.

Signed-off-by: Peter Oskolkov <posk@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/net/.gitignore   |   2 +
 tools/testing/selftests/net/Makefile     |   4 +-
 tools/testing/selftests/net/ip_defrag.c  | 313 +++++++++++++++++++++++
 tools/testing/selftests/net/ip_defrag.sh |  29 +++
 4 files changed, 346 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/net/ip_defrag.c
 create mode 100755 tools/testing/selftests/net/ip_defrag.sh

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 78b24cf76f40..2836e0cf2d81 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -14,3 +14,5 @@ udpgso_bench_rx
 udpgso_bench_tx
 tcp_inq
 tls
+ip_defrag
+
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 9cca68e440a0..cccdb2295567 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,13 +5,13 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_FILES += tcp_mmap tcp_inq psock_snd
-TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx
+TEST_GEN_FILES += udpgso udpgso_bench_tx udpgso_bench_rx ip_defrag
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
 
diff --git a/tools/testing/selftests/net/ip_defrag.c b/tools/testing/selftests/net/ip_defrag.c
new file mode 100644
index 000000000000..55fdcdc78eef
--- /dev/null
+++ b/tools/testing/selftests/net/ip_defrag.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <error.h>
+#include <linux/in.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/udp.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+
+static bool		cfg_do_ipv4;
+static bool		cfg_do_ipv6;
+static bool		cfg_verbose;
+static bool		cfg_overlap;
+static unsigned short	cfg_port = 9000;
+
+const struct in_addr addr4 = { .s_addr = __constant_htonl(INADDR_LOOPBACK + 2) };
+
+#define IP4_HLEN	(sizeof(struct iphdr))
+#define IP6_HLEN	(sizeof(struct ip6_hdr))
+#define UDP_HLEN	(sizeof(struct udphdr))
+
+static int msg_len;
+static int max_frag_len;
+
+#define MSG_LEN_MAX	60000	/* Max UDP payload length. */
+
+#define IP4_MF		(1u << 13)  /* IPv4 MF flag. */
+
+static uint8_t udp_payload[MSG_LEN_MAX];
+static uint8_t ip_frame[IP_MAXPACKET];
+static uint16_t ip_id = 0xabcd;
+static int msg_counter;
+static int frag_counter;
+static unsigned int seed;
+
+/* Receive a UDP packet. Validate it matches udp_payload. */
+static void recv_validate_udp(int fd_udp)
+{
+	ssize_t ret;
+	static uint8_t recv_buff[MSG_LEN_MAX];
+
+	ret = recv(fd_udp, recv_buff, msg_len, 0);
+	msg_counter++;
+
+	if (cfg_overlap) {
+		if (ret != -1)
+			error(1, 0, "recv: expected timeout; got %d; seed = %u",
+				(int)ret, seed);
+		if (errno != ETIMEDOUT && errno != EAGAIN)
+			error(1, errno, "recv: expected timeout: %d; seed = %u",
+				 errno, seed);
+		return;  /* OK */
+	}
+
+	if (ret == -1)
+		error(1, errno, "recv: msg_len = %d max_frag_len = %d",
+			msg_len, max_frag_len);
+	if (ret != msg_len)
+		error(1, 0, "recv: wrong size: %d vs %d", (int)ret, msg_len);
+	if (memcmp(udp_payload, recv_buff, msg_len))
+		error(1, 0, "recv: wrong data");
+}
+
+static uint32_t raw_checksum(uint8_t *buf, int len, uint32_t sum)
+{
+	int i;
+
+	for (i = 0; i < (len & ~1U); i += 2) {
+		sum += (u_int16_t)ntohs(*((u_int16_t *)(buf + i)));
+		if (sum > 0xffff)
+			sum -= 0xffff;
+	}
+
+	if (i < len) {
+		sum += buf[i] << 8;
+		if (sum > 0xffff)
+			sum -= 0xffff;
+	}
+
+	return sum;
+}
+
+static uint16_t udp_checksum(struct ip *iphdr, struct udphdr *udphdr)
+{
+	uint32_t sum = 0;
+
+	sum = raw_checksum((uint8_t *)&iphdr->ip_src, 2 * sizeof(iphdr->ip_src),
+				IPPROTO_UDP + (uint32_t)(UDP_HLEN + msg_len));
+	sum = raw_checksum((uint8_t *)udp_payload, msg_len, sum);
+	sum = raw_checksum((uint8_t *)udphdr, UDP_HLEN, sum);
+	return htons(0xffff & ~sum);
+}
+
+static void send_fragment(int fd_raw, struct sockaddr *addr, socklen_t alen,
+				struct ip *iphdr, int offset)
+{
+	int frag_len;
+	int res;
+
+	if (msg_len - offset <= max_frag_len) {
+		/* This is the last fragment. */
+		frag_len = IP4_HLEN + msg_len - offset;
+		iphdr->ip_off = htons((offset + UDP_HLEN) / 8);
+	} else {
+		frag_len = IP4_HLEN + max_frag_len;
+		iphdr->ip_off = htons((offset + UDP_HLEN) / 8 | IP4_MF);
+	}
+	iphdr->ip_len = htons(frag_len);
+	memcpy(ip_frame + IP4_HLEN, udp_payload + offset,
+		 frag_len - IP4_HLEN);
+
+	res = sendto(fd_raw, ip_frame, frag_len, 0, addr, alen);
+	if (res < 0)
+		error(1, errno, "send_fragment");
+	if (res != frag_len)
+		error(1, 0, "send_fragment: %d vs %d", res, frag_len);
+
+	frag_counter++;
+}
+
+static void send_udp_frags_v4(int fd_raw, struct sockaddr *addr, socklen_t alen)
+{
+	struct ip *iphdr = (struct ip *)ip_frame;
+	struct udphdr udphdr;
+	int res;
+	int offset;
+	int frag_len;
+
+	/* Send the UDP datagram using raw IP fragments: the 0th fragment
+	 * has the UDP header; other fragments are pieces of udp_payload
+	 * split in chunks of frag_len size.
+	 *
+	 * Odd fragments (1st, 3rd, 5th, etc.) are sent out first, then
+	 * even fragments (0th, 2nd, etc.) are sent out.
+	 */
+	memset(iphdr, 0, sizeof(*iphdr));
+	iphdr->ip_hl = 5;
+	iphdr->ip_v = 4;
+	iphdr->ip_tos = 0;
+	iphdr->ip_id = htons(ip_id++);
+	iphdr->ip_ttl = 0x40;
+	iphdr->ip_p = IPPROTO_UDP;
+	iphdr->ip_src.s_addr = htonl(INADDR_LOOPBACK);
+	iphdr->ip_dst = addr4;
+	iphdr->ip_sum = 0;
+
+	/* Odd fragments. */
+	offset = 0;
+	while (offset < msg_len) {
+		send_fragment(fd_raw, addr, alen, iphdr, offset);
+		offset += 2 * max_frag_len;
+	}
+
+	if (cfg_overlap) {
+		/* Send an extra random fragment. */
+		offset = rand() % (UDP_HLEN + msg_len - 1);
+		/* sendto() returns EINVAL if offset + frag_len is too small. */
+		frag_len = IP4_HLEN + UDP_HLEN + rand() % 256;
+		iphdr->ip_off = htons(offset / 8 | IP4_MF);
+		iphdr->ip_len = htons(frag_len);
+		res = sendto(fd_raw, ip_frame, frag_len, 0, addr, alen);
+		if (res < 0)
+			error(1, errno, "sendto overlap");
+		if (res != frag_len)
+			error(1, 0, "sendto overlap: %d vs %d", (int)res, frag_len);
+		frag_counter++;
+	}
+
+	/* Zeroth fragment (UDP header). */
+	frag_len = IP4_HLEN + UDP_HLEN;
+	iphdr->ip_len = htons(frag_len);
+	iphdr->ip_off = htons(IP4_MF);
+
+	udphdr.source = htons(cfg_port + 1);
+	udphdr.dest = htons(cfg_port);
+	udphdr.len = htons(UDP_HLEN + msg_len);
+	udphdr.check = 0;
+	udphdr.check = udp_checksum(iphdr, &udphdr);
+
+	memcpy(ip_frame + IP4_HLEN, &udphdr, UDP_HLEN);
+	res = sendto(fd_raw, ip_frame, frag_len, 0, addr, alen);
+	if (res < 0)
+		error(1, errno, "sendto UDP header");
+	if (res != frag_len)
+		error(1, 0, "sendto UDP header: %d vs %d", (int)res, frag_len);
+	frag_counter++;
+
+	/* Even fragments. */
+	offset = max_frag_len;
+	while (offset < msg_len) {
+		send_fragment(fd_raw, addr, alen, iphdr, offset);
+		offset += 2 * max_frag_len;
+	}
+}
+
+static void run_test(struct sockaddr *addr, socklen_t alen)
+{
+	int fd_tx_udp, fd_tx_raw, fd_rx_udp;
+	struct timeval tv = { .tv_sec = 0, .tv_usec = 10 * 1000 };
+	int idx;
+
+	/* Initialize the payload. */
+	for (idx = 0; idx < MSG_LEN_MAX; ++idx)
+		udp_payload[idx] = idx % 256;
+
+	/* Open sockets. */
+	fd_tx_udp = socket(addr->sa_family, SOCK_DGRAM, 0);
+	if (fd_tx_udp == -1)
+		error(1, errno, "socket tx_udp");
+
+	fd_tx_raw = socket(addr->sa_family, SOCK_RAW, IPPROTO_RAW);
+	if (fd_tx_raw == -1)
+		error(1, errno, "socket tx_raw");
+
+	fd_rx_udp = socket(addr->sa_family, SOCK_DGRAM, 0);
+	if (fd_rx_udp == -1)
+		error(1, errno, "socket rx_udp");
+	if (bind(fd_rx_udp, addr, alen))
+		error(1, errno, "bind");
+	/* Fail fast. */
+	if (setsockopt(fd_rx_udp, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)))
+		error(1, errno, "setsockopt rcv timeout");
+
+	for (msg_len = 1; msg_len < MSG_LEN_MAX; msg_len += (rand() % 4096)) {
+		if (cfg_verbose)
+			printf("msg_len: %d\n", msg_len);
+		max_frag_len = addr->sa_family == AF_INET ? 8 : 1280;
+		for (; max_frag_len < 1500 && max_frag_len <= msg_len;
+				max_frag_len += 8) {
+			send_udp_frags_v4(fd_tx_raw, addr, alen);
+			recv_validate_udp(fd_rx_udp);
+		}
+	}
+
+	/* Cleanup. */
+	if (close(fd_tx_raw))
+		error(1, errno, "close tx_raw");
+	if (close(fd_tx_udp))
+		error(1, errno, "close tx_udp");
+	if (close(fd_rx_udp))
+		error(1, errno, "close rx_udp");
+
+	if (cfg_verbose)
+		printf("processed %d messages, %d fragments\n",
+			msg_counter, frag_counter);
+
+	fprintf(stderr, "PASS\n");
+}
+
+
+static void run_test_v4(void)
+{
+	struct sockaddr_in addr = {0};
+
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(cfg_port);
+	addr.sin_addr = addr4;
+
+	run_test((void *)&addr, sizeof(addr));
+}
+
+static void run_test_v6(void)
+{
+	fprintf(stderr, "NOT IMPL.\n");
+	exit(1);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "46ov")) != -1) {
+		switch (c) {
+		case '4':
+			cfg_do_ipv4 = true;
+			break;
+		case '6':
+			cfg_do_ipv6 = true;
+			break;
+		case 'o':
+			cfg_overlap = true;
+			break;
+		case 'v':
+			cfg_verbose = true;
+			break;
+		default:
+			error(1, 0, "%s: parse error", argv[0]);
+		}
+	}
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	seed = time(NULL);
+	srand(seed);
+
+	if (cfg_do_ipv4)
+		run_test_v4();
+	if (cfg_do_ipv6)
+		run_test_v6();
+
+	return 0;
+}
diff --git a/tools/testing/selftests/net/ip_defrag.sh b/tools/testing/selftests/net/ip_defrag.sh
new file mode 100755
index 000000000000..53a1ed46790d
--- /dev/null
+++ b/tools/testing/selftests/net/ip_defrag.sh
@@ -0,0 +1,29 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a couple of IP defragmentation tests.
+
+set +x
+set -e
+
+echo "ipv4 defrag"
+
+run_v4() {
+sysctl -w net.ipv4.ipfrag_high_thresh=9000000 &> /dev/null
+sysctl -w net.ipv4.ipfrag_low_thresh=7000000 &> /dev/null
+./ip_defrag -4
+}
+export -f run_v4
+
+./in_netns.sh "run_v4"
+
+echo "ipv4 defrag with overlaps"
+run_v4o() {
+sysctl -w net.ipv4.ipfrag_high_thresh=9000000 &> /dev/null
+sysctl -w net.ipv4.ipfrag_low_thresh=7000000 &> /dev/null
+./ip_defrag -4o
+}
+export -f run_v4o
+
+./in_netns.sh "run_v4o"
+
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply related

* Re: phys_port_id in switchdev mode?
From: Jakub Kicinski @ 2018-08-28 18:43 UTC (permalink / raw)
  To: Florian Fainelli, Or Gerlitz, Simon Horman, Andy Gospodarek,
	mchan@broadcom.com, Jiri Pirko, Alexander Duyck, Frederick Botha
  Cc: nick viljoen, netdev@vger.kernel.org
In-Reply-To: <20180828200539.1c0fe607@cakuba.netronome.com>

Ugh, CC: netdev..

On Tue, 28 Aug 2018 20:05:39 +0200, Jakub Kicinski wrote:
> Hi!
> 
> I wonder if we can use phys_port_id in switchdev to group together
> interfaces of a single PCI PF?  Here is the problem:
> 
> With a mix of PF and VF interfaces it gets increasingly difficult to
> figure out which one corresponds to which PF.  We can identify which
> *representor* is which, by means of phys_port_name and devlink
> flavours.  But if the actual VF/PF interfaces are also present on the
> same host, it gets confusing when one tries to identify the PF they
> came from.  Generally one has to resort of matching between PCI DBDF of
> the PF and VFs or read relevant info out of ethtool -i.
> 
> In multi host scenario this is particularly painful, as there seems to
> be no immediately obvious way to match PCI interface ID of a card (0,
> 1, 2, 3, 4...) to the DBDF we have connected.
> 
> Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
> from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
> random manner, which means we have to provide those for all devices with
> link to the PF (all reprs).  And we have to link them (a) because it's
> right (tm) and (b) to get correct naming.  The only reliable way to make
> user space (libvirt) choose the repr it should run the NDOs on (which is
> IMHO the corresponding PF repr) is to set phys_port_id on actual VFs,
> VF reprs, PFs and PF reprs to a value corresponding to the *PCI PF*,
> not the external/Ethernet port when in switchdev mode.  User space
> should understand phys_port_id in this context, given it was originally
> introduced for matching VFs to ports.
> 
> I hope this explanation makes sense, and is correct.  Please point out
> errors in my understanding, any comments would be appreciated! :)
> 
> Jiri?  Or?  Others?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox