Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 1/2] netfilter: nf_nat_snmp_basic: add missing length checks in ASN.1 cbs
From: Pablo Neira Ayuso @ 2019-02-11 16:53 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20190211165319.17965-1-pablo@netfilter.org>

From: Jann Horn <jannh@google.com>

The generic ASN.1 decoder infrastructure doesn't guarantee that callbacks
will get as much data as they expect; callbacks have to check the `datalen`
parameter before looking at `data`. Make sure that snmp_version() and
snmp_helper() don't read/write beyond the end of the packet data.

(Also move the assignment to `pdata` down below the check to make it clear
that it isn't necessarily a pointer we can use before the `datalen` check.)

Fixes: cc2d58634e0f ("netfilter: nf_nat_snmp_basic: use asn1 decoder library")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/nf_nat_snmp_basic_main.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/nf_nat_snmp_basic_main.c b/net/ipv4/netfilter/nf_nat_snmp_basic_main.c
index a0aa13bcabda..0a8a60c1bf9a 100644
--- a/net/ipv4/netfilter/nf_nat_snmp_basic_main.c
+++ b/net/ipv4/netfilter/nf_nat_snmp_basic_main.c
@@ -105,6 +105,8 @@ static void fast_csum(struct snmp_ctx *ctx, unsigned char offset)
 int snmp_version(void *context, size_t hdrlen, unsigned char tag,
 		 const void *data, size_t datalen)
 {
+	if (datalen != 1)
+		return -EINVAL;
 	if (*(unsigned char *)data > 1)
 		return -ENOTSUPP;
 	return 1;
@@ -114,8 +116,11 @@ int snmp_helper(void *context, size_t hdrlen, unsigned char tag,
 		const void *data, size_t datalen)
 {
 	struct snmp_ctx *ctx = (struct snmp_ctx *)context;
-	__be32 *pdata = (__be32 *)data;
+	__be32 *pdata;
 
+	if (datalen != 4)
+		return -EINVAL;
+	pdata = (__be32 *)data;
 	if (*pdata == ctx->from) {
 		pr_debug("%s: %pI4 to %pI4\n", __func__,
 			 (void *)&ctx->from, (void *)&ctx->to);
-- 
2.11.0


^ permalink raw reply related

* [PATCH 0/2] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2019-02-11 16:53 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

The following patchset contains Netfilter fixes for net:

1) Out-of-bound access to packet data from the snmp nat helper,
   from Jann Horn.

2) ICMP(v6) error packets are set as related traffic by conntrack,
   update protocol number before calling nf_nat_ipv4_manip_pkt()
   to use ICMP(v6) rather than the original protocol number,
   from Florian Westphal.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Thanks!

----------------------------------------------------------------

The following changes since commit 31b58ad0c3279817cd246eab27eaf53b626dfcde:

  Merge branch 'r8169-revert-two-commits-due-to-a-regression' (2019-02-10 12:54:49 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to 8303b7e8f018724a2cd7752eb29c2801fa8c4067:

  netfilter: nat: fix spurious connection timeouts (2019-02-11 17:43:17 +0100)

----------------------------------------------------------------
Florian Westphal (1):
      netfilter: nat: fix spurious connection timeouts

Jann Horn (1):
      netfilter: nf_nat_snmp_basic: add missing length checks in ASN.1 cbs

 net/ipv4/netfilter/nf_nat_l3proto_ipv4.c    | 1 +
 net/ipv4/netfilter/nf_nat_snmp_basic_main.c | 7 ++++++-
 net/ipv6/netfilter/nf_nat_l3proto_ipv6.c    | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

^ permalink raw reply

* Re: bpf: test_tunnel.sh: BUG: unable to handle kernel NULL pointer dereference
From: Alan Maguire @ 2019-02-11 16:39 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Netdev, open list, open list:KERNEL SELFTEST FRAMEWORK,
	Linux-Next Mailing List, David S. Miller, Valdis Kletnieks,
	Song Liu, Rafael Tinoco, ast, Daniel Borkmann
In-Reply-To: <CA+G9fYu-RXRUPyTeAfSQjXXbtGQeTkbhns9-L5ZVhm12G3xhmQ@mail.gmail.com>

On Fri, 1 Feb 2019, Naresh Kamboju wrote:

> Kernel panic while running bpf: test_tunnel.sh on linux -next
> 5.0.0-rc4-next-20190129.
> 
> selftests: bpf: test_tunnel.sh
> [  273.997647] IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
> [  274.054068] ip (11458) used greatest stack depth: 11448 bytes left
> [  274.120445] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000000
> [  274.128285] #PF error: [INSTR]
> [  274.131351] PGD 8000000414a0e067 P4D 8000000414a0e067 PUD 3b6334067 PMD 0
> [  274.138241] Oops: 0010 [#1] SMP PTI
> [  274.141734] CPU: 1 PID: 11464 Comm: ping Not tainted
> 5.0.0-rc4-next-20190129 #1
> [  274.149046] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.0b 07/27/2017
> [  274.156526] RIP: 0010:          (null)
> [  274.160280] Code: Bad RIP value.
> [  274.163509] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
> [  274.168726] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
> [  274.175851] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
> [  274.182974] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
> [  274.190098] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
> [  274.197222] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
> [  274.204346] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
> knlGS:0000000000000000
> [  274.212424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  274.218162] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
> [  274.225292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  274.232416] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  274.239541] Call Trace:
> [  274.241988]  ? tnl_update_pmtu+0x296/0x3b0
> [  274.246085]  ip_md_tunnel_xmit+0x1bc/0x520
> [  274.250176]  gre_fb_xmit+0x330/0x390
> [  274.253754]  gre_tap_xmit+0x128/0x180
> [  274.257414]  dev_hard_start_xmit+0xb7/0x300
> [  274.261598]  sch_direct_xmit+0xf6/0x290
> [  274.265430]  __qdisc_run+0x15d/0x5e0
> [  274.269007]  __dev_queue_xmit+0x2c5/0xc00
> [  274.273011]  ? dev_queue_xmit+0x10/0x20
> [  274.276842]  ? eth_header+0x2b/0xc0
> [  274.280326]  dev_queue_xmit+0x10/0x20
> [  274.283984]  ? dev_queue_xmit+0x10/0x20
> [  274.287813]  arp_xmit+0x1a/0xf0
> [  274.290952]  arp_send_dst.part.19+0x46/0x60
> [  274.295138]  arp_solicit+0x177/0x6b0
> [  274.298708]  ? mod_timer+0x18e/0x440
> [  274.302281]  neigh_probe+0x57/0x70
> [  274.305684]  __neigh_event_send+0x197/0x2d0
> [  274.309862]  neigh_resolve_output+0x18c/0x210
> [  274.314212]  ip_finish_output2+0x257/0x690
> [  274.318304]  ip_finish_output+0x219/0x340
> [  274.322314]  ? ip_finish_output+0x219/0x340
> [  274.326493]  ip_output+0x76/0x240
> [  274.329805]  ? ip_fragment.constprop.53+0x80/0x80
> [  274.334510]  ip_local_out+0x3f/0x70
> [  274.337992]  ip_send_skb+0x19/0x40
> [  274.341391]  ip_push_pending_frames+0x33/0x40
> [  274.345740]  raw_sendmsg+0xc15/0x11d0
> [  274.349403]  ? __might_fault+0x85/0x90
> [  274.353151]  ? _copy_from_user+0x6b/0xa0
> [  274.357070]  ? rw_copy_check_uvector+0x54/0x130
> [  274.361604]  inet_sendmsg+0x42/0x1c0
> [  274.365179]  ? inet_sendmsg+0x42/0x1c0
> [  274.368937]  sock_sendmsg+0x3e/0x50
> [  274.372460]  ___sys_sendmsg+0x26f/0x2d0
> [  274.376293]  ? lock_acquire+0x95/0x190
> [  274.380043]  ? __handle_mm_fault+0x7ce/0xb70
> [  274.384307]  ? lock_acquire+0x95/0x190
> [  274.388053]  ? __audit_syscall_entry+0xdd/0x130
> [  274.392586]  ? ktime_get_coarse_real_ts64+0x64/0xc0
> [  274.397461]  ? __audit_syscall_entry+0xdd/0x130
> [  274.401989]  ? trace_hardirqs_on+0x4c/0x100
> [  274.406173]  __sys_sendmsg+0x63/0xa0
> [  274.409744]  ? __sys_sendmsg+0x63/0xa0
> [  274.413488]  __x64_sys_sendmsg+0x1f/0x30
> [  274.417405]  do_syscall_64+0x55/0x190
> [  274.421064]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  274.426113] RIP: 0033:0x7ff4ae0e6e87
> [  274.429686] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00
> 00 00 00 8b 05 ca d9 2b 00 48 63 d2 48 63 ff 85 c0 75 10 b8 2e 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 53 48 89 f3 48 83 ec 10 48 89 7c
> 24 08
> [  274.448422] RSP: 002b:00007ffcd9b76db8 EFLAGS: 00000246 ORIG_RAX:
> 000000000000002e
> [  274.455978] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007ff4ae0e6e87
> [  274.463104] RDX: 0000000000000000 RSI: 00000000006092e0 RDI: 0000000000000003
> [  274.470228] RBP: 0000000000000000 R08: 00007ffcd9bc40a0 R09: 00007ffcd9bc4080
> [  274.477349] R10: 000000000000060a R11: 0000000000000246 R12: 0000000000000003
> [  274.484475] R13: 0000000000000016 R14: 00007ffcd9b77fa0 R15: 00007ffcd9b78da4
> [  274.491602] Modules linked in: cls_bpf sch_ingress iptable_filter
> ip_tables algif_hash af_alg x86_pkg_temp_thermal fuse [last unloaded:
> test_bpf]
> [  274.504634] CR2: 0000000000000000
> [  274.507976] ---[ end trace 196d18386545eae1 ]---
> [  274.512588] RIP: 0010:          (null)
> [  274.516334] Code: Bad RIP value.
> [  274.519557] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
> [  274.524775] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
> [  274.531921] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
> [  274.539082] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
> [  274.546205] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
> [  274.553329] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
> [  274.560456] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
> knlGS:0000000000000000
> [  274.568541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  274.574277] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
> [  274.581403] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  274.588535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  274.595658] Kernel panic - not syncing: Fatal exception in interrupt
> [  274.602046] Kernel Offset: 0x14400000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [  274.612827] ---[ end Kernel panic - not syncing: Fatal exception in
> interrupt ]---
> [  274.620387] ------------[ cut here ]------------
> 
> The above log is from x86_64 device.
> 

 
I'm also seeing the same panic during test_tunnel.sh execution
on x86_64.

From poking around it looks like the skb's dst entry is being used
to calculate the mtu in:

mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;

...but because that dst_entry  has an "ops" value set to md_dst_ops,
the various ops (including mtu) are not set:

crash> struct sk_buff._skb_refdst ffff928f87447700 -x
      _skb_refdst = 0xffffcd6fbf5ea590
crash> struct dst_entry.ops 0xffffcd6fbf5ea590
  ops = 0xffffffffa0193800
crash> struct dst_ops.mtu 0xffffffffa0193800
  mtu = 0x0
crash>

I confirmed that the dst entry also has dst->input set to
dst_md_discard, so it looks like it's an entry that's been
initialized via __metadata_dst_init alright.

I think the fix here is to use skb_valid_dst(skb) - it checks
for  DST_METADATA also, and with that fix in place, the
problem - which was previously 100% reproducible - disappears.

However what we should do in terms of path MTU setting for
such cases (use ip*_update_pmtu() perhaps?), I'm not sure.

The below patch resolves the panic and all tunnel tests pass
without incident.

Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
---
 net/ipv4/ip_tunnel.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 893f013..5dcf50c 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -515,9 +515,10 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
 		mtu = dst_mtu(&rt->dst) - dev->hard_header_len
 					- sizeof(struct iphdr) - tunnel_hlen;
 	else
-		mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
+		mtu = skb_valid_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
 
-	skb_dst_update_pmtu(skb, mtu);
+	if (skb_valid_dst(skb))
+		skb_dst_update_pmtu(skb, mtu);
 
 	if (skb->protocol == htons(ETH_P_IP)) {
 		if (!skb_is_gso(skb) &&
@@ -530,9 +531,11 @@ static int tnl_update_pmtu(struct net_device *dev, struct sk_buff *skb,
 	}
 #if IS_ENABLED(CONFIG_IPV6)
 	else if (skb->protocol == htons(ETH_P_IPV6)) {
-		struct rt6_info *rt6 = (struct rt6_info *)skb_dst(skb);
+		struct rt6_info *rt6;
 		__be32 daddr;
 
+		rt6 = skb_valid_dst(skb) ? (struct rt6_info *)skb_dst(skb) :
+					   NULL;
 		daddr = md ? dst : tunnel->parms.iph.daddr;
 
 		if (rt6 && mtu < dst_mtu(skb_dst(skb)) &&
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH 1/2] xsk: do not use mmap_sem
From: Dan Williams @ 2019-02-11 16:37 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Björn Töpel, Davidlohr Bueso, Andrew Morton, Linux MM,
	LKML, David S . Miller, Bjorn Topel, Magnus Karlsson, Netdev,
	Davidlohr Bueso, Weiny, Ira
In-Reply-To: <d92b7b49-81e6-1ac5-4ae4-4909f87bbea8@iogearbox.net>

[ add Ira ]

On Mon, Feb 11, 2019 at 7:33 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> [ +Dan ]
>
> On 02/07/2019 08:43 AM, Björn Töpel wrote:
> > Den tors 7 feb. 2019 kl 06:38 skrev Davidlohr Bueso <dave@stgolabs.net>:
> >>
> >> Holding mmap_sem exclusively for a gup() is an overkill.
> >> Lets replace the call for gup_fast() and let the mm take
> >> it if necessary.
> >>
> >> Cc: David S. Miller <davem@davemloft.net>
> >> Cc: Bjorn Topel <bjorn.topel@intel.com>
> >> Cc: Magnus Karlsson <magnus.karlsson@intel.com>
> >> CC: netdev@vger.kernel.org
> >> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> >> ---
> >>  net/xdp/xdp_umem.c | 6 ++----
> >>  1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> >> index 5ab236c5c9a5..25e1e76654a8 100644
> >> --- a/net/xdp/xdp_umem.c
> >> +++ b/net/xdp/xdp_umem.c
> >> @@ -265,10 +265,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
> >>         if (!umem->pgs)
> >>                 return -ENOMEM;
> >>
> >> -       down_write(&current->mm->mmap_sem);
> >> -       npgs = get_user_pages(umem->address, umem->npgs,
> >> -                             gup_flags, &umem->pgs[0], NULL);
> >> -       up_write(&current->mm->mmap_sem);
> >> +       npgs = get_user_pages_fast(umem->address, umem->npgs,
> >> +                                  gup_flags, &umem->pgs[0]);
> >>
> >
> > Thanks for the patch!
> >
> > The lifetime of the pinning is similar to RDMA umem mapping, so isn't
> > gup_longterm preferred?
>
> Seems reasonable from reading what gup_longterm seems to do. Davidlohr
> or Dan, any thoughts on the above?

Yes, any gup where the lifetime of the corresponding put_page() is
under direct control of userspace should be using the _longterm flavor
to coordinate be careful in the presence of dax. In the dax case there
is no page cache indirection to coordinate truncate vs page pins. Ira
is looking to add a "fast + longterm" flavor for cases like this.

^ permalink raw reply

* Re: [PATCH v2 bpf] bpf: fix lockdep false positive in stackmap
From: Eric Dumazet @ 2019-02-11 16:35 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, David Miller, Peter Zijlstra, longman,
	Jann Horn, netdev, kernel-team
In-Reply-To: <d6d3bf01-4c8f-8b4d-073c-a8222f5ba677@iogearbox.net>

On Mon, Feb 11, 2019 at 7:39 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 02/10/2019 09:52 PM, Alexei Starovoitov wrote:
> > Lockdep warns about false positive:
> > [   11.211460] ------------[ cut here ]------------
> > [   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
> > [   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
> > [   11.213134] Modules linked in:
> > [   11.214954] RIP: 0010:lock_release+0x1ad/0x280
> > [   11.223508] Call Trace:
> > [   11.223705]  <IRQ>
> > [   11.223874]  ? __local_bh_enable+0x7a/0x80
> > [   11.224199]  up_read+0x1c/0xa0
> > [   11.224446]  do_up_read+0x12/0x20
> > [   11.224713]  irq_work_run_list+0x43/0x70
> > [   11.225030]  irq_work_run+0x26/0x50
> > [   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
> > [   11.225662]  irq_work_interrupt+0xf/0x20
> >
> > since rw_semaphore is released in a different task vs task that locked the sema.
> > It is expected behavior.
> > Fix the warning with up_read_non_owner() and rwsem_release() annotation.
> >
> > Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Applied, thanks!

Thanks everyone !

^ permalink raw reply

* Re: [PATCH v2] xsk: share the mmap_sem for page pinning
From: Björn Töpel @ 2019-02-11 16:21 UTC (permalink / raw)
  To: akpm, linux-mm, LKML, David S . Miller, Bjorn Topel,
	Magnus Karlsson, Netdev, Davidlohr Bueso
In-Reply-To: <20190211161529.uskq5ca7y3j5522i@linux-r8p5>

Den mån 11 feb. 2019 kl 17:15 skrev Davidlohr Bueso <dave@stgolabs.net>:
>
> Holding mmap_sem exclusively for a gup() is an overkill.
> Lets share the lock and replace the gup call for gup_longterm(),
> as it is better suited for the lifetime of the pinning.
>

Thanks for the cleanup!

Acked-by: Björn Töpel <bjorn.topel@intel.com>

> Cc: David S. Miller <davem@davemloft.net>
> Cc: Bjorn Topel <bjorn.topel@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@intel.com>
> CC: netdev@vger.kernel.org
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  net/xdp/xdp_umem.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 5ab236c5c9a5..e7fa8d0d7090 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -265,10 +265,10 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
>         if (!umem->pgs)
>                 return -ENOMEM;
>
> -       down_write(&current->mm->mmap_sem);
> -       npgs = get_user_pages(umem->address, umem->npgs,
> +       down_read(&current->mm->mmap_sem);
> +       npgs = get_user_pages_longterm(umem->address, umem->npgs,
>                               gup_flags, &umem->pgs[0], NULL);
> -       up_write(&current->mm->mmap_sem);
> +       up_read(&current->mm->mmap_sem);
>
>         if (npgs != umem->npgs) {
>                 if (npgs >= 0) {
> --
> 2.16.4
>

^ permalink raw reply

* [PATCH v2] xsk: share the mmap_sem for page pinning
From: Davidlohr Bueso @ 2019-02-11 16:15 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, David S . Miller, Bjorn Topel,
	Magnus Karlsson, netdev, Davidlohr Bueso
In-Reply-To: <20190207053740.26915-2-dave@stgolabs.net>

Holding mmap_sem exclusively for a gup() is an overkill.
Lets share the lock and replace the gup call for gup_longterm(),
as it is better suited for the lifetime of the pinning.

Cc: David S. Miller <davem@davemloft.net>
Cc: Bjorn Topel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
CC: netdev@vger.kernel.org
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 net/xdp/xdp_umem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 5ab236c5c9a5..e7fa8d0d7090 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -265,10 +265,10 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
 	if (!umem->pgs)
 		return -ENOMEM;
 
-	down_write(&current->mm->mmap_sem);
-	npgs = get_user_pages(umem->address, umem->npgs,
+	down_read(&current->mm->mmap_sem);
+	npgs = get_user_pages_longterm(umem->address, umem->npgs,
 			      gup_flags, &umem->pgs[0], NULL);
-	up_write(&current->mm->mmap_sem);
+	up_read(&current->mm->mmap_sem);
 
 	if (npgs != umem->npgs) {
 		if (npgs >= 0) {
-- 
2.16.4


^ permalink raw reply related

* [PATCH net-next 2/4] net: phy: Move of_set_phy_eee_broken to phy-core.c
From: Maxime Chevallier @ 2019-02-11 14:25 UTC (permalink / raw)
  To: davem
  Cc: Maxime Chevallier, netdev, linux-kernel, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit, Russell King, linux-arm-kernel,
	Antoine Tenart, thomas.petazzoni, gregory.clement, miquel.raynal,
	nadavh, stefanc, mw
In-Reply-To: <20190211142529.22885-1-maxime.chevallier@bootlin.com>

Since of_set_phy_supported was moved to phy-core.c, we can also move
of_set_phy_eee_broken to the same location, so that we have all OF
functions in the same place.

This patch doesn't intend to introduce any change in behaviour.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/phy-core.c   | 27 +++++++++++++++++++++++++++
 drivers/net/phy/phy_device.c | 28 ----------------------------
 include/linux/phy.h          |  1 +
 3 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/drivers/net/phy/phy-core.c b/drivers/net/phy/phy-core.c
index 855abf487279..de58a59815d5 100644
--- a/drivers/net/phy/phy-core.c
+++ b/drivers/net/phy/phy-core.c
@@ -383,6 +383,33 @@ void of_set_phy_supported(struct phy_device *phydev)
 		__set_phy_supported(phydev, max_speed);
 }
 
+void of_set_phy_eee_broken(struct phy_device *phydev)
+{
+	struct device_node *node = phydev->mdio.dev.of_node;
+	u32 broken = 0;
+
+	if (!IS_ENABLED(CONFIG_OF_MDIO))
+		return;
+
+	if (!node)
+		return;
+
+	if (of_property_read_bool(node, "eee-broken-100tx"))
+		broken |= MDIO_EEE_100TX;
+	if (of_property_read_bool(node, "eee-broken-1000t"))
+		broken |= MDIO_EEE_1000T;
+	if (of_property_read_bool(node, "eee-broken-10gt"))
+		broken |= MDIO_EEE_10GT;
+	if (of_property_read_bool(node, "eee-broken-1000kx"))
+		broken |= MDIO_EEE_1000KX;
+	if (of_property_read_bool(node, "eee-broken-10gkx4"))
+		broken |= MDIO_EEE_10GKX4;
+	if (of_property_read_bool(node, "eee-broken-10gkr"))
+		broken |= MDIO_EEE_10GKR;
+
+	phydev->eee_broken_modes = broken;
+}
+
 /**
  * phy_resolve_aneg_linkmode - resolve the advertisements into phy settings
  * @phydev: The phy_device struct
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 65e18293d8e5..626f0527f36a 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -30,7 +30,6 @@
 #include <linux/mdio.h>
 #include <linux/io.h>
 #include <linux/uaccess.h>
-#include <linux/of.h>
 
 MODULE_DESCRIPTION("PHY library");
 MODULE_AUTHOR("Andy Fleming");
@@ -2094,33 +2093,6 @@ bool phy_validate_pause(struct phy_device *phydev,
 }
 EXPORT_SYMBOL(phy_validate_pause);
 
-static void of_set_phy_eee_broken(struct phy_device *phydev)
-{
-	struct device_node *node = phydev->mdio.dev.of_node;
-	u32 broken = 0;
-
-	if (!IS_ENABLED(CONFIG_OF_MDIO))
-		return;
-
-	if (!node)
-		return;
-
-	if (of_property_read_bool(node, "eee-broken-100tx"))
-		broken |= MDIO_EEE_100TX;
-	if (of_property_read_bool(node, "eee-broken-1000t"))
-		broken |= MDIO_EEE_1000T;
-	if (of_property_read_bool(node, "eee-broken-10gt"))
-		broken |= MDIO_EEE_10GT;
-	if (of_property_read_bool(node, "eee-broken-1000kx"))
-		broken |= MDIO_EEE_1000KX;
-	if (of_property_read_bool(node, "eee-broken-10gkx4"))
-		broken |= MDIO_EEE_10GKX4;
-	if (of_property_read_bool(node, "eee-broken-10gkr"))
-		broken |= MDIO_EEE_10GKR;
-
-	phydev->eee_broken_modes = broken;
-}
-
 static bool phy_drv_supports_irq(struct phy_driver *phydrv)
 {
 	return phydrv->config_intr && phydrv->ack_interrupt;
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 20344c7744d8..1a1d93a2a906 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -674,6 +674,7 @@ phy_lookup_setting(int speed, int duplex, const unsigned long *mask,
 size_t phy_speeds(unsigned int *speeds, size_t size,
 		  unsigned long *mask);
 void of_set_phy_supported(struct phy_device *phydev);
+void of_set_phy_eee_broken(struct phy_device *phydev);
 
 static inline bool __phy_is_started(struct phy_device *phydev)
 {
-- 
2.19.2


^ permalink raw reply related

* [PATCH net-next 1/4] net: phy: Mask-out non-compatible modes when setting the max-speed
From: Maxime Chevallier @ 2019-02-11 14:25 UTC (permalink / raw)
  To: davem
  Cc: Maxime Chevallier, netdev, linux-kernel, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit, Russell King, linux-arm-kernel,
	Antoine Tenart, thomas.petazzoni, gregory.clement, miquel.raynal,
	nadavh, stefanc, mw
In-Reply-To: <20190211142529.22885-1-maxime.chevallier@bootlin.com>

When setting a PHY's max speed using either the max-speed DT property
or ethtool, we should mask-out all non-compatible modes according to the
settings table, instead of just the 10/100BASET modes.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Suggested-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/phy/phy-core.c   | 45 ++++++++++++++++++++++++++++++
 drivers/net/phy/phy_device.c | 53 ------------------------------------
 include/linux/phy.h          |  1 +
 3 files changed, 46 insertions(+), 53 deletions(-)

diff --git a/drivers/net/phy/phy-core.c b/drivers/net/phy/phy-core.c
index cdea028d1328..855abf487279 100644
--- a/drivers/net/phy/phy-core.c
+++ b/drivers/net/phy/phy-core.c
@@ -4,6 +4,7 @@
  */
 #include <linux/export.h>
 #include <linux/phy.h>
+#include <linux/of.h>
 
 const char *phy_speed_to_str(int speed)
 {
@@ -338,6 +339,50 @@ size_t phy_speeds(unsigned int *speeds, size_t size,
 	return count;
 }
 
+static int __set_phy_supported(struct phy_device *phydev, u32 max_speed)
+{
+	const struct phy_setting *p;
+	int i;
+
+	for (i = 0, p = settings; i < ARRAY_SIZE(settings); i++, p++) {
+		if (p->speed > max_speed)
+			linkmode_clear_bit(p->bit, phydev->supported);
+		else
+			break;
+	}
+
+	return 0;
+}
+
+int phy_set_max_speed(struct phy_device *phydev, u32 max_speed)
+{
+	int err;
+
+	err = __set_phy_supported(phydev, max_speed);
+	if (err)
+		return err;
+
+	linkmode_copy(phydev->advertising, phydev->supported);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_set_max_speed);
+
+void of_set_phy_supported(struct phy_device *phydev)
+{
+	struct device_node *node = phydev->mdio.dev.of_node;
+	u32 max_speed;
+
+	if (!IS_ENABLED(CONFIG_OF_MDIO))
+		return;
+
+	if (!node)
+		return;
+
+	if (!of_property_read_u32(node, "max-speed", &max_speed))
+		__set_phy_supported(phydev, max_speed);
+}
+
 /**
  * phy_resolve_aneg_linkmode - resolve the advertisements into phy settings
  * @phydev: The phy_device struct
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 3d14e48aebc5..65e18293d8e5 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1964,44 +1964,6 @@ int genphy_loopback(struct phy_device *phydev, bool enable)
 }
 EXPORT_SYMBOL(genphy_loopback);
 
-static int __set_phy_supported(struct phy_device *phydev, u32 max_speed)
-{
-	switch (max_speed) {
-	case SPEED_10:
-		linkmode_clear_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
-				   phydev->supported);
-		linkmode_clear_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
-				   phydev->supported);
-		/* fall through */
-	case SPEED_100:
-		linkmode_clear_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
-				   phydev->supported);
-		linkmode_clear_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
-				   phydev->supported);
-		break;
-	case SPEED_1000:
-		break;
-	default:
-		return -ENOTSUPP;
-	}
-
-	return 0;
-}
-
-int phy_set_max_speed(struct phy_device *phydev, u32 max_speed)
-{
-	int err;
-
-	err = __set_phy_supported(phydev, max_speed);
-	if (err)
-		return err;
-
-	linkmode_copy(phydev->advertising, phydev->supported);
-
-	return 0;
-}
-EXPORT_SYMBOL(phy_set_max_speed);
-
 /**
  * phy_remove_link_mode - Remove a supported link mode
  * @phydev: phy_device structure to remove link mode from
@@ -2132,21 +2094,6 @@ bool phy_validate_pause(struct phy_device *phydev,
 }
 EXPORT_SYMBOL(phy_validate_pause);
 
-static void of_set_phy_supported(struct phy_device *phydev)
-{
-	struct device_node *node = phydev->mdio.dev.of_node;
-	u32 max_speed;
-
-	if (!IS_ENABLED(CONFIG_OF_MDIO))
-		return;
-
-	if (!node)
-		return;
-
-	if (!of_property_read_u32(node, "max-speed", &max_speed))
-		__set_phy_supported(phydev, max_speed);
-}
-
 static void of_set_phy_eee_broken(struct phy_device *phydev)
 {
 	struct device_node *node = phydev->mdio.dev.of_node;
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 378da9a6165e..20344c7744d8 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -673,6 +673,7 @@ phy_lookup_setting(int speed, int duplex, const unsigned long *mask,
 		   bool exact);
 size_t phy_speeds(unsigned int *speeds, size_t size,
 		  unsigned long *mask);
+void of_set_phy_supported(struct phy_device *phydev);
 
 static inline bool __phy_is_started(struct phy_device *phydev)
 {
-- 
2.19.2


^ permalink raw reply related

* Re: [PATCH net-next v4 01/17] net: sched: protect block state with mutex
From: Jiri Pirko @ 2019-02-11 14:15 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, ast, daniel
In-Reply-To: <20190211085548.7190-2-vladbu@mellanox.com>

Mon, Feb 11, 2019 at 09:55:32AM CET, vladbu@mellanox.com wrote:
>Currently, tcf_block doesn't use any synchronization mechanisms to protect
>critical sections that manage lifetime of its chains. block->chain_list and
>multiple variables in tcf_chain that control its lifetime assume external
>synchronization provided by global rtnl lock. Converting chain reference
>counting to atomic reference counters is not possible because cls API uses
>multiple counters and flags to control chain lifetime, so all of them must
>be synchronized in chain get/put code.
>
>Use single per-block lock to protect block data and manage lifetime of all
>chains on the block. Always take block->lock when accessing chain_list.
>Chain get and put modify chain lifetime-management data and parent block's
>chain_list, so take the lock in these functions. Verify block->lock state
>with assertions in functions that expect to be called with the lock taken
>and are called from multiple places. Take block->lock when accessing
>filter_chain_list.
>
>In order to allow parallel update of rules on single block, move all calls
>to classifiers outside of critical sections protected by new block->lock.
>Rearrange chain get and put functions code to only access protected chain
>data while holding block lock:
>- Rearrange code to only access chain reference counter and chain action
>  reference counter while holding block lock.
>- Extract code that requires block->lock from tcf_chain_destroy() into
>  standalone tcf_chain_destroy() function that is called by
>  __tcf_chain_put() in same critical section that changes chain reference
>  counters.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v4 02/17] net: sched: protect chain->explicitly_created with block->lock
From: Jiri Pirko @ 2019-02-11 14:15 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, ast, daniel
In-Reply-To: <20190211085548.7190-3-vladbu@mellanox.com>

Mon, Feb 11, 2019 at 09:55:33AM CET, vladbu@mellanox.com wrote:
>In order to remove dependency on rtnl lock, protect
>tcf_chain->explicitly_created flag with block->lock. Consolidate code that
>checks and resets 'explicitly_created' flag into __tcf_chain_put() to
>execute it atomically with rest of code that puts chain reference.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* [PATCH net-next 0/4] net: phy: Add 2.5G/5GBASET PHYs support
From: Maxime Chevallier @ 2019-02-11 14:25 UTC (permalink / raw)
  To: davem
  Cc: Maxime Chevallier, netdev, linux-kernel, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit, Russell King, linux-arm-kernel,
	Antoine Tenart, thomas.petazzoni, gregory.clement, miquel.raynal,
	nadavh, stefanc, mw

The 802.3bz standard defines 2 modes based on the NBASET alliance work
that allow to use 2.5Gbps and 5Gbps speeds on Cat 5e, 6 and 7 cables.

This series adds the necessary infrastructure to handle these modes with
C45 PHYs. This series was originally part of a bigger one, that has
seen 2 iterations [1] [2] that added support for these modes on Marvell
Alaska PHYs.

Following some discussions with Heiner and Andrew [3], we decided to
split-out the generic parts so that we can work together on the
following steps to get these mode fully working with Aquantia and
Marvell PHYS.

The first 3 patches are reworking some of the internal network phy
infrastructure to handle the new modes in a more generic way.

The 4th patch adds all the C45 register definition and accesses that
follows the 802.3bz standard to support 2.5GBASET and 5GBASET.

[1] : https://lore.kernel.org/netdev/20190118152352.26417-1-maxime.chevallier@bootlin.com/
[2] : https://lore.kernel.org/netdev/20190207094939.27369-1-maxime.chevallier@bootlin.com/
[3] : https://lore.kernel.org/netdev/81c340ea-54b0-1abf-94af-b8dc4ee83e3a@gmail.com/

Maxime Chevallier (4):
  net: phy: Mask-out non-compatible modes when setting the max-speed
  net: phy: Move of_set_phy_eee_broken to phy-core.c
  net: phy: Extract genphy_c45_pma_read_abilities from marvell10g
  net: phy: Add generic support for 2.5GBaseT and 5GBaseT

 drivers/net/phy/marvell10g.c |  78 +++---------------------
 drivers/net/phy/phy-c45.c    | 111 +++++++++++++++++++++++++++++++++++
 drivers/net/phy/phy-core.c   |  72 +++++++++++++++++++++++
 drivers/net/phy/phy_device.c |  81 -------------------------
 include/linux/phy.h          |   3 +
 include/uapi/linux/mdio.h    |  16 +++++
 6 files changed, 210 insertions(+), 151 deletions(-)

-- 
2.19.2

^ permalink raw reply

* [PATCH net-next 3/4] net: phy: Extract genphy_c45_pma_read_abilities from marvell10g
From: Maxime Chevallier @ 2019-02-11 14:25 UTC (permalink / raw)
  To: davem
  Cc: Maxime Chevallier, netdev, linux-kernel, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit, Russell King, linux-arm-kernel,
	Antoine Tenart, thomas.petazzoni, gregory.clement, miquel.raynal,
	nadavh, stefanc, mw
In-Reply-To: <20190211142529.22885-1-maxime.chevallier@bootlin.com>

Marvell 10G PHY driver has a generic way of initializing the supported
link modes by reading the PHY's C45 PMA abilities. This can be made
generic, since these registers are part of the 802.3 specifications.

This commit extracts the config_init link_mode initialization code from
marvell10g and uses it to introduce the genphy_c45_pma_read_abilities
function.

Only PMA modes are read, it's still up to the caller to set the Pause
parameters.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
---
 drivers/net/phy/marvell10g.c | 78 ++++--------------------------------
 drivers/net/phy/phy-c45.c    | 74 ++++++++++++++++++++++++++++++++++
 include/linux/phy.h          |  1 +
 3 files changed, 83 insertions(+), 70 deletions(-)

diff --git a/drivers/net/phy/marvell10g.c b/drivers/net/phy/marvell10g.c
index 08362dc657cd..496805c0ddfe 100644
--- a/drivers/net/phy/marvell10g.c
+++ b/drivers/net/phy/marvell10g.c
@@ -233,8 +233,7 @@ static int mv3310_resume(struct phy_device *phydev)
 
 static int mv3310_config_init(struct phy_device *phydev)
 {
-	__ETHTOOL_DECLARE_LINK_MODE_MASK(supported) = { 0, };
-	int val;
+	int ret, val;
 
 	/* Check that the PHY interface type is compatible */
 	if (phydev->interface != PHY_INTERFACE_MODE_SGMII &&
@@ -243,8 +242,8 @@ static int mv3310_config_init(struct phy_device *phydev)
 	    phydev->interface != PHY_INTERFACE_MODE_10GKR)
 		return -ENODEV;
 
-	__set_bit(ETHTOOL_LINK_MODE_Pause_BIT, supported);
-	__set_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT, supported);
+	__set_bit(ETHTOOL_LINK_MODE_Pause_BIT, phydev->supported);
+	__set_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT, phydev->supported);
 
 	if (phydev->c45_ids.devices_in_package & MDIO_DEVS_AN) {
 		val = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_STAT1);
@@ -252,74 +251,13 @@ static int mv3310_config_init(struct phy_device *phydev)
 			return val;
 
 		if (val & MDIO_AN_STAT1_ABLE)
-			__set_bit(ETHTOOL_LINK_MODE_Autoneg_BIT, supported);
+			__set_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
+				  phydev->supported);
 	}
 
-	val = phy_read_mmd(phydev, MDIO_MMD_PMAPMD, MDIO_STAT2);
-	if (val < 0)
-		return val;
-
-	/* Ethtool does not support the WAN mode bits */
-	if (val & (MDIO_PMA_STAT2_10GBSR | MDIO_PMA_STAT2_10GBLR |
-		   MDIO_PMA_STAT2_10GBER | MDIO_PMA_STAT2_10GBLX4 |
-		   MDIO_PMA_STAT2_10GBSW | MDIO_PMA_STAT2_10GBLW |
-		   MDIO_PMA_STAT2_10GBEW))
-		__set_bit(ETHTOOL_LINK_MODE_FIBRE_BIT, supported);
-	if (val & MDIO_PMA_STAT2_10GBSR)
-		__set_bit(ETHTOOL_LINK_MODE_10000baseSR_Full_BIT, supported);
-	if (val & MDIO_PMA_STAT2_10GBLR)
-		__set_bit(ETHTOOL_LINK_MODE_10000baseLR_Full_BIT, supported);
-	if (val & MDIO_PMA_STAT2_10GBER)
-		__set_bit(ETHTOOL_LINK_MODE_10000baseER_Full_BIT, supported);
-
-	if (val & MDIO_PMA_STAT2_EXTABLE) {
-		val = phy_read_mmd(phydev, MDIO_MMD_PMAPMD, MDIO_PMA_EXTABLE);
-		if (val < 0)
-			return val;
-
-		if (val & (MDIO_PMA_EXTABLE_10GBT | MDIO_PMA_EXTABLE_1000BT |
-			   MDIO_PMA_EXTABLE_100BTX | MDIO_PMA_EXTABLE_10BT))
-			__set_bit(ETHTOOL_LINK_MODE_TP_BIT, supported);
-		if (val & MDIO_PMA_EXTABLE_10GBLRM)
-			__set_bit(ETHTOOL_LINK_MODE_FIBRE_BIT, supported);
-		if (val & (MDIO_PMA_EXTABLE_10GBKX4 | MDIO_PMA_EXTABLE_10GBKR |
-			   MDIO_PMA_EXTABLE_1000BKX))
-			__set_bit(ETHTOOL_LINK_MODE_Backplane_BIT, supported);
-		if (val & MDIO_PMA_EXTABLE_10GBLRM)
-			__set_bit(ETHTOOL_LINK_MODE_10000baseLRM_Full_BIT,
-				  supported);
-		if (val & MDIO_PMA_EXTABLE_10GBT)
-			__set_bit(ETHTOOL_LINK_MODE_10000baseT_Full_BIT,
-				  supported);
-		if (val & MDIO_PMA_EXTABLE_10GBKX4)
-			__set_bit(ETHTOOL_LINK_MODE_10000baseKX4_Full_BIT,
-				  supported);
-		if (val & MDIO_PMA_EXTABLE_10GBKR)
-			__set_bit(ETHTOOL_LINK_MODE_10000baseKR_Full_BIT,
-				  supported);
-		if (val & MDIO_PMA_EXTABLE_1000BT)
-			__set_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
-				  supported);
-		if (val & MDIO_PMA_EXTABLE_1000BKX)
-			__set_bit(ETHTOOL_LINK_MODE_1000baseKX_Full_BIT,
-				  supported);
-		if (val & MDIO_PMA_EXTABLE_100BTX) {
-			__set_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
-				  supported);
-			__set_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
-				  supported);
-		}
-		if (val & MDIO_PMA_EXTABLE_10BT) {
-			__set_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
-				  supported);
-			__set_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
-				  supported);
-		}
-	}
-
-	linkmode_copy(phydev->supported, supported);
-	linkmode_and(phydev->advertising, phydev->advertising,
-		     phydev->supported);
+	ret = genphy_c45_pma_read_abilities(phydev);
+	if (ret)
+		return ret;
 
 	return 0;
 }
diff --git a/drivers/net/phy/phy-c45.c b/drivers/net/phy/phy-c45.c
index eff9e5a4d831..6f028de4dae1 100644
--- a/drivers/net/phy/phy-c45.c
+++ b/drivers/net/phy/phy-c45.c
@@ -271,6 +271,80 @@ int genphy_c45_read_mdix(struct phy_device *phydev)
 }
 EXPORT_SYMBOL_GPL(genphy_c45_read_mdix);
 
+/**
+ * genphy_c45_pma_read_abilities - read supported link modes from PMA
+ * @phydev: target phy_device struct
+ *
+ * Read the supported link modes from the PMA Status 2 (1.8) register. If bit
+ * 1.8.9 is set, the list of supported modes is build using the values in the
+ * PMA Extended Abilities (1.11) register, indicating 1000BASET an 10G related
+ * modes. If bit 1.11.14 is set, then the list is also extended with the modes
+ * in the 2.5G/5G PMA Extended register (1.21), indicating if 2.5GBASET and
+ * 5GBASET are supported.
+ */
+int genphy_c45_pma_read_abilities(struct phy_device *phydev)
+{
+	int val;
+
+	val = phy_read_mmd(phydev, MDIO_MMD_PMAPMD, MDIO_STAT2);
+	if (val < 0)
+		return val;
+
+	linkmode_mod_bit(ETHTOOL_LINK_MODE_10000baseSR_Full_BIT,
+			 phydev->supported,
+			 val & MDIO_PMA_STAT2_10GBSR);
+
+	linkmode_mod_bit(ETHTOOL_LINK_MODE_10000baseLR_Full_BIT,
+			 phydev->supported,
+			 val & MDIO_PMA_STAT2_10GBLR);
+
+	linkmode_mod_bit(ETHTOOL_LINK_MODE_10000baseER_Full_BIT,
+			 phydev->supported,
+			 val & MDIO_PMA_STAT2_10GBER);
+
+	if (val & MDIO_PMA_STAT2_EXTABLE) {
+		val = phy_read_mmd(phydev, MDIO_MMD_PMAPMD, MDIO_PMA_EXTABLE);
+		if (val < 0)
+			return val;
+
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_10000baseLRM_Full_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_10GBLRM);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_10000baseT_Full_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_10GBT);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_10000baseKX4_Full_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_10GBKX4);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_10000baseKR_Full_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_10GBKR);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_1000BT);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseKX_Full_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_1000BKX);
+
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_100BTX);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_100BTX);
+
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_10BT);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
+				 phydev->supported,
+				 val & MDIO_PMA_EXTABLE_10BT);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(genphy_c45_pma_read_abilities);
+
 /* The gen10g_* functions are the old Clause 45 stub */
 
 int gen10g_config_aneg(struct phy_device *phydev)
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 1a1d93a2a906..177a330d84e5 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1116,6 +1116,7 @@ int genphy_c45_read_pma(struct phy_device *phydev);
 int genphy_c45_pma_setup_forced(struct phy_device *phydev);
 int genphy_c45_an_disable_aneg(struct phy_device *phydev);
 int genphy_c45_read_mdix(struct phy_device *phydev);
+int genphy_c45_pma_read_abilities(struct phy_device *phydev);
 
 /* The gen10g_* functions are the old Clause 45 stub */
 int gen10g_config_aneg(struct phy_device *phydev);
-- 
2.19.2


^ permalink raw reply related

* [PATCH net-next 4/4] net: phy: Add generic support for 2.5GBaseT and 5GBaseT
From: Maxime Chevallier @ 2019-02-11 14:25 UTC (permalink / raw)
  To: davem
  Cc: Maxime Chevallier, netdev, linux-kernel, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit, Russell King, linux-arm-kernel,
	Antoine Tenart, thomas.petazzoni, gregory.clement, miquel.raynal,
	nadavh, stefanc, mw
In-Reply-To: <20190211142529.22885-1-maxime.chevallier@bootlin.com>

The 802.3bz specification, based on previous by the NBASET alliance,
defines the 2.5GBaseT and 5GBaseT link modes for ethernet traffic on
cat5e, cat6 and cat7 cables.

These mode integrate with the already defined C45 MDIO PMA/PMD registers
set that added 10G support, by defining some previously reserved bits,
and adding a new register (2.5G/5G Extended abilities).

This commit adds the required definitions in include/uapi/linux/mdio.h
to support these modes, and detect when a link-partner advertises them.

It also adds support for these mode in the generic C45 PHY
infrastructure.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
---
 drivers/net/phy/phy-c45.c | 37 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/mdio.h | 16 ++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/drivers/net/phy/phy-c45.c b/drivers/net/phy/phy-c45.c
index 6f028de4dae1..7af5fa81daf6 100644
--- a/drivers/net/phy/phy-c45.c
+++ b/drivers/net/phy/phy-c45.c
@@ -47,6 +47,16 @@ int genphy_c45_pma_setup_forced(struct phy_device *phydev)
 		/* Assume 1000base-T */
 		ctrl2 |= MDIO_PMA_CTRL2_1000BT;
 		break;
+	case SPEED_2500:
+		ctrl1 |= MDIO_CTRL1_SPEED2_5G;
+		/* Assume 2.5Gbase-T */
+		ctrl2 |= MDIO_PMA_CTRL2_2_5GBT;
+		break;
+	case SPEED_5000:
+		ctrl1 |= MDIO_CTRL1_SPEED5G;
+		/* Assume 5Gbase-T */
+		ctrl2 |= MDIO_PMA_CTRL2_5GBT;
+		break;
 	case SPEED_10000:
 		ctrl1 |= MDIO_CTRL1_SPEED10G;
 		/* Assume 10Gbase-T */
@@ -194,6 +204,12 @@ int genphy_c45_read_lpa(struct phy_device *phydev)
 	if (val < 0)
 		return val;
 
+	if (val & MDIO_AN_10GBT_STAT_LP2_5G)
+		linkmode_set_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT,
+				 phydev->lp_advertising);
+	if (val & MDIO_AN_10GBT_STAT_LP5G)
+		linkmode_set_bit(ETHTOOL_LINK_MODE_5000baseT_Full_BIT,
+				 phydev->lp_advertising);
 	if (val & MDIO_AN_10GBT_STAT_LP10G)
 		linkmode_set_bit(ETHTOOL_LINK_MODE_10000baseT_Full_BIT,
 				 phydev->lp_advertising);
@@ -224,6 +240,12 @@ int genphy_c45_read_pma(struct phy_device *phydev)
 	case MDIO_PMA_CTRL1_SPEED1000:
 		phydev->speed = SPEED_1000;
 		break;
+	case MDIO_CTRL1_SPEED2_5G:
+		phydev->speed = SPEED_2500;
+		break;
+	case MDIO_CTRL1_SPEED5G:
+		phydev->speed = SPEED_5000;
+		break;
 	case MDIO_CTRL1_SPEED10G:
 		phydev->speed = SPEED_10000;
 		break;
@@ -339,6 +361,21 @@ int genphy_c45_pma_read_abilities(struct phy_device *phydev)
 		linkmode_mod_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
 				 phydev->supported,
 				 val & MDIO_PMA_EXTABLE_10BT);
+
+		if (val & MDIO_PMA_EXTABLE_NBT) {
+			val = phy_read_mmd(phydev, MDIO_MMD_PMAPMD,
+					   MDIO_PMA_NG_EXTABLE);
+			if (val < 0)
+				return val;
+
+			linkmode_mod_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT,
+					 phydev->supported,
+					 val & MDIO_PMA_NG_EXTABLE_2_5GBT);
+
+			linkmode_mod_bit(ETHTOOL_LINK_MODE_5000baseT_Full_BIT,
+					 phydev->supported,
+					 val & MDIO_PMA_NG_EXTABLE_5GBT);
+		}
 	}
 
 	return 0;
diff --git a/include/uapi/linux/mdio.h b/include/uapi/linux/mdio.h
index 0e012b168e4d..0a552061ff1c 100644
--- a/include/uapi/linux/mdio.h
+++ b/include/uapi/linux/mdio.h
@@ -45,6 +45,7 @@
 #define MDIO_AN_ADVERTISE	16	/* AN advertising (base page) */
 #define MDIO_AN_LPA		19	/* AN LP abilities (base page) */
 #define MDIO_PCS_EEE_ABLE	20	/* EEE Capability register */
+#define MDIO_PMA_NG_EXTABLE	21	/* 2.5G/5G PMA/PMD extended ability */
 #define MDIO_PCS_EEE_WK_ERR	22	/* EEE wake error counter */
 #define MDIO_PHYXS_LNSTAT	24	/* PHY XGXS lane state */
 #define MDIO_AN_EEE_ADV		60	/* EEE advertisement */
@@ -92,6 +93,10 @@
 #define MDIO_CTRL1_SPEED10G		(MDIO_CTRL1_SPEEDSELEXT | 0x00)
 /* 10PASS-TS/2BASE-TL */
 #define MDIO_CTRL1_SPEED10P2B		(MDIO_CTRL1_SPEEDSELEXT | 0x04)
+/* 2.5 Gb/s */
+#define MDIO_CTRL1_SPEED2_5G		(MDIO_CTRL1_SPEEDSELEXT | 0x18)
+/* 5 Gb/s */
+#define MDIO_CTRL1_SPEED5G		(MDIO_CTRL1_SPEEDSELEXT | 0x1c)
 
 /* Status register 1. */
 #define MDIO_STAT1_LPOWERABLE		0x0002	/* Low-power ability */
@@ -145,6 +150,8 @@
 #define MDIO_PMA_CTRL2_1000BKX		0x000d	/* 1000BASE-KX type */
 #define MDIO_PMA_CTRL2_100BTX		0x000e	/* 100BASE-TX type */
 #define MDIO_PMA_CTRL2_10BT		0x000f	/* 10BASE-T type */
+#define MDIO_PMA_CTRL2_2_5GBT		0x0030  /* 2.5GBaseT type */
+#define MDIO_PMA_CTRL2_5GBT		0x0031  /* 5GBaseT type */
 #define MDIO_PCS_CTRL2_TYPE		0x0003	/* PCS type selection */
 #define MDIO_PCS_CTRL2_10GBR		0x0000	/* 10GBASE-R type */
 #define MDIO_PCS_CTRL2_10GBX		0x0001	/* 10GBASE-X type */
@@ -198,6 +205,7 @@
 #define MDIO_PMA_EXTABLE_1000BKX	0x0040	/* 1000BASE-KX ability */
 #define MDIO_PMA_EXTABLE_100BTX		0x0080	/* 100BASE-TX ability */
 #define MDIO_PMA_EXTABLE_10BT		0x0100	/* 10BASE-T ability */
+#define MDIO_PMA_EXTABLE_NBT		0x4000  /* 2.5/5GBASE-T ability */
 
 /* PHY XGXS lane state register. */
 #define MDIO_PHYXS_LNSTAT_SYNC0		0x0001
@@ -234,9 +242,13 @@
 #define MDIO_PCS_10GBRT_STAT2_BER	0x3f00
 
 /* AN 10GBASE-T control register. */
+#define MDIO_AN_10GBT_CTRL_ADV2_5G	0x0080	/* Advertise 2.5GBASE-T */
+#define MDIO_AN_10GBT_CTRL_ADV5G	0x0100	/* Advertise 5GBASE-T */
 #define MDIO_AN_10GBT_CTRL_ADV10G	0x1000	/* Advertise 10GBASE-T */
 
 /* AN 10GBASE-T status register. */
+#define MDIO_AN_10GBT_STAT_LP2_5G	0x0020  /* LP is 2.5GBT capable */
+#define MDIO_AN_10GBT_STAT_LP5G		0x0040  /* LP is 5GBT capable */
 #define MDIO_AN_10GBT_STAT_LPTRR	0x0200	/* LP training reset req. */
 #define MDIO_AN_10GBT_STAT_LPLTABLE	0x0400	/* LP loop timing ability */
 #define MDIO_AN_10GBT_STAT_LP10G	0x0800	/* LP is 10GBT capable */
@@ -265,6 +277,10 @@
 #define MDIO_EEE_10GKX4		0x0020	/* 10G KX4 EEE cap */
 #define MDIO_EEE_10GKR		0x0040	/* 10G KR EEE cap */
 
+/* 2.5G/5G Extended abilities register. */
+#define MDIO_PMA_NG_EXTABLE_2_5GBT	0x0001	/* 2.5GBASET ability */
+#define MDIO_PMA_NG_EXTABLE_5GBT	0x0002	/* 5GBASET ability */
+
 /* LASI RX_ALARM control/status registers. */
 #define MDIO_PMA_LASI_RX_PHYXSLFLT	0x0001	/* PHY XS RX local fault */
 #define MDIO_PMA_LASI_RX_PCSLFLT	0x0008	/* PCS RX local fault */
-- 
2.19.2


^ permalink raw reply related

* [PATCH net V2] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames
From: Tariq Toukan @ 2019-02-11 16:04 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Eric Dumazet,
	Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are usually zero's, which don't affect
checksum. However, it is not guaranteed. For example, switches might
choose to make other use of these octets.
This repeatedly causes kernel hardware checksum fault.

Prior to the cited commit below, skb checksum was forced to be
CHECKSUM_NONE when padding is detected. After it, we need to keep
skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
verify and parse IP headers, it does not worth the effort as the packets
are so small that CHECKSUM_COMPLETE has no significant advantage.

Future work: when reporting checksum complete is not an option for
IP non-TCP/UDP packets, we can actually fallback to report checksum
unnecessary, by looking at cqe IPOK bit.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Please queue for -stable >= v4.18.

v2: Changed the comment per the mailing list discussion.

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9a0881cb7f51..6c01314e87b0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -617,6 +617,8 @@ static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
 }
 #endif
 
+#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
+
 /* We reach this function only after checking that any of
  * the (IPv4 | IPv6) bits are set in cqe->status.
  */
@@ -624,9 +626,20 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 		      netdev_features_t dev_features)
 {
 	__wsum hw_checksum = 0;
+	void *hdr;
+
+	/* CQE csum doesn't cover padding octets in short ethernet
+	 * frames. And the pad field is appended prior to calculating
+	 * and appending the FCS field.
+	 *
+	 * Detecting these padded frames requires to verify and parse
+	 * IP headers, so we simply force all those small frames to skip
+	 * checksum complete.
+	 */
+	if (short_frame(skb->len))
+		return -EINVAL;
 
-	void *hdr = (u8 *)va + sizeof(struct ethhdr);
-
+	hdr = (u8 *)va + sizeof(struct ethhdr);
 	hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
 
 	if (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK) &&
@@ -819,6 +832,11 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		skb_record_rx_queue(skb, cq_ring);
 
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
+			/* TODO: For IP non TCP/UDP packets when csum complete is
+			 * not an option (not supported or any other reason) we can
+			 * actually check cqe IPOK status bit and report
+			 * CHECKSUM_UNNECESSARY rather than CHECKSUM_NONE
+			 */
 			if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						       MLX4_CQE_STATUS_UDP)) &&
 			    (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
-- 
1.8.3.1


^ permalink raw reply related

* [net-next PATCH 2/2] net: page_pool: don't use page->private to store dma_addr_t
From: Jesper Dangaard Brouer @ 2019-02-11 16:06 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
	Saeed Mahameed, Jesper Dangaard Brouer, Andrew Morton, mgorman,
	David S. Miller, Tariq Toukan
In-Reply-To: <154990116432.24530.10541030990995303432.stgit@firesoul>

From: Ilias Apalodimas <ilias.apalodimas@linaro.org>

As pointed out by David Miller the current page_pool implementation
stores dma_addr_t in page->private.
This won't work on 32-bit platforms with 64-bit DMA addresses since the
page->private is an unsigned long and the dma_addr_t a u64.

A previous patch is adding dma_addr_t on struct page to accommodate this.
This patch adapts the page_pool related functions to use the newly added
struct for storing and retrieving DMA addresses from network drivers.

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 43a932cb609b..897a69a1477e 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -136,7 +136,9 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
 		goto skip_dma_map;
 
-	/* Setup DMA mapping: use page->private for DMA-addr
+	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
+	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
 	 * This mapping is kept for lifetime of page, until leaving pool.
 	 */
 	dma = dma_map_page(pool->p.dev, page, 0,
@@ -146,7 +148,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
 		put_page(page);
 		return NULL;
 	}
-	set_page_private(page, dma); /* page->private = dma; */
+	page->dma_addr = dma;
 
 skip_dma_map:
 	/* When page just alloc'ed is should/must have refcnt 1. */
@@ -175,13 +177,16 @@ EXPORT_SYMBOL(page_pool_alloc_pages);
 static void __page_pool_clean_page(struct page_pool *pool,
 				   struct page *page)
 {
+	dma_addr_t dma;
+
 	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
 		return;
 
+	dma = page->dma_addr;
 	/* DMA unmap */
-	dma_unmap_page(pool->p.dev, page_private(page),
+	dma_unmap_page(pool->p.dev, dma,
 		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
-	set_page_private(page, 0);
+	page->dma_addr = 0;
 }
 
 /* Return a page to the page allocator, cleaning up our state */


^ permalink raw reply related

* [net-next PATCH 1/2] mm: add dma_addr_t to struct page
From: Jesper Dangaard Brouer @ 2019-02-11 16:06 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
	Saeed Mahameed, Jesper Dangaard Brouer, Andrew Morton, mgorman,
	David S. Miller, Tariq Toukan
In-Reply-To: <154990116432.24530.10541030990995303432.stgit@firesoul>

The page_pool API is using page->private to store DMA addresses.
As pointed out by David Miller we can't use that on 32-bit architectures
with 64-bit DMA

This patch adds a new dma_addr_t struct to allow storing DMA addresses

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
---
 include/linux/mm_types.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..3060700752cc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -95,6 +95,14 @@ struct page {
 			 */
 			unsigned long private;
 		};
+		struct {	/* page_pool used by netstack */
+			/**
+			 * @dma_addr: Page_pool need to store DMA-addr, and
+			 * cannot use @private, as DMA-mappings can be 64-bit
+			 * even on 32-bit Architectures.
+			 */
+			dma_addr_t dma_addr; /* Shares area with @lru */
+		};
 		struct {	/* slab, slob and slub */
 			union {
 				struct list_head slab_list;	/* uses lru */


^ permalink raw reply related

* [net-next PATCH 0/2] Fix page_pool API and dma address storage
From: Jesper Dangaard Brouer @ 2019-02-11 16:06 UTC (permalink / raw)
  To: netdev, linux-mm
  Cc: Toke Høiland-Jørgensen, Ilias Apalodimas, willy,
	Saeed Mahameed, Jesper Dangaard Brouer, Andrew Morton, mgorman,
	David S. Miller, Tariq Toukan

As pointed out by David Miller in [1] the current page_pool implementation
stores dma_addr_t in page->private. This won't work on 32-bit platforms with
64-bit DMA addresses since the page->private is an unsigned long and the
dma_addr_t a u64.

Since no driver is yet using the DMA mapping capabilities of the API let's
fix this by storing the information in 'struct page' and use that to store
and retrieve DMA addresses from network drivers.

As long as the addresses returned from dma_map_page() are aligned the first
bit, used by the compound pages code should not be set.

Ilias tested this on Espressobin driver mvneta, for which we have patches
for using the DMA API of page_pool.

[1]: https://lore.kernel.org/netdev/20181207.230655.1261252486319967024.davem@davemloft.net/

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

---
Question: Who of the maintainers MM or netdev will take these changes?

---

Ilias Apalodimas (1):
      net: page_pool: don't use page->private to store dma_addr_t

Jesper Dangaard Brouer (1):
      mm: add dma_addr_t to struct page

 include/linux/mm_types.h |    8 ++++++++
 net/core/page_pool.c     |   13 +++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

--

^ permalink raw reply

* [PATCH 4.20 294/352] lib/test_rhashtable: Make test_insert_dup() allocate its hash table dynamically
From: Greg Kroah-Hartman @ 2019-02-11 14:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Thomas Graf, Herbert Xu, netdev,
	Bart Van Assche, David S. Miller
In-Reply-To: <20190211141846.543045703@linuxfoundation.org>

4.20-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Bart Van Assche <bvanassche@acm.org>

[ Upstream commit fc42a689c4c097859e5bd37b5ea11b60dc426df6 ]

The test_insert_dup() function from lib/test_rhashtable.c passes a
pointer to a stack object to rhltable_init(). Allocate the hash table
dynamically to avoid that the following is reported with object
debugging enabled:

ODEBUG: object (ptrval) is on stack (ptrval), but NOT annotated.
WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:368 __debug_object_init+0x312/0x480
Modules linked in:
EIP: __debug_object_init+0x312/0x480
Call Trace:
 ? debug_object_init+0x1a/0x20
 ? __init_work+0x16/0x30
 ? rhashtable_init+0x1e1/0x460
 ? sched_clock_cpu+0x57/0xe0
 ? rhltable_init+0xb/0x20
 ? test_insert_dup+0x32/0x20f
 ? trace_hardirqs_on+0x38/0xf0
 ? ida_dump+0x10/0x10
 ? jhash+0x130/0x130
 ? my_hashfn+0x30/0x30
 ? test_rht_init+0x6aa/0xab4
 ? ida_dump+0x10/0x10
 ? test_rhltable+0xc5c/0xc5c
 ? do_one_initcall+0x67/0x28e
 ? trace_hardirqs_off+0x22/0xe0
 ? restore_all_kernel+0xf/0x70
 ? trace_hardirqs_on_thunk+0xc/0x10
 ? restore_all_kernel+0xf/0x70
 ? kernel_init_freeable+0x142/0x213
 ? rest_init+0x230/0x230
 ? kernel_init+0x10/0x110
 ? schedule_tail_wrapper+0x9/0xc
 ? ret_from_fork+0x19/0x24

Cc: Thomas Graf <tgraf@suug.ch>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 lib/test_rhashtable.c |   23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -541,38 +541,45 @@ static unsigned int __init print_ht(stru
 static int __init test_insert_dup(struct test_obj_rhl *rhl_test_objects,
 				  int cnt, bool slow)
 {
-	struct rhltable rhlt;
+	struct rhltable *rhlt;
 	unsigned int i, ret;
 	const char *key;
 	int err = 0;
 
-	err = rhltable_init(&rhlt, &test_rht_params_dup);
-	if (WARN_ON(err))
+	rhlt = kmalloc(sizeof(*rhlt), GFP_KERNEL);
+	if (WARN_ON(!rhlt))
+		return -EINVAL;
+
+	err = rhltable_init(rhlt, &test_rht_params_dup);
+	if (WARN_ON(err)) {
+		kfree(rhlt);
 		return err;
+	}
 
 	for (i = 0; i < cnt; i++) {
 		rhl_test_objects[i].value.tid = i;
-		key = rht_obj(&rhlt.ht, &rhl_test_objects[i].list_node.rhead);
+		key = rht_obj(&rhlt->ht, &rhl_test_objects[i].list_node.rhead);
 		key += test_rht_params_dup.key_offset;
 
 		if (slow) {
-			err = PTR_ERR(rhashtable_insert_slow(&rhlt.ht, key,
+			err = PTR_ERR(rhashtable_insert_slow(&rhlt->ht, key,
 							     &rhl_test_objects[i].list_node.rhead));
 			if (err == -EAGAIN)
 				err = 0;
 		} else
-			err = rhltable_insert(&rhlt,
+			err = rhltable_insert(rhlt,
 					      &rhl_test_objects[i].list_node,
 					      test_rht_params_dup);
 		if (WARN(err, "error %d on element %d/%d (%s)\n", err, i, cnt, slow? "slow" : "fast"))
 			goto skip_print;
 	}
 
-	ret = print_ht(&rhlt);
+	ret = print_ht(rhlt);
 	WARN(ret != cnt, "missing rhltable elements (%d != %d, %s)\n", ret, cnt, slow? "slow" : "fast");
 
 skip_print:
-	rhltable_destroy(&rhlt);
+	rhltable_destroy(rhlt);
+	kfree(rhlt);
 
 	return 0;
 }



^ permalink raw reply

* Re: [PATCH net-next 1/1] flow_offload: Fix flow action infrastructure
From: Pablo Neira Ayuso @ 2019-02-11 14:37 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Eli Britstein, netdev, Roi Dayan, Jiri Pirko, Saeed Mahameed
In-Reply-To: <20190211133038.GQ2251@nanopsycho>

On Mon, Feb 11, 2019 at 02:30:38PM +0100, Jiri Pirko wrote:
> Mon, Feb 11, 2019 at 08:52:59AM CET, elibr@mellanox.com wrote:
> >Implementation of macro "flow_action_for_each" introduced in
> >commit e3ab786b42535 ("flow_offload: add flow action infrastructure")
> >and used in commit 738678817573c ("drivers: net: use flow action
> >infrastructure") iterated the first item twice and did not reach the
> >last one. Fix it.
> >
> >Fixes: e3ab786b42535 ("flow_offload: add flow action infrastructure")
> >Fixes: 738678817573c ("drivers: net: use flow action infrastructure")
> >Signed-off-by: Eli Britstein <elibr@mellanox.com>
> >Reviewed-by: Roi Dayan <roid@mellanox.com>
> 
> Acked-by: Jiri Pirko <jiri@mellanox.com>

Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>

^ permalink raw reply

* Re: [PATCH] net: phy: mdio_bus: add missing device_del() in mdiobus_register() error handling
From: Thomas Petazzoni @ 2019-02-11 14:31 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, David S . Miller, netdev, linux-kernel,
	Paul Kocialkowski
In-Reply-To: <20190116154439.GA29244@lunn.ch>

Hello Andrew,

On Wed, 16 Jan 2019 16:44:39 +0100
Andrew Lunn <andrew@lunn.ch> wrote:

> > On Wed, 16 Jan 2019 15:48:29 +0100, Andrew Lunn wrote:
> >   
> > > Reviewed-by: Andrew Lunn <andrew@lunn.ch>
> > > 
> > > However, i wounder if it makes sense to add a label before the
> > > existing device_del() at the end of the function, and convert this,
> > > and the case above into a goto? That might scale better, avoiding the
> > > same issue in the future?  
> > 
> > That's another option indeed.
> > 
> > Hmm, now that I looked at it, I think we should use device_unregister()
> > instead. device_unregister() does both device_del() and put_device().  
> 
> Hi Thomas
> 
> device_unregister() does seem symmetrical with device_register() which
> is what we are trying to undo.

Even if DaveM already merged my simple fix, I had a further look at
whether we should be using device_unregister(), and in fact we should
not, but not really for a good reason: because the mdio API is not very
symmetrical.

The typical flow is:

	probe() {
		bus = mdiobus_alloc();
		if (!bus)
			return -ENOMEM;

		ret = mdiobus_register(&bus);
		if (ret) {
			mdiobus_free(bus);

		...
	}

	remove() {
		mdiobus_unregister();
		mdiobus_free();
	}

mdiobus_alloc() only does memory allocation, i.e it has no side effects
on the device model data structures.

mdiobus_register() does a device_register(). If it fails, it only
cleans up with a device_del(), i.e it doesn't do the put_device() that
it should do to fully "undo" its effect.

mdiobus_unregister() does a device_del(), i.e it also doesn't do the
opposite of mdiobus_register(), which should be device_del() +
put_device() (device_unregister() is a shortcut for both).

mdiobus_free() does the put_device()

So:

 * mdiobus_alloc() / mdiobus_free() are not symmetrical in terms of
   their interaction with the device model data structures

 * On error, mdiobus_register() leaves a non-zero reference count to the
   bus->dev structure, which will be freed up by mdiobus_free()

 * mdiobus_unregister() leaves a non-zero reference count to the
   bus->dev structure, which will be freed up by mdiobus_free()

So, if we were to use device_unregister() in the error path of
mdiobus_register() and in mdiobus_unregister(), it would break how
mdiobus_free() works.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH v2 bpf] bpf: fix lockdep false positive in stackmap
From: Daniel Borkmann @ 2019-02-11 15:39 UTC (permalink / raw)
  To: Alexei Starovoitov, davem
  Cc: peterz, edumazet, longman, jannh, netdev, kernel-team
In-Reply-To: <20190210205235.392478-1-ast@kernel.org>

On 02/10/2019 09:52 PM, Alexei Starovoitov wrote:
> Lockdep warns about false positive:
> [   11.211460] ------------[ cut here ]------------
> [   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
> [   11.213134] Modules linked in:
> [   11.214954] RIP: 0010:lock_release+0x1ad/0x280
> [   11.223508] Call Trace:
> [   11.223705]  <IRQ>
> [   11.223874]  ? __local_bh_enable+0x7a/0x80
> [   11.224199]  up_read+0x1c/0xa0
> [   11.224446]  do_up_read+0x12/0x20
> [   11.224713]  irq_work_run_list+0x43/0x70
> [   11.225030]  irq_work_run+0x26/0x50
> [   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
> [   11.225662]  irq_work_interrupt+0xf/0x20
> 
> since rw_semaphore is released in a different task vs task that locked the sema.
> It is expected behavior.
> Fix the warning with up_read_non_owner() and rwsem_release() annotation.
> 
> Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Applied, thanks!

^ permalink raw reply

* Re: [RFC, PATCH] net: page_pool: Don't use page->private to store dma_addr_t
From: Tariq Toukan @ 2019-02-11 15:38 UTC (permalink / raw)
  To: Matthew Wilcox, Tariq Toukan
  Cc: Ilias Apalodimas, David Miller, brouer@redhat.com,
	toke@redhat.com, netdev@vger.kernel.org,
	mgorman@techsingularity.net, linux-mm@kvack.org
In-Reply-To: <20190211121208.GB12668@bombadil.infradead.org>



On 2/11/2019 2:12 PM, Matthew Wilcox wrote:
> On Mon, Feb 11, 2019 at 08:53:19AM +0000, Tariq Toukan wrote:
>> It's great to use the struct page to store its dma mapping, but I am
>> worried about extensibility.
>> page_pool is evolving, and it would need several more per-page fields.
>> One of them would be pageref_bias, a planned optimization to reduce the
>> number of the costly atomic pageref operations (and replace existing
>> code in several drivers).
> 
> There's space for five words (20 or 40 bytes on 32/64 bit).
> 

OK so this is good for now, and for the near future.
Fine by me.

Regards,
Tariq

^ permalink raw reply

* [PATCH 4.19 261/313] lib/test_rhashtable: Make test_insert_dup() allocate its hash table dynamically
From: Greg Kroah-Hartman @ 2019-02-11 14:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Thomas Graf, Herbert Xu, netdev,
	Bart Van Assche, David S. Miller
In-Reply-To: <20190211141852.749630980@linuxfoundation.org>

4.19-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Bart Van Assche <bvanassche@acm.org>

[ Upstream commit fc42a689c4c097859e5bd37b5ea11b60dc426df6 ]

The test_insert_dup() function from lib/test_rhashtable.c passes a
pointer to a stack object to rhltable_init(). Allocate the hash table
dynamically to avoid that the following is reported with object
debugging enabled:

ODEBUG: object (ptrval) is on stack (ptrval), but NOT annotated.
WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:368 __debug_object_init+0x312/0x480
Modules linked in:
EIP: __debug_object_init+0x312/0x480
Call Trace:
 ? debug_object_init+0x1a/0x20
 ? __init_work+0x16/0x30
 ? rhashtable_init+0x1e1/0x460
 ? sched_clock_cpu+0x57/0xe0
 ? rhltable_init+0xb/0x20
 ? test_insert_dup+0x32/0x20f
 ? trace_hardirqs_on+0x38/0xf0
 ? ida_dump+0x10/0x10
 ? jhash+0x130/0x130
 ? my_hashfn+0x30/0x30
 ? test_rht_init+0x6aa/0xab4
 ? ida_dump+0x10/0x10
 ? test_rhltable+0xc5c/0xc5c
 ? do_one_initcall+0x67/0x28e
 ? trace_hardirqs_off+0x22/0xe0
 ? restore_all_kernel+0xf/0x70
 ? trace_hardirqs_on_thunk+0xc/0x10
 ? restore_all_kernel+0xf/0x70
 ? kernel_init_freeable+0x142/0x213
 ? rest_init+0x230/0x230
 ? kernel_init+0x10/0x110
 ? schedule_tail_wrapper+0x9/0xc
 ? ret_from_fork+0x19/0x24

Cc: Thomas Graf <tgraf@suug.ch>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 lib/test_rhashtable.c |   23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -541,38 +541,45 @@ static unsigned int __init print_ht(stru
 static int __init test_insert_dup(struct test_obj_rhl *rhl_test_objects,
 				  int cnt, bool slow)
 {
-	struct rhltable rhlt;
+	struct rhltable *rhlt;
 	unsigned int i, ret;
 	const char *key;
 	int err = 0;
 
-	err = rhltable_init(&rhlt, &test_rht_params_dup);
-	if (WARN_ON(err))
+	rhlt = kmalloc(sizeof(*rhlt), GFP_KERNEL);
+	if (WARN_ON(!rhlt))
+		return -EINVAL;
+
+	err = rhltable_init(rhlt, &test_rht_params_dup);
+	if (WARN_ON(err)) {
+		kfree(rhlt);
 		return err;
+	}
 
 	for (i = 0; i < cnt; i++) {
 		rhl_test_objects[i].value.tid = i;
-		key = rht_obj(&rhlt.ht, &rhl_test_objects[i].list_node.rhead);
+		key = rht_obj(&rhlt->ht, &rhl_test_objects[i].list_node.rhead);
 		key += test_rht_params_dup.key_offset;
 
 		if (slow) {
-			err = PTR_ERR(rhashtable_insert_slow(&rhlt.ht, key,
+			err = PTR_ERR(rhashtable_insert_slow(&rhlt->ht, key,
 							     &rhl_test_objects[i].list_node.rhead));
 			if (err == -EAGAIN)
 				err = 0;
 		} else
-			err = rhltable_insert(&rhlt,
+			err = rhltable_insert(rhlt,
 					      &rhl_test_objects[i].list_node,
 					      test_rht_params_dup);
 		if (WARN(err, "error %d on element %d/%d (%s)\n", err, i, cnt, slow? "slow" : "fast"))
 			goto skip_print;
 	}
 
-	ret = print_ht(&rhlt);
+	ret = print_ht(rhlt);
 	WARN(ret != cnt, "missing rhltable elements (%d != %d, %s)\n", ret, cnt, slow? "slow" : "fast");
 
 skip_print:
-	rhltable_destroy(&rhlt);
+	rhltable_destroy(rhlt);
+	kfree(rhlt);
 
 	return 0;
 }



^ permalink raw reply

* Re: [PATCH 1/2] xsk: do not use mmap_sem
From: Daniel Borkmann @ 2019-02-11 15:33 UTC (permalink / raw)
  To: Björn Töpel, Davidlohr Bueso
  Cc: akpm, linux-mm, LKML, David S . Miller, Bjorn Topel,
	Magnus Karlsson, Netdev, Davidlohr Bueso, dan.j.williams
In-Reply-To: <CAJ+HfNg=Wikc_uY9W1QiVCONq3c1GyS44-xbrq-J4gqfth2kwQ@mail.gmail.com>

[ +Dan ]

On 02/07/2019 08:43 AM, Björn Töpel wrote:
> Den tors 7 feb. 2019 kl 06:38 skrev Davidlohr Bueso <dave@stgolabs.net>:
>>
>> Holding mmap_sem exclusively for a gup() is an overkill.
>> Lets replace the call for gup_fast() and let the mm take
>> it if necessary.
>>
>> Cc: David S. Miller <davem@davemloft.net>
>> Cc: Bjorn Topel <bjorn.topel@intel.com>
>> Cc: Magnus Karlsson <magnus.karlsson@intel.com>
>> CC: netdev@vger.kernel.org
>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>> ---
>>  net/xdp/xdp_umem.c | 6 ++----
>>  1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
>> index 5ab236c5c9a5..25e1e76654a8 100644
>> --- a/net/xdp/xdp_umem.c
>> +++ b/net/xdp/xdp_umem.c
>> @@ -265,10 +265,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
>>         if (!umem->pgs)
>>                 return -ENOMEM;
>>
>> -       down_write(&current->mm->mmap_sem);
>> -       npgs = get_user_pages(umem->address, umem->npgs,
>> -                             gup_flags, &umem->pgs[0], NULL);
>> -       up_write(&current->mm->mmap_sem);
>> +       npgs = get_user_pages_fast(umem->address, umem->npgs,
>> +                                  gup_flags, &umem->pgs[0]);
>>
> 
> Thanks for the patch!
> 
> The lifetime of the pinning is similar to RDMA umem mapping, so isn't
> gup_longterm preferred?

Seems reasonable from reading what gup_longterm seems to do. Davidlohr
or Dan, any thoughts on the above?

Thanks,
Daniel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox