Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: KMSAN: uninit-value in strnlen
From: Guillaume Nault @ 2018-04-23 15:01 UTC (permalink / raw)
  To: syzbot; +Cc: linux-kernel, mostrows, netdev, syzkaller-bugs
In-Reply-To: <00000000000003f2d5056a7fbf92@google.com>

On Mon, Apr 23, 2018 at 01:23:01AM -0700, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on https://github.com/google/kmsan.git/master
> commit
> a7f95e9c8a95e9fbb388c3999b61a17667cd3bbe (Sat Apr 21 13:50:22 2018 +0000)
> kmsan: disable assembly checksums
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=cd06c321e7147d03a65e
> 
> So far this crash happened 5 times on
> https://github.com/google/kmsan.git/master.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5785171018121216
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=5117671628603392
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=6310764688179200
> Kernel config: https://syzkaller.appspot.com/x/.config?id=328654897048964367
> compiler: clang version 7.0.0 (trunk 329391)
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+cd06c321e7147d03a65e@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
> 
> ==================================================================
> BUG: KMSAN: uninit-value in strnlen+0xc4/0x110 lib/string.c:499
> CPU: 1 PID: 4507 Comm: syzkaller579712 Not tainted 4.16.0+ #85
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
>  strnlen+0xc4/0x110 lib/string.c:499
>  dev_name_hash net/core/dev.c:209 [inline]
>  dev_get_by_name_rcu net/core/dev.c:764 [inline]
>  dev_get_by_name+0x6e/0x350 net/core/dev.c:791
>  pppoe_connect+0xcb7/0x2360 drivers/net/ppp/pppoe.c:665
>  SYSC_connect+0x41a/0x510 net/socket.c:1639
>  SyS_connect+0x54/0x80 net/socket.c:1620
>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x43fcf9
> RSP: 002b:00007ffca4bd4978 EFLAGS: 00000213 ORIG_RAX: 000000000000002a
> RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fcf9
> RDX: 0000000000000007 RSI: 0000000020000040 RDI: 0000000000000003
> RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
> R10: 00000000004002c8 R11: 0000000000000213 R12: 0000000000401620
> R13: 00000000004016b0 R14: 0000000000000000 R15: 0000000000000000
> 
> Local variable description: ----address@SYSC_connect
> Variable was created at:
>  SYSC_connect+0x6f/0x510 net/socket.c:1622
>  SyS_connect+0x54/0x80 net/socket.c:1620
> ==================================================================
> 
That's a consequence of not validating sockaddr_len. The sockaddr_pppox
structure was incomplete.

#syz dup: KMSAN: uninit-value in pppoe_connect

^ permalink raw reply

* Re: WARNING: suspicious RCU usage in rt6_check_expired
From: David Ahern @ 2018-04-23 15:10 UTC (permalink / raw)
  To: Eric Dumazet, syzbot, davem, kuznet, linux-kernel, netdev,
	syzkaller-bugs
In-Reply-To: <a0eb8d30-0f76-fad2-e9fa-5de92f59b2ff@gmail.com>

On 4/23/18 7:31 AM, Eric Dumazet wrote:
>> stack backtrace:
>> CPU: 1 PID: 25958 Comm: syz-executor7 Not tainted 4.16.0+ #11
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>>  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
>>  rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410
>>  ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204
>>  dst_negative_advice include/net/sock.h:1786 [inline]
>>  sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051
>>  __sys_setsockopt+0x2df/0x390 net/socket.c:1899
>>  SYSC_setsockopt net/socket.c:1914 [inline]
>>  SyS_setsockopt+0x34/0x50 net/socket.c:1911
>>  do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
>>  entry_SYSCALL_64_after_hwframe+0x42/0xb7

...

> 
> Added in commit a68886a691804d3f6d479ebf6825480fbafb6a00
> ("net/ipv6: Make from in rt6_info rcu protected")
> 

I missed the rcu_read_lock for this path. Will send a patch.

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-23 15:10 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, bhutchings, netdev, linux-kernel, mst,
	jasowang, virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804231006440.22488@file01.intranet.prod.int.rdu2.redhat.com>

On Mon 23-04-18 10:24:02, Mikulas Patocka wrote:
[...]

I am not going to comment on your continuous accusations. We can discuss
patches but general rants make very limited sense.
 
> > > > I already said that we can change it from CONFIG_DEBUG_VM to 
> > > > CONFIG_DEBUG_SG - or to whatever other option you may want, just to make 
> > > > sure that it is enabled in distro debug kernels by default.
> > > 
> > > Yes, and I think that's the right idea.  So send a v2 and ignore the
> > > replies that are clearly relating to an earlier version of the patch.
> > > Not everybody reads every mail in the thread before responding to one they
> > > find interesting.  Yes, ideally, one would, but sometimes one doesn't.
> > 
> > And look here. This is yet another ad-hoc idea. We have many users of
> > kvmalloc which have no relation to SG, yet you are going to control
> > their behavior by CONFIG_DEBUG_SG? No way! (yeah evil again)
> 
> Why aren't you constructive and pick up pick up the CONFIG flag?

Because config doesn't make that much of a sense. You do not want a
permanent vmalloc fallback unconditionally. Debugging option which
changes the behavior completely is not useful IMNHO. Besides that who is
going to enable it?

> > Really, we have a fault injection framework and this sounds like
> > something to hook in there.
> 
> The testing people won't set it up. They install the "kernel-debug" 
> package and run the tests in it.
> 
> If you introduce a hidden option that no one knows about, no one will use 
> it.

then make sure people know about it. Fuzzers already do test fault
injections.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH for-rc] uapi: Fix SPDX tags for files referring to the 'OpenIB.org' license
From: Doug Ledford @ 2018-04-23 15:11 UTC (permalink / raw)
  To: David Miller, jgg
  Cc: linux-rdma, kstewart, pombredanne, gregkh, tglx, swinslow,
	santosh.shilimkar, netdev, linux-kernel, davejwatson
In-Reply-To: <20180422.210823.1454088204642743739.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

On Sun, 2018-04-22 at 21:08 -0400, David Miller wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> Date: Fri, 20 Apr 2018 09:49:10 -0600
> 
> > Based on discussion with Kate Stewart this license is not a
> > BSD-2-Clause, but is now formally identified as Linux-OpenIB
> > by SPDX.
> > 
> > The key difference between the licenses is in the 'warranty'
> > paragraph.
> > 
> > if_infiniband.h refers to the 'OpenIB.org' license, but
> > does not include the text, instead it links to an obsolete
> > web site that contains a license that matches the BSD-2-Clause
> > SPX. There is no 'three clause' version of the OpenIB.org
> > license.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> Acked-by: David S. Miller <davem@davemloft.net>
> 

Thanks, applied.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michal Hocko @ 2018-04-23 15:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804221733520.7995@file01.intranet.prod.int.rdu2.redhat.com>

On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> 
> 
> On Sat, 21 Apr 2018, Matthew Wilcox wrote:
> 
> > On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> > > On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > > > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > > > No way. This is just wrong! First of all, you will explode most likely
> > > > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > > > > enabled quite often.
> > > > > 
> > > > > You're an evil person who doesn't want to fix bugs.
> > > > 
> > > > Steady on.  There's no need for that.  Michal isn't evil.  Please
> > > > apologise.
> > > 
> > > I see this attitude from Michal again and again.
> > 
> > Fine; then *say that*.  I also see Michal saying "No" a lot.  Sometimes
> > I agree with him, sometimes I don't.  I think he genuinely wants the best
> > code in the kernel, and saying "No" is part of it.
> > 
> > > He didn't want to fix vmalloc(GFP_NOIO)
> > 
> > I don't remember that conversation, so I don't know whether I agree with
> > his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
> > towards marking regions with memalloc_noio_save() / restore.  If you do
> > that, you won't need vmalloc(GFP_NOIO).
> 
> He said the same thing a year ago. And there was small progress. 6 out of 
> 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in 
> infiniband and 1 in btrfs. (the whole discussion is here 
> http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )

Well this is not that easy. It requires a cooperation from maintainers.
I can only do as much. I've posted patches in the past and actively
bringing up this topic at LSFMM last two years...

> He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4 
> years, the kernel will be refactored and GFP_NOIO will be eliminated. Why 
> does he have veto over this part of the code? I'd much rather argue with 
> people who have constructive comments about fixing bugs than with him.

I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
I would be much more willing to change my mind if you would back your
patch by a real bug report. Hacks are acceptable when we have a real
issue in hands. But if we want to fix potential issue then better make
it properly.

[...]

> I sent the CONFIG_DEBUG_SG patch before (I wonder why he didn't repond to 
> it). I'll send a third version of the patch that actually randomly chooses 
> between kmalloc and vmalloc, because some abuses can only be detected with 
> kmalloc and we should test both.
> 
> For bisecting, it is better to always fallback to vmalloc, but for general 
> testing, it is better to test both branches.

Agreed!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Fw: [Bug 199469] New: Regression in 32-bit-compat dev_ioctl due to bf4405737f9f85a06db2b0ce5d76a818b61992e2
From: Stephen Hemminger @ 2018-04-23 15:19 UTC (permalink / raw)
  To: netdev

I think this has already been addressed. But forwarding for sure.

Begin forwarded message:

Date: Mon, 23 Apr 2018 01:26:46 +0000
From: bugzilla-daemon@bugzilla.kernel.org
To: stephen@networkplumber.org
Subject: [Bug 199469] New: Regression in 32-bit-compat dev_ioctl due to bf4405737f9f85a06db2b0ce5d76a818b61992e2


https://bugzilla.kernel.org/show_bug.cgi?id=199469

            Bug ID: 199469
           Summary: Regression in 32-bit-compat dev_ioctl due to
                    bf4405737f9f85a06db2b0ce5d76a818b61992e2
           Product: Networking
           Version: 2.5
    Kernel Version: 4.16.0-0.rc4.git0.1.fc28.x86_64
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
          Assignee: stephen@networkplumber.org
          Reporter: robert@ocallahan.org
        Regression: No

Created attachment 275501
  --> https://bugzilla.kernel.org/attachment.cgi?id=275501&action=edit  
Test program

The commit says

    Once upon a time net/socket.c:dev_ifsioc() used to handle SIOCSHWTSTAMP and
    SIOCSIFMAP.  These have different native and compat layout, so the format
    conversion had been needed.  In 2009 these two cases had been taken out,
    turning the rest into a convoluted way to calling sock_do_ioctl().  We copy
    compat structure into native one, call sock_do_ioctl() on that and copy
    the result back for the in/out ioctls.  No layout transformation anywhere,
    so we might as well just call sock_do_ioctl() and skip all the headache
with
    copying.

However there is one problem: 32-bit 'struct ifreq' and 64-bit 'struct ifreq'
are not the same size. The former is 32 bytes and the latter is 40 bytes. Thus,
if you place a 32-bit 'struct ifreq' immediately before an unmapped page and
try to pass it to one of these ioctls, the syscall fails with EFAULT due to
this commit.

Steps to reproduce:
Copy attached file to /tmp/test.c, then:
[roc@localhost-live ~]$ gcc -o /tmp/test /tmp/test.c && /tmp/test
Index: 1
[roc@localhost-live ~]$ gcc -m32 -o /tmp/test /tmp/test.c && /tmp/test
Failed SIOCGIFINDEX: Bad address

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Don Dutile @ 2018-04-23 15:21 UTC (permalink / raw)
  To: Bjorn Helgaas, Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180421203437.GW28657@bhelgaas-glaptop.roam.corp.google.com>

On 04/21/2018 04:34 PM, Bjorn Helgaas wrote:
> On Fri, Apr 20, 2018 at 12:28:08PM -0400, Alexander Duyck wrote:
>> This series is meant to add support for SR-IOV on devices when the VFs are
>> not managed by the kernel. Examples of recent patches attempting to do this
>> include:
>> virto - https://patchwork.kernel.org/patch/10241225/
>> pci-stub - https://patchwork.kernel.org/patch/10109935/
>> vfio - https://patchwork.kernel.org/patch/10103353/
>> uio - https://patchwork.kernel.org/patch/9974031/
>>
>> Since this is quickly blowing up into a multi-driver problem it is probably
>> best to implement this solution as generically as possible.
>>
>> This series is an attempt to do that. What we do with this patch set is
>> provide a generic framework to enable SR-IOV in the case that the PF driver
>> doesn't support managing the VFs itself.
>>
>> I based my patch set originally on the patch by Mark Rustad but there isn't
>> much left after going through and cleaning out the bits that were no longer
>> needed, and after incorporating the feedback from David Miller. At this point
>> the only items to be fully reused was his patch description which is now
>> present in patch 3 of the set.
>>
>> This solution is limited in scope to just adding support for devices that
>> provide no functionality for SR-IOV other than allocating the VFs by
>> calling pci_enable_sriov. Previous sets had included patches for VFIO, but
>> for now I am dropping that as the scope of that work is larger then I
>> think I can take on at this time.
>>
>> v2: Reduced scope back to just virtio_pci and vfio-pci
>>      Broke into 3 patch set from single patch
>>      Changed autoprobe behavior to always set when num_vfs is set non-zero
>> v3: Updated Documentation to clarify when sriov_unmanaged_autoprobe is used
>>      Wrapped vfio_pci_sriov_configure to fix build errors w/o SR-IOV in kernel
>> v4: Dropped vfio-pci patch
>>      Added ena and nvme to drivers now using pci_sriov_configure_unmanaged
>>      Dropped pci_disable_sriov call in virtio_pci to be consistent with ena
>> v5: Dropped sriov_unmanaged_autoprobe and pci_sriov_conifgure_unmanaged
>>      Added new patch that enables pci_sriov_configure_simple
>>      Updated drivers to use pci_sriov_configure_simple
>> v6: Defined pci_sriov_configure_simple as NULL when SR-IOV is not enabled
>>      Updated drivers to drop "#ifdef" checks for IOV
>>      Added pci-pf-stub as place for PF-only drivers to add support
>> v7: Dropped pci_id table explanation from pci-pf-stub driver
>>      Updated pci_sriov_configure_simple to drop need for err value
>>      Fixed comment explaining why pci_sriov_configure_simple is NULL
>> v8: Dropped virtio from the set, support to be added later after TC approval
>>
>> Cc: Mark Rustad <mark.d.rustad@intel.com>
>> Cc: Maximilian Heyne <mheyne@amazon.de>
>> Cc: Liang-Min Wang <liang-min.wang@intel.com>
>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>>
>> ---
>>
>> Alexander Duyck (4):
>>        pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
>>        ena: Migrate over to unmanaged SR-IOV support
>>        nvme: Migrate over to unmanaged SR-IOV support
>>        pci-pf-stub: Add PF driver stub for PFs that function only to enable VFs
>>
>>
>>   drivers/net/ethernet/amazon/ena/ena_netdev.c |   28 -------------
>>   drivers/nvme/host/pci.c                      |   20 ----------
>>   drivers/pci/Kconfig                          |   12 ++++++
>>   drivers/pci/Makefile                         |    2 +
>>   drivers/pci/iov.c                            |   31 +++++++++++++++
>>   drivers/pci/pci-pf-stub.c                    |   54 ++++++++++++++++++++++++++
>>   include/linux/pci.h                          |    3 +
>>   include/linux/pci_ids.h                      |    2 +
>>   8 files changed, 106 insertions(+), 46 deletions(-)
>>   create mode 100644 drivers/pci/pci-pf-stub.c
> 
> I tentatively applied these to pci/virtualization-review.
> 
> The code changes look fine, but I want to flesh out the changelogs a
> little bit before merging them.
> 
> For example, I'm not sure what you mean by "devices where the PF is
> not capable of managing VF resources."
> 
I agree w/Bjorn's assessment of the changelog.
The VF's are (minimally) assigned via the pf-stub driver, so they are 'managed by the kernel'.
The security model is the same as the existing one, which was the issue we resolved in the previous set(s) of patches.

I am hoping that something like vfio will be used to deal with the VF ownership
and the reset mechanisms during assignement & de-assignment to 'guests' (qemu-kvm, DPDK, or whatever user-process),
so the known, existing security model(s) is(are) maintained as well.
If so, it'd be good to add such verbage somewhere (as 0/n is not kept in anything but possibly Bjorn's patchwork, or whatever patch mgmt tool he uses, and future reference would be good to have) say, an update to Documentation/PCI/pci-iov-howto.txt.

So... the 'unmanaged SR-IOV' Subject, IMO, is not a valid Subject for the patch series any longer.

No objections to the patch series, as Bjorn noted, just the commit log(s)/nomenclature of what is really being done.
The expectation of VF enablement via the PF was born out of the fairly complicated, and unique PF vs VF drivers of the first implementations, which AlexD knows so well.  This "VFs act just like PFs without SRIOV capabilities" support is what this patch set enables with a much lighter configuration mechanism.
So, maybe the patch set ought to be 'lightweight SRIOV enablement'.

--dd

> It *sounds* like you're saying the hardware works differently on some
> devices, but I don't think that's what you mean.  I think you're
> saying something about which drivers are used for the PF and the VF.
> 
> I think a trivial example of how this will be used might help.  I
> assume this involves a virtualization scenario where the host uses the
> PF to enable several VFs, but the host doesn't use the PF for much
> else.  Then you assign the VFs to guests, and drivers in the guest
> OSes use the VFs.
> 
> Since .sriov_configure() is only used by sriov_numvfs_store(), I
> assume the usage model involves writing to the sysfs sriov_numvfs
> attribute to enable the VFs, then assigning them to guests?
> 
> Bjorn
> 

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Michael S. Tsirkin @ 2018-04-23 15:25 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Matthew Wilcox, David Miller, Andrew Morton, linux-mm,
	eric.dumazet, edumazet, netdev, linux-kernel, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804200817230.22382@file01.intranet.prod.int.rdu2.redhat.com>

On Fri, Apr 20, 2018 at 08:20:23AM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> 
> > On Thu, Apr 19, 2018 at 12:12:38PM -0400, Mikulas Patocka wrote:
> > > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> > > uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> > > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> > > code.
> > 
> > Maybe it's time to have the SG code handle vmalloced pages?  This is
> > becoming more and more common with vmapped stacks (and some of our
> > workarounds are hideous -- allocate 4 bytes with kmalloc because we can't
> > DMA onto the stack any more?).  We already have a few places which do
> > handle sgs of vmalloced addresses, such as the nx crypto driver:
> > 
> >         if (is_vmalloc_addr(start_addr))
> >                 sg_addr = page_to_phys(vmalloc_to_page(start_addr))
> >                           + offset_in_page(sg_addr);
> >         else
> >                 sg_addr = __pa(sg_addr);
> > 
> > and videobuf:
> > 
> >                 pg = vmalloc_to_page(virt);
> >                 if (NULL == pg)
> >                         goto err;
> >                 BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
> >                 sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
> > 
> > Yes, there's the potential that we have to produce two SG entries for a
> > virtually contiguous region if it crosses a page boundary, and our APIs
> > aren't set up right to make it happen.  But this is something we should
> > consider fixing ... otherwise we'll end up with dozens of driver hacks.
> > The videobuf implementation was already copy-and-pasted into the saa7146
> > driver, for example.
> 
> What if the device requires physically contiguous area and the vmalloc 
> area crosses a page? Will you use a bounce buffer? Where do you allocate 
> the bounce buffer from? What if you run out of bounce buffers?
> 
> Mikulkas

I agree with Matthew here.

4 byte variables are typically size aligned so won't cross a boundary.

That's enough for virtio at least. People using structs can force
alignment.

We could wrap access in a macro (sizeof(x) >= alignof(x)) to help
guarantee that.

-- 
MST

^ permalink raw reply

* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Ben Greear @ 2018-04-23 15:38 UTC (permalink / raw)
  To: David Miller, johannes-cdvu00un1VgdHxzADdlk8Q
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	ath10k-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <20180422.145420.1197041027922699603.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On 04/22/2018 11:54 AM, David Miller wrote:
> From: Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>
> Date: Thu, 19 Apr 2018 17:26:57 +0200
>
>> On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:
>>>
>>> Maybe this could be in followup patches?  It's going to touch a lot of files,
>>> and might be hell to get merged all at once, and I've never used spatch, so
>>> just maybe someone else will volunteer that part :)
>>
>> I guess you'll have to ask davem. :)
>
> Well, first of all, I really don't like this.
>
> The first reason is that every time I see interface foo become foo2,
> foo3 is never far behind it.
>
> If foo was not extensible enough such that we needed foo2, we beter
> design the new thing with explicitly better extensibility in mind.
>
> Furthermore, what you want here is a specific filter.  Someone else
> will want to filter on another criteria, and the next person will
> want yet another.
>
> This needs to be properly generalized.
>
> And frankly if we had moved to ethtool netlink/devlink by now, we
> could just add a netlink attribute for filtering and not even be
> having this conversation.

Well, since there are un-defined flags, it would be simple enough to
extend the API further in the future (flag (1<<31) could mean expect
more input members, etc.  And, adding up to 30 more flags to filter on different
things won't change the API and should be backwards compatible.

But, if you don't want it, that is OK by me, I agree it is a fairly
obscure feature.  It would have saved me time if you had said you didn't
want it at the first RFC patch though...

Thanks,
Ben

-- 
Ben Greear <greearb-my8/4N5VtI7c+919tysfdA@public.gmane.org>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Ben Greear @ 2018-04-23 15:41 UTC (permalink / raw)
  To: Roopa Prabhu, David Miller; +Cc: netdev, Johannes Berg, linux-wireless, ath10k
In-Reply-To: <CAJieiUh8Pnu5GVSWz04WSZ5L0YVpVBmeY+H_MjhL0k6SWwbcqw@mail.gmail.com>

On 04/22/2018 02:15 PM, Roopa Prabhu wrote:
> On Sun, Apr 22, 2018 at 11:54 AM, David Miller <davem@davemloft.net> wrote:
>> From: Johannes Berg <johannes@sipsolutions.net>
>> Date: Thu, 19 Apr 2018 17:26:57 +0200
>>
>>> On Thu, 2018-04-19 at 08:25 -0700, Ben Greear wrote:
>>>>
>>>> Maybe this could be in followup patches?  It's going to touch a lot of files,
>>>> and might be hell to get merged all at once, and I've never used spatch, so
>>>> just maybe someone else will volunteer that part :)
>>>
>>> I guess you'll have to ask davem. :)
>>
>> Well, first of all, I really don't like this.
>>
>> The first reason is that every time I see interface foo become foo2,
>> foo3 is never far behind it.
>>
>> If foo was not extensible enough such that we needed foo2, we beter
>> design the new thing with explicitly better extensibility in mind.
>>
>> Furthermore, what you want here is a specific filter.  Someone else
>> will want to filter on another criteria, and the next person will
>> want yet another.
>>
>> This needs to be properly generalized.
>>
>> And frankly if we had moved to ethtool netlink/devlink by now, we
>> could just add a netlink attribute for filtering and not even be
>> having this conversation.
>
>
> +1.
>
> Also, the RTM_GETSTATS api was added to improve stats query efficiency
> (with filters).
>  we should look at it  to see if this fits there. Keeping all stats
> queries in one place will help.

I like the ethtool API, so I'll be sticking with that for now.

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV
From: Alexander Duyck @ 2018-04-23 15:47 UTC (permalink / raw)
  To: Don Dutile
  Cc: Bjorn Helgaas, Alexander Duyck, Bjorn Helgaas, linux-pci,
	virtio-dev, kvm, Netdev, Daly, Dan, LKML, linux-nvme, Keith Busch,
	netanel, Maximilian Heyne, Wang, Liang-min, Rustad, Mark D,
	David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <84adb7c2-bcf7-5e17-a9d0-482416e2d26c@redhat.com>

On Mon, Apr 23, 2018 at 8:21 AM, Don Dutile <ddutile@redhat.com> wrote:
> On 04/21/2018 04:34 PM, Bjorn Helgaas wrote:
>>
>> On Fri, Apr 20, 2018 at 12:28:08PM -0400, Alexander Duyck wrote:
>>>
>>> This series is meant to add support for SR-IOV on devices when the VFs
>>> are
>>> not managed by the kernel. Examples of recent patches attempting to do
>>> this
>>> include:
>>> virto - https://patchwork.kernel.org/patch/10241225/
>>> pci-stub - https://patchwork.kernel.org/patch/10109935/
>>> vfio - https://patchwork.kernel.org/patch/10103353/
>>> uio - https://patchwork.kernel.org/patch/9974031/
>>>
>>> Since this is quickly blowing up into a multi-driver problem it is
>>> probably
>>> best to implement this solution as generically as possible.
>>>
>>> This series is an attempt to do that. What we do with this patch set is
>>> provide a generic framework to enable SR-IOV in the case that the PF
>>> driver
>>> doesn't support managing the VFs itself.
>>>
>>> I based my patch set originally on the patch by Mark Rustad but there
>>> isn't
>>> much left after going through and cleaning out the bits that were no
>>> longer
>>> needed, and after incorporating the feedback from David Miller. At this
>>> point
>>> the only items to be fully reused was his patch description which is now
>>> present in patch 3 of the set.
>>>
>>> This solution is limited in scope to just adding support for devices that
>>> provide no functionality for SR-IOV other than allocating the VFs by
>>> calling pci_enable_sriov. Previous sets had included patches for VFIO,
>>> but
>>> for now I am dropping that as the scope of that work is larger then I
>>> think I can take on at this time.
>>>
>>> v2: Reduced scope back to just virtio_pci and vfio-pci
>>>      Broke into 3 patch set from single patch
>>>      Changed autoprobe behavior to always set when num_vfs is set
>>> non-zero
>>> v3: Updated Documentation to clarify when sriov_unmanaged_autoprobe is
>>> used
>>>      Wrapped vfio_pci_sriov_configure to fix build errors w/o SR-IOV in
>>> kernel
>>> v4: Dropped vfio-pci patch
>>>      Added ena and nvme to drivers now using
>>> pci_sriov_configure_unmanaged
>>>      Dropped pci_disable_sriov call in virtio_pci to be consistent with
>>> ena
>>> v5: Dropped sriov_unmanaged_autoprobe and pci_sriov_conifgure_unmanaged
>>>      Added new patch that enables pci_sriov_configure_simple
>>>      Updated drivers to use pci_sriov_configure_simple
>>> v6: Defined pci_sriov_configure_simple as NULL when SR-IOV is not enabled
>>>      Updated drivers to drop "#ifdef" checks for IOV
>>>      Added pci-pf-stub as place for PF-only drivers to add support
>>> v7: Dropped pci_id table explanation from pci-pf-stub driver
>>>      Updated pci_sriov_configure_simple to drop need for err value
>>>      Fixed comment explaining why pci_sriov_configure_simple is NULL
>>> v8: Dropped virtio from the set, support to be added later after TC
>>> approval
>>>
>>> Cc: Mark Rustad <mark.d.rustad@intel.com>
>>> Cc: Maximilian Heyne <mheyne@amazon.de>
>>> Cc: Liang-Min Wang <liang-min.wang@intel.com>
>>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> ---
>>>
>>> Alexander Duyck (4):
>>>        pci: Add pci_sriov_configure_simple for PFs that don't manage VF
>>> resources
>>>        ena: Migrate over to unmanaged SR-IOV support
>>>        nvme: Migrate over to unmanaged SR-IOV support
>>>        pci-pf-stub: Add PF driver stub for PFs that function only to
>>> enable VFs
>>>
>>>
>>>   drivers/net/ethernet/amazon/ena/ena_netdev.c |   28 -------------
>>>   drivers/nvme/host/pci.c                      |   20 ----------
>>>   drivers/pci/Kconfig                          |   12 ++++++
>>>   drivers/pci/Makefile                         |    2 +
>>>   drivers/pci/iov.c                            |   31 +++++++++++++++
>>>   drivers/pci/pci-pf-stub.c                    |   54
>>> ++++++++++++++++++++++++++
>>>   include/linux/pci.h                          |    3 +
>>>   include/linux/pci_ids.h                      |    2 +
>>>   8 files changed, 106 insertions(+), 46 deletions(-)
>>>   create mode 100644 drivers/pci/pci-pf-stub.c
>>
>>
>> I tentatively applied these to pci/virtualization-review.
>>
>> The code changes look fine, but I want to flesh out the changelogs a
>> little bit before merging them.
>>
>> For example, I'm not sure what you mean by "devices where the PF is
>> not capable of managing VF resources."
>>
> I agree w/Bjorn's assessment of the changelog.
> The VF's are (minimally) assigned via the pf-stub driver, so they are
> 'managed by the kernel'.
> The security model is the same as the existing one, which was the issue we
> resolved in the previous set(s) of patches.
>
> I am hoping that something like vfio will be used to deal with the VF
> ownership
> and the reset mechanisms during assignement & de-assignment to 'guests'
> (qemu-kvm, DPDK, or whatever user-process),
> so the known, existing security model(s) is(are) maintained as well.
> If so, it'd be good to add such verbage somewhere (as 0/n is not kept in
> anything but possibly Bjorn's patchwork, or whatever patch mgmt tool he
> uses, and future reference would be good to have) say, an update to
> Documentation/PCI/pci-iov-howto.txt.
>
> So... the 'unmanaged SR-IOV' Subject, IMO, is not a valid Subject for the
> patch series any longer.
>
> No objections to the patch series, as Bjorn noted, just the commit
> log(s)/nomenclature of what is really being done.
> The expectation of VF enablement via the PF was born out of the fairly
> complicated, and unique PF vs VF drivers of the first implementations, which
> AlexD knows so well.  This "VFs act just like PFs without SRIOV
> capabilities" support is what this patch set enables with a much lighter
> configuration mechanism.
> So, maybe the patch set ought to be 'lightweight SRIOV enablement'.
>
> --dd

I'd be good with this being referred to as "lightweight SRIOV enablement".

The only reason why I was referring to it as "unmanaged" was because I
am used to drivers that use the PF MMIO registers to manage VF
resources and that doesn't exist in this model. Obviously this is all
still managed via the extended PCIe configuration though so there is
still some management taking place by the PCI subsystem in the kernel.

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCH net] ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
From: David Ahern @ 2018-04-23 15:48 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <20180423012923.121147-1-edumazet@google.com>

On 4/22/18 7:29 PM, Eric Dumazet wrote:
> KMSAN reported use of uninit-value that I tracked to lack
> of proper size check on RTA_TABLE attribute.
> 
> I also believe RTA_PREFSRC lacks a similar check.
> 
> Fixes: 86872cb57925 ("[IPv6] route: FIB6 configuration using struct fib6_config")
> Fixes: c3968a857a6b ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> ---
>  net/ipv6/route.c | 2 ++
>  1 file changed, 2 insertions(+)
> 

Acked-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* Re: [PATCH net] bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
From: David Miller @ 2018-04-23 15:55 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, andy, jiri, xiyou.wangcong
In-Reply-To: <d1defa141d6daef661da14b42db108a0e913314d.1524395510.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Sun, 22 Apr 2018 19:11:50 +0800

> After Commit 8a8efa22f51b ("bonding: sync netpoll code with bridge"), it
> would set slave_dev npinfo in slave_enable_netpoll when enslaving a dev
> if bond->dev->npinfo was set.
> 
> However now slave_dev npinfo is set with bond->dev->npinfo before calling
> slave_enable_netpoll. With slave_dev npinfo set, __netpoll_setup called
> in slave_enable_netpoll will not call slave dev's .ndo_netpoll_setup().
> It causes that the lower dev of this slave dev can't set its npinfo.
> 
> One way to reproduce it:
> 
>   # modprobe bonding
>   # brctl addbr br0
>   # brctl addif br0 eth1
>   # ifconfig bond0 192.168.122.1/24 up
>   # ifenslave bond0 eth2
>   # systemctl restart netconsole
>   # ifenslave bond0 br0
>   # ifconfig eth2 down
>   # systemctl restart netconsole
> 
> The netpoll won't really work.
> 
> This patch is to remove that slave_dev npinfo setting in bond_enslave().
> 
> Fixes: 8a8efa22f51b ("bonding: sync netpoll code with bridge")
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: init sk_cookie for inet socket
From: David Miller @ 2018-04-23 15:58 UTC (permalink / raw)
  To: laoar.shao; +Cc: eric.dumazet, alexei.starovoitov, netdev, linux-kernel
In-Reply-To: <1524405004-10960-1-git-send-email-laoar.shao@gmail.com>

From: Yafang Shao <laoar.shao@gmail.com>
Date: Sun, 22 Apr 2018 21:50:04 +0800

> With sk_cookie we can identify a socket, that is very helpful for
> traceing and statistic, i.e. tcp tracepiont and ebpf.
> So we'd better init it by default for inet socket.
> When using it, we just need call atomic64_read(&sk->sk_cookie).
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] r8169: don't use netif_info et al before net_device has been registered
From: David Miller @ 2018-04-23 15:59 UTC (permalink / raw)
  To: hkallweit1; +Cc: nic_swsd, netdev
In-Reply-To: <94b7d3c8-9634-5b95-ff92-3682f7894fca@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Sun, 22 Apr 2018 17:15:15 +0200

> There's no benefit in using netif_info et al before the net_device has
> been registered. We get messages like
> r8169 0000:03:00.0 (unnamed net_device) (uninitialized): [message]
> Therefore use dev_info/dev_err instead.
> 
> As a side effect we don't need parameter dev for function
> rtl8169_get_mac_version() any longer.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Please properly tag your Subject lines with the tree you are targetting,
in this case it should have been "[PATCH net-next]"

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net] ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
From: David Miller @ 2018-04-23 16:02 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet
In-Reply-To: <20180423012923.121147-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Sun, 22 Apr 2018 18:29:23 -0700

> KMSAN reported use of uninit-value that I tracked to lack
> of proper size check on RTA_TABLE attribute.
> 
> I also believe RTA_PREFSRC lacks a similar check.
> 
> Fixes: 86872cb57925 ("[IPv6] route: FIB6 configuration using struct fib6_config")
> Fixes: c3968a857a6b ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>

Applied and queued up for -stable, thanks Eric.

^ permalink raw reply

* Re: [PATCH v2 net-next] net: stmmac: Implement logic to automatically select HW Interface
From: David Miller @ 2018-04-23 16:04 UTC (permalink / raw)
  To: Jose.Abreu
  Cc: netdev, Joao.Pinto, Vitor.Soares, peppe.cavallaro,
	alexandre.torgue
In-Reply-To: <0f13e7542a0b701da6446411318a4e14023cddb7.1524470547.git.joabreu@synopsys.com>

From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Mon, 23 Apr 2018 09:05:15 +0100

> Move all the core version detection to a common place ("hwif.c") and
> implement a table which can be used to lookup the correct callbacks for
> each IP version.
> 
> This simplifies the initialization flow of each IP version and eases
> future implementation of new IP versions.
> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>

Applied, thanks Jose.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Add configuration information to register dump and debug data
From: David Miller @ 2018-04-23 16:06 UTC (permalink / raw)
  To: denis.bolotin; +Cc: netdev, ariel.elior
In-Reply-To: <20180423115605.8531-1-denis.bolotin@cavium.com>

From: Denis Bolotin <denis.bolotin@cavium.com>
Date: Mon, 23 Apr 2018 14:56:03 +0300

> The purpose of this patchset is to add configuration information to the
> debug data collection, which already contains register dump.
> The first patch (removing the ptt) is essential because it prevents the
> unnecessary ptt acquirement when calling mcp APIs.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] net: init sk_cookie for inet socket
From: Eric Dumazet @ 2018-04-23 16:09 UTC (permalink / raw)
  To: David Miller, laoar.shao
  Cc: eric.dumazet, alexei.starovoitov, netdev, linux-kernel
In-Reply-To: <20180423.115821.640630949143585629.davem@davemloft.net>



On 04/23/2018 08:58 AM, David Miller wrote:
> From: Yafang Shao <laoar.shao@gmail.com>
> Date: Sun, 22 Apr 2018 21:50:04 +0800
> 
>> With sk_cookie we can identify a socket, that is very helpful for
>> traceing and statistic, i.e. tcp tracepiont and ebpf.
>> So we'd better init it by default for inet socket.
>> When using it, we just need call atomic64_read(&sk->sk_cookie).
>>
>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> 
> Applied, thank you.
> 

This is adding yet another atomic_inc on a global cache line.

Most applications do not need the cookie being ever set.

The existing mechanism was fine. Set it on demand.

^ permalink raw reply

* Re: [PATCH net] sfc: ARFS filter IDs
From: David Miller @ 2018-04-23 16:16 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <2ee1ef47-886d-278d-4a8d-234d74e26ad7@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Mon, 23 Apr 2018 14:08:51 +0100

> Associate an arbitrary ID with each ARFS filter, allowing to properly query
>  for expiry.  The association is maintained in a hash table, which is
>  protected by a spinlock.
> 
> Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
> Signed-off-by: Edward Cree <ecree@solarflare.com>

Hmmm, this adds a warning:

drivers/net/ethernet/sfc/farch.c: In function ‘efx_farch_filter_rfs_expire_one’:
drivers/net/ethernet/sfc/farch.c:2938:7: warning: ‘rule’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    if (rule)
       ^

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: Michael S. Tsirkin @ 2018-04-23 16:18 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel,
	netdev, Björn Töpel, michael.lundkvist,
	jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180423135619.7179-3-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 03:56:06PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> In this commit the base structure of the AF_XDP address family is set
> up. Further, we introduce the abilty register a window of user memory
> to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
> window is viewed by an AF_XDP socket as a set of equally large
> frames. After a user memory registration all frames are "owned" by the
> user application, and not the kernel.
> 
> Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  include/uapi/linux/if_xdp.h |  34 +++++++
>  net/Makefile                |   1 +
>  net/xdp/Makefile            |   2 +
>  net/xdp/xdp_umem.c          | 237 ++++++++++++++++++++++++++++++++++++++++++++
>  net/xdp/xdp_umem.h          |  42 ++++++++
>  net/xdp/xdp_umem_props.h    |  23 +++++
>  net/xdp/xsk.c               | 223 +++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 562 insertions(+)
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> new file mode 100644
> index 000000000000..41252135a0fe
> --- /dev/null
> +++ b/include/uapi/linux/if_xdp.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
> + *
> + * if_xdp: XDP socket user-space interface
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * Author(s): Björn Töpel <bjorn.topel@intel.com>
> + *	      Magnus Karlsson <magnus.karlsson@intel.com>
> + */
> +
> +#ifndef _LINUX_IF_XDP_H
> +#define _LINUX_IF_XDP_H
> +
> +#include <linux/types.h>
> +
> +/* XDP socket options */
> +#define XDP_UMEM_REG			3
> +
> +struct xdp_umem_reg {
> +	__u64 addr; /* Start of packet data area */
> +	__u64 len; /* Length of packet data area */
> +	__u32 frame_size; /* Frame size */
> +	__u32 frame_headroom; /* Frame head room */
> +};
> +
> +#endif /* _LINUX_IF_XDP_H */
> diff --git a/net/Makefile b/net/Makefile
> index a6147c61b174..77aaddedbd29 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -85,3 +85,4 @@ obj-y				+= l3mdev/
>  endif
>  obj-$(CONFIG_QRTR)		+= qrtr/
>  obj-$(CONFIG_NET_NCSI)		+= ncsi/
> +obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> new file mode 100644
> index 000000000000..a5d736640a0f
> --- /dev/null
> +++ b/net/xdp/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> new file mode 100644
> index 000000000000..bff058f5a769
> --- /dev/null
> +++ b/net/xdp/xdp_umem.c
> @@ -0,0 +1,237 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* XDP user-space packet buffer
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/task.h>
> +#include <linux/uaccess.h>
> +#include <linux/slab.h>
> +#include <linux/bpf.h>
> +#include <linux/mm.h>
> +
> +#include "xdp_umem.h"
> +
> +#define XDP_UMEM_MIN_FRAME_SIZE 2048
> +
> +int xdp_umem_create(struct xdp_umem **umem)
> +{
> +	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
> +
> +	if (!(*umem))
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> +{
> +	unsigned int i;
> +
> +	if (umem->pgs) {
> +		for (i = 0; i < umem->npgs; i++)

Since you pin them with FOLL_WRITE, I assume these pages
are written to.
Don't you need set_page_dirty_lock here?

> +			put_page(umem->pgs[i]);
> +
> +		kfree(umem->pgs);
> +		umem->pgs = NULL;
> +	}
> +}
> +
> +static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
> +{
> +	if (umem->user) {
> +		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
> +		free_uid(umem->user);
> +	}
> +}
> +
> +static void xdp_umem_release(struct xdp_umem *umem)
> +{
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned long diff;
> +
> +	if (umem->pgs) {
> +		xdp_umem_unpin_pages(umem);
> +
> +		task = get_pid_task(umem->pid, PIDTYPE_PID);
> +		put_pid(umem->pid);
> +		if (!task)
> +			goto out;
> +		mm = get_task_mm(task);
> +		put_task_struct(task);
> +		if (!mm)
> +			goto out;
> +
> +		diff = umem->size >> PAGE_SHIFT;
> +
> +		down_write(&mm->mmap_sem);
> +		mm->pinned_vm -= diff;
> +		up_write(&mm->mmap_sem);
> +		mmput(mm);
> +		umem->pgs = NULL;
> +	}
> +
> +	xdp_umem_unaccount_pages(umem);
> +out:
> +	kfree(umem);
> +}
> +
> +void xdp_put_umem(struct xdp_umem *umem)
> +{
> +	if (!umem)
> +		return;
> +
> +	if (atomic_dec_and_test(&umem->users))
> +		xdp_umem_release(umem);
> +}
> +
> +static int xdp_umem_pin_pages(struct xdp_umem *umem)
> +{
> +	unsigned int gup_flags = FOLL_WRITE;
> +	long npgs;
> +	int err;
> +
> +	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
> +	if (!umem->pgs)
> +		return -ENOMEM;
> +
> +	npgs = get_user_pages(umem->address, umem->npgs,
> +			      gup_flags, &umem->pgs[0], NULL);
> +	if (npgs != umem->npgs) {
> +		if (npgs >= 0) {
> +			umem->npgs = npgs;
> +			err = -ENOMEM;
> +			goto out_pin;
> +		}
> +		err = npgs;
> +		goto out_pgs;
> +	}
> +	return 0;
> +
> +out_pin:
> +	xdp_umem_unpin_pages(umem);
> +out_pgs:
> +	kfree(umem->pgs);
> +	umem->pgs = NULL;
> +	return err;
> +}
> +
> +static int xdp_umem_account_pages(struct xdp_umem *umem)
> +{
> +	unsigned long lock_limit, new_npgs, old_npgs;
> +
> +	if (capable(CAP_IPC_LOCK))
> +		return 0;
> +
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	umem->user = get_uid(current_user());
> +
> +	do {
> +		old_npgs = atomic_long_read(&umem->user->locked_vm);
> +		new_npgs = old_npgs + umem->npgs;
> +		if (new_npgs > lock_limit) {
> +			free_uid(umem->user);
> +			umem->user = NULL;
> +			return -ENOBUFS;
> +		}
> +	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
> +				     new_npgs) != old_npgs);
> +	return 0;
> +}
> +
> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> +{
> +	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> +	u64 addr = mr->addr, size = mr->len;
> +	unsigned int nframes;
> +	int size_chk, err;
> +
> +	if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> +		/* Strictly speaking we could support this, if:
> +		 * - huge pages, or*

what does "or*" here mean?

> +		 * - using an IOMMU, or
> +		 * - making sure the memory area is consecutive
> +		 * but for now, we simply say "computer says no".
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	if (!is_power_of_2(frame_size))
> +		return -EINVAL;
> +
> +	if (!PAGE_ALIGNED(addr)) {
> +		/* Memory area has to be page size aligned. For
> +		 * simplicity, this might change.
> +		 */
> +		return -EINVAL;
> +	}
> +
> +	if ((addr + size) < addr)
> +		return -EINVAL;
> +
> +	nframes = size / frame_size;
> +	if (nframes == 0 || nframes > UINT_MAX)
> +		return -EINVAL;
> +
> +	frame_headroom = ALIGN(frame_headroom, 64);
> +
> +	size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
> +	if (size_chk < 0)
> +		return -EINVAL;
> +
> +	umem->pid = get_task_pid(current, PIDTYPE_PID);
> +	umem->size = (size_t)size;
> +	umem->address = (unsigned long)addr;
> +	umem->props.frame_size = frame_size;
> +	umem->props.nframes = nframes;
> +	umem->frame_headroom = frame_headroom;
> +	umem->npgs = size / PAGE_SIZE;
> +	umem->pgs = NULL;
> +	umem->user = NULL;
> +
> +	umem->frame_size_log2 = ilog2(frame_size);
> +	umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1;
> +	umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
> +	atomic_set(&umem->users, 1);
> +
> +	err = xdp_umem_account_pages(umem);
> +	if (err)
> +		goto out;
> +
> +	err = xdp_umem_pin_pages(umem);
> +	if (err)
> +		goto out;
> +	return 0;
> +
> +out:
> +	put_pid(umem->pid);
> +	return err;
> +}
> +
> +int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> +{
> +	int err;
> +
> +	if (!umem)
> +		return -EINVAL;
> +
> +	down_write(&current->mm->mmap_sem);
> +
> +	err = __xdp_umem_reg(umem, mr);
> +
> +	up_write(&current->mm->mmap_sem);
> +	return err;
> +}
> +
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> new file mode 100644
> index 000000000000..58714f4f7f25
> --- /dev/null
> +++ b/net/xdp/xdp_umem.h
> @@ -0,0 +1,42 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * XDP user-space packet buffer
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef XDP_UMEM_H_
> +#define XDP_UMEM_H_
> +
> +#include <linux/mm.h>
> +#include <linux/if_xdp.h>
> +
> +#include "xdp_umem_props.h"
> +
> +struct xdp_umem {
> +	struct page **pgs;
> +	struct xdp_umem_props props;
> +	u32 npgs;
> +	u32 frame_headroom;
> +	u32 nfpp_mask;
> +	u32 nfpplog2;
> +	u32 frame_size_log2;
> +	struct user_struct *user;
> +	struct pid *pid;
> +	unsigned long address;
> +	size_t size;
> +	atomic_t users;
> +};
> +
> +int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
> +void xdp_put_umem(struct xdp_umem *umem);
> +int xdp_umem_create(struct xdp_umem **umem);
> +
> +#endif /* XDP_UMEM_H_ */
> diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
> new file mode 100644
> index 000000000000..77fb5daf29f3
> --- /dev/null
> +++ b/net/xdp/xdp_umem_props.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + * XDP user-space packet buffer
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef XDP_UMEM_PROPS_H_
> +#define XDP_UMEM_PROPS_H_
> +
> +struct xdp_umem_props {
> +	u32 frame_size;
> +	u32 nframes;
> +};
> +
> +#endif /* XDP_UMEM_PROPS_H_ */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> new file mode 100644
> index 000000000000..19fc719cbe0d
> --- /dev/null
> +++ b/net/xdp/xsk.c
> @@ -0,0 +1,223 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* XDP sockets
> + *
> + * AF_XDP sockets allows a channel between XDP programs and userspace
> + * applications.
> + * Copyright(c) 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * Author(s): Björn Töpel <bjorn.topel@intel.com>
> + *	      Magnus Karlsson <magnus.karlsson@intel.com>
> + */
> +
> +#define pr_fmt(fmt) "AF_XDP: %s: " fmt, __func__
> +
> +#include <linux/if_xdp.h>
> +#include <linux/init.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/task.h>
> +#include <linux/socket.h>
> +#include <linux/file.h>
> +#include <linux/uaccess.h>
> +#include <linux/net.h>
> +#include <linux/netdevice.h>
> +#include <net/sock.h>
> +
> +#include "xdp_umem.h"
> +
> +struct xdp_sock {
> +	/* struct sock must be the first member of struct xdp_sock */
> +	struct sock sk;
> +	struct xdp_umem *umem;
> +	/* Protects multiple processes in the control path */
> +	struct mutex mutex;
> +};
> +
> +static struct xdp_sock *xdp_sk(struct sock *sk)
> +{
> +	return (struct xdp_sock *)sk;
> +}
> +
> +static int xsk_release(struct socket *sock)
> +{
> +	struct sock *sk = sock->sk;
> +	struct net *net;
> +
> +	if (!sk)
> +		return 0;
> +
> +	net = sock_net(sk);
> +
> +	local_bh_disable();
> +	sock_prot_inuse_add(net, sk->sk_prot, -1);
> +	local_bh_enable();
> +
> +	sock_orphan(sk);
> +	sock->sk = NULL;
> +
> +	sk_refcnt_debug_release(sk);
> +	sock_put(sk);
> +
> +	return 0;
> +}
> +
> +static int xsk_setsockopt(struct socket *sock, int level, int optname,
> +			  char __user *optval, unsigned int optlen)
> +{
> +	struct sock *sk = sock->sk;
> +	struct xdp_sock *xs = xdp_sk(sk);
> +	int err;
> +
> +	if (level != SOL_XDP)
> +		return -ENOPROTOOPT;
> +
> +	switch (optname) {
> +	case XDP_UMEM_REG:
> +	{
> +		struct xdp_umem_reg mr;
> +		struct xdp_umem *umem;
> +
> +		if (xs->umem)
> +			return -EBUSY;
> +
> +		if (copy_from_user(&mr, optval, sizeof(mr)))
> +			return -EFAULT;
> +
> +		mutex_lock(&xs->mutex);
> +		err = xdp_umem_create(&umem);
> +
> +		err = xdp_umem_reg(umem, &mr);
> +		if (err) {
> +			kfree(umem);
> +			mutex_unlock(&xs->mutex);
> +			return err;
> +		}
> +
> +		/* Make sure umem is ready before it can be seen by others */
> +		smp_wmb();
> +
> +		xs->umem = umem;
> +		mutex_unlock(&xs->mutex);
> +		return 0;
> +	}
> +	default:
> +		break;
> +	}
> +
> +	return -ENOPROTOOPT;
> +}
> +
> +static struct proto xsk_proto = {
> +	.name =		"XDP",
> +	.owner =	THIS_MODULE,
> +	.obj_size =	sizeof(struct xdp_sock),
> +};
> +
> +static const struct proto_ops xsk_proto_ops = {
> +	.family =	PF_XDP,
> +	.owner =	THIS_MODULE,
> +	.release =	xsk_release,
> +	.bind =		sock_no_bind,
> +	.connect =	sock_no_connect,
> +	.socketpair =	sock_no_socketpair,
> +	.accept =	sock_no_accept,
> +	.getname =	sock_no_getname,
> +	.poll =		sock_no_poll,
> +	.ioctl =	sock_no_ioctl,
> +	.listen =	sock_no_listen,
> +	.shutdown =	sock_no_shutdown,
> +	.setsockopt =	xsk_setsockopt,
> +	.getsockopt =	sock_no_getsockopt,
> +	.sendmsg =	sock_no_sendmsg,
> +	.recvmsg =	sock_no_recvmsg,
> +	.mmap =		sock_no_mmap,
> +	.sendpage =	sock_no_sendpage,
> +};
> +
> +static void xsk_destruct(struct sock *sk)
> +{
> +	struct xdp_sock *xs = xdp_sk(sk);
> +
> +	if (!sock_flag(sk, SOCK_DEAD))
> +		return;
> +
> +	xdp_put_umem(xs->umem);
> +
> +	sk_refcnt_debug_dec(sk);
> +}
> +
> +static int xsk_create(struct net *net, struct socket *sock, int protocol,
> +		      int kern)
> +{
> +	struct sock *sk;
> +	struct xdp_sock *xs;
> +
> +	if (!ns_capable(net->user_ns, CAP_NET_RAW))
> +		return -EPERM;
> +	if (sock->type != SOCK_RAW)
> +		return -ESOCKTNOSUPPORT;
> +
> +	if (protocol)
> +		return -EPROTONOSUPPORT;
> +
> +	sock->state = SS_UNCONNECTED;
> +
> +	sk = sk_alloc(net, PF_XDP, GFP_KERNEL, &xsk_proto, kern);
> +	if (!sk)
> +		return -ENOBUFS;
> +
> +	sock->ops = &xsk_proto_ops;
> +
> +	sock_init_data(sock, sk);
> +
> +	sk->sk_family = PF_XDP;
> +
> +	sk->sk_destruct = xsk_destruct;
> +	sk_refcnt_debug_inc(sk);
> +
> +	xs = xdp_sk(sk);
> +	mutex_init(&xs->mutex);
> +
> +	local_bh_disable();
> +	sock_prot_inuse_add(net, &xsk_proto, 1);
> +	local_bh_enable();
> +
> +	return 0;
> +}
> +
> +static const struct net_proto_family xsk_family_ops = {
> +	.family = PF_XDP,
> +	.create = xsk_create,
> +	.owner	= THIS_MODULE,
> +};
> +
> +static int __init xsk_init(void)
> +{
> +	int err;
> +
> +	err = proto_register(&xsk_proto, 0 /* no slab */);
> +	if (err)
> +		goto out;
> +
> +	err = sock_register(&xsk_family_ops);
> +	if (err)
> +		goto out_proto;
> +
> +	return 0;
> +
> +out_proto:
> +	proto_unregister(&xsk_proto);
> +out:
> +	return err;
> +}
> +
> +fs_initcall(xsk_init);
> -- 
> 2.14.1

^ permalink raw reply

* Re: [PATCH bpf-next v3 4/9] bpf/verifier: improve register value range tracking with ARSH
From: Yonghong Song @ 2018-04-23 16:19 UTC (permalink / raw)
  To: Edward Cree, ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <adef0d25-578c-6eb1-7dea-8c81ba6647e0@solarflare.com>



On 4/23/18 5:25 AM, Edward Cree wrote:
> On 20/04/18 23:18, Yonghong Song wrote:
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 3c8bb92..01c215d 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -2975,6 +2975,32 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
>>   		/* We may learn something more from the var_off */
>>   		__update_reg_bounds(dst_reg);
>>   		break;
>> +	case BPF_ARSH:
>> +		if (umax_val >= insn_bitness) {
>> +			/* Shifts greater than 31 or 63 are undefined.
>> +			 * This includes shifts by a negative number.
>> +			 */
>> +			mark_reg_unknown(env, regs, insn->dst_reg);
>> +			break;
>> +		}
>> +		if (dst_reg->smin_value < 0)
>> +			dst_reg->smin_value >>= umin_val;
>> +		else
>> +			dst_reg->smin_value >>= umax_val;
>> +		if (dst_reg->smax_value < 0)
>> +			dst_reg->smax_value >>= umax_val;
>> +		else
>> +			dst_reg->smax_value >>= umin_val;
>> +		if (src_known)
>> +			dst_reg->var_off = tnum_rshift(dst_reg->var_off,
>> +						       umin_val);
> tnum_rshift is an unsigned shift, it won't do what you want here.
> I think you could write a tnum_arshift that looks something like this
>   (UNTESTED!):
> 
>      struct tnum tnum_arshift(struct tnum a, u8 shift)
>      {
>          return TNUM(((s64)a.value) >> shift, ((s64)a.mask) >> shift);
>      }
> Theory: if value sign bit is 1 then number is known negative so populate
>   upper bits with known 1s.  If mask sign bit is 1 then number might be
>   negative so populate upper bits with unknown.  Otherwise, number is
>   known positive so populate upper bits with known 0s.

Right, my last version just used this tnum_arshift :-).


>> +		else
>> +			dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val);
> Applying the above here, tnum_arshift(tnum_unknown, ...) would always just
>   return tnum_unknown, so just do "dst_reg->var_off = tnum_unknown;".
> The reason for the corresponding logic in the BPF_RSH case is that a right
>   logical shift _always_ populates upper bits with zeroes.
> In any case these 'else' branches are currently never taken because they
>   fall foul of the check Alexei added just before the switch,
>      if (!src_known &&
>          opcode != BPF_ADD && opcode != BPF_SUB && opcode != BPF_AND) {
>          __mark_reg_unknown(dst_reg);
>          return 0;
>      }
> So I can guarantee you haven't tested this code :-)

I just noticed this last night and removed the !src_known branch all 
together here and from LSH/RSH.

> 
>> +		dst_reg->umin_value >>= umax_val;
>> +		dst_reg->umax_value >>= umin_val;
> FWIW I think the way to handle umin/umax here is to blow them away and
>   just rely on inferring new ubounds from the sbounds (i.e. the inverse of
>   what we do just above in case BPF_RSH) since BPF_ARSH is essentially an
>   operation on the signed value.  I don't think there is a need to support
>   cases where the unsigned bounds of a signed shift of a value that may
>   cross the sign boundary at (1<<63) are needed to verify a program.
> (Unlike in the unsigned shift case, it is at least _possible_ for there to
>   be information from the ubounds that we can't get from the sbounds - but
>   it's a contrived case that isn't likely to be useful in real programs.)

This makes sense and will make code simpler and easy to understand.
Will make the change.

Thanks!

> 
> -Ed
>> +		/* We may learn something more from the var_off */
>> +		__update_reg_bounds(dst_reg);
>> +		break;
>>   	default:
>>   		mark_reg_unknown(env, regs, insn->dst_reg);
>>   		break;

^ permalink raw reply

* [PATCH net 0/3] amd-xgbe: AMD XGBE driver fixes 2018-04-23
From: Tom Lendacky @ 2018-04-23 16:42 UTC (permalink / raw)
  To: netdev; +Cc: David Miller

This patch series addresses some issues in the AMD XGBE driver.

The following fixes are included in this driver update series:

- Improve KR auto-negotiation and training (2 patches)
  - Add pre and post auto-negotiation hooks
  - Use the pre and post auto-negotiation hooks to disable CDR tracking
    during auto-negotiation page exchange in KR mode
- Check for SFP tranceiver signal support and only use the signal if the
  SFP indicates that it is supported

This patch series is based on net.

---

Please queue this patch series up to stable releases 4.14 and above.

Tom Lendacky (3):
      amd-xgbe: Add pre/post auto-negotiation phy hooks
      amd-xgbe: Improve KR auto-negotiation and training
      amd-xgbe: Only use the SFP supported transceiver signals


 drivers/net/ethernet/amd/xgbe/xgbe-common.h  |    8 +
 drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c |   16 ++
 drivers/net/ethernet/amd/xgbe/xgbe-main.c    |    1 
 drivers/net/ethernet/amd/xgbe/xgbe-mdio.c    |   24 +++
 drivers/net/ethernet/amd/xgbe/xgbe-pci.c     |    2 
 drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c  |  196 ++++++++++++++++++++++++--
 drivers/net/ethernet/amd/xgbe/xgbe.h         |    9 +
 7 files changed, 233 insertions(+), 23 deletions(-)

-- 
Tom Lendacky

^ permalink raw reply

* [PATCH net 1/3] amd-xgbe: Add pre/post auto-negotiation phy hooks
From: Tom Lendacky @ 2018-04-23 16:43 UTC (permalink / raw)
  To: netdev; +Cc: David Miller
In-Reply-To: <20180423164258.18740.98574.stgit@tlendack-t1.amdoffice.net>

Add hooks to the driver auto-negotiation (AN) flow to allow the different
phy implementations to perform any steps necessary to improve AN.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-mdio.c |   16 ++++++++++++++--
 drivers/net/ethernet/amd/xgbe/xgbe.h      |    5 +++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c
index 072b9f6..e3d361e 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c
@@ -437,6 +437,9 @@ static void xgbe_an73_disable(struct xgbe_prv_data *pdata)
 
 static void xgbe_an_restart(struct xgbe_prv_data *pdata)
 {
+	if (pdata->phy_if.phy_impl.an_pre)
+		pdata->phy_if.phy_impl.an_pre(pdata);
+
 	switch (pdata->an_mode) {
 	case XGBE_AN_MODE_CL73:
 	case XGBE_AN_MODE_CL73_REDRV:
@@ -453,6 +456,9 @@ static void xgbe_an_restart(struct xgbe_prv_data *pdata)
 
 static void xgbe_an_disable(struct xgbe_prv_data *pdata)
 {
+	if (pdata->phy_if.phy_impl.an_post)
+		pdata->phy_if.phy_impl.an_post(pdata);
+
 	switch (pdata->an_mode) {
 	case XGBE_AN_MODE_CL73:
 	case XGBE_AN_MODE_CL73_REDRV:
@@ -637,11 +643,11 @@ static enum xgbe_an xgbe_an73_incompat_link(struct xgbe_prv_data *pdata)
 			return XGBE_AN_NO_LINK;
 	}
 
-	xgbe_an73_disable(pdata);
+	xgbe_an_disable(pdata);
 
 	xgbe_switch_mode(pdata);
 
-	xgbe_an73_restart(pdata);
+	xgbe_an_restart(pdata);
 
 	return XGBE_AN_INCOMPAT_LINK;
 }
@@ -820,6 +826,9 @@ static void xgbe_an37_state_machine(struct xgbe_prv_data *pdata)
 		pdata->an_result = pdata->an_state;
 		pdata->an_state = XGBE_AN_READY;
 
+		if (pdata->phy_if.phy_impl.an_post)
+			pdata->phy_if.phy_impl.an_post(pdata);
+
 		netif_dbg(pdata, link, pdata->netdev, "CL37 AN result: %s\n",
 			  xgbe_state_as_string(pdata->an_result));
 	}
@@ -903,6 +912,9 @@ static void xgbe_an73_state_machine(struct xgbe_prv_data *pdata)
 		pdata->kx_state = XGBE_RX_BPA;
 		pdata->an_start = 0;
 
+		if (pdata->phy_if.phy_impl.an_post)
+			pdata->phy_if.phy_impl.an_post(pdata);
+
 		netif_dbg(pdata, link, pdata->netdev, "CL73 AN result: %s\n",
 			  xgbe_state_as_string(pdata->an_result));
 	}
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe.h b/drivers/net/ethernet/amd/xgbe/xgbe.h
index ad102c8..fa0b51e 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe.h
+++ b/drivers/net/ethernet/amd/xgbe/xgbe.h
@@ -833,6 +833,7 @@ struct xgbe_hw_if {
 /* This structure represents implementation specific routines for an
  * implementation of a PHY. All routines are required unless noted below.
  *   Optional routines:
+ *     an_pre, an_post
  *     kr_training_pre, kr_training_post
  */
 struct xgbe_phy_impl_if {
@@ -875,6 +876,10 @@ struct xgbe_phy_impl_if {
 	/* Process results of auto-negotiation */
 	enum xgbe_mode (*an_outcome)(struct xgbe_prv_data *);
 
+	/* Pre/Post auto-negotiation support */
+	void (*an_pre)(struct xgbe_prv_data *);
+	void (*an_post)(struct xgbe_prv_data *);
+
 	/* Pre/Post KR training enablement support */
 	void (*kr_training_pre)(struct xgbe_prv_data *);
 	void (*kr_training_post)(struct xgbe_prv_data *);

^ permalink raw reply related

* [PATCH net 2/3] amd-xgbe: Improve KR auto-negotiation and training
From: Tom Lendacky @ 2018-04-23 16:43 UTC (permalink / raw)
  To: netdev; +Cc: David Miller
In-Reply-To: <20180423164258.18740.98574.stgit@tlendack-t1.amdoffice.net>

Update xgbe-phy-v2.c to make use of the auto-negotiation (AN) phy hooks
to improve the ability to successfully complete Clause 73 AN when running
at 10gbps.  Hardware can sometimes have issues with CDR lock when the
AN DME page exchange is being performed.

The AN and KR training hooks are used as follows:
- The pre AN hook is used to disable CDR tracking in the PHY so that the
  DME page exchange can be successfully and consistently completed.
- The post KR training hook is used to re-enable the CDR tracking so that
  KR training can successfully complete.
- The post AN hook is used to check for an unsuccessful AN which will
  increase a CDR tracking enablement delay (up to a maximum value).

Add two debugfs entries to allow control over use of the CDR tracking
workaround.  The debugfs entries allow the CDR tracking workaround to
be disabled and determine whether to re-enable CDR tracking before or
after link training has been initiated.

Also, with these changes the receiver reset cycle that is performed during
the link status check can be performed less often.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
---
 drivers/net/ethernet/amd/xgbe/xgbe-common.h  |    8 ++
 drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c |   16 +++
 drivers/net/ethernet/amd/xgbe/xgbe-main.c    |    1 
 drivers/net/ethernet/amd/xgbe/xgbe-mdio.c    |    8 +-
 drivers/net/ethernet/amd/xgbe/xgbe-pci.c     |    2 
 drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c  |  125 ++++++++++++++++++++++++++
 drivers/net/ethernet/amd/xgbe/xgbe.h         |    4 +
 7 files changed, 160 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-common.h b/drivers/net/ethernet/amd/xgbe/xgbe-common.h
index 7ea72ef..d272dc6 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-common.h
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-common.h
@@ -1321,6 +1321,10 @@
 #define MDIO_VEND2_AN_STAT		0x8002
 #endif
 
+#ifndef MDIO_VEND2_PMA_CDR_CONTROL
+#define MDIO_VEND2_PMA_CDR_CONTROL	0x8056
+#endif
+
 #ifndef MDIO_CTRL1_SPEED1G
 #define MDIO_CTRL1_SPEED1G		(MDIO_CTRL1_SPEED10G & ~BMCR_SPEED100)
 #endif
@@ -1369,6 +1373,10 @@
 #define XGBE_AN_CL37_TX_CONFIG_MASK	0x08
 #define XGBE_AN_CL37_MII_CTRL_8BIT	0x0100
 
+#define XGBE_PMA_CDR_TRACK_EN_MASK	0x01
+#define XGBE_PMA_CDR_TRACK_EN_OFF	0x00
+#define XGBE_PMA_CDR_TRACK_EN_ON	0x01
+
 /* Bit setting and getting macros
  *  The get macro will extract the current bit field value from within
  *  the variable
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c b/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c
index 7d128be..b911439 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-debugfs.c
@@ -519,6 +519,22 @@ void xgbe_debugfs_init(struct xgbe_prv_data *pdata)
 				   "debugfs_create_file failed\n");
 	}
 
+	if (pdata->vdata->an_cdr_workaround) {
+		pfile = debugfs_create_bool("an_cdr_workaround", 0600,
+					    pdata->xgbe_debugfs,
+					    &pdata->debugfs_an_cdr_workaround);
+		if (!pfile)
+			netdev_err(pdata->netdev,
+				   "debugfs_create_bool failed\n");
+
+		pfile = debugfs_create_bool("an_cdr_track_early", 0600,
+					    pdata->xgbe_debugfs,
+					    &pdata->debugfs_an_cdr_track_early);
+		if (!pfile)
+			netdev_err(pdata->netdev,
+				   "debugfs_create_bool failed\n");
+	}
+
 	kfree(buf);
 }
 
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-main.c b/drivers/net/ethernet/amd/xgbe/xgbe-main.c
index 795e556..441d0973 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-main.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-main.c
@@ -349,6 +349,7 @@ int xgbe_config_netdev(struct xgbe_prv_data *pdata)
 	XGMAC_SET_BITS(pdata->rss_options, MAC_RSSCR, UDP4TE, 1);
 
 	/* Call MDIO/PHY initialization routine */
+	pdata->debugfs_an_cdr_workaround = pdata->vdata->an_cdr_workaround;
 	ret = pdata->phy_if.phy_init(pdata);
 	if (ret)
 		return ret;
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c
index e3d361e..1b45cd7 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-mdio.c
@@ -432,6 +432,8 @@ static void xgbe_an73_disable(struct xgbe_prv_data *pdata)
 	xgbe_an73_set(pdata, false, false);
 	xgbe_an73_disable_interrupts(pdata);
 
+	pdata->an_start = 0;
+
 	netif_dbg(pdata, link, pdata->netdev, "CL73 AN disabled\n");
 }
 
@@ -511,11 +513,11 @@ static enum xgbe_an xgbe_an73_tx_training(struct xgbe_prv_data *pdata,
 		XMDIO_WRITE(pdata, MDIO_MMD_PMAPMD, MDIO_PMA_10GBR_PMD_CTRL,
 			    reg);
 
-		if (pdata->phy_if.phy_impl.kr_training_post)
-			pdata->phy_if.phy_impl.kr_training_post(pdata);
-
 		netif_dbg(pdata, link, pdata->netdev,
 			  "KR training initiated\n");
+
+		if (pdata->phy_if.phy_impl.kr_training_post)
+			pdata->phy_if.phy_impl.kr_training_post(pdata);
 	}
 
 	return XGBE_AN_PAGE_RECEIVED;
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-pci.c b/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
index eb23f9b..82d1f41 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-pci.c
@@ -456,6 +456,7 @@ static int xgbe_pci_resume(struct pci_dev *pdev)
 	.irq_reissue_support		= 1,
 	.tx_desc_prefetch		= 5,
 	.rx_desc_prefetch		= 5,
+	.an_cdr_workaround		= 1,
 };
 
 static const struct xgbe_version_data xgbe_v2b = {
@@ -470,6 +471,7 @@ static int xgbe_pci_resume(struct pci_dev *pdev)
 	.irq_reissue_support		= 1,
 	.tx_desc_prefetch		= 5,
 	.rx_desc_prefetch		= 5,
+	.an_cdr_workaround		= 1,
 };
 
 static const struct pci_device_id xgbe_pci_table[] = {
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
index 3304a29..b48efc0 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-phy-v2.c
@@ -147,6 +147,14 @@
 /* Rate-change complete wait/retry count */
 #define XGBE_RATECHANGE_COUNT		500
 
+/* CDR delay values for KR support (in usec) */
+#define XGBE_CDR_DELAY_INIT		10000
+#define XGBE_CDR_DELAY_INC		10000
+#define XGBE_CDR_DELAY_MAX		100000
+
+/* RRC frequency during link status check */
+#define XGBE_RRC_FREQUENCY		10
+
 enum xgbe_port_mode {
 	XGBE_PORT_MODE_RSVD = 0,
 	XGBE_PORT_MODE_BACKPLANE,
@@ -355,6 +363,10 @@ struct xgbe_phy_data {
 	unsigned int redrv_addr;
 	unsigned int redrv_lane;
 	unsigned int redrv_model;
+
+	/* KR AN support */
+	unsigned int phy_cdr_notrack;
+	unsigned int phy_cdr_delay;
 };
 
 /* I2C, MDIO and GPIO lines are muxed, so only one device at a time */
@@ -2361,7 +2373,7 @@ static int xgbe_phy_link_status(struct xgbe_prv_data *pdata, int *an_restart)
 		return 1;
 
 	/* No link, attempt a receiver reset cycle */
-	if (phy_data->rrc_count++) {
+	if (phy_data->rrc_count++ > XGBE_RRC_FREQUENCY) {
 		phy_data->rrc_count = 0;
 		xgbe_phy_rrc(pdata);
 	}
@@ -2669,6 +2681,103 @@ static bool xgbe_phy_port_enabled(struct xgbe_prv_data *pdata)
 	return true;
 }
 
+static void xgbe_phy_cdr_track(struct xgbe_prv_data *pdata)
+{
+	struct xgbe_phy_data *phy_data = pdata->phy_data;
+
+	if (!pdata->debugfs_an_cdr_workaround)
+		return;
+
+	if (!phy_data->phy_cdr_notrack)
+		return;
+
+	usleep_range(phy_data->phy_cdr_delay,
+		     phy_data->phy_cdr_delay + 500);
+
+	XMDIO_WRITE_BITS(pdata, MDIO_MMD_PMAPMD, MDIO_VEND2_PMA_CDR_CONTROL,
+			 XGBE_PMA_CDR_TRACK_EN_MASK,
+			 XGBE_PMA_CDR_TRACK_EN_ON);
+
+	phy_data->phy_cdr_notrack = 0;
+}
+
+static void xgbe_phy_cdr_notrack(struct xgbe_prv_data *pdata)
+{
+	struct xgbe_phy_data *phy_data = pdata->phy_data;
+
+	if (!pdata->debugfs_an_cdr_workaround)
+		return;
+
+	if (phy_data->phy_cdr_notrack)
+		return;
+
+	XMDIO_WRITE_BITS(pdata, MDIO_MMD_PMAPMD, MDIO_VEND2_PMA_CDR_CONTROL,
+			 XGBE_PMA_CDR_TRACK_EN_MASK,
+			 XGBE_PMA_CDR_TRACK_EN_OFF);
+
+	xgbe_phy_rrc(pdata);
+
+	phy_data->phy_cdr_notrack = 1;
+}
+
+static void xgbe_phy_kr_training_post(struct xgbe_prv_data *pdata)
+{
+	if (!pdata->debugfs_an_cdr_track_early)
+		xgbe_phy_cdr_track(pdata);
+}
+
+static void xgbe_phy_kr_training_pre(struct xgbe_prv_data *pdata)
+{
+	if (pdata->debugfs_an_cdr_track_early)
+		xgbe_phy_cdr_track(pdata);
+}
+
+static void xgbe_phy_an_post(struct xgbe_prv_data *pdata)
+{
+	struct xgbe_phy_data *phy_data = pdata->phy_data;
+
+	switch (pdata->an_mode) {
+	case XGBE_AN_MODE_CL73:
+	case XGBE_AN_MODE_CL73_REDRV:
+		if (phy_data->cur_mode != XGBE_MODE_KR)
+			break;
+
+		xgbe_phy_cdr_track(pdata);
+
+		switch (pdata->an_result) {
+		case XGBE_AN_READY:
+		case XGBE_AN_COMPLETE:
+			break;
+		default:
+			if (phy_data->phy_cdr_delay < XGBE_CDR_DELAY_MAX)
+				phy_data->phy_cdr_delay += XGBE_CDR_DELAY_INC;
+			else
+				phy_data->phy_cdr_delay = XGBE_CDR_DELAY_INIT;
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+}
+
+static void xgbe_phy_an_pre(struct xgbe_prv_data *pdata)
+{
+	struct xgbe_phy_data *phy_data = pdata->phy_data;
+
+	switch (pdata->an_mode) {
+	case XGBE_AN_MODE_CL73:
+	case XGBE_AN_MODE_CL73_REDRV:
+		if (phy_data->cur_mode != XGBE_MODE_KR)
+			break;
+
+		xgbe_phy_cdr_notrack(pdata);
+		break;
+	default:
+		break;
+	}
+}
+
 static void xgbe_phy_stop(struct xgbe_prv_data *pdata)
 {
 	struct xgbe_phy_data *phy_data = pdata->phy_data;
@@ -2680,6 +2789,9 @@ static void xgbe_phy_stop(struct xgbe_prv_data *pdata)
 	xgbe_phy_sfp_reset(phy_data);
 	xgbe_phy_sfp_mod_absent(pdata);
 
+	/* Reset CDR support */
+	xgbe_phy_cdr_track(pdata);
+
 	/* Power off the PHY */
 	xgbe_phy_power_off(pdata);
 
@@ -2712,6 +2824,9 @@ static int xgbe_phy_start(struct xgbe_prv_data *pdata)
 	/* Start in highest supported mode */
 	xgbe_phy_set_mode(pdata, phy_data->start_mode);
 
+	/* Reset CDR support */
+	xgbe_phy_cdr_track(pdata);
+
 	/* After starting the I2C controller, we can check for an SFP */
 	switch (phy_data->port_mode) {
 	case XGBE_PORT_MODE_SFP:
@@ -3019,6 +3134,8 @@ static int xgbe_phy_init(struct xgbe_prv_data *pdata)
 		}
 	}
 
+	phy_data->phy_cdr_delay = XGBE_CDR_DELAY_INIT;
+
 	/* Register for driving external PHYs */
 	mii = devm_mdiobus_alloc(pdata->dev);
 	if (!mii) {
@@ -3071,4 +3188,10 @@ void xgbe_init_function_ptrs_phy_v2(struct xgbe_phy_if *phy_if)
 	phy_impl->an_advertising	= xgbe_phy_an_advertising;
 
 	phy_impl->an_outcome		= xgbe_phy_an_outcome;
+
+	phy_impl->an_pre		= xgbe_phy_an_pre;
+	phy_impl->an_post		= xgbe_phy_an_post;
+
+	phy_impl->kr_training_pre	= xgbe_phy_kr_training_pre;
+	phy_impl->kr_training_post	= xgbe_phy_kr_training_post;
 }
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe.h b/drivers/net/ethernet/amd/xgbe/xgbe.h
index fa0b51e..95d4b56 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe.h
+++ b/drivers/net/ethernet/amd/xgbe/xgbe.h
@@ -994,6 +994,7 @@ struct xgbe_version_data {
 	unsigned int irq_reissue_support;
 	unsigned int tx_desc_prefetch;
 	unsigned int rx_desc_prefetch;
+	unsigned int an_cdr_workaround;
 };
 
 struct xgbe_vxlan_data {
@@ -1262,6 +1263,9 @@ struct xgbe_prv_data {
 	unsigned int debugfs_xprop_reg;
 
 	unsigned int debugfs_xi2c_reg;
+
+	bool debugfs_an_cdr_workaround;
+	bool debugfs_an_cdr_track_early;
 };
 
 /* Function prototypes*/

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox