Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [virtio-dev] [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Christoph Hellwig @ 2018-04-21  7:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Daly, Dan, Bjorn Helgaas, Duyck, Alexander H,
	linux-pci, virtio-dev, kvm, Netdev, LKML, linux-nvme, Keith Busch,
	netanel, Don Dutile, Maximilian Heyne, Wang, Liang-min,
	Rustad, Mark D, David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <20180420180839-mutt-send-email-mst@kernel.org>

On Fri, Apr 20, 2018 at 06:28:50PM +0300, Michael S. Tsirkin wrote:
> But maybe it's not needed here.  I am not making the decisions myself.
> Not too late: post to the TC list and let's see what the response is.
> Without a feature bit you are making a change affecting all future
> implementations without exception so the bar is a bit higher: you need
> to actually post a spec text proposal not just a patch showing how to
> use the feature, and TC needs to vote on it. Voting takes a week,
> review a week or two depending on change complexity.

Also IFF the hardware already is out we can quirk it in the PCI ID
table to manually set the feature in the driver as a workaround.

^ permalink raw reply

* Re: [pci PATCH v8 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
From: Christoph Hellwig @ 2018-04-21  7:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420162814.46077.11939.stgit@ahduyck-green-test.jf.intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [pci PATCH v8 2/4] ena: Migrate over to unmanaged SR-IOV support
From: Christoph Hellwig @ 2018-04-21  7:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420162841.46077.17967.stgit@ahduyck-green-test.jf.intel.com>

On Fri, Apr 20, 2018 at 12:30:21PM -0400, Alexander Duyck wrote:
> Instead of implementing our own version of a SR-IOV configuration stub in
> the ena driver we can just reuse the existing
> pci_sriov_configure_simple function.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [pci PATCH v8 3/4] nvme: Migrate over to unmanaged SR-IOV support
From: Christoph Hellwig @ 2018-04-21  7:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420163027.46077.93925.stgit@ahduyck-green-test.jf.intel.com>

On Fri, Apr 20, 2018 at 12:31:03PM -0400, Alexander Duyck wrote:
> Instead of implementing our own version of a SR-IOV configuration stub in
> the nvme driver we can just reuse the existing
> pci_sriov_configure_simple function.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [pci PATCH v8 4/4] pci-pf-stub: Add PF driver stub for PFs that function only to enable VFs
From: Christoph Hellwig @ 2018-04-21  7:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420163109.46077.60334.stgit@ahduyck-green-test.jf.intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH 3/8] [media] v4l: rcar_fdp1: Change platform dependency to ARCH_RENESAS
From: Laurent Pinchart @ 2018-04-21  8:07 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Simon Horman, Magnus Damm, Russell King, Catalin Marinas,
	Will Deacon, Dan Williams, Vinod Koul, Mauro Carvalho Chehab,
	Sergei Shtylyov, David S . Miller, Greg Kroah-Hartman,
	Liam Girdwood, Mark Brown, Jaroslav Kysela, Takashi Iwai,
	Arnd Bergmann, Kuninori Morimoto, linux-renesas-soc,
	linux-arm-kernel, dmaengine
In-Reply-To: <1524230914-10175-4-git-send-email-geert+renesas@glider.be>

Hi Geert,

Thank you for the patch.

On Friday, 20 April 2018 16:28:29 EEST Geert Uytterhoeven wrote:
> The Renesas Fine Display Processor driver is used on Renesas R-Car SoCs
> only.  Since commit 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce
> ARCH_RENESAS") is ARCH_RENESAS a more appropriate platform dependency
> than the legacy ARCH_SHMOBILE, hence use the former.
> 
> This will allow to drop ARCH_SHMOBILE on ARM and ARM64 in the near
> future.
> 
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>

Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>

How would you like to get this merged ?

> ---
>  drivers/media/platform/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/media/platform/Kconfig b/drivers/media/platform/Kconfig
> index f9235e8f8e962d2e..7ad4725f9d1f9627 100644
> --- a/drivers/media/platform/Kconfig
> +++ b/drivers/media/platform/Kconfig
> @@ -396,7 +396,7 @@ config VIDEO_SH_VEU
>  config VIDEO_RENESAS_FDP1
>  	tristate "Renesas Fine Display Processor"
>  	depends on VIDEO_DEV && VIDEO_V4L2 && HAS_DMA
> -	depends on ARCH_SHMOBILE || COMPILE_TEST
> +	depends on ARCH_RENESAS || COMPILE_TEST
>  	depends on (!ARCH_RENESAS && !VIDEO_RENESAS_FCP) || VIDEO_RENESAS_FCP
>  	select VIDEOBUF2_DMA_CONTIG
>  	select V4L2_MEM2MEM_DEV

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* Re: Page allocator bottleneck
From: Aaron Lu @ 2018-04-21  8:15 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko
In-Reply-To: <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com>

Sorry to bring up an old thread...

On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
> > 
> > 
> > On 15/09/2017 1:23 PM, Mel Gorman wrote:
> > > On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
> > > > Insights: Major degradation between #1 and #2, not getting any
> > > > close to linerate! Degradation is fixed between #2 and #3. This is
> > > > because page allocator cannot stand the higher allocation rate. In
> > > > #2, we also see that the addition of rings (cores) reduces BW (!!),
> > > > as result of increasing congestion over shared resources.
> > > > 
> > > 
> > > Unfortunately, no surprises there.
> > > 
> > > > Congestion in this case is very clear. When monitored in perf
> > > > top: 85.58% [kernel] [k] queued_spin_lock_slowpath
> > > > 
> > > 
> > > While it's not proven, the most likely candidate is the zone lock
> > > and that should be confirmed using a call-graph profile. If so, then
> > > the suggestion to tune to the size of the per-cpu allocator would
> > > mitigate the problem.
> > > 
> > Indeed, I tuned the per-cpu allocator and bottleneck is released.
> > 
> 
> Hi all,
> 
> After leaving this task for a while doing other tasks, I got back to it now
> and see that the good behavior I observed earlier was not stable.

I posted a patchset to improve zone->lock contention for order-0 pages
recently, it can almost eliminate 80% zone->lock contention for
will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
Skylake server and it doesn't require PCP size tune, so should have
some effects on your workload where one CPU does allocation while
another does free.

It did this by some disruptive changes:
1 on free path, it skipped doing merge(so could be bad for mixed
  workloads where both 4K and high order pages are needed);
2 on allocation path, it avoided touching multiple cachelines.

RFC v2 patchset:
https://lkml.org/lkml/2018/3/20/171

repo:
https://github.com/aaronlu/linux zone_lock_rfc_v2

 
> Recall: I work with a modified driver that allocates a page (4K) per packet
> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
> NICs.
> 
> Performance is good as long as pages are available in the allocating cores's
> PCP.
> Issue is that pages are allocated in one core, then free'd in another,
> making it's hard for the PCP to work efficiently, and both the allocator
> core and the freeing core need to access the buddy allocator very often.
> 
> I'd like to share with you some testing numbers:
> 
> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
> 
> 100% cpu on all cores, top func in perf:
>    84.98%  [kernel]             [k] queued_spin_lock_slowpath
> 
> system wide (all cores)
>            1135941      kmem:mm_page_alloc
> 
>            2606629      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>            4784616      kmem:mm_page_alloc_zone_locked
> 
>               1337      kmem:mm_page_free_batched
> 
>            6488213      kmem:mm_page_pcpu_drain
> 
>            8925503      net:napi_gro_receive_entry
> 
> 
> Two types of cores:
> A core mostly running napi (8 such cores):
>             221875      kmem:mm_page_alloc
> 
>              17100      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>             766584      kmem:mm_page_alloc_zone_locked
> 
>                 16      kmem:mm_page_free_batched
> 
>                 35      kmem:mm_page_pcpu_drain
> 
>            1340139      net:napi_gro_receive_entry
> 
> 
> Other core, mostly running user application (40 such):
>                  2      kmem:mm_page_alloc
> 
>              38922      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>                  1      kmem:mm_page_alloc_zone_locked
> 
>                  8      kmem:mm_page_free_batched
> 
>             107289      kmem:mm_page_pcpu_drain
> 
>                 34      net:napi_gro_receive_entry
> 
> 
> As you can see, sync overhead is enormous.
> 
> PCP-wise, a key improvement in such scenarios would be reached if we could
> (1) keep and handle the allocated page on same cpu, or (2) somehow get the
> page back to the allocating core's PCP in a fast-path, without going through
> the regular buddy allocator paths.

^ permalink raw reply

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
From: Christoph Hellwig @ 2018-04-21  9:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, linux-kernel, Soheil Hassas Yeganeh,
	Eric Dumazet, linux-mm, linux-fsdevel
In-Reply-To: <20180420155542.122183-1-edumazet@google.com>

On Fri, Apr 20, 2018 at 08:55:38AM -0700, Eric Dumazet wrote:
> This patch series provide a new mmap_hook to fs willing to grab
> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
> 
> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
> is held), and improve multi-threading scalability. 

Missing CC to linu-fsdevel and linux-mm that will have to decide.

We've rejected this approach multiple times before, so you better
make a really good argument for it.

introducing a multiplexer that overloads a single method certainly
doesn't help making that case.

^ permalink raw reply

* GREETING  FROM MISS QADESA,,F
From: Miss Qadesa AbdulAziz @ 2018-04-21  9:22 UTC (permalink / raw)
  To: netdev

Friend,
My name is Miss Qadesa AbdulAziz and I am 17 years old girl from Syria.  There is serious war crisis here in Syria, and I have lost my parents and
my two brothers in this war. I want you to help me and receive ($7.md)  which my late father deposited with my name in a bank in London. I want
to come to your country and start a new life  and invest with you, because am the only survival
in my family.

I wait to hear from you, Please do not let me die here, I begging you the name of Almighty. please respond here my pirate email    qadesa@protonmail.com

Regards
Miss Qadesa AbdulAziz

^ permalink raw reply

* Re: Grant
From: M. M. Fridman @ 2018-04-21  5:41 UTC (permalink / raw)
  To: Recipients

I Mikhail Fridman. has selected you specially as one of my beneficiaries
for my Charitable Donation, Just as I have declared on May 23, 2016 to give
my fortune as charity.

Check the link below for confirmation:

http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604

Reply as soon as possible with further directives.

Best Regards,
Mikhail Fridman.

^ permalink raw reply

* Re: WARNING: refcount bug in should_fail
From: Tetsuo Handa @ 2018-04-21 10:26 UTC (permalink / raw)
  To: ebiederm, dvyukov
  Cc: syzbot+84371b6062cb639d797e, syzkaller-bugs, viro, linux-fsdevel,
	linux-mm, netdev
In-Reply-To: <877epnj7bq.fsf@xmission.com>

Eric W. Biederman wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> > On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote:
> >
> >> FWIW, I'm going through the ->kill_sb() instances, fixing that sort
> >> of bugs (most of them preexisting, but I should've checked instead
> >> of assuming that everything's fine).  Will push out later tonight.
> >
> > OK, see vfs.git#for-linus.  Caught: 4 old bugs (allocation failure
> > in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp.
> > and double-dput in late failure exit in rpc_fill_super())
> > and 5 regressions from register_shrinker() failure recovery.
> 
> One issue with your vfs.git#for-linus branch.
> 
> It is missing Fixes tags and  Cc: stable on those patches.
> As the bug came in v4.15 those tags would really help the stable
> maintainers get the recent regression fixes applied.

OK. The patch was sent to linux.git as commit 8e04944f0ea8b838.

#syz fix: mm,vmscan: Allow preallocating memory for register_shrinker().

^ permalink raw reply

* Re: general protection fault in smc_getname
From: syzbot @ 2018-04-21 11:55 UTC (permalink / raw)
  To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun,
	ubraun
In-Reply-To: <001a1145ecc48d1438056655dcd9@google.com>

syzbot has found reproducer for the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of  
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=9605e6cace1b5efd4a0a

So far this crash happened 6 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4803108223844352
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=6277384739225600
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5836548759093248
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9605e6cace1b5efd4a0a@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4548 Comm: syzkaller769662 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:smc_getname+0x124/0x1c0 net/smc/af_smc.c:1089
RSP: 0018:ffff8801ad1d7c20 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff873e3a58
RDX: 0000000000000005 RSI: ffffffff873e3af6 RDI: 0000000000000028
RBP: ffff8801ad1d7c48 R08: ffff8801b10ae280 R09: ffffed0036321370
R10: ffffed0036321370 R11: ffff8801b1909b83 R12: 0000000000000000
R13: ffff8801ad1d7d10 R14: ffff8801a866e0c0 R15: dffffc0000000000
FS:  0000000001a42880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000080 CR3: 00000001ac8b3000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __sys_getsockname+0x184/0x380 net/socket.c:1699
  __do_sys_getsockname net/socket.c:1714 [inline]
  __se_sys_getsockname net/socket.c:1711 [inline]
  __x64_sys_getsockname+0x73/0xb0 net/socket.c:1711
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43fce9
RSP: 002b:00007fff2eba2418 EFLAGS: 00000217 ORIG_RAX: 0000000000000033
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fce9
RDX: 0000000020000080 RSI: 0000000020000000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000217 R12: 0000000000401610
R13: 00000000004016a0 R14: 0000000000000000 R15: 0000000000000000
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 99 00 00 00 48 8b 9b 50 04 00 00 48  
b8 00 00 00 00 00 fc ff df 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00  
75 70 48 b8 00 00 00 00 00 fc ff df 4c 8b 73 28 49
RIP: smc_getname+0x124/0x1c0 net/smc/af_smc.c:1089 RSP: ffff8801ad1d7c20
---[ end trace 9f5c3169466d9443 ]---

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 11/13] audit: add support for containerid to network namespaces
From: Paul Moore @ 2018-04-21 12:10 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, luto-DgEjT+Ai2ygdnm+yROfE0A,
	jlayton-H+wXaHxf7aLQT0dZR+AlfA, carlos-H+wXaHxf7aLQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, Linux-Audit Mailing List,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, simo-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eric Paris
In-Reply-To: <20180420204225.iik2lgtj6gx2ep4w-bcJWsdo4jJjeVoXN4CMphl7TgLCtbB0G@public.gmane.org>

On April 20, 2018 4:48:34 PM Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
On 2018-04-20 16:22, Paul Moore wrote:
On Fri, Apr 20, 2018 at 4:02 PM, Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
On 2018-04-18 21:46, Paul Moore wrote:
On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
Audit events could happen in a network namespace outside of a task
context due to packets received from the net that trigger an auditing
rule prior to being associated with a running task.  The network
namespace could in use by multiple containers by association to the
tasks in that network namespace.  We still want a way to attribute
these events to any potential containers.  Keep a list per network
namespace to track these container identifiiers.

Add/increment the container identifier on:
- initial setting of the container id via /proc
- clone/fork call that inherits a container identifier
- unshare call that inherits a container identifier
- setns call that inherits a container identifier
Delete/decrement the container identifier on:
- an inherited container id dropped when child set
- process exit
- unshare call that drops a net namespace
- setns call that drops a net namespace

See: https://github.com/linux-audit/audit-kernel/issues/32
See: https://github.com/linux-audit/audit-testsuite/issues/64
Signed-off-by: Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
include/linux/audit.h       |  7 +++++++
include/net/net_namespace.h | 12 ++++++++++++
kernel/auditsc.c            |  9 ++++++---
kernel/nsproxy.c            |  6 ++++++
net/core/net_namespace.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 76 insertions(+), 3 deletions(-)

...

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f6c5d33..d9f1090 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -140,6 +140,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
+       u64 containerid = audit_get_containerid(tsk);

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
@@ -167,6 +168,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
return  PTR_ERR(new_ns);

tsk->nsproxy = new_ns;
+       net_add_audit_containerid(new_ns->net_ns, containerid);
return 0;
}

Hopefully we can handle this in audit_net_init(), we just need to
figure out where we can get the correct task_struct for the audit
container ID (some backpointer in the net struct?).

I don't follow.  This needs to happen on every task startup.
audit_net_init() is only called when a new network namespace starts up.

Yep, sorry, my mistake.  I must have confused myself when I was
looking at the code.

I'm thinking out loud here, bear with me ...

Assuming we move the netns/audit-container-ID tracking to audit_net,
and considering we already have an audit hook in copy_process() (it
calls audit_alloc()), would this be better handled by the
copy_process() hook?  This ignores naming, audit_alloc() reuse, etc.;
those can be easily fixed.  I'm just thinking of ways to limit our
impact on the core kernel and leverage our existing interaction
points.

The new namespace hasn't been cloned yet and this is the only function
where we have access to both namespaces, so I don't see how that could
work...

I'll take another, closer look, with v3.

paul moore

- RGB

--
Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

--
paul moore
www.paul-moore.com

^ permalink raw reply related

* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Richard Guy Briggs @ 2018-04-21 14:34 UTC (permalink / raw)
  To: Paul Moore
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <CAHC9VhTyvxxj2e2Gn+iyW6iLLeYB7hp8a+JvfeMmJ2nUPqtEaw@mail.gmail.com>

On 2018-04-18 19:47, Paul Moore wrote:
> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > Implement the proc fs write to set the audit container ID of a process,
> > emitting an AUDIT_CONTAINER record to document the event.
> >
> > This is a write from the container orchestrator task to a proc entry of
> > the form /proc/PID/containerid where PID is the process ID of the newly
> > created task that is to become the first task in a container, or an
> > additional task added to a container.
> >
> > The write expects up to a u64 value (unset: 18446744073709551615).
> >
> > This will produce a record such as this:
> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> >
> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> > the orchestrator while the "opid" field is the object's PID, the process
> > being "contained".  Old and new container ID values are given in the
> > "contid" fields, while res indicates its success.
> >
> > It is not permitted to self-set, unset or re-set the container ID.  A
> > child inherits its parent's container ID, but then can be set only once
> > after.
> >
> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> >  fs/proc/base.c             | 37 ++++++++++++++++++++
> >  include/linux/audit.h      | 16 +++++++++
> >  include/linux/init_task.h  |  4 ++-
> >  include/linux/sched.h      |  1 +
> >  include/uapi/linux/audit.h |  2 ++
> >  kernel/auditsc.c           | 84 ++++++++++++++++++++++++++++++++++++++++++++++
> >  6 files changed, 143 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 60316b5..6ce4fbe 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1299,6 +1299,41 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
> >         .read           = proc_sessionid_read,
> >         .llseek         = generic_file_llseek,
> >  };
> > +
> > +static ssize_t proc_containerid_write(struct file *file, const char __user *buf,
> > +                                  size_t count, loff_t *ppos)
> > +{
> > +       struct inode *inode = file_inode(file);
> > +       u64 containerid;
> > +       int rv;
> > +       struct task_struct *task = get_proc_task(inode);
> > +
> > +       if (!task)
> > +               return -ESRCH;
> > +       if (*ppos != 0) {
> > +               /* No partial writes. */
> > +               put_task_struct(task);
> > +               return -EINVAL;
> > +       }
> > +
> > +       rv = kstrtou64_from_user(buf, count, 10, &containerid);
> > +       if (rv < 0) {
> > +               put_task_struct(task);
> > +               return rv;
> > +       }
> > +
> > +       rv = audit_set_containerid(task, containerid);
> > +       put_task_struct(task);
> > +       if (rv < 0)
> > +               return rv;
> > +       return count;
> > +}
> > +
> > +static const struct file_operations proc_containerid_operations = {
> > +       .write          = proc_containerid_write,
> > +       .llseek         = generic_file_llseek,
> > +};
> > +
> >  #endif
> >
> >  #ifdef CONFIG_FAULT_INJECTION
> > @@ -2961,6 +2996,7 @@ static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns,
> >  #ifdef CONFIG_AUDITSYSCALL
> >         REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
> >         REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> > +       REG("containerid", S_IWUSR, proc_containerid_operations),
> >  #endif
> >  #ifdef CONFIG_FAULT_INJECTION
> >         REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> > @@ -3355,6 +3391,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
> >  #ifdef CONFIG_AUDITSYSCALL
> >         REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
> >         REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> > +       REG("containerid", S_IWUSR, proc_containerid_operations),
> >  #endif
> >  #ifdef CONFIG_FAULT_INJECTION
> >         REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index af410d9..fe4ba3f 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -29,6 +29,7 @@
> >
> >  #define AUDIT_INO_UNSET ((unsigned long)-1)
> >  #define AUDIT_DEV_UNSET ((dev_t)-1)
> > +#define INVALID_CID AUDIT_CID_UNSET
> 
> Why can't we just use AUDIT_CID_UNSET?  Is there an important
> distinction?  If so, they shouldn't they have different values?

One was intended as user-facing and the other was intended for kernel
internal.  As you point out, this does not appear to be necessary since
they are both the same type.  This was to mirror loginuid due to UID
namespace practice to seperate the two to make things very clear that a
userspace view of a UID needed to be translated from the user's user
namespace to the kernel's absolute view of UIDs from the init user
namespace.  Since container ID meanings do not depend on any namespace
context, I agree we can use just one and I'd go with AUDIT_CID_UNSET.

> If we do need to keep INVALID_CID, let's rename it to
> AUDIT_CID_INVALID so we have some consistency to the naming patterns
> and we stress that it is an *audit* container ID.
> 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d258826..1b82191 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -796,6 +796,7 @@ struct task_struct {
> >  #ifdef CONFIG_AUDITSYSCALL
> >         kuid_t                          loginuid;
> >         unsigned int                    sessionid;
> > +       u64                             containerid;
> 
> This one line addition to the task_struct scares me the most of
> anything in this patchset.  Why?  It's a field named "containerid" in
> a perhaps one of the most widely used core kernel structures; the
> possibilities for abuse are endless, and it's foolish to think we
> would ever be able to adequately police this.

Fair enough.

> Unfortunately, we can't add the field to audit_context as things
> currently stand because we don't always allocate an audit_context,
> it's dependent on the system's configuration, and we need to track the
> audit container ID for a given process, regardless of the audit
> configuration.  Pretty much the same reason why loginuid and sessionid
> are located directly in task_struct now.  As I stressed during the
> design phase, I really want to keep this as an *audit* container ID
> and not a general purpose kernel wide container ID.  If the kernel
> ever grows a general purpose container ID token, I'll be the first in
> line to convert the audit code, but I don't want audit to be that
> general purpose mechanism ... audit is hated enough as-is ;)

When would we need an audit container ID when audit is not enabled
enough to have an audit_context?

If it is only used for audit, and audit is the only consumer, and audit
can only use it when it is enabled, then we can just return success to
any write to the proc filehandle, or not even present it.  Nothing will
be able to know that value wasn't used.

When are loginuid and sessionid used now when audit is not enabled (or
should I say, explicitly disabled)?

> I think the right solution to this is to create another new struct,
> audit_task_info (or similar, the name really isn't that important),
> which would be stored as a pointer in task_struct and would replace
> the audit_context pointer, loginuid, sessionid, and the newly proposed
> containerid.  The new audit_task_info would always be allocated in the
> audit_alloc() function (please use kmem_cache), and the audit_context
> pointer included inside would continue to be allocated based on the
> existing conditions.  By keeping audit_task_info as a pointer inside
> task_struct we could hide the structure definition inside
> kernel/audit*.c and make it much more difficult for other subsystems
> to abuse it.[1]
> 
>   struct audit_task_info {
>     kuid_t loginuid;
>     unsigned int sessionid;
>     u64 containerid;
>     struct audit_context *ctx;
>   }

I agree this looks like a good change.

> Actually, we might even want to consider storing audit_context in
> audit_task_info (no pointer), or making it a zero length array
> (ctx[0]) and going with a variable sized allocation of audit_task_info
> ... but all that could be done as a follow up optimization once we get
> the basic idea sorted.
> 
> [1] If for some reason allocating audit_task_info becomes too much
> overhead to bear (somewhat doubtful since we would only do it at task
> creation), we could do some ugly tricks to directly include an
> audit_task_struct chunk in task_struct but I'd like to avoid that if
> possible (and I think we can).
> 
> >  #endif
> >         struct seccomp                  seccomp;
> 
> ...
> 
> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > index 4e61a9e..921a71f 100644
> > --- a/include/uapi/linux/audit.h
> > +++ b/include/uapi/linux/audit.h
> > @@ -71,6 +71,7 @@
> >  #define AUDIT_TTY_SET          1017    /* Set TTY auditing status */
> >  #define AUDIT_SET_FEATURE      1018    /* Turn an audit feature on or off */
> >  #define AUDIT_GET_FEATURE      1019    /* Get which features are enabled */
> > +#define AUDIT_CONTAINER                1020    /* Define the container id and information */
> >
> >  #define AUDIT_FIRST_USER_MSG   1100    /* Userspace messages mostly uninteresting to kernel */
> >  #define AUDIT_USER_AVC         1107    /* We filter this differently */
> > @@ -465,6 +466,7 @@ struct audit_tty_status {
> >  };
> >
> >  #define AUDIT_UID_UNSET (unsigned int)-1
> > +#define AUDIT_CID_UNSET ((u64)-1)
> 
> I think we need to decide if we want to distinguish between the "host"
> (e.g. init ns) and "unset".  Looking at this patch (I've only quickly
> skimmed the others so far) it would appear that you don't think we
> need to worry about this distinction; that's fine, but let's make it
> explicit with a comment in the code that AUDIT_CID_UNSET means "unset"
> as well as "host".

I don't see any reason to distinguish between "host" and "unset".  Since
a container doesn't have a concrete definition based in namespaces, the
initial namespace set is meaningless here.

Is there value in having a container orchestrator process have a
reserved container ID that has a policy distinct from any other
container?  If so, then I could see the value in making the distinction.
For example, I've heard of interest in systemd acting as a container
orchestrator, so if it took on that role as PID 1, then every process in
the system would inherit that ID and none would be unset.

I can't picture how having seperate "host" and "unset" values helps us.

> If we do need to make a distinction, let's add a constant/macro for "host".

Currently "unset" is -1 which fits the convention used for sessionid and
loginuid and a number of others, so I think it makes sense to stick with
that.  If we decide we need a "host" flag, would it make sense to use 0
or (u64)-2?

> >  /* audit_rule_data supports filter rules with both integer and string
> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 4e0a4ac..29c8482 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
> >         return rc;
> >  }
> >
> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
> > +{
> > +       struct task_struct *parent;
> > +       u64 pcontainerid, ccontainerid;
> > +
> > +       /* Don't allow to set our own containerid */
> > +       if (current == task)
> > +               return -EPERM;
> 
> Why not?  Is there some obvious security concern that I missing?

We then lose the distinction in the AUDIT_CONTAINER record between the
initiating PID and the target PID.  This was outlined in the proposal.

Having said that, I'm still not sure we have protected sufficiently from
a child turning around and setting it's parent's as yet unset or
inherited audit container ID.

> I ask because I suppose it might be possible for some container
> runtime to do a fork, setup some of the environment and them exec the
> container (before you answer the obvious "namespaces!" please remember
> we're not trying to define containers).

I don't think namespaces have any bearing on this concern since none are
required.

> > +       /* Don't allow the containerid to be unset */
> > +       if (!cid_valid(containerid))
> > +               return -EINVAL;
> > +       /* if we don't have caps, reject */
> > +       if (!capable(CAP_AUDIT_CONTROL))
> > +               return -EPERM;
> > +       /* if containerid is unset, allow */
> > +       if (!audit_containerid_set(task))
> > +               return 0;
> > +       /* it is already set, and not inherited from the parent, reject */
> > +       ccontainerid = audit_get_containerid(task);
> > +       rcu_read_lock();
> > +       parent = rcu_dereference(task->real_parent);
> > +       rcu_read_unlock();
> > +       task_lock(parent);
> > +       pcontainerid = audit_get_containerid(parent);
> > +       task_unlock(parent);
> > +       if (ccontainerid != pcontainerid)
> > +               return -EPERM;
> > +       return 0;
> > +}
> > +
> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
> > +                                     u64 containerid, int rc)
> > +{
> > +       struct audit_buffer *ab;
> > +       uid_t uid;
> > +       struct tty_struct *tty;
> > +
> > +       if (!audit_enabled)
> > +               return;
> > +
> > +       ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
> > +       if (!ab)
> > +               return;
> > +
> > +       uid = from_kuid(&init_user_ns, task_uid(current));
> > +       tty = audit_get_tty(current);
> > +
> > +       audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
> > +       audit_log_task_context(ab);
> > +       audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
> > +                        from_kuid(&init_user_ns, audit_get_loginuid(current)),
> > +                        tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
> > +                        task_tgid_nr(task), oldcontainerid, containerid, !rc);
> > +
> > +       audit_put_tty(tty);
> > +       audit_log_end(ab);
> > +}
> > +
> > +/**
> > + * audit_set_containerid - set current task's audit_context containerid
> > + * @containerid: containerid value
> > + *
> > + * Returns 0 on success, -EPERM on permission failure.
> > + *
> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
> > + */
> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
> > +{
> > +       u64 oldcontainerid;
> > +       int rc;
> > +
> > +       oldcontainerid = audit_get_containerid(task);
> > +
> > +       rc = audit_set_containerid_perm(task, containerid);
> > +       if (!rc) {
> > +               task_lock(task);
> > +               task->containerid = containerid;
> > +               task_unlock(task);
> > +       }
> > +
> > +       audit_log_set_containerid(task, oldcontainerid, containerid, rc);
> > +       return rc;
> 
> Why are audit_set_containerid_perm() and audit_log_containerid()
> separate functions?

(I assume you mean audit_log_set_containerid()?)
It seemed clearer that all the permission checking was in one function
and its return code could be used to report the outcome when logging the
(attempted) action.  This is the same structure as audit_set_loginuid()
and it made sense.

This would be the time to connect it to a syscall if that seems like a
good idea and remove pid, uid, auid, tty, ses fields.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
From: Matthew Wilcox @ 2018-04-21 14:47 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Michal Hocko, David Miller, Andrew Morton, linux-mm, eric.dumazet,
	edumazet, bhutchings, netdev, linux-kernel, mst, jasowang,
	virtualization, dm-devel, Vlastimil Babka
In-Reply-To: <alpine.LRH.2.02.1804201704580.25408@file01.intranet.prod.int.rdu2.redhat.com>

On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > No way. This is just wrong! First of all, you will explode most likely
> > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > > > enabled quite often.
> > > 
> > > You're an evil person who doesn't want to fix bugs.
> > 
> > Steady on.  There's no need for that.  Michal isn't evil.  Please
> > apologise.
> 
> I see this attitude from Michal again and again.

Fine; then *say that*.  I also see Michal saying "No" a lot.  Sometimes
I agree with him, sometimes I don't.  I think he genuinely wants the best
code in the kernel, and saying "No" is part of it.

> He didn't want to fix vmalloc(GFP_NOIO)

I don't remember that conversation, so I don't know whether I agree with
his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
towards marking regions with memalloc_noio_save() / restore.  If you do
that, you won't need vmalloc(GFP_NOIO).

> he didn't want to fix alloc_pages sleeping when __GFP_NORETRY is used.

The GFP flags are a mess, still.

> So what should I say? Fix them and 
> you won't be evil :-)

No, you should reserve calling somebody evil for truly evil things.

> (he could also fix the oom killer, so that it is triggered when 
> free_memory+cache+free_swap goes beyond a threshold and not when you loop 
> too long in the allocator)

... that also doesn't make somebody evil.

> I already said that we can change it from CONFIG_DEBUG_VM to 
> CONFIG_DEBUG_SG - or to whatever other option you may want, just to make 
> sure that it is enabled in distro debug kernels by default.

Yes, and I think that's the right idea.  So send a v2 and ignore the
replies that are clearly relating to an earlier version of the patch.
Not everybody reads every mail in the thread before responding to one they
find interesting.  Yes, ideally, one would, but sometimes one doesn't.

^ permalink raw reply

* Re: [PATCH net-next 2/2] netns: isolate seqnums to use per-netns locks
From: Christian Brauner @ 2018-04-21 15:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: davem, netdev, linux-kernel, avagin, ktkhai, serge, gregkh
In-Reply-To: <20180420161643.GA15182@gmail.com>

On Fri, Apr 20, 2018 at 06:16:44PM +0200, Christian Brauner wrote:
> On Fri, Apr 20, 2018 at 03:56:28PM +0200, Christian Brauner wrote:
> > On Wed, Apr 18, 2018 at 11:52:47PM +0200, Christian Brauner wrote:
> > > On Wed, Apr 18, 2018 at 11:55:52AM -0500, Eric W. Biederman wrote:
> > > > Christian Brauner <christian.brauner@ubuntu.com> writes:
> > > > 
> > > > > Now that it's possible to have a different set of uevents in different
> > > > > network namespaces, per-network namespace uevent sequence numbers are
> > > > > introduced. This increases performance as locking is now restricted to the
> > > > > network namespace affected by the uevent rather than locking
> > > > > everything.
> > > > 
> > > > Numbers please.  I personally expect that the netlink mc_list issues
> > > > will swamp any benefit you get from this.
> > > 
> > > I wouldn't see how this would be the case. The gist of this is:
> > > Everytime you send a uevent into a network namespace *not* owned by
> > > init_user_ns you currently *have* to take mutex_lock(uevent_sock_list)
> > > effectively blocking the host from processing uevents even though
> > > - the uevent you're receiving might be totally different from the
> > >   uevent that you're sending
> > > - the uevent socket of the non-init_user_ns owned network namespace
> > >   isn't even recorded in the list.
> > > 
> > > The other argument is that we now have properly isolated network
> > > namespaces wrt to uevents such that each netns can have its own set of
> > > uevents. This can either happen by a sufficiently privileged userspace
> > > process sending it uevents that are only dedicated to that specific
> > > netns. Or - and this *has been true for a long time* - because network
> > > devices are *properly namespaced*. Meaning a uevent for that network
> > > device is *tied to a network namespace*. For both cases the uevent
> > > sequence numbering will be absolutely misleading. For example, whenever
> > > you create e.g. a new veth device in a new network namespace it
> > > shouldn't be accounted against the initial network namespace but *only*
> > > against the network namespace that has that device added to it.
> > 
> > Eric, I did the testing. Here's what I did:
> > 
> > I compiled two 4.17-rc1 Kernels:
> > - one with per netns uevent seqnums with decoupled locking
> > - one without per netns uevent seqnums with decoupled locking
> > 
> > # Testcase 1:
> > Only Injecting Uevents into network namespaces not owned by the initial user
> > namespace.
> > - created 1000 new user namespace + network namespace pairs
> > - opened a uevent listener in each of those namespace pairs
> > - injected uevents into each of those network namespaces 10,000 times meaning
> >   10,000,000 (10 million) uevents were injected. (The high number of
> >   uevent injections should get rid of a lot of jitter.)
> > - Calculated the mean transaction time.
> > - *without* uevent sequence number namespacing:
> >   67 μs
> > - *with* uevent sequence number namespacing:
> >   55 μs
> > - makes a difference of 12 μs
> > 
> > # Testcase 2:
> > Injecting Uevents into network namespaces not owned by the initial user
> > namespace and network namespaces owned by the initial user namespace.
> > - created 500 new user namespace + network namespace pairs
> > - created 500 new network namespace pairs
> > - opened a uevent listener in each of those namespace pairs
> > - injected uevents into each of those network namespaces 10,000 times meaning
> >   10,000,000 (10 million) uevents were injected. (The high number of
> >   uevent injections should get rid of a lot of jitter.)
> > - Calculated the mean transaction time.
> > - *without* uevent sequence number namespacing:
> >   572 μs
> > - *with* uevent sequence number namespacing:
> >   514 μs
> > - makes a difference of 58 μs
> > 
> > So there's performance gain. The third case would be to create a bunch
> > of hanging processes that send SIGSTOP to themselves but do not actually
> > open a uevent socket in their respective namespaces and then inject
> > uevents into them. I expect there to be an even more performance
> > benefits since the rtnl_table_lock() isn't hit in this case because
> > there are no listeners.
> 
> I did the third test-case as well so:
> - created 500 new user namespace + network namespace pairs *without
>   uevent listeners*
> - created 500 new network namespace pairs *without uevent listeners*
> - injected uevents into each of those network namespaces 10,000 times meaning
>   10,000,000 (10 million) uevents were injected. (The high number of
>   uevent injections should get rid of a lot of jitter.)
> - Calculated the mean transaction time.
> - *without* uevent sequence number namespacing:
>   206 μs
> - *with* uevent sequence number namespacing:
>   163 μs
> - makes a difference of 43 μs
> 
> So this test-case shows performance improvement as well.

Just for fun, I did a simple statistical anlysis using t-tests and they
all show significant differences at alpha-level 0.001 (Which I chose
because it seemed 0.05 is a bit too lax.).

Testcase 1:
        Welch Two Sample t-test

data:  x1 and y1
t = 405.16, df = 18883000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 12.14949 12.26761
sample estimates:
mean of x mean of y
 68.48594  56.27739

Testcase 2:
        Welch Two Sample t-test

data:  x2 and y2
t = 38.685, df = 19682000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 55.10630 60.98815
sample estimates:
mean of x mean of y
 572.9684  514.9211

Testcase 3:
        Welch Two Sample t-test

data:  x3 and y3
t = 58.37, df = 17711000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 41.77860 44.68178
sample estimates:
mean of x mean of y
 207.2632  164.0330

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Willem de Bruijn @ 2018-04-21 15:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Paolo Abeni, Network Development, David S. Miller,
	Tariq Toukan
In-Reply-To: <20180420154836.3690a39e@redhat.com>

On Fri, Apr 20, 2018 at 9:48 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
>> > On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>> >> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
> [...]
>> >
>> > Any suggestions for better results are more than welcome!
>>
>> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
>> Seoul (netdev conference). Not tied to UDP, but a generic solution.
>
> Yes, I remember.  I think... was it the idea, where you basically
> wanted to queue back SKBs to the CPU that allocated them, right?
>
> Freeing an SKB on the same CPU that allocated it, have multiple
> advantages. (1) the SLUB allocator can use a non-atomic
> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.
>
> We just have to avoid that queue back SKB's mechanism, doesn't cost
> more than the operations we expect to save.  Bulk transfer is an
> obvious approach.  For storing SKBs until they are returned, we already
> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
> which SLUB/SLAB-bulk free to amortize cost (1).
>
> I guess, the missing information is that we don't know what CPU the SKB
> were created on...

For connected sockets, sk->sk_incoming_cpu has this data. It
records BH cpu on enqueue to udp socket, so one caveat is that
it may be wrong with rps/rfs.

Another option is to associate not with source cpu but napi struct
and have the device driver free in the context of its napi processing.
This has the additional benefit that skb->napi_id is already stored
per skb, so this also works for unconnected sockets.

Third, the skb->napi_id field is unused after setting sk->sk_napi_id
on sk enqueue, so the BH cpu could be stored here after that,
essentially extending sk_incoming_cpu to unconnected sockets.

^ permalink raw reply

* [PATCH net-next v2 0/2] fib rules extack support
From: Roopa Prabhu @ 2018-04-21 16:41 UTC (permalink / raw)
  To: davem; +Cc: netdev, dsa, idosch

From: Roopa Prabhu <roopa@cumulusnetworks.com>

First patch refactors code to move fib rule netlink handling
into a common function. This became obvious when adding
duplicate extack msgs in add and del paths. Second patch
adds extack msgs.

v2 - Dropped the ip route get support and selftests from
     the series to look at the input path some more (as pointed
     out by ido). Will come back to that next week when i have
     some time. resending just the extack part for now.

Roopa Prabhu (2):
  fib_rules: move common handling of newrule delrule msgs into
    fib_nl2rule
  net: fib_rules: add extack support

 include/net/fib_rules.h |   3 +-
 net/core/fib_rules.c    | 471 +++++++++++++++++++++++-------------------------
 net/decnet/dn_rules.c   |   7 +-
 net/ipv4/fib_rules.c    |   7 +-
 net/ipv4/ipmr.c         |   3 +-
 net/ipv6/fib6_rules.c   |   7 +-
 net/ipv6/ip6mr.c        |   3 +-
 7 files changed, 245 insertions(+), 256 deletions(-)

-- 
2.1.4

^ permalink raw reply

* [PATCH net-next v2 1/2] fib_rules: move common handling of newrule delrule msgs into fib_nl2rule
From: Roopa Prabhu @ 2018-04-21 16:41 UTC (permalink / raw)
  To: davem; +Cc: netdev, dsa, idosch
In-Reply-To: <1524328891-3647-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>

This reduces code duplication in the fib rule add and del paths.
Get rid of validate_rulemsg. This became obvious when adding duplicate
extack support in fib newrule/delrule error paths.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 net/core/fib_rules.c | 436 +++++++++++++++++++++++----------------------------
 1 file changed, 192 insertions(+), 244 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 33958f8..6780110 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -387,206 +387,185 @@ unsigned int fib_rules_seq_read(struct net *net, int family)
 }
 EXPORT_SYMBOL_GPL(fib_rules_seq_read);
 
-static int validate_rulemsg(struct fib_rule_hdr *frh, struct nlattr **tb,
-			    struct fib_rules_ops *ops)
-{
-	int err = -EINVAL;
-
-	if (frh->src_len)
-		if (tb[FRA_SRC] == NULL ||
-		    frh->src_len > (ops->addr_size * 8) ||
-		    nla_len(tb[FRA_SRC]) != ops->addr_size)
-			goto errout;
-
-	if (frh->dst_len)
-		if (tb[FRA_DST] == NULL ||
-		    frh->dst_len > (ops->addr_size * 8) ||
-		    nla_len(tb[FRA_DST]) != ops->addr_size)
-			goto errout;
-
-	err = 0;
-errout:
-	return err;
-}
-
-static int rule_exists(struct fib_rules_ops *ops, struct fib_rule_hdr *frh,
-		       struct nlattr **tb, struct fib_rule *rule)
+static struct fib_rule *rule_find(struct fib_rules_ops *ops,
+				  struct fib_rule_hdr *frh,
+				  struct nlattr **tb,
+				  struct fib_rule *rule,
+				  bool user_priority)
 {
 	struct fib_rule *r;
 
 	list_for_each_entry(r, &ops->rules_list, list) {
-		if (r->action != rule->action)
+		if (rule->action && r->action != rule->action)
 			continue;
 
-		if (r->table != rule->table)
+		if (rule->table && r->table != rule->table)
 			continue;
 
-		if (r->pref != rule->pref)
+		if (user_priority && r->pref != rule->pref)
 			continue;
 
-		if (memcmp(r->iifname, rule->iifname, IFNAMSIZ))
+		if (rule->iifname[0] &&
+		    memcmp(r->iifname, rule->iifname, IFNAMSIZ))
 			continue;
 
-		if (memcmp(r->oifname, rule->oifname, IFNAMSIZ))
+		if (rule->oifname[0] &&
+		    memcmp(r->oifname, rule->oifname, IFNAMSIZ))
 			continue;
 
-		if (r->mark != rule->mark)
+		if (rule->mark && r->mark != rule->mark)
 			continue;
 
-		if (r->mark_mask != rule->mark_mask)
+		if (rule->mark_mask && r->mark_mask != rule->mark_mask)
 			continue;
 
-		if (r->tun_id != rule->tun_id)
+		if (rule->tun_id && r->tun_id != rule->tun_id)
 			continue;
 
 		if (r->fr_net != rule->fr_net)
 			continue;
 
-		if (r->l3mdev != rule->l3mdev)
+		if (rule->l3mdev && r->l3mdev != rule->l3mdev)
 			continue;
 
-		if (!uid_eq(r->uid_range.start, rule->uid_range.start) ||
-		    !uid_eq(r->uid_range.end, rule->uid_range.end))
+		if (uid_range_set(&rule->uid_range) &&
+		    (!uid_eq(r->uid_range.start, rule->uid_range.start) ||
+		    !uid_eq(r->uid_range.end, rule->uid_range.end)))
 			continue;
 
-		if (r->ip_proto != rule->ip_proto)
+		if (rule->ip_proto && r->ip_proto != rule->ip_proto)
 			continue;
 
-		if (!fib_rule_port_range_compare(&r->sport_range,
+		if (fib_rule_port_range_set(&rule->sport_range) &&
+		    !fib_rule_port_range_compare(&r->sport_range,
 						 &rule->sport_range))
 			continue;
 
-		if (!fib_rule_port_range_compare(&r->dport_range,
+		if (fib_rule_port_range_set(&rule->dport_range) &&
+		    !fib_rule_port_range_compare(&r->dport_range,
 						 &rule->dport_range))
 			continue;
 
 		if (!ops->compare(r, frh, tb))
 			continue;
-		return 1;
+		return r;
 	}
-	return 0;
+
+	return NULL;
 }
 
-int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh,
-		   struct netlink_ext_ack *extack)
+static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
+		       struct netlink_ext_ack *extack,
+		       struct fib_rules_ops *ops,
+		       struct nlattr *tb[],
+		       struct fib_rule **rule,
+		       bool *user_priority)
 {
 	struct net *net = sock_net(skb->sk);
 	struct fib_rule_hdr *frh = nlmsg_data(nlh);
-	struct fib_rules_ops *ops = NULL;
-	struct fib_rule *rule, *r, *last = NULL;
-	struct nlattr *tb[FRA_MAX+1];
-	int err = -EINVAL, unresolved = 0;
-
-	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
-		goto errout;
-
-	ops = lookup_rules_ops(net, frh->family);
-	if (ops == NULL) {
-		err = -EAFNOSUPPORT;
-		goto errout;
-	}
+	struct fib_rule *nlrule = NULL;
+	int err = -EINVAL;
 
-	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy, extack);
-	if (err < 0)
-		goto errout;
+	if (frh->src_len)
+		if (!tb[FRA_SRC] ||
+		    frh->src_len > (ops->addr_size * 8) ||
+		    nla_len(tb[FRA_SRC]) != ops->addr_size)
+			goto errout;
 
-	err = validate_rulemsg(frh, tb, ops);
-	if (err < 0)
-		goto errout;
+	if (frh->dst_len)
+		if (!tb[FRA_DST] ||
+		    frh->dst_len > (ops->addr_size * 8) ||
+		    nla_len(tb[FRA_DST]) != ops->addr_size)
+			goto errout;
 
-	rule = kzalloc(ops->rule_size, GFP_KERNEL);
-	if (rule == NULL) {
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
 	}
-	refcount_set(&rule->refcnt, 1);
-	rule->fr_net = net;
+	refcount_set(&nlrule->refcnt, 1);
+	nlrule->fr_net = net;
 
-	rule->pref = tb[FRA_PRIORITY] ? nla_get_u32(tb[FRA_PRIORITY])
-	                              : fib_default_rule_pref(ops);
+	if (tb[FRA_PRIORITY]) {
+		nlrule->pref = nla_get_u32(tb[FRA_PRIORITY]);
+		*user_priority = true;
+	} else {
+		nlrule->pref = fib_default_rule_pref(ops);
+	}
 
-	rule->proto = tb[FRA_PROTOCOL] ?
+	nlrule->proto = tb[FRA_PROTOCOL] ?
 		nla_get_u8(tb[FRA_PROTOCOL]) : RTPROT_UNSPEC;
 
 	if (tb[FRA_IIFNAME]) {
 		struct net_device *dev;
 
-		rule->iifindex = -1;
-		nla_strlcpy(rule->iifname, tb[FRA_IIFNAME], IFNAMSIZ);
-		dev = __dev_get_by_name(net, rule->iifname);
+		nlrule->iifindex = -1;
+		nla_strlcpy(nlrule->iifname, tb[FRA_IIFNAME], IFNAMSIZ);
+		dev = __dev_get_by_name(net, nlrule->iifname);
 		if (dev)
-			rule->iifindex = dev->ifindex;
+			nlrule->iifindex = dev->ifindex;
 	}
 
 	if (tb[FRA_OIFNAME]) {
 		struct net_device *dev;
 
-		rule->oifindex = -1;
-		nla_strlcpy(rule->oifname, tb[FRA_OIFNAME], IFNAMSIZ);
-		dev = __dev_get_by_name(net, rule->oifname);
+		nlrule->oifindex = -1;
+		nla_strlcpy(nlrule->oifname, tb[FRA_OIFNAME], IFNAMSIZ);
+		dev = __dev_get_by_name(net, nlrule->oifname);
 		if (dev)
-			rule->oifindex = dev->ifindex;
+			nlrule->oifindex = dev->ifindex;
 	}
 
 	if (tb[FRA_FWMARK]) {
-		rule->mark = nla_get_u32(tb[FRA_FWMARK]);
-		if (rule->mark)
+		nlrule->mark = nla_get_u32(tb[FRA_FWMARK]);
+		if (nlrule->mark)
 			/* compatibility: if the mark value is non-zero all bits
 			 * are compared unless a mask is explicitly specified.
 			 */
-			rule->mark_mask = 0xFFFFFFFF;
+			nlrule->mark_mask = 0xFFFFFFFF;
 	}
 
 	if (tb[FRA_FWMASK])
-		rule->mark_mask = nla_get_u32(tb[FRA_FWMASK]);
+		nlrule->mark_mask = nla_get_u32(tb[FRA_FWMASK]);
 
 	if (tb[FRA_TUN_ID])
-		rule->tun_id = nla_get_be64(tb[FRA_TUN_ID]);
+		nlrule->tun_id = nla_get_be64(tb[FRA_TUN_ID]);
 
 	err = -EINVAL;
 	if (tb[FRA_L3MDEV]) {
 #ifdef CONFIG_NET_L3_MASTER_DEV
-		rule->l3mdev = nla_get_u8(tb[FRA_L3MDEV]);
-		if (rule->l3mdev != 1)
+		nlrule->l3mdev = nla_get_u8(tb[FRA_L3MDEV]);
+		if (nlrule->l3mdev != 1)
 #endif
 			goto errout_free;
 	}
 
-	rule->action = frh->action;
-	rule->flags = frh->flags;
-	rule->table = frh_get_table(frh, tb);
+	nlrule->action = frh->action;
+	nlrule->flags = frh->flags;
+	nlrule->table = frh_get_table(frh, tb);
 	if (tb[FRA_SUPPRESS_PREFIXLEN])
-		rule->suppress_prefixlen = nla_get_u32(tb[FRA_SUPPRESS_PREFIXLEN]);
+		nlrule->suppress_prefixlen = nla_get_u32(tb[FRA_SUPPRESS_PREFIXLEN]);
 	else
-		rule->suppress_prefixlen = -1;
+		nlrule->suppress_prefixlen = -1;
 
 	if (tb[FRA_SUPPRESS_IFGROUP])
-		rule->suppress_ifgroup = nla_get_u32(tb[FRA_SUPPRESS_IFGROUP]);
+		nlrule->suppress_ifgroup = nla_get_u32(tb[FRA_SUPPRESS_IFGROUP]);
 	else
-		rule->suppress_ifgroup = -1;
+		nlrule->suppress_ifgroup = -1;
 
 	if (tb[FRA_GOTO]) {
-		if (rule->action != FR_ACT_GOTO)
+		if (nlrule->action != FR_ACT_GOTO)
 			goto errout_free;
 
-		rule->target = nla_get_u32(tb[FRA_GOTO]);
+		nlrule->target = nla_get_u32(tb[FRA_GOTO]);
 		/* Backward jumps are prohibited to avoid endless loops */
-		if (rule->target <= rule->pref)
+		if (nlrule->target <= nlrule->pref)
 			goto errout_free;
-
-		list_for_each_entry(r, &ops->rules_list, list) {
-			if (r->pref == rule->target) {
-				RCU_INIT_POINTER(rule->ctarget, r);
-				break;
-			}
-		}
-
-		if (rcu_dereference_protected(rule->ctarget, 1) == NULL)
-			unresolved = 1;
-	} else if (rule->action == FR_ACT_GOTO)
+	} else if (nlrule->action == FR_ACT_GOTO) {
 		goto errout_free;
+	}
 
-	if (rule->l3mdev && rule->table)
+	if (nlrule->l3mdev && nlrule->table)
 		goto errout_free;
 
 	if (tb[FRA_UID_RANGE]) {
@@ -595,34 +574,72 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout_free;
 		}
 
-		rule->uid_range = nla_get_kuid_range(tb);
+		nlrule->uid_range = nla_get_kuid_range(tb);
 
-		if (!uid_range_set(&rule->uid_range) ||
-		    !uid_lte(rule->uid_range.start, rule->uid_range.end))
+		if (!uid_range_set(&nlrule->uid_range) ||
+		    !uid_lte(nlrule->uid_range.start, nlrule->uid_range.end))
 			goto errout_free;
 	} else {
-		rule->uid_range = fib_kuid_range_unset;
+		nlrule->uid_range = fib_kuid_range_unset;
 	}
 
 	if (tb[FRA_IP_PROTO])
-		rule->ip_proto = nla_get_u8(tb[FRA_IP_PROTO]);
+		nlrule->ip_proto = nla_get_u8(tb[FRA_IP_PROTO]);
 
 	if (tb[FRA_SPORT_RANGE]) {
 		err = nla_get_port_range(tb[FRA_SPORT_RANGE],
-					 &rule->sport_range);
+					 &nlrule->sport_range);
 		if (err)
 			goto errout_free;
 	}
 
 	if (tb[FRA_DPORT_RANGE]) {
 		err = nla_get_port_range(tb[FRA_DPORT_RANGE],
-					 &rule->dport_range);
+					 &nlrule->dport_range);
 		if (err)
 			goto errout_free;
 	}
 
+	*rule = nlrule;
+
+	return 0;
+
+errout_free:
+	kfree(nlrule);
+errout:
+	return err;
+}
+
+int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh,
+		   struct netlink_ext_ack *extack)
+{
+	struct net *net = sock_net(skb->sk);
+	struct fib_rule_hdr *frh = nlmsg_data(nlh);
+	struct fib_rules_ops *ops = NULL;
+	struct fib_rule *rule = NULL, *r, *last = NULL;
+	struct nlattr *tb[FRA_MAX + 1];
+	int err = -EINVAL, unresolved = 0;
+	bool user_priority = false;
+
+	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
+		goto errout;
+
+	ops = lookup_rules_ops(net, frh->family);
+	if (!ops) {
+		err = -EAFNOSUPPORT;
+		goto errout;
+	}
+
+	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy, extack);
+	if (err < 0)
+		goto errout;
+
+	err = fib_nl2rule(skb, nlh, extack, ops, tb, &rule, &user_priority);
+	if (err)
+		goto errout;
+
 	if ((nlh->nlmsg_flags & NLM_F_EXCL) &&
-	    rule_exists(ops, frh, tb, rule)) {
+	    rule_find(ops, frh, tb, rule, user_priority)) {
 		err = -EEXIST;
 		goto errout_free;
 	}
@@ -637,6 +654,16 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 		goto errout_free;
 
 	list_for_each_entry(r, &ops->rules_list, list) {
+		if (r->pref == rule->target) {
+			RCU_INIT_POINTER(rule->ctarget, r);
+			break;
+		}
+	}
+
+	if (rcu_dereference_protected(rule->ctarget, 1) == NULL)
+		unresolved = 1;
+
+	list_for_each_entry(r, &ops->rules_list, list) {
 		if (r->pref > rule->pref)
 			break;
 		last = r;
@@ -690,13 +717,11 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 {
 	struct net *net = sock_net(skb->sk);
 	struct fib_rule_hdr *frh = nlmsg_data(nlh);
-	struct fib_rule_port_range sprange = {0, 0};
-	struct fib_rule_port_range dprange = {0, 0};
 	struct fib_rules_ops *ops = NULL;
-	struct fib_rule *rule, *r;
+	struct fib_rule *rule = NULL, *r, *nlrule = NULL;
 	struct nlattr *tb[FRA_MAX+1];
-	struct fib_kuid_range range;
 	int err = -EINVAL;
+	bool user_priority = false;
 
 	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
 		goto errout;
@@ -711,150 +736,73 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (err < 0)
 		goto errout;
 
-	err = validate_rulemsg(frh, tb, ops);
-	if (err < 0)
+	err = fib_nl2rule(skb, nlh, extack, ops, tb, &nlrule, &user_priority);
+	if (err)
 		goto errout;
 
-	if (tb[FRA_UID_RANGE]) {
-		range = nla_get_kuid_range(tb);
-		if (!uid_range_set(&range)) {
-			err = -EINVAL;
-			goto errout;
-		}
-	} else {
-		range = fib_kuid_range_unset;
+	rule = rule_find(ops, frh, tb, nlrule, user_priority);
+	if (!rule) {
+		err = -ENOENT;
+		goto errout;
 	}
 
-	if (tb[FRA_SPORT_RANGE]) {
-		err = nla_get_port_range(tb[FRA_SPORT_RANGE],
-					 &sprange);
-		if (err)
-			goto errout;
+	if (rule->flags & FIB_RULE_PERMANENT) {
+		err = -EPERM;
+		goto errout;
 	}
 
-	if (tb[FRA_DPORT_RANGE]) {
-		err = nla_get_port_range(tb[FRA_DPORT_RANGE],
-					 &dprange);
+	if (ops->delete) {
+		err = ops->delete(rule);
 		if (err)
 			goto errout;
 	}
 
-	list_for_each_entry(rule, &ops->rules_list, list) {
-		if (tb[FRA_PROTOCOL] &&
-		    (rule->proto != nla_get_u8(tb[FRA_PROTOCOL])))
-			continue;
-
-		if (frh->action && (frh->action != rule->action))
-			continue;
-
-		if (frh_get_table(frh, tb) &&
-		    (frh_get_table(frh, tb) != rule->table))
-			continue;
-
-		if (tb[FRA_PRIORITY] &&
-		    (rule->pref != nla_get_u32(tb[FRA_PRIORITY])))
-			continue;
-
-		if (tb[FRA_IIFNAME] &&
-		    nla_strcmp(tb[FRA_IIFNAME], rule->iifname))
-			continue;
-
-		if (tb[FRA_OIFNAME] &&
-		    nla_strcmp(tb[FRA_OIFNAME], rule->oifname))
-			continue;
-
-		if (tb[FRA_FWMARK] &&
-		    (rule->mark != nla_get_u32(tb[FRA_FWMARK])))
-			continue;
-
-		if (tb[FRA_FWMASK] &&
-		    (rule->mark_mask != nla_get_u32(tb[FRA_FWMASK])))
-			continue;
-
-		if (tb[FRA_TUN_ID] &&
-		    (rule->tun_id != nla_get_be64(tb[FRA_TUN_ID])))
-			continue;
-
-		if (tb[FRA_L3MDEV] &&
-		    (rule->l3mdev != nla_get_u8(tb[FRA_L3MDEV])))
-			continue;
-
-		if (uid_range_set(&range) &&
-		    (!uid_eq(rule->uid_range.start, range.start) ||
-		     !uid_eq(rule->uid_range.end, range.end)))
-			continue;
-
-		if (tb[FRA_IP_PROTO] &&
-		    (rule->ip_proto != nla_get_u8(tb[FRA_IP_PROTO])))
-			continue;
-
-		if (fib_rule_port_range_set(&sprange) &&
-		    !fib_rule_port_range_compare(&rule->sport_range, &sprange))
-			continue;
-
-		if (fib_rule_port_range_set(&dprange) &&
-		    !fib_rule_port_range_compare(&rule->dport_range, &dprange))
-			continue;
-
-		if (!ops->compare(rule, frh, tb))
-			continue;
-
-		if (rule->flags & FIB_RULE_PERMANENT) {
-			err = -EPERM;
-			goto errout;
-		}
-
-		if (ops->delete) {
-			err = ops->delete(rule);
-			if (err)
-				goto errout;
-		}
-
-		if (rule->tun_id)
-			ip_tunnel_unneed_metadata();
+	if (rule->tun_id)
+		ip_tunnel_unneed_metadata();
 
-		list_del_rcu(&rule->list);
+	list_del_rcu(&rule->list);
 
-		if (rule->action == FR_ACT_GOTO) {
-			ops->nr_goto_rules--;
-			if (rtnl_dereference(rule->ctarget) == NULL)
-				ops->unresolved_rules--;
-		}
+	if (rule->action == FR_ACT_GOTO) {
+		ops->nr_goto_rules--;
+		if (rtnl_dereference(rule->ctarget) == NULL)
+			ops->unresolved_rules--;
+	}
 
-		/*
-		 * Check if this rule is a target to any of them. If so,
-		 * adjust to the next one with the same preference or
-		 * disable them. As this operation is eventually very
-		 * expensive, it is only performed if goto rules, except
-		 * current if it is goto rule, have actually been added.
-		 */
-		if (ops->nr_goto_rules > 0) {
-			struct fib_rule *n;
-
-			n = list_next_entry(rule, list);
-			if (&n->list == &ops->rules_list || n->pref != rule->pref)
-				n = NULL;
-			list_for_each_entry(r, &ops->rules_list, list) {
-				if (rtnl_dereference(r->ctarget) != rule)
-					continue;
-				rcu_assign_pointer(r->ctarget, n);
-				if (!n)
-					ops->unresolved_rules++;
-			}
+	/*
+	 * Check if this rule is a target to any of them. If so,
+	 * adjust to the next one with the same preference or
+	 * disable them. As this operation is eventually very
+	 * expensive, it is only performed if goto rules, except
+	 * current if it is goto rule, have actually been added.
+	 */
+	if (ops->nr_goto_rules > 0) {
+		struct fib_rule *n;
+
+		n = list_next_entry(rule, list);
+		if (&n->list == &ops->rules_list || n->pref != rule->pref)
+			n = NULL;
+		list_for_each_entry(r, &ops->rules_list, list) {
+			if (rtnl_dereference(r->ctarget) != rule)
+				continue;
+			rcu_assign_pointer(r->ctarget, n);
+			if (!n)
+				ops->unresolved_rules++;
 		}
-
-		call_fib_rule_notifiers(net, FIB_EVENT_RULE_DEL, rule, ops,
-					NULL);
-		notify_rule_change(RTM_DELRULE, rule, ops, nlh,
-				   NETLINK_CB(skb).portid);
-		fib_rule_put(rule);
-		flush_route_cache(ops);
-		rules_ops_put(ops);
-		return 0;
 	}
 
-	err = -ENOENT;
+	call_fib_rule_notifiers(net, FIB_EVENT_RULE_DEL, rule, ops,
+				NULL);
+	notify_rule_change(RTM_DELRULE, rule, ops, nlh,
+			   NETLINK_CB(skb).portid);
+	fib_rule_put(rule);
+	flush_route_cache(ops);
+	rules_ops_put(ops);
+	kfree(nlrule);
+	return 0;
+
 errout:
+	if (nlrule)
+		kfree(nlrule);
 	rules_ops_put(ops);
 	return err;
 }
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v2 2/2] net: fib_rules: add extack support
From: Roopa Prabhu @ 2018-04-21 16:41 UTC (permalink / raw)
  To: davem; +Cc: netdev, dsa, idosch
In-Reply-To: <1524328891-3647-1-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
 include/net/fib_rules.h |  3 ++-
 net/core/fib_rules.c    | 55 +++++++++++++++++++++++++++++++++++++------------
 net/decnet/dn_rules.c   |  7 +++++--
 net/ipv4/fib_rules.c    |  7 +++++--
 net/ipv4/ipmr.c         |  3 ++-
 net/ipv6/fib6_rules.c   |  7 +++++--
 net/ipv6/ip6mr.c        |  3 ++-
 7 files changed, 63 insertions(+), 22 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index e5cfcfc..b473df5 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -75,7 +75,8 @@ struct fib_rules_ops {
 	int			(*configure)(struct fib_rule *,
 					     struct sk_buff *,
 					     struct fib_rule_hdr *,
-					     struct nlattr **);
+					     struct nlattr **,
+					     struct netlink_ext_ack *);
 	int			(*delete)(struct fib_rule *);
 	int			(*compare)(struct fib_rule *,
 					   struct fib_rule_hdr *,
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 6780110..ebd9351 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -469,14 +469,18 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (frh->src_len)
 		if (!tb[FRA_SRC] ||
 		    frh->src_len > (ops->addr_size * 8) ||
-		    nla_len(tb[FRA_SRC]) != ops->addr_size)
+		    nla_len(tb[FRA_SRC]) != ops->addr_size) {
+			NL_SET_ERR_MSG(extack, "Invalid source address");
 			goto errout;
+	}
 
 	if (frh->dst_len)
 		if (!tb[FRA_DST] ||
 		    frh->dst_len > (ops->addr_size * 8) ||
-		    nla_len(tb[FRA_DST]) != ops->addr_size)
+		    nla_len(tb[FRA_DST]) != ops->addr_size) {
+			NL_SET_ERR_MSG(extack, "Invalid dst address");
 			goto errout;
+	}
 
 	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
 	if (!nlrule) {
@@ -537,6 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 		nlrule->l3mdev = nla_get_u8(tb[FRA_L3MDEV]);
 		if (nlrule->l3mdev != 1)
 #endif
+			NL_SET_ERR_MSG(extack, "Invalid l3mdev");
 			goto errout_free;
 	}
 
@@ -554,31 +559,41 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 		nlrule->suppress_ifgroup = -1;
 
 	if (tb[FRA_GOTO]) {
-		if (nlrule->action != FR_ACT_GOTO)
+		if (nlrule->action != FR_ACT_GOTO) {
+			NL_SET_ERR_MSG(extack, "Unexpected goto");
 			goto errout_free;
+		}
 
 		nlrule->target = nla_get_u32(tb[FRA_GOTO]);
 		/* Backward jumps are prohibited to avoid endless loops */
-		if (nlrule->target <= nlrule->pref)
+		if (nlrule->target <= nlrule->pref) {
+			NL_SET_ERR_MSG(extack, "Backward goto not supported");
 			goto errout_free;
+		}
 	} else if (nlrule->action == FR_ACT_GOTO) {
+		NL_SET_ERR_MSG(extack, "Missing goto target for action goto");
 		goto errout_free;
 	}
 
-	if (nlrule->l3mdev && nlrule->table)
+	if (nlrule->l3mdev && nlrule->table) {
+		NL_SET_ERR_MSG(extack, "l3mdev and table are mutually exclusive");
 		goto errout_free;
+	}
 
 	if (tb[FRA_UID_RANGE]) {
 		if (current_user_ns() != net->user_ns) {
 			err = -EPERM;
+			NL_SET_ERR_MSG(extack, "No permission to set uid");
 			goto errout_free;
 		}
 
 		nlrule->uid_range = nla_get_kuid_range(tb);
 
 		if (!uid_range_set(&nlrule->uid_range) ||
-		    !uid_lte(nlrule->uid_range.start, nlrule->uid_range.end))
+		    !uid_lte(nlrule->uid_range.start, nlrule->uid_range.end)) {
+			NL_SET_ERR_MSG(extack, "Invalid uid range");
 			goto errout_free;
+		}
 	} else {
 		nlrule->uid_range = fib_kuid_range_unset;
 	}
@@ -589,15 +604,19 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (tb[FRA_SPORT_RANGE]) {
 		err = nla_get_port_range(tb[FRA_SPORT_RANGE],
 					 &nlrule->sport_range);
-		if (err)
+		if (err) {
+			NL_SET_ERR_MSG(extack, "Invalid sport range");
 			goto errout_free;
+		}
 	}
 
 	if (tb[FRA_DPORT_RANGE]) {
 		err = nla_get_port_range(tb[FRA_DPORT_RANGE],
 					 &nlrule->dport_range);
-		if (err)
+		if (err) {
+			NL_SET_ERR_MSG(extack, "Invalid dport range");
 			goto errout_free;
+		}
 	}
 
 	*rule = nlrule;
@@ -621,18 +640,23 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 	int err = -EINVAL, unresolved = 0;
 	bool user_priority = false;
 
-	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
+	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh))) {
+		NL_SET_ERR_MSG(extack, "Invalid msg length");
 		goto errout;
+	}
 
 	ops = lookup_rules_ops(net, frh->family);
 	if (!ops) {
 		err = -EAFNOSUPPORT;
+		NL_SET_ERR_MSG(extack, "Rule family not supported");
 		goto errout;
 	}
 
 	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy, extack);
-	if (err < 0)
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack, "Error parsing msg");
 		goto errout;
+	}
 
 	err = fib_nl2rule(skb, nlh, extack, ops, tb, &rule, &user_priority);
 	if (err)
@@ -644,7 +668,7 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 		goto errout_free;
 	}
 
-	err = ops->configure(rule, skb, frh, tb);
+	err = ops->configure(rule, skb, frh, tb, extack);
 	if (err < 0)
 		goto errout_free;
 
@@ -723,18 +747,23 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 	int err = -EINVAL;
 	bool user_priority = false;
 
-	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh)))
+	if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*frh))) {
+		NL_SET_ERR_MSG(extack, "Invalid msg length");
 		goto errout;
+	}
 
 	ops = lookup_rules_ops(net, frh->family);
 	if (ops == NULL) {
 		err = -EAFNOSUPPORT;
+		NL_SET_ERR_MSG(extack, "Rule family not supported");
 		goto errout;
 	}
 
 	err = nlmsg_parse(nlh, sizeof(*frh), tb, FRA_MAX, ops->policy, extack);
-	if (err < 0)
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack, "Error parsing msg");
 		goto errout;
+	}
 
 	err = fib_nl2rule(skb, nlh, extack, ops, tb, &nlrule, &user_priority);
 	if (err)
diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c
index c795c3f..7223669 100644
--- a/net/decnet/dn_rules.c
+++ b/net/decnet/dn_rules.c
@@ -121,13 +121,16 @@ static int dn_fib_rule_match(struct fib_rule *rule, struct flowi *fl, int flags)
 
 static int dn_fib_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 				 struct fib_rule_hdr *frh,
-				 struct nlattr **tb)
+				 struct nlattr **tb,
+				 struct netlink_ext_ack *extack)
 {
 	int err = -EINVAL;
 	struct dn_fib_rule *r = (struct dn_fib_rule *)rule;
 
-	if (frh->tos)
+	if (frh->tos) {
+		NL_SET_ERR_MSG(extack, "Invalid tos value");
 		goto  errout;
+	}
 
 	if (rule->table == RT_TABLE_UNSPEC) {
 		if (rule->action == FR_ACT_TO_TBL) {
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 737d11b..f8eb78d 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -213,14 +213,17 @@ static const struct nla_policy fib4_rule_policy[FRA_MAX+1] = {
 
 static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 			       struct fib_rule_hdr *frh,
-			       struct nlattr **tb)
+			       struct nlattr **tb,
+			       struct netlink_ext_ack *extack)
 {
 	struct net *net = sock_net(skb->sk);
 	int err = -EINVAL;
 	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
 
-	if (frh->tos & ~IPTOS_TOS_MASK)
+	if (frh->tos & ~IPTOS_TOS_MASK) {
+		NL_SET_ERR_MSG(extack, "Invalid tos");
 		goto errout;
+	}
 
 	/* split local/main if they are not already split */
 	err = fib_unmerge(net);
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 2fb4de3..38e092e 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -201,7 +201,8 @@ static const struct nla_policy ipmr_rule_policy[FRA_MAX + 1] = {
 };
 
 static int ipmr_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
-			       struct fib_rule_hdr *frh, struct nlattr **tb)
+			       struct fib_rule_hdr *frh, struct nlattr **tb,
+			       struct netlink_ext_ack *extack)
 {
 	return 0;
 }
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index df113c7..6547fc6 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -245,15 +245,18 @@ static const struct nla_policy fib6_rule_policy[FRA_MAX+1] = {
 
 static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 			       struct fib_rule_hdr *frh,
-			       struct nlattr **tb)
+			       struct nlattr **tb,
+			       struct netlink_ext_ack *extack)
 {
 	int err = -EINVAL;
 	struct net *net = sock_net(skb->sk);
 	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
 
 	if (rule->action == FR_ACT_TO_TBL && !rule->l3mdev) {
-		if (rule->table == RT6_TABLE_UNSPEC)
+		if (rule->table == RT6_TABLE_UNSPEC) {
+			NL_SET_ERR_MSG(extack, "Invalid table");
 			goto errout;
+		}
 
 		if (fib6_new_table(net, rule->table) == NULL) {
 			err = -ENOBUFS;
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 298fd8b..20a419e 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -180,7 +180,8 @@ static const struct nla_policy ip6mr_rule_policy[FRA_MAX + 1] = {
 };
 
 static int ip6mr_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
-				struct fib_rule_hdr *frh, struct nlattr **tb)
+				struct fib_rule_hdr *frh, struct nlattr **tb,
+				struct netlink_ext_ack *extack)
 {
 	return 0;
 }
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Eric Dumazet @ 2018-04-21 16:45 UTC (permalink / raw)
  To: Willem de Bruijn, Jesper Dangaard Brouer
  Cc: Paolo Abeni, Network Development, David S. Miller, Tariq Toukan
In-Reply-To: <CAF=yD-LkM3hzcY3B1P_5fW1t+QNtPz6=2YRr4P79t4hZW=0wTA@mail.gmail.com>



On 04/21/2018 08:54 AM, Willem de Bruijn wrote:
> On Fri, Apr 20, 2018 at 9:48 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> On Thu, 19 Apr 2018 06:47:10 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On 04/19/2018 12:40 AM, Paolo Abeni wrote:
>>>> On Wed, 2018-04-18 at 12:21 -0700, Eric Dumazet wrote:
>>>>> On 04/18/2018 10:15 AM, Paolo Abeni wrote:
>> [...]
>>>>
>>>> Any suggestions for better results are more than welcome!
>>>
>>> Yes, remote skb freeing. I mentioned this idea to Jesper and Tariq in
>>> Seoul (netdev conference). Not tied to UDP, but a generic solution.
>>
>> Yes, I remember.  I think... was it the idea, where you basically
>> wanted to queue back SKBs to the CPU that allocated them, right?
>>
>> Freeing an SKB on the same CPU that allocated it, have multiple
>> advantages. (1) the SLUB allocator can use a non-atomic
>> "cpu-local" (double)cmpxchg. (2) the 4 cache-lines memset cleared of
>> the SKB stay local.  (3) the atomic SKB refcnt/users stay local.
>>
>> We just have to avoid that queue back SKB's mechanism, doesn't cost
>> more than the operations we expect to save.  Bulk transfer is an
>> obvious approach.  For storing SKBs until they are returned, we already
>> have a fast mechanism see napi_consume_skb calling _kfree_skb_defer,
>> which SLUB/SLAB-bulk free to amortize cost (1).
>>
>> I guess, the missing information is that we don't know what CPU the SKB
>> were created on...
> 
> For connected sockets, sk->sk_incoming_cpu has this data. It
> records BH cpu on enqueue to udp socket, so one caveat is that
> it may be wrong with rps/rfs.
> 
> Another option is to associate not with source cpu but napi struct
> and have the device driver free in the context of its napi processing.
> This has the additional benefit that skb->napi_id is already stored
> per skb, so this also works for unconnected sockets.
> 
> Third, the skb->napi_id field is unused after setting sk->sk_napi_id
> on sk enqueue, so the BH cpu could be stored here after that,
> essentially extending sk_incoming_cpu to unconnected sockets.

We use at Google something named TXCS, which is what I mentioned to Jesper and Tariq.

(In our case, we wanted to not perform skb destructor/freeing on the cpu handling the TX queue,
but on cpus that originally cooked the skb (running TCP stack))

To accommodate generic needs (both RX and TX), I do not believe we can union any existing fields,
without a lot of pain/bugs.

^ permalink raw reply

* [PATCH bpf-next] bpf: btf: Clean up btf.h in uapi
From: Martin KaFai Lau @ 2018-04-21 16:48 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

This patch cleans up btf.h in uapi:
1) Rename "name" to "name_off" to better reflect it is an offset to the
   string section instead of a char array.
2) Remove unused value BTF_FLAGS_COMPR and BTF_MAGIC_SWAP

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/uapi/linux/btf.h       |  8 +++-----
 kernel/bpf/btf.c               | 20 ++++++++++----------
 tools/include/uapi/linux/btf.h |  8 +++-----
 tools/lib/bpf/btf.c            |  2 +-
 4 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/include/uapi/linux/btf.h b/include/uapi/linux/btf.h
index 74a30b1090df..bcb56ee47014 100644
--- a/include/uapi/linux/btf.h
+++ b/include/uapi/linux/btf.h
@@ -6,9 +6,7 @@
 #include <linux/types.h>
 
 #define BTF_MAGIC	0xeB9F
-#define BTF_MAGIC_SWAP	0x9FeB
 #define BTF_VERSION	1
-#define BTF_FLAGS_COMPR	0x01
 
 struct btf_header {
 	__u16	magic;
@@ -43,7 +41,7 @@ struct btf_header {
 #define BTF_STR_OFFSET(ref)	((ref) & BTF_MAX_NAME_OFFSET)
 
 struct btf_type {
-	__u32 name;
+	__u32 name_off;
 	/* "info" bits arrangement
 	 * bits  0-15: vlen (e.g. # of struct's members)
 	 * bits 16-23: unused
@@ -105,7 +103,7 @@ struct btf_type {
  * info in "struct btf_type").
  */
 struct btf_enum {
-	__u32	name;
+	__u32	name_off;
 	__s32	val;
 };
 
@@ -122,7 +120,7 @@ struct btf_array {
  * "struct btf_type").
  */
 struct btf_member {
-	__u32	name;
+	__u32	name_off;
 	__u32	type;
 	__u32	offset;	/* offset in bits */
 };
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index eb56ac760547..22e1046a1a86 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -473,7 +473,7 @@ __printf(4, 5) static void __btf_verifier_log_type(struct btf_verifier_env *env,
 	__btf_verifier_log(log, "[%u] %s %s%s",
 			   env->log_type_id,
 			   btf_kind_str[kind],
-			   btf_name_by_offset(btf, t->name),
+			   btf_name_by_offset(btf, t->name_off),
 			   log_details ? " " : "");
 
 	if (log_details)
@@ -517,7 +517,7 @@ static void btf_verifier_log_member(struct btf_verifier_env *env,
 		btf_verifier_log_type(env, struct_type, NULL);
 
 	__btf_verifier_log(log, "\t%s type_id=%u bits_offset=%u",
-			   btf_name_by_offset(btf, member->name),
+			   btf_name_by_offset(btf, member->name_off),
 			   member->type, member->offset);
 
 	if (fmt && *fmt) {
@@ -1419,10 +1419,10 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env,
 	btf_verifier_log_type(env, t, NULL);
 
 	for_each_member(i, t, member) {
-		if (!btf_name_offset_valid(btf, member->name)) {
+		if (!btf_name_offset_valid(btf, member->name_off)) {
 			btf_verifier_log_member(env, t, member,
 						"Invalid member name_offset:%u",
-						member->name);
+						member->name_off);
 			return -EINVAL;
 		}
 
@@ -1605,14 +1605,14 @@ static s32 btf_enum_check_meta(struct btf_verifier_env *env,
 	btf_verifier_log_type(env, t, NULL);
 
 	for (i = 0; i < nr_enums; i++) {
-		if (!btf_name_offset_valid(btf, enums[i].name)) {
+		if (!btf_name_offset_valid(btf, enums[i].name_off)) {
 			btf_verifier_log(env, "\tInvalid name_offset:%u",
-					 enums[i].name);
+					 enums[i].name_off);
 			return -EINVAL;
 		}
 
 		btf_verifier_log(env, "\t%s val=%d\n",
-				 btf_name_by_offset(btf, enums[i].name),
+				 btf_name_by_offset(btf, enums[i].name_off),
 				 enums[i].val);
 	}
 
@@ -1636,7 +1636,7 @@ static void btf_enum_seq_show(const struct btf *btf, const struct btf_type *t,
 	for (i = 0; i < nr_enums; i++) {
 		if (v == enums[i].val) {
 			seq_printf(m, "%s",
-				   btf_name_by_offset(btf, enums[i].name));
+				   btf_name_by_offset(btf, enums[i].name_off));
 			return;
 		}
 	}
@@ -1687,9 +1687,9 @@ static s32 btf_check_meta(struct btf_verifier_env *env,
 		return -EINVAL;
 	}
 
-	if (!btf_name_offset_valid(env->btf, t->name)) {
+	if (!btf_name_offset_valid(env->btf, t->name_off)) {
 		btf_verifier_log(env, "[%u] Invalid name_offset:%u",
-				 env->log_type_id, t->name);
+				 env->log_type_id, t->name_off);
 		return -EINVAL;
 	}
 
diff --git a/tools/include/uapi/linux/btf.h b/tools/include/uapi/linux/btf.h
index 74a30b1090df..bcb56ee47014 100644
--- a/tools/include/uapi/linux/btf.h
+++ b/tools/include/uapi/linux/btf.h
@@ -6,9 +6,7 @@
 #include <linux/types.h>
 
 #define BTF_MAGIC	0xeB9F
-#define BTF_MAGIC_SWAP	0x9FeB
 #define BTF_VERSION	1
-#define BTF_FLAGS_COMPR	0x01
 
 struct btf_header {
 	__u16	magic;
@@ -43,7 +41,7 @@ struct btf_header {
 #define BTF_STR_OFFSET(ref)	((ref) & BTF_MAX_NAME_OFFSET)
 
 struct btf_type {
-	__u32 name;
+	__u32 name_off;
 	/* "info" bits arrangement
 	 * bits  0-15: vlen (e.g. # of struct's members)
 	 * bits 16-23: unused
@@ -105,7 +103,7 @@ struct btf_type {
  * info in "struct btf_type").
  */
 struct btf_enum {
-	__u32	name;
+	__u32	name_off;
 	__s32	val;
 };
 
@@ -122,7 +120,7 @@ struct btf_array {
  * "struct btf_type").
  */
 struct btf_member {
-	__u32	name;
+	__u32	name_off;
 	__u32	type;
 	__u32	offset;	/* offset in bits */
 };
diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 58b6255abc7a..2bac710e3194 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -281,7 +281,7 @@ int32_t btf__find_by_name(const struct btf *btf, const char *type_name)
 
 	for (i = 1; i <= btf->nr_types; i++) {
 		const struct btf_type *t = btf->types[i];
-		const char *name = btf_name_by_offset(btf, t->name);
+		const char *name = btf_name_by_offset(btf, t->name_off);
 
 		if (name && !strcmp(type_name, name))
 			return i;
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
From: Eric Dumazet @ 2018-04-21 16:55 UTC (permalink / raw)
  To: Christoph Hellwig, Eric Dumazet
  Cc: David S . Miller, netdev, linux-kernel, Soheil Hassas Yeganeh,
	linux-mm, linux-fsdevel
In-Reply-To: <20180421090722.GA11998@infradead.org>



On 04/21/2018 02:07 AM, Christoph Hellwig wrote:
> On Fri, Apr 20, 2018 at 08:55:38AM -0700, Eric Dumazet wrote:
>> This patch series provide a new mmap_hook to fs willing to grab
>> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
>>
>> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
>> is held), and improve multi-threading scalability. 
> 
> Missing CC to linu-fsdevel and linux-mm that will have to decide.
> 
> We've rejected this approach multiple times before, so you better
> make a really good argument for it.
> 

Well, tcp code needs to hold socket lock before mm->mmap_sem, so current
mmap hook can not fit. Or we need to revisit all code doing copyin/copyout while
holding a socket lock. (Not feasible really)


> introducing a multiplexer that overloads a single method certainly
> doesn't help making that case.

Well, if you refer to multiple hooks instead of a single one, I basically
thought that since only TCP needs this hook at the moment,
it was not worth adding extra 8-bytes loads for all other mmap() users.

I have no issue adding more hooks and more memory pressure if this is the blocking factor.

We need two actions at this moment, (to lock the socket or release it)
and a third one would allow us to build the array of pages
before grabbing mmap_sem (as I mentioned in the last patch changelog)

^ permalink raw reply

* [PATCH net-next v5 0/3] kernel: add support to collect hardware logs in crash recovery kernel
From: Rahul Lakkireddy @ 2018-04-21 17:05 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: indranil-ut6Up61K2wZBDgjK7y7TUQ, nirranjan-ut6Up61K2wZBDgjK7y7TUQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ganeshgr-ut6Up61K2wZBDgjK7y7TUQ, Rahul Lakkireddy,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn

On production servers running variety of workloads over time, kernel
panic can happen sporadically after days or even months. It is
important to collect as much debug logs as possible to root cause
and fix the problem, that may not be easy to reproduce. Snapshot of
underlying hardware/firmware state (like register dump, firmware
logs, adapter memory, etc.), at the time of kernel panic will be very
helpful while debugging the culprit device driver.

This series of patches add new generic framework that enable device
drivers to collect device specific snapshot of the hardware/firmware
state of the underlying device in the crash recovery kernel. In crash
recovery kernel, the collected logs are added as elf notes to
/proc/vmcore, which is copied by user space scripts for post-analysis.

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

The device specific hardware/firmware logs can be seen as elf notes:

# readelf -n /proc/vmcore

Displaying notes found at file offset 0x00001000 with length 0x04003288:
  Owner                 Data size	Description
  VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8	Unknown note type: (0x00000700)
  VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8	Unknown note type: (0x00000700)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  VMCOREINFO           0x0000074f	Unknown note type: (0x00000000)

Patch 1 adds API to vmcore module to allow drivers to register callback
to collect the device specific hardware/firmware logs.  The logs will
be added to /proc/vmcore as elf notes.

Patch 2 updates read and mmap logic to append device specific hardware/
firmware logs as elf notes.

Patch 3 shows a cxgb4 driver example using the API to collect
hardware/firmware logs in crash recovery kernel, before hardware is
initialized.

Thanks,
Rahul

---
v5:
- Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and
  updated help message.

v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Exported dumps as elf notes. Suggested by Eric Biederman
  <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>.  Added as patch 2 in this version.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.
- Updated comments.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs. Suggested by
  Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>.
- Added new crashdd module that exports /proc/crashdd/ containing
  driver's registered hardware/firmware logs in patch 1.
- Replaced the API to allow drivers to register their hardware/firmware
  log collect routine in crash recovery kernel in patch 1.
- Updated patch 2 to use the new API in patch 1.

Rahul Lakkireddy (3):
  vmcore: add API to collect hardware dump in second kernel
  vmcore: append device dumps to vmcore as elf notes
  cxgb4: collect hardware dump in second kernel

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |   4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c |  25 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |  10 +
 fs/proc/Kconfig                                  |  15 +
 fs/proc/vmcore.c                                 | 399 ++++++++++++++++++++++-
 include/linux/crash_core.h                       |   4 +
 include/linux/crash_dump.h                       |  17 +
 include/linux/kcore.h                            |   6 +
 include/uapi/linux/elf.h                         |   1 +
 10 files changed, 471 insertions(+), 13 deletions(-)

-- 
2.14.1

^ permalink raw reply

* [PATCH net-next v5 1/3] vmcore: add API to collect hardware dump in second kernel
From: Rahul Lakkireddy @ 2018-04-21 17:05 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy
In-Reply-To: <cover.1524329561.git.rahul.lakkireddy@chelsio.com>

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.

Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v5:
- Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and
  updated help message to indicate that the driver must be present
  in second kernel's initramfs to collect the underlying device
  snapshot.

v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs.  Suggested by
  Stephen Hemminger <stephen@networkplumber.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm@xmission.com>.
- Patch added in this series.

 fs/proc/Kconfig            |  15 +++++
 fs/proc/vmcore.c           | 152 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/crash_core.h |   4 ++
 include/linux/crash_dump.h |  17 +++++
 include/linux/kcore.h      |   6 ++
 include/uapi/linux/elf.h   |   1 +
 6 files changed, 186 insertions(+), 9 deletions(-)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 1ade1206bb89..0eaeb41453f5 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -43,6 +43,21 @@ config PROC_VMCORE
         help
         Exports the dump image of crashed kernel in ELF format.
 
+config PROC_VMCORE_DEVICE_DUMP
+	bool "Device Hardware/Firmware Log Collection"
+	depends on PROC_VMCORE
+	default n
+	help
+	  After kernel panic, device drivers can collect the device
+	  specific snapshot of their hardware or firmware before the
+	  underlying devices are initialized in crash recovery kernel.
+	  Note that the device driver must be present in the crash
+	  recovery kernel's initramfs to collect its underlying device
+	  snapshot.
+
+	  If you say Y here, the collected device dumps will be added
+	  as ELF notes to /proc/vmcore.
+
 config PROC_SYSCTL
 	bool "Sysctl support (/proc/sys)" if EXPERT
 	depends on PROC_FS
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a45f0af22a60..7395462d2f86 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/crash_dump.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
 #include <linux/uaccess.h>
@@ -44,6 +45,12 @@ static u64 vmcore_size;
 
 static struct proc_dir_entry *proc_vmcore;
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/* Device Dump list and mutex to synchronize access to list */
+static LIST_HEAD(vmcoredd_list);
+static DEFINE_MUTEX(vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -302,10 +309,8 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *                      vmalloc memory
- *
- * @notes_sz: size of buffer
+ * vmcore_alloc_buf - allocate buffer in vmalloc memory
+ * @sizez: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -313,12 +318,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *vmcore_alloc_buf(size_t size)
 {
 #ifdef CONFIG_MMU
-	return vmalloc_user(notes_sz);
+	return vmalloc_user(size);
 #else
-	return vzalloc(notes_sz);
+	return vzalloc(size);
 #endif
 }
 
@@ -665,7 +670,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -851,7 +856,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -1145,6 +1150,132 @@ static int __init parse_crash_elf_headers(void)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/**
+ * vmcoredd_get_note_size - Get size of the note that will be inserted at
+ * beginning of the dump's buffer.
+ * @name: Note's name
+ *
+ * Gets the overall size of the note that will be inserted at the beginning
+ * of the dump's buffer.  It also adds padding, if necessary to meet
+ * alignment requirements.
+ */
+static inline size_t vmcoredd_get_note_size(const char *name)
+{
+	return CRASH_CORE_NOTE_HEAD_BYTES +
+	       ALIGN(VMCOREDD_NOTE_NAME_BYTES + strlen(name), sizeof(Elf_Word));
+}
+
+/**
+ * vmcoredd_write_note - Write note at the beginning of the dump's buffer
+ * @name: Dump's name
+ * @buf: Output buffer where the note is written
+ * @size: Size of the dump
+ *
+ * Fills beginning of the dump's data with elf note.
+ */
+static void vmcoredd_write_note(const char *name, void *buf, size_t size)
+{
+	struct elf_note *note = (struct elf_note *)buf;
+	Elf_Word *word = (Elf_Word *)note;
+
+	note->n_namesz = ALIGN(VMCOREDD_NOTE_NAME_BYTES + strlen(name),
+			       sizeof(Elf_Word));
+	note->n_descsz = size;
+	note->n_type = NT_VMCOREDD;
+	word += DIV_ROUND_UP(sizeof(*note), sizeof(Elf_Word));
+	snprintf((char *)word, note->n_namesz, "%s_%s", VMCOREDD_NOTE_NAME,
+		 name);
+}
+
+/**
+ * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
+ * @data: dump info.
+ *
+ * Allocate a buffer and invoke the calling driver's dump collect routine.
+ * Write Elf note at the beginning of the buffer to indicate vmcore device
+ * dump and add the dump to global list.
+ */
+static int __vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	size_t note_size, data_size;
+	struct vmcoredd_node *dump;
+	void *buf = NULL;
+	int ret;
+
+	if (!data || !strlen(data->name) ||
+	    !data->vmcoredd_callback || !data->size)
+		return -EINVAL;
+
+	dump = vzalloc(sizeof(*dump));
+	if (!dump) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	note_size = vmcoredd_get_note_size(data->name);
+	/* Keep size of the buffer page aligned so that it can be mmaped */
+	data_size = roundup(note_size + data->size, PAGE_SIZE);
+
+	/* Allocate buffer for driver's to write their dumps */
+	buf = vmcore_alloc_buf(data_size);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	vmcoredd_write_note(data->name, buf, data_size - note_size);
+
+	/* Invoke the driver's dump collection routing */
+	ret = data->vmcoredd_callback(data, buf + note_size);
+	if (ret)
+		goto out_err;
+
+	dump->buf = buf;
+	dump->size = data_size;
+
+	/* Add the dump to driver sysfs list */
+	mutex_lock(&vmcoredd_mutex);
+	list_add_tail(&dump->list, &vmcoredd_list);
+	mutex_unlock(&vmcoredd_mutex);
+
+	return 0;
+
+out_err:
+	if (buf)
+		vfree(buf);
+
+	if (dump)
+		vfree(dump);
+
+	return ret;
+}
+
+int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return __vmcore_add_device_dump(data);
+}
+EXPORT_SYMBOL(vmcore_add_device_dump);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+/* Free all dumps in vmcore device dump list */
+static void vmcore_free_device_dumps(void)
+{
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+	mutex_lock(&vmcoredd_mutex);
+	while (!list_empty(&vmcoredd_list)) {
+		struct vmcoredd_node *dump;
+
+		dump = list_first_entry(&vmcoredd_list, struct vmcoredd_node,
+					list);
+		list_del(&dump->list);
+		vfree(dump->buf);
+		vfree(dump);
+	}
+	mutex_unlock(&vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+}
+
 /* Init function for vmcore module. */
 static int __init vmcore_init(void)
 {
@@ -1192,4 +1323,7 @@ void vmcore_cleanup(void)
 		kfree(m);
 	}
 	free_elfcorebuf();
+
+	/* clear vmcore device dump list */
+	vmcore_free_device_dumps();
 }
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index b511f6d24b42..3b6b041d84c8 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -27,6 +27,10 @@
 				     VMCOREINFO_NOTE_NAME_BYTES +	\
 				     VMCOREINFO_BYTES)
 
+#define VMCOREDD_MAX_NAME_BYTES  32
+#define VMCOREDD_NOTE_NAME	 "VMCOREDD"
+#define VMCOREDD_NOTE_NAME_BYTES sizeof(VMCOREDD_NOTE_NAME)
+
 typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
 
 void crash_update_vmcoreinfo_safecopy(void *ptr);
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index f7ac2aa93269..658d508d1ec5 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -93,4 +93,21 @@ static inline bool is_kdump_kernel(void) { return 0; }
 #endif /* CONFIG_CRASH_DUMP */
 
 extern unsigned long saved_max_pfn;
+
+/* Device Dump information to be filled by drivers */
+struct vmcoredd_data {
+	char name[VMCOREDD_MAX_NAME_BYTES]; /* Unique name of the dump */
+	unsigned long size;                 /* Size of the dump */
+	/* Driver's registered callback to be invoked to collect dump */
+	int (*vmcoredd_callback)(struct vmcoredd_data *data, void *buf);
+};
+
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+int vmcore_add_device_dump(struct vmcoredd_data *data);
+#else
+static inline int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 #endif /* LINUX_CRASHDUMP_H */
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index 80db19d3a505..aa26e7199060 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -28,6 +28,12 @@ struct vmcore {
 	loff_t offset;
 };
 
+struct vmcoredd_node {
+	struct list_head list;	/* List of dumps */
+	void *buf;		/* Buffer containing device's dump */
+	unsigned long size;	/* Size of the buffer */
+};
+
 #ifdef CONFIG_PROC_KCORE
 extern void kclist_add(struct kcore_list *, void *, size_t, int type);
 #else
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index e2535d6dcec7..4e12c423b9fe 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -421,6 +421,7 @@ typedef struct elf64_shdr {
 #define NT_ARM_SYSTEM_CALL	0x404	/* ARM system call number */
 #define NT_ARM_SVE	0x405		/* ARM Scalable Vector Extension registers */
 #define NT_ARC_V2	0x600		/* ARCv2 accumulator/extra registers */
+#define NT_VMCOREDD	0x700		/* Vmcore Device Dump Note */
 
 /* Note header in a PT_NOTE section */
 typedef struct elf32_note {
-- 
2.14.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox