Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, Christoph Lameter <cl@linux.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Pekka Enberg <penberg@kernel.org>,
	netdev@vger.kernel.org, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	David Rientjes <rientjes@google.com>,
	Hannes Frederic Sowa <hannes@redhat.com>,
	brouer@redhat.com, Andi Kleen <ak@linux.intel.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>
Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
Date: Mon, 5 Oct 2015 21:26:39 +0200	[thread overview]
Message-ID: <20151005212639.35932b6c@redhat.com> (raw)
In-Reply-To: <20151002145044.781c911ea98e3ea74ae5cf3b@linux-foundation.org>

On Fri, 2 Oct 2015 14:50:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > > Thus, I need introducing new code like this patch and at the same time
> > > have to reduce the number of instruction-cache misses/usage.  In this
> > > case we solve the problem by kmem_cache_free_bulk() not getting called
> > > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > > cause icache misses.
> > 
> > I just tested this change on top of my net-use-case patchset... and for
> > some strange reason the code with this WARN_ON is faster and have much
> > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).
> > 
> > Thus, I think we should keep your fix.
> > 
> > I cannot explain why using WARN_ON() is better and cause less icache
> > misses.  And I hate when I don't understand every detail.
> > 
> >  My theory is, after reading the assembler code, that the UD2
> > instruction (from BUG_ON) cause some kind of icache decoder stall
> > (Intel experts???).  Now that should not be a problem, as UD2 is
> > obviously placed as an unlikely branch and left at the end of the asm
> > function call.  But the call to __slab_free() is also placed at the end
> > of the asm function (gets inlined from slab_free() as unlikely).  And
> > it is actually fairly likely that bulking is calling __slab_free (slub
> > slowpath call).
> 
> Yes, I was looking at the asm code and the difference is pretty small:
> a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly.

Actually managed to find documentation that the UD2 instruction do
stops the instruction prefetcher.

From "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
 Section 3.4.1.6 "Branch Type Selection":
 Cite:
  "[...] follow the indirect branch with a UD2 instruction, which will
   stop the processor from decoding down the fall-through path."

And from LWN article: https://lwn.net/Articles/255364/
 Cite: "[...] like using the ud2 instruction [...]. This instruction,
 which cannot be executed itself, is used after an indirect jump
 instruction; it is used as a signal to the instruction fetcher that
 the processor should not waste efforts decoding the following memory
 since the execution will continue at a different location."

Thus, we were hitting a very unfortunately combination of code, there
the UD2 instruction was located before the call to __slab_free() in the
ASM code, causing instruction prefetching to fail/stop.

Now I understand all the details again :-)

My only problem left, is I want a perf measurement that pinpoint these
kind of spots.  The difference in L1-icache-load-misses were significant
(1,278,276 vs 2,719,158).  I tried to somehow perf record this with
different perf events without being able to pinpoint the location (even
though I know the spot now).  Even tried Andi's ocperf.py... maybe he
will know what event I should try?


> But I wouldn't assume that the microbenchmarking is meaningful.  I've
> seen shockingly large (and quite repeatable) microbenchmarking
> differences from small changes in code which isn't even executed (and
> this is one such case, actually).  You add or remove just one byte of
> text and half the kernel (or half the .o file?) gets a different
> alignment and this seems to change everything.

I do know alignment of code is important, but this performance impact
was just too large.  And I think I found documentation to back my theory.

 
> Deleting the BUG altogether sounds the best solution.  As long as the
> kernel crashes in some manner, we'll be able to work out what happened.
> And it's cant-happen anyway, isn't it?

To me WARN_ON() seems like a good "documentation" if it does not hurt
performance.  I don't think removing the WARN_ON() will improve
performance, but I'm willing to actually test if it matters.

Below is how it will crash, it is fairly obvious (if you know x86_64
call convention). 

API is: void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)

Below it clearly shows second arg "size" in %RSI == 0.

[ 2986.977708] BUG: unable to handle kernel paging request at ffffffffa03c4d58
[ 2986.977760] IP: [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.977805] PGD 19a4067 PUD 19a5063 PMD 408a66067 PTE 3f6407161
[ 2986.977879] Oops: 0003 [#1] SMP 
[ 2986.977936] Modules linked in: slab_bulk_test03(O+) time_bench(O) netconsole tun coretemp kvm_intel kvm mxm_wmi microcode sg i2c_i801 i2c_core pcspkr wmi video shpchp acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc serio_raw sd_mod hid_generic ixgbe mdio vxlan ip6_udp_tunnel e1000e udp_tunnel ptp pps_core mlx5_core [last unloaded: slab_bulk_test03]
[ 2986.978190] CPU: 1 PID: 12495 Comm: modprobe Tainted: G           O    4.3.0-rc2-net-next-mm+ #442
[ 2986.978265] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
[ 2986.978341] task: ffff88040a89d200 ti: ffff8803e3c50000 task.ti: ffff8803e3c50000
[ 2986.978414] RIP: 0010:[<ffffffff8116d044>]  [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.978490] RSP: 0018:ffff8803e3c53aa8  EFLAGS: 00010282
[ 2986.978528] RAX: 0000000000000000 RBX: ffff8803e3c53c30 RCX: 0000000100200016
[ 2986.978571] RDX: ffff8803e3c53ad8 RSI: 0000000000000000 RDI: 000077ff80000000
[ 2986.978614] RBP: ffff8803e3c53ad0 R08: ffff8803e3db3600 R09: 00000000e3db3b01
[ 2986.978655] R10: ffffffffa03c4d58 R11: 0000000000000001 R12: ffffffffffffffff
[ 2986.978698] R13: ffff8803e3c53ae0 R14: 000077ff80000000 R15: ffff88040d774c00
[ 2986.978740] FS:  00007fbf31190740(0000) GS:ffff88041fa40000(0000) knlGS:0000000000000000
[ 2986.978814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2986.978853] CR2: ffffffffa03c4d58 CR3: 00000003f6406000 CR4: 00000000001406e0
[ 2986.978895] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2986.978938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2986.978979] Stack:
[ 2986.979014]  ffff8803e3c53c30 0000000000000009 0000000000000000 ffffffffa002e000
[ 2986.979099]  ffff8803fc40db40 ffff8803e3c53cf0 ffffffffa03c4d58 0000000000000000
[ 2986.979184]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2986.979259] Call Trace:
[ 2986.979296]  [<ffffffffa002e000>] ? 0xffffffffa002e000
[ 2986.979335]  [<ffffffffa03c4d58>] run_try_crash_tests+0x378/0x3a6 [slab_bulk_test03]
[ 2986.979409]  [<ffffffffa002e12d>] slab_bulk_test03_module_init+0x12d/0x1000 [slab_bulk_test03]
[ 2986.979485]  [<ffffffff810002ed>] do_one_initcall+0xad/0x1d0
[ 2986.979526]  [<ffffffff8111c117>] do_init_module+0x60/0x1e8
[ 2986.979567]  [<ffffffff810c2778>] load_module+0x1bb8/0x2420
[ 2986.979608]  [<ffffffff810bf1b0>] ? __symbol_put+0x40/0x40
[ 2986.979647]  [<ffffffff810c31b0>] SyS_finit_module+0x80/0xb0
[ 2986.979690]  [<ffffffff8166b697>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 2986.979732] Code: f8 eb 05 4d 85 e4 74 13 4c 8b 10 49 83 ec 01 48 89 c2 48 83 e8 08 4d 85 d2 74 e8 4d 85 d2 0f 84 2e 01 00 00 49 63 47 20 4c 89 f7 <49> c7 04 02 00 00 00 00 b8 00 00 00 80 4c 01 d0 48 0f 42 3d b4 
[ 2986.979904] RIP  [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.979945]  RSP <ffff8803e3c53aa8>
[ 2986.979982] CR2: ffffffffa03c4d58
[ 2986.980130] ---[ end trace 64f42b02f4347220 ]---


(gdb) list *(kmem_cache_free_bulk)+0x54
0xffffffff8116d044 is in kmem_cache_free_bulk (mm/slub.c:269).
264		return p;
265	}
266	
267	static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
268	{
269		*(void **)(object + s->offset) = fp;
270	}
271	
272	/* Loop over all objects in a slab */
273	#define for_each_object(__p, __s, __addr, __objects) \


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-10-05 19:26 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
2015-09-28 12:26 ` [PATCH 1/7] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
2015-09-28 12:26   ` Jesper Dangaard Brouer
2015-09-28 12:26 ` [PATCH 2/7] slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer
2015-09-28 12:26 ` [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
2015-09-28 12:26   ` Jesper Dangaard Brouer
2015-09-28 13:49   ` Christoph Lameter
2015-09-28 12:26 ` [PATCH 4/7] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
2015-09-28 12:26   ` Jesper Dangaard Brouer
2015-09-28 15:11   ` Christoph Lameter
2015-09-28 12:26 ` [PATCH 5/7] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
2015-09-28 12:26   ` Jesper Dangaard Brouer
2015-09-28 15:16   ` Christoph Lameter
2015-09-28 15:51     ` Jesper Dangaard Brouer
2015-09-28 15:51       ` Jesper Dangaard Brouer
2015-09-28 16:28       ` Christoph Lameter
2015-09-29  7:32         ` Jesper Dangaard Brouer
2015-09-29  7:32           ` Jesper Dangaard Brouer
2015-09-28 16:30       ` Christoph Lameter
2015-09-29  7:12         ` Jesper Dangaard Brouer
2015-09-29  7:12           ` Jesper Dangaard Brouer
2015-09-28 12:26 ` [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
2015-09-28 12:26   ` Jesper Dangaard Brouer
2015-09-28 15:22   ` Christoph Lameter
2015-09-28 15:22     ` Christoph Lameter
2015-09-28 12:26 ` [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() Jesper Dangaard Brouer
2015-09-28 12:26   ` Jesper Dangaard Brouer
2015-09-28 14:53   ` Alexander Duyck
2015-09-28 15:59     ` Jesper Dangaard Brouer
2015-09-28 15:59       ` Jesper Dangaard Brouer
2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
2015-09-29 15:47   ` [MM PATCH V4 1/6] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
2015-09-29 15:47   ` [MM PATCH V4 2/6] slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer
2015-09-29 15:47   ` [MM PATCH V4 3/6] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
2015-09-29 15:48   ` [MM PATCH V4 4/6] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
2015-09-29 15:48   ` [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
2015-09-29 16:38     ` Alexander Duyck
2015-09-29 17:00       ` Jesper Dangaard Brouer
2015-09-29 17:20         ` Alexander Duyck
2015-09-29 17:20           ` Alexander Duyck
2015-09-29 18:16           ` Jesper Dangaard Brouer
2015-09-30 11:44       ` [MM PATCH V4.1 " Jesper Dangaard Brouer
2015-09-30 16:03         ` Christoph Lameter
2015-10-01 22:10         ` Andrew Morton
2015-10-01 22:10           ` Andrew Morton
2015-10-02  9:41           ` Jesper Dangaard Brouer
2015-10-02 10:10             ` Christoph Lameter
2015-10-02 10:40               ` Jesper Dangaard Brouer
2015-10-02 13:40             ` Jesper Dangaard Brouer
2015-10-02 21:50               ` Andrew Morton
2015-10-02 21:50                 ` Andrew Morton
2015-10-05 19:26                 ` Jesper Dangaard Brouer [this message]
2015-10-05 21:20                   ` Andi Kleen
2015-10-05 21:20                     ` Andi Kleen
2015-10-05 23:07                     ` Jesper Dangaard Brouer
2015-10-07 12:31                       ` Jesper Dangaard Brouer
2015-10-07 13:36                         ` Arnaldo Carvalho de Melo
2015-10-07 15:44                           ` Andi Kleen
2015-10-07 15:44                             ` Andi Kleen
2015-10-07 16:06                         ` Andi Kleen
2015-10-05 23:53                   ` Jesper Dangaard Brouer
2015-10-05 23:53                     ` Jesper Dangaard Brouer
2015-10-07 10:39                   ` Jesper Dangaard Brouer
2015-10-07 10:39                     ` Jesper Dangaard Brouer
2015-09-29 15:48   ` [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
2015-10-14  5:15     ` Joonsoo Kim
2015-10-14  5:15       ` Joonsoo Kim
2015-10-21  7:57       ` Jesper Dangaard Brouer
2015-11-05  5:09         ` Joonsoo Kim
2015-11-05  5:09           ` Joonsoo Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151005212639.35932b6c@redhat.com \
    --to=brouer@redhat.com \
    --cc=acme@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=cl@linux.com \
    --cc=hannes@redhat.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.