From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Christoph Lameter <cl@linux.com>
Cc: linux-mm@kvack.org, Vladimir Davydov <vdavydov@virtuozzo.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
brouer@redhat.com
Subject: Re: [RFC PATCH V2 8/9] slab: implement bulk free in SLAB allocator
Date: Tue, 15 Dec 2015 13:02:35 +0100 [thread overview]
Message-ID: <20151215130235.10e92367@redhat.com> (raw)
In-Reply-To: <20151214161958.1a8edf79@redhat.com>
On Mon, 14 Dec 2015 16:19:58 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> On Wed, 9 Dec 2015 10:06:39 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
>
> > From: Christoph Lameter <cl@linux.com>
> > Subject: slab bulk api: Remove the kmem_cache parameter from kmem_cache_bulk_free()
> >
> > It is desirable and necessary to free objects from different kmem_caches.
> > It is required in order to support memcg object freeing across different5
> > cgroups.
> >
> > So drop the pointless parameter and allow freeing of arbitrary lists
> > of slab allocated objects.
> >
> > This patch also does the proper compound page handling so that
> > arbitrary objects allocated via kmalloc() can be handled by
> > kmem_cache_bulk_free().
> >
> > Signed-off-by: Christoph Lameter <cl@linux.com>
>
> I've modified this patch, to instead introduce a kfree_bulk() and keep
> the old behavior of kmem_cache_free_bulk(). This allow us to easier
> compare the two impl. approaches.
>
> [...]
> > Index: linux/mm/slub.c
> > ===================================================================
> > --- linux.orig/mm/slub.c
> > +++ linux/mm/slub.c
> > @@ -2887,23 +2887,30 @@ static int build_detached_freelist(struc
> >
> >
> > /* Note that interrupts must be enabled when calling this function. */
> > -void kmem_cache_free_bulk(struct kmem_cache *orig_s, size_t size, void **p)
> > +void kmem_cache_free_bulk(size_t size, void **p)
>
> Renamed to kfree_bulk(size_t size, void **p)
>
> > {
> > if (WARN_ON(!size))
> > return;
> >
> > do {
> > struct detached_freelist df;
> > - struct kmem_cache *s;
> > + struct page *page;
> >
> > - /* Support for memcg */
> > - s = cache_from_obj(orig_s, p[size - 1]);
> > + page = virt_to_head_page(p[size - 1]);
> >
> > - size = build_detached_freelist(s, size, p, &df);
> > + if (unlikely(!PageSlab(page))) {
> > + BUG_ON(!PageCompound(page));
> > + kfree_hook(p[size - 1]);
> > + __free_kmem_pages(page, compound_order(page));
> > + p[--size] = NULL;
> > + continue;
> > + }
> > +
> > + size = build_detached_freelist(page->slab_cache, size, p, &df);
> > if (unlikely(!df.page))
> > continue;
> >
> > - slab_free(s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_);
> > + slab_free(page->slab_cache, df.page, df.freelist,
> > + df.tail, df.cnt, _RET_IP_);
> > } while (likely(size));
> > }
> > EXPORT_SYMBOL(kmem_cache_free_bulk);
>
> This specific implementation was too slow, mostly because we call
> virt_to_head_page() both in this function and inside build_detached_freelist().
>
> After integrating this directly into build_detached_freelist() I'm
> getting more comparative results.
>
> Results with disabled CONFIG_MEMCG_KMEM, and SLUB (and slab_nomerge for
> more accurate results between runs):
>
> bulk- fallback - kmem_cache_free_bulk - kfree_bulk
> 1 - 58 cycles 14.735 ns - 55 cycles 13.880 ns - 59 cycles 14.843 ns
> 2 - 53 cycles 13.298 ns - 32 cycles 8.037 ns - 34 cycles 8.592 ns
> 3 - 51 cycles 12.837 ns - 25 cycles 6.442 ns - 27 cycles 6.794 ns
> 4 - 50 cycles 12.514 ns - 23 cycles 5.952 ns - 23 cycles 5.958 ns
> 8 - 48 cycles 12.097 ns - 20 cycles 5.160 ns - 22 cycles 5.505 ns
> 16 - 47 cycles 11.888 ns - 19 cycles 4.900 ns - 19 cycles 4.969 ns
> 30 - 47 cycles 11.793 ns - 18 cycles 4.688 ns - 18 cycles 4.682 ns
> 32 - 47 cycles 11.926 ns - 18 cycles 4.674 ns - 18 cycles 4.702 ns
> 34 - 95 cycles 23.823 ns - 24 cycles 6.068 ns - 24 cycles 6.058 ns
> 48 - 81 cycles 20.258 ns - 21 cycles 5.360 ns - 21 cycles 5.338 ns
> 64 - 73 cycles 18.414 ns - 20 cycles 5.160 ns - 20 cycles 5.140 ns
> 128 - 90 cycles 22.563 ns - 27 cycles 6.765 ns - 27 cycles 6.801 ns
> 158 - 99 cycles 24.831 ns - 30 cycles 7.625 ns - 30 cycles 7.720 ns
> 250 - 104 cycles 26.173 ns - 37 cycles 9.271 ns - 37 cycles 9.371 ns
>
> As can been seen the old kmem_cache_free_bulk() is faster than the new
> kfree_bulk() (which omits the kmem_cache pointer and need to derive it
> from the page->slab_cache). The base (bulk=1) extra cost is 4 cycles,
> which then gets amortized as build_detached_freelist() combines objects
> belonging to same page.
>
> This is likely because the compiler, with disabled CONFIG_MEMCG_KMEM=n,
> can optimize and avoid doing the lookup of the kmem_cache structure.
>
> I'll start doing testing with CONFIG_MEMCG_KMEM enabled...
Enabling CONFIG_MEMCG_KMEM didn't change the performance, likely
because memcg_kmem_enabled() uses a static_key_false() construct.
Implemented kfree_bulk() as an inline function just calling
kmem_cache_free_bulk(NULL, size, p) run with CONFIG_MEMCG_KMEM=y (but
no "user" of memcg):
bulk- fallback - kmem_cache_free_bulk - kmem_cache_free_bulk(s=NULL)
1 - 60 cycles 15.232 ns - 52 cycles 13.124 ns - 58 cycles 14.562 ns
2 - 54 cycles 13.678 ns - 31 cycles 7.875 ns - 34 cycles 8.509 ns
3 - 52 cycles 13.112 ns - 25 cycles 6.406 ns - 28 cycles 7.045 ns
4 - 51 cycles 12.833 ns - 24 cycles 6.036 ns - 24 cycles 6.076 ns
8 - 49 cycles 12.367 ns - 21 cycles 5.404 ns - 22 cycles 5.522 ns
16 - 48 cycles 12.144 ns - 19 cycles 4.934 ns - 19 cycles 4.967 ns
30 - 50 cycles 12.542 ns - 18 cycles 4.685 ns - 18 cycles 4.687 ns
32 - 48 cycles 12.234 ns - 18 cycles 4.661 ns - 18 cycles 4.662 ns
34 - 96 cycles 24.204 ns - 24 cycles 6.068 ns - 24 cycles 6.067 ns
48 - 82 cycles 20.553 ns - 21 cycles 5.410 ns - 21 cycles 5.433 ns
64 - 74 cycles 18.743 ns - 20 cycles 5.171 ns - 20 cycles 5.179 ns
128 - 91 cycles 22.791 ns - 26 cycles 6.601 ns - 26 cycles 6.600 ns
158 - 100 cycles 25.191 ns - 30 cycles 7.569 ns - 30 cycles 7.617 ns
250 - 106 cycles 26.535 ns - 37 cycles 9.426 ns - 37 cycles 9.427 ns
As can be seen, it is still more costly to get the kmem_cache pointer
via the object->slab_cache, even when compiled with CONFIG_MEMCG_KMEM.
But what happen when enabling a "user" of kmemcg, which modify the
static_key_false() in function call memcg_kmem_enabled().
First notice normal kmem fastpath cost increased: 70 cycles(tsc) 17.640 ns
bulk- fallback - kmem_cache_free_bulk - kmem_cache_free_bulk(s=NULL)
1 - 85 cycles 21.351 ns - 76 cycles 19.086 ns - 74 cycles 18.701 ns
2 - 79 cycles 19.907 ns - 43 cycles 10.848 ns - 41 cycles 10.467 ns
3 - 77 cycles 19.310 ns - 32 cycles 8.132 ns - 31 cycles 7.880 ns
4 - 76 cycles 19.017 ns - 28 cycles 7.195 ns - 28 cycles 7.050 ns
8 - 77 cycles 19.443 ns - 23 cycles 5.838 ns - 23 cycles 5.782 ns
16 - 75 cycles 18.815 ns - 20 cycles 5.174 ns - 20 cycles 5.127 ns
30 - 74 cycles 18.719 ns - 19 cycles 4.870 ns - 19 cycles 4.816 ns
32 - 75 cycles 18.917 ns - 19 cycles 4.850 ns - 19 cycles 4.800 ns
34 - 119 cycles 29.878 ns - 25 cycles 6.312 ns - 25 cycles 6.282 ns
48 - 106 cycles 26.536 ns - 21 cycles 5.471 ns - 21 cycles 5.448 ns
64 - 98 cycles 24.635 ns - 20 cycles 5.239 ns - 20 cycles 5.224 ns
128 - 114 cycles 28.586 ns - 27 cycles 6.768 ns - 26 cycles 6.736 ns
158 - 124 cycles 31.173 ns - 30 cycles 7.668 ns - 30 cycles 7.611 ns
250 - 129 cycles 32.336 ns - 37 cycles 9.460 ns - 37 cycles 9.468 ns
With kmemcg runtime-enabled, I was expecting to see a bigger win for
the case of kmem_cache_free_bulk(s=NULL).
Implemented a function kfree_bulk_export(), which inlines build_detached_freelist()
and which avoids including the code path for memcg. But result is almost the same:
bulk- fallback - kmem_cache_free_bulk - kfree_bulk_export
1 - 84 cycles 21.221 ns - 77 cycles 19.324 ns - 74 cycles 18.515 ns
2 - 79 cycles 19.963 ns - 44 cycles 11.120 ns - 41 cycles 10.329 ns
3 - 77 cycles 19.471 ns - 33 cycles 8.291 ns - 31 cycles 7.771 ns
4 - 76 cycles 19.191 ns - 28 cycles 7.100 ns - 27 cycles 6.765 ns
8 - 78 cycles 19.525 ns - 23 cycles 5.781 ns - 22 cycles 5.737 ns
16 - 75 cycles 18.922 ns - 20 cycles 5.152 ns - 20 cycles 5.092 ns
30 - 75 cycles 18.812 ns - 19 cycles 4.864 ns - 19 cycles 4.833 ns
32 - 75 cycles 18.807 ns - 19 cycles 4.852 ns - 23 cycles 5.896 ns
34 - 119 cycles 29.989 ns - 25 cycles 6.258 ns - 25 cycles 6.411 ns
48 - 106 cycles 26.677 ns - 28 cycles 7.182 ns - 22 cycles 5.651 ns
64 - 98 cycles 24.716 ns - 20 cycles 5.245 ns - 21 cycles 5.425 ns
128 - 115 cycles 28.905 ns - 26 cycles 6.748 ns - 27 cycles 6.787 ns
158 - 124 cycles 31.039 ns - 30 cycles 7.604 ns - 30 cycles 7.629 ns
250 - 129 cycles 32.372 ns - 37 cycles 9.475 ns - 37 cycles 9.325 ns
I took a closer look with perf, and it looks like the extra cost from
enabling kmemcg originates from the alloc code path. Specifically the
function call get_mem_cgroup_from_mm(9.95%) is the costly part, plus
__memcg_kmem_get_cache(4.84%) and __memcg_kmem_put_cache(1.35%)
(Added current code as a patch below signature)
- -
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
[PATCH] New API kfree_bulk()
---
include/linux/slab.h | 14 +++++++++++++
mm/slab_common.c | 8 ++++++--
mm/slub.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-----
3 files changed, 67 insertions(+), 7 deletions(-)
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 2037a861e367..9dd353215c2b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -317,6 +317,20 @@ void kmem_cache_free(struct kmem_cache *, void *);
*/
void kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
int kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
+void kfree_bulk_export(size_t, void **);
+
+static __always_inline void kfree_bulk(size_t size, void **p)
+{
+ /* Reusing call to kmem_cache_free_bulk() allow kfree_bulk to
+ * use same code icache
+ */
+ kmem_cache_free_bulk(NULL, size, p);
+ /*
+ * Inlining in kfree_bulk_export() generate smaller code size
+ * than more generic kmem_cache_free_bulk().
+ */
+ // kfree_bulk_export(size, p);
+}
#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3c6a86b4ec25..963c25589949 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -108,8 +108,12 @@ void __kmem_cache_free_bulk(struct kmem_cache *s, size_t nr, void **p)
{
size_t i;
- for (i = 0; i < nr; i++)
- kmem_cache_free(s, p[i]);
+ for (i = 0; i < nr; i++) {
+ if (s)
+ kmem_cache_free(s, p[i]);
+ else
+ kfree(p[i]);
+ }
}
int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
diff --git a/mm/slub.c b/mm/slub.c
index 4e4dfc4cdd0c..1675f42f17b5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2833,29 +2833,46 @@ struct detached_freelist {
* synchronization primitive. Look ahead in the array is limited due
* to performance reasons.
*/
-static int build_detached_freelist(struct kmem_cache **s, size_t size,
- void **p, struct detached_freelist *df)
+static inline
+int build_detached_freelist(struct kmem_cache **s, size_t size,
+ void **p, struct detached_freelist *df)
{
size_t first_skipped_index = 0;
int lookahead = 3;
void *object;
+ struct page *page;
/* Always re-init detached_freelist */
df->page = NULL;
do {
object = p[--size];
+ /* Do we need !ZERO_OR_NULL_PTR(object) here? (for kfree) */
} while (!object && size);
if (!object)
return 0;
- /* Support for memcg, compiler can optimize this out */
- *s = cache_from_obj(*s, object);
+ page = virt_to_head_page(object);
+ if (!*s) {
+ /* Handle kalloc'ed objects */
+ if (unlikely(!PageSlab(page))) {
+ BUG_ON(!PageCompound(page));
+ kfree_hook(object);
+ __free_kmem_pages(page, compound_order(page));
+ p[size] = NULL; /* mark object processed */
+ return size;
+ }
+ /* Derive kmem_cache from object */
+ *s = page->slab_cache;
+ } else {
+ /* Support for memcg, compiler can optimize this out */
+ *s = cache_from_obj(*s, object);
+ }
/* Start new detached freelist */
+ df->page = page;
set_freepointer(*s, object, NULL);
- df->page = virt_to_head_page(object);
df->tail = object;
df->freelist = object;
p[size] = NULL; /* mark object processed */
@@ -2906,6 +2923,31 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
}
EXPORT_SYMBOL(kmem_cache_free_bulk);
+/* Note that interrupts must be enabled when calling this function.
+ * This function can also handle kmem_cache allocated object arrays
+ */
+// Will likely remove this function
+void kfree_bulk_export(size_t size, void **p)
+{
+ if (WARN_ON(!size))
+ return;
+
+ do {
+ struct detached_freelist df;
+ /* NULL here removes code in inlined build_detached_freelist */
+ struct kmem_cache *s = NULL;
+
+ size = build_detached_freelist(&s, size, p, &df);
+ if (unlikely(!df.page))
+ continue;
+
+ slab_free(s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_);
+// slab_free(df.page->slab_cache, df.page, df.freelist, df.tail,
+// df.cnt, _RET_IP_);
+ } while (likely(size));
+}
+EXPORT_SYMBOL(kfree_bulk_export);
+
/* Note that interrupts must be enabled when calling this function. */
int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-12-15 12:02 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-03 15:56 [RFC PATCH 0/2] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
2015-12-03 15:57 ` [RFC PATCH 1/2] slab: implement bulk alloc in " Jesper Dangaard Brouer
2015-12-04 10:16 ` Jesper Dangaard Brouer
2015-12-04 17:10 ` Christoph Lameter
2015-12-07 10:20 ` Jesper Dangaard Brouer
2015-12-07 14:57 ` Christoph Lameter
2015-12-03 15:57 ` [RFC PATCH 2/2] slab: implement bulk free " Jesper Dangaard Brouer
2015-12-04 17:17 ` Christoph Lameter
2015-12-07 11:25 ` Jesper Dangaard Brouer
2015-12-07 14:59 ` Christoph Lameter
2015-12-08 13:39 ` Jesper Dangaard Brouer
2015-12-08 14:11 ` Christoph Lameter
2015-12-08 14:12 ` Vladimir Davydov
2015-12-08 14:15 ` Christoph Lameter
2015-12-08 14:56 ` Vladimir Davydov
2015-12-08 15:13 ` Jesper Dangaard Brouer
2015-12-08 15:32 ` Christoph Lameter
2015-12-04 9:01 ` [RFC PATCH 0/2] slab: implement bulking for " Jesper Dangaard Brouer
2015-12-08 16:18 ` [RFC PATCH V2 0/9] slab: cleanup and bulk API for SLAB Jesper Dangaard Brouer
2015-12-08 16:18 ` [RFC PATCH V2 1/9] mm/slab: move SLUB alloc hooks to common mm/slab.h Jesper Dangaard Brouer
2015-12-09 15:43 ` Christoph Lameter
2015-12-08 16:18 ` [RFC PATCH V2 2/9] mm: generalize avoid fault-inject on bootstrap kmem_cache Jesper Dangaard Brouer
2015-12-09 2:36 ` Joonsoo Kim
2015-12-08 16:18 ` [RFC PATCH V2 3/9] slab: use slab_pre_alloc_hook in SLAB allocator Jesper Dangaard Brouer
2015-12-08 16:18 ` [RFC PATCH V2 4/9] mm: kmemcheck skip object if slab allocation failed Jesper Dangaard Brouer
2015-12-08 16:18 ` [RFC PATCH V2 5/9] slab: use slab_post_alloc_hook in SLAB allocator Jesper Dangaard Brouer
2015-12-08 16:18 ` [RFC PATCH V2 6/9] slab: implement bulk alloc " Jesper Dangaard Brouer
2015-12-08 16:18 ` [RFC PATCH V2 7/9] slab: avoid running debug SLAB code with IRQs disabled for alloc_bulk Jesper Dangaard Brouer
2015-12-08 16:19 ` [RFC PATCH V2 8/9] slab: implement bulk free in SLAB allocator Jesper Dangaard Brouer
2015-12-09 16:06 ` Christoph Lameter
2015-12-09 18:53 ` Jesper Dangaard Brouer
2015-12-09 19:41 ` Christoph Lameter
2015-12-09 20:50 ` Jesper Dangaard Brouer
2015-12-10 15:15 ` Christoph Lameter
2015-12-10 15:10 ` Jesper Dangaard Brouer
2015-12-10 15:18 ` Christoph Lameter
2015-12-10 15:26 ` Vladimir Davydov
2015-12-10 17:24 ` Christoph Lameter
2015-12-14 15:19 ` Jesper Dangaard Brouer
2015-12-15 12:02 ` Jesper Dangaard Brouer [this message]
2015-12-08 16:19 ` [RFC PATCH V2 9/9] slab: annotate code to generate more compact asm code Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151215130235.10e92367@redhat.com \
--to=brouer@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-mm@kvack.org \
--cc=torvalds@linux-foundation.org \
--cc=vdavydov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).