* [PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately @ 2015-01-28 16:22 Vladimir Davydov 2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel Hi, The kmem extension of the memory cgroup is almost usable now. There is, in fact, the only serious issue left: per memcg kmem caches may pin the owner cgroup for indefinitely long. This is, because a slab cache may keep empty slab pages in its private structures to optimize performance, while we take a css reference per each charged kmem page. The issue is only relevant to SLUB, because SLAB periodically reaps empty slabs. This patch set fixes this issue for SLUB. For details, please see patch 3. Changes in v2: - address Christoph's concerns regarding kmem_cache_shrink - fix race between put_cpu_partial reading ->cpu_partial and kmem_cache_shrink updating it as proposed by Joonsoo v1: https://lkml.org/lkml/2015/1/26/317 Thanks, Vladimir Davydov (3): slub: never fail to shrink cache slub: fix kmem_cache_shrink return value slub: make dead caches discard free slabs immediately mm/slab.c | 4 +-- mm/slab.h | 2 +- mm/slab_common.c | 15 +++++++-- mm/slob.c | 2 +- mm/slub.c | 94 +++++++++++++++++++++++++++++++++++------------------- 5 files changed, 78 insertions(+), 39 deletions(-) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 16:22 [PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov @ 2015-01-28 16:22 ` Vladimir Davydov 2015-01-28 16:31 ` Christoph Lameter ` (3 more replies) 2015-01-28 16:22 ` [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value Vladimir Davydov 2015-01-28 16:22 ` [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov 2 siblings, 4 replies; 26+ messages in thread From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but also tries to rearrange the partial lists to place slabs filled up most to the head to cope with fragmentation. To achieve that, it allocates a temporary array of lists used to sort slabs by the number of objects in use. If the allocation fails, the whole procedure is aborted. This is unacceptable for the kernel memory accounting extension of the memory cgroup, where we want to make sure that kmem_cache_shrink() successfully discarded empty slabs. Although the allocation failure is utterly unlikely with the current page allocator implementation, which retries GFP_KERNEL allocations of order <= 2 infinitely, it is better not to rely on that. This patch therefore makes __kmem_cache_shrink() allocate the array on stack instead of calling kmalloc, which may fail. The array size is chosen to be equal to 32, because most SLUB caches store not more than 32 objects per slab page. Slab pages with <= 32 free objects are sorted using the array by the number of objects in use and promoted to the head of the partial list, while slab pages with > 32 free objects are left in the end of the list without any ordering imposed on them. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> --- mm/slub.c | 57 ++++++++++++++++++++++++++++++--------------------------- 1 file changed, 30 insertions(+), 27 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 1562955fe099..dbf9334b6a5c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3358,11 +3358,12 @@ void kfree(const void *x) } EXPORT_SYMBOL(kfree); +#define SHRINK_PROMOTE_MAX 32 + /* - * kmem_cache_shrink removes empty slabs from the partial lists and sorts - * the remaining slabs by the number of items in use. The slabs with the - * most items in use come first. New allocations will then fill those up - * and thus they can be removed from the partial lists. + * kmem_cache_shrink discards empty slabs and promotes the slabs filled + * up most to the head of the partial lists. New allocations will then + * fill those up and thus they can be removed from the partial lists. * * The slabs with the least items are placed last. This results in them * being allocated from last increasing the chance that the last objects @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s) struct kmem_cache_node *n; struct page *page; struct page *t; - int objects = oo_objects(s->max); - struct list_head *slabs_by_inuse = - kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL); + LIST_HEAD(discard); + struct list_head promote[SHRINK_PROMOTE_MAX]; unsigned long flags; - if (!slabs_by_inuse) - return -ENOMEM; + for (i = 0; i < SHRINK_PROMOTE_MAX; i++) + INIT_LIST_HEAD(promote + i); flush_all(s); for_each_kmem_cache_node(s, node, n) { if (!n->nr_partial) continue; - for (i = 0; i < objects; i++) - INIT_LIST_HEAD(slabs_by_inuse + i); - spin_lock_irqsave(&n->list_lock, flags); /* - * Build lists indexed by the items in use in each slab. + * Build lists of slabs to discard or promote. * * Note that concurrent frees may occur while we hold the * list_lock. page->inuse here is the upper limit. */ list_for_each_entry_safe(page, t, &n->partial, lru) { - list_move(&page->lru, slabs_by_inuse + page->inuse); - if (!page->inuse) + int free = page->objects - page->inuse; + + /* Do not reread page->inuse */ + barrier(); + + /* We do not keep full slabs on the list */ + BUG_ON(free <= 0); + + if (free == page->objects) { + list_move(&page->lru, &discard); n->nr_partial--; + } else if (free <= SHRINK_PROMOTE_MAX) + list_move(&page->lru, promote + free - 1); } /* - * Rebuild the partial list with the slabs filled up most - * first and the least used slabs at the end. + * Promote the slabs filled up most to the head of the + * partial list. */ - for (i = objects - 1; i > 0; i--) - list_splice(slabs_by_inuse + i, n->partial.prev); + for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--) + list_splice_init(promote + i, &n->partial); spin_unlock_irqrestore(&n->list_lock, flags); /* Release empty slabs */ - list_for_each_entry_safe(page, t, slabs_by_inuse, lru) + list_for_each_entry_safe(page, t, &discard, lru) discard_slab(s, page); } - kfree(slabs_by_inuse); return 0; } @@ -4682,12 +4688,9 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf) static ssize_t shrink_store(struct kmem_cache *s, const char *buf, size_t length) { - if (buf[0] == '1') { - int rc = kmem_cache_shrink(s); - - if (rc) - return rc; - } else + if (buf[0] == '1') + kmem_cache_shrink(s); + else return -EINVAL; return length; } -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov @ 2015-01-28 16:31 ` Christoph Lameter 2015-01-28 18:29 ` Pekka Enberg 2015-01-28 16:37 ` Christoph Lameter ` (2 subsequent siblings) 3 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2015-01-28 16:31 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, 28 Jan 2015, Vladimir Davydov wrote: > This patch therefore makes __kmem_cache_shrink() allocate the array on > stack instead of calling kmalloc, which may fail. The array size is > chosen to be equal to 32, because most SLUB caches store not more than > 32 objects per slab page. Slab pages with <= 32 free objects are sorted > using the array by the number of objects in use and promoted to the head > of the partial list, while slab pages with > 32 free objects are left in > the end of the list without any ordering imposed on them. Acked-by: Christoph Lameter <cl@linux.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 16:31 ` Christoph Lameter @ 2015-01-28 18:29 ` Pekka Enberg 0 siblings, 0 replies; 26+ messages in thread From: Pekka Enberg @ 2015-01-28 18:29 UTC (permalink / raw) To: Christoph Lameter, Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On 1/28/15 6:31 PM, Christoph Lameter wrote: > On Wed, 28 Jan 2015, Vladimir Davydov wrote: > >> This patch therefore makes __kmem_cache_shrink() allocate the array on >> stack instead of calling kmalloc, which may fail. The array size is >> chosen to be equal to 32, because most SLUB caches store not more than >> 32 objects per slab page. Slab pages with <= 32 free objects are sorted >> using the array by the number of objects in use and promoted to the head >> of the partial list, while slab pages with > 32 free objects are left in >> the end of the list without any ordering imposed on them. > Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov 2015-01-28 16:31 ` Christoph Lameter @ 2015-01-28 16:37 ` Christoph Lameter 2015-01-28 17:32 ` Vladimir Davydov 2015-01-28 21:57 ` Andrew Morton 2015-02-15 3:55 ` Sasha Levin 3 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2015-01-28 16:37 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, 28 Jan 2015, Vladimir Davydov wrote: > + /* We do not keep full slabs on the list */ > + BUG_ON(free <= 0); Well sorry we do actually keep a number of empty slabs on the partial lists. See the min_partial field in struct kmem_cache. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 16:37 ` Christoph Lameter @ 2015-01-28 17:32 ` Vladimir Davydov 2015-01-28 19:20 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Vladimir Davydov @ 2015-01-28 17:32 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, Jan 28, 2015 at 10:37:09AM -0600, Christoph Lameter wrote: > On Wed, 28 Jan 2015, Vladimir Davydov wrote: > > > + /* We do not keep full slabs on the list */ > > + BUG_ON(free <= 0); > > Well sorry we do actually keep a number of empty slabs on the partial > lists. See the min_partial field in struct kmem_cache. It's not about empty slabs, it's about full slabs: free == 0 means slab is full. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 17:32 ` Vladimir Davydov @ 2015-01-28 19:20 ` Christoph Lameter 0 siblings, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2015-01-28 19:20 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, 28 Jan 2015, Vladimir Davydov wrote: > On Wed, Jan 28, 2015 at 10:37:09AM -0600, Christoph Lameter wrote: > > On Wed, 28 Jan 2015, Vladimir Davydov wrote: > > > > > + /* We do not keep full slabs on the list */ > > > + BUG_ON(free <= 0); > > > > Well sorry we do actually keep a number of empty slabs on the partial > > lists. See the min_partial field in struct kmem_cache. > > It's not about empty slabs, it's about full slabs: free == 0 means slab > is full. Correct. I already acked the patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov 2015-01-28 16:31 ` Christoph Lameter 2015-01-28 16:37 ` Christoph Lameter @ 2015-01-28 21:57 ` Andrew Morton 2015-01-28 22:56 ` Christoph Lameter ` (2 more replies) 2015-02-15 3:55 ` Sasha Levin 3 siblings, 3 replies; 26+ messages in thread From: Andrew Morton @ 2015-01-28 21:57 UTC (permalink / raw) To: Vladimir Davydov Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote: > SLUB's version of __kmem_cache_shrink() not only removes empty slabs, > but also tries to rearrange the partial lists to place slabs filled up > most to the head to cope with fragmentation. To achieve that, it > allocates a temporary array of lists used to sort slabs by the number of > objects in use. If the allocation fails, the whole procedure is aborted. > > This is unacceptable for the kernel memory accounting extension of the > memory cgroup, where we want to make sure that kmem_cache_shrink() > successfully discarded empty slabs. Although the allocation failure is > utterly unlikely with the current page allocator implementation, which > retries GFP_KERNEL allocations of order <= 2 infinitely, it is better > not to rely on that. > > This patch therefore makes __kmem_cache_shrink() allocate the array on > stack instead of calling kmalloc, which may fail. The array size is > chosen to be equal to 32, because most SLUB caches store not more than > 32 objects per slab page. Slab pages with <= 32 free objects are sorted > using the array by the number of objects in use and promoted to the head > of the partial list, while slab pages with > 32 free objects are left in > the end of the list without any ordering imposed on them. > > ... > > @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s) > struct kmem_cache_node *n; > struct page *page; > struct page *t; > - int objects = oo_objects(s->max); > - struct list_head *slabs_by_inuse = > - kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL); > + LIST_HEAD(discard); > + struct list_head promote[SHRINK_PROMOTE_MAX]; 512 bytes of stack. The call paths leading to __kmem_cache_shrink() are many and twisty. How do we know this isn't a problem? The logic behind choosing "32" sounds rather rubbery. What goes wrong if we use, say, "4"? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 21:57 ` Andrew Morton @ 2015-01-28 22:56 ` Christoph Lameter 2015-01-29 8:07 ` Vladimir Davydov 2015-01-29 8:32 ` Balbir Singh 2 siblings, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2015-01-28 22:56 UTC (permalink / raw) To: Andrew Morton Cc: Vladimir Davydov, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, 28 Jan 2015, Andrew Morton wrote: > The logic behind choosing "32" sounds rather rubbery. What goes wrong > if we use, say, "4"? Like more fragmentation of slab pages. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 21:57 ` Andrew Morton 2015-01-28 22:56 ` Christoph Lameter @ 2015-01-29 8:07 ` Vladimir Davydov 2015-01-29 15:55 ` Christoph Lameter 2015-01-29 8:32 ` Balbir Singh 2 siblings, 1 reply; 26+ messages in thread From: Vladimir Davydov @ 2015-01-29 8:07 UTC (permalink / raw) To: Andrew Morton, Christoph Lameter Cc: Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, Jan 28, 2015 at 01:57:52PM -0800, Andrew Morton wrote: > On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote: > > @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s) > > struct kmem_cache_node *n; > > struct page *page; > > struct page *t; > > - int objects = oo_objects(s->max); > > - struct list_head *slabs_by_inuse = > > - kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL); > > + LIST_HEAD(discard); > > + struct list_head promote[SHRINK_PROMOTE_MAX]; > > 512 bytes of stack. The call paths leading to __kmem_cache_shrink() > are many and twisty. How do we know this isn't a problem? Because currently __kmem_cache_shrink is only called just from a couple of places, each of which isn't supposed to have a great stack depth AFAIU, namely: - slab_mem_going_offline_callback - MEM_GOING_OFFLINE handler - shrink_store - invoked upon write to /sys/kernel/slab/cache/shrink - acpi_os_purge_cache - only called on acpi init - memcg_deactivate_kmem_caches - called from cgroup_destroy_wq > The logic behind choosing "32" sounds rather rubbery. What goes wrong > if we use, say, "4"? We could, but kmem_cache_shrink would cope with fragmentation less efficiently. Come to think of it, do we really need to optimize slab placement in kmem_cache_shrink? None of its users except shrink_store expects it - they just want to purge the cache before destruction, that's it. May be, we'd better move slab placement optimization to a separate SLUB's private function that would be called only by shrink_store, where we can put up with kmalloc failures? Christoph, what do you think? Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-29 8:07 ` Vladimir Davydov @ 2015-01-29 15:55 ` Christoph Lameter 2015-01-29 16:17 ` Vladimir Davydov 0 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2015-01-29 15:55 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Thu, 29 Jan 2015, Vladimir Davydov wrote: > Come to think of it, do we really need to optimize slab placement in > kmem_cache_shrink? None of its users except shrink_store expects it - > they just want to purge the cache before destruction, that's it. May be, > we'd better move slab placement optimization to a separate SLUB's > private function that would be called only by shrink_store, where we can > put up with kmalloc failures? Christoph, what do you think? The slabinfo tool invokes kmem_cache_shrink to optimize placement. Run slabinfo -s which can then be used to reduce the fragmentation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-29 15:55 ` Christoph Lameter @ 2015-01-29 16:17 ` Vladimir Davydov 2015-01-29 16:22 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Vladimir Davydov @ 2015-01-29 16:17 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Thu, Jan 29, 2015 at 09:55:56AM -0600, Christoph Lameter wrote: > On Thu, 29 Jan 2015, Vladimir Davydov wrote: > > > Come to think of it, do we really need to optimize slab placement in > > kmem_cache_shrink? None of its users except shrink_store expects it - > > they just want to purge the cache before destruction, that's it. May be, > > we'd better move slab placement optimization to a separate SLUB's > > private function that would be called only by shrink_store, where we can > > put up with kmalloc failures? Christoph, what do you think? > > The slabinfo tool invokes kmem_cache_shrink to optimize placement. > > Run > > slabinfo -s > > which can then be used to reduce the fragmentation. Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e. invokes shrink_store(), and I don't propose to remove slab placement optimization from there. What I propose is to move slab placement optimization from kmem_cache_shrink() to shrink_store(), because other users of kmem_cache_shrink() don't seem to need it at all - they just want to release empty slabs. Such a change wouldn't affect the behavior of `slabinfo -s` at all. Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-29 16:17 ` Vladimir Davydov @ 2015-01-29 16:22 ` Christoph Lameter 2015-01-29 18:21 ` Vladimir Davydov 0 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2015-01-29 16:22 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Thu, 29 Jan 2015, Vladimir Davydov wrote: > Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e. > invokes shrink_store(), and I don't propose to remove slab placement > optimization from there. What I propose is to move slab placement > optimization from kmem_cache_shrink() to shrink_store(), because other > users of kmem_cache_shrink() don't seem to need it at all - they just > want to release empty slabs. Such a change wouldn't affect the behavior > of `slabinfo -s` at all. Well we have to go through the chain of partial slabs anyways so its easy to do the optimization at that point. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-29 16:22 ` Christoph Lameter @ 2015-01-29 18:21 ` Vladimir Davydov 2015-01-29 19:10 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Vladimir Davydov @ 2015-01-29 18:21 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Thu, Jan 29, 2015 at 10:22:16AM -0600, Christoph Lameter wrote: > On Thu, 29 Jan 2015, Vladimir Davydov wrote: > > > Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e. > > invokes shrink_store(), and I don't propose to remove slab placement > > optimization from there. What I propose is to move slab placement > > optimization from kmem_cache_shrink() to shrink_store(), because other > > users of kmem_cache_shrink() don't seem to need it at all - they just > > want to release empty slabs. Such a change wouldn't affect the behavior > > of `slabinfo -s` at all. > > Well we have to go through the chain of partial slabs anyways so its easy > to do the optimization at that point. That's true, but we can introduce a separate function that would both release empty slabs and optimize slab placement, like the patch below does. It would increase the code size a bit though, so I don't insist. diff --git a/mm/slub.c b/mm/slub.c index 1562955fe099..2cd401d82a41 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3359,7 +3359,7 @@ void kfree(const void *x) EXPORT_SYMBOL(kfree); /* - * kmem_cache_shrink removes empty slabs from the partial lists and sorts + * shrink_slab_cache removes empty slabs from the partial lists and sorts * the remaining slabs by the number of items in use. The slabs with the * most items in use come first. New allocations will then fill those up * and thus they can be removed from the partial lists. @@ -3368,7 +3368,7 @@ EXPORT_SYMBOL(kfree); * being allocated from last increasing the chance that the last objects * are freed in them. */ -int __kmem_cache_shrink(struct kmem_cache *s) +static int shrink_slab_cache(struct kmem_cache *s) { int node; int i; @@ -3423,6 +3423,32 @@ int __kmem_cache_shrink(struct kmem_cache *s) return 0; } +static int __kmem_cache_shrink(struct kmem_cache *s) +{ + int node; + struct kmem_cache_node *n; + struct page *page, *t; + LIST_HEAD(discard); + unsigned long flags; + int ret = 0; + + flush_all(s); + for_each_kmem_cache_node(s, node, n) { + spin_lock_irqsave(&n->list_lock, flags); + list_for_each_entry_safe(page, t, &n->partial, lru) + if (!page->inuse) + list_move(&page->lru, &discard); + spin_unlock_irqrestore(&n->list_lock, flags); + + list_for_each_entry_safe(page, t, &discard, lru) + discard_slab(s, page); + + if (slabs_node(s, node)) + ret = 1; + } + return ret; +} + static int slab_mem_going_offline_callback(void *arg) { struct kmem_cache *s; @@ -4683,7 +4709,7 @@ static ssize_t shrink_store(struct kmem_cache *s, const char *buf, size_t length) { if (buf[0] == '1') { - int rc = kmem_cache_shrink(s); + int rc = shrink_slab_cache(s); if (rc) return rc; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-29 18:21 ` Vladimir Davydov @ 2015-01-29 19:10 ` Christoph Lameter 0 siblings, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2015-01-29 19:10 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Thu, 29 Jan 2015, Vladimir Davydov wrote: > > Well we have to go through the chain of partial slabs anyways so its easy > > to do the optimization at that point. > > That's true, but we can introduce a separate function that would both > release empty slabs and optimize slab placement, like the patch below > does. It would increase the code size a bit though, so I don't insist. It would also change what slabinfo -s does. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 21:57 ` Andrew Morton 2015-01-28 22:56 ` Christoph Lameter 2015-01-29 8:07 ` Vladimir Davydov @ 2015-01-29 8:32 ` Balbir Singh 2 siblings, 0 replies; 26+ messages in thread From: Balbir Singh @ 2015-01-29 8:32 UTC (permalink / raw) To: Andrew Morton Cc: Vladimir Davydov, Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel@vger.kernel.org On Thu, Jan 29, 2015 at 3:27 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote: > >> SLUB's version of __kmem_cache_shrink() not only removes empty slabs, >> but also tries to rearrange the partial lists to place slabs filled up >> most to the head to cope with fragmentation. To achieve that, it >> allocates a temporary array of lists used to sort slabs by the number of >> objects in use. If the allocation fails, the whole procedure is aborted. >> >> This is unacceptable for the kernel memory accounting extension of the >> memory cgroup, where we want to make sure that kmem_cache_shrink() >> successfully discarded empty slabs. Although the allocation failure is >> utterly unlikely with the current page allocator implementation, which >> retries GFP_KERNEL allocations of order <= 2 infinitely, it is better >> not to rely on that. >> >> This patch therefore makes __kmem_cache_shrink() allocate the array on >> stack instead of calling kmalloc, which may fail. The array size is >> chosen to be equal to 32, because most SLUB caches store not more than >> 32 objects per slab page. Slab pages with <= 32 free objects are sorted >> using the array by the number of objects in use and promoted to the head >> of the partial list, while slab pages with > 32 free objects are left in >> the end of the list without any ordering imposed on them. >> >> ... >> >> @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s) >> struct kmem_cache_node *n; >> struct page *page; >> struct page *t; >> - int objects = oo_objects(s->max); >> - struct list_head *slabs_by_inuse = >> - kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL); >> + LIST_HEAD(discard); >> + struct list_head promote[SHRINK_PROMOTE_MAX]; > > 512 bytes of stack. The call paths leading to __kmem_cache_shrink() > are many and twisty. How do we know this isn't a problem? > > The logic behind choosing "32" sounds rather rubbery. What goes wrong > if we use, say, "4"? This much space in the stack may be fertile grounds for kernel stack overflow code execution :) Another perspective could be that there should be allocations that are not penalized to a particular cgroup (from an accounting perspective), should come from the reserved pool. Balbir Singh. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov ` (2 preceding siblings ...) 2015-01-28 21:57 ` Andrew Morton @ 2015-02-15 3:55 ` Sasha Levin 2015-02-15 9:47 ` Vladimir Davydov 3 siblings, 1 reply; 26+ messages in thread From: Sasha Levin @ 2015-02-15 3:55 UTC (permalink / raw) To: Vladimir Davydov, Andrew Morton Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel Hi Vladimir, On 01/28/2015 11:22 AM, Vladimir Davydov wrote: > SLUB's version of __kmem_cache_shrink() not only removes empty slabs, > but also tries to rearrange the partial lists to place slabs filled up > most to the head to cope with fragmentation. To achieve that, it > allocates a temporary array of lists used to sort slabs by the number of > objects in use. If the allocation fails, the whole procedure is aborted. > > This is unacceptable for the kernel memory accounting extension of the > memory cgroup, where we want to make sure that kmem_cache_shrink() > successfully discarded empty slabs. Although the allocation failure is > utterly unlikely with the current page allocator implementation, which > retries GFP_KERNEL allocations of order <= 2 infinitely, it is better > not to rely on that. > > This patch therefore makes __kmem_cache_shrink() allocate the array on > stack instead of calling kmalloc, which may fail. The array size is > chosen to be equal to 32, because most SLUB caches store not more than > 32 objects per slab page. Slab pages with <= 32 free objects are sorted > using the array by the number of objects in use and promoted to the head > of the partial list, while slab pages with > 32 free objects are left in > the end of the list without any ordering imposed on them. > > Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> It seems that this patch causes shrink to corrupt memory: # echo 1 > /sys/kernel/slab/kmalloc-64/shrink # [ 60.331433] ============================================================================= [ 60.333052] BUG kmalloc-64 (Not tainted): Padding overwritten. 0xffff880051018f50-0xffff880051018fff [ 60.335714] ----------------------------------------------------------------------------- [ 60.335714] [ 60.338530] Disabling lock debugging due to kernel taint [ 60.340140] INFO: Slab 0xffffea0001440600 objects=32767 used=65408 fp=0x (null) flags=0x5fffff80000000 [ 60.343095] CPU: 0 PID: 8634 Comm: sh Tainted: G B 3.19.0-next-20150213-sasha-00045-g897c679-dirty #1920 [ 60.345315] ffff88001fc0ee40 00000000542c5de6 ffff88001114f908 ffffffffabb74948 [ 60.346454] 0000000000000000 ffffea0001440600 ffff88001114f9e8 ffffffffa1764b4f [ 60.347582] 0000000000000007 ffff880000000028 ffff88001114f9f8 ffff88001114f9a8 [ 60.349842] Call Trace: [ 60.350243] [<ffffffffabb74948>] dump_stack+0x4f/0x7b [ 60.350975] [<ffffffffa1764b4f>] slab_err+0xaf/0xd0 [ 60.351714] [<ffffffffa307cce0>] ? memchr_inv+0x2c0/0x360 [ 60.352592] [<ffffffffa17671b0>] slab_pad_check+0x120/0x1c0 [ 60.353418] [<ffffffffa17673a4>] __free_slab+0x154/0x1f0 [ 60.354209] [<ffffffffa141c365>] ? trace_hardirqs_on_caller+0x475/0x610 [ 60.355168] [<ffffffffa1767478>] discard_slab+0x38/0x60 [ 60.355909] [<ffffffffa176d478>] __kmem_cache_shrink+0x258/0x300 [ 60.356801] [<ffffffffa1764770>] ? print_tracking+0x70/0x70 [ 60.357621] [<ffffffffa1764770>] ? print_tracking+0x70/0x70 [ 60.358448] [<ffffffffa16c8460>] kmem_cache_shrink+0x20/0x30 [ 60.359279] [<ffffffffa17665db>] shrink_store+0x1b/0x30 [ 60.360048] [<ffffffffa17647af>] slab_attr_store+0x3f/0xf0 [ 60.360951] [<ffffffffa1764770>] ? print_tracking+0x70/0x70 [ 60.361778] [<ffffffffa193c56a>] sysfs_kf_write+0x11a/0x180 [ 60.362601] [<ffffffffa193c450>] ? sysfs_file_ops+0x170/0x170 [ 60.363447] [<ffffffffa1939ea1>] kernfs_fop_write+0x271/0x3b0 [ 60.364348] [<ffffffffa17ba386>] vfs_write+0x186/0x5d0 [ 60.365112] [<ffffffffa17bd146>] SyS_write+0x126/0x270 [ 60.365837] [<ffffffffa17bd020>] ? SyS_read+0x270/0x270 [ 60.366608] [<ffffffffa141c365>] ? trace_hardirqs_on_caller+0x475/0x610 [ 60.367580] [<ffffffffa3091d7b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 60.368566] [<ffffffffabbea3ad>] system_call_fastpath+0x16/0x1b [ 60.369435] Padding ffff880051018f50: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk [ 60.370750] Padding ffff880051018f60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk [ 60.372077] Padding ffff880051018f70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk [ 60.373450] Padding ffff880051018f80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk. [ 60.374763] Padding ffff880051018f90: bb bb bb bb bb bb bb bb c8 8d 01 51 00 88 ff ff ...........Q.... [ 60.376083] Padding ffff880051018fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 60.377457] Padding ffff880051018fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 60.378768] Padding ffff880051018fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 60.380084] Padding ffff880051018fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 60.381475] Padding ffff880051018fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 60.382798] Padding ffff880051018ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 60.384121] FIX kmalloc-64: Restoring 0xffff880051018f50-0xffff880051018fff=0x5a [...] And basically a lot more of the above. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache 2015-02-15 3:55 ` Sasha Levin @ 2015-02-15 9:47 ` Vladimir Davydov 0 siblings, 0 replies; 26+ messages in thread From: Vladimir Davydov @ 2015-02-15 9:47 UTC (permalink / raw) To: Sasha Levin Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel Hi, On Sat, Feb 14, 2015 at 10:55:15PM -0500, Sasha Levin wrote: > It seems that this patch causes shrink to corrupt memory: Yes, it does :-( The fix can be found here: https://lkml.org/lkml/2015/2/11/347 It must have already been merged to the -mm tree: On Thu, Feb 12, 2015 at 02:14:54PM -0800, akpm@linux-foundation.org wrote: > > The patch titled > Subject: slub: kmem_cache_shrink: fix crash due to uninitialized discard list > has been removed from the -mm tree. Its filename was > slub-never-fail-to-shrink-cache-init-discard-list-after-freeing-slabs.patch > > This patch was dropped because it was folded into slub-never-fail-to-shrink-cache.patch Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value 2015-01-28 16:22 [PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov 2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov @ 2015-01-28 16:22 ` Vladimir Davydov 2015-01-28 16:33 ` Christoph Lameter 2015-01-28 16:22 ` [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov 2 siblings, 1 reply; 26+ messages in thread From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel It is supposed to return 0 if the cache has no remaining objects and 1 otherwise, while currently it always returns 0. Fix it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> --- mm/slub.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index dbf9334b6a5c..63abe52c2951 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3379,6 +3379,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) LIST_HEAD(discard); struct list_head promote[SHRINK_PROMOTE_MAX]; unsigned long flags; + int ret = 0; for (i = 0; i < SHRINK_PROMOTE_MAX; i++) INIT_LIST_HEAD(promote + i); @@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s) for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--) list_splice_init(promote + i, &n->partial); + if (n->nr_partial || slabs_node(s, node)) + ret = 1; + spin_unlock_irqrestore(&n->list_lock, flags); /* Release empty slabs */ @@ -3426,7 +3430,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) discard_slab(s, page); } - return 0; + return ret; } static int slab_mem_going_offline_callback(void *arg) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value 2015-01-28 16:22 ` [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value Vladimir Davydov @ 2015-01-28 16:33 ` Christoph Lameter 2015-01-28 17:46 ` Vladimir Davydov 0 siblings, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2015-01-28 16:33 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, 28 Jan 2015, Vladimir Davydov wrote: > @@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s) > for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--) > list_splice_init(promote + i, &n->partial); > > + if (n->nr_partial || slabs_node(s, node)) The total number of slabs obtained via slabs_node always contains the number of partial ones. So no need to check n->nr_partial. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value 2015-01-28 16:33 ` Christoph Lameter @ 2015-01-28 17:46 ` Vladimir Davydov 2015-01-28 19:19 ` Christoph Lameter 0 siblings, 1 reply; 26+ messages in thread From: Vladimir Davydov @ 2015-01-28 17:46 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, Jan 28, 2015 at 10:33:50AM -0600, Christoph Lameter wrote: > On Wed, 28 Jan 2015, Vladimir Davydov wrote: > > > @@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s) > > for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--) > > list_splice_init(promote + i, &n->partial); > > > > + if (n->nr_partial || slabs_node(s, node)) > > The total number of slabs obtained via slabs_node always contains the > number of partial ones. So no need to check n->nr_partial. Yeah, right. In addition to that I misplaced the check - it should go after discard_slab, where we decrement nr_slabs. Here goes the updated patch: From: Vladimir Davydov <vdavydov@parallels.com> Subject: [PATCH] slub: fix kmem_cache_shrink return value It is supposed to return 0 if the cache has no remaining objects and 1 otherwise, while currently it always returns 0. Fix it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> --- mm/slub.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index dbf9334b6a5c..5626588db884 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3379,6 +3379,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) LIST_HEAD(discard); struct list_head promote[SHRINK_PROMOTE_MAX]; unsigned long flags; + int ret = 0; for (i = 0; i < SHRINK_PROMOTE_MAX; i++) INIT_LIST_HEAD(promote + i); @@ -3424,9 +3425,12 @@ int __kmem_cache_shrink(struct kmem_cache *s) /* Release empty slabs */ list_for_each_entry_safe(page, t, &discard, lru) discard_slab(s, page); + + if (slabs_node(s, node)) + ret = 1; } - return 0; + return ret; } static int slab_mem_going_offline_callback(void *arg) -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value 2015-01-28 17:46 ` Vladimir Davydov @ 2015-01-28 19:19 ` Christoph Lameter 0 siblings, 0 replies; 26+ messages in thread From: Christoph Lameter @ 2015-01-28 19:19 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, 28 Jan 2015, Vladimir Davydov wrote: > Yeah, right. In addition to that I misplaced the check - it should go > after discard_slab, where we decrement nr_slabs. Here goes the updated > patch: > > From: Vladimir Davydov <vdavydov@parallels.com> > Subject: [PATCH] slub: fix kmem_cache_shrink return value > > It is supposed to return 0 if the cache has no remaining objects and 1 > otherwise, while currently it always returns 0. Fix it. Acked-by: Christoph Lameter <cl@linux.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately 2015-01-28 16:22 [PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov 2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov 2015-01-28 16:22 ` [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value Vladimir Davydov @ 2015-01-28 16:22 ` Vladimir Davydov 2016-04-01 9:04 ` Peter Zijlstra 2 siblings, 1 reply; 26+ messages in thread From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel To speed up further allocations SLUB may store empty slabs in per cpu/node partial lists instead of freeing them immediately. This prevents per memcg caches destruction, because kmem caches created for a memory cgroup are only destroyed after the last page charged to the cgroup is freed. To fix this issue, this patch resurrects approach first proposed in [1]. It forbids SLUB to cache empty slabs after the memory cgroup that the cache belongs to was destroyed. It is achieved by setting kmem_cache's cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so that it would drop frozen empty slabs immediately if cpu_partial = 0. The runtime overhead is minimal. From all the hot functions, we only touch relatively cold put_cpu_partial(): we make it call unfreeze_partials() after freezing a slab that belongs to an offline memory cgroup. Since slab freezing exists to avoid moving slabs from/to a partial list on free/alloc, and there can't be allocations from dead caches, it shouldn't cause any overhead. We do have to disable preemption for put_cpu_partial() to achieve that though. The original patch was accepted well and even merged to the mm tree. However, I decided to withdraw it due to changes happening to the memcg core at that time. I had an idea of introducing per-memcg shrinkers for kmem caches, but now, as memcg has finally settled down, I do not see it as an option, because SLUB shrinker would be too costly to call since SLUB does not keep free slabs on a separate list. Besides, we currently do not even call per-memcg shrinkers for offline memcgs. Overall, it would introduce much more complexity to both SLUB and memcg than this small patch. Regarding to SLAB, there's no problem with it, because it shrinks per-cpu/node caches periodically. Thanks to list_lru reparenting, we no longer keep entries for offline cgroups in per-memcg arrays (such as memcg_cache_params->memcg_caches), so we do not have to bother if a per-memcg cache will be shrunk a bit later than it could be. [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650 Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> --- mm/slab.c | 4 ++-- mm/slab.h | 2 +- mm/slab_common.c | 15 +++++++++++++-- mm/slob.c | 2 +- mm/slub.c | 31 ++++++++++++++++++++++++++----- 5 files changed, 43 insertions(+), 11 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 7894017bc160..c4b89eaf4c96 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2382,7 +2382,7 @@ out: return nr_freed; } -int __kmem_cache_shrink(struct kmem_cache *cachep) +int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate) { int ret = 0; int node; @@ -2404,7 +2404,7 @@ int __kmem_cache_shutdown(struct kmem_cache *cachep) { int i; struct kmem_cache_node *n; - int rc = __kmem_cache_shrink(cachep); + int rc = __kmem_cache_shrink(cachep, false); if (rc) return rc; diff --git a/mm/slab.h b/mm/slab.h index 0a56d76ac0e9..4c3ac12dd644 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -138,7 +138,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size, #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS) int __kmem_cache_shutdown(struct kmem_cache *); -int __kmem_cache_shrink(struct kmem_cache *); +int __kmem_cache_shrink(struct kmem_cache *, bool); void slab_kmem_cache_release(struct kmem_cache *); struct seq_file; diff --git a/mm/slab_common.c b/mm/slab_common.c index 0dd9eb4e0f87..9395842cfc6b 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -549,10 +549,13 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) { int idx; struct memcg_cache_array *arr; - struct kmem_cache *s; + struct kmem_cache *s, *c; idx = memcg_cache_id(memcg); + get_online_cpus(); + get_online_mems(); + mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) { if (!is_root_cache(s)) @@ -560,9 +563,17 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg) arr = rcu_dereference_protected(s->memcg_params.memcg_caches, lockdep_is_held(&slab_mutex)); + c = arr->entries[idx]; + if (!c) + continue; + + __kmem_cache_shrink(c, true); arr->entries[idx] = NULL; } mutex_unlock(&slab_mutex); + + put_online_mems(); + put_online_cpus(); } void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) @@ -649,7 +660,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep) get_online_cpus(); get_online_mems(); - ret = __kmem_cache_shrink(cachep); + ret = __kmem_cache_shrink(cachep, false); put_online_mems(); put_online_cpus(); return ret; diff --git a/mm/slob.c b/mm/slob.c index 96a86206a26b..94a7fede6d48 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -618,7 +618,7 @@ int __kmem_cache_shutdown(struct kmem_cache *c) return 0; } -int __kmem_cache_shrink(struct kmem_cache *d) +int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate) { return 0; } diff --git a/mm/slub.c b/mm/slub.c index 63abe52c2951..a91789f59ab6 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) int pages; int pobjects; + preempt_disable(); do { pages = 0; pobjects = 0; @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage); + if (unlikely(!s->cpu_partial)) { + unsigned long flags; + + local_irq_save(flags); + unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); + local_irq_restore(flags); + } + preempt_enable(); #endif } @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree); * being allocated from last increasing the chance that the last objects * are freed in them. */ -int __kmem_cache_shrink(struct kmem_cache *s) +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate) { int node; int i; @@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s) unsigned long flags; int ret = 0; + if (deactivate) { + /* + * Disable empty slabs caching. Used to avoid pinning offline + * memory cgroups by kmem pages that can be freed. + */ + s->cpu_partial = 0; + s->min_partial = 0; + + /* + * s->cpu_partial is checked locklessly (see put_cpu_partial), + * so we have to make sure the change is visible. + */ + kick_all_cpus_sync(); + } + for (i = 0; i < SHRINK_PROMOTE_MAX; i++) INIT_LIST_HEAD(promote + i); flush_all(s); for_each_kmem_cache_node(s, node, n) { - if (!n->nr_partial) - continue; - spin_lock_irqsave(&n->list_lock, flags); /* @@ -3439,7 +3460,7 @@ static int slab_mem_going_offline_callback(void *arg) mutex_lock(&slab_mutex); list_for_each_entry(s, &slab_caches, list) - __kmem_cache_shrink(s); + __kmem_cache_shrink(s, false); mutex_unlock(&slab_mutex); return 0; -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately 2015-01-28 16:22 ` [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov @ 2016-04-01 9:04 ` Peter Zijlstra 2016-04-01 10:55 ` Vladimir Davydov 0 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2016-04-01 9:04 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Wed, Jan 28, 2015 at 07:22:51PM +0300, Vladimir Davydov wrote: > +++ b/mm/slub.c > @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) > int pages; > int pobjects; > > + preempt_disable(); > do { > pages = 0; > pobjects = 0; > @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) > > } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) > != oldpage); > + if (unlikely(!s->cpu_partial)) { > + unsigned long flags; > + > + local_irq_save(flags); > + unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); > + local_irq_restore(flags); > + } > + preempt_enable(); > #endif > } > > @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree); > * being allocated from last increasing the chance that the last objects > * are freed in them. > */ > -int __kmem_cache_shrink(struct kmem_cache *s) > +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate) > { > int node; > int i; > @@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s) > unsigned long flags; > int ret = 0; > > + if (deactivate) { > + /* > + * Disable empty slabs caching. Used to avoid pinning offline > + * memory cgroups by kmem pages that can be freed. > + */ > + s->cpu_partial = 0; > + s->min_partial = 0; > + > + /* > + * s->cpu_partial is checked locklessly (see put_cpu_partial), > + * so we have to make sure the change is visible. > + */ > + kick_all_cpus_sync(); > + } Argh! what the heck! and without a single mention in the changelog. Why are you spraying IPIs across the entire machine? Why isn't synchronize_sched() good enough, that would allow you to get rid of the local_irq_save/restore as well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately 2016-04-01 9:04 ` Peter Zijlstra @ 2016-04-01 10:55 ` Vladimir Davydov 2016-04-01 11:41 ` Peter Zijlstra 0 siblings, 1 reply; 26+ messages in thread From: Vladimir Davydov @ 2016-04-01 10:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Fri, Apr 01, 2016 at 11:04:41AM +0200, Peter Zijlstra wrote: > On Wed, Jan 28, 2015 at 07:22:51PM +0300, Vladimir Davydov wrote: > > +++ b/mm/slub.c > > @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) > > int pages; > > int pobjects; > > > > + preempt_disable(); > > do { > > pages = 0; > > pobjects = 0; > > @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) > > > > } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) > > != oldpage); > > + if (unlikely(!s->cpu_partial)) { > > + unsigned long flags; > > + > > + local_irq_save(flags); > > + unfreeze_partials(s, this_cpu_ptr(s->cpu_slab)); > > + local_irq_restore(flags); > > + } > > + preempt_enable(); > > #endif > > } > > > > @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree); > > * being allocated from last increasing the chance that the last objects > > * are freed in them. > > */ > > -int __kmem_cache_shrink(struct kmem_cache *s) > > +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate) > > { > > int node; > > int i; > > @@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s) > > unsigned long flags; > > int ret = 0; > > > > + if (deactivate) { > > + /* > > + * Disable empty slabs caching. Used to avoid pinning offline > > + * memory cgroups by kmem pages that can be freed. > > + */ > > + s->cpu_partial = 0; > > + s->min_partial = 0; > > + > > + /* > > + * s->cpu_partial is checked locklessly (see put_cpu_partial), > > + * so we have to make sure the change is visible. > > + */ > > + kick_all_cpus_sync(); > > + } > > Argh! what the heck! and without a single mention in the changelog. This function is only called when a memory cgroup is removed, which is rather a rare event. I didn't think it would cause any pain. Sorry. > > Why are you spraying IPIs across the entire machine? Why isn't > synchronize_sched() good enough, that would allow you to get rid of the > local_irq_save/restore as well. synchronize_sched() is slower. Calling it for every per memcg kmem cache would slow down cleanup on cgroup removal. The latter is async, so I'm not sure if it would be a problem though. I think we can try to replace kick_all_cpus_sync() with synchronize_sched() here. Regarding local_irq_save/restore - synchronize_sched() wouldn't allow us to get rid of them, because unfreeze_partials() must be called with irqs disabled. Come to think of it, kick_all_cpus_sync() is used as a memory barrier here, so as to make sure that after it's finished all cpus will use the new ->cpu_partial value, which makes me wonder if we could replace it with a simple smp_mb. I mean, this_cpu_cmpxchg(), which is used by put_cpu_partial to add a page to per-cpu partial list, must issue a full memory barrier (am I correct?), so we have two possibilities here: Case 1: smp_mb is called before this_cpu_cmpxchg is called on another cpu executing put_cpu_partial. In this case, put_cpu_partial will see cpu_partial == 0 and hence call the second unfreeze_partials, flushing per cpu partial list. Case 2: smp_mb is called after this_cpu_cmpxchg. Then __kmem_cache_shrink ->flush_all -> has_cpu_slab should see (thanks to the barriers) that there's a slab on a per-cpu list and so flush it (provided it hasn't already been flushed by put_cpu_partial). In any case, after __kmem_cache_shrink has finished, we are guaranteed to not have any slabs on per cpu partial lists. Does it make sense? Thanks, Vladimir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately 2016-04-01 10:55 ` Vladimir Davydov @ 2016-04-01 11:41 ` Peter Zijlstra 0 siblings, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2016-04-01 11:41 UTC (permalink / raw) To: Vladimir Davydov Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner, Michal Hocko, linux-mm, linux-kernel On Fri, Apr 01, 2016 at 01:55:40PM +0300, Vladimir Davydov wrote: > > > + if (deactivate) { > > > + /* > > > + * Disable empty slabs caching. Used to avoid pinning offline > > > + * memory cgroups by kmem pages that can be freed. > > > + */ > > > + s->cpu_partial = 0; > > > + s->min_partial = 0; > > > + > > > + /* > > > + * s->cpu_partial is checked locklessly (see put_cpu_partial), > > > + * so we have to make sure the change is visible. > > > + */ > > > + kick_all_cpus_sync(); > > > + } > > > > Argh! what the heck! and without a single mention in the changelog. > > This function is only called when a memory cgroup is removed, which is > rather a rare event. I didn't think it would cause any pain. Sorry. Suppose you have a bunch of CPUs running HPC/RT code and someone causes the admin CPUs to create/destroy a few cgroups. > > Why are you spraying IPIs across the entire machine? Why isn't > > synchronize_sched() good enough, that would allow you to get rid of the > > local_irq_save/restore as well. > > synchronize_sched() is slower. Calling it for every per memcg kmem cache > would slow down cleanup on cgroup removal. Right, but who cares? cgroup removal isn't a fast path by any standard. > Regarding local_irq_save/restore - synchronize_sched() wouldn't allow us > to get rid of them, because unfreeze_partials() must be called with irqs > disabled. OK, I figured it was because it needed to be serialized against this kick_all_cpus_sync() IPI. > Come to think of it, kick_all_cpus_sync() is used as a memory barrier > here, so as to make sure that after it's finished all cpus will use the > new ->cpu_partial value, which makes me wonder if we could replace it > with a simple smp_mb. I mean, this_cpu_cmpxchg(), which is used by > put_cpu_partial to add a page to per-cpu partial list, must issue a full > memory barrier (am I correct?), so we have two possibilities here: Nope, this_cpu_cmpxchg() does not imply a memory barrier. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2016-04-01 11:41 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-01-28 16:22 [PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov 2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov 2015-01-28 16:31 ` Christoph Lameter 2015-01-28 18:29 ` Pekka Enberg 2015-01-28 16:37 ` Christoph Lameter 2015-01-28 17:32 ` Vladimir Davydov 2015-01-28 19:20 ` Christoph Lameter 2015-01-28 21:57 ` Andrew Morton 2015-01-28 22:56 ` Christoph Lameter 2015-01-29 8:07 ` Vladimir Davydov 2015-01-29 15:55 ` Christoph Lameter 2015-01-29 16:17 ` Vladimir Davydov 2015-01-29 16:22 ` Christoph Lameter 2015-01-29 18:21 ` Vladimir Davydov 2015-01-29 19:10 ` Christoph Lameter 2015-01-29 8:32 ` Balbir Singh 2015-02-15 3:55 ` Sasha Levin 2015-02-15 9:47 ` Vladimir Davydov 2015-01-28 16:22 ` [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value Vladimir Davydov 2015-01-28 16:33 ` Christoph Lameter 2015-01-28 17:46 ` Vladimir Davydov 2015-01-28 19:19 ` Christoph Lameter 2015-01-28 16:22 ` [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov 2016-04-01 9:04 ` Peter Zijlstra 2016-04-01 10:55 ` Vladimir Davydov 2016-04-01 11:41 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).