* [git pull] vfs.git bits and pieces
@ 2013-11-20 17:42 Al Viro
2013-11-20 17:47 ` Al Viro
0 siblings, 1 reply; 8+ messages in thread
From: Al Viro @ 2013-11-20 17:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel
[Apologies for delay; I'd spent the last day hunting down something that
turned out to be a VM leak completely unrelated to this stuff - it's
present in mainline, for starters. Unreliable reproducers make for fun
bisects ;-/ Anyway, by now I'm absolutely sure that this is a VM bug and
not something I had somehow managed to break, so...]
Assorted bits that got missed in the first pull request + fixes for
a couple of coredump regressions. Please, pull from the usual place -
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus
Shortlog:
Al Viro (7):
dump_align(): fix the dumb braino
dump_emit(): use __kernel_write(), not vfs_write()
gfs2: endianness misannotations
consolidate simple ->d_delete() instances
take read_seqbegin_or_lock() and friends to seqlock.h
dcache.c: get rid of pointless macros
fold try_to_ascend() into the sole remaining caller
Diffstat:
arch/ia64/kernel/perfmon.c | 8 +----
fs/9p/vfs_dentry.c | 19 +---------
fs/configfs/dir.c | 12 +------
fs/coredump.c | 6 ++--
fs/dcache.c | 84 ++++++++++---------------------------------
fs/efivarfs/super.c | 11 +-----
fs/gfs2/lock_dlm.c | 8 ++--
fs/gfs2/quota.c | 23 +++++-------
fs/gfs2/rgrp.c | 4 +-
fs/hostfs/hostfs_kern.c | 11 +-----
fs/libfs.c | 12 ++++---
fs/proc/generic.c | 18 +---------
fs/proc/namespaces.c | 8 +----
include/linux/fs.h | 2 +
include/linux/seqlock.h | 29 +++++++++++++++
kernel/cgroup.c | 7 +---
net/sunrpc/rpc_pipe.c | 11 +-----
17 files changed, 86 insertions(+), 187 deletions(-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] vfs.git bits and pieces
2013-11-20 17:42 [git pull] vfs.git bits and pieces Al Viro
@ 2013-11-20 17:47 ` Al Viro
2013-11-20 22:16 ` Al Viro
2013-11-20 22:33 ` Linus Torvalds
0 siblings, 2 replies; 8+ messages in thread
From: Al Viro @ 2013-11-20 17:47 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel
On Wed, Nov 20, 2013 at 05:42:11PM +0000, Al Viro wrote:
> [Apologies for delay; I'd spent the last day hunting down something that
> turned out to be a VM leak completely unrelated to this stuff - it's
> present in mainline, for starters. Unreliable reproducers make for fun
> bisects ;-/ Anyway, by now I'm absolutely sure that this is a VM bug and
> not something I had somehow managed to break, so...]
BTW, something odd happened to mm/memory.c - either a mangled patch
or a lost followup. Take a look at the last commit in there:
commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date: Thu Nov 14 14:31:53 2013 -0800
mm: create a separate slab for page->ptl allocation
If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
so we loose 24 on each. An average system can easily allocate few tens
thousands of page->ptl and overhead is significant.
Let's create a separate slab for page->ptl allocation to solve this.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fair enough, and yes, it does create that separate slab. The problem is,
it's still using kmalloc/kfree for those beasts - page_ptl_cachep isn't
used at all...
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] vfs.git bits and pieces
2013-11-20 17:47 ` Al Viro
@ 2013-11-20 22:16 ` Al Viro
2013-11-20 22:24 ` Joe Perches
2013-11-20 22:33 ` Linus Torvalds
1 sibling, 1 reply; 8+ messages in thread
From: Al Viro @ 2013-11-20 22:16 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel, Kirill A. Shutemov
On Wed, Nov 20, 2013 at 05:47:12PM +0000, Al Viro wrote:
> On Wed, Nov 20, 2013 at 05:42:11PM +0000, Al Viro wrote:
> > [Apologies for delay; I'd spent the last day hunting down something that
> > turned out to be a VM leak completely unrelated to this stuff - it's
> > present in mainline, for starters. Unreliable reproducers make for fun
> > bisects ;-/ Anyway, by now I'm absolutely sure that this is a VM bug and
> > not something I had somehow managed to break, so...]
>
> BTW, something odd happened to mm/memory.c - either a mangled patch
> or a lost followup. Take a look at the last commit in there:
> commit ea1e7ed33708c7a760419ff9ded0a6cb90586a50
> Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Date: Thu Nov 14 14:31:53 2013 -0800
>
> mm: create a separate slab for page->ptl allocation
>
> If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
> is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
> so we loose 24 on each. An average system can easily allocate few tens
> thousands of page->ptl and overhead is significant.
>
> Let's create a separate slab for page->ptl allocation to solve this.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> Fair enough, and yes, it does create that separate slab. The problem is,
> it's still using kmalloc/kfree for those beasts - page_ptl_cachep isn't
> used at all...
While digging in the same area:
Wrong page freed on preallocate_pmds() failure exit
Note that pmds[i] is simply uninitialized at that point...
Granted, it's very hard to hit (you need split page locks
*and* kmalloc(sizeof(spinlock_t), GFP_KERNEL) failing),
but the code is obviously bogus.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index a7cccb6d..36aa999 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -209,7 +209,7 @@ static int preallocate_pmds(pmd_t *pmds[])
if (!pmd)
failed = true;
if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
- free_page((unsigned long)pmds[i]);
+ free_page((unsigned long)pmd);
pmd = NULL;
failed = true;
}
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [git pull] vfs.git bits and pieces
2013-11-20 22:16 ` Al Viro
@ 2013-11-20 22:24 ` Joe Perches
0 siblings, 0 replies; 8+ messages in thread
From: Joe Perches @ 2013-11-20 22:24 UTC (permalink / raw)
To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel, Kirill A. Shutemov
On Wed, 2013-11-20 at 22:16 +0000, Al Viro wrote:
> While digging in the same area:
>
> Wrong page freed on preallocate_pmds() failure exit
>
> Note that pmds[i] is simply uninitialized at that point...
> Granted, it's very hard to hit (you need split page locks
> *and* kmalloc(sizeof(spinlock_t), GFP_KERNEL) failing),
> but the code is obviously bogus.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index a7cccb6d..36aa999 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -209,7 +209,7 @@ static int preallocate_pmds(pmd_t *pmds[])
> if (!pmd)
> failed = true;
> if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
> - free_page((unsigned long)pmds[i]);
> + free_page((unsigned long)pmd);
> pmd = NULL;
> failed = true;
> }
trivia: It'd probably read better with an else too
if (!pmd) {
failed = true;
} else if (!pgtable_pmd_page_ctor(virt_to_page(pmd))) {
etc...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] vfs.git bits and pieces
2013-11-20 17:47 ` Al Viro
2013-11-20 22:16 ` Al Viro
@ 2013-11-20 22:33 ` Linus Torvalds
2013-11-20 22:40 ` Andrew Morton
2013-11-20 22:42 ` Damien Wyart
1 sibling, 2 replies; 8+ messages in thread
From: Linus Torvalds @ 2013-11-20 22:33 UTC (permalink / raw)
To: Al Viro, Kirill A. Shutemov, Andrew Morton
Cc: Linux Kernel Mailing List, linux-fsdevel
On Wed, Nov 20, 2013 at 9:47 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> BTW, something odd happened to mm/memory.c - either a mangled patch
> or a lost followup:
>
> commit ea1e7ed33708
> mm: create a separate slab for page->ptl allocation
>
> Fair enough, and yes, it does create that separate slab. The problem is,
> it's still using kmalloc/kfree for those beasts - page_ptl_cachep isn't
> used at all...
Ok, it looks straightforward enough to just replace the kmalloc/kfree
with using a slab allocation using the page_ptl_cachep pointer. I'd do
it myself, but I would like to know how it got lost? Also, much
testing to make sure the cachep is initialized early enough.
Or should we just revert the commit that added the pointless/unused
slab pointer?
Andrew, Kirill, comments?
Also note the other issue Al found: see commit 2a46eed54a28 ("Wrong
page freed on preallocate_pmds() failure exit") that I just pushed
out.
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] vfs.git bits and pieces
2013-11-20 22:33 ` Linus Torvalds
@ 2013-11-20 22:40 ` Andrew Morton
2013-11-21 11:19 ` Kirill A. Shutemov
2013-11-20 22:42 ` Damien Wyart
1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2013-11-20 22:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: Al Viro, Kirill A. Shutemov, Linux Kernel Mailing List,
linux-fsdevel
On Wed, 20 Nov 2013 14:33:35 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, Nov 20, 2013 at 9:47 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > BTW, something odd happened to mm/memory.c - either a mangled patch
> > or a lost followup:
> >
> > commit ea1e7ed33708
> > mm: create a separate slab for page->ptl allocation
> >
> > Fair enough, and yes, it does create that separate slab. The problem is,
> > it's still using kmalloc/kfree for those beasts - page_ptl_cachep isn't
> > used at all...
>
> Ok, it looks straightforward enough to just replace the kmalloc/kfree
> with using a slab allocation using the page_ptl_cachep pointer. I'd do
> it myself, but I would like to know how it got lost? Also, much
> testing to make sure the cachep is initialized early enough.
agh, I went through hell keeping that patch alive and it appears I lost
some of it.
> Or should we just revert the commit that added the pointless/unused
> slab pointer?
>
> Andrew, Kirill, comments?
Let's just kill it please. We can try again for 3.14.
> Also note the other issue Al found: see commit 2a46eed54a28 ("Wrong
> page freed on preallocate_pmds() failure exit") that I just pushed
> out.
lgtm, thanks.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] vfs.git bits and pieces
2013-11-20 22:33 ` Linus Torvalds
2013-11-20 22:40 ` Andrew Morton
@ 2013-11-20 22:42 ` Damien Wyart
1 sibling, 0 replies; 8+ messages in thread
From: Damien Wyart @ 2013-11-20 22:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Al Viro, Kirill A. Shutemov, Andrew Morton,
Linux Kernel Mailing List, linux-fsdevel
> > BTW, something odd happened to mm/memory.c - either a mangled patch
> > or a lost followup:
> > commit ea1e7ed33708
> > mm: create a separate slab for page->ptl allocation
> > Fair enough, and yes, it does create that separate slab. The problem
> > is, it's still using kmalloc/kfree for those beasts -
> > page_ptl_cachep isn't used at all...
* Linus Torvalds <torvalds@linux-foundation.org> [2013-11-20 14:33]:
> Ok, it looks straightforward enough to just replace the kmalloc/kfree
> with using a slab allocation using the page_ptl_cachep pointer. I'd do
> it myself, but I would like to know how it got lost? Also, much
> testing to make sure the cachep is initialized early enough.
The initial sending had the proper hunks at the end, so something really
got lost afterwards...
https://lkml.org/lkml/2013/10/22/129
--
Damien Wyart
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] vfs.git bits and pieces
2013-11-20 22:40 ` Andrew Morton
@ 2013-11-21 11:19 ` Kirill A. Shutemov
0 siblings, 0 replies; 8+ messages in thread
From: Kirill A. Shutemov @ 2013-11-21 11:19 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: Al Viro, Kirill A. Shutemov, Linux Kernel Mailing List,
linux-fsdevel, Ingo Molnar, Peter Zijlstra, linux-mm
Andrew Morton wrote:
> On Wed, 20 Nov 2013 14:33:35 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Wed, Nov 20, 2013 at 9:47 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > BTW, something odd happened to mm/memory.c - either a mangled patch
> > > or a lost followup:
> > >
> > > commit ea1e7ed33708
> > > mm: create a separate slab for page->ptl allocation
> > >
> > > Fair enough, and yes, it does create that separate slab. The problem is,
> > > it's still using kmalloc/kfree for those beasts - page_ptl_cachep isn't
> > > used at all...
> >
> > Ok, it looks straightforward enough to just replace the kmalloc/kfree
> > with using a slab allocation using the page_ptl_cachep pointer. I'd do
> > it myself, but I would like to know how it got lost? Also, much
> > testing to make sure the cachep is initialized early enough.
>
> agh, I went through hell keeping that patch alive and it appears I lost
> some of it.
Actually, I've lost it while adding BLOATED_SPINLOCKS :(
> > Or should we just revert the commit that added the pointless/unused
> > slab pointer?
> >
> > Andrew, Kirill, comments?
>
> Let's just kill it please. We can try again for 3.14.
I'm okay with that.
Only side note: it's useful not only for debug case, but also for
PREEMPT_RT where spinlock_t is always bloated.
Fixed patch:
>From e624075b47caa2a15998225df7cec953d271b9ac Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 14 Nov 2013 14:31:53 -0800
Subject: [PATCH] mm: create a separate slab for page->ptl allocation, try two
If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab,
so we loose 24 on each. An average system can easily allocate few tens
thousands of page->ptl and overhead is significant.
Let's create a separate slab for page->ptl allocation to solve this.
To make sure that it really works this time, some numbers from my test
machine (just booted, no load):
Before:
# grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
kmalloc-96 31987 32190 128 30 1 : tunables 120 60 8 : slabdata 1073 1073 92
After:
# grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo
page->ptl 27516 28143 72 53 1 : tunables 120 60 8 : slabdata 531 531 9
kmalloc-96 3853 5280 128 30 1 : tunables 120 60 8 : slabdata 176 176 0
Note that the patch is useful not only for debug case, but also for
PREEMPT_RT, where spinlock_t is always bloated.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
include/linux/mm.h | 9 +++++++++
init/main.c | 2 +-
mm/memory.c | 11 +++++++++--
3 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1cedd000cf29..0548eb201e05 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1318,6 +1318,7 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
#if USE_SPLIT_PTE_PTLOCKS
#if BLOATED_SPINLOCKS
+void __init ptlock_cache_init(void);
extern bool ptlock_alloc(struct page *page);
extern void ptlock_free(struct page *page);
@@ -1326,6 +1327,7 @@ static inline spinlock_t *ptlock_ptr(struct page *page)
return page->ptl;
}
#else /* BLOATED_SPINLOCKS */
+static inline void ptlock_cache_init(void) {}
static inline bool ptlock_alloc(struct page *page)
{
return true;
@@ -1378,10 +1380,17 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
{
return &mm->page_table_lock;
}
+static inline void ptlock_cache_init(void) {}
static inline bool ptlock_init(struct page *page) { return true; }
static inline void pte_lock_deinit(struct page *page) {}
#endif /* USE_SPLIT_PTE_PTLOCKS */
+static inline void pgtable_init(void)
+{
+ ptlock_cache_init();
+ pgtable_cache_init();
+}
+
static inline bool pgtable_page_ctor(struct page *page)
{
inc_zone_page_state(page, NR_PAGETABLE);
diff --git a/init/main.c b/init/main.c
index febc511e078a..01573fdfa186 100644
--- a/init/main.c
+++ b/init/main.c
@@ -476,7 +476,7 @@ static void __init mm_init(void)
mem_init();
kmem_cache_init();
percpu_init_late();
- pgtable_cache_init();
+ pgtable_init();
vmalloc_init();
}
diff --git a/mm/memory.c b/mm/memory.c
index 5d9025f3b3e1..cf6098c10084 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4272,11 +4272,18 @@ void copy_user_huge_page(struct page *dst, struct page *src,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
#if USE_SPLIT_PTE_PTLOCKS && BLOATED_SPINLOCKS
+static struct kmem_cache *page_ptl_cachep;
+void __init ptlock_cache_init(void)
+{
+ page_ptl_cachep = kmem_cache_create("page->ptl", sizeof(spinlock_t), 0,
+ SLAB_PANIC, NULL);
+}
+
bool ptlock_alloc(struct page *page)
{
spinlock_t *ptl;
- ptl = kmalloc(sizeof(spinlock_t), GFP_KERNEL);
+ ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL);
if (!ptl)
return false;
page->ptl = ptl;
@@ -4285,6 +4292,6 @@ bool ptlock_alloc(struct page *page)
void ptlock_free(struct page *page)
{
- kfree(page->ptl);
+ kmem_cache_free(page_ptl_cachep, page->ptl);
}
#endif
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-11-21 11:19 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-20 17:42 [git pull] vfs.git bits and pieces Al Viro
2013-11-20 17:47 ` Al Viro
2013-11-20 22:16 ` Al Viro
2013-11-20 22:24 ` Joe Perches
2013-11-20 22:33 ` Linus Torvalds
2013-11-20 22:40 ` Andrew Morton
2013-11-21 11:19 ` Kirill A. Shutemov
2013-11-20 22:42 ` Damien Wyart
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).